Thresholds and alerts for WMAS metrics

Notifications are generated by WMAS when metric thresholds are exceeded. These notifications are logged as warnings or errors, depending on how far the metric differs from the specified threshold.

This table shows the metrics and their corresponding notification rules, outlining how thresholds log warnings or errors in WMAS:

Metric Notification rules
Throughput

If the throughput remains below the specified threshold continuously for 30 minutes or more, a warning is logged. If the throughput remains below the specified threshold for 60 minutes or more, an error is logged. The default threshold is 50%, but you can configure it within a range of 0% to 100%. Work units with a throughput value of -1 or an idle state are excluded from receiving notifications.

These are the special throughput values reported by WMAS under specific system conditions:
  • Idle: If no work units are completed in the past 60 minutes, the throughput is set to -1, which indicates the system is idle.
  • No dispatched work units but some are completed: Throughput is reported as 100%.
  • With dispatched work units but none are completed: Throughput is reported as 0%.
  • No dispatched or completed work units but an active work unit exists: Throughput is reported as 0%.
Utilization rate If the utilization rate remains below the specified threshold continuously for 30 minutes or more, a warning is logged. If the utilization rate remains below the specified threshold for 60 minutes or more, an error is logged. The default threshold is 50%, but you can configure it within a range of 0% to 100%.
Backlog rate If the backlog rate exceeds the specified threshold of system capacity continuously for 30 minutes or more, a warning is logged. If the backlog rate exceeds the specified threshold of system capacity continuously for 60 minutes or more, an error is logged. The default threshold is 200%, but you can configure it within a range of 100% to 500%.
Failure rate

If the failure rate remains below the specified threshold continuously for 30 minutes or more, a warning is logged. If the failure rate remains below the specified threshold for 60 minutes or more, an error is logged. The default threshold is 25%, but you can configure it within a range of 0% to 100%. Work units with a failure rate value of -1 or an idle state are excluded from receiving notifications.

These are the special failure rate values reported by WMAS under specific system conditions:
  • No failed or completed work units: Failure rate is set to -1, which indicates no work units are completed during the monitored period.
  • No failed work units but some are completed: Failure rate is reported as 0%.
  • With failed work units but none are completed: Failure rate is reported as 100%.
Error rate

If the error rate exceeds the specified threshold continuously for 30 minutes or more, a warning is logged. If the error rate exceeds the specified threshold continuously for 60 minutes or more, an error is logged. The default threshold is 10 errors per work unit, but you can configure it within a range of 1 to 1000 errors per work unit. Work units with an error rate value of -1 or an idle state are excluded from receiving notifications.

These are the special error rate values reported by WMAS under specific system conditions:
  • No completed work units: The error rate is set to -1, which indicates no activity occurred.
  • Completed work units exists: The error rate is calculated normally, regardless of whether errors occur.
Long running work units A work unit is considered long running if it lasts at least 60 minutes and takes more than 25% longer than the average for similar work units. In this case, a warning is logged by WMAS, which escalates to an error if the elapsed time exceeds 50% above the average.
Long running activities If an activity exceeds its defined threshold, up to ten prior executions are reviewed by WMAS to calculate the average elapsed time. If the current elapsed time exceeds this average by 25%, a warning is logged. If the current elapsed time exceeds this average by 50%, an error is logged. Although long-running activities may occur during high-volume periods, consistent or isolated delays may indicate a systemic issue that requires further investigation.