Optimizing Alert Responsiveness – DataSet Customer Portal

Introduction

When configuring alerts, a timeframe is specified to define the evaluation window that the trigger processes.

Under normal circumstances, logs are ingested shortly after they are generated. Depending on external factors, a nominal delay of around several seconds (or more) beyond a log event's timestamp may be present and will not affect the functionality of alerts. However, it is possible for longer delays to occur, and this could impact the triggering of your alerts.

This article includes the best practices to use when configuring alerts. More information can be found in our documentation.

Examples

timeframe

The timeframe is a mandatory parameter that begins from a user defined interval (ex. X seconds, minutes, days, etc.) prior and up to the current time.

In the following example, the timeframe is 10 minutes prior and up to the present. Any log events which match the trigger conditions will cause the alert to issue notifications to the recipient(s) specified in the alertAddress field.

alerts: [
  {
    alertAddress: "<address1>,<Slack>,<PagerDuty>,...",
    trigger: "count:10m(<trigger conditions>) > 0",
    ...
  },
  ...
]

Let's suppose that it's presently 1:45PM. One or more events in the red box will trigger the alert:

timeframe + offset

In addition to the timeframe, an optional offset can be specified to fine tune the evaluation window(s).

In the next example, the timeframe is 20 minutes prior and up to the offset, which is 10 minutes before the present time.

alerts: [
  {
    alertAddress: "<address1>,<Slack>,<PagerDuty>,...",
    trigger: "count:20m:10m(<trigger conditions>) > 0",
    ...
  },
  ...
]

If 11:30AM is the current time, any event which occurs between 11:00AM and 11:20AM (red box) will trigger the alert. The green box represents the 10 minute offset:

Offsets are also used to compare alert activity in the present to some time in the past (ex. compare last 10 minutes to a 30 minute period 2 weeks ago). In most cases, customers use a factor to smooth out irregularities and minimize false alarms. This factor is determined after testing platform behavior against when the alert is expected to trigger. More information on comparisons with offsets can be found here.

Responsiveness

Evaluation Interval

The minimum evaluation interval is 1 minute (60 seconds). A common misconception is that this will result in one notification being issued per matching log event. As a quick example, if 5 log events satisfy a trigger condition of count:1m("severe error") > 0, the alert will only be triggered once and the alert will not be triggered again until it has resolved (minimum resolution time: 1 minute).

There are also some drawbacks to using a 1 minute interval:

Log events which are delayed may not trigger the alert. For example, the Okta monitor downloads logs from Okta's API endpoint and may be subject to latencies of ~60 seconds. A log event with a correct timestamp that arrives outside of the alert's evaluation window won't trigger an alert.
The shorter the evaluation window, the noisier the alert. Alert notifications may be throttled to avoid spamming our partner services (Slack, Okta, PagerDuty, your organization's email, etc.) if they occur at a heightened frequency.
In most use cases, it makes sense to extend the timeframe beyond 1 minute ensure that log events which experience ingestion delays are included in the evaluation.

If the 1 minute time frame is an essential requirement, specifying an offset with the trigger can reduce the effects of delayed logs. In this example, an evaluation window of 1 minute is applied to 5 minutes prior to the present time

alerts: [
  {
    alertAddress: "<address1>,<Slack>,<PagerDuty>,...",
    description: "Test alert",
    gracePeriodMinutes: 0,
    renotifyPeriodMinutes: 0,
    resolutionDelayMinutes: 9999,
    trigger: "count:1m:5m(<trigger conditions>) > 0",
  },
  ...
]

Alert Resolution

An alert's evaluation window and the resolutionDelayMinutes parameter affects the time needed for an alert to resolve, which in turn impacts the responsiveness of an alert.

alerts: [
  {
    alertAddress: "<address1>,<Slack>,<PagerDuty>,...",
    description: "Test alert",
    gracePeriodMinutes: 0,
    renotifyPeriodMinutes: 0,
    resolutionDelayMinutes: 9999, // disabled = 9999
    trigger: "count:10m(<trigger conditions>) > 0",
  },
  ...
]

In this example, we've got an alert timeframe of 10 minutes and the alert was triggered at 2:10PM. Although events matching the trigger only occurred from 2:10-2:15PM, the alert won't resolve until 2:25PM (dotted red lines) since its timeframe is 10 minutes.

If the alert used a 5 minute timeframe, it would resolve at 2:20PM (solid red lines).

Conclusion

Alerts with a short timeframe can cycle between triggered / resolved states more quickly and tend to be more noisy. They are not as forgiving when ingestion delays occur, but an offset can be used to fine tune this behavior. Shorter timeframes are useful when immediate responses on a per-alert basis are required.

Longer timeframes are best applied when additional alerts do not increase the urgency of the original notification. Additionally, notifications can be resent as reminders. Extended timeframes are well suited for potential ingestion delays, but won't resolve until the evaluation window is clear of events.