A typical use case for DataSet alerts is testing for the presence of a heartbeat signal, and alerting if that signal is interrupted. For example, this trigger
count:2m($k8s-cluster='k8s.prod.us-east-1' $k8s-deployment = 'scanner'
'scan completed') < 1
will fire an alert if we don't see at last one "scan completed" message from the Kubernetes deployment scanner on the Kubernetes cluster k8s.prod.us-east-1 within a 2 minute period.
One caveat to this approach is that false alarms can result from connectivity issues between the agent and the DataSet infrastructure, or ingestion delays within DataSet. Production deployments at DataSet can also make services unavailable for a few minutes; the Agent continues to retry, so no data is lost in these situations, but a sensitive alert will notice that the flow of logs has been interrupted briefly. Adding another test to this trigger will help limit false alarms.
count:2m($k8s-cluster='k8s.prod.us-east-1' $k8s-deployment = 'scanner'
'scan completed') < 1
&& count:2m($k8s-cluster=*) > 0
In addition to testing for the heartbeat, we are now also checking that logs are coming in from any Kubernetes source. If all of these logs are missing, chances are this is an ingestion issue rather than an interruption of the heartbeat we are trying to monitor with this alert.
Comments
0 comments
Please sign in to leave a comment.