Prior to the Scalyr K8s Agent version 2.1.21, the Scalyr K8s agent has a defect where if the liveness probe times out before the actual probe fails, the agent pod will never be restarted. This could result in the drop of pod logs which is a total bummer.
This issue is partially due to what is mentioned in the note here. Before K8s 1.20 the timeout configuration wasn't respected and basically didn't exist. After the fix, the timeout interval now defaults to 1 second which is lower than
status -H's wait time of 5 seconds. When you combine that with this issue, the liveness probe will not restart the pod if the probe timeout runs out, because of this stuck pods were never restarted as the agent never managed to reach the
status -H timeout of 5 seconds.
When the issue occurs, the agent pod returns the event :
Liveness probe errored: rpc error: code = DeadlineExceeded desc = failed to exec in container: timeout 1s exceeded: context deadline exceeded
which implies that the agent pod is hitting the K8s timeout and not the actual status timeout which outputs a different message. Running the agent status command directly would return a message:
"Failed to get status within 5 seconds. Giving up. The agent process is possibly stuck. See /var/log/scalyr-agent-2/agent.log for more details. command terminated with exit code 1"
To resolve the issue, you only need to modify the K8s agent manifest file by changing the timeout period to 10 seconds.
livenessProbe: exec: command: - scalyr-agent-2 - status - -H initialDelaySeconds: 60 periodSeconds: 60 timeoutSeconds: 10
The fix has been introduced in the later agent version, so a user could just adopt the latest agent config to avoid hitting any timeout issues and the drop of pod logs.
Please sign in to leave a comment.