On a per-logfile basis and with its default settings, the agent will only try to send log content that is not older than 15 minutes, and that is less than 200MB from the end of the log file. When either of those limits are exceeded, the agent will skip to the end of the log, reset those values and continue from there. There are a couple of reasons we established these defaults.
(1) if the agent is falling behind because logs are being generated at a rate too fast for the agent to keep up with, the assumption is that it is better to skip ahead and send new logs than to be perpetually behind (or continue falling further behind).
(2) The DataSet server architecture originally included a requirement that logs would be no more than a few minutes old- older logs would either be ignored or have their timestamps "dragged forward" to a more recent time. The DataSet architecture now supports what we call "stale log ingestion"- in other words, we are able to ingest logs with an older timestamp and preserve that timestamp so the logs appear in context with other logs at the same time.
This gives us the opportunity to tweak the limits mentioned above in order to allow older logs to be ingested. The primary use case we envision here is making the agent more resilient in cases where connectivity between the agent and DataSet is lost for more than a few minutes.
The parameters are:
max_log_offset_size
- This option controls how far behind the end of the file the agent can be the first time it starts tracking a file before it skips to the end. Specified in bytes. Default is 200,000,000 bytes. In other words, if the agent starts tracking a file and the file size is larger than 200 MB then the agent will only upload log content that was written after tracking started. If the file size is less than 200 MB at the start of tracking, then the agent will try to upload the entire existing file from the beginning.max_existing_log_offset_size
- This option controls how far behind the end of the file the agent can be for a file it is currently monitoring before it skips to the end. Specified in bytes. Default is 200,000,000 bytes. In other words, if the rate at which logs are written to a file is greater than the rate at which the agent can process them, then the agent will gradually fall behind the end of the log file and if it gets more than 200 MB behind, it will ignore those 200 MB, skip to the end of the log and continue from there.copy_staleness_threshold
- If the agent notices that new bytes have appeared in a file but does not read them before this threshold is exceeded, then it considers those bytes to be stale and skips to reading from the end to get the freshest bytes. Specified in seconds. Defaults to 900 seconds (15 minutes).
These parameters are interdependent and are also tied in with the maximum rate at which the agent sends logs to DataSet.
For example, if you wish to set copy_staleness_threshold
to 3600 seconds (60 minutes), you should determine your average peak log volume in bytes for a 60 minute period and set max_existing_log_offset_size
to a value greater than that.
You should also consider the maximum rate at which the agent sends logs to DataSet. This can be set as a hard limit using max_send_rate_enforcement
. The default for this value is unlimited
. The format for specifying a maximum upload rate is: "<rate><unit_numerator>/<unit_denominator>"
- more detail available in the 2.1.6 release notes.
Alternately, this can be dependent on your network speed, RTT to DataSet etc. As a general rule of thumb, use 0.5 MB/sec if max_send_rate_enforcement
is set to legacy
or you are using an agent older than 2.1.6. Otherwise, if your log lines are generally <500 bytes, use 2.5MB/sec; if your log lines are generally longer, use 5 MB/sec .
Calculate the rate of upload by taking the number of bytes you specified in max_existing_log_offset_size
dividing by 1000 bytes and dividing by 3600 seconds, to get a normal upload rate in MB/second. If this nominal send rate is more than half of the maximum send rate described above, this means it will take more than an hour for the agent to catch up after being offline for an hour. The closer the nominal send rate is to the maximum send rate, the higher the chance that the agent will be perpetually behind.
For example, if you have a machine that typically sends 6 GB/hour of logs at its peak (aka 6,000,000,000 bytes), you could set max_log_offset_size
to that value, and copy_staleness_threshold
to 3600 seconds. However, 6 GB/hour translates into 1.67 MB/sec, and if your maximum send rate is 2 MB/sec, the agent will have a hard time catching up after being offline for an hour, so you may wish to consider decreasing the values for max_log_offset_size
and copy_staleness_threshold
.
Also see: Restoring Logs with the Scalyr Agent
Comments
0 comments
Please sign in to leave a comment.