Introduction
This article was written to assist customers who have noticed an unexpected decline in log activity for a particular log. It includes a number of best practices that can be used to isolate the potential cause(s).
Step 1 - Verify the log file path exists and is accessible to the Scalyr Agent
This step is fairly self-explanatory, and will vary by OS and host configuration.
- Review the /etc/scalyr-agent-2/agent.json (or other file defined in /etc/scalyr-agent-2/agent.d/*) to confirm that the log path(s) in question are correctly defined. In Windows, the config path is under C:\Program Files (x86)\Scalyr\config\agent.json or C:\Program Files (x86)\Scalyr\config\agent.d\*
- Verify the path of any missing logfile(s). Confirm that each log file exists, file permissions are correct, and current log events are still being appended.
- Linux log rotation: If you're running logrotate with
copytruncate
andcompress
, be sure that you are also usingdelaycompress
and a reasonablesize
(>= 200).delaycompress
prevents logrotate from immediately compressing logs once a rotate cycle is initiated. Instead, logrotate will wait one cycle before compressing, which gives the Agent enough time to complete its upload to DataSet.
Step 2 - Verify timestamps
Faulty timestamps are a leading cause of "missing" logs. The timestamp could either be incorrectly generated by a customer's platform (ex. wrong timezone, lack of NTP server) or misinterpreted by a DataSet parser regex. In both cases, an updated parser is the solution.
If a parser is not used to extract the timestamp
field, the log's time of ingestion will be used instead.
In the query examples below, <placeholder> represents a string or number. The actual value will not include the <
>
symbols.
- Identify the log file(s) that should be visible on DataSet. Run one of the following search queries to determine log activity for the file in question. Use an initial timeframe of 4-12 hours to ensure that any recent activity is included:
K8s
Note: You may use any of the other available K8s attributes (ex. k8s-controller, app.kubernetes.io/name, k8s-controller, etc.) to improve the search results
tag='logVolume' metric='logBytes' app.kubernetes.io\/instance='<app.k8s.io_instance>' k8s-cluster='<K8s_cluster>'
Standard
tag='logVolume' metric='logBytes' forlogfile = '<logfile_for_missing_log>' host='<serverHost_for_missing_log>'
| group sum(value) by timestamp=timebucket('1h') - If the above query indicates that the
logfile
fromserverHost
is presently generating log volume, subtract an hour from the current time and obtain the corresponding 10-digit unix epoch. Otherwise, use the query in step #1 to identify when the log was last generated and subtract an hour from that. This website is helpful for calculating the epoch.
- The following PowerQuery will display logs by their ingested time and includes the delta of when the log was ingested vs its assigned timestamp:
K8s
Note: You may use any of the other available K8s attributes (ex. k8s-controller, app.kubernetes.io/name, k8s-controller, etc.) to improve the search results
!(tag=*) sca:ingestTime >= <epoch> app.kubernetes.io\/instance='<app.k8s.io_instance>' k8s-cluster='<K8s_cluster>'
Standard
| let ingested.timestamp = (sca:ingestTime * 1000000000)
| let log_orig.timestamp = timestamp
| let delta = ((ingested.timestamp - log_orig.timestamp) / 1000000000) / 60
| columns ingested.timestamp, assigned.timestamp=log_orig.timestamp, deltaMin=delta, serverHost, message
| sort ingested.timestamp
| limit 100000!(tag=*) sca:ingestTime >= <epoch> logfile = '<logfile_for_missing_log>' serverHost='<serverHost_for_missing_log>'
| let ingested.timestamp = (sca:ingestTime * 1000000000)
| let log_orig.timestamp = timestamp
| let delta = ((ingested.timestamp - log_orig.timestamp) / 1000000000) / 60
| columns ingested.timestamp, assigned.timestamp=log_orig.timestamp, deltaMin=delta, serverHost, message
| sort ingested.timestamp
| limit 100000 - Set the timeframe of the search to be as large as your retention period. For example, if you have 30 days of retention, set the start to 30d and the end to NOW:
- If logs are returned, click the "Search" menu and open "New Search" in a new tab. Click on the affected log event. From the "Inspect Log Line" panel, identify the value of
timestamp
and click "Edit Parser." Modify the parser to fix the timestamp extraction. - If no logs are returned, the timestamps are being set to values that are beyond your log retention period (See Note 2). We'll need to identify a sample of these logs so that we can update the parser.
- From the affected host, retrieve and review the latest log events by using tail or less on the logfile. Next, identify the parser that is associated with the log. Modify the parser to fix the timestamp extraction.
- The following PowerQuery will display logs by their ingested time and includes the delta of when the log was ingested vs its assigned timestamp:
Parsers only apply to incoming logs and aren't retroactively applied to existing logs, so any log events with an incorrect timestamp will not be changed once the parser is modified. You may also need to set the timezone
parameter within the parser if the log timestamp is not set to the GMT timezone.
Note 1: Log volume metrics apply to when the logs were uploaded, rather than the timestamp associated with the log events.
Note 2: If the timestamp
occurs after your retention period expires (ex. you have 30d retention and the log's timestamp is from 32 days prior to the present), the log event won't be included in any search results.
Note 3: sca:ingestTime
is the epoch of when the log event was ingested. It is only present when the timestamp
differs from the log's actual time of ingestion by 30 minutes or more.
Example
By using the Parsing Tester, we confirmed that the timestamp format for a set of logs was evaluated as the in Euro format (Day/Month/Year) rather than the US format of Month/Day/Year, which caused logs to be associated with the wrong day.
The timestamp of "10/04/2023 21:06:30.444" was interpreted as "Tue Apr 11, 2023 1:06:30.444 AM GMT"
We updated the parser by adding a timezone (in this case, timezone="America/New_York"
) and a rewrites
statement to resolve this issue:
...
{
format: ".*\\>$timestamp=tsPattern$\\<.*",
rewrites: [
{
input: "timestamp",
output: "timestamp",
match: "(\\d{2})\/(\\d{2})\/(\\d{4}) (\\d{2}:\\d{2}:\\d{2}\\.\\d{3})",
replace: "$3\/$1\/$2 $4"
}
]
},
...
As you can see, the rewrites
statement is used to rearrange the contents of the timestamp
field so that it is interpreted correctly:
Hint 1: We recommend using an ISO-8601 compatible timestamp format
Hint 2: The GMT / UTC timezone is preferred, since no adjustments for daylight savings time need to be made.
Step 3 - Check discard rules
Before proceeding with any other steps, be sure to verify that the logs in question are not being discarded by a discard rule on the "Cost Management" page. The following query makes the identification process easier, especially if a lot of discard rules were configured.
To begin:
- Identify when the last log event arrived with the "Search" function
- Go back ~10 minutes or so
- Run the following Search query after adjusting the time range. Note: "Full" permissions are required in order to access the audit logs.
Time Range
Search Query
tag='audit' action='updateFilter' filterRequest.disabledUntil contains "1970"
This query will list any "Cost Management" discard rules that were activated during the timeframe when the logs went missing. The user will need to confirm that the discard rule actively affects the logs in question. Discards that occur via a parser or the Agent are not returned.
You can also review the overall discard activity by filter by running this query:
tag='budgetCategoryStatistics' discardedByteCount>0
| group bytesDiscarded=sum(discardedByteCount), eventsDiscarded=sum(discardedEventCount) by discardFilterText
| sort -bytesDiscarded
Step 4 - Confirm that the log is being uploaded
Agent Diagnostics
Status Output
If you're using the Agent, run sudo scalyr-agent-2 status -v
to confirm that the log in question is being monitored for changes and uploaded to DataSet. In particular refer to the "Log Transmission" section.
agent.log
Review the /var/log/scalyr-agent-2/agent.log file for errors and / or reported issues that are associated with the missing log events. From the DataSet search, you can run logfile='/var/log/scalyr-agent-2/agent.log severity > 3
, if implicit_agent_log_collection
= true. Otherwise, a cursory search can be performed via the local file.
Log Volume
From your DataSet account, run tag='logVolume' forlogfile!='none'
for the timeframe determined in Step 3. Review the resultant forlogfile
values and verify that the log in question is not present.
Step 5 - Contact DataSet Support
If you have completed the above steps and have any questions, please contact the DataSet Support team by signing into the Support Portal and submitting a ticket. When doing so, kindly include the following in a .tar.gz file:
- The output of
sudo scalyr-agent-2 status -v
- The /etc/scalyr-agent-2/* directory, with any API keys redacted
- The /var/log/scalyr-agent-2/agent.log file
- The last 20 lines or so of the log file in question (
tail -n 20 <path> > logfile_snippet.txt
)
Comments
0 comments
Please sign in to leave a comment.