Help! I'm missing logs! – DataSet Customer Portal

Introduction

This article was written to assist customers who have noticed an unexpected decline in log activity for a particular log. It includes a number of best practices that can be used to isolate the potential cause(s).

Step 1 - Verify the log file path exists and is accessible to the Scalyr Agent

This step is fairly self-explanatory, and will vary by OS and host configuration.

Review the /etc/scalyr-agent-2/agent.json (or other file defined in /etc/scalyr-agent-2/agent.d/*) to confirm that the log path(s) in question are correctly defined. In Windows, the config path is under C:\Program Files (x86)\Scalyr\config\agent.json or C:\Program Files (x86)\Scalyr\config\agent.d\*
Verify the path of any missing logfile(s). Confirm that each log file exists, file permissions are correct, and current log events are still being appended.
Linux log rotation: If you're running logrotate with copytruncate and compress, be sure that you are also using delaycompress and a reasonable size (>= 200). delaycompress prevents logrotate from immediately compressing logs once a rotate cycle is initiated. Instead, logrotate will wait one cycle before compressing, which gives the Agent enough time to complete its upload to DataSet.

Step 2 - Verify timestamps

Faulty timestamps are a leading cause of "missing" logs. The timestamp could either be incorrectly generated by a customer's platform (ex. wrong timezone, lack of NTP server) or misinterpreted by a DataSet parser regex. In both cases, an updated parser is the solution.

If a parser is not used to extract the timestamp field, the log's time of ingestion will be used instead.

In the query examples below, <placeholder> represents a string or number. The actual value will not include the < > symbols.

Identify the log file(s) that should be visible on DataSet. Run one of the following search queries to determine log activity for the file in question. Use an initial timeframe of 4-12 hours to ensure that any recent activity is included:
K8s
Note: You may use any of the other available K8s attributes (ex. k8s-controller, app.kubernetes.io/name, k8s-controller, etc.) to improve the search results
```
tag='logVolume' metric='logBytes' app.kubernetes.io\/instance='<app.k8s.io_instance>' k8s-cluster='<K8s_cluster>'
```
Standard
```
tag='logVolume' metric='logBytes' forlogfile = '<logfile_for_missing_log>' host='<serverHost_for_missing_log>'
| group sum(value) by timestamp=timebucket('1h')
```
If the above query indicates that the logfile from serverHost is presently generating log volume, subtract an hour from the current time and obtain the corresponding 10-digit unix epoch. Otherwise, use the query in step #1 to identify when the log was last generated and subtract an hour from that. This website is helpful for calculating the epoch.
1. The following PowerQuery will display logs by their ingested time and includes the delta of when the log was ingested vs its assigned timestamp:
  K8s
  Note: You may use any of the other available K8s attributes (ex. k8s-controller, app.kubernetes.io/name, k8s-controller, etc.) to improve the search results
```
!(tag=*) sca:ingestTime >= <epoch> app.kubernetes.io\/instance='<app.k8s.io_instance>' k8s-cluster='<K8s_cluster>'
| let ingested.timestamp = (sca:ingestTime * 1000000000)
| let log_orig.timestamp = timestamp
| let delta = ((ingested.timestamp - log_orig.timestamp) / 1000000000) / 60 
| columns ingested.timestamp, assigned.timestamp=log_orig.timestamp, deltaMin=delta, serverHost, message
| sort ingested.timestamp
| limit 100000
```
  Standard
```
!(tag=*) sca:ingestTime >= <epoch> logfile = '<logfile_for_missing_log>' serverHost='<serverHost_for_missing_log>'
| let ingested.timestamp = (sca:ingestTime * 1000000000)
| let log_orig.timestamp = timestamp
| let delta = ((ingested.timestamp - log_orig.timestamp) / 1000000000) / 60 
| columns ingested.timestamp, assigned.timestamp=log_orig.timestamp, deltaMin=delta, serverHost, message
| sort ingested.timestamp
| limit 100000
```
2. Set the timeframe of the search to be as large as your retention period. For example, if you have 30 days of retention, set the start to 30d and the end to NOW:
3. If logs are returned, click the "Search" menu and open "New Search" in a new tab. Click on the affected log event. From the "Inspect Log Line" panel, identify the value of timestamp and click "Edit Parser." Modify the parser to fix the timestamp extraction.
4. If no logs are returned, the timestamps are being set to values that are beyond your log retention period (See Note 2). We'll need to identify a sample of these logs so that we can update the parser.
  1. From the affected host, retrieve and review the latest log events by using tail or less on the logfile. Next, identify the parser that is associated with the log. Modify the parser to fix the timestamp extraction.

Parsers only apply to incoming logs and aren't retroactively applied to existing logs, so any log events with an incorrect timestamp will not be changed once the parser is modified. You may also need to set the timezone parameter within the parser if the log timestamp is not set to the GMT timezone.

Note 1: Log volume metrics apply to when the logs were uploaded, rather than the timestamp associated with the log events.

Note 2: If the timestamp occurs after your retention period expires (ex. you have 30d retention and the log's timestamp is from 32 days prior to the present), the log event won't be included in any search results.

Note 3: sca:ingestTime is the epoch of when the log event was ingested. It is only present when the timestamp differs from the log's actual time of ingestion by 30 minutes or more.

Example

By using the Parsing Tester, we confirmed that the timestamp format for a set of logs was evaluated as the in Euro format (Day/Month/Year) rather than the US format of Month/Day/Year, which caused logs to be associated with the wrong day.

The timestamp of "10/04/2023 21:06:30.444" was interpreted as "Tue Apr 11, 2023 1:06:30.444 AM GMT"

We updated the parser by adding a timezone (in this case, timezone="America/New_York") and a rewrites statement to resolve this issue:

...
  {
      format: ".*\\>$timestamp=tsPattern$\\<.*",
      rewrites: [
      {
        input: "timestamp",
        output: "timestamp",
        match: "(\\d{2})\/(\\d{2})\/(\\d{4}) (\\d{2}:\\d{2}:\\d{2}\\.\\d{3})",
        replace: "$3\/$1\/$2 $4"
      }
    ]
  },
...

As you can see, the rewrites statement is used to rearrange the contents of the timestamp field so that it is interpreted correctly:

Hint 1: We recommend using an ISO-8601 compatible timestamp format

Hint 2: The GMT / UTC timezone is preferred, since no adjustments for daylight savings time need to be made.

Step 3 - Check discard rules

Before proceeding with any other steps, be sure to verify that the logs in question are not being discarded by a discard rule on the "Cost Management" page. The following query makes the identification process easier, especially if a lot of discard rules were configured.

To begin:

Identify when the last log event arrived with the "Search" function
Go back ~10 minutes or so
Run the following Search query after adjusting the time range. Note: "Full" permissions are required in order to access the audit logs.

Time Range

Search Query

tag='audit' action='updateFilter' filterRequest.disabledUntil contains "1970"

This query will list any "Cost Management" discard rules that were activated during the timeframe when the logs went missing. The user will need to confirm that the discard rule actively affects the logs in question. Discards that occur via a parser or the Agent are not returned.

You can also review the overall discard activity by filter by running this query:

tag='budgetCategoryStatistics' discardedByteCount>0 
| group bytesDiscarded=sum(discardedByteCount), eventsDiscarded=sum(discardedEventCount) by discardFilterText
| sort -bytesDiscarded

Step 4 - Confirm that the log is being uploaded

Agent Diagnostics

Status Output

If you're using the Agent, run sudo scalyr-agent-2 status -v to confirm that the log in question is being monitored for changes and uploaded to DataSet. In particular refer to the "Log Transmission" section.

agent.log

Review the /var/log/scalyr-agent-2/agent.log file for errors and / or reported issues that are associated with the missing log events. From the DataSet search, you can run logfile='/var/log/scalyr-agent-2/agent.log severity > 3 , if implicit_agent_log_collection = true. Otherwise, a cursory search can be performed via the local file.

Log Volume

From your DataSet account, run tag='logVolume' forlogfile!='none' for the timeframe determined in Step 3. Review the resultant forlogfile values and verify that the log in question is not present.

Step 5 - Contact DataSet Support

If you have completed the above steps and have any questions, please contact the DataSet Support team by signing into the Support Portal and submitting a ticket. When doing so, kindly include the following in a .tar.gz file:

The output of sudo scalyr-agent-2 status -v
The /etc/scalyr-agent-2/* directory, with any API keys redacted
The /var/log/scalyr-agent-2/agent.log file
The last 20 lines or so of the log file in question (tail -n 20 <path> > logfile_snippet.txt)