Parser Tips and Tricks - Intro – DataSet Customer Portal

Introduction

This article covers some best practices that I discovered after configuring many parsers for different applications. It assumes that you have a basic knowledge of how to work with DataSet parsers and regular expressions. If you have not already done so, please be sure to review https://app.scalyr.com/help/parsing-logs before proceeding.

Also, feel free to submit any of your favorite parsing tips or tricks! We love hearing from our customers.

Make use of built-in DataSet values

The timestamp field is probably the most important of all built-in fields, as it ensures that DataSet is synced with when the log event occurred. Timestamp formats vary, so it’s important to get this right. If DataSet doesn’t have a predefined format for the timestamp, one can be constructed to with regular expressions (and occasionally, a rewrite rule). More on this below.

Video Explanation: https://youtu.be/uNOiu8CVnJU?t=643

severity is commonly used to quickly identify what a log event represents. Is it a normal output or a critical error? severity enables this to be ascertained quickly. Usually, this information is consistently available and can be quickly extracted by its position (ex. INFO, DEBUG, ERROR).

Video Explanation: https://youtu.be/bplhUtiaso8?t=498

The message field contains the complete log message that was sent to DataSet. Regardless of the parsing operations that take place, we retain each log event in its entirety. It can be quickly evaluated from the search results, or by clicking a particular result and viewing the "Inspect Log Line" dialog:

Search Results:

Inspect Log Line dialog:

Video Explanation: https://youtu.be/bplhUtiaso8?t=203

Focus on similarities first

DataSet parsers process rules in the order they occur. I typically begin a parser by processing fields that are common to an application’s log events from left to right. For example, the timestamp is almost always in the same location and often maintains the same format. Since parser rules are applied in a cascaded order, it's possible to target specific segments of each log event. Don’t feel obligated to parse entire log lines at once!

Once your parser has been configured for the similarities of an application’s logs, expand to the differences. Each format statement within the formats block can be used to process variations as needed. Exit the formats block with a halt: true statement, if necessary.

Distilling Log Data

Log Lines

Log lines that don't contain any useful information can be removed with the discard statement (in conjunction with a format statement). The entire log line will be ignored (and won't be stored). See https://app.scalyr.com/help/parsing-logs#discard for more information.

Video Explanation: https://youtu.be/ug-Br8TO__4?t=303

Key-Value Pair Data

Since parsing Key-Value pairs / JSON data is self-governed, matching fields are automatically extracted. However, you can use the attrBlacklist field to avoid extracting / storing unneeded attributes. More information can be found here: https://app.scalyr.com/help/parsing-logs#valueLists

Key-Value Pair Processing

Whenever possible, apply key-value pair processing to extract values. This approach has several advantages:

Parameters will retain the names they were assigned in the log events
Minimizes overhead
Increases versatility - fields are extracted by their formatting, not their static position within the log line

Patterns

The patterns field can be used when defining regular expressions that are intended to be reused within a format line. Note: At this time, the rewrites statement cannot utilize an entry in patterns. This cleans up the parser configuration while ensuring that the regular expression isn’t defined in multiple (inline) locations.

In cases where the character content is not easily defined, you can use exclusions. For example, [^\\s]+ will accept any character(s) other than whitespace.

Video Explanation: https://youtu.be/3Miia70yADE?t=281

Regex look behinds / look aheads are not supported.

Video Explanation: https://youtu.be/trDh8omSlvA?t=169

Example

Log Events

<110>  id=firewall sn=2CB8EARGH780 time="2020-02-21 12:49:56" fw=192.168.115.2 pri=6 c=0 m=1154 msg="Application Control Detection Alert: PROTOCOLS HTTP Protocol -- HEAD" sid=6546 appcat="HTTP Protocol -- HEAD" appid=1277 catid=74 n=1698418 src=192.168.15.30:42744:X0-V170:PH-NUT dst=10.0.0.2:80:X5 srcMac=ff:bb:cc:dd:ee:9a dstMac=aa:bb:cc:dd:ee:c1 proto=tcp/http fw_action="NA"
<110>  id=firewall sn=2CB8EARGH780 time="2020-02-21 12:49:56" fw=192.168.115.2 pri=6 c=512 m=602 msg="DNS packet allowed" app=49169 appName='DNS' n=27445261 src=172.217.9.78:53:X6 dst=10.0.0.3:53448:X0-V40 srcMac=ff:bb:cc:dd:ee:8b dstMac=aa:bb:cc:dd:ee:b1 proto=udp/dns rule="484 (Whee->VPN)" fw_action="forward"
<110>  id=firewall sn=2CB8EARGH780 time="2020-02-21 12:49:56" fw=192.168.115.2 pri=6 c=0 m=1197 msg="NAT Mapping" app=49177 appName='HTTPS' n=61564336 src=172.217.9.69:57873:X0-V40 dst=10.0.0.4:443:X5 dstMac=aa:bb:cc:dd:ee:a1 proto=tcp/https note="Source: 192.168.115.2, 41829, Destination: 172.217.4.228, 443, Protocol: 6" rule="482 (Whee->WAN)" fw_action="NA"

Parser

{
  timezone: "GMT-0600"
  patterns: {
    valuePattern: "[a-zA-Z0-9\/\\-,_.:;@\\(\\)<>]+",
    valuePattern_s: "[a-zA-Z0-9\/\\-,_.:;@\\(\\)<> ]+"  
  },
  formats: [
    {
      format: "<\\d+>\\s+.*"
    },
    {
      format: ".*time=\"$timestamp$\".*"
    },
    {
      format: ".*$_=identifier$=$_=valuePattern$.*",
      repeat: true
    },
    {
      format: ".*$_=identifier$=[\"|']$_=valuePattern_s$[\"|'].*",
      repeat: true
    },
  ]
}

Explanation

Define Timezone

  timezone: "GMT-0600"

It’s always a good idea to define a timezone if the log’s timestamps do not include a timezone, or are not recorded in UTC time. UTC is used as the default if no timezone is defined.

Define Patterns

  patterns: {
    valuePattern: "[a-zA-Z0-9\/\\-,_.:;@\\(\\)<>]+",
    valuePattern_s: "[a-zA-Z0-9\/\\-,_.:;@\\(\\)<> ]+"  
  },

In this example, valuePattern and valuePattern_s both contain a list of expected characters. Log patterns cannot presently reference each other. The patterns section enables users to configure regular expressions in a central location that can cleanly be referenced (instead of of inline definitions).

Ignore sequence prefix

  format: "<\\d+>\\s+.*"

This line is extraneous, but was included to demonstrate how the sequence ID can be explicitly ignored without assigning a variable to it.

Extract timestamp.

      format: ".*time=\"$timestamp$\".*"

DataSets recognizes the timestamp format used in this log, so it can be readily extracted

Extract key-value pairs

      format: ".*$_=identifier$=$_=valuePattern$.*",
      repeat: true

This line extracts key-value pairs that do not employ double quotes. Consequently, no spaces are anticipated in the value. The repeat: true statement looks for all matches within the log event.

Extract key-value pairs that are enclosed in double quotes

      format: ".*$_=identifier$=[\"|']$_=valuePattern_s$[\"|'].*",
      repeat: true

This line extracts key-value pairs that utilize double quotes (and may include spaces) in the value.

Use Wildcards Sparingly

The following regex could potentially cause a performance issue for the parser:

{ id: "agg_err", format:".*\\\\s.*\\\\s$db$ Aggregation finished with error:.*"} // BAD

Depending on the customer's incoming log volume, something like .*\\\\s.*\\\\s could cause performance issues, whereas .*\\\\s would effectively be the same.

In conclusion, we recommend simplifying regular expressions whenever possible.