Introduction
This article covers some best practices that I discovered after configuring many parsers for different applications. It assumes that you have a basic knowledge of how to work with DataSet parsers and regular expressions. If you have not already done so, please be sure to review https://app.scalyr.com/help/parsing-logs before proceeding.
Also, feel free to submit any of your favorite parsing tips or tricks! We love hearing from our customers.
Make use of built-in DataSet values
The timestamp
field is probably the most important of all built-in fields, as it ensures that DataSet is synced with when the log event occurred. Timestamp formats vary, so it’s important to get this right. If DataSet doesn’t have a predefined format for the timestamp, one can be constructed to with regular expressions (and occasionally, a rewrite rule). More on this below.
Video Explanation: https://youtu.be/uNOiu8CVnJU?t=643
severity
is commonly used to quickly identify what a log event represents. Is it a normal output or a critical error? severity
enables this to be ascertained quickly. Usually, this information is consistently available and can be quickly extracted by its position (ex. INFO, DEBUG, ERROR).
Video Explanation: https://youtu.be/bplhUtiaso8?t=498
The message
field contains the complete log message that was sent to DataSet. Regardless of the parsing operations that take place, we retain each log event in its entirety. It can be quickly evaluated from the search results, or by clicking a particular result and viewing the "Inspect Log Line" dialog:
Search Results:
Inspect Log Line dialog:
Video Explanation: https://youtu.be/bplhUtiaso8?t=203
Focus on similarities first
DataSet parsers process rules in the order they occur. I typically begin a parser by processing fields that are common to an application’s log events from left to right. For example, the timestamp is almost always in the same location and often maintains the same format. Since parser rules are applied in a cascaded order, it's possible to target specific segments of each log event. Don’t feel obligated to parse entire log lines at once!
Once your parser has been configured for the similarities of an application’s logs, expand to the differences. Each format statement within the formats block can be used to process variations as needed. Exit the formats block with a halt: true
statement, if necessary.
Distilling Log Data
Log Lines
Log lines that don't contain any useful information can be removed with the discard
statement (in conjunction with a format
statement). The entire log line will be ignored (and won't be stored). See https://app.scalyr.com/help/parsing-logs#discard for more information.
Video Explanation: https://youtu.be/ug-Br8TO__4?t=303
Key-Value Pair Data
Since parsing Key-Value pairs / JSON data is self-governed, matching fields are automatically extracted. However, you can use the attrBlacklist
field to avoid extracting / storing unneeded attributes. More information can be found here: https://app.scalyr.com/help/parsing-logs#valueLists
Key-Value Pair Processing
Whenever possible, apply key-value pair processing to extract values. This approach has several advantages:
- Parameters will retain the names they were assigned in the log events
- Minimizes overhead
- Increases versatility - fields are extracted by their formatting, not their static position within the log line
Patterns
The patterns
field can be used when defining regular expressions that are intended to be reused within a format
line. Note: At this time, the rewrites
statement cannot utilize an entry in patterns
. This cleans up the parser configuration while ensuring that the regular expression isn’t defined in multiple (inline) locations.
In cases where the character content is not easily defined, you can use exclusions. For example, [^\\s]+
will accept any character(s) other than whitespace.
Video Explanation: https://youtu.be/3Miia70yADE?t=281
Regex look behinds / look aheads are not supported.
Video Explanation: https://youtu.be/trDh8omSlvA?t=169
Example
Log Events
<110> id=firewall sn=2CB8EARGH780 time="2020-02-21 12:49:56" fw=192.168.115.2 pri=6 c=0 m=1154 msg="Application Control Detection Alert: PROTOCOLS HTTP Protocol -- HEAD" sid=6546 appcat="HTTP Protocol -- HEAD" appid=1277 catid=74 n=1698418 src=192.168.15.30:42744:X0-V170:PH-NUT dst=10.0.0.2:80:X5 srcMac=ff:bb:cc:dd:ee:9a dstMac=aa:bb:cc:dd:ee:c1 proto=tcp/http fw_action="NA"
<110> id=firewall sn=2CB8EARGH780 time="2020-02-21 12:49:56" fw=192.168.115.2 pri=6 c=512 m=602 msg="DNS packet allowed" app=49169 appName='DNS' n=27445261 src=172.217.9.78:53:X6 dst=10.0.0.3:53448:X0-V40 srcMac=ff:bb:cc:dd:ee:8b dstMac=aa:bb:cc:dd:ee:b1 proto=udp/dns rule="484 (Whee->VPN)" fw_action="forward"
<110> id=firewall sn=2CB8EARGH780 time="2020-02-21 12:49:56" fw=192.168.115.2 pri=6 c=0 m=1197 msg="NAT Mapping" app=49177 appName='HTTPS' n=61564336 src=172.217.9.69:57873:X0-V40 dst=10.0.0.4:443:X5 dstMac=aa:bb:cc:dd:ee:a1 proto=tcp/https note="Source: 192.168.115.2, 41829, Destination: 172.217.4.228, 443, Protocol: 6" rule="482 (Whee->WAN)" fw_action="NA"
Parser
{
timezone: "GMT-0600"
patterns: {
valuePattern: "[a-zA-Z0-9\/\\-,_.:;@\\(\\)<>]+",
valuePattern_s: "[a-zA-Z0-9\/\\-,_.:;@\\(\\)<> ]+"
},
formats: [
{
format: "<\\d+>\\s+.*"
},
{
format: ".*time=\"$timestamp$\".*"
},
{
format: ".*$_=identifier$=$_=valuePattern$.*",
repeat: true
},
{
format: ".*$_=identifier$=[\"|']$_=valuePattern_s$[\"|'].*",
repeat: true
},
]
}
Explanation
Define Timezone
timezone: "GMT-0600"
It’s always a good idea to define a timezone if the log’s timestamps do not include a timezone, or are not recorded in UTC time. UTC is used as the default if no timezone is defined.
Define Patterns
patterns: {
valuePattern: "[a-zA-Z0-9\/\\-,_.:;@\\(\\)<>]+",
valuePattern_s: "[a-zA-Z0-9\/\\-,_.:;@\\(\\)<> ]+"
},
In this example, valuePattern
and valuePattern_s
both contain a list of expected characters. Log patterns cannot presently reference each other. The patterns section enables users to configure regular expressions in a central location that can cleanly be referenced (instead of of inline definitions).
Ignore sequence prefix
format: "<\\d+>\\s+.*"
This line is extraneous, but was included to demonstrate how the sequence ID can be explicitly ignored without assigning a variable to it.
Extract timestamp.
format: ".*time=\"$timestamp$\".*"
DataSets recognizes the timestamp format used in this log, so it can be readily extracted
Extract key-value pairs
format: ".*$_=identifier$=$_=valuePattern$.*",
repeat: true
This line extracts key-value pairs that do not employ double quotes. Consequently, no spaces are anticipated in the value. The repeat: true statement looks for all matches within the log event.
Extract key-value pairs that are enclosed in double quotes
format: ".*$_=identifier$=[\"|']$_=valuePattern_s$[\"|'].*",
repeat: true
This line extracts key-value pairs that utilize double quotes (and may include spaces) in the value.
Use Wildcards Sparingly
The following regex could potentially cause a performance issue for the parser:
{ id: "agg_err", format:".*\\\\s.*\\\\s$db$ Aggregation finished with error:.*"} // BAD
Depending on the customer's incoming log volume, something like .*\\\\s.*\\\\s
could cause performance issues, whereas .*\\\\s
would effectively be the same.
In conclusion, we recommend simplifying regular expressions whenever possible.
Comments
0 comments
Please sign in to leave a comment.