Introduction
Parsers are used to formally extract data from log events. Once parsed, data can be used in such as search queries, alerts, or PowerQueries.
DataSet parsers extract data from a variety of log formats with regular expressions. If you master the following 3 tips, you will be able to build a parser that handles most use cases.
Here is the sample log that will be used to demonstrate each step:
2023/02/18 19:39:28 { "name":"john", "level":"DEBUG", "msg":"parser test, env: demo4" }
2023/02/18 19:39:29 { "name":"joel", "host":"prod", "level":"INFO", "code":"200", "url":"/testdemo", "msg":[
"i-0ecbcdc50907aba74 [api][P] com-scalyr"
] }
2023/02/18 19:39:30 { "name":"wei", "host":"dev", "level":"INFO", "code":"200", "url":"/serviceping", "msg":[
"i-0ecbcdc50907aba74 [search][P] com-scalyr"
] }
2023/02/18 19:39:31 { "name":"mark", "level":"DEBUG", "msg":"end of test: Reply sent to demo4" }
You can copy it to log parser tester and follow the instructions below to build a parser step by step.
1. Extract the timestamp of the parsed message
Video Explanation: https://youtu.be/uNOiu8CVnJU?t=643
If your message includes a timestamp, extract it. This ensures that the timestamp assigned by your platform is also recognized by DataSet. Once the timestamp field is parsed, it will be preceded by the timestamp:
field name. Once the field has been recognized, verify that the timestamp interpreted by the parser matches the actual timestamp of the message.
I am going to use the first sample line to demonstrate this feature.
The following parsing format parses the timestamp from the logline and assigns the rest of the message to a custom filed msg
.
{
formats: [
{ format: "$timestamp$ $msg$"}
]
}
After the parser processes the message, the parser tester output shows that only the date part of the timestamp field was extracted.
timestamp: 2023/02/18 (parsed as: Sat Feb 18, 2023 12:00:00 AM GMT, i.e. x minutes ago)
To fix it, we can define a custom pattern called tsPattern
that uses a regular expression to match the date and time segments of the timestamp. This ensures that only data which matches the format we defined will be extracted as the timestamp.
{
patterns:{ tsPattern: "\\d{4}\/\\d{2}\/\\d{2} [\\d:]+"}
formats: [
{ format: "$timestamp=tsPattern$ $msg$"}
]
}
Re-run the parser tester to verify that it works as expected.
2023/02/18 19:39:28 { "name":"john", "level":"DEBUG", "msg":"parser test, env: demo4" }
message: 2023/02/18 19:39:28 { "name":"john", "level":"DEBUG", "msg":"parser test, env: demo4" }
msg: { "name":"john", "level":"DEBUG", "msg":"parser test, env: demo4" }
timestamp: 2023/02/18 19:39:28 (parsed as: Sat Feb 18, 2023 7:39:28 PM GMT, i.e. x minutes ago)
Note the timestamp:
prefix (which confirms the timestamp was processed)
2. Group lines that belong to the same event
Video Explanation: https://youtu.be/3Miia70yADE?t=516
A single message may span multiple lines. Thus, it's important to group those lines together so it returns as one message when searching it on DataSet. The lineGroupers
function was specifically designed for this use case.
You can find a couple of these multi-line messages in the sample log (above):
...
2023/02/18 19:39:29 { "name":"joel", "host":"prod", "level":"INFO", "code":"200", "url":"/testdemo", "msg":[
"i-0ecbcdc50907aba74 [api][P] com-scalyr"
] }
2023/02/18 19:39:30 { "name":"wei", "host":"dev", "level":"INFO", "code":"200", "url":"/serviceping", "msg":[
"i-0ecbcdc50907aba74 [search][P] com-scalyr"
] }
...
Please refer to the "Multi-Line Messages" section of the parser documentation for the available syntax to group lines based on your use case. Since my logs always start with a timestamp, I decided to use the timestamp as the message separator and used the haltBefore
statement to stop appending additional lines to the message.
{
patterns:{ tsPattern: "\\d{4}\/\\d{2}\/\\d{2} [\\d:]+"},
lineGroupers: [
{
start: "\\d{4}\/\\d{2}\/\\d{2} [\\d:]+",
haltBefore: "\\d{4}\/\\d{2}\/\\d{2} [\\d:]+"
}
],
formats: [
{ format: "$timestamp=tsPattern$ $msg$"}
]
}
The lineGroupers
statement appears to be working as expected since the entire JSON object is now assigned to the msg
field.
2023/02/18 19:39:29 { "name":"joel", "host":"prod", "level":"INFO", "code":"200", "url":"/testdemo", "msg":[
"i-0ecbcdc50907aba74 [api][P] com-scalyr"
] }
message: 2023/02/18 19:39:29 { "name":"joel", "host":"prod", "level":"INFO", "code":"200", "url":"/testdemo", "msg":[
"i-0ecbcdc50907aba74 [api][P] com-scalyr"
] }
msg: { "name":"joel", "host":"prod", "level":"INFO", "code":"200", "url":"/testdemo", "msg":[
"i-0ecbcdc50907aba74 [api][P] com-scalyr"
] }
timestamp: 2023/02/18 19:39:29 (parsed as: Sat Feb 18, 2023 7:39:29 PM GMT, i.e. x minutes ago)
3. Use predefined patterns to extract values from dynamic fields
Video Explanation: https://youtu.be/bplhUtiaso8?t=240
Dynamic fields are populated by a set of consistently formatted values. This could be a JSON object, URI, or key-value pair -- DataSet supports a number of well known data formats. Attributes within a dynamic field are extracted by their name and value. These fields are extracted as they occur by the DataSet parser. Consequently, you no longer need to worry about configuring format
statements to extract specific values from where they occur in a log line -- provided that the log formatting is consistent and correct.
For example, we've implemented the $=json{parse=json}$
statement to extract values from the JSON object.
{
patterns:{ tsPattern: "\\d{4}\/\\d{2}\/\\d{2} [\\d:]+"},
lineGroupers: [
{
start: "\\d{4}\/\\d{2}\/\\d{2} [\\d:]+",
haltBefore: "\\d{4}\/\\d{2}\/\\d{2} [\\d:]+"
}
],
formats: [
{
format: "$timestamp=tsPattern$ $=json{parse=json}$"
}
]
}
Apply the parser to the sample log lines to confirm that the JSON fields are correctly parsed.
2023/02/18 19:39:31 { "name":"mark", "level":"DEBUG", "msg":"end of test: Reply sent to demo4" }
level: DEBUG
message: 2023/02/18 19:39:31 { "name":"mark", "level":"DEBUG", "msg":"end of test: Reply sent to demo4" }
msg: end of test: Reply sent to demo4
name: mark
timestamp: 2023/02/18 19:39:31 (parsed as: Sat Feb 18, 2023 7:39:31 PM GMT, i.e. x minutes ago
Comments
0 comments
Please sign in to leave a comment.