Introduction
DataSet parsers include a number of convenient predefined formats for structured data. These versatile dynamic routines minimize the need for user input when working with key-value pairs.
Configuration
Parsing JSON Data
Parse Entire Lines of JSON
Setting up a dynamic parser that extracts key-value pairs from JSON data is as simple as one format
statement:
{
formats: [
{
format: "${parse=json}$"
}
]
}
This applies to log lines that are entirely comprised of JSON.
Parse JSON Instances Within Log Event
However, in some cases, log events may consist of different formats. For example,
{"blah":"whee"} test1 test2 test3 {"foo":"bar","baz":{"argh":"ubuntu","computer":"blue"}} test4 test5
The original parser can be modified to ensure that the snippets of JSON within the log event are still processed:
{
formats: [
{
format: ".*$=json{parse=json}$.*",
repeat: true
}
]
}
Where
- The additional wildcards (
.*
) specify that the statement can occur anywhere within the line, and - The
$=json
notation means "if the format is [JSON]", while therepeat
applies theformat
statement to the entire log event.
Results
{"blah": "whee"} test1 test2 test3 {"foo":"bar","baz":{"argh":"ubuntu","computer":"blue"}} test4 test5
bazArgh: ubuntu
bazComputer: blue
blah: whee
foo: bar
message: {"blah": "whee"} test1 test2 test3 {"foo": "bar", "baz": { "argh": "ubuntu", "computer": "blue" }} test4 test5
Extract the Positional Attributes
Since fields beginning with "test" are not key-value pairs, they can be extracted directly from the log event by their positions. For example,
{
patterns: {
testPattern: "test[0-9]+"
},
formats: [
{
format: "$=json{parse=json}$ $value1=testPattern$ $value2=testPattern$ $value3=testPattern$ $=json{parse=json}$ $value4=testPattern$ $value5=testPattern$",
}
]
}
We've made a number of changes to the original:
- Added a pattern (
testPattern
), which defines the contents of each value - The
repeat
statement has been replaced by two JSONparse
statements. Since we're also extracting positional attributes, some manual definition is required. - We extract the value of each
test
field by their position within the log event. The pattern we defined (testPattern
) helps ensure that the value we expect is extracted.
Result
{"blah": "whee"} test1 test2 test3 {"foo":"bar","baz":{"argh":"ubuntu","computer":"blue"}} test4 test5
bazArgh: ubuntu
bazComputer: blue
blah: whee
foo: bar
message: {"blah": "whee"} test1 test2 test3 {"foo":"bar","baz":{"argh":"ubuntu","computer":"blue"}} test4 test5
value1: test1
value2: test2
value3: test3
value4: test4
value5: test5
Nested Fields
Nested fields are flattened by combining the parent parameter ("baz", in this case) with its child parameters in camelCase format.
Accommodating Formatting Quirks and Irregularities
Built in parser formats are well suited for extracting attributes from data that follows an established standard (like JSON or Ruby hashes). However, unlike user-defined key-value pair parsers, they can't be customized to accommodate quirks in log data that doesn't adhere to the standard.
Example
A customer has logs that contain Python dict data. A number of fields didn't adhere to the expected Python dict format, so although we considered using scrubbing rules or redaction rules (within the Scalyr Agent) to restructure the log events before ingestion. However, the better alternative was to sanitize the log data before sending it to DataSet. Although custom key-value pair formatting could be implemented at the parser level, there wasn't a good way to duplicate the functionality offered by the built-in format.
Blacklist / Whitelist
Built-in formats can also use the attrBlacklist
or attrWhitelist
functions to simplify the process of including or excluding particular keys. Since both statements accept regular expressions, they offer a lot of flexibility when working with arbitrary key names, however, the general approximation of the key name still needs to be known in advance. More on these parameters can be found here
Comments
0 comments
Please sign in to leave a comment.