Applying built-in parser formats – DataSet Customer Portal

Introduction

DataSet parsers include a number of convenient predefined formats for structured data. These versatile dynamic routines minimize the need for user input when working with key-value pairs.

Configuration

Parsing JSON Data

Parse Entire Lines of JSON

Setting up a dynamic parser that extracts key-value pairs from JSON data is as simple as one format statement:

{
  formats: [
    {
      format: "${parse=json}$"
    }
  ]
}

This applies to log lines that are entirely comprised of JSON.

Parse JSON Instances Within Log Event

However, in some cases, log events may consist of different formats. For example,

{"blah":"whee"} test1 test2 test3 {"foo":"bar","baz":{"argh":"ubuntu","computer":"blue"}} test4 test5

The original parser can be modified to ensure that the snippets of JSON within the log event are still processed:

{
  formats: [
    {
      format: ".*$=json{parse=json}$.*",
      repeat: true
    }
  ]
}

Where

The additional wildcards (.*) specify that the statement can occur anywhere within the line, and
The $=json notation means "if the format is [JSON]", while the repeat applies the format statement to the entire log event.

Results

{"blah": "whee"} test1 test2 test3 {"foo":"bar","baz":{"argh":"ubuntu","computer":"blue"}} test4 test5
bazArgh: ubuntu
bazComputer: blue
blah: whee
foo: bar
message: {"blah": "whee"} test1 test2 test3 {"foo": "bar", "baz": { "argh": "ubuntu", "computer": "blue" }} test4 test5

Extract the Positional Attributes

Since fields beginning with "test" are not key-value pairs, they can be extracted directly from the log event by their positions. For example,

{
  patterns: {
    testPattern: "test[0-9]+"
  },
  formats: [
    {
      format: "$=json{parse=json}$ $value1=testPattern$ $value2=testPattern$ $value3=testPattern$ $=json{parse=json}$ $value4=testPattern$ $value5=testPattern$",
    }
  ]
}

We've made a number of changes to the original:

Added a pattern (testPattern), which defines the contents of each value
The repeat statement has been replaced by two JSON parse statements. Since we're also extracting positional attributes, some manual definition is required.
We extract the value of each test field by their position within the log event. The pattern we defined (testPattern) helps ensure that the value we expect is extracted.

Result

{"blah": "whee"} test1 test2 test3 {"foo":"bar","baz":{"argh":"ubuntu","computer":"blue"}} test4 test5
bazArgh: ubuntu
bazComputer: blue
blah: whee
foo: bar
message: {"blah": "whee"} test1 test2 test3 {"foo":"bar","baz":{"argh":"ubuntu","computer":"blue"}} test4 test5
value1: test1
value2: test2
value3: test3
value4: test4
value5: test5

Nested Fields

Nested fields are flattened by combining the parent parameter ("baz", in this case) with its child parameters in camelCase format.

Accommodating Formatting Quirks and Irregularities

Built in parser formats are well suited for extracting attributes from data that follows an established standard (like JSON or Ruby hashes). However, unlike user-defined key-value pair parsers, they can't be customized to accommodate quirks in log data that doesn't adhere to the standard.

Example

A customer has logs that contain Python dict data. A number of fields didn't adhere to the expected Python dict format, so although we considered using scrubbing rules or redaction rules (within the Scalyr Agent) to restructure the log events before ingestion. However, the better alternative was to sanitize the log data before sending it to DataSet. Although custom key-value pair formatting could be implemented at the parser level, there wasn't a good way to duplicate the functionality offered by the built-in format.

Blacklist / Whitelist

Built-in formats can also use the attrBlacklist or attrWhitelist functions to simplify the process of including or excluding particular keys. Since both statements accept regular expressions, they offer a lot of flexibility when working with arbitrary key names, however, the general approximation of the key name still needs to be known in advance. More on these parameters can be found here