PowerQueries - Best Practices – DataSet Customer Portal

Introduction

Due to the open-ended nature of PowerQueries, there are many ways to accomplish a single task. I received some helpful input from our Engineering team regarding the best practices for creating PowerQueries and wanted to take this opportunity to provide them to our customers.

At their core, PowerQueries are best suited for spot analysis. Since results are limited to 100k rows or less, they weren't intended to replace data warehouse operations. However, PowerQueries make it simple to analyze attribute (key-value pairs) data that originated within dissimilar log formats.

I will be adding more to this section over time, so please feel free to check back!

Please see this page for the most recent updates to DataSet PowerQueries

Best Practices

Use parser to extract attributes

Parsers were designed to extract values from log data with regular expressions. The parser is manually assigned by the DataSet upload method which handles your logs (normally the Agent or HTTP API). After log data is ingested, the assigned parser is applied to each event. Log fields which match the customer-defined regular expressions are stored as attributes (key-value pairs). The resultant attributes are optimized for speed and can be used in PowerQueries, searches, graphs, alerts, and dashboards.

The `parse` function

PowerQueries are capable of extracting attributes on the fly with the parse function. This is particularly useful if you're trying to parse data in an attribute that did not originate from the message field. However, please note that:

Attribute(s) that were extracted by the parse function are only usable within the context of the PowerQuery where parse was defined
The parse function incurs some overhead, which may result in slower PowerQuery execution
The parse function does not support automatic key-value pair extraction (ex. the ${parse=<type>}$ directive, where <type> is JSON, pythonDict, rubyHash, etc.)

Therefore, in most use cases, we recommend using a parser

Recommendations

Since PowerQueries are relatively similar to SQL, I tend to follow SQL best practices when implementing them, such as:

Minimize aggregation and avoid using passthroughs. For example, if you're doing a group at an early stage of the query and are carrying a column through multiple group statements, consider calculating this column by aggregation at a later stage (of course, this depends on the data structure and what you're trying to do).
Avoid sorting until the final level of a query — sorting before then has no benefit and typically isn't needed unless an aggregate function like first / last / newest / oldest requires it.
Use join and union statements sparingly
Assign shorter / more relevant column names as soon as possible (in SQL this is normally done in the SELECT statement, for PowerQueries use the columns function), as this reduces the hassle of grouping columns with special characters in their name
Only extract and process columns that are needed for the final solution

Example Query

This article will be referencing the following simple query that I wrote:

(($euidalert == *)) && $logfile == "/var/log/argh.log" |
filter ($status==500) | 
group c=count() by euidAlert=lower($euidalert) |
filter ($total >= 2) | 
sort -c

Query Structure

The $ Prefix

The $ prefix (sigil) does not need to be used and can be omitted from PowerQueries

Implied AND (&&)

The double ampersand (&&) is implied and not necessary in the first line

Line 1

The first line of a PowerQuery is a standard DataSet search query — whenever possible, include as much detail here to improve relevance. Combining this criteria into the first line improves the performance of the overall query.

Additionally,

filter statements are typically used after aggregation functions are applied to log events. Minimize their use when possible.
In the context of a DB operation, the DataSet query establishes the temporary table that further operations will be applied to.

For example, lines 1-2 of the original query can be simplified:

(($euidalert == *)) && $logfile == "/var/log/argh.log" |
filter ($status==500) |
...

// Becomes

euidalert=* logfile == "/var/log/argh.log" status=500
...

Pipe Characters (|)

We recommend prefixing each line with the pipe, as this makes the PowerQuery more readable. It's also easier to distinguish between new lines and continued lines when this is followed

The Final Query

euidalert=* logfile == "/var/log/argh.log" status=500
| group c=count() by euidAlert=lower(euidalert)
| filter total >= 2
| sort -c

By following these steps, our original query was reduced from 160 to 131 characters (nearly 20%) while becoming significantly more legible.

Conclusion

As you can see from the final example, our PowerQuery retained its original functionality with fewer characters and improved legibility. Creating a query with less characters is particularly advantageous when the query is complex and the DataSet API is used to issue it.