Introduction
Due to the open-ended nature of PowerQueries, there are many ways to accomplish a single task. I received some helpful input from our Engineering team regarding the best practices for creating PowerQueries and wanted to take this opportunity to provide them to our customers.
At their core, PowerQueries are best suited for spot analysis. Since results are limited to 100k rows or less, they weren't intended to replace data warehouse operations. However, PowerQueries make it simple to analyze attribute (key-value pairs) data that originated within dissimilar log formats.
I will be adding more to this section over time, so please feel free to check back!
Please see this page for the most recent updates to DataSet PowerQueries
Best Practices
Use parser to extract attributes
Parsers were designed to extract values from log data with regular expressions. The parser is manually assigned by the DataSet upload method which handles your logs (normally the Agent or HTTP API). After log data is ingested, the assigned parser is applied to each event. Log fields which match the customer-defined regular expressions are stored as attributes (key-value pairs). The resultant attributes are optimized for speed and can be used in PowerQueries, searches, graphs, alerts, and dashboards.
The parse
function
PowerQueries are capable of extracting attributes on the fly with the parse
function. This is particularly useful if you're trying to parse data in an attribute that did not originate from the message
field. However, please note that:
- Attribute(s) that were extracted by the
parse
function are only usable within the context of the PowerQuery whereparse
was defined - The
parse
function incurs some overhead, which may result in slower PowerQuery execution - The
parse
function does not support automatic key-value pair extraction (ex. the${parse=<type>}$
directive, where <type> is JSON, pythonDict, rubyHash, etc.)
Therefore, in most use cases, we recommend using a parser
Recommendations
Since PowerQueries are relatively similar to SQL, I tend to follow SQL best practices when implementing them, such as:
- Minimize aggregation and avoid using passthroughs. For example, if you're doing a
group
at an early stage of the query and are carrying a column through multiplegroup
statements, consider calculating this column by aggregation at a later stage (of course, this depends on the data structure and what you're trying to do). - Avoid sorting until the final level of a query — sorting before then has no benefit and typically isn't needed unless an aggregate function like
first
/last
/newest
/oldest
requires it. - Use
join
andunion
statements sparingly - Assign shorter / more relevant column names as soon as possible (in SQL this is normally done in the SELECT statement, for PowerQueries use the
columns
function), as this reduces the hassle of grouping columns with special characters in their name - Only extract and process columns that are needed for the final solution
Example Query
This article will be referencing the following simple query that I wrote:
(($euidalert == *)) && $logfile == "/var/log/argh.log" |
filter ($status==500) |
group c=count() by euidAlert=lower($euidalert) |
filter ($total >= 2) |
sort -c
Query Structure
The $ Prefix
The $ prefix (sigil) does not need to be used and can be omitted from PowerQueries
Implied AND (&&)
The double ampersand (&&) is implied and not necessary in the first line
Line 1
The first line of a PowerQuery is a standard DataSet search query — whenever possible, include as much detail here to improve relevance. Combining this criteria into the first line improves the performance of the overall query.
Additionally,
filter
statements are typically used after aggregation functions are applied to log events. Minimize their use when possible.- In the context of a DB operation, the DataSet query establishes the temporary table that further operations will be applied to.
For example, lines 1-2 of the original query can be simplified:
(($euidalert == *)) && $logfile == "/var/log/argh.log" |
filter ($status==500) |
...
// Becomes
euidalert=* logfile == "/var/log/argh.log" status=500
...
Pipe Characters (|)
We recommend prefixing each line with the pipe, as this makes the PowerQuery more readable. It's also easier to distinguish between new lines and continued lines when this is followed
The Final Query
euidalert=* logfile == "/var/log/argh.log" status=500
| group c=count() by euidAlert=lower(euidalert)
| filter total >= 2
| sort -c
By following these steps, our original query was reduced from 160 to 131 characters (nearly 20%) while becoming significantly more legible.
Conclusion
As you can see from the final example, our PowerQuery retained its original functionality with fewer characters and improved legibility. Creating a query with less characters is particularly advantageous when the query is complex and the DataSet API is used to issue it.
Comments
0 comments
Please sign in to leave a comment.