Introduction
DataSet uses regular expressions to match and extract patterns from your log data. Regular expressions which are compatible with the traditional Java and Python libraries are supported, however, there are some unsupported operations (more details below). For more information, please review the following tips / best practices.
Tips
Conventions
- DataSet uses java.util.regex as its regex library
- The Scalyr Agent uses the 're' Python library
- DataSet Regex is case insensitive so [A-Z] = [a-z]
- Group naming is not supported
- In the Scalyr Agent redaction rules,
\\1
and\\2
is the group back reference syntax. More information can be found here: https://app.scalyr.com/help/scalyr-agent#redaction - In parsers,
$1
and$2
is the group back reference syntax. The$0
group is also supported by rewrite rules
Testing Parsers
- Test parsers with free text here.
- This page allows you to paste in log data and confirm that your parser works as expected.
- Video Explanation: https://youtu.be/uNOiu8CVnJU?t=72
- Test parsers using your live data. Although this page looks similar to the "free text" parser above, it allows you to debug the workings of your format statements. Access this editor by:
- Clicking the User Menu -> "Parsers" and choosing the "Edit" or "Create" buttons, or
- Select a log line in the "Search" page and clicking "Inspect Fields" then "Edit Parser"
- Video Explanation: https://youtu.be/uNOiu8CVnJU?t=161
Lookaheads / Lookbehinds / Lookarounds
- We restrict certain functions. See list below:
- We completely restrict positive and negative look aheads and look behinds due to some associated performance issues.
- Video Explanation: https://www.youtube.com/watch?v=trDh8omSlvA&feature=youtu.be&t=169
Searches
- When performing searches,
$"regex"
searches the message field. For example,$"tomcat"
would match all log lines with "tomcat" in the message field. For those who are not yet familiar with it, the message field contains the original log line in its entirety - This is shorthand for
$message matches "regex"
- Double escaping is required everywhere except the $"regex" syntax in the search
- In the shorthand format, the regex does not need to be escaped:
$"\d+\.\d+\.\d+\.\d+"
- In the full syntax, the regex needs to be double escaped:
$message matches "\\d+\\.\\d+\\.\\d+\\.\\d+"
Characters
Character |
Legend |
Example |
Sample Match |
\d |
one digit from 0 to 9 |
log_\\d\\d |
log_25 |
\w |
"word character": ASCII letter, digit or underscore |
\\w-\\w\\w\\w |
A-b_1 |
\s |
"whitespace character": space, tab, newline, carriage return, vertical tab |
a\\sb\\sc |
a bc |
\D |
One character that is not a digit as defined by \\d |
\\D\\D\\D |
ABC |
\W |
One character that is not a word character as defined by \\w |
\\W\\W\\W\\W\\W |
*-+=) |
\S |
One character that is not a whitespace character as defined by your engine's \s |
\\S\\S\\S\\S |
Yoyo |
\u\X |
Match specific or ranges of unicode characters. See chart |
.*[\u00A8] |
‰pò†…¨2020-08-28T20:28:34.343-0500 |
Quantifiers
Quantifier |
Legend |
Example |
Sample Match |
+ |
One or more |
Version \\w-\\w+ |
Version A-b1_1 |
{3} |
Exactly three times |
\\D{3} |
ABC |
{2,4} |
Two to four times |
\\d{2,4} |
156 |
{3,} |
Three or more times |
\\w{3,} |
regex_tutorial |
* |
Zero or more times |
A*B*C* |
AAACC |
? |
Once or none |
plurals? |
plural |
More Characters
Character | Legend | Example | Sample Match |
. | Any character except line break | a.c | abc |
. | Any character except line break | .* | whatever, man. |
\. | A period (special character: needs to be escaped by a \) | a\.c | a.c |
\ | Escapes a special character | \\.\\*\\+\\? \\$\\^\/\\\ | .*+? $^/\ |
\ | Escapes a special character | \\[\\{\\(\\)\\}\\] | [{()}] |
Logic
Logic | Legend | Example | Sample Match |
| | Alternation / OR operand | 22|33 | 33 |
( … ) | Capturing group | A(nt|pple) |
Apple (captures "pple")
|
agent: \1 parser: $1 | Contents of Group 1 | parser - r(\\w)g$1x agent - r(\\w)g\\1x |
regex |
agent: \2 parser: $2 | Contents of Group 2 | parser - r(\\w)g$1x2 agent - r(\\w)g\\1x2 |
regex2 |
More White-Space
Character | Legend | Example | Sample Match |
\n | New line | stack trace\ntrace | stack trace |
More Quantifiers
Quantifier | Legend | Example | Sample Match |
+ | The + (one or more) is "greedy" | \d+ | 12345 |
? | Makes quantifiers "lazy" | \d+? | 1 in 12345 |
* | The * (zero or more) is "greedy" | A* | AAA |
? | Makes quantifiers "lazy" | A*? | empty in AAA |
{2,4} | Two to four times, "greedy" | \w{2,4} | abcd |
? | Makes quantifiers "lazy" | \w{2,4}? | ab in abcd |
Character Classes
Character | Legend | Example | Sample Match |
[ … ] | One of the characters in the brackets | [AEIOU] |
One uppercase vowel
|
[ … ] | One of the characters in the brackets | T[ao]p | Tap or Top |
- | Range indicator | [a-z] |
One lowercase letter
|
[x-y] | One of the characters in the range from x to y | [A-Z]+ | GREAT |
[ … ] | One of the characters in the brackets | [AB1-5w-z] |
One of either: A,B,1,2,3,4,5,w,x,y,z
|
[x-y] | One of the characters in the range from x to y | [ -~]+ |
Characters in the printable section of the ASCII table.
|
[^x] | One character that is not x | [^a-z]{3} | A1! |
[^x-y] | One of the characters not in the range from x to y | [^ -~]+ |
Characters that are not in the printable section of the ASCII table.
|
[\d\D] | One character that is a digit or a non-digit | [\\d\\D]+ |
Any characters, including new lines, which the regular dot doesn't match
|
Anchors and Boundaries
Anchor | Legend | Example | Sample Match |
^ or "regex" in parser format | Start of string or start of line depending on multiline mode. (But when [^inside brackets], it means "not") | ^start.*end$ or "start.*the end" | abc (line start) |
$ or "regex" in a parser format | End of string or end of line depending on multiline mode. Many engine-dependent subtleties. | .*? the end$ OR ".*the end" | this is the end |
\b | position where one side only is an ASCII letter, digit or underscore | Bob.*\bcat\b | Bob ate the cat |
Comments
0 comments
Please sign in to leave a comment.