DataSet Regex – DataSet Customer Portal

Introduction

DataSet uses regular expressions to match and extract patterns from your log data. Regular expressions which are compatible with the traditional Java and Python libraries are supported, however, there are some unsupported operations (more details below). For more information, please review the following tips / best practices.

Tips

Conventions

DataSet uses java.util.regex as its regex library
The Scalyr Agent uses the 're' Python library
DataSet Regex is case insensitive so [A-Z] = [a-z]
Group naming is not supported
In the Scalyr Agent redaction rules, \\1 and \\2 is the group back reference syntax. More information can be found here: https://app.scalyr.com/help/scalyr-agent#redaction
In parsers, $1 and $2 is the group back reference syntax. The $0 group is also supported by rewrite rules

Testing Parsers

Test parsers with free text here.
- This page allows you to paste in log data and confirm that your parser works as expected.
- Video Explanation: https://youtu.be/uNOiu8CVnJU?t=72
Test parsers using your live data. Although this page looks similar to the "free text" parser above, it allows you to debug the workings of your format statements. Access this editor by:
- Clicking the User Menu -> "Parsers" and choosing the "Edit" or "Create" buttons, or
- Select a log line in the "Search" page and clicking "Inspect Fields" then "Edit Parser"
- Video Explanation: https://youtu.be/uNOiu8CVnJU?t=161

Lookaheads / Lookbehinds / Lookarounds

We restrict certain functions. See list below:
- We completely restrict positive and negative look aheads and look behinds due to some associated performance issues.
- Video Explanation: https://www.youtube.com/watch?v=trDh8omSlvA&feature=youtu.be&t=169

Searches

When performing searches, $"regex" searches the message field. For example, $"tomcat" would match all log lines with "tomcat" in the message field. For those who are not yet familiar with it, the message field contains the original log line in its entirety
This is shorthand for $message matches "regex"
Double escaping is required everywhere except the $"regex" syntax in the search
In the shorthand format, the regex does not need to be escaped:
$"\d+\.\d+\.\d+\.\d+"
In the full syntax, the regex needs to be double escaped:
$message matches "\\d+\\.\\d+\\.\\d+\\.\\d+"

Characters

Character	Legend	Example	Sample Match
\d	one digit from 0 to 9	log_\\d\\d	log_25
\w	"word character": ASCII letter, digit or underscore	\\w-\\w\\w\\w	A-b_1
\s	"whitespace character": space, tab, newline, carriage return, vertical tab	a\\sb\\sc	a bc
\D	One character that is not a digit as defined by \\d	\\D\\D\\D	ABC
\W	One character that is not a word character as defined by \\w	\\W\\W\\W\\W\\W	*-+=)
\S	One character that is not a whitespace character as defined by your engine's \s	\\S\\S\\S\\S	Yoyo
\u\X	Match specific or ranges of unicode characters. See chart	.*[\u00A8]	‰pò†…¨2020-08-28T20:28:34.343-0500

Quantifiers

Quantifier	Legend	Example	Sample Match
+	One or more	Version \\w-\\w+	Version A-b1_1
{3}	Exactly three times	\\D{3}	ABC
{2,4}	Two to four times	\\d{2,4}	156
{3,}	Three or more times	\\w{3,}	regex_tutorial
*	Zero or more times	ABC*	AAACC
?	Once or none	plurals?	plural

More Characters

Character	Legend	Example	Sample Match
.	Any character except line break	a.c	abc
.	Any character except line break	.*	whatever, man.
\.	A period (special character: needs to be escaped by a \)	a\.c	a.c
\	Escapes a special character	\\.\\*\\+\\? \\$\\^\/\\\	.*+? $^/\
\	Escapes a special character	\\[\\{\$\$\\}\\]	[{()}]

Logic

Logic	Legend	Example	Sample Match
\|	Alternation / OR operand	22\|33	33
( … )	Capturing group	A(nt\|pple)	Apple (captures "pple")
agent: \1 parser: $1	Contents of Group 1	parser - r(\\w)g$1x agent - r(\\w)g\\1x	regex
agent: \2 parser: $2	Contents of Group 2	parser - r(\\w)g$1x2 agent - r(\\w)g\\1x2	regex2

More White-Space

Character	Legend	Example	Sample Match
\n	New line	stack trace\ntrace	stack trace

More Quantifiers

Quantifier	Legend	Example	Sample Match
+	The + (one or more) is "greedy"	\d+	12345
?	Makes quantifiers "lazy"	\d+?	1 in 12345
*	The * (zero or more) is "greedy"	A*	AAA
?	Makes quantifiers "lazy"	A*?	empty in AAA
{2,4}	Two to four times, "greedy"	\w{2,4}	abcd
?	Makes quantifiers "lazy"	\w{2,4}?	ab in abcd

Character Classes

Character	Legend	Example	Sample Match
[ … ]	One of the characters in the brackets	[AEIOU]	One uppercase vowel
[ … ]	One of the characters in the brackets	T[ao]p	Tap or Top
-	Range indicator	[a-z]	One lowercase letter
[x-y]	One of the characters in the range from x to y	[A-Z]+	GREAT
[ … ]	One of the characters in the brackets	[AB1-5w-z]	One of either: A,B,1,2,3,4,5,w,x,y,z
[x-y]	One of the characters in the range from x to y	[ -~]+	Characters in the printable section of the ASCII table.
[^x]	One character that is not x	[^a-z]{3}	A1!
[^x-y]	One of the characters not in the range from x to y	[^ -~]+	Characters that are not in the printable section of the ASCII table.
[\d\D]	One character that is a digit or a non-digit	[\\d\\D]+	Any characters, including new lines, which the regular dot doesn't match

Anchors and Boundaries

Anchor	Legend	Example	Sample Match
^ or "regex" in parser format	Start of string or start of line depending on multiline mode. (But when [^inside brackets], it means "not")	^start.end$ or "start.the end"	abc (line start)
$ or "regex" in a parser format	End of string or end of line depending on multiline mode. Many engine-dependent subtleties.	.? the end$ OR ".the end"	this is the end
\b	position where one side only is an ASCII letter, digit or underscore	Bob.*\bcat\b	Bob ate the cat