Grok Pattern Matching¶
The punch language provides the grok
operator. It is extremely useful.
It is well documented in our documentation and in many excellent
internet resources. Do not hesitate referring to :
- Online Grok Debugger: a online Grok debugger
$PUNCHPLAFORM _CONF _DIR/resources/punch/pattern
: the standard patterns of your platform- Grok introduction : an excellent introductory guide
Warning
The Grok operator is regex based. It suffers from performance issue in particular with match failures. Checkout out the Dissect operator first.
Grok purpose¶
The Grok operator lets you parse arbitrary text and structure it. In short, it is a regular expression engine where you can reuse your patterns by storing it into variables.
PunchPlatform is shipped with about 400 patterns by default. It is very simple to add your own. Here are a few examples.
# Basic Identifiers
USERNAME [a-zA-Z0-9._-]+
USER %{USERNAME}
INT (?:[+-]?(?:[0-9]+))
BASE10NUM (?<![0-9.+-])(?>[+-]?(?:(?:[0-9]+(?: .[0-9]+)?)|(?: .[0-9]+)))
NUMBER (?:%{BASE10NUM})
BASE16NUM (?<![0-9A-Fa-f])(?:[+-]?(?:0x)?(?:[0-9A-Fa-f]+))
BASE16FLOAT b(?<![0-9A-Fa-f.])(?:[+-]?(?:0x)?(?:(?:[0-9A-Fa-f]+(?: .[0-9A-Fa-f]*)?)|(?: .[0-9A-Fa-f]+))) b
As you can guess these pattern will make your life much easier to keep control of your regex usage.
Take a look at a file in
$PUNCHPLAFORM_CONF_DIR/resources/punch/pattern
to see a Grok file.
It is easy to understand.
Grok Basics¶
The syntax for a Grok pattern is %{SYNTAX:SEMANTIC}
If you have a chunk of log like this:
Apache2: 55.3.244.1 GET /index.html 15824 0.043
- The SYNTAX is the name of the pattern that will match your text.
For example, "15824" will be matched by the
NUMBER
pattern and will be matched by theIP
pattern. Syntax are findable in your pattern directory. - The SEMANTIC is the Tuple you give to the piece of text being matched.
For the above example, assuming you have a Tuple tmp, your Grok filter would look something like this:
Apache2: %{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}
To use this pattern in a punchlet, here is what you write:
// @test(encoding=json) {"logs":{"log": "55.3.244.1 GET /index.html 15824 0.043"}}
{
Tuple tmp;
if (grok("Apache: %{IP:tmp:[client]} %{WORD:tmp:[method]} %{URIPATHPARAM:tmp:[request]} %{NUMBER:tmp:[bytes]} %{NUMBER:tmp:[duration]}")
.on([logs][log]) {
// It worked !
print(tmp);
} else {
raise("not the log I expected");
}
}
After the Grok operator, the tmp
tuple (as printed out in the code
snippet just illustrated) is:
{
"duration": "0.043",
"request": "/index.html",
"method": "GET",
"bytes": "15824",
"client": "55.3.244.1",
}
Grok and Tuples¶
When you use a Grok pattern you may want to finely control where you send the resulting matches : in the root Tuple, in a local Tuple, at top level or in inner fields etc..
As you can break down into patterns in Grok, you can structure your matches by leveraging subtuples.
Consider the standard SYSLOGBASE
Grok pattern. If you
write:
// @test mailserver14 postfix/cleanup 21403
{
grok("%{SYSLOGBASE:syslogbase}").on(root);
print(root);
}
Here is what you get:
{
"syslogbase" {
"logsource" : "mailserver14"
"program" : "postfix/cleanup"
"pid" : "21403"
}
}
Example :
Parsing IP :
Considering your logs contain IP and you don't need to differentiate IPV4 and IPV6.
grok("^%{IP}$").on(tmp:[client])
How to deal with IPV4 and IPV6 identifiers ?
IP identifier includes IPV4 and IPV6.
In our example, we only want IPV4 to be added in document:[init][host][ip]
, if it's something else it will be added in document:[init][host][name]
.
{
if (grok("^%{IPV4}$").on(tmp:[client])) {
document:[init][host][ip] = tmp:[client];
}
else {
document:[init][host][name] = tmp:[client];
}
}
- %{PATTERN:field} : the result of the match goes to
root:[field]
, - %{PATTERN:root: [field ]} : identical result.
- %{PATTERN:tmp: [field ]} : the result of the match goes to a
local Tuple
tmp:[field]
, - %{PATTERN:tmp:/} : the result of the match goes to a local Tuple
tmp
. - %{PATTERN:root:/} : the result of the match completely
overwrites the
root
Tuple from the top.
Inner Destination¶
The punchplatform supports defining inner destination inside a Grok pattern. Here is an example. Consider the following pattern:
HTTPD20_ERRORLOG [%{HTTPDERROR_DATE:timestamp} ] [%{LOGLEVEL:loglevel} ] (?: [client %{IPORHOST:clientip} ] ){0,1}%{GREEDYDATA:message}
Focus on the clientip
part, i.e :
%{IPORHOST:clientip}
Say that you would prefer that [clientip] part be sent
directly into a [client][ip]
destination (rather than clientip
). You
can do that by simply define your pattern as follows:
%{IPORHOST:client:ip}
With that additional :
you will get
{
...
"client" {
"ip" : "mailserver14"
}
}
Oniguruma Format¶
The Oniguruma is supported as well. In some case it is interesting to avoid defining Grok pattern for simple matching use case. It is easier to understand with an example:
grok("(?<[first]>hello) (?<[second]>world)").on("hello world");
Generates:
{
"first": "hello",
"second": "world"
}
Here the idea is to use the (?<target>
and
)
tags to specify what you want to match. You can mix in
a same Grok pattern both regular Grok and oniguruma patterns.