Grok Pattern Matching¶

The punch language provides the grok operator. It is extremely useful. It is well documented in our documentation and in many excellent internet resources. Do not hesitate referring to :

Online Grok Debugger: a online Grok debugger
$PUNCHPLAFORM _CONF _DIR/resources/punch/pattern : the standard patterns of your platform
Grok introduction : an excellent introductory guide

Warning

The Grok operator is regex based. It suffers from performance issue in particular with match failures. Checkout out the Dissect operator first.

Grok purpose¶

The Grok operator lets you parse arbitrary text and structure it. In short, it is a regular expression engine where you can reuse your patterns by storing it into variables.

PunchPlatform is shipped with about 400 patterns by default. It is very simple to add your own. Here are a few examples.

# Basic Identifiers
USERNAME [a-zA-Z0-9._-]+
USER %{USERNAME}
INT (?:[+-]?(?:[0-9]+))
BASE10NUM (?<![0-9.+-])(?>[+-]?(?:(?:[0-9]+(?: .[0-9]+)?)|(?: .[0-9]+)))
NUMBER (?:%{BASE10NUM})
BASE16NUM (?<![0-9A-Fa-f])(?:[+-]?(?:0x)?(?:[0-9A-Fa-f]+))
BASE16FLOAT  b(?<![0-9A-Fa-f.])(?:[+-]?(?:0x)?(?:(?:[0-9A-Fa-f]+(?: .[0-9A-Fa-f]*)?)|(?: .[0-9A-Fa-f]+))) b

As you can guess these pattern will make your life much easier to keep control of your regex usage.

Take a look at a file in $PUNCHPLAFORM_CONF_DIR/resources/punch/pattern to see a Grok file. It is easy to understand.

Grok Basics¶

The syntax for a Grok pattern is %{SYNTAX:SEMANTIC}

If you have a chunk of log like this:

Apache2: 55.3.244.1 GET /index.html 15824 0.043

The SYNTAX is the name of the pattern that will match your text. For example, "15824" will be matched by the NUMBER pattern and will be matched by the IP pattern. Syntax are findable in your pattern directory.
The SEMANTIC is the Tuple you give to the piece of text being matched.

For the above example, assuming you have a Tuple tmp, your Grok filter would look something like this:

Apache2: %{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}

To use this pattern in a punchlet, here is what you write:

// @test(encoding=json) {"logs":{"log": "55.3.244.1 GET /index.html 15824 0.043"}}
{
    Tuple tmp;
    if (grok("Apache: %{IP:tmp:[client]} %{WORD:tmp:[method]} %{URIPATHPARAM:tmp:[request]} %{NUMBER:tmp:[bytes]} %{NUMBER:tmp:[duration]}") 
       .on([logs][log]) {
       // It worked !
       print(tmp);
    } else {
     raise("not the log I expected");
    }
}

After the Grok operator, the tmp tuple (as printed out in the code snippet just illustrated) is:

{
  "duration": "0.043",
  "request": "/index.html",
  "method": "GET",
  "bytes": "15824",
  "client": "55.3.244.1",
}

Grok and Tuples¶

When you use a Grok pattern you may want to finely control where you send the resulting matches : in the root Tuple, in a local Tuple, at top level or in inner fields etc..

As you can break down into patterns in Grok, you can structure your matches by leveraging subtuples.

Consider the standard SYSLOGBASE Grok pattern. If you write:

// @test mailserver14 postfix/cleanup 21403
{
    grok("%{SYSLOGBASE:syslogbase}").on(root);
    print(root);
}

Here is what you get:

{ 
    "syslogbase" {
        "logsource" : "mailserver14"
        "program" : "postfix/cleanup"
        "pid" : "21403"
    }
}

Example :

Parsing IP :

Considering your logs contain IP and you don't need to differentiate IPV4 and IPV6.

grok("^%{IP}$").on(tmp:[client])

Considering your logs can contain IPV4 or IPV6 on the same field and you need to differentiate them.

How to deal with IPV4 and IPV6 identifiers ?

IP identifier includes IPV4 and IPV6.

In our example, we only want IPV4 to be added in document:[init][host][ip], if it's something else it will be added in document:[init][host][name].

{
 if (grok("^%{IPV4}$").on(tmp:[client])) {
        document:[init][host][ip] = tmp:[client];
      } 
 else {
        document:[init][host][name] = tmp:[client];
      }
}

Here are the various possible syntax to distribute your matches where you need:

%{PATTERN:field} : the result of the match goes to root:[field],
%{PATTERN:root: [field ]} : identical result.
%{PATTERN:tmp: [field ]} : the result of the match goes to a local Tuple tmp:[field],
%{PATTERN:tmp:/} : the result of the match goes to a local Tuple tmp.
%{PATTERN:root:/} : the result of the match completely overwrites the root Tuple from the top.

Inner Destination¶

The punchplatform supports defining inner destination inside a Grok pattern. Here is an example. Consider the following pattern:

HTTPD20_ERRORLOG  [%{HTTPDERROR_DATE:timestamp} ]  [%{LOGLEVEL:loglevel} ] (?: [client %{IPORHOST:clientip} ] ){0,1}%{GREEDYDATA:message}

Focus on the clientip part, i.e :

%{IPORHOST:clientip}

Say that you would prefer that [clientip] part be sent directly into a [client][ip] destination (rather than clientip). You can do that by simply define your pattern as follows:

%{IPORHOST:client:ip}

With that additional : you will get

{ 
  ...
  "client" {
    "ip" : "mailserver14"
  }
}

Oniguruma Format¶

The Oniguruma is supported as well. In some case it is interesting to avoid defining Grok pattern for simple matching use case. It is easier to understand with an example:

grok("(?<[first]>hello) (?<[second]>world)").on("hello world");

Generates:

{
  "first": "hello",
  "second": "world"
}

Here the idea is to use the (?<target> and ) tags to specify what you want to match. You can mix in a same Grok pattern both regular Grok and oniguruma patterns.