Skip to content

Dissect Pattern Matching

Punch provides the dissect operator. Do not hesitate referring to :

Rationale

This chapter contains extracts of the elastic introductory blog

Most users that want to extract structured fields from an unstructured event data use the popular Grok filter. This filter uses regular expressions under the hood and is versatile to many forms of text. Over time, users tend to use Grok for all of their structured extraction needs. This, in many cases, is analogous to using a sledgehammer to crack a nut. Regular expressions are sledgehammers. They are extremely powerful, but they can be unforgiving -- a poorly built regex can slow down the Logstash pipeline when run on certain lines of text. The pipeline is designed to recover, but the overall performance becomes unpredictable leading to unhappy users.

This problem was also identified early on the punchplatform. The syslogheader operator was designed to handle this kind of performance penalty. The dissect operator is the better, more generic and powerful solution.

Is Dissect a replacement for Grok?

Yes and no. In some cases yes, but remember that Dissect is not as flexible as Grok. In cases where Dissect cannot entirely solve the problem, it can be used first to extract the more predictable fields and then the remaining irregular text is passed along to the Grok filter. As the reduced Grok patterns are more focused and are being applied to smaller texts, Grok's performance should be less variable.

Is Dissect a replacement for CSV?

Yes if performance is paramount. However, you may find the config syntax of the CSV filter more intuitive. In the near future, we will leverage some of the techniques from Dissect in the CSV filter to put its performance on a par with Dissect.

Example

The dissect operates like a powerful split (or csv) operator. Say you receive this (header only) log into [logs][log]

66.249.66.207 - - [14/Dec/2016:19:31:37 -0500]

If you write :

dissect("%{clientip} %{ident} %{agent} [%{timestamp} %{+timestamp}]").on([logs][log]).into([logs][dissect])

You end up with :

{ "logs" :
    {
        "log" : "66.249.66.207 - - [14/Dec/2016:19:31:37 -0500]",
        "dissect" : {
            "clientip" : "66.249.66.207",
            "ident" : "-",
            "agent" : "-",
            "timestamp" : "14/Dec/2016:19:31:37 -0500"
        }
    }
}

Notice how you control the placement of your match using the [into] directive.

Dissect Basics

The dissect operator relies on a pattern (just like grok). Here is an example pattern:

"<%{priority}>%{syslog_timestamp} %{+syslog_timestamp} %{+syslog_timestamp} %{logsource} %{rest}"

And here is how you use it in a punchlet :

dissect("<%{priority}>%{syslog_timestamp} %{+syslog_timestamp} %{+syslog_timestamp} %{logsource} %{rest}")
  .on([logs][log])
  .into([logs][dissect]);

The dissect operator returns true if a match was achieved, false otherwise. You can thus write:

if (!dissect(...).on([logs][log]).into([logs][dissect])) {
  // error handling
}

OR

boolean success = dissect(...).on([logs][log]).into([logs][dissect]);
if (!success) {
    // error handling
}

Dissect Fields

Normal field type

This type simply adds the matched value to the event using the specified text as the name of the field, e.g. %{priority}.

Skip field type

The pattern format requires you to specify each delimiter by putting them between } and %{ meaning there will be times when you need to specify fields but you don\'t need the values in the event - it should be ignored or skipped. To do this you can have an empty field (e.g. %{}) or a named skipped field which is prefixed with a question mark (e.g. %{?priority}). You might choose to name your skip fields to make the final dissect pattern more readable.

Append field type

For the same reason - needing to specify all delimiters, you might need to rebuild a split value. Taking the syslog date as an example, if the source text looks like this <1>Oct 26 20:21:22 alpha1 rest... - you will need to specify fields for the month, day and time.

Maybe you want fields these as separate in your events but more likely you will want to rebuild these into one value. This is what an Append field does. For example, %{syslog_timestamp} %{+syslog_timestamp} %{+syslog_timestamp} will firstly have the month value in the syslog_timestamp field but then the next +syslog_timestamp adds the day but with its prior delimiter in between - in this case a space, resulting in . This is repeated for last +syslog_timestamp so the final value for the field syslog_timestamp will be \"Oct 26 20:21:22\".

Sometimes, you will want to rebuild the value in a different order - this is supported with an ordinal suffix , e.g. %{+syslog_timestamp/1}. There is more info and examples on this in the docs.

Indirect field type

Use the value of one field as a key of another. This type is arguably the least useful type and the syntax looks strange because it leverages Skip fields. For example, this punchlet

{
    [logs][log] = "foo bar";
    dissect("%{?a} %{&a}").on([logs][log]).into([dissect]);
}

will create a variable with the key foo containing the value bar. This is one way to dynamically name some of you variables.

ie:

{
  "dissect": {
    "foo": "bar"
  },
  "logs": {
    "log": "foo bar"
  }
}

Tips and tricks

My last match contains unwanted values

Let's take this log as an example

[logs][log] = "#fun foo bar boo!";

If you try to only get with this pattern dissect("#fun %{a} %{b} boo!"), it wont work because b will equals bar boo!.

Never forget that you are working with delimiters. No %{ symbol after bar means no delimiter so everything will be took as part of the match. A quick way to solve this issue is to add a final skip field: dissect("#fun %{a} %{b} boo!%{}")