Dissect Pattern Matching¶
Punch provides the dissect
operator. Do not hesitate referring to :
- Logstash doc: the logstash dissect manual.
- Elastic blog: a nice introductory blog.
- Github: documentation from the original Dissect repository
Rationale¶
This chapter contains extracts of the elastic introductory blog
Most users that want to extract structured fields from an unstructured event data use the popular Grok filter. This filter uses regular expressions under the hood and is versatile to many forms of text. Over time, users tend to use Grok for all of their structured extraction needs. This, in many cases, is analogous to using a sledgehammer to crack a nut. Regular expressions are sledgehammers. They are extremely powerful, but they can be unforgiving -- a poorly built regex can slow down the Logstash pipeline when run on certain lines of text. The pipeline is designed to recover, but the overall performance becomes unpredictable leading to unhappy users.
This problem was also identified early on the punchplatform. The syslogheader operator was designed to handle this kind of performance penalty. The dissect operator is the better, more generic and powerful solution.
Is Dissect a replacement for Grok?¶
Yes and no. In some cases yes, but remember that Dissect is not as flexible as Grok. In cases where Dissect cannot entirely solve the problem, it can be used first to extract the more predictable fields and then the remaining irregular text is passed along to the Grok filter. As the reduced Grok patterns are more focused and are being applied to smaller texts, Grok's performance should be less variable.
Is Dissect a replacement for CSV?¶
Yes if performance is paramount. However, you may find the config syntax of the CSV filter more intuitive. In the near future, we will leverage some of the techniques from Dissect in the CSV filter to put its performance on a par with Dissect.
Example¶
The dissect operates like a powerful split (or csv) operator. Say you receive this (header only) log into [logs][log]
66.249.66.207 - - [14/Dec/2016:19:31:37 -0500]
If you write :
dissect("%{clientip} %{ident} %{agent} [%{timestamp} %{+timestamp}]").on([logs][log]).into([logs][dissect])
You end up with :
{ "logs" :
{
"log" : "66.249.66.207 - - [14/Dec/2016:19:31:37 -0500]",
"dissect" : {
"clientip" : "66.249.66.207",
"ident" : "-",
"agent" : "-",
"timestamp" : "14/Dec/2016:19:31:37 -0500"
}
}
}
Notice how you control the placement of your match using the [into] directive.
Dissect Basics¶
The dissect operator relies on a pattern (just like grok). Here is an example pattern:
"<%{priority}>%{syslog_timestamp} %{+syslog_timestamp} %{+syslog_timestamp} %{logsource} %{rest}"
And here is how you use it in a punchlet :
dissect("<%{priority}>%{syslog_timestamp} %{+syslog_timestamp} %{+syslog_timestamp} %{logsource} %{rest}")
.on([logs][log])
.into([logs][dissect]);
The dissect operator returns true if a match was achieved, false otherwise. You can thus write:
if (!dissect(...).on([logs][log]).into([logs][dissect])) {
// error handling
}
OR
boolean success = dissect(...).on([logs][log]).into([logs][dissect]);
if (!success) {
// error handling
}
Dissect Fields¶
Normal field type¶
This type simply adds the matched value to the event using the specified
text as the name of the field, e.g. %{priority}
.
Skip field type¶
The pattern format requires you to specify each delimiter by putting
them between }
and %{
meaning there will be times when you need to
specify fields but you don\'t need the values in the event - it should
be ignored or skipped. To do this you can have an empty field (e.g.
%{}
) or a named skipped field which is prefixed with a question mark
(e.g. %{?priority}
). You might choose to name your skip fields
to make the final dissect pattern more readable.
Append field type¶
For the same reason - needing to specify all delimiters, you might need
to rebuild a split value. Taking the syslog date as an example, if the
source text looks like this <1>Oct 26 20:21:22 alpha1 rest...
- you
will need to specify fields for the month, day and time.
Maybe you want fields these as separate in your events but more likely
you will want to rebuild these into one value. This is what an Append
field does. For example,
%{syslog_timestamp} %{+syslog_timestamp} %{+syslog_timestamp}
will
firstly have the month value in the syslog_timestamp field but
then the next +syslog_timestamp adds
the day
but with its prior delimiter in between - in this case a space,
resulting in . This is repeated for last +syslog_timestamp
so the final value for the field syslog_timestamp will be \"Oct 26
20:21:22\".
Sometimes, you will want to rebuild the value in a different order -
this is supported with an ordinal suffix , e.g.
%{+syslog_timestamp/1}
. There is more info and examples on this in the
docs.
Indirect field type¶
Use the value of one field as a key of another. This type is arguably the least useful type and the syntax looks strange because it leverages Skip fields. For example, this punchlet
{
[logs][log] = "foo bar";
dissect("%{?a} %{&a}").on([logs][log]).into([dissect]);
}
will create a variable with the key foo
containing the value bar
.
This is one way to dynamically name some of you variables.
ie:
{
"dissect": {
"foo": "bar"
},
"logs": {
"log": "foo bar"
}
}
Tips and tricks¶
My last match contains unwanted values¶
Let's take this log as an example
[logs][log] = "#fun foo bar boo!";
If you try to only get with this pattern
dissect("#fun %{a} %{b} boo!")
, it wont work because b
will equals
bar boo!
.
Never forget that you are working with delimiters. No %{
symbol
after bar
means no delimiter so everything will be took as part of the
match. A quick way to solve this issue is to add a final skip field:
dissect("#fun %{a} %{b} boo!%{}")