Skip to content

Tutorial Write a Log Parser

This chapter requires you went through the punchlang chapter (at least a quick reading). We will see how the punchlang makes it easy to write a complete log parser.

Step 1 : Get a log

What you have to do first is to understand the log you deal with. Here is a log. (We took an Arkoon log as example).

Sep 21 21:24:03 fakehost Alerts: AKLOG - id=firewall time="2017-09-05 23:59:56" gmtime=1504648796 pri=2 fw=fakehost.com aktype=ALERT alert_type="Blocked by filter" user="bob" alert_level="Medium" alert_desc="UDP from 1.2.3.4:9987 to 1.2.3.5:33719 [default_rule]"

In there you have various parts. You (typically) have a header with some timestamp. It can be a standard syslog header, or something else. In our example it is a syslog header Sep 21 21:24:03 fakehost.

Next you have vendor specific information. Here we have a keyword, Alerts:, the Arkoon vendor tag AKLOG - and keys with values.

Step 2 : Think

With the Punch language you have several operators to extract parts of your log and store each the interesting matches in your JSON document. In a nutshell here is what you have:

  • kv : the key value operator to parse logs with [field1=value1 field2=value2 ... fieldN=valueN] (our case here),
  • csv : the CSV operator to parse logs with [value1;value2; ...;valueN],
  • date: to convert date from to arbitrary format,
  • syslogHeader : to extract syslog header from standard RFC,
  • grok : the Grok operator gives you a large amount of smart regexes to extract arbitrary pattern,
  • dissect : the dissect operator is a sort of super-splitter, more efficient than the grok operator.
  • many (Java) methods such as String startsWith, endWiths, matches etc ..

Have a look at the existing parsers to see how most usual cases have been dealt with. The grok operator is the solution. It works using predefined regular expressions (defined in .grok files), ready to use. It is very powerful, but you pay a performance penalty. Always check if you can get the job done with dissect, kv or csv first.

In our case, you have :

  • the syslog header Sep 21 21:24:03 fakehost;
  • a custom log header Alerts: AKLOG -
  • a KV.

Step 3 : Write the Punchlet

Naming Scheme

Punchlets are named using a well defined scheme. We suggest you stick to that scheme. The punchlet name refers to the role it plays.

Here we will create:

$PUNCHPLATFORM_CONF_DIR/resources/punch/mytenant/arkoon_ngfw/parsing.punch

Where :

  • [mytenant] is the tenant 's name,
  • [arkoon] is the producer 's name,
  • [ngfw] is the technology 's name,
  • [parsing] is the punchlet 's job.

This scheme is used for all configuration and resource files including grok files, injector files and so on. For example:

$PUNCHPLATFORM_CONF_DIR/resource/punch/patterns/mytenant_arkoon_ngfw.grok
$PUNCHPLATFORM_CONF_DIR/resource/injector/mytenant/arkoon_ngfw_injector.json

When used in a LMC context, your punchlet will receive the input logs under the stream . To you it simply consists in a Json document as illustrated next:

{
   "logs" : {
     "log" : "Sep 21 21:24:03 fakehost Alerts: AK... "
   }
} 

Remember a Json document is represented in Punch as a Tuple type. Your punchlet will thus access that log using the punch root:[logs][log] instruction.

Most often, a parsing punchlet is divided into three parts:

1) Input Check : checks that the input document (i.e. Tuple) is valid. 2) Syntax analysis : extracts and analyses important fields from your logs. 3) Field Binding : make sure all parts are stored under the fields ultimately expected by elasticsearch.

Input check

Let us bootstrap our punchlet by retrieving the received log.

// @test Sep 21 21:24:03 fakehost Alerts: AKLOG - id=firewall time="2017-09-05 23:59:56" gmtime=1504648796 pri=2 fw=fakehost.com aktype=ALERT alert_type="Blocked by filter" user="" alert_level="Medium" alert_desc="UDP from 1.2.3.4:9987 to 1.2.3.5:33719 [default_rule]" 
{
  [logs][log][message] = [logs][log]; // Saving the original message here
}

Hit Ctrl + B in Sublime Text to test this small stub.

Syntax analysis

Let us now cut little by little our log to extract interesting parts. We will use a mix of punch operators and plain java methods.

// @test Sep 21 21:24:03 fakehost Alerts: AKLOG - id=firewall time="2017-09-05 23:59:56" gmtime=1504648796 pri=2 fw=fakehost.com aktype=ALERT alert_type="Blocked by filter" user="bob" alert_level="Medium" alert_desc="UDP from 1.2.3.4:9987 to 1.2.3.5:33719 [default_rule]"
{
  [logs][log][message] = [logs][log];

  // We will use local variables. These are handy to store 
  // intermediary results. Without the burden to perform any cleanup ath
  // punchlet return. 
  Tuple tmp;

  // The syslogHeader operator parse and splits the header part form the rest.
  // Here we choose to put the two parts in tmp:[header] and tmp:[greedy]
  if (!syslogHeader().on([logs][log][message]).into(tmp:[header], tmp:[greedy])) {
      raise("does not start with a syslog header");
  }

  // We want to get rif of the 'Alerts: AKLOG - ' part. Here we use the
  // usual java substring and indexOf String operator.
  // You can apply these to a punch Tuple directly. 
  // I.e. writing  
  //    tmp:[greedy].asString().indexOf("AKLOG - ") + 8)
  // is equivalent to write 
  //     tmp:[greedy].asString().indexOf("AKLOG - ") + 8)
  String greedy = tmp:[greedy].substring(tmp:[greedy].asString().indexOf("AKLOG - ") + 8);

  // last we use the key value operator. It will nicely stores all the submatches
  // under the "kv" dictionary of our tmp Tuple.
  if (!kv().on(greedy).into(tmp:[kv])) {
      raise("not a kv log");
  }

  // For debugging only : get your results
  print(tmp);
}

Hit Ctrl + B to see the result. You should see this to the sublime console.

{
  "header": {
    "host": {
      "name": "fakehost"
    },
    "alarm": {
      "sev": 0,
      "facility": 0
    },
    "ts": "2017-09-21T21:24:03.000+02:00"
  },
  "greedy": "Alerts: AKLOG - id=firewall time=",
  "kv": {
    "fw": "fakehost.com",
    "id": "firewall",
    "aktype": "ALERT",
    "pri": "2",
    "gmtime": "1504648796",
    "time": "2017-09-05 23:59:56",
    "alert_level": "Medium",
    "user": "bob",
    "alert_type": "Blocked by filter",
    "alert_desc": "UDP from 1.2.3.4:9987 to 1.2.3.5:33719 [default_rule]"
  }
}

Good you have successfully cut your log.

Field binding

What we just got relies on the the Arkoon naming convention (i.e. aktype, pri, alert level etc). We must normalize the fields. If we succeed in normalising the data, all logs from whatever vendor can be queried based on fields having the same semantics.

Punch & taxonomy normalisation are documented in the Event Normalization section. Here we will bind the following services:

  • into alarm.severity,
  • into obs.host.name
  • ...

Doing that is easy and compact with the punch lnguage. It looks as follows.

{
  ...

  if (!kv().on(greedy).into(tmp:[kv])) {
    raise("not a kv log");
  }

  [logs][log][alarm][category]    = tmp:[kv][aktype];
  [logs][log][alarm][severity]    = tmp:[kv][pri];
  [logs][log][obs][host][name]    = tmp:[kv][fw];
  [logs][log][alarm][name]        = tmp:[kv][alert_type];
  [logs][log][init][usr][name]    = tmp:[kv][user];
  [logs][log][alarm][description] = tmp:[kv][alert_desc];

  ...
}

Put that at the end of your punchlet, and hit Ctrl + B again. Your punchlet is now completed. Congratulations!

First version

Our version of this punchlet is here. We simply removed comments, added some section comments to quickly see input/syntax/field binding parts. We also replaced the [logs][log] by the document: alias. This is all about subjective coding style but based on our experience, it is the best way to keep a clean and easy to maintain punchlet.

// @test Sep 21 21:24:03 fakehost Alerts: AKLOG - id=firewall time="2017-09-05 23:59:56" gmtime=1504648796 pri=2 fw=fakehost.com aktype=ALERT alert_type="Blocked by filter" user="bob" alert_level="Medium" alert_desc="UDP from 1.2.3.4:9987 to 1.2.3.5:33719 [default_rule]"
{
    ///////////////////////////////////////////
    //  BLOCK : INPUT CHECK
    ///////////////////////////////////////////

    Tuple document = [logs][log];
    document:[message] = [logs][log];

    ///////////////////////////////////////////
    //  BLOCK : SYNTAX ANALYSIS
    ///////////////////////////////////////////

    Tuple tmp;
    if (!syslogHeader().on(document:[message]).into(tmp:[header], tmp:[greedy])) {
        raise("does not start with a syslog header");
    }
    String greedy = tmp:[greedy].substring(tmp:[greedy].indexOf("AKLOG - ") + 8);

    if (!kv().on(greedy).into(tmp:[kv])) {
        raise("not a kv log");
    }

    ///////////////////////////////////////////
    //  BLOCK : FIELD BINDING
    ///////////////////////////////////////////

    document:[alarm][category]    = tmp:[kv][aktype];
    document:[alarm][severity]    = tmp:[kv][pri];
    document:[obs][host][name]    = tmp:[kv][fw];
    document:[alarm][name]        = tmp:[kv][alert_type];
    document:[init][usr][name]    = tmp:[kv][user];
    document:[alarm][description] = tmp:[kv][alert_desc];
}

You got your Parser. It produces the following normalised document.

{
  "logs": {
    "log": {
      "obs": {
        "host": {
          "name": "fakehost.com"
        }
      },
      "init": {
        "usr": {
          "name": "bob"
        }
      },
      "alarm": {
        "severity": "2",
        "name": "Blocked by filter",
        "description": "UDP from 1.2.3.4:9987 to 1.2.3.5:33719 [default_rule]",
        "category": "ALERT"
      },
      "message": "Sep 21 21:24:03 fakehost Alerts: AKLOG - id=firewall time="
    }
  }
}

What to do next ?

From there, we suggest you navigate to one of the following topics

  • To integrate your parser into a channel, see channels;
  • To improve your parser, and dig further with punch programming, see punch_programming;
  • To become a true parser developer, see TutorialWriteProductionReadyParsers.