Skip to content

Filter Node

The Filter bolt provides useful, simple, yet efficient filtering capabilities with low cpu usage.

The FilterBolt offers IP BlackListing or WhiteListing capabilities, as well as content based filtering. It can only be used with one type of filtering.

Streams And fields

Except if you use white listing, the filter bolt is a proxy bolt, it emits Tuples over the same stream than the input stream. Make sure you properly configure its storm settings configuration. Note that you can subscribe your filter bolt from several input streams.

Info

make sure you understand spouts and bolts stream and field fundamental concepts.

Black Lists

Here is how you filter out tuple whose one of the fields contains an IP address that you wish to blacklist. You must simply indicate on which (Storm) field you receive the IP address, typically the one set by a syslog spout.

{
 "type" : "filter_bolt",
 "settings" : {
   "blacklist" : {
     "field" : "remote_host",
     "list" : [ "10.0.0.0/8", "172.16.0.0/12", "192.168.0.0/16", "127.0.0.1/32" ]
   }
 },
 "storm_settings" : {
   "component" :  "my_filter_bolt",
   "publish" : [ 
     { "stream" : "logs", "fields" : ["log", "local_host", "local_port", "remote_host", "remote_port", "local_uuid", "local_timestamp"] } 
   ],
   "subscribe" : [ 
     { "component" : "some_spout", "stream" : "logs", "grouping": "localOrShuffle" } 
   ]
 }

}

Instead of setting the blacklist explicitly, you can refer to an external JSON file. For example:

{
  "type" : "filter_bolt",
  "settings" : {
    "blacklist" : {
      "field" : "remote_host",
      "list" : "filter/blacklist.json"
    }
  },
  ...
}

Make sure you have your json file under $PUNCHPLATFORM_CONF_DIR/resources/filter/blacklist.json.

Additional settings for eps max metric :

{
 "type" : "filter_bolt",
 "settings" : {
     "metric_eps_calculation_period" : 500,
     "metric_eps_aggregation_period" : 30000,

EPS max metric computes the 'instant eps' on 'metric_eps_calculation_period' ms during 'metric_eps_aggregation_period' ms and send as a metric the max of 'instant eps'

White Lists

The whitelist works the other way around, it only lets tuple pass in the topology if some IP field is contained in your list.

{
       "type" : "filter_bolt",
       "settings" : {
          "whitelist" : {
            "field" : "remote_host",
            "list" : [ "10.0.0.0/8", "172.16.0.0/12", "192.168.0.0/16" ]
          }
        },
       "storm_settings" : {
          ...
      }
  }

It is handy to dispatch the white listed traffic onto different storm streams. This lets you, typically, dispatch the traffic received from (say) the ""10.0.0.0/8" address range to the "checkpoint" stream. Here is an example :

{
         "type" : "filter_bolt",
         "settings" : {
            "whitelist" : {
              "field" : "remote_host",
              "list" : [ 
                { "address" : "10.0.0.0/8", "stream" : "checkpoint" }, 
                { "address" : "172.16.0.0/12", "stream" : "cisco" }, 
                { "address" : "192.168.0.0/16", "stream" : "handover" }
              ]
            }
          },
         "storm_settings" : {
            "publish" : [ 
              {
                "stream" : "checkpoint", 
                "fields" : ["log", "local_host", "local_port", "remote_host", "remote_port", "local_uuid", "local_timestamp"] 
              },
              {
                "stream" : "cisco", 
                "fields" : ["log", "local_host", "local_port", "remote_host", "remote_port", "local_uuid", "local_timestamp"] 
              },
              {
                "stream" : "handover", 
                "fields" : ["log", "local_host", "local_port", "remote_host", "remote_port", "local_uuid", "local_timestamp"] 
              }
        }
    }

Content Matching

Here is how you filter out (say) logs containing a substring .

{
      "type" : "filter_bolt",
      "settings" : {
          "exclude_substring" : {
             "field" : "log", 
             "substrings" : ["Business"], 
             "case_sentitive" : true
         }
      }
 }

You can use include_substring if you want to work the other way around.

Regex Matching

You can use regexes as well.

Warning

Check the performance if you include a complex pattern. You MUST know that executing a regex takes about 100 times more CPU than a contains. Do not uses regexes on high traffic.

{
     "type" : "filter_bolt",
     "settings" : {
         "exclude_regex" : {
            "field" : "log", 
            "regex" : ".*(traffic|accept).*"
        }
     }
}

You can use include_regex if you want to work the other way around.