Filter Node¶
The Filter bolt provides useful, simple, yet efficient filtering capabilities with low cpu usage.
The FilterBolt offers IP BlackListing or WhiteListing capabilities, as well as content based filtering. It can only be used with one type of filtering.
Streams And fields¶
Except if you use white listing, the filter bolt is a proxy bolt, it emits Tuples over the same stream than the input stream. Make sure you properly configure its storm settings configuration. Note that you can subscribe your filter bolt from several input streams.
Black Lists¶
Here is how you filter out tuple whose one of the fields contains an IP address that you wish to blacklist. You must simply indicate on which (Storm) field you receive the IP address, typically the one set by a syslog spout.
{
"type" : "filter_bolt",
"settings" : {
"blacklist" : {
"field" : "remote_host",
"list" : [ "10.0.0.0/8", "172.16.0.0/12", "192.168.0.0/16", "127.0.0.1/32" ]
}
},
"storm_settings" : {
"component" : "my_filter_bolt",
"publish" : [
{ "stream" : "logs", "fields" : ["log", "local_host", "local_port", "remote_host", "remote_port", "local_uuid", "local_timestamp"] }
],
"subscribe" : [
{ "component" : "some_spout", "stream" : "logs", "grouping": "localOrShuffle" }
]
}
}
Instead of setting the blacklist explicitly, you can refer to an external JSON file. For example:
{
"type" : "filter_bolt",
"settings" : {
"blacklist" : {
"field" : "remote_host",
"list" : "filter/blacklist.json"
}
},
...
}
Make sure you have your json file under $PUNCHPLATFORM_CONF_DIR/resources/filter/blacklist.json
.
Additional settings for eps max metric :
{
"type" : "filter_bolt",
"settings" : {
"metric_eps_calculation_period" : 500,
"metric_eps_aggregation_period" : 30000,
EPS max metric computes the 'instant eps' on 'metric_eps_calculation_period' ms during 'metric_eps_aggregation_period' ms and send as a metric the max of 'instant eps'
White Lists¶
The whitelist works the other way around, it only lets tuple pass in the topology if some IP field is contained in your list.
{
"type" : "filter_bolt",
"settings" : {
"whitelist" : {
"field" : "remote_host",
"list" : [ "10.0.0.0/8", "172.16.0.0/12", "192.168.0.0/16" ]
}
},
"storm_settings" : {
...
}
}
It is handy to dispatch the white listed traffic onto different storm streams. This lets you, typically, dispatch the traffic received from (say) the ""10.0.0.0/8" address range to the "checkpoint" stream. Here is an example :
{
"type" : "filter_bolt",
"settings" : {
"whitelist" : {
"field" : "remote_host",
"list" : [
{ "address" : "10.0.0.0/8", "stream" : "checkpoint" },
{ "address" : "172.16.0.0/12", "stream" : "cisco" },
{ "address" : "192.168.0.0/16", "stream" : "handover" }
]
}
},
"storm_settings" : {
"publish" : [
{
"stream" : "checkpoint",
"fields" : ["log", "local_host", "local_port", "remote_host", "remote_port", "local_uuid", "local_timestamp"]
},
{
"stream" : "cisco",
"fields" : ["log", "local_host", "local_port", "remote_host", "remote_port", "local_uuid", "local_timestamp"]
},
{
"stream" : "handover",
"fields" : ["log", "local_host", "local_port", "remote_host", "remote_port", "local_uuid", "local_timestamp"]
}
}
}
Content Matching¶
Here is how you filter out (say) logs containing a substring .
{
"type" : "filter_bolt",
"settings" : {
"exclude_substring" : {
"field" : "log",
"substrings" : ["Business"],
"case_sentitive" : true
}
}
}
You can use include_substring
if you want to work the other way
around.
Regex Matching¶
You can use regexes as well.
Warning
Check the performance if you include a complex pattern. You MUST know that executing a regex takes about 100 times more CPU than a contains. Do not uses regexes on high traffic.
{
"type" : "filter_bolt",
"settings" : {
"exclude_regex" : {
"field" : "log",
"regex" : ".*(traffic|accept).*"
}
}
}
You can use include_regex
if you want to work the other way around.