File Spout

The file spout reads lines from one or several files, and inject them in the topology. The carriage return character is used to delimit lines. The file spout supports multiline using a configuration similar to the syslog spout.

The file spout launches a single dedicated thread per file to scan. If the file is closed/rotated/recreated, it will be continuously reopened and read.

You can set a directory path instead of a single file path, in which case all the files in there will be continuously read. Last, you can set a pattern to match the name of the files to scan, such as “*.log” for example. Note finally that just like other spout, you can include the rate limiter.

Here is a complete configuration example.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
   {
     "type" : "file_spout",
     "spout_settings" : {
         "load_control" : "none",
         "multiline" : true,
         "multiline.regex" : "^(\t).*",
         "multiline.delimiter" : " ",
         "read_file_from_start" : true,
         "path" : "/var/log",
         "path.regex" : ".*\\.log"
     },
     "storm_settings" : { ... }
   }

Important

the file spout will reemit failed tuples. I.e. reemit the lines of files that have been signaled as failed by a downstream bolt. This makes the file spout robust and capable of coping with downstream failures. However, if the file spout itself is restarted, it will restart reading the files from the start or from the end. If you need more advanced and resilient data consuming, use the KafkaSpout instead.

Streams And fields

The file spout emits 2-fields tuple in the topology. One field is the read line, the other is the path of the file. The former can be named arbitrarily, the latter must be named “path”. The logic of the file spout is illustrated next.

../../../../_images/FileSpoutContract.png

Here is a configuration example that emits line in the “logs” stream, using the fields “log” and “path”.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
       spout_settings" : { ... },
       "storm_settings" : {
         "executors": 1,
         "component" : "a_file_spout",
         "publish" : [
           {
             "stream" : "logs",
             "fields" : [
               "log",
               "path"
             ]
           }
         ]
       }
   }

Metrics

See File Spout Metrics