Skip to content

FileSpout

The FileSpout reads the content of one or several files to inject them in the topology.

You can control the way the file spout will delimit the input data. The simplest strategy is to emit each line of the input file as one value. You can however deal with other patterns using the codec parameter explained below.

By default, the file spout launches a single dedicated thread per file to scan. If the file is closed/rotated/recreated, it will be continuously reopened and read.

You can set a directory path instead of a single file path, in which case all the files in there will be continuously read. You can also wait for a new file to be written in this directory, this one will be automaticaly read. Last, you can set a pattern to match the name of the files to scan, such as for example *.log. Note finally that just like other spout, you can include the rate limiter.

Here is a complete configuration example to read some files from a folder

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
{
  "type" : "file_spout",
  "spout_settings" : {
    # the codec parameter tells the spout how to truncate the 
    # file data ('line' is the default codec)
    "codec" : "line",

    # the path can be a file or a directory name
    "path" : "/var/log",

    # You can use regexes to select the files you want to read
    # from the folder path
    "path.regex" : ".*\\.log"
  },
  "storm_settings" : {
    ...
  }
}

Info

the file spout will reemit any failed tuples (i.e. reemit the lines of files that have been signaled as failed by a downstream bolt). This makes the file spout robust and capable of coping with downstream failures. However, if the file spout itself is restarted, it will restart reading the files from the start or from the end. If you need more advanced and resilient data consuming, use the KafkaSpout instead.

File Reading Strategies

Watcher on a directory

You can activate the "watcher" feature on a specific path. This way, all files (in the initial path or any new subfolder) and archives (.zip,.gz) created in it will be read automaticaly.

To configure this watcher you can see the full description of parameters watcher and watcher.process_existing_files. If you set the watcher.process_existing_files option to true, any already existing file in the directory (and subdirectory) will be added to the list of file to watch. Otherwise, they will never be treated, even if the tail option is active.

Note, for every file you can chose to activate/deactivate the tail and read_file_from_start options but do not deactivate both (or nothing will never happend).

For example, the configuration above will act as a Bash command tail -f /path/to/directory/*.log:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
{
  "type": "file_spout",
  "spout_settings": {
    "codec": "line",
    "watcher": true,
    "watcher.process_existing_files": true,
    "tail": true,
    "read_file_from_start": false,
    "path": "/path/to/directory",
    "path.regex": ".*\\.log"
  },
  "storm_settings": {
    ..
  }
}

Zip Files

Zip files (.zip) can be read with FileSpout. Zip archive will not be read if a subfolder is present in the archive, only the file present at the root level will be proceed.

Compressed Files

Compressed files (.gz) can be read with FileSpout. WARNING: Only on compressed files are compliant, not groups of file nor filesystem trees.

Delimiting the Input Data

By default the FileSpout use the carriage return character to delimit lines.

The file spout supports multiline using a configuration similar to the syslog spout.

The codec parameter can take the following values. By default it is line.

  • line:

    the file is read line per line. The carriage return character is expected.

  • multiline:

    several lines will be grouped together by detecting according to a multiline pattern. This is typically used to read in java stacktraces and have them emit in a single tuple field.

    Use the 'multiline.regex' or 'multiline.delimiter' to set the new line detection option.

  • json_array:

    With the json_array codec, the file will be streamed continuously in a way to delimit json array elements. Each element is emitted as a separated tuple.

  • xml:

    With the xml codec, the file is streamed continuously in a way to delimit xml top level elements. Each element is emitted as a separated tuple.

The xml and json codecs are very useful but only work on specific input data. Consider the following examples using the json_array codec. Let's say your file contains a single (possibly very big) json array:

1
[1, 2, 3]

Each element will be emitted, in this case, 1, 2 and 3..

Now, say that you have another structure with an enclosing object:

1
{"records":[ 1, 2, 3]}{"records":[4, 5, 6]}}

The json_array codec will again emit only the inner elements (1, 2, 3, 4, 5, 6).

Streams And Fields

The file spout emits 2-fields tuple in the topology. One field is the read line, the other is the path of the file. The former can be named arbitrarily, the latter must be named . The logic of the file spout is illustrated next.

image

Here is a configuration example that emits line in the stream, using the fields .

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
{
  "type": "file_spout",
  "spout_settings": {
    ...
  },
  "storm_settings": {
    "component": "my_filespout",
    "publish": [
      {
        "stream": "logs",
        "fields": [
          "log",
          "path",
          "last_modified"
        ]
      }
    ]
  }
}

The FileSpout output its data to 3 special Storm fields. If you use these fields, they will a special value:

  • log : the data read from the file
  • path : the absolute path name of the originating file
  • last_modified : a timestamp of the file creation or last update time.

Parameters

To get a complete list of available parameter, see the FileSpout java documentation.

Metrics

See the FileSpout metrics section.