Skip to content

File Input

Overview

With the file_input node, read file content located on your file system or distributed file system like hdfs and output it as a spark dataset within your PML pipeline:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
job: [
  {
    type: file_input
    component: input
    settings: {
      // Supported format are:
      // text, csv, json, parquet, orc, jdbc, libsvm
      format: csv
      // the name of the file specified in the spark.files parameter below.
      // This node is mostly useful to develop simple pml jobs.
      file_name: AAPL.csv
    }
    publish: [
      {
        stream: data
      }
    ]
  }
],
spark_settings: {
  // Location of the input file. That path must be reachable
  // from where the spark runs. I.e. every spark node.
  // You can also use relative path like './AAPL.csv' as long
  // as you launch your pml in foreground mode from the same directory.
  spark.files: /tmp/AAPL.csv
}

Configuration(s)

  • file_name: String

    Description: [Required] The name of the file specified within spark.files parameter.

  • format: String

    Description: [Required] the codec that should be used to read the file content [json, csv, parquet, orc].