Skip to content

Archive Reader

Overview

Compatible Spark/Pyspark

The archive_reader node extracts archived data from S3, Ceph or filesystem, using metadata stored in Elasticsearch.

This node must follow an Elastic Batch Input node providing metadata about the archived data.

Runtime Compatibility

  • PySpark :
  • Spark :

Filters

The Archive Reader will read a batch for every corresponding Metadata received as input. Therefore, filters and selections about dates, pool, tags etc... must be applied in the Elasticsearch query.

{
    type: elastic_batch_input
    component: input
    settings: {
        index: mytenant-archive-*
        cluster_name : es_search
        nodes: [
            localhost
        ]
    }
    publish: [
        {
            stream: metadata 
        }
    ]
}

Example

{
  version: 6.0
  type: punchline
  runtime: spark
  meta: {
    tenant: mytenant
    channel: archive_reader
    vendor: archive_reader
  }
  dag: [
    {
      type: elastic_batch_input
      component: input
      settings: {
        index: mytenant-archive-*
        es_cluster: es_search
        nodes: [
          localhost
        ]
        query: {
          bool: {
            must: [
              {
                term: {
                  archive.pool: mytenant
                }
              }
              {
                term: {
                  tags.topic: apache_httpd
                }
              }
              {
                range: {
                  batch.earliest_ts: {
                    gte: now-1h
                    lt: now
                  }
                }
              }
            ]
          }
        }
      }
      publish: [
        {
          stream: metadata
        }
      ]
    }
    {
      type: archive_reader
      component: reader
      settings: {
        device_address: file:///tmp/archive-logs/storage
      }
      subscribe: [
        {
          component: input
          stream: metadata
        }
      ]
      publish: [
        {
          stream: data
        }
      ]
    }
    {
      type: show
      component: show
      settings: {
        title: RESULT
        num_rows: 1
        truncate: false
      }
      subscribe: [
        {
          component: reader
          stream: data
        }
      ]
    }
  ]
  settings: {}
}

This example can be used to read files archived using the apache_httpd channel. It's composed of 3 elements : - An Elasticsearch batch input sending a query to Elasticsearch to get metadata. - An Archive Reader reading files from the provided device, using metadata information. - A Show node displaying the resulting Dataset. This Dataset should contain one row per line read from files.

Parameters

  • device_address: String > Description: [Optional] Address to the device where data is stored. Start with file://, ceph_configuration://, http://...
  • user: String > Description: [Optional] The user used to access Ceph cluster.
  • access-key: String > Description: [Optional] Access key for MinIO cluster.
  • secret-key: String > Description: [Optional] Secret key for MinIO cluster.