Skip to content

Archive Reader

Overview

Compatible Spark/Pyspark

The archive_reader node extracts archived data from S3, Ceph or filesystem, using metadata stored in Elasticsearch.

This node must follow an Elastic Batch Input node providing metadata about the archived data.

Runtime Compatibility

  • PySpark :
  • Spark :

Example

{
    type: punchline
    version: "6.0"
    runtime: spark
    tenant: default
    dag: 
    [
        {
            type: elastic_batch_input
            component: input
            settings: {
                index: mytenant-archive-*
                cluster_name : es_search
                nodes: [
                    localhost
                ]
                query: {
                    bool: {
                        must: [
                            {
                                term: {
                                    pool: mytenant
                                }
                            }
                        ]
                    }
                }
            }
            publish: [
                {
                    stream: metadata 
                }
            ]
        }
    ]
}

Filters

The Archive Reader will read a batch for every corresponding Metadata received as input. Therefore, filters and selections about dates, pool, tags etc... must be applied in the Elasticsearch query.

{
    type: elastic_batch_input
    component: input
    settings: {
        index: mytenant-archive-*
        cluster_name : es_search
        nodes: [
            localhost
        ]
    }
    publish: [
        {
            stream: metadata 
        }
    ]
}

Parameters

  • device_address: String > Description: [Optional] Address to the device where data is stored. Start with file://, ceph_configuration://, http://...
  • user: String > Description: [Optional] The user used to access Ceph cluster.
  • access-key: String > Description: [Optional] Access key for MinIO cluster.
  • secret-key: String > Description: [Optional] Secret key for MinIO cluster.