public class FetchCsvOperator extends FetchOperator<FetchCsvOperator>
Tuple resource = fetch("https://w3c.github.io/csvw/tests/test001.csv").asTuple();
This code will fetch the test001.csv file from the w3c web server, cache it and make it available to your punchlet transformed as a punch Tuple.
You can use the refresh option to make the operator regularly refresh your resource. Here is an example to reload the file every 60 seconds.
Tuple resource = fetchCsv("https://your.resource.server/resource.csv")
.refresh("0/60 * * * * ? *")
.asTuple();
This is handy to dynamically enrich your data on the fly. The fetchCsvOperator provides several ways to transform CSV files into the format you need. This is explained below.
public class CsvFetcher extends Punchlet {
//
// This resource tuple wwill contain a peridically refreshed
// representation of our resource file.
//
Tuple resource;
public void activate() {
resource = fetchCsv()
.s3Endpoint("http://10.179.178.215:9000")
.s3Bucket("punch1")
.s3AccessKey("admin")
.s3Secret("password")
.s3Object("whois.csv")
.refresh("0/45 * * * * ? *")
.compactInPlace()
.delimiter(",")
.required()
.hashKey("domain")
.asTuple();
}
public void execute(Tuple root) {
Tuple creationDate = resource:[420highzup.com][creation_date];
if (creationDate) {
...
}
}
}
This is both elegant and efficient as the construction of the fetch operator will be executed only once at startup.
By default, CSV files are transformed into plain arrays. A first header line can be used to set the fields name. For example if you have the following CSV file content:
Surname,FamilyName
Homer,Simpson
Ned,Flanders
The tuple will look like :
[
{
"FamilyName": "Simpson",
"Surname": "Homer"
},
{
"FamilyName": "Flanders",
"Surname": "Ned"
}
]
You can instead specify one of the key to be used as hash key, using the FetchOperator.hashKey This lets you generate a dictionary instead. method. For example :
fetch("https://w3c.github.io/csvw/tests/test001.csv").hashKey("Surname").asTuple();
Produces:
{
"Homer": {
"FamilyName": "Simpson"
},
"Ned": {
"FamilyName": "Flanders"
}
}
Tuple resource = fetchCsv("https://your.resource.server/resource.csv")
.refresh("0/60 0/1 * 1/1 * ? *")
.columns("ip", "city", "zone")
.asTuple();
Tuple
, that provides
you will all the methods. Using these to hold large hash table is not memory efficient though.
If your files are big (more than several tenth of Mbs), you can use the compact option as illustrated next.
Tuple resource = fetchCsv("https://your.resource.server/resource.csv")
.refresh("0/60 0/1 * 1/1 * ? *")
.columns("ip", "city", "zone")
.hashKey("ip")
.compact()
.asTuple();
This option use a more compact strategy to represent the resource in memory. The returned tuple provides you only
with the get(String key) method.
- type: punchlet_node
settings:
punchlet_file_resources:
standard/resources/taxonomy.csv
punchlet:
- ./mypunchlet.punch
As just illustrated you can include local CSV files. These will be available from within your punchlet using the same path.
Tuple taxonomy = fetcCsv().url("standard/resources/taxonomy.csv").asTuple();
Tuple enrichment = fetch("standard/resources/enrichment.json").asTuple();
Refer to the online punch node documentation for a complete explanation on how to deal with resources.
* * * ? * * | Every second |
0 * * ? * * | Every minute |
0 */2 * ? * * | Every even minute |
0 1/2 * ? * * | Every uneven minute |
0 */2 * ? * * | Every 2 minutes |
0 */3 * ? * * | Every 3 minutes |
0 */4 * ? * * | Every 4 minutes |
0 */5 * ? * * | Every 5 minutes |
0 */10 * ? * * | Every 10 minutes |
0 */15 * ? * * | Every 15 minutes |
0 */30 * ? * * | Every 30 minutes |
0 15,30,45 * ? * * | Every hour at minutes 15, 30 and 45 |
0 0 * ? * * | Every hour |
0 0 */2 ? * * | Every hour |
0 0 0/2 ? * * | Every even hour |
0 0 1/2 ? * * | Every uneven hour |
0 0 */3 ? * * | Every three hours |
0 0 */4 ? * * | Every four hours |
0 0 */6 ? * * | Every six hours |
0 0 */8 ? * * | Every eight hours |
0 0 */12 ? * * | Every twelve hours |
0 0 0 * * ? | Every day at midnight - 12am |
0 0 1 * * ? | Every day at 1am |
0 0 6 * * ? | Every day at 6am |
0 0 12 * * ? | Every day at noon - 12pm |
0 0 12 * * ? | Every day at noon - 12pm |
0 0 12 * * SUN | Every Sunday at noon |
0 0 12 * * MON | Every Monday at noon |
0 0 12 * * TUE | Every Tuesday at noon |
0 0 12 * * WED | Every Wednesday at noon |
0 0 12 * * THU | Every Thursday at noon |
0 0 12 * * FRI | Every Friday at noon |
0 0 12 * * SAT | Every Saturday at noon |
0 0 12 * * MON-FRI | Every Weekday at noon |
0 0 12 * * SUN,SAT | Every Saturday and Sunday at noon |
0 0 12 */7 * ? | Every 7 days at noon |
0 0 12 1 * ? | Every month on the 1st, at noon |
0 0 12 2 * ? | Every month on the 2nd, at noon |
0 0 12 15 * ? | Every month on the 15th, at noon |
0 0 12 1/2 * ? | Every 2 days starting on the 1st of the month, at noon |
0 0 12 1/4 * ? | Every 4 days staring on the 1st of the month, at noon |
0 0 12 L * ? | Every month on the last day of the month, at noon |
0 0 12 L-2 * ? | Every month on the second to last day of the month, at noon |
0 0 12 LW * ? | Every month on the last weekday, at noon |
0 0 12 1L * ? | Every month on the last Sunday, at noon |
0 0 12 2L * ? | Every month on the last Monday, at noon |
0 0 12 6L * ? | Every month on the last Friday, at noon |
0 0 12 1W * ? | Every month on the nearest Weekday to the 1st of the month, at noon |
0 0 12 15W * ? | Every month on the nearest Weekday to the 15th of the month, at noon |
0 0 12 ? * 2#1 | Every month on the first Monday of the Month, at noon |
0 0 12 ? * 6#1 | Every month on the first Friday of the Month, at noon |
0 0 12 ? * 2#2 | Every month on the second Monday of the Month, at noon |
0 0 12 ? * 5#3 | Every month on the third Thursday of the Month, at noon - 12pm |
0 0 12 ? JAN * | Every day at noon in January only |
0 0 12 ? JUN * | Every day at noon in June only |
0 0 12 ? JAN,JUN * | Every day at noon in January and June |
0 0 12 ? DEC * | Every day at noon in December only |
0 0 12 ? JAN,FEB,MAR,APR * | Every day at noon in January, February, March and April |
0 0 12 ? 9-12 * | Every day at noon between September and December |
compactionType, hashKey, logger, lowerCaseKeys, requiredResource, runtimeContext, s3Bucket, s3endpoint, s3KeyPath, s3Object, s3SecretPath, silent, url, uuid
Constructor and Description |
---|
FetchCsvOperator(RuntimeContext r)
New style constructor
|
FetchCsvOperator(RuntimeContext r,
String url)
Create a fetch operator for a remote resource.
|
Modifier and Type | Method and Description |
---|---|
Tuple |
asTuple() |
FetchCsvOperator |
columns(List<String> columnNames)
Provide the expected column names as a list of strings
|
FetchCsvOperator |
columns(String... columnNames)
Provide the expected column names
|
FetchCsvOperator |
delimiter(String delimiter)
Set the CSV delimiter.
|
FetchCsvOperator |
generateFieldNames()
Call this to make the operator generate automatically field names.
|
IResourceBuilder |
getResourceBuilder()
Implemented by subclasses to return the adequate resource builder.
|
FetchCsvOperator |
inferTypes()
Auto-guess the fields type from your CSV files.
|
FetchCsvOperator |
setContent(String csv) |
bestEffort, compact, compactDirect, compactInPlace, getResource, hashKey, loadAtStartup, lowerCaseKeys, refresh, required, s3AccessKey, s3AccessKeyPath, s3Bucket, s3Endpoint, s3Object, s3Secret, silent, url
public FetchCsvOperator(RuntimeContext r, String url)
r
- the punchlet runtime context.url
- an http or file url.public FetchCsvOperator(RuntimeContext r)
r
- public FetchCsvOperator inferTypes()
public FetchCsvOperator delimiter(String delimiter)
delimiter
- a CSV delimiter. By default "," is used.public FetchCsvOperator generateFieldNames()
You will get "field0", "field1" etc..
public FetchCsvOperator columns(String... columnNames)
columnNames
- the column namespublic FetchCsvOperator columns(List<String> columnNames)
columnNames
- the column namespublic Tuple asTuple()
public IResourceBuilder getResourceBuilder()
FetchOperator
FetchCsvOperator
returns a builder to construct a tuple from a CSV document
while the FetchJsonOperator
returns builders to deal with JSON files.
This is called only once at init time.
getResourceBuilder
in class FetchOperator<FetchCsvOperator>
public FetchCsvOperator setContent(String csv)
Copyright © 2023. All rights reserved.