entitopia - a Python tool that for loading, customizing and automating indexes and data loads into Elasticsearch
ElasticSearch is an awesome extensible text search engine. It provides methods for loading data, customizing the data, applying analyzers, changing search weightings, and enriching data by merging subsets of multiple datasets. We can merge pieces of different datasets (indexes) into customized indexes to meet our data analysis needs.
This diagram shows a 3-step pipeline that represents data being loaded into two indexes (1,3) with an enrichment and resource manipulation step (2).
We want to do all of that in a repeatable and automatable fashion with some level of flexibility. The Python code lets us define pipelines that support multiple steps and customized operations.
This diagram shows a 3-step pipeline that represents data being loaded into two indexes (1,3) with an enrichment and resource manipulation step (2).
Each step is driven from a config file that describes the phase processors and other configuration information.
{
"steps": [
{
"name": "doctors-clinicians",
"phases": [
"index-create",
"index-map",
"index-populate"
]
}
],
"all_phases": [
"index-create",
"index-map",
"enrichment-policies",
"pipelines",
"index-populate"
],
"configurationDir": "configuration",
"dataDir": "data",
"logLevel": "INFO"
}
Each phase has a configuration file that drives the associated phase processor
{
"alias": "doctors-clinicians-000001",
"index": "doctors-clinicians-{now/d}-000001",
"source": "DAC_NationalDownloadableFile.csv",
"id_field": "NPI",
"num_rows": 50000,
"skip_rows": 0
}
Comments
Post a Comment