entitopia - a Python tool that for loading, customizing and automating indexes and data loads into Elasticsearch

ElasticSearch is an awesome extensible text search engine. It provides methods for loading data, customizing the data, applying analyzers, changing search weightings, and enriching data by merging subsets of multiple datasets. We can merge pieces of different datasets (indexes) into customized indexes to meet our data analysis needs. 

We want to do all of that in a repeatable and automatable fashion with some level of flexibility.  The Python code lets us define pipelines that support multiple steps and customized operations.  


This diagram shows a 3-step pipeline that represents data being loaded into two indexes (1,3) with an enrichment and resource manipulation step (2).

Each step is driven from a config file that describes the phase processors and other configuration information. 

{
    "steps": [
        {
            "name": "doctors-clinicians",
            "phases": [
                "index-create",
                "index-map",
                "index-populate"
            ]
        }
    ],
    "all_phases": [
        "index-create",
        "index-map",
        "enrichment-policies",
        "pipelines",
        "index-populate"
    ],
    "configurationDir": "configuration",
    "dataDir": "data",
    "logLevel": "INFO"
}

 Each phase has a configuration file that drives the associated phase processor

{
    "alias": "doctors-clinicians-000001",
    "index": "doctors-clinicians-{now/d}-000001",
    "source": "DAC_NationalDownloadableFile.csv",
    "id_field": "NPI",
    "num_rows": 50000,
    "skip_rows": 0
}

Comments

Popular posts from this blog

Installing the RNDIS driver on Windows 11 to use USB Raspberry Pi as network attached

Understanding your WSL2 RAM and swap - Changing the default 50%-25%

Almost PaaS Document Parsing with Tika and AWS Elastic Beanstalk