Almost PaaS Document Parsing with Tika and AWS Elastic Beanstalk

The Apache Tika project provides a  library capable of parsing and extracting data and meta data from over 1000 file types.  Tika is available as a single jar file that can be included inside applications or as a deployable jar file that runs Tika as a standalone service.

This blog describes deploying the Tika jar as an auto-scale service in Amazon AWS Elastic Beanstalk.  I selected Elastic Beanstalk because it supports jar based deployments without any real Infrastructure configuration. Elastic Beanstalk auto-scale should take care of scaling up and down for for the number of requests you get.

Tika parses documents and extracts their text completely in memory. Tika was deployed for this blog using EC2 t2.micro instances available in the AWS free tier. t2.micro VMs are 1GB which means that you are restricted in document complexity and size. You would size your instances appropriately for your largest documents.  

Preconditions

  • An AWS account.
  • AWS access id and secret key.  This is most easily created in the AWS web console.  Amazon recommends using IAM credentials. I used the default account credentials since this was done mostly as a PoC. Remember to save any key id and key values that you need. They cannot be recovered once generated.
  • Read the Amazon Elastic Beanstalk command line instructions

Not Addressed

  • Using the Amazon Console to do web based deployments. 
    • Use command line to demonstrate automation possibilities.
  • Limiting access to this service via AuthN/AuthZ
  • Provisioning limited scope IAM credentials for deployments
  • Load testing with something like JMeter to demonstrate autoscale

SSH

SSH must installed and on your command line path. The Elastic Beanstalk command prompt expects ssh-keygen to be on your path. 

I did this work on Microsoft Windows 10 so I needed the windows tools. .  Microsoft is now contributing to this SSH distribution on github. I installed the 64 bit version in c:\Program Files\OpenSSH-Win64. Linux / Mac folks can use their favorite tools. 

Python and Pip

Python must be installed and on your command line path. Install Python and pip per the AWS Elastic Beanstalk CLI instructions. The page describes Windows and Linus installation p;rocesses. Make sure to add the Python directories to your environment PATH variables.

Elastic Beanstalk CLI

Install the Elastic Beanstalk CLI after installing Python. You can find it in the same CLI web page

Creating an EB Environment and Deploying Tika

Terms

  • Working directory: The name of the directory your command prompt is sitting in. This is the directory where .elasticbeanstalk/config.yml is created.
    • I usually prefix mine with my company name to to simplify uniqueness constraints later.
  • Application name:  This is the name of your directory by default.  It can be anything.  The application name is the root of the external URL and should be unique. 
    • I usually prefix mine with my company name to to simplify uniqueness constraints later.
  • Environment Name: Applications can be deployed into different environments with different properties. 
    • This tends to default to <app_name>_dev for development environments.

Create working directory

  1. Create a working directory.  This will probably be the same as your app name. 
    1. I named mine fsi-tika-eb for FreemanSoft Tika ElasticBeanstalk. I'd probably pick something more like fsi-eb-tika if I built up a demo enviornment in the future.
  2. CD into the directory.  The .elasticbeanstalk/config.yml file will end up here
  3. Download the tika-server jar file and put it into this directory. I used version 1.11   

Initialize ElasticBeanstalk CLI environment

Initialize the eb environment and answer its questions
  1. Run eb init
  2. Pick data center your company uses or the nearest data center.
  3. Enter the application id and secret.  I used my test account credentials.  You should use your IAM credentials.
  4. Enter your application name. It may default to your directory name.
  5. Select Java as the platform.
  6. Use Java 7 or Java 8
  7. Let it create the necessary SSH credentials.  
    1. This section fails if SSH is not on the path.
    2. Let it create a SSH key set with the default name if this is the first configured shell.  It selected aws-eb for my keyset name
elasticbeanstalk/config.yml contains the settings selected during eb init.  

Create ElasticBeanstalk Execution environment and deploy

Normally the eb create command would try run a command or deploy a directory  We have a single all-encompasing Tika jar that we downloaded above.  This means we can set the default deployment artifact to the jar file name.
  1. Edit .elasticbeanstalk/config.yml
  2. Add a new section
        deploy:

           artifact: tika-server-1.11.ja
    r
    Create a new Elastic Beanstalk Environment and auto deploy the application.   We can choose the default options and reconfigure later or we can try and configure the ELB/ALB port and machine size in a single command
    1. Option: Single command line
      1. Specify what would be the default values for app initialized as fsi-eb-tika above
      2. eb create fsi-eb-tika-dev --cname fsi-eb-tika-dev --elb-type application --enable-spot -it t2.micro --envvars PORT=9998 --tags mytag=myvalue
      3. Application will deploy
    2. Option: interactive command line specifying port and EC2 type
      1. Create and deploy the application
        eb create --instance_type t2.micro --envvars PORT=9998
      2. Accept any defaults offered
      3. Application will deploy
    3. Option: interactive command line for all optinos
      1. Create and deploy the application eb-create
      2. Accept the defaults.
      3. Application will deploy.
      4. Set the port number after deployed. This causes the application to redeploy:
        eb setenv PORT=9998
    You should end up with Tika deployed a single t2. micro instance deployed with auto-scale enabled up to 4 nodes.

    Port Forwarding Notes

    The AWS ELB/ALB listens on port 80 and assumes that the EB application is running on port 5000. We have to change one of the following.
    1. Change ELB/ALB back-side target port 
    2. Change the Tika listener port.
    It is easiest to just change the ELB/ALB back side port.  This can be done by setting the PORT system property.

    Gateway Errors

    You will get gateway 502 errors and the console will show unhealthy if the ALB/ELB target port and the Tika listener port don't match.

    Verification

    Basic Server Verification

    Open a browser to test with a GET operation
    The default naming is  http://<application_name>_<environment>.elasticbeanstalk.com
    My demo url was fsi-tika-eb.elasticbeanstalk.com.
    Run a GET against the Tika parser http://<application_name>_<environment>.elasticbeanstalk.com/tika
    GET on this demo URL

    resulted in 
    http://fsi-tika-eb-dev.elasticbeanstalk.com/tika
    • This is Tika Server. Please PUT

    Parsing a Test Document

    You can test the Tika server from any HTTP test tool like the Chrome POSTman plugin. The Tika Server API is documented on the Apache Tika Wiki.

    Example File Parsing
    • Select PUT as your HTTP method
    • Set the Body type to Binary
    • Select the file that will act as a body.  
    • Select the MIME type you want back.
    • Use the /tika path, EX: http://fsi-tika-eb-dev.elasticbeanstalk.com/tika
    • Select an XLS files
    • Tell Tika you want HTML back with the header Accept: text/html
    • Submit the message. 
    • You should get get back an HTML table representation of the XLS file.

    Example Data Type Detection
    • Select PUT as your HTTP method
    • Set the Body type to Binary
    • Select the file that will act as a body.  
    • Select the MIME type you want back.
    • Use the /detector/stream path EX: http://fsi-tika-eb-dev.elasticbeanstalk.com/detect/stream
    • Select a PNG file
    • Submit the message
    • You should get back the mime type of the document you sent.
    See the Apache Tika Wiki page for more information.

    Viewing Logs

    You can view any of the captured log files using the AWS console or the eb command prompt. Open up a command prompt in your application's directory.  Enter the following to see logs
    • eb logs
     Deployment , access and application logs will be retrieved and displayed.

    Additional Reading

    Amazon publishes some free AWS kindle books/booklets including the following
    • http://www.amazon.com/AWS-Elastic-Beanstalk-Developer-Guide-ebook/dp/B007Q4JFE0

    Created Feb 15 2016
    Minor Revisions April 2020

    Comments