Monday, February 15, 2016

Almost PaaS Document Parsing with Tika and AWS Elastic Beanstalk

The Apache Tika project provides a  library capable of parsing and extracting data and meta data from over 1000 file types.  Tika is available as a single jar file that can be included inside applications or as a deployable jar file that runs Tika as a standalone service.

This blog describes deploying the Tika jar as an auto-scale service in Amazon AWS Elastic Beanstalk.  I selected Elastic Beanstalk because it supports jar based deployments without any real Infrastructure configuration. Elastic Beanstalk auto-scale should take care of scaling up and down for for the number of requests you get.

Tika parses documents and extracts their text completely in memory. Tika was deployed for this blog using EC2 t2.micro instances available in the AWS free tier. t2.micro VMs are 1GB which means that you are restricted in document complexity and size. You would size your instances appropriately for your largest documents.  


  • An AWS account.
  • AWS access id and secret key.  This is most easily created in the AWS web console.  Amazon recommends using IAM credentials. I used the default account credentials since this was done mostly as a PoC. Remember to save any key id and key values that you need. They cannot be recovered once generated.
  • Read the Amazon Elastic Beanstalk command line instructions

Not Addressed

  • Using the Amazon Console to do web based deployments. I decided to do this with command line tools to get a feel for automation possibilities.
  • Limiting access to this service , access controls
  • IAM credentials.
  • Load testing with something like JMeter


SSH must installed and on your command line path. The Elastic Beanstalk command prompt expects ssh-keygen to be on your path. 

I did this work on Microsoft Windows 10 so I needed the windows tools. .  Microsoft is now contributing to this SSH distribution on github. I installed the 64 bit version in c:\Program Files\OpenSSH-Win64. Linux / Mac folks can use their favorite tools. 

Python and Pip

Python must be installed and on your command line path. Install Python and pip per the AWS Elastic Beanstalk CLI instructions. The page describes Windows and Linus installation p;rocesses. Make sure to add the Python directories to your environment PATH variables.

Elastic Beanstalk CLI

Install the Elastic Beanstalk CLI after installing Python. You can find it in the same CLI web page

Creating an EB Environment and Deploying Tika

  • Working directory: The name of the directory your command prompt is sitting in. This is the directory where .elasticbeanstalk/config.yml is created.
    • I usually prefix mine with my company name to to simplify uniqueness constraints later.
  • Application name:  This is the name of your directory by default.  It can be anything.  The application name is the root of the external URL and should be unique. 
    • I usually prefix mine with my company name to to simplify uniqueness constraints later.
  • Environment Name: Applications can be deployed into different environments with different properties. 
    • This tends to default to <app_name>_dev for development environments.
  1. Create a working directory.  This will probably be the same as your app name. 
    1. I named mine fsi-tika-eb for FreemanSoft Tika ElasticBeanstalk. I'd probably pick something more like fsi-eb-tika if I built up a demo enviornment in the future.
  2. CD into the directory.  The .elasticbeanstalk/config.yml file will end up here
  3. Download the tika-server jar file and put it into this directory. I used version 1.11   
  4. Initialize the eb command environment and answer its questions
    1. Run eb init
    2. Pick data center your company uses or the nearest data center.
    3. Enter the application id and secret.  I used my test account credentials.  You should use your IAM credentials.
    4. Enter your application name. It may default to your directory name.
    5. Select Java as the platform.
    6. Use Java 7 or Java 8
    7. Let it create the necessary SSH credentials.  
      1. This section fails if SSH is not on the path.
      2. Let it create a SSH key set with the default name if this is the first configured shell.  It selected aws-eb for my keyset name
  5. The config.yml file contains the settings selected during eb init.  Normally the eb create command would try run a command or deploy a directory  We have a single all-encompasing Tika jar that we downloaded above.  This means we can set the default deployment artifact to the jar file name.
    1. Edit .elasticbeanstalk/config.yml
    2. Add a new section
             artifact: tika-server-1.11.jar
  6. Create a new Elastic Beanstalk Environment and auto deploy the application.   We can choose the default options and reconfigure later or we can try and configure the load balancer port and machine size in a single command.
    1. Option: Single command line
      1. Create and deploy the application
        eb-crate --instance_type t2.micro --envvars PORT=9998
      2. Accept any defaults offered
    2. Option: Basic commands
      1. Create and deploy the application eb-create 
      2. Accept the defaults.
      3. Set the port number.  This causes the application to redeploy:
        eb setenv PORT=9998
You should end up with Tika deployed a single t2. micro instance deployed with auto-scale enabled up to 4 nodes.

Load Balancer Notes

The AWS Load Balancer listens on port 80 and assumes that the EB application is running on port 5000.  We have to change Load Balancer back side port or we have to change the port Tika is listening on.  It is easiest to just change the load balancer back side port since that can be done with just the PORT system property..


Basic Server Verification

Open a browser and do a GET request against your application.  The default naming is 
My demo showed up on
Executing a GET against the Tika parser
 My demo showed up on
and resulted in the message
This is Tika Server. Please PUT 

Parsing a Test Document

You can test the Tika server from any HTTP test tool like the Chrome POSTman plugin. The Tika Server API is documented on the Apache Tika Wiki.

Example File Parsing
  • Select PUT as your HTTP method
  • Set the Body type to Binary
  • Select the file that will act as a body.  
  • Select the MIME type you want back.
  • Use the /tika path, EX:
  • Select an XLS files
  • Tell Tika you want HTML back with the header Accept: text/html
  • Submit the message. 
  • You should get get back an HTML table representation of the XLS file.

Example Data Type Detection
  • Select PUT as your HTTP method
  • Set the Body type to Binary
  • Select the file that will act as a body.  
  • Select the MIME type you want back.
  • Use the /detector/stream path EX:
  • Select a PNG file
  • Submit the message
  • You should get back the mime type of the document you sent.
See the Apache Tika Wiki page for more information.

Viewing Logs

You can view any of the captured log files using the eb command prompt. Open up a command prompt in your application's directory.  Enter the following to see logs
eb logs
 Deployment , access and application logs will be retrieved and displayed.

Additional Reading

Amazon publishes some free AWS kindle books/booklets including the following

Created Feb 15 2016

No comments:

Post a Comment