Almost PaaS Document Parsing with Tika and AWS Elastic Beanstalk
The Apache Tika project provides a library capable of parsing and extracting data and meta data from over 1000 file types. Tika is available as a single jar file that can be included inside applications or as a deployable jar file that runs Tika as a standalone service.
This blog describes deploying the Tika jar as an auto-scale service in Amazon AWS Elastic Beanstalk. I selected Elastic Beanstalk because it supports jar based deployments without any real Infrastructure configuration. Elastic Beanstalk auto-scale should take care of scaling up and down for for the number of requests you get.
Normally the eb create command would try run a command or deploy a directory We have a single all-encompasing Tika jar that we downloaded above. This means we can set the default deployment artifact to the jar file name.
This blog describes deploying the Tika jar as an auto-scale service in Amazon AWS Elastic Beanstalk. I selected Elastic Beanstalk because it supports jar based deployments without any real Infrastructure configuration. Elastic Beanstalk auto-scale should take care of scaling up and down for for the number of requests you get.
Tika parses documents and extracts their text completely in memory. Tika was deployed for this blog using EC2 t2.micro instances available in the AWS free tier. t2.micro VMs are 1GB which means that you are restricted in document complexity and size. You would size your instances appropriately for your largest documents.
Preconditions
- An AWS account.
- AWS access id and secret key. This is most easily created in the AWS web console. Amazon recommends using IAM credentials. I used the default account credentials since this was done mostly as a PoC. Remember to save any key id and key values that you need. They cannot be recovered once generated.
- Read the Amazon Elastic Beanstalk command line instructions.
Not Addressed
- Using the Amazon Console to do web based deployments.
- Use command line to demonstrate automation possibilities.
- Limiting access to this service via AuthN/AuthZ
- Provisioning limited scope IAM credentials for deployments
- Load testing with something like JMeter to demonstrate autoscale
SSH
SSH must installed and on your command line path. The Elastic Beanstalk command prompt expects ssh-keygen to be on your path.
I did this work on Microsoft Windows 10 so I needed the windows tools. . Microsoft is now contributing to this SSH distribution on github. I installed the 64 bit version in c:\Program Files\OpenSSH-Win64. Linux / Mac folks can use their favorite tools.
Python and Pip
Python must be installed and on your command line path. Install Python and pip per the AWS Elastic Beanstalk CLI instructions. The page describes Windows and Linus installation p;rocesses. Make sure to add the Python directories to your environment PATH variables.
Elastic Beanstalk CLI
Install the Elastic Beanstalk CLI after installing Python. You can find it in the same CLI web page.
Creating an EB Environment and Deploying Tika
Terms
- Working directory: The name of the directory your command prompt is sitting in. This is the directory where .elasticbeanstalk/config.yml is created.
- I usually prefix mine with my company name to to simplify uniqueness constraints later.
- Application name: This is the name of your directory by default. It can be anything. The application name is the root of the external URL and should be unique.
- I usually prefix mine with my company name to to simplify uniqueness constraints later.
- Environment Name: Applications can be deployed into different environments with different properties.
- This tends to default to <app_name>_dev for development environments.
Create working directory
- Create a working directory. This will probably be the same as your app name.
- I named mine fsi-tika-eb for FreemanSoft Tika ElasticBeanstalk. I'd probably pick something more like fsi-eb-tika if I built up a demo enviornment in the future.
- CD into the directory. The .elasticbeanstalk/config.yml file will end up here
- Download the tika-server jar file and put it into this directory. I used version 1.11
Initialize ElasticBeanstalk CLI environment
Initialize the eb environment and answer its questions- Run eb init
- Pick data center your company uses or the nearest data center.
- Enter the application id and secret. I used my test account credentials. You should use your IAM credentials.
- Enter your application name. It may default to your directory name.
- Select Java as the platform.
- Use Java 7 or Java 8
- Let it create the necessary SSH credentials.
- This section fails if SSH is not on the path.
- Let it create a SSH key set with the default name if this is the first configured shell. It selected aws-eb for my keyset name
Create ElasticBeanstalk Execution environment and deploy
Edit .elasticbeanstalk/config.ymlAdd a new section
deploy:r
artifact: tika-server-1.11.ja
- Option: Single command line
- Specify what would be the default values for app initialized as fsi-eb-tika above
- eb create fsi-eb-tika-dev --cname fsi-eb-tika-dev --elb-type application --enable-spot -it t2.micro --envvars PORT=9998 --tags mytag=myvalue
- Application will deploy
- Option: interactive command line specifying port and EC2 type
- Create and deploy the application
eb create --instance_type t2.micro --envvars PORT=9998 - Accept any defaults offered
- Application will deploy
- Option: interactive command line for all optinos
- Create and deploy the application eb-create
- Accept the defaults.
- Application will deploy.
- Set the port number after deployed. This causes the application to redeploy:
eb setenv PORT=9998
You should end up with Tika deployed a single t2. micro instance deployed with auto-scale enabled up to 4 nodes.
Port Forwarding Notes
The AWS ELB/ALB listens on port 80 and assumes that the EB application is running on port 5000. We have to change one of the following.
- Change ELB/ALB back-side target port
- Change the Tika listener port.
Gateway Errors
You will get gateway 502 errors and the console will show unhealthy if the ALB/ELB target port and the Tika listener port don't match.
Verification
Basic Server Verification
Open a browser to test with a GET operation | |
---|---|
The default naming is | http://<application_name>_<environment>.elasticbeanstalk.com |
My demo url was | fsi-tika-eb.elasticbeanstalk.com. |
Run a GET against the Tika parser | http://<application_name>_<environment>.elasticbeanstalk.com/tika |
GET on this demo URL resulted in |
http://fsi-tika-eb-dev.elasticbeanstalk.com/tika
|
Parsing a Test Document
You can test the Tika server from any HTTP test tool like the Chrome POSTman plugin. The Tika Server API is documented on the Apache Tika Wiki.
Example File Parsing
- Select PUT as your HTTP method
- Set the Body type to Binary
- Select the file that will act as a body.
- Select the MIME type you want back.
- Use the /tika path, EX: http://fsi-tika-eb-dev.elasticbeanstalk.com/tika
- Select an XLS files
- Tell Tika you want HTML back with the header Accept: text/html
- Submit the message.
- You should get get back an HTML table representation of the XLS file.
Example Data Type Detection
- Select PUT as your HTTP method
- Set the Body type to Binary
- Select the file that will act as a body.
- Select the MIME type you want back.
- Use the /detector/stream path EX: http://fsi-tika-eb-dev.elasticbeanstalk.com/detect/stream
- Select a PNG file
- Submit the message
- You should get back the mime type of the document you sent.
See the Apache Tika Wiki page for more information.
Viewing Logs
You can view any of the captured log files using the AWS console or the eb command prompt. Open up a command prompt in your application's directory. Enter the following to see logs
- eb logs
Additional Reading
Amazon publishes some free AWS kindle books/booklets including the following
- http://www.amazon.com/AWS-Elastic-Beanstalk-Developer-Guide-ebook/dp/B007Q4JFE0
Created Feb 15 2016
Minor Revisions April 2020
Minor Revisions April 2020
This comment has been removed by a blog administrator.
ReplyDelete