Sunday, February 28, 2016

Received and sent messages in a single mailbox with MS Outlook for OSX

Microsoft Outlook for the Mac and PC behave differently when showing conversations in the Inbox. The PC shows received and sent messages. The Mac shows only the received messages.  There is no default way to show a threaded conversation on Mac Office 2016.

Microsoft Outlook for the Mac is integrated with OS/X spotlight search so that AppleScript and Spotlight can be used to create Outlook Smart Mail folders. Smart Folders are more like views into mailboxes than actual mailboxes. They are virtual folders that are created from the results of a search.  This blog leverages Outlook's raw search capabilities that come from OS/X integration.  You can find out more information about this integration on the Microsoft answers web site. Portions of this blog came from this excellent blog posting.

Identify Mailboxes to be included in Smart Folder

Our conversation SmartFolder is made up of the contents of the Inbox and Sent mailboxes. We first need to identify the Microsoft Outlook folder identifiers for the two mailboxes.

  • Run Outlook
  • Highlight the Mailbox,  we are going to include, Inbox or Sent. We want to get this mailbox's folder id.
  • Start a Mac Spotlight search
  • Enter applesoft and Enter to bring up the applesoft editor.
  • Enter the following in the applescript editor and run it
on run {}
tell application "Microsoft Outlook"
get selected folder
end tell
end run

  • It should return the results of the execution
    mail folder id 109 of application "Microsoft Outlook"
  • Do the same thing for the other mailbox.  
  • My mailbox numbers were 109 (Inbox) and 112 (Sent)

Create an integrated Threaded Conversation 

Build Smart Folder with a Raw Search

  • Click on the Search field in the upper right hand corner of the Outlook view.
    • This will enable the search tab and ribbon
  • Select the Search tab
  • Select All Mailboxes" in the Ribbon Bar
  • Press the Advanced search button in the ribbon
  • Select Raw Query from the drop list. 
  • Enter the following query, replacing 109 and 112 with the mailboxes numbers retrieved above.
    com_microsoft_outlook_folderID == 109 || com_microsoft_outlook_folderID == 112
  • Press Enter.  The Smart Folder should populate with the combined content of the two folders. Conversations in this combined Index/Sent folder will include both inbound and outbound messages.
  • Press Save Search in the Search ribbon bar and enter the name of your new Smart Folder.
Your search bar should look something like the following


Smart folders are query result views and not real folders. You can use the standard Search functionality against a Smart Folder.  The system treats the additional Search terms as part of the Smart Folder's query and will ask you if you wish to change the folder query every time you move Outlook from the Smart Folder to a traditional folder.  You can tell Outlook to "not save" the changes.  Yeah, it is kind of annoying.

Create an Unread Email Smart Folder

You can create a Smart Folder of just unread messages similar to the conversation folder described above.
  • Select the Inbox
  • Select Search
  • Select the Search Tab if it is not showing
  • Select Advanced Query in the Search Ribbon
  • Change the query type to Raw
  • Enter the following into the raw query area:
com_microsoft_outlook_unread != 0
  • Press Save Search and enter the name of the new Smart folder.

Additional Resources

Discussion on the mdls command and Mac / Outlook variables for raw queries can be found in this Apple discussions thread.

Created 2016 Feb 02

Monday, February 15, 2016

Almost PaaS Document Parsing with Tika and AWS Elastic Beanstalk

The Apache Tika project provides a  library capable of parsing and extracting data and meta data from over 1000 file types.  Tika is available as a single jar file that can be included inside applications or as a deployable jar file that runs Tika as a standalone service.

This blog describes deploying the Tika jar as an auto-scale service in Amazon AWS Elastic Beanstalk.  I selected Elastic Beanstalk because it supports jar based deployments without any real Infrastructure configuration. Elastic Beanstalk auto-scale should take care of scaling up and down for for the number of requests you get.

Tika parses documents and extracts their text completely in memory. Tika was deployed for this blog using EC2 t2.micro instances available in the AWS free tier. t2.micro VMs are 1GB which means that you are restricted in document complexity and size. You would size your instances appropriately for your largest documents.  


  • An AWS account.
  • AWS access id and secret key.  This is most easily created in the AWS web console.  Amazon recommends using IAM credentials. I used the default account credentials since this was done mostly as a PoC. Remember to save any key id and key values that you need. They cannot be recovered once generated.
  • Read the Amazon Elastic Beanstalk command line instructions

Not Addressed

  • Using the Amazon Console to do web based deployments. I decided to do this with command line tools to get a feel for automation possibilities.
  • Limiting access to this service , access controls
  • IAM credentials.
  • Load testing with something like JMeter


SSH must installed and on your command line path. The Elastic Beanstalk command prompt expects ssh-keygen to be on your path. 

I did this work on Microsoft Windows 10 so I needed the windows tools. .  Microsoft is now contributing to this SSH distribution on github. I installed the 64 bit version in c:\Program Files\OpenSSH-Win64. Linux / Mac folks can use their favorite tools. 

Python and Pip

Python must be installed and on your command line path. Install Python and pip per the AWS Elastic Beanstalk CLI instructions. The page describes Windows and Linus installation p;rocesses. Make sure to add the Python directories to your environment PATH variables.

Elastic Beanstalk CLI

Install the Elastic Beanstalk CLI after installing Python. You can find it in the same CLI web page

Creating an EB Environment and Deploying Tika

  • Working directory: The name of the directory your command prompt is sitting in. This is the directory where .elasticbeanstalk/config.yml is created.
    • I usually prefix mine with my company name to to simplify uniqueness constraints later.
  • Application name:  This is the name of your directory by default.  It can be anything.  The application name is the root of the external URL and should be unique. 
    • I usually prefix mine with my company name to to simplify uniqueness constraints later.
  • Environment Name: Applications can be deployed into different environments with different properties. 
    • This tends to default to <app_name>_dev for development environments.
  1. Create a working directory.  This will probably be the same as your app name. 
    1. I named mine fsi-tika-eb for FreemanSoft Tika ElasticBeanstalk. I'd probably pick something more like fsi-eb-tika if I built up a demo enviornment in the future.
  2. CD into the directory.  The .elasticbeanstalk/config.yml file will end up here
  3. Download the tika-server jar file and put it into this directory. I used version 1.11   
  4. Initialize the eb command environment and answer its questions
    1. Run eb init
    2. Pick data center your company uses or the nearest data center.
    3. Enter the application id and secret.  I used my test account credentials.  You should use your IAM credentials.
    4. Enter your application name. It may default to your directory name.
    5. Select Java as the platform.
    6. Use Java 7 or Java 8
    7. Let it create the necessary SSH credentials.  
      1. This section fails if SSH is not on the path.
      2. Let it create a SSH key set with the default name if this is the first configured shell.  It selected aws-eb for my keyset name
  5. The config.yml file contains the settings selected during eb init.  Normally the eb create command would try run a command or deploy a directory  We have a single all-encompasing Tika jar that we downloaded above.  This means we can set the default deployment artifact to the jar file name.
    1. Edit .elasticbeanstalk/config.yml
    2. Add a new section
             artifact: tika-server-1.11.jar
  6. Create a new Elastic Beanstalk Environment and auto deploy the application.   We can choose the default options and reconfigure later or we can try and configure the load balancer port and machine size in a single command.
    1. Option: Single command line
      1. Create and deploy the application
        eb-crate --instance_type t2.micro --envvars PORT=9998
      2. Accept any defaults offered
    2. Option: Basic commands
      1. Create and deploy the application eb-create 
      2. Accept the defaults.
      3. Set the port number.  This causes the application to redeploy:
        eb setenv PORT=9998
You should end up with Tika deployed a single t2. micro instance deployed with auto-scale enabled up to 4 nodes.

Load Balancer Notes

The AWS Load Balancer listens on port 80 and assumes that the EB application is running on port 5000.  We have to change Load Balancer back side port or we have to change the port Tika is listening on.  It is easiest to just change the load balancer back side port since that can be done with just the PORT system property..


Basic Server Verification

Open a browser and do a GET request against your application.  The default naming is 
My demo showed up on
Executing a GET against the Tika parser
 My demo showed up on
and resulted in the message
This is Tika Server. Please PUT 

Parsing a Test Document

You can test the Tika server from any HTTP test tool like the Chrome POSTman plugin. The Tika Server API is documented on the Apache Tika Wiki.

Example File Parsing
  • Select PUT as your HTTP method
  • Set the Body type to Binary
  • Select the file that will act as a body.  
  • Select the MIME type you want back.
  • Use the /tika path, EX:
  • Select an XLS files
  • Tell Tika you want HTML back with the header Accept: text/html
  • Submit the message. 
  • You should get get back an HTML table representation of the XLS file.

Example Data Type Detection
  • Select PUT as your HTTP method
  • Set the Body type to Binary
  • Select the file that will act as a body.  
  • Select the MIME type you want back.
  • Use the /detector/stream path EX:
  • Select a PNG file
  • Submit the message
  • You should get back the mime type of the document you sent.
See the Apache Tika Wiki page for more information.

Viewing Logs

You can view any of the captured log files using the eb command prompt. Open up a command prompt in your application's directory.  Enter the following to see logs
eb logs
 Deployment , access and application logs will be retrieved and displayed.

Additional Reading

Amazon publishes some free AWS kindle books/booklets including the following

Created Feb 15 2016

Thursday, February 11, 2016

Slice Splunk simpler and faster with better metadata

Splunk is a powerful event log indexing and search tool that lets you analyze large amounts of data. Event and log streams can be fed to the Splunk engine where they are scanned and indexed.  Splunk supports full text search plus highly optimized searches against metadata and extracted data fields.  Extracted fields are outside this scope of this missive.

Each log/event record consists of the log/event data itself and information about the log/event known as metadata.  For example, Splunk knows the originating host for each log/event.   Queries can efficiently filter by full or partial host names without having to specifically put the host name in every log message.

Message counts with metadata wildcards

One of the power features of metadata is that Splunk will provide a list of all metadata values and the number of matching messages as part of the result of any query.  A Splunk query returns matching log/event records and the the number of records in each bucket like #records/hostname.  A Splunk query against a wildcarded metadata filed like hostname returns the number of records for each hostname matching that pattern.

Some day there will be a screen shot right here.

Standard Metadata

All Splunk entries come with a few basic metadata fields.

NamePurposeSample Values
indexThe primary method for partitioning data. Organizations often route data to different indexes based on business unit , production status or sensitivity of the data.

This is also the primary attribute for access control.  Users are granted access to individual indexes.  
sourceThe originating source of a file.  This is usually the file name in the case of file logs. Queries can pattern match against file paths or full names to narrow down other search criteria. This is useful when looking for a particular type of problem on one or more nodes.

The source and sourcetype may be the same in the case of non-file resources.
sourcetypeThe type of log, used to drive parsing templates.  This can be used as a broader filter to look at all or some subset of log files while filtering out system events.

The source and sourcetype may be the same in the case of non-file resources.

hostThis is the hostname the message came from. Hostnames can be explicitly provided or provided as part of a pattern.  This is useful when cluster nodes share similar hostnames or when looking at problems on a specific host.hostname=RD000*

The default metadata fields make it easy to filter down data without adding explicit values in each individual log/event record.  Using just the standard metadata causes teams to twist the source or sourcetype fields in unnatural ways.

Hacking standard field values

The standard fields do not provide enough axis upon which to partition logs or events.  Organizations often use implicit standards to make it possible to filter out information based on environment or application.

  • Organizations move the location of log files based on the system environment putting production logs in /var/log/prod/myapp/foo.log and QC logs in /var/log/qc/myapp/foo.log.  Then they query by environment by pattern matching the file names.  This only works for log files and not system events or syslog.  
  • Organizations filter for applications or environments by host names counting on standard host naming conventions.  This can be cumbersome and may not work at all for PaaS style hosts created in cloud environments.

Both of these are hacks that can be avoided with the use of additional metadata via the _meta tag.

Recommended Metadata Additions

Custom metadata can be configured through the _meta tag in the Splunk Forwarder inputs.conf files. They can be added in global or application configuration files.  Custom values can be added at the top of input.conf to apply to every source or on each individual source in the inputs.conf file.

NamePurposeSample Values
environmentSoftware moves through different environments on its way to production.  Log analysis for troubleshooting or statistics tends to occur at the environment level.  This can be greatly simplified by binding logs to "well known" environment names.

It is sometimes possible to filter queries against environments based on host names.  This has a sort of "magic spell" feel where everyone has to know the magic host naming conventions.  It becomes complicated when there are multiple environments of of the same type.   An organization may have multiple QA/QC environments may similarly named hostnames. 
applicationThis is the overarching name for an application that may include multiple deployed components or tiers.  All components, web, workflow, integration share the same application value.

This may be the official application name or abbreviation for many large companies.   
roleEach application component plays a part or has a role in the overall application.  This can be a tier name or a specific function name. There is a 1->N relationship between application and role.role::ui
instanceThis value specifies the individual instance of a multi-host/multi-deployment component.  Instance names may be redundant for hostnames in some situations. There is a 1>M relationship between role and instance.

The instance value may be the host name, an auto-generated instance id (for PaaS style) or a software partition name in the case of a multi-node component.  Note: that this can be especially useful in autoscale environments where hostnames may be shared.
runlevelYou may wish to create some grouping bucket one level up from environment.  This could be something that groups all of a certain environment type like QC that contains environments QC1 and QC2.  Or it could be a prod/non-prod discriminator so that production logs can be easily isolated. This can be useful in the unfortunate situation where production and non production logs share the same index.runlevel::prod

Cloud Metadata

Cloud vendors often have additional metadata about their deployments or environment that can be extracted and configured into Splunk inputs.conf files.  Teams should consider modifying Splunk deployment automation scripts to pick up the values.  Examples include but are in no way limited to the following:

Microsoft AzureCloud Service  This represents the load balancer or application pool name.  It can be very useful when troubleshooting or creating performance graphs.

Multiple application components can operate within a cloud service.  This may align with application or may be a level in between application and component. 
Amazon Web ServicesAMR version  This is essentially the virtual machine template version. This can be useful when creating an AMR inventory.

Created 11/Feb/2016