Thursday, February 11, 2016

Slice Splunk simpler and faster with better metadata

Splunk is a powerful event log indexing and search tool that lets you analyze large amounts of data. Event and log streams can be fed to the Splunk engine where they are scanned and indexed.  Splunk supports full text search plus highly optimized searches against metadata and extracted data fields.  Extracted fields are outside this scope of this missive.

Each log/event record consists of the log/event data itself and information about the log/event known as metadata.  For example, Splunk knows the originating host for each log/event.   Queries can efficiently filter by full or partial host names without having to specifically put the host name in every log message.

Message counts with metadata wildcards

One of the power features of metadata is that Splunk will provide a list of all metadata values and the number of matching messages as part of the result of any query.  A Splunk query returns matching log/event records and the the number of records in each bucket like #records/hostname.  A Splunk query against a wildcarded metadata filed like hostname returns the number of records for each hostname matching that pattern.

Some day there will be a screen shot right here.

Standard Metadata

All Splunk entries come with a few basic metadata fields.

NamePurposeSample Values
indexThe primary method for partitioning data. Organizations often route data to different indexes based on business unit , production status or sensitivity of the data.

This is also the primary attribute for access control.  Users are granted access to individual indexes.  
sourceThe originating source of a file.  This is usually the file name in the case of file logs. Queries can pattern match against file paths or full names to narrow down other search criteria. This is useful when looking for a particular type of problem on one or more nodes.

The source and sourcetype may be the same in the case of non-file resources.
sourcetypeThe type of log, used to drive parsing templates.  This can be used as a broader filter to look at all or some subset of log files while filtering out system events.

The source and sourcetype may be the same in the case of non-file resources.

hostThis is the hostname the message came from. Hostnames can be explicitly provided or provided as part of a pattern.  This is useful when cluster nodes share similar hostnames or when looking at problems on a specific host.hostname=RD000*

The default metadata fields make it easy to filter down data without adding explicit values in each individual log/event record.  Using just the standard metadata causes teams to twist the source or sourcetype fields in unnatural ways.

Hacking standard field values

The standard fields do not provide enough axis upon which to partition logs or events.  Organizations often use implicit standards to make it possible to filter out information based on environment or application.

  • Organizations move the location of log files based on the system environment putting production logs in /var/log/prod/myapp/foo.log and QC logs in /var/log/qc/myapp/foo.log.  Then they query by environment by pattern matching the file names.  This only works for log files and not system events or syslog.  
  • Organizations filter for applications or environments by host names counting on standard host naming conventions.  This can be cumbersome and may not work at all for PaaS style hosts created in cloud environments.

Both of these are hacks that can be avoided with the use of additional metadata via the _meta tag.

Recommended Metadata Additions

Custom metadata can be configured through the _meta tag in the Splunk Forwarder inputs.conf files. They can be added in global or application configuration files.  Custom values can be added at the top of input.conf to apply to every source or on each individual source in the inputs.conf file.

NamePurposeSample Values
environmentSoftware moves through different environments on its way to production.  Log analysis for troubleshooting or statistics tends to occur at the environment level.  This can be greatly simplified by binding logs to "well known" environment names.

It is sometimes possible to filter queries against environments based on host names.  This has a sort of "magic spell" feel where everyone has to know the magic host naming conventions.  It becomes complicated when there are multiple environments of of the same type.   An organization may have multiple QA/QC environments may similarly named hostnames. 
applicationThis is the overarching name for an application that may include multiple deployed components or tiers.  All components, web, workflow, integration share the same application value.

This may be the official application name or abbreviation for many large companies.   
roleEach application component plays a part or has a role in the overall application.  This can be a tier name or a specific function name. There is a 1->N relationship between application and role.role::ui
instanceThis value specifies the individual instance of a multi-host/multi-deployment component.  Instance names may be redundant for hostnames in some situations. There is a 1>M relationship between role and instance.

The instance value may be the host name, an auto-generated instance id (for PaaS style) or a software partition name in the case of a multi-node component.  Note: that this can be especially useful in autoscale environments where hostnames may be shared.
runlevelYou may wish to create some grouping bucket one level up from environment.  This could be something that groups all of a certain environment type like QC that contains environments QC1 and QC2.  Or it could be a prod/non-prod discriminator so that production logs can be easily isolated. This can be useful in the unfortunate situation where production and non production logs share the same index.runlevel::prod

Cloud Metadata

Cloud vendors often have additional metadata about their deployments or environment that can be extracted and configured into Splunk inputs.conf files.  Teams should consider modifying Splunk deployment automation scripts to pick up the values.  Examples include but are in no way limited to the following:

Microsoft AzureCloud Service  This represents the load balancer or application pool name.  It can be very useful when troubleshooting or creating performance graphs.

Multiple application components can operate within a cloud service.  This may align with application or may be a level in between application and component. 
Amazon Web ServicesAMR version  This is essentially the virtual machine template version. This can be useful when creating an AMR inventory.

Created 11/Feb/2016

No comments:

Post a Comment