doxlon november 2016 - data democratization using splunk

Data Democratisation Using Splunk

Neil Roy Chowdhury [email protected]

mailto:[email protected]

About Me• Splunking Since 2008

• Largest Splunk Implementation:

• 3 TB/day

• 1.2 PB Searchable

• 900 Users

• Interests:

• Guitars

• And the occasional Uke

What is Splunk?• Google Search for IT Data?

• Log aggregation Tool?

• Data Visualisation Tool?

• Data Platform with App Creation Capabilities

• Proprietary Search Language - SPL

• Correlation of Structured and Unstructured Data Sources

• Visualisation capabilities

• Out of the Box

• Modular

Getting Data In

Unstructured Data Sources

Structured Data Sources - JSON,

CSV, XML

Forwarders HEC

Data Sources Indexer

Line Breaking

Timestamp Recognition

Data Segmentation

Pipeline

Persist to Disk

Index

BucketBucket

BucketBucket

Bucket Keywords Raw Data

Data Collection using Splunk Forwarder

• Splunk forwarder capabilities

• File based Inputs

• Database Inputs

• Scripted Inputs

• Forwarder Configurations deployed as modular add-ons

Typical Splunk Search

index = <my_product> sourcetype=web.access checkout | stats avg(response_time) as “Average Response Time” by request

Searching DataQuery Index By

KeywordLoad Raw Results

Returned in MemoryApply Data Extractions,

Transformations and Lookups

Run Streaming Commands

Indexers - Map

Search Heads - Reduce

Knowledge

Objects

Receive Results and “Reduce”

Run Additional Commands

Visualise, Report, Alert

So what about Knowledge Objects?• Most Knowledge Objects are configurable from UI

• Common Types:

• Field Extractions - regex to extract fields

• Field Aliases - Alias a name of a field

• Lookups - vs flat files and kv-store

• Tags - Provides event grouping abstraction

• Eventtypes - Provides event categorisation

• Calculated Fields - Data manipulations

Goal?• Queries like:

• Become:

index=<my_website> “/checkout/auth/confirmation” | rex “<some humungous regex that extracts customer id in addition to other things>” | eval response_time_seconds = resp_time_milliseconds/(1000) | where http_code == 200 | lookup db_locations customer_id OUTPUT location | stats avg(response_time_seconds) as avg_response_time by location

eventtype=auth_successful tag=web | stats avg(response_time_seconds) as average_response_time by location

Goal?

Persisting Knowledge

Data Democratisation

• Sounds like the holy grail of data

• Idealistic?

Scenario• Microservices Architecture

• Numerous Development Teams working under different service umbrellas

• Mix of legacy systems with modern services

• Dependance on vendor integrations

• Data can be sensitive

Typical Data Democratisation Issues• Security - Some data is sensitive yet valuable but we’d like an open

access model

• Knowledge Fragmentation - Its our data, lets make sure everyone knows what it means.

• Adoption - People need to like it. Shouldn’t get in the way.

• Scalability

• Chargeback - its not my data, why should I pay for it?

Security - Delegated Access Model• Splunk Search Apps can serve knowledge containers

• Knowledge Objects Ownership can scope local to the app or global to the entire system.

• Splunk Indexes are data containers.

• Data Access granted by index

• Assign an app per product or service umbrella

• Assign Data Owner

Delegated Access Model

Federated Group Splunk Role

App Level Permissions

Index Level Permissions

Splunk Security Must Have!• Splunk Authentication is Poor

• No Password Policy

• No Centralised management for multiple search nodes

• Single Sign On - Splunk supports:

• Ping Identity

• Okta

• ADFS

• Azure AD

• LDAP

• Custom Auth

• Use a Entitlement Framework on top of single sign on groups

Combating Knowledge Fragmentation• Semantic Logging:

• Logging for the sole purpose of analytics

• Rich datasets can be viewed in multiple dimensions

• Define Developer Guidelines:

• Ensure Correlation Identifiers are present in all events

• Precision Timestamps

• Incorporate Logging into SDLC

• Standardise Logging Formats

• Standardise Log content per service - e.g. BAM metrics

Combating Knowledge Fragmentation

Reality - Not all logs can be logged semantically or logged semantically without significant refactoring.

Splunk Solution - Data Models

Data Models

• Enable go go gadget - “Schema on the fly”

• Hierarchically structured search-time mapping of semantic knowledge.

• Accessed via Datasets tab in Splunk 6.5

Example: Splunk CIM• Splunk Common Information Model (CIM)

• Collection of Data Models based on subject area

• Shared Semantic model

• Support consistent and normalised treatment of data

• Enables third party apps to be integrated to your data.

• Reference Tables:

http://docs.splunk.com/Documentation/CIM/4.6.0/User/Howtousethesereferencetables

http://docs.splunk.com/Documentation/CIM/4.6.0/User/Howtousethesereferencetables

Pivot• UI Developed to enable the creation of analytics off structured data

models

• Supports:

• Tables

• Charts - Line,Scatter, Column, Bar, Bubble,Pie

• Single Value Visualisations

Performance• Data Models can be accelerated which can lead to:

• Decreases Search Optimisation Effort

• Decreases Dashboard Optimisation Effort

• Increases Storage Requirements

• Speed up upto x1000

• Speed is dependant on the cardinality of data

Notable Splunk Apps on CIM

• Splunk Enterprise Security

• Splunk PCI Compliance

• Insight Engines - Search Splunk using Natural Language

Adoption• Most users complain about backlogs on onboarding data

• Automating the onboarding process isn’t as easy as it sounds. Data Validation is key to deriving value.

• Universal Forwarder:

• Standardise Log Locations

• Standardise Time Stamps

• HTTP Event Collector:

• Send data directly from your application to splunk

• Utilise Indexer Acknowledgement

• Notable implementations:

• Docker - Splunk Logging Driver

Newish Splunk Features• Machine Learning Toolkit

• Comes with built-in assistants for supported algorithms

• Extend algorithms available - python sci-kit learn

• ITSI

• Modular Visualisations

• New Custom Search Command Creation Capability

• TSIDX Reduction - Decrease Storage Costs

Crystal Ball

Further integration into the Hadoop ecosystem

doxlon november 2016 - data democratization using splunk

Technology