doxlon november 2016 - data democratization using splunk

of 28/28
Data Democratisation Using Splunk Neil Roy Chowdhury [email protected]

Post on 11-Apr-2017




2 download

Embed Size (px)


  • Data Democratisation Using Splunk

    Neil Roy Chowdhury [email protected]

    mailto:[email protected]

  • About Me Splunking Since 2008

    Largest Splunk Implementation:

    3 TB/day

    1.2 PB Searchable

    900 Users



    And the occasional Uke

  • What is Splunk? Google Search for IT Data?

    Log aggregation Tool?

    Data Visualisation Tool?

    Data Platform with App Creation Capabilities

    Proprietary Search Language - SPL

    Correlation of Structured and Unstructured Data Sources

    Visualisation capabilities

    Out of the Box


  • Getting Data In

    Unstructured Data Sources

    Structured Data Sources - JSON,

    CSV, XML

    Forwarders HEC

    Data Sources Indexer

    Line Breaking

    Timestamp Recognition

    Data Segmentation


    Persist to Disk




    Bucket Keywords Raw Data

  • Data Collection using Splunk Forwarder

    Splunk forwarder capabilities

    File based Inputs

    Database Inputs

    Scripted Inputs

    Forwarder Configurations deployed as modular add-ons

  • Typical Splunk Search

    index = sourcetype=web.access checkout | stats avg(response_time) as Average Response Time by request

  • Searching DataQuery Index By

    KeywordLoad Raw Results

    Returned in MemoryApply Data Extractions,

    Transformations and Lookups

    Run Streaming Commands

    Indexers - Map

    Search Heads - Reduce



    Receive Results and Reduce

    Run Additional Commands

    Visualise, Report, Alert

  • So what about Knowledge Objects? Most Knowledge Objects are configurable from UI

    Common Types:

    Field Extractions - regex to extract fields

    Field Aliases - Alias a name of a field

    Lookups - vs flat files and kv-store

    Tags - Provides event grouping abstraction

    Eventtypes - Provides event categorisation

    Calculated Fields - Data manipulations

  • Goal? Queries like:


    index= /checkout/auth/confirmation | rex | eval response_time_seconds = resp_time_milliseconds/(1000) | where http_code == 200 | lookup db_locations customer_id OUTPUT location | stats avg(response_time_seconds) as avg_response_time by location

    eventtype=auth_successful tag=web | stats avg(response_time_seconds) as average_response_time by location

  • Goal?

    Persisting Knowledge

  • Data Democratisation

    Sounds like the holy grail of data


  • Scenario Microservices Architecture

    Numerous Development Teams working under different service umbrellas

    Mix of legacy systems with modern services

    Dependance on vendor integrations

    Data can be sensitive

  • Typical Data Democratisation Issues Security - Some data is sensitive yet valuable but wed like an open

    access model

    Knowledge Fragmentation - Its our data, lets make sure everyone knows what it means.

    Adoption - People need to like it. Shouldnt get in the way.


    Chargeback - its not my data, why should I pay for it?

  • Security - Delegated Access Model Splunk Search Apps can serve knowledge containers

    Knowledge Objects Ownership can scope local to the app or global to the entire system.

    Splunk Indexes are data containers.

    Data Access granted by index

    Assign an app per product or service umbrella

    Assign Data Owner

  • Delegated Access Model

    Federated Group Splunk Role

    App Level Permissions

    Index Level Permissions

  • Splunk Security Must Have! Splunk Authentication is Poor

    No Password Policy

    No Centralised management for multiple search nodes

    Single Sign On - Splunk supports:

    Ping Identity



    Azure AD


    Custom Auth

    Use a Entitlement Framework on top of single sign on groups

  • Combating Knowledge Fragmentation Semantic Logging:

    Logging for the sole purpose of analytics

    Rich datasets can be viewed in multiple dimensions

    Define Developer Guidelines:

    Ensure Correlation Identifiers are present in all events

    Precision Timestamps

    Incorporate Logging into SDLC

    Standardise Logging Formats

    Standardise Log content per service - e.g. BAM metrics

  • Combating Knowledge Fragmentation

    Reality - Not all logs can be logged semantically or logged semantically without significant refactoring.

    Splunk Solution - Data Models

  • Data Models

    Enable go go gadget - Schema on the fly

    Hierarchically structured search-time mapping of semantic knowledge.

    Accessed via Datasets tab in Splunk 6.5

  • Example: Splunk CIM Splunk Common Information Model (CIM)

    Collection of Data Models based on subject area

    Shared Semantic model

    Support consistent and normalised treatment of data

    Enables third party apps to be integrated to your data.

    Reference Tables:

  • Pivot UI Developed to enable the creation of analytics off structured data




    Charts - Line,Scatter, Column, Bar, Bubble,Pie

    Single Value Visualisations

  • Performance Data Models can be accelerated which can lead to:

    Decreases Search Optimisation Effort

    Decreases Dashboard Optimisation Effort

    Increases Storage Requirements

    Speed up upto x1000

    Speed is dependant on the cardinality of data

  • Notable Splunk Apps on CIM

    Splunk Enterprise Security

    Splunk PCI Compliance

    Insight Engines - Search Splunk using Natural Language

  • Adoption Most users complain about backlogs on onboarding data

    Automating the onboarding process isnt as easy as it sounds. Data Validation is key to deriving value.

    Universal Forwarder:

    Standardise Log Locations

    Standardise Time Stamps

    HTTP Event Collector:

    Send data directly from your application to splunk

    Utilise Indexer Acknowledgement

    Notable implementations:

    Docker - Splunk Logging Driver

  • Newish Splunk Features Machine Learning Toolkit

    Comes with built-in assistants for supported algorithms

    Extend algorithms available - python sci-kit learn


    Modular Visualisations

    New Custom Search Command Creation Capability

    TSIDX Reduction - Decrease Storage Costs

  • Crystal Ball

    Further integration into the Hadoop ecosystem