data onboarding

Copyright © 2014 Splunk Inc.

Data Onboarding

Ingestion without the

Indigestion

Jeff Meyers Sales Engineer

•  Major components involved in data indexing

•  What happens to data within Splunk

•  What the data pipeline is & how to influence it •  Shaping data understanding via props.conf •  Configuring data inputs via inputs.conf

•  What goes where

•  Heavy Forwarders vs. Universal Forwarders •  How to get your data into Splunk (mostly correctly)

~ 60 minutes from now...

•  SystemaMc way to bring new data sources into Splunk

•  Make sure that new data is instantly usable

& has maximum value for users

•  Goes hand-‐in-‐hand with the User Onboarding process (sold separately)

What is the Data Onboarding Process?

4

Machine Data > Business Value Index Untapped Data: Any Source, Type, Volume

Online Services Web

Services

Servers Security GPS

LocaMon

Storage Desktops

Networks

Packaged ApplicaMons

Custom ApplicaMons Messaging

Telecoms Online

Shopping Cart

Web Clickstream

s

Databases

Energy Meters

Call Detail Records

Smartphones and Devices

RFID

On-‐ Premises

Private Cloud

Public Cloud

Ask Any QuesMon

ApplicaMon Delivery

Security, Compliance and Fraud

IT OperaMons

Business AnalyMcs

Industrial Data and the Internet of Things

Flavors of Machine Data

Order Processing

TwiRer

Care IVR

Middleware Error

Getting Data Into Splunk

6

Agent and Agent-‐less Approach for Flexibility

perf

shell code

Mounted File Systems \\hostname\mount

syslog TCP/UDP

WMI Event Logs Performance

AcMve Directory

syslog compaMble hosts and network devices

Unix, Linux and Windows hosts

Windows hosts Custom apps and scripted API connecMons

Local File Monitoring log files, config files dumps and trace files

Windows Inputs Event Logs

performance counters registry monitoring

AcAve Directory monitoring

virtual host

Windows hosts

Scripted Inputs shell scripts custom parsers batch loading

Agent-‐less Data Input Splunk Forwarder

Splunk Data Ingest

UF UF HF UF

IDX

SH

Splunk Enterprise (with opMonal configs)

Splunk Universal Forwarder

Summary: when it comes to "core" Splunk, there are two dis8nct products: Splunk Universal Forwarder and Splunk Enterprise. "Everything else" – Indexer, Search Head, License Server, Deployment Server, Cluster Master, Deployer, Heavy Forwarder, etc. are all instances of Splunk Enterprise with varying configs.

Data Pipeline (what the what?)

The Data Pipeline

The Data Pipeline

Any QuesMons?

The Data Pipeline

•  Input Processors: Monitor, FIFO, UDP, TCP, Scripted

•  No events yet-‐-‐ just a stream of bytes

•  Break data stream into 64KB blocks

•  Annotate stream with metadata keys (host, source, sourcetype, index, etc.)

•  Can happen on UF, HF or indexer

Inputs– Where it all starts

• Check character set

• Break lines

• Process headers

• Can happen on HF or indexer

Parsing

•  Merge lines for mulM-‐line events

•  IdenMfy events (finally!)

•  Extract Mmestamps

•  Exclude events based on Mmestamp (MAX_DAYS_AGO, ..)

•  Can happen on HF or indexer

AggregaMon/Merging

•  Do regex replacement (field extracMon, punctuaMon extracMon, event rouMng, host/source/sourcetype overrides)

•  Annotate events with metadata keys (host, source, sourcetype, ..)

•  Can happen on HF or indexer

Typing

• Output processors: TCP, syslog, HTTP •  indexAndForward •  Sign blocks •  Calculate license volume and throughput metrics •  Index •  [Write to disk ] / [forward elsewhere] / ... •  Can happen on HF or indexer

Indexing

The Data Pipeline

Data Pipeline: UF & Indexer

Data Pipeline: HF & Indexer

Data Pipeline: UF, IF & Indexer

UF vs. HF

209.160.24.63 -‐ -‐ [23/Feb/2016:18:22:16] "GET /oldlink?itemId=EST-‐6&JSESSIONID=SD0SL6FF7AD... 209.160.24.63 -‐ -‐ [23/Feb/2016:18:22:17] "GET /product.screen?productId=BS-‐AG-‐G09&JSESSION... 209.160.24.63 -‐ -‐ [23/Feb/2016:18:22:19] "POST /category.screen?categoryId=STRATEGY&JSESSI... 209.160.24.63 -‐ -‐ [23/Feb/2016:18:22:20] "GET /product.screen?productId=FS-‐SG-‐G03&JSESSION... 209.160.24.63 -‐ -‐ [23/Feb/2016:18:22:20] "POST /cart.do?acMon=addtocart&itemId=EST-‐21&pro... 209.160.24.63 -‐ -‐ [23/Feb/2016:18:22:21] "POST /cart.do?acMon=purchase&itemId=EST-‐21&JSES... 209.160.24.63 -‐ -‐ [23/Feb/2016:18:22:22] "POST /cart/success.do?JSESSIONID=SD0SL6FF7ADFF49... 209.160.24.63 -‐ -‐ [23/Feb/2016:18:22:21] "GET /cart.do?acMon=remove&itemId=EST-‐11&product... 209.160.24.63 -‐ -‐ [23/Feb/2016:18:22:22] "GET /oldlink?itemId=EST-‐14&JSESSIONID=SD0SL6FF7A... 112.111.162.4 -‐ -‐ [23/Feb/2016:18:26:36] "GET /product.screen?productId=WC-‐SH-‐G04&JSESSION...

209.160.24.63 -‐ -‐ [23/Feb/2016:18:22:16] "GET /oldlink?itemId=EST-‐6&JSESSIONID=SD0SL6FF7AD...

209.160.24.63 -‐ -‐ [23/Feb/2016:18:22:17] "GET /product.screen?productId=BS-‐AG-‐G09&SSN=xxxyyyzzz...

sourcetype=access_combined, _8me=1456251739, index=foo, host=bar, …

sourcetype=access_combined, _8me=1456251739, index=foo, host=bar, …

sourcetype=access_combined, index=foo, host=bar, …

UF

HF emits events

emits chunks of data

Splunk Data Ingest

UF UF HF UF

IDX

SH

Parsing

Not Parsing

Note: the data is parsed at the first component that has a parsing engine – and not again This effects where you put certain props.conf and transforms.conf files (a.k.a. some8mes they go on the forwarder)

Data Onboarding Process (bringing it together)

•  IdenMfy the specific sourcetype(s) -‐ onboard each separately •  Check for pre-‐exisMng app/TA on splunk.com-‐-‐ don't reinvent the wheel! •  Gather info

•  Where does this data originate/reside? How will Splunk collect it? •  Which users/groups will need access to this data? Access controls? •  Determine the indexing volume and data retenMon requirements •  Will this data need to drive exisMng dashboards (ES, PCI, etc.)? •  Who is the SME for this data?

•  Map it out •  Get a "big enough" sample of the event data •  IdenMfy and map out fields •  Assign sourcetype and TA names according to CIM convenMons

On-‐boarding Process

•  Dev •  Create (or use) an app •  Props / inputs definiMon

•  Sourcetype definiMon

•  Use data import wizard •  Import, tweak, repeat •  Oneshot •  [hook up monitor]

On-‐boarding Process

•  Prod •  Deploy app •  Validate •  Monitor

•  Test •  Deploy app •  Oneshost •  Validate •  Hook up monitor •  Validate

1 2

3

•  General: •  Use apps for configs

•  Use TAs / add-‐ons from Splunk if possible •  Use dev, test, prod

•  Dev can be laptop, test can be ephemeral •  UF when possible

•  HF only if filtering / transforming is required in foreign land •  Unique Sourcetype per event stream •  Don't send data through Search Heads •  Don't send data direct to Indexers

Good Hygiene

•  inputs.conf •  As specific as possible •  Set sourcetype, if possible

•  Don't let splunk auto-‐sourcetype (no ...too_small) •  Specify index if possible

•  props.conf •  Set: TIME_PREFIX, TIME_FORMAT, MAX_TIMESTAMP_LOOKAHEAD

•  OpMmally: SHOULD_LINEMERGE = false, LINE_BREAKER, TRUNCATE

Good Hygiene

Data Onboarding Process (details)

•  IdenMfy the specific sourcetype(s) -‐ onboard each separately •  Check for pre-‐exisMng app/TA on splunk.com-‐-‐ don't reinvent the wheel! •  Gather info

•  Where does this data originate/reside? How will Splunk collect it? •  Which users/groups will need access to this data? Access controls? •  Determine the indexing volume and data retenMon requirements •  Will this data need to drive exisMng dashboards (ES, PCI, etc.)? •  Who is the SME for this data?

•  Map it out •  Get a "big enough" sample of the event data •  IdenMfy and map out fields •  Assign sourcetype and TA names according to CIM convenMons

Pre-‐Board

•  The Common InformaMon Model (CIM) defines relaMonships in the underlying data, while leaving the raw machine data intact

•  A naming convenMon for fields, evensypes & tags •  More advanced reporMng and correlaMon requires that the data be normalized, categorized, and parsed

•  CIM-‐compliant data sources can drive CIM-‐based dashboards (ES, PCI, others)

Tangent: What is the CIM and why should I care?

•  IdenMfy necessary configs (inputs, props and transforms) to properly handle:

•  Mmestamp extracMon, Mmezone, event breaking, sourcetype/host/source assignments

•  Do events contain sensiMve data (i.e., PII, PAN, etc.)? Create masking transforms if necessary

•  Package all index-‐Mme configs into the TA

Build the index-‐Mme configs

•  Assign sourcetype according to event format; events with similar format should have the same sourcetype

• When do I need a separate index? •  When the data volume will be very large, or when it will be searched exclusively a lot

•  When access to the data needs to be controlled •  When the data requires a specific data retenMon policy

•  Resist the temptaMon to create lots of indexes

Tangent: Best & Worst PracMces

•  Always specify a sourcetype and index

•  Be as specific as possible: use /var/log/fubar.log, not /var/log/

•  Arrange your monitored filesystems to minimize unnecessary monitored logfiles

•  Use a scratch index while tesMng new inputs

Best & Worst PracMces – [monitor]

•  Lookout for inadvertent, runaway monitor clauses

•  Don’t monitor thousands of files unnecessarily– that’s the NSA’s job

•  From the CLI: splunk show monitor

•  From your browser: hsps://your_splunkd:8089/services/admin/inputstatus/TailingProcessor:FileStatus

Best & Worst PracMces – [monitor]

•  Find & fix index-‐Mme problems BEFORE polluMng your index

•  A try-‐it-‐before-‐you-‐fry-‐it interface for figuring out •  Event breaking •  Timestamp recogniMon

•  Timezone assignment

•  Provides the necessary props.conf parameter sewngs

Your friend, the Data Previewer Another Tangent!

Data Onboarding Process, continued

•  IdenMfy "interesMng" events which should be tagged with an exisMng CIM tag (hsp://docs.splunk.com/DocumentaMon/CIM/latest/User/Alerts)

•  Get a list of all current tags: | rest splunk_server=local /services/admin/tags | rename tag_name as tag, field_name_value AS definiMon, eai:acl.app AS app | eval definiMon_and_app=definiMon . " (" . app . ")" | stats values(definiMon_and_app) as "definiMons (app)" by tag | sort +tag

•  Get a list of all evensypes (with associated tags): | rest splunk_server=local /services/admin/evensypes | rename Mtle as evensype, search AS definiMon, eai:acl.app AS app | table evensype definiMon app tags | sort +evensype

•  Examine the current list of CIM tags. For each "interesMng" event, idenMfy which tags should be applied to each. A parMcular event may have mulMple tags.

•  Are there new tags which should be created, beyond those in the current CIM tag library? If so, add them to the CIM library

Build the search-‐Mme configs: evenRypes & tags

•  Extract "interesMng" fields •  If already in your CIM library, name or alias appropriately •  If not already in your CIM library, name according to CIM convenMons

•  Add lookups for missing/desirable fields •  Lookups may be required to supply CIM-‐compliant fields/field values (for example, to convert 'sev=42' to 'severity=medium'

•  Make the values more readable for humans •  Put everything into the TA package

Build the search-‐Mme configs: extracMons & lookups

•  Create data models. What will be interesMng for end users?

•  Document! (Especially the fields, evensypes & tags)

•  Test •  Does this data drive relevant exisMng dashboards correctly? •  Do the data models work properly / produce correct results? •  Is the TA packaged properly? •  Check with originaMng user/group; is it OK?

Keep Going

•  Determine addiMonal Splunk infrastructure required; can exisMng infrastructure & license support this?

•  Will new forwarders be required? If so, iniMate CR process(es)

•  Will firewall changes be required? If so, iniMate CR process(es)

•  Will new Splunk roles be required? Create & map to AD roles

•  Will new app contexts be required? Create app(s) as necessary

•  Will new users be added? Create the accounts

Get ready to deploy

•  Deploy new search heads & indexers as needed

•  Install new forwarders as needed

•  Deploy new app & TA to search heads & indexers

•  Deploy new TA to relevant forwarders

Bring it!

•  All sources reporMng? •  Event breaking, Mmestamp, Mmezone, host, source, sourcetype?

•  Field extracMons, aliases, lookups? •  Evensypes, tags? •  Data model(s)? •  User access? •  Confirm with original requesMng user/group: looks OK?

Test & Validate

•  Bring new data sources in correctly the first Mme

•  Reduce the amount of “bad” data in your indexes– and the Mme spent dealing with it

•  Make the new data immediately useful to ALL users– not just the ones who originally requested it

•  Allow the data to drive all sorts of dashboards without extra modificaMons

Gee, this seems like a lot of work…

•  What splunk can monitor: •  hsp://docs.splunk.com/DocumentaMon/Splunk/latest/Data/WhatSplunkcanmonitor

•  How data moves through splunk: •  hsp://docs.splunk.com/DocumentaMon/Splunk/latest/Deploy/Datapipeline

•  Components of the data pipeline: •  hsp://docs.splunk.com/DocumentaMon/Splunk/latest/Deploy/Componentsofadistributedenvironment

•  Common informaMon model app: •  hsps://splunkbase.splunk.com/app/1621

•  Common informaMon model docs: •  hsp://docs.splunk.com/DocumentaMon/CIM/latest/User/Overview

•  Where do I put configs: •  hsp://wiki.splunk.com/Where_do_I_configure_my_Splunk_sewngs

Reference

data onboarding

Technology