2013 11-07 lsr-dublin_m_hausenblas_when solr is best

USE CASE DIAGNOSIS: WHEN IS SOLR REALLY THE BEST TOOL? Michael Hausenblas Chief Data Engineer EMEA, MapR Technologies Twitter: @mhausenblas

•  Solr in the Big Data ecosystem •  Polyglot Persistence •  Common (Big Data) use cases •  A checklist •  When not to use Solr …

Agenda

stor

age

proc

essing

Apache Pig

Apache Zookeeper

Polyglot Persistence

$ tail –f some.log $ nc localhost 80

$ ls -al

tool box one-size-fits-all

awk 'BEGIN { FS = "," } /2013-[[:digit:]]+-[[:digit:]]+/ { print $3 }’ sample.csv

•  Michael Stonebraker and Ugur Çetintemel—2005 "One Size Fits All": An Idea Whose Time Has Come and Gone

•  Martin Fowler—2011 Polyglot Persistence1

•  Eric Brewer—2012 Ricon Keynote—Advancing Distributed Systems2

1) http://martinfowler.com/bliki/PolyglotPersistence.html 2) https://speakerdeck.com/eric_brewer/ricon-2012-keynote

Polyglot Persistence—Backdrop

•  Use different datastores for different needs

•  Can apply within an application or cross-enterprise

•  Encapsulating data access yields loosely coupled components

•  Find sweet spot between dev/op complexity and flexibility

Polyglot Persistence—Key Points

Common (Big Data) use cases

•  Keyword search •  Spellcheck & autosuggest •  Ranking •  Faceted search •  Spatial search

Where are we coming from?

Use case: search-based

recommendation

•  Given –  customer purchase history –  merchant designations –  merchant special offers

•  Goal –  Improve existing recommender system –  Throughput important

Search-based recommendation (credit card issuer)

SolR Indexer SolR

Indexer Solr

indexing Co-‐occurrence

(Mahout)

Item meta-‐data Index shards

complete history

Analyze with MapReduce

SolR Indexer SolR

Indexer Solr

search Web >er

Item meta-‐data Index shards

user history

Deploy with search system

Use case: log analysis

•  Given –  Receive 200,000+ log lines per second

•  Goal –  Want to do multi-field search –  Want to search on log lines with <30 second delay before search

Log analysis

Data Ingestion and Indexing

Ka@a SolR

Indexer

Live index shard

SolR Indexer Text

analysis

>me-‐sharded Solr indexes

Raw documents

Older index shards

Solr indexer incoming data

Real-‐>me

Search

SolR Indexer SolR

Indexer Solr

search

Solr search

Web >er

Raw documents

Live index shard Older index

shards

Query

A checklist

•  What is the volume of your data* (few GB? up to PB?)

•  How are your query characteristics? –  full scans –  look-ups –  multiple passes over large parts –  continuous queries

•  What’s (more) important: throughput or latency?

Question you may want to ask …

*) Note: as long as Moore's law s>ll holds, these figures obviously change on a yearly if not monthly basis.

•  Want exploratory interface rather than aggregates in a dashboard

•  Data are sparse symbol sets like words or recommendation indicators

•  Small-ish return sets are OK, especially if facets are good enough

•  Near-real-time is good enough

Key qualifiers

When not to use Solr …

•  You need strong consistency?

•  JOINS, anyone? •  Want (complex) transactions?

•  OLTP, streaming (but: near-real-time)

•  Graphs?

Red Flags

remember: one size does not

fit all—tool belt approach!

MapR HQ San Jose, US

MapR UK

MapR SE & Benelux

MapR DACH

MapR Nordics

MapR Japan

MapR Hyderbad MapR Korea

•  Twitter:

@mhausenblas @MapR

•  We’re hiring!

Let’s stay in touch …

2013 11-07 lsr-dublin_m_hausenblas_when solr is best

Technology

agenda solr

keyword search

multifield search

log lines

log analysis

polyglot persistencekey

size fits

realtime graphs