2013 11-07 lsr-dublin_m_hausenblas_when solr is best
DESCRIPTION
Presented by Michael Hausenblas, Chief Data Engineer, , MapR Technologies This session will present an overview of common big data use cases in the form of a set of questions that can be used to determine what kind of problem you really have. From the answers to these questions, you can quickly find out about what technologies are likely to be most productive, useful and easy to apply.This analysis will also allow you to discern cases where Solr is not a good fit, but where augmentation with other big data systems like HBase leads to feasible architectures. Conversely, you will see cases where Solr can be the hero by filling the gaps that big data systems alone are destined to fail.TRANSCRIPT
USE CASE DIAGNOSIS: WHEN IS SOLR REALLY THE BEST TOOL? Michael Hausenblas Chief Data Engineer EMEA, MapR Technologies Twitter: @mhausenblas
• Solr in the Big Data ecosystem • Polyglot Persistence • Common (Big Data) use cases • A checklist • When not to use Solr …
Agenda
stor
age
proc
essing
Apache Pig
Apache Zookeeper
Polyglot Persistence
$ tail –f some.log $ nc localhost 80
$ ls -al
tool box one-size-fits-all
awk 'BEGIN { FS = "," } /2013-[[:digit:]]+-[[:digit:]]+/ { print $3 }’ sample.csv
• Michael Stonebraker and Ugur Çetintemel—2005 "One Size Fits All": An Idea Whose Time Has Come and Gone
• Martin Fowler—2011 Polyglot Persistence1
• Eric Brewer—2012 Ricon Keynote—Advancing Distributed Systems2
1) http://martinfowler.com/bliki/PolyglotPersistence.html 2) https://speakerdeck.com/eric_brewer/ricon-2012-keynote
Polyglot Persistence—Backdrop
• Use different datastores for different needs
• Can apply within an application or cross-enterprise
• Encapsulating data access yields loosely coupled components
• Find sweet spot between dev/op complexity and flexibility
Polyglot Persistence—Key Points
Common (Big Data) use cases
• Keyword search • Spellcheck & autosuggest • Ranking • Faceted search • Spatial search
Where are we coming from?
Use case: search-based
recommendation
• Given – customer purchase history – merchant designations – merchant special offers
• Goal – Improve existing recommender system – Throughput important
Search-based recommendation (credit card issuer)
SolR Indexer SolR
Indexer Solr
indexing Co-‐occurrence
(Mahout)
Item meta-‐data Index shards
complete history
Analyze with MapReduce
SolR Indexer SolR
Indexer Solr
search Web >er
Item meta-‐data Index shards
user history
Deploy with search system
Use case: log analysis
• Given – Receive 200,000+ log lines per second
• Goal – Want to do multi-field search – Want to search on log lines with <30 second delay before search
Log analysis
Data Ingestion and Indexing
Ka@a SolR
Indexer
Live index shard
SolR Indexer Text
analysis
>me-‐sharded Solr indexes
Raw documents
Older index shards
Solr indexer incoming data
Real-‐>me
Search
SolR Indexer SolR
Indexer Solr
search
Solr search
Web >er
Raw documents
Live index shard Older index
shards
Query
A checklist
• What is the volume of your data* (few GB? up to PB?)
• How are your query characteristics? – full scans – look-ups – multiple passes over large parts – continuous queries
• What’s (more) important: throughput or latency?
Question you may want to ask …
*) Note: as long as Moore's law s>ll holds, these figures obviously change on a yearly if not monthly basis.
• Want exploratory interface rather than aggregates in a dashboard
• Data are sparse symbol sets like words or recommendation indicators
• Small-ish return sets are OK, especially if facets are good enough
• Near-real-time is good enough
Key qualifiers
When not to use Solr …
• You need strong consistency?
• JOINS, anyone? • Want (complex) transactions?
• OLTP, streaming (but: near-real-time)
• Graphs?
Red Flags
remember: one size does not
fit all—tool belt approach!
MapR HQ San Jose, US
MapR UK
MapR SE & Benelux
MapR DACH
MapR Nordics
MapR Japan
MapR Hyderbad MapR Korea
• Twitter:
@mhausenblas @MapR
• We’re hiring!
Let’s stay in touch …