2013 11-07 lsr-dublin_m_hausenblas_when solr is best

24

Upload: lucenerevolution

Post on 11-May-2015

675 views

Category:

Technology


0 download

DESCRIPTION

Presented by Michael Hausenblas, Chief Data Engineer, , MapR Technologies This session will present an overview of common big data use cases in the form of a set of questions that can be used to determine what kind of problem you really have. From the answers to these questions, you can quickly find out about what technologies are likely to be most productive, useful and easy to apply.This analysis will also allow you to discern cases where Solr is not a good fit, but where augmentation with other big data systems like HBase leads to feasible architectures. Conversely, you will see cases where Solr can be the hero by filling the gaps that big data systems alone are destined to fail.

TRANSCRIPT

Page 1: 2013 11-07 lsr-dublin_m_hausenblas_when solr is best
Page 2: 2013 11-07 lsr-dublin_m_hausenblas_when solr is best

USE CASE DIAGNOSIS: WHEN IS SOLR REALLY THE BEST TOOL? Michael Hausenblas Chief Data Engineer EMEA, MapR Technologies Twitter: @mhausenblas

Page 3: 2013 11-07 lsr-dublin_m_hausenblas_when solr is best

•  Solr in the Big Data ecosystem •  Polyglot Persistence •  Common (Big Data) use cases •  A checklist •  When not to use Solr …

Agenda

Page 4: 2013 11-07 lsr-dublin_m_hausenblas_when solr is best

stor

age

proc

essing

Apache Pig

Apache Zookeeper

Page 5: 2013 11-07 lsr-dublin_m_hausenblas_when solr is best

Polyglot Persistence

Page 6: 2013 11-07 lsr-dublin_m_hausenblas_when solr is best

$ tail –f some.log $ nc localhost 80

$ ls -al

tool box one-size-fits-all

awk 'BEGIN { FS = "," } /2013-[[:digit:]]+-[[:digit:]]+/ { print $3 }’ sample.csv

Page 7: 2013 11-07 lsr-dublin_m_hausenblas_when solr is best

•  Michael Stonebraker and Ugur Çetintemel—2005 "One Size Fits All": An Idea Whose Time Has Come and Gone

•  Martin Fowler—2011 Polyglot Persistence1

•  Eric Brewer—2012 Ricon Keynote—Advancing Distributed Systems2

1) http://martinfowler.com/bliki/PolyglotPersistence.html 2) https://speakerdeck.com/eric_brewer/ricon-2012-keynote

Polyglot Persistence—Backdrop

Page 8: 2013 11-07 lsr-dublin_m_hausenblas_when solr is best

•  Use different datastores for different needs

•  Can apply within an application or cross-enterprise

•  Encapsulating data access yields loosely coupled components

•  Find sweet spot between dev/op complexity and flexibility

Polyglot Persistence—Key Points

Page 9: 2013 11-07 lsr-dublin_m_hausenblas_when solr is best

Common (Big Data) use cases

Page 10: 2013 11-07 lsr-dublin_m_hausenblas_when solr is best

•  Keyword search •  Spellcheck & autosuggest •  Ranking •  Faceted search •  Spatial search

Where are we coming from?

Page 11: 2013 11-07 lsr-dublin_m_hausenblas_when solr is best

Use case: search-based

recommendation

Page 12: 2013 11-07 lsr-dublin_m_hausenblas_when solr is best

•  Given –  customer purchase history –  merchant designations –  merchant special offers

•  Goal –  Improve existing recommender system –  Throughput important

Search-based recommendation (credit card issuer)

Page 13: 2013 11-07 lsr-dublin_m_hausenblas_when solr is best

SolR  Indexer  SolR  

Indexer  Solr  

indexing  Co-­‐occurrence  

(Mahout)  

Item  meta-­‐data   Index  shards  

complete  history  

Analyze with MapReduce

Page 14: 2013 11-07 lsr-dublin_m_hausenblas_when solr is best

SolR  Indexer  SolR  

Indexer  Solr  

search  Web  >er  

Item  meta-­‐data   Index  shards  

user  history  

Deploy with search system

Page 15: 2013 11-07 lsr-dublin_m_hausenblas_when solr is best

Use case: log analysis

Page 16: 2013 11-07 lsr-dublin_m_hausenblas_when solr is best

•  Given –  Receive 200,000+ log lines per second

•  Goal –  Want to do multi-field search –  Want to search on log lines with <30 second delay before search

Log analysis

Page 17: 2013 11-07 lsr-dublin_m_hausenblas_when solr is best

Data Ingestion and Indexing

Ka@a  SolR  

Indexer  

Live  index  shard  

SolR  Indexer  Text  

analysis  

>me-­‐sharded  Solr  indexes  

Raw  documents  

Older  index  shards  

Solr  indexer  incoming  data  

Real-­‐>me  

Page 18: 2013 11-07 lsr-dublin_m_hausenblas_when solr is best

Search

SolR  Indexer  SolR  

Indexer  Solr  

search  

Solr  search  

Web  >er  

Raw  documents  

Live  index  shard  Older  index  

shards  

Query  

Page 19: 2013 11-07 lsr-dublin_m_hausenblas_when solr is best

A checklist

Page 20: 2013 11-07 lsr-dublin_m_hausenblas_when solr is best

•  What is the volume of your data* (few GB? up to PB?)

•  How are your query characteristics? –  full scans –  look-ups –  multiple passes over large parts –  continuous queries

•  What’s (more) important: throughput or latency?

Question you may want to ask …

*)  Note:  as  long  as  Moore's  law  s>ll  holds,  these  figures  obviously  change  on  a  yearly  if  not  monthly  basis.  

Page 21: 2013 11-07 lsr-dublin_m_hausenblas_when solr is best

•  Want exploratory interface rather than aggregates in a dashboard

•  Data are sparse symbol sets like words or recommendation indicators

•  Small-ish return sets are OK, especially if facets are good enough

•  Near-real-time is good enough

Key qualifiers

Page 22: 2013 11-07 lsr-dublin_m_hausenblas_when solr is best

When not to use Solr …

Page 23: 2013 11-07 lsr-dublin_m_hausenblas_when solr is best

•  You need strong consistency?

•  JOINS, anyone? •  Want (complex) transactions?

•  OLTP, streaming (but: near-real-time)

•  Graphs?

Red Flags

remember:  one  size  does  not  

fit  all—tool  belt  approach!  

Page 24: 2013 11-07 lsr-dublin_m_hausenblas_when solr is best

MapR  HQ  San  Jose,  US  

MapR  UK  

MapR  SE  &  Benelux  

MapR  DACH  

MapR  Nordics  

MapR  Japan  

MapR  Hyderbad   MapR  Korea  

•  Twitter:

@mhausenblas @MapR

•  We’re hiring!

Let’s stay in touch …