big search 4 big data war stories
DESCRIPTION
Some lessons that we learned in rolling out a search engine across a very big set of data.TRANSCRIPT
![Page 1: Big Search 4 Big Data War Stories](https://reader034.vdocuments.mx/reader034/viewer/2022042607/55509888b4c9058b208b47a7/html5/thumbnails/1.jpg)
Big Search 4 Big DataEnterprise Search Summit Europe 2013 London
Eric Pugh | [email protected] | @dep4b
1
![Page 2: Big Search 4 Big Data War Stories](https://reader034.vdocuments.mx/reader034/viewer/2022042607/55509888b4c9058b208b47a7/html5/thumbnails/2.jpg)
Who am I?
• Principal of OpenSource Connections - Solr/Lucene Search Consultancy
• Member of Apache Software Foundation
• SOLR-284 UpdateRichDocuments (July 07)
• Fascinated by the art of software development
2
![Page 3: Big Search 4 Big Data War Stories](https://reader034.vdocuments.mx/reader034/viewer/2022042607/55509888b4c9058b208b47a7/html5/thumbnails/3.jpg)
CO-AUTHORW
orking on 4.0!
3
![Page 4: Big Search 4 Big Data War Stories](https://reader034.vdocuments.mx/reader034/viewer/2022042607/55509888b4c9058b208b47a7/html5/thumbnails/4.jpg)
Telling some stories
• Prototyping
• Application Development
• Maintaining Your Big Search Indexes
war^
4
![Page 5: Big Search 4 Big Data War Stories](https://reader034.vdocuments.mx/reader034/viewer/2022042607/55509888b4c9058b208b47a7/html5/thumbnails/5.jpg)
What is Big Search?5
![Page 6: Big Search 4 Big Data War Stories](https://reader034.vdocuments.mx/reader034/viewer/2022042607/55509888b4c9058b208b47a7/html5/thumbnails/6.jpg)
Background for Client X’s Project
• Big Data is any data set that is primarily at rest due to the difficulty of working with it.
• 100’s of millions of documents to search
• Aggressive timeline.
• All the data must be searched per query.
• Limited selection of tools available.
• On Solr 3.x line
6
![Page 7: Big Search 4 Big Data War Stories](https://reader034.vdocuments.mx/reader034/viewer/2022042607/55509888b4c9058b208b47a7/html5/thumbnails/7.jpg)
Telling some stories
• Prototyping
• Application Development
• Maintaining Your Big Search Indexes
7
![Page 8: Big Search 4 Big Data War Stories](https://reader034.vdocuments.mx/reader034/viewer/2022042607/55509888b4c9058b208b47a7/html5/thumbnails/8.jpg)
Boy meets Girl Story
Metadata
Content Files
IngestPipeline
SolrSolrSolrSolr
8
![Page 9: Big Search 4 Big Data War Stories](https://reader034.vdocuments.mx/reader034/viewer/2022042607/55509888b4c9058b208b47a7/html5/thumbnails/9.jpg)
Bash Rocks
9
![Page 10: Big Search 4 Big Data War Stories](https://reader034.vdocuments.mx/reader034/viewer/2022042607/55509888b4c9058b208b47a7/html5/thumbnails/10.jpg)
Bash Rocks
• Remote Solr stop/start scripts
• Remote Indexer stop/start scripts
• Performance Monitoring
• Content Extraction scripts (+Java)
• Ingestor Scripts (+Java)
• Artifact Deployment (CM)
10
![Page 11: Big Search 4 Big Data War Stories](https://reader034.vdocuments.mx/reader034/viewer/2022042607/55509888b4c9058b208b47a7/html5/thumbnails/11.jpg)
Lesson: Don’t get
captured by your
environment
11
![Page 12: Big Search 4 Big Data War Stories](https://reader034.vdocuments.mx/reader034/viewer/2022042607/55509888b4c9058b208b47a7/html5/thumbnails/12.jpg)
Make it easy to change approach
12
![Page 13: Big Search 4 Big Data War Stories](https://reader034.vdocuments.mx/reader034/viewer/2022042607/55509888b4c9058b208b47a7/html5/thumbnails/13.jpg)
Make it easy to change sharding
IndexStrategy indexStrategy = (IndexStrategy) Class.forName( "com.o19s.solr.ModShardIndexStrategy").newInstance(); indexStrategy.configure(options); for (SolrInputDocument doc:docs){ indexStrategy.addDocument(doc); }
Lesson: Sharpen
your axe
13
![Page 14: Big Search 4 Big Data War Stories](https://reader034.vdocuments.mx/reader034/viewer/2022042607/55509888b4c9058b208b47a7/html5/thumbnails/14.jpg)
Go Wide Quickly
14
![Page 15: Big Search 4 Big Data War Stories](https://reader034.vdocuments.mx/reader034/viewer/2022042607/55509888b4c9058b208b47a7/html5/thumbnails/15.jpg)
shard1shard1shard1shard1 :8983
shard1shard1shard1shard8 :8984
shard1shard1shard1shard12 :8985
search1.o19s.com
shard1shard1shard1shard12 :8985
shard1shard1shard1shard1 :8983
search1.o19s.com
shard1shard1shard1shard8 :8983
search2.o19s.com
shard1shard1shard1shard12 :8983
search3.o19s.com
Lesson: Hardw
are
is cheap/devs
expensive
15
![Page 16: Big Search 4 Big Data War Stories](https://reader034.vdocuments.mx/reader034/viewer/2022042607/55509888b4c9058b208b47a7/html5/thumbnails/16.jpg)
Why so many pipelines?
16
![Page 17: Big Search 4 Big Data War Stories](https://reader034.vdocuments.mx/reader034/viewer/2022042607/55509888b4c9058b208b47a7/html5/thumbnails/17.jpg)
Simple Pipeline
• Simple pipeline
• mv is atomic
Lesson: Simple
Works
17
![Page 18: Big Search 4 Big Data War Stories](https://reader034.vdocuments.mx/reader034/viewer/2022042607/55509888b4c9058b208b47a7/html5/thumbnails/18.jpg)
Don’t Move Files
• SCP across machines is slow/error prone
• NFS share, single point of failure.
• Clustered file system like GFS (Global File System) can have “fencing” issues
• HDFS shines here.
• ZooKeeper shines here.
• Map/Reduce
18
![Page 19: Big Search 4 Big Data War Stories](https://reader034.vdocuments.mx/reader034/viewer/2022042607/55509888b4c9058b208b47a7/html5/thumbnails/19.jpg)
Can you test your changes?
19
![Page 20: Big Search 4 Big Data War Stories](https://reader034.vdocuments.mx/reader034/viewer/2022042607/55509888b4c9058b208b47a7/html5/thumbnails/20.jpg)
JVM tuning is black art-verbose:gc-XX:+PrintGCDetails-server-Xmx8G-Xms8G-XX:MaxPermSize=256m-XX:PermSize=256m-XX:+AggressiveHeap-XX:+DisableExplicitGC-XX:ParallelGCThreads=16-XX:+UseParallelOldGC
20
![Page 21: Big Search 4 Big Data War Stories](https://reader034.vdocuments.mx/reader034/viewer/2022042607/55509888b4c9058b208b47a7/html5/thumbnails/21.jpg)
21
![Page 22: Big Search 4 Big Data War Stories](https://reader034.vdocuments.mx/reader034/viewer/2022042607/55509888b4c9058b208b47a7/html5/thumbnails/22.jpg)
Run, don’t WalkLesson: Testing
needs to be easy
22
![Page 23: Big Search 4 Big Data War Stories](https://reader034.vdocuments.mx/reader034/viewer/2022042607/55509888b4c9058b208b47a7/html5/thumbnails/23.jpg)
Telling some stories
• Prototyping
•Application Development
• Maintaining Your Big Search Indexes
23
![Page 24: Big Search 4 Big Data War Stories](https://reader034.vdocuments.mx/reader034/viewer/2022042607/55509888b4c9058b208b47a7/html5/thumbnails/24.jpg)
Using Solr as key/value store
Metadata
Content Files
IngestPipeline
SolrSolrSolrSolr
Solr Key/Value Cache
24
![Page 25: Big Search 4 Big Data War Stories](https://reader034.vdocuments.mx/reader034/viewer/2022042607/55509888b4c9058b208b47a7/html5/thumbnails/25.jpg)
• thousands of queries per second without real time get.
• how fast with real time get?
http://localhost:8983/solr/select?q=id:DOC45242&fl=entities,html
http://localhost:8983/solr/get?id=DOC45242&fl=entities,html
Using Solr as key/value store
Lesson: Use w
hat
you have at hand
25
![Page 26: Big Search 4 Big Data War Stories](https://reader034.vdocuments.mx/reader034/viewer/2022042607/55509888b4c9058b208b47a7/html5/thumbnails/26.jpg)
Don’t do expensive things in Solr
• Tika content extraction aka Solr Cell
• UpdateRequestProcessorChain
• Don’t duplicate work
26
![Page 27: Big Search 4 Big Data War Stories](https://reader034.vdocuments.mx/reader034/viewer/2022042607/55509888b4c9058b208b47a7/html5/thumbnails/27.jpg)
Tika as a pipeline?
• Auto detects content type
• Metadata structure has all the key/value needed for Solr
• Allows us to scale up with Behemoth project.
• Ingest multiple XML formats as well as CSV and EDI
27
![Page 28: Big Search 4 Big Data War Stories](https://reader034.vdocuments.mx/reader034/viewer/2022042607/55509888b4c9058b208b47a7/html5/thumbnails/28.jpg)
Telling some stories
• Prototyping
• Application Development
•Maintaining Your Big Search Indexes
28
![Page 29: Big Search 4 Big Data War Stories](https://reader034.vdocuments.mx/reader034/viewer/2022042607/55509888b4c9058b208b47a7/html5/thumbnails/29.jpg)
Indexing is Easy and Quick
29
![Page 30: Big Search 4 Big Data War Stories](https://reader034.vdocuments.mx/reader034/viewer/2022042607/55509888b4c9058b208b47a7/html5/thumbnails/30.jpg)
CHEAP AND CHEERFUL
><
30
![Page 31: Big Search 4 Big Data War Stories](https://reader034.vdocuments.mx/reader034/viewer/2022042607/55509888b4c9058b208b47a7/html5/thumbnails/31.jpg)
NRT versus BigData
31
![Page 32: Big Search 4 Big Data War Stories](https://reader034.vdocuments.mx/reader034/viewer/2022042607/55509888b4c9058b208b47a7/html5/thumbnails/32.jpg)
The tension between scale and update rate
10 million 100’s of millionsBad Place
32
![Page 33: Big Search 4 Big Data War Stories](https://reader034.vdocuments.mx/reader034/viewer/2022042607/55509888b4c9058b208b47a7/html5/thumbnails/33.jpg)
Grim Reaper33
![Page 34: Big Search 4 Big Data War Stories](https://reader034.vdocuments.mx/reader034/viewer/2022042607/55509888b4c9058b208b47a7/html5/thumbnails/34.jpg)
Grim Reaper “Death of Mice”
Especially if you are on cloud platform. They implement their servers on the cheapest commodity hardware
Lesson: Embrace
failure, don’t fear
it
34
![Page 35: Big Search 4 Big Data War Stories](https://reader034.vdocuments.mx/reader034/viewer/2022042607/55509888b4c9058b208b47a7/html5/thumbnails/35.jpg)
Provisioning
• Chef/Puppet
• ZooKeeper
• Have you versioned everything to build an index over again?
Lesson: Autom
ate
Everything!
35
![Page 36: Big Search 4 Big Data War Stories](https://reader034.vdocuments.mx/reader034/viewer/2022042607/55509888b4c9058b208b47a7/html5/thumbnails/36.jpg)
TRADITIONAL ENVIRONMENT
36
![Page 37: Big Search 4 Big Data War Stories](https://reader034.vdocuments.mx/reader034/viewer/2022042607/55509888b4c9058b208b47a7/html5/thumbnails/37.jpg)
POOLED ENVIRONMENTLesson: T
hink
Cloud
37
![Page 38: Big Search 4 Big Data War Stories](https://reader034.vdocuments.mx/reader034/viewer/2022042607/55509888b4c9058b208b47a7/html5/thumbnails/38.jpg)
Building a Patents Index
0
75
150
225
300
5 days 3 days 30 Minutes
1 5
300
Mac
hine
Cou
nt
What happens when we want to index 2 million patents in 30 minutes?
38
![Page 39: Big Search 4 Big Data War Stories](https://reader034.vdocuments.mx/reader034/viewer/2022042607/55509888b4c9058b208b47a7/html5/thumbnails/39.jpg)
Amazon AWS is Good but...
•EC2 is costly for your “base” load• Issues of access to internal data•Firewall and security
39
![Page 40: Big Search 4 Big Data War Stories](https://reader034.vdocuments.mx/reader034/viewer/2022042607/55509888b4c9058b208b47a7/html5/thumbnails/40.jpg)
Do I need Failover?
• Can I build quickly?
• Do I have a reliable cluster of servers?
• Am I spread across data centers?
• Is sooo 90’s....
• Think shared nothing cluster!
Lesson: No!
40
![Page 41: Big Search 4 Big Data War Stories](https://reader034.vdocuments.mx/reader034/viewer/2022042607/55509888b4c9058b208b47a7/html5/thumbnails/41.jpg)
Telling some stories
• Prototyping
• Application Development
• Maintaining Your Big Search Indexes
41
![Page 42: Big Search 4 Big Data War Stories](https://reader034.vdocuments.mx/reader034/viewer/2022042607/55509888b4c9058b208b47a7/html5/thumbnails/42.jpg)
One more thought...
42
![Page 43: Big Search 4 Big Data War Stories](https://reader034.vdocuments.mx/reader034/viewer/2022042607/55509888b4c9058b208b47a7/html5/thumbnails/43.jpg)
Measuring the impact of our algorithms
changes is just getting harder with Big Data.
43
![Page 44: Big Search 4 Big Data War Stories](https://reader034.vdocuments.mx/reader034/viewer/2022042607/55509888b4c9058b208b47a7/html5/thumbnails/44.jpg)
www.solrpa.nl
Project SolrPanlProject SolrPanl
We need a
motivated beta
tester!
44
![Page 45: Big Search 4 Big Data War Stories](https://reader034.vdocuments.mx/reader034/viewer/2022042607/55509888b4c9058b208b47a7/html5/thumbnails/45.jpg)
Thank you!
Questions?
• @dep4b
• www.opensourceconnections.com
• slideshare.com/o19s
Nervous about speaking up? Ask
me later!
45