the search is over: integrating solr and hadoop in the same cluster to simplify big data analytics
DESCRIPTION
Presented by M.C. Srivas | MapR -See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012 This session addresses the biggest issue facing Big Data – Search, Discovery and Analytics need to be integrated. While creating and maintaining separate SOLR and Hadoop clusters is time consuming, error prone and difficult to keep in synch, most Hadoop installations do not integrate with SOLR within the same cluster. Find out how to easily integrate these capabilities into a single cluster. The session will also touch on some of the technical aspects of Big Data Search including how to; protect against silent index corruption that permeates large distributed clusters, overcome the shard distribution problem by leveraging Hadoop to ensure accurate distributed search results, and provide real-time indexing for distributed search including support for streaming data capture. Srivas will also share relevant experiences from his days at Google where he ran one of the major search infrastructure teams where GFS, BigTable and MapReduce were used extensively.TRANSCRIPT
![Page 1: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042715/559351481a28abb3028b45da/html5/thumbnails/1.jpg)
1 ©MapR Technologies - Confidential
The Search Is Over: Integrating SOLR and Hadoop to Simplify Big Data Analytics
![Page 2: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042715/559351481a28abb3028b45da/html5/thumbnails/2.jpg)
2 ©MapR Technologies - Confidential
Evolution of Search
Documents
•Models
•Feature Selection
User Interaction
•Clicks
•Ratings/Reviews
•Learning to Rank
•Social Graph
Queries
•Phrases
•NLP
Content Relationships
•Page Rank, etc.
•Organization
![Page 3: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042715/559351481a28abb3028b45da/html5/thumbnails/3.jpg)
3 ©MapR Technologies - Confidential
Search Discovery and Analytics
Search
Discovery Analytics
![Page 4: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042715/559351481a28abb3028b45da/html5/thumbnails/4.jpg)
4 ©MapR Technologies - Confidential
Data Volume Growing 44x
2020: 35.2
Zettabytes
2010:
1.2
Zettabytes
Data is Growing Quickly
Business Analytics Requires a New Approach
Source: IDC Digital Universe Study, sponsored by EMC, May 2010
IDC Digital Universe
Study 2011
Data is Growing Faster than Moore’s Law
![Page 5: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042715/559351481a28abb3028b45da/html5/thumbnails/5.jpg)
5 ©MapR Technologies - Confidential
MapReduce: A Paradigm Shift
Distributed computing platform
– Large clusters
– Commodity hardware
Pioneered at Google
– Bigtable and Google File System
Commercially available as Hadoop
![Page 6: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042715/559351481a28abb3028b45da/html5/thumbnails/6.jpg)
6 ©MapR Technologies - Confidential
Hadoop Explosion
6
![Page 7: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042715/559351481a28abb3028b45da/html5/thumbnails/7.jpg)
7 ©MapR Technologies - Confidential
How does Map/Reduce work?
1. Map
– Spread data across servers based on key/value pairs
– Each node independently scans local data
2. Servers produce Map results
3. Reduce - combine/merge Map results
4. Process complete or Map a new function
Like shuffling multiple decks of playing cards
![Page 8: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042715/559351481a28abb3028b45da/html5/thumbnails/8.jpg)
8 ©MapR Technologies - Confidential
The Cost of Enterprise Storage
SAN Storage
$2 - $10/Gigabyte
$1M gets: 0.5Petabytes 200,000 IOPS
1Gbyte/sec
NAS Filers
$1 - $5/Gigabyte
$1M gets: 1 Petabyte
400,000 IOPS 2Gbyte/sec
Local Storage
$0.05/Gigabyte
$1M gets: 20 Petabytes
10,000,000 IOPS 800 Gbytes/sec
![Page 9: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042715/559351481a28abb3028b45da/html5/thumbnails/9.jpg)
9 ©MapR Technologies - Confidential
Deep Object Store
Billions and Billions of Files
For some use cases it’s not the storage capacity it’s the number of objects – Messages
– Attachments
– Images
– Recordings
Provides a deep storage pool that is analytic ready – Store it until you need it
– Derive secondary value from analytic processing
Makes more sense to perform analytics on the data and send results over the network
9
![Page 10: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042715/559351481a28abb3028b45da/html5/thumbnails/10.jpg)
10 ©MapR Technologies - Confidential
Problems with Integrating Solr with Hadoop
Simple to integrate with Hadoop as a data source
Difficult to integrate distributed search and scale
SolrCloud simplifies Sharding and Replication coordination
Integration limitations based on capabilities of large scale storage
– High availability
– Data protection
– Ease of Access
![Page 11: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042715/559351481a28abb3028b45da/html5/thumbnails/11.jpg)
11 ©MapR Technologies - Confidential
Sharded text Indexing
Map
Reducer
Input documents
Local disk Search
Engine
Local disk
Clustered index storage
Assign documents to shards
Index text to local disk and then copy index to
distributed file store
Copy to local disk typically required before
index can be loaded
![Page 12: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042715/559351481a28abb3028b45da/html5/thumbnails/12.jpg)
12 ©MapR Technologies - Confidential
Problems with Solr and Hadoop
Map
Reducer
Input documents
Local disk Search
Engine
Local disk
Clustered index storage
Failure of a reducer causes garbage to accumulate in the
local disk
Failure of search engine requires
another download of the index from clustered storage.
![Page 13: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042715/559351481a28abb3028b45da/html5/thumbnails/13.jpg)
13 ©MapR Technologies - Confidential
Limitations of HDFS
NAS appliance
NameNode
A B
DataNode DataNode DataNode
DataNode DataNode DataNode
DataNode DataNode DataNode
HDFS is Append Only
Data Access is through the HDFS API
High Availability is a challenge
Single points of failure
Limited to 50-200 million files
Performance bottleneck
![Page 14: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042715/559351481a28abb3028b45da/html5/thumbnails/14.jpg)
14 ©MapR Technologies - Confidential
Logs, Flume, aggregates incoming events to Solr –Requires Multi-Step, Batch Process
Hadoop Cluster Application
Server
Application Server
Application Server
![Page 15: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042715/559351481a28abb3028b45da/html5/thumbnails/15.jpg)
15 ©MapR Technologies - Confidential
What’s Required for SDA?
Ease of Data Access through Open Standards
Large Scale, Reliable Storage
Ease of Integration
– Management ( REST)
– Security (LDAP, NIS, Linux PAM…)
– Analytics (NFS, ODBC, HDFS)
Search
Discovery Analytics
![Page 16: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042715/559351481a28abb3028b45da/html5/thumbnails/16.jpg)
16 ©MapR Technologies - Confidential
Ease of Data Access
ENTERPRISE NFS Access
HDFS API
![Page 17: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042715/559351481a28abb3028b45da/html5/thumbnails/17.jpg)
17 ©MapR Technologies - Confidential
Multiple Architectures Possible
Export to the world
– NFS gateway runs on selected gateway hosts
Local server
– NFS gateway runs on local host
– Enables local compression and check summing
Export to self
– NFS gateway runs on all data nodes, mounted from localhost
![Page 18: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042715/559351481a28abb3028b45da/html5/thumbnails/18.jpg)
18 ©MapR Technologies - Confidential
Data Access through Standard Protocols
NFS Server
NFS Server
NFS Server
NFS Server NFS
Client
![Page 19: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042715/559351481a28abb3028b45da/html5/thumbnails/19.jpg)
19 ©MapR Technologies - Confidential
Client
NFS Server
NFS Access through a Local server
Application
Cluster Nodes
![Page 20: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042715/559351481a28abb3028b45da/html5/thumbnails/20.jpg)
20 ©MapR Technologies - Confidential
Cluster Node
NFS Server
Universal export to self
Task
Cluster Nodes
![Page 21: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042715/559351481a28abb3028b45da/html5/thumbnails/21.jpg)
21 ©MapR Technologies - Confidential
Cluster Node
NFS Server
Task
Cluster Node
NFS Server
Task
Cluster Node
NFS Server
Task
Nodes are identical
![Page 22: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042715/559351481a28abb3028b45da/html5/thumbnails/22.jpg)
22 ©MapR Technologies - Confidential
Search Engine
Simplifies Solr Hadoop Integration
Map
Reducer
Input documents
Clustered index storage
Failure of a reducer is cleaned up by
map-reduce framework
Search engine reads mirrored index directly.
![Page 23: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042715/559351481a28abb3028b45da/html5/thumbnails/23.jpg)
23 ©MapR Technologies - Confidential
How Does this Integration Happen?
Elegantly simple
Direct Integration a result of leveraging architectures
Data in the Hadoop cluster is written to a Volume
Solr Crawler discovers content being entered into Hadoop
Accesses the data in the cluster through NFS
Builds Search Index
Users access Solr to find data directly into Hadoop
![Page 24: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042715/559351481a28abb3028b45da/html5/thumbnails/24.jpg)
24 ©MapR Technologies - Confidential
Distributed Shard Indexing
24
Input Map Combine Shuffle and sort
Reduce Output
Reduce
doc1 doc2 doc3
shard#1,doc1 shard#2,doc2 shard#1,doc3 shard#3,doc4 shard#3,doc5 …
shard#1,[doc3,doc1] shard#2,[doc2] shard#3, [doc5] …
index/s1 index/s2 index/s3 …
![Page 25: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042715/559351481a28abb3028b45da/html5/thumbnails/25.jpg)
25 ©MapR Technologies - Confidential
How Does this Work at Scale with Distributed Indices?
MapReduce jobs analyze distributed, disparate data in a cluster
In distributed indexing, the input is split arbitrarily into chunks and each chunk is handled separately. There can be many more chunks than there are shards to be created.
Mapper assigns document to shard
– Shard is usually hash of document id
Reducer indexes all documents for a shard
– Indexes created on local disk
– On success, copy index to DFS
Zookeeper is used to manage Solr instances
A large Solr Search is distributed across multiple shards
![Page 26: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042715/559351481a28abb3028b45da/html5/thumbnails/26.jpg)
26 ©MapR Technologies - Confidential
What about HA and Data Protection?
Automated re-replication
Self-healing from HW and SW failures
Load balancing
Rolling upgrades
No lost jobs or data
99999’s of uptime
Reliable Compute Dependable Storage
Business continuity with snapshots and mirrors
Recover to a point in time
End-to-end check summing
Strong consistency
Mirror across sites to meet Recovery Time Objectives
Cluster Capabilities can Extend to Integrated Search and Discovery
![Page 27: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042715/559351481a28abb3028b45da/html5/thumbnails/27.jpg)
27 ©MapR Technologies - Confidential
MapReduce failure to write the Index
Highly Available JobTracker and TaskTracker ensures that any failures are recovered with state to completion
MapReduce will clean up partially written indexes
No administrator intervention required
![Page 28: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042715/559351481a28abb3028b45da/html5/thumbnails/28.jpg)
28 ©MapR Technologies - Confidential
Solr Node Fails
Other Solr nodes start serving shards that were being served by failed node
![Page 29: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042715/559351481a28abb3028b45da/html5/thumbnails/29.jpg)
29 ©MapR Technologies - Confidential
Node Containing the Index Fails
Data is already replicated across the cluster
Zookeeper assigns Solr instance on the replicated node to the replicated shard
![Page 30: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042715/559351481a28abb3028b45da/html5/thumbnails/30.jpg)
30 ©MapR Technologies - Confidential
Additional High Availability and Replication
Snapshots are available
Administrator sets frequency at the Volume
Snapshots with automatic de-duplication
Saves space by sharing blocks
Redirect on write, fast with no performance or storage penalty
Zero performance loss on writing to original
Scheduled, or on-demand
Easy recovery with drag and drop
![Page 31: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042715/559351481a28abb3028b45da/html5/thumbnails/31.jpg)
31 ©MapR Technologies - Confidential
Mirroring Support in Hadoop Cluster
EC2
Business Continuity and Efficiency
Efficient design
Differential deltas are updated
Compressed and check-summed
Easy to manage
Scheduled or on-demand
WAN, Remote Seeding
Consistent point-in-time
WAN Datacenter 2
Production Research
Production WAN
Datacenter 1
![Page 32: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042715/559351481a28abb3028b45da/html5/thumbnails/32.jpg)
32 ©MapR Technologies - Confidential
Simplified NFS data flows for Distributed Search
Map
Reducer
Input documents
Search Engine
Mirrors
Search Engine
Mirroring allows exact placement
of index data
Aribitrary levels of replication also possible
![Page 33: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042715/559351481a28abb3028b45da/html5/thumbnails/33.jpg)
33 ©MapR Technologies - Confidential
Improving Search Relevancy
Requires a continuous Feedback Loop
–The quality of the search is influenced by the end-user selections
–Fully automated process that improves with use
–Does not require manual tags or classification
Search
Discovery Analytics
![Page 34: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042715/559351481a28abb3028b45da/html5/thumbnails/34.jpg)
34 ©MapR Technologies - Confidential
Recommendations
Often referred to as collaborative filtering
Actors interact with items
– observe successful interaction
We want to suggest additional successful interactions
Observations inherently very sparse
![Page 35: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042715/559351481a28abb3028b45da/html5/thumbnails/35.jpg)
35 ©MapR Technologies - Confidential
Examples
Customers buying books (Linden et al)
Web visitors rating music (Shardanand and Maes) or movies (Riedl, et al), (Netflix)
Internet radio listeners not skipping songs (Musicmatch)
Internet video watchers watching >30 s
![Page 36: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042715/559351481a28abb3028b45da/html5/thumbnails/36.jpg)
36 ©MapR Technologies - Confidential
Examples
Query for Friends results in links to Seinfeld
Search for kittens, get results for baby otters
![Page 37: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042715/559351481a28abb3028b45da/html5/thumbnails/37.jpg)
37 ©MapR Technologies - Confidential
Dyadic Structure
Functional
– Interaction: actor -> item*
Relational
– Interaction ⊆ Actors x Items
Matrix
– Rows indexed by actor, columns by item
– Value is count of interactions
Predict missing observations
![Page 38: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042715/559351481a28abb3028b45da/html5/thumbnails/38.jpg)
38 ©MapR Technologies - Confidential
Fundamental Algorithmics
Co-occurrence
A is actors x items, K is items x items
Product has general shape of matrix
K tells us “users who interacted with x also interacted with y”
![Page 39: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042715/559351481a28abb3028b45da/html5/thumbnails/39.jpg)
39 ©MapR Technologies - Confidential
Why not Expand it?
Users enter queries (A)
– (actor = user, item=query)
Users view videos (B)
– (actor = user, item=video)
A’A gives query recommendation
– “did you mean to ask for”
B’B gives video recommendation
– “you might like these videos”
![Page 40: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042715/559351481a28abb3028b45da/html5/thumbnails/40.jpg)
40 ©MapR Technologies - Confidential
The punch-line
B’A recommends videos in response to a query
– (isn’t that a search engine?)
– (not quite, it doesn’t look at content or meta-data)
![Page 41: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042715/559351481a28abb3028b45da/html5/thumbnails/41.jpg)
41 ©MapR Technologies - Confidential
Real-life example
Query: “Paco de Lucia”
Conventional meta-data search results:
– “hombres del paco” times 400
– not much else
Recommendation based search:
– Flamenco guitar and dancers
– Spanish and classical guitar
– Van Halen doing a classical/flamenco riff
![Page 42: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042715/559351481a28abb3028b45da/html5/thumbnails/42.jpg)
42 ©MapR Technologies - Confidential
Real-life example
![Page 43: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042715/559351481a28abb3028b45da/html5/thumbnails/43.jpg)
43 ©MapR Technologies - Confidential
The Search for Relevancy
Updating Search to Reflect Relevancy
– Big Map Reduce jobs can use behaviorial traces in logs to improve results and identify Importance
The power of this virtuous loop depends on ease of frictionless data access, high availability, performance
Search
Discovery Analytics
![Page 44: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042715/559351481a28abb3028b45da/html5/thumbnails/44.jpg)