the search is over: integrating solr and hadoop in the same cluster to simplify big data analytics

1 ©MapR Technologies - Confidential

The Search Is Over: Integrating SOLR and Hadoop to Simplify Big Data Analytics


Evolution of Search

Documents

•Models

•Feature Selection

User Interaction

•Clicks

•Ratings/Reviews

•Learning to Rank

•Social Graph

Queries

•Phrases

•NLP

Content Relationships

•Page Rank, etc.

•Organization


Search Discovery and Analytics

Search

Discovery Analytics


Data Volume Growing 44x

2020: 35.2

Zettabytes

2010:

1.2

Zettabytes

Data is Growing Quickly

Business Analytics Requires a New Approach

Source: IDC Digital Universe Study, sponsored by EMC, May 2010

IDC Digital Universe

Study 2011

Data is Growing Faster than Moore’s Law


MapReduce: A Paradigm Shift

Distributed computing platform

– Large clusters

– Commodity hardware

Pioneered at Google

– Bigtable and Google File System

Commercially available as Hadoop

http://hadoop.apache.org/


Hadoop Explosion

6


How does Map/Reduce work?

1. Map

– Spread data across servers based on key/value pairs

– Each node independently scans local data

2. Servers produce Map results

3. Reduce - combine/merge Map results

4. Process complete or Map a new function

Like shuffling multiple decks of playing cards


The Cost of Enterprise Storage

SAN Storage

$2 - $10/Gigabyte

$1M gets: 0.5Petabytes 200,000 IOPS

1Gbyte/sec

NAS Filers

$1 - $5/Gigabyte

$1M gets: 1 Petabyte

400,000 IOPS 2Gbyte/sec

Local Storage

$0.05/Gigabyte

$1M gets: 20 Petabytes

10,000,000 IOPS 800 Gbytes/sec


Deep Object Store

Billions and Billions of Files

For some use cases it’s not the storage capacity it’s the number of objects – Messages

– Attachments

– Images

– Recordings

Provides a deep storage pool that is analytic ready – Store it until you need it

– Derive secondary value from analytic processing

Makes more sense to perform analytics on the data and send results over the network

9


Problems with Integrating Solr with Hadoop

Simple to integrate with Hadoop as a data source

Difficult to integrate distributed search and scale

SolrCloud simplifies Sharding and Replication coordination

Integration limitations based on capabilities of large scale storage

– High availability

– Data protection

– Ease of Access


Sharded text Indexing

Map

Reducer

Input documents

Local disk Search

Engine

Local disk

Clustered index storage

Assign documents to shards

Index text to local disk and then copy index to

distributed file store

Copy to local disk typically required before

index can be loaded


Problems with Solr and Hadoop

Map

Reducer

Input documents

Local disk Search

Engine

Local disk


Failure of a reducer causes garbage to accumulate in the

local disk

Failure of search engine requires

another download of the index from clustered storage.


Limitations of HDFS

NAS appliance

NameNode

A B

DataNode DataNode DataNode



HDFS is Append Only

Data Access is through the HDFS API

High Availability is a challenge

Single points of failure

Limited to 50-200 million files

Performance bottleneck


Logs, Flume, aggregates incoming events to Solr –Requires Multi-Step, Batch Process

Hadoop Cluster Application

Server

Application Server

Application Server


What’s Required for SDA?

Ease of Data Access through Open Standards

Large Scale, Reliable Storage

Ease of Integration

– Management ( REST)

– Security (LDAP, NIS, Linux PAM…)

– Analytics (NFS, ODBC, HDFS)

Search

Discovery Analytics


Ease of Data Access

ENTERPRISE NFS Access

HDFS API


Multiple Architectures Possible

Export to the world

– NFS gateway runs on selected gateway hosts

Local server

– NFS gateway runs on local host

– Enables local compression and check summing

Export to self

– NFS gateway runs on all data nodes, mounted from localhost


Data Access through Standard Protocols

NFS Server

NFS Server

NFS Server

NFS Server NFS

Client


Client

NFS Server

NFS Access through a Local server

Application

Cluster Nodes


Cluster Node

NFS Server

Universal export to self

Task

Cluster Nodes


Cluster Node

NFS Server

Task

Cluster Node

NFS Server

Task

Cluster Node

NFS Server

Task

Nodes are identical


Search Engine

Simplifies Solr Hadoop Integration

Map

Reducer

Input documents


Failure of a reducer is cleaned up by

map-reduce framework

Search engine reads mirrored index directly.


How Does this Integration Happen?

Elegantly simple

Direct Integration a result of leveraging architectures

Data in the Hadoop cluster is written to a Volume

Solr Crawler discovers content being entered into Hadoop

Accesses the data in the cluster through NFS

Builds Search Index

Users access Solr to find data directly into Hadoop


Distributed Shard Indexing

24

Input Map Combine Shuffle and sort

Reduce Output

Reduce

doc1 doc2 doc3

shard#1,doc1 shard#2,doc2 shard#1,doc3 shard#3,doc4 shard#3,doc5 …

shard#1,[doc3,doc1] shard#2,[doc2] shard#3, [doc5] …

index/s1 index/s2 index/s3 …


How Does this Work at Scale with Distributed Indices?

MapReduce jobs analyze distributed, disparate data in a cluster

In distributed indexing, the input is split arbitrarily into chunks and each chunk is handled separately. There can be many more chunks than there are shards to be created.

Mapper assigns document to shard

– Shard is usually hash of document id

Reducer indexes all documents for a shard

– Indexes created on local disk

– On success, copy index to DFS

Zookeeper is used to manage Solr instances

A large Solr Search is distributed across multiple shards


What about HA and Data Protection?

Automated re-replication

Self-healing from HW and SW failures

Load balancing

Rolling upgrades

No lost jobs or data

99999’s of uptime

Reliable Compute Dependable Storage

Business continuity with snapshots and mirrors

Recover to a point in time

End-to-end check summing

Strong consistency

Mirror across sites to meet Recovery Time Objectives

Cluster Capabilities can Extend to Integrated Search and Discovery


MapReduce failure to write the Index

Highly Available JobTracker and TaskTracker ensures that any failures are recovered with state to completion

MapReduce will clean up partially written indexes

No administrator intervention required


Solr Node Fails

Other Solr nodes start serving shards that were being served by failed node


Node Containing the Index Fails

Data is already replicated across the cluster

Zookeeper assigns Solr instance on the replicated node to the replicated shard


Additional High Availability and Replication

Snapshots are available

Administrator sets frequency at the Volume

Snapshots with automatic de-duplication

Saves space by sharing blocks

Redirect on write, fast with no performance or storage penalty

Zero performance loss on writing to original

Scheduled, or on-demand

Easy recovery with drag and drop


Mirroring Support in Hadoop Cluster

EC2

Business Continuity and Efficiency

Efficient design

Differential deltas are updated

Compressed and check-summed

Easy to manage

Scheduled or on-demand

WAN, Remote Seeding

Consistent point-in-time

WAN Datacenter 2

Production Research

Production WAN

Datacenter 1


Simplified NFS data flows for Distributed Search

Map

Reducer

Input documents

Search Engine

Mirrors

Search Engine

Mirroring allows exact placement

of index data

Aribitrary levels of replication also possible


Improving Search Relevancy

Requires a continuous Feedback Loop

–The quality of the search is influenced by the end-user selections

–Fully automated process that improves with use

–Does not require manual tags or classification

Search

Discovery Analytics


Recommendations

Often referred to as collaborative filtering

Actors interact with items

– observe successful interaction

We want to suggest additional successful interactions

Observations inherently very sparse


Examples

Customers buying books (Linden et al)

Web visitors rating music (Shardanand and Maes) or movies (Riedl, et al), (Netflix)

Internet radio listeners not skipping songs (Musicmatch)

Internet video watchers watching >30 s


Examples

Query for Friends results in links to Seinfeld

Search for kittens, get results for baby otters


Dyadic Structure

Functional

– Interaction: actor -> item*

Relational

– Interaction ⊆ Actors x Items

Matrix

– Rows indexed by actor, columns by item

– Value is count of interactions

Predict missing observations


Fundamental Algorithmics

Co-occurrence

A is actors x items, K is items x items

Product has general shape of matrix

K tells us “users who interacted with x also interacted with y”


Why not Expand it?

Users enter queries (A)

– (actor = user, item=query)

Users view videos (B)

– (actor = user, item=video)

A’A gives query recommendation

– “did you mean to ask for”

B’B gives video recommendation

– “you might like these videos”


The punch-line

B’A recommends videos in response to a query

– (isn’t that a search engine?)

– (not quite, it doesn’t look at content or meta-data)


Real-life example

Query: “Paco de Lucia”

Conventional meta-data search results:

– “hombres del paco” times 400

– not much else

Recommendation based search:

– Flamenco guitar and dancers

– Spanish and classical guitar

– Van Halen doing a classical/flamenco riff


Real-life example


The Search for Relevancy

Updating Search to Reflect Relevancy

– Big Map Reduce jobs can use behaviorial traces in logs to improve results and identify Importance

The power of this virtuous loop depends on ease of frictionless data access, high availability, performance

Search

Discovery Analytics

the search is over: integrating solr and hadoop in the same cluster to simplify big data analytics

Technology