webinar: solr & fusion for big data

Solr & Fusion for Big Data

• Where search fits in the big data landscape?

• Solr on HDFS• Indexing strategies• End-to-end security• Lambda architecture• Spark and how we

use it in Fusion

The standard for enterprise search. of Fortune 500

uses Solr.

90%

Why search for big data?• Speed at scale• Basic analytics (facets, pivot facets, facets +

stats) + visualizations• Query structured and unstructured data• Ad hoc exploration is inherent in big data• People grok search• Context for aggregations (drill into the numbers)

Common use case:log analysis

• Time-ordered data• Raw data stored in

HDFS• How much data? How

fast?• Access patterns?• Schema design ~ no

free lunch at scale

Time-based Partitioning SchemeFusion

Log AnalyticsDashboard

logs_feb26(daily collection)



h00(shard)

h22(shard)

h23(shard)

h00(shard)

h22(shard)

h23(shard)

Add replicasto support higherquery volume & fault-tolerance

recent_logs(colllection alias)

Use a collectionalias to make multiplecollections look like a single collection; minimizeexposure to partitioning strategy in client layer

Every daily collection has 24 shards (h00-h23), each covering 1-hour blocks of log messages

Solr on HDFS• Maturing solution still some issues• My test showed ~23-25% slower than local SSD• Better ROI, operational efficiency, security• Needed for YARN• Enables auto add replicas• Interesting features coming soon: ZooKeeper lock

(SOLR-8169) and replicas share index (SOLR-6237)

Solr on HDFS

Solrshard1 / replica1

block cache


block cache

writes

reads

HDFSDataNode C

HDFSDataNode B

HDFSDataNode A writes

reads

HDFS block replication

Solr replication

Auto Add Replica

HDFSDataNode C

block cache


writes

reads

HDFSDataNode A

HDFS block replication


block cache

HDFSDataNode Bwrites

reads

Solr replication

overseer

ZooKeeper

watches


writes

reads

Indexing Strategies• Many tools available!• MapReduce indexer (Solr contrib)• LWOutputFormat, Hive SerDe, Pig StoreFunc, HBase• Storm to Solr or Fusion

(github.com/LucidWorks/storm-solr)• Spark to Solr or Fusion

(github.com/LucidWorks/spark-solr)• Lucidworks Fusion Connectors

Any Data. Any Source.

Fusion Indexing Pipelines in MapReduce

Solr

Map Task (or reducer if needed)

ZooKeeper

CloudSolrClient

HDFS

Get collection metadatafrom ZooKeeper(e.g. shard leader URL)

Send updates to shardleaders in parallel

Fusion Pipelinedocs

…N map tasks (1 per block)

30+ index stages- Field mapping- JavaScript- Tika parsing- NLP- Regex- JDBC lookup

Many common file formats supported:CSV, SequenceFile, grok, XML, warc

Security• End-to-end security is now a reality for Hadoop• Kerberos authentication (ZK, Solr, HDFS, jobs)• Pluggable authorization framework• Collection and document-level access controls (via

Fusion)• SSL• Apache Ranger (centralized admin, auditing,

monitoring for Hadoop)

Cluster Sizing Worksheet• There is no formula, only guidelines!• # of documents / avg. doc size / number of fields• Updates per second / soft-commit frequency• Storage type (local SSD vs. HDFS)• Sharding scheme (time-based vs. hash-based)• Peak QPS / 95th percentile response time / query

complexity• Must test your data on your servers ;-)

• Search engine fits perfectly with lambda

• Use batch layer to build indexes instead of “views”

• Speed layer uses Spark streaming to build near real-time index

• Aggregation collections for historical data

Lambda Architecture

source: http://lambda-architecture.net/

Spark

Spark Core

SparkSQL

SparkStreaming

MLlib(machinelearning)

GraphX(BSP)

Hadoop YARN Mesos Standalone

HDFSExecution

ModelThe Shuffle Caching

engine

clustermgmt

Tachyon

languages Scala Java Python R

sharedmemory

The most relevant results every single time.

Massive scale. Real-time. Secure.

Any data. Any source.

Lucidworks Is Search

Any questions?• Try Fusion http://lucidworks.com/products/fusion/

download• LinkedIn / Twitter / Solr JIRA: @thelabdude

http://lucidworks.com/products/fusion/download

http://lucidworks.com/products/fusion/download

webinar: solr & fusion for big data

Technology