webinar: solr & fusion for big data
TRANSCRIPT
![Page 1: Webinar: Solr & Fusion for Big Data](https://reader034.vdocuments.mx/reader034/viewer/2022050614/58aa47fb1a28ab4c348b65a5/html5/thumbnails/1.jpg)
![Page 2: Webinar: Solr & Fusion for Big Data](https://reader034.vdocuments.mx/reader034/viewer/2022050614/58aa47fb1a28ab4c348b65a5/html5/thumbnails/2.jpg)
Solr & Fusion for Big Data
• Where search fits in the big data landscape?
• Solr on HDFS• Indexing strategies• End-to-end security• Lambda architecture• Spark and how we
use it in Fusion
![Page 3: Webinar: Solr & Fusion for Big Data](https://reader034.vdocuments.mx/reader034/viewer/2022050614/58aa47fb1a28ab4c348b65a5/html5/thumbnails/3.jpg)
The standard for enterprise search. of Fortune 500
uses Solr.
90%
![Page 4: Webinar: Solr & Fusion for Big Data](https://reader034.vdocuments.mx/reader034/viewer/2022050614/58aa47fb1a28ab4c348b65a5/html5/thumbnails/4.jpg)
Why search for big data?• Speed at scale• Basic analytics (facets, pivot facets, facets +
stats) + visualizations• Query structured and unstructured data• Ad hoc exploration is inherent in big data• People grok search• Context for aggregations (drill into the numbers)
![Page 5: Webinar: Solr & Fusion for Big Data](https://reader034.vdocuments.mx/reader034/viewer/2022050614/58aa47fb1a28ab4c348b65a5/html5/thumbnails/5.jpg)
Common use case:log analysis
• Time-ordered data• Raw data stored in
HDFS• How much data? How
fast?• Access patterns?• Schema design ~ no
free lunch at scale
![Page 6: Webinar: Solr & Fusion for Big Data](https://reader034.vdocuments.mx/reader034/viewer/2022050614/58aa47fb1a28ab4c348b65a5/html5/thumbnails/6.jpg)
Time-based Partitioning SchemeFusion
Log AnalyticsDashboard
logs_feb26(daily collection)
logs_feb25(daily collection)
logs_feb01(daily collection)
h00(shard)
h22(shard)
h23(shard)
h00(shard)
h22(shard)
h23(shard)
Add replicasto support higherquery volume & fault-tolerance
recent_logs(colllection alias)
Use a collectionalias to make multiplecollections look like a single collection; minimizeexposure to partitioning strategy in client layer
Every daily collection has 24 shards (h00-h23), each covering 1-hour blocks of log messages
![Page 7: Webinar: Solr & Fusion for Big Data](https://reader034.vdocuments.mx/reader034/viewer/2022050614/58aa47fb1a28ab4c348b65a5/html5/thumbnails/7.jpg)
Solr on HDFS• Maturing solution still some issues• My test showed ~23-25% slower than local SSD• Better ROI, operational efficiency, security• Needed for YARN• Enables auto add replicas• Interesting features coming soon: ZooKeeper lock
(SOLR-8169) and replicas share index (SOLR-6237)
![Page 8: Webinar: Solr & Fusion for Big Data](https://reader034.vdocuments.mx/reader034/viewer/2022050614/58aa47fb1a28ab4c348b65a5/html5/thumbnails/8.jpg)
Solr on HDFS
Solrshard1 / replica1
block cache
Solrshard1 / replica2
block cache
writes
reads
HDFSDataNode C
HDFSDataNode B
HDFSDataNode A writes
reads
HDFS block replication
Solr replication
![Page 9: Webinar: Solr & Fusion for Big Data](https://reader034.vdocuments.mx/reader034/viewer/2022050614/58aa47fb1a28ab4c348b65a5/html5/thumbnails/9.jpg)
Auto Add Replica
HDFSDataNode C
block cache
Solrshard1 / replica1
writes
reads
HDFSDataNode A
HDFS block replication
Solrshard1 / replica2
block cache
HDFSDataNode Bwrites
reads
Solr replication
overseer
ZooKeeper
watches
Solrshard1 / replica3
writes
reads
![Page 10: Webinar: Solr & Fusion for Big Data](https://reader034.vdocuments.mx/reader034/viewer/2022050614/58aa47fb1a28ab4c348b65a5/html5/thumbnails/10.jpg)
Indexing Strategies• Many tools available!• MapReduce indexer (Solr contrib)• LWOutputFormat, Hive SerDe, Pig StoreFunc, HBase• Storm to Solr or Fusion
(github.com/LucidWorks/storm-solr)• Spark to Solr or Fusion
(github.com/LucidWorks/spark-solr)• Lucidworks Fusion Connectors
![Page 11: Webinar: Solr & Fusion for Big Data](https://reader034.vdocuments.mx/reader034/viewer/2022050614/58aa47fb1a28ab4c348b65a5/html5/thumbnails/11.jpg)
Any Data. Any Source.
![Page 12: Webinar: Solr & Fusion for Big Data](https://reader034.vdocuments.mx/reader034/viewer/2022050614/58aa47fb1a28ab4c348b65a5/html5/thumbnails/12.jpg)
Fusion Indexing Pipelines in MapReduce
Solr
Map Task (or reducer if needed)
ZooKeeper
CloudSolrClient
HDFS
Get collection metadatafrom ZooKeeper(e.g. shard leader URL)
Send updates to shardleaders in parallel
Fusion Pipelinedocs
…N map tasks (1 per block)
30+ index stages- Field mapping- JavaScript- Tika parsing- NLP- Regex- JDBC lookup
Many common file formats supported:CSV, SequenceFile, grok, XML, warc
![Page 13: Webinar: Solr & Fusion for Big Data](https://reader034.vdocuments.mx/reader034/viewer/2022050614/58aa47fb1a28ab4c348b65a5/html5/thumbnails/13.jpg)
Security• End-to-end security is now a reality for Hadoop• Kerberos authentication (ZK, Solr, HDFS, jobs)• Pluggable authorization framework• Collection and document-level access controls (via
Fusion)• SSL• Apache Ranger (centralized admin, auditing,
monitoring for Hadoop)
![Page 14: Webinar: Solr & Fusion for Big Data](https://reader034.vdocuments.mx/reader034/viewer/2022050614/58aa47fb1a28ab4c348b65a5/html5/thumbnails/14.jpg)
Cluster Sizing Worksheet• There is no formula, only guidelines!• # of documents / avg. doc size / number of fields• Updates per second / soft-commit frequency• Storage type (local SSD vs. HDFS)• Sharding scheme (time-based vs. hash-based)• Peak QPS / 95th percentile response time / query
complexity• Must test your data on your servers ;-)
![Page 15: Webinar: Solr & Fusion for Big Data](https://reader034.vdocuments.mx/reader034/viewer/2022050614/58aa47fb1a28ab4c348b65a5/html5/thumbnails/15.jpg)
• Search engine fits perfectly with lambda
• Use batch layer to build indexes instead of “views”
• Speed layer uses Spark streaming to build near real-time index
• Aggregation collections for historical data
Lambda Architecture
source: http://lambda-architecture.net/
![Page 16: Webinar: Solr & Fusion for Big Data](https://reader034.vdocuments.mx/reader034/viewer/2022050614/58aa47fb1a28ab4c348b65a5/html5/thumbnails/16.jpg)
Spark
Spark Core
SparkSQL
SparkStreaming
MLlib(machinelearning)
GraphX(BSP)
Hadoop YARN Mesos Standalone
HDFSExecution
ModelThe Shuffle Caching
engine
clustermgmt
Tachyon
languages Scala Java Python R
sharedmemory
![Page 17: Webinar: Solr & Fusion for Big Data](https://reader034.vdocuments.mx/reader034/viewer/2022050614/58aa47fb1a28ab4c348b65a5/html5/thumbnails/17.jpg)
The most relevant results every single time.
Massive scale. Real-time. Secure.
Any data. Any source.
![Page 18: Webinar: Solr & Fusion for Big Data](https://reader034.vdocuments.mx/reader034/viewer/2022050614/58aa47fb1a28ab4c348b65a5/html5/thumbnails/18.jpg)
Lucidworks Is Search
![Page 19: Webinar: Solr & Fusion for Big Data](https://reader034.vdocuments.mx/reader034/viewer/2022050614/58aa47fb1a28ab4c348b65a5/html5/thumbnails/19.jpg)
Any questions?• Try Fusion http://lucidworks.com/products/fusion/
download• LinkedIn / Twitter / Solr JIRA: @thelabdude