sfbay area solr meetup - july 15th: integrating hadoop and solr

19

Upload: lucidworks

Post on 08-Sep-2014

241 views

Category:

Technology


4 download

DESCRIPTION

"Integrating Hadoop and Solr" - Yann Yu, Lucidworks

TRANSCRIPT

Page 1: SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
Page 2: SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr

Yann Yu Systems Engineer @ Lucidworks

Who am I?

Page 3: SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr

Lucidworks is Search.

Technology Retail Financial Services IndustrialHealthcare

Page 4: SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr

Why would you integrate Hadoop and Solr?(and how would you do that?)

Page 5: SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr

• Open-source • Enterprise support • Cheap, scalable storage • Distributed computation • Farm animals for extensibility

• Open-source, Lucene based • Enterprise support • Real-time queries • Full-text search • NoSQL capabilities • Repeatedly proven in production

environments at massive scales

Page 6: SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr

I have Hadoop, why do I need Solr?

• NoSQL front-end to Hadoop: Enable fast, ad-hoc, search across structured and unstructured big data

• Empower users of all technical ability to interact with, and derive value from, big data — all using a natural language search interface (no MapReduce, Pig, SQL, etc.)

• Preliminary data exploration and analysis • Near real-time indexing and querying • Thousands of simultaneous, parallel requests

• Share machine-learning insights created on Hadoop to a broad audience through an interactive medium

Hadoop excels in storing and working with large amounts of data, but has difficulty with frequent, random access to it

Page 7: SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr

I have Solr, why do I need Hadoop?

• Least expensive storage solution in market • Leverage Hadoop processing power (MapReduce) to build

indexes or send document updates to Solr • Store Solr indexes and transaction logs within HDFS • Augment Solr data by storing additional information for last-

second retrieval in Hadoop

As Solr indexes grow in size, the size and number of the machines hosting Solr must also grow, increasing index time and complexity

Page 8: SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr

?

So what does this actually look like?

Page 9: SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr

The enterprise storage situation today

Page 10: SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr

Enterprise data deployment

Lucidworks HDFS connector processes documents and

sends to SolrCloud

Enterprise documents are stored in HDFS

Users make ad-hoc, full-text queries across the full content

of all documents in Solr

And retrieve source files directly from

HDFS as necessary

Standard document storage and search

Page 11: SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr

• Documents can be migrated from other file storage systems via Flume or other scripts

• MapReduce allows for batch processing of documents (e.g. OCR, NER, clustering, etc.)

Sink documents into HDFS

Page 12: SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr

Index document contents into Solr

• The Lucidworks Hadoop connector parses content from files using many different tools

• Tika, GrokIngest, CSV mapping, Pig, etc.

• Content and data are added to fields in a Solr document

• The resulting document is sent to Solr for indexing

Page 13: SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr

• Users are empowered with ad-hoc, full-text search in Solr

• Provides standard search tools such as autocomplete, more-like-this, spellchecking, faceting, etc.

• Users only access HDFS as needed

Enable users to search and access content

Page 14: SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr

Log record search

Machine generated log records are sent to Flume.

Flume forwards raw log record to Hadoop for archiving.

Flume simultaneously parses out data in record into a Solr document,

forwarding resulting document to Solr

Lucidworks SiLK exposes real-time statistics and analytics to end-users,

as well as full-text search

High volume indexing of many small records

Page 15: SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr

Flume archives data in HDFS

• Flume performs minimal work on log files and sends them directly into HDFS for archival

• Under optimal circumstances, the log files are sized to the block size of HDFS

Page 16: SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr

Flume submits records to Solr

• Flume processes records, extracting strings, ints, dates, times, and other information into Solr fields

• Once the Solr document is created, it is submitted to Solr for indexing

• This process happens in real-time, allowing for near real-time search

Page 17: SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr

Real-time analytics dashboard

• Lucidworks SiLK allows users to create simple dashboards through a GUI

• The Banana dashboard will issue queries to Solr, rendering the received data in tables, graphs, and other plots

• Users can also perform full-text search across the data, allowing for extremely fine granularity

Page 18: SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
Page 19: SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr

End

Any questions?

Find me at: [email protected]

@yawnyou