developing a big data search engine: where we have gone, where we are going

29
‹#› © Cloudera, Inc. All rights reserved. Mark Miller Developing a Big Data search engine: Where we have gone, where we are going.

Upload: mark-miller

Post on 14-Jul-2015

4.951 views

Category:

Technology


0 download

TRANSCRIPT

‹#›© Cloudera, Inc. All rights reserved.

Mark Miller

Developing a Big Data search engine: Where we have gone, where we are going.

‹#›© Cloudera, Inc. All rights reserved.

I’m Mark Miller I’m a Lucene junkie (2006) I’m a Lucene committer (2008) And a Solr committer (2009) And the current Lucene PMC Chair (2014) And a member of the ASF (2011) I co-created SolrCloud (????)

‹#›© Cloudera, Inc. All rights reserved.

A Quick Tour Through History• First there was Lucene. • It took a little while, but soon it was ‘good enough’ to replace most enterprise search engines. And faster. And more efficient. • Lots of Search Engines built on Lucene (I made one!) • Then there was Solr.

• And then there were others.

‹#›© Cloudera, Inc. All rights reserved.

...

‹#›© Cloudera, Inc. All rights reserved.

What Search Engines Matter?• Lucene search engines lead the pack.

• How can you tell? • I like to look at db-engines.org

• Also, plenty of anecdotal evidence that others are using Lucene for the core.

‹#›© Cloudera, Inc. All rights reserved.

2 Lucene based search engines in top 15. No other search engines.

‹#›© Cloudera, Inc. All rights reserved.

Lucene based search engines dominate.

‹#›© Cloudera, Inc. All rights reserved.

“It is hopeless to talk to both of you, you don't understand virtual memory.”Uwe Schindler @thetaph1 @uwesays

‹#›© Cloudera, Inc. All rights reserved.

What is the future of Search?• More NoSQL • More Realtime Analytics • More System of Record • More Scale • Search will eat away at the stack.

• Search focuses on pre processing and efficient in memory data structures for fast responses.

‹#›© Cloudera, Inc. All rights reserved.

The Solr Start - Single node, then DIY disturb• Solr started as a single node solution, followed by master->slave replication, followed by simple distributed search. • This was ‘good enough’ for a long time. • Classic ‘innovators dilemma’ problem.

• Scaling out was super important, but not as soon as some thought and sooner than others thought.

‹#›© Cloudera, Inc. All rights reserved.

SolrCloud - Solr, ‘Clusterized’

‹#›© Cloudera, Inc. All rights reserved.

SolrCloud Meets Hadoop• First class integrations with: • HDFS • MapReduce • Spark • Flume • HBase • Etc

‹#›© Cloudera, Inc. All rights reserved.

Now it’s all about scale and correctness.• The search features for the big data world are here and rapidly advancing.

• The next step is being able to handle Hadoop scale in the ‘general’ case.

• And to be able to handle that correctly ‘enough’ of the time.

‹#›© Cloudera, Inc. All rights reserved.

“In my opinion the whole code is a bug by itself.”Uwe Schindler @thetaph1 @uwesays

‹#›© Cloudera, Inc. All rights reserved.

The Call Me Maybe Tests

• https://aphyr.com/tags/jepsen • Some basic testing around how systems live up to their CAP promises. Heavy focus on partitions. • Most systems fail pretty badly. ZooKeeper rocked it. SolrCloud did pretty darn well*.

‹#›© Cloudera, Inc. All rights reserved.

Call Me … Maybe??

• Passing is actually like a very minimum bar. It doesn’t at all mean your system is correct.

• Your system could be complete crap and still pass.

• In fact, in the general case, all the current best search engines are still flakey at scale.

‹#›© Cloudera, Inc. All rights reserved.

Search at Scale is still Flakey?• Yes, yes it is. Most systems at scale are still flakey. Most systems don’t deliver on their promises.

• How does search in particular get away with it?

• Users are already used to not considering it the system of record.Its easier to scale specialized than general - project scales general, massive users scale specialized.

• We want the project to easily scale generally - no expertise needed. You can already scale pretty large, but it takes a ‘vertical’ and expertise.

‹#›© Cloudera, Inc. All rights reserved.

Search In Particular is HARD.• The search engine is a many faceted beast.

• There is a lot of surface area.

• And you want it all to work and all to work realtime and all to integrate well together.

‹#›© Cloudera, Inc. All rights reserved.

"Lucene is maybe the world's most tested open source project."Uwe Schindler @thetaph1 #bbuzz 2014

‹#›© Cloudera, Inc. All rights reserved.

Lucene Testing Framework• Lucene regularly finds bugs in new Java releases. • Seriously. Regularly.

• Many of those bugs are fixed and fixed quickly. Many are not. • Randomized testing, reproducible master seeds. • “Test Beasting” and seti@home type resource requirements.

‹#›© Cloudera, Inc. All rights reserved.

Lucene Testing Framework

• Code checkers and build enforcers galore, as well as test level checkers and enforcers.

• Who is policing the policeman?

• You need a vibrant community that gives a damn.

‹#›© Cloudera, Inc. All rights reserved.

“The stack trace is only impossible if you look at the code.”Uwe Schindler @thetaph1 @uwesays

‹#›© Cloudera, Inc. All rights reserved.

Testing is the Key and Answer• Just because your tests don't normally fail doesn't mean they are great. You probably just don’t normally see the problems.

• Our test framework exposes the problems - quickly.

• This has pluses and minuses, but the pluses greatly outweigh the minuses!

‹#›© Cloudera, Inc. All rights reserved.

More on Testing

• Integration and unit tests are equally important.

• Integration tests are a little more important.

• Testing, testing, and more testing is your best friend.

• Communities grow, communities change, one or two can’t hold the code together.

‹#›© Cloudera, Inc. All rights reserved.

Regular Large Scale Testing will be a challenge!

1000 nodes with

SolrCloud Radial View

‹#›© Cloudera, Inc. All rights reserved.

The race for scalable search is on!• My approach will be to leverage Hadoop as much as possible!

• Many companies are focused on Solr - there will be many approaches!

• It’s still early in the game.

‹#›© Cloudera, Inc. All rights reserved.

Leverage Hadoop?• A distributed filesystem is a beautiful crutch to lean on!

• Loading data at scale by itself is not Solr’s strength.

• Hadoop will push Solr to it’s limits and beyond.

‹#›© Cloudera, Inc. All rights reserved.

"As a good policeman I have all open source ‘guns’ for code checking available."Uwe Schindler @thetaph1 @uwesays

https://code.google.com/p/forbidden-apis/

http://labs.carrotsearch.com/randomizedtesting.html

‹#›© Cloudera, Inc. All rights reserved.

Thank youMark Miller @heismark