building enterprise search engines using open source technologies

www.anant.us | solutions@anant.us | 202.905.28181010 Wisconsin Ave, NW | Suite 250 | Washington, DC 20007

Large Scale Search with Open Source Technologies

Building Search Engines

What do we do?

Streamline, Organize & Unify

Business Information

Agenda

•Challenge - Why does this matter?•Search Engine - 30k Foot View•Open - Lucene, Cassandra & Spark•Customizing - Apache Lucene/SolR•Custom Parser - Written in Scala

Challenge – Why does this matter?

Knowledge

Project Informatio

Client Service

InformationCorporate

Guides

Collaborative

Documents

Assets& Files

Corporate Resources

Appleseed Framework (Portal, Base, Search)

G Drive Delta

DropBox

G Drive Delta

NutshellDropbox

Freshbooks

G DriveG Sites

(KB)G Drive

WorkflowyEvernote

G DriveDropBox

OwnCloud

PocketLeaves

AIC (WP)Anant (WP)

Search Engine – 30 Thousand Foot View

The search index is only as good as your processed data. If you put everything you find in your index, you are going to spend a lot of time telling people how to search.

Lucene – More than meets the eye

WhoNext?

Think of it like a “NoSQL” Database that has great indexing.. everywhere.

Cassandra – NoSQL With Structure

WhoNext?

Think of it like a “NoSQL” Database that has structure. Using “CQL” You can insert into and select from.. just not join.

Spark – Way Better MapReduce

WhoNext?

Think of it like MapReduce if MapReduce were created with scala, instead of Java, with streams. It’s also 100 times faster.

Configuring - SolR - 1/3SolR is like an eighteen wheel truck you can take apart and rebuild from the ground up with only what you need, or add as much as you want.

• Configuration - Schema–Data Types–Pre-Processing –Collection Definitions–Managed vs. Unmanaged

• Configuration - ZooKeeper–Synchronize Configurations–Distribute Shards–Manage Replicas–Elect Leaders

• Configuration - SolrConfig–Handlers–Components–Indexing Configurations–Memory / Cache–File System

• Lessons Learned–Try to use out of the box–Try to configure your way –Make sure to upgrade–Not everything can be configured

Configuring - SolR - 2/3

• Before Docker –Setup Zookeeper

•Customize zoo.cfg•Setup Zookeeper Servers

–Setup SolR Distro•Download SolR•Clean up SolR•Customize Schema.xml•Customize SolrConfig.xml•Setup Different Solr Servers

–Start the Cloud•Custom Start Scripts

• Today w/ Docker – docker run --name zookeeper \

-p 127.0.0.1:2181:2181 \-p 127.0.0.1:2888:2888 \-p 127.0.0.1:3888:3888 \jplock/zookeeper

– docker run --link zookeeper:ZK -i \-p 127.0.0.1:8983:8983 \-t dockerimages/docker-solr \ /bin/bash -c '\cd /opt/solr/example; \java -jar \-Dbootstrap_confdir=./solr/collection1/conf \-Dcollection.configName=myconf \ -DzkHost=$ZK_PORT_2181_TCP_ADDR:$ZK_PORT_2181_TCP_PORT \-DnumShards=2 \start.jar';

https://hub.docker.com/r/dockerimages/docker-solr/

https://cwiki.apache.org/confluence/display/solr/Getting+Started+with+SolrCloud https://cwiki.apache.org/confluence/display/solr/Taking+Solr+to+Production

Configuring - SolR - 3/3

• SolrConfig - Example • Schema - Example

https://cwiki.apache.org/confluence/display/solr/Configuring+solrconfig.xml

https://wiki.apache.org/solr/SchemaXml

SolR Cloud / Zookeeper

User Interface - Super Advanced

Customizing - SolR - 1/3SolR is like an eighteen wheel truck you can take apart and rebuild from the ground up with only what you need, or add as much as you want.

• Customization - Parsing–Need Specialized Syntax?–Java or Scala Based–Open Plugin Structure–Several Examples

• Customization - Highlighting–Need Special Coloring?–Specialized Syntax Aware–Open Plugin Structure–Several Examples

• Customization - Term Counts–Need Specific Information?–Create the Logic–Register the Component–Complicated Examples

• Lessons Learned–Major version upgrades = pain–Newer classes can be extended better

–Long term investment

Customizing - SolR - 2/3

• Custom Component in Scala or Java • Installing the Component

http://wiki.apache.org/solr/SolrPlugins http://sujitpal.blogspot.com/2011/03/using-lucenes-new-queryparser-framework.html

Customizing - SolR - 3/3

Creating a Custom Parser with ScalaBuilding a parser in Scala wasn’t my first choice, but creating it in Scala made me see how much better the language is.

• Why a Specialized Syntax?–Legacy Syntax–Boolean AND Proximity Queries–Specialized Fielded Expressions–Ranges / Classifications

• Why not ANTLR or JavaCC?–Old Parser was in Parboiled(1)–Parboiled2 was in Scala–No need to learn a separate Syntax for Creating Syntax

• Lessons Learned–Parboiled2 Documentation = bad–Understand the syntax–Interactive REPL in Scala = good–Write tons of unit tests–Long term investment

• Customizing SolR with Scala–Found a good Virtual Mentor–Learned Scala (not for Spark)–Started from the ground up–Reduced from ~1k to 400 LOC

JavaCC vs. parboiled2 (Scala)

• Java CC - SurroundQuery.jj • Scala based Parboiled2

Questions & Contact

@anantcorp

facebook.com/anantCorp

linkedin.com/company/anant

rahul@anant.uslinkedin.com/in/xingh

Rahul SinghCEO & Founder

Questions & Contact

• Brown Bag Session or Meetup?• Modern Enterprise• Mastering Services in the Service of Others• Hybrid Agile Project Management• Building Search Engines• CICD / DevOps• Connecting Internet Software

Streamlined DataIntegration / Data PipelinesOrganized Knowledge

Search / Data WarehousesUnified Interfaces

Portals / Dashboards / Mobile

building enterprise search engines using open source technologies

Software

search engines!

optimizing search engines

module 3 - internet. search engines search engine anatomy...

ao3 search engines

building enterprise search engines using open source...

academic search engines

search engines

aggregate suppression for enterprise search engines ·...

using search engines

the players the majors dead search engines international...

cdn-cms.f-static.com · 2018. 3. 16. · (107) semantic...

biomedical search engines

search engines powerpoint

search engines and metasearch engines

search engines overview

using search engines to market your consultancy. what are...

travel search-engines

web search engines

semantic search engines