solr at aol, presented by sean timm at solrexchage dc

Post on 11-May-2015

309 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Solr and Lucene @ AOLSEAN TIMM, CHIEF ARCHITECT, AOL ADVERTISING

1999• Believe, Cher and Livin’ la Vida Loca, Ricky Martin

• The Matrix and The Phantom Menace

• Windows 98 Second Edition

• AltaVista, Northern Light, Yahoo, ODP, Inktomi– Google

• PPC Text search ads invented 1998– Banner ads

A Brief History of Search @ AOL

• Acquired PLS in 1998• AOL Search used ODP• Site Search• Local Search• Built into AOL Server• CPL

– VSM then BM25– Phrase, numeric, date, text, and

proximity boosting– Conflation classes (like synonyms)

Relevance

• Precision/recall• “free alcohol” vs. “alcohol free”• Lawyer versus Attorney• Iron and ironic same stem (Porter)• Beyonce vs. Beyoncé• Eagles

–Bird, sports teams, band, AMC Eagle• F 15, F-15, F15• FREAK

Relevant Retrieved

The Dawn of Solr

• Prohibitively expensive to continue CPL development

• Complicated deployment

• 2005: Investigating migration to Lucene

• 2006: CNET open sourced Solr

Contributions

• Local Lucene/Solr (superseded by SpatialSearch)

• Query Timeout

• Data Import Handler (DIH)

• Numerous smaller patches

• Committers: Noble Paul, Shalin Mangar, Patrick O’Leary

Contributing to Solr/Lucene

• Learn

–Join the mailing lists•solr-user@lucene.apache.org•dev@lucene.apache.org

–Read search and Solr related blogs

–The #solr IRC channel on freenode

Contributing to Solr/Lucene

• Help others

–Answer questions.

–Improve documentation in the code, the wiki, or the website.

–Make improvements to the Solr Admin UI.

Contributing to Solr/Lucene

• Confirm a bug

• Submit a patch for a reported bug or feature request

• Improve a patch

• Try out a patch and see if it works

Contributing to Solr/Lucene• Submit your own tickets

– Bug– Feature request

• Start with solr-user@lucene• Discuss on dev@lucene• Create Jira ticket, ideally with patches and unit tests

• Yonik’s Law of Patches:– A half-baked patch in Jira, with no documentation, no tests, and no

backwards compatibility is better than no patch at all.

Applications• MapQuest (SpatialSearch)• Mail• AIM• AOL Search• Site Search• News Search• RUM• Sarah Palin e-mails (admin)• Demand• Wikipedia article pattern detection

MapQuest Discover

Travel Blogs

MQ Local Search

Related Searches

Bipartite graph snippet

Related Searches Graph

Page 18

“The Eagles”

The band

NFL

Boston College

Hotel California

Tribute

Related Searches• Simple query

– User• New York Library

– Solr query• Lower case• Prefer exact match “new york library”• Use phrase slop to allow terms in same order and near each

other, e.g., new york city public library• primeQuery:“new york library” OR “new york library”~3

Wikipedia Traffic Correlation Schema

<field name="title" type="string" indexed="true" stored="true" required="true" />

<field name="title_norm" type="string" indexed="true" stored="true" required="true" />

<field name="total_pvs" type="long" indexed="true" stored="true" required="true" />

<!-- Dynamic field definitions. If a field name is not found, dynamicFields

will be used if the name matches any of the patterns.

RESTRICTION: the glob-like pattern in the name attribute must have

a "*" only at the start or the end.

EXAMPLE: name="*_i" will match any field ending in _i (like myid_i, z_i)

Longer patterns will be matched first. if equal size patterns

both match, the first appearing in the schema will be used. -->

<!-- trend direction. field name contains date string, e.g., "trend_20110622" -->

<dynamicField name="trend_*" type="int" indexed="true" stored="true"/>

<!-- page views. field name contains date string, e.g., "pvs_20110622" -->

<dynamicField name="pvs_*" type="long" indexed="true" stored="true"/>

Temporal Traffic Correlation of Wikipedia Page Views

Sarah Palin E-mail Stats

• 13,177 documents

• 4 hours from receiving data to production install

• ~150 K requests per day at launch

• Now about 6-7 K requests per day

• Running on 3 VMs in two different data centers behind a NetScaler

Faceting and Clustering

Huffington Post Comments• Solr 4

• Uses Solr Cloud

• Single shard

• ReplicationFactor 3

• Real-time

• 90 days of comments

• Tested up to 100 writes / second

More HuffPost comments

• Used by editors and moderators–Topic investigation–Troll detection

• Config–Special features: search for emoticons, prefer

exact match, date boosting

• Hack-a-thon comment clustering, timeline, and summarization

Solr Comments Architecture

Message Queue

MongoDBMongoIngestor

Solr Ingestor

Solr Cloud

Uses SolrJ CloudSolrServer

Tools Server

JuLiA

Relevance in Solr

• “free alcohol” vs. “alcohol free”–Phrase queries and phrase slop

• Lawyer versus Attorney–SynonymFilterFactory

• Iron and ironic–Kstem, or Lemmatization via the

SynonymFilterFactory instead of Snowball/Porter

Relevance in Solr

• Beyonce vs. Beyoncé–Various Folding Filters

• Eagles–Boost on other fields, such as popularity,

publish date–Use related searches, facets, or clustering

• F 15, F-15, F15–WordDelimiterFilter

Bringing a New Search Project Online• Understand the domain

• Ingest (sample) data

• Clean data

• Repeat

• Relevance testing

• Scale out

• Launch/Success

top related