solr at aol, presented by sean timm at solrexchage dc

30

Upload: lucidworks-archived

Post on 11-May-2015

309 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Solr At AOL, Presented by Sean Timm at SolrExchage DC
Page 2: Solr At AOL, Presented by Sean Timm at SolrExchage DC

Solr and Lucene @ AOLSEAN TIMM, CHIEF ARCHITECT, AOL ADVERTISING

Page 3: Solr At AOL, Presented by Sean Timm at SolrExchage DC

1999• Believe, Cher and Livin’ la Vida Loca, Ricky Martin

• The Matrix and The Phantom Menace

• Windows 98 Second Edition

• AltaVista, Northern Light, Yahoo, ODP, Inktomi– Google

• PPC Text search ads invented 1998– Banner ads

Page 4: Solr At AOL, Presented by Sean Timm at SolrExchage DC

A Brief History of Search @ AOL

• Acquired PLS in 1998• AOL Search used ODP• Site Search• Local Search• Built into AOL Server• CPL

– VSM then BM25– Phrase, numeric, date, text, and

proximity boosting– Conflation classes (like synonyms)

Page 5: Solr At AOL, Presented by Sean Timm at SolrExchage DC

Relevance

• Precision/recall• “free alcohol” vs. “alcohol free”• Lawyer versus Attorney• Iron and ironic same stem (Porter)• Beyonce vs. Beyoncé• Eagles

–Bird, sports teams, band, AMC Eagle• F 15, F-15, F15• FREAK

Relevant Retrieved

Page 6: Solr At AOL, Presented by Sean Timm at SolrExchage DC

The Dawn of Solr

• Prohibitively expensive to continue CPL development

• Complicated deployment

• 2005: Investigating migration to Lucene

• 2006: CNET open sourced Solr

Page 7: Solr At AOL, Presented by Sean Timm at SolrExchage DC

Contributions

• Local Lucene/Solr (superseded by SpatialSearch)

• Query Timeout

• Data Import Handler (DIH)

• Numerous smaller patches

• Committers: Noble Paul, Shalin Mangar, Patrick O’Leary

Page 8: Solr At AOL, Presented by Sean Timm at SolrExchage DC

Contributing to Solr/Lucene

• Learn

–Join the mailing lists•[email protected][email protected]

–Read search and Solr related blogs

–The #solr IRC channel on freenode

Page 9: Solr At AOL, Presented by Sean Timm at SolrExchage DC

Contributing to Solr/Lucene

• Help others

–Answer questions.

–Improve documentation in the code, the wiki, or the website.

–Make improvements to the Solr Admin UI.

Page 10: Solr At AOL, Presented by Sean Timm at SolrExchage DC

Contributing to Solr/Lucene

• Confirm a bug

• Submit a patch for a reported bug or feature request

• Improve a patch

• Try out a patch and see if it works

Page 11: Solr At AOL, Presented by Sean Timm at SolrExchage DC

Contributing to Solr/Lucene• Submit your own tickets

– Bug– Feature request

• Start with solr-user@lucene• Discuss on dev@lucene• Create Jira ticket, ideally with patches and unit tests

• Yonik’s Law of Patches:– A half-baked patch in Jira, with no documentation, no tests, and no

backwards compatibility is better than no patch at all.

Page 12: Solr At AOL, Presented by Sean Timm at SolrExchage DC

Applications• MapQuest (SpatialSearch)• Mail• AIM• AOL Search• Site Search• News Search• RUM• Sarah Palin e-mails (admin)• Demand• Wikipedia article pattern detection

Page 13: Solr At AOL, Presented by Sean Timm at SolrExchage DC

MapQuest Discover

Page 14: Solr At AOL, Presented by Sean Timm at SolrExchage DC

Travel Blogs

Page 15: Solr At AOL, Presented by Sean Timm at SolrExchage DC

MQ Local Search

Page 16: Solr At AOL, Presented by Sean Timm at SolrExchage DC

Related Searches

Page 17: Solr At AOL, Presented by Sean Timm at SolrExchage DC

Bipartite graph snippet

Page 18: Solr At AOL, Presented by Sean Timm at SolrExchage DC

Related Searches Graph

Page 18

“The Eagles”

The band

NFL

Boston College

Hotel California

Tribute

Page 19: Solr At AOL, Presented by Sean Timm at SolrExchage DC

Related Searches• Simple query

– User• New York Library

– Solr query• Lower case• Prefer exact match “new york library”• Use phrase slop to allow terms in same order and near each

other, e.g., new york city public library• primeQuery:“new york library” OR “new york library”~3

Page 20: Solr At AOL, Presented by Sean Timm at SolrExchage DC

Wikipedia Traffic Correlation Schema

<field name="title" type="string" indexed="true" stored="true" required="true" />

<field name="title_norm" type="string" indexed="true" stored="true" required="true" />

<field name="total_pvs" type="long" indexed="true" stored="true" required="true" />

<!-- Dynamic field definitions. If a field name is not found, dynamicFields

will be used if the name matches any of the patterns.

RESTRICTION: the glob-like pattern in the name attribute must have

a "*" only at the start or the end.

EXAMPLE: name="*_i" will match any field ending in _i (like myid_i, z_i)

Longer patterns will be matched first. if equal size patterns

both match, the first appearing in the schema will be used. -->

<!-- trend direction. field name contains date string, e.g., "trend_20110622" -->

<dynamicField name="trend_*" type="int" indexed="true" stored="true"/>

<!-- page views. field name contains date string, e.g., "pvs_20110622" -->

<dynamicField name="pvs_*" type="long" indexed="true" stored="true"/>

Page 21: Solr At AOL, Presented by Sean Timm at SolrExchage DC

Temporal Traffic Correlation of Wikipedia Page Views

Page 22: Solr At AOL, Presented by Sean Timm at SolrExchage DC

Sarah Palin E-mail Stats

• 13,177 documents

• 4 hours from receiving data to production install

• ~150 K requests per day at launch

• Now about 6-7 K requests per day

• Running on 3 VMs in two different data centers behind a NetScaler

Page 23: Solr At AOL, Presented by Sean Timm at SolrExchage DC
Page 24: Solr At AOL, Presented by Sean Timm at SolrExchage DC

Faceting and Clustering

Page 25: Solr At AOL, Presented by Sean Timm at SolrExchage DC

Huffington Post Comments• Solr 4

• Uses Solr Cloud

• Single shard

• ReplicationFactor 3

• Real-time

• 90 days of comments

• Tested up to 100 writes / second

Page 26: Solr At AOL, Presented by Sean Timm at SolrExchage DC

More HuffPost comments

• Used by editors and moderators–Topic investigation–Troll detection

• Config–Special features: search for emoticons, prefer

exact match, date boosting

• Hack-a-thon comment clustering, timeline, and summarization

Page 27: Solr At AOL, Presented by Sean Timm at SolrExchage DC

Solr Comments Architecture

Message Queue

MongoDBMongoIngestor

Solr Ingestor

Solr Cloud

Uses SolrJ CloudSolrServer

Tools Server

JuLiA

Page 28: Solr At AOL, Presented by Sean Timm at SolrExchage DC

Relevance in Solr

• “free alcohol” vs. “alcohol free”–Phrase queries and phrase slop

• Lawyer versus Attorney–SynonymFilterFactory

• Iron and ironic–Kstem, or Lemmatization via the

SynonymFilterFactory instead of Snowball/Porter

Page 29: Solr At AOL, Presented by Sean Timm at SolrExchage DC

Relevance in Solr

• Beyonce vs. Beyoncé–Various Folding Filters

• Eagles–Boost on other fields, such as popularity,

publish date–Use related searches, facets, or clustering

• F 15, F-15, F15–WordDelimiterFilter

Page 30: Solr At AOL, Presented by Sean Timm at SolrExchage DC

Bringing a New Search Project Online• Understand the domain

• Ingest (sample) data

• Clean data

• Repeat

• Relevance testing

• Scale out

• Launch/Success