solr at aol, presented by sean timm at solrexchage dc
TRANSCRIPT
Solr and Lucene @ AOLSEAN TIMM, CHIEF ARCHITECT, AOL ADVERTISING
1999• Believe, Cher and Livin’ la Vida Loca, Ricky Martin
• The Matrix and The Phantom Menace
• Windows 98 Second Edition
• AltaVista, Northern Light, Yahoo, ODP, Inktomi– Google
• PPC Text search ads invented 1998– Banner ads
A Brief History of Search @ AOL
• Acquired PLS in 1998• AOL Search used ODP• Site Search• Local Search• Built into AOL Server• CPL
– VSM then BM25– Phrase, numeric, date, text, and
proximity boosting– Conflation classes (like synonyms)
Relevance
• Precision/recall• “free alcohol” vs. “alcohol free”• Lawyer versus Attorney• Iron and ironic same stem (Porter)• Beyonce vs. Beyoncé• Eagles
–Bird, sports teams, band, AMC Eagle• F 15, F-15, F15• FREAK
Relevant Retrieved
The Dawn of Solr
• Prohibitively expensive to continue CPL development
• Complicated deployment
• 2005: Investigating migration to Lucene
• 2006: CNET open sourced Solr
Contributions
• Local Lucene/Solr (superseded by SpatialSearch)
• Query Timeout
• Data Import Handler (DIH)
• Numerous smaller patches
• Committers: Noble Paul, Shalin Mangar, Patrick O’Leary
Contributing to Solr/Lucene
• Learn
–Join the mailing lists•[email protected]•[email protected]
–Read search and Solr related blogs
–The #solr IRC channel on freenode
Contributing to Solr/Lucene
• Help others
–Answer questions.
–Improve documentation in the code, the wiki, or the website.
–Make improvements to the Solr Admin UI.
Contributing to Solr/Lucene
• Confirm a bug
• Submit a patch for a reported bug or feature request
• Improve a patch
• Try out a patch and see if it works
Contributing to Solr/Lucene• Submit your own tickets
– Bug– Feature request
• Start with solr-user@lucene• Discuss on dev@lucene• Create Jira ticket, ideally with patches and unit tests
• Yonik’s Law of Patches:– A half-baked patch in Jira, with no documentation, no tests, and no
backwards compatibility is better than no patch at all.
Applications• MapQuest (SpatialSearch)• Mail• AIM• AOL Search• Site Search• News Search• RUM• Sarah Palin e-mails (admin)• Demand• Wikipedia article pattern detection
MapQuest Discover
Travel Blogs
MQ Local Search
Related Searches
Bipartite graph snippet
Related Searches Graph
Page 18
“The Eagles”
The band
NFL
Boston College
Hotel California
Tribute
Related Searches• Simple query
– User• New York Library
– Solr query• Lower case• Prefer exact match “new york library”• Use phrase slop to allow terms in same order and near each
other, e.g., new york city public library• primeQuery:“new york library” OR “new york library”~3
Wikipedia Traffic Correlation Schema
<field name="title" type="string" indexed="true" stored="true" required="true" />
<field name="title_norm" type="string" indexed="true" stored="true" required="true" />
<field name="total_pvs" type="long" indexed="true" stored="true" required="true" />
<!-- Dynamic field definitions. If a field name is not found, dynamicFields
will be used if the name matches any of the patterns.
RESTRICTION: the glob-like pattern in the name attribute must have
a "*" only at the start or the end.
EXAMPLE: name="*_i" will match any field ending in _i (like myid_i, z_i)
Longer patterns will be matched first. if equal size patterns
both match, the first appearing in the schema will be used. -->
<!-- trend direction. field name contains date string, e.g., "trend_20110622" -->
<dynamicField name="trend_*" type="int" indexed="true" stored="true"/>
<!-- page views. field name contains date string, e.g., "pvs_20110622" -->
<dynamicField name="pvs_*" type="long" indexed="true" stored="true"/>
Temporal Traffic Correlation of Wikipedia Page Views
Sarah Palin E-mail Stats
• 13,177 documents
• 4 hours from receiving data to production install
• ~150 K requests per day at launch
• Now about 6-7 K requests per day
• Running on 3 VMs in two different data centers behind a NetScaler
Faceting and Clustering
Huffington Post Comments• Solr 4
• Uses Solr Cloud
• Single shard
• ReplicationFactor 3
• Real-time
• 90 days of comments
• Tested up to 100 writes / second
More HuffPost comments
• Used by editors and moderators–Topic investigation–Troll detection
• Config–Special features: search for emoticons, prefer
exact match, date boosting
• Hack-a-thon comment clustering, timeline, and summarization
Solr Comments Architecture
Message Queue
MongoDBMongoIngestor
Solr Ingestor
Solr Cloud
Uses SolrJ CloudSolrServer
Tools Server
JuLiA
Relevance in Solr
• “free alcohol” vs. “alcohol free”–Phrase queries and phrase slop
• Lawyer versus Attorney–SynonymFilterFactory
• Iron and ironic–Kstem, or Lemmatization via the
SynonymFilterFactory instead of Snowball/Porter
Relevance in Solr
• Beyonce vs. Beyoncé–Various Folding Filters
• Eagles–Boost on other fields, such as popularity,
publish date–Use related searches, facets, or clustering
• F 15, F-15, F15–WordDelimiterFilter
Bringing a New Search Project Online• Understand the domain
• Ingest (sample) data
• Clean data
• Repeat
• Relevance testing
• Scale out
• Launch/Success