ranking the web with spark
TRANSCRIPT
![Page 2: Ranking the Web with Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587cfe9b1a28ab1e7e8b5ebb/html5/thumbnails/2.jpg)
/usr/bin/whoami
• Jamendo (Founder & CTO, 2004-2011)
• TEDxParis (Co-founder, 2009-2012)
• dotConferences (Founder, 2012-)
• Pricing Assistant (Co-founder & CTO, 2012-)
![Page 3: Ranking the Web with Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587cfe9b1a28ab1e7e8b5ebb/html5/thumbnails/3.jpg)
transparency
reproducibility
![Page 4: Ranking the Web with Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587cfe9b1a28ab1e7e8b5ebb/html5/thumbnails/4.jpg)
![Page 5: Ranking the Web with Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587cfe9b1a28ab1e7e8b5ebb/html5/thumbnails/5.jpg)
https://uidemo.commonsearch.org
![Page 7: Ranking the Web with Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587cfe9b1a28ab1e7e8b5ebb/html5/thumbnails/7.jpg)
Ranking
![Page 8: Ranking the Web with Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587cfe9b1a28ab1e7e8b5ebb/html5/thumbnails/8.jpg)
Disclaimer: IANASRE(I Am Not A Search Relevance Engineer)
![Page 9: Ranking the Web with Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587cfe9b1a28ab1e7e8b5ebb/html5/thumbnails/9.jpg)
What's in a score
score = fn( doc, query, language, user, time )
![Page 10: Ranking the Web with Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587cfe9b1a28ab1e7e8b5ebb/html5/thumbnails/10.jpg)
What's in a score
score = fn( doc, query )
![Page 11: Ranking the Web with Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587cfe9b1a28ab1e7e8b5ebb/html5/thumbnails/11.jpg)
What's in a score
score = fn( static_score, dynamic_score ( query ))
![Page 12: Ranking the Web with Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587cfe9b1a28ab1e7e8b5ebb/html5/thumbnails/12.jpg)
Static score
![Page 13: Ranking the Web with Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587cfe9b1a28ab1e7e8b5ebb/html5/thumbnails/13.jpg)
Static features
• Scopes:
• Page: URL depth, markup stats, ...
• Domain: Age, page count, blacklists, ...
• WebGraph: PageRank, ...
![Page 14: Ranking the Web with Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587cfe9b1a28ab1e7e8b5ebb/html5/thumbnails/14.jpg)
http://infolab.stanford.edu/~backrub/google.htmlThe Anatomy of a Large-Scale Hypertextual Web Search Engine (1998)
Crawler
Indexer
Database
SearcherRanker
![Page 15: Ranking the Web with Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587cfe9b1a28ab1e7e8b5ebb/html5/thumbnails/15.jpg)
Dynamic score
![Page 16: Ranking the Web with Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587cfe9b1a28ab1e7e8b5ebb/html5/thumbnails/16.jpg)
Dynamic features
• Text match: TF-IDF, BM25, proximity, topic, ...
• Query-level: number of words, popularity, ...
• Usage: clicks, dwell time, reformulations, ...
• Time
![Page 17: Ranking the Web with Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587cfe9b1a28ab1e7e8b5ebb/html5/thumbnails/17.jpg)
Scoring function
![Page 18: Ranking the Web with Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587cfe9b1a28ab1e7e8b5ebb/html5/thumbnails/18.jpg)
Users
DatabaseElasticsearch
IndexerPython, Spark
Data sourcesCommon Crawl, Alexa top 1M, ...
words, static score
query top 10 docs, final scores
Offline
OnlineSearcher
Go
![Page 19: Ranking the Web with Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587cfe9b1a28ab1e7e8b5ebb/html5/thumbnails/19.jpg)
![Page 21: Ranking the Web with Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587cfe9b1a28ab1e7e8b5ebb/html5/thumbnails/21.jpg)
Issues with this architecture
• Static & dynamic scoring are in different codebases
• No control over result diversity
• Hard to optimize
• Very dependent on Elasticsearch
![Page 22: Ranking the Web with Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587cfe9b1a28ab1e7e8b5ebb/html5/thumbnails/22.jpg)
Rescoring
![Page 23: Ranking the Web with Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587cfe9b1a28ab1e7e8b5ebb/html5/thumbnails/23.jpg)
Users
Database
Indexer
words, static score, features
query
Searcher
top 1k docs, features
Rescorer
final 10 docs
![Page 24: Ranking the Web with Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587cfe9b1a28ab1e7e8b5ebb/html5/thumbnails/24.jpg)
Issues with rescoring
• Latency
• Pagination
• Harder to explain
![Page 25: Ranking the Web with Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587cfe9b1a28ab1e7e8b5ebb/html5/thumbnails/25.jpg)
Learning to rank
![Page 26: Ranking the Web with Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587cfe9b1a28ab1e7e8b5ebb/html5/thumbnails/26.jpg)
LTR Model
• Features
• Training dataset
• Evaluation: NDCG, ERR, ...
• Algorithms: AdaRank, ListNet, LambdaMART, ...
• Learning with Spark!
![Page 27: Ranking the Web with Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587cfe9b1a28ab1e7e8b5ebb/html5/thumbnails/27.jpg)
The right questions
• What do users expect?
• What features?
• How to evaluate and fine-tune in the real world?
![Page 28: Ranking the Web with Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587cfe9b1a28ab1e7e8b5ebb/html5/thumbnails/28.jpg)
PageRank with Spark
![Page 29: Ranking the Web with Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587cfe9b1a28ab1e7e8b5ebb/html5/thumbnails/29.jpg)
![Page 30: Ranking the Web with Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587cfe9b1a28ab1e7e8b5ebb/html5/thumbnails/30.jpg)
http://commoncrawl.org
![Page 31: Ranking the Web with Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587cfe9b1a28ab1e7e8b5ebb/html5/thumbnails/31.jpg)
https://github.com/commonsearch/cosr-back
![Page 32: Ranking the Web with Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587cfe9b1a28ab1e7e8b5ebb/html5/thumbnails/32.jpg)
Common Search Pipeline
Doc sourcesCommon Crawl,
WARC files, URLs ...
Filter plugins
Document parsing
Output plugins
Data outputDatabase, file, HDFS, S3, ...
![Page 33: Ranking the Web with Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587cfe9b1a28ab1e7e8b5ebb/html5/thumbnails/33.jpg)
Most popular Wikipedia pages
![Page 34: Ranking the Web with Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587cfe9b1a28ab1e7e8b5ebb/html5/thumbnails/34.jpg)
Dumping the web graph
![Page 35: Ranking the Web with Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587cfe9b1a28ab1e7e8b5ebb/html5/thumbnails/35.jpg)
Naive pyspark PageRank
![Page 36: Ranking the Web with Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587cfe9b1a28ab1e7e8b5ebb/html5/thumbnails/36.jpg)
GraphFrames
![Page 37: Ranking the Web with Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587cfe9b1a28ab1e7e8b5ebb/html5/thumbnails/37.jpg)
SparkSQL PageRank
![Page 38: Ranking the Web with Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587cfe9b1a28ab1e7e8b5ebb/html5/thumbnails/38.jpg)
SparkSQL PageRank
https://github.com/commonsearch/cosr-back/blob/master/spark/jobs/pagerank.py
![Page 39: Ranking the Web with Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587cfe9b1a28ab1e7e8b5ebb/html5/thumbnails/39.jpg)
Tests
http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm
https://github.com/commonsearch/cosr-back/blob/master/tests/sparktests/test_pagerank.py
![Page 40: Ranking the Web with Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587cfe9b1a28ab1e7e8b5ebb/html5/thumbnails/40.jpg)
https://about.commonsearch.org/developer/get-started
![Page 41: Ranking the Web with Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587cfe9b1a28ab1e7e8b5ebb/html5/thumbnails/41.jpg)
![Page 42: Ranking the Web with Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587cfe9b1a28ab1e7e8b5ebb/html5/thumbnails/42.jpg)
Top 10
![Page 43: Ranking the Web with Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587cfe9b1a28ab1e7e8b5ebb/html5/thumbnails/43.jpg)
![Page 44: Ranking the Web with Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587cfe9b1a28ab1e7e8b5ebb/html5/thumbnails/44.jpg)
Spam
![Page 45: Ranking the Web with Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587cfe9b1a28ab1e7e8b5ebb/html5/thumbnails/45.jpg)
![Page 46: Ranking the Web with Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587cfe9b1a28ab1e7e8b5ebb/html5/thumbnails/46.jpg)
Spamdexing• Keyword stuffing, hidden text
• Scraper sites, Mirrors
• Link farms
• Splogs, Comment spam
• Domaining
• Cloaking
• Bombing
![Page 47: Ranking the Web with Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587cfe9b1a28ab1e7e8b5ebb/html5/thumbnails/47.jpg)
Questions?https://about.commonsearch.org/contributing
https://github.com/commonsearch [email protected]
slack.commonsearch.org