the original vision of nutch, 14 years later: building an open source search engine
TRANSCRIPT
![Page 1: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/1.jpg)
The original vision of Nutch, 14 years later: Building an open source search engine
Apache Big Data Europe 2016
[email protected] @sylvinus
![Page 2: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/2.jpg)
/usr/bin/whoami
• Jamendo (Founder & CTO, 2004-2011)
• TEDxParis (Co-founder, 2009-2012)
• dotConferences (Founder, 2012-)
• Pricing Assistant (Co-founder & CTO, 2012-)
![Page 3: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/3.jpg)
"The original motivation for the Nutch project was to provide a transparent alternative to the growing power of a
handful of private search services over most users’ view of the Web.
CommerceNet Labs Technical Report, Nov 2004
However, as Nutch has been adopted with greater enthusiasm by smaller organizations, the Nutch
Organization has de-emphasized operating a multi-billion-page index in the public interest."
![Page 4: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/4.jpg)
![Page 5: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/5.jpg)
again?
![Page 6: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/6.jpg)
![Page 7: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/7.jpg)
![Page 8: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/8.jpg)
![Page 9: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/9.jpg)
![Page 10: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/10.jpg)
transparency
reproducibility
![Page 11: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/11.jpg)
![Page 12: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/12.jpg)
https://uidemo.commonsearch.org
![Page 14: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/14.jpg)
Agenda
• Values & tech choices
• Search engine components
• Challenges
• Opportunities
![Page 15: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/15.jpg)
Values & tech choices
![Page 16: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/16.jpg)
![Page 17: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/17.jpg)
Radical transparency
• Open source (Apache License v2)
• Open data
• (Governance)
![Page 18: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/18.jpg)
Privacy
• Results can be tailored by language/country, but NOT by user/cookie/sessionid
• \o/ Cache everything!
• Tor service: http://comsearchl2zlnre.onion
![Page 19: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/19.jpg)
Participation & Pragmatism
• Use high-level languages as much as possible (Python, Go)
• Embrace active communities (Spark, Elasticsearch)
• Use mainstream participation platforms, even if they are nonfree (GitHub, Slack)
![Page 20: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/20.jpg)
Search engines
![Page 21: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/21.jpg)
http://infolab.stanford.edu/~backrub/google.htmlThe Anatomy of a Large-Scale Hypertextual Web Search Engine (1998)
Crawler
Indexer
Database
SearcherRanker
![Page 22: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/22.jpg)
Crawler
![Page 23: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/23.jpg)
http://commoncrawl.org
![Page 24: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/24.jpg)
Today at 3:30pm!
![Page 25: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/25.jpg)
http://scrapy.org
![Page 26: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/26.jpg)
http://github.com/cocrawler/cocrawler
![Page 27: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/27.jpg)
Indexer
![Page 28: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/28.jpg)
Specs
• HTML parsing & analysis
• Tokenization / NLP
• Static rankings
• Language detection
• I/O from crawls to databases
![Page 29: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/29.jpg)
![Page 30: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/30.jpg)
Common Search Pipeline
Doc sourcesCommon Crawl,
WARC files, URLs ...
Filter plugins
Document parsing
Output plugins
Data outputDatabase, file, HDFS, S3, ...
![Page 31: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/31.jpg)
HTML parsers
• BeautifulSoup & friends
• lxml
• html5lib
• Gumbo!
![Page 32: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/32.jpg)
https://github.com/google/gumbo-parser
![Page 33: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/33.jpg)
Gumbocy
• Use Cython instead of ctypes
• Smaller API
• Tree traversal on the Cython side with basic boilerplate/visibility support
https://github.com/commonsearch/gumbocy
![Page 34: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/34.jpg)
https://github.com/commonsearch/urlparse4
![Page 35: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/35.jpg)
Database(s)
![Page 36: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/36.jpg)
![Page 37: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/37.jpg)
http://lucene.apache.org/
![Page 38: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/38.jpg)
![Page 39: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/39.jpg)
Ranker
![Page 40: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/40.jpg)
Ranking formula
rank = f( static_score , dynamic_score( query ) )
Alexa DMOZ
Blacklists PageRank
...
ElasticSearch & Lucene TF-IDF BM25
...
![Page 41: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/41.jpg)
![Page 42: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/42.jpg)
https://about.commonsearch.org/developer/get-started
![Page 43: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/43.jpg)
Today @ 4:30pm ;-)
![Page 44: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/44.jpg)
Searcher / Frontend
![Page 45: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/45.jpg)
Specs
• Send user query to databases
• Search-as-you-type
• HTML & JSON endpoints
• High performance
![Page 46: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/46.jpg)
![Page 48: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/48.jpg)
http://infolab.stanford.edu/~backrub/google.htmlThe Anatomy of a Large-Scale Hypertextual Web Search Engine (1998)
Crawler
Parser
Index
SearcherRanker
![Page 49: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/49.jpg)
Challenges
![Page 50: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/50.jpg)
Funding / Scale
• Frugalism
• Caching
• In-kind services
• Individual donations / Foundation grants
• General economic incentives
![Page 51: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/51.jpg)
Spam
• Email spam
• Wikipedia vandalism
• Algorithm complexity & scale
• Given enough eyeballs, all spam is shallow?
![Page 52: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/52.jpg)
Relevance
• Exhaustivity
• Rescoring
• Evaluation
• More at 4:30pm ;-)
![Page 53: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/53.jpg)
More search dimensions
• Realtime search
• Local search
• Universal search
![Page 54: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/54.jpg)
Semantic search
• Wikidata
• YAGO
• Conversational / Voice search
![Page 55: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/55.jpg)
Outreach
• Easy onboarding & docs
• Making people care believe
![Page 56: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/56.jpg)
Opportunities
![Page 57: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/57.jpg)
Decentralization
• YaCy
• Extremely high technical & social cost!
• Transparency?
![Page 58: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/58.jpg)
Research
• More people should know how to build search engines
• Spam, Relevance, Large-scale data processing
• We need more open datasets!
![Page 59: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/59.jpg)
https://about.commonsearch.org/blog/
![Page 60: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/60.jpg)
Make the Web a better place!
• SEO
• Transparency
• Influence of money
• Public service
![Page 61: The original vision of Nutch, 14 years later: Building an open source search engine](https://reader031.vdocuments.mx/reader031/viewer/2022022202/587cfe9b1a28ab1e7e8b5ebd/html5/thumbnails/61.jpg)
Questions?https://about.commonsearch.org/contributing
https://github.com/commonsearch [email protected]
slack.commonsearch.org