pycon fr 2016 - et si on recodait google en python ?
TRANSCRIPT
![Page 1: PyCon FR 2016 - Et si on recodait Google en Python ?](https://reader031.vdocuments.mx/reader031/viewer/2022020119/587cfe9b1a28ab1e7e8b5ebf/html5/thumbnails/1.jpg)
Et si on recodait Google en Python ?
PyCon-FR 2016
![Page 2: PyCon FR 2016 - Et si on recodait Google en Python ?](https://reader031.vdocuments.mx/reader031/viewer/2022020119/587cfe9b1a28ab1e7e8b5ebf/html5/thumbnails/2.jpg)
![Page 3: PyCon FR 2016 - Et si on recodait Google en Python ?](https://reader031.vdocuments.mx/reader031/viewer/2022020119/587cfe9b1a28ab1e7e8b5ebf/html5/thumbnails/3.jpg)
![Page 4: PyCon FR 2016 - Et si on recodait Google en Python ?](https://reader031.vdocuments.mx/reader031/viewer/2022020119/587cfe9b1a28ab1e7e8b5ebf/html5/thumbnails/4.jpg)
![Page 5: PyCon FR 2016 - Et si on recodait Google en Python ?](https://reader031.vdocuments.mx/reader031/viewer/2022020119/587cfe9b1a28ab1e7e8b5ebf/html5/thumbnails/5.jpg)
transparence
reproductibilité
![Page 6: PyCon FR 2016 - Et si on recodait Google en Python ?](https://reader031.vdocuments.mx/reader031/viewer/2022020119/587cfe9b1a28ab1e7e8b5ebf/html5/thumbnails/6.jpg)
![Page 7: PyCon FR 2016 - Et si on recodait Google en Python ?](https://reader031.vdocuments.mx/reader031/viewer/2022020119/587cfe9b1a28ab1e7e8b5ebf/html5/thumbnails/7.jpg)
https://uidemo.commonsearch.org
![Page 9: PyCon FR 2016 - Et si on recodait Google en Python ?](https://reader031.vdocuments.mx/reader031/viewer/2022020119/587cfe9b1a28ab1e7e8b5ebf/html5/thumbnails/9.jpg)
![Page 10: PyCon FR 2016 - Et si on recodait Google en Python ?](https://reader031.vdocuments.mx/reader031/viewer/2022020119/587cfe9b1a28ab1e7e8b5ebf/html5/thumbnails/10.jpg)
Google's early Python code
https://www.quora.com/Why-did-Google-move-from-Python-to-C++-for-use-in-its-crawler
Python (1.2 IIRC) would occasionally just core dump while running the crawler. It was completely stock, no C++ modules compiled in or dynamically
linked, just bog standard.
[...] no unit tests, and its "system tests" were minimal at best, absent at worst.
[...] there was originally some controversy about the switch. However, when the new C++ system was turned on and used fewer machines to crawl 5x
faster with higher reliability, the practical question was settled.
Python was "abandoned" from the core search stack around 2000.
![Page 11: PyCon FR 2016 - Et si on recodait Google en Python ?](https://reader031.vdocuments.mx/reader031/viewer/2022020119/587cfe9b1a28ab1e7e8b5ebf/html5/thumbnails/11.jpg)
Qu'est-ce qui a changé depuis ?
• Stabilité & écosystème
• Librairies performantes en C / Cython
• Evolution des bottlenecks
• PyPy?
![Page 12: PyCon FR 2016 - Et si on recodait Google en Python ?](https://reader031.vdocuments.mx/reader031/viewer/2022020119/587cfe9b1a28ab1e7e8b5ebf/html5/thumbnails/12.jpg)
![Page 13: PyCon FR 2016 - Et si on recodait Google en Python ?](https://reader031.vdocuments.mx/reader031/viewer/2022020119/587cfe9b1a28ab1e7e8b5ebf/html5/thumbnails/13.jpg)
http://infolab.stanford.edu/~backrub/google.htmlThe Anatomy of a Large-Scale Hypertextual Web Search Engine (1998)
Crawler
Parser
Index
SearcherRanker
![Page 14: PyCon FR 2016 - Et si on recodait Google en Python ?](https://reader031.vdocuments.mx/reader031/viewer/2022020119/587cfe9b1a28ab1e7e8b5ebf/html5/thumbnails/14.jpg)
Crawler
![Page 15: PyCon FR 2016 - Et si on recodait Google en Python ?](https://reader031.vdocuments.mx/reader031/viewer/2022020119/587cfe9b1a28ab1e7e8b5ebf/html5/thumbnails/15.jpg)
http://scrapy.org
![Page 16: PyCon FR 2016 - Et si on recodait Google en Python ?](https://reader031.vdocuments.mx/reader031/viewer/2022020119/587cfe9b1a28ab1e7e8b5ebf/html5/thumbnails/16.jpg)
http://github.com/cocrawler/cocrawler
![Page 17: PyCon FR 2016 - Et si on recodait Google en Python ?](https://reader031.vdocuments.mx/reader031/viewer/2022020119/587cfe9b1a28ab1e7e8b5ebf/html5/thumbnails/17.jpg)
![Page 18: PyCon FR 2016 - Et si on recodait Google en Python ?](https://reader031.vdocuments.mx/reader031/viewer/2022020119/587cfe9b1a28ab1e7e8b5ebf/html5/thumbnails/18.jpg)
http://commoncrawl.org
![Page 19: PyCon FR 2016 - Et si on recodait Google en Python ?](https://reader031.vdocuments.mx/reader031/viewer/2022020119/587cfe9b1a28ab1e7e8b5ebf/html5/thumbnails/19.jpg)
Parser
![Page 20: PyCon FR 2016 - Et si on recodait Google en Python ?](https://reader031.vdocuments.mx/reader031/viewer/2022020119/587cfe9b1a28ab1e7e8b5ebf/html5/thumbnails/20.jpg)
HTML parsers
• BeautifulSoup & derivés.
• lxml
• html5lib
• Gumbo!
![Page 21: PyCon FR 2016 - Et si on recodait Google en Python ?](https://reader031.vdocuments.mx/reader031/viewer/2022020119/587cfe9b1a28ab1e7e8b5ebf/html5/thumbnails/21.jpg)
https://github.com/google/gumbo-parser
![Page 22: PyCon FR 2016 - Et si on recodait Google en Python ?](https://reader031.vdocuments.mx/reader031/viewer/2022020119/587cfe9b1a28ab1e7e8b5ebf/html5/thumbnails/22.jpg)
Extensions C en Python
Mémoire gérée par PythonMémoire gérée par l'extension C
PyObject
ctypes
![Page 23: PyCon FR 2016 - Et si on recodait Google en Python ?](https://reader031.vdocuments.mx/reader031/viewer/2022020119/587cfe9b1a28ab1e7e8b5ebf/html5/thumbnails/23.jpg)
![Page 24: PyCon FR 2016 - Et si on recodait Google en Python ?](https://reader031.vdocuments.mx/reader031/viewer/2022020119/587cfe9b1a28ab1e7e8b5ebf/html5/thumbnails/24.jpg)
Cython!
• Faire le gros du travail en C
• Eviter la conversion de données au maximum
• Générer une extension C pour Python facilement
![Page 25: PyCon FR 2016 - Et si on recodait Google en Python ?](https://reader031.vdocuments.mx/reader031/viewer/2022020119/587cfe9b1a28ab1e7e8b5ebf/html5/thumbnails/25.jpg)
![Page 26: PyCon FR 2016 - Et si on recodait Google en Python ?](https://reader031.vdocuments.mx/reader031/viewer/2022020119/587cfe9b1a28ab1e7e8b5ebf/html5/thumbnails/26.jpg)
![Page 27: PyCon FR 2016 - Et si on recodait Google en Python ?](https://reader031.vdocuments.mx/reader031/viewer/2022020119/587cfe9b1a28ab1e7e8b5ebf/html5/thumbnails/27.jpg)
https://github.com/sylvinus/cython-simple-examples
![Page 28: PyCon FR 2016 - Et si on recodait Google en Python ?](https://reader031.vdocuments.mx/reader031/viewer/2022020119/587cfe9b1a28ab1e7e8b5ebf/html5/thumbnails/28.jpg)
Gumbocy
• HTML envoyé au C en UTF-8, sans conversion
• Parcours de l'arbre en Cython
• Gestion de la visibilité & du boilerplate
• Attributs & tags ignorables, ...
https://github.com/commonsearch/gumbocy
![Page 29: PyCon FR 2016 - Et si on recodait Google en Python ?](https://reader031.vdocuments.mx/reader031/viewer/2022020119/587cfe9b1a28ab1e7e8b5ebf/html5/thumbnails/29.jpg)
https://github.com/commonsearch/urlparse4
![Page 30: PyCon FR 2016 - Et si on recodait Google en Python ?](https://reader031.vdocuments.mx/reader031/viewer/2022020119/587cfe9b1a28ab1e7e8b5ebf/html5/thumbnails/30.jpg)
Autres analyses
• Détection de langue : cld2
• Détection charset : cchardet + metatags/headers
• Cleaning titres & metadata
![Page 31: PyCon FR 2016 - Et si on recodait Google en Python ?](https://reader031.vdocuments.mx/reader031/viewer/2022020119/587cfe9b1a28ab1e7e8b5ebf/html5/thumbnails/31.jpg)
Index
![Page 32: PyCon FR 2016 - Et si on recodait Google en Python ?](https://reader031.vdocuments.mx/reader031/viewer/2022020119/587cfe9b1a28ab1e7e8b5ebf/html5/thumbnails/32.jpg)
https://pypi.python.org/pypi/Whoosh/
![Page 33: PyCon FR 2016 - Et si on recodait Google en Python ?](https://reader031.vdocuments.mx/reader031/viewer/2022020119/587cfe9b1a28ab1e7e8b5ebf/html5/thumbnails/33.jpg)
http://lucene.apache.org/
![Page 34: PyCon FR 2016 - Et si on recodait Google en Python ?](https://reader031.vdocuments.mx/reader031/viewer/2022020119/587cfe9b1a28ab1e7e8b5ebf/html5/thumbnails/34.jpg)
https://www.elastic.co
![Page 35: PyCon FR 2016 - Et si on recodait Google en Python ?](https://reader031.vdocuments.mx/reader031/viewer/2022020119/587cfe9b1a28ab1e7e8b5ebf/html5/thumbnails/35.jpg)
Ranker
![Page 36: PyCon FR 2016 - Et si on recodait Google en Python ?](https://reader031.vdocuments.mx/reader031/viewer/2022020119/587cfe9b1a28ab1e7e8b5ebf/html5/thumbnails/36.jpg)
Formule du ranking
rank = f( static_score , dynamic_score( query ) )
Alexa DMOZ
Blacklists PageRank
...
ElasticSearch & Lucene TF-IDF BM25
![Page 37: PyCon FR 2016 - Et si on recodait Google en Python ?](https://reader031.vdocuments.mx/reader031/viewer/2022020119/587cfe9b1a28ab1e7e8b5ebf/html5/thumbnails/37.jpg)
![Page 38: PyCon FR 2016 - Et si on recodait Google en Python ?](https://reader031.vdocuments.mx/reader031/viewer/2022020119/587cfe9b1a28ab1e7e8b5ebf/html5/thumbnails/38.jpg)
https://about.commonsearch.org/developer/get-started
![Page 39: PyCon FR 2016 - Et si on recodait Google en Python ?](https://reader031.vdocuments.mx/reader031/viewer/2022020119/587cfe9b1a28ab1e7e8b5ebf/html5/thumbnails/39.jpg)
Searcher
![Page 40: PyCon FR 2016 - Et si on recodait Google en Python ?](https://reader031.vdocuments.mx/reader031/viewer/2022020119/587cfe9b1a28ab1e7e8b5ebf/html5/thumbnails/40.jpg)
Go version: https://github.com/commonsearch/cosr-front
https://github.com/commonsearch/cosr-back/blob/master/cosrlib/searcher.py
![Page 41: PyCon FR 2016 - Et si on recodait Google en Python ?](https://reader031.vdocuments.mx/reader031/viewer/2022020119/587cfe9b1a28ab1e7e8b5ebf/html5/thumbnails/41.jpg)
Frontend
![Page 42: PyCon FR 2016 - Et si on recodait Google en Python ?](https://reader031.vdocuments.mx/reader031/viewer/2022020119/587cfe9b1a28ab1e7e8b5ebf/html5/thumbnails/42.jpg)
https://uidemo.commonsearch.org
![Page 43: PyCon FR 2016 - Et si on recodait Google en Python ?](https://reader031.vdocuments.mx/reader031/viewer/2022020119/587cfe9b1a28ab1e7e8b5ebf/html5/thumbnails/43.jpg)
http://infolab.stanford.edu/~backrub/google.htmlThe Anatomy of a Large-Scale Hypertextual Web Search Engine (1998)
Crawler
Parser
Index
SearcherRanker
![Page 44: PyCon FR 2016 - Et si on recodait Google en Python ?](https://reader031.vdocuments.mx/reader031/viewer/2022020119/587cfe9b1a28ab1e7e8b5ebf/html5/thumbnails/44.jpg)
Qu'est-ce qui manque ?
![Page 45: PyCon FR 2016 - Et si on recodait Google en Python ?](https://reader031.vdocuments.mx/reader031/viewer/2022020119/587cfe9b1a28ab1e7e8b5ebf/html5/thumbnails/45.jpg)
Architecture• 2-pass search (host clustering, result diversity)
• Indexation continue
• Infoboxes
• Pubs
• Verticaux (images, vidéos, news, science, ...)
• ...
![Page 46: PyCon FR 2016 - Et si on recodait Google en Python ?](https://reader031.vdocuments.mx/reader031/viewer/2022020119/587cfe9b1a28ab1e7e8b5ebf/html5/thumbnails/46.jpg)
Encore plus de funSpam / Relevance
Sustainability
Outreach
API
...
![Page 47: PyCon FR 2016 - Et si on recodait Google en Python ?](https://reader031.vdocuments.mx/reader031/viewer/2022020119/587cfe9b1a28ab1e7e8b5ebf/html5/thumbnails/47.jpg)
Ca vous tente?https://about.commonsearch.org/contributing
https://github.com/commonsearch [email protected]
slack.commonsearch.org