pydata berlin meetup
TRANSCRIPT
![Page 1: PyData Berlin Meetup](https://reader030.vdocuments.mx/reader030/viewer/2022021423/58a98e0e1a28ab412d8b63ab/html5/thumbnails/1.jpg)
Helping travelers make better hotel choices
500 million times a month* Steffen Wenz, CTO TrustYou
![Page 2: PyData Berlin Meetup](https://reader030.vdocuments.mx/reader030/viewer/2022021423/58a98e0e1a28ab412d8b63ab/html5/thumbnails/2.jpg)
For every hotel on the planet, provide a summary
of traveler reviews.What does TrustYou do?
![Page 3: PyData Berlin Meetup](https://reader030.vdocuments.mx/reader030/viewer/2022021423/58a98e0e1a28ab412d8b63ab/html5/thumbnails/3.jpg)
✓ Excellent hotel!
![Page 4: PyData Berlin Meetup](https://reader030.vdocuments.mx/reader030/viewer/2022021423/58a98e0e1a28ab412d8b63ab/html5/thumbnails/4.jpg)
✓ Excellent hotel!
✓ Nice building“Clean, hip & modern, excellent facilities”✓ Great view« Vue superbe »
![Page 5: PyData Berlin Meetup](https://reader030.vdocuments.mx/reader030/viewer/2022021423/58a98e0e1a28ab412d8b63ab/html5/thumbnails/5.jpg)
✓ Excellent hotel!*
✓ Nice building“Clean, hip & modern, excellent facilities”✓ Great view« Vue superbe »✓ Great for partying“Nice weekend getaway or for partying”✗ Solo travelers complain about TVs ℹ You should check out Reichstag, KaDeWe & Gendarmenmarkt.
*) nhow Berlin (Full summary)
![Page 6: PyData Berlin Meetup](https://reader030.vdocuments.mx/reader030/viewer/2022021423/58a98e0e1a28ab412d8b63ab/html5/thumbnails/6.jpg)
![Page 7: PyData Berlin Meetup](https://reader030.vdocuments.mx/reader030/viewer/2022021423/58a98e0e1a28ab412d8b63ab/html5/thumbnails/7.jpg)
![Page 8: PyData Berlin Meetup](https://reader030.vdocuments.mx/reader030/viewer/2022021423/58a98e0e1a28ab412d8b63ab/html5/thumbnails/8.jpg)
![Page 9: PyData Berlin Meetup](https://reader030.vdocuments.mx/reader030/viewer/2022021423/58a98e0e1a28ab412d8b63ab/html5/thumbnails/9.jpg)
DBCrawling Semantic Analysis
TrustYou Analytics
API
Kayak...
TrustYou Architecture
200 million reqs/month
![Page 10: PyData Berlin Meetup](https://reader030.vdocuments.mx/reader030/viewer/2022021423/58a98e0e1a28ab412d8b63ab/html5/thumbnails/10.jpg)
Crawling
![Page 11: PyData Berlin Meetup](https://reader030.vdocuments.mx/reader030/viewer/2022021423/58a98e0e1a28ab412d8b63ab/html5/thumbnails/11.jpg)
/find?q=Berlin
/find?q=Munich
/meetup/BerlinPyData
/meetup/BerlinCyclists
/find?q=Munich&pa
ge=2
/meetup/BerlinPolitics
/meetup/BerlinCyclists
/find?q=Munich&pa
ge=3
Seed URLs
Frontier
Basic crawling setup
![Page 12: PyData Berlin Meetup](https://reader030.vdocuments.mx/reader030/viewer/2022021423/58a98e0e1a28ab412d8b63ab/html5/thumbnails/12.jpg)
/find?q=Berlin
/find?q=Munich
/meetup/BerlinPyData
/meetup/BerlinCyclists
/find?q=Munich&pa
ge=2
/meetup/BerlinPolitics
/meetup/BerlinCyclists
/find?q=Munich&pa
ge=3
/find?q=Munich&page=99999999
999...
...
… if only it were so easy
facebok.com/meetup
…
Seed URLs
Frontier
![Page 13: PyData Berlin Meetup](https://reader030.vdocuments.mx/reader030/viewer/2022021423/58a98e0e1a28ab412d8b63ab/html5/thumbnails/13.jpg)
Scrapy
● Build your own web crawlers○ Extract data via CSS selectors, XPath, regexes …○ Handles queuing, request parallelism, cookies,
throttling … ● Comprehensive and well-designed● Commercial support by http://scrapinghub.com/
![Page 14: PyData Berlin Meetup](https://reader030.vdocuments.mx/reader030/viewer/2022021423/58a98e0e1a28ab412d8b63ab/html5/thumbnails/14.jpg)
Frontier
Seed URLs
Intro to Scrapyfrom scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class MySpider(CrawlSpider):
name = "my_spider"
# start with this URL
start_urls = ["http://www.meetup.com/find/?allMeetups=true&radius=50&userFreeform=Berlin"]
# follow these URLs, and call self.parse_meetup to extract data from them
rules = [
Rule(LinkExtractor(allow=[
"^http://www.meetup.com/[^/]+/$",
]), callback="parse_meetup"),
]
def parse_meetup(self, response):
# Extract data about meetup from HTML
m = MeetupItem()
yield m
![Page 15: PyData Berlin Meetup](https://reader030.vdocuments.mx/reader030/viewer/2022021423/58a98e0e1a28ab412d8b63ab/html5/thumbnails/15.jpg)
Try it out!$ scrapy crawl city -a city=Berlin -t jsonlines -o - 2>/dev/null
{"url": "http://www.meetup.com/Making-Customers-Happy-Berlin/", "name": "eCommerce - Making Customers Happy -
Berlin", "members": "774"}
{"url": "http://www.meetup.com/Berlin-Scrum-Meetup/", "name": "Berlin Scrum Meetup", "members": "368"}
{"url": "http://www.meetup.com/Clojure-Berlin/", "name": "The Clojure Conspiracy (Berlin)", "members": "545"}
{"url": "http://www.meetup.com/appliedJavascript/", "name": "Applied Javascript", "members": "494"}
{"url": "http://www.meetup.com/englishconversationclubberlin/", "name": "English Conversation Club Berlin",
"members": "1"}
{"url": "http://www.meetup.com/Berlin-Nights-Out-and-Daylight-Catch-Up/", "name": "Berlin Nights Out and Daylight
Catch Up", "members": "1"}
...
Full code on GitHub, dump of all Berlin meetups(Note: Meetup also has an API …)
![Page 16: PyData Berlin Meetup](https://reader030.vdocuments.mx/reader030/viewer/2022021423/58a98e0e1a28ab412d8b63ab/html5/thumbnails/16.jpg)
Number of registered meetups
![Page 17: PyData Berlin Meetup](https://reader030.vdocuments.mx/reader030/viewer/2022021423/58a98e0e1a28ab412d8b63ab/html5/thumbnails/17.jpg)
Crawling at TrustYou scale
● 2 - 3 million new reviews/week● Customers want alerts 8 - 24h
after review publication!● Smart crawl frequency & depth,
but still high overhead● Pools of constantly refreshed
EC2 proxy IPs● Direct API connections with
many sites
![Page 18: PyData Berlin Meetup](https://reader030.vdocuments.mx/reader030/viewer/2022021423/58a98e0e1a28ab412d8b63ab/html5/thumbnails/18.jpg)
Crawling at TrustYou scale
● Custom framework very similar to scrapy● Runs on Hadoop cluster (100 nodes)● … Though problem not 100% suitable for MapReduce
○ Nodes mostly waiting○ Coordination/messaging between nodes required:
■ Distributed queue■ Rate limiting
![Page 19: PyData Berlin Meetup](https://reader030.vdocuments.mx/reader030/viewer/2022021423/58a98e0e1a28ab412d8b63ab/html5/thumbnails/19.jpg)
Textual Data
![Page 20: PyData Berlin Meetup](https://reader030.vdocuments.mx/reader030/viewer/2022021423/58a98e0e1a28ab412d8b63ab/html5/thumbnails/20.jpg)
Treating textual data
raw text sentence splitting
stopword filteringstemming
tokenization
![Page 21: PyData Berlin Meetup](https://reader030.vdocuments.mx/reader030/viewer/2022021423/58a98e0e1a28ab412d8b63ab/html5/thumbnails/21.jpg)
Tokenization>>> import nltk
>>> raw = "We are always looking for interesting talks, locations to
host meetups and enthusiastic volunteers. Please get in touch using
>>> nltk.sent_tokenize(raw)
['We are always looking for interesting talks, locations to host meetups
and enthusiastic volunteers.', 'Please get in touch using info@pydata.
berlin.']
>>> nltk.word_tokenize(raw)
['We', 'are', 'always', 'looking', 'for', 'interesting', 'talks', ',',
'locations', 'to', 'host', 'meetups', 'and', 'enthusiastic',
'volunteers.', 'Please', 'get', 'in', 'touch', 'using', 'info', '@',
'pydata.berlin', '.']
![Page 22: PyData Berlin Meetup](https://reader030.vdocuments.mx/reader030/viewer/2022021423/58a98e0e1a28ab412d8b63ab/html5/thumbnails/22.jpg)
“great rooms”“great hotel”“rooms are terrible”“hotel is terrible”
JJ NNJJ NNNN VB JJNN VB JJ
Grammars and Parsing
>>> nltk.pos_tag(nltk.word_tokenize("hotel is terrible"))
[('hotel', 'NN'), ('is', 'VBZ'), ('terrible', 'JJ')]
![Page 23: PyData Berlin Meetup](https://reader030.vdocuments.mx/reader030/viewer/2022021423/58a98e0e1a28ab412d8b63ab/html5/thumbnails/23.jpg)
>>> grammar = nltk.CFG.fromstring("""
... OPINION -> NN COP JJ
... OPINION -> JJ NN
... NN -> 'hotel' | 'rooms'
... COP -> 'is' | 'are'
... JJ -> 'great' | 'terrible'
... """)
>>> parser = nltk.ChartParser(grammar)
>>> sent = nltk.word_tokenize("great rooms")
>>> for tree in parser.parse(sent):
>>> print tree
(OPINION (ADJ great) (NOUN rooms))
Grammars and Parsing
![Page 24: PyData Berlin Meetup](https://reader030.vdocuments.mx/reader030/viewer/2022021423/58a98e0e1a28ab412d8b63ab/html5/thumbnails/24.jpg)
WordNet>>> from nltk.corpus import wordnet as wn
>>> wn.morphy('coded', wn.VERB)
'code'
>>> wn.synsets("python")
[Synset('python.n.01'), Synset('python.n.02'), Synset('python.n.
03')]
>>> wn.synset('python.n.01').hypernyms()
[Synset('boa.n.02')]
>>> # meh :/
![Page 25: PyData Berlin Meetup](https://reader030.vdocuments.mx/reader030/viewer/2022021423/58a98e0e1a28ab412d8b63ab/html5/thumbnails/25.jpg)
● “Nice room”● “Room wasn‘t so great”● “The air-conditioning
was so powerful that we were cold in the room even when it was off.”
● “อาหารรสชาติดี”● ” خدمة جیدة“
● 20 languages● Linguistic system
(morphology, taggers, grammars, parsers …)
● Hadoop: Scale out CPU○ ~1B opinions in DB
● Python for ML & NLP libraries
Semantic Analysis at TrustYou
![Page 26: PyData Berlin Meetup](https://reader030.vdocuments.mx/reader030/viewer/2022021423/58a98e0e1a28ab412d8b63ab/html5/thumbnails/26.jpg)
Word2Vec
● Map words to vectors● “Step up” from bag-of-
words model
● ‘Cats’ and ‘dogs’ should be similar - because they occur in similar contexts
>>> m["python"]
array([-0.1351, -0.1040, -0.0823, -0.0287, 0.3709,
-0.0200, -0.0325, 0.0166, 0.3312, -0.0928,
-0.0967, -0.0199, -0.2498, -0.4445, -0.0445,
# ...
-1.0090, -0.2553, 0.2686, -0.4121, 0.3116,
-0.0639, -0.3688, -0.0273, -0.1266, -0.2606,
-0.1549, 0.0023, 0.0084, 0.2169, 0.0060],
dtype=float32)
![Page 27: PyData Berlin Meetup](https://reader030.vdocuments.mx/reader030/viewer/2022021423/58a98e0e1a28ab412d8b63ab/html5/thumbnails/27.jpg)
Fun with Word2Vec>>> # trained from 100k meetup descriptions!
>>> m = gensim.models.Word2Vec.load("data/word2vec")
>>> m.most_similar(positive=["python"])[:3]
[(u'javascript', 0.8382717370986938), (u'php', 0.8266388773918152), (u'django',
0.8189617991447449)]
>>> m.doesnt_match(["python", "c++", "javascript"])
'c++'
>>> m.most_similar(positive=["berlin"])[:3]
[(u'paris', 0.8339072465896606), (u'lisbon', 0.7986686825752258), (u'holland',
0.7970746755599976)]
>>> m.most_similar(positive=["ladies"])[:3]
[(u'girls', 0.8175351619720459), (u'mamas', 0.745951771736145), (u'gals', 0.7336771488189697)]
![Page 28: PyData Berlin Meetup](https://reader030.vdocuments.mx/reader030/viewer/2022021423/58a98e0e1a28ab412d8b63ab/html5/thumbnails/28.jpg)
ML @ TrustYou
● gensim doc2vec model to create hotel embedding
● Used - together with other features - for various classifiers
![Page 29: PyData Berlin Meetup](https://reader030.vdocuments.mx/reader030/viewer/2022021423/58a98e0e1a28ab412d8b63ab/html5/thumbnails/29.jpg)
Workflow Management& Scaling Up
![Page 30: PyData Berlin Meetup](https://reader030.vdocuments.mx/reader030/viewer/2022021423/58a98e0e1a28ab412d8b63ab/html5/thumbnails/30.jpg)
● Build complex pipelines ofbatch jobs○ Dependency resolution○ Parallelism○ Resume failed jobs
● Some support for Hadoop● Pythonic replacement for Oozie● Can be combined with Pig, Hive
Luigi
![Page 31: PyData Berlin Meetup](https://reader030.vdocuments.mx/reader030/viewer/2022021423/58a98e0e1a28ab412d8b63ab/html5/thumbnails/31.jpg)
class MyTask(luigi.Task):
def requires(self):
return DependentTask()
def output(self):
return luigi.LocalTarget("data/my_task_output"))
def run(self):
with self.output().open("w") as out:
out.write("foo")
Luigi tasks vs. Makefilesdata/my_task_output: DependentTask
run
run
run ...
![Page 32: PyData Berlin Meetup](https://reader030.vdocuments.mx/reader030/viewer/2022021423/58a98e0e1a28ab412d8b63ab/html5/thumbnails/32.jpg)
class CrawlTask(luigi.Task):
city = luigi.Parameter()
def output(self):
output_path = os.path.join("data", "{}.jsonl".format(self.city))
return luigi.LocalTarget(output_path)
def run(self):
tmp_output_path = self.output().path + "_tmp"
subprocess.check_output(["scrapy", "crawl", "city", "-a", "city={}".
format(self.city), "-o", tmp_output_path, "-t", "jsonlines"])
os.rename(tmp_output_path, self.output().path)
Example: Wrap crawl in Luigi task
![Page 33: PyData Berlin Meetup](https://reader030.vdocuments.mx/reader030/viewer/2022021423/58a98e0e1a28ab412d8b63ab/html5/thumbnails/33.jpg)
Luigi dependency graphs
![Page 34: PyData Berlin Meetup](https://reader030.vdocuments.mx/reader030/viewer/2022021423/58a98e0e1a28ab412d8b63ab/html5/thumbnails/34.jpg)
Hadoop!
● MapReduce: Programming model for distributed computation problems
● Express your algorithm as sequences of operations:a. Map: Do a linear pass over your data, emit (k, v)b. (Distributed sort)c. Reduce: Linear pass over all (k, v) for the same k
● Python on Hadoop: Hadoop streaming, MRJob, Luigi(Just go learn PySpark instead)
![Page 35: PyData Berlin Meetup](https://reader030.vdocuments.mx/reader030/viewer/2022021423/58a98e0e1a28ab412d8b63ab/html5/thumbnails/35.jpg)
Luigi Hadoop integrationclass HadoopTask(luigi.hadoop.JobTask):
def output(self):
return luigi.HdfsTarget("output_in_hdfs")
def requires(self):
return {
"some_task": SomeTask(),
"some_other_task": SomeOtherTask()
}
def mapper(self, line):
key, value = line.rstrip().split("\t")
yield key, value
def reducer(self, key, values):
yield key, ", ".join(values)
![Page 36: PyData Berlin Meetup](https://reader030.vdocuments.mx/reader030/viewer/2022021423/58a98e0e1a28ab412d8b63ab/html5/thumbnails/36.jpg)
Luigi Hadoop integrationclass HadoopTask(luigi.hadoop.JobTask):
def output(self):
return luigi.HdfsTarget("output_in_hdfs")
def requires(self):
return {
"some_task": SomeTask(),
"some_other_task": SomeOtherTask()
}
def mapper(self, line):
key, value = line.rstrip().split("\t")
yield key, value
def reducer(self, key, values):
yield key, ", ".join(values)
1. Your input data is sitting in distributed file system (HDFS)
2. Luigi creates a .tar.gz, Hadoop moves your code on machines
3. mapper() gets run (distributed)4. Data gets re-sorted by key5. reducer() gets run (distributed)6. Output gets saved in HDFS
![Page 37: PyData Berlin Meetup](https://reader030.vdocuments.mx/reader030/viewer/2022021423/58a98e0e1a28ab412d8b63ab/html5/thumbnails/37.jpg)
● Batch, never real time● Slow even for batch
(lots of disk IO)● Limited expressiveness
(remedies/crutches: MRJob, Pig, Hive)
● Spark: More complete Python support
Beyond MapReduce
![Page 38: PyData Berlin Meetup](https://reader030.vdocuments.mx/reader030/viewer/2022021423/58a98e0e1a28ab412d8b63ab/html5/thumbnails/38.jpg)
Workflows at TrustYou
![Page 39: PyData Berlin Meetup](https://reader030.vdocuments.mx/reader030/viewer/2022021423/58a98e0e1a28ab412d8b63ab/html5/thumbnails/39.jpg)
Workflows at TrustYou