hadoop and pig at twitter (oscon 2010)
DESCRIPTION
A look into how Twitter uses Hadoop and Pig to build products and do analytics.TRANSCRIPT
![Page 1: Hadoop and pig at twitter (oscon 2010)](https://reader033.vdocuments.mx/reader033/viewer/2022051412/54c677684a7959a4368b4585/html5/thumbnails/1.jpg)
TM
Hadoop and Pig @TwitterKevin Weil -- @kevinweilAnalytics Lead, Twitter
Friday, July 23, 2010
![Page 2: Hadoop and pig at twitter (oscon 2010)](https://reader033.vdocuments.mx/reader033/viewer/2022051412/54c677684a7959a4368b4585/html5/thumbnails/2.jpg)
Agenda‣ Hadoop Overview
‣ Pig: Rapid Learning Over Big Data
‣ Data-Driven Products
‣ Hadoop/Pig and Analytics
Friday, July 23, 2010
![Page 3: Hadoop and pig at twitter (oscon 2010)](https://reader033.vdocuments.mx/reader033/viewer/2022051412/54c677684a7959a4368b4585/html5/thumbnails/3.jpg)
My Background‣ Mathematics and Physics at Harvard, Physics at
Stanford
‣ Tropos Networks (city-wide wireless): mesh routing algorithms, GBs of data
‣ Cooliris (web media): Hadoop and Pig for analytics, TBs of data
‣ Twitter: Hadoop, Pig, HBase, Cassandra, machine learning, visualization, social graph analysis, soon to be PBs data
Friday, July 23, 2010
![Page 4: Hadoop and pig at twitter (oscon 2010)](https://reader033.vdocuments.mx/reader033/viewer/2022051412/54c677684a7959a4368b4585/html5/thumbnails/4.jpg)
Agenda‣ Hadoop Overview
‣ Pig: Rapid Learning Over Big Data
‣ Data-Driven Products
‣ Hadoop/Pig and Analytics
Friday, July 23, 2010
![Page 5: Hadoop and pig at twitter (oscon 2010)](https://reader033.vdocuments.mx/reader033/viewer/2022051412/54c677684a7959a4368b4585/html5/thumbnails/5.jpg)
Data is Getting Big‣ NYSE: 1 TB/day
‣ Facebook: 20+ TB compressed/day
‣ CERN/LHC: 40 TB/day (15 PB/year)
‣ And growth is accelerating
‣ Need multiple machines, horizontal scalability
Friday, July 23, 2010
![Page 6: Hadoop and pig at twitter (oscon 2010)](https://reader033.vdocuments.mx/reader033/viewer/2022051412/54c677684a7959a4368b4585/html5/thumbnails/6.jpg)
Hadoop‣ Distributed file system (hard to store a PB)
‣ Fault-tolerant, handles replication, node failure, etc
‣ MapReduce-based parallel computation(even harder to process a PB)
‣ Generic key-value based computation interfaceallows for wide applicability
Friday, July 23, 2010
![Page 7: Hadoop and pig at twitter (oscon 2010)](https://reader033.vdocuments.mx/reader033/viewer/2022051412/54c677684a7959a4368b4585/html5/thumbnails/7.jpg)
Hadoop‣ Open source: top-level Apache project
‣ Scalable: Y! has a 4000-node cluster
‣ Powerful: sorted a TB of random integers in 62 seconds
‣ Easy Packaging: Cloudera RPMs, DEBs
Friday, July 23, 2010
![Page 8: Hadoop and pig at twitter (oscon 2010)](https://reader033.vdocuments.mx/reader033/viewer/2022051412/54c677684a7959a4368b4585/html5/thumbnails/8.jpg)
MapReduce Workflow
‣ Challenge: how many tweets per user, given tweets table?
‣ Input: key=row, value=tweet info
‣ Map: output key=user_id, value=1
‣ Shuffle: sort by user_id
‣ Reduce: for each user_id, sum
‣ Output: user_id, tweet count
‣ With 2x machines, runs 2x faster
Inputs
Map
Map
Map
Map
Map
Map
Map
Reduce
Reduce
Reduce
Outputs
Shuffle/Sort
Friday, July 23, 2010
![Page 9: Hadoop and pig at twitter (oscon 2010)](https://reader033.vdocuments.mx/reader033/viewer/2022051412/54c677684a7959a4368b4585/html5/thumbnails/9.jpg)
MapReduce Workflow
‣ Challenge: how many tweets per user, given tweets table?
‣ Input: key=row, value=tweet info
‣ Map: output key=user_id, value=1
‣ Shuffle: sort by user_id
‣ Reduce: for each user_id, sum
‣ Output: user_id, tweet count
‣ With 2x machines, runs 2x faster
Inputs
Map
Map
Map
Map
Map
Map
Map
Reduce
Reduce
Reduce
Outputs
Shuffle/Sort
Friday, July 23, 2010
![Page 10: Hadoop and pig at twitter (oscon 2010)](https://reader033.vdocuments.mx/reader033/viewer/2022051412/54c677684a7959a4368b4585/html5/thumbnails/10.jpg)
MapReduce Workflow
‣ Challenge: how many tweets per user, given tweets table?
‣ Input: key=row, value=tweet info
‣ Map: output key=user_id, value=1
‣ Shuffle: sort by user_id
‣ Reduce: for each user_id, sum
‣ Output: user_id, tweet count
‣ With 2x machines, runs 2x faster
Inputs
Map
Map
Map
Map
Map
Map
Map
Reduce
Reduce
Reduce
Outputs
Shuffle/Sort
Friday, July 23, 2010
![Page 11: Hadoop and pig at twitter (oscon 2010)](https://reader033.vdocuments.mx/reader033/viewer/2022051412/54c677684a7959a4368b4585/html5/thumbnails/11.jpg)
MapReduce Workflow
‣ Challenge: how many tweets per user, given tweets table?
‣ Input: key=row, value=tweet info
‣ Map: output key=user_id, value=1
‣ Shuffle: sort by user_id
‣ Reduce: for each user_id, sum
‣ Output: user_id, tweet count
‣ With 2x machines, runs 2x faster
Inputs
Map
Map
Map
Map
Map
Map
Map
Reduce
Reduce
Reduce
Outputs
Shuffle/Sort
Friday, July 23, 2010
![Page 12: Hadoop and pig at twitter (oscon 2010)](https://reader033.vdocuments.mx/reader033/viewer/2022051412/54c677684a7959a4368b4585/html5/thumbnails/12.jpg)
MapReduce Workflow
‣ Challenge: how many tweets per user, given tweets table?
‣ Input: key=row, value=tweet info
‣ Map: output key=user_id, value=1
‣ Shuffle: sort by user_id
‣ Reduce: for each user_id, sum
‣ Output: user_id, tweet count
‣ With 2x machines, runs 2x faster
Inputs
Map
Map
Map
Map
Map
Map
Map
Reduce
Reduce
Reduce
Outputs
Shuffle/Sort
Friday, July 23, 2010
![Page 13: Hadoop and pig at twitter (oscon 2010)](https://reader033.vdocuments.mx/reader033/viewer/2022051412/54c677684a7959a4368b4585/html5/thumbnails/13.jpg)
MapReduce Workflow
‣ Challenge: how many tweets per user, given tweets table?
‣ Input: key=row, value=tweet info
‣ Map: output key=user_id, value=1
‣ Shuffle: sort by user_id
‣ Reduce: for each user_id, sum
‣ Output: user_id, tweet count
‣ With 2x machines, runs 2x faster
Inputs
Map
Map
Map
Map
Map
Map
Map
Reduce
Reduce
Reduce
Outputs
Shuffle/Sort
Friday, July 23, 2010
![Page 14: Hadoop and pig at twitter (oscon 2010)](https://reader033.vdocuments.mx/reader033/viewer/2022051412/54c677684a7959a4368b4585/html5/thumbnails/14.jpg)
MapReduce Workflow
‣ Challenge: how many tweets per user, given tweets table?
‣ Input: key=row, value=tweet info
‣ Map: output key=user_id, value=1
‣ Shuffle: sort by user_id
‣ Reduce: for each user_id, sum
‣ Output: user_id, tweet count
‣ With 2x machines, runs 2x faster
Inputs
Map
Map
Map
Map
Map
Map
Map
Reduce
Reduce
Reduce
Outputs
Shuffle/Sort
Friday, July 23, 2010
![Page 15: Hadoop and pig at twitter (oscon 2010)](https://reader033.vdocuments.mx/reader033/viewer/2022051412/54c677684a7959a4368b4585/html5/thumbnails/15.jpg)
But...‣ Analysis typically in Java
‣ Single-input, two-stage data flow is rigid
‣ Projections, filters:
custom code
‣ Joins are lengthy, error-prone
‣ Hard to manage n-stage jobs
‣ Exploration requires compilation!
Friday, July 23, 2010
![Page 16: Hadoop and pig at twitter (oscon 2010)](https://reader033.vdocuments.mx/reader033/viewer/2022051412/54c677684a7959a4368b4585/html5/thumbnails/16.jpg)
Agenda‣ Hadoop Overview
‣ Pig: Rapid Learning Over Big Data
‣ Data-Driven Products
‣ Hadoop/Pig and Analytics
Friday, July 23, 2010
![Page 17: Hadoop and pig at twitter (oscon 2010)](https://reader033.vdocuments.mx/reader033/viewer/2022051412/54c677684a7959a4368b4585/html5/thumbnails/17.jpg)
Enter Pig‣ High level language
‣ Transformations on
sets of records
‣ Process data one step at a time
‣ Easier than SQL?
‣ Top-level Apache project
Friday, July 23, 2010
![Page 18: Hadoop and pig at twitter (oscon 2010)](https://reader033.vdocuments.mx/reader033/viewer/2022051412/54c677684a7959a4368b4585/html5/thumbnails/18.jpg)
Why Pig?‣ Because I bet you can read the following script.
Friday, July 23, 2010
![Page 19: Hadoop and pig at twitter (oscon 2010)](https://reader033.vdocuments.mx/reader033/viewer/2022051412/54c677684a7959a4368b4585/html5/thumbnails/19.jpg)
A Real Pig Script
Friday, July 23, 2010
![Page 20: Hadoop and pig at twitter (oscon 2010)](https://reader033.vdocuments.mx/reader033/viewer/2022051412/54c677684a7959a4368b4585/html5/thumbnails/20.jpg)
Now, just for fun...‣ The same calculation in vanilla MapReduce
Friday, July 23, 2010
![Page 21: Hadoop and pig at twitter (oscon 2010)](https://reader033.vdocuments.mx/reader033/viewer/2022051412/54c677684a7959a4368b4585/html5/thumbnails/21.jpg)
No, seriously.
Friday, July 23, 2010
![Page 22: Hadoop and pig at twitter (oscon 2010)](https://reader033.vdocuments.mx/reader033/viewer/2022051412/54c677684a7959a4368b4585/html5/thumbnails/22.jpg)
Pig Democratizes Large-scale Data Analysis‣ The Pig version is:
‣ 5% of the code
‣ 5% of the development time
‣ Within 25% of the execution time
‣ Readable, reusable
Friday, July 23, 2010
![Page 23: Hadoop and pig at twitter (oscon 2010)](https://reader033.vdocuments.mx/reader033/viewer/2022051412/54c677684a7959a4368b4585/html5/thumbnails/23.jpg)
One Thing I’ve Learned‣ It’s easy to answer questions
‣ It’s hard to ask the right questions
‣ Value the system that promotes innovation and iteration
Friday, July 23, 2010
![Page 24: Hadoop and pig at twitter (oscon 2010)](https://reader033.vdocuments.mx/reader033/viewer/2022051412/54c677684a7959a4368b4585/html5/thumbnails/24.jpg)
Agenda‣ Hadoop Overview
‣ Pig: Rapid Learning Over Big Data
‣ Data-Driven Products
‣ Hadoop/Pig and Analytics
Friday, July 23, 2010
![Page 25: Hadoop and pig at twitter (oscon 2010)](https://reader033.vdocuments.mx/reader033/viewer/2022051412/54c677684a7959a4368b4585/html5/thumbnails/25.jpg)
MySQL, MySQL, MySQL‣ We all start there.
‣ But MySQL is not built for analysis.
‣ select count(*) from users? Maybe.
‣ select count(*) from tweets? Uh...
‣ Imagine joining them.
‣ And grouping.
‣ Then sorting.
Friday, July 23, 2010
![Page 26: Hadoop and pig at twitter (oscon 2010)](https://reader033.vdocuments.mx/reader033/viewer/2022051412/54c677684a7959a4368b4585/html5/thumbnails/26.jpg)
Non-Pig Hadoop at Twitter‣ Data Sink via Scribe
‣ Distributed Grep
‣ A few performance-critical, simple jobs
‣ People Search
Friday, July 23, 2010
![Page 27: Hadoop and pig at twitter (oscon 2010)](https://reader033.vdocuments.mx/reader033/viewer/2022051412/54c677684a7959a4368b4585/html5/thumbnails/27.jpg)
People Search?‣ First real product built with Hadoop
‣ “Find People”
‣ Old version: offline process on
a single node
‣ New version: complex graph
calculations, hit internal network
services, custom indexing
‣ Faster, more reliable,
more observableFriday, July 23, 2010
![Page 28: Hadoop and pig at twitter (oscon 2010)](https://reader033.vdocuments.mx/reader033/viewer/2022051412/54c677684a7959a4368b4585/html5/thumbnails/28.jpg)
People Search‣ Import user data into HBase
‣ Periodic MapReduce job reading from HBase
‣ Hits FlockDB, other internal services in mapper
‣ Custom partitioning
‣ Data sucked across to sharded, replicated, horizontally scalable, in-memory, low-latency Scala service
‣ Build a trie, do case folding/normalization, suggestions, etc
Friday, July 23, 2010
![Page 29: Hadoop and pig at twitter (oscon 2010)](https://reader033.vdocuments.mx/reader033/viewer/2022051412/54c677684a7959a4368b4585/html5/thumbnails/29.jpg)
Agenda‣ Hadoop Overview
‣ Pig: Rapid Learning Over Big Data
‣ Data-Driven Products
‣ Hadoop/Pig and Analytics
Friday, July 23, 2010
![Page 30: Hadoop and pig at twitter (oscon 2010)](https://reader033.vdocuments.mx/reader033/viewer/2022051412/54c677684a7959a4368b4585/html5/thumbnails/30.jpg)
Order of Operations
‣ Correlating
‣ Research/Algorithmic Learning
‣ Counting
Friday, July 23, 2010
![Page 31: Hadoop and pig at twitter (oscon 2010)](https://reader033.vdocuments.mx/reader033/viewer/2022051412/54c677684a7959a4368b4585/html5/thumbnails/31.jpg)
Counting‣ How many requests per day?
‣ What’s the average latency? 95% latency?
‣ What’s the response code distribution?
‣ How many searches per day? Unique users?
‣ What’s the geographic breakdown of requests?
‣ How many tweets? From what clients?
‣ How many signups? Profile completeness?
‣ How many SMS notifications did we send?
Friday, July 23, 2010
![Page 32: Hadoop and pig at twitter (oscon 2010)](https://reader033.vdocuments.mx/reader033/viewer/2022051412/54c677684a7959a4368b4585/html5/thumbnails/32.jpg)
Correlating‣ How does usage differ for mobile users?
‣ ... for desktop client users (Tweetdeck, etc)?
‣ Cohort analyses
‣ What services fail at the same time?
‣ What features get users hooked?
‣ What do successful users do often?
‣ How does tweet volume change over time?
Friday, July 23, 2010
![Page 33: Hadoop and pig at twitter (oscon 2010)](https://reader033.vdocuments.mx/reader033/viewer/2022051412/54c677684a7959a4368b4585/html5/thumbnails/33.jpg)
Research‣ What can we infer from a user’s tweets?
‣ ... from the tweets of their followers? followees?
‣ What features tend to get a tweet retweeted?
‣ ... and what influences the retweet tree depth?
‣ Duplicate detection, language detection
‣ What graph structures lead to increased usage?
‣ Sentiment analysis, entity extraction
‣ User reputation
Friday, July 23, 2010
![Page 34: Hadoop and pig at twitter (oscon 2010)](https://reader033.vdocuments.mx/reader033/viewer/2022051412/54c677684a7959a4368b4585/html5/thumbnails/34.jpg)
If We Had More Time...‣ HBase
‣ LZO compression and Hadoop
‣ Protocol buffers
‣ Our open source: hadoop-lzo, elephant-bird
‣ Analytics and Cassandra
Friday, July 23, 2010
![Page 35: Hadoop and pig at twitter (oscon 2010)](https://reader033.vdocuments.mx/reader033/viewer/2022051412/54c677684a7959a4368b4585/html5/thumbnails/35.jpg)
Questions?
Follow me attwitter.com/kevinweil
TM
Friday, July 23, 2010