big data in real time at twitter
TRANSCRIPT
![Page 1: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/1.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 1/71
Big Data in Real-Timeat Twitter
by Nick Kallen (@nk)
![Page 2: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/2.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 2/71
Follow alonghttp://www.slideshare.net/nkallen/qcon
![Page 3: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/3.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 3/71
What is Real-Time Data?• On-line queries for a single web request
• Off-line computations with very low latency
• Latency and throughput are equally important
• Not talking about Hadoop and other high-latency,
Big Data tools
![Page 4: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/4.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 4/71
The four data problems• Tweets
• Timelines
• Social graphs
• Search indices
![Page 5: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/5.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 5/71
![Page 6: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/6.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 6/71
What is a Tweet?• 140 character message, plus some metadata
• Query patterns:
• by id
• by author
• (also @replies, but not discussed here)
• Row Storage
![Page 7: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/7.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 7/71
Find by primary key: 4376167936
![Page 8: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/8.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 8/71
Find all by user_id: 749863
![Page 9: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/9.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 9/71
Original Implementation
• Relational
• Single table, vertically scaled
• Master-Slave replication and Memcached for
read throughput.
id user_id text created_at
20 12 just setting up my twttr 2006-03-21 20:50:14
29 12 inviting coworkers 2006-03-21 21:02:56
34 16 Oh shit, I just twittered a little. 2006-03-21 21:08:09
![Page 10: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/10.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 10/71
![Page 11: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/11.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 11/71
Problems w/ solution• Disk space: did not want to support disk arrays larger
than 800GB
• At 2,954,291,678 tweets, disk was over 90% utilized.
![Page 12: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/12.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 12/71
PARTITION
![Page 13: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/13.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 13/71
Possible implementations
id user_id
20 ...
22 ...
24 ...
id user_id
21 ...
23 ...
25 ...
Partition 1 Partition 2
Partition by primary key
![Page 14: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/14.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 14/71
Possible implementations
id user_id
20 ...
22 ...
24 ...
id user_id
21 ...
23 ...
25 ...
Partition 1 Partition 2
Finding recent tweetsby user_id queries N
partitions
Partition by primary key
![Page 15: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/15.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 15/71
Possible implementations
id user_id
... 1
... 1
... 3
id user_id
21 2
23 2
25 2
Partition 1 Partition 2
Partition by user id
![Page 16: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/16.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 16/71
Possible implementations
id user_id
... 1
... 1
... 3
id user_id
21 2
23 2
25 2
Partition 1 Partition 2
Partition by user id
Finding a tweet by id
queries N partitions
![Page 17: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/17.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 17/71
Current Implementation
id user_id
24 ...
23 ...
id user_id22 ...
21 ...
Partition 1
Partition 2
Partition by time
![Page 18: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/18.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 18/71
Current Implementation
id user_id
24 ...
23 ...
id user_id22 ...
21 ...
Partition 1
Partition 2
Queries try eachpartition in order
until enough datais accumulated
Partition by time
![Page 19: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/19.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 19/71
LOCALITY
![Page 20: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/20.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 20/71
Low Latency
* Depends on the number of partitions searched
PK Lookup
Memcached
MySQL
1ms
<10ms*
![Page 21: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/21.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 21/71
Principles• Partition and index
• Exploit locality (in this case, temporal locality)
• New tweets are requested most frequently, so
usually only 1 partition is checked
![Page 22: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/22.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 22/71
Problems w/ solution• Write throughput
• Have encountered deadlocks in MySQL at crazy
tweet velocity
• Creating a new temporal shard is a manual process
and takes too long; it involves setting up a parallel
replication hierarchy. Our DBA hates us
![Page 23: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/23.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 23/71
![Page 24: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/24.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 24/71
The four data problems• Tweets
• Timelines
• Social graphs
• Search indices
![Page 25: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/25.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 25/71
![Page 26: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/26.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 26/71
What is a Timeline?• Sequence of tweet ids
• Query pattern: get by user_id
• Operations:
• append
• merge
• truncate
• High-velocity bounded vector• Space-based (in-place mutation)
![Page 27: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/27.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 27/71
Tweets from 3dif erent people
![Page 28: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/28.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 28/71
Original ImplementationSELECT * FROM tweets
WHERE user_id IN
(SELECT source_id
FROM followers
WHERE destination_id = ?)
ORDER BY created_at DESC
LIMIT 20
![Page 29: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/29.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 29/71
Original ImplementationSELECT * FROM tweets
WHERE user_id IN
(SELECT source_id
FROM followers
WHERE destination_id = ?)
ORDER BY created_at DESC
LIMIT 20
Crazy slow if you have lots of friends or indices can’t be
kept in RAM
![Page 30: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/30.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 30/71
OFF-LINEVS.
ONLINECOMPUTATION
![Page 31: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/31.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 31/71
Current Implementation
• Sequences stored in Memcached
• Fanout off-line, but has a low latency SLA
• Truncate at random intervals to ensure bounded
length
• On cache miss, merge user timelines
![Page 32: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/32.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 32/71
Throughput Statistics
date average tps peak tps fanout ratio deliveries
10/7/2008 30 120 175:1 21,000
4/15/2010 700 2,000 600:1 1,200,000
![Page 33: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/33.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 33/71
1.2mDeliveries per second
![Page 34: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/34.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 34/71
MEMORY HIERARCHY
![Page 35: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/35.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 35/71
Possible implementations• Fanout to disk
• Ridonculous number of IOPS required, even with
fancy buffering techniques • Cost of rebuilding data from other durable stores not
too expensive
• Fanout to memory
• Good if cardinality of corpus * bytes/datum not toomany GB
![Page 36: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/36.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 36/71
Low Latency
get append fanout
1ms 1ms <1s*
* Depends on the number of followers of the tweeter
![Page 37: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/37.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 37/71
Principles• Off-line vs. Online computation
• The answer to some problems can be pre-computed
if the amount of work is bounded and the querypattern is very limited
• Keep the memory hierarchy in mind
• The efficiency of a system includes the cost of
generating data from another source (such as abackup) times the probability of needing to
![Page 38: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/38.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 38/71
The four data problems• Tweets
• Timelines
• Social graphs• Search indices
![Page 39: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/39.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 39/71
![Page 40: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/40.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 40/71
What is a Social Graph?• List of who follows whom, who blocks whom, etc.
• Operations:
• Enumerate by time • Intersection, Union, Difference
• Inclusion
• Cardinality
• Mass-deletes for spam• Medium-velocity unbounded vectors
• Complex, predetermined queries
![Page 41: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/41.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 41/71
![Page 42: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/42.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 42/71
Temporal enumeration
![Page 43: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/43.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 43/71
Temporal enumerationInclusion
![Page 44: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/44.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 44/71
Temporal enumerationInclusion
Cardinality
![Page 45: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/45.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 45/71
![Page 46: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/46.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 46/71
Intersection: Deliver to people who
follow both @aplusk and@foursquare
![Page 47: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/47.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 47/71
Original Implementationsource_id destination_id
20 12
29 12
34 16
• Single table, vertically scaled
• Master-Slave replication
d
![Page 48: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/48.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 48/71
Original Implementationsource_id destination_id
20 12
29 12
34 16
Index
• Single table, vertically scaled
• Master-Slave replication
d d
![Page 49: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/49.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 49/71
Original Implementationsource_id destination_id
20 12
29 12
34 16
Index Index
• Single table, vertically scaled
• Master-Slave replication
![Page 50: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/50.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 50/71
Problems w/ solution• Write throughput
• Indices couldn’t be kept in RAM
![Page 51: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/51.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 51/71
Current solution
• Partitioned by user id
• Edges stored in “forward” and “backward” directions• Indexed by time
• Indexed by element (for set algebra)
• Denormalized cardinality
source_id destination_id updated_at x
20 12 20:50:14 x
20 13 20:51:3220 16
destination_id source_id updated_at x
12 20 20:50:14 x
12 32 20:51:3212 16
Forward Backward
![Page 52: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/52.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 52/71
Current solution
• Partitioned by user id
• Edges stored in “forward” and “backward” directions• Indexed by time
• Indexed by element (for set algebra)
• Denormalized cardinality
source_id destination_id updated_at x
20 12 20:50:14 x
20 13 20:51:3220 16
destination_id source_id updated_at x
12 20 20:50:14 x
12 32 20:51:3212 16
Forward Backward
Partitioned by user
![Page 53: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/53.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 53/71
Current solution
• Partitioned by user id
• Edges stored in “forward” and “backward” directions• Indexed by time
• Indexed by element (for set algebra)
• Denormalized cardinality
source_id destination_id updated_at x
20 12 20:50:14 x
20 13 20:51:3220 16
destination_id source_id updated_at x
12 20 20:50:14 x
12 32 20:51:3212 16
Forward Backward
Partitioned by user
Edges stored in both directions
![Page 54: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/54.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 54/71
Challenges• Data consistency in the presence of failures
• Write operations are idempotent: retry until success
• Last-Write Wins for edges • (with an ordering relation on State for time
conflicts)
• Other commutative strategies for mass-writes
![Page 55: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/55.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 55/71
Low Latency
cardinality iteration write ack write
materializeinclusion
1ms 100edges/ms* 1ms 16ms 1ms
* 2ms lower bound
![Page 56: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/56.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 56/71
Principles• It is not possible to pre-compute set algebra queries
• Simple distributed coordination techniques work
• Partition, replicate, index. Many efficiency and
scalability problems are solved the same way
![Page 57: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/57.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 57/71
The four data problems• Tweets
• Timelines
• Social graphs• Search indices
![Page 58: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/58.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 58/71
![Page 59: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/59.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 59/71
What is a Search Index?• “Find me all tweets with these words in it...”
• Posting list
• Boolean and/or queries• Complex, ad hoc queries
• Relevance is recency*
* Note: there is a non-real-time component to search, but it is not discussed here
![Page 60: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/60.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 60/71
Intersection of three posting lists
![Page 61: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/61.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 61/71
Original Implementationterm_id doc_id
20 12
20 86
34 16
• Single table, vertically scaled
• Master-Slave replication for read throughput
![Page 62: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/62.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 62/71
Problems w/ solution• Index could not be kept in memory
![Page 63: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/63.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 63/71
Current Implementationterm_id doc_id
24 ...
23 ...
term_id doc_id
22 ...
21 ...
Partition 1
Partition 2
• Partitioned by time
• Uses MySQL
• Uses delayed key-write
![Page 64: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/64.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 64/71
Problems w/ solution• Write throughput
• Queries for rare terms need to search many
partitions• Space efficiency/recall
• MySQL requires lots of memory
![Page 65: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/65.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 65/71
F l i
![Page 66: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/66.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 66/71
Future solution
• Document partitioning
• Time partitioning too
• Merge layer
• May use Lucene instead of MySQL
![Page 67: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/67.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 67/71
Principles• Partition so that work can be parallelized
• Temporal locality is not always enough
![Page 68: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/68.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 68/71
The four data problems• Tweets
• Timelines
• Social graphs• Search indices
![Page 69: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/69.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 69/71
Summary Statistics
reads/secondwrites/second
cardinality bytes/item durability
Tweets 100k 850 12b 300b durable
Timelines 80k 1.2m a lot 3.2k non
Graphs 100k 20k 13b 110 durable
Search 13k 21k† 315m‡ 1k durable
† tps * 25 postings‡ 75 partitions * 4.2m tweets
![Page 70: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/70.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 70/71
Principles
• All engineering solutions are transient• Nothing’s perfect but some solutions are good enough
for a while
• Scalability solutions aren’t magic. They involve
partitioning, indexing, and replication• All data for real-time queries MUST be in memory.
Disk is for writes only.
• Some problems can be solved with pre-computation,
but a lot can’t
• Exploit locality where possible
![Page 71: Big Data in Real Time at Twitter](https://reader030.vdocuments.mx/reader030/viewer/2022021213/577d26701a28ab4e1ea1367f/html5/thumbnails/71.jpg)
8/4/2019 Big Data in Real Time at Twitter
http://slidepdf.com/reader/full/big-data-in-real-time-at-twitter 71/71
Appendix