harvesting the power of samza in linkedin's feed
Post on 15-Apr-2017
588 Views
Preview:
TRANSCRIPT
Harvesting the Power of Samza in News FeedProviding fresh and relevant content to hundreds of millions of members
A Few Things Mentioned Here
Prerequisites
1 Samza
2 RocksDB (a key-value store)
3 SerDe (Serializer/Deserializer)
4 Kafka (a distributed messaging system)
5 Java
2
The Challenge
Relevant content is a great way to stay informed about your professional interests; Fresh relevant content is even better!
How do we keep track of what hundreds of millions of members
viewed on their News Feeds?
4
Tracking
News Feed is the Landing Page for Most MembersScale
6
Source: investors.linkedin.com | 1 as of quarter end | 2 monthly average during the quarter
• Lightweight events that track what
the member viewed
• Tiny payload (bandwidth-friendly)
• Events end up in a Kafka topic
Client-Side Tracking
• Events that have more data about
served feeds
• Rich payload
• Events end up in a Kafka topic
Server-Side Tracking
Improving Member Experience Using Samza (Overview)
A stream-stream join task buffers events from both streams; matches are sent to an output Kafka stream1 Join input streams
A custom TTL mechanism reaps stale events every n seconds2 Purge stale events
Convert the rich data about impressions into machine learning features used for ranking items in the News Feed3 Consume output stream
9
Join
10
1
Overview
11
Client Events
Server Events
Process Client Events
Process Server EventsOutput Events
Client-Side Events Processor Overview
12
ID in server-
side events store?
Match events
Store (ID, const.)
Yes
No
Output to Kafka
OptimizationsClient-Side Events Processor
13
• Initial capacity of matches map (event, matched IDs) is determined by a metric (GC-friendly)
• Initial capacity of value set is equal to |IDs|
• An empty byte array is used as a dummy value for IDs to store in RocksDB (passes through the NOP byte array SerDe); acting as a set
Server-Side Events Processor Overview
14
ID in client-side
events store?
Match events
Store (ID, event)
Yes
No
Output to Kafka
• Header (shared event data)
• List of payloads (one for each item)
• Each payload has a join key (ID)
Event AnatomyShared Event Data(e.g. member ID)
ID: 123
ID: 456
ID: 789
ID: 012
ID: 345
ID: 678
ID: 901
ID: 234
Server-Side Events Storage
16
Shared Event Data(e.g. member ID)
ID: 789
ID: 012
ID: 345
ID: 678
ID: 901
ID: 234
ID: 123
ID: 456
ID: 123
ID: 456
ID: 789
ID: 012
ID: 345
ID: 678
ID: 901
ID: 234
UUID: 1A67343FE…83B
UUID: 1A67343FE…83B
UUID: 1A67343FE…83B
UUID: 1A67343FE…83B
UUID: 1A67343FE…83B
UUID: 1A67343FE…83B
UUID: 1A67343FE…83B
UUID: 1A67343FE…83B
UUID: 1A67343FE…83B
ManyKeysToOneValueStore<K, V>Server-Side Events Storage
17
• Space-efficient• Insertion is transactional• Rolling back a transaction is a best effort
thing• Requires an additional lookup (but it’s
worth it)
Event Matching
18
Client-Side Event
ID: 789
ID: 012
ID: 345
ID: 678
ID: 901
ID: 234
ID: 123
ID: 456
Server-Side Event
A
ID: 111
ID: 456
ID: 906
ID: 678
…
ID: 901
ID: 431
ID: 746
Server-Side Event
B
ID: 234
ID: 012
ID: 123
ID: 100
…
ID: 313
ID: 345
ID: 333
Output Event
A
ID: 901
ID: 456
ID: 678
Output Event
B
ID: 012
ID: 123
ID: 345
ID: 234
[SAMZA-647] Key-Value Store Contributions to Samza
19
• The access pattern is getAll(List<K>)• RocksDB supports multiGet that’s faster
than get• Added that support to samza’s
KeyValueStore• Perf test results confirm that of RocksDB
(with caching disabled)
TTL
20
2
Custom TTL Mechanism
Records the timestamp of when an event was stored The “death row” store: key is the timestamp and the value is an ID Because the key is a timestamp, collisions occur:
21
Generate timestamp
Bucket is taken
Bucket is free
Attempts <= max Attempts > max
put(timestamp, ID)
Linear Probing Timestamper
22
TTL calculation is not mission-critical (currentTimeMillis() is not very precise anyways); events get deleted in the next window
Keeping it simple and stupid works
Reapers
Every n seconds:
Get death rows (t < now – TTL)
For each entry in death row:
Remove from core stores
Remove from death row
23
OptimizationsReapers
24
• Keys (timestamps) are stored in order• A range query (0, now – TTL) is much
faster than a range scan (testing all values)
• Even though TTL is in the order of minutes/hours, reaping stale events happens every 10 seconds (the window method is blocking)
Stats
25
[SAMZA-647] getAll is %23 FasterRocksDB Get All vs. Get Performance
26
Timestamp Collision Resolution Metrics
27
The Most Important Metric
28
29
of messages handled by the job everyday
Billions
Find out more:
©2015 LinkedIn Corporation. All Rights Reserved.
blog.linkedin.com linkedin.com/in/elgeish
mmahmoud@linkedin.com
30
top related