harvesting the power of samza in linkedin's feed

Post on 15-Apr-2017

588 Views

Category:

Software

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Harvesting the Power of Samza in News FeedProviding fresh and relevant content to hundreds of millions of members

 A Few Things Mentioned Here

Prerequisites

1 Samza

2 RocksDB (a key-value store)

3 SerDe (Serializer/Deserializer)

4 Kafka (a distributed messaging system)

5 Java

2

The Challenge

Relevant content is a great way to stay informed about your professional interests; Fresh relevant content is even better!

How do we keep track of what hundreds of millions of members

viewed on their News Feeds?

4

Tracking

  News Feed is the Landing Page for Most MembersScale

6

Source: investors.linkedin.com | 1 as of quarter end | 2 monthly average during the quarter

• Lightweight events that track what

the member viewed

• Tiny payload (bandwidth-friendly)

• Events end up in a Kafka topic

Client-Side Tracking

• Events that have more data about

served feeds

• Rich payload

• Events end up in a Kafka topic

Server-Side Tracking

Improving Member Experience Using Samza (Overview)

A stream-stream join task buffers events from both streams; matches are sent to an output Kafka stream1 Join input streams

A custom TTL mechanism reaps stale events every n seconds2 Purge stale events

Convert the rich data about impressions into machine learning features used for ranking items in the News Feed3 Consume output stream

9

Join

10

1

Overview

11

Client Events

Server Events

Process Client Events

Process Server EventsOutput Events

Client-Side Events Processor Overview

12

ID in server-

side events store?

Match events

Store (ID, const.)

Yes

No

Output to Kafka

 OptimizationsClient-Side Events Processor

13

• Initial capacity of matches map (event, matched IDs) is determined by a metric (GC-friendly)

• Initial capacity of value set is equal to |IDs|

• An empty byte array is used as a dummy value for IDs to store in RocksDB (passes through the NOP byte array SerDe); acting as a set

Server-Side Events Processor Overview

14

ID in client-side

events store?

Match events

Store (ID, event)

Yes

No

Output to Kafka

• Header (shared event data)

• List of payloads (one for each item)

• Each payload has a join key (ID)

Event AnatomyShared Event Data(e.g. member ID)

ID: 123

ID: 456

ID: 789

ID: 012

ID: 345

ID: 678

ID: 901

ID: 234

Server-Side Events Storage

16

Shared Event Data(e.g. member ID)

ID: 789

ID: 012

ID: 345

ID: 678

ID: 901

ID: 234

ID: 123

ID: 456

ID: 123

ID: 456

ID: 789

ID: 012

ID: 345

ID: 678

ID: 901

ID: 234

UUID: 1A67343FE…83B

UUID: 1A67343FE…83B

UUID: 1A67343FE…83B

UUID: 1A67343FE…83B

UUID: 1A67343FE…83B

UUID: 1A67343FE…83B

UUID: 1A67343FE…83B

UUID: 1A67343FE…83B

UUID: 1A67343FE…83B

 ManyKeysToOneValueStore<K, V>Server-Side Events Storage

17

• Space-efficient• Insertion is transactional• Rolling back a transaction is a best effort

thing• Requires an additional lookup (but it’s

worth it)

Event Matching

18

Client-Side Event

ID: 789

ID: 012

ID: 345

ID: 678

ID: 901

ID: 234

ID: 123

ID: 456

Server-Side Event

A

ID: 111

ID: 456

ID: 906

ID: 678

ID: 901

ID: 431

ID: 746

Server-Side Event

B

ID: 234

ID: 012

ID: 123

ID: 100

ID: 313

ID: 345

ID: 333

Output Event

A

ID: 901

ID: 456

ID: 678

Output Event

B

ID: 012

ID: 123

ID: 345

ID: 234

 [SAMZA-647] Key-Value Store Contributions to Samza

19

• The access pattern is getAll(List<K>)• RocksDB supports multiGet that’s faster

than get• Added that support to samza’s

KeyValueStore• Perf test results confirm that of RocksDB

(with caching disabled)

TTL

20

2

Custom TTL Mechanism

Records the timestamp of when an event was stored The “death row” store: key is the timestamp and the value is an ID Because the key is a timestamp, collisions occur:

21

Generate timestamp

Bucket is taken

Bucket is free

Attempts <= max Attempts > max

put(timestamp, ID)

Linear Probing Timestamper

22

TTL calculation is not mission-critical (currentTimeMillis() is not very precise anyways); events get deleted in the next window

Keeping it simple and stupid works

Reapers

Every n seconds:

Get death rows (t < now – TTL)

For each entry in death row:

Remove from core stores

Remove from death row

23

 OptimizationsReapers

24

• Keys (timestamps) are stored in order• A range query (0, now – TTL) is much

faster than a range scan (testing all values)

• Even though TTL is in the order of minutes/hours, reaping stale events happens every 10 seconds (the window method is blocking)

Stats

25

[SAMZA-647] getAll is %23 FasterRocksDB Get All vs. Get Performance

26

Timestamp Collision Resolution Metrics

27

The Most Important Metric

28

29

of messages handled by the job everyday

Billions

Find out more:

©2015 LinkedIn Corporation. All Rights Reserved.

blog.linkedin.com linkedin.com/in/elgeish

mmahmoud@linkedin.com

30

top related