optimizing big scio pipelines (in scala!)t1.daumcdn.net/brunch/service/user/6cl5/file/g... · make...

33
Optimizing Big Scio Pipelines (in Scala!) Naheon Kim 6th March 2019

Upload: others

Post on 20-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Optimizing Big Scio Pipelines (in Scala!)t1.daumcdn.net/brunch/service/user/6cl5/file/G... · Make private data forgotten by wiping keys out ... Scio pseudonymization pipelines for

Optimizing Big Scio Pipelines(in Scala!)

Naheon Kim6th March 2019

Page 2: Optimizing Big Scio Pipelines (in Scala!)t1.daumcdn.net/brunch/service/user/6cl5/file/G... · Make private data forgotten by wiping keys out ... Scio pseudonymization pipelines for

Who I am

‣ Data engineer at Spotify‣ Was backend engineer in Korea

Page 3: Optimizing Big Scio Pipelines (in Scala!)t1.daumcdn.net/brunch/service/user/6cl5/file/G... · Make private data forgotten by wiping keys out ... Scio pseudonymization pipelines for

We lock privacy

‣ Padlock - Key management system‣ Right to be forgotten

○ Make private data forgotten by wiping keys out

ENCRYPT(keyA, userA)

Padlock

Give me userA’s key!

UserA doesn’t want to!!!

Okay DECRYPT(keyA, encrypted_userA)

Page 4: Optimizing Big Scio Pipelines (in Scala!)t1.daumcdn.net/brunch/service/user/6cl5/file/G... · Make private data forgotten by wiping keys out ... Scio pseudonymization pipelines for

Event processing at Spotify

EventDelivery System

Pseudonymization

Cloud Storage

Scio pseudonymization pipelinesfor every eventrun every hour

Page 5: Optimizing Big Scio Pipelines (in Scala!)t1.daumcdn.net/brunch/service/user/6cl5/file/G... · Make private data forgotten by wiping keys out ... Scio pseudonymization pipelines for

We use Scio

+more cool things

Scala

Google Cloud Dataflow

Scio

‣ Scala API for Apache Beam andGoogle Cloud Dataflow

‣ Apache Beam○ Unified model for batch and streaming

processing○ Dataflow : cloud-driven

Page 6: Optimizing Big Scio Pipelines (in Scala!)t1.daumcdn.net/brunch/service/user/6cl5/file/G... · Make private data forgotten by wiping keys out ... Scio pseudonymization pipelines for

Scio gives...

‣ Elegant syntax‣ Stronger type safety‣ In-house optimization

Page 7: Optimizing Big Scio Pipelines (in Scala!)t1.daumcdn.net/brunch/service/user/6cl5/file/G... · Make private data forgotten by wiping keys out ... Scio pseudonymization pipelines for

From Beam to Scio

Page 8: Optimizing Big Scio Pipelines (in Scala!)t1.daumcdn.net/brunch/service/user/6cl5/file/G... · Make private data forgotten by wiping keys out ... Scio pseudonymization pipelines for

Pseudonymize events

‣ Generalized Scio pipeline to pseudonymize any data set

‣ Join hourly event with hourly key dump

Event

Join Pseudony-mized data

All keys

Page 9: Optimizing Big Scio Pipelines (in Scala!)t1.daumcdn.net/brunch/service/user/6cl5/file/G... · Make private data forgotten by wiping keys out ... Scio pseudonymization pipelines for

Technical challenges

‣ > 400 event types‣ Various data size‣ Users are not active all the time

Page 10: Optimizing Big Scio Pipelines (in Scala!)t1.daumcdn.net/brunch/service/user/6cl5/file/G... · Make private data forgotten by wiping keys out ... Scio pseudonymization pipelines for

small event optimization

Page 11: Optimizing Big Scio Pipelines (in Scala!)t1.daumcdn.net/brunch/service/user/6cl5/file/G... · Make private data forgotten by wiping keys out ... Scio pseudonymization pipelines for

Optimize small event

Join is still slow

‣ Key dump is still big○ 207M MAU (2018)

‣ Shuffling○ Journey to find the same key○ Network IO is expensive

I’m here,my friend!!

Where is mybuddy?

Page 12: Optimizing Big Scio Pipelines (in Scala!)t1.daumcdn.net/brunch/service/user/6cl5/file/G... · Make private data forgotten by wiping keys out ... Scio pseudonymization pipelines for

Side input

sc.broadcast(m)

An additional input that your DoFn can access each time it processes an element in the input PCollection

Beam copies side input to every worker

Page 13: Optimizing Big Scio Pipelines (in Scala!)t1.daumcdn.net/brunch/service/user/6cl5/file/G... · Make private data forgotten by wiping keys out ... Scio pseudonymization pipelines for

Hash join: idea combined

HashJoin(big_left, small_right)

B1

worker

B2

worker

B3

worker

S S S

distributeside inputcopy

Bn

S

join

Page 14: Optimizing Big Scio Pipelines (in Scala!)t1.daumcdn.net/brunch/service/user/6cl5/file/G... · Make private data forgotten by wiping keys out ... Scio pseudonymization pipelines for

Verify hash join

Page 15: Optimizing Big Scio Pipelines (in Scala!)t1.daumcdn.net/brunch/service/user/6cl5/file/G... · Make private data forgotten by wiping keys out ... Scio pseudonymization pipelines for

Experimentation

‣ Giant and tiny data set‣ Join them either by naive join or by hash join‣ Monitor and compare resources

50 MB

51 GB

>15GB*5

Page 16: Optimizing Big Scio Pipelines (in Scala!)t1.daumcdn.net/brunch/service/user/6cl5/file/G... · Make private data forgotten by wiping keys out ... Scio pseudonymization pipelines for

Experimentation pipelines

Page 17: Optimizing Big Scio Pipelines (in Scala!)t1.daumcdn.net/brunch/service/user/6cl5/file/G... · Make private data forgotten by wiping keys out ... Scio pseudonymization pipelines for

Expected result

‣ Hash join is faster‣ Can see different resource patterns

Page 18: Optimizing Big Scio Pipelines (in Scala!)t1.daumcdn.net/brunch/service/user/6cl5/file/G... · Make private data forgotten by wiping keys out ... Scio pseudonymization pipelines for

Expected result

‣ Hash join is faster : 30min > 10min‣ Can see different resource patterns : YES!

Page 19: Optimizing Big Scio Pipelines (in Scala!)t1.daumcdn.net/brunch/service/user/6cl5/file/G... · Make private data forgotten by wiping keys out ... Scio pseudonymization pipelines for

Task scheduling

Hash join Naive join

Reading small input Reading big inputnot started

Page 20: Optimizing Big Scio Pipelines (in Scala!)t1.daumcdn.net/brunch/service/user/6cl5/file/G... · Make private data forgotten by wiping keys out ... Scio pseudonymization pipelines for

Autoscale workers

Hash join Naive join

IO for small input

Page 21: Optimizing Big Scio Pipelines (in Scala!)t1.daumcdn.net/brunch/service/user/6cl5/file/G... · Make private data forgotten by wiping keys out ... Scio pseudonymization pipelines for

CPU

Hash join Naive join

100%100%

Page 22: Optimizing Big Scio Pipelines (in Scala!)t1.daumcdn.net/brunch/service/user/6cl5/file/G... · Make private data forgotten by wiping keys out ... Scio pseudonymization pipelines for

Network traffic

Hash join

Received bytes at once

Naive join

In and out - do shuffle

received

received

sent

Page 23: Optimizing Big Scio Pipelines (in Scala!)t1.daumcdn.net/brunch/service/user/6cl5/file/G... · Make private data forgotten by wiping keys out ... Scio pseudonymization pipelines for

Disk

Hash join Naive join

write

read write

Page 24: Optimizing Big Scio Pipelines (in Scala!)t1.daumcdn.net/brunch/service/user/6cl5/file/G... · Make private data forgotten by wiping keys out ... Scio pseudonymization pipelines for

hashJoin implementationbig_left.hashJoin(small_right)

1234567891011

// (every key-value from big_left, SideInputContext)

Page 25: Optimizing Big Scio Pipelines (in Scala!)t1.daumcdn.net/brunch/service/user/6cl5/file/G... · Make private data forgotten by wiping keys out ... Scio pseudonymization pipelines for

W00t! Small events are happy (:

Let’s optimize big event

Page 26: Optimizing Big Scio Pipelines (in Scala!)t1.daumcdn.net/brunch/service/user/6cl5/file/G... · Make private data forgotten by wiping keys out ... Scio pseudonymization pipelines for

Optimize big event

‣ They never fit in memory‣ Turn the other side out : is it possible not to

shuffle key dump?

JOIN(key_dump, small_event)

JOIN(big_event, key_dump)

No shuffle

Page 27: Optimizing Big Scio Pipelines (in Scala!)t1.daumcdn.net/brunch/service/user/6cl5/file/G... · Make private data forgotten by wiping keys out ... Scio pseudonymization pipelines for

Key dump shortcut

Users are not active all the time

User1, Event1

User1, Event2

User1, Event3

User1, Event5

User1, Event4

User2, Event1

User2, Event2

User3, Event1

User4, Event1

User1 🔑🔑🔑 User2 🔑🔑🔑 User3 🔑🔑🔑 User4 🔑🔑🔑 User5 🔑🔑🔑 User6 🔑🔑🔑 User7 🔑🔑🔑 User8 🔑🔑🔑 User10 🔑🔑🔑 User11 🔑🔑🔑User12 🔑🔑🔑User13 🔑🔑🔑 User14 🔑🔑🔑 User15 🔑🔑🔑

Page 28: Optimizing Big Scio Pipelines (in Scala!)t1.daumcdn.net/brunch/service/user/6cl5/file/G... · Make private data forgotten by wiping keys out ... Scio pseudonymization pipelines for

Sparkey join

‣ Summarize active users from the most active event

‣ Pre-generate hotset called ‘Sparkey’○ You can think of it as disk cache

‣ Put Sparkey as side input and join

Page 29: Optimizing Big Scio Pipelines (in Scala!)t1.daumcdn.net/brunch/service/user/6cl5/file/G... · Make private data forgotten by wiping keys out ... Scio pseudonymization pipelines for

Let’s merge optimizations

HashJoin(key_dump, small_event)Event < 50MB

SparkeyJoin(keys_in_hotset, hotset)Users in hotset

SkewedJoin(key_dump, rest_event)

Yes

Yes

No

No Merge intothe final result

Page 30: Optimizing Big Scio Pipelines (in Scala!)t1.daumcdn.net/brunch/service/user/6cl5/file/G... · Make private data forgotten by wiping keys out ... Scio pseudonymization pipelines for

Let’s merge optimizations

HashJoin(key_dump, small_event)Event < 50MB

SparkeyJoin(keys_in_hotset, hotset)Users in hotset

SkewedJoin(key_dump, rest_event)

Yes

Yes

No

No Merge intothe final result

They are different join types.

Our codes are not messy…

Page 31: Optimizing Big Scio Pipelines (in Scala!)t1.daumcdn.net/brunch/service/user/6cl5/file/G... · Make private data forgotten by wiping keys out ... Scio pseudonymization pipelines for

Google is expensive.

We save a lot of money!

Page 32: Optimizing Big Scio Pipelines (in Scala!)t1.daumcdn.net/brunch/service/user/6cl5/file/G... · Make private data forgotten by wiping keys out ... Scio pseudonymization pipelines for

Lesson learned

‣ Spotify scale and technical challenges I’ve never seen before

‣ Don’t guess or imagine. Prove by making hands dirty and have fun

Page 33: Optimizing Big Scio Pipelines (in Scala!)t1.daumcdn.net/brunch/service/user/6cl5/file/G... · Make private data forgotten by wiping keys out ... Scio pseudonymization pipelines for

Thank you! : )