samza tech talk_2015 - strata
TRANSCRIPT
![Page 1: Samza tech talk_2015 - strata](https://reader036.vdocuments.mx/reader036/viewer/2022070509/589e21371a28ab605b8b69f9/html5/thumbnails/1.jpg)
Stream Processing @Scale in LinkedIn
Yi PanData Infrastructure
Samza Team @LinkedIn
Databus
![Page 2: Samza tech talk_2015 - strata](https://reader036.vdocuments.mx/reader036/viewer/2022070509/589e21371a28ab605b8b69f9/html5/thumbnails/2.jpg)
• What is Stream Processing?• What is Samza?• Stream Processing @LinkedIn• Upcoming features
Overview
![Page 3: Samza tech talk_2015 - strata](https://reader036.vdocuments.mx/reader036/viewer/2022070509/589e21371a28ab605b8b69f9/html5/thumbnails/3.jpg)
• What’s stream processing– Input: an unbounded sequence of events
• E.g. web server logs, user activity tracking events, database changelogs, etc.
– Latency: near real-time• From milliseconds to minutes, instead of hours to
days– Output: an unbounded sequence of changes to
the derived dataset• The derived dataset is usually the final or partial
analytic results that can either be in another stream, or a serving data store
Stream Processing
![Page 4: Samza tech talk_2015 - strata](https://reader036.vdocuments.mx/reader036/viewer/2022070509/589e21371a28ab605b8b69f9/html5/thumbnails/4.jpg)
Response latency
Stream
Processing
Milliseconds to minutes
RPC
Synchronous Later. Possibly much later.
0 ms
Stream Processing
![Page 5: Samza tech talk_2015 - strata](https://reader036.vdocuments.mx/reader036/viewer/2022070509/589e21371a28ab605b8b69f9/html5/thumbnails/5.jpg)
• What are the application requirements?– Scalable, fast, stateful stream processing– What scale should we operate at?
• Traffic Volume: 1.4 Trillion events/day• Intermediate State Size: multi TB / colo (*)
– Why is it expensive to run stream processing at scale?
• Intermediate data set needs to be stored to allow low latency processing
• Large volume of data needs to be pulled and pushed via network
Stream Processing
![Page 6: Samza tech talk_2015 - strata](https://reader036.vdocuments.mx/reader036/viewer/2022070509/589e21371a28ab605b8b69f9/html5/thumbnails/6.jpg)
• What is Stream Processing?• What is Samza?• Stream Processing @LinkedIn• Upcoming features
Overview
![Page 7: Samza tech talk_2015 - strata](https://reader036.vdocuments.mx/reader036/viewer/2022070509/589e21371a28ab605b8b69f9/html5/thumbnails/7.jpg)
• Samza is a distributed Turing machine– Single Task Samza Job is a stateful
Turing machine
What’s Samza
Samza TaskInput stream Output stream
Statechangelogch
eckp
oint
![Page 8: Samza tech talk_2015 - strata](https://reader036.vdocuments.mx/reader036/viewer/2022070509/589e21371a28ab605b8b69f9/html5/thumbnails/8.jpg)
– Scaling a Samza job: partition the streams
What’s SamzaIn
put s
trea
m A
partition 0
partition 1
partition 2
partition 3
partition n
![Page 9: Samza tech talk_2015 - strata](https://reader036.vdocuments.mx/reader036/viewer/2022070509/589e21371a28ab605b8b69f9/html5/thumbnails/9.jpg)
– Scaling a Samza job: partition the streams
What’s SamzaIn
put s
trea
m B
partition 0
partition 1
partition 2
partition 3
partition n
![Page 10: Samza tech talk_2015 - strata](https://reader036.vdocuments.mx/reader036/viewer/2022070509/589e21371a28ab605b8b69f9/html5/thumbnails/10.jpg)
– Scaling a Samza job: replicating the state machine
What’s Samza
shared checkpoint
Job
![Page 11: Samza tech talk_2015 - strata](https://reader036.vdocuments.mx/reader036/viewer/2022070509/589e21371a28ab605b8b69f9/html5/thumbnails/11.jpg)
• Samza Execution in Yarn
What’s Samza
Host 1 Host 2 Host 3
Application Master
Samza container Samza container
Samza container
Deploy Samza job
![Page 12: Samza tech talk_2015 - strata](https://reader036.vdocuments.mx/reader036/viewer/2022070509/589e21371a28ab605b8b69f9/html5/thumbnails/12.jpg)
• States in Samza– Checkpoints
• Offsets per input stream partitions– State Stores
• In-memory or on-disk (RocksDB) derived data set
What’s Samza
Samza TaskOutput stream partitions
State changelog partitionsch
eckp
oint
Host 1
![Page 13: Samza tech talk_2015 - strata](https://reader036.vdocuments.mx/reader036/viewer/2022070509/589e21371a28ab605b8b69f9/html5/thumbnails/13.jpg)
• States in Samza– Checkpoints and local state stores are backed
by distributed logs
What’s Samza
Samza TaskOutput stream partitions
State changelog partitionsch
eckp
oint
Host 1
![Page 14: Samza tech talk_2015 - strata](https://reader036.vdocuments.mx/reader036/viewer/2022070509/589e21371a28ab605b8b69f9/html5/thumbnails/14.jpg)
• What is Stream Processing?• What is Samza?• Stream Processing @LinkedIn• Upcoming features
Overview
![Page 15: Samza tech talk_2015 - strata](https://reader036.vdocuments.mx/reader036/viewer/2022070509/589e21371a28ab605b8b69f9/html5/thumbnails/15.jpg)
Stream Processing @ LinkedIn
WebServersWebServers
WebServersWebServers
WebServersWebServers
WebServersMonitorServers
Oracle
Espresso
Kafka Databus
Trackingevents
Metrics
changelog
changelog
Samza JobsSamza
JobsSamza JobsSamza
Jobs
bootstrap
bootstrap
VoldemortDerivedData
DerivedData
![Page 16: Samza tech talk_2015 - strata](https://reader036.vdocuments.mx/reader036/viewer/2022070509/589e21371a28ab605b8b69f9/html5/thumbnails/16.jpg)
Stream Processing @ LinkedIn
• Tracking aggregate/analysis (ACG)
![Page 17: Samza tech talk_2015 - strata](https://reader036.vdocuments.mx/reader036/viewer/2022070509/589e21371a28ab605b8b69f9/html5/thumbnails/17.jpg)
Stream Processing @ LinkedIn
• Content standardization w/ adjunct data setMember
Profile DBBootstrap
JobDatabus
Kafka
Content Standardization
Kafka
Kafka
![Page 18: Samza tech talk_2015 - strata](https://reader036.vdocuments.mx/reader036/viewer/2022070509/589e21371a28ab605b8b69f9/html5/thumbnails/18.jpg)
Stream Processing @ LinkedIn
• Kafka Deployment– 1.1 Trillion messages / day
• Databus Deployment– 300 Billion messages / day
• Samza Deployment– multiple colos– 10+ Yarn clusters– 200+ nodes– 100+ Jobs in production
![Page 19: Samza tech talk_2015 - strata](https://reader036.vdocuments.mx/reader036/viewer/2022070509/589e21371a28ab605b8b69f9/html5/thumbnails/19.jpg)
• What is Stream Processing?• What’s Samza• Stream Processing @LinkedIn• Upcoming features
Overview
![Page 20: Samza tech talk_2015 - strata](https://reader036.vdocuments.mx/reader036/viewer/2022070509/589e21371a28ab605b8b69f9/html5/thumbnails/20.jpg)
• New features– Local state store improvements
• RocksDB TTL support• Fast recovery
– Dynamic configuration– Easier deployment w/ standalone jobs– High-level query language for faster
development
Upcoming Features
![Page 21: Samza tech talk_2015 - strata](https://reader036.vdocuments.mx/reader036/viewer/2022070509/589e21371a28ab605b8b69f9/html5/thumbnails/21.jpg)
Contact Us / Get Involved• Open Source
–Documentation: samza.apache.org–Mailing list:
[email protected]– JIRA: https
://issues.apache.org/jira/browse/SAMZA