data stream processing - amazon s3 · • high performance message store: kafka • ubiquity of...
TRANSCRIPT
![Page 1: Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of high performance distributed stream processing: Storm, Samza, Kafka Streams, Spark](https://reader036.vdocuments.mx/reader036/viewer/2022071016/5fced85e296fd416a35ab9b0/html5/thumbnails/1.jpg)
Data Stream ProcessingCan we finally forget the batches?
![Page 2: Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of high performance distributed stream processing: Storm, Samza, Kafka Streams, Spark](https://reader036.vdocuments.mx/reader036/viewer/2022071016/5fced85e296fd416a35ab9b0/html5/thumbnails/2.jpg)
2
Who am I?
Dominik WagenknechtSenior Technology ArchitectAccenture Vienna / Austria
Dealing with data in many industries
![Page 3: Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of high performance distributed stream processing: Storm, Samza, Kafka Streams, Spark](https://reader036.vdocuments.mx/reader036/viewer/2022071016/5fced85e296fd416a35ab9b0/html5/thumbnails/3.jpg)
Data needs to move!A à B
![Page 4: Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of high performance distributed stream processing: Storm, Samza, Kafka Streams, Spark](https://reader036.vdocuments.mx/reader036/viewer/2022071016/5fced85e296fd416a35ab9b0/html5/thumbnails/4.jpg)
4
To get data where it’s needed
Monolith
Gargantuan DB
ServiceService
DB DB
Service Service
Teams collidingLow agility
JOIN everything!
Per TeamHigh agility
Data as needed
![Page 5: Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of high performance distributed stream processing: Storm, Samza, Kafka Streams, Spark](https://reader036.vdocuments.mx/reader036/viewer/2022071016/5fced85e296fd416a35ab9b0/html5/thumbnails/5.jpg)
5
To get smarter
Service
Data Warehouse
ReportingAnalyticsInsight
SystemSystem
System System
SystemSystem
Every systemdoes it’s job
Tells you whatto do better
![Page 6: Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of high performance distributed stream processing: Storm, Samza, Kafka Streams, Spark](https://reader036.vdocuments.mx/reader036/viewer/2022071016/5fced85e296fd416a35ab9b0/html5/thumbnails/6.jpg)
= to integrate services
![Page 7: Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of high performance distributed stream processing: Storm, Samza, Kafka Streams, Spark](https://reader036.vdocuments.mx/reader036/viewer/2022071016/5fced85e296fd416a35ab9b0/html5/thumbnails/7.jpg)
Let’s do Batch!(oldie but goldie)
![Page 8: Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of high performance distributed stream processing: Storm, Samza, Kafka Streams, Spark](https://reader036.vdocuments.mx/reader036/viewer/2022071016/5fced85e296fd416a35ab9b0/html5/thumbnails/8.jpg)
8
So how does batch work?
Source
Source
Extract
Extract
Processing
MergeTransform
Enrich
Load Target
every night
![Page 9: Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of high performance distributed stream processing: Storm, Samza, Kafka Streams, Spark](https://reader036.vdocuments.mx/reader036/viewer/2022071016/5fced85e296fd416a35ab9b0/html5/thumbnails/9.jpg)
9
Speed it up?!? Delta-batches…
every hour à ½ hour à ¼ hour
Source TargetETL
Source TargetETL
Source TargetETL
enjoy the fun whenbatches overlap
![Page 10: Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of high performance distributed stream processing: Storm, Samza, Kafka Streams, Spark](https://reader036.vdocuments.mx/reader036/viewer/2022071016/5fced85e296fd416a35ab9b0/html5/thumbnails/10.jpg)
10
• It’s basically always too late*• Bumpy load-patterns >> shockwaves in the system• Mostly in the night• Testing becomes painful
Batch is not enough
*exceptions: on-purpose timings like interest rate calculations, monthly billing, etc…
![Page 11: Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of high performance distributed stream processing: Storm, Samza, Kafka Streams, Spark](https://reader036.vdocuments.mx/reader036/viewer/2022071016/5fced85e296fd416a35ab9b0/html5/thumbnails/11.jpg)
You can not go from batch to stream!
![Page 12: Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of high performance distributed stream processing: Storm, Samza, Kafka Streams, Spark](https://reader036.vdocuments.mx/reader036/viewer/2022071016/5fced85e296fd416a35ab9b0/html5/thumbnails/12.jpg)
12
How does the stream look
Source
Source
Event-by-EventProcessing
(there is often some state here!)
Target
continuously
![Page 13: Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of high performance distributed stream processing: Storm, Samza, Kafka Streams, Spark](https://reader036.vdocuments.mx/reader036/viewer/2022071016/5fced85e296fd416a35ab9b0/html5/thumbnails/13.jpg)
13
But you can go from stream to batch
Source
Source
Event-by-EventProcessing
(there is often some state here!)
Target
continuously
Extract
Load
![Page 14: Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of high performance distributed stream processing: Storm, Samza, Kafka Streams, Spark](https://reader036.vdocuments.mx/reader036/viewer/2022071016/5fced85e296fd416a35ab9b0/html5/thumbnails/14.jpg)
Why now?
![Page 15: Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of high performance distributed stream processing: Storm, Samza, Kafka Streams, Spark](https://reader036.vdocuments.mx/reader036/viewer/2022071016/5fced85e296fd416a35ab9b0/html5/thumbnails/15.jpg)
15
• Message Queues exist since forever• It’s a cost and efficiency thing
What changed?• LinkedIn / Netflix & Friends• Transaction guarantees of classic MQ’s not needed• High performance message store: Kafka• Ubiquity of high performance distributed stream processing: Storm, Samza,
Kafka Streams, Spark Streaming, Heron, Flink,…
Why now?
![Page 16: Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of high performance distributed stream processing: Storm, Samza, Kafka Streams, Spark](https://reader036.vdocuments.mx/reader036/viewer/2022071016/5fced85e296fd416a35ab9b0/html5/thumbnails/16.jpg)
16
Performance & Transactional Guarantees
Classic MQSource Target
Fully transactionalMQ keeps track of all messagesand transactions from all sources
Fully transactionalMQ needs to track state of every message• re-available after timeout• back-out-queues on rollback, etc…• look-ups by correlation ID, etc...
![Page 17: Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of high performance distributed stream processing: Storm, Samza, Kafka Streams, Spark](https://reader036.vdocuments.mx/reader036/viewer/2022071016/5fced85e296fd416a35ab9b0/html5/thumbnails/17.jpg)
17
Enter the simple distributed log
Source
Target
Log-File(s)
oldest data
latest data
Writing a message just appends to one of the log-files. The message essentially has a file position index
Reading is essentially pulling at an index and just keeps reading forward
Challenge: Position tracking • Kafka helps with that
![Page 18: Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of high performance distributed stream processing: Storm, Samza, Kafka Streams, Spark](https://reader036.vdocuments.mx/reader036/viewer/2022071016/5fced85e296fd416a35ab9b0/html5/thumbnails/18.jpg)
18
We loose• Full transaction support• Lookups by correlation ID, etc…
We win• Very high throughput• No overfilled queues, so we can batch into it!• Strict ordering per log-file/partition*• Multiple target systems can read independently• Very simple testing
Consequences of the distributed log
*which is quite useful given we‘re replacing batch...
![Page 19: Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of high performance distributed stream processing: Storm, Samza, Kafka Streams, Spark](https://reader036.vdocuments.mx/reader036/viewer/2022071016/5fced85e296fd416a35ab9b0/html5/thumbnails/19.jpg)
Summary
![Page 20: Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of high performance distributed stream processing: Storm, Samza, Kafka Streams, Spark](https://reader036.vdocuments.mx/reader036/viewer/2022071016/5fced85e296fd416a35ab9b0/html5/thumbnails/20.jpg)
20
Technologies in play
Source Event processing Target
DB: Change-data-capture or batch export JSystem: data feed
DB: just insert (with decent commit-size)System: REST-calls
![Page 21: Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of high performance distributed stream processing: Storm, Samza, Kafka Streams, Spark](https://reader036.vdocuments.mx/reader036/viewer/2022071016/5fced85e296fd416a35ab9b0/html5/thumbnails/21.jpg)
21
• Think differently: streaming-first• Go idempotent to keep-it-simple• Partition to go fast & ordered• Establish governance & standards as you go*
What should you take away from this?
*data formats, naming conventions, operations,…
![Page 22: Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of high performance distributed stream processing: Storm, Samza, Kafka Streams, Spark](https://reader036.vdocuments.mx/reader036/viewer/2022071016/5fced85e296fd416a35ab9b0/html5/thumbnails/22.jpg)
Thank youQuestions?