event stream processing with kafka and samza
TRANSCRIPT
![Page 1: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/1.jpg)
Event Stream Processingwith Kafka and Samza
Zach Cox - @zcox - [email protected] Code Camp - 1 Nov 2014
![Page 2: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/2.jpg)
Why?Businesses generate and process eventsUnified event log promotes data integrationProcess event streams to take actions quickly
![Page 3: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/3.jpg)
References
Kafka
Samza
Kafka DocumentationThe Log: What every software engineer should know about real-time data's unifying abstractionBenchmarking Apache Kafka
Samza DocumentationQuestioning the Lamba ArchitectureMoving faster with data streams: The rise of Samza at LinkedInWhy local state is a fundamental primitive in stream processingReal time insights into LinkedIn's performance using ApacheSamza
![Page 4: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/4.jpg)
Why?Businesses generate and process eventsUnified event log promotes data integrationProcess event streams to take actions quickly
![Page 5: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/5.jpg)
EventSomething happenedRecord that fact so we can process it
![Page 6: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/6.jpg)
EventDescribes what happened
Who did it?What did they do?What was the result?
Provides contextWhen did it happen?Where did it happen?How did they do it?Why did they do it?
![Page 7: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/7.jpg)
Event Example: PageviewUser viewed web pageUser
ID: a2be9031-9465-4ecb-9302-9b962fa854acIP: 65.121.142.238User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 1095)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.101Safari/537.36
Web PageURL:
ContextTime: 2014-10-14T10:49:24.438-05:00
https://www.mycompany.com/page.html
![Page 8: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/8.jpg)
Event Example: ClickthroughUser clicked linkUser
ID: a2be9031-9465-4ecb-9302-9b962fa854acIP: 65.121.142.238User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 1095)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.101Safari/537.36
LinkURL: Referer:
ContextTime: 2014-10-14T10:49:24.438-05:00
https://www.mycompany.com/product.htmlhttps://www.othersite.com/foo.html
![Page 9: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/9.jpg)
Event Example: User UpdateUser changed first nameUser
ID: 161fa4bf-6ae9-4f4e-b72e-01c40e7783e5First name: ZachContext
Time: 2014-10-14T10:59:56.481-05:00IP: 65.121.142.238
![Page 10: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/10.jpg)
Event Example: User UpdateUser uploaded a new profile imageUser
ID: 161fa4bf-6ae9-4f4e-b72e-01c40e7783e5Profile Image
URL: Context
Time: 2014-10-14T10:59:56.481-05:00IP: 65.121.142.238Using: webcam
http://profile-images.s3.amazonaws.com/katy-perry.jpg
![Page 11: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/11.jpg)
Event Example: TweetUser posted a tweetUser
ID:Username: @zcoxName: Zach CoxBio: Developer @BannoHQ | @iascala organizer | co-founded@Pongr
TweetID: 527152511568719872URL: URL: Text: Going to talk about processing event streams using@apachekafka and @samzastream this Saturday @iowacodecamp
Mentions: @apachekafka, @samzastream, @iowacodecampURLs:
ContextTime: 2014-10-14T10:59:56.481-05:00Using: Twitter for AndroidLocation: 41.7146365,-93.5914038
https://twitter.com/zcox/status/527152511568719872
http://iowacodecamp.com/session/list#66
http://iowacodecamp.com/session/list#66
![Page 12: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/12.jpg)
Event Example: HTTP Request LatencySome measured code took some time to executeCode
production.my-app.some-server.http.get-user-profileTime to execute
Min: 20 msecMax: 950 msecAverage: 190 msecMedian: 110 msec50%: 100 msec75%: 120 msec95%: 150 msec99%: 500 msec
ContextTime: 2014-10-14T11:17:01.597-05:00
![Page 13: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/13.jpg)
Event Example: Runtime ExceptionSome code threw a runtime exceptionSome code
Stack trace: [...]Exception
Message: HBase read timed outContext
Time: 2014-10-14T11:21:23.749-05:00Application: my-appMachine: some-server.my-company.com
![Page 14: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/14.jpg)
Event Example: Application LoggingSome code logged some information[INFO] [2014-10-14 11:25:44,750] [sentry-akka.actor.default-dispatcher-2]a.e.s.Slf4jEventHandler: Slf4jEventHandler startedMessage: Slf4jEventHandler startedLevel: INFOTime: 2014-10-14 11:25:44,750Thread: sentry-akka.actor.default-dispatcher-2Logger: akka.event.slf4j.Slf4jEventHandler
![Page 15: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/15.jpg)
Why?Businesses generate and process eventsUnified event log promotes data integrationProcess event streams to take actions quickly
![Page 16: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/16.jpg)
Unified LogEvents need to be sent somewhereEvents should be accessible to any programLog provides a place for events to be sent and accessedKafka is a great log service
![Page 17: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/17.jpg)
Data Integration
![Page 18: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/18.jpg)
Data Integration
![Page 19: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/19.jpg)
Log
Sequence of recordsAppend-onlyOrdered by timeEach record assigned unique sequential numberRecords stored persistently on disk
![Page 20: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/20.jpg)
Log Service
![Page 21: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/21.jpg)
Logs in Distributed Databases
![Page 22: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/22.jpg)
Traditional Cache
Cache missesCache invalidation
![Page 23: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/23.jpg)
Infrastructure as Distributed Database
Cache is now replicated from DB
![Page 24: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/24.jpg)
Infrastructure as Distributed Database
Cache can be in-process with web app
![Page 25: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/25.jpg)
Log for Event StreamsSimple to send events toBroadcasts events to all consumersBuffers events on disk: producers and consumers decoupledConsumers can start reading at any offset
![Page 26: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/26.jpg)
KafkaApache OSS, mainly from LinkedInHandles all the logs/event streamsHigh-throughput: millions events/secHigh-volume: TBs - PBs of eventsLow-latency: single-digit msec from producer to consumerScalable: topics are partitioned across clusterDurable: topics are replicated across clusterAvailable: auto failover
![Page 27: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/27.jpg)
Twitter Example
Receive messages via long-lived HTTP connection as JSONWrite messages to a Kafka topic
Twitter Streaming API
![Page 28: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/28.jpg)
Twitter Example
Twitter rate-limits clients<1% sample, ~50-100 tweets/sec400 keywords, ? tweets/sec
1 weird trick to get more tweets: multiple clients, same Kafka topic!
![Page 29: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/29.jpg)
Why?Businesses generate and process eventsUnified event log promotes data integrationProcess event streams to take actions quickly
![Page 30: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/30.jpg)
Event Stream ProcessingTurn events into valuable, actionable informationProcess events as they happen, not later (batch)Do all of this reliably, at scale
![Page 31: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/31.jpg)
Event Stream Processor
![Page 32: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/32.jpg)
Event Stream Processor: Input
![Page 33: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/33.jpg)
Event Stream Processor: Output
![Page 34: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/34.jpg)
SamzaEvent stream processing frameworkApache OSS, mainly from LinkedInSimple Java APIScalable: runs jobs in parallel across clusterReliable: fault-tolerance and durability built-inTools for stateful stream processing
![Page 35: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/35.jpg)
Samza Job1) Class that extends StreamTask:
class MyTask extends StreamTask { override def process( envelope: IncomingMessageEnvelope, collector: MessageCollector, coordinator: TaskCoordinator): Unit = { //process message in envelope }}
2) my-task.properties config filejob.factory.class=org.apache.samza.job.local.ThreadJobFactoryjob.name=my-task
task.class=com.banno.MyTask...
![Page 36: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/36.jpg)
Stateless ProcessingOne event at a timeTake action using only that event
SELECT * FROM raw_messages WHERE message_type = 'status';
![Page 37: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/37.jpg)
Samza Job: Separate Message Types
Many message types from TwitterSamza job to separate into type-specific streamsOther jobs process specific message types
![Page 38: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/38.jpg)
Stateful Stream ProcessingOne event at a timeTake action using that event and stateState = data built up from past eventsAggregationGroupingJoins
![Page 39: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/39.jpg)
AggregationState = aggregated values (e.g. count)Incorporate each new event into that aggregationOutput aggregated values as events to new streamWhat happens if job stops?
Crash, deploy, ...Can't lose state!Samza handles this all for you
SELECT COUNT(*) FROM statuses;
![Page 40: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/40.jpg)
Samza Job: Total Status Count
Increment a counter on every status (tweet)Periodically output current count
![Page 41: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/41.jpg)
GroupingState = some data per groupTwo Samza jobs:
Output statuses by user (map)Count statuses per user (reduce)
Output: (user, count)Could use as input to job that sorts by count (most active users)
SELECT user_id, COUNT(user_id) FROM statuses GROUP BY user_id;
SELECT user_id, COUNT(user_id) FROM statuses GROUP BY user_id ORDER BY COUNT(user_id) DESC LIMIT 5;
![Page 42: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/42.jpg)
JoinsSamza job has multiple input streamsStream-Stream join: ad impressions + ad clicksStream-Table join: page views + user zip codeTable-Table join: user data + user settingsJoins involving tables need DB changelog
SELECT u.username, s.text FROM statuses s JOIN users u ON u.id = s.user_id;
![Page 43: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/43.jpg)
What else can we compute?Tweets per sec/min/hour (recent, not for-all-time)Enrich tweets with weather at current locationMost active users, locations, etcEmojis: % of tweets that contain, top emojisHashtags: % of tweets that contain, top #hashtagsURLs: % of tweets that contain, top domainsPhoto URLs: % of tweets that contain, top domainsText analysis: sentiment, spam
![Page 44: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/44.jpg)
Reprocessinghttp://samza.incubator.apache.org/learn/documentation/0.7.0/jobs/reprocessing.html
![Page 45: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/45.jpg)
Other Stream Processing FrameworksStormSpark StreamingHadoop StreamingAkkaRiemannEsper
![Page 46: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/46.jpg)
Druid
Send it eventsDruid reads from Kafka topicThat Kafka topic is a Samza output stream
Super fast time-series queries: aggregations, filters, top-n, etc
http://druid.io
![Page 47: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/47.jpg)
Why?Businesses generate and process eventsUnified event log promotes data integrationProcess event streams to take actions quickly
![Page 48: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/48.jpg)
References
Kafka
Samza
Kafka DocumentationThe Log: What every software engineer should know about real-time data's unifying abstractionBenchmarking Apache Kafka
Samza DocumentationQuestioning the Lamba ArchitectureMoving faster with data streams: The rise of Samza at LinkedInWhy local state is a fundamental primitive in stream processingReal time insights into LinkedIn's performance using ApacheSamza
![Page 49: Event Stream Processing with Kafka and Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a2099c1a28ab9b368b45aa/html5/thumbnails/49.jpg)
Let's chat!Zach Cox@[email protected] is hiring!