i heart log: real-time data and apache kafka
DESCRIPTION
This presentation discusses how logs and stream-processing can form a backbone for data flow, ETL, and real-time data processing. It will describe the challenges and lessons learned as LinkedIn built out its real-time data subscription and processing infrastructure. It will also discuss the role of real-time processing and its relationship to offline processing frameworks such as MapReduce.TRANSCRIPT
![Page 1: I Heart Log: Real-time Data and Apache Kafka](https://reader036.vdocuments.mx/reader036/viewer/2022081412/53f480488d7f72c80e8b4a74/html5/thumbnails/1.jpg)
Real-time Data and Apache Kafka
Jay Kreps
I ♥ Log
![Page 2: I Heart Log: Real-time Data and Apache Kafka](https://reader036.vdocuments.mx/reader036/viewer/2022081412/53f480488d7f72c80e8b4a74/html5/thumbnails/2.jpg)
The Plan1. Apache Kafka2. Logs and Distributed Systems3. Logs and Data Integration4. Logs and Stream Processing
![Page 3: I Heart Log: Real-time Data and Apache Kafka](https://reader036.vdocuments.mx/reader036/viewer/2022081412/53f480488d7f72c80e8b4a74/html5/thumbnails/3.jpg)
Apache Kafka
![Page 4: I Heart Log: Real-time Data and Apache Kafka](https://reader036.vdocuments.mx/reader036/viewer/2022081412/53f480488d7f72c80e8b4a74/html5/thumbnails/4.jpg)
![Page 5: I Heart Log: Real-time Data and Apache Kafka](https://reader036.vdocuments.mx/reader036/viewer/2022081412/53f480488d7f72c80e8b4a74/html5/thumbnails/5.jpg)
Abrief
historyof
Kafka
![Page 6: I Heart Log: Real-time Data and Apache Kafka](https://reader036.vdocuments.mx/reader036/viewer/2022081412/53f480488d7f72c80e8b4a74/html5/thumbnails/6.jpg)
Three principles1. One pipeline to rule them all2. Stream processing >> messaging3. Clusters not servers
![Page 7: I Heart Log: Real-time Data and Apache Kafka](https://reader036.vdocuments.mx/reader036/viewer/2022081412/53f480488d7f72c80e8b4a74/html5/thumbnails/7.jpg)
Characteristics• Scalability of a filesystem– Hundreds of MB/sec/server throughput–Many TB per server
• Guarantees of a database–Messages strictly ordered– All data persistent
• Distributed by default– Replication– Partitioning model
![Page 8: I Heart Log: Real-time Data and Apache Kafka](https://reader036.vdocuments.mx/reader036/viewer/2022081412/53f480488d7f72c80e8b4a74/html5/thumbnails/8.jpg)
Kafka At LinkedIn• 175 TB of in-flight log data per colo• Low-latency: ~1.5 ms• Replicated to each datacenter• Tens of thousands of data producers• Thousands of consumers• 7 million messages written/sec• 35 million messages read/sec• Hadoop integration
![Page 9: I Heart Log: Real-time Data and Apache Kafka](https://reader036.vdocuments.mx/reader036/viewer/2022081412/53f480488d7f72c80e8b4a74/html5/thumbnails/9.jpg)
Open source• Apache Software Foundation• Very healthy usage outside LinkedIn• Broad base of committers• 30 clients in 15 languages• Great ecosystem of supporting tools
![Page 10: I Heart Log: Real-time Data and Apache Kafka](https://reader036.vdocuments.mx/reader036/viewer/2022081412/53f480488d7f72c80e8b4a74/html5/thumbnails/10.jpg)
The Plan1. Apache Kafka2. Logs and Distributed Systems3. Logs and Data Integration4. Logs and Stream Processing
![Page 11: I Heart Log: Real-time Data and Apache Kafka](https://reader036.vdocuments.mx/reader036/viewer/2022081412/53f480488d7f72c80e8b4a74/html5/thumbnails/11.jpg)
Kafka is about logs
![Page 12: I Heart Log: Real-time Data and Apache Kafka](https://reader036.vdocuments.mx/reader036/viewer/2022081412/53f480488d7f72c80e8b4a74/html5/thumbnails/12.jpg)
What is a log?
![Page 13: I Heart Log: Real-time Data and Apache Kafka](https://reader036.vdocuments.mx/reader036/viewer/2022081412/53f480488d7f72c80e8b4a74/html5/thumbnails/13.jpg)
![Page 14: I Heart Log: Real-time Data and Apache Kafka](https://reader036.vdocuments.mx/reader036/viewer/2022081412/53f480488d7f72c80e8b4a74/html5/thumbnails/14.jpg)
![Page 15: I Heart Log: Real-time Data and Apache Kafka](https://reader036.vdocuments.mx/reader036/viewer/2022081412/53f480488d7f72c80e8b4a74/html5/thumbnails/15.jpg)
Partitioning
![Page 16: I Heart Log: Real-time Data and Apache Kafka](https://reader036.vdocuments.mx/reader036/viewer/2022081412/53f480488d7f72c80e8b4a74/html5/thumbnails/16.jpg)
Logs: pub/sub done right
![Page 17: I Heart Log: Real-time Data and Apache Kafka](https://reader036.vdocuments.mx/reader036/viewer/2022081412/53f480488d7f72c80e8b4a74/html5/thumbnails/17.jpg)
Logs And Distributed Systems
![Page 18: I Heart Log: Real-time Data and Apache Kafka](https://reader036.vdocuments.mx/reader036/viewer/2022081412/53f480488d7f72c80e8b4a74/html5/thumbnails/18.jpg)
Example: A Fault-tolerant CEO Hash Table
![Page 19: I Heart Log: Real-time Data and Apache Kafka](https://reader036.vdocuments.mx/reader036/viewer/2022081412/53f480488d7f72c80e8b4a74/html5/thumbnails/19.jpg)
Operations Final State
![Page 20: I Heart Log: Real-time Data and Apache Kafka](https://reader036.vdocuments.mx/reader036/viewer/2022081412/53f480488d7f72c80e8b4a74/html5/thumbnails/20.jpg)
![Page 21: I Heart Log: Real-time Data and Apache Kafka](https://reader036.vdocuments.mx/reader036/viewer/2022081412/53f480488d7f72c80e8b4a74/html5/thumbnails/21.jpg)
State-machine Replication
![Page 22: I Heart Log: Real-time Data and Apache Kafka](https://reader036.vdocuments.mx/reader036/viewer/2022081412/53f480488d7f72c80e8b4a74/html5/thumbnails/22.jpg)
Primary-backup
![Page 23: I Heart Log: Real-time Data and Apache Kafka](https://reader036.vdocuments.mx/reader036/viewer/2022081412/53f480488d7f72c80e8b4a74/html5/thumbnails/23.jpg)
What use is a log?
![Page 24: I Heart Log: Real-time Data and Apache Kafka](https://reader036.vdocuments.mx/reader036/viewer/2022081412/53f480488d7f72c80e8b4a74/html5/thumbnails/24.jpg)
The Plan1. Apache Kafka2. Logs and Distributed Systems3. Logs and Data Integration4. Logs and Stream Processing
![Page 25: I Heart Log: Real-time Data and Apache Kafka](https://reader036.vdocuments.mx/reader036/viewer/2022081412/53f480488d7f72c80e8b4a74/html5/thumbnails/25.jpg)
Data Integration
![Page 26: I Heart Log: Real-time Data and Apache Kafka](https://reader036.vdocuments.mx/reader036/viewer/2022081412/53f480488d7f72c80e8b4a74/html5/thumbnails/26.jpg)
Types of Data• Database data– Users, products, orders, etc
• Events– Clicks, Impressions, Pageviews, etc
• Application metrics– CPU usage, requests/sec
• Application logs– Service calls, errors
![Page 27: I Heart Log: Real-time Data and Apache Kafka](https://reader036.vdocuments.mx/reader036/viewer/2022081412/53f480488d7f72c80e8b4a74/html5/thumbnails/27.jpg)
Systems at LinkedIn• Live Stores– Voldemort– Espresso– Graph– OLAP– Search– InGraphs
• Offline– Hadoop– Teradata
![Page 28: I Heart Log: Real-time Data and Apache Kafka](https://reader036.vdocuments.mx/reader036/viewer/2022081412/53f480488d7f72c80e8b4a74/html5/thumbnails/28.jpg)
Bad
![Page 29: I Heart Log: Real-time Data and Apache Kafka](https://reader036.vdocuments.mx/reader036/viewer/2022081412/53f480488d7f72c80e8b4a74/html5/thumbnails/29.jpg)
Good
![Page 30: I Heart Log: Real-time Data and Apache Kafka](https://reader036.vdocuments.mx/reader036/viewer/2022081412/53f480488d7f72c80e8b4a74/html5/thumbnails/30.jpg)
Example: User views job
![Page 31: I Heart Log: Real-time Data and Apache Kafka](https://reader036.vdocuments.mx/reader036/viewer/2022081412/53f480488d7f72c80e8b4a74/html5/thumbnails/31.jpg)
The Plan1. Apache Kafka2. Logs and Distributed Systems3. Logs and Data Integration4. Logs and Stream Processing
![Page 32: I Heart Log: Real-time Data and Apache Kafka](https://reader036.vdocuments.mx/reader036/viewer/2022081412/53f480488d7f72c80e8b4a74/html5/thumbnails/32.jpg)
Stream Processing
![Page 33: I Heart Log: Real-time Data and Apache Kafka](https://reader036.vdocuments.mx/reader036/viewer/2022081412/53f480488d7f72c80e8b4a74/html5/thumbnails/33.jpg)
![Page 34: I Heart Log: Real-time Data and Apache Kafka](https://reader036.vdocuments.mx/reader036/viewer/2022081412/53f480488d7f72c80e8b4a74/html5/thumbnails/34.jpg)
Stream processing is ageneralization
of batch processing
![Page 35: I Heart Log: Real-time Data and Apache Kafka](https://reader036.vdocuments.mx/reader036/viewer/2022081412/53f480488d7f72c80e8b4a74/html5/thumbnails/35.jpg)
Examples• Monitoring• Security• Content processing• Recommendations• Newsfeed• ETL
![Page 36: I Heart Log: Real-time Data and Apache Kafka](https://reader036.vdocuments.mx/reader036/viewer/2022081412/53f480488d7f72c80e8b4a74/html5/thumbnails/36.jpg)
Stream Processing = Logs + Jobs
![Page 37: I Heart Log: Real-time Data and Apache Kafka](https://reader036.vdocuments.mx/reader036/viewer/2022081412/53f480488d7f72c80e8b4a74/html5/thumbnails/37.jpg)
Systems Can Help
![Page 38: I Heart Log: Real-time Data and Apache Kafka](https://reader036.vdocuments.mx/reader036/viewer/2022081412/53f480488d7f72c80e8b4a74/html5/thumbnails/38.jpg)
Samza Architecture
![Page 39: I Heart Log: Real-time Data and Apache Kafka](https://reader036.vdocuments.mx/reader036/viewer/2022081412/53f480488d7f72c80e8b4a74/html5/thumbnails/39.jpg)
Example: Top Articles By Company
![Page 40: I Heart Log: Real-time Data and Apache Kafka](https://reader036.vdocuments.mx/reader036/viewer/2022081412/53f480488d7f72c80e8b4a74/html5/thumbnails/40.jpg)
Log-centric Architecture
![Page 41: I Heart Log: Real-time Data and Apache Kafka](https://reader036.vdocuments.mx/reader036/viewer/2022081412/53f480488d7f72c80e8b4a74/html5/thumbnails/41.jpg)
Kafkahttp://kafka.apache.org
Samzahttp://samza.incubator.apache.org
Log Bloghttp://linkd.in/199iMwY
Mehttp://www.linkedin.com/in/
jaykreps@jaykreps