building linkedins real-time data pipeline jay kreps

55
Building LinkedIn’s Real-time Data Pipeline Jay Kreps

Upload: jovany-tidball

Post on 29-Mar-2015

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Building LinkedIns Real-time Data Pipeline Jay Kreps

Building LinkedIn’s Real-time Data Pipeline

Jay Kreps

Page 2: Building LinkedIns Real-time Data Pipeline Jay Kreps
Page 3: Building LinkedIns Real-time Data Pipeline Jay Kreps

What is a data pipeline?

Page 4: Building LinkedIns Real-time Data Pipeline Jay Kreps

What data is there?

• Database data• Activity data– Page Views, Ad Impressions, etc

• Messaging– JMS, AMQP, etc

• Application and System Metrics– Rrdtool, graphite, etc

• Logs– Syslog, log4j, etc

Page 5: Building LinkedIns Real-time Data Pipeline Jay Kreps
Page 6: Building LinkedIns Real-time Data Pipeline Jay Kreps

Data Systems at LinkedIn

• Search• Social Graph• Recommendations• Live Storage• Hadoop• Data Warehouse• Monitoring Systems• …

Page 7: Building LinkedIns Real-time Data Pipeline Jay Kreps

Problem: Data Integration

Page 8: Building LinkedIns Real-time Data Pipeline Jay Kreps

Point-to-Point Pipelines

Page 9: Building LinkedIns Real-time Data Pipeline Jay Kreps

Centralized Pipeline

Page 10: Building LinkedIns Real-time Data Pipeline Jay Kreps

How have companies solved this problem?

Page 11: Building LinkedIns Real-time Data Pipeline Jay Kreps

The Enterprise Data Warehouse

Page 12: Building LinkedIns Real-time Data Pipeline Jay Kreps

Problems

• Data warehouse is a batch system• Central team that cleans all data?• One person’s cleaning…• Relational mapping is non-trivial…

Page 13: Building LinkedIns Real-time Data Pipeline Jay Kreps

My Experience

Page 14: Building LinkedIns Real-time Data Pipeline Jay Kreps

LinkedIn’s Pipeline

Page 15: Building LinkedIns Real-time Data Pipeline Jay Kreps

LinkedIn Circa 2010

• Messaging: ActiveMQ• User Activity: In house log

aggregation• Logging: Splunk• Metrics: JMX => Zenoss• Database data: Databus, custom ETL

Page 16: Building LinkedIns Real-time Data Pipeline Jay Kreps

2010 User Activity Data Flow

Page 17: Building LinkedIns Real-time Data Pipeline Jay Kreps

Problems

• Fragility• Multi-hour delay• Coverage• Labor intensive• Slow• Does it work?

Page 18: Building LinkedIns Real-time Data Pipeline Jay Kreps

Four Ideas

1 Central commit log for all data2 Push data cleanliness upstream3 O(1) ETL4 Evidence-based correctness

Page 19: Building LinkedIns Real-time Data Pipeline Jay Kreps

Four Ideas

1 Central commit log for all data2 Push data cleanliness upstream3 O(1) ETL4 Evidence-based correctness

Page 20: Building LinkedIns Real-time Data Pipeline Jay Kreps

What kind of infrastructure is needed?

Page 21: Building LinkedIns Real-time Data Pipeline Jay Kreps

Very confused

• Messaging (JMS, AMQP, …)• Log aggregation• CEP, Streaming

Page 22: Building LinkedIns Real-time Data Pipeline Jay Kreps

First Attempt: Don’t reinvent the wheel!

Page 23: Building LinkedIns Real-time Data Pipeline Jay Kreps
Page 24: Building LinkedIns Real-time Data Pipeline Jay Kreps

Problems With Messaging Systems

• Persistence is an afterthought• Ad hoc distribution• Odd semantics• Featuritis

Page 25: Building LinkedIns Real-time Data Pipeline Jay Kreps

Second Attempt: Reinvent the wheel!

Page 26: Building LinkedIns Real-time Data Pipeline Jay Kreps

Idea: Central, Distributed Commit Log

Page 27: Building LinkedIns Real-time Data Pipeline Jay Kreps

What is a commit log?

Page 28: Building LinkedIns Real-time Data Pipeline Jay Kreps

Data Flow

Page 29: Building LinkedIns Real-time Data Pipeline Jay Kreps

Apache Kafka

Page 30: Building LinkedIns Real-time Data Pipeline Jay Kreps

Some Terminology

• Producers send messages to Brokers• Consumers read messages from

Brokers• Messages are sent to a Topic• Each topic is broken into one or more

ordered partitions of messages

Page 31: Building LinkedIns Real-time Data Pipeline Jay Kreps

APIs

• send(String topic, String key, Message message)

• Iterator<Message>

Page 32: Building LinkedIns Real-time Data Pipeline Jay Kreps

Distribution

Page 33: Building LinkedIns Real-time Data Pipeline Jay Kreps

Performance

• 50MB/sec writes• 110MB/sec reads

Page 34: Building LinkedIns Real-time Data Pipeline Jay Kreps

Performance

Page 35: Building LinkedIns Real-time Data Pipeline Jay Kreps

Performance Tricks

• Batching– Producer– Broker– Consumer

• Avoid large in-memory structures– Pagecache friendly

• Avoid data copying– sendfile

• Batch Compression

Page 36: Building LinkedIns Real-time Data Pipeline Jay Kreps

Kafka Replication

• In 0.8 release• Messages are highly available• No centralized master

Page 37: Building LinkedIns Real-time Data Pipeline Jay Kreps

Kafka Info

http://incubator.apache.org/kafka

Page 38: Building LinkedIns Real-time Data Pipeline Jay Kreps

Usage at LinkedIn

• 10 billion messages/day• Sustained peak: – 172,000 messages/second written– 950,000 messages/second read

• 367 topics• 40 real-time consumers• Many ad hoc consumers• 10k connections/colo• 9.5TB log retained• End-to-end delivery time: 10 seconds (avg)

Page 39: Building LinkedIns Real-time Data Pipeline Jay Kreps

Datacenters

Page 40: Building LinkedIns Real-time Data Pipeline Jay Kreps

Four Ideas

1 Central commit log for all data2 Push data cleanliness upstream3 O(1) ETL4 Evidence-based correctness

Page 41: Building LinkedIns Real-time Data Pipeline Jay Kreps

Problem

• Hundreds of message types• Thousands of fields• What do they all mean?• What happens when they change?

Page 42: Building LinkedIns Real-time Data Pipeline Jay Kreps

Make activity data part of the domain model

Page 43: Building LinkedIns Real-time Data Pipeline Jay Kreps

Schema free?

Page 44: Building LinkedIns Real-time Data Pipeline Jay Kreps

LOAD ‘student’ USING PigStorage() AS (name:chararray, age:int, gpa:float)

Page 45: Building LinkedIns Real-time Data Pipeline Jay Kreps

Schemas

• Structure can be exploited– Performance– Size

• Compatibility• Need a formal contract

Page 46: Building LinkedIns Real-time Data Pipeline Jay Kreps

Avro Schema

• Avro – data definition and schema• Central repository of all schemas• Reader always uses same schema as

writer• Programatic compatibility model

Page 47: Building LinkedIns Real-time Data Pipeline Jay Kreps

Workflow

1 Check in schema2 Code review3 Ship

Page 48: Building LinkedIns Real-time Data Pipeline Jay Kreps

Four Ideas

1 Central commit log for all data2 Push data cleanliness upstream3 O(1) ETL4 Evidence-based correctness

Page 49: Building LinkedIns Real-time Data Pipeline Jay Kreps

Hadoop Data Load

• Map/Reduce job does data load• One job loads all events• Hive registration done automatically• Schema changes handled

transparently• ~5 minute lag on average to HDFS

Page 50: Building LinkedIns Real-time Data Pipeline Jay Kreps

Four Ideas

1 Central commit log for all data2 Push data cleanliness upstream3 O(1) ETL4 Evidence-based correctness

Page 51: Building LinkedIns Real-time Data Pipeline Jay Kreps

Does it work?

Page 52: Building LinkedIns Real-time Data Pipeline Jay Kreps

All messages sent must be delivered to all consumers (quickly)

Page 53: Building LinkedIns Real-time Data Pipeline Jay Kreps

Audit Trail

• Each producer, broker, and consumer periodically reports how many messages it saw

• Reconcile these counts every few minutes

• Graph and alert

Page 54: Building LinkedIns Real-time Data Pipeline Jay Kreps

Audit Trail

Page 55: Building LinkedIns Real-time Data Pipeline Jay Kreps

Questions?