papers we love realtime at facebook
Post on 28-Jan-2018
392 Views
Preview:
TRANSCRIPT
1
Papers We Love:
Realtime Data Processing at Facebook
Gwen ShapiraConfluent Inc.
2
Papers We Love:
Realtime Data Processing at Facebook
3
Published in 2016 (!)
4
What kind of paper is this?
5
This is NOT
The one true architecture
.
Please don’t cargo-cult this paper
6
Few real-time systems at Facebook
• Chorus – aggregate trends
• Realtime feedback for mobile app developers
• Page analytics – likes, engagement…
• Offload CPU-intensive dashboard queries
7
8
9
10
Looking for trending topics in 5 minute windows
11
The Tofu & Potatoes of the paper:
Design Decisions
12
/ KafkaStreams
+ exactly
once
13
Decision #1 – Language Paradigm
• Declarative (SQL) – easy & limited
• Functional
• Procedural (C++, Java, Python) –most flexibility, control, performance. Longer dev cycle.
14
Decision #1 – Language Paradigm
• Declarative (SQL) – easy & limited
• Functional
• Procedural (C++, Java, Python) –most flexibility, control, performance. Longer dev cycle.
15
Decision #2: Data Transfer
• RPC (Millwheel, Flink, SparkStreaming)
• All about speed
• Message-forwarding broker (Heron)
• Applies back-pressure, multiplex
• Persistent stream storage (Samza, Kafka’s Stream API)
• Most reliable
• Decouples processors
16
Decision #2: Data Transfer
17
Love Song to Scribe
Independent stream processing nodes
And storing inputs / outputs
Made everything great
18
Decision #3 – Processing Semantics
19
Decision #3 – Processing Semantics
Facebook Verdict: It depends on requirements
• Ranker writes to idempotent system – at least once
• Scuba can lose data, but not handle duplicates – at most once
• …. Exactly once is REALLY HARD and requires transactions
20
Don’t miss the side-note on side-effects
• Exactly once means writing output + offsets to a transactional system
• This takes time
• Why just wait when you can deserialize? And maybe do other stateless stuff?
21
Decision #4 – State Saving
• In-memory state with replication (Old VoltDB)• Requires lots of hardware and network
• Local database (Samza, Kafka Streams API)
• Remote database (Millwheel)
• Upstream (i.e. replay everything on failure)
• Global consistent snapshot (Flink)
22
Decision #4 – State Saving
Facebook Verdict: It depends
Rhode Island Alaska
23
Best Part of the Paper – by far
How to efficiently work with state in remote DB?
24
Decision #5 - Reprocessing
• Stream only – requires long retention in the stream store
• Maintain both batch and stream systems
• Develop systems that can run in streams and batch (Flink, Spark)
25
Decision #5 - Reprocessing
• Stream only – requires long retention in the stream store
• Maintain both batch and stream systems
• Develop systems that can run in streams and batch (Flink, Spark)
Facebook Verdict:
SQL runs everywhere
And binary generation FTW
26
Applications – Or a whirlwind tour of good patterns
One example:
27
Lessons Learned!
The biggest win is pipelines composed of independent processors
• Mixing multiple systems let us move fast
• High level abstractions let us improve implementation
• Ease of debugging – Independent nodes and ability to replay
• Ease of deployment – Puma as-a-service
• Ease of monitoring – Lag is the most important metric. Everything is instrumented out of the box.
• In the future – auto-scale based on lag
28
Thank You!
top related