![Page 1: continuous data streams scalable joining of Photon: fault ...cloud.berkeley.edu/data/photon.pdf · Photon: fault-tolerant and scalable joining of continuous data streams Manpreet](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f2384960737f45b4b743cad/html5/thumbnails/1.jpg)
Google Confidential and Proprietary
Photon: fault-tolerant and scalable joining of continuous data streams
Manpreet ([email protected])
![Page 2: continuous data streams scalable joining of Photon: fault ...cloud.berkeley.edu/data/photon.pdf · Photon: fault-tolerant and scalable joining of continuous data streams Manpreet](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f2384960737f45b4b743cad/html5/thumbnails/2.jpg)
Google Confidential and Proprietary
Agenda
● Problem and motivation● Systems challenges● Design● Production deployment● Future work
![Page 3: continuous data streams scalable joining of Photon: fault ...cloud.berkeley.edu/data/photon.pdf · Photon: fault-tolerant and scalable joining of continuous data streams Manpreet](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f2384960737f45b4b743cad/html5/thumbnails/3.jpg)
Google Confidential and Proprietary
The Stats Loop
Ads Serving
Stats Engine
Logs Persistent Storage
Users on Google.com Advertisers + Publishers
queries, clicks adsaccounts, campaigns,budgets
events
reports, billing, …
Ads inventory, budgets, …
Continuously update stats
Continuously aggregate logs
Hourly/Daily/Monthly sums of
Clicks, Impressions, Cost… per …
Advertiser, Publisher, Keyword etc
![Page 4: continuous data streams scalable joining of Photon: fault ...cloud.berkeley.edu/data/photon.pdf · Photon: fault-tolerant and scalable joining of continuous data streams Manpreet](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f2384960737f45b4b743cad/html5/thumbnails/4.jpg)
Google Confidential and Proprietary
Problem
● Join two continuous streams of events○ Based on a shared id○ Stored in the Google File System○ Replicated in multiple datacenters
● Motivation:○ Join click with query○ Logged by multiple servers at different times
■ Cannot put query inside the click
![Page 5: continuous data streams scalable joining of Photon: fault ...cloud.berkeley.edu/data/photon.pdf · Photon: fault-tolerant and scalable joining of continuous data streams Manpreet](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f2384960737f45b4b743cad/html5/thumbnails/5.jpg)
Google Confidential and Proprietary
Systems challenges
● Exactly-once semantics○ Output used for billing / internal monitoring
● Reliability○ Fault-tolerance in the cloud○ Automatically handle single data-center disaster
● Scalability○ Millions of joins per minute
● Latency○ O(seconds)
![Page 6: continuous data streams scalable joining of Photon: fault ...cloud.berkeley.edu/data/photon.pdf · Photon: fault-tolerant and scalable joining of continuous data streams Manpreet](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f2384960737f45b4b743cad/html5/thumbnails/6.jpg)
Google Confidential and Proprietary
Singly-homed system
● With patched-on failover tool○ Ran in production for several years
● Very high maintenance cost● Datacenters have downtimes:
○ planned○ unplanned○ random hiccups
● Cannot provide very high up-time SLA
![Page 7: continuous data streams scalable joining of Photon: fault ...cloud.berkeley.edu/data/photon.pdf · Photon: fault-tolerant and scalable joining of continuous data streams Manpreet](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f2384960737f45b4b743cad/html5/thumbnails/7.jpg)
Google Confidential and Proprietary
How about running a MapReduce?
● Read both query logs and click logs● Mapper output key is query_id● Reducer outputs the joined event● Issues:
○ batch job○ high latency (setup cost, stragglers, etc)
![Page 8: continuous data streams scalable joining of Photon: fault ...cloud.berkeley.edu/data/photon.pdf · Photon: fault-tolerant and scalable joining of continuous data streams Manpreet](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f2384960737f45b4b743cad/html5/thumbnails/8.jpg)
Google Confidential and Proprietary
Challenges of running in multiple datacenters
● Stateful system○ Need to ensure exactly-once semantics
● State needs to be replicated synchronously to multiple datacenters○ ==> Paxos (guarantees majority group members have all the updates)
![Page 9: continuous data streams scalable joining of Photon: fault ...cloud.berkeley.edu/data/photon.pdf · Photon: fault-tolerant and scalable joining of continuous data streams Manpreet](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f2384960737f45b4b743cad/html5/thumbnails/9.jpg)
Google Confidential and Proprietary
High-level idea
Paxos-backed storage
Pipelinein DataCenter 2
DCDCB DCE
PaxosA
Pipelinein DataCenter 1
● Run the pipeline in multiple data-centers in parallel:○ Each pipeline processes every event
● Paxos-backed storage is shared across pipelines○ Used to dedup events (at-most-once semantics)○ Maintain persistent state at event-level
![Page 10: continuous data streams scalable joining of Photon: fault ...cloud.berkeley.edu/data/photon.pdf · Photon: fault-tolerant and scalable joining of continuous data streams Manpreet](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f2384960737f45b4b743cad/html5/thumbnails/10.jpg)
Google Confidential and Proprietary
PaxosDB: key building block
● Key-value store built on top of raw Paxos● Runs paxos algorithm to guarantee consistent replication
○ Multi-row transactions with conditional updates○ Auto-elects new master
● Key challenge: scalability○ Need to join Millions of events/minute○ With cross-country replicas, less than 10 paxos transactions/sec
RPC server
PaxosDB library
RPC clientRPC request
DCDCB DCE
PaxosA
![Page 11: continuous data streams scalable joining of Photon: fault ...cloud.berkeley.edu/data/photon.pdf · Photon: fault-tolerant and scalable joining of continuous data streams Manpreet](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f2384960737f45b4b743cad/html5/thumbnails/11.jpg)
Google Confidential and Proprietary
Architecture of a single IdRegistry server
![Page 12: continuous data streams scalable joining of Photon: fault ...cloud.berkeley.edu/data/photon.pdf · Photon: fault-tolerant and scalable joining of continuous data streams Manpreet](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f2384960737f45b4b743cad/html5/thumbnails/12.jpg)
Google Confidential and Proprietary
IdRegistry made scalable: sharding
● Run paxos transactions in parallel for independent keys
RPC server
PaxosDB library
RPC client
RPC request1
DCDCB DCE
PaxosA
RPC server
PaxosDB library
DCDCB DCE
PaxosA
RPC request2
![Page 13: continuous data streams scalable joining of Photon: fault ...cloud.berkeley.edu/data/photon.pdf · Photon: fault-tolerant and scalable joining of continuous data streams Manpreet](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f2384960737f45b4b743cad/html5/thumbnails/13.jpg)
Google Confidential and Proprietary
IdRegistry made scalable: server-side batching
● Batch multiple RPC requests into a single Paxos transaction
RPC server
PaxosDB library
RPC client
RPC request 1, 2, 3
DCDCB DCE
PaxosA
RPC server
PaxosDB library
DCDCB DCE
PaxosA
RPC request 4
Batch multiple RPC requests at server
![Page 14: continuous data streams scalable joining of Photon: fault ...cloud.berkeley.edu/data/photon.pdf · Photon: fault-tolerant and scalable joining of continuous data streams Manpreet](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f2384960737f45b4b743cad/html5/thumbnails/14.jpg)
Google Confidential and Proprietary
IdRegistry made scalable: client-side batching
● Batch multiple client requests into a single RPC
RPC server
PaxosDB library
RPC client
RPC request 1, 2, 3
DCDCB DCE
PaxosA
RPC server
PaxosDB library
DCDCB DCE
PaxosA
RPC request 4
Batch multiple RPC requests at server
Batch multiple client requests into a single RPC
![Page 15: continuous data streams scalable joining of Photon: fault ...cloud.berkeley.edu/data/photon.pdf · Photon: fault-tolerant and scalable joining of continuous data streams Manpreet](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f2384960737f45b4b743cad/html5/thumbnails/15.jpg)
Google Confidential and Proprietary
IdRegistry made scalable
● Sharding○ Paxos transactions happen in parallel for each shard○ Load-balance amongst shards:
■ Shard by hash(event_id) mod num_shards○ Built automated support for resharding
■ Attach timestamp to each key■ Use timestamp-based sharding■ GC old keys
● Server-side batching○ RPC thread adds input request to an in-memory queue○ Single paxos thread extracts multiple requests, performs a single
multi-row paxos transaction, sends rpc response
● Client-side batching○ Client batches multiple RPC requests into a single RPC request.
![Page 16: continuous data streams scalable joining of Photon: fault ...cloud.berkeley.edu/data/photon.pdf · Photon: fault-tolerant and scalable joining of continuous data streams Manpreet](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f2384960737f45b4b743cad/html5/thumbnails/16.jpg)
Google Confidential and Proprietary
JoinerEventStore Paxos-based IdRegistry
Joined Click Logs
Query LogsLog Dispatcher
Click Logs
Retry
query_idquery
click
Photon architecture in a single data-center
Step1
Step2Step3
Step4
![Page 17: continuous data streams scalable joining of Photon: fault ...cloud.berkeley.edu/data/photon.pdf · Photon: fault-tolerant and scalable joining of continuous data streams Manpreet](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f2384960737f45b4b743cad/html5/thumbnails/17.jpg)
Google Confidential and Proprietary
Photon architecture in multiple datacenters
Joiner
EventStore
Paxos-based IdRegistry
Joiner
Joined Click LogsJoined Click Logs
Query Logs
DataCenter 1 DataCenter 2
Log Dispatcher
Click Logs
Log Dispatcher
Click Logs
EventStore
Query Logs
Retry Retry
![Page 18: continuous data streams scalable joining of Photon: fault ...cloud.berkeley.edu/data/photon.pdf · Photon: fault-tolerant and scalable joining of continuous data streams Manpreet](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f2384960737f45b4b743cad/html5/thumbnails/18.jpg)
Google Confidential and Proprietary
PaxosDB rpc semantics
● InsertKey:○ Returns false if the key already exists○ Else inserts the key
● What if the PaxosDB inserts the key but Joiner does not get the response in time?○ Observed 0.01% loss in production
● Write unique id for joiner along with the event_id:○ event_id is key○ Joiner id is value○ Handles Joiner rpc retries gracefully
● If Joiner crashes after writing to Paxos, run offline recovery
![Page 19: continuous data streams scalable joining of Photon: fault ...cloud.berkeley.edu/data/photon.pdf · Photon: fault-tolerant and scalable joining of continuous data streams Manpreet](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f2384960737f45b4b743cad/html5/thumbnails/19.jpg)
Google Confidential and Proprietary
Reading input events continuously
● Events stored in Google File System● Periodically stat the directory:
○ identify new files○ check the growth update on existing files
● Keep track of <filename, next_read_offset>
![Page 20: continuous data streams scalable joining of Photon: fault ...cloud.berkeley.edu/data/photon.pdf · Photon: fault-tolerant and scalable joining of continuous data streams Manpreet](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f2384960737f45b4b743cad/html5/thumbnails/20.jpg)
Google Confidential and Proprietary
EventStore
● Sequentially read the query logs and store a mapping from query_id to query○ In a distributed hash table (e.g. bigtable)
● CacheEventStore:○ Majority of clicks happen within few minutes of query○ Cache mapping from query_id ---> query for recent queries in RAM○ Use distributed Cache Servers (similar to MemCache)
● LogsEventStore○ Performance optimization since our query logs are sorted○ Fallback in case of cache miss○ Use binary search in sorted event stream in disks
![Page 21: continuous data streams scalable joining of Photon: fault ...cloud.berkeley.edu/data/photon.pdf · Photon: fault-tolerant and scalable joining of continuous data streams Manpreet](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f2384960737f45b4b743cad/html5/thumbnails/21.jpg)
Google Confidential and Proprietary
Purgatory (in Log Dispatcher)
● What if a subset of query logs are delayed?● Log Dispatcher reads every event from the log, keeps retrying until
successful join○ Maintains state in peristent storage (Google File System)○ Ensures at-least-once semantics
● Exponential backoff in case of failure
![Page 22: continuous data streams scalable joining of Photon: fault ...cloud.berkeley.edu/data/photon.pdf · Photon: fault-tolerant and scalable joining of continuous data streams Manpreet](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f2384960737f45b4b743cad/html5/thumbnails/22.jpg)
Google Confidential and Proprietary
Operational challenges
● Need 24x7 running system○ Low maintenance
● Volume of input traffic fluctuates by 2x within a day● Bursts of traffic in case of network issues● Predicting resource requirements● Auto-resize the jobs to handle traffic growth and spike● Reliable monitoring
![Page 23: continuous data streams scalable joining of Photon: fault ...cloud.berkeley.edu/data/photon.pdf · Photon: fault-tolerant and scalable joining of continuous data streams Manpreet](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f2384960737f45b4b743cad/html5/thumbnails/23.jpg)
Google Confidential and Proprietary
Production deployment
● System running in production for several months○ O(700)-way sharded PaxosDB○ 5 PaxosDB replicas spread across East Coast, West Coast, Mid-West○ 2 Photon pipelines running in East Coast, West Coast○ Order of magnitude higher uptime SLA than singly-homed system○ Survived multiple planned / unplanned datacenter downtimes○ Much less noisy alerts
![Page 24: continuous data streams scalable joining of Photon: fault ...cloud.berkeley.edu/data/photon.pdf · Photon: fault-tolerant and scalable joining of continuous data streams Manpreet](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f2384960737f45b4b743cad/html5/thumbnails/24.jpg)
Google Confidential and Proprietary
Photon withstanding a real datacenter disaster
![Page 25: continuous data streams scalable joining of Photon: fault ...cloud.berkeley.edu/data/photon.pdf · Photon: fault-tolerant and scalable joining of continuous data streams Manpreet](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f2384960737f45b4b743cad/html5/thumbnails/25.jpg)
Google Confidential and Proprietary
End-to-end latency of Photon
![Page 26: continuous data streams scalable joining of Photon: fault ...cloud.berkeley.edu/data/photon.pdf · Photon: fault-tolerant and scalable joining of continuous data streams Manpreet](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f2384960737f45b4b743cad/html5/thumbnails/26.jpg)
Google Confidential and Proprietary
Effectiveness of server-side batching in IdRegistry
![Page 27: continuous data streams scalable joining of Photon: fault ...cloud.berkeley.edu/data/photon.pdf · Photon: fault-tolerant and scalable joining of continuous data streams Manpreet](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f2384960737f45b4b743cad/html5/thumbnails/27.jpg)
Google Confidential and Proprietary
IdRegistry dynamic time-based upsharding
![Page 28: continuous data streams scalable joining of Photon: fault ...cloud.berkeley.edu/data/photon.pdf · Photon: fault-tolerant and scalable joining of continuous data streams Manpreet](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f2384960737f45b4b743cad/html5/thumbnails/28.jpg)
Google Confidential and Proprietary
Dispatcher in one datacenter catching up after downtime
![Page 29: continuous data streams scalable joining of Photon: fault ...cloud.berkeley.edu/data/photon.pdf · Photon: fault-tolerant and scalable joining of continuous data streams Manpreet](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f2384960737f45b4b743cad/html5/thumbnails/29.jpg)
Google Confidential and Proprietary
Photon wasted joins minimized
![Page 30: continuous data streams scalable joining of Photon: fault ...cloud.berkeley.edu/data/photon.pdf · Photon: fault-tolerant and scalable joining of continuous data streams Manpreet](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f2384960737f45b4b743cad/html5/thumbnails/30.jpg)
Google Confidential and Proprietary
Photon EventStore lookups in a single datacenter
![Page 31: continuous data streams scalable joining of Photon: fault ...cloud.berkeley.edu/data/photon.pdf · Photon: fault-tolerant and scalable joining of continuous data streams Manpreet](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f2384960737f45b4b743cad/html5/thumbnails/31.jpg)
Google Confidential and Proprietary
How we addressed the systems challenges
● Exactly-once semantics○ PaxosDB ensures at-most-once○ Dispatcher retry ensures at-least-once
● Reliability○ PaxosDB needs only majority members to be up.○ Storing global state in PaxosDB allows run in multiple datacenters
● Scalability○ Reshard the number of PaxosDB servers○ All the other workers are stateless
● Latency○ Mostly RPC-based communication amongst jobs○ Most data transfers in RAM
![Page 32: continuous data streams scalable joining of Photon: fault ...cloud.berkeley.edu/data/photon.pdf · Photon: fault-tolerant and scalable joining of continuous data streams Manpreet](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f2384960737f45b4b743cad/html5/thumbnails/32.jpg)
Google Confidential and Proprietary
Next BIG challenges
● Better resource utilization● Join multiple log sources
![Page 33: continuous data streams scalable joining of Photon: fault ...cloud.berkeley.edu/data/photon.pdf · Photon: fault-tolerant and scalable joining of continuous data streams Manpreet](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f2384960737f45b4b743cad/html5/thumbnails/33.jpg)
Google Confidential and Proprietary
More cool problems we are solving in AdWords Backend
● Real-time logs processing○ Read 100T logs per day○ Compute stats over 200+ dimensions with low latency
● Petabyte scale database storage engine○ 32M rows updated per sec, 140T total rows○ 200K read queries/sec with latency of 90ms
● Efficiently backfill stats for the last N years○ O(50B) rows per day○ Too big to store in a database
● And many more cool projects...○ Join us and find out more!○ Manpreet Singh ([email protected])
![Page 34: continuous data streams scalable joining of Photon: fault ...cloud.berkeley.edu/data/photon.pdf · Photon: fault-tolerant and scalable joining of continuous data streams Manpreet](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f2384960737f45b4b743cad/html5/thumbnails/34.jpg)
Google Application Process
1
2 Resume Review and Qualification
3 First Round Interviews - 2 Technical Phone Screens
4 Full-time Positions: 3-5 Onsite InterviewsInternships: 1 Host Matching Interview
5 Hiring Committee - Offer
Apply Now! Full-time: google.com/students/eng
Internships: google.com/students/intern
![Page 35: continuous data streams scalable joining of Photon: fault ...cloud.berkeley.edu/data/photon.pdf · Photon: fault-tolerant and scalable joining of continuous data streams Manpreet](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f2384960737f45b4b743cad/html5/thumbnails/35.jpg)
Google Confidential and Proprietary
Questions