continuous data streams scalable joining of photon: fault ...cloud.berkeley.edu/data/photon.pdf ·...

Post on 06-Jul-2020

16 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Google Confidential and Proprietary

Photon: fault-tolerant and scalable joining of continuous data streams

Manpreet (manpreet@google.com)

Google Confidential and Proprietary

Agenda

● Problem and motivation● Systems challenges● Design● Production deployment● Future work

Google Confidential and Proprietary

The Stats Loop

Ads Serving

Stats Engine

Logs Persistent Storage

Users on Google.com Advertisers + Publishers

queries, clicks adsaccounts, campaigns,budgets

events

reports, billing, …

Ads inventory, budgets, …

Continuously update stats

Continuously aggregate logs

Hourly/Daily/Monthly sums of

Clicks, Impressions, Cost… per …

Advertiser, Publisher, Keyword etc

Google Confidential and Proprietary

Problem

● Join two continuous streams of events○ Based on a shared id○ Stored in the Google File System○ Replicated in multiple datacenters

● Motivation:○ Join click with query○ Logged by multiple servers at different times

■ Cannot put query inside the click

Google Confidential and Proprietary

Systems challenges

● Exactly-once semantics○ Output used for billing / internal monitoring

● Reliability○ Fault-tolerance in the cloud○ Automatically handle single data-center disaster

● Scalability○ Millions of joins per minute

● Latency○ O(seconds)

Google Confidential and Proprietary

Singly-homed system

● With patched-on failover tool○ Ran in production for several years

● Very high maintenance cost● Datacenters have downtimes:

○ planned○ unplanned○ random hiccups

● Cannot provide very high up-time SLA

Google Confidential and Proprietary

How about running a MapReduce?

● Read both query logs and click logs● Mapper output key is query_id● Reducer outputs the joined event● Issues:

○ batch job○ high latency (setup cost, stragglers, etc)

Google Confidential and Proprietary

Challenges of running in multiple datacenters

● Stateful system○ Need to ensure exactly-once semantics

● State needs to be replicated synchronously to multiple datacenters○ ==> Paxos (guarantees majority group members have all the updates)

Google Confidential and Proprietary

High-level idea

Paxos-backed storage

Pipelinein DataCenter 2

DCDCB DCE

PaxosA

Pipelinein DataCenter 1

● Run the pipeline in multiple data-centers in parallel:○ Each pipeline processes every event

● Paxos-backed storage is shared across pipelines○ Used to dedup events (at-most-once semantics)○ Maintain persistent state at event-level

Google Confidential and Proprietary

PaxosDB: key building block

● Key-value store built on top of raw Paxos● Runs paxos algorithm to guarantee consistent replication

○ Multi-row transactions with conditional updates○ Auto-elects new master

● Key challenge: scalability○ Need to join Millions of events/minute○ With cross-country replicas, less than 10 paxos transactions/sec

RPC server

PaxosDB library

RPC clientRPC request

DCDCB DCE

PaxosA

Google Confidential and Proprietary

Architecture of a single IdRegistry server

Google Confidential and Proprietary

IdRegistry made scalable: sharding

● Run paxos transactions in parallel for independent keys

RPC server

PaxosDB library

RPC client

RPC request1

DCDCB DCE

PaxosA

RPC server

PaxosDB library

DCDCB DCE

PaxosA

RPC request2

Google Confidential and Proprietary

IdRegistry made scalable: server-side batching

● Batch multiple RPC requests into a single Paxos transaction

RPC server

PaxosDB library

RPC client

RPC request 1, 2, 3

DCDCB DCE

PaxosA

RPC server

PaxosDB library

DCDCB DCE

PaxosA

RPC request 4

Batch multiple RPC requests at server

Google Confidential and Proprietary

IdRegistry made scalable: client-side batching

● Batch multiple client requests into a single RPC

RPC server

PaxosDB library

RPC client

RPC request 1, 2, 3

DCDCB DCE

PaxosA

RPC server

PaxosDB library

DCDCB DCE

PaxosA

RPC request 4

Batch multiple RPC requests at server

Batch multiple client requests into a single RPC

Google Confidential and Proprietary

IdRegistry made scalable

● Sharding○ Paxos transactions happen in parallel for each shard○ Load-balance amongst shards:

■ Shard by hash(event_id) mod num_shards○ Built automated support for resharding

■ Attach timestamp to each key■ Use timestamp-based sharding■ GC old keys

● Server-side batching○ RPC thread adds input request to an in-memory queue○ Single paxos thread extracts multiple requests, performs a single

multi-row paxos transaction, sends rpc response

● Client-side batching○ Client batches multiple RPC requests into a single RPC request.

Google Confidential and Proprietary

JoinerEventStore Paxos-based IdRegistry

Joined Click Logs

Query LogsLog Dispatcher

Click Logs

Retry

query_idquery

click

Photon architecture in a single data-center

Step1

Step2Step3

Step4

Google Confidential and Proprietary

Photon architecture in multiple datacenters

Joiner

EventStore

Paxos-based IdRegistry

Joiner

Joined Click LogsJoined Click Logs

Query Logs

DataCenter 1 DataCenter 2

Log Dispatcher

Click Logs

Log Dispatcher

Click Logs

EventStore

Query Logs

Retry Retry

Google Confidential and Proprietary

PaxosDB rpc semantics

● InsertKey:○ Returns false if the key already exists○ Else inserts the key

● What if the PaxosDB inserts the key but Joiner does not get the response in time?○ Observed 0.01% loss in production

● Write unique id for joiner along with the event_id:○ event_id is key○ Joiner id is value○ Handles Joiner rpc retries gracefully

● If Joiner crashes after writing to Paxos, run offline recovery

Google Confidential and Proprietary

Reading input events continuously

● Events stored in Google File System● Periodically stat the directory:

○ identify new files○ check the growth update on existing files

● Keep track of <filename, next_read_offset>

Google Confidential and Proprietary

EventStore

● Sequentially read the query logs and store a mapping from query_id to query○ In a distributed hash table (e.g. bigtable)

● CacheEventStore:○ Majority of clicks happen within few minutes of query○ Cache mapping from query_id ---> query for recent queries in RAM○ Use distributed Cache Servers (similar to MemCache)

● LogsEventStore○ Performance optimization since our query logs are sorted○ Fallback in case of cache miss○ Use binary search in sorted event stream in disks

Google Confidential and Proprietary

Purgatory (in Log Dispatcher)

● What if a subset of query logs are delayed?● Log Dispatcher reads every event from the log, keeps retrying until

successful join○ Maintains state in peristent storage (Google File System)○ Ensures at-least-once semantics

● Exponential backoff in case of failure

Google Confidential and Proprietary

Operational challenges

● Need 24x7 running system○ Low maintenance

● Volume of input traffic fluctuates by 2x within a day● Bursts of traffic in case of network issues● Predicting resource requirements● Auto-resize the jobs to handle traffic growth and spike● Reliable monitoring

Google Confidential and Proprietary

Production deployment

● System running in production for several months○ O(700)-way sharded PaxosDB○ 5 PaxosDB replicas spread across East Coast, West Coast, Mid-West○ 2 Photon pipelines running in East Coast, West Coast○ Order of magnitude higher uptime SLA than singly-homed system○ Survived multiple planned / unplanned datacenter downtimes○ Much less noisy alerts

Google Confidential and Proprietary

Photon withstanding a real datacenter disaster

Google Confidential and Proprietary

End-to-end latency of Photon

Google Confidential and Proprietary

Effectiveness of server-side batching in IdRegistry

Google Confidential and Proprietary

IdRegistry dynamic time-based upsharding

Google Confidential and Proprietary

Dispatcher in one datacenter catching up after downtime

Google Confidential and Proprietary

Photon wasted joins minimized

Google Confidential and Proprietary

Photon EventStore lookups in a single datacenter

Google Confidential and Proprietary

How we addressed the systems challenges

● Exactly-once semantics○ PaxosDB ensures at-most-once○ Dispatcher retry ensures at-least-once

● Reliability○ PaxosDB needs only majority members to be up.○ Storing global state in PaxosDB allows run in multiple datacenters

● Scalability○ Reshard the number of PaxosDB servers○ All the other workers are stateless

● Latency○ Mostly RPC-based communication amongst jobs○ Most data transfers in RAM

Google Confidential and Proprietary

Next BIG challenges

● Better resource utilization● Join multiple log sources

Google Confidential and Proprietary

More cool problems we are solving in AdWords Backend

● Real-time logs processing○ Read 100T logs per day○ Compute stats over 200+ dimensions with low latency

● Petabyte scale database storage engine○ 32M rows updated per sec, 140T total rows○ 200K read queries/sec with latency of 90ms

● Efficiently backfill stats for the last N years○ O(50B) rows per day○ Too big to store in a database

● And many more cool projects...○ Join us and find out more!○ Manpreet Singh (manpreet@google.com)

Google Application Process

1

2 Resume Review and Qualification

3 First Round Interviews - 2 Technical Phone Screens

4 Full-time Positions: 3-5 Onsite InterviewsInternships: 1 Host Matching Interview

5 Hiring Committee - Offer

Apply Now! Full-time: google.com/students/eng

Internships: google.com/students/intern

Google Confidential and Proprietary

Questions

top related