google cloud platformfiles.meetup.com/18404940/big data reference architecture - reza rokni.pdfdata...
Post on 21-May-2020
11 Views
Preview:
TRANSCRIPT
Google Cloud Platform Reference Architecture (Streaming)
Reza Rokni
Data .. Introduction
GB's
...can be Big Introduction
TB's
... really really big! ... but at least always batch?Introduction
TuesdayWednesday
Thursday
PB's
... well... but at least it's on time..Introduction
9:008:00 14:0013:0012:0011:0010:002:001:00 7:006:005:004:003:00
... it's doesn't even have the courtesy to be on time!Introduction
9:008:00 14:0013:0012:0011:0010:00
8:00
8:008:00
Google confidential │ Do not distribute
Let’s process some dataReference ArchitectureProcesses
1,000,000'ssec
10sec
Cloud Pub/SubAsync Messaging
Massive Scale NoSqlNoSQL Database Service
Cloud DataflowParallel data processing
BigQueryAnalytics Engine
CloudMLMachine Learning
File
Cloud StorageObject Store Exports
Cloud DataprocManaged Spark Hadoop
Google confidential │ Do not distribute
Let’s process some dataReference ArchitectureProcesses
1,000,000'ssec
100sec
Cloud Pub/SubAsync Messaging
Cloud StorageObject Store
Capture
• Globally redundant• Batched read/write• Custom labels• Push & Pull• Auto expiration• 10 MB Message Size• 7 Days storage for
unack Messages
Publisher A Publisher B Publisher C
Message 1
Topic A Topic B Topic C
Subscription XA Subscription XB Subscription YC
Subscription ZC
Cloud Pub/Sub
Subscriber X Subscriber Y
Message 2 Message 3
Subscriber Z
Message 1
Message 2
Message 3
Message 3
Cloud Pub/Sub API
Google confidential │ Do not distribute
Let’s process some dataReference ArchitectureProcesses
1,000,000'ssec
10sec
Cloud Pub/SubAsync Messaging
Cloud DataflowParallel data processing
File
Cloud StorageObject Store
Google Cloud Dataflow ( Apache Beam ) Introduction
Apache Beam (incubating) Google Cloud Dataflow
Extra Reading : FlumeJava Combined with MillWheel Dataflow explained
FlumeJava - The What not the HowDataflow explained
FlumeJava
TextIO.Read(MarketData)
ParDo(enrichData(bidsize,ask,bid,trade)
ParDo(filterData(bidsize>x))
BigQueryIO.Write
Code shown is sudo code only
MillWheel - Framework for low latency data processing Dataflow explained
Google confidential │ Do not distribute
C D
C+D
consumer-producer sibling
C D
C+D
Optimizer fusion Optimizer fusionProcesses
Google confidential │ Do not distribute
100 mins. 65 mins.vs.
Dynamic Worker OptimizationProcesses
Google confidential │ Do not distribute
Count
Stream
Parse Message
BigQuery BigQuery
Window
Detect Anomaly
Building a clickstream processing pipeline● In this example we will
○ Read Data from Pub/Sub○ Window and Aggregate the Data○ Do something programmatically with the data
Google confidential │ Do not distribute
Batch Read
Parse Message
Clickstream
BigQuery
Pipeline p = Pipeline.create();
p.begin()
PCollection<String> dataCollection = p.apply(TextIO.Read.from(“gs://…”))
dataCollection.apply(new ParseMessage())
ParDo.of(new TokenizesMessage())
ParDo.of(new CreateRecords())
.apply(BigQueryIO.Write.to(...))
STEP 1 - Transport
Code shown is sudo code only
Google confidential │ Do not distribute
Count
Batch Read
Parse Message
Clickstream
BigQuery BigQuery
Window
Detect Anomaly
Pipeline p = Pipeline.create();
p.begin()
.apply(Window.<Record>into(FixedWindows.of(Duration.standardSecounds(60)))
.apply(ParDo.of(new CreateEventKey()))
.apply(Count)
.apply(ParDo.of(new DetectAnomaly()))
STEP 2 - Detect
Code shown is sudo code only
Google confidential │ Do not distribute
Count
Stream
Parse Message
BigQuery BigQuery
Window
Detect Anomaly
Pipeline p = Pipeline.create();
p.begin()
.apply(PubsubIO.Write.topic(...))
STEP 3 - Stream
.apply(TextIO.Read.from(“gs://…”))
.apply(PubsubIO.Read.topic(...))
Code shown is sudo code only
1 + 1 = 2Completeness Latency Cost
$$$
Data Processing Tradeoffs
Requirements: Billing Pipeline
Completeness Low Latency Low Cost
Important
Not Important
Requirements: Live Cost Estimate Pipeline
Completeness Low Latency Low Cost
Important
Not Important
Requirements: Abuse Detection Pipeline
Completeness Low Latency Low Cost
Important
Not Important
Requirements: Abuse Detection Backfill Pipeline
Completeness Low Latency Low Cost
Important
Not Important
Dataflow explained
Inherent issues when dealing with streams
Watermarks
Watermark triggers
PCollection<KV<String, Integer>> scores = input
.apply(Window
.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark())
.apply(Sum.integersPerKey());
Code shown is sudo code only
Approximate Triggers
PCollection<KV<String, Integer>> scores = input
.apply(Window
.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark()
.withEarlyFirings(AtPeriod(Duration.standardMinutes(1)))
.withLateFirings(AtCount(1))))
.apply(Sum.integersPerKey());
Code shown is sudo code only
Requirements: Live Cost Estimate Pipeline
Completeness Low Latency Low Cost
Important
Not Important
Google confidential │ Do not distribute
GCP
Managed Service
User Code & SDK Work Manager
Dep
loy
& S
ched
ule
Pro
gres
s &
Lo
gsMonitoring UI
Job Manager
Google confidential │ Do not distribute
Let’s process some dataReference ArchitectureProcesses
1,000,000'ssec
10sec
Cloud Pub/SubAsync Messaging
Massive Scale NoSqlNoSQL Database Service
Cloud DataflowParallel data processing
BigQueryAnalytics Engine
File
Cloud StorageObject Store Exports
Google confidential │ Do not distribute
BigQuery Or BigTable... Or Both??Pipeline Consumers
Massive Scale NoSqlNoSQL Database Service
BigQueryAnalytics Engine
Google confidential │ Do not distribute
Let’s process some dataReference ArchitectureProcesses
1,000,000'ssec
10sec
Cloud Pub/SubAsync Messaging
Massive Scale NoSqlNoSQL Database Service
Cloud DataflowParallel data processing
BigQueryAnalytics Engine
CloudMLMachine Learning
File
Cloud StorageObject Store Exports
Cloud DataprocManaged Spark Hadoop
Google confidential │ Do not distribute
CloudML - Data pre-processing stagesMachine Learning
If Machine learning is the new rocket ship...
Data is the fuel!
Google confidential │ Do not distribute
Let’s process some dataCloudML - API'sProcesses
Speech APIVision API
Google confidential │ Do not distribute
It is well known that a vital
ingredient of success is not
knowing that what you're
attempting can't be done
Terry Pratchett
top related