Jen TongDeveloper Advocate
Coping with IoT DataOn Google Cloud Platform
Jen Tong
Developer AdvocateGoogle Cloud Platform
@MimmingCodesmimming.com
Agenda
● IoT Data Challenges● A use case● A recipe● Demos
○ Simulate the IoT○ Capture with Pub/Sub○ Wrangle with Dataflow○ Analyze with BigQuery
Confidential & ProprietaryGoogle Cloud Platform 4
About you
● Electrical engineers?● Web developers?● Data scientists?● Mechanical engineers?● Not engineers at all?
Confidential & ProprietaryGoogle Cloud Platform 5Photo credit: Matt Chan
Data
photo credit - taniwha on flickr
Confidential & ProprietaryGoogle Cloud Platform 6
Confidential & ProprietaryGoogle Cloud Platform 7Photo credit: Matt Chan
Data
photo credit - wemake_cc on flickr
Confidential & ProprietaryGoogle Cloud Platform 8
Data
Confidential & ProprietaryGoogle Cloud Platform 9
Big data
Confidential & ProprietaryGoogle Cloud Platform 10
Confidential & ProprietaryGoogle Cloud Platform 11
Confidential & ProprietaryGoogle Cloud Platform 12
Google Research Publications
Confidential & ProprietaryGoogle Cloud Platform 13
Google Research Publications
Confidential & ProprietaryGoogle Cloud Platform 14
Open Source Implementations
Bigtable
Flume
Dremel
Confidential & ProprietaryGoogle Cloud Platform 15
Managed Cloud Versions
Bigtable
Flume
Dremel
Bigtable
Dataflow
BigQuery
Confidential & ProprietaryGoogle Cloud Platform 16
Coping with big data
Confidential & ProprietaryGoogle Cloud Platform 17
Big data
Confidential & ProprietaryGoogle Cloud Platform 18
Really big data
TuesdayWednesday
Thursday
Confidential & ProprietaryGoogle Cloud Platform 19
Infinite data
9:008:00 14:0013:0012:0011:0010:002:001:00 7:006:005:004:003:00
Confidential & ProprietaryGoogle Cloud Platform 20
Delayed data
9:008:00 14:0013:0012:0011:0010:00
8:00
8:008:00
8:00
Confidential & ProprietaryGoogle Cloud Platform 21
Batch Patterns: Creating Structured Data
MapReduce
Confidential & ProprietaryGoogle Cloud Platform 22
Batch Patterns: Repetitive Runs
MapReduce
TuesdayWednesday
Thursday
Confidential & ProprietaryGoogle Cloud Platform 23
Batch Patterns: Time Based Windows
MapReduce
Tuesday [11:00 - 12:00)
[12:00 - 13:00)
[13:00 - 14:00)
[14:00 - 15:00)
[15:00 - 16:00)
[16:00 - 17:00)
[18:00 - 19:00)
[19:00 - 20:00)
[21:00 - 22:00)
[22:00 - 23:00)
[23:00 - 0:00)
Confidential & ProprietaryGoogle Cloud Platform 24
Batch Patterns: Sessions
MapReduce
TuesdayWednesday
Jose
Lisa
Ingo
Asha
Cheryl
Ari
WednesdayTuesday
Confidential & ProprietaryGoogle Cloud Platform 25
Streaming Patterns: Element-wise transformations
13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time
Confidential & ProprietaryGoogle Cloud Platform 26
Streaming Patterns: Aggregating Time Based Windows
13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time
Confidential & ProprietaryGoogle Cloud Platform 27
Delayed data
9:008:00 14:0013:0012:0011:0010:00
8:00
8:008:00
8:00
Confidential & ProprietaryGoogle Cloud Platform 28
Streaming Patterns: Event Time Based Windows
11:0010:00 15:0014:0013:0012:00Event Time
11:0010:00 15:0014:0013:0012:00Processing Time
Input
Output
Confidential & ProprietaryGoogle Cloud Platform 29
Streaming Patterns: Session Windows
Event Time
Processing Time 11:0010:00 15:0014:0013:0012:00
11:0010:00 15:0014:0013:0012:00
Input
Output
Confidential & ProprietaryGoogle Cloud Platform 30
The use case
Confidential & ProprietaryGoogle Cloud Platform 31
The game
Confidential & ProprietaryGoogle Cloud Platform 32
Confidential & ProprietaryGoogle Cloud Platform 33
Confidential & ProprietaryGoogle Cloud Platform 34
Problem scope
● Intermittent connectivity● Inconsistent data delivery timing● Large, endless stream of data● Multiple input and output streams● Bursts of activity● Integrate and synchronize multiple event streams
Confidential & ProprietaryGoogle Cloud Platform 35
Solution requirements
● Keep up with the event streams● Respond in real-time● Scale up and down with demand● Process data once● Accommodate late-arriving data● Detect anomalies
Confidential & ProprietaryGoogle Cloud Platform 36
A recipe
The recipe
Data
The recipe
Pub/SubData
The recipe
Pub/Sub DataflowData
The recipe
Pub/Sub Dataflow BigQueryData
Confidential & ProprietaryGoogle Cloud Platform 41
Cloud Pub/Sub
Confidential & ProprietaryGoogle Cloud Platform 42
How Pub/Sub works
Topics Subscriptions Subscribers
Push
Pull
Push
Confidential & ProprietaryGoogle Cloud Platform 43
OSS alternative
Confidential & ProprietaryGoogle Cloud Platform 44
Cloud Pub/Sub features
● Asynchronous messaging● Many-to-many● Push and pull● At-least-once message delivery● REST/JSON API
Confidential & ProprietaryGoogle Cloud Platform 45
Nonfunctional stuff
● Globally available● Automatic scaling● Replicated storage● Encrypted on the wire and at rest
Confidential & ProprietaryGoogle Cloud Platform 46
Demo time!
Pub/Sub injector
Confidential & ProprietaryGoogle Cloud Platform 47
Cloud Dataflow
Confidential & ProprietaryGoogle Cloud Platform 48
Dataflow Pipelines
Data sources
Confidential & ProprietaryGoogle Cloud Platform 49
Dataflow Pipelines
Pipeline Steps
Confidential & ProprietaryGoogle Cloud Platform 50
Dataflow Pipelines
Destinations
Confidential & ProprietaryGoogle Cloud Platform 51
Dataflow Pipelines
Confidential & ProprietaryGoogle Cloud Platform 52
OSS alternative
Confidential & ProprietaryGoogle Cloud Platform 53
Features
● Unified model for streaming and batch analysis● Once-and-only-once input element processing● Autoscaling● Toolkit of complex transforms● Support for event-time stream processing
○ Handles late data● Session windowing● Real-time analytics, dashboard, and alerts
Confidential & ProprietaryGoogle Cloud Platform 54
What it's good at
• Filtering
• Transformation
• Movement
• Extract insights
• Batch
• Continuous
AnalysisETL
Confidential & ProprietaryGoogle Cloud Platform 55
Demo time!
Start a pipeline
Confidential & ProprietaryGoogle Cloud Platform 56
Google BigQuery
Confidential & ProprietaryGoogle Cloud Platform 57
OSS alternative
Confidential & ProprietaryGoogle Cloud Platform 58
BigQuery
● Scales flat to petabytes● SQL dialect● User defined functions● REST, Web UI, ODBC● 1TB free each month
Confidential & ProprietaryGoogle Cloud Platform 59
BigQuery
● Scales flat to petabytes● SQL dialect● User defined functions● REST, Web UI, ODBC● 1TB free each month
Confidential & ProprietaryGoogle Cloud Platform 60
Off topic demo!
Count stuff
Confidential & ProprietaryGoogle Cloud Platform 61
SELECT count(word)FROM publicdata:samples.shakespeare
Words in Shakespeare
Confidential & ProprietaryGoogle Cloud Platform 62
SELECT sum(requests) as totalFROM [fh-bigquery:wikipedia.pagecounts_20150212_01]
Wikipedia hits over 1 hour
Confidential & ProprietaryGoogle Cloud Platform 63
SELECT sum(requests) as totalFROM [fh-bigquery:wikipedia.pagecounts_201505]
Wikipedia hits over 1 month
Confidential & ProprietaryGoogle Cloud Platform 64
Several years of Wikipedia data
SELECT sum(requests) as totalFROM [fh-bigquery:wikipedia.pagecounts_201105], [fh-bigquery:wikipedia.pagecounts_201106], [fh-bigquery:wikipedia.pagecounts_201107],
...
Confidential & ProprietaryGoogle Cloud Platform 65
SELECT SUM(requests) AS totalFROM TABLE_QUERY( [fh-bigquery:wikipedia], 'REGEXP_MATCH( table_id, r"pagecounts_2015[0-9]{2}$")')
Several years of Wikipedia data
Confidential & ProprietaryGoogle Cloud Platform 66
How about a RegExp
SELECT SUM(requests) AS totalFROM TABLE_QUERY( [fh-bigquery:wikipedia], 'REGEXP_MATCH( table_id, r"pagecounts_2015[0-9]{2}$")')WHERE (REGEXP_MATCH(title, '.*[dD]inosaur.*'))
Confidential & ProprietaryGoogle Cloud Platform 67
Demo time!
BigQuery
Confidential & ProprietaryGoogle Cloud Platform 68
Wrap up
Confidential & ProprietaryGoogle Cloud Platform 69
Big data
Confidential & ProprietaryGoogle Cloud Platform 70Photo credit: Matt Chan
Data
photo credit - wemake_cc on flickr
Confidential & ProprietaryGoogle Cloud Platform 71Photo credit: Matt Chan
Data
photo credit - taniwha on flickr
Thank you!
Jen TongDeveloper AdvocateGoogle Cloud Platform@MimmingCodes
Free trial: cloud.google.com
Slides:mimming.com/presos
Confidential & ProprietaryGoogle Cloud Platform 73
Confidential & ProprietaryGoogle Cloud Platform 74
Bonus stuff
Confidential & ProprietaryGoogle Cloud Platform 75
Dataflow
Ozmg code
Confidential & ProprietaryGoogle Cloud Platform 76
PCollections -- pipeline collections
● Collection of data in a pipeline
● Bounded or unbounded in size
{Seahawks, NFC, Champions, ...}
{..., “NFC Champions #Seahawks”, “Seahawks third #superbowl!”, ... “Je suis #12thMan”, “#GoHawks”, ...}
Confidential & ProprietaryGoogle Cloud Platform 77
ParDo -- Parallel Do transformation
● Process each PCollection element independently using a user-provided DoFn
● Both map and reduce phases in Hadoop.
{Seahawks, NFC, Champions, ...}
{KV<S, Seahawks>, KV<C,Champions>, <KV<S, Seattle>, KV<N, NFC>, ...}
Key by initial letter
Confidential & ProprietaryGoogle Cloud Platform 78
ParDo example
{Seahawks, NFC, Champions, ...}
Lowercase
PCollection<String> tweets = …;
tweets.apply(ParDo.of(
new DoFn<String, String>() {
@Override
public void processElement(ProcessContext c) {
c.output(c.element().toLowerCase());
}));
{seakawhs, nfc, champions, ...}
Confidential & ProprietaryGoogle Cloud Platform 79
GroupByKey
{KV<S, Seahawks>, KV<C,Champions>, <KV<S, Seattle>, KV<N, NFC>, ...}
{KV<S, Seahawks>, KV<C,Champions>, <KV<S, Seattle>, KV<N, NFC>, ...}
GroupByKey
{KV<S, {Seahawks, Seattle, …}, KV<N, {NFC, …} KV<C, {Champion, …}}
● Gathers all PCollection elements with the same key
● Shuffle phase in Hadoop
Confidential & ProprietaryGoogle Cloud Platform 80
GroupByKey & Combine
● Compute the most common value for each key with GroupByKey and DoFn
● DoFn needs to see all of the elements
● Easier to optimize than CombineFn
GroupByKey
{KV<S, Seahawks>, KV<C,Champion>, KV<S, Seattle>, KV<N, NFC>, ...}
{KV<S, {Seahawks, Seattle, …}>, KV<N, {NFC, …}>,
KV<C, {Champion, …}>}
Combine.groupedValues(TopFn)
{KV<S, Seahawks>, KV<N, NFC>,
KV<C, Champion>}
Confidential & ProprietaryGoogle Cloud Platform 81
Windows
● Divide or group elements of a PCollection into windows○ Fixed Windows: hourly, daily, …○ Sliding Windows○ Sessions
● Required for GroupByKey transforms on an unbounded PCollection
Nighttime Mid-Day Nighttime
Confidential & ProprietaryGoogle Cloud Platform 82
Composite PTransforms
● Build new PTransforms from existing transforms
● Some utilities are included in the SDK:○ Count, RemoveDuplicates,
Join, Min, Max, Sum… ● Define your own
○ DoSomething, DoSomethingElse...● Why bother?
○ Code reuse○ Easy to monitor
GroupByKey
Pair With Ones
Sum Values Count
Confidential & ProprietaryGoogle Cloud Platform 83
How BigQuery works
Confidential & ProprietaryGoogle Cloud Platform 84
Qualities of a good RDBMS
Confidential & ProprietaryGoogle Cloud Platform 85
Qualities of a good RDBMS
● Inserts & locking● Indexing● Cache● Query planning
Confidential & ProprietaryGoogle Cloud Platform 86
Qualities of a good RDBMS
● Inserts & locking● Indexing● Cache● Query planning
Confidential & ProprietaryGoogle Cloud Platform 87
Confidential & ProprietaryGoogle Cloud Platform 88
Confidential & ProprietaryGoogle Cloud Platform 89
Confidential & ProprietaryGoogle Cloud Platform 90
Storing data
-- -- -- ---- -- -- ---- -- -- --
Table
Columns
Disks
Confidential & ProprietaryGoogle Cloud Platform 91
Reading data: Life of a BigQuery
SELECT sum(requests) as sumFROM ( SELECT requests, title FROM [fh-bigquery:wikipedia.pagecounts_201501] WHERE (REGEXP_MATCH(title, '[Jj]en.+')) )
Confidential & ProprietaryGoogle Cloud Platform 92
Life of a BigQuery
L L
MMixer
Leaf
Storage
Confidential & ProprietaryGoogle Cloud Platform 93
L L L L
M M
M
Life of a BigQuery
Root Mixer
Mixer
Leaf
Storage
Confidential & ProprietaryGoogle Cloud Platform 94
Life of a BigQueryQuery
L L L L
M M
MRoot Mixer
Mixer
Leaf
Storage
Confidential & ProprietaryGoogle Cloud Platform 95
Life of a BigQueryLife of a BigQuery
L L L L
M M
MRoot Mixer
Mixer
Leaf
StorageSELECT requests, title
Confidential & ProprietaryGoogle Cloud Platform 96
Life of a BigQueryLife of a BigQuery
L L L L
M M
MRoot Mixer
Mixer
Leaf
Storage5.4 Bil
SELECT requests, title
WHERE (REGEXP_MATCH(title, '[Jj]en.+'))
Confidential & ProprietaryGoogle Cloud Platform 97
Life of a BigQueryLife of a BigQuery
L L L L
M M
MRoot Mixer
Mixer
Leaf
Storage5.4 Bil
SELECT sum(requests)
5.8 MilWHERE (REGEXP_MATCH(title, '[Jj]en.+'))
SELECT requests, title
Confidential & ProprietaryGoogle Cloud Platform 98
Life of a BigQueryLife of a BigQuery
L L L L
M M
MRoot Mixer
Mixer
Leaf
Storage5.4 Bil
SELECT sum(requests)
5.8 MilWHERE (REGEXP_MATCH(title, '[Jj]en.+'))
SELECT requests, title
SELECT sum(requests)