Download - Arctic15 keynote
Senior System Architect, Google Developer Expert, Authorised Trainer
REAL-TIME DATA PROCESSING AND ANALYSIS IN THE CLOUDJERRY JALAVA - QVIK
[email protected] | @W_I
REFERENCE ARCHITECTURE
@W_I @QVIK
Ingest
Devices / Systems generating events
Message Queue
Processing
Data Processing
Storage
Time-series Database
Data Warehouse
@W_I @QVIK
Cloud Pub/Sub - A fully managed, global and scalable publish and subscribe service with guaranteed at-least-once message delivery
Cloud Dataflow - A fully managed, auto-scalable service for pipeline data processing in batch or streaming mode
BigQuery - A fully managed, petabyte scale, low-cost enterprise data warehouse for analytics
REFERENCE ARCHITECTURE
@W_I @QVIK
Ingest
Devices / Systems generating events
Processing
Data Processing
Storage
Time-series Database
Data Warehouse
CloudPub/Sub
REFERENCE ARCHITECTURE
@W_I @QVIK
Ingest
Devices / Systems generating events
Processing
Data Processing
Storage
CloudPub/Sub
BigQuery
CloudBigtable
REFERENCE ARCHITECTURE
@W_I @QVIK
Ingest
Devices / Systems generating events
Processing Storage
CloudPub/Sub
BigQuery
CloudBigtable
CloudDataflow
DEMO‣ Analyse “real-time” Taxi data from NYC ‣ >20000 events/s incoming ‣ 3M Taxi rides (1 week of data)
‣ Get insights ‣ Live visualisation of the rides ‣ How do the taxi rides from airports
compare to taxi rides overall ‣ Analyse archived data
@W_I @QVIK
DEMO ARCHITECTURE
@W_I @QVIK
Ingest Processing
MessagingCloud Pub/Sub
Telemetry
RidesCloud Dataflow
Aggregate
Dashboard ApplicationTaxis
MessagingCloud Pub/Sub
Display Data
MULTIPLE DATA PROCESSING REQUIREMENTS‣ Correctness, completeness, reliability, scalability, and performance
‣ Continuous event processing ‣ Continuous result delivery ‣ Scalable ETL for continuous archival
‣ Analyst-ready big data sets
@W_I @QVIK
@W_I @QVIK
COUNT RIDES Taxi Data
Output
(Lax X, Lon Y) @1:00, (Lat X, Lon Y) @1:01, (Lat K, Lon M) @1:03, (Lat K, Lon M) @ 2:30
@W_I @QVIK
COUNT RIDES Taxi Data
Output
(Lax X, Lon Y) @1:00, (Lat X, Lon Y) @1:01, (Lat K, Lon M) @1:03, (Lat K, Lon M) @ 2:30
Window In Time
{[1:00, 2:00) → (X, Y) @1:00, (X, Y) @1:01, (K, M) @1:03 } {[2:00, 2:30) → (K, M) @2:30}
@W_I @QVIK
COUNT RIDES Taxi Data
Output
(Lax X, Lon Y) @1:00, (Lat X, Lon Y) @1:01, (Lat K, Lon M) @1:03, (Lat K, Lon M) @ 2:30
Window In Time
{[1:00, 2:00) → (X, Y) @1:00, (X, Y) @1:01, (K, M) @1:03 } {[2:00, 2:30) → (K, M) @2:30}
Group In Space
{ (X, Y), [1:00, 2:00) → (X, Y) @1:00, (X, Y) @1:01} { (K, M), [1:00, 2:00) → (K, M) @1:03 } { (K, M), [2:00, 3:00) → (K, M) @2:30 }
@W_I @QVIK
COUNT RIDES Taxi Data
Output
(Lax X, Lon Y) @1:00, (Lat X, Lon Y) @1:01, (Lat K, Lon M) @1:03, (Lat K, Lon M) @ 2:30
Window In Time
{[1:00, 2:00) → (X, Y) @1:00, (X, Y) @1:01, (K, M) @1:03 } {[2:00, 2:30) → (K, M) @2:30}
Group In Space
{ (X, Y), [1:00, 2:00) → (X, Y) @1:00, (X, Y) @1:01} { (K, M), [1:00, 2:00) → (K, M) @1:03 } { (K, M), [2:00, 3:00) → (K, M) @2:30 }
Count
{ (X, Y), [1:00, 2:00) → 2} { (K, M), [1:00, 2:00) → 1} { (K, M), [2:00, 3:00) → 1}
@W_I @QVIK
COUNT RIDES Taxi Data
Output
Window In Time
Group In Space
Count
p.apply(PubsubIO.Read.topic(taxiInTopic)
.apply("window 1s", Window.into(FixedWindows.of( Duration.standardSeconds(1)) ) )
.apply("condense rides", MapElements.via(new CondenseRides()) )
.apply("count similar", Count.perKey())
.apply(PubsubIO.Write.topic(taxiOutTopic);
@W_I @QVIK
COUNT RIDES Taxi Data
Output
Window In Time
Group In Space
Count
p.apply(PubsubIO.Read.topic(taxiInTopic)
.apply("window 1s", Window.into(FixedWindows.of( Duration.standardSeconds(1)) ) )
.apply("condense rides", MapElements.via(new CondenseRides()) )
.apply("count similar", Count.perKey())
.apply(PubsubIO.Write.topic(taxiOutTopic);
private static class CondenseRides extends SimpleFunction<TableRow, KV<LatLon, TableRow>> { public KV<LatLon, TableRow> apply(TableRow t) { final float box = 0.001f; // very approximately 100m float roundedLat = Math.floor(t.get("latitude") / box) * box + box / 2; float roundedLon = Math.floor(t.get("longitude"). / box) * box + box / 2; LatLon key = new LatLon(roundedLat, roundedLon); return KV.of(key, t); } }
@W_I @QVIK
#java com.google.codelabs.dataflow.CountRides \ —streaming=true —project=arctic15-demo --sourceProject=arctic15-demo \ —sourceTopic=taxifeed1 --sinkProject=arctic15-demo --runner=DataflowPipelineRunner \ —zone=eu-west1-c --numWorkers=3 --stagingLocation=gs://arctic15-demo \ —sinkTopic=visualisation-sink-1
@W_I @QVIK
AIRPORT RIDES
Read from PubSub
p.apply(PubsubIO.Read(inputTopic)) .apply(“Key By Ride ID”, MapElements.via( (TableRow ride) -> KV.of(ride.get("ride_id"), ride))) .apply(Window.into( Sessions.withGapDuration(TEN_MIN))) .apply(Window.triggering(Repeatedly.forever( AfterPane.elementCountAtLeast(1))) .accumulatingFiredPanes()) .apply(Combine.perKey(new AccumulatePoints())) .apply(ParDo.of(new FilterAtAirport())) .apply(ParDo.of(new ExtractLatest()) .apply(PubsubIO.Write(outputTopic));
@W_I @QVIK
AIRPORT RIDES
Key By Ride ID to group together ride points from the same ride
p.apply(PubsubIO.Read(inputTopic)) .apply(“Key By Ride ID”, MapElements.via( (TableRow ride) -> KV.of(ride.get("ride_id"), ride))) .apply(Window.into( Sessions.withGapDuration(TEN_MIN))) .apply(Window.triggering(Repeatedly.forever( AfterPane.elementCountAtLeast(1))) .accumulatingFiredPanes()) .apply(Combine.perKey(new AccumulatePoints())) .apply(ParDo.of(new FilterAtAirport())) .apply(ParDo.of(new ExtractLatest()) .apply(PubsubIO.Write(outputTopic));
@W_I @QVIK
AIRPORT RIDES
Sessions allow us to create a window with all the same rides grouped together, and then GC the ride data once no more ride points show up for ten minutes
p.apply(PubsubIO.Read(inputTopic)) .apply(“Key By Ride ID”, MapElements.via( (TableRow ride) -> KV.of(ride.get("ride_id"), ride))) .apply(Window.into( Sessions.withGapDuration(TEN_MIN))) .apply(Window.triggering(Repeatedly.forever( AfterPane.elementCountAtLeast(1))) .accumulatingFiredPanes()) .apply(Combine.perKey(new AccumulatePoints())) .apply(ParDo.of(new FilterAtAirport())) .apply(ParDo.of(new ExtractLatest()) .apply(PubsubIO.Write(outputTopic));
@W_I @QVIK
AIRPORT RIDES
Triggering delivers the contents of the ride window early and often:
elementCountAtLeast(1) ensures that we get the first values after even a single element shows up
Repeatedly.forever ensures we keep getting updates
accumulatingFiredPanes ensures we get full view of data
p.apply(PubsubIO.Read(inputTopic)) .apply(“Key By Ride ID”, MapElements.via( (TableRow ride) -> KV.of(ride.get("ride_id"), ride))) .apply(Window.into( Sessions.withGapDuration(TEN_MIN))) .apply(Window.triggering(Repeatedly.forever( AfterPane.elementCountAtLeast(1))) .accumulatingFiredPanes()) .apply(Combine.perKey(new AccumulatePoints())) .apply(ParDo.of(new FilterAtAirport())) .apply(ParDo.of(new ExtractLatest()) .apply(PubsubIO.Write(outputTopic));
@W_I @QVIK
AIRPORT RIDES
Every time our window is triggered, the Accumulator determines how the data points in the window are combined
AccumulatePoints(): - Keeps the pickup location, necessary to know if the ride started at the airport - Keeps the most recent value, to continuously emit update about the ride
p.apply(PubsubIO.Read(inputTopic)) .apply(“Key By Ride ID”, MapElements.via( (TableRow ride) -> KV.of(ride.get("ride_id"), ride))) .apply(Window.into( Sessions.withGapDuration(TEN_MIN))) .apply(Window.triggering(Repeatedly.forever( AfterPane.elementCountAtLeast(1))) .accumulatingFiredPanes()) .apply(Combine.perKey(new AccumulatePoints())) .apply(ParDo.of(new FilterAtAirport())) .apply(ParDo.of(new ExtractLatest()) .apply(PubsubIO.Write(outputTopic));
@W_I @QVIK
AIRPORT RIDES
Look at the pickup point in the accumulator, and compare it with Lat/Long coordinates to determine if its an airport pickup
p.apply(PubsubIO.Read(inputTopic)) .apply(“Key By Ride ID”, MapElements.via( (TableRow ride) -> KV.of(ride.get("ride_id"), ride))) .apply(Window.into( Sessions.withGapDuration(TEN_MIN))) .apply(Window.triggering(Repeatedly.forever( AfterPane.elementCountAtLeast(1))) .accumulatingFiredPanes()) .apply(Combine.perKey(new AccumulatePoints())) .apply(ParDo.of(new FilterAtAirport())) .apply(ParDo.of(new ExtractLatest()) .apply(PubsubIO.Write(outputTopic));
@W_I @QVIK
AIRPORT RIDES
For writing output, we only care about the latest point from the accumulator
p.apply(PubsubIO.Read(inputTopic)) .apply(“Key By Ride ID”, MapElements.via( (TableRow ride) -> KV.of(ride.get("ride_id"), ride))) .apply(Window.into( Sessions.withGapDuration(TEN_MIN))) .apply(Window.triggering(Repeatedly.forever( AfterPane.elementCountAtLeast(1))) .accumulatingFiredPanes()) .apply(Combine.perKey(new AccumulatePoints())) .apply(ParDo.of(new FilterAtAirport())) .apply(ParDo.of(new ExtractLatest()) .apply(PubsubIO.Write(outputTopic));
@W_I @QVIK
AIRPORT RIDES
We write the resulting latest point to Pub/Sub
p.apply(PubsubIO.Read(inputTopic)) .apply(“Key By Ride ID”, MapElements.via( (TableRow ride) -> KV.of(ride.get("ride_id"), ride))) .apply(Window.into( Sessions.withGapDuration(TEN_MIN))) .apply(Window.triggering(Repeatedly.forever( AfterPane.elementCountAtLeast(1))) .accumulatingFiredPanes()) .apply(Combine.perKey(new AccumulatePoints())) .apply(ParDo.of(new FilterAtAirport())) .apply(ParDo.of(new ExtractLatest()) .apply(PubsubIO.Write(outputTopic));
UPDATED DEMO ARCHITECTURE
@W_I @QVIK
Ingest Processing
MessagingCloud Pub/Sub
Telemetry
RidesCloud Dataflow
Aggregate
Dashboard ApplicationTaxis
MessagingCloud Pub/Sub
Display Data
Insights
AnalyticsBigQuery
Data Warehouse
ETL PipelineCloud Dataflow
Archival-grade aggregates
●Create another ETL pipeline PubSub <-> BigQuery ●Composition: save output of regular taxi data and filtered airport data
APACHE BEAM‣ In early 2016,
Google announced their intention to move the Dataflow programming model and SDKs to theApache Software Foundation
‣ Apache Beam is now a top level project
@QVIK
SOME RESOURCES‣ cloud.google.com/dataflow ‣ beam.apache.org
‣ codelabs.developers.google.com
‣ Big Data facts (forbes.com)
@QVIK