Download - Arctic15 keynote

Transcript

Senior System Architect, Google Developer Expert, Authorised Trainer

REAL-TIME DATA PROCESSING AND ANALYSIS IN THE CLOUDJERRY JALAVA - QVIK

[email protected] | @W_I

MASSIVE AMOUNTS OF DATAWE PRODUCE

@W_I @QVIK

HAS BEEN GENERATED IN THE PAST FEW YEARS

OVER 90% OF ALL THE DATA

@W_I @QVIK

REQUIRES YOU TO BE ABLE TO ANALYSE THAT FAST

BEING COMPETITIVE

@W_I @QVIK

BUILDING THESE KIND OF INFRASTRUCTURES IS

EXPENSIVE

BUT,

@W_I @QVIK

MANY GREAT OPEN-SOURCE PROJECTS AVAILABLE

THERE ARE

@W_I @QVIK

REFERENCE ARCHITECTURE

@W_I @QVIK

Ingest

Devices / Systems generating events

Message Queue

Processing

Data Processing

Storage

Time-series Database

Data Warehouse

@W_I @QVIK

Cloud Pub/Sub - A fully managed, global and scalable publish and subscribe service with guaranteed at-least-once message delivery

Cloud Dataflow - A fully managed, auto-scalable service for pipeline data processing in batch or streaming mode

BigQuery - A fully managed, petabyte scale, low-cost enterprise data warehouse for analytics

REFERENCE ARCHITECTURE

@W_I @QVIK

Ingest

Devices / Systems generating events

Processing

Data Processing

Storage

Time-series Database

Data Warehouse

CloudPub/Sub

REFERENCE ARCHITECTURE

@W_I @QVIK

Ingest

Devices / Systems generating events

Processing

Data Processing

Storage

CloudPub/Sub

BigQuery

CloudBigtable

REFERENCE ARCHITECTURE

@W_I @QVIK

Ingest

Devices / Systems generating events

Processing Storage

CloudPub/Sub

BigQuery

CloudBigtable

CloudDataflow

DEMO‣ Analyse “real-time” Taxi data from NYC ‣ >20000 events/s incoming ‣ 3M Taxi rides (1 week of data)

‣ Get insights ‣ Live visualisation of the rides ‣ How do the taxi rides from airports

compare to taxi rides overall ‣ Analyse archived data

@W_I @QVIK

DEMO ARCHITECTURE

@W_I @QVIK

Ingest Processing

MessagingCloud Pub/Sub

Telemetry

RidesCloud Dataflow

Aggregate

Dashboard ApplicationTaxis

MessagingCloud Pub/Sub

Display Data

@W_I @QVIK

@W_I @QVIK

MULTIPLE DATA PROCESSING REQUIREMENTS‣ Correctness, completeness, reliability, scalability, and performance

‣ Continuous event processing ‣ Continuous result delivery ‣ Scalable ETL for continuous archival

‣ Analyst-ready big data sets

@W_I @QVIK

@W_I @QVIK

COUNT RIDES Taxi Data

Output

(Lax X, Lon Y) @1:00, (Lat X, Lon Y) @1:01, (Lat K, Lon M) @1:03, (Lat K, Lon M) @ 2:30

@W_I @QVIK

COUNT RIDES Taxi Data

Output

(Lax X, Lon Y) @1:00, (Lat X, Lon Y) @1:01, (Lat K, Lon M) @1:03, (Lat K, Lon M) @ 2:30

Window In Time

{[1:00, 2:00) → (X, Y) @1:00, (X, Y) @1:01, (K, M) @1:03 } {[2:00, 2:30) → (K, M) @2:30}

@W_I @QVIK

COUNT RIDES Taxi Data

Output

(Lax X, Lon Y) @1:00, (Lat X, Lon Y) @1:01, (Lat K, Lon M) @1:03, (Lat K, Lon M) @ 2:30

Window In Time

{[1:00, 2:00) → (X, Y) @1:00, (X, Y) @1:01, (K, M) @1:03 } {[2:00, 2:30) → (K, M) @2:30}

Group In Space

{ (X, Y), [1:00, 2:00) → (X, Y) @1:00, (X, Y) @1:01} { (K, M), [1:00, 2:00) → (K, M) @1:03 } { (K, M), [2:00, 3:00) → (K, M) @2:30 }

@W_I @QVIK

COUNT RIDES Taxi Data

Output

(Lax X, Lon Y) @1:00, (Lat X, Lon Y) @1:01, (Lat K, Lon M) @1:03, (Lat K, Lon M) @ 2:30

Window In Time

{[1:00, 2:00) → (X, Y) @1:00, (X, Y) @1:01, (K, M) @1:03 } {[2:00, 2:30) → (K, M) @2:30}

Group In Space

{ (X, Y), [1:00, 2:00) → (X, Y) @1:00, (X, Y) @1:01} { (K, M), [1:00, 2:00) → (K, M) @1:03 } { (K, M), [2:00, 3:00) → (K, M) @2:30 }

Count

{ (X, Y), [1:00, 2:00) → 2} { (K, M), [1:00, 2:00) → 1} { (K, M), [2:00, 3:00) → 1}

@W_I @QVIK

COUNT RIDES Taxi Data

Output

Window In Time

Group In Space

Count

p.apply(PubsubIO.Read.topic(taxiInTopic)

.apply("window 1s", Window.into(FixedWindows.of( Duration.standardSeconds(1)) ) )

.apply("condense rides", MapElements.via(new CondenseRides()) )

.apply("count similar", Count.perKey())

.apply(PubsubIO.Write.topic(taxiOutTopic);

@W_I @QVIK

COUNT RIDES Taxi Data

Output

Window In Time

Group In Space

Count

p.apply(PubsubIO.Read.topic(taxiInTopic)

.apply("window 1s", Window.into(FixedWindows.of( Duration.standardSeconds(1)) ) )

.apply("condense rides", MapElements.via(new CondenseRides()) )

.apply("count similar", Count.perKey())

.apply(PubsubIO.Write.topic(taxiOutTopic);

private static class CondenseRides extends SimpleFunction<TableRow, KV<LatLon, TableRow>> { public KV<LatLon, TableRow> apply(TableRow t) { final float box = 0.001f; // very approximately 100m float roundedLat = Math.floor(t.get("latitude") / box) * box + box / 2; float roundedLon = Math.floor(t.get("longitude"). / box) * box + box / 2; LatLon key = new LatLon(roundedLat, roundedLon); return KV.of(key, t); } }

@W_I @QVIK

#java com.google.codelabs.dataflow.CountRides \ —streaming=true —project=arctic15-demo --sourceProject=arctic15-demo \ —sourceTopic=taxifeed1 --sinkProject=arctic15-demo --runner=DataflowPipelineRunner \ —zone=eu-west1-c --numWorkers=3 --stagingLocation=gs://arctic15-demo \ —sinkTopic=visualisation-sink-1

@W_I @QVIK

@W_I @QVIK

10X Reduction In Messages per

Second

@W_I @QVIK

@W_I @QVIK

HOW DO THE TAXI RIDES FROM AIRPORTS COMPARE TO OVERALL

TAXI RIDES

GETTING INSIGHTS

@W_I @QVIK

@W_I @QVIK

AIRPORT RIDES

Read from PubSub

p.apply(PubsubIO.Read(inputTopic)) .apply(“Key By Ride ID”, MapElements.via( (TableRow ride) -> KV.of(ride.get("ride_id"), ride))) .apply(Window.into( Sessions.withGapDuration(TEN_MIN))) .apply(Window.triggering(Repeatedly.forever( AfterPane.elementCountAtLeast(1))) .accumulatingFiredPanes()) .apply(Combine.perKey(new AccumulatePoints())) .apply(ParDo.of(new FilterAtAirport())) .apply(ParDo.of(new ExtractLatest()) .apply(PubsubIO.Write(outputTopic));

@W_I @QVIK

AIRPORT RIDES

Key By Ride ID to group together ride points from the same ride

p.apply(PubsubIO.Read(inputTopic)) .apply(“Key By Ride ID”, MapElements.via( (TableRow ride) -> KV.of(ride.get("ride_id"), ride))) .apply(Window.into( Sessions.withGapDuration(TEN_MIN))) .apply(Window.triggering(Repeatedly.forever( AfterPane.elementCountAtLeast(1))) .accumulatingFiredPanes()) .apply(Combine.perKey(new AccumulatePoints())) .apply(ParDo.of(new FilterAtAirport())) .apply(ParDo.of(new ExtractLatest()) .apply(PubsubIO.Write(outputTopic));

@W_I @QVIK

AIRPORT RIDES

Sessions allow us to create a window with all the same rides grouped together, and then GC the ride data once no more ride points show up for ten minutes

p.apply(PubsubIO.Read(inputTopic)) .apply(“Key By Ride ID”, MapElements.via( (TableRow ride) -> KV.of(ride.get("ride_id"), ride))) .apply(Window.into( Sessions.withGapDuration(TEN_MIN))) .apply(Window.triggering(Repeatedly.forever( AfterPane.elementCountAtLeast(1))) .accumulatingFiredPanes()) .apply(Combine.perKey(new AccumulatePoints())) .apply(ParDo.of(new FilterAtAirport())) .apply(ParDo.of(new ExtractLatest()) .apply(PubsubIO.Write(outputTopic));

@W_I @QVIK

AIRPORT RIDES

Triggering delivers the contents of the ride window early and often:

elementCountAtLeast(1) ensures that we get the first values after even a single element shows up

Repeatedly.forever ensures we keep getting updates

accumulatingFiredPanes ensures we get full view of data

p.apply(PubsubIO.Read(inputTopic)) .apply(“Key By Ride ID”, MapElements.via( (TableRow ride) -> KV.of(ride.get("ride_id"), ride))) .apply(Window.into( Sessions.withGapDuration(TEN_MIN))) .apply(Window.triggering(Repeatedly.forever( AfterPane.elementCountAtLeast(1))) .accumulatingFiredPanes()) .apply(Combine.perKey(new AccumulatePoints())) .apply(ParDo.of(new FilterAtAirport())) .apply(ParDo.of(new ExtractLatest()) .apply(PubsubIO.Write(outputTopic));

@W_I @QVIK

AIRPORT RIDES

Every time our window is triggered, the Accumulator determines how the data points in the window are combined

AccumulatePoints(): - Keeps the pickup location, necessary to know if the ride started at the airport - Keeps the most recent value, to continuously emit update about the ride

p.apply(PubsubIO.Read(inputTopic)) .apply(“Key By Ride ID”, MapElements.via( (TableRow ride) -> KV.of(ride.get("ride_id"), ride))) .apply(Window.into( Sessions.withGapDuration(TEN_MIN))) .apply(Window.triggering(Repeatedly.forever( AfterPane.elementCountAtLeast(1))) .accumulatingFiredPanes()) .apply(Combine.perKey(new AccumulatePoints())) .apply(ParDo.of(new FilterAtAirport())) .apply(ParDo.of(new ExtractLatest()) .apply(PubsubIO.Write(outputTopic));

@W_I @QVIK

AIRPORT RIDES

Look at the pickup point in the accumulator, and compare it with Lat/Long coordinates to determine if its an airport pickup

p.apply(PubsubIO.Read(inputTopic)) .apply(“Key By Ride ID”, MapElements.via( (TableRow ride) -> KV.of(ride.get("ride_id"), ride))) .apply(Window.into( Sessions.withGapDuration(TEN_MIN))) .apply(Window.triggering(Repeatedly.forever( AfterPane.elementCountAtLeast(1))) .accumulatingFiredPanes()) .apply(Combine.perKey(new AccumulatePoints())) .apply(ParDo.of(new FilterAtAirport())) .apply(ParDo.of(new ExtractLatest()) .apply(PubsubIO.Write(outputTopic));

@W_I @QVIK

AIRPORT RIDES

For writing output, we only care about the latest point from the accumulator

p.apply(PubsubIO.Read(inputTopic)) .apply(“Key By Ride ID”, MapElements.via( (TableRow ride) -> KV.of(ride.get("ride_id"), ride))) .apply(Window.into( Sessions.withGapDuration(TEN_MIN))) .apply(Window.triggering(Repeatedly.forever( AfterPane.elementCountAtLeast(1))) .accumulatingFiredPanes()) .apply(Combine.perKey(new AccumulatePoints())) .apply(ParDo.of(new FilterAtAirport())) .apply(ParDo.of(new ExtractLatest()) .apply(PubsubIO.Write(outputTopic));

@W_I @QVIK

AIRPORT RIDES

We write the resulting latest point to Pub/Sub

p.apply(PubsubIO.Read(inputTopic)) .apply(“Key By Ride ID”, MapElements.via( (TableRow ride) -> KV.of(ride.get("ride_id"), ride))) .apply(Window.into( Sessions.withGapDuration(TEN_MIN))) .apply(Window.triggering(Repeatedly.forever( AfterPane.elementCountAtLeast(1))) .accumulatingFiredPanes()) .apply(Combine.perKey(new AccumulatePoints())) .apply(ParDo.of(new FilterAtAirport())) .apply(ParDo.of(new ExtractLatest()) .apply(PubsubIO.Write(outputTopic));

@W_I @QVIK

UPDATED DEMO ARCHITECTURE

@W_I @QVIK

Ingest Processing

MessagingCloud Pub/Sub

Telemetry

RidesCloud Dataflow

Aggregate

Dashboard ApplicationTaxis

MessagingCloud Pub/Sub

Display Data

Insights

AnalyticsBigQuery

Data Warehouse

ETL PipelineCloud Dataflow

Archival-grade aggregates

●Create another ETL pipeline PubSub <-> BigQuery ●Composition: save output of regular taxi data and filtered airport data

@W_I @QVIK

@W_I @QVIK

@W_I @QVIK

RIDE DATA OVER TIME

APACHE BEAM‣ In early 2016,

Google announced their intention to move the Dataflow programming model and SDKs to theApache Software Foundation

‣ Apache Beam is now a top level project

@QVIK

SOME RESOURCES‣ cloud.google.com/dataflow ‣ beam.apache.org

‣ codelabs.developers.google.com

‣ Big Data facts (forbes.com)

@QVIK

THANK YOU!

LET’S CREATE IT TOGETHER

[email protected] | @W_I


Top Related