flexible network analytics in the cloud · 2017-10-11 · your data isn’t neat and tidy ... be...

34
Flexible Network Analytics in the Cloud Jon Dugan & Peter Murphy ESnet Software Engineering Group October 18, 2017 TechEx 2017, San Francisco

Upload: others

Post on 30-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Flexible Network Analytics in the Cloud · 2017-10-11 · Your data isn’t neat and tidy ... be tidy “What” not “How” Rely on the cloud for scaling Keep raw data to keep

Flexible Network Analytics in the Cloud

Jon Dugan & Peter MurphyESnet Software Engineering GroupOctober 18, 2017TechEx 2017, San Francisco

Page 2: Flexible Network Analytics in the Cloud · 2017-10-11 · Your data isn’t neat and tidy ... be tidy “What” not “How” Rely on the cloud for scaling Keep raw data to keep

Introduction● Harsh realities of network analytics● netbeam● Demo● Technology Stack● Alternative Approaches● Lessons Learned

2

Page 3: Flexible Network Analytics in the Cloud · 2017-10-11 · Your data isn’t neat and tidy ... be tidy “What” not “How” Rely on the cloud for scaling Keep raw data to keep

Architecture

3

Page 4: Flexible Network Analytics in the Cloud · 2017-10-11 · Your data isn’t neat and tidy ... be tidy “What” not “How” Rely on the cloud for scaling Keep raw data to keep

The Harsh Realities of Network Analytics

1. It’s a mess

2. Things change

3. There’s always more

4. It’s never really done

● Your data isn’t neat and tidy

● Time and money are limited

● More devices & more telemetry

● What you need today may not be what you need tomorrow.

4

Page 5: Flexible Network Analytics in the Cloud · 2017-10-11 · Your data isn’t neat and tidy ... be tidy “What” not “How” Rely on the cloud for scaling Keep raw data to keep

Coping strategies

1. It’s a mess

2. Things change

3. There’s always more

4. It’s never really done

● Design knowing things won’t be tidy

● “What” not “How”

● Rely on the cloud for scaling

● Keep raw data to keep your options open

5

Page 6: Flexible Network Analytics in the Cloud · 2017-10-11 · Your data isn’t neat and tidy ... be tidy “What” not “How” Rely on the cloud for scaling Keep raw data to keep

netbeam

Network Analytics in Google Cloud

Three Pillars

1. Real time analytics ○ Low latency, incomplete

2. Offline analytics ○ High latency, complete

3. Flexible data model○ Changing needs? Recompute from raw data!

Secret sauce: Apache Beam

6

Page 7: Flexible Network Analytics in the Cloud · 2017-10-11 · Your data isn’t neat and tidy ... be tidy “What” not “How” Rely on the cloud for scaling Keep raw data to keep

What is Apache Beam?

1. The Beam Programming Model

2. SDKs for writing Beam pipelines

3. Runners for existing distributed processing backends

○ Apache Apex

○ Apache Flink

○ Apache Spark

○ Google Cloud Dataflow

○ Local runner for testing

Slide courtesy of the Apache Beam Project 7

Page 8: Flexible Network Analytics in the Cloud · 2017-10-11 · Your data isn’t neat and tidy ... be tidy “What” not “How” Rely on the cloud for scaling Keep raw data to keep

The Evolution of Apache Beam

MapReduce

BigTable DremelColossus

FlumeMegastoreSpanner

PubSub

MillwheelApache Beam

Google Cloud Dataflow

Slide courtesy of the Apache Beam Project 8

Page 9: Flexible Network Analytics in the Cloud · 2017-10-11 · Your data isn’t neat and tidy ... be tidy “What” not “How” Rely on the cloud for scaling Keep raw data to keep

Architecture DiagramApache Beam

(Stream Processing)

BigQuery(immutable)

API

SNMP collection system

Client

Bigtable(realtime)

Apache Beam(Batch Processing)

BigQuery(historical)

...

Old SNMP system

avro

9

Page 10: Flexible Network Analytics in the Cloud · 2017-10-11 · Your data isn’t neat and tidy ... be tidy “What” not “How” Rely on the cloud for scaling Keep raw data to keep

Architecture DiagramApache Beam

(Stream)

BigQuery(immutable)

API

SNMP collection system

Client

Bigtable(realtime)

Rollups5m, 1h, 1d avg

Align/rates

BigQuery(historical)

Percentiles

...

● Google Pubsub● Uses Python outside

of Google Cloud to poll devices and write to Pubsub topic

● Code within Google Cloud subscribes to topic to process data

Old SNMP system

avro

10

Page 11: Flexible Network Analytics in the Cloud · 2017-10-11 · Your data isn’t neat and tidy ... be tidy “What” not “How” Rely on the cloud for scaling Keep raw data to keep

Architecture DiagramApache Beam

(Stream)

BigQuery(immutable)

API

SNMP collection system

Client

Bigtable(realtime)

Rollups5m, 1h, 1d avg

Align/rates

BigQuery(historical)

Percentiles

...

● Apache Beam / Google Dataflow

● Stream processing● Subscribes to

Pubsub topic

Old SNMP system

avro

11

Page 12: Flexible Network Analytics in the Cloud · 2017-10-11 · Your data isn’t neat and tidy ... be tidy “What” not “How” Rely on the cloud for scaling Keep raw data to keep

Architecture DiagramApache Beam

(Stream)

BigQuery(immutable)

API

SNMP collection system

Client

Bigtable(realtime)

Rollups5m, 1h, 1d avg

Align/rates

BigQuery(historical)

Percentiles

...

● Apache Beam / Google Dataflow

● Stream processing● Subscribes to

Pubsub topic● Raw data is written to

BigQuery

Old SNMP system

avro

12

Page 13: Flexible Network Analytics in the Cloud · 2017-10-11 · Your data isn’t neat and tidy ... be tidy “What” not “How” Rely on the cloud for scaling Keep raw data to keep

Architecture DiagramApache Beam

(Stream)

BigQuery(immutable)

API

SNMP collection system

Client

Bigtable(realtime)

Rollups5m, 1h, 1d avg

Align/rates

BigQuery(historical)

Percentiles

...

● Apache Beam / Google Dataflow

● Stream processing● Subscribes to

Pubsub topic● Raw data is written to

BigQuery● Real time

transformed data (e.g. aligned data rates) written to Bigtable

● Writes and makes use of meta data in BigTable (not shown)

Old SNMP system

avro

13

Page 14: Flexible Network Analytics in the Cloud · 2017-10-11 · Your data isn’t neat and tidy ... be tidy “What” not “How” Rely on the cloud for scaling Keep raw data to keep

Architecture DiagramApache Beam

(Stream)

BigQuery(immutable)

API

SNMP collection system

Client

Bigtable(realtime)

Rollups5m, 1h, 1d avg

Align/rates

BigQuery(historical)

Percentiles

...

● Cloud Bigtable● Like HBase● Write to cells in rows,

indexed by keys● We write 1 day of

data to a single row (columns are the time of day, key is metric and day)

● Fast access to row by key, can serve data from here

● Store one year

Old SNMP system

avro

14

Page 15: Flexible Network Analytics in the Cloud · 2017-10-11 · Your data isn’t neat and tidy ... be tidy “What” not “How” Rely on the cloud for scaling Keep raw data to keep

Architecture DiagramApache Beam

(Stream)

BigQuery(immutable)

API

SNMP collection system

Client

Bigtable(realtime)

Rollups5m, 1h, 1d avg

Align/rates

BigQuery(historical)

Percentiles

...

● BigQuery● Data warehousing

solution● Cheap storage, SQL

access, but not suitable for real-time access

● Allows SQL queries for ad hoc investigation

● We store our source of truth here

Old SNMP system

avro

15

Page 16: Flexible Network Analytics in the Cloud · 2017-10-11 · Your data isn’t neat and tidy ... be tidy “What” not “How” Rely on the cloud for scaling Keep raw data to keep

Architecture DiagramApache Beam

(Stream)

BigQuery(immutable)

API

SNMP collection system

Client

Bigtable(realtime)

Rollups5m, 1h, 1d avg

Align/rates

BigQuery(historical)

Percentiles

...

● BigQuery● Data warehousing

solution● Cheap storage, SQL

access, but not suitable for real-time access

● Allows SQL queries for ad hoc investigation

● We store our source of truth here

● Also store historical data (7 years), imported via avro files

Old SNMP system

avro

16

Page 17: Flexible Network Analytics in the Cloud · 2017-10-11 · Your data isn’t neat and tidy ... be tidy “What” not “How” Rely on the cloud for scaling Keep raw data to keep

Architecture DiagramApache Beam

(Stream)

BigQuery(immutable)

API

SNMP collection system

Client

Bigtable(realtime)

Rollups5m, 1h, 1d avg

Align/rates

BigQuery(historical)

Percentiles

...

● Apache Beam / Google Dataflow

● Batch processing● Run with cron job

Old SNMP system

avro

17

Page 18: Flexible Network Analytics in the Cloud · 2017-10-11 · Your data isn’t neat and tidy ... be tidy “What” not “How” Rely on the cloud for scaling Keep raw data to keep

Architecture DiagramApache Beam

(Stream)

BigQuery(immutable)

API

SNMP collection system

Client

Bigtable(realtime)

Rollups5m, 1h, 1d avg

Align/rates

BigQuery(historical)

Percentiles

...

● Apache Beam / Google Dataflow

● Batch processing● Run with cron job● Recalculate Bigtable

data each night from source of truth in BigQuery

Old SNMP system

avro

18

Page 19: Flexible Network Analytics in the Cloud · 2017-10-11 · Your data isn’t neat and tidy ... be tidy “What” not “How” Rely on the cloud for scaling Keep raw data to keep

Architecture DiagramApache Beam

(Stream)

BigQuery(immutable)

API

SNMP collection system

Client

Bigtable(realtime)

Rollups5m, 1h, 1d avg

Align/rates

BigQuery(historical)

Percentiles

...

● Apache Beam / Google Dataflow

● Batch processing● Run with cron job● Recalculate Bigtable

data each night from source of truth in BigQuery

● Process Bigtable rows into new rows of 5min, 1 hr and 1 day aggregations

Old SNMP system

avro

19

Page 20: Flexible Network Analytics in the Cloud · 2017-10-11 · Your data isn’t neat and tidy ... be tidy “What” not “How” Rely on the cloud for scaling Keep raw data to keep

Architecture DiagramApache Beam

(Stream)

BigQuery(immutable)

API

SNMP collection system

Client

Bigtable(realtime)

Rollups5m, 1h, 1d avg

Align/rates

BigQuery(historical)

Percentiles

...

● Apache Beam / Google Dataflow

● Batch processing● Run with cron job● Recalculate Bigtable

data each night from source of truth in BigQuery

● Process Bigtable rows into new rows of 5min, 1 hr and 1 day aggregations

● Additional pre-computed views e.g. percentiles for traffic distribution over a month

Old SNMP system

avro

20

Page 21: Flexible Network Analytics in the Cloud · 2017-10-11 · Your data isn’t neat and tidy ... be tidy “What” not “How” Rely on the cloud for scaling Keep raw data to keep

Architecture DiagramApache Beam

(Stream)

BigQuery(immutable)

Dataserver API(node.js)

SNMP collection system

Client

Bigtable(realtime)

Rollups5m, 1h, 1d avg

Align/rates

BigQuery(historical)

Percentiles

Old SNMP system

avro

...

● API● Currently runs on

App Engine● Node.js● Serves data out of

Bigtable● Timeseries data is

served as ‘tiles’, each tile is one row

● Would like to use Cloud Endpoints and provide a gRPC service

● Looking forward to grpc-web solution

21

Page 22: Flexible Network Analytics in the Cloud · 2017-10-11 · Your data isn’t neat and tidy ... be tidy “What” not “How” Rely on the cloud for scaling Keep raw data to keep

Use case example: Historical Trends

22

Page 23: Flexible Network Analytics in the Cloud · 2017-10-11 · Your data isn’t neat and tidy ... be tidy “What” not “How” Rely on the cloud for scaling Keep raw data to keep

Use case example: Historical TrendsStream to BQ

Dataserver API(node.js)

SNMP collection system

Client

Bigtable

Per-month totals

Per-dayInterface totals

BigQuery(historical)

Old SNMP system avro

snmp-daily::2017-08::$interface

Jan 1 Jan 2

1.8 Pb 1.9 Pb

... Dec31

3.1 Pb...

snmp-monthly-totals

Jan 1991

28 Gb

Feb 1991

29 Gb

...

...

BigQuery

Sep 2017

56 Pb

Bigtable rows

23

Page 24: Flexible Network Analytics in the Cloud · 2017-10-11 · Your data isn’t neat and tidy ... be tidy “What” not “How” Rely on the cloud for scaling Keep raw data to keep

Use case: real time anomaly detectionStream to BQ

Dataserver API(node.js)

SNMP collection system

Client

Bigtable

Baseline generation

baseline::5m::avg::$interface

Mon12am

Mon1am

2.1 1.9

... Sun11pm

0.5...

anomaly::5m::avg

iface-1

+0.1

iface-2

+2.0

...

...

BigQuery

iface-n

-1.5

Anomaly detection

Mon2am

0.3

Generates avg for each interface over the past 3 months for that hour/day

Compares baseline to real time values to generate current deviation from normal

24

Page 25: Flexible Network Analytics in the Cloud · 2017-10-11 · Your data isn’t neat and tidy ... be tidy “What” not “How” Rely on the cloud for scaling Keep raw data to keep

Use case example: Percentiles

25

Page 26: Flexible Network Analytics in the Cloud · 2017-10-11 · Your data isn’t neat and tidy ... be tidy “What” not “How” Rely on the cloud for scaling Keep raw data to keep

Stream to Bigtable

Dataserver API(node.js)

SNMP collection system

Client

Bigtable

Percentiles

Daily rollups5m avg

rollup-month-5m::2017-08::$interface::in

1 2

6Gbps 5Gbps

... 8640

2Gbps...

percentiles::2017-08::$interface::in

1 pct

0.1 Gbps

2 pct

0.3 Gbps

...

...

99 pct

22.1Gbps

Bigtable rows

Use case example: Percentiles

26

Page 27: Flexible Network Analytics in the Cloud · 2017-10-11 · Your data isn’t neat and tidy ... be tidy “What” not “How” Rely on the cloud for scaling Keep raw data to keep

Demo

27

Page 28: Flexible Network Analytics in the Cloud · 2017-10-11 · Your data isn’t neat and tidy ... be tidy “What” not “How” Rely on the cloud for scaling Keep raw data to keep

Example: Computing Total Traffic# Python Beam SDKpipeline = beam.Pipeline('DirectRunner')

(pipeline | 'read' >> ReadFromText('./example.csv') | 'csv' >> beam.ParDo(FormatCSVDoFn()) | 'ifName key' >> beam.Map(group_by_device_interface) | 'group by iface' >> beam.GroupByKey() | 'compute rate' >> beam.FlatMap(compute_rate) | 'timestamp key' >> beam.Map(lambda row: (row['timestamp'], row['rateIn'])) | 'group by timestamp' >> beam.GroupByKey() | 'sum by timestamp' >> beam.Map(lambda rates: (rates[0], sum(rates[1]))) | 'format' >> beam.Map(lambda row: '{},{}'.format(row[0], row[1])) | 'save' >> beam.io.WriteToText('./total_by_timestamp'))

pipeline.run()

Full code available at: http://x1024.net/blog/2017/05/chinog-flexible-network-analytics-in-the-cloud/ 28

Page 29: Flexible Network Analytics in the Cloud · 2017-10-11 · Your data isn’t neat and tidy ... be tidy “What” not “How” Rely on the cloud for scaling Keep raw data to keep

Our Stack● Apache Beam using Scio● Google Cloud Platform

○ Dataflow○ Bigtable○ BigQuery○ Pub/Sub○ App Engine

● Languages○ Scala○ Javascript / Typescript○ Python

Cloud Dataflow

BigQuery Cloud Bigtable

Cloud Endpoints

App Engine

Cloud Pub/Sub

29

Page 30: Flexible Network Analytics in the Cloud · 2017-10-11 · Your data isn’t neat and tidy ... be tidy “What” not “How” Rely on the cloud for scaling Keep raw data to keep

Current Status & Future PlansCurrent

Alpha version for SNMP data:

● Ingest to BigQuery is working● Migration of historical data is

implemented. Awaiting final details before full conversion

● Streaming ingest to Bigtable still in process

● Early version of utilization visualization● Simple data server can provide data to

clients, but gRPC API coming● Interface timeseries charts functional

30

Future

More types of data:

● Flow data● perfSONAR

Machine Learning

Anomaly Detection

“Mash up” various data sources

Page 31: Flexible Network Analytics in the Cloud · 2017-10-11 · Your data isn’t neat and tidy ... be tidy “What” not “How” Rely on the cloud for scaling Keep raw data to keep

Why not InfluxDB, Elastic or ${FAVORITE_DB}● We have a data processing problem, not a data storage problem per se.

○ Beam and the ecosystem around it give a huge amount of flexibility -- can try new ideas as they occur to us

○ Ability to move to different platform components○ machine learning (TensorFlow and others)

● InfluxDB & Elastic ○ require care and feeding -- have to think about disks and machines, etc.○ At our last evaluation (a while ago now) InfluxDB wasn’t able to keep up with our load -- this

may have changed but other benefits outweigh that.○ Elastic doesn’t seem to be a good fit for long term storage -- everything is in the “hot” tier

31

Page 32: Flexible Network Analytics in the Cloud · 2017-10-11 · Your data isn’t neat and tidy ... be tidy “What” not “How” Rely on the cloud for scaling Keep raw data to keep

Why the cloud? Why Google Cloud Platform?Why the cloud?

● Focus on our problems not on infrastructure● Scalability without needing to own lots of systems● Managed services for databases and compute

Why Google Cloud?

● Apache Beam was Google Dataflow when we first encountered it● More cohesive ecosystem than AWS in our experience

32

Page 33: Flexible Network Analytics in the Cloud · 2017-10-11 · Your data isn’t neat and tidy ... be tidy “What” not “How” Rely on the cloud for scaling Keep raw data to keep

Lessons learned / Life in the cloud / Good & Bad● This approach is not a silver bullet, but definitely makes many things easier● Scaling is pretty sweet: we processed 4,005,271,066 points in 13 hours● GCP Tech support could be better● Despite early indications Python streaming support in Beam has been slow to

appear. Python is a second class citizen. Fortunately Scio and Scala allow working with the Java SDK at a high level of abstraction.

● Scala is powerful but challenging at times● Focus on developing your services, not on setting up machines to run them

○ Nice options for decomposing services (Endpoints/esp, load balancing, etc)○ Service oriented○ Battle tested software stacks

33

Page 34: Flexible Network Analytics in the Cloud · 2017-10-11 · Your data isn’t neat and tidy ... be tidy “What” not “How” Rely on the cloud for scaling Keep raw data to keep

Thank you!Peter Murphy <[email protected]>Jon Dugan <[email protected]>

● MyESnet: https://my.es.net● ESnet Open Source: http://software.es.net/

○ http://software.es.net/react-timeseries-charts/ ○ http://software.es.net/pond/ ○ http://software.es.net/react-network-diagrams/

● Scio: https://github.com/spotify/scio ● Beam: https://beam.apache.org

34