data analytics in computer networking

27
Data Analytics in Computer Networking The Case for Exploratory Data Analysis Stenio Fernandes Carleton University / CIn-UFPE March 2016

Upload: stenio-fernandes

Post on 12-Apr-2017

413 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Data analytics in computer networking

Data Analytics in Computer Networking

The Case for Exploratory Data Analysis

Stenio FernandesCarleton University / CIn-UFPE

March 2016

Page 2: Data analytics in computer networking

Outline

Data Analysis - backgroundEDA basicsApplied EDA (Examples: WiFi simulated data)Q&AReferences

Page 3: Data analytics in computer networking

Data Analytics - Background

Page 4: Data analytics in computer networking

Data Science Pipeline

• Analytic Data• Analytic Code• Documentation• Distribution

Elem

ents

of R

epro

duci

ble

Rese

arch

Report Writing for Data Science in R, Roger D. Peng, 2016

Page 5: Data analytics in computer networking

1. Stating and refining the question

2. Exploring the data

3. Building formal statistical models

4. Interpreting the results

5. Communicating the results

Epicycle of Analysis

The Art of Data Science, A Guide for Anyone Who Works with Data, Roger D. Peng and Elizabeth Matsui, 2016

Page 6: Data analytics in computer networking

• summarize the measurements in a single data set without further interpretation

Descriptive

• Searching for discoveries, trends, correlations, or relationships between multiple variables to generate ideas or hypotheses

Exploratory

• quantifying whether an observed pattern will likely hold beyond the data set in hand

Inferential

• uses a subset of measurements (the features) to predict another measurement (the outcome)

Predictive

• what happens to one measurement if you make another measurement change

Causal

• changing one measurement always and exclusively leads to a specific, deterministic behavior in another

DeterministicThe Elements of Data Analytic Style, A guide for people who want to analyze data, Jeff Leek, 2015

Page 7: Data analytics in computer networking

EDA basics

Page 8: Data analytics in computer networking

Why use EDA - Summary

• Maximize insight into a data set• Uncover underlying structure• Extract important variables• Detect outliers and anomalies• Test underlying assumptions• Develop parsimonious models• Determine optimal factor

settings

NIS

T

• Show comparisons• Show causality, mechanism,

explanation• Show multivariate data

• Integrate multiple modes of evidence

• Describe and document the evidence

• Content is king JH U

nive

rsity

Page 9: Data analytics in computer networking

Answer to initial questions

What is a typical value for a certain feature?

What is the uncertainty for a typical value of a feature?

What is a good distributional fit for a feature?

What is the percentile distribution?

Does modification on one variable have an effect another variable?

Does a factor have an effect on performance metrics?

What are the most important factors?

What is the best function for relating a response variable to other variables?

What are the best settings for factors (i.e. levels)?

Can we separate signal from noise?

Can we extract any structure from multivariate data?

Does the data have outliers?

Page 10: Data analytics in computer networking

EDA Graphs

Understand data properties Find patterns in data

Suggest modeling strategies

Debug analyses

Page 11: Data analytics in computer networking

Applied EDAUsing R/ggplot2

(mpg dataset) -> fake wifi dataset

Page 12: Data analytics in computer networking

Practical Steps

Before performing any measurements or simulation• Identify• Performance Metrics• Performance Factors and Levels

• Caution: sometimes you have to guess the ranges for the levels• Use an educated guess

Don’t run tons of simulations / experiments (As previously discussed)

Plot quick and dirty graphs• No need for titles, labels

Page 13: Data analytics in computer networking

Some examples of EDA Graphs - WiFi Data (simulated)

• “Vendor” - factor / levels: LinkSys, …• “Model“ – factor / Levels: GST200, …• "Users_Max_Rate“ - factor (background traffic) /

levels: 1.6, 1.8,…,7.0 Mbps• "Year“ – factor / Levels: 1999, 2008• "BER“ – factor / Levels: 4, 5, 6, and 8• "Type“ – factor (type of user) / Levels: 4, f, r• Rate – performance metric (Mbps)• Distance - factor (distance from the AP) / “Levels:

50,100m

Features (Observation Variables)

Page 14: Data analytics in computer networking
Page 15: Data analytics in computer networking
Page 16: Data analytics in computer networking
Page 17: Data analytics in computer networking
Page 18: Data analytics in computer networking
Page 19: Data analytics in computer networking
Page 20: Data analytics in computer networking
Page 21: Data analytics in computer networking
Page 22: Data analytics in computer networking
Page 23: Data analytics in computer networking
Page 24: Data analytics in computer networking
Page 25: Data analytics in computer networking
Page 26: Data analytics in computer networking

Q&A

Page 27: Data analytics in computer networking

References• NIST’s Handbook of Statistics Engineering (online)• Report Writing for Data Science in R, Roger D. Peng, 2016• The Art of Data Science, A Guide for Anyone Who Works with Data, Roger D.

Peng and Elizabeth Matsui, 2016• The Elements of Data Analytic Style, A guide for people who want to analyze

data, Jeff Leek, 2015