data analytics in computer networking
TRANSCRIPT
Data Analytics in Computer Networking
The Case for Exploratory Data Analysis
Stenio FernandesCarleton University / CIn-UFPE
March 2016
Outline
Data Analysis - backgroundEDA basicsApplied EDA (Examples: WiFi simulated data)Q&AReferences
Data Analytics - Background
Data Science Pipeline
• Analytic Data• Analytic Code• Documentation• Distribution
Elem
ents
of R
epro
duci
ble
Rese
arch
Report Writing for Data Science in R, Roger D. Peng, 2016
1. Stating and refining the question
2. Exploring the data
3. Building formal statistical models
4. Interpreting the results
5. Communicating the results
Epicycle of Analysis
The Art of Data Science, A Guide for Anyone Who Works with Data, Roger D. Peng and Elizabeth Matsui, 2016
• summarize the measurements in a single data set without further interpretation
Descriptive
• Searching for discoveries, trends, correlations, or relationships between multiple variables to generate ideas or hypotheses
Exploratory
• quantifying whether an observed pattern will likely hold beyond the data set in hand
Inferential
• uses a subset of measurements (the features) to predict another measurement (the outcome)
Predictive
• what happens to one measurement if you make another measurement change
Causal
• changing one measurement always and exclusively leads to a specific, deterministic behavior in another
DeterministicThe Elements of Data Analytic Style, A guide for people who want to analyze data, Jeff Leek, 2015
EDA basics
Why use EDA - Summary
• Maximize insight into a data set• Uncover underlying structure• Extract important variables• Detect outliers and anomalies• Test underlying assumptions• Develop parsimonious models• Determine optimal factor
settings
NIS
T
• Show comparisons• Show causality, mechanism,
explanation• Show multivariate data
• Integrate multiple modes of evidence
• Describe and document the evidence
• Content is king JH U
nive
rsity
Answer to initial questions
What is a typical value for a certain feature?
What is the uncertainty for a typical value of a feature?
What is a good distributional fit for a feature?
What is the percentile distribution?
Does modification on one variable have an effect another variable?
Does a factor have an effect on performance metrics?
What are the most important factors?
What is the best function for relating a response variable to other variables?
What are the best settings for factors (i.e. levels)?
Can we separate signal from noise?
Can we extract any structure from multivariate data?
Does the data have outliers?
EDA Graphs
Understand data properties Find patterns in data
Suggest modeling strategies
Debug analyses
Applied EDAUsing R/ggplot2
(mpg dataset) -> fake wifi dataset
Practical Steps
Before performing any measurements or simulation• Identify• Performance Metrics• Performance Factors and Levels
• Caution: sometimes you have to guess the ranges for the levels• Use an educated guess
Don’t run tons of simulations / experiments (As previously discussed)
Plot quick and dirty graphs• No need for titles, labels
Some examples of EDA Graphs - WiFi Data (simulated)
• “Vendor” - factor / levels: LinkSys, …• “Model“ – factor / Levels: GST200, …• "Users_Max_Rate“ - factor (background traffic) /
levels: 1.6, 1.8,…,7.0 Mbps• "Year“ – factor / Levels: 1999, 2008• "BER“ – factor / Levels: 4, 5, 6, and 8• "Type“ – factor (type of user) / Levels: 4, f, r• Rate – performance metric (Mbps)• Distance - factor (distance from the AP) / “Levels:
50,100m
Features (Observation Variables)
Q&A
References• NIST’s Handbook of Statistics Engineering (online)• Report Writing for Data Science in R, Roger D. Peng, 2016• The Art of Data Science, A Guide for Anyone Who Works with Data, Roger D.
Peng and Elizabeth Matsui, 2016• The Elements of Data Analytic Style, A guide for people who want to analyze
data, Jeff Leek, 2015