hannah aizenman - get to know your data
DESCRIPTION
A recent article in the New York Times estimates that data scientists spend somewhere between %50 and %80 of their time "collecting and preparing unruly digital data" before they ever get to the analysis. Data is often badly labeled, inconsistently sampled, incorrect in strange places, missing, and otherwise contains a whole host of errors, leading to the "garbage in, garbage out" problem. While detecting the myriad ways in which the data is broken can sometimes be difficult, traditional visualization techniques, exploratory data analytics, and cluster analysis can help. This talk will discuss some of the typical methods for sanity checking small data sets: visualization, simple statistics, and some basic combinations of the two. This talk will then veer into some machine learning techniques for exploring the underlying structure of larger data sets to verify the occurrence of known patterns and to detect outliers that could be due to errors rather than the occurance of something interesting.TRANSCRIPT
Get To Know Your Data
Hannah Aizenman@story645
image via @Ted Underwood
Unprocessed Data
Missing Observations
Misused Technique
Start?
Research
Explore Attributes
Take Snapshots
Plot
Label
Rearrange
Higher D Data: Plot 1 Dim
Plot Another Dim (or 2)
Fix that Plot
Histogram
Min, Max, Mean, Median
Too Much Data
Multivariate Relationships
Multivariate Relationships With Classes
Known Patterns
Expected Values
Look For Structure
Incorporate Outside Knowledge
Weave it All Together