Click here to load reader
Post on 07-Jan-2017
Embed Size (px)
TOPOLOGICAL DATA ANALYSIS
HJ van Veen Data Science Nubank Brasil
"When a truth is necessary, the reason for it can be found by analysis, that is, by resolving itinto simpler ideas and truths until the primary ones are reached." - Leibniz
Topology is the mathematical study of topological spaces.
Topology is interested in shapes,
More specifically: the concept of 'connectedness'
TOPOLOGY III A topologist is someone who does not see the
difference between a coffee mug and a donut.
Nothing at all takes place in the universe in which some rule of maximum or minimum does not appear. - Euler
Seven Bridges of Koningsbrucke: devise a walk through the city that would cross each bridge once and only once.
HISTORY III Euler's big insights:
Doesnt matter where you start walking, only matters which bridges you cross.
A similar solution should be found, regardless where you start your walk.
only the connectedness of bridges matter,
a solution should also apply to all other bridges that are connected in a similar fashion, no matter the distances between them.
We now call these graph walks Eulerian walks in Eulers honor.
Euler's first proven graph theory theorem:
'Euler walks' are possible if exactly zero or two nodes have an odd number of edges.
TDA I TDA marries 300-year old maths with
modern data analysis.
Captures the shape of data
Compresses large datasets
Functions well in the presence of noise / missing variables
TDA II Capturing the shape of data
Traditional techniques like clustering or dimensionality reduction have trouble capturing this shape.
TDA III Invariance.
Euler showed that only connectedness matters. The size, position, or pose of an object doesn't change that object.
TDA IV Compression.
Compressed representations use the order in data.
Only order can be compressed.
Random noise or slight variations are ignored.
Lossy compression retains the mostimportant features.
"Now where there are no parts, there neither extension, nor shape, nor divisibility is possible. Andthese monads are the true atoms of nature and, in a word, the elements of things." - Leibniz
Mapper was created by Ayasdi Co-founder Gurjeet Singh during his PhD under Gunnar Carlsson.
Based on the idea of partial clustering of the data guided by a set of functions defined on the data.
MAPPER II Mapper was inspired by the Reeb Graph.
MAPPER III Map the data with overlapping intervals.
Cluster the points inside the intervals
When clusters share data points draw an edge
Color nodes by function
MAPPER VDistance_to_median(row) x y z
1.5 1.5 1.5 1.5
1.5 -0.5 -0.5 -0.5
0 1 1 1
0 1 0.9 1.1
3 2 2 2
3 2.1 1.9 2
MAPPER VI In conclusion:
FUNCTIONS Raw features or point-cloud axis / coordinates
Statistics: Mean, Max, Skewness, etc.
Mathematics: L2-norm, Fourier Transform, etc.
Machine Learning: t-SNE, PCA, out-of-fold preds
Deep Learning: Layer activations, embeddings
CLUSTER ALGOS DBSCAN / HDBSCAN:
Handles noise well.
No need to set number of clusters.
Creates visually nice simplicial complexes/graphs
SOME GENERAL USE CASES
Model and feature inspection
Computational Biology / Healthcare
COMPUTER VISION Demo
MODEL AND FEATURE INSPECTION
COMPUTATIONAL BIOLOGY Example
PERSISTENT HOMOLOGY Example
SOME FINANCE USE CASES
Accurate Interpretable Models
Exploration / Analysis
CUSTOMER SEGMENTATION Demo
TRANSACTIONAL FRAUD Example of spousal fraud
ACCURATE INTERPRETABLE MODELS
Create: global linear model
Color: Heatmap by ground truth and animate to out-of-fold model predictions
Identify: Low accuracy sub graphs
Select: Features that are most important for sub graphs
Create: Local linear models on sub graphs
Stack: Decision Tree
Compare: Divide-and-Conquer and LIME
EXPLORATION / ANALYSIS Demo
FURTHER READING Google terms:
Ayasdi, Topological Data Analysis, Robert Ghrist, Gurjeet Singh, Gunnar Carlsson, Anthony Bak, Allison Gilmore, Simplicial Complex, Python Mapper.