unsupervised learning with spark
TRANSCRIPT
Marko Velic PhD
Data Science Department
Styria Medijski Servisi d.o.o.
UNSUPERVISED LEARNING(WITH SPARK)
CONTENTS
Distances• Eucledian
• Manhattan
• Mahalanobis
• Cosine Similarity
Clustering• K-Means
• Example (Spark)
Examples from Styria practice (not Spark – for now)
10.03.2016 2
UNSUPERVISED LEARNING
Opservations are not assigned to classes
Computer program is not ‘supervised’
throughout the learning process
Usually the task is to find ‘meaningful’
groups within data
Decision is made based on distances i.e.
similarities among data points
10.03.2016 4
DISTANCES
10.03.2016 5
• To decide upon the groups we have to introduce
similarity measure or contrary – a distance measure
• Pitagora’s theorem – Euclidean distance
• dist((2, -1), (-2, 2))= √((2 - (-2))² + ((-1) - 2)²) = √((2 + 2)² + (-1 -
2)²) = √((4)² + (-3)²) = √(16 + 9) = √25 = 5
DISTANCES & APPROACHES
10.03.2016 6
Source:
http://en.wikipedia.org/wiki/Man
hattan_distance
Manhattan/Cityblock/Taxicab
• dist((x, y), (a, b)) = |x - a| + |y - b|
Normalization!
Mahalanobis – considers variance
• “multidimensional z-score”
Cosine similarity
Autoencoders – ‘unsupervised’ neural nets
Non-unsupervised but based on distances
• ReliefF measure, KNN classifier ... etc...
K-MEANS
7
Simplified:
1. Randomly place
centroids
2. Find the closest
3. Put centroid in the
middle
4. GOTO 2
Image source:
http://www.javabeat.net/2011/05/k-means-
clustering-algorithms-in-mahout/
DEMO (SPARK!)
K-means clustering of photos (ie.
their vector representations)
Convolutional neural network as
a supervised model and its
outputs as features for
unsupervised models
Vector representations after the
pooling layers, after every
convolutional layer (Caffe)
Clustering in Spark8
SEMI-MANUAL CLUSTERING OF PHOTOS
10Gruping photos based in visual features, Enes Deumić, Styria Data Science Team
SEMI-MANUAL CLUSTERING OF PHOTOS
11Gruping photos based in visual features, Enes Deumić, Styria Data Science Team
NATURAL LANGUAGE PROCESSING
10.03.2016 12
T-sne concept visualization; vecernji.hr, Styria Data Science Team
AUTOMATIC (LEARNED) HIERARCHIES
13
Hierarchical clustering, Florijan Stamenković, Styria Data Science Team
CONCLUSION
Distances• Eucledian
• Manhattan
• Mahalanobis
• Cosine Similarity
Clustering• K-Means
We can nicely combine supervised and unsupervised features
SparkNet: Training Deep Networks in Spark http://arxiv.org/pdf/1511.06051v4.pdf
https://news.developer.nvidia.com/caffe-on-spark-for-deep-learning-from-yahoo/
10.03.2016 15