unsupervised learning with spark

17
Marko Velic PhD Data Science Department Styria Medijski Servisi d.o.o. [email protected] UNSUPERVISED LEARNING (WITH SPARK)

Upload: marko-velic

Post on 20-Feb-2017

364 views

Category:

Technology


1 download

TRANSCRIPT

Marko Velic PhD

Data Science Department

Styria Medijski Servisi d.o.o.

[email protected]

UNSUPERVISED LEARNING(WITH SPARK)

CONTENTS

Distances• Eucledian

• Manhattan

• Mahalanobis

• Cosine Similarity

Clustering• K-Means

• Example (Spark)

Examples from Styria practice (not Spark – for now)

10.03.2016 2

MACHINE LEARNING

10.03.2016 3

UNSUPERVISED LEARNING

Opservations are not assigned to classes

Computer program is not ‘supervised’

throughout the learning process

Usually the task is to find ‘meaningful’

groups within data

Decision is made based on distances i.e.

similarities among data points

10.03.2016 4

DISTANCES

10.03.2016 5

• To decide upon the groups we have to introduce

similarity measure or contrary – a distance measure

• Pitagora’s theorem – Euclidean distance

• dist((2, -1), (-2, 2))= √((2 - (-2))² + ((-1) - 2)²) = √((2 + 2)² + (-1 -

2)²) = √((4)² + (-3)²) = √(16 + 9) = √25 = 5

DISTANCES & APPROACHES

10.03.2016 6

Source:

http://en.wikipedia.org/wiki/Man

hattan_distance

Manhattan/Cityblock/Taxicab

• dist((x, y), (a, b)) = |x - a| + |y - b|

Normalization!

Mahalanobis – considers variance

• “multidimensional z-score”

Cosine similarity

Autoencoders – ‘unsupervised’ neural nets

Non-unsupervised but based on distances

• ReliefF measure, KNN classifier ... etc...

K-MEANS

7

Simplified:

1. Randomly place

centroids

2. Find the closest

3. Put centroid in the

middle

4. GOTO 2

Image source:

http://www.javabeat.net/2011/05/k-means-

clustering-algorithms-in-mahout/

DEMO (SPARK!)

K-means clustering of photos (ie.

their vector representations)

Convolutional neural network as

a supervised model and its

outputs as features for

unsupervised models

Vector representations after the

pooling layers, after every

convolutional layer (Caffe)

Clustering in Spark8

T-SNE CLUSTER VISUALIZATION

9

SEMI-MANUAL CLUSTERING OF PHOTOS

10Gruping photos based in visual features, Enes Deumić, Styria Data Science Team

SEMI-MANUAL CLUSTERING OF PHOTOS

11Gruping photos based in visual features, Enes Deumić, Styria Data Science Team

NATURAL LANGUAGE PROCESSING

10.03.2016 12

T-sne concept visualization; vecernji.hr, Styria Data Science Team

AUTOMATIC (LEARNED) HIERARCHIES

13

Hierarchical clustering, Florijan Stamenković, Styria Data Science Team

VISUAL SEARCH EXAMPLE

14

CONCLUSION

Distances• Eucledian

• Manhattan

• Mahalanobis

• Cosine Similarity

Clustering• K-Means

We can nicely combine supervised and unsupervised features

SparkNet: Training Deep Networks in Spark http://arxiv.org/pdf/1511.06051v4.pdf

https://news.developer.nvidia.com/caffe-on-spark-for-deep-learning-from-yahoo/

10.03.2016 15

THANK YOU!

CONCLUSION

10.03.2016 17