k-means with bsp

25
K-Means Clustering with BSP Thomas Jungblut, Testberichte.de, 2012 Study assignment 4th semester, HWR Berlin

Upload: tjungblut

Post on 25-May-2015

2.378 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: K-Means with BSP

K-Means Clustering with BSP Thomas Jungblut, Testberichte.de, 2012

Study assignment 4th semester, HWR Berlin

Page 2: K-Means with BSP

What is K-Means Clustering?

What is BSP?

K-Means with BSP

Content

2/33

Page 3: K-Means with BSP

What is K-Means Clustering?

3/33

Page 4: K-Means with BSP

Was ist K-Means Clustering?

Page 5: K-Means with BSP
Page 6: K-Means with BSP
Page 7: K-Means with BSP

7

Page 8: K-Means with BSP
Page 9: K-Means with BSP

Unsupervised Learning

Huge number of input vectors

k initial centers

Two step iterative algorithm

Assignment

Update

What is K-Means Clustering?

9/33

Page 10: K-Means with BSP

How do we parallelize K-Means?

10/33

Page 11: K-Means with BSP

BSP = Bulk Synchronous Parallel

Paradigm to design parallel algorithms

Two basic operations

Send message

Barrier synchronization

What is BSP?

11/33

Page 12: K-Means with BSP

What is BSP?

12/33

Sync

Sync

P1 P2 P3

Computation

Communication

Superstep

Page 13: K-Means with BSP

Computation phase is queuing messages

Within two barrier synchronizations messages are exchanged in bulk

Messages from previous superstep are available in next superstep

13

What is BSP?

Page 14: K-Means with BSP

K-Means with BSP

14/33

Partition the dataset into equal sized blocks

Page 15: K-Means with BSP

K-Means with BSP

Centers

Sum assigned vectors to a new temporary center object

15/33

Put centers into RAM on each process

Iterate sequentially over vectors on disk

Page 16: K-Means with BSP

K-Means with BSP

Centers

Centers

Centers

Centers

Centers

Centers

Page 17: K-Means with BSP

K-Means with BSP

Centers

Sums

• Center 1 • Sum=25 • 5 times summed

• Center 2 • Sum=50 • 10 times summed

• Center 3 • Sum=10 • 5 times summed

17/33

Page 18: K-Means with BSP

K-Means with BSP

Centers

Sum

Centers

Sum

Centers

Sum

Centers

Sum

Send the sum

Page 19: K-Means with BSP

K-Means with BSP

Centers

Sum

Centers

Sum

Centers

Sum

Centers

Sum

Send the sum

Page 20: K-Means with BSP

K-Means mit BSP

Centers Sum

Sum

Sum

Sum

Total Sum

Means

New Centers

20/33

• The same calculation on every process

• Floating point error can be corrected by synchronizing when it exceeds a given threshold

Divide by total increments

Page 21: K-Means with BSP

K-Means with BSP

Assignment

Sync

Update

21/33

Page 22: K-Means with BSP

Partition vectors into equal sized blocks # Blocks = # Tasks

Put centers in RAM Assignmentphase

Iterative vectors on disk sequentially Sum up temporary centers with assigned vectors Message all tasks with sum and how often something was

summed

Updatephase Calculate the total sum over all received messages and average Replace old centers with new centers and calc convergence

K-Means with BSP

22/33

Page 23: K-Means with BSP

16 Server, 256 Cores, 10G network

Benchmark

80 seconds!

Possible starvation: add more servers

Page 24: K-Means with BSP

Logarithmic scaling

Much better than linear scaling of MapReduce

24

Benchmark