k-means with bsp
TRANSCRIPT
K-Means Clustering with BSP Thomas Jungblut, Testberichte.de, 2012
Study assignment 4th semester, HWR Berlin
What is K-Means Clustering?
What is BSP?
K-Means with BSP
Content
2/33
What is K-Means Clustering?
3/33
Was ist K-Means Clustering?
7
Unsupervised Learning
Huge number of input vectors
k initial centers
Two step iterative algorithm
Assignment
Update
What is K-Means Clustering?
9/33
How do we parallelize K-Means?
10/33
BSP = Bulk Synchronous Parallel
Paradigm to design parallel algorithms
Two basic operations
Send message
Barrier synchronization
What is BSP?
11/33
What is BSP?
12/33
Sync
Sync
P1 P2 P3
Computation
Communication
Superstep
Computation phase is queuing messages
Within two barrier synchronizations messages are exchanged in bulk
Messages from previous superstep are available in next superstep
13
What is BSP?
K-Means with BSP
14/33
Partition the dataset into equal sized blocks
K-Means with BSP
Centers
Sum assigned vectors to a new temporary center object
15/33
Put centers into RAM on each process
Iterate sequentially over vectors on disk
K-Means with BSP
Centers
Centers
Centers
Centers
Centers
Centers
K-Means with BSP
Centers
Sums
• Center 1 • Sum=25 • 5 times summed
• Center 2 • Sum=50 • 10 times summed
• Center 3 • Sum=10 • 5 times summed
17/33
K-Means with BSP
Centers
Sum
Centers
Sum
Centers
Sum
Centers
Sum
Send the sum
K-Means with BSP
Centers
Sum
Centers
Sum
Centers
Sum
Centers
Sum
Send the sum
K-Means mit BSP
Centers Sum
Sum
Sum
Sum
Total Sum
Means
New Centers
20/33
• The same calculation on every process
• Floating point error can be corrected by synchronizing when it exceeds a given threshold
Divide by total increments
K-Means with BSP
Assignment
Sync
Update
21/33
Partition vectors into equal sized blocks # Blocks = # Tasks
Put centers in RAM Assignmentphase
Iterative vectors on disk sequentially Sum up temporary centers with assigned vectors Message all tasks with sum and how often something was
summed
Updatephase Calculate the total sum over all received messages and average Replace old centers with new centers and calc convergence
K-Means with BSP
22/33
16 Server, 256 Cores, 10G network
Benchmark
80 seconds!
Possible starvation: add more servers
Logarithmic scaling
Much better than linear scaling of MapReduce
24
Benchmark
Implementation on Github
https://github.com/thomasjungblut/thomasjungblut-common/blob/master/src/de/jungblut/clustering/KMeansBSP.java
Will be comitted to Hama‘s ML-package soon
https://issues.apache.org/jira/browse/HAMA-547
25
Misc