scalable clustering using multiple gpus
DESCRIPTION
Scalable Clustering using Multiple GPUs. K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute of Information Technology (IIIT) Hyderabad. Introduction. Classification of data desired for meaningful representation. - PowerPoint PPT PresentationTRANSCRIPT
IIIT
Hyd
erab
ad
Scalable Clustering using Multiple GPUs
K Wasif MohiuddinP J Narayanan
Center for Visual Information TechnologyInternational Institute of Information Technology (IIIT)
Hyderabad
IIIT
Hyd
erab
ad
Introduction• Classification of data desired for
meaningful representation.• Data in subset ideally shares common traits.• Unsupervised learning for finding hidden
structure.• Application in data mining, computer vision
with– Image Classification– Document Retrieval
• Simple K-Means algorithm December 20, 2011 HiPC - 2011 2
IIIT
Hyd
erab
ad
Need for High Performance Clustering
• Time Complexity of O(ndk+1 log n) where n- input vectors, d- dimension, k-centers
• A fast, efficient clustering implementation is needed to deal with large data, high dimensionality and large centers.
• In computer vision, 128-dim SIFT and 512-dim (GIST) are common. Features can run into several millions
• Bag of Words for Vocabulary generation using SIFT [Lowe IJCV04] vectors
December 20, 2011 HiPC - 2011 3
IIIT
Hyd
erab
ad
Challenges and Contributions
• Data: Storage of data format for quick and repeated access.
• Computational: O(ndk+1 log n) • Contributions: A complete GPU based
implementation with– Exploitation of intra-vector parallelism– Efficient Mean evaluation– Data Organization – Multi GPU framework
December 20, 2011 4HiPC - 2011
IIIT
Hyd
erab
ad
Related Work• General Improvements
– KD-trees [Moor et al, SIGKKD-1999– Triangle Inequality [Elkan, ICML-2003]
• Pre CUDA GPU Efforts Improvements– Fragment Shader [Hart et al, SIGGRAPH-
2004]
December 20, 2011 HiPC - 2011 5
IIIT
Hyd
erab
ad
Related Work (cont)• Recent GPU efforts
– Mean on CPU [Che et al, JPDC-2008] – Mean on CPU + GPU [Hong et al, WCCSIE-
2009] – GPU Miner [Ren et al, HP Labs-2009] – HPK-Means [Wu et al, UCHPC-2009] – Divide & Rule [Li et al, ICCIT-2010]
• Parallelism not exploited within data object.• Lacking efficiency in Mean evaluation on
GPU.• Proposed techniques are parameter dependant. December 20, 2011 HiPC - 2011 6
IIIT
Hyd
erab
ad
K-Means
• Objective Function ∑i∑j‖xi(j) -cj‖2
1≤ i ≤ n, 1≤ j ≤ k• Euclidean distance : L2 norm• Steps:
– Membership Evaluation– New Mean Evaluation– Convergence
December 20, 2011 HiPC - 2011 7
IIIT
Hyd
erab
ad
Algorithm
• K random centers are initially chosen from input.
• Partitions data into k clusters• Observation belongs to the cluster with the
nearest mean.• Re-evaluate the new centers & continue the
process till convergence is attained.
December 20, 2011 HiPC - 2011 8
IIIT
Hyd
erab
ad
K-Means on GPU
Membership Evaluation• Involves Distance and Minima evaluation.• Single thread per component of vector• Parallel computation done on ‘d’
components of input and center vectors stored in row major format.
• Log summation for distance evaluation.• For each input vector we traverse across all
centers stored in L2 cache.December 20, 2011 9HiPC - 2011
IIIT
Hyd
erab
ad
K-Means on GPU (Cont)
December 20, 2011 HiPC - 2011 10
Membership Evaluation• Data objects stored in
row major format• Provides coalesced
access• Distance evaluation
using shared memory.• Root finding avoided
IIIT
Hyd
erab
ad
K-Means on GPU (Cont)
• Mean Evaluation Issues
– Data rearrangement on CPU as per membership is time consuming.
– Concurrent writes– Random reads and writes– Non uniform distribution of labels for data
objects.
December 20, 2011 HiPC - 2011 11
IIIT
Hyd
erab
ad
Mean Evaluation on GPU
• Store labels and index in 64 bit records• Group data objects with same membership
using Splitsort operation. • We split using labels as key• Gather primitive used to rearrange input in
order of labels.• Sorted global index of input vectors is
generated.
December 20, 2011 12HiPC - 2011
Splitsort : Suryakant & Narayanan IIITH, TR 2009
IIIT
Hyd
erab
ad
Splitsort & Transpose Operation
December 20, 2011 HiPC - 2011 13
IIIT
Hyd
erab
ad
Mean Evaluation on GPU (cont)
• Row major storage of vectors enabled coalesced access.
• CUDPP segmented scan followed by compact operation for histogram count.
• Transpose operation before rearranging input vectors.
• Using segmented scan again we evaluated mean of rearranged vectors as per labels.
December 20, 2011 HiPC - 2011 14
IIIT
Hyd
erab
ad
Implementation Details
• Tesla – 2 vectors per block , 2 centers at a time– Centers accessed via shared memory
• Fermi – 2 vectors per block, 4 centers at a time– Centers accessed via global memory using L2
cache– More shared memory for distance evaluation
• Occupancy of 83% using 5136 KB of shared memory in case of Fermi.
December 20, 2011 HiPC - 2011 15
IIIT
Hyd
erab
ad
ISSUES
• Too many distance evaluations• Convergence highly dependent on cluster
centers chosen.• Prior seeding using K-Means++ can reduce
the number of iterations.• Parameters like dimension, cluster centers
affect the performance apart from the input size of the vectors.
December 20, 2011 HiPC - 2011 16
IIIT
Hyd
erab
ad
Limitations of GPU device
• Highly computational & memory consuming algorithms.
• Limited Global and Shared memory on a GPU device.
• Division of computational load if more than one device is available.
• Utilization of every resource available.• Scalability of the algorithm
December 20, 2011 HiPC - 2011 17
IIIT
Hyd
erab
ad
Multi GPU Approach• Partition input data into
chunks proportional to number of cores.
• Broadcast ‘k’ centers to all the nodes.
• Perform Membership & partial mean on each of the GPUs sent to their respective nodes.
• Nodes direct partial sums to Master node.
• New means evaluated by Master node for next iteration.
December 20, 2011 HiPC - 2011 18
IIIT
Hyd
erab
ad
Results
• Generated Gaussian SIFT vectors• Variation in parameters n, d, k• Performance on CPU(32 bit, 2.7 Ghz), Tesla
T10, GTX 480 tested up to nmax :4 Million, kmax : 8000 , dmax : 256
• MultiGPU (4xT10 + GTX 480) nmax : 32 Million, kmax : 8000, dmax : 256
• Comparison with previous GPU implementations.
December 20, 2011 19HiPC - 2011
IIIT
Hyd
erab
ad
Overall Results
N, K CPU GPU Tesla T10
GTX 480 4xT10
10K, 80 1.3 0.119 0.18 0.097
50K, 800 71.3 2.73 1.73 0.891
125K, 2K 463.6 14.18 7.71 2.47
250K, 4K 1320 38.5 27.7 7.45
1M, 8K 28936 268.6 170.6 68.5
Times of K-Means on CPU, GPUs in seconds for d=128.
December 20, 2011 20HiPC - 2011
IIIT
Hyd
erab
ad
Overall Performance
• Mean evaluation reduced to 6% of the total time for large input of high dimensional data.
• Multi GPU provided linear speedup• Speedup of up to 170 on GTX 480• 6 Million vectors of 128 dimension
clustered in just 136 sec per iteration.• Achieved up to twice increase in speedup
against the best GPU implementationDecember 20, 2011 HiPC - 2011 21
IIIT
Hyd
erab
ad
Performance vs ‘n’
December 20, 2011 HiPC - 2011 22
Linear performance for variation in n, with d=128 and k=4,000.
IIIT
Hyd
erab
ad
Performance vs ‘d’
December 20, 2011 HiPC - 2011 23
Performance for variation in d, with n=1M and k=8,000.
IIIT
Hyd
erab
ad
Performance vs ‘k’
December 20, 2011 HiPC - 2011 24
Linear performance for variation in k, with n=50k and d=128.
IIIT
Hyd
erab
ad
Comparison
N K D Li et al Wu et al Our K-Means
2 Million 400 8 1.23 4.53 1.27
4 Million 100 8 0.689 4.95 0.734
4 Million 400 8 2.26 9.03 2.4
51,200 32 64 1320 - 0.191
51,200 32 128 28936 - 0.282
Running time of K-Means in seconds on GTX 280.
December 20, 2011 25HiPC - 2011
IIIT
Hyd
erab
ad
Performance on GPUs
December 20, 2011 HiPC - 2011 26
Performance of 8600, Tesla, GTX 480 for d=128 and k=1,000.
IIIT
Hyd
erab
ad
Conclusions• Achieved a speed of over 170 on single NVIDIA Fermi
GPU.• Complete GPU based implementation.• High Performance for large ‘d’ due to processing of vector
in parallel.• Scalable in problem size n, d, k and number of cores.• Use of operations like Splitsort, Transpose for coalesced
memory access.• Overcame memory limitations using Multi GPU frame
work.• Code will be available at http://cvit.iiit.ac.in soonDecember 20, 2011 27HiPC - 2011
IIIT
Hyd
erab
ad
Thank You
Questions?