data-intensive computing: from clouds to gpus
DESCRIPTION
Data-Intensive Computing: From Clouds to GPUs. Gagan Agrawal. Motivation. Growing need for analysis of large scale data Scientific Commercial Data-intensive Supercomputing (DISC) Map-Reduce has received a lot of attention Database and Datamining communities - PowerPoint PPT PresentationTRANSCRIPT
Data-Intensive Computing: From Clouds to GPUs
Gagan Agrawal
April 19, 20231
April 19, 20232
Growing need for analysis of large scale data Scientific Commercial
Data-intensive Supercomputing (DISC) Map-Reduce has received a lot of attention
Database and Datamining communities High performance computing community
Motivation
Motivation (2) Processor Architecture Trends
Clock speeds are not increasing Trend towards multi-core architectures Accelerators like GPUs are very popular Cluster of multi-cores / GPUs are becoming
common Trend Towards the Cloud
Use storage/computing/services from a provider How many of you prefer gmail over cse/osu account for
e-mail? Utility model of computation Need high-level APIs and adaptation
April 19, 20233
My Research Group
Data-intensive theme at multiple levels Parallel programming Models Multi-cores and accelerators Adaptive Middleware Scientific Data Management / Workflows Deep Web Integration and Analysis
April 19, 20234
Personnel Currently
10 PhD students 4 MS thesis students
Graduated PhDs 7 Graduated PhDs between 2005 and 2008
April 19, 20235
This Talk Parallel Programming API for Data-Intensive
Computing An alternate API and System for Google’s Map-
Reduce Show actual comparison
Data-intensive Computing on Accelerators Compilation for GPUs
Overview of other topics Scientific Data Management Adaptive and Streaming Middleware Deep web
April 19, 20236
Map-Reduce Simple API for (data-intensive) parallel
programming Computation is:
Apply map on each input data element Produce (key,value) pair(s) Sort them using the key Apply reduce on the set with a distinct key values
April 19, 20237
April 19, 20238
Map-Reduce Execution
April 19, 20239
Positives: Simple API
Functional language based Very easy to learn
Support for fault-tolerance Important for very large-scale clusters
Questions Performance?
Comparison with other approaches Suitability for different class of applications?
Map-Reduce: Positives and Questions
Class of Data-Intensive Applications Many different types of applications
Data-center kind of applications Data scans, sorting, indexing
More ``compute-intensive’’ data-intensive applications Machine learning, data mining, NLP Map-reduce / Hadoop being widely used for this class
Standard Database Operations Sigmod 2009 paper compares Hadoop with Databases
and OLAP systems
What is Map-reduce suitable for? What are the alternatives?
MPI/OpenMP/Pthreads – too low level?April 19, 202310
Hadoop Implementation
April 19, 202311
HDFS Almost GFS, but no file update Cannot be directly mounted by an existing
operating system Fault tolerance
Name node Job Tracker Task Tracker
April 19, 2023 12
FREERIDE: GOALS Framework for Rapid Implementation of
Data Mining Engines The ability to rapidly prototype a high-
performance mining implementation Distributed memory parallelization Shared memory parallelization Ability to process disk-resident datasets
Only modest modifications to a sequential implementation for the above three
Developed 2001-2003 at Ohio State
April 19, 202312
FREERIDE – Technical Basis
April 19, 202313
Popular data mining algorithms have a common canonical loop
Generalized Reduction Can be used as the
basis for supporting a common API
Demonstrated for Popular Data Mining and Scientific Data Processing Applications
While( ) {
forall (data instances d) {
I = process(d)
R(I) = R(I) op f(d)
}
…….
}
April 19, 202314
Similar, but with subtle differences
Comparing Processing Structure
Observations on Processing Structure Map-Reduce is based on functional idea
Do not maintain state This can lead to sorting overheads FREERIDE API is based on a programmer
managed reduction object Not as ‘clean’ But, avoids sorting Can also help shared memory parallelization Helps better fault-recovery
April 19, 202315
April 19, 202316
Tuning parameters in Hadoop Input Split size Max number of concurrent map tasks per node Number of reduce tasks
For comparison, we used four applications Data Mining: KMeans, KNN, Apriori Simple data scan application: Wordcount
Experiments on a multi-core cluster 8 cores per node (8 map tasks)
Experiment Design
April 19, 202317
KMeans: varying # of nodes
0
50
100
150
200
250
300
350
400
4 8 16
HadoopFREERIDE
Avg
. Tim
e P
er
Itera
tion
(sec)
# of nodes
Dataset: 6.4GK : 1000Dim: 3
Results – Data Mining
April 19, 2023 18
Results – Data Mining (II)
April 19, 202318
Apriori: varying # of nodes
0
20
40
60
80
100
120
140
4 8 16
HadoopFREERIDE
Avg
. Tim
e P
er
Itera
tion
(sec)
# of nodes
Dataset: 900MSupport level: 3%Confidence level: 9%
April 19, 2023 19April 19, 202319
KNN: varying # of nodes
0
20
40
60
80
100
120
140
160
4 8 16
HadoopFREERIDE
Avg
. Tim
e P
er
Itera
tion
(sec)
# of nodes
Dataset: 6.4GK : 1000Dim: 3
Results – Data Mining (III)
April 19, 202320
Wordcount: varying # of nodes
0
100
200
300
400
500
600
4 8 16
HadoopFREERIDE
Tota
l Tim
e (
sec)
# of nodes
Dataset: 6.4G
Results – Datacenter-like Application
April 19, 202321
KMeans: varying dataset size
0
20
40
60
80
100
120
140
160
180
800M 1.6G 3.2G 6.4G 12.8G
HadoopFREERIDE
Avg
. Tim
e P
er
Itera
tion
(sec)
Dataset Size
K : 100Dim: 3On 8 nodes
Scalability Comparison
April 19, 202322
Wordcount: varying dataset size
0
100
200
300
400
500
600
800M 1.6G 3.2G 6.4G 12.8G
HadoopFREERIDE
Tota
l Tim
e (
sec)
Dataset Size
On 8 nodes
Scalability – Word Count
April 19, 202323
Four components affecting the hadoop performance Initialization cost I/O time Sorting/grouping/shuffling Computation time
What is the relative impact of each ? An Experiment with k-means
Overhead Breakdown
April 19, 202324
Varying the number of clusters (k)
0
20
40
60
80
100
120
140
160
180
50 200 1000
HadoopFREERIDE
Avg
. Tim
e P
er
Itera
tion
(sec)
# of KMeans Clusters
Dataset: 6.4GDim: 3On 16 nodes
Analysis with K-means
April 19, 202325
Varying the number of dimensions
020406080
100120140160180
200
3 6 48 96 192
HadoopFREERIDE
Avg
. Tim
e P
er
Itera
tion
(sec)
# of Dimensions
Dataset: 6.4GK : 1000On 16 nodes
Analysis with K-means (II)
Observations
April 19, 202326
Initialization costs and limited I/O bandwidth of HDFS are significant in Hadoop
Sorting is also an important limiting factor for Hadoop’s performance
This Talk Parallel Programming API for Data-Intensive
Computing An alternate API and System for Google’s Map-
Reduce Show actual comparison
Data-intensive Computing on Accelerators Compilation for GPUs
Overview of other topics Scientific Data Management Adaptive and Streaming Middleware Deep web
April 19, 202327
Background - GPU Computing
• Many-core architectures/Accelerators are becoming more popular
• GPUs are inexpensive and fast
• CUDA is a high-level language for GPU programming
CUDA Programming
• Significant improvement over use of Graphics Libraries
• But .. • Need detailed knowledge of the architecture of GPU and a new
language • Must specify the grid configuration• Deal with memory allocation and movement• Explicit management of memory hierarchy
Parallel Data mining
• Common structure of data mining applications (FREERIDE)
• /* outer sequential loop *//* outer sequential loop */
while() {while() {
/* Reduction loop *//* Reduction loop */
Foreach (element e){Foreach (element e){
(i, val) = process(e);(i, val) = process(e);
Reduc(i) = Reduc(i) Reduc(i) = Reduc(i) opop val; val;
}}
Porting on GPUs High-level Parallelization is straight-forward Details of Data Movement Impact of Thread Count on Reduction time Use of shared memory
Architecture of the System
Variable informati
on
Reduction
functions
Optional functions Code
Analyzer( In LLVM)
Variable
Analyzer
Code Generato
r
Variable Access
Pattern and Combinatio
n Operations
Host Program
Grid configurati
on and kernel
invocation
Kernel functions
Executable
User Input
User Input
A sequential reduction function
Optional functions (initialization function, combination function…)
Values of each variable or size of array
Variables to be used in the reduction function
Analysis of Sequential Code
• Get the information of access features of each variable
• Determine the data to be replicated• Get the operator for global combination• Variables for shared memory
Memory Allocation and Copy
Copy the updates back to host memory after the Copy the updates back to host memory after the kernel reduction function returnskernel reduction function returns
CC..
Need copy for each threadNeed copy for each thread
T0 T1 T2 T3 T4 T61T62 T63 T0 T1
…… ……
T0 T1 T2 T3 T4 T61T62 T63 T0 T1
……
AA..
BB..
Generating CUDA Code and C++/C code Invoking the Kernel Function
• Memory allocation and copy• Thread grid configuration (block number
and thread number)• Global function• Kernel reduction function• Global combination
Optimizations
• Using shared memory• Providing user-specified initialization
functions and combination functions• Specifying variables that are allocated
once
Applications
• K-means clustering• EM clustering• PCA
Experimental Results
Speedup of k-means
Speedup of EM
Speedup of PCA
Deep Web Data Integration
The emerge of deep web Deep web is huge Different from surface
web Challenges for
integration Not accessible through
search engines Inter-dependences among
deep web sources
Our Contributions
Structured Query Processing on Deep WebSchema Mining/MatchingAnalysis of Deep web data sources
P44
45
Adaptive Middleware To Enable the Time-critical Event Handling to
Achieve the Maximum Benefit, While Satisfying the Time/Budget Constraint
To be Compatible with Grid and Web Services To Enable Easy Deployment and
Management with Minimum Human Intervention
To be Used in a Heterogeneous Distributed Environment
ICAC 2008
46
HASTE Middleware Design
ICAC 2008
Application Layer
Service Layer
OGSA Infrastructure (Globus Toolkit 4.0)
Application Deployment Service
AUTONOMIC SERVICECOMPONENTS
App.Service 1
Agent/Controller
...
...App.
Service 3
Agent/Controller
App.Service 4
Agent/Controller
App.Service 5
Agent/ControllerApp.
Service 2
Agent/Controller
Application
CodeConfiguration
FileBenefit
Function
Time-CriticalEvent
Resource Allocation Service
Resource Monitoring Service
CPU Memory Bandwidth
SchedulingEfficiency
ValueEstimation
Autonomic Adaptation Service
SystemModel
Estimator
47
Workflow Composition System
Summary Growing data is creating new challenges in
HPC, grid and cloud environments Number of topics being addressed Many opportunities for involvement 888 meets Thursdays 5:00 – 6:00, DL 280
April 19, 202348