data-intensive computing: from clouds to gpus

Data-Intensive Computing: From Clouds to GPUs

Gagan Agrawal

April 19, 20231

April 19, 20232

Growing need for analysis of large scale data Scientific Commercial

Data-intensive Supercomputing (DISC) Map-Reduce has received a lot of attention

Database and Datamining communities High performance computing community

Motivation

Motivation (2) Processor Architecture Trends

Clock speeds are not increasing Trend towards multi-core architectures Accelerators like GPUs are very popular Cluster of multi-cores / GPUs are becoming

common Trend Towards the Cloud

Use storage/computing/services from a provider How many of you prefer gmail over cse/osu account for

e-mail? Utility model of computation Need high-level APIs and adaptation

April 19, 20233

My Research Group

Data-intensive theme at multiple levels Parallel programming Models Multi-cores and accelerators Adaptive Middleware Scientific Data Management / Workflows Deep Web Integration and Analysis

April 19, 20234

Personnel Currently

10 PhD students 4 MS thesis students

Graduated PhDs 7 Graduated PhDs between 2005 and 2008

April 19, 20235

This Talk Parallel Programming API for Data-Intensive

Computing An alternate API and System for Google’s Map-

Reduce Show actual comparison

Data-intensive Computing on Accelerators Compilation for GPUs

Overview of other topics Scientific Data Management Adaptive and Streaming Middleware Deep web

April 19, 20236

Map-Reduce Simple API for (data-intensive) parallel

programming Computation is:

Apply map on each input data element Produce (key,value) pair(s) Sort them using the key Apply reduce on the set with a distinct key values

April 19, 20237

April 19, 20238

Map-Reduce Execution

April 19, 20239

Positives: Simple API

Functional language based Very easy to learn

Support for fault-tolerance Important for very large-scale clusters

Questions Performance?

Comparison with other approaches Suitability for different class of applications?

Map-Reduce: Positives and Questions

Class of Data-Intensive Applications Many different types of applications

Data-center kind of applications Data scans, sorting, indexing

More ``compute-intensive’’ data-intensive applications Machine learning, data mining, NLP Map-reduce / Hadoop being widely used for this class

Standard Database Operations Sigmod 2009 paper compares Hadoop with Databases

and OLAP systems

What is Map-reduce suitable for? What are the alternatives?

MPI/OpenMP/Pthreads – too low level?April 19, 202310

Hadoop Implementation

April 19, 202311

HDFS Almost GFS, but no file update Cannot be directly mounted by an existing

operating system Fault tolerance

Name node Job Tracker Task Tracker

April 19, 2023 12

FREERIDE: GOALS Framework for Rapid Implementation of

Data Mining Engines The ability to rapidly prototype a high-

performance mining implementation Distributed memory parallelization Shared memory parallelization Ability to process disk-resident datasets

Only modest modifications to a sequential implementation for the above three

Developed 2001-2003 at Ohio State

April 19, 202312

FREERIDE – Technical Basis

April 19, 202313

Popular data mining algorithms have a common canonical loop

Generalized Reduction Can be used as the

basis for supporting a common API

Demonstrated for Popular Data Mining and Scientific Data Processing Applications

While( ) {

forall (data instances d) {

I = process(d)

R(I) = R(I) op f(d)

}

…….

}

April 19, 202314

Similar, but with subtle differences

Comparing Processing Structure

Observations on Processing Structure Map-Reduce is based on functional idea

Do not maintain state This can lead to sorting overheads FREERIDE API is based on a programmer

managed reduction object Not as ‘clean’ But, avoids sorting Can also help shared memory parallelization Helps better fault-recovery

April 19, 202315

April 19, 202316

Tuning parameters in Hadoop Input Split size Max number of concurrent map tasks per node Number of reduce tasks

For comparison, we used four applications Data Mining: KMeans, KNN, Apriori Simple data scan application: Wordcount

Experiments on a multi-core cluster 8 cores per node (8 map tasks)

Experiment Design

April 19, 202317

KMeans: varying # of nodes

0

50

100

150

200

250

300

350

400

4 8 16

HadoopFREERIDE

Avg

. Tim

e P

er

Itera

tion

(sec)

# of nodes

Dataset: 6.4GK : 1000Dim: 3

Results – Data Mining

April 19, 2023 18

Results – Data Mining (II)

April 19, 202318

Apriori: varying # of nodes

0

20

40

60

80

100

120

140

4 8 16

HadoopFREERIDE

Avg

. Tim

e P

er

Itera

tion

(sec)

# of nodes

Dataset: 900MSupport level: 3%Confidence level: 9%

April 19, 2023 19April 19, 202319

KNN: varying # of nodes

0

20

40

60

80

100

120

140

160

4 8 16

HadoopFREERIDE

Avg

. Tim

e P

er

Itera

tion

(sec)

# of nodes

Dataset: 6.4GK : 1000Dim: 3

Results – Data Mining (III)

April 19, 202320

Wordcount: varying # of nodes

0

100

200

300

400

500

600

4 8 16

HadoopFREERIDE

Tota

l Tim

e (

sec)

# of nodes

Dataset: 6.4G

Results – Datacenter-like Application

April 19, 202321

KMeans: varying dataset size

0

20

40

60

80

100

120

140

160

180

800M 1.6G 3.2G 6.4G 12.8G

HadoopFREERIDE

Avg

. Tim

e P

er

Itera

tion

(sec)

Dataset Size

K : 100Dim: 3On 8 nodes

Scalability Comparison

April 19, 202322

Wordcount: varying dataset size

0

100

200

300

400

500

600

800M 1.6G 3.2G 6.4G 12.8G

HadoopFREERIDE

Tota

l Tim

e (

sec)

Dataset Size

On 8 nodes

Scalability – Word Count

April 19, 202323

Four components affecting the hadoop performance Initialization cost I/O time Sorting/grouping/shuffling Computation time

What is the relative impact of each ? An Experiment with k-means

Overhead Breakdown

April 19, 202324

Varying the number of clusters (k)

0

20

40

60

80

100

120

140

160

180

50 200 1000

HadoopFREERIDE

Avg

. Tim

e P

er

Itera

tion

(sec)

# of KMeans Clusters

Dataset: 6.4GDim: 3On 16 nodes

Analysis with K-means

April 19, 202325

Varying the number of dimensions

020406080

100120140160180

200

3 6 48 96 192

HadoopFREERIDE

Avg

. Tim

e P

er

Itera

tion

(sec)

# of Dimensions

Dataset: 6.4GK : 1000On 16 nodes

Analysis with K-means (II)

Observations

April 19, 202326

Initialization costs and limited I/O bandwidth of HDFS are significant in Hadoop

Sorting is also an important limiting factor for Hadoop’s performance

This Talk Parallel Programming API for Data-Intensive

Computing An alternate API and System for Google’s Map-

Reduce Show actual comparison

Data-intensive Computing on Accelerators Compilation for GPUs

Overview of other topics Scientific Data Management Adaptive and Streaming Middleware Deep web

April 19, 202327

Background - GPU Computing

• Many-core architectures/Accelerators are becoming more popular

• GPUs are inexpensive and fast

• CUDA is a high-level language for GPU programming

CUDA Programming

• Significant improvement over use of Graphics Libraries

• But .. • Need detailed knowledge of the architecture of GPU and a new

language • Must specify the grid configuration• Deal with memory allocation and movement• Explicit management of memory hierarchy

Parallel Data mining

• Common structure of data mining applications (FREERIDE)

• /* outer sequential loop *//* outer sequential loop */

while() {while() {

/* Reduction loop *//* Reduction loop */

Foreach (element e){Foreach (element e){

(i, val) = process(e);(i, val) = process(e);

Reduc(i) = Reduc(i) Reduc(i) = Reduc(i) opop val; val;

}}

Porting on GPUs High-level Parallelization is straight-forward Details of Data Movement Impact of Thread Count on Reduction time Use of shared memory

Architecture of the System

Variable informati

on

Reduction

functions

Optional functions Code

Analyzer( In LLVM)

Variable

Analyzer

Code Generato

r

Variable Access

Pattern and Combinatio

n Operations

Host Program

Grid configurati

on and kernel

invocation

Kernel functions

Executable

User Input

User Input

A sequential reduction function

Optional functions (initialization function, combination function…)

Values of each variable or size of array

Variables to be used in the reduction function

Analysis of Sequential Code

• Get the information of access features of each variable

• Determine the data to be replicated• Get the operator for global combination• Variables for shared memory

Memory Allocation and Copy

Copy the updates back to host memory after the Copy the updates back to host memory after the kernel reduction function returnskernel reduction function returns

CC..

Need copy for each threadNeed copy for each thread

T0 T1 T2 T3 T4 T61T62 T63 T0 T1

…… ……

T0 T1 T2 T3 T4 T61T62 T63 T0 T1

……

AA..

BB..

Generating CUDA Code and C++/C code Invoking the Kernel Function

• Memory allocation and copy• Thread grid configuration (block number

and thread number)• Global function• Kernel reduction function• Global combination

Optimizations

• Using shared memory• Providing user-specified initialization

functions and combination functions• Specifying variables that are allocated

once

Applications

• K-means clustering• EM clustering• PCA

Experimental Results

Speedup of k-means

Speedup of EM

Speedup of PCA

[email protected]

Deep Web Data Integration

The emerge of deep web Deep web is huge Different from surface

web Challenges for

integration Not accessible through

search engines Inter-dependences among

deep web sources

Our Contributions

Structured Query Processing on Deep WebSchema Mining/MatchingAnalysis of Deep web data sources

P44

45

Adaptive Middleware To Enable the Time-critical Event Handling to

Achieve the Maximum Benefit, While Satisfying the Time/Budget Constraint

To be Compatible with Grid and Web Services To Enable Easy Deployment and

Management with Minimum Human Intervention

To be Used in a Heterogeneous Distributed Environment

ICAC 2008

46

HASTE Middleware Design

ICAC 2008

Application Layer

Service Layer

OGSA Infrastructure (Globus Toolkit 4.0)

Application Deployment Service

AUTONOMIC SERVICECOMPONENTS

App.Service 1

Agent/Controller

...

...App.

Service 3

Agent/Controller

App.Service 4

Agent/Controller

App.Service 5

Agent/ControllerApp.

Service 2

Agent/Controller

Application

CodeConfiguration

FileBenefit

Function

Time-CriticalEvent

Resource Allocation Service

Resource Monitoring Service

CPU Memory Bandwidth

SchedulingEfficiency

ValueEstimation

Autonomic Adaptation Service

SystemModel

Estimator

47

Workflow Composition System

Summary Growing data is creating new challenges in

HPC, grid and cloud environments Number of topics being addressed Many opportunities for involvement 888 meets Thursdays 5:00 – 6:00, DL 280

April 19, 202348

data-intensive computing: from clouds to gpus

Documents

questionsclass of data

popular data mining

input data elementproduce

forall data instances

nlp map

common api

parallel programming

alternate api