scaling data warehousing applications using gpus · 2013-04-23 · scaling data warehousing...

4/22/13

1

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Scaling Data Warehousing Applications using GPUs

Sudhakar Yalamanchili

School of Electrical and Computer Engineering Georgia Institute of Technology

Atlanta, GA. 30332

Sponsors: National Science Foundation, LogicBlox Inc., NVIDIA, Intel


Outline

n New Rules n Scaling and energy efficiency

n Data movement costs

n Thermal issues and processor physics

n Scaling Relational Database Performance with GPUs

n Optimized primitives

n Optimization of Data Movement

n DRAM memory aggregation in clusters

2

4/22/13

2


Scaling Computing Performance

3

Cray Titan: Heterogeneous Computing

Thermal Limits

Energy Limits

Data Movement Costs

3


Moore’s Law

4

From wikipedia.org

•  Performance scaled with number of transistors

•  Dennard scaling: power scaled with feature size

Goal: Sustain Performance Scaling

From R. Dennard, et al., “Design of ion-implanted MOSFETs with very small physical dimensions,” IEEE Journal of Solid State Circuits, vol. SC-9, no. 5, pp. 256-268, Oct. 1974.

4/22/13

3


Post Dennard Architecture Performance Scaling

Perf opss

!

"#

$

%&= Power W( )×Efficiency ops

joule!

"#

$

%&

W. J. Dally, Keynote IITC 2012

Data_movement_cost

Three operands x 64 bits/operand

Energy = #bits× dist −mm× energy− bit −mm

5

Power Delivery Cooling

Moving 1-bit of data 1mm at 22nm1 = ~1 pj

1HIPEAC Roadmap 2012 – 2012-9-hipeacvision.pdf

You can hide latency but you cannot hide energy!


Scaling Performance: Cost of Data Movement

6

Embedded Platforms

Goal: 1-100 GOps/w Goal: 20MW/Exaflop

Big Science: To Exascale

•  Sustain performance scaling through massive concurrency

•  Data movement becomes more expensive than computation

Courtesy: Sandia National Labs :R. Murphy.

Cost of Data Movement

4/22/13

4


Post Dennard Architecture Performance Scaling

Perf opss

!

"#

$

%&= Power W( )×Efficiency ops

joule!

"#

$

%&

W. J. Dally, Keynote IITC 2012

Operator_cost + Data_movement_cost

Three operands x 64 bits/operand Specialization à heterogeneity and

asymmetry

Energy = #bits× dist −mm× energy− bit −mm

7


Scaling Performance: Simplify, Diversify & Multiply AMD Bulldozer Core

ARM A7 Core (arm.com)

n Extracting single thread performance costs energy

n Out-of-order execution n Branch prediction n Scheduling etc.

n Multithread performance exploits parallelism

n Simpler pipelines n Core scaling

Still important!

8

NVIDIA Fermi

4/22/13

5


Asymmetry vs. Heterogeneity

n  Multiple voltage and frequency islands

n  Different memory technologies

n  STT-RAM, PCM, Flash

9

Tile Tile

Tile Tile

Tile Tile

Tile Tile

Tile Tile

Tile Tile

Tile Tile

Tile Tile

MC

MC

MC

MC

Tile

Tile

Tile

Tile

Tile

Tile

Tile

Tile

MC

MC

MC

MC

Performance Asymmetry

Functional Asymmetry

Heterogeneous

n  Complex cores and simple cores

n  Shared instruction set architecture (ISA)

n  Subset ISA n  Distinct microarchitecture n  Fault and migrate model of

operation1

Uniform ISA Multi-ISA

1Li., T., et.al., “Operating system support for shared ISA asymmetric multi-core architectures,” in WIOSCA, 2008.

n  Multi-ISA n  Microarchitecture

n  Memory & Interconnect hierarchy


The Challenge: The Memory System

10

Xeon Phi

Hybrid Memory Cube

n What should the memory hierarchy look like?

n Parallelism vs. locality tradeoffs

n Minimize data movement à Processor in Memory?

4/22/13

6


Thermal Capacity

n Exploit package physics n Temperature changes on the order of

milliseconds n Workload behaviors change on the

order of microseconds

n Impact on device behavior?

Inst

ruct

ions

/cyc

le

Time

Time Varying Workload

Thermal Capacity

Power-Performance Management!

Figures: psdgraphics.com and wikipedia.org

11


Summary: New Performance Scaling Rules

n Energy efficiency: Scale performance by scaling energy efficiency à diversify à programming models?

n Parallelism: Scale number of cores rather than performance of a single core à multiply à programming models

n Data Movement: Energy cost of data movement is more expensive than the energy cost of computation à communication-centric

n Physics Capacity: Scaling limited by thermal/power capacity à power/thermal management

12

4/22/13

7


Outline

n New Rules n Scaling and energy efficiency

n Data movement costs

n Thermal issues and processor physics

n Scaling Relational Database Performance with GPUs

n Optimized primitives

n Optimization of Data Movement

n DRAM memory aggregation in clusters

13


System Diversity

Keeneland System (GPUs)

Amazon EC2 GPU Instances

Hardware Diversity is Mainstream

Mobile Platforms (DSP, GPUs)

14

Cray Titan (GPUs)

4/22/13

8


System Model

System Abstractions e.g. GAS, Virtual DIMMs, etc

Data Movement Optimizations

Programming Models

Large Graphs

Cluster Wide Hardware Consolidation

Compiler and Run-Time Support

Domain Specific Languages

15

Hardware Customization


Databases: Not a Traditional Domain of GPUs

……

LargeQty(p) <-

Qty(q),

q > 1000.

……

Relational Computations Over Massive Data Sets 16

4/22/13

9


Data Warehousing Applications on GPUs

17

n The Opportunity n Significant potential data parallelism

n If data fits in GPU memory, 2x—27x speedup has been shown 1

n The Challenge

n Need to process 1-50 TBs of data2

n 15–90% of the total time* spent in moving data between CPU and GPU *

n Fine grained computation

1 B. He, M. Lu, K. Yang, R. Fang, N. K. Govindaraju, Q. Luo, and P. V. Sander. Relational query co-processing on graphics processors. In TODS, 2009.

2 Independent Oracle Users Group. A New Dimension to Data Warehousing: 2011 IOUG Data Warehousing Survey.


Red Fox: Goal and Status

18

n Goal n Build a compiler/runtime framework to accelerate DatalogLB query

using GPUs

n Understand the Good, the Bad and the Ugly!

n Status

n Capable of running all/full TPC-H queries on GPUs

n Requires that data fits in the GPU memory à move to fusion parts

n Focus to date: correctness and performance

n Moving forward à performance and scale

Haicheng Wu

4/22/13

10


Domain Specific Compilation: Red Fox

19

LogicBlox Front-End

RA-To-PTX (nvcc + RA-Lib)

Red Fox RT

src-src Optimization

IR Optimization

DatalogLB Queries

RA Primitives

Language Front-End

Translation Layer

Machine Neutral Back-End

•  Targeting Accelerator Clouds for meeting the demands of data warehousing applications

•  In-core databases

Joint with LogicBlox Inc.

Kernel IR

Query Plan

Kernel Weaver


DatalogLB Query and Front-end

20

1 number(n)->int32 (n) . 2 number(0). 3 // other number facts elided for brevity 4 next(n,m)->int32(n), int32(m). 5 next(0,1). 6 // other next facts elided for brevity 7 8 even(n)-> int32(n). 9 even(0). 10 even(n)<-number(n),next(m,n),odd(m). 11 12 odd (n)->int32(n). 13 odd (n)<-next(m,n),even(m).

Example DatalogLB Query

Recursive Definition

Front-end

BB1:COPY(pre_odd,odd){PTX}COPY(pre_even,even){PTX}JOIN_PARTITION(next,even){PTX}JOIN_COMPUTE(next,even){PTX}JOIN_GATHER(temp_odd){PTX}PROJECT(odd,temp_odd){PTX}

BB2:PROJECT(m_1,next){PTX}JOIN_PARTITION(number,m_1){PTX}JOIN_COMPUTE(number,m_1){PTX}JOIN_GATHER(temp_j_1){PTX}PROJECT(j_1,temp_j_1){PTX}JOIN_PARTITION(j_1,odd){PTX}JOIN_COMPUTE(j_1,odd){PTX}JOIN_GATHER(temp_even){PTX}PROJECT(even,temp_even){PTX}

BB3:if pre_odd == odd?

BB4:pre_even == even?

Y

N

BB5:HALT

Y

N

Example Harmony IR (CFG)

4/22/13

11


Research Thrusts

n I: Optimized implementations of primitives n Relational algebra n Data management within the GPU memory hierarchy

n II: Data movement optimizations n Between hosts and (local or remote) accelerators n Within an accelerator

n III: In-core processing n Cluster wide memory aggregation techniques n Change the ratio of host memory size to accelerator memory size

21


Primitives

22

§ Map Operators to GPU implementations

§ Data Structure: weekly sorted arrays of densely packed tuples

§ Tuple fields can be integer, float, datetime, string, etc.

From RA Library

•  PROJECT

•  PRODUCT

•  SELECT

•  JOIN

From Thrust Library

•  SORT

•  UNIQUE

•  AGGREGATION

•  SET Family

……

id price tax

4 bytes 8 bytes 16 bytes

padding zeros

Key Value

4/22/13

12


* G. Diamos, H. Wu, J. Wang, A. Lele, and S. Yalamanchili. Relational Algorithms for Multi-Bulk-Synchronous Processors. In PPoPP, 2013.

RA Primitives Library: Multistage Algorithms

23

§ Hybrid multi-stage algorithm (partition, compute, gather) to make trade-offs between computation complexity and memory access efficiency

§ Strategy: Increase core utilizations until the computation becomes memory bound, and then achieve near peak utilization of the memory interface

Example of SELECT


* G. Diamos, H. Wu, J. Wang, A. Lele, and S. Yalamanchili. Relational Algorithms for Multi-Bulk-Synchronous Processors. In PPoPP, 2013.

RA Primitives Library: Example of JOIN

24

• Most complicated JOIN: 57%~72% peak performance • Most efficient PRODUCT, PROJECT and SELECT: 86%~92% peak

performance

Measured on Tesla C2050 Random Integers as inputs

4/22/13

13


Research Thrusts




25


Data Movement in Kernel Execution

26

~250GB/s

① Input

② Execute

③ Result

M

N

T

Thread Block or Cooperative Thread Array (CTA)

4/22/13

14


Kernel Fusion- A Data Movement Optimization

27

n Increase the granularity of kernel computation

n Reduce data movement throughout the hierarchy

n Inspired by loop fusion

n Compile-time automation n  Input is an optimized query

plan


Kernel Weaving and Fusion

28

Interweaving and Fusing individual stages (CUDA kernels)

Use registers or shared memory

to store temporary result

4/22/13

15


Kernel Weaver: Major Benefits n Reduce Data Footprint

n Reduction in accesses to global memory n Access to common data across kernels improves temporal locality n Reduction in PCIe transfers

n Expand optimization scope of the compiler n Data re-use n  Increase textual scope of optimizers

29

Kernel A

A1 A2

A3

Kernel B

Result

Temp

A1 A2 A3

Fused Kernel A , B

Result

* H. Wu, G.Diamos, S.Cadambi, and S. Yalamanchili. Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation. In MICRO 2012.


7.89

1.42 1.58 1.11

2.45

0

1

2

3

4

5

6

7

8

9

10

a b c d e

Speedu

p

Fused vs. Not Fused

Kernel Weaver: Micro-benchmarks

30

Average 2.89x speedup

If fusing below operators together on Tesla C2070

4/22/13

16


Resource Usage & Occupancy

31

PTX Reg # Shared MEM

(Byte)

Occupancy (%)

PROJECT 11 0 100

SELECT 22 3848 88

JOIN 47 13580 38

+/- 10 0 100

Multiply 13 0 100

PTX Reg # Shared MEM

(Byte)

Occupancy (%)

(a) 22 2308 88

(b) 55 23560 33

(c) 62 23048 17

(d) 30 4612 67

(e) 27 0 75

n Kernel fusion may increase resource usage and thus decrease occupancy

n Retains other benefits

Individual primitive After kernel fusion


TPC-H Queries

32

n A popular decision making benchmark suite

n Have 22 queries analyzing data from 6 big tables

n Scale Factor parameter to control database size n Red Fox can run SF=1 for all 22 queries

n GPU benchmark suite being generated (Summer 2013)

4/22/13

17


Experimental Environment

CPU Xeon X5560 @ 2.80GHz

GPU 1 Tesla C2075 (6GB GDDR5 memory)

OS Ubuntu 10.04 Server

GCC 4.6.1

NVCC 4.2

Thrust 1.5.2

33


TPC-H Performance (SF = 1)

34

n 22 queries totally takes 67.40 seconds n Compared with MySQL implementation in 4 node CPU cluster*, Red Fox is

59x faster on average

*Ngamsuriyaroj, Pornpattana, “Performance Evaluation of TPC-H Queries on MySQL Cluster.” WAINA 2010.

Example: Q22 Input Size: 192MB Operator #: 92 CUDA Kernel #: 205

Query Plan:

4/22/13

18


Where is the time spent?

35

38.94% 48.82%

project select product join diff sort unique merge agg arith conv others copy pcie

n Most of time is spent in JOIN and SORT

n PCIe transfer time is less than 10%

n PROJECT used most frequently, but takes less than 5%


Future Improvements n Optimized query plan

n Reduce tuple size n Common operator reduction n Reorder operators n ……

n More RA implementations n Hash Join n Radix Sort n ……

n Pipeline the execution n Expect 10x-100x speedup from above techniques n Increase scale factor à Oncilla

36

4/22/13

19


Research Thrusts




37


II. In-Core Processing

n Cluster-based memory aggregation n Hardware support for global non-coherent, physical address space system

n Change the ratio of host-memory : GPU-memory n Joint project with the University of Heidelberg

38

CPU (Multi Core) 2-16 Cores

MAIN MEM ~128GB

GPU ~2K Cores

GPU MEM ~6GB


MAIN MEM ~128GB

GPU ~2K Cores

GPU MEM ~6GB


MAIN MEM ~128GB

GPU ~2K Cores

GPU MEM ~6GB


MAIN MEM ~128GB

GPU ~2K Cores

GPU MEM ~6GB

4/22/13

20


Oncilla: Fabrics for Accelerator Clouds

n Goal: Efficient memory aggregation for accelerators in data centers

n Solution: Use Global Address Spaces (GAS) and commodity fabrics (HT, QPI, PCIe, 10GE, IB)

n Support in-core databases using software from Red Fox project 39

Jeff Young


Oncilla – TPC-H Microbenchmarks (Preliminary Results)

Using Disk Using Aggregation

40

4/22/13

21


EXTOLL Network Adapter and Fabric

n Provides RDMA transfer (RMA), MMIO-based put/get operations for GAS (SMFU), and support for efficient, small messages (VELO)

n Current V6 prototype: 300 ns latency per hop, 24 Gbps bandwidth, very low overhead (64 B per packet) [1]

n ASIC projected to have bandwidth of 8-12 GB/s

41

•  [1] H. Fröning, On Achieving High Message Rates, CCGRID 2013

Courtesy, Prof. H. Fröning, the University of Heidelberg


Oncilla Infrastructure

n Two node cluster prototypes n 12-16 GB of DRAM n NVIDIA C2070 GPUs

n EXTOLL cluster n Network adapters and fabric

developed by University of Heidelberg, Germany

n AIC custom blades n Galibier Virtex 6 prototypes

n IB cluster based on KIDS n Mellanox QDR IB adapter n Dual-socket Intel Xeon X5660

42

4/22/13

22


Architecture

Applications

System Software

43

Technology

Thank You Questions?

Scaling Rules

scaling data warehousing applications using gpus · 2013-04-23 · scaling data warehousing...

Documents