scaling data warehousing applications using gpus · 2013-04-23 · scaling data warehousing...
TRANSCRIPT
4/22/13
1
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Scaling Data Warehousing Applications using GPUs
Sudhakar Yalamanchili
School of Electrical and Computer Engineering Georgia Institute of Technology
Atlanta, GA. 30332
Sponsors: National Science Foundation, LogicBlox Inc., NVIDIA, Intel
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Outline
n New Rules n Scaling and energy efficiency
n Data movement costs
n Thermal issues and processor physics
n Scaling Relational Database Performance with GPUs
n Optimized primitives
n Optimization of Data Movement
n DRAM memory aggregation in clusters
2
4/22/13
2
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Scaling Computing Performance
3
Cray Titan: Heterogeneous Computing
Thermal Limits
Energy Limits
Data Movement Costs
3
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Moore’s Law
4
From wikipedia.org
• Performance scaled with number of transistors
• Dennard scaling: power scaled with feature size
Goal: Sustain Performance Scaling
From R. Dennard, et al., “Design of ion-implanted MOSFETs with very small physical dimensions,” IEEE Journal of Solid State Circuits, vol. SC-9, no. 5, pp. 256-268, Oct. 1974.
4/22/13
3
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Post Dennard Architecture Performance Scaling
Perf opss
!
"#
$
%&= Power W( )×Efficiency ops
joule!
"#
$
%&
W. J. Dally, Keynote IITC 2012
Data_movement_cost
Three operands x 64 bits/operand
Energy = #bits× dist −mm× energy− bit −mm
5
Power Delivery Cooling
Moving 1-bit of data 1mm at 22nm1 = ~1 pj
1HIPEAC Roadmap 2012 – 2012-9-hipeacvision.pdf
You can hide latency but you cannot hide energy!
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Scaling Performance: Cost of Data Movement
6
Embedded Platforms
Goal: 1-100 GOps/w Goal: 20MW/Exaflop
Big Science: To Exascale
• Sustain performance scaling through massive concurrency
• Data movement becomes more expensive than computation
Courtesy: Sandia National Labs :R. Murphy.
Cost of Data Movement
4/22/13
4
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Post Dennard Architecture Performance Scaling
Perf opss
!
"#
$
%&= Power W( )×Efficiency ops
joule!
"#
$
%&
W. J. Dally, Keynote IITC 2012
Operator_cost + Data_movement_cost
Three operands x 64 bits/operand Specialization à heterogeneity and
asymmetry
Energy = #bits× dist −mm× energy− bit −mm
7
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Scaling Performance: Simplify, Diversify & Multiply AMD Bulldozer Core
ARM A7 Core (arm.com)
n Extracting single thread performance costs energy
n Out-of-order execution n Branch prediction n Scheduling etc.
n Multithread performance exploits parallelism
n Simpler pipelines n Core scaling
Still important!
8
NVIDIA Fermi
4/22/13
5
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Asymmetry vs. Heterogeneity
n Multiple voltage and frequency islands
n Different memory technologies
n STT-RAM, PCM, Flash
9
Tile Tile
Tile Tile
Tile Tile
Tile Tile
Tile Tile
Tile Tile
Tile Tile
Tile Tile
MC
MC
MC
MC
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
MC
MC
MC
MC
Performance Asymmetry
Functional Asymmetry
Heterogeneous
n Complex cores and simple cores
n Shared instruction set architecture (ISA)
n Subset ISA n Distinct microarchitecture n Fault and migrate model of
operation1
Uniform ISA Multi-ISA
1Li., T., et.al., “Operating system support for shared ISA asymmetric multi-core architectures,” in WIOSCA, 2008.
n Multi-ISA n Microarchitecture
n Memory & Interconnect hierarchy
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
The Challenge: The Memory System
10
Xeon Phi
Hybrid Memory Cube
n What should the memory hierarchy look like?
n Parallelism vs. locality tradeoffs
n Minimize data movement à Processor in Memory?
4/22/13
6
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Thermal Capacity
n Exploit package physics n Temperature changes on the order of
milliseconds n Workload behaviors change on the
order of microseconds
n Impact on device behavior?
Inst
ruct
ions
/cyc
le
Time
Time Varying Workload
Thermal Capacity
Power-Performance Management!
Figures: psdgraphics.com and wikipedia.org
11
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Summary: New Performance Scaling Rules
n Energy efficiency: Scale performance by scaling energy efficiency à diversify à programming models?
n Parallelism: Scale number of cores rather than performance of a single core à multiply à programming models
n Data Movement: Energy cost of data movement is more expensive than the energy cost of computation à communication-centric
n Physics Capacity: Scaling limited by thermal/power capacity à power/thermal management
12
4/22/13
7
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Outline
n New Rules n Scaling and energy efficiency
n Data movement costs
n Thermal issues and processor physics
n Scaling Relational Database Performance with GPUs
n Optimized primitives
n Optimization of Data Movement
n DRAM memory aggregation in clusters
13
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
System Diversity
Keeneland System (GPUs)
Amazon EC2 GPU Instances
Hardware Diversity is Mainstream
Mobile Platforms (DSP, GPUs)
14
Cray Titan (GPUs)
4/22/13
8
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
System Model
System Abstractions e.g. GAS, Virtual DIMMs, etc
Data Movement Optimizations
Programming Models
Large Graphs
Cluster Wide Hardware Consolidation
Compiler and Run-Time Support
Domain Specific Languages
15
Hardware Customization
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Databases: Not a Traditional Domain of GPUs
……
LargeQty(p) <-
Qty(q),
q > 1000.
……
Relational Computations Over Massive Data Sets 16
4/22/13
9
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Data Warehousing Applications on GPUs
17
n The Opportunity n Significant potential data parallelism
n If data fits in GPU memory, 2x—27x speedup has been shown 1
n The Challenge
n Need to process 1-50 TBs of data2
n 15–90% of the total time* spent in moving data between CPU and GPU *
n Fine grained computation
1 B. He, M. Lu, K. Yang, R. Fang, N. K. Govindaraju, Q. Luo, and P. V. Sander. Relational query co-processing on graphics processors. In TODS, 2009.
2 Independent Oracle Users Group. A New Dimension to Data Warehousing: 2011 IOUG Data Warehousing Survey.
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Red Fox: Goal and Status
18
n Goal n Build a compiler/runtime framework to accelerate DatalogLB query
using GPUs
n Understand the Good, the Bad and the Ugly!
n Status
n Capable of running all/full TPC-H queries on GPUs
n Requires that data fits in the GPU memory à move to fusion parts
n Focus to date: correctness and performance
n Moving forward à performance and scale
Haicheng Wu
4/22/13
10
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Domain Specific Compilation: Red Fox
19
LogicBlox Front-End
RA-To-PTX (nvcc + RA-Lib)
Red Fox RT
src-src Optimization
IR Optimization
DatalogLB Queries
RA Primitives
Language Front-End
Translation Layer
Machine Neutral Back-End
• Targeting Accelerator Clouds for meeting the demands of data warehousing applications
• In-core databases
Joint with LogicBlox Inc.
Kernel IR
Query Plan
Kernel Weaver
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
DatalogLB Query and Front-end
20
1 number(n)->int32 (n) . 2 number(0). 3 // other number facts elided for brevity 4 next(n,m)->int32(n), int32(m). 5 next(0,1). 6 // other next facts elided for brevity 7 8 even(n)-> int32(n). 9 even(0). 10 even(n)<-number(n),next(m,n),odd(m). 11 12 odd (n)->int32(n). 13 odd (n)<-next(m,n),even(m).
Example DatalogLB Query
Recursive Definition
Front-end
BB1:COPY(pre_odd,odd){PTX}COPY(pre_even,even){PTX}JOIN_PARTITION(next,even){PTX}JOIN_COMPUTE(next,even){PTX}JOIN_GATHER(temp_odd){PTX}PROJECT(odd,temp_odd){PTX}
BB2:PROJECT(m_1,next){PTX}JOIN_PARTITION(number,m_1){PTX}JOIN_COMPUTE(number,m_1){PTX}JOIN_GATHER(temp_j_1){PTX}PROJECT(j_1,temp_j_1){PTX}JOIN_PARTITION(j_1,odd){PTX}JOIN_COMPUTE(j_1,odd){PTX}JOIN_GATHER(temp_even){PTX}PROJECT(even,temp_even){PTX}
BB3:if pre_odd == odd?
BB4:pre_even == even?
Y
N
BB5:HALT
Y
N
Example Harmony IR (CFG)
4/22/13
11
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Research Thrusts
n I: Optimized implementations of primitives n Relational algebra n Data management within the GPU memory hierarchy
n II: Data movement optimizations n Between hosts and (local or remote) accelerators n Within an accelerator
n III: In-core processing n Cluster wide memory aggregation techniques n Change the ratio of host memory size to accelerator memory size
21
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Primitives
22
§ Map Operators to GPU implementations
§ Data Structure: weekly sorted arrays of densely packed tuples
§ Tuple fields can be integer, float, datetime, string, etc.
From RA Library
• PROJECT
• PRODUCT
• SELECT
• JOIN
From Thrust Library
• SORT
• UNIQUE
• AGGREGATION
• SET Family
……
id price tax
4 bytes 8 bytes 16 bytes
padding zeros
Key Value
4/22/13
12
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
* G. Diamos, H. Wu, J. Wang, A. Lele, and S. Yalamanchili. Relational Algorithms for Multi-Bulk-Synchronous Processors. In PPoPP, 2013.
RA Primitives Library: Multistage Algorithms
23
§ Hybrid multi-stage algorithm (partition, compute, gather) to make trade-offs between computation complexity and memory access efficiency
§ Strategy: Increase core utilizations until the computation becomes memory bound, and then achieve near peak utilization of the memory interface
Example of SELECT
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
* G. Diamos, H. Wu, J. Wang, A. Lele, and S. Yalamanchili. Relational Algorithms for Multi-Bulk-Synchronous Processors. In PPoPP, 2013.
RA Primitives Library: Example of JOIN
24
• Most complicated JOIN: 57%~72% peak performance • Most efficient PRODUCT, PROJECT and SELECT: 86%~92% peak
performance
Measured on Tesla C2050 Random Integers as inputs
4/22/13
13
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Research Thrusts
n I: Optimized implementations of primitives n Relational algebra n Data management within the GPU memory hierarchy
n II: Data movement optimizations n Between hosts and (local or remote) accelerators n Within an accelerator
n III: In-core processing n Cluster wide memory aggregation techniques n Change the ratio of host memory size to accelerator memory size
25
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Data Movement in Kernel Execution
26
~250GB/s
① Input
② Execute
③ Result
M
N
T
Thread Block or Cooperative Thread Array (CTA)
4/22/13
14
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Kernel Fusion- A Data Movement Optimization
27
n Increase the granularity of kernel computation
n Reduce data movement throughout the hierarchy
n Inspired by loop fusion
n Compile-time automation n Input is an optimized query
plan
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Kernel Weaving and Fusion
28
Interweaving and Fusing individual stages (CUDA kernels)
Use registers or shared memory
to store temporary result
4/22/13
15
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Kernel Weaver: Major Benefits n Reduce Data Footprint
n Reduction in accesses to global memory n Access to common data across kernels improves temporal locality n Reduction in PCIe transfers
n Expand optimization scope of the compiler n Data re-use n Increase textual scope of optimizers
29
Kernel A
A1 A2
A3
Kernel B
Result
Temp
A1 A2 A3
Fused Kernel A , B
Result
* H. Wu, G.Diamos, S.Cadambi, and S. Yalamanchili. Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation. In MICRO 2012.
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
7.89
1.42 1.58 1.11
2.45
0
1
2
3
4
5
6
7
8
9
10
a b c d e
Speedu
p
Fused vs. Not Fused
Kernel Weaver: Micro-benchmarks
30
Average 2.89x speedup
If fusing below operators together on Tesla C2070
4/22/13
16
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Resource Usage & Occupancy
31
PTX Reg # Shared MEM
(Byte)
Occupancy (%)
PROJECT 11 0 100
SELECT 22 3848 88
JOIN 47 13580 38
+/- 10 0 100
Multiply 13 0 100
PTX Reg # Shared MEM
(Byte)
Occupancy (%)
(a) 22 2308 88
(b) 55 23560 33
(c) 62 23048 17
(d) 30 4612 67
(e) 27 0 75
n Kernel fusion may increase resource usage and thus decrease occupancy
n Retains other benefits
Individual primitive After kernel fusion
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
TPC-H Queries
32
n A popular decision making benchmark suite
n Have 22 queries analyzing data from 6 big tables
n Scale Factor parameter to control database size n Red Fox can run SF=1 for all 22 queries
n GPU benchmark suite being generated (Summer 2013)
4/22/13
17
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Experimental Environment
CPU Xeon X5560 @ 2.80GHz
GPU 1 Tesla C2075 (6GB GDDR5 memory)
OS Ubuntu 10.04 Server
GCC 4.6.1
NVCC 4.2
Thrust 1.5.2
33
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
TPC-H Performance (SF = 1)
34
n 22 queries totally takes 67.40 seconds n Compared with MySQL implementation in 4 node CPU cluster*, Red Fox is
59x faster on average
*Ngamsuriyaroj, Pornpattana, “Performance Evaluation of TPC-H Queries on MySQL Cluster.” WAINA 2010.
Example: Q22 Input Size: 192MB Operator #: 92 CUDA Kernel #: 205
Query Plan:
4/22/13
18
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Where is the time spent?
35
38.94% 48.82%
project select product join diff sort unique merge agg arith conv others copy pcie
n Most of time is spent in JOIN and SORT
n PCIe transfer time is less than 10%
n PROJECT used most frequently, but takes less than 5%
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Future Improvements n Optimized query plan
n Reduce tuple size n Common operator reduction n Reorder operators n ……
n More RA implementations n Hash Join n Radix Sort n ……
n Pipeline the execution n Expect 10x-100x speedup from above techniques n Increase scale factor à Oncilla
36
4/22/13
19
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Research Thrusts
n I: Optimized implementations of primitives n Relational algebra n Data management within the GPU memory hierarchy
n II: Data movement optimizations n Between hosts and (local or remote) accelerators n Within an accelerator
n III: In-core processing n Cluster wide memory aggregation techniques n Change the ratio of host memory size to accelerator memory size
37
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
II. In-Core Processing
n Cluster-based memory aggregation n Hardware support for global non-coherent, physical address space system
n Change the ratio of host-memory : GPU-memory n Joint project with the University of Heidelberg
38
CPU (Multi Core) 2-16 Cores
MAIN MEM ~128GB
GPU ~2K Cores
GPU MEM ~6GB
CPU (Multi Core) 2-16 Cores
MAIN MEM ~128GB
GPU ~2K Cores
GPU MEM ~6GB
CPU (Multi Core) 2-16 Cores
MAIN MEM ~128GB
GPU ~2K Cores
GPU MEM ~6GB
CPU (Multi Core) 2-16 Cores
MAIN MEM ~128GB
GPU ~2K Cores
GPU MEM ~6GB
4/22/13
20
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Oncilla: Fabrics for Accelerator Clouds
n Goal: Efficient memory aggregation for accelerators in data centers
n Solution: Use Global Address Spaces (GAS) and commodity fabrics (HT, QPI, PCIe, 10GE, IB)
n Support in-core databases using software from Red Fox project 39
Jeff Young
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Oncilla – TPC-H Microbenchmarks (Preliminary Results)
Using Disk Using Aggregation
40
4/22/13
21
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
EXTOLL Network Adapter and Fabric
n Provides RDMA transfer (RMA), MMIO-based put/get operations for GAS (SMFU), and support for efficient, small messages (VELO)
n Current V6 prototype: 300 ns latency per hop, 24 Gbps bandwidth, very low overhead (64 B per packet) [1]
n ASIC projected to have bandwidth of 8-12 GB/s
41
• [1] H. Fröning, On Achieving High Message Rates, CCGRID 2013
Courtesy, Prof. H. Fröning, the University of Heidelberg
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Oncilla Infrastructure
n Two node cluster prototypes n 12-16 GB of DRAM n NVIDIA C2070 GPUs
n EXTOLL cluster n Network adapters and fabric
developed by University of Heidelberg, Germany
n AIC custom blades n Galibier Virtex 6 prototypes
n IB cluster based on KIDS n Mellanox QDR IB adapter n Dual-socket Intel Xeon X5660
42
4/22/13
22
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Architecture
Applications
System Software
43
Technology
Thank You Questions?
Scaling Rules