![Page 1: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/1.jpg)
October, 2013The University of Texas at Austin
Enrique S. Quintana-Ortí
Energy-Aware Linear Algebra
![Page 2: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/2.jpg)
October, 2013The University of Texas at Austin
Concurrency and energy efficiency
Green500 vs Top500 (June 2013)
Rank
Top/Green
Site Technology MFLOPS/W
1/32 Tianhe-2 - National University of Defense Technology
Intel Xeon E5 + Intel Xeon Phi
1.901
467/1 Eurora - CINECA Intel Xeon E5 + NVIDIA K20
3.208
![Page 3: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/3.jpg)
October, 2013The University of Texas at Austin
Concurrency and energy efficiency
Green500 vs Top500 (June 2013)
Most powerful reactor under construction in France
Flamanville (EDF, 2017 for US $9 billion):
1,630 MWe
Rank
Top/Green
Site Technology MFLOPS/W MW toEXAFLOPS?
1/32 Tianhe-2 - National University of Defense Technology
Intel Xeon E5 + Intel Xeon Phi
1.901 408
467/1 Eurora - CINECA Intel Xeon E5 + NVIDIA K20
3.208 312
![Page 4: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/4.jpg)
October, 2013The University of Texas at Austin
Green500 vs Top500 (November 2012)
Rank
Top/Green
Site Technology MFLOPS/W MW toEXAFLOPS?
1/32 Tianhe-2 - National University of Defense Technology
Intel Xeon E5 + Intel Xeon Phi
1.901 408
467/1 Eurora - CINECA Intel Xeon E5 + NVIDIA K20
3.208 312
Concurrency and energy efficiency
Most powerful reactor under construction in France
Flamanville (EDF, 2017 for US $9 billion):
1,630 MWe
1 MW ≈ $1 Million/year!
![Page 5: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/5.jpg)
October, 2013The University of Texas at Austin
Concurrency and energy efficiency
System ranked #1 in Green500
0
500
1000
1500
2000
2500
3000
3500
MFLOPS/W
IBM BlueGene/Q
Intel Xeon Phi
![Page 6: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/6.jpg)
October, 2013The University of Texas at Austin
Concurrency and energy efficiency
System ranked #1 in Green500
MFLOPS/W
IBM BlueGene/Q
Intel Xeon Phi
Goal: 20MW for 1 EXAFLOP by 2020
Maintaining the improvement rate of last
five years (x5) 40 MW by 2020!!!
![Page 7: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/7.jpg)
October, 2013The University of Texas at Austin
Concurrency and energy efficiency
Reduce energy consumption!
Costs over lifetime of an HPC facility often exceed
acquisition costs
Carbon dioxide is a hazard for health and environment
Heat reduces hardware reliability
Personal view
Hardware features some power-saving mechanisms
Scientific apps. are in general energy-oblivious
![Page 8: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/8.jpg)
October, 2013The University of Texas at Austin
Experimental setup
AMD
2 AMD Opteron 6128, 48GB
DVFS per core
Intel
2 Intel Xeon E5504, 32GB
DVFS per socket
C-states:C0: normal operation mode
C1, C1E: disable core components (L1/L2 caches), clock signal, mem. controller,…
increases energy savings at the expense of recovery time
![Page 9: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/9.jpg)
October, 2013The University of Texas at Austin
Experimental setup
National Instruments NI9205+NIcDAQ-9178
1,000 Samples/s per channel
Only 12 V lines
![Page 10: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/10.jpg)
October, 2013The University of Texas at Austin
Outline
Modeling power
Saving power in task-parallel applications
ILUPACK for multicore processors
CG for hybrid CPU-GPU platforms
Conclusions
![Page 11: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/11.jpg)
October, 2013The University of Texas at Austin
Outline
Modeling power
Saving power in task-parallel applications
ILUPACK for multicore processors
CG for hybrid CPU-GPU platforms
Conclusions
![Page 12: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/12.jpg)
October, 2013The University of Texas at Austin
Modeling Power
![Page 13: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/13.jpg)
October, 2013The University of Texas at Austin
Modeling Power
System power: 𝑃 = 𝑃𝑌 + 𝑃𝑆 + 𝑃𝐷
Estimated as idle power
Due to off-chip components:
e.g., RAM (only mainboard)
𝑃𝑌 ≈ 𝑃𝐼 = 80.15 W
![Page 14: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/14.jpg)
October, 2013The University of Texas at Austin
Modeling Power
Static power: 𝑃 = 𝑃𝑌 + 𝑃𝑆 + 𝑃𝐷
𝑃𝑇0 𝑐 = 𝑎0 + 𝑏0 ∙ 𝑐 = 168.59 + 9.12 ∙ c W
𝑃𝑆0 ≈ 𝑎0 − 𝑃𝑌 = 88.44 W
![Page 15: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/15.jpg)
October, 2013The University of Texas at Austin
Modeling Power
Dynamic power: 𝑃 = 𝑃𝑌 + 𝑃𝑆 + 𝑃𝐷
𝑃𝑇0 𝑐 = 𝑎0 + 𝑏0 𝑐 = 168.59 + 9.12 ∙ c W
Busy-wait: 𝑃𝐷0 ≈ 𝑏0 𝑐 = 9.12 ∙ c W
![Page 16: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/16.jpg)
October, 2013The University of Texas at Austin
Modeling Power
Dynamic power: 𝑃 = 𝑃𝑌 + 𝑃𝑆 + 𝑃𝐷
𝑃𝑇0 𝑐 = 𝑎0 + 𝑏0 𝑐 = 168.59 + 9.12 ∙ c W
Busy-wait: 𝑃𝐷0 ≈ 𝑏0 𝑐 = 9.12 ∙ c W
An operation more challenging
than busy-wait?
![Page 17: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/17.jpg)
October, 2013The University of Texas at Austin
Modeling Power
Task-parallel DLA on multicore and CPU-GPU
![Page 18: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/18.jpg)
October, 2013The University of Texas at Austin
Modeling Power
Task-parallel DLA on multicore and CPU-GPU
• Use average
Power
• Depends also
on #active
sockets!
![Page 19: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/19.jpg)
October, 2013The University of Texas at Austin
Modeling Power
Task-parallel DLA on multicore and CPU-GPU
Accommodate to memory contention
![Page 20: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/20.jpg)
October, 2013The University of Texas at Austin
Modeling Power
Task-parallel DLA on multicore and CPU-GPU
Accommodate memory contention
![Page 21: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/21.jpg)
October, 2013The University of Texas at Austin
Modeling Power
Task-parallel DLA on multicore and CPU-GPU
![Page 22: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/22.jpg)
October, 2013The University of Texas at Austin
Modeling Power
Task-parallel DLA on multicore and CPU-GPU
![Page 23: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/23.jpg)
October, 2013The University of Texas at Austin
Modeling Power
Simple, yet accurate:
Dense factorizations (Cholesky, LU, QR)
Multicore processors
CPU-GPU platforms
ILUPACK on multicore processors
"Modeling power and energy consumption of dense matrix factorizations on multicore processors"
P. Alonso, M. F. Dolz, R. Mayo, E. S. Quintana. CCPE 2013
"Enhancing performance and energy consumption of runtime schedulers for dense linear algebra"
P. Alonso, M. F. Dolz, F. D. Igual, R. Mayo, E. S. Quintana. CCPE 2013 (submitted)
"Assessing the impact of the CPU power-saving modes on the task-parallel solution of sparse linear systems"
J. Aliaga, M. Barreda, M. F. Dolz, A. Martín, R. Mayo, E. S. Quintana. Cluster Computing 2013 (submitted)
![Page 24: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/24.jpg)
October, 2013The University of Texas at Austin
Outline
Modeling power
Saving power in task-parallel appl.
ILUPACK for multicore processors
CG for hybrid CPU-GPU platforms
Conclusions
![Page 25: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/25.jpg)
October, 2013The University of Texas at Austin
ILUPACK on multicore
Incomplete LU Package (http://ilupack.tu-bs.de)
Iterative Krylov subspace methods
Multilevel ILU preconditioners for
general/symmetric/Hermitian positive definite systems
Based on inverse ILUs with control over growth of inverse
triangular factors
Specially competitive for linear systems from 3D PDEs
![Page 26: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/26.jpg)
October, 2013The University of Texas at Austin
ILUPACK on multicore
Task parallelism
Multi-threaded parallelism (real s.p.d. systems)
Leverage task parallelism
Dynamic scheduling via runtime (OpenMP)
![Page 27: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/27.jpg)
October, 2013The University of Texas at Austin
ILUPACK on multicore
Task parallelism
Run-time in charge of scheduling
"Exploiting thread-level parallelism in the iterative solution of sparse linear systems"
J. I. Aliaga, M. Bollhöfer, A. F. Martín, E. S. Quintana. Parallel Computing, 2011
![Page 28: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/28.jpg)
October, 2013The University of Texas at Austin
ILUPACK on multicore
Experimental setup
Sparse linear system benchmark
Laplacian equation –Δu = f in a 3D unit cube Ω = [0,1]3
Linear system Au = b with A → n x n, n = 2523 ≈ 16 million
unknowns and 111 millions of nonzero entries
![Page 29: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/29.jpg)
October, 2013The University of Texas at Austin
ILUPACK on multicore
Leveraging P-states (AMD)
DVFS = P-states (see ACPI standard)
Moving to a higher P-state results in ↓power
↓Power = ↓Energy?
For a compute-bounded operation, fi is linear to time-1
In principle, for a memory-bounded operation (ILUPACK),
reducing fi should have a minor impact on performance
![Page 30: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/30.jpg)
October, 2013The University of Texas at Austin
ILUPACK on multicore
Leveraging P-states (AMD)
1st attempt: Dynamic Static voltage-frequency scaling
Why?
![Page 31: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/31.jpg)
October, 2013The University of Texas at Austin
ILUPACK on multicore
Leveraging P-states (AMD)
1st attempt: Dynamic Static voltage-frequency scaling
• Combined effect of linear decrease of CPU
performance and memory bandwidth!
• Decrease of Psi (P0 P2 : -21.47%), decrease of PD
i
(P0 P3 : -60.73%) but Pyi does not change!
![Page 32: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/32.jpg)
October, 2013The University of Texas at Austin
ILUPACK on multicore
Leveraging P-states (AMD)
2nd attempt: DVFS during idle periods
![Page 33: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/33.jpg)
October, 2013The University of Texas at Austin
ILUPACK on multicore
Leveraging P-states (AMD)
2nd attempt: DVFS during idle periods
Why?
![Page 34: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/34.jpg)
October, 2013The University of Texas at Austin
ILUPACK on multicore
Leveraging P-states (AMD)
2nd attempt: DVFS during idle periods
![Page 35: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/35.jpg)
October, 2013The University of Texas at Austin
ILUPACK on multicore
Leveraging P-states (AMD)
Active polling for work…
![Page 36: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/36.jpg)
October, 2013The University of Texas at Austin
ILUPACK on multicore
Leveraging P- and C-states (AMD)
3rd attempt: DVFS and idle-wait
![Page 37: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/37.jpg)
October, 2013The University of Texas at Austin
ILUPACK on multicore
Leveraging P- and C-states (AMD)
3rd attempt: DVFS and idle-wait:
Savings of 6.92% of total energy
Negligible impact on execution time
…but take into account that
Idle time: 23.70%
Dynamic power: 32.32%
Upper bound of savings: 39.32 ∙ 0.2370 = 9.32%
![Page 38: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/38.jpg)
October, 2013The University of Texas at Austin
ILUPACK on multicore
Leveraging P-states (Intel)
DVFS
![Page 39: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/39.jpg)
October, 2013The University of Texas at Austin
ILUPACK on multicore
Leveraging P- and C-states (Intel)
DVFS DVFS+idle-wait
Average reduction: 9.5% for LU and 6.5% for Solve
![Page 40: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/40.jpg)
October, 2013The University of Texas at Austin
Outline
Modeling power
ILUPACK for multicore processors
Saving power in task-parallel appl.
ILUPACK for multicore processors
CG for hybrid CPU-GPU platforms
Conclusions
![Page 41: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/41.jpg)
October, 2013The University of Texas at Austin
The CG method on CPU-GPU
Leveraging P-states on CPU-GPU platforms?
Apply DVFS to the CPU while computation proceeds on the
GPU?
Leveraging C-states on CPU-GPU platforms?
What is the CPU doing while computation proceeds on the
GPU?
![Page 42: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/42.jpg)
October, 2013The University of Texas at Austin
The CG method on CPU-GPU
Experimental setup
Sandy:
Intel i7-3770K, 16GB
NVIDIA GeForce GTX480
Cases from two matrix collections
![Page 43: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/43.jpg)
October, 2013The University of Texas at Austin
The CG method on CPU-GPU
Basic implementation
CG: Sparse matrix-vector (SpMV) + CUBLAS
![Page 44: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/44.jpg)
October, 2013The University of Texas at Austin
The CG method on CPU-GPU
Basic implementation
CG: Sparse matrix-vector (SpMV) + CUBLAS
Leveraging P-states:• Basically all computation performed on the GPU
• Apply static VFS to reduced power in CPU!
![Page 45: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/45.jpg)
October, 2013The University of Texas at Austin
The CG method on CPU-GPU
Basic implementation
CG: Sparse matrix-vector (SpMV) + CUBLAS
Leveraging C-states:• What is the CPU doing while computation proceeds on
the GPU?
• CUDA offers polling (active-wait) vs blocking (idle-wait)
operation modes
![Page 46: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/46.jpg)
October, 2013The University of Texas at Austin
The CG method on CPU-GPU
Basic implementation
Trading off energy for time: variations of CUDA
blocking mode w.r.t. CUDA polling mode
![Page 47: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/47.jpg)
October, 2013The University of Texas at Austin
The CG method on CPU-GPU
Basic implementation
Trading off energy for time: variations of CUDA
blocking mode w.r.t. CUDA polling mode
Energy = Time ∙ Power
For AUDIKW_1:
• Time 3.6%
• Power 29.16% ↓
• Energy 26.6% ↓
![Page 48: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/48.jpg)
October, 2013The University of Texas at Austin
The CG method on CPU-GPU
Basic implementation
Trading off energy for time: variations of CUDA
blocking mode w.r.t. CUDA polling mode
![Page 49: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/49.jpg)
October, 2013The University of Texas at Austin
The CG method on CPU-GPU
Merged implementation
Can we attain polling performance and blocking
energy advantage?
Requires a reformulation of CG (merge kernels)
![Page 50: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/50.jpg)
October, 2013The University of Texas at Austin
The CG method on CPU-GPU
Merged implementation
Time vs. CPU energy
Maintain performance of polling…
…while leveraging energy-efficiency
of C-states+idle-wait
![Page 51: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/51.jpg)
October, 2013The University of Texas at Austin
Performance and energy consumption
Summary
“Do nothing, efficiently…” (V. Pallipadi, A. Belay)
or
“Doing nothing well” (D. E. Culler)
![Page 52: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/52.jpg)
October, 2013The University of Texas at Austin
Acknowledgments
A. F. Martín
H. Anzt
J. I. Aliaga, M. Barreda, M. F. Dolz, R. Mayo,
J. Pérez, E. S. Quintana-Ortí
P. Alonso
![Page 53: Energy-Aware Linear Algebra - HPCAThe University of Texas at Austin October, 2013 Concurrency and energy efficiency Green500 vs Top500 (June 2013) Rank Top/Green Site Technology MFLOPS/W](https://reader035.vdocuments.mx/reader035/viewer/2022081613/5f9fdf6b82206873b47a7282/html5/thumbnails/53.jpg)
October, 2013The University of Texas at Austin
Acknowledgments
EU FP7 318793 Project
"EXA2GREEN. Energy-Aware Sustainable Computing on
Future Technology - Paving the Road to Exascale Computing”
Project CICYT TIN2011-23283
"PA-HPC. Power-Aware High Performance Computing”