veal: virtualized execution accelerator for loops nate clark 1, amir hormati 2, scott mahlke 2 1...

Download VEAL: Virtualized Execution Accelerator for Loops Nate Clark 1, Amir Hormati 2, Scott Mahlke 2 1 Georgia Tech., 2 U. Michigan

If you can't read please download the document

Upload: blaze-little

Post on 18-Jan-2018

228 views

Category:

Documents


0 download

DESCRIPTION

How is Heterogeneity Used? 3 Engineer/ Compiler GPP Hetero. Program Control Statically Placed in Binary

TRANSCRIPT

VEAL: Virtualized Execution Accelerator for Loops Nate Clark 1, Amir Hormati 2, Scott Mahlke 2 1 Georgia Tech., 2 U. Michigan STI Cell How to get Efficiency? Microarchitecture changes Multi- / many-core Heterogeneity 2 Core2 Duo How is Heterogeneity Used? 3 Engineer/ Compiler GPP Hetero. Program Control Statically Placed in Binary Problem With Static Control Not forward/backward compatible 4 CPU Hetero. CPU Hetero. Program Engineer/ Compiler Solution: Virtualization Abstract accelerator features Reexamine compiler algorithms Key: do the hard stuff offline 5 CPU Hetero. Program CPU Hetero. Dyn Comp. Dyn Comp. Dyn Comp. Engineer/ Compiler Offline Online This Paper: Examines loops as heterogeneity target ASICs often implement loops Design a generalized loop accelerator Not covered in this talk Explore how to virtualize loop accelerators I.e. abstract the accelerator interface 6 Loop Accelerator Template 7 Why More Efficient Than GPP? Simple control flow Decoupled memory accesses I-Cache unnecessary Customize execution resources for loops 8 Proposed Loop Accelerator 1 CCA 2 Int units 16 regs Memory (4x) 16 Input streams 8 Output streams 0.8 mm 2, 90nm 9 Modulo Scheduling + High quality software pipelining technique + Simple control structure (low HW cost) - Can be slow, i.e., hard to do dynamically - Loops: no side exits, no while, if convertible 10 Benchmark Execution Time 11 Modulo Scheduling Basics 12 Kernel FU C CCAInt CCAInt Modulo Scheduling Example 13 Priority: 2, 4, 6 3, Time 1. CCA Mapping 2. II Calculation 3. Priority 4. Scheduling 5. Reg. assignment/ communication Measured Scheduling Overhead 14 70% Priority, 19% CCA Supporting Hybrid Compilation 15 Loop: 1 ld 2 add 3 sub and sub xor 5 or 6 or 7 add 8 str Loop: 1 ld 2 add 3 sub 4 brl CCA 5 or 6 or 7 add 8 str CCA: and sub xor ret Data: Loop: 1 ld 2 add 3 sub 4 brl CCA 5 or Speedups 16 Summary Virtualization key to heterogeneity VEAL speedup: 2.54 2.63 w/o translation (i.e., not binary compatible) 2.17 fully dynamic CCA and priority: 89% overhead mpeg2dec 2.1 vs Thank you! Questions? 18