Deeper Look Into HSAIL And It's Runtime

Download Deeper Look Into HSAIL And It's Runtime

Post on 21-Jun-2015




2 download

Embed Size (px)


Technical overview of the HSAIL and HSAIL runtime from AFDS by Norm Rubin.


<ul><li> 1. HSAILNorm RubinFellowAn introduction to the HSA Intermediate language</li></ul> <p> 2. Disclaimer &amp; AttributionThe information presented in this document is for informational purposes only and may contain technical inaccuracies, omissionsand typographical errors.The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limitedto product and roadmap changes, component and motherboard version changes, new model and/or product releases, productdifferences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligationto update or otherwise correct or revise this information. However, we reserve the right to revise this information and to makechanges from time to time to the content hereof without obligation to notify any person of such revisions or changes.NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NORESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THISINFORMATION.ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLYDISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIALOR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IFEXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in thispresentation are for informational purposes only and may be trademarks of their respective owners.OpenCL is a trademark of Apple Inc. used with permission by Khronos.DirectX is a registered trademark of Microsoft Corporation. 2012 Advanced Micro Devices, Inc. All rights reserved.2 | hsail AFDS | June 11, 2012 3. WHAT IS SPLIT COMPILATION?App starts a source program1) A high level compiler (HLC) generates HSAIL2) The HSAIL is shipped to the target machine3) A second compiler (a finalizer) turns HSAIL into ISAUnlike traditional compilers, where optimization is contained in one part or done twiceHSAIL allows optimization to be split into two partsThe heavy lifting goes to the HLC , the quick finish goes to the finalizerHSAIL provides ways for an HLC and a finalizer to cooperate For instance: HSAIL provides a fixed number of registers. HSA implementations might support a different number When the HLC spills registers, it can use special operations that will let the finalizer know where to use extra registers.3 | hsail AFDS | June 11, 2012 4. SPLIT COMPILATION(MEANS THERE HAS TO BE WAYS TO PASS INFORMATION FROM HLC TO FINALIZER)HLC High level compilerLots of timeInfo from sourceLots of aggressive optimizationsBut limited (or no) knowledge of targetFinalizer Very little time (we estimate that it will take close to linear time) No info not in HSAIL (no back doors (almost) Cannot update regularly (close to bug free) Simple optimizations onlyBut knows the target Exactly how to split some optimizations is still an open problem4 | hsail AFDS | June 11, 2012 5. WHY A VIRTUAL ISA - WHY NOT JUST TARGET THE REAL ISA?ISA Gains performance Better time to market (because hardware is finished faster)Loses performance (cannot use every hardware trick)No legacy boat anchorReal isa means one vendor/ one chip familyCan fix hardware bugs in softwareOld and new code just works on old and new machinesAllows hardware innovation under the tableFeatures not in HSAIL are not exposed, and are hard to access5 | hsail AFDS | June 11, 2012 6. Development tools at HSAIL levelToday the need for a complete tool chain for each core, each with its own technology, switches etc., is asignificant maintenance problem.Debuggability, reproducibility.Because the same application needs to run on different pieces of hardware, current source code containsmany conditional preprocessing directivesProgrammers rely on compiler intrinsic and ad-hoc command line arguments to drive theoptimization. This severely impacts code readability and productivity, and the applicationbinary tested and debugged on a workstation is different from the one that eventually runs on the system.Platform openness. Independent software vendors rarely have access to the tool chains needed to program the most powerful parts of the system, namely the DSPs and hardware accelerators. Virtualization can make the whole platform programmable, opening opportunities to third-party high-performance applications.Performance through time to market Because of the finalizer, last minute fixes can happen after the chip is finished. This means that the time to release a new part goes down. Less time per generation translates to better performance6 | hsail AFDS | June 11, 2012 7. GOALS OF HSAIL1.Can support all of C++ (open up the GPU to mass programming, not only for specialists)2.Avoid constant change (do not change the spec every chip)3.Support accurate IEEE floating point math4.Target lots of different machines5.Allow for packed operations, SSE and friends, bytes/shorts/ints/doubles etc6.Allow packed forms to save power7.Make the model understandable8.Make the finalizer fast (around linear time)9.Make the finalizer simple (do not need monthly updates)10. Less ambiguity in the spec (little undefined behavior)11. Get good performance (little need to write in ISA)12. Support all of OpenCL and C++Amp13. Can ship linkable libraries in HSAIL14. Clean up all nits in AMDIL15. Allow the use of chip specific acceleration when it is a good idea7 | hsail AFDS | June 11, 2012 8. HSAIL LOTS OF NEW FEATURESLots of features not in OpenCL and C++ AMP Enough to implement C++ Exceptions/ heterogeneous compute Flat address space (work items on the GPU and agents on the CPU)Because of hand written HSAIL, these features can be exposed earlyFine-grain barriers that work inside control flow, you can implement producer consumer modelsLots of cross wave operations so you can quickly move data between lanes without loads and storesSpec is available on the web siteThe memory model shows how the CPU and GPU can cooperateSupport for image operations8 | hsail AFDS | June 11, 2012 9. PARALLELISM MODEL9 | hsail AFDS | June 11, 2012 10. WAVEFRONTSMost developers will not care about wavefrontsSimilar to cache line sizes Experts can get good performance if they code to the cache line size Compiler has to avoid breaking the developers model HSAIL formalizes the notion of wavefrontsyou can tell which work item goes into which wavefrontyou can write producer consumer parallelism between work groups10 | hsail AFDS | June 11, 2012 11. AN EXAMPLE (IN OPENCL)__kernel void vec_add (__global const float *a, __global const float *b, __global float *c, const unsigned int n){// Get our global thread IDint id = get_global_id(0);// Make sure we do not go out of boundsif (id &lt; n) {c[id ] = a[id] + b[id];}11 | hsail AFDS | June 11, 2012 12. VECTOR ADD A[0:N-1] = B[0:N-1] + C[0:N-1]cur $c0, @BB0_2;version 1:0:$small;brn @BB0_1;kernel &amp;__OpenCL_vec_add_kernel(@BB0_1: // %if.endkernarg_u32 %arg_aret;kernarg_u32 %arg_b,@BB0_2: // %if.thenkernarg_u32 %arg_c,shl_u32 $s1, $s1, 2;kernarg_u32 %arg_n)add_u32 $s2, $s2, $s1;{ @__OpenCL_vec_add_kernel_entry:ld_global_f32 $s2, [$s2];// BB#0: // %entryadd_u32 $s3, $s3, $s1;ld_kernarg_u32 $s0, [%arg_n];ld_global_f32 $s3, [$s3];workitemaid $s1, 0;add_f32 $s2, $s3, $s2;cmp_lt_b1_u32 $c0, $s1, $s0;add_u32 $s0, $s0, $s1;ld_kernarg_u32 $s0, [%arg_c];st_global_f32 $s2, [$s0];ld_kernarg_u32 $s2, [%arg_b];brn @BB0_1;ld_kernarg_u32 $s3, [%arg_a];};12 | hsail AFDS | June 11, 2012 13. MEMORY SEGMENTS Memory is split into 7 segments kernarg, global, arg, readonly, private, group, and spill There is a single flat address space with everything but its is often advantageous to tell the finalizerwhich segment to use Load/store machine with registers Some segments are used for intent Spill indicates that the slot was used by the HLC for register spilling13 | hsail AFDS | June 11, 2012 14. SEGMENTS NDRangeWork group Work groupWork Items GroupPrivate group Arg locations arein privatePrivate Spill locations are inprivateAgentFlat address space Group within Private within arg memory is within Privateflatflat spill memory is within Private privateRW is within Private kernarg is within Global ReadOnly is within Global14 | hsail AFDS | June 11, 2012 15. HSAIL FEATURES REGISTERS AND TypesTYPES Brigs8, Brigs16, Brigs32, Brigs64,Four classes of registersBrigu8, Brigu16, Brigu32, Brigu64, c/s/d/q Brigf16, Brigf32, Brigf64, Brigb1, 1 bit Brigb8, Brigb16, Brigb32, Brigb64, 32 bits Brigb128, Brigu8x16, 64 bits BrigROImg, BrigRWImg, BrigSamp, 128 bitsBrigu8x4, Brigs8x4, Brigu8x8, Brigs8x8,Both Binary (BRIG) and text format Brigs8x16,The binary format is fully specified Brigu16x2, Brigs16x2, Brigf16x2, Brigu16x4, Brigs16x4, Brigf16x4, Brigu16x8,120 opcodes (JavaByte code has 200)Brigs16x8, Brigf16x8, Brigu32x2, Brigs32x2, Brigf32x2, Brigu32x4, Brigs32x4, Brigf32x4, Brigu64x2, Brigs64x2, Brigf64x215 | hsail AFDS | June 11, 2012 16. WHY DOES HSAIL LOOK THIS WAY?An SIMT model (single instruction, multiple threads) claims that every work-item has a program counterSo branch instructions look pretty naturalA vector machine model looks like sse, one program counter and vector registers, this is like real AMD GPUhardwareSIMT or Vector?16 | hsail AFDS | June 11, 2012 17. PROS FOR SIMT We want HSAIL to outlast one hardware generation (so at the very least the vector length and real types/number of registers should not get exposed). Even with a vector model the finalizer will still have to map to the real vector length. We expected this to mean that a vector finalizer would not have a much simpler time We want to support lots of machines including ones not built by AMD We can add cross lane operations (like count) to the SIMTmodel so the line between SIMT and vector is blurry We want to open up to 3rd party compiler and tools, all of which can support SIMT but few of which can support vector Work groups is a much more developer friendly model than wavefronts Natural path for OpenCL/CUDA c++amp Graphics is SIMT, so the pressure to make future hardware work well for SIMT is immense17 | hsail AFDS | June 11, 2012 18. PROS FOR VECTOR Might get more performance, we estimated &gt; YWhat the memory system seesmemory system must see X before Yglobal visibility orderthis is transitive X &gt;&gt;Y, and Y &gt;&gt; Z, then X &gt;&gt;Z31 | hsail AFDS | June 11, 2012 32. RULES, SOMETIMESX SB Y =&gt; X &gt;&gt; YX sb Y, same address, then X &gt;&gt;YDifferent addressIf there is a barrier or sync between X and Y thenX &gt;&gt;YIf X is an acquire: ld_acq, atomic_acq, atomicNoRet_acq, atomic_ar, atomicNoRet_arThen X &gt;&gt; YThis is one sided (Y cannot move before X)The general rule is use acquire and release when you want to force orderAcquire and Release may take extra time, but they give you sequential constancy Compilers can trade performance for simple cross work-item communication32 | hsail AFDS | June 11, 2012 33. If Y is a release st_rel, atomic_ar or atomicNoRet_ar then X &gt;&gt;Y st rel is another one way fence Consider a critical region (can use acquire and release to form critical sections) ld_acq x Assorted memory operations st_rel y No operations can move out, but operations can move in33 | hsail AFDS | June 11, 2012 34. AN EXAMPLE SB ORDER DOES NOT FORCE MEMORY ORDERWork-item 0 Work-item 1------------------- ------------------------------------@h0: st_u32 1, [&amp;a] @k0: st_u32 1, [&amp;b]@h1: ld_u32 $s0, [&amp;b]@k1: ld_u32 $s1, [&amp;a]Initially, &amp;a and &amp;b = 0. $s0 = 0 and $s1 = 0 is allowed. --constraints added because readers have to follow writers. k1 (the reader)has to happen before h0 changes the value. There are also constraints caused by synchronization h1 &gt;&gt; k1 &gt;&gt; h0 &gt;&gt; k0.Even though h0 appears first (in sequenced-before order) before h1, there is norequirement that the operations appear in text order (sequenced-before order) to thememory system.34 | hsail AFDS | June 11, 2012 35. EXAMPLE 2 REGISTER DEPENDENCE DOES NOT FORCE MEMORY ORDERWork-item 0 Work-item 1----------------------- ---------------------@h0: ld $s0, [&amp;a] @j0: st 20, [100]@h1: ld $s1, [$s0]@j1: st_rel 100, [&amp;a]Initially, &amp;a and contents of location 100 = 0.$s1 == 0 and $s0 == 100 is allowedIf $s1 == 0 then h1 &gt;&gt; j0. f $s0 == 100 then j1 &gt;&gt; h1.Because this seems to violate dependence order, it is useful to consider how this cancome about.Work-item 0 is allowed to prefetch load h1. One reason it might do this is that code before these operationsreads address 96, and the implementation reads in large cache lines.Later, work-item 1 reads the new value of &amp;a, which is 100. Then it reads the value oflocation 100, but because there is no synchronization, it can use the previously prefetched value of 0.35 | hsail AFDS | June 11, 2012 36. EXAMPLE 3Work-item 0 Work-item 1@h0: ld_acq $s0, [&amp;a] @j0: st 20, [100]@h1: ld $s1, [$s0]@j1: st_rel 100, [&amp;a]Initially, &amp;a and 100 = 0.HSAIL does not allow $s1 == 0 and $s0 == 100.36 | hsail AFDS | June 11, 2012 37. QUESTIONS?37 | hsail AFDS | June 11, 2012</p>