improving gpu performance via improved simd efficiency

77
Improving GPU Performance via Improved SIMD Efficiency Ahmad Lashgar ECE Department University of Tehran Supervisors: Ahmad Khonsari Amirali Baniasadi November 12, 2012 1 November 12, 2012 Improving GPU Performance …

Upload: cyrah

Post on 22-Mar-2016

56 views

Category:

Documents


3 download

DESCRIPTION

Improving GPU Performance via Improved SIMD Efficiency. Ahmad Lashgar ECE Department University of Tehran Supervisors: Ahmad Khonsari Amirali Baniasadi November 12, 2012. Outline. Introduction Background Warping Branch divergence Memory divergence CROWN - PowerPoint PPT Presentation

TRANSCRIPT

Introduction

Improving GPU Performance via Improved SIMD EfficiencyAhmad LashgarECE DepartmentUniversity of Tehran

Supervisors:Ahmad KhonsariAmirali Baniasadi

November 12, 2012

1November 12, 2012Improving GPU Performance 1OutlineIntroductionBackgroundWarpingBranch divergenceMemory divergenceCROWNBranch divergence managementTechnique & ResultsDWRMemory access coalescingTechnique & ResultsConclusion and Future works2November 12, 2012Improving GPU Performance 2

Introduction3

Tianhe-1A supercomputer

Sepermicro 6016GT-TF-FM209CPUGPUDRAMDRAMGPUIn this study, we propose two techniques (CROWN and DWR)to improve the performance of GPU.Simple motivation: There are different kind of threads which can be activated to improve performance.Inactivate due to branch or memory divergence November 12, 2012Improving GPU Performance 3SIMT CoreSIMT (Single-Instruction Multiple-Thread)The goal is throughputIn-order pipelineNo speculationDeep-multithreadedFor latency hiding8- to 16-lane SIMDTypical core pipeline:4

November 12, 2012Improving GPU Performance -This is the architecture we are modeling.4OutlineIntroductionBackgroundWarpingBranch divergenceMemory divergenceCROWNBranch divergence managementTechnique & ResultsDWRMemory access coalescingTechnique & ResultsConclusion and Future works5November 12, 2012Improving GPU Performance 5WarpingThousands of threads are scheduled zero-overheadAll the context of threads are on-coreTens of threads are grouped into warpExecute same instruction in lock-step6

November 12, 2012Improving GPU Performance -Define warp, warp scheduler, and SIMD width-Define the shared resource per SM: thread pool, shared memory, and register file-0 to 100: Every cycle a warp is selected6Warping (continued)OpportunitiesReduce scheduling overheadImprove utilization of execution units (SIMD efficiency)ChallengesMemory divergenceBranch divergence7November 12, 2012Improving GPU Performance 7Memory DivergenceThreads of a warp may take hit or miss in L1 access8J = A[S];// L1 cache accessL = K * J;HitHitMissHitTimeStallStallStallStallWarpT0T1T2T3WarpT0T1T2T3November 12, 2012Improving GPU Performance Group of threads (warp) are processed in lock-step

8Branch DivergenceBranch instruction can diverge to two different paths dividing the warp to two groups:Threads with taken outcomeThreads with not-taken outcome9If(J==K){ C[tid]=A[tid]*B[tid];}else if(J>K){ C[tid]=0;}WarpWarpWarpT0XXT3WarpWarpTimeXT1T2XT0T1T2T3T0XXT3T0T1T2T3November 12, 2012Improving GPU Performance -Diverge, re-converge, re-convergence point,9OutlineIntroductionBackgroundWarpingBranch divergenceMemory divergenceCROWNBranch divergence managementTechnique & ResultsDWRMemory access coalescingTechnique & ResultsConclusion and Future works10November 12, 2012Improving GPU Performance 10CROWN: Comprehensive Reincarnation-based Warping MotivationBranch divergence managementApplication behaviorCROWNOperation exampleDesign goalExperimental Results11November 12, 2012Improving GPU Performance 11- Least a few bits of thread's warpID determines the set which the thread is mapped to in re-convergence barrierBranch Divergence ManagementStack-Based Re-convergence (SBR) is used in NVIDIA GPUsUse a stack per warp to keep track of diverging paths and re-convergence pointsEffectively manages nested divergenceChallengesSIMD efficiency (target of DWF, and LWM)Diverging path serializationRe-convergence waiting12November 12, 2012Improving GPU Performance 12Stack-Based Re-convergence (SBR)13AW0RPCPCMask Vector-A1111W0RPCPCMask VectorDC1001DB0110-D1111W0RPCPCMask VectorDD1001DB0110-D1111W0RPCPCMask VectorDB0110-D1111W0RPCPCMask VectorDD0110-D1111W0RPCPCMask Vector-D1111SIMD Utilizationover timeW01111BW00110CW01001DW01111W00110TOSSIMD efficiencydrops due to idle lanes(down to 24%)SBR 1) SIMD EfficiencyA: IF(k==0){B: G[i] = 0; }ELSE{C: G[i] = 1; }D: E[i] += G[i];

Program CounterRe-convergenceProgram CounterNovember 12, 2012Improving GPU Performance Color on CFG show activation of a warp and gray shows inactivation (masked warp).

13SBR 2) Diverging Path Serialization14AW0RPCPCMask VectorDC1001DB0110-D1111SIMD Utilizationover timeBW00110CW01001DW00110TOSThreads are inactivated due to divergence(Up to 13% of threads)

Targeted by DWFNovember 12, 2012Improving GPU Performance Color on CFG show activation of a warp and gray shows inactivation (masked warp).

14SBR 3) Waiting at Re-convergence15AW0RPCPCMask VectorDB0110-D1111SIMD Utilizationover timeBW00110CDTOSThreads are waiting atre-convergence point(Up to 46% of threads)November 12, 2012Improving GPU Performance Color on CFG show activation of a warp and gray shows inactivation (masked warp).

15Application Behavior16Type B:High branch divergence

low thread-per-core parallelismIPCDiverging PathSerializationSIMD EfficiencyRe-convergence WaitingIPCDiverging PathSerializationSIMD EfficiencyRe-convergence WaitingIPCDiverging PathSerializationSIMD EfficiencyRe-convergence WaitingType A:A few branch divergence

memory-boundedType C:High branch divergence

high thread-per-core parallelismMU, MP,and NQUBFSothersNovember 12, 2012Improving GPU Performance 16CROWNs Design GoalProposed to address SBR challengesSIMD EfficiencyRe-convergenceDynamic regroupingDiverging Path SerializationActivate all threadsRe-convergence WaitingDynamic regroupingSchedule at small warp granularity (as wide as SIMD width)17November 12, 2012Improving GPU Performance 17AAACROWN Example18AW01111W11111W00110W10001W01001W11110W01111W11111W00110W10001Second-Level

Lookup

Re-convergence Barriers

W0AT0T1T2T3 To Fetch

From Commit

W1AT4T5T6T7W0AT0T1T2T3W0CT0 - - T3W0B- T1T2 -CT0 - - T3B- T1T2 -W1AT4T5T6T7W1CT4T5T6 -W1B- - - T7CT0T5T6T3B- T1T2T7CT4 - - -DT0T1T2T30000DT4T5T6T70000W0AT0T1T2T3W1AT4T5T6T7W2CT0T5T6T3W3B- T1T2T7W4CT4 - - -W2CT0T5T6T3W3B- T1T2T7W4CT4 - - -W2CT0T5T6T3W3B- T1T2T7W4CT4 - - -W2DT0T5T6T3W3D- T1T2T7W4DT4 - - -DT0T1T2T31001DT4T5T6T70110DT0T1T2T31111DT4T5T6T70111W5DT0T1T2T3DT4T5T6T71111W6DT4T5T6T7November 12, 2012Improving GPU Performance 18MethodologyGPGPU-sim version 2.1.1bConfigured to model Tesla-like architectureWorkloads from RODINIA, Parboil, CUDA SDK, GPGPU-sim, and third-party sequence alignment

19November 12, 2012Improving GPU Performance 19Experimental ResultsThree challenges:SIMD efficiencyDiverging path serializationRe-convergence waitingThroughput in term of Instruction per clock (IPC)

20November 12, 2012Improving GPU Performance 20SIMD EfficiencySIMD efficiency is only one of the three impacting issues21Type CType BNovember 12, 2012Improving GPU Performance 21Diverging Path SerializationLarge warps may exacerbate this metric22Type BNovember 12, 2012Improving GPU Performance 22Re-convergence Waiting23Type CType BNovember 12, 2012Improving GPU Performance 23IPCCROWN improve performance by %14, %12, and %10 compared to SBR, DWF, and LWM respectively.24Type CType BNovember 12, 2012Improving GPU Performance 24OutlineIntroductionBackgroundWarpingBranch divergenceMemory divergenceCROWNBranch divergence managementTechnique & ResultsDWRMemory access coalescingTechnique & ResultsConclusion and Future works25November 12, 2012Improving GPU Performance 25Memory Access CoalescingCommon memory access of neighbor threads are coalesced into one transaction26WarpT0T1T2T3WarpT4T5T6T7WarpT8T9T10T11HitHitHitHitMissMissMissMissMissHitHitMissMem. Req. AMem. Req. BMem. Req. CMem. Req. DMem. Req. EABABCCCCDEEDNovember 12, 2012Improving GPU Performance 26Coalescing WidthRange of the threads in a warp which are considered for memory access coalescingOver sub-warpOver half-warpOver entire warpWhen the coalescing width is over entire warp, optimal warp size depends on the workload27November 12, 2012Improving GPU Performance 27Warp SizeWarp Size is the number of threads in warpWhy small warp? (not lower that SIMD width)Less branch/memory divergenceLess synchronization overhead at every instructionWhy large warp?Greater opportunity for memory access coalescingWe study warp size impact on performance

28November 12, 2012Improving GPU Performance Justify why shorter warps are not reasonable28Warp Size and Branch DivergenceLower the warp size, lower the branch divergence29If(J>K){ C[tid]=A[tid]*B[tid];else{ C[tid]=0;}2-thread warpT0T1T2T3T4T5T6T7No branch divergence4-thread warpBranch divergenceNovember 12, 2012Improving GPU Performance 29Warp Size and Branch Divergence (continued)30WarpT0T1T2T3WarpT4T5T6T7WarpT8T9T10T11WarpT0T1XXWarpT4T5T6T7WarpXT9T10T11WarpXXT2T3WarpT8XXXWarpT0T1T2T3WarpT4T5T6T7WarpT8T9T10T11WarpTimeT0T1T2T3T4T5T6T7T8T9T10T11WarpT0T1XXT4T5T6T7XT9T10T11WarpXXT2T3XXXXT8XXXWarpT0T1T2T3T4T5T6T7T8T9T10T11Small warpsLarge warpsSaving some idle cyclesNovember 12, 2012Improving GPU Performance 30Warp Size and Memory Divergence31WarpT0T1T2T3WarpT4T5T6T7WarpT8T9T10T11TimeSmall warpsLarge warpsHitHitHitHitMissMissMissMissHitHitHitHitWarpT0T1T2T3HitHitHitHitMissMissMissMissHitHitHitHitWarpT0T1T2T3T8T9T10T11T4T5T6T7StallStallStallStallWarpT0T1T2T3WarpT4T5T6T7T4T5T6T7T8T9T10T11WarpT8T9T10T11Improving memory access latency hidingNovember 12, 2012Improving GPU Performance 31Warp Size and Memory Access Coalescing32WarpT0T1T2T3WarpT4T5T6T7WarpT8T9T10T11TimeSmall warpsLarge warpsMissMissMissMissWarpT0T1T2T3MissMissMissMissT4T5T6T7T8T9T10T11MissMissMissMissMissMissMissMissMissMissMissMissMissMissMissMissReq. AReq. BReq. AReq. AReq. BReq. AReq. BReducing the number of memory accesses using wider coalescing5 memory requests2 memory requestsNovember 12, 2012Improving GPU Performance 32MotivationStudy of warp sizeCoalescing over entire warp33

November 12, 2012Improving GPU Performance Short warps reduce thread synchronization for computationsLarge warps improve coalescingSmall warps reduce synchronization and divergence

33DWR: Dynamic Warp ResizingBasic Idea:Large warps are only useful for coalescing gain. So work at small warp granularity and synchronize small warp to execute memory accesses using one large warps.The instruction which are execute faster using large warp are called LATDetecting LATStatic approachDynamic approach34November 12, 2012Improving GPU Performance Small warps, as wide as SIMD width, are called sub-warp34Static Approach for Detecting LATsModify ISA and add new synchronization instruction35cvt.u64.s32 %rd1, %r3;ld.param.u64 %rd2, [__parm1];add.u64 %rd3, %rd2, %rd1;ld.global.s8 %r5, [%rd3+0];mov.u32 %r6, 0;setp.eq.s32 %p2, %r5, %r6;@%p2 bra $Lt_0_5122;mov.s16 %rh2, 0;st.global.s8 [%rd3+0], %rh2;

cvt.u64.s32 %rd1, %r3;ld.param.u64 %rd2, [__parm1];add.u64 %rd3, %rd2, %rd1;bar.synch_partner 0;ld.global.s8 %r5, [%rd3+0];mov.u32 %r6, 0;setp.eq.s32 %p2, %r5, %r6;@%p2 bra $Lt_0_5122;mov.s16 %rh2, 0;bar.synch_partner 0;st.global.s8 [%rd3+0], %rh2;

November 12, 2012Improving GPU Performance BFS KernelDiscuss about static approach where the LAT is detected after first decode. This LAT is stored in a table for future detections.35Micro-architectureConfigurable parameters:Number of sub-warps per SM (N)Number of Large warps per SM (M)Number of entries in set-associative ILT (K)N/M sub-warps synchronized by PST for executing LAT

PST Partner-Synch TableILT Ignore List TableSCO Sub-warp Combiner36

November 12, 2012Improving GPU Performance -Scheduler Subdivide a warp into sub-warps and schedule threads at sub-warp granularity-PST Synchronizes sub-warps at the barrier just before the memory instruction-SCO issues one large warp when the sub-warps are synchronized.

36Experimental ResultsMethodologyDWR-X where X is the largest warp sizeWe assume 32-entry per ILTResultsCoalescing rateIdle CyclesPerformanceSensitivity AnalysisSIMD widthILT Size

37November 12, 2012Improving GPU Performance 37Coalescing RateDWR-64 reaches 97% of the coalescing rate of 64-thread per warp and improve 8-thread by 14%38November 12, 2012Improving GPU Performance 38Idle CyclesDWR-64 reduces idle cycles by 26%, 12%, 17% and 25% compared to 8, 16, 32 and 64 threads per warp39November 12, 2012Improving GPU Performance 39PerformanceDWR-64 improves performance by 8%, 8%, 11% and 18% compared to 8, 16, 32 and 64 threads per warp40November 12, 2012Improving GPU Performance Where the warp size affects performance, DWR performs near the best warp size40Conclusion & Future WorksProposed mechanism are based on scheduling the short warpsWhen the coalescing width is over SIMDCROWN can improve performance by 14% compared to conventional control-flow mechanism at the cost of 4.2% area overheadWhen the coalescing width is over entire warpDWR can improve the performance of baseline micro-architecture by 8% at the cost of less than 1% area overheadEnergy-Efficiency of proposed mechanism should be evaluatedFull simulationExploiting locality41November 12, 2012Improving GPU Performance 41References[1]A. Bakhoda, G.L. Yuan, W.W.L. Fung, H. Wong, T.M. Aamodt. "Analyzing CUDA workloads using a detailed GPU simulator." In Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 2009. 163 - 174.[2]W. W. L. Fung, I. Sham, G. Yuan, T.M. Aamodt. "Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow." In Proceedings of 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 2007. 407-420.[3]W. W. L. Fung, T.M. Aamodt. "Thread Block Compaction for Efficient SIMT Control Flow." In Proceedings of 17th IEEE International Symposium on High-Performance Computer Architecture (HPCA-17). 2011. 25-36.[4]V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, Y. N. Patt. "Improving GPU performance via large warps and two-level warp scheduling." In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 2011. 308-317.[5]M. Rhu, M. Erez. "CAPRI: prediction of compaction-adequacy for handling control-divergence in GPGPU architectures." In Proceedings of the 39th International Symposium on Computer Architecture (ISCA). 2012. 61-71.[6]N. Brunie, S. Collange, G. Diamos. "Simultaneous branch and warp interweaving for sustained GPU performance." In Proceedings of the 39th International Symposium on Computer Architecture (ISCA). 2012. 49-60.a42November 12, 2012Improving GPU Performance 42Published[1]A. Lashgar, A. Baniasadi. "Performance in GPU Architectures: Potentials and Distances." 9th Annual Workshop on Duplicating, Deconstructing, and Debunking (WDDD), in conjunction with ISCA 2011. 2011.[2]A. Lashgar, A. Baniasadi, A. Khonsari. "Dynamic Warp Resizing: Analysis and Benefits in High-Performance SIMT." In Proceedings of the 30th IEEE International Conference on Computer Design (ICCD) Poster Session. 2012.

43November 12, 2012Improving GPU Performance 4344Thank You!Any Question?November 12, 2012Improving GPU Performance 44Backup Slides45November 12, 2012Improving GPU Performance 45MethodologyGPGPU-sim version 2.1.1bConfigured to model Tesla-like architectureWorkloads from RODINIA, Parboil, CUDA SDK, GPGPU-sim, and third-party sequence alignment46Abbr.Name and SuiteGrid SizeBlock Size#InsnCTA/SMBFSBFS Graph [R]16x(8)16x(512)1.4M1BKPBack Propagation [R]2x(1,64)2x(16,16)2.9M4CPCoulumb Poten. [P](8,32)(16,8)113M8DYNDyn_Proc [R]13x(35)13x(256)64M4FWALFast Wal. Trans. [C]6x(32)3x(16)(128)7x(256)3x(512)11M2, 4GASGaussian Elimin. [R]48x(3,3)48x(16,16)9M1HSPTHotspot [R](43,43)(16,16)76M2LPSLaplace 3D [G](4,25)(32,4)81M6MP2MUMmer-GPU++ [T] big(196)(256)139M2MPMUMmer-GPU++ [T] small(1)(256)0.3M1SR1Speckle Reducing [R] big4x(8,8)4x(16,16)9.5M2, 3SR2Speckle Reducing [R] small4x(4,4)4x(16,16)2.4M1Abbr.Name and SuiteGrid SizeBlock Size#InsnCTA/SMMTMMatrix Multiply [C](5,8)(16,16)2.4M4MU2MUMmer-GPU [R] big(196)(256)75M4MUMUMmer-GPU [R] small(1)(100)0.2M1NNCNearest Neighbor [R]4x(938)4x(16)5.9M8NNNeural Network [G](6,28)(25,28)(100,28)(10,28)(13,13)(5,5)2x(1)68M5, 8NQUN-Queen [G](256)(96)1.2M1NWNeedleman-Wun. [R]2x(1)2x(31)(32)63x(16)12M2RAYRay Tracing [G](16,32)(16,8)64M3SCNScan [C](64)(256)3.6M4November 12, 2012Improving GPU Performance 46CROWN backup47November 12, 2012Improving GPU Performance 47We define the SIMD efficiency as follows:

n_issues: the number of cycles where at least one SIMD lane is activeact_thds[i]: the number of active lanes once the SIMD is active48

November 12, 2012Improving GPU Performance 48Estimation of Hardware Overhead%2.5 estimated for register file%1.7 estimated for CROWN mechanism49Module SpecificationArea (mm2)#entries#setsAssociativityTag bitsRow Size (bits)TagDataReconv. Barrier12816832+(8*(10-3))32+(8*(10-3))+80.0350.137Second level512512132+(8*(10-3))+20.123Lookup Ready81832+(8*(10-3))32+(8*(10-3))0.0200.091Lookup Waiting81832+(8*(10-3))32+(8*(10-3))0.0200.091Total area0.5185 mm2November 12, 2012Improving GPU Performance 49Multi-banked register file [FungMICRO2007]50

November 12, 2012Improving GPU Performance 50MechanismNo Branch Divergence OccurredSimilar to SBR Upon Branch DivergenceInvalidate diverged warpCreate two new warps at diverging pathsReserve a barrier to re-synchronize diverged threads at the re-convergence pointReincarnationOnce all threads hit the re-convergence barrier, the barrier is turned into active warp51

November 12, 2012Improving GPU Performance 51Second-Level

Lookup(Ready)IssuepoolReconv.BarriersLookup(Waiting)EvictionRecently divergedAs all are reachedTo the pipelineWaitingReadyReadyCommit /Initialize warpSwap threadsOperations Under State Machine52

November 12, 2012Improving GPU Performance 52Microarchitecture53

8-entryfully-associative8-entryfully-associative128-entry8-wayset-associative512-entry8-entryNovember 12, 2012Improving GPU Performance 53- Each thread is allowed to reserve among a few entries (8 way here). Else, the re-convergence is bypassed.- Lookup is intended to improve SIMD efficiency- If the lookup is removed and the re-convergence barrier are large enough, it would be the same as SBR- If the re-convergence barrier is removed it would be DWF

Related WorksSIMD EfficiencyDiverging path serializationRe-convergence waitingMemory divergenceDWFRegroupingActivate allDWSHeuristicHeuristicHeuristicTBCCompactionShort warpsLWMCompactionCAPRICompaction2-bit Saturating counterShort warpsCROWNRegroupingActivate allShort warpsShort warps54November 12, 2012Improving GPU Performance 54PDOMAW0RPCPCMask Vector-A1111W0RPCPCMask VectorDC1001DB0110-D1111W0RPCPCMask VectorDD1001DB0110-D1111W0RPCPCMask VectorDB0110-D1111W0RPCPCMask VectorDD0110-D1111W0RPCPCMask Vector-D1111W1RPCPCMask Vector-A1111W1RPCPCMask VectorDC1110DB0001-D1111W1RPCPCMask VectorDD1110DB0001-D1111W1RPCPCMask VectorDB0001-D1111W1RPCPCMask VectorDD0001-D1111W1RPCPCMask Vector-D1111SIMD Utilizationover timeW01111W11111BW00110W10001CW01001W11110DW01111W11111W00110W10001TOSTOSDynamic regrouping ofdiverged threads at same pathincreases utilization55November 12, 2012Improving GPU Performance Color on CFG show activation of a warp and gray shows inactivation (masked warp).

55DWFASIMD Utilizationover timeW01111W11111BW00110W10001CW21001W31110DW00111W11111Warp Pool

WiPCMask VectorW0A1111W1A1111WiPCMask VectorW0B0110W1A1111W2C1001WiPCMask VectorW0B0110W1B0001W2C1001W3C1110WiPCMask VectorW0B0111W1C1111W2C1000WiPCMask VectorW0D0111W1C1111W2C1000WiPCMask VectorW0D0111W1D1111W2C1000WiPCMask VectorW0D0111W1D1111W2D1000WiPCMask VectorW0D1111W1D1111W11111W21000W00111W21000W01111WiPCMask VectorW0A1111W1D1111WiPCMask VectorW0A1111W1A1111W11111W01111Merge Possibility56November 12, 2012Improving GPU Performance Notice warp pool needs to keep TIDs instead of Mask VectorColors of the warps shows potential of different thread-placement56Large Warp Micro-architecture[NarasimanMICRO2011]57

November 12, 2012Improving GPU Performance 57Operation Example - SBRStatus ofConcurrent Threads11111111SIMD11111111222666623446666454466664555666656666666622223444555566666666TimeReadyInactive maskedWaiting at re-convergenceActiveIdleTerminated12435658November 12, 2012Improving GPU Performance 58Operation Example - CROWNStatus ofConcurrent Threads11111111SIMD11111111222666625446642222663444555566Time2226661134466466666622261111344666645556652211111134266662555664666666665665666659ReadyInactive maskedWaiting at re-convergenceActiveIdleTerminated124356November 12, 2012Improving GPU Performance 59Branch Divergence ChallengesBranch divergence is three-sided problem:SIMD efficiencyDiverging path serializationRe-convergence waiting60November 12, 2012Improving GPU Performance 60Understanding the challenges

SIMD efficiency is the dominant challenge as far as there are enough parallel warps to hide the memory latencyOnce there is not enough parallelism, two other factor can impact performance significantly

61November 12, 2012Improving GPU Performance 61Branch Divergence Challenges (cont.)SIMD efficiency62November 12, 2012Improving GPU Performance 62Branch Divergence Challenges (cont.)Diverging path serializationRe-convergence waiting

63November 12, 2012Improving GPU Performance -Number of inactive/waiting threads during idle cycles.-Large inactive/waiting threads can improve performance once idle cycles are significant63Design Space of CROWNCROWN can be configured with different entries inFully-associative ready lookup4, or 8Fully-associative waiting lookup4, or 8Set associative re-convergence barriers16, 32, 64, or 128 entries (fixed 8-way associative)Second-level warpsWe assume 256Issue pool sizeWe assume 864November 12, 2012Improving GPU Performance 64Sensitivity to Ready Lookup SizeUp to 8% performance change65November 12, 2012Improving GPU Performance 65Sensitivity to Waiting Lookup SizeUp to 10% performance reduction66November 12, 2012Improving GPU Performance 66Sensitivity to Reconv. Barriers SizeLower Re-convergence barriers nears CROWN to DWF67November 12, 2012Improving GPU Performance 67Compared to Previous Works1024-thread per SM 16-wide 8-stage SIMD1024-thread per SM 8-wide 16-stage SIMD1024-thread per SM 8-wide 16-stage SIMD512-thread per SM 8-wide 8-stage SIMD68November 12, 2012Improving GPU Performance 681024-thread 16-wide 8-stage SIMDLarger synchronization overhead69November 12, 2012Improving GPU Performance 691024-thread 8-wide 16-stage SIMDStay in lookup for shorter period of time70November 12, 2012Improving GPU Performance 70512-thread 8-wide 16-stage SIMDHigher the importance of concurrent threads71November 12, 2012Improving GPU Performance 71DWR Backup72November 12, 2012Improving GPU Performance 72Sensitivity to SIMD WidthDWR technique is much effective under narrow SIMD73November 12, 2012Improving GPU Performance 73Sensitivity to ILT SizeILT overflow in a few benchmarks74BFSBKPCPDYNGASHSPTFWALMPMTMMUNNCNQUSCNWLATs15175911207547111710526Ignored by ILT7000000360317003November 12, 2012Improving GPU Performance 74Ignore listAll of the LATs are not useful for coalescing:

Add the PC of such LATs to ignore list table (ILT) for bypassing the future synchronization751: if( sub_warp_id == 0){2: regA = gmem[idxA]; 3: }4: regB = gmem[idxB]; (a)1: if( sub_warp_id == 0){2: regA = gmem[idx];3: }4: __syncthreads(); (b)November 12, 2012Improving GPU Performance 75Improving the Energy-Efficiency of Short WarpsLocality can be exploited to design efficient pipeline front-end76November 12, 2012Improving GPU Performance 76Streaming Multiprocessor (SM)Threads of same thread-block (CTA)communicate through fast shared memorySynchronized through fast synchronizerEach CTA is assigned to one SM77November 12, 2012Improving GPU Performance -Define warp, warp scheduler, and SIMD width-Define the shared resource per SM: thread pool, shared memory, and register file-Talk about memory access coalescing here

77IntroductionWhy accelerator?SIMT accelerator overviewMemory divergenceBranch divergenceControl-flow Mechanisms Postdominator Re-convergence (PDOM)Dynamic Warp Formation

78November 12, 2012Improving GPU Performance 78Why accelerator?Heterogonous system to achieve optimal performance/wattSuperscalar speculative out-of-order processor forlatency-intensive serial workloadsMulti-threaded in-order SIMD processor forHigh-throughput parallel workloads6 of 10 Top500.org supercomputers today employ acceleratorsIBM Power BQC 16C 1.60 GHz (1st, 3th, 8th, and 9th)NVIDIA Tesla (6th and 7th)[Dally2010]79November 12, 2012Improving GPU Performance 79Why accelerator? (continued)GPUs are most available acceleratorsClass of general-purpose processors named SIMTIntegrated on same die with CPU (Sandy Bridge, etc)Upcoming Exa-scale computing demands energy efficiency of 50 GFLOPS/WGPU achieves 200 pJ/instructionCPU achieves 2 nJ/instruction[Dally2010]80November 12, 2012Improving GPU Performance 8081November 12, 2012Improving GPU Performance 81