modeling and minimizing memory contention in general-purpose … · modeling and minimizing memory...

Modeling and Minimizing Memory Contention in General-Purpose GPUs

Modelleren en minimaliseren van geheugencontentieJO�HSBðTDIF�QSPDFTTPST�WPPS�BMHFNFFO�HFCSVJL

Lu Wang

1SPNPUPS��QSPG��ES��JS��-��&FDLIPVU1SPFGTDISJGU�JOHFEJFOE�UPU�IFU�CFIBMFO�WBO�EF�HSBBE�WBO

Doctor in de ingenieurswetenschappen: computerwetenschappen

7BLHSPFQ�&MFLUSPOJDB�FO�*OGPSNBUJFTZTUFNFOVoorzitter: prof. dr. ir. K. De Bosschere

'BDVMUFJU�*OHFOJFVSTXFUFOTDIBQQFO�FO�"SDIJUFDUVVS"DBEFNJFKBBS��

ISBN 978-94-6355-408-4NUR 980, 987Wettelijk depot: D/2020/10.500/85

Examination Committee

Prof. Filip De Turck, chairDepartment of Information Technology,Faculty of Engineering and ArchitectureGhent University

Prof. Koen De Bosschere, secretaryDepartment of Electronics and Information Systems,Faculty of Engineering and ArchitectureGhent University

Prof. Lieven Eeckhout, supervisorDepartment of Electronics and Information Systems,Faculty of Engineering and ArchitectureGhent University

Prof. Jan FostierDepartment of Information Technology,Faculty of Engineering and ArchitectureGhent University

Prof. Magnus JahreDepartment of Computer Science,Faculty of Information Technology and Electrical EngineeringNorwegian University of Science and Technology

Dr. Cecilia Gonzalez AlvarezNokia Bell Labs Belgium

Prof. David R. KaeliDepartment of Electrical and Computer Engineering,Northeastern University

i

Contents

Examination Committee i

Acknowledgements vii

Samenvatting ix

Summary xiii

List of Figures xvii

List of Abbreviations xxi

1 Introduction 1

1.1 GPU Architecture Trends . . . . . . . . . . . . . . . . . . . . . 1

1.2 GPU-Application Diversity . . . . . . . . . . . . . . . . . . . . 2

1.3 GPU Performance Modeling . . . . . . . . . . . . . . . . . . . . 3

1.4 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.5 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . 4

1.6 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Background 9

2.1 Basics of GPU Architecture . . . . . . . . . . . . . . . . . . . . 9

2.1.1 GPU Thread Hierarchy . . . . . . . . . . . . . . . . . . 9

2.1.2 Streaming Multiprocessor . . . . . . . . . . . . . . . . . 11

2.1.3 Clustered GPU Architecture . . . . . . . . . . . . . . . 12

2.1.4 CTA Scheduling . . . . . . . . . . . . . . . . . . . . . . 13

2.2 Memory Divergence and Coalescing . . . . . . . . . . . . . . . . 13

iii

iv CONTENTS

2.3 GPU Performance Modeling . . . . . . . . . . . . . . . . . . . . 14

2.3.1 GPU Performance Analysis Tools . . . . . . . . . . . . . 14

2.3.2 Interval-Based Analytical Modeling . . . . . . . . . . . . 15

3 Intra-Cluster Coalescing 19

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2 Motivation and Opportunity . . . . . . . . . . . . . . . . . . . . 20

3.2.1 NoC Bandwidth Bottleneck . . . . . . . . . . . . . . . . 21

3.2.2 Request Merging . . . . . . . . . . . . . . . . . . . . . . 21

3.2.3 Intra-Cluster Locality . . . . . . . . . . . . . . . . . . . 23

3.3 Intra-Cluster Coalescing (ICC) . . . . . . . . . . . . . . . . . . 26

3.3.1 ICC Unit . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3.2 Merge Table . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3.3 Coalesced Cache . . . . . . . . . . . . . . . . . . . . . . 28

3.3.4 Cost Analysis . . . . . . . . . . . . . . . . . . . . . . . . 28

3.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 29

3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.5.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . 31

3.5.2 Energy E�ciency . . . . . . . . . . . . . . . . . . . . . . 31

3.5.3 NoC Tra�c . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.5.4 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . 33

3.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4 Distributed-Block CTA Scheduling 37

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.2 CTA Scheduling versus ICC . . . . . . . . . . . . . . . . . . . . 38

4.2.1 Scheduling Algorithms . . . . . . . . . . . . . . . . . . . 40

4.2.2 Performance Analysis . . . . . . . . . . . . . . . . . . . 41

4.3 Distributed-Block Scheduling . . . . . . . . . . . . . . . . . . . 42

4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.4.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . 43

4.4.2 Energy E�ciency . . . . . . . . . . . . . . . . . . . . . . 45

4.4.3 NoC Tra�c . . . . . . . . . . . . . . . . . . . . . . . . . 45

CONTENTS v

4.4.4 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . 48

4.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5 The GPU Memory Divergence Model 51

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.2 MDM Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.3 Modeling Memory Divergence . . . . . . . . . . . . . . . . . . . 56

5.3.1 Key Performance Characteristics . . . . . . . . . . . . . 56

5.3.2 Interval Analysis . . . . . . . . . . . . . . . . . . . . . . 60

5.3.3 The Memory Divergence Model (MDM) . . . . . . . . . 61

5.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 64

5.5 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.5.1 Model Accuracy . . . . . . . . . . . . . . . . . . . . . . 66

5.5.2 Sensitivity Analyses . . . . . . . . . . . . . . . . . . . . 68

5.6 Real Hardware Results . . . . . . . . . . . . . . . . . . . . . . . 71

5.7 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.7.1 Design Space Exploration . . . . . . . . . . . . . . . . . 72

5.7.2 DVFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.7.3 Validating the Observations . . . . . . . . . . . . . . . . 76

5.7.4 Streaming L1 Cache . . . . . . . . . . . . . . . . . . . . 77

5.8 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

6 Conclusion and Future Work 81

6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.2.1 Software/Hardware Coordinated CTA Scheduling . . . . 83

6.2.2 MSHR Management . . . . . . . . . . . . . . . . . . . . 84

6.2.3 Register File Organization . . . . . . . . . . . . . . . . . 85

6.2.4 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . 85

6.2.5 Extending MDM Model for Multi-Module GPUs . . . . 86

Bibliography 89

Acknowledgements

Time flies, I am still surprised that I have already finished my PhD. I reallyexperienced a memorable life in the past four years in Gent.

I express my deep gratitude toward my advisor, Professor Lieven Eeckhout,for his valuable guidance both in work and in life during my PhD. Withouthim, I can not finish my PhD successfully. He taught me how to do researchin a systematic way and how to write an academic paper. He also guidedme how to present my work in a lecture. These lessons will always be at thefoundation of my professional life. Lieven is a generous and patient professorwho is always enjoying discussing problems with students. I still rememberthat he contributed a large amount of time on my first paper due to my poorEnglish writing. He said he really enjoyed working with his students. Thosewords encouraged me to work harder in the following years. Meanwhile, hisgreat passion for research inspired me to be a better researcher. It has been anhonor to work with him and learn from him in the past four years.

Many thanks to the members in my examination committee for their valu-able feedback and insightful questions which helped me to improve my thesis. Iespecially want to thank David and Magnus since they helped me a lot duringmy PhD. Their collaborations and guidance contributed to several publicationsduring my PhD. David is especially professional in the field of GPU architec-ture. He gave me a lot of suggestions in my first project and encouragement formy future research. I have been working with Magnus for the MDM projectfor almost two and a half years. Actually, I was always questioning myselfwhen I started working on this project since performance modeling was reallynew for me. Fortunately, Magnus was willing to help me with any problemsin this research and he gave me confidence to further investigate on this topic.I also want to thank Koen for organizing the HiPEAC summer school. It isa good place to learn about the latest research in computer architecture andhave detailed discussions with experts in our area. Special thanks to Ciciliafor serving as my committee member and instructing me when I joined theresearch group.

I want to thank all my lab colleagues: Almutaz Adileh, Ajeya Naithani,Cecilia Gonzalez Alvarez, Shoaib Akram, Sander De Pestel, Sam Van den Steen,Kartik Lakshminarasimhan, Xia Zhao, Wenjie Liu, Shiqing Zhang, Yuxi Liuand Jennifer Sartor, for discussing in group meeting and providing valuable

vii

viii ACKNOWLEDGEMENTS

suggestions in my PhD. I especially want to thank Almutaz for his kindness.He was always patient and ready to help me with any problems. I also want tomention Xia and Yuxi since they gave me a lot of help both in research and inlife especially during my first year of PhD. I really cherish the friendship withWenjie and Shiqing. Our lunch meeting was an unforgettable experience inGent. Moreover, I would like to thank the department sta↵ with the technicalissues, especially Marnix, Vicky and Ronny. Marnix provided generous help incases such as arranging flights for conferences.

I would like to thank all my friends in Gent: Xin Cheng, Luyuan Li, LeiLuo, Boxuan Gao, Sheng Yan, Yun Zhou, MeiZhu Li and Zhongjia Yu, etc.Without them, I would not have the splendid and happy life in the past fouryears. I will never forget our formidable friendship. I also want to thank myprevious Chinese advisor, Prof. Zhiying Wang. He gave me endless support forapplying the CSC scholarship to pursue my PhD abroad.

In the end, I want to thank my parents. They brought me to the worldand gave me endless love. I am so lucky to live in this happy family. They areopen-minded to support every decision I made.

Thanks again, my advisor, my collaborators, lab members, my friends, andfamily. This PhD is not only for me, but also for you. I will cherish all theenjoyable moments with you forever. The unforgettable experience in thisbeautiful city will be a great fortune in my whole life.

Lu WangGent, September 6, 2020

Samenvatting

Throughput-processors zoals Graphics Processing Units (GPU’s) zijn be-langrijke componenten in moderne computersystemen vanwege hun vermogenom gegevensparallelle computertoepassingen te versnellen. GPU’s zijn danook wijdverspreid voor het uitvoeren van een brede waaier aan rekeninten-sieve computerapplicaties voor o.a. wetenschappelijke berekeningen, machinaalleren, artificiele intelligentie, data-analyse, medische beeldverwerking, enz. Omhet programmeren van GPU’s te vereenvoudigen werden nieuwe programmeer-talen zoals CUDA en OpenCL ontwikkeld. Een GPU-toepassing bestaat typ-isch uit een aantal kernels waarbij elke kernel bestaat uit honderdduizendendraden die op hun beurt gegroepeerd zijn in zogenaamde Cooperative ThreadArrays (CTA’s). Een resem bibiliotheken en raamwerken vereenvoudigen hetprogrammeren van een GPU drastisch. Het gevolg is dat GPU’s vandaag eensteeds belangrijkere rol spelen voor cloudproviders zoals Amazon, Google, IBMen Facebook.

Tegelijkertijd verbeteren en verfijnen hardwareontwerpers het ontwerp vaneen GPU steeds verder teneinde steeds hogere rekencapaciteit en geheugen-bandbreedte aan te bieden. De nieuwste Volta GPU van Nvidia bijvoor-beeld integreert 84 Streaming Multiprocessors (SM’s) en voorziet in 900 GB/sgeheugenbandbreedte. Deze trend zal zich verderzetten m.b.v. nieuwe tech-nologieen, zoals b.v. GPU’s bestaande uit meerdere modules of chiplets. Hetontwikkelen en optimaliseren van nieuwe generaties GPU’s voor een steedsbreder palet aan toepassingen is zeer uitdagend. Diepgaande innovaties ininfrastructuur en methodologieen zijn dan ook vereist om nieuwe generatiesGPU’s te simuleren en te modelleren. Daarnaast dient de GPU-architectuurgeoptimaliseerd te worden teneinde de nieuwste toepassingen zo snel en zoe�cient mogelijk uit te voeren.

In dit doctoraat spitsen we ons eerst toe op het optimaliseren van het in-terconnectienetwerk in hedendaagse geclusterde GPU’s. Clustering groepeertverschillende SM’s in een cluster om de druk op het interconnectienetwerkte reduceren. Hierdoor is het interconnectienetwerk eenvoudiger te schalennaar GPU’s met vele tientallen SM’s. Clustering leidt echter tot congestieter hoogte van de toegangspoorten tot het interconnectienetwerk. In dit werkexploiteren we de lokaliteit die er bestaat tussen CTA’s die gemapt wordenop eenzelfde cluster. We stellen Intra-Cluster Coalescing (ICC) en de Coa-

ix

x SAMENVATTING

lesced Cache (CC) voor om redundante aanvragen binnen een cluster te elim-ineren en op die manier de hoeveelheid verkeer over het interconnectienetwerkte reduceren. Daarnaast stellen we Distributed-Block Scheduling (DBS) voorteneinde lokaliteit te exploiteren op het niveau van een individuele SM al-sook op het niveau van een cluster. Ten tweede maken we in dit doctoraatde observatie dat geheugendivergentie wijdverspreid is; dit betekent dat eenenkele instructie leidt tot geheugentoegangen tot verschillende locaties in hetgeheugen. Geheugendivergentie is een prestatiebottleneck voor heel wat nieuween belangrijke GPU-toepassingen zoals b.v. in machinaal leren en data-analyse.Bestaande analytische prestatiemodellen slagen er echter niet in nauwkeurigeprestatieschattingen te verkrijgen voor geheugendivergente GPU-toepassingen.In dit doctoraat stellen we het Memory Divergence Model (MDM) voor wat toteen aanzienlijke verbetering leidt in nauwkeurigheid en snelheid van berekening.

Intra-cluster coalescing (ICC).De toenemende prestatie in hedendaagseGPU’s wordt mogelijk gemaakt door het integreren van een steeds groter aan-tal SM’s. Teneinde een schaalbaar interconnectienetwerk mogelijk te makenom de SM’s te verbinden met het geheugen, werd een clusterarchitectuurgeıntroduceerd in hedendaagse GPU’s waarbij verschillende SM’s gegroepeerdworden in een cluster met een enkele netwerkpoort per cluster. Het delen vaneen netwerkpoort leidt echter tot congestie wanneer meerdere SM’s tegelijker-tijd pakketten wensen te versturen over het netwerk, met negatieve gevolgentot prestatie als gevolg. In dit doctoraat observeren we dat gemiddeld 19%(en tot 48%) van de pakketten verstuurd vanuit een cluster redundant zijn, ofm.a.w. verschillende SM’s in een cluster raadplegen dezelfde data binnen eenkort tijdsbestek. We stellen Intra-Cluster Coalescing (ICC) en de CoalescedCache (CC) voor teneinde de druk op het interconnectienetwerk te reduceren.ICC groepeert pakketten naar eenzelfde data-element of een cachelijn in eenenkele aanvraag teneinde het aantal aanvragen te reduceren. Om het tijds-bestek waarbinnen aanvragen gegroepeerd worden uit te breiden, voegen we deCC toe wat ICC in staat stelt recentelijk gegroepeerde data-elementen (cache-lijnen) bij te houden en op die manier het aantal aanvragen over het intercon-nectienetwerk verder te reduceren. ICC en CC leiden tot een verbetering inprestatie van gemiddeld 15% (en tot 69%). Bovendien reduceert ICC/CC hetenergieverbruik met gemiddeld 5.3% (en tot 16.7%), wat tot een verbetering inEnergy-Delay Product (EDP) leidt van gemiddeld 12% (en tot 30%).

Distributed-block scheduling (DBS). We tonen ook aan in dit doctor-aatswerk dat er een belangrijke interactie bestaat tussen ICC en CTA schedul-ing (het schedulen of mappen van CTA’s op SM’s en clusters van SM’s). Weobserveren dat ICC voordeel haalt uit het mappen van naburige CTA’s binneneenzelfde cluster teneinde de bestaande lokaliteit tussen CTA’s te kunnen be-nutten. Op basis van deze observatie stellen we Distributed-Block Scheduling(DBS) voor. DBS buit lokaliteit uit binnen een SM en binnen een cluster, integenstelling tot eerder werk. Deze tweetrapsbenadering maximaliseert de mo-gelijkheid tot het exploiteren van lokaliteit: eerst worden groepen van opeenvol-gende CTA’s gemapt op clusters, waarna twee opeenvolgende CTA’s gezamen-lijk gemapt worden per SM. Lokaliteit wordt op die manier geexploiteerd op het

xi

niveau van de L1 cache (d.i. de hergebruiksafstand tussen twee opeenvolgendetoegangen tot dezelfde cachelijn wordt ingekort wat leidt tot een hogere cachehit rate) en op het niveau van de Miss Status Handling Registers (MSHR’s) (d.i.in het geval van een L1 miss worden meerdere toegangen tot dezelfde cachelijngegroepeerd). We tonen aan dat DBS de prestatie verbetert met gemiddeld 4%(en tot 16%), en het energieverbruik en de EDP reduceert met respectievelijk1.2% en 5%. Bovendien tonen we aan dat DBS complementair is aan ICC/CC,wat tot een gemiddelde prestatieverbetering leidt van 16% (en tot 67%) en eenenergiereductie van 6% (en tot 18%).

Memory divergence model (MDM).Analytische prestatiemodellen stellencomputerarchitecten in staat gigantische ontwerpsruimtes e�cient te verken-nen, en dit vele grootteordes sneller dan gedetailleerde simulatie. Een hogenauwkeurigheid is uiteraard noodzakelijk om correcte conclusies te bekomen. Indit doctoraatswerk spitsen we ons toe op geheugendivergente GPU-toepassingendie steeds meer wijdverspreid zijn in o.a. machinaal leren en data-analyse. Hetontbreken aan spatiale lokaliteit leidt tot frequente blokkeringen ter hoogte vande cache omdat een SM substantieel meer geheugentoegangen initieert dan decache kan ondersteunen. Hierdoor is de GPU niet langer in staat de toegangs-latentie tot het hoofdgeheugen te verbergen via parallellisme op draadniveau.In dit doctoraatswerk stellen we het Memory Divergence Model (MDM) voordat de belangrijkste uitvoeringskarakteristieken van geheugendivergente GPU-toepassingen modelleert, inclusief het serialiseren van geheugentoegangen en detoenemende wachttijden in het interconnectienetwerk en hoofdgeheugen. Wevalideren MDM t.o.v. gedetailleerde simulatie alsook echte hardware, en werapporteren aanzienlijke verbeteringen in (1) toepasbaarheid: MDMmodelleertzowel geheugendivergente als niet-geheugendivergente GPU-toepassingen; (2)snelheid: MDM maakt gebruik van dynamische binaire instrumentatie en isdaardoor 6.1⇥ sneller dan modellering op basis van functionele simulatie; en(3) nauwkeurigheid: de gemiddelde fout van MDM is beperkt tot 13.9% wateen aanzienlijke verbetering is t.o.v. het state-of-the-art GPUMech model meteen fout van gemiddeld 162%. We tonen bovendien aan dat MDM bruikbaaris voor zowel de exploratie van de ontwerpsruimte als voor het evalueren vandynamisch herschalen van de voedingsspanning en klokfrequentie.

We concluderen dat GPU’s wijdverspreid zijn als accelerator voor reken-intensieve computertoepassingen. In dit doctoraatswerk onderzoeken we deuitdagingen en opportuniteiten van nieuwe GPU-architecturen en hun toepassin-gen. We maken meer bepaald gebruik van lokaliteit tussen CTA’s om de drukop het interconnectienetwerk te verlichten in geclusterde GPU-architecturen.We stellen hiertoe drie complementaire innovaties voor: Intra-Cluster Coa-lescing, Coalesced Cache en Distributed-Block Scheduling. Deze innovatieselimineren redundante geheugentoegangen wat leidt tot een aanzienlijke ver-betering in prestatie en energie-e�cientie. Daarnaast tonen we ook aan datgeheugendivergente toepassingen zich grondig verschillend gedragen van meercourante niet-geheugendivergente toepassingen. We stellen het Memory Di-vergence Model voor om de prestatie van dergelijke toepassingen nauwkeuriganalytisch te modelleren.

Summary

Throughput processors such as Graphics Processing Units (GPUs) are widelyused to accelerate a wide range of emerging throughput-oriented applications,e.g., scientific computing, machine learning, artificial intelligence, data ana-lytics, medical imaging, etc. To ease programming, bulk-synchronous parallel(BSP) programming models such as OpenCL and CUDA have been devel-oped in which a GPU application is typically divided into several kernels andeach kernel consists of hundreds of thousands of threads grouped in Coopera-tive Thread Arrays (CTAs). A variety of libraries and frameworks have beendeveloped upon CUDA/OpenCL to further alleviate the burden for GPU pro-grammers. As a result, the programmability motivates GPU platforms as a topchoice for cloud infrastructures in major companies such as Amazon, Google,IBM and Facebook.

At the same time, hardware designers make their e↵ort to scale GPU perfor-mance through improvements in compute capability as well as memory band-width. For instance, the latest Nvidia Volta GPU integrates 84 SMs and deliv-ers 900 GB/s peak memory bandwidth. This trend will further continue thanksto new technology such as multi-module GPUs. However, the adaption prob-lems are always challenging and cover di↵erent aspects of architecture design.For instance, infrastructures such as analytical models or simulators shouldbe developed for new applications and architectures for in-depth performanceanalysis.

In this thesis, we first focus on reducing Network-on-Chip (NoC) pressurein modern-day clustered GPU architecture. In particular, a cluster structureis implemented to group several SMs to address the NoC scalability prob-lem. However, it exacerbates NoC congestion due to port sharing in the samecluster. Fortunately, inter-CTA locality in many GPU-compute applicationsprovides an opportunity to cope with this challenge. In response, we proposeintra-cluster coalescing (ICC) and the coalesced cache (CC) unit to eliminateredundant requests among SMs in the same cluster and significantly reduceNoC tra�c. In addition, we propose distributed-block CTA scheduling forclustered GPU architectures which exploits the locality at both the SM leveland cluster level. Second, we observe that memory divergence is prevalent andcommonly becomes the performance bottleneck in emerging GPU applicationssuch as machine learning and data analytics. Unfortunately, state-of-the-art

xiii

xiv SUMMARY

GPU analytical models focus mainly on traditional non-memory-divergent ap-plications and fail to capture the performance behavior of memory-divergentapplications. In this thesis, we propose the memory divergence model (MDM)which provides significant prediction accuracy improvements with faster mod-eling speed.

Intra-cluster coalescing to reduce NoC pressure. GPUs continue toboost the number of streaming multiprocessors (SMs) to provide increasinglyhigher compute capabilities. To construct a scalable crossbar network-on-chip(NoC) that connects the SMs to the memory controllers, a cluster structure isintroduced in modern GPUs in which several SMs are grouped together to sharea network port. Because of network port sharing, clustered GPUs face severeNoC congestion, which creates a critical performance bottleneck. In this thesis,we observe that in many GPU-compute applications, an average of 19% (and upto 48%) of the requests are redundant in the same cluster which wastes limitedNoC bandwidth. In response, we propose intra-cluster coalescing (ICC) and thecoalesed cache (CC) to reduce NoC pressure in clustered GPUs. In particular,ICC coalesces outstanding misses among SMs in the same cluster once theyaccess the same cache lines. To extend the time window for coalescing, CC iscomplemented upon ICC to keep track of recently coalesced cache lines. L1 hitsin the CC further reduce NoC tra�c. Our experiment shows that ICC alongwith CC leads to an average 15% (and up to 69%) performance improvementwhile at the same time reducing system energy by 5.3% on average (and up to16.7%), and energy-delay-product by 12% and up to 30%, respectively.

Distributed-block scheduling policy. Through investigating the inter-CTA locality, we demonstrate the significant interaction between ICC andthe CTA scheduling policy. We find that ICC benefits more when the CTAscheduling policy maps neighboring CTAs to the same cluster to better exploitinter-CTA locality. Motivated by this observation, we propose distributed-blockscheduling. In contrast to prior work, distributed-block scheduling exploitscache locality at both the cluster level and the SM level. This two-level ap-proach maximizes the opportunities to exploit inter-CTA locality at the L1cache within an SM as well as between SMs in the same cluster by first map-ping a group of consecutive CTAs at the cluster level, and by subsequentlymapping pairs of consecutive CTAs at the SM level. Inter-CTA locality isexploited to improve L1 cache performance (decreasing the reuse distance be-tween accesses to the same memory location, thereby increasing L1 cache hitrate) and to increase the coalescing opportunities in the L1 miss status han-dling registers (MSHRs). Through execution-driven GPU simulation, we findthat distributed-block scheduling improves GPU performance by 4% (and upto 16%) while at the same time reducing system energy and EDP by 1.2%and 5% compared to state-of-the-art distributed scheduling policy. In addi-tion, distributed-block scheduling works synergistically with ICC and CC toimprove performance by 16% (and up to 67%) and reduce system energy by6% (and up to 18%).

Memory Divergence Model. Analytical models enable architects tocarry out early-stage design space exploration several orders of magnitude

xv

faster than cycle-accurate simulation by capturing performance related behav-ior through a set of mathematical equations. However, this speed advantage isvoid if the conclusions obtained through the model are misleading due to modelinaccuracies. In this work, we focus on analytically modeling the performanceof emerging memory-divergent GPU-compute applications which are commonin domains such as machine learning and data analytics. The poor spatiallocality of these applications leads to frequent L1 cache blocking due to theapplication issuing significantly more concurrent cache misses than the cachecan support, which cripples the GPU’s ability to use Thread-Level Parallelism(TLP) to hide memory latencies. Motivated by this observation, we proposethe GPU Memory Divergence Model (MDM) which faithfully captures the keyperformance characteristics of memory-divergent applications, including mem-ory request batching and excessive NoC/DRAM queueing delays. We validateMDM against detailed simulation and real hardware, and report substantialimprovements in (1) scope: the ability to model emerging memory-divergentapplications in addition to traditional non-memory-divergent applications; (2)practicality: 6.1⇥ faster by computing model inputs using binary instrumen-tation as opposed to functional simulation; and (3) accuracy: 13.9% averageprediction error versus 162% compared to the state-of-the-art GPUMech model.In addition, the MDM model is useful for design space explorations as well asfor evaluating dynamic voltage and frequency scaling (DVFS).

In summary, GPUs have emerged as a high-performance computing acceler-ator in modern computer systems in recent years. In this thesis, we investigateboth the challenges and opportunities of new-generation GPU architecturesand emerging GPU applications. More specifically, we make use of the inter-CTA locality in GPU-compute applications and propose intra-cluster coalesing(ICC), the coalesced cache (CC) and distributed-block CTA scheduling policiesto eliminate redundant memory requests and reduce NoC pressure in clusteredGPUs. In addition, we observe the distinct performance behavior of emergingmemory-divergent GPU applications and propose the MDM analytical GPUperformance model to achieve fast design space explorations and accurate per-formance predictions.

List of Figures

2.1 GPU thread hierarchy: a GPU kernel is executed as a grid ofthread blocks. A thread block is a batch of threads that cancooperate with each other. . . . . . . . . . . . . . . . . . . . . . 10

2.2 SM architecture: each SM includes a large register file, severalcaches, and 32 CUDA cores executing threads in a SIMT manner. 10

2.3 Clustered GPU architecture: SMs within a cluster go throughthe NoC to access the L2 cache and main memory to serve L1cache misses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4 CTA scheduling policies: A 2-level round-robin scheduling pol-icy allocates neighboring CTAs to di↵erent clusters while a dis-tributed scheduling policy maps neighboring CTAs to the samecluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.5 Interval analysis: An interval is defined as a sequence of instruc-tions at the maximum issue rate followed by stall cycles. . . . . 15

2.6 Generating interval profiles through instruction dependence anal-ysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1 Quantifying the NoC bottleneck: Normalized IPC when varyingthe NoC and LLC frequency from 0.25⇥ to 2⇥. NoC (and LLC)bandwidth is a severe performance bottleneck. . . . . . . . . . . 21

3.2 Intra-cluster locality (fraction of redundant requests versus thetotal number of requests in a cluster) as a function of a past win-dow of requests under the distributed scheduling policy [24]. Adistinction is made between cache line sharing and data sharing.A substantial fraction of NoC requests are redundant because ofintra-cluster locality due to cache line sharing or data sharing. . 24

3.3 Data sharing in LUD. L11 is reused for calculating submatricesU12 and U13 (reuse along rows), while U11 is reused for calculat-ing submatrices L21 and L31 (reuse along columns). . . . . . . 25

xvii

xviii LIST OF FIGURES

3.4 The intra-cluster coalescing (ICC) unit merges L1 cache missesacross SMs within a cluster. The coalesced cache (CC) keepstrack of recently coalesced cache lines. . . . . . . . . . . . . . . 26

3.5 IPC improvement for intra-cluster coalescing (ICC) and the co-alesced cache (CC) under distributed scheduling. ICC signif-icantly improves performance by 9.7% on average; ICC alongwith CC yields an average 15% performance improvement. . . . 31

3.6 Energy consumption breakdown normalized to distributed schedul-ing (D). ICC along with CC reduces system energy by 5.3% onaverage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.7 EDP reduction for ICC and CC. ICC along with CC reducessystem EDP by 16.7% on average. . . . . . . . . . . . . . . . . 32

3.8 NoC tra�c (number of NoC read requests) reduction for ICCand CC. ICC along with CC reduces NoC tra�c by 19.5% onaverage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.9 IPC improvement for ICC and CC as a function of the numberof clusters while keeping total SM count constant at 60 SMs.ICC along with CC consistently improves performance acrossdi↵erent cluster sizes and e↵ective NoC port count per SM. . . 34

4.1 Illustrating the five CTA scheduling algorithms for a 10-CTAworkload. We assume a GPU architecture with 2 clusters with2 SMs each; we can allocate at most 2 CTAs per SM. The toprow shows the initial mapping of CTAs to clusters and SMs; thebottom row shows the mapping of the next CTA to scheduleafter CTA 1 finishes its execution. . . . . . . . . . . . . . . . . 39

4.2 Normalized IPC for two-level round-robin, greedy-clustering, globalround-robin and distributed CTA scheduling. Distributed schedul-ing outperforms the other three policies on average. . . . . . . . 41

4.3 Intra-cluster locality for the di↵erent CTA scheduling policies.CTA scheduling policies have a substantial impact on the ex-ploitable intra-cluster locality, and distributed scheduling yieldsthe highest opportunity. . . . . . . . . . . . . . . . . . . . . . . 42

4.4 IPC improvement for distributed-block scheduling versus dis-tributed scheduling. Distributed-block scheduling improves per-formance by 4% on average (up to 16%). . . . . . . . . . . . . . 44

4.5 IPC improvement for distributed-block scheduling with ICC andCC versus distributed scheduling. Distributed-block schedulingwith ICC and CC yields an average of 16% (up to 67%) perfor-mance improvement. . . . . . . . . . . . . . . . . . . . . . . . . 44

4.6 Energy consumption breakdown normalized to distributed schedul-ing (D). Distributed-block scheduling with ICC and CC reducessystem energy by 6% on average. . . . . . . . . . . . . . . . . . 46

LIST OF FIGURES xix

4.7 EDP reduction for distributed-block scheduling without and withICC plus CC versus distributed scheduling. Distributed-blockscheduling by itself reduce system EDP by 5%, along with ICCand CC reduces system EDP by 19% on average. . . . . . . . . 47

4.8 NoC tra�c (number of NoC read requests) reduction for distributed-block scheduling, ICC and CC versus distributed scheduling.Distributed-block scheduling itself reduces NoC tra�c by 6% onaverage. Along with ICC and CC reduces NoC tra�c by 20% onaverage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.9 IPC improvement for distributed-block scheduling with ICC andCC versus distributed scheduling as a function of the numberof clusters while keeping total SM count constant at 60 SMs.Distributed-block scheduling with ICC and CC consistently im-proves performance across di↵erent cluster sizes and e↵ectiveNoC port count per SM. . . . . . . . . . . . . . . . . . . . . . . 48

5.1 The key components of MDM-based performance prediction. . 55

5.2 Cache behavior for the example kernel in Listing 5.1 with twodi↵erent grid strides. . . . . . . . . . . . . . . . . . . . . . . . . 57

5.3 L1 miss latency breakdown for selected GPU compute appli-cations. Delays due to insu�cient MSHRs as well as queuingdelays in the NoC and DRAM subsystem significantly a↵ect theoverall memory access latency of MD-applications, while NMD-applications are hardly a↵ected. . . . . . . . . . . . . . . . . . . 58

5.4 Example explaining why MSHR utilization results in signifi-cantly di↵erent performance-related behavior for NMD and MD-applications. MD-applications puts immense pressure on the L1cache MSHRs and thereby severely limit the GPU’s ability to useTLP to hide memory latencies. . . . . . . . . . . . . . . . . . . 59

5.5 IPC prediction error for our NMD and MD-benchmarks underdi↵erent performance models. MDM significantly reduces theprediction error for the MD-applications. . . . . . . . . . . . . . 67

5.6 Prediction error as a function of NoC bandwidth for the MD-applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.7 Prediction error as a function of DRAM bandwidth for the MD-applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.8 Prediction error as a function of the number of MSHR entriesfor the MD-applications. . . . . . . . . . . . . . . . . . . . . . . 70

5.9 Prediction error as a function of SM count for the MD-applications. 70

xx LIST OF FIGURES

5.10 Hardware validation: relative IPC prediction error for GPUMechand MDM compared to real hardware. MDM achieves high pre-diction accuracy compared to real hardware with an average pre-diction error of 40% compared to 164% for GPUMech (usingbinary instrumentation). . . . . . . . . . . . . . . . . . . . . . . 72

5.11 Normalized performance for two NMD-applications (BT and HS)as a function of SM count and DRAM bandwidth. Results arenormalized to the simulation results at 28 SMs and 480GB/sDRAM bandwidth. Both GPUMech and MDM capture the per-formance trend. . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.12 Normalized performance for three MD-applications (CFD, BFSand PVC) as a function of SM count and DRAM bandwidth. Allresults are normalized to the simulation results at 28 SMs and480GB/s DRAM bandwidth. GPUMech not only leads to highprediction errors, it also over-predicts the performance speedupwith more SMs and memory bandwidth, in contrast to MDM. . 74

5.13 Prediction error for predicting the relative performance di↵er-ence at 1.4GHz versus 2GHz for the MD-applications for GPUMech,CRISP and MDM. The general-purpose MDM model achievessimilar accuracy as the special-purpose CRISP. . . . . . . . . . 75

5.14 Normalized CPI as a function of NoC/DRAM bandwidth for twomemory-divergent benchmarks (BFS and CFD).MDM accuratelycaptures MSHR batching and NoC/DRAM queueing delays, incontrast to GPUMech. . . . . . . . . . . . . . . . . . . . . . . . 76

5.15 Normalized CPI as a function of NoC/DRAM bandwidth forthe non-memory-divergent BP benchmark. Both GPUMech andMDM capture the performance trend since the number of MSHRsare su�cient. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.16 Relative IPC prediction error with a streaming L1 cache. MDMimproves accuracy compared to GPUMech because it models batch-ing behavior caused by NoC saturation. . . . . . . . . . . . . . . 77

List of Abbreviations

ALU Arithmetic Logic Unit

API Application Programming Interface

BSP Bulk-Synchronous Parallel

CC Coalesced Cache

CPU Central Processing Unit

CTA Cooperative Thread Arrays

DPKI Divergent loads Per Kilo Instructions

DRAM Dynamic Random Access Memory

DVFS Dynamic Voltage and Frequency Scaling

EDP Energy-Delay Product

GB Gigabytes

GDDR Graphics Double Data Rate

GPU Graphics Processing Unit

GS Grid Stride

GTO Greedy Then Oldest

ICC Intra-Cluster Coalescing

ICL Intra-Cluster Locality

IPC Instructions Per Cycle

ISA Instruction-Set Architecture

KB Kilobytes

LD Load

L1 Level-1 Cache

xxi

xxii LIST OF ABBREVIATIONS

L2 Level-2 Cache

LLC Last-Level Cache

LRR Loosely Round Robin

LRU Least-Recently Used

MC Memory Controller

MD Memory Divergent

MDM Memory Divergence Model

ML Machine Learning

MSHR Miss Status Handling Registers

NMD Non-Memory Divergent

NoC Network on Chip

PTX Parallel Thread Execution

RAW Read After Write

SASS Shader Assembler

SIMT Single Instruction Multiple Thread

SM Streaming Multiprocessor

ST Store

TB Thread Block

TLP Thread-Level Parallelism

VC Virtual channel

Chapter 1

Introduction

This chapter introduces the dissertation’s key contributions.

1.1 GPU Architecture Trends

Throughput processors such as Graphics Processing Units (GPUs) are widelyused to accelerate a broad range of emerging throughput-oriented applications,e.g., scientific computing, machine learning, artificial intelligence, data analyt-ics and medical imaging [34, 58, 93]. To ease programming, bulk-synchronousparallel (BSP) programming models such as OpenCL and CUDA have beendeveloped in which a GPU application is typically divided into several kernelsand each kernel consists of hundreds of thousands of threads. Using Nvidia’sterminology1, Streaming Multiprocessors (SMs) are featured in GPUs to runa massive number of parallel threads, grouped in cooperative thread arrays(CTAs). Threads on each SM are executed at a granularity of a warp, whichis essentially a collection of (usually 32) threads running in a lockstep fashion.

GPUs continue to boost the number of SMs under each generation to sup-port demanding applications. Whereas the Nvidia Fermi GPU implemented16 SMs, the latest Nvidia Pascal [13] and Volta GPUs [15] feature 60 and 84SMs, respectively. Although transistor scaling has dramatically slowed downand restricts this trend to a single-GPU system [16], the potential deploymentof Multi-Module GPUs [24] and multi-socket NUMA-Aware GPUs [86] willfurther increase the overall SM count. In contrast, bandwidth in the mem-ory subsystem is less scalable and becomes the performance bottleneck espe-cially for memory-intensive applications. In order to bridge this gap, priorwork has proposed various solutions to improve DRAM bandwidth [30, 31, 94].For instance, recent GPUs adopt on-package stacked DRAM [13] to providehigh bandwidth. These stacked memories, such as High Bandwidth Memory

1We use Nvidia’s terminology without loss of generality. GPUs designed by other companiessuch as AMD and Intel use a similar organization.

1

2 CHAPTER 1. INTRODUCTION

(HBM) [55], allow the processor and memory to communicate via short linkswithin a package to improve performance and save energy. Other solutionsinclude fine-grained DRAM (FGDRAM) [94] which partitions the DRAM dieinto many independent units, or subchannel DRAM architectures [31].

There are also architectural optimizations on the SM-side with respect toregister utilization [18, 56, 65, 69, 71] and warp scheduling [43, 57, 59, 90,99, 100]. State-of-the-art GPU register files are generally over-provisioned andstatically allocated to meet the peak performance targets while utilization re-mains low [18, 71, 88, 113]. Based on this observation, some techniques proposeto manage register files dynamically based on the their lifetime analysis to saveenergy [56, 65, 69]. As the basic scheduling unit, warp scheduling policies havebeen discussed in depth. These studies either aim to avoid cache thrashing[57, 59, 99, 100, 102] or migrating control flow divergence [43, 90].

Unfortunately, limited proposals focus on memory contention in the network-on-chip (NoC) [28, 66, 128] and last-level cache (LLC) [126, 127], which war-rants further exploration.

1.2 GPU-Application Diversity

Thanks to the high computation power, GPUs have become a popular choicefor high-performance computing (HPC) systems [7], machine learning [17, 58,72, 104] and data analytics applications in large-scale cloud installations andpersonal computing devices [20, 44]. According to recent reports [3, 7], manyof the world’s leading supercomputer systems use Nvidia Tesla accelerators,including Titan, Lomonosov, Piz Daint, etc. Application diversity leads todi↵erent execution behavior across GPU workloads, which makes the design ofa GPU challenging to suit the diverse application characteristics well. In thisthesis, we observe two characteristics of GPU-applications:

Inter-CTA data locality: A CTA is a group of threads that cooperatewith each other by synchronizing their execution. Traditionally, intra-CTA,especially intra-warp locality, is the most common and obvious form of datalocality present in GPU-compute applications. To exploit this characteristic,a memory coalescing unit merges multiple memory accesses to the same cacheline within the same warp before sending the request to the L1 cache [49]. Incontrast, we observe a high degree of inter-CTA locality in many GPU-computeapplications, as di↵erent CTAs access the same cache line or access the sameread-only data.

Memory divergence: Several contemporary GPU applications di↵er fromtraditional GPU-compute workloads as they put a much larger strain on thememory system. More specifically, they are memory-intensive and memory-divergent. These applications typically have strided or data-dependent accesspatterns which cause the accesses of the concurrently executing threads to bedivergent as loads from di↵erent threads access di↵erent cache lines. In this the-sis, we observe that memory-divergent (MD) applications are prevalent among

1.3. GPU PERFORMANCE MODELING 3

a couple benchmark suites. The poor spatial locality of these applications leadsto frequent L1 cache blocking due to issuing more concurrent cache misses thanthe cache can support which cripples the GPU’s ability to use Thread-LevelParallelism (TLP) to hide memory access latencies.

1.3 GPU Performance Modeling

Analyzing and optimizing a GPU architecture for the broad diversity ofmodern-day GPU-compute applications is challenging. Simulation is arguablythe most commonly used evaluation tool as it enables detailed, even cycle-accurate, analysis. However, simulation is excruciatingly slow and parametersweeps commonly require thousands of CPU hours. In addition, simulationneeds to be modified to support new generations of GPU architectures andemerging GPU-applications [63, 76, 98]. An alternative approach is modeling,which captures the key performance-related behavior of the architecture in a setof mathematical equations, which is much faster to evaluate than simulation.

Modeling can be broadly classified in machine-learning (ML) based mod-eling versus analytical modeling. ML-based modeling [40, 81, 118] requireso✏ine training to infer or learn a performance model. A major limitation ofML-based modeling is that a large number (typically thousands) of trainingexamples are needed to infer a performance model. These training examplesare obtained through detailed simulation, which leads to a substantial one-time cost. Moreover, extracting insight from an ML-based performance modelis not straightforward, i.e., an ML-based model is a black box. Analyticalmodeling [26, 50, 51, 103, 115, 124] on the other hand derives a performancemodel from fundamentally understanding the underlying architecture, drivenby first principles. Analytical models provide deep insight, i.e., the model is awhite box, and the one-time cost is small once the model has been developed.The latter is extremely important when exploring large design spaces, makinganalytical models ideally suited for fast, early-stage architecture exploration.There are also some special-purpose models for runtime optimization [36, 91]based on hardware performance counters. Unfortunately, they are not suitedfor design space exploration.

1.4 Motivation

In this thesis, we focus on targeting two dominant challenges related tomodern GPU architectures and their emerging applications:

Challenge #1: Reducing NoC pressure in clustered GPUs. Graph-ics Processing Units (GPUs) are popular in modern computing systems in orderto provide high performance for a wide class of general-purpose applications.In particular, the SMs feature private L1 caches and are connected to theL2 cache and memory controllers (MCs) through a Network-on-Chip (NoC).The trend of increasing number of SMs poses a scalability challenge for the


NoC. Typically, a crossbar is deployed due to its low latency and high band-width [13]. However, a crossbar NoC faces scalability issues as hardware costsincrease quadratically with increasing port count. To address the GPU NoCscalability challenge, a cluster structure is implemented in modern-day GPUsto group several SMs into a cluster. For example, Pascal supports 6 clusters,with each cluster consisting of 10 SMs [13]; Volta features 14 SMs per clusterfor the same number of clusters [15]. By sharing NoC ports among SMs in acluster, the total number of ports to the network is reduced and so is the over-all hardware cost of the crossbar NoC. Previous research has shown that NoCcongestion is a severe GPU performance bottleneck for many memory-intensiveapplications [28, 66, 128]. Unfortunately, clustered GPUs further exacerbatethis performance issue. By sharing ports among SMs in a cluster, congestionsignificantly increases as SMs need to compete with each other in a cluster fornetwork bandwidth. This creates a new and critical performance challenge forthe NoC in clustered GPU organizations. In this thesis, we focus on how toe�ciently reduce the NoC pressure in a clustered GPU. Fortunately, inter-CTAlocality in GPU-applications provides an potential solution. However, there arestill several problems need to be solved. First, how to map inter-CTA localityto locality in a cluster? Second, how to e�ciently eliminate the redundant NoCtra�c?

Challenge #2: Modeling memory divergence. Quantitative eval-uation is an essential part of the computer architect’s tool box. Analyticalmodels are roughly two orders of magnitude faster than simulation — mak-ing them ideally suited for early-stage architectural exploration [51] and forhelping programmers understand application performance characteristics [50,124]. However, new emerging GPU-applications and their distinct charac-teristics have made GPU analytical modeling sophisticated. In particular,memory divergence is prevalent among emerging GPU applications. Unfor-tunately, prior GPU analytical models incur large inaccuracies: the state-of-the-art GPUMech [51] model incurs an average performance error of 298% fora broad set of memory-divergent (MD) applications. The key problem comesfrom the distinct performance behavior of MD-applications compared to well-understood Non-Memory-Divergent (NMD) applications. Modeling memorydivergence is challenging. First, we need to identify the key performance-related behavior of MD-applications. Second, modeling the behavior througha set of equations is complex which imposes a trade-o↵ between simplicity andaccuracy. Third, the modeling overhead should be considered to make sure themodel’s practicality.

1.5 Thesis Contributions

This thesis provides three major contributions.

Contribution #1: Intra-Cluster Coalescing to Reduce NoC Pressure

We make the observation that many GPU-compute applications exhibitinter-CTA locality, as di↵erent CTAs access the same cache line or access the

1.5. THESIS CONTRIBUTIONS 5

same read-only data. For clustered GPUs, this implies that memory requestsfrom CTAs executing on the same cluster will access the same cache lines.According to our experimental results, we find that on average 19% (and up to48%) of all L1 misses originating from a cluster indeed access the same cachelines. These memory requests are redundant and can be eliminated.

In response, we propose intra-cluster coalescing (ICC) to exploit coalescingopportunities across SMs within a cluster. ICC reduces GPU NoC pressureby coalescing memory requests from di↵erent SMs in a cluster to the same L2cache line. In particular, ICC records the memory requests sent to the NoCby all SMs in a cluster, and when subsequent memory requests from otherSMs in the cluster access the same cache lines as an outstanding request, ICCcoalesces them. By doing so, ICC significantly reduces NoC tra�c. To extendthe opportunity for coalescing beyond the time window during which a memoryrequest is outstanding, we complement ICC with a coalesced cache (CC) to keeptrack of recently coalesced cache lines. L1 cache misses trigger an access to theCC, which in case of a hit, further reduces NoC tra�c.

Our experiments show that ICC along with CC leads to an average 15%(and up to 69%) performance improvement while at the same time reducingsystem energy by 5.3% on average (and up to 16.7%) , and EDP by 12% andup to 30%.

Contribution #2: Distributed-Block CTA Scheduling

Considering the intra-cluster locality stems from inter-CTA locality, wedemonstrate the significant interaction between ICC and the CTA schedul-ing policy. We find that ICC benefits more when the CTA scheduling pol-icy maps neighboring CTAs to the same cluster to better exploit inter-CTAlocality. Motivated by this observation, we propose distributed-block schedul-ing. Distributed-block scheduling is a two-level locality-aware CTA schedulingpolicy that first evenly distributes consecutive CTAs across clusters, and sub-sequently schedules pairs of consecutive CTAs per SM to maximize L1 cachelocality and L1 MSHR coalescing opportunity. In contrast to prior work inCTA scheduling, distributed-block scheduling exploits cache locality at boththe cluster level and the SM level. Inter-CTA locality is exploited to improveL1 cache performance (decreasing the reuse distance between accesses to thesame memory location, thereby increasing L1 cache hit rate) and to increasethe coalescing opportunities in the L1 miss status handling registers (MSHRs).

Using execution-driven GPU simulation, we find that distributed-block schedul-ing improves GPU performance by 4% (and up to 16%) while at the same timereducing system energy and EDP by 1.2% and 5%, respectively, compared tothe state-of-the-art distributed scheduling policy. In addition, distributed-blockscheduling works synergistically with ICC and CC to improve performance by16% (and up to 67%) and reduce system energy by 6% (and up to 18%).

These two contributions are published in:L. Wang, X. Zhao, D. Kaeli, Z. Wang and L. Eeckhout. Intra-Cluster Coalescing to Reduce GPU NoC Pressure. In Proceedings


of the International Parallel and Distributed Processing Symposium(IPDPS). pp. 990-999, May 2018

An extended version is published in:L. Wang, X. Zhao, D. Kaeli, Z. Wang and L. Eeckhout. Intra-Cluster Coalescing and Distributed-Block Scheduling to ReduceGPU NoC Pressure. In IEEE Transactions on Computers, Vol.68, No. 7, pp. 1064-1076, July 2019

Contribution #3: Memory Divergence Model (MDM)

In this work, we focus on analytically modeling the performance of emergingmemory-divergent GPU-compute applications which are common in domainssuch as machine learning and data analytics. The poor spatial locality ofthese applications leads to frequent L1 cache blocking due to the applicationissuing significantly more concurrent cache misses than the cache can support,which cripples the GPU’s ability to use Thread-Level Parallelism (TLP) tohide memory latencies. Based on this observation, we propose the MemoryDivergence Model (MDM), which faithfully models the batching behavior andNoC/DRAM queueing delays observed in MD-applications.

MDM significantly improves the performance prediction accuracy (by 16.5⇥on average) compared to the state-of-the-art GPUMech [51] for MD-applications.At the same time, MDM is equally accurate as GPUMech for NMD-applications.Across a set of MD and NMD-applications, we report an average predictionerror of 13.9% for MDM compared to detailed simulation (versus 162% forGPUMech). Moreover, we demonstrate high accuracy across a broad designspace in which we vary the number of MSHRs, NoC and DRAM bandwidth, aswell as SM count. Furthermore, we validate MDM against real hardware, forwhich we rely on binary instrumentation to collect the model inputs as opposedto functional simulation as done in prior work. By doing so, we improve bothmodel evaluation speed (by 6.1⇥) and accuracy (average prediction error of40% for MDM versus 164% for GPUMech).

In addition, we perform three case studies to demonstrate the utility ofMDM model. First, we show the MDM is useful and accurate compared todetailed simulation for early design space exploration of GPU architectureswhen varying the number of SMs, MSHRs, NoC and DRAM bandwidth. Sec-ond, we find that the MDM model is equally accurate as the special-purposeCRISP [91] model to predict the performance impact of DVFS. Finally, wevalidate the observations of the MDM model by reporting CPI components.

Overall, MDM model advances the state-of-the-art in GPU analytical per-formance modeling by expanding its scope, by improving its practicality, andby enhancing its accuracy.

This work is published in:L. Wang, M. Jahre, A. Adileh, Z. Wang, and L. Eeckhout. ModelingEmerging Highly Memory-Divergent GPU Applications. In IEEEComputer Architecture Letters, Vol. 18, No. 2, pp. 95-98, June2019.

1.6. THESIS ORGANIZATION 7

An extended version of this work is published in:L. Wang, M. Jahre, A. Adileh, and L. Eeckhout. MDM: The GPUMemory Divergence Model. In the 53rd International Symposiumon Microarchitecture (MICRO), October 2020

1.6 Thesis Organization

The remainder of the thesis is organized as follows.

Chapter 2 introduces the necessary background regarding GPU architectureand the state-of-the-art GPU performance analysis tools to better understandthe following chapters.

Chapter 3 observes the intra-cluster locality (ICL) in a broad range of GPUapplications and exploits this characteristic to reduce NoC pressure in clusteredGPUs by proposing intra-cluster coalescing (ICC) and the coalesced cache (CC)unit.

Chapter 4 analyzes the interaction between ICC and the CTA schedulingpolicy, and proposes distributed-block CTA scheduling which captures the datalocality at both the SM and cluster levels.

Chapter 5 focuses on memory divergence in emerging GPU applications andproposes the MDM model which advances the state-of-the-art GPUMech [51]model by expanding its scope, improving its accuracy and enhancing its prac-ticality.

Finally, in Chapter 6, we conclude the thesis and discuss potential avenuesfor future work.

Chapter 2

Background

This chapter first provides an overview of the state-of-the-art in GPU com-puter architecture fundamentals, including clustered GPU architectures andGPU hardware schedulers. These are relevant to Chapters 3 and 4. We thenprovide background for the MDM model presented in Chapter 5, including theconcept of memory divergence, state-of-the-art GPU performance modeling,and analytical interval-based performance modeling on which the MDM modelbuilds.

2.1 Basics of GPU Architecture

2.1.1 GPU Thread Hierarchy

Using Nvidia’s terminology, a GPU-compute application consists of ker-nels, grids, thread blocks (TBs) or cooperative thread arrays (CTAs), warpsand threads, organized in a hierarchy (see Figure 2.1). A kernel is a parallelcode region that runs on a GPU and consists of multiple grids, which in turnconsist of multiple CTAs. Each CTA is a batch of threads that can coordi-nate with each other through synchronization using a barrier instruction [62].Threads in a CTA share a fast, on-chip scratchpad memory called shared mem-ory. Since all the synchronization primitives are encapsulated within a CTA,di↵erent CTAs can be executed in any order. This is an important featurethat we will explore in this work to understand how the mapping of CTAs toclusters a↵ects intra-cluster locality. CTAs are typically organized in a 1D, 2Dor 3D structure1. In particular, the 3D variable (blockIdx.x, blockIdx.y,blockIdx,z) distinguishes di↵erent CTAs.

1The most common case is a 2D structure which uses the row and the column to index oneCTA.

9

10 CHAPTER 2. BACKGROUND

&38 *38

.HUQHO��

.HUQHO��

��*ULG'LP� ��%ORFN'LP ��

7KUHDG��

%ORFN��%ORFN��%ORFN��

%ORFN�� %ORFN�� %ORFN��%ORFN�� %ORFN�� %ORFN��

7KUHDG��

7KUHDG�� 7KUHDG��





%ORFN��

Figure 2.1: GPU thread hierarchy: a GPU kernel is executed as a grid ofthread blocks. A thread block is a batch of threads that can cooperate witheach other.

,QVWUXFWLRQ�&DFKH

:DUS�6FKHGXOHU

5HJLVWHU�)LOH

'LVSDWFK�8QLW

&RUH &RUH

&RUH &RUH

&RUH &RUH

&RUH &RUH

&RUH &RUH

&RUH &RUH

&RUH &RUH

&RUH &RUH

&RUH &RUH

&RUH &RUH

&RUH &RUH

&RUH &RUH

&RUH &RUH

&RUH &RUH

&RUH &RUH

&RUH &RUH

:DUS�6FKHGXOHU

'LVSDWFK�8QLW

/'�67/'�67/'�67/'�67/'�67/'�67

/'�67/'�67/'�67

6)8

6)8

6)8

6)8

�6KDUHG�0HPRU\�/��&DFKH

&8'$�&RUH

�'LVSDWFK�3RUW�2SHUDQG�&ROOHFWRU

)3�8QLWV

,178QLWV

�5HVXOWV�4XHXH

Figure 2.2: SM architecture: each SM includes a large register file, severalcaches, and 32 CUDA cores executing threads in a SIMT manner.

2.1. BASICS OF GPU ARCHITECTURE 11

Cluster #1

Crossbar Network

L2 cacheMC #1

Cluster #12

SM SM

Injection Buffer

Response Buffer

L2 cacheMC #8

SMMemory

Subsystem

Shared Mem

L1 Cache

Texture Cache

Const Cache

Register File

ALUs

Warp scheduler

SMMemory

Subsystem

Shared Mem

L1 Cache

Texture Cache

Const Cache

Register File

ALUs

Warp scheduler

Figure 2.3: Clustered GPU architecture: SMs within a cluster go through theNoC to access the L2 cache and main memory to serve L1 cache misses.

2.1.2 Streaming Multiprocessor

The Streaming Multiprocessor (SM) is the basic computation unit in aGPU. An SM executes up to thousands of threads in a single-instructionmultiple-thread (SIMT) manner which are organized as warps. Each warpconsists of a couple dozen threads, e.g., 32 threads. Di↵erent warps withina CTA can synchronize through a barrier and communicate through sharedmemory. Figure 2.2 shows a diagram of an SM [1]. Each SM includes thou-sands of registers that can be partitioned among threads of execution, severalcaches such as scratchpad memory and L1 cache, warp schedulers and execu-tion cores. Warp schedulers can quickly switch contexts between threads andissue instructions from ready warps. There are di↵erent warp scheduling poli-cies such as loose round-robin (LRR) [99] and greedy-then-oldest (GTO) [99].The LRR scheduling policy schedules warps in a round-robin way, while theGTO scheduling policy schedules the same warp until it stalls and then picksthe oldest one. In particular, a warp is mainly stalled due to a RAW hazard(e.g., long memory latency) or LD/ST unit stall (e.g., no free entries in MSHRor memory congestion). However, the high thread-level parallelism (TLP) canusually e�ciently hide related stalls. A CUDA core is the execution unit forfloating-point and integer operations while the LD/ST unit is used to processmemory instructions.


60�� 60�� 60�� 60�� 60��

�7UDGLWLRQDO�*38V

��&OXVWHU�� &OXVWHU��

� � �

5RXQG�URELQ�VFKHGXOLQJ

60��

� �

60��

�

60��

�

� � �

��OHYHO�URXQG�URELQ�VFKHGXOLQJ

60��

��&OXVWHUHG�*38V

&OXVWHU�� &OXVWHU��

60��

� �

60��

�

60��

�

� �

��'LVWULEXWHG�VFKHGXOLQJ

�

�

�D� �E�

�F�

Figure 2.4: CTA scheduling policies: A 2-level round-robin scheduling policyallocates neighboring CTAs to di↵erent clusters while a distributed schedulingpolicy maps neighboring CTAs to the same cluster.

2.1.3 Clustered GPU Architecture

To address the GPU NoC scalability challenge, a cluster structure is imple-mented in modern-day GPUs to group several SMs into a cluster. For example,Nvidia Pascal [13] and Volta [15] GPUs support 6 clusters, with each clusterconsisting of 10 SMs and 14 SMs, respectively. A baseline clustered GPU ar-chitecture in this thesis is shown in Figure 2.3: 12 clusters are connected via acrossbar NoC to 8 memory controllers (MCs). Each MC has an associated L2cache bank for the memory partition that the MC serves, and has one networkport. Each cluster consists of 5 SMs, so there are 60 SMs in total. Each SMhas a private L1 data cache, a read-only texture cache, a constant cache andshared memory. An L1 cache miss triggers a request to be sent over the NoCto reach one of the L2 cache banks; in case of an L2 cache miss, the requestproceeds to main memory. In our baseline architecture, we assume one NoCinjection port bu↵er that is shared by all SMs in a cluster. (We will study thesensitivity of our proposed design to the number of clusters and the e↵ectivenetwork ports per SM in evaluation section in Chapter 3.) Each cluster has aresponse FIFO queue to hold incoming packets from the NoC; responses aredirected to one of the SMs in the cluster according to the control informationin the packet.

2.2. MEMORY DIVERGENCE AND COALESCING 13

2.1.4 CTA Scheduling

Scheduling on a GPU is done in three steps. First, a kernel is launchedon the GPU. In this thesis, we assume that only one kernel is active at agiven time. Second, the CTA scheduler maps CTAs to the available SMs.The scheduling usually follows a round-robin (RR) policy to balance the loadamong di↵erent SMs in traditional GPUs (without cluster structure). As shownin Figure 2.4(a), CTA 1 is allocated to SM #1, CTA 2 is allocated to SM #2,and so on. The maximum number of CTAs that can be scheduled per SMis determined by the SM’s resources. Finally, the warp scheduler in each SMschedules warps (from one or more CTAs) to execute, which we model to followthe Greedy-Then-Oldest (GTO) policy [99].

CTA Scheduling for Clustered GPUs

CTA scheduling policies for clustered GPUs can be di↵erent. The defaultCTA scheduler follows a 2-level round-robin (RR) policy [80], which first sched-ules CTAs across clusters and then across SMs within a cluster. In particular,CTA 1 is allocated to SM #1 in cluster #1, CTA 2 is allocated to SM #3 incluster #2, and so on (see Figure 2.4(b)). Once all clusters are assigned oneCTA, the next iteration allocates a CTA to the second SM in each cluster,etc., until all SMs are assigned one CTA. For example, CTA 3 is allocated toSM #2 in cluster #1 and CTA 4 is allocated to SM #4 in cluster #2. If anSM has enough resources to execute more than one CTA, additional CTAs areassigned — this is done in a round-robin manner similar to the procedure justdescribed. For example, CTA 5 is allocated to SM #1 in cluster #1, etc. Bydoing so, a two-level RR policy balances the load among clusters and SMs, sothat all clusters and SMs have a similar number of CTAs to execute.

As proposed in MCM-GPU [24], a state-of-the-art distributed CTA schedul-ing policy, or distributed scheduling for short, evenly maps a block of neigh-boring CTAs to the same cluster to exploit the locality benefit. As shown inFigure 2.4(c), assuming there are 6 CTAs in total, CTAs 1 through 3 are as-signed to cluster #1, and CTAs 4 through 6 are assigned to cluster #2. ThenCTAs mapped to the same cluster are allocated in a round-robin way. Forinstance, CTAs 1 and 2 are allocated to SM #1 and SM #2 in cluster #1,respectively. In particular, we assume distributed scheduling as the baselinepolicy in Chapters 3 and 4.

2.2 Memory Divergence and Coalescing

Memory Access Coalescing Unit: In GPUs, threads within a warp exe-cute instructions in lockstep. For a global memory instruction, all 32 threadsexecute 32 load instructions. The Memory Access Coalescing Unit (MACU)coalesces these memory requests into several cache line-sized requests beforeaccessing the L1D cache to reduce the number of requests and thus increase


the e↵ective bandwidth utilization. For memory requests with perfect spatiallocality, threads within a warp would access 32 consecutive words and thusonly one memory access to L1D will be generated. A memory-divergent access,however, will generate several memory requests to L1D after coalescing due topoor spatial locality. If memory requests in a warp are divergent, the warpcannot be executed until all memory transactions are handled, which takes sig-nificantly longer than waiting for only one memory request. In this thesis, wedefine an application as memory-divergent if it features more than 10 Divergentloads Per Kilo Instructions (DPKI).

Missing Status Holding Register: For misses in L1D, the correspondingrequests are sent to the lower level of cache in the memory hierarchy. Particu-larly, the Missing Status Holding Registers (MSHRs) are used to track in-flightmemory requests and merge duplicate requests to the same cache line. AfterMSHR allocation, a memory request is bu↵ered into the NoC port for transfer.An MSHR entry is released once its corresponding memory request is backand all accesses to that block are serviced. Memory-divergent requests tend toaccess several cache lines and may thus lead to the GPU core running out ofMSHR entries quickly. In Chapter 5, we will further investigate the relationshipbetween memory divergence and MSHR blocking.

2.3 GPU Performance Modeling

2.3.1 GPU Performance Analysis Tools

Over the years, Graphics Processing Units (GPUs) have been used as accel-erators to perform general-purpose computations. Continuously evolved GPUarchitectures and emerging GPU applications [6, 29] have made architectureexploration and performance analysis more complicated. This drives the needto develop fast and accurate performance evaluation tools for a broad class ofGPU applications and modern-day GPU architectures.

With the advent of GPU computing, GPU manufacturers have developedprofiling and debugging tools such as Nvidia’s Visual Profiler [12]. These toolsuse performance counters to profile various aspects of the program execution.They are easy to use and run at native hardware speed. Unfortunately, theproduction-quality profilers are not flexible enough for computer architects toconduct studies involving micro-architecture innovations and design space ex-plorations.

As one solution, architects, system designers and application developersturn towards cycle-accurate simulators such as GPGPU-Sim [27] for accurateperformance analysis. Simulators are flexible, and allow architects to measurefine-grained details of execution. However, it is quite time-consuming espe-cially for the full execution of real-world applications. This forces researchersand architects to use trimmed-down input data sets or to intelligently choosesections of the full program [95] so that their experiments finish in a reasonable


&DFKH�PLVV &DFKH�PLVV &DFKH�PLVV

LQWHUYDO�� LQWHUYDO�� LQWHUYDO��

,VVXH�5DWH

WLPH

Figure 2.5: Interval analysis: An interval is defined as a sequence ofinstructions at the maximum issue rate followed by stall cycles.

amount of time. Unfortunately, there is no guarantee that simulation-sizedinputs are representative of real workloads [106]. In addition, state-of-the-artsimulators need to be modified to model and support new-generation GPUarchitectures [13] and emerging GPU applications [6, 29].

An alternative approach is analytical modeling, which captures the keyperformance-related behavior of an architecture in a set of mathematical equa-tions. Analytical models are much faster than simulation — making themideally suited for early-stage architectural exploration [51] and helping pro-grammers understand application performance [50, 51]. Hong and Kim [50]propose a model that estimates performance based on concurrent computationas well as memory requests. Baghsorkhi et al. [26] propose a work-flow graph(WFG) based analytical model to predict performance. However, these modelsare built on static code analysis. GPUMech [51] addresses this problem byintegrating a trace-driven functional simulator with an analytical model basedon interval analysis and therefore provides higher accuracy than traditional an-alytical models. In contrast, machine-learning based models are also proposedin several work [40, 118]. Unfortunately, they are black-box approaches whichmake it complicated to extract deep insight.

2.3.2 Interval-Based Analytical Modeling

Interval Analysis

The foundation of interval analysis was proposed by Karkhanis et al. [60]and Eyerman et al. [41]. The basic idea is that the performance of a processor isequal to the issue rate of a processor (a sustained performance) unless disruptivemiss events occur such as cache misses. Performance is then estimated bysubtracting the stall cycles from the maximum issue rate. Figure 2.5 illustratesinterval analysis. An interval is defined as a sequence of instructions at themaximum issue rate followed by stall cycles. Functional simulators are used todetect stall events. Interval analysis was originally proposed for single-threadworkloads, and applying it to GPUs is not straightforward due to their highlyparallel execution mode [51].


Generating Interval Profiles at the Warp Level

The starting point of applying interval analysis to model GPUs is to gen-erate interval profiles for one warp. Equation 2.1 shows an interval profile of awarp. Particularly, each interval includes the number of instructions and thenumber of stall cycles.

interval = [#interval insti, stall cyclesi], i 2 intervals (2.1)

Latencies of compute instructions are fixed and based on the system’sconfiguration. On the other hand, the latency of each memory instructionis calculated based on the predicted cache miss rate obtained from a cachesimulator as well as the LLC/DRAM access latency. For instance, if one(static) memory instruction hits the L1 cache for 10% of the time, and hitsthe LLC for 30% of the time (among all LLC accesses), the latency equals(1 � 0.1) ⇥ 0.3 ⇥ 120 + (1 � 0.1) ⇥ (1 � 0.3) ⇥ 220 = 171 cycles, assumingthe access latencies for the LLC and DRAM equal 120 cycles and 220 cycles,respectively. In particular, since all warps will execute the same memory in-struction, the cache miss rate of one memory instruction (i.e., one PC or onestatic instruction) is calculated by counting the miss events of all executionsacross all warps.

Figure 2.6 illustrates how to generate intervals for a warp. The done cycle isequal to the issue cycle plus the instruction latency. Instruction 3 (i3) leads tostall cycles because instruction 5 (i5) depends on it. Other instructions can beissued every cycle because there are no dependences. We apply Equation 2.2 todetermine the issue cycle of each instruction. An interval is formed if the issuecycle of the current instruction is not equal to the issue cycle of the previousinstruction plus one, since this indicates that stall cycles have been incurredbetween the two instructions. As shown in Figure 2.6, i5 can only issue at cycle123 which is equal to the done-cycle of i3 (122) plus 1.

issue cycle(instk+1) = max{issue cycle(instk)+1, done cycle(source instk+1)+1}(2.2)

Trace Collection

A functional simulator is often used to collect a trace for the interval-basedanalytical performance modeling techniques. Applications are functionally ex-ecuted to capture the dynamic basic block sequence for every warp as well asthe addresses accessed by each thread. GPGPU-sim [27] is widely used forfunctional execution. However, it operates at the intermediate PTX represen-tation [4] and is time-consuming, especially for long-running GPU applications.

Traces can also be collected through instrumentation on real GPUs. Incontrast to a functional simulator, instrumentation tools can profile kernels at


L�L� L� ��L��

LQVWUXFWLRQBLG ��

LVVXHBF\FOH �� 67$// ��

ODWHQF\ ��

GRQHBF\FOH ��

�L�

GHSHQGHQFH

L��L�

,QWHUYDO��

��,QWHUYDO��

�LQVW �

�VWDOO ��

Figure 2.6: Generating interval profiles through instruction dependenceanalysis.

native execution speed. SASSI [106] is a compile-time instrumentation toolthat operates directly at the native instruction level [5], leveraging Nvidia’sproduction back-end compiler. SASSI enables software-based, selective instru-mentation of GPU applications. However, compile-based instrumentation toolshave several limitations: (1) they cannot operate on GPU driver code, and (2)they cannot target pre-compiled libraries because the source code is not avail-able.

NVBit [114] is a fast, dynamic and portable binary instrumentation frame-work targeting Nvidia GPUs. By working directly at the native instructionlevel, NVBit can faithfully instrument applications that have been producedby users with NVCC, via JIT compilation of PTX [4], or through the inclusionof pre-compiled shared libraries such as cuDNN [17] and cuBLAS [8]. These op-timized libraries have been widely used in machine learning workloads [34, 58].It significantly improves the usefulness upon the compile-based tool SASSI.NVBit provides a rich set of high-level APIs which enable instruction inspec-tion, callbacks to CUDA driver APIs, and injection of arbitrary CUDA func-tions into any application before kernel launch. NVBit enables basic-blockinstrumentation, multi-function injection to the same location, inspection ofISA-visible state, dynamic selection of instrumented or un-instrumented code,permanent modification of register state, correlation with source code and in-struction removal. It supports the Nvidia GPU architecture families of Ke-pler [10], Maxwell [11], Pascal [13], and Volta [15]. In this thesis, we use NVBitto collect per-warp instruction and memory address traces (see Chapter 5).

Chapter 3

Intra-Cluster Coalescing

3.1 Introduction

To continuously increase the raw computational power of modern GPUs,the SM count keeps increasing. Whereas the Nvidia Fermi GPU implemented16 SMs, the latest Nvidia Pascal [13] and Volta GPUs [15] feature 60 and 84SMs, respectively. The SMs feature private L1 caches and are connected to theL2 cache and memory controllers (MCs) through a Network-on-Chip (NoC).With the large number of SMs we are observing today, designing a scalableNoC poses a challenge. Typically, a crossbar is deployed as the NoC in a GPUdue to its low latency and high bandwidth [13]. However, a crossbar NoC facesscalability issues as hardware costs increase quadratically with port count.

To address the GPU NoC scalability challenge, a cluster structure is imple-mented in modern-day GPUs to group several SMs into a cluster. For example,Pascal supports 6 clusters, with each cluster consisting of 10 SMs [13]; Voltafeatures 14 SMs per cluster with the same number of clusters [15]. By sharingNoC ports among SMs in a cluster, the total number of ports to the networkis reduced and so is the overall hardware cost of the crossbar NoC.

Previous research has shown that NoC congestion is a severe GPU perfor-mance bottleneck for many memory-intensive applications [28, 66, 128]. Unfor-tunately, clustered GPUs further exacerbate this performance issue. By sharingports among SMs in a cluster, congestion significantly increases as SMs need tocompete with each other in a cluster for network bandwidth. This creates a newand critical performance challenge for the NoC in clustered GPU organizations.

In this chapter, we address the GPU NoC performance bottleneck by reduc-ing NoC tra�c, and more specifically by eliminating redundant NoC requests.We do this by coalescing L1 cache misses from di↵erent SMs within a clusterbefore sending them to the NoC.

Memory coalescing, or grouping memory accesses from di↵erent threads tothe same cache line in a single memory request, is widely deployed in a GPU

19

20 CHAPTER 3. INTRA-CLUSTER COALESCING

and is an e↵ective technique to reduce NoC pressure. More specifically, intra-warp coalescing merges L1 cache accesses across threads within a warp [49];WarpPool merges L1 accesses across warps within the same SM [68]; L1 MissStatus Handling Registers (MSHRs) merge L1 misses across warps within asingle SM. However, to the best of our knowledge, no prior work coalesces L1misses across SMs within a cluster.

We make the observation that many GPU-compute applications exhibitintra-cluster locality (ICL), this implies that memory requests from di↵erentSMs within the same cluster will access the same cache lines. According toour experimental results, we find that on average 19% (and up to 48%) of allL1 misses originating from a cluster indeed access the same cache lines. Thesememory requests are redundant and can be eliminated.

In response, we propose intra-cluster coalescing (ICC) to exploit coalescingopportunities across SMs within a cluster. ICC reduces GPU NoC pressureby coalescing memory requests from di↵erent SMs in a cluster to the sameL2 cache line. In particular, ICC records the memory requests sent to theNoC by all SMs in a cluster, and when subsequent memory requests fromother SMs in the cluster access the same cache lines as an outstanding request,ICC coalesces them. By doing so, ICC significantly reduces NoC tra�c. Toextend the opportunity for coalescing beyond the time window during whicha memory request is outstanding, we complement ICC with a coalesced cache(CC) to keep track of recently coalesced cache lines. Cache lines are added tothe CC when a coalesced cache line with multiple requesters returns from thememory hierarchy. L1 cache misses trigger an access to the CC, which in caseof a hit, further reduces NoC tra�c.

In summary, we make the following contributions:

• We observe that GPU-compute applications exhibit high degrees of inter-CTA locality. We analyze and categorize the sources of data sharingamong CTAs.

• We propose intra-cluster coalescing (ICC) and the coalesced cache (CC)to track and coalesce L1 cache misses from di↵erent SMs in a clusterbefore sending them across the NoC.

• We comprehensively evaluate our newly proposed ICC scheme, and reportan average 15% (up to 69%) performance improvement over the state-of-the-art distributed scheduling policy. The hardware cost of the ICC unitis limited to 276 bytes per cluster. (ICC unit improves performance by9.7% on average, and CC further increases performance by 5.3%. )

3.2 Motivation and Opportunity

We now further motivate the problem and describe the opportunity.

3.2. MOTIVATION AND OPPORTUNITY 21

0

0.5

1

1.5

2

HS BT

NN LU

D

2DCONV

DCT MM

FDTD

SRAD BFS

BP

HMEAN

Nor

mal

ized

IPC

0.25 × NoC bandwidth 0.5 × NoC bandwidth 1 × NoC bandwidth 1.5 × NoC bandwidth

Figure 3.1: Quantifying the NoC bottleneck: Normalized IPC when varyingthe NoC and LLC frequency from 0.25⇥ to 2⇥. NoC (and LLC) bandwidth isa severe performance bottleneck.

3.2.1 NoC Bandwidth Bottleneck

We first demonstrate that the NoC indeed constitutes a performance bottle-neck in a clustered GPU architecture. In particular, we study the relationshipbetween performance and NoC bandwidth. Figure 3.1 quantifies performanceas a function of NoC bandwidth. To ensure an overall balanced design, we varythe LLC bandwidth proportionally as we change the NoC bandwidth. This isdone by increasing the clock frequency of the NoC and LLC subsystems by thesame factor. This enables providing a meaningful measure for how sensitiveperformance is to the available NoC (and LLC) bandwidth. (Further detailsabout our experimental setup are given in Section 3.4.) We find that perfor-mance is sensitive to NoC and LLC bandwidth for most of the benchmarks.In particular, increasing NoC/LLC bandwidth by a factor 1.5⇥ leads to a sub-stantial performance benefit; doubling the NoC/LLC bandwidth saturates theperformance improvement to on average 45% and up to 78%. This illustratesthat NoC bandwidth indeed is a severe bottleneck. Limited NoC bandwidthleads to congestion within a cluster for memory requests that need to proceedthrough the NoC to reach the L2 cache and beyond.

3.2.2 Request Merging

GPU-compute applications exhibit various forms of locality in the memoryhierarchy. Merging memory requests is widely deployed across the memoryhierarchy in a GPU to increase the e↵ective memory system throughput. Ta-ble 3.1 provides a comparison between existing techniques and our work.

Intra-warp locality, or di↵erent threads within the same warp accessingthe same or neighboring memory locations, is the most common and obviousform of data locality present in GPU-compute applications. To exploit this


Table 3.1: GPU coalescing techniques and their scope.

Technique Scope

Intra-warp coalescing [49] Across threads in a warp

WarpPool [68] L1 accesses across warps in an SM

L1 MSHR [70] L1 misses across warps in an SM

Packet coalescing [67] MC side

ICC (this work) L1 misses across SMs in a cluster

characteristic, a memory coalescing unit merges multiple memory accesses tothe same cache line within the same warp before sending the request to the L1cache [49]. In other words, intra-warp coalescing merges requests acrossthreads within a warp. This is easily done as di↵erent threads within awarp execute in SIMD lockstep.

For memory-divergent applications, where di↵erent threads in a warp re-quest more than one cache line in a load or store instruction, the memorycoalescing unit becomes a memory system throughput bottleneck because thedi↵erent memory requests now need to be serialized. Kloosterman et al. [68]propose WarpPool which merges memory requests across warps in anSM before accessing the L1 cache. By merging requests from di↵erent warpsin an SM, they increase the e↵ective L1 cache bandwidth. WarpPool does notaddress NoC congestion though: WarpPool reduces the number of requests tothe L1 cache, but goes no further. SMs in the same cluster that are accessingthe same address, an address that presently is not in the L1 cache, generatemultiple NoC requests.

Miss Status Handling Registers (MSHRs) [70] are used at the L1 cachelevel to track outstanding L1 cache misses and merge multiple requests tothe same cache line in the L2 cache and beyond. This avoids having to sendredundant requests over the NoC to the next level in the cache hierarchy. Notethat L1 MSHRs eliminate redundant NoC requests originating from a singleSM. In other words, L1 cache MSHRs are limited in scope and coalesce L1cache misses across warps within an SM. There may still be redundantNoC requests originating from di↵erent SMs within a single cluster, as we willdemonstrate in this chapter.

Packet coalescing [67] groups memory requests from di↵erent SMs at thememory controller (MC) side. The MC then generates a single read reply andrelies on a multicast NoC to transfer the reply packet to the requesting SMs.Packet coalescing does not reduce the number of L1 miss requests sent over theNoC.

To summarize, although intra-warp coalescing and WarpPool reduce thenumber of requests to the L1 cache and although L1 MSHRs merge outstand-ing L1 cache misses, there is no coalescing or merging happening for accessesto the L2 cache. In other words, di↵erent SMs within the same cluster mayissue multiple requests to the same or neighboring data elements, which leads


to redundant NoC tra�c. In this work, we eliminate redundant NoC traf-fic by coalescing L1 cache misses across SMs within a cluster beforesending requests to the L2 cache. By doing so, we increase the e↵ective NoCbandwidth.

3.2.3 Intra-Cluster Locality

In this section, we observe and exploit the notion of intra-cluster data lo-cality in GPU-compute applications.

Quantifying Intra-Cluster Locality

To quantify the amount of intra-cluster locality, we define the notion of aredundant request. A data request is said to be redundant if it accesses a cacheblock that has been accessed by a previous request from the same cluster; theprevious request needs to have happened recently, within a given window sizeof requests prior to the current request. (We will vary this window size whenwe quantify intra-cluster locality.) We define Intra-Cluster Locality (ICL) as

ICL =no. redundant requests

total no. data requests

. (3.1)

To quantify intra-cluster locality, we track all data requests in a cluster beforethey are injected into the NoC, i.e., after having accessed the L1 cache, so thisincludes all L1 misses. We then calculate the ratio of redundant requests versusthe total number of data requests for di↵erent window sizes of past memoryrequests. We consider window sizes ranging from 500 to 2000 cycles. The reasonfor this wide range is that we observe L1 cache miss latencies ranging up to acouple thousands of cycles, which we observe for some of our benchmarks thatsu↵er from severe NoC congestion.

Di↵erent applications exhibit di↵erent degrees of intra-cluster locality, seeFigure 3.2. On average, for a window size of 2000 cycles, we observe that 19%of the memory requests are redundant. For HS and DCT, up to 48% and 45.4%of the requests are redundant at the cluster level, respectively. This resultsupports the hypothesis that it is possible to significantly reduce NoC tra�cin clustered GPUs by coalescing memory requests within a cluster.

Inter-CTA Locality

It is interesting to investigate where intra-cluster locality comes from. Intra-cluster locality in fact stems from inter-CTA locality because of data reuseamong CTAs mapped to SM cores in the same cluster. We analyze all thebenchmarks and identify two categories of inter-CTA locality: cache line shar-ing versus data sharing. Figure 3.2 quantifies their relative contribution. Fora window size of 2000 cycles, we observe 19% intra-cluster locality, with 11%


0%

10%

20%

30%

40%

50%

HS BT

NN

LUD

2DC

ON

V

DC

T

MM

FDTD

SRAD

BFS BP

AVG

Fra

ctio

n re

dund

ant r

eque

sts

Cache line sharing Data sharing

0.5K

1K

1.5K

2K

Figure 3.2: Intra-cluster locality (fraction of redundant requests versus thetotal number of requests in a cluster) as a function of a past window ofrequests under the distributed scheduling policy [24]. A distinction is madebetween cache line sharing and data sharing. A substantial fraction of NoCrequests are redundant because of intra-cluster locality due to cache linesharing or data sharing.

due to cache line sharing and 8% due to data sharing. We also note thatintra-cluster locality increases with increasing window size.

(1) Inter-CTA locality due to cache line sharing. Inter-CTA localitymay result from adjacent CTAs accessing neighboring data items in the samecache line — spatial locality. If one cache line is big enough to hold the dataaccessed by multiple CTAs, we may observe this form of inter-CTA locality.The number of threads within a CTA is typically a multiple of 32. It maybe the case that all threads within a CTA access less than a cache line widthof data, e.g., 32 or 64 threads in a CTA access 128 or fewer bytes. Hence,for a cache line of 128 bytes, this implies that di↵erent CTAs will access thesame cache line, exhibiting inter-CTA locality through the same cache line. Acouple benchmarks feature cache line sharing predominantly, especially DCTand SRAD, see Figure 3.2.

(2) Inter-CTA locality due to data sharing. In many GPU-computeapplications, we observe that di↵erent CTAs access the same (read-only) data– temporal locality. Data sharing may result from di↵erent reuse patternsdepending on how the CTAs are organized.

We illustrate this using two benchmarks. Hotspot (HS), see Listing 3.1 fora code excerpt, is a benchmark that exhibits high intra-cluster locality. HS hasits threads and CTAs organized in a 2D structure. Di↵erent threads in di↵erentCTAs access the same data through the power[] data structure. The computedindex is a linear combination of the two-dimensional index of the thread andCTA. If this linear combination evaluates to the same value, di↵erent threadsfrom di↵erent CTAs will access the same data, yielding inter-CTA locality.


Listing 3.1: Code excerpt from hotspot (HS). Di↵erent threads in di↵erentCTAs access the same data through the power[] data structure if the index

evaluates to the same value.

int small_block_rows=BLOCK_SIZE -border_rows *2;

int small_block_cols=BLOCK_SIZE -border_cols *2;

int ty = small_block_rows*blockIdx.y+threadIdx.y-border_rows;

int tx = small_block_cols*blockIdx.x+threadIdx.x-border_cols;

int index=grid_cols*ty+tx;

if(0<ty<grid_rows -1)&&(0 <tx<grid_cols -1))

power_on_cuda[threadIdx.y][ threadIdx.x]= power[index]

𝐴 𝐴 𝐴𝐴 𝐴 𝐴𝐴 𝐴 𝐴

=𝐿 0 0𝐿 𝐿 0𝐿 𝐿 𝐿

𝑈 𝑈 𝑈0 𝑈 𝑈0 0 𝑈

First row/column sub-matrix calculation:

𝐴 = 𝐿 × 𝑈

𝑈 = ; 𝑈 = ; 𝐿 = ; 𝐿 =

Figure 3.3: Data sharing in LUD. L11 is reused for calculating submatricesU12 and U13 (reuse along rows), while U11 is reused for calculatingsubmatrices L21 and L31 (reuse along columns).

LUD is another example of a 2D application, see Figure 3.3, in which eachsubmatrix Lij and Uij is processed by one CTA. One iteration (one instance ofthe kernel) is used to calculate the decomposition of one row and column of thesubmatrices. For example, in the first iteration, submatrices Lj1 and U1i arecomputed: L11 is reused for calculating submatrices U12 and U13 (reuse alongrows), while U11 is reused for calculating submatrices L21 and L31 (reuse alongcolumns).

We note that data sharing may happen between CTAs that are relativelyfar apart from each other. For example, LUD features a 6⇥6 CTA organizationin which CTAs in the same row and same column access the same data. Hence,CTAs that are multiples of 6 away from each other will access the same data,i.e., data sharing within a column. During our workload analysis, we find thatlocality due to data sharing may happen among CTAs that are relatively farapart. On the other hand, inter-CTA locality due to cache line sharing istypically observed among adjacent CTAs.


Cluster

Injection Buffer

Response Buffer

SMMemory

Subsystem

Shared Mem

L1 Cache

Texture Cache

Const Cache

Register File

ALUs

Warp scheduler

NoC Injection

NoC Ejection

ICC Unit

Control Logic

Merge Table

Memory Requests

Memory Replies

Address SM List Valid Bit

CC

Address Data Hit_infoMissUpdate

Hit

Merge Table

Coalesced Cache

Figure 3.4: The intra-cluster coalescing (ICC) unit merges L1 cache missesacross SMs within a cluster. The coalesced cache (CC) keeps track of recentlycoalesced cache lines.

3.3 Intra-Cluster Coalescing (ICC)

Based on the notion of intra-cluster data locality in GPU-compute applica-tions, we propose intra-cluster coalescing (ICC) to coalesce inter-CTA localityacross SMs within the same cluster. The key idea of ICC is to merge requestsfrom di↵erent SMs in a cluster to the same L2 cache line before sending therequest to the NoC. By doing so, ICC decreases the number of memory transac-tions over the network and reduces the contention on the network port sharedby multiple SMs in a cluster.

3.3.1 ICC Unit

Figure 3.4 illustrates the overall architecture of the proposed intra-clustercoalescing unit. The central structure of the ICC unit is the merge table. Itsgoal is to track all memory requests coming from the SMs in the cluster beforeinjecting them into the network. The merge table contains multiple entries.Each entry consists of three fields, namely an address field, the SM list and avalid bit. An entry is responsible for coalescing all memory requests to the sameL2 cache line. The merge table is implemented as a fully-associative cache.

When an SM core wants to inject a memory request into the network, theICC unit first searches the merge table using the request’s address. If there

3.3. INTRA-CLUSTER COALESCING (ICC) 27

already exists an entry for the requested cache line (merge table hit), the ICCunit will append the ID of the requesting SM to the SM list. The memoryrequest will not be sent to the network — there already is a request outstandingfor that same L2 cache line. If on the other hand, there is no entry allocatedin the merge table for that cache line (merge table miss), the ICC unit willallocate a new entry (if the merge table has empty entries available) and thensend the memory request to the network; the SM sending this request is addedto the SM list and the valid bit is set. If, under a merge table miss, all entriesin the merge table are occupied, the memory request will be injected into theNoC directly. ICC unit only records read requests to the global address space;read requests to other address spaces bypass the merge table. Write requestsalso bypass the merge table in order not to complicate the design of the mergetable — including writes in the merge table would require storing entire cachelines and implementing write merging within a cache line.

When a cluster receives a reply packet from the network, the ICC unit usesthe reply address to index the merge table. If there exists an entry for thataddress (a merge table hit), the ICC unit reads the corresponding SM list andbroadcasts the memory reply to all SMs in the list. Next, the correspondingentry in the merge table is set to invalid, which means that the entry can bere-used for other memory requests. If the address cannot be found in the mergetable (a merge table miss), the reply will be delivered to the SM based on thedestination stored in the reply packet.

ICC enjoys two performance benefits. First, by design, the total numberof transactions sent to the network is reduced and this relieves the networkbottleneck. Second, the average memory access latency is reduced for requeststhat hit in the merge table. A request to an already outstanding request onlysees the remaining access latency, which is (much) smaller compared to thelatency of a newly initiated request.

3.3.2 Merge Table

The size of the merge table is likely to a↵ect performance. The larger thesize, the higher the opportunity to exploit intra-cluster locality. On the flipside, a large merge table also implies higher hardware cost and access latency;access latency is something to consider since it is on the critical path for everyL1 cache miss.

The maximum possible size of the merge table is determined by the maxi-mum number of in-flight memory requests. Memory read requests in each SMfirst access the L1 cache. In case of a miss, the memory request is sent to thenext level of cache. In the L1 cache, the MSHRs track the in-flight L1 cachemisses and merge duplicate requests accessing the same L2 cache line. Thenumber of MSHR entries controls the number of memory requests that canbe injected into the NoC, i.e., when all MSHR entries are occupied, L1 cachemisses can no longer be serviced. Hence, the maximum size of the merge tableis bounded by the number of SMs per cluster multiplied by the number of L1


MSHR entries per SM. This amounts to a maximum size of 5⇥32 = 160 entriesfor our clustered architecture.

Obviously, the size of the merge table can be set to a smaller value to reducethe hardware cost and/or access latency. This trade-o↵ impacts our ability tocoalesce memory requests across the NoC. We set the size of the merge tableto 48 entries in our setup. We find that whereas a maximum sized merge tablecan coalesce 14.5% of the L1 cache misses, a 48-entry merge table captures thevast majority of those by coalescing 12% of the L1 cache misses.

3.3.3 Coalesced Cache

A limitation of the ICC unit as just proposed is that it can only coalescememory requests within a limited time window, namely while the initial mem-ory request is outstanding. However, as quantified in Section 3.2.3, there existssignificant inter-CTA locality within large time windows, beyond the latencyof a memory request. In other words, there is a high possibility that coalescedcache lines will be accessed again in the near future. We therefore extend theICC unit with a coalesced cache to keep track of recently coalesced cache lines.

Figure 3.4 illustrates the architecture of the coalesced cache (CC). The CCis accessed upon an L1 cache miss. In case of a hit in the CC, i.e., this is anaccess to a previously coalesced cache line, the cache line is simply returned,saving a request over the NoC to the next level of cache. In case of a CC miss,the cache line is inserted in the merge table as previously described. Cachelines are added to the CC upon their return from the memory hierarchy ifthere is more than a single requester, i.e., the cache line is only inserted in theCC if it is e↵ectively a coalesced cache line as indicated in the respective entryin the merge table. Because the shared data set tends to be small for manyGPU applications and because the CC only contains previously coalesced cachelines, we find that a small CC (of 24 entries in our setup) is su�cient.

3.3.4 Cost Analysis

As mentioned before, we assume a 48-entry fully-associative merge table.For GPU-compute applications with a 48-bit address space [99] and a 128-bytecache line size, we need 41 bits to record the address of the cache line. Wefurther assume 5 bits to record the SM list, i.e., the SMs waiting for thatparticular cache line to come back from the memory subsystem. The totalhardware cost for the merge table amounts to 2,208 bits or 276 bytes. Wefurther assume a 24-entry fully-associative coalesced cache with 41 bit tagsand 128 byte cache lines, amounting to a total size of 3.2KB. (We find that alarger number of entries in the coalesced cache improves performance but weopt for 24 entries in the evaluation to balance performance and hardware cost.)We need a merge table and coalesced cache for each cluster. We use CACTI6.5 [89] to compute the access latency to the merge table and coalesced cache,

3.4. EXPERIMENTAL SETUP 29

Table 3.2: Simulated GPU configuration.

Parameter Value

Clock Frequency 1.4 GHz

Number of Clusters 12

Number of SMs per Cluster 5

Numbers of MC 8

Warp Schedulers / SM 2 (GTO)

L1 Cache / SM 48 KB

128 B line, 4-way assoc

LRU, 32-entry MSHR

Shared Memory / SM 64 KB

L2 Unified Cache 512 KB per MC

128 B line, 8-way assoc

LRU, 32-entry MSHR

NoC Topology 12 ⇥ 8 crossbar

NoC Channel width 64 B

NoC Bandwidth 716.8 GB/s

DRAM Bandwidth 720 GB/s

GDDR5 DRAM 1.4 GHz

tCL=12, tRP=12, tRC=40,

tRAS=28, tRCD=12, tRRD=6,

tCCD=2, tWR=12

and we find it to be less than one cycle at 1.4 GHz assuming a 40 nm chiptechnology.

3.4 Experimental Setup

The evaluation is done using the GPGPU-Sim 3.2.2 simulator [27]. Ta-ble 3.2 shows the simulated baseline GPU configuration. We assume a totalof 12 clusters with each cluster containing 5 SMs; hence, there are 60 SMs intotal. Each SM features a 48 KB L1 cache. The 12 clusters are connectedthrough a crossbar NoC to 8 memory controllers with a 512 KB L2 cache permemory controller. (We will vary the number of clusters and the number ofSMs per cluster in the evaluation.) We further assume Greedy-Then-Oldest(GTO) [99] as the warp scheduling policy within an SM. The merge table inthe ICC unit and coalescing cache are configured to hold up to 48 entries and24 entries, respectively; we assume a one-cycle access latency to the merge ta-ble and coalescing cache, which we account for in our simulations. We modelthe state-of-the-art distributed scheduling policy [24] as the baseline.1 Par-

1We will demonstrate that distributed scheduling outperforms other state-of-the-art CTAscheduling policies in the next chapter.


Table 3.3: Benchmarks considered in this study.

Benchmark Suite Abbr.

hotspot Rodinia HS

b+trees Rodinia BT

backprop Rodinia BP

bfs Rodinia BFS

srad Rodinia SRAD

lud Rodinia LUD

2Dconv Polybench 2DCONV

matrixmul SDK MM

neuralnetwork GPGPUsim NN

FDTD3d SDK FDTD

dct8⇥8 SDK DCT

ticularly, distributed scheduling issues CTAs uniformly across clusters. Ourbaseline includes intra-warp coalescing in which memory requests are coalescedacross threads within a warp before sending them to the L1 cache [49]. Wefurther assume 32 MSHR entries at both the L1 and L2 caches; the MSHRs atthe L1 coalesce L1 misses within an SM.

When evaluating GPU energy consumption, we use GPUWattch [75] as-suming a 40 nm chip technology. GPUWattch is modified and configured tomodel the same GPU configuration as the performance simulator. We accountfor the extra energy consumed in the merge table and coalesced cache, althoughwe find it to be negligible.

Table 3.3 lists the workloads used to evaluate our proposed solution, takenfrom CUDA SDK [9], Rodinia [32] and PolyBench [45]; NN comes with GPG-PUsim [27]. We choose a mix of high intra-cluster locality and low intra-cluster locality applications to properly evaluate the performance impact acrossa broad range of workloads.

3.5 Results

We now evaluate intra-cluster coalescing and the coalesced cache. This isdone in a number of steps. We start by investigating the performance improve-ment and energy consumption reduction, after which we analyze the impact onNoC tra�c. Finally, we provide a sensitivity analysis with respect to clustersize and the e↵ective number of NoC ports per SM.

3.5. RESULTS 31

0%

20%

40%

60%

80%

HS BT

NN LU

D

2DCONV

DCT MM

FDTD

SRAD BFS

BP

HMEAN

IPC

impr

ovem

ent

ICC ICC+CC

Figure 3.5: IPC improvement for intra-cluster coalescing (ICC) and thecoalesced cache (CC) under distributed scheduling. ICC significantlyimproves performance by 9.7% on average; ICC along with CC yields anaverage 15% performance improvement.

3.5.1 Performance

Figure 3.5 reports performance improvement for intra-cluster coalescing(ICC) and the coalesced cache (CC) under distributed scheduling. A couple ofinteresting observations can be drawn from these results.

ICC significantly increases performance under the distributed schedulingpolicy, by 9.7% on average. Several benchmarks experience a substantial per-formance improvement, i.e., LUD (33%), HS (30%), 2DCONV (16.8%) and DCT(21%). Generally speaking, benchmarks with high intra-cluster locality, seeFigure 3.2, benefit more from intra-cluster coalescing. However, the correla-tion is not perfect. This is due to the fact that intra-cluster locality quanti-fies the redundancy in read requests only. Applications that have a relativelyhigh fraction of writes versus reads, e.g., DCT, do not benefit as much as theintra-cluster locality metric would suggest (although the improvement is stillsignificant).

Intra-cluster coalescing along with the coalesced cache (CC) yields an 15%performance improvement on average. The coalesced cache extends the oppor-tunity to benefit from coalesced cache lines which leads to an additional average5% performance improvement beyond ICC. Some cache lines that cannot beserviced by the ICC unit hit in the CC, avoiding an additional access over theNoC. In particular, LUD and 2DCONV experience a substantial performanceimprovement of up to 69% and 46%, respectively.

3.5.2 Energy E�ciency

ICC reduces energy consumption by coalescing L1 misses, which reducesthe number of requests over the NoC to the L2 cache. Figure 3.6 quantifiesthe impact on the overall system (GPU plus DRAM) energy consumption byproviding a breakdown of where energy is consumed. ICC reduces energy


0

0.2

0.4

0.6

0.8

1

D

D+I

CC

D

+IC

C+C

C

D

D+I

CC

D

+IC

C+C

C

D

D+I

CC

D

+IC

C+C

C

D

D+I

CC

D

+IC

C+C

C

D

D+I

CC

D

+IC

C+C

C

D

D+I

CC

D

+IC

C+C

C

D

D+I

CC

D

+IC

C+C

C

D

D+I

CC

D

+IC

C+C

C

D

D+I

CC

D

+IC

C+C

C

D

D+I

CC

D

+IC

C+C

C

D

D+I

CC

D

+IC

C+C

C

D

D+I

CC

D

+IC

C+C

C

HS BT ��NN LUD 2DCONV DCT MM FDTD SRAD BFS BP AVG

Nor

mal

ized

ene

rgy

cons

umpt

ion� NoC L1 L2 MC SM DRAM

Figure 3.6: Energy consumption breakdown normalized to distributedscheduling (D). ICC along with CC reduces system energy by 5.3% on average.

0%

20%

40%

60%

HS BT

NN LU

D

2DCONV

DCT MM

FDTD

SRAD BFS

BP AVG

EDP

redu

ctio

n

ICC ICC + CC

Figure 3.7: EDP reduction for ICC and CC. ICC along with CC reducessystem EDP by 16.7% on average.

consumption by 3.7% on average, and up to 7.8% for 2DCONV. ICC plus CCreduces energy consumption by 5.3% on average, and up to 16.7% for 2DCONV.The reduction in energy consumption comes from two sources: reduced NoCenergy and reduced L2 energy. The NoC accounts for a significant fraction oftotal energy consumption, for 25% on average and up to 44% for 2DCONV andBFS. ICC and CC reduces NoC energy by 13.5% on average and up to 26.3%.The reduction in L2 cache energy is also significant (by 7.4% on average).However, because of the smaller contribution of the L2 cache to total systemenergy compared to the NoC, the impact is relatively limited. Overall, weobserve significant system energy savings for the benchmarks that benefit fromexploiting intra-cluster locality, see for example 2DCONV (16.7%), LUD (12%),HS (7.7%), BT (5%) and FDTD (4%).

Figure 3.7 quantifies the energy-delay product (EDP), a well-establishedmetric for energy e�ciency. ICC improves EDP by 12% on average and up to30%. ICC along with CC leads to an average EDP improvement of 16.7%, andup to 48% (LUD).

3.5. RESULTS 33

0%

20%

40%

60%

HS BT

NN LU

D

2DCONV

DCT MM

FDTD

SRAD BFS

BP AVG

NoC

traf

fic re

duct

ion

ICC ICC + CC

Figure 3.8: NoC tra�c (number of NoC read requests) reduction for ICC andCC. ICC along with CC reduces NoC tra�c by 19.5% on average.

3.5.3 NoC Tra�c

To investigate where the performance improvements and energy savings arecoming from, we now report the NoC tra�c, which we quantify by countingthe number of read requests through the NoC. Figure 3.8 reports NoC tra�creduction.

ICC unit decrease NoC tra�c by 14% on average, and up to 45% (DCT)through coalescing within a cluster. Some benchmarks experience a significantNoC tra�c reduction, including DCT (45%), HS (28%), LUD (23%), 2DCONV(14.3%) and FDTD (11%). These benchmarks are also the workloads experi-encing the largest performance and energy improvements. Note though thatthe correlation is not perfect — NoC tra�c reduction also depends on the frac-tion of read requests. Benchmarks with more write requests than reads do notexperience an equally high reduction in NoC tra�c.

The coalesced cache keeps track of coalesced cache lines upon eviction fromthe merge table. This prolongs the opportunity to exploit intra-cluster locality,reducing NoC tra�c reduction by an additional 5.5% on average. The coalescedcache decreases NoC tra�c significantly for some benchmarks, including HS(14%), LUD (13%) and 2DCONV (14.7%). These benchmarks are also theworkloads that are more sensitive to the time window, see Figure 3.2.

3.5.4 Sensitivity Analysis

Our baseline configuration assumed 12 clusters with 5 SMs each. We nowvary the number of SMs per cluster and include configurations with 6, 10, 12and 15 clusters. To keep the total number of SMs constant at 60, each clusterconsists of 10, 6, 5 and 4 SMs, respectively. We assume one NoC port percluster, hence the e↵ective number of NoC ports per SM e↵ectively increasesas we increase the number of clusters. We assume NoC bandwidth is constantacross the di↵erent configurations — note that NoC bandwidth is bounded bythe 8 memory controllers. (For the configuration with 6 clusters, we thereforeincrease NoC frequency to 1.9 GHz to keep NoC bandwidth constant.) Finally,


0%

20%

40%

60%

80%

HS BT NN LUD 2DConv DCT MM FDTD SRAD BFS BP AVG

IPC

impr

ovem

ent

6 clusters (ICC + CC) 10 clusters (ICC + CC) 12 clusters (ICC + CC) 15 clusters (ICC + CC)

Figure 3.9: IPC improvement for ICC and CC as a function of the number ofclusters while keeping total SM count constant at 60 SMs. ICC along withCC consistently improves performance across di↵erent cluster sizes ande↵ective NoC port count per SM.

we also change the number of merge table entries according to the numberof SMs in a cluster, i.e., we set the size of the merge table to 96, 60, 48 and40 entries for 6, 10, 12 and 15 clusters, respectively. We (obviously) set thenumber of bits in the SM list field in the merge table to be the same as thenumber of SMs in a cluster. We assume a 24-entry coalescing cache per cluster.

Figures 3.9 reports IPC improvement (percentage speedup) as a functionof the number of clusters comparing distributed scheduling with ICC and CCversus distributed scheduling. The key observation here is that ICC with CCis e↵ective across the di↵erent GPU architecture configurations. Even with asfew as 4 SMs per cluster sharing one NoC port (15 clusters in total), we stillobserve an average performance improvement of 14.4% (and up to 69%) withICC and CC. The highest performance improvement is achieved for 10 clusterswith 6 SMs each where we note an average 16.6% performance improvementwith ICC and CC.

3.6 Related Work

To the best of our knowledge, this is the first work to target coalescingmemory requests across SMs within a cluster to tackle the NoC bottleneck inGPUs. We now discuss the most closely related work in inter-SM locality, GPUNoC optimization and memory access coalescing.

Exploiting inter-SM locality. Tarjan and Skadron [110] propose a centralsharing tracker (ST) to exploit data sharing among SMs. They consider a GPUarchitecture that lacks an on-chip last-level cache (LLC). Through the ST, L1misses are sent to other SMs to obtain the data from another L1 cache (ifavailable) instead of accessing o↵-chip main memory. Li et al. [78] prioritizememory requests shared by multiple SMs at the DRAM controller side. None

3.6. RELATED WORK 35

of the prior work considers inter-CTA locality as a potential solution for theGPU NoC bottleneck in clustered GPUs.

There are proposals exploiting inter-SM locality through communicationbetween di↵erent SMs. Cooperating Cache [39] allows communication be-tween multiple L1 caches by connecting the cores via a ring NoC. Mohamed etal. [52] leverage remote-core-bandwidth to improve performance under a 2D-mesh based GPU NoC. Unfortunately, these designs require the architectureto support inter-core communication with extra links.

GPU NoC optimization. Two recent approaches address the GPU NoCbottleneck by exploiting inter-SM locality. In particular, Zhao et al. [126]propose an inter-SM locality aware LLC design to transfer few-to-many NoCtra�c into many-to-many tra�c to increase the e↵ective network bandwidthutilization. Kim et al. [67] exploit packet coalescing to reduce data redundancyin GPUs. These two prior approaches relies on a mesh NoC. Although thelatter work also exploits packet coalescing, it coalesces redundant replies oneach MC. This only alleviates the MC bottleneck, but the tra�c caused bya multicast operation to transfer the data back to the requesting SMs is notaddressed, which may lead to serialization delays in the NoC routers. None ofthis prior work considers intra-cluster locality to reduce GPU NoC pressure.

Bakhoda et al. [28] propose a checkerboard router to reduce the NoC costwhile providing multiple input ports for the MCs to increase the injection rate.The bandwidth-e�cient NoC design by Jang et al. [54] leverages asymmetricvirtual channel (VC) partitions to assign more VCs to reply packets whichoccupy a large portion of network tra�c. Ziabari et al. [128] propose asymmet-ric NoCs where the reply network features high network bandwidth. Zhao etal. [125] propose a ring-like NoC to provide high bandwidth for servicing replypackets in a cost-e↵ective way. These previous proposals only focused on theNoC topology, but could be combined with intra-cluster coalescing to furtherimprove their performance.

There are also research e↵orts on NoC optimization for CPU-GPU inte-grated heterogenous systems [37, 87, 122]. This tightly integrated architectureleads to more e�cient communication between CPU and GPU applicationsand better programmability. However, it causes severe contention in sharedresources such as the NoC. Jia et al. [122] propose an asynchronous batchscheduling policy to provide fair bandwidth allocation in one shared networkfor CPU and GPU workloads. They also propose the CPU-workload prior-ity scheme in the NoC router considering the fact that CPU-workloads aremore sensitive to latency. BiNoCHS [87] re-configures network resources totra�c patterns in a heterogeneous CPU/GPU system. Di↵erent from crossbarnetwork in GPUs, CMP systems usually adopt a mesh NoC. Moreover, multi-application interference and QoS problems in the NoC are out of our scopesince we focus on single-kernel execution.

Memory access coalescing. Coalescing techniques for GPUs have beenwidely investigated, see for example [25, 49, 67, 68, 96]. Intra-warp coalesc-ing is widely deployed in GPUs to group aligned memory accesses of di↵erent


threads in a warp [49]. To coalesce memory accesses from di↵erent warps,WarpPool [68] merges requests between warps within an SM to increase thee↵ective L1 cache bandwidth. Hodjat et al. propose register coalescing [25]which combines multiple register reads from the same instruction into a singlephysical register read as long as these registers are stored in the same physicalregister entry to reduce dynamic energy in the register file. However, priorwork targets coalescing within an SM. In contrast, we exploit the potential ofcoalescing redundant memory accesses from di↵erent SMs within a cluster.

3.7 Summary

As the number of SMs in next-generation GPUs continues to increase, NoCcongestion quickly becomes a key design challenge to scale performance. Clus-tered GPUs face a severe NoC bottleneck with increasing SM count. To mit-igate network congestion, we propose intra-cluster coalescing (ICC) and thecoalesced cache (CC) to exploit intra-cluster locality observed in many GPU-compute applications. ICC coalesces memory requests from di↵erent SMs in acluster to the same L2 cache line to reduce the overall number of requests andreplies sent over the NoC. CC extends the opportunity from coalescing cachelines by caching them at the cluster level for future reference.

We find that on average 19% (and up to 48%) of all L1 misses originatingfrom a cluster indeed access the same cache lines. ICC along with CC leadsto an average 15% (and up to 69%) performance improvement while at thesame time reducing system energy by 5.3% on average (and up to 16.7%)and the energy-delay-product by 12% and up to 30% compared to the state-of-the-art distributed scheduling. The overarching contribution of this workis the exploitation of inter-CTA locality, an inherent GPU-compute workloadcharacteristic, to tackle the emerging NoC congestion bottleneck in clusteredGPUs to improve overall system performance by coalescing memory requestsacross SMs within a cluster.

Chapter 4

Distributed-Block CTAScheduling

4.1 Introduction

In the previous chapter, we proposed ICC and CC unit to capture intra-cluster locality and reduce NoC pressure in clustered GPU architectures. How-ever, how much locality can be exploited not only depends on the applicationsbut also on the mapping of CTAs to SMs, also called CTA scheduling. It mo-tivates us to propose locality-aware CTA scheduling policies to further reduceNoC pressure. Unfortunately, most state-of-the-art CTA scheduling policies(e.g., two-level round-robin, global round-robin and greedy clustering) exposelimited intra-cluster locality to maintain the load balance. For instance, thetwo-level round-robin CTA scheduling policy typically maps consecutive CTAsthat have high locality to di↵erent clusters and then SMs. Distributed schedul-ing which was proposed in MCM-GPU [24] addresses this issue by uniformlydistributing CTAs across clusters and yields the highest intra-cluster coalescingopportunity. Unfortunately, distributed scheduling is agnostic to data localitywhen allocating CTAs to SMs within the same cluster.

Motivated by this observation, we propose distributed-block scheduling. Incontrast to prior work in CTA scheduling, distributed-block scheduling ex-ploits cache locality at both the cluster level and the SM level. At the clusterlevel, consecutive CTAs are mapped to the same cluster, similar to distributedscheduling [24]. At the SM level, CTAs are allocated in groups of two to furtherexploit L1 cache locality, similar to block scheduling [73]. This two-level ap-proach maximizes the opportunities to exploit inter-CTA locality among CTAsat the L1 cache within an SM by first mapping a group of consecutive CTAsat the cluster level, and by subsequently mapping pairs of consecutive CTAsat the SM level. Inter-CTA locality is exploited to improve L1 cache perfor-mance (decreasing the reuse distance between accesses to the same memory

37

38 CHAPTER 4. DISTRIBUTED-BLOCK CTA SCHEDULING

location, thereby increasing L1 cache hit rate) and to increase the coalescingopportunities in the L1 miss status handling registers (MSHRs).

Although distributed-block scheduling reduces the number of redundant re-quests at the SM level, it does not tackle the redundant requests originatingfrom di↵erent SMs within a cluster. We therefore combine distributed-blockscheduling policy with ICC and CC (proposed in Chapter 3) to exploit coalesc-ing opportunities across SMs within a cluster.

Distributed-block scheduling, ICC and CC operate synergistically and sig-nificantly reduce NoC pressure, which improves performance while at the sametime reducing energy consumption. We report that distributed-block schedul-ing by itself improves performance by 4% on average (up to 16%) and reducessystem energy by 1.2% (up to 6.5%) over state-of-the-art distributed schedul-ing [24]. Distributed-block with ICC and CC improves performance by 16% onaverage (up to 67%) while simultaneously reduces system energy by 6% (up to18%) and EDP by 19% (and up to 47%).

In this chapter, we make the following contributions:

• We demonstrate the significant interaction between ICC and CTA schedul-ing, i.e., ICC benefits more when the CTA scheduling policy maps neigh-boring CTAs to the same cluster to better exploit inter-CTA locality.

• We propose distributed-block scheduling to exploit inter-CTA locality atthe SM level by mapping groups of consecutive CTAs at the cluster leveland then pairs of consecutive CTAs at the SM level.

• We study the complementarity and interaction between CTA scheduling,ICC and CC as a solution to reduce GPU NoC pressure.

• We report that distributed-block scheduling by itself improves perfor-mance by 4% on average (up to 16%) and reduces system energy by 1.2%(up to 6.5%) over state-of-the-art distributed scheduling. Distributed-block with ICC and CC improves performance by 16% on average (upto 67%) while simultaneously reduces system energy by 6% (up to 18%),and EDP by 19% (and up to 47%).

4.2 CTA Scheduling versus ICC

Intra-cluster locality is not only a function of the algorithm or its imple-mentation. It is also greatly a↵ected by how CTAs are mapped to clusters.In order to illustrate this point, we consider four previously proposed CTAscheduling policies, which we illustrate using the example shown in Figure 4.1.The example assumes 10 CTAs in total; we further assume two clusters withtwo SMs per cluster; each SM can execute two CTAs.

4.2. CTA SCHEDULING VERSUS ICC 39

(2) G

loba

l rou

nd-r

obin

(3) G

reed

y-cl

uste

ring

(4) D

istr

ibut

ed

7

Clu

ster

1C

lust

er 2

(1) T

wo

-leve

l rou

nd-r

obin

SM

1S

M2

SM

3

Initi

al m

appi

ng

Map

ping

aft

er C

TA 1

fini

shes

1

10-C

TA w

orkl

oad

3

SM

4

58

24

66

Clu

ster

1C

lust

er 2

SM

1S

M2

SM

3

12

SM

4

58

34

7

12

34

56

78

910

4

Clu

ster

1C

lust

er 2

SM

2S

M3

12

SM

4

38

56

7

SM

1

4

Clu

ster

1C

lust

er 2

SM

2S

M3

12

SM

4

39

67

8

SM

1

(2) G

loba

l rou

nd-r

obin

(3) G

reed

y-cl

uste

rin

g(4

) Dis

trib

uted

7

Clu

ster

1S

M1

SM

2S

M3

93

SM

4

58

24

66

Clu

ster

1 C

lust

er 2

SM

1S

M2

SM

3

92

SM

4

58

34

74

Clu

ster

1 C

lust

er 2

SM

2S

M3

92

SM

4

38

56

7

SM

1

4

Clu

ster

1 C

lust

er 2

SM

2S

M3

52

SM

4

39

67

8

SM

1

(1) T

wo

-leve

l rou

nd-r

obin

Clu

ster

2

(5) D

istr

ibut

ed-b

lock

4

Clu

ster

1C

lust

er 2

SM

2S

M3

1 2

SM

4

3

9

6 7

8

SM

1

(5) D

istr

ibu

ted-

bloc

k

4

Clu

ster

1C

lust

er 2

SM

2S

M3

2

SM

4

3

9

6 7

8

SM

1

Figure

4.1:

Illustratingthefive

CTA

sched

ulingalgo

rithmsfora10

-CTA

workloa

d.WeassumeaGPU

architecture

with2

clusterswith2SMseach;wecanallocate

atmost2CTAsper

SM.Thetoprow

show

stheinitialmap

pingof

CTAsto

clusters

andSMs;

thebottom

row

show

sthemap

pingof

thenextCTA

tosched

ule

afterCTA

1finishes

itsexecution

.


4.2.1 Scheduling Algorithms

Two-level round-robin first distributes CTAs across clusters; once allclusters have one CTA assigned, we assign CTAs across SMs within a cluster;finally, when all SMs in all clusters are assigned one CTA, we assign additionalCTAs per SM — the assignment of additional CTAs is done the same way.This CTA scheduling algorithm has the advantage of distributing the CTAsuniformly across all clusters and SMs in the system.

Global round-robin or one-level round-robin scheduling, first distributesCTAs across all SMs within a cluster and then across clusters, i.e., it assigns aCTA to the first SM and a second CTA to the second SM in the first cluster;once all SMs in the first cluster are assigned one CTA, we move to the secondcluster, and so forth. Once all SMs in all clusters have one CTA assigned, wethen assign additional CTAs to the SMs. The assignment of additional CTAsper SM is done in the same manner.

Greedy-clustering assigns as many CTAs as possible to the first clusterbefore proceeding to the next, i.e., the first CTA is assigned to the first SM andthe second CTA is assigned to the second SM in the first cluster; once all SMs inthe cluster have one CTA assigned, additional CTAs are assigned to the clusteruntil all SMs can take no more additional CTAs. It then moves to the nextcluster. This greedy-clustering algorithm has the advantage of fully utilizingthe allocated SMs and clusters. However, for kernels with a limited numberof CTAs, this policy may lead to imbalanced execution, i.e., not all clustersare assigned the same workload. While this is not a concern for GPU-computeworkloads that consist of a large number of CTAs, it may be problematic forothers.

These three CTA scheduling policies share the common limitation that theyexpose limited intra-cluster locality. As mentioned before, inter-CTA localitytypically occurs between neighboring CTAs. Compared to the other two poli-cies, greedy-clustering may be advantageous because it assigns neighboringCTAs to the same cluster. The number of neighboring CTAs assigned to thesame cluster under two-level round-robin and global round-robin schedulingis more limited. However, these three polices do not make any guarantees toexploit intra-cluster locality during the execution. In particular, when a CTAon an SM finishes execution, a new CTA needs to be launched and this is donewithout considering the locality between the new CTA and the CTAs alreadyexecuting on the cluster. This is, when CTA 1 finishes in the example shown inFigure 4.1, CTA 9 gets scheduled and assigned to the SM previously executingCTA 1. Unfortunately, there may be limited or no inter-CTA locality betweenCTA 9 and the CTAs already running on the same cluster.

Distributed scheduling as proposed in MCM-GPU [24], addresses thisissue by uniformly distributing CTAs across clusters, i.e., all clusters get thesame number of CTAs assigned in a pool of CTAs. In the example in Figure 4.1,there are 10 CTAs in total. Distributed scheduling first splits up the set ofCTAs evenly across the two clusters, i.e., CTAs 1 through 5 are assigned tocluster #1, and CTAs 6 through 10 are assigned to cluster #2. In the next

4.2. CTA SCHEDULING VERSUS ICC 41

0 0.2 0.4 0.6 0.8

1 1.2 1.4

HS BT

NN LU

D

2DCONV

DCT MM

FDTD

SRAD BFS

BP

HMEAN

Nor

mal

ized

IPC

2-level round-robin greedy-cluster global round-robin distributed

Figure 4.2: Normalized IPC for two-level round-robin, greedy-clustering,global round-robin and distributed CTA scheduling. Distributed schedulingoutperforms the other three policies on average.

step, it maps a block of neighboring CTAs to each cluster from the respectivepools, i.e., CTAs 1 through 4 are mapped to cluster #1, and CTAs 6 through9 are mapped to cluster #2. This is similar to greedy-clustering except thatgreedy-clustering does this from a global pool of CTAs whereas distributedscheduling considers a per-cluster pool of CTAs. The key di↵erence with theother CTA scheduling policies appears when a CTA finishes its execution, e.g.,CTA 1 at the bottom in Figure 4.1. As mentioned above, two-level round-robin,global round-robin and greedy-clustering scheduling select and assign the nextCTA from the global CTA pool, i.e., CTA 9 is selected and mapped to theSM and cluster where CTA 1 just finished its execution, namely the first SMin cluster #1. Distributed scheduling on the other hand selects the next CTAfrom the cluster’s CTA pool, i.e., CTA 5 is mapped to cluster #1. This isa major di↵erence as it enables distributed scheduling to continuously exploitinter-CTA locality and assign neighboring CTAs to the same cluster during theentire execution.

4.2.2 Performance Analysis

Figure 4.2 reports performance (IPC) normalized to two-level round-robin.We observe that distributed scheduling is the best performing scheduling pol-icy. Significantly higher performance is achieved for FDTD (35%) and a coupleother benchmarks including BT (12%), 2DCONV (10%), NN (9%) and LUD(6%). The reason for the higher performance is improved L1 cache locality.Distributed scheduling achieves lower performance than the other policies forSRAD (11%) and BP (4%) because of workload imbalance across SMs. On aver-age, we report that distributed scheduling is the best performing policy, i.e., itachieves 6% higher performance on average compared to two-level round-robinscheduling. Because it is the best performing policy, we consider distributedscheduling as the default CTA scheduling policy in this thesis1.

1Note this is also the baseline policy considered in the previous chapter.


0%

10%

20%

30%

40%

50%

HS BT

NN LU

D

2DCONV

DCT MM

FDTD

SRAD BFS

BP AVG

Frac

. red

unda

nt re

ques

ts 2-level round-robin greedy-cluster global round-robin distributed

Figure 4.3: Intra-cluster locality for the di↵erent CTA scheduling policies.CTA scheduling policies have a substantial impact on the exploitableintra-cluster locality, and distributed scheduling yields the highest opportunity.

The comparison becomes even more interesting as we consider intra-clusterlocality, see Figure 4.3 which quantifies intra-cluster locality as previously de-fined in Section 3.2.3 for the di↵erent CTA scheduling policies with a time win-dow of 2000 cycles. Intra-cluster locality is the highest for distributed schedul-ing, i.e., we measure that on average 19% (and up to 48% for HS and 46%for DCT) of the requests within a cluster are redundant. The reason is thatdistributed scheduling maintains inter-CTA locality across consecutive CTAs,not only at the beginning of the execution but also during the execution asCTAs finish and new CTAs get launched. This makes distributed schedulingparticularly amenable to intra-cluster coalescing, and in addition, reinforcesthe choice to use distributed scheduling as our baseline.

There are two possibilities to exploit the observed intra-cluster locality. Thefirst approach is to coalesce requests to the same cache lines across SMs in acluster, which is the approach taken in the newly proposed intra-cluster coa-lescing mechanism as discussed in the previous chapter. The second approachis based on the observation that a fraction of the locality across SMs withina cluster can be captured through changing the CTA scheduling policy. Thismotivates us to propose distributed-block scheduling policy in the next section.

4.3 Distributed-Block Scheduling

Among the state-of-the-art CTA scheduling strategies, distributed schedul-ing clearly exposes the most intra-cluster locality. It does so by uniformlydistributing CTAs across clusters. However, CTAs are allocated following a(default) round-robin strategy within a cluster. As a result, consecutive CTAsmay be allocated to di↵erent SMs in a cluster, which may lead to reduced L1cache performance and/or missed opportunities to coalesce requests at the L1cache MSHRs. We therefore propose two-level distributed-block CTA schedul-ing, or distributed-block scheduling for short, to exploit coalescing opportuni-

4.4. RESULTS 43

ties at the cluster level and at the same time increase locality benefits at theL1 cache level. At the cluster level, we use distributed scheduling to maxi-mize intra-cluster locality. At the SM level, we leverage block CTA scheduling(BCS) [73] which assigns a block of two CTAs to the same SM. The intuition isto increase the opportunity for exploiting data cache line locality across CTAsif those CTAs get allocated to the same SM at the same time. BCS delays thescheduling of CTAs to an SM until there are two CTA contexts available onthe SM to simultaneously schedule two CTAs. This has two potential bene-fits: (i) the L1 cache hit rate improves because of improved locality, and (ii)there are more coalescing opportunities in the L1 cache MSHRs because thereuse distance between two accesses to the same memory location is reduced.In other words, inter-CTA locality gets exploited within a single SM, whichleads to overall higher performance. Note though that delayed CTA schedulingmay also incur some performance overhead, i.e., if an SM only has one CTAcontext left, no new CTA can be allocated until another CTA context becomesavailable, which leaves SM resources underutilized.

Distributed-block scheduling is illustrated in Figure 4.1. At the cluster level,distributed-block scheduling performs the same as distributed scheduling. Wefirst split the CTAs evenly and assign the first 5 CTAs to cluster #1 and thenext five CTAs to cluster #2. In the next step, rather than allocating CTAsby using a default round-robin strategy, we allocate CTAs at a granularity of2 to di↵erent SMs in a cluster. CTAs 1 and 2 are allocated to the first SMin cluster 1; CTAs 3 and 4 are allocated to the second SM in cluster 1. Bydoing so, CTAs with higher inter-CTA locality are likely to be allocated to thesame SM. Note that distributed-block scheduling follows a delayed mappingstrategy once a CTA finishes its execution, as illustrated at the bottom partin Figure 4.1. New CTAs can only be allocated to an SM if at least two CTAcontexts are available. Hence, CTA 5 can be allocated to SM #1 only whenboth CTAs 1 and 2 have finished their execution.

4.4 Results

We now evaluate distributed-block scheduling policy for clustered GPUswithout and with our proposed ICC and CC unit. This is done in a number ofsteps. We use the same experimental setup as in Chapter 3. We start by in-vestigating the performance improvement and energy consumption reduction,after which we analyze the impact on NoC tra�c. Finally, we provide a sen-sitivity analysis with respect to cluster size and the e↵ective number of NoCports per SM.

4.4.1 Performance

Figure 4.4 and Figure 4.5 report performance improvement for distributed-block scheduling (without and with ICC plus CC) versus distributed scheduling(without ICC). A couple of interesting observations can be drawn from these


0%

5%

10%

15%

20%

HS BT

NN LU

D

2DCON

DCT MM

FDTD

SRAD BFS

BP

HMEAN

IPC

impr

ovem

ent

distributed-block

Figure 4.4: IPC improvement for distributed-block scheduling versusdistributed scheduling. Distributed-block scheduling improves performance by4% on average (up to 16%).

0%

20%

40%

60%

80%

HS

BT

NN

LUD

2DCONV

DCT

MM

FDTD

SRAD

BFS

BP

HM

EAN

IPC

im

pro

ve

me

nt

distributed+ICC+CC distributed-block+ICC+CC

Figure 4.5: IPC improvement for distributed-block scheduling with ICC andCC versus distributed scheduling. Distributed-block scheduling with ICC andCC yields an average of 16% (up to 67%) performance improvement.

results. As shown in Figure 4.4, distributed-block scheduling outperforms dis-tributed scheduling for all benchmarks, and by 4% on average. For applicationsthat exhibit L1 cache locality such as FDTD, MM and 2DCONV, distributed-block scheduling improves performance by 15%, 10% and 7%, respectively.This is because part of the inter-CTA locality is captured at the L1 cachewithin an SM. However, for applications such as HS and DCT which exhibithigh intra-cluster locality, distributed-block scheduling does not show any IPCimprovement. In HS, the lack of improvement is due to the high L1 cache missrate, which under the GTO warp scheduling policy leads to cache lines beingreplaced before being referenced again. Hence, even though two CTAs withcache-line related locality are allocated to the same SM, this does not result inimproved L1 cache performance. The reason is di↵erent for DCT: this bench-mark contains more writes than reads. As a result, reducing the number of L1read misses has limited impact on overall performance.

4.4. RESULTS 45

As shown in Figure 4.5, distributed-block scheduling works synergisticallywith ICC and CC to improve performance by 16% on average (and up to67%) compared to state-of-the-art distributed scheduling. By exploiting local-ity at the cluster level, intra-cluster coalescing (ICC) and the coalesced cache(CC) lead to an additional performance improvement of 11% on average overdistributed-block scheduling. Several benchmarks experience a substantial per-formance improvement when combining distributed-block scheduling with ICCand CC, see for example LUD (67%), HS (23%), DCT (18%) and 2DCONV(46.5%). Generally speaking, benchmarks with high intra-cluster locality, seeFigure 3.2, benefit more from ICC.

We further note that distributed scheduling with ICC and CC yields simi-lar performance benefits as distributed-block scheduling, with an average per-formance improvement of 15% (see Figure 4.5). However, distributed-blockscheduling with ICC and CC is a more robust solution: it leads to substan-tial performance gains, larger than 10% for seven of the benchmarks, whereasdistributed scheduling with ICC and CC leads to similarly high performancegains for only five of the benchmarks.

4.4.2 Energy E�ciency

Figure 4.6 quantifies the impact on the overall system (GPU plus DRAM)energy consumption by providing a breakdown of where energy is consumed.We report that distributed-block scheduling by itself reduces energy consump-tion by 1.2% on average (up to 6.5%). Distributed-block scheduling with ICCand CC reduces system energy by 6% on average and up to 18% for 2DCONV.The reduction in energy consumption comes from two sources: reduced NoCenergy and reduced L2 energy. The NoC accounts for a significant fractionof total energy consumption, for 25% on average and up to 44% for 2DCONVand BFS. Distributed-block scheduling with ICC and CC reduces NoC energyby 16% on average and up to 30%. The reduction in L2 cache energy is alsosignificant (by 10% on average). However, because of the smaller contributionof the L2 cache to total system energy compared to the NoC, the impact isrelatively limited. Overall, we observe significant system energy savings for thebenchmarks that benefit from exploiting intra-cluster locality, see for example2DCONV (18%), LUD (11%), FDTD (8%), HS (6%), BT (5%) and MM (5%).

Figure 4.7 quantifies the energy-delay product (EDP), a well-establishedmetric for energy e�ciency. Distributed-block scheduling by itself improvesEDP by 5% on average and up to 20% compared to distributed scheduling.Distributed-block scheduling with ICC and CC improves EDP by 19% on av-erage, and up to 47% (LUD).

4.4.3 NoC Tra�c

To investigate where the performance improvements and energy savings arecoming from, we now report the NoC tra�c, which we quantify by counting


0

0.2

0.4

0.6

0.8 1

D DB

D+ICC+CC DB+ICC+CC

D DB

D+ICC+CC DB+ICC+CC

D DB

D+ICC+CC DB+ICC+CC

D DB

D+ICC+CC DB+ICC+CC

D DB

D+ICC+CC DB+ICC+CC

D DB

D+ICC+CC DB+ICC+CC

D DB

D+ICC+CC DB+ICC+CC

D DB

D+ICC+CC DB+ICC+CC

D DB

D+ICC+CC DB+ICC+CC

D DB

D+ICC+CC DB+ICC+CC

D DB

D+ICC+CC DB+ICC+CC

D DB

D+ICC+CC DB+ICC+CC

HS

��B

T N

N

LUD

2D

CO

NV

D

CT

MM

FD

TD

SR

AD

B

FS

BP

A

VG

Normalized energy consumption�

NoC

L1

L2 M

C

SM

D

RA

M

Figu

re4.6:

Energy

consumption

breakd

ownnorm

alizedto

distrib

uted

sched

ulin

g(D

).Distribu

ted-blockschedu

lingwith

ICC

and

CC

reduces

systemen

ergyby

6%on

average.

4.4. RESULTS 47

0%

20%

40%

60%

HS BT

NN LU

D

2DCON

DCT MM

FDTD

SRAD BFS

BP AVG

EDP

redu

ctio

n�distributed-block distributed+ICC+CC distributed-block+ICC+CC

Figure 4.7: EDP reduction for distributed-block scheduling without and withICC plus CC versus distributed scheduling. Distributed-block scheduling byitself reduce system EDP by 5%, along with ICC and CC reduces system EDPby 19% on average.

0%

20%

40%

60%

HS BT

NN LU

D

2DCONV

DCT MM

FDTD

SRAD BFS

BP

AVG

NoC

traf

fic re

duct

ion�

distributed-block distributed+ICC+CC distributed-block+ICC+CC

Figure 4.8: NoC tra�c (number of NoC read requests) reduction fordistributed-block scheduling, ICC and CC versus distributed scheduling.Distributed-block scheduling itself reduces NoC tra�c by 6% on average.Along with ICC and CC reduces NoC tra�c by 20% on average.

the number of read requests through the NoC. Figure 4.8 reports NoC tra�creduction.

Distributed-block scheduling reduces NoC tra�c by 6% on average, as aresult of coalescing L1 misses in the MSHRs within an SM. We observe asignificant reduction in NoC tra�c for a number of benchmarks, includingFDTD (24%), DCT (15%), MM (10%) and 2DCONV (7%). These benchmarksalso experience a significant performance improvement as previously reportedin Figure 4.4. The only exception is DCT, given its high percentage of writesversus reads. Other benchmarks such as HS and BT do not benefit either,due to a high L1 cache miss rate (90% for HS) and limited locality amongconsecutive CTAs (as is the case for BT).


0%

20%

40%

60%

80%

HS BT

NN LU

D

2DCONV

DCT MM

FDTD

SRAD BFS

BP AVG

IPC

impr

ovem

ent

6 clusters (distributed-block + ICC + CC) 10 clusters (distributed-block + ICC + CC)

12 clusters (distributed-block + ICC + CC) 15 clusters (distributed-block + ICC + CC)

Figure 4.9: IPC improvement for distributed-block scheduling with ICC andCC versus distributed scheduling as a function of the number of clusters whilekeeping total SM count constant at 60 SMs. Distributed-block scheduling withICC and CC consistently improves performance across di↵erent cluster sizesand e↵ective NoC port count per SM.

Some of the inter-CTA locality cannot be exploited through the L1 cacheMSHRs within an SM, but can be exploited across SMs within a cluster throughthe ICC and CC unit, further decreasing NoC tra�c by 14% on average. Com-bining distributed-block scheduling with ICC and CC leads to an average re-duction in NoC tra�c by 20%. Some benchmarks experience a significant NoCtra�c reduction, including DCT (47%), 2DCONV (33%), LUD (32%), FDTD(26%), and HS (19%). These benchmarks are also the workloads experienc-ing the largest performance and energy improvements. Note though that thecorrelation is not perfect — NoC tra�c reduction also depends on the frac-tion of read requests. Benchmarks with more write requests than reads do notexperience an equally high reduction in NoC tra�c.

4.4.4 Sensitivity Analysis

We use the same configurations as the previous chapter (see Section 3.5.4)to analyze the performance sensitivity with di↵erent cluster size. Figures 4.9reports IPC improvement (percentage speedup) as a function of the number ofclusters comparing distributed-block scheduling with ICC plus CC versus dis-tributed scheduling; The key observation here is that distributed-block schedul-ing with ICC and CC is e↵ective across the di↵erent GPU architecture con-figurations. Even with as few as 4 SMs per cluster sharing one NoC port(15 clusters in total), we still observe an average performance improvementof 16% (and up to 68%) from distributed-block scheduling with ICC and CC.The highest performance improvement is achieved for 10 clusters with 6 SMseach. We note an average 18% performance improvement for distributed-blockscheduling with ICC and CC.

4.5. RELATED WORK 49

4.5 Related Work

We now discuss the related work on scheduling policies in GPUs.

Warp scheduling. To improve the performance of GPU and their memorysubsystems, various warp schedulers have been proposed to reduce memorylatency or cache contention [59, 99, 100]. OWL [59] prioritizes warps from aset of CTAs to increase data locality and avoid bank conflicts. CCWS [99] usesa reactionary mechanism to control the number of warps sharing the cache whenthrashing is detected. DAWS [100] uses cache foot-print prediction to suspendsome warps when warps lose locality. All these schemes aim at avoiding cachecontention by optimizing a core’s warp schedulers.

CTA scheduling. Several prior proposals exploit inter-CTA locality to im-prove CTA scheduling. In particular, Lee et al. [73] and Mao et al. [85] proposeto dispatch groups of two consecutive CTAs onto the same SM to improve L1cache performance. Unfortunately, this exploits locality between consecutiveCTAs located in a row only. Chen et al. [33] propose a software-hardwarecooperative design to exploit spatial locality among di↵erent CTAs located indi↵erent rows and columns. Li et al. [77] propose software techniques to sched-ule CTAs with potential reuse on the same SM to exploit inter-CTA localityon real GPU hardware. Wang et al. [116] propose a CTA scheduler specificallydesigned for dynamic parallelism execution models. They exploit data localitybetween parent and child CTAs to increase cache hit rates. Similar to warpthrottling techniques, Kayrian et al. [62] propose a dynamic CTA schedulingpolicy to allocate fewer CTAs for applications su↵ering from intensive memorycontention. However, none of these prior proposals explore CTA scheduling toimprove intra-cluster coalescing opportunities.

4.6 Summary

In this chapter, we demonstrate the significant interaction between ICCand CTA scheduling policies. We also find that ICC benefits more when theCTA scheduling policy maps neighboring CTAs to the same cluster to betterexploit inter-CTA locality. Based on this observation, we propose distributed-block scheduling, which is a two-level CTA scheduling policy that first evenlydistributes consecutive CTAs across clusters, and subsequently schedules pairsof consecutive CTAs per SM to maximize L1 cache locality and L1 MSHRcoalescing opportunity. Through execution-driven GPU simulation, we findthat distributed-block scheduling improves GPU performance by 4% (and upto 16%) which at the same time reducing system energy and EDP by 1.2% and5%, respectively, compared to the state-of-the-art distributed scheduling policy.In addition, distributed-block scheduling works synergistically with ICC andCC to improve performance by 16% (and up to 67%) and reduce system energyby 6% (and up to 18%).

Chapter 5

The GPU MemoryDivergence Model

5.1 Introduction

Several contemporary GPU applications di↵er from traditional GPU-computeworkloads as they put a much larger strain on the memory system. More specif-ically, they are memory-intensive and memory-divergent. These applicationstypically have strided or data-dependent access patterns which cause the ac-cesses of the concurrently executing threads to be divergent as loads from dif-ferent threads access di↵erent cache lines. We refer to this application class asMemory Divergent (MD), in contrast to the more well-understood Non-MemoryDivergent (NMD) applications.

Analytical GPU Performance Modeling. Analyzing and optimizing GPUarchitecture for the broad diversity of modern-day GPU-compute applicationsis challenging. Simulation is arguably the most commonly used evaluationtool as it enables detailed, even cycle-accurate, analysis. However, simula-tion is excruciatingly slow and parameter sweeps commonly require thousandsof CPU hours. An alternative approach is modeling, which captures the keyperformance-related behavior of the architecture in a set of mathematical equa-tions, which is much faster to evaluate than simulation.

Modeling can be broadly classified in machine-learning (ML) based mod-eling versus analytical modeling. ML-based modeling [40, 118] requires of-fline training to infer or learn a performance model. A major limitation ofML-based modeling is that a large number (typically thousands) of trainingexamples are needed to infer a performance model. These training examplesare obtained through detailed simulation, which leads to a substantial one-time cost. Moreover, extracting insight from an ML-based performance modelis not straightforward, i.e., an ML-based model is a black box. Analyticalmodeling [26, 50, 51, 103, 115, 124] on the other hand derives a performance

51

52 CHAPTER 5. THE GPU MEMORY DIVERGENCE MODEL

Table 5.1: MD-applications across widely used GPU benchmark suites.MD-applications are common.

Suite Ref. #MD-app. #NMD-app. Total

Rodinia [32] 4 12 16

Tango [6] 2 6 8

LonestarGPU [29] 4 2 6

Polybench [45] 4 8 12

Mars [48] 6 0 6

Parboil [107] 1 7 8

Total 21 35 56

model from fundamentally understanding the underlying architecture, drivenby first principles. Analytical models provide deep insight, i.e., the model is awhite box, and the one-time cost is small once the model has been developed.The latter is extremely important when exploring large design spaces, makinganalytical models ideally suited for fast, early-stage architecture exploration.

This work advances the state-of-the-art in analytical GPU performancemodeling by expanding its scope, improving its practicality, and enhancingits accuracy. GPUMech [51] is the state-of-art analytical performance modelfor GPUs, however, it is highly inaccurate for memory-divergent applications,and in addition, it is impractical as it relies on functional simulation to collectthe model inputs. In this chapter, we propose the MDM GPU performancemodel which is highly accurate across the broad spectrum of MD and NMD-applications, and in addition is practical by relying on binary instrumentationfor collecting model inputs. Before diving into the specific contributions of thiswork, we first point out the prevalence of memory-divergent applications.

Memory-Divergent Applications are Prevalent. We use Nvidia’s VisualProfiler tool [12] on an Nvidia GTX 1080 GPU to categorize all benchmarksacross the Rodinia [32], Parboil [107], Polybench [45], Tango [6, 61], Lones-tarGPU [29], and Mars [48] benchmark suites (see Table 5.1). We define anapplication as memory-divergent if it features more than 10 Divergent loads PerKilo Instructions (DPKI). Note that this classification is not particularly sensi-tive to the DPKI threshold: most NMD-applications have a DPKI at, or closeto, 0 (maximum 3) while the average DPKI for the MD-applications is around64 (minimum is 10). DPKI is an architecture-independent metric and identi-fies the applications where memory divergence significantly a↵ects performancesince it captures both memory intensity and degree of memory divergence in asingle number. Overall, we find that 38% (i.e., 21 out of 56) of the benchmarksare memory divergent (i.e., have DPKI larger than 10). Given the prevalenceof MD applications, it is clear that analytical models must capture their keyperformance-related behavior appropriately.

Memory divergence is challenging to model analytically and existing modelsare highly inaccurate: in particular, GPUMech [51] incurs an average perfor-

5.1. INTRODUCTION 53

mance error of 298% for a broad set of MD-applications, even though GPUMechis accurate for NMD-applications. The key problem is that prior work does notmodel the performance impact of Miss Status Holding Registers (MSHRs) [70],nor does it accurately account for Network-on-Chip (NoC) and DRAM queue-ing delays. More specifically, concurrent cache requests in an MD-application(mostly) map to di↵erent L1 cache blocks which causes the L1 cache to runout of MSHRs. This cripples the ability of a GPU’s Streaming Multiprocessors(SMs) to hide memory access latency through Thread-level Parallelism (TLP),i.e., the SMs can no longer execute independent warps because the cache cannotaccept further accesses until one of the current misses resolves. Furthermore,memory divergence incurs a flood of memory requests which leads to severecongestion in the NoC and DRAM subsystems, which in turn leads to longqueueing delays.

Contributions. In this work, we propose the Memory Divergence Model(MDM), the first analytical performance model for GPUs that accurately pre-dicts the performance for MD-applications. MDM builds upon two key insights.First, L1 cache blocking causes the memory requests of concurrently executingwarps to be processed in batches. The reason is that MD-applications havemore concurrent misses than there are MSHRs, and the cache can maximallyservice as many concurrent misses as there are MSHRs. More specifically, acache with N MSHRs will process M requests in roughly M/N batches be-cause it will block after the first N misses. Second, MD-applications saturatethe NoC and the memory system which means that a memory request is queuedbehind all other concurrent requests. Because TLP-based latency hiding breaksdown for MD-applications, the SMs are exposed to the complete memory accesslatency.

MDM faithfully models both MSHR batching and NoC/DRAM queueingdelays, which improves the performance prediction accuracy by 16.5⇥ on av-erage compared to the state-of-the-art GPUMech [51] for MD-applications. Atthe same time, MDM is equally accurate as GPUMech for NMD-applications.Across a set of MD and NMD-applications, we report an average predictionerror of 13.9% for MDM compared to detailed simulation (versus 162% forGPUMech). Moreover, we demonstrate high accuracy across a broad designspace in which we vary the number of MSHRs, NoC and DRAM bandwidth, aswell as SM count. Furthermore, we validate MDM against real hardware, forwhich we rely on binary instrumentation to collect the model inputs as opposedto functional simulation as done in prior work. By doing so, we improve bothmodel evaluation speed (by 6.1⇥) and accuracy (average prediction error of40% for MDM versus 164% for GPUMech).

To demonstrate the utility of MDM, we perform three case studies. First,we use MDM and GPUMech to explore the performance impact of changing thenumber of SMs and DRAM bandwidth for an NMD versus MD-workload. Over-all, MDM predicts the same performance trends as simulation, while GPUMecherroneously predicts that providing more SMs and DRAM bandwidth signifi-cantly improves performance for the MD-workload. In a second case study,we demonstrate that the general-purpose MDM is equally accurate as the


special-purpose CRISP model [91], in contrast to GPUMech, for predictingthe performance impact of DVFS — CRISP is a dedicated model which relieson yet-to-be-implemented hardware performance counters, in contrast to thegeneral-purpose MDM model. Finally, in our last case study, we validate thatMD-applications indeed su↵er from MSHR batching and NoC/DRAM queue-ing, and MDM accurately captures these performance e↵ects.

In summary, we make the following contributions:

• We identify the key architectural mechanisms that determine the perfor-mance of memory-divergent GPU applications. Poor spatial locality leadsto widespread L1 cache blocking due to lack of MSHRs, which results incache misses being processed in batches and the SMs being unable toleverage TLP to hide memory access latency and NoC/DRAM queueingdelays.

• We propose the Memory Divergence Model (MDM), which faithfullymodels the batching behavior and NoC/DRAM queueing delays observedin MD-applications. MDM significantly improves performance predictionaccuracy compared to GPUMech: we report an average prediction errorof 13.9% for MDM versus 162% for GPUMech when compared to detailedsimulation.

• We validate MDM against real hardware, showing substantially higheraccuracy than GPUMech (40% versus 164% prediction error), while inaddition being 6.1⇥ faster by collecting model inputs through binaryinstrumentation as opposed to functional simulation.

• We show that MDM is useful and accurate compared to detailed simula-tion for early design space exploration of GPU architectures when varyingthe number of SMs, MSHRs and the amount of NoC and DRAM band-width. We further demonstrate that MDM is highly accurate for predict-ing performance degradation under DVFS, being similarly accurate asthe special-purpose CRISP [91]. Finally, we demonstrate MDM’s abilityto reveal the mechanisms that underpin performance scaling trends.

5.2 MDM Overview

MDM is based on interval modeling [41, 60], which is an established ap-proach for analytical CPU performance modeling. The key observations arethat an application will have a certain steady-state performance in the absenceof miss events (e.g., data cache misses), and that miss events are independentof each other. Therefore, performance can be predicted by estimating steady-state performance and subtracting the performance loss due to each miss event.Interval modeling was originally proposed for single-threaded CPU workloads,and applying it to GPUs is not straightforward due to their highly parallelexecution model [51].

Figure 5.1 provides a high-level overview of the MDM performance model.The first task is to collect instruction traces which can be accomplished through

5.2. MDM OVERVIEW 55

CacheSimulation

Architectural Parameters

GPUCompute

Application

RepresentativeWarp Selection

TraceCollection

Miss rates

Intervals

Loads andstores

Warpprofiles

Clock frequency, #SMs, cache sizes,bandwidths, etc.

Architecture-independent(One-time cost)

Architecture-dependent(Recurring cost)

Traces are collected with instrumen-tation or functional simulation

��

�

MemoryDivergence

Model(MDM)

Predicted Performance

(IPC)

�

Figure 5.1: The key components of MDM-based performance prediction.

instrumentation on real hardware or through functional simulation (see 1 ).Traces are architecture-independent and therefore only need to be gatheredonce for each benchmark, i.e., trace collection is a one-time cost. Instrumenta-tion on real hardware is significantly faster than functional simulation1. Fur-thermore, binary instrumentation provides traces of native hardware instruc-tions while functional simulation captures instructions at the intermediate PTXrepresentation [4]. For these reasons, instrumentation on real hardware is gen-erally preferable. An important exception is when validating the performancemodel across design spaces, which can only be done through simulation —real hardware enables evaluating a single design point only, and moreover, thedi↵erence in instruction abstraction level introduces error.

We next use the instruction trace to create interval profiles for all warpsin the application (see 2 ). Huang et al. [51] observe that the execution be-haviors of the warps of a GPU-compute application are su�ciently similar sothat a single representative warp can be used as input to the performancemodel. Following this observation, we select a representative warp based onits architecture-independent characteristics such as instruction mix and inter-instruction dependencies. Our analysis confirms that this approach is equallyaccurate for both MD- and NMD-applications — since memory divergence pri-marily a↵ects the latency of the miss events.

In contrast to trace collection and representative warp selection which incursa one-time cost per application, the next steps depend on both the applicationand the architecture, and hence need to be run once for each architecture-application pair; this is a recurring cost which is proportional to the numberof architecture configurations to be explored. We first run the load and storeinstruction traces through a cache simulator to obtain the miss rates for allcaches (see 3 ). We consider all warps in the cache model as the accesses ofconcurrently executed warps significantly a↵ect miss rates (both constructivelyand destructively). Finally, we provide the intervals and the miss rates to ourMDM performance model to predict overall application performance (IPC) fora particular architecture configuration (see 4 ).

1We extend the dynamic instrumentation tool NVBit [114] to capture per-warp instructionand memory address traces; see Section 5.4 for details regarding our experimental setup.


Table 5.2: Evaluation time as a function of design space size for detailedsimulation and MDM. MDM is orders of magnitude faster for large designspaces.

Architectural Detailed MDM

Configurations Simulation One-time Recurrent

1 9.8 days 3.6 hours 2 minutes

10 3.3 months 3.6 hours 20 minutes

100 2.7 years 3.6 hours 3.3 hours

1000 27 years 3.6 hours 1.4 days

MDM dramatically reduces evaluation time compared to simulation andmakes exploring large GPU design spaces practical, i.e., hours or few days forMDM versus months or years for simulation. Table 5.2 explores MDM modelevaluation time compared to simulation as a function of the number of architec-tural configurations in the design space. For each architectural configuration,we evaluate all our 17 benchmarks (see Section 5.4 for details). The key take-away is that MDM-based evaluation is orders magnitude faster than detailedsimulation. More specifically, MDM speeds up evaluation by 65⇥ when onlyconsidering a single configuration — less than 4 hours versus almost ten days— which then grows to 6371⇥ when evaluating 1000 configurations — less thantwo days versus many years. Exploiting parallelism in a server farm speeds upsimulation and MDM equally. The root cause of MDM’s speed and scalabil-ity is that the recurring costs are small compared to the one-time costs, andMDM’s one-time cost is much smaller compared to simulation.

5.3 Modeling Memory Divergence

We now describe how MDM captures the key performance-related behaviorof MD-kernels. We first analyze the key performance characteristics of MD-applications (Section 5.3.1), and we then use interval analysis to examine howthis impacts performance (Section 5.3.2). The observations that come outof this analysis then lead to the derivation of the MDM performance model(Section 5.3.3).

5.3.1 Key Performance Characteristics

The basic unit of a GPU-compute application is a thread. Threads areorganized into Thread Blocks (TBs), and the threads in a TB can executesequentially or concurrently and communicate with each other. The threadsare dynamically grouped into warps by the warp scheduler, and each warpexecutes the same instructions on di↵erent data elements. GPUs execute theinstructions of all threads within a warp in lock-step across the cores of aStreaming Multiprocessor (SM). For loads, this means that each thread issues

5.3. MODELING MEMORY DIVERGENCE 57

Listing 5.1: Example kernel with strided access pattern.

int ix = blockIdx.x*blockDim.x+threadIdx.x;

__shared__ output[blockDim.x][N];

for (i=0; i<N; i++){

int index = GS*ix+i; // Compute index

float t = input[index]; // Load value

output[ix][i] = t*t;

}

Cach

e lin

es i

i+1

i+2

Grid Stride (GS) = 1 Grid Stride (GS) = 32128 bytes

4 bytes

...W1:T1 W1:T2 W1:T32

...W2:T1 W2:T2 W2:T32

...W3:T1 W3:T2 W3:T32

...W1:T1

...W1:T2

...W1:T3

Non-Memory-Divergent (NMD) Memory-Divergent (MD)

Figure 5.2: Cache behavior for the example kernel in Listing 5.1 with twodi↵erent grid strides.

a load for a single data element. The per-thread requests within a warp areaggregated to cache requests by the coalescer. On a cache hit, the cache lineis retrieved and provided to the SM’s compute cores. On a miss, one or moreMSHRs is allocated and a corresponding number of memory requests are sentto the lower levels of the memory hierarchy. If a warp cannot complete (e.g.,due to an LLC miss), the warp scheduler will try to execute other warps tohide latencies (possibly from other TBs). The instructions within a warp areexecuted in program order, and the SM stalls when the pending instructionsof all available warps are blocked.

Listing 5.1 shows a simple GPU kernel that we use to illustrate the key per-formance characteristics of MD and NMD applications. The example kernelsquares the contents of an array and reorganizes it into a matrix. At run-time, each thread executes the instructions of the kernel on a subset of theapplication’s data. The exact data elements are determined by the thread’sposition within the thread grid (line 1). For each iteration of the loop, thekernel computes the array index (line 4), loads the matrix value (line 5), andfinally squares the value and writes it back to memory (line 6). The access pat-tern strides are determined by the constant Grid Stride (GS). In the following,we assume 32 threads per warp, 128-byte L1 cache lines, 4-byte floats, as wellas that the input array is cache-aligned and much larger than the L1 cache.We use the notation W1:T1 to refer to thread 1 within warp 1.

We now configure this example kernel as MD versus NMD by changing theGS parameter, which yields us two key observations regarding the performance


0

500

1000

1500

2000

2500

3000

3500

HS BT

BP

FDTD

SRAD RAY

2DCONV

ST CFD

BFS PVR

SPMV PVC

IIX

KMEANS AN

RN

L1 m

iss

brea

kdow

n (c

ycle

s)�

NMD-applications MD-applications�

MSHR delay LLC access latency DRAM access latency NoC queue delay DRAM queue delay

Figure 5.3: L1 miss latency breakdown for selected GPU computeapplications. Delays due to insu�cient MSHRs as well as queuing delays inthe NoC and DRAM subsystem significantly a↵ect the overall memory accesslatency of MD-applications, while NMD-applications are hardly a↵ected.

characteristics of MD- versus NMD-workloads. We will later build upon thesetwo observations when formulating the MDM model.

MD-kernels exhibit poor spatial locality which leads to widespread cacheblocking due to lack of available MSHRs. Figure 5.2 illustrates this first obser-vation. When GS equals 1, the loads of each warp go to the same cache block,and the coalescer is able to combine these loads into a single cache request perwarp. If GS equals 32 on the other hand, the kernel becomes memory-divergentbecause the memory accesses of each thread within the warp go to di↵erent L1cache blocks, and the coalescer can no longer combine the threads’ cache re-quests. The poor spatial locality of MD-applications puts immense pressure onthe L1 cache MSHRs which therefore become the predominant architecturalbottleneck.

Figure 5.3 adds further evidence to this observation by breaking down theL1 miss latency into the memory system units where the latency is incurred(see Section 5.4 for details regarding our methodology). Unsurprisingly per-haps, the overall memory latency is higher for MD-applications than for NMD-applications. Figure 5.3 shows that MD-applications spend a lot of time waitingfor MSHRs to become available.

The poor spatial locality and high memory intensity of MD-applicationscauses widespread congestion in the NoC or DRAM subsystem. Figure 5.3shows that MD-applications typically have much larger NoC or DRAM queuingdelays than NMD-applications. To explain this behavior, we revisit Figure 5.2.When GS equals one, all words within the cache line are required by the SMs.Conversely, only a single word per cache line is required when GS equals 32.Put di↵erently, the cache has to fetch 128 bytes to obtain the 4 bytes that arerequested. This results in the NoC and DRAM being flooded with memoryrequests, which leads to significant queueing delays in the NoC and DRAMsubsystems.


The

mem

ory

requ

ests

of

all w

arps

and

thr

eads

can

be

exec

uted

in p

aral

lel

Mem

ory

Acc

esse

sM

SHR

Uti

lizat

ion

MSH

R �

MSH

R �

MSH

R �

A

Bloc

k ad

dres

sTa

rget

(s)

W�

B

W�

C

W�

W�

Load

W�

Load

W�

Load

T� A� B� C�

T� A� B� C�

T� A� B� C�

T� A� B� C�

T� A�

B� C�

T� A�

B� C�

((a))

MSHR

utiliza

tionforanNMD-applica

tion.

NM

D-i

nter

val:

SM r

each

TLP

lim

it

SM

MSH

Rs W�

W�

W�

Bloc

ked

W�

W�

W�

W�

W�

W�

A

B

C

A

B

C

Bloc

ked

Stal

led

Stal

led

W�

... ...

W� A

((b))

NMD-applica

tionex

ecutionex

ample.

Onl

y su

ffic

ient

MSH

Rs t

o se

rve

T�, T

�, a

nd T

�. R

emai

ning

thr

eads

and

war

ps s

tall.

W�

Load

W�

Load

W�

Load

T� A�

G�

M�

T� B� H�

N�

T� C� I� O�

T� D� J� P�

T� E� K� Q�

T� F� L� R�

Stalled

Mem

ory

Acc

esse

sM

SHR

Uti

lizat

ion

MSH

R �

MSH

R �

MSH

R �

A

Bloc

k ad

dres

sTa

rget

(s)

W�

B

W�

C

W�

((c))MSHR

utiliza

tionforaMD-applica

tion.

MD

-int

erva

l: M

SHR

s be

com

e th

e ke

y pe

rfor

man

ce b

ottl

enec

k

SM

MSH

Rs W�

W�

W�

W�

W�

W�

Bloc

ked

A C

D

H

Stal

led

Stal

led

Batc

h #�

Batc

h #�

...

NStal

led

......

Bloc

ked

Bloc

ked

Requ

est

bein

g se

rvic

ed

Requ

est

queu

ed

((d))

MD-applica

tionex

ecutionex

ample.

Figure

5.4:

Exa

mple

explainingwhy

MSHR

utilization

resultsin

sign

ificantly

di↵erentperform

ance-related

beh

aviorforNMD

andMD-application

s.MD-app

lication

spu

tsim

men

sepressure

ontheL1cacheMSHRsan

dtherebyseverely

limittheGPU’s

abilityto

use

TLP

tohide

mem

orylatencies.


5.3.2 Interval Analysis

It is clear from the above analysis that an accurate performance modelshould model both the impact of MSHR blocking and NoC/DRAM congestion.We first describe how these phenomena a↵ect performance through intervalanalysis before describing the model in great detail in the next section. We doso using our example kernel, see Figure 5.4, while considering a single interval(i.e., a couple compute instructions followed by a long-latency load). Further,we assume 3 L1 cache MSHRs and we assume that the warp scheduler can onlyconsider 3 concurrent warps. Otherwise, the assumptions are the same as inSection 5.3.1, including the assumption that all requests miss in the L1 cache.

A cache can sustain as many misses to di↵erent cache lines as there areMSHRs [70]. Each MSHR entry also tracks the destination (e.g., warp) foreach request, and this parameter (commonly called the number of targets)determines the number of misses to the same cache line that the MSHR entrycan sustain [53]. If the cache runs out of MSHR entries (or targets), it can nolonger accept memory requests until an MSHR (or target) becomes available.A blocked cache quickly causes the SM to stall because load instructions cannotexecute.

The example NMD-kernel (i.e., GS equals 1) uses the MSHRs e�cientlysince the coalescer combines the loads of the threads within a warp into asingle cache request. Therefore, Figure 5.4(a) shows that each warp occupies asingle MSHR. Figure 5.4(b) shows execution behavior over time for the SM andthe MSHRs as well as the cache requests of each warp. The SM first executesW1 which stalls when its threads reach the load instruction. To hide memorylatency, the warp scheduler decides to execute W2. This enables W2’s threadsto calculate their addresses and issue their respective loads. The schedulercontinues in a similar manner with W3. At this point, all concurrent warpsavailable to the scheduler are stalled on loads, causing the SM to stall. The L1cache also blocks because all its MSHRs are occupied, but this does not a↵ectTLP since the SM has reached its TLP limit. Further, the performance impactof the stall is limited because execution resumes when W1’s memory requestreturns.

Figure 5.4(c) shows the MSHR utilization for the example MD-kernel (i.e.,GS equals 32). In this case, the cache misses of threads T1, T2, and T3 in warpW1 occupy all available MSHRs, and Figure 5.4(d) shows that the widespreadcache blocking makes TLP-based latency hiding ine↵ective. More specifically,all forward progress in program execution occurs in response to memory re-quests completing. Therefore, switching to other warps does not help sincetheir load requests cannot enter the memory system as long as the cache isblocked. This phenomenon occurs when the number of concurrent memory re-quests exceeds the number of L1 MSHRs. It thus depends on the application’smemory access intensity and its access pattern, and its relationship with thenumber of MSHRs provided by the architecture. In other words, the applica-tion’s characteristics and and its interactions with the underlying architecturedetermines to what extent a kernel’s memory divergence a↵ects performance.


5.3.3 The Memory Divergence Model (MDM)

The above analysis illustrates that an accurate performance model for MD-applications needs to model the architectural e↵ects of MSHR blocking (Ob-servation #1), and in addition needs to capture the e↵ects of NoC and DRAMcongestion (Observation #2). We now explain how our MDM model capturesthese observations.

The starting point of MDM is the number of cycles it takes for an SM toexecute the instructions of interval i within the representative warp withoutcontention (i.e., Ci). We then add the predicted MSHR-related stall cycles(i.e., SMSHR

i ) and the predicted stall cycles due to queueing in the NoC andDRAM subsystems (i.e., SNoC

i and S

DRAMi ) to Ci to predict the number of

cycles an SM would use to execute interval i with contention (i.e., Si). Wecan obtain per-interval IPC predictions by dividing the number of instructionsin the interval by the number of cycles we predict that it will take to executethem (i.e., IPCi = #Instructionsi/[Ci + Si]).

We predict the IPC of the entire warp by dividing the total number ofinstructions executed by the warp across all intervals with the total number ofcycles required to execute all intervals. Then, we multiply by the number ofwarps concurrently executed on an SM (i.e., W ) to predict the overall IPCSM

of an SM:

IPCSM = W ⇥P#Intervals

i=0 #InstructionsiP#Intervalsi=0 Ci + S

MSHRi + S

NoCi + S

DRAMi

. (5.1)

We obtain IPC for the entire GPU by multiplying with the number of SMs (i.e.,IPC = #SMs ⇥ IPCSM). MDM obtains Ci similarly to GPUMech [51] whilewe provide new approaches for predicting S

MSHRi , SNoC

i and S

DRAMi . If the

application consists of multiple kernels, we first obtain the instruction countand total cycles for each kernel. Then, we divide the sum of instruction countsacross all kernels by the sum of cycles across all kernels to predict applicationIPC. The following sections explain how MDM predicts the stall cycles Si

per interval; we omit the subscript i in the below discussion to simplify theformulation.

MDM classifies each interval within the application’s representative warpas either an MD or an NMD-interval. More specifically, memory divergenceoccurs when the number of concurrent read misses exceeds the number of L1cache MSHRs:

M

Read ⇥W > #MSHRs. (5.2)

In other words, Equation 5.2 is true if the application has the ability to makethe MSHRs the key performance bottleneck within the current interval.

MDM’s Batching Model

Batching example: We now further analyze the performance of the MD andNMD-kernels in Figure 5.4. To keep the analysis simple, we assume a constant


NoC transfer latency and that all requests enter the same NoC queue and hitin the LLC. Note that these assumptions only apply to the example; MDMmodels the highly parallel memory system of contemporary GPUs.

For the NMD-kernel in Figure 5.4(b), there are su�cient MSHRs for the SMto issue the memory requests of all three warps concurrently and reach the TLPlimit. For the MD-kernel in Figure 5.4(d) on the other hand, divergence resultsin only the three first requests of W1 being issued before the cache blocks. Aseach request completes, an MSHR becomes available and a new request isissued. For instance, the completion of request A enables issuing request D.The poor MSHR utilization of the memory-divergent warp results in it issuingits requests in batches, and that each batch contains as many requests as thereare MSHRs. More specifically, the requests of W1 are serviced over two batchessince there are six memory requests and three MSHRs. W1 cannot execute itsnext instruction until the memory requests of all threads have completed.

Batching model: We now explain how MDM models batching behavior. Wefirst predict the average number of concurrent L1 misses M :

M = min(MRead ⇥W,#MSHRs) +M

Write ⇥W. (5.3)

Read misses allocate MSHR entries and are therefore bounded by the numberof L1 MSHRs. In other words, the application will either: (i) issue the numberof read misses in the representative warp times the number of concurrentlyexecuted warps; or, (ii) as many read misses as there are MSHRs. Since theL1 caches in our GPU models are write-through and no-allocate, write missese↵ectively bypass the L1 and are independent of the number of MSHRs.

To estimate the length of the batches, we start by determining the memorylatency in the absence of contention:

L

NoContention = L

MinLLC + LLCMissRate⇥ L

MinDRAM. (5.4)

Here, LMinLLC is the round-trip latency of an LLC hit without NoC contention.The round-trip latency through the DRAM system is LMinDRAM (again, assum-ing no contention), but only LLC misses incur this latency. We then combineL

NoContention with the average stall cycles due to queueing in the NoC andDRAM subsystems (we will derive S

NoC and S

DRAM in Section 5.3.3):

S

Mem = L

NoContention + S

NoC + S

DRAM. (5.5)

S

Mem is the predicted stall cycles due to L1 misses — considering both NoCand DRAM contention. We then use S

Mem to predict the SM stall cycles dueto MSHR contention:

S

MSHR =

((dMRead⇥W

#MSHRs e � 1)⇥ S

Mem, if MD-interval

0, otherwise.(5.6)

For MD-intervals, Equation 5.6 computes the number of batches needed toissue the memory requests of all warps by dividing the total number of read


misses by the number of MSHRs. The latency of the final batch is covered bythe queueing model (see Section 5.3.3), so we need to subtract one from thisquantity to avoid adding this latency twice. Then, we multiply by S

Mem toobtain the combined SM stall cycles of these batches. NMD-applications areby definition able to issue requests of all warps in a single batch. Therefore,we set SMSHR to zero for NMD-intervals.

MDM’s Memory Contention Model

Contention example: We now return to the example in Figure 5.4. Here,both the NMD- and MD-interval saturate the memory system (see Figure 5.4(b)and 5.4(d), respectively), and this results in each request waiting for all requeststhat were pending at the time it issued. More specifically, both kernels sustainthree memory requests in-flight since the SM can consider three warps concur-rently (NMD-kernel) and there are three MSHRs (MD-kernel). Therefore, eachrequest has to wait for two other requests.

The relationship between SM stall cycles and memory latency is complexdue to the highly parallel execution model of GPUs. To account for this overlapin NMD-applications, we heuristically assume that the SM stalls for half of thememory queueing latency. The half-latency heuristic works well because the SMhides latencies by exploiting TLP. The case where the memory latency is fullyhidden is detected by the multithreading model (see Section 5.2). Conversely,the highly parallel execution model of GPUs means that a significant part ofthe latency is typically hidden. Therefore, assuming that the SM stalls for halfof the latency is reasonable since this is the mid-point between the extremes(i.e., perfect versus no latency hiding). For MD-applications, the SM is notable to use TLP to hide memory latency due to a lack of MSHRs, and theapplication is exposed to the complete memory queueing latency.

Contention model: We now describe how MDM formalizes the above intu-ition. In a real GPU, memory contention occurs because the memory requestsof all SMs queue up in the NoC and DRAM subsystems. The NoC and DRAMuse a certain number of cycles to service each request. More specifically, theNoC service latency L

NoCService is a function of the cache block size, the clockfrequency f and the NoC bandwidth B

NoC:

L

NoCService = f ⇥ BlockSize

B

NoC. (5.7)

The DRAM service latency can be computed in a similar way. However, onlythe LLC misses access DRAM:

L

DRAMService = f ⇥ LLCMissRatio⇥ BlockSize

B

DRAM. (5.8)

We obtain the LLC miss ratio from the information collected in the intervalprofile and adjust the service latencies to account for parallelism in the memorysystem. More specifically, we divide the average service latency by n to modelan n-channel system.


We now use the service latency to predict the SM stall cycles caused byqueueing latencies. The average queueing latency is determined by the aver-age number of in-flight requests M times the average service latency. M isdetermined by application behavior, and we use Equation 5.3 to predict it.The service latency is an architectural parameter which means that we canuse the same model for both NoC and DRAM stalls by providing L

NoCService

(LDRAMService) as input to compute S

NoC (SDRAM):

S

NoC =

(#SMs⇥M ⇥ L

NoCService, if MD and NoC saturated

0.5⇥#SMs⇥M ⇥ L

NoCService, otherwise.

(5.9)

Equation 5.9 formalizes the key observations of the above example. For NMD-intervals, the SM hides queueing latencies with TLP, and we assume that theSM stalls for half of the queueing latency. For MD-intervals, latency-hidingbreaks down and the kernel is exposed to the complete latency.

We have already established that a kernel issues more concurrent missesthan there are L1 cache MSHRs in an MD-interval (see Equation 5.2). For thisbehavior to disable TLP-based latency-hiding, the NoC must saturate. Morespecifically, the predicted NoC queue latency must be larger than the minimumround-trip DRAM access latency:

L

NoCService ⇥#MSHR⇥#SM > L

MinLLC + L

MinDRAM. (5.10)

If the NoC queue does not saturate, memory requests that hit in the LLC willbe serviced quickly. This unblocks the L1 cache and enables the SM to issuea new memory instruction. Without NoC saturation, this occurs frequentlyenough for TLP-based latency-hiding to work also for MD-applications. Thatsaid, Equation 5.10 is typically only false in imbalanced architectures wherethe number of MSHRs is (very) small or the NoC bandwidth (extremely) high.

5.4 Experimental Setup

Simulation Setup. We use GPGPU-sim 3.2 [27], a cycle-accurate GPU sim-ulator, to evaluate MDM’s prediction accuracy. We choose the same base-line GPU architecture configuration as in [51] for fair comparison againstGPUMech, while scaling the number of SMs, NoC bandwidth and DRAMbandwidth to match a GPU architecture that is similar to Nvidia’s PascalGPU [13], see Table 5.3.

Workloads. We select 17 applications: 8 NMD-applications and 9 MD-applications, from the main GPU benchmark suites, including Rodinia [32],SDK [9], Polybench [45], Parboil [107], MARS [48] and Tango [6]. Table 5.4provides details on the selected benchmarks. We simulate the benchmarks tocompletion with the (largest) default input set. The inputs for AlexNet andResNet [61] are pre-trained models from ImageNet [38].

5.5. SIMULATION RESULTS 65

Table 5.3: Simulator configuration.

Parameter Value

Clock frequency 1.4 GHz

Number of SMs 28

No. warps per SM 64

Warp size 32 threads

No. threads per SM 2048

Warp scheduling policy GTO scheduling

SIMT width 32

L1 cache per SM 48 KB, 6-way, LRU, 128 MSHRs

Shared LLC 3MB total, 24⇥128 KB banks

8-way, LRU, 128 MSHRs, 120 cycles

NoC bandwidth 1050 GB/s

DRAM 480 GB/s, 220 cycles

24 memory controllers

Real Hardware Setup. For the experiments that rely on instrumentation, weextend the NVBit [114] binary instrumentation framework. We use an NVIDIAGeForce GTX 1080 [2] GPU to collect traces for each warp. As NVBit doesnot directly support this, we extended it to capture per-warp memory addressand dynamic instruction traces.

Performance Models. The original GPUMech proposal includes a DRAMqueueing model which assumes that each request waits for half the total num-ber of requests on average. However, GPUMech does not model NoC queueingdelay, nor does it account for the DRAM and NoC queueing delays when esti-mating the MSHR stall latencies. We enhance GPUMech minimally and call itGPUMech+: it models the NoC queueing delay similarly to its DRAM queue-ing model, and also accounts for the NoC and DRAM queueing delays whenestimating the MSHR waiting time. MDM-Queue improves upon GPUMech+by using MDM’s NoC and DRAM queue model. MDM-MSHR improves uponGPUMech+ by using MDM’s MSHR batching model. MDM is our final modeland includes both the NoC and DRAM queue model from Section 5.3.3 and theMSHR batching model from Section 5.3.3. Table 5.5 summarizes the di↵erentmodels that we evaluate in the next section. This model breakdown enables usto independently evaluate MDM’s queue model and MSHR model.

5.5 Simulation Results

We first evaluate MDM’s prediction accuracy for our baseline configurationthrough simulation, and consider real hardware validation in the next section.


Table 5.4: Benchmarks.

Benchmark Source #Knl Abbr. Type

Hotspot Rodinia 1 HS NMD

B+trees Rodinia 2 BT NMD

Back Propagation Rodinia 2 BP NMD

FDTD3d SDK 1 FDTD NMD

Srad Rodinia 2 SRAD NMD

Ray tracing GPGPUsim 1 RAY NMD

2D Convolution Polybench 1 2DCONV NMD

Stencil Parboil 1 ST NMD

CFD solver Rodinia 10 CFD MD

Breadth-first search Rodinia 24 BFS MD

PageView Rank MARS 258 PVR MD

RageView Count MARS 358 PVC MD

Inverted Index MARS 158 IIX MD

Sparse matrix mult. Parboil 1 SPMV MD

Kmeans clustering Rodinia 1 KMEANS MD

AlexNet Tango 22 AN MD

ResNet Tango 222 RN MD

The reason for considering simulation is to demonstrate MDM’s accuracy acrossa broad design space which can only be explored through simulation.

5.5.1 Model Accuracy

We evaluate MDM’s accuracy by comparing the model’s prediction againstdetailed cycle-accurate simulation. More specifically, we compute the absolute

Table 5.5: Evaluated performance models.

Scheme Memory/NoC Model MSHR Model

GPUMech GPUMech GPUMech

GPUMech+ GPUMech with NoC GPUMech w/ queue

MDM-Queue MDM (Section 5.3.3) GPUMech w/ queue

MDM-MSHR GPUMech with NoC MDM (Section 5.3.3)

MDM MDM (Section 5.3.3) MDM (Section 5.3.3)


0%

20

0%

40

0%

60

0%

80

0%

HS

BT

BP

FDTD

SRAD

RAY 2D

CONV

ST

CFD

BFS

SPMV

PVR

PVC

IIX

KMEANS

AN

RN

AVG-N

MD

AVG-M

D AVG

-ALL

Rel. IPC prediction error �

NM

D-a

pp

lica

tion

s M

D-a

pp

lica

tion

s �

GP

UM

ech

G

PU

Me

ch+

M

DM

-Qu

eu

e

MD

M-M

SH

R

MD

M

Figure

5.5:

IPC

prediction

errorforou

rNMD

andMD-ben

chmarks

under

di↵erentperform

ance

mod

els.

MDM

sign

ificantly

redu

cestheprediction

errorfortheMD-app

lication

s.


relative prediction error, which we compute as follows:

Error =

��IPCmodel � IPCsimulation

IPCsimulation

�� (5.11)

IPCsimulation is obtained through cycle-accurate simulation, and IPCmodel isobtained through modeling.

Figure 5.5 reports prediction error for the NMD and the MD-applicationsfor our baseline configuration. GPUMech is largely inaccurate, especially forthe MD-applications with an average prediction error around 298% and as highas 750%. MDM improves prediction accuracy by 16.5⇥ compared to GPUMechfor the MD-applications: MDM reduces the prediction error to 18 % on average,and at most 50%. Similar prediction accuracy is achieved by GPUMech andMDM for the NMD-applications (average prediction error around 9%). Onaverage across all benchmarks, MDM achieves a prediction error of 13.9% versus162% for GPUMech.

The alternative performance models, GPUMech+, MDM-Queue and MDM-MSHR shed light on the relative importance of the di↵erent MDM model com-ponents. Although GPUMech+ improves accuracy significantly compared toGPUMech, it still incurs a high average prediction error of 131% for the MD-benchmarks. This shows that minorly modifying GPUMech is insu�cient andthat MD-applications need a fundamentally new modeling approach. MDM-Queue improves upon GPUMech+ by applying the saturation model describedin Section 5.3.3 to memory-divergent intervals, thereby reducing the averageprediction error to 63%. Similarly, MDM-MSHR improves upon GPUMech+by applying the batching model of Section 5.3.3 to memory-divergent intervalswhich reduces the average prediction error to 60.3%. Neither MDM-Queue norMDM-MSHR are able to accurately predict MD-benchmark performance inisolation, indicating that modeling both queueing e↵ects and MSHR behavioris critical to achieve high accuracy.

5.5.2 Sensitivity Analyses

The prediction accuracy numbers reported in the previous section consid-ered the baseline GPU configuration. Although it strengthens our confidenceto know that MDM is accurate in a specific design point, a computer architecttypically cares more about the accuracy of a performance model across a designspace. In other words, the model is only useful if it can accurately predict per-formance across a broad design space. This section evaluates MDM’s accuracyacross the design space while varying various important configuration param-eters including the NoC bandwidth, DRAM bandwidth, number of MSHR en-tries, and SM count. We focus on the MD-applications in this section becausemodel accuracy for MDM is similar to GPUMech for the NMD-applications aspreviously shown for the baseline configuration — we observe similar resultsacross the design space (not shown here because of space constraints).


0%

200%

400%

600%

800%

525 1050 2100 4200 8400

Estim

atio

n er

ror�

NoC bandwidth (GB/s)�

GPUMech GPUMech+ MDM-Queue MDM-MSHR MDM

Figure 5.6: Prediction error as a function of NoC bandwidth for theMD-applications.

0%

100%

200%

300%

400%

177 320 480 720 980

Est

imat

ion

erro

r�

DRAM bandwidth (GB/s) �


Figure 5.7: Prediction error as a function of DRAM bandwidth for theMD-applications.

NoC Bandwidth. Figure 5.6 reports model accuracy as we vary NoC band-width. GPUMech does not model NoC bandwidth, hence GPUMech is highlysensitive to the available NoC bandwidth. At low NoC bandwidth, the NoC is acritical performance bottleneck, and GPUMech shows the highest performanceprediction error. For GPUs with high NoC bandwidth, the NoC does not im-pact performance as significantly, which leads to a relatively low predictionerror for GPUMech. GPUMech+ incorporates a basic NoC model which is im-proved upon by MDM-Queue. As a result, GPUMech+ and MDM-Queue areless sensitive to constrained NoC bandwidth configurations, yielding lower pre-diction errors. However, none of these models capture the impact of MSHRs.MDM-MSHR improves accuracy, especially at larger NoC bandwidths, whereperformance is less bound by the number of MSHRs. MDM significantly im-proves model accuracy at smaller NoC bandwidths, because it accounts for theimpact MSHRs have on NoC bandwidth pressure. Overall, MDM is accurateacross the range of NoC bandwidths.

DRAM Bandwidth. Figure 5.7 reports model accuracy across DRAM band-width configurations. GPUMech’s prediction error increases with DRAM band-width: increasing DRAM bandwidth puts increased pressure on NoC band-width, which GPUMech does not model. GPUMech+ models the NoC queue-


0%

100%

200%

300%

400%

32 64 128 256

Estim

atio

n er

ror�

MSHR entries�


Figure 5.8: Prediction error as a function of the number of MSHR entries forthe MD-applications.

0% 100% 200% 300% 400% 500%

15 20 28 60 80

Estim

atio

n er

ror�

SM count�


Figure 5.9: Prediction error as a function of SM count for theMD-applications.

ing delay and the corresponding L1 miss stall cycles, which significantly de-creases the prediction error. MDM-Queue and MDM-MSHR further improveaccuracy through improved queueing and MSHR models, respectively. How-ever, prediction error still increases with DRAM bandwidth. MDM countersthis trend by synergistically modeling saturation and batching behavior.

MSHR Entries. Figure 5.8 shows model accuracy sensitivity to the num-ber of MSHRs. GPUMech’s prediction accuracy deteriorates with increasingMSHR entries ranging from 98% (32 MSHRs) to 317% (256 MSHRs). MoreMSHR entries leads to an increase in NoC and DRAM queuing delays becausethe system has to process more in-flight memory requests. GPUMech+ incor-porates NoC and DRAM queueing delays when calculating the MSHR waitingtime, which decreases the prediction error. However, the error is still highfor a large number of MSHRs (e.g., 124% for 128 MSHRs and 133% for 256MSHRs). MDM-Queue and MDM-MSHR significantly decrease the predictionerror compared to GPUMech+ by using MDM’s NoC/DRAM queue model andMSHR model, respectively. MDM achieves the highest accuracy of all modelsacross the range of MSHRs.

5.6. REAL HARDWARE RESULTS 71

SM Count. Figure 5.9 reports prediction error as a function of SM count. Ingeneral, increasing the number of SMs increases NoC and DRAM contention,leading to longer L1 miss stall cycles. GPUMech significantly underestimatesthe L1 miss stall cycles, which leads to prediction errors ranging from 124%(15 SMs) to 450% (80 SMs); the increase in prediction error is a direct resultof increased queueing delays which GPUMech does not model. In contrast,GPUMech+ (partially) accounts for NoC and DRAM queuing delays whichsignificantly decreases the prediction error. MDM-Queue and MDM-MSHRfurther improve accuracy by modeling memory saturation and batching behav-ior. By combining both model enhancements, MDM reduces the predictionerror to less than 26% on average compared to detailed simulation across thedi↵erent SM counts.

5.6 Real Hardware Results

We now move to real-hardware validation. As mentioned in Section 5.2,the model’s input can be collected using either binary instrumentation on realhardware or through functional simulation. Because architectural simulationin GPGPUsim also operates at the PTX level, we used functional simulation inthe previous section, for both MDM and GPUMech. Note that functional sim-ulation is done at the intermediate PTX instruction level, and that GPUMechcollects traces through functional simulation. In this work, we novelly col-lect traces using binary instrumentation, which is a better alternative, both interms of accuracy and modeling speed, when comparing against real hardwarewhich executes native-hardware instructions. We evaluate modeling speed andaccuracy below.

Modeling Speed. Overall model evaluation time consists of trace collec-tion, representative warp selection, cache simulation and computing the model’sequations, as previously described in Section 5.2. Collecting traces using func-tional simulation takes up around 84% of the total model evaluation time.Because binary instrumentation using NVBit is around 612⇥ faster than func-tional simulation using GPGPUsim to collect traces, we achieve an overallmodel evaluation speedup of 6.1⇥ through binary instrumentation.

Accuracy. Figure 5.10 validates MDM’s (and GPUMech’s) model accuracyagainst real hardware. We consider binary instrumentation for MDM and bothinstrumentation and functional simulation for GPUMech. There are two keyobservations to take away from this result. First, binary instrumentation im-proves the accuracy of GPUMech considerably compared to functional simula-tion, i.e., the average prediction error reduces from 260% to 164%. The reasonis that profiling happens at the native instruction level using instrumentationinstead of the intermediate PTX level using functional simulation. In spite ofthe improved accuracy, GPUMech still lacks high accuracy compared to realhardware. Second, MDM significantly improves the prediction accuracy com-pared to real hardware with an average prediction error of 40%. Not only is


0% 200% 400% 600% 800%

1000% 1200%

BFS CFD

SPMV PVC

PVR IIX

KMEANS AN

RN AVG

Rel

. IPC

erro

r�GPUMech (func sim) GPUMech (instr) MDM (instr)

Figure 5.10: Hardware validation: relative IPC prediction error for GPUMechand MDM compared to real hardware. MDM achieves high predictionaccuracy compared to real hardware with an average prediction error of 40%compared to 164% for GPUMech (using binary instrumentation).

MDM accurate compared to simulation as reported in the previous section,MDM is also shown to be accurate compared to real hardware.

5.7 Case Studies

We now consider four case studies to illustrate the usefulness of the MDMperformance model.

5.7.1 Design Space Exploration

The most obvious use case for the MDM model is to drive design spaceexploration experiments to speed up the design cycle. Note we are not arguingto replace detailed cycle-accurate simulation. Instead we are advocating to usethe MDM model as a complement to speed up design space exploration, i.e.,a computer architect could use MDM to quickly explore the design space andidentify a region of interest which can then be further analyzed in more detailthrough cycle-accurate simulation.

Figures 5.11 and 5.12 report performance results for a typical design spaceexploration study in which we characterize performance as a function of SMcount (X axis) and DRAM bandwidth (Y axis) for two NMD-applications (BT,HS) and three MD-applications (CFD, BFS, PVC), respectively. Performancenumbers are reported for simulation, GPUMech and MDM; simulation is thegolden reference and all results are normalized to simulation with 28 SMsand 480GB/s memory bandwidth. GPUMech and MDM capture the trendwell for NMD-applications. However, for MD-applications, GPUMech grosslyoverestimates performance as a function of both SM count and DRAM band-width. MDM shows that the performance improvements obtained by increas-ing SM count and/or DRAM bandwidth is small, which simulation confirms.

5.7. CASE STUDIES 73

0 0.4 0.8 1.2 1.6

2 2.4

28 60 80 28 60 80 28 60 80

DRAM bandwidth�

Nor

mal

ized

IPC�

SM count�

Simulation MDM GPUMech�

480GB/s 720GB/s 980GB/s

((a)) Normalized performance for BT as a function of SM count and DRAM bandwidth.

0 1 2 3 4

28 60 80 28 60 80 28 60 80

DRAM bandwidth�

Nor

mal

ized

IPC�

SM count�


480GB/s 720GB/s 980GB/s

((b)) Normalized performance for HS as a function of SM count and DRAM bandwidth.

Figure 5.11: Normalized performance for two NMD-applications (BT and HS)as a function of SM count and DRAM bandwidth. Results are normalized tothe simulation results at 28 SMs and 480GB/s DRAM bandwidth. BothGPUMech and MDM capture the performance trend.

GPUMech on the other hand suggests that there are significant performancegains to be obtained by increasing the number of SMs and DRAM bandwidth,which is a misleading conclusion. This reinforces that an accurate performancemodel such as MDM is needed to accurately predict performance trends formemory-divergent workloads.

5.7.2 DVFS

Dynamic Voltage and Frequency Scaling (DVFS) [82] is a widely used tech-nique to improve energy e�ciency. Reducing the operating voltage and fre-quency dramatically decreases power consumption (i.e., dynamic power con-sumption decreases cubically with voltage and frequency) while only linearlydecreasing performance. Moreover, memory-bound applications might observeonly a slight performance degradation. Hence, DVFS o↵ers significant energysavings while incurring a (relatively) small loss in performance.

CRISP [91] is an online DVFS performance model for GPUs that predictsperformance at a lower clock frequency based on statistics measured using


0 1 2 3 4 5 6

28 60 80 28 60 80 28 60 80 DRAM bandwidth�N

orm

aliz

ed IP

C�

SM count�


480GB/s 720GB/s 980GB/s

((a)) Normalized performance for CFD as a function of SM count and DRAM bandwidth.

0 1 2 3 4 5

28 60 80 28 60 80 28 60 80

DRAM bandwidth�Nor

mal

ized

IPC�

SM count

��Simulation MDM GPUMech�

480GB/s 720GB/s 980GB/s

((b)) Normalized performance for BFS as a function of SM count and DRAM bandwidth.

0 4 8

12 16

28 60 80 28 60 80 28 60 80 DRAM bandwidth�N

orm

aliz

ed IP

C�

SM count�


480GB/s 720GB/s 480GB/s

((c)) Normalized performance for PVC as a function of SM count and DRAM bandwidth.

Figure 5.12: Normalized performance for three MD-applications (CFD, BFSand PVC) as a function of SM count and DRAM bandwidth. All results arenormalized to the simulation results at 28 SMs and 480GB/s DRAMbandwidth. GPUMech not only leads to high prediction errors, it alsoover-predicts the performance speedup with more SMs and memory bandwidth,in contrast to MDM.


0%

10%

20%

30%

40%

CFD BFS

SPMV PVR

PVC IIX

KMEANS AN

RN AVG

Pred

ictio

n er

ror �

GPUMech MDM CRISP

Figure 5.13: Prediction error for predicting the relative performancedi↵erence at 1.4GHz versus 2GHz for the MD-applications for GPUMech,CRISP and MDM. The general-purpose MDM model achieves similaraccuracy as the special-purpose CRISP.

special-purpose hardware performance counters at a nominal frequency. Basedon the statistics measured at nominal frequency, CRISP then predicts howstall cycles scale when lowering clock frequency. Note that CRISP is a special-purpose model, i.e., it can only be used to predict the performance impactof DVFS. MDM and GPUMech on the other hand are general-purpose modelsthat can be used to predict performance across a broader design space in whichwe change SM count, NoC/DRAM bandwidth, etc., as previously shown.

The di↵erence in scope between MDM and CRISP is important becausethe modeling problem for a special-purpose model is much simpler than for ageneral-purpose model. A special-purpose model such as CRISP measures thestall component at the nominal frequency and then predicts how this compo-nent scales at a lower frequency. In contrast, a general-purpose model needs topredict the various stall components at the nominal frequency and at the tar-get frequencies to then predict how performance scales with clock frequency.Predicting how a stall component scales with frequency is much easier thanpredicting the absolute value of the stall component at di↵erent frequencies.In this sense, it is to be expected that CRISP is more accurate than MDM(and GPUMech) for predicting DVFS scaling trends.

Figure 5.13 reports the error for predicting the execution time at 1.4GHzcompared to 2GHz. For MDM and GPUMech, this means we predict perfor-mance at both frequency points and then compute the performance di↵erence.For CRISP, we first run at 2GHz and then predict performance at 1.4GHz.CRISP is the most accurate model (average prediction error of 3.7%), closelyfollowed by MDM (average prediction error of 4.6%); GPUMech on the otherhand leads to much higher inaccuracy (average prediction error of 28%) be-cause it underestimates the memory stall component. We find that CRISP’srelatively high error for AN is due to CRISP assuming memory access time(in seconds) being constant whereas in reality it increases because of reducedoverlapping with computation at lower clock frequency. Overall, we conclude


0% 20% 40% 60% 80%

100%

0.75X 1X 1.5X 2X 0.75X 1X 1.5X 2X 0.75X 1X 1.5X 2X

Nor

mal

ized

CPI�

Bandwidth�


Base MSHR Queue Simulation_CPI

((a)) Normalized CPI as a function of NoC/DRAM bandwidth for the memory-divergentBFS benchmark.

0% 20% 40% 60% 80%

100% 120%

0.75X 1X 1.5X 2X 0.75X 1X 1.5X 2X 0.75X 1X 1.5X 2X

Nor

mal

ized

CPI�

Bandwidth�



((b)) Normalized CPI as a function of NoC/DRAM bandwidth for the memory-divergentCFD benchmark.

Figure 5.14: Normalized CPI as a function of NoC/DRAM bandwidth for twomemory-divergent benchmarks (BFS and CFD). MDM accurately capturesMSHR batching and NoC/DRAM queueing delays, in contrast to GPUMech.

that the general-purpose MDM model is only slightly more inaccurate than thespecial-purpose CRISP model for predicting the performance impact of DVFS.

5.7.3 Validating the Observations

In our third case study, we validate the observations that underpin theMDM model. Figure 5.14 reports CPI for two MD-applications (BFS andCFD) while varying NoC and DRAM bandwidth relative to our baseline con-figuration for simulation, MDM and GPUMech. All results are normalizedto the simulation result at 0.75⇥ the nominal bandwidth. Again, we ob-serve that MDM accurately predicts performance compared to simulation, incontrast to GPUMech. Moreover, we note that the CPI component break-down for MDM shows that MD-application performance is indeed sensitive toMSHR blocking (Observation #1) and NoC/DRAM queueing delay (Obser-


0 0.2 0.4 0.6 0.8

1 1.2

0.75X 1X 1.5X 2X 0.75X 1X 1.5X 2X 0.75X 1X 1.5X 2X

Nor

mal

ized

CPI

Bandwidth�



Figure 5.15: Normalized CPI as a function of NoC/DRAM bandwidth for thenon-memory-divergent BP benchmark. Both GPUMech and MDM capture theperformance trend since the number of MSHRs are su�cient.

0% 100% 200% 300% 400% 500%

CFD BFS

PVR

SPMV PVC

IIX

KMEANS AN

RN AVG

Pred

ictio

n er

ror

GPUMech MDM

Figure 5.16: Relative IPC prediction error with a streaming L1 cache. MDMimproves accuracy compared to GPUMech because it models batching behaviorcaused by NoC saturation.

vation #2). GPUMech models some DRAM queueing e↵ects, but it lacks anMSHR batching model as well as an accurate NoC/DRAM queueing model. Incontrast, Figure 5.15 shows CPI for a representative NMD-application (BP).Both GPUMech and MDM capture the performance trend since the number ofMSHRs are su�cient.

5.7.4 Streaming L1 Cache

In our final case study, we show that MDM is able to model the streamingL1 cache used in the Volta GPU [15, 64]. MDM performs equally well as fora conventional L1 cache and improves prediction accuracy by 8.66⇥ comparedto GPUMech (see Figure 5.16). The streaming cache is interesting because itadds a large number of MSHRs to each L1 cache (4096 MSHRs in this casestudy) and adopts an on-fill line allocation policy. This practically removes L1blocking, but batching behavior still occurs because the queues in the NoC canonly bu↵er a finite number of concurrent requests.


5.8 Related Work

Analytical GPU Performance Modeling. Several analytical models havebeen proposed to predict performance and identify performance bottlenecks.Unfortunately, none of this prior work targets memory-divergent applications.Hong and Kim [50] propose a model that estimates performance based onthe number of concurrent memory requests and the concurrent computationsdone while one warp waits for memory. Baghsorkhi et al. [26] concurrently pro-posed a work-flow graph (WFG) based analytical model to predict performance.WFG is an extension to control-flow graph analysis in which nodes representinstructions while arcs represent various latencies. Sim et al. [103] build a per-formance analysis framework that identifies performance bottlenecks for GPUsby extending the Hong and Kim model. Similarly, Zhang and Owens [124]use a microbenchmark-based approach to model the performance of the in-struction pipeline, shared memory and global memory of a GPU. In general,all these models, as well as the state-of-the-art GPUMech [51], make severalsimplifying assumptions regarding the GPU’s cache hierarchy. Not modelingMSHR batching and inaccurately modeling NoC/DRAM bandwidth queueingdelays leads to significant prediction errors for MD-applications, as reported inthis work. Volkov [115] studies GPU performance using synthetic benchmarksand confirms that several of these recently-proposed GPU models do not ac-curately capture the e↵ects of memory bandwidth, non-coalesced accesses, andmemory-intensive applications.

ML-Based GPU Performance Modeling. A separate line of research ex-ploits machine learning (ML) for GPU performance modeling [21, 40, 81, 118].These techniques are black-box approaches that do not reveal the impact ofGPU components on performance nor help identify performance bottlenecks.Their e↵ectiveness for design space exploration requires extensive training in-volving a wide range of applications and hardware configurations. Wu etal. [118] build GPU performance and power models by training using numerousruns on several hardware configurations, and by feeding relevant performancecounter measurements to the trained model for power and performance pre-diction. Poise [40] is a technique to balance the conflicting e↵ects of increasedTLP and memory system congestion due to high TLP in GPUs. Poise trains amachine learning model o✏ine on a set of benchmarks to learn the best warpscheduling decisions. At runtime a prediction model selects the warp schedulingdecision for unseen applications using the trained performance model. Newshaet al. [21] propose XAPP, a machine-learning based model that uses single-thread CPU implementation to predict GPU performance.

Special-Purpose Models. Several runtime optimization techniques rely onspecial-purpose models [36, 91, 101], in contrast to MDM which is a general-purpose model suitable for broad design space exploration. For example,CRISP [91] predicts the performance impact of varying the operating frequencyfor GPUs. It accounts for artifacts that a↵ect GPU performance more thanCPUs (thus ignored in CPU frequency scaling work), such as store-related stallsand the high overlap between compute and memory operations. Dai et al. [36]

5.9. SUMMARY 79

propose a model to estimate the impact of cache misses and resource congestionas a function of the number of thread blocks. They use these models to devisemechanisms to bypass the cache to improve performance.

Simulation-Based Approaches. Detailed GPU architecture simulation [27,109, 111] is time-consuming and several research groups have proposed novelapproaches for speeding up simulation. Wang et al. [117] propose a modelingapproach that relies on source code profiling to extract warp execution behaviorand an execution trace. They feed this information to a fast high-abstractiontrace-based simulator for performance estimation. Yu et al. [121] propose aframework for generating small synthetic workloads that are representative forlong-running GPU workloads. Kishore et al. [97] propose a novel abstractGPU performance simulation approach that is based on flexible separation offunctional and timing models. In contrast, MDM is an analytical model and istherefore much faster than a simulator-based approach.

Power and Energy Modeling. Guerreiro et al. [46] propose a model to pre-dict the power of a GPU given any change in operating voltage and frequency(V-F). They use numerous microbenchmarks to stress the various GPU com-ponents and measure their power profile at di↵erent V-F points. They build amodel for power prediction using regression analysis considering the collectedpower profile. Model Predictive Control (MPC) [84] is a proactive technique forenergy optimization. It allocates more energy to high-throughput kernels anduses low V-F settings for low-performance kernels. MPC relies on predictionmodels to determine the power and performance at di↵erent configurations.

Abe et al. [19] characterize the impact of (V-F) scaling on execution timeand power consumption, and build statistical models to predict and optimizethe power and performance of GPU-accelerated systems. Arora et al. [23]aim to exploit idle periods in CPU-GPU systems to save power. They pro-pose models to predict idle duration, and power gating mechanisms to ex-ploit su�ciently-long periods, considering dynamically varying break-even idlepoints. GRAPE [101] is a control system that maximizes energy savings givena specific performance requirement. It does so by coordinating core usage,warp action, core speed, and memory speed. In this work, we focus on perfor-mance modeling, with particular emphasis on MD-applications. We show thatMDM accurately captures the performance trends with frequency scaling forMD-applications.

5.9 Summary

In this work, we have presented the GPUMemory Divergence Model (MDM)which accurately models the key performance-related behavior of emergingMD-applications while retaining high accuracy for NMD-applications. MDMfaithfully models that the poor spatial locality and high memory intensity ofMD-applications leads to frequent L1 cache blocking which cripples the abilityof the SMs to use TLP to hide memory latencies. Overall, MDM achieves highaccuracy compared to detailed simulation (13.9% average prediction error ver-


sus 162% for the state-of-the-art GPUMech model) and on real hardware (40%prediction error versus 164%) for our MD-applications. We further demon-strate MDM’s usefulness for driving design space exploration, DVFS scalingand analyzing performance bottlenecks. Overall, MDM significantly advancesthe state-of-the-art in analytical GPU performance modeling in terms of scope,practicality and accuracy.

Chapter 6

Conclusion and FutureWork

It’s easy to look back to weave a crispstory of success, but in reality I triedmany, many things like this, most ofwhich never panned out.

– The PhD Grind

This chapter concludes the key contributions drawn from this thesis anddiscusses potential avenues for future work.

6.1 Conclusion

In recent years, GPUs have emerged as the dominant parallel architec-ture to provide high-performance computing. However, the distinct executioncharacteristics of emerging GPU applications and new hardware features onmodern-day GPU architectures pose unprecedented challenges for architectureresearchers. More specifically, in order to construct a scalable crossbar network-on-chip (NoC) for GPUs with more SMs, a cluster structure is implemented inmodern-day GPUs. Unfortunately, clustered GPUs face severe NoC congestiondue to port sharing. In addition, emerging memory-divergent GPU applicationspose new requirements to construct a faithful analytical performance model forearly design space explorations.

In this thesis, we make the following contributions:

Intra-cluster coalescing and coalesced cache: As the number of SMson next-generation GPUs continues to increase, NoC congestion quickly be-comes a key design challenge to scale performance. Clustered GPUs furtherexacerbate this performance issue due to port sharing among SMs in a clus-

81

82 CHAPTER 6. CONCLUSION AND FUTURE WORK

ter. To mitigate network congestion, we propose intra-cluster coalescing (ICC)and the coalesced cache (CC) to exploit intra-cluster locality (ICL) observedin many GPU-compute applications. ICC coalesces memory requests from dif-ferent SMs in a cluster to the same L2 cache line to reduce the overall numberof requests and replies sent over the NoC. CC extends the opportunity for co-alescing cache lines by caching them at the cluster level for future reference.We observe on average 19% (and up to 48%) redundant memory requests atthe cluster level. Through ICC and CC, we report an average 15% (and upto 69%) performance improvement over a set of benchmarks with varying de-grees of intra-cluster locality. At the same time, we yield 5.3% (and up to16.7%) system energy reduction. The overarching contribution of this work isthe exploitation of intra-cluster locality to tackle the emerging NoC conges-tion bottleneck in clustered GPUs to improve overall system performance bycoalescing memory requests across SMs within a cluster.

Distributed-block CTA scheduling: In this work, we observe the signifi-cant interaction between CTA scheduling policies and ICL. This motivates us toexploit higher ICL through improved CTA scheduling. We propose distributed-block scheduling which is a two-level CTA scheduling policy that first evenlydistributes consecutive CTAs across clusters, and subsequently schedules pairsof consecutive CTAs per SM to maximize L1 cache locality and L1 MSHRcoalescing opportunity. Through execution-driven GPU simulation, we findthat distributed-block scheduling improves GPU performance by 4% (and upto 16%) while at the same time reducing system energy and EDP by 1.2% and5%, respectively, compared to the state-of-the-art distributed scheduling pol-icy. In addition, distributed-block scheduling works synergistically with ICCand CC to improve performance by 16% (and up to 67%) and reduce systemenergy by 6% (and up to 18%).

Memory divergence model: Analytical models enable architects to carryout early-stage design space exploration several orders of magnitude faster thancycle-accurate simulation by capturing first-order performance phenomena witha set of mathematical equations. However, this speed advantage is void if theconclusions obtained through the model are misleading due to model inaccu-racies. Therefore, a practical analytical model needs to be su�ciently accurateto capture key performance trends across a broad range of applications andarchitectural configurations.

In this work, we have presented the GPUMemory Divergence Model (MDM)which accurately models the key performance-related behavior of emergingMemory-Divergent (MD) applications while retaining high accuracy for Non-Memory-Divergent (NMD) applications. MDM faithfully models that the poorspatial locality and high memory intensity of MD-applications leads to fre-quent L1 cache blocking which cripples the ability of the SMs to use TLP tohide memory latencies. Overall, MDM achieves high accuracy compared todetailed simulation (13.9% average prediction error versus 162% for the state-of-the-art GPUMech model) and on real hardware (40% prediction error versus164%) for our MD-applications. We further demonstrate MDM’s usefulness fordriving design space exploration, DVFS scaling and analyzing performance bot-

6.2. FUTURE WORK 83

tlenecks. Overall, MDM significantly advances the state-of-the-art in analyticalGPU performance modeling in terms of scope, practicality and accuracy.

6.2 Future Work

The creative mind has a vast attic.That homework problem you did incollege, that intriguing but seeminglypointless paper you spent a week deci-phering as a postdoc, that o↵hand re-mark of a colleague, all are stored inhope chests somewhere up in a cre-ative person’s brain, often to be pickedthrough and applied by the subconsciousat the most unexpected moments.

– Richard Feynman

In this section, we discuss potential directions for future work related to thethesis.

6.2.1 Software/Hardware Coordinated CTA Scheduling

In this thesis, we observe significant intra-cluster locality for a couple GPUapplications. We further conclude that locality is high among consecutive CTAsand we propose distributed-block CTA scheduling policy to exploit the prop-erty. We observe that inter-CTA locality patterns can be di↵erent for di↵erentGPU applications. Using a one-fits-all CTA scheduling policy can be sub-optimal. For instance, for applications dealing with irregular data structures,e.g., graphs, trees, hashes or pointer lists, the inter-CTA locality highly dependson the data organization and how data is mapped in memory. In addition, thereare some GPU applications with specific inter-CTA locality patterns which aredefined by the programmer (including but no limited to consecutive CTAs).Hence, a software/hardware co-design scheme may analyze the di↵erent inter-CTA locality patterns and propose a corresponding CTA scheduling policy toexploit even more intra-cluster locality. Furthermore, the load balancing prob-lem should be carefully considered since there exists CTA variation in irregularGPU applications.

One way to inspect an application’s inter-CTA locality pattern is throughprofiling. Previous work [79, 123] has suggested to use a lightweight inspectorkernel before runtime to profile local data accesses of certain graph processingapplications (e.g., first few layers of BFS), in order to predict the global dataorganization for optimizing the runtime access patterns. After profiling, we canutilize an o↵-line graph partition algorithm to map CTAs with high localityinto one cluster.


CTA scheduling becomes even more critical especially for the newly pro-posed multi-module GPUs [24] for which the inter-module bandwidth is lim-ited.

6.2.2 MSHR Management

In this thesis, we identify and model the key performance bottleneck ofmemory-divergent applications. The poor spatial locality of these applicationsleads to frequent L1 cache blocking due to issuing a large number of concur-rent cache misses. To address this problem, prior work has proposed new warpscheduling techniques to increase the cache hit rate by limiting the number ofconcurrent warps. Cache-Conscious Wavefront Scheduling (CCWS) [99] usesa reactionary mechanism to scale back the number of warps sharing the cachewhen thrashing is detected. Divergence-aware warp scheduling (DAWS) [100]makes proactive locality-aware scheduling decisions based on the predicted lo-cality and memory divergence behavior. These proposals work well for memory-divergent applications with high intra-warp locality. Other proposals utilizecache bypassing to tackle this problem. Jia et al. [57] use resource-aware cachebypassing to bypass memory requests when they su↵er from stalls in the cache.Xie et al. [119, 120] propose a compiler-level cache bypassing technique: thecompiler analyzes the cache utilization of a program based on the developedmetric, and then selects certain instructions to access or bypass the cache.

However, none of this prior work explores the opportunity of optimizing theMSHRs to reduce the number of stall cycles. Conventional MSHR design isstatic since it provides a fixed number of entries and slots for each entry. How-ever, the optimal entry size can be application-dependent. Memory-divergentapplications with low intra-warp locality favor more MSHR entries to avoidquick depletion of MSHRs. In contrast, memory-divergent applications withhigh intra-warp locality may not need a large number of MSHRs (and concur-rent misses) to avoid cache thrashing. On the other hand, outstanding misseswhich can be coalesced in each MSHR entry (i.e., slots) may be insu�cient forsome GPU applications.

State-of-the-art MSHRs deal with concurrent cache misses in a first-come-first-serve policy which may cause performance degradation. For instance, adivergent memory request consuming MSHRs would quickly stall other cachemisses (e.g., a coalesced cache miss) which can use MSHRs more e�ciently. Anintelligent MSHR design should provide more control to memory tra�c suchas restricting the number of concurrent misses to avoid congestion in the nextlevel of cache in the memory hierarchy.

Overall, the design of conventional MSHRs should be re-considered in lightof the diverse GPU-application characteristics and especially, memory diver-gence. Possible solutions such as dynamic configurable MSHRs or requests-aware MSHRs may provide better performance and need to be explored indepth.

6.2. FUTURE WORK 85

6.2.3 Register File Organization

We have witnessed a sharp increase in GPU power consumption in recentyears. The elevated level of GPU power consumption has significant impact ontheir reliability, economic feasibility, architecture design as well as performancescaling. On one side, GPUs are the main accelerators in large-scale data-centers and supercomputers, and the demand for power-e�cient systems isexpected to grow. For instance, the summit supercomputer equipped withIBM Power 9 CPUs and Nvidia Volta GV 100 GPUs consumes 10 MW ofpower [7]. On the other side of the spectrum, for battery-operated devices suchas smartphones, the need for saving energy increases with every generation ofmobile GPUs [22, 112].

Current GPUs tend to over-provision compute resources while utilizationremains low in order to meet the worst-case performance requirements. Onetypical case is the register file in a GPU to enable massive thread-level par-allelism. The register file is the largest SRAM structure on die and one ofthe most power-hungry components on the GPU [75]. With the increase in thenumber of streaming multiprocessors and concurrent thread contexts, more reg-isters are needed. Traditional GPUs assume static register file management.In particular, each architected register has a corresponding physical registerallocated in the register file [56]. Once a register is allocated, it is not releaseduntil the cooperative thread array (CTA) completes its execution. This policysimplifies register management hardware but at the cost of a significant wasteof power.

Register under-utilization has been observed in a lot of prior work. Forinstance, Hyeran et al. [56] show that some registers have a short lifetime andhence they propose a GPU register file virtualization scheme to share registersamong warps. By under-provisioning the physical register file, both dynamicand static power can be reduced. Actually, register under-utilization becomesmore challenging with emerging GPU applications such as graph analytics andmachine learning. First, the frequent memory accesses stall the core and lead toa large fraction of the registers being inactive. Second, some machine learningapplications such as deep neural networks can tolerate some precision losses sothat provisioning a smaller and a power-e�cient register file becomes feasible.Hence, dynamic register management based on an application’s characteristicsmay be another interesting topic to explore.

6.2.4 Reliability

New-generation GPUs witness an increasing number of SM cores and higherDRAM bandwidth to satisfy the parallelism requirements for GPU-computeapplications. However, the scaling trend will terminate if the sophisticatedreliability problems are un-solved [83, 108].

Transient hardware errors from high-energy particle strikes (also known assoft-errors) are of particular concern due to their risk of silent data corruption


(SDC) [35, 105]. GPUs are originally designed for graphics and gaming appli-cations which are error-tolerant. However, as GPUs become more pervasive insafety-critical (e.g., autonomous driving systems) and high-performance com-puting (HPC) systems, designers must ensure that the computations that areo✏oaded to the GPU are resilient to transient errors. Actually, recent studieshave found that GPUs can experience significantly higher fault rates comparedto CPUs [42, 47].

One direction is error detection and recovery. State-of-the-art GPUs employECC or parity protection for major storage structures such as DRAM, cachesand the register file [92]. However, errors in a logic unit (e.g., an ALU) can stillgenerate reliability problems (e.g., writing an incorrect data value to a register).Also, ECC is a heavy-hand solution because it solves the unpredictable errorsin a costly way. There are also studies utilizing software/hardware duplicationwhich brings high overhead [83, 108]. Fortunately, the multi-threading contextprovides an opportunity.

Prior work [74] has observed the small arithmetic di↵erence between suc-cessive thread registers within one warp, especially the most significant bits(MSB) which are also more critical to the output precision. This motivatesfor detecting the error through comparing register values within same warp.Shu✏e instructions [14] in CUDA work within a single warp, allowing the userto move register values of higher thread indices to lower thread indices by agiven o↵set. This operation performs communication of a thread’s registerswithin one warp and provides an opportunity to detect errors. Moreover, thesame functionality can also be implemented by exploiting some under-utilizedhardware resources to decrease run-time overheads.

In addition, error-tolerance characteristics of emerging GPU applicationshave not been investigated in depth yet. How to utilize the application’s char-acteristics to instruct the chip’s reliability design could therefore be exploredfurther.

6.2.5 Extending MDM Model for Multi-Module GPUs

Recently, the performance demands of GPUs keeps increasing but single-GPU performance ends scaling due to manufacturing limitations. Multi-moduleGPUs are a potential solution where the modules can be chiplets integrated ina single package within an interposer or sockets connected by using NVLinkson a board [24, 86]. Multi-module GPUs bring a Non-Uniform Memory Access(NUMA) GPU system where each module exhibits high bandwidth to accesslocal memory but su↵ers from limited bandwidth to fetch data from remotememory.

Extending the MDM model from single GPUs to multi-module GPUs canbe challenging and interesting. Although a multi-module GPU provides highercompute capability by deploying more SM cores, the long latency and low band-width inter-module communication severely limits performance scalability es-pecially for memory-divergent applications with low spatial locality. The MDM

6.2. FUTURE WORK 87

model provides insight for application mapping as well as architectural configu-rations. However, there are some problems yet to be solved to make the modelapplicable for multi-module GPUs. First, the limited inter-module bandwidthin multi-module GPUs will become a new bottleneck for MD-applications andshould be faithfully modeled. Second, the increased cache levels (e.g., the L1.5cache which is a cache level between L1 and L2 cache) and memory interferencebetween di↵erent GPU modules will further complicate the model.

Hence, how to consider these new challenges and find solutions for buildingan accurate MDM model for multi-module GPUs deserves further exploration.

Bibliography

[1] NVIDIA’s Next Generation CUDA Compute Architecture: Fermi.https://www.nvidia.com/content/PDF/fermi_white_papers.

[2] NVIDIA GeForce GTX1080. https://developer.nvidia.com/

introducing-nvidia-geforce-gtx-1080.

[3] HPC Application Support for GPU Computing. https://www.nvidia.com/content/intersect-360-HPC-application-support.pdf.

[4] NVIDIA CUDA Parallel Thread Execution ISA. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html.

[5] NVIDIA CUDA Binary Utilities. https://docs.nvidia.com/cuda/

cuda-binary-utilities/index.html.

[6] Tango: A Deep Neural Network Benchmark Suite for Various Accelera-tors. https://gitlab.com/Tango-DNNbench/Tango.

[7] TOP500 Supercomputers. https://www.top500.org/.

[8] NVIDIA CUDA Accelerated Libraries for basic linear algebra subrou-tines. https://developer.nvidia.com/cublas.

[9] NVIDIA CUDA SDK Code Samples. https://developer.nvidia.com/cuda-downloads.

[10] NVIDIA’s Next Generation CUDA Compute Architec-ture: Kepler. https://www.nvidia.com/content/dam/

en-zz/Solutions/Data-Center/tesla-product-literature/

NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper.pdf.

[11] NVIDIA’s Next Generation CUDA Compute Architecture: Maxwell.https://developer.nvidia.com/maxwell-compute-architecture.

[12] Profiler’s user guide. https://docs.nvidia.com/cuda/

profiler-users-guide/index.html.

[13] NVIDIA GP100 Pascal Architecture. https://www.nvidia.com/

object/pascal-architecture-whitepaper.html.

89

90 BIBLIOGRAPHY

[14] Cuda Toolkit Documentation. https://docs.nvidia.com/cuda/

cuda-c-programming-guide/#axzz4cZZhMRHt.

[15] NVIDIA Tesla V100 Volta Architecture. http://www.nvidia.com/

object/volta-architecture-whitepaper.html.

[16] International Technology Roadmap for Semiconductors 2.0. http://www.itrs2.net/itrs-reports.html, 2015.

[17] NVIDIA GPU Accelerated Libraries for Computing. https://

developer.nvidia.com/gpu-accelerated-libraries, 2019.

[18] M. Abdel-Majeed and M. Annavaram. Warped Register File: A PowerE�cient Register File for GPGPUs. In Proceedings of the InternationalSymposium on High Performance Computer Architecture (HPCA), pages412–423, 2013.

[19] Y. Abe, H. Sasaki, S. Kato, K. Inoue, M. Edahiro, and M. Peres. Powerand Performance Characterization and Modeling of GPU-AcceleratedSystems. In Proceedings of the International Parallel and DistributedProcessing Symposium (IPDPS), pages 113–122, 2014.

[20] Amazon. Amazon web services. https://aws.amazon.com/cn/ec2/.

[21] N. Ardalani, C. Lestourgeon, K. Sankaralingam, and X. Zhu. Cross-Architecture Performance Prediction (XAPP) Using CPU Code to Pre-dict GPU Performance. In Proceedings of the International Symposiumon Microarchitecture (MICRO), pages 725–737, 2015.

[22] J. Arnau, J. Parcerisa, and P. Xekalakis. Boosting mobile GPU perfor-mance with a decoupled access/execute fragment processor. In Proceed-ings of the International Symposium on Computer Architecture (ISCA),pages 84–93, 2012.

[23] M. Arora, S. Manne, I. Paul, N. Jayasena, and D. M. Tullsen. Under-standing idle behavior and power gating mechanisms in the context ofmodern benchmarks on CPU-GPU Integrated systems. In Proceedings ofthe International Symposium on High Performance Computer Architec-ture (HPCA), pages 366–377, 2015.

[24] A. Arunkumar, E. Bolotin, B. Cho, U. Milic, E. Ebrahimi, O. Villa,A. Jaleel, C.-J. Wu, and D. Nellans. MCM-GPU: Multi-Chip-ModuleGPUs for Continued Performance Scalability. In Proceedings of the Inter-national Symposium on Computer Architecture (ISCA), pages 320–332,2017.

[25] H. Asghari Esfeden, F. Khorasani, H. Jeon, D. Wong, and N. Abu-Ghazaleh. CORF: Coalescing Operand Register File for GPUs. In Pro-ceedings of the International Conference on Architectural Support for Pro-gramming Languages and Operating Systems (ASPLOS), pages 701–714,2019.

91

[26] S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, and W.-m. W.Hwu. An adaptive performance modeling tool for GPU architectures.ACM Sigplan Notices, 45(5):105–114, 2010.

[27] A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt.Analyzing CUDA workloads using a detailed GPU simulator. In Proceed-ings of the International Symposium on Performance Analysis of Systemsand Software (ISPASS), pages 163–174, 2009.

[28] A. Bakhoda, J. Kim, and T. M. Aamodt. Throughput-E↵ective On-ChipNetworks for Manycore Accelerators. In Proceedings of the InternationalSymposium on Microarchitecture (MICRO), pages 421–432, 2010.

[29] M. Burtscher, R. Nasre, and K. Pingali. A Quantitative Study of IrregularPrograms on GPUs. In Proceedings of the International Symposium onWorkload Characterization (IISWC), pages 141–151, 2012.

[30] N. Chatterjee, M. O’Connor, G. H. Loh, N. Jayasena, and R. Balasub-ramonia. Managing DRAM Latency Divergence in Irregular GPGPUApplications. In Proceedings of the International Conference for HighPerformance Computing, Networking, Storage and Analysis (SC), pages128–139, 2014.

[31] N. Chatterjee, M. O’Connor, D. Lee, D. R. Johnson, S. W. Keckler,M. Rhu, and W. J. Dally. Architecting an Energy-E�cient DRAM Sys-tem for GPUs. In Proceedings of the International Symposium on HighPerformance Computer Architecture (HPCA), pages 73–84, 2017.

[32] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Shea↵er, S.-H. Lee, andK. Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing.In Proceedings of the International Symposium on Workload Characteri-zation (IISWC), pages 44–54, 2009.

[33] L. Chen, H. Cheng, P. Wang, and C. Yang. Improving GPGPU Per-formance via Cache Locality Aware Thread Block Scheduling. IEEEComputer Architecture Letters, 16(2):127–131, 2017.

[34] R. Collobert, C. Farabet, K. Kavukcuoglu, and S. Chintala. Torch. http://torch.ch/.

[35] C. Constantinescu. Trends and challenges in VLSI circuit reliability.IEEE Micro, 23(4):14–19, 2003.

[36] H. Dai, C. Li, H. Zhou, S. Gupta, C. Kartsaklis, and M. Mantor. Amodel-driven approach to warp/thread-block level GPU cache bypassing.In Proceedings of the Design Automation Conference (DAC), pages 1–6,2016.

[37] R. Das, O. Mutlu, T. Moscibroda, and C. R. Das. Application-awareprioritization mechanisms for on-chip networks. In Proceedings of theInternational Symposium on Microarchitecture (MICRO), pages 280–291,2009.

92 BIBLIOGRAPHY

[38] J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei. ImageNet: Alarge-scale hierarchical image database. In Proceedings of the Conferenceon Computer Vision and Pattern Recognition (CVPR), pages 248–255,2009.

[39] S. Dublish, V. Nagarajan, and N. Topham. Cooperative Caching forGPUs. ACM Trans. Archit. Code Optim., 13(4), Dec. 2016.

[40] S. Dublish, V. Nagarajan, and N. Topham. Poise: Balancing Thread-Level Parallelism and Memory System Performance in GPUs Using Ma-chine Learning. In Proceedings of the International Symposium on HighPerformance Computer Architecture (HPCA), pages 492–505, 2019.

[41] S. Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith. A mecha-nistic performance model for superscalar out-of-order processors. ACMTransactions on Computer Systems, 27(2):1–37, 2009.

[42] B. Fang, K. Pattabiraman, M. Ripeanu, and S. Gurumurthi. A System-atic Methodology for Evaluating the Error Resilience of GPGPU Appli-cations. IEEE Transactions on Parallel and Distributed Systems, 27(12):3397–3411, 2016.

[43] W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt. Dynamic warpformation and scheduling for e�cient gpu control flow. In Proceedingsof the International Symposium on Microarchitecture (MICRO), pages407–420, 2007.

[44] Google. Graphics Processing Unit (GPU) — Google Cloud. https:

//cloud.google.com/gpu/.

[45] S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cava-zos. Auto-tuning a High-Level Language Targeted to GPU Codes. InInnovative Parallel Computing (InPar), pages 1–10, 2012.

[46] J. Guerreiro, A. Ilic, N. Roma, and P. Tomas. GPGPU power mod-eling for multi-domain voltage-frequency scaling. In Proceedings of theInternational Symposium on High Performance Computer Architecture(HPCA), pages 789–800, 2018.

[47] S. K. S. Hari, T. Tsai, M. Stephenson, S. W. Keckler, and J. Emer.SASSIFI: An architecture-level fault injection tool for GPU applicationresilience evaluation. In Proceedings of the International Symposium onPerformance Analysis of Systems and Software (ISPASS), pages 249–258,2017.

[48] B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang. Mars: AMapReduce Framework on Graphics Processors. In Proceedings of the In-ternational Conference on Parallel Architectures and Compilation Tech-niques (PACT), pages 260–269, 2008.

93

[49] J. Hestness, S. W. Keckler, and D. A. Wood. A Comparative Analysis ofMicroarchitecture E↵ects on CPU and GPU Memory System Behavior.In Proceedings of the International Symposium on Workload Characteri-zation (IISWC), pages 150–160, 2014.

[50] S. Hong and H. Kim. An Analytical Model for a GPU Architecture withMemory-level and Thread-level Parallelism Awareness. In Proceedings ofthe International Symposium on Computer Architecture (ISCA), pages152–163, 2009.

[51] J.-C. Huang, J. H. Lee, H. Kim, and H.-H. S. Lee. GPUMech: GPU per-formance modeling technique based on interval analysis. In Proceedingsof the International Symposium on Microarchitecture (MICRO), pages268–279, 2014.

[52] M. A. Ibrahim, H. Liu, O. Kayiran, and A. Jog. Analyzing and Lever-aging Remote-Core Bandwidth for Enhanced Performance in GPUs. InProceedings of the International Conference on Parallel Architectures andCompilation Techniques (PACT), pages 258–271, 2019.

[53] M. Jahre and L. Natvig. A high performance adaptive miss handlingarchitecture for chip multiprocessors. Transactions on High-PerformanceEmbedded Architectures and Compilers IV, 6760, 2011.

[54] H. Jang, J. Kim, P. Gratz, K. H. Yum, and E. J. Kim. Bandwidth-e�cient on-chip interconnect designs for GPGPUs. In Proceedings of theDesign Automation Conference (DAC), pages 1–6, 2015.

[55] JEDEC. High Bandwidth Memory (HBM) DRAM. 2015.

[56] H. Jeon, G. S. Ravi, N. S. Kim, and M. Annavaram. GPU RegisterFile Virtualization. In Proceedings of the International Symposium onMicroarchitecture (MICRO), pages 420–432, 2015.

[57] W. Jia, K. A. Shaw, and M. Martonosi. MRPB: Memory request priori-tization for massively parallel processors. In Proceedings of the Interna-tional Symposium on High Performance Computer Architecture (HPCA),pages 272–283, 2014.

[58] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell. Ca↵e: Convolutional Architecture forFast Feature Embedding. In Proceedings of the the International Confer-ence on Multimedia (ICMR), pages 675–678, 2014.

[59] A. Jog, O. Kayiran, N. Chidambaram Nachiappan, A. K. Mishra, M. T.Kandemir, O. Mutlu, R. Iyer, and C. R. Das. OWL: Cooperative ThreadArray Aware Scheduling Techniques for Improving GPGPU Performance.In Proceedings of the International Conference on Architectural Supportfor Programming Languages and Operating Systems (ASPLOS), pages395–406, 2013.

94 BIBLIOGRAPHY

[60] T. S. Karkhanis and J. E. Smith. A first-order superscalar processormodel. In Proceedings of the International Symposium on Computer Ar-chitecture (ISCA), pages 338–349, 2004.

[61] A. Karki, C. P. Keshava, S. M. Shivakumar, J. Skow, G. M. Hegde, andH. Jeon. Detailed Characterization of Deep Neural Networks on GPUsand FPGAs. In Proceedings of the 12th Workshop on General PurposeProcessing Using GPUs (GPGPU), pages 12–21, 2019.

[62] O. Kayıran, A. Jog, M. T. Kandemir, and C. R. Das. Neither More norLess: Optimizing Thread-level Parallelism for GPGPUs. In Proceedings ofthe International Conference on Parallel Architectures and CompilationTechniques (PACT), pages 157–166, 2013.

[63] M. Khairy, A. Jain, T. M. Aamodt, and T. G. Rogers. Exploring ModernGPU Memory System Design Challenges through Accurate Modeling.CoRR, abs/1810.07269, 2018.

[64] M. Khairy, Z. Shen, T. M. Aamodt, and T. G. Rogers. Accel-sim: Anextensible simulation framework for validated gpu modeling. In Proceed-ings of the International Symposium on Computer Architecture (ISCA),pages 473–486, 2020.

[65] F. Khorasani, H. Asghari Esfeden, A. Farmahini-Farahani, N. Jayasena,and V. Sarkar. RegMutex: Inter-Warp GPU Register Time-Sharing. InProceedings of the International Symposium on Computer Architecture(ISCA), pages 816–828, 2018.

[66] H. Kim, J. Kim, W. Seo, Y. Cho, and S. Ryu. Providing Cost-E↵ectiveOn-Chip Network Bandwidth in GPGPUs. In Proceedings of Interna-tional Conference on Computer Design (ICCD), pages 407–412, 2012.

[67] K. H. Kim, R. Boyapati, J. Huang, Y. Jin, K. H. Yum, and E. J. Kim.Packet Coalescing Exploiting Data Redundancy in GPGPU Architec-tures. In Proceedings of the International Conference on Supercomputing(ICS), pages 1–10, 2017.

[68] J. Kloosterman, J. Beaumont, M. Wollman, A. Sethia, R. Dreslinski,T. Mudge, and S. Mahlke. WarpPool: Sharing Requests with Inter-WarpCoalescing for Throughput Processors. In Proceedings of the Interna-tional Symposium on Microarchitecture (MICRO), pages 433–444, 2015.

[69] J. Kloosterman, J. Beaumont, D. A. Jamshidi, J. Bailey, T. Mudge, andS. Mahlke. RegLess: Just-in-Time Operand Staging for GPUs. In Pro-ceedings of the International Symposium on Microarchitecture (MICRO),pages 151–164, 2017.

[70] D. Kroft. Lockup-free Instruction Fetch/Prefetch Cache Organization.In Proceedings of the International Symposium on Computer Architecture(ISCA), pages 81–87, 1981.

95

[71] N. B. Lakshminarayana and H. Kim. Spare register aware prefetching forgraph algorithms on GPUs. In Proceedings of the International Sympo-sium on High Performance Computer Architecture (HPCA), pages 614–625, 2014.

[72] A. Lavin. Fast algorithms for convolutional neural networks. CoRR,abs/1509.09308, 2015.

[73] M. Lee, S. Song, J. Moon, J. Kim, W. Seo, Y. Cho, and S. Ryu. Im-proving GPGPU Resource Utilization Through Alternative Thread BlockScheduling. In Proceedings of the International Symposium on High Per-formance Computer Architecture (HPCA), pages 260–271, 2014.

[74] S. Lee, K. Kim, G. Koo, H. Jeon, W.W. Ro, and M. Annavaram. Warped-Compression: Enabling power e�cient GPUs through register compres-sion. In Proceedings of the International Symposium on Computer Archi-tecture (ISCA), pages 502–514, 2015.

[75] J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M.Aamodt, and V. J. Reddi. GPUWattch: Enabling Energy Optimizationsin GPGPUs. In Proceedings of the International Symposium on ComputerArchitecture(ISCA), pages 487–498, 2013.

[76] J. Lew, D. Shah, S. Pati, S. Cattell, M. Zhang, A. Sandhupatla, C. Ng,N. Goli, M. D. Sinclair, T. G. Rogers, and T. M. Aamodt. AnalyzingMachine Learning Workloads Using a Detailed GPU Simulator. CoRR,abs/1811.08933, 2018.

[77] A. Li, S. L. Song, W. Liu, X. Liu, A. Kumar, and H. Corporaal. Locality-Aware CTA Clustering for Modern GPUs. In Proceedings of Interna-tional Conference on Architectural Support for Programming Languagesand Operating Systems (ASPLOS), pages 297–311, 2017.

[78] D. Li and T. M. Aamodt. Inter-Core Locality Aware Memory Scheduling.IEEE Computer Architecture Letters, 15(1):25–28, Jan 2016.

[79] J. Liu, N. Hegde, and M. Kulkarni. Hybrid CPU-GPU Scheduling andExecution of Tree Traversals. ACM SIGPLAN Notices, 51(8), Feb. 2016.

[80] A. Lopes, F. Pratas, L. Sousa, and A. Ilic. Exploring GPU Performance,Power and Energy-E�ciency Bounds with Cache-Aware Roofline Mod-eling. In Proceedings of the International Symposium on PerformanceAnalysis of Systems and Software (ISPASS), pages 259–268, 2017.

[81] K. Ma, X. Li, W. Chen, C. Zhang, and X. Wang. GreenGPU: A HolisticApproach to Energy E�ciency in GPU-CPU Heterogeneous Architec-tures. In Proceedings of the International Conference on Parallel Pro-cessing (ICPP), pages 48–57, 2012.

[82] P. Macken, M. Degrauwe, M. Van Paemel, and H. Oguey. A voltage re-duction technique for digital systems. In Proceedings of the InternationalConference on Solid-State Circuits (ISSCC), pages 238–239, 1990.

96 BIBLIOGRAPHY

[83] A. Mahmoud, S. K. S. Hari, M. B. Sullivan, T. Tsai, and S. W. Keck-ler. Optimizing Software-Directed Instruction Replication for GPU Er-ror Detection. In Proceedings of the International Conference for HighPerformance Computing, Networking, Storage, and Analysis (SC), pages1–12, 2018.

[84] A. Majumdar, L. Piga, I. Paul, J. L. Greathouse, W. Huang, and D. H.Albonesi. Dynamic GPGPU Power Management Using Adaptive ModelPredictive Control. In Proceedings of the International Symposium onHigh Performance Computer Architecture (HPCA), pages 613–624, 2017.

[85] M. Mao, J. Hu, Y. Chen, and H. Li. VWS: A Versatile Warp Schedulerfor Exploring Diverse Cache Localities of GPGPU Applications. In Pro-ceedings of the Design Automation Conference (DAC), pages 1–6, 2015.

[86] U. Milic, O. Villa, E. Bolotin, A. Arunkumar, E. Ebrahimi, A. Jaleel,A. Ramirez, and D. Nellans. Beyond the Socket: NUMA-Aware GPUs.In Proceedings of the International Symposium on Microarchitecture (MI-CRO), pages 123–135, 2017.

[87] A. Mirhosseini, M. Sadrosadati, B. Soltani, H. Sarbazi-Azad, and T. F.Wenisch. BiNoCHS: Bimodal Network-on-Chip for CPU-GPU Hetero-geneous Systems. In Proceedings of the International Symposium onNetworks-on-Chip (NOCS), pages 1–8, 2017.

[88] S. Mittal and J. S. Vetter. A Survey of Methods for Analyzing and Im-proving GPU Energy E�ciency. ACM Computing Surveys, 47(2), 2014.

[89] N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi. CACTI 6.0:A Tool to Model Large Caches. HP Laboratories, pages 22–31, 2009.

[90] V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu,and Y. N. Patt. Improving GPU Performance via Large Warps and Two-Level Warp Scheduling. In Proceedings of the International Symposiumon Microarchitecture (MICRO), pages 308–317, 2011.

[91] R. Nath and D. Tullsen. The CRISP performance model for dynamicvoltage and frequency scaling in a GPGPU. In Proceedings of the In-ternational Symposium on Microarchitecture (MICRO), pages 281–293,2015.

[92] D. A. G. Oliveira, P. Rech, L. L. Pilla, P. O. A. Navaux, and L. Carro.GPGPUs ECC e�ciency and e�cacy. In Proceedings of the InternationalSymposium on Defect and Fault Tolerance in VLSI and NanotechnologySystems (DFT), pages 209–215, 2014.

[93] J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C.Phillips. GPU Computing. Proceedings of the IEEE, 96(5):879–899, May2008.

97

[94] M. O’Connor, N. Chatterjee, D. Lee, J. Wilson, A. Agrawal, S. W. Keck-ler, and W. J. Dally. Fine-Grained DRAM: Energy-E�cient DRAM forExtreme Bandwidth Systems. In Proceedings of the International Sym-posium on Microarchitecture (MICRO), pages 41–54, 2017.

[95] E. Perelman, G. Hamerly, M. Van Biesbrouck, T. Sherwood, andB. Calder. Using SimPoint for Accurate and E�cient Simulation. SIG-METRICS Perform. Eval. Rev., 31(1):318–319, June 2003.

[96] B. Pichai, L. Hsu, and A. Bhattacharjee. Architectural Support for Ad-dress Translation on GPUs: Designing Memory Management Units forCPU/GPUs with Unified Address Spaces. In Proceedings of the Interna-tional Conference on Architectural Support for Programming Languagesand Operating Systems (ASPLOS), pages 743–758, 2014.

[97] K. Punniyamurthy, B. Boroujerdian, and A. Gerstlauer. Gatsim: Ab-stract timing simulation of gpus. In Proceedings of the Design, Automa-tion Test in Europe Conference Exhibition (DATE), pages 43–48, 2017.

[98] M. A. Raihan, N. Goli, and T. M. Aamodt. Modeling Deep LearningAccelerator Enabled GPUs. CoRR, abs/1811.08309, 2018.

[99] T. G. Rogers, M. O’Connor, and T. M. Aamodt. Cache-Conscious Wave-front Scheduling. In Proceedings of the International Symposium on Mi-croarchitecture (MICRO), pages 72–83, 2012.

[100] T. G. Rogers, M. O’Connor, and T. M. Aamodt. Divergence-aware WarpScheduling. In Proceedings of the International Symposium on Microar-chitecture (MICRO), pages 99–110, 2013.

[101] M. H. Santriaji and H. Ho↵mann. GRAPE: Minimizing energy for GPUapplications with performance requirements. In Proceedings of the Inter-national Symposium on Microarchitecture (MICRO), pages 1–13, 2016.

[102] A. Sethia, D. A. Jamshidi, and S. Mahlke. Mascar: Speeding up GPUwarps by reducing memory pitstops. In Proceedings of the InternationalSymposium on High Performance Computer Architecture (HPCA), pages174–185, 2015.

[103] J. Sim, A. Dasgupta, H. Kim, and R. Vuduc. A performance analy-sis framework for identifying potential benefits in GPGPU applications.ACM Sigplan Notices, 47(8):11–22, 2012.

[104] K. Simonyan and A. Zisserman. Very deep convolutional networks forlarge-scale image recognition. CoRR, abs/1409.1556, 2014.

[105] M. Snir, R. W. Wisniewski, J. A. Abraham, S. V. Adve, S. Bagchi, P. Bal-aji, J. Belak, P. Bose, F. Cappello, B. Carlson, A. A. Chien, P. Coteus,N. A. Debardeleben, P. C. Diniz, C. Engelmann, M. Erez, S. Fazzari,A. Geist, R. Gupta, F. Johnson, S. Krishnamoorthy, S. Ley↵er, D. Lib-erty, S. Mitra, T. Munson, R. Schreiber, J. Stearley, and E. V. Hensber-gen. Addressing failures in exascale computing. Int. J. High Perform.Comput. Appl., 28(2):129–173, May 2014.

98 BIBLIOGRAPHY

[106] M. Stephenson, S. K. S. Hari, Y. Lee, E. Ebrahimi, D. R. Johnson, D. Nel-lans, M. O’Connor, and S. W. Keckler. Flexible software profiling of GPUarchitectures. In Proceedings of the International Symposium on Com-puter Architecture (ISCA), pages 185–197, 2015.

[107] J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang,N. Anssari, G. D. Liu, and W.-M. W. Hwu. Parboil: A Revised Bench-mark Suite for Scientific and Commercial Throughput Computing. Tech-nical report, Mar 2012.

[108] M. B. Sullivan, S. K. S. Hari, B. Zimmer, T. Tsai, and S. W. Keck-ler. SwapCodes: Error Codes for Hardware-Software Cooperative GPUPipeline Error Detection. In Proceedings of the International Symposiumon Microarchitecture (MICRO), pages 762–774, 2018.

[109] Y. Sun, T. Baruah, S. A. Mojumder, S. Dong, X. Gong, S. Treadway,Y. Bao, S. Hance, C. McCardwell, V. Zhao, H. Barclay, A. K. Ziabari,Z. Chen, R. Ubal, J. L. Abellan, J. Kim, A. Joshi, and D. Kaeli. MG-PUSim: Enabling multi-GPU Performance Modeling and Optimization.In Proceedings of the International Symposium on Computer Architecture(ISCA), pages 197–209, 2019.

[110] D. Tarjan and K. Skadron. The Sharing Tracker: Using Ideas fromCache Coherence Hardware to Reduce O↵-Chip Memory Tra�c withNon-Coherent Caches. In Proceedings of the International Conference onHigh Performance Computing, Networking, Storage and Analysis (SC),pages 1–10, 2010.

[111] R. Ubal, B. Jang, P. Mistry, D. Schaa, and D. Kaeli. Multi2Sim: A sim-ulation framework for CPU-GPU computing. In Proceedings of the In-ternational Conference on Parallel Architectures and Compilation Tech-niques (PACT), pages 335–344, 2012.

[112] J. M. Vatjus-Anttila, T. Koskela, and S. Hickey. Power ConsumptionModel of a Mobile GPU Based on Rendering Complexity. In Proceed-ings of the International Conference on Next Generation Mobile Apps,Services and Technologies (NGMAST), pages 210–215, 2013.

[113] N. Vijaykumar, K. Hsieh, G. Pekhimenko, S. Khan, A. Shrestha,S. Ghose, A. Jog, P. B. Gibbons, and O. Mutlu. Zorua: A holisticapproach to resource virtualization in GPUs. In Proceedings of the Inter-national Symposium on Microarchitecture (MICRO), pages 1–14, 2016.

[114] O. Villa, M. Stephenson, D. W. Nellans, and S. W. Keckler. NVBit: ADynamic Binary Instrumentation Framework for NVIDIA GPUs. In Pro-ceedings of the International Symposium on Microarchitecture(MICRO),pages 372–383, 2019.

[115] V. Volkov. Understanding Latency Hiding on GPUs. PhD the-sis, EECS Department, University of California, Berkeley, 2016.URL http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/

EECS-2016-143.html.

99

[116] J. Wang, N. Rubin, A. Sidelnik, and S. Yalamanchili. LaPerm: LocalityAware Scheduler for Dynamic Parallelism on GPUs. In Proceedings ofthe International Symposium on Computer Architecture (ISCA), pages583–595, 2016.

[117] X. Wang, K. Huang, A. Knoll, and X. Qian. A Hybrid Framework forFast and Accurate GPU Performance Estimation through Source-LevelAnalysis and Trace-Based Simulation. In Proceedings of the InternationalSymposium on High Performance Computer Architecture (HPCA), pages506–518, 2019.

[118] G. Wu, J. L. Greathouse, A. Lyashevsky, N. Jayasena, and D. Chiou.GPGPU performance and power estimation using machine learning. InProceedings of the International Symposium on High Performance Com-puter Architecture (HPCA), pages 564–576, 2015.

[119] X. Xie, Y. Liang, G. Sun, and D. Chen. An e�cient compiler frame-work for cache bypassing on GPUs. In Proceedings of the InternationalConference on Computer-Aided Design (ICCAD), pages 516–523, 2013.

[120] X. Xie, Y. Liang, Y. Wang, G. Sun, and T. Wang. Coordinated static anddynamic cache bypassing for GPUs. In Proceedings of the InternationalSymposium on High Performance Computer Architecture (HPCA), pages76–88, 2015.

[121] Z. Yu, L. Eeckhout, N. Goswami, T. Li, L. K. John, H. Jin, C. Xu, andJ. Wu. GPGPU-MiniBench: Accelerating GPGPU Micro-ArchitectureSimulation. IEEE Transactions on Computers, 64(11):3153–3166, 2015.

[122] J. Zhan, O. Kayıran, G. H. Loh, C. R. Das, and Y. Xie. OSCAR: Or-chestrating STT-RAM cache tra�c for heterogeneous CPU-GPU archi-tectures. In Proceedings of the International Symposium on Microarchi-tecture (MICRO), pages 1–13, 2016.

[123] E. Z. Zhang, Y. Jiang, Z. Guo, K. Tian, and X. Shen. On-the-Fly Elimi-nation of Dynamic Irregularities for GPU Computing. In Proceedings ofthe International Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS), pages 369–380, 2011.

[124] Y. Zhang and J. D. Owens. A Quantitative Performance Analysis Modelfor GPU Architectures. In Proceedings of the International Symposiumon High Performance Computer Architecture (HPCA), pages 382–393,2011.

[125] X. Zhao, S. Ma, C. Li, L. Eeckhout, and Z. Wang. A Heterogeneous Low-Cost and Low-Latency Ring-Chain Network for GPGPUs. In Proceedingsof the International Conference on Computer Design (ICCD), pages 472–479, 2016.

[126] X. Zhao, Y. Liu, A. Adileh, and L. Eeckhout. LA-LLC: Inter-CoreLocality-Aware Last-Level Cache to Exploit Many-to-Many Tra�c inGPGPUs. IEEE Computer Architecture Letters, 16(1):42–45, Jan 2017.

100 BIBLIOGRAPHY

[127] X. Zhao, A. Adileh, Z. Yu, Z. Wang, A. Jaleel, and L. Eeckhout. Adap-tive Memory-Side Last-Level GPU Caching. In Proceedings of the Inter-national Symposium on Computer Architecture (ISCA), pages 411–423,2019.

[128] A. K. Ziabari, J. L. Abellan, Y. Ma, A. Joshi, and D. Kaeli. AsymmetricNoC Architectures for GPU Systems. In Proceedings of the InternationalSymposium on Networks-on-Chip (NOCS), pages 1–8, 2015.

modeling and minimizing memory contention in general-purpose … · modeling and minimizing memory...

Documents