influential operating systems...

1/62

Influential Operating Systems ResearchLarge Scale Systems

Maksym Planeta

21.06.2016

2/62

Table of Contents

Introduction

Large Scale Operating System

Cluster archituctureSLURM

ChallengesScalabilityPower wallPerformance debugging

Conclusion

3/62

Outline

I Focus on HPC

I What is large scale operating system?

I Important components of a supercomputer

I Challenges of supercomputer development

4/62

Table of Contents

Introduction




Conclusion

5/62

What so special about OS for LSS?

Google data center

Source: google.com

1. x86-based CPUs

2. 16 Gb RAM (as of 2011)

3. Linux-based OS

4. COTS switches[8, 4]

Taurus

Source: tu-dresden.de

1. 12-cores Intel CPUs

2. Two-socket motherboards

3. Linux-based OS

4. Infiniband-based network[7]

https://www.google.com/about/datacenters/gallery/images/_3000/PRY_20.jpg

https://tu-dresden.de/die_tu_dresden/zentrale_einrichtungen/zih/hpc/hochleistungsrechner

6/62

So, what is Operating System?

The job of the operating system is to

. . . provide user programs with a better, simpler, cleaner,model of the computer and to handle managing all theresources . . .

[9]

Key aspects of the operating systems:

1. Extended machine

2. Resource manager

7/62

Extended MachineSEC. 1.1 WHAT IS AN OPERATING SYSTEM? 5

Operating system

Hardware

Ugly interface

Beautiful interface

Application programs

Figure 1-2. Operating systems turn ugly hardware into beautiful abstractions.

It should be noted that the operating system’s real customers are the applica-

tion programs (via the application programmers, of course). They are the ones

who deal directly with the operating system and its abstractions. In contrast, end

users deal with the abstractions provided by the user interface, either a com-

mand-line shell or a graphical interface. While the abstractions at the user interface

may be similar to the ones provided by the operating system, this is not always the

case. To make this point clearer, consider the normal Windows desktop and the

line-oriented command prompt. Both are programs running on the Windows oper-

ating system and use the abstractions Windows provides, but they offer very dif-

ferent user interfaces. Similarly, a Linux user running Gnome or KDE sees a very

different interface than a Linux user working directly on top of the underlying X

Window System, but the underlying operating system abstractions are the same in

both cases.

In this book, we will study the abstractions provided to application programs in

great detail, but say rather little about user interfaces. That is a large and important

subject, but one only peripherally related to operating systems.

1.1.2 The Operating System as a Resource Manager

The concept of an operating system as primarily providing abstractions to ap-

plication programs is a top-down view. An alternative, bottom-up, view holds that

the operating system is there to manage all the pieces of a complex system. Mod-

ern computers consist of processors, memories, timers, disks, mice, network inter-

faces, printers, and a wide variety of other devices. In the bottom-up view, the job

of the operating system is to provide for an orderly and controlled allocation of the

processors, memories, and I/O devices among the various programs wanting them.

Modern operating systems allow multiple programs to be in memory and run

at the same time. Imagine what would happen if three programs running on some

computer all tried to print their output simultaneously on the same printer. The first

Source: [9]

Figure: Operating systems turn ugly hardware into beautiful abstractions

8/62

Extended Machine (contd.)

Large-Scale System OS vs Small-Scale OS:

I Usual node architecture ⇒ interact with HW the same way

I Application expect UNIX-like OS ⇒ the same userabstractions

I From a bird’s eye view extended machine is indifferent toscale.

9/62

Resource Manager

Large-Scale OS vs Small-Scale OS:

I Shared resources and local resources

I Bulk communication

I Resource allocation may happen on inter-node basis

10/62

Large-Scale Operating System

1. Runs multitude of individual computers as a single entity

2. Provides global view on the system for user programs

3. Coordinates resources of individual nodes

11/62

Table of Contents

Introduction




Conclusion

12/62

Node types and network overview

Source: hlrn.de

Figure: The HLRN-III Cray System[1]

https://www.hlrn.de/home/view/System3/CrayHardware

13/62

Compute node types

Type Nodes CoresNode

MemCore (MB) Mem

Node (MB)

Haswell 1328 24 2583 6200084 24 5250 12600044 24 10583 2540004 56 36500 2044000

Sandy 228 16 1875 3000028 16 3875 6200014 16 7875 1260002 32 31875 1020000

Westmere 180 12 3875 46500

Tesla K20x 44 16 3000 48000Tesla K80 64 16 2583 62000

Table: Taurus (Phase 2). Source: doc.zih.tu-dresden.de

https://doc.zih.tu-dresden.de/hpc-wiki/bin/view/Compendium/SystemTaurus

14/62

How is it used

I Start several instances of the program on several nodes

I Nodes are used exclusively by a user

I Programs know about each other

I Bulk-synchronous communication

15/62

From intention to action

I User decides to run program

I User submits a job script

I ???

I Application runs

16/62

From intention to action

I User decides to run program

I User submits a job script

I SLURM!

I Application runs

17/62

SLURM

Simple Linux Utility for Resource Management

1. Resource managerAlso: PBS

2. Job schedulerAlso: MOAB

3. Job launcherAlso: Hydra

18/62

SLURM architecture

Fig. 1. SLURM architecture

job. The job in Partition 2 has only one job step using half of the original job al-location. That job might initiate additional job steps to utilize the remaining nodesof its allocation.

Figure 3 shows the subsystems that are implemented within the ¨ ��O£ and¨ ��b��:��£ daemons. These subsystems are explained in more detail below.

2.1 Slurmd

¨ ��5£ is a multi-threaded daemon running on each compute node and can becompared to a remote shell daemon: it reads the common SLURM configurationfile and saved state information, notifies the controller that it is active, waits forwork, executes the work, returns status, then waits for more work. Because it initi-ates jobs for other users, it must run as user root. It also asynchronously exchangesnode and job status with ¨ �i��b�i��£ . The only job information it has at any giventime pertains to its currently executing jobs. ¨ ��i�5£ has five major components:

��¬��¡�:�D��£¢®¥�_��¥� ¨ �`��`¯¤��_� ¨ : Respond to controller requests formachine and job state information and send asynchronous reports of somestate changes (e.g., ¨ ��5£ startup) to the controller.��°��¡�¤��t±`²��:��¤�i�_� : Start, manage, and clean up after a set of processes (typ-ically belonging to a parallel job) as dictated by the ¨ ��b��:��£ daemon oran ¨ ��`� or ¨ ��_�¤�_�� command. Starting a process may include executing aprolog program, setting process limits, setting real and effective uid, estab-

Source: slurm.schedmd.com

Figure: SLURM architecture

http://slurm.schedmd.com/slurm_design.pdf

19/62

Resource management (contd.)

Resource manager is responsible for enforcing resource constraints.

Typical constraints

1. CPU time quota

2. Wall clock time limit

3. File system quota

4. Node set limitations

5. Priority

6. QoS parameters

20/62

Resource management (contd.)

Submitting a new job

1. User issues a job submit command

2. Resource manager (RM) put a job into a FIFO queue

3. When resources for the first job in the FIFO are free, RMallocates the nodes to the head of the queue

Not particularly efficient? Need a scheduler.

21/62

Job scheduling

Job scheduler (JS) reorders jobs in the queue to optimize runtimeparameters

All resource allocated to a job simultaneously → gang schedulingPossible goals:

I Maximize system utilization

I Minimize total job waiting time

I Give job better fitting partition

22/62

SLURM as a scheduler

Make a quick decision at each:

I Job submission

I Job completion

I Configuration change

A decision is based on:

I Current schedule

I Resource requirements

I User parameters (priority)

I Run time limit

I Scheduling policy

23/62

Table of Contents

Introduction




Conclusion

24/62

Outlook

How to write a program which runs on millions of cores?

We have to cope with:

I Non-scalable algorithms

I Faulty hardware

I Unpredictable communication latency

I Power wall

25/62

Scalability

I Parallel computers run parallel programs

I The most famous model for describing scalability of parallelprograms is Amdahl’s law:

Speedup(f ,S) =1

(1− f ) + fS

,

f Perfectly parallelizable fraction of run timeS Speedup of the parallel fraction

I Assumes very simplistic computer model

26/62

Amdahl’s law in the Multicore Era[5]

I Nowadays all the processors have multiple cores

I Number of cores ranges from two to thousands (GPUs)

I A processor designer has to choose right granularity

I Number of cores is not linearly converted into performance

A many core processor is a device for turning acompute bound problem into a memory boundproblem

— Katherine Yelick

27/62

Cost model for Multicore Chips

I A multicore contains at most n Base Core Equivalents (BCE).

I n is due to constrains on power dissipation, area, etc.

I Resources of multiple BCEs used to create bigger cores

I r BCEs comprise perf (r) times faster rich core

I Always use richer cores when perf (r) > r

I Tradeoff when perf (r) < r

I Assume perf (r) =√r

28/62

Symmetric Multicore Chips

July 2008 35

(a) (b) (c)

Figure 1. Varieties of multicore chips. (a) Symmetric multicore with 16 one-base core equivalent cores, (b) symmetric multicore with four four-BCE cores, and (c) asymmetric multicore with one four-BCE core and 12 one-BCE cores. These figures omit important structures such as memory interfaces, shared caches, and interconnects, and assume that area, not power, is a chip’s limiting resource.

(a) (b)

(c) (d)

(e) (f)

20 4 8 16r BCEs

Symmetric, n = 16

20 4 8 16 32 64 128 256

50

100

150

200

250

r BCEs

Spee

dup s

ymm

etric

50

100

150

200

250

Spee

dup s

ymm

etric

50

100

150

200

250

Spee

dup d

ynam

ic

Spee

dup d

ynam

ic

Symmetric, n = 256

2 4 8 16r BCEs

Asymmetric, n = 16

20 4 8 16 32 64 128 256r BCEs

Asymmetric, n = 256

2

0

0 4 8 162

4

6

8

10

12

14

16

Spee

dup s

ymm

etric

2

4

6

8

10

12

14

16

Spee

dup s

ymm

etric

2

4

6

8

10

12

14

16

r BCEs

Dynamic, n = 16

20 4 8 16 32 64 128 256r BCEs

Dynamic, n = 256

f = 0.999f = 0.99f = 0.975f = 0.9f = 0.5

Figure 2. Speedup of (a, b) symmetric, (c, d) asymmetric, and (e, f) dynamic multicore chips with n = 16 BCEs (a, c, and e) or n = 256 BCEs (b, d, and f).

Figure: Small symmetric cores

July 2008 35

(a) (b) (c)


(a) (b)

(c) (d)

(e) (f)

20 4 8 16r BCEs

Symmetric, n = 16

20 4 8 16 32 64 128 256

50

100

150

200

250

r BCEs

Spee

dup s

ymm

etric

50

100

150

200

250

Spee

dup s

ymm

etric

50

100

150

200

250

Spee

dup d

ynam

ic

Spee

dup d

ynam

ic

Symmetric, n = 256

2 4 8 16r BCEs

Asymmetric, n = 16

20 4 8 16 32 64 128 256r BCEs

Asymmetric, n = 256

2

0

0 4 8 162

4

6

8

10

12

14

16

Spee

dup s

ymm

etric

2

4

6

8

10

12

14

16

Spee

dup s

ymm

etric

2

4

6

8

10

12

14

16

r BCEs

Dynamic, n = 16

20 4 8 16 32 64 128 256r BCEs

Dynamic, n = 256

f = 0.999f = 0.99f = 0.975f = 0.9f = 0.5


Figure: Big symmetric cores

Speedup(f , n, r) =1

(1−f )perf (f ) + f ·r

perf (r)·n

,

n BCEsr BCEs per rich core

perf (f ) sequential performance of rich core

29/62

Symmetric Multicore Chips (contd.)

July 2008 35

(a) (b) (c)


(a) (b)

(c) (d)

(e) (f)

20 4 8 16r BCEs

Symmetric, n = 16

20 4 8 16 32 64 128 256

50

100

150

200

250

r BCEsSp

eedu

p sym

met

ric

50

100

150

200

250

Spee

dup s

ymm

etric

50

100

150

200

250

Spee

dup d

ynam

ic

Spee

dup d

ynam

ic

Symmetric, n = 256

2 4 8 16r BCEs

Asymmetric, n = 16

20 4 8 16 32 64 128 256r BCEs

Asymmetric, n = 256

2

0

0 4 8 162

4

6

8

10

12

14

16

Spee

dup s

ymm

etric

2

4

6

8

10

12

14

16

Spee

dup s

ymm

etric

2

4

6

8

10

12

14

16

r BCEs

Dynamic, n = 16

20 4 8 16 32 64 128 256r BCEs

Dynamic, n = 256

f = 0.999f = 0.99f = 0.975f = 0.9f = 0.5


Figure: 16-BCE core

July 2008 35

(a) (b) (c)


(a) (b)

(c) (d)

(e) (f)

20 4 8 16r BCEs

Symmetric, n = 16

20 4 8 16 32 64 128 256

50

100

150

200

250

r BCEs

Spee

dup s

ymm

etric

50

100

150

200

250

Spee

dup s

ymm

etric

50

100

150

200

250

Spee

dup d

ynam

ic

Spee

dup d

ynam

icSymmetric, n = 256

2 4 8 16r BCEs

Asymmetric, n = 16

20 4 8 16 32 64 128 256r BCEs

Asymmetric, n = 256

2

0

0 4 8 162

4

6

8

10

12

14

16

Spee

dup s

ymm

etric

2

4

6

8

10

12

14

16

Spee

dup s

ymm

etric

2

4

6

8

10

12

14

16

r BCEs

Dynamic, n = 16

20 4 8 16 32 64 128 256r BCEs

Dynamic, n = 256

f = 0.999f = 0.99f = 0.975f = 0.9f = 0.5


Figure: 256-BCE core

30/62

Asymmetric Multicore Chips

July 2008 35

(a) (b) (c)


(a) (b)

(c) (d)

(e) (f)

20 4 8 16r BCEs

Symmetric, n = 16

20 4 8 16 32 64 128 256

50

100

150

200

250

r BCEs

Spee

dup s

ymm

etric

50

100

150

200

250

Spee

dup s

ymm

etric

50

100

150

200

250

Spee

dup d

ynam

ic

Spee

dup d

ynam

ic

Symmetric, n = 256

2 4 8 16r BCEs

Asymmetric, n = 16

20 4 8 16 32 64 128 256r BCEs

Asymmetric, n = 256

2

0

0 4 8 162

4

6

8

10

12

14

16

Spee

dup s

ymm

etric

2

4

6

8

10

12

14

16

Spee

dup s

ymm

etric

2

4

6

8

10

12

14

16

r BCEs

Dynamic, n = 16

20 4 8 16 32 64 128 256r BCEs

Dynamic, n = 256

f = 0.999f = 0.99f = 0.975f = 0.9f = 0.5



(1−f )perf (f ) + f

perf (r)+n−r

31/62

Asymmetric Multicore Chips (contd.)

July 2008 35

(a) (b) (c)


(a) (b)

(c) (d)

(e) (f)

20 4 8 16r BCEs

Symmetric, n = 16

20 4 8 16 32 64 128 256

50

100

150

200

250

r BCEs

Spee

dup s

ymm

etric

50

100

150

200

250

Spee

dup s

ymm

etric

50

100

150

200

250

Spee

dup d

ynam

ic

Spee

dup d

ynam

ic

Symmetric, n = 256

2 4 8 16r BCEs

Asymmetric, n = 16

20 4 8 16 32 64 128 256r BCEs

Asymmetric, n = 256

2

0

0 4 8 162

4

6

8

10

12

14

16

Spee

dup s

ymm

etric

2

4

6

8

10

12

14

16

Spee

dup s

ymm

etric

2

4

6

8

10

12

14

16

r BCEs

Dynamic, n = 16

20 4 8 16 32 64 128 256r BCEs

Dynamic, n = 256

f = 0.999f = 0.99f = 0.975f = 0.9f = 0.5


Figure: 16-BCE core

July 2008 35

(a) (b) (c)


(a) (b)

(c) (d)

(e) (f)

20 4 8 16r BCEs

Symmetric, n = 16

20 4 8 16 32 64 128 256

50

100

150

200

250

r BCEs

Spee

dup s

ymm

etric

50

100

150

200

250

Spee

dup s

ymm

etric

50

100

150

200

250

Spee

dup d

ynam

ic

Spee

dup d

ynam

ic

Symmetric, n = 256

2 4 8 16r BCEs

Asymmetric, n = 16

20 4 8 16 32 64 128 256r BCEs

Asymmetric, n = 256

2

0

0 4 8 162

4

6

8

10

12

14

16

Spee

dup s

ymm

etric

2

4

6

8

10

12

14

16Sp

eedu

p sym

met

ric

2

4

6

8

10

12

14

16

r BCEs

Dynamic, n = 16

20 4 8 16 32 64 128 256r BCEs

Dynamic, n = 256

f = 0.999f = 0.99f = 0.975f = 0.9f = 0.5



32/62

Dynamic Multicore Chips

July 2008 37

BCEs, while Figure 2f gives curves for n = 256 BCEs. As the graphs show, performance always gets better as the software can exploit more BCE resources to improve the sequential component. Practical consid-erations, however, might keep r much smaller than its maximum of n.

Result 6. Dynamic multicore chips can offer speed-ups that can be greater (and are never worse) than asymmetric chips with identical perf(r) functions. With Amdahl’s sequential-parallel assumption, how-ever, achieving much greater speedup than asymmetric chips requires dynamic techniques that harness more cores for sequential mode than is possible today. For f = 0.99 and n = 256, for example, effectively harness-ing all 256 cores would achieve a speedup of 223, which is much greater than the comparable asymmet-ric speedup of 165. This result follows because we assume that dynamic chips can both gang all resources together for sequential execution and free them for parallel execution.

Implication 6. Researchers should continue to inves-tigate methods that approximate a dynamic multicore chip, such as thread-level speculation and helper threads. Even if the methods appear locally inefficient, as with asymmetric chips, the methods can be globally efficient. Although these methods can be difficult to apply under Amdahl’s extreme assumptions, they could flourish for software with substantial phases of intermediate-level parallelism.

SiMple AS pOSSible, but nO SiMplerAmdahl’s law and the corollary we offer for multicore

hardware seek to provide insight to stimulate discussion and future work. Nevertheless, our specific quantitative results are suspect because the real world is much more complex.

Currently, hardware designers can’t build cores that achieve arbitrary high performance by adding more resources, nor do they know how to dynamically har-ness many cores for sequential use without undue perfor-mance and hardware resource overhead. Moreover, our models ignore important effects of dynamic and static power, as well as on- and off-chip memory system and interconnect design.

Software is not just infinitely parallel and sequential. Software tasks and data movements add overhead. It’s more costly to develop parallel software than sequen-tial software. Furthermore, scheduling software tasks on asymmetric and dynamic multicore chips could be difficult and add overhead. To this end, Tomer Morad and his colleagues13 and JoAnn Paul and Brett Meyer14 developed sophisticated models that question the validity of Amdhal’s law to future systems, especially embedded ones. On the other hand, more cores might advantageously allow greater parallelism from larger problem sizes, as John Gustafson envisioned.15

P essimists will bemoan our model’s simplicity and lament that much of the design space we explore can’t be built with known techniques. We charge

you, the reader, to develop better models, and, more importantly, to invent new software and hardware designs that realize the speedup potentials this article displays. Moreover, research leaders should temper the current pendulum swing from the past’s underemphasis on parallel research to a future with too little sequen-tial research. To help you get started, we provide slides from a keynote talk as well as the code examples for this article’s models at www.cs.wisc.edu/multifacet/amdahl. ■

AcknowledgmentsWe thank Shailender Chaudhry, Robert Cypher,

Anders Landin, José F. Martínez, Kevin Moore, Andy Phelps, Thomas Puzak, Partha Ranganathan, Karu Sankaralingam, Mike Swift, Marc Tremblay, Sam Williams, David Wood, and the Wisconsin Multi-facet group for their comments or proofreading. The US National Science Foundation supported this work in part through grants EIA/CNS-0205286, CCR-0324878, CNS-0551401, CNS-0720565, and CNS-0720565. Donations from Intel and Sun Microsystems also helped fund the work. Mark Hill has significant financial interest in Sun Microsystems. The views expressed herein aren’t necessarily those of the NSF, Intel, Google, or Sun Microsystems.

references 1. “From a Few Cores to Many: A Tera-scale Computing Research

Overview,” white paper, Intel, 2006; ftp://download.intel.com/research/platform/terascale/terascale_overview_paper.pdf.

Sequentialmode

Parallelmode

Figure 3. Dynamic multicore chip with 16 one-BCE cores.


(1−f )perf (f ) + f

n

33/62

Dynamic Multicore Chips (contd.)

July 2008 35

(a) (b) (c)


(a) (b)

(c) (d)

(e) (f)

20 4 8 16r BCEs

Symmetric, n = 16

20 4 8 16 32 64 128 256

50

100

150

200

250

r BCEs

Spee

dup s

ymm

etric

50

100

150

200

250

Spee

dup s

ymm

etric

50

100

150

200

250

Spee

dup d

ynam

ic

Spee

dup d

ynam

ic

Symmetric, n = 256

2 4 8 16r BCEs

Asymmetric, n = 16

20 4 8 16 32 64 128 256r BCEs

Asymmetric, n = 256

2

0

0 4 8 162

4

6

8

10

12

14

16

Spee

dup s

ymm

etric

2

4

6

8

10

12

14

16

Spee

dup s

ymm

etric

2

4

6

8

10

12

14

16

r BCEs

Dynamic, n = 16

20 4 8 16 32 64 128 256r BCEs

Dynamic, n = 256

f = 0.999f = 0.99f = 0.975f = 0.9f = 0.5


Figure: 16-BCE core

July 2008 35

(a) (b) (c)


(a) (b)

(c) (d)

(e) (f)

20 4 8 16r BCEs

Symmetric, n = 16

20 4 8 16 32 64 128 256

50

100

150

200

250

r BCEs

Spee

dup s

ymm

etric

50

100

150

200

250

Spee

dup s

ymm

etric

50

100

150

200

250

Spee

dup d

ynam

ic

Spee

dup d

ynam

ic

Symmetric, n = 256

2 4 8 16r BCEs

Asymmetric, n = 16

20 4 8 16 32 64 128 256r BCEs

Asymmetric, n = 256

2

0

0 4 8 162

4

6

8

10

12

14

16Sp

eedu

p sym

met

ric

2

4

6

8

10

12

14

16

Spee

dup s

ymm

etric

2

4

6

8

10

12

14

16

r BCEs

Dynamic, n = 16

20 4 8 16 32 64 128 256r BCEs

Dynamic, n = 256

f = 0.999f = 0.99f = 0.975f = 0.9f = 0.5



34/62

Are dynamic multicores realistic?

I Dynamic multicore chips is a great idea

I But once put in silicon circuit can’t be changed

I FPGA is not a real solution

I Thread-level speculations, helper threads

I Interestingly there is another limitation: Power wall

I Will it make dynamic multicores feasible?

35/62

Dark Silicon and the End of Multicore Scaling[3]

Source: karlrupp.net

https://www.karlrupp.net/2015/06/40-years-of-microprocessor-trend-data/

36/62

Scaling trends

I Processor power does not grow

I The reason is limited CPU power dissipation

I We can put more transistors, than we can power

37/62

Methodology

Figure 1: Overview of the models and the methodology

mance of any application for “any” chip topology for CPU-like and GPU-like multicore performance.

• DevM × CorM: Pareto frontiers at future technology nodes;any performance improvements for future cores will comeonly at the cost of area or power as defined by these curves.

• CmpM×DevM×CorM and an exhaustive state-space search:maximum multicore speedups for future technology nodeswhile enforcing area, power, and benchmark constraints.

The results from this study provide detailed best-case multicoreperformance speedups for future technologies considering real ap-plications from the PARSEC benchmark suite [5]. Our results eval-uating the PARSEC benchmarks and our upper-bound analysis con-firm the following intuitive arguments:

i) Contrary to conventional wisdom on performance improve-ments from using multicores, over five technology generations, only7.9× average speedup is possible using ITRS scaling.

ii) While transistor dimensions continue scaling, power limita-tions curtail the usable chip fraction. At 22 nm (i.e. in 2012), 21%of the chip will be dark and at 8 nm, over 50% of the chip will notbe utilized using ITRS scaling.

iii) Neither CPU-like nor GPU-like multicore designs are suffi-cient to achieve the expected performance speedup levels. Radicalmicroarchitectural innovations are necessary to alter the power/per-formance Pareto frontier to deliver speed-ups commensurate withMoore’s Law.

2. OVERVIEWFigure 1 shows how this paper combines models and empirical

measurements to project multicore performance and chip utiliza-tion. There are three components used in our approach:Device scaling model (DevM): We build a device-scaling modelthat provides the area, power, and frequency scaling factors at tech-nology nodes from 45 nm to 8 nm. We consider ITRS Roadmapprojections [18] and conservative scaling parameters from Borkar’srecent study [7].Core scaling model (CorM): The core-level model provides themaximum performance that a single-core can sustain for any given

area. Further, it provides the minimum power (or energy) that mustbe consumed to sustain this level of performance. To quantify, wemeasure the core performance in terms of SPECmark. We considerempirical data from a large set of processors and use curve fittingto obtain the Pareto-optimal frontiers for single-core area/perfor-mance and power/performance tradeoffs.Multicore scaling model (CmpM): We model two mainstreamclasses of multicore organizations, multi-core CPUs and many-threadGPUs, which represent two extreme points in the threads-per-corespectrum. The CPU multicore organization represents Intel Nehalem-like, heavy-weight multicore designs with fast caches and high single-thread performance. The GPU multicore organization representsNVIDIA Tesla-like lightweight cores with heavy multithreadingsupport and poor single-thread performance. For each multicoreorganization, we consider four topologies: symmetric, asymmet-ric, dynamic, and composed (also called “fused” in the literature).Symmetric Multicore: The symmetric, or homogeneous, multicoretopology consists of multiple copies of the same core operating atthe same voltage and frequency setting. In a symmetric multicore,the resources, including the power and the area budget, are sharedequally across all cores.Asymmetric Multicore: The asymmetric multicore topology con-sists of one large monolithic core and many identical small cores.The design leverages the high-performing large core for the serialportion of code and leverages the numerous small cores as well asthe large core to exploit the parallel portion of code.Dynamic Multicore: The dynamic multicore topology is a varia-tion of the asymmetric multicore topology. During parallel codeportions, the large core is shut down and, conversely, during theserial portion, the small cores are turned off and the code runs onlyon the large core [8, 26].Composed Multicore: The composed multicore topology consistsof a collection of small cores that can logically fuse together tocompose a high-performance large core for the execution of theserial portion of code [17, 19]. In either serial or parallel cases, thelarge core or the small cores are used exclusively.

Table 1 outlines the design space we explore and explains theroles of the cores during serial and parallel portions of applica-

366

38/62

Combination of models

I Device scaling model (DevM)Area, power, frequency scaling at future technology levels.

I Core scaling model (CorM)Power/performance and area/performance single core Paretofrontiers

I Multicore scaling model (CmpM)Pareto frontiers for applications running on any multicorearchitectures

I DevM × CorMPerformance improvements com at the cost of area or power

I CmpM × DevM × CorMMaximum multicore speedups

39/62

Dynamic topology prediction

1.0 0.999 0.99 0.97 0.95 0.90 0.80 0.50 0.00fraction parallel =

Conservative Scaling

45 32 22 16 11 8

2

4

8

16

32

64

128

256

512

Opt

imal

Num

bero

fCor

es

(a) Optimal number of cores45 32 22 16 11 8

2

4

6

8

10

12

Spe

edup

(b) Speedup45 32 22 16 11 8

20%

40%

60%

80%

100%

Perc

enta

geD

ark

Sili

con

(c) Percentage dark silicon

ITRS Scaling

45 32 22 16 11 8

2

4

8

16

32

64

128

256

512

Opt

imal

Num

bero

fCor

es

(d) Optimal number of cores45 32 22 16 11 8

10

20

30

40

50

60

Spe

edup

(e) Speedup45 32 22 16 11 8

20%

40%

60%

80%

100%

Perc

enta

geD

ark

Sili

con

(f) Percentage dark siliconFigure 4: Amdahl’s law projections for the dynamic topology. Upperbound of all four topologies (x-axis: technology node).

and level 3 caches). The reported projections of dark siliconare for the area budget that is solely allocated to the cores,not caches and other ‘uncore’ components.

This exhaustive search is performed separately for Amdahl’s LawCmpMU , CPU-like CmpMR , and GPU-like CmpMR models. Weoptimistically add cores until either the power or area budget isreached. We also require that doubling the number of cores in-creases performance by at least 10%.

7. SCALING AND FUTURE MULTICORESWe begin the study of future multicoredesigns with an optimistic

upper-bound analysis using the Amdahl’s Law multicore-scalingmodel, CmpMU . Then, to achieve an understanding of speedupsfor real workloads, we consider the PARSEC benchmarks and ex-amine both CPU-like and GPU-like multicore organizations underthe four topologies using our CmpMR model. We also describesources of dark silicon and perform sensitivity studies for cacheorganization and memory bandwidth.

7.1 Upper-bound Analysis using Amdahl’s LawFigures 4(a)-(c) show the multicore scaling results comprising

the optimal number of cores, achievable speedup, and dark siliconfraction under conservative scaling. Figures 4(d)-(f) show the sameresults using ITRS scaling. The results are only presented for thedynamic topology, which offers the best speedup levels amongstthe four topologies. These results are summarized below.

Characteristic Conservative ITRS

Maximum Speedup 11.3× 59×Typical # of Cores < 512 < 512Dark Silicon Dominates — 2024

The 59× speedup at 8 nm for highly parallel workloads usingITRS predictions, which exceeds the expected 32×, is due to theoptimistic device scaling projections. We consider scaling of the

Intel Core2 Duo T9900 to clarify. At 45 nm, the T9900 has aSPECmark of 23.58, frequency of 3.06 GHz, TDP of 35 W andper-core power of 15.63 W and are of 22.30 mm2. With ITRS scal-ing at 8nm, T9900 will have SPECmark of 90.78, frequency of11.78 GHz, core power of 1.88 W, and core area of 0.71 mm2. Withthe 125 W power budget at 8nm, 67 such cores can be integrated.There is consensus that such power efficiency is unlikely. Further,our CmpMU model assumes that performance scales linearly withfrequency. These optimistic device and performance assumptionsresult in speedups exceeding Moore’s Law.

7.2 Analysis using Real WorkloadsWe now consider PARSEC applications executing on CPU- and

GPU-like chips. The study considers all four symmetric, asym-metric, dynamic, and composed multicore topologies (see Table 1)using the CmpMR realistic model. As discussed before, the modelcaptures microarchitectural features as well as application behav-ior. To conduct a fair comparison between different design points,all speedup results are normalized to the performance of a quad-core Nehalem multicore at 45 nm. In Figure 5, we present thegeometric mean of speedup, best-case speedup, geometric mean ofthe optimal number of cores, and geometric mean of the percentagedark silicon using optimistic ITRS scaling. The symmetric topol-ogy achieves the lower bound on speedups; with speedups that areno more than 10% higher, the dynamic and composed topologiesachieve the upper-bound. The results are presented for both CPU-like and GPU-like multicore organizations. Details for all appli-cations and topologies are presented in Figure 8. The results aresummarized below.

Conservative ITRSCharacteristic CPU GPU CPU GPU

Symmetric GM Speedup 3.4× 2.4× 7.7× 2.7×Dynamic GM Speedup 3.5× 2.4× 7.9× 2.7×Maximum Speedup 10.9× 10.1× 46.6× 11.2×Typical # of Cores < 64 < 256 < 64 < 256Dark Silicon Dominates 2016 2012 2021 2015

372

Figure: Dark silicon

40/62

Manycore processors

I TaihuLight uses 260-core processors SW26010 [2]

I Intel Phi are 64-core to 72-core processors with 256 up to 288threads

I Current process technology is 14 nm

41/62

Do we experience dark silicon now?

I When paper was published, ”dark silicon” was not an obviousconcept

I Two immediate implications:

1. CPUs can’t increase frequency anymore2. CPUs can’t run on highest possible frequency all the time

I Nowadays dark silicon shows up in more complicated fashion

42/62

Intel AVX

Source: anandtech.com

http://www.anandtech.com/show/8423/intel-xeon-e5-version-3-up-to-18-haswell-ep-cores-/5

43/62

Intel AVX (contd.)

Source: anandtech.com

http://www.anandtech.com/show/8423/intel-xeon-e5-version-3-up-to-18-haswell-ep-cores-/5

44/62

Moore’s law in dark silicon era

45 32 22 16 11 8Technology Node (nm)

0

8

16

24

32

Spe

edup

Moore‘s LawTypical CmpMR Upper Bound

(a) Conservative Scaling

45 32 22 16 11 8Technology Node (nm)

0

8

16

24

32

Spe

edup

Moore‘s LawTypical CmpMR Upper Bound

(b) ITRS ScalingFigure 9: Speedup across process technology nodes over all orga-nizations and topologies with PARSEC benchmarks

7.5 SummaryFigure 9 summarizes all the speedup projections in a single scat-

ter plot. For every benchmark at each technology node, we plot theeight possible configurations, (CPU, GPU) × (symmetric, asym-metric, dynamic, composed). The solid curve indicates perfor-mance Moore’s Law or doubling performance with every technol-ogy node. As depicted, due to the power and parallelism limita-tions, a significant gap exists between what is achievable and whatis expected by Moore’s Law. Results for ITRS scaling are slightlybetter but not by much. With conservative scaling a speedup gapof at least 22× exists at the 8 nm technology node compared toMoore’s Law. Assuming ITRS scaling, the gap is at least 13× at8 nm.

7.6 LimitationsOur modeling includes certain limitations, which we argue do

not significantly change the results. To simplify the discussion, wedid not consider SMT support for the processors (cores) in the CPUmulticore organization. SMT support can improve the power effi-ciency of the cores for parallel workloads to some extent. We stud-ied 2-way, 4-way, and 8-way SMT with no area or energy penalty,and observed that speedup improves with 2-way SMT by 1.5× inthe best case and decreases as much as 0.6× in the worst case due toincreased cache contention; the range for 8-way SMT is 0.3-2.5×.

Our GPU methodology may over-estimate the GPU power bud-get, so we investigated the impact of 10%-50% improved energyefficiency for GPUs and found that total chip speedup and percent-age of dark silicon were not impacted.

We ignore the power impact of “uncore” components such as thememory subsystem. There is consensus that the number of thesecomponents will increase and hence they will further eat into thepower budget, reducing speedups.

We do not consider ARM or Tilera cores in this work becausethey are designed for different application domains and their SPEC-mark scores were not available for a meaningful comparison. Forhighly parallel applications, these lightweight cores may achievehigher speedups, but similar to the GPU case, they will likely belimited by bandwidth and available parallelism.

We acknowledge that we make a number of assumptions in thiswork to build a useful model. Questions may still linger on themodel’s accuracy and whether its assumptions contribute to the per-formance projections that fall well below the ideal 32×. First, in allinstances, we selected parameter values that would be favorable to-wards performance. Second, our validation against real and simu-lated systems (Section 5.2) shows the model always under-predictsperformance.

8. RELATED WORKHill and Marty applied Amdahl’s Law to a range of multicore

topologies, including symmetric, asymmetric, and dynamic multi-core designs and conclude dynamic topologies are superior [15].Their models used area as the primary constraint, using Pollack’srule (Performance ∝ √Area [6]), to estimate performance. Ex-tensions have been developed for modeling ‘uncore’ components,such as the interconnection network and last-level cache [22], com-puting core configuration optimal for energy [9, 20], and leakagepower [28]. These studies all model power as a function of area(neglecting frequency and voltage’s direct effect on power), mak-ing power an area-dependent constraint.

Chakraborty considers device-scaling alone and estimates a si-multaneous activity factor for technology nodes down to 32 nm [8].Hempstead et al. introduce a variant of Amdahl’s Law to estimatethe amount of specialization required to maintain 1.5× performancegrowth per year, assuming completely parallelizable code [14]. Us-ing ITRS projections, Venkatesh et al. estimate technology-imposedutilization limits and motivate energy-efficient and application- spe-cific core designs [27]. Chung et al. study unconventional coresincluding custom logic, FPGAs, or GPUs in heterogeneous single-chip design [10]. They rely on Pollack’s rule for the area/perfor-mance and power/performance tradeoffs. Using ITRS projections,they report on the potential for unconventional cores, consideringparallel kernels.

Azizi et al. derive the energy/performance Pareto frontiers forsingle-core architectures using statistical architectural models com-bined with circuit-level energy-performance tradeoff functions [2].Our core model derives these curves using measured data for realprocessors. Esmaeilzadeh et al. perform a power/energy Pareto ef-ficiency analysis at 45 nm using total chip power measurements inthe context of a retrospective workload analysis [12]. In contrast tothe total chip power measurements, we use only the power budgetallocated to the cores to derive the Pareto frontiers and combinethose with our device and chip-level models to study the future ofmulticore design and the implications of technology scaling.

Previous work largely abstracts away processor organization andapplication details. This study considers the implications of processtechnology scaling, decouples power/area constraints, and consid-ers multicore organizations, microarchitectural features, and realapplications and their behavior.

9. CONCLUSIONSFor decades, Dennard scaling permitted more transistors, faster

transistors, and more energy efficient transistors with each new pro-cess node, justifying the enormous costs required to develop eachnew process node. Dennard scaling’s failure led the industry torace down the multicore path, which for some time permitted per-formance scaling for parallel and multitasked workloads, permit-ting the economics of process scaling to hold. But as the benefitsof multicore scaling begin to ebb, a new driver of transistor utilitymust be found, or the economics of process scaling will break andMoore’s Law will end well before we hit final manufacturing lim-its. An essential question is how much more performance can beextracted from the multicore path in the near future.

This paper combined technology scaling models, performancemodels, and empirical results from parallel workloads to answerthat question and estimate the remaining performance available frommulticore scaling. Using PARSEC benchmarks and ITRS scalingprojections, this study predicts best-case average speedup of 7.9times between now and 2024 at 8 nm. That result translates intoa 16% annual performance gain, for highly parallel workloads and

375

45/62

Unpredictable hardware

1. Core frequency may change

2. Uncoordinated DVFS is stochastic or unpredictable

3. CPU speed varies across nodes over time!

4. Increasing gap between traditional computer models andreality

46/62

The Case of the Missing Supercomputer Performance[6]

TABLE 1: Performance analysis tools and techniques

Technique Description Purpose

measurement running full applications under varioussystem configurations and measuringtheir performance

determine how well the application actu-ally performs

microbenchmarking measuring the performance of primitivecomponents of an application

provide insight into application perfor-mance

simulation running an application or benchmark ona software simulation instead of a physi-cal system

examine a series of “what if” scenarios,such as cluster configuration changes

analytical modeling devising a parameterized, mathematicalmodel that represents the performanceof an application in terms of the per-formance of processors, nodes, and net-works

rapidly predict the expected performanceof an application on existing or hypotheti-cal machines

performance on the full-sized ASCI Q. The modelhas been validated on many large-scale systems—including all ASCI systems—with a typical predictionerror of less than 10% [10]. The HP ES45 AlphaServernodes used in ASCI Q actually went through two ma-jor upgrades during installation: the PCI bus withinthe nodes was upgraded from 33 MHz to 66 MHzand the processor speed was upgraded from 1 GHz to1.25 GHz. The SAGE model was used to provide an ex-pected performance of the ASCI Q nodes in all of theseconfigurations.

The performance of the first 4,096-processor seg-ment of ASCI Q (“QA”) was measured in Septem-ber 2002 and the performance of the second 4,096-processor segment (“QB”)—at the time, not physicallyconnected to QA—was measured in November 2002.The results of these two sets of measurements are con-sistent with each other although they rapidly divergefrom the performance predicted by the SAGE perfor-mance model, as shown in Figure 1 for weak-scaling(i.e., fixed per-node problem size) operation. At 4,096processors, the time to process one cycle of SAGE wastwice that predicted by the model. This was consideredto be a “difference of opinion” between the model andthe measurements. Without further analysis it wouldhave been impossible to discern whether the perfor-mance model was inaccurate—although it has beenvalidated on many other systems—or whether therewas a problem with some aspect of ASCI Q’s hardwareor software configuration.

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 512 1024 1536 2048 2560 3072 3584 4096Number of Processors

Cyc

letim

e(s

)

Sep-21-02Nov-25-02

Model

Figure 1: Expected and measured SAGE performance

MYSTERY #1

SAGE performs significantly worse on ASCI Q thanwas predicted by our performance model.

In order to identify why there was a difference be-tween the measured and expected performance weperformed a battery of tests on ASCI Q. A revealingresult came from varying the number of processors pernode used to run SAGE. Figure 2 shows the differencebetween the modeled and the measured performancewhen using 1, 2, 3, or all 4 processors per node. Notethat a log scale is used on the x axis. It can be seen

3

Proceedings of the ACM/IEEE SC2003 Conference (SC’03) 1-58113-695-1/03 $ 17.00 © 2003 ACM

47/62

Model is not that wrong

that the only significant difference occurs when usingall four processors per node thus giving confidence tothe model being accurate.

0

0.1

0.2

0.3

0.4

0.5

0.6

1 10 100 1000 10000

Number of Processors

Diff

eren

cein

mea

sure

men

t&m

od

el(s

)

1 process per node

2 process per node

3 process per node

4 process per node

Figure 2: Difference between modeled and measuredSAGE performance when using 1, 2, 3, or 4 processorsper node

It is also interesting to note that, when using morethan 256 nodes, the processing rate of SAGE was ac-tually better when using three processors per nodeinstead of the full four, as shown in Figure 3. Eventhough 25% fewer processors are used per node, theperformance can actually be greater than when usingall four processors per node. Furthermore, anothercrossover occurs at 512 nodes, after which two pro-cessors per node also outperform four processors pernode.

Like Phillips et al. [17], we also analyzed applica-tion performance variability. Each computation cyclewithin SAGE was configured to perform a constantamount of work and could therefore be expected totake a constant amount of time to complete. We mea-sured the cycle time of 1,000 cycles using 3,584 proces-sors of one of the ASCI Q segments. The ensuing cycletimes are shown in Figure 4(a) and a histogram of thevariability is shown in Figure 4(b). It is interesting tonote that the cycle time ranges from just over 0.7s toover 3.0s, indicating greater than a factor of 4 in vari-ability.

A profile of the cycle time when using all four pro-cessors per node, as shown in Figure 5, reveals a num-ber of important characteristics in the execution ofSAGE. The profile was obtained by separating out thetime taken in each of the local boundary exchanges

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

200000

1 10 100 1000

Number of Nodes

Pro

cess

ing

Rat

e(c

ell-u

pd

ates

/no

de

/s) 1 process per node

2 process per node

3 process per node

4 process per node

Figure 3: Effective SAGE processing rate when using1, 2, 3, or 4 processors per node

(get and put) and the collective-communication op-erations (allreduce, reduction, and broadcast) on theroot processor. The overall cycle time, which includescomputation time, is also shown in Figure 5. Thetime taken in the local boundary exchanges appears toplateau above 500 processors and corresponds exactlyto the time predicted by the SAGE performance model.However, the time spent in allreduce and reduction in-creases with the number of processors and appears toaccount for the increase in overall cycle time with in-creasing processor count. It should be noted that thenumber and payload size in the allreduce operationswas constant for all processor counts, and the relativedifference between allreduce and reduction (and alsobroadcast) is due to the difference in their frequencyof occurrence within a single cycle.

To summarize, our analysis of SAGE on ASCI Q ledus to the following observations:

• There is a significant difference of opinion be-tween the expected performance and that actuallyobserved.

• The performance difference occurs only when us-ing all four processors per node.

• There is a high variability in the performance fromcycle to cycle.

• The performance deficit appears to originate fromthe collective operations, especially allreduce.

4


48/62

Counter intuitive performance degradation

that the only significant difference occurs when usingall four processors per node thus giving confidence tothe model being accurate.

0

0.1

0.2

0.3

0.4

0.5

0.6

1 10 100 1000 10000

Number of Processors

Diff

eren

cein

mea

sure

men

t&m

od

el(s

)

1 process per node

2 process per node

3 process per node

4 process per node

Figure 2: Difference between modeled and measuredSAGE performance when using 1, 2, 3, or 4 processorsper node

It is also interesting to note that, when using morethan 256 nodes, the processing rate of SAGE was ac-tually better when using three processors per nodeinstead of the full four, as shown in Figure 3. Eventhough 25% fewer processors are used per node, theperformance can actually be greater than when usingall four processors per node. Furthermore, anothercrossover occurs at 512 nodes, after which two pro-cessors per node also outperform four processors pernode.

Like Phillips et al. [17], we also analyzed applica-tion performance variability. Each computation cyclewithin SAGE was configured to perform a constantamount of work and could therefore be expected totake a constant amount of time to complete. We mea-sured the cycle time of 1,000 cycles using 3,584 proces-sors of one of the ASCI Q segments. The ensuing cycletimes are shown in Figure 4(a) and a histogram of thevariability is shown in Figure 4(b). It is interesting tonote that the cycle time ranges from just over 0.7s toover 3.0s, indicating greater than a factor of 4 in vari-ability.

A profile of the cycle time when using all four pro-cessors per node, as shown in Figure 5, reveals a num-ber of important characteristics in the execution ofSAGE. The profile was obtained by separating out thetime taken in each of the local boundary exchanges

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

200000

1 10 100 1000

Number of Nodes

Pro

cess

ing

Rat

e(c

ell-u

pd

ates

/no

de

/s) 1 process per node

2 process per node

3 process per node

4 process per node

Figure 3: Effective SAGE processing rate when using1, 2, 3, or 4 processors per node

(get and put) and the collective-communication op-erations (allreduce, reduction, and broadcast) on theroot processor. The overall cycle time, which includescomputation time, is also shown in Figure 5. Thetime taken in the local boundary exchanges appears toplateau above 500 processors and corresponds exactlyto the time predicted by the SAGE performance model.However, the time spent in allreduce and reduction in-creases with the number of processors and appears toaccount for the increase in overall cycle time with in-creasing processor count. It should be noted that thenumber and payload size in the allreduce operationswas constant for all processor counts, and the relativedifference between allreduce and reduction (and alsobroadcast) is due to the difference in their frequencyof occurrence within a single cycle.

To summarize, our analysis of SAGE on ASCI Q ledus to the following observations:

• There is a significant difference of opinion be-tween the expected performance and that actuallyobserved.

• The performance difference occurs only when us-ing all four processors per node.

• There is a high variability in the performance fromcycle to cycle.

• The performance deficit appears to originate fromthe collective operations, especially allreduce.

4


49/62

Cycle-time measurements

0.0

0.5

1.0

1.5

2.0

2.5

3.0

100 200 300 400 500 600 700 800 900 1000

Cycle Number

Cyc

letim

e(s

)

Measured

Model

(a) Variability

0

20

40

60

80

100

120

0.7

0.8

0.9

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9 2+

Histogram Bins (s)

Item

s

(b) Histogram

Figure 4: SAGE cycle-time measurements on 3,584processors

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1 10 100 1000 10000Number of Processors

Tim

e(s

)

cycle_timegetputallreducereductionbroadcast

Figure 5: Profile of SAGE’s cycle time

It is therefore natural to deduce that improving theperformance of allreduce, especially when using fourprocessors per node, ought to lead to an improvementin application performance. In Section 3 we test thishypothesis.

3 Identification of performancefactors

In order to identify why application performance suchas that observed on SAGE was not as good as ex-pected, we undertook a number of performance stud-ies. To simplify this process we concerned ourselveswith the examination of smaller, individual operationsthat could be more systematically analyzed. Since itappeared that SAGE was most significantly affectedby the performance of the allreduce collective opera-tion several attempts were made to improve the per-formance of collectives on the Quadrics network.

3.1 Optimizing the allreduce

Figure 6 shows the performance of the allreduce whenexecuted on an increasing number of nodes. We canclearly see that a problem arises when using all fourprocessors within a node. With up to three proces-sors the allreduce is fully scalable and takes, on aver-age, less than 300 µs. With four processors the latencysurges to more than 3 ms. These measurements wereobtained on the QB segment of ASCI Q.

5


50/62

Bad scaling

0.0

0.5

1.0

1.5

2.0

2.5

3.0

100 200 300 400 500 600 700 800 900 1000

Cycle Number

Cyc

letim

e(s

)

Measured

Model

(a) Variability

0

20

40

60

80

100

120

0.7

0.8

0.9

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9 2+

Histogram Bins (s)

Item

s

(b) Histogram

Figure 4: SAGE cycle-time measurements on 3,584processors

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1 10 100 1000 10000Number of Processors

Tim

e(s

)

cycle_timegetputallreducereductionbroadcast

Figure 5: Profile of SAGE’s cycle time

It is therefore natural to deduce that improving theperformance of allreduce, especially when using fourprocessors per node, ought to lead to an improvementin application performance. In Section 3 we test thishypothesis.

3 Identification of performancefactors

In order to identify why application performance suchas that observed on SAGE was not as good as ex-pected, we undertook a number of performance stud-ies. To simplify this process we concerned ourselveswith the examination of smaller, individual operationsthat could be more systematically analyzed. Since itappeared that SAGE was most significantly affectedby the performance of the allreduce collective opera-tion several attempts were made to improve the per-formance of collectives on the Quadrics network.

3.1 Optimizing the allreduce

Figure 6 shows the performance of the allreduce whenexecuted on an increasing number of nodes. We canclearly see that a problem arises when using all fourprocessors within a node. With up to three proces-sors the allreduce is fully scalable and takes, on aver-age, less than 300 µs. With four processors the latencysurges to more than 3 ms. These measurements wereobtained on the QB segment of ASCI Q.

5


51/62

Looking at allreduce

0

0.5

1

1.5

2

2.5

3

0 128 256 384 512 640 768 896 1024

Lat

ency

(ms)

Nodes

1 process per node2 processes per node3 processes per node4 processes per node

Figure 6: allreduce latency as a function of the numberof nodes and processes per node

Because using all four processors per node results inunexpectedly poor performance we utilize four proces-sors per node in the rest of our investigation. Figure 7provides more clues to the source of the performanceproblem. It shows the performance of the allreduceand barrier in a synthetic parallel benchmark that al-ternately computes for either 0, 1, or 5 ms then per-forms either an allreduce or a barrier. In an ideal, scal-able, system we should see a logarithmic growth withthe number of nodes and insensitivity to the compu-tational granularity. Instead, what we see is that thecompletion time increases with both the number ofnodes and the computational granularity. Figure 7 alsoshows that both allreduce and barrier exhibit similarperformance. Given that the barrier is implementedusing a simple hardware broadcast whose execution isalmost instantaneous (only a few microseconds) andthat it reproduces the same problem, we concentrateon a barrier benchmark later in this analysis.

We made several attempts to optimize the allreducein the four-processor case and were able to substan-tially improve the performance. To do so, we used adifferent synchronization mechanism. In the existingimplementation the processes in the reduce tree pollwhile waiting for incoming messages. By changingthe synchronization mechanism from always polling topolling for a limited time (100 µs, determined empiri-cally) and then blocking, we were able to improve thelatency by a factor of 7.

At 4,096 processors, SAGE spends over 51% of itstime in allreduce. Therefore, a sevenfold speedup inallreduce ought to lead to a 78% performance gainin SAGE. In fact, although extensive testing was per-

0

2

4

6

8

10

12

14

16

18

0 128 256 384 512 640 768 896 1024

Lat

ency

(ms)

Nodes

allreduce, no computationallreduce, 1 ms granularityallreduce, 5 ms granularitybarrier, no computationbarrier, 1 ms granularitybarrier, 5 ms granularity

Figure 7: allreduce and barrier latency with varyingamounts of intervening computation

formed on the modified collectives, this resulted inonly a marginal improvement in application perfor-mance.

MYSTERY #2

Although SAGE spends half of its time in allreduce(at 4,096 processors), making allreduce seven timesfaster leads to a negligible performance improve-ment.

We can therefore conclude that neither the MPI im-plementation nor the network are responsible for theperformance problems. By process of elimination, wecan infer that the source of the performance loss isin the nodes themselves. Technically, it is possiblethat the performance loss could be caused by the in-teraction of multiple factors. However, to keep our ap-proach focused we must first investigate each potentialsource of performance loss individually.

3.2 Analyzing computational noise

Our intuition was that periodic system activities wereinterfering with application execution. This hypothesisfollows from the observation that using all four proces-sors per node results in lower performance than whenusing fewer processors. Figures 3 and 6 confirm thisobservation for both SAGE and allreduce performance.System activities can run without interfering with theapplication as long as there is a spare processor avail-able in each node to absorb them. When there is nospare processor, a processor is temporarily taken fromthe application to handle the system activity. Doing so

6


52/62

Problem identified

I Allreduce degrades from ≤ 300µs to 3ms

I Another synchronization mechanism (introduced blocking)gave ×7 improvement

I 51% in allreduce ⇒ 78% improvement

I Not really!

I New assumption: system noise

53/62

Microbenchmarking

may introduce performance variability, which we referto as “noise”. Noise can explain why converting fromstrictly polling-based synchronization to synchroniza-tion that uses a combination of polling and blockingsubstantially improves performance in the allreduce, asobserved in Section 3.1.

To determine if system noise is, in fact, the source ofSAGE’s performance variability, as well, we crafted asimple microbenchmark designed to expose the prob-lems. The microbenchmark works as shown in Fig-ure 8: each node performs a synthetic computationcarefully calibrated to run for exactly 1,000 seconds inthe absence of noise.

P1

P2

P3

P4

TIME

START END

Figure 8: Performance-variability microbenchmark

The total normalized run time for the microbench-mark is shown in Figure 9 for all 4,096 processorsin QB. Because of interference from noise the actualprocessing time can be longer and can vary from pro-cess to process. However, the measurements indicatethat the slowdown experienced by each process is low,with a maximum value of 2.5%. As Section 2 showeda performance slowdown in SAGE of a factor of 2, amere 2.5% slowdown in the performance-variabilitymicrobenchmark appears to contradict our hypothesisthat noise is what is causing the high performance vari-ability in SAGE.

MYSTERY #3

Although the “noise” hypothesis could explainSAGE’s suboptimal performance, microbenchmarksof per-processor noise indicate that at most 2.5% ofperformance is being lost to noise.

Sticking to our assumption that noise is somehowresponsible for SAGE’s performance problems we re-fined our microbenchmark into the version shown inFigure 10. The new microbenchmark was intended toprovide a finer level of detail into the measurementspresented in Figure 9. In the new microbenchmark,each node performs 1 million iterations of a syntheticcomputation, with each iteration carefully calibrated

0

0.5

1

1.5

2

2.5

0 512 1024 1536 2048 2560 3072 3584 4096

Slow

dow

n(p

erce

ntag

e)

Processes

Figure 9: Results of the performance-variability mi-crobenchmark

to run for exactly 1 ms in the absence of noise, foran ideal total run time of 1,000 seconds. Using asmall granularity, such as 1 ms, is important becausemany LANL codes exhibit such granularity betweencommunication phases. During the purely computa-tional phase there is no message exchange, I/O, ormemory access. As a result, the run time of each it-eration should always be 1 ms in a noiseless machine.

P1

P2

P3

P4

TIME

φ

START END

Figure 10: Performance-variability of the new mi-crobenchmark

We ran the microbenchmark on all 4,096 processorsof QB. However, the variability results were quali-tatively identical to those shown in Figure 9. Ournext step was to aggregate the four processor measure-ments taken on each node, the idea being that systemactivity can be scheduled arbitrarily on any of the pro-cessors in a node. Our hypothesis is that examiningnoise on a per-node basis may expose structure in whatappears to be uncorrelated noise on a per-processorbasis. Again, we ran 1 million iterations of the mi-crobenchmark, each with a granularity of 1 ms. Atthe end of each iteration we measured the actual runtime and for each iteration that took more than the

7


54/62

Noise pattern

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2

0 128 256 384 512 640 768 896 1024

Slow

dow

n(p

erce

ntag

e)

Nodes

Figure 11: Results of the performance-variability mi-crobenchmark analyzed on a per-node basis

55/62

A closer look

Compute Nodes

Node 1

Node 0

Node 31

0 5 10 15 20 25 30Nodes

0.50.60.70.80.91.01.11.21.31.4

Slow

dow

n(p

erce

ntag

e)

Figure 12: Slowdown per node within each 32-nodecluster

56/62

Special nodes

Nodes grouped in clusters of 32 nodes:

node 0 – cluster manager

node 1 – quorum node

node 2-30 – no special service

node 31 – resource manager monitor

57/62

Elementary, my dear Watson

...... ...... ...... ......

(a) Uncoordinated noise

computationnoise

[ ] idle time

barrier

...... ...... ...... ......

(b) Coscheduled noise

Figure 14: Illustration of the impact of noise on synchronized computation

1

2

3

4

5

6

7

8

0 128 256 384 512 640 768 896 1024

Lat

ency

(ms)

Nodes

measuredmodelwithout 0without 1without 31without 0, 1 and 31without kernel noise

Figure 15: Simulated vs. experimental data with pro-gressive exclusion of various sources of noise in thesystem

As an initial test of the efficacy of our optimiza-tions we used a simple benchmark in which all nodesrepeatedly compute for a fixed amount of time andthen synchronize using a global barrier, whose la-tency is measured. Figure 16 shows the results forthree types of computational granularity—0 ms (a sim-ple sequence of barriers without any intervening com-putation), 1 ms, and 5 ms—and both with the noise-reducing optimizations, as described above, and with-out, as previously depicted in Figure 7.

We can see that with fine granularity (0 ms) the bar-rier is 13 times faster. The more realistic tests with1 and 5 ms, which are closer to the actual granular-ity of LANL codes, show that the performance is morethan doubled. This confirms our conjecture that per-formance variability is closely related to the noise inthe nodes.

5 ms2.2X

1 ms2.5X

0 ms 13X0

2

4

6

8

10

12

14

16

128 256 384 512 640 768 896 1024

Lat

ency

(ms)

Nodes

1 ms5 ms

1 ms, optimized5 ms, optimized

0 ms, optimized

0 ms

0

Figure 16: Performance improvements obtained onthe barrier-synchronization microbenchmark for differ-ent computational granularities

Figure 16 shows only that we were able to improvethe performance of a microbenchmark. In Section 5we discuss whether the same performance optimiza-tions can improve the performance of applications,specifically SAGE.

5 SAGE: Optimized performance

Following from the removal of much of the noise in-duced by the operating system the performance ofSAGE was again analyzed. This was done in two sit-uations, one at the end of January 2003 on a 1,024-node segment of ASCI Q, followed by the performanceon the full sized ASCI Q at the start of May 2003 (af-ter the two individual 1,024-node segments had beenconnected together). The average cycle time obtainedis shown in Figure 17. Note that the performance ob-tained in September and November 2002 is repeated

11


...... ...... ...... ......

(a) Uncoordinated noise

computationnoise

[ ] idle time

barrier

...... ...... ...... ......

(b) Coscheduled noise

Figure 14: Illustration of the impact of noise on synchronized computation

1

2

3

4

5

6

7

8

0 128 256 384 512 640 768 896 1024

Lat

ency

(ms)

Nodes

measuredmodelwithout 0without 1without 31without 0, 1 and 31without kernel noise

Figure 15: Simulated vs. experimental data with pro-gressive exclusion of various sources of noise in thesystem

As an initial test of the efficacy of our optimiza-tions we used a simple benchmark in which all nodesrepeatedly compute for a fixed amount of time andthen synchronize using a global barrier, whose la-tency is measured. Figure 16 shows the results forthree types of computational granularity—0 ms (a sim-ple sequence of barriers without any intervening com-putation), 1 ms, and 5 ms—and both with the noise-reducing optimizations, as described above, and with-out, as previously depicted in Figure 7.

We can see that with fine granularity (0 ms) the bar-rier is 13 times faster. The more realistic tests with1 and 5 ms, which are closer to the actual granular-ity of LANL codes, show that the performance is morethan doubled. This confirms our conjecture that per-formance variability is closely related to the noise inthe nodes.

5 ms2.2X

1 ms2.5X

0 ms 13X0

2

4

6

8

10

12

14

16

128 256 384 512 640 768 896 1024

Lat

ency

(ms)

Nodes

1 ms5 ms

1 ms, optimized5 ms, optimized

0 ms, optimized

0 ms

0

Figure 16: Performance improvements obtained onthe barrier-synchronization microbenchmark for differ-ent computational granularities

Figure 16 shows only that we were able to improvethe performance of a microbenchmark. In Section 5we discuss whether the same performance optimiza-tions can improve the performance of applications,specifically SAGE.

5 SAGE: Optimized performance

Following from the removal of much of the noise in-duced by the operating system the performance ofSAGE was again analyzed. This was done in two sit-uations, one at the end of January 2003 on a 1,024-node segment of ASCI Q, followed by the performanceon the full sized ASCI Q at the start of May 2003 (af-ter the two individual 1,024-node segments had beenconnected together). The average cycle time obtainedis shown in Figure 17. Note that the performance ob-tained in September and November 2002 is repeated

11


58/62

Changes

I Remove daemons (envmod, lpd, insightd, etc.)

I Decrease monitoring frequency

I Locate necessary daemons from nodes 1 and 2 to node 0

59/62

Final gain

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 1024 2048 3072 4096 5120 6144 7168 8192Number of Processors

Cyc

letim

e(s

)

Sep-21-02Nov-25-02Jan-27-03May-01-03May-01-03 (min)Model

Figure 17: SAGE performance: expected and mea-sured after noise removal

60/62

Table of Contents

Introduction




Conclusion

61/62

Summary

I Supercomputer architecture

I Resource manager

I Parallel file systemI Exascale challenges:

I ScalabilityI Dark siliconI Fault tolerance

62/62

Conclusion

Questions?

This work is licensed under a Creative Commons“Attribution-ShareAlike 3.0 Unported” license.

https://creativecommons.org/licenses/by-sa/3.0/deed.en



62/62

Wolfgang Baumann, Guido Laubender, Matthias Lauter,Alexander Reinefeld, Christian Schimmel, Thomas Steinke,Christian Tuma, and Stefan Wollny.Hlrn-iii at zuse institute berlin.In Jeffrey S. Vetter, editor, Contemporary High PerformanceComputing: From Petascale toward Exascale, Volume Two,pages 81 – 114. 2015.

Jack Dongarra.Report on the sunway taihulight system.Technical report, University of Tennessee, June 2016.

Hadi Esmaeilzadeh, Emily Blem, Renee St Amant, KarthikeyanSankaralingam, and Doug Burger.Dark silicon and the end of multicore scaling.In Computer Architecture (ISCA), 2011 38th AnnualInternational Symposium on, pages 365–376. IEEE, 2011.

John L Hennessy and David A Patterson.Computer architecture: a quantitative approach.Elsevier, 5 edition, 2011.

62/62

Mark D Hill and Michael R Marty.Amdahl’s law in the multicore era.Computer, (7):33–38, 2008.

Fabrizio Petrini, Darren J Kerbyson, and Scott Pakin.The case of the missing supercomputer performance:Achieving optimal performance on the 8,192 processors of asciq.In Supercomputing, 2003 ACM/IEEE Conference, pages55–55. IEEE, 2003.

Maik Schmidt.Bull HPC-cluster (taurus).https://doc.zih.tu-dresden.de/hpc-wiki/bin/view/

Compendium/SystemTaurus.Accessed: 24-03-2016.

Arjun Singh, Joon Ong, Amit Agarwal, Glen Anderson, AshbyArmistead, Roy Bannon, Seb Boving, Gaurav Desai, BobFelderman, Paulie Germano, et al.



62/62

Jupiter rising: A decade of clos topologies and centralizedcontrol in google’s datacenter network.In Proceedings of the 2015 ACM Conference on SpecialInterest Group on Data Communication, pages 183–197.ACM, 2015.

Andrew Tanenbaum.Modern operating systems.Pearson Education, Inc.,, 2009.

influential operating systems...

Documents