introduction to high performance computing and ...ernst/lehre/hpc/slides/hpcchapter2.pdf ·...

Institut für Numerische Mathematik und Optimierung

Introduction toHigh Performance Computingand OptimizationOliver Ernst

Audience: 1./3. CMS, 5./7./9. Mm, doctoral studentsWintersemester 2012/13

Contents

1. Introduction

2. Processor Architecture

3. Optimization of Serial Code3.1 Performance Measurement3.2 Optimization Guidelines3.3 Compiler-Aided Optimization3.4 Combine example

Oliver Ernst (INMO) HPC Wintersemester 2012/13 1

Contents

1. Introduction

2. Processor Architecture

3. Optimization of Serial Code


Processor ArchitectureVon Neumann Architecture

John von Neumann (1903–1957), Hungarian-Americanmathematician and computer science pioneer“First Draft of a Report on the EDVAC” (1945):computer design based on stored program, based onprevious work by J. P. Eckert and J. W. Mauchly, UPennsylvania ENIAC project, and earlier theoreticalwork by A. Turing.Essentially all electronic digital computers based onthis model.Von Neumann bottleneck: Manipulation of data inmemory only via traffic to ALU; width of this datapath constrains all computing throughput. (JohnBackus, 1977)Inherently sequential architecture.

Memory

CPU

Control UnitArithmetic

Logical Unit (ALU)

Input Output


Processor ArchitectureCurrent Microprocessors

Extremely complex manufactured devices.Feature size currently at 22 nm and decreasing.Transistor count ≈ 1.4 billion on 160 mm2.Fortunately: it is enough to understand the basicschematic workings of modern microprocessors.

Intel Westmere die shot

Intel Ivy Bridge die labelling


Processor ArchitectureMicroprocessorBlock diagram

Modern processors 3

Mem

ory

inte

rfac

e

cache

cache

maskshift

INTop

LD

ST

FPmult

FPadd

Mai

n m

emor

y

L2 u

nifie

d ca

che

Mem

ory

queu

eIN

T/FP

que

ue

INT

reg.

file

FP re

g. fi

le

L1 data

L1 instr.

Figure 1.2: Simplified block diagram of a typical cache-based microprocessor (one core).Other cores on the same chip or package (socket) can share resources like caches or the mem-ory interface. The functional blocks and data paths most relevant to performance issues inscientific computing are highlighted.

and make up for only a very small fraction of the chip area. The rest consists of ad-ministrative logic that helps to feed those units with operands. CPU registers, whichare generally divided into floating-point and integer (or “general purpose”) varieties,can hold operands to be accessed by instructions with no significant delay; in somearchitectures, all operands for arithmetic operations must reside in registers. TypicalCPUs nowadays have between 16 and 128 user-visible registers of both kinds. Load(LD) and store (ST) units handle instructions that transfer data to and from registers.Instructions are sorted into several queues, waiting to be executed, probably not inthe order they were issued (see below). Finally, caches hold data and instructions tobe (re-)used soon. The major part of the chip area is usually occupied by caches.

A lot of additional logic, i.e., branch prediction, reorder buffers, data shortcuts,transaction queues, etc., that we cannot touch upon here is built into modern pro-cessors. Vendors provide extensive documentation about those details [V104, V105,V106]. During the last decade, multicore processors have superseded the traditionalsingle-core designs. In a multicore chip, several processors (“cores”) execute codeconcurrently. They can share resources like memory interfaces or caches to varyingdegrees; see Section 1.4 for details.

1.2.1 Performance metrics and benchmarksAll the components of a CPU core can operate at some maximum speed called

peak performance. Whether this limit can be reached with a specific application codedepends on many factors and is one of the key topics of Chapter 3. Here we introducesome basic performance metrics that can quantify the “speed” of a CPU. Scientificcomputing tends to be quite centric to floating-point data, usually with “double preci-

source: Hager & Wellein

arithmtic units for floating point (FP) and integer (INT) operations.CPU registers (FP and general-purpose).Load (LD) and store (ST) units for transferring operands to/from registers.Instructions sorted into queues.Caches hold data and instructions.


Processor ArchitecturePerformance

For scientific computing, typically measured in floating point operationsper second, i.e,

# floating point operationsruntime .

Unit: FLOPS, FLOP/s, Flop/sec . . . .What constitutes a FLOP?

IEEE double precision floating point (FP) number (64 bits)FP add or FP multiplydivision, square roots etc. take several cycles

Peak performance is defined as[max # floating point operations per cycle]× clock rate [Hz]

×# cores×# sockets×# nodes

Question: what is the peak performance of klio?


Processor ArchitectureExample: Intel Xeon 5160 (Woodcrest, June 2006)

Architecture: Intel 64Microarchitecture: Core (successor to Netburst)1st CPU with this microarchitecture,server/workstation version of Intel Core 2processor.65 nm manufacturing process technology,socket LGA771dual core, 2 sockets, total of 4 coresclock frequency: 3 GHzEach core of Woodcrest can perform 4 Flops ineach clock cycle.

Woodcrest die shotsource: Intel

Peak performance:

4 Flops × 2 cores × 2 sockets × 3 GHz = 48 GFlops

But: higher rates possible using SIMD instructions (MMX, SSE, AVX).More later.


Processor ArchitectureSome definitions

Architecture: The instruction set of the CPU, also called instruction setarchitecture (ISA). The parts of a processor design that one needs tounderstand to write assembly code. Examples of ISAs:

Intel Architectures (IA): IA32/x86, Intel 64/EM64T, IA64MIPS (SGI)POWER (IBM)SPARC (Sun)ARM (Acorn)

Microarchitecture: Implementation of ISA; invisible features such ascaches, cache structure, CPU cycle time, details of virtual memory system.Process technology: The size of the physical features (such astransistors) that make up the processor.Roughly: smaller is better due to lower power consumption, more chips persilicon wafer in production.


Processor ArchitectureIntel architecture/microarchitecture/process roadmap

Intel tick-tock scheduleTick: New microarchitectureTock: Die shrink, i.e., new process technology

Microarchitecture Processor codename(server)

Processtechnology

Dateintroduced

Woodcrest/Clovertown 65 nm 06/2006CoreDunnington/Harpertown 45 nm 11/2007Nehalem 45 nm 11/2008NehalemWestmere 32 nm 01/2010Sandy Bridge 32 nm 01/2011Sandy BridgeIvy Bridge 22 nm 04/2012Haswell 22 nm 03/2013 (?)HaswellBroadwell 14 nmSkylake 14 nmSkylakeSkymont 10 nm


Processor ArchitectureModern processor features

Pipelined instructions. Separate complex instructions into simpler oneswhich are executed by different functional units in an overlapping fashion;increases throughput; example of instruction level parallelism (ILP).Superscalar architecture. Multiple function units operating concurrently.SIMD instructions (Single Instruction, Multiple Data) One instructionoperates on a vector of data simultaneously. (Examples: Intel’s SSE,AMD’s 3dNow!, Power/PowerPC AltiVec)Out-of-order execution. When instruction operands are not available inregisters, execute next possible instruction(s) to avoid idle time (eligibleinstructions held in reorder buffer).Caches. Small, fast, on-chip buffers for holding data which has recentlybeen used (temporal locality) or is close (in memory) to data which hasrecently been used (spatial locality).Simplified instruction sets. Reduced Instruction Set Computers(RISC,1980s), contrast with CISC; simple instructions executing in few clockcycles; allowed higher clock cycles, freed up transistors; x86 processorstranslate to “µ-ops” on the fly.


Processor ArchitecturePipelining: Example

Pipelining, it’s natural. [D. Patterson, UC Berkeley]

Four students (Anoop, Brian, Christine &Djamal) doing laundry, one load each.Washing takes 30 minutes.Drying takes 40 minutes.Folding takes 20 minutes.

DAP.F96 5

Pipelining: Its Natural!

• Laundry Example• Ann, Brian, Cathy, Dave

each have one load of clothes to wash, dry, and fold

• Washer takes 30 minutes

• Dryer takes 40 minutes

• “Folder” takes 20 minutes

A B C D

Each complete laundry load, done sequentially, takes 90 minutes.



Sequential laundry

DAP.F96 6

Sequential Laundry

• Sequential laundry takes 6 hours for 4 loads• If they learned pipelining, how long would laundry take?

A

B

C

D

30 40 20 30 40 20 30 40 20 30 40 20

6 PM 7 8 9 10 11 Midnight

Task

Order

Time

4 loads in 6 hours. How long would the pipelined laundry take?Oliver Ernst (INMO) HPC Wintersemester 2012/13 26


Pipelined laundry

DAP.F96 7

Pipelined LaundryStart work ASAP

• Pipelined laundry takes 3.5 hours for 4 loads

A

B

C

D

6 PM 7 8 9 10 11 Midnight

Task

Order

Time

30 40 40 40 40 20

Start each task as soon as functional unit available. Takes only 3.5 hours.Oliver Ernst (INMO) HPC Wintersemester 2012/13 27


Pipelining lessons

Pipelining doesn’t help latency ofindividual tasks, it helps throughput ofentire workload.Pipeline limited by slowest pipeline stage.Multiple stages operating concurrently.Potential speedup = number of pipelinestages.Unbalanced lengths of pipeline stagesreduce speedup.Time to “fill” (start-up) pipeline and timeto “drain” (wind-down) it reducesspeedup, especially if there are few loadsrelative to the # stages.

DAP.F96 8

Pipelining Lessons• Pipelining doesn’t help

latency of single task, it helps throughput of entire workload

• Pipeline rate limited by slowest pipeline stage

• Multiple tasks operating simultaneously

• Potential speedup = Number pipe stages

• Unbalanced lengths of pipe stages reduces speedup

• Time to “fill” pipeline and time to “drain” it reduces speedup

A

B

C

D

6 PM 7 8 9

Task

Order

Time

30 40 40 40 40 20


Processor ArchitecturePipelining in computers

Pipelining is the basis of vector processors.Greatest source of ILP today.Invisible to programmer.Typical pipeline stages on CPU (MIPS ISA):(1) Fetch instruction from memory(2) Read registers while decoding instruction(3) Execute operation or calculate address(4) Access an operand in data memory(5) Write the result into a registerHelpful: machine instructions of equal length (x86: from 1 to 17 bytes)


Processor ArchitecturePipelining in computers

Pipelined floating point multiply (FPM)Modern processors 9

B(1)C(1)

B(2)C(2)

B(3)C(3)

B(4)C(4)

B(N)C(N)

B(1)C(1)

B(2)C(2)

B(3)C(3)

B(4)C(4)

B(5)C(5)

B(N)C(N)

B(2)C(2)

B(3)C(3)

B(1)C(1)

B(N)C(N)

Multiplymantissas

Addexponents

Normalizeresult

Insertsign

Separatemant./exp.

(N−4)A

(N−3)A A

(N−2)A

(N−1)A(1) A(N)

(N−3)A

(N−2)A A

(N−1) A(N)A(1) A(2)

C(N−1)B(N−1)

B(N−2)C(N−2)

B(N−1)C(N−1)

...

...

...

...

...

1 2 3 4 5 N N+1 N+2 N+3 N+4...

Cycle

Wind−up

Wind−down

Figure 1.5: Timeline for a simplified floating-point multiplication pipeline that executesA(:)=B(:)*C(:). One result is generated on each cycle after a four-cycle wind-up phase.

Moore’s Law promises a steady growth in transistor count, but more complexity doesnot automatically translate into more efficiency: On the contrary, the more functionalunits are crammed into a CPU, the higher the probability that the “average” code willnot be able to use them, because the number of independent instructions in a sequen-tial instruction stream is limited. Moreover, a steady increase in clock frequencies isrequired to keep the single-core performance on par with Moore’s Law. However, afaster clock boosts power dissipation, making idling transistors even more useless.

In search for a way out of this power-performance dilemma there have been someattempts to simplify processor designs by giving up some architectural complexityin favor of more straightforward ideas. Using the additional transistors for largercaches is one option, but again there is a limit beyond which a larger cache will notpay off any more in terms of performance. Multicore processors, i.e., several CPUcores on a single die or socket, are the solution chosen by all major manufacturerstoday. Section 1.4 below will shed some light on this development.

1.2.3 Pipelining

Pipelining in microprocessors serves the same purpose as assembly lines in man-ufacturing: Workers (functional units) do not have to know all details about the fi-nal product but can be highly skilled and specialized for a single task. Each workerexecutes the same chore over and over again on different objects, handing the half-finished product to the next worker in line. If it takes m different steps to finish theproduct, m products are continually worked on in different stages of completion. Ifall tasks are carefully tuned to take the same amount of time (the “time step”), allworkers are continuously busy. At the end, one finished product per time step leavesthe assembly line.

Complex operations like loading and storing data or performing floating-pointarithmetic cannot be executed in a single cycle without excessive hardware require-


Timeline for simplified floating-point multiplication pipeline executing A(:)=B(:)*C(:).One result is generated on each cycle after a four-cycle “filling” phase.


Processor ArchitecturePipelining hazards

Limits to pipelining: hazards prevent next instruction from executing during itsdesignated clock cycle.

Structural hazards: HW cannot support this combination of instructions(single person to fold and put clothes away, washer/dryer combination)Data hazards: Instruction depends on result of prior instruction still in thepipeline (missing sock)Control hazards: Pipelining of branches & other instructions that changethe program counter; attempt to make a decision before condition isevaulated (washing football uniforms and need to get proper detergentlevel; need to see after dryer before next load in); branch instruction

Common solution is to stall the pipeline until the hazard is resolved, insertingone or more “bubbles” into the pipeline.Alternative solution to control hazard problem: branch prediction.Third solution: delayed decision.


Processor ArchitecturePipeline parameters

m-stage pipeline processing N tasks:Speedup:

TseqTpipe

= Nm

N +m− 1 = m

1 + m−1N

−→ m (N →∞).

Throughput:

N

Tpipe= N

N +m− 1 = 11 + m−1

N

−→ 1 (N →∞).

Given m, how large must N be to obtain α results per cycle (α ∈ (0, 1])?

α = 11 + m−1

Nα

⇔ Nα = m− 11/α− 1 = (m− 1)α

1− α .

Often quoted: N 12

= m− 1.Typical for current microprocessors: m = 10–35.


Processor ArchitecturePipeline parameters

100 101 102 1030

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

N

N/T

pipe

(thr

ough

put)

m=5m=10m=30m=100


Processor ArchitectureSuperscalar processors

Design features enabling the generation of multiple results per cycle:Fetch and decode of multiple instructions in one cycle (currently 3–6).Integer operations (arithmetic, addressing) done in multiple units for add,mult, shift, mask etc., (currently 2–6).Floating-point operations on multiple units (add, mult); additionally fusedmultiply-add (FMA) pipelines can perform a← b+ c ∗ d in one cycle.Fast caches able to sustain enough loads/stores per cycle to feed theseunits.Yet another form of ILP.Needs support from out-of-order execution and compiler; to get more than2/3 instructions/cycle often assembly coding necessary.


Processor ArchitectureVector extensions (SIMD)

Flynn’s taxonomy: Classification of computer architecturesMichael J. Flynn (1966)

Single Instruction Multiple InstructionSingle Data SISD MISDMultiple Data SIMD MIMD

SIMD first arose with vector supercomputers (CDC, Cray, Fujitsu).Again in first “massively parallel” supercomputers (Connection Machine).Wide desktop deployment with x86 MMX extensions in 1996.On current cache-based microprocessors: smaller scale; concurrentexecution of arithmetic operations on wide registers holding, e.g., 2 DP or4 SP floating-point operands.Carried to extremes by GPUs.Note: sustained cache/memory bandwidth necessary to feed the SIMDunits.


Processor ArchitectureVector extensions x86

1978: Intel 8086 architecture introduced; 16 bit, dedicated registers.1980: Intel 8087 floating-point coprocessor introduced; added ≈ 60 FPinstructions. Stack in place of registers.1997: Pentium and Pentium Pro architectures expanded with Multi MediaExtensions (MMX). 57 new instructions, use FP stack to acceleratemultimedia/communication applications.1999: Another 70 instructions, labeled Streaming SIMD Extensions (SSE);eight new separate 128-bit wide registers; new 32-bit SP data type.2001: Yet another 144 instructions, SSE2; new 64-bit DP data type;compilers can choose between stack and 8 SSE registers for FP.2004: SSE3, 13 new instructions; complex arithmetic, video encoding, FPconversion etc.2006: SSE4, 54 new instructions.2008: Advanced Vector Extension (AVX); expands SSE registers from 128to 256 bits; redefines ≈ 250 instructions, adds 128 more.


Processor ArchitectureMemory Hierarchy

Ideal: Unlimited amount of immediately accessible storage for data/instructions.Reality: Memory bottleneck of the von Neumann computer.

Copyright © 2011, Elsevier Inc. All rights Reserved. 3

Figure 2.2 Starting with 1980 performance as a baseline, the gap in performance, measured as the difference in the time between processor memory requests (for a single processor or core) and the latency of a DRAM access, is plotted over time. Note that the vertical axis must be on a logarithmic scale to record the size of the processor–DRAM performance gap. The memory baseline is 64 KB DRAM in 1980, with a 1.07 per year performance improvement in latency (see Figure 2.13 on page 99). The processor line assumes a 1.25 improvement per year until 1986, a 1.52 improvement until 2000, a 1.20 improvement between 2000 and 2005, and no change in processor performance (on a per-core basis) between 2005 and 2010; see Figure 1.1 in Chapter 1.

source: Patterson & Hennessy

DRAM gap: development of average time between memory access of singleprocessor/core (top) and latency of DRAM access (bottom).



The fact that this gap can be tolerated, that computers can give us the illusionof unlimited fast memory, is based on the locality of memory references:

temporal: data recently used will likely be used again soon.spatial: data close (in address space) to data recently used will likely beused soon.

Examples:Instructions in memory are accessed sequentially (except for branches),exhibiting spatial locality.Loops access instructions repeatedly, exhibiting temporal locality.Arrays are typically traversed sequentially (spatial locality). But: 2D and3D grids must involve jumps in linear address space.

Note: Locality of references is a property of software the programmer can(must) influence.



Consequence: Computer system memory organized in a hierarchy.© Markus Püschel Computer Science

Typical Memory Hierarchy

registers

on-chip L1 cache (SRAM)

main memory (DRAM)

local secondary storage (local disks)

Larger, slower, cheaper per byte

remote secondary storage (tapes, distributed file systems, Web servers)

Local disks hold files retrieved from disks on remote network servers

Main memory holds disk blocks retrieved from local disks

on-chip L2 cache (SRAM)

L1 cache holds cache lines retrieved from L2 cache

CPU registers hold words retrieved from L1 cache

L2 cache holds cache lines retrieved from main memory

L0:

L1:

L2:

L3:

L4:

L5:

Smaller, faster, costlier per byte

source: M. Püschel, ETH Zürich



Typical access times and sizes:

Level Speed SizeRegister, L1 1 ns KBL2 10 ns MBMain memory 100 ns GBDisk 10 ms TBTape 10 s PB


Processor ArchitectureCaches and their terminology

Caches: (one or more) levels of memory between processor and main memory.unified caches: store both data and instructions (typically L2 or higher); L1split into instruction (L1I) and data (L1D) cache.block or line: smallest unit of data which can be present/absent in cache.All data transfers occur in multiples of this unit.cache hit: requested data found in cache, fast access.cache miss: requested not found in cache, need to access lower memorylevels, slower access.hit rate: fraction of requests found in cache.miss rate: 1 − hit rate.When cache is full, next load operation must evict resident cache line.Writing: when cache line modified, two strategies. Write-through cachesimmediately update corresponding location in memory. Write-back cachesonly update copy in cache, memory not updated until cache line about tobe replaced. Both strategies can use a write buffer.


Processor ArchitectureSimple cache performance model

β: cache reuse ratio, i.e., fraction of loads/stores sresulting in cache hit due torecent load/store of surrounding cache line.

Tmem: access time (latency + bandwidth) to main memory.Tc: access time for cache hit.τ := Tmem/Tcache: relative performance penalty of cache miss.

Average access time:

Tav = βTc + (1− β)Tmem

Performance gain:

G(τ, β) = Tmem

Tav= τTc

βTc + (1 − β)Tmem

= τ

β + τ(1 − β)0.4 0.5 0.6 0.7 0.8 0.9 10

2

4

6

8

10

12

14

16

18

20

` (reuse ratio)

G(o

,`) (

perfo

rman

ce g

ain)

o=5o=10o=50


Processor ArchitectureCache lines and spatial locality

Typical behavior of scientific codes: traverse large data sed (read, modify,write). No temporal locality.By loading an entire cache line into memory, memory latency onlyencountered on line load.Subsequents requests to nearby memory locations serviced by cache.For code exhibiting spatial locality, loading cache line-wise increases cachehit rate.Example: streaming through memory sequentially with cach line of length16.

hit rate γ = # cache hits# references = 15

16 ≈ 0.94.


Processor ArchitectureCaches and memory/compute bound programs

We define the operational intensity I of a program/algorithm as

I = nopntrans

:= # operationsamount of data transferred between cache and RAM

Programs with high I are called compute bound, those with low I are calledmemory bound.

Trivial bound: ntrans ≥ nio := size(input data) + size(output data), therefore

I ≤ nopnio

Examples:Vector add x ← x + y : I ≤ n

2n = O(1).

Matrix-matrix multiply (MMM) C ← C +AB: I ≤ 2n3

3n2 = O(n).

Fast Fourier Transform (FFT) y = fft(x ): I ≤ 5n log2 n2n = O(logn).


Processor ArchitectureCache mapping

Where in the cache is a cache line placed?

fully set-associative cache: anywhere there’s room. Expensive to build,search logic.direct-mapped cache: unique location for all cache lines. Line may beevicted if new load maps to same location even though cache not full.Cache thrashing: when this happens in rapid succession.E-way set-associative cache: anywhere in one of E possible positionswithin a uniquely assigned set. Set chosen using s consecutive bits fromthe middle of an address.More precisely: Assume memory addressed using m bits, yielding M = 2munique addresses. Partition m address bits into (most to least significant)

t tag bits , s set index bits , b block offset bits (m = t+ s+ b).

Cache then has S = 2s sets, each cache line containing B = 2b words(Bytes) of data. With E cache lines per set, this organization issummarized by the notation (S,E,B,m).


Processor ArchitectureCache mapping

Modern processors 19

Cache

Memory

Figure 1.10: In a direct-mapped cache, memory locations which lie a multiple of the cachesize apart are mapped to the same cache line (shaded boxes).

The most straightforward simplification of this expensive scheme consists in adirect-mapped cache, which maps the full cache size repeatedly into memory (seeFigure 1.10). Memory locations that lie a multiple of the cache size apart are alwaysmapped to the same cache line, and the cache line that corresponds to some addresscan be obtained very quickly by masking out the most significant bits. Moreover, analgorithm to select which cache line to evict is pointless. No hardware and no clockcycles need to be spent for it.

The downside of a direct-mapped cache is that it is disposed toward cache thrash-ing, which means that cache lines are loaded into and evicted from the cache in rapidsuccession. This happens when an application uses many memory locations that getmapped to the same cache line. A simple example would be a “strided” vector triadcode for DP data, which is obtained by modifying the inner loop as follows:

1 do i=1,N,CACHE_SIZE_IN_BYTES/82 A(i) = B(i) + C(i) * D(i)3 enddo

By using the cache size in units of DP words as a stride, successive loop iterationshit the same cache line so that every memory access generates a cache miss, eventhough a whole line is loaded every time. In principle there is plenty of room left inthe cache, so this kind of situation is called a conflict miss. If the stride were equal tothe line length there would still be some (albeit small) N for which the cache reuse is100%. Here, the reuse fraction is exactly zero no matter how small N may be.

To keep administrative overhead low and still reduce the danger of conflict missesand cache thrashing, a set-associative cache is divided into m direct-mapped caches

20 Introduction to High Performance Computing for Scientists and Engineers

Cac

he

Memory

Way 1Way 2

Figure 1.11: In an m-way set-associative cache, memory locations which are located a mul-tiple of 1m th of the cache size apart can be mapped to either of m cache lines (here shown form= 2).

equal in size, so-called ways. The number of ways m is the number of different cachelines a memory address can be mapped to (see Figure 1.11 for an example of atwo-way set-associative cache). On each memory access, the hardware merely has todetermine which way the data resides in or, in the case of a miss, to which of the mpossible cache lines it should be loaded.

For each cache level the tradeoff between low latency and prevention of thrashingmust be considered by processor designers. Innermost (L1) caches tend to be lessset-associative than outer cache levels. Nowadays, set-associativity varies betweentwo- and 48-way. Still, the effective cache size, i.e., the part of the cache that isactually useful for exploiting spatial and temporal locality in an application codecould be quite small, depending on the number of data streams, their strides andmutual offsets. See Chapter 3 for examples.

1.3.3 Prefetch

Although exploiting spatial locality by the introduction of cache lines improvescache efficiency a lot, there is still the problem of latency on the first miss. Figure 1.12visualizes the situation for a simple vector norm kernel:

1 do i=1,N2 S = S + A(i)*A(i)3 enddo

There is only one load stream in this code. Assuming a cache line length of fourelements, three loads can be satisfied from cache before another miss occurs. Thelong latency leads to long phases of inactivity on the memory bus.


Direct-mapped (left) and 2-way set-associative (right) caches. Shading indicatesmapping of lines in memory to their assigned locations in cache.


Processor ArchitectureCache miss taxonomy

Reasons for cache misses (the three Cs)Compulsory. First access to this line, would occur even for cache ofinfinite size.Capacity. Previously resident cache line was evicted when cache full.Conflict. Previously resident cache line was evicted when set full.


Processor ArchitecturePrefetching

Latency on first access (miss), example:for (i=0; i<N; i++) s = s + a[i]*a[i];

One load stream.Assume cache line size of 4 elements ⇒ 3 cache hits before next miss.Between these the memory bus is inactive.


1

2

3

4

5

6

7

Itera

tion

#

time

LD cache miss: latency use data

use data

use data

use data

cache miss: latencyLD use data

use data

use dataLD

LD

LD

LD

LD

Figure 1.12: Timing diagram on the influence of cache misses and subsequent latency penal-ties for a vector norm loop. The penalty occurs on each new miss.

Making the lines very long will help, but will also slow down applications witherratic access patterns even more. As a compromise one has arrived at typical cacheline lengths between 64 and 128 bytes (8–16 DP words). This is by far not big enoughto get around latency, and streaming applications would suffer not only from insuffi-cient bandwidth but also from low memory bus utilization. Assuming a typical com-modity system with a memory latency of 50 ns and a bandwidth of 10GBytes/sec,a single 128-byte cache line transfer takes 13 ns, so 80% of the potential bus band-width is unused. Latency has thus an even more severe impact on performance thanbandwidth.

The latency problem can be solved in many cases, however, by prefetching. Pre-fetching supplies the cache with data ahead of the actual requirements of an applica-tion. The compiler can do this by interleaving special instructions with the softwarepipelined instruction stream that “touch” cache lines early enough to give the hard-ware time to load them into cache asynchronously (see Figure 1.13). This assumesthere is the potential of asynchronous memory operations, a prerequisite that is tosome extent true for current architectures. As an alternative, some processors featurea hardware prefetcher that can detect regular access patterns and tries to read aheadapplication data, keeping up the continuous data stream and hence serving the samepurpose as prefetch instructions. Whichever strategy is used, it must be emphasizedthat prefetching requires resources that are limited by design. The memory subsys-tem must be able to sustain a certain number of outstanding prefetch operations,i.e., pending prefetch requests, or else the memory pipeline will stall and latencycannot be hidden completely. We can estimate the number of outstanding prefetchesrequired for hiding the latency completely: If T! is the latency and B is the bandwidth,the transfer of a whole line of length Lc (in bytes) takes a time of

T = T! +LcB

. (1.5)

One prefetch operation must be initiated per cache line transfer, and the number ofcache lines that can be transferred during time T is the number of prefetches P that


Cache misses, latency penalty and inactive memory bus.



Prefetching: initiate cache line load sufficiently ahead of use.

Can be done by compiler or hardware.Requires that memory system can sustain sufficiently many outstandingprefetch operations.Number outstanding prefetches required to hide latency: time to transfercache line of length Lc given latency of memory system T` and bandwidthB:

T = T` + LcB.

One prefetch operation per cache line transfer. Number prefetches neededis number cache lines which can be moved in time T :

P = T

Lc/B= 1 + T`

Lc/B.




1

2

3

4

5

6

7

Itera

tion

#

8

9

time

use data

use data

use data

use data

use data

use data

use data

use data

use dataLD

LD

LD

PF cache miss: latency

PF cache miss: latency

LD

LD

LD

LD

LD

LD

cache miss: latencyPF

Figure 1.13: Computation and data transfer can be overlapped much better with prefetching.In this example, two outstanding prefetches are required to hide latency completely.

the processor must be able to sustain (see Figure 1.13):

P=T

Lc/B= 1+

T!

Lc/B. (1.6)

As an example, for a cache line length of 128 bytes (16 DP words), B =10GBytes/sec and T! = 50 ns we get P! 5 outstanding prefetches. If this requirementcannot be met, latency will not be hidden completely and the full memory bandwidthwill not be utilized. On the other hand, an application that executes so many floating-point operations on the cache line data that they cannot be hidden behind the transferwill not be limited by bandwidth and put less strain on the memory subsystem (seeSection 3.1 for appropriate performance models). In such a case, fewer outstandingprefetches will suffice.

Applications with heavy demands on bandwidth can overstrain the prefetchmechanism. A second processor core using a shared path to memory can sometimesprovide for the missing prefetches, yielding a slight bandwidth boost (see Section 1.4for more information on multicore design). In general, if streaming-style main mem-ory access is unavoidable, a good programming guideline is to try to establish longcontinuous data streams.

Finally, a note of caution is in order. Figures 1.12 and 1.13 stress the role ofprefetching for hiding latency, but the effects of bandwidth limitations are ignored.It should be clear that prefetching cannot enhance available memory bandwidth, al-though the transfer time for a single cache line is dominated by latency.


Sufficiently early prefetching permits overlapping of data movement and computation,a form of latency hiding.


Processor ArchitectureMulticore processors

Laws of physics (essentially heat dissipation) has forced CPUmanufacturers to use increased transistot count (Moore’s law) towardmultiple processor cores per chip/die/package rather than increased clockfrequency. See [Hager & Wellein (p. 23)] for physical explanation.core = CPU = processorSocket: physical package containing one or more cores.

source: Tom’s Hardware

AMD Opteron, Intel Dempsey & Intel Woodcrest packages.Desktop PCs have 1 socket, servers 2–4; (klio: 2, mathmaster: 2, FGcompute server: 2)Putting all cores to work necessitates parallel programming.More cores reduce available memory bandwidth per core.


Processor ArchitectureMulticore processors: core and cache arrangements24 Introduction to High Performance Computing for Scientists and Engineers

PL1DL2L3

PL1DL2L3

Figure 1.15: Dual-core processor chip withseparate L1, L2, and L3 caches (Intel “Mon-tecito”). Each core constitutes its own cachegroup on all levels.

L1D L1DL2

L1DL2

L1D

PPPP

Figure 1.16: Quad-core processor chip, con-sisting of two dual-cores. Each dual-corehas shared L2 and separate L1 caches (Intel“Harpertown”). There are two dual-core L2groups.

Each one of those cores has the same transistor count as the single “fast” core, butwe know that Moore’s Law gives us transistors for free. Figure 1.14 shows the re-quired relative frequency reduction with respect to the number of cores. The overallperformance of the multicore chip,

pm = (1+ εp)pm , (1.9)

should at least match the single-core performance so that

εp >1m

!1 (1.10)

is a limit on the performance penalty for a relative clock frequency reduction of ε fthat should be observed for multicore to stay useful.

Of course it is not trivial to grow the CPU die by a factor of m with a given man-ufacturing technology. Hence, the most simple way to multicore is to place separateCPU dies in a common package. At some point advances in manufacturing tech-nology, i.e., smaller structure lengths, will then enable the integration of more coreson a die. Additionally, some compromises regarding the single-core performance ofa multicore chip with respect to the previous generation will be made so that thenumber of transistors per core will go down as will the clock frequency. Some manu-facturers have even adopted a more radical approach by designing new, much simplercores, albeit at the cost of possibly introducing new programming paradigms.

Finally, the over-optimistic assumption (1.9) that m cores show m times the per-formance of a single core will only be valid in the rarest of cases. Nevertheless,multicore has by now been adopted by all major processor manufacturers. In order toavoid any misinterpretation we will always use the terms “core,” “CPU,” and “pro-cessor” synonymously. A “socket” is the physical package in which multiple cores(sometimes on multiple chips) are enclosed; it is usually equipped with leads or pinsso it can be used as a replaceable component. Typical desktop PCs have a singlesocket, while standard servers use two to four, all sharing the same memory. SeeSection 4.2 for an overview of shared-memory parallel computer architectures.

dual core, separate L1, L2 and L3 caches(Intel Montecito)


PL1DL2L3

PL1DL2L3

Figure 1.15: Dual-core processor chip withseparate L1, L2, and L3 caches (Intel “Mon-tecito”). Each core constitutes its own cachegroup on all levels.

L1D L1DL2

L1DL2

L1D

PPPP

Figure 1.16: Quad-core processor chip, con-sisting of two dual-cores. Each dual-corehas shared L2 and separate L1 caches (Intel“Harpertown”). There are two dual-core L2groups.

Each one of those cores has the same transistor count as the single “fast” core, butwe know that Moore’s Law gives us transistors for free. Figure 1.14 shows the re-quired relative frequency reduction with respect to the number of cores. The overallperformance of the multicore chip,

pm = (1+ εp)pm , (1.9)

should at least match the single-core performance so that

εp >1m

!1 (1.10)

is a limit on the performance penalty for a relative clock frequency reduction of ε fthat should be observed for multicore to stay useful.

Of course it is not trivial to grow the CPU die by a factor of m with a given man-ufacturing technology. Hence, the most simple way to multicore is to place separateCPU dies in a common package. At some point advances in manufacturing tech-nology, i.e., smaller structure lengths, will then enable the integration of more coreson a die. Additionally, some compromises regarding the single-core performance ofa multicore chip with respect to the previous generation will be made so that thenumber of transistors per core will go down as will the clock frequency. Some manu-facturers have even adopted a more radical approach by designing new, much simplercores, albeit at the cost of possibly introducing new programming paradigms.

Finally, the over-optimistic assumption (1.9) that m cores show m times the per-formance of a single core will only be valid in the rarest of cases. Nevertheless,multicore has by now been adopted by all major processor manufacturers. In order toavoid any misinterpretation we will always use the terms “core,” “CPU,” and “pro-cessor” synonymously. A “socket” is the physical package in which multiple cores(sometimes on multiple chips) are enclosed; it is usually equipped with leads or pinsso it can be used as a replaceable component. Typical desktop PCs have a singlesocket, while standard servers use two to four, all sharing the same memory. SeeSection 4.2 for an overview of shared-memory parallel computer architectures.

quad-core, separate L1, caches; L2 sharedacross 2 (Intel Harpertown)Modern processors 25

P P P P P PL1DL1DL1D

L2L1D

L2L1DL1D

L2L3

Figure 1.17: Hexa-core processor chip withseparate L1 caches, shared L2 caches forpairs of cores and a shared L3 cache for allcores (Intel “Dunnington”). L2 groups aredual-cores, and the L3 group is the wholechip.

HT/QPI

L1DL2

PL1DL2

PL1DL2

PL1DL2

P

L3

Memory Interface

Figure 1.18: Quad-core processor chip withseparate L1 and L2 and a shared L3 cache(AMD “Shanghai” and Intel “Nehalem”).There are four single-core L2 groups, and theL3 group is the whole chip. A built-in mem-ory interface allows to attach memory andother sockets directly without a chipset.

There are significant differences in how the cores on a chip or socket may bearranged:

• The cores on one die can either have separate caches (Figure 1.15) or sharecertain levels (Figures 1.16–1.18). For later reference, we will call a group ofcores that share a certain cache level a cache group. For instance, the hexa-corechip in Figure 1.17 comprises six L1 groups (one core each), three dual-coreL2 groups, and one L3 group which encompasses the whole socket.Sharing a cache enables communication between cores without reverting tomain memory, reducing latency and improving bandwidth by about an orderof magnitude. An adverse effect of sharing could be possible cache bandwidthbottlenecks. The performance impact of shared and separate caches on appli-cations is highly code- and system-dependent. Later sections will provide moreinformation on this issue.

• Most recent multicore designs feature an integrated memory controller towhich memory modules can be attached directly without separate logic(“chipset”). This reduces main memory latency and allows the addition of fastintersocket networks like HyperTransport or QuickPath (Figure 1.18).

• There may exist fast data paths between caches to enable, e.g., efficient cachecoherence communication (see Section 4.2.1 for details on cache coherence).

The first important conclusion one must draw from the multicore transition is theabsolute necessity to put those resources to efficient use by parallel programming,instead of relying on single-core performance. As the latter will at best stagnate overthe years, getting more speed for free through Moore’s law just by waiting for thenew CPU generation does not work any more. Chapter 5 outlines the principles andlimitations of parallel programming. More details on dual- and multicore designs willbe revealed in Section 4.2, which covers shared-memory architectures. Chapters 6

hexa-core, separate L1, L2 shared across2, L3 shared across 6 (Intel Dunnington)


P P P P P PL1DL1DL1D

L2L1D

L2L1DL1D

L2L3

Figure 1.17: Hexa-core processor chip withseparate L1 caches, shared L2 caches forpairs of cores and a shared L3 cache for allcores (Intel “Dunnington”). L2 groups aredual-cores, and the L3 group is the wholechip.

HT/QPI

L1DL2

PL1DL2

PL1DL2

PL1DL2

P

L3

Memory Interface

Figure 1.18: Quad-core processor chip withseparate L1 and L2 and a shared L3 cache(AMD “Shanghai” and Intel “Nehalem”).There are four single-core L2 groups, and theL3 group is the whole chip. A built-in mem-ory interface allows to attach memory andother sockets directly without a chipset.

There are significant differences in how the cores on a chip or socket may bearranged:

• The cores on one die can either have separate caches (Figure 1.15) or sharecertain levels (Figures 1.16–1.18). For later reference, we will call a group ofcores that share a certain cache level a cache group. For instance, the hexa-corechip in Figure 1.17 comprises six L1 groups (one core each), three dual-coreL2 groups, and one L3 group which encompasses the whole socket.Sharing a cache enables communication between cores without reverting tomain memory, reducing latency and improving bandwidth by about an orderof magnitude. An adverse effect of sharing could be possible cache bandwidthbottlenecks. The performance impact of shared and separate caches on appli-cations is highly code- and system-dependent. Later sections will provide moreinformation on this issue.

• Most recent multicore designs feature an integrated memory controller towhich memory modules can be attached directly without separate logic(“chipset”). This reduces main memory latency and allows the addition of fastintersocket networks like HyperTransport or QuickPath (Figure 1.18).

• There may exist fast data paths between caches to enable, e.g., efficient cachecoherence communication (see Section 4.2.1 for details on cache coherence).

The first important conclusion one must draw from the multicore transition is theabsolute necessity to put those resources to efficient use by parallel programming,instead of relying on single-core performance. As the latter will at best stagnate overthe years, getting more speed for free through Moore’s law just by waiting for thenew CPU generation does not work any more. Chapter 5 outlines the principles andlimitations of parallel programming. More details on dual- and multicore designs willbe revealed in Section 4.2, which covers shared-memory architectures. Chapters 6

quad-core, separate L1, L2 caches, L3shared; memory interface built in, allowsadding memory /more sockets directly

(Intel Nehalem, AMD Shanghai)


Processor ArchitectureMultithreaded processors

Thread: stream of instructions from one process/program/task.Modern processors have multiple functional units (execution units):

FP add, FP multinteger units (shift, rotated, arithmetic)load/storebranchvectordispatch/issue

At any given time, many/most of these will be idle.Branch misprediction, instruction pipeline must be flushed.Waiting for memory access to complete.Instruction mix only utilizes small fraction of functional units.

Hardware multithreading allows multiple threads to share functional units of asingle processor in an overlapping fashion.



Intel (2002) 3.06 GHz Pentium 4/Xeon: “Hyper-Threading” (HT)Neutral name: multithreading

Execute more than one thread simultaneouslySome resources are replicated (registers, program counter) in order toduplicate state of each active thread.Some are not: ALUs, caches, queues, memory interface.Single physical processor/core appears as two logical processors.Requires OS/compiler support.Some code may take advantage of SMT more than others.Thread switch much faster than process switch (context switch)Fine-grained multithreading: Switch threads after every instruction.Coarse-grained multithreading: Switch threads only after significant events(cache miss).Simultaneous multithreading (SMT): multiple instructions fromindependent threads. (HW handles dependencies among instructions)



In the SMT case, thread-level parallelism and instruction-level parallelism are both exploited, with multiple threads using the issue slots in a single clock cycle. Ideally, the issue slot usage is limited by imbalances in the resource needs and resource availability over multiple threads. In practice, other factors can restrict how many slots are used. Although Figure 7.5 greatly simplifies the real operation of these processors, it does illustrate the potential performance advantages of multithreading in general and SMT in particular. For example, the recent Intel Nehalem multicore supports SMT with two threads to improve core utilization.

Let us conclude with three observations. First, from Chapter 1, we know that the power wall is forcing a design toward simpler and more power-efficient pro cessors on a chip. It may well be that the under-utilized resources of out-of-order processors may be reduced, and so simpler forms of multithreading will be used. For example, the Sun UltraSPARC T2 (Niagara 2) microprocessor in Section 7.11 is an example of a return to simpler microarchitectures and hence the use of fine-grained multithreading.

Issue slotsThread C Thread DThread A Thread B

Time

Time

SMTCoarse MT Fine MTIssue slots

FIGURE 7.5 How four threads use the issue slots of a superscalar processor in different approaches. The four threads at the top show how each would execute running alone on a standard superscalar processor without multithreading support. The three examples at the bottom show how they would execute running together in three multithreading options. The horizontal dimension represents the instruction issue capability in each clock cycle. The vertical dimension represents a sequence of clock cycles. An empty (white) box indicates that the corresponding issue slot is unused in that clock cycle. The shades of gray and color correspond to four different threads in the multithreading processors. The additional pipe line start-up effects for coarse multithreading, which are not illustrated in this figure, would lead to further loss in throughput for coarse multithreading.

7.5 Hardware Multithreading 647©

Pat

ters

on, D

avid

A.;

Hen

ness

y, Jo

hn L

., O

ct 1

3, 2

011,

Com

pute

r Org

aniz

atio

n an

d D

esig

n, R

evise

d Fo

urth

Edi

tion

: The

Har

dwar

e/So

ftwar

e In

terfa

ceM

orga

n K

aufm

ann,

Bur

lingt

on, I

SBN

: 978

0080

8861

38

source: Patterson & Hennessy



Issues:Single-thread performance not improved, may be slight decrease due toMT overhead.Well-optimized FP-heavy code tends to benefit less from MT.Pressure on shared resources (caches).Affinity.Conservative maxim: run different processes on different physical coresunless certain code benefits from SMT.


Processor ArchitecturePerformance measurement: vector triad

for(j=0; j<NITER; j++){for(i=0; i<N; i++) a[i] = b[i] + c[i]*d[i];

}

3 load streams, 1 store stream.Outer loop to produce measurably long run times.Omitted tricks to prevent compiler optimizations.



Vector triad timings on locally available systems:

101 102 103 104 105 106 1070

500

1000

1500

2000

2500

3000

Vector length

MFl

ops/

s

Intel Woodcrest 5160 @ 3.0 GHzIntel Clovertown E5335 @ 2.0 GHzIntel Westmere X5670 @ 2.93 GHzAMD Opteron 8380 @ 2.5 GHzIntel Core i5 M460 @ 2.53 GHz


Processor ArchitectureOur Woodcrest 5160 system (klio)

CPU: Xeon 5160 @ 3 GHz2 × 2 cores

Memory: 8 × 2 GBCaches: all writebackL1D: 32 KB

8-way set associative64 byte line size

L2: 4 MB16-way set associative64 bytes line size

Memory bandwidth:to chipset: 21.3 GB/sto memory: 21.3 GB/s

Memory(4 channels, FB-DIMMS, DDR2-667)

L2 (4MB)

C0

L1I (32 KB)

L1D (32 KB)

C1

L1I (32 KB)

L1D (32 KB)

L2 (4MB)

C0

L1I (32 KB)

L1D (32 KB)

C1

L1I (32 KB)

L1D (32 KB)

Front Side Bus (FSB)up to 2 x 1333 MHz x 64 bit

sockets

Chipset

Memory channelsup to 4 x 667 MHz x 64 bit



Data on remaining computers in vector triad measurements: (to be completed)Clovertown @ 2.0 GHz (4 cores)32 KB L1 cache/core; 4 MB L2 cache shared by 2 cores;1333 MHz Front Side Bus (21.3 GB/s bandwidth)Westmere @ 2.93 GHz (6 cores)32 KB L1 cache/core; 256 KB L2 cache/core;12 MB L3 cache shared;memory bandwidth max. 32 GB/sOpteron @ 2.5 GHz32 KB L1 cache/core; 128 KB L2 cache/core; 6144 KB shared L3 cache;system bus speed 1 GT/s; memory controller speed 2 GHzCore i5 @ 2.53 GHz32 KB L1D cache/core; 32 KB L1I cache/core;256 KB L2 cache/core; 3 MB shared L3 cache;memory type: DDR3-800/1066;max. memory bandwidth 17.1 GB/s; DMI 2.5 GT/s


Processor ArchitecturePerformance measurement: vector triad, interpretation of results

Each loop iteration operates on 4 vector elements, each a double precisionfloating point number (8 Bytes), i.e., 32 Bytes data movement.An L1 cache of 32 KB (= 32 × 1024 Bytes) can hold at most 1024 sets of 4such vector elements, i.e., all four vectors fit into the L1 cache for at mostN=1024. This explains the flop rate drop around N=1024 for all Intelprocessors.With the same reasoning, all four vectors fit into an L2 cache of 4 MB (= 4 ×1024 × 1024 Bytes) for at most N = 128 × 1024 = 131 072. This explains theflop rate drop around N = 105 for the Woodcrest and Clovertown processors.For the Westmere CPU these events occur for N = 1024 (L1), N = 8196 (L2)and N = 393 216 (L3).Cache bandwidth: Intel says 8.5 Bytes/cycle 1 loop iteration (32 Bytes) needs3.75 cycles, 3 GHz means 3 x 109 cycles/s, we sould see 1.23 GFlops; we seeroughly 1.4 GFlopsMemory bandwidth: 10,664 MB/s, 333.25 M sets of 4 doubles /s, 2 Flops foreach set, we should see 666.5 MFlops/s but we see 200 MFlops/sIntel: 3.5 GB/s memory bandwidth for servers this means 3.5G/32 = 0.109 Gloop iterations /s, 2 Flops each gives 0.218 GFlops


introduction to high performance computing and ...ernst/lehre/hpc/slides/hpcchapter2.pdf ·...

Documents