cache oriented implementation for numerical codesodrobina/paralelne_vedecke... · 2003-03-19 ·...

32
Cache oriented implementation for numerical codes Martin Schulz, IWRMM, University of Karlsruhe As widely known, naively written numerical software may use only a small part of the possible performance of the underlying machine; it is much less known how to actually achieve it. Therefore the characteristics and potential bottlenecks of modern computers are studied in detail with respect to numerical simulation software, with the emphasis on today’s PC hardware running the Linux operation system. The expected performance shows to be limited by the data access (with loads and stores weighted differently) therefore the data reuse from the processor cache is crucial and is discussed by both theoretical and practical aspects. A basic finite volume scheme is chosen for the discussion of different memory access patterns, which are decisive for the overall performance of the code. Introduction Once a numerical scheme is chosen, the further processing seems straight forward. Most mathematicians stop here and move on to other interesting problems. At first sight, the implementation of a given numerical algorithm seems trivial, but there are indeed lots of issues to be considered before actually obtaining a rea- sonable code to run on a computer. Simply counting floating point operations is no more sufficient to create fast programs. Computers of today are highly complex systems of many different com- ponents with given interactions that can hardly be investigated in all detail and completeness. Therefore high level programming languages deploy an abstract and rather simplistic (from the hardware point of view) programming model. It consists of “data”, “operations” and “control structures”, to describe operations 1

Upload: others

Post on 14-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cache oriented implementation for numerical codesodrobina/Paralelne_vedecke... · 2003-03-19 · (Via KT133, Via Apollo 133A), Intel Celeron processors ran at 66MHz for a while. The

Cache oriented implementation fornumerical codes

Martin Schulz, IWRMM, University of Karlsruhe

As widely known, naively written numerical software may use only a small partof the possible performance of the underlying machine; it is much less known howto actually achieve it.

Therefore the characteristics and potential bottlenecks of modern computersare studied in detail with respect to numerical simulation software, with theemphasis on today’s PC hardware running the Linux operation system. Theexpected performance shows to be limited by the data access (with loads andstores weighted differently) therefore the data reuse from the processor cache iscrucial and is discussed by both theoretical and practical aspects.

A basic finite volume scheme is chosen for the discussion of different memoryaccess patterns, which are decisive for the overall performance of the code.

Introduction

Once a numerical scheme is chosen, the further processing seems straight forward.Most mathematicians stop here and move on to other interesting problems. Atfirst sight, the implementation of a given numerical algorithm seems trivial, butthere are indeed lots of issues to be considered before actually obtaining a rea-sonable code to run on a computer.

Simply counting floating point operations is no more sufficient to create fastprograms. Computers of today are highly complex systems of many different com-ponents with given interactions that can hardly be investigated in all detail andcompleteness. Therefore high level programming languages deploy an abstractand rather simplistic (from the hardware point of view) programming model. Itconsists of “data”, “operations” and “control structures”, to describe operations

1

Page 2: Cache oriented implementation for numerical codesodrobina/Paralelne_vedecke... · 2003-03-19 · (Via KT133, Via Apollo 133A), Intel Celeron processors ran at 66MHz for a while. The

execution order but does not take system specific issues such as execution times,memory bandwidths and delays, caches or memory access patterns into account.

In fact, numerical software may use only a small fraction of the possibleperformance of the underlying machine. It can be somewhat depressing to seean uptodate computer delivering only the performance that was possible already3-5 years ago.

Much more exciting is the perspective that, by taking care while coding, it ispossible to achieve today the performance other (careless) programmers have towait another 3-5 years for.

This paper focuses on the current high performing and still cost-efficient sys-tems based on PC technology and their use for scientific computing. After adiscussion of caching issues and a brief introduction to the 80x87 FPU, mea-surements are presented that allow a rule of thumb for realistically expectableperformance figures of numerical codes. This paper tries as well to give hintson how to actually achieve this performance, with data prefetch commands anddata access patterns being the most important points.

1 Considerations on the underlying computing

system

1.1 Memory demand

Since the processor power of PC’s, workstations and Supercomputers came closernowadays, one main criterion of the architecture of choice is – among the the po-tential for vectorization or parallelization – the demand of the numerical schemefor the memory system. This demand splits mainly into the size of availablememory and the system properties of bandwidth and latency.

To note some figures, a current PC is effectively limited to 786MB (AMD750), 1GB (Intel BX) or 1.5 GB (VIA KT133) today. The current linux kernelon 32-bit systems is able to support a maximum of 2 GB for a single process.There are some special designs to allow more memory (such as the Serverworkschipsets) but these are rare and will not get into mainstream market as they donot allow a simple addressing scheme on 32-bit systems.

If more memory is needed, a way out is to use a 64-bit based architecture,such as SUN Sparc (max 4GB on E450), SGI Mips, IBM Power3, or the DECalpha processor line (bought by Compaq, recently bought by Intel), or move onto the upper range of supercomputers. You could as well wait another year ortwo for the upcoming Intel Itanium and AMD Sledgehammer 64-bit processors.

If the dataset of your problem fits into 1GB of RAM, it seems natural tochoose the currently most cost efficient solution by using and Intel Pentium 2/3or AMD Athlon based PC architecture. We will go on to investigate the furtherproperties of these systems. Since computer architecture is similar among differ-

2

Page 3: Cache oriented implementation for numerical codesodrobina/Paralelne_vedecke... · 2003-03-19 · (Via KT133, Via Apollo 133A), Intel Celeron processors ran at 66MHz for a while. The

ent microprocessor lines, the basic concepts and techniques from section 3 can bedeployed for other machine types as well.

1.2 Cache organization

As already well known, todays main processors (Central Processing Unit, CPU)can execute their operations much faster than the main memory system caneffectively handle the corresponding data, see [9]. It has been observed thatthe processors in general work on a rather small subset of the main memory at atime, so processor manufacturers have introduced the concept of so-called caches,which are small but very fast memories. Caches store recently manipulated dataand provide it in the case the processor needs it again. As the processor advancesand moves on to work on other parts of the address space, this data eventuallygets written back into main memory.

It is not sufficient for a cache to hold only the data of the copy. Informationabout its corresponding address in main memory and certain flags (to notifywhether the data is modified, exclusive to the CPU, shared or invalid (the socalled MESI-state)) need to be stored, as well. Since this logic is expensiveto implement in hardware, chip designers reduced the generality of the addressmapping and introduced the notion of cache associativity. First of all, the cachedoes not store single bytes or words, but always complete cachelines, typically 32or 64 bytes of contiguous data. The mapping of a memory address to its cachelineis given by the truncation of the last 4 digits in binary address representation:

cachelinenumber = address div 32

A cache is called fully associative, if every cacheline can get stored at anylocation in the cache. Fully associative caches are rare today, more often, 4-wayor 2-way associative caches can be found (details for Intel /AMD below), whereeach cacheline can be stored only at 4 or 2 places of the cache. This place isdetermined by the lower bits of the cacheline number (see fig 1):

cachelineposition = cachelinenumber mod 512

These numbers suppose a 64 kB 4-way associative L1 cache with cachelines of 32bytes, as it can be found on Pentium 2 processors. As a consequence, a singlebyte consisting of the address bits 4-11 are used to identify possible cachelinelocations for any given address. The 18 remaining high bits (from 12 on) haveto be stored along with the cached data to identify the actual memory positionof the cached data. Reducing the number of these bits saves some silicon, butleads to smaller cacheable areas, such as the one for the formerly well sold IntelTX Pentium chipset.

If a cacheline can only be stored at a single position, the cache is called directmapped, which sounds much better than 1-way associative. This is particularly

3

Page 4: Cache oriented implementation for numerical codesodrobina/Paralelne_vedecke... · 2003-03-19 · (Via KT133, Via Apollo 133A), Intel Celeron processors ran at 66MHz for a while. The

03431. . .

12 11

Figure 1: cache position

simple to implement, since the last bits of any memory address represent its solepossible location the cache. SUN uses this concept for the Enterprise 450 Server,combined with a large L2-cache.

processor cache(4 way)

main memory

.... . .

(small boxesrepresent cachelines, 32B each)

Figure 2: schematic cache organization

The set of cachelines is partitioned into a number of subsets within which thecachelines compete for actual storage space in the cache. This concept supportscontiguous memory access, but can lead to rather disastrous results if the memoryis sparsely accessed with an unfavorable stride. Examples of this are given infigure 14 and in the “Elch Test” by Stefan Turek [10].

Processor cache details differ from system to system, here a some gatheredspecifications:

level-1-cache: level-2-cache:size cacheline size cacheline

Pentium 4 8 kB 4 way 64 B 256 kB 8 way 64 BPentium 2,3 32 kB 4 way 32 B 512 kB 4 way 32 BAthlon 64 kB 2 way 64 B 256 kB 16 way 64 BSUN E450 16 kB direct 32 B 4096 kB direct 64 B

As can be seen, these processors do not have a single cache, but a hierarchy oftwo caches. Not taken into account here are the further CPU caches such as theinstruction cache, the write buffer, the translation look-aside buffer and others.

4

Page 5: Cache oriented implementation for numerical codesodrobina/Paralelne_vedecke... · 2003-03-19 · (Via KT133, Via Apollo 133A), Intel Celeron processors ran at 66MHz for a while. The

1.3 Bandwidth and Latency

As long as all data of a numerical model fits into the processors cache, the pro-cessor can run a full internal speed. But if the dataset gets larger, the data hasto be moved from the main memory to the CPU and back again. This is knownto be much slower than the actual data processing itself [9, 10]. Consequently,the impact of a larger dataset is twofold: First, the amount of data gets larger,second, the processing of that data gets much slower.

Communication between the CPU and the memory system takes place viathe front side bus (FSB) which is 64 bit wide (for both Pentium 2/3 and Athlon)– 8 bytes are called a quadword.

The FSB runs at speeds of 100MHz (e.g. Intel BX, AMD 750) or 133 MHz(Via KT133, Via Apollo 133A), Intel Celeron processors ran at 66MHz for awhile. The theoretical bandwidth is about 800MB/s or 1064MB/s respectively.Due to the cache structure, always whole cachelines are transferred at a time (seesection 1.2), therefore it is most effective to employ a dense (contiguous) memoryscheme such that all transferred data might be used.

1.3.1 Prefetch

Both Intel and AMD introduced special prefetch operations in the commandset of their processors to enable the programmer or compiler to indicate datawhich will be worked on in near future. That way, the processor can requestthe corresponding cacheline from memory in advance in order to have the dataavailable when it is needed. As a consequence, the latency of the main memory– the inherent delay before the data is actually provided to the CPU – is hiddenbehind doing useful work, waiting for data is avoided. The disadvantage is thatthere are additional commands to execute and the prefetches need to be issuedat appropriate places in the code. In case of misprediction of the next memoryaccess, the effective datarate may go down, because of the misguided and thereforeuseless data transfers. For details, see the discussion in the Intel optimizationdocument guide [5] and the AMD Optimization guide [1].

Since the GNU compiler does not issue these commands by itself, I wrotesome small macros to provide them by the means of inline assembly statements.As shown in the later sections, it is possible to get a certain speed up by conciseuse them, but be aware that time measurements are always necessary, since it iseasily possible to actually slow down the code using them inappropriately.

On a Pentium 2/3 or Athlon with MMX extension, the following macro canbe used to prefetch the cacheline of the variable var into L1-cache:

#define PREFETCH(var) asm ("prefetchnta (%0) \n\t" : : "r" (&var))

To fetch 4 cachelines or 4 · 32 = 128 = 0x80 bytes ahead, use the following:

#define PREFETCH4(var) asm ("prefetchnta 0x80(%0) \n\t" : : "r" (&var))

5

Page 6: Cache oriented implementation for numerical codesodrobina/Paralelne_vedecke... · 2003-03-19 · (Via KT133, Via Apollo 133A), Intel Celeron processors ran at 66MHz for a while. The

These prefetchnta (non-temporal access) commands indicate to get the datainto L1 cache only, without eating up space in the L2 cache. That way, theso-called cache pollution by data that is used only once can be avoided.

The kni memcpy routine in the file /usr/src/linux/arch/i386/lib/simd.cof the linux kernel source tree sheds some light on the business of memory prefetchand other processor specific optimizations, where Doug Ledford writes:

/*

* Note: Intermixing the prefetch at *exactly* this point

* in time has been shown to be the fastest possible.

* Timing these prefetch instructions is a complete black

* art with nothing but trial and error showing the way.

* To that extent, this optimum version was found by using

* a userland version of this routine that we clocked for

* lots of runs. We then fiddled with ordering until we

* settled on our highest speed routines. So, the long

* and short of this is, don’t mess with instruction ordering

* here or suffer performance penalties you will.

*/

So the programmer has to decide how far he will go down the road . . .

1.3.2 Preload

Another common way to hide memory latencies, is the one called preload. It ismuch like prefetch, but actually will load the data into a register; since the othercommands do not depend on that data, the execution can go on based on theout-of-order execution features of modern processors.

A nice thing about the preload is that it can be implemented by machineindependent assembler programming: In fact, an empty inline asm statementwith an input suffices to make gcc load that input data into a register:

#define PRELOAD(var) asm VOLATILE ("" ::"r" (var))

This preload can be done even on machines without prefetch operation but has thedisadvantage of blocking a register for other uses (the compiler has less space tostore intermediate values), eats up out-of-order capacity for other operations andtherefore did not yield any enhancements in the measurements of sections 2.2-2.5.If you can, use prefetch instead of preload.

1.4 Virtual address translation

The above sketch of the working of the CPU cache is still too simple by talkingabout “addresses”. In fact there are two kinds of addresses in use here: physicaladdresses and virtual addresses.

6

Page 7: Cache oriented implementation for numerical codesodrobina/Paralelne_vedecke... · 2003-03-19 · (Via KT133, Via Apollo 133A), Intel Celeron processors ran at 66MHz for a while. The

While in protected mode, the CPU provides to each process (instance of arunning program) its own address space, completely separated from each other.These virtual addresses are translated into physical addresses by the CPU underthe control of the operation system. This mechanism is called paging and hastwo main purposes: It supports swapping and memory protection.

Swapping a technique to use the same physical memory for several processes,by moving temporarily non-used data to an external data storage such as a harddisk and back again when needed. This is done by pages of 4 kB size, so themain memory could be interpreted as a fully associative cache for the swap spaceusing 4 kB cache lines.

If a program references a (virtual) address, the CPU looks that address upin the address translation table. If the corresponding page table entry has thepresent bit set, the sought after page is located in memory and the page tableentry provides the physical address of memory page containing the referencedbyte.

If the present bit is not set, it can mean two things: Either the page is locatedon disk, this is called a page fault and forces the page on disk to be loaded intoRAM again. Or it is possible that this address is not valid for the process, asegmentation fault occurs.

The entries of the address translation table are generated by the operationsystem and may even – due to swapping – change in the lifetime of a process.In fact it defines an injective mapping form virtual to physical addresses, whichmay have jumps at 4kB boundaries but is monotone and contiguous inside the4kB pages. This issue is also known as page coloring. See also section 3.8.

The CPU does its caching based on the physical addresses, over which userspace programs have no influence. The address mapping on a machine withenough memory seems fairly randomized after some uptime, resulting in difficul-ties to deliberately reproduce the “worst case scenarios” mentioned above. Bestfigures (in the sense of worst case scenarios) were obtained on freshly rebootedmachines.

A randomized mapping alleviates the effect of cache trashing, because lesscachelines are now competing for the same space in cache. It favors local (inside4kB pages) cache effects but renders the deliberate use of the full cache sizeand structure nearly impossible, as the predictability of mid-scale cache behaviorgradually decreases.

1.4.1 TLB bottleneck

The lookup in the address translation table involves yet another type of cache,not discussed up to there. Each used page table entry gets stored in the TLB(Translation look-aside buffer) to avoid the reloading of the translation table frommemory for a page used not too long ago. The TLB has 64 entries on the Pentiumprocessor and has its worst impact when referencing only little data on a large

7

Page 8: Cache oriented implementation for numerical codesodrobina/Paralelne_vedecke... · 2003-03-19 · (Via KT133, Via Apollo 133A), Intel Celeron processors ran at 66MHz for a while. The

number of 4 kB pages. It is not of great importance for numerical simulationssince we aim at contiguous memory access patterns anyway.

1.4.2 Quadword alignment

The AMD optimization guide [1] stresses the importance of correctly alignedmemory access. To maintain the best possible speed, memory accesses should bequadword aligned, in other words, the address should be divisible by 8.

The GNU compiler issues .aligned commands to the underlying assemblerby himself, and the quadword alignment inside the 4kB pages implicates thequadword alignment in terms of physical addresses, so this seems not a greatissue here.

1.4.3 Memory bank conflicts

The Intel Architecture Reference Manual [4] mentions memory bank conflicts asfurther possible cause for memory access delay. Access to different pages of thesame DRAM bank introduce a delay due to DRAM page opening and closing.Since the operation system handles the physical address allocation at run-time,both programmers and compilers have little control on this effect.

2 Pentium processors from scientific computing

viewpoint

The early PC’s CPUs, such as 8088 and 8086 and later 80286 and 80386 pro-cessors did not have any floating point hardware. An additional floating pointcoprocessor chip (8087, 80287 or 80387 respectively) was available and had tobe plugged into a separate socket. It had separate registers and separate com-mands for floating point operations but needed to be controlled from the CPU.From 80486 on (80486Sx being an exception) the floating point unit (FPU) wasintegrated into the main processor, but the basic structure remained the same.

Later additions to the Pentium processor line include MMX, ISSE and SSE2,all aiming at parallel data processing for further speedups:

• MMX defines new operations to handle 64-bit packed integer data types(byte, word (2 bytes) and doubleword (4 bytes), signed and unsigned each).These operations work on eight new 64-bit wide so-called MMX registers.

As the numerical simulation in scientific computing does not require heavywork in integer arithmetic, these additions are of little use for the applica-tions discussed here, with one exception: the prefetch commands discussedin section 1.3.1 were introduced with MMX.

8

Page 9: Cache oriented implementation for numerical codesodrobina/Paralelne_vedecke... · 2003-03-19 · (Via KT133, Via Apollo 133A), Intel Celeron processors ran at 66MHz for a while. The

• ISSE (also known as KNI, Katmai New Instructions) introduced eight newregisters (XMM) of 128 bit length and new operation on packed single pre-cision floating point data. This allows to execute the same operations on 4single precision numbers at once, using the SIMD (single instruction mul-tiple data) paradigm.

Since scientific numerical computations are usually done in double precision,this is of little use as well.

• SSE2 (Streaming SIMD extensions) was the next enhancement by Intel forthe Pentium 3. Further commands to use the XMM registers for 2 doubleprecision numbers (instead of 4 single precision values with ISSE) and towork on them in parallel were added.

The new cflush command was introduced as well, which serves to tell theprocessor that a certain piece of data will not be used again in near futureby invalidating its cache line. This way, the cache can be kept clean fromdata used only once and can therefore hold more data for effective reuse.The boundary data in section 3.3 would be a good candidate for application.

The GNU compiler currently only uses the 80x87 FPU when using doubleprecision numbers. The gas (GNU assembler) does not yet support the SSE2instruction set, not even for manual use. Therefore, only the 80x87 FPU isdiscussed here.

2.1 Structure of the Intel 80x87 FPU

The 80x87 FPU provides 8 registers of 80 bit length that are capable to holda number in extended precision each. These registers – named %st(0) through%st(7) – are organized as a register stack. Operations always work on the topof the stack, if not explicitly stated otherwise.

As an example, take the evaluation of a linked triad of section 2.4:

di = ai + b ∗ ci

Consider the addresses of ai, b, ci and di to be stored in the general purposeregisters %eax, %ebx, %ecx, and %edx respectively. Then the above operation canbe achieved by:

fldl (%ebx) # load b

label:

fldl (%ecx) # load c_i

fmul %st(1) # b*c_i

fldl (%eax) # load a_i

faddp # a_i+b*c_i

fstpl (%edx) # store c_i

9

Page 10: Cache oriented implementation for numerical codesodrobina/Paralelne_vedecke... · 2003-03-19 · (Via KT133, Via Apollo 133A), Intel Celeron processors ran at 66MHz for a while. The

First the two numbers b and ci are pushed onto the FPU register stack by thefldl1 commands. Then the fmul command multiplies them, stores the result in%st(0), replacing the value of ci. The subsequent fldl command pushes ai ontop of the stack. faddp computes the sum, pops the stack and writes the resultinto %st(0) again. The fstpl command at the end takes this value, stores it tothe memory location of di and pops the stack again.

At time 6, the value of b is on top of the stack and can be used in the nextturn after incrementing the addresses by 8 and a jump to “label:”, skipping thefirst command.

time 1 time 2 time 3 time 4 time 5 time 6st(0) b ci b · ci ai ai + b · ci bst(1) b b b · ci bst(2) b

...

For a more detailed discussion of the 80x87 FPU and its precise floating pointarithmetic properties, see [8].

The CPU also provides a fxch command to rename the floating point registers.This is useful to issue further independent commands while the result of theprevious operation is not yet available. The feature allows compilers to use severalprocessing units in parallel and thus helps to minimize the CPU cycles. Thishowever, works only for data from cache. In the context of memory bandwidthbounded performance, the fxch operation does not help.

2.1.1 FPU performance

Some of the arithmetic floating point operations provided by the 80x87 are listedin the table below, along with measured CPU cycles on Pentium2/3 and Athlonsystems.

Description Pentium2/3 Athlonfmul, fmulp multiply 3 2fadd, faddp addition 1 2fdiv division 38 25fsqrt square root 65 32

It is stated in literature, that the newer 80x87 need two cycles for a floatingpoint multiplication, and the processor has two multiplication units that can workin parallel, but cannot be started in the same cycle. Nevertheless I measured 3CPU cycles for consecutive multiplications the Pentium3. A more extensive listof floating point operations and their timing properties is found in [2].

1The “l” suffix of the command indicates that double values (long) are to be loaded.

10

Page 11: Cache oriented implementation for numerical codesodrobina/Paralelne_vedecke... · 2003-03-19 · (Via KT133, Via Apollo 133A), Intel Celeron processors ran at 66MHz for a while. The

Note that the CPU cycles relate to the clock speed of the processor core,which is usually much higher than the clock speed of its interface to the systembus, which is referred to as FSB cycles mentioned in other part of the paper. Theratio of FSB to CPU cycle times is somewhere in the range of 4 (Pentium 400,BX) to 10 (Athlon 1GHz, AMD 750).

2.2 Measurements of memory transfers

To obtain information about the actually achievable fraction of the theoreticalpeak performance, I did some measurements with hand coded assembly loops ac-cessing contiguous data in simple patterns. They can be found as inline assemblystatements in the unified bandwidth estimation program bandwidth.cc which canbe downloaded from the website stated at the end of the paper. For a discussionof the assembly programming language, see [6, 7]. The use of gcc inline assemblyis discussed in [3, 11]. A selection of these loops is discussed below.

Each of the loops works on 24 bytes per Iteration, the space needed for 3double precision numbers. While the first column of the tables below lists therespective machine type, the second column contains the operation frequency,i.e. the execution speed of the issued numerical operations. The next columnthen states the corresponding amount of transferred cacheline data. Note thatthe cacheline data is larger than the real data (i.e. the amount of data actuallyoperated on) in the case where the data is not accessed contiguously. The last col-umn then gives the respective FSB cycles necessary per quadword of transferredcacheline data.

2.2.1 Reading contiguous double precision data

A simulation of the data flow reading all components of a 3-vector is done byloading three double precision values and incrementing the address by 8 bytesafterwards in a loop. Since data is accessed contiguously, all data that get trans-ferred is actually used. The caches are irrelevant here since all data is used onlyonce and the underlying vector is too long to fit into cache.

qw data

cachelines

1 2 3 loop iteration. . .

Figure 3: load contiguous quadwords

11

Page 12: Cache oriented implementation for numerical codesodrobina/Paralelne_vedecke... · 2003-03-19 · (Via KT133, Via Apollo 133A), Intel Celeron processors ran at 66MHz for a while. The

We get the following results:

qw MHz MB/s cacheline FSB cyclesLaptop Pentium 333 46.1 369.2 2.2Pentium2 400MHz+BX 54.5 436.3 1.8Pentium2 600MHz+BX 63.8 510.6 1.6Pentium 866+Via Apollo 133A 71.4 571.4 1.9Athlon 1GHz+AMD750 56.6 452.8 1.8Athlon 1GHz+Via KT133 68.1 545.4 1.5

2.2.2 Reading contiguous double precision data using prefetch

This can be done faster by using prefetch commands, which request the databefore it is actually used (see section 1.3.1). At the beginning of the loop, a non-temporal prefetch command (prefetchnta) is inserted that addresses 4 cachelinesahead. Non-temporal data is the Intel nomenclature for data that is used onlyonce and therefore not worth being stored in L2 cache.

qw data

cachelines

1 2 3 loop iteration

prefetch

. . .

Figure 4: load contiguous quadwords with prefetch

qw MHz MB/s cacheline FSB cyclesLaptop Pentium 333 43.4 347.8 2.3Pentium2 400MHz+BX 54.5 436.3 1.8Pentium2 600MHz+BX 75 600 1.3Pentium 866+Via Apollo 133A 83.3 666.6 1.6Athlon 1GHz+AMD750 83.3 666.6 1.2Athlon 1GHz+Via KT133 120 960 1.1

Some of the chipsets come reasonably close to their theoretical peak, whereasthe VIA Apollo is rather disappointing. As can be seen, the effect of the prefetchis much larger for the Athlon systems than for the Pentiums; tough they areslower without prefetch, they are faster with.

2.2.3 Reading sparse integer data

To see whether loading to the floating point registers is the bottleneck, the loopis modified to load only the first doubleword of each triple of quadwords into ageneral purpose integer register (32 bit), using the same prefetch.

12

Page 13: Cache oriented implementation for numerical codesodrobina/Paralelne_vedecke... · 2003-03-19 · (Via KT133, Via Apollo 133A), Intel Celeron processors ran at 66MHz for a while. The

Since the address stride of the loop is smaller than the length of a cacheline,all cachelines are loaded, even though only a small fraction of the data is actuallyloaded into a register.

cachelines

1 2 3 loop iteration

prefetch

. . . dw data

Figure 5: load sparse dw with prefetch

dw MHz MB/s cacheline FSB cyclesLaptop Pentium 333 18.8 452.8 1.8Pentium2 400MHz+BX 20 480 1.7Pentium2 600MHz+BX 20.4 489.7 1.6Pentium 866+Via Apollo 133A 30.3 727.2 1.5Athlon 1GHz+AMD750 27.0 648.6 1.2Athlon 1GHz+Via KT133 37.0 888.8 1.2

Due to the skip of 20 of 24 bytes in each loop, the real data rate is only 1/6 ofthe cacheline data rate. Also note that the cacheline transfer rate is smaller asbefore on some systems. The Via Apollo is better here, but still beaten by ViaKT133.

2.2.4 Writing contiguous double precision data

Up to now, we were loading data from memory. Storing data seems to havecompletely different characteristics. The following table lists the performance ofstoring 3 double precision values per loop from floating point register to memory,in some sense this is the inverse of section 2.2.1.

qw data

cachelines

1 2 3 loop iteration. . .

Figure 6: store contiguous quadwords

13

Page 14: Cache oriented implementation for numerical codesodrobina/Paralelne_vedecke... · 2003-03-19 · (Via KT133, Via Apollo 133A), Intel Celeron processors ran at 66MHz for a while. The

qw MHz MB/s cacheline FSB cyclesLaptop Pentium 333 14.5 116.5 6.9Pentium2 400MHz+BX 21.8 175.1 4.6Pentium2 600MHz+BX 27.5 220.1 3.6Pentium 866+Via Apollo 133A 23.4 187.5 5.7Athlon 1GHz+AMD750 40 320 2.5Athlon 1GHz+Via KT133 30.3 242.4 4.4

2.2.5 Writing contiguous double precision data using prefetch

To see the effect of prefetch on the store operations, the same prefetch commandis inserted as in section 2.2.2. It can be seen, that the prefetch 4 lines aheadimproves only the performance of the slower Pentium. Generally, there is onlylittle effect.

qw data

cachelines

1 2 3 loop iteration

prefetch

. . .

Figure 7: store contiguous quadwords with prefetch

qw MHz MB/s cacheline FSB cyclesLaptop Pentium 333 14.7 117.6 6.8Pentium2 400MHz+BX 25 200 4.0Pentium2 600MHz+BX 27.2 218.1 3.6Pentium 866+Via Apollo 133A 24 192 5.5Athlon 1GHz+AMD750 39.4 315.7 2.5Athlon 1GHz+Via KT133 30 240 4.4

2.2.6 Writing integer data using prefetch

To show that the difference of store versus load performance is not induced bythe use of floating point registers, an alternative loop was timed. It contains 6doubleword stores from a general purpose 32-bit register. The results vary onlyslightly from those of the previous section.

dw MHz MB/s cacheline FSB/qwLaptop Pentium 333 29.4 117.6 6.8Pentium2 400MHz+BX 42.8 171.4 4.6Pentium2 600MHz+BX 55.4 222.2 3.6Pentium 866+Via Apollo 133A 48 192 5.5Athlon 1GHz+AMD750 72.2 289.1 2.8Athlon 1GHz+Via KT133 74.0 296.2 3.6

14

Page 15: Cache oriented implementation for numerical codesodrobina/Paralelne_vedecke... · 2003-03-19 · (Via KT133, Via Apollo 133A), Intel Celeron processors ran at 66MHz for a while. The

cachelines

1 2 3 loop iteration

prefetch

. . . dw data

Figure 8: store contiguous doublewords with prefetch

2.3 Rule of thumb

The measurements of the previous sections show the possible store and loadperformance, obtained by a straight forward assembler code without any compilerinterference. Obviously, we cannot expect a “real program” to have higher loador store rates, so we should be happy if we come close to the figures measuredabove.

As a rule of thumb for the maximal expected performance on different ma-chines, count both the quadword loads and stores (including non-used cachelinedata) and multiply them by the FSB cycle numbers from section 2.2.2 and 2.2.5:

FSB cycles FSB cyclesper qw load per qw store

Laptop Pentium 333 2.3 6.8Pentium2 400MHz+BX 1.8 4.5Pentium2 600MHz+BX 1.5 3.6Pentium 866+Via Apollo 133A 1.6 5.5Athlon 1GHz+AMD750 1.2 2.5Athlon 1GHz+Via KT133 1.2 4.4

The resulting sum is the number of FSB cycles a good implementation is expectedto spend on memory transfers. If your code reaches this performance, you shouldbe satisfied.

The kni memcpy routine mentioned above (section 1.3.1) actually beats thesenumbers by about 30% on Pentium2/3, it does not run on Athlon due to theuse of the Katmai instruction set extension. So it is indeed possible to achievehigher throughput than stated by the above rule of thumb, taking explicitly careabout alignment, use of special purpose registers, non-temporal stores that donot pollute the caches and so on. However, there seems to be no general ruleshow to use these instruments, other than trial and error.

We can only speculate why the stores are so much slower than the loads.There are some possible reasons:

• Before writing to a cacheline, it has to be read into cache and gets writtenback afterward. As a consequence, the cacheline gets transferred twice,instead of once.

15

Page 16: Cache oriented implementation for numerical codesodrobina/Paralelne_vedecke... · 2003-03-19 · (Via KT133, Via Apollo 133A), Intel Celeron processors ran at 66MHz for a while. The

As any cacheline that gets written to is first fetched from main memory ifnecessary, an update operation is considered the same as a mere store forour purpose.

There seems to be no instructions for a userland program to notify the pro-cessor, that a certain part of the memory is “write only” for some time. Thememory type range registers (MTRR) provide a similar functionality butare targeted to be used by the operating system, not userspace programs.

• DRAM chip characteristic: Writing may be slower than reading, since ithas to be verified that the DRAM cell really has stored the correct value.

ECC RAM seems to be slower than non-ECC RAM for stores, not for loads.

• Another explication could be the bus address snooping for the coherencetest of dual-capable processors. This is necessary because the cached datamust be exclusive to the CPU when writing to it (see MESI state, section1.2)

• The fast loads may indicate that a ”most significant word first” techniquemight be used whereas a write operation is safely written only after thewhole cacheline got transferred.

2.4 Linked triad

Schonauer [9] states the linked triad of the form (also known as daxpy operation)

di = ai + b ∗ ci

to be one of the most often used operations in scientific supercomputing andtakes it as benchmark for his classification of supercomputers. To verify the ruleof thumb, I coded this operation in assembler as well. The basic operation isdiscussed as the example in section 2.1. To increase the performance, the loop isthreefold unrolled and interspersed with prefetch commands for the three memorylines.

MHz eval MFlop/s FSB/eval expectedLaptop Pentium 333 8.3 16.6 12.0 11.4Pentium2 400MHz+BX 11.2 22.4 8.9 8.1Pentium2 600MHz+BX 14.6 29.2 6.8 6.6Pentium 866+Via Apollo 133A 17.8 35.6 7.5 8.7Athlon 1GHz+AMD750 19.7 39.4 5.1 4.9Athlon 1GHz+Via KT133 16.8 33.6 7.9 6.8

The table lists the evaluation frequency in its first column. The loop contains twoloads and one store for each index, hence column two is the double of the firstcolumn. The third column relates the evaluation frequency to the FSB. Thesenumbers correlate fairly well with the excepted values from the rule of thumbabove (last column).

16

Page 17: Cache oriented implementation for numerical codesodrobina/Paralelne_vedecke... · 2003-03-19 · (Via KT133, Via Apollo 133A), Intel Celeron processors ran at 66MHz for a while. The

2.5 Compiler influences

Coding small things in assembler may sometimes be real fun, but implementingand debugging complex numerical algorithms is another issue. FORTRAN orC/C++ are the usual choices as programming languages for that purpose. Thecode written in these higher level languages is used to generate automatically theassembly code which is then further translated into machine code.

That process of automatic translations usually involves a certain loss in exe-cution speed. In some cases, the compiler may even generate better code than anassembler programmer would do. However, considering problems limited by thespeed of the data from/to memory, the memory access structure is of crucial im-portance. Unfortunately, it is not well recognized by compilers, which are tunedto issue the commands such that as many processor parts as possible are keptbusy. Usually, the memory bandwidth and latency is not taken into account,hence the automatically generated code is most effective on data that is alreadyin cache.

The linked triad is used as an test example as above. The code for a threefoldunrolled loop in C reads:

for(int i=0; i<3*num; i+=3)

{

y1[i ]=y2[i ]+fac*y3[i ];

y1[i+1]=y2[i+1]+fac*y3[i+1];

y1[i+2]=y2[i+2]+fac*y3[i+2];

}

To see what can be done using the compiler and additional manually insertedprefetch commands, a similar, but enhanced loop is used:

for(int i=0; i<3*num; i+=3)

{

PREFETCH3(y1[i]);

y1[i ]=y2[i ]+fac*y3[i ];

PREFETCH3(y2[i]);

y1[i+1]=y2[i+1]+fac*y3[i+1];

PREFETCH3(y3[i]);

y1[i+2]=y2[i+2]+fac*y3[i+2];

}

17

Page 18: Cache oriented implementation for numerical codesodrobina/Paralelne_vedecke... · 2003-03-19 · (Via KT133, Via Apollo 133A), Intel Celeron processors ran at 66MHz for a while. The

gcc gcc + prefetchLaptop Pentium 333 7.6 7.8Pentium2 400MHz+BX 9.9 10.3Pentium2 600MHz+BX 10.5 12.0Pentium 866+Via Apollo 133A 12.5 15.8Athlon 1GHz+AMD750 14.4 19.3Athlon 1GHz+Via KT133 13.4 16.3SUN Enterprise 450 12.4 -

The table shows the evaluation frequency in MHz, to be compared to the firstcolumn of the table in section 2.4. As we have seen before, the Athlon systemsprofit more from the prefetch commands and nearly reach the assembler loopfrom section 2.4. For the Pentium systems, the gain of the additional prefetchesis smaller.

For comparison reasons, the gcc figure is given here as well for a SUN Enter-prise 450. All experiments with different compiler options for the SUN workshopcompiler resulted in lower rates than the GNU compiler using the -O2 optimiza-tion option.

2.6 Indirect addressing

When working in finite volume contexts, topological neighborhoods come intoaccount. Roughly speaking, each cell boundary needs references to the left andright neighbor cell. Consider a loop over all these cell boundaries, which computesall fluxes and updates the corresponding cell variables.

Suppose, this neighbors information is stored together with some given fluxin a data structure like this:

struct boundary_struct {

double flux;

int left;

int right;

};

Consider a field boundary of such structures and another field cellvalue ofdouble precision values. The following basic codes update the cellvalues by theflux given by boundary[n].flux. The third version is the “usual” form. Fordemonstration purposes, the second assignment is left out in the first and secondversion. Version two and four are enhanced by prefetch commands.

The first – reduced – version does only half of the work:

for( int n=0; n<num; n++) {

cellvalue[boundary[n].left ] -= boundary[n].flux;

};

18

Page 19: Cache oriented implementation for numerical codesodrobina/Paralelne_vedecke... · 2003-03-19 · (Via KT133, Via Apollo 133A), Intel Celeron processors ran at 66MHz for a while. The

As can be seen in the table near the end of this section, it is important forthe subscript to be known early enough, otherwise, the execution will stall andwait for the value of boundary[n].left. This is accomplished by an additionalprefetch command in the following:

for( int n=0; n<num; n++) {

PREFETCH4(boundary[n]);

cellvalue[boundary[n].left ] -= boundary[n].flux;

};

A naive implementation of the full cellvalue update would be

for( int n=0; n<num; n++) {

cellvalue[boundary[n].left ] -= boundary[n].flux;

cellvalue[boundary[n].right] += boundary[n].flux;

};

And with the same prefetch applied as before:

for( int n=0; n<num; n++) {

PREFETCH4(boundary[n]);

cellvalue[boundary[n].left ] -= boundary[n].flux;

cellvalue[boundary[n].right] += boundary[n].flux;

};

The following measurements were all done with GNU compiler and high op-timization (-O6).

boundary struct is 2 quadwords wide, hence the reduced loops contain 2loads and 1 store, whereas the fulls loop contain 2 loads and 2 quadword stores.

In the table below, the measured FSB cycles are compared to expected onesaccording to the rule of thumb (see section 2.3). Depending on the offset betweenthe left and right cellnumbers of the boundaries, one cellvalue may still be in cacheby the time the CPU does the second update to that cell. If this is the case, thereare only two quadword loads and one store going over the FSB. The two casesare listed separately.

reduced from memory from cache1 st 1 st+p 2 st 2 st+p exp. 2 st 2 st+p exp.

Laptop Pentium 17.1 13.6 22.1 19.7 18.2 18.4 16.1 11.4Pentium2 400MHz 13.1 9.1 16.4 13.8 12.6 15.0 11.4 8.1Pentium2 600MHz 12.6 6.5 15.6 11.3 10.2 13.3 8.3 6.6Pentium 866 13.9 8.3 19.6 14.0 14.2 14.7 9.5 8.7Athlon 1GHz 8.55 5.25 12.1 7.9 7.4 11.0 5.7 4.9Athlon 1GHz 10.9 8.0 17.5 12.9 11.2 15.0 8.6 6.8

The table lists the number of necessary FSB cycles for data completely frommemory and partially from cache. Note that these cycles do not show the raw

19

Page 20: Cache oriented implementation for numerical codesodrobina/Paralelne_vedecke... · 2003-03-19 · (Via KT133, Via Apollo 133A), Intel Celeron processors ran at 66MHz for a while. The

performance but represent the ratio between the time needed for the executionand the FSB cycle time (which differs from system to system). Therefore, thesenumbers show how well the code fits to a particular machine or vice versa.

If the cell offset is 1, the same cell is used twice in immediately followingoperations. This is called a store-to-load dependency, the data from the firstoperation get stored and read in again immediately. Though this writing onlygoes to cache, it forces the CPU to wait for the first operation and the writeto complete. AMD has implemented a store-to-load optimization, therefore thiseffect is barely noticeable on Athlon, whereas losses about 10% can be measuredon a Pentium system. When working with longer vectors, this effect probablywill become less important, since several components have to be stored beforethe first one gets loaded again.

2.7 SMP versus serial program

As long as the memory subsystem forms the bottleneck for the computation, itwill not be of any use to parallelize the program to run on a SMP (SymmetricMulti-Processing) system such as the popular Dual-Pentium 2 systems.

The upcoming dual-Athlon systems based on the AMD 760 MP chipset maybehave somewhat different, since they use independent point-to-point connectionsbetween the processors and the interface to the chipset, the so called northbrigde,in contrast to the Intel systems, which use a shared bus for both processors.

However, in the light of the above results, it is not clear whether this will helpmuch, since the memory memory subsystem needs to get strengthened along withthe growing CPU power as well.

3 Consequences for the implementation of a nu-

merical scheme

All of the above considered the processing of data from memory, not data fromcache. This is particularly important for two reasons:

• Decent compilers have much knowledge about the characteristics of modernprocessors and go great lengths by using dependency graphs and other toolsto reduce CPU cycles. This works well for data from cache, there is littlegain to expect by manual interaction here.

The same compilers however usually have very little knowledge about mem-ory access characteristics and the access patterns used the the actual pro-gram in question. At this point, the programmer can use its knowledgeabout memory access patterns to help the compiler to produce better per-forming code.

20

Page 21: Cache oriented implementation for numerical codesodrobina/Paralelne_vedecke... · 2003-03-19 · (Via KT133, Via Apollo 133A), Intel Celeron processors ran at 66MHz for a while. The

• If the dataset is large enough, it won’t fit into CPU cache anymore. So weinevitably end up in the data-from-memory situation anyway. However, noteverything is lost. First, we have shown above, how to effectively pull thedata trough the cache to keep the datarate high. Second, by modificationsto the sequence order of the data (see section 3.3) as well as by choices forthe control flow structure of the program (see section 3.5), we can influence,how often the data is pulled through the cache, so the total amount oftransferred data can be reduced.

Manipulations of the processing sequence is known in numerical linear algebraas strip-mining and cache-blocking. The next sections will discuss possibilities ina finite volume context.

3.1 Finite volume example

Finite volume discretizations are popular in the domain of computational fluiddynamics. They work by partitioning the computational domain into a largenumber of small so called (computational) cells, on which the solution is approxi-mated by simple functions such as (cellwise) constant or linear. The evolution intime of that numerical solution is modeled in terms of fluxes, which represent theboundary integral of the transport of certain quantities over the cell boundary.These quantities usually are physically conservative such as mass, momentum orenergy.

As these quantities do not change in total quantity (if they do, they do so byadditional source terms), a cell wins what another cell looses. This is a highlydesired property for physically conserved quantities.

W E

N

S

Figure 9: finite volume basics: fluxes over cell boundary

21

Page 22: Cache oriented implementation for numerical codesodrobina/Paralelne_vedecke... · 2003-03-19 · (Via KT133, Via Apollo 133A), Intel Celeron processors ran at 66MHz for a while. The

3.2 Striking the balance

To obtain the time evolution of the solution for given fluxes, it is necessary tostrike a balance of in- and outflow for every cell, see figure 9. By iterating over allcells, every cell get worked on once, whereas the cell boundaries get loaded twice.By looping over the cell boundaries, these get loaded once, but the cell datastructures get loaded twice. There are roughly half as much cells as boundaries,so the first possibility seems favorable. However it suffers from the followingshortcomings:

• The data structure for boundaries is likely to be larger than the one for thecells

• The cell-focused loop results in an unfavorable access pattern.

• The singlepass technique (see section 3.5) basically enforces the iterationby boundaries.

Therefore, the boundary-focused loop is discussed here.

3.3 Processing sequence

The next point to choose is the sequence in which to iterate over the boundaries,which implies a favorable sequence to store them. For the fluxes in west-eastdirection, it is best to work by rows (and store data in that way as well):

for(j=0;j<nycells; j++)

for(i=0; i<nxcells-1; i++) {

boundary[nbound].left =i +nxcells*j;

boundary[nbound].right=i+1+nxcells*j;

nbound++;

};

So each west-east boundary and each cell get loaded once, since the cell valuescan stay in cache for reuse in the next step. Having chosen such a layout, workingon the south-north boundaries is harder now, the sequence of these is still to bechosen.

Working by columns would imply accessing the cell data in a noncontiguousmanner which is generally not to be recommended. You will have to make surethat the stride by which you access the data is such that you do not suffer fromcache stumbling: If the distance of the cells associated to a boundary is a multipleof the length of the cache, all the cellvalue accesses compete for the same placesin the cache. For direct mapped caches, this is disastrous, for higher associativecaches, it is slightly better. Prefetches will further aggravate the situation as theyintroduce additional cachelines into the competition.

22

Page 23: Cache oriented implementation for numerical codesodrobina/Paralelne_vedecke... · 2003-03-19 · (Via KT133, Via Apollo 133A), Intel Celeron processors ran at 66MHz for a while. The

Figure 10: work by row

The separation of the selection and the processing of data, as proposed bySchonauer [9] looks not very promising either, since the data does not get reusedmany times, so the cost of data moving will outweigh the gain. Therefore areasonable way is to process the south-north boundaries by rows as well:

Figure 11: work by row: south-north boundaries

for(j=0;j<nycells-1; j++)

for(i=0; i<nxcells; i++) {

boundary[nbound].left =i+nxcells*j;

boundary[nbound].right=i+nxcells*(j+1);

nbound++;

};

The consequences for the data reuse from cache depend on the grid size. Fora Pentium 3, the 4-way associative 512 kB cache may be interpreted as 4 “layers”of 128 kB length, that corresponds to roughly 5400 cells if each cell has to store3 double precision numbers. Therefore, if the rows of the grid contain less than5400 cells, the processing of the next row should have the the cell values from theprevious run in cache, so effectively, each cell value will get loaded once in thissecond step as well.

We encountered exactly this situation already in section 2.6, where the mea-surements for the data partially from cache approximated the reduced loop with

23

Page 24: Cache oriented implementation for numerical codesodrobina/Paralelne_vedecke... · 2003-03-19 · (Via KT133, Via Apollo 133A), Intel Celeron processors ran at 66MHz for a while. The

only one store (not perfectly though). If you use this scheme, choose the shorteredge of the grid to correspond to the rows.

Figure 12: work by blocks of rows: south-north boundaries

As the number of cells in a row grows, the cellvalues will drop out of cachebefore they can get used again and need therefore to get loaded twice for thesouth-north boundaries.

In this case, it is better to switch the strategy and to work on several rowsat once (see figure 12). If we were working with physical addresses (see section1.4), the optimal number of rows could be deduced by:

nrows = nassociative − sizeof(structboundary)

sizeof(structcelldata)− 1

for(j=0;j<nycells-1; j+=nrows)

for(i=0; i<nxcells; i++) {

if (j+nrows>nycells-1) nrows=nycells-1-j;

for(jj=j; jj<j+nrows;jj++) {

boundary[nbound].left =i+nxcells*jj;

boundary[nbound].right=i+nxcells*(jj+1);

nbound++;

};

};

When calculating the expected performance, consider the cell data of the be-ginning of a row be pushed out of the cache when the CPU starts working onthe next bunch of rows and the absence of cache trashing. In each sweep, theCPU will encounter (nrow+1)∗nxcells stores while working on nrow∗nxcellsboundaries, this gives a ratio of

(nrow + 1) ∗ nxcellsnrow ∗ nxcells = 1 +

1

nrow

stores per boundary. Due to the size of 16 bytes per boundary, there are twoloads as well.

24

Page 25: Cache oriented implementation for numerical codesodrobina/Paralelne_vedecke... · 2003-03-19 · (Via KT133, Via Apollo 133A), Intel Celeron processors ran at 66MHz for a while. The

This still is not ideal, since every cell still get loaded at least twice, due tothe 2-way passing of first working on west-east boundaries, then on south-northboundaries. We can do better, the previous paragraph leading the way: While wehave the data in cache for the south-north work, we can intersperse the west-eastwork as well, thereby avoiding to load most of the cells twice. (see figure 13)

Figure 13: work by blocks of rows

for(j=0;j<nycells-1; j+=nrows) {

for(i=0; i<nxcells-1; i++) {

if (j+nrows>nycells-1) nrows=nycells-1-j;

for(jj=j; jj<j+nrows;jj++) {

// west-east

boundary[nbound].left =i +nxcells*jj;

boundary[nbound].right=i+1+nxcells*jj;

nbound++;

// south-nord

boundary[nbound].left =i+nxcells*jj;

boundary[nbound].right=i+nxcells*(jj+1);

nbound++;

};

}; // now i=nxcells-1

for(jj=j; jj<j+nrows;jj++) {

// south-nord

boundary[nbound].left =i+nxcells*jj;

boundary[nbound].right=i+nxcells*(jj+1);

nbound++;

};

}; //now j=nycells-1;

for(i=0; i<nxcells-1; i++) {

// west-east

boundary[nbound].left =i +nxcells*j;

boundary[nbound].right=i+1+nxcells*j;

nbound++;

25

Page 26: Cache oriented implementation for numerical codesodrobina/Paralelne_vedecke... · 2003-03-19 · (Via KT133, Via Apollo 133A), Intel Celeron processors ran at 66MHz for a while. The

};

In fact, we get the data for the west-east boundaries for free, since it is in cachefor the south-north boundaries anyway. In each sweep, the CPU will encounter(nrow+1)∗nxcells stores (same number as above) while working on the doublenumber 2 ∗ nrow ∗ nxcells boundaries, this gives a ratio of

(nrow + 1) ∗ nxcells2 ∗ nrow ∗ nxcells =

1

2+

1

2 ∗ nrowstores per boundary. While the number of stores remains constant, the numberof loads doubles due to the double number of boundaries worked on. Thereforethe expected cycles are determined by:

(1

2+

1

2 ∗ nrow)

stores + 2 loads (1)

per boundary. In an ideal world, each boundary and each cell value would getloaded once in each run over the whole grid; this is exactly the asymptotic of theabove formula for large values of nrow.

3.4 Performance figures

Taking a 3000x3000 finite volume grid, a row of the grid consists of 24 kb of datafor a scalar equation. This should nicely fit into the cache of Pentium and Athlonprocessors, therefore the assumption about data from previous row pushed outof cache does not hold. Consequently, no speedup can be expected from varyingthe value of nrow.

The nrow-technique seems however to work in favor of heavily loaded machineswith long uptime, where chances are that data from the previous row may bepartially pushed out of the cache through occasional cacheline conflicts due to arandomized address translation. On freshly rebooted machines, nrow does nothave a noteworthy effect.

The table lists the minimal obtained FSB cycles needed per boundary forsmall values of nrow, and compares them to the “ideal world” numbers of onestore per cell and two loads per boundary. The sample implementation worksfairly well on most machines. Note that the figures for specific values of nroware not firmly reproducible throughout a day-by-day basis, due to the strongdependency on the address translation table.

expected ideal measured lossLaptop Pentium 333 8 11 37%Pentium2 400MHz+BX 5.85 8.5 45%Pentium2 600MHz+BX 4.8 6.5 35%Pentium 866+Via Apollo 133A 5.95 8.05 35%Athlon 1GHz+AMD750 3.65 4.5 23%Athlon 1GHz+Via KT133 4.6 5.8 26%

26

Page 27: Cache oriented implementation for numerical codesodrobina/Paralelne_vedecke... · 2003-03-19 · (Via KT133, Via Apollo 133A), Intel Celeron processors ran at 66MHz for a while. The

When working on higher dimensional equations instead of scalars, the rowsget larger in terms of occupied memory and may not fit into cache anymore. Thiseffect is mimicked by taking rows of 18000 cells, this corresponds to 6 componentsper cell. A 18000x18000 grid does not fit into PC memory anymore, such longrows are only possible by reducing the number of cells in the columns. For realcomputations, it is then advisable to exchange the columns and rows as statedin section 3.6. Hence this is for benchmarking only. On freshly rebooted Athlonsystem, we get the performance depicted in figure 14.

0

5

10

15

20

25

0 5 10 15 20 25 30 35 40 45 50

16384 cells per row18000 cells per row

rule of thumb

nrow

FSB

cycles

Figure 14: performance for long rows

For a long row with 18000 cells, the asymptotic behavior of (1) is well re-produced and the expected performance is achieved fairly well. For a row lengthof 16384 = 214 however, a dramatic performance loss is induced by the effect ofcache trashing, several cachelines competing for the same cache positions. Forsmall values of nrow, the L2-cache can alleviate the performance loss to a factor ofless than 2, but as soon as nrow exceeds the 16-way associativity of the AthlonsL2-cache, the performance loss is about a factor of 4, which clearly should beavoided. For machines with longer uptime, the effect is less dramatic due to theperturbation of the address translation table.

27

Page 28: Cache oriented implementation for numerical codesodrobina/Paralelne_vedecke... · 2003-03-19 · (Via KT133, Via Apollo 133A), Intel Celeron processors ran at 66MHz for a while. The

3.5 Singlepass versus multipass

Another technique to reduce the memory transfer is called singlepass versus mul-tipass. Practically speaking, multipass is to compute the values for the flux inone loop first and then issue a second loop to strike the balance. Singlepass is todo the flux computation work in the balance striking loop as well, which avoidswriting the flux values to memory and loading them to the CPU again, thereforesaving bandwidth and time.

Another example is the incorporation of west-east and south-north boundariesinto a single step at the end of section 3.3. The number of operations remainsthe same; due to interspersing the different steps, programming and control fluxget more complex.

If the computation of the flux values cannot be done on the 80x87 floatingpoint register stack alone but involves an important volume of further memoryaccesses, it may be advisable consider to reduce the number of rows workedon simultaneously by one in favor to reserve one cache layer for these memoryaccesses. It is probably best to leave nrow as a tuning parameter to obtain bestoverall performance by adapting this parameter.

3.6 Grid size

The length of a row should not be a power of 2 to avoid cache thrashing bycolumn wise access to the grid values (see figure 14). It is generally better tochoose the shorter edge of the grid as rows, to maximize the chances to find datastill in cache when working on the next row.

If you obey this rule, the caches of the Pentium and Athlon processors willindeed show to be mostly large enough to hold one row’s worth of data of 2dmeshes. Considering unknowns with dimension n, the above scheme needs tostore approximately 3n + 2 quadwords per cell. Using 1GB RAM, it is possibleto work on about 108/(3n + 2) cells, which yields 104 ∗ √3n + 2 quadwords or√

3n + 2 ∗ 80kB per row under the unfavorable assumption of a quadratic mesh.For n = 3, this is about 265kB cell and boundary data combined so this will fitinto the 512kB sized L2 cache of the Pentium.

Things are somewhat different for 3d meshes, where the processor caches mayturn out to be too small to hold a whole grid layer. At that point, the nrow

technique, applied to two space dimensions, will help to maintain the expectedperformance of the numerical code.

3.7 Structured versus unstructured grids

The unstructured programming style of the finite volume example did not revealany important performance penalties compared to an structured approach. Sincethe neighborhood is implicitly given for structured approaches, they do not need

28

Page 29: Cache oriented implementation for numerical codesodrobina/Paralelne_vedecke... · 2003-03-19 · (Via KT133, Via Apollo 133A), Intel Celeron processors ran at 66MHz for a while. The

to store the topology information as unstructured approaches have to. However,this saving of a quadword load per boundary (two integers) is paid by the priceof the inflexibility of structured meshes.

In any case, great care has to be taken in which sequence order the boundariesare fed into the finite volume solver, with section 3.3 showing the way. If yourgrid is composed of several macro-elements, it seems natural to apply the abovesequence scheme on each of them separately. The less efficient data access atmacro-element boundaries will vanish among the work within the macro-elementsfor high resolution meshes, due to the relatively small number of cells at theseboundaries, compared to the cells contained in the macro-elements.

A mesh refinement should not add new cells and boundaries at the end of thelist, but insert them where they belong to. Mesh refinements by macro-elementseems a natural and the easiest way to do so.

3.8 Page coloring

There is - to my knowledge - no operation system support in the Linux kernelfor some kind of page coloring in the address translation table.2

It would be nice to have a possibility to ask the operation system to reorganizethe 4kB memory pages so that the ones of a specific process are contiguous and toreport whether it was successful in doing so. Having this possibility, one could usemore specific characteristics and sizes of the CPU caches while we currently haveto live with some gradual degradation due to the randomization of the addresstranslation.

Afterword

This paper is about how to get a well performing computer code for given a nu-merical scheme, but you should not forget that all these points do not change theasymptotic complexity of the underlying algorithm, only the constants involved.

Comparing two schemes with asymptotic complexity O(n) and O(n2), therewill always be a certain dataset size n from which on the first will be faster, nomatter how bad it is written. However, when comparing schemes with the sameasymptotic behavior, the implementation quality becomes crucial.

Therefore it is important for best results, to have both a good numericalscheme and a good implementation.

2get dma pages is only for kernel modules, not for user space programs

29

Page 30: Cache oriented implementation for numerical codesodrobina/Paralelne_vedecke... · 2003-03-19 · (Via KT133, Via Apollo 133A), Intel Celeron processors ran at 66MHz for a while. The

Acknowledgment

The author likes to thank Prof. Rentrop, Prof. Schonauer and Prof. Ungerer fortheir speculations on the reason why the stores might be so much slower than theloads, Prof. Rude for the suggestion to study memcopy routines and Prof. Turekfor the nice “Elch Test”.

Download

The source code for the measurements in section 2.2 can be downloaded from

http://www.mathematik.uni-karlsruhe.de/~schulz/Preprint

(look for bandwidth.cc).It consists of a C++ source code file with inline asm statements for the c++

Gnu compiler on the Intel/AMD platform. This enables you to estimate thenumber of FSB cycles needed for the loads and stores and to apply the rule ofthumb to your particular machine.

30

Page 31: Cache oriented implementation for numerical codesodrobina/Paralelne_vedecke... · 2003-03-19 · (Via KT133, Via Apollo 133A), Intel Celeron processors ran at 66MHz for a while. The

References

[1] AMD Athlon Processor x86 Code Optimization Guide;http://www.amd.com/products/cpg/athlon/techdocs/

[2] Agner Fog: How to optimize for the Pentium family of microprocessors;http://home.snafu.de/phpr/pentopt.html.gz\#29

[3] GCC documentation;http://gcc.gnu.org/onlinedocs/gcc-3.0/gcc\5.html\#SEC102

[4] IA32 Intel Architecture Software Developer’s Manual;http://www.intel.com/design/PentiumIII/manuals/

[5] Intel Architecture Software Optimization Reference Manualhttp://www.intel.com/design/PentiumIII/manuals/

[6] Oliver Muller: Assembler Referenz; Franzis Verlag, Poing, 2000

[7] Bob Neveln: Linux Assembly Language Programming; Prentice Hall,Upper Saddle River, 2000

[8] Michael L. Overton: Numerical Computing with IEEE Floating PointArithmetic; SIAM, Philadelphia, 2001

[9] Willi Schonauer: Scientific Supercomputing: Architecture and Use ofShared and Distributed Memory Parallel Computers; ISBN 3-00-005484-7

[10] Stefan Turek: Konsequenzen eines numerischen ‘Elch Tests’ furComputersimulation; http://www.mathematik.uni-dortmund.de/htmldata1/featflow/ture/paper/elch.ps.gz

[11] Brennan Underwood; http://www.delorie.com/djgpp/doc/brennan/brennan_att_inline_djgpp.html

31

Page 32: Cache oriented implementation for numerical codesodrobina/Paralelne_vedecke... · 2003-03-19 · (Via KT133, Via Apollo 133A), Intel Celeron processors ran at 66MHz for a while. The

Contents

1 Considerations on the underlying computing system 21.1 Memory demand . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Cache organization . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Bandwidth and Latency . . . . . . . . . . . . . . . . . . . . . . . 5

1.3.1 Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3.2 Preload . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4 Virtual address translation . . . . . . . . . . . . . . . . . . . . . . 61.4.1 TLB bottleneck . . . . . . . . . . . . . . . . . . . . . . . . 71.4.2 Quadword alignment . . . . . . . . . . . . . . . . . . . . . 81.4.3 Memory bank conflicts . . . . . . . . . . . . . . . . . . . . 8

2 Pentium processors from scientific computing viewpoint 82.1 Structure of the Intel 80x87 FPU . . . . . . . . . . . . . . . . . . 9

2.1.1 FPU performance . . . . . . . . . . . . . . . . . . . . . . . 102.2 Measurements of memory transfers . . . . . . . . . . . . . . . . . 11

2.2.1 Reading contiguous double precision data . . . . . . . . . . 112.2.2 Reading contiguous double precision data using prefetch . 122.2.3 Reading sparse integer data . . . . . . . . . . . . . . . . . 122.2.4 Writing contiguous double precision data . . . . . . . . . . 132.2.5 Writing contiguous double precision data using prefetch . . 142.2.6 Writing integer data using prefetch . . . . . . . . . . . . . 14

2.3 Rule of thumb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.4 Linked triad . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.5 Compiler influences . . . . . . . . . . . . . . . . . . . . . . . . . . 172.6 Indirect addressing . . . . . . . . . . . . . . . . . . . . . . . . . . 182.7 SMP versus serial program . . . . . . . . . . . . . . . . . . . . . . 20

3 Consequences for the implementation of a numerical scheme 203.1 Finite volume example . . . . . . . . . . . . . . . . . . . . . . . . 213.2 Striking the balance . . . . . . . . . . . . . . . . . . . . . . . . . . 223.3 Processing sequence . . . . . . . . . . . . . . . . . . . . . . . . . . 223.4 Performance figures . . . . . . . . . . . . . . . . . . . . . . . . . . 263.5 Singlepass versus multipass . . . . . . . . . . . . . . . . . . . . . . 283.6 Grid size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.7 Structured versus unstructured grids . . . . . . . . . . . . . . . . 283.8 Page coloring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

32