Transcript

Available online at www.sciencedirect.com

www.elsevier.com/locate/sysarc

Journal of Systems Architecture 54 (2008) 638–650

High-performance computing of 1=ffiffiffixp

i and expð�xiÞ for a vectorof inputs xi on Alpha and IA-64 CPUs

Md. Haidar Sharif a,*, Achim Basermann b, Christian Seidel c, Axel Hunger d

a ISE, Computer Engineering, University of Duisburg-Essen, 47057 Duisburg, Germanyb NEC Europe Ltd., C&C Research Laboratories, 53757 Sankt Augustin, Germany

c Max Planck Institute of Colloids and Interfaces, 14424 Potsdam, Germanyd Technische Informatik, University of Duisburg-Essen, 47057 Duisburg, Germany

Received 22 March 2007; accepted 12 November 2007Available online 23 November 2007

Abstract

The modern microprocessors have become more sophisticated, the performance of software on modern architectures has grown moreand more difficult to dissect and prognosticate. The execution of a program nowadays entails the complex interaction of code, compilerand processor micro-architecture. The built-in functions to compute 1=

ffiffiffixp

or expð�xÞ of math library and hardware are often incapableof achieving the challenging performance of high-performance numerical computing. To meet this demand, the current trend in con-structing high-performance numerical computing for specific processors Alpha 21264 & 21364, and IA-64 has been optimized for1=

ffiffiffixp

i and expð�xiÞ for a vector of inputs xi which is significantly faster than optimized library routines. A detailed deliberation ofhow the processor micro-architecture as well as the manual optimization techniques improve the computing performance has beendeveloped.� 2007 Elsevier B.V. All rights reserved.

Keywords: 1=ffiffiffixp

i; expð�xiÞ; Alpha; IA-64; Loop unrolling; Software pipelining; In-order/out-of-order scheduling

1. Introduction

The computation of the functions y ¼ 1=ffiffiffixp

or y ¼ ex ory ¼ 1=ex is a typical and time consuming task in numericalsimulations. These functions can be computed efficientlyand accurately by calling math library routines which willalso deal with exceptional cases like underflow, overflowor input arguments like �1 gracefully. But standard rou-tines are often incapable of achieving the demanding perfor-mance of high-performance computing. The idea tocompute 1=

ffiffiffixp

quickly in software to speed up the compu-tation of Coulombic potentials is not new [1]. Assume N

particles with charge qk and coordinates ðxk; yk; zkÞ, ignoring

1383-7621/$ - see front matter � 2007 Elsevier B.V. All rights reserved.

doi:10.1016/j.sysarc.2007.11.001

* Corresponding author. Fax: +49 203 379 4221.E-mail addresses: [email protected] (Md. Haidar Sharif),

[email protected] (A. Basermann), [email protected](C. Seidel), [email protected] (A. Hunger).

the constants �0 and 4p, the mathematical formula for thepotential /i of the i-th particle is /i ¼

Pk 6¼i/ik ¼

Pk 6¼iqiqk=

rik where rik ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðxi � xkÞ2 þ ðyi � ykÞ

2 þ ðzi � zkÞ2q

¼ ffiffiffixp

bethe distance between particle i and particle k [2]. Since thepotential 1=rik or 1=

ffiffiffixp

is singular at rik ¼ 0, approximationsfor 1=rik are usually only valid for rik large enough, i.e., x > 0.We can compute y ¼ 1=

ffiffiffixp

or y ¼ 1=ex in two steps, firstz ¼ ffiffiffi

xp

or z ¼ ex by calling a library routine and theny ¼ 1=z by hardware division. This approach limits the per-formance as the hardware division is relatively slow andthe efficiency of library routines is restricted, since they haveto solve the scalar problem. But we may compute several x

independently using codes and processor micro-architecture.Our current development in constructing high-performancenumerical computing for specific processors Alpha 21264 &21364, and IA-64 has been optimized for 1=

ffiffiffiffixip

and expo-nential expansion pairs e�xi with a vector of input argumentsxi which work somewhat in parallel. Of course the loss in

Md. Haidar Sharif et al. / Journal of Systems Architecture 54 (2008) 638–650 639

accuracy has to be negligible compared to the standard mathlibrary routines. As a number can be represented in an unlim-ited number of ways, the optimized codes of 1=

ffiffiffiffixip

and e�xi

expect xi to be a normalized floating-point number. Thecodes also impose other regulations: (i) none of the specialfloating-point values like NaN (Not a Number), NaTVal(Not a Thing Value), �1 are allowed; (ii) the input argu-ments xi are considered as a vector which length must ben ¼ 4� T where T ¼ 1; 2; 3; 4; . . . ; bN=4c.

An important feature of the micro-architecture of Alpha21264 & 21364 is that the floating-point add pipeline andthe floating-point multiply pipeline are fully pipelined [3].Fully pipelined makes sense if new operations can bestarted without regard for what happened before [4]. TheIA-64 is an architectural idea that was developed for vec-torizable programs. The IA-64 architecture features a rev-olutionary 64-bit ISA (Instruction Set Architecture) whichapplies a new processor architecture technology calledEPIC (Explicitly Parallel Instruction Computing). A keyfeature of the IA-64 architecture is IA-32 instruction setcompatibility [5]. IA-64 processor core is capable of up tosix issues per clock, with up to three branches and twomemory references [6]. However, compiler quality is criticalfor IA-64 performance. Alpha 21264 & 21364 processorsare capable of exploiting both static and dynamic instruc-

tion-level parallelism (ILP), as opposed to that, an IA-64processor is only able to exploit static ILP. If the com-pile-time predictions are correct, both out-of-order Alpha21264 & 21364 and in-order IA-64 perform very well.When the compiler is wrong, however, the out-of-orderAlpha 21264 & 21364 processors can adapt and continueto perform good enough, as opposed to that, EPIC basedIA-64 runs slowly but correctly.

In all optimized implementations one add and one mul-

tiply are issued each cycle and the result will be availableafter result latencies for both add and multiply. Manualoptimization techniques such as loop unrolling and software

pipelining have been used to compute 1=ffiffiffiffixip

or e�xi moreefficiently. Pipelining can overlap the execution of instruc-tions when they are independent of one another. Thispotential overlap among instructions is called ILP sincethe instructions can be evaluated in parallel [6]. The Loopunrolling technique combines several iterations of a loopinto one basic block. This is useful to reduce the cost ofloop overhead for small loop bodies and to increase thepotential for ILP [2]. The technique of software pipeliningis a way to overlap the execution of different instances ofloop body in a systematic way. It primarily helps to hidelatencies due to data dependences within a single iteration[2]. Loop unrolling and software pipelining have been com-bined by first unrolling the loop and therefore increasingthe size of the loop body to some extent and then softwarepipelining the resulting codes which increases the potentialfor ILP a bit more.

Our work is an incremental improvement of the optimi-zation result presented by Strebel [2]. Among other things,we have extended the investigations by optimizing 1=

ffiffiffiffixip

and e�xi for Alpha 21264 & 21364, and IA-64 CPUs. Sincethe optimization strategies of 1=

ffiffiffiffixip

and e�xi are verysimilar to each other [2], for the sake of an appropriatelength of the paper, we omit the representation of the spe-cific optimization steps for e�xi but show the correspondingoptimized computing times along with the results for1=

ffiffiffiffixip

.

2. Algorithms for 1=ffiffiffixp

computation

In this section, we will shortly show binomial andGoldschmit’s algorithms heading by their argument reduc-tion. The faster and less accurate algorithm is based onGoldschmidt’s algorithm but the accurate algorithm usesbinomial expansion. A detailed description of algorithmsto compute 1=

ffiffiffixp

and e�x in general can be found in [2].Floating-point numbers are stored as rows hsjejmi corre-

sponding to the value x ¼ ð�1Þs � 2e � h1:mi, if we ignorethe bias from e ¼ exponent � bias [2]. Because 1=

ffiffiffixp

isdefined for x > 0 henceforth the sign bit is assumed to bes ¼ 0. For e ¼ 2e0 þ e00 with e0 ¼ be=2c we may then write1ffiffixp ¼ 2�e0 � 1ffiffiffiffi

xmp with xm ¼ 2e00 � h1:mi; where e0 and e00 are

integral. To further reduce the argument xm we find anapproximation xt ¼ 2e00 � h1:tj1j0 . . . ::0i � xm ¼ 2e00 � h1:miwhere t contains the first few bits of m, so that yt ¼ 1ffiffiffi

xtp

may be determined by table lookup with index he00jti, whichis b bits wide. Assuming, we have determined e0; xt; yt ¼ 1ffiffiffi

xtp

and xs ¼ xm � xt we may write 1ffiffixp ¼ 2�e0 � 1ffiffiffi

xtp � 1ffiffiffiffi

xmxt

p ¼2�e0 � yt � ðxmy2

t Þ�1

2 ¼ 2�e0 � yt� ð1þ xsy2t Þ�1

2.To compute 1ffiffi

xp for x � 1 we may use the binomial

series ð1þ zÞa ¼P1

k¼0ðakÞzk with a ¼ �1=2, which con-

verges absolutely for jzj < 1. For arbitrary x we apply theargument reduction steps explained above to find 1ffiffi

xp ¼

2�e0 � yt � ð1þ zsÞ�12 for small zs ¼ xsy2

t . Instead of comput-ing ys ¼ ð1þ zsÞ�

12 � 1, we compute us ¼ ys � 1 directly

using us ¼ ys � 1 ¼P

k¼1pkzks , assuming that the initial

coefficient p0 ¼ 1. Furthermore, we need the value 1ffiffiffixtp ,

which we find by table lookup, with some extra precision.Two simple options for representing a number with extraaccuracy are the sum 1ffiffiffi

xtp ¼ yt þ dt or the product

1ffiffiffixtp ¼ ytð1þ �tÞ. We could use either one, but we choosethe product representation for technical reasons. Therefined formula may then be written 1ffiffi

xp ¼ 2�e0�

ytð1þ �tÞ � ð1þ usÞ ¼ 2�e0 � ðyt þ ytð�t þ usÞÞ, where wehave neglected the small contribution �tus in the approxi-mation ð1þ �tÞð1þ usÞ � ð1þ �t þ usÞ.

A schematic representation of the code which calculates1=

ffiffiffixp

using the binomial series for x � 1 is given below[2].

hsje0je00jtj � i ¼ x ! extract fields

xt � 22e0 ¼ 2e � h1:tj1j0 . . . 0ixs � 22e0 ¼ x� xt � 22e0

ytð1þ �tÞ ¼1ffiffiffiffixtp ! use he00jti for table lookup

640 Md. Haidar Sharif et al. / Journal of Systems Architecture 54 (2008) 638–650

yh ¼ 2�e0 � yt

zs ¼ xs � y2t ¼ ðxs � 22e0 Þ � y2

h

us ¼ �1=2� zs þ 3=8� z2s � � � � ! 1þ us ¼ 1=

ffiffiffiffiffiffiffiffiffiffiffiffi1þ zs

py ¼ yh þ yhð�t þ usÞreturn y

The simple idea of Goldschmidt’s algorithm is to find a se-quence x ¼ x0

y20

¼ x1

y21

¼ � � � ¼ xky2

k¼ � � � such that xk ! 1. It is

easy to see that x ¼ x1=y21 and x1 ¼ 1 imply that

y1 ¼ 1=ffiffiffixp

. Assume xk ! 1, we write the quotient for the

step k ! k þ 1 as xky2

k¼ xkr2

k

ðyk rkÞ2¼: xkþ1

y2kþ1

, where rk � 1=ffiffiffiffixkp

.

Using xk ¼ 1þ dk a good approximation to 1=ffiffiffiffixkp

is

given by rk ¼ 1� dk2¼ 3

2� xk

2. With this choice of rk,

the step xk ! xkþ1 may be written xkþ1 ¼ xkr2k ¼

ð1þ dkÞð1� dk2Þ2 ¼ 1� 3

4d2

k þ 14d3

k ¼: 1þ dkþ1, i.e., ðxkÞ con-

verges quadratically to 1. To obtain x0 � 1 for arbitrary x,the above argument reduction is applied. Thus, the com-puting of 1=

ffiffiffixp

for arbitrary x is reduced to the interval

1 6 xm < 4. The multiplication with 2�e00 is easily done byadding �e0 to the exponent field. To further reduce argu-ment xm we consider only the first t bits of mantissa

m, i.e., 1=ffiffiffixp� y ¼ 2�e0 � yt with yt ¼ 1=

ffiffiffiffixtp

where xt ¼2�e00 h1:tj1j0 . . . 0i. The yt ¼ 1=

ffiffiffiffixtp

can be determined bytable lookup with index he00jti which is t þ 1 bits wide. Notethat the maximum absolute error caused by this approxi-

mation is 2�ðtþ1Þ. Starting the iteration with x0 ¼ xmy2t and

y0 ¼ yt it is easily seen that jd0j ¼ jðxm � xtÞ=xtj 6 2�ðtþ1Þ.

Using the approximation dkþ1 ¼ 34� d2

k and setting t ¼ 6,

the deviation from x1 ¼ 1 after three iteration steps is gi-

ven by d3 � 6� 10�18 which is assumed to be sufficientlysmall. A schematic representation of the code which calcu-lates 1=

ffiffiffixp

in a three-step Goldschmidt’s algorithm is givenbelow [2].

hsje0je00jtj � i ¼ x ! extract fields

xt ¼ 2e00 � h1:tj1j0 . . . 0i

yt ¼1ffiffiffiffixtp ! by table lookup

y ¼ 2�e0 � yt

x ¼ x� y2 ! x � 1

for k ¼ 1; 3 do ! three-step iteration

r ¼ 3

2� x

2y ¼ y � rx ¼ x� r2 ! with loop unrollingenddoreturn y

3. Optimization for Alpha and IA-64 architectures

In this section, we will present the optimization tech-niques of the binomial algorithm to compute several

1=ffiffiffixp

values in parallel for Alpha 21264 & 21364, andIA-64 CPUs.

3.1. Alpha 21164, 21264, & 21364 CPUs

The Alpha 21164 microprocessor EV5.X, which is actu-ally the next to the last generation of Alpha processors, isdesigned for in-order (static) scheduling [2,7]. Schedulingis the process of ordering instruction execution to obtainoptimum performance. The integer unit contains two 64-bit integer execution pipelines, E0 and E1. Two integerinstructions can be issued per cycle, one issue for E0 andE1 each. The most important features of the floating-pointunit are the 32 floating-point registers, each 64 bits wide, afloating-point multiply pipeline FM and a floating-pointadd pipeline FA [2]. The four pipelines of the 21164 proces-sor allow issues of different instruction classes. However, inaddition to the restrictions due to available functional unitsthere is another class of restrictions due to certain interde-pendencies, among them issue and result latencies, whichmay prevent the issuing of an instruction.

The Alpha 21264 microprocessor is a high-performancethird-generation implementation of the Compaq Alphaarchitecture [3]. The Alpha 21364 microprocessor is thefourth generation of the Alpha microprocessor family.Alpha 21364 is in many ways similar to Alpha 21264 [8].Hence, we will show only its optimized computing timeof 1=

ffiffiffiffixip

and e�xi with Alpha 21264. The Alpha 21264can issue four instructions in a single cycle, thereby mini-mizing the average cycles per instruction. In contrast tothe Alpha 21164, the 21264 processor has been designedfor out-of-order scheduling where instructions can be exe-cuted out of program order. Out-of-order execution con-sists of: (i) dynamic scheduling where processor canreorder instructions to reduce processor idle-cycles (stalls);(ii) register renaming where processor can rename registersto remove Write after Read (WAR) and Write after Write(WAW) hazards; (iii) branch prediction where processorcan predict branch behavior where the behavior dependson past history of the same branch or previous branches.In dynamic scheduling the hardware rearranges the instruc-tion execution to reduce the stalls while maintaining dataflow and exception behavior. Dynamic scheduling offersseveral advantages [6]: (i) it enables handling some caseswhen dependencies are unknown at compile time (e.g.,because they may involve a memory reference); (ii) it sim-plifies the compiler; (iii) it also allows code that was com-piled with one pipeline in mind to run efficiently on adifferent pipeline. The instruction fetch, issue, and retireunit (I-box) of Alpha 21264 fetches instructions in programorder, executes them out of order, and then retires them inorder. The I-box retire logic maintains the architecturalstate of the machine by retiring an instruction only if allprevious instructions have executed without generatingexceptions or branch mispredictions [3]. Register renamemaps eliminate register WAR and WAW data dependen-cies while preserving true data dependencies Read after

Md. Haidar Sharif et al. / Journal of Systems Architecture 54 (2008) 638–650 641

Write (RAW), in order to allow instructions to be dynam-ically rescheduled [3]. If an instruction causes a stall in thepipeline, in the case of in-order scheduling (e.g., Alpha21164) the following instructions have to wait until the stallof that instruction is available in pipeline. The advantage ofout-of-order scheduling is that it prevents such waitingcycles except in the case of RAW. Consequently, out-of-order scheduled architecture eliminates WAR and WAWdependencies.

A common feature of Alpha 21164, 21264, & 21364 pro-cessors is that floating-point add and multiply are fullypipelined while floating-point divide which is associatedwith the add pipeline is not pipelined at all. Instructionlatencies of these processors are identical. Since we areinterested in optimizing floating-point computations, onemay execute one add and one multiply at each cycle. Dueto the latencies for both operations, the result is availableonly 4 cycles later. Having this behavior a loop unrollingfactor of 4 appears often to be the natural choice, i.e., itis a good strategy to compute 1=

ffiffiffiffixip

for 4 independentinput arguments xi which work approximately in parallel.

3.1.1. Strebel’s optimization of 1=ffiffiffixp

for in-order scheduling

In 2000, Strebel developed codes which efficiently com-putes 1=

ffiffiffiffixip

for a vector of input arguments xi using bothGoldschmidt’s and binomial algorithms [2].

To achieve understanding a bit more, the issue table hasbeen shown rather than the issue map of reference [2]. InTable 1 the computation of a single 1=

ffiffiffixp

value accordingto Goldschmidt’s Algorithm has been represented by anissue table. Such a table does not contain all the informa-tion required to compute the result. It illustrates only thealgorithm at run time by indicating at which clock cyclewhat instructions will be issued. Classical optimizationtechniques like loop unrolling and software pipelining havebeen used to obtain an efficient implementation. The fol-lowing symbols have been used N,F that correspond to multiplication, addition, branch, inte-ger operation, shift, load, store, nop (no operation), andfnop, respectively, in the issue table. The fnop copies the

Table 2Computation of a single 1=

ffiffiffixp

value using binomial algorithm

Table 1Computation of a single 1=

ffiffiffixp

value, issue map of the sequential code [2]

stack top onto itself, thus padding the executable file andtaking up processing time without having any effect on reg-ister or memory. The result latency gives the number ofcycles required until the result is available. Applying astraightforward instruction count in Goldschmidt’s algo-rithm, 10 floating-point multiply instructions are foundto be necessary for each 1=

ffiffiffixp

computation. Ignoring otherrestrictions, the computation of a single 1=

ffiffiffixp

will take atleast 10 cycles. To achieve this optimum the instructionsmust be scheduled in such a way that one floating-pointmultiply is executed in each cycle. Initially, Strebelobtained 54 cycles to be necessary for the computation witha single argument x, i.e., n arguments require n times thecomplete loop or 54n cycles overall. After loop unrollingthe number of cycles could be reduced to 67� n=4 � 17n.Upon software pipelining there remain about40� n=4 ¼ 10n cycles plus some overhead. So these simplecode transformations alone promise to make the algorithmrun faster by a factor of about 5 [2].

Strebel’s codes only run efficiently on the Alpha 21164,but are not workable on the processors which are signifi-cantly different from Alpha 21164, even not on its succes-sor, the Alpha 21264 [2]. Our approach to compute1=

ffiffiffiffixip

and e�xi efficiently on the Alpha 21264 and IA-64CPUs based on floating-point additions and multiplica-tions that are similar to Strebel’s implementation butinstructions scheduling in the pipelines is different.

3.1.2. Optimization of 1=ffiffiffixp

for out-of-order scheduling

Although Alpha 21264 can fetch only 4 instructions in asingle cycle, up to 6 instructions can be dynamically issuedin the same cycle because there are actually 4 integer and 2floating-point pipelines. Before being issued, the instruc-tions enter the register renaming stage to eliminate any reg-ister dependencies. Despite its out-of-order executioncapability, the Alpha 21264 provides a way to observethe execution results in an in-order manner, simply byrestricting the instruction not to retire before all of its pre-vious instructions. The retirement mechanism has a storageto keep track of the internal registers used by all in-flight

642 Md. Haidar Sharif et al. / Journal of Systems Architecture 54 (2008) 638–650

operations. These registers can be freed only when theinstructions retire. The basic issue table of the calculationof a single 1=

ffiffiffixp

value has been shown in Table 2 usingbinomial algorithm. Applying a straightforward instruc-tion count in binomial algorithm, along with others, 10floating-point add and 11 floating-point multiply instruc-tions are found to be necessary for each 1=

ffiffiffixp

computation.Ignoring other restrictions, the computation of a single1=

ffiffiffixp

will take at least 11 cycles. To achieve this optimumthe instructions must be scheduled in such a way that onefloating-point multiply and a possible floating-point add

are executed in each cycle. Since instructions are reorderedat compile time, issue rules are not important in contrast tothe 21164 processor. Static or dynamic order makes no dif-ference in the issue table, only the collection of 4 confiscateinstructions per cycle is important because this instructionset is fetched in program order, executed out of programorder and retired again in program order. Upon applyingan appropriate instruction scheduling to the resulting codesand proper reordering of the instructions, the correspond-ing issue structure for dynamic scheduling consists of 60cycles as directed in Table 3. Thus, for n 1=

ffiffiffiffixip

valuesone obtains already a relation of 56� n cycles to60� n=4, i.e., speedup at 4� 56

60� 3:73. Table 3 elucidates

n4

times looping together with 4� unrolling of 1=ffiffiffixp

withproper reordering of the instructions to obtain optimumperformance. After loop unrolling and scheduling eachiteration consists of a phase enclosing mainly integer oper-ations, load, store, shift instructions, and a posterior phasewith floating-point operations. Such an issue structure is

Table 34� unrolling and scheduling with n

4times looping using binomial algorithm

Table 4This structure illustrates the Pre-loop of 4� unrolling and 3� pipelining of in

The Pre-Loop is not optimized.

Table 5This structure depicts ðn

4� 2Þ times looping of 4� unrolling and 3� pipelining

The Loop is highly optimized.

pertinent for software pipelining. Three such issue struc-tures have been overlapped by software pipelining to min-imize loop overhead and to increase the potential for ILPto more extent. The resulting optimized codes delineatein the Tables 4–6 concerning pre-loop, loop, and post-loopsuccessively. Initially, there were 56 cycles for n iterationsof the loop or 56n overall. After 4� unrolling and schedul-

ing the cycles reduced to 604

n ¼ 15n. Using software pipelin-ing, there are 44

4n ¼ 11n cycles with some overhead. These

simple optimization techniques make the algorithm runfaster by a factor of 56n

11n � 5.

3.2. IA-64 CPU

Now, we will discuss some key features of the IA-64CPU followed by optimization techniques to compute1=

ffiffiffiffixip

for EPIC based 1.0 GHz Itanium CPU.

3.2.1. Key features of IA-64

Intel� has chosen a conspicuously different directionthan Alpha 21264 & 21364 processors, by introducing anew 64-bit instruction set architecture called IA-64 basedon EPIC design philosophy. EPIC can roughly be denotedas the second generation of VLIW. In EPIC, the compiler,not the processor, performs the parallelizing of instructionstream. In the philosophy of EPIC the compiler decideswhich instructions can be executed together, puts themtogether in bundles, and executes these instruction bundles.Three instructions are grouped together into 128-bit sizedand aligned containers called bundles [5,6,9], i.e., a bundle

struction scheduling to the resulting optimized codes

of instruction scheduling to the resulting optimized codes

Table 6These consecutive structures indicate the non-optimized post-loop of 4� unrolling and 3� pipelining of instruction scheduling to the resulting optimizedcodes

Starting of the second structure must coincide with the end of the first structure.

Md. Haidar Sharif et al. / Journal of Systems Architecture 54 (2008) 638–650 643

is a 128-bit long instruction word (LIW) containing three41-bit IA-64 instructions along with a so-called 5-bit tem-plate that contains instruction grouping information. Bun-dled instructions are not required to be in their originalprogram order, and they can even represent entirely differ-ent paths of a branch. Also, the compiler can mix depen-dent and independent instructions together in a bundle,because the template keeps track of which is which. Duringexecution, architectural stops in the program indicate tothe hardware that one or more instructions before the stop

may have certain kinds of resource dependencies with oneor more instructions after the stop [9]. IA-64 does not insertno-op instructions to fill slots in the bundles. Intel� is con-centrating attention on a compiler-driven technology toincrease ILP with the IA-64 processor which only exploitsstatic ILP. IA64 defines a set of architectural extensions topermit compilers to identify more ILP. The IA-64 architec-ture was designed to overcome the performance limitationsof traditional architectures and provide maximum scopefor the future. To achieve this, IA-64 was designed withan array of innovative features to extract greater instruc-tion level parallelism including speculation, predication,large register files, a register stack, advanced branch archi-tecture, and many others [9]. Itanium processor is the firstimplementation of the IA-64 architecture [6].

3.2.2. Optimization of 1=ffiffiffixp

The Itanium processor core is capable of up to six issuesper clock, with up to three branches and two memory ref-erences. The memory hierarchy consists of a three-levelcache. The first level uses separate instruction and datacaches; floating-point data are not placed in the first-levelcache. The second and third levels are unified caches, with

Table 7Sequential binomial algorithm calculates a single 1=

ffiffiffixp

the third level being an off-chip cache placed in the samecontainer as the Itanium die. There are 9 functional unitsin the Itanium processor: 2 I-units, 2 M-units, 3 B-units,and 2 F-units. All the functional units are pipelined [6]. Ita-nium has an instruction issue window that contains up totwo bundles at any given time. With this window size, Ita-nium can issue up to 6 instructions in a clock cycle. In theworst case, if a bundle is split when it is issued, the hard-ware could see as few as 4 instructions: 1 from the first bun-dle to be executed and 3 from the second bundle.Instructions are allocated to functional units based on thebundle bits, ignoring the presence of no-ops or predicatedinstructions with untrue predicates [6]. Stalling the entirebundle leads to poor performance unless the instructionsare carefully scheduled. Issue Tables 7–11 have been repre-sented by bundles and clock cycles. Each clock cycle con-sists of 2 bundles (six instructions), henceforth, eachclock cycle issues 2 bundles simultaneously – one from firstbundle sequences and another from the corresponding sec-ond bundle sequences. These tables do not contain all theinformation required to compute the result, instead illus-trate only the codes at run time by indicating at whichclock cycle which bundles will be issued. Table 7 represents56� 2 ¼ 112 bundles and 56 clock cycles for the computa-tion of a single 1=

ffiffiffixp

value according to the binomial algo-rithm. Templates 14 and 15 have been used for thesimultaneous execution of ~, � operations. The 24 possibletemplate values and the instruction slots and stops for eachformat can be in general found in [6]. The process of send-ing instructions to functional units is called dispersal; thedispersal windows can hold two bundles. Thus the Itaniumcan issue a maximum of six instructions per clock cycle [6].Consequently, the 112 bundles can be executed with at best

Table 84� unrolling and scheduling with n

4times looping using binomial algorithm

Table 9Pre-loop of the optimized codes

Table 10These structures depict ðn

4� 2Þ times looping of 4� unrolling and 3�pipelining of instruction scheduling to the resulting optimized codes

The Loop is highly optimized.

Table 11Post-loop of the optimized codes

Second two structures starting from left to right are the continuation of first two structures’ 1st and 2nd bundle sequences.

644 Md. Haidar Sharif et al. / Journal of Systems Architecture 54 (2008) 638–650

56 clock cycles. Since each floating-point arithmetic in Ita-nium is designed with a result latency by 4 cycles, it is oftenappropriate to construct a loop unrolling factor of 4. FromTable 7, 56 clock cycles are found to be necessary for thecomputation with a single argument x, i.e., n argumentsrequire n times the complete loop or 56n clock cycles over-all. Table 8 elucidates n

4times looping together with

4� unrolling of 1=ffiffiffixp

with proper reordering of theinstructions to obtain optimum performance. Such an issue

structure is pertinent for software pipelining. In this issuestructure only about 53% of the total instruction slots arefilled with necessary and sufficient instructions. In addition,there are several Itanium-dependent restrictions that causea bundle to be split and issue a stop. When the issue to afunctional unit is blocked because the next instruction tobe issued needs an already committed unit, the resultingbundle is split [6]. Therefore, it would be a good notionto thwart too much potential splittings, if we would overlap

Md. Haidar Sharif et al. / Journal of Systems Architecture 54 (2008) 638–650 645

3 such issue structures – resulting in a structure with4� unrolling and 3�pipelining that minimizes loop over-head and also extends the potential for ILP to some extent.The resulting optimized codes have been delineated inTables 9–11 restraining pre-loop, loop, and post-loop suc-cessively. After 4� unrolling and scheduling the clockcycles become 58

4n � 15n. Upon applying software pipelin-

ing, there are 444

n ¼ 11n cycles with some overhead. Conse-quently, these simple code metamorphoses alone promiseto result in a speedup of the algorithm of 56n=11n � 5.

In IA-64, 3 instructions can be encoded by 128-bitlength, whereas in Alpha 3 instructions can be encodedwith only 96-bit length (one instruction with 32-bit length).Henceforth, to represent a similar bit from Alpha, IA-64code size increases ð128=96� 1Þ � 33:33%. The implemen-tations are presented in Tables 5 and 10 are highly opti-mized. There are 176 instruction slots available in themain loop of Alpha in Table 5, whereas 264 instructionslots are required to represent the main loop of AI-64 inTable 10. With a comparison of instruction slots withAlpha, IA-64 needs 88 more instruction slots, i.e.,88=264 � 33:33% increase of the code size for IA-64 com-pared with Alpha.

4. Properties of the optimized algorithms theoretically and

practically

Counting cycles based on the number of instruction canbe misleading because it ignores memory latencies, it henceestimates an upper limit for the possible improvement dueto the codes transformation. It has been estimated that 11and 15 cycles are required per 1=

ffiffiffixp

and per e�x, respec-tively, for the optimized implementation, if any overheadis ignored. For Alpha 21264 processor with a clock rateof 833 MHz, 11 cycles correspond to 13:2 ns as a lower limitfor computing one 1=

ffiffiffixp

. Analogously, 15 cycles corre-spond to 18 ns as a lower limit for computing one e�x.IA-64 CPU with a clock rate of 1000 MHz, 11 cycles corre-spond to 11 ns as a lower limit for computing one 1=

ffiffiffixp

.Similarly, 15 cycles correspond to 15 ns as a lower limitfor computing one e�x. Of course, the true execution timet will be higher than 13:2 ns or 11 ns per 1=

ffiffiffixp

or 18 ns or15 ns per e�x for several reasons: (i) loads from the argu-ment vector or from the lookup table may miss the first levelcache [2]; (ii) at least once the branch will be mispredicted,and even precisely prognosticated taken branches implicatesome overhead; (iii) the cost of function call and return,

Table 12The execution time t in nanoseconds per 1=

ffiffiffixp

or per e�x without using compiand c2

CPUs Vector length n 4 8 16

Alpha t per 1=ffiffiffixp

35.5 29.4 22.221264 t per e�x 52.5 37.1 29.6IA-64 t per 1=

ffiffiffixp

36.5 23.8 21.9Itanium t per e�x 43.6 35.8 30.2

including the code to save and restore register values, can-not be overlooked; (iv) though it has been achieved highperformance in the loop itself with software pipelining,the cost of the non-optimal pre-loop and post-loop sectionscannot be neglected; (v) there are several Itanium-depen-dent restrictions that cause a bundle to be split and issuea stop. Theoretically, the execution time t required per ele-ment would be independent of the vector size n. However,the function call overhead alone implies that t will decreasefor increasing n, since the one-time overhead is now sharedby n elements. To analyze the efficiency of our implementa-tion for varying vector length n, the same realistic approachas in [2] has been used for the execution time t per element,

t ¼ t0 þ c1 þc2

n; ð1Þ

where t0 is the minimal time required to compute a certaintask (Alpha 21264 with a clock rate of 833 MHz: t0 ¼ 13:2ns and t0 ¼ 18 ns, IA-64 with a clock rate of 1000 MHz :t0 ¼ 11 ns and t0 ¼ 15 ns, for 1=

ffiffiffixp

and e�x, respectively).The parameter c1 will be determined by the overhead whichoccurs for every iteration of the innermost loop such as ataken branch penalty, as opposed to that, parameter c2

measures the overhead which occurs once per function callor once per vector. A tiny portion of the overhead c2 arefor example function call and return, register save andreturn to its original or usable and functioning condition,and the penalty due to non-optimal pre-loop and post-loopcodes. For large vectors n!1, the overhead c2 will benegligible and the execution time will be t! t0 þ c1. Everypiece of overhead cannot be classified to either class c1 or c2

in all likelihoods, e.g., load misses due to table lookup canoccur in every iteration for small n, but for larger n tableentries will be reused and the number of load failures tosucceed will decrease.

To find the overheads c1 and c2 for 1=ffiffiffixp

or e�x, the exe-cution time t per 1=

ffiffiffixp

or per e�x for varying vector lengthn has been measured as indicated in Table 12. The param-eters c1 and c2 can be determined by fitting data in Table 12to Eq. (1) in the least squares sense, minimizing the relativeerrors between predicted and measured execution times.Upon fitting the data for Alpha 21264, one obtainsc1 ¼ 0:7 ns or about 0.5 cycles overhead per 1=

ffiffiffixp

, andc2 ¼ 115 ns overhead per vector. In relation to the idealizedexecution time of 13:2 ns the overall overhead drops toabout 0:7�100

13:21% � 5:3% for n!1. Similarly, it can be esti-

mated c1 ¼ 3:9 ns or approximately 3:9=1:2 � 3:25 cyclesoverhead for each single e�x, and c2 ¼ 143 ns or about

ler optimizing options for varying vector length n to find the overheads c1

32 64 128 256 512

18.2 16.9 15.6 14.9 14.825.5 24.2 22.7 22.2 22.219.0 18.2 17.7 17.4 17.327.7 26.2 25.4 25.0 24.8

646 Md. Haidar Sharif et al. / Journal of Systems Architecture 54 (2008) 638–650

119:2 cycles overhead once for the complete vector of e�x.Compared to the idealized execution time t0 ¼ 18 ns theoverall overhead drops to about 3:9=18 � 21:67% forn!1. Similarly, on furnishing the data for IA-64 fromTable 12 to the simple Eq. (1), it has been estimatedc1 ¼ 2:7 ns or about 2.7 cycles overhead per 1=

ffiffiffixp

, andc2 ¼ 313 ns or about 313 cycles overhead per vector. Inrelation to the idealized execution time of 11 ns the overalloverhead drops to about 2:7�100

11% � 24:55% for n!1.

Analogously, it can be estimated c1 ¼ 5:9 ns or approxi-mately 6 cycles overhead for each single e�x, andc2 ¼ 543 ns or about 543 cycles overhead once for the com-plete vector of e�x. Compared to the idealized executiontime t0 ¼ 15 ns the overall overhead drops to about5:9=15 � 39:33% for n!1. Deeming that the integerpipelines are highly utilized and that casual first level cachemisses will lead to partial stalls, the resulting efficiencyseems satisfactory in either case. However, the performanceof 833 MHz Alpha is better than that of 1000 MHz IA-64Itanium. This is directly due to the fact that IA-64’sinstruction scheduling and resource utilization are deter-mined by the compiler at compile time and are not negoti-ated or altered by the processor at runtime.

5. Results with the optimized routines for 1=ffiffiffixp

and e%x

In this section, we will discuss the performance of codesoptimized by means of various smart compilers.

5.1. Optimized results of 1=ffiffiffixp

and e�x in different compilers

A smart compiler may be an adaptive compiler whichgains possession of feedback from the dynamics of pro-

Table 13The execution time t in nanosecond, per 1=

ffiffiffixp

or per e�x computation for varyinoptions �03 -funroll-loops, �03 -funroll-loops -fast, �O4, �O2, and �O2, re

Vector length n Execution time per 1/ffiffiffixp

or per e�x in nanoseconds

Alpha 21264 (833 MHz) Alpha

gcc cc f95 gcc

4 37.22 35.34 40.11 27.12

52.49 50.00 56.08 38.748 30.01 29.34 32.90 21.83

37.08 35.50 40.24 27.0816 22.69 22.13 26.03 16.51

29.58 29.19 32.91 21.4532 18.33 18.19 20.40 13.31

25.46 25.43 26.24 18.4864 16.99 16.89 17.39 12.33

24.11 23.98 24.43 17.46128 15.72 15.59 16.15 11.42

22.58 22.71 22.85 16.37256 15.12 14.89 15.39 11.04

22.17 22.18 22.67 16.08512 14.83 14.72 15.01 10.79

22.16 22.11 22.62 16.07Built-in functions 142.99 87.77 57.95 103.99

64.58 68.74 57.94 46.87

Typeset in boldface belongs to 1=ffiffiffixp

.

gram execution and puts into service this knowledge toadjust the compiled code to represent the least degree ofexecution latencies. Intel� C++ compiler for Itanium-based applications, ecc, is available as a native compilerthat runs on an Itanium processor system and a cross com-piler that also runs on an IA-32 system. Intel� C++ com-piler for IA-32 based applications is named as icc. Directobject generation can be supported by the icc compiler,but not by the ecc compiler. The ecc compiler generatesassembler files, then invokes the assembler to generateobject files. Cross compilers are compiler that run on onespecific computer but are able to produce object codesfor different types of computer. Cross compilers are usedto generate software that can run on computers with anew architecture or on special-purpose devices that cannothost their own compilers. Native compilers produce execut-able codes only for the system on which they are running.In general, Intel� compilers (e.g., ecc) generate faster codesthan GNU Compilers (e.g., gcc). GNU compilers are avail-able for compatibility reasons and as a backup in caseIntel� compilers have problems with a particular code.Intel� compilers optimize very aggressively which maycause problems with certain codes. The difference betweengcc and ecc is the performance. The gcc is a free cross com-piler that supports a lot of platforms and a fistful of lan-guages. Fortran language is suited for numericalcomputations and C language is used for system program-ming. The Fortran compiler is designed to make numericalcomputation easy, robust, and well-defined.

The execution time of the optimized codes may vary to asmall extent. Consequently, the execution time of Table 13has been computed as the geometric mean of 1000 measure-ments with the same optimized code for certain vector

g vector lengths n using gcc, cc, f95, ecc, and efc following their optimizingspectively

21364 (1.15 GHz) IA-64 (1.00 GHz)

cc f95 gcc ecc efc

25.72 29.19 37.33 36.11 39.03

38.74 41.01 43.43 43.55 48.1921.32 23.91 27.93 23.79 31.11

27.08 29.16 36.96 35.79 39.0316.06 18.88 21.47 21.89 23.42

21.87 23.85 30.68 30.13 31.9013.21 14.79 18.84 18.93 19.82

18.42 19.03 27.67 27.69 28.2512.25 12.61 17.38 18.19 17.76

17.38 17.71 26.12 26.11 26.9311.34 11.72 16.56 17.69 16.77

16.49 16.60 25.34 25.35 25.9810.83 11.19 16.51 17.38 16.27

16.08 16.42 24.95 24.92 25.5110.68 10.89 16.30 17.27 16.03

16.06 16.38 24.72 24.73 25.2863.80 47.13 87.12 86.19 53.06

49.99 47.51 54.86 51.69 49.21

Md. Haidar Sharif et al. / Journal of Systems Architecture 54 (2008) 638–650 647

lengths. Table 13 as well as Figs. 1 and 2 trace the time t innanoseconds, for executables generated by gcc, cc, ecc, efc,and f 95 compilers. The best results obtained by using thevarious optimizing options available for the different com-pilers have been shown. Results for built-in functions areobtained by computing 1=

ffiffiffixp

stepwise as y ¼ sqrtðxÞ andz ¼ 1=y. Itanium with gcc has on average 97% of the per-formance of the Alpha 21264 with gcc. Similarly, Itaniumwith efc has on average 98% of the performance of theAlpha 21264 with f 95, the resulting performance seemsscrumptious. The built-in functions of Itanium are on aver-

0 16 32 48 64vector le

10

15

20

25

30

35

40

exec

utio

n tim

e t [

ns]

Fig. 2. The graph depicts execution time versus vector length with gcc compielement.

0

25

50

75

100

125

150

175

200

exec

utio

n tim

e t [

ns]

gcc cc com

Fig. 1. The graph depicts execution time for built-in functions for a specific voptimization options as given in Table 13 for 1=

ffiffiffixp

per vector element.

age 27% for gcc and 12% for efc better optimized than forAlpha 21264 with gcc and f 95, respectively.

In the following, we define speed-up as the ratio of exe-cution time of the built-in functions and execution time ofthe optimized codes in varying vector lengths. Fig. 3depicts different speed-up derived from the Table 13 for1=

ffiffiffixp

for a specific vector length n ¼ 32. The well optimiz-ing Fortran compiler already reduces run time of the built-in functions remarkably. Thus, the speed-up by themanually optimized 1=

ffiffiffiffixip

or e�xi codes are restricted.Nevertheless, computing the optimized 1=

ffiffiffiffixip

or e�xi for a

80 96 112 128 144ngths n

Alpha 21264 833MHzAlpha 21364 1.15GHzIA-64 1.00GHz

ler only and its optimizing options used in Table 13 for 1=ffiffiffixp

per vector

built-in EV 264optimized EV 264built-in EV 364optimized EV 364built-in IA-64optimized IA-64

f95 ecc efcpilers

ector length of n ¼ 32 of the optimized codes with similar compilers and

spee

d up

0

2

4

6

8

10

12EV 264 (833MHz)EV 364 (1.15GHz)IA-64 (1.00GHz)

gcc cc f95 ecc efccompilers

Fig. 3. The graph depicts speedup compared with the built-in functions versus compilers for a specific vector length n ¼ 32 of Table 13 for 1=ffiffiffixp

per vectorelement.

648 Md. Haidar Sharif et al. / Journal of Systems Architecture 54 (2008) 638–650

large enough vector length n with or without using com-piler options for gcc, cc, ecc, efc, and f 95 gives a satisfac-tory factor of execution time reduction compared with thetime for the standard built-in math library routines.

5.2. Accuracy of the optimized codes with respect to built-in

functions

Since most floating-point calculations have roundingerrors anyway, does it matter if the basic arithmetic opera-tions generate slightly more rounding errors than the min-

0 2 4inpu

-6.00e-16

-4.00e-16

-2.00e-16

0.00e+00

2.00e-16

4.00e-16

6.00e-16

outp

ut n

umer

ical

err

or o

f th

e op

timiz

ed c

odes

Fig. 4. Output numerical deviation of the optimized codes with respect to the buniformly distributed variable in the given interval 0 < x < 10.

imum value [10]? The accuracy of the manually optimizedcodes highly depends on the accuracy of the algorithmsas well as the scheduling of instructions in the softwarepipelining. Since the output deviations with respect tobuilt-in functions are almost similar in either 1=

ffiffiffiffixip

ore�xi , the accuracy of the optimized codes is only discussedfor 1=

ffiffiffiffixip

. Upon fixing the accuracy of binomial algorithmat some standard level, the scheduling of instructions dom-inates the accuracy of the optimized codes. The accuracy ofthe optimized codes can be broken down overwhelming bythe intervention of any improper scheduling. To show the

6 8 10t argument x

uilt-in function. 1000 input argument x values have been designated from a

Table 14Input–output with different compilers and CPUs

Md. Haidar Sharif et al. / Journal of Systems Architecture 54 (2008) 638–650 649

accuracy of the optimized codes with respect to built-infunctions, the output of the optimized codes has been com-pared to the output of the built-in 1=

ffiffiffixp

functions with thesame input argument. The numerical deviations of the out-put of the optimized codes from the output of the built-infunctions are graphically shown in Fig. 4 using an arbitraryinput argument limit: 0 < x < 10. Clearly, the numericaldeviations are not always zero. Apparently, the outputnumerical differences (see Fig. 4) are hardly noticeable, inmany applications it is thoroughly acceptable. The mainreason for the slightly larger numerical error is the argu-ment reduction used in binomial algorithm to minimizethe additional cost for rounding. It is worth mentioningthat the built-in functions may also not be absolutely freefrom rounding error. The output numerical deviations ofthe optimized codes are the same for gcc, cc, and f 95 withthe same compiler optimization options used in Table 13,and for the same value of the input arguments xi in anyargument limit. Table 14 shows a snapshot of 1=

ffiffiffiffixip

calcu-lation for both the output of the optimized binomial algo-rithm [binocalculated] and the built-in functions [built-in],deeming y ¼ ffiffiffi

xp

, built-in = 1/y, error = (built-in)-(bino-

calculated).

6. Conclusion

The paper addresses specific problems to compute 1=ffiffiffixp

and e�x and presents a detailed analysis of possible soft-ware optimizations to improve performance. The softwareoptimizations considered are directly tied to the micro-architecture of the Alpha 21264 & 21364, and IA-64processors, respectively. We show that a careful manualoptimization tied closely to the specific processor architec-ture can provide significantly higher performance than theuse of standard math library and hardware division.

Though the codes for computing 1=ffiffiffiffixip

and e�xi for avector of input arguments xi are specifically optimized forAlpha 21264 & 21364, and IA-64 CPUs, they can be com-piled for any other processor with a few adaptions or with-

out any one provided the processor conforms the IEEE 754floating-point standard.

Acknowledgment

Md. Haidar Sharif is grateful to the Department of The-ory and Bio-Systems of the Max Planck Institute of Col-loids and Interfaces where the codes could be tested.

References

[1] A.H. Karp, Speeding up n-body calculations on machines withouthardware square root, Scientific Programming 1 (1992) 133–140.

[2] R. Strebel, Pieces of software for the Coulombic m body problem,2000, Diss. ETH No. 13504.

[3] Alpha 21264 Microprocessor Hardware Reference Manual, Revision4.2, 1999, Compaq Computer Corporation.

[4] K. Dowd, High Performance Computing, O’Reilly & Associates, Inc.,1993.

[5] Intel� IA-64 Architecture Software Developer’s Manual, vol. 2: IA-64System Architecture, Revision. 1.1, 2000, Intel Corporation.

[6] J.L. Hennessy, D.A. Patterson, Computer Architecture: A Quantita-tive Approach, Third ed., Morgan Kaufmann Publishers, 2003.

[7] Alpha Architecture Handbook, Version 4, 1998, Compaq ComputerCorporation.

[8] Compiler Writer’s Guide for the 21264/21364, Revision 2.0, 2002,Compaq Computer Corporation.

[9] Intel� IA-64 Architecture Software Developer’s Manual, vol. 1: IA-64Application Architecture, Revision. 1.1, 2000, Intel Corporation.

[10] D. Goldberg, What Every Computer Scientist Should Know AboutFloating-Point Arithmetic, Issue of Computer Surveys, March 1991,Association for Computing Machinery, Inc.

Md. Haidar Sharif received the BSc in Electronicsand Computer Science from the JahangirnagarUniversity (Bangladesh) in 2001 and the MSc inComputer Engineering from the University ofDuisburg-Essen in 2006. He is currently a Ph.D.student at the Laboratoire d’Informatique Fon-damentale de Lille, University of Science andTechnology of Lille. His research interests includehigh performance computing, computer archi-tectures and computer vision.

650 Md. Haidar Sharif et al. / Journal of Systems Architecture 54 (2008) 638–650

Achim Basermann, Principal Researcher, TeamLeader, obtained a Ph.D. in Electrical Engineer-ing from RWTH Aachen in 1995 followed by apostdoctoral position in Computer Science atResearch Centre Julich GmbH, Central Institutefor Applied Mathematics. In 1997 he joined C &C Research Laboratories, NEC Europe Ltd., inSankt Augustin, Germany. Current research isfocused on parallel linear algebra algorithms,circuit simulation, distributed computing tech-

nology and future computer architectures.

Christian Seidel received the Ph.D. in PolymerPhysics and a second degree in Solid State Theoryfrom the Academy of Sciences (Berlin) in 1978and 1985, respectively. In 1994 he got the Habil-itation in Theoretical Chemistry from the Tech-nical University Berlin and is currently workingas group leader in the Department of Theory andBio-Systems at the Max Planck Institute of Col-loids and Interfaces Potsdam. His research inter-ests include simulation of polymers in solution

and at interfaces.

Axel Hunger got his Diplom-Ingenieur degree in1978 in Electrical Engineering, from AachenUniversity of Technology (RWTH Aachen), andhis Dr.-Ing. degree (Ph.D.) in 1982 from the sameuniversity. From 1979 to 1987, he was head of aresearch group on test and simulation at RWTH.In 1988 he was appointed a chair as professor forComputer Engineering at the Faculty of Electricaland Electronic Engineering of the Gerhard-Mer-cator-University Duisburg. During the last years,

his activities changed to the field of computer

networks, multimedia, and development of curricula in engineering sci-ences especially in view of internationalization.


Top Related