1 ece 587 advanced computer architecture i chapter 9 epic architecture herbert g. mayer, psu status...

68
1 ECE 587 Advanced Computer Architecture I Chapter 9 EPIC Architecture Herbert G. Mayer, PSU Herbert G. Mayer, PSU Status 7/18/2015 Status 7/18/2015

Upload: victoria-gilmore

Post on 28-Dec-2015

233 views

Category:

Documents


0 download

TRANSCRIPT

1

ECE 587Advanced Computer Architecture I

Chapter 9EPIC Architecture

Herbert G. Mayer, PSUHerbert G. Mayer, PSUStatus 7/18/2015Status 7/18/2015

                                                                                                                 Itanium 2 processor

2

Syllabus

IntroductionIntroduction IntelIntel®® Itanium Itanium®® Architecture Architecture DefinitionsDefinitions Data and MemoryData and Memory Itanium RegistersItanium Registers Instruction Set Architecture (ISA)Instruction Set Architecture (ISA) BibliographyBibliography

3

Itanium 2 Processor

4

Itanium Processor Block Diagram

5

Introduction The ItaniumThe Itanium®® processor is Intel’s first published, commercial processor is Intel’s first published, commercial

64-bit computer product, launched 2001, co-developed with HP64-bit computer product, launched 2001, co-developed with HP

Published meansPublished means: Smart Intel was diligently developing : Smart Intel was diligently developing another 64-bit processor, the extended version of its ancient, another 64-bit processor, the extended version of its ancient, trusted, ugly x86 architecture, just in case, as a trusted, ugly x86 architecture, just in case, as a secret backupsecret backup risk hedgerisk hedge

64-bit means that the logical address range spans 64-bit means that the logical address range spans 226464 different different memory bytes; and natural memory bytes; and natural integer objects are 64integer objects are 64 bits wide bits wide

The exact format of integer objects is described in section The exact format of integer objects is described in section Data Data and Memoryand Memory

During its development at intel, the first generation of Itanium During its development at intel, the first generation of Itanium processors was code-named processors was code-named MercedMerced

The family is now officially called The family is now officially called IPFIPF, for , for Itanium Processor Itanium Processor FamilyFamily, while early in its development it was referred to as IA-, while early in its development it was referred to as IA-64, for Intel 64-bit architecture64, for Intel 64-bit architecture

6

Introduction

Intel’s Itanium architecture is radically different from Intel’s Itanium architecture is radically different from the widely used IA-32 architecturethe widely used IA-32 architecture

IA-32 should be referred to as IA-32 should be referred to as x86 x86 architecture, lest architecture, lest one incorrectly infers today that it be restricted to one incorrectly infers today that it be restricted to 32-bit addresses and integer types of 32-bit length32-bit addresses and integer types of 32-bit length

That limitation no longer exists since introduction of That limitation no longer exists since introduction of 64-bit versions about ½ year after AMD’s extension 64-bit versions about ½ year after AMD’s extension of IA-32 to 64 bits; see also EM64Tof IA-32 to 64 bits; see also EM64T

Imagine how Intel felt, when AMD, the company Imagine how Intel felt, when AMD, the company having produced CPUs compatible with Intel’s having produced CPUs compatible with Intel’s chips, suddenly had a more advanced, attractive x86 chips, suddenly had a more advanced, attractive x86 CPU!CPU!

7

Intel® Itanium® Architecture

InterestinglyInterestingly, IA-32 object code is executable on , IA-32 object code is executable on Itanium processorsItanium processors

More interesting yetMore interesting yet, even the Hewlett-Packard PA-, even the Hewlett-Packard PA-RISC code is natively executable on this new 64-bit RISC code is natively executable on this new 64-bit IPF processorIPF processor

HP was Intel’s strategic partner in the definition, HP was Intel’s strategic partner in the definition, development, and cost sharing of the IPFdevelopment, and cost sharing of the IPF

Cautious about performance inferences: just Cautious about performance inferences: just because IA-32 object code is executable on IPF, do because IA-32 object code is executable on IPF, do not deduce such code executes on the IPF as fast not deduce such code executes on the IPF as fast as, or faster than, on an x86 processoras, or faster than, on an x86 processor

8

Intel® Itanium® Architecture IPF is Intel’s and HP’s first instance of the novel EPIC

architecture

EPIC stands for Explicitly Parallel Instruction Computing. It is Intel’s first launched 64-bit architecture; the second was launched later (1q04), with EM64T, the first 64-bit version of the old x86 architecture

HP already had a 64-bit version with its Performance Architecture (PA) RISC processor at the time Itanium was launched

ExplicitExplicit means, the assembly language programmer bears the means, the assembly language programmer bears the intellectual burden (or the smart compiler) to take advantage of intellectual burden (or the smart compiler) to take advantage of the parallelism in the architecturethe parallelism in the architecture

It is not the processor that automatically exploits the It is not the processor that automatically exploits the numerous, parallel computing modules; it numerous, parallel computing modules; it needs to be toldneeds to be told

9

Intel® Itanium® Architecture As a consequence, compilers for IPF are highly complexAs a consequence, compilers for IPF are highly complex

Complexity is not desirable, as that means Complexity is not desirable, as that means more errorsmore errors, , decreased object code quality, something the promoter of a decreased object code quality, something the promoter of a new architecture should avoidnew architecture should avoid

On the other hand, the IPF has provided On the other hand, the IPF has provided explicit architectural explicit architectural featuresfeatures that ease implementing highly optimizing compilers that ease implementing highly optimizing compilers

A case in point is the architectural support for software-A case in point is the architectural support for software-pipelined (SW PL) loopspipelined (SW PL) loops

Certain source constructs let the compiler emit SW PL loops Certain source constructs let the compiler emit SW PL loops that need no prologue and epilogue. This not only renders the that need no prologue and epilogue. This not only renders the object code object code more compactmore compact, but also , but also fasterfaster

10

Intel® Itanium® Architecture ParallelParallel means an Itanium processor gains speed means an Itanium processor gains speed

not solely via high clock rates, but via simultaneous not solely via high clock rates, but via simultaneous execution of multiple operations in one clock cycleexecution of multiple operations in one clock cycle

Key concepts refined, or newly introduced, in IPF Key concepts refined, or newly introduced, in IPF include: include: predication, branch prediction, branch predication, branch prediction, branch elimination, conditional move, speculation, parallel elimination, conditional move, speculation, parallel comparisonscomparisons,, and a and a large register filelarge register file

Itanium is only the first implementation of the new Itanium is only the first implementation of the new 64-bit Intel Architecture. Contrary to what you would 64-bit Intel Architecture. Contrary to what you would expect, initially Itanium only implemented 44 expect, initially Itanium only implemented 44 physical of the 64 logical address bitsphysical of the 64 logical address bits

Initial product name was Initial product name was MercedMerced

11

Intel® Itanium® Architecture With 44 bits only, the total address range of first Itanium HW With 44 bits only, the total address range of first Itanium HW

was only a millionth of the logical address range, but still was only a millionth of the logical address range, but still 4000 4000 times largertimes larger than earlier 32-bit architecture than earlier 32-bit architecture

In its second generation, 56 physical bits of the 64-bit logical In its second generation, 56 physical bits of the 64-bit logical address space were implemented in HWaddress space were implemented in HW

Product name of that new version: ItaniumProduct name of that new version: Itanium® 2 2

Short-term, no severe limitations were expected with restricted Short-term, no severe limitations were expected with restricted 56-bit addresses56-bit addresses

Still about Still about 16 million times larger16 million times larger than 32-bit addressing space than 32-bit addressing space

Integer type operands are of course full 64 bits wideInteger type operands are of course full 64 bits wide

12

Intel® Itanium® Architecture Unlike earlier parallel VLIW architectures, Unlike earlier parallel VLIW architectures, EPIC does not use a EPIC does not use a

fixed width instruction encodingfixed width instruction encoding

Instead, operational functions can be combined to operate in Instead, operational functions can be combined to operate in parallel from a single to as many instructions as desiredparallel from a single to as many instructions as desired

What is critical in EPIC is that all code is written assuming What is critical in EPIC is that all code is written assuming parallel semanticsparallel semantics within awithin a groupgroup (to be explained later), and (to be explained later), and sequential semantics across groupssequential semantics across groups

To be able to run in parallel, the machine is built with multiple To be able to run in parallel, the machine is built with multiple execution modules that can all work at the same timeexecution modules that can all work at the same time

This allows a natural architecture migration from say, 6 HW This allows a natural architecture migration from say, 6 HW modules executing on today’s Itanium, to as many as can be modules executing on today’s Itanium, to as many as can be crammed into a future silicon chip a few years from nowcrammed into a future silicon chip a few years from now

13

Intel® Itanium® Architecture To illustrate a sample taken from ref [1]. Consider 2 memory To illustrate a sample taken from ref [1]. Consider 2 memory

operands operands aa and and b b to be swappedto be swapped

temptemp :=:= a;a;a a := b;:= b;bb := temp;:= temp;

The semicolon operator The semicolon operator ‘;’ ‘;’ impliesimplies sequential semantics sequential semantics. On a . On a machine with parallel semantics, it would be sufficient to writemachine with parallel semantics, it would be sufficient to write

a a := b,:= b, // operand latching needed// operand latching neededbb := a;:= a; // operand latching needed// operand latching needed

With the comma operator With the comma operator ‘,’‘,’ implying implying parallel semanticsparallel semantics, , similar to syntactic conventions in the programming language similar to syntactic conventions in the programming language Algol-68. This source snipped is just a generic example; Algol-68. This source snipped is just a generic example; NOTNOT a a sample of the sample of the Itanium assemblyItanium assembly language language

14

Definitions

15

DefinitionsDefinitions

Branch EliminationBranch Elimination Replacing object code that has conditional Replacing object code that has conditional

branches, with code that has multiple execution branches, with code that has multiple execution paths, lacking branchespaths, lacking branches

The second version with branches eliminated The second version with branches eliminated must be semantically equivalent to the original must be semantically equivalent to the original code with branchescode with branches

Everything else equal, the version without Everything else equal, the version without branches will execute fasterbranches will execute faster

16

DefinitionsDefinitions

BundleBundle Group of 3 instructions plus a template, that all fit Group of 3 instructions plus a template, that all fit

into a into a 16-byte long, 16-byte aligned 16-byte long, 16-byte aligned section of section of instruction memory on Itaniuminstruction memory on Itanium

17

DefinitionsDefinitions

Conditional MoveConditional Move Move instruction that transfers bits from Move instruction that transfers bits from sourcesource to to

destinationdestination, but only if an associated condition is true, but only if an associated condition is true Otherwise the instruction operates like a Otherwise the instruction operates like a noopnoop Such a move can serve as a special case of branch Such a move can serve as a special case of branch

elimination. For example, the C source construct:elimination. For example, the C source construct:

if ( a > 0 ) x = 99;if ( a > 0 ) x = 99; ---- HL source programHL source program

could be mapped into the conditional move:could be mapped into the conditional move:

cmov x, #99, a, #0, gtcmov x, #99, a, #0, gt ---- hypothetical hypothetical asmasm

which has no branches. Source operand #99 is moved into which has no branches. Source operand #99 is moved into memory location memory location xx only if the > condition holds between only if the > condition holds between operands operands aa and integer literal and integer literal 00

18

DefinitionsDefinitions

Endian, EndiannessEndian, Endianness A convention that defines in which order the A convention that defines in which order the

higher-valued bytes of a multi-byte data object are higher-valued bytes of a multi-byte data object are addressedaddressed

If the higher address byte holds the If the higher address byte holds the higher higher numeric value, we call this numeric value, we call this little-endianlittle-endian

The other way around we call The other way around we call big-endianbig-endian ordering ordering

19

DefinitionsDefinitions

EPICEPIC EExplicitly xplicitly PParallel arallel IInstruction nstruction CComputing, with IPF omputing, with IPF

being the first commercial architecture that being the first commercial architecture that implements EPICimplements EPIC

Note IPF’s ability to also execute old Intel x86 and old HP PA object code

20

DefinitionsDefinitions

EpilogueEpilogue When the When the steady statesteady state of a software-pipelined loop of a software-pipelined loop

completes, there may be yet to be used operands completes, there may be yet to be used operands and operations to be computed that would not fit and operations to be computed that would not fit into the steady stateinto the steady state

These last operands must be consumed, some These last operands must be consumed, some even be generated during the even be generated during the epilogueepilogue, and , and ultimately the ultimately the pipeline must be drainedpipeline must be drained

This is accomplished in the object code after the This is accomplished in the object code after the steady statesteady state, and that portion of code is called the , and that portion of code is called the epilogueepilogue

See also prologue

21

DefinitionsDefinitions

GroupGroup A sequence of instructions, each with an A sequence of instructions, each with an

associated template and a defined associated template and a defined stopstop

A A groupgroup is composed of one is composed of one bundlebundle or more or more

The The stopstop means, the hardware cannot start means, the hardware cannot start executing any subsequent group, until the current executing any subsequent group, until the current group has completedgroup has completed

Syntax notation for Syntax notation for stop stop in Itanium assembler is in Itanium assembler is the double-semicolon ;;the double-semicolon ;;

22

DefinitionsDefinitionsParallel ComparisonParallel Comparison A composite source program A composite source program conditioncondition of the form: of the form:

( ( a > b ) && ( c <= d ) )( ( a > b ) && ( c <= d ) )

requires multiple steps to compute a boolean predicaterequires multiple steps to compute a boolean predicate Generally, on a sequential architecture these multiple steps Generally, on a sequential architecture these multiple steps

are combined via explicit instructions for are combined via explicit instructions for andinganding and and oringoring, or , or else the flow of control of execution selects a matching true else the flow of control of execution selects a matching true label. All this takes timelabel. All this takes time

The Itanium processor allows parallel evaluation of certain The Itanium processor allows parallel evaluation of certain composite Boolean expressions in a single stepcomposite Boolean expressions in a single step

The result can be used as a predicate in subsequent The result can be used as a predicate in subsequent instructions. Notice that such combined Boolean expressions instructions. Notice that such combined Boolean expressions must be must be side-effect freeside-effect free

Also this is not equivalent to C’s short-circuit evaluation of complex boolean expressions!

23

DefinitionsDefinitions

Parallel Comparison, Cont’dParallel Comparison, Cont’d For example, another complex boolean expressionFor example, another complex boolean expression

( fun( j, k ) && ( i < MAX ) )( fun( j, k ) && ( i < MAX ) )

cannot cannot be mapped into a parallel EPIC be mapped into a parallel EPIC comparisoncomparison

Since one operand is a function call Since one operand is a function call fun( i, k )fun( i, k ) with with a possibly large number of parameters, and may a possibly large number of parameters, and may have a side-effect on one of the other operands, have a side-effect on one of the other operands, for example “i” which is yet to be comparedfor example “i” which is yet to be compared

This type of boolean expression is mapped into sequential code

24

DefinitionsDefinitionsPredicationPredication Is the association of a boolean condition with the execution Is the association of a boolean condition with the execution

of an instruction sequence. This allows the following:of an instruction sequence. This allows the following: Two instruction streams can be executed in parallel, clearly Two instruction streams can be executed in parallel, clearly

requiring multiple hardware modules; provided on EPICrequiring multiple hardware modules; provided on EPIC Both streams have a Both streams have a predicatepredicate associated with their associated with their

operations. Only the operations. Only the stream with the true predicate is stream with the true predicate is actually retiredactually retired; the other will be aborted and ignored; the other will be aborted and ignored

AbortAbort can happen as soon as the predicate is known. This can happen as soon as the predicate is known. This means, the computation of the predicate can proceed in means, the computation of the predicate can proceed in parallel with the execution of the two code streams, but parallel with the execution of the two code streams, but must complete by the time these 2 code streams waitie for must complete by the time these 2 code streams waitie for who’ll be the winnerwho’ll be the winner

An ISA with predication requires bits for the predicates to An ISA with predication requires bits for the predicates to use, and which direction (true? or false?) to selectuse, and which direction (true? or false?) to select

Also, the discarded code path may contain Also, the discarded code path may contain no side-effectno side-effect, , such as a write to memory!such as a write to memory!

25

DefinitionsDefinitions

ProloguePrologue Before a software-pipelined loop body can be Before a software-pipelined loop body can be

initiated, hardware resources (e.g. registers) must initiated, hardware resources (e.g. registers) must be initialized; we say the loop must be be initialized; we say the loop must be primedprimed

This is accomplished in the object code This is accomplished in the object code before the before the steady statesteady state, called the , called the ProloguePrologue

See also epilogueSee also epilogue

26

DefinitionsDefinitions

Register FileRegister File The IPF has a rich set of registersThe IPF has a rich set of registers This includes This includes 128 general purpose128 general purpose registers (for registers (for

integer operations), integer operations), 128 floating-point-128 floating-point-, 64 , 64 predicate-, 64 branch-, and predicate-, 64 branch-, and 128 so-called 128 so-called application registersapplication registers

Also a variety of special purpose register is Also a variety of special purpose register is visible; visible means accessible by the assembly visible; visible means accessible by the assembly language programlanguage program

Includes a Includes a user mask, stack markeruser mask, stack marker (frame (frame marker), marker), ip, processor idip, processor id, and performance , and performance monitoring registersmonitoring registers

27

DefinitionsDefinitions

SpeculationSpeculation If it is suspected --but not sure-- that operand If it is suspected --but not sure-- that operand oo will be used will be used

in the future, and this operand is not readily available (not in the future, and this operand is not readily available (not yet in a high-speed register), and it takes long –relative to yet in a high-speed register), and it takes long –relative to instruction execution– to fetch instruction execution– to fetch oo, a processor may initiate , a processor may initiate the the fetch well before it is actually usedfetch well before it is actually used

Advantage: by the time Advantage: by the time oo is needed, it is already available is needed, it is already available without delaywithout delay

Disadvantage: if the flow of control never reaches the place Disadvantage: if the flow of control never reaches the place where where oo was thought to be needed, then the speculative was thought to be needed, then the speculative fetch was superfluousfetch was superfluous

May still be meaningful, if a) no side-effects occurred that May still be meaningful, if a) no side-effects occurred that are harmful to program correctness, and b) if the hardware are harmful to program correctness, and b) if the hardware resource required to fetch resource required to fetch oo was idle anyway; then no loss! was idle anyway; then no loss!

28

DefinitionsDefinitions

Steady StateSteady State The software-pipelined object code executed The software-pipelined object code executed

repeatedly, after the repeatedly, after the ProloguePrologue has been initiated, has been initiated, before the before the EpilogueEpilogue will be active, is called the will be active, is called the Steady StateSteady State

Each iteration of the Each iteration of the Steady StateSteady State makes some makes some progress toward multiple iterations of the original progress toward multiple iterations of the original source loopsource loop

See also prologue and epilogue

29

DefinitionsDefinitions

SyllableSyllable Is the Is the instruction-only portioninstruction-only portion of a bundle of a bundle

A bundle always holds 3 instructions plus a A bundle always holds 3 instructions plus a template, the template specifying additional template, the template specifying additional necessary information about an instructionnecessary information about an instruction

The instruction alone, without the needed The instruction alone, without the needed template information, is a template information, is a syllablesyllable

30

Data & Memory

31

Data and MemoryData and Memory Native data types of IPF resemble conventional 32-Native data types of IPF resemble conventional 32-

bit architectures, except for the longer 64-bit integer bit architectures, except for the longer 64-bit integer and unsigned formatsand unsigned formats

An extension over IA-32 object code is the IPF An extension over IA-32 object code is the IPF bundlebundle

Data types include Data types include integer, unsigned, floating-pointinteger, unsigned, floating-point,, and and pointerpointer

Integers are of different widths: Integers are of different widths: byte, word, double-byte, word, double-word,word, or or quad-wordquad-word precision precision

Length in bits as well as min and max values are Length in bits as well as min and max values are listed belowlisted below:

32

Data and Memory, Min MaxData and Memory, Min Max

Type Byte Word Double-word+

Quad-word+Integer [bits] 8 16 32 64

Unsigned [bits] 8 16 32 64

Pointer [bits] NA NA Comp. 32 64

Float [bits] NA NA 32, 64 64, 80

Type byte Word Double-word Quad-word

Minint -128 -32,768 -2,147,483,648 "-9,223,372,036,854,775,808"

Maxint 127 32,767 2,147,483,647 "9,223,372,036,854,775,807"

Minunsigned 0 0 0 0

Maxunisgned 255 65,535 4,294,967,295 "18,446,744,073,709,551,615"

33

Data and MemoryData and Memory Negative numbers are represented in Negative numbers are represented in two’s two’s

complementcomplement format, with the sign-bit in the most- format, with the sign-bit in the most-significant positionsignificant position

Floating-point data use the Floating-point data use the IEEE 754IEEE 754 standard standard Bits representing integer values are numbered Bits representing integer values are numbered

from 0 in the least significant position (rightmost from 0 in the least significant position (rightmost position) to higher values. For example, the most position) to higher values. For example, the most significant bit in a double word is in position significant bit in a double word is in position indexed 31indexed 31

Maximum address on the first generation Itanium Maximum address on the first generation Itanium processor (Merced) was only 17,592,186,040,322 processor (Merced) was only 17,592,186,040,322 or 2or 24444-1. It grew in the second generation to 56 -1. It grew in the second generation to 56 bits, and is now a full 64-bits longbits, and is now a full 64-bits long

34

Data and MemoryData and Memory Bytes are stored in Bytes are stored in little-endianlittle-endian order by default order by default

Possible to programmatically select little- or big-Possible to programmatically select little- or big-endian order, by setting the endian order, by setting the bebe bit in the bit in the user user maskmask, a special status register, a special status register

The The bebe bit (for big-endian) does not affect how bit (for big-endian) does not affect how instructions are stored or fetched from memoryinstructions are stored or fetched from memory

Object code is always represented in little-endian Object code is always represented in little-endian order; programmer selected endianness only order; programmer selected endianness only impacts dataimpacts data

In little-endian order, data bytes with the lowest In little-endian order, data bytes with the lowest numeric value are stored in the byte with the numeric value are stored in the byte with the lowest address; conversely for big-endian orderlowest address; conversely for big-endian order

35

Data and MemoryData and Memory

Data quad-word 0x1102030455060708 would be stored:Data quad-word 0x1102030455060708 would be stored:

Data stored in 8 adjacent bytes in memory in little-endian order:Data stored in 8 adjacent bytes in memory in little-endian order:

Same int value 0x1102030455060708 stored in 8-byte register:Same int value 0x1102030455060708 stored in 8-byte register:

addr: 0 addr: 1 addr: 2 addr: 3 addr: 4 addr: 5 addr: 6 addr: 708x 07x 06x 55x 04x 03x 02x 11x

byte7 byte6 byte5 byte4 byte3 byte2 byte1 byte011x 02x 03x 04x 55x 06x 07x 08x

36

Itanium RegistersItanium Registers The Itanium processor has The Itanium processor has 128 general128 general registers registers

(GR), (GR), 128 floating-point128 floating-point registers (FR), registers (FR), 64 single-64 single-bitbit predicate registers (PR), predicate registers (PR), 8 branch8 branch registers registers (BR), and (BR), and 128 application128 application registers (AR) registers (AR)

In addition, there are Performance Monitor Data In addition, there are Performance Monitor Data registers (PMD), processor identifiers (CPUID), a registers (PMD), processor identifiers (CPUID), a Current Frame Marker register (CFM), user mask Current Frame Marker register (CFM), user mask (UM), and instruction pointer registers (IP)(UM), and instruction pointer registers (IP)

GRs, FRs, BRs, ARs, CPUIDs, IP, and PMDs are 64 GRs, FRs, BRs, ARs, CPUIDs, IP, and PMDs are 64 bits widebits wide

PRs are 1 bit wide, while the UM holds 6 and the PRs are 1 bit wide, while the UM holds 6 and the CFM 38 bits; depicted below:CFM 38 bits; depicted below:

37

Itanium Register FileItanium Register FileGR FR PR BR AR

gr0 63…0 fr0 63…0 pr0 0 br0 63…0 ar0 Kr0

gr1 63…0 fr1 63…0 pr1 0 br1 63…0 . . .

gr2 63…0 fr2 63…0 pr2 0 br2 63…0 ar7 Kr7

gr3 63…0 fr3 63…0 pr3 0 br3 63…0 . . .

gr4 63…0 fr4 63…0 pr4 0 br4 63…0 ar16 RSC

gr5 63…0 fr5 63…0 pr5 0 br5 63…0 ar17 BSP

. . . . . . . . . . . . . . . . . . br6 63…0 ar18 BSPSTO

gr16 63…0 fr16 63…0 pr10 0 br7 63…0 ar19 RNAT

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . ip 63…0 ar21 FCR

gr126 63…0 fr126 63…0 pr62 0 . . . . . .

gr127 63…0 fr127 63…0 pr63 0 cfm 37…0 ar30 FDR

User M ar32 CCV

CPUID um 5…0 ar36 UNAT

cpuid0 63…0 PMD ar40 FSPR

cpuid1 63…0 pmd0 63…0 ar44 ITC

. . . . . . pmd1 63…0 ar64 LC

cpuidn 63…0 . . . . . . ar66 EC

pmdm 63…0 ar127

38

Itanium Registers GRItanium Registers GR The The 128 GR128 GR registers are the common workhorses registers are the common workhorses

during computationduring computation

They contain integer values being computed that They contain integer values being computed that can also be used as the source and destination can also be used as the source and destination operands in move operationsoperands in move operations

It is possible to use these integer values as It is possible to use these integer values as machine addresses, thus GRs can be used as machine addresses, thus GRs can be used as pointers in load- and store-operationspointers in load- and store-operations

All machine instructions can refer to these All machine instructions can refer to these registers, for reading and writing valuesregisters, for reading and writing values

39

Itanium Registers GRItanium Registers GR In addition to the 64 data bits, each GR has an associated bit In addition to the 64 data bits, each GR has an associated bit

called the NAT, which stands for called the NAT, which stands for Not A ThingNot A Thing. NAT is 1, if the . NAT is 1, if the associated register has not been initialized with good dataassociated register has not been initialized with good data

NATs support NATs support speculationspeculation. For example, if a speculative load . For example, if a speculative load is issued but aborted, before the value arrives in its destined is issued but aborted, before the value arrives in its destined GR, the NAT value can be set to record that factGR, the NAT value can be set to record that fact

Enables integrity of the machine’s exception processEnables integrity of the machine’s exception process

Certain instructions can manipulate individual bits or bit Certain instructions can manipulate individual bits or bit strings of the 64 bits in the various GRs; there are strings of the 64 bits in the various GRs; there are 2 groups2 groups

The first 32The first 32, GR0 through GR31, are visible to all software, , GR0 through GR31, are visible to all software, and are used to hold globally computed, intermediate and are used to hold globally computed, intermediate values. However, values. However, GR0 is read-onlyGR0 is read-only, providing the constant 0, , providing the constant 0, 64 bits long64 bits long

40

Itanium Registers GRItanium Registers GR The next 96, The next 96, GR32 to GR127GR32 to GR127, are used to , are used to

implement a small but frequently used portion of implement a small but frequently used portion of the top of the run-time stack; i.e. work like a the top of the run-time stack; i.e. work like a special-purpose top-of-stack cachespecial-purpose top-of-stack cache

These stack registers are made available to SW by These stack registers are made available to SW by allocation of a allocation of a register stack frameregister stack frame, and include , and include from 0 to 96 registers. All registers not used from from 0 to 96 registers. All registers not used from this subset are this subset are inaccessible inaccessible to general SWto general SW

The stack frame portion implemented via GRs is The stack frame portion implemented via GRs is further partitioned into subsections, one meant to further partitioned into subsections, one meant to hold hold local local registers, the other registers, the other outputoutput registers, i.e. registers, i.e. results of the function callresults of the function call

41

Itanium Predicate Registers (PR)Itanium Predicate Registers (PR) Execution of most IPF instructions can be Execution of most IPF instructions can be

predicatedpredicated by one of the by one of the PRPRss

Value 1 in the PR means: the operation terminated Value 1 in the PR means: the operation terminated normallynormally

0 meaning: the result will not be posted (committed), 0 meaning: the result will not be posted (committed), even if it has been computed already. I.e. there will even if it has been computed already. I.e. there will be no impact on the be no impact on the ARARs of the machines of the machine

A rare A rare exceptionexception of an instruction that cannot be of an instruction that cannot be predicated is the loop operationpredicated is the loop operation

42

Itanium Predicate RegistersSItanium Predicate RegistersS The The PRPRs are also partitioned into 2 sections:s are also partitioned into 2 sections:

PR0 through PR15PR0 through PR15 are are static PRsstatic PRs

The other 48 are so called The other 48 are so called rotating PRsrotating PRs

PR0PR0 is an exceptional register, it can only be read, is an exceptional register, it can only be read, and its value is always 1, meaning, the predicate is and its value is always 1, meaning, the predicate is true; thus PR0 can be used to denote true; thus PR0 can be used to denote unconditional executionunconditional execution

The remaining The remaining 48 PRs48 PRs are used to hold are used to hold stage stage predicatespredicates, used during software-pipelining, used during software-pipelining

43

Branch Registers (BR)Branch Registers (BR)

IPF instructions are grouped in IPF instructions are grouped in bundlesbundles, which are , which are 16-byte 16-byte alignedaligned byte sequences holding executable code. Hence their byte sequences holding executable code. Hence their rightmost 4 address bits will always be 0 due to alignment; rightmost 4 address bits will always be 0 due to alignment; they don’t need to be stored explicitlythey don’t need to be stored explicitly

Execution of an indirect branch requires an explicit operandExecution of an indirect branch requires an explicit operand

On the Itanium architecture this On the Itanium architecture this operand is a branch registeroperand is a branch register; ; it holds the branch destinationit holds the branch destination

The machine then loads the value of the referenced BR into IP The machine then loads the value of the referenced BR into IP and execution continues from thereand execution continues from there

Executing branch-related instructions is about the only way Executing branch-related instructions is about the only way to directly affect the value in the instruction pointer, the to directly affect the value in the instruction pointer, the register that holds the address of the next bundle to be register that holds the address of the next bundle to be executedexecuted

44

Current Frame Marker Register CFMCurrent Frame Marker Register CFM

Note: Frame Marker often referred to as Stack MarkerNote: Frame Marker often referred to as Stack Marker

Each function has a specific Each function has a specific stack framestack frame associated with it, which is created at function associated with it, which is created at function invocation; it is cleared at function returninvocation; it is cleared at function return

If all the relevant data of a function’s stack frame If all the relevant data of a function’s stack frame do fit, they are placed in the stack of general do fit, they are placed in the stack of general registers; else the overflowing data must reside in registers; else the overflowing data must reside in memorymemory

Either way, the current frame marker (CFM) holds Either way, the current frame marker (CFM) holds the frame marker for the function that is currently the frame marker for the function that is currently activeactive

45

Current Frame Marker Register CFMCurrent Frame Marker Register CFM

Layout of the CFM:Layout of the CFM:CFM- 37 .. 32 31 .. 25 24 .. 18 17 .. 14 13 .. 7 6 .. 0 register Rrb.pr Rrb.fr Rrb.gr sor sol sof

Meaning of Bits in CFM:Meaning of Bits in CFM:Name Bit Field meaning

Sof 0..6 Total size of stack frame Sol 7..13 Size of local part of stack frame, in words Sor 14..17 Size of rotating portion of stack frame. The number

of the rotating registers is 8 times the sor value rrb.gr 18..24 Register rename base for grs rrb.fr 25..31 Register rename base frs rrb.pr 32..37 Register rename base prs

46

Application Registers (AR)Application Registers (AR)

Application Registers – t.b.d.:Application Registers – t.b.d.:

register Mnemonic Description of register ar0 – ar7 KR0 – KR7 Kernel registers 0 .. 7 ar8 – ar15 Reserved ar16 t.b.d.

47

Instruction Pointer (IP)Instruction Pointer (IP) IPF instructions are fetched in units of IPF instructions are fetched in units of bundlesbundles, ,

which are chunks of 16 bytes, or 128 bitswhich are chunks of 16 bytes, or 128 bits

Bundles are stored Bundles are stored bundle-alignedbundle-aligned

The ip can address 18,446,744,073,709,551,616 The ip can address 18,446,744,073,709,551,616 different bytes (but only at bundle addresses)different bytes (but only at bundle addresses)

The rightmost 4 bits of the ip thus will always be The rightmost 4 bits of the ip thus will always be zero, due to the bundle-alignmentzero, due to the bundle-alignment

48

Performance Monitor Data RegisterPerformance Monitor Data Register

These are architecture-provided resources that These are architecture-provided resources that record the use of hardware modulesrecord the use of hardware modules

Contents is Contents is read-onlyread-only by SW by SW

But contrary to the performance monitor registers But contrary to the performance monitor registers on Intel Pentium architectures, they are on Intel Pentium architectures, they are user user visiblevisible

Herb, add PMU info here!!!!

49

Itanium ISAInstruction Set Architecture

50

Instruction Set Architecture (ISA)Instruction Set Architecture (ISA)Parallelism and DependencesParallelism and Dependences

Itanium instructions that are explicitly packaged in Itanium instructions that are explicitly packaged in groups groups can execute incan execute in parallel parallel

Assembly programmer or compiler may craft groups as Assembly programmer or compiler may craft groups as large as desired; the performance consequence is: All large as desired; the performance consequence is: All operations embedded in a single group can be executed operations embedded in a single group can be executed simultaneously, in parallel, saving time over the equivalent simultaneously, in parallel, saving time over the equivalent sequential executionsequential execution

The physical silicon angle of this is: Of all operations that The physical silicon angle of this is: Of all operations that could be executedcould be executed in parallel only in parallel only those are actually those are actually performed in parallelperformed in parallel, for which there exist HW resources, for which there exist HW resources

E.g. on an Itanium®2 implementation of IPF, there are 6 units E.g. on an Itanium®2 implementation of IPF, there are 6 units available to operate in parallelavailable to operate in parallel

51

Instruction Set Architecture (ISA)Instruction Set Architecture (ISA)Parallelism and DependencesParallelism and Dependences

If fewerIf fewer actions are enclosed in a group, some HW will idle actions are enclosed in a group, some HW will idle

If moreIf more actions could be included in a group, then all HW actions could be included in a group, then all HW elements are active, yet some degree of possible parallelism elements are active, yet some degree of possible parallelism is lost; future HW implementations may execute that same is lost; future HW implementations may execute that same object code faster due to the higher degree of parallelismobject code faster due to the higher degree of parallelism

Parallel execution is not feasible if dependencies exist Parallel execution is not feasible if dependencies exist between instructions. On the IPF family, however, these between instructions. On the IPF family, however, these dependencies are not resolved by the machinedependencies are not resolved by the machine

It is the human programmer or the optimizing compiler that It is the human programmer or the optimizing compiler that explicitly tracks, what can be done in parallel, and what must explicitly tracks, what can be done in parallel, and what must be done in sequence. The machine just runs it, goal: TO BE be done in sequence. The machine just runs it, goal: TO BE FAST!FAST!

52

Instruction Set Architecture (ISA)Instruction Set Architecture (ISA)Parallelism and DependencesParallelism and Dependences

If a result has to be computed first before it can be read If a result has to be computed first before it can be read somewhere else (memory or register), a somewhere else (memory or register), a true dependencetrue dependence exists; AKA exists; AKA data dependencedata dependence; conventional to say ; conventional to say “dependence”“dependence”

On Itanium we call this a RAW (On Itanium we call this a RAW (RRead ead aafter fter WWrite) dependencerite) dependence

If a result has to be read first before it can be re-computed, a If a result has to be read first before it can be re-computed, a false dependence is created, AKA false dependence is created, AKA anti-dependenceanti-dependence

On Itanium this is named WAR (Write after Read) dependencyOn Itanium this is named WAR (Write after Read) dependency

If a result has to be computed first before it can be computed If a result has to be computed first before it can be computed again, assuming that an intermediate reference is possible, again, assuming that an intermediate reference is possible, output dependenceoutput dependence is created is created

53

Instruction Set Architecture (ISA)Instruction Set Architecture (ISA)Parallelism and DependencesParallelism and Dependences

Itanium calls this WAW (Write after Write) dependencyItanium calls this WAW (Write after Write) dependency

In all these cases, the prior operation has to complete, In all these cases, the prior operation has to complete, before the dependent can be started; e.g.:before the dependent can be started; e.g.:

ld8 r14 = [r3]ld8 r14 = [r3] -- load GR14 w. 8 bytes addr. by GR3-- load GR14 w. 8 bytes addr. by GR3

add r15 = r14, r16add r15 = r14, r16 -– integer sum into GR15, RAW dep-– integer sum into GR15, RAW dep

The loading of an 8-byte value into (8-byte) register GR14 The loading of an 8-byte value into (8-byte) register GR14 must complete first, before the addition of the 2 long integer must complete first, before the addition of the 2 long integer values, held in GR14 and GR16, can be startedvalues, held in GR14 and GR16, can be started

Note the assembler register names: Note the assembler register names: r14r14, and not gr14, and not gr14

54

Instruction Set Architecture (ISA)Instruction Set Architecture (ISA)

Assembly Language FormatAssembly Language Format

Format of an Itanium assembler instruction:Format of an Itanium assembler instruction:

In meta-syntax [ In meta-syntax [ andand ] ] brackets mean that the brackets mean that the bracketed portion of the instruction is optionalbracketed portion of the instruction is optional

In In assembly syntax, these bracket pairsassembly syntax, these bracket pairs [] [] express: express: indirectionindirection

Careful not to get confused by 2 different contexts!Careful not to get confused by 2 different contexts!

[(pr)] mnemonic[.comp] dest = src1 [, src2 [, src3 ] ][(pr)] mnemonic[.comp] dest = src1 [, src2 [, src3 ] ]

Meaning of the various assembly language fields:Meaning of the various assembly language fields:

55

Instruction Set Architecture (ISA)Instruction Set Architecture (ISA)

syntax Name Meaning (pr) Predicate

register Used to predicate execution; if value is 0, the result is not committed, if true, the result is committed. pr0 is always 1, hence the associated instructions are executed unconditionally

mnemonic Instruction Name of the instruction to tell the assembler: which operation to perform

comp Completer Further qualifies or completes the instruction specification. There may be multiple completers per instruction; not all instructions have a completer

dest Destination Is the destination of the specified instruction. Choices are: register or memory

src1 source one Source operand. Not all instructions require a source. Some instructions allow multiple sources. Sources may be: Immediate operands, or registers. Memory can be a source via indirection (through a register)

src2 source two Ditto src3 source

three Ditto

56

Instruction Set Architecture (ISA)Instruction Set Architecture (ISA)Assembly Language FormatAssembly Language Format

A sample assembly language instruction is shown next:A sample assembly language instruction is shown next:

(p0) add r5 = r4, r3, 1 // (p0) can be skipped(p0) add r5 = r4, r3, 1 // (p0) can be skipped This is an integer add instruction that sums up the integer This is an integer add instruction that sums up the integer

values in GR4 and GR3, also adds 1values in GR4 and GR3, also adds 1 Assigns sum to register GR5. Since the predicate register used Assigns sum to register GR5. Since the predicate register used

is PR0, which is always true, the commit of the sum to register is PR0, which is always true, the commit of the sum to register GR5 is unconditional, just as if no predicate qualifier had been GR5 is unconditional, just as if no predicate qualifier had been givengiven

Predicate registers, when listed, are enclosed in parenthesesPredicate registers, when listed, are enclosed in parentheses Not all instructions allow or need a completer. Typical Not all instructions allow or need a completer. Typical

completers are shown below. Some instructions allow multiple completers are shown below. Some instructions allow multiple completers, notably the memory access instructions, and completers, notably the memory access instructions, and branch instructionsbranch instructions

57

Instruction Set Architecture (ISA)Instruction Set Architecture (ISA)

Completer Meaning .a For “advanced” load; check later if successful .c Check

.clr If advanced load was not successful, clear the reg

.nc no clear .s Speculative; e.g. for load; NOT allowed for store!

.many t.b.d. .few t.b.d. .excl t.b.d. Many more

.equ .unc etc.

58

Instruction Set Architecture (ISA)Instruction Set Architecture (ISA)

Itanium Bundle FormatItanium Bundle Format

Executable code on Itanium comes in units of Executable code on Itanium comes in units of bundlesbundles. A bundle consists of . A bundle consists of 3 instructions3 instructions, all , all grouped with an associated templategrouped with an associated template

Template completesTemplate completes the instruction specification the instruction specification and above all, can define a and above all, can define a groupgroup boundary, AKA boundary, AKA stopstop. Stop defines boundary between one . Stop defines boundary between one groupgroup and the nextand the next

If no If no stopstop is included in a template, this means is included in a template, this means that the that the bundlebundle will be part of a larger will be part of a larger groupgroup, , consisting of more instructions in the next consisting of more instructions in the next bundlebundle

59

Instruction Set Architecture (ISA)Instruction Set Architecture (ISA)Itanium Bundle FormatItanium Bundle Format

Each Each instructioninstruction is 41 bits long, a is 41 bits long, a templatetemplate consumes 5 consumes 5 bits, one in a bundlebits, one in a bundle

With 3 instructions per With 3 instructions per bundle,bundle, the overall bundle the overall bundle length is 3 * 41 + 5 = 128 bits, perfectly fitting into 16 length is 3 * 41 + 5 = 128 bits, perfectly fitting into 16 bytes; all bundle-aligned, easily accomplished due to bytes; all bundle-aligned, easily accomplished due to first bundle residing on a mod-16 memory boundaryfirst bundle residing on a mod-16 memory boundary

From then on all will be aligned on 16-byte boundariesFrom then on all will be aligned on 16-byte boundaries

With the memory bus being 128 bits wide (or wider on With the memory bus being 128 bits wide (or wider on future IPF implementations) and bundles being bundle-future IPF implementations) and bundles being bundle-aligned, fetching instruction memory is fastaligned, fetching instruction memory is fast

60

Instruction Set Architecture (ISA)Instruction Set Architecture (ISA)

Itanium Bundle FormatItanium Bundle Format

General General layout of a bundlelayout of a bundle is shown next, with bits is shown next, with bits ordered from 0 through 127 increasing r. to l.ordered from 0 through 127 increasing r. to l.

The template serves as a means for the compiler The template serves as a means for the compiler to communicate additional information about the to communicate additional information about the instructions, without which they would be instructions, without which they would be ambiguousambiguous

One such key piece of information is placement of One such key piece of information is placement of an instruction an instruction group stopgroup stop, in assembler ;;, in assembler ;;

127 87 | 86 46 | 45 5 | 4 0

instruction 2 instruction 1 instruction 0 template

61

Instruction Set Architecture (ISA)Instruction Set Architecture (ISA)

Itanium Bundle FormatItanium Bundle Format

A A group stopgroup stop can occur after instruction 2, or 1, or can occur after instruction 2, or 1, or 0, indicating an earlier 0, indicating an earlier group group must complete must complete execution, before another startsexecution, before another starts

Itanium instructions allows at most 2 Itanium instructions allows at most 2 stopsstops. Thus, . Thus, if 3 are needed, then a no-operation must be if 3 are needed, then a no-operation must be packed into one of the instructions, to effectively packed into one of the instructions, to effectively create 2 physical groups, with the third being the create 2 physical groups, with the third being the NOOP, whose execution order does not matterNOOP, whose execution order does not matter

Compiler-generated code performs this work-Compiler-generated code performs this work-around automaticallyaround automatically

62

Instruction Set Architecture (ISA)Instruction Set Architecture (ISA)

Itanium Bundle FormatItanium Bundle Format

The template specifies which types of instructions The template specifies which types of instructions are assembled into slot 0, 1 and 2. IPF instructions are assembled into slot 0, 1 and 2. IPF instructions are partitioned into the following 6 groups:are partitioned into the following 6 groups:

Type Meaning A ALU: integer or memory unit I Non-ALU: Integer unit

M Memory unit F Floating-point unit B Branch unit

L + X Extended unit, or Branch unit

63

Instruction Set Architecture (ISA)Instruction Set Architecture (ISA)

Itanium Bundle FormatItanium Bundle Format

Providing such information to the processor in the Providing such information to the processor in the template speeds up instruction decoding, and template speeds up instruction decoding, and thus improves execution speedthus improves execution speed

A list with the Instruction Set Architecture (ISA) A list with the Instruction Set Architecture (ISA) templates and embedded stops is shown next:templates and embedded stops is shown next:

64

Instruction Set Architecture (ISA)Instruction Set Architecture (ISA)Template # type slot 0 slot 1 slot2

0 = 0x00 MII Memory unit Integer unit Integer unit 1 = 0x01 MII_ Memory unit Integer unit Integer unit ;; 2 = 0x02 MI_I Memory unit Integer unit;; Integer unit 3 = 0x03 MI_I_ Memory unit Integer unit;; Integer unit;; 4 = 0x04 MLX Memory unit L unit? Extended unit 5 = 0x05 MLX_ Memory unit L unit? Extended unit;; 6 = 0x06 reserved 7 = 0x07 reserved 8 = 0x08 MMI Memory unit Memory unit Integer unit 9 = 0x09 MMI_ Memory unit Memory unit Integer unit;;

10 = 0x0a M_MI Memory unit;; Memory unit Integer unit 11 = 0x0b M_MI_ Memory unit;; Memory unit Integer unit;; 12 = 0x0c MFI Memory unit Floating-point unit Integer unit 13 = 0x0d MFI_ Memory unit Floating-point unit Integer unit;; 14 = 0x0e MMF Memory unit Memory unit Floating-point unit 15 = 0x0f MMF_ Memory unit Memory unit Floating-point unit;; 16 = 0x10 MIB Memory unit Integer unit Branch unit 17 = 0x11 MIB_ Memory unit Integer unit Branch unit;; 18 = 0x12 MBB Memory unit Branch unit Branch unit 19 = 0x13 MBB_ Memory unit Branch unit Branch unit;; 20 = 0x14 reserved 21 = 0x15 reserved 22 = 0x16 BBB Branch unit Branch unit Branch unit 23 = 0x17 BBB_ Branch unit Branch unit Branch unit;; 24 = 0x18 MMB Memory unit Memory unit Branch unit 25 = 0x19 MMB_ Memory unit Memory unit Branch unit;; 26 = 0x1a reserved 27 = 0x1b reserved 28 = 0x1c MFB Memory unit Floating-point unit Branch unit 28 = 0x1d MFB_ Memory unit Floating-point unit Branch unit;; 30 = 0x1e reserved 31 = 0x1f reserved

65

Instruction Set Architecture (ISA)Instruction Set Architecture (ISA)

Itanium Bundle FormatItanium Bundle Format

The difference between above templates 0x00 and The difference between above templates 0x00 and 0x01, both being 0x01, both being MII typeMII type operations is: after operations is: after instruction 2 in template 0x01 there is a stop, instruction 2 in template 0x01 there is a stop, while in template 0x00 there is nonewhile in template 0x00 there is none

In other words, the next bundle after the one for In other words, the next bundle after the one for template 0x00 will belong to the same group, and template 0x00 will belong to the same group, and a higher degree of parallelism will be possiblea higher degree of parallelism will be possible

66

Instruction Set Architecture (ISA)Instruction Set Architecture (ISA)Itanium Assembly CodeItanium Assembly Code

An instruction group is a sequence of 1 or more instructions An instruction group is a sequence of 1 or more instructions delimited by a stop. The first instruction in a whole program delimited by a stop. The first instruction in a whole program is thought to be preceded by a stopis thought to be preceded by a stop

Similarly, the last instruction of a complete program is Similarly, the last instruction of a complete program is thought to be followed by a stopthought to be followed by a stop

All instructions placed into a single group can be executed All instructions placed into a single group can be executed in parallel. Whether or not they will depends on the number in parallel. Whether or not they will depends on the number of hardware resources available. In the initial Itanium of hardware resources available. In the initial Itanium architecture only 6 resources are availablearchitecture only 6 resources are available

In a later implementation, many more may be available, thus In a later implementation, many more may be available, thus potentially speeding up execution of the same old Itanium potentially speeding up execution of the same old Itanium code on a future generationcode on a future generation

The ;; indicates to the assembler, where one boundary ends The ;; indicates to the assembler, where one boundary ends and thus the next group startsand thus the next group starts

67

Instruction Set Architecture (ISA)Instruction Set Architecture (ISA)Itanium Assembly CodeItanium Assembly Code

Some assembly language instructions are shown next:Some assembly language instructions are shown next:

comp.eq p1, p2 = r33, r34comp.eq p1, p2 = r33, r34 This checks general purpose registers 33 and 34 for This checks general purpose registers 33 and 34 for

equality; if true, predicate register 1 is set to true, predicate equality; if true, predicate register 1 is set to true, predicate register 2 to false. Otherwise, since GR33 and GR34 are not register 2 to false. Otherwise, since GR33 and GR34 are not equal, p1 is set to false and p2 to true. A more complicated equal, p1 is set to false and p2 to true. A more complicated case is:case is:

(p3) comp.eq.unc p1, p2 = r33, r34(p3) comp.eq.unc p1, p2 = r33, r34 checks if predicate register 3 is true at the start. If so, then checks if predicate register 3 is true at the start. If so, then

only if registers GR33 and GR34 are equal this acts like a only if registers GR33 and GR34 are equal this acts like a regular IPF comparisonregular IPF comparison

Else –i.e. if p3 is false a priori— then predicate registers 1 Else –i.e. if p3 is false a priori— then predicate registers 1 and 2 are both set to falseand 2 are both set to false

68

Bibliography

1.1. Triebel, Walter: “IA-64 Architecture for Software Developers”, Triebel, Walter: “IA-64 Architecture for Software Developers”, Intel Press © 2000, 308 pagesIntel Press © 2000, 308 pages

2.2. http://www.intel.com/design/itanium2/manuals/25110901.pdfhttp://www.intel.com/design/itanium2/manuals/25110901.pdf

3.3. http://h21007.www2.hp.com/portal/StaticDownload?http://h21007.www2.hp.com/portal/StaticDownload?attachment_ciid=c2d2e0aecd2b7110VgnVCM100000275d6e10RCattachment_ciid=c2d2e0aecd2b7110VgnVCM100000275d6e10RCRD&ciid=ce1fd701521c7110VgnVCM100000275d6e10RCRDRD&ciid=ce1fd701521c7110VgnVCM100000275d6e10RCRD

4.4. http://www.intel.com/design/itanium/downloads/245320.htmhttp://www.intel.com/design/itanium/downloads/245320.htm

5.5. http://www.intel.com/design/itanium/manuals/iiasdmanual.htmhttp://www.intel.com/design/itanium/manuals/iiasdmanual.htm

6.6. http://download.intel.com/design/Itanium2/manuals/25111003.pdfhttp://download.intel.com/design/Itanium2/manuals/25111003.pdf