1 from the 386 to the pentium 4 microprocessors (a) intel ... · 1 from the 386 to the pentium 4...

87
Microprocessors (A) From the 386 to the Pentium 4 1 Dr. Martin Land Hadassah College Spring 2004 Intel Processors from 386 to Pentium 4

Upload: others

Post on 04-Oct-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 41

Dr. Martin LandHadassah CollegeSpring 2004

Intel Processors

from 386 to Pentium 4

Page 2: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 42

Dr. Martin LandHadassah CollegeSpring 2004

386

Page 3: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 43

Dr. Martin LandHadassah CollegeSpring 2004

Intel 80386 Microprocessor

Page 4: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 44

Dr. Martin LandHadassah CollegeSpring 2004

BusInterface

Unit

Address

Data

Control

PagingUnit

PhysicalAddress

ShadowRegisters

SegmentationUnit

LinearAddress

InstructionPrefetch

InstructionDecoderDecode and

Sequencing

ALU

Registers

Effe

ctiv

e Ad

dres

s (O

ffset

)

Code

Str

eam

:Li

near

byt

ese

quen

ce fr

om C

S

CodeStreamCode

Addr

ess

Dis

plac

emen

ts

MicroCode

StatusFlags

ALU (Data) Bus

Simplified 386 Microprocessor

Prefetch loads instruction bytes whenever there are no data accesses.Decoder identifies instruction boundaries and sends displacements to

Address Management.Decode/Sequence generates microcode for instruction execution.ALU sends Effective Address to Address Management for data access.Address Management handles segmentation and paging.Registers are updated in the last step.

Page 5: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 45

Dr. Martin LandHadassah CollegeSpring 2004

BusInterface

Unit

Address

Data

Control

PagingUnit

PhysicalAddress

ShadowRegisters

SegmentationUnit

LinearAddress

InstructionPrefetch

InstructionDecoderDecode and

Sequencing

ALU

Registers

Effe

ctiv

e Ad

dres

s (O

ffset

)

Code

Str

eam

:Li

near

byt

ese

quen

ce fr

om C

S

CodeStream

Code

Addr

ess

Dis

plac

emen

ts

MicroCode

StatusFlags

ALU (Data) Bus

Problems with Pipelining the 386

No Internal Data CacheAll data accesses are external (slow)Unified memory access causes structural hazard on data accesses

Instruction dependenciesLoad+ALU operations stall during load of data operandsConditional branches read status flags set by ALU instructionsLoad+ALU operations use register-based pointers (depend on previous

write-backs)Branches cause a flush of the Instruction Prefetch queue.

Page 6: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 46

Dr. Martin LandHadassah CollegeSpring 2004

486

Page 7: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 47

Dr. Martin LandHadassah CollegeSpring 2004

Upgrade of 80386

New Features in 486:• Pipelines 386 instruction execution• Floating-Point Unit (FPU) integrated on-chip• 8 or 16 KB L1 data cache on chip • Support for external L2 data cache• Multiprocessor support• Support for battery operated notebook PC

Page 8: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 48

Dr. Martin LandHadassah CollegeSpring 2004

Pipeline Organization

Each pipeline stage executes in one clock cycle

InstructionFetch

InstructionMemory

Stage-1Decode

Stage-2Decode Execute

DataMemory

WriteBack

AddressInstruction AddressData

Forwarding

Page 9: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 49

Dr. Martin LandHadassah CollegeSpring 2004

Five Pipelined StagesInstruction Prefetch (PF)Stage-1 Decode (D1)

Instruction IdentificationIdentify source operands:

Identify Register source Calculate Effective Address for Data Memory (cache) source

Stage-2 Decode (D2)Complete complex Effective Addresses Generation of Microcode

Execution (EX)Integer ALUFP ALUData memory writes

Write to fast memory bufferBuffer updates cache

Register Writeback (WB)

Page 10: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 410

Dr. Martin LandHadassah CollegeSpring 2004

486 Internal Organization

Bus Interface Unit (BIU)

Instruction Prefetch

Cache

Decoder

MMUALU

FPU

Page 11: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 411

Dr. Martin LandHadassah CollegeSpring 2004

Intel Architecture Floating Point Unit (FPU)

8087 numeric processorSeparate 8086 integer CPU and 8087 FPU

387 DX and SX math coprocessorsImplement the final IEEE STD 754Added new trigonometric instructions

486 processor FPUOn-chip equivalent of the Intel 387 DX math coprocessor IEEE STD 754

Pentium FPUCompletely redesigned FPU Conformance to both the IEEE STD 754 and 854Algorithms with three times the performance of 486Shortcut cost Intel a lot of money to correct

Page 12: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 412

Dr. Martin LandHadassah CollegeSpring 2004

Support for Battery Operated Notebook PC System Management Mode (SMM)

Special purpose interruptAddress space for storing processor stateTransparent to OS and applications software

Stop Clock StatesInitiated by external signal (hardware control)“Fast Wake-Up” Stop Grant state

Stops processor I/O operations“Slow Wake-Up” Stop Clock state

CLK frequency → 0 MHzAuto Halt Power Down

Similar to Stop ClockInitiated by HALT instruction (software control)

Dynamic Local Power ManagementSubsystems switch themselves off when not needed

Page 13: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 413

Dr. Martin LandHadassah CollegeSpring 2004

Pentium

Page 14: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 414

Dr. Martin LandHadassah CollegeSpring 2004

Superscalar Architecture

Two integer instruction pipelines“U” pipe can execute all integer instructions

“V” pipe can execute “simple integer” instructions

Floating Point Unit integrated with integer pipelinesEach pipeline can issue most instructions in

one clock cycle

Instruction issue: instruction execution stage (after fetch and decode have completed)

Page 15: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 415

Dr. Martin LandHadassah CollegeSpring 2004

Instruction Pairing

Process of issuing two instructions in parallelWhen instructions are paired:

First instruction issued to the U-pipeNext sequential instruction issued to V-pipe

Pairing not possible if:The instructions have dependenciesEither instruction is complex

1 2 3 4 51 2 3 4 5

U-pipeV-pipe

I1

I2

I3

I4

Page 16: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 416

Dr. Martin LandHadassah CollegeSpring 2004

Pipeline Stages for Pentium (without MMX)

PF D1 D2 EX WB Prefetch Instruction

Decode Address Generate

Execution Write Back

PFRAMD2

D2

EX

EX

WB

WB

U

V

Pipeline stages are very similar to 486 stages (not identical)

D1

D1

Page 17: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 417

Dr. Martin LandHadassah CollegeSpring 2004

Pentium Block Diagram

Page 18: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 418

Dr. Martin LandHadassah CollegeSpring 2004

Integer Instruction Pairing Rules

Pairing: two instructions issued on the same clockcycle (one to U-pipe and one to V-pipe)

Pairing requires the following conditions:1. Both instructions must be “simple”2. No RAW or WAW register dependencies between

instructions3. Register dependencies include pointers and flags4. Neither instruction contains both displacement and

immediate

Page 19: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 419

Dr. Martin LandHadassah CollegeSpring 2004

Branch Prediction ⎯ 1 Branch Target Buffer (BTB) is a special cache that stores information

about branch instructions:Source address (identifies particular branch instruction)Target address (“jump to” address)2 History bits provide 4 states: (11) strongly taken

(10) weakly taken(01) weakly not taken(00) strongly not taken

On a branch instruction,BTB makes predictions about branches:

Branch Taken or Branch Not Taken (by high order history bit)Target address (if Taken)

D1 decoder (Stage 2) reads prediction from the BTBInstructions are fetched according to predictionBranches that “miss” in BTB are treated as not taken

Page 20: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 420

Dr. Martin LandHadassah CollegeSpring 2004

Branch Prediction ⎯ 2

Branch Predictions are verified in EX or WB

On first verification of a branch instruction:

If Not Taken, no BTB entry is made

If Taken, the BTB creates a new entry:Instruction address of branch instructionBranch target addressPrediction that branch is strongly taken

Page 21: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 421

Dr. Martin LandHadassah CollegeSpring 2004

Branch Prediction ⎯ 3

On subsequent executions of the same branch instruction:When branch instruction enters D1,

D1 decoder reads the prediction from the BTBOn a Not Taken prediction, the next instruction in the

Sequential Prefetch Buffer is sent to D1On a Taken prediction, the Prediction Prefetch Buffer

prefetches and sends instructions to D1When branch instruction enters EX,

The branch is verified as Taken or Not TakenOn correct prediction,

U-pipe and V-pipe continueBTB entry is updated (history bits adjusted up or down)

On mispredictionBoth pipelines are flushedBTB entry is updated (history bits and branch target)

Page 22: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 422

Dr. Martin LandHadassah CollegeSpring 2004

For typical loops, branches are mispredicted: On the first run (BTB miss ⇒ mispredicted as not taken) On the last run (mispredicted as taken)

Example:Loop runs 400h = 102410 timesOn first run of JLE FOO

BTB miss ⇒ mispredicted as not taken3 stall cycles for pipeline flushBTB entry for JLE FOO as strongly taken

On next 1022 runs of JLE FOOBTB correctly predicts as taken with no stall cycles

On last run of JLE FOOBTB hit ⇒ mispredicted as taken3 stall cycles for pipeline flush

Branch Prediction ⎯ 4

MOV [EBP-02], 0001 FOO: INC [EBP-02] CMP [EBP-02], 0400 JLE FOO NEXT: ADD EAX, EBX

Page 23: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 423

Dr. Martin LandHadassah CollegeSpring 2004

Branch Prediction ⎯ 5

MOV [EBP-02], +01 FOO: MOV [EBP-04], +00 BAR: INC [EBP-04] CMP [EBP-04], +03 JLE BAR INC [EBP-02] CMP [EBP-02], +03 JLE FOO NEXT: ADD EAX, EBX SUB EDX, ECX ADD EAX, EBX

Example with nested loops:

On first run, JLE BAR misses in BTBMispredicted as not takenNew BTB entry as strongly taken3 stall clocks

On following runs, JLE BAR predicted as takenCorrectly predicted, until end of loop

No stall clocks

At end of inner loop, Flushed with 3 stall clocksMarked weakly taken in BTB

On next FOO loop, JLE BAR predicted as (weakly) takenCorrectly predicted, until end of loopAt end of inner loop,

Flushed with 3 stall clocksMarked weakly taken in BTB

Page 24: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 424

Dr. Martin LandHadassah CollegeSpring 2004

Integrated On-Chip Split Cache — 1

Separate code and data caches integrated on-chip Each cache is 8 Kbytes in size32-byte line (block) size 2-way set associativeEach cache has a dedicated TLB

TLB = translation look-aside buffer which caches linear address to physical address translations

Page 25: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 425

Dr. Martin LandHadassah CollegeSpring 2004

Integrated On-Chip Split Cache — 2

Data cache has two ports (one for each pipe)The cache tags are triple portedAllow three simultaneous inquire cycles:

u-pipe, v-pipe and I/O unit

Code cache closely integrated withBranch prediction hardwarePrefetch buffers

Not all main memory must be (can be) cachedInstructions can be fetched from code cache or

directly from main memory

Page 26: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 426

Dr. Martin LandHadassah CollegeSpring 2004

MMX

Page 27: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 427

Dr. Martin LandHadassah CollegeSpring 2004

MMX™ Technology Programming Environment

MMX = Multimedia ExtensionsVector extensions to 32-bit Intel Architecture (IA32)SIMD execution model

Single-instructionMultiple-data

MMX ALU integrated into Pentium pipelineMMX Instructions added to ISA

No new mode or operating system visible stateAll existing software runs as before

Page 28: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 428

Dr. Martin LandHadassah CollegeSpring 2004

Single Instruction, Multiple Data (SIMD) Execution Model

Similar to Very Long Instruction Word (VLIW) machine

Page 29: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 429

Dr. Martin LandHadassah CollegeSpring 2004

Add Packed

PADDB — Add Packed Signed Bytes with Wraparound

PADDSB — Add Packed Signed Bytes with Signed Saturation

PADDUB — Add Packed Signed Bytes with Unsigned Saturation

SRC 63 56 55 4847 40 39 32 31 2423 16 15 8 7 0 +

DEST 63 56 55 4847 40 39 32 31 2423 16 15 8 7 0

DEST 63 56 55 4847 40 39 32 31 2423 16 15 8 7 0

Page 30: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 430

Dr. Martin LandHadassah CollegeSpring 2004

Packed Multiply and Add (pmadd)

DEST[31..0] ← (DEST[15..0] × SRC[15..0]) + (DEST[31..16] × SRC[31..16])

DEST[63..32] ← (DEST[47..32] × SRC[47..32]) + (DEST[63..48] × SRC[63..48])

Page 31: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 431

Dr. Martin LandHadassah CollegeSpring 2004

P6 Architecturefor

Pentium II, III, 4

Page 32: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 432

Dr. Martin LandHadassah CollegeSpring 2004

P6 Architecture

New hardware architectureSupports IA32 Instruction Set Architecture (ISA)

Instructions, registers, data types, addressing modes, etc.

From outside, P6 looks like any other IA32 machine

Internal operations in a RISC core machineFirst introduced in the Pentium Pro (1995)Architectural basis for Pentium II, III, and 4

Page 33: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 433

Dr. Martin LandHadassah CollegeSpring 2004

Main P6 Architecture Features

Internal RISC core machineIA32 instructions recompiled to RISC ISA

Greater ILP than in Pentium ILP = Instruction Level Parallelism

Deeper branch prediction than in Pentium Larger branch cache in BTB

Out-of-order instruction executionInstructions run through pipeline

In most convenient orderNot in the program listing order

Page 34: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 434

Dr. Martin LandHadassah CollegeSpring 2004

P6 Architecture Subsystems

Page 35: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 435

Dr. Martin LandHadassah CollegeSpring 2004

P6 Instruction Fetch

I/OOperations Memory

Access

Fetch

Instruction Cache Updates

Data Cache Updates

Page 36: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 436

Dr. Martin LandHadassah CollegeSpring 2004

P6 Instruction Pool

PoolingPool of micro-operationswhich can be executed inany convenient order

Page 37: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 437

Dr. Martin LandHadassah CollegeSpring 2004

P6 Instruction Execution

Find an Instruction Ready to Execute

Return Executed

Instruction to Pool

Data Reads

Page 38: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 438

Dr. Martin LandHadassah CollegeSpring 2004

P6 Retirement

Retire FinishedInstructions in

Original Program Order

Data Writes Register Updates

Page 39: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 439

Dr. Martin LandHadassah CollegeSpring 2004

Memory Subsystem

Processing Units Fetch/Decode UnitDispatch/Execute UnitRetire UnitInstruction Pool

IA 32 Registers

P6 Subsystems

Page 40: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 440

Dr. Martin LandHadassah CollegeSpring 2004

Memory Subsytem

System BusExternal computer memory bus Connection to main RAM36-bit address bus (physical address space of 64

GBytes)L2 cache ⎯ unified 256 KBBus Interface Unit (BIU) ⎯ controls L1 access to L2

and RAML1 cache

8 KB data8 KB instruction

Page 41: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 441

Dr. Martin LandHadassah CollegeSpring 2004

IA32 Register File

IA32 instruction set defines familiar registersStandard register set since 386Using IA32 registers is required for instruction set

compatibilityIA32 registers are used as P6 source and

destination operandsInternal calculations use a larger RISC-type register

set (not visible to programmer)

Page 42: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 442

Dr. Martin LandHadassah CollegeSpring 2004

Fetch/Decode Units Fetches IA 32 instructionsConverts each IA32 instruction to

one or more (RISC-type) micro-opsPlaces independent micro-ops into Instruction Pool

Dispatch/Execute Unit executes micro-opsIdentifies dependenciesPerforms branch predictionChooses instructions which are ready to executeReturns results to Instruction Pool

Retire UnitConverts micro-op results back to IA 32 formatPreserves original program orderUpdates IA 32 registers

Processing Units

Page 43: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 443

Dr. Martin LandHadassah CollegeSpring 2004

Micro-Operations (Micro-Ops)

Independent RISC-like primitive instructions Triadic instructions

Two logical sources and one logical destination“logical source” = not visible to programmer

Each simple IA32 instruction is converted into one micro-op (example: MOV AX,BX)

Complex instructions are decoded into from 2 to 4 micro-ops (example: MOV AX,[BX+SI+78])

Very complex instructions decoded into preprogrammedmicro-op sequences (example: SQRT)

Page 44: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 444

Dr. Martin LandHadassah CollegeSpring 2004

Register Alias Table (RAT)

Last stage in decoding processAliases IA 32 register references to GP registersAdds status bits to micro-ops to aid schedulingPasses micro-ops to the Instruction PoolNo instruction reordering yetMicro-op stream is a RISC-equivalent of the decoded

IA32 instruction stream

Page 45: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 445

Dr. Martin LandHadassah CollegeSpring 2004

Dynamic Execution

Micro-ops not executed in original program orderMicro-ops are executed when ready

All source operands are available

Requires three conceptual ingredients:

• Deep Branch Prediction• Dynamic Data Flow Analysis • Speculative Execution

Page 46: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 446

Dr. Martin LandHadassah CollegeSpring 2004

Deep Branch Prediction

Extends Pentium branch prediction:Predicts branches to several nested levelsRequires larger statistical record than Pentium

Implemented in instruction fetch/decode unitIncludes branches, calls, and interrupts

Page 47: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 447

Dr. Martin LandHadassah CollegeSpring 2004

Dynamic Data Flow Analysis

Monitors micro-opsLooks for data and register dependenciesLocates any micro-ops ready for execution

(Ready = all source operands are available)Enables out-of-order executionKeeps the execution units busy

Page 48: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 448

Dr. Martin LandHadassah CollegeSpring 2004

Speculative Execution

Execute instructions ahead of the program counterExecute instructions before “normal fetch” timeBranch Prediction determines most likely instructions

for executionStore results in temporary registersSome executed instructions will never be usedCommit the result of each instruction

Only if the speculation is a correct predictionIn the original program order

Page 49: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 449

Dr. Martin LandHadassah CollegeSpring 2004

Register Dependencies

IA 32 has 8 “general purpose” registersSmall register set can cause data hazard stalls

MOV BX, [SI+1234]ADD BX, [BX]

Decoding to micro-ops aliases IA-32 registers40 general purpose 32-bit registers in RISC coreDecoder assigns a RISC register to an IA-32 registerCan assign multiple GP registers to one IA-32 registerCan prevent dependenciesHandle integers and floating point data

Page 50: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 450

Dr. Martin LandHadassah CollegeSpring 2004

Register Alias Table (RAT)

Last stage in decoding processAliases IA 32 register references to GP registersAdds status bits to micro-ops to aid schedulingPasses micro-ops to the Instruction PoolNo instruction reordering yetMicro-op stream is a RISC-equivalent of the decoded

IA32 instruction stream

Page 51: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 451

Dr. Martin LandHadassah CollegeSpring 2004

Pentium II, III, 4

Page 52: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 452

Dr. Martin LandHadassah CollegeSpring 2004

Pentium II

Pentium Pro with MMX™ Technology

Pipeline Sections RenamedFetch/Decode Unit → In-Order Issue Front-endDispatch/Execute Unit → Out-of-Order CoreRetire Unit → In-Order Retirement unit

Page 53: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 453

Dr. Martin LandHadassah CollegeSpring 2004

Pentium III

Maintains the P6 architectureSupports all IA-32 features up to Pentium II

Pentium II with Streaming SIMD Extensions (SSE)Floating Point version of MMXSingle Instruction Multiple Data (vector) FPU

Note: • All RISC-type processors perform integer instructions

very efficiently. • As multimedia programming became more important in

the 1990s, the measure of processor speed shifted to Floating Point efficiency.

Page 54: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 454

Dr. Martin LandHadassah CollegeSpring 2004

Pentium 4

Maintains the P6 architectureSupports all IA-32 features up to Pentium IIIRedesign of the P6 pipeline model

Netburst Micro-ArchitectureSuperpipeliningDeeper Branch PredictionFront End Pipeline Cache SubsystemQuad-Pumped I/O Bus

Page 55: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 455

Dr. Martin LandHadassah CollegeSpring 2004

Superpipelining TechniqueDivide each stage into 2 stages:

• Each stage does half the work • Each stage requires half the time • Double the clock rate (divide the clock cycle time): τ → τ/2

1 2 3 4 5 6 7 I1 S1 S2 S3 S4 S5 I2 S1 S2 S3 S4 S5 I3 S1 S2 S3 S4 S5

1 2 3 4 5 6 7 8 9 10 11 12 I1 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 I2 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 I3 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10

τICτICCPITimeRunIC

4ICninstructio

cyclesCPI

largeICidealideal

normal

largeICidealnormal

×⎯⎯⎯ →⎯××=−

⎯⎯⎯ →⎯+

==

→ 1

idealnormallargeIC

idealsuper

idealsuper

largeICidealsuper

TimeRunτ

ICτ

ICCPITimeRun

IC9IC

ninstructiocycles

CPI

−=×⎯⎯⎯ →⎯××=−

⎯⎯⎯ →⎯+

==

21

22

1

Page 56: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 456

Dr. Martin LandHadassah CollegeSpring 2004

Superpipelining in Pentium 4Rapid Execution Engine

Higher clock speed

Hyper Pipelined Technology20 stage pipeline (double the Pentium III pipeline length)Each stage does less processing work

Typical instruction requires same processing timeHalf the time in each stageDouble the number of stages

Ideally, doubles number of instructions finished per secondFinish one instruction per cycleTwice the cycles per second

Page 57: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 457

Dr. Martin LandHadassah CollegeSpring 2004

Deeper Branch Prediction

Expanded Branch Target Buffer (BTB)4 K-entries Was 256 in Pentium

Expanded Instruction Pool 126 instructions in various stages of executionWas 40 in Pentium Pro

Improved branch prediction algorithm

Page 58: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 458

Dr. Martin LandHadassah CollegeSpring 2004

New Instruction Cache Subsystem

Called Front End PipelineIA-32 Instruction Cache is extended to 128 byte line-sizeWas 32 bytes in Pentium II/III

Caching for Micro-opsTrace = micro-op sequence for one IA-32 instructionA Trace Cache stores decoded micro-op tracesLoops re-use cached tracesSkips additional decode of same IA-32 instructions

Page 59: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 459

Dr. Martin LandHadassah CollegeSpring 2004

Quad-Pumped I/O Bus

New organization of I/O bus

Bus cycles determined by 100 MHz clockCan make 4 transfers per bus cycle

4 transfers/cycle × 100 MHz = 400 M-Transfers per second

Data bus width of 8 bytes (64 bits)8 bytes/transfer × 400 M-Transfers per second

= 3200 MB/second = 3.2 GB/second

Page 60: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 460

Dr. Martin LandHadassah CollegeSpring 2004

SuperpipeliningRapid Execution Engine

Higher clock speedALU operations take ½ clock cycle

Hyper Pipelined Technology20 stage pipeline (double the Pentium III pipeline length)Each stage does less processing work

Typical instruction requires same processing timeHalf the time in each stageDouble the number of stages

Ideally, doubles number of instructions finished per secondFinish one instruction per cycleTwice the cycles per second

Page 61: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 461

Dr. Martin LandHadassah CollegeSpring 2004

Pentium 4 Performance IssuesAdvertising

Pentium 4 processors have very high clock speeds Range from 1.4 GHz to 4 GHz

RealityHigher clock speeds result from superpipeliningClock speed has a different meaning than in P II/III

How should we compare clock speeds?Expectation

1.5 GHz processor is 50% faster than same processor at 1.0 GHz

Measurement1.5 GHz Pentium 4 is 20% faster than 1.0 GHz Pentium III

Page 62: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 462

Dr. Martin LandHadassah CollegeSpring 2004

Problems With SuperpipeliningNot all operations can be divided into smaller stages

PUSH/POP can easily be superpipelined: Split single PUSH stage into

1. SP-- stage 2. [SP] ← value stage

IMUL/DIV/CMP may be harder to splitSome stalls depend on clock cycles and on real time

Pipeline flush: Clock runs at twice the speedSuperpipeline

Twice as many instructions in pipelineTwice as many wasted pipeline cycles were run and cancelled

Pipeline flush penalty does not scale with clock frequency

Cache penalties depend on reaction time of memory

Page 63: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 463

Dr. Martin LandHadassah CollegeSpring 2004

CPIstall and Effective Clock RateSuppose that, on average, every 2nd instruction will stall for 1 cycle

CPIstall (Pentium-4) ≈ 0.5 cycles/instruction (plus Pentium-III stalls)CPItotal (Pentium-4) ≈ 1.5 cycles/instruction (plus Pentium-III stalls)

Pentium-4 clock speed ≈ (Pentium-III clock speed)/1.5

Pentium-4 with 1.5 GHz clock has effective clock of a Pentium-III with 1.0 GHz clock

( ) ( )

( ) Rate Clock Effective , CPI1R

R

R1

ICCPI1R1

ICR1

ICCPI1

R1

ICCPIτICCPITimeRun

stallsuper

effective

effective

stallsuper

stallsuper

totalsuper

totalsuper

totalsuper

+=

×=+××=××+=

××=××=−

Page 64: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 464

Dr. Martin LandHadassah CollegeSpring 2004

Fair Comparison

Accounting for the different meanings of clock speed:

Compared to the 1.0 GHz Pentium-III, 1.5 GHz Pentium-4 is 20% faster on SPECint2000 1.5 GHz Pentium-4 is 75% faster on SPECFP2000

Speed-up is result of the architectural enhancementsA very reasonable performance improvement

Code compiled with Pentium-4 optimization is faster than older code

Page 65: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 465

Dr. Martin LandHadassah CollegeSpring 2004

Intel ItaniumIA-64

Page 66: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 466

Dr. Martin LandHadassah CollegeSpring 2004

Itanium OverviewIntel's 64-bit architectural plan*

Goals of the Itanium architectureSupport 64-bit addressesIA-32 backward compatibilityIncrease instruction level parallelism (ILP)Improve branch handlingReduce hardware burden using compile-time informationImprove floating point performance

New MethodologyExplicitly Parallel Instruction Computing (EPIC)

Compiler identifies instruction dependenciesCompiler reschedules instructions for optimized executionCompiler groups instructions for parallel issue to Execution Units

Hard work done once by compiler (not each time by hardware)

Page 67: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 467

Dr. Martin LandHadassah CollegeSpring 2004

Operating Environments

Page 68: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 468

Dr. Martin LandHadassah CollegeSpring 2004

Data Types

Pointers: 8 bytesInteger:

1, 2, 4, and 8 bytesbyte, word, doubleword, quadword

Floating Point: single, double and double-extended formats

Page 69: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 469

Dr. Martin LandHadassah CollegeSpring 2004

New Features in Itanium Instruction Set

RISC-like syntaxLoad-Store architectureUniform instruction length (41 bits)

Explicit instruction parallelismCompiler chooses instructions to run in parallelCompiler provide hints to the processorPredication replaces branching

More flexible use of registers128 integer and floating-point registersRegister renaming replaces “spill and fill” on callsRegister rotation allows parallelization of loops

Page 70: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 470

Dr. Martin LandHadassah CollegeSpring 2004

Instruction Format

General Syntax:

[(qp)] mnemonic[.comp1][.comp2] dests = srcs

(qp) qualifying predicate register mnemonic name identifying the instruction [comp1][comp2] Completers indicate optional variations on basic

mnemonic dests, srcs source operands are registers or immediates

destination is typically a register

Page 71: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 471

Dr. Martin LandHadassah CollegeSpring 2004

Instruction Format Examples

Simple Instructionadd r1 = r2, r3

r1 ← r2 + r3Instruction with Immediate

add r1 = r2, r3, 1r1 ← r2 + r3 + 1

Instruction with Completercmp.eq p3 = r2, r4

if (r2 eq r4) then p3 ← 1Predicated Instruction

(p4) add r1 = r2, r3if (p4=1) then {r1 ← r2 + r3}if (p4=0) then {NOP}

Page 72: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 472

Dr. Martin LandHadassah CollegeSpring 2004

Predication Replaces Conditional Branches

Conditional execution of predicated instructions

Example: if (p5) r1 = r2 + r3Executes ADD if p5 = 1Executes NOP if p5 = 0

Predicate registers64 predicate registers: pr0 ⎯ pr63Set/Clear by compare instructions

Advantages over conditional branchEliminate misprediction penalties Allow larger parallel instruction blocks (no dependencies)

Page 73: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 473

Dr. Martin LandHadassah CollegeSpring 2004

Predication Example

High level codeif (a > b) c = c + 1else d = d * e + f

Predicated codepT, pF = compare(a > b)if (pT) c = c + 1if (pF) d = d * e + f

Compare sets pT or pFCompiler schedules the two if instructions in parallel

No conditional branchNo misprediction penalty on either outcome

Page 74: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 474

Dr. Martin LandHadassah CollegeSpring 2004

Explicitly Parallel Instruction Computing (EPIC)

Very Long Instruction Word (VLIW) formatInstruction Bundle: 3 instructions in a VLIWInstruction Group: 1 or more instruction bundles

Instruction Group: No data dependencies among instructionsMay be executed in parallel (according to program logic)

At compile time, compiler Identifies data dependenciesForms Instruction BundlesMarks Instruction GroupsDetermines ordering of instruction execution

Page 75: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 475

Dr. Martin LandHadassah CollegeSpring 2004

Instruction Bundles

Instructions BundleThree Instructions and a Template Field16 byte length: 3 × 41 bits + 5 bits = 128 bitsAligned at 16-byte boundaries in memoryContain no RAW or WAW dependencies

Template FieldMaps each instruction to Execution Unit typeInteger (I), Float (F), Memory (M), Branch (B)Processor executes the three instructions in parallel

Page 76: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 476

Dr. Martin LandHadassah CollegeSpring 2004

Instruction Groups

Instruction GroupSequence of Instruction bundlesInstructions without RAW or WAW dependencies At least one instructionNumber of instructions is not limited

Instruction groups end atBranch instructionsCycle Breaks (;;)

Inserted by compiler to indicate data hazards (dependencies)

Processor seeks to execute all bundles in a group in parallel

Page 77: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 477

Dr. Martin LandHadassah CollegeSpring 2004

Instruction Groups ⎯ Example

r1 = r2 + r3r4 = r5 + r6r7 = r8 + r9 ;;r10 = r4 + r11

Page 78: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 478

Dr. Martin LandHadassah CollegeSpring 2004

General Registers

General Registers128 registers ⎯ GR0 through GR12764-bit widthNaT (Not a Thing) bit

Mark deferred speculative exceptionsTwo Subsets

GR0 ⎯ GR31: Static General RegistersGR0 always holds zero as source

Write to GR0 causes an Illegal Operation fault

GR32 ⎯ GR127: Stacked General RegistersAvailable to application program Acquire by allocating a Register Stack FrameAct as Local and Output Registers

Page 79: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 479

Dr. Martin LandHadassah CollegeSpring 2004

“Spilling” and “Filling” Problem

In an IA-32 procedure callCalling procedure uses IA-32 registersCalled procedure needs the same registers

Procedure call causes many memory accessesCalling procedure saves register values in memoryCalling task passes parameters by pushing to stackCalled task returns by register or by stackCalling task restores its previous register state

Page 80: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 480

Dr. Martin LandHadassah CollegeSpring 2004

Stacked General Registers

Stacked General RegistersProcedure Calls use Temporary Registers Avoids “spilling” and “filling”

Register Frame allocated to a nested procedureAllocate up to 96 registers from (GR32 … GR127)

Specify number of required registers for:Local ⎯ private registers for use by procedureOutput ⎯ parameter/return passing

Page 81: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 481

Dr. Martin LandHadassah CollegeSpring 2004

Implementation of Stacked General RegistersOn procedure calls and returns

Allocate temporary physical registersCurrent Frame Marker (CFM)

Stacked General Registers allocated to called procedurePrevious Frame Marker (PFM)

Stacked General Registers allocated to calling proceduresof ⎯ size of frame (local + output registers)sol ⎯ size of local

Implementation in hardwareInvisible to application programsRename temporary registers to standard register set

Called procedure always sees: 32 Static General Registers: GR0 ⎯ GR31Stacked General Registers: GR32, GR33, ... , GR 32+sof

Page 82: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 482

Dr. Martin LandHadassah CollegeSpring 2004

Stacked General Registers ⎯ Example

Page 83: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 483

Dr. Martin LandHadassah CollegeSpring 2004

Register Rotation

Modulo loop scheduling Execute loop iterations in parallelLoop iteration starts before previous iteration finishesTraditionally requires loop unrolling

Write repeated code instances, instead of writing a loop

Register Rotation Use multiple physical registersRename multiple registers to same nameProvide every iteration with its own set of registers

Each instance of loop sees same register namesAvoids unrolling

Page 84: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 484

Dr. Martin LandHadassah CollegeSpring 2004

Virtual Addressing by Region

IA-64 address space 3-bit Virtual Region Number (VRN) 61-bit address within Virtual Region (VR)

261 byte address space in VRDivided into pages by the OS

3-bit Virtual Region Number (VRN)VRN is an index into a Region Register Table (RRT) RRT defines 8 Virtual Region Identifier (VRI) entries24-bit VRI identifies one of 224 Virtual Regions

Total address space = 264 – 3 + 24 = 285 bytes

Page 85: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 485

Dr. Martin LandHadassah CollegeSpring 2004

Virtual Addressing

Page 86: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 486

Dr. Martin LandHadassah CollegeSpring 2004

Virtual Addressing64-bit address divided into 3 fields

3-bit VRN points to 1 of 8 Virtual RegionsVR has 61-bit address (64-3 = 61), divided into pages by the OS

Supported page sizes4KB, 8KB, 16KB, 64KB, 256KB, 1MB, 4MB, 16MB, 64MB, 256MB, 4GBPage Offset = 12, 13, 14, 16, 18, 20, 22, 24, 28, 30, 32 bits

VPN = 49, 48, 47, 45, 43, 41, 39, 37, 33, 31, 29 bits

Effective IA-64 address space is 85-bits:64-3+24=85

3 bits OS dependent OS dependent

Virtual Region Number Virtual Page Number Page Offset

Page 87: 1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4 Microprocessors (A) Spring 2004 Hadassah College Dr. Martin Land Intel Processors

Microprocessors (A)From the 386 to the Pentium 487

Dr. Martin LandHadassah CollegeSpring 2004

Itanium 2

Itanium with enhancements:• 6 integer execution units (up from 4)• 2 Load + 2 Store units (up from 2 Load/Store)• Move L3 cache onto silicon die (on chip)• I/O clock is 400 MHz (up from 266 MHz)• 128-bit data I/O (up from 64-bit)• I/O rate of 6.4 GB/s (up from 2.1 GB/s)