intel pentium 4 processor

Intel Pentium 4 ProcessorIntel Pentium 4 Processor

Presented byPresented by Michele CoMichele Co

(much slide content courtesy of Zhijian Lu and Steve Kelley)(much slide content courtesy of Zhijian Lu and Steve Kelley)

OutlineOutline Introduction (Zhijian)Introduction (Zhijian)

– Willamette (11/2000)Willamette (11/2000) Instruction Set Architecture (Zhijian)Instruction Set Architecture (Zhijian) Instruction Stream (Steve)Instruction Stream (Steve) Data Stream (Zhijian)Data Stream (Zhijian) What went wrong (Steve)What went wrong (Steve) Pentium 4 revisionsPentium 4 revisions

– Northwood (1/2002)Northwood (1/2002)– Xeon (Prestonia, ~2002)Xeon (Prestonia, ~2002)– Prescott (2/2004)Prescott (2/2004)

Dual CoreDual Core– SmithfieldSmithfield

IntroductionIntroduction

Intel Pentium 4 processor Intel Pentium 4 processor – Latest IA-32 processor equipped with a full set Latest IA-32 processor equipped with a full set

of IA-32 SIMD operationsof IA-32 SIMD operations First implementation of a new micro-First implementation of a new micro-

architecture called “NetBurst” by Intel architecture called “NetBurst” by Intel (11/2000) (11/2000)

IA-32IA-32

Intel architecture 32-bit (IA-32)Intel architecture 32-bit (IA-32)– 80386 instruction set (1985)80386 instruction set (1985)– CISC, 32-bit addressesCISC, 32-bit addresses

““Flat” memory model Flat” memory model RegistersRegisters

– Eight 32-bit registersEight 32-bit registers– Eight FP stack registersEight FP stack registers– 6 segment registers6 segment registers

IA-32 (cont’d)IA-32 (cont’d) Addressing modesAddressing modes

– Register indirect (mem[reg])Register indirect (mem[reg])– Base + displacement (mem[reg + const])Base + displacement (mem[reg + const])– Base + scaled index (mem[reg + (2Base + scaled index (mem[reg + (2scalescale x index)]) x index)])– Base + scaled index + displacement (mem[reg + (2Base + scaled index + displacement (mem[reg + (2scalescale x x

index) + displacement])index) + displacement]) SIMD instruction setsSIMD instruction sets

– MMX (Pentium II)MMX (Pentium II)» Eight 64-bit MMX registers, integer ops onlyEight 64-bit MMX registers, integer ops only

– SSE (Streaming SIMD Extension, Pentium III)SSE (Streaming SIMD Extension, Pentium III)» Eight 128-bit registersEight 128-bit registers

Pentium III vs. Pentium 4 PipelinePentium III vs. Pentium 4 Pipeline

Comparison Between Pentium3 and Comparison Between Pentium3 and Pentium4Pentium4

Execution on MPEG4 Benchmarks @ 1 GHzExecution on MPEG4 Benchmarks @ 1 GHz

Instruction Set ArchitectureInstruction Set Architecture

Pentium4 ISA =Pentium4 ISA = Pentium3 ISA +Pentium3 ISA + SSE2 (Streaming SIMD Extensions 2)SSE2 (Streaming SIMD Extensions 2)

SSE2 is an architectural enhancement to SSE2 is an architectural enhancement to the IA-32 architecturethe IA-32 architecture

SSE2SSE2

Extends MMX and the SSE extensions with Extends MMX and the SSE extensions with 144 new instructions:144 new instructions:

128-bit SIMD integer arithmetic operations128-bit SIMD integer arithmetic operations 128-bit SIMD double precision floating 128-bit SIMD double precision floating

point operationspoint operations Enhanced cache and memory management Enhanced cache and memory management

operationsoperations

Comparison Between SSE and SSE2Comparison Between SSE and SSE2

Both support operations on 128-bit XMM register Both support operations on 128-bit XMM register SSE only supports 4 packed single-precision SSE only supports 4 packed single-precision

floating-point valuesfloating-point values SSE2 supports more:SSE2 supports more: 2 packed double-precision floating-point values 2 packed double-precision floating-point values

16 packed byte integers16 packed byte integers 8 packed word integers8 packed word integers 4 packed doubleword integers4 packed doubleword integers 2 packed quadword integers2 packed quadword integers Double quadwordDouble quadword

PackingPacking

128 bits (word = 2 bytes)128 bits (word = 2 bytes)

Quad wordQuad word

Double word Double wordDouble word Double word

64 bit 64 bit

32 bit 32 bit 32 bit 32 bit

Hardware Support for SSE2Hardware Support for SSE2 Adder and Multiplier units in the SSE2 Adder and Multiplier units in the SSE2

engine are 128 bits wide, twice the width of engine are 128 bits wide, twice the width of that in Pentium3that in Pentium3

Increased bandwidth in load/store for Increased bandwidth in load/store for floating-point valuesfloating-point valuesload and store are 128-bit wideload and store are 128-bit wideOne load plus one store can be completed One load plus one store can be completed between XMM register and L1 cache in one between XMM register and L1 cache in one clock cycle clock cycle

SSE2 Instructions (1)SSE2 Instructions (1)

Data movementsData movements Move data between XMM registers and between Move data between XMM registers and between

XMM registers and memoryXMM registers and memory Double precision floating-point operationsDouble precision floating-point operations Arithmetic instructions on both scalar and Arithmetic instructions on both scalar and

packed valuespacked values Logical InstructionsLogical Instructions

Perform logical operations on packed double Perform logical operations on packed double precision floating-point valuesprecision floating-point values

SSE2 Instructions (2)SSE2 Instructions (2) Compare instructionsCompare instructions

Compare packed and scalar double precision floating-Compare packed and scalar double precision floating-point valuespoint values

Shuffle and unpack instructionsShuffle and unpack instructionsShuffle or interleave double-precision floating-point Shuffle or interleave double-precision floating-point values in packed double-precision floating-point values in packed double-precision floating-point operandsoperands

Conversion InstructionsConversion InstructionsConversion between double word and double-Conversion between double word and double-precision floating-point or between single-precision precision floating-point or between single-precision and double-precision floating-point valuesand double-precision floating-point values

SSE2 Instructions (3)SSE2 Instructions (3) Packed single-precision floating-point instructionsPacked single-precision floating-point instructions

Convert between single-precision floating-point Convert between single-precision floating-point and double word integer operandsand double word integer operands

128-bit SIMD integer instructions128-bit SIMD integer instructionsOperations on integers contained in XMM Operations on integers contained in XMM registersregisters

Cacheability Control and Instruction OrderingCacheability Control and Instruction OrderingMore operations for caching of data when storing More operations for caching of data when storing from XMM registers to memory and additional from XMM registers to memory and additional control of instruction ordering on store operations control of instruction ordering on store operations

ConclusionConclusion

Pentium4 is equipped with the full set of Pentium4 is equipped with the full set of IA-32 SIMD technology. All existing IA-32 SIMD technology. All existing software can run correctly on it.software can run correctly on it.

AMD has decided to embrace and AMD has decided to embrace and implement SSE and SSE2 in future CPUsimplement SSE and SSE2 in future CPUs

Instruction StreamInstruction Stream

Instruction StreamInstruction Stream

What’s new? What’s new? – Added Trace CacheAdded Trace Cache– Improved branch predictorImproved branch predictor

TerminologyTerminology op – Micro-op, already decoded RISC-like op – Micro-op, already decoded RISC-like

instructionsinstructions– Front end – instruction fetch and issueFront end – instruction fetch and issue

Front EndFront End

Prefetches instructions that are likely to be Prefetches instructions that are likely to be executedexecuted

Fetches instructions that haven’t been Fetches instructions that haven’t been prefetchedprefetched

Decodes instruction into Decodes instruction into opsops Generates Generates ops for complex instructions or ops for complex instructions or

special purpose codespecial purpose code Predicts branchesPredicts branches

PrefetchPrefetch

Three methods of prefetching:Three methods of prefetching:

Instructions only – HardwareInstructions only – Hardware Data only – SoftwareData only – Software Code or data – Hardware Code or data – Hardware

DecoderDecoder

Single decoder that can operate at a Single decoder that can operate at a maximum of 1 instruction per cyclemaximum of 1 instruction per cycle

Receives instructions from L2 cache 64 bits Receives instructions from L2 cache 64 bits at a timeat a time

Some complex instructions must enlist the Some complex instructions must enlist the help of the microcode ROMhelp of the microcode ROM

Trace CacheTrace Cache

Primary instruction cache in NetBurst Primary instruction cache in NetBurst architecturearchitecture

Stores decoded Stores decoded opsops ~12K capacity~12K capacity On a Trace Cache miss, instructions are On a Trace Cache miss, instructions are

fetched and decoded from the L2 cachefetched and decoded from the L2 cache

What is a Trace Cache?What is a Trace Cache?I1 …I1 …I2 br r2, L1I2 br r2, L1I3 …I3 …I4 …I4 …I5 …I5 …L1: I6 L1: I6 I7 …I7 … Traditional instruction cacheTraditional instruction cache

Trace cacheTrace cache

I1 I2 I3 I4

I1 I2 I6 I7

Pentium 4 Trace CachePentium 4 Trace Cache

Has its own branch predictor that directs Has its own branch predictor that directs where instruction fetching needs to go next where instruction fetching needs to go next in the Trace Cachein the Trace Cache

RemovesRemoves– Decoding costs on frequently decoded Decoding costs on frequently decoded

instructionsinstructions– Extra latency to decode instructions upon Extra latency to decode instructions upon

branch mispredictionsbranch mispredictions

Microcode ROMMicrocode ROM Used for complex IA-32 instructions (> 4 Used for complex IA-32 instructions (> 4

ops) , such as string move, and for fault ops) , such as string move, and for fault and interrupt handlingand interrupt handling

When a complex instruction is encountered, When a complex instruction is encountered, the Trace Cache jumps into the microcode the Trace Cache jumps into the microcode ROM which then issues the ROM which then issues the opsops

After the microcode ROM finishes, the After the microcode ROM finishes, the front end of the machine resumes fetching front end of the machine resumes fetching ops from the Trace Cacheops from the Trace Cache

Branch PredictionBranch Prediction

Predicts ALL near branchesPredicts ALL near branches– Includes conditional branches, unconditional Includes conditional branches, unconditional

calls and returns, and indirect branchescalls and returns, and indirect branches

Does not predict far transfersDoes not predict far transfers– Includes far calls, irets, and software interruptsIncludes far calls, irets, and software interrupts

Branch PredictionBranch Prediction

Dynamically predict the direction and target Dynamically predict the direction and target of branches based on PC using BTBof branches based on PC using BTB

If no dynamic prediction is available, If no dynamic prediction is available, statically predictstatically predict– Taken for backwards looping branchesTaken for backwards looping branches– Not taken for forward branchesNot taken for forward branches

Traces are built across predicted branches Traces are built across predicted branches to avoid branch penaltiesto avoid branch penalties

Branch Target BufferBranch Target Buffer

Uses a branch history table and a branch Uses a branch history table and a branch target buffer to predicttarget buffer to predict

Updating occurs when branch is retiredUpdating occurs when branch is retired

Return Address StackReturn Address Stack

16 entries16 entries Predicts return addresses for procedure callsPredicts return addresses for procedure calls Allows branches and their targets to coexist Allows branches and their targets to coexist

in a single cache linein a single cache line– Increases parallelism since decode bandwidth is Increases parallelism since decode bandwidth is

not wastednot wasted

Branch HintsBranch Hints

P4 permits software to provide hints to the P4 permits software to provide hints to the branch prediction and trace formation branch prediction and trace formation hardware to enhance performancehardware to enhance performance

Take the forms of prefixes to conditional Take the forms of prefixes to conditional branch instructionsbranch instructions

Used only at trace build time and have no Used only at trace build time and have no effect on already built traceseffect on already built traces

Out-of-Order ExecutionOut-of-Order Execution

Designed to optimize performance by Designed to optimize performance by handling the most common operations in handling the most common operations in the most common context as fast as possiblethe most common context as fast as possible

126 126 ops can in flight at onceops can in flight at once– Up to 48 loads / 24 storesUp to 48 loads / 24 stores

IssueIssue

Instructions are fetched and decoded by Instructions are fetched and decoded by translation enginetranslation engine

Translation engine builds instructions into Translation engine builds instructions into sequences of sequences of opsops

Stores Stores ops to trace cacheops to trace cache Trace cache can issue 3 Trace cache can issue 3 opsops per cycle per cycle

ExecutionExecution

Can dispatch up to 6 Can dispatch up to 6 ops per cycleops per cycle Exceeds trace cache and retirement Exceeds trace cache and retirement op op

bandwidthbandwidth– Allows for greater flexibility in issuing Allows for greater flexibility in issuing ops to ops to

different execution unitsdifferent execution units

Execution UnitsExecution Units

Double-pumped ALUsDouble-pumped ALUs

ALU executes an operation on both rising ALU executes an operation on both rising and falling edges of clock cycleand falling edges of clock cycle

RetirementRetirement

Can retire 3 Can retire 3 ops per cycleops per cycle Precise exceptionsPrecise exceptions Reorder buffer to organize completed Reorder buffer to organize completed opsops Also keeps track of branches and sends Also keeps track of branches and sends

updated branch information to the BTBupdated branch information to the BTB

Execution PipelineExecution Pipeline

Data Stream of Pentium 4 ProcessorData Stream of Pentium 4 Processor

Register RenamingRegister Renaming

Register Renaming (2)Register Renaming (2)

8-entry architectural register file8-entry architectural register file 128-entry physical register file128-entry physical register file 2 RAT2 RAT

Frontend RAT and Retirement RATFrontend RAT and Retirement RAT Data does not need to be copied between Data does not need to be copied between

register files when the instruction retires register files when the instruction retires

On-chip CachesOn-chip Caches L1 instruction cache (Trace Cache)L1 instruction cache (Trace Cache)

L1 data cacheL1 data cache L2 unified cacheL2 unified cache Parameters:Parameters:

All caches are not inclusive and a pseudo-LRU All caches are not inclusive and a pseudo-LRU replacement algorithm is usedreplacement algorithm is used

L1 Instruction CacheL1 Instruction Cache

Execution Trace Cache stores decoded Execution Trace Cache stores decoded instructionsinstructions

Remove decoder latency from main Remove decoder latency from main execution loopsexecution loops

Integrate path of program execution flow Integrate path of program execution flow into a single lineinto a single line

L1 Data CacheL1 Data Cache NonblockingNonblocking

Support up to 4 outstanding load missesSupport up to 4 outstanding load misses Load latencyLoad latency

2-clock for integer 2-clock for integer 6-clock for floating-point6-clock for floating-point

1 Load and 1 Store per clock1 Load and 1 Store per clock Speculation LoadSpeculation Load

Assume the access will hit the cacheAssume the access will hit the cache““Replay” the dependent instructions when miss Replay” the dependent instructions when miss happenhappen

L2 CacheL2 Cache Load latencyLoad latency

Net load access latency of 7 cyclesNet load access latency of 7 cycles NonblockingNonblocking BandwidthBandwidth

One load and one store in one cycleOne load and one store in one cycleNew cache operation begin every 2 cyclesNew cache operation begin every 2 cycles256-bit wide bus between L1 and L2256-bit wide bus between L1 and L248Gbytes per second @ 1.5GHz48Gbytes per second @ 1.5GHz

Data Prefetcher in L2 CacheData Prefetcher in L2 Cache

Hardware prefetcher monitors the reference Hardware prefetcher monitors the reference patternspatterns

Bring cache lines automaticallyBring cache lines automatically Attempt to stay 256 bytes ahead of current Attempt to stay 256 bytes ahead of current

data access locationdata access location Prefetch for up to 8 simultaneous Prefetch for up to 8 simultaneous

independent streams independent streams

Store and LoadStore and Load

Out of order store and load operationsOut of order store and load operationsStores are always in program orderStores are always in program order

48 loads and 24 stores can be in flight48 loads and 24 stores can be in flight Store buffers and load buffers are allocated Store buffers and load buffers are allocated

at the allocation stageat the allocation stageTotal 24 store buffers and 48 load buffersTotal 24 store buffers and 48 load buffers

StoreStore

Store operations are divided into two parts:Store operations are divided into two parts:Store dataStore dataStore addressStore address

Store data is dispatched to the fast ALU, Store data is dispatched to the fast ALU, which operates twice per cyclewhich operates twice per cycle

Store address is dispatched to the store Store address is dispatched to the store AGU per cycleAGU per cycle

Store-to-Load ForwardingStore-to-Load Forwarding

Forward data from pending store buffer to Forward data from pending store buffer to dependent loaddependent load

Load stalls still happen when the bytes of Load stalls still happen when the bytes of the load operation are not exactly the same the load operation are not exactly the same as the bytes in the pending store bufferas the bytes in the pending store buffer

System BusSystem Bus

Deliver data with 3.2Gbytes/S Deliver data with 3.2Gbytes/S 64-bit wide bus64-bit wide bus Four data phase per clock cycle (quad Four data phase per clock cycle (quad

pumped)pumped) 100MHz clocked system bus100MHz clocked system bus

ConclusionConclusion

Reduced Cache Size Reduced Cache Size VSVS

Increased Bandwidth and Lower LatencyIncreased Bandwidth and Lower Latency

What Went WrongWhat Went Wrong

No L3 cacheNo L3 cache

Original plans called for a 1M cacheOriginal plans called for a 1M cache Intel’s idea was to strap a separate memory Intel’s idea was to strap a separate memory

chip, perhaps an SDRAM, on the back of chip, perhaps an SDRAM, on the back of the processor to act as the L3the processor to act as the L3

But that added another 100 pads to the But that added another 100 pads to the processor, and would have also forced Intel processor, and would have also forced Intel to devise an expensive cartridge package to to devise an expensive cartridge package to contain the processor and cache memorycontain the processor and cache memory

Small L1 CacheSmall L1 Cache

Only 8k!Only 8k!– Doubled size of L2 cache to compensateDoubled size of L2 cache to compensate

Compare withCompare with– AMD Athlon – 128kAMD Athlon – 128k– Alpha 21264 – 64kAlpha 21264 – 64k– PIIIPIII – 32k – 32k– ItaniumItanium – 16k – 16k

Loses consistently to AMDLoses consistently to AMD

In terms of performance, the Pentium 4 is as In terms of performance, the Pentium 4 is as slow or slower than existing Pentium III slow or slower than existing Pentium III and AMD Athlon processors and AMD Athlon processors

In terms of price, an entry level Pentium 4 In terms of price, an entry level Pentium 4 sells for about double the cost of a similar sells for about double the cost of a similar Pentium III or AMD Athlon based systemPentium III or AMD Athlon based system

1.5GHz clock rate is more hype than 1.5GHz clock rate is more hype than substancesubstance

NorthwoodNorthwood

NorthwoodNorthwood

1/20021/2002 Differences from WillametteDifferences from Willamette

– Socket 478Socket 478– 21 stage pipeline21 stage pipeline– 512 KB L2 cache512 KB L2 cache– 2.0 GHz, 2.2 GHz clock frequency2.0 GHz, 2.2 GHz clock frequency– 0.130.13 fabrication process (130 nm) fabrication process (130 nm)

» 55 million transistors55 million transistors

PrescottPrescott

PrescottPrescott

2/20042/2004 DifferencesDifferences

– 31 stage pipeline!31 stage pipeline!– 1MB L2 cache1MB L2 cache– 3.8 GHz clock frequency3.8 GHz clock frequency– 0.90.9 fabrication process fabrication process– SSE3SSE3

intel pentium 4 processor

Documents