intel pentium 4 processor
TRANSCRIPT
Intel Pentium 4 ProcessorIntel Pentium 4 Processor
Presented byPresented by Michele CoMichele Co
(much slide content courtesy of Zhijian Lu and Steve Kelley)(much slide content courtesy of Zhijian Lu and Steve Kelley)
OutlineOutline Introduction (Zhijian)Introduction (Zhijian)
– Willamette (11/2000)Willamette (11/2000) Instruction Set Architecture (Zhijian)Instruction Set Architecture (Zhijian) Instruction Stream (Steve)Instruction Stream (Steve) Data Stream (Zhijian)Data Stream (Zhijian) What went wrong (Steve)What went wrong (Steve) Pentium 4 revisionsPentium 4 revisions
– Northwood (1/2002)Northwood (1/2002)– Xeon (Prestonia, ~2002)Xeon (Prestonia, ~2002)– Prescott (2/2004)Prescott (2/2004)
Dual CoreDual Core– SmithfieldSmithfield
IntroductionIntroduction
Intel Pentium 4 processor Intel Pentium 4 processor – Latest IA-32 processor equipped with a full set Latest IA-32 processor equipped with a full set
of IA-32 SIMD operationsof IA-32 SIMD operations First implementation of a new micro-First implementation of a new micro-
architecture called “NetBurst” by Intel architecture called “NetBurst” by Intel (11/2000) (11/2000)
IA-32IA-32
Intel architecture 32-bit (IA-32)Intel architecture 32-bit (IA-32)– 80386 instruction set (1985)80386 instruction set (1985)– CISC, 32-bit addressesCISC, 32-bit addresses
““Flat” memory model Flat” memory model RegistersRegisters
– Eight 32-bit registersEight 32-bit registers– Eight FP stack registersEight FP stack registers– 6 segment registers6 segment registers
IA-32 (cont’d)IA-32 (cont’d) Addressing modesAddressing modes
– Register indirect (mem[reg])Register indirect (mem[reg])– Base + displacement (mem[reg + const])Base + displacement (mem[reg + const])– Base + scaled index (mem[reg + (2Base + scaled index (mem[reg + (2scalescale x index)]) x index)])– Base + scaled index + displacement (mem[reg + (2Base + scaled index + displacement (mem[reg + (2scalescale x x
index) + displacement])index) + displacement]) SIMD instruction setsSIMD instruction sets
– MMX (Pentium II)MMX (Pentium II)» Eight 64-bit MMX registers, integer ops onlyEight 64-bit MMX registers, integer ops only
– SSE (Streaming SIMD Extension, Pentium III)SSE (Streaming SIMD Extension, Pentium III)» Eight 128-bit registersEight 128-bit registers
Instruction Set ArchitectureInstruction Set Architecture
Pentium4 ISA =Pentium4 ISA = Pentium3 ISA +Pentium3 ISA + SSE2 (Streaming SIMD Extensions 2)SSE2 (Streaming SIMD Extensions 2)
SSE2 is an architectural enhancement to SSE2 is an architectural enhancement to the IA-32 architecturethe IA-32 architecture
SSE2SSE2
Extends MMX and the SSE extensions with Extends MMX and the SSE extensions with 144 new instructions:144 new instructions:
128-bit SIMD integer arithmetic operations128-bit SIMD integer arithmetic operations 128-bit SIMD double precision floating 128-bit SIMD double precision floating
point operationspoint operations Enhanced cache and memory management Enhanced cache and memory management
operationsoperations
Comparison Between SSE and SSE2Comparison Between SSE and SSE2
Both support operations on 128-bit XMM register Both support operations on 128-bit XMM register SSE only supports 4 packed single-precision SSE only supports 4 packed single-precision
floating-point valuesfloating-point values SSE2 supports more:SSE2 supports more: 2 packed double-precision floating-point values 2 packed double-precision floating-point values
16 packed byte integers16 packed byte integers 8 packed word integers8 packed word integers 4 packed doubleword integers4 packed doubleword integers 2 packed quadword integers2 packed quadword integers Double quadwordDouble quadword
PackingPacking
128 bits (word = 2 bytes)128 bits (word = 2 bytes)
Quad wordQuad word
Double word Double wordDouble word Double word
64 bit 64 bit
32 bit 32 bit 32 bit 32 bit
Hardware Support for SSE2Hardware Support for SSE2 Adder and Multiplier units in the SSE2 Adder and Multiplier units in the SSE2
engine are 128 bits wide, twice the width of engine are 128 bits wide, twice the width of that in Pentium3that in Pentium3
Increased bandwidth in load/store for Increased bandwidth in load/store for floating-point valuesfloating-point valuesload and store are 128-bit wideload and store are 128-bit wideOne load plus one store can be completed One load plus one store can be completed between XMM register and L1 cache in one between XMM register and L1 cache in one clock cycle clock cycle
SSE2 Instructions (1)SSE2 Instructions (1)
Data movementsData movements Move data between XMM registers and between Move data between XMM registers and between
XMM registers and memoryXMM registers and memory Double precision floating-point operationsDouble precision floating-point operations Arithmetic instructions on both scalar and Arithmetic instructions on both scalar and
packed valuespacked values Logical InstructionsLogical Instructions
Perform logical operations on packed double Perform logical operations on packed double precision floating-point valuesprecision floating-point values
SSE2 Instructions (2)SSE2 Instructions (2) Compare instructionsCompare instructions
Compare packed and scalar double precision floating-Compare packed and scalar double precision floating-point valuespoint values
Shuffle and unpack instructionsShuffle and unpack instructionsShuffle or interleave double-precision floating-point Shuffle or interleave double-precision floating-point values in packed double-precision floating-point values in packed double-precision floating-point operandsoperands
Conversion InstructionsConversion InstructionsConversion between double word and double-Conversion between double word and double-precision floating-point or between single-precision precision floating-point or between single-precision and double-precision floating-point valuesand double-precision floating-point values
SSE2 Instructions (3)SSE2 Instructions (3) Packed single-precision floating-point instructionsPacked single-precision floating-point instructions
Convert between single-precision floating-point Convert between single-precision floating-point and double word integer operandsand double word integer operands
128-bit SIMD integer instructions128-bit SIMD integer instructionsOperations on integers contained in XMM Operations on integers contained in XMM registersregisters
Cacheability Control and Instruction OrderingCacheability Control and Instruction OrderingMore operations for caching of data when storing More operations for caching of data when storing from XMM registers to memory and additional from XMM registers to memory and additional control of instruction ordering on store operations control of instruction ordering on store operations
ConclusionConclusion
Pentium4 is equipped with the full set of Pentium4 is equipped with the full set of IA-32 SIMD technology. All existing IA-32 SIMD technology. All existing software can run correctly on it.software can run correctly on it.
AMD has decided to embrace and AMD has decided to embrace and implement SSE and SSE2 in future CPUsimplement SSE and SSE2 in future CPUs
Instruction StreamInstruction Stream
What’s new? What’s new? – Added Trace CacheAdded Trace Cache– Improved branch predictorImproved branch predictor
TerminologyTerminology op – Micro-op, already decoded RISC-like op – Micro-op, already decoded RISC-like
instructionsinstructions– Front end – instruction fetch and issueFront end – instruction fetch and issue
Front EndFront End
Prefetches instructions that are likely to be Prefetches instructions that are likely to be executedexecuted
Fetches instructions that haven’t been Fetches instructions that haven’t been prefetchedprefetched
Decodes instruction into Decodes instruction into opsops Generates Generates ops for complex instructions or ops for complex instructions or
special purpose codespecial purpose code Predicts branchesPredicts branches
PrefetchPrefetch
Three methods of prefetching:Three methods of prefetching:
Instructions only – HardwareInstructions only – Hardware Data only – SoftwareData only – Software Code or data – Hardware Code or data – Hardware
DecoderDecoder
Single decoder that can operate at a Single decoder that can operate at a maximum of 1 instruction per cyclemaximum of 1 instruction per cycle
Receives instructions from L2 cache 64 bits Receives instructions from L2 cache 64 bits at a timeat a time
Some complex instructions must enlist the Some complex instructions must enlist the help of the microcode ROMhelp of the microcode ROM
Trace CacheTrace Cache
Primary instruction cache in NetBurst Primary instruction cache in NetBurst architecturearchitecture
Stores decoded Stores decoded opsops ~12K capacity~12K capacity On a Trace Cache miss, instructions are On a Trace Cache miss, instructions are
fetched and decoded from the L2 cachefetched and decoded from the L2 cache
What is a Trace Cache?What is a Trace Cache?I1 …I1 …I2 br r2, L1I2 br r2, L1I3 …I3 …I4 …I4 …I5 …I5 …L1: I6 L1: I6 I7 …I7 … Traditional instruction cacheTraditional instruction cache
Trace cacheTrace cache
I1 I2 I3 I4
I1 I2 I6 I7
Pentium 4 Trace CachePentium 4 Trace Cache
Has its own branch predictor that directs Has its own branch predictor that directs where instruction fetching needs to go next where instruction fetching needs to go next in the Trace Cachein the Trace Cache
RemovesRemoves– Decoding costs on frequently decoded Decoding costs on frequently decoded
instructionsinstructions– Extra latency to decode instructions upon Extra latency to decode instructions upon
branch mispredictionsbranch mispredictions
Microcode ROMMicrocode ROM Used for complex IA-32 instructions (> 4 Used for complex IA-32 instructions (> 4
ops) , such as string move, and for fault ops) , such as string move, and for fault and interrupt handlingand interrupt handling
When a complex instruction is encountered, When a complex instruction is encountered, the Trace Cache jumps into the microcode the Trace Cache jumps into the microcode ROM which then issues the ROM which then issues the opsops
After the microcode ROM finishes, the After the microcode ROM finishes, the front end of the machine resumes fetching front end of the machine resumes fetching ops from the Trace Cacheops from the Trace Cache
Branch PredictionBranch Prediction
Predicts ALL near branchesPredicts ALL near branches– Includes conditional branches, unconditional Includes conditional branches, unconditional
calls and returns, and indirect branchescalls and returns, and indirect branches
Does not predict far transfersDoes not predict far transfers– Includes far calls, irets, and software interruptsIncludes far calls, irets, and software interrupts
Branch PredictionBranch Prediction
Dynamically predict the direction and target Dynamically predict the direction and target of branches based on PC using BTBof branches based on PC using BTB
If no dynamic prediction is available, If no dynamic prediction is available, statically predictstatically predict– Taken for backwards looping branchesTaken for backwards looping branches– Not taken for forward branchesNot taken for forward branches
Traces are built across predicted branches Traces are built across predicted branches to avoid branch penaltiesto avoid branch penalties
Branch Target BufferBranch Target Buffer
Uses a branch history table and a branch Uses a branch history table and a branch target buffer to predicttarget buffer to predict
Updating occurs when branch is retiredUpdating occurs when branch is retired
Return Address StackReturn Address Stack
16 entries16 entries Predicts return addresses for procedure callsPredicts return addresses for procedure calls Allows branches and their targets to coexist Allows branches and their targets to coexist
in a single cache linein a single cache line– Increases parallelism since decode bandwidth is Increases parallelism since decode bandwidth is
not wastednot wasted
Branch HintsBranch Hints
P4 permits software to provide hints to the P4 permits software to provide hints to the branch prediction and trace formation branch prediction and trace formation hardware to enhance performancehardware to enhance performance
Take the forms of prefixes to conditional Take the forms of prefixes to conditional branch instructionsbranch instructions
Used only at trace build time and have no Used only at trace build time and have no effect on already built traceseffect on already built traces
Out-of-Order ExecutionOut-of-Order Execution
Designed to optimize performance by Designed to optimize performance by handling the most common operations in handling the most common operations in the most common context as fast as possiblethe most common context as fast as possible
126 126 ops can in flight at onceops can in flight at once– Up to 48 loads / 24 storesUp to 48 loads / 24 stores
IssueIssue
Instructions are fetched and decoded by Instructions are fetched and decoded by translation enginetranslation engine
Translation engine builds instructions into Translation engine builds instructions into sequences of sequences of opsops
Stores Stores ops to trace cacheops to trace cache Trace cache can issue 3 Trace cache can issue 3 opsops per cycle per cycle
ExecutionExecution
Can dispatch up to 6 Can dispatch up to 6 ops per cycleops per cycle Exceeds trace cache and retirement Exceeds trace cache and retirement op op
bandwidthbandwidth– Allows for greater flexibility in issuing Allows for greater flexibility in issuing ops to ops to
different execution unitsdifferent execution units
Double-pumped ALUsDouble-pumped ALUs
ALU executes an operation on both rising ALU executes an operation on both rising and falling edges of clock cycleand falling edges of clock cycle
RetirementRetirement
Can retire 3 Can retire 3 ops per cycleops per cycle Precise exceptionsPrecise exceptions Reorder buffer to organize completed Reorder buffer to organize completed opsops Also keeps track of branches and sends Also keeps track of branches and sends
updated branch information to the BTBupdated branch information to the BTB
Register Renaming (2)Register Renaming (2)
8-entry architectural register file8-entry architectural register file 128-entry physical register file128-entry physical register file 2 RAT2 RAT
Frontend RAT and Retirement RATFrontend RAT and Retirement RAT Data does not need to be copied between Data does not need to be copied between
register files when the instruction retires register files when the instruction retires
On-chip CachesOn-chip Caches L1 instruction cache (Trace Cache)L1 instruction cache (Trace Cache)
L1 data cacheL1 data cache L2 unified cacheL2 unified cache Parameters:Parameters:
All caches are not inclusive and a pseudo-LRU All caches are not inclusive and a pseudo-LRU replacement algorithm is usedreplacement algorithm is used
L1 Instruction CacheL1 Instruction Cache
Execution Trace Cache stores decoded Execution Trace Cache stores decoded instructionsinstructions
Remove decoder latency from main Remove decoder latency from main execution loopsexecution loops
Integrate path of program execution flow Integrate path of program execution flow into a single lineinto a single line
L1 Data CacheL1 Data Cache NonblockingNonblocking
Support up to 4 outstanding load missesSupport up to 4 outstanding load misses Load latencyLoad latency
2-clock for integer 2-clock for integer 6-clock for floating-point6-clock for floating-point
1 Load and 1 Store per clock1 Load and 1 Store per clock Speculation LoadSpeculation Load
Assume the access will hit the cacheAssume the access will hit the cache““Replay” the dependent instructions when miss Replay” the dependent instructions when miss happenhappen
L2 CacheL2 Cache Load latencyLoad latency
Net load access latency of 7 cyclesNet load access latency of 7 cycles NonblockingNonblocking BandwidthBandwidth
One load and one store in one cycleOne load and one store in one cycleNew cache operation begin every 2 cyclesNew cache operation begin every 2 cycles256-bit wide bus between L1 and L2256-bit wide bus between L1 and L248Gbytes per second @ 1.5GHz48Gbytes per second @ 1.5GHz
Data Prefetcher in L2 CacheData Prefetcher in L2 Cache
Hardware prefetcher monitors the reference Hardware prefetcher monitors the reference patternspatterns
Bring cache lines automaticallyBring cache lines automatically Attempt to stay 256 bytes ahead of current Attempt to stay 256 bytes ahead of current
data access locationdata access location Prefetch for up to 8 simultaneous Prefetch for up to 8 simultaneous
independent streams independent streams
Store and LoadStore and Load
Out of order store and load operationsOut of order store and load operationsStores are always in program orderStores are always in program order
48 loads and 24 stores can be in flight48 loads and 24 stores can be in flight Store buffers and load buffers are allocated Store buffers and load buffers are allocated
at the allocation stageat the allocation stageTotal 24 store buffers and 48 load buffersTotal 24 store buffers and 48 load buffers
StoreStore
Store operations are divided into two parts:Store operations are divided into two parts:Store dataStore dataStore addressStore address
Store data is dispatched to the fast ALU, Store data is dispatched to the fast ALU, which operates twice per cyclewhich operates twice per cycle
Store address is dispatched to the store Store address is dispatched to the store AGU per cycleAGU per cycle
Store-to-Load ForwardingStore-to-Load Forwarding
Forward data from pending store buffer to Forward data from pending store buffer to dependent loaddependent load
Load stalls still happen when the bytes of Load stalls still happen when the bytes of the load operation are not exactly the same the load operation are not exactly the same as the bytes in the pending store bufferas the bytes in the pending store buffer
System BusSystem Bus
Deliver data with 3.2Gbytes/S Deliver data with 3.2Gbytes/S 64-bit wide bus64-bit wide bus Four data phase per clock cycle (quad Four data phase per clock cycle (quad
pumped)pumped) 100MHz clocked system bus100MHz clocked system bus
ConclusionConclusion
Reduced Cache Size Reduced Cache Size VSVS
Increased Bandwidth and Lower LatencyIncreased Bandwidth and Lower Latency
No L3 cacheNo L3 cache
Original plans called for a 1M cacheOriginal plans called for a 1M cache Intel’s idea was to strap a separate memory Intel’s idea was to strap a separate memory
chip, perhaps an SDRAM, on the back of chip, perhaps an SDRAM, on the back of the processor to act as the L3the processor to act as the L3
But that added another 100 pads to the But that added another 100 pads to the processor, and would have also forced Intel processor, and would have also forced Intel to devise an expensive cartridge package to to devise an expensive cartridge package to contain the processor and cache memorycontain the processor and cache memory
Small L1 CacheSmall L1 Cache
Only 8k!Only 8k!– Doubled size of L2 cache to compensateDoubled size of L2 cache to compensate
Compare withCompare with– AMD Athlon – 128kAMD Athlon – 128k– Alpha 21264 – 64kAlpha 21264 – 64k– PIIIPIII – 32k – 32k– ItaniumItanium – 16k – 16k
Loses consistently to AMDLoses consistently to AMD
In terms of performance, the Pentium 4 is as In terms of performance, the Pentium 4 is as slow or slower than existing Pentium III slow or slower than existing Pentium III and AMD Athlon processors and AMD Athlon processors
In terms of price, an entry level Pentium 4 In terms of price, an entry level Pentium 4 sells for about double the cost of a similar sells for about double the cost of a similar Pentium III or AMD Athlon based systemPentium III or AMD Athlon based system
1.5GHz clock rate is more hype than 1.5GHz clock rate is more hype than substancesubstance
NorthwoodNorthwood
1/20021/2002 Differences from WillametteDifferences from Willamette
– Socket 478Socket 478– 21 stage pipeline21 stage pipeline– 512 KB L2 cache512 KB L2 cache– 2.0 GHz, 2.2 GHz clock frequency2.0 GHz, 2.2 GHz clock frequency– 0.130.13 fabrication process (130 nm) fabrication process (130 nm)
» 55 million transistors55 million transistors