intel sandy ntel sandy bridge architecture
TRANSCRIPT
-
7/27/2019 Intel sandy ntel sandy bridge architecture
1/54
1
IntroducingSandy Bridge
Sandy Bridge - Intel Next Generation Microarchitecture
Bob Valentine
Senior Principal Engineer
-
7/27/2019 Intel sandy ntel sandy bridge architecture
2/54
2
Sandy Bridge: Overview
Sandy Bridge - Intel Next Generation Microarchitecture
Integrated Memory Controller2ch DDR3
High BandwidthLast Level Cache
Next Generation ProcessorGraphics and Media
Next Generation Intel Turbo
Boost Technology
Intel Hyper-Threading
Technology4 Cores / 8 Threads2 Cores / 4 Threads
Integrates CPU, Graphics, MC,PCI Express* On Single Chip
Embedded Display Port
Substantial performanceimprovement
IntelAdvanced VectorExtension (IntelAVX)
High BW/low-latency modularcore/GFX interconnect
Discrete Graphics Support :1x16 or 2x8
2ch DDR3
x16PCIe
PECI Interface
To Embedded
Controller
NotebookDP Port
PCH
En e r g y Ef f i c i e n c y
St u n n i n g P e r f o r m a n ce St u n n i n g Pe r f o r m a n ce
-
7/27/2019 Intel sandy ntel sandy bridge architecture
3/54
3
Agenda
Innovation in the Processor core
System Agent, Ring Architecture andOther Innovations
Innovation in Power Management
-
7/27/2019 Intel sandy ntel sandy bridge architecture
4/54
4
Introduction toSandy Bridge
Processor Core
Microarchitecture
Sandy Bridge - Intel Next Generation Microarchitecture
2ch DDR3
x16
PCIe
PECI Interface
To Embedded
Controller
NotebookDP Port
Sandy Bridge
Microarchitecture
-
7/27/2019 Intel sandy ntel sandy bridge architecture
5/54
5
Outline
Sandy Bridge Processor Core Summary
Core Major MicroarchitectureEnhancements
Core Architectural Enhancements
Processor Core Summary
Sandy Bridge - Intel Next Generation Microarchitecture
-
7/27/2019 Intel sandy ntel sandy bridge architecture
6/54
6
Sandy Bridge Processor Core Summary
Build upon the successful Nehalem microarchitectureprocessor core Converged building block for mobile, desktop, and server
Add Cool microarchitecture enhancements Features that are better than linear performance/power
Add Really Cool microarchitecture enhancements
Features which gain performance while saving power
Extend the architecture for important new applications Floating Point and Throughput Intel Advanced Vector Extensions (Intel AVX) - Significant boost for selected
compute intensive applications
Security AES (Advanced Encryption Standard) throughput enhancements
Large Integer RSA speedups
OS/VMM and server related features State save/restore optimizations
Sandy Bridge - Intel Next Generation Microarchitecture
-
7/27/2019 Intel sandy ntel sandy bridge architecture
7/54
7
ALU, SIMUL,
DIV, FP MUL
ALU, SIALU,
FP ADD
ALU, Branch,
FP Shuffle
Load Load
Store Address Store Address
Store
DataSix Execution PortsSix Execution Ports
Processor Core TutorialMicroarchitecture Block Diagram
32k L1 Instruction Cache
Scheduler
Port 0 Port 1 Port 5 Port 2 Port 3 Port 4
32k L1 Data Cache
48 bytes/cycle
Allocate/Rename/RetireZeroing IdiomsLoadBuffers
StoreBuffers
ReorderBuffers
L2 Data Cache (MLC)L2 Data Cache (MLC)Fill
Buffers
Pre decodeInstruction
QueueDecoders
1.5k uOP cache
DecodersDecodersDecoders
Branch Pred
In order
Out-of-order
Front EndFront End
(IA instructions(IA instructions Uops)Uops)
In Order Allocation, Rename, RetirementIn Order Allocation, Rename, Retirement
Out of OrderOut of Order UopUop SchedulingScheduling
DataData
CacheCache
UnitUnit
-
7/27/2019 Intel sandy ntel sandy bridge architecture
8/54
8
Front End Microarchitecture
Instruction Decode in Processor Core 32 Kilo-byte 8-way Associative ICache
4 Decoders, up to 4 instructions / cycle
Micro-Fusion
Bundle multiple instruction events into a single Uops
Macro-Fusion
Fuse instruction pairs into a complex Uop
Decode Pipeline supports 16 bytes per cycle
32k L1 Instruction Cache Pre decodeInstruction
Queue DecodersDecodersDecodersDecoders
Branch Prediction Unit
-
7/27/2019 Intel sandy ntel sandy bridge architecture
9/54
9
Decoded Uop Cache ~1.5 Kuops
New: Decoded Uop Cache
Add a Decoded Uop Cache An L0 Instruction Cache for Uops instead of
Instruction Bytes
~80% hit rate for most applications
Higher Instruction Bandwidth and Lower Latency
Decoded Uop Cache can represent 32-byte / cycle
More Cycles sustaining 4 instruction/cycle
Able to stitch across taken branches in the control flow
Branch Prediction Unit
32k L1 Instruction Cache Pre decodeInstruction
Queue DecodersDecodersDecodersDecoders
-
7/27/2019 Intel sandy ntel sandy bridge architecture
10/54
10
New Branch Prediction Unit
Do a Ground Up Rebuild of Branch Predictor Twice as many targets
Much more effective storage for history
Much longer history for data dependent behaviors
Decoded Uop Cache ~1.5 KuopsBranch Prediction Unit
32k L1 Instruction Cache Pre decodeInstruction
Queue DecodersDecodersDecodersDecoders
-
7/27/2019 Intel sandy ntel sandy bridge architecture
11/54
11
Really Cool features in the front end Decoded Uop Cache lets the normal front end sleep
Decode one time instead of many times
Branch-Mispredictions reduced substantially
The correct path is also the most efficient pathReally Cool Features
Save Power while Increasing PerformancePower is fungible
give it to other units in this core, or other units on die
Sandy Bridge - Intel Next Generation Microarchitecture
Decoded Uop Cache ~1.5 KuopsBranch Prediction Unit
32k L1 Instruction Cache Pre decodeInstruction
Queue DecodersDecodersDecodersDecoders
Zzzz
Sandy Bridge Front End
Microarchitecture
-
7/27/2019 Intel sandy ntel sandy bridge architecture
12/54
12
Scheduler
Port 0 Port 1 Port 5 Port 2 Port 3 Port 4
Allocate/Rename/RetireZeroing IdiomsLoadBuffers
Store
Buffers
Reorder
Buffers
In order
Out-of-
order
In Order Allocation, Rename, Retirement
Out of Order Uop Scheduling
Out of Order Cluster
Challenge to the OoO Architects :
Increase the ILP while keeping the power available for Execution
Receives Uops from the Front End
Sends them to Execution Units when they are ready
Retires them in Program Order
Goal: Increase Performance by finding more Instruction LevelParallelism
Increasing Depth and Width of machine implies larger buffers More Data Storage, More Data Movement, More Power
-
7/27/2019 Intel sandy ntel sandy bridge architecture
13/54
13
Sandy Bridge Out-of-Order (OOO) Cluster
Method: Physical Reg File (PRF) instead of centralized Retirement Register File
Single copy of every data
No movement after calculation
Allow significant increase in buffer sizes
Dataflow window ~33% larger
Scheduler
Allocate/Rename/RetireZeroing IdiomsLoadBuffers
Store
Buffers
Reorder
Buffers
In order
Out-of-
order
FP/INT Vector PRF Int PRF
PRF is a Cool featurebetter than linear
performance/power
Key enabler for Intel
Advanced VectorExtensions (Intel AVX)
Sandy Bridge - Intel Next Generation Microarchitecture
-
7/27/2019 Intel sandy ntel sandy bridge architecture
14/54
14
Execution Cluster
3 Execution Ports
Maximum throughput of 8 floating pointoperations* per cycle
Port 0 : packed SP multiply
Port 1 : packed SP add
Scheduler
Port 0 Port 1 Port 5
ALUALU ALU
JMP
FP Shuf
FP Bool
VI ADDVI MUL
VI Shuffle
DIV
VI Shuffle
FP ADD
Blend
BlendFP MUL
Challenge to the Execution UnitArchitects :
Double the Floating PointThroughput
in a cool manner
*FLOPS = Floating Point Operations / Second
-
7/27/2019 Intel sandy ntel sandy bridge architecture
15/54
15
Doubling the FLOPs in a Cool Manner
Intel Advanced Vector Extensions (Intel AVX)
Extend SSE FP instruction set to 256 bitsoperand size
Intel AVX extends all 16 XMM registers to 256bits
New, non-destructive source syntax
VADDPS ymm1, ymm2, ymm3
New Operations to enhance vectorization
Broadcasts
Masked load & store
YMM0
XMM0
128 bits256 bits (AVX)
-
7/27/2019 Intel sandy ntel sandy bridge architecture
16/54
16
Doubling the FLOPs in a cool manner
Extend SSE FP instruction set to 256 bits operand size
Intel AVX extends all 16 XMM registers to 256bits
New, non-destructive source syntax
VADDPS ymm1, ymm2, ymm3
New Operations to enhance vectorization
Broadcasts
Masked load & store
Intel Advanced Vector
Extensions (Intel
AVX)
YMM0
XMM0
128 bits
256 bits (AVX)
Intel AVX is a Cool Architecture
Vectors are a natural data-type for many applications
Wider vectors and non-destructive source specify morework with fewer instructions
Extending the existing state is area and power efficient
-
7/27/2019 Intel sandy ntel sandy bridge architecture
17/54
17
Execution Cluster A Look Inside
Scheduler seesmatrix:
3 ports to 3stacks of
execution unitsGeneral PurposeInteger
SIMD (Vector)Integer
SIMD FloatingPoint
The challenge is todouble the outputof one of these
stacks in a mannerthat is invisible tothe others
Port 0
Port 1
Port 5
ALU
ALU
ALU
JMP
FP Shuf
FP Bool
VI ADD
VI MUL
VI Shuffle
DIV
VI Shuffle
FP ADD
Blend
Blend
FP MUL
GPR SIMD INT SIMD FP
-
7/27/2019 Intel sandy ntel sandy bridge architecture
18/54
18
Execution Cluster
Solution:
Repurposeexistingdatapaths to
dual-use
SIMD integerand legacySIMD FP use
legacy stackstyle
Intel AVXutilizes both128-bit
executionstacks
Intel Advanced Vector Extensions (Intel AVX)
Port 0
Port 1
Port 5
ALU
ALU
ALU
JMP
FP Shuf
FP Bool
VI ADD
VI MUL
VI Shuffle DIV
VI Shuffle
FP ADD
Blend
Blend
FP MUL
FP ADD
FP Multiply
FP Blend
FP Shuffle
FP Boolean
FP Blend
-
7/27/2019 Intel sandy ntel sandy bridge architecture
19/54
19
Port 0
Port 1
Port 5
ALU
ALU
ALU
JMP
FP Shuf
FP Bool
VI ADD
VI MUL
VI Shuffle DIV
VI Shuffle
FP ADD
Blend
Blend
FP MUL
FP ADD
FP Multiply
FP Blend
FP Shuffle
FP Boolean
FP Blend
Execution Cluster
Solution:
Repurposeexistingdatapaths to
dual-use
SIMD integerand legacySIMD FP use
legacy stackstyle
Intel AVXutilizes both128-bit
executionstacks
Cool Implementation of Intel AVX256-bit Multiply + 256-bit ADD + 256-bit Load per clock
Double your FLOPs with great energy efficiency
Intel Advanced Vector Extensions (Intel AVX)
-
7/27/2019 Intel sandy ntel sandy bridge architecture
20/54
20
Memory Cluster
Memory Unit can service two memory requests per cycle 16 bytes load and 16 bytes store per cycle
Memory Control
32kx8-way L1 Data Cache
32 bytes/cycle
Load Store AddressStore
Data
256KB L2 Data Cache (MLC)
Challenge to the Memory Cluster Architects
Maintain the historic bytes/flop ratio of SSE for Intel AVX
and do so in a cool manner
Intel Advanced Vector Extensions (Intel AVX)
Fill
Buffers
Store
Buffers
-
7/27/2019 Intel sandy ntel sandy bridge architecture
21/54
21
Memory Cluster in Sandy Bridge
Solution : Dual-Use the existing connections Make load/store pipes symmetric
Memory Unit services three data accesses per cycle
2 read requests of up to 16 bytes AND 1 store of up to 16 bytes
Internal sequencer deals with queued requests
Memory Control
32kx8-way L1 Data Cache
48 bytes/cycle
Store
Data
256KB L2 Data Cache (MLC)
Sandy Bridge - Intel Next Generation Microarchitecture
Fill
Buffers
Store
Buffers
-
7/27/2019 Intel sandy ntel sandy bridge architecture
22/54
22
Memory Cluster in Sandy Bridge
Solution : Dual-Use the existing connections Make load/store pipes symmetric
Memory Unit services three data accesses per cycle
2 read requests of up to 16 bytes AND 1 store of up to 16 bytes
Internal sequencer deals with queued requests
Second Load Port is one of highest performance featuresRequired to keep Intel Advanced Vector Extensions (Intel AVX)
Instruction Set fed linear power/performance means its Cool
Sandy Bridge - Intel Next Generation Microarchitecture
Memory Control
32kx8-way L1 Data Cache
48 bytes/cycle
Store
Data
256KB L2 Data Cache (MLC) FillBuffers
Store
Buffers
-
7/27/2019 Intel sandy ntel sandy bridge architecture
23/54
23
Putting it together
Sandy Bridge Microarchitecture32k L1 Instruction Cache
Scheduler
Memory Control
Port 0 Port 1 Port 5 Port 2 Port 3 Port 4
32k L1 Data Cache
48 bytes/cycle
Allocate/Rename/RetireZeroing IdiomsLoadBuffers
Store
Buffers
Reorder
Buffers
Store
Data
L2 Data Cache (MLC) FillBuffers
Pre decodeInstruction
QueueDecodersDecoders
DecodersDecoders
In order
Out-of-
order
ALU
VI ADD
VI Shuffle
AVX FP ADD
ALU
JMP
AVX/FP Shuf
AVX/FP Bool
AVX FP Blend
ALU
VI MUL
VI Shuffle
DIV
AVX FP BlendAVX FP MUL
Sandy Bridge - Intel Next Generation Microarchitecture AVX= Intel Advanced Vector Extensions (Intel AVX)
-
7/27/2019 Intel sandy ntel sandy bridge architecture
24/54
24
Other Architectural Extensions
Cryptography Instruction ThroughputEnhancements
throughput for AES instructions introduced in Westmere
Large Number Arithmetic ThroughputEnhancements
ADC (Add with Carry) throughput doubled
Multiply (64-bit multiplicands with 128-bit product) ~25% speedup on existing RSA binaries!
State Save/Restore Enhancements
New state added in Intel Advanced Vector Extensions
(Intel AVX)
HW monitors features used by applications
Only saves/restores state that is used
-
7/27/2019 Intel sandy ntel sandy bridge architecture
25/54
25
Sandy Bridge Processor Core
Summary Build upon the successful Nehalem processor core Converged building block for mobile, desktop, and server
Cool and Really Cool features
Improve performance/power and performance/area
Extends the architecture for important new applications
Floating Point and Throughput Applications
Intel Advanced Vector Extensions (Intel AVX) - Significantboost for selected compute intensive apps
Security
AES (Advanced Encryption Standard) Instructions speedup
Large Integer RSA and SHA speedups
OS/VMM and server related features
State save/ restore optimizations
Sandy Bridge - Intel Next Generation Microarchitecture
-
7/27/2019 Intel sandy ntel sandy bridge architecture
26/54
26
System Agent, RingArchitecture and
Other Innovations2ch DDR3
x16
PCIe
PECI Interface
To Embedded
Controller
Notebook
DP Port
Sandy Bridge
Microarchitecture
-
7/27/2019 Intel sandy ntel sandy bridge architecture
27/54
27
Integration: Optimization Opportunities
Dynamically redistribute power between Cores & Graphics
Tight power management control of all components, providingbetter granularity and deeper idle/sleep states
Three separate power/frequency domains: System Agent(Fixed), Cores+Ring, Graphics (Variable)
High BW Last Level Cache, shared among Cores and Graphics
Significant performance boost, saves memory bandwidth and power
Integrated Memory Controller and PCI Express* ports
Tightly integrated with Core/Graphics/LLC domain
Provides low latency & low power remove intermediate busses
Bandwidth is balanced across the whole machine, fromCore/Graphics all the way to Memory Controller
Modular uArch for optimal cost/power/performance
Derivative products done with minimal effort/time
-
7/27/2019 Intel sandy ntel sandy bridge architecture
28/54
28
Scalable Ring On-die Interconnect
Ring-based interconnect between Cores, Graphics, LastLevel Cache (LLC) and System Agent domain
Composed of4 rings
32 Byte Data ring, Requestring,Acknowledge
ring and Snoop ring Fully pipelined at core frequency/voltage:
bandwidth, latency and power scale with cores
Massive ring wire routing runs over the LLC
with no area impact Access on ring always picks the shortestpath minimize latency
Distributed arbitration, sophisticated ringprotocol to handle coherency, ordering, and
core interface
Scalable to servers with large number ofprocessors
Block Diagram Illustrative only. Number of processor cores will vary with different processor models based on the Sandy BridgeMicroarchitecture. Represents client processor implementation.
H i g h B an d w i d t h , Lo w La t e n c y , M o d u l a r
-
7/27/2019 Intel sandy ntel sandy bridge architecture
29/54
29
Cache Box
Interface block
Between Core/Graphics/Media and the Ring
Between Cache controller and the Ring
Implements the ring logic, arbitration, cache controller
Communicates with System Agent for LLC misses, externalsnoops, non-cacheable accesses
Full cache pipeline in each cache box Physical Addresses are hashed at the source
to prevent hot spots and increase bandwidth Maintains coherency and ordering for the
addresses that are mapped to it
LLC is fully inclusive with Core Valid Bits eliminates unnecessary snoops to cores
Runs at core voltage/frequency, scaleswith Cores
Block Diagram Illustrative only. Number of processor cores will vary with different processor models based on the Sandy BridgeMicroarchitecture. Represents client processor implementation.
D i s t r i b u t e d c o h e r e n c y & o r d e r i n g ; Sca l a b l e
B an d w i d t h , La t e n c y & P ow e r
-
7/27/2019 Intel sandy ntel sandy bridge architecture
30/54
30
-
7/27/2019 Intel sandy ntel sandy bridge architecture
31/54
31
-
7/27/2019 Intel sandy ntel sandy bridge architecture
32/54
32
-
7/27/2019 Intel sandy ntel sandy bridge architecture
33/54
33
-
7/27/2019 Intel sandy ntel sandy bridge architecture
34/54
34
-
7/27/2019 Intel sandy ntel sandy bridge architecture
35/54
35
-
7/27/2019 Intel sandy ntel sandy bridge architecture
36/54
36
-
7/27/2019 Intel sandy ntel sandy bridge architecture
37/54
37
-
7/27/2019 Intel sandy ntel sandy bridge architecture
38/54
38
-
7/27/2019 Intel sandy ntel sandy bridge architecture
39/54
39
-
7/27/2019 Intel sandy ntel sandy bridge architecture
40/54
40
-
7/27/2019 Intel sandy ntel sandy bridge architecture
41/54
41
-
7/27/2019 Intel sandy ntel sandy bridge architecture
42/54
42
-
7/27/2019 Intel sandy ntel sandy bridge architecture
43/54
43
-
7/27/2019 Intel sandy ntel sandy bridge architecture
44/54
44
S d id C Sh i
-
7/27/2019 Intel sandy ntel sandy bridge architecture
45/54
45
Sandy Bridge LLC Sharing
LLC shared among all Cores, Graphics and Media
Graphics driver controls which streams are cached/coherent
Any agent can access all data in the LLC, independent of whoallocated the line, after memory range checks
Controlled LLC way allocation mechanism to preventthrashing between Core/graphics
Multiple coherency domains
IA Domain (Fully coherent via cross-snoops) Graphic domain (Graphics virtual caches,
flushed to IA domain by graphics engine)
Non-Coherent domain (Display data, flushed tomemory by graphics engine)
Block Diagram Illustrative only. Number of processor cores will vary with different processor models based on the Sandy BridgeMicroarchitecture. Represents client processor implementation. Sandy Bridge - Intel Next Generation Microarchitecture
Mu ch h i g h e r Gr a p h i c s p e r f o r m a n c e ,
D RAM p o w e r s a v i n g s , m o r e D RAM BW
a v a i l a b l e f o r Co r e s
-
7/27/2019 Intel sandy ntel sandy bridge architecture
46/54
46
Lean and Mean System Agent
-
7/27/2019 Intel sandy ntel sandy bridge architecture
47/54
47
Lean and Mean System Agent
Contains PCI Express*, DMI, MemoryController, Display Engine
Contains Power Control Unit
Programmable uController, handles all power
management and reset functions in the chip Smart integration with the ring
Provides cores/Graphics /Media with high BW,low latency to DRAM/IO for best performance
Handles IO-to-cache coherency Separate voltage and frequency from
ring/cores, Display integration for betterbattery life
Extensive power and thermalmanagement for PCI Express* and DDR
Block Diagram Illustrative only. Number of processor cores will vary with different processor models based on the Sandy BridgeMicroarchitecture. Represents client processor implementation.
Sm a r t I / O I n t e g r a t i o n
-
7/27/2019 Intel sandy ntel sandy bridge architecture
48/54
48
Power and Thermal Management
Usage Scena io Responsi e Beha io
-
7/27/2019 Intel sandy ntel sandy bridge architecture
49/54
49Intel Top SecretIntel Restricted Secret RSNDA
Usage Scenario: Responsive Behavior
Interactive work benefits from Next Generation Intel
Turbo Boost Idle periods intermixed with user actions
User interactive actions
Image open and process
Example Photo editing
Open image
View
Process Rotate
Balance colors
Contrast
Corp
Zoom
Etc.
Innovative Concept:
-
7/27/2019 Intel sandy ntel sandy bridge architecture
50/54
50
pThermal Capacitance
Cl a s s i c M od e l
Steady-State Thermal Resistance
Design guide for steady state
Temperature
Time
Cl a s s ic m o d e l
r e s p o n s e
Cl a s s ic m o d e l
r e s p o n s e
Tem p e r a t u r e r i se s a s e n e r g y i s d e l i v e r e d t o t h e r m a l so l u t i o n
T h e r m a l s o lu t i o n r e s p o n s e i s c a l cu l a t e d a t r e a l - t i m e
Tempe
rature
Time
Mo r e r e a l i s t i c
r e sp o n s e t o p o w e r
c h a n g e s
M o r e r e a l i s t i c
r e sp o n s e t o p o w e r
c h a n g e s
N ew M o d e l
Steady-State Thermal ResistanceAND
Dynamic Thermal Capacitance
Next Generation Intel Turbo
-
7/27/2019 Intel sandy ntel sandy bridge architecture
51/54
51
Boost Benefit
Time
Power
Sleep or
Low power
Next GenTurbo Boost
TDP
C0
(Turbo)
After idle periods, the system
accumulates energy budget
and can accommodate high
power/performance for a few
seconds
In Steady State conditions the
power stabilizes on TDP
P>TDP:
Responsiv
eness
Su
stain
power
Buildup thermal budget
during idle periods
U se
a cc u m u l a t e d
e n e r g y b u d g e t
t o e n h a n c e u s e r
e x p e r i e n c e
-
7/27/2019 Intel sandy ntel sandy bridge architecture
52/54
52
Core and Graphic Power Budgeting
-
7/27/2019 Intel sandy ntel sandy bridge architecture
53/54
53
p g g Cores and Graphics integrated on the same die with
separate voltage/frequency controls; tight HW control
Full package power specifications available for sharing
Power budget can shift between Cores and Graphics
CorePower [W]
Graphics Power[W]
Total package power
Realistic concurrentmax power
Sum of max power
Heavy Graphicsworkload
Heavy CPUworkload
SpecificationCore Power
SpecificationGraphics Power
Sa n d y B r i d g e
N e x t Ge n Tu r b o
f o r sh o r t p e r i o d s
-
7/27/2019 Intel sandy ntel sandy bridge architecture
54/54
54
Summary32nm Next Generation
Microarchitecture
Processor Graphics
System Agent, Ring Architectureand Other Innovations
Performance and PowerEfficiency