design and implementation of sun's niagara2 processor
TRANSCRIPT
Sept 20, 2007
Design and implementationof Sun's
Niagara2 Processor
Umesh NawatheSenior Manager,Sun Microsystems
2
Outline• Why CMT Architecture ?• Key Features and Architecture Overview• Physical Implementation
> Key Statistics> On-chip L2 Caches> Crossbar> Clocking Scheme> SerDes interfaces> Cryptography Support> Physical Design Methodology
• Power and Power Management• DFT Features• Conclusions
3
Outline• Why CMT Architecture ?• Key Features and Architecture Overview• Physical Implementation
> Key Statistics> On-chip L2 Caches> Crossbar> Clocking Scheme> SerDes interfaces> Cryptography Support> Physical Design Methodology
• Power and Power Management• DFT Features• Conclusions
4
Memory BottleneckRelative Performance
10000
11990 1995 2005 1980
1000
100
10
1985 2000
2x Every 6 Years
2x Every 2 Years
Gap
CPU FrequencyDRAM Speeds
Date (Year)
5
Comparing Modern CPU Design Techniques
Single Thread
ILP CC MM CC MM CC MM
CC MM CC MM CC MM
Time
Time SavedTLP
CC MM
CC MM
CC MM
● ILP Offers Limited Headroom● TLP Provides Greater Performance Efficiency
Memory Latency
Compute
(No Parallelism)
HURRYUP ANDWAIT!
6
Throughput Computing Memory Latency
Compute Time
MC MC MC MC
MC MC MC MC
MC MC MC MC
MC MC MC MC
MC MC MC MC
MC MC MC MC
MC MC MC MC
MC MC MC MC
Thre
ads
Time
For a single thread:Memory is THE bottleneck for improving performance.
Commercial server workloads exhibit poor memory locality.Only a modest throughput speedup is possible by reducing compute time. Conventional single-thread processors optimized for ILP have low utilizations.
With many threads:It’s possible to find something to execute every cycle. Significant throughput speedups are possible.Processor utilization is much higher.
Single Thread
7
Microprocessor Power EvolutionPower Density (W/cm2) over Time
.13 04
.0 00
.10 00
.20 00
.30 00
.40 00
.50 00
.60 00
.70 00
.80 00
85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05 06 07 08
Date (Year)
Po
wer
De
ns
ity
(W
/ s
q.
cm
)
intel 386
intel 486
intelpentiumintelpentium 2intelpentium 3intelpentium 4intelitaniumAlpha 21064
Alpha 21164
Alpha 21264
Sparc
SuperSparc
Sparc64
Mips
HP PA
Power PC
AMD K6
AMD K7
AMD x -86 64
UltraSparcVNiagara
Niagara2
N1
N2
8
Outline• Why CMT Architecture ?• Key Features and Architecture Overview• Physical Implementation
> Key Statistics> On-chip L2 Caches> Crossbar> Clocking Scheme> SerDes interfaces> Cryptography Support> Physical Design Methodology
• Power and Power Management• DFT Features• Conclusions
9
Niagara2's Key features
• 2nd generation CMT (Chip Multi-Threading) processor optimized for Space, Power, and Performance (SWaP).
• 8 Sparc Cores, 4MB shared L2 cache; Supports concurrent execution of 64 threads.
• >2x UltraSparc T1's throughput performance and performance/Watt.
• >10x improvement in Floating Point throughput performance.• Integrates important SOC components on chip:
> Two 10G Ethernet (XAUI) ports on chip.> Advanced Cryptographic support at wire speed.
• On-chip PCI-Express, Ethernet, and FBDIMM memory interfaces are SerDes based; pin BW > 1Tb/s.
10
Niagara2 Block Diagram
SPC 0
Cro
ssba
r
L2$ bank 0
L2$ bank 1
L2$ bank 2
L2$ bank 3
L2$ bank 4
L2$ bank 5
L2$ bank 6
L2$ bank 7
JTAGTest Control
FBDIMM channels @ 4.8Gb/s/lane
SSI
Debug Port
SPC 2
SPC 3
SPC 4
SPC 5
SPC 6
SPC 7
FBDIMMController 0
Upto 8 off-chip DIMMs/channelEight banks of512 KByte L2$
NetworkInterface I/O
To/From L2$and memory
Switch
PCI-Express
Unit (XAUI)
SPC 1
2.5 Gb/s/lane3.125 Gb/s/laneClock Control
FBDIMMController 1
FBDIMMController 2
FBDIMMController 3
Key Point:Key Point: System-on-a-Chip, CMT architecture => lower # of system components, reduced complexity/power => higher system reliability.
11
Sparc Core (SPC) Architecture Features• Implementation of the 64-bit
SPARC V9 instruction set.• Each SPC has:
> Supports concurrent execution of 8 threads.> 1 load/store, 2 Integer execution units.> 1 Floating point and Graphics unit.> 8-way, 16 KB I$; 32 Byte line size.> 4-way, 8 KB D$; 16 Byte line size.> 64-entry fully associative ITLB.> 128-entry fully associative DTLB.> MMU supports 8K, 64K, 4M, 256M page
sizes; Hardware Tablewalk.> Advanced Cryptographic unit.
• Combined BW of 8 Cryptographic Units is sufficient for running the 10 Gb ethernet ports encrypted.
TLU IFU
EXU0 EXU1
FGU LSUSPU
Gasket
MMU/HWTW
SPC Block Diagram
12
SPC Architecture Features (Cont'd.)• 8-stage Integer Pipeline (Fetch, Cache, Pick, Decode,
Execute, Memory, Bypass, Writeback).> 3-cycle load-use latency.
• 12-stage FP and Graphics Pipeline (Fetch, Cache, Pick, Decode, Execute, FX1, FX2, FX3, FX4, FX5, FB, FW).> 6-cycle latency for dependent FP operations.> Longer pipeline for Divide/Sqrt.
• Upto 4 instructions fetched per cycle in the 'Fetch' stage.• Has 2 thread-groups (TGs); 'Pick' tries to find 2 instructions
to execute every cycle – one per TG.> Can lead to hazards (e.g. Loads picked from both TGs).
• 'Decode' stage resolves hazards that 'Pick' cannot.
13
SPC Architecture Features (Cont'd.)
• Integer/Ld/St pipeline shown.
F2
C6
IB0-3
P0
D2
E0
M3
B1
IB4-7
P5
D7
E6
M4
B7
W2 W6
M4
B1
W6
Thread Group 0
LSU
IFUThread Group 1 F = Fetch
C = CacheP = PickD = DecodeE = ExecuteM = MemoryB = BypassW = Writeback
Numbers 0 through 7 respresent the thread identifier.
Key Point:Key Point: Differentthreads
can occupydifferent pipeline
stages in a given
cycle withvery few
restrictions.
14
Niagara2 Die Micrograph • 8 SPARC cores, 8 threads/core.
• 4 MB L2, 8 banks, 16-way set associative.
• 16 KB I$ per Core.
• 8 KB D$ per Core.
• FP, Graphics, Crypto, units per Core.
• 4 dual-channel FBDIMM memory controllers @ 4.8 Gb/s.
• X8 PCI-Express @ 2.5 Gb/s.
• Two 10G Ethernet ports @ 3.125 Gb/s.
15
Outline• Why CMT Architecture ?• Key Features and Architecture Overview• Physical Implementation
> Key Statistics> On-chip L2 Caches> Crossbar> Clocking Scheme> SerDes interfaces> Cryptography Support> Physical Design Methodology
• Power and Power Management• DFT Features• Conclusions
16
Physical Implementation Highlights
Technology
11
3 (SVT, HVT, LVT)
Frequency 1.4 Ghz @ 1.1VPower 84 W @ 1.1VDie Size 342 mm^2
503 Million
Package
# of pins
65 nm CMOS (from Texas Instruments)
Nominal Voltages
1.1 V (Core), 1.5V (Analog)
# of Metal LayersTransistor types
Transistor Count
Flip-Chip Glass Ceramic1831 total; 711 Signal I/O
17
Level2 Cache• 4-MB shared L2 Cache:
> 8 banks of 512 KB each.> 64 B line size; 16-way set associative.> Read 16 B per cycle per bank with 2-cycle latency.> Address hashing capability to distribute accesses across different sets.
• SEC DED ECC/parity protected.• Data from different ways/words interleaved to improve SER.• Tag arrays contain reverse-mapped directory:
> Maintains L1 I$ and D$ coherency across 8 SPCs.> Store L2 Index/Way bits instead of all the tag bits.
• Memory cell NWELL power separated out as a test hook:> Helps identify weak memory bits susceptible to read-disturb fails due to
PMOS NBTI effect.> Significantly improves DPPM/reliability.
18
Level2 Cache – Row Redundancy
• Redundancy implemented at 32-KB level.
• Spare rows for one array located in adjacent array.
• Adjacent array (which is normally not enabled) is enabled if 'incoming address' = 'defective row address'.
• Reduces X-decoder area by ~30 %.
19
Crossbar • Provides high-BW interface between 8 SPCs and 8 L2 cache banks/NCU.
• Consists of 2 blocks:> PCX (Processor to
Cache/NCU transfer): 8-i/p, 9-o/p mux.
> CPX (Cache/NCU to Processor transfer): 9-i/p, 8-o/p mux.
• PCX/CPX combined provide Rd/Wr BW of ~270 GB/s (Pin BW of ~400 GB/s).
• 4-stage pipeline: Request, Arbitration, Selection, Transmission.
• 2-deep queue for each source-destination pair to hold data transfer requests.
CORE7
NCU
Bank1
CORE1
CORE0 CORE1 CORE2 CORE3 CORE4 CORE5 CORE7CORE6
NCU Bank0 Bank1 Bank2 Bank3 Bank4 Bank5 Bank6 Bank7
Bank0
CORE0
CPX
PCX
20
ClockingREF 133/167/200 MHzCMP 1.4 GHzIO 350 MHzIO2X 700 MHzFSR.refclk 133/167/200 MHzFSR.bitclk 1.6/2.0/2.4 GHzFSR.byteclk 267/333/400 MHzDR 267/333/400 MHzPSR.refclk 100/125/250 MHzPSR.bitclk 1.25 GHzPSR.byteclk 250 MHzPCI-Ex 250 MHzESR.refclk 156 MHzESR.bitclk 1.56 GHzESR.byteclk 312.5 MHzMAC.1 312.5 MHzMAC.2 156 MHzMAC.3 125/25/2.5 MHz
SP
AR
C
SP
AR
C
SP
AR
C
SP
AR
C
SP
AR
C
SP
AR
C
SP
AR
C
SP
AR
C
CCX
L2$
bank
L2$
bank
L2$
bank
L2$
bank
L2$
bank
L2$
bank
L2$
bank
L2$
bank
NCU
PS
R
SIU
NIU
(RDP,
RTX,
TDS)
ES
R
MCU
MCU
MCU
MCU
FS
R
DMU
PE
UM
AC
Mesochronous Ratioed synchronous
Asynchronous
Asynchronous
350 MHz
1.4 GHz
400 MHz
700 MHz
Key Point:Key Point: Complex clocking; large # of clock domains; asynchronous domain crossings.
21
Clocking (Cont'd.)
• On-chip PLL generates Ratioed Synchronous Clocks (RSCs); Supported fractional divide ratios: 2 to 5.25 in 0.25 increments.
• Balanced use of H-Trees and Grids for RSCs to reduce power and meet clock-skew budgets.
• Periodic relationship of RSCs exploited to perform high BW skew-tolerant domain crossings.
• Clock Tree Synthesis used for Asynchronous Clocks; domain crossings handled using FIFOs and meta-stability hardened flip-flops.
• Cluster/L1 Headers support clock gating to save clock power.
22
Clocking (RSC domain crossings)• FCLK = Fast-Clock
SCLK = Slow-Clock
• Same 'Sync_en' signal for FCLK -> SCLK, and SCLK -> FCLK crossings.
44
33
Hold margin Setup margin
FCLK
SYNC_EN
SCLK
TXfast
TXfast,slow
RXslow
k(N/M)T (k+1)(N/M)T(k+1/2)(N/M)T
44
4433
3322
k(N/M)T (k+1)(N/M)T(k+1/2)(N/M)T
FCLK
SYNC_EN
SCLK
TXslow
TXslow,fast
RXfast
Hold marginSetup margin
BB CC DD
AA BB CC
AA BB
RXslow
TXslow
SYNC_EN
FCLK SCLK
TXfast,slow
EN
TXfast
FCLK SCLK
TXslow,fast
EN
RXfast
Key Point:Key Point: Equalizing setup and hold margins maximizes skew tolerance.
23
Niagara2's SerDes Interfaces
• All SerDes share a common micro-architecture.• Level-shifters enable extensive circuit reuse across the three
SerDes designs.• Total raw pin BW in excess of 1Tb/s.• Choice of FBDIMM (vs DDR2) memory architecture provides
~2x the memory BW at <0.5x the pin count.
FBDIMM PCI-Express Ethernet-XAUI
VSS VDD VDD
Link-rate (Gb/s) 4.8 2.5 3.125
14 * 8 8 4 * 2
10 * 8 8 4 * 2
Bandwidth (Gb/s) 921.6 40 50
Signalling Reference
# of North-bound (Rx) lanes# of South-bound (Tx) lanes
24
Niagara2's True Random Number Generator
• Consists of 3 entropy cells.
• Amplified n-well resistor thermal noise modulates VCO frequency; VCO o/p sampled by on-chip clock.
• LFSR accumulates entropy over a pre-set accumulation time.> Privileged software programs a timer with desired entropy accumulation time.> Timer blocks loads from LFSR before entropy accumulation time has elapsed.
25
Outline• Why CMT Architecture ?• Key Features and Architecture Overview• Physical Implementation
> Key Statistics> On-chip L2 Caches> Crossbar> Clocking Scheme> SerDes interfaces> Cryptography Support> Physical Design Methodology
• Power and Power Management• DFT Features• Conclusions
26
Niagara2's System on Chip Methodology• Chip comprised of many
subsystems with different design styles and methodologies:> Custom Memories & Analog
Macros:> Full custom design and verification.
40% compiled memories.> Schematic/manual layout based.
> External IP:> SerDes full custom IP Macros.
> Complex Clusters:> DP/Control/Memory Macro.> Higher speed designs.
> ASIC designs:> PCI-Express and NIC functions.
> CPU:> Integration of component abstracts.> Custom pre-routes and autoroute
solution.> Propriety RC analysis and buffer
insertion methodology.
Key Point:Key Point: Chip Design Methodologies had to comprehend blocks with different
design styles and levels of abstraction.
27
Complex Design Flow
• Architectural pipeline reflected closely in the floorplanning of:
> Memory Macros.> Control Regions.> Datapath Regions.
• Early Design Phase: > Fully integrated SUN toolset allows fast turnaround.> Less accurate, but fast - allows quick iterations to
identify timing fixes involving RTL/floorplan changes.> Allows reaching route stability.
• Stable Design Phase:> More accurate, but not as fast, allows timing fixes
involving logic and physical changes; Allows logic to freeze.
• Final Design Phase:> More accurate, but longer time to complete; More
focus on physical closure then logic.• Freeze and ECO Design Phases:
> Allows preserving large portion of design from one iteration to next.
Key Point:Key Point: Design Flow different for different design phases.
28
Design For Manufacturability (DFM)• Single poly orientation (except I/O blocks).• Larger-than-minimum design rules:
> To minimize impact of poly/diffusion flaring.> Near stress-prone topologies to reduce chances of dislocations in Si-lattice.> Larger Metal overlap of via/contact where possible.
• Improved gate-CD control:> Dummy polys used for gate shielding.> Limited gate-poly pitches used; OPC algorithm tuned for them.
• OPC simulations of critical cell layouts to ensure sufficient manufacturing process margin.
• Extensive use of statistical simulations:> Reduces unnecessary design margin that could result from designing to FAB-
supplied corner models that often are non-physical.
• Redundant vias placed without area increase.• All custom ckts proven on testchips prior to 1st Si.
29
Outline• Why CMT Architecture ?• Key Features and Architecture Overview• Physical Implementation
> Key Statistics> On-chip L2 Caches> Crossbar> Clocking Scheme> SerDes interfaces> Cryptography Support> Physical Design Methodology
• Power and Power Management• DFT Features• Conclusions
30
Power• CMT approach used to
optimize the design for performance/watt.
• Clock gating used at cluster and local clock-header level.
• 'GATE-BIAS' cells used to reduce leakage.> ~10 % increase in channel
length gives ~40 % leakage reduction.
• Interconnect W/S combinations optimized for power-delay product to reduce interconnect power.
Niagara2 Worst Case Power =84 W @ 1.1V, 1.4 GHz
Sparc Cores 31.6 %
IOs
13.2
% L2Dat 8.6 %
L2Tag 8.5 %
Leakage 21.1 %
SOC
6.1
%To
p Le
vel 4
.9 %
Mis
c U
nits
1.2
%
L2Buffer 2.5 %
Gclk, CCU 1.3 %Crossbar 1.0 %
31
Power management
• Software can turn threads on/off.
• 'Power Throttling' mode controls instruction issue rates to manage power consumption.
• On-chip thermal diodes monitor die temperature.> Helps ensure reliable operation in
case of cooling system failure.
• Memory Controllers enable DRAM power-down modes and/or control DRAM access rates to control memory power.
None Minimum Medium Maximum70
75
80
85
90
95
100
Effect of Throttling on Dynamic Power
High Work-load
Low Work-load
Degree of Throttling
% o
f 'N
o T
hrot
tling
' pow
er
32
Outline• Why CMT Architecture ?• Key Features and Architecture Overview• Physical Implementation
> Key Statistics> On-chip L2 Caches> Crossbar> Clocking Scheme> SerDes interfaces> Cryptography Support> Physical Design Methodology
• Power and Power Management• DFT Features• Conclusions
33
Design for Testability• Deterministic Test Mode (DTM) used to test core by
eliminating uncertainty of asynchronous domain crossings.• Dedicated 'Debug Port' observes on-chip signals.• 32 scan chains cover >99 % flops; enable ATPG/Scan testing.• All RAM/CAM arrays testable using MBIST and Macrotest.
> Direct Memory Observe (DMO) using Macrotest enables fast bit-mapping required for array repair.
• Path Delay/Transition Test technique enables speed testing of targeted critical paths.
• SerDes designs incorporate loopback capabilities for testing.• Architecture design enables use of <8 SPCs/L2 banks.
> Shortened debug cycle by making partially functional die usable.> Will increase overall yield by enabling partial-core products.
34
Mission Mode vs DTM
Mission Mode Operation
Deterministic Test Mode Operation
SP
AR
C
SP
AR
C
SP
AR
C
SP
AR
C
SP
AR
C
SP
AR
C
SP
AR
C
SP
AR
C
CCX
L2$
bank
L2$
bank
L2$
bank
L2$
bank
L2$
bank
L2$
bank
L2$
bank
L2$
bank
NCU
PS
R
SIU
NIU
(RDP,
RTX,
TDS)
ES
R
MCU
MCU
MCU
MCU
FS
R
DMU
PE
UM
AC
Mesochronous Ratioed synchronous
Asynchronous
Asynchronous
2.5 Gb/s
3.125 Gb/s4.8 Gb/s
1.4 GHz
350 MHz400 MHz
non-deterministicDEBUG PORT
CCX
SP
AR
C
SP
AR
C
SP
AR
C
SP
AR
C
SP
AR
C
SP
AR
C
SP
AR
C
SP
AR
C
L2$
bank
L2$
bank
L2$
bank
L2$
bank
L2 $
bank
L2$
bank
L2$
bank
L2$
bank
NCU
PS
R
SIU
NIU
(RDP,
RTX,
TDS)
ES
R
MCU
MCU
MCU
MCU
FS
R
DMU
PE
UM
AC
SynchronousSynchronous Ratioed synchronous
1.0 Gb/s
1.2 Gb/s
100 Mb/s 100 Mb/s
1.4 GHz
100 MHz
non-deterministic
100 MHz
35
Outline• Why CMT Architecture ?• Key Features and Architecture Overview• Physical Implementation
> Key Statistics> On-chip L2 Caches> Crossbar> On-chip L2 Caches> SerDes interfaces> Cryptography Support> Physical Design Methodology
• Power and Power Management• DFT Features• Conclusions
36
Conclusions
• Sun's 2nd generation 8-core, 64-thread, CMT SPARC processor optimized for Space, Power, and Performance (SWaP) integrates all major system functions on chip.
• Doubles the throughput and throughput/watt compared to UltraSparcT1.
• Provides an order of magnitude improvement in floating point throughput compared to UltraSparcT1.
• Enables secure applications with advanced cryptographic support at wire speed.
• Enables new generation of power-efficient, fully-secure datacenters.
37
Acknowledgements
• Jim Ballard, Mahmudul Hassan, Tim Johnson, Rob Mains, Paresh Patel, Alan Smith for helping put together the presentation.
• Niagara2 design team and other teams inside SUN for the development of Niagara2.
• Texas Instruments for manufacturing Niagara2.
Sept 20, 2007
Thank You !
Umesh NawatheSenior Manager,Sun Microsystems