the implementation of a 2-core, multi-threaded itanium ... · – two dual-threaded high...

18
The Implementation of a 2-core, Multi-threaded Itanium® Family Processor Samuel Naffziger Blaine Stackhouse Tom Grutkowski Intel Corp. Fort Collins, CO

Upload: others

Post on 05-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Implementation of a 2-core, Multi-threaded Itanium ... · – Two dual-threaded high performance 64b EPIC cores on a single chip – (very) Large on-chip caches totaling 26.5MB

The Implementation of a 2-core, Multi-threaded

Itanium® Family Processor

Samuel NaffzigerBlaine StackhouseTom Grutkowski

Intel Corp. Fort Collins, CO

Page 2: The Implementation of a 2-core, Multi-threaded Itanium ... · – Two dual-threaded high performance 64b EPIC cores on a single chip – (very) Large on-chip caches totaling 26.5MB

Outline

• Processor overview• Core improvements

– Cache hierarchy– Multi threading– Power reduction

• Handling 4 voltage and frequency domains• Power management and reduction• Conclusion

Page 3: The Implementation of a 2-core, Multi-threaded Itanium ... · – Two dual-threaded high performance 64b EPIC cores on a single chip – (very) Large on-chip caches totaling 26.5MB

Itanium’s Progress• 90nm implementation optimized for server

workloads– Two dual-threaded high performance 64b EPIC

cores on a single chip– (very) Large on-chip caches totaling 26.5MB

• The same latencies as previous Itaniums (paper 26.8)

• 1.72 Billion transistors, 596mm2

• Legacy system bus at higher frequency (666MT/s)• High frequency operation representing at least a

20% increase over the 130nm implementation of the same core micro-architecture

• Power efficiency improvements that deliver a >2 fold compute capability increase at 23% lesspower consumption

Page 4: The Implementation of a 2-core, Multi-threaded Itanium ... · – Two dual-threaded high performance 64b EPIC cores on a single chip – (very) Large on-chip caches totaling 26.5MB

1.72 Billiontransistors

21.5 mm

27.72 mm

Business Critical Features1MB L2I

2-Way Multi-

Threading

FoxtonPower

Controller

2 X 12MB L3 Caches with

PellstonSoft Error Detectionand Correction

Dual Cores

Page 5: The Implementation of a 2-core, Multi-threaded Itanium ... · – Two dual-threaded high performance 64b EPIC cores on a single chip – (very) Large on-chip caches totaling 26.5MB

Core Design Changes• Temporal Multi-threading

– Very low cost throughput performance boost

• New level of cache – 6 cycle 1MB L2I– Addresses largest CPI component

for transaction processing (Instruction misses)

– Frees 256KB L2D to be dedicated for data

• ECC add in L2T tags, and parity added to L1I TLB

• Parity in FP and Integer register files• Second Shift/merge unit for encryption

performance• Power reduction with support for

dynamic frequency and voltage

McKinley/Madison Core

16KB L1I

16KB L1D

256KB L2D

Montecito TMT Core

16KB L1I

16KB L1D

1MB L2I

256KB L2D

L3

L3

Page 6: The Implementation of a 2-core, Multi-threaded Itanium ... · – Two dual-threaded high performance 64b EPIC cores on a single chip – (very) Large on-chip caches totaling 26.5MB

������ ������������ ������ ������������

������ ������������ ������ ������������

��� ���� �

��� ���� �

������

�� ���

15 cycle penaltyOverlap memory latency stall of Thread 1 with execution of Thread 2

Stalled Cycles

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

SpecInt CPI SpecFP CPI TPC-C SingleThread CPI

TPC-C Dual-thread CPI

15-35% PerformanceGain

Temporal Multi-Threading

2% core area impact and 0% cycle time impact

Page 7: The Implementation of a 2-core, Multi-threaded Itanium ... · – Two dual-threaded high performance 64b EPIC cores on a single chip – (very) Large on-chip caches totaling 26.5MB

Multi-Threading Implementation• 15% register file growth

with TMT using muxscheme– See Paper 20.5

• Key architectural state threaded, rest is flushed on a switch– 2X latch area for 1.3%

of total latches

12 read8 writeports

Register file impact

12 read10 write

ports

Thread selThread 0

Thread 1

BA

in

delay

outputdelay

input

Switch

pck

quietthread

activethread

Page 8: The Implementation of a 2-core, Multi-threaded Itanium ... · – Two dual-threaded high performance 64b EPIC cores on a single chip – (very) Large on-chip caches totaling 26.5MB

Core Power Reduction

• To improve power efficiency, we focused the team on reducing power consumption for typical applications– Developed a vector

based tool to accurately measure switching and leakage power

– Achieved a 28% overall reduction from the ported core even with the new features.

Power vs. Unit

0.01000.02000.03000.04000.05000.06000.07000.08000.09000.0

10000.0dcufp

uidp ifr l2dl3t1

l2t1

l2i1

misc

fe-o

therpc bc itc

lru10

0dec

apclk

genclk

dist

repea

ter le

akag

e

gswitc

h

Pow

er (m

W)

HALTtpcc3specint2kspecfp2kdaxpy l1power virus

Leakage (avg.)25%

Logic switching

50%

Clock distribution

5%

Final clock buffer20%

Page 9: The Implementation of a 2-core, Multi-threaded Itanium ... · – Two dual-threaded high performance 64b EPIC cores on a single chip – (very) Large on-chip caches totaling 26.5MB

Multiple Voltage and Frequency Domains

• To ensure minimal latency impact, a high gain resolver is needed to cross clock domains

Core 0,frequency tracks voltage, 40W

250

mW

FoxtonControl1GHz fixed frequency

Bus ArbiterFixed frequency, 2.5W

Bus

I/O

, fix

ed fr

eque

ncy,

8W

to

tal

12M

B L

3 ca

che,

as

ynch

rono

us

oper

atio

n, 2

.5W

12MB

L3 cache, asynchronous operation, 2.5W

Core 1,frequency tracks voltage, 40W

Failure probabilityProportional to ExpWhere Tw is the time window for resolution, Tr is the resolver time constant � 5ps

SmallDevices

f1

DevicesLarge

source clock

destination clock

To destination FIFO clock enable

• Determinism for debug / test provided with a fixed frequency mode (FFM) enabled with dynamic deskew

)-TwTr(

Page 10: The Implementation of a 2-core, Multi-threaded Itanium ... · – Two dual-threaded high performance 64b EPIC cores on a single chip – (very) Large on-chip caches totaling 26.5MB

Multiple Voltage and Frequency Domains

• To ensure reliable crossing of voltage domains, with up to 400mV offsets, a robust level shifter is used– Vcache �� Vcore– Vcore ��Vfixed– Vfixed ��Vtt

• A tool was developed to identify all domain crossings and check for the presence of level shifters

Core 0,frequency tracks voltage, 40W

250

mW

FoxtonControl1GHz fixed frequency

Bus ArbiterFixed frequency, 2.5W

Bus

I/O

, fix

ed fr

eque

ncy,

8W

to

tal

12M

B L

3 ca

che,

as

ynch

rono

us

oper

atio

n, 2

.5W

12MB

L3 cache, asynchronous operation, 2.5W

Core 1,frequency tracks voltage, 40W

f1

In

Out

Vin

Vout Vout

Vout

Page 11: The Implementation of a 2-core, Multi-threaded Itanium ... · – Two dual-threaded high performance 64b EPIC cores on a single chip – (very) Large on-chip caches totaling 26.5MB

Power Efficiency Improvements• A primary design goal of Montecito was the

improvement of power efficiency– Performance / Watt– Lower acquisition and operating costs, reduced form

factor, better compute density• One must measure what is being optimized

– Temperature measurements do not suffice – an environmentally dependent proxy for power

� An ammeter is required• To avoid wasted watts, have the chip

instantaneously optimize operating point for variations in {voltage, temperature, workload, silicon processing}�Requires self selection of voltage with dynamic

frequency in conjunction with temperature and current measurement

Page 12: The Implementation of a 2-core, Multi-threaded Itanium ... · – Two dual-threaded high performance 64b EPIC cores on a single chip – (very) Large on-chip caches totaling 26.5MB

• For further details, see paper 16.7

• Key enablers:– High accuracy ammeter– Dynamic voltage control– Fast frequency synthesis as a function of Vcc– Accurate thermal measurement

Power Management Loop

�������

������

��������������

����������������������������

����������

��������������������������������

�������������������� ���������������

��������������������

��� ��

������

��� ����� ��

������������

��� ��

������

��� ����� ��

������������

�����

��!

����� �����

��!��!

����������

������������������������������

Page 13: The Implementation of a 2-core, Multi-threaded Itanium ... · – Two dual-threaded high performance 64b EPIC cores on a single chip – (very) Large on-chip caches totaling 26.5MB

Power Consumption Contour

0.40.5

0.60.7

0.80.9

11.6

1.82

2.22.4

0

20

40

60

80

100

120

140

160

180

200

Power(Watts)

Activity FactorGHz =

F(V)

Power Consumption ProfileOptimization point is for typical integerapplications which have .6X the switching power of the worst case� Amdahl’s law

Manufacturing test is accomplished by observing the self measured power, and the self-generated frequency for typical code at the power limit

Page 14: The Implementation of a 2-core, Multi-threaded Itanium ... · – Two dual-threaded high performance 64b EPIC cores on a single chip – (very) Large on-chip caches totaling 26.5MB

Per-Part Self-Optimization

Power upperbound

Power spread due toprocess variation

0

10

20

30

40

50Power Distribution at Fixed V/F

90 100 110 120 130 1400

10

20

30

40

50

Power Range with Foxton

90 100 110 120 130 140

Resulting VddDistribution

0.9 1 1.1 1.20

2

4

6

8

10

12

14

16

18

20

Page 15: The Implementation of a 2-core, Multi-threaded Itanium ... · – Two dual-threaded high performance 64b EPIC cores on a single chip – (very) Large on-chip caches totaling 26.5MB

The clock systemgenerates frequency as a function of measured voltage (papers 16.1 &16.2)

FixedSupply

VariableSupply (80W out of 100W)

Bus Clock

I/Os

Foxton

Bus Logic

Core0

Core1

1/N

CVD

RAD

Gater

1/1

GaterCVD

RVD

DFD

DFD

SLCB CVD GaterPLL

DFD

SLCB CVD Gater

SLCB

DFD

CVD Gater

1/M

DFD

DFD

1/NRAD

SLCB

SLCB

Balanced Tree ClockDistribution

Phase Aligner

Page 16: The Implementation of a 2-core, Multi-threaded Itanium ... · – Two dual-threaded high performance 64b EPIC cores on a single chip – (very) Large on-chip caches totaling 26.5MB

0

50

100

150

200

250

300

350

Simple Port of Madison Montecito Final

Cache SwitchCache IgateCache IoffCore SwitchCore IgateCore IoffIO bias DCAP lkg

Frequency Gain @ 100W

Feature

3%Manage junction temperature to the minimum possible

5%Adapt Vcc to optimal value for each part

11%Optimize circuits for low switching and low leakage

7%Adapt frequency to Vcc to

operate at frequency(Vavg) vs. frequency(Vmin)

12%Manage to application power vs. max power

5%Cache power optimization (low V, asynch. etc.)

������� ��"���������#��$%

��&������'(�)���)�*+)�*$)�*,)���)�*+-$ &������'��.�

This improvement is for a typical application. High power floating point code may see up to a 10% reduction in frequency

Page 17: The Implementation of a 2-core, Multi-threaded Itanium ... · – Two dual-threaded high performance 64b EPIC cores on a single chip – (very) Large on-chip caches totaling 26.5MB

Shmoo

Power Consumption is for all 4 voltage domains at ~40C Tj

57W63W

71W 77W

95W 104W110W

Page 18: The Implementation of a 2-core, Multi-threaded Itanium ... · – Two dual-threaded high performance 64b EPIC cores on a single chip – (very) Large on-chip caches totaling 26.5MB

Summary• Design builds on Itanium’s performance

strengths:– Improves leadership cache hierarchy– Improves instruction throughput further with dual-

core, multi-threading and new execution units

• Bolsters reliability– Parity in register files, improved ECC– Seamless adaptation to power or thermal overages

• Achieves leadership performance / Watt– Breakthrough power measurement and

management technology provides flexibility and efficiency