fpr/vr register overlay & simd overviewarith23.gforge.inria.fr/slides/schwarz.pdf · br 2 ls, 2...

© 2016 IBM Corporation

IBM Accelerators July 11, 2016

Eric Schwarz


Outline

Roadmaps of Z and Power

Arithmetic Feature Comparison

How to Get Performance without Frequency

2


z Systems - Processor Roadmap

Core 0

L3_0

L3_1

L2

CoPMCU

L2

Core 1

L3_0

L3_1

Core 2

L2

CoP GX

L2

Core 3

L3_0 Controller

L3_1 Controller

MC

IOs

MC

IOs

GX

IOs

GX

IOs

L3B

L3B

Core 0

L3_0

L3_1

L2

CoPMCU

L2

Core 1

L3_0

L3_1

Core 2

L2

CoP GX

L2

Core 3

L3_0 Controller

L3_1 Controller

MC

IOs

MC

IOs

GX

IOs

GX

IOs

L3B

L3B

z1969/2010

zEC128/2012

z102/2008

z131/2015

Leadership Single Thread,

Enhanced Throughput

Improved out-of-order

Transactional Memory

Dynamic Optimization

2 GB page support

Step Function in System

Capacity

Top Tier Single Thread

Performance,System Capacity

Accelerator Integration

Out of Order Execution

Water Cooling

PCIe I/O Fabric

RAIM

Enhanced Energy

Management

Leadership System Capacity

and Performance

Modularity & Scalability

Dynamic SMT

Supports two instruction threads

SIMD

PCIe attached accelerators (XML)

Business Analytics Optimized

Workload Consolidation

and Integration Engine for

CPU Intensive Workloads

Decimal FP

Infiniband

64-CP Image

Large Pages

Shared Memory


0

1000

2000

3000

4000

5000

6000

z900 z990 z9ec z10ec z196 zEC12 zNext

EC

770 MHz

1.2 GHz

1.7 GHz

4.4 GHz

5.2 GHz5.0 GHz

5.5 GHz

2000z900

189 nm SOI16 Cores**Full 64-bit

z/Architecture

2003z990

130 nm SOI32 Cores**Superscalar

Modular SMP

2005z9 EC

90 nm SOI54 Cores**

System level scaling

2012zEC12

32 nm SOI101 Cores**

OOO and eDRAMcache

improvementsPCIe Flash

Arch extensionsfor scaling

2010z196

45 nm SOI80 Cores**OOO core

eDRAM cacheRAIM memoryzBX integration

2008z10 EC

65 nm SOI64 Cores**

High-freq core3-level cache

2015z13

22 nm SOI141 Cores**

SMT &SIMD

Up to 10TB of Memory

MH

z/G

Hz

1000

0

2000

3000

4000

5000

6000

1695*+12%

GHz

-9%1202

*+33%

GHz

+18%

1514

*+26%

GHz

+6%902*+50%

GHz

+159

%

z13 Continues the CMOS Mainframe Heritage Begun in 1994

* MIPS Tables are NOT adequate for making comparisons of z Systems processors. Additional capacity planning required** Number of PU cores for customer use

© 2013 International Business Machines Corporation 5

Caches

• 512 KB SRAM L2 / core

• 96 MB eDRAM shared L3

• Up to 128 MB eDRAM L4

(off-chip)

Memory

• Up to 230 GB/s

sustained bandwidth

Bus Interfaces

• Durable open memory

attach interface

• Integrated PCIe Gen3

• SMP Interconnect

• CAPI (Coherent

Accelerator Processor

Interface)

Cores

•12 cores (SMT8)

•8 dispatch, 10 issue, 16 exec

pipe

•2X internal data flows/queues

•Enhanced prefetching

•64K data cache,

32K instruction cache

Accelerators

•Crypto & memory expansion

•Transactional Memory

•VMM assist

•Data Move / VM MobilityEnergy Management

• On-chip Power Management Micro-controller

• Integrated Per-core VRM

• Critical Path Monitors

Technology

• 22nm SOI, eDRAM, 15 ML 650mm2

Core

L2

Core

L2

Core

L2

Core

L2

Core

L2

Core

L2

Core

L2

Core

L2

Core

L2

Core

L2

Core

L2

Core

L2

L3 Cache & Chip Interconnect

8M L3

Region

Mem. Ctrl.Mem. Ctrl.

Lo

cal SM

P L

inks

Accelerato

rsL

ocal S

MP

Lin

ksA

ccelerators

From HotChips 2013 Presentation

© 2013 International Business Machines Corporation 6

VSUFXU

IFU

DFU

ISU

LSU

Larger Caching

Structures vs. POWER7

• 2x L1 data cache (64 KB)

• 2x outstanding data cache misses

• 4x translation Cache

Wider Load/Store

• 32B 64B L2 to L1 data bus

• 2x data cache to execution

dataflow

Enhanced Prefetch

• Instruction speculation awareness

• Data prefetch depth awareness

• Adaptive bandwidth awareness

• Topology awareness

Execution Improvement

vs. POWER7

•SMT4 SMT8

•8 dispatch

•10 issue

•16 execution pipes:

• 2 FXU, 2 LSU, 2 LU, 4 FPU,

2 VMX, 1 Crypto, 1 DFU,

1 CR, 1 BR

•Larger Issue queues (4 x 16-entry)

•Larger global completion,

Load/Store reorder

•Improved branch prediction

•Improved unaligned storage access

Core Performance vs . POWER7

~1.6x Single Thread

~2x Max SMT

From HotChips 2013 Presentation

I$

32k D$

IB

LS FX FX FP

PM

DF

BR

2 LS, 2 FX, 1 BR, 1 CR, 1 (FP, ALU, CX), 1 (FP, PM, DF)

256k L2$

32MB L3$

POWER7 Core Base

FP

ALU

CX

CR

ISU

LS

I$

64k D$

IB IB

L,LS L,LS FX FX FP

ALU,PM,

DF

BR

2 LS, 2 LU, 2 FX, 2 (FP , ALU, PM, DF), 1 CR, 1 BR

512k L2$

96MB L3$

Enhanced POWER8 Core

FP

ALU,PM,

CX

CR

ISU ISU

L4$ Mem Buf

Improved

Branch

Prediction

Crypto (AES, SHA)

support

Deeper Out-

of-Order

Processing

More

Execution

Bandwidth

Bigger

Caches

Wider

dispatch &

issue

© 2016 IBM Corporation9

GR 0

z13 Instr / Execution Dataflow

LSU

pipe

0

FXU*

0a

BFU0

IssQ side0 IssQ side1

BFU1

DFU0

Vector0 / FPR0

register

128b string/int

SIMD0

FXU

0b

GR 1

LSU

pipe

1

FXU

1a

FXU

1b

additional execution units for

higher core throughput

new arch registers / execution units

to accelerate business analytics workloads

Instr decode/ crack / dispatch / map

I$

SBBB

33

33

additional instruction flow for

higher core throughput

Branch Q

VBU1VBU0

D$

*FXa pipes execute reg

writers and support b2b

execution to itself

FXb pipes execute non-reg

writers and non-relative branches

(needs 3w AGEN)

DFU1

Vector1 / FPR1

register

128b string/int

SIMD1

VFU0 VFU1

Features of Recent IBM FPUsGHz/FO4 BFU

pipeCore

per

chip

DFP

add

BFU pipes

DP-SP- VDP

Features added problems

P5 2004

2.2 / 23 / 130nm 6 2 OOO N/A 2 – 2 – 0

P6 2007

5.0 / 13 / 65nm 6 2 InO 2 – 4 – 2 13 FO4 inorder

P7 2010

4.1 / 20 / 45nm 6 8 OOO +10 2 – 4 – 4 OOO, more VRs DFU too far

P7+2012

4.2 / 20 / 32nm 6 8 OOO +10 2 – 8 – 4 2 SP per DP

P8

2014

4.1/ 20 / 22nm 6 12 OOO +5 2 – 8 – 4 enh DFU + CAPI +

SMT8

Z9 2005

1.7 / 27 / 90nm 5 1 InO firmware 1 – 1 – 0 software DFU

Z10 2007

4.4 / 15/ 65nm 7 2 InO 16b-

31w

1 – 1 – 0 15 FO4 inorder

Z1962010

5.2 / 16 / 45nm 8 4 OOO 12l-7t 1 – 1 – 0 OOO screams More BFUs

ZEC122012

5.5 / 16 / 32nm 8 6 OOO 12l-7t 1 – 1 – 0 Zoned to DFU Need more Int

registers

Z13

2015

5.0 / 18 / 22nm 8 8 OOO 8l – 1t 2 – 2 – 2 enh DFU, SIMD,

VRs, SMT2

GHz/FO4 BFU

pipeCore

per

chip

SIMD Other

P5 2004

2.2 / 23 / 130nm 6 2 32 x 64b FPRs

scalar

2w SMT

P6 2007

5.0 / 13 / 65nm 6 2 VMX 32 x 128b

VRs + FPRs

2w SMT

P7 2010

4.1 / 20 / 45nm 6 8 VSU 64 x 128b

VRs,

16bit Vec Int MPY

4w SMT

P7+2012

4.2 / 20 / 32nm 6 8 VSU + RNG 4w SMT

P8

2014

4.1 / 20 / 22nm 6 12 VSU +

Crypto/AES +

32b VINT MPY

8w SMT

Z9 2005

1.7 / 27 / 90nm 5 1 16 x 64b FPRs

scalar

COP- CMPR

+ CRYPTO

Z10 2007

4.4 / 15/ 65nm 7 2 FPRs

Z1962010

5.2 / 16 / 45nm 8 4 OOE

zEC12

20125.5 / 16 / 32nm 8 6

Z132015

5.0 / 18 / 22nm 8 8 32 x 128b VRs

+VSU

(Int/FP/String)

COP, 2w SMT

32b V INT MPY

Decimal Floating Point Unit Evolution in HDW since 2006

12

iterativePartially

pipelined

Fully pipelined

addition

Power6

2006Inorder

13 FO4

Power7

2010Out of Order

20 FO4

z196

2010OOE 2X FP perf

Power7+

2012Power8

2014SRT R16

zEC12

2012

z13

2015SMT 2

SIMD

z10

200715 FO4

Z990

GA2

2006

software

QP Binary in DFU

firmware

QP signed BCD addition in DFU

Fixpt divide in DFU

FPU Architecture Advances Quad Precision Hex in hardware for more years than I’ve been at IBM

Quad Precision Binary since 1998

Integer FMA (56 x56 + 112) in hardware since 2003

Z196 - 2010

- BFP new rounding mode (FPC bit 29)

Truncate and OR Inexactness

supports SP A <= SP B + DP C with 1 rounding error

- IEEE 754-2008 heterogeneous support

- DFP quantum exception

Tells Software which is emulating a greater precision and range whether hardware precision (16 or 34 digits and exponent > 398 or 6176) is exceeded

0.5 mask, 1.5 flag, new DXC code

clamped or rounded

replaces use of Test Data Group for every operation

- Converts to/from Integer to BFP/DFP

zEC12 - 2012

- Converts to/from zoned Decimal to DFP

More Changes

Z13 - 2015

- SMT2 and double execution units

- Non-blocking and separate divide pipeline

- 4 X size of register file

- SIMD integer and string

- SIMD floating-point BFP DP

Move away from CC and branches

- 128 bit integer adds and beyond (support cin and cout)

Trends

Memory is getting relatively slower

Frequency constant

Power important – clock/power gating

More parallel hardware/software

- SMT and SIMD/Vector

- Specialized Accelerators

- FPGA

Lines will blur with GPUs

Need processing power where data stored

Future Systems

Core

Multiple

threads

GPU FPGA

Probably

On chip

OpenPower

Possibly both

On chip and off chip

POWER8 CAPI

FPGA accelerator attachment

MEM A

Mem. Bus

SMP Bus

CAPP MC

core core core

core core core

PowerBus

Accelerated

application

Data Control

Host Service Layer

XlateCaches interruptsFault Handling

POWER8FPGA

CAPP: Coherently attached processor proxy

• Provides PowerBus Surrogate

• Directory of cache lines used by accelerator

•Communicates to accelerator over industry-standard PCIe electrical.

•Coherency with full Memory Range

•POWER Effective Address (EA) TranslationS

MP

Bu

s

Host Service Layer

•Provides interface to user application layer

•Responds to snoop commands of interest from PowerBus

•Performs Address translation and table walks

Standard System Topology Preserved

•Accelerators do not consume a Processor Socket

•Accelerators are a System Option – Not a configuration

•Accelerators do not reduce System Memory

Performance

Not getting any faster (GHz)

Parallelism

Reduce Scalar Bottlenecks

- Scatter / Gather of memory elements

Need parallel cache fetch/store engines

Mostly Parallel Execution - Predicated Execution

Avoiding Branches

Easy Programming model

Final Note: Reliability

With all the parallelism something will go wrong

Many parallel executions need to be checked (1.00001)^1000

- Duplication physical vs time redundancy

- Residue and Parity checking

Dynamic Sparing

Or need physical replacing with common FRU

fpr/vr register overlay & simd overviewarith23.gforge.inria.fr/slides/schwarz.pdf · br 2 ls, 2...

Documents