“a next-generation many-core processor with reliability...

25
“A next-generation many-core processor with reliability fault tolerance and adaptive power reliability, fault tolerance and adaptive power management features optimized for embedded and high performance computing embedded and high performance computing applications.” Simon McIntosh-Smith, VP of Applications simon@clearspeed com simon@clearspeed.com

Upload: others

Post on 18-Oct-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: “A next-generation many-core processor with reliability ...simonm/publications/ClearSpeed_HPEC0… · 128 B t F F 64 64 64 64 64 PE n–1 n+1 • Fast inter-PE communication path

“A next-generation many-core processor with reliability fault tolerance and adaptive powerreliability, fault tolerance and adaptive power management features optimized for embedded and high performance computingembedded and high performance computing applications.”

Simon McIntosh-Smith, VP of Applications simon@clearspeed [email protected]

Page 2: “A next-generation many-core processor with reliability ...simonm/publications/ClearSpeed_HPEC0… · 128 B t F F 64 64 64 64 64 PE n–1 n+1 • Fast inter-PE communication path

ClearSpeed company introduction

• ClearSpeed Federal Systems based in San Jose, CA, founded in 2008

• ClearSpeed Technology HQ in Bristol, UK, founded in 2002

• Deliver the world’s most power efficient, high-performance processors

• Over 100 patents on the core technology• Solutions for defence, embedded systems and high

fperformance computing• Licensed IP to BAE Systems for use in future space-based

tsystems• Partnering with HP, SGI, IBM, and Sun

Page 3: “A next-generation many-core processor with reliability ...simonm/publications/ClearSpeed_HPEC0… · 128 B t F F 64 64 64 64 64 PE n–1 n+1 • Fast inter-PE communication path

ClearSpeed’s “Smart SIMD” MTAP processor core

• “Smart SIMD” Multi-Threaded Array Processing:MTAP Processing:– Multi-threading for asynchronous,

overlapped I/O with computeS l bl f P

Peripheral Network

M

System Network

– Scalable array of many Processor Elements (PEs)

– Coarse-grained data parallel

Data Cache

Mono Controller Instruc-

tion Cache

Control and

Debugg p

processing– Supports redundancy and resiliency

Poly Controller

• Programmed in an extended version of ANSI C called Cn:

PE 0

PE 1

PE n-1…

version of ANSI C called C :– Rich expressive semantics – Single “poly” data type modifier

Programmable I/O to DRAM System Network

g p y yp

Page 4: “A next-generation many-core processor with reliability ...simonm/publications/ClearSpeed_HPEC0… · 128 B t F F 64 64 64 64 64 PE n–1 n+1 • Fast inter-PE communication path

Smart SIMD Processing Elements (PEs)

M lti l ti it PE• Multiple execution units per PE• Floating point adder 32 & 64-bit

IEEE 754}• Floating point multiplier• Fixed-point MAC 16x16 → 32+64

IEEE 754PEn

P M

ul

P A

dd

MA

C

ALU

}

• Integer ALU with shifter• Load/storeRegister File

128 B t

FP FP M A64 64 64

6464 PEn+1

PEn–1

• Fast inter-PE communication path• Closely coupled, ECC-protected SRAM

128 Bytes

PE SRAMe.g. 6 KBytes

for data• High bandwidth per PE DMA (PIO)

Memory address generator (PIO)

32

128

32

• Per PE address generators• Platform design enables PE variants

PIO Collection & Distribution128

Page 5: “A next-generation many-core processor with reliability ...simonm/publications/ClearSpeed_HPEC0… · 128 B t F F 64 64 64 64 64 PE n–1 n+1 • Fast inter-PE communication path

The ClearConnect™ “Network on Chip” system bus

• Scalable, System on Chip (SoC) platform interconnect

CC CC

platform interconnect• Used to connect together all the

major blocks on a chip:NODE NODE

major blocks on a chip:– Multiple MTAP smart SIMD cores– Multiple memory controllers– On-chip memories– System interfaces, e.g. PCI Express

SoCBLOCK

SoCBLOCK

• Unified memory architecture• Distributed arbitration

A B • Scalable bandwidths• Low power design

Page 6: “A next-generation many-core processor with reliability ...simonm/publications/ClearSpeed_HPEC0… · 128 B t F F 64 64 64 64 64 PE n–1 n+1 • Fast inter-PE communication path

ClearSpeed’s product range

• Processors– CSX700, CSX600

• Boards– e720, e710, e620, X620

• Systems– CATS-700 1TFLOP in 1U

• SoftwareC il– Compiler

– Libraries– Heterogeneous profilerHeterogeneous profiler– GDB-based debugger

Page 7: “A next-generation many-core processor with reliability ...simonm/publications/ClearSpeed_HPEC0… · 128 B t F F 64 64 64 64 64 PE n–1 n+1 • Fast inter-PE communication path

I l d d l MTAP

The CSX700 Processor

• Includes dual MTAP cores:– 96 GFLOPS peak (32 & 64-bit)– 48 GMACS peak (16x16 → 32+64)48 GMACS peak (16x16 → 32 64)– 10 power consumption– 250MHz clock speed

192 P i El t (2 96)– 192 Processing Elements (2x96)– 8 spare PEs for resiliency– ECC on all internal memories

• Dual integrated 64-bit DDR2 memory controllers with ECC

• Integrated PCI Express x16• CCBR chip-to-chip bridge port

2 128 KB t f hi SRAM• 2x128 KBytes of on-chip SRAM• IBM 90nm process• 266 million transistors• 266 million transistors

Page 8: “A next-generation many-core processor with reliability ...simonm/publications/ClearSpeed_HPEC0… · 128 B t F F 64 64 64 64 64 PE n–1 n+1 • Fast inter-PE communication path

Reliability, fault tolerance & adaptive power management features

Reliability and fault toleranceReliability and fault tolerance• 8 spare PEs can be kept in reserve• Error Correcting Codes (ECC) on all on chip memories• Error Correcting Codes (ECC) on all on-chip memories• ECC with memory scrubber on all external memories• On die temperature sensors• On-die temperature sensors• Low power and modest clock speed design

P i d l t f l d d ti• Programming model supports graceful degradation

Ad ti tAdaptive power management• Ability for software to modify clock speed dynamically, even

while an application is runningwhile an application is running• On-die temperature sensors allow for thermal ceilings to be

set by softwareset by software• Can also throttle clock speed based on power consumption

Page 9: “A next-generation many-core processor with reliability ...simonm/publications/ClearSpeed_HPEC0… · 128 B t F F 64 64 64 64 64 PE n–1 n+1 • Fast inter-PE communication path

CSX700 – Internal Architecture

x96

Page 10: “A next-generation many-core processor with reliability ...simonm/publications/ClearSpeed_HPEC0… · 128 B t F F 64 64 64 64 64 PE n–1 n+1 • Fast inter-PE communication path

Powerful software development environment

• Version 3.11 launched in June 08Version 3.11 launched in June 08

• Binary compatible across all ClearSpeed products

• ANSI C-based optimising compiler – Cn

GDB b d d b• GDB-based debugger

• Profiling with heterogeneous, system-wide capabilities

• Libraries (BLAS, RNG, FFT, more…) & High level APIs (CSPX)

• ECLIPSE-based IDE

Page 11: “A next-generation many-core processor with reliability ...simonm/publications/ClearSpeed_HPEC0… · 128 B t F F 64 64 64 64 64 PE n–1 n+1 • Fast inter-PE communication path

Eclipse IDE – CSX Debug Perspective

Standard Eclipse graphical debug interface for CSX gprocessor debugging.

CSX processor provides full p phardware debugging of application code.

Provides a seamless view of many processor cores in parallel with their associated state.

Allows full symbolic debug of th Cn lthe Cn language.

Enhanced views for CSX specific informationspecific information.

Page 12: “A next-generation many-core processor with reliability ...simonm/publications/ClearSpeed_HPEC0… · 128 B t F F 64 64 64 64 64 PE n–1 n+1 • Fast inter-PE communication path

ClearSpeed profiler for heterogeneous multi-processor systemsHOST/BOARD INTERACTION PROFILINGHOST CODE PROFILING

Advance™ Accelerator BoardHostH t Advance™ Accelerator Board

CSX 600

Pipeline

CSX 600

Pipeline

CPU(s)HostCPU(s)Host

CPU(s)

Advance Accelerator Board

HostCPU(s)

CSX

Pipeline

CSX

Pipeline

CSX PIPELINE PROFILING CSX SYSTEM PROFILING

Page 13: “A next-generation many-core processor with reliability ...simonm/publications/ClearSpeed_HPEC0… · 128 B t F F 64 64 64 64 64 PE n–1 n+1 • Fast inter-PE communication path

Defense focus

P t h l ith i l i fProcessor technology with a signal-processing focus• Radar, Sonar

Imaging• Imaging• Target discrimination• Electronic warfareElectronic warfare• Communications intelligence

Engineered for optimal Size, Weight, and Power• Best performance per watt in the industry

Deployable across harsh environments• Air, land, and sea• Commercial, rugged, conduction cooled configurations• Embedded form factors

Page 14: “A next-generation many-core processor with reliability ...simonm/publications/ClearSpeed_HPEC0… · 128 B t F F 64 64 64 64 64 PE n–1 n+1 • Fast inter-PE communication path

Embedded/Defense Concept XMC and VXS modulesXMC card• Performance

– 96 GFLOPS, 48 GMACS– 192 GBytes/s peak on-chip bandwidth

8 GBytes/s peak off chip– 8 GBytes/s peak off-chip• High bandwidth via PCI Express I/O

– Up to 2 GBytes/sec via PCIe x2,x4,x8– Up to 4 GBytes/sec via PCIe x16p y

• 15 watts typical, 25 watts max

Vita 41 (VME Switched Serial VXS) Module2 GB DDR2

Vita 41 (VME Switched Serial - VXS) Module• Performance

– 192 GFLOPS, 96 GMACS– 384 GBytes/s peak on-chip bandwidth

2 GB

PCIe x4

PCIe x4

384 GBytes/s peak on chip bandwidth– 16 GBytes/s peak off-chip

• High bandwidth – VME interface (P1 & P2) PCIe x4

DDR2 – ClearConnect CCBR for chip-to-chip– (2) 4x PCIe links off-board (Vita 41.4)

• 30 watts typical, 50 watts max

CONCEPTS, NOT COMMITTED PRODUCTS

Page 15: “A next-generation many-core processor with reliability ...simonm/publications/ClearSpeed_HPEC0… · 128 B t F F 64 64 64 64 64 PE n–1 n+1 • Fast inter-PE communication path

Benchmark details

• Power consumption figures are for a single• Power consumption figures are for a single CSX700 running at 0.9V and 42°C

• For 2D FFTs, timing and power consumption measurements include the corner turnmeasurements include the corner turn

• All data always starts and finishes in the off-chip DRAM

• FFTs were run simultaneously on both MTAPFFTs were run simultaneously on both MTAP cores of the CSX700

Page 16: “A next-generation many-core processor with reliability ...simonm/publications/ClearSpeed_HPEC0… · 128 B t F F 64 64 64 64 64 PE n–1 n+1 • Fast inter-PE communication path

CSX700 1D FFTs per second

Page 17: “A next-generation many-core processor with reliability ...simonm/publications/ClearSpeed_HPEC0… · 128 B t F F 64 64 64 64 64 PE n–1 n+1 • Fast inter-PE communication path

CSX700 1D FFT GFLOPS

Page 18: “A next-generation many-core processor with reliability ...simonm/publications/ClearSpeed_HPEC0… · 128 B t F F 64 64 64 64 64 PE n–1 n+1 • Fast inter-PE communication path

CSX700 1D FFT core power

7.4

7.0

7.5

6.5

6.6

6.5

W)

5.95.7

5.5

6.0

re pow

er (W

5.24.95.0

CSX7

00 co

r

128

256

4.5

3 84 0

4.5512

1024

2048

3.6

3.8

3.5

4.0

50MHz 100MHz 150MHz 200MHz 250MHz50MHz 100MHz 150MHz 200MHz 250MHz

Core Clock Speed

Page 19: “A next-generation many-core processor with reliability ...simonm/publications/ClearSpeed_HPEC0… · 128 B t F F 64 64 64 64 64 PE n–1 n+1 • Fast inter-PE communication path

CSX700 1D FFT core power GFLOPS per watt

2.6

3.0

2.1

2.42.5

W

1.91.7

2.0

ore GFLOPS/W

1.5

1.7

1.5

CSX7

00 co

128

256

1.21.1

1.0

512

1024

2048

0.7

0.5

50MHz 100MHz 150MHz 200MHz 250MHz

Core Clock Speed

Page 20: “A next-generation many-core processor with reliability ...simonm/publications/ClearSpeed_HPEC0… · 128 B t F F 64 64 64 64 64 PE n–1 n+1 • Fast inter-PE communication path

CSX700 2D FFTs per second

Page 21: “A next-generation many-core processor with reliability ...simonm/publications/ClearSpeed_HPEC0… · 128 B t F F 64 64 64 64 64 PE n–1 n+1 • Fast inter-PE communication path

CSX700 2D FFT GFLOPS

Page 22: “A next-generation many-core processor with reliability ...simonm/publications/ClearSpeed_HPEC0… · 128 B t F F 64 64 64 64 64 PE n–1 n+1 • Fast inter-PE communication path

CSX700 convolutions per second

10 000 00010,000,000

128

256

512

1,362,6371,816,039

2,267,810

1,000,000ond

512

1024

2048

454,215

909,053

, ,

398,708

597,996797,284

995,686

348 092435,204

,000,000

ons pe

r seco

199,524

,

174,218

261,323348,092

114 468152,549

190,808

100,000

1D con

voluti

87,118

38,132

76,330

114,468

32 872

49,28265,734

82,093

1

16,455

32,872

10,000

50MH 100MH 150MH 200MH 250MH50MHz 100MHz 150MHz 200MHz 250MHz

Core Clock Speed

Page 23: “A next-generation many-core processor with reliability ...simonm/publications/ClearSpeed_HPEC0… · 128 B t F F 64 64 64 64 64 PE n–1 n+1 • Fast inter-PE communication path

25

CSX700 convolution GFLOPS

22.6

25

18.1 19.8

20

PS

13.6 15.915

ution GFLOP

9.111.9

10

1D con

volu

128

4.5

4 0

7.9

5

256

512

1024

20484.0

0

50MHz 100MHz 150MHz 200MHz 250MHz

2048

50MHz 100MHz 150MHz 200MHz 250MHz

Core Clock Speed

Page 24: “A next-generation many-core processor with reliability ...simonm/publications/ClearSpeed_HPEC0… · 128 B t F F 64 64 64 64 64 PE n–1 n+1 • Fast inter-PE communication path

CATS-700 FFT performance

With 12 CSX700 processors the CATS-700 can achieve:• 4.2 million 1024 point 1D FFTs per second (215 GFLOPS)• 26,000 1024x1024 2D FFTs per second (194 GFLOPS)• 25 billion 32-bit inverse square roots• 14 billion 32-bit sines• 12 billion 32-bit exponentials• An aggregate STREAM benchmark of ~80 GBytes/s

Page 25: “A next-generation many-core processor with reliability ...simonm/publications/ClearSpeed_HPEC0… · 128 B t F F 64 64 64 64 64 PE n–1 n+1 • Fast inter-PE communication path

CSX700 Transcendental Function Performance

2.5E+09

32‐bit

64‐bit

2.0E+09

econ

d

1.5E+09

tion

s pe

r se

1.0E+09

er of ope

rat

5.0E+08

Num

be

0 0E+000.0E+00

cs_sinp cs_cosp cs_sincos cs_tanp cs_atanp cs_sqrtp cs_isqrtp cs_logp cs_expp