the cray x1 multiprocessor and roadmap

31
Cray Inc. The Cray X1 Multiprocessor and Roadmap

Upload: taariq

Post on 09-Jan-2016

60 views

Category:

Documents


2 download

DESCRIPTION

The Cray X1 Multiprocessor and Roadmap. Cray’s Computing Vision. Scalable High-Bandwidth Computing. 2010 ‘Cascade’ Sustained Petaflops. ‘Black Widow’. ‘Black Widow 2’. 2006. X1E. 2004. Product Integration. X1. 2006. ‘Strider X’. ‘Strider 3’. 2004. 2005. ‘Strider 2’. RS. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The Cray X1 Multiprocessor and Roadmap

Cray Inc.

The Cray X1 Multiprocessor and

Roadmap

Page 2: The Cray X1 Multiprocessor and Roadmap

Slide 2ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.

Cray’s Computing Vision

X1

‘Strider 2’

X1E

‘Black Widow’

2006200620062006

‘Strider 3’

2006200620062006

‘Black Widow 2’

Scalable High-Bandwidth Computing

Product IntegrationProduct Integration

RSRed Storm

‘Strider X’

20102010‘‘Cascade’Cascade’

SustainedSustainedPetaflopsPetaflops

20102010‘‘Cascade’Cascade’

SustainedSustainedPetaflopsPetaflops

2004200420042004

2004200420042004

2005200520052005

Page 3: The Cray X1 Multiprocessor and Roadmap

Slide 3ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.

Cray PVP

• Powerful vector processors

• Very high memory bandwidth

• Non-unit stride computation

• Special ISA features

• Modernized the ISA

T3E

• Extreme scalability

• Optimized communication

• Memory hierarchy

• Synchronization features

• Improved via vectors

Extreme scalability with high bandwidth vector processors

Cray X1

Page 4: The Cray X1 Multiprocessor and Roadmap

Slide 4ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.

Cray X1 Instruction Set Architecture

New ISA– Much larger register set (32x64 vector, 64+64 scalar)– All operations performed under mask– 64- and 32-bit memory and IEEE arithmetic– Integrated synchronization features

Advantages of a vector ISA– Compiler provides useful dependence information to hardware– Very high single processor performance with low complexity

ops/sec = (cycles/sec) * (instrs/cycle) * (ops/instr)

– Localized computation on processor chip– large register state with very regular access patterns– registers and functional units grouped into local clusters (pipes) excellent fit with future IC technology

– Latency tolerance and pipelining to memory/network very well suited for scalable systems

– Easy extensibility to next generation implementation

Page 5: The Cray X1 Multiprocessor and Roadmap

Slide 5ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.

• New instruction set• New system architecture• New processor microarchitecture

• “Classic” vector machines were programmed differently– Classic vector: Optimize for loop length with little regard for locality– Scalable micro: Optimize for locality with little regard for loop length

• The Cray X1 is programmed like other parallel machines– Rewards locality: register, cache, local memory, remote memory– Decoupled microarchitecture performs well on short loop nests– (however, does require vectorizable code)

Not Your Father’s Vector Machine

Page 6: The Cray X1 Multiprocessor and Roadmap

Slide 6ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.

P PP P

$ $ $ $

P PP P

$ $ $ $

P PP P

$ $ $ $

P PP P

$ $ $ $

M M M M M M M M M M M M M M M M

mem mem mem mem mem mem mem mem mem mem mem mem mem mem mem mem

IO IO

• Four multistream processors (MSPs), each 12.8 Gflops

• High bandwidth local shared memory (128 Direct Rambus channels)

• 32 network links and four I/O links per node

51 Gflops, 200 GB/s

Cray X1 Node

Page 7: The Cray X1 Multiprocessor and Roadmap

Slide 7ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.

• 16 parallel networks for bandwidth• Global shared memory across machine

Interconnection

Network

NUMA Scalable up to 1024 Nodes

Page 8: The Cray X1 Multiprocessor and Roadmap

Slide 8ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.

M15

PP PP PP PP

PP PP PP PP

PP PP PP PP

PP PP PP PP

M0

M0

M1

M1 M15

M0

M0 M1

M1

M15

M15

node 0

node 1

node 2

node 3

Section 0 Section 1 Section 15

Network Topology (16 CPUs)

Page 9: The Cray X1 Multiprocessor and Roadmap

Slide 9ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.

Network Topology (128 CPUs)

RR

RRRR

RR RR

RR

RR

RR

16links

Page 10: The Cray X1 Multiprocessor and Roadmap

Slide 10ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.

Network Topology (512 CPUs)

Page 11: The Cray X1 Multiprocessor and Roadmap

Slide 11ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.

Designed for Scalability

• Distributed shared memory (DSM) architecture– Low latency, load/store access to entire machine (tens of TBs)

• Decoupled vector memory architecture for latency tolerance– Thousands of outstanding references, flexible addressing

• Very high performance network– High bandwidth, fine-grained transfers

– Same router as Origin 3000, but 32 parallel copies of the network

• Architectural features for scalability– Remote address translation

– Global coherence protocol optimized for distributed memory

– Fast synchronization

• Parallel I/O scales with system size

T3E Heritage

Page 12: The Cray X1 Multiprocessor and Roadmap

Slide 12ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.

Decoupled Microarchitecture

• Decoupled access/execute and decoupled scalar/vector

• Scalar unit runs ahead, doing addressing and control– Scalar and vector loads issued early– Store addresses computed and saved for later use– Operations queued and executed later when data arrives

• Hardware dynamically unrolls loops– Scalar unit goes on to issue next loop before current loop has completed– Full scalar register renaming through 8-deep shadow registers– Vector loads renamed through load buffers– Special sync operations keep pipeline full, even across barriers

This is key to making the system like short-VL code

Page 13: The Cray X1 Multiprocessor and Roadmap

Slide 13ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.

Address Translation

• High translation bandwidth: scalar + four vector translations per cycle per P

• Source translation using 256 entry TLBs with support for multiple page sizes

• Remote (hierarchical) translation:– allows each node to manage its own memory (eases memory mgmt.)– TLB only needs to hold translations for one node scales

61 063 62

Memory region:

32 16

Page OffsetVirtual Page #

31 15

Possible page boundaries:

MBZ

4748

useg, kseg, kphys 64 KB to 4 GB

VA:

Physical address space:

45 036

OffsetNode

354647

PA:

Main memory, MMR, I/OMax 64 TB physical memory

Page 14: The Cray X1 Multiprocessor and Roadmap

Slide 14ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.

Cache Coherence

Global coherence, but only cache memory from local node (8-32 GB)– Supports SMP-style codes up to 4 MSP (16 SSP)– References outside this domain converted to non-allocate– Scalable codes use explicit communication anyway– Keeps directory entry and protocol simple

Explicit cache allocation control– Per instruction hint for vector references– Per page hint for scalar references– Use non-allocating refs to avoid cache pollution

Coherence directory stored on the M chips (rather than in DRAM)– Low latency and really high bandwidth to support vectors

• Typical CC system: 1 directory update per proc per 100 or 200 ns• Cray X1: 3.2 dir updates per MSP per ns (factor of several hundred!)

Page 15: The Cray X1 Multiprocessor and Roadmap

Cray Inc.

System

Software

Page 16: The Cray X1 Multiprocessor and Roadmap

Slide 16ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.

Cray X1 I/O SubsystemX

1 N

od

e

I/O Board

I/O Board

SPCChannels

2 x 1.2 GB/s

IOCA

IOCA

IOCA

IOCA

PCI-XBUSES

800 MB/sPCI-XCards

FC

FC

FC

FC

FC

FC

FC

FC

200 MBytes/s FC-AL

IP over Fiber CPES

SB

SB

SB

SB

CB

CNSGigabit Ethernet or Hippi Network

CNSGigabit Ethernet or Hippi Network

RS200

SB

SB

SB

SB

CB

RS200

I Chip

I Chip

XFS and XVM

ADIC SNFS

Page 17: The Cray X1 Multiprocessor and Roadmap

Slide 17ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.

Cray X1 UNICOS/mp

UNIX kernel executes

on each Application node

(somewhat like Chorus™

microkernel on UNICOS/mk)

Provides: SSI, scaling &

resiliency

Commands launch

applications like on

T3E (mpprun)

System Service nodes provide

file services, process manage-

ment and other basic UNIX

functionality (like /mk servers).

User commands execute on

System Service nodes.

UNICOS/mp

Global resource

manager (Psched)

schedules

applications on

Cray X1 nodes256 CPU Cray X1

Single System Image

Page 18: The Cray X1 Multiprocessor and Roadmap

Slide 18ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.

Programming Models

• Shared memory:– OpenMP, pthreads

– Single node (51 Gflops, 8-32 GB)

– Fortran and C

• Distributed memory:– MPI

– shmem(), UPC, Co-array Fortran

• Hierarchical models (DM over OpenMP)• Both MSP and SSP execution modes supported for all

programming models

Page 19: The Cray X1 Multiprocessor and Roadmap

Cray Inc.

Mechanical

Design

Page 20: The Cray X1 Multiprocessor and Roadmap

Slide 20ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.

Node board

Field-Replaceable Memory

Daughter Cards

NetworkInterconnect

Spray Cooling Caps

Air Cooling Manifold

PCB

17” x 22’’

CPU MCM8 chips

Page 21: The Cray X1 Multiprocessor and Roadmap

Slide 21ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.

Cray X1 Node Module

Page 22: The Cray X1 Multiprocessor and Roadmap

Slide 22ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.

Cray X1 Chassis

Page 23: The Cray X1 Multiprocessor and Roadmap

Slide 23ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.

64 Processor Cray X1 System~820 Gflops

Page 24: The Cray X1 Multiprocessor and Roadmap

Cray Inc.

Cray BlackWidow

The Next Generation HPC System From Cray Inc.

Page 25: The Cray X1 Multiprocessor and Roadmap

Slide 25ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.

BlackWidow Highlights

• Designed as a follow-on to Cray X1:– Upward compatible ISA – FCS in 2006

• Significant improvement (>> Moore’s Law rate) in:– Single thread scalar performance– Price performance

• Refine X1 implementation:– Continue to support DSM with 4-way SMP nodes– Further improve sustained memory bandwidth– Improve scalability up and down– Improve reliability/fault tolerance– Enhance instruction set

• A few BlackWidow features:– Single chip vector microprocessor– Hybrid electrical/optical interconnect– Configurable memory capacity, memory BW and network BW

Page 26: The Cray X1 Multiprocessor and Roadmap

Slide 26ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.

Cascade ProjectCray Inc.Stanford

Cal Tech/JPLNotre Dame

Page 27: The Cray X1 Multiprocessor and Roadmap

Slide 27ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.

High Productivity Computing Systems

Goals: Provide a new generation of economically viable high productivity computing

systems for the national security and industrial user community (2007 – 2010)

Impact:Performance (efficiency): critical national security

applications by a factor of 10X to 40XProductivity (time-to-solution) Portability (transparency): insulate research and

operational application software from systemRobustness (reliability): apply all known techniques

to protect against outside attacks, hardware faults, & programming errors

Fill the Critical Technology and Capability GapToday (late 80’s HPC technology)…..to…..Future (Quantum/Bio Computing)

Fill the Critical Technology and Capability GapToday (late 80’s HPC technology)…..to…..Future (Quantum/Bio Computing)

Applications: Intelligence/surveillance, reconnaissance, cryptanalysis, weapons analysis, airborne contaminant

modeling and biotechnology

HPCS Program Focus Areas

Page 28: The Cray X1 Multiprocessor and Roadmap

Slide 28ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.

HPCS Planned ProgramPhases 1-3

02 05 06 07 08 09 1003 04

ProductsMetrics,

Benchmarks

Academia

Research

Platforms

Early

Software

Tools

Early

Pilot

Platforms

Phase 2R&D

Phase 3 Full Scale Development

(Planned)

Metrics and Benchmarks

System DesignReview

Industry GuidedResearch

Application Analysis

PerformanceAssessment

HPCSCapability or

Products

Fiscal Year

Concept Reviews

PDR & Early

Prototype

Research

Research Prototypes

& Pilot Systems

Phase 3 Readiness Review

TechnologyAssessments

Requirementsand Metrics

Phase 2Readiness Reviews

Phase 1Industry

Concept Study

Reviews

Industry BAA/RFP

Critical Program Milestones

Page 29: The Cray X1 Multiprocessor and Roadmap

Slide 29ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.

Technology Focus of Cascade

• System Design– Network topology and router design– Shared memory for low latency/low overhead communication

• Processor design– Exploring clustered vectors, multithreading and streams– Enhancing temporal locality via deeper memory hierarchies– Improving latency tolerance and memory/communication concurrency

• Lightweight threads in memory– For exploitation of spatial locality with little temporal locality– Thread data instead of data thread– Exploring use of PIMs

• Programming environments and compilers– High productivity languages– Compiler support for heavyweight/lightweight thread decomposition– Parallel programming performance analysis and debugging

• Operating systems– Scalability to tens of thousands of processors– Robustness and fault tolerance

Lets us explore technologies we otherwise couldn’t.A three year head start on typical development cycle.

Page 30: The Cray X1 Multiprocessor and Roadmap

Slide 30ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.

Cray’s Computing Vision

X1

‘Strider 2’

X1E

‘Black Widow’

2006200620062006

‘Strider 3’

2006200620062006

‘Black Widow 2’

Scalable High-Bandwidth Computing

Product IntegrationProduct Integration

RSRed Storm

‘Strider X’

20102010‘‘Cascade’Cascade’

SustainedSustainedPetaflopsPetaflops

20102010‘‘Cascade’Cascade’

SustainedSustainedPetaflopsPetaflops

2004200420042004

2004200420042004

2005200520052005

Page 31: The Cray X1 Multiprocessor and Roadmap

Cray Inc.

File Name: BWOpsReview081103.ppt

Questions?