computer science and engineering advanced computer architecture cse 8383 february 21 2008 session 6

Computer Science and Engineering

Advanced Computer Advanced Computer ArchitectureArchitecture

CSE 8383CSE 8383

February 21 2008February 21 2008

Session 6Session 6


Contents

Interconnection Networks (cont.) Static (cont.) Dynamic

Performance Evaluations Grosch’s Law Moore’s Law Von Neumann’s Bottlneck Speedup Amdahl’s Law The Gustafson-Barsis Law


Hypercubes

N = 2d

d dimensions (d = log N) A cube with d dimensions is made out of 2

cubes of dimension d-1 Symmetric Degree, Diameter, Cost, Fault tolerance Node labeling – number of bits


Hypercubes

d = 0 d = 1 d = 2 d = 3

0

1

0100

1110

000

001

100 110

111

011

101

010


Hypercubes

1110 1111

1010 1011

0110 0111

0010 0011

1101

1010

1000 1001

0100 0101

0010

0000 0001

S

d = 4


Hypercube of dimension d

N = 2d d = log n

Node degree = d

Number of bits to label a node = d

Diameter = d

Number of edges = n*d/2

Hamming distance!

Routing


Subcubes and Cube Fragmentation

What is a subcube? Shared Environment Fragmentation Problem Is it Similar to something you know?


Cube Connected Cycles (CCC)

k-cube 2k nodes k-CCC from k-cube, replace each

vertex of the k cube with a ring of k nodes

K-CCC k* 2k nodes Degree, diameter 3, 2k Try it for 3-cube


K-ary n-Cube

d = cube dimension K = # nodes along each dimension N = kd

Wraparound Hupercube binary d-cube Tours k-ary 2-cube


Analysis and performance metricsstatic networks

Network Degree(d) Diameter(D) Cost SymmetryWorst delay

CCNs N-1 1 N(N-1)/2 Yes 1

Linear Array 2 N-1 N-1 No N

Binary Tree 3 2(log2N –1) N-1 No log2N

n-cube log2N log2N nN/2 Yes log2N

2D-Mesh 4 2(n-1) 2(N-n) No N

K-ary n-cube 2n nk/2 nN Yes K x log2N


Dynamic INDynamic IN


Bus Based IN

Global Memory

P

Global Memory

P

C

P

C

P

C

P P


Dynamic Interconnection Networks

Communication patterns are based on program demands

Connections are established on the fly during program execution

Multistage Interconnection Network (MIN) and Crossbar


Switch Modules

A x B switch module A inputs and B outputs In practice, A = B = power of 2

Each input is connected to one or more outputs (conflicts must be avoided)

One-to-one (permutation) and one-to-many are allowed


Binary Switch

2x2Switch

Legitimate States = 4

Permutation Connections = 2


Legitimate Connections

Straight Exchange

Upper-broadcast

Lower-broadcast

The different setting of the 2X2 SE


Group Work

General Case ??


Multistage Interconnection Networks

ISC1ISC1 ISC2ISC2 ISCnISCn

switches switches switches

ISC Inter-stage Connection Patterns


Perfect-Shuffle Routing Function

Given x = {an, an-1, …, a2, a1}

P(x) = {an-1, …, a2, a1 , an}

X = 110001

P(x) = 100011


Perfect Shuffle Example

000 000

001 010

010 100

011 110

100 001

101 011

110 101

111 111


Perfect-Shuffle

000

001

010

011

100

101

110

111

000

001

010

011

100

101

110

111


Exchange Routing Function

Given x = {an, an-1, …, a2, a1}

Ei(x) = {an, an-1, …, ai, …, a2, a1}

X = 0000000

E3(x) = 0000100


Exchange E1

000 001

001 000

010 011

011 010

100 101

101 100

110 111

111 110


Exchange E1

000

001

010

011

100

101

110

111

000

001

010

011

100

101

110

111


Butterfly Routing Function

Given x = {an, an-1, …, a2, a1}

B(x) = {a1 , an-1, …, a2, an}

X = 010001

P(x) = 110000


Butterfly Example

000 000

001 100

010 010

011 110

100 001

101 101

110 011

111 111


Butterfly

000

001

010

011

100

101

110

111

000

001

010

011

100

101

110

111


Multi-stage network

000

001

010

011

100

101

110

111

000

001

010

011

100

101

110

111


MIN (cont.)

1

2

3

4

5

6

7

8

9

10

11

12

001

010

011

100101

110

111

000000

001

010

011

100

101

110

111

An 8X8 Banyan network


Min Implementation

Control (X)

Source (S) Destination (D)

X = f(S,D)


Example

X = 0 X = 1

(crossed) (straight)

A

B

C

D

A

B

C

D


Consider this MIN

S1

S2

S3

S4

S5

S6

S7

S8

D1

D2

D3

D4

D5

D6

D7

D8

stage 1 stage 2 stage 3


Example (Cont.)

Let control variable be X1, X2, X3 Find the values of X1, X2, X3 to connect:

S1 D6 S7 D5 S4 D1


The 3 connections

S1

S2

S3

S4

S5

S6

S7

S8

D1

D2

D3

D4

D5

D6

D7

D8

stage 1 stage 2 stage 3


Boolean Functions

X = x1, x2, x3

S = s2, s2, s3

D = d1, d2, d3

Find X = f(S,D)


Crossbar Switch

M1 M2 M3 M4 M5 M6 M7 M8

P1

P2

P3

P4

P5

P6

P7

P8


Analysis and performance metricsdynamic networks

Networks Delay Cost Blocking Degree of FT

Bus O(N) O(1) Yes 0

Multiple-bus O(mN) O(m) Yes (m-1)

MIN O(logN) O(NlogN) Yes 0

Crossbar O(1) O(N2) No 0


Performance Evaluations


Grosch’s Law (1960s)

“To sell a computer for twice as much, it must be four times as fast”

Vendors skip small speed improvements in favor of waiting for large ones

Buyers of expensive machines would wait for a twofold improvement in performance for the same price.


Moore’s Law

Gordon Moore (cofounder of Intel) Processor performance would double every 18

months This prediction has held for several decades Unlikely that single-processor performance continues

to increase indefinitely


Von Neumann’s bottleneck

Great mathematician of the 1940s and 1950s Single control unit connecting a memory to a

processing unit Instructions and data are fetched one at a time from

memory and fed to processing unit Speed is limited by the rate at which instructions and

data are transferred from memory to the processing unit.


Past Trends in Parallel Architecture (inside the box)

Completely custom designed components (processors, memory, interconnects, I/O)

Longer R&D time (2-3 years) Expensive systems Quickly becoming outdated

– Bankrupt companies!!


Current Trends in Parallel Architecture (outside the box) -- before multicore!!

Advances in commodity processors and network technology

Network of PCs and workstations connected via LAN or WAN forms a Parallel System

Network Computing Compete favorably (cost/performance) Utilize unused cycles of systems sitting idle


Speedup

S = Speed(new) / Speed(old)

S = Work/time(new) / Work/time(old)

S = time(old) / time(new)

S = time(before improvement) /

time(after improvement)


Speedup

Time (one CPU): T(1)

Time (n CPUs): T(n)

Speedup: S

S = T(1)/T(n)


Two Important Laws Influenced Parallel Computing


Argument Against Massively Parallel Processing. Gene Amdahl, 1967.

For over a decade prophets have voiced the contention that the organization of a single computer has reached its limits and that truly significant advances can be made only by interconnection of multiplicity of computers in such a manner as to permit cooperative solution .. The nature of this overhead (in parallelism) appears to be sequential so that it is unlikely to be amenable to parallel processing techniques. Overhead alone would then place an upper limit on throughput of five to seven times the sequential processing rate, even if the housekeeping were done in a separate processor… At any point in time it is difficult to foresee how the previous bottlenecks in a sequential computer will be effectively overcome.


What does that mean?

The performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode cannot be used.

Unparallelizable part of the code severely limits the Unparallelizable part of the code severely limits the speedupspeedup.


Walk 4 miles /hourBike 10 miles / hourCar-1 50 miles / hourCar-2 120 miles / hourCar-3 600 miles /hour

200 miles

20 hours

A Bmust walk

Trip Analogy


Speedup Analysis

(4 miles /hour) Time = 70 hours

(10 miles / hour) Time = 40 hours

(50 miles / hour) Time = 24 hours

(120 miles / hour) Time = 21.67 hours

S = 1.8

S = 2.9

S = 3.2

S = 3.4(600 miles /hour) Time = 20.33 hours


S = T(1)/T(N)

T(N) = T(1) + T(1)(1- )

N

S = 1

+ (1- )

N

=N

N + (1- )

: The fraction of the program that is naturally serial

(1- ): The fraction of the program that is naturally parallel

Amdahl’s Law


10% 20% 30% 40% 50% 60% 70% 80% 90% 99%

0

5

10

15

20

25Speedup

% Serial

1000 CPUs16 CPUs4 CPUs

Amdahl’s Law


Gustafson – Barsis Law (1988)

Gordon Bell Prize Overcoming the conceptual barrier established by

Amdahl’s law Scale the problem to the size of the parallel system No fixed size problem

: The fraction of the program that is naturally serial

T(N) = 1T(1) = + (1- ) NS = N – (N-1)


0

20

40

60

80

100

10% 20% 30% 40% 50% 60% 70% 80% 90% 99%

% Serial

Speedup

Gustafson-Barsis

Amdhal

Amdahl vs. Gustafson-Barsis


Data Parallelism – Scale up

Parallelism is in the data, not the control portion of the application

Problem size scales up to the size of the system

Data Parallelism is to the 1990’s what vector parallelism was to the 1970’s

Supercomputer data parallel


Problem

Assume that a switching component such as a transistor can switch in zero time. We propose to construct a disk-shaped computer chip with such a component. The only limitation is the time it takes to send electronic signals from one edge of the chip to the other. Make the simplifying assumption that electronic signals travel 300,000 kilometers per second. What must be the diameter of a round chip so that it can switch 109 times per second? What would the diameter be if the switching requirements were 1012 time per second?

computer science and engineering advanced computer architecture cse 8383 february 21 2008 session 6

Documents