computer science and engineering advanced computer architecture cse 8383 february 21 2008 session 6
TRANSCRIPT
Computer Science and Engineering
Advanced Computer Advanced Computer ArchitectureArchitecture
CSE 8383CSE 8383
February 21 2008February 21 2008
Session 6Session 6
Computer Science and Engineering
Contents
Interconnection Networks (cont.) Static (cont.) Dynamic
Performance Evaluations Grosch’s Law Moore’s Law Von Neumann’s Bottlneck Speedup Amdahl’s Law The Gustafson-Barsis Law
Computer Science and Engineering
Hypercubes
N = 2d
d dimensions (d = log N) A cube with d dimensions is made out of 2
cubes of dimension d-1 Symmetric Degree, Diameter, Cost, Fault tolerance Node labeling – number of bits
Computer Science and Engineering
Hypercubes
d = 0 d = 1 d = 2 d = 3
0
1
0100
1110
000
001
100 110
111
011
101
010
Computer Science and Engineering
Hypercubes
1110 1111
1010 1011
0110 0111
0010 0011
1101
1010
1000 1001
0100 0101
0010
0000 0001
S
d = 4
Computer Science and Engineering
Hypercube of dimension d
N = 2d d = log n
Node degree = d
Number of bits to label a node = d
Diameter = d
Number of edges = n*d/2
Hamming distance!
Routing
Computer Science and Engineering
Subcubes and Cube Fragmentation
What is a subcube? Shared Environment Fragmentation Problem Is it Similar to something you know?
Computer Science and Engineering
Cube Connected Cycles (CCC)
k-cube 2k nodes k-CCC from k-cube, replace each
vertex of the k cube with a ring of k nodes
K-CCC k* 2k nodes Degree, diameter 3, 2k Try it for 3-cube
Computer Science and Engineering
K-ary n-Cube
d = cube dimension K = # nodes along each dimension N = kd
Wraparound Hupercube binary d-cube Tours k-ary 2-cube
Computer Science and Engineering
Analysis and performance metricsstatic networks
Network Degree(d) Diameter(D) Cost SymmetryWorst delay
CCNs N-1 1 N(N-1)/2 Yes 1
Linear Array 2 N-1 N-1 No N
Binary Tree 3 2(log2N –1) N-1 No log2N
n-cube log2N log2N nN/2 Yes log2N
2D-Mesh 4 2(n-1) 2(N-n) No N
K-ary n-cube 2n nk/2 nN Yes K x log2N
Computer Science and Engineering
Dynamic INDynamic IN
Computer Science and Engineering
Bus Based IN
Global Memory
P
Global Memory
P
C
P
C
P
C
P P
Computer Science and Engineering
Dynamic Interconnection Networks
Communication patterns are based on program demands
Connections are established on the fly during program execution
Multistage Interconnection Network (MIN) and Crossbar
Computer Science and Engineering
Switch Modules
A x B switch module A inputs and B outputs In practice, A = B = power of 2
Each input is connected to one or more outputs (conflicts must be avoided)
One-to-one (permutation) and one-to-many are allowed
Computer Science and Engineering
Binary Switch
2x2Switch
Legitimate States = 4
Permutation Connections = 2
Computer Science and Engineering
Legitimate Connections
Straight Exchange
Upper-broadcast
Lower-broadcast
The different setting of the 2X2 SE
Computer Science and Engineering
Group Work
General Case ??
Computer Science and Engineering
Multistage Interconnection Networks
ISC1ISC1 ISC2ISC2 ISCnISCn
switches switches switches
ISC Inter-stage Connection Patterns
Computer Science and Engineering
Perfect-Shuffle Routing Function
Given x = {an, an-1, …, a2, a1}
P(x) = {an-1, …, a2, a1 , an}
X = 110001
P(x) = 100011
Computer Science and Engineering
Perfect Shuffle Example
000 000
001 010
010 100
011 110
100 001
101 011
110 101
111 111
Computer Science and Engineering
Perfect-Shuffle
000
001
010
011
100
101
110
111
000
001
010
011
100
101
110
111
Computer Science and Engineering
Exchange Routing Function
Given x = {an, an-1, …, a2, a1}
Ei(x) = {an, an-1, …, ai, …, a2, a1}
X = 0000000
E3(x) = 0000100
Computer Science and Engineering
Exchange E1
000 001
001 000
010 011
011 010
100 101
101 100
110 111
111 110
Computer Science and Engineering
Exchange E1
000
001
010
011
100
101
110
111
000
001
010
011
100
101
110
111
Computer Science and Engineering
Butterfly Routing Function
Given x = {an, an-1, …, a2, a1}
B(x) = {a1 , an-1, …, a2, an}
X = 010001
P(x) = 110000
Computer Science and Engineering
Butterfly Example
000 000
001 100
010 010
011 110
100 001
101 101
110 011
111 111
Computer Science and Engineering
Butterfly
000
001
010
011
100
101
110
111
000
001
010
011
100
101
110
111
Computer Science and Engineering
Multi-stage network
000
001
010
011
100
101
110
111
000
001
010
011
100
101
110
111
Computer Science and Engineering
MIN (cont.)
1
2
3
4
5
6
7
8
9
10
11
12
001
010
011
100101
110
111
000000
001
010
011
100
101
110
111
An 8X8 Banyan network
Computer Science and Engineering
Min Implementation
Control (X)
Source (S) Destination (D)
X = f(S,D)
Computer Science and Engineering
Example
X = 0 X = 1
(crossed) (straight)
A
B
C
D
A
B
C
D
Computer Science and Engineering
Consider this MIN
S1
S2
S3
S4
S5
S6
S7
S8
D1
D2
D3
D4
D5
D6
D7
D8
stage 1 stage 2 stage 3
Computer Science and Engineering
Example (Cont.)
Let control variable be X1, X2, X3 Find the values of X1, X2, X3 to connect:
S1 D6 S7 D5 S4 D1
Computer Science and Engineering
The 3 connections
S1
S2
S3
S4
S5
S6
S7
S8
D1
D2
D3
D4
D5
D6
D7
D8
stage 1 stage 2 stage 3
Computer Science and Engineering
Boolean Functions
X = x1, x2, x3
S = s2, s2, s3
D = d1, d2, d3
Find X = f(S,D)
Computer Science and Engineering
Crossbar Switch
M1 M2 M3 M4 M5 M6 M7 M8
P1
P2
P3
P4
P5
P6
P7
P8
Computer Science and Engineering
Analysis and performance metricsdynamic networks
Networks Delay Cost Blocking Degree of FT
Bus O(N) O(1) Yes 0
Multiple-bus O(mN) O(m) Yes (m-1)
MIN O(logN) O(NlogN) Yes 0
Crossbar O(1) O(N2) No 0
Computer Science and Engineering
Performance Evaluations
Computer Science and Engineering
Grosch’s Law (1960s)
“To sell a computer for twice as much, it must be four times as fast”
Vendors skip small speed improvements in favor of waiting for large ones
Buyers of expensive machines would wait for a twofold improvement in performance for the same price.
Computer Science and Engineering
Moore’s Law
Gordon Moore (cofounder of Intel) Processor performance would double every 18
months This prediction has held for several decades Unlikely that single-processor performance continues
to increase indefinitely
Computer Science and Engineering
Von Neumann’s bottleneck
Great mathematician of the 1940s and 1950s Single control unit connecting a memory to a
processing unit Instructions and data are fetched one at a time from
memory and fed to processing unit Speed is limited by the rate at which instructions and
data are transferred from memory to the processing unit.
Computer Science and Engineering
Past Trends in Parallel Architecture (inside the box)
Completely custom designed components (processors, memory, interconnects, I/O)
Longer R&D time (2-3 years) Expensive systems Quickly becoming outdated
– Bankrupt companies!!
Computer Science and Engineering
Current Trends in Parallel Architecture (outside the box) -- before multicore!!
Advances in commodity processors and network technology
Network of PCs and workstations connected via LAN or WAN forms a Parallel System
Network Computing Compete favorably (cost/performance) Utilize unused cycles of systems sitting idle
Computer Science and Engineering
Speedup
S = Speed(new) / Speed(old)
S = Work/time(new) / Work/time(old)
S = time(old) / time(new)
S = time(before improvement) /
time(after improvement)
Computer Science and Engineering
Speedup
Time (one CPU): T(1)
Time (n CPUs): T(n)
Speedup: S
S = T(1)/T(n)
Computer Science and Engineering
Two Important Laws Influenced Parallel Computing
Computer Science and Engineering
Argument Against Massively Parallel Processing. Gene Amdahl, 1967.
For over a decade prophets have voiced the contention that the organization of a single computer has reached its limits and that truly significant advances can be made only by interconnection of multiplicity of computers in such a manner as to permit cooperative solution .. The nature of this overhead (in parallelism) appears to be sequential so that it is unlikely to be amenable to parallel processing techniques. Overhead alone would then place an upper limit on throughput of five to seven times the sequential processing rate, even if the housekeeping were done in a separate processor… At any point in time it is difficult to foresee how the previous bottlenecks in a sequential computer will be effectively overcome.
Computer Science and Engineering
What does that mean?
The performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode cannot be used.
Unparallelizable part of the code severely limits the Unparallelizable part of the code severely limits the speedupspeedup.
Computer Science and Engineering
Walk 4 miles /hourBike 10 miles / hourCar-1 50 miles / hourCar-2 120 miles / hourCar-3 600 miles /hour
200 miles
20 hours
A Bmust walk
Trip Analogy
Computer Science and Engineering
Speedup Analysis
(4 miles /hour) Time = 70 hours
(10 miles / hour) Time = 40 hours
(50 miles / hour) Time = 24 hours
(120 miles / hour) Time = 21.67 hours
S = 1.8
S = 2.9
S = 3.2
S = 3.4(600 miles /hour) Time = 20.33 hours
Computer Science and Engineering
S = T(1)/T(N)
T(N) = T(1) + T(1)(1- )
N
S = 1
+ (1- )
N
=N
N + (1- )
: The fraction of the program that is naturally serial
(1- ): The fraction of the program that is naturally parallel
Amdahl’s Law
Computer Science and Engineering
10% 20% 30% 40% 50% 60% 70% 80% 90% 99%
0
5
10
15
20
25Speedup
% Serial
1000 CPUs16 CPUs4 CPUs
Amdahl’s Law
Computer Science and Engineering
Gustafson – Barsis Law (1988)
Gordon Bell Prize Overcoming the conceptual barrier established by
Amdahl’s law Scale the problem to the size of the parallel system No fixed size problem
: The fraction of the program that is naturally serial
T(N) = 1T(1) = + (1- ) NS = N – (N-1)
Computer Science and Engineering
0
20
40
60
80
100
10% 20% 30% 40% 50% 60% 70% 80% 90% 99%
% Serial
Speedup
Gustafson-Barsis
Amdhal
Amdahl vs. Gustafson-Barsis
Computer Science and Engineering
Data Parallelism – Scale up
Parallelism is in the data, not the control portion of the application
Problem size scales up to the size of the system
Data Parallelism is to the 1990’s what vector parallelism was to the 1970’s
Supercomputer data parallel
Computer Science and Engineering
Problem
Assume that a switching component such as a transistor can switch in zero time. We propose to construct a disk-shaped computer chip with such a component. The only limitation is the time it takes to send electronic signals from one edge of the chip to the other. Make the simplifying assumption that electronic signals travel 300,000 kilometers per second. What must be the diameter of a round chip so that it can switch 109 times per second? What would the diameter be if the switching requirements were 1012 time per second?