first semester examinations june 2000 advanced

FIRST SEMESTER EXAMINATIONS June 2000 ADVANCED COMPUTER ARCHITECTURE (623.305)

1

The University of Western Australia

Department of Electrical and Electronic Engineering

FIRST SEMESTER EXAMINATIONS JUNE 2000 ADVANCED COMPUTER ARCHITECTURE (623.300) This paper contains: 10 questions; XX pages. Candidates should attempt ALL questions QUESTION 1 (10 Marks) Compare buses, crossbar switches, and multistage networks for building a multiprocessor system. Assume a word length of w bits and 2 × 2 switches are used in building the multiprocessor system with n processors and m shared-memory modules. Assume a word length of w bits and that 2 × 2 switches are used in building the multistage networks. To facilitate the comparison process, complete the following table.

Network Characteristics

Bus System Multistage Network Crossbar Switch

Minimum latency for unit data transfer

Bandwidth per processor

Wiring complexity Switching complexity Connectivity and routing capability

Remarks Assume n processors on the bus; bus width is w bits

n × n MIN using k × k switches with line width of w bits

Assume n × n crossbar with line width of w bits


2

Answer:

QUESTION 2 (15 Marks) Analyse the data dependencies among the following statements:

S1: /RDG�5�� / R1 ← 1024 / S2: /RDG�5��0�� / R2 ← Memory(10) / S3: $GG�5��5� / R1 ← (R1) + (R2) / S4: 6WRUH�0��5� / Memory(1024) ← (R1) / S5: 6WRUH�0��5�� / Memory(64) ← 1024 /

Note that (Ri) means that the content of register Ri and Memory(10) contains 64 initially.

a) Draw a dependence graph to show all the dependencies.

b) Are there any resource dependencies if only one copy of each functional unit is available in the CPU?

Answer: a)

Network Characteristics

Bus System Multistage Network Crossbar Switch

Minimum latency for unit data transfer

Constant O(logk n) Constant

Bandwidth per processor

O(w/n) to O(w) O(w) to O(nw) O(w) to O(nw)

Wiring complexity O(w) O(nwlogkn) O(n2w) Switching complexity O(n) O(nlogkn) O(n2) Connectivity and routing capability

Only one-to-one at a time

Some permutations and broadcast, if network unblocked

All permutations, one at a time

Remarks Assume n processors on the bus; bus width is w bits

n × n MIN using k × k switches with line width of w bits

Assume n × n crossbar with line width of w bits

S1

S4

S2

S3

S5


3

b) There are storage dependencies between instruction pairs (S2, S5) and (S4, S5). There is a resource dependence between S1 and S2 on the load unit, and another between S4 and S5 on the store unit. QUESTION 3 (8 Marks) Compare the PRAM models with physical models of real parallel computers in each of the following categories:

a) Which PRAM variant can best model SIMD machines and how? b) Repeat the question in (a) for shared-memory MIMD machines.

Answer:

a) Since the processing elements of a SIMD machine read and write data from different memory modules synchronously, no access conflicts should arise. Thus, any PRAM variant can be used to model SIMD machines.

b) The processors in a MIMD machine can read the same memory location

simultaneously. However, writing to the same memory location is prohibited. Thus, the CREW-PRAM can best model an MIMD machine.

QUESTION 4 (7 Marks)

Vectorizing compilers generally detect loops that can be executed on a pipelined vector computer. Are the vectorization algorithms used by vectorizing compilers suitable for MIMD machine parallelization? Justify your answer. Answer: Vectorization is a subset of MIMD parallel processing so the parallelization algorithms that detect vectorization all detect code that can be multiprocessed. However, there are several complicating factors: the synchronization is implicit in vectorization whereas it must normally be explicit with multiprocessing and it is frequently time consuming. Thus, the crossover vector length where the compiler will switch from scalar to vector processing will normally be greater on the MIMD machine. There are some vector operations that are more difficult on a MIMD machine, so some of the code generation techniques may be different. QUESTION 5 (16 Marks) Consider the following pipelined processor with four stages. This pipeline has a total evaluation time of six clock cycles. All successor stages must be used after each clock cycle.


4

a) Specify the reservation table for this pipeline with six columns and four rows? b) List the set of forbidden latencies between task initiations. c) Draw the state diagram which shows all possible latency cycles. d) List all greedy cycles from the state diagram. e) Determine the minimal average latency (MAL). Answer: a) Reservation table: b) Forbidden latency: 4. Collision vector; (1 0 0 0). c) State transition diagram shown below. d) Simple cycles: (1,5), (1,1,5), (1,1,1,5), (1,2,5), (1,2,3,5), (1,2,3,2,5), (1,2,3,2,1,5), (2,5), (2,1,5), (2,1,2,5), (2,1,2,3,5), (2,3,5), (3,5), (3,2,5), (3,2,1,5), (3,2,1,2,5), (5), (3,2,1,2), and (3). e) Greedy cycles: (1,1,1,5) and (1,2,3,2). f) MAL = (1+1+1+5)/4 = 2.

S1 S2 S3 S4

Input

Output

1 2 3 4 5 6 S1 X X S2 X X S3 X S4 X


5

QUESTION 6 (8 Marks) A uniprocessor computer can operate in either scalar or vector mode. In vector mode, computations can be performed nine times faster than in scalar mode. A certain benchmark program took time T to run on this computer. Further, it was found that 25% of T was attributed to the vector mode. In the remaining time, the machine operated in the scalar mode. a) Calculate the effective speedup under the above conditions as compared with the

condition when the vector mode is not used at all. Also calculate α, the percentage of code that has been vectorized in the above program.

b) Suppose we double the speed ratio between the vector mode and the scalar mode

by hardware improvements. Calculate the effective speedup that can be achieved. c) Suppose the same speedup obtained in part (b) must be obtained by compiler

improvements instead of hardware improvements. What would be the new vectorization ratio α that should be supported by the vectorizing compiler for the same benchmark program?

Answer: a) If the vector mode is not used at all, the execution time will be: 0.75T + 9 × 0.25T = 3T Therefore, effective speedup = 3T/T =3. Let the fraction of vectorized code be α.

Then α = 9 × 0.25T/3T = 0.75. b) Suppose that the speed ratio between the vector mode and the scalar mode is

doubled. The execution time becomes: 0.75T + 0.25T/2 = 0.875T The effective speedup is 3T/0.875T = 24/7 = 3.43. c) Suppose the speed for vector mode computation is still nine times as fast as that

for scalar mode. To maintain the effective speedup of 3.43, the vectorization ratio α must satisfy the following relation:

Solving the equation, we obtain α = ≈51 64 08/ . .

1

9

1

1

24

7α α+ − =


6

QUESTION 7 (15 Marks) On a Fujitsu VP2000, the vector processing unit is equipped with two load/store pipelines plus five functional pipelines as shown below. Consider the execution of the following compound vector function (CVF): � � $�,� %�,�×�&�,��'�,�×�(�,��)�,�×�*�,��

for I=1,2,.....,N. Initially, all vector operands are in memory, and the final vector result must be store in memory. Show the space-time diagram, for pipelined execution of the CVF. Note that, two vector loads can be carried out simultaneously on the two vector-access pipes. At the end of computation, one of the two access pipes is used for storing the A array. Answer:

Load B Load D Load F Store A

Load C Load E Load G

Mul B,C Mul D,E Mul F,G

BC+DE(BC+DE)+ FG

Time

Access 1

Access 2

Multiply

Add

�

6\VWHP�6WRUDJH�8QLW��66��

0DLQ6WRUDJH8QLW

&KDQQHO�

3URFHVV

9HFWRU�

5HJLVWHU�

0DVN�5HJLVWHU

/RDG�6WRUH�3LSHOLQH�

/RDG�6WRUH�3LSHOLQH�

0DVN�3LSHOLQH�

0DVN 3LSHOLQH

$/8�3LSHOLQH�

$/8�3LSHOLQH�

'LYLGH 3LSHOLQH

%XIIHU�6WRUDJH�6FDODU�

([HFXWLRQ�8QLW�

6FDODU�8QLWV�

9HFWRU 8QLW


7

QUESTION 8 (10 Marks) Given the following assembly language code. Exploit the maximum degree of parallelism among the 16 instructions, assuming no resource conflicts and multiple functional units are available simultaneously. For simplicity, no pipelining is assumed. All instructions take one machine cycle to execute. Ignore all other overhead.

�� /RDG�� 5��$�� 5�←0HP�$��

�� /RDG�� 5��%�� 5�←0HP�%��

�� 0XO�� 5��5��5�� 5�←�5��×�5�� /RDG�� 5��'�� 5�←0HP�'��

�� 0XO� � 5��5��5�� 5�←�5��×�5�� $GG� � 5��5��5�� 5�←�5��5��

�� 6WRUH� ;�5�� 0HP�;�←�5��

�� /RDG�� 5��&�� 5�←0HP�&��

�� 0XO� � 5��5��5�� 5�←�5��×�5�� /RDG�� 5��(�� 5�←0HP�(��

�� $GG� � 5��5��5�� 5��←�5��5��

�� 6WRUH� <�5�� 0HP�<�←�5��

�� $GG� � 5��5��5�� 5��←�5��5��

�� 6WRUH� 8�5�� 0HP�8�←�5��

�� 6XE� � 5��5��5�� 5��←�5��5��

�� 6WRUH� 9�5�� 0HP�9�←�5��

a) Draw a program graph with 16 nodes to show the flow relationships among the 16

instructions. b) Consider the use of a three-issue superscalar processor to execute this program

fragment in minimum time. The processor can issue one memory-access instruction (Load or Store but not both), one Add/Sub instruction, and one Mul (multiply) instruction per cycle. The Add unit, Load/Store unit, and Multiply unit can be used simultaneously if there is no data dependence.

Answer: (a)

10 8

11

12 13 15

16 14

R9 R7 R4 R1 R2

R5 R3 R8

R10 R6

Y R11 R12

X


8

(b) QUESTION 9 (8 Marks)

Explain the applicability and the restrictions involved in using Amdahl’s law and Gustafson’s law to estimate the speedup performance of an n-processor system compared with that of a single-processor system. Ignore all communication overheads. (please answer in one paragraph). Answer: Amdahl’s law is based on a fixed workload, where the problem size is fixed regardless of the machine size. Gustafson’s law is based on a scaled workload, where the problem size is increased with the machine size so that the solution time is the same for sequential and parallel executions. QUESTION 10 (5 Marks) Why are hypercube networks, which were very popular in first-generation multicomputers, being replaced by 2D or 3D meshes or tori in the second and third generations of multicomputers? Answer: In VLSI systems, networks with many dimensions require more and longer wires than low-dimensional networks. Thus, high-dimensional networks cost more and run more slowly than low-dimensional networks. Under the assumption of constant wire bisection, low-dimensional networks have wide channels, and high-dimensional networks have narrow channels. With the wormhole routing method, which is used by most of the second- and third-generation multicomputers, the wider channels provide a lower latency, less contention, and higher hot-spot throughput.

1

2

3

4

5

6

7

8

1 2

4

8

10

7 12

14

9 16

Mem. Acc. Add Mul

3

5

9 6

11

13

15

first semester examinations june 2000 advanced

Documents