first semester examinations june 2000 advanced
TRANSCRIPT
FIRST SEMESTER EXAMINATIONS June 2000 ADVANCED COMPUTER ARCHITECTURE (623.305)
1
The University of Western Australia
Department of Electrical and Electronic Engineering
FIRST SEMESTER EXAMINATIONS JUNE 2000 ADVANCED COMPUTER ARCHITECTURE (623.300) This paper contains: 10 questions; XX pages. Candidates should attempt ALL questions QUESTION 1 (10 Marks) Compare buses, crossbar switches, and multistage networks for building a multiprocessor system. Assume a word length of w bits and 2 × 2 switches are used in building the multiprocessor system with n processors and m shared-memory modules. Assume a word length of w bits and that 2 × 2 switches are used in building the multistage networks. To facilitate the comparison process, complete the following table.
Network Characteristics
Bus System Multistage Network Crossbar Switch
Minimum latency for unit data transfer
Bandwidth per processor
Wiring complexity Switching complexity Connectivity and routing capability
Remarks Assume n processors on the bus; bus width is w bits
n × n MIN using k × k switches with line width of w bits
Assume n × n crossbar with line width of w bits
FIRST SEMESTER EXAMINATIONS June 2000 ADVANCED COMPUTER ARCHITECTURE (623.305)
2
Answer:
QUESTION 2 (15 Marks) Analyse the data dependencies among the following statements:
S1: /RDG�5������� / R1 ← 1024 / S2: /RDG�5���0���� / R2 ← Memory(10) / S3: $GG�5���5� / R1 ← (R1) + (R2) / S4: 6WRUH�0��������5� / Memory(1024) ← (R1) / S5: 6WRUH�0��5��������� / Memory(64) ← 1024 /
Note that (Ri) means that the content of register Ri and Memory(10) contains 64 initially.
a) Draw a dependence graph to show all the dependencies.
b) Are there any resource dependencies if only one copy of each functional unit is available in the CPU?
Answer: a)
Network Characteristics
Bus System Multistage Network Crossbar Switch
Minimum latency for unit data transfer
Constant O(logk n) Constant
Bandwidth per processor
O(w/n) to O(w) O(w) to O(nw) O(w) to O(nw)
Wiring complexity O(w) O(nwlogkn) O(n2w) Switching complexity O(n) O(nlogkn) O(n2) Connectivity and routing capability
Only one-to-one at a time
Some permutations and broadcast, if network unblocked
All permutations, one at a time
Remarks Assume n processors on the bus; bus width is w bits
n × n MIN using k × k switches with line width of w bits
Assume n × n crossbar with line width of w bits
S1
S4
S2
S3
S5
FIRST SEMESTER EXAMINATIONS June 2000 ADVANCED COMPUTER ARCHITECTURE (623.305)
3
b) There are storage dependencies between instruction pairs (S2, S5) and (S4, S5). There is a resource dependence between S1 and S2 on the load unit, and another between S4 and S5 on the store unit. QUESTION 3 (8 Marks) Compare the PRAM models with physical models of real parallel computers in each of the following categories:
a) Which PRAM variant can best model SIMD machines and how? b) Repeat the question in (a) for shared-memory MIMD machines.
Answer:
a) Since the processing elements of a SIMD machine read and write data from different memory modules synchronously, no access conflicts should arise. Thus, any PRAM variant can be used to model SIMD machines.
b) The processors in a MIMD machine can read the same memory location
simultaneously. However, writing to the same memory location is prohibited. Thus, the CREW-PRAM can best model an MIMD machine.
QUESTION 4 (7 Marks)
Vectorizing compilers generally detect loops that can be executed on a pipelined vector computer. Are the vectorization algorithms used by vectorizing compilers suitable for MIMD machine parallelization? Justify your answer. Answer: Vectorization is a subset of MIMD parallel processing so the parallelization algorithms that detect vectorization all detect code that can be multiprocessed. However, there are several complicating factors: the synchronization is implicit in vectorization whereas it must normally be explicit with multiprocessing and it is frequently time consuming. Thus, the crossover vector length where the compiler will switch from scalar to vector processing will normally be greater on the MIMD machine. There are some vector operations that are more difficult on a MIMD machine, so some of the code generation techniques may be different. QUESTION 5 (16 Marks) Consider the following pipelined processor with four stages. This pipeline has a total evaluation time of six clock cycles. All successor stages must be used after each clock cycle.
FIRST SEMESTER EXAMINATIONS June 2000 ADVANCED COMPUTER ARCHITECTURE (623.305)
4
a) Specify the reservation table for this pipeline with six columns and four rows? b) List the set of forbidden latencies between task initiations. c) Draw the state diagram which shows all possible latency cycles. d) List all greedy cycles from the state diagram. e) Determine the minimal average latency (MAL). Answer: a) Reservation table: b) Forbidden latency: 4. Collision vector; (1 0 0 0). c) State transition diagram shown below. d) Simple cycles: (1,5), (1,1,5), (1,1,1,5), (1,2,5), (1,2,3,5), (1,2,3,2,5), (1,2,3,2,1,5), (2,5), (2,1,5), (2,1,2,5), (2,1,2,3,5), (2,3,5), (3,5), (3,2,5), (3,2,1,5), (3,2,1,2,5), (5), (3,2,1,2), and (3). e) Greedy cycles: (1,1,1,5) and (1,2,3,2). f) MAL = (1+1+1+5)/4 = 2.
S1 S2 S3 S4
Input
Output
1 2 3 4 5 6 S1 X X S2 X X S3 X S4 X
FIRST SEMESTER EXAMINATIONS June 2000 ADVANCED COMPUTER ARCHITECTURE (623.305)
5
QUESTION 6 (8 Marks) A uniprocessor computer can operate in either scalar or vector mode. In vector mode, computations can be performed nine times faster than in scalar mode. A certain benchmark program took time T to run on this computer. Further, it was found that 25% of T was attributed to the vector mode. In the remaining time, the machine operated in the scalar mode. a) Calculate the effective speedup under the above conditions as compared with the
condition when the vector mode is not used at all. Also calculate α, the percentage of code that has been vectorized in the above program.
b) Suppose we double the speed ratio between the vector mode and the scalar mode
by hardware improvements. Calculate the effective speedup that can be achieved. c) Suppose the same speedup obtained in part (b) must be obtained by compiler
improvements instead of hardware improvements. What would be the new vectorization ratio α that should be supported by the vectorizing compiler for the same benchmark program?
Answer: a) If the vector mode is not used at all, the execution time will be: 0.75T + 9 × 0.25T = 3T Therefore, effective speedup = 3T/T =3. Let the fraction of vectorized code be α.
Then α = 9 × 0.25T/3T = 0.75. b) Suppose that the speed ratio between the vector mode and the scalar mode is
doubled. The execution time becomes: 0.75T + 0.25T/2 = 0.875T The effective speedup is 3T/0.875T = 24/7 = 3.43. c) Suppose the speed for vector mode computation is still nine times as fast as that
for scalar mode. To maintain the effective speedup of 3.43, the vectorization ratio α must satisfy the following relation:
Solving the equation, we obtain α = ≈51 64 08/ . .
1
9
1
1
24
7α α+ − =
FIRST SEMESTER EXAMINATIONS June 2000 ADVANCED COMPUTER ARCHITECTURE (623.305)
6
QUESTION 7 (15 Marks) On a Fujitsu VP2000, the vector processing unit is equipped with two load/store pipelines plus five functional pipelines as shown below. Consider the execution of the following compound vector function (CVF): � � $�,� %�,��&�,���'�,��(�,���)�,��*�,���
for I=1,2,.....,N. Initially, all vector operands are in memory, and the final vector result must be store in memory. Show the space-time diagram, for pipelined execution of the CVF. Note that, two vector loads can be carried out simultaneously on the two vector-access pipes. At the end of computation, one of the two access pipes is used for storing the A array. Answer:
Load B Load D Load F Store A
Load C Load E Load G
Mul B,C Mul D,E Mul F,G
BC+DE(BC+DE)+ FG
Time
Access 1
Access 2
Multiply
Add
�
6\VWHP�6WRUDJH�8QLW��66��
0DLQ6WRUDJH8QLW
&KDQQHO�
3URFHVV
9HFWRU�
5HJLVWHU�
0DVN�5HJLVWHU
/RDG�6WRUH�3LSHOLQH�
/RDG�6WRUH�3LSHOLQH�
0DVN�3LSHOLQH�
0DVN 3LSHOLQH
$/8�3LSHOLQH�
$/8�3LSHOLQH�
'LYLGH 3LSHOLQH
%XIIHU�6WRUDJH�6FDODU�
([HFXWLRQ�8QLW�
6FDODU�8QLWV�
9HFWRU 8QLW
FIRST SEMESTER EXAMINATIONS June 2000 ADVANCED COMPUTER ARCHITECTURE (623.305)
7
QUESTION 8 (10 Marks) Given the following assembly language code. Exploit the maximum degree of parallelism among the 16 instructions, assuming no resource conflicts and multiple functional units are available simultaneously. For simplicity, no pipelining is assumed. All instructions take one machine cycle to execute. Ignore all other overhead.
��� /RDG�� 5��$�� � �5�←0HP�$���
��� /RDG�� 5��%�� � �5�←0HP�%���
��� 0XO��� 5��5��5�� � �5�←�5��×�5������� /RDG�� 5��'�� � �5�←0HP�'���
��� 0XO� � 5��5��5�� � �5�←�5��×�5������� $GG� � 5��5��5�� � �5�←�5����5����
��� 6WRUH� ;�5��� � �0HP�;�←�5����
��� /RDG�� 5��&�� � �5�←0HP�&���
��� 0XO� � 5��5��5�� � �5�←�5��×�5�������� /RDG�� 5��(�� � �5�←0HP�(���
���� $GG� � 5���5��5��� �5��←�5����5����
���� 6WRUH� <�5��� � �0HP�<�←�5�����
���� $GG� � 5���5��5��� �5��←�5����5�����
���� 6WRUH� 8�5��� � �0HP�8�←�5�����
���� 6XE� � 5���5��5��� �5��←�5����5�����
���� 6WRUH� 9�5��� � �0HP�9�←�5�����
a) Draw a program graph with 16 nodes to show the flow relationships among the 16
instructions. b) Consider the use of a three-issue superscalar processor to execute this program
fragment in minimum time. The processor can issue one memory-access instruction (Load or Store but not both), one Add/Sub instruction, and one Mul (multiply) instruction per cycle. The Add unit, Load/Store unit, and Multiply unit can be used simultaneously if there is no data dependence.
Answer: (a)
10 8
11
12 13 15
16 14
R9 R7 R4 R1 R2
R5 R3 R8
R10 R6
Y R11 R12
X
FIRST SEMESTER EXAMINATIONS June 2000 ADVANCED COMPUTER ARCHITECTURE (623.305)
8
(b) QUESTION 9 (8 Marks)
Explain the applicability and the restrictions involved in using Amdahl’s law and Gustafson’s law to estimate the speedup performance of an n-processor system compared with that of a single-processor system. Ignore all communication overheads. (please answer in one paragraph). Answer: Amdahl’s law is based on a fixed workload, where the problem size is fixed regardless of the machine size. Gustafson’s law is based on a scaled workload, where the problem size is increased with the machine size so that the solution time is the same for sequential and parallel executions. QUESTION 10 (5 Marks) Why are hypercube networks, which were very popular in first-generation multicomputers, being replaced by 2D or 3D meshes or tori in the second and third generations of multicomputers? Answer: In VLSI systems, networks with many dimensions require more and longer wires than low-dimensional networks. Thus, high-dimensional networks cost more and run more slowly than low-dimensional networks. Under the assumption of constant wire bisection, low-dimensional networks have wide channels, and high-dimensional networks have narrow channels. With the wormhole routing method, which is used by most of the second- and third-generation multicomputers, the wider channels provide a lower latency, less contention, and higher hot-spot throughput.
1
2
3
4
5
6
7
8
1 2
4
8
10
7 12
14
9 16
Mem. Acc. Add Mul
3
5
9 6
11
13
15