ct2 ca
TRANSCRIPT
8/7/2019 CT2 CA
http://slidepdf.com/reader/full/ct2-ca 1/12
Computer Architecture CT 2 Paper Solution
Q1. ULTRA SPARC IV Plus is a 4 way superscalar 14 stage pipeline. Find speedup. Obtain the
relation used. Comment on the specific requirement for cache. [1+2+1]
A1.Calculating the speedup - [1 mark]Given m=4, k= 14
S (m, 1) = S (4, 1) = T (1, 1) / T (4, 1) = (4*(N+14-1)) / (N+4*(14-1)) = (4*(N+13)) / (N+52)
S (4, 1) 4
Obtaining the relation used to calculate the speedup - [2 marks]For a k -stage linear pipeline,
No. of stages in the pipeline = k
No. of tasks to be processed = N No. of cycles used to fill up the pipeline = k,
(also, this is the time required to complete execution of the first task), therefore,
No. of remaining tasks = N-1, and
No. of cycles needed to complete N-1 tasks = N-1
Therefore, the time required by the scalar base machine,
T (1, 1) = Tk = k + (N-1) clock periods
Similarly, the time required by an m-issue superscalar machine,
T (m, 1) = k + (N – m)/m
= N/m + (k – 1)
where,
k = time required to execute the first ‘m’ instructions through the m-pipelines simultaneously
(N – m)/m = time required to execute the remaining (N – m) instructions ‘m’ per cycles through
‘m’ pipelines
Speedup of the superscalar machine over the base machine,
Specific requirement for cache - [1 mark] • L1 Cache (per pipeline): 64 KB 4-way data
32 KB 4-way instruction
2 KB write, 2 KB prefetch
• L2 Cache: 16 MB external (exclusive access to 8MB per pipeline)
)1(
)1(
1
1
)1,(
)1,1(1
−+
−+=
−+
−+==
k m
k m
k m
k
mT
T )S(m,
mmS →→∝
)1,(
N ∞
8/7/2019 CT2 CA
http://slidepdf.com/reader/full/ct2-ca 2/12
Q2. A 64 KB cache uses 12 bit index. Number of index bits is proposed to be increased to 16
bits. Suggest the best option out of the two with justification. Support your answer with for and
against arguments. [4]
A2.
Calculating the block size - [0.25 marks] Given, Size of Cache = 64 KB
Index Bit = 12
Size of index ≡ f (cache size, block size, set associativity)
212
= 64KB / (block size * 1), (set associativity = 1 for direct mapping)
Block size = 16 Bytes
Similarly, for Index bit 16,2
16= 64KB / (block size * 1)
Block size = 1 Bytes
In favor of 12 bit index - [3 marks for any 4 points]1. By comparing above results it can be derived that when numbers of bits in index are
increased then block size will decrease. So the better solution would be the 12 index bit
value.
2. When the size of the block is decreased, the miss rate increases. Because of spatial
locality, as the size of block increases, the hit ratio will also increase.
3. Block sized is the only important criteria to be monitored when it comes to tradeoff withcost, comparator size and various other issues.
For Instruction Cache, the rate of increase of block size is directly proportional to the rate of
decrease of cache miss rate.
But for Data Cache, this proportionality rule doesn’t hold true. Rate of cache miss is less in data
cache as compared to instruction.
4. Comparison of tags in case of index bit as 12 will be less as compared when index bit is
set to 16.
5. Cost of implementing with Index bit 12 will be less.
6. It will be less complex to implement.
7. Searching overhead will be less for a word.
8. Better for Sequential access of the memory.
Against 12 bit index - [0.75 marks] 9. As the graph depicts when the size of the block increases significantly, the decrease in
miss rate is less than the increase in miss penalty, as the whole block is to be transferred
ityassociativsetsize block
size2
×=
cacheindex
8/7/2019 CT2 CA
http://slidepdf.com/reader/full/ct2-ca 3/12
to the cache which will increase the time delay to provide the required data. So if we
further decrease the index bit then can create a problem.
10. Special Case Problem:
Let’s say, we want to access words 0, 4,8,12 where 1 block size = 4 words.
Therefore for each access, there will be a miss and for these four accesses 4 cache miss
will occur. And as the size of the block is big so miss penalty will also be more.
8/7/2019 CT2 CA
http://slidepdf.com/reader/full/ct2-ca 4/12
Q3. VLIW is not a main stream computer. Give two reasons in support of this statement. What
are the salient features of this machine? [1+2]
A1.Reasons for “VLIW is not a main stream computer” - [1 mark for any 2 points]
• Due to lack of compatibility with conventional hardware and software.
• Due to the dependence on Trace-scheduling compiling and code compaction.
• VLIW instructions are larger than the mainstream computer since each specifies several
independent operations.
• VLIW uses random parallelism among scalar operations instead of regular synchronous
parallelism.
Salient features of a VLIW machine - [2 marks for any 4 points & diagram]• Instruction word length: hundreds of bits.
• Use of (i) multifunctional units concurrently, and
(ii) Common Large Register File shared by all functional Units.• In VLIW run-time scheduling and synchronization are eliminated, because Instruction
parallelism and data movement in VLIW architecture are completely specified at compile
time.
• VLIW processor is an extreme of a superscalar processor in which all independent or
unrelated operations are already synchronously compacted.
• The instruction parallelism embedded in the compacted code may require a different
latency to be executed by different FUs even though the instructions are issued at the
same time.
• It uses random parallelism among scalar operations instead of regular synchronous parallelism (as used in vectorized supercomputer or as in an SIMD computer).
Processor
Instruction Format
8/7/2019 CT2 CA
http://slidepdf.com/reader/full/ct2-ca 5/12
Q4. State various conditions to be satisfied for code compaction. Why is the trace scheduling
done in VLIW? [2]
A4.Conditions for code compaction - [1 mark] Code compaction is an attempt to tune the code automatically in order to make it occupy less
space while maintain all of its original functionality. Thus, the conditions are -
• It must reduce the space used.
• It must retain the original functionality (i.e. the dataflow of the program must not
change and the exception behavior must be preserved).
Trace scheduling done in VLIW - [1 mark] Many compilers for first-generation ILP processors used a three phase method to generate code.
The phases were,
• Generate a sequential program. Analyze each basic block in the sequential program for
independent operations.• Schedule independent operations within the same block in parallel if sufficient hardware
resources are available.
• Move operations between blocks when possible.
This three phase approach fails to exploit much of the ILP available in the program for two
reasons,
• Often times, operations in a basic block are dependent on each other. Therefore sufficient
ILP may not be available within a basic block.
• Arbitrary choices made while scheduling basic blocks make it difficult to move
operations between blocks.
Trace scheduling is a form of global code motion. In trace scheduling, a set of commonly
executed sequence of blocks is gathered together into a trace and the whole trace is scheduled
together. It works by converting a loop to long straight-line code sequence using loop unrolling
and static branch prediction. This process separates out "unlikely" code and adds handlers for
exits from trace. The goal is to have the most common case executed as a sequential set of
instructions without branches.
8/7/2019 CT2 CA
http://slidepdf.com/reader/full/ct2-ca 6/12
Q5. Explain set associative mem
A5.Explanation – It is a compromise between the
associativity mapping. In a k-wa
k-tags within the identified set. S
more economical than the full as
In general, a block B j can be ma
j (modulo v) = i. Matched tag id
Design Trade-offs – • The set size (associativit
• For a fixed cache size the
• Set-Associative Cache is
Advantages – 1. The block replacement
The replacement policy
compared with the fully
2. The k-way associative se
3. Many design tradeoffs ca
Figure with explanation –
ory mapping. [3]
two extreme cache designs based on direct
y associative search, the tag needs to be compa
ince k is small in practice, the k-way associativ
sociativity.
ped into any one of the available frames in a s
ntifies the current block which resides in the fr
) k and the number of sets v are inversely relate
re exists a tradeoff between the set size and the
used in most of the high performance computer
lgorithm needs to consider only a few blocks
an be more economically implemented with li
ssociative cache.
arch is easier to implement.
n be considered (m =v*k ) to yield a higher hit r
[0.5 marks]apping and full
red only with the
e search is much
t Si, B j Є S i, if
me.
[0.25 marks]d; m = v*k
number of sets.
systems.
[0.25 marks]in the same set.
ited choices, as
tio in the cache.
[1 mark]
8/7/2019 CT2 CA
http://slidepdf.com/reader/full/ct2-ca 7/12
Example with figure – [1 mark]
B0B1
B2
B3
B4
B5
B6
B7
B8
B9
B10
B11
B12
B13
B14
B15
TAG CACHE
0B
1B
3B
2B
Ex. 2 way mapping (Set-Associative Search)
MainMemory
2 bit
n = 16, m = 8,
and v = 4 sets,
w = 2
s, d = ?
2d = v
d = 2
s = 4
7B
6B
5B
4B
8/7/2019 CT2 CA
http://slidepdf.com/reader/full/ct2-ca 8/12
Q6. Give formal statement of the task dispatching problem. [1]
A6.Formal statement – [1 mark] Given,
I. a vector task system V
II. a vector computer, with
III. m identical pipelines, and
IV. a deadline D,
Does there exist a parallel schedule f for V with Finish Time ω such that ω ≤ D?
It is a feasibility problem.
Desirable algorithm: Heuristic scheduling algorithm.
8/7/2019 CT2 CA
http://slidepdf.com/reader/full/ct2-ca 9/12
Q7. A sum of first k components of a vector (k = 0 to n - 1), when number ( ) of 3 processors
and length (n) of the vector, both are 8 is obtained in three steps. Will the number of steps
increase if both and n are raised to 16? Justify your answer. [3]
A7.
Formulation – [0.25 marks]We have to find the sum, S (k) of the first k components in vector A for each k varying from 0 to
15.
i.e., S (0) = A0;
S (1) = A1 + S (0);
S (2) = A2 + A1 + A0 = A2 + S (1);
or, S (k) = S (k-1) + Ak ;
It is a recursive sum.Relation used – [1 mark]S (k) depends on S (k-1). Therefore, sum cannot be obtained in parallel. A0 to A15 can be
accessed simultaneously from PEM0 to PEM15. Sk ’s (intermediate sum) calculated in one step
are to be routed to PEs from other PE’s. At the same time, inactive PE’s can be masked.
Therefore for n=16;
Store A0 in PE0, A1 in PE1, …, A15 in PE15.
The operation is,
[R 0 ← A0, R 1 ← A1, …., R 15 ← A15]
To get S (1) = A1 + A0 (=S (0)), Route A0 to A1;
<R i+1> ← <R i> for i = 0 to 14.Similarly to obtain other intermediate solutions,
<R i+2> ← <R i> for i = 0 to13.
<R i+4> ← <R i> for i = 0 to 11.
<R i+8> ← <R i> for i = 0 to 7
Thus,
<R i+1> ← Ai + Ai+1 (intermediate sum)
R 1 ← A0 + A1 = S (1)
R 2 ← A1 + A2 = S (2) – S (0)
R 3 ← A2 + A3 = S (3) – S (1)
.
R 15 ← A14 + A15 = S (15) – S (13)
In the same way it will be done for other steps.
15....,1,2,0,k for 0
=Σ==
i
k
ik AS
8/7/2019 CT2 CA
http://slidepdf.com/reader/full/ct2-ca 10/12
Calculation of number of steps
PE’s not involved in routing: P
PE’s inactive in addition: PE0
Mask-off: PE0 & PE7
diagrammatically –
E15 PE’s not involved in routin PE’s inactive in addition: P
Mask-off: PE0, PE1, PE14
[1.5 marks]
: PE14 & PE15
0 & PE1
PE15
8/7/2019 CT2 CA
http://slidepdf.com/reader/full/ct2-ca 11/12
PE’s not involved in routing: P
PE’s inactive in addition: PE0
Mask-off: PE0 to PE3, PE12 to
E12 to PE15 PE’s not involved in routino PE3 PE’s inactive in addition: P
E15 Mask-off: PE0 to PE15
: PE8 to PE15
0 to PE7
8/7/2019 CT2 CA
http://slidepdf.com/reader/full/ct2-ca 12/12
So we can see the number of ste
Calculation of number of stepsIf number of processors is a perf
So,
r = √16
r = 4
this is also the result we obtaine
s required is 4.
using the formula – ect square then the number of steps would be eq
diagrammatically.
[0.25 marks]uals to r = √.