ct2 ca

8/7/2019 CT2 CA

http://slidepdf.com/reader/full/ct2-ca 1/12

Computer Architecture CT 2 Paper Solution

Q1. ULTRA SPARC IV Plus is a 4 way superscalar 14 stage pipeline. Find speedup. Obtain the

relation used. Comment on the specific requirement for cache. [1+2+1]

A1.Calculating the speedup - [1 mark]Given m=4, k= 14

S (m, 1) = S (4, 1) = T (1, 1) / T (4, 1) = (4*(N+14-1)) / (N+4*(14-1)) = (4*(N+13)) / (N+52)

S (4, 1) 4

Obtaining the relation used to calculate the speedup - [2 marks]For a k -stage linear pipeline,

No. of stages in the pipeline = k

No. of tasks to be processed = N No. of cycles used to fill up the pipeline = k,

(also, this is the time required to complete execution of the first task), therefore,

No. of remaining tasks = N-1, and

No. of cycles needed to complete N-1 tasks = N-1

Therefore, the time required by the scalar base machine,

T (1, 1) = Tk = k + (N-1) clock periods

Similarly, the time required by an m-issue superscalar machine,

T (m, 1) = k + (N – m)/m

= N/m + (k – 1)

where,

k = time required to execute the first ‘m’ instructions through the m-pipelines simultaneously

(N – m)/m = time required to execute the remaining (N – m) instructions ‘m’ per cycles through

‘m’ pipelines

Speedup of the superscalar machine over the base machine,

Specific requirement for cache - [1 mark] • L1 Cache (per pipeline): 64 KB 4-way data

32 KB 4-way instruction

2 KB write, 2 KB prefetch

• L2 Cache: 16 MB external (exclusive access to 8MB per pipeline)

)1(

)1(

1

1

)1,(

)1,1(1

−+

−+=

−+

−+==

k m

k m

k m

k

mT

T )S(m,

mmS →→∝

)1,(

N ∞

8/7/2019 CT2 CA


Q2. A 64 KB cache uses 12 bit index. Number of index bits is proposed to be increased to 16

bits. Suggest the best option out of the two with justification. Support your answer with for and

against arguments. [4]

A2.

Calculating the block size - [0.25 marks] Given, Size of Cache = 64 KB

Index Bit = 12

Size of index ≡ f (cache size, block size, set associativity)

212

= 64KB / (block size * 1), (set associativity = 1 for direct mapping)

Block size = 16 Bytes

Similarly, for Index bit 16,2

16= 64KB / (block size * 1)

Block size = 1 Bytes

In favor of 12 bit index - [3 marks for any 4 points]1. By comparing above results it can be derived that when numbers of bits in index are

increased then block size will decrease. So the better solution would be the 12 index bit

value.

2. When the size of the block is decreased, the miss rate increases. Because of spatial

locality, as the size of block increases, the hit ratio will also increase.

3. Block sized is the only important criteria to be monitored when it comes to tradeoff withcost, comparator size and various other issues.

For Instruction Cache, the rate of increase of block size is directly proportional to the rate of

decrease of cache miss rate.

But for Data Cache, this proportionality rule doesn’t hold true. Rate of cache miss is less in data

cache as compared to instruction.

4. Comparison of tags in case of index bit as 12 will be less as compared when index bit is

set to 16.

5. Cost of implementing with Index bit 12 will be less.

6. It will be less complex to implement.

7. Searching overhead will be less for a word.

8. Better for Sequential access of the memory.

Against 12 bit index - [0.75 marks] 9. As the graph depicts when the size of the block increases significantly, the decrease in

miss rate is less than the increase in miss penalty, as the whole block is to be transferred

ityassociativsetsize block

size2

×=

cacheindex

8/7/2019 CT2 CA


to the cache which will increase the time delay to provide the required data. So if we

further decrease the index bit then can create a problem.

10. Special Case Problem:

Let’s say, we want to access words 0, 4,8,12 where 1 block size = 4 words.

Therefore for each access, there will be a miss and for these four accesses 4 cache miss

will occur. And as the size of the block is big so miss penalty will also be more.

8/7/2019 CT2 CA


Q3. VLIW is not a main stream computer. Give two reasons in support of this statement. What

are the salient features of this machine? [1+2]

A1.Reasons for “VLIW is not a main stream computer” - [1 mark for any 2 points]

• Due to lack of compatibility with conventional hardware and software.

• Due to the dependence on Trace-scheduling compiling and code compaction.

• VLIW instructions are larger than the mainstream computer since each specifies several

independent operations.

• VLIW uses random parallelism among scalar operations instead of regular synchronous

parallelism.

Salient features of a VLIW machine - [2 marks for any 4 points & diagram]• Instruction word length: hundreds of bits.

• Use of (i) multifunctional units concurrently, and

(ii) Common Large Register File shared by all functional Units.• In VLIW run-time scheduling and synchronization are eliminated, because Instruction

parallelism and data movement in VLIW architecture are completely specified at compile

time.

• VLIW processor is an extreme of a superscalar processor in which all independent or

unrelated operations are already synchronously compacted.

• The instruction parallelism embedded in the compacted code may require a different

latency to be executed by different FUs even though the instructions are issued at the

same time.

• It uses random parallelism among scalar operations instead of regular synchronous parallelism (as used in vectorized supercomputer or as in an SIMD computer).

Processor

Instruction Format

8/7/2019 CT2 CA


Q4. State various conditions to be satisfied for code compaction. Why is the trace scheduling

done in VLIW? [2]

A4.Conditions for code compaction - [1 mark] Code compaction is an attempt to tune the code automatically in order to make it occupy less

space while maintain all of its original functionality. Thus, the conditions are -

• It must reduce the space used.

• It must retain the original functionality (i.e. the dataflow of the program must not

change and the exception behavior must be preserved).

Trace scheduling done in VLIW - [1 mark] Many compilers for first-generation ILP processors used a three phase method to generate code.

The phases were,

• Generate a sequential program. Analyze each basic block in the sequential program for

independent operations.• Schedule independent operations within the same block in parallel if sufficient hardware

resources are available.

• Move operations between blocks when possible.

This three phase approach fails to exploit much of the ILP available in the program for two

reasons,

• Often times, operations in a basic block are dependent on each other. Therefore sufficient

ILP may not be available within a basic block.

• Arbitrary choices made while scheduling basic blocks make it difficult to move

operations between blocks.

Trace scheduling is a form of global code motion. In trace scheduling, a set of commonly

executed sequence of blocks is gathered together into a trace and the whole trace is scheduled

together. It works by converting a loop to long straight-line code sequence using loop unrolling

and static branch prediction. This process separates out "unlikely" code and adds handlers for

exits from trace. The goal is to have the most common case executed as a sequential set of

instructions without branches.

8/7/2019 CT2 CA


Q5. Explain set associative mem

A5.Explanation – It is a compromise between the

associativity mapping. In a k-wa

k-tags within the identified set. S

more economical than the full as

In general, a block B j can be ma

j (modulo v) = i. Matched tag id

Design Trade-offs – • The set size (associativit

• For a fixed cache size the

• Set-Associative Cache is

Advantages – 1. The block replacement

The replacement policy

compared with the fully

2. The k-way associative se

3. Many design tradeoffs ca

Figure with explanation –

ory mapping. [3]

two extreme cache designs based on direct

y associative search, the tag needs to be compa

ince k is small in practice, the k-way associativ

sociativity.

ped into any one of the available frames in a s

ntifies the current block which resides in the fr

) k and the number of sets v are inversely relate

re exists a tradeoff between the set size and the

used in most of the high performance computer

lgorithm needs to consider only a few blocks

an be more economically implemented with li

ssociative cache.

arch is easier to implement.

n be considered (m =v*k ) to yield a higher hit r

[0.5 marks]apping and full

red only with the

e search is much

t Si, B j Є S i, if

me.

[0.25 marks]d; m = v*k

number of sets.

systems.

[0.25 marks]in the same set.

ited choices, as

tio in the cache.

[1 mark]

8/7/2019 CT2 CA


Example with figure – [1 mark]

B0B1

B2

B3

B4

B5

B6

B7

B8

B9

B10

B11

B12

B13

B14

B15

TAG CACHE

0B

1B

3B

2B

Ex. 2 way mapping (Set-Associative Search)

MainMemory

2 bit

n = 16, m = 8,

and v = 4 sets,

w = 2

s, d = ?

2d = v

d = 2

s = 4

7B

6B

5B

4B

8/7/2019 CT2 CA


Q6. Give formal statement of the task dispatching problem. [1]

A6.Formal statement – [1 mark] Given,

I. a vector task system V

II. a vector computer, with

III. m identical pipelines, and

IV. a deadline D,

Does there exist a parallel schedule f for V with Finish Time ω such that ω ≤ D?

It is a feasibility problem.

Desirable algorithm: Heuristic scheduling algorithm.

8/7/2019 CT2 CA


Q7. A sum of first k components of a vector (k = 0 to n - 1), when number ( ) of 3 processors

and length (n) of the vector, both are 8 is obtained in three steps. Will the number of steps

increase if both and n are raised to 16? Justify your answer. [3]

A7.

Formulation – [0.25 marks]We have to find the sum, S (k) of the first k components in vector A for each k varying from 0 to

15.

i.e., S (0) = A0;

S (1) = A1 + S (0);

S (2) = A2 + A1 + A0 = A2 + S (1);

or, S (k) = S (k-1) + Ak ;

It is a recursive sum.Relation used – [1 mark]S (k) depends on S (k-1). Therefore, sum cannot be obtained in parallel. A0 to A15 can be

accessed simultaneously from PEM0 to PEM15. Sk ’s (intermediate sum) calculated in one step

are to be routed to PEs from other PE’s. At the same time, inactive PE’s can be masked.

Therefore for n=16;

Store A0 in PE0, A1 in PE1, …, A15 in PE15.

The operation is,

[R 0 ← A0, R 1 ← A1, …., R 15 ← A15]

To get S (1) = A1 + A0 (=S (0)), Route A0 to A1;

<R i+1> ← <R i> for i = 0 to 14.Similarly to obtain other intermediate solutions,

<R i+2> ← <R i> for i = 0 to13.

<R i+4> ← <R i> for i = 0 to 11.

<R i+8> ← <R i> for i = 0 to 7

Thus,

<R i+1> ← Ai + Ai+1 (intermediate sum)

R 1 ← A0 + A1 = S (1)

R 2 ← A1 + A2 = S (2) – S (0)

R 3 ← A2 + A3 = S (3) – S (1)

.

R 15 ← A14 + A15 = S (15) – S (13)

In the same way it will be done for other steps.

15....,1,2,0,k for 0

=Σ==

i

k

ik AS

8/7/2019 CT2 CA


Calculation of number of steps

PE’s not involved in routing: P

PE’s inactive in addition: PE0

Mask-off: PE0 & PE7

diagrammatically –

E15 PE’s not involved in routin PE’s inactive in addition: P

Mask-off: PE0, PE1, PE14

[1.5 marks]

: PE14 & PE15

0 & PE1

PE15

8/7/2019 CT2 CA


PE’s not involved in routing: P

PE’s inactive in addition: PE0

Mask-off: PE0 to PE3, PE12 to

E12 to PE15 PE’s not involved in routino PE3 PE’s inactive in addition: P

E15 Mask-off: PE0 to PE15

: PE8 to PE15

0 to PE7

8/7/2019 CT2 CA


So we can see the number of ste

Calculation of number of stepsIf number of processors is a perf

So,

r = √16

r = 4

this is also the result we obtaine

s required is 4.

using the formula – ect square then the number of steps would be eq

diagrammatically.

[0.25 marks]uals to r = √.

ct2 ca

Documents