understanding tomasulo algorithm

62
Understanding the Tomasulo Algorithm Yichao Cheng Jul 23, 2013

Upload: onesuper

Post on 18-Dec-2014

624 views

Category:

Technology


0 download

DESCRIPTION

How Tomasulo Algorithm works. And why it works.

TRANSCRIPT

Page 1: Understanding Tomasulo Algorithm

Understanding the Tomasulo Algorithm

Yichao Cheng Jul 23, 2013

Page 2: Understanding Tomasulo Algorithm

Background

IBM System/360 Model 91

FPU’s add/mul/div takes 2/3/13 cycles

Can performance be improved through utilizing multiple execution units?

Adder Mul div

Page 3: Understanding Tomasulo Algorithm

Major Contributions

Proposed three innovative mechanisms:

Common data busing(CDB)

Register tagging scheme

Reservation station

which permits:

Out-of-order execution of independent instructions

while preserving the essential precedences in the instruction stream

Page 4: Understanding Tomasulo Algorithm

Doubt

When people talk about Tomasolu algorithm, they talk about register renaming

However this word can’t be found in the original paper

How could anyone invent a thing without noticing it?

Page 5: Understanding Tomasulo Algorithm

Architecture Overview

FLOS

Adder Mul div

FLB

SDB

FLR Decoder

Storage

Instruction Unit

FPU

Page 6: Understanding Tomasulo Algorithm

From a FPU’s perspective

All instructions are ‘register-to-register’

Register-to-register arithmetic

Storage-to-register arithmetic

Load

Store

Instruction Unit(outside FPU) is in charge of the address generation and memory access.

Page 7: Understanding Tomasulo Algorithm

Be equivalent to destination and source

For example, AD R1, R2

R1 is both a sink and a source

‘sink’ and ‘source’

source

sink

value

Page 8: Understanding Tomasulo Algorithm

1.Reg-to-reg arithmetic AD R1, R2

FLOS

Adder Mul div

FLB

SDB

FLR Decoder

Storage

Page 9: Understanding Tomasulo Algorithm

2.Storage-to-reg arithmetic AD R1, FLB

FLOS

Mul div SDB

Decoder

Storage

Adder

FLR

FLB

Page 10: Understanding Tomasulo Algorithm

3.Load LD R1, FLB1

FLOS

Adder Mul div

FLB

SDB

FLR Decoder

Storage 0

Page 11: Understanding Tomasulo Algorithm

4.Store STD R1, SDB1

FLOS

Mul div

FLB

Decoder

Storage

FLR

Adder SDB

0

Page 12: Understanding Tomasulo Algorithm

Timing Sequence: 1. reg-to reg arithmetic

Decode IU

EU Execute

Write back to FLR

2 operands To ALU

Decode

Page 13: Understanding Tomasulo Algorithm

2. storage-to-reg arithmetic

Decode IU

EU Execute

Write back to FLR

FLR To ALU

Decode

FLB To ALU

Addr Gen

Mem Read

Page 14: Understanding Tomasulo Algorithm

3.Load

Decode IU

EU Execute

Writeback to FLR

FLR To ALU

Decode

FLB To ALU

Addr Gen

Mem Read

Page 15: Understanding Tomasulo Algorithm

4.Store

Decode IU

EU Execute

FLR To ALU

Decode

Write To SDB

Addr Gen

Mem Write

Page 16: Understanding Tomasulo Algorithm

A Day in the Life of ‘LD R1, addr’

FLOS

Adder Mul div

FLB

SDB

FLR Decoder

Storage

Instruction Unit

Page 17: Understanding Tomasulo Algorithm

FLB Storage FLOS

Adder Mul div SDB

Decoder

FLB1

addr FLR

Decode & Address

generation

A Day in the Life of ‘LD R1, addr’

Instruction Unit

Page 18: Understanding Tomasulo Algorithm

FLB Storage

A Day in the Life of ‘LD R1, addr’

FLOS

Adder Mul div SDB

Decoder addr

FLB1

LD R1, FLB1

FLR

Instruction Unit

Page 19: Understanding Tomasulo Algorithm

FLB Storage

A Day in the Life of ‘LD R1, addr’

FLOS

Adder Mul div SDB

Decoder addr

FLB1

LD R1, FLB1

FLR

Page 20: Understanding Tomasulo Algorithm

FLB Storage

A Day in the Life of ‘LD R1, addr’

FLOS

Mul div SDB

Decoder addr

FLB1

LD R1, FLB1 OP

FLR

Adder

Page 21: Understanding Tomasulo Algorithm

FLB Storage

A Day in the Life of ‘LD R1, addr’

FLOS

Mul div SDB

addr

FLB1

LD R1, FLB1 OP

Decoder FLR

Adder

Page 22: Understanding Tomasulo Algorithm

FLB Storage

A Day in the Life of ‘LD R1, addr’

FLOS

Adder Mul div SDB

FLR addr

FLB1

R1

LD R1, FLB1

Decoder

Page 23: Understanding Tomasulo Algorithm

An Example of Dependence

LD F0, FLB1

MD F0, FLB2

What if send them to different execution units at the same time?

Adder Mul div

to exploit parallelisim

Page 24: Understanding Tomasulo Algorithm

An Example of Dependence

LD F0, FLB1

MD F0, FLB2

The result(F0) cannot reflect the impact of LD, because MD uses the old value of F0

Adder Mul div

Page 25: Understanding Tomasulo Algorithm

An Example of Dependence

LD F0, FLB1

MD F0, FLB2

Adder Mul div

It is also called true dependence, a.k.a. RAW

Page 26: Understanding Tomasulo Algorithm

A Simple Solution

‘busy’ bit scheme

R0

R1

R2

R3

B

I’am already the sink of some instruction

I need your content LD R1 B

MD R1 A

Page 27: Understanding Tomasulo Algorithm

Performance Degrades...

When the code keep using one register

E.g. MD F0, E

AD F2, F0

AD F4, A

AD F2, F4

overlap fails because the first AD depends on MD, though the others don’t

The second AD is qualified to issue

Page 28: Understanding Tomasulo Algorithm

Cause of the Problem

If one instruction gets stuck(due to dependence), the following can’t be decoded(even it is qualified to issue)

Solution :

Decouple the dependence mantainance from decoding

Look ahead more instructions for concurrency

Page 29: Understanding Tomasulo Algorithm

Dispatch and Issue Decoupling

MD F0, E AD F2, F0 AD F4, A AD F2, F4

Adder

Can issue? Decode

Is that reg busy?

Page 30: Understanding Tomasulo Algorithm

Dispatch and Issue Decoupling

MD F0, E AD F2, F0 AD F4, A AD F2, F4

Adder

Dispatch anyway

Decode Are my operands ready?

MD F0, E Can issue?

Page 31: Understanding Tomasulo Algorithm

An Example of True Dependence

LD F0, FLB1 F0 as sink

AD F2, F0 F0 as source

Adder Mul div

FLB

FLR

FLB1

F0

Assume CDB has not been introduced yet

Page 32: Understanding Tomasulo Algorithm

LD F0, FLB1 dispatches to A1

AD F2, F0

Adder Mul div

FLB

FLR

FLB1

F0 LD F0, FLB1

B A1

An Example of True Dependence

F0 is reserved for some instruction

Page 33: Understanding Tomasulo Algorithm

LD F0, FLB1 dispatches to A1

AD F2, F0

Adder Mul div

FLB

FLR

FLB1

F0 LD F0, FLB1

B A1

An Example of True Dependence

Its content is calculated by A1

Page 34: Understanding Tomasulo Algorithm

LD F0, FLB1

AD F2, F0

Adder Mul div

FLB

FLR

FLB1

F0 LD F0, FLB1

B A1

I need the value of F0, but he seems to be busy

An Example of True Dependence

Page 35: Understanding Tomasulo Algorithm

LD F0, FLB1

AD F2, F0 dispatches to A2

Adder Mul div

FLB

FLR

FLB1

F0 LD F0, FLB1

B A1

Since A1 is the producer, just let

him tell me

An Example of True Dependence

AD F2, F0

Page 36: Understanding Tomasulo Algorithm

LD F0, FLB1

AD F2, F0 dispatches to A2

Adder Mul div

FLB

FLR

FLB1

F0 LD F0, FLB1

B A1

Since A1 is the producer, just ask

him for it

An Example of True Dependence

AD F2, A1

Page 37: Understanding Tomasulo Algorithm

LD F0, FLB1 executing

AD F2, F0

Adder Mul div

FLB

FLR

FLB1

F0 LD F0, FLB1

B A1

An Example of True Dependence

AD F2, A1

Operands are ready. Execute!

Page 38: Understanding Tomasulo Algorithm

LD F0, FLB1 broadcasts it’s result to the air

AD F2, F0

Adder Mul div

FLB

FLR

FLB1

F0 LD F0, FLB1

B A1

I’m A1. Who needs my result? Over..

An Example of True Dependence

AD F2, A1

Page 39: Understanding Tomasulo Algorithm

LD F0, FLB1 broadcasts it’s result to the air

AD F2, F0

Adder Mul div

FLB

FLR

FLB1

F0 LD F0, FLB1

B A1

I depend on A1!

An Example of True Dependence

AD F2, A1

Me too!

Page 40: Understanding Tomasulo Algorithm

The Role of CDB

Common Data Bus is in charge of value forwarding

In reg-to-reg model, a value is passed through a register(write & read)

F0

Write as sink (Producer)

Page 41: Understanding Tomasulo Algorithm

The Role of CDB

Common Data Bus is in charge of value forwarding

In reg-to-reg model, a value is passed through a register(write & read)

F0

Read as source (Consumer)

Page 42: Understanding Tomasulo Algorithm

The Role of CDB

Add

For Mul

Resv. S

For

Resv. S

FLB

SDB

FLR

Load/Store doesn’t need to go through ALU

The dependence management is decoupled from execution as expected

Page 43: Understanding Tomasulo Algorithm

The Role of CDB

CDB All units which may take register as an operand

All units which can alter a register

Consumer Producer

Add

For Mul

Resv. S

For

Resv. S

FLB

SDB

FLR P:3

P:2

P:6

Page 44: Understanding Tomasulo Algorithm

The Role of CDB

CDB All units which may take register as an operand

All units which can alter a register

Consumer Producer

Add

For Mul

Resv. S

For

Resv. S

FLB

SDB

FLR C:4

C:3 C:2*2

C:3*2

Page 45: Understanding Tomasulo Algorithm

The Implementation of CDB

A consumer recognizes his producer by tagging

Producers throw <tag, value> on the bus by turns(make a request first)

If tag matches , consumer ingates the value

C C C C C C

P P P P P P

tag tag tag X Y Y

Requset (2 cycles)

Page 46: Understanding Tomasulo Algorithm

The Implementation of CDB

A consumer recognizes his producer by tagging

Producers throw <tag, value> on the bus by turns(make a request first)

If tag matches , consumer ingates the value

P P P P P P

Y value

C C C C C C

tag tag tag X Y Y

Page 47: Understanding Tomasulo Algorithm

The Implementation of CDB

A consumer recognizes his producer by tagging

Producers throw <tag, value> on the bus by turns(make a request first)

If tag matches , consumer ingates the value

P P P P P P

C C C C C C

tag tag tag X Y Y

request

Page 48: Understanding Tomasulo Algorithm

The Implementation of CDB

A consumer recognizes his producer by tagging

Producers throw <tag, value> on the bus by turns(make a request first)

If tag matches , consumer ingates the value

P P P P P P

C C C C C C

tag tag tag X Y Y

X value

Page 49: Understanding Tomasulo Algorithm

The Principle behind the Scene

Tag is a pointer pointing to the producer of the value required by the current instruction

The pointers construct the dependency information which are hidden by the reg-reg model(discuss later)

With the information, the order of execution can be resolved

CDB enables ‘producer-consumer’ style data flow

Page 50: Understanding Tomasulo Algorithm

LD F0, FLB1

AD F2, F0

LD F0, FLB2

AD F3, F0

Adder Mul div

FLB

FLR

F0

An Example for False Dependence

FLB2

FLB1

WAW WAR

Page 51: Understanding Tomasulo Algorithm

LD F0, FLB1 dispatches

AD F2, F0

LD F0, FLB2

AD F3, F0

Adder Mul div

FLB

FLR

F0

An Example for False Dependence

FLB2

FLB1

B FLB1

Page 52: Understanding Tomasulo Algorithm

LD F0, FLB1

AD F2, F0 dispatches to A1

LD F0, FLB2

AD F3, F0

Adder Mul div

FLB

FLR

F0 AD F2, F0

An Example for False Dependence

FLB2

FLB1

B FLB1

Page 53: Understanding Tomasulo Algorithm

LD F0, FLB1

AD F2, F0

LD F0, FLB2

AD F3, F0

Adder Mul div

FLB

FLR

F0 AD F2, F0

An Example for False Dependence

FLB2

FLB1

B FLB1

Page 54: Understanding Tomasulo Algorithm

LD F0, FLB1

AD F2, F0

LD F0, FLB2 dispatches

AD F3, F0

Adder Mul div

FLB

FLR

F0 AD F2, F0

An Example for False Dependence

FLB2

FLB1

B FLB2

Page 55: Understanding Tomasulo Algorithm

LD F0, FLB1

AD F2, F0

LD F0, FLB2

AD F3, F0 dispatches to A2

Adder Mul div

FLB

FLR

F0

AD F3, F0

AD F2, F0

An Example for False Dependence

FLB2

FLB1

B FLB2

Page 56: Understanding Tomasulo Algorithm

LD F0, FLB1

AD F2, F0

LD F0, FLB2

AD F3, F0

Adder Mul div

FLB

FLR

F0

AD F3, F0

AD F2, F0

An Example for False Dependence

FLB2

FLB1

B FLB2

Keep tracing the source of the value instead of the

register holding it

Page 57: Understanding Tomasulo Algorithm

LD F0, FLB1

AD F2, F0

LD F0, FLB2

AD F3, F0

Adder Mul div

FLB

FLR

F0

AD F3, F0

AD F2, F0

An Example for False Dependence

FLB2

FLB1

B FLB2

There’s no need to rename a register(Naming is just a

way of referring values)

Page 58: Understanding Tomasulo Algorithm

Timing Sequence with Busy Bit

D

T EX WB

AG

D

FLB

D

T T EX WB D

D T EX WB

AG

D FLB

D

LD F0, FLB1

AD F2, F0

LD F0, FLB2

AD F3, F0

T T EX WB D

Page 59: Understanding Tomasulo Algorithm

Timing Sequence with Reservation Station

D

T EX WB

AG

D

FLB

D

T T EX WB D

D

T EX WB

AG

D

FLB

D

T T EX WB D

LD F0, FLB1

AD F2, F0

LD F0, FLB2

AD F3, F0

Page 60: Understanding Tomasulo Algorithm

The Side Effect of Register Machine

What are the differences between a circuit and a register machine?

Page 61: Understanding Tomasulo Algorithm

The Side Effect of Register Machine

What are the differences between a circuit and a register machine?

Register Machine General purpose Control-driven Implict dependence via

registers

Circuit Special purpose Data-driven Exposed dependence

...But registers are rare

Page 62: Understanding Tomasulo Algorithm

Conclusion

Tomasulo algorithm has nothing to do with register renaming

It resolves the WAR & WAW by elimating the side effect of using register to pass value

By using Tomasulo algorithm, the execution of a program is driven by data flow thus exploiting maximum concurrency