instruction set architectures performance issues alus single cycle cpu

76
CS141-L7- 1 Tarun Soni, Summer’03 Instruction Set Architectures Performance issues ALUs Single Cycle CPU Multicycle CPU: datapath; control, Exceptions Pipelining Memory systems Static/Dynamic RAM technologies Cache structures: Direct mapped, Associative Virtual Memory The Story so far:

Upload: galen

Post on 15-Jan-2016

38 views

Category:

Documents


1 download

DESCRIPTION

The Story so far:. Instruction Set Architectures Performance issues ALUs Single Cycle CPU Multicycle CPU: datapath; control, Exceptions Pipelining Memory systems Static/Dynamic RAM technologies Cache structures: Direct mapped, Associative Virtual Memory. Memory. Memory systems. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-1 Tarun Soni, Summer’03

Instruction Set Architectures Performance issues ALUs Single Cycle CPU Multicycle CPU: datapath; control, ExceptionsPipeliningMemory systems

Static/Dynamic RAM technologiesCache structures: Direct mapped, AssociativeVirtual Memory

The Story so far:

Page 2: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-2 Tarun Soni, Summer’03

Memory

Memory systems

Page 3: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-3 Tarun Soni, Summer’03

Computer

Memory

Datapath

Control

Output

Input

Memory Systems

Page 4: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-4 Tarun Soni, Summer’03

Technology Trends (from 1st lecture)

DRAM

Year Size Cycle Time

1980 64 Kb 250 ns

1983 256 Kb 220 ns

1986 1 Mb 190 ns

1989 4 Mb 165 ns

1992 16 Mb 145 ns

1995 64 Mb 120 ns

Capacity Speed (latency)

Logic: 2x in 3 years 2x in 3 years

DRAM: 4x in 3 years 2x in 10 years

Disk: 4x in 3 years 2x in 10 years

1000:1! 2:1!

Page 5: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-5 Tarun Soni, Summer’03

Who Cares About the Memory Hierarchy?

µProc60%/yr.(2X/1.5yr)

DRAM9%/yr.(2X/10 yrs)1

10

100

1000

1980

1981

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

DRAM

CPU19

82

Processor-MemoryPerformance Gap:(grows 50% / year)

Per

form

ance

Time

“Moore’s Law”

Processor-DRAM Memory Gap (latency)

Page 6: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-6 Tarun Soni, Summer’03

Today’s Situation: Microprocessor

• Rely on caches to bridge gap• Microprocessor-DRAM performance gap

– time of a full cache miss in instructions executed

1st Alpha (7000): 340 ns/5.0 ns =  68 clks x 2 or 136 instructions

2nd Alpha (8400): 266 ns/3.3 ns =  80 clks x 4 or 320 instructions

3rd Alpha (t.b.d.): 180 ns/1.7 ns =108 clks x 6 or 648 instructions

– 1/2X latency x 3X clock rate x 3X Instr/clock 5X

• Processor Speeds– Intel Pentium III : 4GHz =

• Memory Speeds– Mac/PC/Workstation DRAM: 50 ns– Disks are even slower....

• How can we span this “access time” gap?– 1 instruction fetch per instruction– 15/100 instructions also do a data read or write (load or store)

Page 7: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-7 Tarun Soni, Summer’03

Impact on Performance

• Suppose a processor executes at

– Clock Rate = 200 MHz (5 ns per cycle)

– CPI = 1.1 – 50% arith/logic, 30%

ld/st, 20% control• Suppose that 10% of

memory ops get 50 cycle miss penalty

DataMiss(1.6)49%

Ideal CPI(1.1)35%

Inst Miss(0.5)16%

• CPI = ideal CPI + average stalls per instruction= 1.1(cyc) +( 0.30 (datamops/ins) x 0.10 (miss/datamop) x 50 (cycle/miss) )= 1.1 cycle + 1.5 cycle = 2. 6

• 58 % of the time the processor is stalled waiting for memory!• a 1% instruction miss rate would add an additional 0.5 cycles to the CPI!

Page 8: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-8 Tarun Soni, Summer’03

Control

Datapath

Memory

Processor

Mem

ory

Memory

Memory

Mem

ory

Fastest Slowest

Smallest Biggest

Highest Lowest

Speed:

Size:

Cost:

Memory system-hierarchical

Page 9: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-9 Tarun Soni, Summer’03

Why hierarchy works

• The Principle of Locality:– Program access a relatively small portion of the address space at any instant of

time.

Address Space0 2^n - 1

Probabilityof reference

Page 10: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-10 Tarun Soni, Summer’03

• Temporal– Tendency to reference locations recently referenced

• Spatial– Tendency to reference locations “near” those recently referenced

x x x

t

likely to reference x

x likely reference zone

address space

• Property of memory references in “typical programs”• Tendency to favor a portion of their address space at any given time

Locality

Page 11: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-11 Tarun Soni, Summer’03

Memory Hierarchy: How Does it Work?

• Temporal Locality (Locality in Time):

=> Keep most recently accessed data items closer to the processor• Spatial Locality (Locality in Space):

=> Move blocks consisting of contiguous words to the upper levels

Lower LevelMemoryUpper Level

MemoryTo Processor

From ProcessorBlk X

Blk Y

Page 12: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-12 Tarun Soni, Summer’03

• Memory hierarchies exploit locality by cacheing (keeping close to the processor) data likely to be used again.

• This is done because we can build large, slow memories and small, fast memories, but we can’t build large, fast memories.

• If it works, we get the illusion of SRAM access time with disk capacity

SRAM access times are 2 - 25ns at cost of $100 to $250 per Mbyte.

DRAM access times are 60-120ns at cost of $5 to $10 per Mbyte.

Disk access times are 10 to 20 million ns at cost of $.10 to $.20 per Mbyte.

Memory Hierarchy: How Does it Work?

Page 13: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-13 Tarun Soni, Summer’03

Memory Hierarchy: Terminology

• Hit: data appears in some block in the upper level (example: Block X) – Hit Rate: the fraction of memory access found in the upper level– Hit Time: Time to access the upper level which consists of

RAM access time + Time to determine hit/miss• Miss: data needs to be retrieve from a block in the lower level (Block Y)

– Miss Rate = 1 - (Hit Rate)– Miss Penalty: Time to replace a block in the upper level +

Time to deliver the block the processor• Hit Time << Miss Penalty

Lower LevelMemoryUpper Level

MemoryTo Processor

From ProcessorBlk X

Blk Y

Page 14: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-14 Tarun Soni, Summer’03

Memory Hierarchy of a Modern Computer System

• By taking advantage of the principle of locality:– Present the user with as much memory as is available in the cheapest

technology.– Provide access at the speed offered by the fastest technology.

Control

Datapath

SecondaryStorage(Disk)

Processor

Reg

isters

MainMemory(DRAM)

SecondLevelCache

(SRAM)

On

-Ch

ipC

ache

1s 10,000,000s

(10s ms)

Speed (ns): 10s 100s

100sGs

Size (bytes):Ks Ms

TertiaryStorage(Disk)

10,000,000,000s (10s sec)

Ts

Page 15: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-15 Tarun Soni, Summer’03

Main Memory Background

• Performance of Main Memory: – Latency: Cache Miss Penalty

• Access Time: time between request and word arrives• Cycle Time: time between requests

– Bandwidth: I/O & Large Block Miss Penalty (L2)• Main Memory is DRAM: Dynamic Random Access Memory

– Dynamic since needs to be refreshed periodically (8 ms)– Addresses divided into 2 halves (Memory as a 2D matrix):

• RAS or Row Access Strobe• CAS or Column Access Strobe

• Cache uses SRAM: Static Random Access Memory– No refresh (6 transistors/bit vs. 1 transistor /bit)– Address not divided

• Size: DRAM/SRAM 4-8, Cost/Cycle time: SRAM/DRAM 8-16

Page 16: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-16 Tarun Soni, Summer’03

Static RAM Cell

6-Transistor SRAM Cell

bit bit

word(row select)

bit bit

word

• Write:

1. Drive bit lines (bit=1, bit=0)

2.. Select row• Read:

1. Precharge bit and bit to Vdd

2.. Select row

3. Cell pulls one line low

4. Sense amp on column detects difference between bit and bit

replaced with pullupto save area

10

0 1

Page 17: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-17 Tarun Soni, Summer’03

Typical SRAM Organization: 16-word x 4-bit

SRAMCell

SRAMCell

SRAMCell

SRAMCell

SRAMCell

SRAMCell

SRAMCell

SRAMCell

SRAMCell

SRAMCell

SRAMCell

SRAMCell

- +Sense Amp - +Sense Amp - +Sense Amp - +Sense Amp

: : : :

Word 0

Word 1

Word 15

Dout 0Dout 1Dout 2Dout 3

- +Wr Driver &Precharger - +

Wr Driver &Precharger - +

Wr Driver &Precharger - +

Wr Driver &Precharger

Ad

dress D

ecod

er

WrEnPrecharge

Din 0Din 1Din 2Din 3

A0

A1

A2

A3

Page 18: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-18 Tarun Soni, Summer’03

Problems with SRAM

• Six transistors use up a lot of area• Consider a “Zero” is stored in the cell:

– Transistor N1 will try to pull “bit” to 0– Transistor P2 will try to pull “bit bar” to 1

• But bit lines are precharged to high: Are P1 and P2 necessary?

bit = 1 bit = 0

Select = 1

On Off

Off On

N1 N2

P1 P2

OnOn

Page 19: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-19 Tarun Soni, Summer’03

Logic Diagram of a Typical SRAM

• Write Enable is usually active low (WE_L)• Din and Dout combined to save pins:– output enable (OE_L) (new signal)– WE_L asserted , OE_L deasserted

• D => data input pin– WE_L deasserted , OE_L asserted

• D => data output pin

A

DOE_L

2 Nwordsx M bitSRAM

N

M

WE_L

Write Timing:

D

Read Timing:

WE_L

A

WriteHold Time

Write Setup Time

Data In

Write Address

OE_L

High Z

Read Address

Junk

Read AccessTime

Data Out

Read AccessTime

Data Out

Read Address

Page 20: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-20 Tarun Soni, Summer’03

1-Transistor Memory Cell (DRAM)

• Write:– 1. Drive bit line– 2.. Select row

• Read:– 1. Precharge bit line to Vdd– 2.. Select row– 3. Cell and bit line share charges

• Very small voltage changes on the bit line– 4. Sense (fancy sense amp)

• Can detect changes of ~1 million electrons– 5. Write: restore the value

• Refresh– 1. Just do a dummy read to every cell.

row select

bit

Page 21: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-21 Tarun Soni, Summer’03

Classical DRAM Organization (square)

row

decoder

rowaddress

Column Selector & I/O Circuits

ColumnAddress

data

RAM Cell Array

word (row) select

bit (data) lines

• Row and Column Address together: – Select 1 bit a time

Each intersection representsa 1-T DRAM Cell

Page 22: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-22 Tarun Soni, Summer’03

Logic Diagram of a Typical DRAM

AD

OE_L

256K x 8DRAM9 8

WE_L

• Control Signals (RAS_L, CAS_L, WE_L, OE_L) are all active low• Din and Dout are combined (D):

– WE_L is asserted (Low), OE_L is deasserted (High)• D serves as the data input pin

– WE_L is deasserted (High), OE_L is asserted (Low)• D is the data output pin

• Row and column addresses share the same pins (A)– RAS_L goes low: Pins A are latched in as row address– CAS_L goes low: Pins A are latched in as column address– RAS/CAS edge-sensitive

CAS_LRAS_L

Page 23: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-23 Tarun Soni, Summer’03

Key DRAM Timing Parameters

• tRAC: minimum time from RAS line falling to the valid data output.

– Quoted as the speed of a DRAM

– A fast 4Mb DRAM tRAC = 60 ns

• tRC: minimum time from the start of one row access to the start of the next.

– tRC = 110 ns for a 4Mbit DRAM with a tRAC of 60 ns

• tCAC: minimum time from CAS line falling to valid data output.

– 15 ns for a 4Mbit DRAM with a tRAC of 60 ns

• tPC: minimum time from the start of one column access to the start of the next.

– 35 ns for a 4Mbit DRAM with a tRAC of 60 ns

Page 24: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-24 Tarun Soni, Summer’03

DRAM Performance

• A 60 ns (tRAC) DRAM can

– perform a row access only every 110 ns (tRC)

– perform column access (tCAC) in 15 ns, but time between column accesses is at least 35 ns (tPC).

• In practice, external address delays and turning around buses make it 40 to 50 ns

• These times do not include the time to drive the addresses off the microprocessor nor the memory controller overhead.

– Drive parallel DRAMs, external memory controller, bus to turn around, SIMM module, pins…

– 180 ns to 250 ns latency from processor to memory is good for a “60 ns” (tRAC) DRAM

Page 25: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-25 Tarun Soni, Summer’03

DRAM Write Timing

AD

OE_L

256K x 8DRAM9 8

WE_LCAS_LRAS_L

WE_L

A Row Address

OE_L

Junk

WR Access Time WR Access Time

CAS_L

RAS_L

Col Address Row Address JunkCol Address

D Junk JunkData In Data In Junk

DRAM WR Cycle Time

Early Wr Cycle: WE_L asserted before CAS_L Late Wr Cycle: WE_L asserted after CAS_L

• Every DRAM access begins at:– The assertion of the RAS_L– 2 ways to write:

early or late v. CAS

Page 26: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-26 Tarun Soni, Summer’03

DRAM Read Timing

AD

OE_L

256K x 8DRAM9 8

WE_LCAS_LRAS_L

OE_L

A Row Address

WE_L

Junk

Read AccessTime

Output EnableDelay

CAS_L

RAS_L

Col Address Row Address JunkCol Address

D High Z Data Out

DRAM Read Cycle Time

Early Read Cycle: OE_L asserted before CAS_L Late Read Cycle: OE_L asserted after CAS_L

• Every DRAM access begins at:– The assertion of the RAS_L– 2 ways to read:

early or late v. CAS

Junk Data Out High Z

Page 27: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-27 Tarun Soni, Summer’03

Cycle Time versus Access Time

• DRAM (Read/Write) Cycle Time >> DRAM (Read/Write) Access Time– 2:1; why?

• DRAM (Read/Write) Cycle Time :– How frequent can you initiate an access?– Analogy: A little kid can only ask his father for money on Saturday

• DRAM (Read/Write) Access Time:– How quickly will you get what you want once you initiate an access?– Analogy: As soon as he asks, his father will give him the money

• DRAM Bandwidth Limitation analogy:– What happens if he runs out of money on Wednesday?

TimeAccess Time

Cycle Time

Page 28: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-28 Tarun Soni, Summer’03

Increasing Bandwidth - Interleaving

Access Pattern without Interleaving:

Start Access for D1

CPU Memory

Start Access for D2

D1 available

Access Pattern with 4-way Interleaving:

Acc

ess

Ban

k 0

Access Bank 1

Access Bank 2

Access Bank 3

We can Access Bank 0 again

CPU

MemoryBank 1

MemoryBank 0

MemoryBank 3

MemoryBank 2

Page 29: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-29 Tarun Soni, Summer’03

Fewer DRAMs/System over TimeM

inim

um

PC

Mem

ory

Siz

e

DRAM Generation

‘86 ‘89 ‘92 ‘96 ‘99 ‘02 1 Mb 4 Mb 16 Mb 64 Mb 256 Mb 1 Gb

4 MB

8 MB

16 MB

32 MB

64 MB

128 MB

256 MB

32 8

16 4

8 2

4 1

8 2

4 1

8 2

Memory per System growth@ 25%-30% / year

Memory per DRAM growth@ 60% / year

• Increasing DRAM => fewer chips => harder to have banks

Page 30: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-30 Tarun Soni, Summer’03

Page Mode DRAM: Motivation

• Regular DRAM Organization:– N rows x N column x M-bit– Read & Write M-bit at a time– Each M-bit access requires

a RAS / CAS cycle• Fast Page Mode DRAM

– N x M “register” to save a row

A Row Address Junk

CAS_L

RAS_L

Col Address Row Address JunkCol Address

1st M-bit Access 2nd M-bit Access

N r

ows

N cols

DRAM

M bits

RowAddress

ColumnAddress

M-bit Output

Page 31: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-31 Tarun Soni, Summer’03

Fast Page Mode Operation

• Fast Page Mode DRAM– N x M “SRAM” to save a row

• After a row is read into the register– Only CAS is needed to access other M-bit

blocks on that row– RAS_L remains asserted while CAS_L is

toggled

A Row Address

CAS_L

RAS_L

Col Address Col Address

1st M-bit Access

N r

ows

N cols

DRAM

ColumnAddress

M-bit OutputM bits

N x M “SRAM”

RowAddress

Col Address Col Address

2nd M-bit 3rd M-bit 4th M-bit

Page 32: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-32 Tarun Soni, Summer’03

DRAMs over Time

DRAM Generation

‘84 ‘87 ‘90 ‘93 ‘96 ‘99

1 Mb 4 Mb 16 Mb 64 Mb 256 Mb 1 Gb

55 85 130 200 300 450

30 47 72 110 165 250

28.84 11.1 4.26 1.64 0.61 0.23

1st Gen. Sample

Memory Size

Die Size (mm2)

Memory Area (mm2)

Memory Cell Area (µm2)

Page 33: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-33 Tarun Soni, Summer’03

• Two Different Types of Locality:– Temporal Locality (Locality in Time): If an item is referenced, it will tend to be

referenced again soon.– Spatial Locality (Locality in Space): If an item is referenced, items whose

addresses are close by tend to be referenced soon.• By taking advantage of the principle of locality:

– Present the user with as much memory as is available in the cheapest technology.

– Provide access at the speed offered by the fastest technology.• DRAM is slow but cheap and dense:

– Good choice for presenting the user with a BIG memory system• SRAM is fast but expensive and not very dense:

– Good choice for providing the user FAST access time.

Memory - Summary

Page 34: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-34 Tarun Soni, Summer’03

Cache Fundamentals

• cache hit -- an access where the data

is found in the cache.• cache miss -- an access which isn’t• hit time -- time to access the cache• miss penalty -- time to move data from

further level to closer, then to cpu• hit ratio -- percentage of time the data is found in the

cache• miss ratio -- (1 - hit ratio)

cpu

lowest-levelcache

next-levelmemory/cache

• cache block size or cache line size-- the

amount of data that gets transferred on a

cache miss.• instruction cache -- cache that only holds

instructions.• data cache -- cache that only caches data.• unified cache -- cache that holds both.

Page 35: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-35 Tarun Soni, Summer’03

Cacheing Issues

On a memory access -• How do I know if this is a hit or miss?

On a cache miss -• where to put the new data?• what data to throw out?• how to remember what data this is?

cpu

lowest-levelcache

next-levelmemory/cache

access

miss

Page 36: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-36 Tarun Soni, Summer’03

• A cache that can put a line of data in exactly one place is called direct-mapped

tag data

an index is usedto determine

which line an addressmight be found in

an index is usedto determine

which line an addressmight be found in

4 entries, each block holds one word, each word in memory maps to exactly one cache location.

address string:4 000001008 0000100012 000011004 000001008 0000100020 000101004 000001008 0000100020 0001010024 0001100012 000011008 000010004 00000100

00000100

A simple cache : similar to the branch prediction table in pipelining ?

• Conflict Misses are misses caused by:– Different memory locations mapped to the same cache index

• Solution 1: make the cache size bigger • Solution 2: Multiple entries for the same Cache Index

• Conflict misses = 0 by definition, for fully associative caches.

Page 37: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-37 Tarun Soni, Summer’03

Fully associative cache

• A cache that can put a line of data anywhere is called fully associative• The most popular replacement strategy is LRU (least recently used).• How do you find the data ?

tag data

the tag identifiesthe address of the cached data

the tag identifiesthe address of the cached data

4 entries, each block holds one word, any block can hold any word.

address string:4 000001008 0000100012 000011004 000001008 0000100020 000101004 000001008 0000100020 0001010024 0001100012 000011008 000010004 00000100

Page 38: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-38 Tarun Soni, Summer’03

• Fully Associative Cache– Forget about the Cache Index– Compare the Cache Tags of all cache entries in parallel– Example: Block Size = 2 B blocks, we need N 27-bit comparators

• By definition: Conflict Miss = 0 for a fully associative cache

:

Cache Data

Byte 0

0431

:

Cache Tag (27 bits long)

Valid Bit

:

Byte 1Byte 31 :

Byte 32Byte 33Byte 63 :

Cache Tag

Byte Select

Ex: 0x01

X

X

X

X

X

Fully associative cache

Page 39: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-39 Tarun Soni, Summer’03

• A cache that can put a line of data in exactly n places is called n-way set-associative.

• The cache lines that share the same index are a cache set.

tag data

4 entries, each block holds one word, each wordin memory maps to one of a set of n cache lines

address string:4 000001008 0000100012 000011004 000001008 0000100020 000101004 000001008 0000100020 0001010024 0001100012 000011008 000010004 00000100

00000100tag data

A n-way set associative cache

Page 40: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-40 Tarun Soni, Summer’03

A n-way set associative cache

• N-way set associative: N entries for each Cache Index– N direct mapped caches operates in parallel

• Example: Two-way set associative cache– Cache Index selects a “set” from the cache– The two tags in the set are compared in parallel– Data is selected based on the tag result

Cache Data

Cache Block 0

Cache TagValid

:: :

Cache Data

Cache Block 0

Cache Tag Valid

: ::

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

Compare

OR

Hit

Page 41: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-41 Tarun Soni, Summer’03

• N-way Set Associative Cache versus Direct Mapped Cache:– N comparators vs. 1– Extra MUX delay for the data– Data comes AFTER Hit/Miss decision and set selection

• In a direct mapped cache, Cache Block is available BEFORE Hit/Miss:– Possible to assume a hit and continue. Recover later if miss.

Cache Data

Cache Block 0

Cache Tag Valid

: ::

Cache Data

Cache Block 0

Cache TagValid

:: :

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

Compare

OR

Hit

Direct vs. Set Associative Caches

Page 42: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-42 Tarun Soni, Summer’03

• Large cache blocks take advantage of spatial locality.• Too large of a block size can waste cache space.• Longer cache blocks require less tag space

tag data

4 entries, each block holds two words, each wordin memory maps to exactly one cache location (this cache is twice the total size of the prior caches).

address string:4 000001008 0000100012 000011004 000001008 0000100020 000101004 000001008 0000100020 0001010024 0001100012 000011008 000010004 00000100

00000100

Longer cache-blocks

Page 43: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-43 Tarun Soni, Summer’03

1 KB

8 KB

16 KB

64 KB

256 KB

256

40%

35%

30%

25%

20%

15%

10%

5%

0%

Miss

rate

64164

Block size (bytes)

Increasing block size

• In general, larger block size take advantage of spatial locality BUT:– Larger block size means larger miss penalty:

• Takes longer time to fill up the block– If block size is too big relative to cache size, miss rate will go up

• Too few cache blocks• In general, Average Access Time:

– = Hit Time x (1 - Miss Rate) + Miss Penalty x Miss Rate

Page 44: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-44 Tarun Soni, Summer’03

Sources of Cache Misses

• Compulsory (cold start or process migration, first reference): first access to a block

– “Cold” fact of life: not a whole lot you can do about it– Note: If you are going to run “billions” of instruction, Compulsory Misses

are insignificant• Conflict (collision):

– Multiple memory locations mappedto the same cache location

– Solution 1: increase cache size– Solution 2: increase associativity

• Capacity:– Cache cannot contain all blocks access by the program– Solution: increase cache size

• Invalidation: other process (e.g., I/O) updates memory

Page 45: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-45 Tarun Soni, Summer’03

Direct Mapped N-way Set Associative Fully Associative

Compulsory Miss

Cache Size

Capacity Miss

Invalidation Miss

Big Medium Small

Same Same Same

Conflict Miss High Medium Zero

Low Medium High

Same Same Same

Sources of Cache Misses

Page 46: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-46 Tarun Soni, Summer’03

1. Use index and tag to access cache and determine hit/miss.

2. If hit, return requested data.

3. If miss, select a cache block to be replaced, and access memory or next lower cache (possibly stalling the processor).

-load entire missed cache line into cache

-return requested data to CPU (or higher cache)

4. If next lower memory is a cache, goto step 1 for that cache.

ICache Reg A

LU Dcache Reg

IF ID EX MEM WB

Accessing a cache

Page 47: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-47 Tarun Soni, Summer’03

• 64 KB cache, direct-mapped, 32 byte cache block size

31 30 29 28 27 ........... 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0tag index

valid tag data

64 KB

/ 32 bytes =

2 K cache blocks/sets

11

=256

32

16

hit/miss

012

...

...

...

...204520462047

word offset

Accessing a cache

3

Page 48: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-48 Tarun Soni, Summer’03

• 32 KB cache, 2-way set-associative, 16-byte block size (cache lines)

31 30 29 28 27 ........... 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0tag index

valid tag data 32 KB

/ 16 bytes / 2 =

1 K cache sets

10

=

18

hit/miss

012

...

...

...

...102110221023

word offset

tag datavalid

=

Accessing a cache

Page 49: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-49 Tarun Soni, Summer’03

• The data that gets moved into the cache on a miss are all data whose addresses share the same tag and index (regardless of which data gets accessed first).

• This results in – no overlap of cache lines– easy mapping of addresses to cache lines (no additions)– data at address X always being present in the same location in the

cache block (at byte X mod blocksize) if it is there at all.• Think of memory as organized into cache-line sized pieces

(because in reality, it is!).• Recall DRAM page mode architecture !!

tag index block offset

memory address

.

.

.

0123456789

10...

Memory

Cache Alignment

Page 50: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-50 Tarun Soni, Summer’03

• Q1: Where can a block be placed in the upper level? (Block placement)• Q2: How is a block found if it is in the upper level? (Block identification)• Q3: Which block should be replaced on a miss? (Block replacement)• Q4: What happens on a write? (Write strategy)

Basic Memory Hierarchy questions

• Placement: Cache structure determines the where ?:– Fully associative, direct mapped, 2-way set associative– S.A. Mapping = Block Number Modulo Number Sets

• Identification: Tag on each block– No need to check index or block offset

• Replacement: Easy for Direct Mapped• Set Associative or Fully Associative:

– Random– LRU (Least Recently Used)

Page 51: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-51 Tarun Soni, Summer’03

What about writes?

• Stores must be handled differently than loads, because...– they don’t necessarily require the CPU to stall.– they change the content of cache/memory (creating memory

consistency issues)– may require a load and a store to complete

Page 52: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-52 Tarun Soni, Summer’03

• On a store hit, write the new data to cache. In a write-through cache, write the data immediately to memory. In a write-back cache, mark the line as dirty.

• On a store miss, initiate a cache block load from memory for a write-allocate cache. Write directly to memory for a write-around cache.

• On any kind of cache miss in a write-back cache, if the line to be replaced in the cache is dirty, write it back to memory.

• Keep memory and cache identical?– write-through => all writes go to both cache and main memory– write-back => writes go only to cache. Modified cache lines are

written back to memory when the line is replaced.• Make room in cache for store miss?

– write-allocate => on a store miss, bring written line into the cache– write-around => on a store miss, ignore cache

What about writes?

Page 53: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-53 Tarun Soni, Summer’03

Cache Performance

TCPI = BCPI + MCPI– where TCPI = Total CPI; – BCPI = base CPI, which means the CPI assuming perfect memory;– MCPI = the memory CPI, the number of cycles (per instruction) the

processor is stalled waiting for memory.

MCPI = accesses/instruction * miss rate * miss penalty– this assumes we stall the pipeline on both read and write misses, that the

miss penalty is the same for both, that cache hits require no stalls.

Page 54: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-54 Tarun Soni, Summer’03

21164 CPUcore

InstructionCache

DataCache

UnifiedL2

Cache

Off-ChipL3 Cache

• ICache and DCache -- 8 KB, DM, 32-byte lines• L2 cache -- 96 KB, 3-way SA, 32-byte lines• L3 cache -- 1 MB, DM, 32-byte lines

Example -- DEC Alpha 21164 Caches

Dec Alpha

Page 55: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-55 Tarun Soni, Summer’03

tag index offsetmemory address:

DEC Alpha 21164’s L2 cache: 96 KB, 3-way set associative, 32-Byte lines64 bit addresses

How many “offset” bits?How many “index” bits?How many “tag” bits?Draw cache picture – how do you tell if it’s a hit?

Page 56: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-56 Tarun Soni, Summer’03

• 8 KB cache, direct-mapped, 32 byte lines

31 30 29 28 27 ........... 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0tag index

valid tag data

8 KB

/ 32 bytes =

256 cache blocks/sets

8

=256

32

19

hit/miss

012

...

...

...

...204520462047

word offset3

Page 57: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-57 Tarun Soni, Summer’03

• 96 KB cache, 3-way set-associative, 32-byte block size (cache lines)

• => Each set has 32 bytes,• => 3 bits per set to locate the right 32 bits (word)• => each “index” has 3 sets (3-way) (in effect three places one could place the

data in)• => 96 bytes per “row” • => 96KB/96B = 1K rows • => index = 10 bits• => Tag = 32 – (10+3+2) = 17bits

Page 58: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-58 Tarun Soni, Summer’03

Pentium 4: Trace Cache and Instruction Decode

Remember Micro-Instructions??

Page 59: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-59 Tarun Soni, Summer’03

Power4-Core

Power4-Core

I-Cache64 Kbyte

I-Cache64 Kbyte

D-Cache,32 Kbyte, 128 Byte lines,

2 way set associative,Write through

D-Cache,32 Kbyte, 128 Byte lines,

2 way set associative,Write through L2

1.44 MB128 Byte lines

8-way-Set-AssociativeWrite Back

Power 4 Chip

L3: 128M512 Byte-lines,8-way SAWrite Back

L3128 MB

512 Byte lines8-way-Set-Associative

Write Back

P0

P1

P7

Multi-Chip-Module

Power4

Page 60: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-60 Tarun Soni, Summer’03

• Often DM at highest level (closest to CPU), associative further away• split I and D close to the processor, unified further away (for throughput rather

than miss rate).• write-through and write-back both common, but never write-through all the way

to memory.• 32-byte cache lines very common

• Non-blocking– processor doesn’t stall on a miss, but only on the use of a miss (if even

then)– this means the cache must be able to handle multiple outstanding

accesses.

Real Caches

Page 61: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-61 Tarun Soni, Summer’03

Recall: Levels of the Memory Hierarchy

CPU Registers100s Bytes<10s ns

CacheK Bytes10-100 ns$.01-.001/bit

Main MemoryM Bytes100ns-1us$.01-.001

DiskG Bytesms10 - 10 cents

-3 -4

CapacityAccess TimeCost

Tapeinfinitesec-min10

-6

Registers

Cache

Memory

Disk

Tape

Instr. Operands

Blocks

Pages

Files

StagingXfer Unit

prog./compiler1-8 bytes

cache cntl8-128 bytes

OS512-4K bytes

user/operatorMbytes

Upper Level

Lower Level

faster

Larger

Page 62: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-62 Tarun Soni, Summer’03

Virtual Memory

• Main memory can act as a cache for the secondary storage (disk)

• Advantages:– illusion of having more physical memory– program relocation – protection

Physical addresses

Disk addresses

Virtual addresses

Address translation

Page 63: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-63 Tarun Soni, Summer’03

• is just cacheing, but uses different terminology

cache VM

block page

cache miss page fault

address virtual address

index physical address (sort of)

Virtual Memory

• What happens if another program in the processor uses the same addresses that yours does?

• What happens if your program uses addresses that don’t exist in the machine?• What happens to “holes” in the address space your program uses?

• So, virtual memory provides– performance (through the cacheing effect)– protection– ease of programming/compilation– efficient use of memory

Page 64: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-64 Tarun Soni, Summer’03

• is just a mapping function from virtual memory addresses to physical memory locations, which allows cacheing of virtual pages in physical memory.

• Why is it different ?

Virtual Memory

• MUCH higher miss penalty (millions of cycles)!• Therefore

– large pages [equivalent of cache line] (4 KB to MBs)– associative mapping of pages (typically fully associative)– software handling of misses (but not hits!!)– write-through not an option, only write-back

Page 65: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-65 Tarun Soni, Summer’03

• Page table maps virtual memory addresses to physical memory locations– Page may be in main memory or may be on disk.

“Valid bit” says which.

Virtual page number (high order bits of virtual address)

Page Tables

Page 66: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-66 Tarun Soni, Summer’03

virtual page number page offset

valid physical page numberpage table reg

physical page number page offset

virtual address

physical address

pagetable

• Why don’t we need a “tag” field?• How does operating system switch to new process???

This hardwareregister tells VM system where tofind the page table.

Page Tables

Page 67: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-67 Tarun Soni, Summer’03

Virtual Address and a Cache

CPU Trans-lation

Cache MainMemory

VA PA miss

hitdata

It takes an extra memory access to translate VA to PA

This makes cache access very expensive, and this is the "innermost loop" that you want to go as fast as possible

ASIDE: Why access cache with PA at all? VA caches have a problem! synonym / alias problem: two different virtual addresses map to same physical address => two different cache entries holding data for the same physical address!

for update: must update all cache entries with same physical address or memory becomes inconsistent

determining this requires significant hardware, essentially an associative lookup on the physical address tags to see if you have multiple hits; or

software enforced alias boundary: same lsb of VA &PA > cache size

Page 68: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-68 Tarun Soni, Summer’03

TLBs

A way to speed up translation is to use a special cache of recently used page table entries -- this has many names, but the most frequently used is Translation Lookaside Buffer or TLB

Virtual Address Physical Address Dirty Ref Valid Access

TLB access time comparable to cache access time (much less than main memory access time)

Page 69: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-69 Tarun Soni, Summer’03

Translation lookaside buffer (TLB) A hardware cache for the page tableTypical size: 256 entries, 1- to 4-way set associative

Page 70: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-70 Tarun Soni, Summer’03

Translation Look-Aside Buffers

Just like any other cache, the TLB can be organized as fully associative, set associative, or direct mapped

TLBs are usually small, typically not more than 128 - 256 entries even on high end machines. This permits fully associative lookup on these machines. Most mid-range machines use small n-way set associative organizations.

CPU TLBLookup

Cache MainMemory

VA PA miss

hit

data

Trans-lation

hit

miss

20 tt1/2 t

Translationwith a TLB

Page 71: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-71 Tarun Soni, Summer’03

Translation Look-Aside Buffers

Machines with TLBs go one step further to reduce # cycles/cache access- overlap the cache access with the TLB access

Works because high order bits of the VA are used to look in the TLB while low order bits are used as index into cache

Overlapped access only works as long as the address bits used to index into the cache do not change as the result of VA translation

This usually limits things to small caches, large page sizes, or high n-way set associative caches if you want a large cache

Page 72: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-72 Tarun Soni, Summer’03

All together!

Page 73: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-73 Tarun Soni, Summer’03

• Virtual memory provides:

– protection

– sharing

– performance

– illusion of large main memory

• Virtual Memory requires twice as many memory accesses, so we cache page table entries in the TLB.

• Four things can go wrong on a memory access: cache miss, TLB miss (but page table in cache), TLB miss and page table not in cache, page fault.

Page 74: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-74 Tarun Soni, Summer’03

• The Principle of Locality:– Program likely to access a relatively small portion of the address space at any

instant of time.• Temporal Locality: Locality in Time• Spatial Locality: Locality in Space

• Three Major Categories of Cache Misses:– Compulsory Misses: sad facts of life. Example: cold start misses.– Conflict Misses: increase cache size and/or associativity.

Nightmare Scenario: ping pong effect!– Capacity Misses: increase cache size

• Cache Design Space– total size, block size, associativity– replacement policy– write-hit policy (write-through, write-back)– write-miss policy

Summary

Page 75: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-75 Tarun Soni, Summer’03

• Caches, TLBs, Virtual Memory all understood by examining how they deal with 4 questions: 1) Where can block be placed? 2) How is block found? 3) What block is repalced on miss? 4) How are writes handled?

• Page tables map virtual address to physical address• TLBs are important for fast translation• TLB misses are significant in processor performance: (funny times, as most systems

can’t access all of 2nd level cache without TLB misses!)

• VIrtual memory was controversial at the time: can SW automatically manage 64KB across many programs?

– 1000X DRAM growth removed the controversy• Today VM allows many processes to share single memory without having to swap all

processes to disk; VM protection is more important than memory hierarchy• Today CPU time is a function of (ops, cache misses) vs. just f(ops):

What does this mean to Compilers, Data structures, Algorithms?

Summary

Page 76: Instruction Set Architectures   Performance issues   ALUs   Single Cycle CPU

CS141-L7-76 Tarun Soni, Summer’03