understanding dram architectureslide source: onur mutlu, cmu . 12 dram bank operation row buffer...

Understanding DRAM Architecture

R. Govindarajan Computer Science & Automation

Supercomputer Edn. & Res. Centre

Indian Institute of Scinece, Bangalore

[email protected]

Why Study Memory System?

• Memory Wall [McKee’94]

– CPU-Memory speed disparity

– 100’s of cycles for off-chip access

DRAM

(2X/10 yrs)

Processor-Memory

Performance Gap:

(grows 50% / year)

Proessor

(2X/1.5yr)

Perf

orm

an

ce

Year

Moore’s Law

2

Memory Hierarchy : Recap

3

CPU MMU

L1

I-Cache

L1

D-Cache

L2 Unified Cache

Memory

Memory Hierarchy in Multicore

4

L2-Cache

C0 C1

L1$ L1$

C2 C3

L1$ L1$

Memory

Memory Controller

1 – 2 Cycles

100 – 300 cycles

10 – 15 cycles

Core

0

Core

1

Core

15

...

L1D

L1I

L1D

L1I

L1D

L1I

L2

Cache

(Off

Chip)

Main

Memory

Hit

Mem

ory

Co

ntr

oll

er

Miss

Multi-Core Processor

L3

Cache

(LLSC)

L2

Cache

Core

14

L1D

L1I

Memory Hierarchy in Multicore

Memory Bandwidth Demand for Multicores

• Memory Wall [McKee’94]

– CPU-Memory speed disparity

– 100’s of cycles for off-chip access

• Bandwidth Wall [ISCA’09]

– More cores and limited off-chip bandwidth

– Cores double every 18 months

– Pincount grows only by 10%

Off-chip accesses are expensive ! Memory System Performance is Critical

Big Picture of Memory

Memory Controller

Data Read & Write operations

Control

Address

Data

DIMM

Rank

Device

Overview of a DRAM Memory

Bank

10

Rows

Columns

Bank

Logic

Row Buffer

DRAM Bank

Basic DRAM Operations

• ACTIVATE Bring data from DRAM core into the row-buffer

• READ/WRITE Perform read/write operations on the contents in the row-buffer

• PRECHARGE Store data back to DRAM core (ACTIVATE discharges capacitors), put cells back at neutral voltage

Ld Ld

Memory Requests

PRE RD ACT

Ld

RD

Row buffer hits are faster and consume less power

PRE RD ACT

Row Buffer

Miss

Row Buffer

Hit Row Buffer

Miss

12

DRAM Bank Operation

Row Buffer

Access Address

(Row 0, Column 0)

Ro

w d

eco

der

Row address 0

Empty

Columns

Ro

ws

Slide Source: Onur Mutlu, CMU

12

DRAM Bank Operation

Row Buffer

Access Address

(Row 0, Column 0)

Ro

w d

eco

der

Row 0

Columns

Ro

ws


12

DRAM Bank Operation

Row Buffer

Access Address

(Row 0, Column 0)

Ro

w d

eco

der

Column decoder Column address 0

Row 0

Columns

Ro

ws


12

DRAM Bank Operation

Row Buffer

Access Address

(Row 0, Column 0)

Ro

w d

eco

der

Column decoder Column address 0

Data

Row 0

Columns

Ro

ws


12

DRAM Bank Operation

Row Buffer

Ro

w d

eco

der

Column decoder

Data

Row 0

Access Address

(Row 0, Column 1) Columns

Ro

ws


12

DRAM Bank Operation

Row Buffer

Ro

w d

eco

der

Column decoder

Data

Row 0

Access Address

(Row 0, Column 1)

HIT

Columns

Ro

ws


12

DRAM Bank Operation

Row Buffer

Ro

w d

eco

der

Column decoder

Data

Row 0

Access Address

(Row 0, Column 1)

Column address 1

HIT

Columns

Ro

ws


12

DRAM Bank Operation

Row Buffer

Ro

w d

eco

der

Column decoder

Data

Row 0

Access Address

(Row 1, Column 0) Columns

Ro

ws


12

DRAM Bank Operation

Row Buffer

Ro

w d

eco

der

Column decoder

Data

Row 0

Access Address

(Row 1, Column 0)

CONFLICT !

Columns

Ro

ws


12

DRAM Bank Operation

Row Buffer

Ro

w d

eco

der

Column decoder

Data

Access Address

(Row 1, Column 0)

CONFLICT !

Columns

Ro

ws


12

DRAM Bank Operation

Row Buffer

Ro

w d

eco

der

Column decoder

Data

Access Address

(Row 1, Column 0)

Row address 1

CONFLICT !

Columns

Ro

ws


12

DRAM Bank Operation

Row Buffer

Ro

w d

eco

der

Column decoder

Data

Access Address

(Row 1, Column 0)

Row 1 CONFLICT !

Columns

Ro

ws


12

DRAM Bank Operation

Row Buffer

Ro

w d

eco

der

Column decoder

Data

Access Address

(Row 1, Column 0)

Row 1

Column address 0

CONFLICT !

Columns

Ro

ws


12

DRAM Bank Operation

Row Buffer

Ro

w d

eco

der

Column decoder

Data

Access Address

(Row 1, Column 0)

Row 1

Column address 0

Columns

Ro

ws


DRAM Command Summary

Slide Source: S. Rixner

DRAM Memory Controller

• Frontend

– Request/Response Buffers

– Memory mapping

– Arbiter

• Controller Backend

– Command Generator

• Timing to be obeyed

Slide Source: S. Rixner

Memory Controller Control

Address

Data

Bank Level Parallelism in DRAM

Bank


Address

Data


Bank

Memory Requests

Ld A0


Address

Data


Bank

Memory Requests

PRE RD ACT

Ld A0


Address

Data


Bank

Ld B1

Memory Requests

PRE RD ACT

Ld A0

PRE RD ACT


Address

Data


Bank

Ld B2 Ld B1

Memory Requests

PRE RD ACT

Ld A0

PRE RD ACT

PRE RD ACT


Address

Data


Bank

Ld B2 Ld B1

Memory Requests

PRE RD ACT

Ld A0

PRE RD ACT

PRE RD ACT

Ld C1

RD


Address

Data


Bank

Ld B2 Ld B1

Memory Requests

PRE RD ACT

Ld A0

PRE RD ACT

PRE RD ACT

Ld C1

RD

Bank Level Parallelism • Improves perf. with Parallelism and Row Buffer Hit • Hurts perf. due to bank-to-bank switch delay

DRAM Refresh

• Capacitors leak and lose charge Need periodic restoration of charge

• JEDEC Spec: At normal temp, cell retention time limit is 64ms. At high (extended) temp, retention time halves to 32ms.

• The memory controller issues refresh operations periodically.

Normal Access Normal

Access

Normal

Access

Normal

Access Refresh

Normal

Access

• Assume 4GB DRAM with 2KB pages, organized as 16 banks

• 2M pages total, 128K pages per bank

• Refreshing a page takes 20ns (ACTIVATE+PRECHARGE)

• Refreshing all pages in a bank 2.6ms!

• 2.6/64 = 4% overhead!

43

Memory Access Scheduling: FCFS

Row Buffer R

ow

deco

der

Column decoder

Data

Row 0

Row 0;Col 0

Request Buffer


Row 0;Col 1

Row 1;Col 0

Row 0;Col 4

43


Row Buffer R

ow

deco

der

Column decoder

Data

Row 0 Request Buffer


Row 0;Col 1

Row 1;Col 0

Row 0;Col 4

43


Row Buffer R

ow

deco

der

Column decoder

Data



Row 0;Col 1

Row 1;Col 0

Row 0;Col 4

Row 0;Col 5

43


Row Buffer R

ow

deco

der

Column decoder

Data



Row 1;Col 0

Row 0;Col 4

Row 0;Col 5

43


Row Buffer R

ow

deco

der

Column decoder

Data

Row 0

Request Buffer


Row 1;Col 0

Row 0;Col 4

Row 0;Col 5

43


Row Buffer R

ow

deco

der

Column decoder

Data

Request Buffer


Row 1;Col 0

Row 0;Col 4

Row 0;Col 5

43


Row Buffer R

ow

deco

der

Column decoder

Data

Request Buffer


Row 1;Col 0

Row 0;Col 4

Row 0;Col 5

Row 1

43


Row Buffer R

ow

deco

der

Column decoder

Data

Request Buffer


Row 0;Col 4

Row 0;Col 5

54

Memory Access Scheduling : FR-FCFS

• A row-conflict memory access takes significantly

longer than a row-hit access

• Current controllers take advantage of the row

buffer

• Commonly used scheduling policy (FR-FCFS) [Rixner, ISCA’00]

(1) Row-hit (column) first: Service row-hit memory

accesses first

(2) Oldest-first: Then service older accesses first

• This scheduling policy aims to maximize DRAM

throughput


55


Row Buffer R

ow

deco

der

Column decoder

Data

Row 0

Row 0;Col 0

Request Buffer


Row 0;Col 1

Row 1;Col 0

Row 0;Col 4

55


Row Buffer R

ow

deco

der

Column decoder

Data



Row 0;Col 1

Row 1;Col 0

Row 0;Col 4

55


Row Buffer R

ow

deco

der

Column decoder

Data



Row 0;Col 1

Row 1;Col 0

Row 0;Col 4

Row 0;Col 5

55


Row Buffer R

ow

deco

der

Column decoder

Data



Row 1;Col 0

Row 0;Col 4

Row 0;Col 5

55


Row Buffer R

ow

deco

der

Column decoder

Data



Row 1;Col 0

Row 0;Col 5

55


Row Buffer R

ow

deco

der

Column decoder

Data



Row 1;Col 0

55


Row Buffer R

ow

deco

der

Column decoder

Data

Row 0

Request Buffer


Row 1;Col 0

55


Row Buffer R

ow

deco

der

Column decoder

Data

Request Buffer


Row 1;Col 0

55


Row Buffer R

ow

deco

der

Column decoder

Data

Request Buffer


Row 1;Col 0

Row 1

55


Row Buffer R

ow

deco

der

Column decoder

Data

Request Buffer


Row 1

55


Row Buffer R

ow

deco

der

Column decoder

Data

Request Buffer


Emerging Memory Technology

• Non-Volatile Memory technology

– Phase Change Memory (PCM), Magnetic RAM

(MRAM), Resistive RAM (RRAM), Spin Torque

Transfer RAM (STT-RAM), …

68 Slide Source: Moin Quereshi, Georgia Tech.

Emerging Memory Technology

• Phase Change Memory

– Data stored by changing phase of special material

– Data read by detecting material’s resistance

– Phase change material (chalcogenide glass) exists in two states:

1. Amorphous: high resistivity – reset state or 0

2. Crystalline: low resistivity – set state or 1

– Non-volatality and low idle power (no refresh)

– Expected to scale (to 9nm), denser than DRAM, and can store multiple bits/cell

– Higher Write latency and write-energy

– Endurance issues (cell dies after 108 writes)

69 Slide Source: Onur Mutlu, CMU

DRAM – PCM Hybrid Memory

• PCM-based (main) memory be organized?

• Hybrid PCM+DRAM

– How to partition/migrate data between PCM and DRAM

– Is DRAM a cache for PCM or part of main memory?

– How to design the hardware and software 70 Slide Source: Onur Mutlu, CMU

PCM-based Main Memory

• How should PCM-based (main) memory be organized?

• Pure PCM main memory [Lee et al., ISCA’09, Top Picks’10]:

– How to redesign entire hierarchy (and cores) to overcome PCM shortcomings

71 Slide Source: Onur Mutlu, CMU

Expanding the Multicore Memory Hierarchy

72

L2-Cache

C0 C1

L1$ L1$

C2 C3

L1$ L1$

Memory

Memory Controller

DRAM

Cache

Stacked DRAM

• DRAM vertically stacked over the processor die.

• Stacked DRAMs offer

– High bandwidth

– Large capacity

– Same or slightly lower latency.

3-D Stacked DRAM 2.5-D Stacked DRAM

Stacked DRAM

• DRAM vertically stacked over the processor die.

• Stacked DRAMs offer

– High bandwidth

– Large capacity

– Same or slightly lower latency.

3-D Stacked DRAM 2.5-D Stacked DRAM

Can be used as

Cache or

Part of Memory

Multicore With DRAM Cache

Core

0

Core

1

Core

N

.

.

.

L1D

L1I

L1D

L1I

L1D

L1I

L2

(LLSC)

DRAM

Cache

(Vertically

Stacked)

(Off

Chip)

Main

Memory

Hit

Memory

Controller

Miss

Processor with Stacked DRAM

Overview of Different Designs

PA_M PA_DRAM $ = Hash (PA_M) + Offset

Problems in Architecting Large Caches • Organizing at cache line granularity (64 B)

reduces wasted space and wasted bandwidth

• Problem: Cache of hundreds of MB needs tag-

store of tens of MB

• E.g. 256MB DRAM cache needs ~20MB tag store

(5 bytes/line)

• But big blocks have their own issues

– Wasted off-chip bandwidth

– Wasted cache space

Problems in Architecting Large Caches • Organizing at cache line granularity (64 B)

reduces wasted space and wasted bandwidth

• Problem: Cache of hundreds of MB needs tag-

store of tens of MB

• E.g. 256MB DRAM cache needs ~20MB tag store

(5 bytes/line)

• But big blocks have their own issues

– Wasted off-chip bandwidth

– Wasted cache space

Option 1: SRAM Tags

Fast, But Impractical

(Not enough transistors)

Option 2: Tags in DRAM

Naïve design has 2x latency

(Two accesses -- tag and data)

Stacked DRAM Caches

Tags-On-DRAM

• Cache tags on DRAM itself

• Typically 64B blocks

• Due to overhead of tag

access from DRAM,

requires some form of

predictor/cache in SRAM

• Several recent proposals

(Loh-Hill, AlloyCache,

ATCache, Bi-Modal)

Tags-On-SRAM

• Cache tags on SRAM

• Expensive SRAM

• Large storage overhead

• So typically uses larger block

sizes to reduce overhead

(~ 1KB)

• Off-chip bandwidth and

cache utilization are

concerns

• Several recent proposals

(FootPrintCache, CHOP)

Stacked DRAM Cache Orgn.

Core

0

Core

1

Core

N

.

.

.

L1D

L1I

L1D

L1I

L1D

L1I

L2

(LLSC)

DRAM

Cache

(Vertically

Stacked)

(Off

Chip)

Main

Memory

Hit

Memory

Controller

Miss



Core

0

Core

1

Core

N

.

.

.

L1D

L1I

L1D

L1I

L1D

L1I

L2

(LLSC)

DRAM

Cache

(Vertically

Stacked)

(Off

Chip)

Main

Memory

Hit

Memory

Controller

Miss


MetaData

on SRAM


Core

0

Core

1

Core

N

.

.

.

L1D

L1I

L1D

L1I

L1D

L1I

L2

(LLSC)

DRAM

Cache

(Vertically

Stacked)

(Off

Chip)

Main

Memory

Tag-

Pred

Hit

Memory

Controller

Miss


MetaData

on DRAM


• For DRAM caches, critical to optimize first for

latency, then hit-rate






Hash (PA_M) + Offset





Hash (PA_M) + Offset

Hash (PA_M)

Overview of Bi-Modal Cache

• Tags-In-DRAM organization

• With 3 new organizational features:

1) Cache Sets are Bi-Modal – they can hold

a combination of big (512B) and small

(64B) blocks

2) Parallel Tag and Data Accesses

3) Eliminating Most Tag Accesses via a

small SRAM based Way Locator

Reduce Hit

Latency

Improves Hit Rate

And

Reduces Off-Chip

Bandwidth

Supporting Bi-Modal Block Sizes

• Each Set can hold some big

(512B) and some small (64B)

blocks.

• Block Size Predictor

– Blocks with high spatial reuse

fetch 512B

– Blocks with little spatial reuse

fetch 64B

512B 512B 512B 64 64 64 …

Predict

Block

Size

DRAM Cache Miss

Fetch 512B

Block

Fetch 64B

Block

Big Small

DRAM Cache Set

Parallel Tag and Data Accesses

High Row Buffer Hit Rate in

the Metadata Bank!

Channel 0

D M D D

Channel 1

D M D D

Tag Access Data Access

T T T T T T T T T T T T T T T T

Data Data

Page in DRAM Cache

Eliminating a Majority of Tag Accesses using the Way Locator

Set MRU: Tag and Way MRU-1: Tag and Way

Set 0 Tag a1 Way 3 Tag a2 Way 1



…

Set N Tag am Way x Tag an Way y

Addr

Set

2-way Set Associative Cache

Each entry specifies tag and associated way

(DRAM column) where data is stored

Putting them together Last Level Cache Miss

Way

Locator

Hit?

DRAM

Cache

(Data)

Hit

Return

Data

DRAM

Cache

(Tag)

DRAM

Cache

(Data)

Miss

Tag

Match?

Cache Hit

Return

Data

Predict

Block

Size

Fetch 512B

Block

Fetch 64B

Block

Big Small

Cache Miss

Lowest

Latency

Parallel

Access

Hit Latency Improvement

Common Case:

Over 85% of

accesses

understanding dram architectureslide source: onur mutlu, cmu . 12 dram bank operation row buffer...

Documents