cs211 advanced computer architecture l01 introductionshanghaitech/...what are covered by ca?...

CS211Advanced Computer Architecture

L02 Review

Chundong WangSeptember 9th, 2020

CS211@ShanghaiTech 1

L01 Survey

What are covered by CA?

Instructions and micro-codesInstruction execution: pipeline, in-order or out-of-order, speculation, etc.

Memory hierarchy: cache, main memory, disk, etc.

Exception, interrupt, etc.

I/OSingle-threaded or multi-threaded

Pipeline: instruction-level parallelism

Inst. No. 1 2 3 4 5 6 7 8 9 10 11 12 13

i F D X M W

i+1 F D X

i+2 F D X M

i+3 F D X M W

i+4 F D X M W

From single-core to multi-core

• Multi-core is not multi-threading• Single core supports multi-threading. Multi-threading is older than multi-core.• Intel introduced “hyper-threading” in 2002

• Virtually, one core becomes two.

• The era of multi-core • Intel with Core 2 Duo, AMD with Athlon 64 X2 in 2005/2006.• From Single Core to Multi-Core: Preparing for a new exponential, in ICCAD ’06

• Multi-core• Replicate multiple cores on a single die.• Operating systems perceives a core as a separate processor.

• Why multi-core?• Difficult to make single-core clock frequencies even higher, i.e., the wall• Multi-threading applications, more parallelism demanded

• Problems along with multi-cores?• Cache coherence, scheduling, interconnect, etc.

Memory system

• Memory hierarchy• Register, L1/L2/L3 caches, main memory, flash memory, hard disk

Registers

Size 2000 bytes 64KB 256KB 8-32MB 8-64GB 256GB-2TB

Speed 300ps 1ns 3-10ns 10-20ns 50-100ns 50-100us

A typical memory hierarchy for a desktop

L1 CacheL2 Cache

L3 Cache

Memory

Storage

Memory system

• Locality• Spatial locality and temporal locality• Spatial locality (locality in space): if an item is referenced, items whose

addresses are close by will tend to be referenced soon• Temporal locality (locality in time): if an item is referenced, it will tend to be

referenced again soon.

Memory system

• Locality• Spatial locality and temporal locality• Spatial locality (locality in space): if an item is referenced, items whose

addresses are close by will tend to be referenced soon• Temporal locality (locality in time): if an item is referenced, it will tend to be

referenced again soon.

Spatial locality

The Philosophy behind CPU Cache• Programmer-invisible hardware mechanism

• Give illusion of speed of fastest memory with size of largest memory• Works fine even if programmer has no idea what a cache is• However, performance-oriented programmers today sometimes “reverse

engineer” cache design to design data structures to match cache• And modern programming languages try to provide storage abstractions that

provide flexibility while still caching well

• Does have limits: When you overwhelm the cache your performance may drop off a cliff ...

10CS211@ShanghaiTech

Processor

Control

Datapath

The involvement of cache

Registers

Arithmetic & Logic Unit(ALU)

MemoryInput

Output

Enable?Read/Write

Address

Write Data

ReadData

Processor-Memory Interface I/O-Memory Interfaces

Program

CS211@ShanghaiTech

CPU Cache

• CPU cache consists of a small, fast memory (mostly in SRAM) that acts as a buffer for the DRAM memory

• Cache block: the exchange unit between cache and main memory• Also known as cache line in state-of-the-art research papers• Each block has an address from the main memory

• Four questions related to CPU cache1. Where can a block be placed in a cache?2. How is a block found if it is in the cache?3. Which book should be replaced on a cache miss?4. What happens on a write?

Where to place a block in a cache

01234567

Direct mapped

(Block address) % (# of blocks in cache)

9527 % 8 = 7

Fully associativeA block can be placed anywhere in the cache

Set associativeCache partitioned into multiple setsA block can be placed in anywhere in a set

9527 % 4 = 3

Four sets, two ways

A direct mapped cache can be viewed as a set associative cache with N sets and one way.A fully associative cache can be viewed as set associative cache with one set and N ways.

How to find a block in a cache• Block Offset: Byte address within block• Set Index: Select the set the block is in• Tag: choose the block by comparison with the blocks in the selected set

• Size of Index = 𝑙𝑙𝑙𝑙𝑙𝑙2# 𝑜𝑜𝑜𝑜 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠

• Size of Tag = Address size – Size of Index – 𝑙𝑙𝑙𝑙𝑙𝑙2# 𝑜𝑜𝑜𝑜 𝑏𝑏𝑏𝑏𝑠𝑠𝑠𝑠𝑠𝑠 𝑝𝑝𝑠𝑠𝑝𝑝 𝑏𝑏𝑏𝑏𝑜𝑜𝑏𝑏𝑏𝑏

Block offsetSet IndexTag

Address

CS211@ShanghaiTech

Which block to be the victim for replacement

• From time to time, cache is almost always full• When a cache miss happens, a victim would be chosen and replaced

• Victim selection for a direct-mapped cache is straightforward• Why?

• For set- and fully-associative caches• Random

• e.g., ARM Cortex-A53 L1 cache• First in, first out (FIFO)• LRU (Least recently used)

• Pseudo-LRU, or approximated LRU• e.g., Intel Core i7 L1 cache

Quite complicated in today’s CPUs

Write policy

• Write through• Data written into cache and low-level memory• Pros: easy to implement, easy to maintain data coherency • Cons: slow, and may cause write stall

• Write back• Data written to cache only, written to main memory only upon replacement• Pros: fast, and multiple writes are batched in one, saving power• Cons: more complex control, data coherency across multi-level caches and

memory

• Write Allocate• If writing to memory not in the cache, fetch it first • Intel Core i7 L2 cache is Write Allocate

• No Write Allocate • Just write to memory without a fetch • Intel Core i7 L1 cache is no write allocate

Cache performance

• Cache hit and miss• Cache is transparent to system and applications. • All want high hit rates, right?

• Three types of cache miss• Compulsory miss

• First access to a block, cold start• Capacity miss

• When the cache cannot contain all the blocks needed for executing a program• Conflict miss

• Multiple blocks compete for the same set (block)

• How to reduce miss rates?• Increase the cache size? • Increase the block size?• Increase the associativity (# of ways per set)?

Multi-level cache

Processor

Control

DatapathPC

RegistersArithmetic & Logic Unit

Instruction L1 Cache

DataL1 Cache

L2Cache

Memory (DRAM)

Program

Now L3 cache is very common. How to manage multi-level caches would be detailed later in this course.

TA’s time for lab

• Submission deadline• 23:59:59pm (UTC+8), Wednesday, 23rd September, 2020

• Check Blackboard or course website to download it

• Submission format• A report about what you have done, what you observe, etc.• To be submitted to Blackboard• A subset of students would be asked to explain their solutions in person

• Check your email accounts after submission deadline

Main Memory

Virtual memory

• The reason of “virtual” memory• Helping programmers manage memory space• Protecting system from applications, and applications from applications• Extending memory space with the introduction of disk

• Paging vs. segmentation• Page in uniform size, e.g., 4096 or 8192 bytes• Segment in variable size• Both with pros and cons

• Processor-generated address can be split into:

Paged memory systems

Page tables make it possible to store the pages of a program non-contiguously.

Address Spaceof User-1

Page Table of User-1

page number offset

Physical Memory

A page table contains the physical address of the base of each page

CS211@ShanghaiTech

Private address space per User

• Each user has a page table • Page table contains an entry for each user page

VA1User 1

Page Table

VA1User 2

Page Table

VA1User 3

Page Table

OSpages

CS211@ShanghaiTech

Linear page table

VPN OffsetVirtual address

PT Base Register

Data word

Data Pages

Offset

PPNPPN

DPNPPN

PPNPPNPage Table

DPNDPN

DPNPPN

• Page Table Entry (PTE) contains:• 1 bit to indicate if page exists• And either PPN or DPN:• PPN (physical page number) for a

memory-resident page• DPN (disk page number) for a page on

the disk• Status bits for protection and usage

(read, write, exec)

• OS sets the Page Table Base Register whenever active user process changes

CS211@ShanghaiTech

Size of linear page table is a problem. Assume 64-bit address, 4KB page and 8B

PTE, 264

212× 8𝐵𝐵 for a page

table.

Hierarchical Page Table – exploits sparsity of virtual address space use

Level 1 Page Table

Level 2Page Tables

Data Pages

page in primary memory page in secondary memory

Root of the CurrentPage Table

Virtual Address

(ProcessorRegister)

PTE of a nonexistent page

p1 p2 offset01112212231

10-bitL1 index

10-bit L2 index

CS211@ShanghaiTech

Address Translation & Protection

• Every instruction and data access needs address translation and protection checks

A good VM design needs to be fast (~ one cycle) and space efficient

Physical Address

Virtual Address

AddressTranslation

Virtual Page No. (VPN) offset

Physical Page No. (PPN) offset

ProtectionCheck

Exception?

Kernel/User Mode

Read/Write

CS211@ShanghaiTech

Translation Lookaside Buffers (TLB)Address translation is very expensive!

In a two-level page table, each reference becomes several memory accesses

Solution: Cache some translations in TLBTLB hit => Single-Cycle TranslationTLB miss => Page-Table Walk to refill

VPN offset

V R W D tag PPN

physical address PPN offset

virtual address

(VPN = virtual page number)

(PPN = physical page number)

CS211@ShanghaiTech

TLB Designs• Typically 32-128 entries, usually fully associative

• Each entry maps a large page, hence less spatial locality across pages => more likely that two entries conflict

• Sometimes larger TLBs (256-512 entries) are 4-8 way set-associative• Larger systems sometimes have multi-level (L1 and L2) TLBs

• Random or FIFO replacement policy• Upon context switch? New VM space! Flush TLB …• “TLB Reach”: Size of largest virtual address space that can be

simultaneously mapped by TLB

CS211@ShanghaiTech

VM-related events in pipeline

• Handling a TLB miss needs a hardware or software mechanism to refill TLB

• usually done in hardware now

• Handling a page fault (e.g., page is on disk) needs a precise trap so software handler can easily resume after retrieving page

• Handling protection violation may abort process

PCInst TLB

Inst. Cache D Decode E M

Data TLB

Data Cache W+

TLB miss? Page Fault?Protection violation?

CS211@ShanghaiTech

Disk and RAID

Magnetic Disk – common I/O device• A kind of computer memory

• Information stored by magnetizing ferrite material on surface of rotating disk• similar to tape recorder except digital rather than analog data

• A type of non-volatile storage• Retains its value without applying power to disk.

• Magnetic Disk1. Hard Disk Drives (HDD) – faster, denser, non-removable.

• Purpose in computer systems (Hard Disk Drive):1. Working file system + long-term backup for files2. Secondary “backing store” for main-memory. Large, inexpensive, slow

level in the memory hierarchy (virtual memory)

Disk Device Terminology

• Several platters, with information recorded magnetically on both surfaces (usually)

• Bits recorded in tracks, which in turn divided into sectors (e.g., 512 Bytes)

• Actuator moves head (end of arm) over track (“seek”), wait for sector rotate under head, then read or write

OuterTrack

InnerTrackSector

Actuator

HeadArmPlatter

RAID: Redundant Arrays of Independent (Inexpensive) Disks• Files are "striped" across multiple disks• Redundancy yields high data availability

• Availability: service still provided to user, even if some components failed

• Disks will still fail• Contents reconstructed from data redundantly stored in the array

=> Capacity penalty to store redundant info=> Bandwidth penalty to update redundant info

RAID 0: Striping

• RAID 0 provides no fault tolerance or redundancy• Striping, or disk spanning• High performance

A0 A1 A2 A3A4 A5 A6 A7

CS211@ShanghaiTech

RAID 1: Disk Mirroring/Shadowing

• Each disk is fully duplicated onto its “mirror(s)”• Very high availability can be achieved

• Bandwidth sacrifice on write:• Logical write = N physical writes• Reads may be optimized

• Most expensive solution: 100% capacity overhead • RAID 10 (striped mirrors), RAID 01 (mirrored stripes):

• Combinations of RAID 0 and 1.

A0 A0 A0 A0A1 A1 A1 A1

CS211@ShanghaiTech

RAID 3: Parity Disk

100100111100110110010011. . .

logical record 10100011

11001101

10100011

11001101

P contains sum ofother disks per stripe mod 2 (“parity”)If disk fails, subtract P from sum of other disks to find missing information

Striped physicalrecords

CS211@ShanghaiTech

RAID 4: High I/O Rate Parity

D0 D1 D2 D3 P

D4 D5 D6 PD7

D8 D9 PD10 D11

D12 PD13 D14 D15

PD16 D17 D18 D19

D20 D21 D22 D23 P

.Disk Columns

IncreasingLogicalDiskAddress

Stripe

Insides of 5 disks

Example: small read D0 & D5, large write D12-D15

Inspiration for RAID 5

• RAID 4 works well for small reads• Small writes (write to one disk):

• Option 1: read other data disks, create new sum and write to Parity Disk• Option 2: since P has old sum, compare old data to new data, add the difference

• Small writes are limited by Parity Disk: Write to D0, D5 both also write to P disk

D0 D1 D2 D3 P

D4 D5 D6 PD7

CS211@ShanghaiTech

RAID 5: High I/O Rate Interleaved Parity

Independent writespossible because ofinterleaved parity

D0 D1 D2 D3 P

D4 D5 D6 P D7

D8 D9 P D10 D11

D12 P D13 D14 D15

P D16 D17 D18 D19

D20 D21 D22 D23 P

.Disk Columns

IncreasingLogicalDisk Addresses

Example: write to D0, D5 uses disks 0, 1, 3, 4

CS211@ShanghaiTech

Problems of Disk Arrays: Small Writes

D0 D1 D2 D3 PD0'

D0' D1 D2 D3 P'

newdata

olddata

old parity

(1. Read) (2. Read)

(3. Write) (4. Write)

RAID-5: Small Write Algorithm

1 Logical Write = 2 Physical Reads + 2 Physical Writes

Warehouse Scale Computer

Google’s WSCs

Ex: In Oregon

CS211@ShanghaiTech

Containers in WSCs

Inside WSC Inside Container

CS211@ShanghaiTech

Warehouse-Scale Computers• Datacenter

• Collection of 10,000 to 100,000 servers• Networks connecting them together

• Single gigantic machine• Very large applications (Internet service):

search, email, video sharing, social networking• Very high availability• “…WSCs are no less worthy of the expertise of computer

systems architects than any other class of machines” Barroso and Hoelzle, 2009

Anatomy of a Web Search

Anatomy of a Web Search (1/3)• Google “chundong wang”

• Direct request to “closest” Google WSC• Front-end load balancer directs request to one of many clusters of servers within

WSC• Within array, select one of many Goggle Web Servers (GWS) to handle the request

and compose the response pages• GWS communicates with Index Servers to find documents that contains the search

word, “chundong”, “wang”, uses location of search as well as user information• Return document list with associated relevance score

Anatomy of a Web Search (2/3)• In parallel,

• Ad system: if someone has bothered to advertise for “chundong wang”

• Use docids (Document IDs) to access indexed documents to get snippets of stuff

• Compose the page• Result document extracts (with keyword in context) ordered by relevance score• Sponsored links and advertisements where possible

Anatomy of a Web Search (3/3)

• Implementation strategy• Randomly distribute the entries

• Make many copies of data (a.k.a. “replicas”)

• Load balance requests across replicas

• Redundant copies of indices and documents• Breaks up search hot spots, e.g., “Tenet”

• Increases opportunities for request-level parallelism

• Makes the system more tolerant of failures

Conclusion

• We have reviewed important topics of CA• Many not covered yet• But to be covered in a deeper way through this course

• Next lecture• Microcode, instruction, ISA, ROP

Acknowledgements

• These slides contain materials developed and copyright by:• Prof. Krste Asanovic (UC Berkeley)• Prof. Xuehai Zhou (USTC)• Prof. Mikko Lipasti (UW-Madison)• Prof. Sören Schwertfeger (ShanghaiTech) • Prof. Kenji Kise (Tokyo Tech)• Prof. Jernej Barbic (USC)

CS211@ShanghaiTech

cs211 advanced computer architecture l01 introductionshanghaitech/...what are covered by ca?...

Documents

cs 110 computer architecture - shanghaitech

xna l01–introduction

l01 unit 4

you and labview will - shanghaitech

cs211: potpourri enums,packages,unittests, multi...

l01 page 1lapchi/cs798/l01.pdf · l01 page 4 . l01 page 5 ....

philips chasis l01

cs211 advanced computer architecture l01...

cs 110 computer architecture lecture 9 - shanghaitech

shanghaitech sem working paper series no. 2018-010

lubricator = “most popular” l01 economist · 1/4...

tf 502 sist, shanghaitech low-rank matrix...

transportation electrification - shanghaitech

cs211 advanced computer architecture l01...

antennas l01

l01 ecture 01-

l01 5797 handouts

algorithms & data structures cs211 taibah university

chassis l01

l01 introduction