cs211 advanced computer architecture l01 introductionshanghaitech/...what are covered by ca?...
Post on 10-Oct-2020
4 Views
Preview:
TRANSCRIPT
CS211Advanced Computer Architecture
L02 Review
Chundong WangSeptember 9th, 2020
CS211@ShanghaiTech 1
CS211@ShanghaiTech 2
L01 Survey
Admin
What are covered by CA?
CS211@ShanghaiTech 3
What are covered by CA?
CS211@ShanghaiTech 4
Instructions and micro-codesInstruction execution: pipeline, in-order or out-of-order, speculation, etc.
Memory hierarchy: cache, main memory, disk, etc.
Exception, interrupt, etc.
I/OSingle-threaded or multi-threaded
Pipeline: instruction-level parallelism
CS211@ShanghaiTech 5
Inst. No. 1 2 3 4 5 6 7 8 9 10 11 12 13
i F D X M W
i+1 F D X
i+2 F D X M
i+3 F D X M W
i+4 F D X M W
From single-core to multi-core
• Multi-core is not multi-threading• Single core supports multi-threading. Multi-threading is older than multi-core.• Intel introduced “hyper-threading” in 2002
• Virtually, one core becomes two.
• The era of multi-core • Intel with Core 2 Duo, AMD with Athlon 64 X2 in 2005/2006.• From Single Core to Multi-Core: Preparing for a new exponential, in ICCAD ’06
• Multi-core• Replicate multiple cores on a single die.• Operating systems perceives a core as a separate processor.
• Why multi-core?• Difficult to make single-core clock frequencies even higher, i.e., the wall• Multi-threading applications, more parallelism demanded
• Problems along with multi-cores?• Cache coherence, scheduling, interconnect, etc.
CS211@ShanghaiTech 6
Memory system
• Memory hierarchy• Register, L1/L2/L3 caches, main memory, flash memory, hard disk
CS211@ShanghaiTech 7
Registers
Size 2000 bytes 64KB 256KB 8-32MB 8-64GB 256GB-2TB
Speed 300ps 1ns 3-10ns 10-20ns 50-100ns 50-100us
A typical memory hierarchy for a desktop
L1 CacheL2 Cache
L3 Cache
Memory
Storage
Memory system
• Memory hierarchy• Register, L1/L2/L3 caches, main memory, flash memory, hard disk
• Locality• Spatial locality and temporal locality• Spatial locality (locality in space): if an item is referenced, items whose
addresses are close by will tend to be referenced soon• Temporal locality (locality in time): if an item is referenced, it will tend to be
referenced again soon.
CS211@ShanghaiTech 8
Memory system
• Memory hierarchy• Register, L1/L2/L3 caches, main memory, flash memory, hard disk
• Locality• Spatial locality and temporal locality• Spatial locality (locality in space): if an item is referenced, items whose
addresses are close by will tend to be referenced soon• Temporal locality (locality in time): if an item is referenced, it will tend to be
referenced again soon.
CS211@ShanghaiTech 9
Spatial locality
The Philosophy behind CPU Cache• Programmer-invisible hardware mechanism
• Give illusion of speed of fastest memory with size of largest memory• Works fine even if programmer has no idea what a cache is• However, performance-oriented programmers today sometimes “reverse
engineer” cache design to design data structures to match cache• And modern programming languages try to provide storage abstractions that
provide flexibility while still caching well
• Does have limits: When you overwhelm the cache your performance may drop off a cliff ...
10CS211@ShanghaiTech
Processor
Control
Datapath
The involvement of cache
11
PC
Registers
Arithmetic & Logic Unit(ALU)
MemoryInput
Output
Bytes
Enable?Read/Write
Address
Write Data
ReadData
Processor-Memory Interface I/O-Memory Interfaces
Program
Data
Cache
CS211@ShanghaiTech
CPU Cache
• CPU cache consists of a small, fast memory (mostly in SRAM) that acts as a buffer for the DRAM memory
• Cache block: the exchange unit between cache and main memory• Also known as cache line in state-of-the-art research papers• Each block has an address from the main memory
• Four questions related to CPU cache1. Where can a block be placed in a cache?2. How is a block found if it is in the cache?3. Which book should be replaced on a cache miss?4. What happens on a write?
CS211@ShanghaiTech 12
Where to place a block in a cache
CS211@ShanghaiTech 13
01234567
01234567
01234567
Direct mapped
9527
(Block address) % (# of blocks in cache)
9527 % 8 = 7
Fully associativeA block can be placed anywhere in the cache
Set associativeCache partitioned into multiple setsA block can be placed in anywhere in a set
Set 0
Set 1
Set 2
Set 3
9527 % 4 = 3
Four sets, two ways
A direct mapped cache can be viewed as a set associative cache with N sets and one way.A fully associative cache can be viewed as set associative cache with one set and N ways.
How to find a block in a cache• Block Offset: Byte address within block• Set Index: Select the set the block is in• Tag: choose the block by comparison with the blocks in the selected set
• Size of Index = 𝑙𝑙𝑙𝑙𝑙𝑙2# 𝑜𝑜𝑜𝑜 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
• Size of Tag = Address size – Size of Index – 𝑙𝑙𝑙𝑙𝑙𝑙2# 𝑜𝑜𝑜𝑜 𝑏𝑏𝑏𝑏𝑠𝑠𝑠𝑠𝑠𝑠 𝑝𝑝𝑠𝑠𝑝𝑝 𝑏𝑏𝑏𝑏𝑜𝑜𝑏𝑏𝑏𝑏
Block offsetSet IndexTag
14
Address
CS211@ShanghaiTech
Which block to be the victim for replacement
• From time to time, cache is almost always full• When a cache miss happens, a victim would be chosen and replaced
• Victim selection for a direct-mapped cache is straightforward• Why?
• For set- and fully-associative caches• Random
• e.g., ARM Cortex-A53 L1 cache• First in, first out (FIFO)• LRU (Least recently used)
• Pseudo-LRU, or approximated LRU• e.g., Intel Core i7 L1 cache
CS211@ShanghaiTech 15
Quite complicated in today’s CPUs
Write policy
• Write through• Data written into cache and low-level memory• Pros: easy to implement, easy to maintain data coherency • Cons: slow, and may cause write stall
• Write back• Data written to cache only, written to main memory only upon replacement• Pros: fast, and multiple writes are batched in one, saving power• Cons: more complex control, data coherency across multi-level caches and
memory
• Write Allocate• If writing to memory not in the cache, fetch it first • Intel Core i7 L2 cache is Write Allocate
• No Write Allocate • Just write to memory without a fetch • Intel Core i7 L1 cache is no write allocate
CS211@ShanghaiTech 16
Cache performance
• Cache hit and miss• Cache is transparent to system and applications. • All want high hit rates, right?
• Three types of cache miss• Compulsory miss
• First access to a block, cold start• Capacity miss
• When the cache cannot contain all the blocks needed for executing a program• Conflict miss
• Multiple blocks compete for the same set (block)
• How to reduce miss rates?• Increase the cache size? • Increase the block size?• Increase the associativity (# of ways per set)?
CS211@ShanghaiTech 17
Multi-level cache
CS211@ShanghaiTech 18
Processor
Control
DatapathPC
RegistersArithmetic & Logic Unit
(ALU)
Instruction L1 Cache
DataL1 Cache
L2Cache
Memory (DRAM)
Bytes
Data
Program
Now L3 cache is very common. How to manage multi-level caches would be detailed later in this course.
CS211@ShanghaiTech 19
TA’s time for lab
Admin
Lab 0
• Submission deadline• 23:59:59pm (UTC+8), Wednesday, 23rd September, 2020
• Check Blackboard or course website to download it
• Submission format• A report about what you have done, what you observe, etc.• To be submitted to Blackboard• A subset of students would be asked to explain their solutions in person
• Check your email accounts after submission deadline
CS211@ShanghaiTech 20
Admin
Main Memory
CS211@ShanghaiTech 21
Virtual memory
• The reason of “virtual” memory• Helping programmers manage memory space• Protecting system from applications, and applications from applications• Extending memory space with the introduction of disk
• Paging vs. segmentation• Page in uniform size, e.g., 4096 or 8192 bytes• Segment in variable size• Both with pros and cons
CS211@ShanghaiTech 22
23
• Processor-generated address can be split into:
Paged memory systems
Page tables make it possible to store the pages of a program non-contiguously.
0123
0123
Address Spaceof User-1
Page Table of User-1
10
2
3
page number offset
Physical Memory
A page table contains the physical address of the base of each page
CS211@ShanghaiTech
24
Private address space per User
• Each user has a page table • Page table contains an entry for each user page
VA1User 1
Page Table
VA1User 2
Page Table
VA1User 3
Page Table
Phys
ical
Mem
ory
free
OSpages
CS211@ShanghaiTech
25
Linear page table
VPN OffsetVirtual address
PT Base Register
VPN
Data word
Data Pages
Offset
PPNPPN
DPNPPN
PPNPPNPage Table
DPN
PPN
DPNDPN
DPNPPN
• Page Table Entry (PTE) contains:• 1 bit to indicate if page exists• And either PPN or DPN:• PPN (physical page number) for a
memory-resident page• DPN (disk page number) for a page on
the disk• Status bits for protection and usage
(read, write, exec)
• OS sets the Page Table Base Register whenever active user process changes
CS211@ShanghaiTech
Size of linear page table is a problem. Assume 64-bit address, 4KB page and 8B
PTE, 264
212× 8𝐵𝐵 for a page
table.
26
Hierarchical Page Table – exploits sparsity of virtual address space use
Level 1 Page Table
Level 2Page Tables
Data Pages
page in primary memory page in secondary memory
Root of the CurrentPage Table
p1
p2
Virtual Address
(ProcessorRegister)
PTE of a nonexistent page
p1 p2 offset01112212231
10-bitL1 index
10-bit L2 index
Phys
ical
Mem
ory
CS211@ShanghaiTech
27
Address Translation & Protection
• Every instruction and data access needs address translation and protection checks
A good VM design needs to be fast (~ one cycle) and space efficient
Physical Address
Virtual Address
AddressTranslation
Virtual Page No. (VPN) offset
Physical Page No. (PPN) offset
ProtectionCheck
Exception?
Kernel/User Mode
Read/Write
CS211@ShanghaiTech
28
Translation Lookaside Buffers (TLB)Address translation is very expensive!
In a two-level page table, each reference becomes several memory accesses
Solution: Cache some translations in TLBTLB hit => Single-Cycle TranslationTLB miss => Page-Table Walk to refill
VPN offset
V R W D tag PPN
physical address PPN offset
virtual address
hit?
(VPN = virtual page number)
(PPN = physical page number)
CS211@ShanghaiTech
29
TLB Designs• Typically 32-128 entries, usually fully associative
• Each entry maps a large page, hence less spatial locality across pages => more likely that two entries conflict
• Sometimes larger TLBs (256-512 entries) are 4-8 way set-associative• Larger systems sometimes have multi-level (L1 and L2) TLBs
• Random or FIFO replacement policy• Upon context switch? New VM space! Flush TLB …• “TLB Reach”: Size of largest virtual address space that can be
simultaneously mapped by TLB
CS211@ShanghaiTech
VM-related events in pipeline
• Handling a TLB miss needs a hardware or software mechanism to refill TLB
• usually done in hardware now
• Handling a page fault (e.g., page is on disk) needs a precise trap so software handler can easily resume after retrieving page
• Handling protection violation may abort process
30
PCInst TLB
Inst. Cache D Decode E M
Data TLB
Data Cache W+
TLB miss? Page Fault?Protection violation?
TLB miss? Page Fault?Protection violation?
CS211@ShanghaiTech
Disk and RAID
CS211@ShanghaiTech 31
Magnetic Disk – common I/O device• A kind of computer memory
• Information stored by magnetizing ferrite material on surface of rotating disk• similar to tape recorder except digital rather than analog data
• A type of non-volatile storage• Retains its value without applying power to disk.
• Magnetic Disk1. Hard Disk Drives (HDD) – faster, denser, non-removable.
• Purpose in computer systems (Hard Disk Drive):1. Working file system + long-term backup for files2. Secondary “backing store” for main-memory. Large, inexpensive, slow
level in the memory hierarchy (virtual memory)
32CS211@ShanghaiTech
Disk Device Terminology
• Several platters, with information recorded magnetically on both surfaces (usually)
• Bits recorded in tracks, which in turn divided into sectors (e.g., 512 Bytes)
• Actuator moves head (end of arm) over track (“seek”), wait for sector rotate under head, then read or write
OuterTrack
InnerTrackSector
Actuator
HeadArmPlatter
33CS211@ShanghaiTech
RAID: Redundant Arrays of Independent (Inexpensive) Disks• Files are "striped" across multiple disks• Redundancy yields high data availability
• Availability: service still provided to user, even if some components failed
• Disks will still fail• Contents reconstructed from data redundantly stored in the array
=> Capacity penalty to store redundant info=> Bandwidth penalty to update redundant info
34CS211@ShanghaiTech
RAID 0: Striping
• RAID 0 provides no fault tolerance or redundancy• Striping, or disk spanning• High performance
35
A0 A1 A2 A3A4 A5 A6 A7
CS211@ShanghaiTech
RAID 1: Disk Mirroring/Shadowing
36
• Each disk is fully duplicated onto its “mirror(s)”• Very high availability can be achieved
• Bandwidth sacrifice on write:• Logical write = N physical writes• Reads may be optimized
• Most expensive solution: 100% capacity overhead • RAID 10 (striped mirrors), RAID 01 (mirrored stripes):
• Combinations of RAID 0 and 1.
A0 A0 A0 A0A1 A1 A1 A1
CS211@ShanghaiTech
RAID 3: Parity Disk
37
P
100100111100110110010011. . .
logical record 10100011
11001101
10100011
11001101
P contains sum ofother disks per stripe mod 2 (“parity”)If disk fails, subtract P from sum of other disks to find missing information
Striped physicalrecords
CS211@ShanghaiTech
RAID 4: High I/O Rate Parity
D0 D1 D2 D3 P
D4 D5 D6 PD7
D8 D9 PD10 D11
D12 PD13 D14 D15
PD16 D17 D18 D19
D20 D21 D22 D23 P
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.Disk Columns
IncreasingLogicalDiskAddress
Stripe
Insides of 5 disks
Example: small read D0 & D5, large write D12-D15
38CS211@ShanghaiTech
Inspiration for RAID 5
• RAID 4 works well for small reads• Small writes (write to one disk):
• Option 1: read other data disks, create new sum and write to Parity Disk• Option 2: since P has old sum, compare old data to new data, add the difference
to P
• Small writes are limited by Parity Disk: Write to D0, D5 both also write to P disk
39
D0 D1 D2 D3 P
D4 D5 D6 PD7
CS211@ShanghaiTech
RAID 5: High I/O Rate Interleaved Parity
40
Independent writespossible because ofinterleaved parity
D0 D1 D2 D3 P
D4 D5 D6 P D7
D8 D9 P D10 D11
D12 P D13 D14 D15
P D16 D17 D18 D19
D20 D21 D22 D23 P
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.Disk Columns
IncreasingLogicalDisk Addresses
Example: write to D0, D5 uses disks 0, 1, 3, 4
CS211@ShanghaiTech
Problems of Disk Arrays: Small Writes
D0 D1 D2 D3 PD0'
+
+
D0' D1 D2 D3 P'
newdata
olddata
old parity
XOR
XOR
(1. Read) (2. Read)
(3. Write) (4. Write)
RAID-5: Small Write Algorithm
1 Logical Write = 2 Physical Reads + 2 Physical Writes
41CS211@ShanghaiTech
Warehouse Scale Computer
CS211@ShanghaiTech 42
Google’s WSCs
43
Ex: In Oregon
CS211@ShanghaiTech
Containers in WSCs
44
Inside WSC Inside Container
CS211@ShanghaiTech
Warehouse-Scale Computers• Datacenter
• Collection of 10,000 to 100,000 servers• Networks connecting them together
• Single gigantic machine• Very large applications (Internet service):
search, email, video sharing, social networking• Very high availability• “…WSCs are no less worthy of the expertise of computer
systems architects than any other class of machines” Barroso and Hoelzle, 2009
45CS211@ShanghaiTech
Anatomy of a Web Search
46
Anatomy of a Web Search (1/3)• Google “chundong wang”
• Direct request to “closest” Google WSC• Front-end load balancer directs request to one of many clusters of servers within
WSC• Within array, select one of many Goggle Web Servers (GWS) to handle the request
and compose the response pages• GWS communicates with Index Servers to find documents that contains the search
word, “chundong”, “wang”, uses location of search as well as user information• Return document list with associated relevance score
47CS211@ShanghaiTech
Anatomy of a Web Search (2/3)• In parallel,
• Ad system: if someone has bothered to advertise for “chundong wang”
• Use docids (Document IDs) to access indexed documents to get snippets of stuff
• Compose the page• Result document extracts (with keyword in context) ordered by relevance score• Sponsored links and advertisements where possible
48CS211@ShanghaiTech
Anatomy of a Web Search (3/3)
• Implementation strategy• Randomly distribute the entries
• Make many copies of data (a.k.a. “replicas”)
• Load balance requests across replicas
• Redundant copies of indices and documents• Breaks up search hot spots, e.g., “Tenet”
• Increases opportunities for request-level parallelism
• Makes the system more tolerant of failures
49CS211@ShanghaiTech
Conclusion
• We have reviewed important topics of CA• Many not covered yet• But to be covered in a deeper way through this course
• Next lecture• Microcode, instruction, ISA, ROP
CS211@ShanghaiTech 50
51
Acknowledgements
• These slides contain materials developed and copyright by:• Prof. Krste Asanovic (UC Berkeley)• Prof. Xuehai Zhou (USTC)• Prof. Mikko Lipasti (UW-Madison)• Prof. Sören Schwertfeger (ShanghaiTech) • Prof. Kenji Kise (Tokyo Tech)• Prof. Jernej Barbic (USC)
CS211@ShanghaiTech
top related