prelim 3 review

Parallel Programming and Synchronization

Prelim 3 ReviewHakim WeatherspoonCS 3410, Spring 2012Computer ScienceCornell University

AdministriviaPizza party: PA3 Games NightTomorrow, Friday, April 27th, 5:00-7:00pmLocation: Upson B17Prelim 3Tonight, Thursday, April 26th, 7:30pmLocation: Olin 155PA4: Final project out next weekDemos: May 14-16Will not be able to use slip days

#Goals for TodayPrelim 3 reviewCaching,Virtual Memory, Paging, TLBsOperating System, Traps, Exceptions,Multicore and synchronization

#3Write-BackMemoryInstructionFetchExecuteInstructionDecodeextendregisterfilecontrolBig PicturealumemorydindoutaddrPCmemorynewpcinstIF/IDID/EXEX/MEMMEM/WBimmBActrlctrlctrlBDDMcomputejump/branchtargets+4forwardunitdetecthazard#Caching,Virtual Memory, Paging, TLBsOperating System, Traps, Exceptions,Multicore and synchronization

Memory Hierarchy and Caches#Memory PyramidDisk (Many GB few TB)Memory (128MB few GB)L2 Cache (-32MB)RegFile100s bytesMemory Pyramid< 1 cycle access1-3 cycle access5-15 cycle access50-300 cycle accessL3 becoming more common(eDRAM ?)These are rough numbers: mileage may vary for latest/greatestCaches usually made of SRAM (or eDRAM)L1 Cache(several KB)1000000+ cycle access#L1 L2 and even L3 all on-chip these days using SRAM and eDRAMMemory HierarchyInsight for Caches

If Mem[x] is was accessed recently... then Mem[x] is likely to be accessed soonExploit temporal locality:Put recently accessed Mem[x] higher in memory hierarchysince it will likely be accessed again soon

then Mem[x ] is likely to be accessed soonExploit spatial locality:Put entire block containing Mem[x] and surrounding addresses higher in memory hierarchy since nearby address will likely be accessedput entire block containing Mem[x] higher in the pyramidput recently accessed Mem[x] higher in the pyramidA: Data that will be used soon#soon = frequency of loadsMemory Hierarchy1101301501601802002202400123456789101112131415LB $1 M[ 1 ]LB $2 M[ 5 ]LB $3 M[ 1 ]LB $3 M[ 4 ]LB $2 M[ 0 ]

CacheProcessortag data$0$1$2$3Memory1001201401701902102302504 cache lines2 word block0000V#Three Common Cache DesignsA given data block can be placed in exactly one cache line Direct Mapped in any cache line Fully Associative in a small set of cache lines Set Associative#Direct Mapped CacheVTagBlock=TagIndexOffsetword selecthit?data=hit?dataword select32bits#Fully Associative CacheVTagBlockword selecthit?dataline select====32bits64bytesTag Offset#way = line3-Way Set Associative Cacheword selecthit?dataline select===32bits64bytesTagIndex Offset#Cache MissesThree types of missesCold (aka Compulsory)The line is being referenced for the first timeCapacityThe line was evicted because the cache was not large enoughConflictThe line was evicted because of another access whose index conflicted

Writing with Caches#EvictionWhich cache line should be evicted from the cache to make room for a new line?Direct-mappedno choice, must evict line selected by indexAssociative cachesrandom: select one of the lines at randomround-robin: similar to randomFIFO: replace oldest lineLRU: replace line that has not been used in the longest time#Cached Write PoliciesQ: How to write data?CPU

CacheSRAMMemoryDRAM

addrdataIf data is already in the cacheNo-Writewrites invalidate the cache and go directly to memoryWrite-Throughwrites go to main memory and cacheWrite-BackCPU writes only to cachecache writes to main memory later (when block is evicted)#What about Stores?Where should you write the result of a store?If that memory location is in the cache?Send it to the cacheShould we also send it to memory right away?(write-through policy)Wait until we kick the block out (write-back policy)If it is not in the cache?Allocate the line (put it in the cache)?(write allocate policy)Write it directly to memory without allocation?(no write allocate policy)#Cache Performance#Cache PerformanceConsider hit (H) and miss ratio (M)H x ATcache + M x ATmemoryHit rate = 1 Miss rateAccess Time is given in cyclesRatio of Access times, 1:50

90% : .90 + .1 x 50 = 5.995% : .95 + .05 x 50 = .95+2.5=3.4599% : .99 + .01 x 50 = 1.4999.9%: .999 + .001 x 50 = 0.999 + 0.05 = 1.049Cache Conscious Programming#Cache Conscious ProgrammingEvery access is a cache miss!(unless entire matrix can fit in cache)

// H = 12, NCOL = 10int A[NROW][NCOL];

for(col=0; col < NCOL; col++) for(row=0; row < NROW; row++)sum += A[row][col];

1112121222313234142451525616267178189191020#Cache Conscious ProgrammingBlock size = 4 75% hit rateBlock size = 8 87.5% hit rateBlock size = 16 93.75% hit rateAnd you can easily prefetch to warm the cache.// NROW = 12, NCOL = 10int A[NROW][NCOL];

for(row=0; row < NROW; row++)for(col=0; col < NCOL; col++) sum += A[row][col];

12345678910111213#MMU, Virtual Memory, Paging, and TLBsMultiple Processes Q: What happens when another program is executed concurrently on another processor?

A: The addresses will conflictEven though, CPUs may take turns using memory busCPUTextDataStackHeapMemoryCPUTextDataStackHeap0x00000x7fff0xffff#Virtual MemoryVirtual Memory: A Solution for All Problems

Each process has its own virtual address spaceProgrammer can code as if they own all of memory

On-the-fly at runtime, for each memory accessall access is indirect through a virtual addresstranslate fake virtual address to a real physical addressredirect load/store to the physical address#Virtual Memory AdvantagesAdvantagesEasy relocationLoader puts code anywhere in physical memoryCreates virtual mappings to give illusion of correct layoutHigher memory utilizationProvide illusion of contiguous memoryUse all physical memory, even physical address 0x0Easy sharingDifferent mappings for different programs / coresDifferent Permissions bits#Address SpacePrograms load/store to virtual addressesActual memory uses physical addressesMemory Management Unit (MMU)Responsible for translating on the flyEssentially, just a big array of integers:paddr = PageTable[vaddr];CPUMMUABCXYZXYZCBACPUMMU#each mmu has own mappingsEasy relocationHigher memory utilizationEasy sharing

Address TranslationAttempt #1: For any access to virtual address:Calculate virtual page number and page offsetLookup physical page number at PageTable[vpn]Calculate physical address as ppn:offsetvaddrPage OffsetVirtual page numberPage offsetPhysical page numberLookup in PageTablepaddr#4kb pages = 12 bits for page offset, 20 bits for VPNBeyond Flat Page TablesAssume most of PageTable is emptyHow to translate addresses? 10 bitsPTBR10 bits10 bitsvaddrPDEntryPage DirectoryPage TablePTEntryPageWord2Multi-level PageTable* x86 does exactly this#Q: Benefits?A1: Dont need 4MB contiguous physical memoryA2: Dont need to allocate every PageTable, only those containing valid PTEsQ: Drawbacks?A: Longer lookups.

29Virtual Addressing with a CacheThus it takes an extra memory access to translate a VA to a PACPUTrans-lationCacheMainMemoryVAPAmisshitdataThis makes memory (cache) accesses very expensive (if every access was really two accesses)A TLB in the Memory HierarchyA TLB miss: If the page is not in main memory, then its a true page faultTakes 1,000,000s of cycles to service a page faultTLB misses are much more frequent than true page faultsCPUTLBLookupCacheMainMemoryVAPAmisshitdataTrans-lationhitmissVirtual vs. Physical CachesCPU

CacheSRAMMemoryDRAM

addrdataMMUCacheSRAMMMUCPU

MemoryDRAM

addrdataCache works on physical addressesCache works on virtual addressesQ: What happens on context switch?Q: What about virtual memory aliasing?Q: So whats wrong with physically addressed caches?#A1: Physically-addressed: nothing; Virtually-addressed: need to flush cacheA2: Physically-addressed: nothing; Virtually-addressed: problemsIndexing vs. TaggingPhysically-Addressed Cacheslow: requires TLB (and maybe PageTable) lookup firstVirtually-Indexed, Virtually Tagged Cachefast: start TLB lookup before cache lookup finishesPageTable changes (paging, context switch, etc.) need to purge stale cache lines (how?)Synonyms (two virtual mappings for one physical page) could end up in cache twice (very bad!)Virtually-Indexed, Physically Tagged Cache~fast: TLB lookup in parallel with cache lookupPageTable changes no problem: phys. tag mismatchSynonyms search and evict lines with same phys. tagVirtually-Addressed Cache#Indexing vs. Tagging

#Typical Cache SetupCPUL2 CacheSRAMMemoryDRAM

addrdataMMUTypical L1: On-chip virtually addressed, physically taggedTypical L2: On-chip physically addressedTypical L3: On-chip L1 CacheSRAMTLB SRAM#Hardware/Software Boundary#Hardware/Software BoundaryVirtual to physical address translation is assisted by hardware?Translation Lookaside Buffer (TLB) that caches the recent translationsTLB access time is part of the cache hit timeMay allot an extra stage in the pipeline for TLB accessTLB missCan be in software (kernel handler) or hardwareHardware/Software BoundaryVirtual to physical address translation is assisted by hardware?Page table storage, fault detection and updatingPage faults result in interrupts (precise) that are then handled by the OSHardware must support (i.e., update appropriately) Dirty and Reference bits (e.g., ~LRU) in the Page Tables

Paging

Traps, exceptions, and operating systemOperating SystemSome things not available to untrusted programs:Exception registers, HALT instruction, MMU instructions, talk to I/O devices, OS memory, ...Need trusted mediator: Operating System (OS)Safe control transferData isolationP1P2P3P4VMfilesystemnetdriverdriverdiskethMMU#user/kernel/hardwareTerminologyTrap: Any kind of a control transfer to the OS

Syscall: Synchronous (planned), program-to-kernel transferSYSCALL instruction in MIPS (various on x86)

Exception: Synchronous, program-to-kernel transferexceptional events: div by zero, page fault, page protection err,

Interrupt: Aysnchronous, device-initiated transfere.g. Network packet arrived, keyboard event, timer ticks* real mechanisms, but nobody agrees on these terms#Multicore and SynchronizationMulti-core is a reality

but how do we write multi-core safe code?Why Multicore?Moores lawA law about transistors(Not speed)Smaller means faster transistors

Power consumption growing with transistors

Power TrendsIn CMOS IC technology

1000305V 1V

Uniprocessor Performance

Constrained by power, instruction-level parallelism, memory latencyWhy Multicore?Moores lawA law about transistorsSmaller means faster transistors

Power consumption growing with transistors

The power wallWe cant reduce voltage furtherWe cant remove more heatHow else can we improve performance?Intels argument

1.2x1.6xAmdahls LawTask: serial part, parallel partAs number of processors increases, time to execute parallel part goes to zerotime to execute serial part remains the sameSerial part eventually dominatesMust parallelize ALL parts of task

Amdahls LawConsider an improvement EF of the execution time is affectedS is the speedup

F = 0.5, S = 21/(0.5 + 0.25) = 1/0.75 = 1.333

F = 0.9, S = 21/(0.1 + 0.9/2) = 1/(0.1 + 0.45) = 1/0.55 = 1.8

F = 0.99, S = 21/(0.01+0.99/2) = 1/(0.01+0.495) = 1.98

F = 0.5, S = 81/(0.5+0.5/8) -> 2Multithreaded Processes

Shared countersUsual result: works fine.Possible result: lost update!

Occasional timing-dependent failure Difficult to debugCalled a race conditionhits = 0 + 1read hits (0)hits = 0 + 1read hits (0)T1T2hits = 1hits = 0timeRace conditionsDef: a timing dependent error involving shared state Whether it happens depends on how threads scheduled: who wins races to instructions that update stateRaces are intermittent, may occur rarelyTiming dependent = small changes can hide bugA program is correct only if all possible schedules are safe Number of possible schedule permutations is hugeNeed to imagine an adversary who switches contexts at the worst possible time

Critical SectionsBasic way to eliminate races: use critical sections that only one thread can be inContending threads must wait to enterCSEnter();Critical sectionCSExit();T1T2timeCSEnter();Critical sectionCSExit();T1T2MutexesCritical sections typically associated with mutual exclusion locks (mutexes)Only one thread can hold a given mutex at a timeAcquire (lock) mutex on entry to critical sectionOr block if another thread already holds itRelease (unlock) mutex on exitAllow one waiting thread (if any) to acquire & proceedpthread_mutex_lock(m);hits = hits+1;pthread_mutex_unlock(m);T1T2pthread_mutex_lock(m);hits = hits+1;pthread_mutex_unlock(m);pthread_mutex_init(m);Protecting an invariant// invariant: data is in buffer[first..last-1]. Protected by m.pthread_mutex_t *m;char buffer[1000];int first = 0, last = 0;

void put(char c) {pthread_mutex_lock(m);buffer[last] = c;last++;pthread_mutex_unlock(m);}

Rule of thumb: all updates that can affect invariant become critical sections.char get() {pthread_mutex_lock(m);char c = buffer[first];first++;pthread_mutex_unlock(m);}X what if first==last?See you TonightGood Luck!

prelim 3 review

Documents