maximum benefit from a minimal htm owen hofmann, chris rossbach, and emmett witchel the university...

64
Maximum Benefit from a Minimal HTM Owen Hofmann, Chris Rossbach, and Emmett Witchel The University of Texas at Austin

Upload: marybeth-stafford

Post on 30-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Maximum Benefit from a Minimal TM

Maximum Benefit from a Minimal HTMOwen Hofmann, Chris Rossbach, and Emmett Witchel

The University of Texas at AustinHTM1Concurrency is hereCore count ever increasingParallel programming is difficultSynchronization perilousPerformance/complexityMany ideas for simpler parallelismTM, Galois, MapReduce

2 core16 core80 core

2Transactional memoryBetter performance from simple codeChange performance/complexity tradeoff

Replace locking with memory transactionsOptimistically execute in parallelTrack read/write sets, roll back on conflictWA(RBWB) != Commit successful changesParallelism offered by versioning and conflict detection3TM ain't easyTM must be fastLose benefits of concurrency

TM must be unboundedKeeping within size not easy programming model

TM must be realizableImplementing TM an important first step

4TM ain't easyVersion and detect conflicts with existing structuresCache coherence, store buffer FastRealizableUnboundedBest-effort HTMTM ain't easyVersion and detect conflicts with existing structuresCache coherence, store buffer

Simple modifications to processorVery realizable (stay tuned for Sun Rock) FastRealizableUnboundedBest-effort HTMTM ain't easyVersion and detect conflicts with existing structuresCache coherence, store buffer

Simple modifications to processorVery realizable (stay tuned for Sun Rock)

Resource-limitedCache size/associativity, store buffer size

FastRealizableUnboundedBest-effort HTMTM ain't easySoftware endlessly flexibleTransaction size limited only by virtual memory

SlowInstrument most memory referencesFastRealizableUnboundedBest-effort HTMSTM

TM ain't easyVersioning unbounded data in hardware is difficultUnlikely to be implementedFastRealizableUnboundedBest-effort HTMSTM

Unbounded HTMTM ain't easyTight marriage of hardware and softwareDisadvantages of both?FastRealizableUnboundedBest-effort HTMSTM

Unbounded HTMHybrid TM~~Hybrid share data between software and hardware10FastRealizableUnboundedBest-effort HTMBack to basicsCache-based HTMSpeculative updates in L1Augment cache line with transactional stateDetect conflicts via cache coherenceOperations outside transactions can conflictAsymmetric conflictDetected and handled in strong isolation FastRealizableUnboundedBest-effort HTMBack to basicsTransactions bounded by cacheOverflow because of size or associativityRestart, return reason

Not all operations supportedTransactions cannot perform I/OIn both these cases,(click) it is necessary for software to work around the limitations of the HTM12FastRealizableUnboundedBest-effort HTMBack to basicsTransactions bounded by cacheSoftware finds another way

Not all operations supportedSoftware finds another wayFastRealizableUnboundedBest-effort HTMMaximum benefitCreative software and ISA makes best-effort unbounded

TxLinuxBetter performance from simpler synchronization Transaction orderingMake best-effort unbounded

No shift14Linux: HTM proving groundLarge, complex application(s)With different synchronizationJan. 2001: Linux 2.45 types of synchronization~8,000 dynamic spinlocksHeavy use of Big Kernel LockDec. 2003: Linux 2.68 types of synchronization~640,000 dynamic spinlocksRestricted Big Kernel Lock use15Linux: HTM proving groundLarge, complex applicationWith evolutionary snapshotsLinux 2.4Simple, coarse synchronization

Linux 2.6Complex, fine-grained synchronization16Linux: HTM proving groundModified Andrew Benchmark17Linux: HTM proving groundModified Andrew Benchmark18HTM can help 2.4Software must back up hardwareUse locks

Cooperative transactional primitivesReplace locking functionExecute speculatively, concurrently in HTMTolerate overflow, I/ORestart, (fairly) use locking if necessary

acquire_lock(lock)

release_lock(lock)Next on replace locking function19HTM can help 2.4Software must back up hardwareUse locks

Cooperative transactional primitivesReplace locking functionExecute speculatively, concurrently in HTMTolerate overflow, I/ORestart, (fairly) use locking if necessary

cx_begin(lock)

cx_end(lock)No include/linux

Visual representation of cooperative transactional20HTM can help 2.4Software must back up hardwareUse locks

Cooperative transactional primitivesReplace locking functionExecute speculatively, concurrently in HTMTolerate overflow, I/ORestart, (fairly) use locking if necessarycx_begin(lock)

cx_end(lock)TX ATX BNo include/linux

Visual representation of cooperative transactional21HTM can help 2.4Software must back up hardwareUse locks

Cooperative transactional primitivesReplace locking functionExecute speculatively, concurrently in HTMTolerate overflow, I/ORestart, (fairly) use locking if necessarycx_begin(lock)

cx_end(lock)TX Ado_IO() TX BNo include/linux

Visual representation of cooperative transactional22HTM can help 2.4Software must back up hardwareUse locks

Cooperative transactional primitivesReplace locking functionExecute speculatively, concurrently in HTMTolerate overflow, I/ORestart, (fairly) use locking if necessarycx_begin(lock)

cx_end(lock)TX ALocking BAcquires the lock using special instruction, such that any transactions are allowed to commit first23HTM can help 2.4Software must back up hardwareUse locks

Cooperative transactional primitivesReplace locking functionExecute speculatively, concurrently in HTMTolerate overflow, I/ORestart, (fairly) use locking if necessarycx_begin(lock)

cx_end(lock)TX ALocking BBetter cooperative talking24Adding HTMSpinlocks: good fit for best-effort transactionsShort, performance-critical synchronizationcxspinlocks (SOSP '07)

2.4 needs cooperative transactional mutexesMust support blockingComplicated interactions with BKLcxmutexMust modify wakeup behavior cooperative transactional25Adding HTM, cont.Reorganize data structuresLinked listsShared counters~120 lines of codeAtomic lock acquireRecord locksAcquire in transactionCommit changesLinux 2.4 TxLinux 2.4Change synchronization, not use

cx_begin(lock)

cx_end(lock)stat++ TX BStatistics counter changes to per-cpu counter26Adding HTM, cont.Reorganize data structuresLinked listsShared counters~120 lines of codeAtomic lock acquireRecord locksAcquire in transactionCommit changesLinux 2.4 TxLinux 2.4Change synchronization, not use

cx_begin(lock)

cx_end(lock)stat[CPU]++ TX BRelate to coarse-grained (only need 4!)

Less cryptic (visual)

Changed synchronization, not its use27Adding HTM, cont.Reorganize data structuresLinked listsShared counters~120 lines of codeAtomic lock acquireRecord locksAcquire in transactionCommit changesLinux 2.4 TxLinux 2.4Change synchronization, not use

cx_begin(lock)

do_IO()

cx_end(lock)TX BRelate to coarse-grained (only need 4!)

Less cryptic (visual)

Changed synchronization, not its use28Adding HTM, cont.Reorganize data structuresLinked listsShared counters~120 lines of codeAtomic lock acquireRecord locksAcquire in transactionCommit changesLinux 2.4 TxLinux 2.4Change synchronization, not use

cx_begin(lock)

do_IO()

cx_end(lock)acquire_locks() TX BRelate to coarse-grained (only need 4!)

Less cryptic (visual)

Changed synchronization, not its use29Adding HTM, cont.Reorganize data structuresLinked listsShared counters~120 lines of codeAtomic lock acquireRecord locksAcquire in transactionCommit changesLinux 2.4 TxLinux 2.4Change synchronization, not use

cx_begin(lock)

do_IO()

cx_end(lock)do_IO() Locking BChanged synchronization, not its use30Evaluating TxLinuxMABModified Andrew BenchmarkdpunishStress dcache synchronizationfindParallel find + grepconfigParallel software package configurepmakeParallel make

Kernel is the benchmark31Evaluation: MAB2.4 wastes 63% kernel time synchronizingWith 32 processors executing this benchmark, unmodified Linux 2.4 wastes

This is a graph of speedup over unmodified Linux 2.4 on 8 processors32Evaluation: dpunish2.4 wastes 57% kernel time synchronizingNot 16/32

De-color coordinate33Evaluation: config2.4 wastes 30% kernel time synchronizingNot 16/32

De-color coordinate34From kernel to userBest-effort HTM means simpler locking codeGood programming model for kernelFall back on locking when necessaryStill permits concurrency

HTM promises transactionsGood model for userNeed software synchronization fallbackDont want to expose to userWant concurrency

Mention hybrid designs35Software, save me!HTM falls back on software transactionsGlobal lockSTMConcurrencyConflict detectionHTM workset in cacheSTM workset in memoryGlobal lock no worksetCommunicate between disjoint SW and HWNo shared data structuresNo shared data structures36Hardware, save me!HTM has strong isolationDetect conflicts with softwareRestart hardware transactionOnly if hardware already has value in read/write set

Transaction orderingCommit protocol for hardwareWait for concurrent software TXResolve inconsistenciesHardware/OS contains bad side effectsHardware/OS contains37Transaction orderingchar* rint idxbad38Transaction orderingTransaction A

begin_transaction()

r[idx] = 0xFF

end_transaction()Transaction B

begin_transaction()

r = new_array

idx = new_idx

end_transaction()Invariant: idx is valid for rchar* rint idxbad39Transaction orderingTransaction A

begin_transaction()

r[idx] = 0xFF

end_transaction()Transaction B

begin_transaction()

r = new_array

idx = new_idx

end_transaction()Invariant: idx is valid for rInconsistent read causes bad data writechar* rint idxbad40Transaction orderingA(HW)B(HW)read:write:read:write:r = old_arrayidx = old_idxtime r=new_array idx=new_idx r[idx] = 0xFFTime arrow

Put code in41Transaction orderingA(HW)B(HW)read:write: rread:write:r = new_arrayidx = old_idxtime r[idx] = 0xFF r=new_array idx=new_idx r[idx] = 0xFFtimeTransaction orderingA(HW)B(HW)read:write: rread: rwrite:r = new_arrayidx = old_idx r=new_array idx=new_idxConflict! r[idx] = 0xFFtimeTransaction orderingA(HW)B(HW)read:write: rread:write:r = new_arrayidx = old_idx r=new_array idx=new_idxRestartTransaction orderingA(HW)B(HW)read:write:read:write:r = old_arrayidx = old_idxtime r=new_array idx=new_idx r[idx] = 0xFFTime arrow

Put code in45 r=new_array idx=new_idxTransaction orderingA(SW)B(HW)read:write:idx = old_idxr = old_arraytime r[idx] = 0xFF r=new_array idx=new_idxTransaction orderingA(SW)B(HW)read:write:idx = old_idxr = new_arraytime r[idx] = 0xFFBecause A is software, no hardware visible write-set to add to47 r=new_array idx=new_idx r[idx] = 0xFFtimeTransaction orderingA(SW)B(HW)read: rwrite:idx = old_idxr = new_arrayConflict not detectedTransaction orderingnew_array[old_idx] = 0xFFA(SW)B(HW)read: r, idxwrite: r[idx]idx = old_idxr = new_arraytime r=new_array idx=new_idx r[idx] = 0xFFTransaction orderingnew_array[old_idx] = 0xFFA(SW)B(HW)read: r, idxwrite: r[idx]idx = old_idxr = new_arraytime r=new_array idx=new_idx r[idx] = 0xFFOh, no!timeTransaction orderingHardware contains effects of BUnless B commitsA(SW)B(HW)read: r, idxwrite: r[idx]idx = old_idxr = new_array r=new_array idx=new_idx r[idx] = 0xFFtimeTransaction orderingA(SW)B(HW)read: r, idxwrite: r[idx]idx = old_idxr = new_array r=new_array idx=new_idx r[idx] = 0xFFHardware contains effects of BUnless B commitstimeTransaction orderingA(SW)B(HW)read: r, idxwrite: r[idx]r = new_array r=new_array idx=new_idx r[idx] = 0xFFHardware contains effects of BUnless B commitsidx = new_idxtimeTransaction orderingA(SW)B(HW)read: r, idxwrite: r[idx]r = new_array r=new_array idx=new_idx r[idx] = 0xFFHardware contains effects of BUnless B commitsidx = new_idxAsymmetricconflict!timeTransaction orderingA(SW)B(HW)read: a, idxwrite: a[idx]r = new_array r=new_array idx=new_idx r[idx] = 0xFFRestartHardware contains effects of BUnless B commitsidx = new_idxSoftware + hardware mechanismsCommit protocolHardware commit waits for any current software TXImplemented as sequence lockOperating systemInconsistent data can cause spurious faultResolve faults by TX restartHardwareEven inconsistent TX must commit correctlyPass commit protocol address to transaction_begin()

Transaction orderingSafe, concurrent hardware and softwareEvaluated on STAMP benchmarksssca2 graph kernelsvacation reservation systemHigh-contentionLow-contentionyada Delauney mesh refinement

Best-effort + Single Global Lock STM (ordered)Idealized HTM (free)From unrelated systems57Evaluation: ssca2No overflow performance == idealThere are two lines here, one on top58Evaluation: vacation-low