transactional coherence and consistency...

12
92 With uniprocessor systems running into instruction-level parallelism (ILP) limits and fundamental VLSI constraints, parallel a rc h i t e c t u res provide a realistic path tow a rd scalable performance by letting programmers exploit thread-level parallelism (TLP) in more explicitly distributed architectures. As a result, single-board and single-chip multiprocessors a re becoming the norm for server and embed- ded computing, and are even starting to appear on desktop platforms. Ne ve rtheless, the com- plexity of parallel application development and the continued difficulty of implementing effi- cient and correct parallel architectures have limited these architectures’ potential. Most existing parallel systems add hardware to give programmers an illusion of a single shared memory common to all processors. Programmers must divide the computation into parallel tasks, but all tasks work on a sin- gle data set resident in the shared memory. Unfortunately, the hardware required to sup- port this model can be complex. To provide a coherent view of memory, the hardware must determine the location of the latest version of any particular memory address, re c over the latest version of a cache line from anywhere within the system when a load from it occurs, and efficiently support interprocessor com- munication of numerous small, cache-line- sized data packets. It must do all this with minimal latency, too, because individual load and store instructions depend on each com- munication event. Fu rther complicating mat- ters is the problem of sequencing the various communication events constantly passing through the system at the fine-grained leve l of individual load and store instructions. For s o f t w a re synchronization routines to work , hardware designers must devise and correctly implement sets of rules known as m e m o ry con - sistency models. Over the years, these models h a ve progressed from easy-to-understand but sometimes performance-limiting sequential consistency schemes to more modern schemes such as relaxed consistency. The complex interaction of coherence, syn- chronization, and consistency makes the job Lance Hammond Brian D. Carlstrom Vicky Wong Michael Chen Christos Kozyrakis Kunle Olukotun Stanford University TCC SIMPLIFIES PARALLEL HARDWARE AND SOFTWARE DESIGN BY ELIMINATING THE NEED FOR CONVENTIONAL CACHE COHERENCE AND CONSISTENCY MODELS AND LETTING PROGRAMMERS PARALLELIZE A WIDE RANGE OF APPLICATIONS WITH A SIMPLE, LOCK-FREE TRANSACTIONAL MODEL. TRANSACTIONAL COHERENCE AND CONSISTENCY : SIMPLIFYING P ARALLEL HARDWARE AND SOFTWARE Published by the IEEE Computer Society 0272-1732/04/$20.00 2004 IEEE

Upload: haanh

Post on 11-Nov-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

92

With uniprocessor systems ru n n i n ginto instru c t i o n - l e vel parallelism (ILP) limitsand fundamental VLSI constraints, parallela rc h i t e c t u res provide a realistic path tow a rdscalable performance by letting pro g r a m m e r sexploit thre a d - l e vel parallelism (TLP) in moreexplicitly distributed arc h i t e c t u res. As a re s u l t ,s i n g l e - b o a rd and single-chip multipro c e s s o r sa re becoming the norm for server and embed-ded computing, and are even starting to appearon desktop platforms. Ne ve rtheless, the com-plexity of parallel application development andthe continued difficulty of implementing effi-cient and correct parallel arc h i t e c t u res havelimited these arc h i t e c t u re s’ potential.

Most existing parallel systems add hard w a reto give programmers an illusion of a singles h a red memory common to all pro c e s s o r s .Programmers must divide the computationinto parallel tasks, but all tasks work on a sin-gle data set resident in the shared memory.Unfortunately, the hardware required to sup-port this model can be complex. To provide acoherent view of memory, the hardware must

determine the location of the latest version ofany particular memory address, re c over thelatest version of a cache line from anywherewithin the system when a load from it occurs,and efficiently support interprocessor com-munication of numerous small, cache-line-s i zed data packets. It must do all this withminimal latency, too, because individual loadand store instructions depend on each com-munication event. Fu rther complicating mat-ters is the problem of sequencing the variouscommunication events constantly passingt h rough the system at the fine-grained leve lof individual load and store instructions. Fors o f t w a re synchronization routines to work ,hardware designers must devise and correctlyimplement sets of rules known as m e m o ry con -sistency models. Over the years, these modelsh a ve pro g ressed from easy-to-understand butsometimes performance-limiting sequentialconsistency schemes to more modern schemessuch as relaxed consistency.

The complex interaction of coherence, syn-chronization, and consistency makes the job

Lance HammondBrian D. Carlstrom

Vicky WongMichael Chen

Christos KozyrakisKunle Olukotun

Stanford University

TCC S I M P L I F I E S PA R A L L E L H A R D WA R E A N D S O F T WA R E D E S I G N B Y

E L I M I N AT I N G T H E N E E D F O R C O N V E N T I O N A L C A C H E C O H E R E N C E A N D

C O N S I S T E N C Y M O D E L S A N D L E T T I N G P R O G R A M M E R S PA R A L L E L I Z E A W I D E

R A N G E O F A P P L I C AT I O N S W I T H A S I M P L E, L O C K-F R E E T R A N S A C T I O N A L M O D E L.

TRANSACTIONALCOHERENCE AND CONSISTENCY:

SIMPLIFYING PARALLELHARDWARE AND SOFTWARE

Published by the IEEE Computer Society 0272-1732/04/$20.00 2004 IEEE

of parallel programming difficult. Ex i s t i n ga p p roaches re q u i re programmers to manageparallel concurrency directly by creating andexplicitly synchronizing threads. The diffi-culty stems from the need to achieve the oftenc o n flicting goals of functional correctness andhigh performance. In particular, using a fewcoarse-grained locks can make it simpler toc o r rectly sequence accesses to variables share damong parallel threads. On the other hand,having numerous fine-grained locks oftena l l ows higher performance by reducing theamount of time wasted by threads as theycompete for access to the same va r i a b l e s ,although the larger number of locks usuallyincurs more locking overhead.

Transactional coherence and consistency(TCC) simultaneously eases both parallelp rogramming and parallel arc h i t e c t u re designby relying on programmer-defined t ra n s a c -t i o n s as the basic unit of parallel work, com-munication, memory coherence, and

m e m o ry consistency. Although seve r a lre s e a rchers have proposed using hard w a retransactions instead of locks or other parallelp rogramming constructs individually,4 - 6 TC Cunifies these ideas into an “all transactions,all the time” model that allows significants i m p l i fications of both parallel hard w a re3 a n ds o f t w a re .2 As parallel arc h i t e c t u res becomei n c reasingly pre valent and a wider variety ofh a rd w a re designers and programmers mustdeal with their intricacies, this advantage willi n c rease TC C ’s import a n c e .

TCC hardwareProcessors operating in a TCC-based mul-

t i p rocessor execute speculative transactions ina continuous cycle (illustrated in Fi g u re 1a)on multiprocessor hard w a re similar to thatdepicted in Fi g u re 1b. A t ra n s a c t i o n is asequence of instructions marked by softwarethat is guaranteed to execute and completeonly as an atomic unit. Each transaction pro-

93NOVEMBER–DECEMBER 2004

Figure 1. Transactional coherence and consistency (TCC) in hardware: the transaction cycle (a) and a diagram of sample TCC-enabled hardware (b).

duces a block of writes, which the processorbuffers locally while the transaction executes,committing them to shared memory only asan atomic unit after the transaction completes.Once the transaction is complete, hard w a remust arbitrate system-wide for permission tocommit the transaction’s writes. After w i n n i n gthe arbitration, the processor can exploit high-bandwidth system interconnect to broadcastall of the transaction’s writes to the rest of thesystem, as one packet. This broadcast canmake scaling TCC to numerous processors achallenge. Meanwhile, other pro c e s s o r s’ localcaches snoop on the write packets to main-tain system coherence. Snooping also lets thep rocessors detect when they have used datathat another processor has subsequently mod-ified—a dependence violation. Combining allof the transaction’s writes imparts latency tol-erance because it reduces the number of inter-p rocessor messages and arbitrations andbecause flushing the writes is a one-way oper-ation. The commit operation can also prov i d ei n h e rent synchronization for pro g r a m m e r sand a greatly simplified consistency protocolfor hard w a re engineers. Most signific a n t l y, thiscontinual cycle of speculative buffering,b roadcast, and (potential) violations lets usreplace both conventional coherence and con-sistence protocols.

Consistency protocols can be simplifiedbecause communication can occur only atoccasional, programmer-defined commitpoints, instead of at each of the numero u sloads and stores used by conventional mod-els. This has significant implications for bothhardware and software. For hardware design-ers, TCC simplifies the design by drasticallyreducing the number of latency-sensitive arbi-tration and synchronization events that mustbe sequenced by consistency-managementlogic in a typical multiprocessor system. Top rogrammers or parallelizing compilers,explicit commit operations mean that soft-w a re can orchestrate communication muchm o re precisely than with conventional con-sistency models. Simplifying matters furtherfor most programmers, imposing an order onthe transaction commits and backing upuncommitted transactions if they have spec-u l a t i vely read data modified by other transac-tions effectively lets the TCC system providean illusion of uniprocessor execution. As far

as global memory and software is concerned,all memory re f e rences from a transaction thatcommits earlier effectively occurred before allof the memory re f e rences of a transaction thatcommitted later. This is true even if their actu-al execution was interleaved in time, becauseall writes from a transaction become visible toother processors only at commit time.

This simple, pseudo-sequential consistencemodel allows for a simpler coherence model,t o o. During each transaction, stores areb u f f e red and kept within the transaction’sp rocessor node to maintain transaction atom-i c i t y. Processor caches do not use conve n t i o n a lm o d i f i e d / e xc l u s i ve / s h a re d / i n valid (MESI)-style protocols to maintain lines in shared orexclusive states at any point, so many proces-sor nodes can legally hold the same line simul-taneously in either an unmodified ors p e c u l a t i vely modified form. At the end ofeach transaction, the pro c e s s o r’s commitb roadcast notifies all other processors aboutits changed state. During this process, theother processors perform conventional inval-idation (if the commit packet contains onlyaddresses) or update (if it contains addressesand data) to keep their cache state coherent.Si m u l t a n e o u s l y, they must determine whetherthey have read from any of the committedaddresses. If they have, they must restart andreexecute their current transactions with theupdated data. This protects against true datadependencies. At the same time, because laterprocessors’ transactions do not flush out anydata to memory until their own turn to com-mit, data antidependencies are not an issue.Until a transaction commits, transactions thatcommit earlier do not see its effectively laterresults (avoiding write-after-read dependen-cies), and processors can freely overwrite pre-viously modified data in a clearly sequencedmanner (handling write-after-write depen-dencies legally).

TCC will work in a wide variety of multi-processor hardware environments, includingvarious chip multiprocessor configurationsand small-scale multichip multipro c e s s o r s .Within these systems, individual pro c e s s o rc o res and their local cache hierarchies needfeatures that provide speculative buffering ofm e m o ry re f e rences and commit arbitrationcontrol. Most significantly, a mechanism forgathering all modified cache lines from each

94

MICRO TOP PICKS

IEEE MICRO

transaction into a commit packet is required.This mechanism can be a write buffer com-pletely separate from the caches or an addressbuffer that maintains a list of tags for linescontaining data to be committed. The buffermust hold approximately 4 to 16 Kby t e sworth of cache lines. If it fills during the exe-cution of a long transaction, the pro c e s s o rmust declare an ove rf l ow and stall until itobtains commit permission, when it can con-tinue executing while writing its results dire c t-ly to shared memory. Of course, thiswrite-through behavior means that no otherprocessors can commit while the transactioncompletes, to maintain atomicity. This effectcan cause serious serialization if it occurs fre-quently, but is acceptable on occasion.

In the caches, all of the included lines mustmaintain the following information in someway:

• Read bits. Bits set on loads to indicate thata cache line (or portion there o f) has beenread speculatively during a transaction.A processor’s coherence logic snoops theread bits while other processor nodescommit to determine when the pro c e s-sor has speculatively read any data tooearly. If it snoops and sees a write com-mitted by another processor modifyingan address cached locally with its read bitset, it must violate its current transactionand restart.

• Mo d i fied bits. Eve ry cache line has at leastone modified bit. St o res set the modifie dbit to indicate when any part of the linehas been written speculative l y. Thep rocessor uses these bits to simultane-ously invalidate all speculatively writtenlines when it detects a violation.

Additional bits can improve performance,but are not essential for correct execution. Ap rocessor cannot flush cache lines with setread bits from its local cache hierarchy in mid-transaction; if it does, the processor must stallfor an ove rflow. Set modified bits will causesimilar ove rflow conditions if the write bufferholds only addresses.

TCC softwareTCC parallelization re q u i res only a few new

p rogramming constructs. It is simpler than

parallelization with conventional thre a d e dmodels because it needs fewer code transfor-mations for typical parallelization efforts. Inp a rt i c u l a r, it lets programmers make informedtrade-offs between programmer effort andp e rformance. In simplified form, pro g r a m-ming with TCC is a three-step process:

1. Divide the pro g ram into tra n s a c t i o n s. Toc reate a parallel program using TCC, ap rogrammer coarsely divides the pro-gram into transactions that can run con-currently on different processors. In thisrespect, parallelizing for TCC is like con-ventional parallelization, which alsore q u i res finding and marking parallelcode regions. Howe ve r, with TCC theprogrammer does not need to guaranteethat parallel regions are independent, ashardware will catch all dependence vio-lations dynamically. Our interface letsprogrammers divide their program intoparallel transactions using loop iterationsand/or a forking mechanism.

2. Specify transaction ord e r. The defaulttransaction ordering is to have transac-tions commit results in the same order asthey would in the original sequential pro-gram, because this guarantees that thep rogram will execute corre c t l y. Howe v-e r, if a programmer can verify that thiscommit order constraint is unnecessary,he or she can relax it completely or par-tially to improve performance. The inter-face also provides ways to specify thea p p l i c a t i o n’s ordering of constraints inuseful ways.

3 . Tune perf o rm a n c e . After selecting ando rdering transactions, the pro g r a m m e rcan run the program in parallel. The TC Csystem can automatically provide feedbackabout where violations occur in the pro-gram, which can direct the pro g r a m m e rto perform further optimizations.

We describe the interface in C, but it canbe readily adapted to any programming lan-guage (for example, we also used Java).

Loop-based parallelizationWe introduce loop parallelization in the

context of a simple sequential code segmentthat calculates a histogram of 1,000 integer

95NOVEMBER–DECEMBER 2004

p e rcentages using an array of corre s p o n d i n gbuckets:

/* input */

int *in = load_data();

int i, buckets[101];

for (i = 0; i < 1000; i++) {

buckets[data[i]]++;

}

/* output */

print_buckets(buckets);

The compiler interprets this program as onelarge transaction, exposing no parallelism tothe TCC hard w a re. We can parallelize the f o rl o o p, howe ve r, with the t _ f o r k e y w o rd :

...

t_for (i = 0; i < 1000; i++) {

...

With this small change, we now have a par-allel loop that is guaranteed to execute iden-tically to the original sequential loop. A similart _ w h i l e k e y w o rd is also available forw h i l e loops. Each loop body iterationbecomes a separate transaction that can exe-cute in parallel but must commit in the orig-inal sequential ord e r, such as the pattern inFigure 2. When two parallel iterations try toupdate the same histogram bucket simulta-n e o u s l y, the TCC hard w a re causes the lateriteration to violate when the earlier iterationcommits, forcing the later one to re e xe c u t eusing updated data and preserving the origi-nal sequential semantics.

In contrast, a conventionally parallelize dsystem would re q u i re an array of locks to pro-tect the histogram bins, resulting in muchmore extensive changes:

int* in = load_data();

int i, buckets[101];

/* Define & initialize locks */

LOCK_TYPE bucketLock[101];

for (i = 0; i < 101; i++) {

L O C K _ I N I T ( b u c k e t L o c k [ i ] ) ;

}

for (i = 0; i < 1000; i++) {

L O C K ( b u c k e t L o c k [ d a t a [ i ] ] ) ;

buckets[data[i]] ++;

U N L O C K ( b u c k e t L o c k [ d a t a [ i ] ] ) ;

}

p r i n t _ b u c k e t s ( b u c k e t s ) ;

If any of this locking code is omitted or buggy,the program m i g h t fail—and not necessarily inthe same place eve ry time—significantly com-plicating debugging. Debugging is especially dif-ficult if the errors occur only during infre q u e n tm e m o ry access patterns. The situation is poten-tially even trickier if an application must holdmultiple locks simultaneously within a criticalregion, because programmers can easily writecode with locking sequences that might dead-lock in these cases.

Although sequential ordering is generallyuseful because it guarantees correct exe c u t i o n ,in some cases—such as this histogram exam-ple—it is not actually re q u i red for corre c t n e s s .In this example, the only dependencies amongthe loop transactions are through the his-togram bin updates, which are perf o r m a b l ein any order. If programmers can verify thatno dependencies are order-critical, or if therea re simply no loop-carried dependencies, theycan use the t _ f o r _ u n o r d e r e d a n dt _ w h i l e _ u n o r d e r e d k e y w o rds to allowthe loop’s transactions to commit in any ord e r.A l l owing unord e red commits is most usefulin more complex programs with dynamicallyvarying transaction lengths, because it elimi-nates much of the time that processors spendwaiting for commit permission betwe e nunbalanced transactions.

Fork-based parallelizationAlthough the simple parallel loop API will

w o rk for many programs, some less-stru c t u re dp rograms might need to generate transactionsmore flexibly. For these situations t_fork,a transactional fork similar to conve n t i o n a lthread-creation APIs, is useful:

void tt_fork(

void(*child_function_ptr)

(void*),

void *input_data,

int child_sequence_num,

int parent_phase_increment,

int child_phase_increment);

/* Which forks this off: */

96

MICRO TOP PICKS

IEEE MICRO

Figure 2. Sample schedulefor transactional loops.

void child_function

(void *input_data);

This call forces the parent transaction to com-mit, and then creates two completely new — a n dparallel—transactions in its place. One (the newp a rent) continues executing the code immedi-ately following the t _ f o r k, while the other(the child) starts executing the function atc h i l d _ f u n c t i o n _ p t r with i n p u t _d a t a. Other input parameters control ord e r-ing of forked transactions in relation to othertransactions, as we discuss in the following sec-tion. We demonstrate this function with ap a r a l l e l i zed form of a simple two-stage pro c e s s o rpipeline, which we simulate using the functionsi _ f e t c h for instruction fetch, i n c r e-m e n t _ P C to select the next instruction, ande x e c u t e to execute instructions. The childtransaction executes each instruction while then ew parent transaction fetches another:

/* Initial setup */

int PC = INITIAL_PC;

int opcode = i_fetch(PC);

/* Main loop */

while (opcode ! = END_CODE)

{

t _ f o r k(execute, &opcode,

1, 1, 1);

increment_PC(opcode, &PC);

opcode = i_fetch(PC);

}

This example creates a sequence of ove r-lapping transactions similar to those in Fi g-u re 3. The t _ f o r k c a l l g i ves enoughflexibility to divide a program into transac-tions in virtually any way. It can even be usedto build the t _ f o r and t _ w h i l e c o n-structs, if necessary.

Explicit transaction commit orderingThe simple ordered and unordered modes

might not always suffice. For example, a pro-grammer might desire partial orderingexe-cuting unord e red most of the time, butoccasionally imposing some ordering. This israre with transactional loops, but quite com-mon with forked transactions. To handle thesecases, we have developed a simple model forexplicitly controlling transaction sequencing

when necessary. Howe ve r, because sequenc-ing requirements often vary in small but crit-ical ways from program to program, this is stillan active area of development as we evaluatemore applications with TCC.

We control transaction ordering by assign-ing two parameters to each transaction:s e q u e n c e and p h a s e. These two numbers contro lthe ordering of transaction commits. Tr a n s a c-tions with the same sequence number mightneed to commit in a pro g r a m m e r - d e f i n e do rd e r, whereas transactions in differe n tsequences are always independent. We can usethe c h i l d _ s e q u e n c e _ n u m p a r a m e t e rwithin a t _ f o r k call to produce a child trans-action with a new sequence number. Wi t h i neach sequence, the phase number indicateseach transaction’s re l a t i ve age. TCC hard w a rewill only commit transactions in the oldesta c t i ve phase (lowest value) from within eacha c t i ve sequence. Using this notation, ano rd e red loop is simply a sequence of transac-tions with the phase number incremented byone each time, whereas an unord e red loop usestransactions with the same phase number.

We can impose more arbitrary transactionphase ordering with the following calls:

void tt_commit(

int phase_increment);

void tt_wait_for_sequence(

int phase_increment,

int wait_for_sequence_num);

The t _ c o m m i t routine implicitly commitsthe current transaction and then immediatelys t a rts another on the same processor with its phase number incremented by thep h a s e _ i n c r e m e n t p a r a m e t e r. The mostcommon p h a s e _ i n c r e m e n t parameter is0, which simply breaks a large transaction intot w o. Howe ve r, using a p h a s e _ i n c r e m e n tof 1 or more forces an explicit transaction com-mit ordering. This can be used in many ways,but the most common is to emulate a conve n-tional barrier among all transactions in asequence using transactional semantics. The sim-ilar t _ w a i t _ f o r _ s e q u e n c e call perf o r m sa t _ c o m m i t and waits for all transactions inanother sequence to complete. Pro g r a m m e r stypically use this call to let a parent transactionsequence wait for a child sequence to complete,similar to a conventional thread join operation.

97NOVEMBER–DECEMBER 2004

Figure 3. Simple fork exam-ple, where the parametersare sequence:phase.

Performance evaluationWe used our TCC programming constru c t s

to parallelize applications from va r i o u sdomains. We then used simulation to evalu-ate and tune their performance for large- andsmall-scale TCC systems.

Our exe c u t i o n - d r i ven simulator models ap rocessor executing at a fixed rate of onei n s t ruction per cycle while producing tracesof all transactions in the program. We ana-lyzed the traces on a parameterized TCC sys-tem simulator that includes 4 to 32 pro c e s s o r sconnected by a broadcast network. The TLP-oriented benefits of TCC parallelizationare fairly orthogonal to speedup from super-scalar ILP extraction techniques within indi-

vidual processors, so our results can scale forsystems with processors faster (or slower) than1.0 instructions per cycle, until the commitbroadcast bandwidth is saturated.

Table 1 lists the values of three key systemparameters that describe three potential TC Cc o n figurations: i d e a l ( i n finite bandwidth, ze rooverhead), s i n g l e - c h i p / C M P (high bandwidth,low overhead), and single-board/SMP (medi-um bandwidth, higher overhead).

Table 2 presents the applications we used inthe study. These applications exhibit a dive r s eset of concurrency patterns, including denseloops (LU Factor), sparse loops (equake), andtask parallelism (SPECjbb). We modified the Capplications to use transactions manually. In

98

MICRO TOP PICKS

IEEE MICRO

Table 1. Key parameters in our simulations (all cycle values are in CPU cycles).

InterCPU Commit Violation bandwidth overhead delay

System Description (bytes per cycle) (cycles) (cycles)Ideal “Perfect” TCC multiprocessor 0 0Chip multiprocessor (CMP) Realistic multiprocessor on a single chip 16 5 0Symmetric multiprocessor (SMP) Realistic multiprocessor on a board 4 25 20

Table 2. Characteristics of applications used for our analysis.

Lines Primary Source Application Lines of changed TCC language Benchmark description Source Input code (percent) parallelizationJava Assignment Resource allocation jBYTEmark 51 × 51 556 5.8 Loop: two ordered,

solver array nine unorderedMolDyn N-body code- Java Grande 2,048 615 3.3 Loop: nine

modeling particles particles unorderedLUFactor Matrix factorization jBYTEmark 101 × 101 516 1.9 Loop: two ordered,

and triangular solver matrix four unorderedRayTrace 3D ray tracer Java Grande 150 × 150 1,233 4.9 Loop: nine

pixel image unorderedSPECjbb Transaction SPECjbb 230 iterations 27,249 1.3 Fork: five calls (one

processing without per task type)server randomization

C art Image recognition/ SPEC2000 FP Reference.1 1,270 8.9 Loop: 11 chunkedneural network and unordered

equake Seismic wave SPEC2000 FP Reference 1,513 0.8 Loop: three propagation unorderedsimulation

tomcatv Vectorized mesh SPEC95 FP 256 × 256 array 346 2.0 Loop: seven generation unordered

MPEGdecode Video bitstream Mediabench mei16v2.m2v 9,834 4.6 Fork: one calldecoding

contrast, we parallelized most Ja va applicationsautomatically using the Jrpm dynamic compi-lation system,4 and then ran them on top of aversion of the Kaffe Ja va virtual machine(http://kaffe.org) customized to use transac-tions instead of locks. The only exception wasS PE C j b b, which we manually parallelized byf o rking transactions for each warehouse task(such as orders and payments). Although mostp rogrammers assign separate warehouses toeach processor to parallelize this benchmark ,TCC let us parallelize w i t h i n a single ware-house, a virtually impossible task with con-ventional techniques. In all cases, we onlyneeded to modify a modest percentage of theoriginal code to make it parallel.

Parallel performance tuningAfter a program is divided into transactions,

most problems will tend to be with perf o r-mance, not correctness. To obtain good per-formance, programmers must optimize theirtransactions to balance a few competing goals:

• Ma x i m i ze para l l e l i s m. Programmers mustb reak as much of the program as possibleinto parallel transactions, which shouldbe of roughly equal size to maintain goodload balancing across pro c e s s o r s .

• Minimize violations. To avoid the costlydiscarding of work after violations, pro-grammers should avoid parallel transac-tions that communicate fre q u e n t l y.Keeping transactions reasonably small tominimize the amount of work lost whenviolations occur can also help.

• Mi n i m i ze transaction ove rh e a d. On theother hand, programmers should gener-ally avoid ve ry small transactions becauseof the overhead associated with starting,ending, and committing transactions.

• Avoid buffer ove rf l ow s . Buffer ove rf l ow scan cause serialization by re q u i r i n gp rocessors to hold commit permission fora long time. Hence, programmers shouldstay away from transactions that read orwrite large amounts of state.

Fi g u re 4a shows application speedups as weapplied successive optimizations for CMP con-figurations with 4, 8, 16, and 32 pro c e s s o r s ,and Fi g u re 4b breaks down execution time forthe eight-processor case into “useful work , ”

“waiting for commit,” “losses to violations,”and “idling” caused by too few parallel trans-actions for the available pro c e s s o r s — u s u a l l ycaused by sequential code regions. Althoughthe baseline results showed significant speedup,they often needed improvement. Where pos-sible, we first optimized with unord e red loops.Howe ve r, because load balancing was rarely ap roblem in the applications we selected, as ourl ow “waiting to commit” times show, this didnot significantly improve perf o r m a n c e .

We used results from initial applicationruns to guide our optimization efforts. Theseruns produced violation re p o rts that summa-r i zed the load-store pairs and data addre s s e scausing violations, prioritized by the amountof time lost. We then used gdb to determinee x a c t l y which variables and variable accessescaused the violations. This information is sig-n i ficantly more useful than feedback ava i l a b l eon current share d - m e m o ry systems, whichtends to be mostly in terms of coherence pro-tocol statistics. This feature can gre a t l yi n c rease programmer productivity by guidingthem d i re c t l y to the most critical communi-cation. Violation re p o rts can lead to a widevariety of optimizations, many adapted fromtraditional parallelization techniques.

One of the most common optimizations isthe privatization of shared variables that causeu n n e c e s s a ry violations. For example, wei m p roved SPECjbb by privatizing somes h a red temporary buffers. The most commontype of privatization, howe ve r, was to priva-tize the loop-carried “sum” variable used as areduction target within loops that are other-wise good targets for parallelization. In ourstatistics, the “sum” variables caused frequentviolations. Howe ve r, many of the re d u c t i o noperations are associative, such as additionand multiplication, and can there f o re bere o rd e red safely. Programmers can expose par-allelism by privatizing a “sum” variable with-in each processor and then combining theprivate copies only at loop termination.

Although transactions should usually bemonolithic, it is sometimes more helpful tob reak large transactions into a transactionalg roup of two or more smaller transactionswhich must execute sequentially on onep ro c e s s o r. A programmer can use at _ c o m m i t ( 0 ) call to mark each bre a k p o i n tbetween the individual transactions. Because

99NOVEMBER–DECEMBER 2004

this operation splits the original transactioninto two separate atomic regions, the pro-grammer must not place t _ c o m m i t s in themiddle of critical regions that m u s t e xe c u t eatomically. Despite this limitation, the tech-nique can help solve three problems:

• In s e rting a t _ c o m m i t takes a new ro l l-back checkpoint, limiting the amount ofw o rk lost to a violation. This is helpful withS PE C j b b, which has large transactions andf requent, unavoidable violations.

• It lets the processor commit and broad-cast changes made by a transaction toother parallel processors early, when theoriginal transaction would only havebeen partially completed.

• Because each commit flushes the pro c e s-s o r’s TCC write buffer, judicioust _ c o m m i t s can pre vent or re d u c ebuffer overflows.

Loop adjustment techniques can also be crit-ical when optimizing code for TCC. L o o pu n ro l l i n g, f u s i o n, fis s i o n, and re n e s t i n g a re com-mon parallelizing compiler tricks that can allp rove helpful. Although the techniques are thesame, the patterns used typically differ some-what, with optimal transaction sizes being theusual goal. In addition, with loop nests a pro-grammer must carefully choose the proper loopnesting level to use as the source of transactions,because our model does not support nestedtransactions. Outer loop bodies provide large,coarse-grained transactions, but these loop bod-ies can easily be too large, causing frequent bufferove rflows. On the other hand, using inner loopbodies can make the transactions too small,incurring exc e s s i ve startup and commit ove r-heads. Meanwhile, critical loop-carried depen-dencies can occur at any level. For example,t o m c a t v’s inner loops had many dependenciesand we re small, forcing us to use outer loops.

100

MICRO TOP PICKS

IEEE MICRO

Figure 4. Evaluation results: improvements in CMP configuration speedup provided by the various optimizations for four to32 processors (a), and processor utilization breakdowns for the eight-processor case (b).

Overall resultsFi g u re 5 presents the best achieved speedups

for three TCC system configurations (ideal,C M P, and SMP) using 4 to 32 pro c e s s o r s .While results are generally quite good, com-binations of unavoidable violations andregions with insufficient parallel transactions,causing extra processors to idle, limit someapplications. In addition, large sequential coderegions limit Assignment and RayTrace.

For most benchmarks, CMP performanceclosely tracks ideal performance for all pro c e s-sor counts. The CMP configuration is worsethan the ideal only when insufficient commitbandwidth is available, which generally occursonly with 32 or more processors. This causesprocessors to stall while waiting for access tothe commit network. Si m i l a r l y, the SMPTCC configuration achieves very little bene-fit beyond four- to eight-processor config u r a-tions because of its significantly reduced globalbandwidth. Of our applications, only Assign-ment, SPE C j b b, and tomcatv used morep rocessors to significant advantage. This re s u l tis still promising, howe ve r, because onlines e rver applications such as SPECjbb wouldmost likely be the primary applications run-ning on TCC systems larger than CMPs.

The most significant TCC hardware is theaddition of speculative buffer support. Theamount of state read and written by an aver-age transaction must be small enough for localbuffering. Fi g u re 6 shows the buffer size need-ed to hold the state read (Figure 6a) or writ-ten (Figure 6b) by 10, 50, and 90 percent ofeach application’s transactions, sorted by the90-percent limit’s size, generally from 6 to 12K bytes for read state and 4 to 8 Kbytes for

write state. All applications have a few ve rylarge transactions that will ove rflow, but TC Chardware should be able to achieve good per-formance as long as this does not occur often.

In the future, we plan to explore softwareand hardware techniques for scaling TCC

to large-scale parallel systems, so that pro-grammers can use a single programming par-adigm on any parallel machines, from smallC M Ps to large SMPs. We will complementthis work by investigating dynamic optimiza-tion techniques that will help automate theo t h e rwise time-consuming parallel perf o r-mance tuning of larger and more complexTCC programs. Ultimately, we see the com-bination of TCC hardware and highly auto-mated TCC programming language supportas forming a new programming paradigm thatwill provide a smooth transition from today’ssequential programs to tomorrow’s parallels o f t w a re. This paradigm will be an import a n tcatalyst in transforming parallel pro c e s s i n gfrom a niche activity to one that is accessibleto the average software developer. M I C R O

AcknowledgmentsThis work was supported by US National

Science Foundation grant CCR-0220138 andDARPA PCA program grants F29601-01-2-0085 and F29601-03-2-0117.

References 1. R. Rajwar and J. Goodman, “Transactional

Lock-Free Execution of Lock-BasedPrograms,” Proc. 10th Int’l Conf.Architectural Support for ProgrammingLanguages and Operating Systems

101NOVEMBER–DECEMBER 2004

Figure 5. Overall speedup obtained using different hardware configurations.

(ASPLOS 02), ACM Press, 2002, pp. 5-17.2. M. Herlihy and J. Moss, “Transactional

Memory: Architectural Support for Lock-Free Data Structures,” Proc. 20th Int’l Symp.Computer Architecture (ISCA 93), ACMPress, 1993, pp. 289-300.

3. J. Martinez and J. Torrellas, “SpeculativeSynchronization: Applying Thread-LevelSpeculation to Parallel Applications,” P r o c .10th Int’l Conf. Architectural Support forProgramming Languages and OperatingS y s t e m s (ASPLOS 02), ACM Press, 2002,pp. 18-29.

4. M.K. Chen and K. Olukotun, “The JrpmSystem for Dynamically Parallelizing JavaPrograms,” Proc. 30th Int’l Symp. ComputerA r c h i t e c t u r e (ISCA 03), IEEE CS Press, 2003,pp. 434-445.

5. L. Hammond et al., “Programming withTransactional Coherence and Consistency(TCC),” Proc. 11th Int’l Conf. ArchitecturalSupport for Programming Languages andOperating Systems (ASPLOS 04), ACMPress, 2004, pp. 1-13.

6. L. Hammond et al., “Transactional MemoryCoherence and Consistency,” Proc. 31stInt’l Symp. Computer Architecture (ISCA 04),IEEE CS Press, 2004, pp. 102-113.

Brian D. Ca r l s t ro m is a PhD candidate incomputer science at Stanford University. Hisre s e a rch interests include transactional pro-gramming models and other aspects of mod-ern programming language design andimplementation. Carlstrom has a master’sdegree in electrical engineering and comput-er science from the Massachusetts Institute ofTechnology. He is a member of the ACM.

Vicky Wong is a PhD candidate in computerscience at St a n f o rd Un i ve r s i t y. Her re s e a rc hi n t e rests include parallel computer arc h i t e c-tures and probabilistic reasoning algorithms.Wong has an MS in computer science fro mStanford University.

Michael Chen p e rformed this work whilecompleting his PhD in electrical engineeringat St a n f o rd Un i ve r s i t y. He is now a staffre s e a rcher in the Programming System Lab atIn t e l’s Mi c ro p rocessor Re s e a rch Labs. Hi sresearch interests include optimization tech-niques for compiling high-level languages tothe Intel IXP network processors.

C h ristos Ko z y r a k i s is an assistant professor ofelectrical engineering and computer science

102

MICRO TOP PICKS

IEEE MICRO

Figure 6. State read (a) or written (b) by individual transactions, with a cache/buffer granularity of 64-byte lines, for 10, 50,and 90 percent of the iterations.

at Stanford University. His research interestsinclude arc h i t e c t u res, compilers, and pro-gramming models for parallel systems rang-ing from embedded devices to superc o m p u t e r.Kozyrakis has a PhD in computer sciencefrom the University of California, Berkeley.

Kunle Ol u k o t u n is an associate professor ofelectrical engineering and computer scienceat Stanford University. His research interestsinclude computer arc h i t e c t u re, parallel pro-gramming environments, and formal hard-

w a re verification. Olukotun has a PhD incomputer science and engineering from theUniversity of Michigan.

Direct questions and comments about thisarticle to Kunle Olukotun, Stanford Univer-s i t y, Gates Bldg. 3A-302, St a n f o rd, CA,94305-9030; [email protected].

For further information on this or any othercomputing topic, visit our Digital Library ath t t p : / / w w w. c o m p u t e r. o r g / p u b l i c a t i o n s / d l i b .

103NOVEMBER–DECEMBER 2004

Get accessto individual IEEE Computer Society

documents online.

More than 100,000 articles

and conference papers available!

US$9 per article for members

US$19 for nonmembers

h t t p : / / c o m p u t e r. o r g / p u b l i c a t i o n s / d l i b /