single-threaded vs. multithreaded: where should we focus?

11
. .................................................................................................................................................................................................................................................... SINGLE-THREADED VS. MULTITHREADED:WHERE SHOULD WE FOCUS? . .................................................................................................................................................................................................................................................... TO CONTINUE TO OFFER IMPROVEMENTS IN APPLICATION PERFORMANCE, SHOULD COMPUTER ARCHITECTURE RESEARCHERS AND CHIP MANUFACTURERS FOCUS ON IMPROVING SINGLE-THREADED OR MULTITHREADED PERFORMANCE?THIS PANEL, FROM THE 2007 WORKSHOP ON COMPUTER ARCHITECTURE RESEARCH DIRECTIONS, DISCUSSES THE RELEVANT ISSUES. Moderator’s introduction: Joel Emer ...... Today, with the increasing popu- larity of multicore processors, one approach to optimizing the processor’s performance is to reduce the execution times of individual applications running on each core by designing and implementing more powerful cores. Another approach, which is the polar opposite of the first, optimizes the proces- sor’s performance by running a larger number of applications (or threads of a single application) on a correspondingly larger number of cores, albeit simpler ones. The difference between these two approaches is that the former focuses on reducing the latency of individual applications or threads (it optimizes the processor’s single-threaded performance), whereas the latter focuses on reducing the latency of the applications’ threads taken as a group (it optimizes the processor’s multithreaded performance). The obvious advantage of the single- threaded approach is that it minimizes the execution times of individual applications (especially preexisting or legacy applica- tions)—but potentially at the cost of longer design and verification times and lower power efficiency. By contrast, although the multithreaded approach may be easier to design and have higher power efficiency, its utility is limited to specific—highly paral- lel—applications; it is difficult to program for other, less-parallel applications. Of course the multiprocessor versus uni- processor controversy is not a new issue. My earliest recollection of it is from back in the 1970s, when Intel introduced the 4004, the first microprocessor. I remember almost immediately seeing proposals that said that if we could just hook together a hundred or a thousand of these, we would have the best computer in the world, and it would solve all of our problems. I suspect that no one remembers the result of that research, as the long series of increasingly more powerful microprocessors is a testament to our success at (and the value of) improving uniprocessor performance. Yet, with each successive gen- eration—the 4040, the 8080, all the way up to the present—we’ve seen exactly the same sort of proposal for ganging together lots of the latest-generation uniprocessors. Joel Emer Intel Mark D. Hill University of Wisconsin –Madison Yale N. Patt University of Texas at Austin Joshua J. Yi Freescale Semiconductor Derek Chiou University of Texas at Austin Resit Sendag University of Rhode Island . ...................................................................... 14 Published by the IEEE Computer Society 0272-1732/07/$20.00 G 2007 IEEE

Upload: independent

Post on 07-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

.....................................................................................................................................................................................................................................................

SINGLE-THREADED VS.MULTITHREADED: WHERE SHOULD

WE FOCUS?.....................................................................................................................................................................................................................................................

TO CONTINUE TO OFFER IMPROVEMENTS IN APPLICATION PERFORMANCE, SHOULD

COMPUTER ARCHITECTURE RESEARCHERS AND CHIP MANUFACTURERS FOCUS ON

IMPROVING SINGLE-THREADED OR MULTITHREADED PERFORMANCE? THIS PANEL, FROM

THE 2007 WORKSHOP ON COMPUTER ARCHITECTURE RESEARCH DIRECTIONS,

DISCUSSES THE RELEVANT ISSUES.

Moderator’s introduction: Joel Emer......Today, with the increasing popu-larity of multicore processors, one approachto optimizing the processor’s performance isto reduce the execution times of individualapplications running on each core bydesigning and implementing more powerfulcores. Another approach, which is the polaropposite of the first, optimizes the proces-sor’s performance by running a largernumber of applications (or threads ofa single application) on a correspondinglylarger number of cores, albeit simpler ones.The difference between these two approachesis that the former focuses on reducing thelatency of individual applications or threads(it optimizes the processor’s single-threadedperformance), whereas the latter focuses onreducing the latency of the applications’threads taken as a group (it optimizes theprocessor’s multithreaded performance).

The obvious advantage of the single-threaded approach is that it minimizes theexecution times of individual applications(especially preexisting or legacy applica-tions)—but potentially at the cost of longer

design and verification times and lowerpower efficiency. By contrast, although themultithreaded approach may be easier todesign and have higher power efficiency, itsutility is limited to specific—highly paral-lel—applications; it is difficult to programfor other, less-parallel applications.

Of course the multiprocessor versus uni-processor controversy is not a new issue. Myearliest recollection of it is from back in the1970s, when Intel introduced the 4004, thefirst microprocessor. I remember almostimmediately seeing proposals that said thatif we could just hook together a hundred ora thousand of these, we would have the bestcomputer in the world, and it would solve allof our problems. I suspect that no oneremembers the result of that research, as thelong series of increasingly more powerfulmicroprocessors is a testament to our successat (and the value of) improving uniprocessorperformance. Yet, with each successive gen-eration—the 4040, the 8080, all the way upto the present—we’ve seen exactly the samesort of proposal for ganging together lots ofthe latest-generation uniprocessors.

Joel Emer

Intel

Mark D. Hill

University of Wisconsin

–Madison

Yale N. Patt

University of Texas at

Austin

Joshua J. Yi

Freescale Semiconductor

Derek Chiou

University of Texas at

Austin

Resit Sendag

University of Rhode

Island

.......................................................................

14 Published by the IEEE Computer Society 0272-1732/07/$20.00 G 2007 IEEE

So the question for these panelists is, whyis today’s generation different from any ofthe past generations? Clearly, it is not justthat we cannot achieve gains in instructionsper cycle (IPC) in direct proportion to thenumber of transistors used, because that’snever been true. If it had been true, wewould have IPCs several orders of magni-tude larger than those we have today. Ourrecent rule of thumb has been that processorperformance improves as the square root ofthe number of transistors, and cache-missrates likewise improve as the square root ofthe cache size. Yet, despite these sublineararchitectural improvements, until recently,uniprocessors have been the preferredtrajectory. Why is the uniprocessor trajec-tory insufficient for today? Are there noideas that will bring even that sublineararchitectural performance gain? Why aren’tthe order-of-magnitude gains being prom-ised by single-instruction, multiple-data(SIMD), vector, and streaming processorsof interest? Is the complexity of currentarchitectures a factor in the inability to pushacross the next architectural performancestep? Or is there a feeling that, irrespectiveof architecture, we’ve crossed a complexitythreshold beyond which we can’t buildbetter processors in a timely fashion?

Looking at multiprocessors, is using full(but simpler) processors as the buildingblocks the right granularity? Are multi-processors really such a panacea of simplic-ity? How much complexity is going to beintroduced in the interprocessor intercon-nect, in the shared cache hierarchy, and inprocessor support for mechanisms liketransactional memory? Much of the chal-lenge of past generations has been copingwith the increasing disparity between pro-cessor speed and memory speed, or thelimits of die-to-memory bandwidth. Doeshaving a multiprocessor with its multiplecontexts make this problem worse instead ofbetter? And, even if multiprocessors reallyare a simpler alternative, what is theapplication domain over which they willprovide a benefit? Will enough people beable to program them?

As I hope is clear, both approaches haveadvantages and disadvantages. It is, howev-er, unclear which approach will be more

heavily used in the future, and should be themajor focus of our research. The goal of thispanel was to discuss the merits of eachapproach and the trade-offs between thetwo.

The case for single-threaded optimization:Yale N. Patt

The purpose of a chip is to optimallyexecute the user’s desired single application.In other words, the reason for designing anexpensive chip, rather than a network ofsimple chips, is to speed up the execution ofindividual programs. To the user, optimallyexecuting a desired application meansminimizing its execution time. Amdahl’slaw says the maximum speedup in theexecution time is limited by the time it takesto execute the slowest component of theprogram. Therefore, when we parallelize anapplication to run multiple threads acrossmultiple cores, the maximum speedup islimited by the time that it takes to executethe serial part of the program, or the onethat needs to run on a powerful single core.

For example, suppose a computer archi-tect designs a processor for a specificoperating system that needs to periodicallyexecute an old VAX instruction. In theextreme, assume that this instruction showsup only once every two weeks. In that case,because that instruction is not frequentlyexecuted, the computer architect does notallocate much hardware to execute thatinstruction. In the worst case, if it takes theprocessor three weeks to execute the in-struction, then the program will never finishexecuting, because the processor did not runthe slowest part of the program fast enough.Although this example is clearly exaggerat-ed, it motivates the need to continue tofocus on improving the single-threadedperformance of processors.

One potential objection to continuing tofocus heavily on single-threaded microarch-itectural research is that there is insufficient

.....................................................................................................................................................................

About this article

Derek Chiou, Resit Sendag, and Josh Yi conceived and organized the 2007 CARD

Workshop and transcribed, organized, and edited this article based on the panel discussion.

Video and audio of this and the other panels can be found at http://www.ele.uri.edu/CARD/.

........................................................................

NOVEMBER–DECEMBER 2007 15

parallelism at the instruction level tosignificantly improve processor perfor-mance, whereas there may be significantlymore parallelism at the thread level that canbe exploited instead. However, difficultproblems still need to be solved, and thecomputer architecture community will notsolve them if the best and brightest aresteered away from those problems. Creativ-ity and ingenuity can solve those problems.For example, in the RISC vs. CISC debate,CISC did not win merely because of legacycode. Rather, CISC won the debate becausecomputer architects had the ingenuity todevelop solutions such as out-of-orderexecution combined with in-order retire-ment, wide issue, better branch prediction,and so on. For example, Figure 1 shows theheadroom remaining for a 16-wide issueprocessor that uses the best branch predictorand prefetcher available at the time themeasurements were made.1

Take, for example, the creativity ofcomputer architects with respect to branchprediction. Conventional wisdom in 1990pretty well accepted as fact that instruction-level parallelism on integer codes wouldnever be greater than 1.85 IPC. However,even simple two-level branch predictorscould improve the processor’s performancebeyond 1.85 IPC. And, today, computerarchitects such as Andre Seznec and DanielJimenez have proposed more sophisticatedbranch predictors that are better than the

two-level branch predictor. There is stillquite a bit more unrealized performance tobe gotten from better branch prediction andbetter memory hierarchies (see Figure 1.)The key point is that the computerarchitecture community should not avoidthe difficult problems, but rather shouldharness its members’ creativity and ingenu-ity to propose solutions and solve thoseproblems.

Another potential objection to the single-threaded focus is power consumption.Power consumption in a 10-billion-transis-tor processor is a significant problem.However, in a processor with 10 billontransistors, the processor could have severallarge functional units that remain poweredoff when not in use, but that are poweredup via compiled code when necessary tocarry out a needed piece of work specifiedby the algorithm. I use my ‘‘Refrigerator’’analogy to explain this type of microarch-itecture. William ‘‘Refrigerator’’ Perry wasa huge defensive tackle for the 1985Chicago Bears. While the Bears were onoffense, Perry sat on the bench until theywere on the one yard line. Then, they wouldpower him up to carry the ball fora touchdown. To me, the key question is,‘‘How much functionality can we put intoa processor that remains powered off until itis needed?’’

I think the processor of the future will bewhat I call Niagara X, Pentium Y. The

Figure 1. Potential of improving caches and branch prediction.

.........................................................................................................................................................................................................................

COMPUTER ARCHITECTURE DEBATE

.......................................................................

16 IEEE MICRO

Niagara X part of the processor consists ofmany very lightweight cores for processingthe embarrassingly parallel part of analgorithm. The Pentium Y part consists ofvery few, heavyweight cores—or perhapsonly one—with serious hybrid branch pre-diction, out-of-order execution, and so on,to handle the serial part of an algorithm tomitigate the effects of Amdahl’s law. Toensure that the processor’s performancedoes not suffer because of intrachip com-munication, a high-performance intercon-nect will need to connect the Pentium Ycore to the Niagara X cores. Additionally,computer architects need to determine howtransistors can minimize the performanceimplications of off-chip communication.

The case for multithreaded optimization:Mark D. Hill

For decades, increasing transistor budgetshave allowed computer architects to im-prove processor performance by exploitingbit-level parallelism, instruction-level paral-lelism, and memory hierarchies. However,several ‘‘walls’’ will steer computer architectsaway from continuing to design increasinglymore powerful single-threaded processorsand toward designing less complex, multi-threaded processors.

The first wall is power. Some current-generation processors require large, evenabsurd, amounts of power. For these pro-cessors—and to a lesser degree, other lower-power processors—high power consumptionmakes it more difficult and expensive to coolthe processor or forces the processor tooperate at a higher temperature, which canlower the processor’s lifetime reliability. Inboth cases, higher power consumption eitherincreases the operational cost of owning theprocessor, by increasing the costs of electric-ity and server room cooling, or decreases thebattery life. The current situation is verysimilar to the one mainframe developersfaced about 10 to 15 years ago with emitter-coupled logic. Although mainframe devel-opers were able to switch to an alternativetechnology that was initially slower buteventually proved better—namely CMOS—the computer architecture community doesnot currently have a viable alternativetechnology that we can switch to. Thus, we

must adopt a different approach to designingour processors.

The second wall is complexity. The basicissue with the complexity wall is that it isbecoming too difficult, and taking too long,to design and verify next-generation pro-cessors, and to manufacture them atsufficiently high yield rates. Although somecomputer architects think that this isa significant wall, I do not, since I believethat creative people like Joel Emer and YalePatt can create abstractions to manage thecomplexity.

The third, and final, wall is memory,which I think is a very significant problem.Currently, accessing main memory requirespossibly hundreds of core cycles, which canbe hundreds of times slower than a floating-point multiply. Consequently, accessingmain memory wastes hundreds, or eventhousands, of instruction execution oppor-tunities. How much instruction-level orthread-level parallelism is available for theprocessor to exploit while waiting fora memory access to complete? And, even ifthere is a large amount of parallelism, thepower and complexity walls limit how thecomputer architect can design a processor toexploit that parallelism.

Given these three walls, the computerarchitecture community needs to continueto design processors that exploit parallelismif we want to continue to improve processorperformance. However, I do not believe thatwe can continue to exploit instruction-levelparallelism with identical operations, that is,SIMD vectors. Rather, we need to exploita higher-level parallelism in which we cando slightly different things along eachparallel path. Note that this higher-levelparallelism could be like the thread-levelparallelism that has been spectacularlysuccessful in narrow domains such asdatabase management systems, Web servers,and scientific computing, but less successfulfor other application spaces. More specifi-cally, instead of exploiting bit-level andinstruction-level parallelism with wider,more powerful single-threaded processors,we need to use multicore processors toexploit higher-level parallelism.

My approach to this problem is toadvocate the development of new hardware

........................................................................

NOVEMBER–DECEMBER 2007 17

and software models (Figure 2). The archi-tecture community needs to raise the levelof abstraction at the software layer. Al-though most people naturally do not thinkin parallel, there are tremendous successstories, like SQL language for relationaldatabases and Google’s MapReduce. In theformer case, most people can write de-clarative queries and let the underlying

system optimize and create great parallelismbecause of the strong properties of therelational algebra. In the latter case, a limitedset of programs allow the user to programeven clusters relatively easily. However,these two examples are point solutions only;we need to develop these types of solutionsmore broadly.

The key at the hardware level is to supportheterogeneous operations, and not just strictSIMD. Additionally, the computer architec-ture community needs to add mechanismsthat can improve the performance of soft-ware models; examples include transactionalmemory and speculative parallelism (forexample, thread-level speculation). Addingthese hardware features helps support thesoftware model, which allows the program-mer to continue to think sequentially.

Multicore architectureHill: [removing faux white beard he has beenwearing in imitation of Patt; see Figure 3]Let me take this beard off. It’s making methink way too sequentially. So, Yale, is itcorrect that you think that we should havechips with lots of Niagaras and one beefiercore? Are you saying that I’ve won thedebate?

Figure 2. Toward a future hardware/software stack.

Figure 3. Panelists Yale Patt and Mark Hill and moderator Joel Emer (left to right). Hill

removed his beard during the discussion, claiming it caused him to think too sequentially.

Patt and Emer declined to remove theirs.

.........................................................................................................................................................................................................................

COMPUTER ARCHITECTURE DEBATE

.......................................................................

18 IEEE MICRO

Patt: No.

Hill: You’re trying to say that’s theuniprocessor argument?

Patt: I’m saying that you need one ormore fast uniprocessors on the chip.Without that, Amdahl’s law gets in theway. That is, yes, your Niagara thing willrun the part of the problem that isembarrassingly parallel. But what aboutthe thing that really needs a serious branchpredictor or maybe a trace cache? I have noidea what people are going to come up within the future. What I don’t want is forpeople to get in the ‘‘don’t-worry-about-it’’mind-set.

Hill: Okay, I agree that we need to havemultiple cores and that one of those coresshould be much more powerful, or alterna-tively several of those cores should be muchmore powerful.

Emer: So now Yale has won the debate.

Hill: No, that’s multiple cores.

Patt: I’m not suggesting that havingmultiple cores is not important. What Iam suggesting is that we shouldn’t forgetthe part of the problem that requires theheavy-duty single core. You know, you heara lot of people say, ‘‘The time for that haspassed.’’ There are so many opportunities tofurther improve single-core performance.Take the stuff they did at UniversitatPolitecnica de Catalunya (UPC) withvirtual allocate, where, at rename time, theyallocate a reorder buffer register but don’tactually use it until it is really needed. Theysave the power from the time they rename ituntil the time they actually use it: doingthings virtually until you really need it,turning off hardware until you really needit… There’s a whole lot out there that infact students in this room will work on,maybe, if they’re not told, ‘‘Forget it; thereal action is in multiple cores.’’ The fact ofthe matter is, [smiling] the real action is inWeb design.

Hill: So, in your view what is the rate ofperformance improvement that we canexpect to get out of a single core?

Patt: The rate is zero, if we think negatively.Sammy Davis Jr. wrote an autobiographyentitled Yes I Can. That’s what I would liketo encourage people to think: ‘‘Yes I can’’—as opposed to just giving up and saying,‘‘You know, when it’s 10 billion transistors,I guess it won’t be 20 Pentium 4s; it will be200 Pentium 4s.’’

Hill: I guess I should remind you that Iabsolutely think that we can push unipro-cessors forward. We can get improvements,but we’re not going to see the improve-ments that we’ve seen in the past. And thereare markets out there that have grown usedto this opium of doubling in performance.That is what we are going to lose goingforward.

Emer: How much architectural improve-ment have we seen?

Patt: We’ve seen quite a bit. In fact, therewas a 2000 Asia and South Pacific DesignAutomation Conference presentation bya guy at Compaq/DEC [Digital EquipmentCorporation] named Bill Herrick, who saidthat between the [Alpha] EV-4 and EV-8there was a factor of 55 improvement inperformance.2 That represented a timeframeof about 10 years. The contribution fromtechnology, essentially in cycle time, wasa factor of 7. Thus, the contribution frommicroarchitecture was more than the con-tribution from technology. EV-4 was thefirst 21064, and EV-8, had it been de-livered, would have been the 21464.

Take branch predictors. The Alpha21064 used a last-taken branch predictor.The Alpha 21264 used a hybrid two-levelpredictor. Or, in-order versus out-of-orderexecution: The first Alpha was in-orderexecution. The third one was out-of-orderexecution. Issue width: You know the firstAlpha was two-wide issue. The second onewas four-wide issue. I am not saying that weshould tell these students, ‘‘No multiplecores.’’ I agree, we need to teach thinkingand programming in parallel. But we alsoneed to expose them to algorithms and to allof the other levels down to the circuit level.

Audience member: [interjecting] Becauseit’s a bad idea. Abstraction is a good thing.

........................................................................

NOVEMBER–DECEMBER 2007 19

Patt: Abstraction’s a good thing? Abstrac-tion is a good thing if you don’t care aboutthe performance of the underlying entities.You know, many schools teach freshmanprogramming in Java. So what’s a datastructure? Who cares? My hero is DonaldKnuth, who teaches data structures showingyou how data is stored in memory. Knuthsays that unless the programmer under-stands how the data structure is actuallyrepresented in memory, and how thealgorithm actually processes that data struc-ture, the programmer will write inefficientalgorithms. I agree. Ask a current graduatein computer science, ‘‘Does it matterwhether or not all of the data to be sortedcan be in memory at the same time?’’ Howmany will say ‘‘Yes’’ and pick their sortingalgorithm accordingly?

Programming parallel machinesAudience member: I’d like to pick up onthat point. There seem to be a lot of peoplenow who argue that we won’t realize theperformance benefits of multicore unlessprogrammers explicitly manage data local-ity. And in some cases that’s not so hard if itnaturally fits a stream or some otherparadigm. But is this a realistic expectation?And what do we do?

Patt: I think there are two types ofprogrammers: yo-yos and serious program-mers. And if you really want performance,you don’t want yo-yos. So, yes, for everyprogrammer who knows what he or she isdoing, there are thousands who don’t.Object-oriented isolates the programmerfrom what’s going on, but don’t expectperformance.

Hill: But I don’t think it’s as simple as that.I think that there are a lot of cases wherepeople are solving very deep intellectualchallenges in their problem domains. In-sofar as we can automatically help themcreate the locality, I think it’s a good thingand we should do it.

Patt: Absolutely.

Hill: I am willing to give up someperformance. I just don’t see people man-aging the details of locality. I think the

tougher problem is parallelizing the workand coming up with the abstractionssomewhere in the tool chain so that thework will be parallelized. I think thatlocality is also important, but it is not thehardest thing to do.

Patt: Yes, the developer needs tools to breakdown the work at large levels, but if youwant performance, you’re going to needknowledge of what’s underneath.

Providing simpler programming modelsHill: I still think that even sequentialconsistency is too complicated for thoseprogrammers. You’re already reasoning witharbitrary threads and arbitrary nondetermin-ism and locks, so who wants to also reasonabout memory reordering and fences?

Patt: You know, I’m hopeful that Mark isright and that someday we will thinkparallel.

Hill: The key is enabling people to think ata higher level of abstraction. Take SQL:There is a lot of parallelism underneath, butpeople don’t think parallel when they writeSQL.

Patt: So, somehow, magically, these in-termediate layers will do it for you?

Emer: That’s the research that Markadvocates pursuing more aggressively, inlieu of the solutions for higher single-streamperformance that you’re advocating.

Hill: In addition to! We’ve got to knowa two-handed economist.

Patt: And certainly there are people work-ing on trying to take the sequentialabstraction of the program and having thecompiler automatically parallelize that code.

Hill: I don’t think that’s good enough. Imean, just taking C or Java and parallelizingit is not good enough. We need to go higherthan that.

Patt: I agree with that. In fact, that’s Wen-mei [Hwu’s] point, that you should havethe library available and the heavy-dutylogic available, which I agree with also.

.........................................................................................................................................................................................................................

COMPUTER ARCHITECTURE DEBATE

.......................................................................

20 IEEE MICRO

Checkpointed processorsAudience member: Any thoughts oncheckpointed processors and how theychange the trade-off of processor complexityand performance, and perhaps the func-tionality for multithreading.

Hill: I think checkpointed processors area very interesting and viable technique thatare going to make uniprocessors better, butthere are also arguments that they can helpmultiprocessing. For example, you havea paper accepted to ISCA on bulk sequentialconsistency (SC) where the checkpointedprocessors help multiprocessing. I don’t seecheckpointed processors as fundamentallytipping this debate, but they are a good idea.

Emer: So, you’re saying that there’s researchthat applies to both uniprocessor andmultiprocessor domains, potentially.

Patt: That’s right.

Breaking down the barriersaround architectsAudience member: In general, I agree withMark [Hill], and I’d like to amplify hiscomments that good abstraction to paral-lelism using a parallel programming modelis probably the most important point. Howcan architects enable that? In my view, it’snot going to happen with just architectsworking alone. It’s not going to happen ifthe other people are working alone. Wereally need to work together, and it’s veryhard to do that. So, the question is, whatcan we really do to enable that synergy?

Hill: I completely agree. Doing it alone, wecan make some progress. But to have thereal progress, we have to work all the way upto theoreticians. Well, there are two wayswe’re going to do this. One is if we canconvince our colleagues to do the research.The other is if performance improvementfalls flat on its face for a number of years;then they’ll start paying attention. I thinkwe want to try to use the former route,because the latter route is not going to bepretty.

Emer: We’ve been in a situation fora number of years where we’ve been able

to work behind an architectural boundaryto make processors go faster, and softwarepeople can be largely oblivious to whatwe’re doing.

Hill: A very concrete example of this is thatMicrosoft had no interest in architectureresearch until recently. And suddenly itoccurred to them that being ignorant ofwhat’s happening on the other side of theinterface is no longer a viable strategy.

Emer: They probably did get a lot ofperformance by being oblivious. That’s allthe legacy code that we do have.

Audience member: Actually, a good anal-ogy is the memory consistency model issue.The weaker models were exposed to thesoftware community for the longest time,and they chose to ignore them. But in thelast five or six years, there’s been this hugeeffort from the software community to getthings fixed there. So, I do see a hope thatthe other people will band together witharchitects, but I think that something needsto be done proactively to enable thatsynergy to actually happen.

Patt: Yeah, so what [the audience member]has opened up is whether this should bea sociology debate rather than a hardwareone, which I think is right. I think we areuniquely positioned where we are asarchitects because we’re the center. Thereare the software people up here, and thecircuit designers down here, and we—ifwe’re doing our job—we see both sides. Weare uniquely positioned to engage thosepeople. Historically, you said that softwarepeople are oblivious, and yet they still getperformance because we did our job so well.I think we can continue to do our job sowell. I don’t think we’re going to run out ofperformance if we address Amdahl’s bottle-neck. For a certain amount of time I think[the audience member] is right, that weneed to be engaging people and problemson a number of levels. In fact, I would say,the number one problem today has nothingto do with performance, and that is security.

Hill: I just want to add one more comment.I think what’s really tough is to parallelize

........................................................................

NOVEMBER–DECEMBER 2007 21

code. You may notice that at Illinois too—a lot of software people did some parallelwork in the 1980s and early 1990s. But theefforts petered out, and they left! And nowthere are very few parallel software expertsaround.

What if there were no power wall?Audience member: Hypothetically, if wedidn’t have a power wall, would we be ableto continue scaling uniprocessor perfor-mance and thus avoid having to go afterparallelism?

Hill: There’s still the memory wall. Thepower wall makes it worse because of thehigh cost in power for many forms ofspeculation.

Patt: If the biggest problem today is thememory wall, how do we make thisbottleneck in the code we want to run stillperform well? Again, this is an argument forfaster uniprocessors. There will always stillbe issues with the memory wall, intercon-nect, and other areas that get in the way ofsingle-threaded performance. The powerwall has influenced our thinking and hasalso helped us to cop out, because architectsreason that putting a number of identicallow-power cores on the same chip solves thepower wall problem.

Hill: I agree that just doubling the numberof cores is a cop-out.

Data parallelism?Audience member: Mark [Hill] has seemedto downplay the potential for data parallel-ism on the hardware level. It is muchsimpler to exploit data parallelism froma workload than thread-level parallelism,especially when dealing with a large numberof cores (about 1,000) on a chip.

Emer: This kind of parallelism is muchmore power-efficient than multiple-coreparallelism, because one can save all thebookkeeping overhead.

Hill: We need to have data parallelism (notSIMD), but I do not believe that dataparallelism is as simple as doing identicaloperations in the same place due to

conditionals, encryption/decryption, andso on. It’s not the same at the lowest level.I’m not convinced that data parallelism issufficient at the lowest operation level, butit is needed in the programming model.

Vectors are a cool solution, but theyhaven’t seemed to generalize beyond a fewclasses of applications these past 35 years.The difference between vector processingand multicore processing is that we mayhave no alternative with the latter. Thereason for this is that the uniprocessors willget faster through creative efforts on the partof people like Yale [Patt], but not at a rateof 50 percent a year. We need not onlyadditional cores, but also additional mech-anisms to use them, and that is whatcomputer architects have to invent. Simplystamping out the cores may be okay forserver land, but not for clients.

Patt: I’m not interested in server land—although that’s what made SMT (simulta-neous multithreading). If you have lots ofdifferent jobs, why not just have simplerchips and let this job run here and that jobrun there?

Emer: What does SMT have to do withthat?

Patt: SMT was a solution looking fora problem, and the problem it found wasservers.

Hill: I completely disagree with Yale.Multiple threads are a way to hide memorylatency.

Patt: Using SMT to hide memory latency isapplication dependent, and not alwaysa solution to the memory wall problem.Thus, we come back to Amdahl’s law,which we need to continue to address.

Emer: That’s what SMT allowed you to do!

The relationship between localityand parallelismAudience member: You said that locality isless of a problem than parallelism. I wouldargue that localization is parallelism. If youdon’t have independent, local tasks, then youcan’t have parallelism; both are stronglyintertwined. Looking at parallel applications,

.........................................................................................................................................................................................................................

COMPUTER ARCHITECTURE DEBATE

.......................................................................

22 IEEE MICRO

they did a good job except when parallelismand locality were in conflict, resulting in toomuch communication.

Hill: Multicore does change things. Forexample, multicore has much better on-chip, thread-level communication than waspreviously possible. But multicore chips dohave a different type of locality, in that theunion of the cores should not thrash thechip’s last-level cache. I would like pro-grammers to manage locality, but it seemsvery difficult to do so.

Audience member: I agree that it’s hard,but if you don’t, it doesn’t work.

Emer: Yale, [the audience member] issupporting you.

Patt: Then I should just sit quietly.

Where should industry put its resources?Audience member: The business sidesimply wants applications to run faster.Where should computer manufacturers puttheir resources? What should they produce?

Emer: This is the fundamental question:How do we allocate our resources as chipdesigners?

Patt: I would produce Niagara X, PentiumY. I’d worry about lots of mickey-mousecores and a few heavy-duty cores with a goodinterconnection network so you don’t haveto suffer when you move between cores.Some would argue whether the memorymodel on the chip is correct, but I thinkthat’s the chip of the future.

Hill: I agree with Yale that a multiprocessorlike his Niagara X, Pentium Y is a goodidea. I would also put effort into hardwaremechanisms, compiler algorithms, and pro-gramming-model changes that can make iteasier to program these machines.

Patt: I agree with that. MICRO

................................................................................................

References1. R. Chappell, private communication, 2003.

2. B. Herrick, ‘‘Design Challenges in Multi-

GHz Microprocessors,’’ presentation, 2000.

Asia South Pacific Design Automation Conf.

(ASP-DAC 00); http://www.aspdac.com/

2000/eng/ap/herrick2.pdf.

Joel Emer is an Intel Fellow and director ofmicroarchitecture research at Intel, where heleads the VSSAD group. He also teachespart-time at the Massachusetts Institute ofTechnology. His current research interestsinclude performance-modeling frameworks,parallel and multithreaded processors, cacheorganization, processor pipeline organiza-tion, and processor reliability. Emer hasa PhD in electrical engineering from theUniversity of Illinois. He is an ACM Fellowand a Fellow of the IEEE.

Mark D. Hill is a professor in both theComputer Sciences and Electrical andComputer Engineering Departments at theUniversity of Wisconsin–Madison, wherehe also coleads the Wisconsin Multifacetproject with David Wood. His researchinterests include parallel computer systemdesign, memory system design, computersimulation, and transactional memory. Hillhas a PhD in computer science from theUniversity of California, Berkeley. He isa Fellow of the IEEE and the ACM.

Yale N. Patt is the Ernest Cockrell Jr.Centennial Chair in Engineering at theUniversity of Texas at Austin. His researchinterests focus on harnessing the expectedbenefits of future process technology tocreate more effective microarchitectures forfuture microprocessors. He is a Fellow ofthe IEEE and the ACM.

Joshua J. Yi is a performance analyst atFreescale Semiconductor in Austin, Texas.His research interests include high-perfor-mance computer architecture, simulation,low-power design, and reliable computing.Yi has a PhD in electrical engineering fromthe University of Minnesota, Minneapolis.He is a member of the IEEE and the IEEEComputer Society.

Derek Chiou is an assistant professor in theElectrical and Computer Engineering De-partment at the University of Texas atAustin. His research interests include com-

........................................................................

NOVEMBER–DECEMBER 2007 23

puter system simulation, computer archi-tecture, parallel computer architecture, andInternet router architecture. Chiou hasa PhD in electrical engineering and com-puter science from the Massachusetts In-stitute of Technology. He is a seniormember of the IEEE and a member of theACM.

Resit Sendag is an assistant professor in theDepartment of Electrical and ComputerEngineering at the University of RhodeIsland, Kingston. His research interestsinclude high-performance computer archi-

tecture, memory systems performance is-sues, and parallel computing. Sendag hasa PhD in electrical and computer engineer-ing from the University of Minnesota,Minneapolis. He is a member of the IEEEand the IEEE Computer Society.

Direct questions and comments about thisarticle to Joel Emer, [email protected].

For more information on this or any

other computing topic, please visit our

Digital Library at http://computer.org/

csdl.

.........................................................................................................................................................................................................................

COMPUTER ARCHITECTURE DEBATE

.......................................................................

24 IEEE MICRO