evaluating the performance potential of function level parallelism

10
Evaluating the Performance Potential of Function Level Parallelism David Yuste, Julio Sahuquillo, Salvador Petit, Pedro L´opez, and Jos´ e Duato Dept. de Inform´atica de Sistemas y Computadores (DISCA) Universidad Polit´ ecnica de Valencia [email protected],{jsahuqui, spetit, plopez, jduato}@disca.upv.es Abstract Because of technology advances, current trend in pro- cessor architecture design focuses on placing multiple cores on single chip instead of increasing the complex- ity of single core processors. These upcoming proces- sors are able to execute several threads in parallel, which make them a suitable platform for the appli- cation of automatic parallelization techniques. Most of the research efforts concerning automatic parallelization are in essence sophisticated transla- tions from Instruction Level Parallelism (ILP) tech- niques working on loop parallelization, which apply speculation in both data and control prediction. Nev- ertheless, these techniques incur in a high overhead due to the additional instructions required to detect dependencies among iterations or to verify predic- tions. In order to avoid these shortcomings, in this paper, we devise a novel way to exploit Thread Level Paral- lelism (TLP) by means of a compiler-based technique, referred to as Function Level Parallelism (FLP), that looks for parallelism at a function level. This tech- nique detects larger threads than other existing tech- niques like the loop-level based ones, which is a main concern when exploiting TLP. In addition, it also re- duces the involved runtime overhead. 1 Introduction Traditionally, superscalar processors have improved performance by using sophisticated techniques to ex- ploit Instruction Level Parallelism (ILP). Most of these processors implement out-of-order execution, which allows independent instructions to be issued on the same cycle, thus overlapping their execution. ILP became popular in research and in the industry thanks to technology advances, which provided more transistors available on a single chip, thus allowing the design of more complex and powerfull processors. Nowadays, the evolution of the uniprocessor ar- chitecture is being constrained by technological re- strictions which have made that microarchitectures advance towards a new direction. Instead of focus- ing on improving a complex single-core processor, new approaches try to improve performance through multiple-core architectures. This way pursues to get performance gains because of several cores can work in parallel. At the same time, to reduce power and heat, each core is designed simpler than previous single-core architectures. Current multi-core microprocessors have moved the industry and research interest to TLP (Thread or Task Level Parallelism), which exploits the over- lapped execution of instructions from different con- texts, i.e., threads or tasks. In that way, programs with multiple explicit threads benefit from multiple core processors, since each thread can be executed separately and concurrently with each other. Unfor- tunately, existing sequential programs cannot bene- fit in a straightforward way from these architectures, since they must be previously parallelized. The term task has been widely used in the litera- ture but referring to different concepts (e.g. loop it-

Upload: independent

Post on 17-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Evaluating the Performance Potential of

Function Level Parallelism

David Yuste, Julio Sahuquillo, Salvador Petit, Pedro Lopez, and Jose Duato

Dept. de Informatica de Sistemas y Computadores (DISCA)

Universidad Politecnica de Valencia

[email protected],{jsahuqui, spetit, plopez, jduato}@disca.upv.es

Abstract

Because of technology advances, current trend in pro-cessor architecture design focuses on placing multiplecores on single chip instead of increasing the complex-ity of single core processors. These upcoming proces-sors are able to execute several threads in parallel,which make them a suitable platform for the appli-cation of automatic parallelization techniques.

Most of the research efforts concerning automaticparallelization are in essence sophisticated transla-tions from Instruction Level Parallelism (ILP) tech-niques working on loop parallelization, which applyspeculation in both data and control prediction. Nev-ertheless, these techniques incur in a high overheaddue to the additional instructions required to detectdependencies among iterations or to verify predic-tions.

In order to avoid these shortcomings, in this paper,we devise a novel way to exploit Thread Level Paral-lelism (TLP) by means of a compiler-based technique,referred to as Function Level Parallelism (FLP), thatlooks for parallelism at a function level. This tech-nique detects larger threads than other existing tech-niques like the loop-level based ones, which is a mainconcern when exploiting TLP. In addition, it also re-duces the involved runtime overhead.

1 Introduction

Traditionally, superscalar processors have improvedperformance by using sophisticated techniques to ex-

ploit Instruction Level Parallelism (ILP). Most ofthese processors implement out-of-order execution,which allows independent instructions to be issuedon the same cycle, thus overlapping their execution.ILP became popular in research and in the industrythanks to technology advances, which provided moretransistors available on a single chip, thus allowingthe design of more complex and powerfull processors.

Nowadays, the evolution of the uniprocessor ar-chitecture is being constrained by technological re-strictions which have made that microarchitecturesadvance towards a new direction. Instead of focus-ing on improving a complex single-core processor,new approaches try to improve performance throughmultiple-core architectures. This way pursues to getperformance gains because of several cores can workin parallel. At the same time, to reduce power andheat, each core is designed simpler than previoussingle-core architectures.

Current multi-core microprocessors have movedthe industry and research interest to TLP (Threador Task Level Parallelism), which exploits the over-lapped execution of instructions from different con-texts, i.e., threads or tasks. In that way, programswith multiple explicit threads benefit from multiplecore processors, since each thread can be executedseparately and concurrently with each other. Unfor-tunately, existing sequential programs cannot bene-fit in a straightforward way from these architectures,since they must be previously parallelized.

The term task has been widely used in the litera-ture but referring to different concepts (e.g. loop it-

erations). In this paper, a task defines a sequence ofprogram statements (including operations, control in-structions and function calls) that maintain a depen-dence relationship among them. These statementscan be either contiguous or not, and their relationshipcan be either explicit as data dependences throughcall arguments and local variables, or more sophis-ticated, involving global and heap variables, librarycalls and calls through function pointers.

To exploit TLP different tasks must be executedconcurrently. To this end, tasks must be previouslydefined in a explicit way. Explicit tasks can be ex-tracted by the programmer or by means of auto-matic techniques based on software (compilation),hardware or both. Regardless the technique usedto extract parallelism, the common goal is to pro-vide tools to exploit the TLP contained in sequentialprograms when executed in multiprocessor systems.These tools must be scalable with the number of pro-cessors of the underlying architecture, and transpar-ent to the developer, thus allowing to parallelize ex-isting sequential programs without rewriting them.

Nowadays, the major efforts on TLP extractionhave concentrated on Loop Level Parallelism (LLP).Nevertheless, recent studies [1] show that the upperbound on the speedup achievable via this kind of pa-rallelism is about 1% (in Spec2006).

To improve these results, we propose a techniquefor automatic Function Level Parallelism (FLP) de-tection, that is, a compiler-oriented methodologythat searches TLP focusing on the function call asthe elemental unit containing a task.

The remaining of this paper is organized as follows.Section 2 discuss the state of the art on automaticTLP detection and its disadvantages. The methodol-ogy that allows detecting dependences among func-tion calls is explained in section 3. The techniquesthat enable the proposed algorithm to detect variablealiasing is presented in section 4. A brief evaluationof the potential of the applied techniques is discussedin section 5. Finally, section 6 presents some con-cluding remarks and future work.

2 Related work

Most of the research on thread level parallelizationtargets on exploiting control-driven threads [2]. Itconsists on dividing the dynamic instruction streaminto contiguous segments along control-flow bound-aries. It can be easily implemented in hardware but itis hard to guarantee the correctness of an overlappedexecution of the extracted tasks. That is why most ofthese approaches make use of speculative techniques.

Speculative approaches such as [3, 4] work as fol-lows. The execution starts with a non-speculativethread. When the execution reaches certain controlinstructions (ie. a call to a function or the backwardjump of a loop), an speculative thread is spawned.This thread begin at an instruction that is supposedto be executed soon. Nevertheless, its input valuesmay be unknown when it is spawned, so they mustbe predicted. Accuracy on data value predictors [5]exposes the efficiency of these techniques because ofthe missprediction overhead.

The Mitosis compiler [6] improves data value pre-diction through an heuristic approach. Each spec-ulative thread is preceded by a prolog that providethe predicted values for its input variables. This pro-log contains the most probable execution paths ofthe original program that write the variables that areprone to be used in the speculative thread. The maindrawback of this solution is that the overhead of theprolog can compromise its advantages when the de-tected threads are not large enough.

Another topic on TLP is speculative loop level pa-rallelism [7, 8, 9]. It consists on the parallel execu-tion of independent iterations speculatively selectedof a loop. In order to reduce the impact of the over-head due to the thread spawn and synchronization,contiguous iterations are packed together. However,it raises a new problem: the larger number of it-erations per thread the greater probability of miss-speculation. Some studies [10, 11] work on schedulingstrategies that try to dynamically adapt the numberof iterations per thread. In spite of the accuracy oftheir speculation, these approaches are restricted tospecific contexts, such as randomized incremental al-gorithms.

Although the most usual efforts on current com-

2

pilers parallelization focuses on the loop level, re-cent studies [1] are discouraging. They show thatthe upper bound on the speedup achievable via spec-ulative loop level parallelism in Spec2006 is only a1%. Therefore, new parallelization techniques mustsearch for TLP beyond the loop level.

Finally, little work has been published regardingFLP. In [12] a framework to estimate potential paral-lelism from programs has been proposed. Under thisframework, data dependences are detected by meansof a profile. Such data dependences are used to buildthe interprocedural data flow graph and the data shar-

ing graph. The former shows which functions readdata from other functions. The latter shows howdata is shared. Both graphs are used to select themost suitable parallel construct (e.g., Master-Slave,Workpile or Pipeline). Finally, previous informa-tion is used to parallelize the program. The maindrawback of this technique is that it requires to per-form a profiling in order to detect data dependences.Thus, as profiling depends on particular input data,detected dependences are optimistic and unsafe.

3 Methodology to exploit

Function Level Parallelism

The body of a function complex enough use to con-tain different function calls. Some of them can beindependent each other, that is, the task performedby a function is not influenced by the other one andvice versa. In these situations, the only reason be-cause they execute sequentially is to match the orderthey appear in the original program, which is im-posed by imperative languages that do not allow toexpress that their execution order is irrelevant. Someof these function calls are characterized by the factthat their accessed variables do not conflict, that is,there is not any data or control dependence amongthem. A parallelizing compiler that exploits func-tion independence must perform an static analysisin order to discover which ones satisfy the previousconditions.

Detecting dependence relations among functionsimplies an intensive interprocedural analysis, which

usually brings a high computational overhead. Toshort-cut this problem, the proposed technique im-plements a three steps analysis algorithm. It at-tempts to reduce the amount of job to do and torecycle the analyzed chunks of the program for fur-ther stages of the algorithm. To this end, we firstanalyze each function individually without consider-ing the rest of functions of the program. Then, weperform an interprocedural analysis across the callgraph, resulting in a per function characterization.Finally, this information is used to characterize eachcall to function, refining the data collected in the pre-vious stage for each call instance. These steps areexplained in the following sections.

3.1 Characterizing a function

In a first step, we characterize each function fromthe point of view of its interactions with the programvariables. That is, we identify its uses and defini-

tions. In this stage, the analyzer does not care aboutprevious definitions of global variables or the pointeddata by references. This information will be providedduring the last step of the algorithm.

The result of this characterization is a set of uses

and a set of definitions per function, hereafter SU

and SD respectivaly. Initially, these sets contain thevariables used and defined by each statement of thefunction that can be relevant for the execution ofother parts of the program (i.e., global variables anddereferenced pointers).

After applying this rule to the functions includedin the code shown in Figure 1, the SU and the SD offunction function1 contains {G2} and {G1} respec-tively, since the statement at line 6 uses the globalvariable G2 and the statement at line 7 defines G1.Note that the variable L1 is not in any set since it is alocal variable. Regarding function2, the SU is {G1,

G2}, while the SD is empty. Function function3 de-fines the variable ∗P1, which is a dereferenced pointerthat could point to something outside the function,so it must belong to the SD of function3. Finally,the SU of function4 contains G3.

3

01 int G1, G2, G3, G4; 17 void function3(int * P1)02 18 {03 void function1() 19 int L1;04 { 20 function1 ();05 int L1; 21 L1 = function2 ();06 L1 = G2; 22 *P1 = L1;07 G1 = L1 + 1; 23 }08 } 2409 25 void function4()10 int function2() 26 {11 { 27 function3 (&G3);12 int L1; 28 G2 = G3;13 L1 = G1 + G2; 29 if (G3)14 return L1; 30 {15 } 31 function3 (&G4);16 32 G2 = G4;

33 }34 }

Figure 1: Sample code

3.1.1 Interprocedural function characteriza-

tion

Note that the previous step ignores the calls tofunction1 and function2 in the body of function3.As these calls have side effects, the uses and defi-nitions of both called functions must be taken intoaccount in the characterization of function3. There-fore, the SU of function3 must contain G1 due tothe call to function2 and G2 due to the call to bothfunction1 and function2. Regarding definitions, G1must be in the SD of function3 due to the call tofunction1. In other words, uses and definitions offunctions are propagated across the call graph. Thispropagation consists on joining the SU of the calledfunction to the SU of the caller one, and analogouslyfor definitions.

The case of function4 is a bit more complex, be-cause it involves calls with arguments. Every callargument is a use of the caller function, becauseits value must be read in order to be sent to thecalled function. Moreover, arguments passed by ref-erence (pointers in a C-like language) can imply ause or a definition of the referenced (pointed) vari-able. The function example function3 contains ∗P1in its SD. P1 is an argument of function3, so defin-

ing ∗P1 means to define whatever P1 points to in thecaller function. Therefore, the call function3(&G3)defines G3 since the definition ∗P1 is contextual-ized to ∗&G3, (i.e., G3). Analogously, the callfunction3(&G4) defines G4. To sum up, a given ar-gument can be passed by value or by reference. Inthe former case, the interprocedural analysis will con-sider it as an use of the caller function. In the lattercase, each use or definition of a dereference of the ar-gument in the called function is contextualized to thepassed reference.

3.2 Characterizing a call

After the interprocedural analysis is concluded, theSU and the SD of each function provides informationto the analysis about the variables that the functionuses and defines. This previous collected informationis used in this section to define the behaviour of eachcall instance.

To characterize a call, analogously to the previousanalysis, two sets namely the set of uses of the call

(SUC) and the set of definitions of the call (SDC)are obtained. Initially, each SUC and SDC containsthe uses and definitions of the respective function, asobtained in the interprocedural analysis. According

4

to this rule, the sets of the calls of our example areinitialized as shown in Figure 2.

Source code SUC SDC

17 void function3()18 {19 int L1;20 function1 (); {G2} {G1}21 L1 = function2(); {G1, G2} {}22 *P1 = L1;23 }2425 void function4()26 {27 function3 (&G3); {G1, G2 } {G1, G3}28 G2 = G3;29 if (G3)30 {31 function3 (&G4); {G1, G2 } {G1, G4}32 G2 = G3;33 }34 }

Figure 2: Initialization of sets of uses and definitionsof calls

3.2.1 Procedural call characterization

The result of the previous initialization does not pro-vide enough information to determine dependence re-lationships among calls, since to track the effects ofeach call in the body of the caller function is stillrequired. Notice that the final goal of this work isto assert whether or not two calls can run concur-rently. Therefore, the analysis must take care of theuses and definitions made on statements that dependon previous calls, as well as control dependences.

Whenever an assignment statement uses some vari-able Vi defined by a predecessor call fi(), the variableassigned by the statement, namely Vj , is consideredan indirect definition of fi(). Notice that the execu-tion of such statement depends on Vi, which is definedby fi(). Therefore, the definition of Vj depends indi-rectly on the execution of fi(). In order to track thesedefinitions, we apply an algorithm that explores thecontrol flow graph of each function and expands theSDC of each call with its indirect definitions.

The previous algorithm must know which is thelast call that defined the variables used by each state-

ment. This information is provided by the set of live

definitions of each call. The set of live definitions

of a call fi(), computed at an statement stmtk, con-tains the variables defined by fi() and not redefinedby some other statement stmtl in any execution pathfrom fi() to stmtk, where stmtl does not depend onfi().

Applying the algorithm to the calls of function4(see Figure 3) we get the following. The initial set

of live definitions of the call function3(&G3) con-tains G1 and G3 (i.e., the variables of the SDC ofsuch call). The statement at line 28 uses G3, whichis a live definition of function3(&G3), and definesG2. Thus, G2 joins the set of live definitions offunction3(&G3). Moreover, G2 joins the SDC of thecall function3(&G3) since it is an indirect definition.After the call function3(&G4), G1 does not belongsto the set of live definitions of function3(&G3), sincefunction3(&G4) defines G1 killing the previous def-inition in this execution path. Similarly, the nextstatement kills the previous definition of G2, whichis an indirect definition of function3(&G4). Finally,at line 34 two execution paths are joined. The al-gorithm must conservatively assume that definitionson both incoming execution paths are live definitionsafter the join point. Therefore, the sets of live def-

initions of function3(&G3) from both incoming ex-ecution paths are merged. As a consequence, G2 isan indirect definition of both function3(&G3) andfunction3(&G4).

To homogenize further dependence resolution algo-rithms, control dependences are converted into datadependences as follows. A conditional branch canbe taken or not depending on the values of the vari-ables involved in the condition evaluation. Therefore,we can assert that the execution of the target basicblock depends on these variables (i.e., data depen-dence). As a consequence, the calls included in suchbasic block also depend on them. To reflect thesedependences, the proposed algorithm makes explicitthe control flow dependences by joining the variablesused in jump conditions to the SUC of the calls whichexecution depends on these branches. These variablesare referred to as artificial uses. In the example, theexecution of the call function3(&G4) in function4depends on the value of G3 (due to the conditional

5

Source code function3(&G3) function3(&G4)live definitions live definitions

2425 void function4()26 {27 function3 (&G3);28 G2 = G3; {G1, G3 }29 if (G3) {G1, G3, G2}30 {31 function3 (&G4); {G1, G3, G2}32 G2 = G3; {G3, G2 } {G1, G4}33 } {G3 } {G1, G4, G2}34 } {G1, G3, G2} {G1, G4, G2}

Figure 3: Computing live definitions for function3(&G3) and function3(&G4) calls at each statement offunction4. Bold variables in live definitions sets are indirect definitions

structure ’if’). Therefore G3 joins the SUC of thecall function3(&G4) as an artificial use.

3.3 Dependence analysis

At this point of the analysis, we are able to iden-tify dependences among calls and variables, but wecannot establish inter-call dependences, which is thefinal aim of the proposed technique. This section ad-dresses how to tackle such dependences.

The key question to be addressed is how to deter-mine where the uses and the definitions of each pairof calls in the body of a function conflict. Given apair of calls fi() and fj() in the body of some functiong, where fi() precedes fj() in the CFG of g, there is atrue data dependence (RAW) from fi() to fj() if theSDC of fi() intersects SUC of fj(). On the otherhand, if the definitions of fi() intersect the defini-

tions of fj() there is an output dependence (WAW)between them. Finally, there is an antidependence(WAR) between them if the uses of fi() intersectsthe definitions of fj().

If the above intersections are null, means that fi()and fi() are independent. Thus, both function callscan be executed concurrently without lack of cor-rectness. Note that in case of anti and output de-pendences, indirect definitions can be avoided, sincethey occur in the body of the analyzed function andcan be handled by the scheduling mechanism. Sys-

tem performance should take into account how callsare scheduled and synchronized. A detailed analy-sis of these topics is out of the scope of this paper,we assumed a simple scheduling policy and an idealsynchronization mechanism.

Another key point in dependence analysis is theway in which variable uses and definitions are con-sidered for intersection analysis. This is specially im-portant in the case of dereferenced pointers. Below,we discuss this topic.

4 Data aliasing algorithms

The aim of this section is to set the rules that allow toidentify any program variable and to compare it withothers in order to check whether they are aliases.

The simplest variables are the scalar ones sincethey represent atomic data and their memory ad-dresses are statically known. The unique alias ofthese variables is the same variable. Aggregate typesare not atomic. They are subject to be partially readand updated, but since the number of fields of anaggregation is known at compile time this analysisdistinguishes different fields. Therefore, the uniquealias of a base.field structure is base.field. Anothernon atomic data type is the array type, which differsfrom aggregation in the number of components andthe way that they are accessed. The array size can

6

be statically unknown, moreover, the accessed itemof the array can depend on variables whose value isunknown until runtime. A sufficient condition is toconsider that each use or definition over array itemsis an use or an update of the whole array, which is theapproach followed in this paper. Consequently, eacharray item is an alias of the array variable. Note thatarrays are commonly accessed within loops, so theidentification of the accessed item is usually tackledat loop level parallelism studies.

Finally, the most hazardous type is the pointertype. A pointer can point to any instance of the pre-vious types. Such instance can be a global variable,a heap variable or even a stack variable. Therefore,in order to preserve the correctness, it is required atechnique that allows to discover the aliasing of eachpointer whose dereference is used or defined. In thispaper, we propose three approaches:

• Type Aliasing. This technique is the most con-servative and easy to implement. This analysisclassify a pointer according to its type. For in-stance, two dereferences of a pointer to integerare represented by the same alias. When usingthis technique, assignments between pointers ofdifferent types are unsafe. In order to preservecorrectness, the types of the pointers on an as-signment joins the same aliasing class, that is,their types are considered to aliase.

• Point-to sets Aliasing. A more accurate alias-ing can be detected through a reference analysisbased on point-to data[13, 14, 15, 16]. This ana-lysis tracks the flow of references within the pro-gram and builds a set of data pointed by eachpointer. An interprocedural and field sensitiveanalysis is performed. In other words, referencesare not clobbered beyond the function scope andreferences stored in different fields are handledseparately as far as possible, respectively.

The implemented reference analysis is intrinsi-cally flow insensitive. Nevertheless, it is easyto get some level of flow sensitivity by trans-lating the program into the static single assign-ment form (SSA) [17]. Under SSA, each variabledefinition creates a new version of the variable.

In that way, different assignments to a pointerresult in different pointers, so that each one ofthem points to different data. The implementa-tion evaluated in this paper applies SSA at pro-cedural level.

• Runtime Approaches. Even when using flowsensitivity, the exact variables that pointerspoint to are unknown until runtime due to thealternative execution paths and the complexityof the call graph. Therefore, many times it canbe worth to apply runtime checks on pointers toidentify if some functions can run concurrently.In object oriented programming, containers (ie.,lists) can hind a lot of parallelism. In such cases,runtime mechanisms can verify whether or notan object of a container is being used by somefunction. This can help to reduce the impact ofthe false aliasing that container data representa-tion introduces.

5 Performance evaluation

As mentioned above, this paper is aimed at providinga first temptative analysis study about how much per-formance gains could be achieved by exploiting FLP.This section analyzes the amount of FLP achievablewhen applying the proposed techniques to analyzecall dependences.

void main (){...f1 ();...while (...) two iterations

f2 (); loop...}

Figure 4: Example of code for simulation

To this end, a software framework has been devel-oped. The main tool consists of a simulator whichis fed by a trace obtained from the sequential exe-cution of the evaluated program, and simulates its

7

f1( ) f2 ( ) f2 ( )

startup

statements

calls of main()

statements

main() execution time

21

a) Sequential function parts

f1( )

f2 ( )

f2 ( )

Optimistic Model

Pessimistic Model

main() execution time

main() execution time

1

2

f1( )

f2 ( )

f2 ( )1

2

b) Parallel function parts

Figure 5: Sequential execution and parallelized approach under optimistic and pessimistic model

concurrent execution after analyzing which functioncalls have been identified to be independent.

Consider the example shown in Figure 5 (thesource code is shown in Figure 4). During the se-quential execution (see Figure 5a), the main() func-tion executes the calls f1() (taking 10ms), and laterenters in a loop with two iterations that call to f2()(resulting in the call dynamic instance f21() whichtakes 15ms and f22() with 20ms). Assume also thatthe dependence analysis of the program states thatf2() depends on f1(), but both calls to f2() are in-dependent each other.

The simulation starts with a single task executingthe main() function. When f1() starts, it is executedas a part of main(). Since f2() depends on f1(), thecall f21() is executed sequentially after f1(). In thesame way, f22() is executed sequentially after f1(),but concurrently to f21(), as a different task. The re-sult is a tree that represents the simulated executionof the calls of main(). In that tree, f21() and f22()are linked to f1(), and the largest path (measured inexecution time) is f1() → f22(), with 30ms of exe-cution time. That is the parallel execution time forfunction main().

Note that each call can also invoque other func-

tions. Therefore, the execution time of such calls iscalculated by applying the above algorithm to eachone.

To avoid an extremely high simulation time, we donot analyze the statements executed between calls.These statements can depend on previous calls andenlarge the time of some execution paths. In orderto reflect the impact of these statements in the finalparallel execution time, two different models allow toobtain the bounds that enclose the actual behaviorof these statements (see Figure 5b). An optimisticbound considers that all the statements are out ofthe critical execution path (the largest one). The pes-simistic bound considers that all of them contributesto enlarge the critical execution path.

We have evaluated C and Fortran written bench-marks from the Spec CPU 2006 benchmark suite andsome Mediabench benchmarks. Regarding perfor-mance, Figure 6a and Figure 6b show the speedupachieved when applying the type aliasing detectionanalysis and the points-to sets aliasing analysis, re-spectively. Benefits on speedup provided by exploi-ting FLP fall in between the ones achieved by theoptimistic and pessimistic approaches.

Using type aliasing detection does not reduce the

8

a) Type aliasing analysis b) Points-to sets aliasing analysis

Figure 6: Speedup using different analysis

execution time of the benchmarks using the pes-simistic model (i.e., not speedup is provided). Thus,in order to improve results a more accurate aliasingdetection analysis is required. The reference analy-sis performed with the points-to sets aliasing tech-nique increases benefits on pessimistic speedup. Thelargest benefits appear in 437.leslie3d and mpeg2encwhere the execution time is reduced by a factor of3.3 and 2.1 respectively. Spec2006 benchmarks showthe best performance improvements (e.g., 429.mcf,482.sphinx3, 445.gobmk and 437.leslie3d) when ap-plying this technique. Attending to the optimisticspeedup, some benchmarks achieve speedups higherthan 3 (e.g., 437.leslie3d with 4.656 and 482.sphinx3with 3.140).

6 Conclusions and Future

Work

This paper has introduced FLP as a novel ap-proach to exploit thread level parallelism. FLP re-quires the compiler to perform an static analysis ofthe code to identify which functions can be concur-

rently launched. A main concern regarding FLP ispointer aliasing, and two techniques for pointer alias-ing detection has been presented. To provide insightson the potential of this approach, two different tech-niques based on FLP have been evaluated.

Experimental results show that these techniques,for some applications, might provide performance im-provements falling in between 200% and 365%.

As for future work we plan to analyze how run-time mechanisms could help to exploit FLP. Previousstatic analysis will allow to identify the main vari-ables constraining parallelism. In other words, wepursue to identify variables subject to be analyzedduring runtime that hinder the concurrent executionof a significant amount of calls. This analysis couldalso be applied to programs written in object orientedlanguages, since we feel that these programming lan-guages exhibit a high degree of intrinsic FLP becauseof the object properties; for instance, encapsulation.

Acknowledgments

This work was supported by CICYT under GrantTIN2006-15516-C04-01, by Consolider-Ingenio 2010

9

under Grant CSD2006-00046, by the Generalitat va-lenciana under Grant GV06/326, and by the Univer-sidad Politecnica de Valencia under the PAID-06-07-20080029 program.

References

[1] A. Kejariwal, X. Tian, M. Girkar, W. Li,S. Kozhukhov, U. Banerjee, A. Nicolau, A. V.Veidenbaum, and C. D. Polychronopoulos. Tightanalysis of the performance potential of threadspeculation using spec cpu 2006. pages 215–225,2007.

[2] G.S. Sohi. Speculative multithreaded processors.Computer, 34(4):66–, 2001.

[3] P Marcuello, J Tubella, and A Gonzalez. Value

prediction for speculative multithreaded architec-

tures. 1999.

[4] J.T. Oplinger, D.L. Heine, and M.S. Lam. Insearch of speculative thread-level parallelism.Parallel Architectures and Compilation Tech-

niques, proceedings, pages 303–313, 1999.

[5] Y. Sazeides and J.E. Smith. The predictabil-ity of data values. Micro -Annual workshop

then annual international symposium-, 30:248–258, 1997.

[6] C. Garcıa, , C. Madriles, J. Sanchez, P. Mar-cuello, A. Gonzalez, and D.M. Tullsen. Mito-sis compiler: an infrastructure for speculativethreading based on pre-computation slices. SIG-

PLAN Not., 40(6):269–279, 2005.

[7] A. Gonzalez-Escribano and D.R. Llanos. Spec-ulative parallelization. Computer, 39(12):126–128, December 2006. ISSN 0018-9162.

[8] M. Cintra and D. Llanos. Toward efficient androbust software speculative parallelization onmultiprocessors. In PPoPP ’03: Proc. of the 9th

ACM SIGPLAN Symp. on Principles and Prac-

tice of Parallel Programming, pages 13–24, SanDiego, California, USA, June 2003. ACM Press.

[9] F. Dang and L. Rauchwerger. The r-lrpd test:speculative parallelization of partially parallel-loops. pages 20–29. Parallel and DistributedProcessing Symposium, 2002.

[10] D.R. Llanos, D. Orden, and B. Palop. Meseta:a new scheduling strategy for speculative paral-lelization of randomized incremental algorithms.ICPP 2005 Workshops, 2005.

[11] D. R. Llanos, D. Orden, and B. Palop. Newscheduling strategies for randomized incremen-tal algorithms in the context of speculative par-allelization. IEEE Transactions on Computers,56(6):839–852, 2007.

[12] S. Rul, H. Vandierendonck, and K. Bosschere.Function level parallelism driven by data de-pendencies. SIGARCH Comput. Archit. News,35(1):55–62, 2007.

[13] W. Landi and B. G. Ryder. A safe approximatealgorithm for interprocedural aliasing. In PLDI

’92: Proceedings of the ACM SIGPLAN 1992

conference on Programming language design and

implementation, pages 235–248, New York, NY,USA, 1992. ACM.

[14] D. Chase, M. Wegman, and F. K. Zadeck. Ana-lysis of pointers and structures. SIGPLAN Not.,39(4):343–359, 2004.

[15] S. Horwitz, P. Pfeiffer, and T. Reps. Dependenceanalysis for pointer variables. In PLDI ’89: Pro-

ceedings of the ACM SIGPLAN 1989 Conference

on Programming language design and implemen-

tation, pages 28–40, New York, NY, USA, 1989.ACM.

[16] J. M. Barth. A practical interproceduraldata flow analysis algorithm. Commun. ACM,21(9):724–736, 1978.

[17] R. Cytron, J. Ferrante, B. K. Rosen, M. N. Weg-man, and F. K. Zadeck. Efficiently computingstatic single assignment form and the control de-pendence graph. ACM Trans. Program. Lang.

Syst., 13(4):451–490, 1991.

10