first steps towards optimal oblique tile sizing

First Steps Towards Optimal Oblique Tile SizingR. Andonov, P-Y. Calland, S. NiarLAMIH/ROI, Universit�e de ValenciennesValenciennes, France S. RajopadhyeIRISARennes, France N. YanevUniversity of So�aSo�a, BulgariaAbstractTiling is a common compiler optimization, used for adjusting the parallelism granularityor for enhancing locality. For the case of a 2-dimensional iteration space (domain) tiledwith oblique tiles, we address the problem of determining the tile size that minimizes thetotal execution time on a distributed memory parallel machine. We restrict our attention touniform dependency computations where one of the tile boundaries is nevertheless parallelto the domain boundaries, in the interest of good load balance. Our solution is obtained byformulating and resolving a discrete nonlinear optimization problem. Experimental runningtimes on an Intel Paragon and an IBM SP2 compare well with our analytical predictions.1 IntroductionIteration space tiling [Wol87, IT88] is a technique used by parallelizing compilers to improve datalocality, and to increase the computation to communication ratio by varying the granularity ofthe computation. It may also be used for hand tuning the performance of parallel codes (see also[KCN90b, RS91, SD90, V.93]). On distributed shared memory machines it may be implementedas follows. A nested loop program executes in SPMD (Single Program Multiple Data) fashion,and the communication is performed by send/receive calls. A tile in the iteration space is acollection of iterations to be executed as a single unit with the following protocol|all the (non-local) data required for each tile is �rst obtained by appropriate communication calls. The bodyof the tile is executed next, and this is repeated iteratively. The code for the body containsno calls to any communication routine. Alternatively, the communication is performed throughcalls to libraries MPI or OpenMP.The tiling problem can be broadly de�ned as the problem of choosing the tile parameters(notably the shape and size) in an optimal manner. It may be decomposed into two subproblems:tile shape optimization [BDRR94], and tile size optimization [AR97, HKT94, KCN90a, PSCB94](some authors also attempt to resolve both problems under some simplifying assumptions [RS91,SD90, ES96]). By its very nature, such a two step approach may not be globally optimal, butis often used in order to make the problem tractable.The tile size problem seeks, for a given tile shape, to choose the size (length along each dimen-sion of the hyper parallelepiped) so as to minimize the total execution time. In its most generalformulation, this is a hard discrete non-linear optimization problem, and there is currently nosolution. However, optimal solutions can be found analytically under certain restrictions.1

All the recent pragmatic work on tile size optimization (i.e., where the theoretical predictionsare backed with experimental evaluations on real machines) addresses the following class: (i)perfect loop nests with (hyper) parallelepiped shaped iteration spaces (domains); (ii) uniformdependence vectors; and (iii) tiles parallel to the domain boundaries. In this paper we studyoblique tiles, i.e., we relax the latter restriction. The common arguments given in the literatureagainst oblique tiling are that (i) code generation is di�cult, (ii) the code will need to test forboundary cases and hence is likely to be ine�cient, and (iv) in most cases programs can alwaysbe tiled orthogonally (albeit in a degenerate manner by not tiling certain loops). However, tothe best of our knowledge, there has been no detailed study, analytic or experimental, to justifythis.In this paper, we present some preliminary results on oblique tiling of 2-dimensional domains.We allow only one of the tile boundaries to be oblique. A pragmatic reason for this is that theresulting tiles can then be allocated to processors so as to maintain a good load balance.Our main result is that we formulate and analytically solve a discrete non-linear optimizationproblem to determine the tile size parameters that minimize the total execution time of thetiled program on a parallel machine. Our solution is approximate, and we perform a number ofexperiments (on an Intel Paragon and an IBM SP2) that (i) clearly justify the approximation,(ii) validate the theoretic model and its predictions, and (iii) clearly demonstrate that obliquetiling yields signi�cant performance gains of typically 30% for the Paragon and 25% for the SP2.The remainder of this paper is organized as follows. We �rst illustrate (in Section 2), througha simple program (which we shall use as a running example) the steps involved in optimal obliquetiling. Next, in Section 3, we formulate the tile sizing problem for this example, using the two-step approach of Andonov and Rajopadhye [AR97]. In Section 5 we formulate the 2-dimensionaloblique tile sizing problem for the general case. In Section 4, we describe the solution of theoptimization problem for the general case, and illustrate it for our running example. We alsoprove a result (known in the \folklore") that it is possible to obtain orthogonal tiling (albeitdegenerate) for any program that does not have cyclic dependencies. We present experimentalresults in Section 6, and conclude in Section 7.2 An illustrative exampleWe consider the problem of computing the following recurrence over the n � m rectangulardomain, fj; kj0 < j � n; 0 < k � mg.X[j; k] = f(X[j; k � 1];X[j � 1; k � 1];X[j � 1; k];X[j � 1; k + 1]) (1)This is a dynamic programming computation that occurs in path planning in robotics, for whichBitz and Kung [BK88] have presented a systolic array. Each element is updated with a weightedaverage of the values of 4 out of 8 of its neighbors (those that precede it in the lexicographicorder). The dependence matrix (we use the convention that the origin is at the top left) is givenby D = � 0 1 1 11 1 0 �1 �.The �rst step in our procedure is to choose the tile shape, and this is done essentially asfollows. Since we have a 2-dimensional iteration space, the tiles are de�ned by 2 families of2

parallel lines and their normals can be speci�ed by the rows of a 2 � 2 matrix, H. Manyauthors have investigated the problem of choosing H. A su�cient condition to ensure thatthe tiles do not have cyclic dependencies is that that HD � 0. Ramanujam and Sadayappan[RS91] have proposed that H be lower triangular with non-negative entries and unit diagonal, soH = � 1 0a 1 � with a � 0. This simpli�es to a � 1. They also propose that the unknown entriesof H are chosen to minimize the sum of the elements of HD,i.e. 3a + 4, which is achieved ata = 1 and so the tiling is speci�ed by, H = � 1 01 1 �. Our optimal tiles are thus parallelograms(say, of size s � r) formed by the families j = const and j + k = const, i.e., horizontal anddiagonal lines respectively. Figs. 1.a and 1.b illustrate two views of the tile shape, and Fig. 1.cshows how to determine the dependencies of the tile graph.Hence we will seek a systolic array implementation of the following recurrence:X[j; k] = f(X[j � 1; k];X[j � 1; k � 1];X[j; k � 1]) (2)The domain of this recurrence is the parallelogram of Fig.1-b, \compressed" by the rationaldiagonal transformation, Q = � 1s 00 1r �. However, since we compress horizontally by a factorr and vertically by a factor s, the domain boundary will now have a slope sr . Observe thatbecause the horizontal cuts are parallel to the boundaries of the original domain, there areexactly N = n=s rows of tiles. Also note that the sr slope of the boundary produces some specialtiles in the tile graph. Because there is no tile immediately below it, such boundary tiles onlyhave a horizontal dependency leaving them, and this will turn out to be a crucial aspect a�ectingour problem formulation. We also see that each row has precisely M = m=r tiles, other thanthe boundary tiles. It is also easy to verify that this is true in general for arbitrary values of rand s: The rational diagonal transformation, Q yields a parallelogram whose vertices are [0; 0],[0; mr ], [ns ; 0] and [ns ; mr ], and this is also a parallelogram whose horizontal side is mr .2.1 Mapping the tile graph to a virtual systolic array: abstract modelingWe now map this recurrence to a systolic array. We see that the optimal schedule is given by� = [1; 1]T , i.e., the family of 45� lines. As for the allocation function, there are many choices.However, if we impose the restriction that only nearest neighbor connections are allowed, it isknown [ZRW92] that there are no more than 4 possibilities. They correspond to projectionsalong the directions [1;�1], [1; 1], [1; 0] and [0; 1] respectively. Of these, the �rst three are notchosen because (i) [1;�1] con icts with the timing function; (ii) [1; 1] yields a non-unimodulartransformation1; and (iii) projection along [1; 0], yields an array where the processors are activefor di�erent lengths of time, and this will lead to load imbalance (especially since we envisagemultiple passes). Hence the only remaining choice is [0; 1], which gives us a linear array of NPEs. Thus, our allocation function is a(j; k) = j. Since the tile graph is an N�M parallelogram,we can show by standard analysis that it requires N +M \time steps" on N virtual processors.1If [j; k] and [j + 1; k + 1] are two successive computations performed by the same PE, and t(j; k) = j + k isthe schedule, then each PEs is active only on alternate cycles.3

j k

(a) The original iteration space of theBitz-Kung algorithm, and the tiles impliedby H = � 1 01 1 �.

j k

(b) The same domain transformed by H,which makes the domain a parallelogram,and the tiles rectangular. Thus, each point,z of the original domain is mapped to the tilez0 = bQHzc, where Q is a rational diagonalmatrix, speci�ed by �1s ; 1r �.(c) Four adjacent tiles, showing the data dependencies that cross the tile boundaries,and the corresponding dependencies in the tile graph (after transformation by H).Figure 1: Determining the dependencies of the Bitz-Kung tile graph3 Performance modeling on a parallel machineWe now describe the performance of this algorithm, when implemented (in multiple passes, ifnecessary) on a �xed size ring of p processors. We will formulate our optimization problem usingjust two parameters, namely processor latency, L and tile period, as de�ned below. Two tilesare said to be successive if one depends directly on the other. The tile period, P is de�ned asthe time elapsed (in the steady state) between corresponding instructions of any two successivetiles that are mapped to the same processor. The processor latency, L is the time for which theprocessor j is delayed in respect to the processor j�1 (once the processors allocation a(j; k) = jhas been done). 4

A straightforward counting argument will now yield a formula for the running time of theprogram on a ring of p processors. The tile graph is an N �M parallelogram, where M = m=rand N = n=s, which corresponds to N +M \time steps" on N virtual processors. Hence, wewill need N=p passes through the ring. Since a macro-row contains M tiles, the period Prow ofa macro-row, is given byProw =MP (3)A single pass involves communications between p processors (the last one is around the ring),and we de�ne the latency Lpass of a complete pass asLpass = pL (4)Eqns. (3) and (4) give us two constraints that determine when the next pass can start. The�rst macro-row must be completely �nished, and the last processor should have produced someresults. Hence, the pass-period of the ring isPring = max(Prow; Lpass) (5)In a single pass, the last macro-row can start only at (p�1)L, and hence, the execution timeof a pass is given byTpass = (p� 1)L+ Prow (6)The entire program is executed in N=p passes, and hence the last pass can only start at(N=p� 1)Pring . Thus, the total running time, T (r; s), is �Np � 1�Pring+Tpass. Considering thetwo cases of Eqn. (5) and simplifying, we have the following optimization problem.Minimize T (r; s) = 8><>: Prow � Lpass ) NMp P + (p� 1)L = T1(r; s) sayProw � Lpass ) MP + (N � 1)L = T2(r; s) say (7)subject to 1 � r � m and 1 � s � n=p (8)3.1 Modeling the tile period and processor latencyNote that the above formulation was obtained without considering the exact nature of the�nal code. We will now determine P and L, based on standard assumptions about low levelbehavior of the architecture and program [GR88, "Bo90]. Although the model is developed for adistributed memory parallel machine using send-receive system calls, it can be easily adaptedto various models such as MPI, OpenMP, or even the BSP model (see [ARY98]). At the machinelevel, the time to transfer r words between two processors is given by �m+ r�t, where where �mis the message startup time (and the constant overhead of headers, routing information, etc.)and 1=�t is the bandwidth.Let � denote the overhead of the send (and receive) system call, �c the time (per byte) tocopy from user to system memory. The message transmission time is then �+2r�c+(�m+ r�t).5

Note that v�c is counted twice|once for the sender and once for the receiver. The � for receivercan be omitted because the system call occurs on di�erent processors, and when the sender andreceiver are properly synchronized, the calls are simultaneous. Usually, � � �m (� involves acontext switch). We assume that the time to execute a single instance of the loop body is �(we neglect the overhead of setting up the loop for each tile, as well as cache e�ects). The timeto execute a tile body is then, ta = �rs. It is obvious that P is the time that the CPU is busyduring each tile i.e., two OS calls, two message copies, and the tile body. HenceP = 2� + 2�cr + �rs (9)The latency between processors, L is as follows (this can be easily justi�ed by analyzingFig. 2 which illustrates the case s > r).L = P + �tr � � + srP (10)The term srP accounts for b sr c boundary tiles in each row of the tile graph for the case s � r.For the case when s < r, the formula is deduced from the average (over p� 1 processors) delaygiven by (p�1)(P+�tr��)+b(p�1) sr cPp�1 .s/r M = m/r

N = n/s

tiles

tilestiles

s

r

Figure 2: The case s > rObserve that because the call to receive is overlapped, we subtract one � factor, and the �traccounts for the transmission time. Since all the tiles are identical, later tiles will all be in sync,and the future receives are expected to return immediately (except for random uctuations inthe system).We emphasize that the additional term is sr times the period of a normal (non-boundary)tile. This may seem odd at �rst glance, since the boundary tiles do not perform a send (in thes > r case), and one may expect them to run faster. Although this is true, the data they needis produced by a non-boundary tile on the preceding processor, and these tiles run with a periodP! Of course, we have used a rational approximation (there are s boundary tiles every r pro-cessors, not sr tiles between every successive processors), and have ignored the oor and ceiling6

functions that are bound to come up when neither r nor s is a factor of the other. For suchcases, the details of the code generation also become very complicated. However, as shown bythe computational results our rational model su�ces for performance modeling.4 Solution of the optimization problemWe now return to the optimization problem (Eqns. 7-8) which is to minimizeT (r; s) = 8><>: MP � pL ) T1(r; s) = NMp P + (p� 1)LMP � pL ) T2(r; s) =MP + (N � 1)Lin the domain R: the integer points of the rectangle 1 � r � m; 1 � s � n=pwhere P = 2�+2�cr+�rs. Let us note that at this level of abstraction this problem formulation isexactly the same as for orthogonal tile case [AR97]. The crucial di�erence between the two is thatfor the orthogonal tiling, L = P+�tr��, while in the Bitz-Kung algorithm, L = (1+ sr )P+�tr��.Let us �rst discuss the common aspects of the two problems.We see that the condition, g(r; s) � pL �MP is a non-linear constraint involving r, s, thearchitecture parameters and p, the number of processors. It divides the feasible space into tworegions, D1 : g(r; s) � 0 (called region I), and D2 : g(r; s) � 0 (region II). The �rst question weseek to resolve is that of determining which region contains the solution.It is easy to see that the cost function is continuous across the boundary and that if g(r; s) =0, then T1(r; s) = T2(r; s). Since Pring = max(MP; pL), it is easy to verify that in region I,T1(r; s) � T2(r; s) and in region II, T1(r; s) � T2(r; s). This leads to the following proposition.Proposition 4.1. If the solution of the optimization problem that seeks to minimize T1(r; s)(or T2(r; s), respectively) in the entire region, 1 � r � m, and 1 � s � n=p, satis�es MP � pL(respectively, MP � pL), then this is a solution of our problem.Note that neither P nor L nor M depend on p and the e�ect of reducing p is to enlarge thearea of region I. On the other hand, as the problem size increases, we are more and more likelyto be in region I, since MP increases linearly with the problem size while pL is independent ofit. Also note that T2(r; s) is independent of p, and hence, if our optimal solution is in region II,we can obtain the same optimal running time with fewer processors. We can thus continue toreduce p until (say, with p0 processors) we are at the boundary between the two regions. Now,with p0 processors, the optimal tile size must be in region I (the solution for region II is on theboundary), and the corresponding running time is no worse than the optimal solution obtainablewith p processors. An intuitive explanation is that in region II, the latency of a pass is largerthan the period of a macrocolumn executing on the �rst processor, and hence the processors areidle between successive passes. This seems to be a waste of resources, and we do not expect theoptimal solution to be here. The only exception is when we have only one pass (in this case,the pass latency is not a factor). It is easy to verify, by substituting s = n=p, i.e., N = p, in7

Eqn. (7), that T1(r; s) = T2(r; s). Because of the nondi�erentiability of the objective function,we choose to use the relation minR maxfT1(r; s); T2(r; s)g = minf minRTD2T1(r; s); minRTD2T2(r; s)g,and to solve problem P1 : minRTD1T1(r; s) and problem P2 : minRTD2T2(r; s) separately. In otherwords, our general strategy consists of the following steps.1. Solve the optimization problem corresponding to Eqns. (7-8) as two sub-problems (mini-mize T1(r; s) in region I, and T2(r; s) in region II). Compare the two solutions. If the betterone is in region I, we are done2.2. If the optimal solution is in region II, we know that the problem size is too small to makee�ective use of the available resources (the \overhead" of parallelism is too high). Wetherefore reduce p until we are back in region I, and we the re-solve the problem for thisreduced number of processors. This solution is guaranteed to be no worse than the optimalsolution with p processors.Observe that the �nal solution is not necessarily optimal in terms of the best possible uti-lization of the available p processors. In general, we should also treat p as a variable and seekto optimally choose all three of r s and p, but this will signi�cantly complicate the problem.We also see that even though we may have to solve three separate (but related) non-linearoptimization problems, the �nal solution is guaranteed to be in region I.4.1 Solution for the Path Planning RecurrenceAs we saw in section 2 the problem of �nding the optimal tile-size passes through solving of theoptimization problem (7). For the purpose of this section we set �p = (p � 1), G = sP=r andF = P � � + �tr. We can then rewrite the non-linear constraint as g(r; s) � p(F + G) � ms Gand the timing functions T1(r; s) and T2(r; s) as followsT1(r; s) = �pF + (�p+ mnps2 )G and T2(r; s) = (ns � 1)F + (n+ms � 1)G (11)Let us also remind that T1(r; s) and T2(r; s) are de�ned resp. over region I (D1 : g(r; s) � 0)and region II, (D2 : g(r; s) � 0). Then the optimal tile-size is given by the point (r*,s*) in therectangle R which minimizes the function maxfT1; T2g. Computationally, this point could befound in O(n=p) time, simply by using the convexity of T1 and T2 on the variable r. Instead,our goals are to �nd a close form solution, which, if not exact, to be acceptably close to theminimizer (r�; s�).4.2 Solving problem P2In equation (11) we reasonably assume that n=s � 1 and (m + n)=s � 1. This simpli�essigni�cantly the formulas and also we'll con�rm that in the optimum point. We then can drop2For the single pass case, we consider the entire line segment s = n=p; 1 � r � m as an \extended" region I8

�1 terms from the expression for T2 (11) which shifts the graphics up and does not a�ect theoptimal point. Now by using the necessary condition for optimality, i.e. rT2(r; s) = 0 we obtain@T2@s = ��ns2 � n(2�c + �t)rs2 + (m+ n)� = 0@T2@r = n(2�c + �t)s + n�t � 2(m+ n)�r2 = 0From this system we obtain the following equation for s(m+ n)�s2 = �n+pn(2�c + �t)s 2(m+ n)�s2�c + �t + �sThe function 2(m+n)�s2�c+�t+�s is increasing from which follows the uniqueness of the optimal pointof T2 in R+2 . Hence the following is trueProposition 4.2. Let (r0; s0) be the minimal point of T2 in R+2 . If g(r0; s0) � 0 thenminR maxfT1(r; s); T2(r; s)g = minRTD1T1(r; s).This proposition necessitates solving of problem P2 only if g(r0; s0) > 0. In order to checkthis condition and to �nd the optimal point as well we turn again to the function 2(m+n)�s2�c+�t+�s . Forany realistic con�guration and any interesting instances this function is constant and equal to2(m+n)�� . Thus for (r0; s0) we obtains0 =s� nm+ n�� +�2�c + �t� �r 2nm+ nr�� and r0 =r2(m+ n)n r�� (12)Now we can simply check if g(r0; s0) > 0. In this case the optimal solution of problem P2is the closest integer point which minimizes T2. Another important consequence of having suchclose form solution is that we can optimize even on the number of processors in the followingsense. Let the optimal tile size given by (12) satisfy g(r0; s0) > 0. We can in this case obtainthe same performance on smaller number of processors p0 < p where p0 is determined as follows.If (r0; s0) 2 R then p0 = ms0 G(r0;s0)F (r0;s0)+G(r0;s0) (i.e. such that g(r0; s0) = 0). If (r0; s0) =2 R (i.e.s0 > n=p) by decreasing p we can enlarge the feasible domain RTD1 until the point (r0; s0)moves into it. Using the observation that quite reasonable approximation to the non-linearborder g(r; s) = 0 is the line s + r = m=p we see that the reduced number of processors isp0 = minf ns0 ; ms0+r0 g.4.3 Solving problem P1Below, we outline the main steps of the analysis of the problem P1: minRTD1T1(r; s):i) Find the point (r�(v); s�(v)) which minimizes T1(r; s) over the hyperbola rs = v, forarbitrary �xed v. The coordinates of this point are expressed as a one variable function of v;9

ii) By substituting in T1(r; s) we obtain the minimal value f(v) of T1(r; s) over the hyperbolars = v as a function of v;iii) The function f(v) is minimized in order to obtain the optimal (unconditional) tile-volumev�; iv) The optimal point on rs = v� is found by the formulas in i) and, if necessary, is shiftedalong this hyperbola to meet the feasible set RTD1.The details of the steps from above are as follows:i) We rewrite the function T1(r; s) (11) conveniently as:T1(r; s) = Ars + Bs + Cr +Drs+ Esr +Ks+Ds2 + const (13)where A = 2�mn=p;B = 2�cmn=p;C = �p(2�c + �t);D = �p�;E = �p2�;K = 2�p�c and considerits restriction on rs = v:T1(r; s) = A0s + B0r2 + Av +Dv + const (14)where A0 = Cv +B;B0 = Ev +Dv2; v = rs.Remark: The linear term Ks from (13) is dropped from the last expression. When �c issmall this omission does not in uence the optimal point of (13) but allows for obtaining a morecompact closed form expression of the minimizer.The minimizer (r�(v); s�(v)) of(14) over rs � v = 0 could be obtained by the theorem ofLagrange multipliers as:r�(v) = 3r2B0vA0 = 3s 2(2� + �v)v2p�pp�p(2�c + �t)v + 2�cmn; s�(v) = 3rA0v22B0 = 3sp�p(2�c + �t)v2 + 2�cmnv2(2� + �v)p�p (15)ii) Now we use the obvious relation : minR2+ T1(r; s) = minv minrs=v T1(r; s), which is equivalent tominimize f(v), v � 1,where f(v) is obtained from (14) with (r; s) set to (r�(v); s�(v)),i.e.f(v) = Av +Dv + l 3r(Cv +B)2(Ev +Dv2)v2 + const; l = 3r274 (16)iii) In order to estimate the zero of f 0(v) analytically, instead of minimizing (16), we minimizeits majorant g(v)g(v) = Av +Dv + 2l(Cv +B)3 + l3(Ev +D) + const (17)Here, we use the inequality:(Cv +Bv ) 23 (Ev +D) 13 � 23 Cv +Bv + 13(Ev +D) (18)10

Now we obtain for the zero of g0(v):v� = argming(v) =r 3A+El3D + 2lC =s 6�mn+ 2�p�plp�p(3�+ 2l(2�c + �t) (19)which can be approximated as l0ppmn where l0 =q 6�3�+2l(2�c+�t) .iv) Finally, the absolute minimum of the function T1(r; s) is attained at the point given in(15) with v = v�.To obtain the solution of the constrained problem we need to check if g(b(r�(v�)c; b(s�(v�)c) �0 and if not, to make the necessary adjustment briefed below:case 1. [The hyperbola rs = v� intersect the feasible set RTD1] In this case, if the point(r�(v�); s�(v�)) is outside, we shift the point along to the cross-point of the hyperbola with thefeasible set. Using the observation that the non-linear border of this set is the line s+ r = m=pwe easily �nd r = m=p� s. By solving sr = s(m=p� s) = v� we obtain then the minimizer.Case 2 [The hyperbola rs = v� is outside the feasible set] A simple criteria for this ism2p m2p < v�. The point (m2p ; m2p) maximizes sr over s + r = m=p. In this case the minimizer is(bm2pc; bm2pc).In order to check the exactness of the result we generate numerous instances with m andn up to 108 and p up to m=3. The exact solution is computed by enumeration on s and byusing the convexity of the functions T1 and T2 on r. In all runs when the objective functionT = maxfT1; T2g attains its minimum at T1, the predicted minimal value is bigger then the trueone with less then 0:005 sec. due to the very good prediction of the optimal tile size.5 Problem formulation for the general caseOur problem formulation described in Section 2 was very similar to that proposed by Andonovand Rajopadhye [AR97] for the case of orthogonal dependencies, except that we have extendedit to oblique tiling. We now discuss the issues that arise in the more general case when thedependency cone is given by the matrix D = � 0 1� �� , for some �; � > 0. If we let = �=�,it is easy to verify that H = � 1 0 1 � yields a valid tile shape. Indeed, it can be veri�ed thatthis is the preferred tile shape according to most of the criteria proposed in the literature.With this choice of H, we simply extend the approach that we used for the running example,outlined in Fig. 1. First, we use H itself as a reindexing, to transform the original dependencegraph. This achieves the e�ect of rendering all the tiles rectangular, whose sides are normal tothe axes. When this new domain is \shrunk" in each dimension by a factor equal to the tilesize (in that dimension), we obtain the nodes of the tile graph. Thus, a point, z in the originaldomain, D is mapped to tile �(z) = bQHzc, where Q is a rational diagonal matrix, [ 1s1 ; : : : ; 1sn ].When the non-positive extremal ray of the dependency cone is [ 1� ], the slope of the boundaryof H(D) is . This leads to sr slope of the tile graph boundary. As a result our optimization11

problem is exactly as for the path planning exaple, except that (10) now has an additional factor . The results can then be easily carried through.5.1 Existence of orthogonal tilingIn this section we show the existence of orthogonal tiling (possibly degenerated) for each instanceof do loop nest of depth k with uniform dependencies. LetD = [d1; d2; : : : ; ds] be the dependencematrix of the original iteration space, the columns di are lexicographically positive becauseuniform, i.e. di L> 0. Let P = [p1; p2; : : : ; pk] be a k � k matrix which columns pi de�ne theedges of some orthogonal tile, i.e. pi = [0; : : : ; ti; : : : ; 0]T with the i-th component ti � 1,integer. If �D is the tile graph dependence matrix, we call such tiling valid if x �D > 0 for somex 2 Rk. The relation between the tile graph dependence matrix �D and the original dependencematrix D is given by the following lemma.Lemma 5.1. The columns of �D are given by the vectors hP�1dii; i = 1; : : : ; s where the com-ponent wise function h:i is the oor or ceil functions.Now we can prove the following proposition.Proposition 5.1. If D is an uniform dependence matrix then a valid orthogonal tiling exists.Proof. Let P be a matrix which de�nes orthogonal tiling with ti = 1; i = 1; : : : ; k � 1 andtk = t > 1. Then from lemma 5.1 it follows that the columns of tile graph dependence matrix�D are lexicographically positive. We can prove the validity of this tiling if there is x 2 Qk suchthat x �D � 1. From Gordan lemma [MC79] this system is solvable if and only if the system �Dy =0; y � 0; y 6= 0 is unsolvable. Now by sorting the columns of �D in lexicographically decreasingorder the unsolvability of the latter system becomes obvious, which proves the assertion.For the Bitz-Kung example this is the only (up to t) possible valid orthogonal tiling, butfor a do loop with higher depth one could try to �nd the least degenerated orthogonal tilingpossible. The proof of the proposition gives some hints to this end.Let us now de�ne the timing function for an orthogonal tiling with s = 1 for the Bitz-Kungexample. For the period P we have the same formula as before. To �nd the latency L weproceed as follows. Because of the dependency (1;�1) the tile Tj+1 k cannot start before theexecution of the tile Tj k+1. If the computation of tile Tj k starts at tile t at time t + 2P thetile Tj k+1 is computed and a message of volume r + 1 is sent to processor P(j+1) mod p. Thismessage contains the r data from the tile Tj k plus the �rst data of the tile Tj k+1. Thereforefor P et L we obtain P = 2� + 2�cr + �rs and L = 2P + �t(r + 1)� � respectively.6 Experimental resultsWe now present results obtained on an Intel Paragon at IRISA and on an IBM SP2 at CINES3.The main characteristics of these machines are given in table 1 below.3Centre Informatique National de l'Enseignement Superieur BP7229, 34184 Montpellier Cedex 4, France12

Intel Paragon IBM SP2number of processors 50 207(thin type)data cache per node 128KB 16KBMemory per processor 56MB 256MBMax peak processor speed 100M ops 500M opsPoint-to-point bandwith 175MB/sec 35MB/sec�t 0:0015�s 0:022�s�c 0:0015�s 0:11�s� 40�s 45:0�sTable 1: Technical characteristics of the Intel Paragon and the IBM SP2Due to space limitations we present the results for only a typical problem instance. It waschosen to be large enough (m = 5000 and n = 50000) to have relatively signi�cant executiontime for p = 10 (in the range of 170 sec. on the Intel Paragon). This instance is representativeof the behavior of our code over a range of problem instances. Each data point presented is theaverage of ten executions for the same instance.Our approach in this section will be top-down, from the general to the concrete. Firstwe illustrate the theoretical and the experimental function in the whole domain, and then welocalize more and more in the neighborhood of the optimum. Figure 3 depicts the globalbehavior (theoretical and experimental) of our code for a large range of values for s and r onthe Intel Paragon. In the entire domain we observe good corroboration between theory andexperiments. The results on the IBM SP2 are very similar and omitted for brevity (althoughsome speci�c data points will be seen later in �gure 4).Next, �gure 4 compares the theoretical vs. the experimental curves for both orthogonal andoblique tiling. These curves are presented as functions of r while s is �xed to its theoreticaloptimum value (s = 12 for the Intel Paragon and s = 22 for the IBM SP2). As an aside, weobserved that on the SP2, the parameter �, remained the same for both orthogonal and obliquetiling, but on the Intel Paragon it was di�erent (� = 9:6�s for the orthogonal and � = 6:6�s orthe oblique). By analyzing the experimental curves we �rst observe that oblique tiling behavesbetter than the orthogonal tiling on both machines (170 vs. 250 sec. for the Paragon, and 21vs. 26 sec. for the SP2). The second obvious advantage of the oblique tiling is that its curveis relatively at in a large interval of r. Therefore we expect that oblique tiling will be more\robust" (i.e., less sensitive) with respect to possible errors in the optimal value of r. This isnot the case for the orthogonal tiling where small uctuations in r could signi�cantly worsenthe running time (especially to the left of the real optimum, where the hyperbolic componentof the running time dominates). Another inconvenience of the orthogonal tiling for our exampleis that it restricts us to degenerate tile shapes, and the message size is equal to the size of thetile. This may lead to sub-optimal performance as many applications programmers encounterwhen trying to tile relaxation type codes. It is true that because of the value of �t for parallelmachines like Paragon or SP2 the communication time is not so sensitive in respect to the13

(a)160.00

200.00

240.00

280.00

320.00

360.00

0 80 160 240 320 400 480 560

time

(sec

)

S=800S=600S=200S=12S=2

(b)160

200

240

280

320

360

0 80 160 240 320 400 480 560

time

(sec

)

s=800s=600s=200s=12

s=2

Figure 3: Global behavior of the experimental (a) and the theoretical (b) curves of the obliquetiling on the Intel Paragon in a large range of values for s and r

(a)150

200

250

300

350

400

450

500

550

600

0 50 100 150 200 250 300 350 400 450 500 550 600

time

(sec

)

Orthogonal ExperimentalOrthogonal TheoreticalOblique Experiemental

Oblique Theoretical

(b)20

30

40

50

60

70

0 100 200 300 400 500 600 700 800 900

time

(sec

)

Orthogonal ExperimentalOrthogonal TheoreticalOblique Experimentale

Oblique Theoretical

Figure 4: Theoretical vs. experimental and/or orhogonal vs. non-ortogonal curves on the IntelParagon (a) and on the IBM SP2 (b)message size. However the message size is limited by the size of the send/receive bu�er and thisshould be carefully managed in order to avoid limitation on the problem size. In view of theabove discussion, although orthogonal tiling seems attractive due to its simplicity (both, ease14

(a)166

168

170

172

174

176

178

180

182

184

186

0 60 120 180 240 300 360 420 480 540

time

(sec

)

Experiementaltheoretical T1theoritical T2

(b)20

21

22

23

24

25

26

27

0 100 200 300 400 500

time

(sec

)

T1 theoreticalT2 theoreticalexperimental

Figure 5: Zoom for the oblique tiling: theoretical vs. experimental curves as functions of r whens is �xed to its optimum; s = 12 for the Intel Paragon (a) and s = 22 for the IBM SP2 (b).of automatic code generation by a compiler, and the quality of the code), it may lead to poorperformance.Finally, �gure 5 is a zoom of the previous picture in case of the oblique tiling. The resultsof the comparison between the theory and the experiments on both machines are summarizedon Table 2, where we use the following notations:P: Intel Paragon;S: IBM SP2;�: arithmetic time (in �s);x: the point (s; r) (tile size);T (x): theoretical time at point x (in sec);x�ta: approximate theoretical optimum point;x�te: exact theoretical optimum point;x�e: experimentally observed optimum point;�t: T (x�ta)� T (x�te);E(x): experimental running time for the point x (in sec);�e: E(x�ta)�E(x�e);Let us now discuss and analyze the above results. The proposed theoretical model corrob-orates very well with the numerical experiments. Moreover we notice that the approximatesolutions are of extremely good quality (see �t column in the part \theoretical" of table 2).Remember that the approximate theoretical minimum T (x�ta) is computed in �(1) time as de-scribed in section 4.1 while the exact minimum T (x�te) is computed in �(mn) time by traversingthe entire domain. 15

theoretical experimental� x�ta T (x�ta) x�te T (x�te) �t x�e E(x�e) E(x�ta) �eP 6:6 (12; 481) 166.9572 (8,493) 166.9555 0.0017 (12,230) 170.1 172.4 2.3S 0:9 (22,478) 22.6868 (15,486) 22.6857 0.0011 (22,40) 21.02 23.5 2.5Table 2: Comparison summary of the theoretical vs. experimental resultsNote that the impact of caches on the precision of the results is not completely clari�ed.Indeed, one may say that by approximating � as a constant we are ignoring cache e�ects, andwe hypothesize that this is the reason for the theoretical-experimental gap and the computational uctuations around the point x�e in �gure 5 for the SP2 (note that this machine has a smallercache than the Paragon). Moreover, the SP2 program uses MPI communication routines, whilethe Paragon program uses vendor supplied speci�c communication routines. Primitives in thislibrary have lower overhead than the MPI primitives. Nevertheless, we point out that (i) cachebehavior can be modeled as uctuations of the parameter �, and we have seen that the modelis robust to such uctuations, and (ii) one can easily use a second level of tiling (loop blocking)to get better locality for the tile body, and this will ensure that � remains relatively stable.7 ConclusionsOblique tiling has often been regarded as a undesirable, since it poses di�culties for code gen-eration, and control overhead. As a result there has not been any e�ort to address the problemof optimal tile size selection for this tiling strategy. In this paper we have taken the �rst stepstowards this problem. We have restricted ourselves to 2-dimensional iterations and have alsorequired one of the tile boundaries to remain parallel to a domain boundary. As a result, theproblem of determining the tile shape has a straightforward and natural solution (though we donot make any claim of global optimality). Given this tile shape, we formulated and resolved theproblem of determining the tile size parameters so as to minimize the running time. We pre-sented an approximate analytic solution, and performed a number of experiments to (i) validatethe model and the predictions, (ii) justify the approximation, and (iii) illustrate that obliquetiling yields signi�cant (and predictable) performance gains. Open problems are to extend thework to arbitrary oblique tiles and also to higher dimensions.References[AR97] R. Andonov and S. Rajopadhye. Optimal Orthogonal Tiling of 2-D Iterations. Jour-nal of Parallel and Distributed Computing, 45:159{165, September 1997.[ARY98] R. Andonov, S Rajopadhye, and N. Yanev. Optimal Orthogonal Tiling . In Euro-Par'98 Parallel Processing, Lecture Notes in Computer Science, 1470, pages 480{490.ACM, 1998. Southampton, UK, Spetember 1-4.16

[BDRR94] P. Boulet, A. Darte, T. Risset, and Y. Robert. (Pen)-Ultimate Tiling? Integration,the VLSI Journal, (17):33{51, 1994.[BK88] F. Bitz and H. T. Kung. Path Planning on the WARP Computer: Using a LinearSystolic Array in Dynamic Programming. Int.J. Computer Math., 25:173{188, 1988.["Bo90] S. H." "Bokhari. Communication overheads in ipsc-860 hypercube. Technical report,"NASA ICASE, "1990".[CDR98] P-Y. Calland, J. Dongarra, and Y. Robert. Tiling with limited resources. Concur-rency: Practice and Experience, page ??, 1998.[ES96] Hodszi�c E. and W. Shang. Optimal Size and Shape of Supernode Transformations.In International Conf. on Application Speci�c Array Processors ASAP'96. IEEEComputer Society Press, August 1996.[GR88] D. C. Grunwald and D. A. Reed. Networks for Parallel Processors: Measurementsand Prognostications. In Third Conference on Hypercube Concurrent Computer Ap-plications, pages 610{619, January 1988.[HCF97] K. Hogstedt, L. Carter, and J. Ferrante. Determining the idle time of a tiling. InPrinciples of Programming Languages, page ??, Paris, France, January 1997. ACM.[HKT94] S. Hiranandani, K. Kennedy, and C. Tseng. Evaluating Compiler Optimizations forFortran D. Journal of Parallel and Distributed Computing, 21:27{45, 1994.[IT88] F. Irigoin and R. Triolet. Supernode Partitioning. In 15th ACM Symposium onPrinciples of Programming Languages, pages 319{328. ACM, Jan 1988.[KCN90a] C-T. King, W-H. Chou, and L. Ni. Pipelined Data-Parallel Algorithms: Part I{Concept and Modelling. IEEE Transactions on Parallel and Distributed Systems,1(4):470{485, October 1990.[KCN90b] C-T. King, W-H. Chou, and L. Ni. Pipelined Data-Parallel Algorithms: Part II{Design. IEEE Transactions on Parallel and Distributed Systems, 1(4):486{499, Oc-tober 1990.[MC79] Bazara. S. M. and Shetty M. C. Nonlinear Programming: Theory and Application.John Willey and Sons, 1979.[PSCB94] D. Palermo, E. Su, A. Chandy, and P. Banerjee. Communication OptimizationsUsed in the PARADIGM Compiler for Distributed-Memory Multicomputers. InInternational Conference on Parallel Processing, St. Charles, IL, August 1994.[RS91] J. Ramanujam and P. Sadayappan. Tiling Multidimensional Iteration Spaces forNon Shared-Memory Machines. In Supercomputing 91, pages 111{120, 1991.17

[SD90] R. Schreiber and J. Dongarra. Automatic Blocking of Nested Loops. TechnicalReport 38, RIACS, NASA Ames Research Center, Aug 1990.[V.93] Van Dongen V. Loop Parallelization on Distributed Memory Machines: ProblemStatement. In Proceedings of EPPP, 1993.[Wol87] M. Wolfe. Iteration Space Tiling for Memory Hierarchies. Parallel Processing forScienti�c Computing (SIAM), pages 357{361, 1987.[ZRW92] Xiaoxiong Zhong, S. V. Rajopadhye, and Ivan Wong. Systematic generation of linearallocation functions in systolic array design. Journal of VLSI Signal Processing,4(4):279{293, Nov 1992.

18

first steps towards optimal oblique tile sizing

Documents