area-efficient pixel rasterization and texture coordinate interpolation

ARTICLE IN PRESS

Computers & Graphics 32 (2008) 669–681

Contents lists available at ScienceDirect

Computers & Graphics

0097-84

doi:10.1

� Corr

CA 9212

E-m

journal homepage: www.elsevier.com/locate/cag

Technical Section

Area-efficient pixel rasterization and texture coordinate interpolation

Donghyun Kim a,�, Lee-Sup Kim b

a Qualcomm Inc., USAb Department of Electrical Engineering and Computer Science, Korea Advanced Institute of Science and Technology, Republic of Korea

a r t i c l e i n f o

Article history:

Received 19 October 2007

Received in revised form

3 August 2008

Accepted 20 August 2008

Keywords:

Rasterization

Scan conversion

Texture coordinate interpolation

93/$ - see front matter & 2008 Elsevier Ltd. A

016/j.cag.2008.08.007

esponding author at: Qualcomm Inc., 5775 M

1, USA. Tel.: +1858 805 5993; Fax: +82 42 879

ail address: [email protected] (D. Ki

a b s t r a c t

In this paper, new pixel rasterization and texture coordinate interpolation algorithms are presented to

reduce silicon area. The proposed pixel rasterization based on the characteristics of edge function saves

silicon area in terms of gate count by 38.9% and 35.3% compared to the previous centerline and scanline

algorithms, respectively. The proposed texture coordinate interpolation combines the benefits of

division and midpoint iteration in order to reduce silicon area without performance loss in computing

the fraction part of texture coordinates, which is required for texture filtering. The proposed texture

coordinate interpolation architecture uses less silicon gates than the architecture using dividers, and the

gate count reduction ratios are 25.2% and 37.0% for 16- and 32-bit texture coordinates, respectively. The

hardware feasibility of the proposed architecture is proved by implementation into a three-dimensional

(3D) graphics SoC.

& 2008 Elsevier Ltd. All rights reserved.

1. Introduction

Three-dimensional (3D) graphics is now commonly used inmobile and consumer devices with widespread use of liquidcrystal displays. The major applications of 3D graphics such ascomputer games and graphical user interfaces need real-timeinteractivity. Therefore, the typical rendering algorithm based onrasterization, mapping 3D primitives into a raster format, ispopularly used due to its fast speed compared to other approachessuch as ray-tracing or image-based rendering. A 3D scene istypically described as triangles represented by three vertices.Rasterizers take a stream of triangles and transform them intocorresponding pixels on the viewer’s monitors [1–3]. Previousmobile 3D graphics chipsets with limited area and power used tointegrate only rasterizers which are essential in 3D graphicspipeline [4,5], and rasterizer still accounts for large proportion ofsilicon area in 3D graphics mobile accelerators now [6–8].

An important part of rasterization is how to fill in all the pixelsinside a given triangle. Various algorithms have been developedfor moving from pixel to pixel to cover all pixels inside thetriangle, and the most popular solution is the scanline algorithm[9]. It firstly searches boundary pixels along triangle edges andfills the internal pixels row by row between boundary pixels. Thisapproach is intuitive and does not traverse outside the triangle,but separate hardware logics for edge traversal and span filling are

ll rights reserved.

orehouse Drive, San Diego,

9860.

m).

typically required [4]. Instead of scanline edge search, severalalgorithms which traverse pixel by pixel using edge functionshave been presented in [1]. The centerline algorithm also based onedge functions has been implemented into real hardware with thetiling method [2,10]. The centerline algorithm has parallelismwith any degree of efficiency, but it needs to lock up the traversalpoint inside the given triangle because determination of traversaldirection is based on intersection tests. There are other approachessuch as zigzag rasterization and Hilbert order rasterization toincrease memory coherency for the performance of texturing andvisibility test [11], but they do not provide hardware feasibility forforward differencing interpolation because their rasterized pixelsare not always continuous in horizontal or vertical direction.Forward rasterization that does not sample at the pixel center, butit is efficient for small primitives [12]. The performance improve-ment in texturing and visibility test can be also achieved with pre-visibility tests on rasterizer [13–15].

Another important part is that interpolation for texturecoordinates must be followed by the division with the depthof the pixel in order to avoid the perspective foreshorteningproblems. Since this per-pixel division makes it difficult to achievethe high throughput of pixel-fill rate with small hardware cost,various previous studies have been developed to eliminate thedivision in perspective correct texturing. Some techniques toapproximate hyperbolic curves with linear and quadratic inter-polations have been studied [16]. Hardware architecture for thequadratic interpolation that uses only adders instead of dividershas been also presented [17]. However, these studies are based onapproximations with possible errors, and the error may causeconspicuous pixel defects in some particular cases such as large

www.sciencedirect.com/science/journal/cag

www.elsevier.com/locate/cag

dx.doi.org/10.1016/j.cag.2008.08.007

mailto:[email protected]

ARTICLE IN PRESS

D. Kim, L.-S. Kim / Computers & Graphics 32 (2008) 669–681670

polygon rendering. There have been other approaches to usemidpoint algorithms [18–20], which are well known for the traceof hyperbolic functions. The midpoint algorithm computes exactinteger texture coordinates for the point sampling by only usingseveral additions instead of a division, but the point samplingdoes not provide good image quality. Usual filtering methods suchas trilinear filtering require the fraction part of texture coordinatesas filtering coefficients. Fixed point representations including thefraction part can be scaled such that they take integer values only,as mentioned in [20], but the number of iterations is also greatlyincreased.

In this paper, the proposed entire rasterization improvesthe previous works using two schemes, which are the pixelrasterization and the texture coordinate interpolation [23]. Theproposed pixel rasterization scheme uses the characteristics of theedge function [1] in the pixel traversal, and it reduces silicon areacompared with previous schemes [1,2]. The texture coordinateinterpolation proposed in our earlier work [23] combines the bothmerits of the midpoint iteration for the integer part and thepipelined digit-recurrence division approach for the fraction part.Since the precision of the fraction part is much less than that ofthe integer part, the major hardware cost gain is obtained frominteger part iteration, and the throughput is sustained from shortpipeline of the fraction part division.

Section 2 describes the proposed pixel rasterization scheme,and Section 3 reviews the proposed texture coordinate interpola-tion [23]. After hardware architecture and implementation for thetwo proposed techniques are described in Section 4, the analysisand discussion about the proposed pixel rasterization and texturecoordinate interpolation are presented in Section 5. Section 6summarizes our works.

2. Pixel rasterization

2.1. Previous pixel rasterization algorithms

The intuitive scanline traversal and its variants have been usedin several hardware architectures [4,7]. A graphics engine [7] hasbeen developed for mobile applications, and it has separatehardware logic for edge traversal and span filling. A hardwarerasterizer [4] also has two separate edge-traversal blocks and aspan-filling block. There is hardware redundancy in the edge-traversal blocks because their throughputs are different from thatof the span-filling block. In addition, it is hard to generate multiplepixels in parallel with any degree of efficiency. Various algorithmshave been developed to remove the hardware redundancy bytraversing pixel by pixel. These algorithms are based on the edgefunction which classifies a point within the plane as falling intoone of three regions: the region to left of the line direction, theregion to the right to the line direction, or the region representingthe line itself [1,3]. Let x and y denote the horizontal and verticalscreen coordinates, respectively. The edge function E(x, y) is

Fig. 1. Conventional tra

defined as follows:

Eðx; yÞ ¼ ðx� x0ÞDy� ðy� y0ÞDx ðDx ¼ x1 � x0; Dy ¼ y1 � y0Þ

(1)

There is a relationship between the sign of the edge functionand the position of the point:

E(x,y)40 if (x, y) is to the right side of the edge directionE(x,y) ¼ 0 if (x, y) is on the edgeE(x,y)o0 if (x, y) is to the left side of the edge direction

Assuming a triangle that consists of three edges in clockwisedirection, pixel validity that indicates whether a pixel is inside thetriangle or not is known by checking if all the edge functions arepositive. However, as the edge function guarantees only pixelvalidity, additional traversal algorithms are required to traverse allthe pixels of an object. Several traversal algorithms with the edgefunction have also been presented as depicted in Fig. 1. Thecenterline algorithm shows the smartest traversal among thembecause it avoids unnecessary traversal in the bounding box orrepeated pixel generation. The centerline algorithm was imple-mented into a silicon chip, PixelVision [10].

The centerline algorithm is well described in [2]. Beforedeciding on the pixel traversal direction, the rasterizer shouldcheck which direction is possible. The centerline algorithm checksthe intersection of the boundary of a pixel rectangle (or a pixelstamp rectangle) and triangle edges. There are four probe pointsfor edge functions at the four corners of the pixel for theintersection test. PixelVision samples the pixel value at the left-top corner, and hence three additional probe points are used. Forthe commodity hardware, four additional probe points arerequired since OpenGL semantics stipulate pixel center sampling.Computing the intersection of a pixel boundary segment with theobject consists of two steps. The first step determines whetherone or both probe points at the end of each pixel boundarysegment are inside each edge segment. The second step tests if thepixel boundary segment is inside of the object bounding box. If asegment passes both tests, then it probably intersects with theobject.

The rasterization algorithm of PixelVision starts the traversalfrom the top-most vertex. Before the traversal moves down to thenext horizontal line, the centerline algorithm sweeps all the pixelson the current line in the object. When the traversal is started on ahorizontal line, the centerline algorithm checks whether the rightdirection of the current pixel is valid or not. If the right direction isvalid, the stamp contexts such as edge function values, color, anddepth are stored into an additional context register namedRightSave. After that, the centerline algorithm traverses to the leftdirection. After all the pixels in the left side of the line aretraversed, the pixel context in RightSave is restored to traverse theright direction. While the traversal sweeps left or right, thealgorithm also looks for valid down positions. The first valid down

versal algorithms.

ARTICLE IN PRESS

Fig. 2. The centerline algorithm. Octagons are DownSave positions, and circles are

RightSave positions.

D. Kim, L.-S. Kim / Computers & Graphics 32 (2008) 669–681 671

position is saved into an additional context DownSave. After all thepixels in the line are swept, the algorithm restores DownSave andmoves the traversal down. If a valid DownSave is not found in thehorizontal sweeping, the rasterization of the object will befinished. Fig. 2 describes an example of the pixel generation orderand saved contexts in rendering a triangle.

This algorithm requires two additional pixel contexts, Down-

Save and RightSave. A context includes not only edge functionvalues but also all of the pixel data such as color, depth, andtexture coordinates. Since this is a large amount of data, thenumber of contexts influences the silicon area of rasterizer.Because an additional context save point is required per one-leveltiling, there exists a trade-off between silicon area and memoryefficiency [2].

2.2. Proposed movement decision

The centerline algorithm used in PixelVision determines therasterization direction by searching the validity of the adjacentpixels with the intersection test [2]. It cannot find the movementdirection if a pixel is fully outside the object. This propertydemands two context save points, DownSave and RightSave to bindthe current pixel inside the object. The proposed methoddetermines the movement direction of a pixel by the only edgefunction characteristics regardless of the pixel position.

An edge function of a point describes the relative position ofthe point to the edge. To determine whether a pixel is left or rightof an edge in screen coordinate, it is necessary not only tocompute the sign of the edge function value but also to examinethe direction of the edge. We can distinguish the edge direction byexamining the sign of vertical difference Dy. If Dy is positive, theedge is up-direction. Otherwise, the edge is down-direction.Therefore, if a pixel is outside an edge, the horizontal direction tomeet the edge is simply decided by examining the sign of Dy.Since Dy is already required to update edge function values by thex-movement, the sign of Dy is easily obtained. For a pixel P(x, y)and an edge with edge function E0 and Dy0, a 2-bit traverse-codeT0 is introduced and defined as follows:

T0 ¼ 00 if E0ðx; yÞX0 (2)

T0 ¼ 01 if E0ðx; yÞo0 and Dy0X0 (3)

T0 ¼ 10 if E0ðx; yÞo0 and Dy0o0 (4)

Traverse-code 11 is undefined. This traverse-code indicates therelative location of the pixel to the edge. Traverse-code 00 meansthat the pixel is inside the edge-on the edge or in the clockwiseplane of the edge direction. If traverse-code is 01, the pixel isoutside the edge and the edge is located on the right side of thepixel. Traverse-code 10 implies that the pixel is outside the edgeand the edge is located in the left side of the pixel. Traversaldirection should be right when traverse-code is 01, and thedirection should be left when traverse-code is 10. When an edge ishorizontal and a current traversal position is outside the edge,traverse-code is defined as 01 though the edge is not located onthe right-side of the pixel. In this case, the edge is never reachedby moving right, but there will be a movement of up-direction byanother edge of the triangle. When a pixel lies on the shared edgeof two triangles, it should be drawn by only one triangle.The shared edge of two adjacent triangles must have oppositedirection according to OpenGL specification, and the pixel isdrawn according to the edge direction. In the pixel traversal, allthe pixels on edges are visited regardless of the edge directiononce, and they are filtered out according to the edge directions.

Now let us consider a triangle of three edges. Fig. 3 shows atriangle with three edges E0, E1, E2. Traverse-codes T1 and T2 arealso similarly defined for the other two edges E1 and E2. Let usdefine a 2-bit traverse-code Ttri for a triangle by bit-OR operationsof T0, T1 and T2. This is a very simple computation, but thetraverse-code Ttri of a triangle directly informs the traversedirection of a pixel. If Ttri is 00, it indicates that a pixel is insidethe triangle (Region 0). If Ttri is 10, it implies that a pixel must betraversed left (Region 2). If Ttri is 01, a pixel must go right to meetthe triangle (Regions 1, 3 and 6). Traverse-code Ttri 11 implies thatthe pixel must move vertical to meet the triangle (Regions 4and 5). By this characteristic, the pixel movement is determinedfrom anywhere on the screen. This is a useful characteristic inpixel traversal that is not available in the approach based on theintersection of object edges and stamp edge segments.

2.3. Proposed pixel rasterization

The centerline algorithm uses additional probe points of edgefunctions at the four corners of the pixel for the intersection test.This requires additional sets of adders, and increases the dataamount of a pixel context. The proposed traversal algorithm usesonly probe points at the center of pixel rectangle, which isrequired for the pixel validity test. Fig. 4 shows the probe pointpositions for OpenGL semantics.

The proposed rasterization uses the direction characteristics ofthe edge function, instead of intersection tests. It is similar to theprevious centerline algorithm, but it requires only one contextsave point, RightSave, while the centerline algorithm requires twocontext save points, RightSave and DownSave. The centerlinealgorithm examines the first valid down point for DownSave

during horizontal sweeping, but this is not necessary in theproposed algorithm because it proceeds down to the next stampline immediately after it finishes sweeping one stamp linehorizontally.

We now describe the proposed pixel rasterization algorithm.This algorithm always starts with the top-most vertex, sweepingout an entire horizontal pixel line before moving down to the nextpixel line. First, if the start pixel of the horizontal traversal is validwith Ttri ¼ 00, which means that the pixel is inside the triangle,there are probably other pixels to the left and the right side of thepixel. Therefore, it must sweep both sides. To sweep both sides,the context of the current position is saved into RightSave for thefuture reload, and the traversal goes to the left side until there areno valid pixels. After that, the pixel context is restored fromRightSave, and the traversal sweeps the remaining right-side area

ARTICLE IN PRESS

Fig. 3. L and R for three edges by divided regions.

Fig. 4. (a) Additional probe points (diamonds) are required in the centerline algorithm. (b) Proposed algorithm removes redundant probe points at the corners.


of the pixel line. If the horizontal traversal starts outside of thetriangle, the direction of traversal is decided by a triangle traverse-code Ttri. If Ttri is 10, the direction is left side. It goes left side untilTtri becomes 01 or 11. If Ttri is 01, it goes right side until Ttri

becomes 10 or 11. If the traversal starts where Ttri is 11, it finishesthis horizontal line. After sweeping one horizontal pixel line, thetraversal goes down directly, and repeats these processes againuntil there are no horizontal lines to draw. Fig. 5 depicts the casesof the start of pixel line traversal.

3. Division-free texture coordinate interpolation

3.1. Perspective correct texture mapping

Texture mapping is to map an image, called a texture map,onto a surface to obtain realistic look by using simple geometryinstead of precise geometry. The transformation of texturecoordinate consists of two steps. The first is a transform fromthe 2D texture space to the homogenous object space, and thesecond is a transform from homogenous object space to 2D screenspace [21]. For triangles, the homogenous object coordinates andthe texture coordinates are connected by a homogeneous linear

transformation, and the homogenous object coordinates and 2Dscreen coordinates are related hyperbolically. The relation amongthe screen coordinates (x, y), the homogeneous object coordinates(x0, y0, w0), and the texture coordinates (u, v) is given by

ðx; yÞ ¼x0

w0;

y0

w0

� � u

v

1

264

375 ¼

K L M

N P Q

R S T

264

375

x0

y0

w0

264

375 (5)

Therefore, texture coordinates (u, v) can be represented in termsof screen coordinates (x, y) in the next relation

ðu; vÞ ¼Kxþ LyþM

Rxþ Syþ T;Nxþ Pyþ Q

Rxþ Syþ T

� �(6)

As shown in (6), two division operations or one reciprocal withtwo multiplications are required in calculating exact texturecoordinates of one pixel.

3.2. Previous midpoint algorithm

The per-pixel division can be avoided by using midpointalgorithm in perspective texture mapping as proposed in [20]. Thekey idea of [20] is based on the three characteristics of 3D

ARTICLE IN PRESS

Fig. 5. Four cases of the start points in horizontal traversals.

Fig. 6. Integer points in the acceptable region for a hyperbolic curve.


graphics texture mapping. First, it is not necessary to know theexact value in infinite precision. The error smaller than unit in lastposition (ulp) is allowed for finite precision. Second, the order ofpixel generation is not arbitrary but sequential in a rasterizer.Third, the derivatives such as qu/qx are less than two if the level ofmipmap, pyramid architecture of preprocessed texture, does notchange [20,22].

With these three characteristics, repetition of additionsenables us to trace hyperbolic curves within given precision asshown in Fig. 6. The ulp of pixel coordinate x and texturecoordinate u ¼ f(x) is 1 in the example of Fig. 6. Since the exactvalue f(xi) for given xi is quantized into a nearest value of limitedprecision, the nearest quantized value ui is acceptable for f(xi). Thedifference between ui and f(xi) must be smaller than half ulp,which is 0.5 in this case. This is formulated as below:

Kxi þ Lyi þM

Rxi þ Syi þ T� 0:5ouip

Kxi þ Lyi þM

Rxi þ Syi þ Tþ 0:5 (7)

Two variables d and E are introduced to simplify the expression

dðx; yÞ ¼ 2ðRxþ Syþ TÞ (8)

Eðx; y;uÞ ¼ udðx; yÞ � ðRþ 2KÞx� ðSþ 2LÞy� ðT þ 2MÞ (9)

Assuming the case that d is positive, (7) is simplified to the nextinequality by using (8) and (9):

�dðxi; yiÞoEðxi; yi;uiÞp0 (10)

ARTICLE IN PRESS

Fig. 7. The acceptable region is changed by truncation.


A new variable A is introduced for simplicity. When x increasesby 1, d and E are changed as below:

AðuÞ ¼ ð2u� 1ÞR� 2K (11)

dðxþ 1; yÞ ¼ dðx; yÞ þ 2R (12)

Eðxþ 1; y;uÞ ¼ Eðx; y;uÞ þ AðuÞ (13)

When u increases by 1, A and E can be linearly updated asfollows:

Eðx; y;uþ 1Þ ¼ Eðx; y;uÞ þ dðx; yÞ (14)

Aðuþ 1Þ ¼ AðuÞ þ 2R (15)

Therefore, when E and d are changed by increases or decreasesof x or y, E can be restored for satisfying inequality (9) by adjustingu with iterative additions or subtractions in (14) and (15). Fory-direction movement, similar operations can be done. A variableB for y can be defined similar to A.

The restoration of E into the range of (10) incurs multipleiterative additions of (14) and (15) depending on the partialderivatives of the hyperbolic curve. Generally, this iteration ofadditions or subtractions does not have upper bound, but thecharacteristics of mipmap [22] limit the derivatives such as qu/qx

to two as mentioned previously. However, texture filteringmethods such as trilinear filtering are usually used to increaseimage quality of texturing, and they require fraction parts of thetexture coordinate. In these cases, ulp of u is much less than 1. Form-bit precision of the fraction part, the upper bound of iterations(14) and (15) become 2m+1. If 4-bits are used to represent thefraction of u, 32 iterations must be performed in the worst case.Large number of iterations results in the long latency of clockcycles or consumes large hardware cost for parallel additions of(14) and (15) in the hardware implementation, like divisions.

3.3. Proposed fraction part evaluation with midpoint algorithm

To solve this problem of the midpoint algorithm in texturefiltering, we try to combine the benefit of midpoint iteration anddivision. Midpoint iteration method requires small hardware sizefor addition of several variables, but its iteration bound dependson the precision of fraction part. On the contrary, most divisionsalways provide same throughput regardless of their precision, butthe hardware size becomes larger as the pipeline is deeper. Weperceive that the precision of fraction part is much lower than thatof integer part. Therefore, we try to obtain major area gain frommidpoint iteration on integer part and to remove the performanceloss on fraction part by using digit-recurrence division.

The proposed algorithm [23] separates integer part iterationand fraction part evaluation so that there is no dependencybetween the fraction part computation of previous pixel xi�1 andcurrent pixel xi. The mathematical induction follows as below.

We firstly move the acceptable region from rounding off totruncation as shown in Fig. 7. The maximum error increases from 1

2

ulp to ulp, but truncation is simpler in computation of fraction partin our method. And also, the difference is negligible for thefraction part of enough precision. For the nearest filtering whichchooses the nearest texel, only 1-bit fraction has to be calculatedto indicate which texel is the closest.

The next inequality formulates the acceptable region describedin Fig. 7 for m-bit precision of the texture coordinate fraction part:

Kxi þ Lyi þM

Rxi þ Syi þ T�

1

2m ouipKxi þ Lyi þM

Rxi þ Syi þ T(16)

We introduce E0 and A0 instead of E and A in (9) and (11) so thatE0 is independent of m as follows. Note that the second and the

third in relations (12)–(14) are still valid:

E0ðx; y;uÞ ¼ udðx; yÞ � 2ðKxþ LyþMÞ (17)

A0ðuÞ ¼ 2uR� 2K (18)

Therefore, substituting E0 in (16) yields the next relation

�dðxi; yiÞ

2m oE0ðxi; yi;uiÞp0 (19)

Let sequence k(u) represents binary digits of fraction part of u asshown in the next relation. Let buic denote the largest integerequal to or less than ui,

ui ¼ ui

� �þXm

j¼1

kj

2j; kj 2 f0;1g (20)

Naturally, buic satisfies (19) in the case of m ¼ 0:

�dðxi; yiÞoE0ðxi; yi; ui

� �Þp0 (21)

Notice that (21) and (10) have the same shape such that buic isevaluated as the same way described in Section 3.2.

We will show how the fraction part, the digit sequence k, isevaluated. Let u|n imply the number with n fraction bits equal toor less than u. Then next relations are induced easily:

uijnþ1 ¼ uijn þknþ1

2nþ1¼ ui

� �þXnþ1

j¼1

kj

2j(22)

uij0 ¼ ui

� �(23)

uijm ¼ ui (24)

E0ðxi; yi;uijnþ1Þ ¼ E0ðxi; yi;uijnÞ þknþ1

2nþ1dðxi; yiÞ (25)

�dðxi; yiÞ

2n oE0ðxi; yi;uijnÞp0 (26)

The rule to evaluate sequence k is induced from (22) to (26) asfollows:

�dðxi; yiÞ

2nþ1oE0ðxi; yi;uijnÞp0 ! knþ1 ¼ 0 (27)

�dðxi; yiÞ

2n oE0ðxi; yi;uijnÞp�dðxi; yiÞ

2nþ1! knþ1 ¼ 1 (28)

Fig. 8 depicts how the sequence k is obtained from k1 to km stepby step. The number of these sequential evaluations is m for m-bitfraction part of texture coordinates while the iteration bound ofthe previous midpoint algorithm is 2m. More important thing isthat these sequential evaluations do not have any dependency

ARTICLE IN PRESS

Fig. 8. Finding from k1 to km sequentially. a) Integer level : evaluation of ui+1|0 from ui|0, b) To find k1: evaluation of ui+1|1 from ui+1|0, c) To find k2 : evaluation of ui+1|2 from

ui+1|1, d) To find k3 : evaluation of ui+1|3 from ui+1|2.

Fig. 9. State transition diagram for the proposed traversal algorithm.

Table 1State description of FSM

State name Description

IDLE Idle state ready to rasterize a triangle.


from computation of previous texture coordinates. For hardwareimplementation, the fraction part evaluation can be pipelined andperformed simultaneously with integer part iterations. Thehardware resource to evaluate 1-bit of fraction part is one adderfor computing (28) and updating E0.

START Renders a first pixel (stamp). Save context into

RightSave.

LEFT Indicates traversal to left side.

RIGHT Indicates traversal to right side.

DOWN Indicates traversal to down side for the next pixel line.

Save context into RightSave.

LEFT2 Indicates traversal to left side. After this state, the

traversal restores RightSave and traverses right side.

RIGHT2 Restores RightSave.

4. Hardware architecture and implementation

4.1. Hardware architecture

The rasterizer architecture presented in this paper receivestriangle data from the triangle setup engine [6,24], and then itgenerates pixels with texture coordinates. The pixel rasterizationproposed in Section 2 is implemented by only a simple FSM. Thestate transition diagram is illustrated in Fig. 9. There is only onemajor 2-bit input, a triangle traverse-code Ttri, and two controlinputs, triangle_start and triangle_finish. The control signals in thehardware implementation level such as stall signals are ignored.Table 1 describes each state.

The proposed texture coordinate interpolation is adopted in ascalable pixel pipeline, named Basic Rasterization Units (BRU), asdepicted in Fig. 10, which shows the data flow for only x and u forsimplicity. The dimension of screen and texture coordinates isscalable. For example, there are screen coordinates (x, y) and

ARTICLE IN PRESS

Fig. 10. The block diagram of BRU for 1D screen coordinate x and 1D texture coordinate u.

Fig. 11. The Integer Interpolator. The dashed rounding boxes imply the additional hardware for the 2�2 pixel stamp.


texture coordinates (u0, v0, u1, v1) for two-level multi-texturein our SoC implementation [6]. Therefore, the internal variables(E0, A0 , B0) are also extended into four variable sets such as (Eu0

0,Au0

0, Bu00), (Ev0

0, Av00, Bv0

0), (Eu10, Au1

0, Bu10), and (Ev1

0, Av10, Bv1

0).The Context Register contains the variables of a current pixel

such as position, color, depth, integer part of texture coordinates,and the internal variables defined in Section 3.3. There are tworegister sets for the current pixel context and RightSave. The Pixel

Interpolator consists of adder arrays to interpolate the pixelcontexts by rasterization traversal. It receives only two controlsignals indicating the traversal direction. In our implementation,Ef0, Ef1, and Ef2, which are 32-bit edge function values of threeedges, are updated for every pixel to determine whether the pixelis inside a triangle or not. Twenty-six-bit depth Z, 10-bit fog factor

f, and a 4�18-bit color channel (R, G, B, A) are also interpolated.The internal variables in (11)–(13) are updated in the samearchitecture. The pixel rasterizer, which consists of the Context

Register and the Pixel Interpolator described as above, produces allthe attributes of one pixel per clock cycle.

A texture coordinate interpolator consists of the other parts inthe BRU, which are the Integer Interpolator, the Mipmap Switching

Detector, the Mipmap Switch, and at least one Fraction Evaluator.

The Integer Interpolator finds the integer part of texture coordi-nates by checking (10) and by updating (11)–(13). For 2D two-level multi-texture coordinates (u1, v1) and (u2, v2), four Integer

Interpolator blocks are required in a BRU. Fig. 11 shows thearchitecture of the Integer Interpolator. The interpolator receivesthe negative value of E0 and determines whether E0 is positive or

ARTICLE IN PRESS

Fig. 12. Cascading the Fraction Evaluators.


not. The sign signal, MSB of �E0, drives all the subtraction controlports of all adders. If E0 is positive, �E0 is negative, and MSB of �E0

is high. Two parallel adders evaluate �E0+d, �E0+2d in the case ofthe single-pixel stamp, but Dx is two in the case of a 2�2 pixelstamp. Therefore, the iteration bound is doubled from 2 to 4, andfour parallel adders are required to evaluate �E0+d, �E0+2d,�E0+3d, and �E0+4d.

If E0 is negative, �E0 is positive, and MSB of �E0 is low. In thecase of a 2�2 pixel stamp, the four adders evaluate �E0�d,�E0�2d, �E0�3d, and �E0�4d. The Decision Logic examines thesign of five results and bypassed �E0, and selects the result thatsatisfies condition (10). It also generates control signals of MUX,and finally selects the add-term for texture coordinate u. Internalvariables A0 and B0 are updated in the same way.

The Mipmap Switching Detector determines whether the currentmipmap level must be changed. When the mipmap level is changed,the outputs of the Pixel Interpolator and the Integer Interpolator arenot updated to the Context Register. Instead, the outputs of theMipmap Switch are updated. There is one cycle loss to change amipmap level, but mipmap-level shifting does not frequently occur.The Fraction Evaluator produces the fraction part of texturecoordinate by (28). A Fraction Evaluator produces only one fractionbit by (28), but pipelining multiple Fraction Evaluators enables us toobtain multiple bits of fraction parts per cycle as shown in Fig. 12.

4.2. Hardware implementation

The proposed rasterization algorithms were implemented into areal application for 3D graphics system to test the feasibility of theproposed architecture [6]. The developed SoC integrates a RISCprocessor, 3D graphics IP and other peripheral blocks. The 2�2rasterizer in 3D graphics IP evaluates texture coordinates by theproposed method. All the internal variables related on texturecoordinates are represented in 16-bit fixed point representation. Thedeveloped rasterizer including pre-depth test block consists of 290 kgates. It runs at a speed of 166 MHz and hence gives 666 M pixels and1.3 G texture coordinates per second. Fig. 13 shows the test board ofthe chip rendering real-time images successfully on the LCD.

5. Analysis

5.1. Analysis setup

There are many kinds of rasterizer architectures according totarget applications, and performance and silicon area vary widely.

The proposed pixel rasterizer is compared with rasterizers basedon bounding box, span filling, and centerline methods in termsof gate counts in Section 5.2. Each rasterizer for the test isimplemented in Verilog-HDL and synthesized for 166 MHz clockfrequency in 0.13mm CMOS technology. It is assumed that eachedge function is 32-bit, each color channel is 18-bit, depth is26-bit, and the fog factor is 10-bit. All the rasterizers are designedto traverse one pixel per clock cycle in the peak performance, andthe critical-path delays of all the rasterizers are almost same.Therefore, the performance of each rasterizer totally depends onpixel traversal efficiency which indicates the ratio of valid pixelsinside a given triangle to the traversed pixels. The comparison inthe pixel traversal efficiency is shown in Section 5.3.

The proposed texture interpolator is compared with thearchitecture using pipelined dividers in terms of gate counts inSection 5.4, and the performance comparison with generalmidpoint iteration in the calculation of fractional part is describedin Section 5.5.

Two kinds of pipelined dividers are used for the comparison inSection 5.4. The first one is digit-recurrence divider of radix-2, andthe second one adopts very-high radix algorithms using a lookuptable. Regarding the very-high radix dividers, The divider based onHung’s algorithm [25] is used for 16-bit texture coordinateprecision, and the divider using modified Hung’s algorithm withNewton–Raphson iteration [26] is used for 32-bit texturecoordinate. The digit-recurrence divider for 16-bit texture co-ordinates consists of 4 pipeline stages and consumes 6773 NAND-equivalent gates. Hung’s divider is 2 pipeline stages and 14,025gates, but it processes two dividends with one common divisor.The 32-bit digit-recurrence divider consumes 23,486 gates in 5pipeline stages, and the 32-bit very-high radix divider consumes41,102 gates in 3 pipeline stages.

5.2. Area comparison of the proposed pixel rasterization

There are several kinds of traversal algorithms including theconventional centerline algorithm. The bounding box scan is thesimplest algorithm, but the algorithms for memory locality suchas zig-zag scan or Hilbert-order scan [11] are also based on thebounding box scan. The concept of scanline traversal algorithmlooks intuitive, but the throughput balance of edge traversaland span filling is important. If the hardware resource ofedge traversal is equal to that of span filling so that the pixelthroughput is one per clock, but the required logic becomes twiceof edge function-based algorithms like the centerline or theproposed algorithm.

ARTICLE IN PRESS

Fig. 13. Real-time verification of the 3D graphics SoC adopting the proposed rasterizer.

Fig. 14. The gate count comparison of the single-pixel rasterizer without texture

coordinate interpolation.


Fig. 14 shows the gate count comparison of the single-pixelrasterizer implemented without texture coordinate interpolation.Common logic includes control logic and register to storederivatives to be used in interpolation, and pixel register countsall the flip-flops used to store pixel contexts. Pixel interpolatorincludes movement decision logic and all adders used in pixelinterpolation. The bounding box rasterizer is the smallest in thepoint of gate count because there are not any backup registerfiles and logic for movement decision is simple. However, thepixel throughput of rasterizer is much less than a half, andthis performance degradation will be shown in Section 5.3. Thecenterline rasterizer costs the highest gate counts in this casebecause there are two sets of pixel backup registers and complexlogics for intersection test. The test result in Fig. 14 shows that thecenterline rasterizer uses higher costs in gate count than thescanline rasterizer due to the additional intersection test logic.The proposed rasterizer reduces gate count over all three partscompared to the centerline rasterizer and the scanline rasterizer.The movement decision logic is much simpler than the intersec-tion test logic in the centerline rasterizer, and the size ofinterpolation adder set is almost half of that of the scanlinerasterizer. The size of the register to store pixel context is almosttwo thirds of the pixel register in the centerline rasterizer.Common logic is also smallest. The centerline rasterizer has theadditional storage for corner probe points and the scanlinerasterizer has the additional pixel context fields in edge traversals.In the test implementation, the gate count of the proposed pixelrasterization architecture is less by 38.9% than that of the

centerline method, and less by 35.3% than that of scanlinemethod.

5.3. Performance comparison of the proposed pixel rasterization

As all the rasterizers are implemented to traverse one pixel perclock cycle, the performance of them depends on how many ofgenerated pixels are valid. Pixel efficiency denotes the number of validpixels over the number of traversed pixels, and Fig. 15 shows the pixelefficiency of above rasterization methods according to scenes withvarious average triangle sizes. While the scanline rasterizer alwaysshows pixel throughput of one because it firstly finds correct edges,the pixel throughput of the centerline and proposed methods is lowerwhen an average triangle size is smaller. There are null pixels, whichmean traversed pixels outside a given triangle in the centerline andproposed algorithms. The pixel throughput of bounding box is muchless than those of other methods.

Fig. 16 shows the performance per area of each rasterizeras considering cost in gate counts shown in Section 5.2.The performance efficiency per gate count is normalized to thescanline rasterizer in Fig. 16. The proposed and the centerlinerasterizers have similar performance tendency, but the proposedrasterizer improves the average performance per area by 78.2%compared to the centerline rasterizer. The bounding box rasterizeris proved to be the worst due to its low pixel efficiency. Theproposed rasterizer architecture shows the best results on overallsimulation except for the case when average triangle size is lesserthan 16. Considering that average triangle sizes in most gameapplications are much bigger than 16 [27], the proposed rasterizerarchitecture is proved to be better than other rasterizers.

5.4. Comparison of the proposed texture coordinate interpolation

and division

The proposed texture coordinate interpolator and pipelineddividers have the same performance as they can produce outputsper clock cycle, but their silicon area usages are different. Fig. 17shows the gate count comparison graph for the texture coordinateinterpolator compared to the interpolator architectures usingdividers in the case of 16- and 32-bit texture coordinate,respectively. Each interpolator produces a set of 2D texturecoordinates per clock cycle. All interpolator have different critical

ARTICLE IN PRESS

Fig. 15. The pixel efficiency according to scenes with various average triangle sizes.

Fig. 16. The performance per area normalized to scanline rasterizer.


logic delays, but they are pipelined for the same clock frequency,166 MHz. The additional pipelining registers are counted in logicpart (Fig. 17).

In the proposed architecture of a texture coordinate interpola-tion, the storage of intermediate term d, E, A, and B is additionallyrequired. Therefore, the gate count of the register part in theproposed texture coordinate interpolator is increased by 17.7%from the divider-based interpolators in both cases. However, thegate count of logic part is greatly reduced in the proposedarchitecture since the proposed architecture uses only addersinstead of dividers. The logic part is reduced by 34.1% for 16-bittexture coordinate compared to the very-high radix divider. In thecase of 32-bit texture coordinate, the reduction ratio of the logicgate count becomes larger up to 45.1% because divider size usuallyincreases quadratically as the precision increases.

This gain in texture logics covers the penalty in textureregisters, and it totally saves the gate count without performanceloss by 25.2% and 37.0% for 16- and 32-bit texture coordinates,

respectively. If a rasterizer is designed to produce multi-sets oftexture coordinates for the multi-layer texturing, the area gain ofproposed texture coordinate interpolator architecture is dominantin the whole rasterizer system.

5.5. Comparison of the proposed texture coordinate interpolation

and midpoint algorithm

Fig. 18 shows image examples rendered by the proposedalgorithm. Fig. 18a is an image rendered without texture filteringusing conventional midpoint algorithm, and Fig. 18b is an imagerendered with trilinear texture filtering using the proposedalgorithm. The rendering times of both scenes are the same, butit is clear that Fig. 18a shows many visual defects, which are pixelscontorted by aliasing noise. Trilinear texture filtering alleviatesaliasing noise as shown in Fig. 18b. Although the precision offraction part in texture coordinates is only 4-bit, Fig. 18b showsmuch better image quality. The image quality is greatly improved

ARTICLE IN PRESS

Fig. 17. Gate count comparison for 16- and 32-bit texture coordinate interpolator. (a) 16 bit, (b) 32 bit.

Fig. 18. Image quality comparison between nearest filtering without fraction part

(a) and trilinear filtering with 4-bit fraction part (b).

Fig. 19. Performance comparison for midpoint iteration and proposed texture

coordinate interpolation.


with the same performance by the proposed algorithm, and theadditional hardware is only 4 adders used in 4 Fractional

Evaluators.The fractional part of texture coordinates can be produced by

general midpoint iteration, but the iteration upper bound isexponentially increased as the number of the fractional bit. Fig. 19shows the performance difference between the proposed techni-que and midpoint iteration for 4-bit fractional part. The texturecoordinate restorations (14) and (15) are checked in parallel fortwo cases that the texture coordinate differences are 1 and 2 ulp.Therefore, the midpoint algorithm calculates integer texturecoordinates per every clock cycle only if ulp is 1, while theproposed technique calculates full texture coordinates per everyclock cycle. The maximum texture coordinate difference is two bymipmap, and the iteration upper bound is up to 16 as ulp is 1

16 in4-bit fractional part. The low performance of the previousmidpoint technique shown in Fig. 19 is caused by increasedaverage iteration numbers, and it is 24% of full performance on

average. The performance degradation, which is an averageiteration number, depends on the characteristics of each scene.For example, scene 19.a shows the performance degradation issmaller than the other scenes because magnified textures areused. There are not negative mipmap levels, so the texturecoordinates vary slightly.

6. Conclusion

In this paper, 3D graphics rasterization algorithms to reducehardware area are presented. The proposed pixel traversalalgorithm is based on edge function characteristics insteadof the intersection test of polygon edges and pixel stamp edgesegments. It reduces not only the edge function probe points atthe four corners of the pixel stamp, but also one context savepoint. The gate count of the proposed pixel rasterizationarchitecture is less by 38.9% than that of the centerline method,and less by 35.3% than that of scanline method.

ARTICLE IN PRESS


The proposed texture coordinate interpolation benefits fromthe low cost of midpoint algorithm and high throughputof pipelined divider for the case of texture filtering which requiresfractional texture coordinates. The hardware area cost forproposed texture coordinate interpolation is lesser by 25.2% and37.0% than the area cost of the architecture using dividers in thecases of 16- and 32-bit texture coordinates, respectively.

The rasterizer of the proposed architecture with four parallelpixel processing units is implemented in a 3D graphics SoC. Theimplemented rasterizer achieves a throughput of 666 M pixels and1.3 G texture coordinates per second.

Acknowledgment

This work is supported in part by SAMSUNG Electronics, theuniversity IT research center program, and the Consortium ofSemiconductor Advanced Research through the SYSTEM IC 2010project, Korea.

References

[1] Pineda J. A parallel algorithm for polygon rasterization. In: Proceeding ofSIGGRAPH, 1988. p. 15–21.

[2] McCormack J, McNamara R. Tiled polygon traversal using half-plane edgefunctions. In: Proceeding of SIGGRAPH, 2000. p. 15–21.

[3] Lentz DJ, Kosmal DR, Poole GC. Polygon rasterization. US Patent 5,446,836, 1995.[4] Lee J, Kim LS. SPARP: a single pass antialiased rasterization processor.

Computers and Graphics 2000;4:233–43.[5] Park YH, Han SH, Lee JH, Yoo HJ. A 7.1-GB/s low-power rendering engine in 2-D

array-embedded memory logic CMOS for portable multimedia system.IEEE Journal of Solid-State Circuits 2001;32(6):944–55.

[6] Kim D, Chung K, Yu CH, Kim CH, Lee I, Bae J, et al. An SoC with 1.3 Gtexels/s3-D graphics full pipeline for consumer applications. IEEE Journal of Solid-State Circuits 2006;41(1):71–84.

[7] Woo R, Choi S, Sohn JH, Song SJ, Yoo HJ. A 210 mW graphics LSI implementingfull 3D pipeline with 264Mtexels/s texturing for mobile multimediaapplications. In: Proceedings of IEEE international solid-state circuitsconference; 2003. p. 44–5.

[8] Akenine-Moller T, Strom J. Graphics for the masses: a hardware rasterizationarchitecture for mobile phones. ACM Transactions on Graphics 2003;22(3):801–8.

[9] Wylie C, Romney GW, Evans DC, Erdahl A. Halftone perspective drawings bycomputer. In: Proceeding of AFIPS fall joint computer conference, 1967. p. 49.

[10] Kelleher B. PixelVision architecture. Technical note 1998-013, SystemResearch Center, Compaq Computer Corporation, 1998, available at /http://www.research.digital.com/SRC/publications/src-tn.htmlS.

[11] McCool MD, Wales C, Moul K. Incremental and hierarchical Hilbert order edgeequation polygon rasterization. In: Proceeding of SIGGRAPH /EUROGRAPHICSworkshop on graphics hardware, 2001. p. 65–72.

[12] Popescu V, Rosen P. Forward rasterization. ACM Transactions on Graphics2006;25(2):375–411.

[13] Yu CH, Kim LS. An adaptive spatial filter for early depth test. In: Proceedingof IEEE international symposium on circuits and systems, vol. 2, 2004.p. 137–40.

[14] Park WC, Lee KW, Kim IS, Han TD, Yang SB. An effective pixel rasterizationpipeline architecture for 3D rendering processors. IEEE Transactions onComputers 2003;52(11):1501–8.

[15] Park WC, Lee KW, Kim IS, Han TD, Yang SB. A mid-texturing pixel rasteriationpipeline architecture for 3D rendering processors. In: Proceeding of IEEE 13thinternational conference on application-specific systems, architectures andprocessors, 2002. p. 173–82.

[16] Demirer M, Grimdale RL. Approximation techniques for high performancetexture mapping. Computer and Graphics 1996;20(4):483–90.

[17] Abbas A, Szirmay-Kalos L, Szijarto G, Horvath T, Foris T. Quadraticinterpolation in hardware Phong shading and texture mapping. In: Proceed-ing of spring conference on computer graphics, 2001. p. 25–8.

[18] Blinn JF. Hyperbolic interpolation. IEEE Computer Graphics and Applications1992:89–94.

[19] Pitteway M. Algorithms for drawing ellipses or hyperbolae with a digitalplotter. Computer Journal 1967;10(3):282–9.

[20] Barenbrug B, Peters FJ. Overveld CWAM. Algorithms for division freeperspective correct rendering. In: Proceeding of SIGGRAPH /EUROGRAPHICSworkshop on graphics hardware. 2000. p. 7–13.

[21] Watt A. 3D computer graphics. 2nd ed. Boston: Addison-Wesley PublishingCompany; 2000.

[22] Ewins J, Waller MD, White M, Lister PF. MIP-map level selection for texturemapping. IEEE Transaction on Visualization and Computer Graphics 1998;4(4):17–29.

[23] Kim.D, Kim LS. Division-free rasterizer for perspective-correct texturefiltering. In: Proceeding of IEEE international symposium on circuits andsystems, vol. 2, 2004. p. 153–6.

[24] Chung K, Kim D, Kim LS. A 3-way SIMD engine for programmable trianglesetup in embedded 3D graphics hardware. In: Proceeding of IEEE interna-tional symposium on circuits and systems, 2005. p. 4570–3.

[25] Hung. P, Fahmy H, Mencer O, Flynn MJ. Fast division algorithm with a smalllookup table. In: Proceeding of 33rd Asilomar conference on signals, systems,and computers, vol. 2, 1999. p. 1465–8.

[26] Jeong J, Park WC, Jeong W, Han TD, Lee MK. A cost-effective pipelined dividerwith a small lookup table. IEEE Transaction on Computers 2004;53(4):489–95.

[27] Roca J, Moya V, Gonzalez C, Solis C, Fernandez A, Espasa R. Workloadcharacterization of 3D games. In: Proceeding of IEEE international sympo-sium on workload characterization, 2006. p. 17–26.

http://www.research.digital.com/SRC/publications/src-tn.html

http://www.research.digital.com/SRC/publications/src-tn.html

area-efficient pixel rasterization and texture coordinate interpolation

Documents