streamcg : a stream-based framework for programmable graphics hardware

8/14/2019 StreamCg : A Stream-based Framework for Programmable Graphics Hardware

1/55

StreamCg : A Stream-based Framework forProgrammable Graphics Hardware

Alex RadeskiSchool of Computer Science and Software Engineering

University of Western Australia35 Stirling Highway, Crawley, Western Australia, 6009

[email protected]

November 26, 2003

This report is submitted as partial fulfilment

of the requirements for the Graduate Diploma Programme of the

Department of Computer Science and Software Engineering,

The University of Western Australia,

2003

[email protected]


2/55

Abstract

The thesis of this research states that a modern programmable Graphics Pro-cessing Unit (GPU) can be used for media processing tasks, such as signaland image processing, resulting in significant performance gains over tra-ditional CPU-based systems. This is made possible by innovative researchin the areas of programmable rendering pipelines, shader programming lan-guages, and graphics hardware. The GPU uses data locality and parallelismto achieve these performance improvements. Recent advances in mainstreamprogrammable GPU technology has led many to research the suitability ofthese devices for more generic programming tasks.

This work explores the difficulties associated with using a GPU for mediaprocessing and the performance gains made over traditional CPU imple-mentations. StreamCg was developed as a C++ framework to simplify thedevelopment of media processing applications. The NVIDIA Cg languageand runtime infrastructure is used to provide a high level programming en-vironment for the GPU. StreamCg simplifies the development of GPU-basedprograms by providing a simple stream-based programming model that fa-cilitates reuse, and encapsulates the complexity of the underlying OpenGLrendering system. As part of this research I also developed the EmuCgframework. EmuCg assists in the execution of NVIDIA Cg programs on atraditional CPU. This aids in the normally difficult or impossible task of de-bugging Cg programs. The remainder of this thesis discusses the evolutionof programmable rendering pipelines and graphics hardware.

A Discrete Wavelet Transform (DWT) is implemented as a non-trivial exam-ple to assist in the performance analysis. The DWTs were designed, imple-mented and tested using StreamCg. Three different implementations wereused, including a CPU-based algorithm, a GPU-based algorithm that wasexecuted on a GPU and the GPU-based algorithm executed on a CPU usingEmuCg. The experimental test results show significant performance gainsare made by GPU-based kernels when compared to CPU-based implemen-

i


3/55

tations. The StreamCg programming encourages loose coupling that allowsthe arbitrary combination of CPU and GPU kernel implementations.

Keywords: stream processors, programmable hardware, GPU, Cg, shaderlanguages, Discrete Wavelet Transform, stream kernel

CR Classification: C.1.2 [PROCESSOR ARCHITECTURES]: MultipleData Stream Architectures (Multiprocessors), I.3.1 [COMPUTER GRAPH-ICS]: Hardware Architecture, I.3.6 [COMPUTER GRAPHICS]: Methodol-ogy and Techniques, I.3.7 [COMPUTER GRAPHICS]: Three-DimensionalGraphics and Realism

ii


4/55

Acknowledgements

I would like to thank Dr Karen Haines, my supervisor, for her enthusiasm,motivation, and guidance. This research would not have been possible ifKaren had not provided the NVIDIA GeforceFX 5900 graphics card needed,for which I am sincerely grateful.

I am very grateful of the consideration given to my work/study situation byDr Richard Thomas, the honours/fourth year coordinator.

I would not be able to judge a good scientific paper, let alone write one ifit wasnt for the excellent lessons taught by Professor Robyn Owens. I hopethis paper is an example of how much I valued those classes.

I would like to thank all my work colleagues at ADI Limited for all yourinterest and encouragement. Also, thanks to my managers who didnt watchthe clock when I disappeared during the day to go to university.

Finally, I would like to thank my partner Alison, my family, and friends for

all your love, support and understanding.

iii


5/55

Contents

1 Introduction 1

1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.1 Programmable Rendering Systems . . . . . . . . . . . . 3

1.1.2 Hardware-based Rendering Systems . . . . . . . . . . . 6

1.1.3 Hardware-based Shader Languages . . . . . . . . . . . 9

1.1.4 High Level Shader Languages . . . . . . . . . . . . . . 12

1.2 Generic Programming using GPUs . . . . . . . . . . . . . . . 13

2 The StreamCg Framework 15

2.1 The EmuCg Framework . . . . . . . . . . . . . . . . . . . . . 20

3 Testing 21

3.1 Test System Configuration . . . . . . . . . . . . . . . . . . . . 21

3.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 24

4 Discussion 26

4.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

A Original Research Proposal 29

iv


6/55

A.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

A.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

A.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

A.4 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

B Cg Programmaning 32

B.1 The Cg Language . . . . . . . . . . . . . . . . . . . . . . . . . 32

B.2 The Cg Runtime . . . . . . . . . . . . . . . . . . . . . . . . . 34

C Cg Discrete Wavelet Transform Implementation 36

C.1 Forward DWT Cg Implementation . . . . . . . . . . . . . . . 36

C.2 Inverse DWT Cg Implementation . . . . . . . . . . . . . . . . 39

v


7/55

List of Figures

1.1 The Application interacts with the graphics pipeline using theCommand system, this is typically done through a computergraphics API, such as OpenGL. The Command system sup-ports the specification of the 3-dimensional scene, for examplespecifying geometry, textures, lighting, and cameras. The Ge-ometry system handles the transformation, clipping, culling,texture coordinates, lighting, and primitive assembly opera-tions. The Rasterisation system samples the geometry intocolour fragments and performs colour interpolation. The Tex-ture system performs texture transformation, projection, andfiltering. The Fragment system performs alpha, stencil, anddepth testing along with fog and blending to produce pixelcolours. The Display system performs gamma correction andthe output signal for the display. . . . . . . . . . . . . . . . . 9

1.2 A logical view of the render pipeline operations performed bythe CPU and GPU. This figure shows that only the Applica-tion and Command stages of the pipeline are performed in theCPU. The remainder are executed by the GPU, therefore re-moving the intensive processing from the CPU. The GPU hasits own local video RAM and can also accesses the main RAMvia the Advanced Graphics Port (AGP). It is important tonote that the AGP bus is the primary bottleneck when trans-mitting data to and from the GPU; the higher the transferspeed of the AGP bus to better the throughput. . . . . . . . 10

1.3 The deep execution pipelines are often executed using streamprocessors. A stream is essentially comprised of a sequence ofkernels that process data in one direction. Kernels are limitedto processing only the data passed down, or limited number ofhigh speed global registers. This is referred to a data locality. 10

vi


8/55

1.4 The data flow of a Programmable Rendering Pipeline. Thispipeline replaces parts of the fixed pipeline with Vertex Pro-gram and the Fragment Program stages. . . . . . . . . . . . . 11

2.1 The StreamCg inheritance hierarchy. The Kernel class can beused for CPU-based Kernel implementations, these Kernelswill execute on the CPU. The CgKernelFP subclass providesthe additional infrastructure to execute a Cg Fragment Pro-gram on the GPU. The EmuCgKernelFP subclass providesthe emulation layer required to emulate a subset of the Cglanguage. EmuCg is discussed in more detail later in this sec-tion. These classes are extended and appropriate methods canbe overridden to specialise the objects behaviour. . . . . . . . 16

2.2 Kernel A is the data source where the data may be obtained

from numerous sources including local files, databases or overa network; Kernel D is the data sink where the output datamay be displayed on the screen, written to a file or transferredover a network. Kernels B and C are typical Kernels in so faras they receive data as input, process the data, and pass onthe data down stream. . . . . . . . . . . . . . . . . . . . . . . 17

2.3 Illustration of the four modes of data transfer supported byStreamCg. It has four parts representing different source andtarget Kernel implementations. In part A, the data transferremains in video RAM as both are GPU Kernels. In part B,

the data remains in main RAM as both are CPU Kernels. Inpart C, the data is converted from a texture in video RAM intoto pixel buffer in main RAM, and part D the reverse occurswhere the data is converted into a texture in video RAM. . . 18

3.1 The DWT Stream Program is comprised of 4 kernels: an imageloader, forward DWT, inverse DWT, and image viewer. Theimage loader reads PNG image files from disk. The imagesare already resident in memory before execution, therefore theload time from disk is not included. The forward DWT pro-

cesses the data from the image loader, the output of which isfed into the inverse DWT. . . . . . . . . . . . . . . . . . . . . 22

vii


9/55

3.2 DWTs are applied to 2-dimensional images in two passes, partsA and B. The first pass, the vertical pass, performs a highand low band filter on a row by row basis, resulting in a highand low column. The second pass, the horizontal pass, againperforms a high and low band filter on the data, however thistime its on a column by column basis. The result, seen in partC, is four quadrants with a mixture of high and low filtereddata. The algorithm then recursively transforms the upperleft quadrant to produce part D. . . . . . . . . . . . . . . . . 23

B.1 A logical view of a Fragment Program that illustrates the per-fragment input, per-fragment output and constant data. Theper-fragment input changes which each fragment processed,however the constant data is the same for all fragments. . . . . 34

B.2 An example main function for a FP30 Fragment Program,using direction modifiers and semantics. . . . . . . . . . . . . 34

B.3 When a Fragment Program is executed it processes each frag-ment on a per-row basis, from the min screen position to themax screen position. This means at each fragment the Cgprogram knows nothing of its adjacent neighbours, this datalocality is on of the features that enables high performanceparallelism of GPUs. . . . . . . . . . . . . . . . . . . . . . . . 35

viii


10/55

List of Tables

3.1 The execution times are presented in fastest to slowest. TheGPU shows the fastest times that scale very well with data sizeincreases. The raw GPU execution times follow an interestingpattern. The first time the DWT Stream Program is run, ittakes approximately 230 milliseconds. Subsequent executionsaveraged approximately in the four to seven millisecond range.The CPU has the next fastest times, which scale significantworse that the GPU. Finally, the worst times are achievedusing EmuCG on the CPU. . . . . . . . . . . . . . . . . . . . 25

ix


11/55

Chapter 1

Introduction

The data processing demands of many industries, such as medicine, mining,and entertainment, are not being easily met by traditional microprocessorsystems alone, such as Intel-compatible processors [11] (referred to as CPUsfor here on). This is due to the scalar processor architecture of the CPU andthe significant increases in the volume of data. The data volume increases canbe attributed to the demand for greater accuracy and improved realism. Thisresearch focuses on data that are typically used in signal processing, imageprocessing and computer graphics, generally referred to as media processing.

Developments in 3-dimensional computer graphics hardware industries haveresulted in devices that utilise Graphics Processing Units (GPUs). The GPUoffers unprecedented performance gains in mainstream computer graphics.This is primarily due to the employment of large-scale parallelism. Manyleading modern GPUs offer increased programmability, thus enabling manymore complex rendering techniques. This research looks at applying theseprogrammable GPUs to more general media processing tasks. There arenumerous difficulties that need to be overcome, including determining whatparts of the GPU to utilise, what language to be used to program the GPU,and how to transfer data between GPU programs.

These problems resulted in the development of StreamCg, a C++ softwaredevelopment framework. I developed StreamCg utilising a layered approach.

The lowest layer wraps the complexity of the programming interface to theGPU; this includes wrapping OpenGL and the NVIDIA Cg runtime. Thehighest layer provides a simplified stream oriented programming model for de-veloping StreamCg programs. In addition, the supporting framework EmuCgwas developed. EmuCg assists in the execution of NVIDIA Cg programs on

1


12/55

a traditional CPU. This aids in the normally difficult or impossible task ofdebugging Cg programs.

I implemented a Discrete Wavelet Transform (DWT) as a non-trivial exam-ple to assist in the performance analysis. The DWTs were designed, imple-

mented and tested using StreamCg. Three different implementations wereused, including a CPU-based algorithm, a GPU-based algorithm that wasexecuted on a GPU and the GPU-based algorithm executed on a CPU usingEmuCg. The experimental test results show that significant performancegains are made by GPU-based kernels when compared to CPU-based im-plementations. The StreamCg programming encourages loose coupling thatallows the arbitrary combination of CPU and GPU kernel implementations.

The significant contributions of this research are the development of theStreamCg framework, that allows the easier development of loosely coupledand highly reusable software kernels; the EmuCg framework, that enables the

execution of Cg algorithms on a CPU from within StreamCg; and finally thedevelopment of a forward and inverse DWT algorithm using the Cg language.

2


13/55

1.1 Background

The following section outlines the relevant background regarding the evolu-tion of rendering systems that resulted in modern programmable graphics

hardware. The first part of this section outlines programmable renderingsystems in general, these were traditionally software-based renderers. Thesecond part highlights how the programming capabilities of hardware-basedrendering systems have become more programmable and have reached a pointwhere they are flexible enough to be used for more generic tasks.

1.1.1 Programmable Rendering Systems

For many years, off-line (non-realtime) rendering systems have supportedprogrammable components. These programmable components are commonlyreferred to as shaders. Shaders offer a high degree artistic control over therendering process and have been employed by the movie industry to pro-duce cinematic quality computer graphics for many years. Historically, pro-grammable rendering systems have sacrificed performance for quality, result-ing in slow rendering times. For example, in the late 1980s Pixar consideredit reasonable to render a feature film in one year [7]. The reason for such longcomputation times is that very few quality compromises can be made whenrendering cinematic quality effects such as ray tracing, anti-aliasing, pro-cedural surface materials, transparency, reflection and refraction, shadows,lighting, and atmospheric effects [18].

Shade Trees

The evolution of shader languages began with the seminal work done byRobert L. Cook [6] on Shade Trees. Shade Trees provide a foundation forflexible procedural shading techniques, with the aim of integrating discreteshading techniques into a unified rendering model. This allowed shadingtechniques to be combined in novel ways.

Shade Trees are defined as a tree structure where nodes define operations

and leaves contain appearance parameters, such as color and surface normal.Nodes can be nested to produce composite operations where the outputs ofthe children operations are the inputs of the parent operation. Ultimately,the root node of a Shade Tree provides the RGB pixel color for the current

3


14/55

point. A Shade Tree is associated with one or more objects in the scene,this means each object has a specific shading program used to evaluate itspixels. This was revolutionary at the time as many other rendering systemsused fixed shading models for all objects in a scene. The use of Shade Treesenabled distinct surfaces to be shaded according to an lighting model thatbest synthesised that surface. For example, polished wood interacts withlight very differently to brushed metal.

In addition to Shade Trees are Light Trees and Atmosphere Trees. A LightTree describes a specific type of light, such as spotlight or point light. Eachtype of light source is accessed by a Shade Tree to perform the appropriatelight shading calculations. Atmospheric Trees affect the light output from aShade Tree by performing some computation to simulate atmospheric effects,such as fog. This transforms the light that reaches the virtual camera.

An Image Synthesizer

Ken Perlins [39] subsequent work extended the programming model of ShadeTrees, this language was called the Pixel Stream Editing (PSE) language.The PSE language included conditions, loops, function definitions, arithmeticand logical operators, and mathematical functions. In the PSE languagevariables are either scalars or vectors and operators can work on both types.For example, the expression a + b could be adding two scalars or two vectors.

Another important contribution of the PSE language is the famous PerlinNoise function. The noise function is used to create natural looking tex-tures that are devoid of repeating patterns. Using noise as a base function,complex functions are used to produce very realistic effects such as water,crystal, fire, ice, marble, wood, metal and rock. These effects are examples ofsolid textures, which generate distinct features in 3-dimensional space. Thiswas a major step forward for procedural texturing, which were previouslyonly computed in 2D.

The RenderMan Shader Language

RenderMan [24] is one of the most popular shader languages in use today.RenderMan builds on the foundation of laid by Shade Trees and PSE. Thedesign goals of RenderMan were to develop a unified rendering system thatsupports shading models for global and local illumination, define the interface

4


15/55

between the rendering system and shader programs, and provide a high levellanguage that is expressive and easy to use.

The RenderMan shader language [41] defines a number of different shaderstype handled at different stages in the rendering process. This is made pos-

sible by standardising the shader behaviour and defining an interface to thegreater rendering system. These shader types include

Surface shaders are attached to geometric primitives and compute thelight reflected by a surface point as a function of the incoming light andthe surface properties. These shaders synthesise how the light interactswith the distinct surface type, such as wood or metal.

Displacement shaders modify the position and normal vector of a pointon the surface of an object to modify the visible features of the surface.

For example, creating bump mapped effects. Light shaders are attached to a point in space and compute the color

of light that is emitted to an illuminated surface.

Volume shaders compute how light is modified when it travels througha volume, such as light refraction. Atmospheric effects can be createdusing a global volume that contains the entire scene.

Imager shaders perform operations on the rasterised pixels before theyare displayed using image processing techniques.

There is a strong correlation between the RenderMan shader types and theShade Trees types mentioned previously. For example, surface and displace-ment shaders closely resemble the functionality of the base Shade Tree, andlight and volume shaders perform similar operations to Light and AtmosphereTrees. The RenderMan system is far more flexible than the specialised graph-ics hardware discussed later in the chapter. Typically RenderMan systemsutilise networked clusters of computers, called render farms, which distributethe work load to improve performance.

The shader language has a C-like syntax and supports a specialised set of

types including floats, RGB colours, points, vectors, strings and arrays. Thelanguage also supports loops and conditions, which are also present in PerlinsPSE language. Shaders make use of the wide array of trigonometric andmathematical functions, as well as other special purpose functions, such asinterpolation, noise and lighting functions. A shader is implemented as a

5


16/55

named function with inputs and outputs. RenderMan defines a specific setof input and output parameters for each shader class. For example, thesurface shader is required to output the Ci parameter that represents thenew surface color. One problem is that parameters tend to have crypticnames so it is a good idea to keep the reference manual handy.

The RenderMan implementation referred to in [24], utilises a virtual SingleInstruction Multiple Data (SIMD) [40] array architecture. The scene is splitinto regions and each region has the associated shaders executed over theregion by the rendering system. Even though this was a software renderer,the SIMD architecture facilitates the utilisation of high speed parallel graph-ics hardware. This leads into the following discussion on the evolution ofhardware-based rendering systems.

1.1.2 Hardware-based Rendering Systems

The programming model of hardware-based rendering systems have evolvedwith the capabilities of the graphics hardware. These capabilities have pro-gressed from a fixed function system to the programmable graphics hardwareof today. Consumer level graphic hardware devices were not readily avail-able until the mid-1990s. Up to this point most rendering was performed insoftware, except for some high-end specialised graphics systems. For interac-tive applications, with a frame rate above 20 Hz, this resulted in a trade-offbetween image quality and responsiveness. Even a current day two giga-hertz CPU, such as a Pentium IV, cannot render very high quality computer

graphics at interactive rates without help.

There are a number of reasons why the CPU cannot keep up with the de-mands of computer graphics [19]. Firstly, a CPU is primarily a scalar pro-cessor as it essentially processes one instruction at a time on a single set ofdata. Secondly, a CPU has a single memory interface. This increases accesslatency as memory access contention increases. Thirdly, only a small portionof the entire chip is actually dedicated to Arithmetic Logic Units (ALUs).This means there are less instructions that can be executed in one chip clockcycle. Most of the chip is actually dedicated to caching instructions anddata. So as a general rule, CPUs are optimised for complex logic, not high

bandwidth throughput. The following section outlines the evolution of howspecialised graphics devices offload the intensive graphics processing from themain CPU.

One of the first commercially available high performance hardware acceler-

6


17/55

ated graphics systems was developed by Silicon Graphics Incorporated (SGI)in the 1980s [3]. This system could render 100,000 polygons per second at arefresh rate of 10Hz; relatively speaking this was a very high performance sys-tem. This system defined a polygon as 4 sided, 10x10 pixel, RGB, Gouraudshaded, z-buffered, clipped and screen projected.

The SGI system tightly coupled the main CPU and RAM with the graphicssystem into one complete unit. From a usability stand point, this allowedthe graphics system to integrate with a users window-based desktop envi-ronment. The SGI graphics system provided the hardware acceleration ofa fixed function pipeline and was comprised of the geometry [5], scan con-version and rasterisation subsystems. Subsequent work by SGI on hardwaregraphics systems enabled realtime texture mapping and anti-aliasing [1], andmade further improvements on refresh rates and overall performance [36].

The DN10000VS system used an alternative approach to using a single spe-

cialised graphics device. This was to make holistic changes to the systemto handle high-performance graphics [28]. One of the main goals of the sys-tem designers was to have most of the hardware usable most of the time.This could only be achieved through efficient load balancing and minimisinglatency system wide. Performance was primarily achieved through utilisingfour processors, with custom graphics instructions and improved hardwarebus speeds.

In the mid-1990s, Lastra et al. [30] outlined the requirements for programmablegraphics hardware that could achieve interactive frame rates. These require-ments covered programmability, memory layout, and computational power

that formed part of the experimental graphics system called PixelFlow. Thiswork was built on the previous work done by Molnar et al [35] and followedthe achievements of early graphics systems [20,21].

To achieve interactive frame rates PixelFlow employed large-scale parallelism.To highlight the scale required, during that time existing commercial graphicssystems required hundreds of processors for a fixed rendering pipeline [36],a programmable pipeline would require many more. Parallel architecturesprovided greater performance gains over single pipeline architectures as datawas processed simultaneously across an array of processors. Single pipelinearchitectures are constrained to the clock speed of the single processor; per-

formance increases could only be achieved through advances in technologythat improve clock speed.

PixelFlow used a SIMD array of 128 x 64 pixel processors and two general

7


18/55

purpose RISC processors. The general purpose processors fed instructionsinto the SIMD array where they were executed simultaneously. Parallelismwas applied to a scene by subdividing the screen into 128 x 64 pixel regionsand then each region was then processed at once.

The modern rendering pipeline evolved as a result of common patterns foundin the processes used for rendering 3-dimensional scenes. This pipeline wasdevised to provide a flexible and simple programming model to the graph-ics hardware. A high level example of a modern rendering pipeline can beseen in Figure 1.1. This render pipeline is comprised of the Application,Command, Geometry, Rasterisation, Texture, Fragment and Display sub-systems [2]. The Application interacts with the graphics pipeline using theCommand system, this is typically done through a computer graphics API,such as OpenGL. The Command system supports the specification of the3-dimensional scene, for example specifying geometry, textures, lighting, andcameras. The Geometry system handles the transformation, clipping, culling,texture coordinates, lighting, and primitive assembly operations. The Ras-terisation system samples the geometry into colour fragments and performscolour interpolation. The Texture system performs texture transformation,projection, and filtering. The Fragment system performs alpha, stencil, anddepth testing along with fog and blending to produce pixel colours. TheDisplay system performs gamma correction and the output signal for thedisplay.

At the core of modern graphics hardware is the Graphics Processing Unit(GPU). The term GPU was first introduced in 1999 in the NVIDIA Geforce

series of chips [17]. Other hardware vendors, such as ATI, 3D Labs, Matroxalso used the same or similar term for their graphics processing chip. Thefirst generations of GPU provided hardware acceleration of the fixed functionpipeline mentioned earlier. Figure 1.2 provides a logical view of the renderpipeline operations performed by the CPU and GPU. This figure shows thatonly the Application and Command stages of the pipeline are performed inthe CPU. The rest is executed by the GPU, therefore removing the intensiveprocessing from the CPU. The GPU has its own local video RAM and canalso access the main RAM via the Advanced Graphics Port (AGP) [10]. Thestandard 1x AGP speed is approximately 267 megabytes (MB) per secondthroughput. The standard AGP speed at the time of this research is 4x AGP,

which is about 1 gigabyte (GB) per second throughput. Newer generationsof computers will have 8x AGP, which is approximately 2 gigabytes per sec-ond throughput. It is important to note that the AGP bus is the primarybottleneck when transmitting data to and from the video RAM accessed by

8


19/55

Figure 1.1: The Application interacts with the graphics pipeline using theCommand system, this is typically done through a computer graphics API,such as OpenGL. The Command system supports the specification of the3-dimensional scene, for example specifying geometry, textures, lighting, andcameras. The Geometry system handles the transformation, clipping, culling,texture coordinates, lighting, and primitive assembly operations. The Ras-terisation system samples the geometry into colour fragments and performscolour interpolation. The Texture system performs texture transformation,projection, and filtering. The Fragment system performs alpha, stencil, anddepth testing along with fog and blending to produce pixel colours. The

Display system performs gamma correction and the output signal for thedisplay.

the GPU.

There are a number of reasons why GPUs are better suited to computergraphics than CPUs [19]. Firstly, GPUs employ large-scale parallelism, re-sulting in fewer clocks per instruction. This is enabled through data locality,where processors essentially have exclusive access over their allocated data.Secondly, GPUs have multiple and wide memory interfaces, which meansthat the GPU is optimised for high throughput, thus reducing data accesslatency. Thirdly, GPUs have deep execution pipelines that help to amortizelatencies over the entire processing time. The deep execution pipelines areoften executed using stream processors. A stream, shown in Figure 1.3, isessentially comprised of a sequence of kernels that process data in one di-rection [25]. Kernels are limited to processing only the data passed downto them this is referred to a data locality. The momentum behind streamprocessing is increasing, particularly in the area of media processing [27] [46].

1.1.3 Hardware-based Shader Languages

A major challenge to developing a shader language for hardware is provid-ing an easy to use programming model that also works within the cost and

9


20/55

Figure 1.2: A logical view of the render pipeline operations performed by theCPU and GPU. This figure shows that only the Application and Commandstages of the pipeline are performed in the CPU. The remainder are executedby the GPU, therefore removing the intensive processing from the CPU. TheGPU has its own local video RAM and can also accesses the main RAM viathe Advanced Graphics Port (AGP). It is important to note that the AGPbus is the primary bottleneck when transmitting data to and from the GPU;the higher the transfer speed of the AGP bus to better the throughput.

Figure 1.3: The deep execution pipelines are often executed using streamprocessors. A stream is essentially comprised of a sequence of kernels thatprocess data in one direction. Kernels are limited to processing only the

data passed down, or limited number of high speed global registers. This isreferred to a data locality.

10


21/55

Figure 1.4: The data flow of a Programmable Rendering Pipeline. Thispipeline replaces parts of the fixed pipeline with Vertex Program and theFragment Program stages.

complexity constraints of manufacturing the hardware devices. The GPUprogramming model is developed as an extension to the fixed function ren-der pipeline of the Microsoft Direct3D [34] and OpenGL [44] programmingApplication Programmer Interface (API).

The data flow of a Programmable Rendering Pipeline is illustrated in Figure1.4. This pipeline replaces parts of the fixed pipeline with programmableprocessors.

Vertex Programs

A Vertex Program, also know as a Vertex Shaders, processes the propertiesof a single vertex as input to produce a single transforms vertex as output.This model was first supported in the mainstream by the NVIDIA Geforce3GPU [32]. In the context of a fixed function pipeline, Vertex Programs are ex-ecuted before view space clipping and screen space scaling. This is to ensurethat the programs do not cause an invalid operation during the rasterisationprocess. Vertex Programs allow effects such as procedural animation, includ-ing interpolation and morphing, lens effects (such as the fish eye lens) andfog effects (such as elevation based fog).

The basic data type of a Vertex Program is a four-component 32-bit floatingpoint vector (x,y,z,w). The inputs and outputs for the Vertex Programare named four-component registers, for example COL0, is the diffuse outputcolour; and TEX0-7, are the 8 possible output texture coordinates for thevertex. The input registers are read-only and the output registers are write-only, this simplifies their hardware implementation. Recent generations ofGPUs, such as the NVIDIA Geforce4, only support 16 input registers, 96

11


22/55

constant registers, and 12 output registers and a maximum of 128 instructionper shader. The input registers default to vertex attributes such as position,colour, normal and texture coordinates, however, the register contents canbe overridden by the programmer. The constant registers are available forapplication specific values, these values are the same for each execution ofthe Vertex Program.

The Vertex Program uses low level machine language with operations in-cluding move (MOV), multiply (MUL), distance (DST), minimum(MIN), four-component dot product (DP4) and others. With each new generation ofgraphics hardware these limitations are being eliminated, for example, thelatest NVIDIA GeforceFX, allows shaders to have up to 1024 instructionsand many more registers.

Fragment Programs

Fragment Programs, also known as Pixel Shaders, allow per-fragment op-erations during the rasterisation process. A fragment is essentially a pixelwith associated metadata, such as color, texture coordinates and screen co-ordinates. Previously, per-pixel operations were achieved through registercombiners [12], and texture shaders. These provided a limited set of op-erations that enabled simple arithmetic operations to be performance withRGBA colours.

The Fragment Program language is similar to the Vertex Program language.Fragment Programs supported by the NVIDIA Geforce4 only provided a lim-ited subset of the operations supported by Vertex Programs and used reducedprecision fixed point numbers instead of 32-bit floating point numbers usedin Vertex Programs. These limitation have been lifted in the GeforceFX,which provides a uniform set of operations and 32-bit floating point numbersacross Vertex and Fragment Programs.

1.1.4 High Level Shader Languages

Low level shader languages share many of the same problems as CPU machine

languages. They are not easy to read or write, and are tightly coupled tothe hardware. New shading languages are being designed to counter theseproblems; two hardware-based languages that are still in development are Cfor Graphics (Cg) [13] and OpenGL Shader Language (GLslang) [26]. The

12


23/55

remainder of this paper will focus on Cg, further details on programmingwith Cg can be found in Appendix B.

Shader language compilers are designed to produce the machine languagethat is compliant with target shader specification. The compiler can emu-

late missing functionality wherever possible. However under certain circum-stances the compiler may reject or ignore functions that are not supportedby the graphics hardware.

These shader languages share similarities with the RenderMan shader lan-guage by design. These include a C-like language, support for high precisiondata types (such as 32-bit floating point numbers), provide noise and turbu-lence functions, and provide a rich set of mathematical functions. However,RenderMan provides more shader types applied to more stages in the ren-dering pipeline. The current generation of high level shader languages arestill limited to the capabilities of the programmable graphics hardware; only

the Vertex Program or the Fragment Program can be used to customise therendering pipeline of programmable graphics hardware.

1.2 Generic Programming using GPUs

There is an ever increasing body of research into using GPUs for varyingtasks. GPUs are currently used in computer graphics for accelerating ra-diosity calculations [8], and ray tracing [43]. However, the remainder of thissection provides an overview of the work done with GPUs on non-computergraphics research.

Moreland et al [38], implement a Fast Fourier Transform (FFT) using asimilar GPU to the one used in my research. The FFT implementationdid not use Cg, but was instead implemented as a Fragment Program inmachine language. The results of the research showed that a 512 by 512image was synthesised by conventional means, the FFT performed, the imagewas filtered, and then finally the inverse FFT was applied in a time well underone second.

Moravanszky [37], discusses implementing dense matrix multiplication using

Microsoft DirectX, instead of OpenGL and Cg. This has the disadvantage asit ties the implementation to the Windows platform. Moravanszkys researchdiffers to my research in that he used an ATI Radeon GPU. Moravanszkysresults showed significant performance gains for larger datasets that can ab-

13


24/55

sorb the cost of transferring the data to the GPU.

Krueger et al [29], implement a comprehensive suite of linear algebra opera-tions on a GPU, including vector arithmetic, matrix-vector products, sparsematrices, banded matrices and many more. Krueger et al also used an ATI

Radeon GPU that did not provide as comprehensive floating-point numbersupport as the NVIDIA GeforceFX used in my research. Higher perfor-mance can be achieved when less accurate data types are used. However,their research shows that even with consideration given to lower precision,the performance gains were significant in comparison to the CPU-based im-plementation.

Much of the research discussed in the previous paragraphs is based on ad-hoc systems developed to test a narrow domain. McCool et al [33] developeda metaprogramming system that enabled all code, including the shader ma-chine language, to be specified in C++. The metaprogramming system would

manage the underlying graphics API, in this case OpenGL, and also handleloading and executing the shader machine language on the CPU. Althoughthe research primarily focuses on computer graphics examples, this kind ofsystem could be used for more generic tasks.

Buck et al [4] emphasise that a GPU is a form of stream processor. This leadsto the development of Brook, a new language that shares many similaritieswith Cg, but differs in that it provides greater support for generalised GPUprogramming. The Brook programming model is based on streams of kernels.The research done by Buck et al, shares some resemblance to my research.The main similarity lies in the motivation to create a programming model

based on stream processors. The main difference is that my research triedto work within the limitations imposed by established technologies such asC++ and Cg, rather than creating a new language.

14


25/55

Chapter 2

The StreamCg Framework

The StreamCg framework is a C++ framework that simplifies the devel-opment of high-performance media processing applications. The goals ofStreamCG are to developed a simple programming model and hide the un-derlying graphics system complexity. Another driving factor when develop-ing StreamCg, was to leverage established technologies that are familiar todevelopers, primarily C++ and Cg.

The StreamCg programming model borrows heavily from the stream architec-ture of parallel computing. StreamCg is used to assemble Stream Programs,which are comprised of two fundamental building blocks; the Kernel and theChannel. A StreamCg Kernel encapsulates the logic to be performed on agiven dataset. Kernels have named inputs and outputs. However, at presentonly one input and output channel are supported. A Channel is a connectionbetween an output to an input and is used to transfer data down stream tothe next Kernel. Kernels can only access the data passed down stream orconstant data specific before execution. This constraint is aligned with themodel used by GPU shaders discussed previously.

The UML diagram shown in Figure 2.1 shows the inheritance hierarchy of theStreamCg framework. The Kernel class can be used for CPU-based Kernelimplementations, these Kernels will execute on the CPU. The CgKernelFPsubclass provides the additional infrastructure to execute a Cg Fragment

Program on the GPU. The StreamCg framework currently only supportsCg Fragment Program Kernels because they allow the easiest access to theoutput data. The data is easily accessible as the Fragment Program outputsdata directly to the frame buffer. In contrast, Vertex Programs produceoutput in the middle of the pipeline that is not easily accessible. Therefore,

15


26/55

Figure 2.1: The StreamCg inheritance hierarchy. The Kernel class can beused for CPU-based Kernel implementations, these Kernels will execute onthe CPU. The CgKernelFP subclass provides the additional infrastructure toexecute a Cg Fragment Program on the GPU. The EmuCgKernelFP subclassprovides the emulation layer required to emulate a subset of the Cg language.EmuCg is discussed in more detail later in this section. These classes are

extended and appropriate methods can be overridden to specialise the objectsbehaviour.

Vertex Programs are not widely used for generic programming tasks. TheEmuCgKernelFP subclass provides the emulation layer required to emulatea subset of the Cg language. EmuCg is discussed in more detail later inthis section. These classes are extended and appropriate methods can beoverridden to specialise the objects behaviour.

A Kernel is configured to transfer its output to a down stream Kernel using

the writeTo(next:Kernel,inputName:String) method. This method ispassed the next Kernel instance and the name of the input. StreamCg onlysupports transferring data to one input at a time, however there may beother unused inputs specified by a Kernel. An example Stream Program isassembled in Figure 2.2. In this figure, Kernel A is the data source where thedata may be obtained from numerous sources including local files, databasesor over a network; Kernel D is the data sink where the output data maybe displayed on the screen, written to a file, or transferred over a network.Kernels B and C are typical Kernels in so far as they receive data as input,process the data, and pass on the data down stream.

The Kernel::execute() method is called on the root Kernel, which in turnexecutes the entire Stream Program one Kernel at a time. The executionsequence of a single Kernel is as follows Kernel::initialise(), performsuser initialisation of values prior to processing; Kernel::process(), per-

16


27/55

kernelA->writeTo(kernelB, "InputB");

kernelB->writeTo(kernelC, "InputC");

kernelC->writeTo(kernelD, "InputD");

Figure 2.2: Kernel A is the data source where the data may be obtained from

numerous sources including local files, databases or over a network; KernelD is the data sink where the output data may be displayed on the screen,written to a file or transferred over a network. Kernels B and C are typicalKernels in so far as they receive data as input, process the data, and pass onthe data down stream.

forms the actual data processing, this may be repeated a predefined numberof times; and finally Kernel::transfer(), transfers the output data to thenext Kernel. The behaviour of the Kernel::transfer() method is depen-

dant on the type of Kernel. Kernels that execute on the CPU, output theirdata to PixelBufferChannels, which store data in main RAM and transferdata downstream using Channel::write(buf:PixelBuffer) . GPU-basedKernels output data to TextureBufferChannels, which store data in videoRAM and transfer data using Channel::write(). Details of the significanceof the two write methods are discussed later in this section.

StreamCg Kernels are constrained to the types of data inputs and outputsprovided by a Fragment Program. As outlined in Appendix B, a FragmentProgram can take 2-dimensional textures as input, and its output is alwayswritten to the colour buffer in the current OpenGL render context [44]. By

default OpenGL clamps the values of buffers and textures, to the range of 0.0to 1.0. Under normal use this is sufficient as the values typically representRGBA colours. However, for more generic data processing this is a seriouslimitation. The solution is to use the NVIDIA OpenGL extension for FloatBuffers [14]. This extension allows the creation of render contexts containingbuffers and textures in video RAM that do not enforce any limitation of thevalue beyond that of a float type.

The complexity of managing textures and buffers is encapsulated in the in-frastructure of StreamCg. The Channel::write() method will copy thecolour buffer contents of the current render context into either a texture

(video RAM) or pixel buffer (main RAM). The Channel::write(buf:PixelBuffer)method will copy the contents of a pixel buffer into either a texture or an-other pixel buffer. So we can see that, based on the configuration of theStream Program Kernels, StreamCG will perform the appropriate conver-

17


28/55

Figure 2.3: Illustration of the four modes of data transfer supported byStreamCg. It has four parts representing different source and target Kernelimplementations. In part A, the data transfer remains in video RAM as bothare GPU Kernels. In part B, the data remains in main RAM as both areCPU Kernels. In part C, the data is converted from a texture in video RAMinto to pixel buffer in main RAM, and part D the reverse occurs where thedata is converted into a texture in video RAM.

sion of channel data as required. Figure 2.3 illustrates the four modes ofdata transfer supported by StreamCg. It has four parts representing differ-ent source and target Kernel implementations. In part A, the data transferremains in video RAM as both are GPU Kernels. In part B, the data remainsin main RAM as both a CPU Kernels. In part C, the data is converted froma texture in video RAM into to pixel buffer in main RAM, and part D thereverse occurs where the data is converted into a texture in video RAM.

To ensure compatibility of data sizes when transferring the colour bufferto a texture and in reverse, StreamCg enforces a uniform data size for allKernels in a Stream Program. The size limitations are influenced by the sizeconstraints of textures in OpenGL (up to 4096 x 4096), and the availablevideo RAM. The size is set as part of the OpenGL render context, in typicalOpenGL applications this is the viewport size. Pure CPU Stream Programsare least impacted by these limitations as they do no use video RAM.

StreamCg provides the KernelContext class to manage the creation of FloatBuffers. A KernelContext is specified on an individual Kernel basis, thisallows multiple contexts to be utilised in a single Stream Program. However,context switching is demanding on resources. A CPU Kernel executes in thedefault context or null context, which can be considered main RAM. GPUKernels must execute within a KernelContext otherwise the data values willbe clamped.

18


29/55

The Kernels of a Stream Program are executed over the entire data set fromtop to bottom, row by row. This execution pattern is that supported byFragment Programs, this is illustrated by Figure B.3 in Appendix B. Thisis achieved by rendering a GL_QUAD that spans the entire display area. Thisresults in the execution of Fragment Program for every pixel in the colourbuffer. Due to data locality constraints imposed by the GPU the row andcolumn of the pixel being processed is not immediately available. StreamCgspecifies texture coordinates for each corner of the GL_QUAD rendered. Thisis passed to the Fragment Program using the TEX0 semantic and serves as aindexing mechanism for calculations that need to know about the row andcolumn being processed. The CPU-based Kernels emulate this behaviour toensure a consistent programming model.

The StreamCg underlying infrastructure comprises of a number of layers,the OpenGL Object layer, the Cg Object layer and the StreamCg layer.The OpenGL Object layer wraps the OpenGL API into convenience objectsand, the Cg Object layer wraps the underlying Cg API. These subsystemsprovide convenience classes essential to overcoming problems, discussed inthe following paragraphs, encountered during the construction of StreamCg.Many problems were encountered when developing the StreamCg framework,the primary issue was that OpenGL has a C API. This is a problematicbecause logical elements within OpenGL (textures, meshes etc) are object-like, but these are not exposed easily. So, it is easy to forget to set a propertyor not realise that properties you did set are incompatible. For example, whenusing the NVIDIA Float Buffer extension, textures must have there internalformat set to GL_FLOAT_RGBA_NV [14] instead of the standard GL_RGBA, which

is easy to forget when developing a system for the first time.

Another significant development issue is that OpenGL and Cg are reallycompletely separate systems with a loose coupling via the Cg Runtime sys-tem. Limited diagnostic and validation is performed, so it is important toperform manual validation regularly. An issue encountered during devel-opment relates to silent incompatibilities that cause things not to work.For example, the typical Cg type for a texture is sample2D, this requiresthat the texture is specified in OpenGL using the GL_TEXTURE_2D type [44].Using the NVIDIA Float Buffer extension requires the texture type to beGL_TEXTURE_RECTANGLE_NV. This is not compatible with the Cg sample2D

type, therefore sampleRECT [14] must be used instead.

19


30/55

2.1 The EmuCg Framework

The EmuCg framework is a supporting framework that is primarily intendedfor debugging Cg programs to be used with StreamCg. Cg programs cannot

be easily traced through step-by-step as they are executed on the GPU. Atthe time of this research there is no facility that allows tracing instructionson a GPU. EmuCg helps minimise the serious problems encountered whendeveloping generic GPU programs. The Cg code can be relatively easilyported to the EmuCg framework which allows it to be compiled with a C++compiler; enabling the usage of common C++ debugging tools.

EmuCg emulates the Cg runtime and Cg language. The Cg runtime supportis limited to the execution model of a Fragment Program that processes asingle four-sided polygon that covers the viewport. This is currently the onlyexecution model supported by the StreamCg framework and is a more limited

version of what is possible using a general Fragment Program. Refer toAppendix B for more details on Cg and Fragment Programs. Comprehensiveemulation of the Cg runtime would be a difficult task and would requirereproducing much of the rendering pipeline behaviour externally.

Emulating the Cg language is a little easier, this is primarily due to thevarying degrees of similarity between Cg, Java [22] and C/C++ [45]. Thebasic primitive types are supported with little effort, such as float and int.EmuCg also supports vector types, float2, float3 and float4; associatedarithmetic operators, such a addition, and division; and type conversionsbetween the vector sizes. This is possible in C++ primarily through the use

of classes and operator overloading.

20


31/55

Chapter 3

Testing

3.1 Test System Configuration

The hardware used comprised of an Athlon Thunderbird 1.2Ghz with 768MBmain memory on an Asus A7V266 mainboard supporting 4x AGP. The graph-ics device was a GeforceFX Ultra 5900 with 256MB video memory. Themainboard was configured to give the graphics device an additional 64MB ofmain memory as AGP video RAM.

The operating system used was Redhat Linux Version 9.0 with Linux KernelVersion 2.4.22 [47]. The display system used included XFree86 Version 4.3.0with the NVIDIA Linux Driver Version 1.0-4496, using the internal NVAGPAGP driver.

The software was compiled with GCC Version 3.2.2, using the following soft-ware development libraries

Cg Toolkit for Linux Version 1.1, provided the Cg Runtime and lan-guage compiler [13];

libSDL Version 1.2.5, for OpenGL display configuration, input handlingand performance timer;

libPNG Version 1.2.2, for loading PNG image files;

and the GNU Wavelet Image Codec Version 0.1 (GWIC) [31], for aneasy to following DWT reference implementation.

21


32/55

Figure 3.1: The DWT Stream Program is comprised of 4 kernels: an imageloader, forward DWT, inverse DWT, and image viewer. The image loaderreads PNG image files from disk. The images are already resident in memorybefore execution, therefore the load time from disk is not included. The

forward DWT processes the data from the image loader, the output of whichis fed into the inverse DWT.

The source code was compiled using GCC 3.2.3, to produce an optimisedbinary executable using the following configuration-march=athlon-tbird -mmmx -m3dnow -Wall -O3 -pipe -fomit-frame-pointer.

3.2 Method

The aim of this experiment is to compare execution times of CPU and GPUbased stream programs implemented using StreamCg. The test stream pro-gram, illustrated in figure 3.1, is comprised of 4 kernels, an image loader,forward DWT, inverse DWT, and image viewer. The image loader readsPNG image files from disk. The forward DWT processes the data from theimage loader, the output of which is fed into the inverse DWT. This col-lection of stream kernels will be referred to as the DWT Stream Program.The expected output is that the image viewer will display an image the sameas the input image. Only a working knowledge of DWTs was required forthis research and only the salient aspects of DWTs are discussed in thispaper. Further details on DWTs can be found in the existing body of knowl-edge [23] [42].

The DWT algorithm implemented as part of this research is designed for

22


33/55

Figure 3.2: DWTs are applied to 2-dimensional images in two passes, partsA and B. The first pass, the vertical pass, performs a high and low bandfilter on a row by row basis, resulting in a high and low column. The secondpass, the horizontal pass, again performs a high and low band filter on thedata, however this time its on a column by column basis. The result, seen inpart C, is four quadrants with a mixture of high and low filtered data. Thealgorithm then recursively transforms the upper left quadrant to producepart D.

use in image compression. The process of the DWT is illustrated in Figure3.2. DWTs are applied to 2-dimensional images in two passes, parts A andB. The first pass, the vertical pass, performs a high and low band filter ona row by row basis, resulting in a high and low column. The second pass,the horizontal pass, again performs a high and low band filter on the data,however this time its on a column by column basis. The result, seen in part C,is four quadrants with a mixture of high and low filtered data. The algorithmthen recursively transforms the upper left quadrant to produce part D.

Three versions of the DWT Stream Program were implemented includinga CPU-based algorithm, implemented as the ForwardWaveletKernel andInverseWaveletKernel classes; a GPU-based algorithm running on a GPU,implemented as the CgForwardWaveletKernel and CgInverseWaveletKernelclasses; and a GPU-based algorithm running on a CPU (using EmuCg), im-plemented as the EmuCgForwardWaveletKernel and the EmuCgInverseWaveletKernelclasses. The ForwardWaveletKernel and InverseWaveletKernel providethe base reference implementation for developing the Cg DWT kernels. Thereference implementation provided a means of comparing the CPU-basedversion with the GPU-based version for both correctness and performance.The CPU-based algorithm was used as a reference implementation usingthird party code from the GWIC DWT implementation; this code was useddirectly with only minor modifications.

The CgForwardWaveletKernel and CgInverseWaveletKernel kernels were

23


34/55

developed to execute on the GPU, the Cg code can be seen in Appendix C.The algorithm processes logical rows, a logical row may be a physical row orcolumn depending of the orientation. This is required, as Fragment Programsonly process data on a row by row basis. The isH and isV parameters areused to set the logical row orientation to horizontal or vertical, respectively.If one value is set to 1

.

0 the other must be set to 0.

0. An alternative approachcould have been to have two different algorithms handle the horizontal andvertical processing, however this would result in a significant performancepenalty caused by rapidly unloading and loading of the Fragment Programs.

The Cg DWT algorithm is optimised for how the GPU handles branching.On a current GPU both paths of a boolean condition are executed. If thecondition is true then the results of the true path are multiplied by 1 .0 andthe other by 0.0, then the values are then added together. The oppositeoccurs if the condition is false. The result is a weighted sum of both pathsbased on the condition. This is not immediately obvious to a developer, andcounter to how branching is performed in a CPU, where only the true branchis executed. The algorithm is optimised by collocating the common parts ofeach branch, thus minimising the actual code for each condition.

The EmuCgForwardWaveletKernel and EmuCgInverseWaveletKernel ker-nels were developed to assist in debugging the algorithm during develop-ment. This was achieved by prototyping the the forward and inverse DWTalgorithm and tracing through it with a debugger. This is not possible withCg code executed on the GPU, so it is very difficult to develop complexalgorithms without something like EmuCg.

The StreamCg framework enables any forward DWT implementation to beused with any other inverse DWT implementation. This is achieved by onlychanging the starting configuration of the kernels, as illustrated in Figure 2.2.This highlights the modularity of StreamCg kernels and the improved possi-bility of kernel reuse. This is supported by the strict object contract betweenthe kernels and the StreamCg framework that enforces a loose coupling.

3.3 Experimental Results

The results were obtained by executing the DWT Stream Program on 256 by 256,512 by 512, and 1024 by 1024 data sizes. This research intended to explorethe usage of larger data sets, including 2048 by 2048 and 4096 by 4096.However, the graphics hardware could not support such large data sets as it

24


35/55

Implementation Data Size Average Time (millis)

GPU 256x256 10512x512 11

1024x1024 11

CPU 256x256 89512x512 4501024x1024 2783

CPU (EmuCg) 256x256 920512x512 4191

1024x1024 18656

Table 3.1: The execution times are presented in fastest to slowest. The GPUshows the fastest times that scale very well with data size increases. The rawGPU execution times follow an interesting pattern. The first time the DWT

Stream Program is run, it takes approximately 230 milliseconds. Subsequentexecutions averaged approximately in the four to seven millisecond range.The CPU has the next fastest times, which scale significant worse that theGPU. Finally, the worst times are achieved using EmuCG on the CPU.

ran out of video RAM.

The DWT stream programs were executed 30 times to compute the averageexecution time. This experiment used the SDL_GetTicks() timer providedby libSDL. The SDL timer was configured to use the Read Time-StampCounter (RDTSC) machine code instruction supported by Intel Pentium

compatible processors [9]. The time-stamp counter keeps an accurate countof every cycle that occurs on the processor.

The execution times, shown in Table 3.1, are recorded in milliseconds andstart when the first kernel writes the input data to the forward DWT kernel,and stops when the last kernel receives the inverse DWT data. Each run ofthe DWT Stream Program is run as a single render frame using OpenGL.At the end of each frame the OpenGL glFinish() function is called toallow the graphics hardware to finalise any internal rendering processes. Thetime taken for this task is not included in the experiment times as they area side-effect of using OpenGL. This research assumes that the timer stops

immediately after the last kernel receives the data, and subsequent processingrequired by OpenGL is not significant.

25


36/55

Chapter 4

Discussion

The introduction of high level shader languages, such as Cg, are encouragingthe development a wider range of applications that utilise the processingpower of GPUs. The NVIDIA Cg toolkit provides a runtime system thatwork within OpenGL, and a language for programming GPUs. StreamCgutilises the Cg toolkit to provide a simple programming model, abstractthe underlying implementation complexities, and enable to development ofreusable software components.

Many limitations are imposed on StreamCg due to the strong dependency onOpenGL. A major problem is that OpenGL is not being used in the mannerfor which it was originally designed, in so far as it is not being used forcomputer graphics. This means that vendor extensions, such as the NVIDIAFloat Buffer extension need to be used to enable more generic programming.StreamCg is also limited to a fixed data size for all kernels within an executioncontext. This limitation is due to the use of textures and the colour bufferto store input and output data respectively.

EmuCg assists in the execution of NVIDIA Cg programs on a traditionalCPU. This aids in the normally difficult or impossible task of debugging Cgprograms. EmuCg only supports the execution of Cg Fragment Programs ina limited manner. However, it provides the foundations for a more completeemulation of the Cg language.

The GPU DWT Stream Program shows the best performance. Not only arethe execution times significantly faster than the CPU times, but the executiontimes scale very well and are almost constant as the data size increases.

26


37/55

The next best times are that of the CPU DWT Stream Program using theGNU Wavelet Image Codec (GWIC) code, of which, the fastest time is stillnine times worse than the slowest GPU time. This Stream Program does notscale well as the data sizes increase. The execution times are approximatelyfour times slower with each step in size, highlighting that there is a linearrelationship between the data size and the execution time.

The worst times are recorded for the EmuCg DWT Stream Program, theseare about ten times worse than the GWIC algorithm. This is due to emu-lating the runtime nature of a Fragment Programs using EmuCg. Also theCg DWT algorithm is designed to execute on a GPU and performs badlyon a CPU, as it relies on parallel execution and high bandwidth memorythroughput.

The summary of results above highlights that by using parallelism and highbandwidth memory throughput the GPU achieves significant performance

gains over CPU based programs. This is primarily because a CPU is ascalar processor and has single relatively narrow memory interface. However,current graphics hardware is limited by the size of video RAM, thus limitedthe amount of data that can be processed. Also, OpenGL only supportstextures, used as inputs in StreamCg, up to 4096 by 4096. As a resultthe GPU is not limited by processing power, but by the capacity to storedata. The CPU has the inverse relationship as it is primarily bandwidth andprocessor limited, and limited less by storage capacity.

Better results could be achieved if the motherboard used supported 8x AGP,about 2 GB per second throughput. This would minimise the initial spike

seen in the GPU results, which is caused by uploading the texture to thegraphics hardware. The size of the video RAM would also need to be accom-panied by a increase in the AGP bus speed, necessary to transfer larger datato the graphics hardware in a timely manner.

4.1 Future Work

There are numerous enhancements that could be made to the StreamCgframework. These include

Support for more Cg types, such as matrices.

Port to MS Windows platform.

27


38/55

Support more GPU profiles other than FP30. This includes investigat-ing using Vertex Programs.

Implement more image processing kernels, such as edge detection algo-rithms.

Improve the programming model to make it simpler and more flexible.

It is possible to further optimise the DWT stream program. Fine tuningthe Cg program to use integers where possible will improve performance; asinteger and float operations are performed in parallel.

28


39/55

Appendix A

Original Research Proposal

Title: High Performance Generic Programming using Programmable Graph-ics HardwareAuthor: Aleksandar RadeskiSupervisor: Dr Karen Haines

A.1 Background

In recent times the speed of the modern computer processor (such as an

AMD Athlon XP) has reached speeds over 2GHz . However, in the areaof cinematic quality computer graphics the raw processing speed of a singleprocessor is not enough to render these scenes at interactive rates (above25Hz). This is due to the large volumes of data that need to be processedmany hundreds of times a second.

The aim of modern graphics hardware is to render cinematic quality computergraphics at interactive rates. This is achieved through developing highly par-allel hardware devices that efficiently process large volumes of data. Earlygenerations of graphics hardware only supported a fixed function renderingpipeline. The fixed function rendering pipeline simplified both the hard-

ware design and programming the hardware device; this was at the cost offlexibility.

Modern graphics hardware, such as the NVIDIA GeforceFX, support theexecution of user programs called shaders. The two forms of shader are

29


40/55

the vertex shader and the pixel shader. The vertex shader is executed pervertex with the purpose of transforming world coordinates into view-spacecoordinates prior to view frustum clipping. The pixel shader is executed perpixel during the rasterisation phase prior to screen coordinate clipping.

Earlier generations of shader languages resembled machine languages andwere limited and difficult to use, for example branching and looping oper-ations were not supported. Current shader languages are becoming morepowerful and less specialised. These C-like languages, such as NVIDIA Cg,supports a wide range of types, and looping and branching operations. Asthese languages become more generalised it has become apparent that theselanguages could be used for high performance processing of generic program-ming tasks.

A.2 Aim

The goal of my research is to investigate modern programmable graphicshardware and shader languages and how they can be utilised for high per-formance generic programming tasks.

A.3 Method

The first step is to build an understanding of modern programmable graphicshardware. My research will not specifically focus on the physical design ofthe hardware, but the aim is to provide enough background to give a betterunderstanding of the difficulties in programming such devices.

Following the hardware research I will investigate the origins of shader lan-guages; primarily on the current generation of C-like languages, such as Cg,and how they were influenced by early shader languages, such as RenderMan.To better understand shader languages I will prototype a number of simpletypical shader examples.

Once I have a good understanding of conventional shaders I will explore the

possibilities of using programmable graphics hardware in more generic pro-gramming tasks. The example I will implement is a projection shader usingCg, that will project data from one coordinate system into another. Thistechnique can be used in Geographic Information Systems (GIS) that re-

30


41/55

quire data to be transformed from geographic (latitude, longitude, elevation)into Cartesian (x,y,z) coordinates, and back. The shader is expected to beaccurate and fast. The goal is to devise an implementation that can projecta complex scene consisting of terrain and entities at interactive rates. Otherapplications beyond GIS will also be explored.

A.4 Requirements

The application programming languages used will be C++, the shader lan-guage used will be NVIDIA Cg. Software requirements include Linux, GCC3.x, OpenGL, NVIDIA Cg Toolkit,

Hardware Requirements include a standard PC of around 2GHz with about512MB of RAM and an NVIDIA GeforceFX graphics device.

31


42/55

Appendix B

Cg Programmaning

The Cg Toolkit consists of two parts, the Cg Language and Cg Runtime ser-vices [13]. The Cg language is used to implement GPU programs, the genericform of Vertex and Fragment Program. The Cg Runtime services form partof the CPU-based host application and manage the execution context of aGPU program. The Cg Runtime services are used to specify the GPU Pro-gram profile; perform loading, compiling and binding of GPU programs; andspecify the input parameters. This discussion outlines many relevant aspectsof the Cg Language, but does not try cover the entire language.

B.1 The Cg Language

The Cg language is designed to be general purpose and hardware-oriented.As a general purpose programming language, Cg may support features be-yond that of the available hardware. Cg uses profiles to group supportedcapabilities for a specific GPU. The GPU profiles discussed in this paper arethe OpenGL NVIDIA Vertex Program 2 profile (VP30) [16] and FragmentProgram profile (FP30) [15]. A GPU profile is used by the Cg compiler togenerate the GPU machine language. If a Cg program uses features notsupported by the target profile, the Cg compiler will produce an error. TheGPU profile defines the supported data types, flow control operations (such

as conditions and loops) and other special purpose operations (such as light-ing calculations). This research will not cover the VP30 profile in great detailas the primary discussion concerns the FP30 profile.

The Cg language is based on ANSI C and incorporates certain desirable fea-

32


43/55

tures from C++ and Java. Cg does not support full ANSI C, there are anumber of limitations imposed. For example, pointers are not supported,therefore arrays are a first-class type; function overloading is supported; andvariables may be defined anywhere before being used, not just at the begin-ning of the scope. The limitations of Cg are a direct result of the limitationsimposed by current GPU hardware. The enhancements made to ANSI Care intended to provide specialised GPU support, such as swizzling, and toreduce the programming effort for certain tasks.

The supported data types include float, a 32-bit IEEE floating-point num-ber; half, a 16-bit IEEE-like floating point number; int, a 32-bit integer;fixed, a 12-bit integer; bool, a boolean type; and sampler*, a texture ob-

ject with six variants including sampler1D, sampler2D, samplerRECT andsampler3D. Cg also supports more complex types, such as float2, float3,float4, which are two, three, and four component floating point vectors re-spectively. Although not used in this paper, Cg also supports matrix types,such as float4x4 which is the largest matrix supported. These complextypes act very similarly to C++ classes with a broad range of overloadedarithmetic operators.

Cg program defines a main function that specifies the parameters for varyinginputs and outputs, and constant data, referred to as uniform data. The valueof an input parameter and the destination of an output parameter can bespecified using semantic tags. Semantic tags relate to a specific GPU profilesand are declared with the parameter name and type. An example of a mainfunction using semantics is shown in Figure B.2. The input parameter incol

uses theCOLOR

semantic, which implies that for each pixel this FragmentProgram visits, the parameter receives the colour value. Some input andoutput semantics are required, otherwise the GPU program is invalid. If aparameter has no semantics defined, then it is up to the host application tospecify the value of the parameter. Uniform data is data that is constantfor the entire execution of the Cg program. This can be of any of the typesdiscussed previously. Figure B.1 illustrates the inputs, outputs and constantdata for a Fragment Program. The constant data is using the Cg Runtimediscuss later.

Function parameters can also have a direction modifier specified; one of in,

out or inout. Parameters are declared in by default if no direction is spec-ified. An in parameter is passed by value, an out parameter is passed outwhen the function exits but has no initial value, and an inout parameterhas a value on function entry and if it is modified in the function the valueis reflected outside the function. The inout direction modified is meant to

33


44/55

Figure B.1: A logical view of a Fragment Program that illustrates the per-fragment input, per-fragment output and constant data. The per-fragmentinput changes which each fragment processed, however the constant data isthe same for all fragments.

void main(in float3 incol : COLOR,in float4 winPos : WPOS,

out float4 outcol : COLOR)\{

//function body

\}

Figure B.2: An example main function for a FP30 Fragment Program, usingdirection modifiers and semantics.

provide some of the functionality missing due to the lack of pointers in Cg.

The Cg language supports flow control constructs, such as looping and branch-ing, which are relatively difficult to implement in parallel systems due to la-tency and synchronisation issues. At present only the VP30 profile supportslooping and branching, the FP30 profile only supports branching.

B.2 The Cg Runtime

The Cg Runtime provides Cg program compilation and inspections opera-

tions. The Cg Runtime supports the pre-compilation of Cg programing intothe target profile machine language, or on-demand compilation that is per-formed at runtime by the graphics driver. On-demand compilation has theadvantage of enabling new driver implementations to generate more efficient

34


45/55

Figure B.3: When a Fragment Program is executed it processes each frag-ment on a per-row basis, from the min screen position to the max screenposition. This means at each fragment the Cg program knows nothing of itsadjacent neighbours, this data locality is on of the features that enables highperformance parallelism of GPUs.

machine language as GPU technology matures. By using the on-demandcompilation the Cg Runtime also supports inspecting the Cg programs in-terface, including the inputs, outputs and constant data.

The Cg Runtime also manages setting the value of constant input parame-ters and binding a Cg program to the current rendering context. Values ofconstant input parameters, of the various Cg types discussed earlier, are setvia an appropriately typed function call. The value of an input parameterremains constant until it is changed or the state of the rendering contextchanges. Only one Vertex and Fragment Program can be bound to a render-ing context at any given time.

When correctly configured, Cg programs are executed after the OpenGLglBegin() function call is made to draw a primitive. When a FragmentProgram is executed it processes each fragment on a per-row basis, from themin screen position to the max screen position as show in figure B.3. Thismeans at each fragment the Cg program knows nothing of its adjacent neigh-bours, this data locality is one of the features that enables high performanceparallelism of GPUs. If Cg programs could access neighbouring values, this

would introduce significant performance penalties due to the synchronisationmanagement of shared data.

35


46/55

Appendix C

Cg Discrete Wavelet Transform

Implementation

C.1 Forward DWT Cg Implementation

//parameters:

//icol - required for Fragment Program but not used

//tex0 - the execution position passed as texture coords

//subframeSize - the size of the subframe of the data buffer to process

//isH - use horizontal logical orientation

//isV - use vertical logical orientation

//hSize - half the size of the subframe//imageData - the input data as an RGBA texture

//daub4High - the DWT high band co-efficients

//daub4Low - the DWT low band co-efficients

//ocol - the colour buffer output value

void main(

in float3 icol : COLOR,

in float2 tex0 : TEX0,

uniform f loat subframeSize,

uniform float isH,

uniform float isV,

uniform float hSize,

uniform samplerRECT imageData,

uniform float4 daub4High,

36


47/55

uniform float4 daub4Low,

out float4 ocol : COLOR)

{

//ensure we get a floored value to avoid interpolation issues

float2 tex = floor(tex0);

//copy default colour to fill-in unprocessed quadrants

float3 val = texRECT(imageData, tex).xyz;

//set logical row/col position based on orientation (H/V)

float rowPos = (tex.x * isH) + (tex.y * isV); ;

float colPos = (tex.y * isH) + (tex.x * isV); ;

//set sample step based on orientation (H/V)

float uStep = (2 * isH) + (1 * isV);

float vStep = (1 * isH) + (2 * isV);

float2 uv0;

float2 uv1;

float2 uv2;

float2 uv3;

float uvMax;

float4 daub;

bool doFilter = false;

//process the low and high band halves

if((rowPos < hSize) &&(colPos < subframeSize)) {

//low band filter

float u = (tex.x * uStep);

float v = (tex.y * vStep);

uv0 = float2(u, v);

daub = daub4Low;

doFilter = true;

}

else if((rowPos >= hSize) &&

(rowPos < subframeSize) &&(colPos < subframeSize)) {

//high band filter

float u = (tex.x - (hSize * isH)) * uStep;

float v = (tex.y - (hSize * isV)) * vStep;

37


48/55

uv0 = float2(u, v);

daub = daub4High;

doFilter = true;

}

if(doFilter) {

//compute the next 3 input uv coords

//in the appropriate direction (H or V)

uv1 = float2(uv0.x + isH, uv0.y + isV);



//wrap values to within the subframeSize

uv0.x = (uv0.x >= subframeSize ? (uv0.x - subframeSize) : uv0.x);

uv0.y = (uv0.y >= subframeSize ? (uv0.y - subframeSize) : uv0.y);

uv1.x = (uv1.x >= subframeSize ? (uv1.x - subframeSize) : uv1.x);uv1.y = (uv1.y >= subframeSize ? (uv1.y - subframeSize) : uv1.y);





//fetch 4 input data elements

float3 a = texRECT(imageData, uv0).xyz;

float3 b = texRECT(imageData, uv1).xyz;

float3 c = texRECT(imageData, uv2).xyz;

float3 d = texRECT(imageData, uv3).xyz;

//peform the high or low band calculation

val.x = (a.x * daub.x) + (b.x * daub.y) +

(c.x * daub.z) + (d.x * daub.w);

val.y = val.x;

val.z = val.x;

}

//if the current execution position in not within the subframe then the

//existing value at the current position is copied to the output without//being processsed.

//set the output colour

ocol = float4(val, 1);

38


49/55

}

C.2 Inverse DWT Cg Implementation

//parameters:

//icol - required for Fragment Program but not used

//tex0 - the execution position passed as texture coords

//subframeSize - the size of the subframe of the data buffer to process

//isH - use horizontal logical orientation

//isV - use vertical logical orientation

//hSize - half the size of the subframe

//imageData - the input data as an RGBA texture

//daub4High - the DWT high band co-efficients//daub4Low - the DWT low band co-efficients

//ocol - the colour buffer output value

void main(in float3 icol : COLOR,

in float2 tex0 : TEX0,

uniform f loat subframeSize,

uniform float isH,

uniform float isV,

uniform float hSize,

uniform samplerRECT imageData,

uniform float4 daub4High,

uniform float4 daub4Low,out float4 ocol : COLOR)

{

//ensure we get a floored value to avoid interpolation issues

float2 tex = floor(tex0);

//copy default colour to fill in unprocessed quadrants

float3 val = texRECT(imageData, tex).xyz;

//set logical row/col position based on orientation (H/V)

float rowPos = (tex.x * isH) + (tex.y * isV); ;

float colPos = (tex.y * isH) + (tex.x * isV); ;

//set sample step based on orientation (H/V)

float uStep = (0.5 * isH) + (1 * isV);

39


50/55

float vStep = (1 * isH) + (0.5 * isV);

float uLimit = (hSize * isH) + (subframeSize * isV);

float vLimit = (subframeSize * isH) + (hSize * isV);

if((rowPos < subframeSize) && (colPos < subframeSize)) {

//determine if this is an odd position index

float odd = fmod(rowPos, 2.0);

//compute daub coefficients for even (0,2) or odd (1,3) case

float4 daube = float4(daub4Low.x, daub4High.x,

daub4Low.z, daub4High.z);

float4 daubo = float4(daub4Low.y, daub4High.y,

daub4Low.w, daub4High.w);

float4 daub = (daube * (1 - odd)) + (daubo * odd);float ul = (tex.x - (1 * isH * odd * 0)) * uStep;

float vl = (tex.y - (1 * isV * odd * 0)) * vStep;

float uh = ul + (hSize * isH);

float vh = vl + (hSize * isV);

streamcg : a stream-based framework for programmable graphics hardware

Documents