exploiting the graphics hardware to solve two compute intensive problems
DESCRIPTION
Sheetal Lahabar and P. J. Narayanan Center for Visual Information Technology, IIIT - Hyderabad. Exploiting the Graphics Hardware to solve two compute intensive problems. General-Purpose Computation on GPUs. Why GPGPU? Computational Power Pentium 4: 12 GFLOPS, GTX 280: 1 TFLOPS - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/1.jpg)
Exploiting the Graphics Hardware to solve two compute intensive
problems
Sheetal Lahabar and P. J. Narayanan
Center for Visual Information Technology,
IIIT - Hyderabad
![Page 2: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/2.jpg)
General-Purpose Computation on GPUs Why GPGPU? Computational Power
Pentium 4: 12 GFLOPS, GTX 280: 1 TFLOPS
High Performance Growth: Faster than Moore's law CPU: 1.4x, GPU: 1.7x ~ 2.3x for every year Disparity in performance: CPU(caches and branch
prediction), GPU(arithmetic intensity)
Flexible and precise Programmability High-level language support
Economics Gaming market
![Page 3: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/3.jpg)
The Problem: Difficult to use GPUs are designed for and driven by
graphics Model is unusual & tied to graphics Environment is tightly constrained
Underlying architectures Inherently parallel Rapidly evolving Largely secret
Can’t simply “port” code written for the CPU!
![Page 4: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/4.jpg)
Mapping Computations to GPU
Data-parallel processing GPU architecture is ALU-heavy Performance depends on
Arithmetic intensity = Computation / Bandwidth ratio
Hide memory latency with more computation
![Page 5: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/5.jpg)
GPU architecture
![Page 6: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/6.jpg)
Singular Value Decomposition on
the GPU using CUDA
Proceedings of IEEE International Parallel Distributed Processing Symposium(IPDPS 09), 25-29 May, 2009, Rome, Italy
![Page 7: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/7.jpg)
Problem Statement
SVD of matrix A(mxn) for m>n
U and V are orthogonal and Σ is a diagonal matrix
![Page 8: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/8.jpg)
Motivation
SVD has many applications in Image Processing, Pattern Recognition etc.
High computational complexity GPUs have high computing power
Teraflop performance Exploit the GPU for high performance
![Page 9: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/9.jpg)
Related Work Ma et al. implemented two sided rotation
Jacobi on 2 million gate FPGA (2006) Yamamoto et al. proposed a method on
CSX600 (2007) Only for large rectangular matrices
Bobda et al. proposed a implemention on Distributed reconfigurable system (2001)
Zhang Shu et al. implemented One Sided Jacobi Works for small matrices
Bondhugula et al. proposed a hybrid implementation on GPU Using frame buffer objects
![Page 10: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/10.jpg)
Methods
SVD algorithms Golub Reinsch (Bidiagonalization and Diagonalization) Hestenes algorithm(Jacobi)
Golub Reinsch method Simple and compact Maps well to the GPU Popular in numerical libraries
![Page 11: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/11.jpg)
Golub Reinsch algorithm
Bidiagonalization: Series of householder transformations
Diagonalization: Implicitly Shifted QR iterations
![Page 12: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/12.jpg)
SVD
Overall algorithm B ← QTAP Bidiagonalization of A to B Σ ← XTBY Diagonalization of B to Σ U ← QX , V T ← (PY ) T
Compute orthogonal matrices U andV T
Complexity: O(mn2) for m>n
![Page 13: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/13.jpg)
Bidiagonalization
QT AsA QT P
Identity matrix
Simple Bidiagonalization
ith updateA(i+1:m, i+1:n) = A(i+1:m, i+1:n) – uif(ui,vi ) - f(vi)vi QT(i:m, 1:m) = QT(i:m, 1:m) – f(Q,ui)ui P(1:n, i:n) = P(1:n, i:n) – f(P,vi)vi
![Page 14: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/14.jpg)
Contd… Many Reads and writes Use block updates Divide matrix into n/L blocks
Eliminate L rows and columns at once n/L block transformations
![Page 15: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/15.jpg)
Contd… A Block transformation, L=3
QT A P
L
ith block transformation updates trailing
A(iL+1:m, iL+1:n), Q(1:m, iL+1:m) and PT(iL+1:n, 1:n)
Update using BLAS operations
![Page 16: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/16.jpg)
Contd… Final bidiagonal matrix B = QTAP Store L ui’s and vi’s Additional space complexity O(mL) Partial Bidiagonalization only
computes B
![Page 17: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/17.jpg)
Challenges
Iterative algorithm Repeated data transfer High precision requirements Irregular data access Matrix size affects performance
![Page 18: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/18.jpg)
Bidiagonalization on GPU Block updates require Level 3 BLAS CUBLAS functions used, single
precision High performance for smaller
dimension Matrix dimension are multiple of 32 Operations on data local to the GPU Expensive GPU CPU transfers
avoided
![Page 19: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/19.jpg)
Contd…
Inplace bidiagonalization Efficient GPU implementation Bidiagonal matrix copied to the CPU
![Page 20: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/20.jpg)
Diagonalization
Implicitly shifted QR algorithm
Identity matrix
k1
X B Y
T
k2
k1
Iteration 1Iteration 2
k2
![Page 21: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/21.jpg)
Diagonalization Apply implicitly shifted QR algorithm
In every iteration, until convergence Find matrix indexes k1 and k2
Apply Given’s rotations on B Store coefficient vectors (C1, S1) and (C2, S2) of
length k2-k1
Transform k2-k1+1 rows of YT using (C1, S1) Transform k2-k1+1 columns of X using (C2, S2)
![Page 22: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/22.jpg)
Contd… Forward transformation on YT
C1 S1
YTfor(j=k1; j<k2; j++)
YT(j,1:n) = f (YT(j,1:n), YT(j+1,1:n), C1(j-k1+1), S1(j-k1+1))
YT(j+1,1:n) = g (YT(j,1:n), YT(j+1,1:n), C1(j-k1+1), S1(j-k1+1))
j=0
j=1
j=2
![Page 23: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/23.jpg)
Diagonalization on GPU
Hybrid algorithm Given rotations modifies B on CPU Transfer coefficient vectors to GPU Row transformations Transform k2-k1+1 rows of YT and XT
on GPU
![Page 24: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/24.jpg)
Contd… A row element depends on next or
previous row element A row is divided into blocks
m
n
txty=0
B1 Bk Bn
blockDim.x
k1
k2
![Page 25: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/25.jpg)
Contd…
Kernel modifies k2-k1+1 rows
Kernel loops over k2-k1 rows Two rows in shared memory Requires k2-k1+1 coefficient vectors Coefficient vectors copied to shared
memory Efficient division of rows Each thread works independently
![Page 26: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/26.jpg)
Orthogonal matrices
CUBLAS matrix multiplication for U and VT
Good performance even for small matrices
![Page 27: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/27.jpg)
Results Intel 2.66 Ghz Dual Core CPU used Speedup on NVIDIA GTX 280:
3-8 over MKL LAPACK 3-60 over MATLAB
![Page 28: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/28.jpg)
Contd… CPU outperforms for smaller matrices Speedup increases with matrix size
![Page 29: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/29.jpg)
Contd… SVD timing for rectangular matrices
(m=8K) Speedup increases with varying
dimension
![Page 30: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/30.jpg)
Contd… SVD of upto 14K x 14K on Tesla S1070
takes 76 mins on GPU 10K x 10K SVD takes 4.5 hours on CPU,
25.6 minutes on GPU
![Page 31: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/31.jpg)
Contd…
Yamamoto achieved a speedup of 4 on CSX600 for very large matrices
Bobda report the time for 106 x 106 matrix which takes 17 hours
Bondhugula report only the partial bidiagonalization time
![Page 32: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/32.jpg)
Timing for Partial Bidiagonalization Speedup:1.5-16.5 over Intel MKL CPU outperforms for small matrices Timing comparable to Bondhugula e.g 11 secs on GTX 280 compared to
19 secs on 7900 Time in secs
SIZEBidiag.
GTX 280
Partial Bidiag.
GTX 280
Partial Bidiag.
Intel MKL
512 x 512 0.57 0.37 0.14
1K x 1K 2.40 1.06 3.81
2K x 2K 14.40 4.60 47.9
4K x 4K 92.70 21.8 361.8
![Page 33: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/33.jpg)
Timing for Diagonalization Speedup:1.5-18 over Intel MKL Maximum Occupancy: 83% Data coalescing achieved Performance increases with matrix
size Performs well even for small matricesTime in secs
SIZE
Diag.
GTX 280
Diag.
Intel MKL
512 x 512 0.38 0.54
2K x 2K 5.14 49.1
4K x 4K 20 354
8K x 2K 8.2 100
![Page 34: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/34.jpg)
Limitations Limited double precision support High performance penalty Discrepancy due to reduced precision
m=3K, n=3K
![Page 35: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/35.jpg)
Contd…
Max singular value discrepancy = 0.013%
Average discrepancy < 0.00005% Average discrepancy < 0.001% for U
and VT
Limited by device memory
![Page 36: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/36.jpg)
SVD on GPU using CUDA Summary SVD algorithm on GPU Exploits the GPU parallelism High performance achieved Bidiagonalization using CUBLAS Hybrid algorithm for diagonalization Error due to low precision < 0.001% SVD of very large matrices
![Page 37: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/37.jpg)
Ray Tracing Parametric Patches on GPU
![Page 38: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/38.jpg)
Problem Statement
Direct ray trace parametric patches Exact point of intersection High visual quality images Less artifacts Fast preprocessing Less memory requirement Better rendering
![Page 39: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/39.jpg)
Motivation
Describes 3D geometrical figures Foundation of most CAD systems
Computationally expensive process Graphics Processing Units (GPU)
High Computational Power, 1 TFLOPS
Exploit the Graphics hardware
![Page 40: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/40.jpg)
Bezier patch
16 control points Better continuity properties, compact Difficult to render directly Tessellated to polygons Patch equation
Q(u, v) = [u3 u2 u 1] P [v3 v2 v 1]T
![Page 41: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/41.jpg)
Methods
Uniformly refine on the fly Expensive tests to avoid recursion Approximates to triangles Rendering artifacts
Find exact hit point of a ray with a patch High computational complexity Prone to numerical errors
![Page 42: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/42.jpg)
Related Work Toth’s algorithm (1985)
Applies multivariate Newton iteration Dependent on calculation of interval
extension; numerical errors Manocha’s and Krishnan’s method (1993)
Algebraic pruning based approaches Eigen value formation of the problem Does not map well to GPU
Kajiya’s method (1982) Finds roots of a 18-degree polynomial Maps well to parallel architectures
![Page 43: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/43.jpg)
Kajiya’s algorithm
v - Intersect a and bu - gcd(a,b)
Rl0
l1a
b P
![Page 44: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/44.jpg)
Advantages
Finds the exact point of intersection Uses robust root finding procedure No memory overhead required Requires double precision arithmetic Able to trace secondary rays On the downside; computationally
expensive Suitable for parallel implementation Can be implemented on GPU
![Page 45: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/45.jpg)
Overview of ray tracing algorithm
Create BVH (CPU)
Compute Plane
Equations (GPU)
Traverse BVH for all
pixels/rays (GPU)
Compute 18 degree
polynomials (GPU)
Find the roots of the
polynomials (GPU)
Compute the GCD of bicubic
polynomials (GPU)
Accumulate shading data
recursively and render
Spawn Secondar
y Rays (GPU)
Compute point and
normal (GPU)
For all intersections
Every frame
Preprocessing
![Page 46: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/46.jpg)
Compute Plane Equations
M+N planes represent MxN rays Thread computes a plane equation
Use frustum corner information Device occupancy: 100%
EyePixel
![Page 47: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/47.jpg)
BVH traversal on the GPU Create BVH, traverse depth first Invoke traverse, scan, rearrange Store Num_Intersect intersection
data Device occupancy: 100%
4,5 4,6 5,65,5
0 0 1 1 2 2
0 0
3 3
2 2 3 3 4 4 5 5 4 4
(x,y)
traverse
1 1 2 2 2 2
3 3 4 4 6 6
Sum
Prefix_Sum
scan
4 4 4 4 5 5 5 5
5 5 5 6 5 5 6 6
0 1 2 2 3 4 4 5
rearrange
pixel_x
pixel_y
patch_ID
![Page 48: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/48.jpg)
Computing the 18 degree polynomial Intersection of a and b
32 A and B coefficients Evaluate R = [a b c; b d e; c e f] for v bezout kernel
grid = Num_Intersect/16, threads = 21*16
6-6 degree, 6-12 degree, 3-18 degree
16
21 Threads active21*16Threads active13*16Threads active19*21
![Page 49: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/49.jpg)
Contd…
Configuration uses resources well Avoids uncoalesced read and write
Row major layout Reduced divergence Device occupancy: 69% Performance limited by registers
![Page 50: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/50.jpg)
Finding the polynomial roots 18 roots using Laguerre’s method
Guarantees convergence Iterative and cubically convergent
Thread evaluates an intersection grid = Num_Intersect/64, threads = 64
Kernel invoked from the CPUwhile(i < 18)
call <laguerre> kernel, finds ith root xi
call <deflate> kernel, deflates polynomial by xi
End
Iteration update: xi = xi – g(p(x), p’(x))
Each invocation finds a root in the block Store real v count in d_countv
![Page 51: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/51.jpg)
Contd…
Splitting kernel reduces register usage Avoids uncoalesced read and write
Row major data layout Device occupancy
laguerre kernel : 25%, deflate kernel: 50% Performance limited by
Use of double registers Complex arithmetic Shared memory: Repeated transfer of
polynomial coefficients
![Page 52: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/52.jpg)
Compute GCD of bicubic polynomials u = GCD(a,b)
Euclidean algorithm Real v count from d_countv
Thread evaluates an intersection grid = Num_Intersect/64, threads = 64
Num_Intersect
tx = 64, ty = 0
bx bx
![Page 53: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/53.jpg)
Contd…
Update d_countu for real (u, v) pair Device occupancy: 25% Performance limited
Double registers Shared memory
A and B coefficients read repeatedly
![Page 54: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/54.jpg)
Compute (x,y,z) and normal n
Use parametric patch equation Real (u,v) count from d_countu
Thread processes an intersection grid = Num_Intersect / 64, threads = 64
Device occupancy: 25% Performance limited by
Double registers Shared memory
Repeated patch data transfer
![Page 55: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/55.jpg)
Challenges
High computational complexity Requires higher precision Repeated data transfer from device
to kernel Irregular data access Robust root finding algorithm Complex arithmetic High memory requirements
![Page 56: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/56.jpg)
Optimizations
Keep computations independent (one thread per pixel) Disadvantage – no coherence
Avoid unnecessary computations Using SAH(surface area heuristics) in
building BVH Arrange data to reduce workload
![Page 57: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/57.jpg)
Secondary rays
Secondary (shadow and reflection) rays spawned
Two orthogonal planes selected Find real point of intersection Shadow ray shadows the point of
origin Compute final color recursively Standard illumination equation
![Page 58: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/58.jpg)
Memory requirements and Bandwidth Memory requirements
64 doubles: Patch coefficients Store plane equations (screen resolution) Per Intersection of the ray (double) – 480 bytes
32 x 8 bytes: Bicubic polynomials 19 x 8 bytes: Polynomial roots 3 x 4 bytes: Patch ID and Pixel location 60 bytes: Additional flags
Memory Bandwidth Patch coefficients read repeatedly in laguerre
kernel Incurs a performance penalty
![Page 59: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/59.jpg)
Strengths
Facilitates direct ray tracing of dynamic patches
Divides into independent tasks Low branch divergence and high
memory access coherence Time taken linear in the number of
intersections No additional overhead incurred for
secondary rays
![Page 60: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/60.jpg)
Contd…
Predict performance based on scene complexity
Speed up by multiple GPUs Reduction in the number of
intersections boosts performance
![Page 61: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/61.jpg)
Limitations
Ray tracing performance Memory usage
Limits the number of intersections processed 480 bytes per ray patch intersection
Double Precision Performance Less GFLOPS
Limited Shared Memory Repeated data transfer Increases memory traffic and reduces the
performance
![Page 62: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/62.jpg)
Contd…
Batch processing solves the memory usage problem
GPUs now have improved double precision, up to 4x
Modern GPU has increased shared memory available
![Page 63: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/63.jpg)
Results: On GTX 280 Model No. of
Intersection
Patch/Ray
BVH TraversalTime (secs)
Polynomial formationTime (secs)
Solve polynomialTime (secs)
GCD, x,y,z and n computation Time (secs)
Time per frame(secs)
Average time per intersection(microseconds)
Teapot-P 54389 2.01 0.004 0.019 0.175 0.013 0.211 3.8
Teapot-S 29626 2.32 0.003 0.012 0.111 0.010 0.136 4.5
Teapot-R 41096 3.21 0.004 0.031 0.143 0.011 0.189 4.6
Bigguy-P 114048 3.23 0.007 0.043 0.352 0.015 0.417 3.6
Bigguy-S 114112 3.47 0.007 0.048 0.350 0.015 0.420 3.7
Bigguy-R 143040 4.34 0.008 0.104 0.480 0.022 0.614 4.3
Killeroo-P 127040 1.43 0.010 0.050 0.390 0.016 0.466 3.7
Killeroo-S 138240 1.72 0.011 0.061 0.420 0.016 0.508 3.7
Killeroo-R 146432 1.82 0.013 0.105 0.446 0.022 0.586 4
![Page 64: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/64.jpg)
Kernel split timing Finding roots: On
average 82% BVH traversal takes
negligible time Constant
percentage for
primary and secondary rays
Device occupancy: 25-100% Y axis – Model, Ray Type tuple
Teapot(T), Bigguy(B), Killeroo(K)Primary(P), Shadow(S), Reflected(R)
![Page 65: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/65.jpg)
Preliminary results on FermiModel No. of
inter-sections
Time per frame (secs)Fermi 480
Time per frame(secs)GTX 280
Avg. time per inter-section Fermi 480(microsecs)
Avg. time per inter- sectionGTX 280(microsecs)
Speedup
Teapot-P 54389 0.071 0.211 1.3 3.8 2.94
Teapot-S 29626 0.041 0.136 1.38 4.5 3.28
Teapot-R 41096 0.057 0.189 1.38 4.6 3.30
Bigguy-P 114048 0.147 0.417 1.28 3.6 2.83
Bigguy-S 114112 0.148 0.420 1.29 3.7 2.83
Bigguy-R 143040 0.190 0.614 1.32 4.3 3.23
Killeroo-P 127040 0.164 0.466 1.29 3.7 2.83
Killeroo-S 138240 0.179 0.508 1.29 3.7 2.83
Killeroo-R 146432 0.195 0.586 1.33 4 2.99
![Page 66: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/66.jpg)
Average time per intersection
Per intersection 3.7 μs – GTX 280 1.4 μs – GTX 480
No overhead incurred for secondary rays Predict perfor- mance
X axis – Model, Ray Type tuple
Teapot(T), Bigguy(B), Killeroo(K)Primary(P), Shadow(S), Reflection(R)
![Page 67: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/67.jpg)
Comparison to CPU First direct ray tracing
implementation Scales linearly with
number of inter-
sections Near interactive rates
Outperforms the CPU:
340x – GTX 280
990x – GTX 480
Promises interactivity
Shows the speedup using GTX 280 over MATLAB implementation on AMD dual core processor
![Page 68: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/68.jpg)
Teapot (32 patches) with reflection rays
Teapot (32 patches) with shadow and reflection rays
![Page 69: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/69.jpg)
Bigguy(3570 patches) with shadow rays
Killeroo(11532 patches) with shadow rays
![Page 70: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/70.jpg)
Multiple objects with shadow and reflection rays
![Page 71: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/71.jpg)
Ray tracing parametric patches on the GPU – Summary Finds exact points of intersection
Per pixel shading using true normal Renders highly accurate models
Quality not affected on zooming Able to trace secondary rays Suitable for parallel and pipelined
execution Near interactive performance; Speed
up over CPU
![Page 72: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/72.jpg)
Contd…
Alternative to subdivision approaches Suitable for multi GPU implementation Easily extended for other parametric
models
![Page 73: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/73.jpg)
Future Work
SVD and Ray tracing on multiple GPUs
Addressing large SVD Use double precision for SVD Adapt ray tracing to new generation
architectures (Fermi) Extend ray tracing for dynamic
models
![Page 74: Exploiting the Graphics Hardware to solve two compute intensive problems](https://reader035.vdocuments.mx/reader035/viewer/2022062517/56813ca6550346895da655ec/html5/thumbnails/74.jpg)
Thank you