implementing 3d spharm surfaces registration on cell b.e. processor
TRANSCRIPT
Implementing 3D SPHARM Surfaces Registration on Cell ProcessorRegistration on Cell Processor
Huian Li ([email protected]) Mi Yan ([email protected])Robert Henschel (rhensche@indiana edu) Li Shen (shenli@iupui edu)Robert Henschel ([email protected]) Li Shen ([email protected])
July 29, 2009
Contents• SPHARM registrationSPHARM registration• Matlab implementation
Cell implementation• Cell implementation• Performance Analysis• Conclusion
SPHARM Surfaces
R di l d t ll f• Radial and stellar surfaces• Simply connected, arbitrarily shaped• Vision, graphics, imaging, bioinformatics
SPHARM Expansion
( ) (x y z)( ) ( )(,) (x,y,z)
Area-preserving
(,) (x,y,z)
mapping
SHREC
(a) template, (b) object, (c) after ICP, (d) after registration of parameterizationg p
Calculation of coefficients• After rotating the parameter net on the surface inAfter rotating the parameter net on the surface in
Euler angles (α, β, γ), new coefficients will be:l
l
ln
nl
lmn
ml cDc )()(
where
ln
)min( mlnl
))()1(()(),min(
),0max(
)( lmnt
mlnl
mnt
tnimilmn deD
and
)!()!()!()!( llll )2()22( )2
(sin)2
(cos!)!()!()!(
)!()!()!()!()( nmttmnll
mnt tnmttmltnlmlmlnlnl
d
RMSD• RMSD (Root Mean Square Distance): distanceRMSD (Root Mean Square Distance): distance
between two SPHARM models
max
2,2,1 ||||
41 L l
ml
ml ccRMSD
04 l lm
m mand are coefficients of two
SPHARM models
mlc ,1
mlc ,2
Matlab implementation• A straightforward implementation in Matlab:A straightforward implementation in Matlab:
for l = 0 Lfor l = 0, Lmaxfor m = -l, l
for n = l lfor n = -l, lfor t = max(0, n-m), min(l+m, l-n)
performing calculations... performing calculations ...
• One rotation for L = 50 took 823 seconds on 2GHz quad• One rotation for Lmax = 50 took 823 seconds on 2GHz quad-core Intel Xeon E5335
Cell B.E.
Cell implementation• Domain decomposition:Domain decomposition:
for l = 0, Lmaxfor m = -l lfor m l, l
for n = -l, lfor t = max(0 n-m) min(l+m l-n)for t max(0, n m), min(l+m, l n)... calculations ...
• Decomposition along l leads to work load imbalance among SPUsimbalance among SPUs
• Decomposition along m creates unnecessary data p g ycommunication
Cell implementation• Loop fusion:Loop fusion:
for l = 0, Lmaxfor m = -l lfor m l, l
for n = -l, lfor t = max(0 n-m) min(l+m l-n)for t max(0, n m), min(l+m, l n)... calculations ...
• Unique index for combined loop:• Unique index for combined loop: f(l, m) = l2 + m + l
W kl d f h SPE• Workload for each SPE :(Lmax + 1)2/(total # of SPEs)
Cell implementation• Lookup table T for factorialLookup table T for factorial• Transform exponentials & multiplications into
multiplications & additions respectivelymultiplications & additions, respectively.
)2()22( )(sin)(cos)!()!()!()!(
)( nmttmnll mlmlnlnld
)()( )2
(sin)2
(cos!)!()!()!(
)(mnt tnmttmltnld
exp(
))()()()((21
exp(
mlTmlTnlTnlT
)()()()(2
tTnmtTtmlTtnlT
))2
log(sin)2()2
log(cos)22( nmttmnl
Cell implementation• Others that specific to Cell:Others that specific to Cell:
• Vectorization & data alignmentDMA data transfer between main memory &• DMA data transfer between main memory & local storeSPU d t• SPU decrementer
Cell implementation• Single precision vs. double precision: all data in single precisiong p p g p
Cell implementation• Single precision vs. double precision: partial data in double precisiong p p p p
Cell implementation• Single precision vs. double precision: all critical data in double precisiong p p p
Performance analysis
1 8
Performance of one rotation on Cell BE
1.41.61.8
s)
11.2
econ
ds
0 40.60.8
Tim
e (s
00.20.4T
1 2 4 8 16Number of SPEs
Performance analysisPerformance of finding the shortest
7000
Performance of finding the shortest distance at Level 3 on Cell BE
5000
6000
s)
4000
5000
seco
nds
2000
3000
Tim
e (s GNU gcc
IBM xlc
0
1000
04 8 12 16
Number of SPEs
Conclusion• Performance increases dramatically on Cell due toPerformance increases dramatically on Cell due to
its unique architecture and algorithm optimization.• Carefulness must be taken for data placement due• Carefulness must be taken for data placement due
to limited local store.• Carefulness must also be taken for data transfer• Carefulness must also be taken for data transfer
between local store and main memory.
The End
Questions?Questions?