ravioli: a parallel vide processing library with auto resolution adjustability
Post on 01-Nov-2014
406 Views
Preview:
DESCRIPTION
TRANSCRIPT
RaVioli: A Parallel Video Processing Librarywith Auto Resolution AdjustabilityHiroko SAKURAI† Masaomi OHNO† Shintaro OKADA‡Tomoaki TSUMURA† Hiroshi MATSUO†† Nagoya Institute of Technology, Japan‡ Toyota Motor Corp., JapanIADIS International Conference APPLIED COMPUTING 2009November 19 – 21, 2009Rome, Italy
Background(1/2): Portability of Video Applications• Real-time video processing applications– should run on a great variety of platforms• Cell phones• Cars• PCs
– Principal goal of an application• Long battery life• High throughput• Good accuracy
Applied Computing 2009 2
We must rewrite a video processing program,when porting it to another platform
Background(2/2): Many-Core Era is Coming• Multi/Many-core processors have come into wide use• Video processing applications– have various parallelisms• Pixels in video frames have data parallelism• Multiple frames can be processed in parallel by pipelining
– promise good performance on such parallel systems
Applied Computing 2009 3
Parallelizing programs is not so simpleIt becomes much important to improve compilers and libraries
A Video Processing Library: RaVioli• RaVioli provides:– Easy writeability of• pseudo real-time video processing
– Interfaces for parallelization• Detecting data dependencies and formulating reductions• Balancing loads of pipeline stages
Applied Computing 2009 4
Outline• Concept of RaVioli– RaVioli hides resolutions from programmers– Easy writeability of video processing applications
• Pseudo real-time processing by adjusting loads• Semi-automatic parallelization functions– Automatic block decomposition– Pipelining interface with automatic load balance mechanism
• Evaluation resultsApplied Computing 2009 5
Traditional Image Processing Program• Image processing program written by traditional C
Applied Computing 2009 6
void main{ // Input image int luma; for(int y=0;y<180;y++){ for(int x=0;x<200;x++){ luma = (int)( InImg[x][y].R*0.299 +InImg[x][y].G*0.587 +InImg[x][y].B*0.114); OutImg[x][y].R = luma; OutImg[x][y].G = luma; OutImg[x][y].B = luma; } }}
InImg
OutImg
Image Processing Program with RaVioli• Grayscale program using RaVioli
Applied Computing 2009 7
RV_Image OutImg
Higher-odermethodprocPixRV_Pixel GrayScale(RV_Pixel Pix){ int luma; luma=(int)( Pix.R()*0.299 +Pix.G()*0.587 +Pix.B()*0.114); return(Pix.setRGB(luma, luma, luma));}void main(){ RV_Image InImg,OutImg; // Input image OutImg=InImg.procPix(GrayScale);}
Component function RV_Image InImg
RV_Video obj
Video Processing Program with RaVioli• Video processing program with RaVioli
Applied Computing 2009 8
RV_Image objHigher-odermethod
RV_Pixel GrayScale(RV_Pixel p){}
Grayscale
Higher-odermethod
RV_Image GrayScale(RV_Image img){
}
RV_Image obj
Outline• Concept of RaVioli– RaVioli hides resolutions from programmers– Easy writeability of video processing applications
• Pseudo real-time processing by adjusting loads• Semi-automatic parallelization functions– Automatic block decomposition– Pipelining interface with automatic load balance mechanism
• Evaluation resultsApplied Computing 2009 9
Auto-Adjustment of Computation Load• Spatial resolution (pixel rate)– Ss: Spatial stride
• Temporal resolution (frame rate)– St: Temporal stride
Applied Computing 2009 10
Ss=1Ss=2
St=1St=2
1/4
1/2
Priority Set• Which stride should be increased?
• (Spatial resolution, Temporal resolution)=– (7,3) : keep spatial stride and temporal stride in the ratio of “3:7”– (1,0) : keep spatial stride “1”
Applied Computing 2009 11
Moving object detectionTemporal resolution
Pattern recognitionSpatial resolution
We can specify resolution priorities by priority set
Ss=1Ss=2 St=1St=2
Higher-odermethod
Detecting Overload
Applied Computing 2009 12
RV_Video class
RingbufferRV_Image instanceHigher-ordermethod
Frame intervalProcessing time
< Overloaded!ImageProcessingprogram
Outline• Concept of RaVioli– RaVioli hides resolutions from programmers– Easy writeability of video processing applications
• Pseudo real-time processing by adjusting loads• Semi-automatic parallelization functions– Automatic block decomposition– Pipelining interface with automatic load balance mechanism
• Evaluation results of our workApplied Computing 2009 13
Parallelization: Block DecompositionImage processing with c/c++ Image processing with RaVioli
RV_Pix GrayScale(RV_Pix Pix){int Y; Y = (int)(Pix.R()*0.299+Pix.G()*0.587+Pix.B()*0.114); return( Pix.setRGB(Y, Y, Y) );}void main(){ RV_Img InImg, OutImg; OutImg = InImg.procPix(GrayScale);}
void main(){ byte InImg[180][200]; byte OutImg[180][200]; for( int y=0; y<180; y++ ){ for( int x=0; x<200; x++ ){ OutImg[x][y]=(int)( InImg[x][y].R*0.299 +InImg[x][y].G*0.587 +InImg[x][y].B*0.114); } }}
Parallelization: Block DecompositionImage processing with RaVioli
RV_Pix GrayScale(RV_Pix Pix){int Y;Y = (int)(Pix.R()*0.299+Pix.G()*0.587+Pix.B()*0.114);return( Pix.setRGB(Y, Y, Y) );}void main(){ RV_Img InImg,OutImg; OutImg = InImg.procPix(GrayScale);}
thread1 thread2thread4thread3 OutImg = InImg.procPix(GrayScale, 4);
InImg
Translator for Block Decomposition
• Reduction operations may be requiredApplied Computing 2009 16
RV_Pix GrayScale(RV_Pix Pix){int Y;Y = (int)(Pix.R()*0.299+Pix.G()*0.587+Pix.B()*0.114);return(Pix.setRGB(Y, Y, Y) );}void main(){ RV_Img InImg,OutImg; OutImg = InImg.procPix(GrayScale);}
TranslatorRV_Pix GrayScale(RV_Pix Pix){int Y;Y = (int)(Pix.R()*0.299+Pix.G()*0.587+Pix.B()*0.114);return( Pix.setRGB(Y, Y, Y) );}void main(){ RV_Img InImg,OutImg; OutImg = InImg.procPix(GrayScale, 4);}
parallelize
for Reference: Example Code with OpenMP• OpenMP– Standardized model of parallel programming for C/C++ and FORTRAN
#define NUM_THREADS 4int i; int sum=0;#pragma parallelfor(i=1;i<=256;i++) sum += i;
for( ... ) sum1 += i;Process 1for( ... ) sum2 += i;Process 2for( ... ) sum3 += i;Process 3for( ... ) sum4 += i;Process 4
sum
Reduction pragmareduction(+:sum)
Reduction Op.s can be Automatically Added
Applied Computing 2009 18
int sum = 0;void pixSum(RV_Pixel p){ sum += 1;}int main(){ RV_Image InputImg; //read image data in “InputImg” InputImg.procPix(pixSum);}
sum += 1;
_localsum+=1;sum+= _localsum;
sum += 1associative law ?commutative law ? Reductionoperation
_localsum += 1;
inputImg.reduction(__pixSum);
__thread int _localsum = 0;Component function
void __pixSum(int threadNum){ mutex_lock(&Mutex); sum += _localsum; mutex_unlock(&Mutex);}InputImg.procPix(pixSum, 4);
associative law OK!commutative law OK!
Outline• Concept of RaVioli– RaVioli hides resolutions from programmers– Easy writeability of video processing applications
• Pseudo real-time processing by adjusting loads• Semi-automatic parallelization functions– Automatic block decomposition– Pipelining interface with automatic load balance mechanism
• Evaluation results of our workApplied Computing 2009 19
Assisting Pipeline Implementation• For building pipeline– Whole process is split into several stages– Several threads are created and assigned to the stages– FIFOs are needed to be implemented and managed for data transfer between stages
Applied Computing 2009 20
binarize edgedetect houghtrans・・・
FIFO3・・・
FIFO2・・・
FIFO1 thread1 thread2 thread3
Creating threads and FIFOs • is not the essence of video processing• is troublesome for programmers
Interface for Pipelining
Applied Computing 2009 21
RV_Pipedata* GrayScale(RV_Pipedata* data){ // Grayscale processing for a frame return data;}RV_Pipedata* Laplacian(RV_Pipedata* data){ // Laplacian filter processing for a frame return data;}int main (){ RV_Pipeline pipe; pipe.push(GrayScale); pipe.push(Laplacian); pipe.run(); return 0;}
・・・
FIFO1・・・
FIFO2thread1 thread2pushGrayScale Laplacianrun
RV_Pipeline pipe
Load Imbalance between Stages
Applied Computing 2009 23
A Bthread1 thread2 thread3
A BA B
A B Cthread1 thread2 thread3・・・
・・・
・・・
C Cframe1frame2frame3
C
123
Pipelinestalls
Automatic Load Balancing
Applied Computing 2009 24
thread1 thread2 thread3frame1frame2frame3
A B Cthread1 thread2 thread3・・・
B・・・
・・・
thread1
Cthread3Cthread2
Automatic Load Balancing
Applied Computing 2009 25
thread1 thread2 thread3A B A B A B
frame1frame2frame3
Athread1・・・
・・・
Bthread1
Cthread3Cthread2
CC C
123
Outline• Concept of RaVioli– RaVioli hides resolutions from programmers– Easy writeability of video processing applications
• Pseudo real-time processing by adjusting loads• Semi-automatic parallelization functions– Automatic parallelization with block decomposition– Pipelining interface with automatic load balance mechanism
• Evaluation results of our workApplied Computing 2009 26
Evaluation: Resolution Adjustment
276543210.00000012785.71386725571.42773438357.14160251142.85546963928.56933676714.28320389499.997070
65432105101520253035
6543210.00000012785.71386725571.42773438357.14160251142.85546963928.56933676714.28320389499.997070
20k
40k
60k
80k
05101520253035
05101520253035
20k
40k
60k
80k
20k
40k
60k
80k
(sec)
(sec)
(sec)05101520253035
Spatial resolution : Temporal resolution0:11:03:7
frame rate(fps)Number of pixels Priority set
Evaluation: Parallelization Functions
Applied Computing 2009 28
OS Solaris 10CPU UltraSPARC T1Frequency 1.0GHzNumber of cores 8Number of active threads per core 4Memory 16GBCompiler Sun Studio 12 (Sun C++5.9)Compiler options -fast –m64 –xchip=ultraT1Thread library pthreads
Evaluation: Auto Block Decomposition
Applied Computing 2009 290 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 320
5
10
15
20
Number of threads
Sp
eed
up
rat
io
houghpixAverage
laplacian
voronoi
Evaluation: Hough transform
302 4 8 16 320.00 0.20 0.40 0.60 0.80 1.00
Reduction overhead Reduction variable initialization Reduction operation s hough
Evaluation: Automatic load balancing
31
w/o load balancing w/ load balancingPipeline status
Image
Spatial resolution 51x51 170x170Spatial resolution stride 11 4Temporal resolution stride 1 1
A B CA B CA BAC
B CAA BA CBA B C
Conclusion• RaVioli– hides resolutions from programmers
• pseudo real-time processing– has semi-automatic parallelization functions
• semi-automatic block decompotision• load balancing mechanism between pipeline stages
• Our future works– implementing automatic power-saving function to RaVioli– making RaVioli adaptive to various platforms such as Cell Broadband Engine– designing easy-to-write language which cooperates with RaVioliApplied Computing 2009 32
Automatic Load Balancing
Applied Computing 2009 33
A B Cthread1 thread2 thread3
・・・
・・・
・・・
Manager
123
Automatic Load Balancing
Applied Computing 2009 34
A B Cthread1 thread2 thread3
・・・
・・・
・・・
45
Manager A:1B:1C:4
1 1 4B
thread1
Cthread3
Cthread2
23 11
top related