ravioli: a parallel vide processing library with auto resolution adjustability

Tags:

Post on 01-Nov-2014

406 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

RaVioli: A Parallel Video Processing Librarywith Auto Resolution AdjustabilityHiroko SAKURAI† Masaomi OHNO† Shintaro OKADA‡Tomoaki TSUMURA† Hiroshi MATSUO†† Nagoya Institute of Technology, Japan‡ Toyota Motor Corp., JapanIADIS International Conference APPLIED COMPUTING 2009November 19 – 21, 2009Rome, Italy

Background(1/2): Portability of Video Applications• Real-time video processing applications– should run on a great variety of platforms• Cell phones• Cars• PCs

– Principal goal of an application• Long battery life• High throughput• Good accuracy

Applied Computing 2009 2

We must rewrite a video processing program,when porting it to another platform

Background(2/2): Many-Core Era is Coming• Multi/Many-core processors have come into wide use• Video processing applications– have various parallelisms• Pixels in video frames have data parallelism• Multiple frames can be processed in parallel by pipelining

– promise good performance on such parallel systems

Applied Computing 2009 3

Parallelizing programs is not so simpleIt becomes much important to improve compilers and libraries

A Video Processing Library: RaVioli• RaVioli provides:– Easy writeability of• pseudo real-time video processing

– Interfaces for parallelization• Detecting data dependencies and formulating reductions• Balancing loads of pipeline stages

Applied Computing 2009 4

Outline• Concept of RaVioli– RaVioli hides resolutions from programmers– Easy writeability of video processing applications

• Pseudo real-time processing by adjusting loads• Semi-automatic parallelization functions– Automatic block decomposition– Pipelining interface with automatic load balance mechanism

• Evaluation resultsApplied Computing 2009 5

Traditional Image Processing Program• Image processing program written by traditional C

Applied Computing 2009 6

void main{ // Input image int luma; for(int y=0;y<180;y++){  for(int x=0;x<200;x++){ luma = (int)( InImg[x][y].R*0.299   +InImg[x][y].G*0.587   +InImg[x][y].B*0.114);   OutImg[x][y].R = luma; OutImg[x][y].G = luma; OutImg[x][y].B = luma;  } }}

InImg

OutImg

Image Processing Program with RaVioli• Grayscale program using RaVioli

Applied Computing 2009 7

RV_Image OutImg

Higher-odermethodprocPixRV_Pixel GrayScale(RV_Pixel Pix){  int luma;  luma=(int)(   Pix.R()*0.299   +Pix.G()*0.587   +Pix.B()*0.114);  return(Pix.setRGB(luma, luma, luma));}void main(){ RV_Image InImg,OutImg; // Input image OutImg=InImg.procPix(GrayScale);}

Component function RV_Image InImg

RV_Video obj

Video Processing Program with RaVioli• Video processing program with RaVioli

Applied Computing 2009 8

RV_Image objHigher-odermethod

RV_Pixel GrayScale(RV_Pixel p){}

Grayscale

Higher-odermethod

RV_Image GrayScale(RV_Image img){

}

RV_Image obj

Outline• Concept of RaVioli– RaVioli hides resolutions from programmers– Easy writeability of video processing applications

• Pseudo real-time processing by adjusting loads• Semi-automatic parallelization functions– Automatic block decomposition– Pipelining interface with automatic load balance mechanism

• Evaluation resultsApplied Computing 2009 9

Auto-Adjustment of Computation Load• Spatial resolution (pixel rate)– Ss: Spatial stride

• Temporal resolution (frame rate)– St: Temporal stride

Applied Computing 2009 10

Ss=1Ss=2

St=1St=2

1/4

1/2

Priority Set• Which stride should be increased?

• (Spatial resolution, Temporal resolution)=– (7,3) : keep spatial stride and temporal stride in the ratio of “3:7”– (1,0) : keep spatial stride “1”

Applied Computing 2009 11

Moving object detectionTemporal resolution

Pattern recognitionSpatial resolution

We can specify resolution priorities by priority set

Ss=1Ss=2 St=1St=2

Higher-odermethod

Detecting Overload

Applied Computing 2009 12

RV_Video class

RingbufferRV_Image instanceHigher-ordermethod

Frame intervalProcessing time

< Overloaded!ImageProcessingprogram

Outline• Concept of RaVioli– RaVioli hides resolutions from programmers– Easy writeability of video processing applications

• Pseudo real-time processing by adjusting loads• Semi-automatic parallelization functions– Automatic block decomposition– Pipelining interface with automatic load balance mechanism

• Evaluation results of our workApplied Computing 2009 13

Parallelization: Block DecompositionImage processing with c/c++ Image processing with RaVioli

RV_Pix GrayScale(RV_Pix Pix){int Y; Y = (int)(Pix.R()*0.299+Pix.G()*0.587+Pix.B()*0.114); return( Pix.setRGB(Y, Y, Y) );}void main(){ RV_Img InImg, OutImg; OutImg = InImg.procPix(GrayScale);}

void main(){ byte InImg[180][200]; byte OutImg[180][200]; for( int y=0; y<180; y++ ){ for( int x=0; x<200; x++ ){ OutImg[x][y]=(int)( InImg[x][y].R*0.299 +InImg[x][y].G*0.587 +InImg[x][y].B*0.114); } }}

Parallelization: Block DecompositionImage processing with RaVioli

RV_Pix GrayScale(RV_Pix Pix){int Y;Y = (int)(Pix.R()*0.299+Pix.G()*0.587+Pix.B()*0.114);return( Pix.setRGB(Y, Y, Y) );}void main(){ RV_Img InImg,OutImg; OutImg = InImg.procPix(GrayScale);}

thread1 thread2thread4thread3 OutImg = InImg.procPix(GrayScale, 4);

InImg

Translator for Block Decomposition

• Reduction operations may be requiredApplied Computing 2009 16

RV_Pix GrayScale(RV_Pix Pix){int Y;Y = (int)(Pix.R()*0.299+Pix.G()*0.587+Pix.B()*0.114);return(Pix.setRGB(Y, Y, Y) );}void main(){ RV_Img InImg,OutImg; OutImg = InImg.procPix(GrayScale);}

TranslatorRV_Pix GrayScale(RV_Pix Pix){int Y;Y = (int)(Pix.R()*0.299+Pix.G()*0.587+Pix.B()*0.114);return( Pix.setRGB(Y, Y, Y) );}void main(){ RV_Img InImg,OutImg; OutImg = InImg.procPix(GrayScale, 4);}

parallelize

for Reference: Example Code with OpenMP• OpenMP– Standardized model of parallel programming for C/C++ and FORTRAN

#define NUM_THREADS 4int i; int sum=0;#pragma parallelfor(i=1;i<=256;i++)  sum += i;

for( ... ) sum1 += i;Process 1for( ... ) sum2 += i;Process 2for( ... ) sum3 += i;Process 3for( ... ) sum4 += i;Process 4

sum

Reduction pragmareduction(+:sum)

Reduction Op.s can be Automatically Added

Applied Computing 2009 18

int sum = 0;void pixSum(RV_Pixel p){ sum += 1;}int main(){ RV_Image InputImg; //read image data in “InputImg” InputImg.procPix(pixSum);}

sum += 1;

_localsum+=1;sum+= _localsum;

sum += 1associative law ?commutative law ? Reductionoperation

_localsum += 1;

inputImg.reduction(__pixSum);

__thread int _localsum = 0;Component function

void __pixSum(int threadNum){ mutex_lock(&Mutex); sum += _localsum; mutex_unlock(&Mutex);}InputImg.procPix(pixSum, 4);

associative law OK!commutative law OK!

Outline• Concept of RaVioli– RaVioli hides resolutions from programmers– Easy writeability of video processing applications

• Pseudo real-time processing by adjusting loads• Semi-automatic parallelization functions– Automatic block decomposition– Pipelining interface with automatic load balance mechanism

• Evaluation results of our workApplied Computing 2009 19

Assisting Pipeline Implementation• For building pipeline– Whole process is split into several stages– Several threads are created and assigned to the stages– FIFOs are needed to be implemented and managed for data transfer between stages

Applied Computing 2009 20

binarize edgedetect houghtrans・・・

FIFO3・・・

FIFO2・・・

FIFO1 thread1 thread2 thread3

Creating threads and FIFOs • is not the essence of video processing• is troublesome for programmers

Interface for Pipelining

Applied Computing 2009 21

RV_Pipedata* GrayScale(RV_Pipedata* data){ // Grayscale processing for a frame return data;}RV_Pipedata* Laplacian(RV_Pipedata* data){ // Laplacian filter processing for a frame return data;}int main (){ RV_Pipeline pipe; pipe.push(GrayScale); pipe.push(Laplacian); pipe.run(); return 0;}

・・・

FIFO1・・・

FIFO2thread1 thread2pushGrayScale Laplacianrun

RV_Pipeline pipe

Load Imbalance between Stages

Applied Computing 2009 23

A Bthread1 thread2 thread3

A BA B

A B Cthread1 thread2 thread3・・・

・・・

・・・

C Cframe1frame2frame3

C

123

Pipelinestalls

Automatic Load Balancing

Applied Computing 2009 24

thread1 thread2 thread3frame1frame2frame3

A B Cthread1 thread2 thread3・・・

B・・・

・・・

thread1

Cthread3Cthread2

Automatic Load Balancing

Applied Computing 2009 25

thread1 thread2 thread3A B A B A B

frame1frame2frame3

Athread1・・・

・・・

Bthread1

Cthread3Cthread2

CC C

123

Outline• Concept of RaVioli– RaVioli hides resolutions from programmers– Easy writeability of video processing applications

• Pseudo real-time processing by adjusting loads• Semi-automatic parallelization functions– Automatic parallelization with block decomposition– Pipelining interface with automatic load balance mechanism

• Evaluation results of our workApplied Computing 2009 26

Evaluation: Resolution Adjustment

276543210.00000012785.71386725571.42773438357.14160251142.85546963928.56933676714.28320389499.997070

65432105101520253035

6543210.00000012785.71386725571.42773438357.14160251142.85546963928.56933676714.28320389499.997070

20k

40k

60k

80k

05101520253035

05101520253035

20k

40k

60k

80k

20k

40k

60k

80k

(sec)

(sec)

(sec)05101520253035

Spatial resolution : Temporal resolution0:11:03:7

frame rate(fps)Number of pixels Priority set

Evaluation: Parallelization Functions

Applied Computing 2009 28

OS Solaris 10CPU UltraSPARC T1Frequency 1.0GHzNumber of cores 8Number of active threads per core 4Memory 16GBCompiler Sun Studio 12 (Sun C++5.9)Compiler options -fast –m64 –xchip=ultraT1Thread library pthreads

Evaluation: Auto Block Decomposition

Applied Computing 2009 290 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 320

5

10

15

20

Number of threads

Sp

eed

up

rat

io

houghpixAverage

laplacian

voronoi

Evaluation: Hough transform

302 4 8 16 320.00 0.20 0.40 0.60 0.80 1.00

Reduction overhead Reduction variable initialization Reduction operation s hough

Evaluation: Automatic load balancing

31

w/o load balancing w/ load balancingPipeline status

Image

Spatial resolution 51x51 170x170Spatial resolution stride 11 4Temporal resolution stride 1 1

A B CA B CA BAC

B CAA BA CBA B C

Conclusion• RaVioli– hides resolutions from programmers

• pseudo real-time processing– has semi-automatic parallelization functions

• semi-automatic block decompotision• load balancing mechanism between pipeline stages

• Our future works– implementing automatic power-saving function to RaVioli– making RaVioli adaptive to various platforms such as Cell Broadband Engine– designing easy-to-write language which cooperates with RaVioliApplied Computing 2009 32

Automatic Load Balancing

Applied Computing 2009 33

A B Cthread1 thread2 thread3

・・・

・・・

・・・

Manager

123

Automatic Load Balancing

Applied Computing 2009 34

A B Cthread1 thread2 thread3

・・・

・・・

・・・

45

Manager A:1B:1C:4

1 1 4B

thread1

Cthread3

Cthread2

23 11

top related