parallel image processing - purdue engineeringparallel image processing lei cao and yan wang...
TRANSCRIPT
Parallel Image Processing Lei Cao and Yan Wang
2013-04-19
1
2
Image Size: 4753 * 3168 (*3 RGB)
3
LOMO effect: boundaries are made darker; add contrast to the red channel.
4
5
Tilt-shift effect: certain regions are made out-of-focus by Gaussian filtering
Outline
Algorithm &Serial Code Profiling
OpenMP Parallelization
MPI Parallelization
Performance
Future Work
6
Algorithm
7
Tilt-shift LOMO
Serial Program Profiling
8
4%
71%
8%
15%
2% 0%
Time
Read Image
Tilt Shift
Red Tune
Vignette
Write Image
Others
Tested at Purdue’s Hansen cluster (Dell compute nodes with four 12-core AMD Opteron
6176 processors). Compiler: intel/12.0.084. R/W the image with the JPEG library
LOMO
9
255 135 2
221 127 18
94 100 33 a1 a2 a3
a4 a5 a6
a7 a8 a9
( )* ( )( )
( )
p i a ip center
a i
Gaussian Filtering:
Center Weighted Average
4753 * 3168 (*3
RGB)
Gaussian Filter:
10
255 135 2
221 127 18
94 100 33 a1 a2 a3
a4 a5 a6
a7 a8 a9
11
221 127 18
94 100 33
32 93 105
a1 a2 a3
a4 a5 a6
a7 a8 a9
94 100 33
32 93 105
19 18 17
12
a1 a2 a3
a4 a5 a6
a7 a8 a9
19 18 94 100 33
32 93 32 93 105
19 18 19 18 17
19 18 94 100 33
13
a1 a2 a3
a4 a5 a6
a7 a8 a9
Less computation Vs. More computation
Load Imbalance: dynamical scheduling in OpenMP
OMP Parallelization
Dynamics scheduling
nowait
Loop coalescing: transform 2D image matrix
to 1D array
14
Performance
15
4%
71%
8%
15%
2% 0%
Serial
Read Image
Tilt Shift
Red Tune
Vignette
Write Image
Others
Hansen cluster (Dell compute nodes with four 12-core AMD Opteron 6176 processors);
Serial compiler: intel/12.0.084
Parallel compiler: openmpi/1.4.4_intel-12.0.084; 8 cores are used.
16%
44% 6%
26%
6%
2%
8-core OpenMP
Dynamic Scheduling
16
17%
39%
7%
28%
7%
2%
Dynamic
Hansen cluster (Dell compute nodes with four 12-core AMD Opteron 6176 processors);
Serial compiler: intel/12.0.084
Parallel compiler: openmpi/1.4.4_intel-12.0.084; 8 cores are used.
16%
44% 6%
26%
6%
2%
Static
Read Image
Tilt Shift
Red Tune
Vignette
Write Image
Others
nowait
17
17%
39%
7%
28%
7%
2%
w/ nowait
16%
44% 6%
26%
6%
2%
w/o nowait
Read Image
Tilt Shift
Red Tune
Vignette
Write Image
Others
Speedup
18
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Original Dynamic w/o
nowait
Nowait w/o
dynamic
Dynamic
&nowait
Time
Hansen cluster (Dell compute nodes with four 12-core AMD Opteron 6176 processors);
Parallel compiler: openmpi/1.4.4_intel-12.0.084; 8 cores are used.
+15% +2%
+15%
Speedup of OpenMP and MPI
19
0 5 10 15 20 250
5
10
15
20
25
No. of procs
Sp
ee
du
p
Ideal
OpenMP
MPI
Hansen cluster (Dell compute nodes with four 12-core AMD Opteron 6176 processors);
OMP compiler: openmpi/1.4.4_intel-12.0.084;
MPI compiler: mpich2-intel/1.4.4_intel-12.0.084
MPI Vs. OpenMP
20
9%
52% 6%
25%
6%
2%
8-core MPI
Read Image
Tilt Shift
Red Tune
Vignette
Write Image
Others
Hansen cluster (Dell compute nodes with four 12-core AMD Opteron 6176 processors);
Serial compiler: intel/12.0.084; OpenMP compiler: openmpi/1.4.4_intel-12.0.084; MPI
compiler: mpich2/1.4.4_intel-12.0.0848
16%
44% 6%
26%
6%
2%
8-core OpenMP
MPI: creates a large array in each thread, which is inefficient.
Future Work
Other image sizes (smaller or larger)
Find other aspects to improve the speedup
Try different chunk size in dynamic
scheduling
21
22
Thanks!