motion estimtion and motion compensated (video) coding
TRANSCRIPT
1
MOTION ESTIMTION AND
MOTION COMPENSATED
(VIDEO) CODING
1. Imaging space 2. Image sequence vs. Video 3. Interframe correlation and motion compensated (MC) coding 4. Three major techniques in motion
analysis (MA) for video coding:
• Block matching • Pel-recursion • Optical flow
2
• Digital Image Processing:
1960s ⇒ active research field due to the advent of two events
• Digital Image Sequence Processing:
§ Early 1980s ⇒ active research area
§ More information available
§ More computation involved
More memory space required
§ Technically feasible (VLSI, DSP, etc.) ⇒ Image and video sequences:
Indispensable elements of our modern life
Some Applications:
q CD-RAM, VCD, DVD
3
q Digital video camera q Digital TV q Video-phony, Video-conferencing q Video over internet, wireless q Video-on-demand q Interactive multimedia, databases, etc.
• Temporal image sequence: ),,( tyxg • Stereo image (pair) sequence
• Consider a generalization:
A sensor – a solid article in 3-D space Translation (3 parameters) Rotation (2 parameters) ⇒ Many images of the scene
⇔ Infinitely many sensors on all possible positions and orientations, at a moment, take many images
4
Figure 10.1 Two sensors’ position 0)0,0,0,(0,s =v and ?)ß,,z~,y~,x~(s =v .
)~,~,~( zyx
'x
0
(0, 0, 0)
0
X
Y
Z
x
y
X’
Y’
Z’
'y
0’
5
⇒ A spatial image sequence ),,( syxg
r
s
r: the sensor’s position in 3-D
space ),,~,~,~( γβzyxs =
r
• An imaging space:
§ A collection of all spatial image sequences (over temporal domain)
),,,( styxg
r
Alternatively, § A collection of all temporal image
sequences (over spatial domain) • Meaning: Unification of all the image frames
and image sequences [shu & shi 1991]
6
Figure 10.2 A hierarchical structure
Intermediate
Image space ),,,( styxg v
Temporal image sequence ),,,( vectorspecificastyxg =v
Spatial image sequence ),,,( smomentspecificatyxg v=
Individual images ),,,( vectorspecificasmomentspecificatyxg == v
Top level
Bottom level
7
2. Image Sequence vs. Video
• Image frames and image sequences:
defined above
• Video: § refer to a single video frame § or video sequences associated § with visible frequency band (visual information)
• Scope of image frames and sequences is wider.
e.g. Infrared image and image sequences: Outside the visible band: Not video
• Video is more pertinent to multimedia
engineering
• Two are equal when visible band is concerned. • Video, sometimes,
8
⇒ video sequences exclusively
• Image compression
⇒ still image compression
• Video compression
⇒ video sequence coding
3. Interframe Correlation and Motion Compensated Compression
• High interframe correlation exists in
video sequence:
♦ A video sequence in videophone service: less than 10% pixels change their
intensity by 1% of the peak signal between frames [mounts 1969]
9
♦ A video sequence in TV broadcast:
(More motion is involved.) even if there is no scene change between frames, the scene is still scanned constantly
• How to utilize the interframe correlation?
♦ Frame replenishment (FR) [mounts 1969]
§ Compare present frame with next frame pixel-by-pixel
§ If difference large enough,
⇒ Difference, position: coded & transmitted § Otherwise, repeat pixel’s gray scale at
the receiver
§ More efficient than intraframe coding technique such as DPCM
Interframe redundancy:
10
(a) Frame 24 (b) Frame 25
Figure 10.2 Two frames of the Miss America sequence. Problem with frame replenishment:
Figure 10.3 Dirty window effect
11
Main drawback with frame replenishment technique: It is difficult to handle frame sequences containing more rapid changes.
When there are more rapid changes, the number of pixels whose intensity values need to be updated increases. In order to maintain the transmission bit-rate in a steady and proper level, the threshold has to be raised, thus causing many slow changes cannot show up in the receiver.
This poorer reconstruction in the receiver is somewhat analogous to viewing a scene through a dirty window. This is referred to as the dirty window effect. The result of one experiment on the dirty window effect is displayed in Figure 10.5.
From frame 22 to frame 25 of the “Miss America” sequence, there are 2166 pixels (less than 10% of the total pixels) which change their gray level values by more than 1% of the peak signal. When we only update the gray level values for 25% (randomly chosen) of these changing pixels, we can clearly see the dirty window effect.
When rapid scene changes exceed a certain level, buffer saturation will result, causing picture breakup [mounts, 1969]. Motion compensated coding, which is discussed below, has been proved to be able to provide better performance than the replenishment technique in situations with rapid changes.
12
♦ Motion estimation and compensation
§ Find displacement vector of a pixel or a set of pixels between frames
§ Via displacement vector, to predict
counterpart in present frame
§ Prediction error, positions; motion vectors: coded & transmitted
Figure 10.7 Block diagram of motion compensated coding
• MC is much more efficient and complicated than FR
Motionanalysis
Predictionand
differentiation
Encoding
13
(MC ⇒ smaller dynamic range of prediction error) (both MC & FR are predictive coding) (need correspondence – “motion trajectory”)
Figure 10.6 Two consecutive frames of a video sequence
• The first motion compensated coding
algorithm [netravali & robbins 1979] achieved 22-50% lower bit rate than the frame replenishment algorithm
(a) 1−nt (b) nt
14
• ⇒ High coding efficiency is obtained with
high computational complexity
• This strategy is: § Technically feasible (VLSI, DSP, etc.) § Economically desired
∵ cost of DSP drops much faster than that of transmission [dubois 1981]
§ Continue to be the case
• Motion compensated video coding ⇒ A major development in coding since 1980s [musmann 1985, zhang 1995, kunt 1995]
Motion Estimation (ME)
• Biological Vision Perspective
15
§ For understanding human visual system (HVS) § Studied by biomedical scientists § Goal: to shed light for technology
development (machine vision and others) § Measurement and interpretation: two steps
[singh 91]
• Computer Vision Perspective
§ For interpretation of 3-D motion from image sequences
§ Requirements: Accuracy, and processing speed
§ A dominant subject in vision [thompson 1989]
§ Correspondence and Optical flow: two
approaches § Rigid vs deformable motions
§ Intermediate variable vs direct methods
• Signal Processing Perspective
16
§ For MC coding (main purpose) & other MC video processing tasks
§ Ultimate goal: Ø Higher coding efficiency
i.e., same quality with less bit rate or better quality with same bit rate Ø Quite different from the goal in
Computer Vision
§ Real time requirement ⇒ Simple 2-D displacement model
4.1 Block Matching § Description: region-based, matching
§ Representative: [jain 1981], diagram
§ The most popular one among three
techniques
17
Figure
Figure 11.1 Block matching
Matching Criteria • Instead of finding maximum similarity • An equivalent, more computationally
efficient way of block matching: finding minimum dissimilarity
• Dissimilarity (error, distortion, or distance) between two images tn and tn-1, D (s, t), is defined as:
correlation window
q
q
p
(x0,y0)
(xi,yi)
(x,y)
(a) t n frame
(x0,y0)
(x,y)
(xi,yi)
p
q
p
best block matching
displacement vector
(b) tn-1 frame
search window
An original block
18
∑=
++−∑=⋅=
q
ktksjnfkjnfM
p
jqptsD1
)),(1),,((1
1),(
M(u, v): a dissimilarity metric of u and v.
• Several types of matching criteria:
§ mean-square error (MSE) [jain 1981]
2)(),( vuvuM −=
§ sum of squared difference (SSD) [anandan 1987]
§ sum of squared error (SSE) [chan 1990]
§ mean absolute difference (MAD) [koga 1981]
vuvuM −=),(
§ mean absolute error (MAE) [nogaki 1972].
• A study based on experimental works reported that the matching criterion does not significantly affect the search [srinivasan 1984]. The MAD is hence preferred due to its simplicity in implementation [musmann 1985].
19
Searching Procedures
• Full search
§ Brute force in nature § Good accuracy § Large amount computation
• Several fast searching procedures
§ 2-D logarithm search [jain 1981] § Three-step search [koga 1981] § Conjugate direction search
[srinivasan 1984]
j
j+4
j+3
j+2
j+1
j-1
j-2
j-3
j-4
k-4 k-3 k-2 k-1 k k+1 k+2 k+3 k+4
3
4
3
4
4
4 4
1 2
1
1 1 1 2
2
20
Figure 11.1 (a) 2-D logarithm search procedure: Points at (j, k+2), (j+2, k+2), (j+2, k+4), and (j+1, k+4) are found to give the minimum dissimilarity in steps 1, 2, 3 and 4, respectively.
j+6
j+4
j+2
j
j-2
j-4
j-6
k-6 k-4 k-2 k k+2 k+4 k+6
3
3
3
3
3 3
3
3
2
1 1
1 1
1 1 1
1
2 2
2 2
2 2 2
1
21
Figure 11.2 Three-step search procedure: Points (j+4, k-4), (j+4, k-6), and (j+5, k-7) give the minimum dissimilarity in steps 1, 2, and 3, respectively.
Thresholding Multiresolution Block Matching (shi and xia 97)
level increasing
level 1
level 4
level 3
level 2
level 0
resolution
decreasing
22
Figure 11.3 A Gaussian pyramid structure.
Block matching
Block matching
Block matching
Satisfying threshold
Satisfying threshold
Motion field
Low pass filtering and subsampling
Low pass filtering and subsampling
Low pass filtering and subsampling
Low pass filtering and subsampling
Y Y
Y
Frame n-1 Frame n
N
N
23
Figure 11.4 Block diagram for a three-level threshold multiresolution block matching.
Figure 11.5 The thresholding process.
Pyramid n-1
Pyramid n
Pyramid level
1
L
Projection of the block and its estimated motion vector at level L
Estimation of motion vector
Calculation of the MAD of the block at level L
24
Table 11.1 Parameters used in experiments
Parameters at level
Low resolution level
Full resolution leve l
“Miss America”
Search Range
3×3
1×1
Block Size 4×4 8×8 Thresholding Value 2 None(Not applicable)
“Train” Search Range
4×4
1×1
Block Size 4×4 8×8 Thresholding Value 3 None (Not applicable)
“Football” Search Range
4×4
1×1
Block Size 4×4 8×8 Thresholding Value 4 None (Not applicable)
25
Figure 11.6 The 20th frame of the “Train” sequence.
Figure 11.7 The 20th frame in the “Football” sequence.
26
Table 11.2 Experimental results (I)
PSNR (dB)
Error Image Entropy (bits/pixel)
Vector Entropy (bits/vector)
Block stopped at top level/ Total block
Processing Times (No.of Additions, 106)
“Miss America” Sequence Method 1 [tzoraras 1994] New Method (TH=2) New Method (TH=3)
38.91 38.79 38.43
3.311 3.319 3.340
6.02 5.65 5.45
0/1280 487/1280 679/1280
10.02 8.02 6.17
“Train” Sequence Method 1 [tzoraras 1994] New Method (TH=3)
27.37 27.27
4.692 4.788
6.04 5.65
0/2560 1333/2560
22.58 18.68
“Football” Sequence Method 1 [tzoraras 1994] New Method (TH=4) New Method (TH=3)
24.26 24.18 24.21
5.379 5.483 5.483
7.68 7.58 7.57
0/3840 1464/3840 1128/3840
30.06 25.90 27.10
4.2 Pel-recursion
27
§ Description: Ø region-based, matching
Ø DFD: displaced frame
difference Ø minimization of a nonlinear
function of DFD
§ Representative: [netravali & robbins 1979] 1st MC algorithm (Figure) § The earliest one, but the least
accurate one (backward MA)
• In [netravali 1979], a dissimilarity measure called displaced frame difference (DFD) was defined as follows.
28
),(),(),;,( 1 ydyxdxfyxnfydxdyxDFD n −−−= −
n , n-1 : two successive frames yx, : coordinates in image planes
yx dd , : two components of displacement
vector, dv
• Obviously, if no error in the estimation, then
DFD will be zero.
• A nonlinear function of dissimilarity measure: 2DFD , was proposed in [netravali 1979].
• Displacement estimation is then converted into a minimization problem.
• A typical nonlinear programming problem
♦ In [netravali 1979] :
29
),(1,),,(1 ydyxdxnfyxdyxDFDdd kkk −−−∇⋅−=+vvv
α
yx,∇ : gradient operator with respect to x, y
1024/1=α
♦ Inclusion of a neighborhood area
To make displacement estimation more robust, Netravali and Robbins assume the displacement vector is constant within a small neighborhood Ω.
∑Ω∈
⋅⋅∇−=+,,,
);,,(21 21
yxidyxDFDdd k
ikk
d
vvvv ωα
i : an index for the i th pixel ),( yx in Ω ,
iω : weight
∑ =≥
Ω∈i l
i1
0ω
ω
• The algorithm can be applied to a pixel once or iteratively applied several times for displacement estimation.
30
Then the algorithm moves to the next pixel.
The estimated displacement vector of a pixel can be used as an initial estimate for the next pixel.
• This recursion can be carried out horizontally, vertically, or temporally.
31
Figure 12.1 Three types of recursions
4.3 Optical Flow
(x,y+1)(x, y)
(a)
(x+1,y)
(x, y)
(b)
(x,y,tn-1) (x,y,tn)
(c)
32
§ Description: velocity vector for each pixel Ø correlation-based (similar to block
matching)
Figure 13.1 Correlation-based approach to optical flow determination
Ø differential (gradient) method Ø Spatiotemporal energy-based
Ø Phase-based § Representative: [horn and schunck, 1981],
Differential, gradient-based classical solution
(x0,y0)p
p
(x,y)
f(x, y, t)
(x,y)
best matchingcorrelation window
optical flowvector
f(x, y, t-1)
search window
correlation window
The pixel to whichoptical flow needsto be determined
33
§ Mainly used in computer vision
§ Less used in video coding
Ø though accurate, but more side information &
computation Ø (recall ultimate goal of MA in video coding)
⇒ Need some innovative work
5. Classification • Region-based vs. gradient-based
34
Table 14.1 Region-based vs. Gradient-based
Block matching
Pel recursive
Optical flow
Gradient-based | correlation-based method | method
Regional –based
√
√
√
Gradient-based
√
• Forward vs. Backward Motion Estimation
36
Figure 14. 1 Forward motion estimation and compensation , T: transform, Q: quantizer, FB: frame buffer, MCP: motion compensated predictor, ME: motion estimator, e: prediction error , f: input video frame, f p: predicted video frame, f r: reconstructed video frame, q: quantized transform coefficients, v: motion vector
e Video in f - T Q
Q-1
T-1
+
q
pf
f
37
Figure 14. 2 Backward motion estimation and compensation, T: transform, Q: quantizer, FB: frame buffer, MCP: motion compensated predictor, ME: motion estimator, e: prediction error , f: input video frame, f p: predicted video frame, f r1: reconstructed video frame, f r2: reconstructed previous video frame, q: quantized transform coefficients
Summary
In general, block matching achieves the best performance in terms of higher estimation accuracy and less computational complexity.
It has been used in all of international video coding standards such as H.261, H.263, H. 26L, MPEG 1, 2, and 4.
38
Pel recursive technique has the relatively lowest motion estimation accuracy. Hence less used in the motion estimation and motion compensation.
Optical flow achieves high motion estimation accuracy, but has too many motion vectors to deal with. Mainly used in Computer Vision field.
One attempt (shi et al, 1998): a. Flow vectors are highly correlated b. Can be modeled by a first-order AR model. c. DCT can be applied to flow vectors. d. Good results have achieved:
a. The bit rate is compatible to that achieved by H.263
b. Visual quality is better: i. No block artifacts
39
(a) (b)
(c)
Figure 13.2 (a) The 21st original frame of the “Miss America” sequence, (b) The reconstructed 21st frame with H.263, (c) The reconstructed 21st frame with the proposed technique.
6. References [aggarwal 1988] J. K. Aggarwal and N. Nandhakumar, ``On the computation of motion from sequences of images -- a review,’’ Proceedings of the IEEE, vol. 76, no. 8, pp. 917-935, 1988.
[barron 1994] J. L. Barron, D. J. Fleet and S. S. Beauchemin, “Systems and experiment performance of optical flow techniques,” International Journal of Computer Vision, vol. 12, no. 1, pp. 43-77, 1994.
[boroczky 1991] “Pel-Recursive motion estimation for image coding,” PH.D. dissertation, Technic University of Delft, Netherland.
40
[dubois 1981] E. Dubois, B. Prasada and M. S. Sabri, ``Image Sequence Coding,’’ chapter 3, in T. S. Huang, Ed., Image Sequence Analysis, Springer-Verlag, 1981
[dufaux 1995] F. Dufaux and F. Moscheni, “Motion estimation techniques for digital TV: A review and a new contribution,” Proceedings of the IEEE , vol. 83, no. 6, pp. 858-876, 1995.
[horn 1981] B. K. P. Horn and B. G. Schunck, “Determining optical flow,” Artificial Intelligence, vol. 17, pp. 185-203, 1981.
[jain 1981] J. R. Jain and A. K. Jain, “displacement measurement and its application in interframe image coding,” IEEE Transactions on Communications, vol. COM-29, no. 12, pp. 1799-1808, December 1981.
[koga 1981] T. Koga, K. Linuma, A. Hirano, Y. Iijima and T.Ishiguro, “Motion-compensated interframe coding for video conferencing,” Proceedings of NTC'81 , pp. G5.3.1-G5.3.5, New Orleans, LA, Dec. 1981.
[kunt 1995] M. Kunt Ed., Special Issue on Digital Television Part 1: Technologies, Proceedings of The IEEE, vol. 83, no. 6, June 1995.
[mounts 1969] F. W. Mounts, “A video encoding system with conditional picture-element replenishment,’’ The Bell System Technical Journal, vol. 48, no. 7, pp. 2545-1554, September 1969.
[musmann 1985] H. G. Musmann, P. Pirsch, and H. J. Grallert, “Advances in picture coding,’’ Proceedings of the IEEE, vol. 73, no. 4, pp. 523-548, 1985.
[netravali 1979] A. N. Netravali and J. D. Robbins, “Motion-compensated television coding: Part I,” The Bell System Technical Journal, vol. 58, no. 3, pp. 631-670, March 1979.
29
[srinivasan 1984] R. Srinivasan and K. R. Rao, “Predictive coding based on efficient motion estimation,” Proceedings of ICC, pp. 521-526, May 1984.
[shu 1991] C. Q. Shu and Y. Q. Shi, “On unified optical flow field,’’ Pattern Recognition, vol. 24, no. 6, pp. 579-586, 1991.
[thompson 1989] W. B. Thompson, “Introduction to special issue on visual motion,’’ IEEE Trans. on Pattern Analysis and Machine Intelligence , vol. 11, no. 5, pp. 449-450, 1989.
[zhang 1995] Y.-Q. Zhang, W. Li and M. L. Liou, Ed., Special Issue on Advances in Image and