MOTION ESTIMTION AND MOTION COMPENSATED (VIDEO) CODING
1. Imaging space 2. Image sequence vs. Video 3. Interframe correlation and motion compensated (MC) coding 4. Three major techniques in motion analysis (MA) for video coding: • Block matching • Pel-recursion • Optical flow
1
• Digital Image Processing: 1960s ⇒ active research field due to the advent of two events • Digital Image Sequence Processing: § Early 1980s
⇒
active research area
§ More information available § More computation involved More memory space required § Technically feasible (VLSI, DSP, etc.) ⇒
Image and video sequences:
Indispensable elements of our modern life Some Applications: q
CD-RAM, VCD, DVD
2
q q q q q q
Digital video camera Digital TV Video-phony, Video-conferencing Video over internet, wireless Video-on-demand Interactive multimedia, databases, etc.
• Temporal image sequence:
g( x, y,t)
• Stereo image (pair) sequence • Consider a generalization: A sensor –a solid article in 3-D space Translation (3 parameters) Rotation (2 parameters) ⇒ Many images of the scene Infinitely many sensors on all possible positions and orientations, at a moment, take many images ⇔
3
Y Y’
X’
(~ x , ~y, ~z )
y'
x' 0’
y
Z’
0
X
(0, 0, 0)
0
x
Z
Figure 10.1 Two sensors’ position v s = (~ x , ~y, ~ z , ß, ?) .
v s = (0, 0,0,0, 0)
and
4
⇒
A spatial image sequence
r g (x, y,s ) r s : the sensor’s position in 3-D space
r ~~~ s = ( x , y , z , β ,γ ) • An imaging space: § A collection of all spatial image sequences (over temporal domain)
r g (x, y,t,s )
Alternatively, § A collection of all temporal image sequences (over spatial domain) • Meaning: Unification of all the image frames and image sequences [shu & shi 1991]
5
Top level Image space v g ( x, y , t , s )
Intermediate
Temporal image sequence v g ( x , y , t , s = a specific vector )
Spatial image sequence
v g ( x , y , t = a specific moment , s )
Bottom level
Individual images
v g ( x , y , t = a specific moment , s = a specific vector )
Figure 10.2 A hierarchical structure
6
2. Image Sequence vs. Video • Image frames and image sequences: defined above • Video: § refer to a single video frame § or video sequences associated § with visible frequency band (visual information) • Scope of image frames and sequences is wider. e.g. Infrared image and image sequences: Outside the visible band: Not video • Video is more pertinent to multimedia engineering • Two are equal when visible band is concerned. • Video, sometimes, 7
⇒
video sequences exclusively
• Image compression ⇒
still image compression
• Video compression ⇒
video sequence coding
3. Interframe Correlation and Motion Compensated Compression • High interframe correlation exists in video sequence: ♦ A video sequence in videophone service: less than 10% pixels change their intensity by 1% of the peak signal between frames [mounts 1969] 8
♦ A video sequence in TV broadcast: (More motion is involved.) even if there is no scene change between frames, the scene is still scanned constantly • How to utilize the interframe correlation? ♦ Frame replenishment (FR) [mounts 1969] § Compare present frame with next frame pixel-by-pixel § If difference large enough, ⇒
Difference, position: coded & transmitted
§ Otherwise, repeat pixel’s gray scale at the receiver § More efficient than intraframe coding technique such as DPCM Interframe redundancy: 9
(a) Frame 24
Figure 10.2 sequence.
(b) Frame 25
Two frames of the Miss America
Problem with frame replenishment:
Figure 10.3 Dirty window effect
10
Main drawback with frame replenishment technique: It is difficult to handle frame sequences containing more rapid changes. When there are more rapid changes, the number of pixels whose intensity values need to be updated increases. In order to maintain the transmission bit-rate in a steady and proper level, the threshold has to be raised, thus causing many slow changes cannot show up in the receiver. This poorer reconstruction in the receiver is somewhat analogous to viewing a scene through a dirty window. This is referred to as the dirty window effect. The result of one experiment on the dirty window effect is displayed in Figure 10.5. From frame 22 to frame 25 of the “Miss America” sequence, there are 2166 pixels (less than 10% of the total pixels) which change their gray level values by more than 1% of the peak signal. When we only update the gray level values for 25% (randomly chosen) of these changing pixels, we can clearly see the dirty window effect. When rapid scene changes exceed a certain level, buffer saturation will result, causing picture breakup [mounts, 1969]. Motion compensated coding, which is discussed below, has been proved to be able to provide better performance than the replenishment technique in situations with rapid changes. 11
♦ Motion estimation and compensation § Find displacement vector of a pixel or a set of pixels between frames § Via displacement vector, to predict counterpart in present frame § Prediction error, positions; motion vectors: coded & transmitted
Motion analysis
Prediction and differentiation
Encoding
Figure 10.7 Block diagram of motion compensated coding • MC is much more efficient and complicated than FR 12
(MC ⇒ smaller dynamic range of prediction error) (both MC & FR are predictive coding) (need correspondence – “motion trajectory”)
(a)
t n −1
tn
(b)
Figure 10.6 Two consecutive frames of a video sequence
• The first motion compensated coding algorithm [netravali & robbins 1979] achieved 22-50% lower bit rate than the frame replenishment algorithm 13
•
High coding efficiency is obtained with high computational complexity ⇒
• This strategy is: § Technically feasible (VLSI, DSP, etc.) § Economically desired Q
cost of DSP drops much faster than that
of transmission [dubois 1981] § Continue to be the case • Motion compensated video coding A major development in coding since 1980s [musmann 1985, zhang 1995, kunt 1995] ⇒
Motion Estimation (ME) • Biological Vision Perspective
14
§ For understanding human visual system (HVS) § Studied by biomedical scientists § Goal: to shed light for technology development (machine vision and others) § Measurement and interpretation: two steps [singh 91] • Computer Vision Perspective § For interpretation of 3-D motion from image sequences § Requirements: Accuracy, and processing speed § A dominant subject in vision [thompson 1989] § Correspondence and Optical flow: two approaches § Rigid vs deformable motions § Intermediate variable vs direct methods • Signal Processing Perspective
15
§ For MC coding (main purpose) & other MC video processing tasks § Ultimate goal: Ø Higher coding efficiency i.e., same quality with less bit rate or better quality with same bit rate Ø Quite different from the goal in Computer Vision §
Real time requirement ⇒
Simple 2-D displacement model
4.1 Block Matching § Description: region-based, matching § Representative: [jain 1981], diagram § The most popular one among three techniques 16
best block matching q
(x0 ,y0 )
q p
(x0 ,y0 )
p
search window
q
(x,y)
(x,y) p
(xi,yi)
(xi,yi)
displacement vector correlation window
An original block
(a) t n frame
Figure
(b) t n-1 frame
Figure 11.1 Block matching Matching Criteria • Instead of finding maximum similarity • An equivalent, more computationally efficient way of block matching: finding minimum dissimilarity • Dissimilarity (error, distortion, or distance) between two images t n and tn-1, D (s, t), is defined as:
17
p q 1 D(s,t) = p ⋅q ∑ ∑ M ( f n ( j,k ), f ( j + s, k + t )) n−1 j=1k =1 M(u, v): a dissimilarity metric of u and
v. •
Several types of mat ching criteria: § mean-square error (MSE) [jain 1981] M (u, v) = (u − v)2 §
§ §
sum of squared difference (SSD) [anandan 1987] sum of squared error (SSE) [chan 1990] mean absolute difference (MAD) [koga 1981] M (u, v) = u − v
§ mean absolute error (MAE) [nogaki 1972]. • A study based on experimental works reported that the matching criterion does not significantly affect the search [srinivasan 1984]. The MAD is hence preferred due to its simplicity in implementation [musmann 1985].
18
Searching Procedures • Full search § Brute force in nature § Good accuracy § Large amount computation • Several fast searching procedures § 2-D logarithm search [jain 1981] § Three-step search [koga 1981] § Conjugate direction search [srinivasan 1984]
k-4
k-3
k-2
k-1
k
k+1
k+2
k+3
k+4
j-4
j-3
j-2
1
2
1
1
j-1
j
1
j+1
j+2
1
2
j+3
2
4
4
4
3
4
4
19
j+4
3
Figure 11.1 (a) 2-D logarithm search procedure: Points at (j, k+2), (j+2, k+2), (j+2, k+4), and (j+1, k+4) are found to give the minimum dissimilarity in steps 1, 2, 3 and 4, respectively.
k-6
k-4
k-2
k
k+2
k+4
k+6
j-6
j-4
1
1
1
1
1
1
1
1
j-2
j
2
j+2
j+4
j+6
3
3
3
3
2
3
3
3
3
2
2
2
1
2
2
2
20
Figure 11.2 Three-step search procedure: Points (j+4, k-4), (j+4, k-6), and (j+5, k-7) give the minimum dissimilarity in steps 1, 2, and 3, respectively. Thresholding Multiresolution Block Matching (shi and xia 97)
level 4
level 3
level increasing resolution
level 2
decreasing
level 1
level 0
21
Figure 11.3 A Gaussian pyramid structure.
Block matching
Low pass filtering and subsampling
Y Y
Satisfying threshold
Low pass filtering and subsampling
N
Block matching
Y Low pass filtering and subsampling
Satisfying threshold
Low pass filtering and subsampling
N
Block matching
Frame n
Frame n-1
22
Motion field
Figure 11.4 Block diagram for a three-level threshold multiresolution block matching.
Pyramid n-1
Pyramid n
Pyramid level
Estimation of motion vector 1 Projection of the block and its estimated motion vector at level L
Calculation of the MAD of the block at level L L
Figure 11.5 The thresholding process.
23
Table 11.1 Parameters used in experiments
Parameters at level
Low resolution level
Full resolution leve l
“Miss America” Search Range
3×3
1×1
Block Size Thresholding Value
4×4 2
8×8 None(Not applicable)
4×4 4×4 3
1×1 8×8 None (Not applicable)
4×4 4×4 4
1×1 8×8 None (Not applicable)
“Train” Search Range Block Size Thresholding Value “Football” Search Range Block Size Thresholding Value
24
Figure 11.6 The 20th frame of the “Train”sequence.
Figure 11.7 The 20 th frame in the “Football”sequence.
25
Table 11.2 Experimental results (I) PSNR (dB)
Error Image Entropy (bits/pixel)
Vector Entropy (bits/vector)
Block stopped at top level/ Total block
Processing Times (No.of Additions, 10 6)
“Miss America” Sequence Method 1 [tzoraras 1994]
38.91
3.311
6.02
0/1280
10.02
New Method (TH=2)
38.79
3.319
5.65
487/1280
8.02
New Method (TH=3)
38.43
3.340
5.45
679/1280
6.17
Method 1 [tzoraras 1994]
27.37
4.692
6.04
0/2560
22.58
New Method (TH=3)
27.27
4.788
5.65
1333/2560
18.68
Method 1 [tzoraras 1994]
24.26
5.379
7.68
0/3840
30.06
New Method (TH=4)
24.18
5.483
7.58
1464/3840
25.90
New Method (TH=3)
24.21
5.483
7.57
1128/3840
27.10
“Train” Sequence
“Football” Sequence
4.2 Pel-recursion 26
§ Description: Øregion-based, matching ØDFD: displaced frame difference Ø minimization of a nonlinear function of DFD § Representative: [netravali & robbins 1979] 1st MC algorithm (Figure) § The earliest one, but the least accurate one (backward MA)
• In [netravali 1979], a dissimilarity measure called displaced frame difference (DFD) was defined as follows. 27
DFD( x, y;d x, d y ) = f n ( x, y) − f ( x − d x , y − d y ) n−1 n , n-1 : two successive frames x, y :
coordinates in image planes
d x, d y :
two components of displacement v vector, d
• Obviously, if no error in the estimation, then DFD will be zero. • A nonlinear function of dissimilarity measure: DFD 2 , was proposed in [netravali 1979]. • Displacement estimation is then converted into a minimization problem. • A typical nonlinear programming problem
♦ In [netravali 1979] :
28
v v v d k +1 = d k −α ⋅ DFD( x, y,d k )∇ x, y f
( x − d x, y − d y ) n−1
∇ x, y : gradient operator with respect to x, y α = 1/1024 ♦ Inclusion of a neighborhood area To make displacement estimation more robust, Netravali and Robbins assume the displacement vector is constant within a small neighborhood Ω .
v v v 1 k + 1 k 2 d = d − α∇ v ⋅ ∑ ωi ⋅ DFD (x, y,;d k ) 2 d i, x, y,∈Ω i : an index for the i th pixel ( x, y) in Ω ,
ωi : weight
ωi ≥ 0 ∑ ω =1
i∈Ω l
• The algorithm can be applied to a pixel once or iteratively applied several times for displacement estimation. 29
Then the algorithm moves to the next pixel. The estimated displacement vector of a pixel can be used as an initial estimate for the next pixel. • This recursion can be carried out horizontally, vertically, or temporally.
30
(x, y) (x+1,y) (x, y) (x,y+1)
(a)
(x,y,tn-1 )
(b)
(x,y,tn)
(c)
Figure 12.1 Three types of recursions
4.3 Optical Flow
31
§ Description: velocity vector for each pixel Ø correlation-based (similar to block matching) search window The pixel to which optical flow needs to be determined
q
q
best matching correlation window
(x0,y0) p p
(x,y)
(x,y)
optical flow vector
correlation window
f(x, y, t)
f(x, y, t-1)
Figure 13.1 Correlation-based approach to optical flow determination Ø differential (gradient) method Ø Spatiotemporal energy-based Ø Phase-based § Representative: [horn and schunck, 1981], Differential, gradient-based classical solution 32
§ Mainly used in computer vision § Less used in video coding Ø though accurate, but more side information & computation Ø (recall ultimate goal of MA in video coding) ⇒
Need some innovative work
5. Classification • Region-based vs. gradient-based
33
Table 14.1 Region-based vs. Gradientbased Optical flow
Regional –based
Gradient-based
Block matching
Pel recursive
√
√
Gradient-based | correlation-based method | method
√
√
• Forward vs. Backward Motion Estimation 34
Video in f
-
e
q
T
Q
fp
Q- 1
T-1
+
fr
MCP
FB
ME v
35
Figure 14. 1 Forward motion estimation and compensation , T: transform, Q: quantizer, FB: frame buffer, MCP: motion compensated predictor, ME: motion estimator, e: prediction error , f: input video frame, f p: predicted video frame, f r: reconstructed video frame, q: quantized transform coefficients, v: motion vector
Video in f
e
-
q
T
Q
fp -
Q1
T-1
36
+
f
Figure 14. 2 Backward motion estimation and compensation, T: transform, Q: quantizer, FB: frame buffer, MCP: motion compensated predictor, ME: motion estimator, e: prediction error , f: input video frame, f p : predicted video frame, f r1 : reconstructed video frame, f r2: reconstructed previous video frame, q: quantized transform coefficients Summary In general, block matching achieves the best performance in terms of higher estimation accuracy and less computational complexity. It has been used in all of international video coding standards such as H.261, H.263, H. 26L, MPEG 1, 2, and 4. 37
Pel recursive technique has the relatively lowest motion estimation accuracy. Hence less used in the motion estimation and motion compensation. Optical flow achieves high motion estimation accuracy, but has too many motion vectors to deal with. Mainly used in Computer Vision field. One attempt (shi et al, 1998): a. Flow vectors are highly correlated b. Can be modeled by a first-order AR model. c. DCT can be applied to flow vectors. d. Good results have achieved: a. The bit rate is compatible to that achieved by H.263 b. Visual quality is better: i. No block artifacts
38
(a)
(b)
(c)
Figure 13.2 (a) The 21st original frame of the “Miss America” sequence, (b) The reconstructed 21st frame with H.263, (c) The reconstructed 21st frame with the proposed technique.
6. References [aggarwal 1988] J. K. Aggarwal and N. Nandhakumar, ``On the computation of motion from sequences of images -- a review,’’Proceedings of the IEEE, vol. 76, no. 8, pp. 917-935, 1988. [barron 1994] J. L. Barron, D. J. Fleet and S. S. Beauchemin, “Systems and experiment performance of optical flow techniques,”International Journal of Computer Vision, vol. 12, no. 1, pp. 43-77, 1994. [boroczky 1991] “Pel-Recursive motion estimation for image coding,” PH.D. dissertation, Technic University of Delft, Netherland.
39
[dubois 1981] E. Dubois, B. Prasada and M. S. Sabri, ``Image Sequence Coding,’’ chapter 3, in T. S. Huang, Ed., Image Sequence Analysis, Springer-Verlag, 1981 [dufaux 1995] F. Dufaux and F. Moscheni, “Motion estimation techniques for digital TV: A review and a new contribution,” Proceedings of the IEEE, vol. 83, no. 6, pp. 858-876, 1995. [horn 1981] B. K. P. Horn and B. G. Schunck, “Determining optical flow,” Artificial Intelligence, vol. 17, pp. 185-203, 1981. [jain 1981] J. R. Jain and A. K. Jain, “displacement measurement and its application in interframe image coding,” IEEE Transactions on Communications, vol. COM-29, no. 12, pp. 1799-1808, December 1981. [koga 1981] T. Koga, K. Linuma, A. Hirano, Y. Iijima and T.Ishiguro, “Motion-compensated interframe coding for video conferencing,” Proceedings of NTC'81, pp. G5.3.1-G5.3.5, New Orleans, LA, Dec. 1981. [kunt 1995] M. Kunt Ed., Special Issue on Digital Television Part 1: Technologies, Proceedings of The IEEE, vol. 83, no. 6, June 1995. [mounts 1969] F. W. Mounts, “A video encoding system with conditional picture-element replenishment,’’The Bell System Technical Journal, vol. 48, no. 7, pp. 2545-1554, September 1969. [musmann 1985] H. G. Musmann, P. Pirsch, and H. J. Grallert, “Advances in picture coding,’’ Proceedings of the IEEE, vol. 73, no. 4, pp. 523-548, 1985. [netravali 1979] A. N. Netravali and J. D. Robbins, “Motion-compensated television coding: Part I,” The Bell System Technical Journal, vol. 58, no. 3, pp. 631-670, March 1979. 29 [srinivasan 1984] R. Srinivasan and K. R. Rao, “Predictive coding based on efficient motion estimation,” P roceedings of ICC, pp. 521-526, May 1984. [shu 1991] C. Q. Shu and Y. Q. Shi, “On unified optical flow field,’’Pattern Recognition, vol. 24, no. 6, pp. 579-586, 1991. [thompson 1989] W. B. Thompson, “Introduction to special issue on visual motion,’’IEEE Trans. on Pattern Analysis and Machine Intelligence , vol. 11, no. 5, pp. 449-450, 1989. [zhang 1995] Y.-Q. Zhang, W. Li and M. L. Liou, Ed., Special Issue on Advances in Image and
40
30
41