Motion

MOTION ESTIMTION AND MOTION COMPENSATED (VIDEO) CODING 1. Imaging space 2. Image sequence vs. Video 3. Interframe corre...

3 downloads 277 Views 692KB Size
MOTION ESTIMTION AND MOTION COMPENSATED (VIDEO) CODING

1. Imaging space 2. Image sequence vs. Video 3. Interframe correlation and motion compensated (MC) coding 4. Three major techniques in motion analysis (MA) for video coding: • Block matching • Pel-recursion • Optical flow

1

• Digital Image Processing: 1960s ⇒ active research field due to the advent of two events • Digital Image Sequence Processing: § Early 1980s



active research area

§ More information available § More computation involved More memory space required § Technically feasible (VLSI, DSP, etc.) ⇒

Image and video sequences:

Indispensable elements of our modern life Some Applications: q

CD-RAM, VCD, DVD

2

q q q q q q

Digital video camera Digital TV Video-phony, Video-conferencing Video over internet, wireless Video-on-demand Interactive multimedia, databases, etc.

• Temporal image sequence:

g( x, y,t)

• Stereo image (pair) sequence • Consider a generalization: A sensor –a solid article in 3-D space Translation (3 parameters) Rotation (2 parameters) ⇒ Many images of the scene Infinitely many sensors on all possible positions and orientations, at a moment, take many images ⇔

3

Y Y’

X’

(~ x , ~y, ~z )

y'

x' 0’

y

Z’

0

X

(0, 0, 0)

0

x

Z

Figure 10.1 Two sensors’ position v s = (~ x , ~y, ~ z , ß, ?) .

v s = (0, 0,0,0, 0)

and

4



A spatial image sequence

r g (x, y,s ) r s : the sensor’s position in 3-D space

r ~~~ s = ( x , y , z , β ,γ ) • An imaging space: § A collection of all spatial image sequences (over temporal domain)

r g (x, y,t,s )

Alternatively, § A collection of all temporal image sequences (over spatial domain) • Meaning: Unification of all the image frames and image sequences [shu & shi 1991]

5

Top level Image space v g ( x, y , t , s )

Intermediate

Temporal image sequence v g ( x , y , t , s = a specific vector )

Spatial image sequence

v g ( x , y , t = a specific moment , s )

Bottom level

Individual images

v g ( x , y , t = a specific moment , s = a specific vector )

Figure 10.2 A hierarchical structure

6

2. Image Sequence vs. Video • Image frames and image sequences: defined above • Video: § refer to a single video frame § or video sequences associated § with visible frequency band (visual information) • Scope of image frames and sequences is wider. e.g. Infrared image and image sequences: Outside the visible band: Not video • Video is more pertinent to multimedia engineering • Two are equal when visible band is concerned. • Video, sometimes, 7



video sequences exclusively

• Image compression ⇒

still image compression

• Video compression ⇒

video sequence coding

3. Interframe Correlation and Motion Compensated Compression • High interframe correlation exists in video sequence: ♦ A video sequence in videophone service: less than 10% pixels change their intensity by 1% of the peak signal between frames [mounts 1969] 8

♦ A video sequence in TV broadcast: (More motion is involved.) even if there is no scene change between frames, the scene is still scanned constantly • How to utilize the interframe correlation? ♦ Frame replenishment (FR) [mounts 1969] § Compare present frame with next frame pixel-by-pixel § If difference large enough, ⇒

Difference, position: coded & transmitted

§ Otherwise, repeat pixel’s gray scale at the receiver § More efficient than intraframe coding technique such as DPCM Interframe redundancy: 9

(a) Frame 24

Figure 10.2 sequence.

(b) Frame 25

Two frames of the Miss America

Problem with frame replenishment:

Figure 10.3 Dirty window effect

10

Main drawback with frame replenishment technique: It is difficult to handle frame sequences containing more rapid changes. When there are more rapid changes, the number of pixels whose intensity values need to be updated increases. In order to maintain the transmission bit-rate in a steady and proper level, the threshold has to be raised, thus causing many slow changes cannot show up in the receiver. This poorer reconstruction in the receiver is somewhat analogous to viewing a scene through a dirty window. This is referred to as the dirty window effect. The result of one experiment on the dirty window effect is displayed in Figure 10.5. From frame 22 to frame 25 of the “Miss America” sequence, there are 2166 pixels (less than 10% of the total pixels) which change their gray level values by more than 1% of the peak signal. When we only update the gray level values for 25% (randomly chosen) of these changing pixels, we can clearly see the dirty window effect. When rapid scene changes exceed a certain level, buffer saturation will result, causing picture breakup [mounts, 1969]. Motion compensated coding, which is discussed below, has been proved to be able to provide better performance than the replenishment technique in situations with rapid changes. 11

♦ Motion estimation and compensation § Find displacement vector of a pixel or a set of pixels between frames § Via displacement vector, to predict counterpart in present frame § Prediction error, positions; motion vectors: coded & transmitted

Motion analysis

Prediction and differentiation

Encoding

Figure 10.7 Block diagram of motion compensated coding • MC is much more efficient and complicated than FR 12

(MC ⇒ smaller dynamic range of prediction error) (both MC & FR are predictive coding) (need correspondence – “motion trajectory”)

(a)

t n −1

tn

(b)

Figure 10.6 Two consecutive frames of a video sequence

• The first motion compensated coding algorithm [netravali & robbins 1979] achieved 22-50% lower bit rate than the frame replenishment algorithm 13



High coding efficiency is obtained with high computational complexity ⇒

• This strategy is: § Technically feasible (VLSI, DSP, etc.) § Economically desired Q

cost of DSP drops much faster than that

of transmission [dubois 1981] § Continue to be the case • Motion compensated video coding A major development in coding since 1980s [musmann 1985, zhang 1995, kunt 1995] ⇒

Motion Estimation (ME) • Biological Vision Perspective

14

§ For understanding human visual system (HVS) § Studied by biomedical scientists § Goal: to shed light for technology development (machine vision and others) § Measurement and interpretation: two steps [singh 91] • Computer Vision Perspective § For interpretation of 3-D motion from image sequences § Requirements: Accuracy, and processing speed § A dominant subject in vision [thompson 1989] § Correspondence and Optical flow: two approaches § Rigid vs deformable motions § Intermediate variable vs direct methods • Signal Processing Perspective

15

§ For MC coding (main purpose) & other MC video processing tasks § Ultimate goal: Ø Higher coding efficiency i.e., same quality with less bit rate or better quality with same bit rate Ø Quite different from the goal in Computer Vision §

Real time requirement ⇒

Simple 2-D displacement model

4.1 Block Matching § Description: region-based, matching § Representative: [jain 1981], diagram § The most popular one among three techniques 16

best block matching q

(x0 ,y0 )

q p

(x0 ,y0 )

p

search window

q

(x,y)

(x,y) p

(xi,yi)

(xi,yi)

displacement vector correlation window

An original block

(a) t n frame

Figure

(b) t n-1 frame

Figure 11.1 Block matching Matching Criteria • Instead of finding maximum similarity • An equivalent, more computationally efficient way of block matching: finding minimum dissimilarity • Dissimilarity (error, distortion, or distance) between two images t n and tn-1, D (s, t), is defined as:

17

p q 1 D(s,t) = p ⋅q ∑ ∑ M ( f n ( j,k ), f ( j + s, k + t )) n−1 j=1k =1 M(u, v): a dissimilarity metric of u and

v. •

Several types of mat ching criteria: § mean-square error (MSE) [jain 1981] M (u, v) = (u − v)2 §

§ §

sum of squared difference (SSD) [anandan 1987] sum of squared error (SSE) [chan 1990] mean absolute difference (MAD) [koga 1981] M (u, v) = u − v

§ mean absolute error (MAE) [nogaki 1972]. • A study based on experimental works reported that the matching criterion does not significantly affect the search [srinivasan 1984]. The MAD is hence preferred due to its simplicity in implementation [musmann 1985].

18

Searching Procedures • Full search § Brute force in nature § Good accuracy § Large amount computation • Several fast searching procedures § 2-D logarithm search [jain 1981] § Three-step search [koga 1981] § Conjugate direction search [srinivasan 1984]

k-4

k-3

k-2

k-1

k

k+1

k+2

k+3

k+4

j-4

j-3

j-2

1

2

1

1

j-1

j

1

j+1

j+2

1

2

j+3

2

4

4

4

3

4

4

19

j+4

3

Figure 11.1 (a) 2-D logarithm search procedure: Points at (j, k+2), (j+2, k+2), (j+2, k+4), and (j+1, k+4) are found to give the minimum dissimilarity in steps 1, 2, 3 and 4, respectively.

k-6

k-4

k-2

k

k+2

k+4

k+6

j-6

j-4

1

1

1

1

1

1

1

1

j-2

j

2

j+2

j+4

j+6

3

3

3

3

2

3

3

3

3

2

2

2

1

2

2

2

20

Figure 11.2 Three-step search procedure: Points (j+4, k-4), (j+4, k-6), and (j+5, k-7) give the minimum dissimilarity in steps 1, 2, and 3, respectively. Thresholding Multiresolution Block Matching (shi and xia 97)

level 4

level 3

level increasing resolution

level 2

decreasing

level 1

level 0

21

Figure 11.3 A Gaussian pyramid structure.

Block matching

Low pass filtering and subsampling

Y Y

Satisfying threshold

Low pass filtering and subsampling

N

Block matching

Y Low pass filtering and subsampling

Satisfying threshold

Low pass filtering and subsampling

N

Block matching

Frame n

Frame n-1

22

Motion field

Figure 11.4 Block diagram for a three-level threshold multiresolution block matching.

Pyramid n-1

Pyramid n

Pyramid level

Estimation of motion vector 1 Projection of the block and its estimated motion vector at level L

Calculation of the MAD of the block at level L L

Figure 11.5 The thresholding process.

23

Table 11.1 Parameters used in experiments

Parameters at level

Low resolution level

Full resolution leve l

“Miss America” Search Range

3×3

1×1

Block Size Thresholding Value

4×4 2

8×8 None(Not applicable)

4×4 4×4 3

1×1 8×8 None (Not applicable)

4×4 4×4 4

1×1 8×8 None (Not applicable)

“Train” Search Range Block Size Thresholding Value “Football” Search Range Block Size Thresholding Value

24

Figure 11.6 The 20th frame of the “Train”sequence.

Figure 11.7 The 20 th frame in the “Football”sequence.

25

Table 11.2 Experimental results (I) PSNR (dB)

Error Image Entropy (bits/pixel)

Vector Entropy (bits/vector)

Block stopped at top level/ Total block

Processing Times (No.of Additions, 10 6)

“Miss America” Sequence Method 1 [tzoraras 1994]

38.91

3.311

6.02

0/1280

10.02

New Method (TH=2)

38.79

3.319

5.65

487/1280

8.02

New Method (TH=3)

38.43

3.340

5.45

679/1280

6.17

Method 1 [tzoraras 1994]

27.37

4.692

6.04

0/2560

22.58

New Method (TH=3)

27.27

4.788

5.65

1333/2560

18.68

Method 1 [tzoraras 1994]

24.26

5.379

7.68

0/3840

30.06

New Method (TH=4)

24.18

5.483

7.58

1464/3840

25.90

New Method (TH=3)

24.21

5.483

7.57

1128/3840

27.10

“Train” Sequence

“Football” Sequence

4.2 Pel-recursion 26

§ Description: Øregion-based, matching ØDFD: displaced frame difference Ø minimization of a nonlinear function of DFD § Representative: [netravali & robbins 1979] 1st MC algorithm (Figure) § The earliest one, but the least accurate one (backward MA)

• In [netravali 1979], a dissimilarity measure called displaced frame difference (DFD) was defined as follows. 27

DFD( x, y;d x, d y ) = f n ( x, y) − f ( x − d x , y − d y ) n−1 n , n-1 : two successive frames x, y :

coordinates in image planes

d x, d y :

two components of displacement v vector, d

• Obviously, if no error in the estimation, then DFD will be zero. • A nonlinear function of dissimilarity measure: DFD 2 , was proposed in [netravali 1979]. • Displacement estimation is then converted into a minimization problem. • A typical nonlinear programming problem

♦ In [netravali 1979] :

28

v v v d k +1 = d k −α ⋅ DFD( x, y,d k )∇ x, y f

( x − d x, y − d y ) n−1

∇ x, y : gradient operator with respect to x, y α = 1/1024 ♦ Inclusion of a neighborhood area To make displacement estimation more robust, Netravali and Robbins assume the displacement vector is constant within a small neighborhood Ω .

v v v 1 k + 1 k 2 d = d − α∇ v ⋅ ∑ ωi ⋅ DFD (x, y,;d k ) 2 d i, x, y,∈Ω i : an index for the i th pixel ( x, y) in Ω ,

ωi : weight       

ωi ≥ 0 ∑ ω =1

i∈Ω l

• The algorithm can be applied to a pixel once or iteratively applied several times for displacement estimation. 29

Then the algorithm moves to the next pixel. The estimated displacement vector of a pixel can be used as an initial estimate for the next pixel. • This recursion can be carried out horizontally, vertically, or temporally.

30

(x, y) (x+1,y) (x, y) (x,y+1)

(a)

(x,y,tn-1 )

(b)

(x,y,tn)

(c)

Figure 12.1 Three types of recursions

4.3 Optical Flow

31

§ Description: velocity vector for each pixel Ø correlation-based (similar to block matching) search window The pixel to which optical flow needs to be determined

q

q

best matching correlation window

(x0,y0) p p

(x,y)

(x,y)

optical flow vector

correlation window

f(x, y, t)

f(x, y, t-1)

Figure 13.1 Correlation-based approach to optical flow determination Ø differential (gradient) method Ø Spatiotemporal energy-based Ø Phase-based § Representative: [horn and schunck, 1981], Differential, gradient-based classical solution 32

§ Mainly used in computer vision § Less used in video coding Ø though accurate, but more side information & computation Ø (recall ultimate goal of MA in video coding) ⇒

Need some innovative work

5. Classification • Region-based vs. gradient-based

33

Table 14.1 Region-based vs. Gradientbased Optical flow

Regional –based

Gradient-based

Block matching

Pel recursive





Gradient-based | correlation-based method | method





• Forward vs. Backward Motion Estimation 34

Video in f

-

e

q

T

Q

fp

Q- 1

T-1

+

fr

MCP

FB

ME v

35

Figure 14. 1 Forward motion estimation and compensation , T: transform, Q: quantizer, FB: frame buffer, MCP: motion compensated predictor, ME: motion estimator, e: prediction error , f: input video frame, f p: predicted video frame, f r: reconstructed video frame, q: quantized transform coefficients, v: motion vector

Video in f

e

-

q

T

Q

fp -

Q1

T-1

36

+

f

Figure 14. 2 Backward motion estimation and compensation, T: transform, Q: quantizer, FB: frame buffer, MCP: motion compensated predictor, ME: motion estimator, e: prediction error , f: input video frame, f p : predicted video frame, f r1 : reconstructed video frame, f r2: reconstructed previous video frame, q: quantized transform coefficients Summary In general, block matching achieves the best performance in terms of higher estimation accuracy and less computational complexity. It has been used in all of international video coding standards such as H.261, H.263, H. 26L, MPEG 1, 2, and 4. 37

Pel recursive technique has the relatively lowest motion estimation accuracy. Hence less used in the motion estimation and motion compensation. Optical flow achieves high motion estimation accuracy, but has too many motion vectors to deal with. Mainly used in Computer Vision field. One attempt (shi et al, 1998): a. Flow vectors are highly correlated b. Can be modeled by a first-order AR model. c. DCT can be applied to flow vectors. d. Good results have achieved: a. The bit rate is compatible to that achieved by H.263 b. Visual quality is better: i. No block artifacts

38

(a)

(b)

(c)

Figure 13.2 (a) The 21st original frame of the “Miss America” sequence, (b) The reconstructed 21st frame with H.263, (c) The reconstructed 21st frame with the proposed technique.

6. References [aggarwal 1988] J. K. Aggarwal and N. Nandhakumar, ``On the computation of motion from sequences of images -- a review,’’Proceedings of the IEEE, vol. 76, no. 8, pp. 917-935, 1988. [barron 1994] J. L. Barron, D. J. Fleet and S. S. Beauchemin, “Systems and experiment performance of optical flow techniques,”International Journal of Computer Vision, vol. 12, no. 1, pp. 43-77, 1994. [boroczky 1991] “Pel-Recursive motion estimation for image coding,” PH.D. dissertation, Technic University of Delft, Netherland.

39

[dubois 1981] E. Dubois, B. Prasada and M. S. Sabri, ``Image Sequence Coding,’’ chapter 3, in T. S. Huang, Ed., Image Sequence Analysis, Springer-Verlag, 1981 [dufaux 1995] F. Dufaux and F. Moscheni, “Motion estimation techniques for digital TV: A review and a new contribution,” Proceedings of the IEEE, vol. 83, no. 6, pp. 858-876, 1995. [horn 1981] B. K. P. Horn and B. G. Schunck, “Determining optical flow,” Artificial Intelligence, vol. 17, pp. 185-203, 1981. [jain 1981] J. R. Jain and A. K. Jain, “displacement measurement and its application in interframe image coding,” IEEE Transactions on Communications, vol. COM-29, no. 12, pp. 1799-1808, December 1981. [koga 1981] T. Koga, K. Linuma, A. Hirano, Y. Iijima and T.Ishiguro, “Motion-compensated interframe coding for video conferencing,” Proceedings of NTC'81, pp. G5.3.1-G5.3.5, New Orleans, LA, Dec. 1981. [kunt 1995] M. Kunt Ed., Special Issue on Digital Television Part 1: Technologies, Proceedings of The IEEE, vol. 83, no. 6, June 1995. [mounts 1969] F. W. Mounts, “A video encoding system with conditional picture-element replenishment,’’The Bell System Technical Journal, vol. 48, no. 7, pp. 2545-1554, September 1969. [musmann 1985] H. G. Musmann, P. Pirsch, and H. J. Grallert, “Advances in picture coding,’’ Proceedings of the IEEE, vol. 73, no. 4, pp. 523-548, 1985. [netravali 1979] A. N. Netravali and J. D. Robbins, “Motion-compensated television coding: Part I,” The Bell System Technical Journal, vol. 58, no. 3, pp. 631-670, March 1979. 29 [srinivasan 1984] R. Srinivasan and K. R. Rao, “Predictive coding based on efficient motion estimation,” P roceedings of ICC, pp. 521-526, May 1984. [shu 1991] C. Q. Shu and Y. Q. Shi, “On unified optical flow field,’’Pattern Recognition, vol. 24, no. 6, pp. 579-586, 1991. [thompson 1989] W. B. Thompson, “Introduction to special issue on visual motion,’’IEEE Trans. on Pattern Analysis and Machine Intelligence , vol. 11, no. 5, pp. 449-450, 1989. [zhang 1995] Y.-Q. Zhang, W. Li and M. L. Liou, Ed., Special Issue on Advances in Image and

40

30

41