thesis

UNIVERSITY OF CALIFORNIA, IRVINE Scalable Action Recognition in Continuous Video Streams DISSERTATION submitted in par...

0 downloads 215 Views 14MB Size
UNIVERSITY OF CALIFORNIA, IRVINE

Scalable Action Recognition in Continuous Video Streams DISSERTATION

submitted in partial satisfaction of the requirements for the degree of

DOCTOR OF PHILOSOPHY in Computer Science

by

Hamed Pirsiavash

Dissertation Committee: Professor Deva Ramanan, Chair Professor Ramesh Jain Professor Charless C. Fowlkes Professor Pietro Perona Professor Padhraic Smyth

2012

c 2012 Hamed Pirsiavash

DEDICATION

To my wonderful parents and to my lovely wife

ii

TABLE OF CONTENTS Page LIST OF FIGURES

vi

LIST OF TABLES

ix

ACKNOWLEDGMENTS

x

CURRICULUM VITAE

xi

ABSTRACT OF THE DISSERTATION

xii

1

1 1 2 8 8 10 11 11

2

Introduction 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 What makes this problem hard? . . . . . . . . . . . . . . . . . . . . . . . 1.3 Our approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 High-level features based on objects and human pose . . . . . . . . 1.3.2 Scalable appearance features . . . . . . . . . . . . . . . . . . . . . 1.3.3 Efficient temporal features . . . . . . . . . . . . . . . . . . . . . . 1.3.4 Our grammar-based action models . . . . . . . . . . . . . . . . . . 1.3.5 An Application: Detecting Activities of Daily Living in First-Person Camera Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12 13

Scalable Rigid Templates 2.1 Introduction . . . . . . . . . . . . 2.2 Related Work . . . . . . . . . . . 2.3 Model definition . . . . . . . . . . 2.3.1 Learning . . . . . . . . . 2.3.2 Coordinate descent . . . . 2.4 Motivation . . . . . . . . . . . . . 2.4.1 Regularization . . . . . . 2.4.2 Efficiency . . . . . . . . 2.4.3 Transfer . . . . . . . . . . 2.5 Extensions . . . . . . . . . . . . . 2.5.1 Multilinear . . . . . . . . 2.5.2 Bilinear multi-class SVMs

14 14 17 19 21 22 23 23 23 24 25 25 27

. . . . . . . . . . . .

iii

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

2.6

2.7 3

4

5

Experiments . . . . . . . . . . . . . . . . . . 2.6.1 Spatio-temporal pedestrian detection . 2.6.2 Human action classification . . . . . Conclusion . . . . . . . . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

28 29 30 32

Scalable Part Models 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 3.2 Related work . . . . . . . . . . . . . . . . . . . . . 3.3 Steerable basis model . . . . . . . . . . . . . . . . . 3.4 Learning . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Steerability and separability . . . . . . . . . . . . . 3.6 Multi-category learning . . . . . . . . . . . . . . . . 3.7 Computational savings . . . . . . . . . . . . . . . . 3.8 Experiments . . . . . . . . . . . . . . . . . . . . . . 3.8.1 Articulated human pose estimation . . . . . . 3.8.2 Facial pose estimation / landmark localization 3.8.3 Sharing across object categories . . . . . . . 3.9 Conclusion . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

34 34 37 39 41 44 46 47 48 48 49 50 54

Temporal Features 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . 4.2 Related Work . . . . . . . . . . . . . . . . . . . . 4.3 Model . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Independent tracks . . . . . . . . . . . . . 4.3.2 Track interdependence . . . . . . . . . . . 4.4 MAP Inference . . . . . . . . . . . . . . . . . . . 4.4.1 Equivalence to network flow . . . . . . . . 4.5 Finding min-cost flows . . . . . . . . . . . . . . . 4.5.1 Successive Shortest-paths . . . . . . . . . 4.5.2 Dynamic Programming Solution for K = 1 4.5.3 Approximate DP solution for K > 1 . . . . 4.5.4 Approximate 2-pass DP solution for K > 1 4.5.5 Caching . . . . . . . . . . . . . . . . . . . 4.6 Experimental Results . . . . . . . . . . . . . . . . 4.7 Conclusion . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

56 56 58 60 61 63 64 66 67 68 69 71 71 72 74 79

Action Models 5.1 Introduction . . . . . . . . . . . . . . . . . 5.2 Related Work . . . . . . . . . . . . . . . . 5.3 Segmental Grammars . . . . . . . . . . . . 5.3.1 Context-Free Grammars . . . . . . 5.3.2 Segmental Context-Free Grammars 5.3.3 Regular grammars . . . . . . . . . 5.3.4 Segmental regular grammars . . . . 5.4 Model structure . . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

81 81 84 85 86 87 89 90 91

iv

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

5.5 5.6 5.7

Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 Conclusion: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6

6.1 6.2 6.3 6.4 6.5

6.6 6.7 7

An Application: Detecting Activities of Daily Living in First-person Camera Views 102 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Temporal pyramids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Active object models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 6.5.1 Collection and size . . . . . . . . . . . . . . . . . . . . . . . . . . 114 6.5.2 Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.5.3 Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.6.1 Action recognition results . . . . . . . . . . . . . . . . . . . . . . 122 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

Conclusion and future work 126 7.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

Bibliography

129

v

LIST OF FIGURES Page 1.1 1.2

Different applications of action recognition in video. . . . . . . . . . . . . 2 Action recognition, object detection, and scene understanding are highly related problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Multiple instances of humans with some partial occlusions in a video frame. 4 1.4 A partially occluded faucet in interaction. . . . . . . . . . . . . . . . . . . 4 1.5 Sample human poses that are difficult to detect due to large amounts of articulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.6 Large variation in the appearance of a fridge in interaction. . . . . . . . . . 6 1.7 The importance of temporal models: is this person catching the ball or throwing it? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.8 Different sub-actions of “making tea” action. Some actions exhibit a complex temporal structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.9 Block diagram of our approach . . . . . . . . . . . . . . . . . . . . . . . . 9 1.10 Sample low-level and high-level features . . . . . . . . . . . . . . . . . . . 10 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 3.1 3.2 3.3 3.4 3.5

Illustration of extracting appearance features from an image and then applying a linear model for human detection. . . . . . . . . . . . . . . . . . An image, extracted HOG features, and a sample learned template . . . . An illustration of applying a linear model in the matrix form. . . . . . . . Illustration of the bilinear factorization . . . . . . . . . . . . . . . . . . . Precision-recall results of the linear model, PCA, and our bilinear model on the INRIA-motion database [25] . . . . . . . . . . . . . . . . . . . . Class confusion matrices for our model and the baselines on the UCF dataset [111] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . bilinear model for transfer learning in multiclass classifiers . . . . . . . . Results of transfer learning on the UCF action recognition dataset with limited training data . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . .

. 30 . 31 . 32 . 32

Our results on three diverse but well-benchmarked applications of part models A large vocabulary of parts and a steerable basis . . . . . . . . . . . . . . . The general scoring function for part models and our steerable representation The original articulated human model from [145] and our steerable/separable reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Performance comparison between the articulated pose model of [145] and our steerable/separable variant . . . . . . . . . . . . . . . . . . . . . . . . vi

15 17 20 21

35 36 41 50 51

3.6 3.7 3.8

We plot PCP performance over iterations of coordinate descent. . . . . . the original from [151] and reconstructed face model . . . . . . . . . . . Accuracy vs. reduction in the model size for facial pose estimation and landmark localization on MultiPIE . . . . . . . . . . . . . . . . . . . . . 3.9 Evaluating landmark localization in the facial pose model . . . . . . . . . 3.10 The original car model and its steerable/separable reconstruction . . . . 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10

. 52 . 53 . 53 . 54 . 54

Our tracking results on the ETHMS dataset . . . . . . . . . . . . . . . . . The intuition behind our optimal greedy algorithm. . . . . . . . . . . . . . The network model from [149] for three consecutive frames of video. . . . Illustration of the successive shortest path algorithm. . . . . . . . . . . . . Our tracking results on the Caltech Pedestrian dataset . . . . . . . . . . . . Input and output detections of our tracking algorithm . . . . . . . . . . . . Cost vs. iteration number for all three algorithms on Caltech dataset. . . . . Detection rate versus FPPI on Caltech dataset [27] and ETHMS dataset [29] Non-max-suppression in the detection window level. . . . . . . . . . . . . An illustration of applying non-max-suppression in the track level rather than the detection window level . . . . . . . . . . . . . . . . . . . . . . .

58 60 65 70 73 73 76 77 78 78

5.1 5.2 5.3

A sample video parsed to actions and subactions. . . . . . . . . . . . . . . 82 The underlying grammar used in Fig. 5.1 . . . . . . . . . . . . . . . . . . 82 Context-free grammars (CFGs), segmental VFGs, and segmental regular grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.4 The CYK parsing algorithm and an illustrative example. . . . . . . . . . . 88 5.5 A sample SRG rule set used to generate the actual parse shown in Fig. 5.1 . 92 5.6 Our codebook of poses, which we use to define a “bag-of-poses” data model 93 5.7 Our Continuous Olympics action dataset contains video streams of 16 different actions collected from YouTube. . . . . . . . . . . . . . . . . . . . . 97 5.8 A sample parsed video containing ‘javelin throw” actions . . . . . . . . . . 98 5.9 Sample frames of a continuous video of three “weight-lifting” actions with the detected poses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.10 Precision-recall curve for two baselines and our SRG-parser . . . . . . . . 101 6.1 6.2 6.3 6.4 6.5 6.6

Activities of daily living (ADL) captured by a wearable camera. . . . . . . 103 Recognizing actions using an object-centric bag model . . . . . . . . . . . 108 An illustration of the temporal pyramid model for action recognition. . . . . 109 Our dataset contains images of objects under different semantic states-of-use 110 We visualize our passive and active stove models. . . . . . . . . . . . . . . 111 Bag model for action recognition constructed using active and passive object detections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.7 Average location of active vs passive objects in our ADL dataset . . . . . . 112 6.8 The average passive and active TV remote in our ADL dataset . . . . . . . 113 6.9 Variation in different kitchen scenes in our dataset . . . . . . . . . . . . . . 116 6.10 Our manually-designed functional ADL taxonomy. . . . . . . . . . . . . . 118 6.11 The taxonomy-based cost of mistaking one class for another . . . . . . . . 119

vii

6.12 Toothbrushes in our ADL dataset look different than those found in webbased collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.13 Confusion matrix for temporal pyramid with vision based active object detectors on pre-segmented videos . . . . . . . . . . . . . . . . . . . . . . . 124 6.14 Sample detected active and passive objects with their confidence values . . 125

viii

LIST OF TABLES Page 3.1 3.2 3.3 4.1 4.2

We compare articulated pose estimation results with [145], which reports state-of-the-art results on the PARSE benchmark [108] . . . . . . . . . . . 49 We compare our results in facial detection, pose estimation, and landmark localization with the baseline in [151] . . . . . . . . . . . . . . . . . . . . 52 Average precision for different object categories in PASCAL 2007 dataset . 55 Evaluating track label error as a function of the length of the allowable occlusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Our algorithm performance compared to the previous state-of-the-art on the ETHMS dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.1 5.2

A taxonomy of different temporal models with their inference complexity. . 86 Comparing our model with the baselines in terms of the precision and running time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.1 6.2

Statistics for the duration of each action . . . . . . . . . . . . . . . . . . . 114 Average precision results for part-based object detectors evaluated on our ADL dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 Classification accuracy and taxonomy loss for action recognition using different representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

6.3

ix

ACKNOWLEDGMENTS First and foremost, I would like to thank my advisor, Professor Deva Ramanan, for his constant support, encouragement, and insightful comments and advice throughout my Ph.D. studies. I consider myself lucky for having been his student. Thanks to my co-advisor, Professor Ramesh Jain, for his kind support and insightful discussions and for broadening my view on recognizing and solving practical research problems. Thanks to Professor Charless Folwkes for the enjoyable and productive collaboration. Thanks to my other thesis committee members, Professor Pietro Perona and Professor Padhraic Smyth, for insightful comments and discussions. Thanks to the faculty and staff members of the School of ICS at UC Irvine, specially, Mark Cartnal, Cindy Kennedy, and Carolyn Simpson for providing resources in my PhD studies. Thanks to Professor Farbod Razzazi for all his support and encouragement. He was the first to introduce me to the field of computer vision. Thanks to the UCI Vision group for many enlightening chats: Mohsen Hejrati, Chaitanya Desai, Dennis Park, Yi Yang, Sam Hallman, Xiangxin Zhu, Julian Yarkony , Ragib Morshed, Sholeh Forouzan, Golnaz Ghiasi, James Supancic, Bailey Kong, Shaofei Wang, Raul Diaz, Goutham Patnaik, Carl Vondrick, Kuang-Chi Liu, Yihang Bo, Tony Tran, and Levi Boyles. Thanks to the UCI Experiential Systems group for great discussions: Vivek Singh, Mingyan Gao, YeSun Joung, Pinaki Sinha, Laleh Jalali, Ish Rishabh, Setareh Rafatirad, Arjun Satish, Siripen Pongpaichet, Bo Gong, and Bohyung Han. Thanks to my friends and collaborators for all insightful discussions: Ramin Mehran, Mohammad Mahoor, Ali Farhadi, Alireza Fathi, Mohammad Rastegari, Mohammad Amin Sadeghi, Roozbeh Mottaghi, Sobhan Naderi Parizi, Babak Saleh, Mohammad Norouzi, Neda Sabbaghpour, Payam Esfandiari and many others. Thanks to my awesome friends in Irvine for making my PhD studies enjoyable. It would have been difficult to do research without having fun with them. Thanks to my brother, Mehdi, for all encouragement. I owe a special thanks to my lovely wife, Saeedeh, for her love and support. And finally, thanks to my wonderful parents for everything.

x

CURRICULUM VITAE Hamed Pirsiavash EDUCATION Doctor of Philosophy in Computer Science University of California, Irvine

2012 Irvine, California

Master of Science in Electrical Engineering Sharif University of Technology

2006 Tehran, Iran

Bachelor of Science in Electrical Engineering Iran University of Science and Technology

2003 Tehran, Iran

High School Diploma in Math and Physics Allameh Helli High School

1999 Tehran, Iran

xi

ABSTRACT OF THE DISSERTATION Scalable Action Recognition in Continuous Video Streams By Hamed Pirsiavash Doctor of Philosophy in Computer Science University of California, Irvine, 2012 Professor Deva Ramanan, Chair

Activity recognition in video has a variety of applications, including rehabilitation, surveillance, and video retrieval. It is relatively easy for a human to recognize actions in a video once he/she watches it. However, in many applications the videos are very long, eg. in lifelogging, and/or we need the real-time detection, eg. in human computer interaction. This motivates us to build computer vision and artificial intelligence algorithms to recognize activities in video sequences automatically. We are addressing several challenges in activity recognition, including (1) computational scalability, (2) spatio-temporal feature extraction, (3) spatio-temporal models, and finally, (4) dataset development. (1) Computational Scalability: We develop “steerable” models that parsimoniously represent a large collection of templates with a small number of parameters. This results in local detectors scalable enough for a large number of frames and object/action categories. (2) Spatio-temporal feature extraction: Spatio-temporal feature extraction is difficult for scenes with many moving objects that interact and occlude each other. We tackle this problem using the framework of multi-object tracking and developing linear-time, scalable graph-theoretic algorithms for inference. (3) Spatio-temporal models: Actions exhibit complex temporal structure, such as sub-actions of variable durations and compositional orderings. Much research on action recognition ignores such structure and xii

instead focuses on K-way classification of temporally pre-segmented video clips [105, 3]. We describe lightweight and efficient grammars that segment a continuous video stream into a hierarchical parse of multiple actions and sub-actions. (4) Dataset development: Finally, in terms of evaluation, video benchmarks are relatively scarce compared to the abundance of image benchmarks. It appears difficult to collect (and annotate) large-scale, unscripted footage of people doing interesting things. We discuss one solution, introducing a new, large-scale benchmark for the problem of detecting activities of daily living (ADL) in first-person camera views.

xiii

Chapter 1 Introduction

1.1

Motivation

Applications: Action recognition in video has a variety of applications, including telerehabilitation, surveillance, video retrieval, etc. Several examples are shown in Fig. 1.1. It is relatively easy for a human to recognize actions in a video once he/she watches it. However, in many applications like life-logging and surveillance, the videos are very long and it is a very tedious task for humans to watch them all. Moreover, human vision is limited in perceiving fast changes, so humans cannot watch the video in multiple times faster than real-time speed. In addition, some applications such as human computer interaction need real-time activity detection which cannot be done by humans. This motivates us to build computer vision and artificial intelligence algorithms to recognize actions in videos automatically. General object and scene understanding: Object detection and scene understanding in still images are two well-known problems in computer vision. Thanks to recent advances in technology, capturing video data has become very popular and less expensive. Hence, we 1

Figure 1.1: Different applications of action recognition in video. should revisit these problems and solve them for video data to understand and exploit the temporal information inherent in the video. We believe action recognition, object detection, and scene understanding are highly related problems, and solving one introduces some context that helps in solving the other two. For instance, in Fig. 1.2, as soon as we detect that the person is brushing his teeth in the bathroom, we can easily tell that the person is holding a toothbrush although the toothbrush has a very small appearance and is highly occluded. Human vision (recognizing actions): We know that humans recognize actions in video easily, so from a scientific point of view, there should be a way for artificial intelligence systems to solve this problem, and the path to this solution may help us in understanding human vision better.

1.2

What makes this problem hard?

No canonical activities / datasets: Activity recognition is a classic task in computer vision, 2

Figure 1.2: Action recognition, object detection, and scene understanding are highly related problems, and solving one introduces some context that helps in solving the other two. In this example, detecting “brushing teeth” action based on the scene and human pose tells us that the person is holding a tooth brush in his hand. Note that the object is very small and highly occluded. but is relatively less well-defined compared to related problems such as object recognition for which large-scale, established benchmarks exist [30, 26]. We believe this is so for the following reasons: (1) It is difficult to define canonical categories of everyday behavior outside particular domains such as surveillance and sports analysis. (2) It is difficult to collect large-scale footage with rich intra-class variation. For example, unscripted surveillance footage tends to be repetitive, often dominated by scenes of people walking. Traditionally, the above limitations have been addressed by using actor-scripted video footage of posture-defined action categories such as “skipping” or “jumping” [115, 41]. Such categories maybe artificial because they tend not be functionally defined, a core aspect of human movement [8]. Interactions and partial occlusion: People rarely move in isolation, so there may be more than one person/object involved in an action. Hence, it is necessary to track multiple ob-

3

Figure 1.3: Multiple instances of humans with some partial occlusions in a video frame.

Figure 1.4: A partially occluded faucet in interaction. jects/humans simultaneously and detect their interactions. This leads to solving the partial occlusion problem which is one of the main difficulties in building visual detectors. Fig. 1.3 shows multiple instances of humans with partial occlusions in a video frame, and Fig. 1.4 shows a faucet partially occluded due to interaction. The partial occlusion problem makes the object detection task very difficult. 4

Figure 1.5: Sample human poses that are difficult to detect due to large amounts of articulation. Difficult to extract reliable features outside of background subtraction: A large portion of the literature on activity recognition in surveillance videos uses some form of background subtraction to extract features on the moving regions of interest. This approach has been successful in many surveillance applications with stationary cameras. However, in many other applications, including those using wearable cameras, these methods fail since it is not easy to model the background. Traditionally, low-level features like space-time interest points [72] have been used in these cases, but they suffer from a lack of semantics. Moreover, they are not very reliable since they are extracted from a very small spatiotemporal patch of the video. High-level features like human poses and object centric [78] features seem to be more promising compared to low-level features, but detecting these features reliably is still a challenge. For instance, humans are involved in many applications of activity recognition. Our brains detect humans and their poses very easily, but this task seems to be very difficult for computers. One reason is that the human body is very flexible and can accommodate a large amount of articulation. In addition, clothing introduces a large variation in the appearance. Furthermore, in the case of human poses, it is not clear what features to use. One could use the location of hands, joints, or the relative location of hands and objects. Fig. 1.5 shows sample human poses that are difficult to detect due to large amounts of articulation. 5

Figure 1.6: Large variation in the appearance of a fridge in interaction. Many human actions involve interactions with objects, such as brushing one’s teeth, or riding a bicycle. One might think that detecting such objects might provide useful features for recognizing such actions. Similar challenges exist in detecting objects. The large variation in the view point and appearance makes this problem very difficult. This is even more challenging in the case of objects in interaction. Interaction introduces partial occlusion and also large variations in appearance. Fig. 1.6 shows this variation in the appearance of a fridge in interaction. It is more important to detect objects being interacted with compared to the objects in the corner of a room since the interaction tells us a lot about the activity taking place. Difficulty of scaling up high-level feature extraction: Scalability as a function of the length of the video is a serious but often ignored issue. Also, some high-level features like object-centric ones, require detecting many objects in video frames. It is difficult to scale up the state of the art detectors to a large number of object categories and video frames. For instance, the recently released Mind’s-Eye dataset [1] has more than 700,000 frames (6 hours of video footage) and is more than 100 times larger than the PASCAl dataset [30].

6

Figure 1.7: Is this person catching the ball or throwing it? This example shows the importance of temporal structure in modeling human activities in video. Assuming that we can detect the human’s pose and also the object, we can detect the existence of an interaction, but it is not possible to tell if the person is “throwing” the ball or “catching” it. It took more than a week for us on our 50-node cluster to run the human pose estimation code [145] on this dataset. Complex temporal structures: Fig. 1.7 shows a human-object interaction. Assuming that we can detect the human’s pose and also the object, we can detect the existence of an interaction, but it is not possible to tell if the person is throwing the ball or catching it. This example shows the importance of temporal structure in modeling human activities in video. Actions exhibit complex temporal structure, such as sub-actions of variable durations and compositional orderings. Much research on action recognition ignores such structure and instead focuses on K-way classification of temporally pre-segmented video clips [105, 3]. For example, “making tea” requires heating water on a stove, placing tea leaves in a cup, and pouring hot water into the same cup. Each subaction can vary in duration and (sometimes) temporal ordering. Modeling the temporal structure is difficult but can resolve many ambiguities in action recognition.

7

Figure 1.8: Different sub-actions of “making tea” action. Some actions exhibit a complex temporal structure.

1.3

Our approach

Fig. 1.9 summarizes our approach for recognizing actions in video sequences. We first extract high-level features. We introduce scalable algorithms for this purpose in Chapters 2 and 3. Then, we extract temporal features by introducing a novel and fast multiple object tracking algorithm in Chapter 4. In Chapter 5, we introduce our grammar-based action models and show that they are capable of parsing a video into different actions using highlevel features. Finally, in Chapter 6, we introduce the problem of detecting activities of daily living in wearable camera views, as an application of action recognition. A detailed description of these components follows.

1.3.1

High-level features based on objects and human pose

Low-level features like space-time interest points (STIP) [72] have been used extensively in action recognition. These features are very easy to extract and very simple to use since they do not need to be adapted to a particular domain and application. However, there is not much semantics in these features. For instance, Fig. 1.10 shows a sample space-time interest point. These are simply space-time corners of pixel values, and it is very difficult to draw a semantic meaning for each interest point. On the other hand, in this work, we are interested in using high-level features like object8

Figure 1.9: Block diagram of our approach. We first extract high-level features. We introduce scalable algorithms for this purpose in Chapters 2 and 3. Then, we extract temporal features by introducing a novel and fast multiple object tracking algorithm in Chapter 4. In Chapter 5, we introduce our grammar-based action model that are capable of parsing a video into different actions/sub-actions using high-level features as the observation. Finally, in Chapter 6, we introduce the problem of detecting activities of daily living in wearable camera views as an application of action recognition. centric features [78] and human pose. Each of these feature points has a semantic meaning. An example is shown in Fig. 1.10. The disadvantages are that high-level features are domain specific and they are often very expensive to extract. For instance object detection and human pose estimation are two of the most difficult problems in computer vision. Unfortunately the state of the art detectors are not quite scalable. Therefore, it is very expensive to apply them to video on long sequences of frames with a large number of object categories.

9

Figure 1.10: Top: low-level feature: Space-time interest point, Bottom high-level features: object-centric and human pose.

1.3.2

Scalable appearance features

In Chapters 2 and 3, we improve the scalability of state-of-the-art object detectors and human pose estimation algorithms by introducing our scalable representations. We describe a method for learning steerable deformable part models. Our models exploit the fact that part templates can be written as linear filter banks. We demonstrate that one can enforce steerability and separability during learning by applying rank constraints. These constraints are enforced with a coordinate descent learning algorithm, where each step can be solved with an off-the-shelf structured SVM solver. The resulting models are orders of magnitude smaller than their counterparts, greatly simplifying learning and reducing run-time computation. Limiting the degrees of freedom also reduces overfitting, which is useful for learning large part vocabularies from limited training data. We learn steerable variants of several state-of-the-art models for object detection, human pose estimation, and facial landmark estimation. Our steerable models are smaller, faster, and often improve performance.

10

1.3.3

Efficient temporal features

In extracting high-level features in a video sequence, the naive approach is to run all detectors on all frames independently and let the action model take care of false positives and errors by some form of averaging. However, we know that all frames belong to a continuous video, so all the features should be consistent in the time domain. The naive approach does not take advantage of this rich source of information. In Chapter 4, we exploit the temporal coherency of detected features by tracking multiple objects simultaneously throughout the whole video. We formulate the problem using a cost function that requires estimating the number of tracks, as well as their birth and death states. We show that the global solution can be obtained with a greedy algorithm that sequentially instantiates tracks using shortest path computations on a flow network. Greedy algorithms allow one to embed pre-processing steps, such as nonmax suppression, within the tracking algorithm. Furthermore, we give a near-optimal algorithm based on dynamic programming which runs in time linear in the number of objects and linear in the sequence length. Our tracking algorithms are fast, simple, and scalable, allowing us to process dense input data. This results in state-of-the-art performance.

1.3.4

Our grammar-based action models

After extracting high-level features, we need an action model to classify a video into different actions. The bag of features model is a well-known approach. However, it doesn’t model the temporal structure of the action. Actions exhibit complex temporal structure, such as sub-actions of variable durations and compositional orderings. Much research on action recognition ignores such structure and instead focuses on K-way classification of temporally pre-segmented video clips. In Chapter 5, we describe lightweight and efficient grammars that segment a continuous video stream into a hierarchical parse of multiple ac11

tions and subactions. We introduce online parsing algorithms that are as fast as a simple sliding window classifier, which scale linearly with the length of video and use constant storage. We describe a simple but novel data-model based on a “bag-of-articulated-poses”, and train it as well as other grammar parameters using a structural SVM. We illustrate the effectiveness of our approach over common baselines on a new 2-million frame dataset of continuous YouTube videos.

1.3.5

An Application: Detecting Activities of Daily Living in FirstPerson Camera Views

Finally, in Chapter 6, we focus on the problem of detecting activities of daily living (ADL) in first-person camera views as a realistic application of action recognition. We use a natural list of daily activities developed from the medical literature on rehabilitation. These activities are chosen so as to capture the representative movements a person must undergo to perform everyday functions, such as eating and maintaining personal hygiene. We have collected and annotated a dataset of 1 million frames of dozens of people performing unscripted, everyday activities. ADLs differ from typical actions in that they can involve long-scale temporal structure (making tea can take a few minutes) and complex object interactions (a fridge looks different when its door is open). We develop novel representations including composite object models that exploit the fact that objects look different when being interacted with. We perform an extensive empirical evaluation and demonstrate that our novel representations produce a two-fold improvement over traditional approaches. Our analysis suggests that real-world ADL recognition is “all about the objects,” and in particular, “all about the objects being interacted with.”

12

1.4

Related work

Activity recognition: There is a rich history of activity recognition in the vision community; we refer the reader to the recent surveys of [41, 135, 105, 3] for a detailed summary. Most of the research in this area works on relatively simple situations like those with a stationary background or simple actions like “boxing” and “jogging” [115]. However, we focus on recognition in widely-varying, un-instrumented, “in-the-wild” environments [100]. High-level features: Traditionally, low-level features such as spatio-temporal corners play an important role in visual recognition [72, 80, 24]. In applying these features in the video domain for object detection or activity recognition, some are extended to work in the time domain [25, 116] to exploit the temporal information in the video footage. Although these features result in reasonable performance, they do not capture any semantics, so it is not easy to understand why the model works or how it can be used in other domains. Recently, due to advancements in detecting high-level features [35, 145], using high-level features in visual computing is emerging [78]. We also introduce our features and compare them with low-level features in activity recognition. Temporal models: By far, the dominant approach to temporal modeling of actions is the Hidden Markov Model (HMM), spurred in part by its successful application to speech recognition [106]. Early approaches date back to finite state machine models [140, 90], while more recent work has explored discriminative variants such as CRFs [139]. [118, 131, 57, 127] explore semi-Markov models for action segmentation; our models are similar in complexity, but allow for compositional models of actions and subactions. We will revisit the related work specific to each component of our system in the appropriate chapter.

13

Chapter 2 Scalable Rigid Templates

2.1

Introduction

We are interested in extracting high-level features based on objects and human pose. Recently, the computer vision community has made good progress in building these detectors for still images. However, most of these algorithms are not scalable to a large number of categories and frames. In this chapter, we will focus on building scalable rigid template models for the task of object/human detection [102]. Linear classifiers (i.e., wT x > 0) are the basic building block of statistical prediction and detection. Though quite standard, they produce many competitive approaches for various prediction tasks. We focus here on the task of visual recognition in video - “does this spatio-temporal window contain an object”? In this domain, scanning-window templates trained with linear classification yield state-of-the-art performance on many benchmark datasets [24, 30, 25]. Fig. 2.1 illustrates extracting appearance features and applying a linear model. In this work, we are interested in making more scalable models. We do so using bilinear models to factorize the parameters. 14

Figure 2.1: Illustration of extracting appearance features from an image and then applying a linear model for human detection. Bilinear models, introduced into the vision community by [130], provide an interesting generalization of linear models. Here, data points are modeled as the confluence of a pair of factors. Typical examples include digits affected by style and content factors or faces affected by pose and illumination factors. Conditioned on one factor, the model is linear in the other. More generally, one can define multilinear models [136] that are linear in one factor conditioned on the others. While bilinear models are often motivated from the perspective of increasing the flexibility of a linear model, our motivation is reversed - we use them to reduce the number of parameters of a weight vector that is naturally represented as a matrix or tensor W . We reduce parameters by factorizing W into a product of low-rank factors. This parameter reduction can reduce over-fitting and improve run-time efficiency because fewer operations

15

are needed to score an example. These are important considerations when training largescale spatial or spatio-temporal templates. In our case, the state-of-the-art features we use to detect pedestrians are based on histograms of oriented gradient (HOG) features [24] or spatio-temporal generalizations [25] as shown in Fig. 2.2. The extracted feature set of both gradient and optical flow histogram is quite large, motivating the need for dimensionality reduction. In addition, by sharing factors across different classification problems, we introduce a novel formulation of transfer learning. We believe that transfer through shared factors is an important benefit of multilinear classifiers which can help ameliorate overfitting. This is crucial in scaling up object detectors to a large number of categories. Inspired by the success of bilinear models in data modeling, we introduce discriminative bilinear models for classification. We describe a method for training bilinear (multilinear) SVMs with biconvex (multiconvex) programs. A function f : X × Y → R is called biconvex if f (x, y) is convex in y for fixed x ∈ X and is convex in x for fixed y ∈ Y . Such functions are well-studied in the optimization literature [5, 48]. While not convex, they admit efficient coordinate descent algorithms that solve a convex program at each step. We show bilinear SVM classifiers can be optimized with an off-the-shelf linear SVM solver. This is advantageous because we can leverage large-scale, highly-tuned solvers (we use [43]) to learn bilinear classifiers with tens of thousands of features with hundreds of millions of examples. We begin with a discussion of related work in Sec. 2.2. We then explicitly define our bilinear classifier in Sec. 2.3. We illustrate several applications and motivations for the bilinear framework in Sec. 2.4. In Sec. 2.5, We describe extensions to our model for the multilinear and multiclass case. We provide several experiments on visual recognition in the video domain in Sec. 2.6, significantly improving on the state-of-the-art system for finding people in video sequences [25] both in performance and speed. We also illustrate our ap16

Figure 2.2: Many approaches for visual recognition employ linear classifiers on scanned windows. Here we illustrate windows processed into gradient-based features [24, 35]. We show an image window (left) and a visualization of the extracted HOG descriptor (middle), which itself is better represented as gradient features extracted from different orientation channels (right). Most learning formulations ignore this natural representation of visual data as matrices or tensors. Wolf et al. [141] show that one can produce more meaningful schemes for regularization and parameter reduction through low-rank approximations of a tensor model. Our contribution involves casting the resulting learning problem as a biconvex optimization. Such formulations can leverage off-the-shelf solvers in an efficient two-stage optimization. We also demonstrate that bilinear models have additional advantages for transfer learning and run-time efficiency. proach on the task of action recognition, showing that transfer learning can ameliorate the small-sample problem that plagues current benchmark datasets [111, 115].

2.2

Related Work

Tenenbaum and Freeman [130] introduced bilinear models into the vision community to model data generated from multiple linear factors. Such methods have been extended to the multilinear setting, e.g. by [136], but such models were generally used as a factor analysis or density estimation technique. Recent work has explored extensions of tensor

17

models to discriminant analysis [128, 144], while our work focuses on an efficient maxmargin formulation of multilinear models. There is also a body of related work on learning low-rank matrices from the collaborative filtering literature [123, 110, 79]. Such approaches typically define a convex objective by replacing the Tr(W T W ) regularization term in our objective (2.6) with the trace norm √ Tr( W T W ). This can be seen as an alternate “soft” rank restriction on W that retains convexity. This is because the trace norm of a matrix is equivalent to the sum of its singular values rather than the number of nonzero eigenvalues (the rank) [14]. Such a formulation would be interesting to pursue in our scenario, but as [110, 79] note, the resulting SDP is difficult to solve. Our approach, though non-convex, leverages existing SVM solvers in the inner loop of a coordinate descent optimization that enforces a hard low-rank condition. Our bilinear-SVM formulation is closely related to the low-rank SVM formulation of [141]. Wolf et. al. convincingly argue that many forms of visual data are better modeled as matrices rather than vectors - an important motivation for our work (see Fig. 2.2). They analyze the VC dimension of rank constrained linear classifiers and demonstrate an iterative weighting algorithm for approximately solving an SVM problem in which the rank of W acts as a regularizer. They also outline an algorithm similar to the one we propose here which has a hard constraint on the rank, but they include an additional orthogonality constraint on the columns of the factors that compose W . This requires cycling through each column separately during the optimization which is presumably slower and may introduce additional local minima. This in turn may explain why they did not present experimental results for their hard-rank formulation. Our work also stands apart from Wolf et. al. in our focus on multi-task learning, which dates back at least to the work of Caruna [17]. Our formulation is most similar to that of Ando and Zhang [6]. They describe a procedure for learning linear prediction models for multiple tasks with the assumption that all models share a component living in a common 18

low-dimensional subspace. While this formulation allows for sharing, it does not reduce the number of model parameters as does our approach of sharing factors.

2.3

Model definition

Linear predictors are of the form

fw (x) = wT x.

(2.1)

Existing formulations of linear classification typically treat x as a vector. We argue that for many problems, particularly in visual recognition, x is more naturally represented as a matrix or tensor. For example, many state-of-the-art window scanning approaches train a classifier defined over local feature vectors extracted over a spatial neighborhood. The Dalal and Triggs detector [24] is a particularly popular pedestrian detector where x is naturally represented as a concatenation of histogram of gradient (HOG) feature vectors extracted from a spatial grid of ny × nx , where each local HOG descriptor is itself composed of nf features. In this case, it is natural to represent an example x as a tensor X ∈ Rny ×nx ×nf . For ease of exposition, we develop the mathematics for a simpler matrix representation, fixing nf = 1. This holds, for example, when learning templates defined on grayscale pixel values. We generalize (3.6) for a matrix X with

fW (X) = Tr(W T X).

(2.2)

where both X and W are ny × nx matrices. One advantage of the matrix representation is that it is more natural to regularize W and restrict the number of parameters. For example, one natural mechanism for reducing the degrees of freedom in a matrix is to reduce its rank. 19

Figure 2.3: An illustration of applying a linear model in the matrix form. We show that one can obtain a biconvex objective function by enforcing a hard restriction on the rank. Specifically, we enforce the rank of W to be at most d ≤ min(ny , nx ). This restriction can be implemented by writing

W = Wy WxT

where

Wy ∈ Rny ×d , Wx ∈ Rnx ×d .

(2.3)

This allows us to write the final predictor explicitly as the following bilinear function:

fWy ,Wx (X) = Tr(Wx WyT X) = Tr(WyT XWx ).

(2.4)

Fig. 2.3 illustrates applying the linear classifier in the matrix form, and Fig. 2.4 illustrates applying the factorized bilinear model. The latter shows that choosing a small d leads to a huge saving in the computation time.

20

Figure 2.4: An illustration of factorizing the parameter matrix into two smaller matrices and applying them in the bilinear form. Note that, we should apply each factored matrix at a time, so by choosing a small d, this leads to huge savings in computation.

2.3.1

Learning

Assume we are given a set of training data and label pairs {xn , yn }. We would like to learn a model with low error on the training data. One successful approach is a support vector machine (SVM). We can rewrite the linear SVM formulation for w and xn with matrices W and Xn using the trace operator. X 1 L(w) = wT w + C max(0, 1 − yn wT xn ). 2 n X 1 L(W ) = Tr(W T W ) + C max(0, 1 − yn Tr(W T Xn )). 2 n

(2.5) (2.6)

The above formulations are identical when w and xn are the vectorized elements of matrices W and Xn . Note that (2.6) is convex. We wish to restrict the rank of W to be d. Plugging in W = Wy WxT , we obtain our biconvex objective function:

L(Wy , Wx ) =

X 1 Tr(Wx WyT Wy WxT ) + C max(0, 1 − yn Tr(Wx WyT Xn )). (2.7) 2 n

In the next section, we show that optimizing (2.7) over one matrix holding the other fixed

21

is a convex program - specifically, a QP equivalent to a standard SVM. This makes (2.7) biconvex.

2.3.2

Coordinate descent

We can optimize (2.7) with a coordinate descent algorithm that solves for one set of parameters holding the other fixed. Each step in this descent is a convex optimization that can be solved with a standard SVM solver. Specifically, consider

min L(Wy , Wx ) = Wy

X 1 max(0, 1 − yn Tr(WyT Xn Wx )). (2.8) Tr(Wy WxT Wx WyT ) + C 2 n

The above optimization is convex in Wy but does not directly translate into the trace-based ˜ y: SVM formulation from (2.6). To do so, let us reparametrize Wy as W X ˜ y , Wx ) = 1 Tr(W ˜ TX ˜ n )) ˜ TW ˜ y) + C min L(W max(0, 1 − yn Tr(W y y ˜y 2 W n (2.9) where

˜ y = Wy A 21 W

and

˜ n = Xn Wx A− 12 X

and

A = WxT Wx .

One can see that (2.9) is structurally equivalent to (2.6) and hence (2.5). Hence it can be solved with a standard off-the-shelf SVM solver. Given a solution, we can recover the ˜ y A− 21 . Recall that A = W T Wx is matrix of size d × d original parameters by Wy = W x that is in general invertible for small d. Using a similar derivation, one can show that minWx L(Wy , Wx ) is also equivalent to a standard convex SVM formulation.

22

2.4

Motivation

We outline here a number of motivations for the biconvex objective function defined above.

2.4.1

Regularization

Bilinear models allow a natural way of restricting the number of parameters in a linear model. From this perspective, they are similar to approaches that apply PCA for dimensionality reduction prior to learning. Felzenszwalb et al. [37] find that PCA can reduce the size of HOG features by a factor of 4 without a loss in performance. Image windows are naturally represented as a 3D tensor X ∈ Rny ×nx ×nf , where nf is the dimensionality of a HOG feature. Let us “reshape” X into a 2D matrix X ∈ Rnxy ×nf where nxy = nx ny . We can restrict the rank of the corresponding model to d by defining W = Wxy WfT . Wxy ∈ Rnxy ×d is equivalent to a vectorized spatial template defined over d features at each spatial location, while Wf ∈ Rnf ×d defines a set of d basis vectors spanning Rnf . This basis can be loosely interpreted as the PCA-basis estimated in [37]. In our biconvex formulation, the basis vectors are not constrained to be orthogonal, but they are learned discriminatively and jointly with the template Wxy . We show in Sec. 2.6 this often significantly outperforms PCA-based dimensionality reduction.

2.4.2

Efficiency

Scanning window classifiers are often implemented using convolutions [24, 35]. For example, the product Tr(W T X) can be computed for all image windows X with nf convolutions. By restricting W to be Wxy WfT , we project features into a d dimensional subspace spanned by Wf , and compute the final score with d convolutions. One can further improve efficiency by using the same d-dimensional feature space for a large number of different 23

object templates - this is precisely the basis of our transfer approach in Sec. 2.4.3. This can result in significant savings in computation. For example, spatio-temporal templates for finding objects in video tend to have large nf since multiple features are extracted from each time-slice. Consider a rank-1 restriction of Wx and Wy . This corresponds to a separable filter Wxy . Hence, our formulation can be used to learn separable scanning-window classifiers. Separable filters can be evaluated efficiently with two one-dimensional convolutions. This can result in significant savings because computing the score at the window is now O(nx + ny ) rather than O(nx ny ).

2.4.3

Transfer

m Assume we wish to train M predictors and are given {xm n , yn } training data pairs for each

prediction problem 1 ≤ m ≤ M . One can write all M learning problems with a single optimization:

L(W 1 , . . . , W M ) =

X X 1X T T Tr(W m W m ) + Cm max(0, 1 − ynm Tr(W m Xnm )). 2 m m n (2.10)

As written, the problem above can be optimized over each W m independently. We can introduce a rank constraint on W m that induces a low-dimensional subspace projection of Xnm . To transfer knowledge between the classification tasks, we require all tasks to use the same low-dimensional subspace projection by sharing the same feature matrix:

m W m = Wxy WfT

24

m Note that the leading dimension of Wxy can depend on m. This fact allows for Xnm from

different tasks to be of varying sizes. In our motivating application, we can learn a family of HOG templates of varying spatial dimension that share a common HOG feature subspace. The coordinate descent algorithm from Sec. 2.3.2 naturally applies to the multi-task m setting. Given a fixed Wf , it is straightforward to independently optimize Wxy by defining m A = WfT Wf . Given a fixed set of Wxy , a single matrix Wf is learned for all classes by

computing: X X 1 ˜ fT X ˜ nm )) ˜ fT W ˜f) + Tr(W Cm max(0, 1 − ynm Tr(W 2 m n X 1 mT m ˜ m = X m W m A− 2 Wxy Wxy . and X and A = n n xy

M 1 ˜ f , Wxy )= , . . . , Wxy min L(W ˜f W

where

˜ f = Wf A 21 W

m

If all problems share the same slack penalty (Cm = C), the above can be optimized with an off-the-shelf SVM solver. In the general case, a minor modification is needed to allow for slack-rescaling [134]. In practice, nf can be large for spatio-temporal features extracted from multiple temporal windows. The above formulation is convenient in that we can use data examples from many classification tasks to learn a good subspace for spatio-temporal features.

2.5

Extensions

2.5.1

Multilinear

In many cases, a data point x is more natural represented as a multidimensional matrix or a high-order tensor. For example, spatio-temporal templates are naturally represented as a 4th -order tensor capturing the width, height, temporal extent, and the feature dimension of

25

a spatio-temporal window. For ease of exposition let us assume the feature dimension is 1 and so we write a feature vector x as X ∈ Rnx ×ny ×nt . We denote the element of a tensor X as xijk . Following [75], we define a scalar product of two tensors W and X as the sum of their element-wise products:

hW, Xi =

X

wijk xijk .

(2.11)

ijk

With the above definition, we can generalize our trace-based objective function (2.6) to higher-order tensors:

L(W ) =

X 1 max(0, 1 − yn hW, Xn i). hW, W i + C 2 n

(2.12)

We wish to impose a rank restriction on the tensor W . The notion of rank for tensors of order greater than two is subtle - for example, there are alternate approaches for defining a high-order SVD [136, 75, 69]. For our purposes, we follow [117] and define W as a rank d tensor by writing it as product of matrices W y ∈ Rny ×d , W x ∈ Rnx ×d , W t ∈ Rnt ×d : wijk =

d X

y x t wis wjs wks .

(2.13)

s=1

Combining (2.11) - (2.13), it is straightforward to show that L(W y , W x , W t ) is convex in one matrix given the others. This means our coordinate descent algorithm from Sec. 2.3.2 still applies. As an example, consider the case when d = 1. This rank restriction forces the spatio-temporal template W to be separable along the x, y, and t axes, allowing for window-scan scoring by three one-dimensional convolutions. This greatly increases runtime efficiency for spatio-temporal templates. A version of this formulation will be used in the next chapter in building scalable part models.

26

2.5.2

Bilinear multi-class SVMs

We outline here an extension of our formalism to multi-class SVMs [22]. Multi-class SVMs learn models that predict the category label yn given a data point xn . We follow the formulation in [22] where we learn a weight vector for each category, and at inference time, the best scoring model predicts the category:

y = argmax wiT x

(2.14)

i∈[1...nc ]

where nc is the number of categories, and wi ∈ Rnx is the weight vector for the i’th category. Given training data of the form {xn , yn }, the learning problem is: X  1 max 0, max(1 − (wyn − wy )T xn ) L(w) = wT w + C y6=yn 2 n

(2.15)

This penalizes the objective function to force the correct category model to score at least one unit better than the highest scoring wrong category model. One can concatenate all weight vectors and define the weight matrix W ∈ Rnx ×nc : 



W = w1 . . . wnc The corresponding Xny ∈ Rnx ×nc will be a sparse vector with xn at the yn ’th column and −xn at the y’th column. The objective function in the matrix form will be: X  1 L(w) = W T W + C max 0, max(1 − W T Xny ) y6=yn 2 n

(2.16)

We can enforce W to be of rank d < min(nx , nc ) by defining W = Wx WcT where Wx ∈ 27

Rnx ×d and Wc ∈ Rnc ×d . For example, one may expect template classifiers that classify nc different human actions to reside in a d dimensional subspace. The resulting biconvex objective function is

L(Wx , Wc ) =

X  1 Tr(Wc WxT Wx WcT ) + C max 0, max(1 − Wc WxT Xny ) (2.17) y6=yn 2 n

Using our previous arguments, it is straightforward to show that the above objective is biconvex and that each step of the coordinate descent algorithm reduces to a standard SVM problem. We will generalize bilinear multi-class SVMs to bilinear structural SVMs in the next chapter.

2.6

Experiments

We focus our experiments on the task of visual recognition using spatio-temporal templates. This problem domain has large feature sets obtained by histograms of gradients and histograms of optical flow computing from a frame pair. We illustrate our method on two challenging tasks using two benchmark datasets - detecting pedestrians in video sequences from the INRIA-Motion database [25] and classifying human actions in the UCF-Sports dataset [111]. We model features computed from frame pairs x as matrices X ∈ Rnxy ×nf , where nxy = nx ny is the vectorized spatial template and nf is the dimensionality of our combined gradient and flow feature space. We use the histogram of gradient and flow feature set from [25]. Our bilinear model learns a classifier of the form Wxy WfT where Wxy ∈ Rnxy ×d and Wf ∈ Rnf ×d . Typical values include ny = 14, nx = 6, nf = 84, and d = 5 or 10.

28

2.6.1

Spatio-temporal pedestrian detection

Scoring a detector: Template classifiers are often scored using missed detections versus false-positives-per-window statistics. However, recent analysis suggests such measurements can be misleading [27]. We opt for the scoring criteria outlined by the widelyacknowledged PASCAL competition [30], which looks at average precision (AP) results obtained after running the detector on cluttered video sequences and suppressing overlapping detections. Baseline: We compare with the linear spatio-temporal-template classifier from [25]. This seems to be a good baseline since it differs from our model only in using a full-rank linear classifier instead of a rank-restricted bilinear classifier. The static-image detector counterpart is a well-known state-of-the-art system for finding pedestrians [24]. Surprisingly, when scoring AP for person detection in the INRIA-motion dataset, we find the spatio-temporal model performed worse than the static-image model. This is corroborated by personal communication with the authors as well as Dalal’s thesis [23]. We found that aggressive SVM cutting-plane optimization algorithms [43] were needed for the spatio-temporal model to outperform the spatial model. This suggests our linear baseline is the true state-of-the-art system for finding people in video sequences. We also compare results with an additional rank-reduced baseline obtained by setting wf to the basis returned by a PCA projection of the feature space from nf to d dimensions. We use this PCA basis to initialize our coordinate descent algorithm when training our bilinear models. We show precision-recall curves in Fig. 2.5. We refer the reader to the caption for a detailed analysis, but our bilinear optimization seems to produce the state-of-the-art system for finding people in video sequences, while being an order-of-magnitude faster than previous approaches.

29

Prec/Rec curve 1

Precision

0.8

0.6

0.4

0.2

Bilinear AP = 0.795 Baseline AP = 0.765 PCA AP = 0.698

0 0

0.2

0.4

0.6

0.8

1

Recall

Figure 2.5: Our results on the INRIA-motion database [25]. We evaluate results using average precision, using the well-established protocol outlined in [30]. The baseline curve is our implementation of the HOG+flow template from [25]. The size of the feature vector is over 7,000 dimensions. Using PCA to reduce the dimensionality by 10X results in a significant performance hit. Using our bilinear formulation with the same low-dimensional restriction, we obtain better performance than the original detector while being 10X faster. We show example detections on video clips on the right.

2.6.2

Human action classification

Action classification requires labeling a video sequence with one of nc action labels. We do this by training nc 1-vs-all action templates. Template detections from a video sequence are pooled together to output a final action label. We experimented with different voting schemes and found that a second-layer SVM classifier defined over the maximum score (over the entire video) for each template performed well. Our future plan is to integrate the video class directly into the training procedure using our bilinear structural SVM formulation. Action recognition datasets tend to be quite small and limited. For example, up until recently, the norm consisted of scripted activities on controlled, simplistic backgrounds. We focus our results on the relatively new UCF Sports Action dataset, consisting of nonscripted sequences of cluttered sports videos. Unfortunately, there has been few published results on this dataset, and the initial work [111] uses a slightly different set of classes than

30

Bilinear (.648) Dive−Side Golf−Back Golf−Front Golf−Side Kick−Front Kick−Side Ride−Horse Run−Side Skate−Front Swing−Bench Swing−Side Walk−Front

D iv G e−S ol i G f−B de ol a f G −F ck o r Ki lf− ont ck Si d K −F e R ick− ron id S t e− id R H e S un ors Swkate −S e in −F ide g Sw −B ron i e t W ng− nch al S k− id Fr e on t

D iv G e−S ol i G f−B de ol a f G −F ck o r Ki lf− ont ck Si d K −F e R ick− ron id S t e− id R H e S un ors Swkate −S e in −F ide g Sw −B ron i e t W ng− nch al S k− id Fr e on t

Linear (.518) Dive−Side Golf−Back Golf−Front Golf−Side Kick−Front Kick−Side Ride−Horse Run−Side Skate−Front Swing−Bench Swing−Side Walk−Front

D iv G e−S o G lf−B ide ol a f G −F ck o r Ki lf− ont ck Si d K −F e R ick− ron id S t e− id R H e Sk un ors Sw ate −S e in −F ide g Sw −B ron i e t W ng− nch al S k− id Fr e on t

PCA (.444) Dive−Side Golf−Back Golf−Front Golf−Side Kick−Front Kick−Side Ride−Horse Run−Side Skate−Front Swing−Bench Swing−Side Walk−Front

Figure 2.6: Our results on the UCF Sports Action dataset [111]. We show classification results obtained from 2-fold cross validation. Our bilinear model provides a strong improvement over both the linear and PCA baselines. We show class confusion matrices, where light values correspond to correct classification. We label each matrix with the average classification rate over all classes. those which are available online. The published average class confusion is 69.2%, obtained with leave-one-out cross validation. Using 2-fold cross validation (and hence significantly less training data), our bilinear template achieves a score of 64.8% (Fig. 2.6). Again, we see a large improvement over linear and PCA-based approaches. While not directly comparable, these results suggest our model is competitive with the state of the art. Transfer: We use the UCF dataset to evaluate transfer-learning in Fig. 2.8. We consider a small-sample scenario when one has only two example video sequences of each action class. Under this scenario, we train one bilinear model in which the feature basis Wf is optimized independently for each action class, and another where the basis is shared across all classes. Fig. 2.7 illustrates learning a single shared Wf across categories. The independently-trained model tends to overfit to the training data for multiple values of C, the slack penalty from (2.6). The joint model clearly outperforms the independently-trained models.

31

Figure 2.7: Transfer learning by sharing the Wf ’s across categories. Each column in the block diagram corresponds to learning a classifier (Wfi and Wx y i ) for one category. The red box illustrates learning a unique shared wf across categories jointly. This leads to better generalization and scalability in the problems with a large number of categories. Walk−Iter1 Walk−Iter2 UCF Sport Action Dataset (2 training videos per class) Ind (C=.01) Joint (C=.1)

Iter1 .222 .267

Walk−Iter1 closeup

Walk−Iter2 closeup

Iter2 .289 .356

Figure 2.8: We show results for transfer learning on the UCF action recognition dataset with limited training data - 2 training videos for each of 12 action classes. In the top table row, we show results for independently learning a subspace for each action class. In the bottom table row, we show results for jointly learning a single subspace that is transfered across classes. In both cases, the regularization parameter C was set on held-out data. The jointly-trained model is able to leverage training data from across all classes to learn the feature space Wf , resulting in overall better performance. On the right, We show low-rank models W = Wxy WfT during iterations of the coordinate descent. Note that the head and shoulders of the model are blurred out in iteration 1 which uses PCA, but after the biconvex training procedure discriminatively updates the basis, the final model is sharper at the head and shoulders.

2.7

Conclusion

In this chapter, we introduced a generic framework for multilinear classifiers that are efficient to train with existing linear solvers. This lets us learn rigid templates for object/human 32

detection in a scalable way. Multilinear classifiers exploit the natural matrix and/or tensor representation of spatio-temporal data. For example, this allows one to learn separable spatio-temporal templates for finding objects in video. Multilinear classifiers also allow for factors to be shared across classification tasks, providing a novel form of transfer learning. We will use extend this approach in the next chapter to deformable models and then use it in extracting high-level features.

33

Chapter 3 Scalable Part Models

3.1

Introduction

The previous chapter described a scalable rigid template representation for the task of object detection. It is shown in the literature that part models outperform rigid templates in this task by a large margin [37]. Hence in this chapter, we will generalize our representations to build steerable/separable part models as a novel scalable representation for object detection and human/face pose estimation. Another advantage of part models is that they can output labels beyond bounding boxes. For example, they can estimate the human pose, face direction, interaction, and semantically meaningful parts of objects. Apparently, these structured outputs can be used in constructing better high-level features [101]. Fig. 3.1 shows some examples of these outputs. Part-based models provide a promising framework for capturing variation in the appearance of an object. They do so by reasoning about local appearance using part templates; by composing different parts together in shifted positions, one can model a variety of global appearances. However, one will likely need a large set of parts to model changes in view 34

Figure 3.1: We show results of applying our steerable/separable representation to three diverse but well-benchmarked applications of articulated pose estimation (on PARSE), facial landmark estimation (on MultiPIE), and object detection (on PASCAL VOC 2007). point, deformation, scale, etc., across thousands of object categories. One challenge that lies ahead is the development of scalable representations that efficiently capture such largescale part vocabularies. Most current part models are implemented as templates defined on gradient features such as HOG [24]. Often these templates are trained using linear classifiers (such as an SVM), resulting in very high-dimensional learning problems. We argue that by conceptualizing these problems as one of learning spatial filters rather than high-dimensional parameter vectors, one can leverage the considerable body of work in image processing for developing efficient representations [44, 98, 82]. Typical approaches for reducing the size of a part vocabulary include vector quantization (e.g., visual words [120]). We show that one can also use linear subspace methods as an alternative form of compression. We represent a large vocabulary of parts as linear combinations of a small set of basis filters. One can use a small number of basis filters to “steer” across a variety of factors, including viewpoint, scales, and even semantic part types. We show that one can encode steerable and separable constraints during learning using the framework of rank-constrained classifiers [102, 69, 141]. We simultaneously learn a separable and steerable basis, together with the steering coefficients, using structural SVMs [134]. 35

(a) Changes in part viewpoint

(b) Vocabulary of parts

(c) Steerable basis

Figure 3.2: One needs large part vocabularies to capture variations in appearance due to viewpoint (a), among other factors. We approximate a large vocabulary of thousands of part templates (b) as linear combinations of a small set of basis parts (c). We show how can learn a steerable, separable basis together with the steering coefficients using rankconstrained structural SVMs. This reduces the number of model parameters by orders of magnitude, simplifying learning and increasing run-time speed. Our steerable representations can reduce the number of model parameters by orders of magnitude, greatly simplifying both learning and run-time computation. Limiting the degrees of freedom reduces overfitting, which is useful for learning large part vocabularies from limited training data. Our steerable basis also represents a novel form of multi-task learning, in that the models learned for a set of object categories share information between them.

36

Since the current bottleneck of most part models is the computation of local filter scores, our steerable models improve run-time performance, often by an order of magnitude. This approaches recent speed ups obtained by cascaded [34] or coarse-to-fine [93] matching algorithms, but our performance gains come from an underlying change in the representation during learning, rather than an approximation that is made after a model is learned. For example, our efficiency gains still hold for high-recall detection, while other matching algorithms may slow down because one has to enumerate all candidate windows in an image. Because the efficiency gains are orthogonal, our steerable models can be combined with such matching algorithms to yield extremely lightweight and efficient detectors. We learn steerable part models for three state-of-the-art systems for object detection [37] , articulated pose estimation [145], and facial landmark estimation [151]. In all three cases, we are able to achieve a near-10X reduction in parameters with little or no loss in performance. All of our systems are based on mixtures of parts, and so can be tuned for large or small vocabularies. When we compare these tuned systems with a steerable model with an equivalent number of parameters, our steerable models always significantly dominate in performance.

3.2

Related work

Our representation can be seen an approach to sharing parts. Parts are typically shared in a discrete fashion; for example, a single template for a wheel part may be shared across multiple view-based mixture models [132, 91] or within a compositional grammar of vehicles [150, 39]. In particular, [91] learns coefficients which calibrate parts that are shared across sub-category mixtures. However, because parts may look different under different viewpoints and compositions, we share a linear subspace rather than a fixed template, letting a small number of basis filters generate a large, near-continuous range of part appearances. 37

Our approach is inspired by classic work on low-level vision, including steerable filters [44], separable filters [2], and filter banks [2]. Our work is most similar to the framework of [82], who use an SVD to efficiently represent a large set of oriented and scaled filters. We follow a similar approach, but embed the rank restriction within the learning of the classifier, allowing us to discriminatively learn both a basis and reconstruction coefficients. Following the terminology of [82], we refer to these as a steerable basis and steering coefficients. Our approach for rank-constrained learning is based on a recent collection of papers [102, 69, 141]. We follow the formalism of [102], who formulate a rank-constrained linear model as a bilinear model. In our case, given a fixed basis, our model is linear in the steering coefficients, and vice versa. [102] point out that convex learning algorithms for linear classifiers (such as SVMs) can be generalized to biconvex learning algorithms that can be optimized with coordinate descent. Each step of the coordinate descent reduces to a standard SVM problem. In this work, we demonstrate such tools can be extended to simultaneously learn steerable and separable part filters using structural SVMs. Any recognition algorithm that relies on a large vocabulary of linearly-scored templates would benefit from our approach. Some popular examples include poselets [13], visual phrases [113], and latent part models [37] . We focus on three illustrative examples. We use part-based model of [37] to illustrate part steering across object categories and subcategories. We use the recent articulated model of [145] to illustrate steering across in-plane orientations, useful for modeling articulation. We use the multiview model of [151] to illustrate steering across out-of-plane orientations, useful for modeling viewpoint variation in faces. Based on standard benchmark evaluation, all methods represent the state-of-the-art methods in human pose estimation, face pose/landmark estimation, and object detection. Our work provides a 10-100X reduction in model size, considerably simplifying learning and increasing run-time efficiency.

38

3.3

Steerable basis model

We begin by describing a general form of linearly-parameterized part models with mixtures, amenable to our steerable representation. The part models described in [37, 145, 151] can all be seen as instances of this general form. Let li = (xi , yi ) be the pixel location of part i, and let ti be the mixture component of part i. These mixture components may span different in-plane orientations (as in [145], out-ofplane orientations (as in [151]), or semantic sub-categories of an object class (as in [37]). Given an image I, we score a collection of part locations l = {li } and mixtures t = {ti } as:

score(I, l, t) =

m hX i

i witi · φa (I, li ) + ws · φs (l, t)

(3.1)

where φa (I, li ) is an appearance feature vector (e.g., HOG descriptor [24]) extracted from pixel location li in image I. The first term in (3.1) is an appearance model that computes the local score of placing filter wi , tuned for mixture ti , at location li . Linear shape: The second term is a shape “prior” that favors particular configurations of part locations and mixtures. For our purposes, we are agnostic as to form of this function so long as it is linearly parameterized and there exist tractable algorithms for computing the best scoring configuration argmaxl,t score(I, l, t). For example, [145] encodes springs between pairs of parts and co-occurrence priors between particular mixtures, where spring rest position, spring rigidity, and co-occurrence biases are all encoded linearly in ws . When pairwise relations are restricted to a tree, the best-matching configuration can be computed with dynamic programming. Usually, there are very few parameters in the prior term compared to the appearance terms. We will describe linear subspace methods for approximating the high-dimensional appearance terms.

39

Bilinear appearance: Assume that all part filters are identical in size, each requiring nd parameters. If they are not, one can zero-pad templates to achieve this. We would like to write each filter as a linear combination of ns basis filters: witi =

ns X

bj stiji

(3.2)

j=1

where bj is a basis filter of size nd and stiji is the scalar steering coefficient specific to each part mixture witi . Plugging (3.2) into (3.1), we can see that our scoring function is now bilinear; if we fix the basis b, it is linear in the coefficients s and vice versa. We now make this connection explicit by introducing matrix notation for (3.1) and defining z = (l, t):

score(I, z) = Tr(WaT Φa (I, z)) + wsT φs (z)

(3.3)

where Wa ∈ Rnd ×np is the filter bank (in column-wise order). Each column of Wa represents a single part filter with nd parameters. We write np for the total size of the part vocabulary, including all mixtures of all parts. Φa (I, z) ∈ Rnd ×np is a sparse matrix of local appearance features extracted from location l, with nonzero entries corresponding to the active mixtures t. Finally, Tr() is the standard trace operator, where we use the property that Tr(AT B) = A(:)T B(:) following Matlab notation. We can then write Wa as a product of a basis and steering coefficient matrix:

Wa = BS T

where

B ∈ Rnd ×ns , S ∈ Rnp ×ns

(3.4)

where each column of B is a basis filter and each row of S specifies the steering coefficients for a particular part mixture. We see that our steerable model can be interpreted as a rank ns constraint on the filter bank Wa . Plugging in (3.4) into (3.3), we get: score(I, z) = Tr(SB T Φa (I, z)) + wsT φs (z)

40

(3.5)

The new score is bilinear in its parameters (B, S, ws ); if we fix the steerable basis B, the score is linear in the coefficients S and shape parameters ws . If we fix S, the score is linear in B and ws . This suggests that we can generalize convex algorithms for learning parameters (such as SVMs) to biconvex algorithms that iterate learning one set of parameters holding the others fixed. Fig. 3.3 shows the bilinear factorization for part models.

Figure 3.3: Top row shows the general scoring function for part models. In the bottom row, each filer wi is written as a linear combination of a set of steerable basis filters. The original scoring function is linear in wi , so we can use linear SVMs in learning. By substituting the bottom row in the scoring function, for a fixed basis, one can pre-multiply features with the basis and see that the scoring function is also linear in the coefficients, so we can learn the coefficients using the same SVM formulation. The same holds if we fix the coefficients and learn the basis.

3.4

Learning

We now describe a method for learning steerable models given supervised data, where part locations and mixture types, are known for all positive training examples. When these variables are not known, [37] describes a coordinate descent algorithm that iterates between (1) learning model parameters assuming such latent variables are known and (2) updating latent variables given the learned model parameters. For weakly supervised datasets, our learning algorithm can be applied within step (1). If all labels are given in a supervised 41

framework (as is the case for [145] and [151]), then one can directly apply the approach here. Given labeled positive examples {In , zn } and negative examples {In } we can write our max-margin learning formulation as: X 1 max[0, 1 − yn wT φ(In , z)] L(w) = wT w + C z∈Zn 2 n Zn = {zn }

∀n s.t.

Zn = {unrestricted}

(3.6)

yn = 1 ∀n s.t.

yn = −1

where parameters and features in (3.1) are vectorized and concatenated to form vectors w and φ(In , z). The first term is a standard regularizer. The second term is a cumulative loss over training examples. For each positive training example In , if the score of the given labels zn is greater than 1, the loss evaluates to 0. Else the loss is set to the difference (the slack). For each negative training example, one searches all possible locations and mixture types z for the highest scoring configuration. If its score is less than −1, the loss evaluates to 0; else the loss is set to the difference. The above formulation is equivalent to the convex inner loop of a latent SVM [37]. It can also be written as a structural SVM, for which many excellent large-scale solvers exist [37, 134]. To optimize the above, we rewrite it as a quadratic objective with linear constraints, minimizing it with the dual-coordinate descent quadratic program (QP) solver of [145]. We can rewrite (3.6) in matrix form as: 1 1 L(Wa , ws ) = Tr(WaT Wa ) + wsT ws + 2 2   X T T C max[0, 1 − yn Tr(Wa Φa (I, z)) + ws φs (z) ] n

z∈Zn

42

(3.7)

This means that any equation that can be written as (3.7) can be solved by an off-the-shelf structured SVM solver. By substituting a steerable basis B and coefficient matrix S for the filter bank Wa , we can write the objective function as L(B, S, ws ). With the following key substitutions,

Tr(WaT Wa ) = Tr(SB T BS T )

(3.8)

Tr(WaT Φa ) = Tr(SB T Φa ) = Tr(B T Φa S)

(3.9)

we can rewrite (3.7) as: 1 1 L(B, S, ws ) = Tr(SB T BS T ) + wsT ws + 2 2   X max[0, 1 − yn Tr(B T Φa (I, z)S) + wsT φs (z) ] C n

(3.10)

z∈Zn

The above function is no longer convex in its arguments. However, by freezing the steering coefficients S, the above function can be written as a convex function: ˜ ws ) = 1 Tr(B ˜ T B) ˜ + 1 w T ws + L(B, 2 2 s   X ˜T Φ ˜ a (In , zn )) + wT φs (z) ] max[0, 1 − yn Tr(B C s n

where

(3.11)

z∈Zn

˜ = BA 21 , B

˜ a = Φa SA− 12 , Φ

A = ST S

(3.11) is equivalent in structure to (3.7); hence it is convex and can be optimized with an offthe-shelf structured SVM solver. Given a solution, we can recover the final steerable basis ˜ − 12 . Note that A = S T S is ns × ns matrix that will in general be invertible given B = BA for ns  np (e.g., a small number of basis filters compared to a large part vocabulary). One can easily show a similar convex formulation for optimizing L(S, ws ) given a fixed steerable basis B. This makes the overall formulation from (3.10) biconvex in its arguments, amenable to coordinate descent algorithms for minimization [102]. Specifically, given

43

some initial steerable basis B ∗ , iterate the following steps using a structured SVM solver: 1

(S ∗ , ws∗ ) = argminS,ws L(B ∗ , S, ws )

2

(B ∗ , ws∗ ) = argminB,ws L(B, S ∗ , ws )

Initialization: In practice, to initialize B ∗ , we first independently learn a filter for each part with a standard linear SVM. This is typically inexpensive and parallelizable. We then apply a rank-ns SVD to this set to estimate an initial B ∗ . Latent alignment: A traditional difficulty with subspace methods is that of alignment; if patches are not aligned well, then low-rank approximations will tend to be very blurred. By iterating both over our steerable parameters (S, B, ws ) and latent configuration variables z, our learning algorithm can re-align parts to better match our steerable basis. Hence, even for fully-supervised datasets where part locations z are known, we allow for small latent translations that re-align parts as we learn a steerable basis.

3.5

Steerability and separability

Thus far we have made no assumption on the form of each basis filter, beyond the fact that it contains nd parameters. We now augment our model to enforce the fact that each basis filter is separable. One can model each nd -length basis filter as a ny × nx × nf tensor, encoding a spatial neighborhood of ny × nx cells, with nf orientation features extracted from each cell. A fully-separable filter can be written as a rank-1 tensor, or a product of three one-dimensional vectors. For simplicity, we focus on separability in one dimension. To do so, let us reshape each basis filter bj from (3.2) into a nxy × nf matrix Bj that is

44

restricted to be low rank:

Bj =

nk X

T cjk fjk

where

k=1

cjk ∈ Rnxy ×1 , fjk ∈ Rnf ×1

where nk = 1 corresponds to the fully separable case. We refer to cjk as the spatial basis and fjk as the feature basis. Combining this with (3.2), we can write each part filter as: Witi

=

nk ns X X j=1 k=1

T ti sij cjk fjk

where Witi ∈ Rnxy ×nf

When plugging this expression back into (3.3), we see that the overall score function is now multilinear in its parameters. By fixing two sets of its parameters (say the feature basis and steering coefficients), it is simultaneously linear in the third (the spatial basis) and the spatial parameters ws . The resulting learning problem is multiconvex, amenable to coordinate descent where each step corresponds to solving a problem of the form from (3.11), derived by holding two parameters fixed and solving for the third. Again, this convex program can be solved with an off-the-shelf structural SVM solver. We omit the straightforward but cluttered equations for lack of space. One can combine the two approaches by learning a “shared basis” of separability. For example, one could force all basis filters Bj to share the same feature basis:

fjk = fk One can then interpret fk as vectors that span a generic feature basis used by all basis filters. We consider this form of separability in our experiments, as it considerably reduces the number of parameters even further.

45

3.6

Multi-category learning

Current category-level models are trained and detected independently for each object category [34]. This will clearly not scale to tens of thousands of categories. An open question is how to share structure across such models, both for purposes of increased regularization and computational savings. We show that our steerable framework provides one natural mechanism for sharing. Assume we wish to train M category-specific models, but wish to learn a shared steerable/separable vocabulary across all categories. Given positive and negative training examples for each category, we can write all M learning problems as one optimization:

Lmulti (Wa1

. . . WaM , ws1

. . . waM )

=

M X

L(Wam , wsm )

m=1

where L(Wam , wsm ) is defined as (3.7). As written, the above problem can be optimized over each category-specific model independently. Instead, we share structure across categories by restricting all models to share the same basis filters:

T Wam = BSm

The previously described coordinate descent algorithm applies to this multi-category formulation as well. Given a fixed basis B, it is straightforward to optimize (Sm , wsm ) with M independent structural SVM problems. Given the learned steering coefficients Sm for each category, a single steerable basis is learned by solving a single structural SVM problem using all the training data. We omit the straightforward but cluttered equations for lack of space.

46

3.7

Computational savings

Given a learned model, at run-time, we want to maximize score(I, z) in (3.1) over all part positions p and mixture types t. As stated previously, we assume that the shape prior ws · φs (p, t) factors into a tree model, amenable to dynamic programming. Many practitioners have observed that the bottleneck of the dynamic programming is the computation of the local part scores, for all parts in the vocabulary at all image locations [34, 93]. We analyze the computational savings afforded by steerable and separable representations. Inference: Following our existing notation, assume we have a vocabulary of np parts of dimension nxy × nf . Given an image with N cell locations, naive computation of the local score will take O(N nxy nf np ). One can show that given a shared feature basis of nk dimensions and ns basis filters, the local score computation is dominated by the convolution of basis filters, requiring O(N nxy nk ns ). For typical parameter values, this results in a 10X to 100X reduction in the number of operations. Learning: The above savings for inference also speed up learning, since one must run the inference algorithm to collect support vectors. Our steerable model also decreases storage costs by a similar amount. Support vectors will be 10X to 100X smaller since we need only optimize/store a small fraction of the appearance features during each step of coordinate descent. This is practically significant since many models require a large number of support vectors, approaching memory limits of typical hardware. Hence our factored representation allows us to learn significantly larger models.

47

3.8

Experiments

We apply our method on three recognition tasks; articulated pose estimation, facial pose estimation/landmark estimation, and object detection (Fig. 3.1). We use standard benchmark datasets, evaluation protocols, and compare results with state-of-the-art methods (as reported on these datasets).

3.8.1

Articulated human pose estimation

The articulated part model of [145] is a pictorial structure defined on a part vocabulary of size np = 138, nxy = 25, nf = 32. These parts represent different in-plane orientations of articulated body parts, and so are natural candidates for a steerable representation. We train and test the model on the Image Parse dataset [108], following the standard evaluation protocol and criteria (PCP score). We visualize the original model from [145], and its reconstruction under our steerable representation in Fig. 3.4. We show a subset of the 138-part vocabulary and the learned steerable filters in Fig. 3.2. Fig. 3.5 compares our results with [145] for various parameter settings. Notably, the baseline model can be tuned for different number of parameters by varying the number of discrete orientations modeled at each part, from 1 to 6. We show that, given a fixed number of parameters, we always obtain a better model using steerable parts. We obtain similar (and even slightly better) PCP performance with a 15X reduction in the total number of parameters, and even obtain reasonable performance for a 100X reduction in model size. These translate to roughly similar improvements in run-time speed, given the singlethreaded public matlab/mex implementation of [145]. For example, per-image running time decreases from 6 seconds to almost 0.5 seconds for our ns = 20, nk = 8 model. We show part-specific localization accuracy in Table 3.1. Finally, we show performance as a

48

function of the number of iterations of coordinate descent in Fig. 3.6. We see that the initial steerable basis, though initialized with care (as described in Sec. 3.4) considerably improves after a full iteration of coordinate descent. Method

Reduction # basis Feature in # params ns dimension nk 138-part baseline[145] 1 52-part baseline[145] 2.7 Our Model 15.7 20 8 Our Model 22.6 20 4 Upper Lower Upper Lower Total legs legs arms arms 82.0 74.1 72.9 47.3 74.2 82.2 72.4 68.5 36.3 70.7 81.2 74.4 74.9 48.8 74.8 82.0 75.4 72.2 42.2 73.1

Torso

Head

96.6 95.6 97.6 97.1

93.2 92.2 92.2 90.2

Table 3.1: We compare articulated pose estimation results with [145], which reports stateof-the-art results on the PARSE benchmark [108]. We use the standard evaluation criteria of PCP. We compare results to the baseline model [145] tuned for different numbers of orientation mixtures. We report the results for different numbers of basis filters ns and different feature dimensions nk . We achieve almost the same total and per part results with over 10 times reduction in the number of parameters and running time. Our model even slightly outperforms the baseline for some parts.

3.8.2

Facial pose estimation / landmark localization

The view-based part model of [151] appears to be current state-of-the-art system for pose estimation and landmark localization, and given by evaluation on the MultiPIE benchmark [52]. This model defines a mixtures-of-trees part facial landmark model, where each mixture encodes an out-of-plane change in orientation. The best performance reported in [151] is given by a model which defines a separate part filter for each out-of-plane orientation, resulting in a large vocabulary of size np = 1050, nxy = 25, nf = 32. Table 3.2 and Fig. 3.8 compare our results with [151] for various parameter settings. Notably, the baseline model can be tuned for different number of parameters by varying the 49

(a) Baseline (np = 138, nf = 32)

(b) Our model (ns = 20, nk = 8)

Figure 3.4: We show the original articulated human model from [145] in (a) and our steerable/separable reconstruction in (b). Our model looks and performs similar, but is roughly 15X smaller and faster at inference time. number of out-of-plane orientations modeled at each part. Tuning both our model and the baseline for a 7X reduction in parameters, our model dominates in both pose estimation and landmark localization accuracy. Our model even slightly improves upon the state-ofthe-art pose estimation reported in [151], and still produces reasonable performance for a 22X reduction in model size.

3.8.3

Sharing across object categories

We now present results for steerable variants of the popular deformable part model of [37]. We show results on the PASCAL VOC 2007 testset using standard evaluation criteria. Notably, we apply the multi-category steerable representation presented in Sec. 3.6. For simplicity, we only apply our steerable representation on the part filters and not the root filters, since the former are all equivalent in size. Across all categories, the size of the part vocab-

50

0.75

0.7

PCP

0.65

0.6 Baseline nk = 8

0.55

nk = 4 0.5

0

10

1

2

10 10 Reduction in the number of parameters

Figure 3.5: We compare the performance of the articulated pose model of [145] with our steerable/separable variant for equivalent number of parameters (on the PARSE benchmark). One can reduce the number of parameters in [145] by limiting the number of oriented mixtures of each part (the black baseline), or by exploiting our steerable representation. We plot two curves obtained by varying the number of basis filters ns for different feature dimensions nk . Our model always dominates for an equivalent number of parameters, and even achieves state-of-the-art performance with a 15X reduction in parameters. We also achieve reasonable performance with 100X reduction in parameters. ulary can be written as np = 480, nxy = 36, nf = 32. We explore a steerable model with ns = 60, nf = 32, since we found that a shared feature basis hurts in the multi-category scenario. Our models are 3X smaller and faster with a near equivalent performance (Table 3.3 and Fig. 3.10). Our parameter reduction would be greater if we also implemented steerable root filters; we are currently pursuing extensions for such variable-size filter banks.

51

80 75

PCP

70 65 60 138−part Baseline 52−part Baseline ns=20, nk=8

55 50

0

2

4 6 Number of iterations

8

10

Figure 3.6: We plot PCP performance over iterations of coordinate descent. Iteration 0 corresponds to initialing both a steerable basis and feature subspace with an SVD of part filters trained independently with SVMs. We see a significant improvement in results in a single iteration, and even slightly outperform the baseline after 6 iterations. Method 1050-part baseline[151] 146-part baseline[151] 99-shared baseline[151] Our Model Our Model Our Model

Reduction in # params 1

# basis ns -

Subspace dim nk -

Accuracy of exact pose estimation 91.4

Localization error (mse) 0.0234

7.2

-

-

82.0

0.0256

10.6

-

-

81.5

0.0281

7.2 22.2 24.3

93 30 30

8 8 4

91.6 89.3 89.9

0.0236 0.0247 0.0256

Table 3.2: We compare our results in facial detection, pose estimation, and landmark localization with the baseline in [151]. We compare against baseline models with different amounts of sharing. We show the results for different number of basis filters ns and different subspace dimensions nk . We achieve almost the same performance, in both pose estimation and landmark localization, with a 7-20X reduction in model size.

52

(a) Baseline (np =1050, nf =32)

(b) Our model (ns =93, nk =8)

Accuracy of pose estimation

100

95

90

85 Baseline ours 80

0

5 10 15 20 Reduction in the number of parameters

25

Mean squared error in landmark localization

Figure 3.7: We show one of views from the multi-view facial model from [151] in (a) and our steerable/separable reconstruction in (b). Tree-structured spatial constraints between parts are drawn as red lines. Our model looks and performs similar, but is roughly 7X smaller and faster at inference time. 0.026 0.0255 0.025 0.0245 0.024 0.0235 0.023

Baseline ours 0

5 10 15 20 Reduction in the number of parameters

25

Figure 3.8: Facial pose estimation and landmark localization on MultiPIE. We compare to two versions of the baseline model from [151], tuned for 1 mixture per part or 7 view-based mixtures per part. On the left, we plot pose estimation accuracy (higher is better). On the right, we plot landmark localization error (lower is better). For an equivalent number of parameters, our model greatly outperforms the baseline. We also closely match baseline performance with a 24X reduction in model size.

53

Fraction of the num. of testing faces

1 0.8 0.6 0.4

Baseline (1050−part) Baseline (146−shared) ns=93, nk=8

0.2

ns=30, nk=8 0 0.01 0.02 0.03 0.04 0.05 Average localization error as fraction of face size

Figure 3.9: We evaluate landmark localization accuracy with cumulative error plots. We plot the fraction of test images for which landmark localization error is below a given value. We compare against baseline models with vocabularies of 1050 and 146 parts. Our model gives roughly equivalent performance with a 7-20X reduction in model size.

(a) Baseline (np =480)

(b) Our model (ns =60)

Figure 3.10: On the left, we show the result of our implementation of the car model from [37]. On the right, we show our learned model reconstructed from 60 steerable basis filters, shared across all 20 object categories. Our model looks and performs similar, but is 3X smaller and faster at run-time.

3.9

Conclusion

We described a method for learning steerable deformable part models, based on the observation that part templates can be written as linear filter banks. We show how can leverage existing SVM-solvers to learn steerable representations using rank-constraints. We demonstrate impressive results on three diverse problems in recognition, showing improvements 54

Category voc-rel4[36] Our Baseline Our Model Category voc-rel4[36] Our Baseline Our Model Category voc-rel4[36] Our Baseline Our Model

plane 29.6 27.1 29.7 cat 18.4 17.4 15.7 person 42.1 42.3 40.4

bicycle 57.3 57.1 56.6 chair 21.6 17.8 19.9 plant 12.2 12.0 12.3

bird 10.1 10.2 10.2 cow 24.7 23.8 22.2 sheep 18.6 18.5 18.6

boat 17.1 13.9 15.3 table 23.3 20.4 20.5 sofa 31.9 32.5 30.1

bottle 25.2 22.5 23.1 dog 11.2 6.8 4.3 train 44.5 39.0 40.4

bus 47.8 47.3 48.7 horse 57.6 56.1 56.0 tv 40.9 39.7 41.4

car 55.0 52.2 53.8 mbike 46.5 43.5 46.0 total 31.8 30.0 30.3

Table 3.3: Average precision for different object categories in PASCAL 2007 dataset. The first row contains the results reported in the released code of [36] without any postprocessing. We reimplemented the code to allow for easier modification. Our reimplementation is shown in the second row. The third row is the steerable variant of our reimplementation, tuned for ns = 60 and 3X reduction in the number of parameters. Our performance slightly increases while yielding a smaller and faster model. up to 10X-100X in size and speed. When we compare our steerable models with baselines tuned for equivalent sizes, our models always dominate in performance, suggesting that they are more natural representations. Finally, we demonstrate that steerable structure can be shared across different object categories. This leads to learning highly scalable part models for detecting human pose and objects in extracting our high-level features for action recognition. In the next chapter, we will improve the results of these detectors by enforcing the temporal coherency.

55

Chapter 4 Temporal Features

4.1

Introduction

So far, we have introduced our scalable representations for extracting high-level features from each frame independently. The detectors are still far from ideal, so they produce a lot of errors and false alarms. Moreover, we know that frames are not independent from each other and the features exhibit a form of temporal coherency. A naive way of exploiting this temporal coherency would be collecting the independent results and letting the action model enforce coherency by some form of temporal averaging. A more sophisticated approach would be enforcing the coherency by tracking features in the time domain. Imagine enforcing limited change rate (velocity) in the locations of objects and angles of human joints. This can remove a lot of mistakes and lead to better detection results. Moreover, tracking provides some valuable by-products like the trajectory. These outputs can be used in constructing better high-level features. For instance, we can extract trajectory features like the velocity from the whole track of an object or human pose. This may not be possible in using single frame detections. Or we can use non-max-suppression in the track level

56

instead of the frame level to remove duplicate detections more efficiently. In this chapter, we consider the problem of tracking a variable number of objects in a video sequence [103]. We approach this task as a “spatio-temporal grouping” problem, where all image regions must be labeled as background or as a detection belonging to a particular object track. From such a grouping perspective, one must explicitly estimate (a) the number of unique tracks and (b) the spatio-temporal extent, including the start/termination times, of each track (Fig. 4.1). Approaches to accomplishing the above tasks typically employ heuristics or expensive algorithms that scale exponentially in the number of objects and/or super-linearly in the length of the video. In this work, we outline a family of multi-object tracking algorithms that are:

1. Globally optimal (for common objective functions) 2. Locally greedy (and hence easy to implement) 3. Scale linearly in the number of objects and (quasi)linearly with video-length

Our contribution is grounded in a novel analysis of an integer linear program (ILP) formulation of multi-object tracking [64, 149, 10, 81, 7, 94]. Our work most closely follows the min-cost flow algorithm of [149]. We show that one can exploit the special structure of the tracking problem by using a greedy, successive shortest-path algorithm to reduce the best-previous running time of O(N 3 log2 N ) to O(KN log N ), where K is the unknown, optimal number of unique tracks, and N is the length of the video sequence. The intuition behind the greedy approach stems from this surprising fact (Fig. 4.2): the optimal interpretation of a video with k + 1 tracks can be derived by a local modification to the solution obtained for k tracks. Guided by this insight, we also introduce an approximate greedy algorithm whose running time scales linearly with sequence length (i.e., 57

83

62 96 839 67

62 67 83 9

67 83 9

62

9

83

67

67

62

9

83

9

32 73

32

73

track birth c

b

62

67

73

73

track birth

a

96

9

67

track birth

62 96 67 83 9

83

96

67

67

83 9 32 73 83

track death track death d

9

32

73

67 839 32 73 83

67

9

32

73

e

f

Figure 4.1: We treat the problem of multi-target tracking through a perspective of “spatiotemporal grouping”, where both a large number of groups and their spatio-temporal extent (e.g., the number of objects and their track births and deaths) must be estimated. We show the output of an efficient, linear-time algorithms for solving this computational problem on the ETHMS dataset [29]. In this video clip our method returns hundreds of correct tracks, as evident by the overlaid track number. O(KN )), and is in practice several orders of magnitude faster with no observable loss in accuracy. Finally, our greedy algorithms allow for the embedding of various pre-processing or post-processing heuristics (such as non-maximum suppression) into the tracking algorithm, which can boost performance.

4.2

Related Work

Classic formulations of multi-object tracking focus on the data association problem of matching instance labels with temporal observations [42, 16, 20, 61]. Many approaches assume manual initialization of tracks and/or a fixed, known number of objects [64]. However, for many real-world tracking problems, such information is not available. A more general spatio-temporal grouping framework is required in which these quantities are au-

58

tomatically estimated from video data. A popular approach to multi-object tracking is to run a low-level tracker to obtain “tracklets”, and then stitch together tracklets using various graph-based formalisms or greedy heuristics [70, 124, 77, 96, 7]. Such graph-based algorithms include flow-networks [149], linear-programming formulations [64], and matching algorithms [70]. One of the contributions of this work is to show that with a particular choice of low-level tracker, and a particular schedule of track instantiation, such an algorithm can be globally-optimal. We rely on an increasingly common ILP formulation of tracking [64, 149, 10, 81, 7, 94]. Such approaches restrict the set of possible object locations to a finite set of candidate windows on the pixel grid. Because standard linear programming (LP) relaxations do not scale well, many algorithms process a small set of candidates, with limited or no occlusion modeling. This can produce broken tracks, often requiring a second merging stage. Our scalable algorithm is able to process much larger problems and directly produces state-ofthe-art tracks. Our work relies heavily on the min-cost flow network introduced for temporal data association in [149]. We compare our results with the min-cost solver used in that work [47], and verified that our O(KN log N ) algorithm produces identical results, and that our approximate O(KN ) algorithm produces near-identical results when properly tuned. In concurrent work, Berclaz et al. describe a O(KN log N) algorithm for multi-object tracking in [11]. It is similar in many respects with some differences: Our graph representation has a pair of nodes for each detection. This allows us to explicitly model object dynamics through transition costs, and allows for a simpler flow-based analysis. In addition, our algorithm instantiates tracks in a greedy fashion, allowing for the integration of pre-processing steps (e.g., non-max-suppression) that improve accuracy. Finally, we also describe approximate O(KN) algorithms that perform near-identically in practice.

59

Figure 4.2: The intuition behind our optimal greedy algorithm. Assume that we are tracking the ’x’ location of multiple objects over time. On the left, we show the optimal estimate of 3 object trajectories. One may suppress all detections along the previously instanced tracks and instance a new track (On the middle top.) However, this is not the optimum solution since given the knowledge that an additional object is present, one may need to adjust the existing tracks. We show that one can do this with a shortest-path/min-flow computation that pushes flow from a source to a terminal (middle bottom). The solution can reverse flow along existing tracks to “cut and paste” segments, producing the optimal 4-track estimate (right). We further speed up this process by approximating such edits using fast dynamic programming algorithms.

4.3

Model

We define an objective function for multi-object tracking equivalent to that of [149]. The objective can be derived from a generative perspective by considering a Hidden Markov Model (HMM) whose state space is the set of true object locations at each frame, along with a prior that specifies likely state transitions (including births and deaths) and an observation likelihood that generates dense image features for all objects and the background.

60

4.3.1

Independent tracks

We write x for a vector-valued random variable that represents the location of a particular object, as given by a pixel position, scale, and frame number:

x∈V

x = (p, σ, t)

(4.1)

where V denotes the set of all space-time locations in a video. Prior: We write a single track as an ordered set of state vectors T = (x1 , . . . xN ), ordered by increasing frame number. We write the collection of tracks as a set X = {T1 , . . . TK }. We assume that tracks behave independently of each other, and that each follows a variablelength Markov model:

P (X) =

Y

P (T )

T ∈X

where

P (T ) = Ps (x1 )

−1  NY n=1

 P (xn+1 |xn ) Pt (xN )

The dynamic model P (xn+1 |xn ) encodes a smoothness prior for track location. We write Ps (x1 ) for the probability of a track starting at location x1 , and Pt (xN ) for the probability of a track transitioning into a termination state from location xN . If the probability of termination is low, the above prior will tend to favor longer, but fewer tracks so as to minimize the total number of terminations. If these probabilities are dependent on the spatial coordinate of x, they can model the fact that tracks tend to terminate near image borders or start near entry points such as doorways.

Likelihood: We write Y = {yi |i ∈ V } for the set of feature vectors observed at all spacetime locations in a video. For example, these could be the set of gradient histogram features

61

that are scored by a sliding-window object detector. We now describe a likelihood model for generating Y given the set of tracks X. We make two assumptions: 1) there exists a one-to-one mapping between a putative object state x and space-time location index i and 2) tracks do not overlap (Tk ∩ Tl = ∅ for k 6= l). Together, both imply that a location can be “claimed” by at most one track. We write yx for the image features at location x; these features are generated from a foreground appearance model. Feature vectors for unclaimed windows are generated from a background model:

P (Y |X) =

YY

=Z

Pf g (yx )

T ∈X x∈T

YY

 Y

Pbg (yi )

(4.2)

i∈V \X

l(yx )

T ∈X x∈T

where

l(yx ) =

Pf g (yx ) Pbg (yx )

and Z =

Y

Pbg (yi )

i

The likelihood is, up to a constant, only dependent on features of the windows which are part of the set of tracks. If we assume that the foreground and background likelihoods are Gaussian densities with the same covariance,

Pf g (yx ) = N (yx ; µf g , Σ) and Pbg (yx ) = N (yx ; µbg , Σ),

we can write the log-likelihood-ratio as a linear function (log l(yx ) = w · yx ), akin to a logistic regression model derived from a class-conditional Gaussian assumption. This model provides a generative motivation for the linear templates that we use as local detectors in our experiments.

62

4.3.2

Track interdependence

The above model is reasonable when the tracks do not overlap or occlude each other. However, in practice we need to deal with both occlusion and non-maxima suppression.

Occlusion: To model occlusions, we allow tracks to be composed of state vectors from non-consecutive frames e.g., we allow tn and tn+1 to differ by up to k frames. The dynamic model P (xn+1 |xn ) for such k-frame skips captures the probability of observing the given k-frame occlusion.

Non-maxima suppression: When we consider a dense set of locations V , there will be multiple tracks which score well but correspond to the same object (e.g., a good track shifted by one pixel will also have a high probability match to the appearance model). A complete generative model could account for this by producing a cluster of image features around each true object location. Inference would “explain away” evidence and enforce exclusion. In practice, the typical solution is to apply non-max suppression (NMS) as a pre-process to prune the set of candidate locations V prior to multi-object tracking [96, 16, 64, 149]. In our experiments, we also utilize NMS to prune the set V and as a heuristic for explaining away evidence. However, we show that the NMS procedure can be naturally embedded within our iterative algorithm (rather than as a pre-process). By suppressing extra detections around each track as it is instanced, we allow for the possibility that the prior can override the observation term and select a window which is not a local maxima. This allows the NMS procedure to exploit temporal coherence. The recent work of [7] make a similar argument and add an explicit non-overlapping constraint to their ILP, which may sacrifice tractability. We demonstrate in Sec. 4.6 that our simple and fast approach pro63

duces state-of-the-art results.

4.4

MAP Inference

The maximuim a posteriori (MAP) estimate of tracks given the collection of observed features is:

X ∗ = argmax P (X)P (Y |X)

(4.3)

X

= argmax X

= argmax X

Y

P (T )

T ∈X

X

Y

l(yx )

(4.4)

x∈T

log P (T ) +

T ∈X

X

log l(yx )

(4.5)

x∈T

We drop the constant factor Z and take logarithm of the objective function to simplify the expression while preserving the MAP solution. The above can be re-written as an Integer Linear Program:

f ∗ = argmin C(f )

(4.6)

f

with C(f ) =

X

csi fis +

i

X

cij fij +

X

ci f i +

i

ij∈E

s.t. fij , fi , fis , fit ∈ {0, 1} X X and fis + fji = fi = fit + fij j

X

cti fit

(4.7)

i

(4.8)

j

where fi is a binary indicator variable that is 1 when space-time location i is included in some track. The auxiliary variables fij along with the second constraint (4.8) ensures that at most one track claims location i, and that multiple tracks may not split or merge. With a

64

s

frame 1

frame 2

frame 3

t Figure 4.3: The network model from [149] for three consecutive frames of video. Each space-time location i ∈ V is represented by a pair of nodes connected by a red edge. Possible transitions between locations are modeled by blue edges. To allow tracks to start and end at any spatio-temporal point in the video, each location i is connected to both a start and termination node. All edges are directed and unit capacity. The costs are ci for red edges, cij for blue edges and csi and cti for black edges. slight abuse of notation, let us write xi for the putative state corresponding to location i: csi = − log Ps (xi ), cij = − log P (xj |xi ),

cti = − log Pt (xi ),

(4.9)

ci = − log l(yi ).

encode the track start, terminate, transition, and observation likelihoods respectively. We define the edge set E to span the set of permissible state transitions given by our dynamic model (Sec. 4.3.1).

65

4.4.1

Equivalence to network flow

To solve the above problem, we can relax the integer constraints in (4.8) to linear box constraints (e.g., 0 ≤ fi ≤ 1) This relaxation yields a unit capacity network flow problem whose constraint matrix is totally unimodular, implying that optimal solutions to the relaxed problem will still be integral [4]. In particular, assume that we knew the number of tracks in a video to be K. Let FK denote the set of flow conservation and unit capacity constraints along with the additional constraints   

P s fij , fi , fis , fit ∈ [0, 1], i fi = K, FK = P P P t   fis + j fji = fi = fit + j fij , i fi = K Minimizing C(f ) subject to constraints FK is an instance of a minimum cost flow problem [4, 149]. Such problems are similar to max-flow problems (commonly used in vision for solving graph-cut problems [15]), except that edges in the flow network are labeled with a cost as well as capacity. The cost of a flow is defined to be the sum, over all edges, of the cost of each edge multiplied by the flow through that edge. Finding the MAP estimate of K tracks corresponds to finding a minimum cost flow that pushes K units of flow from the source to the sink. Figure 4.3 shows an example flow network constructed from the tracking problem. Each space-time location i, or equivalently putative object state xi , corresponds to a pair of nodes (ui , vi ) connected by an edge of cost ci . Each transition between successive windows is represented by an edge (vi , uj ) with cost cij . Finally, nodes s and t are introduced with edges (s, ui ) corresponding to track starts and edges (vi , t) for terminations (with cost csi and cti respectively). All edges have unit capacity. Pushing K units of flow from s to t yields a set of K disjoint st-paths, each of which corresponds to one of the optimal tracks T ∈ X.

66

4.5

Finding min-cost flows

Zhang et al. [149] describe how to solve the above optimization problem in O(mn2 log n) time using a push-relabel method [47], where n is the number of nodes (e.g. detection windows) in the network graph and m is the number of edges. Assuming that n and m scale linearly with the number of frames N (reasonable given a fixed number of detections per frame), the algorithm takes O(N 3 log N ) to find K tracks. Furthermore, the cost of the optimal solution, minf ∈FK C(f ) is convex in K [149] so one can use a bisection search over K (upper-bounded by the number of detections) to find the optimal number of tracks with a total running time O(N 3 log2 N ). In the following, we show that one can solve the multi-object tracking problem in O(KN log N ) by solving K + 1 shortest-path problems. This considerable reduction in complexity is due to two particular properties of the network in Fig. 4.3:

1. All edges are unit capacity. 2. The network is a directed acyclic graph (DAG).

The above conditions allow one to use dynamic programming (DP) algorithms to compute shortest paths. We describe a novel DP algorithm that is necessary to construct a globallyoptimal O(KN log N ) algorithm. We also show that DP produces the optimal solution for K = 1 in O(N ) and high-quality approximate solutions for K > 1 in O(KN ). We begin by describing the optimal O(KN log N ) algorithm based on successive shortest paths (introduced in Fig. 4.2).

67

4.5.1

Successive Shortest-paths

We now describe a successive shortest path algorithm [4] for solving min-cost flow problems for DAG networks with unit-capacity links. Given a graph G with an integral flow f , define the residual graph Gr (f ) to be the same as the original graph except that all edges used in the flow f are reversed in direction and assigned negative their original cost. We initialize the algorithm by setting the flow f to be zero and then iterate the following two steps: 1. Find the minimum-cost path γ from s to t in Gr (f ) 2. If total cost of the path C(γ) is negative, update f by pushing unit-flow along γ until no negative cost path can be found. Since each path has unit capacity, each iteration increases the total flow by 1 and decreases the objective by C(γ). The algorithm terminates after K + 1 iterations having found a minimum cost flow. Pushing any further flow from s to t will only increase the cost. We refer the reader to [4] for a proof of the correctness of the algorithm but give a brief outline. We say a flow f is FK -feasible if it satisfies the constraint set FK . A necessary and sufficient condition for f to be a minimum cost flow of size K is that it be FK -feasible and that there does not exist a negative-cost directed cycle in Gr (f ). The successive shortestpaths algorithm above starts with a F0 -feasible flow and at each iteration i yields a new flow which is Fi -feasible. Furthermore, each step of the algorithm modifies edges along a single path and can be shown to not introduce any negative weight cycles. Fig. 4.4 shows example iterations of this algorithm and the resulting sequence of residual graphs. Note that the shortest path in the residual network may instance a new track and/or edit previous tracks by removing flow from them (by pushing flow through the reverse edges). 68

In each iteration, we need to find a shortest st-path. We would like to use Dijkstra’s algorithm to compute the shortest path in O(N log N ), making the overall algorithm O(KN log N ) where K is the optimal number of tracks. Unfortunately, there are negative edge costs in our original network, precluding the direct application of Dijkstra’s algorithm. Fortunately, one can convert any min-cost flow network to an equivalent network with non-negative costs [4]. This conversion requires computing the shortest-path of every node from s in the original graph G. For general graphs with negative weights, this computation takes O(N 2 ) using the Bellman-Ford algorithm [4]. For DAGs, one can use a O(N ) dynamic programming algorithm, which we describe below. The successive shortest path algorithm thus runs in O(KN log N ) operations and returns the global minima for our tracking problem (Equation 4.3).

4.5.2

Dynamic Programming Solution for K = 1

We now present a O(N ) dynamic programming (DP) algorithm for computing the shortest path of every node to s. We will also show that this algorithm solves the min cost flow problem for K = 1. Because each edge in the network is of unit capacity, the minimum cost unit flow must correspond to the shortest path from node s to t. Because the original network graph is a DAG, one can construct a partial ordering of nodes and use DP to compute shortest paths by sweeping from the first to last frame. This is similar to DP algorithms for tracking but augmented to estimate both the birth and death time of a track. Assume that nodes are ordered in time, and let cost(i) represent the minimum cost of a track passing through node i. We initialize cost(i) for detections in the first frame to be cost(i) = ci + csi . We can then recursively compute the cost in successive frames as: cost(i) = ci + min(π, csi ) where

π = min cij + cost(j) j∈N (i)

69

(4.10)

s

s

(a)

t

s

(b)

t

(d)

t

s

(c)

t

s

s

t

t

(e)

(f)

Figure 4.4: Illustration of the successive shortest path algorithm. (a) The tracking problem modeled as a graph as described in Fig. 4.3. The algorithm should send a given amount of flow from source node s to the terminal t. (b) One unit of flow f1 is passed through the shortest path (in red) from source to terminal. (c) The residual graph Gr (f1 ) produced by eliminating the shortest path and adding edges (in green) with unit capacity and negative cost with the opposite direction. (d) The shortest path found in the residual graph. In this example, this path uses previously added edges, pushing flow backwards and “editing” the previously instanced tracks. (e) Residual graph after passing two units of flow. At this point, no negative cost paths exist and so the algorithm terminates and returns the two tracks highlighted in (f). Note that the algorithm ultimately splits the track instanced in the first step in order to produce the final optimal set of tracks. In this example only one split happened in an iteration, but it is possible for a shortest path to use edges from two or more previously instanced tracks, but it is very rare in practice. Our dynamic programming algorithm cannot resolve any splitting since the residual graph has cycles, however the 2pass dynamic programming algorithm can resolve the situations when any new shortest path splits at most one previously instanced track. 70

where N (i) is the set of detections from the previous k frames that can transition to detection i. The cost of the optimal ending at node i is then cost(i) + cti , and the overall shortest path is computed by taking a min over i. By caching the argmin values at each node, we can reconstruct the shortest path to each node in a single backward sweep.

4.5.3

Approximate DP solution for K > 1

We now propose a simple greedy algorithm to instance a variable, unknown number of disjoint, low-cost tracks. Start with the original network-flow graph:

1. Find the shortest path from s to t using DP. 2. If the cost of the path is negative, remove nodes on the path and repeat.

The above algorithm performs K + 1 iterations of DP to discover K tracks – the last instanced track is ignored since it increases the overall cost. Its running time is O(KN ). At each iteration, we have obtained a feasible (but not necessarily minimum cost) k-unit flow. The sub-optimality lies in the fact that the above algorithm cannot adjust any previously instanced tracks based on the demand to produce additional tracks. In successive stages, it operates on a subset of the original graph rather than the residual graph used in the successive shortest paths algorithm. Unfortunately dynamic programming can’t be directly applied to the residual graph Gr (f ) since the residual graph is no longer a DAG (Fig. 4.4(c)).

4.5.4

Approximate 2-pass DP solution for K > 1

We now describe generalization of our DP-based algorithm from 4.5.3 that can also instance new tracks while performing “small” edits of previously instanced track. We observe that 71

most of the time the shortest residual path does not make large edits on previous tracks. We use the same algorithm from Section 4.5.1, but perform an approximate shortest-path using a 2-pass DP algorithm rather than Dijkstra’s algorithm. We perform a forward pass of DP as in (4.10), but on Gr (f ) rather than G with cost(i) defined as the best forwardprogressing path from the source to node i (ignoring reversed edges). We then use the costs as initial values for a backward pass starting from the last frame, defining N (i) to be the set of nodes connected through reverse edges to i. After this pass, cost(i) is the cost of the best forward and backward progressing path ending at i. One could add additional passes, but we find experimentally that two passes are sufficient for good performance while saving O(log N ) operations over Dijkstra’s approach.

4.5.5

Caching

Our DP algorithms repeatedly perform computations on a series of reduced or residual graphs. Much of these computations can be cached. Consider the DP computations required for the algorithm from Section 4.5.3. Once a track is instanced, cost(i) values for nodes whose shortest-paths intersect that track are no longer valid, and it is only this small number of nodes that need to be re-evaluated in the next iteration. This set can be marked using the following fact: any paths that intersect at some node must share the same birth node. Each node can be labeled with its birth node by propagating a birth ID during message-passing in DP. We then only need to recompute cost(i) for nodes that have the same birth node as a newly instanced track. In our experiments, this decreased computation time by three orders of magnitude.

72

4

46 4

46 4

32

4

46

46

4

4

32

track birth

track death a

track death c

b

46

46

46

46 14

14

track death e

d

d

Figure 4.5: We show the results of our algorithm, including estimated track births and deaths, on the Caltech Pedestrian dataset [27]. We show typical results on the ETHMS dataset in Fig. 4.1.

Figure 4.6: We show the results of our algorithm on ETHMS dataset. We show the input to our algorithm on the left and the resulting tracks on the right. We can reduce the number of false positives by enforcing temporal coherency.

73

4.6

Experimental Results

Datasets: Most benchmarks for multi-object tracking (e.g., PETS [146]) are designed for stationary cameras. We are interested in moving camera applications, and so use the Caltech Pedestrian dataset [27] and ETHMS dataset [29] to evaluate our algorithms. The Caltech dataset was captured by a camera installed on a moving car. It contains 71 videos of roughly 1800 frames each, captured at 30 frames per second. Since the testset contains held-out labels, we evaluate ourselves using all annotated pedestrians on the training set. The ETHMS dataset contains footage of a busy sidewalk as seen by a camera mounted on a child stroller. Although this dataset contains both left and right views to facilitate stereo, we use only the left view in our experiments. The dataset contains four videos of roughly 1000 frames each, captured at 14 fps. Both datasets include bounding box annotations for people, while Caltech also provides track IDs. We manually annotated IDs on a portion of ETHMS. In order to compare our results with previous work, we use the same ETHMS video sequence as [149] with 999 frames and ignore detections smaller than 24 pixels as they did. Setup: We ran an “out-of-the-box” pre-trained part-based HOG pedestrian detector [35] with a conservative NMS threshold, generating around 1000 detections per frame of each video. We set the log-likelihood ratio (the local cost ci ) of each detection to be the negative score of the linear detector (the distance from the decision boundary of an SVM). We use a bounded-velocity dynamic model: we define the transition cost cij to be 0, but only connect candidate windows across consecutive frames that spatially overlap. We set birth and death costs (csi , cti ) to be 10. We experimented with applying an additional NMS step within our greedy algorithm. We also experimented with occlusion modeling by adding transitions which skip over k frames, with k up to 10.

74

Scoring criteria: We use detection accuracy (as measured by detection rate and false positives per frame) as our primary evaluation criteria, as it allows us to compare with a wide body of related work on these datasets. To directly score tracker accuracy, various other criteria (such as track fragmentation, identity switching, etc.) have been proposed [121, 97, 149]. We also use track identity to evaluate our algorithms below. Approximation quality: We have described three different algorithms for solving the minimum cost flow problem. Fig. 4.7 shows the flow cost, i.e., the objective function, versus iteration number for all three algorithms on the Caltech dataset. The DP algorithms follow the successive shortest path (SSP) algorithm for many iterations but eventually it is necessary to “edit” a previously instanced track (as in Figure 4.4) and the greedy DP algorithm begins to make suboptimal choices. However DP and SSP do not deviate much before reaching the minimum cost and the 2-pass DP which allows for a single edit follows SSP quite closely. This figure inset shows a close look at the cost at the minimum. Since the 2-pass algorithm can split at most one track in each iteration and it is very rare to see two splits at the same iteration, the cost value for 2-pass DP algorithm is very close to the optimum one. Rather than scoring the cost function, we can directly compare algorithms using track accuracy. Fig. 4.8 shows detection rate versus FPPI for the baseline detector, DP, and SSP algorithms. These figures show that DP and SSP are similar in accuracy, with DP performing even better in some cases. We suspect the SSP algorithm produces (overly) short tracks because the 1st order Markov model enforces a geometric distribution over track length. The approximate DP algorithm inadvertently produces longer tracks (that better match the ground truth distributions of lengths) because previously instanced tracks are never cut or edited. We henceforth evaluate our one-pass DP algorithm in the subsequent experiments. We also present additional diagnostic experiments on the ETHMS data, since it contains on average more objects than Caltech.

75

−200 DP (min −1263.6404 at 444) 2−pass DP (min −1283.0164 at 516) Successive shortest path (min −1284.1071 at 522)

−400

−600 Cost

Cost

−1260

−800

−1270 −1280

−1000

−1290 500 505 510 515 Number of tracks

−1200

−1400

0

200

400 600 Number of tracks

800

1000

Figure 4.7: Cost vs. iteration number for all three algorithms on Caltech dataset. The inset shows that our 2-pass DP algorithm produces tracks whose cost is close to optimum while being orders of magnitude faster. Track identities: We evaluate track identities on the ETHMS dataset by using our tracker to compute track labels for ground-truth bounding boxes. This is equivalent to running our tracker on an ideal object detector with zero missed detections and false positives. Given a correspondence between estimated track labels and ground-truth track labels, the misclassification rate is the fraction of bounding boxes with incorrect labels. We compute the correspondence that minimizes this error by bipartite matching [70]. We found occlusion modeling to be crucial for maintaining track identities. Our algorithms can report tracks with k-frame occlusions by adding in transitions between space-time windows spaced k frames apart. Our DP algorithm scales linearly with k, and so we can readily model long 10-frame occlusions (Table 4.1 ). This greatly increases the accuracy of track labels on this data because such occlusions are common when nearby people pass the camera, occluding people further away. This result implies that, given ideal local detectors, our tracking algorithm produces track identities with 90% accuracy. 76

1

Detection Rate

0.8 0.6 DP 0.4

SSP DP+NMS

0.2 HOG 0 −3 10

−2

−1

0

10 10 10 False Positive Per Frame

0.6

Detection Rate

0.5

DP SSP

0.4

HOG 0.3 0.2 0.1 0 −4

10

−3

−2

−1

10 10 10 False Positive Per Frame

0

10

Figure 4.8: Detection rate versus FPPI on Caltech dataset [27] (top) and ETHMS dataset [29] (bottom). We compare our approximate 1-pass DP algorithm with the optimal successive shortest path (SSP) algorithm and a HOG-detector baseline. The DP performs as well as or even better than the shortest path algorithm, while being orders of magnitude faster. We also show that by suppressing overlapping detections after each track is instanced (DPNMS), we can further improve performance.

77

Length of allowable occlusion 1 5 10

% of windows with ID errors 14.69 13.32 9.39

Table 4.1: Evaluating track label error as a function of the length of the allowable occlusion. We show results for our DP algorithm applied to a portion of the ETHMS dataset given ideal detected windows. Our DP algorithm scales linearly with the length of allowable occlusions. By allowing for longer occlusions (common in this dataset), the % of windows with correct track labels significantly increases.

Figure 4.9: Non-max-suppression in the detection window level.

Figure 4.10: An illustration of applying non-max-suppression in the track level rather than the detection window level. After instancing a track, we suppress all detections in the space-time cube that overlap the instanced track. NMS-within-the-loop: In Fig. 4.8, we use the ETHMS dataset to examine the effect of adding a NMS step within our iterative greedy algorithms. When applying [35]’s pedestrian detector, we use their default NMS algorithm as a pre-process to suppress detections that overlap other higher-scoring detection by some threshold. After instancing a track during the DP algorithm, we suppress remaining windows that overlap the instanced tracks using 78

Algorithm

Detection rate

[29]’s stereo algorithm [149]’s algorithm 1 [149]’s algorithm 2 with occlusion handling [143]’s two-stage algorithm with occlusion handling Our DP Our DP+NMS

47 68.3

False positive per frame 1.5 0.85

70.4

0.97

75.2 76.6 79.8

0.939 0.85 0.85

Table 4.2: Our algorithm performance compared to the previous state-of-the-art on the ETHMS dataset. Please see the text for further discussion.

a lower threshold. This suppression is more reliable than the initial one because tracked windows are more likely to be true positives. Fig. 4.9 shows the non-max-suppression in the detection window level and Fig. 4.10 illustrates one iteration of applying non-maxsuppression in the track level rather than the detection window level. Our results outperform all previously published results on this data (Table 4.2 ). Fig. 4.6 shows the detected humans in ETHMS dataset in the input and output of the our tracker. It shows that we can reduce the number of false positives by enforcing temporal coherency. Figures 4.1 and 4.5 show the results of tracking on ETHMS and Caltech datasets. Running time: For the 999-frame ETHMS dataset, MATLAB’s LP solver does not converge, the commercial min-cost-flow solver used in [149] takes 95 seconds, while our MATLAB DP code takes 0.5 seconds.

4.7

Conclusion

We have described a family of efficient, greedy but globally optimal algorithms for solving the problem of multi-object tracking, including estimating the number of objects and their

79

track births and deaths. Our algorithms are based on a novel analysis of a min-cost flow framework for tracking. Our greedy algorithms allow us to embed pre-processing steps such as NMS within our tracking algorithm. Our scalable algorithms also allow us to process large input sequences and model long occlusions, producing state-of-the-art results on benchmark datasets. This builds an important component in extracting high-level features for action recognition. It improves the accuracy of detected appearance features by enforcing temporal coherency and also lets constructing temporal features based on the whole trajectory (eg, velocity of objects.) In the next chapter, we will introduce our action model based on these high-level features.

80

Chapter 5 Action Models

5.1

Introduction

Now that we can extract reliable high-level features scalably, we will focus on building action models to recognize actions in video sequences. Traditionally action recognition is approached as a classification problem on pre-segmented clips; this is evidenced by the multitude of action benchmarks consisting of such pre-segmented data. Given a “real world” continuous video stream, one must address the temporal segmentation problem in addition to predicting an action class label. We see this disconnect as analogous to the difference between image classification with centered objects (Caltech 101) and object detection in cluttered images (PASCAL VOC). In this work, we develop extremely-efficient compositional models of actions which report back complete parses of video streams. Our models are efficient in that they scale linearly with the length of a video stream - both in theory and practice, they are just as fast as a “scanning window” action classifier. Our models are compositional in that actions themselves are composed of subactions, together with relational constraints on what orderings and compositions are valid.

81

Figure 5.1: A sample video parsed to actions and subactions. Subactions are color-coded on the time-line while “white” represents the background term. Given a video and a grammar, our linear-time parser produces a valid sequence of actions and subactions. The underlying grammar is shown in Figures 5.2 and 5.5.

Figure 5.2: The underlying grammar used in Fig. 5.1. S is the sentence symbol, Red, green, and blue correspond to different subactions and black corresponds to the background action. This grammar is also shown formally in Fig. 5.5. Markov models: Past approaches to action recognition have been dominated by Hidden Markov Models (HMMs). Inference is linear in length of video, and so such models are convenient to work with. But it is difficult to directly reason about segment-level constraints, as well as long-range temporal interactions between segments. For example, making tea requires heating water on a stove, placing tea leaves in a cup, and pouring hot water into the same cup. Each subaction can vary in duration and (sometimes) temporal ordering. Semi-HMMs (SHMMs) are HMMs that generate variable-length segments, where inference can be quadratic in the length of the video. Such models can be used for segmentation, but can only capture first-order dependences between segment labels. This means they cannot model actions with subactions.

82

CFGs

Segmental CFGs

Segmental Regular Grammars

Figure 5.3: Context-free grammars (CFGs) use fixed-length terminals that model a single data element - in our case, detected poses in a frame. We show an example parse tree on the left. Regular grammars (RGs) are a restricted type of CFGs formed by a “one-sided” grammar. We introduce segmental variants of CFGs and RGs (SCFGs and SRGs), shown in the middle and right, that generate variable-length terminals or segments. SRGs are particularly convenient for action recognition because they admit linear-time, constant-storage parsing algorithms, while still allowing for long-term hierarchical temporal structure. Grammars: Context-free grammars (CFGs) are rather flexible language models that capture long-range dependencies through production rules; e.g., a sentence is derived from a noun and verb phrase. Though such models are popular in natural language processing, they are somewhat uncommon in vision. We believe this is due to the following reasons (1) Space and time complexity is high, cubic in the length of the sequence (2) They are traditionally used to reason about fixed-length tokens (low-level candidate detections) , but we wish to reason about variable-length segments. (3) Most parsers (e.g., CYK) are not online, and so cannot provide partial solutions up to the current frame. (4) Parse trees are latent in that only terminal nodes are connected to data. This makes learning production rules difficult and somewhat unintuitive. Contribution 1: segmental grammars We propose segmental extensions of grammars for action recognition, which address many of the limitations above. Segmental CFGs are an extension of CFGs with production rules that generate sequences (or segments) of tokens. We show how to augment the CYK parsing algorithm to handle such additional production 83

rules without any increase in complexity. Segmental Regular Grammars (SRGs) are a special case of segmental CFGs for a restricted set of production rules. SRGs are derived from a regular grammar, the simplest level of Chomsky’s hierarchy of languages [21, 59]. We argue that it is a natural grammar for actions because it captures long-range dependencies between actions and subactions, while admitting efficient inference. We define a parsing algorithm that is quadratic in the length of the video and naturally online, in that it computes partial solutions to the current time t. If the maximum segment length is bounded, inference is linear in the length of a video and storage cost is constant. In practice, a SRG is no more expensive that a “scanning-window” action classifier. Contribution 2: pose-based data model We develop a simple but novel pose-based descriptor to illustrate our grammars on videos of challenging actions. We describe candidate temporal segments with a “bag-of-poses” descriptor defined on a codebook of poses. We learn different distributions of poses for different actions and subactions. Finally, we show how to learn such models, together with the parameters of our grammars, in a max-margin framework. Contribution 3: action segmentation dataset We have constructed a dataset of long temporal streams from YouTube, based off the original clips used in the Olympic Sports Dataset [88]. Our 20-hour video dataset contains continuous video streams with multiple types of actions with subaction structure. We will make this available to the community to spur further research.

5.2

Related Work

Action Recognition: We refer the reader to the recent surveys of [105, 3] for a complete treatment of the large body of related work. We focus on methods most related to our ap-

84

proach. By far, the dominant approach to temporal modeling of actions are HMMs, spurred in part by their successful application to speech recognition. Early approaches date back to finite state machine models [140, 90], while more recent work has explored discriminative variants such as CRFs [139]. [118, 131, 57] explore semi-Markov models for action segmentation; our models are similar in complexity, but allow for compositional models of actions and subactions. [62, 86, 112, 119] use a stochastic CFG for gesture and event recognition, but only process a small set of pre-segmented events or primitive detections to keep computation manageable. Hence their CFGs generate fixed-length terminals. [88] explore temporal compositional action models by defining a relational part model in the time domain. Our work differs in that parts are replaced by segments that vary in temporal extent, and by the fact that our grammars can encode long-range constraints. Sequential Models: There exist a large body of literature on extensions to HMMs in which states generate variable-length sequences with arbitrary prior on the length of segments; these are sometimes called variable-length HMMs [38, 107], semi-Markov models [63, 114], segmental HMMs [45]. Most related to us are hierarchical semi-Markov models and hierarchical HMMs (HHMMs) [40, 133, 90]. HHMMs are recursive HMMs where each hidden state is its own sequential probabilistic model, emitting a sequence of observations rather than a single terminal. [40, 133] both show that such models are O(N 3 ), the same complexity as a CFG, though [87] describe linear-time inference algorithms obtained by expanding such models into an HMM with an exponentially large state-space (exponential in the depth of the hierarchy).

5.3

Segmental Grammars

Grammars, and in particular CFGs, are well studied topics and covered in classic texts on formal languages [59] and natural languages [83]. We review the CYK parsing algorithm 85

Fixed-length terminals Segmental

HMM O(N ) SHMM O(N 2 )∗

RG O(N ) SRG O(N 2 )∗

CFG SCFG

O(N 3 ) O(N 3 )

Table 5.1: A taxonomy of different temporal models with their inference complexity. Semi-HMMs (SHMMs), also known as Segmental HMMs, generalize HMMs to allow for variable-length terminals or segments. In this work, we generalize regular grammars (RGs) and context-free grammars (CFGs) to allow for variable-length segments, denoted in bold. We use a * to denote that both SHMMs and SRGs admit linear-time inference if the maximum segment length is bounded. for CFGs as it forms the basis for our segmental extensions. We illustrate our segmental grammars in Fig. 5.3 and summarize their relationship with classic models in Table 5.1.

5.3.1

Context-Free Grammars

A weighted CFG in Chomsky Normal Form (CNF) is specified by:

1. V is a finite set of non-terminal symbols 2. Σ is a set of terminal symbols 3. R is a set of rules of the form X → Y Z or X → w for X, Y, Z ∈ V and w ∈ Σ. Each rule r ∈ R has as associated score s(r, i, k, j) for instantiating it at boundary points i, j with a transition point of k.

A general CFG contains rules of the form A → B, where A is any nonterminal and B is any string of terminals and nonterminals. We write nV for the number of non-terminal symbols, and nR for the number of rules. One can show that any CFG can be written in CNF form by adding new rules with “dummy” nonterminals. Given a sentence w1 , . . . , wN , the CYK algorithm is a O(nR N 3 ) dynamic programming algorithm for computing the best-scoring parse [59, 83]. The algorithm will compute a table

86

of partial parses, of size O(nV N 2 ). CYK explicitly computes the best parse of each possible substring and each possible symbol label for that substring. The key quantity which is iteratively computed is π[X, i, j], the score of the best parse of the substring starting at i, ending at j, and labeled as symbol X ∈ V . Initialize the “bottom” row of the table, which represents the best parse of each one-element long substring:

π[X, i, i] =

max s(r, i)

r∈{X→w}

for

i = 1...N

(5.1)

We can now populate the “second” row of the table, which represents the optimal parses of 2 elements. For a l-long substring, we will look at all possible l − 1 splits and all possible nr production rules that could generate this sequence. Each one can be scored by looking at lower entries in the DP table, which have already been computed. We then take the max, and update the entry for the current l-long sequence. We describe the algorithm below and illustrate it in Fig. 5.4. 1 2 3

for l = 2 : N do // Iterate over substring-length for i = 1 : N − l + 1 do // Iterate over substring-start j =i+l−1; // Set j to substring-end π[X, i, j] = max s(r, i, k, j) + π[Y, i, k] + π[Z, k + 1, j] ; r∈{X→Y Z} k∈{i...j−1}

4 5 6

end end

5.3.2

Segmental Context-Free Grammars

A CFG in CNF form only allows for rules that generate a single terminal at a time. Consider a set of CNF rules augmented with those that allow nonterminals to generate a k-long

87

Given a string w of length n, we build a triangular table with n rows and n columns. Conceptually, we write w below the bottom row of the table. The ith column correspond to the ith word. The cell at the ith column and the jth row (from the bottom) of the table corresponds to the substring starting the ith character of length j. The following is the table, and the substrings each entry corresponds to.

Seq. length l

len 4 3 2 1

Jeff trains geometry students Jeff trains geometry Jeff trains Jeff

trains geometry students trains geometry trains

Jeff

=⇒ geometry students geometry

trains geometry first word in substring

Sequence start i

students

students

S → N VP VN → N N VP → V N N → students | Jeff | geometry | V → trains

Given a string w of length n, we build a triangular table w

we xwrite w below the bottom row of the table. CYK builds a table containing a cell for each substring. The cellConceptually, for a substring contains to the ith word. The cell theare ith column Figure 5.4: The CYK parsing algorithm illustrative example. Assumeatwe given and the jth row (fro a list of variables V from which we can deriveand x (inan one or more steps).

corresponds to the substring starting the ith character of leng

the N -long string “Jeff trains geometry students” and the table, grammar onsubstrings the right which a and the each(of entry corresponds to. 4 few of the nR rules are shown). We visualize the dynamic programming table computed by 3 len a CYK parser on the left. The parser iterates over substrings of Jeff length l, and start positions length 2 trains 1 students of each substring i. For each of such N 2 substrings, the4 parsergeometry searches over all rules Jeff trains geometry students 3 Jeff trains geometry geometry students (at most nR ) and split points (at most N ) which could derive it. This makes thetrains overall 2 Jeff trains trains geometry first 2word in substring 3 algorithm O(nR N ). For each of the N table entries, one1 must store the score oftrains the nV Jeff 2 The symbols, bottom rowmaking containsstorage the variables of length 1. This trains ). derive each substringJeff possible O(nthat V N can

geom first word in substring

is easy to fill in:

segment of terminals: length

X → w1:k

geom geom

where

4 3 2 1

CYK builds a table containing a cell for each substring. The c a list of variables V from which we can derive x (in one or more

N Jeff w

N,V N N trains geometry students for 1:k = w1 . . . wk first word in substring

0 < k ≤ Nlength

4 3 2 1

(5.2) Jeff

trains

geometry

st

Now we fill the table row-by-row, moving upwards. To fill in the cell for a 2-word substring first word in substring x, we look at the labels in the cells for its two constituent words and see what rules could Each of the above rules has as associated score s(X → wThe j) for row placing a k-element 1:k , bottom contains the variables that can derive each derive this pair of labels. In this case, we use the rules N → N N and VP → V N to produce: is easy to fill in:

segment ending at j. We call such a CFG a segmental-CFG (SCFG). SCFGs can be seen 4

3 as a type of CFG where the number of 2rules scales with N , complicating inference. We length

2 1

N

N,V

N N geometry stu first word in substring

need N additional copies of each original terminal rule X → w, allowing one to generate Jeff trains segments of length 1 to N. For each copy, we need an additional log N rules to convert them into CNF form (by recursively splitting each

Now we fill the table row-by-row, moving upwards. To fill in t x, we the labels in the cells for its two constituent wor segmental look ruleatusing a dummy nonterderive this pair of labels. In this case, we use the rules N → N N

minal). This means the number of rules scales as O(N log N ), making the final CYK parser O(nR N 4 log N ), where nR refers to the number of rules in the original non-segmental CFG. Other algorithms such as the Earley parser [28] scale even more poorly due to the fact the number of production rules in a SCFG grows with N . We show that, with a simple modification, the original CYK parser can accept rules of the form (5.2), keeping the algorithm at O(nR N 3 ) without any increase in complexity. Replace

88

2

Line 4 of CYK with the following two lines:

score = π[X, i, j] =

max

r∈{X→Y Z} k∈{i...j−1}

s(r, i, k, j) + π[Y, i, k] + π[Z, k + 1, j]

max

r∈{X→w1:(j−i+1) }



 score, s(r, j)

(5.3)

(5.4)

For each table entry, we search over derivations as before in (5.3) (equivalent to Line 4 of CYK), but now we also search over segmental terminal derivations from the lowest layer (5.4). We only need check for (j − i + 1)-long terminal derivations, of which there exist at most nR . This means that one can parse a segmental CFG in O(nR N 3 ).

5.3.3

Regular grammars

For many problems (including action recognition), it is useful to define specialized cases of CFGs which are easier to solve. Chomsky, in his landmark work on the theory of languages [21], describes a special case of context-free-grammars known as finite-state grammars or regular grammars. These grammars are restricted to rules of the form:

X →Yw

(5.5)

X→w

In the next section, we show that such grammars can be parsed in O(nR N ). Such grammars can be written as first-order Markov models - hence Chomsky’s initial nomenclature of finite-state grammars [59]. The advantage of such a representation over a traditional Markov model is that one can encode long-range interactions with a small set of production rules, while a Markov model might need an (exponentially) large state space.

89

5.3.4

Segmental regular grammars

As above, we will augment the above rules to generate arbitrary length sequences of terminals:

X → Y w1:k

(5.6)

X → w1:k

Assume scores can be written as s(X → Y w1:k , j), where j is the end location of a substring. Assume that segments can be at most L tokens long. We show the above segmental regular grammar (SRG) can be parsed in O(nR N L), where nR is the number of rules in the underlying non-segmental grammar. The algorithm is similar to CYK except that we need maintain only the first column of table entries π[X, i, j] for a fixed i = 1. We write this as π[X, j], the score of the best parse from position 1 to j, given that the symbol at j is X. For notational simplicity, let us define an empty symbol {} which allows us to write both rules from (5.6) using a single form. Initialize π[{}, 1] = 0 and π[{}, 1] = −∞ for j > 1: 1

for j = 1 : N do π[X, j] =

2

max

k∈{1...L} r∈{X→Y w1:k }

π[Y, j − k] + s(r, j)

end Algorithm 1: O(N L) SRG Parser. Because the length of a segment L is upper-bounded

3

by N , the parser runs in O(N 2 ) in the worst case.

It will be helpful to think of j as indexing time. The inner loop searches for best derivation of X that transitions from a previous nonterminal Y , at some position j −k (after which the k remaining elements are segmented into X). At each iteration, the algorithm searches over L possible transition points and nR possible rules for transitioning at that point, making

90

overall computation O(nR N L). The algorithm is naturally online; at any point, we can compute the score of the best parse found until now with maxX π[X, j]. Storage for the algorithm is O(nV L) – independent of N – because we only need to store table entries for the past L timesteps. This makes it scalable to very long, or even streaming, data. Note that for L = 1, our SRG reduces to a regular RG, implying that our algorithm also parses RGs in O(nR N ). Parse tree: We can recover the optimal parse tree by storing the argmax pointer from Line 2 of the algorithm. Let R[X, j] point to the id of the selected rule r, which also implicitly encodes the transition offset k. We can represent a parse tree with a collection of (j, r) pairs obtained by backtracking through R, initializing with the highest-scoring symbol at the last frame. We encode a parse-tree as

P = {(jm , rm ) : m = 1 . . . M } where M refers to the amount of distinct segments in the parse, jm is the last frame of the mth segment, rm is the rule id whose right-hand-side instanced that segment. Note that the rule id itself encodes the length of the mth segment.

5.4

Model structure

Given a data sequence D and a candidate parse P = {(jm , rm )}, we can write its score under our weighted SRG grammar as

S(D, P ) =

M X

s(D, jm , rm )

(5.7)

m=1

91

S → z1:k

Segmental regular grammar: S → Sa1:k1 b1:k2 c1:k3 z1:k4 S → Sa1:k1 c1:k2 z1:k3

Equivalent CNF form: S → z1:k S → Az1:k S → Bz1:k I 1 → Sa1:k I 2 → I 1 b1:k A → I 2 c1:k B → I 1 c1:k Figure 5.5: A sample SRG rule set used to generate the actual parse shown in Fig. 5.1, where S corresponds to the start symbol, a, b, c correspond to a “yank”, “pause”, and “press” subaction, and z is the background. In the equivalent CNF form, A and B correspond to a completed “clean-and-jerk” and “snatch” action respectively, and I 1 and I 2 correspond to incomplete actions. where we have explicitly denoted the dependence of each segment’s score s(jm , rm ) on data D. To further describe our particular data model, let us define xm to be the terminal on the right-hand-side of rule rm , and km to be the length of that terminal segment sequence.

s(D, jm , rm ) = αxm · φ(D, jm − km + 1, jm ) + βrm · ψ(km ) + γrm

(5.8)

To describe the parameters of our model, we will use an illustrative example of a grammar defined for weightlifting actions, shown in Fig. 5.1 and defined in Fig. 5.5. Terminal symbols correspond to subactions (yank,pause,and press), while abstract symbols correspond to actions (clean-and-jerk and snatch). A “clean-and-jerk” is defined by a yank, pause, and press subaction, while a “snatch” is defined by a yank and press. Data model: The first term is a data term that extracts features from a segment, and scores them with a model tuned for terminal xm . We compute a bag-of-poses descriptor φ() for each segment, described in detail below. We can interpret αxm as a model tuned for particular subactions; for example, a xm =“yank” model will be tuned to respond to a particular distribution of poses. Note that α is symbol-specific xm rather than rule-specific rm ; this allows different rules to share data models for equivalent symbols. For example, rules capturing a “clean-and-jerk” and “snatch” action can share the same subaction data models. 92

Figure 5.6: Our codebook of poses, which we use to define a “bag-of-poses” data model. We show one level of our hierarchical vocabulary, as described further in the text. Temporal model: The second term is analogous to a temporal “prior” that favors particular segment lengths km for a sub-action. Our data is rule-specific rm and not symbol-specific xm . This allows a single “yank” sub-action to have different priors for its length depending   2 on the action in which it was instanced. Specifically, we define ψ(km ) = km km . This means that the parameters βrm can be interpreted as the rest position and rigidity of a spring that defines the ideal length of a segment. In our experimental results, we show that such temporal terms are vital to ensuring good segmentation accuracy. Furthermore, such constraints are difficult to encode by a standard HMM. Rule model: The last term γrm is a scalar “prior” or bias that favors the use of certain rules over others. This may encode the fact that “clean-and-jerk” actions are more likely to precede “snatch” actions, for example. Codebook of poses: To define the descriptor φ(), we use a bag-of-poses model that captures high-level pose features in a “soft manner”, as current pose estimation algorithms are still relatively noisy. We first build a codebook of poses by running an articulated pose estimation algorithm [145] on our training video clips. We select a fixed detector threshold such that 3-5 poses are typically found in a frame, some of which may be false positives. We represent each pose with a vector of scale-normalized joint-positions, and cluster these vectors to build a vocabulary of K typical poses (Fig. 5.6). Given a test video, we run the same pose estimation algorithm with identical settings, and quantize detections into the

93

closest-matching pose codeword. Given a sequence of frames in a video, we define φ() to be a K-element histogram of poses found in that sequence. Unlike traditional bag models that use low-level features, our codewords are high-level, semantically-meaningful poses. Our approach is similar to [55, 9], whom define a bag of pose-specific templates, but our model is more generalizable in that we use codebooks of pose configurations rather than pixel-appearances. Extensions: We make two extensions that produce small but noticeable improvements in performance. (1) We define a multi-resolution codebook by building a hierarchy of codebooks for different K (we use K = 100 and K = 400), concatenating together their descriptors. This is analogous to a pyramid-match kernel [50], which implicitly encodes an approximate correspondence between a model and an observation in feature space (in our case, pose-configuration space). (2) To evaluate such a descriptor, we use a chi-squared kernel. To ensure that it can be scored with a linear model, we use the linear embedding feature map of [137].

5.5

Learning

Our model from (5.8) is linear in the model parameters w = {α, β, γ}, allowing us to write the score of a parse as a linear function S(D, P ) = w · Φ(D, P ). Fully-supervised learning: Assume we are given training data of videos with ground-truth parses {Dn , PN } and a manually-specified set of production rules R. We wish to learn weights w for the data model α, temporal model β, and rule compositions γ. The weights should score correct parses highly and incorrect parses poorly; we do this by defining a

94

structured prediction learning function

arg min

w,ξn ≥0

s.t. ∀n, ∀Hn

X 1 w·w+C ξn 2 n

(5.9)

w · Φ(Dn , Pn ) − w · Φ(Dn , Hn ) ≥ loss(Pn , Hn ) − ξn

The above linear constraints state that, for each training video Dn , the true parse Pn should outscore a hypothesized parse Hn by some amount given by loss(Pn , Hn ). We use a Hamming loss, which simply counts up the number of frames with incorrect symbol labels. Thus parses that are quite different from the true parse Pn should score much worse. Finally, we allow these constraints to be loosely satisfied using slack variables. The above is equivalent to a structural SVM, for which many well-tooled solvers exist [65]. The main computational step in learning requires solving a “loss-augmented” inference problem, where for each training example, one finds the worst-offending parse

Hn∗ = max w · Φ(Dn , H) + loss(Pn , H) H

The Hamming loss can be absorbed into the data term of (5.8), implying that our efficient SRG parser can be used to find such an offensive parse. Latent learning of sub-actions: Specifying a full desired parse tree can be expensive or ambiguous. In our scenario, we are given video streams with action labels, but do not know what sub-actions should be labeled. In this setting, we can employ a latent structural SVM [147, 37] that iterates between (1) Given a model w, predicting a full parse (with subaction labels) for the entire training set and (2) Given a set of full parses, learning a model w with a structural SVM. Step (2) is equivalent to the supervised case above. Step (1) corresponds to a “loss-augmented” inference problem, where parses inconsistent with the given action

95

labels are given an infinite loss. This again can be implemented by tweaking the local data cost from (5.8) to only allow for latent sub-actions with valid action labels.

5.6

Experiments

Continuous Olympics Dataset: Most action video datasets are temporally pre-segmented, and so are not appropriate for action-parsing. Because of difficulties associated with capture, many are collected in highly controlled conditions which limits their applicability. The Olympics dataset originally introduced in [88] contains videos of 16 different sport actions collected from YouTube. The videos are mostly amateur and realistic, but are presegmented into distinct actions. Inspired by this dataset, we collected continuous videos of the same list of 16 actions from YouTube. Our Continuous Olympics dataset contains almost 20 hours or 2 million frames of such realistic video footage, almost 10 times larger than the Olympics dataset. Each video on average contains 8 instances of an action. We annotated this dataset in terms of the start and end frame numbers for each action. Hence, our dataset can be used in action detection or classification tasks. Fig. 5.7 illustrates a sample frame of each action. We use half of the dataset for training and the other half for testing. Qualitative results: Fig. 5.8 shows the result of running our SRG parser on a video of multiple “javelin throw” actions. Our algorithm does a reasonable job of parsing actions and (latently-learned) sub-actions. In training, we estimate the transition points between subactions latently, and at the test time, our parsing algorithm estimates all transition points. Our subaction model naturally captures somewhat meaningful subactions like “running”, “releasing”, and “throwing”. Fig. 5.9 shows multiple frames of a continuous video of three weightlifting actions. The annotation and also detection results are shown using color bars. Note that in some frames (the second row), the model mistakenly detects the person’s head 96

Figure 5.7: Our Continuous Olympics action dataset contains video streams of 16 different actions collected from YouTube. There is large variety in view point, scale, duration, and the number of action instances in each video. Actions include: “basketball layup”, “bowling”, “clean and jerk”, “discus throw”, “diving platform”, “diving springboard”, “hammer throw”, “high jump”, “javelin throw”, “long jump”, “pole vault”, “shot put”, “snatch”, “tennis serve”, “triple jump”, and “vault”. on the lift. This happens since the lift has a circular shape similar to a head. This results in wrong features. However, this type of mistake has a strong bias so the temporal model may be able to fix them. Basically, it can learn that in the weight-lifting action, it is possible to see the head in a different location on the lift. Quantitative evaluation: Given a candidate parsing of a sequence, we evaluate it in two ways. Per frame: We count the number of frames with matching candidates and ground truth labels, similar to the loss function in our structured learning framework. Action detection: Similar to PASCAL evaluation of object detection, we think of our parser as returning candidate action-sequence “detections” and evaluate the overlap between candi97

Figure 5.8: A sample video containing ‘javelin throw” actions. The bottom timeline shows ground truth actions in gray. The above one shows detected subactions in different colors, where black correspond to the background (no action) label. Our grammar requires three subactions to appear in sequence. Our subaction model, learned latently, captures meaningful structure such as “running” (red), “releasing” (blue), and “throwing” (green). The last two subactions are quite short and not always visible in the figure. The complete video is available online here: http://www.youtube.com/watch?v=ORQcuqW3gtQ dates and ground-truth segments. A detection is considered a true positive if the overlap (union/intersection) is more than a predefined threshold. We plot a precision-recall curve and use the average precision as an evaluation metric. Baselines: We use a three different baselines. We build a SVM classifier that predicts the best label as well as its confidence value for a given temporal window. Baseline1: To predict the label for the i’th frame, we apply the classifier to all possible temporal windows centered on the i’th frame, and choose the most confident resulting label. Baseline2: Similar to object detection, we run the classifier on all possible temporal windows (sliding window) and collect a pool of results. Then, we choose a set of non-overlapping windows by running a greedy non-maxima suppression procedure. Semi-Markov model: This is similar to a “one-level” grammar that reasons about action segments and their transitions, but no sub-actions. It outperforms our other baselines, but under-performs our final hierarchical grammar of actions and sub-actions. All baselines take O(N L) time to process a video which is similar to our grammar parsing algorithm. We also evaluated a HMM baseline, but we found the single-frame data model to be so weak that the results were vastly inferior to either of the above baselines. 98

Figure 5.9: Sample frames of a continuous video of three “weight-lifting” actions. The Bottom red time bar shows the frame number, the middle bar shows the ground truth (black is for the background and gray is for the action.), and the colored time bar shows the result of our model (red, green, and blue show three different sub-actions of the “weight-lifting” action learned latently.) Note that in some frames (the second row), the model mistakenly detects the person’s head on the lift. This happens since the lift has a circular shape similar to a head. This results in wrong features. However, this type of mistakes has a strong bias so the temporal model may be able to fix them by learning the bias. Basically, it can learn that in the weight-lifting action, it is likely to see the head in a different location (on the lift). The complete video is available online here: http://www.youtube.com/watch?v=1FR7U7r4G58 We summarize results in Table 5.2. Our model strongly outperforms all baselines. Our bag-of-poses data model outperforms spatio-temporal interest points. We see a dramatic 99

Algorithm

Baseline1 Baseline2 Semi-markov model Our SRG parser Our SRG parser (STIP features) Our SRG parser (no length prior)

Average precision 10% overlap 15.1% 35.6% 42.1% 36.2%

Average precision 30% overlap 11.3% 26.9% 27.7% 17.8%

Average precision 50% overlap 7.7% 11.3% 19.4% 7.81%

per-frame precision

timing ms/frame

56.3% 59.1% 63.0% 56.8%

0.781 0.38 0.55 0.77 0.63

21.0%

11.8%

0.2%

53.4%

0.77

Table 5.2: We use two different evaluations and compare our model to three different baselines. We report the results for different overlapping thresholds. Our SRG parser outperforms all baselines. It also outperforms variants without a segment length prior and with STIP features instead of our pose codebook. We display timing results (ignoring the local feature computation). For reference we also implemented a segmental-CYK parser, which produced identical results to our SRG-parser, but is 1000X slower (2.3 secs/frame). decline when the temporal spring model is removed; the importance of such a term is corroborated by the poor performance of an HMM. Fig. 5.10 compares our model and the baselines in terms of the precision-recall curves. We plot curves for two different overlapping thresholds.

5.7

Conclusion:

In this chapter, we introduced modeling action using high-level features. We described segmental extensions of grammars for action-parsing, introducing an extremely-efficient, linear-time, online parsing algorithm as our primary contribution. We also described a simple but effective data model based on bags-of-poses. To illustrate our approach, we introduced a novel dataset of continuous actions and demonstrated encouraging performance over standard baseline approaches.

100

1 Baseline (30% overlap) Baseline (10% overlap) Ours (30% overlap) Ours (10% overlap)

precision

0.8

0.6

0.4

0.2

0

0

0.2

0.4

0.6

0.8

1

recall

Figure 5.10: Precision-recall curve for two baselines and our SRG-parser. We report the results for multiple overlap thresholds. In all cases our model outperforms the baselines.

101

Chapter 6 An Application: Detecting Activities of Daily Living in First-person Camera Views

6.1

Introduction

Action recognition in video has a variety pf applications. In this chapter, we will focus on an application that is not well-addressed in the computer vision community. We argue that detecting activities of daily living in wearable camera views [100] is a challenging problem with many realistic applications and can be considered a good benchmark for action recognition algorithms. Activity recognition is a classic task in computer vision, but is relatively less well-defined compared to neighboring problems such as object recognition for which large-scale, established benchmarks exist [30, 26]. We believe this is so the following reasons: (1) It is difficult to define canonical categories of everyday behavior outside particular domains 102

Figure 6.1: Activities of daily living (ADL) captured by a wearable camera. such as surveillance and sports analysis. (2) It is difficult to collect large-scale footage with rich intra-class variation. For example, unscripted surveillance footage tends to be repetitive, often dominated by scenes of people walking. Traditionally, the above limitations have been addressed by using actor-scripted video footage of posture-defined action categories such as “skipping” or “jumping” [115, 41]. Such categories maybe artificial because they tend not be functionally defined, a core aspect of human movement [8]. We focus on the problem of detecting activities of daily living (ADL) from first-person wearable cameras. This formulation addresses many of the limitations described above, in that we use a natural list of daily activities developed from the medical literature on rehabilitation. These activities are chosen so as to capture the representative movements a person 103

must undergo to perform everyday functions, such as eating and maintaining personal hygiene. Wearable cameras also provide a practical advantage of ease of capture; we have amassed a diverse, 1 million-frame dataset of people performing natural, everyday activities in diverse home environments. We argue that ease of data collection is one important benefit of wearable cameras. Application 1 (Tele-rehabilitation): The medical literature on nursing and motor rehabilitation [68, 18] describes a variety of clinical benchmarks used to evaluate everyday functional activities such as picking up a telephone, drinking from a mug, and turning on a light switch, etc. We develop a taxonomy of everyday actions based on such medical evaluations (Fig. 6.10). These evaluations are currently done in the hospital, but a computervision system capable of analyzing such activities would revolutionize the rehabilitative process, allowing for long-term, at-home monitoring. Application 2 (Life-logging): A growing trend in ubiquitous computing is that of continual logging of visual personal histories [46, 58]. Initial work has shown promise for memory enhancement for patients with memory-loss [58]. However, as there has been limited algorithm development for processing and managing such massive records of daily activity, these systems currently suffer from behaving mostly as “write-only” memories. We believe the time is right for the vision community to consider such large-scale, “in the wild” activity recognition problems. Novel representations: ADLs differ from typical actions in that they can involve longscale temporal structure (making tea can take a few minutes) and complex object interactions (a fridge looks different when its door is open). We develop novel representations including (1) temporal pyramids, which generalize the well-known spatial pyramid to approximate temporal correspondence when scoring a model and (2) composite object models that exploit the fact that objects look different when being interacted with.

104

Dataset: We introduce a fully-annotated dataset suitable for “egocentric” ADL-recognition. Our dataset is 1 million frames of 10 hours of video, amassed from 20 people performing non scripted ADL’s in 20 different homes. Our dataset has been annotated with activity labels, bounding-box tracks of all objects in view, and annotations for which are being interacted with. With respect to existing egocentric datasets, our dataset is notable for its size and diversity of natural scenes. With respect to existing action datasets, our dataset is notable for its content of unscripted, everyday, activities collected in continuous (nonsegmented) video streams. We use our dataset to perform a thorough investigation of stateof-the-art algorithms in both action and object recognition.

6.2

Related work

There has been a fair amount of work on everyday activity-recognition from the ubiquitous computing community [99, 129, 92] , much of it addressed from a “life-logging” perspective [12, 95] . Most approaches have ignored visual cues, and instead focused on alternate sensors such as RFID tags or accelerometers. This requires a fairly involved effort for instrumenting both the observer and the “outside world”. One may argue that it is more practical to instrument the observer; for example, wearable cameras are easy to deploy, innocuous, and increasingly common [58]. There is a rich history of activity recognition in the vision community; we refer the reader to the recent surveys of [41, 135] for a detailed summary. Classic datasets tend to consist of scripted actions [74, 148, 89], though recent work has looked at actions in televised and film footage [84]. Related work has also begun looking at the problem of everyday, athome activity-recognition. Though there exists large body of work on recognizing actions from wearable cameras, most demonstrate results in a single scene, such as a kitchen or office [126, 32, 125, 54], possibly outfitted with actors in mocap suits [122]. Such a setting 105

may allow one to assume a priori knowledge of the particular objects in the scene, which allows for instance-level visual recognition [56] or RFID-tagged objects [129, 142]. We focus on recognition in widely-varying, un-instrumented, “in-the-wild” environments. Low-level features such as motion [33, 54, 67, 109] and skin-based color models [85] likely play a large role in analyzing wearable camera footage. We experimented with such cues, but found static image cues (such as image-based object-detectors) to be more stable, perhaps due to the unconstrained nature of our footage. Other researchers have examined unsupervised discovery of objects [66, 33] and actions [67] from wearable footage. We work with a list of semantically-driven actions and objects, as derived from the medical literature on ADL. Our temporal pyramid representation is inspired by a large body of work on multi-resolution models of video [60, 148]. Our model can be seen as a special case of a spatio-temporal pyramid [19]. However, we use interaction-based object models to determine spatial support rather than a spatial pyramid. Our model is also similar to the temporally-binned model of [73], but we use a weighted, multiscale, pyramid to approximate a coarse-to-fine temporal correspondence. Our interaction-based object models are inspired by studies from human vision [71] and are related to visual phrases [113], which capture visual composites of humans and objects in interactions. Our performance gains stem from the ability to capture large changes in object appearance (an open versus closed fridge) as well as the inclusion of a human in the composite model.

6.3

Temporal pyramids

In this section, we develop several simple but novel models of daily activities based on object-centric representations. We write T for the set of frames to be analyzed using K

106

object models. We use these models to compute a score for object i at a particular pixel location and scale p = (x, y, s) in frame t:

scoreti (p) ∈ [0, 1]

(6.1)

We use the object models of [37], which are not calibrated to return scores in [0, 1]. One may calibrate the models to return probabilities [104], but we divide the raw score of each object model by the maximum value found in the whole dataset. We then record the maximum value of each object model i in each frame t:

fit = max scoreti (p)

(6.2)

p

”Bag of features” is a naive way of aggregating these features by averaging them over time. Fig. 6.2 illustrates recognizing actions using an object-centric bag model.

x0i =

1 X t f |T | t∈T i

(6.3)

The above representation ignores any temporal structure; we may want to encode, for example, that “making tea” requires first boiling water and then (minutes later) pouring it into a cup. Such long-scale temporal structure is difficult to encode using standard hidden Markov models. We develop a flexible model based on the spatial pyramid match kernel [76]. We represent features in a temporal pyramid, where the top level j = 0 is a histogram over the full temporal extent of a video clip (as in (6.3)), the next level is the concatenation of two histograms obtained by temporally segmenting the video into two halves, and so on.

107

Figure 6.2: Recognizing actions using an object-centric bag model. Bag model is constructed by detecting objects in a given video. It is represented in a histogram where each bin corresponds to one type of object. The resulting histogram is given to an SVM for classification. We obtain a coarse-to-fine representation by concatenating all such histograms together:

xj,k i =

2j−1 X t fi ; |T | j,k t∈T

∀k ∈ {1...2j }

(6.4)

where T j,k is the temporal extent of the k’th segment on the j’th level of the pyramid and xj,k i is the feature for the i’th object detector on that segment. The scale factors define an implicit correspondence based on the finest temporal resolution at which a model feature matches a video feature [50]. We use j ∈ {0, 1} levels. These allows us to encode longscale temporal structure in a “soft” manner; one must touch a kettle at the beginning of a making tea action, but the precise time may vary a bit. Fig. 6.3 illustrates the temporal pyramid model for action recognition. Both the temporal pyramid and grammar-based models in Chapter 5 are capable of handling the long term temporal dependency. The temporal pyramid can be seen as an approximate

108

Figure 6.3: An illustration of the temporal pyramid model for action recognition. grammar that computes an approximate correspondence between a sequence and a model. However, the grammar model in Chapter 5 computes an explicit correspondence by finding the optimum sub-action segmentation. We use our models for activity recognition by learning linear SVM classifiers on features T   L j,k L,2 0 , 0.01 x = min x1 . . . xi . . . xK with the public SVM implementation of [31]. We found an elementwise-minimum was useful to approximately “binarize” x, so that it softly encode the presence or lack thereof of object i (inspired by the clipping post-processing step in SIFT [80]).We experimented with various histogram kernels [137], but found a simple linear kernel defined on an L1normalized feature to work well.

109

Figure 6.4: Our dataset (top row) contains images of objects under different semantic states-of-use (e.g., a microwave with open or closed door). These semantic states are typically not captured in web-based photo collections (bottom row). Our active/passive object models exploit such visual cues to determine which objects are being interacted with.

6.4

Active object models

Recognizing objects undergoing hand manipulations is a crucial aspect of wearable ADL recognition (see Fig. 6.4) [71]. Following recent work on human-object interaction models, one approach may be to detect objects and human body parts (hands) in frames, and then reason about their spatial relationship. However, this ignores the fact that objects may significantly change in appearance during interactions - an open fridge looks very different from a closed fridge.

110

(a) passive stove

(b) active stove

Figure 6.5: We visualize our passive and active stove models. Active models: Instead, we learn separate object detectors tuned for “active” objects being interacted with. Our approach is inspired by the recent “visual phrase” work of [113], which advocates detection of human-object composites rather than detectors built for each constituent object. In our case, we do not increase the spatial support of our detectors to explicitly include the human hand, but define an active object to be a visually disparate sub-category. We do this by training an additional object detector [37] using the subset of active training images for a particular object. We show an example for a “stove” object in Fig. 6.5. Fig. 6.6 illustrates the bag model constructed using active and passive object detections. In the histogram representation, there are separate bins for active and passive instances of objects.

111

Figure 6.6: Bag model for action recognition constructed using active and passive object detections. In the histogram representation, there are separate bins for active and passive instances of objects.

pan

tv

mug/cup

dish

Figure 6.7: To visualize the average location of active vs passive objects in our ADL dataset, we make a rectangular mask for each bounding box and average them all for passive (on left) and active (on right) instances of annotated objects. Active images tend to have larger bounding boxes at the center of the image, indicating that active objects tend to occur at large scales near the center of the field of view. Spatial reasoning: While many objects can appear in the field of view, active objects tend to be at a consistent scale and location convenient for hand manipulation. We analyze the spatial bias of passive versus active objects in Fig. 6.7. To exploit such cues, we 112

Figure 6.8: We show the average passive (left) and active (right) TV remote in our ADL dataset. The left image is blurred due to the variation in viewing angle in our data, while the right image is more structured due to less pose variation for remotes-in-hand. The right image also contains more red-hued skin-pixels. We use such cues to build active object models. augment our active object models to include the position and scale as additional features when detecting active objects:

scoreti (p)

T

 = w · score(p) x y s x2 y 2 s2

Because we use linearly-parameterized templates as object detectors [37], we simply add the above spatial features to the local appearance features when learning active object models. We found this produced a small but noticeable improvement in our final results. Skin detectors: Because active objects are manipulated by the hand, they tend to occur near skin pixels (as shown in Fig. 6.8). We experimented with adding a skin detector feature to the above linear model, but failed to see consistent improvements. We hypothesize this was due to large variations in illumination in our dataset. We augment the temporal pyramid feature from (6.4) to include K additional features corresponding to active objects, as defined in this section. We refer to this model as “AO”, for our object-centric model augmented with active objects. We refer to the original feature from (6.4) as “O”, for our object-centric model.

113

action name combing hair make up brushing teeth dental floss washing hands/face drying hands/face laundry washing dishes moving dishes making tea making coffee drinking water/bottle drinking water/tap making cold food/snack vacuuming watching tv using computer using cell

mean of length (secs) 26.50 108.00 128.86 92.00 76.00 26.67 215.50 159.60 143.00 143.00 85.33 70.50 8.00 117.20 77.00 189.60 105.60 18.67

std. dev. of length 9.00 85.44 45.50 23.58 36.33 13.06 142.81 154.39 159.81 71.81 54.45 30.74 5.66 96.63 60.81 98.74 32.94 9.45

Table 6.1: This table shows the statistics for the duration of each action. Some actions like “using cell” are shorter in time than other actions like “washing dishes”. Many actions exhibit a rather large variability in duration, making action detection in continuous data difficult.

6.5

Dataset

In the subsequent description, we refer to our dataset as the ADL dataset.

6.5.1

Collection and size

To collect our dataset, we used a GoPro camera designed for wearable capture of athletes during sporting events. We found a chest-mount easier than a helmet mount, both in terms of quality of data and ease of capture. The camera captures high definition quality video (1280x960) in the rate of 30 frames per second and with 170 degrees of viewing angle. A

114

large viewing angle is important in capturing this type of footage to reduce object and hand truncation. We put together a list of 18 actions of daily activities and asked 20 people to do them all in their own apartment. In order to collect realistic and varied data, we didn’t give users a detailed description of actions, and instead gave them the list of actions in Table 6.1 . Each capture session was roughly 30 minutes of unscripted morning activity. Our camera was equipped with sufficient battery and memory to allow for continuous capture. We collected more than 10 hours of first person video footage, with more than a million frames. The total collection process took one month of acquiring subjects and arranging capture sessions.

6.5.2

Annotation

We annotated every second (30 frames) with dense annotations of the form in Fig. 6.1. We did so by assembling a team of 10 part-time annotators, working over a month span. The final dataset is annotated in terms of: Action label: Our ADL dataset is temporally annotated in terms of 18 different actions. Table 6.1 shows the list of actions and also the statistics of their duration. Object bounding boxes: Our ADL dataset is annotated in terms of 42 different objects, of which some listed in Table 6.2 . We asked the annotators to draw a tight bounding box around each known object and then track it and adjust the bounding box for every 30 frames. Object identity: Our dataset is annotated with individual tracks of objects. We do not use such annotations for our current analysis, but it may be useful for evaluating tracking algorithms. Note that there is large amount of camera motion in this footage and can be considered a good benchmark for object detection and tracking algorithms.

115

Figure 6.9: We show different kitchen scenes in our dataset. Unlike many other manually constructed action datasets, we exhibit a large variety of scenes and objects. Human-object interaction: We denote objects that are being interacted with as “active”. We set a binary attribute flag for active objects in our annotation interface. We show that this knowledge is very helpful in action recognition.

116

6.5.3

Characteristics

In this section, we point out various distinguishing characteristics of our dataset. We refer to the following sets of figure captions for a detailed description, but we summarize the main points here. Our dataset contains large variability in scenes (Fig. 6.9) and object viewpoint and occlusion level (Fig. 6.4). In Fig. 6.7 and Fig. 6.8, we illustrate various biases (such as image location and skin color) which can be exploited to visually identify interacting objects. Functional taxonomy: Many ADL actions are quite related to each other. We construct a functional taxonomy based on a bottom-up grouping of actions; at a high-level, all ADL actions can be grouped into those based on personal hygiene, food, and entertainment. Fig. 6.10 shows the functional taxonomy we manually constructed. We can use this taxonomy in evaluating actions, meaning we penalize less for making mistakes between actions with similar functionalities. Following [51], we define the misclassification cost of two classes as the total distance to their closest common ancestor, divided by two. Fig. 6.11 illustrates this cost for all possible mistakes with brighter color for larger cost. We find the interesting phenomena that functionality correlates strongly with scene context; one both brushes their teeth and flosses in a bathroom. This suggests that approaches that rely on scene context might fare well under such a functional score.

6.6

Experimental results

We implemented and evaluated our object-centric action models on our ADL dataset. Evaluation: We use leave-one-out cross-validation, where we ensure that footage of the same person does not appear across both training and test data. We use average precision to

117

Figure 6.10: Our manually-designed functional ADL taxonomy. evaluate object detection accuracy (following the standard convention [30]). We use class confusion matrices to evaluate action classification, scoring both classification error and the taxonomy-derived loss shown in Fig. 6.11. We compute an overall classification rate by averaging the diagonal of this matrix, weighting all classes equally. Because we have 18 classes, chance performance corresponds to almost 5%. Co-occurring actions: Some actions can co-occur in our dataset. In many cases, it may be more natural to think of the shorter action as interrupting the longer one; “watching TV” while waiting for water to boil while “making tea.” Our annotations include such cooccurrences. For simplicity in our current evaluation, we assign only one label to our test frame, taking the shorter interrupting action when there is overlap.

118

combing hair make up brushing teeth dental floss washing hands/face drying hands/face laundry washing dishes moving dishes making tea making coffee drinking water/bottle drinking water/tap making cold food/snack vacuuming watching tv using computer using cell

br

co

m

bi

ng us ma hai w hi ke r as n hi de g t up n n dr g ta eet yi ha l f h ng n lo ha ds/ ss nd fac s e w as l /fa hi au ce m ng nd ov d r in is y g h dr m ma dis es in a k he k k in s m d ing ing g t ak rin w c ea in ki ate of g ng r fe co w /b e ld a ott fo ter le od /t va /sn ap c a us wa uum ck in tc in g hi g co ng m tv do us put in ing er g c no el th l in g

doing nothing

Figure 6.11: The taxonomy-based cost of mistaking one class for another is the (average) distance to their closest common ancestor in Fig. 6.10 . Dark values correspond to a low cost; we pay less for mistaking “brushing teeth” with “flossing” as opposed to “making tea”. Training: In training visual object detectors, we used off-the-shelf part-based model for object detection [37]. We use training data for 24 object categories with roughly 1200 training instances (with bounding-box labels) per category. In Table 6.2, we compare results using different training datasets. We show that models trained using web-based collections (such as ImageNet) tend to contain iconic viewpoints of images not present in our ADL dataset (Fig. 6.12). Additionally, wearable video contains images of objects under different states-of-use (an open microwave or fridge, as in Fig. 6.4), also usually absent in online collections. When trained on data extracted from natural ADL footage, object detectors perform considerably better; for example, a faucet tap trained from ImageNet performs at 0.1% average precision, while faucet tap model trained from ADL data performs at 40% 119

Figure 6.12: Toothbrushes in our ADL dataset (left) look different than those found in web-based collections such as ImageNet (right). The latter typically contains large, iconic views of objects, often in displayed in isolation. Object detectors trained with the latter may not work well for our application, as we show in Table 6.2. Object tap soap liquid fridge microwave oven/stove bottle kettle mug/cup washer/dryer tv

ADL 40.4 ± 24.3 32.5 ± 28.8 19.9 ± 12.6 43.1 ± 14.1 38.7 ± 22.3 21.0 ± 27.0 21.6 ± 24.2 23.5 ± 14.8 47.6 ± 15.7 69.0 ± 21.7

ImageNet 0.1 2.5 0.4 20.2 0.1 9.8 0.1 14.8 1.8 26.9

Table 6.2: Average precision results for part-based object detectors evaluated on our ADL dataset. We compare models trained on our ADL dataset versus ImageNet. Since the ADLtrained models are trained and evaluated across cross-validation splits, we report both the mean and standard deviation of average precision. The deviation is large because we have relatively few object instances in our dataset (people own a single tea kettle). Detectors trained on ImageNet perform poorly on our data because they fail to capture the large number of viewing angles and occlusion states present in our wearable data. average precision. This suggests that, for our application, it is crucial to train on data with a large variety of viewpoints and scales. We find that there are certain objects in our labeled dataset for which current detection systems cannot model - e.g., they yield zero percent performance. We think this is due to small resolution and large geometric variation.

120

STIP O AO IO IA+IO

STIP O AO IO IA+IO

segment class. pyramid 22.8 32.7 40.6 55.8 77.0

frame class. pyramid 15.6 23.8 28.8 43.5 60.7

pre-segmented accuracy taxonomy loss bag pyramid bag 16.5 1.8792 2.1092 24.7 1.4017 1.7129 36.0 1.2501 1.4256 49.3 0.9267 0.9947 76.8 0.4664 0.4851

sliding window accuracy taxonomy loss bag pyramid bag 12.9 2.1957 2.1997 17.4 1.5975 1.8123 23.9 1.5057 1.6515 36.6 1.1047 1.2859 53.7 0.79532 0.9551

Table 6.3: Classification accuracy and taxonomy loss for action recognition using different representations. We compare results using both pre-segmented and temporally-continuous video clips. Please see the text for a detailed discussion, but our active-object (AO) model doubles the performance of typical action models based on space-time interest points (STIP). We also show that idealized, perfect object detectors (IO), augmented with the knowledge of which objects are being interacted with (IA+IO), dramatically increase performance.

121

6.6.1

Action recognition results

Table 6.3 tabulates action classification accuracy for different versions of our system. We begin with a standard baseline; a SVM trained on a bag of quantized spatio-temporal interest points (STIPS) [138]. It performs fairly poorly, at 16.5% on classification of presegmented video clips. Adding our temporal pyramid model boosts performance to 22.8%, revealing the benefit of reasoning about temporal structure. Our bag-of-objects model (O) noticeably improves performance to 32.7%, which is further increased to 40.6% when augmented with the active-object model (AO). Our novel representations provide a factor of two improvement over contemporary approaches to action recognition. To further analyze where future work should be focused, we evaluated our model with idealized perfect object detectors (IO), and augmented such idealized detectors with perfect knowledge of when objects are “active” (IA+IO). We do this by simply using the object and interaction annotations in our dataset. These dramatically increase performance to 77%, suggesting that for ADL recognition, “its all about the objects”, and in particular, “its all about the objects being interacted with.” We believe that accuracy is limited (even for the ideal case) due to genuine ambiguities in the data, as well as difficulties in annotation. Some actions such as “washing dishes” and “drinking water from tap” involve interactions with the same objects (mug and tap). Some objects are small and often occluded (e.g., dental floss) and so are not fully annotated. But it’s likely that an ideal detector for such objects would be difficult to build. One compelling aspect of ADL footage is that it is naturally collected as a continuous video stream, requiring one to solve a temporal segmentation problem. This temporal continuity is rare for action datasets, which tend to consist of pre-segmented clips. For each evaluation scenario, we train 1-vs-rest SVM classifiers on pre-segmented clips and test them on either pre-segmented or continuous videos. In the continuous case, we apply the detector within 122

a temporal sliding window of 10 seconds and assign the best label to its center frame. We also add a background class label to the set of possible outputs in the continuous case. We score a model with its frame classification accuracy. As perhaps expected, performance decreases with respect to the pre-segmented case, but it is still reasonable. We have constructed confusion matrices for all entries in Table 6.3 and will release them with the dataset. We show the confusion matrix for our active-object (AO) model in Fig. 6.13. Interestingly, many actions are mistaken for functionally similar ones - “make up”, “brushing teeth”, and “dental floss” are all mistaken for each other (and are instances of personal hygiene). We believe this holds because much of the functional taxonomy in Fig. 6.10 is scene-based; people prepare food in a kitchen and maintain personal hygiene in a bathroom. Our bag-of-object representation acts as a coarse scene descriptor, and hence makes such functionally-reasonable mistakes. Fig. 6.14 shows sample detected active and passive objects with the their confidence values. There are many mistakes in the results of object detection, but the temporal averaging in our temporal model can handle them and find the most consistent detections.

6.7

Conclusion

After building all components of our action recognition system. In this chapter, we introduced an important application of activity recognition. We have presented a novel dataset, algorithms, and empirical evaluation for the problem of detecting activities of daily living (ADL) in first-person camera views. We present novel algorithms for exploiting temporal structure and interactive models of objects, both important for ADL recognition. To illustrate our algorithms, we have collected a dataset of 1 million frames of dozens of people performing, unscripted, everyday activities. We have annotated the dataset with activities,

123

br

co

m

bi ng m u h w as shin ake air h d g u dr ing en te p yi h tal et ng an f h ha ds loss nd /fa w s/ ce as la fac h m in un e ov g d in dis ry g h dr m in m a dis es m d king akin king hes ak ri w g t in nk a co ea g in te ff co g r/b ee ld wa o fo te ttle o r va d/s /tap n us wacuu ac in tc mi k g h ng co ing m us pu tv in te g r ce ll

combing hair make up brushing teeth dental floss washing hands/face drying hands/face laundry washing dishes moving dishes making tea making coffee drinking water/bottle drinking water/tap making cold food/snack vacuuming watching tv using computer using cell

Figure 6.13: Confusion matrix for temporal pyramid with vision based active object detectors on pre-segmented videos. Segment classification accuracy = 40.6%. object tracks, hand positions, and interaction events. We have presented extensive experimental results that demonstrate our models greatly outperform existing approaches for wearable ADL-recognition, and also present a roadmap for future work on better models for objects and their interactions.

124

Figure 6.14: Sample detected active (in green) and passive (in red) objects with their confidence values. There are many mistakes in the results of object detection, but the temporal averaging in our temporal model can handle them and find the most consistent detections. A sample video is available online here: http://www.youtube.com/watch?v=tybmC0bS928

125

Chapter 7 Conclusion and future work

In this thesis, we introduced a novel activity recognition system. We built scalable algorithms for extracting high-level features in the spatial and temporal domain. Then, we introduced our fast real-time grammar-based action parser capable of temporally segmenting long video sequences to actions and sub-actions. Finally, we introduced detecting activities of daily living in first-person camera views as an important and realistic application of activity recognition in video. We developed a large scale egocentric ADL video dataset on which we evaluated our models.

7.1

Future work

We list some of the natural future works of this thesis: Sparse steerable models: As discussed, our steerable part models reduce the inference time by a large factor. In applying these models to large scale problems with many categories, we may use a large set of basis filters. One natural extension would be learning sparse steering coefficients. This can improve the running time further in the cases with a 126

large set of basis. This can be done by adding a regularizer (eg, l1 norm) on the steering coefficients to the SVM formulation. Structured learning in multiple object tracking: Our globally optimal algorithm for multiple object tracking is relatively fast. In this work, various model parameters such as the detection bias, transition costs, and birth-death costs were manually tuned and kept constant throughout the whole space and time. However, in theory, they can be location/time dependent. For instance, in the case of a forward moving camera (eg mounted on a car), objects tend to appear in the middle of the view in small sizes (far from the camera) and disappear in the boundaries of the view in large sizes (close to the camera). Interestingly, the objective function (the cost function in the graph problem) is linear in these parameters. Hence, one can formulate a structured SVM to learn these parameters for specific applications. Usually, the inference is a bottle-neck in the structured learning problems. However, in this case, our globally optimal inference algorithm is relatively fast and efficient. Action interruptions and co-occurring actions: In building our action models in Chapter 5, we assumed temporally non-overlapping actions. However, in realistic applications, actions can co-occur or interrupt each other. For instance, in making tea, we start with boiling water, but may engage in other unrelated activities, such as watching TV, as we wait for the water to boil. There are many instances of this type of interruption in our ADL dataset. One may extend our grammar models to handle this. Integration of modules for detecting ADL: We introduced several components for action recognition, but did not integrate them all in the application of detecting ADLs. This integration is a task which may have its own challenges such as defining the grammar rules. More detailed interactions/occlusions and affordances We show the effectiveness of modeling active and passive objects separately. This suggests using more sophisticated approaches in building these representations. For instance, recently, there is a growing in-

127

terest in the community in detecting affordances of objects and environments [49, 53]. One can utilize these to construct more sophisticated high-level features for action recognition. Assessing the quality of actions: In this thesis, we focus on the binary problem of detecting activities. We answer the question “Did this action happen at this time?” However, in many applications, the answer is not binary and it is important to asses the quality of the action. For instance, in detecting ADLs in medical applications, physicians are interested in knowing if the person can move his arms properly or how well the person can pour water into a cup. Solving this problem is a part of our long-term future work.

128

Bibliography [1] Mind’s eye video dataset. http://www.visint.org/datasets.html. [2] E. Adelson and J. Bergen. Spatiotemporal energy models for the perception of motion. J. Opt. Soc. Am. A, 2(2):284–299, 1985. [3] J. K. Aggarwal and M. S. Ryoo. Human activity analysis: A review. ACM Comput. Surv., 43(3):16, 2011. [4] R. Ahuja, T. Magnati, and J. Orlin. Network flows: Theory, Algorithms, and Applications. Prentice Hall, 2008. [5] F. Al-Khayyal and J. Falk. Jointly constrained biconvex programming. Mathematics of Operations Research, pages 273–286, 1983. [6] R. Ando and T. Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. The Journal of Machine Learning Research, 6:1817–1853, 2005. [7] A. Andriyenko and K. Schindler. Globally optimal multi-target tracking on a hexagonal lattice. In ECCV, 2010. [8] M. Argyle and B. Foss. The psychology of interpersonal behaviour. Penguin Books Middlesex. England, 1967. [9] S. Baysal, M. Kurt, and P. Duygulu. Recognizing human actions using key poses. In ICPR, 2010. [10] J. Berclaz, F. Fleuret, and P. Fua. Multiple object tracking using flow linear programming. In Performance Evaluation of Tracking and Surveillance (PETS-Winter), 2009 Twelfth IEEE International Workshop on, pages 1–8. IEEE, 2010. [11] J. Berclaz, F. Fleuret, E. Turetken, and P. Fua. Multiple object tracking using kshortest paths optimization. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 33(9):1806–1819, 2011. [12] M. Blum, A. Pentland, and G. Troster. Insense: Interest-based life logging. Multimedia, IEEE, 13(4):40–48, 2006. [13] L. Bourdev and J. Malik. Poselets: Body part detectors trained using 3d human pose annotations. In ICCV, pages 1365–1372. IEEE, 2009. [14] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.

129

[15] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. IEEE PAMI, 2001. [16] Y. Cai, N. de Freitas, and J. Little. Robust visual tracking for multiple targets. Lecture Notes in Computer Science, 3954:107, 2006. [17] R. Caruana. Multitask learning. Machine Learning, 28(1):41–75, 1997. [18] A. Catz, M. Itzkovich, E. Agranov, H. Ring, and A. Tamir. SCIM-spinal cord independence measure: a new disability scale for patients with spinal cord lesions. Spinal Cord, 35(12):850–856, 1997. [19] J. Choi, W. Jeon, and S. Lee. Spatio-temporal pyramid matching for sports videos. In ACM ICMR, pages 291–297. ACM, 2008. [20] W. Choi and S. Savarese. Multiple target tracking in world coordinate with single, minimally calibrated camera. ECCV 2010, pages 553–567, 2010. [21] N. Chomsky. Three models for the description of language. Information Theory, IRE Transactions on, 2(3):113–124, 1956. [22] K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernelbased vector machines. The Journal of Machine Learning Research, 2:265–292, 2002. [23] N. Dalal. Finding People in Images and Video. PhD thesis, Institut National Polytechnique de Grenoble / INRIA Grenoble, July 2006. [24] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005. CVPR 2005, 2005. [25] N. Dalal, B. Triggs, and C. Schmid. Human detection using oriented histograms of flow and appearance. Lecture Notes in Computer Science, 3952:428, 2006. [26] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. Ieee, 2009. [27] P. Dollar, C. Wojek, B. Schiele, and P. Perona. Pedestrian detection: A benchmark. In CVPR, June 2009. [28] J. Earley. An efficient context-free parsing algorithm. Communications of the ACM, 13(2):94–102, 1970. [29] A. Ess, B. Leibe, and L. Van Gool. Depth and appearance for mobile scene analysis. In ICCV, 2007. [30] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2008 (VOC2008) Results. http://www.pascal-network.org/challenges/VOC/voc2008/workshop/index.html. [31] R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin. Liblinear: A library for large linear classification. JMLR, 9:1871–1874, 2008. [32] A. Fathi, A. Farhadi, and J. Rehg. Understanding egocentric activities. In ICCV, 2011. 130

[33] A. Fathi, X. Ren, and J. Rehg. Learning to recognize objects in egocentric activities. In CVPR, pages 3281–3288. IEEE, 2011. [34] P. Felzenszwalb, R. Girshick, and D. McAllester. Cascade object detection with deformable part models. In CVPR, pages 2241–2248. IEEE, 2010. [35] P. Felzenszwalb, D. McAllester, and D. Ramanan. A discriminatively trained, multiscale, deformable part model. Computer Vision and Pattern Recognition, Anchorage, USA, June, 2008. [36] P. F. Felzenszwalb, R. B. Girshick, and D. McAllester. Discriminatively trained deformable part models, release 4. http://www.cs.brown.edu/ pff/latent-release4/. [37] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. IEEE PAMI, 2010. [38] J. Ferguson. Variable duration models for speech. In Symposium on the Application of hidden Markov models to Text and Speech, volume 1, pages 143–179, 1980. [39] S. Fidler, M. Boben, and A. Leonardis. Learning hierarchical compositional representations of object structure. S. Dickinson, A., Leonardis, B. Schiele, & M. Tarr (eds.), Object categorization computer and human perspectives, pages 196–215, 2009. [40] S. Fine, Y. Singer, and N. Tishby. The hierarchical hidden markov model: Analysis and applications. Machine learning, 32(1):41–62, 1998. [41] D. Forsyth, O. Arikan, L. Ikemoto, J. O’Brien, and D. Ramanan. Computational studies of human motion i: Tracking and animation. Foundations and Trends in Computer Graphics and Vision, 1(2/3):1–255, 2006. [42] T. Fortmann, Y. Bar-Shalom, and M. Scheffe. Sonar tracking of multiple targets using joint probabilistic data association. IEEE Journal of Oceanic Engineering, 8(3):173–184, 1983. [43] V. Franc and S. Sonnenburg. Optimized cutting plane algorithm for support vector machines. In Proceedings of the 25th international conference on Machine learning, pages 320–327. ACM New York, NY, USA, 2008. [44] W. Freeman and E. Adelson. The design and use of steerable filters. IEEE TPAMI, 13(9):891–906, 1991. [45] M. Gales and S. Young. Segmental hidden markov models. In Third European Conference on Speech Communication and Technology, 1993. [46] J. Gemmell, G. Bell, and R. Lueder. MyLifeBits: a personal database for everything. Communications of the ACM, 49(1):88–95, 2006. [47] A. Goldberg. An efficient implementation of a scaling minimum-cost flow algorithm. Journal of Algorithms, 22(1):1–29, 1997. [48] J. Gorski, F. Pfeuffer, and K. Klamroth. Biconvex sets and optimization with biconvex functions: a survey and extensions. Mathematical Methods of Operations Research, 66(3):373–407, 2007.

131

[49] H. Grabner, J. Gall, and L. Van Gool. What makes a chair a chair? In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1529– 1536. IEEE, 2011. [50] K. Grauman and T. Darrell. The pyramid match kernel: Efficient learning with sets of features. The Journal of Machine Learning Research, 8:725–760, 2007. [51] G. Griffin, A. Holub, and P. Perona. Caltech-256 object category dataset. 2007. [52] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker. Multi-pie. Image and Vision Computing, 28(5):807–813, 2010. [53] A. Gupta, S. Satkin, A. Efros, and M. Hebert. From 3d scene geometry to human workspace. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1961–1968. IEEE, 2011. [54] M. Hanheide, N. Hofemann, and G. Sagerer. Action recognition in a wearable assistance system. In ICPR, volume 2, pages 1254–1258. IEEE, 2006. [55] K. Hatun and P. Duygulu. Pose sentences: A new representation for action recognition using sequence of pose words. In ICPR, 2008. [56] S. Hinterstoisser, C. Cagniart, S. Ilic, P. Sturm, N. Navab, P. Fua, and V. Lepetit. Gradient response maps for real-time detection of texture-less objects. IEEE TPAMI, (99):1–1, 2011. [57] M. Hoai, Z. zhong Lan, and F. De la Torre. Joint segmentation and classification of human actions in video. In CVPR, 2011. [58] S. Hodges, L. Williams, E. Berry, S. Izadi, J. Srinivasan, A. Butler, G. Smyth, N. Kapur, and K. Wood. SenseCam: A retrospective memory aid. UbiComp 2006: Ubiquitous Computing, pages 177–193, 2006. [59] J. Hopcroft, R. Motwani, and J. Ullman. Introduction to automata theory, languages, and computation, volume 2. Addison-wesley Reading, MA, 1979. [60] M. Irani, P. Anandan, J. Bergen, R. Kumar, and S. Hsu. Efficient representations of video sequences and their applications. Signal Processing: Image Communication, 8(4):327–351, 1996. [61] M. Isard and J. MacCormick. Bramble: A bayesian multiple-blob tracker. In ICCV, 2001. [62] Y. Ivanov and A. Bobick. Recognition of visual activities and interactions by stochastic parsing. PAMI, 22(8):852–872, 2000. [63] J. Janssen and N. Limnios. Semi-Markov Models and Applications. Kluwer Academic Publishers, 1999. [64] H. Jiang, S. Fels, and J. Little. A linear programming approach for multiple object tracking. In IEEE CVPR, 2007. [65] T. Joachims, T. Finley, and C. Yu. Cutting plane training of structural SVMs. Machine Learning, 2009. [66] K. T. Kang H., Hebert M. Discovering object instances from scenes of daily living. In ICCV, 2011. 132

[67] K. Kitani, T. Okabe, Y. Sato, and A. Sugimoto. Fast unsupervised ego-action learning for first-person sports videos. In CVPR, pages 3241–3248. IEEE, 2011. [68] B. Kopp, A. Kunkel, H. Flor, T. Platz, U. Rose, K. Mauritz, K. Gresser, K. McCulloch, and E. Taub. The Arm Motor Ability Test: reliability, validity, and sensitivity to change of an instrument for assessing disabilities in activities of daily living. Archives of physical medicine and rehabilitation, 78(6):615–620, 1997. [69] I. Kotsia and I. Patras. Support tucker machines. In CVPR, 2011. [70] H. Kuhn, P. Haas, I. Ilyas, G. Lohman, and V. Markl. The Hungarian method for the assignment problem. Masthead, 23(3):151–210, 1993. [71] M. Land, N. Mennie, and J. Rusted. The roles of vision and eye movements in the control of activities of daily living. Perception, 28(11):1311–1328, 1999. [72] I. Laptev. On space-time interest points. International Journal of Computer Vision, 64(2):107–123, 2005. [73] I. Laptev, M. Marszałek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In Conference on Computer Vision and Pattern Recognition, 2008. [74] I. Laptev and P. Perez. Retrieving actions in movies. In International Conference on Computer Vision, 2007. [75] L. Lathauwer, B. Moor, and J. Vandewalle. A multilinear singular value decomposition. SIAM J. Matrix Anal. Appl, 1995. [76] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR, volume 2, pages 2169– 2178. Ieee, 2006. [77] B. Leibe, K. Schindler, and L. Van Gool. Coupled detection and trajectory estimation for multi-object tracking. In Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, pages 1–8. IEEE, 2007. [78] L. Li, H. Su, E. Xing, and L. Fei-Fei. Object bank: A high-level image representation for scene classification and semantic feature sparsification. Advances in Neural Information Processing Systems, 24, 2010. [79] N. Loeff and A. Farhadi. Scene Discovery by Matrix Factorization. In Proceedings of the 10th European Conference on Computer Vision: Part IV, pages 451–464. Springer-Verlag Berlin, Heidelberg, 2008. [80] D. Lowe. Object recognition from local scale-invariant features. In Computer Vision, 1999. The Proceedings of the Seventh IEEE International Conference on, volume 2, pages 1150–1157. Ieee, 1999. [81] Y. Ma, Q. Yu, and I. Cohen. Target tracking with incomplete detection. CVIU, 2009. [82] R. Manduchi, P. Perona, and D. Shy. Efficient deformable filter banks. Signal Processing, IEEE Transactions on, 46(4):1168–1173, 1998. [83] C. Manning, H. Sch¨utze, and MITCogNet. Foundations of statistical natural language processing, volume 999. MIT Press, 1999. 133

[84] M. Marszałek, I. Laptev, and C. Schmid. Actions in context. In Proc. IEEE Conf. Computer Vision & Pattern Recognition. Citeseer, 2009. [85] W. Mayol and D. Murray. Wearable hand activity recognition for event summarization. In International Symposium on Wearable Computers. IEEE, 2005. [86] D. Moore and I. Essa. Recognizing multitasked activities from video using stochastic context-free grammar. In National Conf on Artificial Intelligence, pages 770–776. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2002. [87] K. Murphy and M. Paskin. Linear-time inference in hierarchical hmms. Advances in Neural Information Processing Systems, 2:833–840, 2002. [88] J. Niebles, C. Chen, and L. Fei-Fei. Modeling temporal structure of decomposable motion segments for activity classification. ECCV, pages 392–405, 2010. [89] S. Oh, A. Hoogs, A. Perera, N. Cuntoor, C. Chen, J. Lee, S. Mukherjee, J. Aggarwal, H. Lee, L. Davis, E. Swears, X. Wang, Q. Ji, K. Reddy, M. Shah, C. Vondrick, H. Pirsiavash, D. Ramanan, J. Yuen, A. Torralba, B. Song, A. Fong, A. Roy-Chowdhury, and M. Desai. A large-scale benchmark dataset for event recognition in surveillance video. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 3153–3160. IEEE, 2011. [90] N. Oliver, A. Garg, and E. Horvitz. Layered representations for learning and inferring office activity from multiple sensory channels. CVIU, 96(2):163–180, 2004. [91] P. Ott and M. Everingham. Shared parts for deformable part-based models. In CVPR, pages 1513 –1520, june 2011. [92] D. Patterson, D. Fox, H. Kautz, and M. Philipose. Fine-grained activity recognition by aggregating abstract object usage. In Wearable Computers, 2005. Proceedings. Ninth IEEE International Symposium on, pages 44–51. IEEE, 2005. [93] M. Pedersoli, A. Vedaldi, and J. Gonz`alez. A coarse-to-fine approach for fast deformable object detection. In CVPR, 2011. [94] S. Pellegrini, A. Ess, and L. V. Gool. Improving data association by joint modeling of pedestrian trajectories and groupings. In ECCV, 2010. [95] A. Pentland. Looking at people: Sensing for ubiquitous and wearable computing. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 22(1):107–119, 2002. [96] A. Perera, C. Srinivas, A. Hoogs, G. Brooksby, and W. Hu. Multi-object tracking through simultaneous long occlusions and split-merge conditions. In IEEE CVPR, volume 1, 2006. [97] A. G. A. Perera, A. Hoogs, C. Srinivas, G. Brooksby, and W. Hu. Evaluation of algorithms for tracking multiple objects in video. In AIPR, page 35, 2006. [98] P. Perona. Deformable kernels for early vision. IEEE TPAMI, 17(5):488–499, 1995. [99] M. Philipose, K. Fishkin, M. Perkowitz, D. Patterson, D. Fox, H. Kautz, and D. Hahnel. Inferring activities from interactions with objects. IEEE Pervasive Computing, pages 50–57, 2004.

134

[100] H. Pirsiavash and D. Ramanan. Detecting activities of daily living in first-person camera views. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2847–2854. IEEE, 2012. [101] H. Pirsiavash and D. Ramanan. Steerable part models. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 3226–3233. IEEE, 2012. [102] H. Pirsiavash, D. Ramanan, and C. Fowlkes. Bilinear classifiers for visual recognition. In Neural Information Processing Systems, volume 1. Citeseer, 2009. [103] H. Pirsiavash, D. Ramanan, and C. Fowlkes. Globally-optimal greedy algorithms for tracking a variable number of objects. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1201–1208. IEEE, 2011. [104] J. Platt et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3):61–74, 1999. [105] R. Poppe. A survey on vision-based human action recognition. Image and Vision Computing, 28(6):976–990, 2010. [106] L. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989. [107] L. Rabiner and B. Juang. An introduction to hidden markov models. ASSP Magazine, IEEE, 3(1):4–16, 1986. [108] D. Ramanan. Learning to parse images of articulated bodies. Advances in Neural Information Processing Systems, 19:1129, 2007. [109] X. Ren and C. Gu. Figure-ground segmentation improves handled object recognition in egocentric video. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 3137–3144. IEEE, 2010. [110] J. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative prediction. In International Conference on Machine Learning, volume 22, page 713, 2005. [111] M. Rodriguez, J. Ahmed, and M. Shah. Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2008. CVPR 2008, pages 1–8, 2008. [112] M. S. Ryoo and J. K. Aggarwal. Semantic representation and recognition of continued and recursive human activities. IJCV, 82(1):1–24, 2009. [113] M. Sadeghi and A. Farhadi. Recognition using visual phrases. In CVPR, pages 1745–1752. IEEE, 2011. [114] S. Sarawagi and W. Cohen. Semi-markov conditional random fields for information extraction. NIPS, 17:1185–1192, 2004. [115] C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: A local svm approach. In Pattern Recognition, 2004. ICPR 2004. Proceedings of th e17th International Conference on, volume 3, pages 32–36. IEEE, 2004.

135

[116] P. Scovanner, S. Ali, and M. Shah. A 3-dimensional sift descriptor and its application to action recognition. In Proceedings of the 15th international conference on Multimedia, pages 357–360. ACM, 2007. [117] A. Shashua and T. Hazan. Non-negative tensor factorization with applications to statistics and computer vision. In International Conference on Machine Learning, volume 22, page 793, 2005. [118] Q. Shi, L. Cheng, L. Wang, and A. Smola. Human action segmentation and recognition using discriminative semi-markov models. IJCV, pages 1–11, 2010. [119] Z. Si, M. Pei, B. Yao, and S. Zhu. Unsupervised learning of event and-or grammar and semantics from video. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 41–48. IEEE, 2011. [120] J. Sivic and A. Zisserman. Video google: Efficient visual search of videos. Toward Category-Level Object Recognition, pages 127–144, 2006. [121] K. Smith, D. Gatica-Perez, J. Odobez, and S. Ba. Evaluating multi-object tracking. In CVPR Workshops. IEEE, 2005. [122] E. Spriggs, F. De La Torre, and M. Hebert. Temporal segmentation and activity classification from first-person sensing. In Computer Vision and Pattern Recognition Workshops, 2009. CVPR Workshops 2009. IEEE Computer Society Conference on, pages 17–24. IEEE, 2009. [123] N. Srebro, J. Rennie, and T. Jaakkola. Maximum-margin matrix factorization. Advances in Neural Information Processing Systems, 17:1329–1336, 2005. [124] C. Stauffer. Estimating tracking sources and sinks. In Computer Vision and Pattern Recognition Workshop, 2003. CVPRW’03. Conference on, volume 4, pages 35–35. IEEE, 2003. [125] L. Sun, U. Klank, and M. Beetz. Eyewatchme3d hand and object tracking for inside out activity analysis. In IEEE Workshop on Egocentric Vision, pages 9–16. IEEE, 2009. [126] S. Sundaram and W. Cuevas. High level activity recognition using low resolution wearable vision. In IEEE Workshop on Egocentric Vision, pages 25–32. IEEE, 2009. [127] K. Tang, L. Fei-Fei, and D. Koller. Learning latent temporal structure for complex event detection. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 1250–1257. IEEE, 2012. [128] D. Tao, X. Li, X. Wu, and S. Maybank. General tensor discriminant analysis and Gabor features for gait recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(10):1700, 2007. [129] E. Tapia, S. Intille, and K. Larson. Activity recognition in the home using simple and ubiquitous sensors. Pervasive Computing, pages 158–175, 2004. [130] J. Tenenbaum and W. Freeman. Separating style and content with bilinear models. Neural Computation, 12(6):1247–1283, 2000.

136

[131] O. Thomas, P. Sunehag, G. Dror, S. Yun, S. Kim, M. Robards, A. Smola, D. Green, and P. Saunders. Wearable sensor activity analysis using semi-markov models with a grammar. Pervasive and Mobile Computing, 6(3):342–350, 2010. [132] A. Torralba, K. P. Murphy, and W. T. Freeman. Sharing visual features for multiclass and multiview object detection. IEEE TPAMI, 29(5):854–869, 2007. [133] T. T. Truyen, D. Q. Phung, H. H. Bui, and S. Venkatesh. Hierarchical semi-markov conditional random fields for recursive sequential data. In NIPS, pages 1657–1664, 2008. [134] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6(2):1453, 2006. [135] P. Turaga, R. Chellappa, V. Subrahmanian, and O. Udrea. Machine recognition of human activities: A survey. Circuits and Systems for Video Technology, IEEE Transactions on, 18(11):1473–1488, 2008. [136] M. Vasilescu and D. Terzopoulos. Multilinear analysis of image ensembles: Tensorfaces. Lecture Notes in Computer Science, pages 447–460, 2002. [137] A. Vedaldi and A. Zisserman. Efficient additive kernels via explicit feature maps. In CVPR, pages 3539–3546. IEEE, 2010. [138] H. Wang, M. Ullah, A. Klaser, I. Laptev, and C. Schmid. Evaluation of local spatiotemporal features for action recognition. In BMVC, 2009. [139] S. Wang, A. Quattoni, L. Morency, D. Demirdjian, and T. Darrell. Hidden conditional random fields for gesture recognition. In CVPR, volume 2, pages 1521–1527. IEEE, 2006. [140] A. Wilson and A. Bobick. Parametric hidden markov models for gesture recognition. PAMI, 21(9):884–900, 1999. [141] L. Wolf, H. Jhuang, and T. Hazan. Modeling appearances with low-rank SVM. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–6. Citeseer, 2007. [142] J. Wu, A. Osuntogun, T. Choudhury, M. Philipose, and J. Rehg. A scalable approach to activity recognition based on object use. In Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, pages 1–8. IEEE, 2007. [143] J. Xing, H. Ai, and S. Lao. Multi-object tracking through occlusions by local tracklets filtering and global tracklets association with detection responses. In IEEE CVPR, June 2009. [144] S. Yan, D. Xu, Q. Yang, L. Zhang, X. Tang, and H. Zhang. Discriminant analysis with tensor representation. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 1, page 526. Citeseer, 2005. [145] Y. Yang and D. Ramanan. Articulated pose estimation using flexible mixtures of parts. In CVPR 2011, 2011.

137

[146] D. Young and J. Ferryman. Pets metrics: On-line performance evaluation service. In Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance (VS-PETS), pages 317–324, 2005. [147] C.-N. J. Yu and T. Joachims. Learning structural svms with latent variables. In International Conference on Machine Learning (ICML), 2009. [148] L. Zelnik-Manor and M. Irani. Event-based analysis of video. In CVPR, volume 2, 2001. [149] L. Zhang, Y. Li, and R. Nevatia. Global data association for multi-object tracking using network flows. In CVPR, 2008. [150] L. Zhu, Y. Chen, A. Torralba, W. Freeman, and A. Yuille. Part and appearance sharing: Recursive compositional models for multi-view multi-object detection. Pattern Recognition, pages 1919–1926, 2010. [151] X. Zhu and D. Ramanan. Face detection, pose estimation, and landmark estimation in the wild. CVPR, 2012.

138