1 action recognition with improved trajectories heng wang and cordelia schmid lear team, inria

1

Action recognition with improved trajectories

Heng Wang and Cordelia Schmid

LEAR Team, INRIA

2

Action recognition in realistic videos

Challenges

Severe camera motion

Variation in human appearance and pose

Cluttered background and occlusion

Viewpoint and illumination changes

Current state of the art

Local space-time features + bag-of-features model

Dense trajectories performs the best on a large variety of datasets (Wang et.al. IJCV’13)

3

Three major steps:

- Dense sampling

- Feature tracking

- Trajectory-aligned descriptors

Dense trajectories revisited

4

Advantages:

- Capture the intrinsic dynamic structures in video - MBH is robust to camera motion

Dense trajectories revisited

Disadvantages:

- Generate irrelevant trajectories in background due to camera motion

- Motion descriptors are corrupted due to camera motion, e.g., HOF, MBH

5

Contributions:

- Improve dense trajectories by explicit camera motion estimation - Detect humans to remove outlier matches for homography estimation

Improved dense trajectories

- Remove trajectories caused by camera motion

- Stabilize optical flow to eliminate camera motion

6

Camera motion estimation Find the correspondences between two consecutive frames:

- Extract and match SURF features (robust to motion blur) - Sample good-features-to-track interest points from optical flow

Combine SURF (green) and optical flow (red) results in a more balanced distribution

Use RANSAC to estimate a homography from all feature matches

Inlier matches of the homography

7

Remove inconsistent matches due to humans

Human motion is not constrained by camera motion, thus generate outlier matches

Apply a human detector in each frame, and track the human bounding box forward and backward to join them together Remove feature matches inside the human bounding box during homography estimation

Inlier matches and warped flow, without or with HD

8

Warp optical flow

Warp the second frame of two consecutive frames with the homography and re-compute the optical flow

For HOF, the warped flow removes irrelevant camera motion, thus only encodes foreground motion

For MBH, it also improves, as the motion boundaries are enhanced

Two images overlaid Original optical flow Warped version

9

Remove background trajectories Remove trajectories by thresholding the maximal magnitude of stabilized motion vectors in the warped optical flow

Our method works well under various camera motions, such as pan, zoom, tilt

Removed trajectories (white) and foreground ones (green)

Successful examples Failure cases

Failure due to severe motion blur; the homography is not correctly estimated due to unreliable feature matches

10

Demo of warp flow and remove track

Removing trajectories makes the feature representation more focus on human motion

Warped optical flow eliminates background camera motion

11

Experimental setting

"RootSIFT" normalization for each descriptor, then PCA to reduce its dimension by a factor of two

Use Fisher vector to encode each descriptor separately, set the number of Gaussians to K=256 Use Power+L2 normalization for FV, and linear SVM with one-against-rest for multi-class classification

Datasets

Hollywood2: 12 classes from 69 movies, report mAP

HMDB51: 51 classes, report accuracy on three splits

Olympic sports: 16 sport actions, report mAP

UCF50: 50 classes, report accuracy over 25 groups

12

Evaluation of the intermediate steps

RmTrack = "remove background trajectory"

ITF = "improved trajectory feature: combining WarpFlow and RmTrack

Trajectory HOG HOF MBH HOF+MBH CombinedDTF 25.4% 38.4% 39.5% 49.1% 49.8% 52.2%WarpFlow 31.0% 38.7% 48.5% 50.9% 53.5% 55.6%RmTrack 26.9% 39.6% 41.6% 50.8% 51.0% 53.9%ITF 32.4% 40.2% 48.9% 52.1% 54.7% 57.2%

Baseline: DTF = "dense trajectory feature"

WarpFlow = "warp the optical flow"

Results on HMDB51 using Fisher vector

13

Evaluation of the intermediate steps

HOF and MBH are complementary, as they represent zero and first order motion information

Both RmTrack and WarpFlow helps; WarpFlow contributes more; Combing them (ITF) works the best

Trajectory HOG HOF MBH HOF+MBH CombinedDTF 25.4% 38.4% 39.5% 49.1% 49.8% 52.2%WarpFlow 31.0% 38.7% 48.5% 50.9% 53.5% 55.6%RmTrack 26.9% 39.6% 41.6% 50.8% 51.0% 53.9%ITF 32.4% 40.2% 48.9% 52.1% 54.7% 57.2%

Both Trajectory and HOF are significantly improved; MBH also better as motion boundaries are clearer; HOG does not change much

Results on HMDB51 using Fisher vector

14

Impact of feature encoding on improved trajectories

We observe a similar improvement of ITF over DTF when using BOF or FV for feature encoding

The improvement of FV over BOF varies on different datasets, from 2% to 7%

Standard bag of features: train a codebook of 4000 visual words with k-means for each descriptor type; RBF- kernel SVM for classification

Compare DTF and ITF using different feature encoding

Datasets Bag of features Fisher vector

DTF ITF DTF ITF

Hollywood2 58.5% 62.2% 60.1% 64.3%HMDB51 47.2% 52.1% 52.2% 57.2%

Olympic Sport 75.4% 83.3% 84.7% 91.1%

UCF50 84.8% 87.2% 88.6% 91.2%

15

Impact of human detection and state of the art

Significantly outperforms the state of the art on all four datasets

Source code: http://lear.inrialpes.fr/~wang/improved_trajectories

Human detection always helps. For Hollywood2 and HMDB51, the difference is more significant, as there are more humans present

HD stands for human detection

Hollywood2 HMDB51Jain CVPR'13 62.5% Jain CVPR'13 52.1%

With HD 64.3% With HD 57.2%

Without HD 63.0% Without HD 55.9%

Olympic Sports UCF50Jain CVPR'13 83.2% Shi CVPR'13 83.3%

With HD 91.1% With HD 91.2%

Without HD 90.2% Without HD 90.5%

16

THUMOS'13 Action Recognition Challenge

We follow exactly the same framework: improved trajectory feature + Fisher vector

We do not apply human detection as it is computational expensive to run it on large datasets

Dataset: three train-test splits from UCF101

We use spatio-temporal pyramid to embed structure information in the final representation

17


We do not include Trajectory descriptor, as combining it does not improve the final performance

Descriptors None T2 H3 Combined

HOG 72.4% 72.8% 73.2% 74.6%

HOF 76.0% 76.1% 77.3% 78.3%

MBH 80.8% 81.1% 80.5% 82.1%

HOG+HOF 82.9% 82.7% 82.7% 83.9%

HOG+MBH 83.3% 83.3% 83.4% 84.4%

HOF+MBH 82.2% 82.2% 82.0% 83.3%

HOG+HOF+MBH 84.8% 84.8% 84.6% 85.9%

For single descriptor: MBH > HOF > HOG; For combining two descriptors, MBH+HOG works the best, as they are most complementary

18


Spatio-temporal pyramids always helps. The improvement is more significant on a single descriptor

Descriptors None T2 H3 Combined

HOG 72.4% 72.8% 73.2% 74.6%

HOF 76.0% 76.1% 77.3% 78.3%

MBH 80.8% 81.1% 80.5% 82.1%

HOG+HOF 82.9% 82.7% 82.7% 83.9%

HOG+MBH 83.3% 83.3% 83.4% 84.4%

HOF+MBH 82.2% 82.2% 82.0% 83.3%

HOG+HOF+MBH 84.8% 84.8% 84.6% 85.9%

Combing everything gives the best performance 85.9%, which is the result we submitted to THUMOS

19

TRECVID’13 Multimedia Event Detection

Large scale video classification: 4500 hours, over 100,0000 videos. ITF is the best video descriptor and very fast to compute. Our whole pipeline (ITF+FV) is only 10 times slower than real time.

For visual channel, we combine ITF and SIFT

Descriptors Full system ASR Audio OCR Visual

AXES 36.6% 1.0% 12.4% 1.1% 29.4%

CMU 36.3% 5.7% 16.1% 3.7% 28.4%

BBNVISER 32.2% 8.0% 15.1% 5.3% 23.4%

Sesame 25.7% 3.9% 5.6% 0.2% 23.2%

MediaMill 25.3% ---- 5.6% ---- 23.8%

NII 24.9% ---- 8.8% ---- 19.9%

SRIAURORA 24.2% 3.9% 9.6% 4.3% 20.4%

Genie 20.2% 4.3% 10.1% ---- 16.9%

Top performance on MED ad-hoc

1 action recognition with improved trajectories heng wang and cordelia schmid lear team, inria

Documents

background camera motion

irrelevant camera motion

severe motion

foreground motion

motion boundaries

humans human motion

background trajectories

warped optical flow