iciap 2

24
Daily Living Activities Recognition via Efficient High and Low Level Cues Combination and Fisher Kernel Representation Negar Rostamzadeh 1 Gloria Zen 1 Ionut Mironica 2 Jasper Uijlings 1 Nicu Sebe 1 1 DISI, University of Trento, Trento, Italy 2 LAPI, University Politehnica of Bucharest, Bucharest, Romania

Upload: ionut-mironica

Post on 08-Aug-2015

52 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Iciap 2

Daily Living Activities Recognition via EfficientHigh and Low Level Cues Combination and

Fisher Kernel Representation

Negar Rostamzadeh1

Gloria Zen1

Ionut Mironica2

Jasper Uijlings1

Nicu Sebe1

1 DISI, University of Trento, Trento, Italy2 LAPI, University Politehnica of Bucharest, Bucharest, Romania

Page 2: Iciap 2

Outline

• Daily Living Action Recognition• State-of-the-art• Our approach • Results• Conclusion

2/25

Page 3: Iciap 2

Motivation – State of the art – Our approach – Results - Conclusion 3/23

Action Recognition in videos

Answer phone or dial phone?

Difficulties in fine-grained activities:1. Slightly different activities in motion and appearance2. Different manner of performing the similar task.

Page 4: Iciap 2

Object Centric approaches- SoA

Object-centric approaches- based on tracking and trajectory

Brendel et al, ICCV 2011 [5], May et al, CVPR 2004 [6]Campos et al, WACV 2011 [23], Liu et al, CVPR 2007[16]

Limitations

Providing semantic/high-level information of the scene

AdvantagesHandling occlusions in objects interactionsThe broken and missed trajectories The problem of curse of dimensionality

4/23Motivation – State of the art – Our approach – Results - Conclusion

Page 5: Iciap 2

Non-object centric approaches - SoABag-of-words approach relying on low-level

Laptev et al, CVPR 2008 [1], Willems et al, ECCV 2008 [2], Hospedales et al, ICCV 2009. Zen et al, CVPR 2011 [4], Wong et al, CVPR 2007 [15], Chang et al, ICCV 2011 [17], Gilbert et al, ICCV 2009 [19],Zelniker et al, ICCV 2008 [20], Gehrig et al, ICCV 2009[21], Mahbub et al, ICIEV [25]

Foreground pixels HoFSTIPHoG

5/23

Limitations

Robustness to noise & occlusions Computational efficiency

Advantages 1. Discard semantic & high-level information of the scene.2. Discard relationship among spatio-temporal local features.

Motivation – State of the art – Our approach – Results - Conclusion

Page 6: Iciap 2

Enhanced descriptors - SoA

Which body-part causes what motion?

6/23

Messing et al, ICCV, 2009 [7], Fathi et al, 2008 [8], Zhang et al, 2012 [9], Matikainen et al, 2010 [10], : Gaur et al, 2011 [11], Savarese et al, 2008 [12], Kovashka et al, Gireddy et al, 2011 [14], CVPR 2007[18] , Shechtman et al, CVPR 2011 [24]

Motivation – State of the art – Our approach – Results - Conclusion

1. Relation between local featuresPair-wise10,11,12,18, local space or time neighborhood11,18, ST phrases9

2. Combining different local featuresSuch as local motion, appearance, and positions14,24

3. Enriching the combination of low level features with high- level information

Detect and localize faces7, STIP volume8,9

Page 7: Iciap 2

Input Video ClassifierFusing information

to produce enriched descriptor

Apply a Feature-representation

Recognizing Activities

Body-part detector

Low-level cues

Accumulation over each video

Fisher Kernel to model the Temporal

variation

Approach in a glance

7/25Motivation – State of the art – Our approach – Results - Conclusion

Page 8: Iciap 2

Body-pose estimation

What is the problem with an off-the-shelf detector?

Our Solution:

Employ an already-trained off-the-shelf detector

Enhanced pose estimator

We use the already trained classifier, but we provide some additional information from the

new dataset

ADLBUFFY

8/25Motivation – State of the art – Our approach – Results - Conclusion

Page 9: Iciap 2

Body-pose estimation- build on ‘Yang and Ramanan PAMI2012, CVPR11’ [29]

9/25

1. Model the body as a pictorial structure (Felzenshwalb-CVPR 2010)2. Model the body as a Tree3. Each possible body-configuration has a score

Pair-wise scoreLocal score: HoG

HoG - appearance

Scores by employing off-the-

shelf detector = Sinitial

Motivation – State of the art – Our approach – Results - Conclusion

Enhanced pose estimator

Page 10: Iciap 2

10/25

New Score = Sinitial

Foreground Score Optical Flow Score

weightsRelative importance of foreground and optical flow score

Motivation – State of the art – Our approach – Results - Conclusion

Enhanced pose estimator

Page 11: Iciap 2

New Score = Sinitial

Our approachSoA ForegroundSoA Optical FlowOur approach

Enhanced pose estimator

SoA Optical FlowOur approachOur approachSoA Foreground

11/25Motivation – State of the art – Our approach – Results - Conclusion

Page 12: Iciap 2

12/25

New Score = Sinitial

Tuning

Motivation – State of the art – Our approach – Results - Conclusion

Enhanced pose estimator used to enrich action recognition approach

Page 13: Iciap 2

Body-part detector

Input VideoFusing information

to produce enriched descriptor

Low-level cues

Accumulation over each video

ClassifierApply a Feature-representation

Recognizing Activities

Fisher Kernel to model the Temporal

variation

Approach in a glance

13/25Motivation – State of the art – Our approach – Results - Conclusion

Page 14: Iciap 2

Fisher Kernel (FK) Theory

1. Introduced by Jaakkol NIPS’99 [26]) for protein detection2. Web audio classification (Moreno 2000)3. Introduced in Computer Vision for Image categorization by [Perronnin, CVPR’07]

Motivation – State of the art – Our approach – Results - Conclusion 14/25

Fisher Kernel in image categorization Vs video analysis

1. Modeling the : spatial variation temporal variation2. Visual documents: small patches frames of the video3. Initial feature vectors : SIFT our novel descriptors for

action recognition

Fisher Kernel in the state-of-the-art

Page 15: Iciap 2

Fisher Kernel (FK) Theory

- combines the benefits of generative and discriminative approaches

- represents a signal as the gradient of the probability density function that is a learned generative model of that signall

Motivation – State of the art – Our approach – Results - Conclusion 15/25

Page 16: Iciap 2

Results on the ADL Rochester dataset

16/25Motivation – State of the art – Our approach – Results - Conclusion

Page 17: Iciap 2

Conclusion

17/25

We proposed a novel descriptor that is combining high-level semantic information and low–level cues.

We propose an enhanced body-pose estimator.

We model the Temporal variation by the Fisher-Kernel representation.

Motivation – State of the art – Our approach – Results - Conclusion

Page 18: Iciap 2

Thank you!

Page 19: Iciap 2

References1. Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008, June). Learning

realistic human actions from movies. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on (pp. 1–8). IEEE..

2. Willems, G., Tuytelaars, T., & Van Gool, L. (2008). An efficient dense and scale–invariant spatio–temporal interest point detector. Computer Vision–ECCV 2008, 650–663.

3. Hospedales, T., Gong, S., & Xiang, T. (2009, September). A markov clustering topic model for mining behaviour in video. In Computer Vision, 2009 IEEE 12th International Conference on (pp. 1165–1172). IEEE

4. Zen, G., and Ricci, E. "Earth mover's prototypes: A convex learning approach for discovering activity patterns in dynamic scenes." Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011.

5. Brendel, W., & Todorovic, S. (2011, November). Learning spatiotemporal graphs of human activities. In Computer Vision (ICCV), 2011 IEEE International Conference on (pp. 778–785). IEEE.

Page 20: Iciap 2

References6. Han, M., Xu, W., Tao, H., & Gong, Y. (2004, June). An algorithm for multiple

object trajectory tracking. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on(Vol. 1, pp. I–864). IEEE.

7. Messing, R., Pal, C., & Kautz, H. (2009, September). Activity recognition using the velocity histories of tracked keypoints. In Computer Vision, 2009 IEEE 12th International Conference on (pp. 104–111). IEEE.

8. Fathi, A., and Mori, G. (2008, June). Action recognition by learning mid–level motion features. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on (pp. 1–8). IEEE.

9. Zhang, Y., Liu, X., Chang, M. C., Ge, W., & Chen, T. (2012). Spatio–Temporal phrases for activity recognition. Computer Vision–ECCV 2012, 707–721.

10. Matikainen, M. Hebert, and R. Sukthankar. Representing pairwise spatial and temporal relations for action recognition. European Conference of Computer Vision (ECCV) 2010, pages 508{521, 2010.

Page 21: Iciap 2

References11. Gaur, U., Zhu, Y., Song, B., & Roy–Chowdhury, A. (2011, November). A

“string of feature graphs” model for recognition of complex activities in natural videos. InComputer Vision (ICCV), 2011 IEEE International Conference on (pp. 2595–2602). IEEE.

12. Savarese, S., DelPozo, A., Niebles, J. C., & Fei–Fei, L. (2008, January). Spatial–Temporal correlatons for unsupervised action classification. In Motion and video Computing, 2008. WMVC 2008. IEEE Workshop on (pp. 1–8). IEEE.

13. Taralova, E., De la Torre, F., & Hebert, M. (2011, November). Source constrained clustering. In Computer Vision (ICCV), 2011 IEEE International Conference on (pp. 1927–1934) IEEE.

14. M. Malgireddy, I. Nwogu, and V. Govindaraju. A generative framework to investigate the underlying patterns in human activities. International Conference of Computer Vision Workshops (ICCV Workshops), 2011

15. Wong, S. F., Kim, T. K., & Cipolla, R. (2007, June). Learning motion categories using both semantic and structural information. In Computer Vision and Pattern Recognition, 2007. CVPR'07. IEEE Conference on (pp. 1-6). IEEE.

Page 22: Iciap 2

References16. Liu, J., Luo, J., & Shah, M. (2009, June). Recognizing realistic actions from

videos “in the wild”. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on (pp. 1996-2003). IEEE.

17. Chang, M. C., Krahnstoever, N., & Ge, W. (2011, November). Probabilistic group-level motion analysis and scenario recognition. In Computer Vision (ICCV), 2011 IEEE International Conference on (pp. 747-754). IEEE.

18. Kovashka, A., & Grauman, K. Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on(pp. 2046-2053). IEEE.

19. Gilbert, A., Illingworth, J., & Bowden, R. (2009, September). Fast realistic multi-action recognition using mined dense spatio-temporal features. In Computer Vision, 2009 IEEE 12th International Conference on (pp. 925-931). IEEE.

20. E. Zelniker, S. Gong, T. Xiang, et al. Global abnormal behaviour detection using a network of cctv cameras. In The Eighth International Workshop on Visual Surveillance-VS2008, 2008.

Page 23: Iciap 2

References21. Gehrig, D., Kuehne, H., Woerner, A., & Schultz, T. (2009, December).

Hmm-based human motion recognition with optical flow data. In Humanoid Robots, 2009. Humanoids 2009. 9th IEEE-RAS International Conference on (pp. 425-430). IEEE.

22. Sadanand, S., & Corso, J. J. (2012, June). Action bank: A high-level representation of activity in video. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on (pp. 1254-1241). IEEE.

23. de Campos, T., Barnard, M., Mikolajczyk, K., Kittler, J., Yan, F., Christmas, W., & Windridge, D. (2011, January). An evaluation of bags-of-words and spatio-temporal shapes for action recognition. In Applications of Computer Vision (WACV), 2011 IEEE Workshop on (pp. 344-351). IEEE.

24. Shechtman, E., & Irani, M. Space-time behavior based correlation. In Conference of Computer Vision and Pattern Recognition (CVPR), 2011.

25. Mahbub, U., Imtiaz, H., Ahad, M., & Rahman, A. (2012, May). Motion clustering-based action recognition technique using optical flow. In Informatics, Electronics & Vision (ICIEV), 2012 International Conference on (pp. 919-924). IEEE.

Page 24: Iciap 2

References26. Jaakkola, T., & Haussler, D. (1999). Exploiting generative models in

discriminative classifiers. Advances in neural information processing systems, 487-493.