learning realistic human actions from movies by abhinandh palicherla divya akuthota samish chandra...
TRANSCRIPT
![Page 1: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli](https://reader035.vdocuments.mx/reader035/viewer/2022081507/551c2fee550346b24f8b634a/html5/thumbnails/1.jpg)
Learning realistic human actions from movies
By Abhinandh PalicherlaDivya AkuthotaSamish Chandra Kolli
![Page 2: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli](https://reader035.vdocuments.mx/reader035/viewer/2022081507/551c2fee550346b24f8b634a/html5/thumbnails/2.jpg)
Introduction
Address recognition of natural human actions in diverse and realistic video settings.
Addresses the limitations (lack of realistic and annotated video datasets)
![Page 3: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli](https://reader035.vdocuments.mx/reader035/viewer/2022081507/551c2fee550346b24f8b634a/html5/thumbnails/3.jpg)
Visual recognition progressed from classifying toy objects towards recognizing the classes of objects and scenes in natural images .
Existing datasets for human action recognition provide samples for few action classes .
![Page 4: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli](https://reader035.vdocuments.mx/reader035/viewer/2022081507/551c2fee550346b24f8b634a/html5/thumbnails/4.jpg)
To Address these limitations we implement
• Automatic annotation of human actions • Manual annotation is difficult
• Video classification for action recognition
![Page 5: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli](https://reader035.vdocuments.mx/reader035/viewer/2022081507/551c2fee550346b24f8b634a/html5/thumbnails/5.jpg)
Automatic annotation of human actions
Alignment of actions in scripts and videosText Retrieval of human actionsVideo datasets for human actions
![Page 6: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli](https://reader035.vdocuments.mx/reader035/viewer/2022081507/551c2fee550346b24f8b634a/html5/thumbnails/6.jpg)
Alignment of actions in scripts and videos
…117201:20:17,240 --> 01:20:20,437
Why weren't you honest with me?Why'd you keep your marriage a secret?
117301:20:20,640 --> 01:20:23,598
lt wasn't my secret, Richard.Victor wanted it that way.
117401:20:23,800 --> 01:20:26,189
Not even our closest friendsknew about our marriage.…
…RICK
Why weren't you honest with me? Why did you keep your marriage a secret?
Rick sits down with Ilsa.
ILSA
Oh, it wasn't my secret, Richard. Victor wanted it that way. Not even our closest friends knew about our marriage.
…
01:20:17
01:20:23
subtitlesmovie script
• Scripts available for >500 movies (no time synchronization) www.dailyscript.com, www.movie-page.com,
www.weeklyscript.com …
• Subtitles (with time info.) are available for the most of movies• Can transfer time to scripts by text alignment
![Page 7: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli](https://reader035.vdocuments.mx/reader035/viewer/2022081507/551c2fee550346b24f8b634a/html5/thumbnails/7.jpg)
Script alignment: Evaluation
Example of a “visual false positive”
A black car pulls up, two army officers get out.
• Annotate action samples in text• Do automatic script-to-video alignment• Check the correspondence of actions in scripts and movies
a: quality of subtitle-script matching
![Page 8: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli](https://reader035.vdocuments.mx/reader035/viewer/2022081507/551c2fee550346b24f8b634a/html5/thumbnails/8.jpg)
Text Retrieval of human actions
“… Will gets out of the Chevrolet. …” “… Erin exits her new truck…”
• Large variation of action expressions in text:
GetOutCar action:
Potential false positives: “…About to sit down, he freezes…”
• => Supervised text classification approach
![Page 9: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli](https://reader035.vdocuments.mx/reader035/viewer/2022081507/551c2fee550346b24f8b634a/html5/thumbnails/9.jpg)
Video Datasets for Human actions 1
2 m
ovie
s2
0
diff
ere
nt
movie
s
• Learn vision-based classifier from automatic training set• Compare performance to the manual training set
![Page 10: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli](https://reader035.vdocuments.mx/reader035/viewer/2022081507/551c2fee550346b24f8b634a/html5/thumbnails/10.jpg)
Video Classification for action recognition
![Page 11: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli](https://reader035.vdocuments.mx/reader035/viewer/2022081507/551c2fee550346b24f8b634a/html5/thumbnails/11.jpg)
SPACE-TIME FEATURES
Good performance for action recognition Compact and provide tolerance to
background clutter, occlusions and scale changes.
![Page 12: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli](https://reader035.vdocuments.mx/reader035/viewer/2022081507/551c2fee550346b24f8b634a/html5/thumbnails/12.jpg)
INTEREST POINT DETECTION
Harris operator - with a space-time extension.
We use multiple levels of spatio-temporal scales
σ = 2(1+i)/2 , i = 1, …, 6τ = 2j/2 , j = 1, 2
I. Laptev. On space-time interest points. IJCV, 64(2/3):107–123, 2005.
![Page 13: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli](https://reader035.vdocuments.mx/reader035/viewer/2022081507/551c2fee550346b24f8b634a/html5/thumbnails/13.jpg)
![Page 14: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli](https://reader035.vdocuments.mx/reader035/viewer/2022081507/551c2fee550346b24f8b634a/html5/thumbnails/14.jpg)
DESCRIPTORS Compute histogram descriptors of volume
around the interest points.(∆x , ∆y , ∆t ) is related to the detection scales by ∆x , ∆y =
2kσ, ∆t = 2kτ
Each volume is divided into (nx, ny, nt) grid of cuboids.
We use k = 9, nx, ny=3, nt=2.
![Page 15: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli](https://reader035.vdocuments.mx/reader035/viewer/2022081507/551c2fee550346b24f8b634a/html5/thumbnails/15.jpg)
..CONTD
For each cuboid, we calculate HoG and HoF (optic flow) descriptors
Very similar to SIFT descriptors, adapted to the third dimension.
![Page 16: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli](https://reader035.vdocuments.mx/reader035/viewer/2022081507/551c2fee550346b24f8b634a/html5/thumbnails/16.jpg)
SPATIO-TEMPORAL BOF
Construct a visual vocabulary using k-means, with k = 4000. (Just like what we do in hw3)
Assign each feature to one word. Compute a frequency histogram for the
entire video, Or, a subsequence defined by a spatio-temporal grid.
If divided into grids, concatenate and normalize.
![Page 17: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli](https://reader035.vdocuments.mx/reader035/viewer/2022081507/551c2fee550346b24f8b634a/html5/thumbnails/17.jpg)
SPATIO-TEMPORAL BOF
Construct a visual vocabulary using k-means, with k = 4000. (Just like what we do in hw3)
Assign each feature to one word. Compute a frequency histogram for the
entire video, Or, a subsequence defined by a spatio-temporal grid.
If divided into grids, concatenate and normalize.
![Page 18: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli](https://reader035.vdocuments.mx/reader035/viewer/2022081507/551c2fee550346b24f8b634a/html5/thumbnails/18.jpg)
GRIDS
We divide both spatial and temporal dimensions.
Spatial – 1x1, 2x2, 3x3, v1x3, h3x1, o2x2 Temporal – t1, t2, t3, ot2 6 * 4 = 24 possible grid combinations! Descriptor + grid = channel.
![Page 19: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli](https://reader035.vdocuments.mx/reader035/viewer/2022081507/551c2fee550346b24f8b634a/html5/thumbnails/19.jpg)
NON-LINEAR SVM Classification using a non-linear SVM Multi-channel Gaussian kernel
V = vocab size, A = mean distances between training samples
Best set of channels for a training set is found by a greedy approach.
![Page 20: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli](https://reader035.vdocuments.mx/reader035/viewer/2022081507/551c2fee550346b24f8b634a/html5/thumbnails/20.jpg)
WHAT CHANNELS TO USE?
Channels may complement each other Greedy approach to pick the best
combination
Combining channels is more advantageousTable: Classification performance of different channels and their combinations
![Page 21: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli](https://reader035.vdocuments.mx/reader035/viewer/2022081507/551c2fee550346b24f8b634a/html5/thumbnails/21.jpg)
EVALUATION OF SPATIO-TEMPORAL GRIDS
Figure: Number of occurrences for each channel component within the optimized channel combinations for the KTH action dataset and our manually labeled movie dataset
![Page 22: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli](https://reader035.vdocuments.mx/reader035/viewer/2022081507/551c2fee550346b24f8b634a/html5/thumbnails/22.jpg)
RESULTS WITH THE KTH DATASET
Figure: Sample frames from the KTH actions sequences, all six classes (columns) and scenarios (rows) are presented
![Page 23: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli](https://reader035.vdocuments.mx/reader035/viewer/2022081507/551c2fee550346b24f8b634a/html5/thumbnails/23.jpg)
• 2391 sequences divided into the training/validation set (8+8 people) and test set (9 people)
• 10 fold cross validation
Table: Confusion matrix for the KTH actions
RESULTS WITH THE KTH DATASET
![Page 24: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli](https://reader035.vdocuments.mx/reader035/viewer/2022081507/551c2fee550346b24f8b634a/html5/thumbnails/24.jpg)
ROBUSTNESS TO NOISE IN THE TRAINING DATA
Up to p=0.2 the performance decreases insignificantly
At p=0.4 the performance decreases by around 10%
Figure: Performance of our video classification approach in the presence of wrong labels
![Page 25: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli](https://reader035.vdocuments.mx/reader035/viewer/2022081507/551c2fee550346b24f8b634a/html5/thumbnails/25.jpg)
ACTION RECOGNITION IN REAL-WORD VIDEOS
Table: Average precision (AP) for each action class of our test set. Comparison results for clean (annotated) and automatic training data and also results for a random classifier (chance)
![Page 26: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli](https://reader035.vdocuments.mx/reader035/viewer/2022081507/551c2fee550346b24f8b634a/html5/thumbnails/26.jpg)
ACTION RECOGNITION IN REAL-WORLD VIDEOS
Figure: Example results for action classification trained on the automatically annotated data. We show the key frames for test movies with the highest confidence values for true/false; pos/neg
• the rapid getting up is typical for “GetOutCar”• the false negatives are very difficult to recognize
• occluded handshake• hardly visible person getting out of the car
![Page 27: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli](https://reader035.vdocuments.mx/reader035/viewer/2022081507/551c2fee550346b24f8b634a/html5/thumbnails/27.jpg)
CONCLUSIONS
Summary Automatic generation of realistic action samples Transfer of recent bag-of-features experience to
videos Improved performance on KTH benchmark Decent results for actions in real-videos
Future direction Improving the script-video alignment Experimenting with space-time-low-level-features Internet-scale video search
![Page 28: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli](https://reader035.vdocuments.mx/reader035/viewer/2022081507/551c2fee550346b24f8b634a/html5/thumbnails/28.jpg)
THANK YOU