people.uwplatt.edupeople.uwplatt.edu/.../f12/ryan_wendel_human_activity_… · web viewthese...

Human Activity Analysis

Ryan WendelComputer Science

University of Wisconsin - [email protected]

October 30th, 2012

Abstract

The goal of the Human Activity Analysis is to review ongoing activities from a sequence of image frames (video) which demonstrates and classifies the interactions of a person. Human Activity Analysis involves the importance of visual research in Computer Science and covers a broad number of topics including computer interfaces, surveillance systems, monitoring patients, babies, elderly people, and many more applications. These applications can be monitored using various types of artificial intelligence, image processing, computer vision, and patterns recognition. In this report I will go over how human activity recognition works, how common human movement patterns are detected, and how these patterns are categorized.

Introduction

Exploring Human Activity Analysis is an important topic today because it provides state of the art visual research that is necessary to continue improving techniques for monitoring individuals. Human Activity Analysis today is used primarily in airport security, movie making, and other forms of monitoring. The objective of Human Activity Analysis is to classify a video segment as a specific type of action through ongoing frame by frame analysis. An action is a sequence of body-part states [1]. Most of the video recognition is pulled from 3-D graphic engines.

There are a varying number of different approaches that can be taken to solve different human activity recognition problems. This paper will cover the approaches that have been developed for simple human actions and high-level actions and there different approaches. The human activity recognition actions are categorized by two primary categories: Single-layered approach and a Hierarchical approach.

When we break the human activity analysis down into a low-level solution (Single-Layered approach). This approach can be split up into two different categories: Space-time, and Sequential approach. There are also three different sub categories of the Space-Time approach: Space-Time volume, Space-Time trajectories, and Space-Time features. And the Sequential approach has two sub-categories: Exemplar-Based, and State-Based. These low-level components are essential for understanding human motion such as tracking and body posture analysis [1].

of 10

The second main approach, as stated above, is the Hierarchical approach. This approach uses the Statistical, Syntactic, and Description-based approaches to solve high-level human activity recognition [1].

Single-Layered Approach

The main objective of the single-layered approaches has been to analyze relatively simple (and short) sequential movements of humans, such as walking, jumping, and waving [1]. The Single-Layered approach uses recognition software to pull features directly from video data. most single-layered approaches have adopted a sliding windows technique that classifies all possible subsequences. Single-layered approaches are most effective when a particular sequential pattern that describes an activity can be captured from training sequences [1]. This will take an image and categorize it into one of the two classes that this approach is broken into: Space-Time approach and Sequential approach.

Space-Time Approach

Space-Time approaches recognize human activities by looking at the space time volumes of activity based videos. The Space-Time approach recognizes human activities through three different approaches: Space-time volume, Space-time trajectories, and Space-time features. When using the Space-time approach models are created using a 3-D XYT volume which represents activity between two volumes [1]. Existing methods typically rely on a predefined spatial binning of the local descriptors to impose spatial information beyond a pure "bag-of-words" model [3].

Space-Time Volume

An example of this is shown in figure 1. Figure 1 represents a sequence of images taken captured from an activity video which analyzes how the images are broken down and compared to a model of space-time volume. This also uses types of recognition using Space-Time volume matching to measure similarities between two volumes in figure 1.

of 10

[1] Fig. 1. Example XYT volumes constructed by concatenating (a) entire images and (b) foreground blob images obtained from a punching sequence.

Space-Time Trajectories

Another method used by the Space-Time approach is Space-Time trajectories. Space-Time trajectories use stick figure modeling to extract joint positions of a person at each frame by frame segment of a video. Figure 2 it shows a stick figure with a number of key trajectories represented by dots on the graph in 3-D space, and it compares trajectory shapes to classify human actions [1]. The joint positions on the graph can be used to determine actions taken by a person. The advantage to this approach is the ability to track and analyze human movements.

Fig. 2: An example of trajectories of human joint positions when performing the human action of walking [Sheikh et al. 2005] (c2005. IEEE.) Figure (a) shows trajectories in XYZ space, and (b)

shows those in XYT space. [1]

Space-Time Features

of 10

The last approach used is the Space-Time features. Space-Time features use a predefined set of models to identify the action of a specific frame. When a video is analyzed frame by frame each frame is compared to your predefined set of models to determine what type of action is being performed. An example of this is shown below in Figure 3. Each frame has a sequence of interest points shown that are represented as dots or otherwise known as features; when we compare these dots we are comparing them too frames and when the features have a close match we can determine which type of action is being performed.

Fig. 3: The most discriminative space-time neighborhoods of local descriptors (denoted by circles) may depend on the activity category. [3]

A disadvantage of using the space-time approach method is it can be hard to differentiate between multiple people in the same scene when using the Space-Time volume method. 3-D body-part detection and tracking is still an unsolved problem and it requires a strong low-level component that can estimate 3-D joint location for the space-time trajectories. When using Space-Time features it is not suitable for modeling complex activities.

Sequential Approach

The second class in the single-layered approach is a Sequential approach. Space-Time approaches recognize human activities by looking at the space time volumes of activity based videos, where a sequential approach recognizes human actions by analyzing a sequence of features. The sequential approach views the input of a video as a sequence of observations to determine what kind of activity is occurring. A Sequential approach uses each frame in an action video to describe a particular body-part configuration [1]. After the information has been extracted you can then use vectors to compare how close the extracted images are to what you have stored as your original model.

Exemplar-Based Approach

of 10

One of the ways to approach solving a Sequential approach is by using the Exemplar-Based approach. The Exemplar-Based approach uses a sample of actions and executions that could be performed in a video. When a new video is examined, vectors are used to compare actions performed between the two videos. An example of this is shown below in figure 4. The numbers on the top row represent a sequence of actions that are related to the model and the numbers on the bottom row also represent a different sequence of actions that are related to the model stored. When the actions are compared using vectors that will represent the same action.

Fig. 4: An example matching between two “stretching a leg” sequences with different nonlinear execution rates. Each number represents a particular status (i.e., pose) of the person. [1]

State Model-Based

The second approach we can use to solve a Sequential approach is State Model-Based approach. A state model introduces parameters to compute the likelihood of observed video sequences [2]. The state transitions of motion parameters are modeled using the continuous density hidden Markov model (HMM) [2]. This approach generates a sequence that contains a certain probability. Figure 5 shows how each state has a pose sequence followed by an action [1]. Using a state model-based approach can handle a probabilistic analysis of an activity better because getting to a new state is based off mathematic probability, but exemplar-based is more flexible in terms of comparing multiple sample sequences due to dynamic programming algorithms [1].

of 10

Fig. 5: The model is one of the simplest cases among HMMs, which is designed to be strictly sequential. Each actor image in the figure represents a pose with the highest observation

probability. [1]

In conclusion a Sequential approach is able to handle and detect more complex activities performed for low-level solutions, but the Space-Time approach handles simpler less complex activities more efficiently. Both methods are based off of some type of a sequence of images they just process the images in different ways. Sequential is based off a sequence of events where Space-Time compares two frames volumes.

Hierarchical Approach

We use the Single-Layered approach is used to solve low-level solutions, but the Hierarchical approach is used to solve high-level solutions. It allows the recognition of high-level activities based on the recognition results of other simpler activities. A hierarchical approach has the ability to recognize high-level activities with a more in depth structure, the amount of data required to recognize an activity is significantly less then single-layered approach, and it’s easier to incorporate human knowledge. Hierarchical approach is solved with three different methods: Statistical, Syntactic, and Description-Based approach.

Statistical Approach

Statistical approaches use the low-level State Model-Based to recognize activities. If you use multiple layers of a state-based model you can use these separate models to recognize activities with sequential structures. The activities are categorized in terms of sub-events [1]. Concurrent sub-events must be represented in order to recognize high-level activities with a complex structure, while most of the previous approaches including previous statistical approaches (using hidden Markov models (HMM)) and syntactic approaches (using stochastic context-free grammars) were limited in the case of recognizing activities with concurrent sub-events [4]. Using these high level approaches allow a high accuracy for detecting interactions between two persons. The Statistical approach is useful when the structure of activity is Sequential and when integrating dynamics, but is not useful for complex temporal structure or deep hierarchical structure [1].

of 10

Fig. 5: An example hierarchical hidden Markov model (HHMM) for recognizing the activity of punching. The model is composed of two layers. In the lower layer, HMMs are used to recognize various atomic-level activities, such as stretching and withdrawing. The upper layer HMM treats recognition results of the lower layer HMMs as an input, recognizing that punching stretching,

and withdrawing occurred in a sequence. [1]

Syntactic Approach

Syntactic approaches model human activities as multiple production rules generating a string of symbols, and adopt parsing techniques from the field of programming language to recognize the activities from a given string [5]. Human activities are shown as a set of production rules generating a string of actions. Syntactic approaches are able to probabilistically recognize hierarchical activities composed of sequential sub-events, but are inherently limited on activities composed of concurrent sub-events [5]. The Syntactic approach is useful with deep hierarchical structure, and Repetitive structures. However it’s not good with systems that have a lot of errors and uncertainty. So, your system makes assumptions that the input received does not have a lot of errors.

Fig. 6: Fighting is defined as any number of consecutive punching actions which can be decomposed into stretching and withdrawal similar to Figure 5. [1]

Description-Based Approach

of 10

Is our approach is to incorporate humans' conceptual knowledge of the structure of human activities into the recognition process, by enabling the system to maintain formal programming language-like representations of human activities [4]. These human activities use recognition with complex Spatio-Temporal structures (a Spatio-Temporal structure is a detector used for recognizing human action) use Context-free grammars (CFGs) to represent activities. CFGs are used to recognize high-level activities. Figure 7 shows how a human activity is represented by decomposing it into multiple sub-events and by specifying their temporal, spatial, and logical relationships. A sub-event of one activity may be composed of multiple sub-events of itself, capturing the hierarchical structure of human activities [4].

Fig. 7: (a) is a conceptual illustration describing the activity’s temporal structure, whose sub-events are organized sequentially as well as concurrently. Following the CFG, we convert this

into a formal representation as shown in (b). [1]

Conclusion

In conclusion this is a very large and expansive topic and there is still a lot of research to be done in the field of activity analysis. This has lead to studying Human Activity Analysis techniques over the past two decades due to the increase in security driven software among other things. Activity analysis has been constantly improving in the medical and security world, and the technology continues to improve and grow. Some of the big areas of growth are going to be in military operations. When we are able to further improve the resolution rates at far distances we will be able to better detect what sort of activities are being captured from aerial images taken from devices. Of course there are many different applications that will surface from the improvement of activity analysis; Intelligent driving, real-time surveillance, multiple cameras, continuous streams, video searching. The future of human activity analysis research will be driven by applications [1].

There are some hurdles that will have to overcome in the areas of real-time, activity context, and interactive learning. The high-level approaches get difficult to distinguish between processing power, activity involving interactions among human objects and scenes, and learning by generation questions [1]. These are core areas in high-level approaches that can still use lots of improvement.

of 10

This paper primarily talked about different techniques that are used to solve various different situations. Single-Layered approaches solve actions with Sequential approaches, and Space-Time approaches, and the Hierarchical approaches solve activities using Syntactic approaches, and Description-Based approaches. I hope you have learned some of the techniques used for understanding how videos are analyzed and how some techniques are applied. Thank you for reading.

of 10

References

[1]J.K. Aggarwal and M.S. Ryoo. 2011. Human activity analysis: A review. ACM Comput. Surv. 43, 3, Article 16 (April 2011), 43 pages. DOI=10.1145/1922649.1922653 http://doi.acm.org/10.1145/1922649.1922653

[2] Xinding Sun; Ching-Wei Chen ; Manjunath, B.S. , (2002). Probabilistic motion parameter models for human activity recognition . Pattern Recognition, 2002. Proceedings. 16th International Conference on. 1 (), pp.443-446 vol.1

[3] Adriana Kovashka; Kristen Grauman (2010). Learning a Hierarchy of Discriminative Space-Time Neighborhood Features for Human Action Recognition. [ONLINE] Available at: http://www.cs.utexas.edu/~adriana/cvpr2010/. [Last Accessed 28 November 12].

[4] Ryoo, M. S. (2008). Semantic representation and recognition of human activities. The University of Texas at Austin). ProQuest Dissertations and Theses, , 205. Retrieved from http://search.proquest.com/docview/304474642?accountid=9253

[5] M. S. Ryoo, J. K. Aggarwal , (April 2009). Semantic Representation and Recognition of Continued and Recursive Human Activities. international Journal of Computer Vision. 82 (e.g. 2), pp.pp 1-24

people.uwplatt.edupeople.uwplatt.edu/.../f12/ryan_wendel_human_activity_… · web viewthese...

Documents