spae deep lstm ih convolional attenion fo hman acion

Vol.:(0123456789)

SN Computer Science (2021) 2:151 https://doi.org/10.1007/s42979-021-00576-x

SN Computer Science

ORIGINAL RESEARCH

Sparse Deep LSTMs with Convolutional Attention for Human Action Recognition

Atefe Aghaei1 · Ali Nazari1 · Mohsen Ebrahimi Moghaddam1

Received: 1 October 2020 / Accepted: 9 March 2021 / Published online: 19 March 2021 © The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd 2021

AbstractDeep learning has recently gained remarkable results in action recognition. In this paper, an architecture is proposed for action recognition, including ResNet feature extractor, Conv-Attention-LSTM, BiLSTM, and fully connected layers. Furthermore, a sparse layer after each LSTM layer is added to overcome overfitting. In addition to RGB images, optical flow is also used to incorporate motion information into our architecture. Due to similarities of consecutive frames, video sequences are divided into equal parts. Frames of successive parts are used as consecutive frames to obtain the flow. Furthermore, to find the significant regions, the convolutional attention network is applied. The proposed method is evaluated using two popular datasets, UCF-101, and HMDB-51 and its accuracy for these datasets is 95.24 and 71.62, respectively. Overfitting is reduced due to using a sparse layer instead of a dropout based on the results achieved. Moreover, a deep LSTM network leads to a higher recognition rate than one-layer LSTM.

Keywords Action recognition · Deep learning · Deep LSTM · BiLSTM · Convolutional attention · Sparse layer · Optical flow

Introduction

According to the increase of digital video cameras in eve-ryday life, more and more video contents are produced and sent on the internet or stored in large video datasets. Catego-rizing the rich video content based on actions is suitable for an arrangement of videos. Detecting video-based activity is a challenging problem in machine vision and has attracted much computer science communities’ attention. The primary purpose is to detect, identify, and analyze the activities of human, object–human interaction, and human–human inter-action from video sequences [1].

Video-based action recognition recognizes video sequence-labeled human activities, which have been wide-spread in recent years in different fields of psychology,

medicine, sports, and computer science, including surveil-lance and monitoring systems, detecting and analyzing the behavior, and detecting the movements in group activi-ties [2]. Activity, according to its complexity, is generally divided into four groups: gesture (a movement of body parts, such as shaking hands), actions (combining several basic movements performed by a person, such as walking), inter-action (an activity conducted by a person with an object or someone else, such as hugging or eating), and group activity (the combination of several activities of one or more indi-viduals, the most complex one).

Sensor-based methods make another group of human action recognition methods. In this group, in addition to RGB data, depth data are also captured by the sensor. Using depth sensors has many advantages over RGB in some appli-cations, including object detection, background segmenta-tion, and human action recognition [3]. In some sensor-based methods, Microsoft Kinect is used [3–5]. These methods have high accuracy and low-time complexity, but due to their high cost, they are limited to use in specific applications [6].

Deep learning is divided into three categories, generative or unsupervised tasks, supervised tasks, and deep hybrid networks [7]. In this study, the second category, supervised task, is considered. Deep learning approaches in Human

* Mohsen Ebrahimi Moghaddam [email protected]

Atefe Aghaei [email protected]

Ali Nazari [email protected]

1 Faculty of Computer Science and Engineering, Shahid Beheshti University, Tehran, Iran

http://orcid.org/0000-0002-7391-508X

http://crossmark.crossref.org/dialog/?doi=10.1007/s42979-021-00576-x&domain=pdf

SN Computer Science (2021) 2:151151 of 14

SN Computer Science

Action Recognition (HR) are used to extract features and classify activities. Multi-layered networks automatically extract features [8]. Convolutional neural network (CNN) is one of the frequently used deep neural networks [9]. It is a feed-forward neural network and has many image- and video-processing applications. This network has also been extended into other networks, such as 2DCNNs and 3DCNNs. 3DCNN has mostly been employed in 3-dimen-sional applications of speech and video procession and spatiotemporal feature extraction. One of its significant drawbacks is its computational complexity because the convolutions, filters, and pooling layers are 3D and increase parameters.

RNNs are other types of deep neural networks that have recurrent structure. This structure allows the network to dis-play the behavior of temporal development. Therefore, it can be useful for video classification [10]. A big problem of these networks is vanishing or exploding gradient [11]. Long Short-Term Memory (LSTM) was introduced to deal with this problem [12]. Another RNN network is the bidirectional LSTM introduced in paper [13] that allows the network to use the information of the next positions.

Although neural networks are very useful in many aspects, they have overfitting problem. Overfitting occurs when the error of the training data is reducing (accuracy is increasing) and the error of the test data is increasing (accuracy is decreasing). In spite of many different intro-duced methods, overfitting is still a challenging problem in complex networks because a method should be applied to the network to reduce overfitting while it does not affect accuracy. In machine learning, regularization is usually employed to avoid overfitting. Several different methods have been introduced for regularization. However, some of them are effective in performance. In the paper, to overcome this problem, we prune the weights and connections using sparse layers between LSTM layers.

To recognize the movement of the actors, optical flow is added to the network. Optical flow plays a significant role in detecting activities due to providing information about the velocity of objects in videos. Actions happen slowly, normally, or quickly and have different speeds in consecu-tive frames. This motion information can be discriminatory, useful in recognition, and encoded by convolutional neu-ral networks as a motion representation [14]. It is costly to obtain it from RGB channels, such as RGB-based 3D CNN methods, introduced in [15] owing to training more convolu-tional layers’ parameters to learn the motion representation. This information can be provided in the feature extraction phase with efficient optical flow methods. Other advantages of optical flow are discussed in [16].

The contributions of this paper are as follows:In this paper, a combination of ResNet-152 to extract fea-

tures from RGB and optical flow images and two-layered

LSTM including one usual LSTM and one bidirectional LSTM is used. Also, an attention layer is added to the model because it helps to produce effective features in classifica-tion. Convolutional attention LSTM, inspired by [17], is used in this paper due to the fact that experimental results demonstrate its higher accuracy than soft attention. Finally, two sparse layers have been added between LSTM layers to prune the LSTM weights and the outputs of each layer.

In the rest of the paper, “Related Works” introduces the similar works that have conducted for action recognition. In “The Proposed Algorithm”, first, an overview of RNN and sparsity is described. Then, the proposed algorithm used in the paper is fully explained. In “Discussion and Results”, the evaluation method and results are presented in two popular datasets called UCF-101 and HMDB-51. In last section, con-clusion is presented.

Related Works

Recent methods for action recognition are divided into two categories, traditional methods, and deep learning methods. In the following, their brief description will be presented.

Traditional action recognition models, which are the hand-crafted methods, include two main steps. The first step is to extract the features from the given video sequences. The second step is to classify actions using the features extracted from the first step [18]. In the first step, the hand-crafted spatial and temporal interest points are detected, and then descriptors are extracted. Some methods were used for detecting interest points include the Harris 3D detector [19], dense sampling detector [20]. Some methods were used to extract descriptors include Cuboid descriptor [21], HOG / HOF (Histogram of the Gradient)/ (Histogram of Flow) Descriptor [22], HOG 3D Descriptor [23], sparse spatiotem-poral features, and sparse representation [24]. For the second step, the classification step, there were also used various methods, such as SVM and KNN. Although these methods have achieved good results, due to environmental complexi-ties, intra-class variance, and inter-class similarities, there are still large gaps to reach a robust system analyzing videos.

In recent years, deep learning for classification, object detection, and natural language processing has attracted many researchers’ attention [25–28]. It is used in HR to extract features, to classify actions, and to enhance its per-formance for video representation. The difference between this method and hand-crafted one is that the features are trainable and automatically extractable from raw data. One of the benefits of deep learning techniques instead of tradi-tional methods is their ability to recognize high-level activi-ties with very complex structures. One of the ways of deep neural networks for HR is using a convolutional neural net-work (CNN) [9]. Sharma [29] used a CNN model to detect

SN Computer Science (2021) 2:151 of 14 151

SN Computer Science

RGB data and achieved good results. Other CNNs were also introduced and presented for HR [30, 31]. However, CNN networks generally have a problem in extracting features frame by frame independently instead of considering the interaction of temporal features. Therefore, 3DCNN was first introduced by [18]. The 3D convolutional neural network is an extended type of the convolutional neural network in the spatiotemporal domain. In addition to the spatial features of each frame, it employs the temporal changes between consecutive frames, which leads to better results in video-processing. This method was considered by many research-ers in activity recognition [18, 32, 33]. Other convolutional neural networks, introduced in this field, are two-stream neural networks and two-stream 3D convolutional neural networks [34, 35]. This architecture combines RGB video frames and optical flow.

3DCNN has high computational complexity and cost. 3DCNN also needs more data for training. Therefore, recur-rent neural networks can be a reasonable substitution to model the differences between frames [36]. The recurrent structure of RNN enables the network to retain the behav-ior of temporal transformations. Both LSTM and GRU can handle long-term dependencies and have a similar internal structure. Bi-LSTM and Bi-GRU can also keep long-term dependencies between later feature vectors and earlier ones. GRU has less computation time than LSTM because it con-sists of two gates (reset and update gates) whereas LSTM is composed of three gates (input, output and forget gates) in their internal structures. Empirically, in some datasets, GRU outperformed LSTM and in some other datasets LSTM did [37, 38]. Weiss et al. [39] showed that LSTM with ReLU is stronger than GRU in the context of counting languages and k-counter machines. A lot of work has been conducted by combining RNN and CNN for HR [40]. Ma et al. [41] proposed Temporal Segment LSTM (TS-LSTM), including both RNN and temporal CNN, for obtaining spatiotemporal information. Correlation Convolutional LSTM (C2LSTM) considered cross-convolution and correlation operators to extract both spatial and motion features [42]. Bidirectional LSTM is useful in extracting bidirectional temporal features in human activities [43]. Attention LSTM was introduced to concentrate on the important parts of the body [29]. A new approach named Attention-based Temporal Weighted CNN (ATW) was proposed in paper [44]. Their model includes temporal weighted CNN, which uses multi-stream inputs and visual attention models. Also, attention again, including two-stream LSTM, was proposed in paper [45]. Hieratical Bi-LSTM included separated spatiotemporal attention [36]. These papers depict that combination of LSTM and Atten-tion mechanism results in high accuracy.

Using deep learning methods, especially RNNs, have also achieved good results in sensor-based human activ-ity recognition. For example, Saini et al. [4] mentioned

Bi-LSTM is used to monitor the interaction between two people using Kinect for the health care system and good results have been achieved. In [5], Kumar et al. have used deep learning for gait recognition. They have introduced a method based on 3 DCNN and LSTM to train the network on one walking pattern and test on others. They have also used the Graph Wolf Optimizer evolutionary algorithm to increase the efficiency of the model. Zhang et al. [3] have used BLSTM to recognize continuous human activities. Using Kinect to capture 3D sequence activities, they first segmented the activities into two classes including sitting and standing, then have used BLSTM and Hidden Markov model to categorize consecutive activities performed dur-ing sitting to standing.

To avoid overfitting, dropout is a considered method. It was popularized by Hassibi et al. and LeCuan et al. [46, 47]. Authors in [48, 49] proposed the fully connected dropout layers in the neural network layers for regulari-zation, randomly making the subset of activations zero during training. Besides, the DropConnect method, an extended type of dropout, was proposed by Wan et al. [50]. It makes zero a subset of weights and connections. Despite many dropout applications in reducing overfitting, this method is not entirely satisfactory because pruning is randomly done. Therefore, a better way to reduce the over-fitting is sparse regularization methods, such as l1 norm, l2 norm, or l1/l2 norm.

A lot of work has been proposed to reduce overfitting using sparse regularization. For example, a new tech-nique, a sparse random connection structure, is introduced to prune connections in paper [51]. Another method is Group lasso regularization, introduced by Yuan et al. [52], makes zero all parameters of some groups of the weights, while all parameters of the remaining groups are non-zero. Therefore, a structured sparse regularization is required [53]. They propose a structured sparse training method to regularize filters, channels, filter shapes, and layers’ depth. In addition to weight pruning, connections, and filters, the sparse regularization technique is used to prune the neu-rons or the hidden units. For example, Murray et al. [54] used structured sparse regularization to prune the hidden units, and the neuron pruning technique is proposed in paper [55].

Sparse regularization has also been used in recurrent neu-ral networks. For example, Narang et al. [56] have proposed a method to prune the weights of recurrent neural networks in initial iterations, while the accuracy is still close to the original network in the final iterations. Reducing the sizes of basic structures independently may cause the network to end up with invalid LSTM units. To solve this problem, Wen et al. [57] have proposed intrinsic sparse structures (ISS) in LSTM. ISS could decrease all basic structures by removing a component and maintaining the dimension consistency.


SN Computer Science

The Proposed Algorithm

We introduce our Convolutional Attention Sparse Deep LSTM (CASDLSTM) method, a multi-layered LSTM. It includes Convolutional Attention, LSTM and Bidirectional LSTM layers and sparse layers between its layers to prune the weights and connections.

CASDLSTM on Action Recognition

Our input is a set of video sequences of human action from one of C categories with labels Y =

{y1, y2, ...yc

} . All the

categories contain some video clips as X. Figure 1 shows the overview of CASDLSTM.

Each video clip(X) includes n consecutive frames that we show each frame with “f”, so each video clip is:

X ={f1, f2,… fn

} . Every frame is converted to one RGB

image and enters the network as the first input (the blue box in Fig. 1). After that, a CNN model extracts its features.

According to Fig. 1, there is another input (orange box), which is the optical flow (the differences of frames in terms of the temporal dimension) [34]. In this paper, because the similarity of two consecutive frames is very close, to obtain a discriminant optical flow features, the X clip is divided into “p” parts. Each part contains “m” consecutive frames. Therefore, frames “1 to m” are marked as “part 1”, frames “m + 1 to 2 m” are marked as “part 2”, and so on. (Fig. 2).

(1)

P ={p1, p2, ...pp

}

pj ={f(j−1)∗m+1, ..., fm∗j

}

p =

⌊n

m

⌋

Fig. 1 Dividing frames into p parts so that each part includes m frames

Fig. 2 The overview of CASDLSTM. Frame fi is from part j and fi+m is from part j + 1. They are considered as two consecutive frames and one optical flow image is obtained and enters the network.

Also, fi enters the network as an RGB image. The network includes ResNet152 for spatial feature extraction and Sparse Attention-LSTM besides Bi-LSTM for temporal feature extraction


SN Computer Science

Consequently, the fi th frame of each part is considered as consecutive frames. For example, the ith frame of the jth part and the ith frame of the part j + 1 is regarded as two consecu-tive frames. In other words, fi and fi+m are two consecutive frames. Therefore, the flow will be:

The optical flow TvL1 algorithm [58] is applied to cal-culate the optical flow. This algorithm is based on the total variation and L1 norm. Finally, RGB and Flow images enter into the CNN and the spatial and temporal features are extracted. The outputs of the extracted features of xith video clip are stored as a high-dimensional matrix with the label yi. These outputs enter the first LSTM. Then, the out-put of first LSTM layer enters the sparse layer. This layer prunes the weights. Sparsification is also described in detail in Sect. 3–3. In the next step, the output from the sparse layer enters the BiLSTM and again enters the next sparse layer. Ultimately, the feature map enters the fully connected layer. The final output is calculated by Softmax as follows:

where “lf” is the output from the fully connected layer, “W” is the trained weight matrix, “b” is bias and “yp”is the pre-dicted output label.

The pre-trained ResNet-152 method is employed to extract features [59]. ResNet-152 is a residual learning framework with up to 152 layers. The size of the input images of ResNet-152 is 224 × 224 . In other words, each frame of our input data is converted into an 224 × 224 image.

Convolutional Attention

Attention is a beneficial mechanism in deep learning and gen-eralizes pooling methods. It is considered by some research-ers in action recognition. In video and image processing, the

(2)F = Flow(fi, fi+m)

(3)yp ∼ argmax Softmax(Wlf + b)

attention layer focuses on some essential regions of the image or frames more than other parts. The attention mechanism computes a score map as follows:

In this equation, “W” is the weight matrix, ht-1 is the previ-ous hidden state and “X” is the feature map. Another attention, Convolutional attention [17], utilizes the convolution operator instead of inner product in (4):

In addition, St instead of a vector of score map, is a 2d map (Fig. 3). Eventually, the attention map is computed by the softmax function:

where At is the attention map of state t. Next, the final feature map is simply acquired by an element-wise product between each element of the feature map and the element of the atten-tion map (Fig. 3):

Sparse Regularization

One solution to avoid overfitting is adding a sparse regulariza-tion term to the cost function. Therefore, the optimization of the weight matrix for each layer is given by Eq. (9):

(4)St = tan h(Whaht−1 +WiaXt+ ba)

(5)St = tan h(Wha ∗ ht−1

+Wia ∗ Xt+ ba)

(6)Aij

t =exp(S

ij

t )∑i

∑j

exp(Sij

t )

(7)X̂t = At ⊙ Xt

(8)E(W) = ED(W) + �R(W)

Fig. 3 A detailed demonstration of Conv-Att-LSTM. It gives the usage of the convolution attention and its role in finding effective feature maps


SN Computer Science

In Eqs. (8) and (9), “W” is the weight matrix, ED(W) is a suitable cost function, R(W) is the regularization term (the weight pruning function), and � is the hyper-parameter. The L1 and L2 norms are the conventional methods for sparse regularization, which have been presented in Eqs. (10) and (11), respectively:

Since the L1 norm is not differentiable at zero, (its left and right limits at zero are not equal), numerous methods have been introduced to make it differentiable. In this paper, the infinitesimal number is added to the L1 norm mentioned in Eq. (13). Also, the regularization term is a combination of the L1 and L2 norms as presented in Eq. (12).

Equation (12) consists of two terms. L1 Norm is used in the first term, whereas we use the square root of the sum of square all the elements of the weight matrix plus " � ” to be differentiable, instead of using the absolute value of elements. The second term of the equation is also L2 Norm regularization, which is introduced in Eq. (11).

Discussion and Results

CASDLSTM, for human action recognition, is proposed in a neural network, a combination of ResNet-152, convolu-tional attention, and deep LSTMs. To recognize temporal features, optical flow is used as well as RGB. To overcome the overfitting problem, the sparse layers between LSTM layers to prune the weights and connections are used. Since the sparsity does not prune randomly, it prunes the more insignificant weights and achieves higher accuracy. We examined the results on two real-world datasets introduced in Sect. 4.1 to evaluate the proposed method’s accuracy. The results indicate that the proposed method has achieved high results in both datasets. Isolated action recognition is con-sidered in this paper, and due to this assumption, all video

(9)ED(W) =1

n

n∑

i=1

L(yi, f (xi,wi))

(10)R(w) = ‖W‖1 =�w��

i=1

�w�

(11)R(w) = ‖W‖2 =�w��

i=1

w2i

(12)R(W) = ‖W‖1 + ‖W‖2

(13)‖W‖1 =

��w��

i=1

(Wi)2 + �

clips are divided into some parts. If continuous action rec-ognition is assumed to perform, a segmentation algorithm of actions [60] is required as a pre-processing step before the proposed algorithm owing not to mixing the information of separate actions with each other. It seems that with the pre-processing step, the proposed algorithm performs well. In this section, the datasets are introduced and the configuration of the training, validation, and test phases, and their result are described in detail.

Dataset

It is significant to choose an appropriate dataset to evaluate our proposed method. Choosing a dataset with good com-plexity helps us to get close to real-world conditions. This paper considers the CASDLSTM method on two popular datasets in action recognition and presents the results.

One of these two datasets is the UCF-101 dataset [61]. This dataset is one of the enormous datasets for human activities that has 101 activity classes. These 101 classes include 13,320 clips and 27 h of video data. These clips are from the real world and have been uploaded on YouTube by different users. Videos include camera movements and back-ground clutters. Soomro et al. [61] have divided the dataset into five general categories: human–object interaction, only body movement, human–human interaction, playing musi-cal instruments, and sport. A few samples of each type are shown in Fig. 4. UCF101 dataset is an extended type of the UCF50 dataset that includes 50 activities. The clips have a fixed frame rate of 25FPS and resolution. Their average length is about 8 s.

Another complicated dataset is the HMDB-51 dataset [62]. The HMDB-51 dataset contains 51 activity classes; each class consists of at least 101 clips and 6766 different clips from different sources. This dataset has attracted much attention from the researchers in action recognition. Kuehne et al. [62] have classified the dataset into five groups, includ-ing (1) general face activities like laughing, (2) face activities using objects like drinking, (3) general body movements like walking, (4) body movements with the interference of items like cycling, and (5) body movements by human–human interaction like hugging. Some example frames from each group are depicted in Fig. 5. The HMDB51 dataset includes additional information representing the camera viewpoint, the existence or non-existence of camera movements, video quality, and actors’ number. This information provides the possibility of achieving more flexible results.

Proposed Method Evaluation

In this paper, tenfold cross-validation is used. The proposed model is performed ten times on two datasets, and at each time, one part is considered the test set. The average and


SN Computer Science

standard deviation of repeated results on the UCF-101 data-set are 95.24 and 1.5, respectively. Additionally, the average and standard deviation on the HMDB-51 dataset are 71.62 and 2.4, respectively.

Our presented method is developed using the Keras library in the Tensorflow framework [63]. The results are examined at a workstation with 4GH Intel Core i7 CPU, 16G RAM, and an NVIDIA GeForce GTX TITAN X GPU. The number of hidden units of each LSTM and the number of hidden units of each sparse layer are 512 and 1024, respec-tively. The optimizer function used in the model is rmsprop introduced by Tieleman et al. [26], with a learning rate of 0.001 and a decay rate of 0.9. The error function is categori-cal cross-entropy. The ReLU activation function is used in all recurrent layers and sparse layers.

Furthermore, the hyperparameter is a significant param-eter. If it is too high, the error function’s second term will be more important than the first term. Therefore, the accu-racy may decrease. If it is too small, the importance of

the regularization term will be very low. Therefore, there should be a trade-off between the two terms. 10e-3 is cho-sen for the L2 norm rate and 10e-6 for the L1 norm rate. The feature extraction is an essential step in action recog-nition. In the paper, two types of input features, the RGB channels and its optical flow of frames, are used. In Fig. 6, several images of the UCF-101 and the HMDB-51 datasets beside their optical flow are shown.

Two scenarios are regarded to evaluate the proposed method; in the first scenario, the RGB data are only entered into the proposed architecture without giving the optical flow. The results for different architectures ALSTM, DALSTM, SDALSTM, and CASDLSTM meth-ods are computed. The fourth column in Table 1 depicts

Fig. 4 Samples from UCF-101 dataset Row1: Human–Object inter-action (Applying Eye Makeup, Writing On Board and Hula Hoop), Row2: Playing the Musical Instruments (Playing Daff, Playing Piano and Playing Sitar), Row3: Only Body Movement (Baby Crawling, Pull Up and Handstand Push up), Row 4: Sports (Floor Gymnastics, Biking and Basketball Dunk), Row 5: Human–Human interaction (Haircut, Band marching and Salsa Spin) Fig. 5 Examples from HMDB-51 dataset Row1: General Facial

Actions (Laughing, Chewing and Talking), Row2: Facial Actions with Object Manipulation: (smoking, eating and drinking), Row3: General Body Movements (Climbing, Running and Diving), Row 4: Body Movements for Human interaction: (Shaking hands, Hugging and Fencing) and Row 5: Body Movements with Object interaction: (Golfing, Brushing Hair and Kicking Ball)


SN Computer Science

the results of the UCF-101 dataset, and the fourth column in Table 2 demonstrates the results of HMDB-51.

In the second scenario, the optical flow besides the RGB channels is regarded. Both RGB and Optical Flow images are entered into the network. The results of ALSTM (Attention plus LSTM), ADLSTM (Attention plus deep LSTM), ASDLSTM (ADLSTM plus sparse layers), and CASDLSTM (convolutional attention plus deep LSTM plus sparse layers) methods for UCF-101 and HMDB-51 are located in the fifth column in Tables 1 and 2, respectively. By comparing the fourth and fifth columns,

the significant effect of applying the optical flow on the results is noticeable.

If the tested networks do not have sparse layers, the drop-out equal to 0.5 is applied to prevent overfitting. By compar-ing methods in Tables 1 and 2, the sparsity’s effect on the network is visible. As mentioned in introduction, dropout prunes randomly. However, sparse regularization is a struc-tured method. According to the results, it is observed that the existence of sparsity in both networks (RGB and two-stream network) improves the results.

Moreover, using two-layered LSTM has enhanced the results because it can create a complex feature representation of the input clips. Therefore, much information is trained and processed. Also, due to using sparse layers between the LSTM layers, the model’s complexity does not cause over-fitting. The results are shown in Tables 1 and 2. The effect of using optical flow in UCF-101 and HMDB-51 dataset is visible in Fig. 7.

The loss of the sparse regularization and dropout method on the UCF-101 dataset is shown in Fig. 8. The effect of the existence of sparsity on preventing overfitting is quite vis-ible. Also, as shown in graphs, dropout converges in iteration 9. However, sparse regularization converges in iteration 4. It is observable that sparse regularization converges faster than a dropout. This quick convergence is likely to effectively eliminate neurons compared to the randomized elimination of neurons in the dropout method.

Comparative Analysis

Finally, the proposed method and some state-of-the-art methods are compared in Table 3. It should be noted that the results of those methods are compared only with the results of our method performed in the second scenario (the combination of RGB and optical flow). By compar-ing existing LSTM methods, it is noted that the proposed method has achieved remarkable results for the following reasons: First, using a combination of LSTM and Bi-LSTM causes the model to learn much information and dependency between frames. Second, the convolution attention layer focuses on more essential regions of frames. Also, results show it achieves higher performances than soft attention.

Fig. 6 A number of activation, ‘a’ is a number of activations in UCF-101 dataset as RGB, ‘b’ is those activations as Optical Flow, ‘c’ is a number of activations in HMDB-51 dataset as RGB, and ‘d’ is those activations as Optical Flow

Table 1 The results on UCF-101 using LSTM, Deep LSTM and Sparse Deep LSTM

Model Pre-Training CNN Results on RGB

Results on RGB + Flow

ALSTM ImageNet ResNet-152 90.9 91.77ADLSTM ImageNet ResNet-152 91.11 94.21ASDLSTM ImageNet ResNet-152 91.81 95.1CASDL-

STMImageNet ResNet-152 92.6 95.24

Table 2 The results on HMDB-51 using ALSTM, Deep ALSTM, Sparse Deep ALSTM and Sparse Deep CALSTM

Model Pre-training CNN Results on RGB

Results on RGB + Flow

ALSTM ImageNet ResNet-152 61.6 66.87ADLSTM ImageNet ResNet-152 66.37 69.93ASDLSTM ImageNet ResNet-152 68.12 70.9CASDL-

STMImageNet ResNet-152 68.95 71.62


SN Computer Science

Third, the sparse regularization use makes pruning more insignificant weights instead of random pruning used in the dropout method. It is noticeable that we have achieved high accuracy, especially in the UCF-101 dataset. To show our

method’s performance, first, we compare our method with some state-of-the-art papers on hand-crafted methods, and then some deep learning methods. The first method is IDT, which has improved dense trajectories by considering the

(a) (b)

85

90

95

100

ALSTM ADLSTM ASDLSTM CASDLSTM

UCF-101RGB+FlowRGB

85

90

95

100

ALSTM ADLSTM ASDLSTM CASDLSTM

HMDB-51RGB+FlowRGB

Fig. 7 The comparison of ALSTM, Deep ALSTM, Sparse Deep ALSTM and Sparse Deep CALSTM using both RGB and RGB + Optical Flow input on (a) UCF-101 (b) HMDB-51 dataset

Fig. 8 Train and validation loss in UCF-101 (a) Using sparse layers, (b) Using dropout

Table 3 Comparison with current state-of-the-art methods on both datasets

Model UCF-101 HMDB-51 KTH

Hand-crafted models IDT 2013 [64] 85.90% 57.20% –Fisher Vector 2017 [65] – – 98.92

Dep Learning models Composite LSTM model 2015[40] 84.30% 44.00% –Two-stream model (fusion by SVM) 2014 [34] 88.00% 59.40% –Attention Again 2018 [10] 87.70% 54.40% –Video-LSTM 2018 [17] 88.90% 56.40% –Asymmetric 3D CNN (RGBF) 2019 [68] 87.70% 61.20% –Intrinsic Sparse Structures 2018 [57] 93.22 59.97 –C2LSTM 2019 [43] 92.8 61.3 –L2STM 2017 [69] 93.6 66.2 –TS-LSTM 2019 [41] 94.1 69 –ATW 2018 [44] 94.6 70.5 –Bi-Hieratic-LSTM 2019 [36] 94.8 71.9 –Our approach(CASDLSTM) 95.24 71.62 98.41


SN Computer Science

camera motion [64]. The significant difference between its accuracy with our method is noticeable. The second state-of-the-art hand-crafted method is Reduced Fisher Vector Embedding [65], an extension of Bag of Visual Words. Dense trajectory has also been used in this paper to extract features from frames. This paper has achieved remarkable results on two simple action datasets, including KTH [66] and Weizmann [67]. Although a comparison of the results on the KTH dataset shows Fisher Vector Embedding has achieved better accuracy than our paper, according to the

confusion matrix of KTH dataset, 5 out of 6 classes obtain 100% accuracy. Also, our method’s performance is very high in complex datasets from the real world. As mentioned in the introduction, LSTM is a suitable method to compute temporal features. Therefore, it is evident in the results, our proposed method that uses Deep LSTM has achieved bet-ter accuracy compared to 3DCNN [68]. Yang et al. in [68] get better results using a combination of RGB, Flow, and traditional IDT. However, the results are compared with the results that they achieve from RGB and Flow.

Fig. 9 The confusion matrix of UCF-101 dataset


SN Computer Science

To exhibit the impact of the proposed method, further complications of the architecture, such as use of 3DCNN for feature extraction, are avoided. Therefore, we employ 2DCNN (ResNet-152) for the feature extraction. If 3DCNN is used in feature extraction, it increases accuracy. Nonethe-less, we try to clarify the contribution of two sparse-layered LSTMs with the attention mechanism and do not use the 3DCNN. Authors in [41, 69], and [36] have obtained good results using LSTM, temporal Inception, and attention-based convolutional neural network, respectively; how-ever, we achieve higher results using the combination of convolutional attention and deep LSTM. Moreover, unlike ISS [57], we only use sparse layers between LSTM layers and do not use it for decreasing the size of basic structures within LSTM units. Therefore, as the results show, we have achieved a slightly higher accuracy. In the C2LSTM method [42], Uah et al. have used the convolution and correlation operators to attribute the data’s spatial and motion construc-tion. As shown in the results of Table 3, in [36], authors achieve slightly higher accuracy than our method in HMDB-51. In some iteration of experiments, the results of the pro-posed method are better than [36]. However, we emphasize that we have performed experiments at least 10 times and in some runs, they have higher accuracy approximately with the 2% standard deviation.

Error Analysis

In this section, the errors of the proposed method are ana-lyzed. According to the results, the average error in the UCF-101 database is 4.76%. As seen in the confusion matrix of the UCF-101 dataset in Fig. 9, most of the classes have

achieved 100% accuracy, and the accuracy of other classes is above 70%. The most errors belong to the “Handstand-Pushup”, “Skijet”, “IceDancing”, and “FieldHockeyPenalty” classes and they are misclassified as “CuttingInKitchen”, “CliffDiving”, “HorseRiding”, “NunChucks”, respectively. Also, approximately 10% errors have occurred in some classes, including “Drumming” and “Punch”. Figure 10, related to the confusion matrix of the KTH database shows the high accuracy of the proposed method on the classes and only under 10% errors occur in the "Running" class and it is misclassified into “Jogging” class. In the confusion matrix of the HMDB-51 dataset in Fig. 11, it can be seen that per-centage of the proposed method’s error in this dataset is higher than UCF-101 dataset. The most errors in this dataset are related to the “Sword”, “Somersault”, Swingbaseball”, “Smoke”, “Drink”, “Ride-Horse” “Chew”, and “Situp” classes and they are misclassified into “PushUp”, “Chew”, “PushUp”, Sword-exercise”, “Pick”, “Swing-baseball”, “PushUp”, “Chew” classes, respectively. According to the confusion matrices, error seems to occur more frequently in classes that have more similarities. Using data augmentation might be helpful to decrease some errors.

Conclusion

This paper proposes a neural network, CASDLSTM, for human action recognition. It applies ResNet-152 for the feature extraction by feeding RGB and optical flow images. The extracted features enter the deep Convolutional Atten-tion LSTM network to process much information and find the dependency between frames. To obtain a discriminant optical flow features, the video sequences are divided into some parts, each part contains specified frames. We consider the frames of consecutive parts as consecutive frames. To overcome the overfitting problem, we use the sparse layers between LSTM layers to prune the weights and connections. Since the sparsity does not prune randomly, it prunes the more insignificant weights and achieves the higher accuracy. We examined the results on two UCF-101 and HMDB-51 datasets to evaluate the accuracy of the proposed method. These datasets are very popular and used in recent studies. The results indicate that the proposed method has achieved high results in both datasets. Results reveal that using sparse layers instead of dropout between LSTM layers deals with overfitting. Also, deep LSTM achieves better results than one-layer LSTM.

Fig. 10 The confusion matrix of KTH dataset


SN Computer Science

Declarations

Conflict of interests Atefe Aghaei declares that she has no conflict of interest. Ali Nazari declares that he has no conflict of interest. Mohsen Ebrahimi Moghaddam declares that he has no conflict of interest.

Ethical approval This article does not contain any studies with human participants or animals performed by any of the authors.

References

1. Subetha T, Chitrakala S (2016). A survey on human activity recognition from videos. In: 2016 International Conference on Information Communication and Embedded Systems (ICICES) (pp 1–7). IEEE.

2. Herath S, Harandi M, Porikli F. Going deeper into action rec-ognition: a survey. Image Vis Comput. 2017;60:4–21.

Fig. 11 The confusion matrix of HMDB-51 dataset


SN Computer Science

3. Saini R, Kumar P, Roy PP, Dogra DP. A novel framework of continuous human-activity recognition using Kinect. Neuro-computing. 2018;311:99–111.

4. Saini R, Kumar P, Kaur B, Roy PP, Dogra DP, Santosh KC. Kinect sensor-based interaction monitoring system using the BLSTM neural network in healthcare. Int J Mach Learn Cyber-net. 2019;10(9):2529–40.

5. Kumar P, Mukherjee S, Saini R, Kaushik P, Roy PP, Dogra DP. Multimodal gait recognition with inertial sensor data and video using evolutionary algorithm. IEEE Trans Fuzzy Syst. 2018;27(5):956–65.

6. Zhang HB, Zhang YX, Zhong B, Lei Q, Yang L, Du JX, Chen DS. A comprehensive survey of vision-based human action recogni-tion methods. Sensors. 2019;19(5):1005.

7. Deng L, Yu D (2014) Deep learning: methods and applications. Found Trends® Signal Process 7(3–4):197–387.

8. Wu D, Sharma N, Blumenstein M (2017) Recent advances in video-based human action recognition using deep learning: a review. In: 2017 International Joint Conference on Neural Net-works (IJCNN) (pp 2865–2872). IEEE.

9. LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proc IEEE. 1998;86(11):2278–324.

10. Yang H, Zhang J, Li S, Lei J, Chen S. Attend it again: Recurrent attention convolutional neural network for action recognition. Appl Sci. 2018;8(3):383.

11. Bengio Y, Simard P, Frasconi P. Learning long-term dependen-cies with gradient descent is difficult. IEEE Trans Neural Netw. 1994;5(2):157–66.

12. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735–80.

13. Graves A, Schmidhuber J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neu-ral Netw. 2005;18(5–6):602–10.

14. Piergiovanni AJ, Ryoo MS (2019) Representation flow for action recognition. In: Proceedings of the ieee conference on computer vision and pattern recognition (pp 9945–9953).

15. Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 6299–6308).

16. Sevilla-Lara L, Liao Y, Güney F, Jampani V, Geiger A, Black MJ (2018) On the integration of optical flow and action recogni-tion. In: German conference on pattern recognition (pp 281–297). Springer, Cham.

17. Li Z, Gavrilyuk K, Gavves E, Jain M, Snoek CG. Videolstm convolves, attends and flows for action recognition. Comput Vis Image Underst. 2018;166:41–50.

18. Ji S, Xu W, Yang M, Yu K. 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell. 2012;35(1):221–31.

19. Laptev I. On space-time interest points. Int J Comput Vision. 2005;64(2–3):107–23.

20. Sicre R, Gevers T (2014) Dense sampling of features for image retrieval. In: 2014 IEEE International Conference on Image Pro-cessing (ICIP) (pp 3057–3061). IEEE.

21. Dollár P, Rabaud V, Cottrell, G., & Belongie, S. (2005, October). Behavior recognition via sparse spatio-temporal features. In: 2005 IEEE international workshop on visual surveillance and perfor-mance evaluation of tracking and surveillance (pp 65–72). IEEE.

22. Laptev I, Marszalek M, Schmid C, Rozenfeld, B. (2008) Learning realistic human actions from movies. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition (pp 1–8). IEEE.

23. Klaser A, Marszałek M, Schmid C (2008) A spatio-temporal descriptor based on 3d-gradients.

24. Zhu J, Qi J, Kong X (2012) An improved method of action recog-nition based on sparse spatio-temporal features. In: International conference on artificial intelligence: methodology, systems, and applications (pp 240–245). Springer, Berlin, Heidelberg.

25. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, ..., Rabinovich A (2015). Going deeper with convolutions. In: Pro-ceedings of the IEEE conference on computer vision and pattern recognition (pp 1–9).

26. Tieleman T, Hinton G (2017) Divide the gradient by a running average of its recent magnitude. coursera: Neural networks for machine learning. Technical Report.

27. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems (pp 91–99).

28. Ohn-Bar E, Trivedi MM. Multi-scale volumes for deep object detection and localization. Pattern Recogn. 2017;61:557–72.

29. Sharma S, Kiros R, Salakhutdinov R (2015) Action recognition using visual attention. arXiv preprint arXiv: 1511.04119.

30. Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, & Van Gool L (2016) Temporal segment networks: Towards good practices for deep action recognition. In: European conference on computer vision (pp 20–36). Springer, Cham.

31. Ijjina EP, Chalavadi KM. Human action recognition using genetic algorithms and convolutional neural networks. Pattern Recogn. 2016;59:199–212.

32. Cheng C, Lv P, Su B (2018) Spatiotemporal pyramid pooling in 3d convolutional neural networks for action recognition. In: 2018 25th IEEE international conference on image processing (ICIP) (pp 3468–3472). IEEE.

33. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learn-ing spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision (pp 4489–4497).

34. Simonyan K, Zisserman A (2014) Two-stream convolutional net-works for action recognition in videos. In: Advances in neural information processing systems (pp 568–576).

35. Wang X, Gao L, Wang P, Sun X, Liu X. Two-stream 3-d convnet fusion for action recognition in videos with arbitrary size and length. IEEE Trans Multimedia. 2017;20(3):634–44.

36. Yang H, Zhang J, Li S, Luo T. Bi-direction hierarchical LSTM with spatial-temporal attention for action recognition. J Intell Fuzzy Syst. 2019;36(1):775–86.

37. Yang S, Yu X, Zhou Y (2020) LSTM and GRU Neural Network performance comparison study: taking yelp review dataset as an example. In: 2020 International workshop on electronic commu-nication and artificial intelligence (IWECAI) (pp 98–101). IEEE.

38. Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evalu-ation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv: 1412.3555.

39. Weiss G, Goldberg Y, Yahav E (2018) On the practical computa-tional power of finite precision RNNs for language recognition. arXiv preprint arXiv: 1805.04908.

40. Srivastava N, Mansimov E, Salakhudinov R (2015) Unsuper-vised learning of video representations using lstms. In: Inter-national conference on machine learning (pp 843–852).

41. Ma CY, Chen MH, Kira Z, AlRegib G. TS-LSTM and temporal-inception: exploiting spatiotemporal dynamics for activity rec-ognition. Signal Process. 2019;71:76–87.

42. Majd M, Safabakhsh R (2019) Correlational Convolutional LSTM for human action recognition. Neurocomputing.

43. Ullah A, Ahmad J, Muhammad K, Sajjad M, Baik SW. Action recognition in video sequences using deep bi-directional LSTM with CNN features. IEEE Access. 2017;6:1155–66.

44. Zang J, Wang L, Liu Z, Zhang Q, Hua G, Zheng N (2018) Atten-tion-based temporal weighted convolutional neural network for


SN Computer Science

action recognition. In: IFIP international conference on artificial intelligence applications and innovations (pp 97–108). Springer, Cham.

45. Gammulle H, Denman S, Sridharan S, Fookes C (2017). Two stream lstm: A deep fusion framework for human action recogni-tion. In: 2017 IEEE winter conference on applications of computer vision (WACV) (pp 177–186). IEEE.

46. Hassibi B, Stork DG (1993) Second order derivatives for network pruning: Optimal brain surgeon. In: Advances in neural informa-tion processing systems (pp 164–171).

47. LeCun Y, Denker JS, Solla SA (1990) Optimal brain damage. In: Advances in neural information processing systems (pp 598–605).

48. Hinton GE, Srivastava N, Krizhevsky A, Sutskever I, Salakhut-dinov RR (2012) Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv: 1207.0580.

49. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfit-ting. J Mach Learn Res. 2014;15(1):1929–58.

50. Wan L, Zeiler M, Zhang S, Le Cun Y, Fergus R (2013) Regu-larization of neural networks using dropconnect. In: International conference on machine learning (pp 1058–1066).

51. Changpinyo S, Sandler M, Zhmoginov A (2017) The power of sparsity in convolutional neural networks. arXiv preprint arXiv: 1702.06257.

52. Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J R Stat Soc. 2006;68(1):49–67.

53. Wen W, Wu C, Wang Y, Chen Y, Li H (2016) Learning structured sparsity in deep neural networks. In: Advances in neural informa-tion processing systems (pp 2074–2082).

54. Murray K, Chiang D (2015) Auto-sizing neural networks: With applications to n-gram language models. arXiv preprint arXiv: 1508.05051.

55. Srinivas S, Babu RV (2015) Data-free parameter pruning for deep neural networks. arXiv preprint arXiv: 1507.06149.

56. Narang S, Elsen E, Diamos G, Sengupta S (2017) Exploring sparsity in recurrent neural networks. arXiv preprint arXiv: 1704.05119.

57. Wen W, He Y, Rajbhandari S, Zhang M, Wang W, Liu F, ... Li H (2017). Learning intrinsic sparse structures within long short-term memory. arXiv preprint arXiv: 1709.05027.

58. Zach C, Pock T, Bischof H (2007) A duality based approach for realtime TV-L 1 optical flow. In: Joint pattern recognition sym-posium (pp 214–223). Springer, Berlin, Heidelberg.

59. Simonyan K, Zisserman A (2014) Very deep convolutional net-works for large-scale image recognition. arXiv preprint arXiv: 1409.1556.

60. Kulkarni K, Evangelidis G, Cech J, Horaud R. Continuous action recognition based on sequence alignment. Int J Comput Vision. 2015;112(1):90–114.

61. Soomro K, Zamir AR, Shah M (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv: 1212.0402.

62. Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human motion recognition. In: 2011 International conference on computer vision (pp 2556–2563). IEEE.

63. Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C., ... Ghemawat S (2016) Tensorflow: Large-scale machine learn-ing on heterogeneous distributed systems. arXiv preprint arXiv: 1603.04467.

64. Wang H, Schmid C (2013) Action recognition with improved tra-jectories. In: Proceedings of the IEEE international conference on computer vision (pp 3551–3558).

65. Dhar P, Alvarez JM, Roy PP (2017) Efficient framework for action recognition using reduced fisher vector encoding. In: Proceedings of international conference on computer vision and image process-ing (pp 343–354). Springer, Singapore.

66. Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local SVM approach. In: Proceedings of the 17th international conference on pattern recognition, 2004. ICPR 2004. (Vol. 3, pp 32–36). IEEE.

67. Blank M, Gorelick L, Shechtman E, Irani M, Basri R (2005) Actions as space-time shapes. In: Tenth IEEE international con-ference on computer vision (ICCV’05) Volume 1 (Vol. 2, pp. 1395–1402). IEEE.

68. Yang H, Yuan C, Li B, Du Y, Xing J, Hu W, Maybank SJ. Asym-metric 3d convolutional neural networks for action recognition. Pattern Recogn. 2019;85:1–12.

69. Sun L, Jia K, Chen K, Yeung DY, Shi BE, Savarese S (2017) Lat-tice long short-term memory for human action recognition. In: Proceedings of the IEEE International Conference on Computer Vision (pp 2147–2156).

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

spae deep lstm ih convolional attenion fo hman acion

Documents