automatic content-based retrieval and semantic classification of video content

9
International Journal on Digital Libraries (2006) 6(1): 30–38 DOI 10.1007/s00799-005-0119-y REGULAR PAPER Ankush Mittal · Sumit Gupta Automatic content-based retrieval and semantic classification of video content Published online: 23 February 2006 c Springer-Verlag 2006 Abstract The problem of video classification can be viewed as discovering the signature patterns in the elemental fea- tures of a video class. In order to solve this problem, a large and diverse set of video features is proposed in this paper. The contributions of the paper further lie in dealing with high-dimensionality induced by the feature space and in presenting an algorithm based on two-phase grid search- ing for automatic parameter selection for support vector ma- chine (SVM). The framework thus is directed to bridge the gap between low-level features and semantic video classes. The experimental results and comparison with state-of-the- art learning tools on more than 5000 video segments show the effectiveness of our approach. Keywords Content-based retrieval · Multimedia classi- fication · Support-vector machines · Semantic labels · Radial-basis function 1 Introduction There has been a tremendous growth in the amount of dig- ital video content in the recent years. Thus there is a great need for automatic tools to classify and retrieve the video content. In recent years research in multimedia content- based retrieval (CBR) has focused on the use of internal features of images and videos computed in an automated or semi-automated way. Automated analysis calculates statis- tics which can be approximately correlated to the content features. This is useful as it provides information without costly human interaction. Recently, with the objective of providing semantic level indices to the user of CBR system, strategies involving learn- ing a supervised model are emerging in the field of CBR. When there are clearly identified categories, as well as, large domain-representative training data, learning can be A. Mittal (B ) · S. Gupta Department of Electronics and Computer Engineering Indian Institute of Technology, Roorkee E-mail: [email protected] effectively employed to construct a model of the domain. A model generally represents a strong spatial order within the individual images and/or a strong temporal order across a sequence. Many researchers have worked in semantic im- age classification and natural image database organization into categories like Indoor vs. Outdoor [1, 2], city vs. land- scape [3, 4], man-made vs. natural [2, 5], sunset vs. forest vs. mountain [6] and so on. On the other hand, the domain of semantic video classification has not been thoroughly ex- plored. Learning a pattern in video data is a more difficult prob- lem than doing the same in image data as temporal rela- tionships between attributes are inherent in understanding video data. In addition, the video classes are more com- plex than those that exist in image databases, for example indoor vs. outdoor. For extracting video information in our work, the traditional feature set used in image databases con- sisting of color, texture, edges, etc., is enhanced by using special features such as color structure, color layout, region shape, group-of-frame color [7], etc. Group-of-frame color descriptor is designed to catch color distribution for multiple frames. In this paper, we present a learning framework where construction of a high-level video index is visualized through the synthesis of its set of elemental features. This is done through the medium of support vector machines (SVM). The support vector machines associate each set of data points in the multi-dimensional feature space to one of the classes during training. This association can later be used for evaluating the class to which that partition of fea- ture space belongs. Experiments on a large database and the comparison with the other standard classification tools (neu- ral networks, decision trees, and K-nearest neighbor classi- fier) show the effectiveness of our approach. Several content-based retrieval systems require user to be knowledgeable in low-level implementation details in or- der to effectively utilize them. Thus, the task of mapping low-level features with mental concepts and semantic labels rests on the user in such systems. The contribution of the pa- per lies in providing mapping between semantic labels and

Upload: ankush-mittal

Post on 15-Jul-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Automatic content-based retrieval and semantic classification of video content

International Journal on Digital Libraries (2006) 6(1): 30–38DOI 10.1007/s00799-005-0119-y

REGULAR PAPER

Ankush Mittal · Sumit Gupta

Automatic content-based retrieval and semantic classificationof video content

Published online: 23 February 2006c© Springer-Verlag 2006

Abstract The problem of video classification can be viewedas discovering the signature patterns in the elemental fea-tures of a video class. In order to solve this problem, alarge and diverse set of video features is proposed in thispaper. The contributions of the paper further lie in dealingwith high-dimensionality induced by the feature space andin presenting an algorithm based on two-phase grid search-ing for automatic parameter selection for support vector ma-chine (SVM). The framework thus is directed to bridge thegap between low-level features and semantic video classes.The experimental results and comparison with state-of-the-art learning tools on more than 5000 video segments showthe effectiveness of our approach.

Keywords Content-based retrieval · Multimedia classi-fication · Support-vector machines · Semantic labels ·Radial-basis function

1 Introduction

There has been a tremendous growth in the amount of dig-ital video content in the recent years. Thus there is a greatneed for automatic tools to classify and retrieve the videocontent. In recent years research in multimedia content-based retrieval (CBR) has focused on the use of internalfeatures of images and videos computed in an automated orsemi-automated way. Automated analysis calculates statis-tics which can be approximately correlated to the contentfeatures. This is useful as it provides information withoutcostly human interaction.

Recently, with the objective of providing semantic levelindices to the user of CBR system, strategies involving learn-ing a supervised model are emerging in the field of CBR.When there are clearly identified categories, as well as,large domain-representative training data, learning can be

A. Mittal (B) · S. GuptaDepartment of Electronics and Computer EngineeringIndian Institute of Technology, RoorkeeE-mail: [email protected]

effectively employed to construct a model of the domain.A model generally represents a strong spatial order withinthe individual images and/or a strong temporal order acrossa sequence. Many researchers have worked in semantic im-age classification and natural image database organizationinto categories like Indoor vs. Outdoor [1, 2], city vs. land-scape [3, 4], man-made vs. natural [2, 5], sunset vs. forestvs. mountain [6] and so on. On the other hand, the domainof semantic video classification has not been thoroughly ex-plored.

Learning a pattern in video data is a more difficult prob-lem than doing the same in image data as temporal rela-tionships between attributes are inherent in understandingvideo data. In addition, the video classes are more com-plex than those that exist in image databases, for exampleindoor vs. outdoor. For extracting video information in ourwork, the traditional feature set used in image databases con-sisting of color, texture, edges, etc., is enhanced by usingspecial features such as color structure, color layout, regionshape, group-of-frame color [7], etc. Group-of-frame colordescriptor is designed to catch color distribution for multipleframes.

In this paper, we present a learning framework whereconstruction of a high-level video index is visualizedthrough the synthesis of its set of elemental features. Thisis done through the medium of support vector machines(SVM). The support vector machines associate each set ofdata points in the multi-dimensional feature space to oneof the classes during training. This association can later beused for evaluating the class to which that partition of fea-ture space belongs. Experiments on a large database and thecomparison with the other standard classification tools (neu-ral networks, decision trees, and K-nearest neighbor classi-fier) show the effectiveness of our approach.

Several content-based retrieval systems require user tobe knowledgeable in low-level implementation details in or-der to effectively utilize them. Thus, the task of mappinglow-level features with mental concepts and semantic labelsrests on the user in such systems. The contribution of the pa-per lies in providing mapping between semantic labels and

Page 2: Automatic content-based retrieval and semantic classification of video content

Automatic content-based retrieval and semantic classification of video content 31

Fig. 1 An overview and steps in our CBR system

features derived from video data. The issues that we tacklein this paper are:

1. Employing innovative and meaningful features similar tosome of the preeminent work in the field of MPEG-7.

2. High-dimensional feature space inherent in multimediadomain

3. Automatic optimization of machine learning parameters.Though machine learning has been applied to several do-mains, no theoretical work exists to a priori decide uponthe machine learning parameters. Therefore, a carefulstrategy is required to properly consider the factors in-volved in a particular problem.

An overview of our framework is presented in Fig. 1.The continuous video stream is partitioned into shots whichare passed to the feature extraction module. Out of severaldimensions of feature values, meaningful dimensions are ex-tracted using inter-feature correlation and feature-class cor-relation measures. In the training phase, selected featurespassed to SVM are used to adjust kernel function parame-ters. In the testing phase, best video class label is given asoutput by the SVM predictor.

We will be using the following notations for a feature,a descriptor, and a dimension in this paper (consistent withMPEG-7 [8]). A feature is a perceptual attribute of the videothat signifies something to a human observer, i.e., color, tex-ture, shape, motion, etc. A descriptor is a numerical struc-ture that describes a feature, i.e., average color, histogramcolor; a dimension is one of the dimensions of a multi-dimensional descriptor. For example, the descriptor “aver-age color” might have three dimensions—one for each aver-age component (red, green, and blue).

The paper is organized as follows. Section 2 presentsa brief literature review on CBR systems and video index-ing. Section 3 describes the support vector machines and au-tomated parameter selection technique. Section 4 describesthe process of feature extraction and discusses the features

that are being used in this framework. Feature selection anddimensionality reduction algorithm is presented in Sect. 5.Experimental results and their comparison with other classi-fication tools is presented in Sect. 6. Conclusions and scopeof future work follow in Sect. 7.

2 Review and analysis of existing CBR techniques

Fischer et al. [9], in their CoP (content processing) project,made a pioneering effort in developing a semantic systemwhich uses a broad strategy. They have worked in the do-main of {news cast, car race, tennis, commercials, animatedcartoon}. They employ syntactic properties of a video likecolor statistics, cut detection, motion vectors, simple objectsegmentation, etc. Their classification module is comparableto a human expert who is asked about his/her evaluation ofcloseness of a particular feature. Their strategy depends onthe assistance from a human knowledge base to distinguishthe style profiles of the features. Shot length based compu-tational features were also used by Truong et al. [10] forclassifying videos into genres.

Ferman et al. [11] and Naphade et al. [12] have employedprobabilistic framework to construct descriptors in terms oflocation, objects and events. Vasconcelos et al. [13] haveintegrated shot length along with global motion activity tocharacterize the video stream with properties such as vio-lence, sex, or profanity.

Yuan et al. [14] use decision-trees to classify video ingenres such as music, commercials, sports, etc. They em-ploy simple color and shot length based features and reportprecision rate of around 60%. A modified SVM frameworkwas proposed by Lin et al. [15] for news video classification.Several other works have been based on using HMMs on aspecific domain with limited feature space [16, 17].

Page 3: Automatic content-based retrieval and semantic classification of video content

32 A. Mittal, S. Gupta

3 Support vector machines

This paper introduces a support vector machine (SVM)based video indexing system. The SVMs were proposed byVapnik and co-workers [18] as a very effective method forgeneral purpose supervised pattern recognition. The SVMapproach is not only well founded theoretically because itis based on extremely well developed machine learning the-ory and Statistical Learning Theory [19], but is also superiorin practical applications. The SVM method has been suc-cessfully applied to isolated handwritten digit recognition[18], object recognition [20], text categorization [21], mi-croarray data analysis [22], protein secondary structure pre-diction [23], etc.

3.1 Classification using support vector machines

When used for classification, SVMs separate a given set ofbinary labeled data with a hyper-plane that is maximallydistant from them. Since, most practical classification prob-lems are non-linear (at least in video indexing), the SVMsemploy a technique of kernels that automatically realizes anon-linear mapping to a feature space. The hyperplane foundby the SVM in the feature space corresponds to a nonlinearboundary in the input space.

In their basic form, SVMs learn linear decision rulesh(

−→x ) = sign(−→w .−→x + b) described by a weight vector −→w

and a threshold b. Let the input be a sample of n training ex-amples with the jth input point being x j = (x j

1 , x j2 , . . . , x j

n ).Let this input point be labeled by the random variableY jε{−1, +1}. For a linearly separable input, the SVM findsthe hyperplane with maximum Euclidean distance to theclosest training examples. This distance is called the mar-gin δ as depicted in Fig. 2. For non separable training sets,the amount of training error is measured using slack variable

Fig. 2 The optimal separating hyperplane (OSH), support vectors αi

and the slack variables ξ i

ξ j as shown in Fig. 2 for a two class problem. Computinghyperplanes is equivalent to solving the following primal op-timization problem.minimize

V (−→w ,−→b ,

−→ξ ) = 1

2−→w .−→w + C

n∑

j=1

ξ j (1)

subject to

∀nj=1 : y j [−→w .

−→x j + b] ≥ 1 − ξ j (2)

∀nj=1 : ξ j > 0 (3)

The second constraint requires that all the training examplesare classified properly up to a slack ξ . Therefore,

∑nj=1 ξ j is

an upper bound on the number of training errors. The factorC in Eq. (1) is a parameter that allows trading off trainingerror verses model complexity. Note that the margin of theresulting hyperplane is δ = 1/‖−→w ‖. The hyperplane thatseparates the positive from the negative examples and hasmaximal margin is called the maximal margin hyperplane orthe optimal separating hyperplane (OSH) as shown in Fig. 2.The hyperplanes that contain the training points with theminimal distance to the OSH are called the margin hyper-planes and they form the boundary of the margin. They arerepresented as H1 and H2 in Fig. 2.

Instead of solving the above optimization problem di-rectly, it is easier to solve its dual optimization problem.

maximize

W (−→α ) =n∑

j=1

α j − 1

2

n∑

i=1

n∑

j=1

yi y jαiα j (−→xi .

−→x j ) (4)

subject to

n∑

j=1

y jα j = 0 (5)

∀nj=1 : 0 ≤ α j ≤ C (6)

A remarkable property of the dual optimization problemis that α j associated with training input x j expresses thestrength with which that point is embedded in the final de-cision function and often only a small subset of points willbe associated with non-zero α j . These points are called sup-port vectors (SV) and they are the points that lie closest tothe separating hyperplane as shown in Fig. 2.

For solving the general case of linearly non-separableinputs, SVM implements the following idea. SVM mapsthe input vector −→x into a high dimensional feature spaceφ(

−→x )εH and constructs an optimal separating hyperplane(OSH), which maximizes the margin, the distance betweenthe hyperplane and the nearest data points of each classin the space H. Different mappings construct differentSVMs. The mapping φ(.) is performed by a kernel functionK (xi , x j ) which defines an inner product in the space H.Polynomial kernel function and radial basis kernel functionare two examples for kernels that can be employed.

Page 4: Automatic content-based retrieval and semantic classification of video content

Automatic content-based retrieval and semantic classification of video content 33

3.2 Parameter and kernel selection

The performance of SVM classification is strongly related tothe choice of the kernel function and the penalty parameterC. There are a large number of kernel functions available.In general, radial basis function (RBF) is a reasonable firstchoice. The RBF kernel non-linearly maps samples into ahigher dimensional space, and can handle the case when therelation between class labels and attributes is nonlinear. Fur-thermore, the linear kernel is a special case of RBF as [24]shows that the linear kernel with a penalty parameter C hasthe same performance as the RBF kernel with some param-eters (C, γ ). The RBF kernel can be described as

k(x, z) = exp(−γ × ‖x − z‖2) (7)

Thus, while using the RBF kernel function, there are twoparameters C and γ that need to be selected. Usually, theseparameters are selected on a trial and error basis. The userperforms SVM classification using different (C, γ ) pairs andselects the one that outperforms the rest. While the choice ofRBF width γ is at least guided by a heuristic, no theoret-ical hint is available on how to choose the error weight C.In most works it is done by cross-validation, which amountsto trying through a whole range of possible candidates andchoosing the one with the best result. The goal is to selectgood values of (C, γ ) so that the classifier performs wellon the test and unseen data. In the k-fold cross-validation,the training set is first divided into k-sets of approximatelyequal size. The classifier is trained using the data from v − 1subsets and its accuracy is determined using the left out set.The average accuracy of all the k-classifiers is used as over-all accuracy of the classification system. Cross-validation isused to prevent the overfitting problem. Since it is not knowna-priori, which (C, γ ) pairs would result in the best classifi-cation performance, an automated parameter selection tech-nique would be highly preferred.

For finding the optimum values of parameters (C, γ ) au-tomatically, a grid search technique is used using cross val-idation. Basically pairs of (C, γ ) are tried and the one withthe best cross-validation accuracy is picked. The grid searchis performed in an hierarchical manner. Initially, the valuesof C and γ are grown exponentially. Keeping one of the pa-rameters fixed, the other parameter is grown exponentiallyand the classification performance is evaluated using crossvalidation. The two dimensional grid of interest is rasterscanned and the region where the performance is the bestis selected. After identifying a better region on the grid, afiner grid search on that region can be conducted. The gridsearch can easily be parallelized and although it is exhaus-tive, yet the computational time is not huge because of justtwo search parameters.

In Fig. 3, best cross validation accuracy for the videoindexing with 96 features (see results section for details ofvideos) is achieved for C = 45 and γ = .06076, and thelogarithmic search space for these variables was from 2 to8 and −2 to −6. From Fig. 3, it is observed that the regionwith In(C) between 5 and 6, and region with γ between −4

Fig. 3 Grid search at a coarse level

Fig. 4 Grid search at a finer level

to −4.5 has the best cross validation results of the order of83%. Thus, a finer search is conducted in this region withsearch space restricted from 5 to 6 and −4 to −4.5 as shownin Fig. 4 and the search steps are reduced to 0.1 in both cases.In this particular case, the results improve by around 1% byconducting a finer search, which is a significant improve-ment in classification accuracy.

3.3 Multiclass classification

Typically, the SVMs can solve a two class problem. Theycan separate the data into two classes. However, in videoindexing, the video needs to be categorized in multipleclasses. Several methods have been proposed where typ-ically we construct a multi-class classifier by combiningseveral binary classifiers. The one-against-rest and the one-against-one approaches are popular. In the one-against-restapproach, the ith SVM is trained with all of the examplesin the ith class with positive labels, and all other exam-ples with negative labels. Thus, for a k-class problem, k-classifiers are constructed and for each input sample, theclass corresponding to the classifier with largest score isselected.

We use the “one-against-one” approach [25] in whichk(k − 1)/2 classifiers are constructed and each one trains

Page 5: Automatic content-based retrieval and semantic classification of video content

34 A. Mittal, S. Gupta

data from two different classes. The first use of this strategyon SVM was in [26]. In our system, we use a voting strat-egy. The result of each binary classifier is considered to bea vote and finally a class label is assigned to the data pointwhich has maximum number of votes. In case two classeshave equal votes, the tie is resolved by assigning the classlabel that resulted from the classifier in which both theseclasses were present. The voting approach described aboveis also called the “Max Wins" strategy.

4 Feature extraction

Descriptors can be classified as global or local [27]. Globalor coarse-grained feature extraction techniques transformthe whole image into a functional representation whereminute details within the individual portion of the multi-media are ignored. It offers low computational complex-ity at the cost of high percentage of false matches. At an-other level of granularity exist local descriptors, which ex-hibit fine-grained approach in analyzing each data into seg-mented smaller regions. Although working with local de-scriptors implies increasing the complexity in the feature ex-traction process and increasing the dimension of the descrip-tor space, local descriptors are nevertheless employed in ourCBR system as they provide more effective characterizationof a class.

In order to have complete understanding of the work, adiscussion of the descriptors is presented in Table 1. Fea-ture extraction process was done for a domain comprisingof diverse video classes and was selected as shown in Ta-ble 1. The sequences were recorded from TV using VCR andgrabbed in MPEG format. The size of the training databasewas 4000 sequences each with frame dimension 352 × 288.The size of the test database was 1000 sequences comprisingof an equal number of sequences for each class. The resultspresented throughout this paper are based on these descrip-tors and the video database. It can be seen that the descrip-tors come from highly varied sources and a priori normal-ization might not work. The high dimensionality associatedwith these descriptors poses further challenges.

The details of these descriptors are as follows:

1. Region shape: The shape of an object may consist of ei-ther a single region or a set of regions, as well as someholes in the object. Since the region-based shape descrip-tor makes use of all the pixels constituting the shape, itcan describe any shape. This descriptor employs a set ofART (Angular Radial Transform) coefficients. The ART,Fnm is a 2D complex transform of order n and m definedon a unit disk in polar coordinates:

Fnm = 〈Vnm(ρ, θ), f (ρ, θ)〉=

∫ 2π

0

∫ 1

0V ∗

nm(ρ, θ) f (ρ, θ)ρdρdθ, (8)

where f (ρ, θ) is an image function in polar coordinatesand Vnm is the ART basis function. The ART basis func-

tions are separable along the angular and radial direc-tions. Twelve angular and three radial functions are usedto give 36 dimensions to this descriptor.

2. Homogeneous texture: Many natural and man-made ob-jects such as water, grass, a bed of flowers, etc., are dis-tinguished by their texture. The homogeneous texturedescriptor characterizes the region texture using the en-ergy and energy deviation in a set of frequency chan-nels. The energy ei of the i th descriptor channel is de-fined as the log-scaled sum of the square of Gabor-filtered Fourier transform coefficients of an image: ei =log10[1 + pi ], where

pi =∫ 1

ω=0+

∫ 3600

θ=00+[G Ps,r (ω, θ) · P(ω, θ)]2 (9)

and P(ω, θ) is the Fourier transform of an image repre-sented in polar frequency domain and G Ps,r (ω, θ) is 2DGabor function. The energy deviation di of the i th de-scriptor channel is defined as the log-scaled standard de-viation of the square of Gabor-filtered Fourier transformcoefficients of an image: di = log10[1 + qi ] where

qi =√√√√

{⟨∫ 1

ω=0+

∫ 3600

θ=00+[G Ps,r (ω, θ) · P(ω, θ)]2

⟩− pi

}2

(10)

3. GoF color: The group-of-frames (GoF) color descriptordefines a structure required for representing the color de-scriptors of a collection of video frames by means ofthe scalable color descriptor. The individual histogramof a video frame is an 32-bin quantized HSV color his-togram. The three sub-descriptors used are average his-togram, median histogram and intersection histogram.The intersection histogram (int_histogram) is obtainedby computing for each bin j the minimum value over allthe N frame histograms in the group:

int_histogram[ j] = mini (Histogram_valuei [ j]),j = 1, . . . , 32 (11)

where Histogram_valuei [ j] is the count for j th bin offrame i of color histogram.

4. Color layout: This descriptor specifies the spatial dis-tribution of the color of a representative frame selectedfrom the corresponding video segment. This descriptorcan be applied to the arbitrary shaped regions. Six, threeand three DCT coefficients of the color component Y,Cb and Cr, respectively, constitute the dimensions of thisdescriptor. Since the human visual system has much lessdynamic range for spatial variation in colour (Cb and Crchannels) than for brightness (luminance, the Y chan-nel), the Cb and Cr channels are allocated lesser band-width.

5. Color structure: It is a color descriptor that captures boththe color content and the structure of the content via theuse of a structuring element. A “raw” 256-bin HMMDcolor space histogram is accumulated directly from the

Page 6: Automatic content-based retrieval and semantic classification of video content

Automatic content-based retrieval and semantic classification of video content 35

image. Color structure descriptor containing 128 bins iscomputed based on unification of the bins of the 256-bindescriptor. The final step of the extraction process is thenon-uniform quantization of each bin amplitude to an 8-bit code value.

6. Edge components histogram: This descriptor representsthe spatial distribution of the five types of edges in the lo-cal image regions. There are four directional edges andone non-directional edge in each local region which isknown as a sub-image. Each sub-image is defined by di-viding the image space into 16 non-overlapping blocks.For each sub-image a local edge histogram with 5 bins isgenerated; thus there are in total 16 × 5 = 80 histogrambins.

7. Other descriptors: In these descriptors, the image is rep-resented in RGB color space. The descriptor Averagecolor has three dimensions containing the average valueof three components R, G, and B. Histogram color is the32-bins’ quantized histogram for the three componentsR, G, and B. Other descriptors are based on a grid lay-out strategy of MPEG-7. Each frame is split into a set ofequally sized 4 × 4 rectangular regions and each regionis described separately. Motion Motkn is the number ofpixels moved from frame n −1 to the next frame n givenas :

Motkn =∑

i

j

〈( fn(i, j) − f n − 1(i, j)) > Tm〉, (12)

where (i, j) ∈ kth sub-image, k = 1, . . . , 16 andfn(i, j) is the pixel value at (i, j) coordinates of nthframe. Tm is a suitably chosen threshold. The predicate〈( fn(i, j)− f n − 1(i, j)) > Tm〉 is either 0 or 1. For cal-culating intensity variation I V k

n , the average of intensitydifference from frame n − 1 and frame n is taken:

I V kn = 1

size(sub − image)

i

j

fn(i, j)

− f n − 1(i, j), (13)

where (i, j) ∈ kth sub-image, k = 1, . . . , 16.Feature density is calculated by finding the edges in aframe. Then, a percentage count of pixels with edge in-tensity greater than a threshold is taken for each sub-image.

5 Dimensionality reduction

Multimedia classification is a representative of a domain oftasks involving high dimensionality of the descriptor spaceand large dissimilarity between descriptors in their range anddistribution. The number of features (or dimensions) affectsthe classifier’s speed. Including a large number of featurescan result in long training and classification times. If we usetoo many features, we may not be able to reliably estimatethe correct values for learning parameters. This can result inover-fitting, where the classifier models the idiosyncrasies

of the training set, rather than the class that the training setrepresents.

Rather than working with entire descriptor set, a betterapproach is to begin with a large number of features, and usestatistical techniques to decide which features are relevant.A classifier can then be constructed from this reduced set offeatures. This task of deciding which features are relevantfor a classification task is known as feature selection or di-mensionality reduction. Dimensionality reduction can elim-inate some irrelevant and/or redundant dimensions of thedescriptors. By using feature selection, classification algo-rithms can in general improve their predictive accuracy (asin [28]), shorten the learning period [29] and result in savingin the memory requirements and the computation time.

For selecting a subset of features, techniques can bebroadly classified into wrapper and filter methods. Wrap-pers utilize the learning machine of interest as a black box toscore subsets of variable according to their predictive power.Filters select subsets of variables as a pre-processing step,independently of the chosen predictor. However, wrappertechniques are usually slower. Another compelling justifi-cation for the use of filters is that filtering can be used as apreprocessing step to reduce space dimensionality and over-come over-fitting, which is exactly what we desire.

In this paper, we employ a filter algorithm that ranks fea-ture subsets according to a correlation based heuristic eval-uation function. The functions tends to choose the featuresubsets that are highly correlated with the class and rela-tively uncorrelated with one another. We employ the Pear-son’s correlation coefficient [30], given by

rsv = krsi√k + k(k − 1)rii

(14)

where rsv is the merit of a feature subset (in consideration)containing k features, rsi is the average of the correlationsbetween the features and the class and rii is the averageinter-correlation between features. In Eq. (14), the numera-tor indicates the predictive power of a feature set, and the de-nominator accounts for the redundancy in the feature space.This equation imposes a ranking on the feature subsets in thefeature search space.

In conjunction with the above criteria, we employ a bestfirst greedy algorithm for selecting the features as shown inFig. 1. The best first search starts with an empty set of fea-tures and generates all possible single feature expansions.Next, the subset with the best performance is selected andis further expanded by adding features in the same manner.The subset expansion stops when adding a new feature doesnot improve the result. The search then continues from thenext best unexpanded subset. To prevent the best first algo-rithm from exploring the entire search space a stopping cri-teria is employed. The best subset found is returned to theSVM predictor when the search terminates according to theabove criteria.

Page 7: Automatic content-based retrieval and semantic classification of video content

36 A. Mittal, S. Gupta

Table 1 Set of descriptors used and set of video classes

Descriptor Dimension Class no. Class

Region shape 36 1 BasketHomoTexture 62 2 MTV (music)GoF color 96 3 EducationalColor layout 12 4 TennisColor structure 128 5 SoccerEdge histogram 80 6 SwimmingAverage color 3 7 NewsHistogram color 32 × 3 8 Table tennisMotion 16 9 Weather reportIntensity variation 16 × 3 10 VolleyballFeature density 16

Total 593

6 Experiments and results

This section describes experiments using real video se-quences. Details of experimental setup are first described.A comparison in performance is made with some standardclassification tools. Some advantages of the present frame-work are highlighted in the ensuing discussion.

6.1 Experimental setup

A domain comprising of diverse video classes was selectedas shown in Table 1. The ideas of performing associationand classification in content-based classification are begin-ning to develop with the application of tools like neural net-works (for example, see Doulamis et al. [31]), decision trees(see Demsar et al. [32]) and K-nearest neighbor classifier(see Yang et al. [33]). These works have different paradigmsof operation from our CBR system in the sense that they donot envisage autonomous development of high level classesfrom the knowledge extraction processes as we do. Here,we present a comparison of our approach with implementa-tions of neural networks (ANN), K-nearest neighbor classi-fier (KNN) and decision trees.

6.2 Comparison with other approaches

Some of the most well known decision tree algorithmsare C4.5 [34] and its improved successor C5.1 We choseC5 decision tree package for the purpose of comparisonsince it has many nice features like accurate and fast rulesets, and fuzzy thresholding. In neural networks, a feed-forward backpropagation network was used with two hid-den layers consisting of 150 neurons and 100 neurons. Thetraining function used was the resilient backpropagation(‘trainrp’ in MATLAB). It is one of the best learning func-tions for classification problems, is not sensitive to the finesettings of training parameters and converges faster than

1 Downloaded from http://www.rulequest.com/see5-unix.html.

other functions [35]. The transfer functions employed were{tansig, tansig, tansig} where tansig is the hyperbolictangent sigmoid transfer function. A refinement to KNN al-gorithm called distance weighted nearest neighbor algorithm[36] is used in which the contribution of each of the k neigh-bors are weighed according to their distance to the querypoint, giving greater weight to closer neighbors.

For the purpose of comparing performance, two setsof experiment were done under these conditions: (a) withvarying size of the training data (b) with varying numberof dimensions. Type (a) experiment evaluates the general-ization properties of the classification approach in relationto non-linear input-output mapping while type (b) experi-ment demonstrates the effect of dimensionality on the per-formance.

Table 2 illustrates the comparison of the percentage clas-sification accuracy on type (a) experiment. The number ofdimensions kept for this experiment set was 593 and the sizeof training data was varied by randomly selecting equal per-centage of sequences for each class. Our approach is denotedas ISVM (improved support vector machines) and backpro-pogation networks as MLP. ISVM has overall best perfor-mance, whereas KNN is the second best. It is interesting tonote that the performance of ISVM performance is above90% even with 30% of training samples, when the dimen-sion space is relatively sparse. The second best performanceof KNN could be attributed to the strategy of working in lo-cal regions as opposed to estimating some parameters forentire descriptor space. However, in KNN the boundariesare not so well defined as in ISVM and therefore boundarypoints are misclassified.

Table 3 shows the classification accuracy of the vari-ous tools for varying number of dimensions (using the di-mensionality reduction algorithm presented in Sect. 5). With10% dimensions only, the best performance was that ofKNN (90%). It is noteworthy to observe that increasing thenumber of dimensions initially results in better performanceby most of the tools and it appears that with a very fewdimensions, information on distinction between the classeswas less. The distinction achieved by a large number of di-mensions shows the effectiveness of local feature extractionover the global one. On increasing the number of dimensionsto more than 20%, ISVM gets sufficient meaningful dimen-sions and its performance is consistently more than 93%.

On the other hand, tools like neural networks suffer fromthe curse of dimensionality and it has been suggested in theliterature [35] that the only practical way to overcome it isto incorporate prior knowledge about the function over andabove the training data, which is known to be correct. How-ever, this exercise is very difficult. This also reflects in thecomputational times taken by the MLP. The training time forMLP on Pentium IV machine is approximately 5 h, whileKNN takes 14 min, C5 classifier requires 3 min and ISVMtakes approximately 7 min. The classification time for all theclassifiers is negligible and almost the same.

The recall rate using the ISVM is 96.45% and the pre-cision is 95.5%. The individual class errors can be further

Page 8: Automatic content-based retrieval and semantic classification of video content

Automatic content-based retrieval and semantic classification of video content 37

Table 2 Comparison of accuracy of various classification tools as a function of percentage of samples in the training sequence

Accuracy for a given percentage of samples used for training

S. No. Technique 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

1 MLP 40.8 57.6 53.4 55.5 48.6 52.8 41.2 54.3 53.2 48.22 KNN 78.6 81.2 82.4 83.8 86.1 89.3 89.1 89.4 88.6 89.83 C5 40.1 81.2 82.8 83.4 82.6 83.2 84.8 83.6 85.2 85.64 ISVM 78.5 86.6 91.5 93.7 95.5 94.9 95.6 95.8 95.6 95.4

The number of classes was 10 and the dimensions were kept fixed at 593.

Table 3 Performance with varying number of dimensions or features

Accuracy for a given percentage of dimensions used for training

S. No. Technique 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

1 MLP 37.8 47.6 43.4 51.6 44.8 42.4 51.6 50.9 51.4 48.22 KNN 90.6 90.2 89.4 89.8 90.1 89.7 90.1 89.4 89.6 89.83 C5 81.1 85.2 84.8 85.4 84.6 84.8 85.1 85.2 85.4 85.64 ISVM 88.5 92.6 93.2 94.5 95.3 94.4 95.1 95.6 95.4 95.4

Table 4 Confusion matrix

seen from Table 4, which shows the confusion matrices forthe ISVM. It can be seen that the educational class is mostcharacterized by these features and thus most of the shots areclassified correctly. Actually, there are only a few settings(like those depicting speaker, screen, students, etc.) whichare possible in educational videos. One significant observa-tion is that most of the misclassified shots were assignedclass 2 (MTV), which has large distribution in the featurespace (especially with relation to color and motion), thusleading to misclassifications.

7 Conclusion and future work

Several works in CBR have contributed to meaningful fea-tures that extract content information in some unique man-ner. In our work, an automatic framework is presented

to map the features to semantic video class labels. Onlyrelevant features are considered for classification purposeas chosen by feature selection algorithm. Two-phase gridsearching algorithm optimizes the support vector machineparameters to give best possible classification. The experi-ments testing generality and effect of dimensionality of ourframework show promising results.

The framework presented in this paper is limited to clas-sifying video shots, and activities or emotions that span overmultiple video shots would require extending our frameworkto a temporal one. As the number of classes increases, itwould also be interesting to automatically evaluate mean-ingful features for each class.

References

1. Szummer M., Picard R.W.: Indoor–outdoor image classification.In: International Workshop on Content-based Access of Image andVideo Databases, pp. 42–51 (1998)

2. Vailaya A., Jain A.K.: Incremental learning for bayesian classifica-tion of images. In: International Conference on Image Processing,vol. 2, pp. 585–589 (1999)

3. Forsyth D., Malik J., Fleck M., Greenspan H., Leung T., BelongieS., Carson C., Bregler C.: Finding pictures of objects in large collec-tions of images. In: International Workshop on Object Recognitionfor Computer Vision, pp. 335–361 (1996)

4. Gorkhani M.M., Picard R.W.: Texture orientation for sorting pho-tos. In: International Conference on Pattern Recognition, pp. 459–464 (1994)

5. Torralba A.B., Oliva A.: Semantic organization of scenes usingdiscriminant structural templates. In: International Conference onComputer Vision, vol. 2, pp. 1253–1258 (1999)

6. Vailaya A., Jain A.K.: Reject option for VQ-based bayesian clas-sification. In: International Conference on Pattern Recognition, pp.2048–2051 (2000)

7. Ferman A.M., Krishnamachari S., Tekalp A.M.: Group-of-frame/picture histogram descriptors for multimedia applications.In: International Conference on Image Processing, pp. 1181–1184(2000)

Page 9: Automatic content-based retrieval and semantic classification of video content

38 A. Mittal, S. Gupta

8. ISO/IEC JTC1/SC29/WG11 Coding of Moving pictures and audio.Overview of the mpeg-7 standard (version 4.0). International Orga-nization for Standarisation, October 2000

9. Fischer S., Lienhart R., Effelsberg W.: Automatic recognition offilm genres. In: ACM Multimedia—Electronic Proceedings, SanFrancisco, CA, pp. 295–304 (1995)

10. Truong B.T., Venkatesh S., Dorai C.: Automatic genre identifica-tion for content-based video categorization. In: International Con-ference on Pattern Recognition, pp. 4230–4234 (2000)

11. Ferman A.M., Tekalp A.M.: Probabilistic analysis and extractionof video content. International Conference on Image Processing,vol. 2, pp. 91–95 (1999)

12. Naphade M.R., Kristjansson T., Frey B., Huang T.S.: Probabilisticmultimedia objects (multijects): A novel approach to video index-ing and retrieval in multimedia systems. In: International Confer-ence on Image Processing, pp. 536–40 (1998)

13. Vasconcelos N., Lipman A.: Towards semantically meaningfulfeature spaces for the characterization of video content. In: Inter-national Conference on Image Processing, pp. 25–28 (1997)

14. Yuan Y., Song Q.-B., Shen J.-Y.: Automatic video classificationusing decision tree method. In: Proceedings of Machine Learningand Cybernetics, pp. 1153–1157 (2002)

15. Lin W.-H., Hauptmann A.: News video classification using svm-based multimodal classifiers and combination strategies. In: Pro-ceedings of the tenth ACM International Conference on Multime-dia, pp. 323–326 (2002)

16. Huang J., Liu Z., Wang Y., Chen Y., Wong E.K.: Integration ofmultimodal features for video scene classification based on HMM.In: IEEE Third Workshop on Multimedia Signal Processing, pp.53–58 (1999)

17. Lu C., Drew M.S., James A.: Classification of summarized videosusing hidden markov models on compressed chromaticity signa-tures. In: Proceedings of the ninth ACM International Conferenceon Multimedia, pp. 479–482 (2001)

18. Cortes C., Vapnik V.: Support vector networks. In: MachineLearning, pp. 273–297 (1995)

19. Vapnik V.: Statistical Learning Theory. Wiley, New York (1998)20. Roobaert D., Van Hulle M.M.: View-based 3D object recognition

with support vector machines. In: Proceedings of IEEE Interna-tional Workshop on Neural Networks for Signal Processing, Wis-consin, USA, pp. 77–84 (1999)

21. Drucker H., Shahrary B., Gibbon D.C.: Relevance feedback us-ing support vector machines. In: Proceedings of 18th InternationalConference on Machine Learning, pp. 122–129 (2001)

22. Brown M.P., Haussler D.: Knowledge-based analysis of microar-ray gene expression data by using support vector machines. In: Pro-ceedings of National Academy Science USA, pp. 262–267 (2000)

23. Hua S., Sun Z.: A novel method of protein secondary structureprediction with high segment overlap measure: Support vector ma-chine approach. J. Mol. Biol. 397–407(2001)

24. Keerthi S.S., Lin C.J.: Asymptotic behaviors of support vector ma-chines with gaussian kernel. Neural Comput. 1667–89(2003)

25. Knerr S., Personnaz L., Dreyfus G.: Single-layer learning revis-ited: A stepwise procedure for building and training neural network.In: Neurocomputing: Algorithms, Architectures and Applications,NATO ASI, Springer-Verlag, Berlin (1990)

26. Kressel U.H.-G.: Pairwise classification and support vector ma-chines. In: Advances in Kernel Methods Support Vector Learning.MIT Press, Cambridge, MA (1999)

27. Khatib W.A., Day Y.F., Ghafoor A., Berra P.B.: Semantic model-ing and knowledge representation in multimedia databases. IEEETrans. Knowl. Data Eng. 11, 64–80 (1999)

28. Aha D.W., Bankert R.L.: Feature selection for case-based classi-fication of cloud types: An empirical comparison. In: Aha D.W.(eds.) Case-Based Reasoning, pp. 106–112. AAAI Press, MenloPark, CA (1994)

29. Liu H., Setiono R.: Dimensionality reduction via discretization.Knowl. Based Syst. 9, 67–72 (1996)

30. Guyon I., Elisseeff A.: An introduction to variable and feature se-lection. J. Mach. Learn. Res. 1157–1182 (2003)

31. Doulamis N.D., Doulamis A.D., Kollias S.D.: A neural networkapproach to interactive content-based retrieval of video databases.In: International Conference on Image Processing, vol. 2, 116–120(1999)

32. Demsar J., Solina F.: Using machine learning for content-basedimage retrieving. In: International Conference on Pattern Recogni-tion vol. 3, 138–142 (1996)

33. Yang Z., Kuo C.C.J.: A semantic classification and composite in-dexing approach to robust image retrieval. In: International Confer-ence on Image Processing vol. 1, 134–138 (1999)

34. Quinlan J.R.: C4.5: Programs for Machine Learning. MorganKaufmann Publishers, Los Altos, CA (1993)

35. Haykin S.: Neural Network: A Comprehensive Foundation, 2ndEdn, pp. 178–210. Prentice Hall, NJ (1999)

36. Mitchell T.M.: Instance-based learning. In: Machine Learning, pp.230–248. McGraw-Hill, New York (1997)