video scene detection using compact bag of visual word...

Research ArticleVideo Scene Detection Using Compact Bag ofVisual Word Models

Muhammad Haroon 1 Junaid Baber1 Ihsan Ullah1 Sher Muhammad Daudpota2

Maheen Bakhtyar1 and Varsha Devi3

1Department of Computer Science amp IT University of Balochistan Pakistan2Department of Computer Science Sukkur IBA University Pakistan3Department of Computer Science Sardar Bahadur KhanWomenrsquos University Pakistan

Correspondence should be addressed to Muhammad Haroon haroonsdbagmailcom

Received 18 May 2018 Revised 14 August 2018 Accepted 3 October 2018 Published 8 November 2018

Academic Editor Deepu Rajan

Copyright copy 2018 Muhammad Haroon et al This is an open access article distributed under the Creative Commons AttributionLicense which permits unrestricted use distribution and reproduction in any medium provided the original work is properlycited

Video segmentation into shots is the first step for video indexing and searching Videos shots are mostly very small in durationand do not give meaningful insight of the visual contents However grouping of shots based on similar visual contents gives abetter understanding of the video scene grouping of similar shots is known as scene boundary detection or video segmentationinto scenes In this paper we propose a model for video segmentation into visual scenes using bag of visual word (BoVW) modelInitially the video is divided into the shots which are later represented by a set of key frames Key frames are further represented byBoVW feature vectors which are quite short and compact compared to classical BoVWmodel implementations Two variations ofBoVWmodel are used (1) classical BoVWmodel and (2)Vector of Linearly AggregatedDescriptors (VLAD)which is an extensionof classical BoVWmodel The similarity of the shots is computed by the distances between their key frames feature vectors withinthe sliding window of length L rather comparing each shot with very long lists of shots which has been previously practicedand the value of L is 4 Experiments on cinematic and drama videos show the effectiveness of our proposed framework TheBoVW is 25000-dimensional vector and VLAD is only 2048-dimensional vector in the proposed modelThe BoVW achieves 090segmentation accuracy whereas VLAD achieves 083

1 Introduction

The size of video databases is increasing exponentially due tothe emergence of cheap and fast Internet The indexing andretrieval of the videos are getting more difficult The expec-tation of users are high due to advanecment in technologiesThe giant video portals such as YouTube Dailymotion andGoogle are investing huge amount on efficient and smartindexing and retrieval so that their portals remain attractiveand addictive to the users

To process videos for indexing and searching the firsttask is to segment the videos into shots and extract represen-tative frames known as key frames from each shot Thesekey frames are later used for searching efficient indexingscene generation and video classification Main idea to selectkey frame is to reduce the computational cost as video is the

collection of frames which are stored in temporal order ieevery video uploaded on Youtube is 30 frames per secondor higher The more the frames per second the better thevisual effect Despite being very sophisticated hardware allthe frames cannot be processed in real time applications suchas event detection from CCTV streaming To process oneframe for the detection of possible objects it takes 05 to15 seconds to identify objects in the frame (cascade objectdetector is used to identify possible text boards in the frameusing Matlab)

In video scene segmentation the video is divided intoshots and similar shots are combined together to make thescenes Shots are uninterrupted and unceasing sequencesof video frames where there is no change in theme andcamera [1] Generally the video shots can be categorized intotwo types abrupt shots and gradual shots An abrupt shot

HindawiAdvances in MultimediaVolume 2018 Article ID 2564963 9 pageshttpsdoiorg10115520182564963

2 Advances in Multimedia

boundary is the sudden change in the scene such as changeof speaker during TV interviews whereas gradual shots takeseveral frames to change the shot such as fades and dissolvesIn videos many shots repeat in very short interval of time ifthose shots are combined then these collections of shots arecalled scenes For example if two actors are talking then thecamera keeps switching to both actors with very little changein background and in two-minute conversation of video clipthere are sometimes 25-30 shots Scene detection aka sceneboundary detection or video scene segmentation is the studyto merge similar or repeating shots into one clip or dividingthe videos into clips which are semantically or visually relatedor similar

Manual segmentation of videos for websites and DVDsis very time consuming and not feasible when dealing withlarge datasets Recently automatic video segmentation intoshots and scenes have gained wide attraction among industryand researchers [2ndash5]

In the proposed methodology videos are segmented intoabrupt shot boundaries which are further grouped on thebasis of the similarity to construct the scenes The proposedmethodology is inspired by the BoVW model for scenedetection [2] the abstract flow diagram is shown in Figure 1In bag of visual word model local key point descriptorswhich are extracted from the key frame of the shots arerepresented by the histograms of visual words These keyframes arematched based onbag of visual word histograms insliding window of lengthL [3] It has been shown that shotsmatched in sliding windows aremore efficient [2 3] ClassicalBoVWmodel andVLAD are usedwith compact vocabularieswithout compromising in accuracies

Rest of the paper is organized as follows Section 2presents the related work It is divided into three subsections(1) shot boundary detection (2) key frame extraction and (3)scene boundary detection Section 3 presents the proposedmethodology along with experimental protocols and finallySection 4 concludes the papers and discusses the future work

2 Related Work

In this section a brief literature review and state-of-the-artmethodologies are presented for all the main steps of thevideo segmentation which include shot boundary detectionkey frame extraction and scene boundary detection

21 Shot Boundary Detection In the problem of videoindexing and searching the first and foremost step is shotboundary detection Shot boundary has two types as men-tioned earlier abrupt and gradual shot boundaries Abruptshot boundary is the sudden change in the stream if thedissimilarity difference between the two consecutive framesis very large then either of the adjacent frames is consideredthe boundary whereas gradual shot boundary is the gradualchange in the video such as the effects like fade-in fade-outand dissolves

Let the 119865 = 1198911 1198912 119891119899 be frames of a video and 119891119894 issaid to be an abrupt shot boundary if and only if the differencebetween 119891119894 and 119891119894+1 is greater than a threshold 120591 In ourexperiments we are not taking the gradual shot detection into

Video Shot BoundaryDetection

Key frame Selection

FeatureExtractionQuantization

Similarity ofKey Frames in

Sliding Window

Grouping ofShots Scene Creation

Figure 1 Abstract flow diagram of proposed framework

consideration as the ratio of gradual shots in any cinematicvideo is too small more than 90 of shot boundaries areabrupt boundaries [2 3]

There is long list of methodologies first on pixel to pixeldifference between the consecutive video frames ie 119891119894 and119891119894 + 1 are used for segmentation of the video [6] In thistechnique if the sum of the pixels difference is greater thansome threshold then it was considered to be an abrupt shotboundary

Later on many other scientists worked on this problemand proposed a new technique in which pixel intensityhistograms were used in successive frames instead of the pixelto pixel difference to detect the abrupt shot boundaries [7 8]These techniques are good except that they are sensitive tomotion of objects and camera [9]

Moreover a latter approach [10] detects the shot bound-ary based on the mutual information and joint entropybetween the consecutive frames A sports dataset has beenused to detect the shot boundary This technique of jointentropy is useful if used for faded or gradual boundariesThe entropy is high for an extended time period duringfade-in because the visual intensity gradually increases andthe entropy is low during fade-out as the intensity slowlydecreases

Videos fragmentation by pixel variance in frames andpixel strength in histogram calculations has been presentedin [11 12] The frame indexing was used with rapid boundarywhen the amount of shot pixels between two frames isoverdone by some threshold

Chavez et al [13] proposed a different technique in whichthey used supervised learning with support vector machine(SVM) in order to separate the abrupt boundaries from thegradual boundaries In this technique authors calculated thedissimilarity vector which assimilate set of different featuresincluding Fourier-Mellinmoments Zernike moments andcolor histograms (RGB and HSV) to capture the informationlike illumination changes and rapid motion After then thisvector is used in SVM for detection of shot boundariesThe authors also used illumination changes for detecting thegradual shot boundaries

Advances in Multimedia 3

Furthermore [14] proposed a new technique of learningalgorithm which has three main steps

(1) Firstly frames which have smooth changes areremoved

(2) Secondly three types of feature differences areextracted The intensity difference difference in ver-tical and horizontal edge histograms and differencebetween the HSV color histograms are calculatedfrom the shot boundaries

(3) Lastly the authors detect the gradual boundariesfrom the video using a technique named as temporalmultiresolution analysis

Several other works use various kinds of methodologiesfor different kind of shots Such an example work [15] useddifferent techniques for abrupt shots and gradual shots byutilizing SIFT and SVM Their methodology also comprisesfew main steps which are given below

(1) In the first step they select the shot boundary framesfrom video using the difference between color his-tograms of two consecutive frames

(2) Then in the second step they extract the SIFT featuresfrom the frames selected as a shot boundary

(3) Lastly they use different approach for abrupt andgradual shot boundaries by using SIFT and SVMSIFT is considered to be the most efficient effectiveand massively used in state-of-the-art techniques

Although SIFT is considered to be the most widely usedfeature extraction technique it still has some downsides ascompared to SURF The SIFT feature has high dimensionalfeature vector ie 128-D whereas the SURF only has 64-Dvector SIFT is slow as compared to SURF due to complex andcomprehensive images Moreover Baber et al [3] proposed anew technique for shot boundary detection using two differ-ent feature extraction methods that are SURF and EntropyTheir research consists of different steps in which theydetect shot boundaries (abrupt and gradual) and differentiategradual boundaries from abrupt shot boundaries The stepsare as follows

(1) In the first step the fade boundaries are detected bythe analysis of entropy pattern during fade effects

(2) After the detection of fade shot boundaries the otherkind of shot boundary that is abrupt shot boundaryis detected by using the entropy difference betweentwo consecutive frames If the difference between twoconsecutive frames is higher than the threshold 120591then it is considered to be the abrupt shot boundary

(3) SURF is used for removing the false negative bound-aries

22 Key Frame Extraction Most of the researchers use theshot boundary detection as the important step to extract therepresentative key frames from the videos Representative keyframes in the videos are the particular frames which describethe whole content of the particular scene in the video Each

video may consist of one or more key frames based on thescenes or content in the video

Shot boundary detection is one of the most crucial stepsin our problem for finding representative key frames fromthe videos as scene detection is completely based on theserepresentative key frames Baber et al [5] used the entropydifferences between two consecutive frames for finding theshot boundaries If the contents of the two consecutive framessay 119891119894 and 119891119894+1 are different and their entropy difference isgreater than the specified threshold then 119891119894 is said to be ashot boundary and considered as a representative key frame

In our methodology we have first calculated the entropyof each video frame and then the difference between twoconsecutive frames is recorded The difference is greater thanthe the threshold 120591 is considered to be the representative keyframe Entropy is a statistical measure of randomness thatcan be used to characterize the texture of the input imageMathematically entropy is defined as

119864119909 = minussum119910

119867119910 log (119867119910) (1)

where119867 is the normalized histogram of the gray scale imagersquospixels intensity

23 Scene Detection At the first stage the video is segmentedin shots and semantically similar shots are merged to formscenes Scenes are categorized into various classes such asconversation indoor and outdoor scenes Many importantresearches have been published related to video segmentationinto scenes using different type of videos for example cin-ematic drama serials (indoor and outdoor) video lecturesand documentaries Although a lot of work has been reportedfor segmentation of video into scenes there is still a gap toaddress the challenge in cinematic videos Commonly thereare two types of features being extracted from the videos forsegmentation ie audio and visual We have focused on thevisual features in our research

Yeung et al [16] proposed a technique in which theauthors used the scene transition graph (STG) to segment thevideo The nodes in the graph are considered as a shot whichis based on the temporal relationship and visual similarityedges are describedThen the graph is divided into subgraphsand these subgraphs are considered as scenes which are basedon the color similarity of the shots

Rasheed et al [17] proposed an effective technique forscene detection in Hollywood and TV shows For featuresthey have used the motion color and length of the shotsIn the initial step they first cluster the shots by usingBackward Shot Coherence (BSC) Next by calculating thecolor similarity between shots they first detect potentialscene boundaries and after that they remove the false negativefrom potential scene boundaries by scene dynamics which isbased on motion and length of the shot

Many recent authors worked on video scene segmen-tation and proposed new technique for this problem inresearch Some researchers used multimodal fusion tech-nique of optimally grouped features using the dynamicprogramming scheme [18ndash20] Their methodology includesfew steps in which the first step was to divide the video into


shots and then using clustering technique they cluster theshots The authors in their paper [19] proposed a techniqueknown as intermediate fusion which uses all the informationfrom different modalities They considered this problem anoptimization problem and used it via dynamic programming[19] The authors have some previous research [18] in whichthey proposed a technique of dividing the video into scenesusing the sequential structure In this technique they decideda location for video segmentation and only inspected thepartitioning possibilities In this technique the video is rep-resented by set of features and each set is given by a distancemetric between them The segmentation purely depends oninput features and distance metric [18]

Furthermore a different technique was proposed inwhich they used spectral clustering technique with anautomatic selection on number of clusters and extractedthe normalized histogram of each shot Further they usedBhattacharyya distance and temporal distance as a distancemetric Authors in this paper said that clustering is notconsistent and adjacent shots belong to different clusters [20]

Sakarya et al [21] used a new technique of graph con-struction for the segmentation of video into scenesThey con-struct a graph weighting the temporal and spatial function ofsimilarity From this the dominant shots are detected and fortemporal consistency constraint they used the edges of thescene via mean and standard deviation of the shot positionThis process kept on going until all the video is allocated toscenes Lin et al [22] used the approach of color histogramsfor the shot boundary detection and then formed the scenebymerging the similar shots via identifying the local minimaand maxima to determine the scene transitions

Baraldi et al [4] used another approach for shots andscene detection from the videos using the color histogramsand clustering technique respectivelyThe authors first detectthe shots using the color histogram then the authors clus-tered the shots using the hierarchical K-means clusteringtechnique and created N clusters for N number of shotsEach shot is assigned a particular cluster and they find theleast dissimilar shots using the distance metric formula andmerged the two clusters with the least distance This processcontinues until and unless all the scenes are detected andvideo is completed

Chen et al [23] proposed a new approach for scene detec-tion from the H264 video sequences They define a scenechange factor which is used to reserve bits for each frameTheir methodology has reduced rate error and was foundbetter when compared with JVT-G012 algorithm The workof [24] proposed a novel technique for scene change detectionespecially for H264AVC encoded video sequences and theytake into consideration the design and performance evalua-tion of the systemThey further worked with a dynamic thre-shold which adapts and tracks different descriptors andincreased the accuracy of system by locating true scenes inthe videos

3 Proposed Methodology for Scene Detection

The proposed framework comprises shot boundary detec-tion key frame extraction local key point descriptors

4 6 8 10 12 14 16 18 20Entropy Difference ()

0

010

020

030

040

050

060

070

080

090

1

Prec

ision

Rec

allF

-sco

re

PrecisionRecallF-score

Figure 2 Sensitivity of 120591119860 on movie Pink Panther (2006)

extraction from key frames feature quantization and sceneboundary detection

31 Shot Boundary Detection Shot boundary detection isthe primary step for any kind of video operations Thereare number of frameworks for shot boundary detection Wehave used the technique for shot boundary detection basedon entropy differences [5 26] The entropy is computed foreach frame and differences between the adjacent frames arecomputed The frame 119891119894 is considered to be a shot boundaryparticularly abrupt shot boundary if the entropy differencebetween 119891119894 and 119891119894+1 is greater than the predefined threshold120591119860 [2 3 5] It can be returned as

B (119891119894) =

True if D (119891119894 119891119894+1) gt 120591119860False otherwise

(2)

B() decides either the given frame 119891119894 is shot boundary ornot and D computes the dissimilarity or difference betweenadjacent framesThe value 120591119860 gives better precision with poorrecall if it is high and better recall with poor precision if it islow as shown in Figure 2 During experiment the value of 120591119860is set experimentally which gives high F-score

32 Key Frame and Local Key Point Descriptors ExtractionLet S = 1199041 1199042 119904119899 be the set of all shot boundaries Oneor set of key frame(s) from each shot are selected There are anumber of possibilities to select representative frames akakey frames from each shot Since the entropies are alreadycomputed in shot boundary process so entropy based keyframe selection criteria are used [3]

For any given shot 119904119894 isin S the frame with maximumentropy is selected as key frame It has been shown exper-imentally that if the entropy is larger the contents in theframe are dense which represents the shots precisely The


shots are now represented by key frames and denoted byF = 1198911199041 1198911199042 119891119904119899 where 119891119904119894 denotes the key frame of shot119904119894

Two images can be matched if they are similar based onsome similarity criteria Similarity is computed between thefeatures of the images SIFT [27] is widely used as imagefeature for various applications of computer vision and videoprocessing For any given image key points are detected andthose key points are represented by some descriptors such asSIFT On average there are 2-3 thousand key points on singleimage whichmakes matching very expensive and exhaustiveas single image is represented by 2-3 thousand feature vectorsTomatch two images of size 800times600 each it takes 2 secondson commodity hardware on average If one image has to bematched with several hundreds or thousand images then it isnot practical to use SIFT or any rawdescriptors Quantizationis used to reduce the feature space

33 Quantization BoVW Model Bag of visual word modelis widely used for feature quantization Every key pointdescriptor 119909119895 sub R119889 is quantized into a finite number ofcentroids from 1 to 119896 where 119896 denotes the total number ofcentroids aka visual words denoted byV = V1 V2 V119896and each V119894 sub R119889 Let say a frame 119891 be represented bysome local key point descriptors119891119883 = 1199091 1199092 119909119898 where119909119894 sub R119889 In BoVWmodel a function G is defined as

G R119889 997891997888rarr [1 119896]

119909119894 997891997888rarr G (119909119894)(3)

G maps descriptor 119909119894 sub R119889 to an integer index Forgiven frame 119891 bag of visual word I = 1205831 1205832 120583119896 iscomputed 120583119894 indicates the number of times V119894 appeared inframe119891 andI is unit normalized at the endMostly 119896-meanor hierarchical 119896-mean clustering is applied and centroids(visual words) V are obtained The value of 119896 is keptvery large for image matching or retrieval applications thesuggested value of 119896 is 1 millionThe accuracy of quantizationmainly depends on the value of 119896 if the value is small thentwo different key point descriptors will be quantized to samevisual words which will decrease the distinctiveness or if thevalue is very large then two similar key point descriptorswhich are slightly distorted can be assigned different visualwords which will decrease the robustness [28]

In the case of the video segmentation the scenario isdifferent than the searching ormatching one imagewith set ofvery large database which have severe image transformationssuch as illumination scale viewpoint and scene capture atdifferent time In video segmentation image is matched withfew other images 4 to 7 in sliding window which containslightly different contents The each image in sliding windowis a key framewhich represents the shot an example of slidingwindow matching is shown in Figure 3

In proposed framework the value of 119896 is kept far smallerthan the value suggested in the literature [2] without compro-mising on the segmentation accuracy During experimentthe value of 119896 = 25000 gives approximately same accuracyas the value 500000 which is used in our previous work

[2] For the above-mentioned experiment the value of 119896 wasgradually increased from 5000 to 30000 by the factor of 1000and it was found that the value 119896 = 25000 gives approximatelysame accuracy as of our previous work [2]

34 Quantization VLAD Model VLAD is emerging quanti-zation framework for local key point descriptors [29] Insteadcomputing the histogram of visual words it computes thesum of the differences of residual descriptors with visualwords and concatenates into single vector of 119889 times 119896 Let G119881be VLAD quantization function [30]

G119881 R

119889 997891997888rarr V119895 isin V

119909119894 997891997888rarr G119881 (119909119894) = arg min

VisinV

1003817100381710038171003817119909119894 minus V10038171003817100381710038172 (4)

The VLAD is computed in three steps(1) offline visual words are obtained V(2) all the key point descriptors obtained from given

frame 119891119883 are quantized using (4)(3) VLAD is computed for given frame J119891 =

1198951 1198952 119895119896 where each 119895119902 is 119889-dimensional vectorobtained as follows

119895119902 = sum119883G119881(119883)=V119898

119883 minus V119898 (5)

J119891 is 119889 times 119896 dimensional feature In case of SIFT 119889 = 128and recommended value of 119896 isin 64 128 256 [29] As statedabove video segmentation does not require very large valuesof 119896 During experiments the value of 119896 for VLAD is 16 andJ using SIFT is 128 times 16 = 2048 dimensional J is unitnormalized at the end The vector is very compact withoutthe loss of accuracy as shown in experiments

35 Scene Boundary Detection Algorithm 1 is used to findthe scene boundaries [2] H denotes feature vectors for keyframes the feature vectors are either VLAD or BoVWvectorsexplained in the previous section The similarity between twokey frames is decided by dissimilarity function D which canbe computed as follows

D (119867119894 119867119895) =N

sum119902=1

min (ℎ119894119902 ℎ119895119902) (6)

Two key frames are treated as similar if their D() gt 120591119904 Thevalue of 120591119904 is the average of the minimum and the maximumsimilarities of the similar shots on a subset of the videos usedin the experiments The average of similarity score is widelyused as the value of 120591119904 In our experiment the average ofsimilarity scores gives low segmentation accuracy ie 0713

4 Experiments and Results

Cinematic and drama videos are used for scene boundarydetection list of movies and dramas is given in Table 1 F-score is used as performance metrics for scene boundarydetectionThere is no benchmark dataset Two strategies havebeen used to obtain the ground-truth first party and third


fsfs fs fs fs

fsfs fs fs

Figure 3 Example of key framesmatching in the sliding window of lengthL = 3 Each frame represents the shot and there are 9 consecutiveshots 1198911199041 1198911199049 Each key frame 119891119904119894 is matched with next three neighbors

Require H = 1198671 1198672 119867119899(1) 119860[1] larr997888 1(2) 119906 larr997888 1(3) index larr997888 2(4) for each 119867119894 isin H 119894 = [1 119899 minus 1] do(5) isSimilar larr997888 false(6) for 119895 = 119894 + 1 to 119894 + L do(7) if not Contains(119860 1 119895 + 1) and D(119867119894 119867119895) gt 119879119904 then(8) 119860[index] larr997888 119895(9) isSimilar larr997888true(10) end if(11) end for(12) if not isSimilar and(119894 ge 119860[index]) then(13) add (119906 119860[index]) toZ(14) 119906 larr997888 119860[index] + 1(15) index larr997888 index +1(16) end if(17) end for(18) Merge the short scenes(19) return Z

Algorithm 1 Scene detection algorithm

party ground-truth First party ground-truth is generated bythe authors and third party ground-truth is collected fromthe experts who have adequate knowledge of shots and sceneboundaries [2 3] To make ground-truth hinased third partyapproach is used in our experiments [3 5 26]

The accuracy of proposed system can be seen in Table 1Our dataset has two different groups with completely dif-ferent videos One group consists of cinematic movies withentirely different environment and challenging effects withcomplex motion of scenes On the other hand the secondgroup of data consists of indoor drama serials which areeasy to segment compared to cinematic movies becauseof their simple scene with no challenging effects that iswhy then length of the sliding window L is different forboth groups of dataset The sensitivity of L can be seen inFigure 4 [2] In cinematic videos the scenes are longer andshots are shorter In just few seconds there are sometimesmore than 20 shots due to different effects and actions Thevalue of L is marginally bigger compared on drama typesof videos Though single value can also be used for alltypes

Since the values of 119896 for VLAD and BoVW are shorterin proposed experiments compared to recommended valueswhich increase the efficiency for similarity computation thesimilarity computation by (6) or any other distance is at leastO(119899) where 119899 denotes the dimensionality of the feature Thecomputation of similarity is faster if the value of 119899 is shorteras shown in Figure 5 It can be seen that VLAD is faster thanBoVW because VLAD has shorter dimensions compared toBoVW The recommended value of 119896 for BoVW is 1000000as discussed in previous section whereas in our experimentsthe value of 119896 is 25000

5 Conclusion

Video segmentation is a primary step for video indexing andsearching Shot boundary detection divides the videos intosmall units These small units do not give meaningful insightof the video story or theme However grouping of similarshots give better insight of the video and this grouping can betreated as video scene and grouping of similar shots is calledscenes


Table 1 Performance of BoVW and VLAD on cinematic and drama videos

Movie name FPS Durationhhmmss

F-scoreScene boundary

by [3]Scene boundaryby BoVW [2]

Scene boundaryby [25] VLAD

L = 5The Pink Panther (2006) 24 013201 091 086 079 088Grown Ups (2010) 25 014055 089 085 077 082The Expendables (2010) 24 014329 085 082 072 081

L = 4I Dream of JeanniendashMyWild Eyed Master 25 002415 090 086 083 088

I Dream of JeanniendashMyMasterThe Rich Tycoon 25 002418 089 087 081 084

I Dream of JeanniendashMyMasterThe Doctor 25 002443 087 085 075 088

I Dream of JeanniendashTheMoving Finger 25 002432 085 083 079 081

Big BangTheory (Season 1)Episode 1 25 002233 087 086 079 084




3 4 5 6 7 8 9 1005

05506

06507

07508

08509

Length of sliding window

Fminussc

ore

Cinematic videosDrama videos

Figure 4 Sensitivity ofL on different types of videos

In this paper we propose framework which uses state-of-the-art searching techniques such as BoVW and VLADwhich is widely used for image and video retrieval for sceneboundary detection Images or video frames are representedby BoVW and VLAD which are very high dimensionalfeature vectors We experimentally show that in the fieldof scene boundary detection competitive accuracy can beachieved by keeping the dimensions of BoVW and VLADto very small The recommended dimensions for BoVWare 1 million in our experiments we just tuned it to be25000 The recommended dimensions of VLAD are 32768

0 2 4 6 8Data Size

0

05

1

15

2

25

3

35

Tim

e (se

c)

VLADBoVW

times104

Figure 5 Timing plot of query image matching with all the imagesin database VLAD always has less dimensions compared to theBoVWwhich makes VLAD faster than BoVW

in our experiments it is tuned to 2048 We exploit thesliding window for shot boundary detection In very smallsliding window the contents of the video shots do notchange drastically which helps to represent shots by reduceddimensions of BoVW and VLAD


Data Availability

The data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

We are thankful to Shinrsquoichi Satoh from National Institute ofInformatics Japan Nitin Afzulpurkar fromAsian Institute ofTechnologyThailand andChadapornKeatmanee fromThai-Nichi Institute of Technology Thailand for their expertisethat greatly assisted this research

References

[1] S Lefevre and N Vincent ldquoEfficient and robust shot changedetectionrdquo Journal of Real-Time Image Processing vol 2 no 1pp 23ndash34 2007

[2] J Baber S Satoh N Afzulpurkar and C Keatmanee ldquoBag ofvisual wordsmodel for videos segmentation into scenesrdquo inPro-ceedings of the Fifth International Conference on Internet Multi-media Computing and Service pp 191ndash194 ACM 2013

[3] J Baber N Afzulpurkar and S Satoh ldquoA framework for videosegmentation using global and local featuresrdquo InternationalJournal of Pattern Recognition and Artificial Intelligence vol 27no 5 Article ID 1355007 2013

[4] L Baraldi C Grana and R Cucchiara ldquoShot and scene detec-tion via hierarchical clustering for re-using broadcast videordquoin Proceedings of the International Conference on ComputerAnalysis of Images and Patterns pp 801ndash811 Springer 2015

[5] J Baber NAfzulpurkar andM Bakhtyar ldquoVideo segmentationinto scenes using entropy and SURFrdquo in Proceedings of the 20117th International Conference on Emerging Technologies (ICETrsquo11) pp 1ndash6 IEEE 2011

[6] T Kikukawa and S Kawafuchi ldquoTransaction of the instituteof electronics development of an automatic summary editingsystem for the audio visual resourcesrdquo Information and commu-nication Engineers vol 75 no 2 pp 398ndash402 1992

[7] A Nagasaka and Y Tanaka Visual database systems II 1992[8] H Zhang A Kankanhalli and S W Smoliar ldquoAutomatic parti-

tioning of full-motion videordquo Multimedia Systems vol 1 no 1pp 10ndash28 1993

[9] I Koprinska and S Carrato ldquoTemporal video segmentation asurveyrdquo Signal Processing Image Communication vol 16 no 5pp 477ndash500 2001

[10] Z Cernekova I Pitas and C Nikou ldquoInformation theory-based shot cutfade detection and video summarizationrdquo IEEETransactions on Circuits and Systems for Video Technology vol16 no 1 pp 82ndash91 2006

[11] T Kikukawa and S Kawafuchi ldquoDevelopment of an automaticsummary editing system for the audio-visual resourcesrdquo Trans-actions on Electronics and Information J75-A pp 204ndash212 1992

[12] A Nagasaka ldquoAutomatic video indexing and full-video searchfor object appearancesrdquo in Proceedings of the IFIP 2nd WorkingConference on Visual Database Systems 1992

[13] G C Chavez F Precioso M Cord S Philipp-Foliguet andA D A Araujo ldquoShot boundary detection at trecvid 2006rdquo inProceedings of the TREC Video Retrieval Eval vol 15 2006

[14] X Ling O Yuanxin L Huan and X Zhang ldquoAMethod for FastShot Boundary Detection Based on SVMrdquo in Proceedings of the2008 Congress on Image and Signal Processing vol 2 pp 445ndash449 IEEE 2008

[15] J Li Y Ding Y Shi and W Li ldquoA divide-and-rule scheme forshot boundary detection based on SIFTrdquo International Journalof Digital Content Technology and Its Applications vol 4 no 3pp 202ndash214 2010

[16] M Yeung B-L Yeo and B Liu ldquoSegmentation of Video byClustering and Graph Analysisrdquo Computer Vision and ImageUnderstanding vol 71 no 1 pp 94ndash109 1998

[17] Z Rasheed andM Shah ldquoScene detection inHollywoodmoviesand TV showsrdquo in Proceedings of the 2003 IEEE ComputerSociety Conference on Computer Vision and Pattern Recognitionvol 2 pp 343ndash348 IEEE 2003

[18] D Rotman D Porat and G Ashour ldquoRobust and efficientvideo scene detection using optimal sequential groupingrdquo inProceedings of the 18th IEEE International Symposium on Multi-media ISM rsquo16 pp 275ndash280 IEEE 2016

[19] D Rotman D Porat and G Ashour ldquoRobust video scenedetection using multimodal fusion of optimally grouped fea-turesrdquo in Proceedings of the 19th IEEE International Workshopon Multimedia Signal Processing MMSP rsquo17 pp 1ndash6 2017

[20] L Baraldi C Grana and R Cucchiara ldquoAnalysis and re-useof videos in educational digital libraries with automatic scenedetectionrdquo in Proceedings of the Italian Research Conference onDigital Libraries pp 155ndash164 Springer 2015

[21] U Sakarya and Z Telatar ldquoVideo scene detection using dom-inant setsrdquo in Proceedings of the 2008 15th IEEE InternationalConference on Image Processing - ICIP rsquo08 pp 73ndash76 IEEE2008

[22] T Lin H Zhang and Q-Y Shi ldquoVideo scene extraction byforce competitionrdquo in Proceedings of the IEEE InternationalConference on Multimedia and Expo (ICME rsquo01) pp 753ndash7562001

[23] X Chen and F Lu ldquoAdaptive rate control algorithm for H264AVC considering scene changerdquo Mathematical Problems inEngineering vol 2013 Article ID 373689 6 pages 2013

[24] GRascioni S Spinsante andEGambi ldquoAnoptimized dynamicscene change detection algorithm for H264AVC encodedvideo sequencesrdquo International Journal of Digital MultimediaBroadcasting vol 2010 Article ID 864123 9 pages 2010

[25] Z Rasheed andM Shah ldquoScene detection inHollywoodmoviesand TV showsrdquo in Proceedings of the 2003 IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition2003

[26] J Baber N Afzulpurkar M N Dailey and M BakhtyarldquoShot boundary detection from videos using entropy and localdescriptorrdquo in Proceedings of the 2011 17th International Con-ference onDigital Signal Processing (DSP rsquo11) pp 1ndash6 IEEE 2011

[27] D G Lowe ldquoDistinctive image features from scale-invariantkeypointsrdquo International Journal of ComputerVision vol 60 no2 pp 91ndash110 2004

[28] J Baber M N Dailey S Satoh N Afzulpurkar and M Bakht-yar ldquoBIG-OH binarization of gradient orientation histogramsrdquoImage and Vision Computing vol 32 no 11 pp 940ndash953 2014

[29] H JegouMDouze C Schmid and P Perez ldquoAggregating localdescriptors into a compact image representationrdquo inProceedingsof the 2010 IEEE Computer Society Conference on ComputerVision and Pattern Recognition (CVPR rsquo10) pp 3304ndash3311 2010


[30] J Delhumeau P-HGosselinH Jegou andP Perez ldquoRevisitingthe VLAD image representationrdquo inProceedings of the 21st ACMInternational Conference on Multimedia pp 653ndash656 2013

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018


Active and Passive Electronic Components

VLSI Design



Shock and Vibration


Civil EngineeringAdvances in

Acoustics and VibrationAdvances in



Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of



Journal ofEngineeringVolume 2018

SensorsJournal of



RotatingMachinery


Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018


Chemical EngineeringInternational Journal of Antennas and

Propagation




Navigation and Observation


Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom


boundary is the sudden change in the scene such as changeof speaker during TV interviews whereas gradual shots takeseveral frames to change the shot such as fades and dissolvesIn videos many shots repeat in very short interval of time ifthose shots are combined then these collections of shots arecalled scenes For example if two actors are talking then thecamera keeps switching to both actors with very little changein background and in two-minute conversation of video clipthere are sometimes 25-30 shots Scene detection aka sceneboundary detection or video scene segmentation is the studyto merge similar or repeating shots into one clip or dividingthe videos into clips which are semantically or visually relatedor similar

Manual segmentation of videos for websites and DVDsis very time consuming and not feasible when dealing withlarge datasets Recently automatic video segmentation intoshots and scenes have gained wide attraction among industryand researchers [2ndash5]

In the proposed methodology videos are segmented intoabrupt shot boundaries which are further grouped on thebasis of the similarity to construct the scenes The proposedmethodology is inspired by the BoVW model for scenedetection [2] the abstract flow diagram is shown in Figure 1In bag of visual word model local key point descriptorswhich are extracted from the key frame of the shots arerepresented by the histograms of visual words These keyframes arematched based onbag of visual word histograms insliding window of lengthL [3] It has been shown that shotsmatched in sliding windows aremore efficient [2 3] ClassicalBoVWmodel andVLAD are usedwith compact vocabularieswithout compromising in accuracies

Rest of the paper is organized as follows Section 2presents the related work It is divided into three subsections(1) shot boundary detection (2) key frame extraction and (3)scene boundary detection Section 3 presents the proposedmethodology along with experimental protocols and finallySection 4 concludes the papers and discusses the future work

2 Related Work

In this section a brief literature review and state-of-the-artmethodologies are presented for all the main steps of thevideo segmentation which include shot boundary detectionkey frame extraction and scene boundary detection

21 Shot Boundary Detection In the problem of videoindexing and searching the first and foremost step is shotboundary detection Shot boundary has two types as men-tioned earlier abrupt and gradual shot boundaries Abruptshot boundary is the sudden change in the stream if thedissimilarity difference between the two consecutive framesis very large then either of the adjacent frames is consideredthe boundary whereas gradual shot boundary is the gradualchange in the video such as the effects like fade-in fade-outand dissolves

Let the 119865 = 1198911 1198912 119891119899 be frames of a video and 119891119894 issaid to be an abrupt shot boundary if and only if the differencebetween 119891119894 and 119891119894+1 is greater than a threshold 120591 In ourexperiments we are not taking the gradual shot detection into

Video Shot BoundaryDetection

Key frame Selection

FeatureExtractionQuantization

Similarity ofKey Frames in

Sliding Window

Grouping ofShots Scene Creation

Figure 1 Abstract flow diagram of proposed framework

consideration as the ratio of gradual shots in any cinematicvideo is too small more than 90 of shot boundaries areabrupt boundaries [2 3]

There is long list of methodologies first on pixel to pixeldifference between the consecutive video frames ie 119891119894 and119891119894 + 1 are used for segmentation of the video [6] In thistechnique if the sum of the pixels difference is greater thansome threshold then it was considered to be an abrupt shotboundary

Later on many other scientists worked on this problemand proposed a new technique in which pixel intensityhistograms were used in successive frames instead of the pixelto pixel difference to detect the abrupt shot boundaries [7 8]These techniques are good except that they are sensitive tomotion of objects and camera [9]

Moreover a latter approach [10] detects the shot bound-ary based on the mutual information and joint entropybetween the consecutive frames A sports dataset has beenused to detect the shot boundary This technique of jointentropy is useful if used for faded or gradual boundariesThe entropy is high for an extended time period duringfade-in because the visual intensity gradually increases andthe entropy is low during fade-out as the intensity slowlydecreases

Videos fragmentation by pixel variance in frames andpixel strength in histogram calculations has been presentedin [11 12] The frame indexing was used with rapid boundarywhen the amount of shot pixels between two frames isoverdone by some threshold

Chavez et al [13] proposed a different technique in whichthey used supervised learning with support vector machine(SVM) in order to separate the abrupt boundaries from thegradual boundaries In this technique authors calculated thedissimilarity vector which assimilate set of different featuresincluding Fourier-Mellinmoments Zernike moments andcolor histograms (RGB and HSV) to capture the informationlike illumination changes and rapid motion After then thisvector is used in SVM for detection of shot boundariesThe authors also used illumination changes for detecting thegradual shot boundaries


















119864119909 = minussum119910

119867119910 log (119867119910) (1)















0

010

020

030

040

050

060

070

080

090

1

Prec

ision

Rec

allF

-sco

re





B (119891119894) =


(2)








G R119889 997891997888rarr [1 119896]

119909119894 997891997888rarr G (119909119894)(3)






G119881 R

119889 997891997888rarr V119895 isin V

119909119894 997891997888rarr G119881 (119909119894) = arg min

VisinV

1003817100381710038171003817119909119894 minus V10038171003817100381710038172 (4)




119895119902 = sum119883G119881(119883)=V119898

119883 minus V119898 (5)



D (119867119894 119867119895) =N

sum119902=1

min (ℎ119894119902 ℎ119895119902) (6)





fsfs fs fs fs

fsfs fs fs







5 Conclusion

















3 4 5 6 7 8 9 1005

05506

06507

07508

08509


Fminussc

ore




0 2 4 6 8Data Size

0

05

1

15

2

25

3

35

Tim

e (se

c)

VLADBoVW

times104




Data Availability




Acknowledgments


References


































RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia



















119864119909 = minussum119910

119867119910 log (119867119910) (1)















0

010

020

030

040

050

060

070

080

090

1

Prec

ision

Rec

allF

-sco

re





B (119891119894) =


(2)








G R119889 997891997888rarr [1 119896]

119909119894 997891997888rarr G (119909119894)(3)






G119881 R

119889 997891997888rarr V119895 isin V

119909119894 997891997888rarr G119881 (119909119894) = arg min

VisinV

1003817100381710038171003817119909119894 minus V10038171003817100381710038172 (4)




119895119902 = sum119883G119881(119883)=V119898

119883 minus V119898 (5)



D (119867119894 119867119895) =N

sum119902=1

min (ℎ119894119902 ℎ119895119902) (6)





fsfs fs fs fs

fsfs fs fs







5 Conclusion

















3 4 5 6 7 8 9 1005

05506

06507

07508

08509


Fminussc

ore




0 2 4 6 8Data Size

0

05

1

15

2

25

3

35

Tim

e (se

c)

VLADBoVW

times104




Data Availability




Acknowledgments


References


































RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia











0

010

020

030

040

050

060

070

080

090

1

Prec

ision

Rec

allF

-sco

re





B (119891119894) =


(2)








G R119889 997891997888rarr [1 119896]

119909119894 997891997888rarr G (119909119894)(3)






G119881 R

119889 997891997888rarr V119895 isin V

119909119894 997891997888rarr G119881 (119909119894) = arg min

VisinV

1003817100381710038171003817119909119894 minus V10038171003817100381710038172 (4)




119895119902 = sum119883G119881(119883)=V119898

119883 minus V119898 (5)



D (119867119894 119867119895) =N

sum119902=1

min (ℎ119894119902 ℎ119895119902) (6)





fsfs fs fs fs

fsfs fs fs







5 Conclusion

















3 4 5 6 7 8 9 1005

05506

06507

07508

08509


Fminussc

ore




0 2 4 6 8Data Size

0

05

1

15

2

25

3

35

Tim

e (se

c)

VLADBoVW

times104




Data Availability




Acknowledgments


References


































RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia






G R119889 997891997888rarr [1 119896]

119909119894 997891997888rarr G (119909119894)(3)






G119881 R

119889 997891997888rarr V119895 isin V

119909119894 997891997888rarr G119881 (119909119894) = arg min

VisinV

1003817100381710038171003817119909119894 minus V10038171003817100381710038172 (4)




119895119902 = sum119883G119881(119883)=V119898

119883 minus V119898 (5)



D (119867119894 119867119895) =N

sum119902=1

min (ℎ119894119902 ℎ119895119902) (6)





fsfs fs fs fs

fsfs fs fs







5 Conclusion

















3 4 5 6 7 8 9 1005

05506

06507

07508

08509


Fminussc

ore




0 2 4 6 8Data Size

0

05

1

15

2

25

3

35

Tim

e (se

c)

VLADBoVW

times104




Data Availability




Acknowledgments


References


































RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia



fsfs fs fs fs

fsfs fs fs







5 Conclusion

















3 4 5 6 7 8 9 1005

05506

06507

07508

08509


Fminussc

ore




0 2 4 6 8Data Size

0

05

1

15

2

25

3

35

Tim

e (se

c)

VLADBoVW

times104




Data Availability




Acknowledgments


References


































RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia

















3 4 5 6 7 8 9 1005

05506

06507

07508

08509


Fminussc

ore




0 2 4 6 8Data Size

0

05

1

15

2

25

3

35

Tim

e (se

c)

VLADBoVW

times104




Data Availability




Acknowledgments


References


































RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia



Data Availability




Acknowledgments


References


































RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia






RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia


video scene detection using compact bag of visual word...

Documents