automated video segmentation for lecture videos: a ... · dents/onlineclass.htm), a video of an...

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

International Journal of Technology and Human Interaction, 1(2), 27-45, April-June 2005 27

Automated Video Segmentationfor Lecture Videos:

A Linguistics-Based ApproachMing Lin, University of Arizona, USA

Michael Chau, The University of Hong Kong, Hong Kong

Jinwei Cao, University of Arizona, USA

Jay F. Nunamaker Jr., University of Arizona, USA

ABSTRACT

Video, a rich information source, is commonly used for capturing and sharing knowledge inlearning systems. However, the unstructured and linear features of video introduce difficultiesfor end users in accessing the knowledge captured in videos. To extract the knowledge structureshidden in a lengthy, multi-topic lecture video and thus make it easily accessible, we need to firstsegment the video into shorter clips by topic. Because of the high cost of manual segmentation,automated segmentation is highly desired. However, current automated video segmentationmethods mainly rely on scene and shot change detection, which are not suitable for lecturevideos with few scene/shot changes and unclear topic boundaries. In this article we investigatea new video segmentation approach with high performance on this special type of video:lecture videos. This approach uses natural language processing techniques such as nounphrases extraction, and utilizes lexical knowledge sources such as WordNet. Multiple linguistic-based segmentation features are used, including content-based features such as noun phrasesand discourse-based features such as cue phrases. Our evaluation results indicate that thenoun phrases feature is salient.

Keywords: computational linguistics; lecture video; multimedia application; videosegmentation; virtual learning

INTRODUCTION

The quick development of technolo-gies in the storage, distribution, and pro-duction of multimedia has created newsources of knowledge. Among these newknowledge sources, video is extremely use-ful for knowledge sharing and learning be-cause of its great capability of carrying and

transmitting “rich” information (Daft &Lengel, 1986). Nowadays videotaped lec-tures are more and more commonly pro-vided in computer-based training systems,and they can create a virtual learning envi-ronment that simulates the real classroomlearning environment. However, peopleoften have difficulties in finding specificpieces of knowledge in video because of

This paper appears in the journal International Journal of Technology and Human Interaction edited by BerndCarsten Stahl. Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission

701 E. Chocolate Avenue, Suite 200, Hershey PA 17033-1240, USATel: 717/533-8845; Fax 717/533-8661; URL-http://www.idea-group.com

��

IDEA GROUP PUBLISHING

28 International Journal of Technology and Human Interaction, 1(2), 27-45, April-June 2005


its unstructured and linear features. Forinstance, when students want to review acertain part of a videotaped lecture, theyhave to look through almost the entire videoor even play back and forth several timesto locate the right spot.

Multimedia information retrieval tech-nologies, such as video search engines orvideo browsers, try to address this prob-lem by analyzing, transforming, indexing,and presenting the knowledge captured invideos in a structured way. For instance, inonline courses provided by Stanford Uni-versity (http://scpd.stanford.edu/scpd/stu-dents/onlineClass.htm), a video of an in-structor is synchronized with his/herPowerPoint (PPT) slides. Students canmove forward or backward to a certainsegment of the video by choosing the slideassociated with that segment (see Figure1). The similar but improved design wasimplemented in two multimedia-basedlearning systems that we developed before:

the Learning By Asking (LBA) system(Zhang, 2002), and its extension, theAgent99 Trainer system (see Figure 2)(Lin, Cao, Nunamaker, & Burgoon, 2003).In both training systems, each lecture videois manually segmented into short clips, andeach clip is synchronized with a PPT slideas well as a text transcript of the speech inthe clip. The clips are also indexed basedon these text transcripts. Students can se-lect a specific clip in the lecture by brows-ing a list of topics of the lecture or search-ing with keywords or questions. An experi-ment and a usability test have shown thatstudents thought that such structured andsynchronized multimedia contents, as wellas the self-based learner control enabledby the list of topics, are helpful. The result-ing training system is as effective as tradi-tional classroom training (Lin et al., 2003).

However, to realize such a structuredvideo lecture, there must be a critical pre-processing step: video segmentation. With-

Figure 1. Screenshot of the Stanford Online System



out decomposing a lengthy, continuousvideo into short, discrete, and semanticallyinterrelated video segments, the knowledgestructure of the video cannot be extracted,and efficient navigating or searching withinthe video is impossible. However, perform-ing video segmentation manually is verytime consuming because it requires a hu-man to watch the whole video and under-stand the content before starting to per-form the segmentation. Therefore, in thisarticle we focus on studying how to auto-matically segment lecture videos to facili-tate knowledge management, sharing, andlearning. We define the segmentation taskas automatically segmenting videos intotopically cohesive blocks by detecting topicboundaries. We focus on segmentation bytopic because this type of segmentation al-

lows each segment to be a coherent topic,which gives users the relevant context tounderstand the content in a virtual learningenvironment such as the LBA system.

Although automated video segmen-tation has been researched for years, theexisting segmentation methods are not suit-able for lecture videos. The most commonlyused video segmentation methods rely onalgorithms for scene/shot change detection,such as those utilizing image cues basedon color histogram (Wactlar, 1995) or pro-gressive transition detection by combiningboth motion and statistical analysis (Zhang& Smoliar, 1994). However, lecture videosusually have very few scene or shotchanges. For instance, in many situationsthere is only a “talking instructor” in thevideo. Furthermore, topic boundaries in lec-

Figure 2. Screenshot of the Agent99 Trainer



ture videos are much more subtle and fuzzybecause of the spontaneous speech of in-structors. Therefore, a new method forsegmenting lecture videos is highly desired.

In this article we describe the devel-opment of such an automated segmenta-tion algorithm. Videotaped lectures cap-tured in a university are used as the testbed for evaluating our algorithm. The restof this article is organized as follows. Thenext section reviews related research andidentifies potential applicable segmentationfeatures and methods. We then propose ourtechnical approach to the segmentationproblem, and describe an evaluation studyand discussion. Finally, we conclude ourresearch and outline some future researchdirections.

LITERATURE REVIEW

The image cues that most video seg-mentation methods rely on are not avail-able for lecture videos because they usu-ally have very few scene/shot changes, andin most cases those scene/shot changes donot match with topic changes. On the otherhand, generally there are rich speeches inthose videos. This implies that the audiospeech and the text transcription of thespeeches can be good information sourcesfor topic segmentation. Thus, in this articlewe focus on topic segmentation using tran-scribed text. With the time stamps (ex-tracted from automatic speech recognitionsoftware) that synchronize the video streamand the transcribed text (Blei & Moreno,2001), the output of transcribed text seg-mentation can be mapped back to videosegmentation. Therefore, our video seg-mentation problem can be addressed bysegmenting transcribed spoken text.

Segmentation in the News Domain

Segmentation of transcribed spokentext has been studied for years (Allan,Carbonell, Doddington, Yamron, & Yang,1998; Beeferman, Berger, & Lafferty,1997). Work in this area has been largelymotivated by the topic detection and track-ing (TDT) initiative (Allan et al., 1998). Thestory segmentation task in TDT is definedas the task of segmenting the stream ofdata (transcribed speech) into topically co-hesive stories. However, they usually fo-cus on the broadcast and news domain inwhich the formal presentation format andcue phrases can be explored to improvesegmentation accuracy. For instance, inCNN news stories, the phrase “This isLarry King…” normally implies the begin-ning or the ending of a story or topic. Incontrast, the speeches in lecture videos aretypically unprofessional and spontaneous.Also, a large set of training data is requiredfor the machine learning methods used inTDT. The Dragon approach (Yamron, Carp,Gillick, Lowe, & Van Mulbregt, 1999) treatsa story as an instance of some underlyingtopics, models a text stream as an unla-beled sequence of those topics, and usesclassic Hidden Markov Model (HMM)techniques as in speech recognition. TheUMass approach (Ponte & Croft, 1997)makes use of the techniques of local con-text analysis and discourse-based HMM.The CMU approach (Beeferman et al.,1997) explores both content and discoursefeatures to train a probability distributionmodel to combine information from a lan-guage model with lexical features that areassociated with story boundaries. However,a large set of training data is not availablefor lecture videos. Furthermore, the largevariety of instructional styles of instructorsmakes the problem even more difficult.



Alternatively, without requiring formalpresentation format and training, researchin the area of domain-independent text seg-mentation provides possible methodologiesto address this problem.

Domain-IndependentText Segmentation

Most existing work in domain-inde-pendent text segmentation has been de-rived from the lexical cohesion theory sug-gested by Halliday and Hasan (1976). Theyproposed that text segments with similarvocabulary are likely to be in one coherenttopic segment. Thus, finding topic bound-aries could be achieved by detecting topictransitions from vocabulary change. In thissubsection, we review the literature byshowing different segmentation features,the similarity measures used, and the meth-ods of finding boundaries.

Researchers used different segmen-tation features to detect cohesion. Termrepetition is a dominant feature with dif-ferent variants such as word stem repeti-tion (Youmans, 1991; Hearst, 1994; Reynar,1994), word n-gram or phrases (Reynar,1998; Kan, Klavans, & McKeown, 1998),and word frequency (Reynar, 1999;Beeferman et al., 1997). The first use ofwords was also used by some researchers(Youmans, 1991; Reynar, 1999) because alarge percentage of first-used words oftenaccompanies topic shifts. Cohesion be-tween semantically related words (e.g.,synonyms, hyponyms, and collocationalwords) is captured using different knowl-edge sources such as a thesaurus (Morris& Hirst, 1991), dictionary (Kozima &Furugori, 1993), or large corpus (Ponte &Croft, 1997; Kaufmann, 1999). To measurethe similarity between different text seg-ments, research uses vector models

(Hearst, 1994), graphic methods (Reynar,1994; Choi, 2000; Salton, Singhal, Buckley,& Mitra, 1996), and statistical methods(Utiyama, 2000). Methods of finding topicboundaries include sliding window (Hearst,1994), lexical chains (Morris & Hirst, 1991;Kan et al., 1998), dynamic programming(Ponte & Croft, 1997; Heinonen, 1998), andagglomerative clustering and divisive clus-tering (Yarri, 1997; Choi, 2000). We de-scribe some representative research withmore details below. For a thorough review,refer to Reynar (1998).

Youmans (1991) designed a techniquebased on the first uses of word types, calledVocabulary Management Profile. Hepointed out that a large amount of first useof words frequently followed topic bound-aries. Kozima and Furugori (1993) deviseda measure called Lexical Cohesion Profile(LCP) based on spreading activation withina semantic network derived from an En-glish dictionary. The segment boundariescan be detected by the valleys (minimumvalues) of LCP. Hearst (1994) developeda technique called TextTiling that automati-cally divides long expository texts into multi-paragraph segments using the vector spacemodel, which has been widely used in in-formation retrieval (IR). Topic boundariesare placed where the similarity betweenneighboring blocks is low. Reynar (1994)described a method using an optimizationalgorithm based on word repetition and agraphic technique called dotplotting. In fur-ther studies, Reynar (1998) designed twoalgorithms for topic segmentation. The firstis based solely on word frequency, repre-sented by Katz’s G model (Katz, 1996).The second one combines the first withother sources of evidence such as domaincues and content word bigram, and incor-porates these features into a maximumentropy model. The research of Choi(2000) is built on the work of Reynar



(1998). The primary distinction is that in-ter-sentence similarity is replaced by rankin local context, and boundaries are dis-covered by divisive clustering.

RESEARCH QUESTIONS

Unlike the above segmentation meth-ods that focus on written text, segmenta-tion of transcribed spoken text is more chal-lenging because spoken text lacks typo-graphic cues such as headers, paragraphs,punctuation, or capitalized letters. More-over, compared to written text and newsstories, the topic boundaries within lecturetranscripts tend to be more subtle and fuzzybecause of the unprofessional and sponta-neous speech, and the large variety of in-structional methods. Therefore, we needmore resolving power for segmenting lec-ture transcripts.

With the advancement of computa-tional linguistics and natural language pro-cessing (NLP) research, NLP techniquessuch as Part-of-Speech tagging and nounphrase extraction are becoming more ma-ture and available for real-life usage. Theyare potentially useful for gaining more re-solving power and improving segmentationaccuracy because they provide a deeperstructure and better understanding of thelanguage and content in the transcript. Wepropose a linguistics-based approach thatutilizes all kinds of linguistics-based fea-tures and NLP techniques to facilitate theautomated segmentation. Part-of-speechtagging was used to distinguish betweendifferent word types (noun, verb, adjec-tive). Noun-phrase extraction could helpthe segmentation because noun phrasescarry more content information than singlewords (Katz, 1996). Lexical knowledgesuch as WordNet was used because dif-ferent words such as synonyms may be

used to express the same concept in a text.The basic idea behind the proposed ap-proach is that different linguistic units andfeatures (e.g., different word types, largerunits such as noun phrases, discourse mark-ers, or cue phrases) carry different por-tions of content and structure informationand therefore should be assigned differentweights. More specifically, in this article,we propose two research questions as fol-lows: (1) As the names of concepts andtheories that appear frequently in lecturesare usually noun phrases, are noun phrasesmore salient segmentation features andcould they be used to improve segmenta-tion performance? (2) Intuitively, linguisticfeatures modeling different characteristicsof text (e.g., content based vs. discoursebased) should complement each other; then,can the combination of multiple linguisticsegmentation features lead to gains in re-solving power and thus improve segmen-tation performance?

PROPOSED APPROACH

To answer the two research questionsabove, we propose implementing an algo-rithm called PowerSeg. The algorithm com-bines multiple linguistic segmentation fea-tures that include content-based featuressuch as noun phrases and verbs, and dis-course-based features such as pronounsand cue phrases. It also incorporates lexi-cal knowledge from WordNet to improveaccuracy. The algorithm utilizes an ideasimilar to the sliding window methods inTextTiling (Hearst, 1994). We move a slid-ing window (e.g., W1, W2) of certain size(e.g., six sentences) across the transcriptby certain interval (e.g., two sentences)(see Figure 1). We then compare the simi-larities between two neighboring windowsof text. For instance, we compute the simi-



larity between windows W1 and W2 (e.g.,sentence numbers 14-19 vs. 20-25); thenwe move W1 and W2 by an interval of twosentences, and calculate the similarity be-tween W1(2) and W2(2). We repeat thisprocess until the sliding windows reach theend of the transcript. The places wheresimilarities have a large variation are iden-tified as potential topic boundaries. Thebasic idea here is that we view the task oftopic-based segmentation as the detectionof shift from one topic to the next. In otherwords, the task is to detect where the useof one set of terms ends and another setbegins (Halliday & Hasan, 1976). Then theremaining questions are how we calculatethe similarities between two windows, howwe represent the topic information of eachsliding window, and finally how we identifythe largest variations of similarities. We willanswer all these questions in the detailedalgorithm description. Basically the algo-rithm has three major steps: (1) preprocess-ing, (2) features extraction, and (3) findingboundaries.

The preprocessing step performspreparation work for next steps, which isliterally standardized. The algorithm takesthe transcript text as input, and uses GATE(Cunningham, 2000) to handle tokenization,sentence splitting, and part-of-speech (POS)tagging. GATE is a widely used human lan-guage processing system developed at theUniversity of Sheffield. GATE splits the textinto simple tokens such as numbers, punc-tuation, and words; segments the text intosentences; and the part-of-speech tag wasproduced as an annotation on each wordor symbol (e.g., NN for nouns and VB forverbs). Further, Porter’s stemmer (Porter,1980) is used for suffix stripping (e.g., “lays”becomes “lay”). Punctuation and uninfor-mative words are removed using astopword list. Based on the results frompreprocessing, different features such as

noun phrases are extracted to representeach sliding window and used for similari-ties comparison.

Feature Extraction

Seven feature vectors are extractedincluding noun phrases (NP), verb classes(VC), word stems (WS), topic words(TNP), combined features (NV), pronouns(PN), and cue phrases (CP). The first fivefeatures (NP, VC, WS, TNP, and NV) arecontent-based features, which carry lexi-cal or syntactic meanings of the body ofcontent. The last two features (PN and CP)are discourse-based features, which de-scribe more about the properties of thesmall text body surrounding the topicboundaries.

We use noun phrases instead of “bagof words” (single words) because nounphrases are usually more salient featuresand exhibit fewer sense ambiguities. Fur-thermore, most names of concepts andtheories in a lecture are noun phrases. Forinstance, in the transcript of a lecture videoabout Web search engines (see Figure 1),topic 3, “Definition of Information Re-trieval” and topic 4, “Architecture of In-formation Retrieval” share a lot of wordssuch as “information” and “retrieval” (inbold face in Figure 1). It will be hard foralgorithms using single-word features suchas word repetition to distinguish betweenthese two topics. However, it will be mucheasier to separate these two topics if weuse noun phrases. For instance, “informa-tion retrieval” occurs several times in topic3, but not in topic 3. We use the ArizonaNoun Phraser (Tolle & Chen, 2000) to ex-tract the noun phrases from transcript.

Besides noun phrases, verbs alsocarry a lot of content information. Seman-tic verb classification has been used to char-acterize document type (Klavans & Kan,



1998) because verbs typically embody anevent’s profile. Our intuition is that verbclassification also represents topic informa-tion. After removing support verbs (e.g.,is, have, get, go, etc.), which do not carry alot of content information, we use WordNetto build the links between verbs to providea verb-based semantic profile for each textwindow during the segmentation process.WordNet is a lexical knowledge resourcein which words are organized into synonymsets (Miller, Beckwith, Felbaum, Gross, &Miller, 1990). These synonym sets, orsynsets, are connected by semantic rela-tionships such as hypernymy or antonymy.We use the synonymy and hypernymy re-lationship within two levels in WordNet. Weonly accept hypernymy relationships withintwo levels because of the flat nature of verbhierarchy in WordNet (Klavans & Kan,1998). More specifically, when two verbsbetween two text windows are compared,they will be considered as having the samemeaning (or in the same verb class) if theyare synonyms or hypernyms within two lev-els. Except nouns and verbs, other contentwords such as adjectives and adverbs willbe simply used in their stem forms (wordstems, WS).

Other than those simple features(nouns, verbs, and word stems), we alsohave two complex features. The first oneis topic terms, or more exactly, topic nounphrases. Topic terms are defined as thoseterms with co-occurrence larger than one(Katz, 1996). Topic terms usually hold morecontent information (such as “informationretrieval” in Figure 1), which means theyshould carry more weight in our algorithm.The other complex feature is a combinedfeature of nouns and verbs. We extract themain noun and verb in each sentence ac-cording to the POS tags, with the expecta-tion of capturing the complex relationshipinformation of subject plus behavior.

Different from the above five con-tent-based features, the two discourse-based features focus on the small size textbody surrounding the pseudo-boundariesproposed by the algorithm based on the fivecontent-based features. We use a size offive words in our algorithm. In other words,we check the five words before and afterthe pseudo-boundaries. If we find any pro-noun (from a pronoun list) within the five-word window, we decrease the probabilityscore of this pseudo-boundary as a trueboundary. The reason is that pronouns usu-ally substitute for nouns or noun phrasesthat appear within the same topic. Any oc-currence of cue phrases (from a cue phraselist) will increase the probability of pseudo-boundary as a true boundary because cuephrases usually indicate the change of dis-course structure (e.g., cue phrase “Let” atthe beginning of topic 5, Figure 1.). We usethe general cue phrases list (see Table 1)and the pronoun list (see Table 2) fromReynar (1998).

After extracting the feature vectors,we need a measure to calculate the simi-larity between two neighboring text win-dows represented by the seven featurevectors.

Similarity Measure

The similarity between two neighbor-ing text windows (w1 and w2) is calculatedaccording to cosine measure in vectorspace model (Salton et al., 1996). Giventwo neighboring text windows, their simi-larity score is the sum of normalized innerproduct of seven feature vectors weights.The basic idea is that neighboring text win-dows with more overlapping of features(e.g., noun phrases, verbs) will have highersimilarity. (See Equation 1.)

j represents the different features (1to 7 here), and i ranges over all the spe-



cific feature weight values (e.g., nounphrases) in the text window. fj,i,w1 is thei-th feature weight value of j-th type fea-ture vector in text window w1. We calcu-late fj,i,w1 based on term frequency (TF).j is the feature type and i is the specificword or noun phrase in the feature vector.Sj is the significant value of some specificfeature type. The best way to calculate Sjis to use language model or word modeland utilize large corpus. For instance,Reynar (23) uses G-model and the WallStreet Journal to calculate Sj (called wordfrequency in Reynar, 1998). However,without a large training corpus of lecturevideos available, the significant values Sj

are estimated based on human heuristicsand hand tuning. We assume thatsignificances of the five features are in thefollowing order: S (TNP) > S (NV) > S(NP) > S (VC) > S (WS).

Finding the Boundaries

After the similarity between twoneighboring windows for each interval iscalculated, a similarity graph for all the in-tervals is drawn (see Figure 2). Intervalsare certain number locations in the text tran-script (e.g., 13, 15, 17…), similar to theconcept of “gap” in TextTiling (Hearst,1994). The X-axis indicates intervals and

∑∑∑

∑= j j

i wiji wij

i wijwijS

ff

ffwwSimilarity

2,,2

1,,2

2,,1,,

21),(

Equation 1.

actually further otherwise

also furthermore right

although generally say

and however second

basically indeed see

because let similarly

but look since

essentially next so

except no then

finally now therefore

first ok well

firstly or yes

Table 1. Cue phrases



Y-axis indicates similarity between neigh-boring windows at each corresponding in-terval (e.g., the interval at sentence num-ber 17). The intervals with largest depthvalues (deepest valleys) are identified aspossible topic boundaries. The depth valueis based on the distances from the valley tothe peaks to both sides of the valley, whichreveals how strongly the features of topics(e.g., frequency of noun phrases occurring)change on both sides. For instance, thedepth value of the interval at sentence num-ber 27 is equal to (y3 - y2) + (y1 - y2) (seeFigure 2). To decide how many boundariesthe algorithm will assign, we use a cutofffunction (m – sd). m is the mean of all depthvalues and sd is the standard deviation. Inother words, we draw boundaries only ifthe depth values are larger than (m – sd).

EVALUATION

To test our research questions thatnoun phrases are salient features and thatthe combination of features improve accu-racy, we evaluated our algorithm with asubset of features. We chose five features(NP, TNP, WS, CP, PN) to conduct a pre-liminary experiment. Those five featuresinclude salient features such as nounphrases (NP) and a combination of bothcontent- and discourse-based features: NP,TNP (topic noun phrases), and WS (wordstems) for content-based features, and CP(cue phrases) and PN (pronoun) for dis-course-based features. The performanceof PowerSeg was compared to that of abaseline method and TextTiling (Hearst,1994), one of the best text segmentationalgorithms. For TextTiling we used a Javaimplementation from Choi (2000). We alsodeveloped a simple version of the baselinesegmentation algorithm. The baseline al-gorithm randomly chose points (e.g., cer-

tain sentence numbers) to be topic bound-aries. We hypothesized that:

H1: The PowerSeg algorithm with NP aloneachieves a higher performance than TextTilingand Baseline.

H2: The PowerSeg algorithm withNP+CP+PN achieves a higher performancethan PowerSeg with NP alone or WS alone.

Data Set and Performance Metrics

Since there was no available anno-tated corpus for lectures videos, we usedthe lecture videos in our e-learning systemcalled Learning By Asking (LBA) (Zhang,2002) as pilot data for evaluation. Becausethe task of transcribing lecture videos isvery time consuming, we choose a smalldata set of three videos for our preliminaryexperiment. All three videos are chosenrandomly and transcribed by human ex-perts. The three videos are selected fromtwo different courses and instructors. Onevideo was from a lecture about the Internetand Web search engines, and the other twowere from a database course. Three tran-scripts corresponding to the videos wereused for the evaluation purpose. The aver-age length of the videos is around 28 min-utes, and the average number of words inthe transcripts is 1,859. We assumed thatthe segmentation results from the expertsare perfect (100% accuracy). The perfor-mance measures of PowerSeg, TextTiling,and Baseline were calculated by compar-ing their output results to the results fromthe experts.

Selecting an appropriate performancemeasure for our purpose is difficult. Themetric suggested by Beeferman et al.(1997) is well accepted and has beenadopted by TDT. It measures the probabil-ity that two sentences drawn at random



from a corpus are correctly classified as towhether they belong to the same story.However, this metric cannot fulfill our pur-pose because it requires some knowledgeof the whole collection and it is not clearhow to combine the scores from probabi-listic metrics when segmenting collectionsof texts in different files (Reynar, 1998).Instead, we chose precision, recall, and F-measure as our metrics. Precision and re-call were chosen because they are wellaccepted and frequently used in informa-tion retrieval and text segmentation litera-ture (Hearst, 1994; Reynar, 1998). F-mea-sure was chosen to overcome the tuningeffects of precision and recall. Precision,recall, and F-measure were defined asshown in Equation 2.

No_of_Matched_Boundaries is thenumber of correctly identified or matchedboundaries when comparing to actualboundaries identified by experts.No_of_Hypothesized_Boundaries is thenumber of boundaries proposed by the al-gorithm (e.g., PowerSeg). Besides exactmatch, we also used the concept of fuzzyboundary which means that hypothesizedboundaries that are a few sentences (usu-ally one) away from the actual boundariesare also considered as correct. We usedfuzzy boundary because for lengthy lec-ture videos, one sentence away from theactual boundary is acceptable for most ap-

plications. It is only a very short time pe-riod when we map the transcript back tothe video. For instance, the average timespan of one sentence in our data set is only12 seconds.

Experiment and Results

We ran the three algorithms(Baseline, TextTiling, and PowerSeg) us-ing the three transcripts and calculated themean performance. The performance mea-sures (precision, recall, and F-Measure)were calculated under two conditions: ex-act match and fuzzy boundary. Under fuzzyboundary condition, hypothesized boundarythat is one sentence away from the actualboundary is acceptable.

Hypothesis Testing. First, in order totest whether noun phrases are salient fea-tures (H1), we ran the PowerSeg algorithmwith the NP feature only, TextTiling, andbaseline using our dataset. We found thateven with NP only, PowerSeg improved theperformance (F-Measure) by more than10% compared to both Baseline andTextTiling under “fuzzy boundary” condi-tion (see Table 3). Under the “exact match”condition, the PowerSeg only performed 5%better than Baseline, although it was 15%better than TextTiling. Surprisingly, theTextTiling algorithm performed even worsethan Baseline. It showed that the “bag of

dariessized_Bounof_Hypothe No_

rieshed_BoundaNo_of_Matc )P(recision =

Boundariesof_Actual_ No_

rieshed_BoundaNo_of_Matc R(ecall) =

RP

2PR Measure-F

+=

Equation 2.



words” algorithms were not good at identi-fying exact topic boundaries, especiallywhen the transcript is about sub-topics witha lot of words shared (e.g., topics 13 and14 in Figure 1). On the other hand, all algo-rithms performed better under the “fuzzyboundary” condition as expected.

In order to evaluate the effectivenessof features combination (H2), we ran fourdifferent versions of PowerSeg which usedfour types of feature subsets: WS (wordstem only), NP (noun phrase only),NP+TNP (noun phrase plus topic nounphrases), and NP+CP+PN (noun phrases,cue phrases, and pronouns) (see Table 4).We found that the combination of nounphrases, cue phrases, and pronouns had abetter performance than using noun phrases(NP) only. This showed that the combina-tion of multiple features, especially the com-bination of content-based features and dis-course-based features, improved segmen-tation performance (F-Measure).

However, the improvement was verysmall, only around 2%. The possible rea-son was that the cue phrase list and pro-noun list we used are too general given oursmall data set. Those words may happen

rarely in the small dataset. To our surprise,the NP+TNP combination performedslightly worse than using NP only. One pos-sible reason is that although we definedtopic noun phrases as those noun phraseswith frequency larger than one, our fea-ture weighting method and calculation ofsimilarity were still based on term fre-quency. When we calculated the similaritybetween two text windows, TNPs alreadyoccupied a large percentage of weight.From another perspective, it also showedthat complementary features such as con-tent-based features and discourse-basedfeatures would improve performance, butnot those with similar characteristics suchas noun phrases and topic noun phrases.

In summary, H1 has been supported,but H2 has not. In other words, thePowerSeg algorithm using noun phrasesalone performed better than the Baselineand the TextTiling methods. Referring toour first research question, we have shownthat noun phrases are salient linguistic fea-tures that can greatly improve the perfor-mance of video segmentation system. Wesuggest that noun phrasing can be usefulfor video indexing and other applications.

she he they themselves

her him their

hers his them

herself himself theirs

Table 2. Pronouns

Exact Match Fuzzy (1) Algorithms

Precision Recall F-Measure Precision Recall F-Measure Baseline 0.32 0.32 0.32 0.56 0.56 0.56

TextTiling 0.30 0.18 0.22 0.75 0.46 0.56

PowerSeg (NP) 0.41 0.35 0.37 0.77 0.67 0.70

Table 3. Comparison of algorithms



However, concerning the second researchquestion, we found that the combination ofdifferent linguistic features did not furtherimprove the performance of the algorithm.Beside small test dataset and general cuephrase and pronoun list, another possiblereason is that noun phrases are very im-portant and already represent most of thetopic and content information in the text.Therefore, the addition of other featuresdoes not provide more useful informationto the algorithm.

DISCUSSION

Overall the experiment results arepromising. Our proposed PowerSeg algo-rithm achieved 0.70 in F-measure whennoun phrases were used as the only fea-ture and the fuzzy boundary was applied.As it has been shown that human agree-ment on video segmentation is often onlyaround 0.76 (Precision: 0.81; Recall: 0.71)(Hearst, 1994), our algorithm has performedsimilarly by agreeing well with the segmen-tation generated by our human experts.

Because of the distinct characteris-tics of datasets and different performancemeasures (as described in our literaturereview and in evaluation sections), it is hardto compare the segmentation results withthose achieved in other domains such asbroadcast news segmentation. However,the segmentation of lecture videos is ex-pected to be more difficult because of thelack of large training datasets and the largevariety of instructional methods. For in-stance, the formal presentation format andcue phrases that the methods in the newsdomain heavily rely on are not available forlecture videos. As previous research showsthat the segmentation performance of theHUB-4 news broadcast data, measured byprecision and recall, is only around 0.6

(Reynar, 1998), our algorithm achieves apromising performance. For further com-parison, future research needs to be con-ducted to evaluate the performance of ouralgorithm using broadcast news data.

In the following, we discuss the prac-tical implication of our research and alsothe limitations of the experiment.

PRACTICALIMPLICATIONS

Our study has proposed a video seg-mentation algorithm and has significantimplications for multimedia and e-learningapplications. The proposed algorithmachieved a precision of 77% and a recallof 68% (when using a combination of nounphrases, cue phrases, and pronouns), whichwe believe is sufficient for practical appli-cations, because even human experts donot agree totally with their segmentationresults (Hearst, 1994). With the decreas-ing cost of disk storage and network band-width, and the increasing amount of digi-tized lecture videos available, the proposedalgorithm can be applied to facilitate betterorganization of the videos. For example, inthe LBA e-learning system discussed ear-lier, lecture videos can be segmented tosupport better retrieval and browsing bystudents. This automated approach will savea lot of time and effort that human expertswould need in order to manually segmentthe videos. The video segmentation tech-nique also facilitates the classification ofvideos into topics. This could allow instruc-tors to share their lectures more easily, forexample, by sharing segments of their lec-ture videos on certain topics.

We also found that noun phrases aresalient features and very useful for videosegmentation based on text. It suggests thatnoun phrases can also be useful for other



video applications when transcripts areavailable. For example, the noun phrasescan be used for indexing, classification, orclustering of videos in e-learning or othervideo applications.

LIMITATIONS

There are several limitations of ourstudy. First, although the evaluation resultsare encouraging, one should note that onlythree videos were used in our experiment.Caution needs to be taken when interpret-ing our findings. More evaluations on largersets of data from different instructors andcourses will be needed to increase the reli-ability and validity of our results. Second,the “transcript problem” needs to be ad-dressed. The performance of the videosegmentation algorithm depends greatly onthe correctness of the transcripts. Currently,when transcripts of the videos are not avail-able, they have to be created using speechrecognition software, which often does notachieve high accuracy. Such transcriptshave to be corrected manually in order toensure better performance in video seg-mentation. As such, the video segmenta-tion process cannot be fully automatedwhen transcripts are not available. Third,

the method is currently designed and testedfor English lectures only. Noun phrases,while salient in English, may not be as use-ful in other languages. Some componentsin our system are also language-specific(e.g., the speech recognition software).Customization of the system, tuning of theparameters, and further testing will be nec-essary if the algorithm is applied to videosin a language other than English.

CONCLUSION ANDFUTURE DIRECTIONS

With the purpose of extracting thehidden knowledge structure of lecture vid-eos, and making them searchable and us-able for educators and students, we inves-tigated an automated video segmentationapproach with high performance, especiallyfor videos having few shot changes andunclear topic boundaries. We explored howcomputational linguistics research and natu-ral language processing techniques can fa-cilitate automated segmentation. Our ap-proach utilized salient linguistic segmenta-tion features such as noun phrases, andcombined content-based and discourse-based features to gain more resolvingpower. Our preliminary experiment results

Exact Match Fuzzy (1) Features Combination

Precision Recall F-Measure Precision Recall F-Measure

Baseline 0.32 0.32 0.32 0.56 0.56 0.56

WS 0.30 0.18 0.22 0.75 0.46 0.56

NP 0.41 0.35 0.37 0.77 0.67 0.70

NP+TNP 0.39 0.32 0.34 0.73 0.60 0.65

NP+CP+PN 0.42 0.37 0.39 0.77 0.68 0.72

Table 4. Comparison of PowerSeg with different feature subsets



Topic 3: Definition of Information Retrieval 13. Information retrieval lays a foundation for building various Internet search engines . ========== 14. Dr Salton from Cornell university is one of the most famous researchers in the field of information retrieval . 15. He defined that an IR system is used to store items of information that need to be processed , searched , retrieved , and disseminated to various user populations . 16. Generally speaking , information retrieval is an effort to achieve accurate and speedy access to pieces of desired information in a huge collection of distributed information sources by a computer . 17. In the current information era , the volume of information grows dramatically . 18. We need to develop computer programs to automatically locate the desired information from a large collection of information sources. 19. The three key concepts here are accurate , speedy , and distributed . 20. First , the retrieval results must be accurate . 21. The retrieved information must be relevant to users ' needs . 22. The retrieval process has to be quick enough . 23. Besides , the relevant information has to be collected from distributed sources . 24. Theoretically there is no constraint on the type and structure of the information items . 25. In practice , though , most large-scale IR systems are still mostly processing textual information . 26. If the information is particularly well structured , database management systems are used to store and access that information . Topic 4: Architecture of Information Retrieval 27. It is a simplified architecture of a typical information retrieval system . ========== 28. We start with the input side . 29. The main problem is to obtain a representation of each documents and query suitable for a computer to use . 30. Most computer-based retrieval systems store and use only the representation of a document or query . 31. The original text of a document is ignored once it has been processed for the purpose of generating its representation . 32. For example , a document or a query is often represented by a list of extracted keywords considered to be significant . 33. A football Web page might be represented by a list of keywords such as quarterback , offense , defense , linebacker , fumble , touch down , game , etc . 34. The processor of the retrieval system is concerned with the retrieval process . 35. The process involves performing the actual retrieval function by executing the search strategy and matching a query presentation with document representations . 36. The best matched documents are considered relevant to the query and will be displayed to users as output . 37. When a retrieval system is online , it is possible for the user to change his query during one search session in the light of a sample retrieval . 38. It is hoped improving the subsequent retrieval run . 39. Such a procedure is commonly referred to as feedback . Topic 5: Some Key Concepts of Information Retrieval ********** 40. Let us learn some key concepts in information retrieval . 41. First , a query is a list of individual words or a sentence that expresses users ' interest . ========== 42. Keywords refer to the meaningful words or phrases in the query or documents . 43. A list of keywords is often used to represent the contents of a query and a document . 44. Document indexing is the process of identifying and extracting keywords from documents to generate an index . 45. These indexing terms will be used to match with the query . …

W1

W1(2)

W2

W2(2)

Figure 3. Part of the transcript for a lecture video about Information Retrieval and Web searchengines (each line starts with the sentence number. Lines with “==========” are boundariesidentified by automated algorithm. Lines start with “Topic:” are actual boundaries identifiedby human experts)

demonstrated that the effectiveness of nounphrases as salient features and the meth-odology of combining multiple segmenta-tion features to complement each other arepromising.

We are currently in the process ofimplementing the full algorithm into the

LBA system. We believe that the auto-mated segmentation algorithm is a criticalpart of the preprocessing and authoring toolfor the LBA system, and appropriate seg-mentations can improve the effectivenessand efficiency of information browsing orsearching (e.g., in terms of time cost) and



thus the effectiveness of student learning.A controlled experiment, therefore, is pro-posed to assess the efficiency and effec-tiveness of our automated segmentationapproach in the LBA system undertakenby students. University students will be re-cruited to participate in the experiment. Thelearning performance of the students usingthe LBA system with automated segmen-tation will be measured and compared tothose who use the LBA system with manualsegmentation and who use the LBA sys-tem without segmentation. We hypothesizethat the LBA system with automated seg-mentation will produce the comparablelearning performance as the LBA systemwith manual segmentation.

In the proposed experiment, subjectswill be randomly assigned to three groups,one control group where students learn viathe LBA system without segmentation, andtwo treatment groups where students learnvia the LBA system with automated ormanual segmentation. Students in eachgroup will be required to finish a lecture inLBA and then complete an open-bookexam (posttest) with multiple-choice ques-

tions on the knowledge they learned fromthe lecture. A pretest, which is an equal-difficulty-level exam like the posttest, willbe given to the students before they startto learn. The differences between theposttest and the pretest, as well as the timefor completing the posttest, are used asmeasures of a student’s learning perfor-mance. Using a methodology that we de-veloped in our previous research (Cao,Crews, Nunamaker, Burgoon, & Lin, 2004),we will also test the usability — such asuser indication of utility, ease of use, andnaturalness, of the LBA system — whichwill further demonstrate the value of thesegmentation approach.

REFERENCES

Allan, J., Carbonell, J., Doddington, G.,Yamron, J., & Yang, Y. (1998). Topicdetection and tracking pilot study: Finalreport. Proceedings of the DARPABroadcast News Transcription andUnderstanding Workshop.

Agius, H.W. & Angelides, M.C. (1999).Developing knowledge-based intelligent

(37 , y3=0.2126)

(19 , y1=0.4005)

(39, 0.2066)

(27 , y2=0.0900)

(13 , 0 .2499)

(41, 0.1864)

0

0.05

0 .1

0 .15

0 .2

0 .25

0 .3

0 .35

0 .4

0 .45

0 .5

7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53

Se nte nce Numbe r

Sim

ilari

ty

Figure 4. Example of a similarity graph (dashed vertical lines indicate the boundaries proposedby automated method (e.g., PowerSeg here); solid vertical lines indicate the actual boundaries)



multimedia tutoring systems using se-mantic content-based modeling. Artifi-cial Intelligence Review, 13, 55-83.

Baltes, C. (2001). The e-learning balanc-ing act: Training and education with mul-timedia. IEEE Multimedia, 8(4), 16-19.

Beeferman, D., Berger, A., & Lafferty, J.(1997). Text segmentation using expo-nential models. Proceedings of the 2ndConference on Empirical Methods inNatural Language Processing (pp. 35-46).

Blei, D.M. & Moreno, P.J. (2001). Topicsegmentation with an aspect hiddenMarkov model. Proceedings of the 24thInternational Conference on Researchand Development in Information Re-trieval (SIGIR 2001). New York: ACM.

Cao, J., Crews, J.M., Nunamaker, J.F. Jr.,Burgoon, J.K., & Lin, M. (2004). Userexperience with Agent99 Trainer: A us-ability study. Proceedings of 37th An-nual Hawaii International Confer-ence on System Sciences (HICSS2004), Big Island, Hawaii.

Choi, F. (2000). Advances in domain inde-pendent linear text segmentation. TheNorth American Chapter of the Asso-ciation for Computational Linguistics(NAACL), Seattle, Washington.

Cunningham, H. (2000). Software archi-tecture for language engineering.PhD Thesis, University of Sheffield, UK.

Daft, R.L. & Lengel, R.H. (1986). Orga-nizational information requirements,media richness and structural design.Management Science, 32(5), 554-571.

Halliday, M. & Hasan, R. (1976). Cohe-sion in English. Longman.

Hearst, M.A. (1994). TextTiling: Segment-ing text into multi-paragraph subtopicpassages. Computational Linguistics,23(1), 33-64.

Heinonen, O. (1998). Optimal multi-para-graph text segmentation by dynamic pro-

gramming. Proceedings of 17th Inter-national Conference on Computa-tional Linguistics (COLING-ACL98)(pp. 1484-1486).

Hepple, M. (2000). Independence and com-mitment: Assumptions for rapid trainingand execution of rule-based POStaggers. Proceedings of the 38th An-nual Meeting of the Association forComputational Linguistics (ACL-2000), Hong Kong.

Kan, M., Klavans, J.L., & McKeown, K.R.(1998). Linear segmentation and seg-ment significance. Proceedings of the6th International Workshop of VeryLarge Corporations (pp. 197-205).

Katz, S.M. (1996). Distribution of contentwords and phrases in text and languagemodeling. Natural Language Engi-neering, 2(1), 15-59.

Kaufmann, S. (1999, June). Cohesion andcollocation: Using context vectors in textsegmentation. Proceedings of the 37thAnnual Meeting of the Association ofComputational Linguistics (StudentSession) (pp. 591-595), College Park,Maryland.

Klavans, J. & Kan, M.Y. (1998). Role ofverbs in document analysis. Proceed-ings of 17th International Conferenceon Computational Linguistics(COLING-A CL98) (pp. 680-686).

Kozima, H. & Furugori, T. (1993). Similar-ity between words computed by spread-ing activation on an English dictionary.Proceedings of the European Asso-ciation for Computational Linguistics(pp. 232-239).

Lin, M., Crews, J.M., Cao, J., Nunamaker,J.F. Jr., & Burgoon, J.K. (2003).AGENT99 Trainer: Designing a Web-based multimedia training system fordeception detection knowledge transfer.Proceedings of the 9th Americas Con-ference on Information Systems,



Tampa, Florida.Miller, G., Beckwith, R., Felbaum, C., Gross,

D., & Miller, K. (1990). Introduction toWordNet: An online lexical database.International Journal of Lexicogra-phy (Special Issue), 3(4), 235-312.

Morris, J. & Hirst, G. (1991). Lexical co-hesion computed by thesaural relationsas an indicator of the structure of text.Computational Linguistics, 17, 21-48.

Ponte, J.M. & Croft, W.B. (1997). Textsegmentation by topic. Proceedings ofthe European Conference on DigitalLibraries (pp. 113-125), Pisa, Italy.

Porter, M. (1980). An algorithm for suffixstripping. Program, 14(3), 130-137.

Reynar, J.C. (1994). An automatic methodof finding topic boundaries. Proceedingsof the 32nd Annual Meeting of theAssociation for Computational Lin-guistics (Student Session) (pp. 331-333),Las Cruces, New Mexico.

Reynar, J.C. (1998). Topic segmentation:Algorithms and applications. PhD the-sis, Computer and Information Science,University of Pennsylvania, USA.

Reynar, J.C. (1999). Statistical models fortopic segmentation. Proceedings of37th Annual Meeting of the ACL (pp.357-364).

Salton, G., Allan, J., & Buckley, C. (1993).Approaches to passage retrieval in fulltext information systems. Proceedingsof the 16 Annual International ACMSIGIR Conference on Research andDevelopment in Information Retrieval(pp. 49-58), Pittsburgh, Pennsylvania.

Salton, G., Singhal, A., Buckley, C., & Mitra,M. (1996). Automatic text decomposi-tion using text segments and text themes.Proceedings of Hypertext’96 (pp. 53-65). New York: ACM Press.

Sean, J.A. (1997). Capitalising on inter-active multimedia technologies in dy-namic environments. Retrieved March

1, 2004, from http://crm.hct.ac.ae/021senn.html

Tolle, K. & Chen, H. (2000). Comparingnoun phrasing techniques for use withmedical digital library tools. Journal ofthe American Society for InformationScience (Special Issue on Digital Li-braries), 51(4), 352-370.

Utiyama, M. & Isahara, H. (2001). A sta-tistical model for domain-independenttext segmentation. Proceedings of the9th Conference of the EuropeanChapter of the Association for Com-putational Linguistics (pp. 491-498).

Wactlar, H.D. (2000). Informedia—searchand summarization in the video medium.Proceedings of the Imagina 2000Conference, Monaco.

Yaari, Y. (1997). Segmentation of exposi-tory texts by hierarchical agglomerativeclustering. Proceedings of Recent Ad-vances in Natural Language Process-ing, Bulgaria.

Yamron, J.P., Carp, I., Gillick, L., Lowe, S.,& Van Mulbregt, P. (1999). Topic track-ing in a news stream. Proceedings of theDARPA Broadcast News Workshop.

Youmans, G. (1991). A new tool for dis-course analysis: The vocabulary man-agement profile. Language, 67(4), 763-789.

Zhang, D.S. (2002). Virtual mentor andmedia structuralization theory. PhDthesis, University of Arizona, USA.

Zhang, H.J. & Smoliar, S.W. (1994). De-veloping power tools for video indexingand retrieval. Proceedings of SPIE’94Storage and Retrieval for Video Da-tabases, San Jose, California.

Zhang, W. (1995). Multimedia, technology,education and learning. Technologicalinnovations in literacy and socialstudies education. The University ofMissouri-Columbia.



Ming Lin is currently a PhD candidate in the Management Information Systems Department atthe University of Arizona. His current research interests include knowledge management,computer-assisted learning, information retrieval and, AI/expert system. His current researchfocuses on developing multimedia systems, and studying their impacts on supporting learningand knowledge management. He has also been an active researcher in several projects fundedby NSF, the Ford Foundation, and DARPA. Prior to his time at the University of Arizona, hereceived his master’s degree in Computer Science from the Institute of Computing Technology,Chinese Academy of Sciences, and his bachelor’s degree in Computer Science from NorthernJiaotong University, China.

Jinwei Cao is a PhD candidate in Management Information Systems at the University of Arizona.Her research focuses on technology-supported learning, with a special interest in developingtheoretical models, design principles, and prototypes of interactive computer-based trainingsystems. She is also experienced in conducting behavioral research experiments to evaluatelearning systems. Ms. Cao has authored and co-authored many publications, and she presentsregularly at professional conferences including AMCIS, HICSS, and ICIS. In 2000, Ms. Caoearned her Master’s of Science in Multimedia Technologies from the Beijing BroadcastingInstitute, China.

Michael Chau is currently a research assistant professor in the School of Business at theUniversity of Hong Kong. He received his PhD in Management Information Systems from theUniversity of Arizona and a bachelor’s degree in Computer Science (Information Systems) fromthe University of Hong Kong. When he was in the Artificial Intelligence Lab, he was an activeresearcher in several projects funded by NSF, NIH, NIJ and DARPA. His current research interestsinclude information retrieval, Web mining, knowledge management, intelligent agents andsecurity informatics.

Jay F. Nunamaker Jr. is regents professor of MIS, Computer Science, and Communication, aswell as the director of the Center for Management of Information at the University of Arizona,Tucson. He was a faculty member at Purdue University prior to founding the MIS Department atthe University of Arizona in 1974. During his tenure of 30 years, the department has becomeknown for its expertise in collaboration technology and the technical aspects of MIS. He has 40years of experience in examining, analyzing, designing, testing, evaluating, and developinginformation systems. In 2002, Dr. Nunamaker received the LEO Award for lifetime achievementfrom AIS. He earned his PhD in Systems Engineering and Operations Research from the CaseInstitute of Technology, his MS and BS degrees from the University of Pittsburgh, and a BSdegree from Carnegie Mellon University. He has also been a registered professional engineersince 1965.

automated video segmentation for lecture videos: a ... · dents/onlineclass.htm), a video of an...

Documents