news topic classification

Upload: hrdholaria

Post on 14-Apr-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 News Topic Classification

    1/12

    1612 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 8, NOVEMBER 200

    Story Segmentation and Topic Classication of Broadcast News via a Topic-Based Segmental

    Model and a Genetic AlgorithmChung-Hsien Wu , Senior Member, IEEE , and Chia-Hsin Hsieh

    Abstract This paper presentsa two-stageapproach to story seg-mentation and topicclassication of broadcastnews. The two-stageparadigm adopts a decision tree and a maximum entropy modelto identify the potential story boundaries in the broadcast newswithin a sliding window. The problem for story segmentation isthus transformed to the determination of a boundary position se-quence from the potential boundary regions. A genetic algorithmis then applied to determine the chromosome, which correspondsto the nal boundary position sequence. A topic-based segmentalmodel is proposed to dene the tness function applied in the ge-netic algorithm. The syllable- and word-based story segmentationschemes are adopted to evaluate the proposed approach. Experi-mental results indicate that a miss probability of 0.1587 and a falsealarm probability of 0.0859 are achieved for story segmentationon the collected broadcast news corpus. On the TDT-3 Mandarinaudio corpus, a miss probability of 0.1232 and a false alarm prob-ability of 0.1298 are achieved. Moreover, an outside classicationaccuracy of 74.55% is obtained for topic classication on the col-lected broadcast news, while an inside classication accuracy of 88.82% is achieved on the TDT-2 Mandarin audio corpus.

    Index Terms Genetic algorithm (GA), segmental model, storysegmentation, topic classication.

    I. INTRODUCTION

    A S thenumberof audio/speech/videostreamdata increasesdramatically, it is urgent to havean intelligent agent to au-tomatically detect story boundaries and topics on audio/speech/ video streams for further applications. For instance, a well-seg-mented and classied broadcast news stream is clearly a signif-icant prerequisite for information retrieval, data mining, sum-marization, and management. Moreover, an explicit story doc-ument is more suitable than an entire audio document withoutstory boundaries when browsing or retrieving a multimedia doc-ument.

    The story segmentation problem discussed in this study canbe simply dened as: given a stream of text articles or text

    Manuscript received June 30, 2008; revised March 23, 2009. Current versionpublished September 04, 2009. This work was supported by the National Sci-ence Council, Taiwan, under Contract 95-2221-E-006-181-MY3. The associateeditor coordinating the review of this manuscript and approving it for publica-tion was Dr. Gerhard Rigoll.

    C.-H. Wu is with the Department of Computer Science and InformationEngineering, National Cheng Kung University, Tainan 701, Taiwan (e-mail:[email protected]).

    C.-H. Hsieh is with the Institute for Information Industry, Kaohsiung 806,Taiwan (e-mail: [email protected]).

    Color versions of one or more of the gures in this paper are available onlineat http://ieeexplore.ieee.org.

    Digital Object Identier 10.1109/TASL.2009.2021304

    transcriptions of a broadcast audio/speech/video stream via aspeech recognizer, identify the locations where the event ortopic changes. Previous studies have presented the approachesto story segmentation in a feature space or a model space. Forfeature-space approaches, lexical, audio-visual or prosodicinformation is generally employed. Wonget al. [6] adoptedlexical features to capture words or phrases that are commonlyused at the beginning or end of a segment in a particular

    domain. Moreover, C99 [7] and TextTiling [8] analyzed lexicalcohesive strength and SeLeCT [9] utilized lexical chainingbetween textual units for text segmentation. In contrast tothe approach based on text features, Hauptmannet al. [1]integrated closed-caption cues with the results of image andaudio processing, while restricting content analysis to somespecic TV programs. Shriberget al. [32] and Tret al. [33]presented a prosody-based method that obtained segmentationcues directly from the acoustic signal by pause, speaker change,pitch and energy. Second, the studies in the model spaceadopted topic transition models, exponential models, statisticalmodels (hidden Markov models, HMMs) and hybrid models.Makhoul et al. [2] presented the topic transition concept forstory segmentation. The CMU approach [5] adopted con-tent-based features, lexical features and an exponential model.Van Mulbregtet al. [3] presented an HMM-based scheme,which adopted a classical language model and a topic transitionmodel for story segmentation. Utiyamaet al. [26] presenteda statistical model for domain-independent text segmentation.Furthermore, the Umass approach [4] adopted Local ContextAnalysis (LCA) methods, but some problems still remained.First, the approach is somewhat slow since it needs a databasequery per sentence for query expansion and concept retrieval.Second, for automatically recognizing the speech data withoutmanually tagged sentence boundaries, the performance of the

    LCA concept retrieval might degrade if the sentence boundariesdetected from ASR are inaccurate.Regarding topic classication, several well-known ap-

    proaches for topic or text classication have been studied.Cohen and Singer [10] introduced the decision tree approach[11] for text classication. Lewis [12] studied the probabilisticclassiers. Li and Jain [13] adopted the Nave Bayes classier.Yang [14] discussed the simplest and fastest example-basedclassiers, such as thek-NN classier. Joachims [15] introduceda support vector machine scheme for text classication.

    Although story segmentation and topic classication havebeen extensively studied, there still exist some problems. First,most story segmentation approaches adopt the sliding window

    1558-7916/$26.00 2009 IEEE

  • 7/28/2019 News Topic Classification

    2/12

    WU AND HSIEH: STORY SEGMENTATION AND TOPIC CLASSIFICATION OF BROADCAST NEWS 1613

    strategy for lexical, audio-visual or prosodic features. However,the sliding window strategy may incur local false alarms dueto a small window size, or contrarily, miss the true boundariesfor a large window size. Even if a well-tuned window size isgiven, the xed-length window is still not robust for differenttypes of data. Therefore, a global and efcient search strategy

    is required since the search complexity is enormous for storysegmentation. Several search strategies have been proposedin previous research, such as hill climbing. Nevertheless, theaforementioned search strategies usually result in locally op-timal solutions since the initial settings will affect the searchresult signicantly. Genetic algorithms (GAs) [18][20] havebeen shown to be a good solution to search problems with highcomputational complexity in a multi-direction way. A GA canalso avoid locally optimal solutions, due to the random initialchromosomes and the genetic operators: crossover and muta-tion. Since the story segmentation problem is a stream-typedata segmentation problem and has a very large search spacefor the optimal solution, a GA is appropriate for story boundarydetection. Furthermore, some previous studies solved thestream-type data segmentation problem for time series dataand speech signals by evolutionary computation and GAs.Chung et al. [27] presented an evolutionary approach to pat-tern-based time series segmentation and obtained a satisfactoryperformance. Moreover, Salcedo-Sanzet al. [28] adopted GAsto solve the ofine speaker segmentation problem for speechsignals. Second, the HMM-based approach models the topictransition in broadcast news by state transition probability.However, unlike speech signals, topic transitions cannot bemodeled easily, since they depend on the editorial style andthe number and importance of the events, which vary daily.

    Therefore, story segmentation results by the HMM-basedscheme are generally dominated by the observation probabilityof the HMM. However, topic information is useful for storysegmentation since most stories represent signicant topics inbroadcast news. It is thus benecial to incorporate topic infor-mation with lexical features to solve the story segmentation andtopic classication problems simultaneously.

    This study presents a topic-based segmental model by fusingthe topic information with boundary word information. Sincestory segmentation is an NP-complete problem, GAs, a usefulapproach to NP-complete problems, are adopted in this study toovercome the aforementioned difculties. When applying a GAto problem solving, a well-designed tness function is critical.In this paper, a topic-based segmental model comprising a topicmodel and a boundary model is proposed to characterize thetness function. The topic model uses the topic information torepresent the coherence of topics inside a story. The boundarymodel uses the word information around the boundary torepresent the distinction between two stories segmented bythe boundary. Furthermore, in practice, the search complexityin story segmentation is where denotes the numberof words in the whole broadcast news program. A prelteringprocedure is benecial to reduce the computational complexity.Therefore, a hybrid approach using a decision tree and max-imum entropy is utilized for pre-ltering. In the two-stage story

    segmentation paradigm, stage 1 adopts the decision tree andmaximum entropy model to obtain the coarse story boundaries

    locally using lexical features. This stage yields a very low missprobability but high false alarm probability, thus reducing thesearch complexity for further computation in the GA. In stage 2,the problem for story segmentation can be considered as anoptimization problem, which can be handled by the GA. Storysegmentation is thus transformed into the global determination

    of chromosomes (boundary position sequence) in the potentialstory boundary region based on a dened tness functionderived from a topic-based segmental model, which modelsthe topic and boundary characteristics jointly in a segment.The topic information compensates for the insufciency of the lexical information (boundary characteristics). Finally, thedetermination of the story boundaries (chromosomes), whichmaximize the tness function (topic-based segmental model),is achieved globally by the GA.

    The rest of this paper is organized as follows. Section II pro-vides the system description. Section III denes the LSA-basedNave Bayes topic classier. Section IV gives a brief back-ground introduction on GAs. Section V describes in detailthe topic-based segmental model and genetic algorithm forstory segmentation. Section VI presents experiments and theperformance evaluation of the proposed method. Conclusionsare given in Section VII.

    II. SYSTEM DESCRIPTION

    Fig. 1 shows the overall architecture for story segmentationand topic classication. In the training phase, a boundary modeland a topic model are constructed from a large text corpus. Thenews text corpus with tagged topics was collected automati-cally from a news web portal every day. First, the features and

    the parameters of the boundary model for the decision tree andmaximum entropy model were extracted and trained from thetext corpus. Then a latent semantic analysis (LSA)-based NaveBayes topic classier was constructed as the topic model fortopic classication in the topic-based segmental model.

    In the segmentation phase, given a speech document input,a speech recognizer was rst adopted to transcribe the broad-cast news into a top-5 syllable lattice as well as the potentialword sequences. The syllable-based lattice or the word-basedsequences was used for topic classication and segmentation.After extracting the features for pre-segmentation, the decisiontree and maximum entropy model were adopted to obtaina coarse story boundary sequence using a sliding window.Finally, the genetic algorithm along with the topic-based seg-mental model was adopted to rene the story boundaries inspeech documents based on the boundary model and topicmodel.

    III. LSA-BASED NAVE BAYES TOPIC CLASSIFIER

    Given a speech stream comprising several contiguousstories, a speech recognizer is adopted to transcribe thespeech signal into a syllable or word sequence

    , where denotes theknown number of story boundaries, and

    denotes the position sequence of known story boundaries.Given the context of a story ,

  • 7/28/2019 News Topic Classification

    3/12

    1614 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 8, NOVEMBER 200

    Fig. 1. System architecture for story segmentation and topic classication.

    the posterior probability for the word sequencebelonging to topic is estimated based on Bayes theorem as

    (1)

    where denotes the number of topics adopted in this study.To formulate the topic model , the famous concept

    of is exploited toobtainthetopic informationmatrix,and LSA [16], [17] is then adopted to reduce the dimensionalityof the feature space. First, a word-by-topic matrix is establishedfrom thenews text training corpus. The topic information matrix

    is calculated as follows:

    (2)

    where denotes the number of words, and denotes thenumber of topics. A text training corpus grouped intotopics is utilized to generate the topic information matrix

    . Each element of the topic informa-tion matrix is dened as the similarity between word andtopic estimated as follows:

    (3)

    where denotes the term frequency of word in topic ,and denotes the summation of the term frequenciesof all the terms in topic . denotes the entropy of word

    over all topics:

    (4)where denotes the summation of the term fre-quency of term over all topics, and . If approaches1, then word has a low topic discriminability; oth-erwise, has a high topic discriminability.

    In addition, the topic information matrix may con-tain noise information, such as the words with very low topicdiscriminability. That is, the feature space contains redundantdimensionality. Therefore, LSA is introduced to reduce the di-mensionality of the feature space by eliminating the noise in-formation and extracting the latent semantic information be-tween words and topics. LSA includes two functions, Singular

    value decomposition (SVD) and feature dimensionality reduc-tion. The SVD projection is performed by decomposing theword-by-topic matrix into the product of three matrices,

    , and :

    with (5)

    where and have orthogonal columns,i.e., and .

    denotes the diagonal ma-trix with , and . By reducingthe ranks of the matrices , and to , a

    truncated matrix is derived as follows:(6)

    where denotes the least square approximation of . The original feature space is transformed by multi-

    plying with to obtain a topic strength matrixas follows:

    (7)

    For topic classication, all the words inare assumed to be indepen-

    dent. The word sequence is rst transformed into a vectorby (3). The -dimensional column

    vector is transformed into a dimensional row vectorby the following formula:

    (8)

    The vector space model (VSM), an algebraic model for repre-senting text documents as vectors of index terms, is successfullyused in relevancy rankings of documents. In this paper, based onthe topic strength matrix , the likelihood of topic model

    between word sequence and topic is approxi-

  • 7/28/2019 News Topic Classification

    4/12

    WU AND HSIEH: STORY SEGMENTATION AND TOPIC CLASSIFICATION OF BROADCAST NEWS 1615

    mated using a the vector space model (VSM) as follows:

    (9)

    where denotesthe thcolumnvector in .Finally,the posterior probability is then derived by (1) and (9)as

    (10)

    where denotes the th column vector in . The

    topic of a word sequence is determined by the following equa-tion:

    (11)

    IV. BACKGROUND TOGENETICALGORITHMSGenetic algorithms (GAs) [18][20] are a technique which

    simulate genetic evolution to search for the solution to aproblem, especially NP or NP-complete problems with largesearch complexity and high computational complexity. Thestory segmentation problem in this study is an NP-completeproblem with computational complexity, thus the a

    GA is suitable to search for a solution in this problem. GAuses multidirectional search to reduce search complexity from) to , where denotes the number of

    words of the input word sequence, denotes the populationsize and denotes the generation number. Although there arethe numerous optimization techniques, such as hill climbingand simulated annealing, the hill climbing algorithm will easilyincur a locally optimal solution affected by the initial setting.In addition, the simulated annealing algorithm utilizes theGibbs and Boltzmann distribution to avoid the local optimaproblem. Since the best solution with the largest temperaturedecrement is obtained by repeatedly search the same temper-

    ature region, the annealing algorithm has the problem of lowefciency. In contrast to the above two approaches, a GA canavoid incurring a locally optimal solution based on two specialoperators: crossover and mutation, which can search globallyin a multidirection way among a large population.

    Fig. 2 illustrates an example of the implementation of theGA for story segmentation. In a GA, an initial population,i.e., a set of chromosome, is created randomly. In this study,each chromosome maps to a possible story boundary sequencein each sub-segment which is generated by a rough storypre-segmenter. Then, two operators, crossover and mutation,are applied to those chromosomes based on two probabilities:crossover probability and mutation probability. The crossoveroperator interchanges two chromosomes with each other at arandomly selected position and the mutation operator changes

    Fig. 2. Flowchart of the genetic algorithm.

    one bit in a chromosome to its complement. Following theseoperators, the offspring are generated and evaluated by a pre-dened tness function. The tness function plays a criticalrole in GAs, dening the tness a chromosome has. A roulettewheel is then created based on the tness of the offspring,and better chromosomes are selected from the offspring bythe roulette wheel. The roulette wheel has the function thatchromosomes with higher tness values are more likely to beselected for the next generation. Nevertheless, the best parentpopulations are kept for the next generation. The above proce-dure continues until the termination condition is satised, such

    as when a certain number of generations are produced or thedifference of average tness values in the population betweentwo generations is below a threshold. Finally, the chromosomewith the highest tness value is regarded as the best solutionfor the problem.

    V. STORY SEGMENTATIONUSING GENETICALGORITHMIn this study, the segmentation problem is transformed into

    an evolution problem, and then solved by the genetic algorithm.The story segmentation problem can be modied easily to ta GA by mapping the story boundary positions to the chromo-somes, and dening a tness function to evaluate the possible

    story boundary position (the chromosome).Fig. 3 shows the block diagram for story segmentation byGA. The input sequence is the form of a syllable or word se-quencegenerated from the speech recognizer. This study adoptsa decision tree and a maximum entropy model over the wordsequence, and determines the potential boundaries locally. Thepotential boundary is then relaxed to include its neighborhoodfor further processing. The story segmentation problem is trans-formed into discovering a chromosome (boundary position se-quence) in the potential boundary regions using the genetic al-gorithm shown in Fig. 2. Fig. 4 shows an example for problemmapping. The input word sequence is pre-segmented with threepotential boundaries. A potential region is expanded to coverten words centered on its corresponding potential boundary.Therefore, a possible segmentation solution can be presented as

  • 7/28/2019 News Topic Classification

    5/12

    1616 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 8, NOVEMBER 200

    Fig. 3. Flowchart of story segmentation using GA.

    Fig. 4. Diagram for segmentation problem mapping.

    a three-digit sequence (3, 7, 9), i.e., a chromosome. A topic-based segmental model is thus proposed as the tness func-tion to model the story segmentation situation. The two-stagestory segmentation approach is employed since the story seg-mentation problem is an NP-complete problem with computa-tional complexity being , where denotes the number of words in thewhole broadcastprogram. Thecoarseboundary de-tection can reduce the computational complexity to ,where denotes the expanded window size and denotes thenumberof coarseboundaries. If themiss rate of coarseboundarydetection is low enough, the computational complexity can besignicantly reduced since and , and therefore

    .

    A. Stage 1: Coarse Boundary Detection by Decision Tree and Maximum Entropy

    This study adopts the decision tree [24] and the maximumentropy model [5], [24] to search the coarse story boundary lo-cally within a sliding window. The aim of this stage is to reducethe search complexity of the GA under a very low miss prob-

    ability but high false alarm probability. The speech recognizertranscribes the speech into syllable and word sequences. For thesyllable/word sequence, a sliding window

    with a length of is dened to cover words andshifts with a step size of in the segmentation procedure. Theposterior probability is dened based on Bayestheorem

    (12)

    where the random variable denotes whetheris a story boundary, and denotes the boundary

    model. The prior probability is assumed to be uniform,and is set to .

    Story boundary modeling requires a huge amount of trainingdata to avoid the data sparseness problem. The boundary model

    is approximated by combining the scores from thedecision tree and the maximum entropy model in themodel domain

    (13)where denotes a fusion factor. The segmentation procedure inStage 1 was performed by Franzs method [24]. Franzs methodis a hybrid approach to story segmentation using a fusing deci-sion tree and a maximum entropy model. In the training proce-dure, training data is used to train a decision tree based on a setof binary features (the question set) and the nal score todecide if is a boundaryestimated at each leafnode. Inanotherwords, the training data is also used to train the model param-eters of the maximum entropy model comprising a log-linearmodel based on a set of binary features. Therefore, the score

    of the maximum entropy model can be calculated by thefeature set of input data based on the trained model parame-ters in the test phase. Similar to Franzs method, the boundarymodel adopts three features in the classication and regressiontree (CART).

    1) The duration of events (silence or pause) marked as non-speech in the ASR transcript.

    2) The presence of words and word pairs strongly correlatedwith the document boundaries.

    3) The comparative distributions of nouns on either side of the proposed boundary .

    Second, the maximum entropy model adopts three features.1) The three features adopted by the decision tree model.2) The -grams obtained from the context in to the left

    and right of the proposed boundary .

  • 7/28/2019 News Topic Classification

    6/12

    WU AND HSIEH: STORY SEGMENTATION AND TOPIC CLASSIFICATION OF BROADCAST NEWS 1617

    3) Structural and large-scale regularities in the broadcastnews, including specic time slots for commercials andthe speaking rate.

    In the segmentation procedure, the posterior probabilityof each sliding window is estimated and

    depicted as a score curve across the entire text stream. A peak

    extraction procedure is adopted to extract the coarse storyboundaries, dened by a posterior probability higher than apre-tuned threshold . The pre-tuned threshold epsilon wastuned to achieve a target miss rate in the coarse boundarydetection of less than 0.01 based on a tuning data set.

    B. Stage Two: Fine Boundary Detection

    1) Topic-Based Segmental Model: Since a GA is adoptedfor ne boundary detection, a well-designed tness functionis needed to formulate the segmentation problem. The topic-based segmental model is employed to incorporate the topic andboundary word information in characterizing the tness func-tion for global formulation of story segmentation. Therefore,the GA can work with the topic-based segmental model for neboundary detection based on the detected coarse story bound-aries in Stage 1.

    Given a word sequence com-prising several stories with unknown story boundaries,denotes the number of story boundaries,denotes the story boundary position sequence and

    denotes the topic index sequence cor-responding to the stories. Assuming and are con-ditionally independent given , for a given word sequence

    with story boundarynumber , story boundary position sequence and topic index

    sequence , the posterior probability is estimatedas follows by the topic-based segmental model based on theBayes theorem

    (14)

    where the denominator is a constant given the same word se-

    quence in the story segmentation procedure. In this frame-work, the most important components are the boundary modeland the topic model .

    First, assuming that all stories are conditionally independentgiven , the boundary model can be derivedfrom the boundary word information as

    (15)

    where denotes the word se-quence in story , and and correspond to and ,

    respectively. Additionally, this study makes a geometric meanapproximation for by the boundary models forboundaries and based on (15) as

    (16)

    where denotes a balancing factor to manage the effect of thestart and the end boundary model, and is tuned based on thedevelopment data. In (16), the front preceding story boundary

    and the succeeding story boundary occur consecu-tively. The probability can be approximatedby the two independent posterior probabilitiesand from (13) using the geometric mean approx-imation.

    Second, the topic model is dened as

    , where can be formulated the same

    as in (9) under the assumption that all the stories are condition-ally independent given . In order to control the effect of the

    topic model and boundary model, a geometric mean fusion isapplied on the two models in (14) as follows:

    (17)

    where the balancing factor is adopted to balance the effectbetween the topic and boundary models, and is tuned based onthe development data. In (17), the geometric mean approxima-

    tion is used for fusing the contribution of the topic model andboundary model to deal with the different characteristics of dif-ferent data/corpora.

    In addition, the conditional probability in (17) de-notes the prior probability of story boundary position sequence

    associated with story boundaries.Because the word sequence can be considered as a sequence

    of binary random variables {0,1} where story boundaries occur,the probability of the boundary positions sequencein the word sequence can be characterized by a Poisson familydistribution, i.e., . In this study, in order toreduce the number of model parameters in the topic-based seg-mental model, the probability is simplied and assumedto have a binominal distribution

    (18)

    where denotes the probability of word , which is observed asa story boundary. The probability is assumed to be uniform,and is set to 0.5 under the condition of no prior knowledge.Moreover, the prior probability denotes the probability of story boundaries in the data, and is regarded as a constant withno priorknowledge. Furthermore, the reason for the distributionsetting of the prior models is to lower the effect of the priormodels and focus on the topic and boundary models. Similarly,the prior probability is set to , where denotesthe number of topics dened in the topic classier.

  • 7/28/2019 News Topic Classification

    7/12

    1618 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 8, NOVEMBER 200

    Finally, the posterior probability is expressed asfollows:

    (19)

    In the segmentation paradigm, Stage 1 obtains the tentativenumber of story boundaries and the tentative story boundaryposition sequence . This signicantly reduces the searchcom-plexity in the GA. The GA and the topic-based segmental model

    are then adopted to search for a precise storyboundary number and the boundary position sequencethat maximize .

    2) Search Precise Boundary Positions Using Ge-netic Algorithm: All tentative topic boundary po-sitions formed inStage 1 are relaxed forward and backward to formsearch regions , where

    witha length of . A precise boundary in each search regionis determined over the whole document by the geneticalgorithm and the topic-based segmental model. In general,the GA adopts a sequence of numbers, generally calleda chromosome or individual, to denote a solution set[18]. For problem mapping, this study transforms the localboundary position sequence within the search regions

    into a number sequence

    (chromosome): ,

    where , and denotes a lack of storyboundaries inside the search region. Overall, each number in achromosome denotes the boundary position inside the searchregion. For example, Fig. 4 shows an input word sequence withlength comprising several unknown story boundaries whichare pre-segmented to generate three potential boundaries. Thena search region with a length of ten words is expanded for eachpotential boundary. Therefore, a story boundary solution can berepresented as a chromosome. In this gure, the number withthe three digits 379 represents that the input word sequence hasthree ne story boundaries which occur in the third word of therst search region, the seventh word of the second search region

    and the ninth word of the third search region, respectively.In the GA, the design of the tness function plays the mostimportant role in discovering the solution. Based on the topic-based segmental model, the tness function is dened as fol-lows:

    (20)

    where , anddenotes the th chromosome or solution in the search

    region, and represents one case of boundary position sequencebounded in the search regions.In the evolution process of the GA, chromosomes with better

    tness denote better solutions, and survive with higher proba-bility from generation to generation. The ne story boundarieswith boundaries and the position sequence are obtained

    after theGA operations (crossover andmutation)andevaluationprocedure shown in Fig. 2 are performed, and the terminationcondition is satised (tness converges). Restated,

    (21)

    VI. EXPERIMENTALRESULTSThis section rst presents the experimental materials and

    evaluation metrics, followed by the performance of theLSA-based Nave Bayes classier. A parameter tuning pro-cedure of the GA-based story segmenter is then described.Finally, a segmentation performance, -test, an execution timeand training time comparison between the proposed approachand Makhouls [2], Franzs [24] and HMM-based [3] ap-proaches are demonstrated on a test set based on the tunedparameters.

    A. Training, Development, and Test Set

    A news text corpus was collected for topic classier training.The news text corpus was gathered daily from the UnitedDaily News website (UDN) [21] over a period of four months(MarchJune of 2002). The UDN Corporation is one of thebiggest publishers of print media in Taiwan, and the news

    catalog on the UDN News website is almost complete. Thiscorpus comprises 46965 stories or 23 million words with 14main topics (e.g., Domestic, World, Politics, Stock, Business,IT, Health, Science, Travel, Entertainment) and about 110subtopics (each main topic has on average eight subtopics).This study adopted the 110 subtopics as the topic classicationcatalog.

    Two audio corpora were adopted for story segmentation eval-uation: the BCC (Broadcasting Corporation of China) corpusand the TDT (Topic Detection and Tracking) Mandarin corpus.Because the TDT corpus has the characteristic that contiguousstories may belong to the same topic, the contribution of the

    topic model to the story segmenter may decrease. The BCCcorpus was utilized to perform an individual experiment to eval-uate the effect of the topic model on the story segmenter, sincein that corpus, contiguous stories belong to different topics. TheBCC audio corpus was collected from the Broadcasting Corpo-ration of China News radio channel in the corresponding pe-riod to the UDN corpora. The audio data were recorded anddigitized at a sampling rate of 16 kHz with a 16-bit resolution.Twelve recordings (about1 h of speechmaterial) were manuallytranscribed and segmented into stories for tuning (six record-ings) and testing (six recordings). Each recording includes veto seven short news stories produced by an anchor speaker, andeach story contains about 1030 sentences.

    The TDT audio corpus was the Mandarin audio corpus fromthe Linguistic Data Consortium (LDC), named Topic Detection

  • 7/28/2019 News Topic Classification

    8/12

    WU AND HSIEH: STORY SEGMENTATION AND TOPIC CLASSIFICATION OF BROADCAST NEWS 1619

    and Tracking Phase 2 and Phase 3 (TDT-2 and TDT-3). TheTDT-2 and TDT-3 Mandarin audio corpora were recorded fromVoice of America (VOA) news broadcast. The TDT-2 VOAMandarin audio data were collected during a ve-month pe-riod (FebruaryJune of 1998), andcomprised2265 stories with70 172 words.

    TDT-3 VOA Mandarin audio data were collected during athree-month period (OctoberDecember of 1998), and com-prised 3371 stories with 331494 words. The TDT-2 corpuscomprised 100 topics dened in LDC, and only 564 storiescorresponding to those topics. Similarly, TDT-3 contained 100topics, which were different from those in TDT-2, and only1542 stories corresponded to the topics. Word transcriptionsfrom the Dragon Mandarin speech recognizer were adopted inthis experiment. Because the transcription result of the TDTcorpus does not include syllable lattice data, the syllable-basedapproach was not applied to it. The TDT-2 (564 stories) andone-third of the TDT-3 corpus (542 stories) were adopted asthe training set for topic classication. One-third of the TDT-3corpus (500 stories) was adopted as the development set forstory segmentation, and the remaining part of the TDT-3 corpus(500 stories) was adopted as the test set for story segmentation.Additionally, the -test was performed for signicance testingby randomly dividing the test set into ten subsets.

    In practice, since the broadcastnews audio data includemanyforeground and background events, the audio data have to besegmented and identied into different segments (e.g., anchor,report, commercial and noise) based on the acoustic featuresrst using an audio segmentation and classication system [30].This study adopted only anchor segments for the potential storystream, because the word error rate in the transcription of re-

    porter speech is too high. However, an anchor segment includemore than one story, and must be further segmented and identi-ed by story segmentation and topic classication approachesbased on the context of the segment. Only the BCC corporawere processed using the audio segmentation and classicationsystem. Because the TDT corpora have fewer speaker or back-ground noise changes than the BCC corpora, and the Dragonsystem was adopted to process the audio segments, such as themusic in the beginning of a show. The TDT corpora were notprocessed using the audio segmentation and classication butthe standard word transcription was adopted for evaluation.

    B. Performance Evaluation of Story Segmentation

    In this experiment, story segmentation accuracy was evalu-ated by the standard TDT metric with a 15-s tolerance window[23], [31]. Themiss probability and false alarm probability

    were considered. Moreover, two joint costs andwere also adopted for evaluation:

    (22)

    (23)

    where and denote the costs of a miss and of a falsealarm, respectively; denotes thea priori target proba-bility; , and denotes the bottom-line

    representation of the performance of a story segmentation task.However, the value of is also a function of the parame-ters , , and , and is normalized so that

    can be no less than1 without extracting information fromthe source data. In this experiment, the parameter settings werethe same as those of Doddington [31], i.e.,

    for the TDT-2 corpus, andfor the TDT-3 and BCC corpora. Further-more, the execution and training time for each recording wasalso used to compare the computational complexity for each ap-proach.

    C. Audio News Transcription Using Speech Recognizer

    For the BCC audio broadcast news, an HMM-basedspeaker-independent large vocabulary continuous speechrecognizer with two-pass search strategy was adopted [22],[29] for broadcast news content transcription. The recognitionmodels were trained by the spoken documents from Voiceof Taipei (VOT), UFO station (UFO), Police Radio Station(PRS), and Voice of Han (VOH) from December 1998 to July1999 [25]. The 5465 spoken documents (about 14 h of speechdata) were divided into two parts, 4/5 for training and the restfor testing. The error rates for syllable-level and characterlevel were 22% and 25%, respectively. For the TDT-2/TDT-3audio corpus, the word transcription result from the DragonMandarin speech recognizer was adopted in the experiment.For the TDT-2 corpus, the Dragon speech recognizer achievederror rates of 35.38% for word level, 17.69% for character leveland 13.00% for syllable level; For the TDT-3 corpus, the errorrates were 36.97% for word level, 19.78 for character level and

    15.06% for syllable level.Two approaches, syllable-based and word-based, wereadopted to construct the content information from the speechrecognizer for story segmentation. Theaimof thesyllable-basedapproach is to employ more syllable features for story seg-mentation to overcome the high word error rate of speechrecognition. The syllable segments range from 1-syllable to4-syllable segments. For instance, the syllable sequence chengkung da hsueh (National Cheng Kung University) can beconverted into four syllable segments: cheng, cheng kung,cheng kung da, and cheng kung da hsueh. The top-vesyllable sequences with the highest recognition scores were

    selected to generate the potential syllable segments for topicclassication. The syllable-based method converts the speechsignal into base syllable segments from the top-ve syllablesequences. Similarly, a word-based approach was adopted toconvert the syllable sequence from the speech recognizer intoa word sequence. The bi-gram language model was adoptedto obtain the best ve sentences as well as their correspondingpotential word segments in order to include more possiblefeatures from top ve best sentences to tolerate the ASR errorsfor topic classication. However, in the TDT-2/TDT-3 corpus,the transcription of the Dragon speech recognizer only com-prises the top-one word sequence. The top-one word sequencewas employed for the word-based experiment and a syllablesequence was transformed from the top-one word sequencedirectly to perform the syllable-based experiment.

  • 7/28/2019 News Topic Classification

    9/12

    1620 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 8, NOVEMBER 200

    Fig. 5. Percentage of the numbers of days with continued news subtopics forthe entire news text corpus.

    D. Experiments on Topic Classication

    In this study, the topic classier plays a very important rolein story segmentation. Therefore, topic classication evaluationwas performed with thenewswire news. In thenews text corpus,40 255stories were adopted for training,andtheremaining6710stories were adopted for testing.

    The aim of this experiment is to show the temporal effectof the topic classier, which was trained by the static and dy-namic models. The dynamic model trained on the most recent

    one week of collected news text was constructed to handle theshort life cycle of the broadcast news, and the static model wastrained on all the data. Fig.5 shows the percentageof the numberof days with continued news subtopics for the entire news textcorpus during a period of four months. The gure indicatesthat about 90% of news events occurred in the most recent oneweek. Nevertheless, the temporal effect was not adopted in theTDT-2/TDT-3 topic classier, since this phenomenon may notbe signicant in the VOA material.

    In the training phase, a topic classier was trained by thecorpus collected in the most recent week. Additionally, anothertext corpus was collected to train the static model with all topicsin ourcorpus. Table I shows the comparison of the classication

    accuracy of the dynamic and static models. In the test phase,the topic classier accepted the text input, and output a topicbased on the dynamic model. For the inside test over the 40255training news stories, the system achieved a topic classica-tion accuracy of 84.89%. For the outside test over 6710 newsstories, the system achieved 74.55% topic classication accu-racy. By comparison, the static model achieved precision ratesof 66.79% for the inside test, and 58.15% for the outside test.Overall, the dynamic model outperformed the static model. Theproposed topicclassier achieveda topicclassication accuracyof 74.55%. The topic classication experiment was also appliedto the TDT-2/TDT-3 Mandarin audio corpus. For the inside testover the training stories, the system achieved a topic classi-cation accuracy of 88.82% and 80.3% for TDT-2 and TDT-3,respectively.

    TABLE ITOPIC CLASSIFICATIONACCURACY BYDYNAMICMODEL ANDSTATIC MODEL

    E. Parameter Tuning on Story Segmentation

    To obtain the best parameter setting for performance evalua-tion, a tuning procedure was constructed on the developmentdata set. Two parameter sets, namely the pre-segmenter andGA-based story segmenter, were handled as follows.

    1) Parameter Tuning of the DT_ ME-Based Pre-Segmenter:The fusion factor and the threshold of the hybrid pre-seg-menter using the decision tree and maximum entropy weretuned rst. The tuning range of was from 0 to 1 with a stepsize of 0.25, and the tuning range of was from 0 to 1 witha step size of 0.1. Since the aim of the pre-segmenter is tominimize the miss probability rst rather than the false alarmprobability, the fusion factor and threshold were set to 0.5and 0.1, respectively, with a miss probability below 0.01 andwith a false alarm probability of about 0.6 for both the BCCand TDT corpora.

    2) Parameter Tuning of Window Size : The windowsize was set to be half the average story length in pre-vious tuning. In this subsection, the windowsize was tuned witha tuning rage of 1/16, 1/8, 1/4, 1/2 and 1 average story lengthbased on the optimized and . The tuning result indicates thatthe best performance was obtained by a window size of half of

    the average story length. Conversely, the performance degradedwhen the window size was set to the average story length. Thereason is that the miss probability rose signicantly since thelarge window size will miss many story boundaries.

    3) Parameter Tuning of the GA-Based Story Segmenter:Since the GA-based segmenter had many parameters, thetuning procedure was split into two subprocedures. Becausethe parameters in the GA mainly affect the convergence speedwhen the population and generation number are large, the rstsubprocedure xed the parameters in the GA empirically, withpopulation size , generation number ,crossover probability , and mutation probability

    . The tuning ranges of the balancing factors andwere then set from 0 to 1 with a step size of 0.2. Figs. 6 and 7show the tuning results. For the BCC corpus, the best settingsof the balancing factors and were found to be 0.6. For theTDT corpus, the best settings of the balancing factors andwere 0.2 and 0.6, respectively.

    In the second subprocedure, the balancing factors andwere xed at the best settings, and the parameters in the GAwere tuned based on the development data set. The tuningranges of the parameters in GA were set as follows.

    1) Population size from 20 to 100 with a step size of 202) Generation number from 100 to 1000 with a step size

    of 100.3) Crossover and mutation probability , : from 0 to 1

    with a step size of 0.25, and .

  • 7/28/2019 News Topic Classification

    10/12

    WU AND HSIEH: STORY SEGMENTATION AND TOPIC CLASSIFICATION OF BROADCAST NEWS 1621

    Fig. 6. Tuning results of the GA-based segmenter on the BCC corpus byvarying the balance factors and .

    Fig. 7. Tuning results of the GA-based segmenter on the TDT corpus byvarying the balance factors and .

    Fig. 8 shows the tuning results on the TDT corpus. The param-eter settings , , andachieved the best tness curve and convergence speed. More-over, the tuning results on the BCC corpus were similar to thaton the TDT corpus. The best parameter settings for the BCCcorpus were , , , and .The tuning result shows that the GA does not need a large popu-lation size or number of generations because the pre-segmenterreduces the search complexity effectively.

    F. Experiments on Story Segmentation

    All of the methods mentioned in the experiment setup wereevaluated on the test set from the BCC and TDT corpora under

    the best-tuned parameter settings, respectively. The executiontime evaluation was performed on a Pentium 4 1.5-GHz PC, andthe time was measured in minutes.

    1) Experiments on News Text Corpus: To demonstrate theupper bound of the GA-based story segmenter performance, thetext corpuswas testedfor story segmentation. News text streams

    comprising several news stories in concatenation is generallynot easy to collect. Therefore, a number of news stories wereconcatenated together randomly to form a long text as thetestingdocument. The document comprising 3500 long texts obtainedby concatenating ve news stories randomly chosen from theUDN news Text. The evaluation results of the test corpus of the3500 long texts using the same tuned parameters as the BCCcorpus indicate a miss probability of 0.1197 and a false alarmprobability of 0.1228. The result from the simulated data maybe treated as a reference for the upper bound since all the datacontain no speech recognition errors.

    2) Experiments on BCC Mandarin Broadcast News Corpus:TheBCCbroadcast newscorpuswas assessedby threemethods:syllable-based story segmentation, word-based segmentationand Makhouls method [2]. Table II shows the experimentalresults from the BCC broadcast news. A miss probability of 0.1587 and a false alarm probability of 0.0859 were observedin word-based story segmentation. Additionally, a miss prob-ability of 0.1495 and a false alarm probability of 0.0871 wereachieved in syllable-based story segmentation. For comparison,Makhouls method produced a miss probability of 0.1671 anda false alarm probability of 0.1425. The proposed approachoutperforms Makhouls method since the proposed approachcan take care of the story boundaries of the entire program,rather than two contiguous stories, using the GA. The use of

    boundary word information and the GA can rene the storyboundaries globally, so that the reduction in the false alarmprobability is signicant compared to Makhouls approach.

    In summary, the performance of the syllable-based approachperformed slightly better than the word-based approach. Be-cause the syllable-based approach considers the 14-syllablesegments from the top-ve syllable lattices, the segmenter cancapture more correct syllable information than the word-basedapproach with a high character error rate.

    Second, although Makhouls method achieved a slightlylower miss probability than the others, it had a higher overallerror probability than both of the word- and syllable-basedapproaches. The value (5.98) exceeded the tabulated valueof 3.36 ( , 8 degrees of freedom). This nding indi-cates that the difference between the proposed approach andMakhouls method is highly signicant.

    3) Experiments on TDT3 Mandarin Audio Corpora: In thissection, only a word-based approach was experimented with onthe TDT-3 Mandarin audio corpora, because the transcriptionunits of the Dragon speech recognizer are words, not syllables.Table III shows the experimental results of the TDT-3 Mandarincorpus. The proposed approach achieved a miss probability of 0.1232 and a false alarm probability of 0.1298, and thus out-performed Franzs approach and the HMM-based approaches[3] which involve a classical language model and a topic tran-

    sition model for story segmentation. In particular, the reduc-tion in the false alarm probability is signicant because the GA

  • 7/28/2019 News Topic Classification

    11/12

    1622 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 8, NOVEMBER 200

    Fig. 8. Tuning results of the GA-based segmenter obtained by varying the population sizeN , the generation numberN , the crossover and the mutationprobability P and P .

    TABLE IIEXPERIMENTAL RESULTS ONBCC BROADCASTNEWS CORPUS (WITH EXC_T GIVEN INMINUTES ANDTRA_T GIVEN INHOURS)

    TABLE IIIEXPERIMENTALRESULTS ON TDT-3 CORPUS (WITH EXC_T GIVEN INMINUTES ANDTRA_T GIVEN INHOURS)

    with the topic-based segmental model can obtain global con-

    sideration among the whole word stream. Nevertheless, the re-duction in the miss probability is small since the contributionof the topic model is small; the boundary model has a higherweighting in the topic-based segmental model. The contribu-tion of the topic model in the TDT corpus was found to beless than that in the BCC corpus. We believe that the broad-cast news program ow of the VOA Mandarin corpus is dif-ferent from that of the BCC corpus. VOA broadcast news isgenerally a report on a special topic in a whole one-hour pro-gram. Therefore, many contiguous stories have the same topic,and the boundary model plays a signicant role. Additionally,a miss probability of 0.1215 and a false alarm probability of 0.1307 were achieved in syllable-based story segmentation. Al-though the syllable-based approach takes care of the high worderror rate for story segmentation, the feature space is huge for

    story segmentation. In contrast to the word-based approach, the

    range of performance improvement of syllable-based approachfor story segmentation is not as good as that in speech recogni-tion. Furthermore, a GA-based approach, both word-based andsyllable-based, needs more execution time than the others sinceword sequences are very long in the one-hour show.

    VII. CONCLUSIONThis study presented a two-stageapproach to story segmenta-

    tion and topic classication on Mandarin broadcast news usinga topic-based segmental model and the genetic algorithm. Inthis approach, a speech recognizer is rst adopted to generatethe syllable- and word-based transcriptions from the audiobroadcast news. Decision trees and maximum entropy are thenadopted to identify the potential story boundaries locally. Thegenetic algorithm is then employed to obtain the nal story

  • 7/28/2019 News Topic Classification

    12/12

    WU AND HSIEH: STORY SEGMENTATION AND TOPIC CLASSIFICATION OF BROADCAST NEWS 1623

    boundaries globally. Moreover, a topic-based segmental modelis dened from the tness function of the genetic algorithm.Experimental results indicate that the GA-based approachoutperforms the traditional methods for both syllable- andword-based segmentation approaches.

    REFERENCES

    [1] A. G. Hauptmann and M. J. Witbrock, Story segmentation and detec-tion of commercials in broadcast news video, inProc. ADL-98 Adv. Digital Libraries Conf. , Santa Barbara, CA, Apr. 1998, pp. 168179.

    [2] J. Makhoul, F. Kubala, T. Leek, L. Daben, N. Long, R. Schwartz, andA. Srivastava, Speech and language technologies for audio indexingand retrieval,Proc. IEEE , vol. 88, no. 8, pp. 13381353, Aug. 2000.

    [3] P. Van Mulbregt,I. Carp, L. Gillick, S. Lowe, andJ.Yamron,Text seg-mentation and topic tracking on broadcast news via a hidden Markovmodel approach, inProc. ICSLP98 , Sydney, Australia, 1998, vol. I,pp. 333336.

    [4] J. M. Ponte and W. B. Croft, Text segmentation by topic, inProc. 1st Eur. Conf. Res. Adv. Technol. Digital Libraries (ECDL97) , Pisa, Italy,Sep. 1997, pp. 120129.

    [5] J. Lafferty, D. Beeferman, and A. Berger, Statistical models for textsegmentation,Mach. Learn., Special Iss. Natural Lang. Learn. , vol.34, no. 1-3, pp. 177210, 1999.

    [6] K. L. Wong, W. Lam, and J. Yen, Interactive Chinese news event de-tection and tracking, inProc. 2nd Asia Digital Library Conf. , Taipei,Taiwan, Nov. 1999, pp. 3043.

    [7] J. S. Justeson and S. M. Katz, Technical terminology: Some linguisticproperties and an algorithm for identication in text,Natural Lang. Eng. , vol. 11, pp. 927, 1995.

    [8] M. Hearst, Texttiling: Segmenting text into multi-paragraph subtopicpassages,Comput. Linguist. , vol. 23, no. 1, pp. 3364, 1997.

    [9] N. Stokes, J. Carthy, and A. F. Smeaton, A lexical cohesion basednews story segmentation system,J. AI Commun. , vol. 17, no. 1, pp.312, Mar. 2004.

    [10] T. M. Mitchell , Machine Learning . New York: McGraw-Hill, 1996.[11] D. D. Lewis and M. Ringuette, A comparison of two learning algo-

    rithms for text categorization, inProc., 3rd Annu. Symp. Document Anal. Inf. Retrieval (SDAIR94) , Las Vegas, NV, Apr. 1994, pp. 8193.[12] D. D. Lewis, Representation and Learning in Information Retrieval,Ph.D. dissertation,Dept.of Comput. Inf. Sci., Univ. of Mass.,Amherst,1992.

    [13] Y. H. Li and A. K. Jain, Classication of text documents,Comput. J. , vol. 41, no. 8, pp. 537546, 1998.

    [14] Y. Yang, Expert network: Effectiveand efcient learning from humandecisions in text categorization and retrieval, inProc. SIGIR-94, 17th ACM Int. Conf. Res. Develop. Inf. Retrieval , Dublin, Ireland, Jul. 1994,pp. 1322.

    [15] T. Joachims, Text categorization with support vector machines:Learning with many relevant features, inProc. ECML-98, 10th Eur.Conf. Mach. Learn. , Chemnitz, Germany, Apr. 1998, pp. 137142.

    [16] S. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A.Harshman, Indexing by latent semantic analysis,J. Soc. Inf. Sci. , vol.41, no. 6, pp. 391407, 1990.

    [17] J. R. Bellegarda, Exploiting latent semantic information in statisticallanguage modeling,Proc. IEEE , vol. 88, no. 8, pp. 12791296, Aug.2000.

    [18] Z. Michalewicz , Genetic Algorithms + Data Structures = EvolutionPrograms , 3rd ed. New York: Springer-Verlag, 1996.

    [19] C. W. Ahn and R. S. Ramakrishna, A genetic algorithm for shortestpath routing problem and the sizing of populations,IEEE Trans. Evol.Comput. , vol. 6, no. 6, pp. 566579, Dec. 2002.

    [20] E. Y. Kim, S. W. Hwang, S. H. Park, and H. J. Kim, Spatiotemporalsegmentation using genetic algorithms,Pattern Recognition , vol. 34,pp. 20632066, 2001.

    [21] United Daily News Website. [Online]. Available: http://www.udn.com[22] C. H. Wu and Y. J. Chen, Multi-keyword spotting oftelephonespeech

    using a fuzzy search algorithm and keyword-driven two-level CBSM,Speech Commun. , vol. 33, pp. 197212, 2001.

    [23] J. Allan, J. Carbonell, G. Doddington, J. Yamron, and Y. Yang,Topic detection and tracking pilot study: Final report, inProc. DARPA Broadcast News Transcription and Understanding Workshop ,Lansdowne, VA, Feb. 1998, pp. 194218.

    [24] M. Franz, J. S. McCarley, S. Roukos, T. Ward, and W.-J. Zhu, Seg-mentation and detection at IBM: Hybrid statistical models and two-tiered clustering, inProc. TDT-3 Workshop , Feb. 2000, pp. 135148.

    [25] B. L. Chen, H. M. Wang, and L. S. Lee, Retrieval of Mandarin broad-cast news using spoken queries, inProf. ICSLP00 , Oct. 2000, pp.520523.

    [26] M. Utiyama and H. Isahara, A statistical model for domain-indepen-dent text segmentation, inProc. 39th Annu. Meeting Assoc. Comput.

    Linguist. (ACL01) , Toulouse, France, Jul. 2001, pp. 499506.[27] F. L. Chung, T. C. Fu, V. Ng, and R. W. P. Luk, An evolutionaryapproach to pattern-based timeseries segmentation, IEEE Trans.Evol.Comput. , vol. 8, no. 5, pp. 471489, Oct. 2004.

    [28] S. Salcedo-Sanz, A. Gallardo-Antolin, J. M. Leiva-Murillo, and C.Bousono-Calzon, Ofine speaker segmentation using genetic algo-rithms and mutual information,IEEE Trans. Evol. Comput. , vol. 10,no. 2, pp. 175186, Apr. 2006.

    [29] C.H. Wu andG. L. Yan,Speech actmodeling andvericationof spon-taneous speech with disuency in a spoken dialogue system,IEEE Trans. Speech Audio Process. , vol. 13, no. 3, pp. 330344, May 2005.

    [30] C. H. Wu and C. H. Hsieh, Multiple change-point audio segmentationand classication using an MDL-based Gaussian model,IEEE Trans. Audio, Speech, Lang. Process. , vol. 14, no. 2, pp. 647657, Mar. 2006.

    [31] G. Doddington, The 1999 topic detection and tracking (TDT3) taskdenition and evaluation plan, 1999. [Online]. Available: http://www.nist.gov/speech/tdt3/doc/tdt3.eval.plan.99.v2.7.ps, 2.7

    [32] E. Shriberg, A. Stolcke, D. Hakkani-Tr, and G. Tr, Prosody-basedautomatic segmentation of speech into sentences and topics,SpeechCommun. , vol. 32, no. 1-2, pp. 127154, 2000.

    [33] G. Tr, A. Stolcke, D. Hakkani-Tr, and E. Shriberg, Integratingprosodic and lexical cues for automatic topic segmentation,Comput. Linguist. , vol. 27, no. 1, pp. 3157, 2001.

    Chung-Hsien Wu (SM03) received the B.S. degreein electronics engineering from National Chiao TungUniversity, Hsinchu, Taiwan, in 1981, and the M.S.and Ph.D. degrees in electrical engineering fromNational Cheng Kung University, Tainan, Taiwan, in1987 and 1991, respectively.

    Since August 1991, he has been with the De-partment of Computer Science and InformationEngineering, National Cheng Kung University,

    Tainan. He became a Professor and DistinguishedProfessor in August 1997 and August 2004, respec-tively. From 1999 to 2002, he served as the Chairman of the Department. Healso worked at Massachusetts Institute of Technology Computer Science andArticial Intelligence Laboratory, Cambridge, in summer 2003 as a VisitingScientist. He was the Editor-in-Chief forInternational Journal of Compu-tational Linguistics and Chinese Language Processing from 2004 to 2008.He serves as the Guest Editor of theACM Transactions on Asian Language Information Processing , and the EURASIP Journal on Audio, Speech, and Music Processing in 20082009. He is currently in Editorial Advisory Board of The Open Articial Intelligence Journal and the Subject Editor onInformation Engineering of Journal of the Chinese Institute of Engineers . His researchinterests include speech recognition, text-to-speech, and spoken languageprocessing.

    Dr. Wu is a member of the International Speech Communication Association(ISCA), the Association for Computational Linguistics (ACL), and Associationfor Computational Linguistics and Chinese Language Processing (ACLCLP).He serves as a Guest Editor for the IEEE TRANSACTIONS ONAUDIO, SPEECH,AND LANGUAGEPROCESSING.

    Chia-Hsin Hsieh received the B.S. degree in en-gineering science from the National Cheng KungUniversity, Tainan, Taiwan, in 2000 and the Ph.D.degree in computer science and information engi-neering from the National Cheng Kung University,Tainan, Taiwan, in 2007.

    Currently, he is a Senior Engineer in the Insti-tute for Information Industry, Kaohsiung, Taiwan.His research interests focus on audio-video signalprocessing, segmentation and content classication,robust noisy speech recognition, spoken document

    structural analysis, summarization and retrieval, and story segmentation andtopic detection.