yun-nung (vivian) chen, yu huang, sheng-yi kong, lin-shan lee national taiwan university, taiwan

1

Automatic Key Term Extractionfrom Spoken Course LecturesUsing Branching Entropy and Prosodic/Semantic FeaturesYun-Nung (Vivian) Chen, Yu Huang,Sheng-Yi Kong, Lin-Shan LeeNational Taiwan University, Taiwan

Hello, everybody. I am Vivian Chen, coming from National Taiwan University.Today Im going to present my work about automatic key term extraction from spoken course lectures using branching entropy and prosodic/semantic features.1Introduction2Key Term Extraction, National Taiwan University

DefinitionKey TermHigher term frequencyCore contentTwo typesKeywordKey phraseAdvantageIndexing and retrievalThe relations between key terms and segments of documents

3Key Term Extraction, National Taiwan University

First I will define what is key term.A key term is a term that has higher term frequency and includes core content.There are two types of key terms.One of them is phrase, we call it key phrase. For example, language model is a key phrase.Another type is single word, we call it keyword, like entropy.Then there are two advantages about key term extraction.They can help us index and retrieve.We also can construct the relationships between key terms and segment documents.Heres an example.3

Introduction4Key Term Extraction, National Taiwan University

We can show some key terms related to acoustic model.If the key term and acoustic model co-occurs in the same document, they are relevant, so that we can show them for users.4

Introduction5Key Term Extraction, National Taiwan Universityacoustic modellanguage modelhmmn gramphonehidden Markov model

Then we can construct the key term graph to represent the relationships between these key terms like this.5

Introduction6Key Term Extraction, National Taiwan Universityhmmacoustic modellanguage modeln gramphonehidden Markov modelbigramTarget: extract key terms from course lectures

Similarly, we can also construct the relation between language model and other terms.Then we can show the whole graph to know the organization of key terms.6Proposed Approach7Key Term Extraction, National Taiwan University

Automatic Key Term Extraction8Key Term Extraction, National Taiwan University Original spoken documentsArchive of spoken documentsBranching EntropyFeature ExtractionLearning MethodsK-means ExemplarAdaBoostNeural NetworkASRspeech signalASR trans

Heres the flow chart. Now there are a lot of spoken documents.8Automatic Key Term Extraction9Key Term Extraction, National Taiwan UniversityArchive of spoken documentsBranching EntropyFeature ExtractionLearning MethodsK-means ExemplarAdaBoostNeural NetworkASRspeech signalASR trans

First, with speech recognition system, we can get ASR transcriptions.9Automatic Key Term Extraction10Key Term Extraction, National Taiwan UniversityArchive of spoken documentsBranching EntropyFeature ExtractionLearning MethodsK-means ExemplarAdaBoostNeural NetworkASRspeech signalASR trans

The input of our work includes transcriptions and speech signals.10Phrase IdentificationAutomatic Key Term Extraction11Key Term Extraction, National Taiwan UniversityArchive of spoken documentsBranching EntropyFeature ExtractionLearning MethodsK-means ExemplarAdaBoostNeural NetworkASRspeech signalFirst using branching entropy to identify phrasesASR trans

The first part is to identify phrases. We propose branching entropy to finish this part.11Phrase IdentificationKey Term ExtractionAutomatic Key Term Extraction12Key Term Extraction, National Taiwan UniversityArchive of spoken documentsBranching EntropyFeature ExtractionLearning MethodsK-means ExemplarAdaBoostNeural NetworkASRspeech signalKey termsentropyacoustic model:Then using learning methods to extract key terms by some featuresASR trans

Then we can enter key term extraction part.We first extract some features to represent a candidate term, and use machine learning methods to decide the key terms.12Phrase IdentificationKey Term ExtractionAutomatic Key Term Extraction13Key Term Extraction, National Taiwan UniversityArchive of spoken documentsBranching EntropyFeature ExtractionLearning MethodsK-means ExemplarAdaBoostNeural NetworkASRspeech signalKey termsentropyacoustic model:ASR trans

First we explain how to do phrase identification.13Branching Entropy14Key Term Extraction, National Taiwan Universityhidden is almost always followed by the same word

hidden Markov modelHow to decide the boundary of a phrase?representiscan::isofin::

The target of this work is to decide the boundary of a phrase, but wheres the boundary?We can observe some characteristics first.Hidden is almost always followed by Markov.14Branching Entropy15Key Term Extraction, National Taiwan Universityhidden is almost always followed by the same wordhidden Markov is almost always followed by the same word

hidden Markov modelHow to decide the boundary of a phrase?representiscan::isofin::

Similarly, hidden Markov is almost always followed by model,15Branching Entropy16Key Term Extraction, National Taiwan Universityhidden Markov modelboundaryDefine branching entropy to decide possible boundaryHow to decide the boundary of a phrase?representiscan::isofin::

hidden is almost always followed by the same wordhidden Markov is almost always followed by the same wordhidden Markov model is followed by many different words

But at this position, we find that hidden Markov model is followed by some different words.The position is more likely to be boundary.Then we define branching entropy by the concept.16Branching Entropy17Key Term Extraction, National Taiwan Universityhidden Markov modelDefinition of Right Branching EntropyProbability of children xi for X

Right branching entropy for X

XxiHow to decide the boundary of a phrase?representiscan::isofin::

First, I will define what is right branching entropy.We assume that X is hidden and x_i is the children of X, representing hidden Markov.We define p of x_i to be the frequency of x_i over the frequency of X, which is probability of children x_i for X.Then we define right branching entropy as this equation, H_r of X, which represents the entropy of Xs chidren.17Branching Entropy18Key Term Extraction, National Taiwan Universityhidden Markov modelDecision of Right BoundaryFind the right boundary located between X and xi where

XboundaryHow to decide the boundary of a phrase?representiscan::isofin::

The idea is that the branching entropy at the boundary is higher.So we can find the right boundary where branching entropy is higher like hidden Markov model.18Branching Entropy19Key Term Extraction, National Taiwan Universityhidden Markov modelHow to decide the boundary of a phrase?representiscan::isofin::

Similarly, left branching entropy has the same attribute.19Branching Entropy20Key Term Extraction, National Taiwan Universityhidden Markov modelHow to decide the boundary of a phrase?representiscan::isofin::

The approach called branching entropy is to find the boundaries of key phrases, but wheres the boundary.The idea is that the word entropy at the boundary is higher.There are four parts in this approach. We can present each part in detail.20Branching Entropy21Key Term Extraction, National Taiwan Universityhidden Markov modelHow to decide the boundary of a phrase?representiscan::isofin::

The approach called branching entropy is to find the boundaries of key phrases, but wheres the boundary.The idea is that the word entropy at the boundary is higher.There are four parts in this approach. We can present each part in detail.21Branching Entropy22Key Term Extraction, National Taiwan Universityhidden Markov modelDecision of Left BoundaryFind the left boundary located between X and xi where

X: model Markov hiddenHow to decide the boundary of a phrase?boundaryXrepresentiscan::isofin::Using PAT Tree to implement

We also decide the left boundary by H_l of X bar, X bar is the reverse pattern like model Markov hidden.Branching entropy can be implemented by PAT tree.22Branching Entropy23Key Term Extraction, National Taiwan UniversityImplementation in the PAT treeProbability of children xi for X

Right branching entropy for X

hiddenMarkov1model2chain3state5distribution6variable4Xx1x2X : hidden Markovx1: hidden Markov modelx2: hidden Markov chainHow to decide the boundary of a phrase?

Then in the PAT tree, we can compute right branching entropy for each node.We can take an example to explain p(xi).X is the node representing a phrase hidden Markov, x_1 is Xs child hidden Markov model, x_2 is Xs another child hidden Markov chain.Previously described p of x_i is shown as this one, and then compute right branching entropy of this node.We compute H_r of X for all X in PAT tree and H_l of X bar for all X bar in the reverse PAT tree.We can co23Phrase IdentificationKey Term ExtractionAutomatic Key Term Extraction24Key Term Extraction, National Taiwan UniversityArchive of spoken documentsBranching EntropyFeature ExtractionLearning MethodsK-means ExemplarAdaBoostNeural NetworkASRspeech signalKey termsentropyacoustic model:Extract prosodic, lexical, and semantic features for each candidate termASR trans

With all candidate terms, including words and phrases, we can extract prosodic, lexical, semantic features for each term.24Feature Extraction25Key Term Extraction, National Taiwan UniversityProsodic features

For each candidate term appearing at the first time

Feature NameFeature DescriptionDuration(I IV)normalized duration(max, min, mean, range)Speaker tends to use longer duration to emphasize key termsusing 4 values for duration of the termduration of phone a normalized by avg duration of phone a

For each word, we also compute some prosodic features.First, we believe that lecturer would use longer duration to emphasize key terms.For the candidate term first appearing, we compute the duration for each phone in each candidate term.Then we normalize the duration of specific phone by the average duration of this phone.In this feature, we only use four values to represent this term.We can compute maximum, minimum, mean and range over all phones in a single term to be the features.25Feature Extraction26Key Term Extraction, National Taiwan UniversityProsodic features


Higher pitch may represent significant informationFeature NameFeature DescriptionDuration(I IV)normalized duration(max, min, mean, range)

We believe that higher pitch may represent important information.So we can extract the pitch contour like this.26

Feature Extraction27Key Term Extraction, National Taiwan UniversityProsodic features


Higher pitch may represent significant informationFeature NameFeature DescriptionDuration(I IV)normalized duration(max, min, mean, range)Pitch(I - IV)F0(max, min, mean, range)

The method is like duration, but the segment unit is changed to single frame.We also use these four values to represent the features.27



Higher energy emphasizes important informationFeature NameFeature DescriptionDuration(I IV)normalized duration(max, min, mean, range)Pitch(I - IV)F0(max, min, mean, range)

Similarly, we think that higher energy may represent important information.We also can extract energy for each frame in a candidate term.28



Higher energy emphasizes important informationFeature NameFeature DescriptionDuration(I IV)normalized duration(max, min, mean, range)Pitch(I - IV)F0(max, min, mean, range)Energy(I - IV)energy(max, min, mean, range)

The features are like pitch. The first set of features is shown in this.29Feature Extraction30Key Term Extraction, National Taiwan UniversityLexical features

Feature NameFeature DescriptionTFterm frequencyIDFinverse document frequencyTFIDFtf * idfPoSthe PoS tagUsing some well-known lexical features for each candidate term

The second set of features is lexical features.These features are well-known, which may indicate the importance of term.We just use these features to represent each term.30Feature Extraction31Key Term Extraction, National Taiwan UniversitySemantic features

Probabilistic Latent Semantic Analysis (PLSA)Latent Topic ProbabilityKey terms tend to focus on limited topics

Di: documentsTk: latent topicstj: terms

Third set of features is semantic features.The assumption is that key terms tend to focus on limited topics.We use PLSA to compute some semantic features.We can get the probability of each topic given a candidate term.31Feature Extraction32Key Term Extraction, National Taiwan UniversitySemantic features

Probabilistic Latent Semantic Analysis (PLSA)Latent Topic ProbabilityFeature NameFeature DescriptionLTP (I - III)Latent Topic Probability (mean, variance, standard deviation)

non-key termkey term

Key terms tend to focus on limited topicsdescribe a probability distributionHow to use it?

Here are distributions of two terms.Notice that the first term has similar probability for most topics, and it may be a function word, so that this candidate term may be non-key term.In other one, the term focuses on less topics, so it is more likely to be key term.In order to describe the distribution, for the distribution, we compute mean, variance and standard deviation to be the features of this term.32Feature Extraction33Key Term Extraction, National Taiwan UniversitySemantic features

Probabilistic Latent Semantic Analysis (PLSA)Latent Topic SignificanceWithin-topic to out-of-topic ratioFeature NameFeature DescriptionLTP (I - III)Latent Topic Probability (mean, variance, standard deviation)non-key termkey term

Key terms tend to focus on limited topics

within-topic freq.out-of-topic freq.

Similarly, we can compute latent topic significance, which is within-topic to out-of-topic ratio like this equation.This is the significance of topic T_k in term t_i.This one is within-topic frequency, where n of t_i and d_j is the number of t_i in document d_j, and this one is out-of-topic frequency.33Feature Extraction34Key Term Extraction, National Taiwan UniversitySemantic features

Probabilistic Latent Semantic Analysis (PLSA)Latent Topic SignificanceWithin-topic to out-of-topic ratioFeature NameFeature DescriptionLTP (I - III)Latent Topic Probability (mean, variance, standard deviation)LTS (I - III)Latent Topic Significance (mean, variance, standard deviation)non-key termkey term


within-topic freq.out-of-topic freq.

Because the significance of key terms would focus on limited topics, we also compute this three values to represent a distribution.34Feature Extraction35Key Term Extraction, National Taiwan UniversitySemantic features

Probabilistic Latent Semantic Analysis (PLSA)Latent Topic EntropyFeature NameFeature DescriptionLTP (I - III)Latent Topic Probability (mean, variance, standard deviation)LTS (I - III)Latent Topic Significance (mean, variance, standard deviation)non-key termkey term


Then, we also can compute latent topic entropy.Because this feature also can describe the difference between these two distributions.Non-key term may have higher LTE, and key term would have lower one.35Feature Extraction36Key Term Extraction, National Taiwan UniversitySemantic features

Probabilistic Latent Semantic Analysis (PLSA)Latent Topic EntropyFeature NameFeature DescriptionLTP (I - III)Latent Topic Probability (mean, variance, standard deviation)LTS (I - III)Latent Topic Significance (mean, variance, standard deviation)LTEterm entropy for latent topicnon-key termkey term


Higher LTELower LTE

36Phrase IdentificationKey Term ExtractionAutomatic Key Term Extraction37Key Term Extraction, National Taiwan UniversityArchive of spoken documentsBranching EntropyFeature ExtractionLearning MethodsK-means ExemplarAdaBoostNeural NetworkASRspeech signalASR transKey termsentropyacoustic model:Using unsupervised and supervised approaches to extract key terms

Finally, we use some machine learning methods to extract key terms.37Learning Methods38Key Term Extraction, National Taiwan UniversityUnsupervised learningK-means ExemplarTransform a term into a vector in LTS (Latent Topic Significance) space

Run K-means

Find the centroid of each cluster to be the key term

The candidate term in the same group are related to the key termThe key term can represent this topicThe terms in the same cluster focus on a single topic

The first one is an unsupervised learning, K-means Examplar.First, we can transform each word into a vector in latent topic significance space, as this equation.With these vectors, we run K-means. The terms in the same cluster focus on a single topic.So we can extract the centroid of each cluster to be the key term, because the terms in the same group are related to this key term.38Learning Methods39Key Term Extraction, National Taiwan UniversitySupervised learningAdaptive BoostingNeural Network

Automatically adjust the weights of features to produce a classifier

We also use two supervised learning, adaptive boosting and neural network to automatically adjust the weights of features to produce a classifier.39Experiments & Evaluation40Key Term Extraction, National Taiwan University

Experiments41Key Term Extraction, National Taiwan UniversityCorpusNTU lecture corpusMandarin Chinese embedded by English words

Single speaker45.2 hourssolutionviterbi algorithm(Our solution is viterbi algorithm)

Then we do some experiments to evaluate our approach.The corpus is NTU lecture, which includes Mandarin Chinese and some English words, like this example.This sentence is ~~, which means our solution is viterbi algorithm,The lecture is from a single speaker, and total corpus is about 45 hours.41Experiments42Key Term Extraction, National Taiwan UniversityASR Accuracy

LanguageMandarinEnglishOverallChar Acc (%)78.1553.4476.26CHENSI Modelsome data from target speakerAMOut-of-domain corporaBackgroundIn-domaincorpusAdaptivetrigram interpolationLMBilingual AM and model adaptation

In the ASR system, we train two acoustic models for chinese and english, and use some data to adapt, finally getting a bilingual acoustic model.Language model is trigram interpolation of out-of-domain corpora and some in-domain corpus.Heres ASR accuracy.42Experiments43Key Term Extraction, National Taiwan UniversityReference Key TermsAnnotations from 61 students who have taken the courseIf the k-th annotator labeled Nk key terms, he gave each of them a score of , but 0 to othersRank the terms by the sum of all scores given by all annotators for each termChoose the top N terms form the list (N is average Nk)N = 154 key terms59 key phrases and 95 keywords

To evaluate our result, we need to generate reference key term list.The reference key terms are from students annotations, and these students have taken the course.Then we sort all terms and decide top N to be key terms.N is the average numbers of key terms extracted by students, and N is 154.The reference key term list includes 59 key phrases and 95 keywords.43Experiments44Key Term Extraction, National Taiwan UniversityEvaluationUnsupervised learningSet the number of key terms to be NSupervised learning3-fold cross validation

Finally we evaluate the results.For unsupervised learning, we set the number of key terms to be N.And using 3-fold cross validation evaluates supervised learning.44Experiments45Key Term Extraction, National Taiwan UniversityFeature EffectivenessNeural network for keywords from ASR transcriptionsEach set of these features alone gives F1 from 20% to 42%Prosodic features and lexical features are additiveThree sets of features are all useful20.7842.8635.6348.1556.55Pr: ProsodicLx: LexicalSm: SemanticF-measure

At this experiment, we are going to see feature effectiveness.The results only include keywords, and it is from neural network using ASR transcriptions.Row a, b, c perform F1 measure from 20% to 42%.Then row d shows that prosodic and lexical features are additive.Row e proves that adding semantic features can further improve the performance so that three sets of features are all useful.45Experiments46Key Term Extraction, National Taiwan UniversityOverall Performance51.9555.8462.3967.3123.38Conventional TFIDF scores w/o branching entropystop word removalPoS filteringBranching entropy performs wellK-means Exempler outperforms TFIDFSupervised approaches are better than unsupervised approachesF-measure

AB: AdaBoostNN: Neural NetworkFinally, we show the overall performance for manual and ASR transcriptions.These are baseline, conventional TFIDF without extracting phrases using branching entropy.From the better performance, we can see that branching entropy is very useful.This proves our assumption that the term with higher branching entropy is more likely to be key term.And the best results are from supervised learning using neural network, achieving F1 measure of 67 and 62.46Experiments47Key Term Extraction, National Taiwan UniversityOverall PerformanceThe performance of ASR is slightly worse than manual but reasonableSupervised learning using neural network gives the best results23.3820.7851.9543.5155.8452.6062.3957.6867.3162.70F-measure

AB: AdaBoostNN: Neural NetworkFinally, we show the overall performance for manual and ASR transcriptions.These are baseline, conventional TFIDF without extracting phrases using branching entropy.From the better performance, we can see that branching entropy is very useful.This proves our assumption that the term with higher branching entropy is more likely to be key term.And the best results are from supervised learning using neural network, achieving F1 measure of 67 and 62.47Conclusion48Key Term Extraction, National Taiwan University

Conclusion49Key Term Extraction, National Taiwan UniversityWe propose the new approach to extract key terms The performance can be improved byIdentifying phrases by branching entropyProsodic, lexical, and semantic features togetherThe results are encouraging

From the above experiments, the conclusion is that we proposed new approach to extract key terms efficiently.The performance can be improved by two ideas, using branching entropy to extract key phrases and using three sets of features.49Thanks for your attention! Q & AThank reviewers for valuable commentsNTU Virtual Instructor: http://speech.ee.ntu.edu.tw/~RA/lecture50Key Term Extraction, National Taiwan University

50

yun-nung (vivian) chen, yu huang, sheng-yi kong, lin-shan lee national taiwan university, taiwan

Documents

key term graph

organization of key

types of key terms

taiwan hello

spoken course lectures

lot of spoken documents

segments of documents

segment documents