12.0 spoken document understanding and organization references: 1. “spoken document understanding...

12.0 Spoken Document Understanding and Organization

References: 1. “Spoken Document Understanding and Organization”, IEEE Signal Processing Magazine, Sept. 2005, Special Issue on Speech Technology in Human-Machine Communication 2. “Speech-to-text and Speech-to-speech Summarization of Spontaneous Speech”, IEEE Transactions on Speech and Audio Processing, Dec. 2004 3. “Multi-layered Summarization of Spoken Document Archives by Information Extraction and Semantic Structuring”, Interspeech 2006, Pittsburg, USA

Multi-media Content in the Future Network Era

• Integrating All Knowledge, Information and Services Globally• Most Attractive Form of the Network Content will be in Multi-media, which

usually Includes Speech Information• The Speech Information, if Included, usually Tells the Subjects, Topics and

Concepts of the Multi-media Content, thus Becomes the Key for Indexing, Retrieval and Browsing

Future Integrated Networks

Real–time Information– weather, traffic– flight schedule– stock price– sports scores

Electronic Commerce– virtual banking– on–line transactions– on–line investments

Knowledge Archieves– digital libraries– virtual museums

Intelligent Working Environment– e–mail processors– intelligent agents– teleconferencing– distant learning

Private Services– personal notebook– business databases– home appliances– network entertainments

Network Content Indexing/Retrieval/Browsing in the Future Era of Wireless Multi-media

voice information

Private and Personal Services

Public Information and Services

Future Networks

voice

input/

outputSpoken

Document Retrieval

text informationText-to-Speech

Synthesis

Spoken Dialogue

• Multi-media Content Indexed/Retrieved/Browsed Based on the Speech Information

• User Instructions in either Text or Speech Form

• Network Access is Primarily Text-based today, but almost all Roles of Texts can be Replaced by Speech in the Future

Text-based Retrieval

Multi-media/Spoken Document Understanding and Organization (Ⅰ)

• Written Documents are Better Structured and Easier to Browse ---in paragraphs with titles ---easily shown on the screen ---easily decided at a glance if it is what the user is looking for

• Multi-media/Spoken Documents are just Video/Audio Signals ---not easy to be shown on the screen ---the user can’t go through each one from the beginning to the end during browsing ---better approaches for understanding/organization of multi- media/spoken documents becomes necessary

Multi-media/Spoken Document Understanding and Organization (Ⅱ)

• Key Term/Named Entity Extraction from Multi-media/Spoken Documents — personal names, organization names, location names, event names — very often keywords in the multi-media/spoken documents — very often out-of-vocabulary (OOV) words, difficult for recognition

• Multi-media/Spoken Document Segmentation — automatically segmenting a multi-media/spoken document into short paragraphs, each

with a central topic

• Information Extraction for Multi-media/Spoken Documents — extraction of key information such as who, when, where, what and how for the information described by multi-media/spoken documents. — very often the relationships among the key terms/named entities

• Summarization for Multi-media/Spoken Documents — automatically generating a summary (in text or speech form) for each short paragraph

• Title Generation for Multi-media/Spoken Documents — automatically generating a title (in text or speech form) for each short paragraph — very concise summary indicating the topic area

• Topic Analysis and Organization for Multi-media/Spoken Documents — analyzing the subject topics for the short paragraphs — clustering and organizing the subject topics of the short paragraphs into graphic structures giving the relationships among them for easier access

Integration Relationships among the Involved Technology Areas

Keyterms/Named EntityExtraction from

Spoken Documents

Semantic

Analysis

Information

Indexing,

Retrieval

And Browsing

Key Term Extraction from

Spoken Documents

Key Term Selection （ 1/2） Topic Entropy

- carries less topical information - carries more topical information

Named Entity Extraction

• HMM-based Approaches

• Rule-based Approaches• Special Approaches Used

– context information among different sentences in the same document properly considered

– matching with automatically retrieved relevant text news to identify out-of-vocabulary (OOV) words

– multi-layered Viterbi search to handle a long named entity composed of several named entities of different types

Named Entity Extraction

• Context Information Extracted — some named entities may not be easily identified from a single sentence, but can be extracted when information in several sentences jointly considered

遊戲橘子高階人事異動………對於遊戲橘子企圖跨足研發領域……遊戲橘子董事長表示……

• Named Entity Matching using Retrieved Text News Corpus to Identify Some Out-of-Vocabulary (OOV) Words

娜莉颱風重創花蓮縣壽豐鄉 ( 那裡 ) ( 受封 )

• Multi-layered Viterbi Search — handling the situation that a named entity may be the concatenation of several named entities of different types 台北市中正紀念堂是一個熱門的旅遊景點

ConfidenceMeasure

ThresholdGoogle

TextNews

Corpora

Spoken Document Segmentation

• Training Phase

• Segmentation Phase — dividing the word sequence into sentences(s1, s2, s3...) by pause duration — Viterbi search over the Hidden Markav Model of clusters

— transition from a cluster Ci into another Cj is a proper segmentation point

Training Corpora(text form,

short paragraphs)

K-meansclustering

P (s | Cj )for all clusters Cj

by N-gram probabilitiess: a sentence

L clusters,each with a topic,including many

short paragraphs

C1 C2 C3

P1 P1 P1

P2 P2

P2P2

P1,P2: may be modified by

— story length modeling

— pause duration modeling P(s|C1) P(s|C2) P(s|C3)

d= s1, s2, s3, s4, s5, ……

……

Spoken Document Summarization

Selecting Important Sentences to be Concatenated into a Summary

— sentence scoring

— given a summarization ratio Selected Sentences Collectively Represent Some Concepts Closest

to those of the Complete Document

— removing the concepts already mentioned previously

— concepts presented smoothly

Title Generation for Spoken Documents (1/2)

• Training Phase

• Generation Phase

• For Training Phase — developing statistical relationships between words in the training

documents and their human-generated titles• For New Spoken Documents

— transcribing into term sequences — identifying suitable terms, and using them to generate a readable

title

Training DocumentsD={dj,j=1,2,…N}

(text form)

Computer-generatedTitles of the New Spoken Documents

T={ ti, i=1,2,…M} (text form, speech form)

Human-generatedTitles of Training Documents

T={tj,j=1,2,…N}(text form)

New Spoken DocumentsD={di, i=1,2,…M}

(speech form)

Scored Viterbi

Trainingcorpus

TermOrdering

Model

TermSelection

Model

TitleLengthModel

Spoken document AutomaticSummarization

ViterbiAlgorithm

OutputTitle

Summary

Title Generation for Spoken Documents (2/2)

Topic Analysis and Organization for Spoken Documents

• Example Approach ： on Probabilistic Latent Semantic Analysis (PLSA)— terms (words, syllable pairs, etc.)/documents analyzed by probabilities considering a set

of latent topics— trained by EM algorithm

— related documents don’t have to share common sets of terms, and related terms don’t have to co-exist in the same set of documents

• Broadcast News Clustered by the Latent Topics and Organized in a Two-dimensional Tree Structure, or as a Two-layer Map

— news stories in the same cluster or in closely located clusters usually address related topics

— clusters labeled by terms with highest probabilities— easier to browse related news stories within a cluster or across nearby clusters

. 1

K

kikkjij dTPTtPdtP

User’s Query Produces many Retrieved Spoken Documents

— Difficult to display on the screen

Better User/System Interaction

— The system may provide better information about the semantic structure of the retrieved documents to the user

— The user may then enter a more precise query to the system

Topic Hierarchy

User

Multi-modal Dialogue Retrieved

Documents

Spoken Document

Archive

Retrieval System

Query/Instruction

Query-based Local Semantic Structuring for Retrieved Spoken Documents