12.0 spoken document understanding and organization references: 1. “spoken document understanding...
TRANSCRIPT
12.0 Spoken Document Understanding and Organization
References: 1. “Spoken Document Understanding and Organization”, IEEE Signal Processing Magazine, Sept. 2005, Special Issue on Speech Technology in Human-Machine Communication 2. “Speech-to-text and Speech-to-speech Summarization of Spontaneous Speech”, IEEE Transactions on Speech and Audio Processing, Dec. 2004 3. “Multi-layered Summarization of Spoken Document Archives by Information Extraction and Semantic Structuring”, Interspeech 2006, Pittsburg, USA
Multi-media Content in the Future Network Era
• Integrating All Knowledge, Information and Services Globally• Most Attractive Form of the Network Content will be in Multi-media, which
usually Includes Speech Information• The Speech Information, if Included, usually Tells the Subjects, Topics and
Concepts of the Multi-media Content, thus Becomes the Key for Indexing, Retrieval and Browsing
Future Integrated Networks
Real–time Information– weather, traffic– flight schedule– stock price– sports scores
Electronic Commerce– virtual banking– on–line transactions– on–line investments
Knowledge Archieves– digital libraries– virtual museums
Intelligent Working Environment– e–mail processors– intelligent agents– teleconferencing– distant learning
Private Services– personal notebook– business databases– home appliances– network entertainments
Network Content Indexing/Retrieval/Browsing in the Future Era of Wireless Multi-media
voice information
Private and Personal Services
Public Information and Services
Future Networks
voice
input/
outputSpoken
Document Retrieval
text informationText-to-Speech
Synthesis
Spoken Dialogue
• Multi-media Content Indexed/Retrieved/Browsed Based on the Speech Information
• User Instructions in either Text or Speech Form
• Network Access is Primarily Text-based today, but almost all Roles of Texts can be Replaced by Speech in the Future
Text-based Retrieval
Multi-media/Spoken Document Understanding and Organization (Ⅰ)
• Written Documents are Better Structured and Easier to Browse ---in paragraphs with titles ---easily shown on the screen ---easily decided at a glance if it is what the user is looking for
• Multi-media/Spoken Documents are just Video/Audio Signals ---not easy to be shown on the screen ---the user can’t go through each one from the beginning to the end during browsing ---better approaches for understanding/organization of multi- media/spoken documents becomes necessary
Multi-media/Spoken Document Understanding and Organization (Ⅱ)
• Key Term/Named Entity Extraction from Multi-media/Spoken Documents — personal names, organization names, location names, event names — very often keywords in the multi-media/spoken documents — very often out-of-vocabulary (OOV) words, difficult for recognition
• Multi-media/Spoken Document Segmentation — automatically segmenting a multi-media/spoken document into short paragraphs, each
with a central topic
• Information Extraction for Multi-media/Spoken Documents — extraction of key information such as who, when, where, what and how for the information described by multi-media/spoken documents. — very often the relationships among the key terms/named entities
• Summarization for Multi-media/Spoken Documents — automatically generating a summary (in text or speech form) for each short paragraph
• Title Generation for Multi-media/Spoken Documents — automatically generating a title (in text or speech form) for each short paragraph — very concise summary indicating the topic area
• Topic Analysis and Organization for Multi-media/Spoken Documents — analyzing the subject topics for the short paragraphs — clustering and organizing the subject topics of the short paragraphs into graphic structures giving the relationships among them for easier access
Integration Relationships among the Involved Technology Areas
Keyterms/Named EntityExtraction from
Spoken Documents
Semantic
Analysis
Information
Indexing,
Retrieval
And Browsing
Key Term Extraction from
Spoken Documents
Key Term Selection ( 1/2) Topic Entropy
- carries less topical information - carries more topical information
Named Entity Extraction
• HMM-based Approaches
• Rule-based Approaches• Special Approaches Used
– context information among different sentences in the same document properly considered
– matching with automatically retrieved relevant text news to identify out-of-vocabulary (OOV) words
– multi-layered Viterbi search to handle a long named entity composed of several named entities of different types
Named Entity Extraction
• Context Information Extracted — some named entities may not be easily identified from a single sentence, but can be extracted when information in several sentences jointly considered
遊戲橘子高階人事異動………對於遊戲橘子企圖跨足研發領域……遊戲橘子董事長表示……
• Named Entity Matching using Retrieved Text News Corpus to Identify Some Out-of-Vocabulary (OOV) Words
娜莉 颱風 重創 花蓮縣 壽豐鄉 ( 那裡 ) ( 受封 )
• Multi-layered Viterbi Search — handling the situation that a named entity may be the concatenation of several named entities of different types 台北市中正紀念堂是一個熱門的旅遊景點
ConfidenceMeasure
ThresholdGoogle
TextNews
Corpora
Spoken Document Segmentation
• Training Phase
• Segmentation Phase — dividing the word sequence into sentences(s1, s2, s3...) by pause duration — Viterbi search over the Hidden Markav Model of clusters
— transition from a cluster Ci into another Cj is a proper segmentation point
Training Corpora(text form,
short paragraphs)
K-meansclustering
P (s | Cj )for all clusters Cj
by N-gram probabilitiess: a sentence
L clusters,each with a topic,including many
short paragraphs
C1 C2 C3
P1 P1 P1
P2 P2
P2P2
P1,P2: may be modified by
— story length modeling
— pause duration modeling P(s|C1) P(s|C2) P(s|C3)
d= s1, s2, s3, s4, s5, ……
……
Spoken Document Summarization
Selecting Important Sentences to be Concatenated into a Summary
— sentence scoring
— given a summarization ratio Selected Sentences Collectively Represent Some Concepts Closest
to those of the Complete Document
— removing the concepts already mentioned previously
— concepts presented smoothly
Title Generation for Spoken Documents (1/2)
• Training Phase
• Generation Phase
• For Training Phase — developing statistical relationships between words in the training
documents and their human-generated titles• For New Spoken Documents
— transcribing into term sequences — identifying suitable terms, and using them to generate a readable
title
Training DocumentsD={dj,j=1,2,…N}
(text form)
Computer-generatedTitles of the New Spoken Documents
T={ ti, i=1,2,…M} (text form, speech form)
Human-generatedTitles of Training Documents
T={tj,j=1,2,…N}(text form)
New Spoken DocumentsD={di, i=1,2,…M}
(speech form)
Scored Viterbi
Trainingcorpus
TermOrdering
Model
TermSelection
Model
TitleLengthModel
Spoken document AutomaticSummarization
ViterbiAlgorithm
OutputTitle
Summary
Title Generation for Spoken Documents (2/2)
Topic Analysis and Organization for Spoken Documents
• Example Approach : on Probabilistic Latent Semantic Analysis (PLSA)— terms (words, syllable pairs, etc.)/documents analyzed by probabilities considering a set
of latent topics— trained by EM algorithm
— related documents don’t have to share common sets of terms, and related terms don’t have to co-exist in the same set of documents
• Broadcast News Clustered by the Latent Topics and Organized in a Two-dimensional Tree Structure, or as a Two-layer Map
— news stories in the same cluster or in closely located clusters usually address related topics
— clusters labeled by terms with highest probabilities— easier to browse related news stories within a cluster or across nearby clusters
. 1
K
kikkjij dTPTtPdtP
User’s Query Produces many Retrieved Spoken Documents
— Difficult to display on the screen
Better User/System Interaction
— The system may provide better information about the semantic structure of the retrieved documents to the user
— The user may then enter a more precise query to the system
Topic Hierarchy
User
Multi-modal Dialogue Retrieved
Documents
Spoken Document
Archive
Retrieval System
Query/Instruction
Query-based Local Semantic Structuring for Retrieved Spoken Documents