2 nd progress meeting for sphinx 3.6 development arthur chan, david huggins-daines, yitao sun...
Post on 19-Dec-2015
213 views
TRANSCRIPT
2nd Progress Meeting For Sphinx 3.6 Development
Arthur Chan,David Huggins-Daines,
Yitao SunCarnegie Mellon University
Jun 7, 2005
This meeting (2nd Progress Meeting of 3.6)
Purpose of this meeting A working progress report on various aspects
of the development A briefing on embedded sphinx2. (by David) A briefing on sphinx3’s “crazy branch” (by
Arthur) As a branch in CVS Include several interesting features Include bunches of mild changes
Discussion before another check-in.
Outline of this talk Review of 1st Progress Meeting Progress of Embedded version of Sphinx 2 (by Dave,
7-10 pages) Progress of Sphinx 3’s crazy branches (15-20 pages)
Architecture Diagram of Sphinx 3.6 Changes in search abstraction (7 pages) Progress on search implementation (8 pages)
GMM Computation FSG mode, Word Switching Tree Search mode
Mild re-factoring (Not “gentle” any more) (3 pages) LM S3.0 family of tools
Hieroglyph (1 page)
Review of 1st Progress Meeting
Last time.. Two separate layers were defined
Low-Level Implementation of Search and Possible abstractions of Search Just introduced, its advantage was not yet
revealed. Implementation of Mode 5 was still under
developed (only 10% Completion) Just modularize libs3decoder to 8 sub-
modules
Motivation of Architecting Sphinx 3.X Need of new search algorithms
New search algorithm development could have risk. We don’t want to throw away the old one. Mere replacement could cause backward
compatibility problem. Code has grown to a stage where
Some changes could be very hard. Multiple programmers become active at the
same time CVS conflict could become often if things are
controlled by “if-else” structure
Architecture of Sphinx 3.X (X<6)
Batch sequential Architecture (Shaw 96) Each executable would customize the sub-
routines
decode livepretend Decode_anytopo align allphone
GMM Computation 1approx_cont_mgau
Search 1
Process Controller 1
GMM Computation 2(Using gauden &
senone Method 1)
Search 2
Process Controller 2
GMM Computation 3(Using gauden &
senone Method 2)
Search 3
Process Controller 3
GMM Computation 4(Using gauden & senone Method 3)
Search 4
Process Controller 4
Command Line 1 Command Line 2 Command Line 3 Command Line 4
Initialization 1(kb and kbcore)
Initialization 2 Initialization 3 Initialization 4
Pros/Cons of Batch Sequential Architecture Pros:
Great flexibility for individual programmers No assumption, data structure are usually optimized
for the application. Align and allphone have optimization.
Crafting in individual application has high quality Cons:
Tremendous difficulty in maintenance Most changes need to be carried out for 5-6 times.
Spread disease of code duplication Code with functionality was duplicated multiple times
Scared a lot of programmers in the past Beginners tend to love general architecture
Big Picture of Software Architecture in Sphinx 3.6 Layered and Object Oriented
Implemented in C Major high level routines
Initializer (kb.c or kbcore.c) A kind of clipboard for other controllers
Process controller (corpus.c) Govern the protocol of processing a sentence
Search abstraction routine (srch.c) Govern how search is done Implemented as piplines and filters with shared memory Each filter can be overridden, similar to what OO
language do Command line processor (cmd_ln_macro.c
and cmd_ln.c) – implemented as macros.
Software Architecture Diagram of Sphinx 3.6
Applications Controllers/Abstractions
Implementations Libraries
decode
livepretend
align
allphone
dag
astar
livedecodeAPI
SearchController
ProcessController
SearchInitializer
CommandLine
Processor
User Defined Applications
Fast Single Stream GMM
Computation
Multi Stream GMM
Computation
Mode 0 : Align
Mode 1 : Allphone
Mode 2 : FSG
Mode 3 : Anytopo
Mode 4 : Magic Wheel
Mode 5 : WSFT
DictionaryLibrary
SearchLibrary
LM Library
AM Library
Utility Library
FeatureLibrary
MiscellaneousLibrary
decode(anytopo)
Search Abstraction Search abstraction is implemented as objects Search operations are implemented as filters with shared memory Each filter, a kind of unique operation for search Ideally, each filter or a set of filter can be replaced.
SelectActive
CDSenone
ComputeApprox.
GMMScore
(CI senone)
ComputeDetailGMMScore
(CD senone)
ComputeDetailHMMScore(CD)
PropagateGraph (Phone-Level)
RescoringAt word
End usingHigh-Level
KS(e.g. LM)
PropagateGraph(Word-Level)
Search For One Frame
Different ways to implement Search implementations
1, Use Default implementation Just specify all atomic search operations
(ASOs) provided 2, Override “search_one_frame”
Only need to specify GMM computation and how to “search_one_frame”
3, Override the whole mechanism For people who dislike the default so much Override how to “search”
Concrete Examples Mode 4 (Magic Wheel) and Mode 5 (WST)
are using the default implementation Mode 2 (FSG)
override “search_one_frame” implementation But share GMM implementation.
Likely, Mode 0 (align),1 (allphone) and 3 (flat lexicon decoding) will also do the same.
Future work Align, allphone and decode_anytopo’s re-factoring are not yet
completed. Search abstraction need to consider
More flexible mechanisms Do the search backward. (for backward search) Approximate search in the first stage (for phoneme and word
look-ahead) (Optional) Parallel and distributed decoding
Command-line and internal modules could still have mismatch
Might learn from mechanisms of Sphinx 2 and Sphinx 4 Controlling how an utterance could require 5 different files
A better control format? Not yet fully anticipate fixed point front-end and GMM
computation in Sphinx 2
GMM Computation
Decode can now use SCHMM specify by .semi. Implemented and tested by Dave
GMM Computation in align, allphone, decode, livepretend are now common
Not yet incorporate Sphinx 2 Fixed-point version of GMM computation It looks very delicious.
Finite State Machine Search (Mode 2) -Implementation
Largely Completed (Completion 70%) Recipe:
Search function pointer implementation adapted from Sphinx 2 FSG_* family of
routines GMM computation
Use Sphinx 3 GMM computation Already allows CIGMMS
Finite State Machine Search (Mode 2) –Problems for the Users
Not yet seriously tested Finding test cases are hard
Still don’t have a way to write grammar Yitao’s goal in Q3 and Q4 2005
Either directly incorporate the CFG’s score into the search
Or implement an approximate converter from CFG to FSM (HTK’s method)
Finite State Machine Search (Mode 2) –Other Problems Problems inherited from Sphinx2 (copied from
Ravi’s slide) No lextree implementation (What?) Static allocation of all HMMs; not allocated “on
demand” (Oh, no! ) FSG transitions represented by NxN matrix (You can’t
be serious!! ) Other wish list
No histogram pruning (Houston, we’ve got a problem.)
No state-based implementation (Wilson! I am sorry!! ) We need it for unifyication of BW, alignment, allphone and FSG
search.
Time Switching Tree Search (Mode 4)
Name changes: It was “lucky wheel” Now is “magic wheel”
In last check-in, after test-full, results are exactly the same for 6 corpora We could sleep.
Future work: Change the word end triphone
implementation from composite triphone to full triphones
Word Switching Tree Search (Mode 5)
Now could run for the Communicator task With the same performance as mode 4
Major reasons why it doesn’t approach decode_anytopo’s result Bigram probability is not yet factored
Not an easy task. Still considering howto. Triphone’s implementation is not yet exact
Completion 30%
Future work on Mode 5
N-gram Look-ahead Full trigram tree implementation Phoneme and Word Look-ahead Share full triphone implementation
with mode 4 in future.
Big picture of All Search Implementations Finite state machine data structure could unify
align, allphone, Baum-Welch, FSG search
Time will show whether it is also applicable in tree search.
Search implementation has more short-term demand. Mode 5 will be our new flag ship By Oct, 3 out of 4 goals in mode 5 should be
completed. Between different searches, code should be shared
as much as possible
Summary of Re-factorings
Not gentle any more But it is mild
Several useful things to know Language model routine revamping S3.0 family of tools Overall status of merging
LM routine Current capability
Read both text-based and DMP-based LM Allow switching of LM Allow inter-conversion between text and
DMP format of LM Provide single interface to all applications
Tool of the month : lm_convert lm3g2dmp++ Will be the application for future language
model inter-conversion Other formats? CMULMTK’s format?
S3.0 family of tools Architecture drives many changes in the
code Align, allphone and decode_anytopo now use
kbcore Same version of multi-stream GMM
Computation routine Simplified search structure. ctl_process mechanism
Next step is to use srch.c interface. All tools are now sharing
Sets of common command-line macros
Code Merging Sphinx3.0, Sphinx 3.X and share are now
unified. Alex: “It’s time to fix the training algorithms!” Ravi: “It’s time to add full n-gram and full n-phones to the
recognizer!!” Dave: ”It’s time to work on pronunciation modeling!” Yitao: “It’s time to implement a CFG-based search!!” Evandro: “It’s time to do more regression test!” Alan: “Don’t merge Sphinx with festival!!” Next step:
It’s time to clean up SphinxTrain. We will keep the pace to be <4 tools
check-in/month.
Hieroglyphs Halves of Chapter 3 and 5 are finished
Chapter 3: “Introduction to Speech Recognition”
Missing : Description of DTW, HMM and LM Chapter 5: “Roadmap of building speech
recognition system” Missing
How to evaluate the system? How to train a system? (Evandro’s tutorial will be
perfect)
Still ~4 chapters (out of 12) of material to go before 1st draft is written