lr college paris 10 th ecess meeting 10th ecess meeting college language resources paris january...
TRANSCRIPT
LR College Paris 10th ECESS meeting
1010th ECESS Meetingth ECESS MeetingCollege College Language ResourcesLanguage Resources
Paris January Paris January 200200881. Goal of meeting
2. Status members of College
3. Interests and acceptance of associated members and observers
4. Acceptance of College minutes of last meeting
5. College-Action List of 9th meeting
LR College Paris: 10th ECESS meeting
6. Status of partners (as in TA and in Maribor Pool-xls-data) o Pronunciation lexica (Pool Lex1, Pool Lex2) o Acoustic data for TTS voices (Pool Voice1, Pool Voice2)o Text Corpora (Pool Text1, Pool Text2).
7. The actual state of LR specification. o Settling/Finalization the specification for Text Corpora (Pool Text1, Pool Text2). o Settling/Finalization the specification for Acoustic data for TTS voices (minimal requirements - Pool Voice2).
8. Interest and further plans of partners.
LR College Paris 10th ECESS meeting
9. Discussion. General issues.
o ECESS LR specification documents (public page)o LSPs specifications (internal page) o LR distribution (internal page) o LR exchanging agreement o Splitting LR
LR College Paris 10th ECESS meeting
10. Discussion. Further directions of LR College
o Promotion of ECESS LRo Extension of LR collection. New types of Pools (eg. acoustic databases for speaker characterization, emotional databases, special databases with pathological voices/speech) depending on interests and needs of ECESS.
11. New Action List of College
LR College Paris 10th ECESS meeting
1. 1. Main GoalMain Goalss
•Status and further plansStatus and further plans of partners of partners
•Interests and acceptance of associated Interests and acceptance of associated membersmembers
•SettlingSettling/finalization/finalization the specification for the specification for POS taggingPOS tagging
•ECESS LR specification documents (public and internal page)
•Extension of LR collection
•Distribution of LRDistribution of LR
LR College Paris 10th ECESS meeting
2. Status members of LR College
Status members of LR College
AMU University of Poznan (Coordinator Grażyna Demenko )
Siemens (Harald Höge)
Middle East Technical University, Ankara (Tolga Çiloğlu)
CAS (Jinhua Tao)
Uni Bonn (Stefan Breuer)
Uni Munich ( )
Associates and Observers:Nokia (Imre Kiss)Microsoft Portugal (Daniela Braga)
3. Interests and acceptance of associated members and observers
Uni Bielefeld (Dafydd Gibbon)
1) MBROLA diphone voice creation service for new languages 2) German lexicon (details to be specified).3) An experimental child's voice with recordings and report on issues involved.4) Particular interest in multilingual resources and in under-documented languages.
Others members/observers ?
LR College Paris 10th ECESS meeting
LR College Paris 10th ECESS meeting
4.Acceptance of College minutes of last meeting
5. College-Action List of 9th meeting
• Settling/Finalization specifications for Text Corpora POS: PT1, PT2 Pool
• Settling/Finalization specifications for LR – voice database: non-standard PV2 Pool
• Lexicon: PL1, PL2 Pool final documentation – end of 2007 (internal ECESS
pages)
LR College Paris 10th ECESS meeting
6. Status of partners (as in TA and in Maribor Pool-xls-data)
Types of LR and related PoolsPools for Pronunciation lexica(1) PL1 Pool Lex1, according to LC-STAR specs(2) (PL2) Pool Lex2, according to minimum requirements
Pools for Voices(1) PV1 Pool Voice1, according to TC-STAR specs,(2) ( PV2) Pool Voice2, according to minimum requirements.
Pools for Text Corpora(1) PT1 Pool Text1, according ECESS Specs(2) (PT2) Pool Text2, according to minimum requirements
Pools Lex1 and Voice1Pool Lex1:According to LC-STAR specs as described earlier (documents available from the ECESS website)Pool Voice1: According to TC-STAR specs as described earlier (documents available from the ECESS websites)Pools Lex2 and Voice 2 Specifications of Minimum Requirements and thresholds will be defined during the first Period of ECESS coordinated by Uni. Munich). - Preferably defined as a subset of TC/LC-STAR criteria.
LR College Paris 10th ECESS meeting
Technical Annex Ideal Language Resources Pool Lex1 Pool Voice1 Pool Text1
Partner Language Amount-Sex-Language Amount in Words
CAS 1fCN Uni Bonn DE Uni Munich Siemens UK 1mUK3 Uni Posnan PL
Language Resources with Minimal Requirements Pool Lex2 Pool Voice2 Pool Text2
Partner Language Amount-Sex-Language Amount in Words
CAS CN CN 150K Syllables (plan)
Uni Bonn 1fDE DE Uni Munich DE 2fDE,2mDE 200KDE Siemens 100K UK Uni Posnan 1mPL Uni. METU (Tr) 1mTr,1fTr 10K Tr
Language
Codes UK= UK-English JP=Japanese CN=Mandarin DE= German PL= Polish EU=Basque ES= Spanish FI= Finnish SI= Slowenian Tr=Turkish PT=Portugese
LR College Paris 10th ECESS meeting
PRESENT RESOURCES
• Siemens,
UK lexicon (10/2007), UK baseline voice validated,
• Nokia
LC-STAR Mandarin lexicon and TC-STAR Mandarin TTS database (1 male voice) for exchange in ECESS.
• AMU
LC-STAR Polish lexicon
• UPC
Catalan, 2 sp 10h baseline voices
LR College Paris 10th ECESS meeting
7.The actual state of LR specification.
o Settling/Finalization the specification for Text Corpora (Pool Text1, Pool Text2).
o Settling/Finalization the specification for Acoustic data for TTS voices (minimal requirements - Pool Voice2).
LR College Paris 10th ECESS meeting
Design Principles of the Acoustic Corpora
Size of corpus10 h speech per baseline speaker per language‘Baseline Text Corpus’ is composed by the corpora**Transcribed speech 45 000 words Written text (novels and short stories with short sentences)
27 000 wordsSelected phrases (frequent phrases, triphone sentences, mimic sentences)
18000 words
Minimal requirements acoustic data. Coordinated by University of Munich
LR College Maribor: 9th ECESS meeting
Text corpus specifications (for POS tagging)
Size of corpus:
Expected size of text data: 100K tokens minimum, 100% manually checked
rest (500K-1M) can be done automaticallyDomains:
Mandatory: 20K should be coming from spoken transliterations
Preferred: in line with the TC-STAR text corpora (in line with acoustic data creation)
TC-STAR text corpus as basis for POS tagging (90Kwords)LC-STAR tag set, or comparable, but tag set in lexicon and
tagged text corpus must match
LR College Maribor: 9th ECESS meeting
Discussion
POS tagging • Size of text, domains• Tokenization problems• POS tagging sets• Format of POS tagging • Validation
8. Plans of Partners.
LR College Paris 10th ECESS meeting
Ideal Language Resources Pool Lex1 Pool Voice1 Pool Text1
Partner Language Amount-Sex-Language Amount in Words
CAS 1fCN Uni Bonn DE Uni Munich Siemens UK 1mUK3 Uni Posnan PL
Language Resources with Minimal Requirements Pool Lex2 Pool Voice2 Pool Text2
Partner Language Amount-Sex-Language Amount in Words
CAS CN CN 150K Syllables (plan)
Uni Bonn 1fDE DE Uni Munich DE 2fDE,2mDE 200KDE Siemens 100K UK Uni Posnan 1mPL Uni. METU (Tr) 1mTr,1fTr 10K Tr
Language
Codes UK= UK-English JP=Japanese CN=Mandarin DE= German PL= Polish EU=Basque ES= Spanish FI= Finnish SI= Slowenian Tr=Turkish PT=Portugese
LR College Paris 10th ECESS meeting
9. Discussion. General issues.
o ECESS LR specification documents (public page)
o LSPs specifications (internal page)
o Splitting LR
o LR distribution (internal page)
o LR exchanging agreement
ECESS LR specification documents (public page, internal page)
The language independent specification is public and should be accessible from the public ECESS web-page.
The language specific data (Language Specific Peculiarities);
the LSP could be extended to contain all the 'contact information') is part of the LR dedicated for a pool. The LSPs have to be approved by the LR-college. The LSPs are located in the internal webpage of ECESS (College LR).
A new public 'ECESS' specification document.
(different LC-STAR ,TC_StAR documents together, ECESS specification
LR papers, publication
LR College Paris 10th ECESS meeting
• Splitting LR
SIE suggests to split the data in the lexicon pool to 'lexicon for common words' (which we will deliver for UK) and 'lexicon for proper names'.
Partners interested only in parts of the lexica could then choose what they want to deliver and exchange.
Advantage: some partners may only want to deliver/get certain parts of a particular language; production costs for the different parts are more comparable.
LR College Paris 10th ECESS meeting
o LR distribution (internal page) o LR exchanging agreement
LR-agreement: within the college 'Tools‘ Uni. Maribor acts as a distributor of tools needed for evaluation.
LR College Maribor: 9th ECESS meeting
• 10. Discussion. Further directions of LR College
o Promotion of ECESS LRo Extension of LR collection. New types of Pools (eg.
acoustic databases for speaker characterization, emotional databases, special databases with pathological voices/speech) depending on interests and needs of Ecess.
• 11. New Action List of College
LR College Paris 10th ECESS meeting