lr college paris 10 th ecess meeting 10th ecess meeting college language resources paris january...

21
LR College Paris 10 th ECESS meeting 10 10 th ECESS Meeting th ECESS Meeting College College Language Language Resources Resources Paris January Paris January 200 200 8 8 1. Goal of meeting 2. Status members of College 3. Interests and acceptance of associated members and observers 4. Acceptance of College minutes of last meeting 5. College-Action List of 9 th meeting

Upload: magnus-roberts

Post on 12-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: LR College Paris 10 th ECESS meeting 10th ECESS Meeting College Language Resources Paris January 2008 1. Goal of meeting 2. Status members of College 3

LR College Paris 10th ECESS meeting

1010th ECESS Meetingth ECESS MeetingCollege College Language ResourcesLanguage Resources

Paris January Paris January 200200881. Goal of meeting

2. Status members of College

3. Interests and acceptance of associated members and observers

4. Acceptance of College minutes of last meeting

5. College-Action List of 9th meeting

Page 2: LR College Paris 10 th ECESS meeting 10th ECESS Meeting College Language Resources Paris January 2008 1. Goal of meeting 2. Status members of College 3

LR College Paris: 10th ECESS meeting

6. Status of partners (as in TA and in Maribor Pool-xls-data) o Pronunciation lexica (Pool Lex1, Pool Lex2) o Acoustic data for TTS voices (Pool Voice1, Pool Voice2)o Text Corpora (Pool Text1, Pool Text2).

7. The actual state of LR specification. o Settling/Finalization the specification for Text Corpora (Pool Text1, Pool Text2). o Settling/Finalization the specification for Acoustic data for TTS voices (minimal requirements - Pool Voice2).

8. Interest and further plans of partners.

Page 3: LR College Paris 10 th ECESS meeting 10th ECESS Meeting College Language Resources Paris January 2008 1. Goal of meeting 2. Status members of College 3

LR College Paris 10th ECESS meeting

9. Discussion. General issues.

o ECESS LR specification documents (public page)o LSPs specifications (internal page) o LR distribution (internal page) o LR exchanging agreement o Splitting LR

Page 4: LR College Paris 10 th ECESS meeting 10th ECESS Meeting College Language Resources Paris January 2008 1. Goal of meeting 2. Status members of College 3

LR College Paris 10th ECESS meeting

10. Discussion. Further directions of LR College

o Promotion of ECESS LRo Extension of LR collection. New types of Pools (eg. acoustic databases for speaker characterization, emotional databases, special databases with pathological voices/speech) depending on interests and needs of ECESS.

11. New Action List of College

Page 5: LR College Paris 10 th ECESS meeting 10th ECESS Meeting College Language Resources Paris January 2008 1. Goal of meeting 2. Status members of College 3

LR College Paris 10th ECESS meeting

1. 1. Main GoalMain Goalss

•Status and further plansStatus and further plans of partners of partners

•Interests and acceptance of associated Interests and acceptance of associated membersmembers

•SettlingSettling/finalization/finalization the specification for the specification for POS taggingPOS tagging

•ECESS LR specification documents (public and internal page)

•Extension of LR collection

•Distribution of LRDistribution of LR

Page 6: LR College Paris 10 th ECESS meeting 10th ECESS Meeting College Language Resources Paris January 2008 1. Goal of meeting 2. Status members of College 3

LR College Paris 10th ECESS meeting

2. Status members of LR College

Status members of LR College

AMU University of Poznan (Coordinator Grażyna Demenko )

Siemens (Harald Höge)

Middle East Technical University, Ankara (Tolga Çiloğlu)

CAS (Jinhua Tao)

Uni Bonn (Stefan Breuer)

Uni Munich ( )

Associates and Observers:Nokia (Imre Kiss)Microsoft Portugal (Daniela Braga)

     

Page 7: LR College Paris 10 th ECESS meeting 10th ECESS Meeting College Language Resources Paris January 2008 1. Goal of meeting 2. Status members of College 3

3. Interests and acceptance of associated members and observers

Uni Bielefeld (Dafydd Gibbon)

1) MBROLA diphone voice creation service for new languages 2) German lexicon (details to be specified).3) An experimental child's voice with recordings and report on issues involved.4) Particular interest in multilingual resources and in under-documented languages.

Others members/observers ?

LR College Paris 10th ECESS meeting

Page 8: LR College Paris 10 th ECESS meeting 10th ECESS Meeting College Language Resources Paris January 2008 1. Goal of meeting 2. Status members of College 3

LR College Paris 10th ECESS meeting

4.Acceptance of College minutes of last meeting

5. College-Action List of 9th meeting

• Settling/Finalization specifications for Text Corpora POS: PT1, PT2 Pool

• Settling/Finalization specifications for LR – voice database: non-standard PV2 Pool

• Lexicon: PL1, PL2 Pool final documentation – end of 2007 (internal ECESS

pages)

Page 9: LR College Paris 10 th ECESS meeting 10th ECESS Meeting College Language Resources Paris January 2008 1. Goal of meeting 2. Status members of College 3

LR College Paris 10th ECESS meeting

6. Status of partners (as in TA and in Maribor Pool-xls-data)

Types of LR and related PoolsPools for Pronunciation lexica(1)          PL1 Pool Lex1, according to LC-STAR specs(2)          (PL2) Pool Lex2, according to minimum requirements

Pools for Voices(1)         PV1 Pool Voice1, according to TC-STAR specs,(2)      ( PV2) Pool Voice2, according to minimum requirements.

Pools for Text Corpora(1)          PT1 Pool Text1, according ECESS Specs(2)          (PT2) Pool Text2, according to minimum requirements

Pools Lex1 and Voice1Pool Lex1:According to LC-STAR specs as described earlier (documents available from the ECESS website)Pool Voice1: According to TC-STAR specs as described earlier (documents available from the ECESS websites)Pools Lex2 and Voice 2 Specifications of Minimum Requirements and thresholds will be defined during the first Period of ECESS coordinated by Uni. Munich). - Preferably defined as a subset of TC/LC-STAR criteria.

Page 10: LR College Paris 10 th ECESS meeting 10th ECESS Meeting College Language Resources Paris January 2008 1. Goal of meeting 2. Status members of College 3

LR College Paris 10th ECESS meeting

Technical Annex Ideal Language Resources Pool Lex1 Pool Voice1 Pool Text1

Partner Language Amount-Sex-Language Amount in Words

CAS 1fCN Uni Bonn DE Uni Munich Siemens UK 1mUK3 Uni Posnan PL

Language Resources with Minimal Requirements Pool Lex2 Pool Voice2 Pool Text2

Partner Language Amount-Sex-Language Amount in Words

CAS CN CN 150K Syllables (plan)

Uni Bonn 1fDE DE Uni Munich DE 2fDE,2mDE 200KDE Siemens 100K UK Uni Posnan 1mPL Uni. METU (Tr) 1mTr,1fTr 10K Tr

Language

Codes UK= UK-English JP=Japanese CN=Mandarin DE= German PL= Polish EU=Basque ES= Spanish FI= Finnish SI= Slowenian Tr=Turkish PT=Portugese

Page 11: LR College Paris 10 th ECESS meeting 10th ECESS Meeting College Language Resources Paris January 2008 1. Goal of meeting 2. Status members of College 3

LR College Paris 10th ECESS meeting

PRESENT RESOURCES

• Siemens,

UK lexicon (10/2007), UK baseline voice validated,

• Nokia

LC-STAR Mandarin lexicon and TC-STAR Mandarin TTS database (1 male voice) for exchange in ECESS.

• AMU

LC-STAR Polish lexicon

• UPC

Catalan, 2 sp 10h baseline voices

Page 12: LR College Paris 10 th ECESS meeting 10th ECESS Meeting College Language Resources Paris January 2008 1. Goal of meeting 2. Status members of College 3

LR College Paris 10th ECESS meeting

7.The actual state of LR specification.

o Settling/Finalization the specification for Text Corpora (Pool Text1, Pool Text2).

o Settling/Finalization the specification for Acoustic data for TTS voices (minimal requirements - Pool Voice2).

Page 13: LR College Paris 10 th ECESS meeting 10th ECESS Meeting College Language Resources Paris January 2008 1. Goal of meeting 2. Status members of College 3

LR College Paris 10th ECESS meeting

Design Principles of the Acoustic Corpora

Size of corpus10 h speech per baseline speaker per language‘Baseline Text Corpus’ is composed by the corpora**Transcribed speech 45 000 words Written text (novels and short stories with short sentences)

27 000 wordsSelected phrases (frequent phrases, triphone sentences, mimic sentences)

18000 words

Minimal requirements acoustic data. Coordinated by University of Munich

Page 14: LR College Paris 10 th ECESS meeting 10th ECESS Meeting College Language Resources Paris January 2008 1. Goal of meeting 2. Status members of College 3

LR College Maribor: 9th ECESS meeting

Text corpus specifications (for POS tagging)

Size of corpus:

Expected size of text data: 100K tokens minimum, 100% manually checked

rest (500K-1M) can be done automaticallyDomains:

Mandatory: 20K should be coming from spoken transliterations

Preferred: in line with the TC-STAR text corpora (in line with acoustic data creation)

TC-STAR text corpus as basis for POS tagging (90Kwords)LC-STAR tag set, or comparable, but tag set in lexicon and

tagged text corpus must match

Page 15: LR College Paris 10 th ECESS meeting 10th ECESS Meeting College Language Resources Paris January 2008 1. Goal of meeting 2. Status members of College 3

LR College Maribor: 9th ECESS meeting

Discussion

POS tagging • Size of text, domains• Tokenization problems• POS tagging sets• Format of POS tagging • Validation

Page 16: LR College Paris 10 th ECESS meeting 10th ECESS Meeting College Language Resources Paris January 2008 1. Goal of meeting 2. Status members of College 3

8. Plans of Partners.

LR College Paris 10th ECESS meeting

Ideal Language Resources Pool Lex1 Pool Voice1 Pool Text1

Partner Language Amount-Sex-Language Amount in Words

CAS 1fCN Uni Bonn DE Uni Munich Siemens UK 1mUK3 Uni Posnan PL

Language Resources with Minimal Requirements Pool Lex2 Pool Voice2 Pool Text2

Partner Language Amount-Sex-Language Amount in Words

CAS CN CN 150K Syllables (plan)

Uni Bonn 1fDE DE Uni Munich DE 2fDE,2mDE 200KDE Siemens 100K UK Uni Posnan 1mPL Uni. METU (Tr) 1mTr,1fTr 10K Tr

Language

Codes UK= UK-English JP=Japanese CN=Mandarin DE= German PL= Polish EU=Basque ES= Spanish FI= Finnish SI= Slowenian Tr=Turkish PT=Portugese

Page 17: LR College Paris 10 th ECESS meeting 10th ECESS Meeting College Language Resources Paris January 2008 1. Goal of meeting 2. Status members of College 3

LR College Paris 10th ECESS meeting

9. Discussion. General issues.

o ECESS LR specification documents (public page)

o LSPs specifications (internal page)

o Splitting LR

o LR distribution (internal page)

o LR exchanging agreement

Page 18: LR College Paris 10 th ECESS meeting 10th ECESS Meeting College Language Resources Paris January 2008 1. Goal of meeting 2. Status members of College 3

ECESS LR specification documents (public page, internal page)

The language independent specification is public and should be accessible from the public ECESS web-page.

The language specific data (Language Specific Peculiarities);

the LSP could be extended to contain all the 'contact information') is part of the LR dedicated for a pool. The LSPs have to be approved by the LR-college. The LSPs are located in the internal webpage of ECESS (College LR).

A new public 'ECESS' specification document.

(different LC-STAR ,TC_StAR documents together, ECESS specification

LR papers, publication

LR College Paris 10th ECESS meeting

Page 19: LR College Paris 10 th ECESS meeting 10th ECESS Meeting College Language Resources Paris January 2008 1. Goal of meeting 2. Status members of College 3

• Splitting LR

SIE suggests to split the data in the lexicon pool to 'lexicon for common words' (which we will deliver for UK) and 'lexicon for proper names'.

Partners interested only in parts of the lexica could then choose what they want to deliver and exchange.

Advantage: some partners may only want to deliver/get certain parts of a particular language; production costs for the different parts are more comparable.

LR College Paris 10th ECESS meeting

Page 20: LR College Paris 10 th ECESS meeting 10th ECESS Meeting College Language Resources Paris January 2008 1. Goal of meeting 2. Status members of College 3

o LR distribution (internal page) o LR exchanging agreement

LR-agreement: within the college 'Tools‘ Uni. Maribor acts as a distributor of tools needed for evaluation.

LR College Maribor: 9th ECESS meeting

Page 21: LR College Paris 10 th ECESS meeting 10th ECESS Meeting College Language Resources Paris January 2008 1. Goal of meeting 2. Status members of College 3

• 10. Discussion. Further directions of LR College

o Promotion of ECESS LRo Extension of LR collection. New types of Pools (eg.

acoustic databases for speaker characterization, emotional databases, special databases with pathological voices/speech) depending on interests and needs of Ecess.

• 11. New Action List of College

LR College Paris 10th ECESS meeting