standards and tools: dobes and clarin views - resumé after about 8 years -

21
Standards and Tools: DOBES and CLARIN Views - resumé after about 8 years - Peter Wittenburg, André More The Language Archive - Max Planck Instit CLARIN European Research Infrastruct

Upload: beyla

Post on 14-Feb-2016

42 views

Category:

Documents


0 download

DESCRIPTION

Standards and Tools: DOBES and CLARIN Views - resumé after about 8 years -. Peter Wittenburg , André Moreira The Language Archive - Max Planck Institute CLARIN European Research Infrastructure. Content . CLARIN vs. DOBES - differences? Tools vs. Standards - differences? - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Standards and Tools:  DOBES  and  CLARIN  Views -  resumé  after about 8 years -

Standards and Tools: DOBES and CLARIN Views

- resumé after about 8 years -Peter Wittenburg, André Moreira

The Language Archive - Max Planck InstituteCLARIN European Research Infrastructure

Page 2: Standards and Tools:  DOBES  and  CLARIN  Views -  resumé  after about 8 years -

Content

1. CLARIN vs. DOBES - differences?2. Tools vs. Standards - differences?3. Overall Comparison4. TLA Team - Landscape and Strategy 5. Technology - Mainstream influences

6. Conclusions

Page 3: Standards and Tools:  DOBES  and  CLARIN  Views -  resumé  after about 8 years -

DOBES vs. CLARIN

• DOBES is about the documentation of endangered languages (as many other comparable initiatives)

• documentation teams are under time pressure• thus efficiency is required (transcription: 1-35, translation: 1-25)• can be facilitated by good tools

• documentation certainly is for this generation of researchers, speech communities, students, public, etc.(primary focus of DOBES and teams)

• documentation is also for future generations • documents part of our cultural heritage • languages encode knowledge about natures and cultures • historical material helps finding our identity

• therefore DOBES has a short-term and a long-term challenge

Page 4: Standards and Tools:  DOBES  and  CLARIN  Views -  resumé  after about 8 years -

DOBES vs. CLARIN

• CLARIN is about an interoperable + persistent infrastructure for LRT

• landscape is fragmented and nothing fits together • thus researchers working on data can't be efficient

(knowledge workers spend 40% of time on finding resources, making things compatible etc)

• can be facilitated by good standards and agreements

• infrastructure certainly is for this/next generation of • researchers, students, "citizen scientists", etc.• enable "better" research if it is "data-driven"

• infrastructure is also for future generations • ensuring access to our research records • lots of data is highly endangered !!!• comparing "old" data with "new" data

• therefore CLARIN has a short-term and a long-term challenge

Page 5: Standards and Tools:  DOBES  and  CLARIN  Views -  resumé  after about 8 years -

DOBES vs. CLARIN: interoperability

• DOBES

• community of documenting field linguists • is interoperability an issue? well I still don't know

• interoperable with whom?• cross-corpus work based on data is still to come• of course some practical barriers (language)

• CLARIN

• infrastructure covering "all" language resources & tools (named entity recognition relevant for everyone)

• is interoperability an issue: YES - it's in the focus• otherwise always barriers to tackle relevant questions • otherwise data-driven research too expensive

• seems that here is a clear difference in primary objectives

Page 6: Standards and Tools:  DOBES  and  CLARIN  Views -  resumé  after about 8 years -

DOBES and CLARIN

DOBES CLARIN

researcher focus "comprehensive" documentation

give seamless access to all relevant data

main characteristic efficiency in annotating, lexicon creation etc

efficiency in finding things and combining them

Page 7: Standards and Tools:  DOBES  and  CLARIN  Views -  resumé  after about 8 years -

DOBES and CLARIN

DOBES CLARIN

researcher focus "comprehensive" documentation

give seamless access to all relevant data

main characteristic efficiency in annotating, lexicon creation etc

efficiency in finding things and combining them

addressees communities, researchers, students, pupils, public

researchers, students, "citizen scientists"

Page 8: Standards and Tools:  DOBES  and  CLARIN  Views -  resumé  after about 8 years -

DOBES and CLARIN

DOBES CLARIN

researcher focus "comprehensive" documentation

give seamless access to all relevant data

main characteristic efficiency in annotating, lexicon creation etc

efficiency in finding things and combining them

addressees communities, researchers, students, pupils, public

researchers, students, "citizen scientists"

short-term task give access now improve access now

long-term task preserve cultural heritagesecond priority

ensure access in future part of the concept

Page 9: Standards and Tools:  DOBES  and  CLARIN  Views -  resumé  after about 8 years -

DOBES and CLARIN

DOBES CLARIN

researcher focus "comprehensive" documentation

give seamless access to all relevant data

main characteristic efficiency in annotating, lexicon creation etc

efficiency in finding things and combining them

addressees communities, researchers, students, pupils, public

researchers, students, "citizen scientists"

short-term task give access now improve access now

long-term task preserve cultural heritagesecond priority

ensure access in future part of the concept

interoperability not first priority first priority

Page 10: Standards and Tools:  DOBES  and  CLARIN  Views -  resumé  after about 8 years -

DOBES and CLARIN

DOBES CLARIN

researcher focus "comprehensive" documentation

give seamless access to all relevant data

main characteristic efficiency in annotating, lexicon creation etc

efficiency in finding things and combining them

addressees communities, researchers, students, pupils, public

researchers, students, "citizen scientists"

short-term task give access now improve access now

long-term task preserve cultural heritagesecond priority

ensure access in future part of the concept

interoperability not first priority first priority Ulrike - Nicoletta "standard" a no topic "standard" a major topic

thus very much in common - but also some differences

Page 11: Standards and Tools:  DOBES  and  CLARIN  Views -  resumé  after about 8 years -

Tools vs. Standards

• who dears to doubt that

• tools determine our "productivity" • tools influence attractiveness of solutions • people are used to tools - who wants to learn new stuff?

• tools need to be egocentrically built • development is expensive (UI)• fast development cycles are necessary • SW management is very expensive and

eats up person power • ~ 80 % of all software developments fail • lot of SW developed will die

quickly since not enough money to maintain it

• tools have a short lifecycle of in average about 10 years

functionality

time

Page 12: Standards and Tools:  DOBES  and  CLARIN  Views -  resumé  after about 8 years -

Tools vs. Standards

• who dears to doubt that

• standards live almost forever de facto lifetime comparatively high

• standards are in general not attractive for users except for some XML "fans"

• standards should be hidden and only experts need to read all documents

• standards building has some form of altruism (if big industry is not involved)

• costs lot of time and effort (ISO TC37/SC4 started 2002 at LREC)

• risk of being quickly outdated • will a standard be accepted?

• implementing standards in tools can be expensive (moving target, complexity of standard, etc)

Page 13: Standards and Tools:  DOBES  and  CLARIN  Views -  resumé  after about 8 years -

Tools and Standards

Tools Standardslifetime comparatively short comparatively longuser attractiveness high lowcreation costs high highmaintenance costs high low

Page 14: Standards and Tools:  DOBES  and  CLARIN  Views -  resumé  after about 8 years -

Tools and Standards

Tools Standardslifetime comparatively short comparatively longattractiveness high lowcreation costs high highmaintenance costs high low?short-term success high low (requires time)long-term "factor" low potentially high

Page 15: Standards and Tools:  DOBES  and  CLARIN  Views -  resumé  after about 8 years -

Tools and Standards

Tools Standardslifetime comparatively short comparatively longattractiveness high lowcreation costs high highmaintenance costs high lowshort-term success high low (requires time)long-term "factor" low potentially high

thus tools are important for short term successstandards are important for long term success

Page 16: Standards and Tools:  DOBES  and  CLARIN  Views -  resumé  after about 8 years -

all together

• for CLARIN no separation - symbiosis between short-term tool support and long-term interoperability facilitation

• for DOBES there seems to be a difference

Tools Standards

CLARINrelevant for short and long term development (stability, generic,

standards-based)relevant for interoperability on

short and long term

DOBES clear interest in short term efficiency

relevant only for those who focus on long-term aspects

Page 17: Standards and Tools:  DOBES  and  CLARIN  Views -  resumé  after about 8 years -

Landscape for TLA Team

• being archivist and providing access to stored material in DOBES (+MPI)• being in the core of CLARIN/EUDAT infrastructure development

• a few major questions:

• how can we preserve bit streams and interpretability over long period?• how can we give access to heterogeneous resources and also

support resource creation and manipulation/enrichment? • have about 71 lexica (and many different annotation types)

61 in the archive, 10 active in LEXUS• created by different tools, • using different structures• using different categories (lexical attributes)

• how can we build "generic" tools and frameworks that can cope with heterogeneity - cannot build/maintain SW too specifically targeted?

• how can we build SW in a scenario where there are so many smart developers out there?

Page 18: Standards and Tools:  DOBES  and  CLARIN  Views -  resumé  after about 8 years -

Strategy for TLA Team

• Rule 1: have a coherent archive of 34/75 TB• i.e. convert "everything" to stable formats with explicit syntax/encoding

and check quality• otherwise long term curation and access too expensive • costs for late curation and manual migration are extreme

• Rule 2: base tool development on open and "generic" formats • EAF for annotations turned out to be flexible enough over 10 years • LMF is a flexible model for lexicon structures

• "LEGO" approach makes some people frightened • but flexibility not even sufficient for field linguists • yet no agreement on an exchange format - a disaster

• ISOcat for registering semantics (is it generic enough?)

• Rule 3: provide converters and interfaces for major tools/formats • Toolbox, CLAN, Transcriber, PRAAT, other XML • time consuming effort (cyclic flow almost impossible)

Page 19: Standards and Tools:  DOBES  and  CLARIN  Views -  resumé  after about 8 years -

Is our Strategy Successful?

• very difficult to answer - what are the criteria?

• strategy allows us to be coherent with both DOBES and CLARIN • strategy was broad enough to help establishing TLA • although

• LMF turned out to be very expensive for us • much time investment to participate in x meetings• little understanding from NLP hardcore guys

• can't even claim to be 100% compliant or?• some years of instability of the model thus changes of code

• thus slowing down development • invent own interchange format for archiving purposes (RELISH ??)

• modern lexica are complex objects with inclusions of objects (images, a/v fragments, internal and archived resources, etc)• finally an approach based on flexible standards will pay off but it takes more time

Page 20: Standards and Tools:  DOBES  and  CLARIN  Views -  resumé  after about 8 years -

Technology (IT) Issues

• technology innovation is moving ahead with the web as driving force • designs and tools need to be web-ready

• visibility from everywhere• access from everywhere • collaboration support • annotation (incl. relation drawing) support

(there are so many knowledgeable people around)

• web-technology subject of high innovation rate • frequent re-design of components • what is the stable core to keep costs low and make code

maintenance feasible?

Page 21: Standards and Tools:  DOBES  and  CLARIN  Views -  resumé  after about 8 years -

Conclusions

• research communities naturally more interested in tools• research infrastructure work needs to find a balance between short- and long-term aspects

• however, need to store data following general IT principlesexplicit syntax, declared semantics, open formats

• need to build better tools to support standards and/or to convince companies to adopt standards • but tool building based on standards can be more expensive and time consuming

• RELISH is very good to compare TEI, LMF and LIFT• RELISH is very good to compare ISOcat and GOLD • we need a strategy for TLA to support one (or two) exchange formats and one needs to be based on a standard (data will go into the archive)