language resource and technology registry infrastructurecommon language resources and technology...

28
Common Language Resources and Technology Infrastructure www.clarin.eu Language Resource and Technology Registry Infrastructure 2010-01-20 Internal Version: 1 Editors: Daan Broeder, Dieter Van Uytvanck, Peter Wittenburg Contributors: Patrick Duin, Matej Durco, Leif-Jöran Olsson, Oliver Schonefeld

Upload: others

Post on 16-Sep-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Language Resource and Technology Registry InfrastructureCommon Language Resources and Technology Infrastructure Language Resource and Technology Registry Infrastructure 2010-01-20

Common Language Resources and Technology Infrastructure

www.clarin.eu

Language Resource and Technology Registry

Infrastructure

2010-01-20 Internal Version: 1

Editors: Daan Broeder, Dieter Van Uytvanck, Peter Wittenburg

Contributors: Patrick Duin, Matej Durco, Leif-Jöran Olsson, Oliver Schonefeld

Page 2: Language Resource and Technology Registry InfrastructureCommon Language Resources and Technology Infrastructure Language Resource and Technology Registry Infrastructure 2010-01-20

Common Language Resources and Technology Infrastructure

WG2.5 Registry Infrastructure Building 2

Page 3: Language Resource and Technology Registry InfrastructureCommon Language Resources and Technology Infrastructure Language Resource and Technology Registry Infrastructure 2010-01-20

Common Language Resources and Technology Infrastructure

WG2.5 Registry Infrastructure Building 3

Language Resource and Technology Registry Infrastructure

(version 1)

EC FP7 project no. 212230

Deliverable: D2R-5a - Deadline: 1.1.2010 Responsible: Peter Wittenburg

Contributing Partners: MPI, INL, OTA, RACAI, WROCUT, UPF, ELDA, ILSP, ILC, USFD, ULund, DFKI, CSC, UIL-OTS, ULeuven, AKSIS, ATILF, UTuebingen, HASRIL, CST, UTartu Contributing Members: ULeipzig, UMasaryk, CELTA, TILDE, Meertens, IDS, BBAW, UFrankfurt, DANS, SBGöteborg

© all rights reserved by MPI for Psycholinguistics on behalf of CLARIN

Page 4: Language Resource and Technology Registry InfrastructureCommon Language Resources and Technology Infrastructure Language Resource and Technology Registry Infrastructure 2010-01-20

Common Language Resources and Technology Infrastructure

Scope of the Document CLARIN References

CLARIN-2008-2 Persistent and unique Identifiers CLARIN-2008-3 CLARIN Centres CLARIN-2008-5 Metadata Infrastructure for Language Resources and Technology

WG2.5 Registry Infrastructure Building 4

Page 5: Language Resource and Technology Registry InfrastructureCommon Language Resources and Technology Infrastructure Language Resource and Technology Registry Infrastructure 2010-01-20

Common Language Resources and Technology Infrastructure

Contents  1 Introduction.....................................................................................................................................6 2 Design choices & Implementation..................................................................................................6

2.1 Data Category registry ...........................................................................................................6 2.1.1 Principles........................................................................................................................6

2.1.2 Data categories for metadata descriptions .....................................................................7

2.2 Component Metadata .............................................................................................................7 2.2.1 Model .............................................................................................................................7

2.2.2 Workflow and software components .............................................................................9

2.3 Virtual Language Observatory.............................................................................................11 2.3.1 The GIS perspective.....................................................................................................12

2.3.2 Future work..................................................................................................................13

2.4 Persistent Identifiers.............................................................................................................14 2.5 Future Work .........................................................................................................................14

3 Conclusions...................................................................................................................................14 3.1 Complexity...........................................................................................................................14 3.2 Curation................................................................................................................................15 3.3 Communication....................................................................................................................15

4 Appendix A: Metadata data categories .........................................................................................16 5 Appendix B: EPIC Memorandum of Understanding....................................................................21 6 Appendix C: CMDI software components ...................................................................................23

6.1 CMDI Toolkit ......................................................................................................................23 6.2 CMDI component registry ...................................................................................................24 6.3 CMDI Metadata Repository / Service..................................................................................25 6.4 CMDI Metadata Editor ........................................................................................................26 6.5 CMDI Virtual Collection registry........................................................................................26

7 References.....................................................................................................................................27 8 Acronyms......................................................................................................................................27

WG2.5 Registry Infrastructure Building 5

Page 6: Language Resource and Technology Registry InfrastructureCommon Language Resources and Technology Infrastructure Language Resource and Technology Registry Infrastructure 2010-01-20

Common Language Resources and Technology Infrastructure

1 Introduction As already indicated in document CLARIN-2008-5, there is no single metadata schema that can satisfy the whole humanities and social sciences research community. Instead we proposed a modular, component- and data category-based approach. This should provide a high level of flexibility while maintaining semantic interoperability. In this document we describe all design and implementation efforts that have been made to build a LRT registry infrastructure. Main pillars are the data category registry, the Component Metadata Infrastructure (CMDI), the Virtual Language World and the Persistent Identifier (PID) services. Please keep in mind while reading that rather than describing a static final result this document tries to offer a view on a very dynamic prototyping process. Details behind it might change, nevertheless the reader should get a reasonably accurate overview of the direction where the CLARIN registry infrastructure is heading to.

2 Design choices & Implementation

2.1 Data Category registry

2.1.1 Principles [ISOcat] is the Data Category Registry (DCR) for ISO TC 37. The mission of the DCR is to overcome problems associated with semantic interoperability among language-related data resources. The DCR provides carefully defined linguistic concepts together with relevant modeling constraints. Interoperability can be achieved between different schemas, e.g., if they reference the same ISOcat entries. Pointing to an entry without further constraints indicates that the "concept" used is identical to the one found in ISOcat. An increasing amount of experts in the metadata field is shifting its focus away from schemas, since it is widely understood that schemas are important to be able to correctly parse and interpret the content of an XML file, but not to achieve interoperability. For achieving interoperability the proper definition of linguistic concepts and their registration in machine readable registries is of greatest relevance. T. Baker (DC) summarized it as "there will be many schemas that hopefully will use registered semantics that can be referred to by using persistent identifiers". This is the reason why ISO TC37/SC4 saw it as its central activity to create a "data category registry" and to define a process guided by domain experts to integrate concept definitions. The state of this work can be described in the following way: (1) The DCR data model has been stabilized. (2) The ISOcat distributed database system will become usable in January 2009. (3) Several colleagues have already created many concept definitions that are waiting to be processed by community experts. Also the Dublin Core and TEI concepts can be referenced via the web although one cannot yet speak in all cases about persistent identifiers. On purpose ISO TC37/SC4 did not want to include relations in the DCR, since relations are frequently very much dependent on practical issues such as the purpose of a search. The idea is now that schemas reference such categories (concept definitions), that users create and register relations between such categories where necessary and that search engines make use of the definitions and relations to find useful resources. The principles of such a data category relation store (also known as a relation registry) were discussed at a workshop in January 2010 – see http://www.clarin.eu/node/2974 for more details.

WG2.5 Registry Infrastructure Building 6

Page 7: Language Resource and Technology Registry InfrastructureCommon Language Resources and Technology Infrastructure Language Resource and Technology Registry Infrastructure 2010-01-20

Common Language Resources and Technology Infrastructure

For CLARIN this development means that when we want to build a new infrastructure we need to step away from a fixed schema. Instead we will rely on a vocabulary which is registered in open and accepted registries and will include existing schemas as possible components to be upwards compliant and to support living communities. Here we should primarily think of the existing OLAC, IMDI and ELDA schemas.

2.1.2 Data categories for metadata descriptions A dedicated working group has determined a set of data categories that were considered suitable for the LRT metadata domain. Inspiration sources for list this were: [Dublin Core] and [OLAC] [ENABLER] [IMDI] Talks with representatives from speech technology research community The [TEI] header element This process resulted in a list of 220 data categories, as listed in Appendix A. This list is also accessible via www.isocat.org , under the Thematic Domain Group and Profile Metadata. After the (English) data categories were entered into the ISOcat registry, the entries were translated – a so-called language section was added for the following languages: Bulgarian Catalan Danish Estonian Finnish French German Greek Hungarian Italian Latvian Polish Portuguese Spanish The intention is to provide translations into at least all official [EU languages]. In order to achieve this, all national representatives were approached with the question to be responsible for the translation into their official languages. The progress of this translation process is and will be monitored at http://www.clarin.eu/node/2947 .

2.2 Component Metadata

2.2.1 Model A metadata component bundles different related aspects or dimensions of a resource. Such a component is basically a collection of metadata fields. Each field refers via a URI to exactly one data

WG2.5 Registry Infrastructure Building 7

Page 8: Language Resource and Technology Registry InfrastructureCommon Language Resources and Technology Infrastructure Language Resource and Technology Registry Infrastructure 2010-01-20

Common Language Resources and Technology Infrastructure

category in the ISO DCR, thus indicating unambiguously how the content of the field in a metadata description should be semantically interpreted. Components can have a recursive structure: next to the atomic fields they may contain other components. They are expressed as XML-files, an example component description can be found in Fig. 1(a) with the graphical representation in Fig. 1(b); here, fields are marked with dotted lines, components with a solid line. Any number of components can be combined with a header element into a metadata profile. A profile, also represented as an XML-file, provides a blueprint for the specialized metadata schema. It can be converted into a W3C XML-schema with an XSLT transformation; references to all occurring data categories will be maintained. Finally, using the generated XML-schema to check for the formal correctness of the data, the XML files containing the metadata descriptions can be created (instantiations). A fragment of such an example description is shown in Fig. 1(c). As each metadata description will contain a link to its W3C schema, its validity can be checked and the data categories used in the description can be retrieved. CLARIN will suggest a number of recommended components and profiles that will be made available in a component registry. These components and profiles are based on decomposition of existing metadata sets as OLAC, IMDI and DC. But users can use and create their own components, too. The fundamental requirement for all components is that the metadata elements that are used explicitly refer to concepts registered in ISOcat or other trusted registries. If a user wants to include an element which is not yet registered, since she feels that it is necessary for the proper description of the resource, she would need to register the new concept at least in the so-called ISOcat ”user space”. The ISOcat process will then decide whether the new category will be integrated in the official part of the registry. CLARIN will be strict and only accept categories that are registered in accepted registries, since otherwise no semantic interoperability can be established. <CMD_Component name="Actor"> <CMD_Element name="firstName" ValueScheme="string" ConceptLink="http://www.isocat.org/datcat/CMD-123"> <CMD_Element name="lastName" ValueScheme="string" ConceptLink="http://www.isocat.org/datcat/CMD-124"/> <CMD_Component name="ActorLanguage" id="ActorLanguage" CardinalityMin="0" CardinalityMax="unbounded"> <CMD_Element name="ActorLanguageName" ValueScheme="string" ConceptLink="http://www.isocat.org/datcat/DC-1766"/> </CMD_Component> </CMD_Component> Figure 1 (a). XML representation

Figure 1 (b). Schematic representation.

WG2.5 Registry Infrastructure Building 8

Page 9: Language Resource and Technology Registry InfrastructureCommon Language Resources and Technology Infrastructure Language Resource and Technology Registry Infrastructure 2010-01-20

Common Language Resources and Technology Infrastructure

<Actor> <firstName>Foo</firstName> <lastName>Bar</lastName> <ActorLanguage> <ActorLanguageName>Kilivila</ActorLanguageName> <ActorLanguageName>French</ActorLanguageName> </ActorLanguage> </Actor> Figure 1 (c). Actor instance in XML Fig. 2 describes the interaction between the data category registries, the component registry, the component editor and the metadata editor. The design aims at promoting the reuse of existing fields and components, but also gives users the opportunity to create new ones within the well-defined ISO DCR process. Naturally, we anticipate the creation of different metadata profiles for different communities, e.g., sign-language researchers will need to describe a video signal differently than multimodality experts.

Figure 2. Interaction of registries and editors Clearly, the principles behind this model need to be enforced by the accompanying end user software: the component editor will check whether all elements used are indeed taken from ISOcat or another trusted registry, and it will interact with the component registry to store final component profiles and make them reusable.

2.2.2 Workflow and software components Using the “metadata component editor”, the user is presented a GUI where he can create an adequate metadata schema for a metadata description. The user does this by selecting metadata components from a so called "metadata component registry". These components all are supposed to describe a particular aspect of a resource. For instance an "ActorLanguage" component will describe all

WG2.5 Registry Infrastructure Building 9

Page 10: Language Resource and Technology Registry InfrastructureCommon Language Resources and Technology Infrastructure Language Resource and Technology Registry Infrastructure 2010-01-20

Common Language Resources and Technology Infrastructure

relevant aspects of a language available to an actor. Another component "Location" may describe relevant aspects of a region or town that a researcher finds useful to describe, as where e.g. a recording took place. Such components can be aggregated into a single metadata description. To overcome problems with semantic interoperability every metadata element in a component is

quired to have a reference to a concept registry, as do the components as a whole.

th the uge existing installed base of metadata descriptions such as those of IMDI, OLAC, TEI, etc.

ns) of the XML-schemas. But the schema can also be used to work with tools utside the CMDI.

gistry and relation registries to ake different components interoperable if their semantics overlap.

etadata descriptions, the ISOcat data category registry and the lation registry.

igure 3. The whole CLARIN metadata provider and consumer workflow.

Researcher

D-Editor

re Users can also create their own components either on the basis of aggregating existing components into new ones, or by defining completely new ones as long as the elements have references to an accepted concept registry. It was understood that we need to seed the initial component registry with components taken from the existing metadata sets so that we can achieve interoperability wih When the user has created a selection of components, such a selection can be saved for future use in the form of a metadata profile that can be exported as an XML-schema (different flavours are possible). Within the CMDI there will be a metadata editor available to create instantiations (metadata descriptioo The metadata search service & browsing is offered to users via (1) a GUI allowing users to create queries on the basis of the elements and vocabularies available in the repository and (2) a web service. The service can make use of the references to the concept rem Currently we are discussing the best way to provide the described functionality on the basis of the available information sources: mre

Catalog Search

GUI

F

M

MD Editing

OAI Data

Provider

MD / Comp.

Creation

MD Comp.

Registry

OAI ServiceProvider

Joint MD

System Repos.

SemanticMapping Service

Local MD

System Repos.

DCR

RelationRegistry

MD services

MD-Modeler

External Agents

WG2.5 Registry Infrastructure Building 10

Page 11: Language Resource and Technology Registry InfrastructureCommon Language Resources and Technology Infrastructure Language Resource and Technology Registry Infrastructure 2010-01-20

Common Language Resources and Technology Infrastructure

2.3 Virtual Language Observatory The most elaborated part of the Virtual Language World is the Virtual Language Observatory, which provides multiple views on metadata for linguistic data and software. At the time of writing, metadata has been collected from the CLARIN [LRT inventory], the [IMDI] archive (this includes the [DOBES] corpora of endangered languages), linguistic archives distributing their metadata within the Open Language Archives Community [OLAC], the [DFKI] software registry and a sample of the [ELRA] catalogue. The resulting data sources are then brought together by mapping their respective field descriptors to two metadata sets: one for describing resources, and one for describing tools. From these two metadata sets, a small number of field descriptors are chosen as facets, providing users with well-defined entry dimensions to all resources and tools via a facetted browser, using the [Flamenco] toolkit. The six facets to which all of the metadata records are mapped are currently country, continent, corpus, language, organization, genre and subject – as illustrated in figure 1. Faceted search allows users to find resources in an intuitive manner: each facet selection reduces the number of resources or tools that fall into the selected categories, sharing those common properties. In contrast to traditional hierarchical browsing, faceted browsing offers many different access path to a resource or tool of interest. The VLO faceted browser displays all metadata (from the various content providers) in a uniform format; in addition, whenever possible, links were created that point to a resource/tool in its original context (e.g., all of the IMDI data in the VLO have backlinks to the IMDI metadata browser, allowing users to continue their search in a tree-structured way). Moreover, we have started to provide links from other metadata viewers (e.g., the geographical browser discussed below) to the faceted browser.

WG2.5 Registry Infrastructure Building 11

Page 12: Language Resource and Technology Registry InfrastructureCommon Language Resources and Technology Infrastructure Language Resource and Technology Registry Infrastructure 2010-01-20

Common Language Resources and Technology Infrastructure

Figure 4. The faceted search interface for language resources at the top level of the Virtual Language Observatory. The number after each facet value indicates the number of metadata records corresponding to that value.

2.3.1 The GIS perspective We have also created a Google Earth overlay, combining geographic information with metadata-based information. This work is partially based on the Language-Sites collection [language-sites] and has been extended with links to typological information about the languages from the [WALS] database and [DELAMAN] research centres. An important aspect is the interaction with the aforementioned faceted search: the user can reach a particular facet-view by clicking on a point associated with a language. This aspect is illustrated in figure 5.

WG2.5 Registry Infrastructure Building 12

Page 13: Language Resource and Technology Registry InfrastructureCommon Language Resources and Technology Infrastructure Language Resource and Technology Registry Infrastructure 2010-01-20

Common Language Resources and Technology Infrastructure

Figure 5. By clicking on a language marker (1) the user gets information about a language, together with links to the relevant entries in the faceted search interface (2), where one of the options is to see the resource metadata record in its original context (3), in this example the CLARIN LRT inventory.

2.3.2 Future work The Virtual Language World portal is up and running, but there is potential for improving the site and its content. First and foremost, the metadata needs to be further curated. This includes not only the correction of obvious (typographic) errors, the removal of outdated or double entries, the actualisation of metadata, and the filling in of missing information but also a convergence towards controlled vocabularies for descriptor values and the harmonization of language-dependent fields. One of the biggest challenges in this respect is the distributed nature of metadata storage (in particular, with regard to the OLAC providers); here, curation needs to be coordinated and pursued at various sites at once. The consistent use of persistent identifiers to address all resources and tools is high on the wish list, but also the use of standards for referring to entities such as persons, organizations, publications, etc. The electronic libraries community [DRIVER], and other communities, are facing similar problems and it would be beneficial to seek cooperation with them. The use of [CERIF] (Common European Research Information Format) may solve some of the problems at hand, but more work is needed to judge whether it is the best answer to address all issues. Our approach to map the various metadata schemes to each other is ad hoc. CLARIN addresses the issue of semantic interoperability by introducing a component-based metadata framework [CMDI], which is tied with the ISO Data Category Registry [ISOCAT]. In future work, we would like to take-up the CMDI avenue for bringing together the various distributed metadata for language resources

WG2.5 Registry Infrastructure Building 13

Page 14: Language Resource and Technology Registry InfrastructureCommon Language Resources and Technology Infrastructure Language Resource and Technology Registry Infrastructure 2010-01-20

Common Language Resources and Technology Infrastructure

and tools in a more systematic way. Having a well-thought infrastructure in place will realise VLW’s vision – to provide the research community with a telescope to the wonderful universe of language resources and tools.

2.4 Persistent Identifiers In document CLARIN-2008-2 we extensively sketched the CLARIN requirements for Persistent Identifiers for language resources. To provide the LRT community with a high-quality service for issuing and maintaining such PIDs the [EPIC] initiative was launched. Although EPIC is at the time of writing in a 1-year prototype phase, it offers a full REST-based web service for registering and managing handle PIDs, as documented at http://www.pidconsortium.eu/. For more details we refer to the EPIC website and the Memorandum of Understanding as included in Appendix B.

2.5 Future Work During the creation of the CMDI xml-toolkit it became clear that there is a substantial demand

for partial controlled vocabularies: list with suggested values for a certain metadata field, leaving the option for the user to override these suggestions. Next to that the need for a system that can cope with long lists (i.e. thousands of elements) for controlled or partial vocabularies surfaced. One could think of the ISO-639-3 language codes (about 7000 elements) or a non-exhaustive list of academic institutions in the EU. We will closely evaluate the [CATCH-plus] vocabulary repository (http://vocrep.q42.net/ - available under a free software license) for this purpose. It uses a SKOS RDF-store as a backend and is reported to cope well with large vocabularies (millions of entries).

As new components are created it can be expected that the list of metadata data categories (as listed in Appendix A) will need to be extended.

Quite some linguistic resources are subject to specific code of conducts and license agreements. It is the goal of CLARIN to harmonize the digital consent of the end user – resulting in a less intrusive (i.e.: avoiding constantly returning "do you agree" dialogs) experience. This would require the integration of an external attribute authority (for details, see the CLARIN AAI documents) and the registry infrastructure. Currently we are looking into this issue.

The granularity of metadata-described collections turns out to be an issue where users need clear guidance. Creating such collection components and actively approach users to explain their use is an important task.

3 Conclusions

3.1 Complexity Designing and implementing a full-fledged metadata and registry infrastructure is a large and complex task. Different modules need to be setup to accommodate for all these aspects. The wide variety of resources requires an adjustable metadata framework, powerful enough to describe all necessary details. Because inevitably different metadata schemes will be used, enough thought should go to ensuring semantic interoperability. Referring to registered data categories in the metadata (schema) can provide an outcome. Realistically one needs to think of how to integrate the vast amount of legacy metadata into the new infrastructure. Conversion procedures come to the mind as a possible solution.

WG2.5 Registry Infrastructure Building 14

Page 15: Language Resource and Technology Registry InfrastructureCommon Language Resources and Technology Infrastructure Language Resource and Technology Registry Infrastructure 2010-01-20

Common Language Resources and Technology Infrastructure

3.2 Curation The quality and consistency of the current metadata descriptions is often far from optimal. We noticed this especially when gathering information from several sources for the Virtual Language Observatory. Often only very sparse descriptive metadata exists. In other cases different names are used for similar concepts, descriptions contain typographic errors, fields were misinterpreted. Coping with these issues asks for a dual strategy. First the existing metadata needs to be cleared out. Without any doubt this is a very demanding task, as even deciding that something can be corrected automatically has to be done after human inspection and evaluation. Then there are still those errors that can only be fixed manually. As annoying as such a curation operation might sound, it remains the only possibility to come to a consistent repository, which is required for a satisfying end-user experience. Equally important is trying to prevent the mistakes mentioned above during the creation of new metadata. Although there is no golden bullet here, technology can help by offering high-quality controlled vocabularies, dynamic suggestions based on user input, references to data categories, etc. Repetitive actions should be avoided where possible, as they often lead to a decrease of attention and consequently to errors. On the other hand the human experts entering the metadata should be made aware of the importance of the meticulousness with which they act. Doing something right from the start is always better than fixing it afterwards. Nevertheless, even with the most sophisticated software and highly accurate experts, one has to foresee the need for metadata curation in advance.

3.3 Communication With the introduction of new concepts (be it component metadata, data categories, PIDs, …) comes a very clear need to communicate to the future users and implementers. This cannot be addressed with workshops and reference documents alone. Some kind of user- and/or centre-directed consultancy seems to be a necessary in order to make progress in an efficient way. However the energy and resources needed to provide such consultancy should not be underestimated.

WG2.5 Registry Infrastructure Building 15

Page 16: Language Resource and Technology Registry InfrastructureCommon Language Resources and Technology Infrastructure Language Resource and Technology Registry Infrastructure 2010-01-20

Common Language Resources and Technology Infrastructure

4 Appendix A: Metadata data categories id name type 2686 byte order closed 2505 Address open 2562 Annotation Format open 2462 Annotation Level Type open 2506 Annotation Mode open 2507 Annotation Stand-off closed 2508 Annotation Workflow open 2548 Anonymization Flag closed 2623 Anthropological Linguistics simple 2624 Applied Linguistics simple 2653 Audio simple 2453 Availability open 2595 Broadcasting simple 2563 Capture Method open 2464 Channel closed 2564 Character Encoding open 2565 Character Set open 2625 Cognitive Science simple 2509 Completion Year open 2626 Computational Linguistics simple 2566 Condition open 2454 Contact Full Name simple 2622 Controlled Environment simple 2661 Conversation simple 2510 Creation Date open 2511 Creation Tool open 2512 Creator Full Name open 2513 Creator Role open 2465 Delivery Format open 2514 Deployment Tool open 2515 Derivation Date open 2516 Derivation Mode open 2517 Derivation Tool open 2518 Derivation Type open 2519 Derivation Workflow open 2520 Description open 2466 Dialect open 2660 Dialogue simple 2608 Discourse simple 2627 Discourse Analysis simple 2656 Document simple 2467 Domain open 2468 Dominant Language open 2609 Drama simple 2657 Drawing simple 2567 Duration open

WG2.5 Registry Infrastructure Building 16

Page 17: Language Resource and Technology Registry InfrastructureCommon Language Resources and Technology Infrastructure Language Resource and Technology Registry Infrastructure 2010-01-20

Common Language Resources and Technology Infrastructure

2616 Elicited simple 2521 Email open 2568 Environment open 2469 Event Structure closed 2569 Execution Location open 2594 Experimental Setting simple 2593 Face to Face simple 2619 Family simple 2455 Fax Number open 2628 Forensic Linguistics simple 2452 Free simple 2522 Funder open 2629 General Linguistics simple 2470 Genre closed 2471 Geographic Coverage open 2523 Geographical Coordinates open 2669 Good simple 2524 Harvesting Date open 2630 Historical Linguistics simple 2631 History of Linguistics simple 2598 Human-machine dialogue simple 2655 Image simple 2611 Instrumental Music simple 2613 Interactive simple 2476 Interactivity closed 2525 Interannotator Agreement open 2477 Involvement closed 2632 Language Acquisition simple 2633 Language Documentation simple 2482 Language ID constrained 2483 Language In open 2484 Language Name open 2485 Language Script open 2526 Last Update open 2456 Legal Status open 2486 Lexical Unit open 2634 Lexicography simple 2487 Lexicon Type open 2457 License open 2527 Linguistic Subject closed 2636 Linguistic Theories simple 2635 Linguistics and Literature simple 2601 Literature simple 2528 Location Address open 2531 Location Continent closed 2532 Location Country closed 2533 Location Region open 2667 Low simple 2488 Main Level Information open 2637 Mathematical Linguistics simple 2570 Media Type closed

WG2.5 Registry Infrastructure Building 17

Page 18: Language Resource and Technology Registry InfrastructureCommon Language Resources and Technology Infrastructure Language Resource and Technology Registry Infrastructure 2010-01-20

Common Language Resources and Technology Infrastructure

2458 Medium open 2489 Meta Language open 2541 Metadata Creation Date open 2542 Metadata Creator open 2543 Metadata Language open 2571 Mime Type closed 2490 Modalities open 2659 Monologue simple 2638 Morphology simple 2639 Neurolinguistics simple 2606 Newspaper Article simple 2618 No-observer simple 2617 Non-elicited simple 2614 Non-interactive simple 2668 Normal simple 2662 Not a natural format simple 2491 Number of Languages open 2572 Operating System open 2459 Organization open 2534 Original Source open 2599 Other simple 2550 Participant Age open 2551 Participant Birthdate open 2552 Participant Code open

2553 Participant Dominant Language open

2554 Participant Education open 2555 Participant Ethnic Group open 2556 Participant Full Name open 2557 Participant Name open 2558 Participant Profession open 2559 Participant Role open 2560 Participant Sex open 2573 Persistent Identifier open 2610 Personal Notes simple 2640 Philosophy of Language simple 2641 Phonetics simple 2642 Phonology simple 2665 Planned simple 2492 Planning Type closed 2602 Poetry simple 2604 Popular Fiction simple 2643 Pragmatics simple 2460 Price open 2620 Private simple 2535 Project Id open 2536 Project Name open 2537 Project Title open 2644 Psycholinguistics simple 2621 Public simple 2538 Publication Date open

WG2.5 Registry Infrastructure Building 18

Page 19: Language Resource and Technology Registry InfrastructureCommon Language Resources and Technology Infrastructure Language Resource and Technology Registry Infrastructure 2010-01-20

Common Language Resources and Technology Infrastructure

2574 Quality closed 2549 Relation Type open 2575 Resolution open 2544 Resource Name open 2545 Resource Title open 2605 Ritual/Religious Texts simple 2576 Running Environment open 2577 Sample Rate open 2600 Secondary Document simple 2578 Segmentation open 2579 Segmentation Method open 2645 Semantics simple 2615 Semi-interactive simple 2664 Semi-spontaneous simple 2603 Singing simple 2580 Size open 2581 Size Per Language open

2582 Size Per Representative Level open

2583 Size Unit open 2493 Social Context closed 2561 Social Family Role open 2646 Sociolinguistics simple 2494 Source Language closed 2663 Spontaneous simple 2539 Start Year open 2612 Stimuli simple 2495 Structural Units open 2496 Sub-level Information open 2647 Syntax simple 2607 TV/Radio Features simple 2497 Tagset open 2498 Tagset Language constrained 2499 Target Language closed 2500 Task open 2596 Telephone simple 2461 Telephone Number open 2658 Text simple 2648 Text and Corpus Linguistics simple 2501 Theoretic Model open 2502 Time Coverage open 2503 Topic open 2649 Translating and Interpreting simple 2650 Typology simple 2591 Unknown simple 2592 Unspecified simple 2540 Update Frequency open 2546 Url open 2584 Validation closed 2585 Validation Level open 2586 Validation Mode open

WG2.5 Registry Infrastructure Building 19

Page 20: Language Resource and Technology Registry InfrastructureCommon Language Resources and Technology Infrastructure Language Resource and Technology Registry Infrastructure 2010-01-20

Common Language Resources and Technology Infrastructure

2587 Validation Type open 2547 Version open 2670 Very Good simple 2666 Very Low simple 2654 Video simple 2504 Vocabulary Size open 2597 Wizard-of-oz simple 2651 Writing Systems simple 2689 audio file format open 2687 big endian simple 2684 bit resolution constrained 2685 compression open 2691 duration of effective speech open 2690 duration of full database open 2697 home/office simple 2688 little endian simple 2692 number of speakers constrained 2700 public place simple 2696 recording environment closed 2693 recording platform hardware open 2694 recording platform software open 2698 studio simple 2699 vehicle simple

WG2.5 Registry Infrastructure Building 20

Page 21: Language Resource and Technology Registry InfrastructureCommon Language Resources and Technology Infrastructure Language Resource and Technology Registry Infrastructure 2010-01-20

Common Language Resources and Technology Infrastructure

5 Appendix B: EPIC Memorandum of Understanding Persistent Identifiers for Research Data in Europe All research disciplines are confronted with an enormous increase of the number of data objects they are dealing with and of the complexity of the relationships amongst them. Objects are organized in virtual collections which are increasingly often defined by the needs of the analyzing researcher, i.e. individual or sets of objects are grouped together in almost arbitrary ways where metadata is used to store all information. Semantic weaving often applied in particular in the humanities and social sciences is not only relating objects but fragments of objects, in electronic publications references to fragments of objects are being used to proof claims and interoperability between objects is increasingly often relying on referencing to commonly used objects. There are many more examples for the increasing relevance of references to document the results of research work and often they are being created by (semi) automatic means. Just as we starting to understand that we need to improve our efforts to preserve digital research data we need to conceive that references between data objects and their elements are part of the research data infrastructure which need to be preserved as well to make data interpretable in future. References are known in the research world for a long time already for example to cite research results in publications. New, however, is the shear mass of references and their granularity which we will have to manage. References occur at various levels for many different purposes within and across digital archives - thus in contrast of former referencing systems we do not speak about hundreds of references to be managed, but about millions of references. Mostly references will not be used by human inspection, but simply as part of automatic procedures, i.e. highly performant and highly available resolving systems are required to offer satisfying services. Therefore, the Max Planck Society decided to set up a reliable system for creating, resolving and storing persistent identifiers for all its researchers. Therefore, research infrastructures such as CLARIN understand the necessity of providing a system that can be used by all its centres. Therefore, projects in the domain of digital cultural heritage see the need to provide a robust system for their institutions. To meet these forth-coming requirements from various communities within the research and cultural heritage domain for a robust, performant and available service for registering and resolving persistent identifiers an initiative has been started to share the burden of offering a persistent identifier system and therefore to offer also a higher chance for persistency. The participating institutions commit themselves to offer a joint and redundant system the business model of which will be defined by the research communities. This service based on the Handle System will offer the required performance and robustness. It is not seen as competitive to other offers, but as an additional one. The participating institutions declare to be willing to work out an appropriate sustainable service, operating and business model which will extend the service already given now by the GWDG for the Max Planck Society. It will offer interested communities to participate in these discussions about the principles of a shared and therefore highly available and highly persistent service. In the first year EPIC will work on a prototype solution for such a robust system with the intention to turn this into a full production service.

WG2.5 Registry Infrastructure Building 21

Page 22: Language Resource and Technology Registry InfrastructureCommon Language Resources and Technology Infrastructure Language Resource and Technology Registry Infrastructure 2010-01-20

Common Language Resources and Technology Infrastructure

It will take care in discussions with CNRI to find a proper basis for the smooth continuation of the Handle System and to establish the required independence. Other well-known institutions are welcome to participate in setting up and maintaining this shared persistent identifiers system in Europe. The participating institutes will together with other stakeholders take part in and support founding an international governing board guiding further operation and development of the Handle System. The purpose of this is safeguarding the investments of the scientific community in using the Handle System for research data. EPIC Partners The following institutions declare by this memorandum to start setting up and maintaining a joint service for registering, storing and resolving persistent identifiers based on handles. GWDG - Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen SARA - Reken- en Netwerkdiensten CSC - IT Center for Science Ltd Interested Communities The following institutions and communities are currently supporting this initiative and will offer the services to its members: the Max Planck Gesellschaft CLARIN Research Infrastructure Initiative Niedersächsische Staats- und Universitätsbibliothek Göttingen (SUB)

WG2.5 Registry Infrastructure Building 22

Page 23: Language Resource and Technology Registry InfrastructureCommon Language Resources and Technology Infrastructure Language Resource and Technology Registry Infrastructure 2010-01-20

Common Language Resources and Technology Infrastructure

6 Appendix C: CMDI software components

6.1 CMDI Toolkit Until the full software environment to create components, profiles and metadata instances is available, a so-called XML toolkit is offered (available at www.clarin.eu/toolkit). Using this, in combination with an XML-editor and standard tools such as an XSLT engine, users with some background in XML can: create metadata components, group components into a metadata profile generate an XML schema out of a profile use the XML schema to create valid metadata instances This process is sketched in figure 6. Obviously this is a temporary solution meant to be replaced by a full set of user-friendly editors and registries, as proposed in section 2.2.2.

Metadata Profile (“components à la carte”)

<xs:schema> ... </xs:schema>

XML schema (“grammar”)

<CMD> ... </CMD>

Metadata instance (the real resource description)

XSLT

XML validator XML editor

Figure 6. Relation between components, profile, XML schema and metadata instance.

WG2.5 Registry Infrastructure Building 23

Page 24: Language Resource and Technology Registry InfrastructureCommon Language Resources and Technology Infrastructure Language Resource and Technology Registry Infrastructure 2010-01-20

Common Language Resources and Technology Infrastructure

6.2 CMDI component registry

The CMDI component registry is the central point of storage for metadata profiles and components. The registry can be divided in a browser part and an editor part.

The browser aims for easy access to components and profiles so they can be reused as much as possible. The browser is a user interface (web application) where all registered components and profiles can be browsed and searched. Besides an user interface the registry can be queried by means of a set of web services, this will make it possible for e.g. the CMDI Metadata Editor to interact with the registry.

The editor aims to help creating and editing components and profiles.

The browsing functionality can be viewed at: http://catalog.clarin.eu/ds/ComponentRegistry

Currently a first version is available allowing people to browse already registered components and profiles. While feedback is being gathered from this version work is being done on the editor part of the registry. Future work will include, user login and access, ability to register and edit components and profiles.

Figure 7. Screenshot of the Component Registry GUI

WG2.5 Registry Infrastructure Building 24

Page 25: Language Resource and Technology Registry InfrastructureCommon Language Resources and Technology Infrastructure Language Resource and Technology Registry Infrastructure 2010-01-20

Common Language Resources and Technology Infrastructure

6.3 CMDI Metadata Repository / Service The MD Repository and MD Service are the two components on the consumer side of CMD infrastructure – Repository being the persistence component and MDService exposing the interface to access the metadata stored in the repository. All the metadata records provided by the resource providers for harvesting via OAI-PMH are collected and stored into the MD Repository. As this is all XML-Files a native XML Database - we employ [eXist] - seems a natural choice for technology. It can handle smoothly the large amount of highly heterogeneous XML-Data and allows complex querying over the dataset in the native XML querying language (XQuery, XPath 2.0). Until now the database has been populated with substantial test dataset (round 85.000 metadata records) of real data (mainly OLAC, IMDI), generally the same dataset used for populating the Virtual Language Observatory. Deep analysis of the structure and content of this dataset is underway. MDService shall accept queries over metadata from MetadataBrowser (and external Applications), translating them into XPath-queries and passing them to the Metadata Repository and/or to the Virtual Collection Registry (and possibly to other Repositories in the future), optionally expanding the queries making use of a separate Semantic Mapping service based on Relation Registry, receiving results and passing them back to the requesting side. It is yet undecided where eventual format transformation (HTML, or XML-Formats other than the native CMD) of the returned metadata records shall happen, this can be done either already in MD Repository or in MDService or even on client side. So the primary functionality of MDService is to accept a metadata query and return matching metadata records. However as we have to deal with a highly heterogeneous dataset with multitude of schemas (hundreds to thousands) with even dynamically adding new ones allowed, the MDService/Repository also need to provide a kind of meta information about the metadata they store/serve. I.e. MDService has to be able to say (via a separate method exposed) which CMD components and elements are used in the metadata records with which values, together with frequency information. This enables the querying side to actually form a useful query. A separate web user interface - the MDBrowser - will be also provided - a web application allowing also a human user to query CLARIN’s metadata. This web application will itself rely on the MDService’s interface, presenting a functionally equivalent human interaction overlay over the MDService interface. Of course this has to deliver advanced user interaction capabilities. A combination of catalogue and search functionality is foreseen as base components extended with advanced techniques – e.g. already demonstrated in VLO - in particular faceted search and in the long term also a GIS perspective. Currently the specification of MDService is almost finished and prototyping work has started on both MDRepository and MDService. Important part of the specification is the proposed definition of the query language, to be employed on the metadata query interface. It shall be based on the Z39.50 successor standard SRU/CQL, most probably with certain extensions to meet the specific requirements of CLARIN, which should however fit in within the extension mechanisms SRU/CQL already provides as part of the standard. This adapted query language is also discussed as a universal query language also for querying content in the planned distributed/federated content search. This would allow to formulate a complex combined metadata/content search in one query.

WG2.5 Registry Infrastructure Building 25

Page 26: Language Resource and Technology Registry InfrastructureCommon Language Resources and Technology Infrastructure Language Resource and Technology Registry Infrastructure 2010-01-20

Common Language Resources and Technology Infrastructure

6.4 CMDI Metadata Editor Recently the MPI team launched a first version of a new tool, ARBIL, that allows users to organize data and metadata. It also aims at making metadata creation a much more efficient process. The current version supports the standard IMDI set and support for the special profiles will be available soon. In the near future a new version of ARBIL will support the new CLARIN component metadata format and interact with a CLARIN metadata component & profile registry. ARBIL supports table view metadata inspection and editing and speeds up metadata related work compared with the old IMDI-Editor which it will replace gradually. As all code is developed by MPI, it will be Open Source software. More information can be found at http://www.lat-mpi.eu/tools/arbil.

6.5 CMDI Virtual Collection registry The CMDI virtual collection registry allows CLARIN users to create collection of resources that may be spread over different locations and register them in the CLARIN metadata infrastructure. A persistent identifier is assigned to each virtual collection. For example, when publishing about works a researcher can create a virtual collection of resources and then refer to them in a publication or such a collection can be used to provide the input to the distributed processing infrastructure of CLARIN. Another example is creating a virtual collection from the result set of a search in the repositories. Those collections are extensional, which means they are an explicit enumeration of resources. The virtual collection registry either stores references to external metadata describing those resources, references to resources, or a copy of the external metadata. However, a virtual collection may also be defined as an intentional collection. Those collection provide instructions – in prose and/or machine-readable – how to compile the collection. One possible use case is to build a virtual corpus of texts that are distributed over different locations based on some metadata properties to use it as sample tailored to a specific research question and basic population (see [Kupietz/Keibel]). The virtual collection registry consists of a backend and a web-based user interface. The backend has a RESTful web service interface implemented in Java, which allows creating, retrieving, modifying and deleting virtual collections. Since a large number of collections are expected, they are stored in a relational database. The user interface provides an easy, web-based access to the registry and connects to the registry through the web-service. Other software could use the web-service, as well. Currently, only extensional virtual collections are supported and metadata harvesting for virtual collections is not implemented. Furthermore, the registry is not yet connected to a persistent identifier service. We plan to connect the registry to the GWDG [GWDG-Handle] handle service and work on enhancing the features of the registry, e.g. implementing intentional collections, allow for references between collections, and implement interfaces for metadata harvesting in the CLARIN infrastructure. The structure of metadata for virtual collections is still in being discussed.

WG2.5 Registry Infrastructure Building 26

Page 27: Language Resource and Technology Registry InfrastructureCommon Language Resources and Technology Infrastructure Language Resource and Technology Registry Infrastructure 2010-01-20

Common Language Resources and Technology Infrastructure

7 References [EU languages] http://ec.europa.eu/education/languages/languages-of-europe/doc135_en.htm [Flamenco] Stoica, E. and Hearst, M. A. (2007). Automating creation of hierarchical faceted

metadata structures. In In Procs. of the Human Language Technology Conference, http://flamenco.berkeley.edu

[Kupietz/Keibel] Kupietz, Marc/Keibel, Holger (2009): The Mannheim German Reference Corpus (DeReKo) as a basis for empirical linguistic research. In: Working Papers in Corpus-based linguistics and Language Education, No. 3 (pp. 53-59). Tokyo: Tokyo University of Foreign Studies (TUFS). http://cblle.tufs.ac.jp/assets/files/publications/working_papers_03/section/053-059.pdf

[language-sites] Van Uytvanck, D., Dukers, A., Ringersma, J., & Trilsbeek, P. (2008). Language-sites: Accessing and presenting language resources via geographic information systems. In Proceedings of the 6th Int’l Conference on Language Resources and Evaluation (LREC 2008).

8 Acronyms CATCH-plus Continuous Access To Cultural

Heritage http://www.catchplus.nl/

CERIF Common European Research Information Format

http://en.wikipedia.org/wiki/CERIF

CMDI Component Metadata Infrastructure

http://www.clarin.eu/toolkit

DELAMAN digital endangered languages and musics archive network

http://www.delaman.org/

DFKI http://registry.dfki.de/ DOBES Dokumentation Bedrohter

Sprachen http://www.mpi.nl/dobes

DRIVER Digital Repository Infrastructure Vision for European Research

http://www.driver-repository.eu/

Dublin Core http://dublincore.org/ ELRA http://catalog.elra.info/ ENABLER http://www.ilsp.gr/enabler/ EPIC European Persistent Identifier

Consortium http://www.pidconsortium.eu/

eXist http://exist-db.org GWDG-handle

http://handle.gwdg.de/PIDservice/, http://handle.gwdg.de:8080/pidservice/index.html

IMDI ISLE Meta Data Initiative http://www.mpi.nl/imdi/ ISOcat http://www.isocat.org LRT http://www.clarin.eu/inventory

WG2.5 Registry Infrastructure Building 27

Page 28: Language Resource and Technology Registry InfrastructureCommon Language Resources and Technology Infrastructure Language Resource and Technology Registry Infrastructure 2010-01-20

Common Language Resources and Technology Infrastructure

WG2.5 Registry Infrastructure Building 28

inventory OLAC Open Language Archives

Community

http://www.language-archives.org/

SRU/CQL Search - Retrieval via URL / Contextual Query Language

http://www.loc.gov/standards/sru/specs/cql.html

TEI Text Encoding Initiative http://www.tei-c.org/ WALS World Atlas of Language

Structures http://wals.info/