lrec 2000 athens, greece an xml-based encoding standard for language corpora nancy ide vassar...

32
LREC 2000 • Athens, Greece An XML-based Encoding Standard for Language Corpora Nancy Ide • Vassar College Patrice Bonhomme • LORIA/CNRS Laurent Romary • LORIA/CNRS XCES

Post on 22-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

LREC 2000 • Athens, Greece

An XML-based Encoding Standard for Language Corpora

Nancy Ide • Vassar College

Patrice Bonhomme • LORIA/CNRS

Laurent Romary • LORIA/CNRS

XCESXCES

LREC 2000 • Athens, Greece

XCESXCES

The Corpus Encoding Standard

• EAGLES standard

• encoding conventions for corpora used in language engineering research

• an SGML application

• TEI conformant

LREC 2000 • Athens, Greece

XCESXCES

The CES defines...

• requirements for increasing levels of encoding

• a suite of DTDs for encoding basic document structure and linguistic annotation

• a corresponding data architecture for linguistic corpora

LREC 2000 • Athens, Greece

XCESXCES

XCES

• instantiation of the CES DTDs in XML

• same tags, data architecture as the CES

• motivation: use in creating the American National Corpus (ANC)– Macleod, Ide, and Grishman, LREC 2000

LREC 2000 • Athens, Greece

XCESXCES

XML provides more than SGML

• better linkage mechanisms

• XSLT for document access and transformation

• XML schemas

• provision for accessing all or part of multiple DTDs

LREC 2000 • Athens, Greece

XCESXCES

Minimal XML conversion

adaptation of DTDs eliminate inclusion exceptions make mixed-content models XML-compliant

• adaptation of the CES mechanism for inter-document reference– meet the specifications of XML pointer and

linking mechanisms

LREC 2000 • Athens, Greece

XCESXCES

Additional Adaptations

• validate the CES data architecture by ensuring conformance to other XML specifications – XSL Transformation Language– XQL

• exploit XML mechanisms for combining all or part of documents described by different DTDs

• instantiate the XCES DTDs using XML schemas

LREC 2000 • Athens, Greece

XCESXCES

The CES/XCES Data Architecture

• remote markup, or "stand-off" model

• annotations maintained in separate documents that point back to the original

• yields a “hyper-document” composed of the original text and all annotations

• increasingly accepted as the appropriate architecture for language resources

LREC 2000 • Athens, Greece

XCESXCES

Use of links in CES and XCES

• link corresponding segments of two or more aligned primary texts

• link annotation documents to a base document containing the primary text– e.g., morpho-syntactic information linked to

the string of characters in the original text to which it applies

LREC 2000 • Athens, Greece

XCESXCES

XCES Requirements for Linking

• must be able to point to other documents

• must be able to point to tagged elements as well as locations within tagged elements– eliminate the need to tag every element that

might be referenced– eliminate IDs on every element that is

referenced, as in SGML

LREC 2000 • Athens, Greece

XCESXCES

XML Path Language (XPath)

• concise notation for element localization in the document tree– /div/p[2]/s[3] - third sentence of second

paragraph in each <div>– /descendant::p - all <p> elements

• predicates for accessing characters within elements– substring(/p/s[2]/text(),10,12)

LREC 2000 • Athens, Greece

XCESXCES

XPointer

• extends XPath syntax to allow : – addressing points and ranges as well as

nodes– locating information by string matching– use of addressing expressions in URI-

references as fragment identifiers

LREC 2000 • Athens, Greece

XCESXCES

XLink

• uni- or multi-directional links

• can specify how link is to be activated– by hand or automatically by the browser

• can specify what to do with the target fragment – replace it or insert it into the source document

LREC 2000 • Athens, Greece

XCESXCES

Links to External Documents

• None in SGML

• HyTime/TEI invented "doc" attribute

• CES used "doc" with inheritance to avoid repetition of the attribute– not supported by SGML processors

• XML: XLink and xml:base attribute

LREC 2000 • Athens, Greece

XCESXCES

Linking Mechanisms (A brief history)

<tok from="CHILD (1) (2) STRLOC (10)" to="CHILD (1) (2) STRLOC (22)" doc="doc.xml">

<tok from="1.2\10" to="1.2\22" doc="doc.xml">

<tok xlink:href= "http://www.loria.fr/doc.xml#xptr (substring(/p/s[2]/text(),10,12))">

CESCES

XCES using XLinkXCES using XLink

HyTime/TEIHyTime/TEI

LREC 2000 • Athens, Greece

XCESXCES

Use of xml:base

<chunk xml:base= "http://www.loria.fr/doc.xml#"> <tok xlink:href="xptr(substring (/p/s[2]/text(), 10, 12))"/> <tok xlink:href="xptr(substring (/p/s[2]/text(), 24, 4))"/>

</chunk>

LREC 2000 • Athens, Greece

XCESXCES

XSLT• a powerful tree-traversal language

• translate any XML document into another document in any form– html

– XML

– plain text

– etc.

• most to offer for handling annotated corpora

LREC 2000 • Athens, Greece

XCESXCES

XSLT Capabilities

selection of elements or portions of element content using the XPath syntax

rearrangement, transformation of extracted information (text content, element names, etc.) in the target document

• addition of information to the target document

LREC 2000 • Athens, Greece

XCESXCES

A Simple Example<?xml version="1.0">

<chunk type="BODY" lang="en"

xml:base=

"http://www.cs.vassar.edu/~ME/Oen.xcesDoc#">

<par xlink:href="xptr(substring(//p[1]">

<s xlink:href="xptr(substring(//p/s[1]">

<tok type="WORD"

xlink:href=

"xptr(substring(//p/s[1]/text(),1,2">

<orth>It</orth>

<disamb>

<base>it</base>

<msd>Pp3ns</msd>

<ctag>PPER3</ctag></lex>

<lex>

<base>it</base>

<msd>Pp3ns</msd>

<ctag>PPER3</ctag></lex></tok>...

xcesAnadocumentxcesAnadocument

LREC 2000 • Athens, Greece

XCESXCES

<xsl:stylesheet version="1.0" xmnls:xsl= "http://www.w3.org/1999/XSL/Transform">

<xsl:template match= “/”> <html> <body> <xsl:apply-templates/> </body> </html></xsl:template>

<xsl:template match="//par"/> <xsl:for-each select=”//tok”/> <xsl:value-of select=”orth”/> <xsl:text>|</xsl:text> <xsl:value-of select=”disamb/base”/> <xsl:text>|</xsl:text> <xsl:value-of select=”disamb/ctag”/> </xsl:for-each> </xsl:template>

</xsl:stylesheet>

XSLT creates HTML

XSLTdocumentXSLTdocument

LREC 2000 • Athens, Greece

XCESXCES

Result

It|it|PPER3 was|be|PAST3 a|a|DINTbright|bright|ADJEcold|cold|ADJE day|day|NN…

LREC 2000 • Athens, Greece

XCESXCES

Possibilities

• create new documents containing selected annotations

• transduce XML encoded documents to tool-internal formats

• generate a new document with all phonemes that appear in a certain context (or all the unique contexts of a certain phoneme), etc.

LREC 2000 • Athens, Greece

XCESXCES

XML Schemas

• constrain and document the meaning, usage and relationships of the constituent parts of XML documents– datatypes– elements and their content– attributes and their values

• provide default values for attributes and elements

LREC 2000 • Athens, Greece

XCESXCES

Impact for language resources

• provide means to define an abstract data model for a class of documents– e.g., data model for annotations and annotated

objects– one of the most important tasks for corpus and

tool creators

• provide for much tighter validation of document form and content

LREC 2000 • Athens, Greece

XCESXCES

Capabilities

• different attribute declarations and/or content models can apply to elements with the same name in different contexts– allows for more tightly constrained content

models than possible with DTDs– e.g., <name> in header and <name> in text

likely have different content constraints

LREC 2000 • Athens, Greece

XCESXCES

• define equivalence classes for groups of elements and/or attributes– may be used in the same ways as defined

for a particular named element

• in CES used parameter entities to make a class of phrase-level objects (for example)– a "kludge"

LREC 2000 • Athens, Greece

XCESXCES

• constrain attribute or element values (or combinations) to be unique, e.g.,– only one entry in a computational lexicon can

be defined with a given word form – only one paragraph can have an attribute

indicating that it is the 23rd– only one disambiguated form is given for each

token – only one correspondence for a given item in an

alignment document

Useful for error detection and preventionUseful for error detection and prevention

LREC 2000 • Athens, Greece

XCESXCES

• establish dependencies based on element or attribute values, for example:– prevent nouns from being assigned a tense– specify that tokens with type attribute value

PUNCT include only <orth> elements containing specific characters

– specify annotation labels elsewhere, constrain element content to these values only

• e.g., constrain the values of the <msd> element in an XCES annotation document to the EAGLES morpho-syntactic specifications

Another means for error control and validationAnother means for error control and validation

LREC 2000 • Athens, Greece

XCESXCES

Why is XML a good thing?• search, extraction, and transformation

capabilities answer most current and foreseen needs for corpus-based language engineering

• means to fully implement the CES/XCES data architecture

• processing tools for XML recommendations are freely distributed– no need for costly and time-consuming tool

development

LREC 2000 • Athens, Greece

XCESXCES

XCES and its future

• CES and XCES have been developed for and by the language engineering community

• At present, cover – various features in written text– morpho-syntactic annotation– alignment information

• relatively stable and agreed-upon within the community

LREC 2000 • Athens, Greece

XCESXCES

• coverage will continue to evolve

• currently working with different groups to implement encoding guidelines for – additional written text features– computational lexicons– discourse and dialogue– co-reference– speech and its various levels of annotation and

representation– Asian character support

LREC 2000 • Athens, Greece

XCESXCES

Information

http://www.cs.vassar.edu/CESand

http://www.cs.vassar.edu/XCES

http://www.cs.vassar.edu/CESand

http://www.cs.vassar.edu/XCES

[email protected] or [email protected]@loria.fr

[email protected]