annotation : the scope er so steven said it was not a property of um annotated corpora verb phrase...

45
Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing truck speech act disfluency intonation pattern contex t participant

Upload: cornelius-morrison

Post on 03-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing

Annotation : the scope

er so Steven said it was not a property of um annotated corpora

verb phrase

Anaphoric reference

noun phrasenamed entity

passing truck

speech act

disfluencyintonation pattern

contextparticipant

Page 2: Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing

Some xml annotations

<phon addr=”1:10”>st</phon><phon addr=”12:5”>i</phon><phon addr=”23:5”>v</phon><phon addr=”30:2”>n</phon>

<w pos=”NP1”>Steven</w>

<person ident=”SB01” gender=”M”><birthDate>12/03/1956</birthDate>.... </person> <name persKey=”SB01”>

steven</name>

<u who=”SB01” start=”0:1”>er so steven said it was <emph>not</emph> a property of annotated corpora</u>

Page 3: Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing

Transcribing speech

normalization issues

ease of reading vs accuracy

interpretation vs prosody

analogous to problems of handling digitized images

Page 4: Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing

The Spoken base tagset

components : <u> <event> <kinesic> <vocal> <pause> <shift>

contextual information in header <settingDesc> <particDesc>

facilities for synchronization and timing

Page 5: Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing

Features of speech

lexica l<u>

non-lexica l<vocal>

anthropophenic non-anthropophenic<k inesic>

com m unicative non-com m unicative<event>

transcribed events

Page 6: Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing

Utterances

Basic unit of discourse, corresponding to speaker turns

Optionally grouped into higher-level divisions (<div>s), e.g. to mark discourse function

Linked by who attribute to <person> description in header

Page 7: Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing

Vocals and events

Empty elements are used to mark paralinguistic phenomena

<u who="Jan">This is just delicious</u><event desc='telephone rings'><u who="Kim">I'll get it</u> <u who="Tom">I used to <vocal desc="cough"/> smoke a lot</u><u who="Bob"><vocal desc="sniff"/>He thinks he's tough</u><vocal who="Ann" desc="snorts"/>

Page 8: Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing

Voice quality and prosody

The <shift> element is used to mark changes in voice quality

Other prosodic features may be marked using specific kinds of <seg> or entity refs

<u who="LB"><shift feature="loud" new="f"/>Elizabeth</u><u who="EB">Yes</u><u who="LB"><shift/>Come and try this <pause/><shift feature="loud" new="ff"/>come on</u>

Page 9: Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing

Another example<u who="MAR">you never <pause/> take this cat for show and tell <pause dur='5'/> meow meow</u> <u who="ROS">yeah well I dont want to</u> <event><desc>toy cat has bell in tail which continues to make a tinkling sound</desc></event> <vocal who="MAR"><desc>miaows</desc></vocal> <u who="ROS">because it is so old</u> <u who="MAR">how <reg>about</reg> your cat <pause/> yours is new<kinesic><desc>shows Father the cat</desc></kinesic></u><u who="FAT" trans="pause">that<pause/> darling</u><u who="MAR"><s>no mine isnt old</s> <s>mine is just um a little dirty</s></u>

Page 10: Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing

Participant Description

<person xml:id="P1" sex='2' age='mid'> <p>Female informant, well-educated, born in Shropshire UK, 12 Jan 1950, of unknown occupation. Speaks French fluently. Socio-Economic status B2 in the PEP classification scheme. </person>

<person xml:id="P1" sex="2" age='mid'> <birth date='1950-01-12'> <date>12 January 1950</date> <name type="place">Shropshire, UK</name></birth><firstLang>English</firstLang><langKnown>French</langKnown><residence>Long term resident of Hull</residence><education>University postgraduate</education><occupation>Unknown</occupation><socecstatus source="PEP" code="B2"/></person>

Page 11: Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing

Setting Description

eg from P2<settingDesc><setting who="#P1 #P2"><name type="city">Bedford</><name type="region">UK: South East</name><date value="1989">early spring, 1989</><locale>rug of a suburban home</locale><activity>playing</activity></setting><setting who="#P3"><name type="city">Bedford</name><name type="region">UK: South East</name><date value="1989">early spring, 1989</date><locale>at the sink</locale> <activity>washing-up</activity></setting><setting who="#P4"><name type="place">London, UK</name> <time>unknown</time><locale>broadcasting studio</locale><activity>radio performance</activity></setting></settingDesc>

Page 12: Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing

Timing

Pausinguse <pause> element

Durationuse dur attribute

Overlapuse trans attribute

Page 13: Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing

OverlapHave you heard the the election results?

its a disasterits a miracle

<u xml:id="A1" who="A">Have you heard the</u> <u xml:id="B1" who="B" trans="latching">the election results? </u><u xml:id="A2" who="A" trans="pause">its a disaster</u><u xml:id="B2" who="B" trans="overlap">its a miracle </u>

Page 14: Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing

Linking, segmentation, alignment

Provides generic segmentation elements

Provides extensive set of attributes for linkage, correspondence,synchronization, aggregation, alternation, etc.

Documents generic pointing mechanism

Page 15: Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing

Generic segmentation elements

<seg> for arbitrary (nesting) segmentation

<s> for end-to-end segmentation

use type attribute to subcategorise

<anchor> for points

Segmentation is the key to successful linking and analysis

Page 16: Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing

Clustering

(Difficulty (is being expressed) with ((the method) (to be used)))

<s>Difficulty <seg>is being expressed</seg> with <seg><seg>the method</seg> <seg>to be used</seg></seg></s>

Page 17: Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing

discontinuous segments

fundamental problem

first segment, then link, using stand-off

“You put it,” Quill reminded him, “in the safe.”<s xml:id="s1">"You put it,"</s> <s xml:id="s2">Quill reminded him,</s> <s xml:id="s3">"in the safe."</s>

Page 18: Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing

discontinuous segments

can also use PART attribute to indicate that segments are incomplete

“You put it,” Quill reminded him, “in the safe.”

<s xml:id="s1" next="#s3">"You put it,"</s> <s xml:id="s2">Quill reminded him,</s> <s xml:id="s3" prev="#s1">"in the safe."</s>

Page 19: Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing

discontinuous segments“You put it,” Quill reminded him, “in the safe.”

<s xml:id="s1">”You put it,”</s> <s xml:id="s2">Quill reminded him,</s> <s xml:id="s3">“in the safe.”</s>

<join targets="#s1 #s3" result="s"/>

Page 20: Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing

Translation pairs

<s xml:id="s1" corresp="#s2" xml:lang="EN">For a long time I used to go to bed early</s><s xml:id="s2" corresp="#s1" xml:lang="FR">Longtemps je me couchais de bonne heure</s>

<correspGrp type="trans"><link targets="#s1 #s2"/>

</correspGrp>

and/or....

Page 21: Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing

Synchronization

of whole elements

of points in time

<u xml:id="A2" who="A" synch="#B2">its a disaster</u><u xml:id="B2" who="B">its a miracle</u>

<u xml:id="A1" who="A">Have you heard <anchor xml:id="AO"/>the</u> <u xml:id="B1" who="B" start="#A0"><anchor xml:id="BO"/>the election results? yes</u>

Page 22: Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing

XML semantics are limited

The containment relation is implicit, so we do not need to say

though we may wish to say

<s id=”S1” head=”V1”> <np id=”N1”>annotated corpora</np> <vp id=”V1”>rule</vp> <tq id=”T1”>okay</tq></s>

<vp id=”V1” partOf=”S1”>rule</vp>

<vp id=”V1” role=”head” >rule</vp>

Page 23: Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing

Analytic mechanisms

Specific kinds of segment for linguistic analyses

Why is there no tag for noun?

Specialized interpretive pointers (<span> and <spanGrp>)

The ana attribute and its possible targets

<interp> and <interpGrp>

feature systems <fs> and <fsd>

Page 24: Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing

Arbitrary characterizations

The <span> points into a stretch of a text and characterizes it in some way

Target may be anything you can reach by an xpath

<spanGrp resp=”#LB” type="thematic" > <span value="ships" from="#P1" to="#P2"/> <span value="shoes" from="#P4" to="#P8"/> <span value="sealing wax" from="http://www.somewhere.com/waxinit.xml#P45"/></spanGrp>

Page 25: Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing

<w ana="#VVD">annotated</w>

More detailed analysis

the ana attribute is of type IDREFS

what does VVD identify?a prose description

an <interp> element

a feature structure

Page 26: Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing

using interp...

<w ana="#VVD">annotated</w><w ana="#NN2">corpora</w>

<interp xml:id="VVD" type="lexicalClass" value="verbPastTense"/><interp xml:id="NN2" type="lexicalClass" value="nounPlural"/>

Page 27: Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing

hierarchic grouping of interpsnouns can be common or proper

nouns can be singular or plural

<interpGrp value="nomimal"> <interpGrp value="common"> <interp value="singular"/> <interp value="plural"/> </interpGrp></interpGrp>

Page 28: Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing

for example...

<interp xml:id=‘VVD’> <desc>verb past tense</desc></interp><interp xml:id=‘NN2’> <desc>plural common noun</desc>

</interp>

<w ana=‘#VVD’>annotated</w>

<w ana=‘#NN2’>corpora</w>

Page 29: Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing

Encoding analyses

Linguistic Annotation Frameworks and standards

the philosophers stone

Generic feature structure system any analysis can be represented by bundles of named feature-value pairs

embedded within text or indirectly linked

Ancillary feature system declarationTheoretically neutral (?) pragmatic solution to real world problem of intermachine communication

Page 30: Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing

Feature structures

a feature structure consists of a bundle of featuresa feature has a name and a valuevalues may be binary switches, symbols, strings, feature structures, or operations on thembundling may constrained in various (not necessarily hierarchic) ways

Page 31: Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing

... or, in XML:

The <fs> element represents a (typed) feature structure, which contains...One or more <f> elements, each of which has

a name

a value

Feature values may beatomic: <binary> <string> <numeric> <symbol>

complex: <fs> <coll>

expressions: <vNot> <vAlt> <vColl> ... or <var>

Page 32: Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing

Using a feature structure...<w ana=‘#NN2’>corpora</w>

<fs xml:id=‘NN2’> <f name=‘class’> <symbol value=‘noun’/></f> <f name=‘number’> <symbol value=‘plural’/></f> <f name=‘proper’> <binary value=”false”/></f></fs>

Page 33: Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing

Features: simple values

binary, numeric, symbol or stringconstraints may be declared in FSD

<fs type='word structure'> <f name='lemma'><string>goose</str></f> <f name='category'><symbol value='noun‘/></f> <f name='barLevel'><numeric value='0‘/></f> <f name='number'><symbol value='plural‘/></f></fs>

lemma : goose,

category: noun,

number:plural

bar level: 0

Page 34: Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing

Features: plus or minus<fs type='phonetic segment'> <f name='segment'><binary

value=”yes”></f> <f

name='consonantal'><binary value=”yes”/></f>

<f name='vocalic'><binary value=”no”/></f>

<f name='nasal'><binary value=”no”/></f>

<!-- .... -->. <f name='coronal'><binary

value=”yes”/></f> <f name='continuant'><binary

value=”yes”/></f> <f

name='delayedRelease'><binary value=”yes”/></f>

<f name='strident'><binary value=”yes”/></f>

</fs>

segment +, consonantal +, vocalic -, nasal -, low -,

high -, back -, round -, anterior +, coronal +,

continuant +,

delayed release +,

strident +]

Page 35: Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing

Alternate values

<w ana=VVD>annotated</w>

<fs id=VVD type=‘lexical’> <f name=“class”> <vAlt mutExcl=“Y”> <sym value=‘verb’/> <sym value=‘adj’/> </vAlt> </f>...</fs>

Page 36: Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing

for example...<fs> <f name="cat"> <symbol value="verb"/></f> <f name="aux"> <string value="avoir"/></f> <f name=”mode”> <symbolvalue=”indicatif”/></f> <f name="tense"> <symbol value="present"/> </f> <f name="pers"> <vAlt> <symbol value="1"/> <symbol value="3"/> </vAlt> </f> <f name="num"> <symbol value="sing"/></f></fs>

“mange”

Page 37: Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing

Value librariesCollections of re-usable feature-structure components, each with a unique key

May be referenced from an <fs> (using feats attribute) or an <f> (using fVal attribute)

NB effect is to transclude (embed a copy of) the referenced item

Not to be confused with....

Page 38: Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing

for example <fLib type="agreement features"> <f xml:id="p1" name="person"> <symbol value="first"/></f> <f xml:id="p2" name="person"> <symbol value="second"/></f> <!-- ... --> <f xml:id="ns" name="number"> <symbol value="singular"/></f> <f xml:id="np" name="number"> <symbol value="plural"/></f> <!-- ... --></fLib>

<fs feats=”#p2 #ns”/><fs feats=”#p2 #ns”/>

Page 39: Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing

Structure sharingSome <fs> are not trees but DAGs – nodes may have multiple parents

We represent this by labelling each re-entrancy point, using a <var> element

All <var>s with the same label are held to be the same node: any contents found are to be unified

Page 40: Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing

for example<fs><f name="nominal"> <fs> <f name="nm-num"> <var label="L1"> <symbol value="singular"/></var> </f> <!-- other nominal features --> </fs></f><f name="verbal"> <fs> <f name="vb-num"><var label="L1"/></f> </fs> <!-- other verbal features --></f></fs>

Page 41: Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing

Collections and other multiples

The value of a feature may be an aggregate of atomic values organized as a set, list, or bag

We represent this as a <coll> with a distinguishing org attribute

The value of a feature may (more usually) be a feature structure

... or the value of a feature may be given by a feature expression

Page 42: Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing

For example <fs> <f name="lexicalForm"> <symbol value="auxquels"/></f> <f name="analyses"> <coll org="list"> <fs> <f name="cat"><symbol value="prep"/></f> </fs> <fs> <f name="cat"><symbol value="pronoun"/></f> <f name="kind"><symbol value="rel"/></f> <f name="num"><symbol value="pl"/></f> <f name="gender"><symbol value="masc"/></f> </fs> </coll> </f> </fs>

Page 43: Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing

Feature expressions

We provide the following operatorsNegation <vNot> i.e. complement

Alternation <vAlt>

“Flattening” collection <vColl>

We also provide a <default> element

... but some of these are not very useful in the absence of a feature system declaration

Page 44: Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing

Validation of Feature Structures

Constraints can be applied at three levels

in the XML schema (e.g. empty <f> is not allowed)

by supplying additional rules in an established XML constraint language (e.g. Schematron)

by defining a complete FSD or equivalent

Or, a given set of <fs> could be “de-abstracted” to form a structure for which a specific schema could be written

Essential to support “typing” and “sub-typing” of feature structures

Page 45: Annotation : the scope er so Steven said it was not a property of um annotated corpora verb phrase Anaphoric reference noun phrase named entity passing

“de-abstractification”A generic XML representation can be automatically converted to a specific one...<fs type=”ABC”> <f name=”xyz”> <symbol value=”zzz”/></f> <f name=”foo”> <numeric value=”42”/></f></fs> <ABC>

<xyz>zzz</xyz> <foo>42</foo></ABC>

<!ELEMENT ABC (xyz,foo)><!ELEMENT xyz (#PCDATA)><!ELEMENT foo (#PCDATA)>