annotation : the scope er so steven said it was not a property of um annotated corpora verb phrase...
TRANSCRIPT
Annotation : the scope
er so Steven said it was not a property of um annotated corpora
verb phrase
Anaphoric reference
noun phrasenamed entity
passing truck
speech act
disfluencyintonation pattern
contextparticipant
Some xml annotations
<phon addr=”1:10”>st</phon><phon addr=”12:5”>i</phon><phon addr=”23:5”>v</phon><phon addr=”30:2”>n</phon>
<w pos=”NP1”>Steven</w>
<person ident=”SB01” gender=”M”><birthDate>12/03/1956</birthDate>.... </person> <name persKey=”SB01”>
steven</name>
<u who=”SB01” start=”0:1”>er so steven said it was <emph>not</emph> a property of annotated corpora</u>
Transcribing speech
normalization issues
ease of reading vs accuracy
interpretation vs prosody
analogous to problems of handling digitized images
The Spoken base tagset
components : <u> <event> <kinesic> <vocal> <pause> <shift>
contextual information in header <settingDesc> <particDesc>
facilities for synchronization and timing
Features of speech
lexica l<u>
non-lexica l<vocal>
anthropophenic non-anthropophenic<k inesic>
com m unicative non-com m unicative<event>
transcribed events
Utterances
Basic unit of discourse, corresponding to speaker turns
Optionally grouped into higher-level divisions (<div>s), e.g. to mark discourse function
Linked by who attribute to <person> description in header
Vocals and events
Empty elements are used to mark paralinguistic phenomena
<u who="Jan">This is just delicious</u><event desc='telephone rings'><u who="Kim">I'll get it</u> <u who="Tom">I used to <vocal desc="cough"/> smoke a lot</u><u who="Bob"><vocal desc="sniff"/>He thinks he's tough</u><vocal who="Ann" desc="snorts"/>
Voice quality and prosody
The <shift> element is used to mark changes in voice quality
Other prosodic features may be marked using specific kinds of <seg> or entity refs
<u who="LB"><shift feature="loud" new="f"/>Elizabeth</u><u who="EB">Yes</u><u who="LB"><shift/>Come and try this <pause/><shift feature="loud" new="ff"/>come on</u>
Another example<u who="MAR">you never <pause/> take this cat for show and tell <pause dur='5'/> meow meow</u> <u who="ROS">yeah well I dont want to</u> <event><desc>toy cat has bell in tail which continues to make a tinkling sound</desc></event> <vocal who="MAR"><desc>miaows</desc></vocal> <u who="ROS">because it is so old</u> <u who="MAR">how <reg>about</reg> your cat <pause/> yours is new<kinesic><desc>shows Father the cat</desc></kinesic></u><u who="FAT" trans="pause">that<pause/> darling</u><u who="MAR"><s>no mine isnt old</s> <s>mine is just um a little dirty</s></u>
Participant Description
<person xml:id="P1" sex='2' age='mid'> <p>Female informant, well-educated, born in Shropshire UK, 12 Jan 1950, of unknown occupation. Speaks French fluently. Socio-Economic status B2 in the PEP classification scheme. </person>
<person xml:id="P1" sex="2" age='mid'> <birth date='1950-01-12'> <date>12 January 1950</date> <name type="place">Shropshire, UK</name></birth><firstLang>English</firstLang><langKnown>French</langKnown><residence>Long term resident of Hull</residence><education>University postgraduate</education><occupation>Unknown</occupation><socecstatus source="PEP" code="B2"/></person>
Setting Description
eg from P2<settingDesc><setting who="#P1 #P2"><name type="city">Bedford</><name type="region">UK: South East</name><date value="1989">early spring, 1989</><locale>rug of a suburban home</locale><activity>playing</activity></setting><setting who="#P3"><name type="city">Bedford</name><name type="region">UK: South East</name><date value="1989">early spring, 1989</date><locale>at the sink</locale> <activity>washing-up</activity></setting><setting who="#P4"><name type="place">London, UK</name> <time>unknown</time><locale>broadcasting studio</locale><activity>radio performance</activity></setting></settingDesc>
Timing
Pausinguse <pause> element
Durationuse dur attribute
Overlapuse trans attribute
OverlapHave you heard the the election results?
its a disasterits a miracle
<u xml:id="A1" who="A">Have you heard the</u> <u xml:id="B1" who="B" trans="latching">the election results? </u><u xml:id="A2" who="A" trans="pause">its a disaster</u><u xml:id="B2" who="B" trans="overlap">its a miracle </u>
Linking, segmentation, alignment
Provides generic segmentation elements
Provides extensive set of attributes for linkage, correspondence,synchronization, aggregation, alternation, etc.
Documents generic pointing mechanism
Generic segmentation elements
<seg> for arbitrary (nesting) segmentation
<s> for end-to-end segmentation
use type attribute to subcategorise
<anchor> for points
Segmentation is the key to successful linking and analysis
Clustering
(Difficulty (is being expressed) with ((the method) (to be used)))
<s>Difficulty <seg>is being expressed</seg> with <seg><seg>the method</seg> <seg>to be used</seg></seg></s>
discontinuous segments
fundamental problem
first segment, then link, using stand-off
“You put it,” Quill reminded him, “in the safe.”<s xml:id="s1">"You put it,"</s> <s xml:id="s2">Quill reminded him,</s> <s xml:id="s3">"in the safe."</s>
discontinuous segments
can also use PART attribute to indicate that segments are incomplete
“You put it,” Quill reminded him, “in the safe.”
<s xml:id="s1" next="#s3">"You put it,"</s> <s xml:id="s2">Quill reminded him,</s> <s xml:id="s3" prev="#s1">"in the safe."</s>
discontinuous segments“You put it,” Quill reminded him, “in the safe.”
<s xml:id="s1">”You put it,”</s> <s xml:id="s2">Quill reminded him,</s> <s xml:id="s3">“in the safe.”</s>
<join targets="#s1 #s3" result="s"/>
Translation pairs
<s xml:id="s1" corresp="#s2" xml:lang="EN">For a long time I used to go to bed early</s><s xml:id="s2" corresp="#s1" xml:lang="FR">Longtemps je me couchais de bonne heure</s>
<correspGrp type="trans"><link targets="#s1 #s2"/>
</correspGrp>
and/or....
Synchronization
of whole elements
of points in time
<u xml:id="A2" who="A" synch="#B2">its a disaster</u><u xml:id="B2" who="B">its a miracle</u>
<u xml:id="A1" who="A">Have you heard <anchor xml:id="AO"/>the</u> <u xml:id="B1" who="B" start="#A0"><anchor xml:id="BO"/>the election results? yes</u>
XML semantics are limited
The containment relation is implicit, so we do not need to say
though we may wish to say
<s id=”S1” head=”V1”> <np id=”N1”>annotated corpora</np> <vp id=”V1”>rule</vp> <tq id=”T1”>okay</tq></s>
<vp id=”V1” partOf=”S1”>rule</vp>
<vp id=”V1” role=”head” >rule</vp>
Analytic mechanisms
Specific kinds of segment for linguistic analyses
Why is there no tag for noun?
Specialized interpretive pointers (<span> and <spanGrp>)
The ana attribute and its possible targets
<interp> and <interpGrp>
feature systems <fs> and <fsd>
Arbitrary characterizations
The <span> points into a stretch of a text and characterizes it in some way
Target may be anything you can reach by an xpath
<spanGrp resp=”#LB” type="thematic" > <span value="ships" from="#P1" to="#P2"/> <span value="shoes" from="#P4" to="#P8"/> <span value="sealing wax" from="http://www.somewhere.com/waxinit.xml#P45"/></spanGrp>
<w ana="#VVD">annotated</w>
More detailed analysis
the ana attribute is of type IDREFS
what does VVD identify?a prose description
an <interp> element
a feature structure
using interp...
<w ana="#VVD">annotated</w><w ana="#NN2">corpora</w>
<interp xml:id="VVD" type="lexicalClass" value="verbPastTense"/><interp xml:id="NN2" type="lexicalClass" value="nounPlural"/>
hierarchic grouping of interpsnouns can be common or proper
nouns can be singular or plural
<interpGrp value="nomimal"> <interpGrp value="common"> <interp value="singular"/> <interp value="plural"/> </interpGrp></interpGrp>
for example...
<interp xml:id=‘VVD’> <desc>verb past tense</desc></interp><interp xml:id=‘NN2’> <desc>plural common noun</desc>
</interp>
<w ana=‘#VVD’>annotated</w>
<w ana=‘#NN2’>corpora</w>
Encoding analyses
Linguistic Annotation Frameworks and standards
the philosophers stone
Generic feature structure system any analysis can be represented by bundles of named feature-value pairs
embedded within text or indirectly linked
Ancillary feature system declarationTheoretically neutral (?) pragmatic solution to real world problem of intermachine communication
Feature structures
a feature structure consists of a bundle of featuresa feature has a name and a valuevalues may be binary switches, symbols, strings, feature structures, or operations on thembundling may constrained in various (not necessarily hierarchic) ways
... or, in XML:
The <fs> element represents a (typed) feature structure, which contains...One or more <f> elements, each of which has
a name
a value
Feature values may beatomic: <binary> <string> <numeric> <symbol>
complex: <fs> <coll>
expressions: <vNot> <vAlt> <vColl> ... or <var>
Using a feature structure...<w ana=‘#NN2’>corpora</w>
<fs xml:id=‘NN2’> <f name=‘class’> <symbol value=‘noun’/></f> <f name=‘number’> <symbol value=‘plural’/></f> <f name=‘proper’> <binary value=”false”/></f></fs>
Features: simple values
binary, numeric, symbol or stringconstraints may be declared in FSD
<fs type='word structure'> <f name='lemma'><string>goose</str></f> <f name='category'><symbol value='noun‘/></f> <f name='barLevel'><numeric value='0‘/></f> <f name='number'><symbol value='plural‘/></f></fs>
lemma : goose,
category: noun,
number:plural
bar level: 0
Features: plus or minus<fs type='phonetic segment'> <f name='segment'><binary
value=”yes”></f> <f
name='consonantal'><binary value=”yes”/></f>
<f name='vocalic'><binary value=”no”/></f>
<f name='nasal'><binary value=”no”/></f>
<!-- .... -->. <f name='coronal'><binary
value=”yes”/></f> <f name='continuant'><binary
value=”yes”/></f> <f
name='delayedRelease'><binary value=”yes”/></f>
<f name='strident'><binary value=”yes”/></f>
</fs>
segment +, consonantal +, vocalic -, nasal -, low -,
high -, back -, round -, anterior +, coronal +,
continuant +,
delayed release +,
strident +]
Alternate values
<w ana=VVD>annotated</w>
<fs id=VVD type=‘lexical’> <f name=“class”> <vAlt mutExcl=“Y”> <sym value=‘verb’/> <sym value=‘adj’/> </vAlt> </f>...</fs>
for example...<fs> <f name="cat"> <symbol value="verb"/></f> <f name="aux"> <string value="avoir"/></f> <f name=”mode”> <symbolvalue=”indicatif”/></f> <f name="tense"> <symbol value="present"/> </f> <f name="pers"> <vAlt> <symbol value="1"/> <symbol value="3"/> </vAlt> </f> <f name="num"> <symbol value="sing"/></f></fs>
“mange”
Value librariesCollections of re-usable feature-structure components, each with a unique key
May be referenced from an <fs> (using feats attribute) or an <f> (using fVal attribute)
NB effect is to transclude (embed a copy of) the referenced item
Not to be confused with....
for example <fLib type="agreement features"> <f xml:id="p1" name="person"> <symbol value="first"/></f> <f xml:id="p2" name="person"> <symbol value="second"/></f> <!-- ... --> <f xml:id="ns" name="number"> <symbol value="singular"/></f> <f xml:id="np" name="number"> <symbol value="plural"/></f> <!-- ... --></fLib>
<fs feats=”#p2 #ns”/><fs feats=”#p2 #ns”/>
Structure sharingSome <fs> are not trees but DAGs – nodes may have multiple parents
We represent this by labelling each re-entrancy point, using a <var> element
All <var>s with the same label are held to be the same node: any contents found are to be unified
for example<fs><f name="nominal"> <fs> <f name="nm-num"> <var label="L1"> <symbol value="singular"/></var> </f> <!-- other nominal features --> </fs></f><f name="verbal"> <fs> <f name="vb-num"><var label="L1"/></f> </fs> <!-- other verbal features --></f></fs>
Collections and other multiples
The value of a feature may be an aggregate of atomic values organized as a set, list, or bag
We represent this as a <coll> with a distinguishing org attribute
The value of a feature may (more usually) be a feature structure
... or the value of a feature may be given by a feature expression
For example <fs> <f name="lexicalForm"> <symbol value="auxquels"/></f> <f name="analyses"> <coll org="list"> <fs> <f name="cat"><symbol value="prep"/></f> </fs> <fs> <f name="cat"><symbol value="pronoun"/></f> <f name="kind"><symbol value="rel"/></f> <f name="num"><symbol value="pl"/></f> <f name="gender"><symbol value="masc"/></f> </fs> </coll> </f> </fs>
Feature expressions
We provide the following operatorsNegation <vNot> i.e. complement
Alternation <vAlt>
“Flattening” collection <vColl>
We also provide a <default> element
... but some of these are not very useful in the absence of a feature system declaration
Validation of Feature Structures
Constraints can be applied at three levels
in the XML schema (e.g. empty <f> is not allowed)
by supplying additional rules in an established XML constraint language (e.g. Schematron)
by defining a complete FSD or equivalent
Or, a given set of <fs> could be “de-abstracted” to form a structure for which a specific schema could be written
Essential to support “typing” and “sub-typing” of feature structures
“de-abstractification”A generic XML representation can be automatically converted to a specific one...<fs type=”ABC”> <f name=”xyz”> <symbol value=”zzz”/></f> <f name=”foo”> <numeric value=”42”/></f></fs> <ABC>
<xyz>zzz</xyz> <foo>42</foo></ABC>
<!ELEMENT ABC (xyz,foo)><!ELEMENT xyz (#PCDATA)><!ELEMENT foo (#PCDATA)>