type systems, interoperability and database population eric nyberg, cmu shilpa arora, cmu lance...
Post on 22-Dec-2015
221 views
TRANSCRIPT
Type Systems,Interoperability
and Database PopulationEric Nyberg, CMUShilpa Arora, CMU
Lance Ramshaw, BBN
Outline
• Annotation sample analysis– emergent type systems– ongoing issues / clarification questions
• Data interoperability• Database population
– CMU’s Annotations DB– OntoNotes– Possible architecture for interoperability with
UIMA annotators– Issues for Discussion
Task• Analyze sample outputs from
different annotation groups• Formalize annotation type system
(UML object model) for each sample
• Generate clarification questions • Work toward a unified type
system• Work toward interoperability
architecture
In progress,not finished
Not started
For each annotation sample:
• Overview of what we received
• Brief example annotation
• Type system analysis
• Issues / Questions
Whats in the bin ?
5
# Annotation Manual Samples Analysis Type System
1.1 CMU Belief Annotations x x x x
1.2 CMU Event Coreference Annotations x
2.1Ed Hovy's Group - Noun Sense Annotation x x x
3.1 BBN Temporal Ordering Annotation x x x x
3.2 BBN Name Annotations x x x
3.3 BBN Coreference Annotation x x x
3.4 BBN (Complex) Coreference Annotation x x x x
4.1 UMBC Modality Annotation x x x x
5.1 Columbia Dialog Annotation x
CMU/Columbia Belief Annotation
• Annotation Manual:– Davis et. al., “Annotating belief in
Communication: Manual”
• Annotation Units: Propositions identified by PropBank and NomBank
6
CMU/Columbia CMU Belief Annotation
• Three categories:– Committed belief: Belief expressed in utterance
• Can be a proposition about present or future• E.g. (1) I know Mark and Sandra have eloped. (2) The
sun will rise again. (Future)
– Non-committed belief: Not a strong belief• Can be a proposition about present or future• E.g. (1) Mark and Sandra may have eloped. (2) John
may return tomorrow.
– Not application: Not a belief• E.g. (1) I wish Mark and Sandra would finally elope.
7
CMU/Columbia Belief Annotation
• Five Classes:– Committed Belief– Committed Belief Future– Non-Committed Belief– Non-Committed Belief Future– Not Applicable
8
CMU/Columbia Belief Annotation:Type System (1)
9
CMU/Columbia Belief Annotation:Type System (2)
10
CMU/Columbia Belief Annotation: Type System (3)
11
Follow up questions
• Extensions: – What extensions do we expect to the
annotation scheme? – How best we can tailor the type system
towards expected future changes
• Requirements from application domain?– Do we have a set of requirements from the
application side?
12
Ed Hovy’s group
• Annotations:– Annotated with OntoNotes for Noun senses– 205 nouns, one file for each noun, sense + location in files for
each noun is stored
• Sample annotations:– eng/AFGP-2002-600175-Trans.txt 427 4 [email protected] 3 Mon Dec 3
02:31:27 2007
– eng/AFGP-2002-602187-Trans.txt 25 6 [email protected] 2 Mon Dec 3 02:31:27 2007
– Noun="position", sense=3; file= AFGP-2002-600175-Trans.txt, position = “427 4”
– Noun="position", sense=3; file=AFGP-2002-602187-Trans.txt, position=“25 6”
13
TypeSystem (Ed Hovy et. al. Annotation)
14
BBN
1. BBN TTO-3 Temporal Ordering Annotation
2. BBN Name Annotations: named entities – org, date, per etc
3. BBN-Coref-Annotation: entity (with type) and entity mentions etc
4. BBN-complex-coref-annotation
15
Temporal Relationship Assignment
• ID TT TP TR• 11/28 1 DS 2 A• Arrived 2 EP 0 B• yesterday 3 DS 2 C• told 4 SP 2 B• Visiting 5 EUN 4 A• left 6 EP 4 A• Return 7 EF 2 A• Monday8 DS 7 C• is 9 BC 0 C• Return 10 EF 9 A• day 11 DU 10 C
16
Type System (BBN Temporal Ordering Annotation)
17
BBN Name Annotations (Type system)
18
BBN-complex-coref-annotation
Annotations:• Relations between entities
– Member– Member Base– Subset– Subset Size (future type system)
• Other annotations - Attributes of a mention– Reference type– Syntactic Context
19
20
Type System for BBN (Complex) coreference annotation
21
Type System for BBN (Complex) coreference annotation (contd…)
UMBC Modality Annotations
• TMR – Text Meaning Representation or Concepts annotated
• Main Annotation – Modality. It has three main attributes: TYPE, VALUE, SCOPE & ATTRIBUTED-TO
• TMRs can be nested i.e. attributes or relation can refer to other TMRs
22
23
UMBC Modality Annotations
Interoperability: Data• Common data model• Multiple implementations
– based on the same underlying schema(formal object model)
– meet different goals / requirements
• Implementation Criteria:– support effective run-time annotation
(e.g. UIMA type system)– Support effective user interface, query/update
(e.g. OntoNotes)– Support on-the-fly schema extension
(e.g. CMU’s AnnotationsDB)
Interoperability: Data [2]
• Formal object model is mapped to:– UIMA type system definition (create)– OntoNotes RDBMS schema (extend)– CMU’s Annotations DB (extend)
• Annotated data can be represented in any format that implements the formal model
• “Have your cake and eat it too”
CMU’s Annotations Database• MySQL implementation
• Java APIs (SQL connection API and simple object access API)
• Fully integrated with UIMA
• Used on DTO and DARPA projects
• PRO: tag types can be extended at run time by the application (schema supports open-ended type definition)
• CON: interactive tools are currently limited
JAVELIN Project Briefing
AQUAINTProgram
Annotations Database
In an interview with Defense News, Indian Defence Research and Development Organization (DRDO) scientists said India was launching a comprehensive plan to develop a wide range of modern nuclear missiles. Within two years, India would develop an intercontinental ballistic missile (ICBM), ...
<entity type=org offset=21 length=12 /><entity type=org offset=35 length=59 /><entity type=gpe offset=111 length=5 source=bbn ref=#INDIA /><entity type=gpe offset=223 length=5 source=bbn ref=#INDIA /><entity type=fac offset=231 length=41 />
document
datetimedocnodoctype
passage
text
tag
typevalueparent
span
offsetlength
*
**
*
28
An Integrated Annotation DB in OntoNotes
Sameer Pradhan, Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel
http://www.bbn.com/NLP/OntoNotes
29
Goals
Capture multiple layers of annotation and modeling– Syntax– Propositions– Word sense– Ontology– Coreference– Names
Using an integrated relational database representation– Enforces consistency across the different annotations– Supports integrated models that can combine evidence from
different layers
30
Unified Representation
Provide a bare-bones representation independent of the individual semantics that can– Efficiently capture intra- and inter- layer
semantics– Maintain component independence – Provide mechanism for flexible integration– Integrate information at the lowest level of
granularity
A Relational Database
31
Unified Relational Representation
Corpus
Trees
Coreference Names
Propositions
Senses
32
Example: DB Representation of Syntax
• Treebank tokens (stored in the Token table) provide the common base• The Tree table stores the recursive tree nodes, each with its span• Subsidiary tables define the sets of function tags, phase types, etc.
33
Advantages of an Integrated Representation
Each layer translates into a common representation Clean, consistent layers
– Resolve the inconsistencies and problems that this reveals
Well defined relationships– Database schema defines the merged structure efficiently
Original representations available as predefined views – Treebank, PropBank, etc.
SQL queries can extract examples based on multiple layers or define new views
Python Object-oriented API allows for programmatic access to tables and queries
34
Syntax Layer
Identifies meaningful phrases in the text
Lays out the structure of how they are related
Concerns about the pace of the Vienna talks -- which are aimed at the destruction of some 100,000 weapons , as well as major reductions and realignments of troops in central Europe – also are being registered at the Pentagon .
S
major reductions and realignments of troopsin central Europe
... major reductions and realignments of troops in central Europe – ...
NP
NP
JJ NNS CC NNS IN NP
NNS
PP
IN NP
JJ NNP
PP
SYNTAX
35
ARG2
ARG1
ARGM-LOC
Propositional Structure
Tells who did what to whom
For both verbs and nouns
Concerns about the pace of the Vienna talks -- which are aimed at the destruction of some 100,000 weapons , as well as major reductions and realignments of troops in central Europe – also are being registered at the Pentagon .
... major reductions and realignments of troops in central Europe – ...
NP
NP
JJ NNS CC NNS IN NP
NNS
PP
IN NP
JJ NNP
PP
S
Concerns about the pace of the Vienna talks -- which are aimed at the destruction of some 100,000 weapons , as well as major reductions and realignments of troops in central Europe – also are being registered at the Pentagon .
36
reduce.01 – Make less
aim.02 – Directed motion
Predicate Frames
Concerns about the pace of the Vienna talks -- which are aimed at the destruction of some 100,000 weapons , as well as major reductions and realignments of troops in central Europe – also are being registered at the Pentagon .
Predicate Frames
aimaim.01 – Plan
aim.02 – Directed motion
ARG0 – Aimer ARG1 – Action
ARG0 – AimerARG1 – Thing in motionARG2 – Target
Predicate Framesreductionreduce.01 – Make less
ARG0 – Agent ARG1 – Thing fallingARG2 – Amount fallenARG3 – Starting pointARG4 – Ending point
Predicate frames define the meanings of the numbered arguments
37
Word Sense and Ontology
Meaning of nouns and verbs are specified All the senses are annotatable at 90% inter-annotator agreement Catalog of possible meanings supplied in the sense inventory files Ontology links (currently being added) will capture similarities
between related senses of different words
Concerns about the pace of the Vienna talks -- which are aimed at the destruction of some 100,000 weapons , as well as major reductions and realignments of troops in central Europe – also are being registered at the Pentagon .
Word Sense
aim
1. Point or direct object, weapon, at something ...
2. Wish, purpose or intend to achieve something
Word Sense
register
1. Enter into an official record2. Be aware of, enter into someone’s
conciousness3. Indicate a measurement4. Show in one’s face
2. Wish, purpose or intend to achieve something
1. Enter into an official record
Concerns about the pace of the Vienna talks -- which are aimed at the destruction of some 100,000 weapons , as well as major reductions and realignments of troops in central Europe – also are being registered at the Pentagon .
38
Coreference
Identifies different mentions of the same entity in text – especially links definite, referring noun phrases, and pronouns in text
Two types – Identity as well as Attributive coreference tagged.
Concerns about the pace of the Vienna talks -- which are aimed at the destruction of some 100,000 weapons , as well as major reductions and realignments of troops in central Europe – also are being registered at the Pentagon .
President Bush
conventional arms talk
the Pentagon
Vienna talks – which are aimed at the destructionof some 100,000 weapons , as well as major reductions and realignments of troopsin central Europe the Pentagon
Pentagon
He
e0 e0 e0
39
Example of DB Query Function
for a_proposition in a_proposition_bank: if(a_proposition.lemma != "say"): arg_in_p_q = "select * from argument where proposition_id = '%s';" % (a_proposition.id) a_cursor.execute(arg_in_p_query) argument_rows = a_cursor.fetchall()
for a_argument_row in argument_rows: a_argument_id = a_argument_row["id"] a_argument_type = a_argument_row["type"]
if(a_argument_type != "ARG0"): n_in_arg_q = "select * from argument_node where argument_id = '%s';" % (a_argument_id) a_cursor.execute(n_in_arg_q) argument_node_rows = a_cursor.fetchall() for a_argument_node_row in argument_node_rows: a_node_id = a_argument_node_row["node_id"]
a_ne_node_query = "select * from name_entity where subtree_id = '%s';" % (a_node_id) a_cursor.execute(a_ne_node_query) ne_rows = a_cursor.fetchall()
for a_ne_row in ne_rows: a_ne_type = a_ne_row["type"] ne_hash[a_ne_type] = ne_hash[a_ne_type] + 1
a_tree = a_tree_document.get_tree(a_tree_id) a_node = a_tree.get_subtree(a_node_id)
for a_child in a_node.subtrees(): a_ne_subtree_query = "select * from name_entity where subtree_id = '%s';" % (a_child.id) subtree_ne_rows = a_cursor.execute(a_ne_subtree_query)
ne_subtree_rows = a_cursor.fetchall()
for a_ne_subtree_row in ne_subtree_rows: a_subtree_ne_type = a_ne_subtree_row["type"] ne_hash[a_subtree_ne_type] = ne_hash[a_subtree_ne_type] + 1
if (proposition.lemma == “say”):
query = “select * from argument where proposition_id = '%s';” ..
What is the distribution of named entities that are ARG0s of the predicate “say”?
if (argument_type == "ARG0"):
for child in node.subtrees():
Name Entity Frequency
Person 84
GPE 34
Organization 29
NORP 15
... ...
40
Conclusion
Integrating the annotation layers using a relational schema – Improves consistency– Allows predictive features that combine evidence from
multiple layers
Easily Accessible– Through Python API– SQL queries
Interoperability: Components
OntoNotesCollection
Reader
OntoNotesCAS
Consumer
OntoNotes
UIMAAnalysisEngine
AnnotationsDB
ADBCollection
Reader
ADBCAS
ConsumerFile SystemCollection
Reader
XCASCollection
Reader
XCASCAS
Consumer
XML
TXTExisting
UIMA wrapper
New UIMAwrapper
RDBMSstorage
Filestorage
key
A shared, formal type systemallows multiple data formats tobe combined effectively
Customer’sannotators
Issues for Discussion
• Persistence formats optimize for different concerns– RDBMS – relational querying, update– XCAS – fast deserialization of run-time
objects
• Consider extending schema to hold XML serialization of document annotations