supporting annotation layers for natural language processing
DESCRIPTION
Supporting Annotation Layers for Natural Language Processing. Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley. Stanford InfoSeminar March 17, 2006. Supported by NSF DBI-0317510 And a gift from Genentech. Outline. Motivation: NLP tasks System Description - PowerPoint PPT PresentationTRANSCRIPT
Supporting Annotation Layers for Natural Language Processing
Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk
UC Berkeley
Stanford InfoSeminarMarch 17, 2006 Supported by NSF DBI-0317510
And a gift from Genentech
UC Berkeley Biotext Project
Outline
• Motivation: NLP tasks
• System Description Annotation architecture Sample queries
• Database Design and Evaluation
• Related Work
• Future Work
UC Berkeley Biotext Project
Double Exponential Growth in Bioscience Journal ArticlesFrom Hunter & Cohen, Molecular Cell 21, 2006
UC Berkeley Biotext Project
BioText Project Goals
• Provide flexible, intelligent access to information for use in biosciences applications.
• Focus on Textual Information from Journal Articles Tightly integrated with other resources
Ontologies Record-based databases
UC Berkeley Biotext Project
Project Team
• Project Leaders: PI: Marti Hearst Co-PI: Adam Arkin
• Computational Linguistics and Databases Presley Nakov Ariel Schwartz Brian Wolf Barbara Rosario (alum) Gaurav Bhalotia (alum)
• User Interface / IR Rowena Luk Dr. Emilia Stoica
• Bioscience Janice Hamerja Dr. TingTing Zhang (alum)
UC Berkeley Biotext Project
BioText Architecture
Sophisticated Text Analysis
Annotations inDatabase
ImprovedSearch Interface
UC Berkeley Biotext Project
Sample Sentence
“Recent research, in proliferating cells, has demonstrated that interaction of E2F1 with the p53 pathway could involve transcriptional up-regulation of E2F1 target genes such as p14/p19ARF, which affect p53 accumulation [67,68], E2F1-induced phosphorylation of p53 [69], or direct E2F1-p53 complex formation [70].”
UC Berkeley Biotext Project
Motivation
• Most natural language processing (NLP) algorithms make use of the results of previous processing steps:
Tokenizer Part-of-speech tagger Phrase boundary recognizer Syntactic parser Semantic tagger
• No standard way to represent, store and retrieve text annotations efficiently.
• MEDLINE has close to 13 million abstracts. Full text has started to become available as well.
UC Berkeley Biotext Project
System overview
• A system for flexible querying of text that has been annotated with the results of NLP processing.
• Supports self-overlapping and parallel layers, integration of syntactic and ontological hierarchies, and tight integration with SQL.
• Designed to scale to very large corpora. Most NLP annotation systems assume in-memory
usage We’ve evaluated indexing architectures
UC Berkeley Biotext Project
Text Annotation Framework
• Annotations are stored independently of text in an RDBMS.
• Declarative query language for annotation retrieval.
• Indexing structure designed for efficient query processing.
UC Berkeley Biotext Project
Key Contributions
•Support for hierarchical and overlapping layers of annotation.
•Querying multiple levels of annotations simultaneously.
•First to evaluate different physical database designs for NLP annotation architecture.
UC Berkeley Biotext Project
Layers of Annotations
• Each annotation represents an interval spanning a sequence of characters absolute start and end positions
• Each layer corresponds to a conceptually different kind of annotation Protein, MESH label, Noun Phrase
• Layers can be Sequential Overlapping
two multiple-word concepts sharing a word Hierarchical (two different ways)
spanning, when the intervals are nested as in a parse tree, or ontologically, when the token itself is derived from a
hierarchical ontology
UC Berkeley Biotext Project
Layers of Annotations
NN IN NN VBZ IN JJ JJ NN NN NN CC NN IN NN
NP PP NP VP PP NP NP PP NP
D019254 D044465 D001769 D002477 D003643
D001773
D016923
D007962
24224596 28102012043
POS
Shallow
parse
Ontology
Gene/protein
185 8 51112 23017 7 5874 2791 8952 1263 5632 17 8252 8 12523Word
Ontology
Gene/protein
Word
Part of Speech
Shallow Parse
Overexpression of Bcl-2 results in insufficient white blood cell death and activation of p53.
D016158
39727642722
NN IN NN VBZ IN JJ JJ NN NN NN CC NN IN NN
NP PP NP VP PP NP NP PP NP
D019254 D044465 D001769 D002477 D003643
D001773
D016923
D007962
24224596 28102012043
POS
Shallow
parse
Ontology
Gene/protein
185 8 51112 23017 7 5874 2791 8952 1263 5632 17 8252 8 12523Word
Ontology
Gene/protein
Word
Part of Speech
Shallow Parse
Overexpression of Bcl-2 results in insufficient white blood cell death and activation of p53.
D016158
39727642722
UC Berkeley Biotext Project
Layers of Annotations
NN IN NN VBZ IN JJ JJ NN NN NN CC NN IN NN
NP PP NP VP PP NP NP PP NP
D019254 D044465 D001769 D002477 D003643
D001773
D016923
D007962
24224596 28102012043
POS
Shallow
parse
Ontology
Gene/protein
185 8 51112 23017 7 5874 2791 8952 1263 5632 17 8252 8 12523Word
Ontology
Gene/protein
Word
Part of Speech
Shallow Parse
Overexpression of Bcl-2 results in insufficient white blood cell death and activation of p53.
D016158
39727642722
NN IN NN VBZ IN JJ JJ NN NN NN CC NN IN NN
NP PP NP VP PP NP NP PP NP
D019254 D044465 D001769 D002477 D003643
D001773
D016923
D007962
24224596 28102012043
POS
Shallow
parse
Ontology
Gene/protein
185 8 51112 23017 7 5874 2791 8952 1263 5632 17 8252 8 12523Word
Ontology
Gene/protein
Word
Part of Speech
Shallow Parse
Overexpression of Bcl-2 results in insufficient white blood cell death and activation of p53.
D016158
39727642722
UC Berkeley Biotext Project
Layers of Annotations
NN IN NN VBZ IN JJ JJ NN NN NN CC NN IN NN
NP PP NP VP PP NP NP PP NP
D019254 D044465 D001769 D002477 D003643
D001773
D016923
D007962
24224596 28102012043
POS
Shallow
parse
Ontology
Gene/protein
185 8 51112 23017 7 5874 2791 8952 1263 5632 17 8252 8 12523Word
Ontology
Gene/protein
Word
Part of Speech
Shallow Parse
Overexpression of Bcl-2 results in insufficient white blood cell death and activation of p53.
D016158
39727642722
NN IN NN VBZ IN JJ JJ NN NN NN CC NN IN NN
NP PP NP VP PP NP NP PP NP
D019254 D044465 D001769 D002477 D003643
D001773
D016923
D007962
24224596 28102012043
POS
Shallow
parse
Ontology
Gene/protein
185 8 51112 23017 7 5874 2791 8952 1263 5632 17 8252 8 12523Word
Ontology
Gene/protein
Word
Part of Speech
Shallow Parse
Overexpression of Bcl-2 results in insufficient white blood cell death and activation of p53.
D016158
39727642722
UC Berkeley Biotext Project
Layers of Annotations
NN IN NN VBZ IN JJ JJ NN NN NN CC NN IN NN
NP PP NP VP PP NP NP PP NP
D019254 D044465 D001769 D002477 D003643
D001773
D016923
D007962
24224596 28102012043
POS
Shallow
parse
Ontology
Gene/protein
185 8 51112 23017 7 5874 2791 8952 1263 5632 17 8252 8 12523Word
Ontology
Gene/protein
Word
Part of Speech
Shallow Parse
Overexpression of Bcl-2 results in insufficient white blood cell death and activation of p53.
D016158
39727642722
NN IN NN VBZ IN JJ JJ NN NN NN CC NN IN NN
NP PP NP VP PP NP NP PP NP
D019254 D044465 D001769 D002477 D003643
D001773
D016923
D007962
24224596 28102012043
POS
Shallow
parse
Ontology
Gene/protein
185 8 51112 23017 7 5874 2791 8952 1263 5632 17 8252 8 12523Word
Ontology
Gene/protein
Word
Part of Speech
Shallow Parse
Overexpression of Bcl-2 results in insufficient white blood cell death and activation of p53.
D016158
39727642722
Full parse, sentence and section layers are not shown.
UC Berkeley Biotext Project
Example: Query for Noun Compound ExtractionGoal: find noun phrases consisting ONLY of 3 nouns
plastic water bottle
blue water bottle
big plastic water bottle
FROM
[layer=’shallow_parse’ && tag_name=’NP’
ˆ [layer=’pos’ && tag_name="noun"]
[layer=’pos’ && tag_name="noun"]
[layer=’pos’ && tag_name="noun"] $
] AS compound
SELECT compound.content
UC Berkeley Biotext Project
Query for Noun Compound Extraction (SQL wrapping)
SELECT LOWER(compound.content), COUNT(*)
FROM (
BEGIN_LQL
[layer=’shallow_parse’ && tag_name=’NP’
ˆ [layer=’pos’ && tag_name="noun"]
[layer=’pos’ && tag_name="noun"]
[layer=’pos’ && tag_name="noun"] $
] AS compound
SELECT compound.content
END_LQL
) AS lql
ORDER BY freq DESC
UC Berkeley Biotext Project
Query for Noun Compound Extraction (using artificial layers)
Goal: find noun phrases which have EXACTLY two nouns at the end, but no nouns before those two.
“big blue water bottle”
“plastic water bottle”
FROM
[layer=’shallow_parse’ && tag_name=’NP’
ˆ ( { ALLOW GAPS }
![layer=’pos’ && tag_name="noun"]
( [layer=’pos’ && tag_name="noun"]
[layer=’pos’ && tag_name="noun"] ) $
) $
] AS compound
SELECT compound.content
UC Berkeley Biotext Project
Example: Paraphrases
• Want to find phrases with certain variations: Immunodeficiency virus(?es) in ?the
human(?s)
immunodeficiency virus in humans immonodeficiency viruses in humans immunodeficiency virus in the human immunodeficiency virus in a human
UC Berkeley Biotext Project
Query for Paraphrases(optional layers and disjunction) [layer=’sentence’
[layer=’pos’ && tag_name="noun" &&
content = "immunodeficiency"]
[layer=’pos’ && tag_name="noun" &&
content IN ("virus","viruses")]
[layer=’pos’ && tag_name=’IN’] AS prep
?[layer=’pos’ && tag_name=’DT’ &&
content IN ("the","a","an")]
[layer=’pos’ && tag_name="noun" &&
content IN ("human", "humans")]
] SELECT prep.content
UC Berkeley Biotext Project
Example: Protein-Protein Interactions• Find all sentences that consist of a
An NP containing a gene, followed by a morphological variant of the verb “activate”,
“inhibit”, or “bind”, followed by another NP containing a gene.
protein
Activate(d,ing)Inhibit(ed,ing)
Bind(s,ing)protein
Sentence
UC Berkeley Biotext Project
Query for Protein-Protein InteractionsSELECT p1_text, verb_content, p2_text, COUNT(*) AS cnt FROM (BEGIN_LQL [layer='sentence' { ALLOW GAPS } [layer='shallow_parse' && tag_name='NP' [layer='gene'] $ ] AS p1 [layer='pos' && tag_name="verb" && (content ~ "activate%" || content ~ "inhibit%" ||
content ~ "bind%") ] AS verb [layer='shallow_parse' && tag_name='NP' [layer='gene'] $ ] AS p2 ] SELECT p1.text AS p1_text, verb.content AS verb_content,
p2.text AS p2_text END_LQL) lql GROUP BY p1_text, verb_content, p2_textORDER BY count(*) DESC
UC Berkeley Biotext Project
Protein-Protein InteractionsSample Output
PROTEIN 1 INTERACTION VERB PROTEIN 2 FREQUENCY
Ca2 activates protein kinase 312
Cln3 activate protein kinase 234
TAP binds transcription factor 192
TNF activatesprotein tyrosine kinase
133
serine/threonine kinase
binding RhoA GTPase 132
Phospholamban inhibits ATPase 114
PRL activated transcription factor 108
Interleukin 2 activates transcription factor 84
Prolactin activates transcription factor 84
AMPA activated protein kinase 78
Nerve growth factor activates protein kinase 78
LPS inhibited MHC class II 75
Heat shock protein Binding p59 72
EPO activated STAT5 63
EGF activated PP2A 60
cis binds Sp1 50
UC Berkeley Biotext Project
Example: Chemical-Disease Interactions• “A new approach to the respiratory problems of
cystic fibrosis is dornase alpha, a mucolytic enzyme given by inhalation.”
• Goal: extract the relation that dornase alpha (potentially) prevents cystic fibrosis.
• MeSH C06.689 subtree contains pancrediseases
• MeSH supplementary concepts represent chemicals.
UC Berkeley Biotext Project
Query onDisease-Chemical Interactions
UC Berkeley Biotext Project
Query onDisease-Chemical Interactions[layer='sentence' { NO ORDER, ALLOW GAPS } [layer='shallow_parse' && tag_name='NP‘ [layer='chemicals'] AS chemical $ ] [layer='shallow_parse' && tag_name='NP' [layer='mesh' &&
tree_number BELOW 'C06.689%'] AS disease $
] ]] AS sent SELECT chemical.text, disease.text, sent.text
UC Berkeley Biotext Project
Results: Chemical-Disease
UC Berkeley Biotext Project
Query Translation
Database Design & Evaluation
UC Berkeley Biotext Project
Database Design• Evaluated 5 different logical and physical database designs.
• The basic model is similar to the one of TIPSTER (Grishman, 1996). Each annotation is stored as a record in a relation.
• Architecture 1 contains the following columns:1. docid: document ID;2. section: title, abstract or body text;3. layer_id: a unique identifier of the annotation layer;4. start_char_pos: starting character position, relative to
particular section and docid;5. end_char_pos: end character position, relative to
particular section and docid;6. tag_type: a layer-specific token unique identifier.
There is a separate table mapping token IDs to entities (the string in case of a word, the MeSH label(s) in case of a MeSH term etc.)
UC Berkeley Biotext Project
Database Design (cont.)
• Architecture 2 introduces one additional column, sequence_pos, thus defining an ordering for each layer.
Simplifies some SQL queries as there is no need for “NOT EXISTS” self joins, which are required under Architecture 1 in cases where tokens from the same layer must follow each other immediately.
• Architecture 3 adds sentence_id, which is the number of the current sentence and redefines sequence_pos as relative to both layer_id and sentence_id.
Simplifies most queries since they are often limited to the same sentence.
UC Berkeley Biotext Project
Database Design (cont.)
• Architecture 4 merges the word and POS layers, and adds word_id assuming a one-to-one correspondence between them. Reduces the number of stored annotations and the number
of joins in queries with both word and POS constraints.
• Architecture 5 replaces sequence_pos with first_word_pos and last_word_pos, which correspond to the sequence_pos of the first/last word covered by the annotation. Requires all annotation boundaries to coincide with word
boundaries. Copes naturally with adjacency constraints between
different layers. Allows for a simpler indexing structure.
UC Berkeley Biotext Project
Data Layout for all 5 Architectures
Example: “Kinase inhibits RAG-1.”
231(NP)40343(s.parse)b3345
259(VP)49413b3345
23155503b3345
21665455506b3345
21077040346(mesh)b3345
23955505b3345
239(prt)40345 (gene)b3345
8998522755501b3345
55608253 (VB)49411 b3345
59571227 (NN)40341 (POS)b3345
8998528998555500b3345
5560825560849410b3345
595712595714034b (body)3345
WORDID
SENTENCE
SEQUENCEPOS
TAGTYPE
ENDCHARPOS
STARTCHARPOS
LAYERID
SECTIONPMID
131(NP)343(s.parse)b3345
259(VP)413b3345
331503b3345
216654506b3345
110770346(mesh)b3345
239505b3345
139(prt)345 (gene)b3345
89985327501b3345
55608253 (VB)411 b3345
59571127 (NN)341 (POS)b3345
89985389985500b3345
55608255608410b3345
59571159571340 (word)b (body)3345
WORDID
SENTENCE
SEQUENCEPOS
TAGTYPE
ENDCHARPOS
STARTCHARPOS
LAYERID
SECTIONPMID
Basic architecture Added, architecture 3
Added, architecture 2 Added, architecture 4
3
2
1
3
2
1
FIRSTWORDPOS
1
2
3
1
3
1
3
4
3
2
4
3
2
LASTWORDPOS
2
3
4
2
4
2
4
Added, architecture 5
UC Berkeley Biotext Project
Indexing Structure
• Two types of composite indexes: forward and inverted. An index lookup can be performed on any column combination
that corresponds to an index prefix. The forward indexes support lookup based on position in a
given document. The inverted indexes support lookup based on annotation
values (i.e., tag type and word id).
• Most query plans involve both forward and inverted indexes Joins statistics would have been useful
• Detailed statistics are essential. Standard statistics in DB2 are insufficient.
• Records are clustered on their primary key
UC Berkeley Biotext Project
Indexing Structure (cont.)Architecture Type Columns
Arch 1-4 F *DOCID +SECTION +LAYER_ID +START_CHAR_POS +END_CHAR_POS +TAG_TYPE
Arch 1-4 I LAYER_ID +TAG_TYPE +DOCID +SECTION +START_CHAR_POS +END_CHAR_POS
Arch 2 F DOCID +SECTION +LAYER_ID +SEQUENCE POS +TAG_TYPE +START_CHAR_POS +END_CHAR_POS
Arch 2 I LAYER_ID +TAG_TYPE +DOCID +SECTION +SEQUENCE POS +START_CHAR_POS +END_CHAR_POS
Arch 3-4 F DOCID +SECTION +LAYER_ID +SENTENCE +SEQUENCE POS +TAG_TYPE +START_CHAR_POS +END_CHAR_POS
Arch 3-4 I LAYER_ID +TAG_TYPE +DOCID +SECTION +SENTENCE +SEQUENCE POS +START_CHAR_POS +END_CHAR_POS
Arch 4 I WORD ID +LAYER_ID +TAG_TYPE +DOCID +SECTION +START_CHAR_POS +END_CHAR_POS +SENTENCE +SEQUENCE POS
Arch 5 F *DOCID +SECTION +LAYER_ID +SENTENCE +FIRST_WORD_POS +LAST_WORD_POS +TAG_TYPE
Arch 5 I LAYER_ID +TAG_TYPE +DOCID +SECTION +SENTENCE +FIRST_WORD_POS +LAST_WORD_POS
Arch 5 I WORD ID +LAYER_ID +TAG_TYPE +DOCID +SECTION +SENTENCE +FIRST_WORD_POS
UC Berkeley Biotext Project
Experimental Setup
• Annotated 13,504 MEDLINE abstracts Stanford Lexicalized Parser (Klein and Manning,
2003) for sentence splitting, word tokenization, POS tagging and parsing.
We wrote a shallow parser and tools for gene and MeSH term recognition.
• This resulted in 10,910,243 records stored in an IBM DB2 Universal Database Server.
• Defined 4 workloads based on variants of queries.
UC Berkeley Biotext Project
Experimental Setup:4 Workloads
[layer='shallow_parse' && tag_name="NP"] AS np1[layer='pos' && content='('][layer='shallow_parse' && tag_name="NP"] AS np2[layer='pos' && content=')']
(Pustejovsky et al., 2001)
(d) Acronym-Meaning Extraction
[layer='shallow_parse' && tag_name="NP" [layer='pos' && tag_name="noun" ^ [layer='mesh' && tree_number BELOW "G07.553"] AS m1 $ ] [layer='pos' && tag_name="noun" ^ [layer='mesh' && tree_number BELOW "D"] AS m2 $ ]] SELECT m1.content, m2.content
(c) Descent of Hierarchy:
(Rosario et al., 2002)
[layer='sentence' {ALLOW GAPS} [layer='gene'] AS gene1 [layer='pos' && tag_name="verb" && content="binds"] AS verb [layer='gene'] AS gene2] SELECT gene1.content, verb.content, gene2.content
(Blaschke et al., 1999)
(a) Protein-Protein Interaction
[layer='sentence' [layer='shallow_parse' && tag_name="NP"] AS np1 [layer='pos' && tag_name="verb" && content='binds'] AS verb [layer='pos' && tag_name="prep" && content='to'] [layer='shallow_parse' && tag_name="NP"] AS np2] SELECT np1.content, verb.content, np2.content
(Thomas et al., 2000)
(b) Protein-Protein Interaction
A01 A07
limb:vein
shoulder: artery
UC Berkeley Biotext Project
Results
Workload (a) (b)
Architecture 1 2 3 4 5 1 2 3 4 5
SQL lines 37 37 34 29 29 91 77 75 65 50
# Joins 6 6 6 5 5 12 11 11 9 7
Time (sec) 3.98 4.35 3.59 1.69 1.94 3.88 5.68 5.41 3.85 3.55
Workload (c) (d)
Architecture 1 2 3 4 5 1 2 3 4 5
SQL lines 45 38 38 39 41 59 50 53 53 35
# Joins 7 6 6 6 6 7 7 7 7 4
Time (sec) 17.9 23.42 21.49 30.07 4.06 1,879 1,700 2,182 1,682 1,582
Workload (a) (b) (c) (d)
#Queries 54 11 50 1
#Results/query 303.4 77.5 1.6 16,701
LQL lines 8 6 5 4
UC Berkeley Biotext Project
Results
Architecture
Space (MB) 1 2 3 4 5
Data Storage 168.5 168.5 168.5 132.5 136.5
Index Storage 617.0 1,397.0 1,441.0 1,182.0 673.5
Total Storage 785.5 1,565.5 1,609.5 1,314.5 810.0
•Architecture 5 performs well (if not best) on all query types, while the other architectures perform poorly on at least one query type.
•Storage requirement of Architecture 5 is comparable to that of Architecture 1
•Architecture 5 results in much simpler queries
•Conclusion: We recommend Architecture 5 in most cases, or Architecture 1, if atomic annotation layer cannot be defined.
UC Berkeley Biotext Project
Scalability Analysis
• Combined workload of 3 query types
• Varying buffer pool sizes
UC Berkeley Biotext Project
Scalability Analysis
Buffer Pool Size (MB) Elapsed Time (ms) Buffer Read Time (ms)
1000 2300 1050
100 2900 1670
10 4600 3340
1 8300 6250
• Suggests that the query execution time grows as a sub-linear function of memory size.
• We believe a similar ratio will be observed when increasing the database size and keeping the memory size fixed
• Parallel query execution can be enabled after partitioning the annotation on document_id
UC Berkeley Biotext Project
Study on a larger dataset
• Annotated 1.4 Million MEDLINE abstracts 10 million sentences 320 million annotations 70 GB total database size
Workload (a) (b) (c) (d) Random (a, b, c)
#Queries 54 11 50 1 115
#Results/query 32,295 5,420 48 113,483 15,686
Time/query 0:50 55:44 1:35 3:33:57 6:26
UC Berkeley Biotext Project
Related Work• Annotation graphs (AG): directed
acyclic graph; nodes can have time stamps or are constrained via paths to labeled parents and children. (Bird and Liberman, 2001)
• Emu system: sequential levels of annotations. Hierarchical relations may exist between different levels, but must be explicitly defined for each pair.(Cassidy&Harrington,2001)
• The Q4M query language for MATE: directed graph; constraints and ordering of the annotated components. Stored in XML (McKelvie&al., 2001)
• TIQL: queries consist of manipulating intervals of text, indicated by XML tags; supports set operations. (Nenadic et al., 2002)
SELECT IWHERE X.[id:I].Y <- db/wrd X.[:hv].[]*.Y <- db/phn;
Annotation GraphsFind arcs labeled as words, whose phonetic transcription starts with a “hv“:
[[Phonetic=A -> Phonetic=p] ^ Syllable=S]
EmuFind sentences of phonetic “A” followed by “p“ both dominated by an “S” syllable:
($a word) ($b word); ($a pos ~ "NN") && ($a <> $b) && ($b # ~ "lesser")
Q4M (MATE system)Find nouns followed by the word “lesser”:
TIQL (TIMS system)Find sentences containing the noun phrase “COUP-TF II” and the verb “inhibit”:
(<SENTENCE> <TERM nf=‘COUP TF II’>) <V lemma=‘inhibit’>
UC Berkeley Biotext Project
What about XQuery/XPath?
UC Berkeley Biotext Project
Main Advantages of LQL System• Stand-off annotation
Flexible and modular Multi-layered, including overlaps
• LQL – simple yet powerful Support for hierarchies Optimized for cross-layer queries Much more expressive than standard text search engines
• Seamless integration with SQL and RDBMS Easy integration with additional data sources Simple parallelism
• Full text support Caption search Formatting-aware queries Flexible support for document structure
UC Berkeley Biotext Project
On the Horizon
• Full text documents support Really complex in bioscience text
Caption search Formatting-aware annotation layers Flexible support for document structure
• Query simplification Shorthand syntax GUI helper
UC Berkeley Biotext Project
Syntax-HelperInterface
Thank you!
biotext.berkeley.edu/lql
UC Berkeley Biotext Project
Overlap Example
UC Berkeley Biotext Project
Meta-data tables
BIOTEXT_ANNOTATION_LAYER
LAYER_ID LAYER_NAME OWNER LAST_UPDATED
1 pos hearst 6/12/2005
2 full_parse hearst 6/12/2005
3 shallow_parse hearst 6/12/2005
4 sentence hearst 6/12/2005
5 gene hearst 6/12/2005
6 mesh hearst 6/12/2005
7 chemicals hearst 6/12/2005
UC Berkeley Biotext Project
Meta-data tables
BIOTEXT_ANNOTATION_ATTRIBUTESLAYER_ID
ATTRIBUTEATTRIBUTE_FIELD
TABLE_NAME ATTRIBUTE_IDATTRIBUTE_TEXT
DBL_QUOTE_ALIAS
TREE_TABLETREE_DESC
TREE_NUM
-1 layer layer_idbiotext_annotation_layers
layer_idlayer_name
layer None None None
-1 tag_name tag_typebiotext_annotation_tag_types
tag_type_id tag_name tag_group None None None
-1 tag_group tag_typebiotext_annotation_tag_types
tag_type_id tag_group tag_group None None None
1 content word_idbiotext_annotation_word
word_id word content_lower None None None
1content_lower
word_idbiotext_annotation_word
word_id word_lower content_lower None None None
5 name tag_typelocuslink_aliases
locus_id name name None None None
6tree_number
tag_typebiotext_annotation_mesh_tree
descriptor_uitree_number
tree_numberbiotext_annotation_mesh_tree
descriptor_ui
tree_number
6 mesh_term tag_typebiotext_annotation_mesh_terms
descriptor_ui mesh_termmesh_term_lower
biotext_annotation_mesh_tree
descriptor_ui
tree_number
6mesh_term_lower
tag_typebiotext_annotation_mesh_terms
descriptor_uimesh_term_lower
mesh_term_lower
biotext_annotation_mesh_tree
descriptor_ui
tree_number
UC Berkeley Biotext Project
Meta-data tables
BIOTEXT_ANNOTATION_TAG_TYPESLAYER_ID TAG_TYPE_ID TAG_NAME TAG_GROUP
21 2 1019 IN IN
22 2 1020 INTJ INTJ
23 2 1021 JJ adjective
24 2 1022 JJR adjective
25 2 1023 JJS adjective
26 2 1025 LS LS
27 2 1069 LST LST
28 2 1026 MD MD
29 2 1070 NAC NAC
30 2 1027 NN noun
31 2 1028 NNP noun
32 2 1029 NNPS noun
33 2 1030 NNS noun
34 2 1031 NP NP
35 2 1032 NX NX
UC Berkeley Biotext Project
Meta-data tables
BIOTEXT_ANNOTATION_WORDWORD_ID
WORD WORD_LOWER
1 1212952 BCl bcl
2 1212953 2,2'-disulfonic 2,2'-disulfonic
3 1212954 1762-1860 1762-1860
4 1212955 Premkumar premkumar
5 1212956 329:265-285 329:265-285
6 1212957 EVPROC evproc
7 1212958 fascinae fascinae
8 1212959 fascines fascines
9 1212960 Cox-Stuart cox-stuart
10 1212961 epidydimo-orchitis epidydimo-orchitis
11 1212962 10-20-min 10-20-min
12 1212963 0.05-10-ng/ml 0.05-10-ng/ml
13 1212964 1.016x 1.016x
14 1212965 Goldberg-Lindblom goldberg-lindblom
15 1212966 Lundborg lundborg
16 1212967 graft-loss graft-loss
UC Berkeley Biotext Project
References
• Steven Bird and Mark Liberman. 2001. A formal framework for linguistic annotation. Speech Communication, 33(1–2):23–60.
• Steve Cassidy and Jonathan Harrington. 2001. Speech annotation and corpus tools. Speech Communication, 33(1–2):61–77.
• David McKelvie, Amy Isard, Andreas Mengel, Morten B. Moller, Michael Grosse and Marion Klein. 2001. Speech annotation and corpus tools. Speech Communication, 33(1–2):97–112.
• Goran Nenadic, Hideki Mima, Irena Spasic, Sophia Ananiadou and Jun-ichi Tsujii. 2002. Terminology-Driven Literature Mining and Knowledge Acquisition in Biomedicine. International Journal of Medical Informatics, 67:33–48.
• Ralph Grishman. 1996. Building an Architecture: a CAWG Saga. Advances in Text Processing: Tipster Program Phase II, Morgan Kaufmann, 1996.
• Steve Cassidy. 1999. Compiling Multi-tiered Speech Databases into the Relational Model: Experiments with the Emu System. 6th European Conference on Speech Communication and Technology Eurospeech 99, 2127–2130, Budapest, Hungary.
• Xiaoyi Ma, Haejoong Lee, Steven Bird and Kazuaki Maeda. 2002. Models and Tools for Collaborative Annotation. Third International Conference on Language Resources and Evaluation, 2066–2073.
UC Berkeley Biotext Project
Acquiring Labeled Data using Citances
UC Berkeley Biotext Project
A discovery is made …
A paper is written …
UC Berkeley Biotext Project
That paper is cited …
and cited …
and cited …
… as the evidence for some fact(s) F.
UC Berkeley Biotext Project
Each of these in turn are cited for some fact(s) …
… until it is the case that all important facts in the field can be found in citationsentences alone!
UC Berkeley Biotext Project
Citances
• Nearly every statement in a bioscience journal article is backed up with a cite.
• It is quite common for papers to be cited 30-100 times.
• The text around the citation tends to state biological facts. (Call these citances.)
• Different citances will state the same facts in different ways …
• … so can we use these for creating models of language expressing semantic relations?