a survey on information extraction from documents using structures of sentences
DESCRIPTION
A Survey on Information Extraction from Documents Using Structures of Sentences. Chikayama Taura Lab. M1 Mitsuharu Kurita. Introduction. Current search systems are based on 2 assumptions Users send words, not sentences The aim is finding documents which is related to the query words - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: A Survey on Information Extraction from Documents Using Structures of Sentences](https://reader035.vdocuments.mx/reader035/viewer/2022070423/56816737550346895ddbea01/html5/thumbnails/1.jpg)
A SURVEY ONINFORMATION EXTRACTIONFROM DOCUMENTSUSING STRUCTURES OF SENTENCES
Chikayama Taura Lab. M1 Mitsuharu Kurita
1
![Page 2: A Survey on Information Extraction from Documents Using Structures of Sentences](https://reader035.vdocuments.mx/reader035/viewer/2022070423/56816737550346895ddbea01/html5/thumbnails/2.jpg)
INTRODUCTION Current search systems are based on 2
assumptions
1. Users send words, not sentences2. The aim is finding documents which is
related to the query words
We are unconsciously get to select words which will appear nearby the target information
In some cases this clue doesn’t work well2
![Page 3: A Survey on Information Extraction from Documents Using Structures of Sentences](https://reader035.vdocuments.mx/reader035/viewer/2022070423/56816737550346895ddbea01/html5/thumbnails/3.jpg)
INTRODUCTION For more convenient access to the
information Analysis of the detail of question
To know the target information
Analysis of the information in retrieved documents To find the requested informationInformation Extraction
3
![Page 4: A Survey on Information Extraction from Documents Using Structures of Sentences](https://reader035.vdocuments.mx/reader035/viewer/2022070423/56816737550346895ddbea01/html5/thumbnails/4.jpg)
OUTLINE Introduction Overview of Information Extraction (IE) IE with pattern matching IE with sentence structures
Frequent substructure Shortest path between 2 words Applying the kernel method for structured data
Conclusion
4
![Page 5: A Survey on Information Extraction from Documents Using Structures of Sentences](https://reader035.vdocuments.mx/reader035/viewer/2022070423/56816737550346895ddbea01/html5/thumbnails/5.jpg)
INFORMATION EXTRACTION What is Information Extraction?
A kind of task in natural language processing Addresses extraction of information from texts
Not to retrieve the documents Originated with an international conference
named MUC
Message Understanding Conference (MUC) Competition of IE among research groups Set information extraction tasks every year
between 1987-19975
![Page 6: A Survey on Information Extraction from Documents Using Structures of Sentences](https://reader035.vdocuments.mx/reader035/viewer/2022070423/56816737550346895ddbea01/html5/thumbnails/6.jpg)
MUC COMPETITION An example of MUC task
MUC-3 terrorism domainInput: news articles
(some of them include terrorism event)
Output: the instances involved in each incident
6
![Page 7: A Survey on Information Extraction from Documents Using Structures of Sentences](https://reader035.vdocuments.mx/reader035/viewer/2022070423/56816737550346895ddbea01/html5/thumbnails/7.jpg)
MUC COMPETITION Pattern matching or linguistic analysis
At that time (1987-1997), there were many difficulties to use advanced natural language processing
Therefore, most of competitors adopted pattern matching to find instances
7
![Page 8: A Survey on Information Extraction from Documents Using Structures of Sentences](https://reader035.vdocuments.mx/reader035/viewer/2022070423/56816737550346895ddbea01/html5/thumbnails/8.jpg)
OUTLINE Introduction Overview of Information Extraction (IE) IE with pattern matching IE with sentence structures
Frequent substructure Shortest path between 2 words Applying the kernel method for structured data
Conclusion
8
![Page 9: A Survey on Information Extraction from Documents Using Structures of Sentences](https://reader035.vdocuments.mx/reader035/viewer/2022070423/56816737550346895ddbea01/html5/thumbnails/9.jpg)
EXAMPLE OF PATTERN MATCHING CIRCUS [92 Lehnert et al.]
Each pattern consists of “trigger word” and “linguistic pattern”
Pattern: kidnap-passiveTrigger:
“kidnap”Linguistic pattern:
“<subject> passive-verb”Variable:
“target”
“The mayor was kidnapped
by terrorists.”1. “kidnap” activates the
pattern2. “was kidnapped” is a
passive verb phrase3. The subject “mayor” is
the target
9
![Page 10: A Survey on Information Extraction from Documents Using Structures of Sentences](https://reader035.vdocuments.mx/reader035/viewer/2022070423/56816737550346895ddbea01/html5/thumbnails/10.jpg)
PROBLEMS OF PATTERN MATCHING It takes a huge amount of time to create
patterns In many cases, they were handwritten
It depends a lot on the target domain It is difficult to adapt to the new task
Automatic constructionof patterns
10
![Page 11: A Survey on Information Extraction from Documents Using Structures of Sentences](https://reader035.vdocuments.mx/reader035/viewer/2022070423/56816737550346895ddbea01/html5/thumbnails/11.jpg)
THE EARLIESTAUTOMATIC PATTERN
GENERATION AutoSlog [93 Riloff et al.]
Creates the patterns for CIRCUS automatically Training data: articles tagged the target word
Created 1237 patterns from 1500 tagged texts Only 450 of them were judged to be valid by
human
“The mayor was kidnapped
by terrorists.”
Pattern: kidnap-passiveTrigger:
“kidnap”Linguistic pattern:
“<subject> passive-verb”Variable:
“target”
11
![Page 12: A Survey on Information Extraction from Documents Using Structures of Sentences](https://reader035.vdocuments.mx/reader035/viewer/2022070423/56816737550346895ddbea01/html5/thumbnails/12.jpg)
Recently it has become possible to use deeper linguistic analysis
Some studies are addressing new IE tasks using these linguistic resources and machine learning approach
12
![Page 13: A Survey on Information Extraction from Documents Using Structures of Sentences](https://reader035.vdocuments.mx/reader035/viewer/2022070423/56816737550346895ddbea01/html5/thumbnails/13.jpg)
OUTLINE Introduction Overview of Information Extraction (IE) IE with pattern matching IE with sentence structures
Frequent substructure Shortest path between 2 words Applying the kernel method for structured data
Conclusion
13
![Page 14: A Survey on Information Extraction from Documents Using Structures of Sentences](https://reader035.vdocuments.mx/reader035/viewer/2022070423/56816737550346895ddbea01/html5/thumbnails/14.jpg)
SENTENCE STRUCTURES Dependency Structure
Describes modification relations between words One sentence makes up a tree structure
Predicate-Argument structure Describes the semantic relations between
predicate and argument One sentence makes up a graph structure
14
![Page 15: A Survey on Information Extraction from Documents Using Structures of Sentences](https://reader035.vdocuments.mx/reader035/viewer/2022070423/56816737550346895ddbea01/html5/thumbnails/15.jpg)
DIFFICULTIES TO USE STRUCTURED DATA Most of the machine learning algorithms deal
with the data as feature vectors
It is difficult to express structured data (e.g. trees, graphs) as vectors
The ways to use sentence structures for IE Frequent substructures Shortest paths between 2 words Applying the kernel method for structured data
15
![Page 16: A Survey on Information Extraction from Documents Using Structures of Sentences](https://reader035.vdocuments.mx/reader035/viewer/2022070423/56816737550346895ddbea01/html5/thumbnails/16.jpg)
OUTLINE Introduction Overview of Information Extraction (IE) IE with pattern matching IE with sentence structures
Frequent substructure Shortest path between 2 words Applying the kernel method for structured data
Conclusion
16
![Page 17: A Survey on Information Extraction from Documents Using Structures of Sentences](https://reader035.vdocuments.mx/reader035/viewer/2022070423/56816737550346895ddbea01/html5/thumbnails/17.jpg)
IE WITHSUBGRAPH OF SENTENCE STRUCTURES
On-Demand Information Extraction[06 Sekine et
al.] Create extraction patterns on-demand and
extract information with itquery Relevan
tarticles
FrequentSubtreeMining
Article database Dependency analyzer
Table of Information
Dependency trees
Subtree patterns
17
![Page 18: A Survey on Information Extraction from Documents Using Structures of Sentences](https://reader035.vdocuments.mx/reader035/viewer/2022070423/56816737550346895ddbea01/html5/thumbnails/18.jpg)
EXPERIMENTAL RESULTS Generated patterns
Found patterns for a query“merger and acquisition” (M&A)
Extracted Information For the query “acquire, acquisition, merger, buy,
purchase”
18
<COM1>
<agree to buy>
<COM2>
<for MNY>
<COM1>
<will acquire>
<COM2>
<for MNY>
<a MNY merger>
<of COM1>
<and COM2>
![Page 19: A Survey on Information Extraction from Documents Using Structures of Sentences](https://reader035.vdocuments.mx/reader035/viewer/2022070423/56816737550346895ddbea01/html5/thumbnails/19.jpg)
EXPERIMENTAL RESULTS Very quick construction of patterns
In MUC, it is allowed to take one month ODIE takes only a few minutes to return the
result
No training corpus is needed ODIE learns extraction patterns from the data
Information about reprising event can be extracted well Merger and acquisition Nobel prize winners 19
![Page 20: A Survey on Information Extraction from Documents Using Structures of Sentences](https://reader035.vdocuments.mx/reader035/viewer/2022070423/56816737550346895ddbea01/html5/thumbnails/20.jpg)
OUTLINE Introduction Overview of Information Extraction (IE) IE with pattern matching IE with sentence structures
Frequent substructure Shortest path between 2 words Applying the kernel method for structured data
Conclusion
20
![Page 21: A Survey on Information Extraction from Documents Using Structures of Sentences](https://reader035.vdocuments.mx/reader035/viewer/2022070423/56816737550346895ddbea01/html5/thumbnails/21.jpg)
IE WITHSHORTEST PATH BETWEEN
WORDS Extraction of interacting protein pair
[06 Yakushiji et al.] Extract the interacting protein pairs from
biomedical articles Focus on the shortest path between 2 protein
names on predicate-argument structure Discriminate with Support Vector Machine (SVM)
Entity1 is interacted with a hydrophilic loop region
of Entity2.be
entity1interact
withregion
ofa
hydrophilicloop
entity2 21
![Page 22: A Survey on Information Extraction from Documents Using Structures of Sentences](https://reader035.vdocuments.mx/reader035/viewer/2022070423/56816737550346895ddbea01/html5/thumbnails/22.jpg)
PATTERN GENERATION Variation of Patterns
The extracted patterns are not enough Divide the patterns and combine them into new
patterns
Main PrepEntity Entity
………
X interact Ywithprotein regio
n of
22
![Page 23: A Survey on Information Extraction from Documents Using Structures of Sentences](https://reader035.vdocuments.mx/reader035/viewer/2022070423/56816737550346895ddbea01/html5/thumbnails/23.jpg)
PATTERN GENERATION Validation of patterns
Some of these patterns are inappropriate Each patterns are scored by its adequacy to the
learning data
Feature vector
23
TP: True PositiveFP: False Positive
![Page 24: A Survey on Information Extraction from Documents Using Structures of Sentences](https://reader035.vdocuments.mx/reader035/viewer/2022070423/56816737550346895ddbea01/html5/thumbnails/24.jpg)
SUPPORT VECTOR MACHINE (SVM) 2 class linear classifier Divide the data space with hyperplane Margin maximization Margin
maximization
24
![Page 25: A Survey on Information Extraction from Documents Using Structures of Sentences](https://reader035.vdocuments.mx/reader035/viewer/2022070423/56816737550346895ddbea01/html5/thumbnails/25.jpg)
EXPERIMENTAL RESULTS Learning
AImed corpus 225 abstracts of biomedical papers Annotated with protein names and interactions
Extraction MEDLINE
14 million titles and 8 million abstracts Extracted data
7775 protein pairs 64.0% precision 83.8% recall
25
![Page 26: A Survey on Information Extraction from Documents Using Structures of Sentences](https://reader035.vdocuments.mx/reader035/viewer/2022070423/56816737550346895ddbea01/html5/thumbnails/26.jpg)
OUTLINE Introduction Overview of Information Extraction (IE) IE with pattern matching IE with sentence structures
Frequent substructure Shortest path between 2 words Applying the kernel method for structured data
Conclusion
26
![Page 27: A Survey on Information Extraction from Documents Using Structures of Sentences](https://reader035.vdocuments.mx/reader035/viewer/2022070423/56816737550346895ddbea01/html5/thumbnails/27.jpg)
IE WITH THE KERNEL METHOD ON SENTENCE STRUCTURES
Kernel Method e.g. SVM
Data are used only in the form of dot products If you can calculate the dot product directly, you
do not have to calculate the vector Furthermore, you can use other functions as long
as they meet some conditions27
Raw data
vector space
classifier
Kernel function
![Page 28: A Survey on Information Extraction from Documents Using Structures of Sentences](https://reader035.vdocuments.mx/reader035/viewer/2022070423/56816737550346895ddbea01/html5/thumbnails/28.jpg)
RELATION EXTRACTION Relation Extraction with Tree Kernel
[04 Culotta et al.] Classify the relation between 2 entities
5 entity types(person, organization, geo-political-entity,
location, facility) 5 major types of relations
(at, near, part, role, social) Classify the smallest subtree of dependency tree
which includes the entities
28
![Page 29: A Survey on Information Extraction from Documents Using Structures of Sentences](https://reader035.vdocuments.mx/reader035/viewer/2022070423/56816737550346895ddbea01/html5/thumbnails/29.jpg)
TREE KERNEL Represents the similarity between 2 tree-
shaped data Calculated as the sum of similarity of nodes
29
Dequeue a node pair
Add the similarity
Find all child node sequence pairswhose main features of the nodes
are common
Enqueue the child node pairs
Is the queueempty?
Return the similarity
Enqueue root node pair
Start
End
Yes
No
![Page 30: A Survey on Information Extraction from Documents Using Structures of Sentences](https://reader035.vdocuments.mx/reader035/viewer/2022070423/56816737550346895ddbea01/html5/thumbnails/30.jpg)
CALCULATION OF TREE KERNEL Features of nodes
The similarity between nodes are defined as the number of common features (except the main features)
30
Main features
![Page 31: A Survey on Information Extraction from Documents Using Structures of Sentences](https://reader035.vdocuments.mx/reader035/viewer/2022070423/56816737550346895ddbea01/html5/thumbnails/31.jpg)
CALCULATION OF TREE KERNEL
31
A
B C D
E
A’
B’ D’
E’
F’
A
B
A
D
D
E
C’
A’
B’ C’
A’A
A’
B’
A’
D’
A
B C
D’
E’
X and X’ denote the nodes whose main
features are common
A
C
A’
C’
![Page 32: A Survey on Information Extraction from Documents Using Structures of Sentences](https://reader035.vdocuments.mx/reader035/viewer/2022070423/56816737550346895ddbea01/html5/thumbnails/32.jpg)
EXPERIMENTAL RESULTS Data set: ACE corpus
800 annotated documents(gathered from newspapers and
broadcasts) 5 entity types
(person, organization, geo-political-entity, location, facility)
5 major types of relations(at, near, part, role, social)
32
Kernel Precision (%)
Recall (%)
Bag-of-words kernel 47.0 10.0Tree kernel 69.6 25.3
![Page 33: A Survey on Information Extraction from Documents Using Structures of Sentences](https://reader035.vdocuments.mx/reader035/viewer/2022070423/56816737550346895ddbea01/html5/thumbnails/33.jpg)
OUTLINE Introduction Overview of Information Extraction (IE) IE with pattern matching IE with sentence structures
Frequent substructure Shortest path between 2 words Applying the kernel method for structured data
Conclusion
33
![Page 34: A Survey on Information Extraction from Documents Using Structures of Sentences](https://reader035.vdocuments.mx/reader035/viewer/2022070423/56816737550346895ddbea01/html5/thumbnails/34.jpg)
CONCLUSION Overview of Information Extraction
The aim of information extraction Recent movement to use deep linguistic resource
The way to use sentence structures for IE Difficulties of using structured data in machine
learning Three different approaches to exploit them
34