sd study: statistical learning of domain-dependent semantic structure
TRANSCRIPT
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
7/14 SD studyChapter 2
Statistical Learning of Domain-Dependent Semantic Structure
D1 Kyoshiro Sugiyama
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Chapter 2
2.1 Semantic Information Structure based on
Predicate-Argument Structure
2.1.1 Predicate-Argument Structure
2.2 Extraction of Domain-dependent P-A Patterns
2.2.1 Significance Score based on TF-IDF Measure
2.2.2 Significance Score based on Naïve Bayes Model
2.2.3 Clustering of Named Entities
2.2.4 Evaluation of P-A Significance Scores
2.3 Conclusion2/29
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Chapter 2 intro.
This chapter introduces a statistical learning method of domainknowledge based on semantic structure of the domain corpus,which plays an important role in the proposed system.
The domain knowledge is based on predicate-argument (P-A)structure, which is one of the most fundamental informationstructures in a natural language text.
The userful information structure depends on the domain. In orderto automatically extract useful domain-dependent P-A structure, astatistical measure is introduced, resulting in a completelyunsupervised learning of semantic information structure given adomain corpus.
3/29
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
In three lines
To automatically extract domain knowledge based on domain-dependent useful P-A structure
A statistical measure is introduced
Unsupervised learning of semantic information structure
4/29
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Predicate-Argument (P-A) Structures(述語項構造)
Arguments項
Predicates述語
5/29
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Problem setting
Every P-A structure is not useful.
They extract useful information using hand-crafted templates. (conventional)
It is so costly that it cannot applied to a variety of domains.
In this chapter, two scoring method are prescribedto extract domain-dependent useful information patterns
6/29
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Useful information of domain
Only a fraction of the patterns is useful,
and it is domain-dependent.
e.g.)
* beat: 打ち勝つ, acquire: 買収する
Baseball domain[A beat B] important[A hit B] important[A sell B] not important[A acquire B] not important
Business domain[A beat B] not important[A hit B] not important[A sell B] important[A acquire B] important
7/29
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Automatic information extraction
Which P-A pair is important in XXX domain?
Two significance measures are prescribed.
Baseball
Soccer
Business
Economy
Corpus/Websites
︙
Baseball domain:P-A pair Score[A hit B] 0.9[A beat B] 0.9︙[A sell B] 0.2[A acquire B] 0.1
Autocalc.
8/29
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Chapter 2
2.1 Semantic Information Structure based on
Predicate-Argument Structure
2.1.1 Predicate-Argument Structure
2.2 Extraction of Domain-dependent P-A Patterns
2.2.1 Significance Score based on TF-IDF Measure
2.2.2 Significance Score based on Naïve Bayes Model
2.2.3 Clustering of Named Entities
2.2.4 Evaluation of P-A Significance Scores
2.3 Conclusion9/29
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Definition of TF-IDF measure
𝑤𝑖: word𝑑: document𝐶(∙):count function𝛼, 𝛽:smoothing factor
Here 𝛼 = 1, 𝛽 = 1
10/29
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Intuitive meaning of TF-IDF score
How often does 𝑤𝑖 occurin this document?
How rare is the document 𝑑that contain 𝑤𝑖 ?
Domain-specific and frequent words have high score.
11/29
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Chapter 2
2.1 Semantic Information Structure based on
Predicate-Argument Structure
2.1.1 Predicate-Argument Structure
2.2 Extraction of Domain-dependent P-A Patterns
2.2.1 Significance Score based on TF-IDF Measure
2.2.2 Significance Score based on Naïve Bayes Model
2.2.3 Clustering of Named Entities
2.2.4 Evaluation of P-A Significance Scores
2.3 Conclusion12/29
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Definition of Naïve Bayes based score
𝛾: smoothing factor with the Dirichlet process prior
13/29
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Intuitive meaning of Naïve Bayes based score
# of word 𝑤𝑖
# of word 𝑤𝑖
in domain D
14/29
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Intuitive meaning of Naïve Bayes based score
# of word 𝑤𝑖
# of word 𝑤𝑖
in domain D
# of words in domain D
# of words in whole documents
15/29
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Intuitive meaning of Naïve Bayes based score
# of word 𝑤𝑖
in domain D
# of words in domain D
# of words in whole documents
Probability thatan unknown word
belongs to domain D?
what is gamma?
# of word 𝑤𝑖
16/29
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Chapter 2
2.1 Semantic Information Structure based on
Predicate-Argument Structure
2.1.1 Predicate-Argument Structure
2.2 Extraction of Domain-dependent P-A Patterns
2.2.1 Significance Score based on TF-IDF Measure
2.2.2 Significance Score based on Naïve Bayes Model
2.2.3 Clustering of Named Entities
2.2.4 Evaluation of P-A Significance Scores
2.3 Conclusion17/29
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Problem with named entity
Named entities: name of persons, organizations, locations…
Sparseness problem, mismatch between training and test set
NE classes are introduced for robust estimation.
18/29
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Clustering of named entities
Argument (Semantic_role) Predicate
19/29
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Equations
Naïve Bayes score
Probability of word occurrence
same?
20/29
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Intuitive meaning of NE clustering
Toritani (agent) hit
Ichiro (agent) hit
Score
Score
P(Toritani)
P(Ichiro)
Score of[Person] (agent) hit
Training set
Test set
Matsui (agent) hit
Mismatching
Sparse Dense
Score of[Person] (agent) hit
Matching
sum
21/29
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Chapter 2
2.1 Semantic Information Structure based on
Predicate-Argument Structure
2.1.1 Predicate-Argument Structure
2.2 Extraction of Domain-dependent P-A Patterns
2.2.1 Significance Score based on TF-IDF Measure
2.2.2 Significance Score based on Naïve Bayes Model
2.2.3 Clustering of Named Entities
2.2.4 Evaluation of P-A Significance Scores
2.3 Conclusion22/29
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Evaluation of significance score
Task: Useful information extraction (for QA, info. navigation)
Methods:
baseline: all of P-A pairs are useful
TF-IDF: 𝑇𝐹𝐼𝐷𝐹 𝑤𝑎 × 𝑇𝐹𝐼𝐷𝐹 𝑤𝑠, 𝑤𝑝 > threshold
NB(PS+A): 𝑃 𝐷|𝑤𝑎 × 𝑃(𝐷|𝑤𝑠, 𝑤𝑝) > threshold
NB(PSA): 𝑃 𝐷|𝑤𝑎, 𝑤𝑠, 𝑤𝑝 > threshold
With NEs clustering
23/29
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Data sets
Training set:
Mainichi Newspaper corpus 2000-2008
Evaluation set (Dev 10%, Test 90%):Mainichi newspaper’s website which talks about professional baseball games played between April 21-23, 2010
Manual annotated P-A patterns are “useful”
24/29
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Result (precision, recall and F-measure)
25/29
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Result (precision, recall and F-measure)
seems the best
Just 100% recall
but low precision
some degradation
high precision
26/29
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Precision-recall curve
Baseline
27/29
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Consideration
PS+A is robust on data-sparseness problem more than PSA.
Typical successes“勝つ(have a win)”, “登板する(come into pitch)”, etc.
Typical errors“する(do)”, “なる(become)”: frequently and not domain-specific but sometimes important verbs.
“日本一 (ニ格) 輝く(won the championship)”: very important but appear on other sports domain and infrequently (once/year).
28/29
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Conclusion
The statistical learning of semantic structures is formulatedby defining the significance score of the domain-dependentP-A structure.
The score based on the Naive Bayes is introduced to selectuseful templates in a given domain automatically.
The experimental results show that the high scores are givento important patterns in the domain.
The scoring method does not require any annotated data orthesaurus in the domain and it can be applied to a variety ofdomains.
29/29