sd study: statistical learning of domain-dependent semantic structure

Kyoshiro SUGIYAMA , AHC-Lab. , NAIST

7/14 SD studyChapter 2

Statistical Learning of Domain-Dependent Semantic Structure

D1 Kyoshiro Sugiyama


Chapter 2

2.1 Semantic Information Structure based on

Predicate-Argument Structure

2.1.1 Predicate-Argument Structure

2.2 Extraction of Domain-dependent P-A Patterns

2.2.1 Significance Score based on TF-IDF Measure

2.2.2 Significance Score based on Naïve Bayes Model

2.2.3 Clustering of Named Entities

2.2.4 Evaluation of P-A Significance Scores

2.3 Conclusion2/29


Chapter 2 intro.

This chapter introduces a statistical learning method of domainknowledge based on semantic structure of the domain corpus,which plays an important role in the proposed system.

The domain knowledge is based on predicate-argument (P-A)structure, which is one of the most fundamental informationstructures in a natural language text.

The userful information structure depends on the domain. In orderto automatically extract useful domain-dependent P-A structure, astatistical measure is introduced, resulting in a completelyunsupervised learning of semantic information structure given adomain corpus.

3/29


In three lines

To automatically extract domain knowledge based on domain-dependent useful P-A structure

A statistical measure is introduced

Unsupervised learning of semantic information structure

4/29


Predicate-Argument (P-A) Structures(述語項構造)

Arguments項

Predicates述語

5/29


Problem setting

Every P-A structure is not useful.

They extract useful information using hand-crafted templates. (conventional)

It is so costly that it cannot applied to a variety of domains.

In this chapter, two scoring method are prescribedto extract domain-dependent useful information patterns

6/29


Useful information of domain

Only a fraction of the patterns is useful,

and it is domain-dependent.

e.g.)

* beat: 打ち勝つ, acquire: 買収する

Baseball domain[A beat B] important[A hit B] important[A sell B] not important[A acquire B] not important

Business domain[A beat B] not important[A hit B] not important[A sell B] important[A acquire B] important

7/29


Automatic information extraction

Which P-A pair is important in XXX domain?

Two significance measures are prescribed.

Baseball

Soccer

Business

Economy

Corpus/Websites

︙

Baseball domain:P-A pair Score[A hit B] 0.9[A beat B] 0.9︙[A sell B] 0.2[A acquire B] 0.1

Autocalc.

8/29


Chapter 2









2.3 Conclusion9/29


Definition of TF-IDF measure

𝑤𝑖: word𝑑: document𝐶(∙):count function𝛼, 𝛽:smoothing factor

Here 𝛼 = 1, 𝛽 = 1

10/29


Intuitive meaning of TF-IDF score

How often does 𝑤𝑖 occurin this document?

How rare is the document 𝑑that contain 𝑤𝑖 ?

Domain-specific and frequent words have high score.

11/29


Chapter 2









2.3 Conclusion12/29


Definition of Naïve Bayes based score

𝛾: smoothing factor with the Dirichlet process prior

13/29


Intuitive meaning of Naïve Bayes based score

# of word 𝑤𝑖

# of word 𝑤𝑖

in domain D

14/29



# of word 𝑤𝑖

# of word 𝑤𝑖

in domain D

# of words in domain D

# of words in whole documents

15/29



# of word 𝑤𝑖

in domain D

# of words in domain D

# of words in whole documents

Probability thatan unknown word

belongs to domain D?

what is gamma?

# of word 𝑤𝑖

16/29


Chapter 2









2.3 Conclusion17/29


Problem with named entity

Named entities: name of persons, organizations, locations…

Sparseness problem, mismatch between training and test set

NE classes are introduced for robust estimation.

18/29


Clustering of named entities

Argument (Semantic_role) Predicate

19/29


Equations

Naïve Bayes score

Probability of word occurrence

same?

20/29


Intuitive meaning of NE clustering

Toritani (agent) hit

Ichiro (agent) hit

Score

Score

P(Toritani)

P(Ichiro)

Score of[Person] (agent) hit

Training set

Test set

Matsui (agent) hit

Mismatching

Sparse Dense

Score of[Person] (agent) hit

Matching

sum

21/29


Chapter 2









2.3 Conclusion22/29


Evaluation of significance score

Task: Useful information extraction (for QA, info. navigation)

Methods:

baseline: all of P-A pairs are useful

TF-IDF: 𝑇𝐹𝐼𝐷𝐹 𝑤𝑎 × 𝑇𝐹𝐼𝐷𝐹 𝑤𝑠, 𝑤𝑝 > threshold

NB(PS+A): 𝑃 𝐷|𝑤𝑎 × 𝑃(𝐷|𝑤𝑠, 𝑤𝑝) > threshold

NB(PSA): 𝑃 𝐷|𝑤𝑎, 𝑤𝑠, 𝑤𝑝 > threshold

With NEs clustering

23/29


Data sets

Training set:

Mainichi Newspaper corpus 2000-2008

Evaluation set (Dev 10%, Test 90%):Mainichi newspaper’s website which talks about professional baseball games played between April 21-23, 2010

Manual annotated P-A patterns are “useful”

24/29


Result (precision, recall and F-measure)

25/29


Result (precision, recall and F-measure)

seems the best

Just 100% recall

but low precision

some degradation

high precision

26/29


Precision-recall curve

Baseline

27/29


Consideration

PS+A is robust on data-sparseness problem more than PSA.

Typical successes“勝つ(have a win)”, “登板する(come into pitch)”, etc.

Typical errors“する(do)”, “なる(become)”: frequently and not domain-specific but sometimes important verbs.

“日本一 (ニ格) 輝く(won the championship)”: very important but appear on other sports domain and infrequently (once/year).

28/29


Conclusion

The statistical learning of semantic structures is formulatedby defining the significance score of the domain-dependentP-A structure.

The score based on the Naive Bayes is introduced to selectuseful templates in a given domain automatically.

The experimental results show that the high scores are givento important patterns in the domain.

The scoring method does not require any annotated data orthesaurus in the domain and it can be applied to a variety ofdomains.

29/29

sd study: statistical learning of domain-dependent semantic structure

Presentations & Public Speaking