in4080 natural language processing · can use various learners, e.g. logistic regression (maxent)...

66
IN4080 – 2019 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning 1

Upload: others

Post on 04-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

IN4080 – 2019 FALLNATURAL LANGUAGE PROCESSING

Jan Tore Lønning

1

Page 2: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Lecture 8, 8 Oct

Information extraction, pipelines

2

Page 3: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Today

Sentence structure:

Constituents and phrases

Treebanks

Information extraction, IE

Chunking

Named entity recognition

Relation extraction, 5 different ways

Pipelines

3

Page 4: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Sentences have inner structure

Sentence: a sequence of words

Properties of words:

morphology, tags, embeddings

Probabilities of sequences

Flat

Sentences have inner structure

The structure determines

whether the sentence is

grammatical or not

The structure determines how to

understand the sentence

So far But

4

Page 5: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Why syntax?

Some sequences of words are

well-formed meaningful

sentences.

Others are not:

Are meaningful of some sentences

sequences well-formed words

It makes a difference:

A dog bit the man.

The man bit a dog.

BOW-models don't capture this

difference

5

Page 6: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Two ways to describe sentence structure6

Phrase structure Dependency structure

Focus of INF2820 Focus of IN2110

Page 7: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Constituents and phrases

Constituent: A group of word which functions as a unit in the sentence

See Wikipedia: Constituent for criteria of constituency

Phrase: A sequence of words which "belong together"

= constituent (for us)

In some theories a phrase is a constituent of more than one word

7 NP

Mary

The small, cute dog

The dog from Baskerville

You

V

ate

saw

enjoied

NP

the apple

the small, cute dog

the apple that Kim had stolen from the store

it

VP

Page 8: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Phrases

Phrases can be classified into categories:

Noun Phrases, Verb Phrases, Prepositional Phrases, etc.

Phrases of the same category have similar distribution,

e.g. NPs can replace names

(but there are restrictions on case, number, person, gender agreement, etc.)

Phrases of the same category have similar structure, simplified:

NP (roughly): (DET) ADJ* N PP* (+ some alternatives, e.g. pronoun)

PP: PREP NP

8

Page 9: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Phrase structure

A sentence is hierarchically

ordered into phrases

Various syntactic theories and

models and NLP tools depart

with respect to the actual trees:

Models based on X-bar theory

prefer "deep threes": binary

branching

Penn treebank prefers shallow

trees

9

Page 10: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

A Penn treebank tree10

Page 11: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Today

Sentence structure:

Constituents and phrases

Treebanks

Information extraction, IE

Chunking

Named entity recognition

Relation extraction, 5 different ways

Pipelines

11

Page 12: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Treebanks

A collection of analyzed sentences/trees

Penn treebank is best known

12

Page 13: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

13

Treebanks

Treebanks are corpora in which each sentence has been paired with a parse tree (presumably the right one).

These are generally created

By first parsing the collection with an automatic parser

And then having human annotators correct each parse as necessary.

This requires detailed annotation guidelines that provide a POS tagset, a grammar and instructions for how to deal with particular grammatical constructions.

Page 14: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Treebank Grammars

Treebanks implicitly define a grammar for the language covered in the treebank.

Such grammars tend to be very flat due to the fact that they tend to avoid recursion.

To ease the annotators burden

For example, the Penn Treebank has 4500 different rules for VPs. Among them...

Speech and Language Processing - Jurafsky and Martin

14

Page 15: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Different types of treebanks

Hand-made

Human annotators assign trees.

The trees define a grammar:

Many rules

Penn uses flat trees

Parse bank

Start with a grammar

And a parser

Parse the sentences

A human annotator selects the best analysis between the candidates

May be used for training a parse ranker

October 7, 2019

15

Page 16: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Treebanks

There are available free dependency treebanks for many languages

The place to start in these days: http://universaldependencies.org/

CONLL-formats:

One word per line, a number of columns for various information

CONLL-X, CONLL-U – different POSTAGs

16

fromAndrei's INF5830 slides

Page 17: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Today

Sentence structure:

Constituents and phrases

Treebanks

Information extraction, IE

Chunking

Named entity recognition

Relation extraction, 5 different ways

Pipelines

17

Page 18: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

IE basics

Bottom-Up approach

Start with unrestricted texts, and do the best you can

The approach was in particular developed by the Message Understanding Conferences (MUC) in the 1990s

Select a particular domain and task

18

Information extraction (IE) is the task of

automatically extracting structured information

from unstructured and/or semi-structured

machine-readable documents. (Wikipedia)

Page 19: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Steps19

(Some appro-

aches do these

steps in a

different order

– or

simultaneously)From NLTK

Page 20: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Some example systems20

Stanford core nlp: http://corenlp.run/

SpaCy (Python): https://spacy.io/docs/api/

OpenNLP (Java): https://opennlp.apache.org/docs/

GATE (Java): https://gate.ac.uk/

UDPipe: http://ufal.mff.cuni.cz/udpipe

Online demo: http://lindat.mff.cuni.cz/services/udpipe/

Page 21: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Today

Dependency parsing:

Wrap-up

Evaluation

Treebanks and pipelines

Information extraction, IE

Chunking

Named entity recognition

Relation extraction, 5 different ways

21

Page 22: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Next steps

Chunk together words to phrases

22

Page 23: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

NP-chunks

Exactly what is an NP-chunk?

It is an NP

But not all NPs are chunks

Flat structure: no NP-chunk is part of another NP chunk

Maximally large

Opposing restrictions

23

[ The/DT market/NN ] for/IN

[ system-management/NN software/NN ] for/IN

[ Digital/NNP ]

[ 's/POS hardware/NN ] is/VBZ fragmented/JJ enough/RB that/IN

[ a/DT giant/NN ] such/JJ as/IN

[ Computer/NNP Associates/NNPS ] should/MD do/VB well/RB there/RB ./.

Page 24: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Regular Expression Chunker

Input POS-tagged sentences

Use a regular expression over POS to identify NP-chunks

NLTK example:

It inserts parentheses

24

grammar = r"""NP: {<DT|PP\$>?<JJ>*<NN>}

{<NNP>+} """

Page 25: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

IOB-tags

B-NP: First word in NP

I-NP: Part of NP, not first word

O: Not part of NP (phrase)

Properties

One tag per token

Unambiguous

Does not insert anything in the

text itself

25

Page 26: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Assigning IOB-tags

The process can be considered a form for tagging

POS-tagging: Word to POS-tag

IOB-tagging: POS-tag to IOB-tag

But one may in addition use additional features, e.g. words

Can use various types of classifiers

NLTK uses a MaxEnt Classifier (=LogReg, but the implementation is slow)

We can modify along the lines of mandatory assignment 2, using scikit-learn

26

Page 27: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

27

J&M, 3. ed.

Page 28: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Evaluating (IOB-)chunkers

cp = nltk.RegexpParser("")

test_sents = conll ('test', chunks=['NP'])

IOB Accuracy: 43.4%

Precision: 0.0%

Recall: 0.0%

F-Measure: 0.0%

What do we evaluate?

IOB-tags? or

Whole chunks?

Yields different results

For IOB-tags:

Baseline: majority class O,

yields > 33%

Whole chunks:

Which chunks did we find?

Harder

Lower numbers

28

Page 29: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Evaluating (IOB-)chunkers

cp = nltk.RegexpParser("")

test_sents = conll ('test',

chunks=['NP'])

IOB Accuracy: 43.4%

Precision: 0.0%

Recall: 0.0%

F-Measure: 0.0%

>> cp = nltk.RegexpParser(

r"NP: {<[CDJNP].*>+}")

IOB Accuracy: 87.7%

Precision: 70.6%

Recall: 67.8%

F-Measure: 69.2%

29

Page 30: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Today

Sentence structure:

Constituents and phrases

Treebanks

Information extraction, IE

Chunking

Named entity recognition

Relation extraction, 5 different ways

Pipelines

30

Page 31: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Named entities31

Named entity:

Anything you can refer to by a proper name

i.e. not all NP (chunks):

high fuel prices

Maybe longer NP than just chunk:

Bank of America

Find the phrases

Classify them

Citing high fuel prices, [ORG United Airlines]

said [TIME Friday] it has increased fares by

[MONEY $6] per round trip on flights to

some cities also served by lower-cost

carriers. [ORG American Airlines], a unit of

[ORG AMR Corp.], immediately matched the

move, spokesman [PER Tim Wagner] said.

[ORG United], a unit of [ORG UAL Corp.],

said the increase took effect [TIME Thursday]

and applies to most routes where it

competes against discount carriers, such as

[LOC Chicago] to [LOC Dallas] and [LOC

Denver] to [LOC San Francisco].

Page 32: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Types of NE

The set of types vary between different systems

Which classes are useful depend on application

32

Page 33: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Ambiguities33

Page 34: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Gazetteer

Useful: List of names,

e.g.

Gazetteer: list of

geographical names

But does not remove all

ambiguities

cf. example

34

Page 35: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Representation (IOB)35

Page 36: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Feature-based NER

Similar to tagging and chunking

You will need features from several layers

Features may include

Words, POS-tags, Chunk-tags, Graphical prop.

and more (See J&M, 3.ed)

36

Page 37: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Feature-based NER algorithms37

Greedy decoding

"Word-by word", decide for the first word, then for the second word, etc.

Can use various learners, e.g. Logistic regression (MaxEnt)

We can use our set-up for mandatory 2 with smaller adjustments

For shortcomings and better alternatives, c.f. J&M, 3. ed, ch.8:

Maximum Entropy Markov Models (MEMM)

Conditional random fields (Preferred approach until recently

Page 38: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Neural NER

The last years: neural architectures show the best results

J&M, 3. ed., ch. 17, sec. 17.1.3, not curriculum in IN4080

IN5550

38

Page 39: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Evaluation

Have we found the correct NERs?

Evaluate precision and recall as for chunking

For the correctly identified NERs, have we labelled them correctly?

39

Page 40: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Today

Sentence structure:

Constituents and phrases

Treebanks

Information extraction, IE

Chunking

Named entity recognition

Relation extraction, 5 different ways

Pipelines

40

Page 41: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Goal

Extract the relations that exist

between the (named) entities in the

text

A fixed set of relations (normally)

Determined by application:

Jeopardy

Preventing terrorist attacks

Detecting illness from medical record

41

• Born_in

• Date_of_birth

• Parent_of

• Author_of

• Winner_of

• Part_of

• Located_in

• Acquire

• Threaten

• Has_symptom

• Has_illness

Page 42: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Examples42

Page 43: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Methods for relation extraction43

1. Hand-written patterns

2. Machine Learning (Supervised classifiers)

3. Semi-supervised classifiers via bootstrapping

4. Semi-supervised classifiers via distant supervision

5. Unsupervised

Page 44: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

1. Hand-written patterns

Example: acquisitions

[ORG]…( buy(s)|

bought|

aquire(s|d) )…[ORG]

Hand-write patterns like this

Properties:

High precision

Will only cover a small set of

patterns

Low recall

Time consuming

(Also in NLTK, sec 7.6)

44

Page 45: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Example45

Page 46: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Methods for relation extraction46

1. Hand-written patterns

2. Machine Learning (Supervised classifiers)

3. Semi-supervised classifiers via bootstrapping

4. Semi-supervised classifiers via distant supervision

5. Unsupervised

Page 47: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

2. Supervised classifiers47

A corpus

A fixed set of entities and relations

The sentences in the corpus are hand-annotated:

Entities

Relations between them

Split the corpus into parts for training and testing

Train a classifier:

Choose learner: Naive Bayes, Logistic regression (Max Ent), SVM, …

Select features

Page 48: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

2. Supervised classifiers, contd.48

Training:

Use pairs of entities within the same sentence with no relation between them

as negative data

Classification

1. Find the NERs

2. For each pair of NERs determine whether there is a relation between them

3. If there is, label the relation

Page 49: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Examples of features49

American

Airlines, a unit

of AMR,

immediately

matched the

move,

spokesman Tim

Wagner said

Page 50: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Properties50

The bottleneck is the availability of training data

To hand label data is time consuming

Mostly applied to restricted domains

Does not generalize well to other domains

Page 51: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Methods for relation extraction51

1. Hand-written patterns

2. Machine Learning (Supervised classifiers)

3. Semi-supervised classifiers via bootstrapping

4. Semi-supervised classifiers via distant supervision

5. Unsupervised

Page 52: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

3. Semisupervised, bootstrapping

If we know a pattern for a relation, we can determine whether a pair stands in the relation

Conversely: If we know that a pair stands in a relationship, we can find patterns that describe the relation

52

Pairs:

IBM – AlchemyAPI

Google – YouTube

Facebook - WhatsApp

Patterns:

[ORG]…bought…[ORG]

Relation

ACQUIRE

Page 53: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Example53

(IBM, AlchemyAPI): ACQUIRE

Search for sentences containing IBM and AlchemyAPI

Results (Web-search, Google, btw. first 10 results):

IBM's Watson makes intelligent acquisition of Denver-based AlchemyAPI(Denver Post)

IBM is buying machine-learning systems maker AlchemyAPI Inc. to bolster its Watson technology as competition heats up in the data analytics and artificial intelligence fields. (Bloomberg)

IBM has acquired computing services provider AlchemyAPI to broaden its portfolio of Watson-branded cognitive computing services. (ComputerWorld)

Page 54: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Example contd.54

Extract patterns

IBM's Watson makes intelligent acquisition of Denver-based AlchemyAPI

(Denver Post)

IBM is buying machine-learning systems maker AlchemyAPI Inc. to bolster its

Watson technology as competition heats up in the data analytics and artificial

intelligence fields. (Bloomberg)

IBM has acquired computing services provider AlchemyAPI to broaden its

portfolio of Watson-branded cognitive computing services. (ComputerWorld)

Page 55: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Procedure

From the extracted sentences,

we extract patterns

Use these patterns to extract

more pairs of entities that stand

in these patterns

These pairs may again be used

for extracting more patterns,

etc.

…makes intelligent acquisition …

… is buying …

… has acquired …

55

Page 56: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Bootstrapping56

Page 57: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

A little more57

We could

either extract pattern templates and searching for these

or features for classification and build a classifier

If we use patterns we should generalize

makes intelligent acquisition (make(s)|made) JJ* acquisition

During the process we should evaluate before we extend:

Does the new pattern recognize other pairs we know stand in the relation?

Does the new pattern return pairs that are not in the relation? (Precision)

Page 58: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Methods for relation extraction58

1. Hand-written patterns

2. Machine Learning (Supervised classifiers)

3. Semi-supervised classifiers via bootstrapping

4. Semi-supervised classifiers via distant supervision

5. Unsupervised

Page 59: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

4. Distant supervision for RE

Combine:

A large external knowledge base, e.g. Wikipedia, Word-net

Large amounts of unlabeled text

Extract tuples that stand in known relation from knowledge base:

Many tuples

Follow the bootstrapping technique on the text

59

Page 60: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

4. Distant supervision for RE

Properties:

Large data sets allow for

fine-grained features

combinations of features

Evaluation

Requirement

Large knowledge-base

60

Page 61: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Methods for relation extraction61

1. Hand-written patterns

2. Machine Learning (Supervised classifiers)

3. Semi-supervised classifiers via bootstrapping

4. Semi-supervised classifiers via distant supervision

5. Unsupervised

Page 62: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

5. Unsupervised relation extraction

Open IE

Example:

1. Tag and chunk

2. Find all word sequences

satisfying certain syntactic constraints,

in particular containing a verb

These are taken to be the relations

3. For each such, find the immediate non-vacuous NP to the left and to the right

4. Assign a confidence score

United has a hub in Chicago, which is the headquarters of United Continental Holdings.

r1: <United, has a hub in, Chicago>

r2: <Chicago, is the headquarters of, United Continental Holdings>

62

Page 63: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Evaluating relation extraction

Supervised methods can be

evaluated on each of the

examples in a test set.

For the semi-supervised

method:

we don’t have a test set.

we can evaluate the precision of

the returned examples manually

Beware the difference between

Determine for a sentence

whether an entity pair in the sen-

tence is in a particular relation

Recall and precision

Determine from a text:

We may use several occurrences

of the pair in the text to draw a

conclusion

Precision

63

We skip the confidence scoring

Page 64: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

More fine grained IE

Tokenization+tagging

Identifying the "actors"

Chunking

Named-entity recognition

Co-reference resolution

Relation detection

Event detection

Co-reference resolution of events

Temporal extraction

Template filling

64

So far Possible refinements

Page 65: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Today

Sentence structure:

Constituents and phrases

Treebanks

Information extraction, IE

Chunking

Named entity recognition

Relation extraction, 5 different ways

Pipelines

65

Page 66: IN4080 Natural Language Processing · Can use various learners, e.g. Logistic regression (MaxEnt) ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Some example systems66

Stanford core nlp: http://corenlp.run/

SpaCy (Python): https://spacy.io/docs/api/

OpenNLP (Java): https://opennlp.apache.org/docs/

GATE (Java): https://gate.ac.uk/

UDPipe: http://ufal.mff.cuni.cz/udpipe

Online demo: http://lindat.mff.cuni.cz/services/udpipe/