nlp data structures

46
NLP Data Structures PU Tools, Ressourcen, Infrastruktur Nils Reiter, [email protected] January 7, 2021 (Winter term 2020/21) Reiter NLP Data Structures 2021-01-07 1 / 24

Upload: others

Post on 02-Dec-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

NLP Data StructuresPU Tools, Ressourcen, Infrastruktur

Nils Reiter,[email protected]

January 7, 2021(Winter term 2020/21)

Reiter NLP Data Structures 2021-01-07 1 / 24

Recap

Section 1

Recap

Reiter NLP Data Structures 2021-01-07 2 / 24

Recap

Exercise 7

https://github.com/idh-cologne-tools-ressourcen-infra/exercise-07

Reiter NLP Data Structures 2021-01-07 3 / 24

NLP Data Structures

Section 2

NLP Data Structures

Reiter NLP Data Structures 2021-01-07 4 / 24

NLP Data Structures

Data Structures▶ Storing information digitally▶ Atomic values: int, String, boolean, …▶ More complex data: Combinations of multiple atomic values

Example (Address)street Stringhouse intzip intcity String

Example (Music)

artist Stringtitle Stringrelease int

tracks⟨title String

length timerating int

, …⟩

rating intcompilation boolean

Reiter NLP Data Structures 2021-01-07 5 / 24

NLP Data Structures

Data Structures▶ Storing information digitally▶ Atomic values: int, String, boolean, …▶ More complex data: Combinations of multiple atomic values

Example (Address)street Stringhouse intzip intcity String

Example (Music)

artist Stringtitle Stringrelease int

tracks⟨title String

length timerating int

, …⟩

rating intcompilation boolean

Reiter NLP Data Structures 2021-01-07 5 / 24

NLP Data Structures

Data Structures▶ Storing information digitally▶ Atomic values: int, String, boolean, …▶ More complex data: Combinations of multiple atomic values

Example (Address)street Stringhouse intzip intcity String

Example (Music)

artist Stringtitle Stringrelease int

tracks⟨title String

length timerating int

, …⟩

rating intcompilation boolean

Reiter NLP Data Structures 2021-01-07 5 / 24

NLP Data Structures

Data StructuresCollections

▶ Storing individual objects is often clear▶ Collections of values of the same type

▶ Impact on programming experience▶ Impact on performance

▶ Commonly used collection data structures (e.g., in interviews)▶ Arrays, stacks, queues, linked lists, trees, graphs, …▶ Sometimes hidden and/or merged by programming libraries

Reiter NLP Data Structures 2021-01-07 6 / 24

NLP Data Structures

Data StructuresCollections

▶ Storing individual objects is often clear▶ Collections of values of the same type

▶ Impact on programming experience▶ Impact on performance

▶ Commonly used collection data structures (e.g., in interviews)▶ Arrays, stacks, queues, linked lists, trees, graphs, …▶ Sometimes hidden and/or merged by programming libraries

Reiter NLP Data Structures 2021-01-07 6 / 24

NLP Data Structures

NLP Data

▶ Text and annotations▶ Annotations: Meta data associated with specific locations in the text

Storing Text▶ Texts can be stored as strings▶ Beware: Unicode▶ Limitations on length

Reiter NLP Data Structures 2021-01-07 7 / 24

NLP Data Structures

NLP Tasks

▶ Tokenization, sentence splitting▶ Morphological analysis▶ Part of speech-tagging, lemmatisation▶ Named entity recognition▶ Phrase structure and dependency syntax▶ Coreference chains

Reiter NLP Data Structures 2021-01-07 8 / 24

NLP Data Structures

NLP TasksTokenization

ExampleSentence: The dog is fed by Mr. Parker, who wears a yellow hat.Tokens: [The] [dog] [is] [fed] [by] [Mr.] [Parker] [,] [who] [wears] [a] [yellow] [hat] [.]

▶ Split the text into ›meaningful‹ units (e.g., words)▶ Two storage approaches

A list of strings / token objects⟨[surface The

],[surface dog

], …

⟩A list of character positions(preferred)text »The dog is fed …«

toks⟨[

beg 0end 3

],[beg 4end 7

], …

Reiter NLP Data Structures 2021-01-07 9 / 24

NLP Data Structures

NLP TasksTokenization

ExampleSentence: The dog is fed by Mr. Parker, who wears a yellow hat.Tokens: [The] [dog] [is] [fed] [by] [Mr.] [Parker] [,] [who] [wears] [a] [yellow] [hat] [.]

▶ Split the text into ›meaningful‹ units (e.g., words)▶ Two storage approaches

A list of strings / token objects⟨[surface The

],[surface dog

], …

A list of character positions(preferred)text »The dog is fed …«

toks⟨[

beg 0end 3

],[beg 4end 7

], …

Reiter NLP Data Structures 2021-01-07 9 / 24

NLP Data Structures

NLP TasksTokenization

ExampleSentence: The dog is fed by Mr. Parker, who wears a yellow hat.Tokens: [The] [dog] [is] [fed] [by] [Mr.] [Parker] [,] [who] [wears] [a] [yellow] [hat] [.]

▶ Split the text into ›meaningful‹ units (e.g., words)▶ Two storage approaches

A list of strings / token objects⟨[surface The

],[surface dog

], …

⟩A list of character positions(preferred)text »The dog is fed …«

toks⟨[

beg 0end 3

],[beg 4end 7

], …

⟩Reiter NLP Data Structures 2021-01-07 9 / 24

NLP Data Structures

NLP TasksSentence splitting

▶ Split a text into sentences▶ Storing sentences as strings

▶ Easy to pass into the next component▶ Difficult to access token-based information

▶ Storing sentences as character positions▶ Easy to combine with token-based information▶ Not complicated to extract the sentence string

▶ text.substring(begin, end)

text »The dog is fed …«

toks⟨[

beg 0end 3

],[beg 4end 7

], …

sent⟨[

beg 0end 53

], …

Reiter NLP Data Structures 2021-01-07 10 / 24

NLP Data Structures

NLP TasksSentence splitting

▶ Split a text into sentences▶ Storing sentences as strings

▶ Easy to pass into the next component▶ Difficult to access token-based information

▶ Storing sentences as character positions▶ Easy to combine with token-based information▶ Not complicated to extract the sentence string

▶ text.substring(begin, end)

text »The dog is fed …«

toks⟨[

beg 0end 3

],[beg 4end 7

], …

sent⟨[

beg 0end 53

], …

Reiter NLP Data Structures 2021-01-07 10 / 24

NLP Data Structures

NLP TasksSentence splitting

▶ Split a text into sentences▶ Storing sentences as strings

▶ Easy to pass into the next component▶ Difficult to access token-based information

▶ Storing sentences as character positions▶ Easy to combine with token-based information▶ Not complicated to extract the sentence string

▶ text.substring(begin, end)

text »The dog is fed …«

toks⟨[

beg 0end 3

],[beg 4end 7

], …

sent⟨[

beg 0end 53

], …

Reiter NLP Data Structures 2021-01-07 10 / 24

NLP Data Structures

NLP TasksPart of speech-tagging and lemmatization

▶ PoS-tagging: Assign word categories (noun, verb, …) to each token▶ Lemmatization: Determine the base form of each word

⟨surf Thepos detlem the

,

surf dogpos NNlem dog

, …⟩

txt »The dog is fed …«

tok.⟨

beg 0end 3pos detlem the

,

beg 4end 7pos NNlem dog

, …⟩

▶ Straightforward in any case

Reiter NLP Data Structures 2021-01-07 11 / 24

NLP Data Structures

NLP TasksPart of speech-tagging and lemmatization

▶ PoS-tagging: Assign word categories (noun, verb, …) to each token▶ Lemmatization: Determine the base form of each word

⟨surf Thepos detlem the

,

surf dogpos NNlem dog

, …⟩

txt »The dog is fed …«

tok.⟨

beg 0end 3pos detlem the

,

beg 4end 7pos NNlem dog

, …⟩

▶ Straightforward in any case

Reiter NLP Data Structures 2021-01-07 11 / 24

NLP Data Structures

NLP TasksPart of speech-tagging and lemmatization

▶ PoS-tagging: Assign word categories (noun, verb, …) to each token▶ Lemmatization: Determine the base form of each word

⟨surf Thepos detlem the

,

surf dogpos NNlem dog

, …⟩

txt »The dog is fed …«

tok.⟨

beg 0end 3pos detlem the

,

beg 4end 7pos NNlem dog

, …⟩

▶ Straightforward in any case

Reiter NLP Data Structures 2021-01-07 11 / 24

NLP Data Structures

Accessing Information

▶ Extract all nouns: Iterate over each token, collect nouns

▶ Alternative: PoS-Tagger produces a separate list▶

[nouns

⟨[surf dog

],[surf hat

]⟩]▶

nouns⟨[

beg 4end 7

],[beg 49end 52

]⟩▶ Often: Trade-off between storage and time

▶ I.e., storing more information makes access faster, but consumes more space (even if youonly store pointers)

Reiter NLP Data Structures 2021-01-07 12 / 24

NLP Data Structures

Accessing Information

▶ Extract all nouns: Iterate over each token, collect nouns▶ Alternative: PoS-Tagger produces a separate list

▶[nouns

⟨[surf dog

],[surf hat

]⟩]▶

nouns⟨[

beg 4end 7

],[beg 49end 52

]⟩

▶ Often: Trade-off between storage and time▶ I.e., storing more information makes access faster, but consumes more space (even if you

only store pointers)

Reiter NLP Data Structures 2021-01-07 12 / 24

NLP Data Structures

Accessing Information

▶ Extract all nouns: Iterate over each token, collect nouns▶ Alternative: PoS-Tagger produces a separate list

▶[nouns

⟨[surf dog

],[surf hat

]⟩]▶

nouns⟨[

beg 4end 7

],[beg 49end 52

]⟩▶ Often: Trade-off between storage and time

▶ I.e., storing more information makes access faster, but consumes more space (even if youonly store pointers)

Reiter NLP Data Structures 2021-01-07 12 / 24

NLP Data Structures

NLP TasksNamed Entity Recognition

▶ Detection of proper names in texts▶ Classification according to entity type

▶ Person, Organisation, Location, …

▶ Multi-word expressions: »Donald Trump«, »the European Union«, …▶ Overlapping: »Die [[Oswald von [Wolkenstein]LOC]PER-Gesellschaft]ORG ist …«

Data Structures▶ Lists of token sequences:ne

⟨named-entitytype ORGtokens

⟨0 , 1 , 2 , 3

⟩,

named-entitytype PERtokens

⟨0 , 1 , 2

⟩,

named-entitytype LOCtokens

⟨2⟩⟩

Reiter NLP Data Structures 2021-01-07 13 / 24

NLP Data Structures

NLP TasksNamed Entity Recognition

▶ Detection of proper names in texts▶ Classification according to entity type

▶ Person, Organisation, Location, …▶ Multi-word expressions: »Donald Trump«, »the European Union«, …

▶ Overlapping: »Die [[Oswald von [Wolkenstein]LOC]PER-Gesellschaft]ORG ist …«Data Structures▶ Lists of token sequences:ne

⟨named-entitytype ORGtokens

⟨0 , 1 , 2 , 3

⟩,

named-entitytype PERtokens

⟨0 , 1 , 2

⟩,

named-entitytype LOCtokens

⟨2⟩⟩

Reiter NLP Data Structures 2021-01-07 13 / 24

NLP Data Structures

NLP TasksNamed Entity Recognition

▶ Detection of proper names in texts▶ Classification according to entity type

▶ Person, Organisation, Location, …▶ Multi-word expressions: »Donald Trump«, »the European Union«, …▶ Overlapping: »Die [[Oswald von [Wolkenstein]LOC]PER-Gesellschaft]ORG ist …«

Data Structures▶ Lists of token sequences:ne

⟨named-entitytype ORGtokens

⟨0 , 1 , 2 , 3

⟩,

named-entitytype PERtokens

⟨0 , 1 , 2

⟩,

named-entitytype LOCtokens

⟨2⟩⟩

Reiter NLP Data Structures 2021-01-07 13 / 24

NLP Data Structures

NLP TasksNamed Entity Recognition

▶ Detection of proper names in texts▶ Classification according to entity type

▶ Person, Organisation, Location, …▶ Multi-word expressions: »Donald Trump«, »the European Union«, …▶ Overlapping: »Die [[Oswald von [Wolkenstein]LOC]PER-Gesellschaft]ORG ist …«

Data Structures▶ Lists of token sequences:ne

⟨named-entitytype ORGtokens

⟨0 , 1 , 2 , 3

⟩,

named-entitytype PERtokens

⟨0 , 1 , 2

⟩,

named-entitytype LOCtokens

⟨2⟩⟩

Reiter NLP Data Structures 2021-01-07 13 / 24

NLP Data Structures

NLP TasksSyntax

Phrase Structure Syntax (HPSG, LFG, Montague)▶ Parse tree consists of phrases▶ A phrase covers multiple words or other phrases▶ Context-free grammar

Dependency Syntax▶ Parse tree consists of words▶ Edges in the tree represent dependency relation (instead of constituency)Niu and Osborne (2019)

Reiter NLP Data Structures 2021-01-07 14 / 24

NLP Data Structures

NLP TasksSyntax

Phrase Structure Syntax (HPSG, LFG, Montague)▶ Parse tree consists of phrases▶ A phrase covers multiple words or other phrases▶ Context-free grammar

Dependency Syntax▶ Parse tree consists of words▶ Edges in the tree represent dependency relation (instead of constituency)

Niu and Osborne (2019)

Reiter NLP Data Structures 2021-01-07 14 / 24

NLP Data Structures

NLP TasksSyntax

Phrase Structure Syntax (HPSG, LFG, Montague)▶ Parse tree consists of phrases▶ A phrase covers multiple words or other phrases▶ Context-free grammar

Dependency Syntax▶ Parse tree consists of words▶ Edges in the tree represent dependency relation (instead of constituency)Niu and Osborne (2019)

Reiter NLP Data Structures 2021-01-07 14 / 24

NLP Data Structures

TreesFormal Definition

Definition1. A rooted tree is a connected, acyclic, undirected graph, with one designated node as root.

2. A tree is some payload data with a (possibly empty) list of children, which are also trees.

1

2 3

4 5 6 7 8

root node (no parent)

leaf node (no children)

Figure: A tree

Reiter NLP Data Structures 2021-01-07 15 / 24

NLP Data Structures

TreesFormal Definition

Definition1. A rooted tree is a connected, acyclic, undirected graph, with one designated node as root.

2. A tree is some payload data with a (possibly empty) list of children, which are also trees.

1

2 3

4 5 6 7 8

root node (no parent)

leaf node (no children)

Figure: A tree

Reiter NLP Data Structures 2021-01-07 15 / 24

NLP Data Structures

TreesFormal Definition

Definition1. A rooted tree is a connected, acyclic, undirected graph, with one designated node as root.

2. A tree is some payload data with a (possibly empty) list of children, which are also trees.

1

2 3

4 5 6 7 8

root node (no parent)

leaf node (no children)

Figure: A tree

Reiter NLP Data Structures 2021-01-07 15 / 24

NLP Data Structures

TreesFormal Definition

Definition1. A rooted tree is a connected, acyclic, undirected graph, with one designated node as root.

2. A tree is some payload data with a (possibly empty) list of children, which are also trees.

1

2 3

4 5 6 7 8

root node (no parent)

leaf node (no children)

Figure: A tree

Reiter NLP Data Structures 2021-01-07 15 / 24

NLP Data Structures

Trees

Listing: Top-down tree1 public class Tree {2 Object payload;3 Tree[] children;4 }

p 1

c⟨[

p 2c

⟨[p 4

], …

⟩],[p 3c

⟨…⟩]⟩

Listing: Bottom-up tree1 public class Tree {2 Object payload;3 Tree parent;4 }

p 6

par

p 3par

[p 1par null

]

Reiter NLP Data Structures 2021-01-07 16 / 24

NLP Data Structures

Phrase Structure Syntax

▶ Phrases: Separate entities in tree

▶ Phrases have properties (e.g., agreement in number)

S

NP

D N

VP

V PP

P N

This tree is for illustration

A phrase:cat Sagr

[num sg

]chi

⟨…⟩

Reiter NLP Data Structures 2021-01-07 17 / 24

NLP Data Structures

Phrase Structure Syntax

▶ Phrases: Separate entities in tree▶ Phrases have properties (e.g., agreement in number)

S[num sg

]NP

[num sg

]D N

VP[num sg

]V

[num sg

]PP

P N

This tree is for illustration

A phrase:cat Sagr

[num sg

]chi

⟨…⟩

Reiter NLP Data Structures 2021-01-07 17 / 24

NLP Data Structures

Dependency Syntax

▶ Links between tokens (sometimes typed)▶ Each token is thus linked to its governing token▶ »This tree is for illustration.«

▶ ›This‹ det−−→ ›tree‹, ›tree‹ nsubj−−−→ ›is‹, ›for‹ prep−−→ ›is‹, ›illustration‹ pobj−−→ ›for‹, ›.‹ pobj−−→ ›is‹.

▶ Storing is simple: Each token stores information for the governor and the relation type▶ conceptually, a bottom-up tree

txt »This tree is for illustration.«

tok.⟨

0

beg 0end 4gov 1rel det

, 1

beg 5end 9gov 3rel nsubj

, 2

beg 10end 12gov nullrel null

, …⟩

Reiter NLP Data Structures 2021-01-07 18 / 24

NLP Data Structures

Dependency Syntax

▶ Links between tokens (sometimes typed)▶ Each token is thus linked to its governing token▶ »This tree is for illustration.«

▶ ›This‹ det−−→ ›tree‹, ›tree‹ nsubj−−−→ ›is‹, ›for‹ prep−−→ ›is‹, ›illustration‹ pobj−−→ ›for‹, ›.‹ pobj−−→ ›is‹.▶ Storing is simple: Each token stores information for the governor and the relation type

▶ conceptually, a bottom-up tree

txt »This tree is for illustration.«

tok.⟨

0

beg 0end 4gov 1rel det

, 1

beg 5end 9gov 3rel nsubj

, 2

beg 10end 12gov nullrel null

, …⟩

Reiter NLP Data Structures 2021-01-07 18 / 24

NLP Data Structures

Dependency Syntax

▶ Links between tokens (sometimes typed)▶ Each token is thus linked to its governing token▶ »This tree is for illustration.«

▶ ›This‹ det−−→ ›tree‹, ›tree‹ nsubj−−−→ ›is‹, ›for‹ prep−−→ ›is‹, ›illustration‹ pobj−−→ ›for‹, ›.‹ pobj−−→ ›is‹.▶ Storing is simple: Each token stores information for the governor and the relation type

▶ conceptually, a bottom-up treetxt »This tree is for illustration.«

tok.⟨

0

beg 0end 4gov 1rel det

, 1

beg 5end 9gov 3rel nsubj

, 2

beg 10end 12gov nullrel null

, …⟩

Reiter NLP Data Structures 2021-01-07 18 / 24

NLP Data Structures

Full Example

txt »The dog is fed by Mr. Parker, who wears a yellow hat.«

tok⟨

0

beg 0end 3pos DTlem thegov 1rel det

, 1

beg 4end 7pos NNlem doggov 2rel nsubj

, …⟩

nns⟨

1 , 12⟩

nes⟨[

type PERtokens

⟨5 , 6

⟩]⟩

phr

cat S

chi⟨[

cat NPchi

⟨0 , 1

⟩],[

cat VPchi

⟨…⟩]⟩

Reiter NLP Data Structures 2021-01-07 19 / 24

NLP Data Structures

Summary

▶ NLP data structures are complex, because language is▶ Annotations can be rooted in tokens or character positions▶ Many linguistic structures can be represented as trees

▶ Trees are a special kind of graph

Reiter NLP Data Structures 2021-01-07 20 / 24

Next Exercise

Section 3

Next Exercise

Reiter NLP Data Structures 2021-01-07 21 / 24

Next Exercise

Exercise 8

Reiter NLP Data Structures 2021-01-07 22 / 24

Appendix

Section 4

Appendix

Reiter NLP Data Structures 2021-01-07 23 / 24

Appendix

References I

Niu, Ruochen and Timothy Osborne (2019). »Chunks are components: A dependencygrammar approach to the syntactic structure of Mandarin«. In: Lingua 224, pp. 60 –83.issn: 0024-3841. doi: https://doi.org/10.1016/j.lingua.2019.03.003. url:http://www.sciencedirect.com/science/article/pii/S0024384118308350.

Reiter NLP Data Structures 2021-01-07 24 / 24