nlp data structures
TRANSCRIPT
NLP Data StructuresPU Tools, Ressourcen, Infrastruktur
Nils Reiter,[email protected]
January 7, 2021(Winter term 2020/21)
Reiter NLP Data Structures 2021-01-07 1 / 24
Recap
Exercise 7
https://github.com/idh-cologne-tools-ressourcen-infra/exercise-07
Reiter NLP Data Structures 2021-01-07 3 / 24
NLP Data Structures
Data Structures▶ Storing information digitally▶ Atomic values: int, String, boolean, …▶ More complex data: Combinations of multiple atomic values
Example (Address)street Stringhouse intzip intcity String
Example (Music)
artist Stringtitle Stringrelease int
tracks⟨title String
length timerating int
, …⟩
rating intcompilation boolean
Reiter NLP Data Structures 2021-01-07 5 / 24
NLP Data Structures
Data Structures▶ Storing information digitally▶ Atomic values: int, String, boolean, …▶ More complex data: Combinations of multiple atomic values
Example (Address)street Stringhouse intzip intcity String
Example (Music)
artist Stringtitle Stringrelease int
tracks⟨title String
length timerating int
, …⟩
rating intcompilation boolean
Reiter NLP Data Structures 2021-01-07 5 / 24
NLP Data Structures
Data Structures▶ Storing information digitally▶ Atomic values: int, String, boolean, …▶ More complex data: Combinations of multiple atomic values
Example (Address)street Stringhouse intzip intcity String
Example (Music)
artist Stringtitle Stringrelease int
tracks⟨title String
length timerating int
, …⟩
rating intcompilation boolean
Reiter NLP Data Structures 2021-01-07 5 / 24
NLP Data Structures
Data StructuresCollections
▶ Storing individual objects is often clear▶ Collections of values of the same type
▶ Impact on programming experience▶ Impact on performance
▶ Commonly used collection data structures (e.g., in interviews)▶ Arrays, stacks, queues, linked lists, trees, graphs, …▶ Sometimes hidden and/or merged by programming libraries
Reiter NLP Data Structures 2021-01-07 6 / 24
NLP Data Structures
Data StructuresCollections
▶ Storing individual objects is often clear▶ Collections of values of the same type
▶ Impact on programming experience▶ Impact on performance
▶ Commonly used collection data structures (e.g., in interviews)▶ Arrays, stacks, queues, linked lists, trees, graphs, …▶ Sometimes hidden and/or merged by programming libraries
Reiter NLP Data Structures 2021-01-07 6 / 24
NLP Data Structures
NLP Data
▶ Text and annotations▶ Annotations: Meta data associated with specific locations in the text
Storing Text▶ Texts can be stored as strings▶ Beware: Unicode▶ Limitations on length
Reiter NLP Data Structures 2021-01-07 7 / 24
NLP Data Structures
NLP Tasks
▶ Tokenization, sentence splitting▶ Morphological analysis▶ Part of speech-tagging, lemmatisation▶ Named entity recognition▶ Phrase structure and dependency syntax▶ Coreference chains
Reiter NLP Data Structures 2021-01-07 8 / 24
NLP Data Structures
NLP TasksTokenization
ExampleSentence: The dog is fed by Mr. Parker, who wears a yellow hat.Tokens: [The] [dog] [is] [fed] [by] [Mr.] [Parker] [,] [who] [wears] [a] [yellow] [hat] [.]
▶ Split the text into ›meaningful‹ units (e.g., words)▶ Two storage approaches
A list of strings / token objects⟨[surface The
],[surface dog
], …
⟩A list of character positions(preferred)text »The dog is fed …«
toks⟨[
beg 0end 3
],[beg 4end 7
], …
⟩
Reiter NLP Data Structures 2021-01-07 9 / 24
NLP Data Structures
NLP TasksTokenization
ExampleSentence: The dog is fed by Mr. Parker, who wears a yellow hat.Tokens: [The] [dog] [is] [fed] [by] [Mr.] [Parker] [,] [who] [wears] [a] [yellow] [hat] [.]
▶ Split the text into ›meaningful‹ units (e.g., words)▶ Two storage approaches
A list of strings / token objects⟨[surface The
],[surface dog
], …
⟩
A list of character positions(preferred)text »The dog is fed …«
toks⟨[
beg 0end 3
],[beg 4end 7
], …
⟩
Reiter NLP Data Structures 2021-01-07 9 / 24
NLP Data Structures
NLP TasksTokenization
ExampleSentence: The dog is fed by Mr. Parker, who wears a yellow hat.Tokens: [The] [dog] [is] [fed] [by] [Mr.] [Parker] [,] [who] [wears] [a] [yellow] [hat] [.]
▶ Split the text into ›meaningful‹ units (e.g., words)▶ Two storage approaches
A list of strings / token objects⟨[surface The
],[surface dog
], …
⟩A list of character positions(preferred)text »The dog is fed …«
toks⟨[
beg 0end 3
],[beg 4end 7
], …
⟩Reiter NLP Data Structures 2021-01-07 9 / 24
NLP Data Structures
NLP TasksSentence splitting
▶ Split a text into sentences▶ Storing sentences as strings
▶ Easy to pass into the next component▶ Difficult to access token-based information
▶ Storing sentences as character positions▶ Easy to combine with token-based information▶ Not complicated to extract the sentence string
▶ text.substring(begin, end)
text »The dog is fed …«
toks⟨[
beg 0end 3
],[beg 4end 7
], …
⟩
sent⟨[
beg 0end 53
], …
⟩
Reiter NLP Data Structures 2021-01-07 10 / 24
NLP Data Structures
NLP TasksSentence splitting
▶ Split a text into sentences▶ Storing sentences as strings
▶ Easy to pass into the next component▶ Difficult to access token-based information
▶ Storing sentences as character positions▶ Easy to combine with token-based information▶ Not complicated to extract the sentence string
▶ text.substring(begin, end)
text »The dog is fed …«
toks⟨[
beg 0end 3
],[beg 4end 7
], …
⟩
sent⟨[
beg 0end 53
], …
⟩
Reiter NLP Data Structures 2021-01-07 10 / 24
NLP Data Structures
NLP TasksSentence splitting
▶ Split a text into sentences▶ Storing sentences as strings
▶ Easy to pass into the next component▶ Difficult to access token-based information
▶ Storing sentences as character positions▶ Easy to combine with token-based information▶ Not complicated to extract the sentence string
▶ text.substring(begin, end)
text »The dog is fed …«
toks⟨[
beg 0end 3
],[beg 4end 7
], …
⟩
sent⟨[
beg 0end 53
], …
⟩
Reiter NLP Data Structures 2021-01-07 10 / 24
NLP Data Structures
NLP TasksPart of speech-tagging and lemmatization
▶ PoS-tagging: Assign word categories (noun, verb, …) to each token▶ Lemmatization: Determine the base form of each word
⟨surf Thepos detlem the
,
surf dogpos NNlem dog
, …⟩
txt »The dog is fed …«
tok.⟨
beg 0end 3pos detlem the
,
beg 4end 7pos NNlem dog
, …⟩
▶ Straightforward in any case
Reiter NLP Data Structures 2021-01-07 11 / 24
NLP Data Structures
NLP TasksPart of speech-tagging and lemmatization
▶ PoS-tagging: Assign word categories (noun, verb, …) to each token▶ Lemmatization: Determine the base form of each word
⟨surf Thepos detlem the
,
surf dogpos NNlem dog
, …⟩
txt »The dog is fed …«
tok.⟨
beg 0end 3pos detlem the
,
beg 4end 7pos NNlem dog
, …⟩
▶ Straightforward in any case
Reiter NLP Data Structures 2021-01-07 11 / 24
NLP Data Structures
NLP TasksPart of speech-tagging and lemmatization
▶ PoS-tagging: Assign word categories (noun, verb, …) to each token▶ Lemmatization: Determine the base form of each word
⟨surf Thepos detlem the
,
surf dogpos NNlem dog
, …⟩
txt »The dog is fed …«
tok.⟨
beg 0end 3pos detlem the
,
beg 4end 7pos NNlem dog
, …⟩
▶ Straightforward in any case
Reiter NLP Data Structures 2021-01-07 11 / 24
NLP Data Structures
Accessing Information
▶ Extract all nouns: Iterate over each token, collect nouns
▶ Alternative: PoS-Tagger produces a separate list▶
[nouns
⟨[surf dog
],[surf hat
]⟩]▶
nouns⟨[
beg 4end 7
],[beg 49end 52
]⟩▶ Often: Trade-off between storage and time
▶ I.e., storing more information makes access faster, but consumes more space (even if youonly store pointers)
Reiter NLP Data Structures 2021-01-07 12 / 24
NLP Data Structures
Accessing Information
▶ Extract all nouns: Iterate over each token, collect nouns▶ Alternative: PoS-Tagger produces a separate list
▶[nouns
⟨[surf dog
],[surf hat
]⟩]▶
nouns⟨[
beg 4end 7
],[beg 49end 52
]⟩
▶ Often: Trade-off between storage and time▶ I.e., storing more information makes access faster, but consumes more space (even if you
only store pointers)
Reiter NLP Data Structures 2021-01-07 12 / 24
NLP Data Structures
Accessing Information
▶ Extract all nouns: Iterate over each token, collect nouns▶ Alternative: PoS-Tagger produces a separate list
▶[nouns
⟨[surf dog
],[surf hat
]⟩]▶
nouns⟨[
beg 4end 7
],[beg 49end 52
]⟩▶ Often: Trade-off between storage and time
▶ I.e., storing more information makes access faster, but consumes more space (even if youonly store pointers)
Reiter NLP Data Structures 2021-01-07 12 / 24
NLP Data Structures
NLP TasksNamed Entity Recognition
▶ Detection of proper names in texts▶ Classification according to entity type
▶ Person, Organisation, Location, …
▶ Multi-word expressions: »Donald Trump«, »the European Union«, …▶ Overlapping: »Die [[Oswald von [Wolkenstein]LOC]PER-Gesellschaft]ORG ist …«
Data Structures▶ Lists of token sequences:ne
⟨named-entitytype ORGtokens
⟨0 , 1 , 2 , 3
⟩,
named-entitytype PERtokens
⟨0 , 1 , 2
⟩,
named-entitytype LOCtokens
⟨2⟩⟩
Reiter NLP Data Structures 2021-01-07 13 / 24
NLP Data Structures
NLP TasksNamed Entity Recognition
▶ Detection of proper names in texts▶ Classification according to entity type
▶ Person, Organisation, Location, …▶ Multi-word expressions: »Donald Trump«, »the European Union«, …
▶ Overlapping: »Die [[Oswald von [Wolkenstein]LOC]PER-Gesellschaft]ORG ist …«Data Structures▶ Lists of token sequences:ne
⟨named-entitytype ORGtokens
⟨0 , 1 , 2 , 3
⟩,
named-entitytype PERtokens
⟨0 , 1 , 2
⟩,
named-entitytype LOCtokens
⟨2⟩⟩
Reiter NLP Data Structures 2021-01-07 13 / 24
NLP Data Structures
NLP TasksNamed Entity Recognition
▶ Detection of proper names in texts▶ Classification according to entity type
▶ Person, Organisation, Location, …▶ Multi-word expressions: »Donald Trump«, »the European Union«, …▶ Overlapping: »Die [[Oswald von [Wolkenstein]LOC]PER-Gesellschaft]ORG ist …«
Data Structures▶ Lists of token sequences:ne
⟨named-entitytype ORGtokens
⟨0 , 1 , 2 , 3
⟩,
named-entitytype PERtokens
⟨0 , 1 , 2
⟩,
named-entitytype LOCtokens
⟨2⟩⟩
Reiter NLP Data Structures 2021-01-07 13 / 24
NLP Data Structures
NLP TasksNamed Entity Recognition
▶ Detection of proper names in texts▶ Classification according to entity type
▶ Person, Organisation, Location, …▶ Multi-word expressions: »Donald Trump«, »the European Union«, …▶ Overlapping: »Die [[Oswald von [Wolkenstein]LOC]PER-Gesellschaft]ORG ist …«
Data Structures▶ Lists of token sequences:ne
⟨named-entitytype ORGtokens
⟨0 , 1 , 2 , 3
⟩,
named-entitytype PERtokens
⟨0 , 1 , 2
⟩,
named-entitytype LOCtokens
⟨2⟩⟩
Reiter NLP Data Structures 2021-01-07 13 / 24
NLP Data Structures
NLP TasksSyntax
Phrase Structure Syntax (HPSG, LFG, Montague)▶ Parse tree consists of phrases▶ A phrase covers multiple words or other phrases▶ Context-free grammar
Dependency Syntax▶ Parse tree consists of words▶ Edges in the tree represent dependency relation (instead of constituency)Niu and Osborne (2019)
Reiter NLP Data Structures 2021-01-07 14 / 24
NLP Data Structures
NLP TasksSyntax
Phrase Structure Syntax (HPSG, LFG, Montague)▶ Parse tree consists of phrases▶ A phrase covers multiple words or other phrases▶ Context-free grammar
Dependency Syntax▶ Parse tree consists of words▶ Edges in the tree represent dependency relation (instead of constituency)
Niu and Osborne (2019)
Reiter NLP Data Structures 2021-01-07 14 / 24
NLP Data Structures
NLP TasksSyntax
Phrase Structure Syntax (HPSG, LFG, Montague)▶ Parse tree consists of phrases▶ A phrase covers multiple words or other phrases▶ Context-free grammar
Dependency Syntax▶ Parse tree consists of words▶ Edges in the tree represent dependency relation (instead of constituency)Niu and Osborne (2019)
Reiter NLP Data Structures 2021-01-07 14 / 24
NLP Data Structures
TreesFormal Definition
Definition1. A rooted tree is a connected, acyclic, undirected graph, with one designated node as root.
2. A tree is some payload data with a (possibly empty) list of children, which are also trees.
1
2 3
4 5 6 7 8
root node (no parent)
leaf node (no children)
Figure: A tree
Reiter NLP Data Structures 2021-01-07 15 / 24
NLP Data Structures
TreesFormal Definition
Definition1. A rooted tree is a connected, acyclic, undirected graph, with one designated node as root.
2. A tree is some payload data with a (possibly empty) list of children, which are also trees.
1
2 3
4 5 6 7 8
root node (no parent)
leaf node (no children)
Figure: A tree
Reiter NLP Data Structures 2021-01-07 15 / 24
NLP Data Structures
TreesFormal Definition
Definition1. A rooted tree is a connected, acyclic, undirected graph, with one designated node as root.
2. A tree is some payload data with a (possibly empty) list of children, which are also trees.
1
2 3
4 5 6 7 8
root node (no parent)
leaf node (no children)
Figure: A tree
Reiter NLP Data Structures 2021-01-07 15 / 24
NLP Data Structures
TreesFormal Definition
Definition1. A rooted tree is a connected, acyclic, undirected graph, with one designated node as root.
2. A tree is some payload data with a (possibly empty) list of children, which are also trees.
1
2 3
4 5 6 7 8
root node (no parent)
leaf node (no children)
Figure: A tree
Reiter NLP Data Structures 2021-01-07 15 / 24
NLP Data Structures
Trees
Listing: Top-down tree1 public class Tree {2 Object payload;3 Tree[] children;4 }
p 1
c⟨[
p 2c
⟨[p 4
], …
⟩],[p 3c
⟨…⟩]⟩
Listing: Bottom-up tree1 public class Tree {2 Object payload;3 Tree parent;4 }
p 6
par
p 3par
[p 1par null
]
Reiter NLP Data Structures 2021-01-07 16 / 24
NLP Data Structures
Phrase Structure Syntax
▶ Phrases: Separate entities in tree
▶ Phrases have properties (e.g., agreement in number)
S
NP
D N
VP
V PP
P N
This tree is for illustration
A phrase:cat Sagr
[num sg
]chi
⟨…⟩
Reiter NLP Data Structures 2021-01-07 17 / 24
NLP Data Structures
Phrase Structure Syntax
▶ Phrases: Separate entities in tree▶ Phrases have properties (e.g., agreement in number)
S[num sg
]NP
[num sg
]D N
VP[num sg
]V
[num sg
]PP
P N
This tree is for illustration
A phrase:cat Sagr
[num sg
]chi
⟨…⟩
Reiter NLP Data Structures 2021-01-07 17 / 24
NLP Data Structures
Dependency Syntax
▶ Links between tokens (sometimes typed)▶ Each token is thus linked to its governing token▶ »This tree is for illustration.«
▶ ›This‹ det−−→ ›tree‹, ›tree‹ nsubj−−−→ ›is‹, ›for‹ prep−−→ ›is‹, ›illustration‹ pobj−−→ ›for‹, ›.‹ pobj−−→ ›is‹.
▶ Storing is simple: Each token stores information for the governor and the relation type▶ conceptually, a bottom-up tree
txt »This tree is for illustration.«
tok.⟨
0
beg 0end 4gov 1rel det
, 1
beg 5end 9gov 3rel nsubj
, 2
beg 10end 12gov nullrel null
, …⟩
Reiter NLP Data Structures 2021-01-07 18 / 24
NLP Data Structures
Dependency Syntax
▶ Links between tokens (sometimes typed)▶ Each token is thus linked to its governing token▶ »This tree is for illustration.«
▶ ›This‹ det−−→ ›tree‹, ›tree‹ nsubj−−−→ ›is‹, ›for‹ prep−−→ ›is‹, ›illustration‹ pobj−−→ ›for‹, ›.‹ pobj−−→ ›is‹.▶ Storing is simple: Each token stores information for the governor and the relation type
▶ conceptually, a bottom-up tree
txt »This tree is for illustration.«
tok.⟨
0
beg 0end 4gov 1rel det
, 1
beg 5end 9gov 3rel nsubj
, 2
beg 10end 12gov nullrel null
, …⟩
Reiter NLP Data Structures 2021-01-07 18 / 24
NLP Data Structures
Dependency Syntax
▶ Links between tokens (sometimes typed)▶ Each token is thus linked to its governing token▶ »This tree is for illustration.«
▶ ›This‹ det−−→ ›tree‹, ›tree‹ nsubj−−−→ ›is‹, ›for‹ prep−−→ ›is‹, ›illustration‹ pobj−−→ ›for‹, ›.‹ pobj−−→ ›is‹.▶ Storing is simple: Each token stores information for the governor and the relation type
▶ conceptually, a bottom-up treetxt »This tree is for illustration.«
tok.⟨
0
beg 0end 4gov 1rel det
, 1
beg 5end 9gov 3rel nsubj
, 2
beg 10end 12gov nullrel null
, …⟩
Reiter NLP Data Structures 2021-01-07 18 / 24
NLP Data Structures
Full Example
txt »The dog is fed by Mr. Parker, who wears a yellow hat.«
tok⟨
0
beg 0end 3pos DTlem thegov 1rel det
, 1
beg 4end 7pos NNlem doggov 2rel nsubj
, …⟩
nns⟨
1 , 12⟩
nes⟨[
type PERtokens
⟨5 , 6
⟩]⟩
phr
cat S
chi⟨[
cat NPchi
⟨0 , 1
⟩],[
cat VPchi
⟨…⟩]⟩
Reiter NLP Data Structures 2021-01-07 19 / 24
NLP Data Structures
Summary
▶ NLP data structures are complex, because language is▶ Annotations can be rooted in tokens or character positions▶ Many linguistic structures can be represented as trees
▶ Trees are a special kind of graph
Reiter NLP Data Structures 2021-01-07 20 / 24
Appendix
References I
Niu, Ruochen and Timothy Osborne (2019). »Chunks are components: A dependencygrammar approach to the syntactic structure of Mandarin«. In: Lingua 224, pp. 60 –83.issn: 0024-3841. doi: https://doi.org/10.1016/j.lingua.2019.03.003. url:http://www.sciencedirect.com/science/article/pii/S0024384118308350.
Reiter NLP Data Structures 2021-01-07 24 / 24