tokenization - definition · spacy tokenization –the algorithm •iterates over space-separated...
TRANSCRIPT
![Page 1: Tokenization - Definition · spaCy tokenization –The algorithm •Iterates over space-separated substrings •Checks whether a rule for the substring is defined •Otherwise, it](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f552b061a7be078ea5d72dd/html5/thumbnails/1.jpg)
Tokenization - Definition
“Tokenization is the process of demarcating and possibly classifying sections of a string of input characters. The resulting tokens are then passed on to some other form of processing. The process can be considered a sub-task of parsing input.” - Wikipedia
![Page 2: Tokenization - Definition · spaCy tokenization –The algorithm •Iterates over space-separated substrings •Checks whether a rule for the substring is defined •Otherwise, it](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f552b061a7be078ea5d72dd/html5/thumbnails/2.jpg)
spaCy tokenization: overview
![Page 3: Tokenization - Definition · spaCy tokenization –The algorithm •Iterates over space-separated substrings •Checks whether a rule for the substring is defined •Otherwise, it](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f552b061a7be078ea5d72dd/html5/thumbnails/3.jpg)
spaCy tokenization: overview
![Page 4: Tokenization - Definition · spaCy tokenization –The algorithm •Iterates over space-separated substrings •Checks whether a rule for the substring is defined •Otherwise, it](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f552b061a7be078ea5d72dd/html5/thumbnails/4.jpg)
spaCy tokenization: overview
• Input: unicode string
• Output: Doc object
• A Doc object is a sequence of Token objects. Vocab class is needed to create a Doc object
• Vocab is a storage class for vocabulary and other data shared across a language
![Page 5: Tokenization - Definition · spaCy tokenization –The algorithm •Iterates over space-separated substrings •Checks whether a rule for the substring is defined •Otherwise, it](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f552b061a7be078ea5d72dd/html5/thumbnails/5.jpg)
spaCy tokenization: overview
• If possible, spaCy tries to store data in a vocabulary, the Vocab , that will be shared by multiple documents.
• To save memory, all strings are encoded to hash values
• StringStore acts as a lookup table that works in both directions
![Page 6: Tokenization - Definition · spaCy tokenization –The algorithm •Iterates over space-separated substrings •Checks whether a rule for the substring is defined •Otherwise, it](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f552b061a7be078ea5d72dd/html5/thumbnails/6.jpg)
spaCy tokenization: overview
![Page 7: Tokenization - Definition · spaCy tokenization –The algorithm •Iterates over space-separated substrings •Checks whether a rule for the substring is defined •Otherwise, it](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f552b061a7be078ea5d72dd/html5/thumbnails/7.jpg)
spaCy tokenization: overview
• spaCy's models are statistical and every "decision" they make is a prediction
• This prediction is based on the examples the model has seen during training
![Page 8: Tokenization - Definition · spaCy tokenization –The algorithm •Iterates over space-separated substrings •Checks whether a rule for the substring is defined •Otherwise, it](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f552b061a7be078ea5d72dd/html5/thumbnails/8.jpg)
spaCy tokenization: a simplified overview
• A string or a text is given as input
• The input is segmented
• Each Doc is a single token and can be iterated:
for token in doc:print(token.text)
• The output will be an array of token elements from the input
![Page 9: Tokenization - Definition · spaCy tokenization –The algorithm •Iterates over space-separated substrings •Checks whether a rule for the substring is defined •Otherwise, it](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f552b061a7be078ea5d72dd/html5/thumbnails/9.jpg)
spaCy tokenization – The algorithm
• Tokenizer exception: Special-case rule to split a string into several tokens or prevent a token from being split when punctuation rules are applied.
• Prefix: Character(s) at the beginning, like $, (, “, ¿.
• Suffix: Character(s) at the end, like km, ), ”, !.
• Infix: Character(s) in between, like -, --, /, ….
![Page 10: Tokenization - Definition · spaCy tokenization –The algorithm •Iterates over space-separated substrings •Checks whether a rule for the substring is defined •Otherwise, it](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f552b061a7be078ea5d72dd/html5/thumbnails/10.jpg)
spaCy tokenization – The algorithm
• Iterates over space-separated substrings
• Checks whether a rule for the substring is defined
• Otherwise, it tries to consume a prefix
• If it consumed a prefix, it checks for special cases (e.g. “Don’t”)
• If it didn't consume a prefix, tries to consume a suffix
• If it can't consume a prefix or suffix, looks for "infixes"
• Once it can't consume any more parts of the string, handles it as a single token
![Page 11: Tokenization - Definition · spaCy tokenization –The algorithm •Iterates over space-separated substrings •Checks whether a rule for the substring is defined •Otherwise, it](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f552b061a7be078ea5d72dd/html5/thumbnails/11.jpg)
spaCy tokenization
![Page 12: Tokenization - Definition · spaCy tokenization –The algorithm •Iterates over space-separated substrings •Checks whether a rule for the substring is defined •Otherwise, it](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f552b061a7be078ea5d72dd/html5/thumbnails/12.jpg)
spaCy Tokenization
• Attributes
• Methods
• Properties
• Text Processing
![Page 13: Tokenization - Definition · spaCy tokenization –The algorithm •Iterates over space-separated substrings •Checks whether a rule for the substring is defined •Otherwise, it](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f552b061a7be078ea5d72dd/html5/thumbnails/13.jpg)
spaCy – Basics
# Load the spacy library
import spacy
# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load('en_core_web_md’)
# Process a string
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')
# Process a text
text = unicode(open(‘PATH').read().decode('utf8'))
![Page 14: Tokenization - Definition · spaCy tokenization –The algorithm •Iterates over space-separated substrings •Checks whether a rule for the substring is defined •Otherwise, it](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f552b061a7be078ea5d72dd/html5/thumbnails/14.jpg)
spaCy tokenization – Custom rules
• Exception rule for the contracted verb form “gimme”: *
from spacy.symbols import ORTH, LEMMA, POS
nlp = spacy.load('en_core_web_md')
doc = nlp(u'gimme that')
# add special case rule
special_case = [{ORTH: u'gim', LEMMA: u'give', POS: u'VERB'}, {ORTH: u'me'}]
nlp.tokenizer.add_special_case(u'gimme', special_case)
![Page 15: Tokenization - Definition · spaCy tokenization –The algorithm •Iterates over space-separated substrings •Checks whether a rule for the substring is defined •Otherwise, it](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f552b061a7be078ea5d72dd/html5/thumbnails/15.jpg)
Token Class – Attributes
# Token processing options
for token in doc: print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_, token.is_alpha, token.is_stop ...
![Page 16: Tokenization - Definition · spaCy tokenization –The algorithm •Iterates over space-separated substrings •Checks whether a rule for the substring is defined •Otherwise, it](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f552b061a7be078ea5d72dd/html5/thumbnails/16.jpg)
Token Class - available attributes
text: The original word text.
lemma_: The base form of the word.
pos_: The simple part-of-speech tag.
tag_: The detailed part-of-speech tag.
dep_: Syntactic dependency, i.e. the relation between tokens.
shape_: The word shape (capitalisation, punctuation, digits).
Is_alpha: Is the token an alpha character?
Is_stop: Is the token part of a stop list, i.e. the most common words of the language?
![Page 17: Tokenization - Definition · spaCy tokenization –The algorithm •Iterates over space-separated substrings •Checks whether a rule for the substring is defined •Otherwise, it](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f552b061a7be078ea5d72dd/html5/thumbnails/17.jpg)
More token attributes
• sentiment : A scalar value indicating the positivity or negativity of the token
• like_email: Does the token resemble an email address?
• like_num: Does the token resemble a number?
• like_url:Does the token resemble a URL?
• vocab: the vocab object of the parent doc
• head: The syntactic parent, or "governor", of this token
• …
• Complete list available here: https://spacy.io/api/token#attributes
![Page 18: Tokenization - Definition · spaCy tokenization –The algorithm •Iterates over space-separated substrings •Checks whether a rule for the substring is defined •Otherwise, it](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f552b061a7be078ea5d72dd/html5/thumbnails/18.jpg)
spaCy Token class methods – Length
doc = nlp(u'Give it back!')token = doc[0]assert len(token) == 4
![Page 19: Tokenization - Definition · spaCy tokenization –The algorithm •Iterates over space-separated substrings •Checks whether a rule for the substring is defined •Otherwise, it](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f552b061a7be078ea5d72dd/html5/thumbnails/19.jpg)
spaCy Token class methods – Token.nbor
![Page 20: Tokenization - Definition · spaCy tokenization –The algorithm •Iterates over space-separated substrings •Checks whether a rule for the substring is defined •Otherwise, it](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f552b061a7be078ea5d72dd/html5/thumbnails/20.jpg)
spaCy Token class methods – Token.nbor
doc = nlp(u'Give it back!')
give_nbor = doc[0].nbor()
assert give_nbor.text == [u'it']
![Page 21: Tokenization - Definition · spaCy tokenization –The algorithm •Iterates over space-separated substrings •Checks whether a rule for the substring is defined •Otherwise, it](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f552b061a7be078ea5d72dd/html5/thumbnails/21.jpg)
spaCy Token class methods – Token.children
![Page 22: Tokenization - Definition · spaCy tokenization –The algorithm •Iterates over space-separated substrings •Checks whether a rule for the substring is defined •Otherwise, it](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f552b061a7be078ea5d72dd/html5/thumbnails/22.jpg)
spaCy Token class methods – Token.children
doc = nlp(u'Give it back! He pleaded.’)
give_children = doc[0].children
for children in give_children:
print(children.text) # returns “it back !”
![Page 23: Tokenization - Definition · spaCy tokenization –The algorithm •Iterates over space-separated substrings •Checks whether a rule for the substring is defined •Otherwise, it](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f552b061a7be078ea5d72dd/html5/thumbnails/23.jpg)
spaCy Token class methods – Token.lefts
![Page 24: Tokenization - Definition · spaCy tokenization –The algorithm •Iterates over space-separated substrings •Checks whether a rule for the substring is defined •Otherwise, it](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f552b061a7be078ea5d72dd/html5/thumbnails/24.jpg)
spaCy Token class methods – Token.lefts
doc = nlp(u'I like New York in Autumn.')
lefts = [t.text for t in doc[3].lefts]
assert lefts == [u'New']
![Page 25: Tokenization - Definition · spaCy tokenization –The algorithm •Iterates over space-separated substrings •Checks whether a rule for the substring is defined •Otherwise, it](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f552b061a7be078ea5d72dd/html5/thumbnails/25.jpg)
spaCy Token class methods – Token.rights
![Page 26: Tokenization - Definition · spaCy tokenization –The algorithm •Iterates over space-separated substrings •Checks whether a rule for the substring is defined •Otherwise, it](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f552b061a7be078ea5d72dd/html5/thumbnails/26.jpg)
spaCy Token class methods – Token.rights
doc = nlp(u'I like New York in Autumn.')
lefts = [t.text for t in doc[3].rights]
assert lefts == [u'in']
![Page 27: Tokenization - Definition · spaCy tokenization –The algorithm •Iterates over space-separated substrings •Checks whether a rule for the substring is defined •Otherwise, it](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f552b061a7be078ea5d72dd/html5/thumbnails/27.jpg)
spaCy Token class methods – Token.similarity
![Page 28: Tokenization - Definition · spaCy tokenization –The algorithm •Iterates over space-separated substrings •Checks whether a rule for the substring is defined •Otherwise, it](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f552b061a7be078ea5d72dd/html5/thumbnails/28.jpg)
spaCy Token class methods – Token.similarity
# Vectors needed!
doc = nlp(u'apple and orange')
apple = doc[0]
orange = doc[2]
apple_oranges = apple.similarity(orange)
orange_apples = orange.similarity(apple)
assert apple_oranges == orange_apples
![Page 29: Tokenization - Definition · spaCy tokenization –The algorithm •Iterates over space-separated substrings •Checks whether a rule for the substring is defined •Otherwise, it](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f552b061a7be078ea5d72dd/html5/thumbnails/29.jpg)
spaCy Token class properties – Token.is_sent_start
![Page 30: Tokenization - Definition · spaCy tokenization –The algorithm •Iterates over space-separated substrings •Checks whether a rule for the substring is defined •Otherwise, it](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f552b061a7be078ea5d72dd/html5/thumbnails/30.jpg)
spaCy Token class properties – Token.is_sent_start
doc = nlp(u'Give it back! He pleaded.')
assert doc[4].is_sent_start
assert not doc[5].is_sent_start
![Page 31: Tokenization - Definition · spaCy tokenization –The algorithm •Iterates over space-separated substrings •Checks whether a rule for the substring is defined •Otherwise, it](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f552b061a7be078ea5d72dd/html5/thumbnails/31.jpg)
spaCy Token class properties – Token.has_vector
![Page 32: Tokenization - Definition · spaCy tokenization –The algorithm •Iterates over space-separated substrings •Checks whether a rule for the substring is defined •Otherwise, it](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f552b061a7be078ea5d72dd/html5/thumbnails/32.jpg)
spaCy Token class properties – Token.has_vector
doc = nlp(u'I like apples')
apples = doc[2]
assert apples.has_vector
![Page 33: Tokenization - Definition · spaCy tokenization –The algorithm •Iterates over space-separated substrings •Checks whether a rule for the substring is defined •Otherwise, it](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f552b061a7be078ea5d72dd/html5/thumbnails/33.jpg)
How can we work with spaCy?
• Token analysis of an Amazon review
• Pseudocode example
• Results
![Page 34: Tokenization - Definition · spaCy tokenization –The algorithm •Iterates over space-separated substrings •Checks whether a rule for the substring is defined •Otherwise, it](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f552b061a7be078ea5d72dd/html5/thumbnails/34.jpg)
How can we work with spaCy? (Pseudocode)
# Let’s read and decode our review fileamazon_review = read file and decode utf8
# Let’s define arrays of token types that we want to process. They will process the entire texttoken_lemma = array of token lemmas for all tokens in amazon_review
token_shape = …
# Let’s create a dataframe tabledataframe = (add token_lemma, token_shape under LEMMA, SHAPE column heading)
![Page 35: Tokenization - Definition · spaCy tokenization –The algorithm •Iterates over space-separated substrings •Checks whether a rule for the substring is defined •Otherwise, it](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f552b061a7be078ea5d72dd/html5/thumbnails/35.jpg)
Tokenization of an Amazon review - results
![Page 36: Tokenization - Definition · spaCy tokenization –The algorithm •Iterates over space-separated substrings •Checks whether a rule for the substring is defined •Otherwise, it](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f552b061a7be078ea5d72dd/html5/thumbnails/36.jpg)
Tokenization of an Amazon review - results
![Page 37: Tokenization - Definition · spaCy tokenization –The algorithm •Iterates over space-separated substrings •Checks whether a rule for the substring is defined •Otherwise, it](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f552b061a7be078ea5d72dd/html5/thumbnails/37.jpg)
Tokenization of an Amazon review - results