lecture 5: annotating things & list comprehensions
DESCRIPTION
Lecture 5: Annotating Things & List Comprehensions. Methods in Computational Linguistics II Queens College. Linguistic Annotation. Text only takes us so far. People are reliable judges of linguistic behavior. - PowerPoint PPT PresentationTRANSCRIPT
Methods in Computational Linguistics II
Queens College
Lecture 5: List Comprehensions
2
Split into words
• sent = “That isn’t the problem, Bob.” • sent.split()• vs. • nltk.word_tokenize(sent)
3
List Comprehensions
• Compact way to process every item in a list.
[x for x in array]
dest = []for x in array:
dest.append(x)
4
Methods
• Using the iterating variable, x, methods can be applied.
• Their value is stored in the resulting list.[len(x) for x in array]
dest = []for x in array:
dest.append(len(x))
5
Conditionals
• Elements from the original list can be omitted from the resulting list, using conditional statements
[x for x in array if len(x) == 3]
dest = []for x in array:
if len(x) == 3:dest.append(x)
6
Building up
• These can be combined to build up complicated lists
[x.upper() for x in array if len(x) > 3 and x.startswith(‘t’)]
dest = []for x in array:
if len(x) > 3 and x.startswith(‘t’):dest.append(x.upper())
7
Lists Containing Lists
• Lists can contain lists• [[a, 1], [b, 2], [d, 4]]• ...or tuples• [(a, 1), (b, 2), (d, 4)]• [ [d, d*d] for d in array if d < 4]
8
Using multiple lists
• Multiple lists can be processed simultaneously in a list comprehension
• [x*y for x in array1 for y in array2]
9
List Comprehension ExercisesMake a list of the first ten multiples of ten (10, 20, 30... 90, 100) using a list comprehension.
Make a list of the first ten cubes (1, 8, 27... 1000) using a list comprehension.
Store five names in a list. Make a second list that adds the phrase "is awesome!" to each name, using a list comprehension.
Write out the following code without using a list comprehension:plus_thirteen = [number + 13 for number in range(1,11)]
Exercises from: http://introtopython.org/all_exercises_challenges.html#ex_ch_12
10
Lists within lists are often called 2-d arrays
• This is another way we store tables.
• Similar to nested dictionaries.• a = [[0,1], [1,0]]• a[1][1]• a[0][0]
11
Numpy & Arrays
• Numpy is a commonly used package for numerical calculations in python.
• Its main object is a multidimensional array.
• A[1] List• A[1][2] ‘Rectangular’ 2-d Matrix• A[1][2][3] ‘Cube/Prism’ 3-d Matrix • A[1][2][3][4] 4-d Matrix• etc.
12
Numpy arrays
from numpy import *a = array([1,2,3,4])a = array([1,2], [3,4])
a.ndim Number of dimensionsa.shape Length of each dimensiona.size Total number of elements
13
numpy array initialization>>> zeros( (3,4) )array([[0., 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.]])>>> ones( (2,3,4), dtype=int16 ) array([[[ 1, 1, 1, 1], [ 1, 1, 1, 1], [ 1, 1, 1, 1]], [[ 1, 1, 1, 1], [ 1, 1, 1, 1], [ 1, 1, 1, 1]]], dtype=int16)>>> empty( (2,3) )array([[ 3.73603959e-262, 6.02658058e-154, 6.55490914e-260], [ 5.30498948e-313, 3.14673309e-307, 1.00000000e+000]])
14
Content Types
• arrays are homogenous (ndarray)– array([1, 3, 4], dtype=int16)
• lists are not homogenous– [‘abc’, 123, [list1, list2]]
• dtype describes the “type” of object in the array– str, tuple, int, etc.– numpy.int16, numpy.int32, numpy.float64 etc.
15
zip
• Zip allows you to “zip” two lists together, creating a list of tuples
• names = [‘Andrew’, ‘Beth’, ‘Charles’]• ages = [35, 34, 33]• name_age = zip(names, ages)
– [(‘Andrew’, 35), (‘Beth’, 34), (‘Charles’, 33)]
16
foreach vs. indexed for loops
“More pythonic”
for n, a in zip(names, ages):print “%s -- %s” % (n, a)
vs.
for i in xrange(len(names)):print “%s -- %s” % (names[i], ages[i])
17
map
• map allows you to apply the same function to a list of objects.
a = [‘1’, ‘2’, ‘4’]map(int, a)
18
map
Any function can be ‘map’ed over a list, but the elements of the list need to be a value argument.
def uppercase(s):return s.upper()
a = [‘abc’, ‘def’, ‘ghi’]map(uppercase, a)
19
Functions as objects
• A function name can be assigned to a variable.• map is an example of this, where the first
argument to map is a function object. a = [1, 3, 4]len(a)sum(a)functions = [len, sum]for fn in functions:
print str(fn), fn(a)
20
lambda
• Lambda functions are single use functions that do not need to be ‘def’ed.
• Using the uppercase example again:def uppercase(s):
return s.upper()
a = [‘abc’, ‘def’, ‘ghi’]map(uppercase, a)
21
lambda
• Lambda functions are single use functions that do not need to be ‘def’ed.
• These are “anonymous” functions• Using the uppercase example again:
a = [‘abc’, ‘def’, ‘ghi’]map(lambda s : s.upper(), a)
By design, lambdas are only a single statement
22
Aside: Glob
• Construct a list of all filemames matching a pattern.
from glob import glob
glob(‘*.txt’)glob(‘/Users/andrew/Documents/*/*.ppt’)
23
Linguistic Annotation• Text only takes us so far.• People are reliable judges of linguistic
behavior.• We can model with machines, but for
“gold-standard” truth, we ask people to make judgments about linguistic qualities.
24
Example Linguistic Annotations• Sentence Boundaries• Part of Speech Tags• Phonetic Transcription• Syntactic parse trees• Speaker Identity• Semantic Role • Speech Act• Document Topic• Argument structure• Word Sense• many many many more
25
We need…
• Techniques to process these.
• Every corpus has its own format for linguistic annotation.
• so…we need to parse annotation formats.
26
Constructing a linguistic corpus• Decisions that need to be made:
– Why are you doing this?– What material will be collected?– How will it be collected?
• Automatically?• Manually?• Found material vs. laboratory language?
– What meta information will be stored?– What manual annotations are required?
• How will each annotation be defined?• How many annotators will be used?• How will agreement be assessed? • How will disagreements be resolved?
– How will the material be disseminated?• Is this covered by your IRB if the material is the result of a human subject
protocol?
27
Part of Speech Tagging
• Task: Given a string of words, identify the parts of speech for each word.
28
Part of Speech tagging
• Surface level syntax.• Primary operation• Parsing• Word Sense Disambiguation• Semantic Role labeling• Segmentation • Discourse, Topic, Sentence
29
How is it done?
• Learn from Data.• Annotated Data:
• Unlabeled Data:
30
Learn the association from Tag to Word
31
Limitations
• Unseen tokens• Uncommon interpretations• Long term dependencies
32
Format conversion exerciseThe/DET Dog/NN is/VB fast/JJ ./.
<word ortho=“The” pos=“DET”></word><word ortho=“Dog” pos=“NN”></word><word ortho=“is” pos=“VB”></word><word ortho=“fast” pos=“JJ”></word><word ortho=“.” pos=“.”></word>
The dog is fast.
1, 3, DET5, 7, NN9, 10, VB12, 15, JJ16, 16, .
33
Parsing
• Generate a parse tree.
34
Parsing
• Generate a Parse Tree from:• The surface form (words) of the text• Part of Speech Tokens
35
Parsing Styles
36
Parsing styles
37
Context Free Grammars for Parsing
• S → VP• S →NP VP• NP → Det Nom• Nom → Noun• Nom → Adj Nom• VP → Verb Nom• Det → “A”, “The”
• Noun → “I”, “John”, “Address”
• Verb → “Gave”• Adj → “My”, “Blue”• Adv → “Quickly”
38
Limitations
• The grammar must be built by hand.• Can’t handle ungrammatical sentences.• Can’t resolve ambiguity.
39
Probabilistic Parsing
• Assign each transition a probability• Find the parse with the greatest
“likelihood”
• Build a table and count– How many times does each transition happen
• Structured learning.
40
Segmentation
• Sentence Segmentation
• Topic Segmentation
• Speaker Segmentation
• Phrase Chunking– NP, VP, PP, SubClause, etc.