lecture 5: annotating things & list comprehensions

Methods in Computational Linguistics II

Queens College

Lecture 5: List Comprehensions

2

Split into words

• sent = “That isn’t the problem, Bob.” • sent.split()• vs. • nltk.word_tokenize(sent)

3

List Comprehensions

• Compact way to process every item in a list.

[x for x in array]

dest = []for x in array:

dest.append(x)

4

Methods

• Using the iterating variable, x, methods can be applied.

• Their value is stored in the resulting list.[len(x) for x in array]


dest.append(len(x))

5

Conditionals

• Elements from the original list can be omitted from the resulting list, using conditional statements

[x for x in array if len(x) == 3]


if len(x) == 3:dest.append(x)

6

Building up

• These can be combined to build up complicated lists

[x.upper() for x in array if len(x) > 3 and x.startswith(‘t’)]


if len(x) > 3 and x.startswith(‘t’):dest.append(x.upper())

7

Lists Containing Lists

• Lists can contain lists• [[a, 1], [b, 2], [d, 4]]• ...or tuples• [(a, 1), (b, 2), (d, 4)]• [ [d, d*d] for d in array if d < 4]

8

Using multiple lists

• Multiple lists can be processed simultaneously in a list comprehension

• [x*y for x in array1 for y in array2]

9

List Comprehension ExercisesMake a list of the first ten multiples of ten (10, 20, 30... 90, 100) using a list comprehension.

Make a list of the first ten cubes (1, 8, 27... 1000) using a list comprehension.

Store five names in a list. Make a second list that adds the phrase "is awesome!" to each name, using a list comprehension.

Write out the following code without using a list comprehension:plus_thirteen = [number + 13 for number in range(1,11)]

Exercises from: http://introtopython.org/all_exercises_challenges.html#ex_ch_12

10

Lists within lists are often called 2-d arrays

• This is another way we store tables.

• Similar to nested dictionaries.• a = [[0,1], [1,0]]• a[1][1]• a[0][0]

11

Numpy & Arrays

• Numpy is a commonly used package for numerical calculations in python.

• Its main object is a multidimensional array.

• A[1] List• A[1][2] ‘Rectangular’ 2-d Matrix• A[1][2][3] ‘Cube/Prism’ 3-d Matrix • A[1][2][3][4] 4-d Matrix• etc.

12

Numpy arrays

from numpy import *a = array([1,2,3,4])a = array([1,2], [3,4])

a.ndim Number of dimensionsa.shape Length of each dimensiona.size Total number of elements

13

numpy array initialization>>> zeros( (3,4) )array([[0., 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.]])>>> ones( (2,3,4), dtype=int16 ) array([[[ 1, 1, 1, 1], [ 1, 1, 1, 1], [ 1, 1, 1, 1]], [[ 1, 1, 1, 1], [ 1, 1, 1, 1], [ 1, 1, 1, 1]]], dtype=int16)>>> empty( (2,3) )array([[ 3.73603959e-262, 6.02658058e-154, 6.55490914e-260], [ 5.30498948e-313, 3.14673309e-307, 1.00000000e+000]])

14

Content Types

• arrays are homogenous (ndarray)– array([1, 3, 4], dtype=int16)

• lists are not homogenous– [‘abc’, 123, [list1, list2]]

• dtype describes the “type” of object in the array– str, tuple, int, etc.– numpy.int16, numpy.int32, numpy.float64 etc.

15

zip

• Zip allows you to “zip” two lists together, creating a list of tuples

• names = [‘Andrew’, ‘Beth’, ‘Charles’]• ages = [35, 34, 33]• name_age = zip(names, ages)

– [(‘Andrew’, 35), (‘Beth’, 34), (‘Charles’, 33)]

16

foreach vs. indexed for loops

“More pythonic”

for n, a in zip(names, ages):print “%s -- %s” % (n, a)

vs.

for i in xrange(len(names)):print “%s -- %s” % (names[i], ages[i])

17

map

• map allows you to apply the same function to a list of objects.

a = [‘1’, ‘2’, ‘4’]map(int, a)

18

map

Any function can be ‘map’ed over a list, but the elements of the list need to be a value argument.

def uppercase(s):return s.upper()

a = [‘abc’, ‘def’, ‘ghi’]map(uppercase, a)

19

Functions as objects

• A function name can be assigned to a variable.• map is an example of this, where the first

argument to map is a function object. a = [1, 3, 4]len(a)sum(a)functions = [len, sum]for fn in functions:

print str(fn), fn(a)

20

lambda

• Lambda functions are single use functions that do not need to be ‘def’ed.

• Using the uppercase example again:def uppercase(s):

return s.upper()

a = [‘abc’, ‘def’, ‘ghi’]map(uppercase, a)

21

lambda

• Lambda functions are single use functions that do not need to be ‘def’ed.

• These are “anonymous” functions• Using the uppercase example again:

a = [‘abc’, ‘def’, ‘ghi’]map(lambda s : s.upper(), a)

By design, lambdas are only a single statement

22

Aside: Glob

• Construct a list of all filemames matching a pattern.

from glob import glob

glob(‘*.txt’)glob(‘/Users/andrew/Documents/*/*.ppt’)

23

Linguistic Annotation• Text only takes us so far.• People are reliable judges of linguistic

behavior.• We can model with machines, but for

“gold-standard” truth, we ask people to make judgments about linguistic qualities.

24

Example Linguistic Annotations• Sentence Boundaries• Part of Speech Tags• Phonetic Transcription• Syntactic parse trees• Speaker Identity• Semantic Role • Speech Act• Document Topic• Argument structure• Word Sense• many many many more

25

We need…

• Techniques to process these.

• Every corpus has its own format for linguistic annotation.

• so…we need to parse annotation formats.

26

Constructing a linguistic corpus• Decisions that need to be made:

– Why are you doing this?– What material will be collected?– How will it be collected?

• Automatically?• Manually?• Found material vs. laboratory language?

– What meta information will be stored?– What manual annotations are required?

• How will each annotation be defined?• How many annotators will be used?• How will agreement be assessed? • How will disagreements be resolved?

– How will the material be disseminated?• Is this covered by your IRB if the material is the result of a human subject

protocol?

27

Part of Speech Tagging

• Task: Given a string of words, identify the parts of speech for each word.

28

Part of Speech tagging

• Surface level syntax.• Primary operation• Parsing• Word Sense Disambiguation• Semantic Role labeling• Segmentation • Discourse, Topic, Sentence

29

How is it done?

• Learn from Data.• Annotated Data:

• Unlabeled Data:

30

Learn the association from Tag to Word

31

Limitations

• Unseen tokens• Uncommon interpretations• Long term dependencies

32

Format conversion exerciseThe/DET Dog/NN is/VB fast/JJ ./.

<word ortho=“The” pos=“DET”></word><word ortho=“Dog” pos=“NN”></word><word ortho=“is” pos=“VB”></word><word ortho=“fast” pos=“JJ”></word><word ortho=“.” pos=“.”></word>

The dog is fast.

1, 3, DET5, 7, NN9, 10, VB12, 15, JJ16, 16, .

33

Parsing

• Generate a parse tree.

34

Parsing

• Generate a Parse Tree from:• The surface form (words) of the text• Part of Speech Tokens

35

Parsing Styles

36

Parsing styles

37

Context Free Grammars for Parsing

• S → VP• S →NP VP• NP → Det Nom• Nom → Noun• Nom → Adj Nom• VP → Verb Nom• Det → “A”, “The”

• Noun → “I”, “John”, “Address”

• Verb → “Gave”• Adj → “My”, “Blue”• Adv → “Quickly”

38

Limitations

• The grammar must be built by hand.• Can’t handle ungrammatical sentences.• Can’t resolve ambiguity.

39

Probabilistic Parsing

• Assign each transition a probability• Find the parse with the greatest

“likelihood”

• Build a table and count– How many times does each transition happen

• Structured learning.

40

Segmentation

• Sentence Segmentation

• Topic Segmentation

• Speaker Segmentation

• Phrase Chunking– NP, VP, PP, SubClause, etc.