a taste of python - devdays toronto 2009

39
a taste of Presented by Jordan Baker October 23, 2009 DevDays Toronto

Upload: jordan-baker

Post on 11-Nov-2014

2.822 views

Category:

Technology


0 download

DESCRIPTION

Explores Peter Norvig's spell corrector written in Python as an example of the language's elegance and readability

TRANSCRIPT

Page 1: A Taste of Python - Devdays Toronto 2009

a taste of

Presented by Jordan BakerOctober 23, 2009DevDays Toronto

Page 2: A Taste of Python - Devdays Toronto 2009

About Me

• Open Source Developer

• Founder of Open Source Web Application and CMS service provider: Scryent - www.scryent.com

• Founder of Toronto Plone Users Group - www.torontoplone.ca

Page 3: A Taste of Python - Devdays Toronto 2009

Agenda

• About Python

• Show me your CODE

• A Spell Checker in 21 lines of code

• Why Python ROCKS

• Resources for further exploration

Page 4: A Taste of Python - Devdays Toronto 2009

About Python

http://www.flickr.com/photos/schoffer/196079076/

Page 5: A Taste of Python - Devdays Toronto 2009

About Python

• Gotta love a language named after Monty Python’s Flying Circus

• Used in more places than you might know

Page 6: A Taste of Python - Devdays Toronto 2009

Significant WhitespaceC-like

if(x == 2) { do_something();}do_something_else();

Python

if x == 2: do_something()do_something_else()

Page 7: A Taste of Python - Devdays Toronto 2009

Significant Whitespace

• less code clutter

• eliminates many common syntax errors

• proper code layout

• use an indentation aware editor or IDE

• Get over it!

Page 8: A Taste of Python - Devdays Toronto 2009

Python 2.6.1 (r261:67515, Jul 7 2009, 23:51:51) [GCC 4.2.1 (Apple Inc. build 5646)] on darwinType "help", "copyright", "credits" or "license" for more information.>>>

Python is Interactive

Page 9: A Taste of Python - Devdays Toronto 2009

FIZZ BUZZ

12FIZZ4BUZZ...14FIZZ BUZZ

Page 10: A Taste of Python - Devdays Toronto 2009

def fizzbuzz(n):    for i in range(n + 1):        if not i % 3:            print "Fizz",        if not i % 5:            print "Buzz",        if i % 3 and i % 5:            print i,        print

fizzbuzz(50)

FIZZ BUZZ

Page 11: A Taste of Python - Devdays Toronto 2009

def fizzbuzz(n):    for i in range(n + 1):        if not i % 3:            print "Fizz",        if not i % 5:            print "Buzz",        if i % 3 and i % 5:            print i,        print

fizzbuzz(50)

FIZZ BUZZ

Page 12: A Taste of Python - Devdays Toronto 2009

class FizzBuzzWriter(object):    def __init__(self, limit):        self.limit = limit            def run(self):        for n in range(1, self.limit + 1):            self.write_number(n)        def write_number(self, n):        if not n % 3:            print "Fizz",        if not n % 5:            print "Buzz",        if n % 3 and n % 5:            print n,        print        fizzbuzz = FizzBuzzWriter(50)fizzbuzz.run()

FIZZ BUZZ (OO)

Page 13: A Taste of Python - Devdays Toronto 2009

A Spell Checker in 21 Lines of Code

• Written by Peter Norvig

• Duplicated in many languages

• Simple Spellchecking algorithm based on probability

• http://norvig.com/spell-correct.html

Page 14: A Taste of Python - Devdays Toronto 2009

The Approach• Census by frequency

• Morph the word (werd)

• Insertions: waerd, wberd, werzd

• Deletions: wrd, wed, wer

• Transpositions: ewrd, wred, wedr

• Replacements: aerd, ward, wbrd, word, wzrd, werz

• Find the one with the highest frequency: were

Page 15: A Taste of Python - Devdays Toronto 2009

import re, collections

def words(text):    return re.findall('[a-z]+', text.lower())

def train(words):    model = collections.defaultdict(int)    for w in words:       model[w] += 1    return model

NWORDS = train(words(file('big.txt').read()))

alphabet = 'abcdefghijklmnopqrstuvwxyz'

def edits1(word):    s = [(word[:i], word[i:]) for i in range(len(word) + 1)]    deletes    = [a + b[1:] for a, b in s if b]    transposes = [a + b[1] + b[0] + b[2:] for a, b in s if len(b)>1]    replaces   = [a + c + b[1:] for a, b in s for c in alphabet if b]    inserts    = [a + c + b     for a, b in s for c in alphabet]    return set(deletes + transposes + replaces + inserts)

def known_edits2(word):    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

def known(words):    return set(w for w in words if w in NWORDS)

def correct(word):    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]    return max(candidates, key=NWORDS.get)

Norvig Spellchecker

Page 16: A Taste of Python - Devdays Toronto 2009

def words(text): return re.findall('[a-z]+', text.lower())

>>> words("The cat in the hat!")['the', 'cat', 'in', 'the', 'hat']

Regular Expressions

Page 17: A Taste of Python - Devdays Toronto 2009

>>> d = {'cat':1}>>> d{'cat': 1}>>> d['cat']1

>>> d['cat'] += 1>>> d{'cat': 2}

>>> d['dog'] += 1Traceback (most recent call last):  File "<stdin>", line 1, in <module>KeyError: 'dog' 

Dictionaries

Page 18: A Taste of Python - Devdays Toronto 2009

# Has a factory for missing keys>>> d = collections.defaultdict(int)>>> d['dog'] += 1>>> d{'dog': 1}

>>> int<type 'int'>>>> int()0

def train(words):   model = collections.defaultdict(int)   for w in words:       model[w] += 1   return model

>>> train(words("The cat in the hat!")){'cat': 1, 'the': 2, 'hat': 1, 'in': 1}                

defaultdict

Page 19: A Taste of Python - Devdays Toronto 2009

   >>> text = file('big.txt').read()   >>> NWORDS = train(words(text))   >>> NWORDS   {'nunnery': 3, 'presnya': 1, 'woods': 22, 'clotted': 1, 'spiders': 1,   'hanging': 42, 'disobeying': 2, 'scold': 3, 'originality': 6,   'grenadiers': 8, 'pigment': 16, 'appropriation': 6, 'strictest': 1,   'bringing': 48, 'revelers': 1, 'wooded': 8, 'wooden': 37,   'wednesday': 13, 'shows': 50, 'immunities': 3, 'guardsmen': 4,   'sooty': 1, 'inevitably': 32, 'clavicular': 9, 'sustaining': 5,   'consenting': 1, 'scraped': 21, 'errors': 16, 'semicircular': 1,   'cooking': 6, 'spiroch': 25, 'designing': 1, 'pawed': 1,   'succumb': 12, 'shocks': 1, 'crouch': 2, 'chins': 1, 'awistocwacy': 1,   'sunbeams': 1, 'perforations': 6, 'china': 43, 'affiliated': 4,   'chunk': 22, 'natured': 34, 'uplifting': 1, 'slaveholders': 2,   'climbed': 13, 'controversy': 33, 'natures': 2, 'climber': 1,   'lency': 2, 'joyousness': 1, 'reproaching': 3, 'insecurity': 1,   'abbreviations': 1, 'definiteness': 1, 'music': 56, 'therefore': 186,   'expeditionary': 3, 'primeval': 1, 'unpack': 1, 'circumstances': 107,   ... (about 6500 more lines) ...

   >>> NWORDS['the']   80030   >>> NWORDS['unusual']   32   >>> NWORDS['cephalopod']   0

Reading the File

Page 20: A Taste of Python - Devdays Toronto 2009

import re, collections

def words(text): return re.findall('[a-z]+', text.lower())

def train(words):    model = collections.defaultdict(int)    for w in words:    model[w] += 1    return model

NWORDS = train(words(file('big.txt').read()))

Training the Probability Model

Page 21: A Taste of Python - Devdays Toronto 2009

# These two are equivalent:

result = []for v in iter: if cond:    result.append(expr)

[ expr for v in iter if cond ]

# You can nest loops also:

result = []for v1 in iter1:    for v2 in iter2:        if cond:            result.append(expr)

[ expr for v1 in iter1 for v2 in iter2 if cond ]

 

List Comprehensions

Page 22: A Taste of Python - Devdays Toronto 2009

>>> word = "spam">>> word[:1]'s'>>> word[1:]'pam'

>>> (word[:1], word[1:])('s', 'pam')

>>> range(len(word) + 1)[0, 1, 2, 3, 4]

>>> [(word[:i], word[i:]) for i in range(len(word) + 1)][('', 'spam'), ('s', 'pam'), ('sp', 'am'), ('spa', 'm'), ('spam', '')]

String Slicing

Page 23: A Taste of Python - Devdays Toronto 2009

>>> word = "spam">>> s = [(word[:i], word[i:]) for i in range(len(word) + 1)]

>>> deletes = [a + b[1:] for a, b in s if b]

>>> deletes['pam', 'sam', 'spm', 'spa']

>>> a, b = ('s', 'pam')>>> a's'>>> b'pam'

>>> bool('pam')True>>> bool('')False

Deletions

Page 24: A Taste of Python - Devdays Toronto 2009

For example: teh => the

>>> transposes = [a + b[1] + b[0] + b[2:] for a, b in s if len(b)>1]

>>> transposes['psam', 'sapm', 'spma']

Transpositions

Page 25: A Taste of Python - Devdays Toronto 2009

>>> alphabet = "abcdefghijklmnopqrstuvwxyz"

>>> replaces = [a + c + b[1:]  for a, b in s for c in alphabet if b]>>> replaces['apam', 'bpam', ..., 'zpam', 'saam', ..., 'szam', ..., 'spaz']

Replacements

Page 26: A Taste of Python - Devdays Toronto 2009

>>> alphabet = "abcdefghijklmnopqrstuvwxyz"

>>> inserts = [a + c + b  for a, b in s for c in alphabet]>>> inserts['aspam', ..., 'zspam', 'sapam', ..., 'szpam', 'spaam', ..., 'spamz']

Insertion

Page 27: A Taste of Python - Devdays Toronto 2009

alphabet = 'abcdefghijklmnopqrstuvwxyz'

def edits1(word):    s = [(word[:i], word[i:]) for i in range(len(word) + 1)]    deletes = [a + b[1:] for a, b in s if b]    transposes = [a + b[1] + b[0] + b[2:] for a, b in s if len(b)>1]    replaces = [a + c + b[1:] for a, b in s for c in alphabet if b]    inserts = [a + c + b  for a, b in s for c in alphabet]    return set(deletes + transposes + replaces + inserts)

>>> edits1("spam")set(['sptm', 'skam', 'spzam', 'vspam', 'spamj', 'zpam', 'sbam','spham', 'snam', 'sjpam', 'spma', 'swam', 'spaem', 'tspam', 'spmm','slpam', 'upam', 'spaim', 'sppm', 'spnam', 'spem', 'sparm', 'spamr','lspam', 'sdpam', 'spams', 'spaml', 'spamm', 'spamn', 'spum','spamh', 'spami', 'spatm', 'spamk', 'spamd', ..., 'spcam', 'spamy'])

Find all Edits

Page 28: A Taste of Python - Devdays Toronto 2009

def known(words):       """ Return the known words from `words`. """       return set(w for w in words if w in NWORDS)

Known Words

Page 29: A Taste of Python - Devdays Toronto 2009

def known(words):    """ Return the known words from `words`. """    return set(w for w in words if w in NWORDS)

def correct(word):    candidates = known([word]) or known(edits1(word)) or [word]    return max(candidates, key=NWORDS.get)

>>> bool(set([]))False

>>> correct("computr")'computer'

>>> correct("computor")'computer'

>>> correct("computerr")'computer'

Correct

Page 30: A Taste of Python - Devdays Toronto 2009

def known_edits2(word):    return set(        e2            for e1 in edits1(word)                for e2 in edits1(e1)                    if e2 in NWORDS        )

def correct(word):    candidates = known([word]) or known(edits1(word)) or \        known_edits2(word) or [word]    return max(candidates, key=NWORDS.get)

>>> correct("conpuler")'computer'>>> correct("cmpuler")'computer'

Edit Distance 2

Page 31: A Taste of Python - Devdays Toronto 2009

import re, collections

def words(text):    return re.findall('[a-z]+', text.lower())

def train(words):    model = collections.defaultdict(int)    for w in words:       model[w] += 1    return model

NWORDS = train(words(file('big.txt').read()))

alphabet = 'abcdefghijklmnopqrstuvwxyz'

def edits1(word):    s = [(word[:i], word[i:]) for i in range(len(word) + 1)]    deletes    = [a + b[1:] for a, b in s if b]    transposes = [a + b[1] + b[0] + b[2:] for a, b in s if len(b)>1]    replaces   = [a + c + b[1:] for a, b in s for c in alphabet if b]    inserts    = [a + c + b     for a, b in s for c in alphabet]    return set(deletes + transposes + replaces + inserts)

def known_edits2(word):    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

def known(words):    return set(w for w in words if w in NWORDS)

def correct(word):    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]    return max(candidates, key=NWORDS.get)

Page 32: A Taste of Python - Devdays Toronto 2009

Comparing Python & Java Versions

• http://raelcunha.com/spell-correct.php

• 35 lines of Java

Page 33: A Taste of Python - Devdays Toronto 2009

import java.io.*;import java.util.*;import java.util.regex.*;

class Spelling {

" private final HashMap<String, Integer> nWords = new HashMap<String, Integer>();

" public Spelling(String file) throws IOException {" " BufferedReader in = new BufferedReader(new FileReader(file));" " Pattern p = Pattern.compile("\\w+");" " for(String temp = ""; temp != null; temp = in.readLine()){" " " Matcher m = p.matcher(temp.toLowerCase());" " " while(m.find()) nWords.put((temp = m.group()), nWords.containsKey(temp) ? nWords.get(temp) + 1 : 1);" " }" " in.close();" }

" private final ArrayList<String> edits(String word) {" " ArrayList<String> result = new ArrayList<String>();" " for(int i=0; i < word.length(); ++i) result.add(word.substring(0, i) + word.substring(i+1));" " for(int i=0; i < word.length()-1; ++i) result.add(word.substring(0, i) + word.substring(i+1, i+2) + word.substring(i, i+1) + word.substring(i+2));" " for(int i=0; i < word.length(); ++i) for(char c='a'; c <= 'z'; ++c) result.add(word.substring(0, i) + String.valueOf(c) + word.substring(i+1));" " for(int i=0; i <= word.length(); ++i) for(char c='a'; c <= 'z'; ++c) result.add(word.substring(0, i) + String.valueOf(c) + word.substring(i));" " return result;" }

" public final String correct(String word) {" " if(nWords.containsKey(word)) return word;" " ArrayList<String> list = edits(word);" " HashMap<Integer, String> candidates = new HashMap<Integer, String>();" " for(String s : list) if(nWords.containsKey(s)) candidates.put(nWords.get(s),s);" " if(candidates.size() > 0) return candidates.get(Collections.max(candidates.keySet()));" " for(String s : list) for(String w : edits(s)) if(nWords.containsKey(w)) candidates.put(nWords.get(w),w);" " return candidates.size() > 0 ? candidates.get(Collections.max(candidates.keySet())) : word;" }

" public static void main(String args[]) throws IOException {" " if(args.length > 0) System.out.println((new Spelling("big.txt")).correct(args[0]));" }

}

Page 34: A Taste of Python - Devdays Toronto 2009

import re, collections

def words(text):    return re.findall('[a-z]+', text.lower())

def train(words):    model = collections.defaultdict(int)    for w in words:       model[w] += 1    return model

NWORDS = train(words(file('big.txt').read()))

alphabet = 'abcdefghijklmnopqrstuvwxyz'

def edits1(word):    s = [(word[:i], word[i:]) for i in range(len(word) + 1)]    deletes    = [a + b[1:] for a, b in s if b]    transposes = [a + b[1] + b[0] + b[2:] for a, b in s if len(b)>1]    replaces   = [a + c + b[1:] for a, b in s for c in alphabet if b]    inserts    = [a + c + b     for a, b in s for c in alphabet]    return set(deletes + transposes + replaces + inserts)

def known_edits2(word):    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

def known(words):    return set(w for w in words if w in NWORDS)

def correct(word):    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]    return max(candidates, key=NWORDS.get)

Page 35: A Taste of Python - Devdays Toronto 2009

IDE for Python

• IDE’s for Python include:

• PyDev for Eclipse

• WingIDE

• IDLE for Windows/ Linux/ Mac

• there’s more

Page 36: A Taste of Python - Devdays Toronto 2009

Why Python ROCKS

• Elegant and readable language - “Executable Pseudocode”

• Standard Libraries - “Batteries Included”

• Very High level Datatypes

• Dynamically Typed

• It’s FUN!

Page 37: A Taste of Python - Devdays Toronto 2009

An Open Source Community

• Projects: Plone, Zope, Grok, BFG, Django, SciPy & NumPy, Google App Engine, PyGame

• PyCon

Page 38: A Taste of Python - Devdays Toronto 2009

Resources

• PyGTA

• Toronto Plone Users

• Toronto Django Users

• Stackoverflow

• Dive into Python

• Python Tutorial

Page 39: A Taste of Python - Devdays Toronto 2009

Thanks

• I’d love to hear your questions or comments on this presentation. Reach me at:

[email protected]

• http://twitter.com/hexsprite