a taste of python - devdays toronto 2009

a taste of

Presented by Jordan BakerOctober 23, 2009DevDays Toronto

About Me

• Open Source Developer

• Founder of Open Source Web Application and CMS service provider: Scryent - www.scryent.com

• Founder of Toronto Plone Users Group - www.torontoplone.ca

http://www.scryent.com

http://www.scryent.com

http://www.torontoplone.ca

http://www.torontoplone.ca

Agenda

• About Python

• Show me your CODE

• A Spell Checker in 21 lines of code

• Why Python ROCKS

• Resources for further exploration

About Python

http://www.flickr.com/photos/schoffer/196079076/



About Python

• Gotta love a language named after Monty Python’s Flying Circus

• Used in more places than you might know

Significant WhitespaceC-like

if(x == 2) { do_something();}do_something_else();

Python

if x == 2: do_something()do_something_else()

Significant Whitespace

• less code clutter

• eliminates many common syntax errors

• proper code layout

• use an indentation aware editor or IDE

• Get over it!

Python 2.6.1 (r261:67515, Jul 7 2009, 23:51:51) [GCC 4.2.1 (Apple Inc. build 5646)] on darwinType "help", "copyright", "credits" or "license" for more information.>>>

Python is Interactive

FIZZ BUZZ

12FIZZ4BUZZ...14FIZZ BUZZ

def fizzbuzz(n): for i in range(n + 1): if not i % 3: print "Fizz", if not i % 5: print "Buzz", if i % 3 and i % 5: print i, print

fizzbuzz(50)

FIZZ BUZZ

class FizzBuzzWriter(object): def __init__(self, limit): self.limit = limit def run(self): for n in range(1, self.limit + 1): self.write_number(n) def write_number(self, n): if not n % 3: print "Fizz", if not n % 5: print "Buzz", if n % 3 and n % 5: print n, print fizzbuzz = FizzBuzzWriter(50)fizzbuzz.run()

FIZZ BUZZ (OO)

A Spell Checker in 21 Lines of Code

• Written by Peter Norvig

• Duplicated in many languages

• Simple Spellchecking algorithm based on probability

• http://norvig.com/spell-correct.html

http://norvig.com/spell-correct.html


The Approach• Census by frequency

• Morph the word (werd)

• Insertions: waerd, wberd, werzd

• Deletions: wrd, wed, wer

• Transpositions: ewrd, wred, wedr

• Replacements: aerd, ward, wbrd, word, wzrd, werz

• Find the one with the highest frequency: were

import re, collections

def words(text): return re.findall('[a-z]+', text.lower())

def train(words): model = collections.defaultdict(int) for w in words: model[w] += 1 return model

NWORDS = train(words(file('big.txt').read()))

alphabet = 'abcdefghijklmnopqrstuvwxyz'

def edits1(word): s = [(word[:i], word[i:]) for i in range(len(word) + 1)] deletes = [a + b[1:] for a, b in s if b] transposes = [a + b[1] + b[0] + b[2:] for a, b in s if len(b)>1] replaces = [a + c + b[1:] for a, b in s for c in alphabet if b] inserts = [a + c + b for a, b in s for c in alphabet] return set(deletes + transposes + replaces + inserts)

def known_edits2(word): return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

def known(words): return set(w for w in words if w in NWORDS)

def correct(word): candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word] return max(candidates, key=NWORDS.get)

Norvig Spellchecker


>>> words("The cat in the hat!")['the', 'cat', 'in', 'the', 'hat']

Regular Expressions

>>> d = {'cat':1}>>> d{'cat': 1}>>> d['cat']1

>>> d['cat'] += 1>>> d{'cat': 2}

>>> d['dog'] += 1Traceback (most recent call last): File "<stdin>", line 1, in <module>KeyError: 'dog'

Dictionaries

# Has a factory for missing keys>>> d = collections.defaultdict(int)>>> d['dog'] += 1>>> d{'dog': 1}

>>> int<type 'int'>>>> int()0


>>> train(words("The cat in the hat!")){'cat': 1, 'the': 2, 'hat': 1, 'in': 1}

defaultdict

>>> text = file('big.txt').read() >>> NWORDS = train(words(text)) >>> NWORDS {'nunnery': 3, 'presnya': 1, 'woods': 22, 'clotted': 1, 'spiders': 1, 'hanging': 42, 'disobeying': 2, 'scold': 3, 'originality': 6, 'grenadiers': 8, 'pigment': 16, 'appropriation': 6, 'strictest': 1, 'bringing': 48, 'revelers': 1, 'wooded': 8, 'wooden': 37, 'wednesday': 13, 'shows': 50, 'immunities': 3, 'guardsmen': 4, 'sooty': 1, 'inevitably': 32, 'clavicular': 9, 'sustaining': 5, 'consenting': 1, 'scraped': 21, 'errors': 16, 'semicircular': 1, 'cooking': 6, 'spiroch': 25, 'designing': 1, 'pawed': 1, 'succumb': 12, 'shocks': 1, 'crouch': 2, 'chins': 1, 'awistocwacy': 1, 'sunbeams': 1, 'perforations': 6, 'china': 43, 'affiliated': 4, 'chunk': 22, 'natured': 34, 'uplifting': 1, 'slaveholders': 2, 'climbed': 13, 'controversy': 33, 'natures': 2, 'climber': 1, 'lency': 2, 'joyousness': 1, 'reproaching': 3, 'insecurity': 1, 'abbreviations': 1, 'definiteness': 1, 'music': 56, 'therefore': 186, 'expeditionary': 3, 'primeval': 1, 'unpack': 1, 'circumstances': 107, ... (about 6500 more lines) ...

>>> NWORDS['the'] 80030 >>> NWORDS['unusual'] 32 >>> NWORDS['cephalopod'] 0

Reading the File





Training the Probability Model

# These two are equivalent:

result = []for v in iter: if cond: result.append(expr)

[ expr for v in iter if cond ]

# You can nest loops also:

result = []for v1 in iter1: for v2 in iter2: if cond: result.append(expr)

[ expr for v1 in iter1 for v2 in iter2 if cond ]

List Comprehensions

>>> word = "spam">>> word[:1]'s'>>> word[1:]'pam'

>>> (word[:1], word[1:])('s', 'pam')

>>> range(len(word) + 1)[0, 1, 2, 3, 4]

>>> [(word[:i], word[i:]) for i in range(len(word) + 1)][('', 'spam'), ('s', 'pam'), ('sp', 'am'), ('spa', 'm'), ('spam', '')]

String Slicing

>>> word = "spam">>> s = [(word[:i], word[i:]) for i in range(len(word) + 1)]

>>> deletes = [a + b[1:] for a, b in s if b]

>>> deletes['pam', 'sam', 'spm', 'spa']

>>> a, b = ('s', 'pam')>>> a's'>>> b'pam'

>>> bool('pam')True>>> bool('')False

Deletions

For example: teh => the

>>> transposes = [a + b[1] + b[0] + b[2:] for a, b in s if len(b)>1]

>>> transposes['psam', 'sapm', 'spma']

Transpositions

>>> alphabet = "abcdefghijklmnopqrstuvwxyz"

>>> replaces = [a + c + b[1:] for a, b in s for c in alphabet if b]>>> replaces['apam', 'bpam', ..., 'zpam', 'saam', ..., 'szam', ..., 'spaz']

Replacements

>>> alphabet = "abcdefghijklmnopqrstuvwxyz"

>>> inserts = [a + c + b for a, b in s for c in alphabet]>>> inserts['aspam', ..., 'zspam', 'sapam', ..., 'szpam', 'spaam', ..., 'spamz']

Insertion



>>> edits1("spam")set(['sptm', 'skam', 'spzam', 'vspam', 'spamj', 'zpam', 'sbam','spham', 'snam', 'sjpam', 'spma', 'swam', 'spaem', 'tspam', 'spmm','slpam', 'upam', 'spaim', 'sppm', 'spnam', 'spem', 'sparm', 'spamr','lspam', 'sdpam', 'spams', 'spaml', 'spamm', 'spamn', 'spum','spamh', 'spami', 'spatm', 'spamk', 'spamd', ..., 'spcam', 'spamy'])

Find all Edits

def known(words): """ Return the known words from `words`. """ return set(w for w in words if w in NWORDS)

Known Words

def known(words): """ Return the known words from `words`. """ return set(w for w in words if w in NWORDS)

def correct(word): candidates = known([word]) or known(edits1(word)) or [word] return max(candidates, key=NWORDS.get)

>>> bool(set([]))False

>>> correct("computr")'computer'

>>> correct("computor")'computer'

>>> correct("computerr")'computer'

Correct

def known_edits2(word): return set( e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS )

def correct(word): candidates = known([word]) or known(edits1(word)) or \ known_edits2(word) or [word] return max(candidates, key=NWORDS.get)

>>> correct("conpuler")'computer'>>> correct("cmpuler")'computer'

Edit Distance 2

Comparing Python & Java Versions

• http://raelcunha.com/spell-correct.php

• 35 lines of Java



import java.io.*;import java.util.*;import java.util.regex.*;

class Spelling {

" private final HashMap<String, Integer> nWords = new HashMap<String, Integer>();

" public Spelling(String file) throws IOException {" " BufferedReader in = new BufferedReader(new FileReader(file));" " Pattern p = Pattern.compile("\\w+");" " for(String temp = ""; temp != null; temp = in.readLine()){" " " Matcher m = p.matcher(temp.toLowerCase());" " " while(m.find()) nWords.put((temp = m.group()), nWords.containsKey(temp) ? nWords.get(temp) + 1 : 1);" " }" " in.close();" }

" private final ArrayList<String> edits(String word) {" " ArrayList<String> result = new ArrayList<String>();" " for(int i=0; i < word.length(); ++i) result.add(word.substring(0, i) + word.substring(i+1));" " for(int i=0; i < word.length()-1; ++i) result.add(word.substring(0, i) + word.substring(i+1, i+2) + word.substring(i, i+1) + word.substring(i+2));" " for(int i=0; i < word.length(); ++i) for(char c='a'; c <= 'z'; ++c) result.add(word.substring(0, i) + String.valueOf(c) + word.substring(i+1));" " for(int i=0; i <= word.length(); ++i) for(char c='a'; c <= 'z'; ++c) result.add(word.substring(0, i) + String.valueOf(c) + word.substring(i));" " return result;" }

" public final String correct(String word) {" " if(nWords.containsKey(word)) return word;" " ArrayList<String> list = edits(word);" " HashMap<Integer, String> candidates = new HashMap<Integer, String>();" " for(String s : list) if(nWords.containsKey(s)) candidates.put(nWords.get(s),s);" " if(candidates.size() > 0) return candidates.get(Collections.max(candidates.keySet()));" " for(String s : list) for(String w : edits(s)) if(nWords.containsKey(w)) candidates.put(nWords.get(w),w);" " return candidates.size() > 0 ? candidates.get(Collections.max(candidates.keySet())) : word;" }

" public static void main(String args[]) throws IOException {" " if(args.length > 0) System.out.println((new Spelling("big.txt")).correct(args[0]));" }

}

IDE for Python

• IDE’s for Python include:

• PyDev for Eclipse

• WingIDE

• IDLE for Windows/ Linux/ Mac

• there’s more

Why Python ROCKS

• Elegant and readable language - “Executable Pseudocode”

• Standard Libraries - “Batteries Included”

• Very High level Datatypes

• Dynamically Typed

• It’s FUN!

An Open Source Community

• Projects: Plone, Zope, Grok, BFG, Django, SciPy & NumPy, Google App Engine, PyGame

• PyCon

Resources

• PyGTA

• Toronto Plone Users

• Toronto Django Users

• Stackoverflow

• Dive into Python

• Python Tutorial

Thanks

• I’d love to hear your questions or comments on this presentation. Reach me at:

• [email protected]

• http://twitter.com/hexsprite

mailto:[email protected]

mailto:[email protected]

http://twitter.com/hexsprite

http://twitter.com/hexsprite

a taste of python - devdays toronto 2009

Technology

collections def words

signicant whitespace

spell checker

import java

model collections

def train

return max

return set