1 python for bioinformatics lecture 4: dictionaries

1

Python for Bioinformatics

Lecture 4: Dictionaries

2

Some comments• You do not learn programming in the lecture,

but at the keyboard• Practice, practice, practice!• Python is the coolest language ever!

• Getting stuck is completely normal, it happens to everyone. Really. You just have to know where to look:

1. Try Google! Example: "python sort list"2. Read Python manual or reference pages. Or look into a Python

book.3. Ask people!

3

• Suppose we have a file containing a table of Drosophila gene names and their Entrez Gene identifiers, one pair on each line:

• Suppose this table is in a tab-separated text file called "genes.txt"

Tabular data

Cyp12a5 42293ken and barbie 37785cop 45837bor 53565hangover 32613

4

Reading a table of data

• We can split each line into a list with two elements using the split command:


>>> line = "Cyp12a5 42293">>> line.split("\t")['Cyp12a5', '42293']

• The opposite of split is join, which makes a string from a list of strings:

>>> "\t".join(['Cyp12a5', '42293'])'Cyp12a5\t42293'>>> print "\t".join(['Cyp12a5', '42293'])Cyp12a5 42293

TAB character

5

Reading a table of data• Reading the file:

• Output:

Genes names: Cyp12a5, ken and barbie, cop, bor, hangoverGene IDs: 42293, 37785, 45837, 53565, 32613


geneNames = []geneIDs = []for line in open("genes.txt"): geneName, geneID = line.split("\t") geneNames.append(geneName) geneIDs.append(geneID)print "Genes names:", ", ".join(geneNames)print "Gene IDs:", ", ".join(geneIDs)

TAB character

6

Finding an entry• The following code assumes that we have already read in the

table from the file:

import sys

geneToFind = sys.argv[1]print "Searching for gene", geneToFind

for i in range(len(geneNames)): if geneNames[i] == geneToFind: print "Found gene:", geneNames[i] print "Gene ID:", geneIDs[i] sys.exit()

print "Couldn't find gene"

Searching for gene copFound gene: copGene ID: 45837

Example output:sys.argv[1] = "cop"

7

Dictionaries

• Conveniently, Python provides a type of array called a dictionary (also called a hash table) that does something similar for you

• A dictionary is a set of key–value pairs (such as our geneName–geneID table)

genes["cop"] = "45837"Squared brackets [] are used to index a dictionary

8

Getting familiar with dictionaries

Creating an initial phone book

Asking for all keys

Asking for all values

Asking for a value, given a key

Check if a key is in the dictionary

Inserting a single key–value pair

Looping through the dictionary

>>> tlf = {"Michael" : 40062, \"Bingding" : 40064, "Andreas": 40063 }>>> tlf.keys()['Bingding', 'Andreas', 'Michael']>>> tlf.values()[40064, 40063, 40062]>>> tlf["Michael"]40062>>> "Lars" in tlfFalse>>> tlf["Lars"] = 40070>>> tlf.has_key("Lars") # now it's thereTrue>>> for name in tlf.keys():... print name, tlf[name]... Lars 40070Bingding 40064Andreas 40063Michael 40062

9

Reading a tabular file into a dictionary

...with sys.argv[1] = "cop" as before:

import sys

genes = {} # creates an empty dictionary

for line in open("genes.txt"): geneName, geneID = line.split("\t") genes[geneName] = geneID

geneToFind = sys.argv[1]print "Gene:", geneToFindprint "Gene ID:", genes[geneToFind]

Gene: copGene ID: 45837

10

def read_fasta(filename): name = None name2seq = {} for line in open(filename): if line.startswith(">"): if name: name2seq[name] = seq name = line[1:].rstrip() seq = "" else: seq += line.rstrip() name2seq[name] = seq return name2seq

Reading a FASTA file into a dictionary

set final entry, after loop

if name only evaluates to False if it is still None (when going over first line)

new name is obtained from line from second letter on (skipping the >), with newline character removed

• Dictionaries are an excellent choice for storing sequences:

11

Reading a FASTA file the easy way• BioPython (http://biopython.org) is a useful collection of

Python code for computational molecular biology. Once you installed it, you can use all its powerful functions.

• Reading from a FASTA file becomes quite easy:

from Bio import Fasta

def read_fasta(filename): name2seq = {} iterator = Fasta.Iterator(open(filename), Fasta.RecordParser())

for record in iterator: name2seq[record.title] = record.sequence

return name2seq

12

Keys and values• keys() returns the list of keys of the dictionary

– e.g. names, in the name2seq dictionary• values() returns the list of values

– e.g. sequences, in the name2seq dictionary

name2seq = read_fasta("C:/fly3utr.fa")

print "Sequence names read: ", " ".join(name2seq.keys()) print "Total length of sequences: ", len("".join(name2seq.values()))

Sequence names read: CG11488 CG11604 CG11455Total length of sequences: 210

13

Remember: why modules are cool• Additional functionality, that is not part of the core

language, can be loaded from modules:

• You can write your own modules and import them. Never copy and paste code you wrote. Re-use your code!

• People already wrote modules that you can download and use. Don't re-invent the wheel!

>>> import math>>> math.pi3.1415926535897931>>> help(math)Help on built-in module math:…

14

Formatted output of sequencesdef print_seq(name, seq, width=50): print ">" + name i = 0 while i < len(seq): print seq[i:i+width] i += width

print_seq("Tata-box1", "TA"*50)print_seq("Tata-box2", "TA"*50, 30)

Default values can be assigned to parameters. They need to be placed rightmost.Here, width default is set50-column output.

>Tata-box1TATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATA>Tata-box2TATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATA

15

Comparing files with sequence names• Easy way to specify a subset of a given FASTA database• Each line is the name of a sequence in a given database

• Two files with sequence names:What is the overlap, difference, and union?

CG1167CG685CG1041CG1043

CG215CG1041CG483CG1167CG1163

C:/fosn1.txt C:/fosn2.txt

16

Common set operationsCG1167CG685CG1041CG1043

CG215CG1041CG483CG1167CG1163

from sets import Set

geneSet1 = Set([])geneSet2 = Set([])

for line in open("C:/fosn1.txt"): geneSet1.add(line.rstrip())

for line in open("C:/fosn2.txt"): geneSet2.add(line.rstrip())

C:/fosn1.txt C:/fosn2.txt

>>> geneSet1.intersection(geneSet2)Set(['CG1041', 'CG1167'])>>> geneSet2.difference(geneSet1)Set(['CG483', 'CG215', 'CG1163'])>>> geneSet1.union(geneSet2)Set(['CG483', 'CG1043', 'CG1041', 'CG1167', 'CG685', 'CG1163', 'CG215'])

AA B

difference

AA B

intersection

AA B

union

17

More set operations• Since every element in a set occurs only once, sets can be used

to reduce redundancy

>>> from sets import Set>>> Set([1,2,3,1,3,3])Set([1, 2, 3])

>>> pqs = Set("1kim 1dan 1bob".split())>>> pdb = Set("1bob 3mad 1dan 2bad 1kim".split())>>> pqs.issubset(pdb)True

• A is a superset of B when A fully contains BTest: A.issuperset(B)

• A is a subset of B when A is fully contained in BTest: A.issubset(B)

18

The genetic code as a dictionaryaa = {'ttt':'F', 'tct':'S', 'tat':'Y', 'tgt':'C', 'ttc':'F', 'tcc':'S', 'tac':'Y', 'tgc':'C', 'tta':'L', 'tca':'S', 'taa':'!', 'tga':'!', 'ttg':'L', 'tcg':'S', 'tag':'!', 'tgg':'W', 'ctt':'L', 'cct':'P', 'cat':'H', 'cgt':'R', 'ctc':'L', 'ccc':'P', 'cac':'H', 'cgc':'R', 'cta':'L', 'cca':'P', 'caa':'Q', 'cga':'R', 'ctg':'L', 'ccg':'P', 'cag':'Q', 'cgg':'R', 'att':'I', 'act':'T', 'aat':'N', 'agt':'S', 'atc':'I', 'acc':'T', 'aac':'N', 'agc':'S', 'ata':'I', 'aca':'T', 'aaa':'K', 'aga':'R', 'atg':'M', 'acg':'T', 'aag':'K', 'agg':'R', 'gtt':'V', 'gct':'A', 'gat':'D', 'ggt':'G', 'gtc':'V', 'gcc':'A', 'gac':'D', 'ggc':'G', 'gta':'V', 'gca':'A', 'gaa':'E', 'gga':'G', 'gtg':'V', 'gcg':'A', 'gag':'E', 'ggg':'G' }

19

Translating: DNA to proteindef translate(dna): length = len(dna) if length % 3 != 0: print "Warning: Length is not a multiple of 3" sys.exit() amino_acids = [] i = 0 while i < length: codon = dna[i:i+3].lower() if codon not in aa: print "Codon %s is illegal" % codon sys.exit() amino_acids.append(aa[codon]) i += 3 return "".join(amino_acids)

>>> translate("gatgacgaaagttgt")'DDESC'>>> translate("gatgacgaaagttgta")Warning: Length is not a multiple of 3… (SystemExit)>>> translate("gatgacgiaagttgt")Codon gia is illegal… (SystemExit)

20

Counting residue frequenciesdef count_residues(seq): freq = {} seq = seq.lower() for letter in seq: if letter in freq: freq[letter] += 1 else: freq[letter] = 1 return freq

freq = count_residues("gatgacgaaagttgt")

# display statisticsfor residue in freq.keys(): print "%s : %s" % (residue, freq[residue])

a : 5c : 1t : 4g : 5

Tricky question: Given the same sequence, does the output always look exactly the same?

21

Counting N-mer frequencies

def count_nmers(seq, n): freq = {} seq = seq.lower() for i in range(len(seq) – n + 1): nmer = seq[i:i+n] if nmer in freq: freq[nmer] += 1 # increase counter else: freq[nmer] = 1 # first occurrence return freq

freq = count_nmers("gatgacgaaagttgt", 2)

# display statisticsfor residue in freq.keys(): print "%s : %s" % (residue, freq[residue])

cg : 1tt : 1ga : 3tg : 2gt : 1aa : 2ac : 1at : 1ag : 1

N-mer frequencies for a whole filefrom read_fasta import read_fasta def count_nmers(seq, n, freq): seq = seq.lower() for i in range(len(seq)-n+1): nmer = seq[i : i+n] if nmer in freq: freq[nmer] += 1 else: freq[nmer] = 1 return freq

name2seq = read_fasta("C:/fly3utr.fa")freq = {}# count for each sequencefor seq in name2seq.values(): freq = count_nmers(seq, 2, freq)# display statisticsfor residue in freq.keys(): print "%s : %s" % (residue, freq[residue])

ct : 5tc : 9tt : 26cg : 4ga : 11tg : 12gc : 2gt : 17aa : 39ac : 10gg : 4at : 17ca : 11ag : 15ta : 20cc : 2

Note how we pass freq back into the count_nmers function, to get cumulative counts

Here we reuse a function we wrote earlier by importing it. The first is the filename (without .py), the second is the function name

23

Files and file handlesOpening a file:

Closing a file:

Read one line:

Read everything:

Open write-only:

fh = open(filename) # gives a file handle

fh.close()

data = fh.readline()

data = fh.read()

fh.write(line)print >> fh, line # alternative

fh = open(filename, "w") # writing will # overwrite current content!

fh = open(filename, "a") # writing will # be appended at the end of the file

if os.path.exists(filename): print "filename exists!"

Open for append:

Writing a line:

Test if file exists:

24

Database access from Python# use the database module with all the DB relevant sub-routines import MySQLdb# import class that enables data acquirement as dictionariesfrom MySQLdb.cursors import DictCursor

# Connection to database with access specificationconn = MySQLdb.connect(db="scop", # name of database host="mydb", # name of server user="guest", # username passwd="guest") # password# create an access pointer that retrieves data dictionaries cursor = conn.cursor(DictCursor)

# send an SQL querycursor.execute("SELECT * FROM cla LIMIT 10")# retrieve all rows as a tuple of dictionariesdata = cursor.fetchall()

# close connectionconn.close()

What are the tuples?

What do the dictionaries contain? What are the keys?

25

Python for Bioinformatics

Lecture 5: Advanced Programming Techniques

26

Local vs. global variablesdef foo(): a = 3 print a

a = 6print a foo() print a

def foo(): global a a = 3 print a

a = 6print a foo()print a

Local variable a does not change

global variable a

Now global variable a is changed

636

633

• Global variables can be used everywhere

• Function variables are local by default

• …unless you declare them to be global

def foo(): print a

# declare global variable a = 6print a foo()

66

27

References in Python• Lists, sets, dictionaries and all

other changeable data types are referenced, i.e. when assigning a variable, no data is copied:

• "Real copies" with copy module:• Don't worry about any

referencing. Python is doing the job! But be aware when you want to copy objects.

>>> a = [1,2,3,4]>>> b = a>>> b[2] = 7>>> a[1, 2, 7, 4]

>>> from copy import copy>>> b = copy(a)>>> b[2] = 3>>> a[1, 2, 7, 4]

assigning b = a results in b pointing to the same list as a

ab

[1, 2, 3, 4]

copying results in having two different lists assigned to a and b

28

Matrices

• Easy solution: Lists of lists in core Python

• Access an element at position (i, j) in a list of lists: selecting from the i-th row the j-th element

• Disadvantages: operations such as addition, multiplication of lists would be slow, and would all need to be implemented

• Luckily: big library already available: numpyFast (since implemented in C), rich functionality

>>> m = [[1, 2], [3, 4]]>>> m[1][3, 4]>>> m[1][1]4

29

Matrices with numpy• Faster, more calculations

(reshaping, built-in matrix operations) with external package numpy

• Various matrix creation methods with numpy:– from list of lists– filled with zeros– from a function

• Convenient access of multi-dimensional array elements

>>> from numpy import *>>> m = [[1, 2], [3, 4]]>>> m1 = array(m)>>> m1array([[1, 2], [3, 4]])>>> m1.shape(2, 2)>>> zeros((3,5))array([[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]])>>> m2 = arange(8)# same as array(range(8))

>>> m2.shape = (2,4)>>> m2array([[0, 1, 2, 3], [4, 5, 6, 7]])>>> m2[1,2]6

30

Matrices with numpy

• You can select rows and columns...

• ...or even submatrices(same "slicing" as with lists)

• You can apply a scalar operations to an array such as – addition + – multiplication *– sine or cosine

>>> m = arange(9) # one-dim. array>>> m.shape = (3,3)>>> marray([[0, 1, 2], [3, 4, 5], [6, 7, 8]])

>>> m[:,1] # second columnarray([1, 4, 7]) >>> m[1:,1:]array([[4, 5], [7, 8]])

>>> m[1,:] + 1 # additionarray([4, 5, 6])>>> m[0,:] * 5 # multiplicationarray([ 0, 5, 10])>>> sin(m1)array([[ 0.84147098, 0.90929743], [ 0.14112001, -0.7568025 ]])

31

More math• Remember the mean and standard deviation from Lecture 3?

Reuse of existing packages makes live easier:

• Or finding the maximum in a list becomes now:

• numpy also provides functions for dot product, vector calculations etc.

>>> data = array([1, 5, 1, 12, 3, 4, 6])>>> data.mean()4.5714285714285712>>> data.std()3.4992710611188254

>>> dot(array([1,2,3]), array([1,2,3]))14>>> array([1,2,3]) + array([4,5,6])array([5, 7, 9])

>>> data[argmax(data)]12

32

Longest Common Subsequence in Python

By Michael Schroeder, Biotec, 2004 33

Formally:Longest Common Subsequence LCS What is the length s(V,W) of the longest common

subsequence of two sequencesV=v1..vn and W=w1..wm ?

Find sequences of indices1 ≤ i1 < … < ik ≤ n and 1 ≤ j1 < … < jk ≤ msuch that vit

= wjt for 1 ≤ t ≤ k

How? Dynamic programming: si,0 = s0,j = 0 for all 1 ≤ i ≤ n and 1 ≤ j ≤ m and si-1,j

si,j = max si,j-1

si-1,j-1 + 1, if vi = wj

Then s(V,W) = sn,m is the length of the LCS

{


Example LCS

0 1 2 3 4 5 6T G C A T A

0 1 A2 T3 C4 T5 G6 A7 T


Example LCS:

0 1 2 3 4 5 6T G C A T A

0 0 0 0 0 0 0 01 A 02 T 03 C 04 T 05 G 06 A 07 T 0

Initialisation: si,0 = s0,j = 0 for all 1 ≤ i ≤ n and 1 ≤ j ≤ m and


Example LCS:

0 1 2 3 4 5 6T G C A T A

0 00 0 0 0 0 01 A 00 0 0 1 1 12 T 03 C 04 T 05 G 06 A 07 T 0

Computing each cell: si-1,j

si,j = max si,j-1

si-1,j-1 + 1, if vi = wj

{


Example LCS:

0 1 2 3 4 5 6T G C A T A

0 00 0 0 0 0 01 A 00 0 0 1 1 12 T 01 1 1 1 2 23 C 01 1 2 2 2 24 T 01 1 2 2 3 35 G 01 2 2 2 3 36 A 01 2 2 3 3 47 T 01 2 2 3 4 4

Computing each cell: si-1,j

si,j = max si,j-1

si-1,j-1 + 1, if vi = wj

{


LCS Algorithm

LCS(V,W) For i = 1 to n

si,0 = 0 For j = 1 to m

s0,j = 0 For i = 1 to n

For j = 1 to m If vi = wj and si-1,j-1 +1 ≥ si-1,j and si-1,j-1 +1 ≥ si,j-1 Then

si,j = si-1,j-1 +1 bi,j = North West

Else if si-1,j ≥ si,j-1 Then si,j = si-1,j

bi,j = North Else

si,j = si,j-1

bi,j = West Return s and b

Complexity: LCS has quadratic complexity:

O(n m)

39

LCS in Pythonfrom numpy import *seq1 = "ATCTGATC"seq2 = "TGCATA"len1 = len(seq1)len2 = len(seq2)

def max3(a,b,c): return max(max(a, b), c)

# create an array val of length len1 + 1 times len2 + 1val = zeros((len1+1, len2+1))

for i in range(1, len1+1): for j in range(1, len2+1): if seq1[i-1] == seq2[j-1]: val[i,j] = max3(val[i-1,j], val[i,j-1], val[i-1,j-1]+1) else: val[i,j] = max3(val[i-1,j], val[i,j-1], val[i-1,j-1])print vallcs = val[len1, len2]print "The longest common subsequence of %s and %s is %d" % (seq1, seq2, lcs)

40

Longest Common Subsequence Output

[[0 0 0 0 0] [0 1 1 1 1] [0 1 1 1 1] [0 1 1 1 1] [0 1 2 2 2] [0 1 2 3 3] [0 1 2 3 4] [0 1 2 3 4] [0 1 2 3 4]]The longest common subsequence of ATCTGATC and TGCATA is 4

Result of print val

Final Result

41

Classes• Define a class to store PDB residues.

A residue has: a name, a position in the sequence, and a list of atoms. An atom has a name and coordinates. Define two methods: add_residue and add_atom

class PDBStructure: def add_residue(self, name, posseq): residue = {'name': resname, 'posseq': posseq, 'atoms': []} self._residues.append(residue) return residue def add_atom(self, residue, name, coord): atom = {'residue': residue, 'name': name, 'coord': coord } residue['atoms'].append(atom) return atom

42

Classes: Usagestruct = PDBStructure() # create an instance of a class

residue = struct.add_residue(name="ILE", posseq=1)struct.add_atom(residue, name="N", coord = (23.46, -8.01, -15.26))struct.add_atom(residue, name = "CZ", coord = (125.50, 4.50, -19.14))residue = struct.add_residue(name="LYS", posseq=2)struct.add_atom(residue, name="OE1", coord = (126.12, -1.78, -15.04))

print struct.residues

[{'name': 'ILE', 'posseq': 1, 'atoms': \ [{'name': 'N', 'coord': (23.46, -8.01, -15.26)}, \ {'name': 'CZ', 'coord': (125.50, 4.50, -19.14)}]}, \ {'name': 'LYS', 'posseq': 2, 'atoms': \ [{'name': 'OE1', 'coord': (126.12, -1.78, -15.04)}]}]

1 python for bioinformatics lecture 4: dictionaries

Documents