1 python for bioinformatics lecture 4: dictionaries

42
1 Python for Bioinformatics Lecture 4: Dictionaries

Post on 20-Dec-2015

238 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: 1 Python for Bioinformatics Lecture 4: Dictionaries

1

Python for Bioinformatics

Lecture 4: Dictionaries

Page 2: 1 Python for Bioinformatics Lecture 4: Dictionaries

2

Some comments• You do not learn programming in the lecture,

but at the keyboard• Practice, practice, practice!• Python is the coolest language ever!

• Getting stuck is completely normal, it happens to everyone. Really. You just have to know where to look:

1. Try Google! Example: "python sort list"2. Read Python manual or reference pages. Or look into a Python

book.3. Ask people!

Page 3: 1 Python for Bioinformatics Lecture 4: Dictionaries

3

• Suppose we have a file containing a table of Drosophila gene names and their Entrez Gene identifiers, one pair on each line:

• Suppose this table is in a tab-separated text file called "genes.txt"

Tabular data

Cyp12a5 42293ken and barbie 37785cop 45837bor 53565hangover 32613

Page 4: 1 Python for Bioinformatics Lecture 4: Dictionaries

4

Reading a table of data

• We can split each line into a list with two elements using the split command:

Cyp12a5 42293ken and barbie 37785cop 45837bor 53565hangover 32613

>>> line = "Cyp12a5 42293">>> line.split("\t")['Cyp12a5', '42293']

• The opposite of split is join, which makes a string from a list of strings:

>>> "\t".join(['Cyp12a5', '42293'])'Cyp12a5\t42293'>>> print "\t".join(['Cyp12a5', '42293'])Cyp12a5 42293

TAB character

Page 5: 1 Python for Bioinformatics Lecture 4: Dictionaries

5

Reading a table of data• Reading the file:

• Output:

Genes names: Cyp12a5, ken and barbie, cop, bor, hangoverGene IDs: 42293, 37785, 45837, 53565, 32613

Cyp12a5 42293ken and barbie 37785cop 45837bor 53565hangover 32613

geneNames = []geneIDs = []for line in open("genes.txt"): geneName, geneID = line.split("\t") geneNames.append(geneName) geneIDs.append(geneID)print "Genes names:", ", ".join(geneNames)print "Gene IDs:", ", ".join(geneIDs)

TAB character

Page 6: 1 Python for Bioinformatics Lecture 4: Dictionaries

6

Finding an entry• The following code assumes that we have already read in the

table from the file:

import sys

geneToFind = sys.argv[1]print "Searching for gene", geneToFind

for i in range(len(geneNames)): if geneNames[i] == geneToFind: print "Found gene:", geneNames[i] print "Gene ID:", geneIDs[i] sys.exit()

print "Couldn't find gene"

Searching for gene copFound gene: copGene ID: 45837

Example output:sys.argv[1] = "cop"

Page 7: 1 Python for Bioinformatics Lecture 4: Dictionaries

7

Dictionaries

• Conveniently, Python provides a type of array called a dictionary (also called a hash table) that does something similar for you

• A dictionary is a set of key–value pairs (such as our geneName–geneID table)

genes["cop"] = "45837"Squared brackets [] are used to index a dictionary

Page 8: 1 Python for Bioinformatics Lecture 4: Dictionaries

8

Getting familiar with dictionaries

Creating an initial phone book

Asking for all keys

Asking for all values

Asking for a value, given a key

Check if a key is in the dictionary

Inserting a single key–value pair

Looping through the dictionary

>>> tlf = {"Michael" : 40062, \"Bingding" : 40064, "Andreas": 40063 }>>> tlf.keys()['Bingding', 'Andreas', 'Michael']>>> tlf.values()[40064, 40063, 40062]>>> tlf["Michael"]40062>>> "Lars" in tlfFalse>>> tlf["Lars"] = 40070>>> tlf.has_key("Lars") # now it's thereTrue>>> for name in tlf.keys():... print name, tlf[name]... Lars 40070Bingding 40064Andreas 40063Michael 40062

Page 9: 1 Python for Bioinformatics Lecture 4: Dictionaries

9

Reading a tabular file into a dictionary

...with sys.argv[1] = "cop" as before:

import sys

genes = {} # creates an empty dictionary

for line in open("genes.txt"): geneName, geneID = line.split("\t") genes[geneName] = geneID

geneToFind = sys.argv[1]print "Gene:", geneToFindprint "Gene ID:", genes[geneToFind]

Gene: copGene ID: 45837

Page 10: 1 Python for Bioinformatics Lecture 4: Dictionaries

10

def read_fasta(filename): name = None name2seq = {} for line in open(filename): if line.startswith(">"): if name: name2seq[name] = seq name = line[1:].rstrip() seq = "" else: seq += line.rstrip() name2seq[name] = seq return name2seq

Reading a FASTA file into a dictionary

set final entry, after loop

if name only evaluates to False if it is still None (when going over first line)

new name is obtained from line from second letter on (skipping the >), with newline character removed

• Dictionaries are an excellent choice for storing sequences:

Page 11: 1 Python for Bioinformatics Lecture 4: Dictionaries

11

Reading a FASTA file the easy way• BioPython (http://biopython.org) is a useful collection of

Python code for computational molecular biology. Once you installed it, you can use all its powerful functions.

• Reading from a FASTA file becomes quite easy:

from Bio import Fasta

def read_fasta(filename): name2seq = {} iterator = Fasta.Iterator(open(filename), Fasta.RecordParser())

for record in iterator: name2seq[record.title] = record.sequence

return name2seq

Page 12: 1 Python for Bioinformatics Lecture 4: Dictionaries

12

Keys and values• keys() returns the list of keys of the dictionary

– e.g. names, in the name2seq dictionary• values() returns the list of values

– e.g. sequences, in the name2seq dictionary

name2seq = read_fasta("C:/fly3utr.fa")

print "Sequence names read: ", " ".join(name2seq.keys()) print "Total length of sequences: ", len("".join(name2seq.values()))

Sequence names read: CG11488 CG11604 CG11455Total length of sequences: 210

Page 13: 1 Python for Bioinformatics Lecture 4: Dictionaries

13

Remember: why modules are cool• Additional functionality, that is not part of the core

language, can be loaded from modules:

• You can write your own modules and import them. Never copy and paste code you wrote. Re-use your code!

• People already wrote modules that you can download and use. Don't re-invent the wheel!

>>> import math>>> math.pi3.1415926535897931>>> help(math)Help on built-in module math:…

Page 14: 1 Python for Bioinformatics Lecture 4: Dictionaries

14

Formatted output of sequencesdef print_seq(name, seq, width=50): print ">" + name i = 0 while i < len(seq): print seq[i:i+width] i += width

print_seq("Tata-box1", "TA"*50)print_seq("Tata-box2", "TA"*50, 30)

Default values can be assigned to parameters. They need to be placed rightmost.Here, width default is set50-column output.

>Tata-box1TATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATA>Tata-box2TATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATA

Page 15: 1 Python for Bioinformatics Lecture 4: Dictionaries

15

Comparing files with sequence names• Easy way to specify a subset of a given FASTA database• Each line is the name of a sequence in a given database

• Two files with sequence names:What is the overlap, difference, and union?

CG1167CG685CG1041CG1043

CG215CG1041CG483CG1167CG1163

C:/fosn1.txt C:/fosn2.txt

Page 16: 1 Python for Bioinformatics Lecture 4: Dictionaries

16

Common set operationsCG1167CG685CG1041CG1043

CG215CG1041CG483CG1167CG1163

from sets import Set

geneSet1 = Set([])geneSet2 = Set([])

for line in open("C:/fosn1.txt"): geneSet1.add(line.rstrip())

for line in open("C:/fosn2.txt"): geneSet2.add(line.rstrip())

C:/fosn1.txt C:/fosn2.txt

>>> geneSet1.intersection(geneSet2)Set(['CG1041', 'CG1167'])>>> geneSet2.difference(geneSet1)Set(['CG483', 'CG215', 'CG1163'])>>> geneSet1.union(geneSet2)Set(['CG483', 'CG1043', 'CG1041', 'CG1167', 'CG685', 'CG1163', 'CG215'])

AA B

difference

AA B

intersection

AA B

union

Page 17: 1 Python for Bioinformatics Lecture 4: Dictionaries

17

More set operations• Since every element in a set occurs only once, sets can be used

to reduce redundancy

>>> from sets import Set>>> Set([1,2,3,1,3,3])Set([1, 2, 3])

>>> pqs = Set("1kim 1dan 1bob".split())>>> pdb = Set("1bob 3mad 1dan 2bad 1kim".split())>>> pqs.issubset(pdb)True

• A is a superset of B when A fully contains BTest: A.issuperset(B)

• A is a subset of B when A is fully contained in BTest: A.issubset(B)

Page 18: 1 Python for Bioinformatics Lecture 4: Dictionaries

18

The genetic code as a dictionaryaa = {'ttt':'F', 'tct':'S', 'tat':'Y', 'tgt':'C', 'ttc':'F', 'tcc':'S', 'tac':'Y', 'tgc':'C', 'tta':'L', 'tca':'S', 'taa':'!', 'tga':'!', 'ttg':'L', 'tcg':'S', 'tag':'!', 'tgg':'W', 'ctt':'L', 'cct':'P', 'cat':'H', 'cgt':'R', 'ctc':'L', 'ccc':'P', 'cac':'H', 'cgc':'R', 'cta':'L', 'cca':'P', 'caa':'Q', 'cga':'R', 'ctg':'L', 'ccg':'P', 'cag':'Q', 'cgg':'R', 'att':'I', 'act':'T', 'aat':'N', 'agt':'S', 'atc':'I', 'acc':'T', 'aac':'N', 'agc':'S', 'ata':'I', 'aca':'T', 'aaa':'K', 'aga':'R', 'atg':'M', 'acg':'T', 'aag':'K', 'agg':'R', 'gtt':'V', 'gct':'A', 'gat':'D', 'ggt':'G', 'gtc':'V', 'gcc':'A', 'gac':'D', 'ggc':'G', 'gta':'V', 'gca':'A', 'gaa':'E', 'gga':'G', 'gtg':'V', 'gcg':'A', 'gag':'E', 'ggg':'G' }

Page 19: 1 Python for Bioinformatics Lecture 4: Dictionaries

19

Translating: DNA to proteindef translate(dna): length = len(dna) if length % 3 != 0: print "Warning: Length is not a multiple of 3" sys.exit() amino_acids = [] i = 0 while i < length: codon = dna[i:i+3].lower() if codon not in aa: print "Codon %s is illegal" % codon sys.exit() amino_acids.append(aa[codon]) i += 3 return "".join(amino_acids)

>>> translate("gatgacgaaagttgt")'DDESC'>>> translate("gatgacgaaagttgta")Warning: Length is not a multiple of 3… (SystemExit)>>> translate("gatgacgiaagttgt")Codon gia is illegal… (SystemExit)

Page 20: 1 Python for Bioinformatics Lecture 4: Dictionaries

20

Counting residue frequenciesdef count_residues(seq): freq = {} seq = seq.lower() for letter in seq: if letter in freq: freq[letter] += 1 else: freq[letter] = 1 return freq

freq = count_residues("gatgacgaaagttgt")

# display statisticsfor residue in freq.keys(): print "%s : %s" % (residue, freq[residue])

a : 5c : 1t : 4g : 5

Tricky question: Given the same sequence, does the output always look exactly the same?

Page 21: 1 Python for Bioinformatics Lecture 4: Dictionaries

21

Counting N-mer frequencies

def count_nmers(seq, n): freq = {} seq = seq.lower() for i in range(len(seq) – n + 1): nmer = seq[i:i+n] if nmer in freq: freq[nmer] += 1 # increase counter else: freq[nmer] = 1 # first occurrence return freq

freq = count_nmers("gatgacgaaagttgt", 2)

# display statisticsfor residue in freq.keys(): print "%s : %s" % (residue, freq[residue])

cg : 1tt : 1ga : 3tg : 2gt : 1aa : 2ac : 1at : 1ag : 1

Page 22: 1 Python for Bioinformatics Lecture 4: Dictionaries

N-mer frequencies for a whole filefrom read_fasta import read_fasta def count_nmers(seq, n, freq): seq = seq.lower() for i in range(len(seq)-n+1): nmer = seq[i : i+n] if nmer in freq: freq[nmer] += 1 else: freq[nmer] = 1 return freq

name2seq = read_fasta("C:/fly3utr.fa")freq = {}# count for each sequencefor seq in name2seq.values(): freq = count_nmers(seq, 2, freq)# display statisticsfor residue in freq.keys(): print "%s : %s" % (residue, freq[residue])

ct : 5tc : 9tt : 26cg : 4ga : 11tg : 12gc : 2gt : 17aa : 39ac : 10gg : 4at : 17ca : 11ag : 15ta : 20cc : 2

Note how we pass freq back into the count_nmers function, to get cumulative counts

Here we reuse a function we wrote earlier by importing it. The first is the filename (without .py), the second is the function name

Page 23: 1 Python for Bioinformatics Lecture 4: Dictionaries

23

Files and file handlesOpening a file:

Closing a file:

Read one line:

Read everything:

Open write-only:

fh = open(filename) # gives a file handle

fh.close()

data = fh.readline()

data = fh.read()

fh.write(line)print >> fh, line # alternative

fh = open(filename, "w") # writing will # overwrite current content!

fh = open(filename, "a") # writing will # be appended at the end of the file

if os.path.exists(filename): print "filename exists!"

Open for append:

Writing a line:

Test if file exists:

Page 24: 1 Python for Bioinformatics Lecture 4: Dictionaries

24

Database access from Python# use the database module with all the DB relevant sub-routines import MySQLdb# import class that enables data acquirement as dictionariesfrom MySQLdb.cursors import DictCursor

# Connection to database with access specificationconn = MySQLdb.connect(db="scop", # name of database host="mydb", # name of server user="guest", # username passwd="guest") # password# create an access pointer that retrieves data dictionaries cursor = conn.cursor(DictCursor)

# send an SQL querycursor.execute("SELECT * FROM cla LIMIT 10")# retrieve all rows as a tuple of dictionariesdata = cursor.fetchall()

# close connectionconn.close()

What are the tuples?

What do the dictionaries contain? What are the keys?

Page 25: 1 Python for Bioinformatics Lecture 4: Dictionaries

25

Python for Bioinformatics

Lecture 5: Advanced Programming Techniques

Page 26: 1 Python for Bioinformatics Lecture 4: Dictionaries

26

Local vs. global variablesdef foo(): a = 3 print a

a = 6print a foo() print a

def foo(): global a a = 3 print a

a = 6print a foo()print a

Local variable a does not change

global variable a

Now global variable a is changed

636

633

• Global variables can be used everywhere

• Function variables are local by default

• …unless you declare them to be global

def foo(): print a

# declare global variable a = 6print a foo()

66

Page 27: 1 Python for Bioinformatics Lecture 4: Dictionaries

27

References in Python• Lists, sets, dictionaries and all

other changeable data types are referenced, i.e. when assigning a variable, no data is copied:

• "Real copies" with copy module:• Don't worry about any

referencing. Python is doing the job! But be aware when you want to copy objects.

>>> a = [1,2,3,4]>>> b = a>>> b[2] = 7>>> a[1, 2, 7, 4]

>>> from copy import copy>>> b = copy(a)>>> b[2] = 3>>> a[1, 2, 7, 4]

assigning b = a results in b pointing to the same list as a

ab

[1, 2, 3, 4]

copying results in having two different lists assigned to a and b

Page 28: 1 Python for Bioinformatics Lecture 4: Dictionaries

28

Matrices

• Easy solution: Lists of lists in core Python

• Access an element at position (i, j) in a list of lists: selecting from the i-th row the j-th element

• Disadvantages: operations such as addition, multiplication of lists would be slow, and would all need to be implemented

• Luckily: big library already available: numpyFast (since implemented in C), rich functionality

>>> m = [[1, 2], [3, 4]]>>> m[1][3, 4]>>> m[1][1]4

Page 29: 1 Python for Bioinformatics Lecture 4: Dictionaries

29

Matrices with numpy• Faster, more calculations

(reshaping, built-in matrix operations) with external package numpy

• Various matrix creation methods with numpy:– from list of lists– filled with zeros– from a function

• Convenient access of multi-dimensional array elements

>>> from numpy import *>>> m = [[1, 2], [3, 4]]>>> m1 = array(m)>>> m1array([[1, 2], [3, 4]])>>> m1.shape(2, 2)>>> zeros((3,5))array([[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]])>>> m2 = arange(8)# same as array(range(8))

>>> m2.shape = (2,4)>>> m2array([[0, 1, 2, 3], [4, 5, 6, 7]])>>> m2[1,2]6

Page 30: 1 Python for Bioinformatics Lecture 4: Dictionaries

30

Matrices with numpy

• You can select rows and columns...

• ...or even submatrices(same "slicing" as with lists)

• You can apply a scalar operations to an array such as – addition + – multiplication *– sine or cosine

>>> m = arange(9) # one-dim. array>>> m.shape = (3,3)>>> marray([[0, 1, 2], [3, 4, 5], [6, 7, 8]])

>>> m[:,1] # second columnarray([1, 4, 7]) >>> m[1:,1:]array([[4, 5], [7, 8]])

>>> m[1,:] + 1 # additionarray([4, 5, 6])>>> m[0,:] * 5 # multiplicationarray([ 0, 5, 10])>>> sin(m1)array([[ 0.84147098, 0.90929743], [ 0.14112001, -0.7568025 ]])

Page 31: 1 Python for Bioinformatics Lecture 4: Dictionaries

31

More math• Remember the mean and standard deviation from Lecture 3?

Reuse of existing packages makes live easier:

• Or finding the maximum in a list becomes now:

• numpy also provides functions for dot product, vector calculations etc.

>>> data = array([1, 5, 1, 12, 3, 4, 6])>>> data.mean()4.5714285714285712>>> data.std()3.4992710611188254

>>> dot(array([1,2,3]), array([1,2,3]))14>>> array([1,2,3]) + array([4,5,6])array([5, 7, 9])

>>> data[argmax(data)]12

Page 32: 1 Python for Bioinformatics Lecture 4: Dictionaries

32

Longest Common Subsequence in Python

Page 33: 1 Python for Bioinformatics Lecture 4: Dictionaries

By Michael Schroeder, Biotec, 2004 33

Formally:Longest Common Subsequence LCS What is the length s(V,W) of the longest common

subsequence of two sequencesV=v1..vn and W=w1..wm ?

Find sequences of indices1 ≤ i1 < … < ik ≤ n and 1 ≤ j1 < … < jk ≤ msuch that vit

= wjt for 1 ≤ t ≤ k

How? Dynamic programming: si,0 = s0,j = 0 for all 1 ≤ i ≤ n and 1 ≤ j ≤ m and si-1,j

si,j = max si,j-1

si-1,j-1 + 1, if vi = wj

Then s(V,W) = sn,m is the length of the LCS

{

Page 34: 1 Python for Bioinformatics Lecture 4: Dictionaries

By Michael Schroeder, Biotec, 2004 34

Example LCS

0 1 2 3 4 5 6T G C A T A

0 1 A2 T3 C4 T5 G6 A7 T

Page 35: 1 Python for Bioinformatics Lecture 4: Dictionaries

By Michael Schroeder, Biotec, 2004 35

Example LCS:

0 1 2 3 4 5 6T G C A T A

0 0 0 0 0 0 0 01 A 02 T 03 C 04 T 05 G 06 A 07 T 0

Initialisation: si,0 = s0,j = 0 for all 1 ≤ i ≤ n and 1 ≤ j ≤ m and

Page 36: 1 Python for Bioinformatics Lecture 4: Dictionaries

By Michael Schroeder, Biotec, 2004 36

Example LCS:

0 1 2 3 4 5 6T G C A T A

0 00 0 0 0 0 01 A 00 0 0 1 1 12 T 03 C 04 T 05 G 06 A 07 T 0

Computing each cell: si-1,j

si,j = max si,j-1

si-1,j-1 + 1, if vi = wj

{

Page 37: 1 Python for Bioinformatics Lecture 4: Dictionaries

By Michael Schroeder, Biotec, 2004 37

Example LCS:

0 1 2 3 4 5 6T G C A T A

0 00 0 0 0 0 01 A 00 0 0 1 1 12 T 01 1 1 1 2 23 C 01 1 2 2 2 24 T 01 1 2 2 3 35 G 01 2 2 2 3 36 A 01 2 2 3 3 47 T 01 2 2 3 4 4

Computing each cell: si-1,j

si,j = max si,j-1

si-1,j-1 + 1, if vi = wj

{

Page 38: 1 Python for Bioinformatics Lecture 4: Dictionaries

By Michael Schroeder, Biotec, 2004 38

LCS Algorithm

LCS(V,W) For i = 1 to n

si,0 = 0 For j = 1 to m

s0,j = 0 For i = 1 to n

For j = 1 to m If vi = wj and si-1,j-1 +1 ≥ si-1,j and si-1,j-1 +1 ≥ si,j-1 Then

si,j = si-1,j-1 +1 bi,j = North West

Else if si-1,j ≥ si,j-1 Then si,j = si-1,j

bi,j = North Else

si,j = si,j-1

bi,j = West Return s and b

Complexity: LCS has quadratic complexity:

O(n m)

Page 39: 1 Python for Bioinformatics Lecture 4: Dictionaries

39

LCS in Pythonfrom numpy import *seq1 = "ATCTGATC"seq2 = "TGCATA"len1 = len(seq1)len2 = len(seq2)

def max3(a,b,c): return max(max(a, b), c)

# create an array val of length len1 + 1 times len2 + 1val = zeros((len1+1, len2+1))

for i in range(1, len1+1): for j in range(1, len2+1): if seq1[i-1] == seq2[j-1]: val[i,j] = max3(val[i-1,j], val[i,j-1], val[i-1,j-1]+1) else: val[i,j] = max3(val[i-1,j], val[i,j-1], val[i-1,j-1])print vallcs = val[len1, len2]print "The longest common subsequence of %s and %s is %d" % (seq1, seq2, lcs)

Page 40: 1 Python for Bioinformatics Lecture 4: Dictionaries

40

Longest Common Subsequence Output

[[0 0 0 0 0] [0 1 1 1 1] [0 1 1 1 1] [0 1 1 1 1] [0 1 2 2 2] [0 1 2 3 3] [0 1 2 3 4] [0 1 2 3 4] [0 1 2 3 4]]The longest common subsequence of ATCTGATC and TGCATA is 4

Result of print val

Final Result

Page 41: 1 Python for Bioinformatics Lecture 4: Dictionaries

41

Classes• Define a class to store PDB residues.

A residue has: a name, a position in the sequence, and a list of atoms. An atom has a name and coordinates. Define two methods: add_residue and add_atom

class PDBStructure: def add_residue(self, name, posseq): residue = {'name': resname, 'posseq': posseq, 'atoms': []} self._residues.append(residue) return residue def add_atom(self, residue, name, coord): atom = {'residue': residue, 'name': name, 'coord': coord } residue['atoms'].append(atom) return atom

Page 42: 1 Python for Bioinformatics Lecture 4: Dictionaries

42

Classes: Usagestruct = PDBStructure() # create an instance of a class

residue = struct.add_residue(name="ILE", posseq=1)struct.add_atom(residue, name="N", coord = (23.46, -8.01, -15.26))struct.add_atom(residue, name = "CZ", coord = (125.50, 4.50, -19.14))residue = struct.add_residue(name="LYS", posseq=2)struct.add_atom(residue, name="OE1", coord = (126.12, -1.78, -15.04))

print struct.residues

[{'name': 'ILE', 'posseq': 1, 'atoms': \ [{'name': 'N', 'coord': (23.46, -8.01, -15.26)}, \ {'name': 'CZ', 'coord': (125.50, 4.50, -19.14)}]}, \ {'name': 'LYS', 'posseq': 2, 'atoms': \ [{'name': 'OE1', 'coord': (126.12, -1.78, -15.04)}]}]