2015 bioinformatics python_io_wim_vancriekinge

FBW20-10-2015

Wim Van Criekinge

Bioinformatics.be

Overview

What is Python ?Why Python 4 Bioinformatics ?How to Python

IDE: Eclipse & PyDev / AthenaCode Sharing: Git(hub)

StringsRegular expressions

Python

• Programming languages are overrated– If you are going into bioinformatics you probably

learn/need multiple– If you know one you know 90% of a second

• Choice does matter but it matters far less than people think it does

• Why Python?– Lets you start useful programs asap– Build-in libraries – incl BioPython– Free, most platforms, widely (scientifically) used

• Versus Perl?– Incredibly similar– Consistent syntax, indentation

Version 2.7 and 3.4 on athena.ugent.be

Where is the workspace ?

GitHub: Hosted GIT

• Largest open source git hosting site• Public and private options• User-centric rather than project-centric• http://github.ugent.be (use your Ugent

login and password)– Accept invitation from Bioinformatics-I-

2015URI:– https://github.ugent.be/Bioinformatics-I-

2015/Python.git

http://github.ugent.be/

Run Install.py (is BioPython installed ?)

import pipimport sysimport platformimport webbrowser

print ("Python " + platform.python_version()+ " installed packages:")

installed_packages = pip.get_installed_distributions()installed_packages_list = sorted(["%s==%s" % (i.key, i.version) for i in installed_packages])print(*installed_packages_list,sep="\n")

Control Structures

if condition: statements[elif condition: statements] ...else: statements

while condition: statements

for var in sequence: statements

breakcontinue

Lists

• Flexible arrays, not Lisp-like linked lists

• a = [99, "bottles of beer", ["on", "the", "wall"]]

• Same operators as for strings• a+b, a*3, a[0], a[-1], a[1:], len(a)

• Item and slice assignment• a[0] = 98• a[1:2] = ["bottles", "of", "beer"]

-> [98, "bottles", "of", "beer", ["on", "the", "wall"]]• del a[-1] # -> [98, "bottles", "of",

"beer"]

Dictionaries

• Hash tables, "associative arrays"• d = {"duck": "eend", "water": "water"}

• Lookup:• d["duck"] -> "eend"• d["back"] # raises KeyError exception

• Delete, insert, overwrite:• del d["water"] # {"duck": "eend", "back": "rug"}• d["back"] = "rug" # {"duck": "eend", "back":

"rug"}• d["duck"] = "duik" # {"duck": "duik", "back":

"rug"}

Regex.py

text = 'abbaaabbbbaaaaa'pattern = 'ab'

for match in re.finditer(pattern, text): s = match.start() e = match.end() print ('Found "%s" at %d:%d' % (text[s:e], s, e))

Find the answer in ultimate-sequence.txt ?

>ultimate-sequenceACTCGTTATGATATTTTTTTTGAACGTGAAAATACTT

TTCGTGCTATGGAAGGACTCGTTATCGTGAAGTTGAACGTTCTGAATGTATGCCTCTTGAAATGGAAAATACTCATTGTTTATCTGAAATTTGAATGGGAATTTTATCTACAATGTTTTATTCTTACAGAACATTAAATTGTGTTATGTTTCATTTCACATTTTAGTAGTTTTTTCAGTGAAAGCTTGAAAACCACCAAGAAGAAAAGCTGGTATGCGTAGCTATGTATATATAAAATTAGATTTTCCACAAAAAATGATCTGATAAACCTTCTCTGTTGGCTCCAAGTATAAGTACGAAAAGAAATACGTTCCCAAGAATTAGCTTCATGAGTAAGAAGAAAAGCTGGTATGCGTAGCTATGTATATATAAAATTAGATTTTCCACAAAAAATGATCTGATAA

Question 2

AA1 = {'UUU':'F','UUC':'F','UUA':'L','UUG':'L','UCU':'S','UCC':'S','UCA':'S','UCG':'S','UAU':'Y','UAC':'Y','UAA':'*','UAG':'*','UGU':'C','UGC':'C','UGA':'*','UGG':'W','CUU':'L','CUC':'L','CUA':'L','CUG':'L','CCU':'P','CCC':'P','CCA':'P','CCG':'P','CAU':'H','CAC':'H','CAA':'Q','CAG':'Q','CGU':'R','CGC':'R','CGA':'R','CGG':'R','AUU':'I','AUC':'I','AUA':'I','AUG':'M','ACU':'T','ACC':'T','ACA':'T','ACG':'T','AAU':'N','AAC':'N','AAA':'K','AAG':'K','AGU':'S','AGC':'S','AGA':'R','AGG':'R','GUU':'V','GUC':'V','GUA':'V','GUG':'V','GCU':'A','GCC':'A','GCA':'A','GCG':'A','GAU':'D','GAC':'D','GAA':'E','GAG':'E','GGU':'G','GGC':'G','GGA':'G','GGG':'G' }

Hint: Use Dictionaries

Hint 2: Translations

Python way:tab = str.maketrans("ACGU","UGCA")sequence = sequence.translate(tab)[::-1]

17

Reading Filesname = open("filename")– opens the given file for reading, and returns a file

objectname.read() - file's entire contents as a stringname.readline() - next line from file as a string name.readlines() - file's contents as a list of lines– the lines from a file object can also be read using a for

loop >>> f = open("hours.txt")>>> f.read()'123 Susan 12.5 8.1 7.6 3.2\n456 Brad 4.0 11.6 6.5 2.7 12\n789 Jenn 8.0 8.0 8.0 8.0 7.5\n'

18

File Input Template• A template for reading files in Python:

name = open("filename")for line in name: statements

>>> input = open("hours.txt")>>> for line in input:... print(line.strip()) # strip() removes \n

123 Susan 12.5 8.1 7.6 3.2456 Brad 4.0 11.6 6.5 2.7 12789 Jenn 8.0 8.0 8.0 8.0 7.5

19

Writing Filesname = open("filename", "w")name = open("filename", "a")– opens file for write (deletes previous contents),

or– opens file for append (new data goes after

previous data)

name.write(str) - writes the given string to the filename.close() - saves file once writing is done>>> out = open("output.txt", "w")

>>> out.write("Hello, world!\n")>>> out.write("How are you?")>>> out.close()

>>> open("output.txt").read()'Hello, world!\nHow are you?'

Question 3. Swiss-Knife.py

• Using a database as input ! Parse the entire Swiss Prot collection–How many entries are there ?–Average Protein Length (in aa and

MW)–Relative frequency of amino acids

• Compare to the ones used to construct the PAM scoring matrixes from 1978 – 1991

Question 3: Getting the database

Uniprot_sprot.dat.gz – 528Mb (on Github onder Files)Unzipped 2.92 Gb !

http://www.ebi.ac.uk/uniprot/download-center

http://www.ebi.ac.uk/uniprot/download-center

Amino acid frequencies

1978 1991L 0.085 0.091A 0.087 0.077G 0.089 0.074S 0.070 0.069V 0.065 0.066E 0.050 0.062T 0.058 0.059K 0.081 0.059I 0.037 0.053D 0.047 0.052R 0.041 0.051P 0.051 0.051N 0.040 0.043Q 0.038 0.041F 0.040 0.040Y 0.030 0.032M 0.015 0.024H 0.034 0.023C 0.033 0.020W 0.010 0.014

Second step: Frequencies of Occurence

Extra Questions

• How many records have a sequence of length 260?• What are the first 20 residues of 143X_MAIZE?• What is the identifier for the record with the

shortest sequence? Is there more than one record with that length?

• What is the identifier for the record with the longest sequence? Is there more than one record with that length?

• How many contain the subsequence "ARRA"?• How many contain the substring "KCIP-1" in the

description?

Question 4

• Program your own prosite parser !

• Download prosite pattern database (prosite.dat)

• Automatically generate >2000 search patterns, and search in sequence set from question 1

2015 bioinformatics python_io_wim_vancriekinge

Education