2015 bioinformatics python_io_wim_vancriekinge
TRANSCRIPT
FBW20-10-2015
Wim Van Criekinge
Bioinformatics.be
Overview
What is Python ?Why Python 4 Bioinformatics ?How to Python
IDE: Eclipse & PyDev / AthenaCode Sharing: Git(hub)
StringsRegular expressions
Python
• Programming languages are overrated– If you are going into bioinformatics you probably
learn/need multiple– If you know one you know 90% of a second
• Choice does matter but it matters far less than people think it does
• Why Python?– Lets you start useful programs asap– Build-in libraries – incl BioPython– Free, most platforms, widely (scientifically) used
• Versus Perl?– Incredibly similar– Consistent syntax, indentation
Version 2.7 and 3.4 on athena.ugent.be
Where is the workspace ?
GitHub: Hosted GIT
• Largest open source git hosting site• Public and private options• User-centric rather than project-centric• http://github.ugent.be (use your Ugent
login and password)– Accept invitation from Bioinformatics-I-
2015URI:– https://github.ugent.be/Bioinformatics-I-
2015/Python.git
Run Install.py (is BioPython installed ?)
import pipimport sysimport platformimport webbrowser
print ("Python " + platform.python_version()+ " installed packages:")
installed_packages = pip.get_installed_distributions()installed_packages_list = sorted(["%s==%s" % (i.key, i.version) for i in installed_packages])print(*installed_packages_list,sep="\n")
Control Structures
if condition: statements[elif condition: statements] ...else: statements
while condition: statements
for var in sequence: statements
breakcontinue
Lists
• Flexible arrays, not Lisp-like linked lists
• a = [99, "bottles of beer", ["on", "the", "wall"]]
• Same operators as for strings• a+b, a*3, a[0], a[-1], a[1:], len(a)
• Item and slice assignment• a[0] = 98• a[1:2] = ["bottles", "of", "beer"]
-> [98, "bottles", "of", "beer", ["on", "the", "wall"]]• del a[-1] # -> [98, "bottles", "of",
"beer"]
Dictionaries
• Hash tables, "associative arrays"• d = {"duck": "eend", "water": "water"}
• Lookup:• d["duck"] -> "eend"• d["back"] # raises KeyError exception
• Delete, insert, overwrite:• del d["water"] # {"duck": "eend", "back": "rug"}• d["back"] = "rug" # {"duck": "eend", "back":
"rug"}• d["duck"] = "duik" # {"duck": "duik", "back":
"rug"}
Regex.py
text = 'abbaaabbbbaaaaa'pattern = 'ab'
for match in re.finditer(pattern, text): s = match.start() e = match.end() print ('Found "%s" at %d:%d' % (text[s:e], s, e))
Find the answer in ultimate-sequence.txt ?
>ultimate-sequenceACTCGTTATGATATTTTTTTTGAACGTGAAAATACTT
TTCGTGCTATGGAAGGACTCGTTATCGTGAAGTTGAACGTTCTGAATGTATGCCTCTTGAAATGGAAAATACTCATTGTTTATCTGAAATTTGAATGGGAATTTTATCTACAATGTTTTATTCTTACAGAACATTAAATTGTGTTATGTTTCATTTCACATTTTAGTAGTTTTTTCAGTGAAAGCTTGAAAACCACCAAGAAGAAAAGCTGGTATGCGTAGCTATGTATATATAAAATTAGATTTTCCACAAAAAATGATCTGATAAACCTTCTCTGTTGGCTCCAAGTATAAGTACGAAAAGAAATACGTTCCCAAGAATTAGCTTCATGAGTAAGAAGAAAAGCTGGTATGCGTAGCTATGTATATATAAAATTAGATTTTCCACAAAAAATGATCTGATAA
Question 2
AA1 = {'UUU':'F','UUC':'F','UUA':'L','UUG':'L','UCU':'S','UCC':'S','UCA':'S','UCG':'S','UAU':'Y','UAC':'Y','UAA':'*','UAG':'*','UGU':'C','UGC':'C','UGA':'*','UGG':'W','CUU':'L','CUC':'L','CUA':'L','CUG':'L','CCU':'P','CCC':'P','CCA':'P','CCG':'P','CAU':'H','CAC':'H','CAA':'Q','CAG':'Q','CGU':'R','CGC':'R','CGA':'R','CGG':'R','AUU':'I','AUC':'I','AUA':'I','AUG':'M','ACU':'T','ACC':'T','ACA':'T','ACG':'T','AAU':'N','AAC':'N','AAA':'K','AAG':'K','AGU':'S','AGC':'S','AGA':'R','AGG':'R','GUU':'V','GUC':'V','GUA':'V','GUG':'V','GCU':'A','GCC':'A','GCA':'A','GCG':'A','GAU':'D','GAC':'D','GAA':'E','GAG':'E','GGU':'G','GGC':'G','GGA':'G','GGG':'G' }
Hint: Use Dictionaries
Hint 2: Translations
Python way:tab = str.maketrans("ACGU","UGCA")sequence = sequence.translate(tab)[::-1]
17
Reading Filesname = open("filename")– opens the given file for reading, and returns a file
objectname.read() - file's entire contents as a stringname.readline() - next line from file as a string name.readlines() - file's contents as a list of lines– the lines from a file object can also be read using a for
loop >>> f = open("hours.txt")>>> f.read()'123 Susan 12.5 8.1 7.6 3.2\n456 Brad 4.0 11.6 6.5 2.7 12\n789 Jenn 8.0 8.0 8.0 8.0 7.5\n'
18
File Input Template• A template for reading files in Python:
name = open("filename")for line in name: statements
>>> input = open("hours.txt")>>> for line in input:... print(line.strip()) # strip() removes \n
123 Susan 12.5 8.1 7.6 3.2456 Brad 4.0 11.6 6.5 2.7 12789 Jenn 8.0 8.0 8.0 8.0 7.5
19
Writing Filesname = open("filename", "w")name = open("filename", "a")– opens file for write (deletes previous contents),
or– opens file for append (new data goes after
previous data)
name.write(str) - writes the given string to the filename.close() - saves file once writing is done>>> out = open("output.txt", "w")
>>> out.write("Hello, world!\n")>>> out.write("How are you?")>>> out.close()
>>> open("output.txt").read()'Hello, world!\nHow are you?'
Question 3. Swiss-Knife.py
• Using a database as input ! Parse the entire Swiss Prot collection–How many entries are there ?–Average Protein Length (in aa and
MW)–Relative frequency of amino acids
• Compare to the ones used to construct the PAM scoring matrixes from 1978 – 1991
Question 3: Getting the database
Uniprot_sprot.dat.gz – 528Mb (on Github onder Files)Unzipped 2.92 Gb !
http://www.ebi.ac.uk/uniprot/download-center
Amino acid frequencies
1978 1991L 0.085 0.091A 0.087 0.077G 0.089 0.074S 0.070 0.069V 0.065 0.066E 0.050 0.062T 0.058 0.059K 0.081 0.059I 0.037 0.053D 0.047 0.052R 0.041 0.051P 0.051 0.051N 0.040 0.043Q 0.038 0.041F 0.040 0.040Y 0.030 0.032M 0.015 0.024H 0.034 0.023C 0.033 0.020W 0.010 0.014
Second step: Frequencies of Occurence
Extra Questions
• How many records have a sequence of length 260?• What are the first 20 residues of 143X_MAIZE?• What is the identifier for the record with the
shortest sequence? Is there more than one record with that length?
• What is the identifier for the record with the longest sequence? Is there more than one record with that length?
• How many contain the subsequence "ARRA"?• How many contain the substring "KCIP-1" in the
description?
Question 4
• Program your own prosite parser !
• Download prosite pattern database (prosite.dat)
• Automatically generate >2000 search patterns, and search in sequence set from question 1