2015 bioinformatics python_io_wim_vancriekinge

24

Upload: prof-wim-van-criekinge

Post on 15-Apr-2017

1.450 views

Category:

Education


0 download

TRANSCRIPT

Page 1: 2015 bioinformatics python_io_wim_vancriekinge
Page 2: 2015 bioinformatics python_io_wim_vancriekinge

FBW20-10-2015

Wim Van Criekinge

Page 3: 2015 bioinformatics python_io_wim_vancriekinge

Bioinformatics.be

Page 4: 2015 bioinformatics python_io_wim_vancriekinge

Overview

What is Python ?Why Python 4 Bioinformatics ?How to Python

IDE: Eclipse & PyDev / AthenaCode Sharing: Git(hub)

StringsRegular expressions

Page 5: 2015 bioinformatics python_io_wim_vancriekinge

Python

• Programming languages are overrated– If you are going into bioinformatics you probably

learn/need multiple– If you know one you know 90% of a second

• Choice does matter but it matters far less than people think it does

• Why Python?– Lets you start useful programs asap– Build-in libraries – incl BioPython– Free, most platforms, widely (scientifically) used

• Versus Perl?– Incredibly similar– Consistent syntax, indentation

Page 6: 2015 bioinformatics python_io_wim_vancriekinge

Version 2.7 and 3.4 on athena.ugent.be

Page 7: 2015 bioinformatics python_io_wim_vancriekinge

Where is the workspace ?

Page 8: 2015 bioinformatics python_io_wim_vancriekinge

GitHub: Hosted GIT

• Largest open source git hosting site• Public and private options• User-centric rather than project-centric• http://github.ugent.be (use your Ugent

login and password)– Accept invitation from Bioinformatics-I-

2015URI:– https://github.ugent.be/Bioinformatics-I-

2015/Python.git

Page 9: 2015 bioinformatics python_io_wim_vancriekinge

Run Install.py (is BioPython installed ?)

import pipimport sysimport platformimport webbrowser

print ("Python " + platform.python_version()+ " installed packages:")

installed_packages = pip.get_installed_distributions()installed_packages_list = sorted(["%s==%s" % (i.key, i.version) for i in installed_packages])print(*installed_packages_list,sep="\n")

Page 10: 2015 bioinformatics python_io_wim_vancriekinge

Control Structures

if condition: statements[elif condition: statements] ...else: statements

while condition: statements

for var in sequence: statements

breakcontinue

Page 11: 2015 bioinformatics python_io_wim_vancriekinge

Lists

• Flexible arrays, not Lisp-like linked lists

• a = [99, "bottles of beer", ["on", "the", "wall"]]

• Same operators as for strings• a+b, a*3, a[0], a[-1], a[1:], len(a)

• Item and slice assignment• a[0] = 98• a[1:2] = ["bottles", "of", "beer"]

-> [98, "bottles", "of", "beer", ["on", "the", "wall"]]• del a[-1] # -> [98, "bottles", "of",

"beer"]

Page 12: 2015 bioinformatics python_io_wim_vancriekinge

Dictionaries

• Hash tables, "associative arrays"• d = {"duck": "eend", "water": "water"}

• Lookup:• d["duck"] -> "eend"• d["back"] # raises KeyError exception

• Delete, insert, overwrite:• del d["water"] # {"duck": "eend", "back": "rug"}• d["back"] = "rug" # {"duck": "eend", "back":

"rug"}• d["duck"] = "duik" # {"duck": "duik", "back":

"rug"}

Page 13: 2015 bioinformatics python_io_wim_vancriekinge

Regex.py

text = 'abbaaabbbbaaaaa'pattern = 'ab'

for match in re.finditer(pattern, text): s = match.start() e = match.end() print ('Found "%s" at %d:%d' % (text[s:e], s, e))

Page 14: 2015 bioinformatics python_io_wim_vancriekinge

Find the answer in ultimate-sequence.txt ?

>ultimate-sequenceACTCGTTATGATATTTTTTTTGAACGTGAAAATACTT

TTCGTGCTATGGAAGGACTCGTTATCGTGAAGTTGAACGTTCTGAATGTATGCCTCTTGAAATGGAAAATACTCATTGTTTATCTGAAATTTGAATGGGAATTTTATCTACAATGTTTTATTCTTACAGAACATTAAATTGTGTTATGTTTCATTTCACATTTTAGTAGTTTTTTCAGTGAAAGCTTGAAAACCACCAAGAAGAAAAGCTGGTATGCGTAGCTATGTATATATAAAATTAGATTTTCCACAAAAAATGATCTGATAAACCTTCTCTGTTGGCTCCAAGTATAAGTACGAAAAGAAATACGTTCCCAAGAATTAGCTTCATGAGTAAGAAGAAAAGCTGGTATGCGTAGCTATGTATATATAAAATTAGATTTTCCACAAAAAATGATCTGATAA

Question 2

Page 15: 2015 bioinformatics python_io_wim_vancriekinge

AA1 = {'UUU':'F','UUC':'F','UUA':'L','UUG':'L','UCU':'S','UCC':'S','UCA':'S','UCG':'S','UAU':'Y','UAC':'Y','UAA':'*','UAG':'*','UGU':'C','UGC':'C','UGA':'*','UGG':'W','CUU':'L','CUC':'L','CUA':'L','CUG':'L','CCU':'P','CCC':'P','CCA':'P','CCG':'P','CAU':'H','CAC':'H','CAA':'Q','CAG':'Q','CGU':'R','CGC':'R','CGA':'R','CGG':'R','AUU':'I','AUC':'I','AUA':'I','AUG':'M','ACU':'T','ACC':'T','ACA':'T','ACG':'T','AAU':'N','AAC':'N','AAA':'K','AAG':'K','AGU':'S','AGC':'S','AGA':'R','AGG':'R','GUU':'V','GUC':'V','GUA':'V','GUG':'V','GCU':'A','GCC':'A','GCA':'A','GCG':'A','GAU':'D','GAC':'D','GAA':'E','GAG':'E','GGU':'G','GGC':'G','GGA':'G','GGG':'G' }

Hint: Use Dictionaries

Page 16: 2015 bioinformatics python_io_wim_vancriekinge

Hint 2: Translations

Python way:tab = str.maketrans("ACGU","UGCA")sequence = sequence.translate(tab)[::-1]

Page 17: 2015 bioinformatics python_io_wim_vancriekinge

17

Reading Filesname = open("filename")– opens the given file for reading, and returns a file

objectname.read() - file's entire contents as a stringname.readline() - next line from file as a string name.readlines() - file's contents as a list of lines– the lines from a file object can also be read using a for

loop >>> f = open("hours.txt")>>> f.read()'123 Susan 12.5 8.1 7.6 3.2\n456 Brad 4.0 11.6 6.5 2.7 12\n789 Jenn 8.0 8.0 8.0 8.0 7.5\n'

Page 18: 2015 bioinformatics python_io_wim_vancriekinge

18

File Input Template• A template for reading files in Python:

name = open("filename")for line in name: statements

>>> input = open("hours.txt")>>> for line in input:... print(line.strip()) # strip() removes \n

123 Susan 12.5 8.1 7.6 3.2456 Brad 4.0 11.6 6.5 2.7 12789 Jenn 8.0 8.0 8.0 8.0 7.5

Page 19: 2015 bioinformatics python_io_wim_vancriekinge

19

Writing Filesname = open("filename", "w")name = open("filename", "a")– opens file for write (deletes previous contents),

or– opens file for append (new data goes after

previous data)

name.write(str) - writes the given string to the filename.close() - saves file once writing is done>>> out = open("output.txt", "w")

>>> out.write("Hello, world!\n")>>> out.write("How are you?")>>> out.close()

>>> open("output.txt").read()'Hello, world!\nHow are you?'

Page 20: 2015 bioinformatics python_io_wim_vancriekinge

Question 3. Swiss-Knife.py

• Using a database as input ! Parse the entire Swiss Prot collection–How many entries are there ?–Average Protein Length (in aa and

MW)–Relative frequency of amino acids

• Compare to the ones used to construct the PAM scoring matrixes from 1978 – 1991

Page 21: 2015 bioinformatics python_io_wim_vancriekinge

Question 3: Getting the database

Uniprot_sprot.dat.gz – 528Mb (on Github onder Files)Unzipped 2.92 Gb !

http://www.ebi.ac.uk/uniprot/download-center

Page 22: 2015 bioinformatics python_io_wim_vancriekinge

Amino acid frequencies

1978 1991L 0.085 0.091A 0.087 0.077G 0.089 0.074S 0.070 0.069V 0.065 0.066E 0.050 0.062T 0.058 0.059K 0.081 0.059I 0.037 0.053D 0.047 0.052R 0.041 0.051P 0.051 0.051N 0.040 0.043Q 0.038 0.041F 0.040 0.040Y 0.030 0.032M 0.015 0.024H 0.034 0.023C 0.033 0.020W 0.010 0.014

Second step: Frequencies of Occurence

Page 23: 2015 bioinformatics python_io_wim_vancriekinge

Extra Questions

• How many records have a sequence of length 260?• What are the first 20 residues of 143X_MAIZE?• What is the identifier for the record with the

shortest sequence? Is there more than one record with that length?

• What is the identifier for the record with the longest sequence? Is there more than one record with that length?

• How many contain the subsequence "ARRA"?• How many contain the substring "KCIP-1" in the

description?

Page 24: 2015 bioinformatics python_io_wim_vancriekinge

Question 4

• Program your own prosite parser !

• Download prosite pattern database (prosite.dat)

• Automatically generate >2000 search patterns, and search in sequence set from question 1