programming in python - biotec · programming in python michael schroeder sven schreiber ......

23
1 Programming in Python Michael Schroeder Sven Schreiber [email protected] Updates by Andreas Henschel Lecture 2: Sequences Slides derived from Ian Holmes, Department of Statistics, University of Oxford

Upload: others

Post on 05-Oct-2020

29 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Programming in Python - Biotec · Programming in Python Michael Schroeder Sven Schreiber ... –Lists, Tuples, Strings, Range •Building, accessing and modifying sequences •List

1

Programming in Python

Michael Schroeder Sven [email protected]

Updates by Andreas Henschel

Lecture 2: Sequences

Slides derived from Ian Holmes, Department of Statistics, University of Oxford

Page 2: Programming in Python - Biotec · Programming in Python Michael Schroeder Sven Schreiber ... –Lists, Tuples, Strings, Range •Building, accessing and modifying sequences •List

2

Overview

• Types of sequences and their properties– Lists, Tuples, Strings, Range

• Building, accessing and modifying sequences• List comprehensions• File operations

Page 3: Programming in Python - Biotec · Programming in Python Michael Schroeder Sven Schreiber ... –Lists, Tuples, Strings, Range •Building, accessing and modifying sequences •List

3

Types and Properties of Sequences

Page 4: Programming in Python - Biotec · Programming in Python Michael Schroeder Sven Schreiber ... –Lists, Tuples, Strings, Range •Building, accessing and modifying sequences •List

4

Lists vs tuples• Both are sequences (used to store collections of objects)• Tuples are immutable, Lists mutable• List are more flexible• Tuples provide better performance• Rule of thumb: Lists for similar kind of objects, tuples for different

l = [1,2,3,4]l2 = [‘Apple’, ‘Banana’, ‘Orange’]

t = (‘sebastian’, ‘m’, 28)t2 = (‘motif’, ‘ATTCG’, ‘E44’)

Construction (Syntax)

Accessing Elementsl[0] t[0]1 sebastian

l.append(3)l[1] = 5

t.append(3)t[1] = 5

l3 = l+[3,2] t3 = t + (‘phd’,’biotec’)

Adding/modifying Elements

Concatenating

immutable !

Page 5: Programming in Python - Biotec · Programming in Python Michael Schroeder Sven Schreiber ... –Lists, Tuples, Strings, Range •Building, accessing and modifying sequences •List

5

Range

• Used to provide collections of sequent integer numbers• Allow iteration with loops

• Numbers are not stored in memory, but just generated when needed (while looping)

• Saves time and memory with larger number sets

for x in range(10000):print(x)

0123......99989999

Excluding last number!

Page 6: Programming in Python - Biotec · Programming in Python Michael Schroeder Sven Schreiber ... –Lists, Tuples, Strings, Range •Building, accessing and modifying sequences •List

6

Working with Lists

Page 7: Programming in Python - Biotec · Programming in Python Michael Schroeder Sven Schreiber ... –Lists, Tuples, Strings, Range •Building, accessing and modifying sequences •List

7

Lists

A list is a collection of values/objects

We can think of the above as a container with 4 entries

nucleotides = ['a', 'c', 'g', 't']print("Nucleotides: ", nucleotides)

Nucleotides: ['a', 'c', 'g', 't']

a c g telement 0

element 1 element 2element 3

the list is the collection of all four elements

Note that the elementindices start at zero!

Page 8: Programming in Python - Biotec · Programming in Python Michael Schroeder Sven Schreiber ... –Lists, Tuples, Strings, Range •Building, accessing and modifying sequences •List

8

List literals

• There are several ways to create or obtain lists.

a = [1,2,3,4,5]print("a = ",a)b = ['a','c','g','t']print("b = ",b)

c = list(range(1,6))print("c = ",c)d = "a c g t".split()print("d = ", d)

a = [1,2,3,4,5] b = ['a','c','g','t']c = [1,2,3,4,5] d = ['a','c','g','t']

This is the most common: a comma-separated list, delimited by squared brackets

Page 9: Programming in Python - Biotec · Programming in Python Michael Schroeder Sven Schreiber ... –Lists, Tuples, Strings, Range •Building, accessing and modifying sequences •List

9

Accessing lists

To access list elements, use square brackets e.g. x[0]means "element zero of list x"

• Remember, element indices start at zero!• Negative indices refer to elements counting from the

end e.g. x[-1] means "last element of list x"

x = ['a', 'c', 'g', 't']i= 2print(x[0], x[i], x[-1]) a g t

Page 10: Programming in Python - Biotec · Programming in Python Michael Schroeder Sven Schreiber ... –Lists, Tuples, Strings, Range •Building, accessing and modifying sequences •List

10

List operations

• You can sort and reverse lists...

• You can add, delete and count elements

x = ['a', 't', 'g', 'c']print("x =",x)x.sort()print("x =",x)x.reverse()print("x =",x)

x = ['a', 't', 'g', 'c']x = ['a', 'c', 'g', 't']x = ['t', 'g', 'c', 'a']

nums = [2,2,5,2,6]nums.append(8)print(nums)print(nums.count(2))nums.remove(5)print(nums)

[2,2,5,2,6,8]3[2,2,2,6,8]

Page 11: Programming in Python - Biotec · Programming in Python Michael Schroeder Sven Schreiber ... –Lists, Tuples, Strings, Range •Building, accessing and modifying sequences •List

11

More list operations

>>> x=[1,0]*2>>> x[1, 0, 1, 0]

>>> x.pop()0>>> x[1, 0, 1]

>>> x+=x>>> x[1, 0, 1, 1, 0, 1]

>>> x.index(0)1

pop() obtains and

removes the lastelement of a list

multiplying lists

concatenating lists with +or +=

index(..) searches for thefirst occurrence of an element

Page 12: Programming in Python - Biotec · Programming in Python Michael Schroeder Sven Schreiber ... –Lists, Tuples, Strings, Range •Building, accessing and modifying sequences •List

12

Example: Reverse complementing DNA

dna = "accACgttAGgtct".lower()

replaced = dna.replace("a",“_a") \.replace("t","a").replace(“_a","t") \.replace("g",“_g").replace("c","g") \.replace(“_g", "c")

replacedList = list(replaced)replacedList.reverse()

print("".join(replacedList))

agacctaacgtggt

Start by making string lower caseagain. This is generally good practice

Convert back to string using join

Replace 'a' with 't', 'c' with 'g','g' with 'c' and 't' with 'a'

A common operation due to double-helix symmetry of DNA

Convert to list and reverse

Page 13: Programming in Python - Biotec · Programming in Python Michael Schroeder Sven Schreiber ... –Lists, Tuples, Strings, Range •Building, accessing and modifying sequences •List

13

Taking a slice of a list

• The syntax x[i:j] returns a list containing elements i,i+1,…,j-1 of list x

nucleotides = ['a', ’g’, 'c', 't']print(nucleotides)print(nucleotides[0:2]) # nucleotides[:2] also worksprint(nucleotides[2:4]) # nucleotides[2:] also worksprint(nucleotides[-2:]) # takes last two elementsprint(nucleotides[::2]) # takes every secondprint(nucleotides[::-1]) # obtains reversed list

['a', 'g', 'c', 't']['a', 'g']['c', 't']['c', 't'][‘a', ‘c'][‘t', ‘c', ‘g', ‘a']

Page 14: Programming in Python - Biotec · Programming in Python Michael Schroeder Sven Schreiber ... –Lists, Tuples, Strings, Range •Building, accessing and modifying sequences •List

14

Lists and Strings

• A string can be translated into a list of strings and– Using the split method: string.split(separator)

• A list of strings can be translated into one string– Using the join method: separator.join(list)

sentence = ‘This is a complete sentence.’print(sentence.split())

[‘This’, ‘is’, ‘a’, ‘complete’, ‘sentence’]

datarow = ‘Apples,Bananas,Oranges’print(datarow.split(‘,’))

[‘Apples’,’Bananas’,’Oranges’]

cities = [‘Dresden’, ‘Munich’, ‘Hamburg’, ‘Cologne’]print(‘ -> ’.join(cities))

‘Dresden -> Munich -> Hamburg -> Cologne’

Page 15: Programming in Python - Biotec · Programming in Python Michael Schroeder Sven Schreiber ... –Lists, Tuples, Strings, Range •Building, accessing and modifying sequences •List

15

List Comprehensions

Page 16: Programming in Python - Biotec · Programming in Python Michael Schroeder Sven Schreiber ... –Lists, Tuples, Strings, Range •Building, accessing and modifying sequences •List

16

What are list comprehensions?

• Very concise way to build and transform lists• Typically replaces a for loop and an if-construction• Used very often in Python• Syntax: [expr(var) for var in sequence if condition]

newlist = []for x in range(1,11):

if x % 2: newlist.append(x**2)

Verbose construction of list

[1,9,25,49,81]

newlist = [x**2 for x in range(1,11) if x % 2]

Construction with list comprehension

Squares of all odd numbers between 1 and 10

Page 17: Programming in Python - Biotec · Programming in Python Michael Schroeder Sven Schreiber ... –Lists, Tuples, Strings, Range •Building, accessing and modifying sequences •List

17

Examples: List comprehensions

sentence = ‘I like MySQL but not Python’print([(w.lower(), len(w)) for w in sentence.split()])

[(i, 1), (like, 4), (mysql, 5), (but, 3), (not, 3), (python, 6)]

numbers = (1,0,-1,6,3,-2,3,4)sum = sum([x for x in numbers if x >0])print(sum)

17Sum up all positive integers in a tuple

Page 18: Programming in Python - Biotec · Programming in Python Michael Schroeder Sven Schreiber ... –Lists, Tuples, Strings, Range •Building, accessing and modifying sequences •List

18

File IO

Page 19: Programming in Python - Biotec · Programming in Python Michael Schroeder Sven Schreiber ... –Lists, Tuples, Strings, Range •Building, accessing and modifying sequences •List

Opening and reading a file

f = open(‘myfile.txt’, ‘r’)for line in f:

if not line.startswith(‘#’):print(line)

f.close()

#Old number1234# New number5555# Test1

123455551

Returns file handler

Loop variable Linewise iteration over file!

File mode (r, w, a, ...)

with open(‘myfile.txt’, ‘r’) as f:for line in f:

if not line.startswith(‘#’):print(line)

Shorter and better formFile is closed after block!

Page 20: Programming in Python - Biotec · Programming in Python Michael Schroeder Sven Schreiber ... –Lists, Tuples, Strings, Range •Building, accessing and modifying sequences •List

20

Example: FASTA format

• A format for storing multiple named sequences

• This file contains 3' UTRsfor Drosophila genes

CG11604CG11455CG11488

>CG11604TAGTTATAGCGTGAGTTAGTTGTAAAGGAACGTGAAAGATAAATACATTTTCAATACC>CG11455TAGACGGAGACCCGTTTTTCTTGGTTAGTTTCACATTGTAAAACTGCAAATTGTGTAAAAATAAAATGAGAAACAATTCTGGT>CG11488TAGAAGTCAAAAAAGTCAAGTTTGTTATATAACAAGAAATCAAAAATTATATAATTGTTTTTCACTCT

Name of sequence ispreceded by > symbol

NB sequences canspan multiple lines

fly3utr.txt

Page 21: Programming in Python - Biotec · Programming in Python Michael Schroeder Sven Schreiber ... –Lists, Tuples, Strings, Range •Building, accessing and modifying sequences •List

21

Example: FASTA format

with open(‘fly3utr.txt’, ‘r’) as f:for line in f:

if line.startswith(‘>’): print(line[1:])

CG11604CG11455CG11488

What if we want to show the length of

each sequence record?

>CG11604TAGTTATAGCGTGAGTTAGTTGTAAAGGAACGTGAAAGATAAATACATTTTCAATACC>CG11455TAGACGGAGACCCGTTTTTCTTGGTTAGTTTCACATTGTAAAACTGCAAATTGTGTAAAAATAAAATGAGAAACAATTCTGGT>CG11488TAGAAGTCAAAAAAGTCAAGTTTGTTATATAACAAGAAATCAAAAATTATATAATTGTTTTTCACTCT

Page 22: Programming in Python - Biotec · Programming in Python Michael Schroeder Sven Schreiber ... –Lists, Tuples, Strings, Range •Building, accessing and modifying sequences •List

22

Example: FASTA format

name = Nonelength = Nonewith open('fly3utr.txt', 'r') as f:

for line in f:line = line.rstrip()if line.startswith('>'):

# None -> Falseif name:

print(name, length)name = line[1:]length = 0

else:length += len(line)

print(name, length)

CG11604 58CG11455 83CG11488 69

>CG11604TAGTTATAGCGTGAGTTAGTTGTAAAGGAACGTGAAAGATAAATACATTTTCAATACC>CG11455TAGACGGAGACCCGTTTTTCTTGGTTAGTTTCACATTGTAAAACTGCAAATTGTGTAAAAATAAAATGAGAAACAATTCTGGT>CG11488TAGAAGTCAAAAAAGTCAAGTTTGTTATATAACAAGAAATCAAAAATTATATAATTGTTTTTCACTCT

Page 23: Programming in Python - Biotec · Programming in Python Michael Schroeder Sven Schreiber ... –Lists, Tuples, Strings, Range •Building, accessing and modifying sequences •List

23

Summary

• Strings, lists, tuples and ranges are all sequences• Lists (usually for elements of same type)

– More flexible, more memory consumption

• Tuples (usually store elements of different types)– Immutable, less memory consumption

• Ranges for fast numeric iteration– Least memory consumption

• List comprehension as concise way to transform sequences• Convert strings into lists and vice versa with join and split• File handlers provides line-wise iteration