introduction to python for biologists part 1 this lecture

36
Introduction to Python for Biologists Part 1 This Lecture

Upload: jocelyn-singleton

Post on 12-Jan-2016

240 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Python for Biologists Part 1 This Lecture

Introduction to Python for BiologistsPart 1

This Lecture

Page 2: Introduction to Python for Biologists Part 1 This Lecture

Learning Objectives

• Install Python• Data & Variables• Strings• String slicing• String methods• Lists• List methods & list slicing• Math• Arrays

Page 3: Introduction to Python for Biologists Part 1 This Lecture

Why do Biologist need to learn Programming?

http://archive.oreilly.com/pub/a/oreilly//news/perlbio_1001.html

http://www.nature.com/nbt/journal/v31/n10/box/nbt.2721_BX1.html

• Biology is becoming a data-driven field– New technology enables scientists to generate large data sets in semi-automated

experiments. – Analysis of your own data is challenging– Automation saves time– Many interesting questions remain unanalyzed in huge amounts of publicly

available data– Integration of new experimental results with public data is a challenging

computational problem

• Scientists who can pursue innovative data analysis methods have an advantage over those limited to existing software (or those who require the assistance of other people with programming and data analysis skills)

Page 4: Introduction to Python for Biologists Part 1 This Lecture

Python* • is a Programming Language• Free, open source• Runs on all types of computers• “User friendly and easy to learn”• “clean readable code”• Very popular among bioinformaticians• Good documentation available

https://wiki.python.org/moin/BeginnersGuide/Overview

• Powerful “object oriented” features• Many add-on toolkits (“modules”) available for scientific

computing, visualization, statistics, etc.

*Python is named after a 1970’s British comedy TV show, not a large snake

Page 5: Introduction to Python for Biologists Part 1 This Lecture

Grad School

Python

Thanks to xkcd: https://xkcd.com/519/

Page 6: Introduction to Python for Biologists Part 1 This Lecture

Python.org

Page 7: Introduction to Python for Biologists Part 1 This Lecture

Online Tutorials• You can’t learn an entire programming language from a

couple of classroom lectures.• There are many online tutorials for Python, which allow

self-learning at your own pace • We recommend:

• Codecademy.com• TryPython.org• LearnPython.org• LearnPythontheHardWay.org/book• Software Carpenty

• For Biologists:• Python for Biologists• Rosalind Python Village (learn by solving problems)

Page 8: Introduction to Python for Biologists Part 1 This Lecture

Reading For this week:

• Python for Biologists, chapter 1-3

• The anatomy of successful computational biology software. Altschul S, Demchak B, Durbin R, Gentleman R, Krzywinski M, Li H, Nekrutenko A, Robinson J, Rasband W, Taylor J, Trapnell C.Nature Biotechnology 2013 Oct;31(10):894-7. DOI:doi:10.1038/nbt.2721

Page 9: Introduction to Python for Biologists Part 1 This Lecture

Install Python

• Assignment: Install Python on your computer• Be sure to include the Numpy and SciPy

modules• One easy way to set up a GUI for Python (on Mac

and Windows) is to download the free version of Anaconda: http://continuum.io/downloads

• Or you can run the command line version on Linux or in the Macintosh Terminal (for Mac you will need Xcode, which is a free software developers toolkit from Apple, is not installed by default in OSX)

Page 10: Introduction to Python for Biologists Part 1 This Lecture

Anaconda• Your life (in this course) will probably be easier if you install

the (free) Anaconda – includes numerical, scientific, statistical, and graphics modules.

http://continuum.io/downloads

Page 11: Introduction to Python for Biologists Part 1 This Lecture

Programming Concepts

All programming languages are built from the same basic elements:

• data• operators• flow control

These concepts are expressed in a specific syntax for each programming language

Page 12: Introduction to Python for Biologists Part 1 This Lecture

Data types

• Basic:• Strings = 'GATCCATGCGAGACCCTTGA‘• Numbers = 7, 123.455, 4.2e-14• Boolean = True, False

• Every data object has a type – (try these examples on your own)

>>> type (1)>>> type (“GATCCT”)

Page 13: Introduction to Python for Biologists Part 1 This Lecture

Variables

• A Variable is a named container for data (think of it as a box or a shelf that has a name)

• In Python, a variable can hold any type of data, does not need to be pre-defined

• The data in the variable can be changed at any time (and can change to a different type)

• Python variable names must start with a letter, can only contain text letters and numbers and the underscore _ character.

• Case sensitive

Page 14: Introduction to Python for Biologists Part 1 This Lecture

Comments

• Comments are bits of text added by the programmer into the code that explain what is going on. They are not executed by the computer.

• Python uses the hash symbol # to mark a comment, anything on a line after the # is ignored

• Use lots of clear comments in your code: for a good grade, so others can understand your code, and so you can understand your own code from the past (days, weeks, years… ago).

Page 15: Introduction to Python for Biologists Part 1 This Lecture

Examples of Variables

A value is assigned to a variable by the = sign. The value to the right of the = is put into the variable name on the left.

my_DNA = "ATGCGTA"gene_length = 467Dog_Text = “my Dog has Fleas” #spaces are part of a stringcounter = 6pi_short= 3.14

my_list = [a, b, c, d]HBB_human=“MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEF

TPPVQAAYQKVVAGVANALAHKYH”#this string is one line that wraps on the screen

Page 16: Introduction to Python for Biologists Part 1 This Lecture

Strings• Strings are text. Must always be in quotes.• Can use single or double quotes, but must be consistent• A string can contain space characters and also newline

characters.• Biology data involves a lot of strings: sequences, names

(taxonomy, gene names), etc.• A string is usually assigned to a variable

>>>my_string = “gi|45478711|ref|NC_005816.1|Yersinia pestis biovar Microtus”

Page 17: Introduction to Python for Biologists Part 1 This Lecture

String Methods

• In Python, data objects of type ‘string’ have built in operators called ‘methods”

• Methods use a ‘dot’ syntax as follows:

>>> my_DNA = "ATGCGTA">>> my_DNA.count(G)2>>> my_DNA.lower() 'atgcgta‘

Page 18: Introduction to Python for Biologists Part 1 This Lecture

String Concatenation• Two strings can be joined with the + operator

c = 'cat'h = 'hat'print ('cat' + 'hat')ch = c + hprint chprint (c + ' in the ' + h)

• Numbers must be converted to strings using the str() function before using the string concatentation operator

A = 5

print (A + c)#note the error messageprint ('We have' + ' ' + str(A) + ' ' + c + 's')

Page 19: Introduction to Python for Biologists Part 1 This Lecture

More String methods• upper() and lower() return a value that changes the case of a string.

You usually need to put this value into a variable, otherwise the original string is unchanged.

>>> my_DNA = “TATGCGTA">>> my_DNA.lower() ‘tatgcgta‘ >>> my_DNA 'TATGCGTA‘

• len() gives the length of a string>>> len(my_DNA)8

Page 20: Introduction to Python for Biologists Part 1 This Lecture

Find & Replace

• find() is another handy string method. (Note: It only works for exact matches)

>>> my_DNA = "ATGCGTA“>>> my_DNA.find("GC")2 #returns the position index of the first

occurrence of the search string in the target

• replace() finds and replaces letters in a string>>> my_DNA.replace('T', 'X' )'AXGCGXA'

Page 21: Introduction to Python for Biologists Part 1 This Lecture

Lists

• Lists contain a group of things, in square brackets, separated by commas

List1 = [a, b, c, d]List2 = [“XP_008199794”, “PF03769”, “gi|54037254”]List_mix = [“fish”, “hat”, “box”, 17, 4935.45, True]

• The elements of a list do not all have to be of the same type

• Lists are used for many tasks in Python that involve a lot of data.

Page 22: Introduction to Python for Biologists Part 1 This Lecture

List Elements• The elements in a list are ordered. They can be accessed by their index

number in the list.• Python starts counting list elements at zero• The list index is indicated by a number in square brackets following the

name of the list• List slicing uses this format: [begin:end:step]• You can do fancy things with list slicing, but intervals are counted with

strange rules. You need to study this.

>>> my_list=['G', 'A', 'hat', 'cat']>>> my_list[1]'A'

>>> my_list[1:3]['A', 'hat']

>>> my_list[:-2]['G', 'A']

Page 23: Introduction to Python for Biologists Part 1 This Lecture

List Methods• You can assign a value to a specific position in a list:

>>> my_list=['G', 'A', 'hat', 'cat']>>> my_list[1] = “X”>>> my_list['G', ‘X', 'hat', 'cat']

• List methods are functions built into the list data type. They use the ‘dot’ syntax just like string methods.

my_list.count(‘G’)1

• list.append() is a commonly used list method. It adds its argument to the end of a list. It is frequently used to collect results as a program steps through a loop

my_list.append(‘T’)>>> my_list['G', 'X', 'hat', 'cat', 'T' ]

Page 24: Introduction to Python for Biologists Part 1 This Lecture

String Slicing

• Strings can be treated as a list of letters, and sliced with the exact same methods as lists

>>> my_DNA = "ATGCGTA">>> my_DNA[1]'T'>>> my_DNA[1:4]'TGC'

Page 25: Introduction to Python for Biologists Part 1 This Lecture

Split a string into a List

• Sometimes it is helpful to turn a string into a list of words or numbers. The split() method does this.

• By default, it splits on whitespace, but any character specified in the parentheses can be used as delimiter.

• This is useful when working with tab delimited or comma delimited (csv) data.>>> names = "melanogaster,simulans,yakuba,ananassae">>> species = names.split(",")>>> print(names[1] + ' ' + species[2])e yakuba

Page 26: Introduction to Python for Biologists Part 1 This Lecture

The list() function

• The list() function splits a string into a list of characters

>>> hi = "Hello world">>> list(hi)['H', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd']

Page 27: Introduction to Python for Biologists Part 1 This Lecture

Join

• join() turns a list of strings into a single string. You can add a spacer such as a comma or space character. It has a backwards syntax, where the spacer is the thing being acted upon by the method:

>>> my_list=['G', 'A', 'hat', 'cat']>>> spacer = ':'>>> newstring = spacer.join(my_list)>>> newstring'G:A:hat:cat'>>> '#'.join(my_list)'G#A#hat#cat'

Page 28: Introduction to Python for Biologists Part 1 This Lecture

String Slicing: Exercises(do these yourself in Python shell)

>>> dna = 'CGGTTAATAGGGACTCTC'>>> dna[0]>>> dna[0:3]>>> dna[-1]>>> dna[-1:-3]>>> dna[-3:-1]>>> dna[0:5]>>> dna[0:5:2]>>> dna[0:5][::-1]>>> dna[0:5][::-2]>>> dna

Page 29: Introduction to Python for Biologists Part 1 This Lecture

Math

• Python can do simple math like a calculator.• Type the following expressions into an interactive

Python session (or the IDE editor), hit the enter/return key (or Run button) and observe the results:

2 + 26 – 38 / 3.09 * 36 ** 2

Page 30: Introduction to Python for Biologists Part 1 This Lecture

Math module• Python does not activate all of its built-in

functions when you start it up• You use the “import” command to add

modules.• Type “import math” to get more advanced

mathematics functions. math.sqrt() is a function in the math module. Try this:

import mathmath.sqrt(36)6.0

Page 31: Introduction to Python for Biologists Part 1 This Lecture

Simple Navigation

• Doing some simple file system navigation in Python is unreasonably difficult (uses a module called os)

• Where am I?>>> import os>>> os.getcwd()'C:\\Python27‘

• What files are in this directory (folder)?>>> os.listdir('.')

['at.py', 'hello.py', 'JASPAR-pfm_all.txt', 'JasparClient.py', 'MA0024.1.pfm', 'my_blast.xml', 'ros4.py', 'rosalind_ini5.txt', 'SRR020192.fastq', 'Test_100.fasta‘]

• Change directory >>> os.chdir('/Users/stu/Python')

Page 32: Introduction to Python for Biologists Part 1 This Lecture

NumPy and Arrays

• Arrays are like lists, but they contain only numbers, and they have dimensions.

• NumPy is a Python module that enables array operations.

Here is a simple one dimensional array of integers (just like a list):

>>> import numpy as np >>> x = np.array([42,47,11], int) >>> x >>> array([42, 47, 11])

Software Carpentry has a nice introduction to NumPy arrays: http://swcarpentry.github.io/python-novice-inflammation/01-numpy.html

Page 33: Introduction to Python for Biologists Part 1 This Lecture

2-Dimensional Array• A two dimensional array is like a list of lists, but

each row must have the same number of elements.

>>> x = np.array( ((11,12,13), (21,22,23), (31,32,33)) ) >>> print x [ [11 12 13] [21 22 23] [31 32 33] ]

• Note the nested square brackets• NumPy has no problem with 3, 4, or more

dimensions, but it is annoying to represent as text.

Page 34: Introduction to Python for Biologists Part 1 This Lecture

Matrix Math• Matrices are 2-dimensional arrays.• NumPy has linear algebra methods for operations on

matrices. These operations require that two matrices be of the same size.

• Vector addition• Matrix subtraction• Matrix multiplication• Scalar product (dot product)• Cross product

>>> x = np.array([3,2]) >>> y = np.array([5,1]) >>> z = x + y >>> z array([8, 3])

Page 35: Introduction to Python for Biologists Part 1 This Lecture

Assignment:• Rosalind Python Village– All 6 problems (should take you 1-2 hours)Rosalind Python Village:http://rosalind.info/problems/list-view/?location=python-village

Page 36: Introduction to Python for Biologists Part 1 This Lecture

Summary

• Install Python• Data & Variables• Strings• String slicing• String methods• Lists• List methods & list slicing• Math• Arrays