python programming for bioinformatics

150
앞앞앞앞 앞앞앞앞앞앞앞 앞앞 파파파 파파파파파 Hyungyong Kim [email protected] R&D Center, Insilicogen, Inc.

Upload: hyungyong-kim

Post on 23-Jun-2015

2.537 views

Category:

Education


9 download

DESCRIPTION

앞서가는 생명정보 분석을 위한 파이썬 프로그래밍

TRANSCRIPT

Page 1: Python programming for Bioinformatics

앞서가는 생명정보분석을 위한

파이썬 프로그래밍

Hyungyong Kim

[email protected]&D Center, Insilicogen, Inc.

Page 2: Python programming for Bioinformatics

• 강의 전 아래 웹사이트에서 파이썬 프로그램을 다운로드 받아 설치해주세요 .

• http://www.python.org

• Download “Python 2.7.3 Windows Installer”

• 설치 후 , ;C:\Python27 를 PATH 환경변수에 추가

파이썬 설치

Page 3: Python programming for Bioinformatics

• 아주대학교 생물공학과 93• 생물정보훈련과정 1 기 (2000 년 )• ㈜바이오인포메틱스 • 국립축산과학원 생물정보실• 숭실대학교 생명정보학과 • ㈜인실리코젠

• LabKM, KinMatch, Ontle• 6.25 전사자 유전자정보 검색시스템• 실종아동등찾기 DNA 정보 검색시스템• 가축유전자원종합관리시스템

강사소개

http://biohackers.net http://yong27.biohackers.nethttp://twitter.com/yong27

Page 4: Python programming for Bioinformatics

• 머리속의 아이디어를 빠르게 구현• Battery included. 이미 만들어진 라이브러리 이용• Prototype Product

앞서가는 생물정보 분석

Page 5: Python programming for Bioinformatics

• 기본문법– Python introduction– Data type– Control flow– Function, Module, Package– String formatting

• 객체지향과 고급기능– Exception and Test– Class– Decorator, Iterator, Gener-

ator– Standard libraries

교육 구성

• 기본 실습문제– 구구단 함수– 이차방정식 근의 공식– 단어빈도수 계산

• 객체지향 문제– FASTA 서열 다루기

Page 6: Python programming for Bioinformatics

INTRODUCTION

Page 7: Python programming for Bioinformatics

“ 미래는 창조적이지 않은 모든 일들을 기술이 대체할 것이다 .”

Page 8: Python programming for Bioinformatics
Page 9: Python programming for Bioinformatics

• Information is a sequence (order)• How can we manage it?• Computer language is for it by mimic human’s lan-

guage• History

– Machine language– Assembly– Compiled language (C,…)– Interpreter language (Java, Python,…)

Computer language

Page 10: Python programming for Bioinformatics

• Unix (linux)• MS-Windows• Mac OS X

• Python runs everywhere

Operating System

Page 11: Python programming for Bioinformatics

• 1991’s Guido Van Rossum• Free and Open source• For easy language• Object oriented scripting• Dynamic typing• Interpreter• Glue language

What is Python?

Page 12: Python programming for Bioinformatics

• Easy object oriented• Easy to learn • Prototyping• Battery included• Portable• Extensible• Powerful internal data structure

Why Python?

Page 13: Python programming for Bioinformatics

Indentation

Page 14: Python programming for Bioinformatics

• 아나콘다• 구글의 3 대 언어가운데 하나• NASA• Biopython• Django – Pinterest, Instagram

Python applications

Page 15: Python programming for Bioinformatics

• Web programming – Django, Turbo gears• Network programming – Twisted• GUI – wxPython, PyQt• Game - pygame• Database – Oracle, MySQL, PostgreSQL, sqlite• Scientific and numeric – Numpy, Scipy

Python applications

Page 16: Python programming for Bioinformatics

• 3.2.3• 2.7.3• 2.6• 2.5• 2.4…• 1.5

Python version

Page 17: Python programming for Bioinformatics

• Cpython (C)• Pypy (python)• Jython (java)• Parrot (perl)• IronPython (.NET)

Python implementation

Page 18: Python programming for Bioinformatics

DATA TYPE

Page 19: Python programming for Bioinformatics

• 첫문자가 “ _” 혹은 영문자 , 두번째 문자부터는 숫자도 가능 , 대소문자 구분

• 예약어는 안됨 (import keyword)• 내장함수 이름은 피한다 .

• “a = 1”– 파이썬에서는 모든 것이 객체– 객체에 이름부여

Variables

Page 20: Python programming for Bioinformatics

• Numeric types – Integer, Long, Float, Decimal, Complex

• Sequence types– String, Unicode, List, Tuple

• Collection types– Dictionary, Set

• Etc– Boolean

Python internal data types

Page 21: Python programming for Bioinformatics

• int, long, float, complex• +, -, *, /, //, %, **• Builtin functions : abs, divmod, pow• For advanced calculation, use import math

Numeric types

Page 22: Python programming for Bioinformatics

• Integer and long literal– ~L : Long– 0b~ : 2 진수 (bin) : 0b101 5– 0o~ : 8 진수 (oct) : 011 9– 0x~ 16 진수 (hex) : 0xa 10

• Long merged to int in python3

Integer and Long

Page 23: Python programming for Bioinformatics

• 3.14 10. .001 1e100 3.14e-10 0e0

• Builtin functions : round• Float is not precise number• Use Decimal but it is slow

Float

Page 24: Python programming for Bioinformatics

• 3.14j 10.j 1e100j 3+4J

• Method : conjugate, real, imag

Complex

Page 25: Python programming for Bioinformatics

• In Python, all is object• Object has attributes. Use “.” for access • Some attribute is method. Use “()” for call

• a = 27• dir(a)• help(a)• a.real• a.imag• a.conjugate()

Method?

Page 26: Python programming for Bioinformatics

• Immutable : String, Unicode, Tuple • Mutable : List

Sequence types

Page 27: Python programming for Bioinformatics

Sequencing indexing/slicing

H E L L O WO R L D !0 1 2 3 4 5 6 7 8 9 10 11 12

MyStr =

MyStr[1] “E”MyStr[6] “W”MyStr[-1] “!”

-4 -3 -2 -1

MyStr[3:5] “LO”MyStr[8:-1] “RLD”MyStr[9:] “LD!”MyStr[:4] “HELL”MyStr[:] “HELLO WORLD!”

Page 28: Python programming for Bioinformatics

• It is a byte stream• Methods : capitalize, center, count, endswith,

startswith, find, index, join, strip, split, zfill• Bach slash is special

– \n : ASCII linefeed– \t : ASCII horizontal tab

• String literals– r~ : raw string (back slash is not escaping)– u~ : unicode

String

Page 29: Python programming for Bioinformatics

• Immutable sequence of anything• Use “( )”

Tuple

Page 30: Python programming for Bioinformatics

• Mutable sequence of anything, Use “[ ]”

List

Page 31: Python programming for Bioinformatics

• Unordered collection of distinct hashable objects• Mutability

– Mutable : set (add, remove)– Immutable : frozenset

• Methods : union(|), intersection(&), difference(-), symmetric_difference(^)

Set

Page 32: Python programming for Bioinformatics
Page 33: Python programming for Bioinformatics

• Mapping object maps hashable values to arbitrary ob-jects.

• Mutable• Usages

– d = { key : value }– d[key] = value– key in d– d.keys(), d.values(), d.items()– d.update(annother_d)

Dictionary

Page 34: Python programming for Bioinformatics

• Boolean type• True or False• All python objects has boolean value

Bool

Page 35: Python programming for Bioinformatics

CONTROL FLOW

Page 36: Python programming for Bioinformatics

Scope

• 괄호 {} 대신 들여쓰기로 현재 범위 규정

• 들여쓰기가 끝나면 해당 Scope 가 끝남을 의미

• 아무것도 안할 때는 pass

Page 37: Python programming for Bioinformatics

• bool() 함수로 평가하여 , True, False 에 따라 분기• elif, else 는 옵션

if statement

Page 38: Python programming for Bioinformatics

• Operator : >, <, >=, <=, !=, ==• When string, alphabet order• When sequence type, from first element• When other types, number < dict < list < string < tu-

ple• hasattr• belong check uses “in” statement : when dict, checks

keys default• Same object check uses “is” statement• None, 0, 0.0, “”, [], (), {} is False

if statement

Page 39: Python programming for Bioinformatics

• not• and, or

– True and True True– True and False False– False and False False– False or False False– True or False True– It returns that value. So can use one line if statement

• if a: return b else: return c a and b or c

• all(), any()

Logical calculation

Page 40: Python programming for Bioinformatics

• Repeat elements in sequence types data

for statement

Page 41: Python programming for Bioinformatics

• When dict– keys(), values(), items()– Default is keys()

• When list– Use [:] copy when self alteration

for statement

Page 42: Python programming for Bioinformatics

• range([start,], stop[, step]) list of integers

The range() function

Page 43: Python programming for Bioinformatics

구구단 출력

print 2, “*”, 1, “=“, 2 * 1

Page 44: Python programming for Bioinformatics

• enumerate(sequence[, start=0])

The enumerate() function

Page 45: Python programming for Bioinformatics

• break : breaks out the loop• continue : next iteration of the loop• else : end without break or continue

break and continue statement

Page 46: Python programming for Bioinformatics

• Repeat while an ex-pression is true

• Used in break, con-tinue, else

while statement

Page 47: Python programming for Bioinformatics

Fibonacci series

• 어떤 사람이 벽으로 둘러싸인 어떤 곳에다 토끼 암수 한 쌍을 집어 넣었다 . 매달 한 쌍의 토끼가 태어나고 또 그 신생 토끼의 쌍이 두 달째부터 새끼를 낳을 수 있다면 , 1년 뒤 원래의 한 쌍으로부터 얼마나 많은 쌍의 토끼가 태어날까 ?

Page 48: Python programming for Bioinformatics

• L.sort(cmp=None, key=None, reverse=False)• Builtin functions: sorted, reversed• cmp(x, y) -1, 0, 1• key compare function• Dictionary sorting (by key, by value)

Sort

Page 49: Python programming for Bioinformatics

• L = [ k*k for k in range(10) if k > 0]

• L=[]• for k in range:• if k > 0:• L.append(k*k)

List comprehension

Page 50: Python programming for Bioinformatics

FUNCTION MODULE PACKAGE

Page 51: Python programming for Bioinformatics

Functions

2,4

2 ** 4

16

Function name : pow

def pow(a, b): result = a ** b return result

Page 52: Python programming for Bioinformatics

Quadratic equation

a

acbbx

2

42

23x^2 + 43.2x + 34 = 0

Page 53: Python programming for Bioinformatics

import cmathdef quadratic_equation(a, b, c): in_sqrt = cmath.sqrt(b*b – 4*a*c) x1 = (-b + in_sqrt) / (2.0*a) x2 = (-b – in_sqrt) / (2.0*a) return x1, x2

print quadratic_equation(23, 43.2, 34)((-0.9391304347+0.7722013312j), (-0.9391304347-

0.7722013312j))

Quadratic equation

Page 54: Python programming for Bioinformatics

Defining Functions

Page 55: Python programming for Bioinformatics

Defining Functions

Page 56: Python programming for Bioinformatics

h = 5

def f(): a = h + 10 # Can refer name but not change print a

def f(): global h h = h + 10 print hprint h

Globals and locals

Page 57: Python programming for Bioinformatics

Search names from inside to outside

x = 2def F(): x = 1 def G(): print x G()F()

Nested function

Page 58: Python programming for Bioinformatics

def write_multiple_items(file, separator, *args): file.write(separator.join(args))

>>> # Unpacking argument lists>>> range(3,6)[3, 4, 5]>>> args = [3, 6]>>> range(*args)[3, 4, 5]

Arbitrary argument lists

Page 59: Python programming for Bioinformatics

Keyword arguments

Page 60: Python programming for Bioinformatics

def cheeseshop(kind, *args, **kwargs): assert type(args) == list assert type(kwargs) == dict

>>> # Unpacking argument dict>>> mydict = {‘client’:’michael’, ‘sketch’:’cap’}>>> mylist = [‘b’, ‘c’]>>> cheeseshop(‘a’, ‘b’, ‘c’, client=‘michael’,

sketch=‘cap’)>>> cheeseshop(‘a’, *mylist, **mydict)

Arbitrary keyword argument

Page 61: Python programming for Bioinformatics

• {‘one’:2, ‘two’:3)• dict(one=2, two=3)• dict({‘one’:2, ‘two’:3})• dict(zip((‘one’, ‘two’), (2, 3)))• dict([[‘two’, 3], [‘one’, 2]])

Various ways for dict

Page 62: Python programming for Bioinformatics

• LISP, Heskel• Function is an object• Recursion • Expression evaluation instead of statement• It can be used in control flow • In Python: map, reduce, filter, lambda, list compre-

hension (,iterator, generator)

Python functional programming

Page 63: Python programming for Bioinformatics

>>> def sqr(x): return x*x…>>> def cube(x): return x*x*x…>>> sqr<function sqr at …>>>> a = [sqr, cube]>>> a[0](2)>>> def compose(f, g): return f(g(x))>>> compose(sqr, cube, 2)64

Python functional programming

Page 64: Python programming for Bioinformatics

>>> f = lambda x: x+4>>> f(3)7>>> a = [‘1’, ‘2’, ‘3’]>>> map(int, a)>>> [1, 2, 3]>>> reduce(lambda x, y: x+y, [1,2,3,4,5])>>> ((((1+2)+3)+4)+5)>>> filter(lambda x: x%2, range(10))[1, 3, 5, 7, 9]

lambda, map, reduce, filter

Page 65: Python programming for Bioinformatics

A file containing Python definitions and statements. The file name module name (.py) In fib.py, fib, fib2 functions existed,

◦ from fib import fib, fib2◦ fib(3)or◦ import fib◦ fib.fib2(3)or◦ from fib import *◦ fib(3)

Module

Page 66: Python programming for Bioinformatics

• That file is in current directory or PYTHONPATH envi-ronmental variable

• Orimport syssys.path.append(‘/mydirectory’)

import path

Page 67: Python programming for Bioinformatics

Byte compile for Virtual Machine◦ import statement search .pyc first.◦ If .py is new, recompile it◦ If use –O option (optimization), it makes .pyo

For reload, use reload(module) If module was executed program mode, __name__ is

‘__main__’ but module importing, __name__ is module name

So, for scripting (not for importing), useif __name__ == “__main__”: fib(3)

Module

Page 68: Python programming for Bioinformatics

• A way of structuring python module namespace by us-ing “dotted module names”

Packages

Page 69: Python programming for Bioinformatics

import sound.effects.echosound.effects.echo()…

Or

from sound.effects import echoecho()…

Or

from sound.effects import echo as myechomyecho()…

Packages

Page 70: Python programming for Bioinformatics

STRING FORMATING AND FILE IO

Page 71: Python programming for Bioinformatics

• str – 비형식적 문자열로 변환• repr – 형식적 문자열로 변환• eval – 문자열을 실행 (expression, 식 )• exec – 문자열을 실행 (statement, 문 )

Functions for string

>>> s = 'Hello, world'>>> str(s)'Hello, world'>>> repr(s)"'Hello, world'">>>>>> str(0.1)'0.1'>>> repr(0.1)'0.10000000000000001'>>>

>>> s1 = repr([1,2,3])>>> s1‘[1, 2, 3]'>>> eval(s1)[1, 2, 3]>>>>>> a = 1>>> a = eval(‘a + 4’)>>> a5>>> exec ‘a = a + 4’>>> a9

Page 72: Python programming for Bioinformatics

String formatting (old style)

>>> template = "My name is %s and I have %i won" >>> template % ("yong", 1000) 'My name is yong and I have 1000 won' >>> >>> template = "My name is %(name)s and I have %(money)i won" >>> template % {'name':'yong', 'money':1000} 'My name is yong and I have 1000 won' >>> >>> name = 'yong' >>> money = 1000 >>> template % locals() 'My name is yong and I have 1000 won' >>>>>> import math >>> print 'The value of PI is approximately %5.3f.' % math.pi The value of PI is approximately 3.142.>>>

Page 73: Python programming for Bioinformatics

• Similar with sprintf() in C

String formatting operation

Page 74: Python programming for Bioinformatics

String formatting (new style) 1

>>> template = "My name is {0} and I have {1} won" >>> template.format("yong", 1000) 'My name is yong and I have 1000 won' >>> >>> template = "My name is {name} and I have {money} won" >>> template.format(name=‘yong', money=1000} 'My name is yong and I have 1000 won' >>> >>> print 'The story of {0}, {1}, and {other}.'.format('Bill', 'Manfred', ... other='Georg') The story of Bill, Manfred, and Georg. >>>>>> table = {'Sjoerd': 4127, 'Jack': 4098, 'Dcab': 7678} >>> for name, phone in table.items(): ... print '{0:10} ==> {1:10d}'.format(name, phone) ... Jack ==> 4098 Dcab ==> 7678 Sjoerd ==> 4127

Page 75: Python programming for Bioinformatics

String formatting (new style) 2

"First, thou shalt count to {0}" # References first positional argument "My quest is {name}" # References keyword argument 'name' "Weight in tons {0.weight}" # 'weight' attribute of first positional arg "Units destroyed: {players[0]}" # First element of keyword argument 'players'.

"Harold's a clever {0!s}" # Calls str() on the argument first "Bring out the holy {name!r}" # Calls repr() on the argument first

"A man with two {0:{1}}".format("noses", 10)

Page 76: Python programming for Bioinformatics

• mode : ‘r’, ‘w’, ‘a’, ‘r+’, ‘b’, ‘rb’, ‘wb’• Method : read, readline, readlines, write

Reading and Writing Files

>>> f = open(‘/tmp/workfile’, ‘w’)>>> print f<open file ‘/tmp/workfile’, mode ‘w’ at 80a0960>

>>> f.read()'This is the entire file.\n' >>> f.read() '‘>>> f.readline() 'This is the first line of the file.\n' >>> f.readline() 'Second line of the file\n' >>> f.readline() ''

>>> f.readlines()['This is the first line of the file.\n', 'Second line of the file\n'] >>>>>> for line in f: print line, This is the first line of the file. Sec-ond line of the file>>>>>> f.write(str(43))

Page 77: Python programming for Bioinformatics

• f.tell() : current position• F.seek(offset, from_what) : go there position

– from_what : 0 start, 1 current, 2 end

Position in the file

>>> f = open('/tmp/workfile', 'r+') >>> f.write('0123456789abcdef') >>> f.seek(5) # Go to the 6th byte in the file >>> f.read(1) '5' >>> f.seek(-3, 2) # Go to the 3rd byte before the end >>> f.read(1) 'd'

Page 78: Python programming for Bioinformatics

• f.close() : close it and free up any system resource• “with” keyword when free up

File close

>>> f.close() >>> f.read() Traceback (most recent call last): File "<stdin>", line 1, in ? ValueError: I/O operation on closed file

>>> with open('/tmp/workfile', 'r') as f: ... read_data = f.read() >>> f.closed True

Page 79: Python programming for Bioinformatics

• stdin, stdout, stderr• Command line pipe “>”, “2>”, “<“, “|”

• Different with sys.argv

Standard IO

def work(input_file, output_file): output_file.write("<") output_file.write(input_file.read()) output_file.write(">")

work(open('a.txt'), open('b.txt', 'w'))

import syswork(sys.stdin, sys.stdout)-----python a.py < a.txt > b.txt

Page 80: Python programming for Bioinformatics

• Virtual file on memory

StringIO

from cStringIO import StringIO

handle = StringIO("""\> test fastaAGTCAGTCAGTCCCCC""")

for line in handle: print line

Page 81: Python programming for Bioinformatics

• 문자들의 집합 (Character set) 을 부호화하는것– ASCII : 7 비트 , 인코딩– ISO-Latin1(ISO-8859-1) : 8 비트 , 대부분의 서구유럽언어 표시– 한글조합형 – 한글완성형 EUC-KR CP949 : 2 바이트

• 인코딩간 호환문제• Unicode

Character encoding

Page 82: Python programming for Bioinformatics

• 전세계의 모든 문자를 표시하기 위한 체계• 목적 : 모든 인코딩 방법을 유니코드로 교체• 인코딩

– UTF-7 – UTF-8 : 가변문자열 (2 바이트 혹은 4 바이트 )– UTF-16

Unicode

Page 83: Python programming for Bioinformatics

• ‘a’ str• u‘a’ unicode • ‘a’ + u‘bc’ u‘abc’

• Unicode Str• encode/decode• Character set

Unicode in Python

Page 84: Python programming for Bioinformatics

• Python default encoding : ASCII UTF-8– Used in unicode file IO

• Source code encoding : ASCII– # -*- coding:utf-8 -*-

• 한글이 깨진다면 ,– 저장된 정보의 인코딩 확인– 디스플레이 환경 ( 터미널 , 에디터 , 웹브라우저 등 )

Unicode in Python 2

Page 85: Python programming for Bioinformatics

Hangul examples

hangul.py 필요

>>> import hangul >>> haveJongsung = lambda u: bool(hangul.split(u[-1])[-1])>>> haveJongsung(u' 자음 ')) True >>> haveJongsung(u‘ 자 ')) False

Page 86: Python programming for Bioinformatics

Excersize

임의의 텍스트파일내 단어의 출현 빈도를 조사하여 가장 많이 출현한 단어부터 정렬하여 출력하는 프로그램 (특수기호 제외 , 소문자로 통일 )

$ python word_frequency.py < input.txt32 the28 of17 boy…

Page 87: Python programming for Bioinformatics

Excersize

임의의 FASTA 형식의 파일에 저장된 DNA 서열을 읽어 GC 함량을 계산하시오 (single FASTA format)

$ python gc_content.py < input.fasta0.55

Page 88: Python programming for Bioinformatics

Excersize

임의의 FASTA 형식의 파일에 저장된 DNA 서열을 읽어 Reverse complement 서열을 출력하시오 (Single FASTA)

$ python reverse_complement.py < input.fasta> seq1 reverse complementAGTCAAGGCCAAGTCCAAAGCAGCAGGAGCCAAGGT

Page 89: Python programming for Bioinformatics

EXCEPTION AND TEST

Page 90: Python programming for Bioinformatics

Exception

• Errors detected during execution

Page 91: Python programming for Bioinformatics

• Used by class inheritance

• BaseException– SystemExit– KeyboardInterrupt– GeneratorExit– Exception

Built-in Exceptions

Page 92: Python programming for Bioinformatics

• By default, when exception raised, program stopped and show error message

• try/except/else/finally statement

Handling Exceptions

try: statements… #1except (exception types): statements… #2else: statements… #3finally: statements… #4

Page 93: Python programming for Bioinformatics

Handling Exceptions example

Page 94: Python programming for Bioinformatics

Handling Exceptions example

Page 95: Python programming for Bioinformatics

Handling Exceptions example

def dosomething(): a = 1/0

try: dosomething()except ArithmeticError: print ‘Exception occurred’

Page 96: Python programming for Bioinformatics

Raising Exceptions

• “raise” statement allows the programmer to force a specified exception to occur.

Page 97: Python programming for Bioinformatics

User-defined Exceptions

• Programs may name their own exceptions by creating a new exception class

Page 98: Python programming for Bioinformatics

Assert statement

• Usually, used when debugging

a = 30margin = 2 * 0.2assert margin > 10, ‘not enough margin %s’ % margin

if not margin > 10: raise AssertionException(‘not enough margin %s’ % margin)

Page 99: Python programming for Bioinformatics

• 프로그램을 작은 단위로 쪼개서 그 단위를 테스트하는 전통적인 프로그래밍 테스팅 방법중의 하나

• 왜 필요한가 ?

• Regression testing 개념– 인공적인 상황을 가정하고 , 테스트모듈이 그 상황을 이용하여 , 결과값을

계산한다 . 이때 기대되는 값과 , 계산 값이 같은가를 확인한다 .– 프로그램이 퇴행하지 않도록 계속적으로 검사한다 .– 프로그램이 수정되는 것 뿐만 아니라 , 플랫폼이나 주변 환경 등의 요소의

변화에 의한 퇴행도 검사한다 .

Unit test 란 ?

bioxp

Page 100: Python programming for Bioinformatics

• 테스트가 주도하는 프로그래밍• 기본 사이클

– Write a test – Make it compile, run it to see it fails – Make it run – Remove duplication

Test Driven Development

bioxp

Page 101: Python programming for Bioinformatics

• 테스트에는 실제 코드를 어떻게 사용하는지에 대해 작동하는 설명이 들어있다 . ( 인터페이스가 정의된다 .)

• 따로 테스트를 할 필요가 없다 . • 코드 수정 시 기존의 테스트 코드를 통과하는지 체크되기 때문에 통합적인

테스트가 유지된다 . • 테스트가 용이한 코드가 유지보수관리가 용이하다 . • 프로그램이 잘못되었는지를 "빨리 " 알 수 있다 ( 혹은 그럴 확률이 높다 ). (Fail

early, often) • 어떤 기능을 구현할 때 , 어떻게 사용할지를 먼저 생각하도록 이끄는 역할을 한다 .

(Programming by intention) • 오랜 시간이 지난 후에 다시 그 코드를 개선해야 할 일이 생길 때 ( 혹은 어쨌던 봐야 할 일이 있을 때 ), 빨리 접근할 수 있도록 도와준다 . (Documentation)

TDD 의 장점

bioxp

Page 102: Python programming for Bioinformatics

CLASS

Page 103: Python programming for Bioinformatics

• Programming paradigms– Procedural programming– Functional programming– Object oriented programming

• It’s origin is the modeling of cell• Modeling of real world Easy maintain• The keys : remove duplication, easy management

Object Oriented

Page 104: Python programming for Bioinformatics

Example calculate average

Name Korean English Math Science

smith 80 69 70 88

neo 92 66 80 72

trinity 82 73 91 90

oracle 80 42 100 92

Page 105: Python programming for Bioinformatics

Procedural example

smith_korean = 80smith_english = 69smith_math = 70smith_science = 88

neo_korean = 92neo_english = 66neo_math = 80neo_science = 88

smith_average = (smith_korean + smith_english + smith_math + smith_science) / 4.0neo_average = (neo_korean + neo_english + neo_math + neo_science) / 4.0

Page 106: Python programming for Bioinformatics

Functional example

def average(alist): return sum(alist) / float(len(alist))

smith = { 'korean': 80, 'english': 69, 'math': 70, 'science': 88, }neo = { 'korean': 92, 'english': 66, 'math': 80, 'science': 88, }

smith_average = average(smith.values())neo_average = average(neo.values())

Page 107: Python programming for Bioinformatics

Object oriented example

class Score: def __init__(self, korea, english, math, science): self.korea = korea self.english = english self.math = math self.science = science def get_average(self): return (self.korea + self.english + self.math + self.science) / 4.0

smith_score = Score(80, 69, 70, 88)smith_average = smith_score.get_average()neo_score = Score(92, 66, 80, 88)neo_average = neo_score.get_average()

Page 108: Python programming for Bioinformatics

• Collection of variables and functions• It is a kind of name space

class Person: name = ‘yong’ gender = ‘male’ def get_age(self): return 27

Person.name, Person.get_age()

Class statement

Page 109: Python programming for Bioinformatics

smith_score = Score(80, 69, 70, 88)

object = Class()

붕어빵틀붕어빵

Instance(object) and Class

Page 110: Python programming for Bioinformatics

Class Person: count = 0 def __init__(self, name, gender): self.name = name self.gender = gender Person.count += 1 def set_age(self, age): self.age = age

yong = Person(‘yong’, ‘male’)yong.set_age(27)

클래스 변수

생성자 인스턴스 변수

메쏘드( 인스턴스 함수 )

객체

메쏘드 호출

Page 111: Python programming for Bioinformatics

• Constructor( 생성자 ) : 객체가 만들어질 때 최초 수행되는 함수

Constructor, Destructor

class Person: count = 0 def __init__(self, name, gender): self.name = name self.gender = gender Person.count += 1 def __del__(self): Person.count -= 1

Page 112: Python programming for Bioinformatics

• __add__(self, other) : +• __sub__(self, other) : - • __mul__(self, other) : *• __div__(self, other) : /• __mod__(self, other) : %• __and__(self, other) : &• __or__(self, other) : |

Operator overloading

Class MyString: def __init__(self, str): self.str = str def __div__(self, sep): return self.str.split(sep)

>>> m = MyString(“abcdef”)>>> print m / “b”[“a”, “cdef”]

Page 113: Python programming for Bioinformatics

• Is-a relationship. “Man is a person”

Inheritance

class Man(Person): def __init__(self, name): Person.__init__(self, name, ‘male’)

class Woman(Person): def __init__(self, name): Person.__init__(self, name, ‘female’)

yong = Man(‘yong’)yong.gender, yong.set_age(27)

Page 114: Python programming for Bioinformatics

Multiple Inheritance

class Singer: def song(self): print “Oh my love~”

class ManSinger(Man, Singer): def __init__(self, name): Man.__init__(self, name)

yong = ManSinger(‘yong’)yong.song()

Page 115: Python programming for Bioinformatics

Subclassing

class MyList(list): def __sub__(self, other): L = self[:] for x in other: if x in L: L.remove(x) return L

>>> L = MyList([1,2,3,’spam’,4,5])>>> L = L – [‘spam’]>>> print L[1, 2, 3, 4, 5]

Page 116: Python programming for Bioinformatics

Polymorphism

class Animal: def cry(self): print ‘…’

class Dog(Animal): def cry(self): print ‘멍멍’

class Duck(Animal): def cry(self): print “꽥꽥”

for each in (Animal(), Dog(), Duck()): each.cry()

Page 117: Python programming for Bioinformatics

• has-a relationship.

Composition

class Set(list): def union(self, A): result = self[:] for x in A: if x not in result: result.append(x) return Set(res)

A = MySet([1,2,3])B = MySet([3,4,5])print A.union(B)

Page 118: Python programming for Bioinformatics

• Python do not support complete private• “from module import *” do not import name “_”

starting

Encapsulation

class Encapsulation: z = 10 __x = 1

>>> Encapsulation.z>>> Encapsulation.__x>>> Encapsulation._Encapsulation__x

Page 119: Python programming for Bioinformatics

DECORATOR ITERATOR GEN-ERATOR

Page 120: Python programming for Bioinformatics

• 함수를 장식 (decoration) 하는 함수

Decorator

@A @B @Cdef f(): ….

def f(): ….f = A(B(C(f)))

def mydecorator(function): def wrapper(*args, **kwargs): ## do something for decoration result = function(*args, **kwargs) return result return wrapper

def require_int(function): def wrapper(arg): assert isinstance(arg, int) return function(arg) return wrapper

@require_intdef p1(arg): print arg

Page 121: Python programming for Bioinformatics

• 인수를 가질 수도 있다 .

Decorator

def mydecorator(arg1, arg2): def _mydecorator(function): def __mydecorator(*args, **kwargs): #do somethings for decoration result = function(*args, **kwargs) return result return __mydecorator return _mydecorator

@mydecorator(1, 2)def f(arg): ….

Page 122: Python programming for Bioinformatics

• 인스턴스를 생성하지 않고 클래스 이름으로 직접 호출할 수 있는 메쏘드

• Instance method 는 이처럼 호출하면 unbounded method 오류

Static method

class D: def spam(x, y): # self 가 없다 . print ‘static method’, x, y spam = staticmethod(spam)

D.spam(1, 2) class D: @staticmethod def spam(x, y): print ‘static method’, x, y

Page 123: Python programming for Bioinformatics

• 일반 메쏘드가 첫 인수 (self) 로 인스턴스 객체를 받는 것에 비해서 , 클래스 메쏘드는 첫 인수로 클래스 객체를 받는다 .

Class method

class C: def spam(cls, y): # self 가 없다 . print ‘class method’, cls, y spam = classmethod(spam)

>>> C.spam(5)__main__.C 5

class C: @classmethod def spam(cls, y): print ‘class method’, cls, y

Page 124: Python programming for Bioinformatics

• 멤버 값을 정의할 때 편하게 사용하기 위함• < 예제 > degree 에 변수에 값을 저장하는데 360 도 미만의

범위에서 정규화 하기

get/set property

class D(object): def __init__(self): self.__degree = 0 def get_degree(self): return self.__degree def set_degree(self, d): self.__degree = d % 360 degree = property(get_degree, set_degree)

d = D()d.degree = 10 # set_degree callprint d.degree # get_degree call

Page 125: Python programming for Bioinformatics

• 순차적으로 참조는 하나 , 인덱싱에 의한 순서적인 참조는 의미가 없는 경우 메모리 효율

• iter() 내장함수로 만들며 , next() 메쏘드를 갖는다 .• 더 이상 넘겨줄 자료가 없으면 StopIteration 예외

Iterator

>>> I = iter([1,2,3])>>> I<iterator object at 0x1234556>>>> I.next()1>>> I.next()2>>> I.next()3

>>> I.next() # 더이상 자료가 없으면 StopIterationTraceback (most recent call last): File "<pyshell#71>", line 1, in ? I.next() StopIteration

Page 126: Python programming for Bioinformatics

• iter(s) 에 의해 s.__iter__() 가 호출되고 반복자 객체를 리턴한다 .

• 반복자 객체는 next() 메쏘드를 갖는다 .

Iterator on class

class Seq: def __init__(self, fname): self.file = open(fname) def __iter__(self): return self def next(self): line = self.file.readline() if not line: raise StopIteration return line

>>> s = Seq(‘readme.txt’)>>> for line in S: print line,

Page 127: Python programming for Bioinformatics

• icon 이란 언어에서 영향을 받음• 기존의 함수 호출방식 – 인수들과 내부 변수들이 스택을 이용

생성 소멸• 발생자란 ( 중단된 시점부터 ) 재실행 가능한 함수• 어떤 함수이든 yield 를 가지면 발생자 함수 • 일종의 내맘대로 만드는 iterator

Generator

def generate_int(n): while True: yield n n += 1

def generate_int(n): while True: return n n += 1

Page 128: Python programming for Bioinformatics

• 피보나치 수열

Generator example

def fibonacci(a=1, b=1): while 1: yield a a, b = b, a+b

t = fibonacci() # t 는 반복자 for i in range(10): print t.next(),

Page 129: Python programming for Bioinformatics

• 홀수 집합 만들기 (iterator 이용 )

Generator example

class Odds: def __init__(self, limit=None): self.data = -1 self.limit = limit def __iter__(self): return self def next(self): self.data += 2 if self.limit and self.limit <= self.data: raise StopIteration return self.data

>>> for k in Odds(20): print k, 1 3 5 7 9 11 13 15 17 19

Page 130: Python programming for Bioinformatics

• 홀수 집합 만들기 (generator 이용 )

Generator example

def odds(limit=None): k = 1 while not limit or limit >= k: yield k k += 2

>>> for k in odds(20): print k,1 3 5 7 9 11 13 15 17 19

Page 131: Python programming for Bioinformatics

• List comprehension

• Generator expression

• Other example

Generator expression

>>> [k for k in range(100) if k % 5 == 0] [0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95]

>>> (k for k in range(100) if k % 5 == 0) <generator object at 0x40190e4c>

>>> sum(x for x in range(1, 20) if x % 2)>>> “, “.join(x for x in [“abc”, “def”] if x.startswith(“a”))

Page 132: Python programming for Bioinformatics

• iter()• xrange()• dict.iteritems(), dict.iterkeys(), dict.itervalues()• file 은 라인단위의 반복자를 지원한다 . • reversed() 는 iterator 를 받지 않는다 .

Iterator / generator 가 사용되는 곳

Page 133: Python programming for Bioinformatics

• http://docs.python.org/library/itertools.html• Infinite iterator: count, cycle, repeat• Finite iterator: chain, groupby, ifilter,…• Combinatoric iterator: product, permutations, combi-

nations

Itertools module

>>> text = “Hello world my world”>>> wd = dict( (k, len(list(v))) for k, v in groupby( sorted(text.split() ), lambda x: x.upper()) )

Page 134: Python programming for Bioinformatics

Excersize

임의의 FASTA 형식의 파일에 저장된 DNA 서열을 읽어 Reverse complement 서열을 출력하시오 (Multiple FASTA)

$ python reverse_complement.py < input.fasta> seq1 reverse complementAGTCAAGGCCAAGTCCAAAGCAGCAGGAGCCAAGGT> seq2 reverse complementAGTCAAGGCCAAGTCCAAAGCAGCAGGAGCCAAGGT

Page 135: Python programming for Bioinformatics

STANDARD LIBRARIES

Page 136: Python programming for Bioinformatics

• 특정 기능을 수행하기 위한 모듈 및 패키지 import 하여 사용

• Python libraries– Built-in library : 파이썬 설치시 같이 설치됨

• math, StringIO, random, unittest, re, itertools, decimal • os, sys, subprocess, glob, pickle, csv, datetime, Tkinter, …

– 3rd party library : 따로 설치하여 사용해야함• wxPython, PythonWin, numpy, scipy, matplotlib, Biopython, PIL,

BeautifulSoup…

Library?

Page 137: Python programming for Bioinformatics

• Miscellaneous operating system interfaces• 운영체제 의존적인 기능들을 일관적으로 사용• os.name ‘posix’, ‘nt’, ‘mac’, ‘os2’, ‘ce’, ‘java’,

‘riscos’• os.environ 시스템 환경변수 사전• os.chdir 현재 디렉토리 변경 • os.stat 파일의 속성• os.walk 특정 디렉토리 하위 모든 파일들에 대한 일괄작업• os.fork 프로세스 분기

os

Page 138: Python programming for Bioinformatics

• System-specific parameters and functions• 시스템운영에 관련된 특정 상수값들과 함수• sys.argv 명령행 인수• sys.getdefaultencoding() 기본 인코딩 • sys.stdin, sys.stdout 표준입출력

sys

Page 139: Python programming for Bioinformatics

• Subprocess management• 다른 프로그램 이용하기

– Shell pipeline– Process spawn

• Popen, PIPE

Subprocess

Page 140: Python programming for Bioinformatics

• Unix style pathname pattern expansion• 와일드카드를 이용한 디렉토리내 파일 탐색

– ? : 아무런 문자 하나– * : 0 개 이상의 아무 문자– [ ] : 사이에 나열된 문자 중 하나– - : 일정 범위 a-z

glob

Page 141: Python programming for Bioinformatics

• Python object serialization• 파이썬 객체를 ( 파일에 ) 저장하기• dump() and load()

pickle

Page 142: Python programming for Bioinformatics

• CSV File Reading and Writing• 표 데이터를 다루는 일반적인 방법• CSV format : “,” 로 컬럼 구분 . ‘ “” ’ 로 데이터 구분• reader() and writer()

csv

Page 143: Python programming for Bioinformatics

• Basic date and time types• 날짜와 시각 ( 시간 ) 을 다루기• date, time, datetime, timedelta, tzinfo

datetime

Page 144: Python programming for Bioinformatics

• Python interface to Tcl/Tk• Built-in GUI library, 운영체제 독립

Tkinter

Page 145: Python programming for Bioinformatics

• Advanced GUI library for python• http://www.wxpython.org• 3rd party GUI library, 운영체제 독립• PythonWin 과 구분

wxPython

Page 146: Python programming for Bioinformatics

• numpy : 행렬 , 벡터방식의 수치해석• scipy : 각종 과학연산용 라이브러리• matplotlib : matLab 프로그램의 영향을 받은 차트

라이브러리

numpy , scipy, matplotlib

Page 147: Python programming for Bioinformatics

• Python Image Library• 그래픽이미지 변환 및 수정

PIL

Page 148: Python programming for Bioinformatics

• HTML, XML 문서 파싱 라이브러리• Invalide 형식도 적절하게 자동 해석

BeautifulSoup

Page 149: Python programming for Bioinformatics

• 생물정보 관련 라이브러리• 생물서열 관리 , 각종 문서형식 파싱 , 주요 분석 알고리즘 탑재

BioPython

Page 150: Python programming for Bioinformatics

www.insilicogen.comE-mail [email protected] Tel 031-278-0061Fax 031-278-0062