2015 bioinformatics python_introduction_wim_vancriekinge_vfinal

81

Upload: prof-wim-van-criekinge

Post on 11-Jan-2017

1.832 views

Category:

Education


1 download

TRANSCRIPT

FBW29-09-2015

Wim Van Criekinge

Bioinformatics.be

Overview

What is Python ?Why Python 4 Bioinformatics ?How to Python

IDE: Eclipse & PyDev / AthenaCode Sharing: Git(hub)

ExamplesHello WorldPIthon

Overview

What is Python ?Why Python 4 Bioinformatics ?How to Python

IDE: Eclipse & PyDev / AthenaCode Sharing: Git(hub)

ExamplesHello WorldPIthon

What is Python ?

• Python is an interpreted, object-oriented, high-level programming language with dynamic semantics.

• Its high-level built in data structures, combined with dynamic typing and dynamic binding, make it very attractive for Rapid Application Development, as well as for use as a scripting or glue language to connect existing components together.

• Python supports modules and packages, which encourages program modularity and code reuse. The Python interpreter and the extensive standard library are available in source or binary form without charge for all major platforms, and can be freely distributed.

• When he began implementing Python, Guido van Rossum was also reading the published scripts from “Monty Python's Flying Circus”, a BBC comedy series from the 1970s. Van Rossum thought he needed a name that was short, unique, and slightly mysterious, so he decided to call the language Python.

Programming Language

• Formal notation for specifying computations– Syntax (usually specified by a context-free

grammar)– Semantics for each syntactic construct– Practical implementation on a real or virtual

machine• Compilation vs. interpretation• Efficiency vs. portability

• Assembly Languages– Invented by machine designers the early 1950s– Reusable macros and subroutines

FORTRAN

• Procedural, imperative language– Still used in scientific computation

• Developed at IBM in the 1950s by John Backus (1924-2007)

– Backus’s 1977 Turing award lecture made the case for functional programming

– On FORTRAN: “We did not know what we wanted and how to do it. It just sort of grew. The first struggle was over what the language would look like. Then how to parse expressions – it was a big problem…”• BNF: Backus-Naur form for defining context-free

grammars

LISP

• Invented by John McCarthy (b. 1927, Turing award: 1971)

• Formal notation for lambda-calculus• Pioneered many PL concepts

– Automated memory management (garbage collection)– Dynamic typing– No distinction between code and data

• Still in use: ACL2, Scheme, …

“Anyone could learn Lisp in one day, except that if they already knew FORTRAN, it

would take three days” - Marvin Minsky

PASCAL

• Designed by Niklaus Wirth– 1984 Turing Award

• Revised type system of Algol– Good data structure concepts

• Records, variants, subranges– More restrictive than Algol 60/68

• Procedure parameters cannot have procedure parameters

• Popular teaching language• Simple one-pass compiler

C

• Bell Labs 1972 (Dennis Ritchie)• Development closely related to UNIX

– 1983 Turing Award to Thompson and Ritchie• Compiles to native code• 1973-1980: new features; compiler ported

– unsigned, long, union, enums• 1978: K&R C book published• 1989: ANSI C standardization

– Function prototypes as in C++• 1999: ISO 9899:1999 also known as “C99”

– Inline functions, C++-like decls, bools, variable arrays• Concurrent C, Objective C, C*, C++, C#• “Portable assembly language”

– Early C++, Modula-3, Eiffel source-translated to C

JAVA

• Sun 1991-1995 (James Gosling)– Originally called Oak, intended for set top boxes

• Mixture of C and Modula-3– Unlike C++

• No templates (generics), no multiple inheritance, no operator overloading

– Like Modula-3 (developed at DEC SRC)• Explicit interfaces, single inheritance, exception

handling, built-in threading model, references & automatic garbage collection (no explicit pointers!)

• “Generics” added later

Other Important Languages

• Algol-like– Modula, Oberon, Ada

• Functional– ISWIM, FP, SASL, Miranda, Haskell, LCF,

ML, Caml, Ocaml, Scheme, Common LISP• Object-oriented

– Smalltalk, Objective-C, Eiffel, Modula-3, Self, C#, CLOS

• Logic programming– Prolog, Gödel, LDL, ACL2, Isabelle, HOL

… and more

• Data processing and databases– Cobol, SQL, 4GLs, XQuery

• Systems programming– PL/I, PL/M, BLISS

• Specialized applications– APL, Forth, Icon, Logo, SNOBOL4,

GPSS, Visual Basic• Concurrent, parallel, distributed

– Concurrent Pascal, Concurrent C, C*, SR, Occam, Erlang, Obliq

… and more

• Programming tool “mini-languages”– awk, make, lex, yacc, autoconf …

• Command shells, scripting and “web” languages– sh, csh, tcsh, ksh, zsh, bash …– Perl, JavaScript, PHP, Python, Rexx, Ruby, Tcl,

AppleScript, VBScript …• Web application frameworks and technologies

– ASP.NET, AJAX, Flash, Silverlight …• Note: HTML/XML are markup languages, not

programming languages, but they often embed executable scripts like Active Server Pages (ASPs) & Java Server Pages (JSPs)

What is scripting ?

• Wikipedia has an informative and detailed explanation, “A scripting language, script language or extension language is a programming language that allows control of one or more software applications. "Scripts" are distinct from the core code of the application, as they are usually written in a different language and are often created or at least modified by the end-user.[1] Scripts are often interpreted from source code or bytecode, whereas the applications they control are traditionally compiled to native machine code. Scripting languages are nearly always embedded in the applications they control.[2]

• The name "script" is derived from the written script of the performing arts, in which dialogue is set down to be spoken by human actors. Early script languages were often called batch languages or job control languages. Such early scripting languages were created to shorten the traditional edit-compile-link-run process”.

What’s Driving Their Evolution?

• Constant search for better ways to build software tools for solving computational problems– Many PLs are general purpose tools– Others are targeted at specific kinds of problems

• For example, massively parallel computations or graphics

• Useful ideas evolve into language designs– Algol Simula Smalltalk C with Classes C++

• Often design is driven by expediency– Scripting languages: Perl, Tcl, Python, PHP, etc.

• “PHP is a minor evil perpetrated by incompetent amateurs, whereas Perl is a great and insidious evil, perpetrated by skilled but perverted professionals.” - Jon Ribbens

What Do They Have in Common?

• Lexical structure and analysis– Tokens: keywords, operators, symbols, variables– Regular expressions and finite automata

• Syntactic structure and analysis– Parsing, context-free grammars

• Pragmatic issues– Scoping, block structure, local variables– Procedures, parameter passing, iteration,

recursion– Type checking, data structures

• Semantics– What do programs mean and are they correct

Visual history of programming languages

http://cdn.oreillystatic.com/news/graphics/prog_lang_poster.pdf

The most valuable programming skills to have on a resume

Overview

What is Python ?Why Python 4 Bioinformatics ?How to Python

IDE: Eclipse & PyDev / AthenaCode Sharing: Git(hub)

ExamplesHello WorldPIthon

Python

• Programming languages are overrated– If you are going into bioinformatics you probably

learn/need multiple– If you know one you know 90% of a second

• Choice does matter but it matters far less than people think it does

• Why Python?– Lets you start useful programs asap– Build-in libraries – incl BioPython– Free, most platforms, widely (scientifically) used

• Versus Perl?– Incredibly similar– Consistent syntax, indentation

http://www.python.org

Should I use Python 2 or Python 3 for my development activity?• Short version: Python 2.x is legacy, Python 3.x is the present and future of the language• Python 3.0 was released in 2008. The final 2.x version 2.7 release came out in mid-

2010, with a statement of extended support for this end-of-life release. The 2.x branch will see no new major releases after that. 3.x is under active development and has already seen over five years of stable releases, including version 3.3 in 2012 and 3.4 in 2014. This means that all recent standard library improvements, for example, are only available by default in Python 3.x.

• Guido van Rossum (the original creator of the Python language) decided to clean up Python 2.x properly, with less regard for backwards compatibility than is the case for new releases in the 2.x range. The most drastic improvement is the better Unicode support (with all text strings being Unicode by default) as well as saner bytes/Unicode separation.

• Besides, several aspects of the core language (such as print and exec being statements, integers using floor division) have been adjusted to be easier for newcomers to learn and to be more consistent with the rest of the language, and old cruft has been removed (for example, all classes are now new-style, "range()" returns a memory efficient iterable, not a list as in 2.x).

• The What's New in Python 3.0 document provides a good overview of the major language changes and likely sources of incompatibility with existing Python 2.x code. Nick Coghlan (one of the CPython core developers) has also created a relatively extensive FAQ regarding the transition.

• However, the broader Python ecosystem has amassed a significant amount of quality software over the years. The downside of breaking backwards compatibility in 3.x is that some of that software (especially in-house software in companies) still doesn't work on 3.x yet.

How to install ?

• On windows you’ll need administrator right

• Portable python distribution ?

Takes 500Mb and >2 hours

Version 2.7 and 3.4 on http://athena.ugent.be

Interactive “Shell”

• Great for learning the language• Great for experimenting with the library• Great for testing your own modules• Two variations: IDLE (GUI),

python (command line)• Type statements or expressions at prompt:

>>> print "Hello, world"Hello, world>>> x = 12**2>>> x/272>>> # this is a comment

Overview

What is Python ?Why Python 4 Bioinformatics ?How to Python

IDE: Eclipse & PyDev / AthenaCode Sharing: Git(hub)

ExamplesHello WorldPIthon

IDE: Integrated Development Environment

• You type scripts using can use notepad(++)

• Better: PyCharm – available for free on most OS but you

need to be administrator to install • We will use Eclipse in combination

with PyDev

What is Eclipse?• Eclipse started as a proprietary IBM product

(IBM Visual age for Smalltalk/Java)– Embracing the open source model IBM opened the

product up• Open Source

– It is a general purpose open platform that facilitates and encourages the development of third party plug-ins

• Best known as an Integrated Development Environment (IDE)– Provides tools for coding, building, running and

debugging applications• Originally designed for Java, now supports

many other languages– Good support for C, C++– Python, PHP, Ruby, etc…

Prerequisites for Running Eclipse

• Eclipse is written in Java and will thus need an installed JRE or JDK in which to execute– JDK recommended

Selecting a Workspace• In Eclipse, all of your code will live under a

workspace• A workspace is nothing more than a location

where we will store our source code and where Eclipse will write out our preferences

• Eclipse allows you to have multiple workspaces – each tailored in its own way

• Choose a location where you want to store your files, then click OK

Eclipse IDE Components

MenubarsFull drop down menus plus quick

access to common functions

Editor PaneThis is where we edit

our source code

Perspective SwitcherWe can switch

between various perspectives here

Outline PaneThis contains a hierarchical

view of a source file

Package Explorer PaneThis is where our

projects/files are listed

Miscellaneous PaneVarious components can appear in this pane – typically this contains a console

and a list of compiler problems

Task List PaneThis contains a list of “tasks” to complete

PYTHON

PyDev: Python plug-in for Eclipse

• Syntax highlighting

• Debugger

• Code completion

• An extensive preference menu that can be used to edit the

plug-in’s attributes and options.

Installation The plug-in can be installed through

Software Updates:

Setting UpIn Eclipse, go to: Window, Preferences, PyDev,Interpreter-Python,and click New.

Select the python.exe file in the Python directory, click OK and OK in the Preferences window again. Wait for the creating procedure to finish.

Create Python Project and File

Click on File, New, choose File, click on Python project folder, write the file name ending in a .py, and click Finish.

Go to File, New, Project, selectPydev,Python Project, click Next, write name, choose Python version,and click Finish.

Running Python

To run Python code click on Run, Run As, andselect Python Run.

Lets try for “Hello World!” from athena.ugent.be

Where is the workspace ?

Make PyDev Project

Which Python interpreter is used … check Preferences or run version.py

Create new file …

… Hello_world.py

Run Hello_world.py

Overview

What is Python ?Why Python 4 Bioinformatics ?How to Python

IDE: Eclipse & PyDev / AthenaCode Sharing: Git(hub)

ExamplesHello WorldPIthon

git is an open source, distributed version

control system designed for speed and

efficiency

Git: A distributed version control system

• Version control (or revision control, or source control) is all about managing multiple versions of documents, programs, web sites, etc.– Almost all “real” projects use some kind of version

control– Essential for team projects, but also very useful for

individual projects• Some well-known version control systems are

CVS, Subversion, Mercurial, and Git– CVS and Subversion use a “central” repository; users

“check out” files, work on them, and “check them in”– Mercurial and Git treat all repositories as equal

• Distributed systems like Mercurial and Git are newer and are gradually replacing centralized systems like CVS and Subversion

Why version control?

• For working by yourself:– Gives you a “time machine” for going back to

earlier versions– Gives you great support for different versions

(standalone, web app, etc.) of the same basic project

• For working with others:– Greatly simplifies concurrent work, merging

changes• For getting an internship or job:

– Any company with a clue uses some kind of version control

– Companies without a clue are bad places to work

Why Git?

• Git has many advantages over earlier systems such as CVS and Subversion– More efficient, better workflow, etc.– See the literature for an extensive list of reasons– Of course, there are always those who disagree

• It works from with Eclipse, also when started from athena

No Network needed for

(almost) everything is local

• Performing a diff• Viewing file history• Committing changes• Merging branches• Obtaining any other

revision of a file• Switching branches

GitHub: Hosted GIT

• Largest open source git hosting site• Public and private options• User-centric rather than project-centric• http://github.ugent.be (use your Ugent

login and password)– Accept invitation from Bioinformatics-I-

2015URI:– https://github.ugent.be/Bioinformatics-I-

2015/Python.git

GitHub: Hosted GIT

GitHub: Hosted GIT

GitHub: Hosted GIT

Typical workflow

Person A Setup project &

repo push code onto

github

edit/commit edit/commit pull/push

Person B

•clone code from github•edit/commit/push•edit…•edit… commit•pull/push

This is just the flow, specific commands on following slides.It’s also possible to create your project first on github, then clone (i.e., no git init)

GitHub: Hosted GIT

GitHub: Hosted GIT

GitHub: Hosted GIT

GitHub: Hosted GIT

GitHub: Hosted GIT

URI (Uniform Resource Identifier):https://github.ugent.be/Bioinformatics-I-2015/Python.git

GitHub: Hosted GIT

GitHub: Hosted GIT

GitHub: Hosted GIT

GitHub: Hosted GIT

GitHub: Hosted GIT

Overview

What is Python ?Why Python 4 Bioinformatics ?How to Python

IDE: Eclipse & PyDev / AthenaCode Sharing: Git(hub)

ExamplesHello_World.pyPI-thon.py

Hello_world.py

Overview

What is Python ?Why Python 4 Bioinformatics ?How to Python

IDE: Eclipse & PyDev / AthenaCode Sharing: Git(hub)

ExamplesHello_World.py

PI-thon.py

Variables

• No need to declare• Need to assign (initialize)

• use of uninitialized variable raises exception• Not typed

if friendly: greeting = "hello world"else: greeting = 12**2print greeting

• Everything is a "variable":• Even functions, classes, modules

Numbers

• The usual suspects• 12, 3.14, 0xFF, 0377, (-1+2)*3/4**5, abs(x), 0<x<=5

• C-style shifting & masking• 1<<16, x&0xff, x|1, ~x, x^y

• Integer division truncates :-(• 1/2 -> 0 # 1./2. -> 0.5, float(1)/2 -> 0.5• Will be fixed in the future

• Long (arbitrary precision), complex• 2L**100 -> 1267650600228229401496703205376L

– In Python 2.2 and beyond, 2**100 does the same thing• 1j**2 -> (-1+0j)

Control Structures

if condition: statements[elif condition: statements] ...else: statements

while condition: statements

for var in sequence: statements

breakcontinue

Example Function

def gcd(a, b): "greatest common divisor" while a != 0: a, b = b%a, a # parallel assignment return b

>>> gcd.__doc__'greatest common divisor'>>> gcd(12, 20)4

How to generate random numbersThe standard random module implements a random number generator.

import randomprint (random.random())

This prints a random floating point number in the range [0, 1) (that is, between 0 and 1, including 0.0 but always smaller than 1.0).There are also many other specialized generators in this module, such as:randrange(a, b) chooses an integer in the range [a, b).uniform(a, b) chooses a floating point number in the range [a, b).normalvariate(mean, sdev) samples the normal (Gaussian) distribution.Some higher-level functions operate on sequences directly, such as:choice(S) chooses a random element from a given sequence (the sequence must have a known length).shuffle(L) shuffles a list in-place, i.e. permutes it randomlyThere’s also a Random class you can instantiate to create independent multiple random number generators.

First program: PI-thon.py

• How good are the random numbers ?

• If they are good, you should be able to “measure” PI

Measure Pi with two random numbers …. many of them …

1

x

y

Python Videos

http://python.org/ - documentation, tutorials, beginners guide, core

distribution, ...Books include:Learning Python by Mark LutzPython Essential Reference by David BeazleyPython Cookbook, ed. by Martelli, Ravenscroft and

Ascher