bioinformatics v2014 wim_vancriekinge

275
1 Preparation for the course on 17 th of december 2014 Get a functional perl interpreter on your computer Windows: http://www.activestate.com/activeperl/ downloads Macosx: http://learn.perl.org/installing/ osx.html You can check this by typing perl v Check to have functional JAVA (www.java.com ) and install WEKA (http://www.cs.waikato.ac.nz/ml/weka/downloading. html)

Upload: wvcrieki

Post on 02-Jul-2015

3.073 views

Category:

Education


1 download

DESCRIPTION

Bioinformatics

TRANSCRIPT

Page 1: Bioinformatics v2014 wim_vancriekinge

1

Preparation for the course on 17th of december 2014

• Get a functional perl interpreter on your computer

Windows:

http://www.activestate.com/activeperl/downloads

Macosx: http://learn.perl.org/installing/osx.html

You can check this by typing perl –v

• Check to have functional JAVA (www.java.com)

and install WEKA

(http://www.cs.waikato.ac.nz/ml/weka/downloading.

html)

Page 2: Bioinformatics v2014 wim_vancriekinge

Prof. Wim Van Criekinge

17th december 2014, 08:00-13:00

VUmc, Amsterdam

Bioinformatics

Page 3: Bioinformatics v2014 wim_vancriekinge

3

Outline

• Scripting

Perl (Bioperl/Python)examples spiders/bots

• Databases

Genome Browserexamples biomart, galaxy

• AI

Classification and clusteringexamples WEKA (R, Rapidminer)

Page 4: Bioinformatics v2014 wim_vancriekinge

Math

Informatics

Bioinformatics, a life science discipline …

(Molecular)

Biology

Page 5: Bioinformatics v2014 wim_vancriekinge

Math

Informatics

Bioinformatics, a life science discipline …

Theoretical Biology

Computational Biology

(Molecular)

Biology

Computer Science

Page 6: Bioinformatics v2014 wim_vancriekinge

Math

Informatics

Bioinformatics, a life science discipline …

Theoretical Biology

Computational Biology

(Molecular)

Biology

Computer Science

Bioinformatics

Page 7: Bioinformatics v2014 wim_vancriekinge

Math

Informatics

Bioinformatics, a life science discipline … management of expectations

Theoretical Biology

Computational Biology

(Molecular)

Biology

Computer Science

Bioinformatics

Interface Design

AI, Image Analysis

structure prediction (HTX)

Sequence Analysis

Expert Annotation

NP

Datamining

Page 8: Bioinformatics v2014 wim_vancriekinge

Math

Informatics

Bioinformatics, a life science discipline … management of expectations

Theoretical Biology

Computational Biology

(Molecular)

Biology

Computer Science

Bioinformatics

Discovery Informatics – Computational Genomics

Interface Design

AI, Image Analysis

structure prediction (HTX)

Sequence Analysis

Expert Annotation

NP

Datamining

Page 9: Bioinformatics v2014 wim_vancriekinge

9

Bioinformatics

Page 10: Bioinformatics v2014 wim_vancriekinge

10

Page 11: Bioinformatics v2014 wim_vancriekinge

11

• Perl is a High-level Scripting language

• Larry Wall created Perl in 1987 Practical Extraction (a)nd Reporting

Language

(or Pathologically Eclectic Rubbish Lister)

• Born from a system administration tool

• Faster than sh or csh

• Sslower than C

• No need for sed, awk, tr, wc, cut, …

• Perl is open and free

• http://conferences.oreillynet.com/eurooscon/

What is Perl ?

Page 12: Bioinformatics v2014 wim_vancriekinge

12

• Perl is available for most computing

platforms: all flavors of UNIX (Linux), MS-

DOS/Win32, Macintosh, VMS, OS/2, Amiga,

AS/400, Atari

• Perl is a computer language that is:

Interpreted, compiles at run-time (need for

perl.exe !)

Loosely “typed”

String/text oriented

Capable of using multiple syntax formats

• In Perl, “there’s more than one way to do it”

What is Perl ?

Page 13: Bioinformatics v2014 wim_vancriekinge

13

• Ease of use by novice programmers

• Flexible language: Fast software prototyping (quick and dirty creation of small analysis programs)

• Expressiveness. Compact code, Perl Poetry: @{$_[$#_]||[]}

• Glutility: Read disparate files and parse the relevant data into a new format

• Powerful pattern matching via “regular expressions”(Best Regular Expressions on Earth)

• With the advent of the WWW, Perl has become the language of choice to create Common Gateway Interface (CGI) scripts to handle form submissions and create compute severs on the WWW.

• Open Source – Free. Availability of Perl modules for Bioinformatics and Internet.

Why use Perl for bioinformatics ?

Page 14: Bioinformatics v2014 wim_vancriekinge

14

• Some tasks are still better done with other languages (heavy computations / graphics)C(++),C#, Fortran, Java (Pascal,Visual Basic)

• With perl you can write simple programs fast, but on the other hand it is also suitable for large and complex programs. (yet, it is not adequate for very large projects) Python

• Larry Wall: “For programmers, laziness is a virtue”

Why NOT use Perl for bioinformatics ?

Page 15: Bioinformatics v2014 wim_vancriekinge

15

• Sequence manipulation and analysis

• Parsing results of sequence analysis programs

(Blast, Genscan, Hmmer etc)

• Parsing database (eg Genbank) files

• Obtaining multiple database entries over the

internet

• …

What bioinformatics tasks are suited to Perl ?

Page 16: Bioinformatics v2014 wim_vancriekinge

16

• Perl Perl is available for various operating systems. To

download Perl and install it on your computer, have a look at the following resources:

www.perl.com (O'Reilly). Downloading Perl Software

ActiveState. ActivePerl for Windows, as well as for Linux and Solaris.

ActivePerl binary packages.

CPAN

• PHPTriad: bevat Apache/PHP en MySQL:

http://sourceforge.net/projects/phptriad

Perl installation

Page 17: Bioinformatics v2014 wim_vancriekinge

17

Check installation

• Command-line flags for perl

Perl – vGives the current version of Perl

Perl –eExecutes Perl statements from the comment line.

Perl –e “print 42;”

Perl –e “print \”Two\n\lines\n\”;”

Perl –weExecutes and print warnings

Perl –we “print ‘hello’;x++;”

Page 18: Bioinformatics v2014 wim_vancriekinge

18

• Syntax highlighting

• Run program (prompt for parameters)

• Show line numbers

• Clip-ons for web with perl syntax

• ….

TextPad

Page 19: Bioinformatics v2014 wim_vancriekinge

19

Customize textpad part 1: Create Document Class

Page 20: Bioinformatics v2014 wim_vancriekinge

20

• Document classes

Page 21: Bioinformatics v2014 wim_vancriekinge

21

Customize textpad part 2: Add Perl to “Tools Menu”

Page 22: Bioinformatics v2014 wim_vancriekinge

22

Unzip to textpad samples directory

Page 23: Bioinformatics v2014 wim_vancriekinge

23

• Perl is mostly a free format language: add spaces, tabs or new lines wherever you want.

• For clarity, it is recommended to write each statement in a separate line, and use indentation in nested structures.

• Comments: Anything from the # sign to the end of the line is a comment. (There are no multi-line comments).

• A perl program consists of all of the Perl statements of the file taken collectively as one big routine to execute.

General Remarks

Page 24: Bioinformatics v2014 wim_vancriekinge

24

Three Basic Data Types

•Scalars - $

•Arrays of scalars - @

•Associative arrays of

scalers or Hashes - %

Page 25: Bioinformatics v2014 wim_vancriekinge

2+2 = ?

$a = 2;

$b = 2;

$c = $a + $b;

$ - indicates a variable

; - ends every command

= - assigns a value to a variable

$c = 2 + 2;or

$c = 2 * 2;or

$c = 2 / 2;or

$c = 2 ^ 4;or 2^4 <-> 24 =16

$c = 1.35 * 2 - 3 / (0.12 + 1);or

Page 26: Bioinformatics v2014 wim_vancriekinge

Ok, $c is 4. How do we know it?

print “Hello \n”;

print command:

$c = 4;

print “$c”;

“ ” - bracket output expression

\n - print a end-of-the-line character

(equivalent to pressing ‘Enter’)

print “Hello everyone\n”;

print “Hello” . ” everyone” . “\n”;

Strings concatenation:

Expressions and strings together:

print “2 + 2 = “ . (2+2) . ”\n”;

expression

2 + 2 = 4

Page 27: Bioinformatics v2014 wim_vancriekinge

Loops and cycles (for statement):

# Output all the numbers from 1 to 100

for ($n=1; $n<=100; $n+=1) {

print “$n \n”;

}

1. Initialization:

for ( $n=1 ; ; ) { … }

2. Increment:

for ( ; ; $n+=1 ) { … }

3. Termination (do until the criteria is satisfied):

for ( ; $n<=100 ; ) { … }

4. Body of the loop - command inside curly brackets:

for ( ; ; ) { … }

Page 28: Bioinformatics v2014 wim_vancriekinge

FOR & IF -- all the even numbers from 1 to 100:

for ($n=1; $n<=100; $n+=1) {

if (($n % 2) == 0) {

print “$n”;

}

}

Note: $a % $b -- Modulus -- Remainder when $a is divided by $b

Page 29: Bioinformatics v2014 wim_vancriekinge

29

Two brief diversions (warnings & strict)

• Use warnings

• strict – forces you to ‘declare’ a variable the first time you use it.

usage: use strict; (somewhere near the top of your script)

• declare variables with ‘my’

usage: my $variable;

or: my $variable = ‘value’;

• my sets the ‘scope’ of the variable. Variable exists only within the current block of code

• use strict and my both help you to debug errors, and help prevent mistakes.

Page 30: Bioinformatics v2014 wim_vancriekinge

30

Text Processing Functions

The substr function

• Definition

• The substr function extracts a substring out of a string and returns it. The function receives 3 arguments: a string value, a position on the string (starting to count from 0) and a length.

Example:

• $a = "university";

• $k = substr ($a, 3, 5);

• $k is now "versi" $a remains unchanged.

• If length is omitted, everything to the end of the string is returned.

Page 31: Bioinformatics v2014 wim_vancriekinge

31

Random

$x = rand(1);

• srand The default seed for srand, which used to be time, has

been changed. Now it's a heady mix of difficult-to-predict system-dependent values, which should be sufficient for most everyday purposes. Previous to version 5.004, calling rand without first calling srandwould yield the same sequence of random numbers on most or all machines. Now, when perl sees that you're calling rand and haven't yet called srand, it calls srandwith the default seed. You should still call srandmanually if your code might ever be run on a pre-5.004 system, of course, or if you want a seed other than the default

Page 32: Bioinformatics v2014 wim_vancriekinge

32

Demo/Example

• Oefening hoe goed zijn de random

nummers ?

• Als ze goed zijn kan je er Pi mee

berekenen …

• Een goede random generator is belangrijk

voor goede randomsequenties die we

nadien kunnen gebruiken in simulaties

Page 33: Bioinformatics v2014 wim_vancriekinge

33

Bereken Pi aan de hand van twee random getallen

1

x

y

Page 34: Bioinformatics v2014 wim_vancriekinge

34

Introduction

Buffon's Needle is one of the oldest problems in the field of geometrical probability. It was first stated in 1777. It involves dropping a needle on a lined sheet of paper and determining the probability of the needle crossing one of the lines on the page. The remarkable result is that the probability is directly related to the value of pi.

http://www.angelfire.com/wa/hurben/buff.html

In Postscript you send it too the printer … PS has no variables but “stacks”, you can mimick this in Perl by recursively loading and rewriting a subroutine

Page 35: Bioinformatics v2014 wim_vancriekinge

35–http://www.csse.monash.edu.au/~damian/papers/HTML/Perligata.html

Page 36: Bioinformatics v2014 wim_vancriekinge

36

Programming

•Variables

•Flow control (if, regex …)

•Loops

• input/output

•Subroutines/object

Page 37: Bioinformatics v2014 wim_vancriekinge

37

What is a regular expression?

• A regular expression (regex) is simply a

way of describing text.

• Regular expressions are built up of small

units (atoms) which can represent the type

and number of characters in the text

• Regular expressions can be very broad

(describing everything), or very narrow

(describing only one pattern).

Page 38: Bioinformatics v2014 wim_vancriekinge

38

Page 39: Bioinformatics v2014 wim_vancriekinge

39

Regular Expression Review

• A regular expression (regex) is a way of

describing text.

• Regular expressions are built up of small units

(atoms) which can represent the type and

number of characters in the text

• You can group or quantify atoms to describe

your pattern

• Always use the bind operator (=~) to apply your

regular expression to a variable

Page 40: Bioinformatics v2014 wim_vancriekinge

40

Why would you use a regex?

• Often you wish to test a string for the

presence of a specific character, word, or

phrase

Examples

“Are there any letter characters in my string?”

“Is this a valid accession number?”

“Does my sequence contain a start codon (ATG)?”

Page 41: Bioinformatics v2014 wim_vancriekinge

41

Regular Expressions

Match to a sequence of characters

The EcoRI restriction enzyme cuts at the consensus sequence GAATTC.

To find out whether a sequence contains a restriction site for EcoR1, write;

if ($sequence =~ /GAATTC/) {

...

};

Page 42: Bioinformatics v2014 wim_vancriekinge

42

Regex-style

[m]/PATTERN/[g][i][o]

s/PATTERN/PATTERN/[g][i][e][o]

tr/PATTERNLIST/PATTERNLIST/[c][d][s]

Page 43: Bioinformatics v2014 wim_vancriekinge

43

Regular Expressions

Match to a character class

• Example

• The BstYI restriction enzyme cuts at the consensus sequence rGATCy, namely A or G in the first position, then GATC, and then T or C. To find out whether a sequence contains a restriction site for BstYI, write;

• if ($sequence =~ /[AG]GATC[TC]/) {...}; # This will match all of AGATCT, GGATCT, AGATCC, GGATCC.

Definition

• When a list of characters is enclosed in square brackets [], one and only one of these characters must be present at the corresponding position of the string in order for the pattern to match. You may specify a range of characters using a hyphen -.

• A caret ^ at the front of the list negates the character class.

Examples

• if ($string =~ /[AGTC]/) {...}; # matches any nucleotide

• if ($string =~ /[a-z]/) {...}; # matches any lowercase letter

• if ($string =~ /chromosome[1-6]/) {...}; # matches chromosome1, chromosome2 ... chromosome6

• if ($string =~ /[^xyzXYZ]/) {...}; # matches any character except x, X, y, Y, z, Z

Page 44: Bioinformatics v2014 wim_vancriekinge

44

Constructing a Regex

• Pattern starts and ends with a / /pattern/

if you want to match a /, you need to escape it\/ (backslash, forward slash)

you can change the delimiter to some other

character, but you probably won’t need tom|pattern|

• any ‘modifiers’ to the pattern go after the last /i : case insensitive /[a-z]/i

o : compile once

g : match in list context (global)

m or s : match over multiple lines

Page 45: Bioinformatics v2014 wim_vancriekinge

45

Looking for a pattern

• By default, a regular expression is applied to $_

(the default variable)

if (/a+/) {die} looks for one or more ‘a’ in $_

• If you want to look for the pattern in any other

variable, you must use the bind operator

if ($value =~ /a+/) {die}looks for one or more ‘a’ in $value

• The bind operator is in no way similar to the

‘=‘ sign!! = is assignment, =~ is bind.

if ($value = /[a-z]/) {die}Looks for one or more ‘a’ in $_, not $value!!!

Page 46: Bioinformatics v2014 wim_vancriekinge

46

Regular Expression Atoms

• An ‘atom’ is the smallest unit of a

regular expression.

• Character atoms0-9, a-Z match themselves

. (dot) matches everything

[atgcATGC] : A character class (group)

[a-z] : another character class, a through z

Page 47: Bioinformatics v2014 wim_vancriekinge

47

Quantifiers

• You can specify the number of times you want

to see an atom. Examples

• \d* : Zero or more times

• \d+ : One or more times

• \d{3} : Exactly three times

• \d{4,7} : At least four, and not more than seven

• \d{3,} : Three or more timesWe could rewrite /\d\d\d-\d\d\d\d/ as:

/\d{3}-\d{4}/

Page 48: Bioinformatics v2014 wim_vancriekinge

48

Anchors

• Anchors force a pattern match to a certain

location

• ^ : start matching at beginning of string

• $ : start matching at end of string

• \b : match at word boundary (between \w and \W)

• Example:

• /^\d\d\d-\d\d\d\d$/ : matches only valid phone numbers

Page 49: Bioinformatics v2014 wim_vancriekinge

49

Remembering Stuff

• Being able to match patterns is good, but

limited.

• We want to be able to keep portions of the

regular expression for later.

Example: $string = ‘phone: 353-7236’We want to keep the phone number only

Just figuring out that the string contains a phone number is

insufficient, we need to keep the number as well.

Page 50: Bioinformatics v2014 wim_vancriekinge

50

Memory Parentheses (pattern memory)

• Since we almost always want to keep portions

of the string we have matched, there is a

mechanism built into perl.

• Anything in parentheses within the regular

expression is kept in memory.

‘phone:353-7236’ =~ /^phone\:(.+)$/;Perl knows we want to keep everything that matches ‘.+’ in the above

pattern

Page 51: Bioinformatics v2014 wim_vancriekinge

51

Getting at pattern memory

• Perl stores the matches in a series of default

variables. The first parentheses set goes into

$1, second into $2, etc.

This is why we can’t name variables ${digit}

Memory variables are created only in the amounts

needed. If you have three sets of parentheses, you

have ($1,$2,$3).

Memory variables are created for each matched set of

parentheses. If you have one set contained within

another set, you get two variables (inner set gets

lowest number)

Memory variables are only valid in the current scope

Page 52: Bioinformatics v2014 wim_vancriekinge

52

Finding all instances of a match

• Use the ‘g’ modifier to the regular expression

@sites = $sequence =~ /(TATTA)/g;

think g for global

Returns a list of all the matches (in order), and

stores them in the array

If you have more than one pair of parentheses, your

array gets values in sets($1,$2,$3,$1,$2,$3...)

Page 53: Bioinformatics v2014 wim_vancriekinge

53

Perl is Greedy

• In addition to taking all your time, perl regular

expressions also try to match the largest

possible string which fits your pattern

/ga+t/ matches gat, gaat, gaaat

‘Doh! No doughnuts left!’ =~ /(d.+t)/$1 contains ‘doughnuts left’

• If this is not what you wanted to do, use the ‘?’

modifier

/(d.+?t)/ # match as few ‘.’s as you can and still

make the pattern work

Page 54: Bioinformatics v2014 wim_vancriekinge

54

Substitute function

• s/pattern1/pattern2/;

• Looks kind of like a regular expression

Patterns constructed the same way

• Inherited from previous languages, so it can be

a bit different.

Changes the variable it is bound to!

Page 55: Bioinformatics v2014 wim_vancriekinge

55

Page 56: Bioinformatics v2014 wim_vancriekinge

56

tr function

• translate or transliterate

• tr/characterlist1/characterlist2/;

• Even less like a regular expression than s

• substitutes characters in the first list with

characters in the second list

$string =~ tr/a/A/; # changes every ‘a’ to an ‘A’

No need for the g modifier when using tr.

Page 57: Bioinformatics v2014 wim_vancriekinge

57

Translations

Page 58: Bioinformatics v2014 wim_vancriekinge

58

Using tr

• Creating complimentary DNA sequence

$sequence =~ tr/atgc/TACG/;

• Sneaky Perl trick for the day

tr does two things.1. changes characters in the bound variable

2. Counts the number of times it does this

Super-fast character counter™$a_count = $sequence =~ tr/a/a/;

replaces an ‘a’ with an ‘a’ (no net change), and assigns the result

(number of substitutions) to $a_count

Page 59: Bioinformatics v2014 wim_vancriekinge

59

Regex-Related Special Variables

• Perl has a host of special variables that get filled after every m// or s/// regex match. $1, $2, $3, etc. hold the backreferences. $+ holds the last (highest-numbered) backreference. $& (dollar ampersand) holds the entire regex match.

• @- is an array of match-start indices into the string. $-[0] holds the start of the entire regex match, $-[1] the start of the first backreference, etc. Likewise, @+ holds match-end indices (ends, not lengths).

• $' (dollar followed by an apostrophe or single quote) holds the part of the string after (to the right of) the regex match. $` (dollar backtick) holds the part of the string before (to the left of) the regex match. Using these variables is not recommended in scripts when performance matters, as it causes Perl to slow down all regex matches in your entire script.

• All these variables are read-only, and persist until the next regex match is attempted. They are dynamically scoped, as if they had an implicit 'local' at the start of the enclosing scope. Thus if you do a regex match, and call a sub that does a regex match, when that sub returns, your variables are still set as they were for the first match.

Page 60: Bioinformatics v2014 wim_vancriekinge

60

Voorbeeld

Which of following 4 sequences (seq1/2/3/4)

a) contains a “Galactokinase signature”

b) How many of them?

c) Where (hints:pos and $&) ?

http://us.expasy.org/prosite/

Page 61: Bioinformatics v2014 wim_vancriekinge

61

>SEQ1

MGNLFENCTHRYSFEYIYENCTNTTNQCGLIRNVASSIDVFHWLDVYISTTIFVISGILNFYCLFIALYT YYFLDNETRKHYVFVLSRFLSSILVIISLLVLESTLFSESLSPTFAYYAVAFSIYDFSMDTLFFSYIMIS LITYFGVVHYNFYRRHVSLRSLYIILISMWTFSLAIAIPLGLYEAASNSQGPIKCDLSYCGKVVEWITCS LQGCDSFYNANELLVQSIISSVETLVGSLVFLTDPLINIFFDKNISKMVKLQLTLGKWFIALYRFLFQMT NIFENCSTHYSFEKNLQKCVNASNPCQLLQKMNTAHSLMIWMGFYIPSAMCFLAVLVDTYCLLVTISILK SLKKQSRKQYIFGRANIIGEHNDYVVVRLSAAILIALCIIIIQSTYFIDIPFRDTFAFFAVLFIIYDFSILSLLGSFTGVAM MTYFGVMRPLVYRDKFTLKTIYIIAFAIVLFSVCVAIPFGLFQAADEIDGPIKCDSESCELIVKWLLFCI ACLILMGCTGTLLFVTVSLHWHSYKSKKMGNVSSSAFNHGKSRLTWTTTILVILCCVELIPTGLLAAFGK SESISDDCYDFYNANSLIFPAIVSSLETFLGSITFLLDPIINFSFDKRISKVFSSQVSMFSIFFCGKR

>SEQ2

MLDDRARMEA AKKEKVEQIL AEFQLQEEDL KKVMRRMQKE MDRGLRLETH EEASVKMLPT YVRSTPEGSE VGDFLSLDLG GTNFRVMLVK VGEGEEGQWS VKTKHQMYSI PEDAMTGTAE MLFDYISECI SDFLDKHQMK HKKLPLGFTF SFPVRHEDID KGILLNWTKG FKASGAEGNN VVGLLRDAIK RRGDFEMDVV AMVNDTVATM ISCYYEDHQC EVGMIVGTGC NACYMEEMQN VELVEGDEGR MCVNTEWGAF GDSGELDEFL LEYDRLVDES SANPGQQLYE KLIGGKYMGE LVRLVLLRLV DENLLFHGEA SEQLRTRGAF ETRFVSQVES DTGDRKQIYN ILSTLGLRPS TTDCDIVRRA CESVSTRAAH MCSAGLAGVI NRMRESRSED VMRITVGVDG SVYKLHPSFK ERFHASVRRL TPSCEITFIE SEEGSGRGAA LVSAVACKKA CMLGQ

>SEQ3

MESDSFEDFLKGEDFSNYSYSSDLPPFLLDAAPCEPESLEINKYFVVIIYVLVFLLSLLGNSLVMLVILY SRVGRSGRDNVIGDHVDYVTDVYLLNLALADLLFALTLPIWAASKVTGWIFGTFLCKVVSLLKEVNFYSGILLLACISVDRY LAIVHATRTLTQKRYLVKFICLSIWGLSLLLALPVLIFRKTIYPPYVSPVCYEDMGNNTANWRMLLRILP QSFGFIVPLLIMLFCYGFTLRTLFKAHMGQKHRAMRVIFAVVLIFLLCWLPYNLVLLADTLMRTWVIQET CERRNDIDRALEATEILGILGRVNLIGEHWDYHSCLNPLIYAFIGQKFRHGLLKILAIHGLISKDSLPKDSRPSFVGSSSGH TSTTL

>SEQ4

MEANFQQAVK KLVNDFEYPT ESLREAVKEF DELRQKGLQK NGEVLAMAPA FISTLPTGAE TGDFLALDFG GTNLRVCWIQ LLGDGKYEMK HSKSVLPREC VRNESVKPII DFMSDHVELF IKEHFPSKFG CPEEEYLPMG FTFSYPANQV SITESYLLRW TKGLNIPEAI NKDFAQFLTE GFKARNLPIR IEAVINDTVG TLVTRAYTSK ESDTFMGIIF GTGTNGAYVE QMNQIPKLAG KCTGDHMLIN MEWGATDFSC LHSTRYDLLL DHDTPNAGRQ IFEKRVGGMY LGELFRRALF HLIKVYNFNE GIFPPSITDA WSLETSVLSR MMVERSAENV RNVLSTFKFR FRSDEEALYL WDAAHAIGRR AARMSAVPIA SLYLSTGRAG KKSDVGVDGS LVEHYPHFVD MLREALRELI GDNEKLISIG IAKDGSGIGA ALCALQAVKE KKGLA MEANFQQAVK KLVNDFEYPT ESLREAVKEF DELRQKGLQK NGEVLAMAPA FISTLPTGAE TGDFLALDFG GTNLRVCWIQ LLGDGKYEMK HSKSVLPREC VRNESVKPII DFMSDHVELF IKEHFPSKFG CPEEEYLPMG FTFSYPANQV SITESYLLRW TKGLNIPEAI NKDFAQFLTE GFKARNLPIR IEAVINDTVG TLVTRAYTSK ESDTFMGIIF GTGTNGAYVE QMNQIPKLAG KCTGDHMLIN MEWGATDFSC LHSTRYDLLL DHDTPNAGRQ IFEKRVGGMY LGELFRRALF HLIKVYNFNE GIFPPSITDA WSLETSVLSR MMVERSAENV RNVLSTFKFR FRSDEEALYL WDAAHAIGRR AARMSAVPIA SLYLSTGRAG KKSDVGVDGS LVEHYPHFVD MLREALRELI GDNEKLISIG IAKDGSGIGA ALCALQAVKE KKGLA

Page 62: Bioinformatics v2014 wim_vancriekinge

62

Arrays

Definitions

• A scalar variable contains a scalar value: one number or one string. A string might contain many words, but Perl regards it as one unit.

• An array variable contains a list of scalar data: a list of numbers or a list of strings or a mixed list of numbers and strings. The order of elements in the list matters.

Syntax

• Array variable names start with an @ sign.

• You may use in the same program a variable named $var and another variable named @var, and they will mean two different, unrelated things.

Example

• Assume we have a list of numbers which were obtained as a result of some measurement. We can store this list in an array variable as the following:

• @msr = (3, 2, 5, 9, 7, 13, 16);

Page 63: Bioinformatics v2014 wim_vancriekinge

63

The foreach construct

The foreach construct iterates over a list of scalar values (e.g. that are contained in an array) and executes a block of code for each of the values.

• Example: foreach $i (@some_array) {

statement_1;

statement_2;

statement_3; }

Each element in @some_array is aliased to the variable $i in turn, and the block of code inside the curly brackets {} is executed once for each element.

• The variable $i (or give it any other name you wish) is local to the foreach loop and regains its former value upon exiting of the loop.

• Remark $_

Page 64: Bioinformatics v2014 wim_vancriekinge

64

Examples for using the foreach construct - cont.

• Calculate sum of all array elements:

#!/usr/local/bin/perl

@msr = (3, 2, 5, 9, 7, 13, 16);

$sum = 0;

foreach $i (@msr) {

$sum += $i; }

print "sum is: $sum\n";

Page 65: Bioinformatics v2014 wim_vancriekinge

65

Accessing individual array elements

Individual array elements may be accessed by indicating their position in the list (their index).

Example:

@msr = (3, 2, 5, 9, 7, 13, 16);

index value 0 3 1 2 2 5 3 9 4 7 5 13 6 16

First element: $msr[0] (here has the value of 3),

Third element: $msr[2] (here has the value of 5),

and so on.

Page 66: Bioinformatics v2014 wim_vancriekinge

66

The sort function

The sort function receives a list of variables (or an array) and returns the sorted list.

@array2 = sort (@array1);

#!/usr/local/bin/perl

@countries = ("Israel", "Norway", "France", "Argentina");

@sorted_countries = sort ( @countries);

print "ORIG: @countries\n", "SORTED: @sorted_countries\n";

Output:

ORIG: Israel Norway France Argentina

SORTED: Argentina France Israel Norway

#!/usr/local/bin/perl

@numbers = (1 ,2, 4, 16, 18, 32, 64);

@sorted_num = sort (@numbers);

print "ORIG: @numbers \n", "SORTED: @sorted_num \n";

Output:

ORIG: 1 2 4 16 18 32 64

SORTED: 1 16 18 2 32 4 64

Note that sorting numbers does not happen numerically, but by the string values of each number.

Page 67: Bioinformatics v2014 wim_vancriekinge

67

The push and shift functions

The push function adds a variable or a list of variables to the end of a given array.

Example:

$a = 5;

$b = 7;

@array = ("David", "John", "Gadi");

push (@array, $a, $b);

# @array is now ("David", "John", "Gadi", 5, 7)

The shift function removes the first element of a given array and returns this element.

Example:

@array = ("David", "John", "Gadi");

$k = shift (@array);

# @array is now ("John", "Gadi"); # $k is now "David"

Note that after both the push and shift operations the given array @array is changed!

Page 68: Bioinformatics v2014 wim_vancriekinge

68

Perl Array review

• An array is designated with the ‘@’ sign

• An array is a list of individual elements

• Arrays are ordered

Your list stays in the same order that you created it, although you can add or subtract elements to the front or back of the list

• You access array elements by number, using the special syntax:

$array[1] returns the ‘1th’ element of the array (remember perl starts counting at zero)

• You can do anything with an array element that you can do with a scalar variable (addition, subtraction, printing … whatever)

Page 69: Bioinformatics v2014 wim_vancriekinge

69

Generate random sequence string

for($n=1;$n<=50;$n++)

{

@a = ("A","C","G","T");

$b=$a[rand(@a)];

$r.=$b;

}

print $r;

Page 70: Bioinformatics v2014 wim_vancriekinge

70

Text Processing Functions

The split function

• The split function splits a string to a list of substrings according to the positions of a given delimiter. The delimiter is written as a pattern enclosed by slashes: /PATTERN/. Examples:

• $string = "programming::course::for::bioinformatics";

• @list = split (/::/, $string);

• # @list is now ("programming", "course", "for", "bioinformatics") # $string remains unchanged.

• $string = "protein kinase C\t450 Kilodaltons\t120 Kilobases";

• @list = split (/\t/, $string); #\t indicates tab #

• @list is now ("protein kinase C", "450 Kilodaltons", "120 Kilobases")

Page 71: Bioinformatics v2014 wim_vancriekinge

71

Text Processing Functions

The join function

• The join function does the opposite of split. It receives a delimiter and a list of strings, and joins the strings into a single string, such that they are separated by the delimiter.

• Note that the delimiter is written inside quotes.

• Examples:

• @list = ("programming", "course", "for", "bioinformatics");

• $string = join ("::", @list);

• # $string is now "programming::course::for::bioinformatics"

• $name = "protein kinase C"; $mol_weight = "450 Kilodaltons"; $seq_length = "120 Kilobases";

• $string = join ("\t", $name, $mol_weight, $seq_length);

• # $string is now: # "protein kinase C\t450 Kilodaltons\t120 Kilobases"

Page 72: Bioinformatics v2014 wim_vancriekinge

72

When is an array not good enough?

• Sometimes you want to associate a given value

with another value. (name/value pairs)

(Rob => 353-7236, Matt => 353-7122,

Joe_anonymous => 555-1212)

(Acc#1 => sequence1, Acc#2 => sequence2, Acc#n =>

sequence-n)

• You could put this information into an array, but

it would be difficult to keep your names and

values together (what happens when you sort?

Yuck)

Page 73: Bioinformatics v2014 wim_vancriekinge

73

Problem solved: The associative array

• As the name suggests, an associative array allows you to link a name with a value

• In perl-speak: associative array = hash

‘hash’ is the preferred term, for various arcane reasons, including that it is easier to say.

• Consider an array: The elements (values) are each associated with a name – the index position. These index positions are numerical, sequential, and start at zero.

• A hash is similar to an array, but we get to name the index positions anything we want

Page 74: Bioinformatics v2014 wim_vancriekinge

74

The ‘structure’ of a Hash

• An array looks something like this:

@array =Index

Value

0 1 2

'val1' 'val2' 'val3'

Page 75: Bioinformatics v2014 wim_vancriekinge

75

The ‘structure’ of a Hash

• An array looks something like this:

• A hash looks something like this:

@array =Index

Value

0 1 2

'val1' 'val2' 'val3'

Rob Matt Joe_A

353-7236 353-7122 555-1212

Key (name)

Value%phone =

Page 76: Bioinformatics v2014 wim_vancriekinge

76

Creating a hash

• There are several methods for creating a hash. The most simple way – assign a list to a hash.

%hash = (‘rob’, 56, ‘joe’, 17, ‘jeff’, ‘green’);

• Perl is smart enough to know that since you are assigning a list to a hash, you meant to alternate keys and values.

%hash = (‘rob’ => 56 , ‘joe’ => 17, ‘jeff’ => ‘green’);

• The arrow (‘=>’) notation helps some people, and clarifies which keys go with which values. The perl interpreter sees ‘=>’ as a comma.

Page 77: Bioinformatics v2014 wim_vancriekinge

77

Getting at values

• You should expect by now that there is some

way to get at a value, given a key.

• You access a hash key like this:

$hash{‘key’}

• This should look somewhat familiar

$array[21] : refer to a value associated with a

specific index position in an array

$hash{key} : refer to a value associated with a

specific key in a hash

Page 78: Bioinformatics v2014 wim_vancriekinge

78

• There is more than one right way to do it. Unfortunately, there are also

many wrong ways.

1. Always check and make sure the output is correct and logical Consider what errors might occur, and take steps to ensure that you are

accounting for them.

2. Check to make sure you are using every variable you declare.Use Strict !

3. Always go back to a script once it is working and see if you can eliminate unnecessary steps.Concise code is good code.

You will learn more if you optimize your code.

Concise does not mean comment free. Please use as many comments as you think are necessary.

Sometimes you want to leave easy to understand code in, rather than short but difficult to understand tricks. Use your judgment.

Remember that in the future, you may wish to use or alter the code you wrote today. If you don’t understand it today, you won’t tomorrow.

Programming in general and Perl in particular

Page 79: Bioinformatics v2014 wim_vancriekinge

79

Develop your program in stages. Once part of it works, save the working version to another file (or use a source code control system like RCS) before continuing to improve it.

When running interactively, show the user signs of activity. There is no need to dump everything to the screen (unless requested to), but a few words or a number change every few minutes will show that your program is doing something.

Comment your script. Any information on what it is doing or why might be useful to you a few months later.

Decide on a coding convention and stick to it. For example,

for variable names, begin globals with a capital letter and privates (my) with a lower case letter

indent new control structures with (say) 2 spaces

line up closing braces, as in: if (....) { ... ... }

Add blank lines between sections to improve readibility

Programming in general and Perl in particular

Page 80: Bioinformatics v2014 wim_vancriekinge

80

CPAN

• CPAN: The Comprehensive Perl Archive

Network is available at www.cpan.org

and is a very large respository of Perl

modules for all kind of taks (including

bioperl)

Page 81: Bioinformatics v2014 wim_vancriekinge

81

What is BioPerl?

• An ‘open source’ project http://bio.perl.org or http://www.cpan.org

• A loose international collaboration of biologist/programmersNobody (that I know of) gets paid for this

• A collection of PERL modules and methods for doing a number of bioinformatics tasksThink of it as subroutines to do biology

• Consider it a ‘tool-box’There are a lot of nice tools in there, and (usually)

somebody else takes care of fixing parsers when they break

• BioPerl code is portable - if you give somebody a script, it will probably work on their system

Page 82: Bioinformatics v2014 wim_vancriekinge

82

Multi-line parsing

use strict;

use Bio::SeqIO;

my $filename="sw.txt";

my $sequence_object;

my $seqio = Bio::SeqIO -> new (

'-format' => 'swiss',

'-file' => $filename

);

while ($sequence_object = $seqio -> next_seq) {

my $sequentie = $sequence_object-> seq();

print $sequentie."\n";

}

Page 83: Bioinformatics v2014 wim_vancriekinge

83

Live.pl

#!e:\Perl\bin\perl.exe -w

# script for looping over genbank entries, printing out name

use Bio::DB::Genbank;

use Data::Dumper;

$gb = new Bio::DB::GenBank();

$sequence_object = $gb->get_Seq_by_id('MUSIGHBA1');

print Dumper ($sequence_object);

$seq1_id = $sequence_object->display_id();

$seq1_s = $sequence_object->seq();

print "seq1 display id is $seq1_id \n";

print "seq1 sequence is $seq1_s \n";

Page 84: Bioinformatics v2014 wim_vancriekinge

84

Bioperl 101: 2 ESSENTIAL TOOLS

Data::Dumper to find out what

class your in

Perl bptutorial (100 Bio::Seq) to

find the available methods for that

class

Page 85: Bioinformatics v2014 wim_vancriekinge

85

Outline

• Scripting

Perl (Bioperl/Python)examples spiders/bots

• Databases

Genome Browserexamples biomart, galaxy

• AI

Classification and clusteringexamples WEKA (R, Rapidminer)

Page 86: Bioinformatics v2014 wim_vancriekinge

86

Overview

• Bots and Spiders

The web

Bots

Spiders

Real world examples

Bioinformatics applications

Perl – LWP libraries

Google hacks

Advanced APIs

Fetch data from NCBI / Ensembl /

Page 87: Bioinformatics v2014 wim_vancriekinge

87

The web

• The WWW-part of the

Internet is based on

hyperlinks

• So if one started to

follow all hyperlinks, it

would be possible to

map almost the entire

WWW

• Everything you can do

as a human (clicking,

filling in forms,…) can be

done by machines

Page 88: Bioinformatics v2014 wim_vancriekinge

88

Bots

• Webbots (web robots, WWW robots, bots): software applications

that run automated tasks over the Internet

• Bots perform tasks that:

Are simple

Structurally repetitive

At a much higher rate than would be possible for a human

• Automated script fetches, analyses and files information from

web servers at many times the speed of a human

• Other uses:

Chatbots

IM / Skype / Wiki bots

Malicious bots and bot networks (Zombies)

Page 89: Bioinformatics v2014 wim_vancriekinge

89

Spiders

• Webspiders / Crawlers are programs

or automated scripts which browses

the World Wide Web in a methodical,

automated manner. It is one type of

bot

• The spider starts with a list of URLs

to visit, called the seeds

As the crawler visits these URLs,

it identifies all the hyperlinks in

the page

It adds them to the list of URLs to

visit, called the crawl frontier

URLs from the frontier are

recursively visited according to a

set of policies

• This process is called web crawling

or spidering: in most cases a mean

of providing up-to-date data

Page 90: Bioinformatics v2014 wim_vancriekinge

90

Spiders

Use of webcrawlers:

Mainly used to create a copy of all the visited pages for later

processing by a search engine that will index the downloaded

pages to provide fast searches

Automating maintenance tasks on a website, such as checking

links or validating HTML code

Can be used to gather specific types of information from Web

pages, such as harvesting e-mail addresses

Most common used crawler is probably the GoogleBot crawler

Crawls

Indexes (content + key content tags and attributes, such as Title

tags and ALT attributes)

Serves results: PageRank Technology

Page 91: Bioinformatics v2014 wim_vancriekinge

91

Spiders

Page 92: Bioinformatics v2014 wim_vancriekinge

92

Perl - LWP

LWP (also known as libwww-perl)

The World-Wide Web library for Perl

Set of Perl modules which provides a simple and

consistent application programming interface (API) to

the World-Wide Web

Free book: http://lwp.interglacial.com/

LWP for newbies

LWP::Simple (demo1)

Go to a URL, fetch data, ready to parse

Attention: HTML tags and regular expression

Page 93: Bioinformatics v2014 wim_vancriekinge

93

Perl - LWP

Some more advanced features

LWP::UserAgent (demo2 – show server access logs)

Fill in forms and parse results

Depending on content: follow hyperlinks to other pages

and parse these again,…

Bioinformatics examples

Use genome browser data (demo3) and sequences

Get gene aliases and symbols from GeneCards (demo4)

Page 94: Bioinformatics v2014 wim_vancriekinge

94

Google hacks

Why not make use of crawls, indexing and

serving technologies of others (e.g. Google)

Google allows automated queries: per account 1000

queries a day

Google uses Snippets: the short pieces of text you get

in the main search results

This is the result of its indexing and parsing algoritms

Demo5: LWP and Google combined and parsing the

results

Page 95: Bioinformatics v2014 wim_vancriekinge

95

Advanced APIs

An application programming interface (API) is a source

code interface that an operating system, library or service

provides to support requests made by computer programs

Language-dependent APIs

Language-independent APIs are written in a way they can

be called from several programming languages. This is a

desired feature for service style API which is not bound to

a particular process or system and is available as a

remote procedure call

Page 96: Bioinformatics v2014 wim_vancriekinge

96

Advanced APIs

Google example used Google API / SOAP

NCBI API

The NCBI Web service is a web program that enables

developers to access Entrez Utilities via the Simple Object

Access Protocol (SOAP)

Programmers may write software applications that access

the E-Utilities using any SOAP development tool

Main tools (demo6):

E-Search Searches and retrieves primary IDs and term

translations and optionally retains results for future use in

the user's environment

E-Fetch: Retrieves records in the requested format from a

list of one or more primary IDs

Ensembl API (demo7)

Page 97: Bioinformatics v2014 wim_vancriekinge

97

Fetch data from NCBI

A NCBI database, frequently used is PubMed

PubMed can be queried using E-Utils

Uses syntax as regular PubMed website

Get the data back in data formats as on the website

(XML, Plain Text)

Parse XML results and more advanced Text-mining

techniques

Demo8

Parse results and present them in an interface

(http://matrix.ugent.be/mate/methylome/result1.html)

Page 98: Bioinformatics v2014 wim_vancriekinge

98

Fetch data from NCBI

Example: PubMeth

Get data from NCBI PubMed

Get all genes and all aliases for human genes and their

annotations from Ensembl & GeneCards

Get all cancer types from cancer thesaurius

Parse PubMed results: find genes and aliases;

keywords

Keep variants in mind (Regexes are very useful)

Sort the PubMed abstracts and store found genes and

keywords in database; apply scoring scheme

Page 99: Bioinformatics v2014 wim_vancriekinge

99

Outline

• Scripting

Perl (Bioperl/Python)examples spiders/bots

• Databases

Genome Browserexamples biomart, galaxy

• AI

Classification and clusteringexamples WEKA (R, Rapidminer)

Page 100: Bioinformatics v2014 wim_vancriekinge

100

The three genome browsers

• There are three main browsers:

Ensembl

NCBI MapViewer

UCSC

• At first glance their main distinguishing features are:

MapViewer is arranged vertically.

Ensembl has multiple (22) different “Views”.

UCSC has a single “View” for (almost) everything.

Page 101: Bioinformatics v2014 wim_vancriekinge

101

MapViewer Home

http://www.ncbi.nlm.nih.gov/mapview/

Page 102: Bioinformatics v2014 wim_vancriekinge

102

MapViewer Master Map

Page 103: Bioinformatics v2014 wim_vancriekinge

103

Selecting tracks on MapViewer

Page 104: Bioinformatics v2014 wim_vancriekinge

104

MapViewer strengths

• Good coverage of plant and fungal genomes.

• Close integration with other NCBI tools and databases, such as Model Maker, trace archives or Celera assemblies.

• Vertical view enables convenient overview of regional gene descriptions.

• Discontiguous MEGABLAST is probably the most sensitive tool available for cross-species sequence queries.

• Ability to view multiple assemblies (e.g. Celera and reference) simultaneously.

Page 105: Bioinformatics v2014 wim_vancriekinge

105

MapViewer limitations

• Little cross-species conservation or alignment

data.

• Inability to upload custom annotations and data.

• Limited capability for batch data access.

• Limited support for automated database querying.

• Vertical view makes base-pair level annotation

cumbersome.

Page 106: Bioinformatics v2014 wim_vancriekinge

106106

UCSC Genome Browser

Page 107: Bioinformatics v2014 wim_vancriekinge

107107

http://genome.ucsc.edu/

Page 108: Bioinformatics v2014 wim_vancriekinge

108108

UCSC Genome Browser

Page 109: Bioinformatics v2014 wim_vancriekinge

109

Strengths of the UCSC Browser (I)

For this course I will be focusing primarily on the UCSC Browser for several reasons:

• Strong comparative genomics capabilities.

• Fast response

sequence searches performed with BLAT.

code is written in speed-optimized C.

Multiple indexing and non-normalized tables for fast database retrieval.

• (Essentially) single “view” from single base-pair to entire chromosome.

• Easiest interface for loading custom annotations.

Page 110: Bioinformatics v2014 wim_vancriekinge

110

UCSC Browser Strengths (II)

• Well suited for batch and automated querying of both gene and intergenic regions.

• Comprehensive: tends to have the most species, genes and annotations.

• Annotations frequently updated (Genbank/Refseqdaily / ESTs weekly).

• Able to find “similar” genes easily with GeneSorter.

• Rapid access to in situ images with VisiGene.

Page 111: Bioinformatics v2014 wim_vancriekinge

111

UCSC browser limitations

• Lack of “overview” mode can make it harder to see genomic context.

• Syntenic regions cannot be viewed simultaneously.

• Cross species sequence queries with BLAT are often insensitive.

• Comprehensiveness of database can make user interface intimidating.

• Code access for commercial users requires licensing.

Page 112: Bioinformatics v2014 wim_vancriekinge

112

Human, mouse,rat synteny in MapViewer

Page 113: Bioinformatics v2014 wim_vancriekinge

113113

Browser/Database Batch Querying

Page 114: Bioinformatics v2014 wim_vancriekinge

114

Batch querying overview

• Introduction / motivation

• UCSC table browser

• Custom tracks and frames

• Galaxy and direct SQL database querying

• A batch query example

• UCSC Database “gotchas”

• Batch querying on Ensembl

Page 115: Bioinformatics v2014 wim_vancriekinge

115

Why batch querying

• Interactive querying is difficult if you want to study numerous “interesting” genomic regions.

• Querying each region interactively is:

Tedious

Time-consuming

Error prone

Page 116: Bioinformatics v2014 wim_vancriekinge

116

Batch querying examples

• As an example, say you have found one hundred candidate polymorphisms and you want to know:

Are they in dbSNP?

Do they occur in any known ESTs?

Are the sites conserved in other vertebrates?

Are they near any ”LINE” repeat sequences?

Of course you could repeat the procedures described in Part II one hundred times but that would get “old” very fast…

Page 117: Bioinformatics v2014 wim_vancriekinge

117

Other examples

• Other examples include characterizing multiple:

Non-coding RNA candidates

ultra-conserved regions

introns hosting snoRNA genes

Page 118: Bioinformatics v2014 wim_vancriekinge

118

Browsers and databases

• Each of the genome browsers is built on top of multiple relational databases.

• Typically data for each genome assembly are stored in a separate database and auxiliary data, e.g. gene ontology (GO) data, are stored in yet other databases.

• These databases may have hundreds of tables, many with millions of entries.

Page 119: Bioinformatics v2014 wim_vancriekinge

119

The UCSC Table Browser

• For batch queries, you need to query the browser databases.

• The conventional way of querying a relational database is via “Structured Query Language” (SQL).

• However with the Table Browser, you can query the database without using SQL.

Page 120: Bioinformatics v2014 wim_vancriekinge

120

Browser Database Formats

Nevertheless, even with the Table Browser, you need some understanding of the underlying track, table and file formats.

Table formats describe how data is stored in the (relational) databases.

Track formats describe how the data is presented on the browser.

File formats describe how the data is stored in “flat files” in conventional computer files.

Finally, for understanding the underlying the computer code (as we will do in the last part of this tutorial) you will need to learn about the “C” structures which hold the data in the source code.

Page 121: Bioinformatics v2014 wim_vancriekinge

121

Main UCSC Data Formats

• GFF/GTF

• BED (Browser Extensible Data)

lists of genomic blocks

• PSL

RNA/DNA alignments

• .chain

pair-wise cross species alignments

• .maf

multiple genome alignments

• .wig

numerical data

Page 122: Bioinformatics v2014 wim_vancriekinge

122

Custom Tracks

• Custom tracks are essentially BED, PSL or GTF files

with formatting lines so they can be displayed on the

browser.

• A custom track file can contain multiple tracks, which

may be in different formats.

• Custom tracks are useful for:

Display of regions of interest on the browser.

Sharing custom data with others.

Input of multiple, arbitrary regions for annotation by the Table

Browser.

• Custom tracks can be made by the Table Browser, or

you can make them easily yourself.

Page 123: Bioinformatics v2014 wim_vancriekinge

123

Selecting custom track output

Page 124: Bioinformatics v2014 wim_vancriekinge

124124

Sending custom track to browser

Page 125: Bioinformatics v2014 wim_vancriekinge

125125

Adding a custom track

Page 126: Bioinformatics v2014 wim_vancriekinge

126

Adding a custom track (II)

Page 127: Bioinformatics v2014 wim_vancriekinge

127

Custom track example

browser position chr22:10000000-10020000

browser hide all

track name=clones description="Clones” visibility=3

color=0,128,0 useScore=1

chr22 10000000 10004000 cloneA 960

chr22 10002000 10006000 cloneB 200

chr22 10005000 10009000 cloneC 700

chr22 10006000 10010000 cloneD 600

chr22 10011000 10015000 cloneE 300

chr22 10012000 10017000 cloneF 100

Page 128: Bioinformatics v2014 wim_vancriekinge

128

Limitations of the table browser

• Can be difficult to create more complex queries.

• With hundreds of tables, finding the one(s) you want can be confusing.

• Getting intersections or unions of genomic regions is often a multi-step process and can be tedious or error prone.

• May be slower than direct SQL query.

• Not designed for fully automated operation.

Page 129: Bioinformatics v2014 wim_vancriekinge

129

Ensembl

Page 130: Bioinformatics v2014 wim_vancriekinge

130

Ensembl Home http://www.ensembl.org/

Page 131: Bioinformatics v2014 wim_vancriekinge

131

Ensembl ContigView

Page 132: Bioinformatics v2014 wim_vancriekinge

132

Ensembl ContigView

Page 133: Bioinformatics v2014 wim_vancriekinge

133

Detail and Basepair view

Page 134: Bioinformatics v2014 wim_vancriekinge

134

Changing tracks in Ensembl

Page 135: Bioinformatics v2014 wim_vancriekinge

135135

Ensembl strengths (I)

• Multiple view levels shows genomic context.

• Some annotations are more complete and/or are more clearly presented (e.g. snpView of multiple mouse strain data.)

• Possible to create query over more than one genome database at a time (with BioMart).

Page 136: Bioinformatics v2014 wim_vancriekinge

136

Ensembl snpView

Page 137: Bioinformatics v2014 wim_vancriekinge

137

Ensembl strengths (II)

• Batch and automated querying well supported and documented (especially for perl and java).

• API (programmer interface) is designed to be identical for all databases in a release.

• Ensembl tends to be more “community oriented” -using standard, widely used tools and data formats.

• All data and code are completely free to all.

Page 138: Bioinformatics v2014 wim_vancriekinge

138

Ensembl is “community oriented”

• Close alliances with Wormbase, Flybase, SGD

• “support for easy integration with third party data and/or programs” – BioMart

• Close integration with R/ Bioconductor software

• More use of community standard formats and programs, e.g. DAS, GFF/GTF, Bioperl

( Note: UCSC also supports GFF/GTF and is compatible with R/Bioconductor and DAS, but UCSC tends to use more “homegrown” formats, e.g. BED, PSL, and tools.)

Page 139: Bioinformatics v2014 wim_vancriekinge

139

Ensembl limitations

• Limited data quantifying cross-species sequence conservation.

• Batch queries for intergenic regions with BioMart are difficult.

• BioMart offers less complete access to database than UCSC Table Browser. (However, the user interface to BioMart is easier.)

Page 140: Bioinformatics v2014 wim_vancriekinge

140

BioMart

• BioMart - the Ensembl “Table browser”

• Similar to the Table Browser and Galaxy tools.

• Previous version was called EnsMart.

• Fewer tables can be accessed with BioMart than

with UCSC Table Browser. In particular, non-gene

oriented queries may be difficult.

• However, the user interface is simpler.

• Tight interface with Bioconductor project for

annotation of microarray genes.

Page 141: Bioinformatics v2014 wim_vancriekinge

141

The Galaxy Website

• Galaxy website: http://g2.bx.psu.edu

• Galaxy objective: Provide sequence and data manipulation tools (a la SRS or the UCSD Biology Workbench) that are capable of being applied to genomic data.

• The intent is to provide an easy interface to numerous analysis tools with varied output formats that can work on data from multiple browsers / databases.

Page 142: Bioinformatics v2014 wim_vancriekinge

142142

Page 143: Bioinformatics v2014 wim_vancriekinge

143

• Galaxy is a web interface to bioinformatics tools that deal with genome-scale data

• There is a public server with many pre-installed tools

• Many tools work with genomic intervals

• Other tools work with various types of tab delimited data formats, and some directly on DNA sequences

• It has excellent tools to access public data

• It can be installed on a local computer or set up as an institutional server

• Can access a standard or custom build on Amazon “Cloud”

• Any command line tool or web service can easily be wrapped into the Galaxy interface.

Demo: Galaxy Genomics Toolkit

Page 144: Bioinformatics v2014 wim_vancriekinge

144

Genome-Scale Data

• Bioinformatics work is challenging on

very large “genomics” data sets

sequencing, gene expression, variants,

ChIPseq

• Complex command line programs

• Genome Browsers

• New tools

Page 145: Bioinformatics v2014 wim_vancriekinge

145

The Galaxy Interface has 3 parts

List of Tools Central work panel

History =

data & results

Page 146: Bioinformatics v2014 wim_vancriekinge

146

Load Data from UCSC

Or upload from your computer

Page 147: Bioinformatics v2014 wim_vancriekinge

147

Demo: Galaxy Genomics Toolkit

• http://athos.ugent.be:8080: staat er een Galaxy instance.

• inloggen (als admin: [email protected], password: newnew)

• de cleanfq history heeft 2 paar fastq files en een ref fa en een ref gtf

Page 148: Bioinformatics v2014 wim_vancriekinge

148

Workflows

• Galaxy saves your data, and results in the

History

• The exact commands and parameters used with

each operation with each tool are also saved.

• These operations can be saved as a “Workflow”,

which can be reused, and shared with other

users.

Page 149: Bioinformatics v2014 wim_vancriekinge

149

• Galaxy has many public data sets and public workflows, which can be easily used in your projects (or a tutorial)

Page 150: Bioinformatics v2014 wim_vancriekinge

150

NGS tools

• Galaxy has recently been expanded with tools to

analyze Next-Gen Sequence data

• File format conversions

• Analysis methods specific to different sequencing

platforms (454, Illumina, SOLID)

• Analysis methods specific to different applications

(RNA-seq, ChIP-seq, mutation finding,

metagenomics, etc).

Page 151: Bioinformatics v2014 wim_vancriekinge

• NGS tools include file format conversion, mapping to reference genome, ChIPseq peak calling, RNA-seq gene expression, etc.

• NGS data analysis uses

large files – slow to upload

and slow to process on a

public server

Page 152: Bioinformatics v2014 wim_vancriekinge

152

A number of Groups have set up custom Galaxy servers with special tools

Page 153: Bioinformatics v2014 wim_vancriekinge

153

The SPARQLing future

Page 154: Bioinformatics v2014 wim_vancriekinge

154

Outline

• Scripting

Perl (Bioperl/Python)examples spiders/bots

• Databases

Genome Browserexamples biomart, galaxy

• AI

Classification and clusteringexamples WEKA (R, Rapidminer)

Page 155: Bioinformatics v2014 wim_vancriekinge

155

Wat is ‘intelligent’ ?

• Intelligentie = de mogelijkheid tot

leren en begrijpen, tot het oplossen

van problemen, tot het nemen van

beslissingen

Machine learning …

Page 156: Bioinformatics v2014 wim_vancriekinge

156

Turing test voor intelligentie

THE IMITATION GAME

Vrouw

Man/Machine

Ondervrager: Wie van

beide is de vrouw?

Page 157: Bioinformatics v2014 wim_vancriekinge

157

Wat is ‘artificieel’ ?

• Artificieel = kunstmatig = door de mensvervaardigd, niet van natuurlijkeoorsprong

• in de context van A.I.: machines, meestaleen digitale computer

• H. Simon: analogie mens-digitalecomputer

geheugen

uitvoeringseenheid

controle-eenheid

Page 158: Bioinformatics v2014 wim_vancriekinge

158

Data mining

• WAT? extraheren van kennis uit data

• Data indelen in drie groepen:

trainingsset

validatieset

testset

• Clustering/Classificatie

Page 159: Bioinformatics v2014 wim_vancriekinge

159

Clustering

• WAT? ‘unsupervised learning’ –

antwoord voor de trainingsdata niet

gekend

• Resultaat meestal als boomstructuur

• Belangrijke methode: hiërarchisch

clusteren opstellen van distance matrix

Page 160: Bioinformatics v2014 wim_vancriekinge

160

Cluster Analysis

• Unsupervised methods

• Descriptive modeling

Grouping of genes with “similar” expression profiles

Grouping of disease tissues, cell lines, or toxicants with “similar” effects on gene expression

• Clustering algorithms

Self-organizing maps

Hierarchical clustering

K-means clustering

SVD

Page 161: Bioinformatics v2014 wim_vancriekinge

161

Linkage in Hierarchical Clustering

• Single linkage:S(A,B) = mina minb d(a,b)

• Average linkage:A(A,B) = (∑a ∑b d(a,b)) / |A| |B|

• Complete linkage:C(A,B) = maxa maxb d(a,b)

• Centroid linkage:M(A,B) = d(mean(A),mean(B))

• Hausdorff linkage:h(A,B) = maxa minb d(a,b)

H(A,B) = max(h(A,B),h(B,A))

• Ward linkage:W(A,B) = (|A| |B| (M(A,B))2) / (|A|+|B|)

A

B

Page 162: Bioinformatics v2014 wim_vancriekinge

162

Hierarchical Clustering

3 clusters?

2 clusters?

Page 163: Bioinformatics v2014 wim_vancriekinge

163

Classificatie

• WAT? ‘supervised learning’ – antwoord

voor de trainingsdata is gekend

• Verschillende classificatiemethoden:

decision tree

neurale netwerken

support vector machines

Page 164: Bioinformatics v2014 wim_vancriekinge

164

Voorbeeld: tennis

Decision tree

Page 165: Bioinformatics v2014 wim_vancriekinge

165

Neurale netwerken

BOUW: Neuronen en verbindingen

TAAK:

verwerken van invoergegevens

machine learning

Page 166: Bioinformatics v2014 wim_vancriekinge

166

Support Vector Machines

Doorvoeren van een lineaire separatie in de data

door de dimensies aan te passen

Page 167: Bioinformatics v2014 wim_vancriekinge

167

Bio-informatica toepassingen

• Decision tree: zoeken naar DNA-sequenties

homoloog aan een gegeven DNA-sequentie

• Neurale netwerken: modelleren en analyseren

van genexpressiegegevens, voorspellen van de

inwerkingsplaatsen van proteasen

• Support Vector Machines: identificeren van

genen betrokken bij anti-kankermechanismen,

detecteren van homologie tussen eiwitten,

analyse van genexpressie

Page 168: Bioinformatics v2014 wim_vancriekinge

168

Bio-informatica toepassingen

• Hiërarchisch clusteren: opstellen van fylo-genetische bomen op basis van DNA-sequenties

• Genetische algoritmes: moleculaire herkenning, relatie tussen structuur en functie ophelderen, Multiple Sequence Alignment

• Expertsystemen: ontdekken van blessures, vroege detectie van afwijkingen aan de hartklep

• Fuzzy logic: primerdesign, voorspellen van de functie van een onbekend gen, expressie-analyse

Page 169: Bioinformatics v2014 wim_vancriekinge

169

Outline

Page 170: Bioinformatics v2014 wim_vancriekinge

170

Classification

OMS

classifier

C N

N N C C

N C

C

C C

C

N

N N

N

C: cancer, N: normal

Page 171: Bioinformatics v2014 wim_vancriekinge

171

Classification

OMS

classifier

R N

N N R R

N R

R

R R

R

N

N N

N

R: responder

N: non-responder

Page 172: Bioinformatics v2014 wim_vancriekinge

172

Outline

Page 174: Bioinformatics v2014 wim_vancriekinge

174

Why use methylation as a biomarker ?

• What is feature/biomarker ?

A characteristic that is objectively

measured and evaluated as an indicator

of normal biological processes,

pathogenic processes, or pharmacologic

responses to a therapeutic intervention

• Business/biological feature

selection/reduction

Of all possible (molecular and clinical)

features oncomethylome measures

methylation (in cancer/onco)

Page 175: Bioinformatics v2014 wim_vancriekinge

175

Outline

Page 176: Bioinformatics v2014 wim_vancriekinge

176

Data preparation and modelling

• Data preparation

Construct binary features « Methylated » from

PCR data (Ct and Temp)

• Modelling

Construct classifier (cancer vs normal) from

features « Methylated »

Page 178: Bioinformatics v2014 wim_vancriekinge

178

Construction of features « Methylated »

• Per gene: find boolean function

Methylated IFF:

Ct below upperbound AND

Temp above lowerbound

• Taking into account

All Ct and Temp measurementsMethylation Specific Quantitative PCR (QMSP) for

normals and cancers

Noise in QMPS measurementsAs observed per gene during Quality Control

Page 179: Bioinformatics v2014 wim_vancriekinge

179

Construction of features « Methylated »

Plot of all Ct and Temp measurements for a given gene

Temp

Ct

What about noise?

Page 180: Bioinformatics v2014 wim_vancriekinge

180

Noise

Noise: random error or variance in a

measured variable

Incorrect attribute values may due to

Quantity not correctly compared to calibration (e.g., ruler slips)

Inaccurate calibration device (e.g., ruler > 1m)

Precision (e.g., truncated to nearest mile or Ångstrom unit)

Data entry problems

Data transmission problems

Inconsistency in naming convention

Page 181: Bioinformatics v2014 wim_vancriekinge

181

Construction of features «Methylated» Taking into account noise

StDev 1.6 StDev 0.3

QC: StdDev of Ct and Tm in IVM

StD

ev

3.5

StD

ev

0.0

2

Inrobust assay Robust assay

Cancer

Normal

Cut-off

Page 182: Bioinformatics v2014 wim_vancriekinge

182

Construction of features « Methylated » Taking into account noise

Sharp cut-off

Blunt cut-off

Good Reproducibility Bad Reproducibility

Methylated Methylated

MethylatedMethylated

Page 183: Bioinformatics v2014 wim_vancriekinge

183

Construction of features « Methylated » Taking into account noise

Find most robust cut-off for each gene

Compute quality with increasing noise levels (0-2 times StdDev)

Stdev0 2

Qualit

y 1

Stdev0 2

Qualit

y 1

Inrobust Robust

Quality score based on binomial test

46 or more successes with 58 trials unlikely

When probability success = 80/179

Expected nr successes = 21

16 or more successes with 44 trials likely

when probability success = 77/175

Expected nr successes = 19

Page 184: Bioinformatics v2014 wim_vancriekinge

184

Construction of features « Methylated »

Methylated: inside red box

Page 185: Bioinformatics v2014 wim_vancriekinge

185

Construction of features « Methylated »C

an

ce

rN

orm

al

Ranked GenesMethylated Unmethylated

Page 186: Bioinformatics v2014 wim_vancriekinge

186

Data preparation and modelling

• Data preparation

Construct binary features « Methylated » from

PCR data (Ct and Temp)

• Modelling

Construct classifier (cancer vs normal) from

« Methylated » features

Page 187: Bioinformatics v2014 wim_vancriekinge

187

Selection of modelling technique

• In theory, many techniques applicable

Data type: boolean methylation table, discrete

classes

See other talks today

• But, additional requirements follow from

business understanding (more details below)

Feature selectionFinal test should be based on at most ~5 genes

Understandability

Both provide a direct competitive advantage

• Example of acceptable technique: decision

trees

Page 188: Bioinformatics v2014 wim_vancriekinge

188

Decision treesThe Weka tool

@relation weather.symbolic

@attribute outlook {sunny, overcast, rainy}

@attribute temperature {hot, mild, cool}

@attribute humidity {high, normal}

@attribute windy {TRUE, FALSE}

@attribute play {yes, no}

@data

sunny,hot,high,FALSE,no

sunny,hot,high,TRUE,no

overcast,hot,high,FALSE,yes

rainy,mild,high,FALSE,yes

rainy,cool,normal,FALSE,yes

rainy,cool,normal,TRUE,no

overcast,cool,normal,TRUE,yes

sunny,mild,high,FALSE,no

sunny,cool,normal,FALSE,yes

rainy,mild,normal,FALSE,yes

sunny,mild,normal,TRUE,yes

overcast,mild,high,TRUE,yes

overcast,hot,normal,FALSE,yes

rainy,mild,high,TRUE,no

http://www.cs.waikato.ac.nz/ml/weka/

Page 189: Bioinformatics v2014 wim_vancriekinge

189

Decision treesAttribute selection

outlook temperature humidity windy play

sunny hot high FALSE no

sunny hot high TRUE no

overcast hot high FALSE yes

rainy mild high FALSE yes

rainy cool normal FALSE yes

rainy cool normal TRUE no

overcast cool normal TRUE yes

sunny mild high FALSE no

sunny cool normal FALSE yes

rainy mild normal FALSE yes

sunny mild normal TRUE yes

overcast mild high TRUE yes

overcast hot normal FALSE yes

rainy mild high TRUE no

play

don’t play

pno = 5/14

pyes = 9/14

maximal gain of information

maximal reduction of Entropy = - pyes log2 pyes - pno log2 pno

= - 9/14 log2 9/14 - 5/14 log2 5/14

= 0.94 bits

http://www-lmmb.ncifcrf.gov/~toms/paper/primer/latex/index.html

http://directory.google.com/Top/Science/Math/Applications/Information_Theory/Papers/

Page 190: Bioinformatics v2014 wim_vancriekinge

play

don’t play

amount of information required to specify class of an example given that it reaches node

0.94 bits

0.0 bits

* 4/14

0.97 bits

* 5/14

0.97 bits

* 5/14

0.98 bits

* 7/14

0.59 bits

* 7/14

0.92 bits

* 6/14

0.81 bits

* 4/14

0.81 bits

* 8/14

1.0 bits

* 4/14

1.0 bits

* 6/14

outlook

sunny overcast rainy

+

= 0.69 bits

gain: 0.25 bits

+

= 0.79 bits

+

= 0.91 bits

+

= 0.89 bits

gain: 0.15 bits gain: 0.03 bits gain: 0.05 bits

play don't play

sunny 2 3

overcast 4 0

rainy 3 2

humidity temperature windy

high normal hot mild cool false true

play don't play

hot 2 2

mild 4 2

cool 3 1

play don't play

high 3 4

normal 6 1

play don't play

FALSE 6 2

TRUE 3 3

Decision treesAttribute selection

Page 191: Bioinformatics v2014 wim_vancriekinge

play

don’t play

outlook

sunny overcast rainy

0.97 bits

0.0 bits

* 3/5

humidity temperature windy

high normal hot mild cool false true

+

= 0.0 bits

gain: 0.97 bits

+

= 0.40 bits

gain: 0.57 bits

+

= 0.95 bits

gain: 0.02 bits

0.0 bits

* 2/5

0.0 bits

* 2/5

1.0 bits

* 2/5

0.0 bits

* 1/5

0.92 bits

* 3/5

1.0 bits

* 2/5

outlook temperature humidity windy play

sunny hot high FALSE no

sunny hot high TRUE no

sunny mild high FALSE no

sunny cool normal FALSE yes

sunny mild normal TRUE yes

Decision treesAttribute selection

Page 192: Bioinformatics v2014 wim_vancriekinge

play

don’t playoutlook

sunny overcast rainy

humidity

high normal

outlook temperature humidity windy play

rainy mild high FALSE yes

rainy cool normal FALSE yes

rainy cool normal TRUE no

rainy mild normal FALSE yes

rainy mild high TRUE no

1.0 bits

*2/5

temperature windy

hot mild cool false true

+

= 0.95 bits

gain: 0.02 bits

+

= 0.95 bits

gain: 0.02 bits

+

= 0.0 bits

gain: 0.97 bits

humidity

high normal

0.92 bits

* 3/5

0.92 bits

* 3/5

1.0 bits

* 2/5

0.0 bits

* 3/5

0.0 bits

* 2/5

0.97 bits

Decision treesAttribute selection

Page 193: Bioinformatics v2014 wim_vancriekinge

193

play

don’t playoutlook

sunny overcast rainy

windy

false true

humidity

high normal

Decision treesfinal tree

Page 194: Bioinformatics v2014 wim_vancriekinge

194

• Initialize top node to all examples

• While impure leaves available

select next impure leave L

find splitting attribute A with maximal information gain

for each value of A add child to L

Decision treesBasic algorithm

Page 195: Bioinformatics v2014 wim_vancriekinge

195

Decision tree built from methylation table

Sensitivity: 80%

Specificity: 88%

Decision tree:

Test based on 12 genes

Leave-one-out experiment

To avoid overfitting

Page 196: Bioinformatics v2014 wim_vancriekinge

196

Outline

Page 197: Bioinformatics v2014 wim_vancriekinge

197

Evaluation and deployment

• Decide whether to use Classification results

Can we use 12 gene decision tree for classifying

new patients?

• Verification of all steps

Excercise. The above modelling procedure contains

a classical mistake: the test-sets used for cross-

validation (see leave-one-out) have actually been

used for training the model. How? (Weka is not to blame)

And how can we fix this?

• Check whether business goals have been met

No: test based on 12 genes not useful (max ~5)

Iteration required

Page 198: Bioinformatics v2014 wim_vancriekinge

198

Attempt to rebuild decision treewith at most ~5 genes

New Decision tree:

Test based on 4 genes

Minimal leaf size

Increased to 12

Sensitivity decreased from 80% to 64%

Specificity increased from 88% to 90%

Page 199: Bioinformatics v2014 wim_vancriekinge

199

Evaluation and deploymentThe impact of « cost »

• Market conditions, cost of goods &

royalty structure can limit the amount

of genes that can tested

Page 200: Bioinformatics v2014 wim_vancriekinge

200

Evaluation and deploymentThe importance of « understandability »

Page 201: Bioinformatics v2014 wim_vancriekinge

201

Evaluation and deploymentThe importance of « understandability »

Pre and postmarket requirements imposed for IVDMIA (510k etc)

Understandability (NO black boxes) is becoming an important asset

Page 202: Bioinformatics v2014 wim_vancriekinge

202

Outline

• Scripting

Perl (Bioperl/Python)examples spiders/bots

• Databases

Genome Browserexamples biomart, galaxy

• AI

Classification and clusteringexamples WEKA (R, Rapidminer)

Page 203: Bioinformatics v2014 wim_vancriekinge

203

WEKA:: Introduction

• A collection of open source ML

algorithms

pre-processing

classifiers

clustering

association rule

• Created by researchers at the

University of Waikato in New Zealand

• Java based

Page 204: Bioinformatics v2014 wim_vancriekinge

204

WEKA:: Installation

• Download software from http://www.cs.waikato.ac.nz/ml/weka/

If you are interested in modifying/extending weka there is a developer version that includes the source code

• Set the weka environment variable for java setenv WEKAHOME /usr/local/weka/weka-3-0-

2

setenv CLASSPATH

$WEKAHOME/weka.jar:$CLASSPATH

• Download some ML data from http://mlearn.ics.uci.edu/MLRepository.html

Page 205: Bioinformatics v2014 wim_vancriekinge

205

Page 206: Bioinformatics v2014 wim_vancriekinge

206

Main GUI

• Three graphical user interfaces

“The Explorer” (exploratory data analysis)

“The Experimenter” (experimental environment)

“The KnowledgeFlow” (new process model inspired interface)

Page 207: Bioinformatics v2014 wim_vancriekinge

20711/24/2014

Explorer: pre-processing the data

• Data can be imported from a file in

various formats: ARFF, CSV, C4.5,

binary

• Data can also be read from a URL or

from an SQL database (using JDBC)

• Pre-processing tools in WEKA are

called “filters”

• WEKA contains filters for:

Discretization, normalization, resampling,

attribute selection, transforming and

combining attributes, …

Page 208: Bioinformatics v2014 wim_vancriekinge

208

@relation heart-disease-simplified

@attribute age numeric

@attribute sex { female, male}

@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}

@attribute cholesterol numeric

@attribute exercise_induced_angina { no, yes}

@attribute class { present, not_present}

@data

63,male,typ_angina,233,no,not_present

67,male,asympt,286,yes,present

67,male,asympt,229,yes,present

38,female,non_anginal,?,no,not_present

...

WEKA only deals with “flat” files

Page 209: Bioinformatics v2014 wim_vancriekinge

20911/24/20142

0

9

@relation heart-disease-simplified

@attribute age numeric

@attribute sex { female, male}

@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}

@attribute cholesterol numeric

@attribute exercise_induced_angina { no, yes}

@attribute class { present, not_present}

@data

63,male,typ_angina,233,no,not_present

67,male,asympt,286,yes,present

67,male,asympt,229,yes,present

38,female,non_anginal,?,no,not_present

...

WEKA only deals with “flat” files

Page 210: Bioinformatics v2014 wim_vancriekinge

21011/24/2014University of Waikato2

1

0

Page 211: Bioinformatics v2014 wim_vancriekinge

21111/24/2014University of Waikato2

1

1

Page 212: Bioinformatics v2014 wim_vancriekinge

21211/24/2014University of Waikato2

1

2

Page 213: Bioinformatics v2014 wim_vancriekinge

21311/24/2014University of Waikato2

1

3

Page 214: Bioinformatics v2014 wim_vancriekinge

21411/24/2014University of Waikato2

1

4

Page 215: Bioinformatics v2014 wim_vancriekinge

21511/24/2014University of Waikato2

1

5

Page 216: Bioinformatics v2014 wim_vancriekinge

21611/24/2014University of Waikato2

1

6

Page 217: Bioinformatics v2014 wim_vancriekinge

21711/24/2014University of Waikato2

1

7

Page 218: Bioinformatics v2014 wim_vancriekinge

21811/24/2014University of Waikato2

1

8

Page 219: Bioinformatics v2014 wim_vancriekinge

21911/24/2014University of Waikato2

1

9

Page 220: Bioinformatics v2014 wim_vancriekinge

22011/24/2014University of Waikato2

2

0

Page 221: Bioinformatics v2014 wim_vancriekinge

22111/24/2014University of Waikato2

2

1

Page 222: Bioinformatics v2014 wim_vancriekinge

22211/24/2014University of Waikato2

2

2

Page 223: Bioinformatics v2014 wim_vancriekinge

22311/24/2014University of Waikato2

2

3

Page 224: Bioinformatics v2014 wim_vancriekinge

22411/24/2014University of Waikato2

2

4

Page 225: Bioinformatics v2014 wim_vancriekinge

22511/24/2014University of Waikato2

2

5

Page 226: Bioinformatics v2014 wim_vancriekinge

22611/24/2014University of Waikato2

2

6

Page 227: Bioinformatics v2014 wim_vancriekinge

22711/24/2014University of Waikato2

2

7

Page 228: Bioinformatics v2014 wim_vancriekinge

22811/24/2014University of Waikato2

2

8

Page 229: Bioinformatics v2014 wim_vancriekinge

22911/24/2014University of Waikato2

2

9

Page 230: Bioinformatics v2014 wim_vancriekinge

23011/24/2014University of Waikato2

3

0

Page 231: Bioinformatics v2014 wim_vancriekinge

231

Explorer: building “classifiers”

• Classifiers in WEKA are models for

predicting nominal or numeric

quantities

• Implemented learning schemes

include:

Decision trees and lists, instance-based

classifiers, support vector machines,

multi-layer perceptrons, logistic

regression, Bayes’ nets, …

Page 232: Bioinformatics v2014 wim_vancriekinge

232November 24, 20142

3

2

age income student credit_rating buys_computer

<=30 high no fair no

<=30 high no excellent no

31…40 high no fair yes

>40 medium no fair yes

>40 low yes fair yes

>40 low yes excellent no

31…40 low yes excellent yes

<=30 medium no fair no

<=30 low yes fair yes

>40 medium yes fair yes

<=30 medium yes excellent yes

31…40 medium no excellent yes

31…40 high yes fair yes

>40 medium no excellent no

This

follows an

example

of

Quinlan’s

ID3

(Playing

Tennis)

Decision Tree Induction: Training Dataset

Page 233: Bioinformatics v2014 wim_vancriekinge

233

age?

overcast

student? credit rating?

<=30 >40

no yes yes

yes

31..40

fairexcellentyesno

Output: A Decision Tree for “buys_computer”

Page 234: Bioinformatics v2014 wim_vancriekinge

23511/24/2014University of Waikato2

3

5

Page 235: Bioinformatics v2014 wim_vancriekinge

23611/24/2014University of Waikato2

3

6

Page 236: Bioinformatics v2014 wim_vancriekinge

23711/24/2014University of Waikato2

3

7

Page 237: Bioinformatics v2014 wim_vancriekinge

23811/24/2014University of Waikato2

3

8

Page 238: Bioinformatics v2014 wim_vancriekinge

23911/24/2014University of Waikato2

3

9

Page 239: Bioinformatics v2014 wim_vancriekinge

24011/24/2014University of Waikato2

4

0

Page 240: Bioinformatics v2014 wim_vancriekinge

24111/24/2014University of Waikato2

4

1

Page 241: Bioinformatics v2014 wim_vancriekinge

24211/24/2014University of Waikato2

4

2

Page 242: Bioinformatics v2014 wim_vancriekinge

24311/24/2014University of Waikato2

4

3

Page 243: Bioinformatics v2014 wim_vancriekinge

24411/24/2014University of Waikato2

4

4

Page 244: Bioinformatics v2014 wim_vancriekinge

24511/24/2014University of Waikato2

4

5

Page 245: Bioinformatics v2014 wim_vancriekinge

24611/24/2014University of Waikato2

4

6

Page 246: Bioinformatics v2014 wim_vancriekinge

24711/24/2014University of Waikato2

4

7

Page 247: Bioinformatics v2014 wim_vancriekinge

24811/24/2014University of Waikato2

4

8

Page 248: Bioinformatics v2014 wim_vancriekinge

24911/24/2014University of Waikato2

4

9

Page 249: Bioinformatics v2014 wim_vancriekinge

25011/24/2014University of Waikato2

5

0

Page 250: Bioinformatics v2014 wim_vancriekinge

25111/24/2014University of Waikato2

5

1

Page 251: Bioinformatics v2014 wim_vancriekinge

25211/24/2014University of Waikato2

5

2

Page 252: Bioinformatics v2014 wim_vancriekinge

25311/24/2014University of Waikato2

5

3

Page 253: Bioinformatics v2014 wim_vancriekinge

25411/24/2014University of Waikato2

5

4

Page 254: Bioinformatics v2014 wim_vancriekinge

25511/24/2014University of Waikato2

5

5

Page 255: Bioinformatics v2014 wim_vancriekinge

25611/24/2014University of Waikato2

5

6

Page 256: Bioinformatics v2014 wim_vancriekinge

259

Explorer: finding associations

• WEKA contains an implementation of

the Apriori algorithm for learning

association rules

Works only with discrete data

• Can identify statistical dependencies

between groups of attributes:

milk, butter bread, eggs (with

confidence 0.9 and support 2000)

• Apriori can compute all rules that

have a given minimum support and

exceed a given confidence

Page 257: Bioinformatics v2014 wim_vancriekinge

26011/24/20142

6

0

Explorer: data visualization

• Visualization very useful in practice:

e.g. helps to determine difficulty of the

learning problem

• WEKA can visualize single attributes

(1-d) and pairs of attributes (2-d)

To do: rotating 3-d visualizations (Xgobi-

style)

• Color-coded class values

• “Jitter” option to deal with nominal

attributes (and to detect “hidden” data

points)

• “Zoom-in” function

Page 258: Bioinformatics v2014 wim_vancriekinge

26111/24/2014University of Waikato2

6

1

Page 259: Bioinformatics v2014 wim_vancriekinge

26211/24/2014University of Waikato2

6

2

Page 260: Bioinformatics v2014 wim_vancriekinge

26311/24/2014University of Waikato2

6

3

Page 261: Bioinformatics v2014 wim_vancriekinge

26411/24/2014University of Waikato2

6

4

Page 262: Bioinformatics v2014 wim_vancriekinge

26511/24/2014University of Waikato2

6

5

Page 263: Bioinformatics v2014 wim_vancriekinge

26611/24/2014University of Waikato2

6

6

Page 264: Bioinformatics v2014 wim_vancriekinge

26711/24/2014University of Waikato2

6

7

Page 265: Bioinformatics v2014 wim_vancriekinge

26811/24/2014University of Waikato2

6

8

Page 266: Bioinformatics v2014 wim_vancriekinge

26911/24/2014University of Waikato2

6

9

Page 267: Bioinformatics v2014 wim_vancriekinge

27011/24/2014University of Waikato2

7

0

Page 268: Bioinformatics v2014 wim_vancriekinge

271

References and Resources

• References: WEKA website:

http://www.cs.waikato.ac.nz/~ml/weka/index.html

WEKA Tutorial:Machine Learning with WEKA: A presentation

demonstrating all graphical user interfaces (GUI) in Weka.

A presentation which explains how to use Weka for exploratory data mining.

WEKA Data Mining Book:Ian H. Witten and Eibe Frank, Data Mining:

Practical Machine Learning Tools and Techniques (Second Edition)

WEKA Wiki: http://weka.sourceforge.net/wiki/index.php/Main_Page

Others:Jiawei Han and Micheline Kamber, Data Mining:

Concepts and Techniques, 2nd ed.

Page 269: Bioinformatics v2014 wim_vancriekinge

272

Een praktisch voorbeeld…

Page 270: Bioinformatics v2014 wim_vancriekinge

273

Resultaat van clustering

Page 271: Bioinformatics v2014 wim_vancriekinge

274

Resultaat van classificatie

Verschillende technieken:

ZeroR

IBk

Decision Table – SMO – Multilayer Perceptron

Page 272: Bioinformatics v2014 wim_vancriekinge

275

RapidMiner:: Introduction

• A very comprehensive open-source software implementing tools for

intelligent data analysis, data mining, knowledge discovery, machine learning, predictive analytics, forecasting, and analytics in business intelligence (BI).

• Is implemented in Java and available under GPL among other licenses

• Available from http://rapid-i.com

Page 273: Bioinformatics v2014 wim_vancriekinge

276

RapidMiner:: Intro. Contd.

• Is similar in spirit to Weka’s

Knowledge flow

• Data mining processes/routines are

views as sequential operators

Knowledge discovery process are

modeled as operator chains/treesOperators define their expected inputs and delivered

outputs as well as their parameters

• Has over 400 data mining operators

Page 274: Bioinformatics v2014 wim_vancriekinge

277

RapidMiner:: Intro. Contd.

• Uses XML for describing operator

trees in the KD process

• Alternatively can be started through

the command line and passed the

XML process file

Page 275: Bioinformatics v2014 wim_vancriekinge

278

Outline

• Scripting

Perl (Bioperl/Python)examples spiders/bots

• Databases

Genome Browserexamples biomart, galaxy

• AI

Classification and clusteringexamples WEKA (R, Rapidminer)