classification by machine learning approaches michael j. kerner –...

Post on 15-Jan-2016

228 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Classification by

Machine Learning

Approaches

Michael J. Kerner – kerner@cbs.dtu.dk

Center for Biological Sequence AnalysisTechnical University of Denmark

Outline

• Introduction to Machine Learning

• Datasets, Features

• Feature Selection

• Machine Learning Approaches (Classifiers)

• Model Evaluation and Interpretation

• Examples, Exercise

Machine Learning – Data Driven Prediction

To Learn:“to gain knowledge or understanding of or skill in by study, instruction, or experience”

(Merriam Webster English Dictionary, 2005)

Machine Learning:Learning the theory automatically from the data, through a process of inference, model fitting, or learning from examples:

Automated extraction of useful information from a body of data by building good probabilistic models.

Ideally suited for areas with lots of data in the absence of a general theory.

Why do we need Machine Learning?

• Some tasks cannot be defined well, except by examples (e.g. recognition of faces or people).

• Large amounts of data may have hidden relationships and correlations. Only automated approaches may be able to detect these.

• The amount of knowledge about a certain problem / task may be too large for explicit encoding by humans (e.g. in medical diagnostics)

• Environments change over time, and new knowledge is constantly being discovered. A continuous redesign of the systems “by hand” may be difficult.

The Machine Learning Approach

InputData

ClassifierML

e.g. Gene Expression Profiles, …

Machine Learning

Prediction:Yes / No

Machine Learning

• Learning Task:– What do we want to learn or predict?

• Data and assumptions:– What data do we have available? – What is their quality?– What can we assume about the given problem?

• Representation:– What is a suitable representation of the examples to be classified?

• Method and Estimation:– Are there possible hypotheses?– Can we adjust our predictions based on the given results?

• Evaluation:– How well does the method perform?– Might another approach/model perform better?

Learning Tasks

• Classification:– Prediction of an item class.

• Forecasting:– Prediction of a parameter value.

• Characterization:– Find hypotheses that describe groups of items.

• Clustering:– Partitioning of the (unassigned) data set into clusters

with common properties. (Unsupervised learning)

Emergence of Large Datasets

Dataset examples:

• Image processing• Spam email detection• Text mining• DNA micro-array data• Protein function• Protein localization• Protein-protein interaction• …

Dataset Examples

Edible or poisonous ?

Dataset Examples

mRNA Splicing

mRNA Splice Site Prediction

Protein Function Prediction: ProtFun

• Predict as many biologically relevant features as we can from the sequence

• Train artificial neural networks for each category

• Assign a probability for each category from the NN outputs

############## ProtFun 2.2 predictions ########

>KCNA1_HUMAN

# Functional category Prob Odds

Amino_acid_biosynthesis 0.042 1.893

Biosynthesis_of_cofactors 0.119 1.654

Cell_envelope 0.031 0.507

Cellular_processes 0.027 0.373

Central_intermediary_metabolism 0.046 0.731

Energy_metabolism 0.036 0.395

Fatty_acid_metabolism 0.019 1.485

Purines_and_pyrimidines 0.214 0.879

Regulatory_functions 0.013 0.083

Replication_and_transcription 0.019 0.073

Translation 0.129 2.925

Transport_and_binding =>0.717 1.748

# Enzyme/nonenzyme Prob Odds

Enzyme 0.231 0.807

Nonenzyme =>0.769 1.078

# Enzyme class Prob Odds

Oxidoreductase (EC 1.-.-.-) 0.040 0.193

Transferase (EC 2.-.-.-) 0.056 0.163

Hydrolase (EC 3.-.-.-) 0.062 0.195

Lyase (EC 4.-.-.-) 0.020 0.430

Isomerase (EC 5.-.-.-) 0.010 0.321

Ligase (EC 6.-.-.-) 0.017 0.326

# Gene Ontology category Prob Odds

Signal_transducer 0.061 0.284

Receptor 0.055 0.323

Hormone 0.001 0.206

Structural_protein 0.002 0.086

Transporter 0.469 4.299

Ion_channel 0.207 3.633

Voltage-gated_ion_channel =>0.280 12.736

Cation_channel 0.348 7.560

Transcription 0.163 1.270

Transcription_regulation 0.166 1.331

Stress_response 0.011 0.125

Immune_response 0.031 0.370

Growth_factor 0.005 0.372

Metal_ion_transport 0.159 0.345

Complexity of datasets:

• Many instances (examples)

• Instances with multiple features (properties / characteristics)

• Dependencies between the features (correlations)

Emergence of Large Datasets

Data Preprocessing

Instance selection:– Remove identical / inconsistent / incomplete

instances (e.g. reduction of homologous genes, removal of wrongly annotated genes)

Feature transformation / selection:– Projection techniques (e.g. principal

components analysis)– Compression techniques (e.g. minimum

description length)– Feature selection techniques

Benefits of Feature Selection

• Attain good and often even better classification performance using a small subset of features– Less noise in the data

• Provide more cost-effective classifiers– Less features to take into account

smaller datasets faster classifiers

• Identification of (biologically) relevant features for the given problem

Feature Selection

FeatureSubset

Selection

LearningAlgorithm

All Features

FeatureSubset

Selection

Learning Algorithm

All Features

Feature SubsetSearch Algorithm

SelectionCriterion

LearningAlgorithm

SelectedFeatures

Evaluation

OptimalFeatures

OptimalFeatures

OptimalFeatures

Filter approach Wrapperapproach

Filter Approach

• Independent of the classification model• A relevance measure for each feature is calculated• Features with a value lower than a selected threshold t will

be removed

Example: Feature-class entropy• Measures the “uncertainty” about the class when

observing feature i

f1 f2 f3 f4 class f1 f2 f3 f4 class

1 0 1 1 1 1 0 0 0 0

0 1 1 0 1 0 0 1 0 0

1 0 1 0 1 1 1 0 1 0

0 1 0 1 1 0 1 0 1 0

Wrapper approach

• Specific to a classification algorithm• The search for a good feature subset is guided by

a search algorithm • The algorithm uses the evaluation of the classifier

as a guide to find good feature subsets• Search algorithm examples: sequential forward or

backward search, genetic algorithms

Sequential backward elimination– Starts with the set of all features– Iteratively discards the feature whose removal

results in the best classification performance

Wrapper approach

Full feature set : f1,f2,f3,f4

f2,f3,f4 0.7 f1,f3,f4 0.8 f1,f2,f4 0.1 f1,f2,f3 0.75

f3,f40.85

f1,f40.1

f1,f30.8

f40.2

f30.7

Classification Methods

- Decision trees

- Hidden Markov Models (HMMs)

- Support vector machines

- Artificial Neural Networks

- Bayesian methods

- …

Decision Trees

• Simple, practical and easy to interpret• Given a set of instances (with a set of features), a

tree is constructed with internal nodes as the features and the leaves as the classes

Example Dataset: Shall we play golf?

Instance  Attributes /   Features   Class

day outlook temperature humidity windy Play Golf ?

1 sunny hot high FALSE no

2 sunny hot high TRUE no

3 overcast hot high FALSE yes

4 rainy mild high FALSE yes

5 rainy cool normal FALSE yes

6 rainy cool normal TRUE no

7 overcast cool normal TRUE yes

8 sunny mild high FALSE no

9 sunny cool normal FALSE yes

10 rainy mild normal FALSE yes

11 sunny mild normal TRUE yes

12 overcast mild high TRUE yes

13 overcast hot normal FALSE yes

14 rainy mild high TRUE no

today sunny cool high TRUE ?

Example: Shall we play golf today?

WEKA data file (arff format) :

@relation weather.symbolic

@attribute outlook {sunny, overcast, rainy}@attribute temperature {hot, mild, cool}@attribute humidity {high, normal}@attribute windy {TRUE, FALSE}@attribute play {yes, no}

@datasunny,hot,high,FALSE,nosunny,hot,high,TRUE,noovercast,hot,high,FALSE,yesrainy,mild,high,FALSE,yesrainy,cool,normal,FALSE,yesrainy,cool,normal,TRUE,noovercast,cool,normal,TRUE,yessunny,mild,high,FALSE,nosunny,cool,normal,FALSE,yesrainy,mild,normal,FALSE,yessunny,mild,normal,TRUE,yesovercast,mild,high,TRUE,yesovercast,hot,normal,FALSE,yesrainy,mild,high,TRUE,no

Instance Independent features (attributes) Class

Day Outlook Temperature Humidity Windy Play Golf?

1 sunny hot high FALSE no

2 sunny hot high TRUE no

3 overcast hot high FALSE yes

4 rainy mild high FALSE yes

5 rainy cool normal FALSE yes

6 rainy cool normal TRUE no

7 overcast cool normal TRUE yes

8 sunny mild high FALSE no

9 sunny cool normal FALSE yes

10 rainy mild normal FALSE yes

11 sunny mild normal TRUE yes

12 overcast mild high TRUE yes

13 overcast hot normal FALSE yes

14 rainy mild high TRUE no

Feature compositions

sun

ny

ove

rcas

t

rain

y

ho

t

coo

l

mil

d

hig

h

no

rmal

Tru

e

Fal

se

YE

S

NO

NOYES

Decision TreesJ48 pruned tree------------------outlook = sunny| humidity = high: no (3.0)| humidity = normal: yes (2.0)outlook = overcast: yes (4.0)outlook = rainy| windy = TRUE: no (2.0)| windy = FALSE: yes (3.0)

Number of Leaves : 5Size of the tree : 8

Attributes / Features

Attribute Values

Classes

Artificial Neural Networks (ANNs)

Artificial Neuron

Neural Network

Overfitting

Overfitting:A classifier that performs well on the training examples, but poorly on new examples.

Training and testing on the same data will generally produce a good classifier (for this dataset) with high overfitting.

To avoid overfitting:• Use separate training and testing data• Use cross-validation• Use the simplest model possible

Performance Evaluation

Cross-Validation (10 fold)

Data

TrainingSet

TestSet

Performance Evaluation

Classifier

ML

(9/10)

(1/10)10x

Performance Evaluation

Confusion Matrix

TP True Positives

TN True Negatives

FP False Positives

FN False Negatives

Predicted Label

positive negative

Known positive TP FNLabel negative FP TN

Performance Evaluation

• Precision (PPV) TP / (TP + FP)– Percentage of correct positive predictions

• Recall / Sensitivity TP / (TP + FN)– Percentage of positively labeled instances, also predicted as positive

• Specificity TN / (TN + FP)– Percentage of negatively labeled instances, also predicted as

negative

• Accuracy (TP + TN) / (TP + TN + FP + FN)– Percentage of correct predictions

• Correlation Coefficient (TP * TN – FP * FN)

(TP+FP)*(FP+TN)*(TN+FN)*(FN+TP)

-1 ≤ cc ≤ 1 cc = 1 : no FP or FNcc = 0 : random cc = -1: only FP and FN

ROC – Receiver Operating Characteristic

( FP / (FP + TN) )False Positive Rate, (1 - Specificity)

Tru

e P

os

itiv

e R

ate

, Se

ns

itiv

ity

TP

/ (T

P +

FN

)

ROC – Receiver Operating Characteristic

1 - Specificity

Se

ns

itiv

ity

Case Study - Splice Site Prediction

Case Study - Splice Site Prediction

Splice site prediction:

Correctly identify the borders of introns and exons in genes (splice sites)

• Important for gene prediction

• Split up into 2 tasks:– Donor prediction (exon -> intron)– Acceptor prediction (intron -> exon)

Case Study - Splice Site Prediction

• Splice sites are characterized by a conserved dinucleotide in the intron part of the sequence

– Donor sites :

– Acceptor sites :

• Classification problem:– Distinguish between true GT, AG and false GT, AG.

Case Study - Splice Site Prediction

• Position dependent features

e.g. an A on position 1, C on position 17, ….

• Position independent features

e.g. subsequence “TCG” occurs, “GAG” occurs,…

atcgatcagtatcgat GT ctgagctatgag

atcgatcagtatcgat GT ctgagctatgag

1 2 3 17 28

Features:

Original Data – Human Acceptor Splice Site Sites

>HUMGLUT4B_3535GGGCCCCTAGCGGAAGGAAAAAAATCATGGTTCCATGTGACATGCTGTGTCTTTGTGTCTGCCTGTTCAGGATGGGGAACCCCCTCAGCA>HUMGLUT4B_3763GAGGACAGGTGTCTCGGGGGTGGTGGAAAGGGGACGGTCTGCAGGAAATCTGTCCTCTGCTGTCCCCCAGGTGATTGAACAGAGCTACAA>HUMGLUT4B_4028TGGGGGAAACAGGAAGGGAGCCACTGCTGGGTGCCCTCACCCTCACAGCCTCACTCTGTCTGCCTGCCAGGAAAAGGGCCATGCTGGTCA>HUMGLUT4B_4276TGGGCTTTCAGATGGGAATGGACACCTGCCCTCAGCCCTCTCTTCTTCCCTCGCCCAGGGCTGACATCAGGGCTGGTGCCCATGTACGTG>HUMGLUT4B_4507ATATGGTGGGCTTCCAAGGTAAGGCAGAAGGGCTGAGTGACCTGCCTTCTTTCCCAACCTTCTCCCACAGGTGCTGGGCTTGGAGTCCCT>HUMGLUT4B_4775GCCTCCGCCTCATCTTGCTAGCACCTGGCTTCCTCTCAGGTCCCCTCAGGCCTGACCTTCCCTTCTCCAGGTCTGAAGCGCCTGACAGGC>HUMGLUT4B_5125CCAGCCTGTTGTGGCTGGAGTAGAGGAAGGGGCATTCCTGCCATCACTTCTTCTTCTCCCCCACCTCTAGGTTTTCTATTATTCGACCAG>HUMGLUT4B_5378CCTCACCCACGCGGCCCCTCCTACTTCCCGTGCCCAAAAGGCTGGGGTCAAGCTCCGACTCTCCCCGCAGGTGTTGTTGGTGGAGCGGGC>HUMGLUT4B_5995CTGAGTTGAGGGCAAGGGAAGATCAGAAAGGCCTCAACTGGATTCTCCACCCTCCCTGTCTGGCCCCTAGGAGCGAGTTCCAGCCATGAG>HUMGLUT4B_6716CTGGTTGCCTGAAACTACCCCTTCCCTCCCCACCTCACTCCGTCAACACCTCTTTCTCCACCTGTCCCAGGAGGCTATGGGGCCCTACGT>HSRPS6G_1493CTTTGTAGATGGCTCTACAATTACCTGTATAGATAGTTTCGTAAACTATTTCCCCCCTTTTAATCCTTAGCTGAACATCTCCTTCCCAGC[...]

Arff Data File - WEKA

@RELATION splice-train

@ATTRIBUTE -68_A {0,1}@ATTRIBUTE -68_T {0,1}@ATTRIBUTE -68_C {0,1}@ATTRIBUTE -68_G {0,1}@ATTRIBUTE -67_A {0,1}@ATTRIBUTE -67_T {0,1}@ATTRIBUTE -67_C {0,1}@ATTRIBUTE -67_G {0,1}[...]@ATTRIBUTE 20_A {0,1}@ATTRIBUTE 20_T {0,1}@ATTRIBUTE 20_C {0,1}@ATTRIBUTE 20_G {0,1}@ATTRIBUTE class {true,false}

@DATA0,0,0,1,0,0,0,1, [...] ,1,0,0,0,true0,0,0,1,1,0,0,0, [...] ,1,0,0,0,true0,1,0,0,0,0,0,1, [...] ,1,0,0,0,true0,1,0,0,0,0,0,1, [...] ,0,0,0,1,true[...]1,0,0,0,0,1,0,0, [...] ,0,1,0,0,true0,0,0,1,0,0,1,0, [...] ,0,0,1,0,true0,0,1,0,0,0,1,0, [...] ,0,0,0,1,true0,0,1,0,0,0,1,0, [...] ,0,0,1,0,true

The original sequence files in FASTA format have been converted to represent the four DNA bases in a binary fashion 

A:   1 0 0 0T:   0 1 0 0C:   0 0 1 0G:   0 0 0 1

Case Study - Splice Site Prediction

• Local context of 88 nucleotides around the splice site

• 88 position dependent features• A=1000, T=0100, C=0010, G=0001

352 binary features

• Reduce the dataset to contain fewer but relevant features

352 Binary features

15 Binary features

Case Study – Splice Site Sequence Logos

Acceptor Sites:

Donor Sites:

+ 3

+ 2

+ 1- 2

- 3

+ 4- 1

+ 1- 2

- 3

- 1

- 4

- 8

- 9

- 7

- 5

- 6

- 13

- 14

- 12

- 10

- 11

- 15

- 18

- 16

- 17

Exercise:

• Building a prediction tool for human mRNA splice sites

• Feature selection for classification of splice sites

• Tool: The WEKA machine learning toolkit.

• Go tohttp://www.cbs.dtu.dk/~kerner/GeneDisc_Course_2007_MJK/

and follow the instructions

Acknowledgements

Slides and Exercises Adapted from and inspired by:

Søren Brunak

David Gilbert, Aik Choon Tan

Yvan Saeys

top related