data preprocessing unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · data...

106
Data Preprocessing Unsupervised learning Elena Sügis [email protected] Machine learning, SPB2016

Upload: others

Post on 23-Sep-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Data Preprocessing

Unsupervised learning

Elena Sügis

[email protected]

Machine learning, SPB2016

Page 2: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

? ??

Page 3: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

VS VS

Questions we ask

Page 4: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Questions we ask

Page 5: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Questions we ask

Page 6: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Experiments

Page 7: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Data comes in different forms

Page 8: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Data ≠ Knowledge

Page 9: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Simple data analysis pipeline

Data Black Magic Result

high quality data machine learning method

awesome result

Page 10: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Simple data analysis pipeline

Data Black Magic Result

poor quality data machine learning method

not so awesome result

Page 11: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Data Preprocessing

Clean

Page 12: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Massage your data

Page 13: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

80 %

Page 14: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Interpretation

Impute missing

values

Normalize/

Standardize

Handle

outliers

Data

analysis

Import

data validation

?

Page 15: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Interpretation

Summarize/

plot raw data

Impute missing

values

Normalize/

Standardize

Handle

outliers

Data

analysis

Import

data validation

Page 16: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Meet your data

Page 17: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Interpretation

Summarize/

plot raw data

Impute missing

values

Normalize/

Standardize

Handle

outliers

Data

analysis

Import

data validation

Page 18: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Missing Values

Origins:

•  Malfunctioning measurement equipment

•  Very low intensity signal

•  Deleted due to inconsistency with other recorded

data

•  Data removed/not entered by mistake

Page 19: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Missing Values

How to deal with them:

•  Filter out

•  Replace missing values by 0

•  Replace by the mean, median value

•  K nearest neighbor imputation (KNN imputation)

•  Expectation—Maximization (EM) based imputations

Page 20: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

k-nearest neighbors

Page 21: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

KNN •  We are given a gene expression matrix M

•  Let X=(x1, x2, …, xi, …, xn) be a vector in the matrix M with a missing value at xi at the dimension i

•  Find in the gene expression data matrix matrix vectors X1 , X2 , …, Xk , such that they are the k closest vectors to X in M (with a chosen distance measure) among the vectors that do not have a missing value at dimension i

•  Replace the missing value xi with the mean (or median) of X1 i, X2 i, …, Xk i , i.e., mean (median) of the values at dimension i of vectors X1 , X2 , …, Xk

Page 22: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

KNN

Healthy people Patients

Gene expression matrix

Page 23: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Imputed missing values

Healthy people Patients

Gene expression matrix

Page 24: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Interpretation

Summarize/

plot raw data

Impute missing

values

Normalize/

Standardize

Handle

outliers

Data

analysis

Import

data validation

Page 25: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Technical vs Biological

Page 26: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Normalization & Standardization

Objective: adjust measurements so that they can be appropriately

compared among samples

Key ideas:

•  Remove technological biases

•  Make samples comparable Methods:

•  Z-scores (centering and scaling)

•  Logarithmization

•  Quantile normalization

•  Linear model based normalization

Page 27: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Z-scores

Centering a variable is subtracting the mean of the variable from each data point so that the new variable's mean is 0.

Scaling a variable is multiplying each data point by a

constant in order to alter the range of the data.

where:

µ is the mean of the population. σ is the standard deviation of the population.

z =x −µ

σ

Page 28: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

transforms the data by a linear projection onto

a lower-dimensional space that preserves as

much data variation as possible

Principal Component Analysis

Page 29: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Principal Component Analysis Objective: Reduce dimensionality while preserving as much variance as possible

http://setosa.io/ev/principal-component-analysis/

Page 30: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Visualize normalized data

Groups

Healthy

Patients group1

Patients group2

Page 31: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Visual inspection after normalization

Page 32: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Visual Inspection. PCA

Highlight groups

Patients

Healthy people

Page 33: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Arrrgh!!!

Why aren’t you

together ?!?!

Page 34: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Visual Inspection. PCA

Color by experiment/dataset/day

etc.

DAY1

DAY2

Page 35: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Batch Effects

Measurements are affected by:

•  Laboratory conditions

•  Reagent lots

•  Personnel differences

are technical sources of variation that have been added to the samples during handling. They are unrelated to the biological or

scientific variables in a study.

Major problem : might be correlated with an outcome of

interest and lead to incorrect conclusions

Page 36: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Fighting The Batch Effects

Experimental design solutions: •  Shorter experiment time

•  Equally distributed samples between multiple laboratories and across different processing times, etc.

•  Provide info about changes in personnel, reagents, storage and laboratories

Statistical solutions: •  ComBat

•  SVA(Surrogate variable analysis, SVD+linear models)

•  PAMR (Mean-centering)

•  DWD (Distance-weighted discrimination based on SVM)

•  Ratio_G (Geometric ratio-based)

J.T. Leek, Nature Reviews Genetics 11, 733-739 (October 2010,) Chao Chen, PlosOne, 2011

Page 37: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,
Page 38: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Interpretation

Summarize/

plot raw data

Impute missing

values

Normalize/

Standardize

Handle

outliers

Data

analysis

Import

data validation

Page 39: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Outliers Detection

Page 40: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Interquartile range

outlier

Page 41: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,
Page 42: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Interpretation

Summarize/

plot raw data

Impute missing

values

Normalize/

Standardize

Handle

outliers

Data

analysis

Import

data validation

Page 43: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

IF YOU TORTURE

THE DATA

LONG ENOUGH

IT WILL CONFESS

TO ANYTHING

Ronald Coase, Economist, Nobel Prize winner

Page 44: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Clustering is finding groups of objects such that:

similar (or related) to the objects in the same group and

different from (or unrelated) to the objects in other groups

What is cluster analysis?

Page 45: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Properties

•  Classes/labels for each instance are

derived only from the data

•  For that reason, cluster analysis is referred

to as unsupervised classification

Page 46: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

•  Intuition building Finding hidden internal structure of the high-dimensional data

•  Hypothesis generation Finding and characterizing similar groups of objects in the data

•  Knowledge discovery in data Ex. Underlying rules, reoccurring patterns, topics, etc.

•  Summarizing / compressing large data

•  Data visualization

Why to cluster biological data?

Page 47: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Intuition building

cardiopulmonary/metabolic

disorders

neurological diseases

sensory conditions

cerebral vascular

accident

cancer

http://bmcgeriatr.biomedcentral.com/articles/10.1186/1471-2318-11-45

Page 48: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

•  Intuition building Finding hidden internal structure of the high-dimensional data

•  Hypothesis generation Finding and characterizing similar groups of objects in the data

•  Knowledge discovery in data Ex. Underlying rules, reoccurring patterns, topics, etc.

•  Summarizing / compressing large data

•  Data visualization

Why to cluster biological data?

Page 49: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Hypothesis generation

SAH

Tri

Va

C PC

PC

CdC

AO

Tri

Tri

PC

Tu

CH3HgCl

Rotenon

Pb

Ma

Th

EGF

ILK

RHOC

ACTN1

BCAR1

ITGB3

ACTN4

MYH9

CAV1

HGF

MET

DPP4

MYLK

PLD1

ITGA4

ITGB1

ROCK1

MMP14

RHOB

MMP2

CAPN1

PTPN1

SRC

PLCG1

RAC2

MYH10

BAIAP2

STAT3

RND3

MMP9

RAC1

RHOA

SH3PXD2A

CSF1

DIAPH1

SA

HA

Trich

osta

tin

A

Va

lpro

ic a

cid

Cyp

roco

na

zo

le

PC

B1

80

PC

B1

70

Cd

Cl 2

As

2O

3

Tria

dim

en

ol

Tria

dim

efo

n

PC

B1

53

Tu

ba

cin

Me

Hg

Ro

ten

on

Pb

-ace

tate

Ma

nn

ito

l

Th

ime

rosa

l

HD

AC

i

RHOCN1

AR1

B3N4

9

AV1

PP4K1

-3 -2 -1 0 1 2 3

-3 -2 -1 0 1 2 3 Color coded scaled fold change (FC) vs control

Page 50: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

•  Intuition building Finding hidden internal structure of the high-dimensional data

•  Hypothesis generation Finding and characterizing similar groups of objects in the data

•  Knowledge discovery in data Ex. Underlying rules, reoccurring patterns, topics, etc.

•  Summarizing / compressing large data

•  Data visualization

Why to cluster biological data?

Page 51: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Knowledge discovery in data

Ex. Underlying rules, reoccurring patterns, topics, etc.

Page 52: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

•  Intuition building Finding hidden internal structure of the high-dimensional data

•  Hypothesis generation Finding and characterizing similar groups of objects in the data

•  Knowledge discovery in data Ex. Underlying rules, reoccurring patterns, topics, etc.

•  Summarizing / compressing large data

•  Data visualization

Why to cluster biological data?

Page 53: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Summarizing/compressing the data

Page 54: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Summarizing/compressing the data

Page 55: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Summarizing/compressing the data

Page 56: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Partitional vs Hierarchical

Creates a nested

and hierarchical set

of partitions/clusters

Each sample(point) is

assigned to a unique

cluster

Page 57: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Fuzzy vs Non-Fuzzy

Each object belongs to

each cluster with some

weight

Each object belongs to

exactly one cluster

Page 58: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Hierarchical clustering

Hierarchical clustering is usually depicted as a dendrogram (tree)

Page 59: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

•  Each subtree corresponds to a cluster

•  Height of branching shows distance

Hierarchical clustering

Page 60: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Algorithm for Agglomerative Hierarchical

Clustering:

Join the two closest objects

Hierarchical clustering

Page 61: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Join the two closest objects

Hierarchical clustering

Page 62: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Keep joining the closest pairs

Hierarchical clustering

Page 63: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Keep joining the closest pairs

Hierarchical clustering

Page 64: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Keep joining the closest pairs

Hierarchical clustering

Page 65: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Keep joining the closest pairs

Hierarchical clustering

Page 66: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

After 10 steps we have 4 clusters left

Hierarchical clustering

Page 67: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Q: Which clusters do we merge next?

Page 68: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Several ways to measure

distance between clusters:

•  Single linkage(MIN)

Hierarchical clustering

Page 69: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Sebetw••

Several ways to measure

distance between clusters:

•  Single linkage(MIN)

•  Complete linkage(MAX)

Hierarchical clustering

Page 70: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Sebetw•••

Several ways to measure

distance between clusters:

•  Single linkage (MIN)

•  Complete linkage (MAX)

•  Average linkage

•  Weighted

•  Unweighted ...

•  Ward’s method

Hierarchical clustering

Page 71: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

In this example and at this

stage we have the same result

as in partitional clustering

Hierarchical clustering

Page 72: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

In the final step the two

remaining clusters are joined

into a single cluster

Hierarchical clustering

Page 73: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

In the final step the two In the final step the two

remaining clusters are joined

into a single cluster

Hierarchical clustering

Page 74: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Examples of Hierarchical

Clustering in Bioinformatics

PhylogenyGene expression clustering

Page 75: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

K-means clustering

•  Partitional, non-fuzzy

•  Partitions the data into K clusters

•  K is given by the user

Algorithm:

•  Choose K initial centers for the clusters

•  Assign each object to its closest center

•  Recalculate cluster centers

•  Repeat until converges

Page 76: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

K-means (1)

Page 77: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

K-means (2)

Page 78: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

K-means (3)

Page 79: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

K-means (4)

Page 80: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

K-means (5)

Page 81: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Elbow method

Estimate the number of clusters

Page 82: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

K-means clustering summary

•  One of the fastest clustering algorithms

•  Therefore very widely used

•  Sensitive to the choice of initial centers

•  many algorithms to choose initial centers

cleverly

•  Assumes that the mean can be calculated

•  can be used on vector data

•  cannot be used on sequences

(what is the mean of A and T?)

Page 83: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

K-medoids clustering

•  The same as K-means, except that the center

is required to be at an object

•  Medoid - an object which has minimal total

distance to all other objects in its cluster

•  Can be used on more complex data, with any

distance measure

•  Slower than K-means

Page 84: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

K-medoids (1)

Page 85: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

K-medoids (2)

Page 86: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

K-medoids (3)

Page 87: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

K-medoids (4)

Page 88: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

K-medoids (5)

Page 89: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

K-medoids (6)

Page 90: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

K-medoids (7)

Page 91: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

K-medoids (8)

Page 92: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

K-medoids (9)

Page 93: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Gene expression clustering

Sequence clustering

Examples of K-means and

K-medoids in Bioinformatics

Page 94: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Distance measures Distance measures

Distance of vectors and

• Euclidean distance

• Manhattan distance

• Correlation distance

Distance of sequences and

• Hamming distance => 3

• Levenshtein distance

x = (x1, . . . , xn) y = (y1, . . . , yn)

d(x, y) =

v

u

u

t

nX

i=1

(xi − yi)2

d(x, y) =

nX

i=1

|xi − yi|

d(x, y) = 1− r(x, y)is Pearson

correlation coefficientr(x, y)

ACCTTG TACCTG

ACCTTGTACCTG

.ACCTTGTACC.TG

=> 2

Page 95: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,
Page 96: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Interpretation

Summarize/

plot raw data

Impute missing

values

Normalize/

Standardize

Handle

outliers

Data

analysis

Import

data validation

Page 97: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Put it into words & Discover

Page 98: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Gene ontology

•  Molecular Function - elemental activity or task

• Biological Process - broad objective or goal

• Cellular Component - location or complex

What found genes are doing

Page 99: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

99  

Genes with known

function x

Your gene list

?

Functional enrichment statistics

Page 100: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

100  

Genes with known

function x

Your gene list

?

Does your gene list includes more genes with function x than expected by random chance?

Functional enrichment statistics

Page 101: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

101  

Genes with known

function x

Your gene list

?

Does your gene list includes more genes with function x than expected by random chance?

p =

Functional enrichment statistics

Page 102: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

g:Profiler toolset

http://biit.cs.ut.ee/gprofiler

102  

J. Reimand, M. Kull, H. Peterson, J. Hansen, J. Vilo: g:Profiler - a web-based toolset for functional

profiling of gene lists from large-scale experiments (2007) NAR 35 W193-W200

Jüri Reimand, Tambet Arak, Priit Adler, Liis Kolberg, Sulev Reisberg, Hedi Peterson, Jaak Vilo:

g:Profiler -- a web server for functional interpretation of gene lists (2016 update) Nucleic Acids

Research 2016; doi: 10.1093/nar/gkw199

Page 103: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Reading  the  output  

103  

Statistics

Your genes

50 GO:0034660

ncRNA metabolic process

475 genes

10

Page 104: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Functional annotations & Significance

statistical significance of having drawn a sample consisting of a specific number of k successes out of n total draws from a population of size N containing K

successes.

Page 105: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Cluster annotation

GOsummaries

https://www.bioconductor.org/packages/release/bioc/html/GOsummaries.html

Page 106: Data Preprocessing Unsupervised learningedu.bioinf.me/files/preprocessing_unsupervised.pdf · Data Preprocessing Unsupervised learning Elena Sügis elena.sugis@ut.ee Machine learning,

Practice time!