introduction to machine learning bmi/ibgp 730 kun huang department of biomedical informatics the...

Download Introduction to Machine Learning BMI/IBGP 730 Kun Huang Department of Biomedical Informatics The Ohio State University

If you can't read please download the document

Upload: barnaby-haynes

Post on 18-Dec-2015

220 views

Category:

Documents


2 download

TRANSCRIPT

  • Slide 1
  • Introduction to Machine Learning BMI/IBGP 730 Kun Huang Department of Biomedical Informatics The Ohio State University
  • Slide 2
  • Machine Learning Statistical learning Artificial intelligence Pattern recognition Data mining
  • Slide 3
  • Machine Learning Supervised Unsupervised Semi-supervised Regression
  • Slide 4
  • Clustering and Classification Preprocessing Distance measures Popular algorithms (not necessarily the best ones) More sophisticated ones Evaluation Data mining
  • Slide 5
  • - Clustering or classification? - Is training data available? - What domain specific knowledge can be applied? - What preprocessing of data is needed? - Log / data scale and numerical stability - Filtering / denoising - Nonlinear kernel - Feature selection (do I need to use all the data?) - Is the dimensionality of the data too high?
  • Slide 6
  • -Accuracy vs. generality -Overfitting -Model selection Model complexity Prediction error Training sample Testing sample (reproduced from Hastie et.al.)
  • Slide 7
  • How do we process microarray data (clustering)? - Feature selection genes, transformations of expression levels. - Genes discovered in the class comparison (t-test). Risk: missing genes. - Iterative approach : select genes under different p- value cutoff, then select the one with good performance using cross-validation. - Principal components (pro and con). - Discriminant analysis (e.g., LDA).
  • Slide 8
  • - Dimensionality Reduction - Principal component analysis (PCA) - Singular value decomposition (SVD) - Karhunen-Loeve transform (KLT) Basis for P SVD
  • Slide 9
  • - Principal Component Analysis (PCA) - Other things to consider - Numerical balance/data normalization - Noisy direction - Continuous vs. discrete data - Principal components are orthogonal to each other, however, biological data are not - Principal components are linear combinations of original data - Prior knowledge is important - PCA is not clustering!
  • Slide 10
  • Visualization of Microarray Data Multidimensional scaling (MDS) High-dimensional coordinates unknown Distances between the points are known The distance may not be Euclidean, but the embedding maintains the distance in a Euclidean space Try different dimensions (from one to ???) At each dimension, perform optimal embedding to minimize embedding error Plot embedding error (residue) vs. dimension Pick the knee point
  • Slide 11
  • Visualization of Microarray Data Multidimensional scaling (MDS)
  • Slide 12
  • Distance Measure (Metric?) -What do you mean by similar? -Euclidean -Uncentered correlation -Pearson correlation
  • Slide 13
  • Distance Metric -Euclidean 102123_atLip11596.0002040.9001277.0004090.5001357.6001039.2001387.300 3189.0001321.3002164.400868.600185.300266.4002527.800 160552_atAp1s14144.4003986.9003083.1006105.9003245.8004468.4007295.000 5410.9003162.1004100.9004603.2006066.2005505.8005702.700 d E (Lip1, Ap1s1) = 12883
  • Slide 14
  • Distance Metric -Pearson Correlation 102123_atLip11596.0002040.9001277.0004090.5001357.6001039.2001387.300 3189.0001321.3002164.400868.600185.300266.4002527.800 160552_atAp1s14144.4003986.9003083.1006105.9003245.8004468.4007295.000 5410.9003162.1004100.9004603.2006066.2005505.8005702.700 d P (Lip1, Ap1s1) = 0.904
  • Slide 15
  • Distance Metric -Pearson Correlation r = 1r = -1 Ranges from 1 to -1.
  • Slide 16
  • Distance Metric -Uncentered Correlation 102123_atLip11596.0002040.9001277.0004090.5001357.6001039.2001387.300 3189.0001321.3002164.400868.600185.300266.4002527.800 160552_atAp1s14144.4003986.9003083.1006105.9003245.8004468.4007295.000 5410.9003162.1004100.9004603.2006066.2005505.8005702.700 d u (Lip1, Ap1s1) = 0.835 About 33.4 o
  • Slide 17
  • Distance Metric -Difference between Pearson correlation and uncentered correlation 102123_atLip11596.0002040.9001277.0004090.5001357.6001039.2001387.300 3189.0001321.3002164.400868.600185.300266.4002527.800 160552_atAp1s14144.4003986.9003083.1006105.9003245.8004468.4007295.000 5410.9003162.1004100.9004603.2006066.2005505.8005702.700 Pearson correlation Baseline expression possible Uncentered correlation All are considered signals
  • Slide 18
  • Distance Metric -Difference between Euclidean and correlation
  • Slide 19
  • Distance Metric -PCC means similarity, how can we transform it to distance? -1-PCC -Negative correlation may also mean close in signal pathway (1-|PCC|, 1-PCC^2)
  • Slide 20
  • Supervised Learning Perceptron neural networks
  • Slide 21
  • Supervised Learning Perceptron neural networks
  • Slide 22
  • -Supervised Learning -Support vector machines (SVM) and Kernels -Only (binary) classifier, no data model
  • Slide 23
  • -Supervised Learning - Nave Bayesian classifier -Bayes rule -Maximum a posterior (MAP) Prior prob. Conditional prob.
  • Slide 24
  • - Dimensionality reduction: linear discriminant analysis (LDA) B. 2.0 1.5 1.0 0.5 0.5 1.0 1.5 2.0............. A w. (From S. Wus website)
  • Slide 25
  • Linear Discriminant Analysis B. 2.0 1.5 1.0 0.5 0.5 1.0 1.5 2.0............. A w. (From S. Wus website)
  • Slide 26
  • -Supervised Learning - Support vector machines (SVM) and Kernels -Kernel nonlinear mapping
  • Slide 27
  • How do we use microarray? Profiling Clustering Cluster to detect patient subgroups Cluster to detect gene clusters and regulatory networks
  • Slide 28
  • Slide 29
  • How do we process microarray data (clustering)? - Unsupervised Learning Hierarchical Clustering
  • Slide 30
  • How do we process microarray data (clustering)? -Unsupervised Learning Hierarchical Clustering Single linkage: The linking distance is the minimum distance between two clusters.
  • Slide 31
  • How do we process microarray data (clustering)? -Unsupervised Learning Hierarchical Clustering Complete linkage: The linking distance is the maximum distance between two clusters.
  • Slide 32
  • How do we process microarray data (clustering)? -Unsupervised Learning Hierarchical Clustering Average linkage/UPGMA: The linking distance is the average of all pair-wise distances between members of the two clusters. Since all genes and samples carry equal weight, the linkage is an Unweighted Pair Group Method with Arithmetic Means (UPGMA).
  • Slide 33
  • How do we process microarray data (clustering)? -Unsupervised Learning Hierarchical Clustering Single linkage Prone to chaining and sensitive to noise Complete linkage Tends to produce compact clusters Average linkage Sensitive to distance metric
  • Slide 34
  • -Unsupervised Learning Hierarchical Clustering
  • Slide 35
  • Dendrograms Distance the height each horizontal line represents the distance between the two groups it merges. Order Opensource R uses the convention that the tighter clusters are on the left. Others proposed to use expression values, loci on chromosomes, and other ranking criteria.
  • Slide 36
  • -Unsupervised Learning - K-means -Vector quantization -K-D trees -Need to try different K, sensitive to initialization
  • Slide 37
  • -Unsupervised Learning - K-means [cidx, ctrs] = kmeans(yeastvalueshighexp, 4, 'dist', 'corr', 'rep',20); K Metric
  • Slide 38
  • -Unsupervised Learning - K-means -Number of class K needs to be specified -Does not always converge -Sensitive to initialization
  • Slide 39
  • -Unsupervised Learning - K-means
  • Slide 40
  • -Unsupervised Learning -Self-organized maps (SOM) -Neural network based method -Originally used as a visualization method for visualize (embedding) high-dimensional data -Also related vector quantization -The idea is to map close data points to the same discrete level
  • Slide 41
  • -Issues -Lack of consistency or representative features (5.3 TP53 + 0.8 PTEN doesnt make sense) -Data structure is missing -Not robust to outliers and noise DHaeseleer 2005 Nat. Biotechnol 23(12):1499-501
  • Slide 42
  • -Model-based clustering methods (Han) http://www.cs.umd.edu/~bhhan/research2.html Pan et al. Genome Biology 2002 3:research0009.1 doi:10.1186/gb-2002-3-2-research0009
  • Slide 43
  • -Structure-based clustering methods
  • Slide 44
  • Data Mining is searching for knowledge in data Knowledge mining from databases Knowledge extraction Data/pattern analysis Data dredging Knowledge Discovery in Databases (KDD)
  • Slide 45
  • The process of discovery Interactive + Iterative Scalable approaches
  • Slide 46
  • Popular Data Mining Techniques Clustering: Most dominant technique in use for gene expression analysis in particular and bioinformatics in general. Partition data into groups of similarity Classification: Supervised version of clustering technique to model class membership can subsequently classify unseen data. Frequent Pattern Analysis A method for identifying frequently re-curring patterns (structural and transactional). Temporal/Sequence Analysis Model temporal data wavelets, FFT etc. Statistical Methods Regression, Discriminant analysis
  • Slide 47
  • Summary A good clustering method will produce high quality clusters with high intra-class similarity low inter-class similarity The quality of a clustering result depends on both the similarity measure used by the method and its implementation. Other metrics include: density, information entropy, statistical variance, radius/diameter The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns.
  • Slide 48
  • Recommended Literature 1. Bioinformatics The Machine Learning Approach by P. Baldi & S. Brunak, 2 nd edition, The MIT Press, 2001 2. Data Mining Concepts and Techniques by J. Han & M. Kamber, Morgan Kaufmann Publishers, 2001 3. Pattern Classification by R. Duda, P. Hart and D. Stork, 2 nd edition, John Wiley & Sons, 2001 4. The Elements of Statistical Learning by T. Hastie, R. Tibshirani, J. Friedman, Springer-Verlag, 2001