cs598 machine learning in computational biology (lecture 1

Post on 14-Feb-2017

225 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

CS598 Machine Learning in

Computational Biology (Lecture 1: Introduction)

Professor Jian Peng Teaching Assistant: Rongda Zhu

IntroductionInstructor:

• Jian Peng My office location: 2118 SC Office hour: Thursday, 3:15pm-4:45pm Email: jianpeng@illinois.edu

• My own research: Computational Biology and Graphical Models

Teaching Assistant: • Rongda Zhu, PhD student (rzhu4@illinois.edu) (Department of Computer Science)

• Rongda’s research: Machine Learning and Probabilistic Inference

Course website: http://web.engr.illinois.edu/~jianpeng/teaching/CS598_Fall15/index.htm

Course Information

Schedule (tentative)

• Introductory lectures (Aug 25 to Sep 8) • Biology data analysis • Probabilisitic models

• Student presentations (Sep 8 to Dec 3)

• Research survey • Research article

• Course projects

• Proposal presentation (Oct 6 & 8) • Final presentation (Dec 8 &10)

ObjectivesIntroduction to computational biology

• Important problems in computational biology • Machine learning techniques for data analysis • Understand how methods work

Learning to do research

• Paper presentation • Ability to present key ideas to other people • Ability to ask critical questions

• Course project experience • Hands-on practice with real datasets • Propose and perform independent research • Active participation in the field

Prerequisites

Biology:

• Basic concepts in molecular biology • Reference:

Molecular Biology for Computer Scientists by Lawrence Hunter

Machine Learning:

• Probability and statistics • Optimization • Textbook:

Pattern Recognition and Machine Learning by Christopher Bishop

Grading

• Class attendance: 10%

• Presentation: 30%

• Course Project: 60% • Proposal • Report • Presentation

Presentation

• Discuss papers you would like to present with me at least one week before your presentation

• Research survey (at least five papers) • Methodology: applications to different problems • Research problem: the state-of-the-art methods

• Research article (preferred) • Background: what is the problem? why important? • Methodology: how does it work? • Results: what are the findings? any conclusions?

• Open-ended Q & A and debate

Questions about the presentation?

Course Project

Computational techniques • Novel machine learning algorithms • Efficient algorithms that scale on large datasets • New probabilistic models for biological data

Biological problems • New biological findings • Improvements over existing method • New computational biological problems

The goal is to have something publishable or presentable in conferences or journals.

Course Project

• Proposal presentation (Oct 6 & 8) • written proposal due by Oct 4 • at least four pages • discuss with me about your projects in Sep • 15-min presentation in class • I will also give you a list of potential projects

if you don’t have one by Sep 20.

• Final presentation (Dec 8 &10) • Report due by Dec 12 • at least eight pages • 15-min oral presentation and poster

Course Project

• Team size • one or two • make clear your contribution in the project report

• Implementation • put your code/data on github • get your hands dirty and work on real-world datasets • your contribution should be original

Questions about the course project?

Introduce yourself

Why computational biology is hard?

• High-dimensional

• Noisy

• Huge

• Sparse

Biological Data

Sequence data

• Protein/DNA sequence • Generative and discriminative models for sequences • Deep learning

Matrix data

• Gene expression • Dimensionality reduction and feature selection • Low-rank approximation

Biological Data

Network data

• Molecular network • Random walk algorithms • Graphical models and approximate inference

Heterogeneous data

• Dimensionality reduction • Probabilistic models for data integration • Network-based data integration

Machine Learning

Supervised learning • Prediction:

• classification: SVM, logistic regression, random forest • structured output: CRF, structured SVM

• Feature finding: • Sparse learning: LASSO and elastic nets

Unsupervised learning • Dimensionality reduction and embedding:

• manifold learning: Isomap, LLE, t-SNE • component analysis: PCA, ICA

• Probabilistic modeling: • graphical model: HMM, Bayesian networks, RBM • methodology: variational inference, sampling

Please read “Molecular Biology for Computer Scientists” by Lawrence Hunter

TODO after this class

Examples of my research projects

Protein sequence, structure and function

ACDEEEFGHIKL----MPQRSTVWY ACDE--FGHIKLRMQP----STVWY

sequence

structure function

Network analysis for disease modeling

human disease network

network analysis

new disease biology (potential drug targets)

Pharmacogenomics and cancer genomics

Figure from the DREAM challenge website

Integration of heterogeneous data

“Search” engine for drug discovery

Drug Protein

DiseaseSideeffect

perturbationassociation

association association

Pathway

membership

Cell type

on/off

Mutation

association

interaction

Diffusion Component Analysis

Network embedding

Variational inference

• Discriminance sampling for partition function estimation

• Combining variational inference and sampling approaches

Restricted Boltzmann Machine Deep Boltzmann Machine

Sampling Classification

Approximate inference

top related