welcome to cse 527: computational biologysuinlee/cse527/notes/lecture1... · welcome to cse 527:...

30
1 Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall (JHN) 022 Welcome to CSE 527: Computational Biology 1 Who is the instructor? Prof. Su-In Lee Assistant Professor A joint faculty member Computer Science & Engineering, Genome Sciences Office hours: Wednesday 1:30-2:30 Research interests Developing machine learning techniques applied to Computational Biology (genetics, systems biology) Predictive Medicine, Translational Medicine 2

Upload: lynga

Post on 25-Apr-2018

223 views

Category:

Documents


4 download

TRANSCRIPT

1

Lecture 1 – Sep 27, 2011CSE 527 Computational Biology, Fall 2011

Instructor: Su-In LeeTA: Christopher Miles

Monday & Wednesday 12:00-1:20Johnson Hall (JHN) 022

Welcome to CSE 527: Computational Biology

1

Who is the instructor? Prof. Su-In Lee

Assistant Professor A joint faculty member Computer Science & Engineering, Genome Sciences Office hours: Wednesday 1:30-2:30

Research interests Developing machine learning techniques applied to

Computational Biology (genetics, systems biology) Predictive Medicine, Translational Medicine

2

2

Teaching assistant Christopher Miles (CSE PhD student) Office: TBA Office hours: Monday 1:30-2:30 Email: [email protected]

3

Curing cancer. Understanding how the blue print of life (DNA)

determines important traits (e.g. diseases)? Predicting your disease susceptibilities based on your

biological information including DNA sequence. Predicting sudden changes in the condition of patients

at ICU (intensive care unit). Determining the order of A,G,C and T in my 3-billion

long DNA sequence.:

CSE 527 will provide you with basic concepts and ML/statistical techniques that you can use to realize these goals.

What is the Coolest Thing a Computational/Mathematical Scientist Can Do?

3

Biological information (data) DNA sequence information RNA levels of 30K genes Protein levels of 30K genes DNA molecule’s 3D structure :

More and More of Biology is Becoming an Information Science

Cell: The basic unit of life

AGATATGTGGATTGTTAGGATTTATGCGCGTCAGTGACTACGCATGTTACGCACCTACGACTAGGTAATGATTGATC

DNA

AUGUGGAUUGUU AUGCGCGUC AUGUUACGCACCUAC AUGAUUGAURNA

Protein MWIV MRV MLRTY MID

Gene (~30,000 in human)

AGATATGTGGATTGTTAGGATTTATGCGCGTCAGTGACTACGCATGTTACGCACCTACGACTAGGTAATGATTGATC

AUGCGCGUC

MRV

RNA degradation

MID

AUGAUUAUAUGAUUGAU

MID

Gene regulation

Gene interaction map

A cell’s biological state can be described by millions of numbers! What biological discoveries we can make highly depends on

the computational method we use to analyze the data. Machine learning techniques provide very effective tools.

Gene expression

Outline Course logistics

A zero-knowledge based introduction to biology

Potential project topics

6

4

Goals of this course Introduction to Computational Biology

Basic concepts and scientific questions Basic biology for computational scientists In-depth coverage of ML techniques Current active areas of research

Useful machine learning (ML) algorithms Probabilistic graphical models, clustering, classification Learning techniques (MLE, EM)

7

Topics in CSE 527 Part 1: Basic ML algorithms

Introduction to probabilistic models Bayesian networks, Hidden Markov models Representation and learning

Part 2: Topics in computational biology and areas of active research Genetics, systems biology, predictive medicine,

sequence analysis Finding genetic factors for complex biological traits Inferring biological networks from data Comparative genomics DNA/RNA sequence analysis

8

5

Course responsibilities Class participation and attendance (10%)

Good answers to the questions asked in class Initiating a productive discussion.

Homework assignments (40%) Four problem sets

Due at beginning of class Up to 3 late days (24-hr period) for the quarter

Collaboration allowed Teams of 2 or 3 students Individual writeups

Final project (50%) A group of up to two students. 9

Project overview (1/2) Topic

Choose from the list of project topics on the course website, or come up with your own.

Open-ended

Project deliverables Project proposal (due 10/19) Midterm report (due 11/16) Final report (due 12/14) Final presentations or poster session (12/7)

10

6

Project overview (2/2) Final report

Short report (up to 10 pages) Conference-style presentation Successful project reports can be submitted to

computational biology/ ML conferences (ISMB, RECOMB, NIPS, ICML)

Or journals (PLoS journals, Nature journals, PNAS, Genome Research and so on)

11

Reading material Lecture notes

Mostly based on recent papers & old seminar papers

Biological background The Cell, a molecular approach by Copper Genetics, from genes to genomes by Hartwell and more Principles of Population genetics by Hartl & Clark

Computational background Probabilistic graphical models by Profs. Daphne Koller &

Nir Friedman Prof. Andrew Ng’s machine learning lecture note

(cs229.stanford.edu)

No textbook required for the course 12

7

Class resources Course website – cs.washington.edu/527

Lecture notes, assignments, project topics Deadlines of assignments and projects

Mailing list [email protected]

13

Outline Course logistics

A zero-knowledge based introduction to biology Prepared by George Asimenos (PhD student,

Stanford) for CS262 Computational Genomics by Prof. Serafim Batzoglou (Stanford).

Potential project topics

14

8

Cells: Building Blocks of Life

© 1997-2005 Coriell Institute for Medical Research

cell, nucleus, cytoplasm, mitochondrion

15

Eukaryots: Plants, animals, humans DNA resides in the nucleus Contain other compartments

for other specialized functions

Prokaryots: Bacteria Do not contain compartments Little recognizable

substructure

DNA: “Blueprints” for a cell Genetic information encoded in long

strings of double-stranded DNA

Deoxyribo Nucleic Acid comes in only four flavors: Adenine, Cytosine, Guanine, Thymine

16

9

Nucleotide

O

C C

CC

H

H

HHH

H

H

COPO

O-

O

to next nucleotide

to previous nucleotide

to base

Deoxyribose, nucleotide, base, A, C, G, T, 3’, 5’

3’

5’ Adenine (A)

Cytosine (C)

Guanine (G)

Thymine (T)

Let’s write “AGACC”!17

“AGACC” (backbone)

18

10

“AGACC” (DNA)deoxyribonucleic acid (DNA)

5’

5’3’

3’

19

DNA is double stranded

3’

5’

5’

3’

DNA is always written 5’ to 3’

AGACC or GGTCT

strand, reverse complement

20

11

DNA Packaginghistone, nucleosome, chromatin, chromosome, centromere, telomere

H1DNA

H2A, H2B, H3, H4

~146bp

telomerecentromere

nucleosome

chromatin

21

The Genome The genome is the full set of hereditary

information for an organism Humans bundle two copies of the genome into 46

chromosomes in every cell = 2 x (1-22 + X/Y)

22

12

Building an organism

cellDNA Every cell has the same sequence of DNA

Subsets of the DNA sequence determine the identity and function of different cells

23

From DNA To Organism

?

Proteins do most of the work in biology, and are encoded by subsequences of DNA, known as genes.

24

13

RNA

O

C C

CC

H

OH

HHH

H

H

COPO

O-

O

to next ribonucleotide

to previous ribonucleotide

to base

ribonucleotide, U

3’

5’ Adenine (A)

Cytosine (C)

Guanine (G)

Uracil (U)

T U

25

Genes & Proteins

3’5’

5’3’TAGGATCGACTATATGGGATTACAAAGCATTTAGGGA...TCACCCTCTCTAGACTAGCATCTATATAAAACAGAA

ATCCTAGCTGATATACCCTAATGTTTCGTAAATCCCT...AGTGGGAGAGATCTGATCGTAGATATATTTTGTCTT

AUGGGAUUACAAAGCAUUUAGGGA...UCACCCUCUCUAGACUAGCAUCUAUAUAA

(transcription)

(translation)

Single-stranded RNA

protein

Double-stranded DNA

gene, transcription, translation, protein

26

14

Gene Transcriptionpromoter

3’5’

5’3’

G A T T A C A . . .

C T A A T G T . . .

27

Gene Transcription

Transcription factors: a type of protein that binds to DNA and helps initiate gene transcription.

Transcription factor binding sites: short sequences of DNA (6-20 bp) recognized and bound by TFs.

RNA polymerase binds a complex of TFs in the promoter.

transcription factor, binding site, RNA polymerase

3’5’

5’3’

G A T T A C A . . .

C T A A T G T . . .

28

15

Gene Transcription

3’5’

5’3’

The two strands are separated

29

Gene Transcription

3’5’

5’3’

An RNA copy of the 5’→3’ sequence is created from the 3’→5’ template

30

16

Gene Transcription

3’5’

5’3’

G A U U A C A . . .

G A T T A C A . . .

C T A A T G T . . .

pre-mRNA 5’ 3’

31

RNA Processing5’ cap, polyadenylation, exon, intron, splicing, UTR, mRNA

5’ cap poly(A) tail

intron

exon

mRNA

5’ UTR 3’ UTR

32

17

Gene Structure

5’ 3’

promoter

5’ UTR exons 3’ UTR

introns

coding

non-coding

33

How many? (Human Genome) Genes:

~ 20,000 Exons per gene:

~ 8 on average (max: 148)

Nucleotides per exon:170 on average (max: 12k)

Nucleotides per intron:5,500 on average (max: 500k)

Nucleotides per gene:45k on average (max: 2,2M)

34

18

From RNA to Protein Proteins are long strings of amino acids joined by

peptide bonds Translation from RNA sequence to amino acid

sequence performed by ribosomes 20 amino acids 3 RNA letters required to

specify a single amino acid

35

Amino acidamino acid

C

O

N

H

C

H

H OH

R

There are 20 standard amino acids

AlanineArginine

AsparagineAspartateCysteine

GlutamateGlutamine

GlycineHistidine

IsoleucineLeucineLysine

MethioninePhenylalanine

ProlineSerine

ThreonineTryptophan

TyrosineValine

36

19

C

O

N

H

C

H

R

to previous aa to next aa

N-terminus

(start)

H OH

C-terminus

(end)

N-terminus, C-terminus

from 5’ 3’ mRNA

Proteins

37

Translation

The ribosome (a complex of protein and RNA) synthesizes a protein by reading the mRNA in triplets (codons). Each codon is translated to an amino acid.

ribosome, codon

mRNA

P site A site

38

20

The genetic code Mapping from a codon to an amino acid

39

Translation

5’ . . . A U U A U G G C C U G G A C U U G A . . . 3’

UTR Met

Start Codon

Ala Trp ThrStop

Codon40

21

Translation

t‐RNA

Met Ala

5’ . . . A U U A U G G C C U G G A C U U G A . . . 3’

Trp

41

amino acid

Errors?

What if the transcription / translation machinery makes mistakes?

What is the effect of mutations in coding regions?

mutation

42

22

Reading Framesreading frame

G C U U G U U U A C G A A U U A G

G C U U G U U U A C G A A U U A G

G C U U G U U U A C G A A U U A G

G C U U G U U U A C G A A U U A G

4343

Synonymous Mutation

G C U U G U U U A C G A A U U A G

Ala Cys Leu Arg Ile

G C U U G U U U A C G A A U U A G

synonymous (silent) mutation, fourfold site

G

G C U U G U U U G C G A A U U A G

Ala Cys Leu Arg Ile

4444

23

Missense Mutation

G C U U G U U U A C G A A U U A G

Ala Cys Leu Arg Ile

G C U U G U U U A C G A A U U A G

missense mutation

G

G C U U G G U U A C G A A U U A G

Ala Trp Leu Arg Ile

4545

Nonsense Mutation

G C U U G U U U A C G A A U U A G

Ala Cys Leu Arg Ile

G C U U G U U U A C G A A U U A G

nonsense mutation

A

G C U U G A U U A C G A A U U A G

Ala STOP

46

24

Frameshift

G C U U G U U U A C G A A U U A G

Ala Cys Leu Arg Ile

G C U U G U U U A C G A A U U A G

frameshift

G C U U G U U A C G A A U U A G

Ala Cys Tyr Glu Leu

47

Transcription and translation

48Illustration from Radboud University Nijmegen

Let’s see how this happens! Transcription: http://www.youtube.com/watch?v=DA2t5N72mgw Translation: http://www.youtube.com/watch?v=WkI_Vbwn14g&feature=related

25

Gene Expression Regulation

When should each gene be expressed? Regulate gene expression

Examples:

Make more of gene A when substance X is present Stop making gene B once you have enough Make genes C1, C2, C3 simultaneously

Why? Every cell has same DNA but each cell expresses different proteins.

Signal transduction: One signal converted to another Cascade has “master regulators” turning on many proteins,

which in turn each turn on many proteins, ...

Regulation, signal transduction

49

Gene Regulation Gene expression is controlled at many levels

DNA chromatin structure Transcription Post-transcriptional modification RNA transport Translation mRNA degradation Post-translational modification

50

26

Transcription regulation Much gene regulation occurs at

the level of transcription.

Primary players: Binding sites (BS) in cis-regulatory

modules (CRMs) Transcription factor (TF) proteins RNA polymerase II

Primary mechanism: TFs link to BSs Complex of TFs forms Complex assists or inhibits formation of

the RNA polymerase II machinery

51

Transcription Factor Binding Sites Short, degenerate DNA sequences recognized by

particular TFs

For complex organisms, cooperative binding of multiple TFs required to initiate transcription

Binding Sequence Logo

52

27

Summary All hereditary information encoded in double-

stranded DNA Each cell in an organism has same DNA DNA RNA protein Proteins have many diverse roles in cell Gene regulation diversifies protein products

within different cells

53

Outline Course logistics

A zero-knowledge based introduction to biology

Potential project topics

54

28

Say that a cancer patient X undergoes a chemotherapy. There are >200 drugs patient X can be treated with. How do doctors choose which drug to use in

chemotherapy treatment ?

Which Drug Patient X Should Be Treated With?

Example project topic #1 (1/3)

Follicular lymphoma

Diffuse large B cell lymphoma

A few histologic features Patient X

Chemotherapy drugs5-IodotubercidinAcrichineARQ-197Arsenic trioxideAS101AS-703026AT-7519AxitinibAzacitidine

:

How can we improve this?

Which Drug Patient X Should Be Treated With?

Example project topic #1 (1/3)

Follicular lymphoma

Diffuse large B cell lymphoma

A few histologic features Patient X

Chemotherapy drugs5-IodotubercidinAcrichineARQ-197Arsenic trioxideAS101AS-703026AT-7519AxitinibAzacitidine

:

Doctors cannot handle millions of numbers! How about computers?

RNA levels of genes

Protein levels of genes

Epigenetics (Methylation)

A few histologic features

…ACGTAGCTAGCTAGCTAGCTGATGCTAGCTACGTGCT…

DNA sequence

29

This is a pure machine learning problem!

Let’s Build a Prediction ModelExample project topic #1 (3/3)

160

drug

s

Drug sensitivity

test

~100 patients at UWMC

In collaboration with Tony Blau, Pam Becker, Ray Monnat, David Hawkins (Medicine)

g1g2

g4g5

g6

g3

e8

g11

g14g15

g9

g16

gg

g30,000

g3

g7

g12g13

gg

g

g

g

g

g1030,0

00 g

enes

RNA levelsof genes in cancer cells

Drug 3Drug 2

Drug i

Drug 6

Drug 4Drug 5

Drug 160

30,000 features!(feature selection)

Prior knowledge on drugs’ targets

Publicly available RNA level data

>3000 patientsTransfer learning,Feature reconstruction

Goal: realizing personalized cancer treatment

Patient X

How Well Can We Predict Disease-related Traits Based on DNA?

Example project topic #2 (1/2)

Standard approach Find a simple rule! Failed to detect the DNA affecting

many important traits.

…ACTCGGTAGACCTAAATTCGGCCCGG…

…ACCCGGTAGACCTTTATTCGGCCCGG…

…ACCCGGTAGACCTTAATTCGGCCGGG…

:

…ACCCGGTAGTCCTATATTCGGCCCGG…

…ACTCGGTAGTCCTATATTCGGCCGGG…

DNA sequence

…ACTCGGTAGACCTAAATTCGGCCCGG…

…ACCCGGTAGACCTTTATTCGGCCCGG…

…ACCCGGTAGACCTTAATTCGGCCGGG…

:

…ACCCGGTAGTCCTATATTCGGCCCGG…

…ACTCGGTAGTCCTATATTCGGCCGGG…

obesity

Individual1

Individual2

Individual3

IndividualN-1

IndividualN

Obesity

:

A

A

A

T

T

Athin, T fat

p≈106 !

cell,a complex system

??

environmental factors

Causality?

N instances

s1 s2 sptoo weak to be detected

?

One of the most important research problems in this area is to develop new computational methods that can represent more complicated interaction between sequence variation and trait.

30

Longitudinal study Environmental factors

Age, sex, smoking status

How Well Can We Predict Disease-related Traits Based on DNA?

Example project topic #2 (2/2)

…A

CTC

GG

AC

CTA

AA

TCC

CG

……

AC

CC

GG

AC

CTT

AA

TGC

GG

……

AC

CC

GG

AC

CTA

TATG

CC

G…

…A

CC

CG

GA

CC

TTTA

TGC

CG

…:

…A

CC

CG

GA

CC

TTA

ATG

CG

G…

…A

CC

CG

GTC

CTA

TATG

CC

G…

…A

CTC

GG

TCC

TTA

ATG

CG

G…

…A

CTC

GG

TCC

TATA

TGC

GG

Sequence Information

s4

s2s3

s1

Cholesterol

~2000 subjects

:

Fatty acidGlucose

Insulin

Cholesterol

Fatty acid

GlucoseInsulin

Phenotype Data

Phenotype Data

Year 0

Year 25

:

Age-specific genetic influence

sP

Structural learning

p≈106 !(feature selection)

In collaboration with Alex Reiner (Epidemiology)

More project topics at the course website!

Questions?