rule-based data mining methods for classification problems in biomedical domains jinyan li limsoon...

81
Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Upload: allison-rosalyn-stevens

Post on 03-Jan-2016

224 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Rule-Based Data Mining Methods for

Classification Problems in Biomedical Domains

Jinyan LiLimsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Page 2: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Rule-Based Data Mining Methods for

Classification Problems in Biomedical Domains

Part 1: Background and

Introduction

Copyright © 2004 by Jinyan Li and Limsoon Wong

Page 3: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Outline

• Practical Introduction to Biology• Brief Introduction to Bioinformatics• Overview of Gene Expression Profiling• Overview of Proteomic Profiling• A motivating Biomedical Application• Brief Introduction to Some Data Sets

Page 4: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Practical Introduction to

Biology

Page 5: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

History

• 1866 Mendel discovered genetics

• 1869 DNA discovered

• 1944 Avery & McCarty demonstrated DNA as carrier of genetic info

• 1953 Watson & Crick deduced 3D struct of DNA

• 1960 Elucidation of genetic code, mapping DNA to protein

• 1970 Development of DNA sequencing techniques: sequence segmentation and electrophoresis

• 1980 Development of PCR: exploiting natural replication, amplify DNA samples so that they are enough for doing expt

• 1990 Human Genome Project

• 2002 Human genome published

• Now Understanding the detail mechanism of the cell

Page 6: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Body

• Our body consists of a number of organs• Each organ composes of a number of

tissues• Each tissue composes of cells of the

same type

Page 7: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Cell

• Performs two types of function– Chemical reactions necessary to maintain our

life– Pass info for maintaining life to next

generation

• In particular– Protein performs chemical reactions– DNA stores & passes info– RNA is intermediate between DNA & proteins

Page 8: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Protein

• A sequence composed from an alphabet of 20 amino acids– Length is usually 20 to

5000 amino acids – Average around 350

amino acids

• Folds into 3D shape, forming the building blocks & performing most of the chemical reactions within a cell

Copyright © 2004 by Jinyan Li and Limsoon Wong

Page 9: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Amino Acid

• Each amino acid consist of– Amino group– Carboxyl group– R group

Carboxyl group

Aminogroup

C

(the central carbon)R group

NH2

H

C C

R

OH

O

Copyright © 2004 by Jinyan Li and Limsoon Wong

Page 10: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Classification of Amino Acids

• Amino acids can be classified into 4 types.

• Positively charged (basic)– Arginine (Arg, R)– Histidine (His, H)– Lysine (Lys, K)

• Negatively charged (acidic)– Aspartic acid (Asp, D)– Glutamic acid (Glu, E)

Page 11: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Classification of Amino Acids• Polar (overall

uncharged, but uneven charge distribution. can form hydrogen bonds with water. they are called hydrophilic)– Asparagine (Asn, N)– Cysteine (Cys, C)– Glutamine (Gln, Q)– Glycine (Gly, G)– Serine (Ser, S)– Threonine (Thr, T)– Tyrosine (Tyr, Y)

• Nonpolar (overall uncharged and uniform charge distribution. cant form hydrogen bonds with water. they are called hydrophobic)– Alanine (Ala, A)– Isoleucine (Ile, I)– Leucine (Leu, L)– Methionine (Met, M)– Phenylalanine (Phe, F)– Proline (Pro, P)– Tryptophan (Trp, W)– Valine (Val, V)

Page 12: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

N

H

C C

R’

OH

O

NH2

H

C C

R

O

H

Peptide bond

NH2

H

C C

R

OH

O

NH2

H

C C

R’

OH

O

+

Protein & Polypeptide Chain

• Formed by joining amino acids via peptide bond• One end the amino group, called N-terminus• The other end is the carboxyl group, called C-

terminus

Copyright © 2004 by Jinyan Li and Limsoon Wong

Page 13: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

DNA

• Stores instruction needed by the cell to perform daily life function

• Consists of two strands interwoven together and form a double helix

• Each strand is a chain of some small molecules called nucleotides

Francis Crick shows James Watson the model of DNA in their room number 103 of the Austin Wing at the Cavendish Laboratories, Cambridge

Copyright © 2004 by Jinyan Li and Limsoon Wong

Page 14: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Base(Adenine)

DeoxyribosePhosphate

5`

4`3`

2`

1`

Nucleotide

• Consists of three parts:– Deoxyribose– Phosphate (bound to the 5’ carbon)– Base (bound to the 1’ carbon)

Copyright © 2004 by Jinyan Li and Limsoon Wong

Page 15: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

A C G T U

Classification of Nucleotides

• 5 diff nucleotides: adenine(A), cytosine(C), guanine(G), thymine(T), & uracil(U)

• A, G are purines. They have a 2-ring structure• C, T, U are pyrimidines. They have a 1-ring

structure• DNA only uses A, C, G, & T

Copyright © 2004 by Jinyan Li and Limsoon Wong

Page 16: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

A

T

10Å

G

C

10Å

Watson-Crick rules

• Complementary bases:– A with T (two hydrogen-bonds)– C with G (three hydrogen-bonds)

Copyright © 2004 by Jinyan Li and Limsoon Wong

Page 17: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

P P P P

5’3’

A C G T A

Orientation of a DNA

• One strand of DNA is generated by chaining together nucleotides, forming a phosphate-sugar backbone

• It has direction: from 5’ to 3’, because DNA always extends from 3’ end:– Upstream, from 5’ to 3’– Downstream, from 3’ to 5’

Copyright © 2004 by Jinyan Li and Limsoon Wong

Page 18: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Double Stranded DNA

• DNA is double stranded in a cell. The two strands are anti-parallel. One strand is reverse complement of the other

• The double strands are interwoven to form a double helix

Copyright © 2004 by Jinyan Li and Limsoon Wong

Page 19: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Locations of DNAs in a Cell?

• Two types of organisms– Prokaryotes (single-celled organisms with no nuclei. e.g.,

bacteria)

– Eukaryotes (organisms with single or multiple cells. their cells have nuclei. e.g., plant & animal)

• In Prokaryotes, DNA swims within the cell• In Eukaryotes, DNA locates within the

nucleus

Page 20: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Chromosome

• DNA is usually tightly wound around histone proteins and forms a chromosome

• The total info stored in all chromosomes constitutes a genome

• In most multi-cell organisms, every cell contains the same complete set of chromosomes – May have some small different due to

mutation

• Human genome has 3G base pairs, organized in 23 pairs of chromosomes

Page 21: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Gene

• A gene is a sequence of DNA that encodes a protein or an RNA molecule

• About 30,000 – 35,000 (protein-coding) genes in human genome

• For gene that encodes protein– In Prokaryotic genome, one gene corresponds

to one protein– In Eukaryotic genome, one gene can

corresponds to more than one protein because of the process “alternative splicing”

Page 22: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Complexity of Organism vs. Genome Size

• Human Genome: 3G base pairs

• Amoeba dubia (a single cell organism): 600G base pairs

Genome size has no relationship with the complexity of the organism

Page 23: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Number of Genes vs. Genome Size

• Prokaryotic genome (e.g., E. coli)– Number of base pairs: 5M– Number of genes: 4k– Average length of a gene:

1000 bp

• Eukaryotic genome (e.g., human)– Number of base pairs: 3G– Estimated number of

genes: 30k – 35k– Estimated average length

of a gene: 1000-2000 bp

• ~ 90% of E. coli genome are of coding regions.

• < 3% of human genome is believed to be coding regions

Genome size has no relationship with the number of genes!

Page 24: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Base(Adenine)

Ribose SugarPhosphate

5`

4`3`

2`

1`

RNA

• RNA has both the properties of DNA & protein– Similar to DNA, it can

store & transfer info– Similar to protein, it

can form complex 3D structure & perform some functions

• Nucleotide for RNA has of three parts:– Ribose Sugar (has an

extra OH group at 2’)– Phosphate (bound to 5’

carbon)– Base (bound to 1’

carbon)

Copyright © 2004 by Jinyan Li and Limsoon Wong

Page 25: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

RNA vs DNA

• RNA is single stranded• Nucleotides of RNA are similar to that of

DNA, except that have an extra OH at position 2’– Due to this extra OH, it can form more

hydrogen bonds than DNA– So RNA can form complex 3D structure

• RNA use the base U instead of T– U is chemically similar to T– In particular, U is also complementary to A

Page 26: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Mutation

• Sudden change of genome

• Basis of evolution• Cause of cancer• Can occur in DNA,

RNA, & Protein

Copyright © 2004 by Jinyan Li and Limsoon Wong

Page 27: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Central Dogma

• Gene expression consists of two steps– Transcription

DNA mRNA

– Translation mRNA Protein

Copyright © 2004 by Jinyan Li and Limsoon Wong

Page 28: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Transcription

• Synthesize mRNA from one strand of DNA– An enzyme RNA

polymerase temporarily separates double-stranded DNA

– It begins transcription at transcription start site

– A A, CC, GG, & TU

– Once RNA polymerase reaches transcription stop site, transcription stops

• Additional “steps” for Eukaryotes– Transcription produces

pre-mRNA that contains both introns & exons

– 5’ cap & poly-A tail are added to pre-mRNA

– RNA splicing removes introns & mRNA is made

– mRNA are transported out of nucleus

Page 29: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Translation

• Synthesize protein from mRNA

• Each amino acid is encoded by consecutive seq of 3 nucleotides, called a codon

• The decoding table from codon to amino acid is called genetic code

• 43=64 diff codons Codons are not 1-to-1

corr to 20 amino acids• All organisms use the

same decoding table• Recall that amino acids

can be classified into 4 groups. A single-base change in a codon is usually not sufficient to cause a codon to code for an amino acid in different group

Page 30: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Genetic Code• Start codon: ATG (code for M)• Stop codon: TAA, TAG, TGA

Copyright © 2004 by Jinyan Li and Limsoon Wong

Page 31: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Gene StructureCoding region

Copyright © 2004 by Jinyan Li and Limsoon Wong

Image credit: Bajic

Page 32: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Ribosome

• Translation is handled by a molecular complex, ribosome, which consists of both proteins & ribosomal RNA (rRNA)

• Ribosome reads mRNA & the translation starts at a start codon (the translation start site)

• With help of tRNA, each codon is translated to an amino acid

• Translation stops once ribosome reads a stop codon (the translation stop site)

Page 33: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

tRNA

• 61 diff tRNAs, each corresponds to a non-termination codon

• Each tRNA folds to form a cloverleaf-shaped structure– One side holds an

anticodon– The other side holds

the appropriate amino acid

Copyright © 2004 by Jinyan Li and Limsoon Wong

Page 34: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Introns and exons

• Eukaryotic genes contain introns & exons– Introns are seq that

are ultimately spliced out of mRNA

– Introns normally satisfy GT-AG rule, viz. begin w/ GT & end w/ AG

– Each gene can have many introns & each intron can have thousands bases

• Introns can be very long

• An extreme example is a gene associated with cystic fibrosis in human:– Length of 24 introns

~1Mb– Length of exons ~1kb

Page 35: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Basic Biotechnology Tools

• Cutting & breaking DNA– Restriction Enzymes– Shortgun method

• Copying DNA– Cloning– Polymerase Chain Reaction (PCR)

• Measuring length of DNA– Gel Electrophoresis

Page 36: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Restriction Enzymes

• Recognize certain point, called restriction site, in DNA w/ particular pattern & break it. This process is called digestion

• In nature, restriction enzymes are used to break foreign DNA to avoid infection

• Example– EcoRI is the 1st

restriction enzyme discovered that cuts DNA wherever the sequence GAATTC is found

– Similar to most of the other restriction enzymes, GAATTC is a palindrome

• > 300 known restriction enzymes have been discovered

Page 37: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Shotgun Method

• Break DNA molecule into small pieces randomly like this:– Solution w/ a large amount of purified DNA– Apply high vibration to break each molecule

randomly into small fragments

Page 38: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Cloning by Plasmid Vector• Insert DNA X into plasmid

vector w/ antibiotic-resistant gene & a recombinant DNA molecule is formed

• Insert recombinants into host cells

• Grow host cells in presence of antibiotic– Only cells w/ antibiotic-

resistant gene can grow– When we duplicate host

cell, X is also duplicated

• Select cells w/ antibiotic-resistance genes

• Kill them & extract XCopyright © 2004 by Jinyan Li and Limsoon Wong

Page 39: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Polymerase Chain Reaction (PCR)

• Allows rapid replication of a selected region of a DNA w/o need for a living cell

• Inputs for PCR:– Two oligonucleotides

are synthesized, each complementary to the two ends of the region. They are used as primers.

– Thermostable DNA polymerase TaqI

• Repeats a cycle w/ 3 phases 25-30 times. Each cycle takes ~5min– Phase 1: separate double

stranded DNA by heat– Phase 2: cool; add

synthesis primers– Phase 3: add DNA

polymerase TaqI to catalyze 5’ to 3’ DNA synthesis

Selected region has been amplified exponentially

Page 40: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Gel Electrophoresis

• Used to separate a mixture of DNA fragments of different lengths

• Apply an electrical field to mixture of DNA

• Note that DNA is negative charged. Small molecules travel faster than large molecules

• Mixture is separated into bands, each containing DNA molecules of same length

Page 41: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Sequencing by Gel Electrophoresis

• An application of gel electrophoresis is to reconstruct DNA sequence of length 500-800 within a few hours– Generate all sequences ending with A– Separate sequences ending with A into diff

bands using gel electrophoresis Such info tells us positions of A’s in the DNA– Similarly for C, G, & T

Page 42: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

……TCAACATGT

Sequencing by Gel Electrophoresis: Reading the

Sequence• Four groups of fragments: A, C, G, & T• Fragments are placed in negative end• Fragments move to positive end• From relative distances of fragments,

reconstruct the sequence

Copyright © 2004 by Jinyan Li and Limsoon Wong

Page 43: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Hybridization

• Among thousands of DNA fragments, biologists routinely need to find a DNA fragment that contains a particular DNA subsequence

• This can be done based on hybridization– Suppose we need to

find DNA fragments that contain ACCGAT

– Make probes inversely complement to ACCGAT

– Mix probes w/ DNA fragments

– Due to hybridization rule (A=T, CG), DNA fragments containing ACCGAT will hybridize with probes

Page 44: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Brief Introduction to Bioinformatics

Page 45: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Some Bioinformatics Problems

• Biological Data Searching• Gene/Promoter finding• Cis-regulatory DNA• Gene/Protein Network• Protein/RNA Structure Prediction• Evolutionary Tree reconstruction• Infer Protein Function• Disease Diagnosis• Disease Prognosis• Disease Treatment Optimization, ...

Page 46: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Biological Data Searching

• Biological Data is increasing rapidly

• Biologists need to locate required info

• Difficulties:– Too much– Too heterogeneous– Too distributed– Too many errors– Due to mutation, need

approximate search

Copyright © 2004 by Jinyan Li and Limsoon Wong

Page 47: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Cis-Regulatory DNAs

• Cis-regulatory DNAs control whether genes should express or not

• Cis-regulatory may locate in promoter region, intron, or exon

• Finding and understanding cis-regulatory DNAs is one of the key problem in coming years

Image credit: US DOE

Copyright © 2004 by Jinyan Li and Limsoon Wong

Page 48: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Gene Networks

• Inside a cell is a complex system

• Expression of one gene depends on expression of another gene

• Such interactions can be represented using gene network

• Understanding such networks helps identify association betw genes & diseases

Copyright © 2004 by Jinyan Li and Limsoon Wong

Page 49: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Protein/RNA structure prediction

• Structure of Protein/RNA is essential to its functionality

• Important to have some ways to predict the structure of a protein/RNA given its sequence

• This problem is important & it is always considered as a “grand challenge” problem in bioinformatics

Copyright © 2004 by Jinyan Li and Limsoon Wong

Image credit: Kolatkar

Page 50: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

189, 217, 247, 261

189, 217

189, 217, 261

Evolutionary Tree Reconstruction

• Protein/RNA/DNA mutates

• Evolutionary Tree studies evolutionary relationship among set of protein/RNA/DNAs

• Figures out origin of species

150000years ago

100000years ago

50000years ago

present

African Asian Papuan European

Root

Copyright © 2004 by Jinyan Li and Limsoon Wong

Image credit: Sykes

Page 51: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Gene Expression Profiling

Page 52: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

What’s a Microarray?

• Idea of hybridization leads to DNA array tech

• In the past, “one gene in one experiment” Hard to get whole picture• DNA array is a technology that contains

large number of “DNA molecules” spotted on glass slides, nylon membranes, or silicon wafers

Measure expression of thousands of genes simultaneously

Page 53: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Image credit: Affymetrix

Affymetrix GeneChip™ Array

Page 54: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

quartz is washed to ensure uniform hydroxylation across its surface and to attach linker molecules

exposed linkers become deprotected and are available for nucleotide coupling

Copyright © 2004 by Jinyan Li and Limsoon Wong

Image credit: Affymetrix

Making GeneChip™ Array

Page 55: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Image credit: Affymetrix

Gene Expression Measurement by GeneChip™

Array

Page 56: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

A Sample GeneChip™ File

Page 57: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Applications of DNA arrays

• Sequencing by hybridization– Promising alternative

to sequencing by gel electrophoresis

– May be able to reconstruct longer DNA sequences in shorter time

SNP discovery

• Profiling of gene expression– DNA arrays allow us to

monitor activities within a cell

– Each spot contains complement of a gene

– Due to hybridization, we can measure concentration of diff mRNAs within cell

disease diagnosis disease prognosis target discovery

Page 58: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Proteomic Profiling

Page 59: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

What is Proteomics?

• It is proteins that are directly involved in both normal and disease-associated biochemical processes

• A more complete understanding of disease may be gained by looking directly at the proteins present within a diseased cell or tissue

• Achieved thru the proteome & proteomics

• Proteomics is scientific discipline that detects proteins associated with a disease by means of their altered levels of expression betw control & disease states.

• Proteome research permits discovery of new protein markers for diagnostic purposes & of novel molecular targets for drug discovery

Page 60: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Expt Procedures in Proteomics

• Separation– electrophoresis (1-D, 2-D)

– chromatography (SEC, ion exchange, reversed phase)

• Digestion– chemical (BrCN)

– enzymatic (trypsin, Lys-C, Asp-C)

– reduction (Di-Thio-Threitol, b-Mercapto-Ethanol)

– alkylation (IodoAcAcid, IodoAcAmide, Vynil Pyridine)

• Sample clean-up– chromatography (rev phase)

– solid phase extraction (Zip Tip)

• MS ANALYSIS– protein identification

(peptide mass fingerprinting)

– peptide structural information (post source decay)

Copyright © 2004 by Jinyan Li and Limsoon Wong

Page 61: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Ciphergen Protein Chip® System

Image credit: Ciphergen

Page 62: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Protein Chip® Processes

SELDI-TOF-MS

Image credit: Ciphergen

Page 63: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Schematic of Protein Chip® Reader

Image credit: Ciphergen

Page 64: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Protein Chip® Array Surfaces

Image credit: Ciphergen

Page 65: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

A sample proteomic profile

Copyright © 2004 by Jinyan Li and Limsoon Wong

Image credit: Petricoin

Page 66: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Applications of Proteomics

Disease diagnosis

Disease prognosis

Target discovery

Page 67: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

A Motivating Application:Optimizing

Treatment of Childhood ALL

Image credit: FEER

Page 68: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Childhood ALL

• Major subtypes are: T-ALL, E2A-PBX, TEL-AML, MLL genome rearrangements, Hyperdiploid>50, BCR-ABL

• Diff subtypes respond differently to same Tx

• Over-intensive Tx – Development of

secondary cancers– Reduction of IQ

• Under-intensiveTx – Relapse

• The subtypes look similar

• Conventional diagnosis– Immunophenotyping– Cytogenetics– Molecular diagnostics

• Unavailable in most ASEAN countries

Page 69: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Image credit: Affymetrix

Single-Test Platform ofMicroarray & Machine

Learning

Page 70: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Impact

Conventional Tx:• intermediate intensity to everyone 10% suffers relapse 50% suffers side effects costs US$150m/yr

Our optimized Tx:• high intensity to 10%• intermediate intensity to 40%• low intensity to 50%• costs US$100m/yr

Copyright © 2004 by Jinyan Li and Limsoon Wong

•High cure rate of 80%• Less relapse

• Less side effects• Save US$51.6m/yr

Page 71: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Some Sample Data at

http://research.i2r.a-star.edu.sg/rp

Page 72: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Kent Ridge Biomedical Data Set Repository

• http://research.i2r.a-star.edu.sg/rp• Store high-dimensional biomedical data sets:

– gene expression data– protein profiling data– genomic sequence data

• All data are for classification purposes:– disease diagnosis– subtype classification– relapse study– genomic feature prediction– ....

Page 73: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Convenient File Formats

• Original raw data are of formats diff from c’mon ones used in machine learning softwares

• All original raw data files have been converted into plain text data files under the same schema, where every row in the new file is a comma-punctuated string, like a vector, representing a labelled data sample

• Re-formatted data files have extension .data; & feature names are saved in a separate file w/ extension .names. Equiv .arff format also provided

• Such transformed data files can be directly fed to c’mon machine learning s/w packages such as C5.0, MLC++, & WEKA

Page 74: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Breast Cancer Outcome Prediction Gene Expression

Data• Van't Veer et al., Nature

415:530-536, 2002• Training set contains 78

patient samples– 34 patients develop dist-

ance metastases in 5 yrs– 44 patients remain

healthy from the disease after initial diagnosis for >5 yrs

• Testing set contains 12 relapse & 7 non-relapse samples

• No of genes is 24481

Copyright © 2004 by Jinyan Li and Limsoon WongImage credit: Veer

Page 75: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

CNS Embryonal Tumour Outcome Prediction Gene

Expression Data• Pomeroy et al., Nature

415:436-442, 2002• Survivors are patients

who are alive after treatment, while the failures are those who succumb to their disease

• 60 patient samples– 21 are survivors – 39 are failures

• There are 7129 genes in the dataset

Copyright © 2004 by Jinyan Li and Limsoon Wong

Image credit: Pomeroy

Page 76: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Ovarian Cancer Diagnosis Proteomic Profiling Data

• Petricoin et al., Lancet 359:572-577, 2002

• 253 serum samples of proteomic spectra generated by mass spec– 91 controls (Normal) – 162 ovarian cancers

• Each sample contains 15154 M/Z identities

Copyright © 2004 by Jinyan Li and Limsoon Wong

Image credit: Petricoin

Page 77: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Lung Cancer Diagnosis Gene Expression Data

• Gordon et al., Cancer Res 62:4963-4967, 2002

• Lung Malignant pleural mesothelioma (MPM) vs adenocarcinoma (ADCA)

• 149 testing samples– 15 MPM – 134 ADCA

• 32 training samples– 16 MPM – 16 ADCA

• Each sample described by 12533 genes

Copyright © 2004 by Jinyan Li and Limsoon Wong

Image credit: Gordon

Page 78: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Translation Initiation Site Prediction Data

• Liu & Wong, JBCB 1:139-168, 2003

• Used to find Translation Initiation Site (TIS), where translation from mRNA to protein initiates.

• 3312 seq in raw data– 3312 true ATG – 10063 false ATG

• 927 features

Copyright © 2004 by Jinyan Li and Limsoon WongImage credit: Liu

Page 79: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Childhood Acute Lymphoblastic Leukemia

Gene Expression Data• Yeoh et al., Cancer

Cell 1:133-143, 2002• Classifying 6

subtypes of childhood acute lymphoblastic leukemia

• 215 training samples• 112 testing samples• Each sample has

expression value of 12558 genes

Copyright © 2004 by Jinyan Li and Limsoon Wong

Image credit: Yeoh

Page 80: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Any Question?

Page 81: Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Acknowledgements

• Many slides presented in this part of the tutorial are derived from a course jointly taught at NUS SOC with Ken Sung