bioinformatics life sciences_2012

Inleiding tot de bio-informatica en computationele biologie

Lab for Bioinformatics and computational genomics

10 “genome hackers” mostly engineers (statistics)

42 scientiststechnicians, geneticists, clinicians

>100 people hardware engineers,

mathematicians, molecular biologists

What is Bioinformatics ?

• Application of information technology to the storage, management and analysis of biological information (Facilitated by the use of computers)– Sequence analysis?– Molecular modeling (HTX) ?– Phylogeny/evolution?– Ecology and population studies?– Medical informatics?– Image Analysis ?– Statistics ? AI ?– Sterkstroom of zwakstroom ?

• Medicine (Pharma)– Genome analysis allows the targeting of genetic

diseases– The effect of a disease or of a therapeutic on RNA

and protein levels can be elucidated– Knowledge of protein structure facilitates drug

design– Understanding of genomic variation allows the

tailoring of medical treatment to the individual’s genetic make-up

• The same techniques can be applied to crop (Agro) and livestock improvement (Animal Health)

Promises of genomics and bioinformatics

Informatics

Bioinformatics, a life science discipline …

(Molecular)Biology

Informatics

Theoretical Biology

Computational Biology

(Molecular)Biology

Computer Science

Informatics

Theoretical Biology

(Molecular)Biology

Computer Science

Bioinformatics

Informatics

Bioinformatics, a life science discipline … management of expectations

Theoretical Biology

(Molecular)Biology

Computer Science

Bioinformatics

Interface Design

AI, Image Analysisstructure prediction (HTX)

Sequence Analysis

Expert Annotation

NPDatamining

Informatics

Bioinformatics, a life science discipline … management of expectations

Theoretical Biology

(Molecular)Biology

Computer Science

BioinformaticsDiscovery Informatics – Computational Genomics

Interface Design

AI, Image Analysisstructure prediction (HTX)

Sequence Analysis

Expert Annotation

NPDatamining

Time (years)

• Timelin: Magaret Dayhoff …

Happy Birthday …

PCR + dye termination

Suddenly, a flash of insight caused him to pull the car off the road and stop. He awakened his friend dozing in the passenger seat and excitedly explained to her that he had hit upon a solution - not to his original problem, but to one of even greater significance. Kary Mullis had just conceived of a simple method for producing virtually unlimited copies of a specific DNA sequence in a test tube - the polymerase chain reaction (PCR)

naturetheHumangenome

Setting the stage …

Biological Research

Adapted from John McPherson, OICRAdapted from John McPherson, OICR

And this is just the beginning ….

Next Generation Sequencing is here

One additional insight ...

Read Length is Not As Important For Resequencing

8 10 12 14 16 18 20

Length of K-mer Reads (bp)

E.COLI

Jay Shendure

ABI SOLID

Paired End Reads are Important!

Repetitive DNAUnique DNA

Single read maps to multiple positions

Paired read maps uniquely

Read 1 Read 2

Known Distance

Single Molecule Sequencing

Helicos Biosciences Corp.

Microscope slide

Single DNA molecule

dNTP-Cy3

primer

Super-cooledTIRF microscope

Adapted from: Barak Cohen, Washington University, Bio5488 http://tinyurl.com/6zttuq http://tinyurl.com/6k26nh

Complete genomics

Next next generation sequencing

Third generation sequencing

Now sequencing

Pacific Biosciences: A Third Generation Sequencing Technology

Eid et al 2008

Nanopore Sequencing

Ultra-low-cost SINGLE molecule sequencing

Genome Size

DOGS: Database Of Genome Sizes

E. coli = 4.2 x 106

Yeast = 18 x 106

Arabidopsis = 80 x 106

C.elegans = 100 x 106

Drosophila = 180 x 106

Human/Rat/Mouse = 3000 x 106

Lily = 300 000 x 106

With ... : 99.9 %To primates: 99%

Anno 2012

IdentityThe extent to which two (nucleotide or amino acid) sequences are invariant.

HomologySimilarity attributed to descent from a common ancestor.

Definitions

RBP: 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWD- 84 + K ++ + + GTW++MA+ L + A V T + +L+ W+ glycodelin: 23 QTKQDLELPKLAGTWHSMAMA-TNNISLMATLKAPLRVHITSLLPTPEDNLEIVLHRWEN 81

Orthologous Homologous sequences in different species that arose from a common ancestral gene during speciation; may or may not be responsible for a similar function.

Paralogous Homologous sequences within a single species that arose by gene duplication.

Definitions

speciation

duplication

• Simple identity, which scores only identical amino acids as a match.

• Genetic code changes, which scores the minimum number of nucieotide changes to change a codon for one amino acid into a codon for the other.

• Chemical similarity of amino acid side chains, which scores as a match two amino acids which have a similar side chain, such as hydrophobic, charged and polar amino acid groups.

• The Dayhoff percent accepted mutation (PAM) family of matrices, which scores amino acid pairs on the basis of the expected frequency of substitution of one amino acid for the other during protein evolution.

• The blocks substitution matrix (BLOSUM) amino acid substitution tables, which scores amino acid pairs based on the frequency of amino acid substitutions in aligned sequence motifs called blocks which are found in protein families

Overview

BLOSUM (BLOck – SUM) scoring

DDNAAVDNAVDDNNVAVV

Block = ungapped alignentEg. Amino Acids D N V A

a b c d e f1

S = 3 sequencesW = 6 aaN= (W*S*(S-1))/2 = 18 pairs

A. Observed pairs

DDNAAVDNAVDDNNVAVV

a b c d e f1

D N A V

.056.056.056

Relative frequency table

Probability of obtaining a pair if randomly choosing pairs from block

AB. Expected pairs

DDDDDNNNNAAAAVVVVV

DDNAAVDNAVDDNNVAVV

5/184/184/185/18

P{Draw DN pair}= P{Draw D, then N or Draw M, then D}P{Draw DN pair}= PDPN + PNPD = 2 * (5/18)*(4/18) = .123

D N A V

.049.123.099

eijRandom rel. frequency table

Probability of obtaining a pair of each amino acid drawn independently from block

C. Summary (A/B)

sij = log2 gij/eij

(sij) is basic BLOSUM score matrix

Notes:• Observed pairs in blocks contain information about relationships at all levels of evolutionary distance simultaneously (Cf: Dayhoffs’s close relationships)• Actual algorithm generates observed + expected pair distributions by accumalution over a set of approx. 2000 ungapped blocks of varrying with (w) + depth (s)

• blosum30,35,40,45,50,55,60,62,65,70,75,80,85,90• transition frequencies observed directly by identifying

blocks that are at least – 45% identical (BLOSUM 45) – 50% identical (BLOSUM 50) – 62% identical (BLOSUM 62) etc.

• No extrapolation made

• High blosum - closely related sequences• Low blosum - distant sequences • blosum45 pam250• blosum62 pam160 • blosum62 is the most popular matrix

The BLOSUM Series

Overview

• Church of the Flying Spaghetti Monster

• http://www.venganza.org/about/open-letter

– Henikoff and Henikoff have compared the BLOSUM matrices to PAM by evaluating how effectively the matrices can detect known members of a protein family from a database when searching with the ungapped local alignment program BLAST. They conclude that overall the BLOSUM 62 matrix is the most effective.

• However, all the substitution matrices investigated perform better than BLOSUM 62 for a proportion of the families. This suggests that no single matrix is the complete answer for all sequence comparisons.

• It is probably best to compliment the BLOSUM 62 matrix with comparisons using 250 PAMS, and Overington structurally derived matrices.

– It seems likely that as more protein three dimensional structures are determined, substitution tables derived from structure comparison will give the most reliable data.

Overview

Rat versus mouse RBP

Rat versus bacteriallipocalin

• Exhaustive …– All combinations:

• Algorithm – Dynamic programming (much faster)

• Heuristics– Needleman – Wunsh for global

alignments(Journal of Molecular Biology, 1970)

– Later adapated by Smith-Waterman for local alignment

Alignments

A metric …

GACGGATTAG, GATCGGAATAG

GA-CGGATTAGGATCGGAATAG

+1 (a match), -1 (a mismatch),-2 (gap)

9*1 + 1*(-1)+1*(-2) = 6

Needleman-Wunsch-edu.pl

The Score Matrix----------------

Seq1(j)1 2 3 4 5 6 7 8 9 10Seq2 * C K H V F C R V C I(i) * 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -101 C -1 1 0 -1 -2 -3 -4 -5 -6 -7 -82 K -2 0 2 1 0 -1 -2 -3 -4 -5 -63 K -3 -1 1 1 0 -1 -2 -3 -4 -5 -64 C -4 -2 0 0 0 -1 0 -1 -2 -3 -45 F -5 -3 -1 -1 -1 1 0 -1 -2 -3 -46 C -6 -4 -2 -2 -2 0 2 1 0 -1 -27 K -7 -5 -3 -3 -3 -1 1 1 0 -1 -28 C -8 -6 -4 -4 -4 -2 0 0 0 1 09 V -9 -7 -5 -5 -3 -3 -1 -1 1 0 0

A: matrix(i,j) = matrix(i-1,j-1) + (MIS)MATCH if (substr(seq1,j-1,1) eq substr(seq2,i-1,1)

B: up_score = matrix(i-1,j) + GAP

C: left_score = matrix(i,j-1) + GAP

Seq1: CKHVFCRVCISeq2: CKKCFC-KCV ++--++--+- score = 0

• Practicum: use similarity function in initialization step -> scoring tables

• Time Complexity

• Use random proteins to generate histogram of scores from aligned random sequences

Time complexity with needleman-wunsch.pl

Sequence Length (aa) Execution Time (s)10 025 050 0100 1500 51000 192500 5595000 Memory could not be

written

Average around -64 !

-80-78-76-74-72 **-70 *******-68 ***************-66 *************************-64 ************************************************************-60 ***********************-58 ***************-56 ********-54 ****-52 *-50-48-46-44-42-40-38

If the sequences are similar, the path of the best alignment should be very close to the main diagonal.

Therefore, we may not need to fill the entire matrix, rather, we fill a narrow band of entries around the main diagonal.

An algorithm that fills in a band of width 2k+1 around the main diagonal.

Multiple Alignment Method

Phylogenetic methods may be used to solve crimes, test purity of products, and determine whether endangered species have been smuggled or mislabeled: – Vogel, G. 1998.

HIV strain analysis debuts in murder trial. Science 282(5390): 851-853.

– Lau, D. T.-W., et al. 2001. Authentication of medicinal Dendrobium species by the internal transcribed spacer of ribosomal DNA. Planta Med 67:456-460.

Examples

– Epidemiologists use phylogenetic methods to understand the development of pandemics, patterns of disease transmission, and development of antimicrobial resistance or pathogenicity: • Basler, C.F., et al. 2001.

Sequence of the 1918 pandemic influenza virus nonstructural gene (NS) segment and characterization of recombinant viruses bearing the 1918 NS genes. PNAS, 98(5):2746-2751.

• Ou, C.-Y., et al. 1992. Molecular epidemiology of HIV transmission in a dental practice. Science 256(5060):1165-1171.

• Bacillus Antracis:

Examples

Tree Of Life

Modeling

Ramachandran / Phi-Psi Plot

Protein Architecture

• Finding a structural homologue• Blast

–versus PDB database or PSI-blast (E<0.005)

–Domain coverage at least 60%• Avoid Gaps

–Choose for few gaps and reasonable similarity scores instead of lots of gaps and high similarity scores

Modeling

Bootstrapping - an example

Ciliate SSUrDNA - parsimony bootstrap

Majority-rule consensus

Ochromonas (1)

Symbiodinium (2)

Prorocentrum (3)

Euplotes (8)

Tetrahymena (9)

Loxodes (4)

Tracheloraphis (5)

Spirostomum (6)

Gruberia (7)

Overview

Personalized Medicine,

Biomarkers …

… Molecular Profiling

First Generation Molecular Profiling

Next Generation Molecular Profiling

Next Generation Epigenetic Profiling

Concluding Remarks

Overview

Biomarkers …

Concluding Remarks

Personalized Medicine

• The use of diagnostic tests (aka biomarkers) to identify in advance which patients are likely to respond well to a therapy

• The benefits of this approach are to– avoid adverse drug reactions– improve efficacy– adjust the dose to suit the patient– differentiate a product in a competitive market– meet future legal or regulatory requirements

• Potential uses of biomarkers– Risk assessment– Initial/early detection– Prognosis– Prediction/therapy selection– Response assessment– Monitoring for recurrence

Biomarker

First used in 1971 … An objective and « predictive » measure … at the molecular level … of normal and pathogenic processes and responses to therapeutic interventions

Characteristic that is objectively measured and evaluated as an indicator of normal biologic or pathogenic processes or pharmacologic response to a drug

A biomarker is valid if:– It can be measured in a test system with well

established performance characteristics – Evidence for its clinical significance has been

established

Rationale 1:Why now ? Regulatory path becoming more clear

There is more at stake than efficient drug development. FDA « critical path initiative » Pharmacogenomics guideline

Biomarkers are the foundation of « evidence based medicine » - who should be treated, how and with what.

Without Biomarkers advances in targeted therapy will be limited and treatment remain largely emperical. It is imperative that Biomarker development be accelarated along with therapeutics

Why now ?

First and maturing second generation molecular profiling methodologies allow to stratify clinical trial participants to include those most likely to benefit from the drug candidate—and exclude those who likely will not—pharmacogenomics-based

Clinical trials should attain more specific results with smaller numbers of patients. Smaller numbers mean fewer costs (factor 2-10)

An additional benefit for trial participants and internal review boards (IRBs) is that stratification, given the correct biomarker, may reduce or eliminate adverse events.

Molecular Profiling

The study of specific patterns (fingerprints) of proteins, DNA, and/or mRNA and how these patterns correlate with an individual's physical characteristics or symptoms of disease.

Generic Health advice

• Exercise (Hypertrophic Cardiomyopathy)• Drink your milk (MCM6 Lactose intolarance)• Eat your green beans (glucose-6-phosphate

dehydrogenase Deficiency)• & your grains (HLA-DQ2 – Celiac disease)• & your iron (HFE - Hemochromatosis)• Get more rest (HLA-DR2 - Narcolepsy)

Generic Health advice (UNLESS)

• Exercise (Hypertrophic Cardiomyopathy)• Drink your milk (MCM6 Lactose intolarance)• Eat your green beans (glucose-6-phosphate

• Exercise (Hypertrophic Cardiomyopathy)• Drink your milk (MCM6 Lactose intolerance)• Eat your green beans (glucose-6-phosphate

EGFR based therapy in mCRC

Overview

Biomarkers …

Concluding Remarks

Before molecular profiling …

• Flow cytometry correlates surface markers, cell size and other parameters

• Circulating tumor cell assays (CTC’s) quantitate the number of tumor cells in the peripheral blood.

• Exosomes are 30-90 nm vesicles secreted by a wide range of mammalian cell types.

• Immunohistochemistry (IHC) measures protein expression, usually on the cell surface.

• Gene sequencing for mutation detection

• Microarray for m-RNA message detection • RT-PCR for gene expression

• FISH analysis for gene copy number • Comparative Genome Hybridization (CGH) for

gene copy number

Basics of the “old” technology

• Clone the DNA.• Generate a ladder of labeled (colored)

molecules that are different by 1 nucleotide.• Separate mixture on some matrix.• Detect fluorochrome by laser.• Interpret peaks as string of DNA.• Strings are 500 to 1,000 letters long• 1 machine generates 57,000 nucleotides/run• Assemble all strings into a genome.

Genetic Variation Among People

0.1% difference among people

GATTTAGATCGCGATAGAGGATTTAGATCTCGATAGAG

Single nucleotide polymorphisms(SNPs)

The genome fits as an e-mail attachment

gene copy number

mRNA Expression Microarray

gene copy number

Overview

Biomarkers …

Concluding Remarks

Second Generation DNA profiling

• Exome Sequencing (aka known as targeted exome capture) is an efficient strategy to selectively sequence the coding regions of the genome to identify novel genes associated with rare and common disorders.

• 160K exons

Second Generation DNA profiling

Besides the 6000 protein coding-genes …

140 ribosomal RNA genes275 transfer RNA gnes40 small nuclear RNA genes>100 small nucleolar genes

Function of RNA genes

pRNA in 29 rotary packaging motor (Simpson et el. Nature 408:745-750,2000)Cartilage-hair hypoplasmia mapped to an RNA (Ridanpoa et al. Cell 104:195-203,2001)The human Prader-Willi ciritical region (Cavaille et al. PNAS 97:14035-7, 2000)

Second Generation RNA profiling

RNA genes can be hard to detects

UGAGGUAGUAGGUUGUAUAGU

C.elegans let-27; 21 nt (Pasquinelli et al. Nature 408:86-89,2000)

Often smallSometimes multicopy and redundantOften not polyadenylated (not represented in ESTs)Immune to frameshift and nonsense mutationsNo open reading frame, no codon biasOften evolving rapidly in primary sequence

Second Generation RNA profiling

ncRNAs in human genome

tRNA 60018S rRNA 2005.8S rRNA 20028S rRNA 2005S rRNA 200snoRNA 300miRNA 250U1 40U2 30U4 30U5 30U6 20U4atac 5U6atac 5U11 5U12 5

SRP RNA 1

RNase P RNA 1

Telomerase RNA 1

RNase MRP 1

Y RNA 5

Vault 4

7SK RNA 1

Antisense RNAs 1000s?

Cis reg regions 100s?

Others ?

Mapping Structural Variation in Humans

- Thought to be Common 12% of the genome (Redon et al. 2006)

- Likely involved in phenotype variation and disease

- Until recently most methods fordetection were low resolution (>50 kb)

>1 kb segments

Size Distribution of CNV in a Human Genome

Overview

Biomarkers …

Concluding Remarks

CONFIDENTIAL

Defining Epigenetics

Reversible changes in gene expression/function

Without changes in DNA sequence

Can be inherited from precursor cells

Allows to integrate intrinsic with environmental signals (including diet)

Methylation I Epigenetics | Oncology | Biomarker

Genome

Gene Expression

Epigenome

Chromatin

Phenotype

I NEXT-GEN | PharmacoDX | CRC

CONFIDENTIAL

Epigenetic Regulation: Post Translational Modifications to Histones and Base Changes in DNA

Epigenetic modifications of histones and DNA include:– Histone acetylation and methylation, and DNA methylation

HistoneAcetylation

HistoneMethylation

DNA Methylation

CONFIDENTIAL

MGMT BiologyO6 Methyl-Guanine Methyl Transferase

Essential DNA Repair Enzyme

Removes alkyl groups from damaged guanine bases

Healthy individual: - MGMT is an essential DNA repair enzymeLoss of MGMT activity makes individuals susceptible to DNA damage and prone to tumor development

Glioblastoma patient on alkylator chemotherapy: - Patients with MGMT promoter methylation show have longer PFS and OS with the use of alkylating agents as chemotherapy

CONFIDENTIAL

MGMT Promoter Methylation Predicts Benefit form DNA-Alkylating Chemotherapy

Post-hoc subgroup analysis of Temozolomide Clinical trial with primary glioblastoma patients show benefit for patients with MGMT promoter methylation

25Median Overall Survival

21.7 months

12.7 months

radiotherapy

plus temozolomide

Methylated MGMT Gene

Non-Methylated MGMT Gene

radiotherapy

Adapted from Hegi et al.NEJM 2005352(10):1036-8.Study with 207 patients

CONFIDENTIAL

Genome-wide methylation by methylation sensitive restriction enzymes

CONFIDENTIAL

Genome-wide methylation by probes

CONFIDENTIAL# samples

# markers

Genome-wide methylation …. by next generation sequencing

Discovery

Verification

Validation

CONFIDENTIAL

MBD_Seq

DNA Sheared

Immobilized Methyl Binding Domain

Condensed Chromatin

DNA Sheared

CONFIDENTIAL

Immobilized Methyl binding domain

Next Gen SequencingGA Illumina: 100 million reads

MBD_Seq

CONFIDENTIAL

MBD_SeqMGMT = dual core

CONFIDENTIAL# samples

# markers

MBD_Seq

Discovery

1-2 millionmethylation

CONFIDENTIAL

Data integrationCorrelation tracks

methylation methylation

expression expression

Corr =-1 Corr = 1

CONFIDENTIAL

Correlation trackin GBM @ MGMT

143Methylation I Epigenetics | Oncology | Biomarker

I NEXT-GEN | PharmacoDX |

CONFIDENTIAL# samplesMethylation I Epigenetics | Oncology | Biomarker

# markers

MBD_Seq

454_BT_Seq

Discovery

Verification

Validation

I NEXT-GEN | PharmacoDX |

CONFIDENTIAL

GCATCGTGACTTACGACTGATCGATGGATGCTAGCAT

unmethylated alleles

less methylationmethylated alleles

more methylation

Deep Sequencing

CONFIDENTIAL

Deep MGMTHeterogenic complexity

CONFIDENTIAL

147Methylation I Epigenetics | Oncology | Biomarker

Overview

Biomarkers …

Concluding Remarks

Translational Medicine: An inconvenient truth

• 1% of genome codes for proteins, however more than 90% is transcribed

• Less than 10% of protein experimentally measured can be “explained” from the genome

• 1 genome ? Structural variation• > 200 Epigenomes ??

• Space/time continuum …

Translational Medicine: An inconvenient truth

• 1% of genome codes for proteins, however more than 90% is transcribed

• Less than 10% of protein experimentally measured can be “explained” from the genome

• 1 genome ? Structural variation• > 200 Epigenomes …

• “space/time” continuum

Epigenetic (meta)information = stem cells

Cellular programming

Cellular reprogramming

Epigenetically altered, self-renewing cancer stem cells

Tumor Development and Growth

Gene-specificEpigeneticreprogramming

Cellular reprogramming

biobixwvcrieki

biobix.bebioinformatics.be

bioinformatics life sciences_2012

amino acid sequences

amino acid pairs

polar amino acid groups

nanopore sequencing

identical amino acids

read maps uniquelysingle

single speciesthat

medical informatics

Education

1. introduction to biology and bioinformatics · 6...

biology for bioinformatics. the big picture of biology what...

mnw leerlijn bioinformatics bioinformatics & systems biology...

introduction to proteomics & bioinformatics€¦ ·...

data integration in bioinformatics and life sciences

guru nanak dev university...

bioinformatics and evolutionary genomics the tree of life /...

bioinformatics core competencies for undergraduate life

hp-see project and the hpc bioinformatics life science g...

bioinformatics - auckland · ¥bioinformatics is fast...

vidyadhar karmarkar genomics and bioinformatics 414 life...

embl-ebi: providing bioinformatics research infrastructure...

program of bioinformatics for international students...

bioinformatics: patenting the bridge between … · 93...

case study life sciences data: central for integrative...

professional science master's - life...

bioinformatics applications of machine learning brian parker...

guru nanak dev university...

how bioinformatics can change your life basic concepts of...

bioinformatics as an integrative science jaap heringa...