shinichi morishitamolecular biology of the cell - fifth edition figure 5-75 figure removed due to...

29
Computational Analysis of Genomes Shinichi Morishita The figures, photos and moving images with ‡marks attached belong to their copyright holders. Reusing or reproducing them is prohibited unless permission is obtained directly from such copyright holders.

Upload: others

Post on 24-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Shinichi MorishitaMolecular Biology of the Cell - Fifth Edition Figure 5-75 Figure removed due to copyright restrictions Molecular Biology of the Cell - Fifth Edition Garland Science

Computational Analysis of  Genomes

Shinichi Morishita

The figures, photos and moving images with ‡marks attached belong to their copyright holders. Reusing or reproducing them is prohibited unless permission is obtained directly from such copyright holders.

Page 2: Shinichi MorishitaMolecular Biology of the Cell - Fifth Edition Figure 5-75 Figure removed due to copyright restrictions Molecular Biology of the Cell - Fifth Edition Garland Science

Chromosomes, Chromatin Structure and Genomes

Ref: Annunziato,

A.

DNA packaging: Nucleosomes

and chromatin.

Nature Education

1(1),

(2008)

Page 3: Shinichi MorishitaMolecular Biology of the Cell - Fifth Edition Figure 5-75 Figure removed due to copyright restrictions Molecular Biology of the Cell - Fifth Edition Garland Science

Sequencing a Genome

• Do the genome size and number of  chromosomes characterize an organism?

• What could the sequenced genome be used for?

• How to sequence genomes?

• How to identify the gene coding regions?• Recent revolutionary advances in genome 

sequencing equipment

• How to infer chromatin structure?

Page 4: Shinichi MorishitaMolecular Biology of the Cell - Fifth Edition Figure 5-75 Figure removed due to copyright restrictions Molecular Biology of the Cell - Fifth Edition Garland Science

Genome Size

Courtesy of Dr. T. Ryan Gregory http://www.genomesize.com/statistics.php

1 pg (10-12 g) ≅

1 billion base pairs (more accurately, 978 million pairs)

Page 5: Shinichi MorishitaMolecular Biology of the Cell - Fifth Edition Figure 5-75 Figure removed due to copyright restrictions Molecular Biology of the Cell - Fifth Edition Garland Science

Why Are Genomes So Different in Size?

Human genome著作権の都合により、ここに挿入され

ていた画像を削除しました。

Molecular Biology of the Cell - Fifth EditionFigure 5-75

Figure removed due to

copyright restrictions

Molecular Biology of the Cell - Fifth EditionGarland Science (2008)Figure 5-75

retrovirus-like element

simple repeat sequence

repetitive sequence unique sequence

Overlapping fragments

Intron

Protein coding region

GENE

Nonrepetive DNA sequence, neither be found in intron nor codon

Page 6: Shinichi MorishitaMolecular Biology of the Cell - Fifth Edition Figure 5-75 Figure removed due to copyright restrictions Molecular Biology of the Cell - Fifth Edition Garland Science

Human Chromosomes

U.S. National Library of Medicinehttp://ghr.nlm.nih.gov/handbook/illustrations/normalkaryotype

Page 7: Shinichi MorishitaMolecular Biology of the Cell - Fifth Edition Figure 5-75 Figure removed due to copyright restrictions Molecular Biology of the Cell - Fifth Edition Garland Science

millions of years ago

Mesozoic CenozoicPaleozoic

500 400 300 200 100 0

Cambrian TriassicJurassic

CretaceousPaleogene

Neogene

Verterata

LampreyAgnathaChondrichthyes

Gnathostomata

Shark, ray

Osteichthyes

ActinopterygiiPolypteriformes

Holostei

Bichir

Sturgeon, gar, bowfin

Teleostei

Zebrafish

Killifish

Green spotted puffer

Tiger blowfish

Sarcopterygii

Coelacanth, lungfishDipnoi

Tetrapoda

Amniota

AmphibiaFrog

Reptilia

Aves

SquamataCrocodilia

Testudinata

Lizard, snake

Chicken

CrocodileTurtle

Mammalia

Dog

Mouse, rat

Human, chimpanzee

Vertebrate ChromosomeCounts

Nakatani

et al., 2007 , Genome Res., 17, 1254‐1265

CarboniferousPermianDevonian

SilurianOrdovician

Page 8: Shinichi MorishitaMolecular Biology of the Cell - Fifth Edition Figure 5-75 Figure removed due to copyright restrictions Molecular Biology of the Cell - Fifth Edition Garland Science

Chromosome Count DistributionVertical axis: No. species,

Horizontal axis: Chromosome count500 400 300 200 100 0

Verterata

Gnathostomata

Osteichthyes

ActinopterygiiPolypteriformes

Holostei

Teleostei

Sarcopterygii

Dipnoi

TetrapodaReptilia

Agnatha

Chondrichthyes

Amphibia

Aves

Squamata

Figure removed due to

copyright restrictions

Nakatani et al., 2007 , Genome Res., 17, 1254-1265Figure6

Nakatani

et al., 2007 , Genome Res., 17, 1254‐1265

Mesozoic CenozoicPaleozoic

Cambrian TriassicJurassic

CretaceousPaleogene

NeogeneCarboniferousPermianDevonian

SilurianOrdovician

Amniota

Mammalia

Testudinata

Crocodilia

Page 9: Shinichi MorishitaMolecular Biology of the Cell - Fifth Edition Figure 5-75 Figure removed due to copyright restrictions Molecular Biology of the Cell - Fifth Edition Garland Science

(小鹿)Figure removed due to

copyright restrictions

Molecular Biology of the Cell - Fifth EditionGarland Science (2008)Figure 4-14

Page 10: Shinichi MorishitaMolecular Biology of the Cell - Fifth Edition Figure 5-75 Figure removed due to copyright restrictions Molecular Biology of the Cell - Fifth Edition Garland Science

What Uses Has the Genome?

• It tells us whether a given gene is present.• Chicken genome sequenced Dec 2004

• Do chickens have a poor sense of smell?

• There are calculated to be 218 genes that  could be olfactory receptors.

• What happened to the genes for flight?

Page 11: Shinichi MorishitaMolecular Biology of the Cell - Fifth Edition Figure 5-75 Figure removed due to copyright restrictions Molecular Biology of the Cell - Fifth Edition Garland Science

The Sanger method allows reading of 500 to 800 bases . . .

Ref: Annunziato,

A.

DNA packaging: Nucleosomes

and chromatin.

Nature Education

1(1),

(2008)

How to Sequence the Genome?

Applied Biosystems Japan Co., Ltd.‡

Applied Biosystems Japan Co., Ltd.‡

Page 12: Shinichi MorishitaMolecular Biology of the Cell - Fifth Edition Figure 5-75 Figure removed due to copyright restrictions Molecular Biology of the Cell - Fifth Edition Garland Science

A C G T

A jigsaw puzzle, of millions or tens of millions of pieces,whose source image is unknown.

Page 13: Shinichi MorishitaMolecular Biology of the Cell - Fifth Edition Figure 5-75 Figure removed due to copyright restrictions Molecular Biology of the Cell - Fifth Edition Garland Science

Copy the genome

Use a high-velocity water jet to fragment the genome into random pieces

Collect fragments of around 2,000 base pairs (or other suitable volume)

Page 14: Shinichi MorishitaMolecular Biology of the Cell - Fifth Edition Figure 5-75 Figure removed due to copyright restrictions Molecular Biology of the Cell - Fifth Edition Garland Science

Read 500 to 800 bases at either end of the group

Assemble read sequences into contiguous fragments

Example where it takes more than one method to assemble read sequences

Page 15: Shinichi MorishitaMolecular Biology of the Cell - Fifth Edition Figure 5-75 Figure removed due to copyright restrictions Molecular Biology of the Cell - Fifth Edition Garland Science

Assemble contiguous fragments

Assemble non-contiguous fragments

Page 16: Shinichi MorishitaMolecular Biology of the Cell - Fifth Edition Figure 5-75 Figure removed due to copyright restrictions Molecular Biology of the Cell - Fifth Edition Garland Science

Gene Code Regions Within a Genome

Barry Shell, www.science.cahttp://www.science.ca/scientists/scientistprofile.php?pID=19&pg=1

Page 17: Shinichi MorishitaMolecular Biology of the Cell - Fifth Edition Figure 5-75 Figure removed due to copyright restrictions Molecular Biology of the Cell - Fifth Edition Garland Science

Can Gene Coding Regions Be Predicted from a  Genome Alone?

CCATA TATA

GGTAAGG

CAGG

ATG(Start codon)

TAA,TAG,TGA(Stop codon)

Protein coding region

AATAAA

Coding potential

The frequency of codon usage is a bias specific to the organism.The periodicity of a coding region is the nucleotide triplet.The standard bias used is that of a six-nucleotide (two-codon) frequency.Hidden Markov model

Page 18: Shinichi MorishitaMolecular Biology of the Cell - Fifth Edition Figure 5-75 Figure removed due to copyright restrictions Molecular Biology of the Cell - Fifth Edition Garland Science

millions of years ago

Mesozoic CenozoicPaleozoic

500 400 300 200 100 0

Cambrian TriassicJurassic

CretaceousPaleogene

Neogene

Verterata

LampreyAgnathaChondrichthyes

Gnathostomata

Shark, ray

Osteichthyes

ActinopterygiiPolypteriformes

Holostei

Bichir

Sturgeon, gar, bowfin

Teleostei

Zebrafish

Killifish

Green spotted puffer

Tiger blowfish

Sarcopterygii

Coelacanth, lungfishDipnoi

Tetrapoda

Amniota

AmphibiaFrog

Reptilia

Aves

SquamataCrocodilia

Testudinata

Lizard, snake

Chicken

CrocodileTurtle

Mammalia

Dog

Mouse, rat

Human, chimpanzee

Nakatani

et al., 2007 , Genome Res., 17, 1254‐1265

CarboniferousPermianDevonian

SilurianOrdovician

Sequenced VertebrateGenomes

Compare genomes, identify stored regions, predict genes

Page 19: Shinichi MorishitaMolecular Biology of the Cell - Fifth Edition Figure 5-75 Figure removed due to copyright restrictions Molecular Biology of the Cell - Fifth Edition Garland Science

Compare genomes, identify stored regions, predict genes

Dubchak and Frazer, 2003 , Genome Biology, 4,122http://genomebiology.com/2003/4/12/122

Page 20: Shinichi MorishitaMolecular Biology of the Cell - Fifth Edition Figure 5-75 Figure removed due to copyright restrictions Molecular Biology of the Cell - Fifth Edition Garland Science

Acquiring Gene  Sequences

• Synthesize cDNA from mRNA

• Combine cDNA into vectors, propagate and store copies: cDNA library

• Areas where Japan has global prominence:Sumio Sugano (The University of Tokyo, Institute of Medical Science): Human, otherYoshihide Hayashizaki (Riken): Mouse

• Extreme difficulty of identifying all mRNA

Figure removed

due to

copyright restrictions

Molecular Biology of the Cell - Fifth EditionGarland Science (2008)Figure 8-43

Page 21: Shinichi MorishitaMolecular Biology of the Cell - Fifth Edition Figure 5-75 Figure removed due to copyright restrictions Molecular Biology of the Cell - Fifth Edition Garland Science

Accelerating Genome  Sequencing

Illumina GAIIx ABI SOLiD 3 Roche 454FLX Titanium

Read length (nucleotides) 75 x 2 = 150 50 500Reads (hundred mn) / run 0.96~1.2 4 0.01Days / run 9.5 16 0.4 (10 hr)Nucleotides per unit of timeNucleotides (hundred mn)/day 15~19 12.5 12Sample volume (µg) 0.1~1 0.01~ 5 3~5

1.4x/year

2.9x/year

Partial collection of gene sequences

• Genome re-sequencing, transcription start sites, chromatin structures, DNA methylation, RNA sequencing ==> Illumina GA

• De novo sequencing of large genomes, full-length cDNA sequencing, selective splicing ==> Roche 454

• Single-molecule prediction ==> Observations of early development

• Re-sequencing of the Watson genome (454, c.250 bp)

• Re-sequencing of the Asian human genome (Illumina, c.35 bp): mutations, insertions/deletions, inversions, etc.

• DNA methylation (Roche 454, 100-250 bp; Illumina, 36 bp after target capture)

• RNA sequencing (Illumina: 25~35 bp)• Cover the start points of transcription

(Illumina/SOLiD: 

25bp)

• Chromatin structures (Illumina/SOLiD, 25 bp)

• Partial sequencing of Neanderthal genome• Chromatin structures (c.100 bp read length)• De novo sequencing of human and other large

vertebrate genomes, full-length cDNA sequencing (500-800 bp read lengths)

Page 22: Shinichi MorishitaMolecular Biology of the Cell - Fifth Edition Figure 5-75 Figure removed due to copyright restrictions Molecular Biology of the Cell - Fifth Edition Garland Science

1.4x/yr

2.9x/yr

• Moore's Law: CPU performance (the number of transistors on a chip) doubles every 1.5 years.

• The performance of next- generation sequencers outstripping Moore's Law will improve some tenfold over four years.

• Parallelize ten-times the number of CPUs to maintain processing speeds.

• Bottleneck simultaneous access to secondary storage devices

Parallelization of Computational Resources to Improve Performance of Next-Generation Sequencers

Moore's law:1.6x/year

10xdifference

N megabytes/sec KxN megabytes/sec

Access parallelization

Page 23: Shinichi MorishitaMolecular Biology of the Cell - Fifth Edition Figure 5-75 Figure removed due to copyright restrictions Molecular Biology of the Cell - Fifth Edition Garland Science

HA8000 Cluster System  at the U. Tokyo Information Technology Center

Nodes

Logical operational 

performance

147.2 GFLOPS

Processors (cores) 4(16)

Main memory32 GB (936 nodes)128 GB (16 nodes)

Local disk capacity 250 GB (incl. RAID 1 OS space)

Processors

Processors (clock speed) AMD Opteron

8386 (2.3 GHz)

Cache memoryL2: 512 KB/core

L3: 2 MB/processor

Logical core operational 

performance

9.2 GFLOPS

http://www.cc.u‐tokyo.ac.jp/ha8000/

Top-performing supercomputer in Japan

Global Top 50027th in Nov 200816th in Jun 2008

the U. Tokyo Information Technology Center‡

Type B computational node group (35 + 16 nodes)Login nodes (4 nodes)

Computational node

Login 

node

Storage system

Computational node

Computational node

Computational node

Computational node

Computational node

External rou

ter Type A computational

node group (512 nodes)Type A computational node group (128 nodes)

Type B computational node group (256 nodes)

Page 24: Shinichi MorishitaMolecular Biology of the Cell - Fifth Edition Figure 5-75 Figure removed due to copyright restrictions Molecular Biology of the Cell - Fifth Edition Garland Science

Jun Wang (1976 ‐

)

Page 25: Shinichi MorishitaMolecular Biology of the Cell - Fifth Edition Figure 5-75 Figure removed due to copyright restrictions Molecular Biology of the Cell - Fifth Edition Garland Science

Comprehensive Description of Chromatin Structure

Jeremy M. Berg,2006,Biochemistry 6th edition,W.H. Freeman & Co.Figure removed

due to

copyright restrictions

Molecular Biology of the Cell - Fifth EditionGarland Science (2008)Figure 4-72

Page 26: Shinichi MorishitaMolecular Biology of the Cell - Fifth Edition Figure 5-75 Figure removed due to copyright restrictions Molecular Biology of the Cell - Fifth Edition Garland Science

Can nucleosome core positions be predicted from a genome sequence alone?

Reprinted by permission from Macmillan Publishers Ltd: Segal et al., Nature 442(7104):772-8 , copyright (2006)

(out of the minor groove)

(inside of the minor groove)

dinucleotide

DNA of nucreosomeHistone core

Page 27: Shinichi MorishitaMolecular Biology of the Cell - Fifth Edition Figure 5-75 Figure removed due to copyright restrictions Molecular Biology of the Cell - Fifth Edition Garland Science

• A nucleosome core is present every 160-200 base pairs

• The human genome has 15 to 20 million nucleosome cores

• c.2,000 sequences/day in 2002 (ABI 3730)

• 10 million sequences/day attainable since 2007 (Illumina GA)

Linker DNAHistone

core of 

nucreosome

Linker DNA breakage by 

micrococcus nuclease(digestive ferment)

Take a pieces of DNA only the 

part  of twin around

11nm

Read both ends out from a gene‐sequencing machine

Page 28: Shinichi MorishitaMolecular Biology of the Cell - Fifth Edition Figure 5-75 Figure removed due to copyright restrictions Molecular Biology of the Cell - Fifth Edition Garland Science

In a population of cells, positions of  nucleosome

cores are unlikely to be stable.

Figure removed

due to

copyright restrictions

Molecular Biology of the Cell - Fifth EditionGarland Science (2008)Figure 4-23 (part2of2)

Page 29: Shinichi MorishitaMolecular Biology of the Cell - Fifth Edition Figure 5-75 Figure removed due to copyright restrictions Molecular Biology of the Cell - Fifth Edition Garland Science

Summary

• Genome size and chromosome count do not necessarily  characterize an organism.

• A genome has many different uses.• Repeated sequences complicate genome sequencing. 

Computational analysis is essential.• Prediction, genome comparison and cDNA

collection are 

used in conjunction to infer gene code regions.• The capability of genome sequencers is making 

tremendous strides in recent years.• It is now possible to characterize chromatin structures.