bioinformatics - rutgers universitykyc/teaching/files/543-05/543-24.pdf · •online mendelian...

KYC

BioinformaticsLecture 24

Definition of bioinformatics

Overview of the NCBI website

Accessing information about DNA and proteins

--Definition of an accession number

--Five ways to find information on proteins and DNA

Access to biomedical literature

Alignment and BLAST

Tim

e ofdevelopm

ent

Body region, physiology, pharmacology, pathology

KYC

Gene/protein familiesIn Silico experiments

Examples:Retinol-binding protein 4 (RBP4): a member of the lipocalin family.The Pol protein of HIV-1 --sequence alignment--gene expression--protein structure--phylogeny--homologs in various species

Aspartylprotease

Reversetranscriptase

Integrase

PR RT IN

KYC

• Interface of biology and computers• Analysis of proteins, genes and genomes using computer algorithms and computer databases• Genomics is the analysis of genomes. The tools of bioinformatics are used to make sense of the billions of base pairs of DNA that are sequenced by genomics projects.

What is bioinformatics?

KYC

Top ten challenges for bioinformatics

[1] Precise models of where and when transcription will occur in a genome(initiation and termination)

[2] Precise, predictive models of alternative RNA splicing[3] Precise models of signal transduction pathways;ability to predict cellular

responses to external stimuli[4] Determining protein:DNA, protein:RNA, protein:protein recognition codes[5] Accurate ab initio protein structure prediction[6] Rational design of small molecule inhibitors of proteins[7] Mechanistic understanding of protein evolution[8] Mechanistic understanding of speciation[9] Development of effective gene ontologies: systematic ways to describe

gene and protein function[10] Education: development of bioinformatics curricula

Source: Ewan Birney, Chris Burge, Jim Fickett

KYC

Tool-users

Tool-makers

bioinformatics

public healthinformatics

medicalinformatics

infrastructure

databases algorithms

KYC

Growth of GenBank

Year

Bas

e p

airs

of

DN

A (

bill

ion

s)

Seq

uen

ces

(mill

ion

s)

Updated 8-12-04:>40b base pairs

1982 1986 1990 1994 1998 2002

KYC

DNA RNA

cDNAESTsUniGene

phenotype

genomicDNAdatabases

protein sequence databases

protein

genome transcriptome proteome

KYC

GenBankEMBL DDBJ

Housedat EBIEuropean

BioinformaticsInstitute

There are three major public DNA databases

Housedat NCBI

National Center for Bio-technology and Information

Housed in Japan

all species 128,941

viruses 6,137

bacteria 31,262

archaea 2,100

eukaryota 87,147

Homo sapiens 10.7b Mus musculus 6.5bRattus norvegicus 5.6bDanio rerio 1.7bZea mays 1.4bOryza sativa 0.8bDrosophila melanogaster 0.7bGallus gallus 0.5bArabidopsis thaliana 0.5b

The most sequenced organisms in GenBank

www.ncbi.nlm.nih.gov

Species in GenBank

KYC

www.ncbi.nlm.nih.gov

PubMed is…

• National Library of Medicine's search service

• 12 million citations in MEDLINE

• links to participating online journals

• PubMed tutorial (via “Education” on side bar)

Entrez integrates…• the scientific literature; • DNA and protein sequence databases; • 3D protein structure data; • population study data sets; • assemblies of complete genomes

Books is…• searchable resource of on-line books

KYC

TaxBrowser is…• browser for the major divisions of living organisms (archaea, bacteria, eukaryota, viruses)• taxonomy information such as genetic codes• molecular data on extinct organisms

Structure site includes…• Molecular Modelling Database (MMDB)

• biopolymer structures obtained from

the Protein Data Bank (PDB)

• Cn3D (a 3D-structure viewer)

• vector alignment search tool (VAST)OMIM is…•Online Mendelian Inheritance in Man

•catalog of human genes and genetic disorders

•edited by Dr. Victor McKusick, others at JHUBLAST is…• Basic Local Alignment Search Tool

• NCBI's sequence similarity search tool

• supports analysis of DNA and protein databases

• 80,000 searches per day

KYC

Accession numbers are labels for sequencesNCBI includes databases (such as GenBank) that contain information on DNA, RNA, orprotein sequences. You can acquire information beginning with a query such as thename of a protein, or the raw nucleotides comprising a DNA sequence of interest.DNA sequences and other molecular data are tagged with accession numbers that areused to identify a sequence or other record relevant to molecular data.

Accessing information on molecular sequences

An accession number is label that used to identify a sequence. It is a string ofletters and/or numbers that corresponds to a molecular sequence.Examples (all for retinol-binding protein, RBP4):

X02775 GenBank genomic DNA sequenceNT_030059 Genomic contigRs7079946 dbSNP (single nucleotide polymorphism)

N91759.1 An expressed sequence tag (1 of 170)NM_006744 RefSeq DNA sequence (from a transcript)

NP_007635 RefSeq proteinAAC02945 GenBank proteinQ28369 SwissProt protein1KT7 Protein Data Bank structure record

KYC

Five ways to access DNA and protein sequences[1] LocusLink with RefSeq[2] UniGene [3] Entrez[4] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI)[5] ExPASy Sequence Retrieval System (separate from NCBI)

LocusLink is a great starting point: it collects key information on each gene/protein frommajor databases. It now covers 15 organisms.

Unfortunately, LocusLink is slowly being retired in favor of EntrezGene

RefSeq provides a curated, optimal accession number for each DNA (NM_006744)

or protein (NP_007635)

KYC

NCBI’s important RefSeq project:best representative sequences

RefSeq (accessible via the main page of NCBI)provides an expertly curated accession number thatcorresponds to the most stable, agreed-upon “reference”version of a sequence.

RefSeq identifiers include the following formats:

Complete genome NC_######Complete chromosome NC_######Genomic contig NT_######mRNA (DNA format) NM_###### e.g. NM_006744Protein NP_###### e.g. NP_006735

KYC

UniGene: unique genes via ESTs

• Find UniGene at NCBI: www.ncbi.nlm.nih.gov/UniGene• UniGene clusters contain many expressed sequence tags (ESTs), which are DNA sequences (typically 500 base pairs in length) corresponding to the mRNA from an expressed gene. ESTs are sequenced from a complementary DNA (cDNA) library.• UniGene data come from many cDNA libraries. Thus, when you look up a gene in UniGene you get information on its abundance and its regional distribution.

[2] UniGene

DNA RNA

complementary DNA(cDNA)

protein

UniGene

KYC

Cluster sizes in UniGene

This is a gene with10 ESTs associated;the cluster size is 10

This is a gene with1 EST associated;the cluster size is 1

KYC

Cluster sizes in UniGene (human)

Cluster size Number of clusters1 ≈ 8,1002 38,2003-4 23,3005-8 12,0009-16 5,60017-32 3,700

≈500-1000 1,050≈2000-4000 100≈8000-16,000 12≈16,000-30,000 2

KYC

From the NCBI homepage,type “rbp4” and hit “Go”

3. Entrez to access protein & DNA sequences

KYC

By applying limits, there are now just two entries

KYC

FASTA format

KYC

clickhuman

4. EBI and

Ensembl

KYC

enterRBP4

KYC

5. ExPASy Sequence Retrieval System

KYC

Begin at the main page of NCBI, and type an Entrez query: hiv-1 pol

KYC

Searching for HIV-1 pol:Following the “genome” link yields

a manageable three results

KYC

PubMed at NCBIto find literatureinformation

PubMed is the NCBIgateway to MEDLINE.

MEDLINE containsbibliographic citationsand author abstracts fromover 4,600 journalspublished in the UnitedStates and in 70 foreigncountries.

It has 12 million recordsdating back to 1966.

KYC

MeSH is the acronym for "Medical Subject Headings."

MeSH is the list of the vocabulary terms used for subject analysis of biomedical literature at NLM. MeSH vocabulary is used for indexing journal articles for MEDLINE.

The MeSH controlled vocabulary imposes uniformity and consistency to the indexing of biomedical literature.

KYC

lipocalin AND disease(60 results)

lipocalin OR disease(1,650,000 results)

lipocalin NOT disease(530 results)

1 AND 2

1 OR 2

1 NOT 2

1

1

1

2

2

2

KYC

Pairwise sequence alignment

β-corticotropin (sheep)Corticotropin A (pig)

ala gly glu asp asp gluasp gly ala glu asp glu

OxytocinVasopressin

CYIQNCPLGCYFQNCPRG

Pairwise sequence alignment is the most fundamental operation of bioinformatics

• It is used to decide if two proteins (or genes) are related structurally or functionally• It is used to identify domains or motifs that are shared between proteins• It is the basis of BLAST searching • It is used in the analysis of genomes

KYC

retinol-binding protein(NP_006735)

β-lactoglobulin(P02754)

RBP and β-lactoglobulin are homologous proteinsthat share related three-dimensional structures

KYC

1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP . ||| | . |. . . | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin

51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin

98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin

137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP . | | | : || . | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin

Pairwise alignment of retinol-binding protein and β-lactoglobulin

KYC






Somewhatsimilar

(one dot)

Verysimilar

(two dots)

Identity(bar)

KYC






Internalgap

Terminalgap

KYC

SimilarityThe extent to which nucleotide or protein sequences arerelated. It is based upon identity plus conservation.

IdentityThe extent to which two sequences are invariant.

ConservationChanges at a specific position of an amino acid or (lesscommonly, DNA) sequence that preserve the physico-chemical properties of the original residue.

Definitions

KYC

• Positions at which a letter is paired with a null are called gaps. • Gap scores are typically negative. • Since a single mutational event may cause the insertion or deletion of more than one residue, the presence of a gap is ascribed more significance than the length of the gap. • In BLAST, it is rarely necessary to change gap values from the default.

Gaps

KYC

1 .MKWVWALLLLA.AWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDP 48 :: || || || .||.||. .| :|||:.|:.| |||.||||| 1 MLRICVALCALATCWA...QDCQVSNIQVMQNFDRSRYTGRWYAVAKKDP 47 . . . . . 49 EGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTED 98 |||| ||:||:|||||.|.|.||| ||| :||||:.||.| ||| || | 48 VGLFLLDNVVAQFSVDESGKMTATAHGRVIILNNWEMCANMFGTFEDTPD 97 . . . . . 99 PAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADS 148 ||||||:||| ||:|| ||||||::||||| ||: |||| ..||||| | 98 PAKFKMRYWGAASYLQTGNDDHWVIDTDYDNYAIHYSCREVDLDGTCLDG 147 . . . . . 149 YSFVFSRDPNGLPPEAQKIVRQRQEELCLARQYRLIVHNGYCDGRSERNLL 199 |||:||| | || || |||| :..|:| .|| : | |:|: 148 YSFIFSRHPTGLRPEDQKIVTDKKKEICFLGKYRRVGHTGFCESS...... 192

Pairwise alignment of retinol-binding protein from human (top) and rainbow trout (O. mykiss)

KYC

Pairwise alignment The process of lining up two or more sequences to achieve maximal levels of identity (and conservation, in the case of amino acid sequences) for the purpose of assessing the degree of similarity and the possibility of homology.

HomologySimilarity attributed to descent from a common ancestor.

RBP 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWD- 84 +K++ +++ GTW++MA + L + A V T + +L+ W+ glycodelin 23 QTKQDLELPKLAGTWHSMAMA-TNNISLMATLKAPLRVHITSLLPTPEDNLEIVLHRWEN 81

OrthologsHomologous sequences in different speciesthat arose from a common ancestral geneduring speciation; may or may not be responsiblefor a similar function.ParalogsHomologous sequences within a single speciesthat arose by gene duplication.

KYC

Orthologs:members of a gene (protein)family in variousorganisms.

This tree shows13 RBP orthologs.

common carp

zebrafish

rainbow trout

teleost

African clawed frog

chicken

mouserat

rabbitcowpighorse

human

10 changes

KYC

Paralogs:members of a gene (protein)family within aspecies.

This tree shows9 humanlipocalins.

apolipoprotein D

retinol-bindingprotein 4

Complementcomponent 8

prostaglandinD2 synthase

neutrophilgelatinase-associatedlipocalin

10 changesLipocalin 1Odorant-bindingprotein 2A

progestagen-associatedendometrialprotein

Alpha-1Microglobulin/bikunin

KYChttp://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Orthology.html

KYC

4 3 2 1 0

Pairwise sequence alignment allows usto look back billions of years ago (BYA)

Origin oflife

Origin ofeukaryotes insects

Fungi/animalPlant/animal

Earliestfossils

Eukaryote/archaea

KYC

fly GAKKVIISAP SAD.APM..F VCGVNLDAYK PDMKVVSNAS CTTNCLAPLA human GAKRVIISAP SAD.APM..F VMGVNHEKYD NSLKIISNAS CTTNCLAPLA plant GAKKVIISAP SAD.APM..F VVGVNEHTYQ PNMDIVSNAS CTTNCLAPLA bacterium GAKKVVMTGP SKDNTPM..F VKGANFDKY. AGQDIVSNAS CTTNCLAPLA yeast GAKKVVITAP SS.TAPM..F VMGVNEEKYT SDLKIVSNAS CTTNCLAPLA archaeon GADKVLISAP PKGDEPVKQL VYGVNHDEYD GE.DVVSNAS CTTNSITPVA

fly KVINDNFEIV EGLMTTVHAT TATQKTVDGP SGKLWRDGRG AAQNIIPAST human KVIHDNFGIV EGLMTTVHAI TATQKTVDGP SGKLWRDGRG ALQNIIPAST plant KVVHEEFGIL EGLMTTVHAT TATQKTVDGP SMKDWRGGRG ASQNIIPSST bacterium KVINDNFGII EGLMTTVHAT TATQKTVDGP SHKDWRGGRG ASQNIIPSST yeast KVINDAFGIE EGLMTTVHSL TATQKTVDGP SHKDWRGGRT ASGNIIPSST archaeon KVLDEEFGIN AGQLTTVHAY TGSQNLMDGP NGKP.RRRRA AAENIIPTST

fly GAAKAVGKVI PALNGKLTGM AFRVPTPNVS VVDLTVRLGK GASYDEIKAK human GAAKAVGKVI PELNGKLTGM AFRVPTANVS VVDLTCRLEK PAKYDDIKKV plant GAAKAVGKVL PELNGKLTGM AFRVPTSNVS VVDLTCRLEK GASYEDVKAA bacterium GAAKAVGKVL PELNGKLTGM AFRVPTPNVS VVDLTVRLEK AATYEQIKAA yeast GAAKAVGKVL PELQGKLTGM AFRVPTVDVS VVDLTVKLNK ETTYDEIKKV archaeon GAAQAATEVL PELEGKLDGM AIRVPVPNGS ITEFVVDLDD DVTESDVNAA

Multiple sequence alignment ofglyceraldehyde 3-phosphate dehydrogenases

KYC

~~~~~EIQDVSGTWYAMTVDREFPEMNLESVTPMTLTTL.GGNLEAKVTM lipocalin 1 LSFTLEEEDITGTWYAMVVDKDFPEDRRRKVSPVKVTALGGGNLEATFTF odorant-binding protein 2aTKQDLELPKLAGTWHSMAMATNNISLMATLKAPLRVHITSEDNLEIVLHR progestagen-assoc. endo.VQENFDVNKYLGRWYEIEKIPTTFENGRCIQANYSLMENGNQELRADGTV apolipoprotein DVKENFDKARFSGTWYAMAKDPEGLFLQDNIVAEFSVDETGNWDVCADGTF retinol-binding proteinLQQNFQDNQFQGKWYVVGLAGNAI.LREDKDPQKMYATIDKSYNVTSVLF neutrophil gelatinase-ass.VQPNFQQDKFLGRWFSAGLASNSSWLREKKAALSMCKSVDGGLNLTSTFL prostaglandin D2 synthaseVQENFNISRIYGKWYNLAIGSTCPWMDRMTVSTLVLGEGEAEISMTSTRW alpha-1-microglobulinPKANFDAQQFAGTWLLVAVGSACRFLQRAEATTLHVAPQGSTFRKLD... complement component 8

Multiple sequence alignment ofhuman lipocalin paralogs

KYC

General approach to pairwise alignment

• Choose two sequences• Select an algorithm that generates a score• Allow gaps (insertions, deletions)• Score reflects degree of similarity• Alignments can be global or local• Estimate probability that the alignment occurred by chance

KYC

Calculation of an alignment score

http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Alignment_Scores2.html

bioinformatics - rutgers universitykyc/teaching/files/543-05/543-24.pdf · •online mendelian...

Documents