bioinformatics - rutgers universitykyc/teaching/files/543-05/543-24.pdf · •online mendelian...
TRANSCRIPT
KYC
BioinformaticsLecture 24
Definition of bioinformatics
Overview of the NCBI website
Accessing information about DNA and proteins
--Definition of an accession number
--Five ways to find information on proteins and DNA
Access to biomedical literature
Alignment and BLAST
Tim
e ofdevelopm
ent
Body region, physiology, pharmacology, pathology
KYC
Gene/protein familiesIn Silico experiments
Examples:Retinol-binding protein 4 (RBP4): a member of the lipocalin family.The Pol protein of HIV-1 --sequence alignment--gene expression--protein structure--phylogeny--homologs in various species
Aspartylprotease
Reversetranscriptase
Integrase
PR RT IN
KYC
• Interface of biology and computers• Analysis of proteins, genes and genomes using computer algorithms and computer databases• Genomics is the analysis of genomes. The tools of bioinformatics are used to make sense of the billions of base pairs of DNA that are sequenced by genomics projects.
What is bioinformatics?
KYC
Top ten challenges for bioinformatics
[1] Precise models of where and when transcription will occur in a genome(initiation and termination)
[2] Precise, predictive models of alternative RNA splicing[3] Precise models of signal transduction pathways;ability to predict cellular
responses to external stimuli[4] Determining protein:DNA, protein:RNA, protein:protein recognition codes[5] Accurate ab initio protein structure prediction[6] Rational design of small molecule inhibitors of proteins[7] Mechanistic understanding of protein evolution[8] Mechanistic understanding of speciation[9] Development of effective gene ontologies: systematic ways to describe
gene and protein function[10] Education: development of bioinformatics curricula
Source: Ewan Birney, Chris Burge, Jim Fickett
KYC
Tool-users
Tool-makers
bioinformatics
public healthinformatics
medicalinformatics
infrastructure
databases algorithms
KYC
Growth of GenBank
Year
Bas
e p
airs
of
DN
A (
bill
ion
s)
Seq
uen
ces
(mill
ion
s)
Updated 8-12-04:>40b base pairs
1982 1986 1990 1994 1998 2002
KYC
DNA RNA
cDNAESTsUniGene
phenotype
genomicDNAdatabases
protein sequence databases
protein
genome transcriptome proteome
KYC
GenBankEMBL DDBJ
Housedat EBIEuropean
BioinformaticsInstitute
There are three major public DNA databases
Housedat NCBI
National Center for Bio-technology and Information
Housed in Japan
all species 128,941
viruses 6,137
bacteria 31,262
archaea 2,100
eukaryota 87,147
Homo sapiens 10.7b Mus musculus 6.5bRattus norvegicus 5.6bDanio rerio 1.7bZea mays 1.4bOryza sativa 0.8bDrosophila melanogaster 0.7bGallus gallus 0.5bArabidopsis thaliana 0.5b
The most sequenced organisms in GenBank
www.ncbi.nlm.nih.gov
Species in GenBank
KYC
www.ncbi.nlm.nih.gov
PubMed is…
• National Library of Medicine's search service
• 12 million citations in MEDLINE
• links to participating online journals
• PubMed tutorial (via “Education” on side bar)
Entrez integrates…• the scientific literature; • DNA and protein sequence databases; • 3D protein structure data; • population study data sets; • assemblies of complete genomes
Books is…• searchable resource of on-line books
KYC
TaxBrowser is…• browser for the major divisions of living organisms (archaea, bacteria, eukaryota, viruses)• taxonomy information such as genetic codes• molecular data on extinct organisms
Structure site includes…• Molecular Modelling Database (MMDB)
• biopolymer structures obtained from
the Protein Data Bank (PDB)
• Cn3D (a 3D-structure viewer)
• vector alignment search tool (VAST)OMIM is…•Online Mendelian Inheritance in Man
•catalog of human genes and genetic disorders
•edited by Dr. Victor McKusick, others at JHUBLAST is…• Basic Local Alignment Search Tool
• NCBI's sequence similarity search tool
• supports analysis of DNA and protein databases
• 80,000 searches per day
KYC
Accession numbers are labels for sequencesNCBI includes databases (such as GenBank) that contain information on DNA, RNA, orprotein sequences. You can acquire information beginning with a query such as thename of a protein, or the raw nucleotides comprising a DNA sequence of interest.DNA sequences and other molecular data are tagged with accession numbers that areused to identify a sequence or other record relevant to molecular data.
Accessing information on molecular sequences
An accession number is label that used to identify a sequence. It is a string ofletters and/or numbers that corresponds to a molecular sequence.Examples (all for retinol-binding protein, RBP4):
X02775 GenBank genomic DNA sequenceNT_030059 Genomic contigRs7079946 dbSNP (single nucleotide polymorphism)
N91759.1 An expressed sequence tag (1 of 170)NM_006744 RefSeq DNA sequence (from a transcript)
NP_007635 RefSeq proteinAAC02945 GenBank proteinQ28369 SwissProt protein1KT7 Protein Data Bank structure record
KYC
Five ways to access DNA and protein sequences[1] LocusLink with RefSeq[2] UniGene [3] Entrez[4] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI)[5] ExPASy Sequence Retrieval System (separate from NCBI)
LocusLink is a great starting point: it collects key information on each gene/protein frommajor databases. It now covers 15 organisms.
Unfortunately, LocusLink is slowly being retired in favor of EntrezGene
RefSeq provides a curated, optimal accession number for each DNA (NM_006744)
or protein (NP_007635)
KYC
KYC
KYC
KYC
NCBI’s important RefSeq project:best representative sequences
RefSeq (accessible via the main page of NCBI)provides an expertly curated accession number thatcorresponds to the most stable, agreed-upon “reference”version of a sequence.
RefSeq identifiers include the following formats:
Complete genome NC_######Complete chromosome NC_######Genomic contig NT_######mRNA (DNA format) NM_###### e.g. NM_006744Protein NP_###### e.g. NP_006735
KYC
UniGene: unique genes via ESTs
• Find UniGene at NCBI: www.ncbi.nlm.nih.gov/UniGene• UniGene clusters contain many expressed sequence tags (ESTs), which are DNA sequences (typically 500 base pairs in length) corresponding to the mRNA from an expressed gene. ESTs are sequenced from a complementary DNA (cDNA) library.• UniGene data come from many cDNA libraries. Thus, when you look up a gene in UniGene you get information on its abundance and its regional distribution.
[2] UniGene
DNA RNA
complementary DNA(cDNA)
protein
UniGene
KYC
Cluster sizes in UniGene
This is a gene with10 ESTs associated;the cluster size is 10
This is a gene with1 EST associated;the cluster size is 1
KYC
Cluster sizes in UniGene (human)
Cluster size Number of clusters1 ≈ 8,1002 38,2003-4 23,3005-8 12,0009-16 5,60017-32 3,700
≈500-1000 1,050≈2000-4000 100≈8000-16,000 12≈16,000-30,000 2
KYC
From the NCBI homepage,type “rbp4” and hit “Go”
3. Entrez to access protein & DNA sequences
KYC
KYC
KYC
KYC
By applying limits, there are now just two entries
KYC
KYC
KYC
FASTA format
KYC
clickhuman
4. EBI and
Ensembl
KYC
enterRBP4
KYC
KYC
5. ExPASy Sequence Retrieval System
KYC
KYC
Begin at the main page of NCBI, and type an Entrez query: hiv-1 pol
KYC
Searching for HIV-1 pol:Following the “genome” link yields
a manageable three results
KYC
KYC
PubMed at NCBIto find literatureinformation
PubMed is the NCBIgateway to MEDLINE.
MEDLINE containsbibliographic citationsand author abstracts fromover 4,600 journalspublished in the UnitedStates and in 70 foreigncountries.
It has 12 million recordsdating back to 1966.
KYC
MeSH is the acronym for "Medical Subject Headings."
MeSH is the list of the vocabulary terms used for subject analysis of biomedical literature at NLM. MeSH vocabulary is used for indexing journal articles for MEDLINE.
The MeSH controlled vocabulary imposes uniformity and consistency to the indexing of biomedical literature.
KYC
KYC
lipocalin AND disease(60 results)
lipocalin OR disease(1,650,000 results)
lipocalin NOT disease(530 results)
1 AND 2
1 OR 2
1 NOT 2
1
1
1
2
2
2
KYC
Pairwise sequence alignment
β-corticotropin (sheep)Corticotropin A (pig)
ala gly glu asp asp gluasp gly ala glu asp glu
OxytocinVasopressin
CYIQNCPLGCYFQNCPRG
Pairwise sequence alignment is the most fundamental operation of bioinformatics
• It is used to decide if two proteins (or genes) are related structurally or functionally• It is used to identify domains or motifs that are shared between proteins• It is the basis of BLAST searching • It is used in the analysis of genomes
KYC
retinol-binding protein(NP_006735)
β-lactoglobulin(P02754)
RBP and β-lactoglobulin are homologous proteinsthat share related three-dimensional structures
KYC
KYC
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP . ||| | . |. . . | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin
51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin
98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin
137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP . | | | : || . | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin
Pairwise alignment of retinol-binding protein and β-lactoglobulin
KYC
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP . ||| | . |. . . | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin
51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin
98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin
137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP . | | | : || . | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin
Pairwise alignment of retinol-binding protein and β-lactoglobulin
Somewhatsimilar
(one dot)
Verysimilar
(two dots)
Identity(bar)
KYC
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP . ||| | . |. . . | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin
51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin
98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin
137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP . | | | : || . | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin
Pairwise alignment of retinol-binding protein and β-lactoglobulin
Internalgap
Terminalgap
KYC
SimilarityThe extent to which nucleotide or protein sequences arerelated. It is based upon identity plus conservation.
IdentityThe extent to which two sequences are invariant.
ConservationChanges at a specific position of an amino acid or (lesscommonly, DNA) sequence that preserve the physico-chemical properties of the original residue.
Definitions
KYC
• Positions at which a letter is paired with a null are called gaps. • Gap scores are typically negative. • Since a single mutational event may cause the insertion or deletion of more than one residue, the presence of a gap is ascribed more significance than the length of the gap. • In BLAST, it is rarely necessary to change gap values from the default.
Gaps
KYC
1 .MKWVWALLLLA.AWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDP 48 :: || || || .||.||. .| :|||:.|:.| |||.||||| 1 MLRICVALCALATCWA...QDCQVSNIQVMQNFDRSRYTGRWYAVAKKDP 47 . . . . . 49 EGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTED 98 |||| ||:||:|||||.|.|.||| ||| :||||:.||.| ||| || | 48 VGLFLLDNVVAQFSVDESGKMTATAHGRVIILNNWEMCANMFGTFEDTPD 97 . . . . . 99 PAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADS 148 ||||||:||| ||:|| ||||||::||||| ||: |||| ..||||| | 98 PAKFKMRYWGAASYLQTGNDDHWVIDTDYDNYAIHYSCREVDLDGTCLDG 147 . . . . . 149 YSFVFSRDPNGLPPEAQKIVRQRQEELCLARQYRLIVHNGYCDGRSERNLL 199 |||:||| | || || |||| :..|:| .|| : | |:|: 148 YSFIFSRHPTGLRPEDQKIVTDKKKEICFLGKYRRVGHTGFCESS...... 192
Pairwise alignment of retinol-binding protein from human (top) and rainbow trout (O. mykiss)
KYC
Pairwise alignment The process of lining up two or more sequences to achieve maximal levels of identity (and conservation, in the case of amino acid sequences) for the purpose of assessing the degree of similarity and the possibility of homology.
HomologySimilarity attributed to descent from a common ancestor.
RBP 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWD- 84 +K++ +++ GTW++MA + L + A V T + +L+ W+ glycodelin 23 QTKQDLELPKLAGTWHSMAMA-TNNISLMATLKAPLRVHITSLLPTPEDNLEIVLHRWEN 81
OrthologsHomologous sequences in different speciesthat arose from a common ancestral geneduring speciation; may or may not be responsiblefor a similar function.ParalogsHomologous sequences within a single speciesthat arose by gene duplication.
KYC
Orthologs:members of a gene (protein)family in variousorganisms.
This tree shows13 RBP orthologs.
common carp
zebrafish
rainbow trout
teleost
African clawed frog
chicken
mouserat
rabbitcowpighorse
human
10 changes
KYC
Paralogs:members of a gene (protein)family within aspecies.
This tree shows9 humanlipocalins.
apolipoprotein D
retinol-bindingprotein 4
Complementcomponent 8
prostaglandinD2 synthase
neutrophilgelatinase-associatedlipocalin
10 changesLipocalin 1Odorant-bindingprotein 2A
progestagen-associatedendometrialprotein
Alpha-1Microglobulin/bikunin
KYChttp://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Orthology.html
KYC
4 3 2 1 0
Pairwise sequence alignment allows usto look back billions of years ago (BYA)
Origin oflife
Origin ofeukaryotes insects
Fungi/animalPlant/animal
Earliestfossils
Eukaryote/archaea
KYC
fly GAKKVIISAP SAD.APM..F VCGVNLDAYK PDMKVVSNAS CTTNCLAPLA human GAKRVIISAP SAD.APM..F VMGVNHEKYD NSLKIISNAS CTTNCLAPLA plant GAKKVIISAP SAD.APM..F VVGVNEHTYQ PNMDIVSNAS CTTNCLAPLA bacterium GAKKVVMTGP SKDNTPM..F VKGANFDKY. AGQDIVSNAS CTTNCLAPLA yeast GAKKVVITAP SS.TAPM..F VMGVNEEKYT SDLKIVSNAS CTTNCLAPLA archaeon GADKVLISAP PKGDEPVKQL VYGVNHDEYD GE.DVVSNAS CTTNSITPVA
fly KVINDNFEIV EGLMTTVHAT TATQKTVDGP SGKLWRDGRG AAQNIIPAST human KVIHDNFGIV EGLMTTVHAI TATQKTVDGP SGKLWRDGRG ALQNIIPAST plant KVVHEEFGIL EGLMTTVHAT TATQKTVDGP SMKDWRGGRG ASQNIIPSST bacterium KVINDNFGII EGLMTTVHAT TATQKTVDGP SHKDWRGGRG ASQNIIPSST yeast KVINDAFGIE EGLMTTVHSL TATQKTVDGP SHKDWRGGRT ASGNIIPSST archaeon KVLDEEFGIN AGQLTTVHAY TGSQNLMDGP NGKP.RRRRA AAENIIPTST
fly GAAKAVGKVI PALNGKLTGM AFRVPTPNVS VVDLTVRLGK GASYDEIKAK human GAAKAVGKVI PELNGKLTGM AFRVPTANVS VVDLTCRLEK PAKYDDIKKV plant GAAKAVGKVL PELNGKLTGM AFRVPTSNVS VVDLTCRLEK GASYEDVKAA bacterium GAAKAVGKVL PELNGKLTGM AFRVPTPNVS VVDLTVRLEK AATYEQIKAA yeast GAAKAVGKVL PELQGKLTGM AFRVPTVDVS VVDLTVKLNK ETTYDEIKKV archaeon GAAQAATEVL PELEGKLDGM AIRVPVPNGS ITEFVVDLDD DVTESDVNAA
Multiple sequence alignment ofglyceraldehyde 3-phosphate dehydrogenases
KYC
~~~~~EIQDVSGTWYAMTVDREFPEMNLESVTPMTLTTL.GGNLEAKVTM lipocalin 1 LSFTLEEEDITGTWYAMVVDKDFPEDRRRKVSPVKVTALGGGNLEATFTF odorant-binding protein 2aTKQDLELPKLAGTWHSMAMATNNISLMATLKAPLRVHITSEDNLEIVLHR progestagen-assoc. endo.VQENFDVNKYLGRWYEIEKIPTTFENGRCIQANYSLMENGNQELRADGTV apolipoprotein DVKENFDKARFSGTWYAMAKDPEGLFLQDNIVAEFSVDETGNWDVCADGTF retinol-binding proteinLQQNFQDNQFQGKWYVVGLAGNAI.LREDKDPQKMYATIDKSYNVTSVLF neutrophil gelatinase-ass.VQPNFQQDKFLGRWFSAGLASNSSWLREKKAALSMCKSVDGGLNLTSTFL prostaglandin D2 synthaseVQENFNISRIYGKWYNLAIGSTCPWMDRMTVSTLVLGEGEAEISMTSTRW alpha-1-microglobulinPKANFDAQQFAGTWLLVAVGSACRFLQRAEATTLHVAPQGSTFRKLD... complement component 8
Multiple sequence alignment ofhuman lipocalin paralogs
KYC
General approach to pairwise alignment
• Choose two sequences• Select an algorithm that generates a score• Allow gaps (insertions, deletions)• Score reflects degree of similarity• Alignments can be global or local• Estimate probability that the alignment occurred by chance
KYC
Calculation of an alignment score
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Alignment_Scores2.html