Sequence Analysis (I)
Yuh-Shan Jou ( 周玉山 )[email protected]
Institute of Biomedical Sciences, Academia Sinica
Bioinformatics• Bioinformatics is the application of information
technology to analyze, process, and manage biological data.
• Bioinformatics provides computational tools to facilitate the process of
Data Information Knowledge Discovery
Don’t believe everything you see in DB or even in GenBank!QC is the most important aspect and concern in Bioinformatics!
Roadmap to Genomics
Transcriptional Map of human Genome
Sequencing ofHuman Genome
Human Genome
Human GenomePhysical Maps
Database ESTs (dbEST)
1. Markers: EST: Expressed Sequence Tag. STS: Sequence Tag Site. STR: Short Tandem Repeat.2. genomic DNA contigs: Cosmid contigs YAC contigs
cDNA sequencing
1. BAC or PAC contigs2. Sequencing technologies
Radiation HybridsMapping Panels
Posit ionalCloning
Positional Candidate Approaches
*Diagnosiswith GeneChips
*1. Expression patterns*2. Expression profiles*3. Microarray of genes
Diseases Markersfor diagnosis
Positional Candidate Approaches
Sho
tgun
seq
uen c
ing
Full length cDNAs
Functional Genomics
A Vision for the Future of Genome ResearchFrancis S. Collins (National Human Genome Research Institute, NIH, USA)
Nature 422:835 (2003)
EBI
GenBankGenBank
DDBJDDBJ
EMBLEMBL
EMBLEMBL
Entrez
SRS
getentry
NIGNIGCIB
NCBI
NIHNIH
•Submissions•Updates •Submissions
•Updates
•Submissions•Updates
International Sequence Database Collaboration
Lecture 7.1 6
www.ensembl.org
http://genome.ucsc.edu
Data Bases and Scientific Algorithms Data Bases and Scientific Algorithms
Integration Bioinformatics
Medline(Asn.1)
Medline(Asn.1)
Entrez/NCBI(Asn.1)
Entrez/NCBI(Asn.1)
PDB(Oracle, 3D images)
PDB(Oracle, 3D images)
BLAST(FASTA)
BLAST(FASTA)
IntegrationIntegrationBioInformaticsBioInformatics
IntegrationIntegrationBioInformaticsBioInformatics
ClustalW(FASTA)
KEGG (HTML Text,
Binary Images)
OMIN(Text File)
Microarray Data(RDBMS, Excel)
Web Access: www.ncbi.nlm.nih.gov
NCBI Web Traffic
Christmas and New Year’s Day
User’s per day
The Entrez System: Text Searches
Types of Databases
• Primary Databases– Original submissions by experimentalists– Content controlled by the submitter
• Examples: GenBank, SNP, GEO
• Derivative Databases– Built from primary data– Content controlled by third party (NCBI)
• Examples: Refseq, TPA, RefSNP, UniGene, NCBI Protein, Structure, Conserved Domain
Entrez Nucleotides
Primary • GenBank / EMBL / DDBJ 49,675,750
Derivative• RefSeq 545,503
• Third Party Annotation 4,544
• PDB 5,561 Total 50,231,358
Entrez Protein: Derivative Databases
GenPept 3,950,968
RefSeq 1,348,072
Third Party Annotation 4,133
Swiss Prot 170,087
PIR 282,821
PRF 12,079
PDB 61,845
Total 5,830,005
BLAST nr total 2,336,522
GenBank Growth
0
5
10
15
20
25
30
35
40
45
50
Jun
-82
Jun
-84
Jun
-86
Jun
-88
Jun
-90
Jun
-92
Jun
-94
Jun
-96
Jun
-98
Jun
-00
Jun
-02
Jun
-04
Date
Ba
se
Pa
irs
(b
illi
on
s)
0
5
10
15
20
25
30
35
40
45
50
Rec
ord
s (m
illio
ns)
BasepairsRecords
The Growth of GenBank
Release 148: 45.2 million records 49.4 billion nucleotides
Average doubling time ≈ 14 months*
Organization of GenBank:Traditional Divisions
Records are divided into 17 Divisions.11 Traditional 6 Bulk
Traditional Divisions: Traditional Divisions: • Direct Submissions (Sequin and BankIt)
• Accurate• Well characterized
PRI (28) Primate PLN (13) Plant and FungalBCT (11) Bacterial and Archeal INV (7) InvertebrateROD (15) RodentVRL (4) ViralVRT (7) Other VertebrateMAM (1) Mammalian PHG (1) PhageSYN (1) Synthetic (cloning vectors) UNA (1) Unannotated
Entrez query: gbdiv_xxx[Properties]
Organization of GenBank:Bulk Divisions
Records are divided into 17 Divisions.11 Traditional 6 Bulk
BULK Divisions: BULK Divisions: • Batch Submission (Email and FTP)
• Inaccurate• Poorly characterized
EST (355) Expressed Sequence Tag GSS (132) Genome Survey SequenceHTG (62) High Throughput GenomicSTS (5) Sequence Tagged SiteHTC (6) High Throughput cDNAPAT (17) Patent
Entrez query: gbdiv_xxx[Properties]
File Formats of theSequence Databases
Each sequence is represented bya text record called a flat file.
GenBank/GenPept (useful for scientists) FASTA (the simplest format)
ASN.1 & XML (useful for programmers)
A TraditionalGenBank
Record
LOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds.ACCESSION AY182241VERSION AY182241.2 GI:32265057KEYWORDS .SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004)REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USAREFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitterCOMMENT On Jun 26, 2003 this sequence version replaced gi:27804758.FEATURES Location/Qualifiers source 1..1931 /organism="Malus x domestica" /mol_type="mRNA" /cultivar="'Law Rome'" /db_xref="taxon:3750" /tissue_type="peel" gene 1..1931 /gene="AFS1" CDS 54..1784 /gene="AFS1" /note="terpene synthase" /codon_start=1 /product="(E,E)-alpha-farnesene synthase" /protein_id="AAO22848.2" /db_xref="GI:32265058" /translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI LSLLFQPLVN"ORIGIN 1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat 61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg 121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt 181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga 241 agctgtctga gaagttaata gaagaagtta agatttatat atctgctgaa acaatggatt//
Header
Feature Table
Sequence
The Flatfile Format
LOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds.ACCESSION AY182241VERSION AY182241.2 GI:32265057KEYWORDS .SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004)REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USAREFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitterCOMMENT On Jun 26, 2003 this sequence version replaced gi:27804758.
The Header
LOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds.ACCESSION AY182241VERSION AY182241.2 GI:32265057KEYWORDS .SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004)REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USAREFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitterCOMMENT On Jun 26, 2003 this sequence version replaced gi:27804758.
Header: Locus LineLOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004LOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004
Molecule typeMolecule typeDivisionDivision
Modification DateModification Date
Locus nameLocus name
LengthLength
LOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds.ACCESSION AY182241VERSION AY182241.2 GI:32265057KEYWORDS .SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004)REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USAREFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitterCOMMENT On Jun 26, 2003 this sequence version replaced gi:27804758.
Header: Database Identifiers
ACCESSION AY182241
VERSION AY182241.2 GI:32265057
ACCESSION AY182241
VERSION AY182241.2 GI:32265057
Accession•Stable•Reportable•Universal
Accession•Stable•Reportable•Universal
VersionTracks changes in sequenceVersionTracks changes in sequence
GI numberNCBI internal useGI numberNCBI internal use
LOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds.ACCESSION AY182241VERSION AY182241.2 GI:32265057KEYWORDS .SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004)REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USAREFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitterCOMMENT On Jun 26, 2003 this sequence version replaced gi:27804758.
Header: Organism
SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.
SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.
NCBI-controlled taxonomy
FEATURES Location/Qualifiers source 1..1931 /organism="Malus x domestica" /mol_type="mRNA" /cultivar="'Law Rome'" /db_xref="taxon:3750" /tissue_type="peel" gene 1..1931 /gene="AFS1" CDS 54..1784 /gene="AFS1" /note="terpene synthase" /codon_start=1 /product="(E,E)-alpha-farnesene synthase" /protein_id="AAO22848.2" /db_xref="GI:32265058" /translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE NHHFAHLKGMLELFEASNLGFEGEDILDEAKASLTLALRDSGHICYPDSNLSRDVVHS LELPSHRRVQWFDVKWQINAYEKDICRVNATLLELAKLNFNVVQAQLQKNLREASRWW ANLGIADNLKFARDRLVECFACAVGVAFEPEHSSFRICLTKVINLVLIIDDVYDIYGS EEELKHFTNAVDRWDSRETEQLPECMKMCFQVLYNTTCEIAREIEEENGWNQVLPQLT KVWADFCKALLVEAEWYNKSHIPTLEEYLRNGCISSSVSVLLVHSFFSITHEGTKEMA DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI LSLLFQPLVN"
The Feature Table
Coding sequenceCoding sequence
start (atg)start (atg) stop (tag)stop (tag)
ImpliedproteinImpliedprotein
GenPept Identifiers
The Sequence: 99.99% Accurate
ORIGIN 1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat 61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg 121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt 181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga
ORIGIN 1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat 61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg 121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt 181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga
1741 ggacccacat cctgtcttta ctattccaac ctcttgtaaa ctagtactca tatagtttga 1801 aataaatagc agcaaaagtt tgcggttcag ttcgtcatgg ataaattaat ctttacagtt 1861 tgtaacgttg ttgccaaaga ttatgaataa aaagttgtag tttgtcgttt aaaaaaaaaa 1921 aaaaaaaaaa a//
1741 ggacccacat cctgtcttta ctattccaac ctcttgtaaa ctagtactca tatagtttga 1801 aataaatagc agcaaaagtt tgcggttcag ttcgtcatgg ataaattaat ctttacagtt 1861 tgtaacgttg ttgccaaaga ttatgaataa aaagttgtag tttgtcgttt aaaaaaaaaa 1921 aaaaaaaaaa a//
>gi|30256|emb|CAA42556.1| c-src-kinase [Homo sapiens] MSAIQAAWPSGTECIAKYNFHGTAEQDLPFCKGDVLTIVAVTKDPNWYKAKNKVGREGIIPANYVQKREG VKAGTKLSLMPWFHGKITREQAERLLYPPETGLFLVRESTNYPGDYTLCVSCDGKVEHYRIMYHASKLSI DEEVYFENLMQLVEHYTSDADGLCTRLIKPKVMEGTVAAQDEFYRSGWALNMKELKLLQTIGKGEFGDVM LGDYRGNKVAVKCIKNDATAQAFLAEASVMTQLRHSNLVQLLGVIVEEKGGLYIVTEYMAKGSLVDYLRS RGRSVLGGDCLLKFSLDVCEAMEYLEGNNFVHRDLAARNVLVSEDNVAKVSDFGLTKEASSTQDTGKLPV KWTAPEALREKKFSTKSDVWSFGILLWEIYSFGRVPYPRIPLKDVVPRVEKGYKMDAPDGCPPAVYEVMKNCWHLDAAMRPSFLQLREQLEHIKTHELHL
MSAIQAAWPSGTECIAKYNFHGTAEQDLPFCKGDVLTIVAVTKDPNWYKAKNKVGREGIIPANYVQKREG VKAGTKLSLMPWFHGKITREQAERLLYPPETGLFLVRESTNYPGDYTLCVSCDGKVEHYRIMYHASKLSI DEEVYFENLMQLVEHYTSDADGLCTRLIKPKVMEGTVAAQDEFYRSGWALNMKELKLLQTIGKGEFGDVM LGDYRGNKVAVKCIKNDATAQAFLAEASVMTQLRHSNLVQLLGVIVEEKGGLYIVTEYMAKGSLVDYLRS RGRSVLGGDCLLKFSLDVCEAMEYLEGNNFVHRDLAARNVLVSEDNVAKVSDFGLTKEASSTQDTGKLPV KWTAPEALREKKFSTKSDVWSFGILLWEIYSFGRVPYPRIPLKDVVPRVEKGYKMDAPDGCPPAVYEVMKNCWHLDAAMRPSFLQLREQLEHIKTHELHL
FASTA Format
>gi|30256|emb|CAA42556.1| c-src-kinase [Homo sapiens]
>gi number
Database Identifiers:gb GenBankemb EMBLdbj DDBJref RefSeqsp SWISS-PROTpdb Protein Databankpir PIRprf PRFtpg TPA-GenBanktpe TPA-EMBLtpj TPA-DDBJ
Accession.Version Locus Name Organism
Seq-entry ::= set { class nuc-prot , descr { title "Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds." , source { org { taxname "Malus x domestica" , common "cultivated apple" , db { { db "taxon" , tag id 3750 } } , orgname { name binomial { genus "Malus" , species "x domestica" } , mod { { subtype cultivar , subname "'Law Rome'" } , { subtype old-name , subname "Malus domestica" , attrib "(10)cultivar='Law Rome'" } } , lineage "Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus" , gcode 1 ,,
Abstract Syntax Notation: ASN.1
FASTA NucleotideFASTA Nucleotide
FASTAProteinFASTAProtein
GenPeptGenPept GenBankGenBank
ASN.1ASN.1
Bulk Divisions
• Expressed Sequence Tag– 1st pass single read cDNA
• Genome Survey Sequence– 1st pass single read gDNA
• High Throughput Genomic– incomplete sequences of genomic clones
• Sequence Tagged Site– PCR-based mapping reagents
•Batch Submission and htg (email and ftp)•Inaccurate•Poorly Characterized
EST Division: Expressed Sequence Tags
RNA gene products
nucleus30,000 genes
80-100,000 uniquecDNA clones in library
- isolate unique clones -sequence once from each end
make cDNA library
5’
3’
>IMAGE:275615 3', mRNA sequenceNNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTATTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTCATTCATTATAACAAATTTAATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTTCTTATGCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAGGTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANGTTGCCAGCCCTC
>IMAGE:275615 3', mRNA sequenceNNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTATTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTCATTCATTATAACAAATTTAATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTTCTTATGCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAGGTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANGTTGCCAGCCCTC
>IMAGE:275615 5' mRNA sequenceGACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGGTGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAGAATGGAAAGTCAATTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGAGAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTGTACTACTGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNCAAGTTNAGTTTAAGTGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCNTTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG
>IMAGE:275615 5' mRNA sequenceGACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGGTGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAGAATGGAAAGTCAATTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGAGAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTGTACTACTGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNCAAGTTNAGTTTAAGTGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCNTTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG
gbdiv_est[Properties]
ESTs in EntrezTotal 26 million recordsHuman 6.0 millionMouse 4.3 millionRat 0.7 millionZebrafish 0.6 millionWheat 0.6 millionBarley 0.3 millionMaize 0.4 million
Total 26 million recordsHuman 6.0 millionMouse 4.3 millionRat 0.7 millionZebrafish 0.6 millionWheat 0.6 millionBarley 0.3 millionMaize 0.4 million
Genome Sequencing - HTG, GSS, (WGS)
Draft Sequence (HTG division)
shredding
Whole BAC insert (or genome)
cloning isolating
assembly
sequencing
GSS divisionor trace archive
whole genome shotgun assemblies (traditional division)
HTG Division: Rice Draft Sequences
•Unfinished sequences of BACs•Gaps and unordered pieces•Finished sequences move to traditional GenBank division
Whole Genome Shotgun Projects• Traditional GenBank Divisions• 200 + projects
– Virus
– Bacteria
– Environmental sequences
– Archaea
– 51 Eukaryotes featuring:• Cow, Chicken, Rat, Mouse, Dog, Chimpanzee, Human
• Pufferfish (2)
• Honeybee, Anopheles, Fruit Flies (3), Silkworm
• Nematode (C. briggsae)
• Yeasts (8), Aspergillus (2)
• Rice
Zebrafish: WGS
wgs_master[Properties]wgs_master[Properties]
Derivative DatabasesUniGeneRefSeq
TPA
Primary vs. DerivativeSequence Databases
GenBankGenBank
SequencingSequencingCentersCenters
GA
GAGA
ATTAT
TC
CGAGA
ATTAT
TC
C
AT
GAGA
ATTC
C GAGA
ATTC
C
TTGACAAT
TGACTA
ACGTGC
TTGACA
CGTGAATTGACTA
TATAGCCG
ACGTGC
ACGTGCACGTGC
TTGACA
TTGACA
CGTGA
CGTGA
CGTGA
ATTGACTA
ATTGACTAATTGACTA
ATTGACTA
TATAGCCG
TATAGCCGTATAGCCGTATAGCCG
TATAGCCG TATAGCCGTATAGCCG TATAGCCGCAT
T
GAGA
ATTC
C GAGA
ATTC
C LabsLabs
AlgorithmsAlgorithms
UniGene
CuratorsCurators
RefSeq
GenomeAssembly
TATAGCCGAGCTCCGATACCGATGACAA
Updated continuall
y by NCBI
Updated ONLY by submitters
A gene-oriented view of sequence entries
•MegaBlast based automated sequence clustering
•Now informed by genome hits New!
•Nonredundant set of gene oriented clusters
•Each cluster a unique gene
•Information on tissue types and map locations
•Includes known genes and uncharacterized ESTs
•Useful for gene discovery and selection of mapping reagents
What is UniGene?
EST hits: Human mRNA
Albumin mRNAAlbumin mRNA
5’ EST hits5’ EST hits
3’ EST hits3’ EST hits
UniGene: Expressed Sequences
Expression Data
RELEASE 11 (May 13, 2005)AVAILABLE ON THE FTP SITE!
• Forming the “best representative” sequence
• Standardizing nomenclature and record structure
• Adding annotation (references, sequence features)
• Stable reference for example, gene identification, polymorphism discovery, comparative analysis
• RefSeq Release 11 includes over 1,425,971 proteins and 2928 organisms.
• The release is available by FTP at: ftp://ftp.ncbi.nih.gov/refseq/release/
• RefSeq number is still not fixed.
srcdb_refseq[Properties]
Curated RefSeq Records
COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff. The reference sequence was derived from X66503.1. Summary: Adenylosuccinate synthetase catalyzes the first committed step in the conversion of IMP to AMP.
LOCUS ADSS 1368 bp mRNA linear PRI 27-AUG-2002 DEFINITION Homo sapiens adenylosuccinate synthase (ADSS), mRNA. ACCESSION NM_001126 VERSION NM_001126.1 GI:4557270 RefSeq NucleotideRefSeq Nucleotide
LOCUS ADSS 455 aa linear PRI 27-AUG-2002 DEFINITION adenylosuccinate synthase; Adenylosuccinate synthetase (Ade(-)H-complementing) Homo sapiens . ACCESSION NP_001117 VERSION NP_001117.1 GI:4557271 DBSOURCE REFSEQ: accession NM_001126.1 RefSeq ProteinRefSeq Protein
X records:X records: Genome Annotation & Inferred or PredictedGenome Annotation & Inferred or Predicted vs vs N records:N records: Provisional, Reviewed or ValidatedProvisional, Reviewed or Validated
RefSeq Accession Numbers
mRNAs and Proteins
NM_123456 Curated mRNANP_123456 Curated ProteinNR_123456 Curated non-coding RNAXM_123456 Predicted mRNAXP_123456 Predicted Protein XR_123456 Predicted non-coding RNAGene RecordsNG_123456 Reference Genomic SequenceChromosomeNC_123455 Microbial replicons, organelle genomes, human chromosomesAssembliesNT_123456 Contig NW_123456 WGS Supercontig
Curated genomic DNACurated genomic DNA(NC, NT, NW)(NC, NT, NW)
Curated Model mRNACurated Model mRNA (XM)(XM)(XR)(XR)
Curated mRNACurated mRNA (NM)(NM)(NR)(NR)
Model protein Model protein (XP)(XP)
RefSeq Curation Processes
ProteinProtein (NP)(NP)
Scanning....
http://www.ncbi.nlm.nih.gov/RefSeq/key.html#accessions
RefSeq: NCBI’s Derivative Sequence Database
• Curated transcripts and proteins– reviewed– human, mouse, rat, fruit fly, zebrafish, arabidopsis microbial genomes (proteins), and more
• Model transcripts and proteins• Assembled Genomic Regions (contigs)
– human genome– mouse genome– rat genome
• Chromosome records– Human genome– microbial– organelle
ftp://ftp.ncbi.nih.gov/refseq/release/
srcdb_refseq[Properties]
RefSeq Benefits
• non-redundancy • explicitly linked nucleotide and protein sequences• updates to reflect current sequence data and
biology• data validation • format consistency• distinct accession series • stewardship by NCBI staff and collaborators
Third Party Annotation (TPA) Database
• Annotations of existing GenBank sequences
• Allows for community annotation of genomes
• Direct submissions– BankIt – Sequin
tpa[Properties]
TPA record: WGS Assembly
CDS FeatureTPA protein
Human Nucleotide Sequences
ISDC 8,965,327(GenBank/EMBL/DDBJ)
PRI 916,017(WGS 601,855)EST 6,003,916GSS 905,645HTG 18,364HTC 49,373STS 117,870
PAT 953,269
RefSeq 35,934TPA 893Total 9,002,154
ISDC 8,965,327(GenBank/EMBL/DDBJ)
PRI 916,017(WGS 601,855)EST 6,003,916GSS 905,645HTG 18,364HTC 49,373STS 117,870
PAT 953,269
RefSeq 35,934TPA 893Total 9,002,154
Other NCBI Databases
•dbSNP: nucleotide polymorphism
•Geo: Gene Expression Omnibusmicroarray and other
expression data
•Gene: gene recordsUnifies LocusLink and
Microbial Genomes
•Structure: imported structures (PDB)Cn3D viewer, NCBI
curation
•CDD: conserved domain databaseProtein families
(COGs)
Single domains (PFAM, SMART, CD)
NCBI’s SNP Database
• Primary Database and Derivative (RefSNP)
• Single Nucleotide Polymorphism
• Repeat polymorphisms
• Insertion-Deletion Polymorphisms
• 24 Species
• Over 15 million submissions
Submitted SNP
Hemachromatosis SNP
•Non-redundant•Computational Analysis
•BLAST hits to•genome, mRNA, protein and structure
RefSNP
Sequence Similarity Searching
Basic Local Alignment Search Tool
(BLAST)
BLAST
VAST
Pubmed
Text
Sequence
Structure
• Best score for aligning part of sequences
• Dynamic programming • Algorithm:
Smith-Waterman• Table cells never score
below zero
• Best score for aligning the full length sequences
• Dynamic programming• Algorithm:
Needelman- Wunch• Table cells are allowed
any score
Global Local
Pairwise Alignment Summary
Global vs Local Alignment
Seq 1
Seq 2
Seq 1
Seq 2
Global alignment
Local alignment
Global Alignment
Human: 15 IAKYNFHGTAEQDLPFCKGDVLTIVAVTKDPNWYKAKNKVGREGIIPANYVQKREGVKAGTKLSLMPWFH 84 +A + + + DL F K D+L I+ T+ W+ GR G IP+NYV + + +++ PW+ Worm: 63 VALFQYDARTDDDLSFKKDDILEILNDTQGDWWFARHKATGRTGYIPSNYVAREKSIES------QPWYF 125
Human: 85 GKITREQAERLLYPP--ETGLFLVRESTNYPGDYTLCVSCDGKVEHYRI-MYHASKLSIDEEVYFENLMQ 151 GK+ R AE+ L E G FLVR+S + D +L V + V+HYRI + H I F L Worm: 126 GKMRRIDAEKCLLHTLNEHGAFLVRDSESRQHDLSLSVRENDSVKHYRIQLDHGGYF-IARRRPFATLHD 194
Human: 152 LVEHYTSDADGLCTRLIKPKVMEGTVAAQDEFYRSGWALNMKELKLLQTIGKGEFGDVMLGDYRGN-KVA 220 L+ HY +ADGLC L P Y W ++ + ++L++ IG G+FG+V G + N VAWorm: 195 LIAHYQREADGLCVNLGAPCAKSEAPQTTTFTYDDQWEVDRRSVRLIRQIGAGQFGEVWEGRWNVNVPVA 264
Human: 221 VKCIK-NDATAQAFLAEASVMTQLRHSNLVQLLGVIVEEKGGLYIVTEYMAKGSLVDYLRSRGRSVLGGD 289 VK +K A FLAEA +M +LRH L+ L V ++ + IVTE M + +L+ +L+ RGR Worm: 265 VKKLKAGTADPTDFLAEAQIMKKLRHPKLLSLYAVCTRDE-PILIVTELMQE-NLLTFLQRRGRQCQMPQ 332 Human: 290 CLLKFSLDVCEAMEYLEGNNFVHRDLAARNVLVSEDNVAKVSDFGLT----KEASSTQDTG-KLPVKWTA 353 L++ S V M YLE NF+HRDLAARN+L++ K++DFGL KE TG + P+KWTA Worm: 333 -LVEISAQVAAGMAYLEEMNFIHRDLAARNILINNSLSVKIADFGLARILMKENEYEARTGARFPIKWTA 401
Human: 354 PEALREKKFSTKSDVWSFGILLWEIYSFGRVPYPRIPLKDVVPRVEKGYKMDAPDGCPPAVYEVMKNCWH 423 PEA +F+TKSDVWSFGILL EI +FGR+PYP + +V+ +V+ GY+M P GCP +Y++M+ CW Worm: 402 PEAANYNRFTTKSDVWSFGILLTEIVTFGRLPYPGMTNAEVLQQVDAGYRMPCPAGCPVTLYDIMQQCWR 471
Human: 424 LDAAMRPSFLQLREQLEHI 443 D RP+F L+ +LE +Worm: 472 SDPDKRPTFETLQWKLEDL 492
human M--------------SAIQ----------------------AAWPSGT------------ECIAKYNFHG M S .. AA SG. . .A ... .worm MGSCIGKEDPPPGATSPVHTSSTLGRESLPSHPRIPSIGPIAASSSGNTIDKNQNISQSANFVALFQYDA 1 20 40 60
440 450human REQLEHI--------KTHELHL . .:: . : ...worm QWKLEDLFNLDSSEYKEASINF 500
Align program (Lipman and Pearson)
Basic Local Alignment Search Tool
• Widely used similarity search tool
• Heuristic approach based on Smith Waterman algorithm
• Finds best local alignments
• Provides statistical significance
• All combinations (DNA/Protein) query and database.– DNA vs DNA
– DNA translation vs Protein
– Protein vs Protein
– Protein vs DNA translation
– DNA translation vs DNA translation
• www, standalone, and network clients
What BLAST tells you• BLAST reports surprising alignments
– Different than chance
• Assumptions– Random sequences– Constant composition
• Conclusions– Surprising similarities imply evolutionary
homology
Evolutionary Homology: descent from a common ancestorDoes not always imply similar function
BLAST/FASTA variants for different searches
Program Query Database Comparison Searching purpose
blastn/fasta
blastp/fasta
blastx/fastx
tblastn/tfasta
tblastx/tfastx
DNA
DNA
DNA
DNA
DNA
DNA
Protein Protein
Protein
Protein
DNA level
Protein level
Protein level
Protein level
Protein level
homologous DNA
homologous protein
New genes from DNA
New genes from peptide
New genes from DNA
BLAST Web site: http://www.ncbi.nlm.nih.gov/BLASTFASTA Web sites: http://www2.ebi.ac.uk/fasta3/
or http://www.fasta.genome.ad.jp/
BLASTN Databases
nrGenBank, EMBL, DDBJ, PDB and NCBI reference sequences (RefSeq)
htgs High-throughput genomic sequences (draft)
pat Patented nucleotide sequences
mito Mitochondrial sequences
vector Vector subset of GenBank
month GenBank, EMBL, DDBJ, PDB from 30 days
chrom Contigs and chromosomes from RefSeq
BLASTP Databases
nrGenBank CDS translations, RefSeq, PDB, SWISS-PROT, PIR, PRF
swissprot SWISS-PROT
pat Patented protein sequences
pdb Protein Data Bank
monthGenBank CDS translations, PDB, SWISS-PROT, PIR, PRF from 30 days
GTACTGGACATGGACCCTACAGGAACGT
TGGACATGGACCCTACAGGAACGTATAC
CATGGACCCTACAGGAACGTATACGTAA . . .
Nucleotide Words
GTACTGGACAT
TACTGGACATG
ACTGGACATGG
CTGGACATGGA
TGGACATGGAC
GGACATGGACC
GACATGGACCC
ACATGGACCCT . . .
Make a lookuptable of words
GTACTGGACATGGACCCTACAGGAACGTATACGTAAG Query
11-mer
1228megablast
711blastn
Min.Def.WORD SIZE
Protein WordsGTQITVEDLFYNIATRRKALKNQuery:
Neighborhood Words
LTV, MTV, ISV, LSV, etc.
GTQ
TQI
QIT
ITV
TVE
VED
EDL
DLF
...
Make a lookuptable of words
Word size = 3 (default) Word size can only be 2 or 3
Minimum Requirements for a Hit
•Nucleotide BLAST requires one exact match•Protein BLAST requires two neighboring matches within 40 aa
GTQITVEDLFYNI
SEI YYN
ATCGCCATGCTTAATTGGGCTT
CATGCTTAATT
neighborhood words
exact word match
one match
two matches
Query sequenceWords of length W
(1)
(2) Compare the word list to the database and identify exact matches
BLAST Algorithm
W default = 11
(3) For each word match, extend alignment in both directions
(4) Compute E-value
An alignment that BLAST can’t find
1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| |
1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG
61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT
| || || || ||| || | |||||| || | |||||| ||||| | |
61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT
121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC
|||| || ||||| || || | | |||| || |||
121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC
Here there are no words longer than 6…...for nucleotides
there must be an exact match of at least 7.
An Alignment BLAST Can MakeSolution: compare protein sequences; BLASTX
Score = 290 bits (741), Expect = 7e-77Identities = 147/331 (44%), Positives = 206/331 (61%), Gaps = 8/331 (2%)Frame = +3
Score = 290 bits (741), Expect = 7e-77Identities = 147/331 (44%), Positives = 206/331 (61%), Gaps = 8/331 (2%)Frame = +3
BLAST 2 Sequences (blastx) output:
Nucleotide vs. Protein BLAST
aaccgggtgacggtggtgctcggtgcgcagtggggcgacgaaggcH.sapiens: N R V T V V L G A Q W G D E G + + V + V L G Q W G D E GA.thaliana: S Q V S G V L G C Q W G D E G agtcaagtatctggtgtactcggttgccaatggggagatgaaggt
Comparing ADSS from H. sapiens and A. thaliana
BLASTp finds three matching wordsBLASTn finds no match, because there are no 7 bp words
Protein searches are generallymore sensitive than nucleotide searches.
The Flavors of BLAST
• Standard BLAST– traditional “contiguous” word hit– position independent scoring – nucleotide, protein and translations (blastn, blastp,
blastx, tblastn, tblastx)• Megablast
– optimized for large batch searches– can use discontiguous words
• PSI-BLAST– constructs PSSMs automatically; uses as query– very sensitive protein search
• RPS BLAST– searches a database of PSSMs– tool for conserved domain searches
Megablast: NCBI’s Genome Annotator
• Long alignments for similar DNA sequences
• Concatenation of query sequences
• Faster than blastn
• Contiguous Megablast– exact word match– Word size 28
• Discontiguous Megablast– initial word hit with mismatches– cross-species comparison
MegaBLAST
AI217550AI251192AI254381BE645079
C:\seq\hs.4.fsa
> 1133045 gnl|UG|Hs#S1133045 qd43b11.x1 Homo sapiens cDNA, 3' end CATGTAAGCCATTTATTGGTTTGTTTTAAAAATATGTATTTTATTTATACATGAAGTTTGGTGAGAAGTGCTCGATTAGTTCAGACAACATCTGGCACTTGATGTCTGTCCTTCCCTCCTTTTTCCTACTCTCTTCTCCCCTCCTGCTGGTCATTGTGCAGTTCTGGAAATTAAAAAGGTGACAGCCAGGCTAAAAGCTAAGGGTTGGGTCTAGCTCACCTCCCACCCCCAACCACACCGTCTGCAGCCAGCCCCAGGCACCTGTCTCAAAGCTCCCGGGCTGTCCACACACACAAAAACCACAGTCTCCTTCCGGCCAGCTGGGCTGGCAGCCCGACCTGC> 1141828 gnl|UG|Hs#S1141828 qv37f11.x1 Homo sapiens cDNA, 3' end GAGAAGACGACAGAAGGGGAGAAGAGAGTAGGAAAAAGGAGGGAAGGACAGACATCAAGTGCCAGATGTTGTCTGAACTAATCGAGCACTTCTCACCAAACTTCATGTATAAATAAAATACATATTTTTAAAACAAACCAATAAATGGCTTACATCAAAAAAAAAAAAAAAAAAAAAAAAGTCGTATCGATGT> 1145899 gnl|UG|Hs#S1145899 qv33c06.x1 Homo sapiens cDNA, 3' endGAGAAGACGACAGAAGGGGAGAAGAGAGTAGGAAAAAGGAGGGAAGGACAGACATCAAGTGCCAGATGTTGTCTGAACTAATCGAGCACTTCTCACCAAACTTCATGTATAAATAAAATACATATTTTTAAAACAAACCAATAAATGGCTTACATCAAAAAAAAAAAAAAAAAAAAAAAAGTCGTATCGATGT> 2291670 gnl|UG|Hs#S2291670 7e65f04.x1 Homo sapiens cDNA, 3' end TTTCATGTAAGCCATTTATTGGTTTGTTTTAAAAATATGTATTTTATTTATACATGAAGTTTGGTGAGAAGTGCTCGATTAGTTCAAACAACATCTGGCACTTGATGTCTGTCCTTCCCTCCTTTTTCCTACTCTCTTCTCCCCTCCTGCTGGTCATTGTGCAGTTCTGGAAATTAAAAAGGTGACAGCCAGGCTAAAAGCTAAGGGTTGGGTCTAGCTCACCTCCCACCCCCAACCACACCGTCTGCAGCCAGCCCCAGGCACCTGTCTCAAAGCTCCCGGGCTGTCCACACACACAAAAACCACAGTCTCCTTCCGGCCAGCTGGGCTGGCAGCCCGACCTGCCTCCCAACCGCATTCCTGCCTGTGTAGCAGGCGGTGAGCACCCAGAAGGGGCACATACCTCTCCAAGCCTTGAAAGCAAAGCATGGAGATCTACAAAAATAGGATTTCCACTTGGAGAAATGTCGCTGGGACAGT
Templates for Discontiguous Words
W = 11, t = 16, coding: 1101101101101101W = 11, t = 16, non-coding: 1110010110110111W = 12, t = 16, coding: 1111101101101101W = 12, t = 16, non-coding: 1110110110110111W = 11, t = 18, coding: 101101100101101101W = 11, t = 18, non-coding: 111010010110010111W = 12, t = 18, coding: 101101101101101101W = 12, t = 18, non-coding: 111010110010110111W = 11, t = 21, coding: 100101100101100101101W = 11, t = 21, non-coding: 111010010100010010111W = 12, t = 21, coding: 100101101101100101101W = 12, t = 21, non-coding: 111010010110010010111
Reference: Ma, B, Tromp, J, Li, M. PatternHunter: faster and more sensitive homology search. Bioinformatics March, 2002; 18(3):440-5
W = word size; # matches in template
t = template length (window size within which the word match is evaluated)
Scoring Systems - Nucleotides
A G C T
A +1 –3 –3 -3
G –3 +1 –3 -3
C –3 –3 +1 -3
T –3 –3 –3 +1
Identity matrix
CAGGTAGCAAGCTTGCATGTCA
|| |||||||||||| ||||| raw score = 19-9 = 10
CACGTAGCAAGCTTG-GTGTCA
Scoring Systems - Proteins
Position Independent MatricesPAM Matrices (Percent Accepted Mutation)
• Derived from observation; small dataset of alignments• Implicit model of evolution• All calculated from PAM1• PAM250 widely used
BLOSUM Matrices (BLOck SUbstitution Matrices)• Derived from observation; large dataset of highly
conserved blocks• Each matrix derived separately from blocks with a
defined percent identity cutoff• BLOSUM62 - default matrix for BLAST
Position Specific Score Matrices (PSSMs)PSI- and RPS-BLAST
A 4R -1 5 N -2 0 6D -2 -2 1 6C 0 -3 -3 -3 9Q -1 1 0 0 -3 5E -1 0 0 2 -4 2 5G 0 -2 0 -1 -3 -2 -2 6H -2 0 1 -1 -3 0 0 -2 8I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X
BLOSUM62
Common amino acids have low weights
Rare amino acids have high weights
Negative for less likely substitutions
Positive for more likely substitutions
Gapped Alignments
• Gapping provides more biologically realistic alignments
• Statistical behavior is not completely understood forgapped alignments
• Gapped BLAST parameters must be found by simulations for each matrix
Gap costs: -(a+bk)
a = gap open penalty b = gap extend penalty k= number of residues
For example: A gap of 1 residue receives the score “-(a+b)”.
Scores
V D S – C Y
V E T L C F
BLOSUM62 +4 +2 +1 -12 +9 +3 = 7
PAM30 +7 +2 0 -10 +10 +2 =. 11
Simply add the scoresfor each pair of aligned residues
and (as necessary) factor in the gaps!
Different matrices produce different scores!
Lower BLOSUM series means more divergence
Higher PAM series means more divergence
better for finding local alignments
better for finding global alignments and remote homologs
based on groups of related sequences counted as one
based on minimum replacement or maximum parsimony
Built from vast amout of dataBuilt from small amout of data
Built from local alignmentsBuilt from global alignments
BLOSUMBLOSUMPAMPAM
Matrix differencesMatrix differences
Matrices - Rules of thumb
Need different levels of sensitivity ?– Close relationships (Low PAM number (PAM 1) or
high Blosum number, eg. 80)– Distant relationships (High PAM (e.g. PAM 250),
low Blosum (BLOSUM 45)
Local Alignment StatisticsHigh scores of local alignments between two random sequencesfollow the Extreme Value Distribution
Score
Alig
nm
en
ts
(applies to ungapped alignments)
E = Kmne-S E = mn2-S’
K = scale for search space = scale for scoring system S’ = bitscore = (S - lnK)/ln2
Expect ValueE = number of database hits you expect to find by
chancesize of database
your score
expected number of
random hits
WWW BLAST
The BLAST homepage
Standard databases
Specialized Databases
BLAST Databases: Nucleic Acid
• nr (nt)– Traditional GenBank– NM_ and XM_ RefSeqs
• refseq_rna
• refseq_genomic– NC_ RefSeqs
• dbest – EST Division
• est_human, mouse, others
• htgs – HTG division
• gss – GSS division
• wgs– whole genome shotgun
• env_nt– environmental samples
Options for Advanced Blasting: Nucleotide
Example Entrez Queriesnucleotide all[Filter] NOT mammalia[Organism]green plants[Organism]biomol mrna[Properties]biomol genomic[Properties]
OtherAdvanced-W 7 word size–e 10000 expect value-v 2000 descriptions-b 2000 alignments
BLAST Databases: Non-redundant protein
nr (non-redundant protein sequences)– GenBank CDS translations– NP_ RefSeqs– Outside Protein
• PIR, Swiss-Prot, PRF• PDB (sequences from structures)
pat protein patents
env_nr environmental samples
nr (non-redundant protein sequences)– GenBank CDS translations– NP_ RefSeqs– Outside Protein
• PIR, Swiss-Prot, PRF• PDB (sequences from structures)
pat protein patents
env_nr environmental samples
Advanced Options: Filter
all[Filter] NOT mammals[Organism]
gene_in_mitochondrion[Properties]2003:2005 [Modification Date]tpa[Filter]
Nucleotidebiomol_mrna[Properties]biomol_genomic[Properties]
all[Filter] NOT mammals[Organism]
gene_in_mitochondrion[Properties]2003:2005 [Modification Date]tpa[Filter]
Nucleotidebiomol_mrna[Properties]biomol_genomic[Properties]
Default settingDefault setting
Hides low complexity for initial word hits onlyHides low complexity
for initial word hits only
Masks regions of query in lower case (pre-masked)Masks regions of query in lower case (pre-masked)
BLAST Formatting Page
BLAST Output: Graphic
mouse over
Sort by taxonomy
BLAST Output: Descriptions
link to entrez
Sorted by e values
3 X 10-12
Default e value cutoff 10
Gene Linkout
TaxBLAST: Taxonomy Reports
>gi|127552|sp|P23367|MUTL_ECOLI DNA mismatch repair protein mutL Length = 615
Score = 42.0 bits (97), Expect = 3e-04 Identities = 26/59 (44%), Positives = 33/59 (55%), Gaps = 9/59 (15%)
Query 9 LPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHF-----LHE---ESILEV-QQHIESKL 58 L + P L LEI P VDVNVHP KHEV F +H+ + +L V QQ +E+ LSbjct 280 LGADQQPAFVLYLEIDPHQVDVNVHPAKHEVRFHQSRLVHDFIYQGVLSVLQQQLETPL 338
>gi|127552|sp|P23367|MUTL_ECOLI DNA mismatch repair protein mutL Length = 615
Score = 42.0 bits (97), Expect = 3e-04 Identities = 26/59 (44%), Positives = 33/59 (55%), Gaps = 9/59 (15%)
Query 9 LPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHF-----LHE---ESILEV-QQHIESKL 58 L + P L LEI P VDVNVHP KHEV F +H+ + +L V QQ +E+ LSbjct 280 LGADQQPAFVLYLEIDPHQVDVNVHPAKHEVRFHQSRLVHDFIYQGVLSVLQQQLETPL 338
BLAST Output: Alignments
Identical match
positive score(conservative)
negative substitution
gap
BLAST Output: Alignments
>gi|730028|sp|P40692|MLH1_HUMAN DNA mismatch repair protein Mlh1 1) Length = 756
Score = 233 bits (593), Expect = 8e-62 Identities = 117/131 (89%), Positives = 117/131 (89%)
Query: 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 60 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLLSbjct: 276 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 335
Query: 61 GSNSSRMYFTQTLLPGLAGPSGEMVKXXXXXXXXXXXXXXDKVYAHQMVRTDSREQKLDA 120 GSNSSRMYFTQTLLPGLAGPSGEMVK DKVYAHQMVRTDSREQKLDASbjct: 336 GSNSSRMYFTQTLLPGLAGPSGEMVKSTTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDA 395
Query: 121 FLQPLSKPLSS 131 FLQPLSKPLSSSbjct: 396 FLQPLSKPLSS 406
>gi|730028|sp|P40692|MLH1_HUMAN DNA mismatch repair protein Mlh1 1) Length = 756
Score = 233 bits (593), Expect = 8e-62 Identities = 117/131 (89%), Positives = 117/131 (89%)
Query: 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 60 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLLSbjct: 276 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 335
Query: 61 GSNSSRMYFTQTLLPGLAGPSGEMVKXXXXXXXXXXXXXXDKVYAHQMVRTDSREQKLDA 120 GSNSSRMYFTQTLLPGLAGPSGEMVK DKVYAHQMVRTDSREQKLDASbjct: 336 GSNSSRMYFTQTLLPGLAGPSGEMVKSTTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDA 395
Query: 121 FLQPLSKPLSS 131 FLQPLSKPLSSSbjct: 396 FLQPLSKPLSS 406
low complexity sequence filtered
Neighbors: Precomputed BLAST
Nucleotide
Protein
Entrez Related Sequences produces a list of sequences sorted by BLAST score, but with no alignment details.
Blink – Protein BLAST Alignments
• Lists only 200 hits • List is nonredundant
PSI-BLAST
Position-Specific Iterated BLAST
• Mining for protein domains• Confirming relationships among related proteins
Position-Specific Scoring Matrix (PSSM)
A R N D C Q E G H I L K M F P S T W Y V 206 D 0 -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G -2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1 -4 0 210 S -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3 212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0 -2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G -2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3 -5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6 -5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7 219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N -1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1 4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3 -4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2 -2 -3 0 -2 -2 -2 -3
Serine is scored differently in these two positions.
Active site nucleophile
Position Specific Iterative BLAST:PSI-BLAST
Create your own PSSM:Finding protein families
based on your own sequence.
query BLOSUM62PSSM AlignmentAlignment
>gi|113340|sp|P03958|ADA_MOUSE ADENOSINE DEAMINASE (ADENOSINEMAQTPAFNKPKVELHVHLDGAIKPETILYFGKKRGIALPADTVEELRNIIGMDKPLSLPGFVIAGCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVDPMPWNQTEGDVTPDDVVDEQAFGIKVRSILCCMRHQPSWSLEVLELCKKYNQKTVVAMDLAGDETIEGSSLFPGHVEAYRTVHAGEVGSPEVVREAVDILKTERVGHGYHTIEDEALYNRLLKENMHFEVCPWSSYLTGAVRFKNDKANYSLNTDDPLIFKSTLDTDYQMTKKDMGFTEEEFKRLNINAAKSSFLPEEEKK
PSI-BLAST
e value cutoff for PSSM
RESULTS: Initial BLASTPSame results as protein-protein BLAST
Results of First PSSM Search
Other purine nucleotide metabolizing enzymes not found by ordinary BLAST
Third PSSM Search: Convergence
Just below threshold, another nucleotide metabolism enzyme
Check to add to PSSM
Reverse Position Specific Iterative-BLAST (a.k.a. RPS-BLAST or CDD Search)
A sequence search of the Conserved Domain Database (CDD)
containing curated Position-Specific Scoring Matrices. 10 20 30 40 50 60 . . . . * . . . . | . . . . * . . . . | . . . . * . . . . | . . . . * . . . . | . . . . * . . . . | . . . . * . . . . |consensus 1 KWEIPREDLTLGKKLGEGAFGEVYKGTLKGkgd---nkSIDVAVKTLKEDASEeqIKEFL 571FGI A 1 aWEIPRESLRLEVKLGQGCFGEVWMGTWNG--------TTRVAIKTLKPGTMS--PEAFL 3111BYG A 1 RWELPRDRLVLgkPLGEGAFGQVYLAEAIglgkdkpnrvTKVAVKMLKSDAtedkLSLDI 74gi 125135 1 GWALNMKELKLlqTIGKGEFGDVMLGDYRg---------NKVAVKCIKNDAt---AQAFL 62gi 125702 1 KYEIPRTDLTLkhKLGGGQYGEVYEGVWKky-------sLTVAVKTLKEDTm--eVEEFL 284gi 1174437 1 KWEIPRSELTIlrKLGRGNFGEVFYGKWRn--------sIDVAVKTLREGTm--sTAAFL 325
PSSM Sources
Pfam Sanger 7255SMART EMBL 663COG NCBI 4873KOG NCBI 4825CD NCBI 645
Reverse Position Specific Iterative-BLAST (a.k.a. RPS-BLAST or CD Search)
Query: sequence Database: PSSMs
P03958
Result: TyrKc
Questions:
• Searching for p53 protein homologs with annotation of CDD.
• Can you put codon 72 SNP into 3D protein structure?
Other Areas to Cover
• Genomic Data
• Annotation
• Common Domains prediction WWW
• Other Useful Genome Browsers