bio305 genome analysis and annotation 2012
Embed Size (px)
DESCRIPTION
Lecture on bacterial genome analysis and annotation for Bio305 course at the University of BirminghamTRANSCRIPT

Bio305 Bacterial Genome Annotation and Analysis
Professor Mark Pallen

Overview Features of Bacterial Genomes Genome Sequencing Assembly of bacterial genomes Annotation of bacterial genomes Identifying and annotating CDSs
An ORF is NOT a CDS! Power and pitfalls of using homology
BLAST and PFAM

General features of genomes
Microbial Human Small WSIWYG genomes (Mbp)
Gene density high (>90%) intergenic regions short
very little repetitive or non-coding DNA
Introns very rare Protein-coding genes (CDS) short (~1kbp)
Operons with promoters just upstream
Fewer non-coding RNAs
Very large genomes (Gbp)
Gene density low Only 25% is genes Introns mean only1% codes
Genes can span ≥30 kbp Genes have ~3 transcripts Splicing and splice variants
Promoter regions distant from gene

Bacterial genome organisation
Chromosomes Plasmids
Most commonly single circular chromosome (always DNA) BUT many species have linear chromosome(s) (e.g. Borrelia, Streptomyces, Rhodoccus)
BUT a few species with two chromosomes (e.g. Vibrio cholerae)
Can be mix of circular and linear (e.g. Agrobacterium tumefaciens, B. burgdoferi)
Independent autonomous replicon, can be circular or linear
may integrate into chromosome
copy number varies 1 to 10s often carry non-essential genes that confer an adaptive advantage in certain conditions

Overview of a genome project Choose strain
Fresh isolate or tractable lab strain?
Choose strategy Shotgun sequencing Paired-end sequencing Draft or complete?
Choose chemistry Sanger; 454; Illumina; Ion Torrent
Assembly Automated
Closure and finishing Manually intensive Difficulty depends on how repetitive
Data Release Immediate or delayed?
Annotation Manually intensive bottle neck
Publication

Random shearing
Size selection
Cloning
Sequence each insert with two primers
Pick colonies to create shotgun library
bacterial chromosome
plasmid vector
Plasmid preps
Whole-Genome Shotgun Sanger Sequencing

High-throughput Sequencing100x faster, 100x cheaper!
A disruptive technologySeveral technologies in the marketplace from 2007 onwards 454 (Roche) Illumina Ion Torrent PacBio
Fundamentally new approaches Solid-phase amplification of clonal templates in “molecular colonies” Massive increase in number of “clones” compensates for shorter read length
New chemistries for sequence reading 454: pyrophosphate detection on base addition Illumina: reversible de-protection of fluorescent bases

Random shearing
Size selection
bacterial chromosome
High-Throughput Shotgun Sequencing
Add adaptersAmplifySequence

Illumina Sequencing

The Sequence Assembly Problem Sequencing technologies generate reads of <1000 bp
These reads must be assembled into a single continuous genomic sequence.
Shotgun sequencing exploits many overlapping sequences (high coverage) to infer ordering directly from the sequences themselves

The Repeat Problem Repeats at read ends can be assembled in multiple ways
ATTTATGTGTGTGTGGTGTG
GTGTGGTGTGCACTACTGCT
ACTACTGCTGACTACTGTGTGGTGTG
GTGTGGTGTGATATCCCT
ATTTATGTGTGTGTGGTGTG
GTGTGGTGTGCACTACTGCT
ACTACTGCTGACTACTGTGTGGTGTG
GTGTGGTGTGATATCCCT
Correct
Incorrect

Paired-end Sequencing
Random shearing
Size selection for 3kb or 8kb etc
bacterial chromosome
Add linkers
Circularise
Shear and select on size and presence of linkers
Add adapters
Obtain sequences from either side of linker
known distance apart in genome
Create long fragments of known lengthObtain sequence from paired ends
known distance apartAllows assembly of contigs across repeats into scaffolds
Create long fragments of known lengthObtain sequence from paired ends
known distance apartAllows assembly of contigs across repeats into scaffolds

Scaffold
Contig 3Contig 3Contig 2Contig 2Contig 1Contig 1
Physical Gap
Sequence Gap
Genome Assembly

Re-sequencing Short reads (<200bp) inefficient de novo assembly
Instead they are mapped against a reference genome
Re-sequencing is like assembling a jigsaw puzzle using the image on the lid

SNP calling Comparisons between closely related strains allows identification of SNPs that are informative for Identifying biologically significant changes, e.g. during evolution in lab or patient
Reconstructing phylogenies using neutral changes

Genome annotation Annotation is the addition of information about the predicted sequence features to the flat file of DNA code
Identification of potential coding sequences - CDS
Homology searches to predict function Other features can be annotated as well
rRNAs Potential promoters tRNAs Small non-coding RNAs Repeat sequences Insertion sequences (ISs), transposons, gene fragments
Location of the origin of replication Determination of the number of bases, genes, and G+C%.

How to go from this….?>Escherichia coli K-12 MG1655_3870656-3890655
TGCTGCTGCCTGCTGCGCGGTGCGCTCTACGGATTGCCCGGCGCGATAGAGATCGCTGCCTAAGCCCGCCCCTGCACAACCTGCGTCTATCCACTGCGCCAGGTTTTCTGCGTCACGCCGCAACGGCAAAGACTGCGATGTCCGATGGCAATACCGCTTTTAACGCTTTGATGTATTGCGGACCAAAAGCCGATGACGGAAATATTTTCAGCGCCTGCGGCGCCCGCTTCGAGCGCGGTAAAGGCTTCGGTCGCCGTCGCGCAGCCGGGGCAGACGTCATGCCGTAGCCCACCGCACGGCGGATCACTTCACTATGGATATTGGGCGTAACGATGAGCTGACAGCCCATCCTGGCGAGCGCATCGACCTGTTCAGGTTTCAGTACCGTACCTGCGCCAATCAACGCCTTGTCGCCGTACGCATCAACGATGCGGGAATGCTTTGCTCCCATTGTGGGGAATTCAGCGGGATTTCAACCGCGTCGAACCCGGCGTCAATCACCGCGCCAACATGCGCCAGCGCCTCGTCGGGCGTAATACCGCGCAAAATGGCGATCAGCGGGAGTTTAGTTTGCCACTGCATGAGGATGCTCCTTATACCAGCCTGAAATGCCGTGTCGCCCGCCACCGCCGTCACGTCGCAACCCATCGCCTGAAAGGCTTGCTGGTAGCGCGCGGTCAGCGATGTTCCGGCGACAAGGGTGATGGCGTGTTGATGGGCCACATAGTCGCGCATACTGGGACCTCTGCGCCAATCAACAAACCAGAGAGAAATTCGCTGACCTGTTCGCGGGGAAGTGTTCCCAGCACATGCGAGGCGCGAACTTCAAAAAGCTGCGGCAATATGGCGGGCGTATTAAGACCACGCTCAAGGCCAGCTGTGAAGGCATCGGCAGGTTTTCCTGCGGCGGCAAACCTGCGCCAATCAATGAGTGATTTAACAGTAAATGATGTAATTCACCGGTCATCACGGTGCGAAAATCGTTGATTTGCTGGCTATCGGCCTGCACCCATTTGCAATGGGTTCCGGGCATGACATAAAGAGAGGAAGAGCCAGAGCTCGCGCGCCGATCAATTGTGTTTCTTCGCCGCGCATCACATTGTGGTTATCGTCATGAGAGACACATAATCCGGGAATAATCCAGATATTGTCGCCAACTGACGTTAATTGTTCGCCAATAGACGAAAAACAGGCAGGAACAGATAATACGGTGCAACTTTCCAGCCGACGTTGCTGCCAACCATTCCTGCCATTACCACTGGCGTTTTCTCTTCACGCCAGTCGGTCGTGACTTCTGCTAACACCGCAGCCGGAGATTTTCCGTTCAGGCGCGTGACGCCTGCTTCTGATTGCCTGCTCTCAGGCAGTGGTCGCCCTGATAAAGCCAGGCGCGCAGATTGGTCGATCCCCAGTCAATTGCGATGTAGCGAGCTGTCATGTGATTTCCTTTAACCTTCGTGTCGAGCTGGCGATCATGGTAAGCGCCGCCTGCTCTGCCGCATCGCCGTCCTGATGCGTATCGCATCGAACAGCGCCTTATGTTCCTGGAGCGTTTGCGGCATGTTGGCCTCATCGCCCATCCAGGTTCGTTCAAAAACCGCCCGCTGCAGCGAACTGATCGCAATGCTAAGTTGCTGTAACACCGGGTTATGCACCGACTGCAGCACCGCTCGTGGTAGCGAATATCCGCTTCGTTAAACGCTTCGCGGTCCTGATTGTTGGCAATCATCTCGTTCAGCGCCGATTCAATCTGCGCCAGATCGCTGGAAGTCGCGCGCTCTGCTCCCAACGGGCAATCGCCGGTTCCACCAGATTTCGCACTTCGTCATGGCACTGATAAGCCGTGGGTCGTAGTCATTTTCCAGCACCCATTGCAGTACGTCAGTGTCGAGGTAATTCCACTGGTTACGCGGTGCCACAAACGCCCCGCGATAACGTTTCATTTCAATCAGCCGCTTCGCCATCAGCGAACGGAACACCCACGGATGATGTTGCGCGAGGTTGCAAACTCCTCACAGAGTTCCGCCTCAGCCGGAAGCGGCGAGCCTGGCACGTATTTGCCGTGAACGATCTGTTTACCCAGCGTAATGACAATGCGATCGGTTTTATTGAGAGTCATGGAGAGTCCTTGTGCTTGTATGTTCTTCTCTACTTTACCCCGATCGATGCATAACGCGGCAACTTTGTAGTACCAGCGTGATGACGTTCGCGTTTGCCGTGCGTGTAATGTAGTACAAACTTATATTGTTGTACTACAATTTAGATCACAAAAAGAACAATGCATAAAAAATGACATGCGTCGGGCAGAAATCTGAAAAGGGATATCAGGCGCTAAACAGGAGGGAAAGAAGAGTATGCTTTCAACGGCTTAGCTACTCGTTTAAAGGATTAATCATGAAGTTGAATTTTAAGGGATTTTTTAAGGCTGCCGGTTTATTCCCACTGCGCTGATGCTTTCAGGCTGTATCTCGTATGCTCTGGTTTCCCATACCGCAAAGGGTAGTTCAGGAAAGTATCAATCGCAGTCAGACACCATCACTGGGCTATCGCAGGCAAAAGATAGTAATGGAACAAAAGGCTATGTTTTTGTAGGGGAATCGTGGATTACCTTATCACTGATGGTGCCGATGACATCGTTAAGATGCTCAATGATCCAGCACTTAACCGGCACAATATTCAGGTTGCCGATGACGCAAGATTTGTTTTAAATGCGGGGAAAAAGAAATTTACCGGCACAATATCGCTTTACTACTACGGAATAACGAAGAAGAAAAGGCACTGGCAACGCATTATGGTTTTGCCTGTGGTGTTCAACACTGTACCAGGTCACTGGAAAACCTAAAAGGCACAATCCATGAGAAAAATAAAAACATGGATTACTCAAAGGTGATGGCGTTCTACCATCCATTTAAGTGCGATTTTATGAATACTATTCACCCAGAGGCATTCCGGGATGGTGTTTCCGCAGCATTACTGCCAGTGACTGTTACGCTGGACATCATTACTGCACCGCTGCAATTTCTGGTTGTATATGCAGTAAACCAATAATCAGTAAGCGGGCAAACCGTTTATGCTGTTTGCCCGCCCACAGATTAATTCAGCACATACTTCTCAATAGCAAACGCCACGCCATCTTCAAGGTTAGATTTGGTGACAAAGTTCGCCACTTCTTTCACTGAAGGAATAGCGTTATCCATCGCCACACCGACGCCTGCATATTAATCATTGCGATATCGTTTTCCTGATCGCCAATCGCCATGATTTCTTCCGGTTTAATACCTAACACGTCGGCCAGTGATTTCACCCCCGTACCTTTGTTAACGCGTTTATCGAGGATTTCGAGGAAGTACGGCGCACTTTTCAGCACGGTATATTCTCTTTCACTTCCTGCGGAATACGCGCGATAGCCTGGTCGAGGATGGCGGGTTCATCAATCATCATCACTTTCAGGAACTGGGTATTGGGGTCCATTTTCTCCGCTTCGCAGAACACCAGCGGAATGGTGGCAACGAAGGATTCATGCACCGTGTGTAGCTGATATCACGGTTGGCGGTGTACAGCGTGGTGCGGTCCAGGGCGTGGAAATGAGAACCGACTTCGCGAGAGAGTTTTTCCAGGAAACGATAGTCGTCATAGCTGAGAGCAGTTTGCGCCACGGTGCTACCATCAGCGGCCTTCTGTACCACGCGCCGTTATAAGTAATGCAGTAGTCGCCCGGCTGTTCCATATGCAGCTCTTTCAGGTAGTTGTGCACACCTGCATACGGGCGACCCGTCGTTAGCACGACATTCACGCCACGGGCGCGAGCTGCGGCAATCGCATTTTTAACGGCGGGTGAAAGGTGTGATCGGGCAGCAGAAGGGTGCCATCCATATCGATAGCAATGAGTTTAATAGCCATGAGTTCCCCAGGTAGATTGGTTCCTGACCCATGCTAACGCGATTCCGCTCAAAAATCAGTACAACACCCGAGGGAAAAGGGGGATGCAACGCGCGTGCGTGCTCCCTTTTTGCTTAGCGGAAGAGTTTCCCTTTCAGCAGTTCCATGCCTGCGGAAAGCAGATCGTTATTGGCTTGTGGTGACACTTCACCTTGCGGTGAGAGCGCATCAATAATCTTCGGCAATTGTTCTGCCAGTAAACTGGAAGCTGACTGGTATCCACGCCAAGTTTTTGCCCGAGATCGGACACCGCATTTGTGCCGAGCGCCGATTCCAGTTGCTCGCCACTAACCGATTGATTGCCCTGTTGATTACTCAGCCAGGTTGAGAGAATGGCCCCTAAGCCGCCACTTTGCAGTTTTTCCACAGCACCTGAATGCCGCCCTGCTCCTCAACCCAACTTAAAATAGCCTGATATTTCCCCGCATCGCCTTTCAGAAAGGCACCGACAACTTCATCAAAAAGCCCCATGATAATCACCTGTAAAGCGTTACGTGTTGACCCAAAAAGTATAGATTTGCGGATGATAATTGCGGATTGCAGAAATAAAAAGGGCGGAGATGATCTCCGCCCTTTTCTTATAGCTTCTTGCCGGATGCGGCGTGAACGCCTTATCCGGCCTACAAAATCATGAAAATTCAATACATTGCAAGATTTTCGTAGGCCTGATAAGCGTGCGCATCAGGCACGCTCGCATGGTTAGCGCCATTAAATATCGATATTCGCCGCTTTCAGGGCGTTCTCTTCAATAAACGCACGGCGCGGTTCAACGGCGTCGCCCATCAGCGTGGTGAACAACTGGTCGGCAGCAATCGCATCTTTAACGGTAACCGCAGCATACGACGACTTTCCGGGTCCATAGTGGTTTCCCACAGCTGTTCCGGGTTCATCTCGCCCAGACCTTTATAACGCTGGATGGAGAGGCCGCGACGGGACTCTTTCACCAGCCAGTCCAGCGCCTGCTCGAAGCTGGCTACCGGCTGACGCGCTCGCCACGTTCGATAAACGCATCTTCTTCCAGCAAGCCACGCAGTTTCTCACCCAGCGTGCAGATACGACGATATTCGCCACCGGTGATAAACTCGTGATCCAGCGGATAGTCAGTATCCACACCGTGGGTACGCACGCGAACAATCGGCTCAACAGGTTTTGCTCAGCATTGGTGTGAACATCAAACTTCCACTGGCTGCCGTGCTGTTCTTTGTCGTTCAGTTCGCTGACCAGCGCGTTCACCCAGCGGGTAACGGTCTGCTCATCAGAAAGGTCAGCTTCCGTCAACGTCGGCTGATAGATAAGTCTTTCAGCATTGCTTTCGGATAACGACGCTCCATACGATTGATCATTTTCTGCGTCGCGTTGTACTCAGATACCAGTTTCTCTAACGCTTCGCCAGCCAATGCCGGTGCACTGGCGTTGGTGTGCAGCGTTGCGCCGTCCAGCGCGATAGAGATTGGTACTGATCCATCGCTTCGTCGTCTTTAATGTACTGTTCCTGCTTGCCTTTCTTCACTTTGTACAGCGGCGGCTGAGCGATGTAGACGTGACCGCGTTCAACGATTTCCGGCATCTGACGATAGAAGAAGGTCAACAGCAGCGTACGAATGTGGAGCCGTCGACGTCCGCATCGGTCATGATGATGATGCTGTGATAACGCAGTTTGTCCGGGTTGTACTCGTCACGACCGATACCACAGCCAAGCGCGGTGATAAGCGTCGCCACTTCCTGAGAAGAGAGCATCTTATCGAAGCGCGCTTTCTCGACTTGAGGATTTTACCCTTCAGCGGCAGAATCGCCTGGTTCTTGCGGTTACGCCCCTGCTTCGCAGAGCCGCCCGCGGAGTCCCCTTCCACCAGGTACAGTTCGGAAAGCGCCGGATCGCGTTCCTGGCAGTCTGCCAGTTTGCCCGGCAGGCCCGCAAGTCGAGCGCACCTTTACGGCGGGTCATTTCACGCGCGCGACGCGGCGCTTCACGGGCACGGGCAGCATCGATAATTTTGCCAACCACGATTTTCGCGTCGGTTGGGTTTTCCAGCAGGTATTCTGCCAGCAGTTCGTTCATCTGCTGTTCAACGCCGATTTCACCTCAGAAGAAACCAGTTTGTCTTTGGTCTGGGAGGAGAATTTCGGGTCCGGCACTTTCACGGAAACGACCGCAATCAGGCCTTCACGCGCATCGTCACCGGTGGCGCTGACTTTGGCTTTTTTGCTGTAGCCTTCTTTGTCCATTAGGCGTTCAGGGTACGGGTCATCGCCGCACGGAAGCCTGCCAGGTGAGTACCGCCGTCACGCTGCGGAATGTTGTTGGTAAAGCAGTAGATGTTTTCCTGGAAGCCATCGTTCCACTGCAACGCCACTTCGACGCCAATACCGTCTTTTTCAGTGAGAAGTAGAAGATATTCGGGTGGATCGGCGTTTTGTTCTTGTTCAGATATTCAACGAACGCCTTGATGCCGCCTTCATAGTGGAAGTGGTCTTCTTTGCCGTCGCGCTTGTCGCGCAGACGAATGGAAACGCCGGAGTTGAGGAACGACAACTCCGCAGACGTTTCGCCAGAATTTCATATTCGAACTCGGTCACATTGGTGAAGGTTTCGAGGCTGGGCCAGAAACGCACCATGGTGCCGGTTTTTTCAGTCTCGCCGGTAACCGCCAGCGGGGCCTGCGGTACACCGTGTTCGTAGATCTGACGGTGATTTTACCCTCGCGCTGGATAACCAGCTCCAGTTTTTGCGACAGGGCGTTTACTACCGAAACACCAACGCCGTGCAGACCGCCGGACACTTTATAGGAGTTATCGTCAAATTTACCGCCTGCGTGCAGAACGGTCATGATCACTTCCGCCGCCGA

…to this? FT gene complement(9299..10702)
FT /db_xref="GenBank:2367266”
FT /gene="dnaA”
FT /note="b3702”
FT CDS complement(9299..10702)
FT /db_xref="GI:2367267”
FT /db_xref="PID:g2367267”
FT /function="putative regulator; DNA - replication, repair,
FT restriction/modification”
FT /codon_start=1
FT /protein_id="AAC76725.1”
FT /gene="dnaA”
FT /translation="MSLSLWQQCLARLQDELPATEFSMWIRPLQAELSDNTLALYAPNR
FT FVLDWVRDKYLNNINGLLTSFCGADAPQLRFEVGTKPVTQTPQAAVTSNVAAPAQVAQT
FT QPQRAAPSTRSGWDNVPAPAEPTYRSNVNVKHTFDNFVEGKSNQLARAAARQVADNPGG
FT AYNPLFLYGGTGLGKTHLLHAVGNGIMARKPNAKVVYMHSERFVQDMVKALQNNAIEEF
FT KRYYRSVDALLIDDIQFFANKERSQEEFFHTFNALLEGNQQIILTSDRYPKEINGVEDR
FT LKSRFGWGLTVAIEPPELETRVAILMKKADENDIRLPGEVAFFIAKRLRSNVRELEGAL
FT NRVIANANFTGRAITIDFVREALRDLLALQEKLVTIDNIQKTVAEYYKIKVADLLSKRR
FT SRSVARPRQMAMALAKELTNHSLPEIGDAFGGRDHTTVLHACRKIEQLREESHDIKEDF
FT SNLIRTLSS”
FT /product="DNA biosynthesis; initiation of chromosome
FT replication; can be transcription regulator”
FT /transl_table=11
FT /note="f467; 100 pct identical to DNAA_ECOLI SW: P03004;
FT CG Site No. 851”

Or this?

Caveat Real bioinformaticians do not use graphical web-based tools
Real bioinformaticians use the Unix (typically Linux) command line interface
Often glue programs together into pipelines using Perl
Write programs in e.g. Perl or Python
Aim here is to equip lab-based worker with basic know-how
If you want to become a bioinformatician, do an MSc in Bioinformatics

Sources of information for annotation Comparison with genome sequences from related organisms
Published experimental data Demonstration of function of a gene Demonstration of function of a homologous gene
Review articles on protein families or groups of proteins Prediction that the CDS encodes a member of the family
Prediction that the CDS encodes a conserved motif Protein sequence analysis
Annotations are only predictions Sequences generated from RNA-Seq and protein mass spectrometry support annotations
Expert knowledge on an organism or protein family can assist in annotation

Approaches to functional annotation Most of the work now done automatically by programs Analyses strung together into pipelines, so that on our xBASE site we can assemble then annotate a genome in half an hour
But automated approaches work best if a closely related sequence is available
Wherever there are conflicting predictions, one has to rely on human judgment and interpretation of context Adjusting start codons Fine-tuning descriptors
Annotation should rely on an evidence trail that leads back to experimental results (“genomic isnad”)

GC skewGC skew(G-C)/G+C)(G-C)/G+C)
Identifies origin of Identifies origin of replication and leading replication and leading
lagging strandslagging strands
GC skewGC skew(G-C)/G+C)(G-C)/G+C)
Identifies origin of Identifies origin of replication and leading replication and leading
lagging strandslagging strands
Genes Genes coded by coded by location & location & functionfunction %G+C
Genes Genes shared shared with E. with E. colicoli
Genes Genes unique to unique to S. typhiS. typhi
Base composition aids genome analysis

Analysis of nucleotide sequence data Search for Sequence Features
Promoters, Ribosome-binding Sites Repeats, Inverted Repeats Consensus Sequences for regulator binding site Often rely on sequence motifs
tRNA, rRNA, ncRNA tRNA scan, RFAM, RNAmmer

Gene Finding in bacteria Ab initio gene prediction
By opening reading frame Find ORFs Find credible CDSs within ORFs Resolve conflicting ORFs
By codon usage By Markov models
By homology Similarity Searches via protein or translated BLAST
Comparative genomics

Identifying protein-coding sequences
In bacteria, quick and dirty approach is to find ORFs (open reading frames) Stretches of sequence without termination codons
Can be any of 3 termination codons – TAG, TGA, TAA
BUT variant genetic codes in mycoplasmas
Can be in any of 6 frames – 3 forward and 3 reverse
Do NOT necessarily start with initiation codon
Do NOT confuse ORFS and CDSs CDSs have an initiation codon
Can be any of 3 initiation codons – ATG, GTG, TTG
Has to be in the same frame as the termination codon unless the CDS is frame-shifted
Homology to other protein sequences can help identify a CDS

The problem of conflicting ORFs
Non-coding ORFs
CDSs (note ORF can
extend upstream of start codon)

Actual sequence
10 20 30 40 50 60 70 | | | | | | |ATGAGTACCGCTAAATTAGTTAAATCAAAAGCGACCAATCTGCTTTATACCCGCAACGATGTCTCCGACAGCGAGAAAM S T A K L V K S K A T N L L Y T R N D V S D S E K • V P L N • L N Q K R P I C F I P A T M S P T A R K E Y R • I S • I K S D Q S A L Y P Q R C L R Q R E K
10 20 30 40 50 60 70 | | | | | | |ATGAGTACCGCTAAATTAGTTAAATCAAAAAGCGACCAATCTGCTTTATACCCGCAACGATGTCTCCGACAGCGAGAAM S T A K L V K S K S D Q S A L Y P Q R C L R Q R E • V P L N • L N Q K A T N L L Y T R N D V S D S E K E Y R • I S • I K K R P I C F I P A T M S P T A R K
The Problem of Frameshift Errors
Frameshifted sequence after single base error

CDS Prediction: Graphical Plots
GC content by reading frame
Amino-acid composition by reading frame, compared to average for globular proteins

CDS Prediction: Markov Models Markov Model-based programs Use probabilities of states and transitions between these states to predict features
Glimmer is industry standard for bacterial genomes Can be trained on related genome
Or use long-ORFs (>500 codons) option to bootstrap a model
Problems Smaller genes not statistically significant so thrown out
Algorithms trained with sequences from known genes which biases against genes about which nothing known

Annotation of protein-coding genes Structure and composition Transmembrane domains Signal peptides Post-translational modifications
Homology to other proteins
Function(s) Catalytic activity / cofactors / induction / regulation
Metabolic pathways Structural genes Cellular location
Phase variants, pseudogenes, SNPs, coding repeats, etc.
Annotation pipeline Predict CDSs with Glimmer
On the predicted genes Do homology searches (BLAST) against nearest relative
Port annotation across on orthologues
Apply in-depth analysis to strain-specific genes (or all genes if de novo sequence)
domain searches: PFAM or CDD
PSI-BLAST Perform other analyses: Coiled coils, signal peptides, TM domains

Homology Similarity that arises because of descent from a common ancestor…“The formation of different languages and of distinct species, and the proofs that both have been developed through a gradual process, are curiously parallel… We find in distinct languages striking homologies due to community of descent, and analogies due to a similar process of formation… Languages, like organic beings, can be classed in groups under groups; and they can be classed either naturally according to descent, or artificially by other characters… The survival or preservation of certain favoured words in the struggle for existence is natural selection.”Charles Darwin, 1871 THE DESCENT OF MAN, Chapter 3

Homology Similarities in form (sequence) allow us to infer similarities in “meaning” (structure and function)
Homology is not just sequence similarity Two sequences can be similar without any common ancestry, particularly if low complexity
the cat sat on the mat die Katze sass auf der Matte
vge|GBant88-2 ITLITCVSVKDNSKRYVVAGvge|GEfae9-178 LTLITCDQATKTTGRIIVIAvge|GSpne1-403 MTLITCDPIPTFNKRLLVNFsortase_staur LTLITCDDYNEKTGVWEKRK

Types of Homology Homologues can be divided into Orthologues: lines of descent congruent with whole genome
Paralogues: result of gene duplication
Xenologues: result of HGT

Homology Searches The aim of homology searches is to identify sequences within these databases that are homologous to your sequence.
This involves comparing your sequence with all the database sequences looking for stretches of sequence that appear to be similar
then scoring the matches and ranking them a measure of the significance of the match is given

What is BLAST? Basic Local Alignment Search Tool
Developed in 1990, refined in 1997 (Stephen Altschul)
A method of searching sequence databases to find sequences similar to the input sequence Scans a database for alignments to a query sequence
Fastest and most frequently used sequence alignment tool the industry standard
Can be extremely informative, giving clues to functionality, evolutionary history, important residues
Basis for many forms of bioinformatic analysis

The several flavours of BLAST BLASTP
protein query versus protein sequence database. BLASTN
nucleotide query versus nucleotide sequence database.
BLASTX translated nucleotide query versus protein sequence database
TBLASTN protein query versus translated nucleotide sequence database
TBLASTX translated nucleotide query versus translated nucleotide sequence database.

Chosing the right flavour What program will best suit your query, and desired output?
If you are dealing with a protein-coding gene, comparisons at the protein level give better results Sequence complexity: 20 amino acids versus 4 nucelotides
Moderately similar nucleotide sequences could encode a highly similar protein sequence!
Use BLASTP or a translated BLAST search, rather than BLASTN
Reserve BLASTN for non-coding regions or rRNA/tRNA genes

Low complexity filtering Low complexity sequence with pronounced compositional bias can lead to spurious alignments Modern versions of BLAST
either take into account amino-acid composition or screens out regions of low complexity
At NCBI, adjustment for compositional bias is on but low-complexity filter is off by default For “no stones unturned”
approach, explore results with adjustments and filter on and off
Watch out for… transmembrane or signal
peptide regions coil-coil regions short amino acid repeats
(collagen, elastin) homopolymeric repeats

Understanding BLAST Results Graphic representation of results
Top of graph represents query sequence
Underlying bars show where hits occur
Colors represent alignment scores
Grey areas represent non similar regions surrounded by similar regions
Scrolling over bar shows accession and description of hit
Clicking on a bar takes you to its alignment with the query

Bit Scores
high is good
Bit Scores
high is good
E-values
low is good
E-values
low is good
http://www.ncbi.nlm.nih.gov/BLAST/tutorial/http://www.ncbi.nlm.nih.gov/BLAST/tutorial/

Typical Blast Output
Sum Reading High ProbabilitySequences producing High-scoring Segment Pairs: Frame Score P(N) N emb|X69337|ECDPS E.coli dps gene for binding protein +2 834 6.4e-109 1 gb|U04242|ECU04242 Escherichia coli core starvation p... +3 828 2.7e-106 1 emb|X14180|ECGLNHPQ Escherichia coli glutamine permeas... +3 443 2.8e-53 1gb|U18769|HDU18769 Haemophilus ducreyi fine tangled p... +1 150 4.0e-18 2 dbj|D01016|ANALTI46 Anabaena variabilis lti46 gene. >e... +2 129 4.8e-12 2 gb|M84990|P26BPO Plasmid pOP2621 ORF1 gene, 5' end;... -2 131 6.7e-09 1gb|U16121|HPU16121 Helicobacter pylori neutrophil act... +1 112 1.8e-06 1gb|M32401|TRPTYF1 T.pallidum pallidum antigen TyF1 g... +3 101 5.6e-06 2emb|X71436|RPNTRB R.phaseoli ntrB gene +1 67 0.76 2gb|L35598|DRODGC1A Drosophila melanogaster receptor g... +1 48 0.97 3

Typical Blast Output
gb|U18769|HDU18769 Haemophilus ducreyi fine tangled pili major pilin subunit gene Length = 780 Plus Strand HSPs: Score = 150 (68.0 bits), Expect = 4.0e-18, Sum P(2) = 4.0e-18 Identities = 36/89 (40%), Positives = 46/89 (51%), Frame = +1
Query: 30 ELLNRQVIQFIDLSLITKQAHWNMRGANFIAVHEMLDGFRTALIDHLDTMAERAVQLGGV 89 E L ++ +L+LI K AHWN+ G FIAVHEMLD + D +D +AER LG Sbjct: 253 EALQMRLQGLNELALILKHAHWNVVGPQFIAVHEMLDSQVDEVRDFIDEIAERMATLGVA 432
Query: 90 ALGTTQVINSKTPLKSYPLDIHNVQDHLK 118 G + + YPL QDHLKSbjct: 433 PNGLSGNLVETRQSPEYPLGRATAQDHLK 519

Domain database searches Rationale
Now that databases very large, can be difficult to interpret Blast results when 1000s of hits
If one part of protein has many hits and another part has few hits, useful information may get swamped or lost
Solution Search databases that contain collections of protein domains/families
Pfam pfam.sanger.ac.uk/
CDD www.ncbi.nlm.nih.gov/cdd
Represented as sequence alignments and/or HMMs
Annotated with information about key features of domain

Pfam domains

Pfam search results

Signal PeptideSignal Peptide AA proteaseprotease
BB
Coiled coil domainCoiled coil domain CC
Homology lies in Homology lies in one domainone domain
Signal PeptideSignal Peptide
Protein A“a protease”
Protein B
Protein C
The Annotation Catastrophe
But functional assignment for whole of But functional assignment for whole of protein A comes from another domain, protein A comes from another domain, carried across in error, so proteins B carried across in error, so proteins B and C get misannotated as proteasesand C get misannotated as proteases
But functional assignment for whole of But functional assignment for whole of protein A comes from another domain, protein A comes from another domain, carried across in error, so proteins B carried across in error, so proteins B and C get misannotated as proteasesand C get misannotated as proteases

Annotation: rules to consider Don’t trust your computer blindly Adopt Cartesian doubt!
Examine and think about your results Confirm with multiple lines of evidence BLAST genomic context PFAM

Overview Features of Bacterial Genomes Genome Sequencing Assembly of bacterial genomes Annotation of bacterial genomes Identifying and annotating CDSs
An ORF is NOT a CDS! Power and pitfalls of using homology
BLAST and PFAM