sequence & course material repository annotation (sequences & evidence) manuals (dna,...

31
Sequence & course material repository http://gfx.dnalc.org/files/e vidence Annotation (sequences & evidence) Manuals (DNA, Subway, Apollo, JalView) Presentations (.ppt files) Prospecting (sequences) Readings (Bioinformatics tools, splicing, etc.) Worksheets (Word docs, handouts, etc.) BCR-ABL (temporary; not course- related)

Upload: jessica-jenkins

Post on 02-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sequence & course material repository  Annotation (sequences & evidence) Manuals (DNA, Subway, Apollo, JalView) Presentations

Sequence & course material repository

http://gfx.dnalc.org/files/evidence

• Annotation (sequences & evidence)• Manuals (DNA, Subway, Apollo, JalView)• Presentations (.ppt files)• Prospecting (sequences)• Readings (Bioinformatics tools, splicing, etc.)• Worksheets (Word docs, handouts, etc.)• BCR-ABL (temporary; not course-related)

Page 2: Sequence & course material repository  Annotation (sequences & evidence) Manuals (DNA, Subway, Apollo, JalView) Presentations

Manifestations of a Code

Genes, genomes, bioinformatics and cyberspace – and the promise they hold

for biology education

Page 3: Sequence & course material repository  Annotation (sequences & evidence) Manuals (DNA, Subway, Apollo, JalView) Presentations

Plants are amazing – and so are their genomes

Largest flower (~ 1m) Oldest plant (> 5000 years) Tallest organism (> 100m)

Slide: ASPB, 2009

Page 4: Sequence & course material repository  Annotation (sequences & evidence) Manuals (DNA, Subway, Apollo, JalView) Presentations

A GENOME is all of a living thing’s genetic material.

The genetic material is DNA (DeoxyriboNucleic Acid)

DNA, a double helical molecule, is made up of four nucleotide “letters”:

A-- --G

T-- --C

What is a genome?

Slide: JGI, 2009

Page 5: Sequence & course material repository  Annotation (sequences & evidence) Manuals (DNA, Subway, Apollo, JalView) Presentations

Just as computer software is rendered in long strings of 0s and 1s, the GENOME or “software” of life is represented by a string of the four nucleotides, A, G, C, and T.

To understand the software of either - a computer or a living organism - we must know the order, or sequence, of these informative bits.

What is sequencing?

Slide: JGI, 2009

Page 6: Sequence & course material repository  Annotation (sequences & evidence) Manuals (DNA, Subway, Apollo, JalView) Presentations

Exciting?>mouse_ear_cress_1080 GAAATAATCAATGGAATATGTAGAGGTCTCCTGTACCTTCACAGAGATTCTAGGCTGAGAGCAGTGCATATAGATATCTTTCGTACTCATCTGCTTTTTCTGGTCTCCATCACAAAAGCCAACTAGGTAATCATATCAATCTCTCTTTACCGTTTACTCGACCTTTTCCAATCAGGTGCT TCTGGTGTGTCTACTACTATCAGTTTTAGGTCTTTGTATACCTGATCTTATCTGCTACTG AGGCTTGTAAAAGTGATTAAAACTGTGACATTTACTCTAAGAGAAGTAACCTGTTTGATGCATTTCCCTAATATACCGGTGTGGAAAAGTGTAGGTATCTGTACTCAGCTGAAATGGTGGACGATTTTGAAGAAGATGAACTCTCATTGACTGAAAGCGGGTTGAAGAGTGAAGATGGCGTTATTATCGAGATGAATGTCTCCTGGATGCTTTTATTATCATGTTTGGGAATTTACCAAGGGAGAGGTATCAGAATCTATCTTAGAAGGTTACATTTAGCTCAAGCTTGCATCAACATCTTTACTTAGAGCTCTACGGGTTTTAGTGTGTTTGAAGTTTCTTAACTCCTAGTATAATTAGAATCTTCTGCAGCAGACTTTAGAGTTTTGGGATGTAGAGCTAACCAGAGTCGGTTTGTTTAAACTAGAATCTTTTTATGTAGCAGACTTGTTCAGTACCTGAATACCAGTTTTAAATTACCGTCAGATGTTGATCTTGTTGGTAATAATGGAGAAACGGAAGAATAATTAGACGAAACAAACTCTTTAAGAACGTATCTTTCAGTTTTCCATCACAAATTTTCTTACAAGCTACAAAAATCGAACTATATATAACTGAACCGAATTTAAACCGGAGGGAGGGTTTGACTTTGGTCAATCACATTTCCAATGATACCGTCGTTTGGTTTGGGGAAGCCTCGTCGTACAAATACGACGTCGTTTAAGGAAAGCCCTCCTTAACCCCAGTTATAAGCTCAAAGTTGTACTTGACCTTTTTAAAGAAGCACGAAACGAAAAACCCTAAAATTCCCAAGCAGAGAAAGAGAGACAGAGCAAGTACAGATTTCAACTAGCTCAAGATGATCATCCCTGTTCGTTGCTTTACTTGTGGAAAGGTTGATATTTTCCCCTTCGCTTTGGTCTTATTTAGGGTTTTACTCCGTCTTTATAGGGTTTTAGTTACTCCAAATTTGGCTAAGAAGAGATCTTTACTCTCTGTATTTGACACGAATGTTTTTAATCGGTTGGATACATGTTGGGTCGATTAGAGAAATAAAGTATTGAGCTTTACTAAGCTTTCACCTTGTGATTGGTTTAGGTGATTGGAAACAAATGGGATCAGTATCTTGATCTTCTCCAGCTCGACTACACTGAAGGGTAAGCTTACAATGATTCTCACTTCTTGCTGCTCTAATCATCATACTTTGTGTCAAAAAGAGAGTAATTGCTTTGCGTTTTAGAGAAATTAGCCCAGATTTCGTATTGGGTCTGTGAAGTTTCATATTAGCTAACACACTTCTCTAATTGATAACAGAAGCTATAAAATAGATTTGCTGATGAAGGAGTTAGCTTTTTATAATCTTCTGTGTTTGTGTTTTACTGTCTGTGTCATTGGAAGAGACTATGTCCTGCCTATATAATCTCTATGTGCCTATCTAGATTTTCTATACAATTGATATTTGATAGAAGTAGAAAGTAAGACTTAAGGTCTTTTGATTAGACTTGTGCCCATCTACATGATTCTTATTGGACTAATCATTCTTTGTGTGAAAATAGAATACTTTGTCTGAACATGAGAGAATGGTTCATAATACGTGTGAAGTATGGGATTAGTTCAACAATTTCGCTATTGGAGAAGCAAACCAAGGGTTAATCGTTTATAGGGTTAAGCTAATGCTCTGCTCTTTATATGTTATTGGAACAGACTATTGTTGTGCCTATCTTGTTTAGTTGTAGATTCTATCTCGACTGTTATAAGTATGACTGAAGGCTTGATGACTTATGATTCTCTTTACACCTGTAGAAGGATTTAAGCTTGGTGTCTAGATATTCAATCTGTGTTGGTTTTGTCTTTCTTTTGGCTCTTAGTGTTGTTCAATCTCCTCAATAGGTATGAAGTTACAATATCCTTATTATTTTGCAGGGACGCACTTGATGCACTCCAGCTAGTCAGATACTGCTGCAGGCGTATGCTAATGACCTTGCATCAACATCTTTACTTAGAGCTCTACGGGTTTTAGTGTGT

Page 7: Sequence & course material repository  Annotation (sequences & evidence) Manuals (DNA, Subway, Apollo, JalView) Presentations

Much better

Page 8: Sequence & course material repository  Annotation (sequences & evidence) Manuals (DNA, Subway, Apollo, JalView) Presentations

FindGene Families

Generate mathematical

evidence

Analyze large data amounts

Browse in context

Build gene models

Gatherbiological evidence

Annotation workflow

Get DNA sequence

Page 9: Sequence & course material repository  Annotation (sequences & evidence) Manuals (DNA, Subway, Apollo, JalView) Presentations

Walk or…

Page 10: Sequence & course material repository  Annotation (sequences & evidence) Manuals (DNA, Subway, Apollo, JalView) Presentations

…take DNA Subway

Page 11: Sequence & course material repository  Annotation (sequences & evidence) Manuals (DNA, Subway, Apollo, JalView) Presentations

Molecular biology and bioinformatics conceptsRepeatMasker• Eukaryotic genomes contain large amounts of repetitive DNA.• Transposons can be located anywhere.• Transposons can mutate like any other DNA sequence.

FGenesH Gene Predictor• Protein-coding information begins with start, is followed by codons, ends with stop.• Codons in mRNA (AUG, UAA,…) have sequence equivalents in DNA (ATG, TAA,…).• Most eukaryotic introns have “canonical splice sites,” GT---AG (mRNA: GU---AG).• Gene prediction programs search for patterns to predict genes and their structure.• Different gene prediction programs may predict different genes and/or structures.

Multiple Gene Predictors• The protein coding sequence of a mRNA is flanked by untranslated regions (UTRs).• UTRs hold information for the half-lives of mRNAs and regulatory purposes.• Gene > mRNA > CDS.

BLAST Searches• Gene or protein homologs share similarities due to common ancestry. • Biological evidence is needed to curate gene models predicted by computers.• mRNA transcripts and protein sequence data provide “hard” evidence for genes.

Page 12: Sequence & course material repository  Annotation (sequences & evidence) Manuals (DNA, Subway, Apollo, JalView) Presentations

How do we find genes?

Search for themLook them up

Page 13: Sequence & course material repository  Annotation (sequences & evidence) Manuals (DNA, Subway, Apollo, JalView) Presentations

How do I get to this…

Page 14: Sequence & course material repository  Annotation (sequences & evidence) Manuals (DNA, Subway, Apollo, JalView) Presentations

From this…

>mouse_ear_cress_1080 GAAATAATCAATGGAATATGTAGAGGTCTCCTGTACCTTCACAGAGATTCTAGGCTGAGAGCAGTGCATATAGATATCTTTCGTACTCATCTGCTTTTTCTGGTCTCCATCACAAAAGCCAACTAGGTAATCATATCAATCTCTCTTTACCGTTTACTCGACCTTTTCCAATCAGGTGCT TCTGGTGTGTCTACTACTATCAGTTTTAGGTCTTTGTATACCTGATCTTATCTGCTACTG AGGCTTGTAAAAGTGATTAAAACTGTGACATTTACTCTAAGAGAAGTAACCTGTTTGATGCATTTCCCTAATATACCGGTGTGGAAAAGTGTAGGTATCTGTACTCAGCTGAAATGGTGGACGATTTTGAAGAAGATGAACTCTCATTGACTGAAAGCGGGTTGAAGAGTGAAGATGGCGTTATTATCGAGATGAATGTCTCCTGGATGCTTTTATTATCATGTTTGGGAATTTACCAAGGGAGAGGTATCAGAATCTATCTTAGAAGGTTACATTTAGCTCAAGCTTGCATCAACATCTTTACTTAGAGCTCTACGGGTTTTAGTGTGTTTGAAGTTTCTTAACTCCTAGTATAATTAGAATCTTCTGCAGCAGACTTTAGAGTTTTGGGATGTAGAGCTAACCAGAGTCGGTTTGTTTAAACTAGAATCTTTTTATGTAGCAGACTTGTTCAGTACCTGAATACCAGTTTTAAATTACCGTCAGATGTTGATCTTGTTGGTAATAATGGAGAAACGGAAGAATAATTAGACGAAACAAACTCTTTAAGAACGTATCTTTCAGTTTTCCATCACAAATTTTCTTACAAGCTACAAAAATCGAACTATATATAACTGAACCGAATTTAAACCGGAGGGAGGGTTTGACTTTGGTCAATCACATTTCCAATGATACCGTCGTTTGGTTTGGGGAAGCCTCGTCGTACAAATACGACGTCGTTTAAGGAAAGCCCTCCTTAACCCCAGTTATAAGCTCAAAGTTGTACTTGACCTTTTTAAAGAAGCACGAAACGAAAAACCCTAAAATTCCCAAGCAGAGAAAGAGAGACAGAGCAAGTACAGATTTCAACTAGCTCAAGATGATCATCCCTGTTCGTTGCTTTACTTGTGGAAAGGTTGATATTTTCCCCTTCGCTTTGGTCTTATTTAGGGTTTTACTCCGTCTTTATAGGGTTTTAGTTACTCCAAATTTGGCTAAGAAGAGATCTTTACTCTCTGTATTTGACACGAATGTTTTTAATCGGTTGGATACATGTTGGGTCGATTAGAGAAATAAAGTATTGAGCTTTACTAAGCTTTCACCTTGTGATTGGTTTAGGTGATTGGAAACAAATGGGATCAGTATCTTGATCTTCTCCAGCTCGACTACACTGAAGGGTAAGCTTACAATGATTCTCACTTCTTGCTGCTCTAATCATCATACTTTGTGTCAAAAAGAGAGTAATTGCTTTGCGTTTTAGAGAAATTAGCCCAGATTTCGTATTGGGTCTGTGAAGTTTCATATTAGCTAACACACTTCTCTAATTGATAACAGAAGCTATAAAATAGATTTGCTGATGAAGGAGTTAGCTTTTTATAATCTTCTGTGTTTGTGTTTTACTGTCTGTGTCATTGGAAGAGACTATGTCCTGCCTATATAATCTCTATGTGCCTATCTAGATTTTCTATACAATTGATATTTGATAGAAGTAGAAAGTAAGACTTAAGGTCTTTTGATTAGACTTGTGCCCATCTACATGATTCTTATTGGACTAATCATTCTTTGTGTGAAAATAGAATACTTTGTCTGAACATGAGAGAATGGTTCATAATACGTGTGAAGTATGGGATTAGTTCAACAATTTCGCTATTGGAGAAGCAAACCAAGGGTTAATCGTTTATAGGGTTAAGCTAATGCTCTGCTCTTTATATGTTATTGGAACAGACTATTGTTGTGCCTATCTTGTTTAGTTGTAGATTCTATCTCGACTGTTATAAGTATGACTGAAGGCTTGATGACTTATGATTCTCTTTACACCTGTAGAAGGATTTAAGCTTGGTGTCTAGATATTCAATCTGTGTTGGTTTTGTCTTTCTTTTGGCTCTTAGTGTTGTTCAATCTCCTCAATAGGTATGAAGTTACAATATCCTTATTATTTTGCAGGGACGCACTTGATGCACTCCAGCTAGTCAGATACTGCTGCAGGCGTATGCTAATGACCTTGCATCAACATCTTTACTTAGAGCTCTACGGGTTTTAGTGTGT

Page 15: Sequence & course material repository  Annotation (sequences & evidence) Manuals (DNA, Subway, Apollo, JalView) Presentations

Meaning?

Page 16: Sequence & course material repository  Annotation (sequences & evidence) Manuals (DNA, Subway, Apollo, JalView) Presentations

Mathematical Tools (Code; statistics)

Page 17: Sequence & course material repository  Annotation (sequences & evidence) Manuals (DNA, Subway, Apollo, JalView) Presentations

Comparative Tools (Database searches)

Page 18: Sequence & course material repository  Annotation (sequences & evidence) Manuals (DNA, Subway, Apollo, JalView) Presentations

What do we know about genes?• Expressed (Transcribed)

– Transcriptional start & termination sites (TXSS, TXTS)– Transcription artefacts (cDNA & ESTs)

• Regulated– Promoters (TATAAA)– Transcription Factor Binding Sites– CpG (Cytosin methylation)

• Meaningful (Translated)– 3n basepairs– Codon usage– Translational start & stop/termination codons (TLSS, TLTS)– Translation artefacts (proteins)

• Spliced– Splice sites (GT-AG)

• Derived (Homology: Paralogy/Orthology)– Search for known genes, proteins (BLAST)

Page 19: Sequence & course material repository  Annotation (sequences & evidence) Manuals (DNA, Subway, Apollo, JalView) Presentations

How might this knowledge help to find genes?

• Predict genes– Look for potential starts and stops.– Connect them into open reading frames (ORFs).– Filter for “correct’ length & codon usage.

• Search databases– Known genes: UniGene– Known proteins: UniProt

• Use transcript evidence– cDNA– ESTs– proteins

Page 20: Sequence & course material repository  Annotation (sequences & evidence) Manuals (DNA, Subway, Apollo, JalView) Presentations

Operating computationally

• Go to beginning of sequence start SCAN• If ATG register putative TLSS; then

– Move in 3-steps & count steps (=COUNTS)– If 3-step = (TAA or TAG or TGA), register putative TLTS– If register evaluate COUNTS (= triplets)

• If COUNTS < minimum discard; then go behind ATG above and start SCAN

• If COUNTS > maximum discard; then go behind ATG above and start SCAN

• If minimum < COUNTS < maximum record as GENE with TLSS, TLTS; then go behind ATG above and start SCAN.

• Arrive at end of sequence stop SCAN

Page 21: Sequence & course material repository  Annotation (sequences & evidence) Manuals (DNA, Subway, Apollo, JalView) Presentations

Find gene families

Mathematical evidence

Analyze large data

sets

Browse in ccontext

Construct gene

models

Annotation workflow

Biological evidence

Browse results

Get/Generate sequence

Page 22: Sequence & course material repository  Annotation (sequences & evidence) Manuals (DNA, Subway, Apollo, JalView) Presentations

Annotation Cheat Sheet

• Open existing project or generate new (Red square)

• Run RepeatMasker

• Generate evidence (Predictions, BLAST searches)

• Synthesize evidence into gene models (Apollo)

• Browse results locally and in context (Phytozome)

• Conduct functional analysis (link from Browser)

• Prospect for gene family (Yellow Line from Browser)

• Select region that holds biological gene evidence

• Optimize work space and zoom to region (View tab)

• Expand all tiers (Tiers tab)

• Drag evidence item(s) onto workspace (mouse)

• Edit to match biol. evidence (right-click item for tools)

• Record what was done in Annotation Info Editor

• Assess necessity to build alternative model(s)

• Upload model(s) to DNA Subway (File tab)

A. DNA Subway

B. Apollo

Page 23: Sequence & course material repository  Annotation (sequences & evidence) Manuals (DNA, Subway, Apollo, JalView) Presentations

Predictors (mathematical evidence)

• Utilize predominantly mathematical methods (statistical).• Search for patterns

– Some score starts, stops, splice sites (GenScan).– Some score nucleotides (Augustus, FGenesH).

• Few incorporate EST data and/or known genes/proteins.• Require optimization for each new species (training).• Accuracy:

– False positives (scoring non-genes as genes):5% - 50%.– False negatives (missed genes): 5%-40%.– Weak or unable in determining first and last exons, and UTRs.

• Specific for gene models (spliced genes, non-spliced genes).• Specialty predictors (tRNA Scan, RepeatMasker).

Page 24: Sequence & course material repository  Annotation (sequences & evidence) Manuals (DNA, Subway, Apollo, JalView) Presentations

Search tools (biological evidence)

• Search sequence databases:– Known genes– Known proteins– cDNAs & ESTs

• Utilize alignment methods (BLAST, BLAT).• Reliability:

– Good in determining gene locations and general gene structures.– Weak in exactly determining exon/intron borders.– Unlikely to correctly determine TXSS and TXTS.– Should be used with cDNA/EST from same species.

Page 25: Sequence & course material repository  Annotation (sequences & evidence) Manuals (DNA, Subway, Apollo, JalView) Presentations

mRNA Splicing

During RNA processing internal segments are removed from the transcript and the remaining segments spliced together.

Internal RNA segments that are removed are named introns; the spliced segments are defined as exons.

• Causes mRNA to be “missing” segments present in DNA template and primary transcript.

• Most transcripts in eukaryotes spliced.• Erosion: 1-exon genes (no exons without introns).

Page 26: Sequence & course material repository  Annotation (sequences & evidence) Manuals (DNA, Subway, Apollo, JalView) Presentations

Exon ExonIntron

Pre-

mR

NA

5’ Splice Site

3’ Splice SiteReddy, S.N. Annu. Rev. Plant Biol. 2007 58:267-94Of 1588 examined predicted splice sites in Arabidopsis

1470 sites (93%) followed the canonical GT…AGconsensus. (Plant (2004) 39, 877–885)

Canonical splice sites

Page 27: Sequence & course material repository  Annotation (sequences & evidence) Manuals (DNA, Subway, Apollo, JalView) Presentations

Multiple splice variants = multiple proteins from the same gene

Alternative Splicing

Not a rare event!!!

- Alternative splice sites C’ and D’ lead to different splice variants- JAZ10.3: premature stop codon in D exon, intact JAS domain- JAZ10.4: truncated C exon, protein lacks JAS domain- JAZ 10 encoded by At5G13220

Page 28: Sequence & course material repository  Annotation (sequences & evidence) Manuals (DNA, Subway, Apollo, JalView) Presentations

Example: Jasmonate signaling in Arabidopsis

- Plant hormone; affects cell division, growth, reproduction and responses to insects, pathogens, and abiotic stress factors.

- Jasmonate Signaling Repressor Protein JAZ 10 splice variants JAZ 10.1, JAZ 10.3 and JAZ 10.4 differ in susceptibility to degradation.

- Phenotypic consequences include male sterility and altered root growth.

Page 29: Sequence & course material repository  Annotation (sequences & evidence) Manuals (DNA, Subway, Apollo, JalView) Presentations

Example: Disease resistance in tobacco

- Nicotiana tabacum resistance gene N involved in resistance to TMV.- Alternative splicing required to achieve resistance.- Alternative transcripts Ns (short) and NL (long).- NS encodes full-length, NL a truncated protein.- Splicevariants produced by alternative splicing confer resistance (D).- Splicevariants produced by cDNAs do not confer resistance (A, B, C).

Page 30: Sequence & course material repository  Annotation (sequences & evidence) Manuals (DNA, Subway, Apollo, JalView) Presentations

Molecular biology and bioinformatics conceptsRepeatMasker• Eukaryotic genomes contain large amounts of repetitive DNA.• Transposons can be located anywhere.• Transposons can mutate like any other DNA sequence.

FGenesH Gene Predictor• Protein-coding information begins with start, is followed by codons, ends with stop.• Codons in mRNA (AUG, UAA,…) have sequence equivalents in DNA (ATG, TAA,…).• Most eukaryotic introns have “canonical splice sites,” GT---AG (mRNA: GU---AG).• Gene prediction programs search for patterns to predict genes and their structure.• Different gene prediction programs may predict different genes and/or structures.

Multiple Gene Predictors• The protein coding sequence of a mRNA is flanked by untranslated regions (UTRs).• UTRs hold information for the half-lives of mRNAs and regulatory purposes.• Gene > mRNA > CDS.

BLAST Searches• Gene or protein homologs share similarities due to common ancestry. • Biological evidence is needed to curate gene models predicted by computers.• mRNA transcripts and protein sequence data provide “hard” evidence for genes.

Page 31: Sequence & course material repository  Annotation (sequences & evidence) Manuals (DNA, Subway, Apollo, JalView) Presentations

…take DNA Subway