lecture’4’ sequencingbioinf.gen.tcd.ie/~faresm/page12/files/genomics... · 2015. 3. 17. ·...
TRANSCRIPT
-
LECTURE 4 sequencing
-
Structure of DNA sequences
-
Know your 5’ and 3’
-
Bases 40 -‐ 120 of a DNA sequencing read from an automated sequencer (Applied Biosystems, ABI377)
-
bases 40 -‐ 50 … … bases 820 -‐ 880
-
Sequencing DNA (Sanger method)
• Raw data consists of "reads" of ~ 700 bp – A read costs about €1 to obtain. – A read gives you informaWon about the sequence downstream of a known
starWng point (primer binding site).
• To get longer stretches of sequence informaWon, we overlap data from reads that begin at different starWng points.
• How do we get different starWng points? – (old) RestricWon enzymes (subclone the cloned DNA fragment). – (newer) Primer walking (use custom primers). – (newest) Shotgun sequencing (just sequence lots of random clones).
-
Ilumina Sequencing
a) a) Shear the DNA b) Create Sequencing Library c) Place the library to a solid flow
cell d) PCR e) Reads using the machine f) Sequencing by synthesis
b) c) d)
e)
f)
-
>frag1.f LEN=117 !CTGCTAGAAGAAATTCATTCCTAGTTAATGCGAGAGAAGCTAATGATAATTTAGAAAGATTAGAAAACCAATTTTATTAAATTAGTGGAGAGCAATAAGGATAATTCTCATCATATT!>frag1.r LEN=234 !CGTTACAAAAGGCATCAAGAAACTCAGAGTTTATCTCAATTGATCATTACTTTGCAACTTTAAAGACCCTTCCATTTCGTTATATATACCCCATTATTACTATCAAAATGCTAGATGCCCAAACCCTCCTGGACGTTATTGTGAAAAGGCACGGAGCTGACGACGATTCGGACCATGTGCCAGCGTGTGAGACTTCCAACGATTATGACGGGCGCATGAATCTTCGTATTCTGT!>frag2.f LEN=241 !CATAGGAACACGTATATTGCTTTTGCCCTTATTAGTTTCGGCTTCCGGGCACCCTTCGGGTCACGATCTGTACCCTCACGGTTTCGGAGTTGCTGCAAAGGTACCTGTGCCTACAATGTGTGCAGAAAAGCCCGTAGCGCAGTGGAAGCTGGGGGTTGCCCAGCAACGTCTGTTCATGTATATAAGGCCCATGAGGTATGGAATGTGACATTCTGCCTCCACAATTGAAGGACCTAAATCG!>frag2.r LEN=97 !TTTTTATTTATAAAACAACAAAAAAGGAGATATTAAAGTTGAACGAAAGAGATATATACCTTTCAAAAAGCCCCAATCCAGATTAATAATTTTGAAA!>frag3.f LEN=382 !CCGGGGTTTGTAGGTGTTTAAGCTTGCCCAGGAATAGCCGTAATGGCAGTGGCAGTGGCAATGGCAGTGGCAGTCGTCGTCGTCGTAGCCGTAGCCGTAGTTCGCTAGACGTTAACCCTATAGCCTAACTCTTAAGCCCTAGATTTCGCTTAGTTACACTTACGCTAAAGTTACGTTACTCTAAAGTTACGTTACGCTAAGTTACGTTACGCtaaagcctgaacttaggaacacatctaatattaccaagcacaggtaacatatatttagtaggaacataggaaacacattacaaatatataaggaatacaaaacngtatattaacgaacatcaaaattaggacttgccgttaaaggacgcatacttanatccaaggaacac!>frag3.r LEN=282 !ATGAACAGACGTTGCTGGGCAACCCCCAGCTTCCACTGCGCTACGGGCTTTTCTGCACACATTGTAGGCACAGGTACCTTTGCAGCAACTCCGAAACCGTGAGGGTACAGATCGTGACCCGAAGGGTGCCCGGAAGCCGAAACTAATAAGGGCAAAAGCAATATACGTGTTCCTATGAAAAAATTAAGGCAAGGTGAACAGCAAGACGGGGAGCAATATTGCTGTATTGCAAGATAAAAAGCAATGTGAGTGGTGTTCCTGTATATAAGTATGCGTTCCTTT!>frag4.f LEN=46 !CATATTTAATTGTAAAATATTTTGTCTAATATTAAATTTCAATTTA!>frag4.r LEN=360 !GTGTTCCTGTATATAAGTATGCGTTCCTTTAACGGCATCCTAATTTGATGTTCGTAATATATGTTTGTATTCCTTATATATTTGTATGTGTTCCTATGTTCCTAATAAATATATGTTCCTGTGCTTGTAATATAGATGTGTTCCTAAGTTCAGGCTTAGCGTAACGTAACTTAGCGTAACGTAACTTAGAGTAACGTAACTTAGCGTAGTGTAACTAAGCGAAATCTAGGGCTAAAAGTTAGGCTATGGTTACGTCTGCGACTACGGCTACGGCTACGACGACGACGACTGCCACTGCCATTGCCACTGCCACTGCCATTACGG!>frag5.f LEN=105 !AGTAATAAAAAATGACTCATCCTTCTAATGATTCCACATTATCAACGAAGTAATAGTAATTCTAGTCTACAAAAATATGAATACTCCACCAAAGTACGAGCAAGT!>frag5.r LEN=244 !GGCTTCCGGGCACCCTTCCGGGTCGACGATCTGTACCCTCACGGTTTCGGTAGTTGCTGCAAAGGTACCTGTGCCTACAATGTGTGCAGAAAAGACCCGTAGCGCAGTGGAAGCGTGGGGGTTCGGCCCACGCAACGTCTGTTCATGTATAATAACGGCCCCATGGAGGTAGTGGAATGTGACATTCTGCCTCCACAATTGAAGGAACCTAAATCCTTAACAAAAGGCATCAAGAAACTCAGAG!>frag6.f LEN=399 !TTGTAGGCACAGGTACCTTTGCAGCAACTCCGAAACCGTGAGGGTACAGATCGTGACCCGAAGGGTGCCCGGAAGCCGAAACTAATAAGGGCAAAAGCAATATACGTGTTCCTATGAAAAAAATTAAAGGCAAGGTGAACAGCAAGACGGGGAGCAATATTGCTGTATTGCAAGATAAAAAGCAATGTGAGTGGTGTTCCTGTATATAAGTATGCGTTCCTTTAACGGCATCCTAATTTGATGTCCGTAATATATGTTTGTATTCCTTATATATTTGTATGTGTTCCTATGTTCCTAATAAATATATGTTCCTGTGCTTGTATTATAGATGTGTTCCTAAGTTCAGGCTTAGCGTAACGTAACTTAGCGTAACGTAACTTAGAGTAACGTAACTTAGCG!>frag6.r LEN=150!TAGATTTACATTTTTTCATATTTAATTGTAAAATATTTTGTCTAATATTAAATTTCAATTTATTTAAAAATACTATTTTGATCAGTAGCCACATAATAATCCACATGATTACTCATCACAATACTTAAAAATGCCTGATATTCCAGAACC!>frag7 LEN-485!ACAGCAATATTGCTCCCCGTCTTGCTGTTCACCTTGCCTTTAATTTTTTTCATAGGAACACGTATATTGCTTTTGCCCTTATTAGTTTCGGCTTCCGGGCACCCTTCGGGTCACGATCTGTACCCTCACGGTTTCGGAGTTGCTGCAAAGGTACCTGTGCCTACAATGTGTGCAGAAAAGCCCGTAGCGCAGTGGAAGCTGGGGGTTGCCCAGCAACGTCTGTTCATGTATATAAGGCCCATGAGGTATGGAATGTGACATTCTGCCTCCACAATTGAAGGACCTAAATCGTTAACAAAAGGCATCAAGAAACTCAGAGTTTATCTCAATTGATCATTACTTTGACAACTTTAAAGACCCTTCCATTTCGTTATATATACCCATTATTACTAATCAAATGCTAGATCGACCCAAACCCTCCTGGAACGTTTAGTTAGTAGAAAAGCGAACGGAGCTGAACGACATTCGGACCATGTCGCCAGCGT!…!…!…!…!…!!
Sequence assembly: finding overlaps to merge sequence reads into longer pieces of sequence. ParWcularly important and difficult for whole-‐genome shotgun sequencing. The input to the assembler program is a file of sequence reads. Forward and Reverse pairs.
-
Reads overlap
-
Output from assembler program
Some famous assemblers: Phrap; CAP4; Celera (TIGR) assembler; Arachne; CLC. Good assemblers make use of base-‐quality informaWon (Phred scores) and mate-‐pair info.
ConWgs (assembled from overlapping reads)
Singletons (reads that don’t overlap others) >frag1.f LEN=117 QL=1 QR=544!CTGCTAGAAGAAATTCATTCCTAGTTAATGCGAGAGAAGCTAATGATAATTTAGAAAGAT!TAGAAAACCAATTTTATTAAATTAGTGGAGAGCAATAAGGATAATTCTCATCATATT!>frag2.r LEN=97 QL=1 QR=97!TTTTTATTTATAAAACAACAAAAAAGGAGATATTAAAGTTGAACGAAAGAGATATATACC!TTTCAAAAAGCCCCAATCCAGATTAATAATTTTGAAA!>frag5.f LEN=105 QL=1 QR=462!AGTAATAAAAAATGACTCATCCTTCTAATGATTCCACATTATCAACGAAGTAATAGTAAT!TCTAGTCTACAAAAATATGAATACTCCACCAAAGTACGAGCAAGT!!
>Contig1!CCGGGGTTTGTAGGTGTTTAAGCTTGCCCAGGAATAGCCGTAATGGCAGTGGCAGTGGCA!ATGGCAGTGGCAGTCGTCGTCGTCGTAGCCGTAGCCGTAGTTCGCTAGACGTTAACCCTA!TAGCCTAACTCTTAAGCCCTAGATTTCGCTTAGTTACACTTACGCTAAAGTTACGTTACT!CTAAGTTACGTTACGCTAAGTTACGTTACGCTAAGCCTGAACTTAGGAACACATCTATAT!TACAAGCACAGGAACATATATTTATTAGGAACATAGGAACACATACAAATATATAAGGAA!TACAAACATATATTACGGACATCAAATTAGGATGCCGTTAAAGGAACGCATACTTATATA!CAGGAACACCACTCACATTGCTTTTTATCTTGCAATACAGCAATATTGCTCCCCGTCTTG!CTGTTCACCTTGCCTTTAATTTTTTTCATAGGAACACGTATATTGCTTTTGCCCTTATTA!GTTTCGGCTTCCGGGCACCCTTCGGGTCACGATCTGTACCCTCACGGTTTCGGAGTTGCT!GCAAAGGTACCTGTGCCTACAATGTGTGCAGAAAAGCCCGTAGCGCAGTGGAAGCTGGGG!GTTGCCCAGCAACGTCTGTTCATGTATATAAGGCCCATGAGGTATGGAATGTGACATTCT!GCCTCCACAATTGAAGGAACCTAAATCCTTAACAAAAGGCATCAAGAAACTCAGAGTTTA!TCTCAATTGATCATTACTTTGACAACTTTAAAGACCCTTCCATTTCGTTATATATACCCC!ATTATTACTAACAAAATGCTAGATCGACCCAAACCCTCCTGGACGTTATTGTGAAAAGGC!ACGGAGCTGACGACGATTCGGACCATGTGCCAGCGTGTGAGACTTCCAACGATTATGACG!GGCGCATGAATCTTCGTATTCTGT!>Contig2!TAGATTTACATTTTTTCATATTTAATTGTAAAATATTTTGTCTAATATTAAATTTCAATT!TATTTAAAAATACTATTTTGATCAGTAGCCACATAATAATCCACATGATTACTCATCACA!ATACTTAAAAATGCCTGATATTCCAGAACC!!
-
ConWgs
-
Sequence assembly: reads, conWgs, scaffolds
-
DETAILED DISPLAY OF CONTIGS!******************* Contig 1 ********************! . : . : . : . : . : . :!frag3.f+ CCGGGGTTTGTAGGTGTTTAAGCTTGCCCAGGAATAGCCGTAATGGCAGTGGCAGTGGCA!frag4.r- CCGTAATGGCAGTGGCAGTGGCA! ____________________________________________________________ 60!consensus CCGGGGTTTGTAGGTGTTTAAGCTTGCCCAGGAATAGCCGTAATGGCAGTGGCAGTGGCA!! . : . : . : . : . : . :!frag3.f+ ATGGCAGTGGCAGTCGTCGTCGTCGTAGCCGTAGCCGTAGTTCGCTAGACGTTAACCCTA!frag4.r- ATGGCAGTGGCAGTCGTCGTCGTCGTAGCCGTAGCCGTAGT-CGC-AGACGT-AACC--A! ____________________________________________________________ 120!consensus ATGGCAGTGGCAGTCGTCGTCGTCGTAGCCGTAGCCGTAGTTCGCTAGACGTTAACCCTA!! . : . : . : . : . : . :!frag3.f+ TAGCCTAACTCTTAAGCCCTAGATTTCGCTTAGTTACACTTACGCTAAAGTTACGTTACT!frag4.r- TAGCCTAACTTTTA-GCCCTAGATTTCGCTTAGTTACACT-ACGCTAA-GTTACGTTACT!frag6.f- AAGTTACGTTACT! ____________________________________________________________ 180!consensus TAGCCTAACTCTTAAGCCCTAGATTTCGCTTAGTTACACTTACGCTAAAGTTACGTTACT!! . : . : . : . : . : . :!frag3.f+ CTAAAGTTACGTTACGCTAAGTTACGTTACGCTAAAGCCTGAACTTAGGAACACATCTAA!frag4.r- CTAA-GTTACGTTACGCTAAGTTACGTTACGCTAA-GCCTGAACTTAGGAACACATCTA-!frag6.f- CTAA-GTTACGTTACGCTAAGTTACGTTACGCTAA-GCCTGAACTTAGGAACACATCTA-! ____________________________________________________________ 240!consensus CTAA-GTTACGTTACGCTAAGTTACGTTACGCTAA-GCCTGAACTTAGGAACACATCTA-!! . : . : . : . : . : . :!frag3.f+ TATTACCAAGCACAGGTAACATATATTTAGTAGGAACATAGGAAACACATTACAAATATA!frag4.r- TATTAC-AAGCACAGG-AACATATATTTATTAGGAACATAGGAA-CACAT-ACAAATATA!frag6.f- TAATAC-AAGCACAGG-AACATATATTTATTAGGAACATAGGAA-CACAT-ACAAATATA! ____________________________________________________________ 300!consensus TATTAC-AAGCACAGG-AACATATATTTATTAGGAACATAGGAA-CACAT-ACAAATATA!! . : . : . : . : . : . :!frag3.f+ TAAGGAATACAAA !frag4.r- TAAGGAATACAAA !frag6.f- TAAGGAATACAAACATATATTACGGACATCAAATTAGGATGCCGTTAAAGGAACGCATAC!frag3.r- AAAGGAACGCATAC! ____________________________________________________________ 360!consensus TAAGGAATACAAACATATATTACGGACATCAAATTAGGATGCCGTTAAAGGAACGCATAC!! . : . : . : . : . : . :!frag6.f- TTATATACAGGAACACCACTCACATTGCTTTTTATCTTGCAATACAGCAATATTGCTCCC!frag3.r- TTATATACAGGAACACCACTCACATTGCTTTTTATCTTGCAATACAGCAATATTGCTCCC!frag7+ ACAGCAATATTGCTCCC! ____________________________________________________________ 420!consensus TTATATACAGGAACACCACTCACATTGCTTTTTATCTTGCAATACAGCAATATTGCTCCC!
-
Handling sequences on a computer
Sequence files should be plain text files (like .txt) if you’re going to use them as input to programs.
Open/Edit them with a simple text editor like Notepad (PC) or TextEdit (Mac), not Microsoj Word.
To print sequences/alignments for humans to read, use Courier or Monaco font.
>myseq in Courier font (60 bases per line)!gggggggggggggaggtccaaatcgaatacacggagaaaacttgcatacagaaagccgat!tccactgatttacctgcaataccggattcaattatattgaagtaccacatccctccgaat!ttacaaaagtacgagcaccaggatatgaatatgagtcgattttggcgttcgttcacaaaa!!>myseq in Times New Roman font (60 bases per line) gggggggggggggaggtccaaatcgaatacacggagaaaacttgcatacagaaagccgat Tccactgatttacctgcaataccggattcaattatattgaagtaccacatccctccgaat Ttacaaaagtacgagcaccaggatatgaatatgagtcgattttggcgttcgttcacaaaa !
FASTA format is a very common simple format for sequences that will be read by a computer: >Title (must be only 1 line, starWng with a ‘>’ character). Sequence begins on next line (any number of lines, any width, spaces/numbers are ignored)
-
Websites for handling sequences Sequence ManipulaWon Suite www.bioinformaWcs.org/sms2 -‐ Reverse complement
>myseq!gggggggggggggaggtccaaatcgaatacacggagaaaacttgcatacagaaagccgat!tccactgatttacctgcaataccggattcaattatattgaagtaccacatccctccgaat!ttacaaaagtacgagcaccaggatatgaatatgagtcgattttggcgttcgttcacaaaa!!>myseq reverse complement!ttttgtgaacgaacgccaaaatcgactcatattcatatcctggtgctcgtacttttgtaa !attcggagggatgtggtacttcaatataattgaatccggtattgcaggtaaatcagtgga !atcggctttctgtatgcaagttttctccgtgtattcgatttggacctccccccccccccc !
-‐ Group DNA -‐ TranslaWon map -‐ Range extractor
-
NCBI ORF finder www.ncbi.nlm.nih.gov/gorf
-
GenBank accession number AJ245745
"Endomyces fibuliger URA3 gene for oroWdine-‐5'-‐phosphate decarboxylase"
URA3
1 1101 start codon
1898 stop codon
3194
-
6 reading frames: 6 ways that the same DNA sequence could potenWally encode a protein
(NH2) S H V V E A L Y L V C G E R G F F (COOH) frame +1 (NH2) H M W W K L S T * C A G N E A S (COOH) frame +2 (NH2) T C G G S S L P S V R G T R L L (COOH) frame +3 (5’) 1 tcacatgtggtggaagctctctacctagtgtgcggggaacgaggcttcttc 51 (3’) (3’) agtgtacaccaccttcgagagatggatcacacgccccttgctccgaagaag (5’) (COOH) * M H H F S E V * H A P F S A E E (NH2) frame -1 (COOH) C T T S A R * R T H P S R P K K (NH2) frame -2 (COOH) V H P P L E R G L T R P V L S R (NH2) frame -3 (5’) 51 gaagaagcctcgttccccgcacactaggtagagagcttccaccacatgtga 1 (3’) (NH2) E E A S F P A H * V E S F H H M * (COOH) frame -1 (NH2) K K P R S P H T R * R A S T T C (COOH) frame -2 (NH2) R S L V P R T L G R E L P P H V (COOH) frame -3 Reading frame – any these hypotheWcal translaWon schemes. Open reading frame (ORF) – the region between a putaWve start codon (ATG) and the next stop codon (TAG / TGA / TAA) downstream of it in the same frame.
-
TTT Phe F TCT Ser S TAT Tyr Y TGT Cys C TTC Phe F TCC Ser S TAC Tyr Y TGC Cys C TTA Leu L TCA Ser S TAA stop * TGA stop * TTG Leu L TCG Ser S TAG stop* TGG Trp W CTT Leu L CCT Pro P CAT His H CGT Arg R CTC Leu L CCC Pro P CAC His H CGC Arg R CTA Leu L CCA Pro P CAA Gln Q CGA Arg R CTG Leu L CCG Pro P CAG Gln Q CGG Arg R
ATT Ile I ACT Thr T AAT Asn N AGT Ser S ATC Ile I ACC Thr T AAC Asn N AGC Ser S ATA Ile I ACA Thr T AAA Lys K AGA Arg R ATG Met M ACG Thr T AAG Lys K AGG Arg R
GTT Val V GCT Ala A GAT Asp D GGT Gly G GTC Val V GCC Ala A GAC Asp D GGC Gly G GTA Val V GCA Ala A GAA Asp D GGA Gly G GTG Val V GCG Ala A GAG Asp D GGG Gly G
The ‘universal’ geneWc code