lecture’4’ sequencingbioinf.gen.tcd.ie/~faresm/page12/files/genomics... · 2015. 3. 17. ·...

25
LECTURE 4 sequencing

Upload: others

Post on 10-Feb-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • LECTURE  4  sequencing  

  • Structure  of  DNA  sequences  

  • Know  your  5’  and  3’  

  • Bases  40  -‐  120  of  a  DNA  sequencing  read    from  an  automated  sequencer  (Applied  Biosystems,  ABI377)  

  • bases  40  -‐  50  …   …  bases  820  -‐  880  

  • Sequencing  DNA  (Sanger  method)  

    •  Raw  data  consists  of  "reads"  of  ~  700  bp  –  A  read  costs  about  €1  to  obtain.  –  A  read  gives  you  informaWon  about  the  sequence  downstream  of  a  known  

    starWng  point  (primer  binding  site).  

    •  To  get  longer  stretches  of  sequence  informaWon,  we  overlap  data  from  reads  that  begin  at  different  starWng  points.  

    •  How  do  we  get  different  starWng  points?  –  (old)    RestricWon  enzymes  (subclone  the  cloned  DNA  fragment).  –  (newer)    Primer  walking  (use  custom  primers).  –  (newest)    Shotgun  sequencing  (just  sequence  lots  of  random  clones).  

  • Ilumina  Sequencing  

    a)   a)  Shear  the  DNA  b) Create  Sequencing  Library  c)  Place  the  library  to  a  solid  flow  

    cell  d)  PCR  e)  Reads  using  the  machine  f)  Sequencing  by  synthesis  

    b)  c)   d)  

    e)  

    f)  

  • >frag1.f LEN=117 !CTGCTAGAAGAAATTCATTCCTAGTTAATGCGAGAGAAGCTAATGATAATTTAGAAAGATTAGAAAACCAATTTTATTAAATTAGTGGAGAGCAATAAGGATAATTCTCATCATATT!>frag1.r LEN=234 !CGTTACAAAAGGCATCAAGAAACTCAGAGTTTATCTCAATTGATCATTACTTTGCAACTTTAAAGACCCTTCCATTTCGTTATATATACCCCATTATTACTATCAAAATGCTAGATGCCCAAACCCTCCTGGACGTTATTGTGAAAAGGCACGGAGCTGACGACGATTCGGACCATGTGCCAGCGTGTGAGACTTCCAACGATTATGACGGGCGCATGAATCTTCGTATTCTGT!>frag2.f LEN=241 !CATAGGAACACGTATATTGCTTTTGCCCTTATTAGTTTCGGCTTCCGGGCACCCTTCGGGTCACGATCTGTACCCTCACGGTTTCGGAGTTGCTGCAAAGGTACCTGTGCCTACAATGTGTGCAGAAAAGCCCGTAGCGCAGTGGAAGCTGGGGGTTGCCCAGCAACGTCTGTTCATGTATATAAGGCCCATGAGGTATGGAATGTGACATTCTGCCTCCACAATTGAAGGACCTAAATCG!>frag2.r LEN=97 !TTTTTATTTATAAAACAACAAAAAAGGAGATATTAAAGTTGAACGAAAGAGATATATACCTTTCAAAAAGCCCCAATCCAGATTAATAATTTTGAAA!>frag3.f LEN=382 !CCGGGGTTTGTAGGTGTTTAAGCTTGCCCAGGAATAGCCGTAATGGCAGTGGCAGTGGCAATGGCAGTGGCAGTCGTCGTCGTCGTAGCCGTAGCCGTAGTTCGCTAGACGTTAACCCTATAGCCTAACTCTTAAGCCCTAGATTTCGCTTAGTTACACTTACGCTAAAGTTACGTTACTCTAAAGTTACGTTACGCTAAGTTACGTTACGCtaaagcctgaacttaggaacacatctaatattaccaagcacaggtaacatatatttagtaggaacataggaaacacattacaaatatataaggaatacaaaacngtatattaacgaacatcaaaattaggacttgccgttaaaggacgcatacttanatccaaggaacac!>frag3.r LEN=282 !ATGAACAGACGTTGCTGGGCAACCCCCAGCTTCCACTGCGCTACGGGCTTTTCTGCACACATTGTAGGCACAGGTACCTTTGCAGCAACTCCGAAACCGTGAGGGTACAGATCGTGACCCGAAGGGTGCCCGGAAGCCGAAACTAATAAGGGCAAAAGCAATATACGTGTTCCTATGAAAAAATTAAGGCAAGGTGAACAGCAAGACGGGGAGCAATATTGCTGTATTGCAAGATAAAAAGCAATGTGAGTGGTGTTCCTGTATATAAGTATGCGTTCCTTT!>frag4.f LEN=46 !CATATTTAATTGTAAAATATTTTGTCTAATATTAAATTTCAATTTA!>frag4.r LEN=360 !GTGTTCCTGTATATAAGTATGCGTTCCTTTAACGGCATCCTAATTTGATGTTCGTAATATATGTTTGTATTCCTTATATATTTGTATGTGTTCCTATGTTCCTAATAAATATATGTTCCTGTGCTTGTAATATAGATGTGTTCCTAAGTTCAGGCTTAGCGTAACGTAACTTAGCGTAACGTAACTTAGAGTAACGTAACTTAGCGTAGTGTAACTAAGCGAAATCTAGGGCTAAAAGTTAGGCTATGGTTACGTCTGCGACTACGGCTACGGCTACGACGACGACGACTGCCACTGCCATTGCCACTGCCACTGCCATTACGG!>frag5.f LEN=105 !AGTAATAAAAAATGACTCATCCTTCTAATGATTCCACATTATCAACGAAGTAATAGTAATTCTAGTCTACAAAAATATGAATACTCCACCAAAGTACGAGCAAGT!>frag5.r LEN=244 !GGCTTCCGGGCACCCTTCCGGGTCGACGATCTGTACCCTCACGGTTTCGGTAGTTGCTGCAAAGGTACCTGTGCCTACAATGTGTGCAGAAAAGACCCGTAGCGCAGTGGAAGCGTGGGGGTTCGGCCCACGCAACGTCTGTTCATGTATAATAACGGCCCCATGGAGGTAGTGGAATGTGACATTCTGCCTCCACAATTGAAGGAACCTAAATCCTTAACAAAAGGCATCAAGAAACTCAGAG!>frag6.f LEN=399 !TTGTAGGCACAGGTACCTTTGCAGCAACTCCGAAACCGTGAGGGTACAGATCGTGACCCGAAGGGTGCCCGGAAGCCGAAACTAATAAGGGCAAAAGCAATATACGTGTTCCTATGAAAAAAATTAAAGGCAAGGTGAACAGCAAGACGGGGAGCAATATTGCTGTATTGCAAGATAAAAAGCAATGTGAGTGGTGTTCCTGTATATAAGTATGCGTTCCTTTAACGGCATCCTAATTTGATGTCCGTAATATATGTTTGTATTCCTTATATATTTGTATGTGTTCCTATGTTCCTAATAAATATATGTTCCTGTGCTTGTATTATAGATGTGTTCCTAAGTTCAGGCTTAGCGTAACGTAACTTAGCGTAACGTAACTTAGAGTAACGTAACTTAGCG!>frag6.r LEN=150!TAGATTTACATTTTTTCATATTTAATTGTAAAATATTTTGTCTAATATTAAATTTCAATTTATTTAAAAATACTATTTTGATCAGTAGCCACATAATAATCCACATGATTACTCATCACAATACTTAAAAATGCCTGATATTCCAGAACC!>frag7 LEN-485!ACAGCAATATTGCTCCCCGTCTTGCTGTTCACCTTGCCTTTAATTTTTTTCATAGGAACACGTATATTGCTTTTGCCCTTATTAGTTTCGGCTTCCGGGCACCCTTCGGGTCACGATCTGTACCCTCACGGTTTCGGAGTTGCTGCAAAGGTACCTGTGCCTACAATGTGTGCAGAAAAGCCCGTAGCGCAGTGGAAGCTGGGGGTTGCCCAGCAACGTCTGTTCATGTATATAAGGCCCATGAGGTATGGAATGTGACATTCTGCCTCCACAATTGAAGGACCTAAATCGTTAACAAAAGGCATCAAGAAACTCAGAGTTTATCTCAATTGATCATTACTTTGACAACTTTAAAGACCCTTCCATTTCGTTATATATACCCATTATTACTAATCAAATGCTAGATCGACCCAAACCCTCCTGGAACGTTTAGTTAGTAGAAAAGCGAACGGAGCTGAACGACATTCGGACCATGTCGCCAGCGT!…!…!…!…!…!!

    Sequence  assembly:  finding  overlaps  to  merge  sequence  reads  into  longer  pieces  of  sequence.    ParWcularly  important  and  difficult  for  whole-‐genome  shotgun  sequencing.      The  input  to  the  assembler  program  is  a  file  of  sequence  reads.  Forward  and  Reverse  pairs.    

  • Reads  overlap  

  • Output  from  assembler  program      

    Some  famous  assemblers:    Phrap;  CAP4;    Celera  (TIGR)  assembler;  Arachne;  CLC.    Good  assemblers  make  use  of  base-‐quality  informaWon  (Phred  scores)  and  mate-‐pair  info.    

    ConWgs  (assembled  from  overlapping  reads)  

    Singletons  (reads  that  don’t  overlap  others)  >frag1.f LEN=117 QL=1 QR=544!CTGCTAGAAGAAATTCATTCCTAGTTAATGCGAGAGAAGCTAATGATAATTTAGAAAGAT!TAGAAAACCAATTTTATTAAATTAGTGGAGAGCAATAAGGATAATTCTCATCATATT!>frag2.r LEN=97 QL=1 QR=97!TTTTTATTTATAAAACAACAAAAAAGGAGATATTAAAGTTGAACGAAAGAGATATATACC!TTTCAAAAAGCCCCAATCCAGATTAATAATTTTGAAA!>frag5.f LEN=105 QL=1 QR=462!AGTAATAAAAAATGACTCATCCTTCTAATGATTCCACATTATCAACGAAGTAATAGTAAT!TCTAGTCTACAAAAATATGAATACTCCACCAAAGTACGAGCAAGT!!

    >Contig1!CCGGGGTTTGTAGGTGTTTAAGCTTGCCCAGGAATAGCCGTAATGGCAGTGGCAGTGGCA!ATGGCAGTGGCAGTCGTCGTCGTCGTAGCCGTAGCCGTAGTTCGCTAGACGTTAACCCTA!TAGCCTAACTCTTAAGCCCTAGATTTCGCTTAGTTACACTTACGCTAAAGTTACGTTACT!CTAAGTTACGTTACGCTAAGTTACGTTACGCTAAGCCTGAACTTAGGAACACATCTATAT!TACAAGCACAGGAACATATATTTATTAGGAACATAGGAACACATACAAATATATAAGGAA!TACAAACATATATTACGGACATCAAATTAGGATGCCGTTAAAGGAACGCATACTTATATA!CAGGAACACCACTCACATTGCTTTTTATCTTGCAATACAGCAATATTGCTCCCCGTCTTG!CTGTTCACCTTGCCTTTAATTTTTTTCATAGGAACACGTATATTGCTTTTGCCCTTATTA!GTTTCGGCTTCCGGGCACCCTTCGGGTCACGATCTGTACCCTCACGGTTTCGGAGTTGCT!GCAAAGGTACCTGTGCCTACAATGTGTGCAGAAAAGCCCGTAGCGCAGTGGAAGCTGGGG!GTTGCCCAGCAACGTCTGTTCATGTATATAAGGCCCATGAGGTATGGAATGTGACATTCT!GCCTCCACAATTGAAGGAACCTAAATCCTTAACAAAAGGCATCAAGAAACTCAGAGTTTA!TCTCAATTGATCATTACTTTGACAACTTTAAAGACCCTTCCATTTCGTTATATATACCCC!ATTATTACTAACAAAATGCTAGATCGACCCAAACCCTCCTGGACGTTATTGTGAAAAGGC!ACGGAGCTGACGACGATTCGGACCATGTGCCAGCGTGTGAGACTTCCAACGATTATGACG!GGCGCATGAATCTTCGTATTCTGT!>Contig2!TAGATTTACATTTTTTCATATTTAATTGTAAAATATTTTGTCTAATATTAAATTTCAATT!TATTTAAAAATACTATTTTGATCAGTAGCCACATAATAATCCACATGATTACTCATCACA!ATACTTAAAAATGCCTGATATTCCAGAACC!!

  • ConWgs  

  • Sequence  assembly:    reads,  conWgs,  scaffolds  

  • DETAILED DISPLAY OF CONTIGS!******************* Contig 1 ********************! . : . : . : . : . : . :!frag3.f+ CCGGGGTTTGTAGGTGTTTAAGCTTGCCCAGGAATAGCCGTAATGGCAGTGGCAGTGGCA!frag4.r- CCGTAATGGCAGTGGCAGTGGCA! ____________________________________________________________ 60!consensus CCGGGGTTTGTAGGTGTTTAAGCTTGCCCAGGAATAGCCGTAATGGCAGTGGCAGTGGCA!! . : . : . : . : . : . :!frag3.f+ ATGGCAGTGGCAGTCGTCGTCGTCGTAGCCGTAGCCGTAGTTCGCTAGACGTTAACCCTA!frag4.r- ATGGCAGTGGCAGTCGTCGTCGTCGTAGCCGTAGCCGTAGT-CGC-AGACGT-AACC--A! ____________________________________________________________ 120!consensus ATGGCAGTGGCAGTCGTCGTCGTCGTAGCCGTAGCCGTAGTTCGCTAGACGTTAACCCTA!! . : . : . : . : . : . :!frag3.f+ TAGCCTAACTCTTAAGCCCTAGATTTCGCTTAGTTACACTTACGCTAAAGTTACGTTACT!frag4.r- TAGCCTAACTTTTA-GCCCTAGATTTCGCTTAGTTACACT-ACGCTAA-GTTACGTTACT!frag6.f- AAGTTACGTTACT! ____________________________________________________________ 180!consensus TAGCCTAACTCTTAAGCCCTAGATTTCGCTTAGTTACACTTACGCTAAAGTTACGTTACT!! . : . : . : . : . : . :!frag3.f+ CTAAAGTTACGTTACGCTAAGTTACGTTACGCTAAAGCCTGAACTTAGGAACACATCTAA!frag4.r- CTAA-GTTACGTTACGCTAAGTTACGTTACGCTAA-GCCTGAACTTAGGAACACATCTA-!frag6.f- CTAA-GTTACGTTACGCTAAGTTACGTTACGCTAA-GCCTGAACTTAGGAACACATCTA-! ____________________________________________________________ 240!consensus CTAA-GTTACGTTACGCTAAGTTACGTTACGCTAA-GCCTGAACTTAGGAACACATCTA-!! . : . : . : . : . : . :!frag3.f+ TATTACCAAGCACAGGTAACATATATTTAGTAGGAACATAGGAAACACATTACAAATATA!frag4.r- TATTAC-AAGCACAGG-AACATATATTTATTAGGAACATAGGAA-CACAT-ACAAATATA!frag6.f- TAATAC-AAGCACAGG-AACATATATTTATTAGGAACATAGGAA-CACAT-ACAAATATA! ____________________________________________________________ 300!consensus TATTAC-AAGCACAGG-AACATATATTTATTAGGAACATAGGAA-CACAT-ACAAATATA!! . : . : . : . : . : . :!frag3.f+ TAAGGAATACAAA !frag4.r- TAAGGAATACAAA !frag6.f- TAAGGAATACAAACATATATTACGGACATCAAATTAGGATGCCGTTAAAGGAACGCATAC!frag3.r- AAAGGAACGCATAC! ____________________________________________________________ 360!consensus TAAGGAATACAAACATATATTACGGACATCAAATTAGGATGCCGTTAAAGGAACGCATAC!! . : . : . : . : . : . :!frag6.f- TTATATACAGGAACACCACTCACATTGCTTTTTATCTTGCAATACAGCAATATTGCTCCC!frag3.r- TTATATACAGGAACACCACTCACATTGCTTTTTATCTTGCAATACAGCAATATTGCTCCC!frag7+ ACAGCAATATTGCTCCC! ____________________________________________________________ 420!consensus TTATATACAGGAACACCACTCACATTGCTTTTTATCTTGCAATACAGCAATATTGCTCCC!

  • Handling  sequences  on  a  computer  

    Sequence  files  should  be  plain  text  files  (like  .txt)  if  you’re  going  to  use  them  as  input  to  programs.  

    Open/Edit  them  with  a  simple  text  editor  like  Notepad  (PC)  or  TextEdit  (Mac),  not  Microsoj  Word.  

    To  print  sequences/alignments  for  humans  to  read,  use    Courier or    Monaco font.  

    >myseq in Courier font (60 bases per line)!gggggggggggggaggtccaaatcgaatacacggagaaaacttgcatacagaaagccgat!tccactgatttacctgcaataccggattcaattatattgaagtaccacatccctccgaat!ttacaaaagtacgagcaccaggatatgaatatgagtcgattttggcgttcgttcacaaaa!!>myseq in Times New Roman font (60 bases per line) gggggggggggggaggtccaaatcgaatacacggagaaaacttgcatacagaaagccgat Tccactgatttacctgcaataccggattcaattatattgaagtaccacatccctccgaat Ttacaaaagtacgagcaccaggatatgaatatgagtcgattttggcgttcgttcacaaaa !

    FASTA  format  is  a  very  common  simple  format  for  sequences  that  will  be  read  by  a  computer:    >Title  (must  be  only  1  line,  starWng  with  a  ‘>’  character).  Sequence  begins  on  next  line  (any  number  of  lines,  any  width,  spaces/numbers  are  ignored)      

  • Websites  for  handling  sequences  Sequence  ManipulaWon  Suite  www.bioinformaWcs.org/sms2    -‐  Reverse  complement  

    >myseq!gggggggggggggaggtccaaatcgaatacacggagaaaacttgcatacagaaagccgat!tccactgatttacctgcaataccggattcaattatattgaagtaccacatccctccgaat!ttacaaaagtacgagcaccaggatatgaatatgagtcgattttggcgttcgttcacaaaa!!>myseq reverse complement!ttttgtgaacgaacgccaaaatcgactcatattcatatcctggtgctcgtacttttgtaa !attcggagggatgtggtacttcaatataattgaatccggtattgcaggtaaatcagtgga !atcggctttctgtatgcaagttttctccgtgtattcgatttggacctccccccccccccc  !

    -‐ Group  DNA    -‐ TranslaWon  map  -‐ Range  extractor    

  • NCBI  ORF  finder  www.ncbi.nlm.nih.gov/gorf  

  • GenBank  accession  number  AJ245745  

    "Endomyces  fibuliger  URA3  gene  for  oroWdine-‐5'-‐phosphate  decarboxylase"  

    URA3  

    1   1101  start  codon  

    1898  stop  codon  

    3194  

  • 6  reading  frames:  6  ways  that  the  same  DNA  sequence  could  potenWally  encode  a  protein  

    (NH2) S H V V E A L Y L V C G E R G F F (COOH) frame +1 (NH2) H M W W K L S T * C A G N E A S (COOH) frame +2 (NH2) T C G G S S L P S V R G T R L L (COOH) frame +3 (5’) 1 tcacatgtggtggaagctctctacctagtgtgcggggaacgaggcttcttc 51 (3’) (3’) agtgtacaccaccttcgagagatggatcacacgccccttgctccgaagaag (5’) (COOH) * M H H F S E V * H A P F S A E E (NH2) frame -1 (COOH) C T T S A R * R T H P S R P K K (NH2) frame -2 (COOH) V H P P L E R G L T R P V L S R (NH2) frame -3 (5’) 51 gaagaagcctcgttccccgcacactaggtagagagcttccaccacatgtga 1 (3’) (NH2) E E A S F P A H * V E S F H H M * (COOH) frame -1 (NH2) K K P R S P H T R * R A S T T C (COOH) frame -2 (NH2) R S L V P R T L G R E L P P H V (COOH) frame -3 Reading  frame  –  any  these  hypotheWcal  translaWon  schemes.  Open  reading  frame  (ORF)  –  the  region  between  a  putaWve  start  codon  (ATG)  and  the  next  stop  codon  (TAG  /  TGA  /  TAA)  downstream  of  it  in  the  same  frame.  

  • TTT Phe F TCT Ser S TAT Tyr Y TGT Cys C TTC Phe F TCC Ser S TAC Tyr Y TGC Cys C TTA Leu L TCA Ser S TAA stop * TGA stop * TTG Leu L TCG Ser S TAG stop* TGG Trp W CTT Leu L CCT Pro P CAT His H CGT Arg R CTC Leu L CCC Pro P CAC His H CGC Arg R CTA Leu L CCA Pro P CAA Gln Q CGA Arg R CTG Leu L CCG Pro P CAG Gln Q CGG Arg R

    ATT Ile I ACT Thr T AAT Asn N AGT Ser S ATC Ile I ACC Thr T AAC Asn N AGC Ser S ATA Ile I ACA Thr T AAA Lys K AGA Arg R ATG Met M ACG Thr T AAG Lys K AGG Arg R

    GTT Val V GCT Ala A GAT Asp D GGT Gly G GTC Val V GCC Ala A GAC Asp D GGC Gly G GTA Val V GCA Ala A GAA Asp D GGA Gly G GTG Val V GCG Ala A GAG Asp D GGG Gly G

    The  ‘universal’  geneWc  code