ismb 2010 amitra - university of california,...
Post on 07-Feb-2018
218 Views
Preview:
TRANSCRIPT
Introduc)on to Next Genera)on Sequencing
1.5 Billion bases / day
100nG DNA for library
Mo)va)on
• Next genera)on sequencing (NGS) is rapidly becoming a method of choice for many whole genome studies, especially so for iden)fying protein-‐DNA interac)ons (ChIP-‐Seq).
• NGS technology is rela)vely new and the proper)es of sequencing background are not well understood yet.
• We need to iden)fy the cause of uneven distribu)on of sequenced DNA.
Mo)va)on: Uneven Distribu)on of Sequenced DNA
Source: Various S. cerevisiae sequences from NCBI SRA. PloSed with UCSC Genome Browser
Plots of Genomic and Input DNA of S. cerevisiae, Chromosome IV from different experiments
R Genomic DNA. K Input DNA Exp 1. B Input DNA Exp2.
Mo)va)on: Accuracy of Peak Calling?
“True peaks”, corresponding to the actual protein-‐DNA binding sides (in red) are difficult to dis)nguish from spurious peaks found in background (in green).
PolII binding profile
Control / background profile
Example of ChIP-‐Seq Data: Lefrançois et al. (2009), Efficient yeast ChIP-‐Seq using mul9plex short-‐read DNA sequencing, BMC Genomics.
We an)cipate that modeling the sequencing background as a one-‐point Poisson distribu)on, which lies at the core of almost all approaches to date, will lead to significant systema)c errors in interpreta)on of ChIP-‐Seq experiments.
Modeling of Sequence Background
• Global / Local Poisson Model • Are sequence reads distributed along the chromosomes / locally within sliding windows, according to Poisson distribu)on?
• Formula for Poisson Distribu)on.
• Lambda = average sequence read density for the whole genome. [Total reads / mappable bases]
Modeling of Sequence Background
Method
• Iden)fy 35bp (depends on the sequence read lengths) long mappable points in the genome of Saccharomyces cerevisiae.
• Input DNA reads were mapped to the genome.
• Simulated Poisson model with the number of reads.
• Compared Poisson model with actual background.
Method
• We Compare the fifh order moments of simula)on and actual ChIP-‐Seq reads.
• The p-‐value turns out to be 2.36e-‐9.
• Conclusion: Sequence reads are not distributed over the Saccharomyces cerevisiae genome as a Poisson distribu)on.
• Conclusion: With a 500bp sliding window; ~80% of the windows do-‐not contain reads distributed as a Poisson distribu)on.
Results
• Physical proper)es of DNA in solu)on changes, and allows it to bend more at certain places.
• Bendability of DNA segments has been quan)zed by hydroxyl radical cleavage intensity.
• Prior to sequencing, DNA usually undergoes mechanical shearing by means of sonica)on.
• Some reads are prone to get sequenced more others.
• Example: ‘TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTAA’
‘TTTTTTTTTTTTTTTTTAATTGAACAATAGATGC’
DNA Bendability
hSp://dna.bu.edu/orchid/ (Tullius and Greenbaum)
Method
• We propose an alterna)ve method of background characteriza)on which is non-‐local and takes into account intrinsic proper)es of DNA that correlate with the density of reads.
• We iden)fy the weighted average of bendability as well as gc content at sequence start sites and use these as models to iden)fy sites on the genome which may be preferen)ally sequenced in the given experiment.
• Using log-‐likelihood we iden)fy the best model that can predict sequencing bias.
Bendability at Sequence Read Loca)ons
GC Content at Sequence Read Loca)ons
Method and Conclusion
• Use log-‐likelihood to figure out whether bendability or gc content is a beSer model compared to a random Poisson model.
• Formula for likelihood Func)on
• We compute the sum of the logarithms of likelihood values for three probability distribu)ons one each for bendability, gc content and the random model.
• We conclude that the bendability model serves as an useful predictor of bias in sequencing experiments within a +/-‐ 100bp offset from beginning of sequence, thereafer the gc content works beSer.
Results for Genomic DNA
The log-‐likelihood for Poisson model was -‐1.517e+6.
Results for Input DNA Exp. 1
The log-‐likelihood for Poisson model was -‐2.507e+6.
Results for Input DNA Exp. 2
The log-‐likelihood for Poisson model was -‐5.945e+6.
• Next genera)on sequencing (NGS) is rapidly becoming a method of choice for many whole genome studies, especially so for iden)fying protein-‐DNA interac)ons (ChIP-‐Seq). NGS technology is rela)vely new and the proper)es of sequencing background are not well understood yet. It has been shown [1] that sequencing background is highly repeatable and thus, in principle, can be modeled with high accuracy.
• We demonstrate that modeling the sequencing background as a one-‐point Poisson distribu)on, which lies at the core of almost all approaches to date, will lead to significant systema)c errors in interpreta)on of ChIP-‐Seq experiments. We propose an alterna)ve method of background characteriza)on which is non-‐local and takes into account intrinsic proper)es of DNA that correlate with the density of reads.
• By comparing Chip-‐seq backgrounds obtained in the same condi)ons but using different experimental protocols, we iden)fy and computa)onally separate the bias introduced into the results at the different stages of the sample prepara)on and sequencing process. We discuss how this understanding can be used to improve detec)on of genuine protein-‐DNA interac)ons and provide sofware tools that implement our approach. Finally, we propose an algorithm based on a mul)parametric model, which can ab ini)o model the sequencing background.
• [1]. P Lefrançois et al.: “Efficient yeast ChIP-‐Seq using mul)plex short-‐read DNA sequencing.”, (2009) BMC Genomics 10: 37.
Next Genera)on Sequencing
• Next genera)on sequencing (NGS) is rapidly becoming a method of choice for many whole genome studies, especially so for iden)fying protein-‐DNA interac)ons (ChIP-‐Seq).
• NGS technology is rela)vely new and the proper)es of sequencing background are not well understood yet.
top related