analysis of barcode sequencing - kribbmedical-genome.kribb.re.kr/barseq/barcas_presentation.pdf ·...
Post on 12-Feb-2020
1 Views
Preview:
TRANSCRIPT
Department of Functional Genomics, USTJihyeob Mun
2016.12.07
Analysis of barcode sequencing
2
Pooled library screen analysis
‘gene A’ is a target?
experience knowledge
Fail
Success
High-throughput
How?
Simplicity
Pooled library screen analysis
A pool‘gene A’ targeted cell
‘gene B’targeted cell
‘gene C’ targeted cell
‘gene D’ targeted cell
The pool is used to analyze.
3
◎ Barcode sequence
What is barcode sequencing
- Barcodes are Genome-integrated artificial sequences that specifically mark biological materials, such as cells or genes, with unique sequences.
- The barcodes are sequenced and analyzed by barcode sequencing (barcode-seq).
- Library : a set of barcode sequences
- Barcode-seq is used in several genome-wide screening tools, including shRNAs, sgRNAs and barcoded yeast deletion strains.
4
Workflow : genome-wide shRNA screening
5
Workflow : barcoded yeast deletion strains
6
Limitation of barcode-seq data analysis
◎ Previously reported tools are mostly focused on shRNA or sgRNAscreening analysis
◎ Until now, error free production of barcode libraries is important issue. (For examples, barcode error, off-target problem, etc.)
Genome-wide functional analysis using the Barcode Sequence Alignment and Statistical Analysis (Barcas) tool
7
What is Barcas
- Barcas (Barcode sequence Alignment and Statistical analysis tool) is a specialized program for the analysis of multiplexed barcode sequencing (barcode-seq) data
- input: Barcode-seq data(from shRNAs, sgRNAs and barcoded yeast deletion strains)
Analysis pipeline of barcode-seq data
Step 1: Data pre-processing Step 2: QC of data
Step 3: Design experiment Step 4: Statistical analysis
9
Three novel functions of Barcas
- Based on trie data structure, Barcas supports imperfect matching containing mismatches, position shifts and indels (insertion and deletion).
- Detection of barcode errors in the library.
- Checking similarity between barcodes in the library collection (barcode library QC).
10
Feature 1:
Trie data structure based imperfect matching
11
Previously reported tools for data preprocessing
Program Mismatches shifts Indels Dynamic length Backend tool Ref
BiNGS!LS-seq O X X X bowtie Kim (2012)Methods Mol Bio
shALIGN O X X X Perl script(or bowtie)
Sims (2011)Genome Bio
edgeR O O X X edgeR Dai (2014)F1000Res
Barcas O O O O java Mun (2016)BMC Bioinfo
MID Universal Primer Barcode
ex) The Cellecta library (shRNA)MID from 9-bp to 17-bp.
MID Universal Primer Barcode
Barcode
Barcode
Universal PrimerMID
MID Universal Primer
Dynamic sequence length
1 10 13 18 25 28 33
12
Trie data structure based imperfect matching
1:1 sequence matching processingAlgorithm : List basedMaximum time : N * M
(N: read count, M: library sequence count)
1:M sequence matching processingAlgorithm : Trie based
Maximum time : N(N: read count)
Read Library sequences
TTAG
Library sequences
root
TA G C
G C GT C
C CA A A C
T TG G AT G
T A
AGCT
TTAT
TTAG
TCAGT
GCAG
GCCAA
CGCT
A sequence A base
Comparison of speed and mapping rate
- Option
- Result
Barcas is 1.7 times faster than bowtie and 13 times faster than edgeR. Owing to indel mapping, Barcas mapped at least 8-12% more than other two programs.
- Data 215 million reads are mapped to 4,832 heterozygous diploid deletion strains in S. pombe. 45-bp sequences are used as barcode library.
14
Feature 2:
Detection of barcode errors
15
Methods of targeting regions (1/2)
○ Barcoded yeast deletion strains
○ shRNAs ○ sgRNAs
Homologous recombination site
When the artificial sequence targets an unexpected region, it is called off-target
16
Methods of targeting regions (2/2)
Original Design
Correct sequence Barcode error
True Off-target
high low
True Off-target
low high
Solutions are provided by statistical analysis
Not yet;It is essential with
imperfect matching
17
Detection of barcode errors (1/4)
Eason et al (2004) Characterization of synthetic DNA bar codes in Saccharomyces cerevisiae gene-deletion strains PNAS 101(30):11046-51
Smith et al (2009) Quantitative phenotyping via deep barcode sequencing Genome Res 19:1836-42
U1 UpTag U2 D2 DnTag D1# correctby Smith 4,242 4,369 4,045 4,207 4,320 3,867
% correct by Smith 80.1% 82.5% 82.9% 80.9% 83.1% 83.7%
# correct by Easton 4185 3,764 4,057 4,343 3,807 4,095
% correct by Easton 79.1% 71.1% 83.2% 83.5% 73.2% 88.7%
% Agreed 86% 84.4% 89.2% 92.6% 85.1% 92%
○ Barcoded yeast deletion strains
18
Detection of barcode errors (2/4)
Ziller,MJ. et al., Nature 2015, 518, 355-9.
- Library : 1,230 shRNA sequences of TRC library.- Data : Control samples in neuroepithelial (NE), early radial glial (ERG) and
mid radial glial (MRG)- We found 25 (2.03%) erroneous barcodes (<= 2 bases mismatches or indels).
19
Detection of barcode errors (3/4)
Deletion Mismatch Insertion
20
Detection of barcode errors (4/4)
A simple method distinguishing barcode errors(PM: perfect matching, IM: imperfect matching)
○ Dominant PM ○ Barcode error
Deletion Mismatch Insertion
Original
Real
GCTGGAGATCCTCAAAGTCAT
GCTGGAGATCCTCAAAGTCAT=
GAATCTGCCACTCTCAGAATA
AATCTGCCACTCTCAGAATA≠
(IM1)
21
Inclusion of barcode errors
○ Barcoded yeast deletion strains
○ shRNAs (or sgRNAs)
▷ include barcode errors
▷ filtering barcode errors
Why use imperfect matching in shRNAs? • Increase mapped read counts• Consider mutated primers (shifts)• Provide additional information
Barcas supports an option of filtering barcode errors
Except several librariesex) Cellecta library
22
Feature 3:
Checking similarity between original barcodes
23
Library reference QC (1/2)
- Barcode errors can potentially be generated during the production of many barcodes.
- If some barcodes are designed similarly and mutations or sequencing errors occur, then it is hard to distinguish errors from true differences.
- Thus, barcodes originally designed to be similar should be separated in a step of pooling.
- For this purpose, Barcas gives notice about sequence similarity between barcodes.
24
Library reference QC (2/2)
Screen Library Date Species Module Barcode length
Barcode count
Gene count Reference
Staticsequence
comparison
Dynamic sequence
comparison
shRNA
TRC 05/Apr/11
Human
21-bp 61,621 15,435 http://www.broadinstitute.org/rnai/public/ 790 (1.28 %) 1,909 (3.10 %)
Cellecta 15/Feb/12
Module1 18-bp 27,500 5,046
https://www.cellecta.com/
0 (0 %) 412 (1.5 %)
Module2 18-bp 27,500 5,421 0 (0 %) 398 (1.45 %)
Module3 18-bp 27,500 4,923 0 (0 %) 410 (1.49 %)
sgRNA
yusa Mouse 19-bp 87,437 19,149 Koike et al., 2014 517 (0.59 %) 3,944 (4.51 %)
CeCKOv2 09/Mar/15
HumanLibrary A 20-bp 63,950 21,669
https://www.addgene.org/crispr/libraries/geckov2/
517 (0.81 %) 538 (0.84 %)
Library B 20-bp 56,869 19,834 437 (0.77 %) 441 (0.78 %)
MouseLibrary A 20-bp 65,959 22,486 736 (1.12 %) 755 (1.14 %)
Library B 20-bp 61,139 21,263 850 (1.39 %) 860 (1.41 %)
Deletionmutantstrains
Heterozygous diploid
Saccharomycescerevisiae 20-bp 6,318/UP
6,126/DN 6,131
http://www-sequence.stanford.edu/group/yeast_deletion_project/deletions3.html
0 (0 %) 0 (0 %)
Schizosaccharomyces pombe 20-bp 4,832/UP
4,832/DN 4,832 Kim,D.U. et al, 2010 0 (0 %) 0 (0 %)
25
Conclusion
◎ Barcas is an all-in-one software for barcode-seq data analysis and a few new useful functions for data pre-processing and quality control of barcode library
◎ Improvement point Memory usage
Trie-data structure consumes more memory as sequence gets longer due to recursive function.
Barcas consumes much memory while making middle files (.seqmap) from fastq or fasta in mapping step.
For example, Barcas needs about 350 MB memories for uploading Yusalibrary (19-bp 87,437 barcodes).
Statistical analysis Multiple-condition comparison (MAGeCK-VISPR) Utilization of metadata (HiTSelect)
Acknowledgement
26
Dr. Seon-Young Kim, Dr. Jong-Lyul Park and Jeong-Hwan Kim
Aging Research Center of KRIBB Dong-Uk Kim
Chungnam National University Dr. Kwang-Lae Hoe, Dr. 이숙정, Miyoung Nam, 이아름, etc.
Thank you for listening
28
Comparison AGCT sequence with ACTA sequence
Static comparison vs. Dynamic comparison
Static comparison
Based on the same lengths between sequences
2 bases
Static comparison
Based on the length of a specific sequence
1 base
Input sequence (read)Barcode region
AGCTACTA…
Other region
top related