research article error analysis of deep sequencing of

14
Hindawi Publishing Corporation Computational and Mathematical Methods in Medicine Volume 2013, Article ID 491612, 13 pages http://dx.doi.org/10.1155/2013/491612 Research Article Error Analysis of Deep Sequencing of Phage Libraries: Peptides Censored in Sequencing Wadim L. Matochko and Ratmir Derda Department of Chemistry and Alberta Glycomics Centre, University of Alberta, Edmonton, AB, Canada T6G 2G2 Correspondence should be addressed to Ratmir Derda; [email protected] Received 6 May 2013; Accepted 30 July 2013 Academic Editor: Yanxin Huang Copyright © 2013 W. L. Matochko and R. Derda. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Next-generation sequencing techniques empower selection of ligands from phage-display libraries because they can detect low abundant clones and quantify changes in the copy numbers of clones without excessive selection rounds. Identification of errors in deep sequencing data is the most critical step in this process because these techniques have error rates >1%. Mechanisms that yield errors in Illumina and other techniques have been proposed, but no reports to date describe error analysis in phage libraries. Our paper focuses on error analysis of 7-mer peptide libraries sequenced by Illumina method. Low theoretical complexity of this phage library, as compared to complexity of long genetic reads and genomes, allowed us to describe this library using convenient linear vector and operator framework. We describe a phage library as N ×1 frequency vector =‖n , where n is the copy number of the th sequence and N is the theoretical diversity, that is, the total number of all possible sequences. Any manipulation to the library is an operator acting on . Selection, amplification, or sequencing could be described as a product of a N × N matrix and a stochastic sampling operator (Sa). e latter is a random diagonal matrix that describes sampling of a library. In this paper, we focus on the properties of Sa and use them to define the sequencing operator (Seq). Sequencing without any bias and errors is Seq = Sa I N , where I N is a N × N unity matrix. Any bias in sequencing changes I N to a nonunity matrix. We identified a diagonal censorship matrix (CEN), which describes elimination or statistically significant downsampling, of specific reads during the sequencing process. 1. Introduction In vitro selection experiments—such as phage display [1, 2], RNA display, SELEX, and DNA aptamer selection [3, 4]— employ large libraries, from which 10 2 –10 6 active sequences are identified through iterative rounds of selection and amplification. With the recent emergence of deep sequencing, it became possible to extract a large amount of information from the libraries before and aſter selection [510]. Deep examination of the library is a promising technique for direct evaluation of binding capacities of all binding sequences from one panning experiment. Deep sequencing also allows the characterization of unwanted phenomena in selection, such as amplification bias [6, 11]. Analysis of 10 6 reads by deep sequencing gave rise to a large number of errors that were not present in the analysis based on the small number of sequences obtained using the Sanger method. Analysis of errors in information-rich datasets is a problem with over 50 years of history; correction of digital data made of bits or words is a topic of intense research in communication theory [12]. As phage display operates with limited digital sets, data analysis techniques from the communication theory could be applied to phage display. For example, Rodi and coworkers used a positional frequency matrix to calculate the informational content or Shannon entropy of each sequence [13]. is approach could be used to distinguish potential fast growing sequences from potential hits [14]. With the introduction of deep sequencing, the problem of error analysis in phage display becomes identical to a classical information theory problem: “reproducing at one point, either exactly or approximately, a message selected at another point” [15]. e “message” is the sequence information stored in the library. Sequencing process transmits this information and makes either stochas- tic or predictable errors. Understanding the sources of errors during sequencing could provide mechanisms for bypassing

Upload: others

Post on 23-Mar-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

Hindawi Publishing CorporationComputational and Mathematical Methods in MedicineVolume 2013 Article ID 491612 13 pageshttpdxdoiorg1011552013491612

Research ArticleError Analysis of Deep Sequencing of Phage LibrariesPeptides Censored in Sequencing

Wadim L Matochko and Ratmir Derda

Department of Chemistry and Alberta Glycomics Centre University of Alberta Edmonton AB Canada T6G 2G2

Correspondence should be addressed to Ratmir Derda ratmirualbertaca

Received 6 May 2013 Accepted 30 July 2013

Academic Editor Yanxin Huang

Copyright copy 2013 W L Matochko and R Derda This is an open access article distributed under the Creative CommonsAttribution License which permits unrestricted use distribution and reproduction in any medium provided the original work isproperly cited

Next-generation sequencing techniques empower selection of ligands from phage-display libraries because they can detect lowabundant clones and quantify changes in the copy numbers of clones without excessive selection rounds Identification of errors indeep sequencing data is the most critical step in this process because these techniques have error rates gt1 Mechanisms that yielderrors in Illumina and other techniques have been proposed but no reports to date describe error analysis in phage libraries Ourpaper focuses on error analysis of 7-mer peptide libraries sequenced by Illumina method Low theoretical complexity of this phagelibrary as compared to complexity of long genetic reads and genomes allowed us to describe this library using convenient linearvector and operator framework We describe a phage library as Ntimes1 frequency vector 119899 = n

119894 where n

119894is the copy number of the

119894th sequence and N is the theoretical diversity that is the total number of all possible sequences Any manipulation to the library isan operator acting on 119899 Selection amplification or sequencing could be described as a product of a N timesNmatrix and a stochasticsampling operator (Sa) The latter is a random diagonal matrix that describes sampling of a library In this paper we focus on theproperties of Sa and use them to define the sequencing operator (Seq) Sequencingwithout any bias and errors is Seq = Sa IN whereIN is a N times N unity matrix Any bias in sequencing changes IN to a nonunity matrix We identified a diagonal censorship matrix(CEN) which describes elimination or statistically significant downsampling of specific reads during the sequencing process

1 Introduction

In vitro selection experimentsmdashsuch as phage display [1 2]RNA display SELEX and DNA aptamer selection [3 4]mdashemploy large libraries from which 102ndash106 active sequencesare identified through iterative rounds of selection andamplificationWith the recent emergence of deep sequencingit became possible to extract a large amount of informationfrom the libraries before and after selection [5ndash10] Deepexamination of the library is a promising technique for directevaluation of binding capacities of all binding sequences fromone panning experiment Deep sequencing also allows thecharacterization of unwanted phenomena in selection suchas amplification bias [6 11]

Analysis of 106 reads by deep sequencing gave rise to alarge number of errors that were not present in the analysisbased on the small number of sequences obtained usingthe Sanger method Analysis of errors in information-rich

datasets is a problem with over 50 years of history correctionof digital data made of bits or words is a topic of intenseresearch in communication theory [12] As phage displayoperates with limited digital sets data analysis techniquesfrom the communication theory could be applied to phagedisplay For example Rodi and coworkers used a positionalfrequency matrix to calculate the informational content orShannon entropy of each sequence [13] This approach couldbe used to distinguish potential fast growing sequencesfrom potential hits [14] With the introduction of deepsequencing the problem of error analysis in phage displaybecomes identical to a classical information theory problemldquoreproducing at one point either exactly or approximatelya message selected at another pointrdquo [15] The ldquomessagerdquo isthe sequence information stored in the library Sequencingprocess transmits this information andmakes either stochas-tic or predictable errors Understanding the sources of errorsduring sequencing could provide mechanisms for bypassing

2 Computational and Mathematical Methods in Medicine

them for correcting the errors and for maximizing theamount of useful information received from sequencing

There are over 10000 published literature reports thatcontain the terms ldquodeep sequencingrdquo or ldquonext generationsequencingrdquo or any of the trademark names such as ldquoIllu-minardquo (reference ISI database) Among these reports lessthan 10 published reports describe sequencing of phage-displayed libraries [5ndash7 9 10 16ndash19] Deep sequencing effortsin the literature are largely focused on genome assembly andmetagenomic analysesThe error analysis techniques tailoredfor genome assembly cannot be used directly for analysis ofphage libraries because the data output from phage librarysequencing is very different from the genome assembly Ingenome assembly genomic DNA is shredded into randomfragments and sequenced The genome is then assembledfrom these fragments in silico Although multiple fragmentscover each area of the genome the probability to observetwo identically shredded fragments is very small Two exactsequences thus could be considered amplification artifactsand removed by error analysis software On the contraryin phage-display sequencing the reads are exactly of thesame length Duplication of the same read is important forvalidation of the accuracy of this read Some researchers focusexclusively on reads that have been observed multiple timesand discard singleton reads as erroneous [5] Within eachlibrary the copy numbers of sequences range continuouslyby six or more orders of magnitude [5 6 9] Some phageclones are observed in the entire library only a few timesother clones could be present at copy number of 100000 persequencing run [5 6 9] Unlike multiple cells with identicalgenomes each screen is unique identical set of sequenceswith identical copy numbers cannot be obtained even if thescreen is repeated due to stochastic number of the screen thatcontains low copy number of binding clones [20]

Metagenomic analyses of microorganisms recoveredfrom environmental samples [21 22] also known as ldquomicro-biomerdquo [23] and ldquoviriomerdquo analyses [24] encountered similarproblems to those observed in phage library analysis theconcentration of species observed in a particular sampleis unequal [25] The abundance of species might range bya few orders of magnitude [26] It is possible that erroranalysis tools developed in the above areas could find usein phage display sequencing For example there are multiplepublished algorithms for removing errors from low copynumber reads to ascertain that low copy number sequencesare new species and not sequencing errors (eg see [27ndash29] and references within) Metagenomic analysis is usu-ally more complex than analysis of phage-display librariesFirst in metagenomics the bacterial or viral genes mustbe assembled from short reads de novo Second there isno simple relationship between phylogenetic classification ofldquospeciesrdquo and the observed DNA sequence Third the exactnumber of species in the environment is unknown On theother hand sequencing of phage-displayed peptide librarieshas none of these problems (i) it requires no assemblysteps because each sequence is covered by one read (ii) auniqueDNA sequence defines a unique ldquospeciesrdquo and (iii) thetheoretical complexity in synthetic libraries is known exactlyFor small libraries such as the library of 7-mer peptides

the complexity (20)7 is within the reach of next-generationsequencing We see phage-displayed peptide libraries as anideal model playground for the development of optimalerror analysis and error correction protocols It is possiblethat error analysis developed from phage libraries analysiscould then be used in other areas such as genomic andmetagenomic analyses

The errors in sequencing could be divided into ldquoanno-tatedrdquo and ldquoinvisiblerdquo The ldquoannotatedrdquo errors that originatefrom misincorporation of nucleotides are annotated usingPhred quality score [30]These annotated errors are removedduring the processing (see below) Examples of ldquoinvisiblerdquoerrors are sequence-specific frame shifts that lead to emer-gence of truncated reads during the Illumina sequencing [31]Invisible errors could also originate during the preparation ofthe libraries for sequencing Examples are removal of AT-richfragments during purification of dsDNA [32] and erroneousincorporation of nucleotides during PCR [33 34] Mutationshave the most significant impact on the observed diversity ofthe libraryThere are 63 ways to misspell a 21-mer-nucleotidesequence with a one-letter error (point mutation) The largedynamic range in concentrations of clones in the phagelibrary exacerbates the problem Clones that are present inhigh abundancemdash105 copies per readmdashare more prone toyield errors [6] For example we observed that random pointmutations convert several short sequencewith a copy numberof 105 to a library of sequences with copy numbers rangingfrom 1 to 102 [11] In attempt to unify error analysis into oneconvenient theoretical framework we generalized all errorsas follows All errors either lead to disappearance of particularsequence or its conversion to another sequence of the samelength Errors thus operate within a finite sequence spaceand it should be possible to use elementary linear algebra togeneralize most processes that lead to errors

2 Theoretical Description

See Table 1

21 Operator Description of the Phage-Display Library andSelection Process In our previous reports we described thephage library as a multiset or a set in which members canappear more than once [35] This description also simplifiesthe analysis of the errors in these libraries The multisetdescription represents a library with N theoretical membersas an ordered set of N sequences and N times 1 copy numbervector (119899) with positive integer copy numbers (Figure 1(a))Any manipulation of a phage librarymdashsuch as erroneousreading or selectionmdashchanges the numbers within the copynumber vector All manipulations to the multiset thus couldbe described by operators (Op) that convert vector 119899

1to

another vector 1198992as 1198992= Op 119899

1(Figure 1(c)) For an N times 1

vector the operator is NtimesNmatrix If elements are selected oreliminated independently of one another the NtimesNmatrix isdiagonal (Figure 1(d)) This approach is uniquely convenientfor libraries of short reads For example a library of 7-merscontains exactly 207 = 128 times 109 peptides and is described

Computational and Mathematical Methods in Medicine 3

Table 1 Symbols and definitions used in the theoretical description section

Symbols MeaningA a f m n k Unless specified otherwise normal font designates scalars119860 119886119873 119875 1n 13n Italic font designates vectors Different vectors can be distinguished by the left-superscript notationA a Abc Pan Sa Bold font designates operators or matrices (here all operators are matrices)1AfSa09Sa05Sa Operators can be distinguished by the left-superscript notation For sampling operator Sa this

notation specifies the sampling fraction of the Sa operatorA1 a2 A119894 a119895 Normal font with right subscript designates scalar values of the vectorA11 A21 Aij Aii Normal font with two right subscripts designates scalar values of the 2D matrix1003817100381710038171003817A1 A5

1003817100381710038171003817

Description of the scalar elements in the vector10038171003817100381710038171003817Aij Aii

10038171003817100381710038171003817

Description of the scalar elements in the matrixx isin [A B] Scalar x belongs to the inclusive scalar interval [A B] that is A le x le B119909 isin [119860119861] Vector x belongs to the ldquovector intervalrdquo [119860119861] that is for every element Ai le xi le Bi

ABC X Set where A B C X are the unique elements of the set

A(a) B(b) X(x) Multiset (2-tuple) where A B X are the unique elements and a b x are the scalars describing thecopy numbers of the A B X elements

IN Unity matrix of the Nth order that is N times Nmatrix 10038171003817100381710038171003817Aij10038171003817100381710038171003817 Aij = 120575ij (Kronecker delta)

completely using a 109-element vector This size is accessibleto the computational capacity of most desktop computers

In operator notation phage display can be described as

Sel = Pan Naive (1)

where Naive is the copy number vector for naıve librarySel is the copy number vector after panning and Pan isa panning operator In standard phage display the Panoperator is a complex product of all manipulation steps(binding amplification dilutions etc) If a screen uses noamplification and uses deep-sequencing [9 16] or large-scaleSanger sequencing [36 37] to analyze the enrichment itmight be possible to define the panning process as a simpleproduct of two operators as follows

Pan = fSa Ka (2)

Sel = fSa Ka Naive (3)

where Ka is a deterministic ldquoassociationrdquo operator whichcontains association constants for every phage clone presentin the library Description of such operator is beyond thescope of this paper and we recommend consulting otherreports that attempted to generalize the selection procedure[20] Another operator in (3) is a sampling operator ( fSa)which describes stochastic sampling of the library with msequences to yield a sublibrarywith flowastm-sequences where f isin[0 1] is a sampling fraction fSa operator has the followingproperties which emanate from physical properties of thesampling procedure

(I)

fSai 0 = 0 (sampling does not create new

sequences from nonexisting sequences) (4)

(II) fSa is a diagonal operator with diagonal scalar func-tions Sa

11Sa22sdot sdot sdot SaNN Sai(0) = 0

(III) In 119861 = fSa 119860 119861 is a vector of positive integersB119894ge 0 and sum(119861) = flowastsum(119860) Integer values

ensure that the observable values of the operator havephysical meaning The clone could be observed once(1) multiple times (2 3 etc) or not observed at all(0)

(IV) Sa is nondeterministic operator When applied to thesame vector the operator does not yield the sameresult but one of the possible vectors that satisfy rules(IndashIII) The majority of the solutions of the operatorhowever reside within a deterministic confidenceinterval fSa 119860 isin [ lo119862 hiC]

(V) As a consequence from (IV) operator Sa is nonlinearnoncommutative and nondistributive

(VI) Large sum of sampling operators with same f shouldldquoaverage outrdquo to yield IN unity matrix

(fSa1+

fSa2+

fSa3+ sdot sdot sdot

fSak)k

997888rarr flowastIN

as k 997888rarr infin

(5)

The Sa operator is simple to implement as a randomarray indexing function in any programming language (egsee Supplementary Schemes S1 and S2 available online athttpdxdoiorg1011552013491612) It might be possibleto express fSa analytically for any f as a diagonal matrix(Figure 1(d)) In this paper we use numerical treatment byan array sampling function because it is more convenient formultisets of general structure We tested the random index-ing implementation to show that the sampling algorithmyields a normal distribution for a large number of samples(Supplementary Figure S1) Despite the simplicity of fSa

4 Computational and Mathematical Methods in Medicine

Any phage library is a multiset defined by sequence set and copy number vector

Lib = [S n] middot middot middot middot middot middot

n1n2nN

N number of all possiblesequences in the libraryni copy number of theith sequence in the library

Lib = [S = sequence1sequence2 n = ]sequenceN

(a)

Total (sum) and unique sequences (uni) 11

1middot middot middotsum(n) = n middot |1| =

|1| =Nsumi=1

ni

uni(n) =Nsumi=1

1 minus 120575[ni] 0 n ne 0 delta120575[n] = 1 n = 0 Kronecker

(b)

middot middot middot

Any manipulation of library is an operator acting on the ni vector (copy number vector)Example sampling of the library

1n =

1n11n2

1nN

middot middot middot

2n =

2n12n2

2nN1Lib = [S 1n]

2Lib = [S 2n]

T1 total sequencesT2 total sequences

Sum(1n) = T1 Sum(2n) = sum(Sa 1n) = T2

2n = Sa 1

n

Sa) acts on 1n vectorSampling operator (

Some elementspresent in 1st

library are ldquogonerdquofrom the 2nd library

2ni = 0

1ni 0ne

(c)

rnrn

rn

0 00 0

0 0

Example sampling operator

Properties of Sa

middot middot middotmiddot middot middot

middot middot middot

Sa(n) = round T1

T2

N Irndn where N Irnd =

k

Operators are N times N matrices Random N times Ndiagonal matrix

(1) Sa(middot middot middot(Sa(Sa(n)) )middot middot middot) asymp Sa(n)

sum Sa(n) rarr infinn as k rarr(2) 1

k

((

(d)

amplificationSynthesis ofnucleotides

Theoretical library

Example each random

Ligation

Sequencing

Panning

Naive librarySynthetic library Selected (panned) library

General problem of selection

Sequencing

Multiset description of the phage display selection

transformationgrowth

nucleotide is present once

T = [S 1]

The ith sequence Sihas copy number syn ni

enriched sequences x i gt 1 depleted sequences x i lt 1

P and N cannot be measured They have to be approximated from the ldquoobservablesrdquo

From P and N find enrichment E = [S x] x e R+obs

P = [S panobsn]

P = [S pann]N = [S naive

n]

S = [S synn]

obsN = [S naive

obsn]

(e)

Figure 1 (a) Phage library can be described by multisets made of S = sequence set and 119899 = vector of copy numbers Any change to thelibrary can be described as functionoperator acting on the 119899 (b) Relevant functions are calculations of total sequences (sum) and uniquesequences (uni) (c) Any transformation of library to another library is an operator acting on 119899 Sampling of libraries to yield a sublibrary isthe most important operator (d) It can be described as N times N matrix Specifically Sa is a diagonal matrix of values derived from randomdistribution Rounding function is necessary to ensure the physical meaning of the sampling results Sa acting on the same vector yieldsone of many vectors that have the same number of total elements As a consequence Sa is nonlinear nondistributive and noncommutativeoperator Average of many Sa operators is a scalar (dilution factor) (e) Any screen of any library can be described as operators acting onthe copy number vectors of the naıve (or theoretical) library Copy number vectors cannot be observed directly They have to be measuredthrough sequencing As sequencing contains sampling process (Sa operator) the result of sequencing is nondeterministic Sequencing yieldsone of many possible observed copy number vectors none of which are equal to the real copy number vector

implementationmdashthe entire code is lt30 lines in MatLabmdashthe script allows rapid calculation of the results of fSa for amultiset of reasonable size (severalmillion sequences Figures4 and 5) on a desktop computer

We evaluated the behaviors of 05Sa for several multisetsThe probability to observe a specific solution is describedin Figure 3(b) Individual solutions can be represented aslines with nodes on 119883119884-plane where each node represents

one element of the multiset (Figures 3(d) and 3(e)) Themost probable solutions reside near the ldquoexpected solutionrdquo(represented as dotted line) and the probability to observea solution where many elements deviate from the probablesolution is low (Figure 3(e)) Graphical representation of thesolutions highlights that sampling could lead to deviationof the frequency of the individual elements of the multisetfor example Figure 3(e) describes gt2 fold deviation from

Computational and Mathematical Methods in Medicine 5

Library of phage

Sampling

Growthof phage

Isolationof DNA

PCRof DNA

Sequencing Analysis of

Sa Gr

Operator description of each step

Sa Is Pc Sa Se AnSa

of DNA sequencing

(observed)Accepted

sequencing data

realN = [S real

n]

obs N = [S obs n]

(a)

Relation between observed copy numbers and ldquorealrdquo copy numbers

Analysisoperator

obs n =f 5Sa An f 4Sa Seq f 3Sa PCR f 2Sa Is f1Sa Gr real n

Growth bias betoassumedandneglected(usually IN )Isolation bias betoassumedandneglected(usually IN )

PCR bias betoassumedandneglected(usually IN )Sequencing bias betoassumedandneglected(usually IN )

(b)

10

0

0 00 0

0 0

0011

0

0

0

Analysis operator is a matrix that could compensate for problems and censorships caused by Se Pc Is Gr operators

f5Sa An (f4Sa Seq f3Sa PCR f2Sa Is f1Sa Gr real ni) sim Sa real ni

Seq 1 is accepted as isSeq 2 is considered a typoand renamed to seq 3Seq 3 is accepted as isSeq N is deleted (possible error)

middot middot middot

middot middot middot

middot middot middot

An =

(c)

Figure 2 Operator description of the deep sequencing process (a) A library of phage must be processed before deep sequencing Each stepinvolves sampling which is either a deliberate partitioning of the sample or random loss of the sample Each sample preparation state could(and does) introduce bias in sequence abundance Each step thus is an operator chat changeing the 119899 vector (b) If we ignore bias duringpreparation operators could be approximated as unity vectors and sequencing could be represented as a product of sampling and analysisoperators (c) Analysis operator (An) is a binary decision matrix which describes what sequences are and are not considered as errorsDecisions such as removal of sequences or correction of sequences are the most important because they decide which ldquoobservedrdquo sequencesare considered ldquorealrdquo To make the analysis of the selection process meaningful the same An operator should be used in all analyses

the expected value for one of the elements Figure 3(f) showsthat the solution in which two elements deviate by gt2 foldis improbable This observation is a simple consequence ofthe multiplicity of the probabilities (large deviation from theaverage has probability 119901 and the probability to observe thisdeviation twice is 1199012)

Even in small multisets such as A(1) B(2) C(3) D(4)made of four unique and 10 total elements05Sa A(1) B(2) C(3) D(4) operation yields large numberof solutions with equal probability termed as redundantsolutions (eg solutions that have equal probability inFigure 3(b)) Redundancy depends on the structure of themultiset (Figure S2)This redundancy makes the calculationsof all probable solutions of Sa impractical For sets even with5-6 unique elements identification of all vectors 119861 whichsatisfy equation 119861 = fSa 119860 and reside within a 95 intervalrequires hundreds of thousands of iterations (Figure S2and S3) On the other hand calculation of the confidenceinterval of each element 119861i of the vector 119861 converges rapidlyA multiset A

1000 = A

1(1) A2(2) sdot sdot sdotA

1000(1000) with

1000 unique elements and 1 + 2 + 3 + sdot sdot sdot + 1000 = 500 500total elements is similar to an average deep sequencingdata set (Figure 4) Calculation of all probable solutions of05Sa A

1000 is beyond the capabilities of most computers

However the 999 confidence interval of all elements ofvector 119861 = 05Sa A

1000 can be calculated in sim2 minutes on

an average desktop computer The red dots in Figure 4 arelo119862119894and hi119862119894or the 999 high and low confidence interval

of all elements Bi (Figure 4)

The sampling operator is critical in phage display becausesampling of libraries occurs in every step of the selection andthe preparation of libraries for sequencing The stochasticnature of sampling operators makes two identical screensldquosimilar within a confidence intervalrdquo Solving (1) exactly isnot possible but it should be possible to estimate the solutionwithin a confidence interval Consider

Sel isin [ loKa Naive hiKa Naive] (6)

where loKa and hiKa are diagonal matrices of the upperand lower confidence intervals for the association constantsA simulation of the behavior of the Sa operator (Figures3 and S3) suggests that the relative sizes of the confidenceintervalsmight be impractically largewhen the copy numbersof sequences are lt10

Multiple sampling events of the Sa operator yield anormal distribution for each element of the vector (Figure 3)Fitting this normal distribution could yield a ldquotruerdquo valueof the process This process is identical to the extrapolationof the average from the normal distribution of noisy dataMultiple algorithms for such extrapolation exist for one- andmultidimensional stochastic processes [38 39] We believethat Sa behaves as a one-dimensional stochastic process andit might be possible to extrapolate the true value of thesampling from 7 to 10 repeated instances of Sa (ie thenumber of data sufficient to fit an 1D normal distribution)The necessary practical steps towards solving (3) or (10)are the following (i) Eliminate or account for any bias notrelated to binding (eg growth bias) (ii) Repeat the screen

6 Computational and Mathematical Methods in Medicine

Probability to observe specific solution

BB C

C

C

AD

DD

D B CCD

D[0 1 2 2]A B C D

[1 2 3 4]A B C D Sa

95

confi

denc

e int

erva

l

0 2 3 01 0 0 40 1 0 41 1 3 00 0 1 41 2 2 00 2 0 31 0 3 10 0 3 21 2 0 20 1 3 11 1 0 31 2 1 10 0 2 31 0 1 30 2 2 11 0 2 20 2 1 21 1 2 10 1 1 30 1 2 21 1 1 2

0 002 004 006 008 01 012 014

Copy number before

Cop

y nu

mbe

r afte

r

0

1

2

3

4

(a)

(c) (d)

(e)

(h)(g) (i)

(f)(b)

1 2 3 4

Probability to observe specific solution0 002 004 006 008 01 012 014

100000 trials

Sa50

1

6385

5000 trials2957

7

42

1 2 3 4 5 6 7 8012345678

Copy number before the sampling

Cop

y nu

mbe

r afte

r the

sam

plin

g

One of the solutions

[1 2 3 4 5 6 7 8]A B C D E F G H

1 2 3 4 5 6 7 8

012345678

Cop

y nu

mbe

r afte

r the

sam

plin

g

Number of data per point

Copy number before the sampling

50

BB

CCCADD

DD H HH

G

HHH

HHGG G

GGG

EE

EE

E

E

FF

FF

FF

B

CAD DH

G

HHH

GG

E EE

FF

F

[1 1 1 2 3 3 3 4]A B C D E F G H

Sa 50

Most probable solution

0

1

2

3

4

1 2 3 4

0

1

2

3

4

1 2 3 4

0

1

2

3

4

1 2 3 4

[0 2 3 0]A B C D

[1 1 1 2]A B C D

[0 2 1 2]A B C D

the s

ampl

ing

Cop

y nu

mbe

r afte

rth

e sam

plin

g

Cop

y nu

mbe

r afte

rth

e sam

plin

g

Cop

y nu

mbe

r afte

rth

e sam

plin

g

the sampling

Copy number before the sampling

Copy number before the sampling

Copy number before the sampling

gt1000 copies100ndash1000 copies

10ndash100 copies1ndash10 copies

solutions

P = 014

P lt 0001

P = 007

after 5000 trials

Figure 3 (a) Testing the sampling operator implemented as random indexing function using a model multiset (b) In 100000 trials weobserved 22 unique solutions from which 14 resided in a 95 confidence interval Solutions with 0 and 1 copies of element A were foundat equal abundances (ldquoredundant solutionsrdquo) (c) Representation of the most probable solution as a line with 4 nodes ldquo119901rdquo is a probabilityto find the solution dotted line is an expected ldquoaverage solutionrdquo for 50 sampling (d) The 5th most probable solution (e) least probablesolution deviates the most form the average (f) combination of all solutions Red thick lines describe the most probable solutions thin bluelines describe the least probable solutions (g) Sampling of larger multisets yields more possible solutions (here 2957 in 5000 trials) (h)All solutions of the sampling represented as lines (i) Probability to observe a particular copy number after sampling While (h) is the mostaccurate representations of the confidence intervals the thin blue lines describe solutions outside the confidence interval this representation isimpractical due to large number of redundant solutions in largermultisets In (i) confidence interval could be extrapolated from distributionsof individual copy numbers (e) red dots are on or outside the confidence interval

several times (iii) Measure all copy numbers of all sequencesincluding zero values with high confidence Requirement (i)has been an ongoing effort in our group [11 40] and othergroups [13 41ndash43] for review see [11 44] Deep sequencing

makes it simple to satisfy requirement (ii) and obtainmultipleinstances of the same experiment For example we describedthe Illumina sequencing method that allows using barcodedprimers to sequence 18 unrelated experiments in one deep

Computational and Mathematical Methods in Medicine 7

Number of data in one pointgt1000 copies100ndash1000 copies

10ndash100 copies1ndash10 copies

Number of data in one pointgt1000 copies100ndash1000 copies

10ndash100 copies1ndash10 copies

0 100 200 300 400 500 600 700 800 900 10000

100

200

300

400

500

600

700

800

900

1000

1 10 100 10001

10

100

1000

Copy number before the samplingC

opy

num

ber a

fter t

he sa

mpl

ing

A B C X Z

Copy number before the sampling

Cop

y nu

mbe

r afte

r the

sam

plin

g

1

1000

990

7

8

23

9 980

10Sa

456

50

5000 trials1000 unique elements

500000 total sequencesmiddot middot middot

middot middot middot

middot middot middot

middot middot middot[1 2 3 middot middot middot 999 1000]

Figure 4 (a) Testing the sampling operator using a large multiset made of 1000 unique elements with 1000 different copy numbers Imagesdescribe linear and log-scale representation of the confidence interval of the sampling operator Solutions beyond this interval were notobserved in 5000 trials Dotted line represents an overestimate of the 999 confidence interval (for details see Figure S4) Most probableoutcomes of the Sa operator have either zero or one unique sequence beyond this interval This line is used in subsequent sections (Figures 5and 6)We note that distributions of the copy numbers have well-defined shape according to central limit theorem it is a normal distributionWith enough replicas it should be possible to extrapolate the center of this distribution define the solutions explicitly and bypass thestochastic nature of the Sa operator

sequencing experiment [45] We recently scaled this effortto 50 primer sets and evaluated the performance replicas ofsimple selection procedures (in preparation)

The measurement of the copy numbers of sequencesis a separate problem that can be described using thesame sampling operators and bias operators that describehow the library is skewed by each preparation step Forexample isolation of DNA by gel purification disfavors AT-rich sequences whereas PCR favors sequence with withinspecific GC-content range [32]The real sequence abundancein any phage library ( real119899) hence has to be derived from theobserved sequence abundance ( obs119899) by solving this equation

obs119899 = (

f5Sa An) ( f4Sa Seq) ( f3Sa PCR)

times (f2Sa Is) ( f1Sa Gr) real119899

(7)

In this equation each operator in brackets describes a bias ata particular step fSa describes sampling at that step and f1ndashf5 describe the sampling fractions The bias in growth (Gr)isolation (Is) PCR amplification (PCR) and sequencing(Seq) could be related to the nucleotide sequences TheAn analysis operator is a matrix that describes retainingdiscarding or correcting the sequence (Figure 2(b)) An ideal

An operator could compensate for the biases introducedby another operator (Figure 2(c)) To define such operator(7) could be potentially solved using repeated sequencingof a well-defined model library In the next applied sectionwe examine the real deep sequencing data and identifyconditions under which these operators could be at leastpartially defined

22 Analysis or the Error Cutoff inDeep Sequencing Reads Allnext-generation sequencing techniques provide quality score(Phred Score) for every sequenced nucleotide In Illuminasequencing this score is related to the probability of thenucleotide being correct [46] In low throughput Sangersequencing the Phred score monotonously decreases withread length and the mechanisms that yield errors in capillaryelectrophoresis are well understood Common practice inSanger sequencing is to discard all reads after the firstnucleotide with a Phred score of 0 In next-generationsequencing the filtering of the reads is usuallymore stringentas follows

(A) Discard reads that have at least one read that has scorelower than ldquocutoffrdquo

(B) Discard reads that had cumulative Phred score lowerthan cutoff

8 Computational and Mathematical Methods in Medicine

C7C

Tags NKKN Bar Adapter Sequence Quality string for sequenceEnd of

read

C7C

C7CC7CC7CC7C

Ph7

Ph7Ph7P12

Ph7

TGGG

GGGC

TGTG

ATGG

TTGG

CGGC

CGGG

CGGG

CGTC

TGTG

ATGC

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

CGA CCCF

CCCF

CCCF

CCCF

CCCF

CCF

CCCF

F

CCCF

CCBF

CCCF

FFF

FFF

FFF

FFF

FDF

FFA

FFF

=A

FFF

FFF

FFF

FHHHHHJJJJJJ

FHHHHHJJJJJJ

FHHHHHJJJJJJ

FHHHHHJJJJJJ

FHHHHHJJJJJJ

DFFHBFFFGEFH

FHHHHHJJJJJJ

ADDDHFEFIEHG

FHGGHHJJJJJJ

FHHHHHJJJJJJ

FHHHHHJJJJJJ

TTG

GAC

GAC

GTA

TTG

CGA

TTG

TCA

TCA

CGA

JJJJJJJJJJJJJJJHHHGIJJJJJgtFHJJIJH simsimsimsimsimsimsim

JJJJJJJJJJHJJIJHIIJJJJBFHAHIHHHF simsimsimsimsimsimsim

JJJJJJJJJJJJGGHIJJJJJJJIJJIJJJJJJ simsimsimsimsimsimsim

JJJJJJJJJJJJIJJIJJJJJJJJJJJJJIJJJ simsimsimsimsimsimsim

JJJJJJJJJJJIIJJJJJJIJIJJIEHHHFHFD simsimsimsimsimsimsim

JJJJJJJJJJJJJJJJJJJJJJJHIJJJJJJJJ simsimsimsimsimsimsim

JJJJJJJJJJJJIJJJJJJJJJJJJJJJJJJJH simsimsimsimsimsimsim

JJIJJIGIJIJEGIJGHFGIIJJFFGGIIGII simsimsimsimsimsimsim

JJJJJJJJJJJIJJJIGHGIHIIIGHGIJJJJH simsimsimsimsimsimsim

GAHGHFHGGDHEHBGGCEHIEGBFDHBDltA simsimsimsimsimsimsim

simsimsimsimsimsimsim

GCTTGTCCTCCGAAGGTGCATCTGCAGTGCGGT simsimsimsimsimsimsim

GCTTGTGCGGCGAATTCTTATCGGAATTGCGGT simsimsimsimsimsimsim

ACTCAGCCTCATCATACTCCGGGTGGAGGTTCG simsimsimsimsimsimsim

GCGACGACTGTTCCAGCTTCGGGTGGAGGTTCG simsimsimsimsimsimsim

GCTTGTGATGCGCGTATGAATCAGCCTTGCGGT simsimsimsimsimsimsim

GCTTGTCAGGATAAGAATTCGCTTTTTTGCGGT simsimsimsimsimsimsim

GCTTGTAAGTCTCTTTCGAGTGCTCTGTGCGGT simsimsimsimsimsimsim

GCTTGTCCTCGTCTGTATCAGATGTCGTGCGGT simsimsimsimsimsimsim

GATGCGCATCGTACGTGTAATGGTGGAGGTTCG simsimsimsimsimsimsim

TGGCGTCATGATTCGTTTTAGACGGTTCAGCGT simsimsimsimsimsimsim

CGGGCTGCGTGTGCGAATCCTGGTGGAGGTTCG simsimsimsimsimsimsim

GEGIDFAEGIFCHC9FFAGHlowastDBAH

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

(a)

Sequencing resultsRemoval

All nucleotides

of errors

0L = [S 0n]

30L = [S 30n]

Ph gt 30

Ph gt 13

Ph gt 0

Ph gt 30

Phred gt 30

(b)

Base call accuracy ()

Phred score cutoff Every nucleotide must have Phred score higher than

Mean 33-mer accuracy ()

1 5 8 13 20 30

210 680 840 950 990 999

00 00 01 18 717 967

950 975 987 992 995 996

(Base accuracy)33 ()

10cutoff10

33prodi=1

10Qi10

(c)

Cumulative base call accuracy or accuracy

100

10

1

01

001

0001

0000160 70 80 1009050403020100

100

10

01

1

001

0001

00001

of 33-mer sequence ()

Sequ

ence

s (

)

(d)

60

80

100

40

20

0

60

80

100

40

20

0gt90 gt95 gt98

Sequ

ence

s (

)

Sequ

ence

s (

)

Cumulative base call accuracy oraccuracy of 33-mer sequence ()

gt99

(e)

Figure 5 (a) Representative lines from the intermediate file from Illumina deep sequencing analysis (for more information seeERROR TAG data0001txt in Section 4 and our previous publication [6]) The reads have been parsed to identify adapters and barcodesEach read has been tagged according to the library type direction of the read and quality of the adapter regions We use this intermediatelibrary to identify reads that harbor erroneous nucleotides (b) Multiset view of the intermediate library The library contains subsets thathave low medium and high quality reads Error filtering of this intermediate library to eliminates any read with Phred score below 30 yieldsa high quality library of reads 30L (c) Mean accuracy of the reads in the library after error filtering ranges from 95 to 996 Even forvery low-quality cutoff Phred gt 1 the average read quality is 95 (d) Distribution of cumulative read accuracy in libraries processed usingdifferent cutoffs (e) Linear plot of the data presented in (d) with zoom in on the region with gt90 cumulative accuracy

(C) Use a combination of A and B (accept reads withminimal cutoff and minimal cumulative score)

Many of the error analyses in the area of deep sequencingare designed for genetic reads which have variable lengthsand unknown sequence throughout the whole read Analysisof the reads in a phage-display library is a simpler problembecause phage-derived constant adapter regions flank thevariable reads Identification of the adapter region is anecessary first step in the analysis Reads in which theadapters cannot be mapped cannot be used We designedalgorithms to recover reads even if adapters were hamperedby truncation deletion ormutation [6]We observed that the

reads flanked by the erroneous adapters had a significantlyhigher error rate than reads flanked by ldquoperfectrdquo adapters(unpublished) Example of mapping of flawed reads in Illu-mina sequencing is provided in ERROR TAG data0001txt(see Section 4) In the remaining sections we analyze thepopulation of the sequences preceded by a ldquoperfectrdquo adapterto identify possible sequence-specific biases

We analyzed a typical library sequenced by Illumina usingvarious cutoffs (Figure 5) We analyzed a 33-bp segment ofthe library that contained variable seven amino acids anda constant region and GGGS terminus A simple cutoffthat discards reads with Phred lt1 nucleotides yields library

Computational and Mathematical Methods in Medicine 9

termed 1119899 which had an average 95 accuracy of the33-nucleotide read Reads that do not contain Phred = 0nucleotide rarely contain multiple low-quality reads The 1119899library was bimodal 80 of the reads had overall accuracyof 99 very few reads with accuracy 5ndash90 and significantnumber of reads with accuracy of 1 (Figures 5(d) and 5(e))These observations suggest that reads can be divided into (i)reads free of errors and (ii) reads with multiple errors

An example of a more stringent cutoff is eliminationof reads with Phred lt13 nucleotides this process yielded alibrary 13119899 in which every nucleotide had gt95 confidenceThe number of total reads in 13119899 was 10 less than numberof reads in 1119899 that is sum( 13119899) = 09sum( 1119899) The observedaverage read accuracy of the read in the 13119899 library was992 Theoretically the 095 confidence cutoff in a 33-mer nucleotide could yield reads with accuracy as low as(095)

33= 18 In practice the probability to find reads with

multiple nucleotides of 95 accuracy was vanishingly smallSpecifically among 500000 reads the lowest observed cumu-lative accuracy was 77 Such a result for example could beobtained in a sequence that has 27 ldquoperfectrdquo nucleotides and5 nucleotides with a Phred = 13 score (1)27(095)5 = 077Applying the most stringent cutoff to eliminate all reads witha Phred lt 30 yielded a library 30119899 in which every nucleotidehad 999 confidence The average confidence of the readsimproved subtly from 992 to 996 The number of totalreads in 30119899 was 30 less than number of reads in 13119899 thatis sum( 30119899) = 07sum( 13119899) It was not clear whether suchcutoff is an improvement or a detriment for analysis In thenext section we examined how frequency of the members ofthe library changed upon application of each error cutoff

23 Example of Error Analysis Sequence-Specific Censorshipduring Phred Quality Cutoff If errors occur by randomchance they should be uniformly distributed in all sequencesRemoval of erroneous read in that case should be identicalto sampling of the library by fSa operator where f is thesampling fraction For example consider the removal ofPhred lt 13 nucleotides from an unfiltered library (processdenoted as 1119899 rarr 13

119899) From the experiments we knowthat sum( 13119899) = 09sum( 1119899) if errors were distributedin sequences at random the 1119899 and 13119899 vectors should berelated as

13119899 =09Sa ( 1119899) (8)

The solutions should reside within a confidence interval

13119899 isin [

lo119862

hi119862] (9)

If errors occur preferentially in specific reads the frequencyof these reads should occur beyond the confidence intervalof the 09Sa This process could be described by a diagonalmatrix Bias as

13119899 =09Sa (Bias ( 1119899)) (10)

The elements of the diagonal matrix Bias = B119894119894 could be

estimated as follows

13119899119894isin [

lo119862119894

hi119862119894] B

119894119894= 1 (11)

13119899119894lt

lo119862119894 B

119894119894=

13119899119894

(091119899119894)

(12)

Figure 6(c) describes the representative solution of the09Sa( 1119899) (green dots) and the confidence interval (bluelines) Supplementary Scheme S3 describes the script thatcalculated this interval frommultiset 1119899 described as a plaintext file PhD7-Amp-0Ftxt using 10000 iterative calculationsof 09Sa( 1119899) This calculation required sim2 hours on a desktopcomputer Confidence interval was estimated as the mini-mum and maximum copy number found after 10000 iter-ations In this approximation of the confidence interval forsequences with the copy number lt10 before sampling it wasimpossible to determine whether the sequence disappeareddue to random sampling or due to bias The values of Biasoperator cannot be defined for these sequences and it couldbe assumed to be 1 (see (11)) For copy number gt10 howeversequence-specific bias can be readily detected We observedthat the removal of Phred lt13 reads yielded a multiset inwhich a large number of sequences deviated beyond theconfidence interval (Figure 6(d)) Their sequences could bereadily extracted by comparing the vector 13119899with the vectorof the lower confidence intervals lo

119862 (see (12)) The solutionof theBias can be illustrated graphically (Figure 6(e)) Top 30censored sequences are listed in Table S1 the other sequencescan be found in the supplementary information (file PhD7-Amp-0F-13F-CENtxt)

We performed similar calculations for 1119899 rarr 30119899 and13119899 rarr

30119899 processes The latter process is the most

interesting because 13119899 library has all nucleotides withinacceptable confidence range (gt95) and the distribution ofcumulative quality suggested that errors on average do notcluster in one read (Figure 5) The 13119899 rarr 30

119899 conversioneliminated 30 of the reads and copy numbers of manysequences deviated significantly from the random samplingthese sequences are represented by green dots outside the blueconfidence interval in Figure 6(h) Top 30 sequences are listedin Table S2 The censorship is not only sequence-specific butalso position-specific In sequences that had been censoredduring the 13119899 rarr 30119899 process lower quality reads clusteredaround 3-4 specific nucleotides (supplementary informationFigure S5)

The mechanism that leads to the disappearance of cen-sored sequences is not currently clear We attempted toidentify common motifs in censored sequences using twoapproaches (i) clustering and principal component analysesbased on Jukes-Cantor distance between sequences and(ii) identification of motifs using multiple unique sequenceidentifier software (MUSI) [17] These approaches couldnot detect any property common to censored reads whichwould make them significantly different from the othernoncensored reads Still we hypothesize that the observed

10 Computational and Mathematical Methods in Medicine

Ph gt 30 Ph gt 30

Sequencing resultsRemovalof errors

30n = Bias Sa 0

n

All nucleotides

0L = [S

0n]

30L = [S 30n]

Phred gt 30

(a)

Sequencing results

Ph gt13

1T total sequences

13T total sequences

30T total sequences

30n = Bias Sa 13

n

13n = Bias Sa 1

n

30n = Bias Sa 1

n

13L = [S

13n]

0L = [S

0n]

1L = [S

1n]

30L = [S 30n]

Ph gt 30

Ph gt 13

Ph gt 0

Ph ge 1

Ph gt 30

(b)

38390

597122105

104

103

102

10

1

105

104

103

102

101

13n = Sa 1

n

Cop

y nu

mbe

r in

Phre

dgt0

libra

ry

Copy number in Sa (Phred gt 0) library

(c)

558480

77032

Inside the interval

Outsideintervalthe

of

105

104

103

102

10

1

105

104

103

102

101

13n = Bias 1

n

13n = Bias Sa 1

nC

opy

num

ber i

n Ph

redgt0

libra

ry

Copy number in Phred gt 30 library

(d)

1

01

001

Bias can be definedfor these sequences

Bias cannot be definedfor these sequences

105

104

103

102

101

Bias = 13n|I| |N||

1nminus1

Bias = |IN|

Ratio

of c

opy

num

ber i

nPh

redgt0

Phr

edgt13

Copy number in Phred gt 0 library

(e)

502655

132857

105

104

103

102

101

105

104

103

102

10

1

30n = Sa 1

n

Number of data points per pixelgt30000

gt10000

gt3000

gt1000 1ndash10

gt100

gt300

gt30

gt10

Cop

y nu

mbe

r in

Phre

dgt0

libra

ry

Copy number in Sa (Phred gt 0) library

(f)

438703

196809

Inside the interval

Outside ofthe interval

105

104

103

102

10

1

105

104

103

102

101

Number of data points per pixel

30n = Bias Sa 1

n

30n = Bias 1

n

gt30000

gt10000

gt3000

gt1000 1ndash10

gt100

gt300

gt30

gt10

Copy number in Phred gt 30 library

Cop

y nu

mbe

r in

Phre

dgt0

libra

ry

(g)

438703

119777

Inside the interval

Outside ofthe interval

105

104

103

102

10

1

105

104

103

102

101

30n = Bias 13

n

Number of data points per pixel

30n = Bias Sa 13

n

gt30000

gt10000

gt3000

gt1000 1ndash10

gt100

gt300

gt30

gt10

Copy number in Phred gt 30 library

Cop

y nu

mbe

r in

Phre

dgt13

libra

ry

(h)

Figure 6 (a) Operator and multiset description of the error filtering procedure Applying a Phred gt 30 cutoff to library filtered by Phredgt1cutoff ( 1n) yields a subpopulation of the library ( 30n) If errors are sequence-independent the 1n rarr 30n process should be identical torandom sampling ( 30n = Sa 1n) Any sequence-specific bias (Bias) should be detected as deviation from Sa 1n (b) Progressive samplingwith more stringent cutoff (c) Theoretical Sa 1n and theoretical 999 confidence interval (blue) (d) Observation of statistically significantdeviation from Sa operator dots beyond the blue line represent sequences prone to bias Red dots represent sequences that disappeared afterin 1n rarr 30n process or during Sa 1n sampling (e) Magnitude of the bias range from 5 to 100-fold (f) Bias in sampling of Phred gt 30 datafrom Phred gt 1 data ((f) is theory (g) is observed) (h) Bias upon sampling of Phred gt 30 data from Phred gt 13 Many sequences were lost inthis sampling and this loss was statistically significant beyond the 999 interval This result shows that some sequences have propensity toharbor low- and medium-quality reads Distribution of the errors is sequence specific

censorship represents sequence-specific errors which occurin every time such sequence passes though the Illuminaanalyzer For example the sequences listed in Tables S1 andS2 and supplementary files were censored in five independent

experiments which were pooled and processed simultane-ously in one Illumina run Analysis of other instances of Illu-mina sequencing performedby other groups could help prove(or disprove) that censorship is indeed sequence-specific

Computational and Mathematical Methods in Medicine 11

and experiment-independent Sequence-specific censorshipduring Illumina analysis has been described in other publi-cations [46] The observations presented above suggest thatreading of some sequences in phage libraries does not yield anaccurate copy number Even if these sequences were enricheddue to binding their apparent copy number in sequencingwould be decreased due to sequencing bias If the magnitudeof bias is known however such error could be corrected Weanticipate that other biases could be calculated for these andother libraries in similar fashion Their calculation extendsbeyond the scope of this paper and it will be performed inour next publication

3 Discussion

31 Significance and Transformative Potential of Library-Wide Error Correction In the Medicinal Chemistry fieldstructure-activity relationships (SAR) and pharmacophoresare built using both positive and negative observations Itis the negative results that bear the most significance inthese studies because they allow mapping of the range ofconditions under which particular structure no longer worksFor example SAR of an R group of a ligand might be builton the following observations A ligand binds to the targetwhen the R group in the specific position is methyl or ethylchanging R to iso-propyl and tert-butyl ablates the bindingThis concludes that the R group must be a small alkyl groupAn analogous situation is found in SARof peptide ligands themost important information from alanine scan mutagenesisis loss of function because it helps identifying the importantresidues Interestingly loss-of-binding conclusions are neverapplied to phage-display The phage-display field is driven bypositive results Most publications report and follow up onlyon sequences enriched in the screen and consider only largecopy numbers interesting All papers focus on sequences thatwere found Very few papers in phage display ask why othersequences were not found

One of the reasons why phage display is not used forSAR-type analysis is because negative observations in phagelibrary cannot be determined with high confidence From apractical point of view measuring zero with high confidencerequires the largest number of observation (the highest depthof sequencing) The payoff however is immense one screenwith ldquoconfident zerosrdquo could potentially yield SAR for everypossible substitution of every possible amino acid We referto this (theoretical) possibility as ldquoInstant SARrdquo and itscondensed theoretical form is described in (3) or (9) and(10) This paper demonstrates that the depth of sequencing isnot the only problem towards this goal Accurate estimate ofnegative results requires complete characterization of the ori-gins of errors in sequencing which yield false negative valuesby censoring certain sequencing Other types of censorshipsuch as growth bias should be characterized and eliminatedas well As the phage display field is currently focused onpositive results the need for optimal error corrections andrecovery of erroneous reads is low With the rise of SAR-type applications in phage display error correction will berecognized as the most significant barrier because it could

lead to improper assignment of low frequencies and negativeresults Improved error correction strategies could assigna lower confidence to the sequence instead of eliminatingthe errors and labeling them as confident zero Propermathematical framework possibly similar to the one used inthis paper could be then used to carry all confidence intervalsthrough calculations to yield reliable SAR-type data

We note that the framework described in this paper issuitable for the analysis of the selection from libraries inwhich the diversity of the libraries before and after selectioncould be covered entirely by deep sequencing With thecurrent depth of sequencing it corresponds to medium-scale libraries of sim106 randommembers and affinity-maturedlibraries that containsim106 pointmutationsWe are in the pro-cess of generating these medium-scale libraries and runningselection procedures that will allow us to apply and refine ourframework In the future as technical capabilities and depthof sequencing increase the process would be applicable tolarger libraries as well

4 Methods

41 Generation of Z-Bars and Other Visualization TechniquesSequencing of the libraries has been described in our previouspublications [6 47] All data visualization in this paper wasdone by MATLAB scripts raw lowasteps output from MATLABscripts subject to minor postprocessing in Adobe Illustratorto adjust fonts relative dimensions of plots Core scripts aredescribed in the supplementary information Other scriptsare available in our previous publication [47] Illumina filesused for the analysis can be found in the directory athttpwwwchemualbertacasimderdamathbiology the fileERROR TAG data0001txt is an example of error-taggedreads PhD7-Amp-xxFtxt is the library filtered with xx Phredcutoff (xx = 1 13 and 30) file PhD7-Amp-13F-30F-COtxtdescribes confidence intervals for Phred(13) to Phred(30)filtering process other files with lowast-COtxt extension describeconfidence intervals of other processes Supplementary Fig-ures S1ndashS4 and Schemes S1 and S2 describe MATLAB imple-mentation of the Sa operator

Acknowledgments

This work was supported by funds from the University ofAlberta and Alberta Glycomics Centre

References

[1] J K Scott and G P Smith ldquoSearching for peptide ligands withan epitope libraryrdquo Science vol 249 no 4967 pp 386ndash390 1990

[2] G P Smith and V A Petrenko ldquoPhage displayrdquo ChemicalReviews vol 97 no 2 pp 391ndash410 1997

[3] A D Ellington and J W Szostak ldquoIn vitro selection of RNAmolecules that bind specific ligandsrdquo Nature vol 346 no 6287pp 818ndash822 1990

[4] C Tuerk and L Gold ldquoSystemic evolution of ligands byexponential enrichment RNA ligands to bacteriophage T4DNApolymeraserdquo Science vol 249 no 4968 pp 505ndash510 1990

12 Computational and Mathematical Methods in Medicine

[5] E Dias-Neto D N Nunes R J Giordano et al ldquoNext-generation phage display integrating and comparing availablemolecular tools to enable costeffective high-throughput analy-sisrdquo PLoS ONE vol 4 no 12 Article ID e8338 2009

[6] W LMatochko K Chu B Jin SW Lee GMWhitesides andR Derda ldquoDeep sequencing analysis of phage libraries usingIllumina platformrdquoMethods vol 58 pp 47ndash55 2012

[7] A Ernst D Gfeller Z Kan et al ldquoCoevolution of PDZ domain-ligand interactions analyzed by high-throughput phage displayand deep sequencingrdquo Molecular BioSystems vol 6 no 10 pp1782ndash1790 2010

[8] G V Kupakuwana J E Crill M P McPike and P N BorerldquoAcyclic identification of aptamers for human alpha-thrombinusing over-represented libraries and deep sequencingrdquo PLoSONE vol 6 no 5 Article ID e19395 2011

[9] P A C rsquoT Hoen S M G Jirka B R Ten Broeke et alldquoPhage display screening without repetitious selection roundsrdquoAnalytical Biochemistry vol 421 no 2 pp 622ndash631 2012

[10] H Zhang A Torkamani T M Jones D I Ruiz J Ponsand R A Lerner ldquoPhenotype-information-phenotype cyclefor deconvolution of combinatorial antibody libraries selectedagainst complex systemsrdquo Proceedings of the National Academyof Sciences of the United States of America vol 108 no 33 pp13456ndash13461 2011

[11] R Derda S K Y Tang S C Li S Ng W Matochko and M RJafari ldquoDiversity of phage-displayed libraries of peptides duringpanning and amplificationrdquo Molecules vol 16 no 2 pp 1776ndash1803 2011

[12] R W Hamming ldquoError detecting and error correcting codesrdquoBell System Technical Journal vol 29 pp 147ndash160 1950

[13] D J Rodi A S Soares and L Makowski ldquoQuantitativeassessment of peptide sequence diversity in M13 combinatorialpeptide phage display librariesrdquo Journal of Molecular Biologyvol 322 no 5 pp 1039ndash1052 2002

[14] L Makowski ldquoQuantitative analysis of peptide librariesrdquo inPhage Nanobiotechnology chapter 3 2011

[15] C E Shannon ldquoAmathematical theory of communicationrdquo BellSystem Technical Journal vol 27 pp 379ndash423 1948

[16] U Ravn F Gueneau L Baerlocher et al ldquoBy-passing in vitroscreening next generation sequencing technologies applied toantibody display and in silico candidate selectionrdquoNucleic AcidsResearch vol 38 no 21 2010

[17] T Kim M S Tyndel H Huang et al ldquoMUSI an integratedsystem for identifying multiple specificity from very largepeptide or nucleic acid data setsrdquo Nucleic Acids Research vol40 no 6 article e47 2012

[18] J A Weinstein N Jiang R A White III D S Fisher and S RQuake ldquoHigh-throughput sequencing of the zebrafish antibodyrepertoirerdquo Science vol 324 no 5928 pp 807ndash810 2009

[19] B J DeKosky G C Ippolito R P Deschner et al et al ldquoHigh-throughput sequencing of the paired human immunoglobulinheavy and light chain repertoirerdquo Nature Biotechnology vol 31pp 166ndash169 2013

[20] B Levitan ldquoStochastic modeling and optimization of phagedisplayrdquo Journal of Molecular Biology vol 277 no 4 pp 893ndash916 1998

[21] C S Riesenfeld P D Schloss and J Handelsman ldquoMetage-nomics genomic analysis of microbial communitiesrdquo AnnualReview of Genetics vol 38 pp 525ndash552 2004

[22] M L Sogin H G Morrison J A Huber et al ldquoMicrobialdiversity in the deep sea and the underexplored lsquorare biospherersquordquo

Proceedings of the National Academy of Sciences of the UnitedStates of America vol 103 no 32 pp 12115ndash12120 2006

[23] F Backhed R E Ley J L Sonnenburg D A Peterson and JI Gordon ldquoHost-bacterial mutualism in the human intestinerdquoScience vol 307 no 5717 pp 1915ndash1920 2005

[24] N Beerenwinkel and O Zagordi ldquoUltra-deep sequencing forthe analysis of viral populationsrdquo Current Opinion in Virologyvol 1 no 5 pp 413ndash418 2011

[25] J A Huber D B MarkWelch H G Morrison et al ldquoMicrobialpopulation structures in the deep marine biosphererdquo Sciencevol 318 no 5847 pp 97ndash100 2007

[26] A Wilm P P K Aw D Bertrand et al ldquoLoFreq a sequence-quality aware ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencingdatasetsrdquo Nucleic Acids Research vol 40 pp 11189ndash11201 2012

[27] P J Turnbaugh C Quince J J Faith et al ldquoOrganismalgenetic and transcriptional variation in the deeply sequencedgut microbiomes of identical twinsrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 107 no16 pp 7503ndash7508 2010

[28] CQuince A Lanzen T P Curtis et al ldquoAccurate determinationof microbial diversity from 454 pyrosequencing datardquo NatureMethods vol 6 no 9 pp 639ndash641 2009

[29] S J Watson M R A Welkers D P Depledge et al ldquoViralpopulation analysis and minority-variant detection using shortread next-generation sequencingrdquo Philosophical Transactions ofthe Royal Society B vol 368 no 1614 Article ID 20120205 2013

[30] D R Bentley S Balasubramanian H P Swerdlow et al et alldquoAccurate whole human genome sequencing using reversibleterminator chemistryrdquo Nature vol 456 pp 53ndash59 2008

[31] KNakamura TOshima TMorimoto et al ldquoSequence-specificerror profile of Illumina sequencersrdquo Nucleic Acids Researchvol 39 no 13 2011

[32] M A Quail I Kozarewa F Smith et al ldquoA large genomecenterrsquos improvements to the Illumina sequencing systemrdquoNature Methods vol 5 no 12 pp 1005ndash1010 2008

[33] O Zagordi R Klein M Daumer and N Beerenwinkel ldquoErrorcorrection of next-generation sequencing data and reliableestimation of HIV quasispeciesrdquoNucleic Acids Research vol 38no 21 pp 7400ndash7409 2010

[34] M W Schmitt S R Kennedy J J Salk E J Fox J B Hiattand L A Loeb ldquoDetection of ultra-rare mutations by next-generation sequencingrdquo Proceedings of the National Academyof Sciences of the United States of America vol 109 pp 14508ndash14513 2012

[35] A Syropoulos ldquoMathematics of multisetsrdquo in Proceedings of theWorkshop on Multiset Processing Multiset Processing Mathe-matical Computer Science and Molecular Computing Points ofView pp 347ndash358 Springer 2001

[36] W ArapM G KoloninM Trepel et al ldquoSteps towardmappingthe human vasculature by phage displayrdquo Nature Medicine vol8 no 2 pp 121ndash127 2002

[37] S Blond-Elguindi S E Cwirla W J Dower et al ldquoAffinitypanning of a library of peptides displayed on bacteriophagesreveals the binding specificity of BiPrdquoCell vol 75 no 4 pp 717ndash728 1993

[38] S Cox E Rosten J Monypenny et al ldquoBayesian localizationmicroscopy reveals nanoscale podosome dynamicsrdquo NatureMethods vol 9 no 2 pp 195ndash200 2012

[39] E Rosten G E Jones and S Cox ldquoImage plug-in for Bayesiananalysis of blinking and bleachingrdquoNature Methods vol 10 pp97ndash98 2013

Computational and Mathematical Methods in Medicine 13

[40] W L Matochko S Ng M R Jafari J Romaniuk S K Y Tangand R Derda ldquoUniform amplification of phage display librariesin monodisperse emulsionsrdquoMethods vol 58 pp 18ndash27 2012

[41] E A Peters P J Schatz S S Johnson and W J DowerldquoMembrane insertion defects caused by positive charges in theearly mature region of protein pIII of filamentous phage fd canbe corrected by prlA suppressorsrdquo Journal of Bacteriology vol176 no 14 pp 4296ndash4305 1994

[42] L A Brammer B Bolduc J L Kass K M Felice C J Norenand M F Hall ldquoA target-unrelated peptide in an M13 phagedisplay library traced to an advantageous mutation in the geneII ribosome-binding siterdquo Analytical Biochemistry vol 373 no1 pp 88ndash98 2008

[43] G A Kuzmicheva P K Jayanna I B Sorokulova and VA Petrenko ldquoDiversity and censoring of landscape phagelibrariesrdquo Protein Engineering Design and Selection vol 22 no1 pp 9ndash18 2009

[44] D R Wilson and B B Finlay ldquoPhage display applicationsinnovations and issues in phage and host biologyrdquo CanadianJournal of Microbiology vol 44 no 4 pp 313ndash329 1998

[45] R Derda S K Y Tang and G M Whitesides ldquoUniformamplification of phage with different growth characteristics inindividual compartments consisting ofmonodisperse dropletsrdquoAngewandte Chemie vol 49 no 31 pp 5301ndash5304 2010

[46] J C Dohm C Lottaz T Borodina and H HimmelbauerldquoSubstantial biases in ultra-short read data sets from high-throughput DNA sequencingrdquo Nucleic Acids Research vol 36no 16 article e105 2008

[47] W L Matochko S C Li S K Y Tang and R DerdaldquoProspective identification of parasitic sequences in phagedisplay screensrdquo Nucleic Acids Research 2013

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

2 Computational and Mathematical Methods in Medicine

them for correcting the errors and for maximizing theamount of useful information received from sequencing

There are over 10000 published literature reports thatcontain the terms ldquodeep sequencingrdquo or ldquonext generationsequencingrdquo or any of the trademark names such as ldquoIllu-minardquo (reference ISI database) Among these reports lessthan 10 published reports describe sequencing of phage-displayed libraries [5ndash7 9 10 16ndash19] Deep sequencing effortsin the literature are largely focused on genome assembly andmetagenomic analysesThe error analysis techniques tailoredfor genome assembly cannot be used directly for analysis ofphage libraries because the data output from phage librarysequencing is very different from the genome assembly Ingenome assembly genomic DNA is shredded into randomfragments and sequenced The genome is then assembledfrom these fragments in silico Although multiple fragmentscover each area of the genome the probability to observetwo identically shredded fragments is very small Two exactsequences thus could be considered amplification artifactsand removed by error analysis software On the contraryin phage-display sequencing the reads are exactly of thesame length Duplication of the same read is important forvalidation of the accuracy of this read Some researchers focusexclusively on reads that have been observed multiple timesand discard singleton reads as erroneous [5] Within eachlibrary the copy numbers of sequences range continuouslyby six or more orders of magnitude [5 6 9] Some phageclones are observed in the entire library only a few timesother clones could be present at copy number of 100000 persequencing run [5 6 9] Unlike multiple cells with identicalgenomes each screen is unique identical set of sequenceswith identical copy numbers cannot be obtained even if thescreen is repeated due to stochastic number of the screen thatcontains low copy number of binding clones [20]

Metagenomic analyses of microorganisms recoveredfrom environmental samples [21 22] also known as ldquomicro-biomerdquo [23] and ldquoviriomerdquo analyses [24] encountered similarproblems to those observed in phage library analysis theconcentration of species observed in a particular sampleis unequal [25] The abundance of species might range bya few orders of magnitude [26] It is possible that erroranalysis tools developed in the above areas could find usein phage display sequencing For example there are multiplepublished algorithms for removing errors from low copynumber reads to ascertain that low copy number sequencesare new species and not sequencing errors (eg see [27ndash29] and references within) Metagenomic analysis is usu-ally more complex than analysis of phage-display librariesFirst in metagenomics the bacterial or viral genes mustbe assembled from short reads de novo Second there isno simple relationship between phylogenetic classification ofldquospeciesrdquo and the observed DNA sequence Third the exactnumber of species in the environment is unknown On theother hand sequencing of phage-displayed peptide librarieshas none of these problems (i) it requires no assemblysteps because each sequence is covered by one read (ii) auniqueDNA sequence defines a unique ldquospeciesrdquo and (iii) thetheoretical complexity in synthetic libraries is known exactlyFor small libraries such as the library of 7-mer peptides

the complexity (20)7 is within the reach of next-generationsequencing We see phage-displayed peptide libraries as anideal model playground for the development of optimalerror analysis and error correction protocols It is possiblethat error analysis developed from phage libraries analysiscould then be used in other areas such as genomic andmetagenomic analyses

The errors in sequencing could be divided into ldquoanno-tatedrdquo and ldquoinvisiblerdquo The ldquoannotatedrdquo errors that originatefrom misincorporation of nucleotides are annotated usingPhred quality score [30]These annotated errors are removedduring the processing (see below) Examples of ldquoinvisiblerdquoerrors are sequence-specific frame shifts that lead to emer-gence of truncated reads during the Illumina sequencing [31]Invisible errors could also originate during the preparation ofthe libraries for sequencing Examples are removal of AT-richfragments during purification of dsDNA [32] and erroneousincorporation of nucleotides during PCR [33 34] Mutationshave the most significant impact on the observed diversity ofthe libraryThere are 63 ways to misspell a 21-mer-nucleotidesequence with a one-letter error (point mutation) The largedynamic range in concentrations of clones in the phagelibrary exacerbates the problem Clones that are present inhigh abundancemdash105 copies per readmdashare more prone toyield errors [6] For example we observed that random pointmutations convert several short sequencewith a copy numberof 105 to a library of sequences with copy numbers rangingfrom 1 to 102 [11] In attempt to unify error analysis into oneconvenient theoretical framework we generalized all errorsas follows All errors either lead to disappearance of particularsequence or its conversion to another sequence of the samelength Errors thus operate within a finite sequence spaceand it should be possible to use elementary linear algebra togeneralize most processes that lead to errors

2 Theoretical Description

See Table 1

21 Operator Description of the Phage-Display Library andSelection Process In our previous reports we described thephage library as a multiset or a set in which members canappear more than once [35] This description also simplifiesthe analysis of the errors in these libraries The multisetdescription represents a library with N theoretical membersas an ordered set of N sequences and N times 1 copy numbervector (119899) with positive integer copy numbers (Figure 1(a))Any manipulation of a phage librarymdashsuch as erroneousreading or selectionmdashchanges the numbers within the copynumber vector All manipulations to the multiset thus couldbe described by operators (Op) that convert vector 119899

1to

another vector 1198992as 1198992= Op 119899

1(Figure 1(c)) For an N times 1

vector the operator is NtimesNmatrix If elements are selected oreliminated independently of one another the NtimesNmatrix isdiagonal (Figure 1(d)) This approach is uniquely convenientfor libraries of short reads For example a library of 7-merscontains exactly 207 = 128 times 109 peptides and is described

Computational and Mathematical Methods in Medicine 3

Table 1 Symbols and definitions used in the theoretical description section

Symbols MeaningA a f m n k Unless specified otherwise normal font designates scalars119860 119886119873 119875 1n 13n Italic font designates vectors Different vectors can be distinguished by the left-superscript notationA a Abc Pan Sa Bold font designates operators or matrices (here all operators are matrices)1AfSa09Sa05Sa Operators can be distinguished by the left-superscript notation For sampling operator Sa this

notation specifies the sampling fraction of the Sa operatorA1 a2 A119894 a119895 Normal font with right subscript designates scalar values of the vectorA11 A21 Aij Aii Normal font with two right subscripts designates scalar values of the 2D matrix1003817100381710038171003817A1 A5

1003817100381710038171003817

Description of the scalar elements in the vector10038171003817100381710038171003817Aij Aii

10038171003817100381710038171003817

Description of the scalar elements in the matrixx isin [A B] Scalar x belongs to the inclusive scalar interval [A B] that is A le x le B119909 isin [119860119861] Vector x belongs to the ldquovector intervalrdquo [119860119861] that is for every element Ai le xi le Bi

ABC X Set where A B C X are the unique elements of the set

A(a) B(b) X(x) Multiset (2-tuple) where A B X are the unique elements and a b x are the scalars describing thecopy numbers of the A B X elements

IN Unity matrix of the Nth order that is N times Nmatrix 10038171003817100381710038171003817Aij10038171003817100381710038171003817 Aij = 120575ij (Kronecker delta)

completely using a 109-element vector This size is accessibleto the computational capacity of most desktop computers

In operator notation phage display can be described as

Sel = Pan Naive (1)

where Naive is the copy number vector for naıve librarySel is the copy number vector after panning and Pan isa panning operator In standard phage display the Panoperator is a complex product of all manipulation steps(binding amplification dilutions etc) If a screen uses noamplification and uses deep-sequencing [9 16] or large-scaleSanger sequencing [36 37] to analyze the enrichment itmight be possible to define the panning process as a simpleproduct of two operators as follows

Pan = fSa Ka (2)

Sel = fSa Ka Naive (3)

where Ka is a deterministic ldquoassociationrdquo operator whichcontains association constants for every phage clone presentin the library Description of such operator is beyond thescope of this paper and we recommend consulting otherreports that attempted to generalize the selection procedure[20] Another operator in (3) is a sampling operator ( fSa)which describes stochastic sampling of the library with msequences to yield a sublibrarywith flowastm-sequences where f isin[0 1] is a sampling fraction fSa operator has the followingproperties which emanate from physical properties of thesampling procedure

(I)

fSai 0 = 0 (sampling does not create new

sequences from nonexisting sequences) (4)

(II) fSa is a diagonal operator with diagonal scalar func-tions Sa

11Sa22sdot sdot sdot SaNN Sai(0) = 0

(III) In 119861 = fSa 119860 119861 is a vector of positive integersB119894ge 0 and sum(119861) = flowastsum(119860) Integer values

ensure that the observable values of the operator havephysical meaning The clone could be observed once(1) multiple times (2 3 etc) or not observed at all(0)

(IV) Sa is nondeterministic operator When applied to thesame vector the operator does not yield the sameresult but one of the possible vectors that satisfy rules(IndashIII) The majority of the solutions of the operatorhowever reside within a deterministic confidenceinterval fSa 119860 isin [ lo119862 hiC]

(V) As a consequence from (IV) operator Sa is nonlinearnoncommutative and nondistributive

(VI) Large sum of sampling operators with same f shouldldquoaverage outrdquo to yield IN unity matrix

(fSa1+

fSa2+

fSa3+ sdot sdot sdot

fSak)k

997888rarr flowastIN

as k 997888rarr infin

(5)

The Sa operator is simple to implement as a randomarray indexing function in any programming language (egsee Supplementary Schemes S1 and S2 available online athttpdxdoiorg1011552013491612) It might be possibleto express fSa analytically for any f as a diagonal matrix(Figure 1(d)) In this paper we use numerical treatment byan array sampling function because it is more convenient formultisets of general structure We tested the random index-ing implementation to show that the sampling algorithmyields a normal distribution for a large number of samples(Supplementary Figure S1) Despite the simplicity of fSa

4 Computational and Mathematical Methods in Medicine

Any phage library is a multiset defined by sequence set and copy number vector

Lib = [S n] middot middot middot middot middot middot

n1n2nN

N number of all possiblesequences in the libraryni copy number of theith sequence in the library

Lib = [S = sequence1sequence2 n = ]sequenceN

(a)

Total (sum) and unique sequences (uni) 11

1middot middot middotsum(n) = n middot |1| =

|1| =Nsumi=1

ni

uni(n) =Nsumi=1

1 minus 120575[ni] 0 n ne 0 delta120575[n] = 1 n = 0 Kronecker

(b)

middot middot middot

Any manipulation of library is an operator acting on the ni vector (copy number vector)Example sampling of the library

1n =

1n11n2

1nN

middot middot middot

2n =

2n12n2

2nN1Lib = [S 1n]

2Lib = [S 2n]

T1 total sequencesT2 total sequences

Sum(1n) = T1 Sum(2n) = sum(Sa 1n) = T2

2n = Sa 1

n

Sa) acts on 1n vectorSampling operator (

Some elementspresent in 1st

library are ldquogonerdquofrom the 2nd library

2ni = 0

1ni 0ne

(c)

rnrn

rn

0 00 0

0 0

Example sampling operator

Properties of Sa

middot middot middotmiddot middot middot

middot middot middot

Sa(n) = round T1

T2

N Irndn where N Irnd =

k

Operators are N times N matrices Random N times Ndiagonal matrix

(1) Sa(middot middot middot(Sa(Sa(n)) )middot middot middot) asymp Sa(n)

sum Sa(n) rarr infinn as k rarr(2) 1

k

((

(d)

amplificationSynthesis ofnucleotides

Theoretical library

Example each random

Ligation

Sequencing

Panning

Naive librarySynthetic library Selected (panned) library

General problem of selection

Sequencing

Multiset description of the phage display selection

transformationgrowth

nucleotide is present once

T = [S 1]

The ith sequence Sihas copy number syn ni

enriched sequences x i gt 1 depleted sequences x i lt 1

P and N cannot be measured They have to be approximated from the ldquoobservablesrdquo

From P and N find enrichment E = [S x] x e R+obs

P = [S panobsn]

P = [S pann]N = [S naive

n]

S = [S synn]

obsN = [S naive

obsn]

(e)

Figure 1 (a) Phage library can be described by multisets made of S = sequence set and 119899 = vector of copy numbers Any change to thelibrary can be described as functionoperator acting on the 119899 (b) Relevant functions are calculations of total sequences (sum) and uniquesequences (uni) (c) Any transformation of library to another library is an operator acting on 119899 Sampling of libraries to yield a sublibrary isthe most important operator (d) It can be described as N times N matrix Specifically Sa is a diagonal matrix of values derived from randomdistribution Rounding function is necessary to ensure the physical meaning of the sampling results Sa acting on the same vector yieldsone of many vectors that have the same number of total elements As a consequence Sa is nonlinear nondistributive and noncommutativeoperator Average of many Sa operators is a scalar (dilution factor) (e) Any screen of any library can be described as operators acting onthe copy number vectors of the naıve (or theoretical) library Copy number vectors cannot be observed directly They have to be measuredthrough sequencing As sequencing contains sampling process (Sa operator) the result of sequencing is nondeterministic Sequencing yieldsone of many possible observed copy number vectors none of which are equal to the real copy number vector

implementationmdashthe entire code is lt30 lines in MatLabmdashthe script allows rapid calculation of the results of fSa for amultiset of reasonable size (severalmillion sequences Figures4 and 5) on a desktop computer

We evaluated the behaviors of 05Sa for several multisetsThe probability to observe a specific solution is describedin Figure 3(b) Individual solutions can be represented aslines with nodes on 119883119884-plane where each node represents

one element of the multiset (Figures 3(d) and 3(e)) Themost probable solutions reside near the ldquoexpected solutionrdquo(represented as dotted line) and the probability to observea solution where many elements deviate from the probablesolution is low (Figure 3(e)) Graphical representation of thesolutions highlights that sampling could lead to deviationof the frequency of the individual elements of the multisetfor example Figure 3(e) describes gt2 fold deviation from

Computational and Mathematical Methods in Medicine 5

Library of phage

Sampling

Growthof phage

Isolationof DNA

PCRof DNA

Sequencing Analysis of

Sa Gr

Operator description of each step

Sa Is Pc Sa Se AnSa

of DNA sequencing

(observed)Accepted

sequencing data

realN = [S real

n]

obs N = [S obs n]

(a)

Relation between observed copy numbers and ldquorealrdquo copy numbers

Analysisoperator

obs n =f 5Sa An f 4Sa Seq f 3Sa PCR f 2Sa Is f1Sa Gr real n

Growth bias betoassumedandneglected(usually IN )Isolation bias betoassumedandneglected(usually IN )

PCR bias betoassumedandneglected(usually IN )Sequencing bias betoassumedandneglected(usually IN )

(b)

10

0

0 00 0

0 0

0011

0

0

0

Analysis operator is a matrix that could compensate for problems and censorships caused by Se Pc Is Gr operators

f5Sa An (f4Sa Seq f3Sa PCR f2Sa Is f1Sa Gr real ni) sim Sa real ni

Seq 1 is accepted as isSeq 2 is considered a typoand renamed to seq 3Seq 3 is accepted as isSeq N is deleted (possible error)

middot middot middot

middot middot middot

middot middot middot

An =

(c)

Figure 2 Operator description of the deep sequencing process (a) A library of phage must be processed before deep sequencing Each stepinvolves sampling which is either a deliberate partitioning of the sample or random loss of the sample Each sample preparation state could(and does) introduce bias in sequence abundance Each step thus is an operator chat changeing the 119899 vector (b) If we ignore bias duringpreparation operators could be approximated as unity vectors and sequencing could be represented as a product of sampling and analysisoperators (c) Analysis operator (An) is a binary decision matrix which describes what sequences are and are not considered as errorsDecisions such as removal of sequences or correction of sequences are the most important because they decide which ldquoobservedrdquo sequencesare considered ldquorealrdquo To make the analysis of the selection process meaningful the same An operator should be used in all analyses

the expected value for one of the elements Figure 3(f) showsthat the solution in which two elements deviate by gt2 foldis improbable This observation is a simple consequence ofthe multiplicity of the probabilities (large deviation from theaverage has probability 119901 and the probability to observe thisdeviation twice is 1199012)

Even in small multisets such as A(1) B(2) C(3) D(4)made of four unique and 10 total elements05Sa A(1) B(2) C(3) D(4) operation yields large numberof solutions with equal probability termed as redundantsolutions (eg solutions that have equal probability inFigure 3(b)) Redundancy depends on the structure of themultiset (Figure S2)This redundancy makes the calculationsof all probable solutions of Sa impractical For sets even with5-6 unique elements identification of all vectors 119861 whichsatisfy equation 119861 = fSa 119860 and reside within a 95 intervalrequires hundreds of thousands of iterations (Figure S2and S3) On the other hand calculation of the confidenceinterval of each element 119861i of the vector 119861 converges rapidlyA multiset A

1000 = A

1(1) A2(2) sdot sdot sdotA

1000(1000) with

1000 unique elements and 1 + 2 + 3 + sdot sdot sdot + 1000 = 500 500total elements is similar to an average deep sequencingdata set (Figure 4) Calculation of all probable solutions of05Sa A

1000 is beyond the capabilities of most computers

However the 999 confidence interval of all elements ofvector 119861 = 05Sa A

1000 can be calculated in sim2 minutes on

an average desktop computer The red dots in Figure 4 arelo119862119894and hi119862119894or the 999 high and low confidence interval

of all elements Bi (Figure 4)

The sampling operator is critical in phage display becausesampling of libraries occurs in every step of the selection andthe preparation of libraries for sequencing The stochasticnature of sampling operators makes two identical screensldquosimilar within a confidence intervalrdquo Solving (1) exactly isnot possible but it should be possible to estimate the solutionwithin a confidence interval Consider

Sel isin [ loKa Naive hiKa Naive] (6)

where loKa and hiKa are diagonal matrices of the upperand lower confidence intervals for the association constantsA simulation of the behavior of the Sa operator (Figures3 and S3) suggests that the relative sizes of the confidenceintervalsmight be impractically largewhen the copy numbersof sequences are lt10

Multiple sampling events of the Sa operator yield anormal distribution for each element of the vector (Figure 3)Fitting this normal distribution could yield a ldquotruerdquo valueof the process This process is identical to the extrapolationof the average from the normal distribution of noisy dataMultiple algorithms for such extrapolation exist for one- andmultidimensional stochastic processes [38 39] We believethat Sa behaves as a one-dimensional stochastic process andit might be possible to extrapolate the true value of thesampling from 7 to 10 repeated instances of Sa (ie thenumber of data sufficient to fit an 1D normal distribution)The necessary practical steps towards solving (3) or (10)are the following (i) Eliminate or account for any bias notrelated to binding (eg growth bias) (ii) Repeat the screen

6 Computational and Mathematical Methods in Medicine

Probability to observe specific solution

BB C

C

C

AD

DD

D B CCD

D[0 1 2 2]A B C D

[1 2 3 4]A B C D Sa

95

confi

denc

e int

erva

l

0 2 3 01 0 0 40 1 0 41 1 3 00 0 1 41 2 2 00 2 0 31 0 3 10 0 3 21 2 0 20 1 3 11 1 0 31 2 1 10 0 2 31 0 1 30 2 2 11 0 2 20 2 1 21 1 2 10 1 1 30 1 2 21 1 1 2

0 002 004 006 008 01 012 014

Copy number before

Cop

y nu

mbe

r afte

r

0

1

2

3

4

(a)

(c) (d)

(e)

(h)(g) (i)

(f)(b)

1 2 3 4

Probability to observe specific solution0 002 004 006 008 01 012 014

100000 trials

Sa50

1

6385

5000 trials2957

7

42

1 2 3 4 5 6 7 8012345678

Copy number before the sampling

Cop

y nu

mbe

r afte

r the

sam

plin

g

One of the solutions

[1 2 3 4 5 6 7 8]A B C D E F G H

1 2 3 4 5 6 7 8

012345678

Cop

y nu

mbe

r afte

r the

sam

plin

g

Number of data per point

Copy number before the sampling

50

BB

CCCADD

DD H HH

G

HHH

HHGG G

GGG

EE

EE

E

E

FF

FF

FF

B

CAD DH

G

HHH

GG

E EE

FF

F

[1 1 1 2 3 3 3 4]A B C D E F G H

Sa 50

Most probable solution

0

1

2

3

4

1 2 3 4

0

1

2

3

4

1 2 3 4

0

1

2

3

4

1 2 3 4

[0 2 3 0]A B C D

[1 1 1 2]A B C D

[0 2 1 2]A B C D

the s

ampl

ing

Cop

y nu

mbe

r afte

rth

e sam

plin

g

Cop

y nu

mbe

r afte

rth

e sam

plin

g

Cop

y nu

mbe

r afte

rth

e sam

plin

g

the sampling

Copy number before the sampling

Copy number before the sampling

Copy number before the sampling

gt1000 copies100ndash1000 copies

10ndash100 copies1ndash10 copies

solutions

P = 014

P lt 0001

P = 007

after 5000 trials

Figure 3 (a) Testing the sampling operator implemented as random indexing function using a model multiset (b) In 100000 trials weobserved 22 unique solutions from which 14 resided in a 95 confidence interval Solutions with 0 and 1 copies of element A were foundat equal abundances (ldquoredundant solutionsrdquo) (c) Representation of the most probable solution as a line with 4 nodes ldquo119901rdquo is a probabilityto find the solution dotted line is an expected ldquoaverage solutionrdquo for 50 sampling (d) The 5th most probable solution (e) least probablesolution deviates the most form the average (f) combination of all solutions Red thick lines describe the most probable solutions thin bluelines describe the least probable solutions (g) Sampling of larger multisets yields more possible solutions (here 2957 in 5000 trials) (h)All solutions of the sampling represented as lines (i) Probability to observe a particular copy number after sampling While (h) is the mostaccurate representations of the confidence intervals the thin blue lines describe solutions outside the confidence interval this representation isimpractical due to large number of redundant solutions in largermultisets In (i) confidence interval could be extrapolated from distributionsof individual copy numbers (e) red dots are on or outside the confidence interval

several times (iii) Measure all copy numbers of all sequencesincluding zero values with high confidence Requirement (i)has been an ongoing effort in our group [11 40] and othergroups [13 41ndash43] for review see [11 44] Deep sequencing

makes it simple to satisfy requirement (ii) and obtainmultipleinstances of the same experiment For example we describedthe Illumina sequencing method that allows using barcodedprimers to sequence 18 unrelated experiments in one deep

Computational and Mathematical Methods in Medicine 7

Number of data in one pointgt1000 copies100ndash1000 copies

10ndash100 copies1ndash10 copies

Number of data in one pointgt1000 copies100ndash1000 copies

10ndash100 copies1ndash10 copies

0 100 200 300 400 500 600 700 800 900 10000

100

200

300

400

500

600

700

800

900

1000

1 10 100 10001

10

100

1000

Copy number before the samplingC

opy

num

ber a

fter t

he sa

mpl

ing

A B C X Z

Copy number before the sampling

Cop

y nu

mbe

r afte

r the

sam

plin

g

1

1000

990

7

8

23

9 980

10Sa

456

50

5000 trials1000 unique elements

500000 total sequencesmiddot middot middot

middot middot middot

middot middot middot

middot middot middot[1 2 3 middot middot middot 999 1000]

Figure 4 (a) Testing the sampling operator using a large multiset made of 1000 unique elements with 1000 different copy numbers Imagesdescribe linear and log-scale representation of the confidence interval of the sampling operator Solutions beyond this interval were notobserved in 5000 trials Dotted line represents an overestimate of the 999 confidence interval (for details see Figure S4) Most probableoutcomes of the Sa operator have either zero or one unique sequence beyond this interval This line is used in subsequent sections (Figures 5and 6)We note that distributions of the copy numbers have well-defined shape according to central limit theorem it is a normal distributionWith enough replicas it should be possible to extrapolate the center of this distribution define the solutions explicitly and bypass thestochastic nature of the Sa operator

sequencing experiment [45] We recently scaled this effortto 50 primer sets and evaluated the performance replicas ofsimple selection procedures (in preparation)

The measurement of the copy numbers of sequencesis a separate problem that can be described using thesame sampling operators and bias operators that describehow the library is skewed by each preparation step Forexample isolation of DNA by gel purification disfavors AT-rich sequences whereas PCR favors sequence with withinspecific GC-content range [32]The real sequence abundancein any phage library ( real119899) hence has to be derived from theobserved sequence abundance ( obs119899) by solving this equation

obs119899 = (

f5Sa An) ( f4Sa Seq) ( f3Sa PCR)

times (f2Sa Is) ( f1Sa Gr) real119899

(7)

In this equation each operator in brackets describes a bias ata particular step fSa describes sampling at that step and f1ndashf5 describe the sampling fractions The bias in growth (Gr)isolation (Is) PCR amplification (PCR) and sequencing(Seq) could be related to the nucleotide sequences TheAn analysis operator is a matrix that describes retainingdiscarding or correcting the sequence (Figure 2(b)) An ideal

An operator could compensate for the biases introducedby another operator (Figure 2(c)) To define such operator(7) could be potentially solved using repeated sequencingof a well-defined model library In the next applied sectionwe examine the real deep sequencing data and identifyconditions under which these operators could be at leastpartially defined

22 Analysis or the Error Cutoff inDeep Sequencing Reads Allnext-generation sequencing techniques provide quality score(Phred Score) for every sequenced nucleotide In Illuminasequencing this score is related to the probability of thenucleotide being correct [46] In low throughput Sangersequencing the Phred score monotonously decreases withread length and the mechanisms that yield errors in capillaryelectrophoresis are well understood Common practice inSanger sequencing is to discard all reads after the firstnucleotide with a Phred score of 0 In next-generationsequencing the filtering of the reads is usuallymore stringentas follows

(A) Discard reads that have at least one read that has scorelower than ldquocutoffrdquo

(B) Discard reads that had cumulative Phred score lowerthan cutoff

8 Computational and Mathematical Methods in Medicine

C7C

Tags NKKN Bar Adapter Sequence Quality string for sequenceEnd of

read

C7C

C7CC7CC7CC7C

Ph7

Ph7Ph7P12

Ph7

TGGG

GGGC

TGTG

ATGG

TTGG

CGGC

CGGG

CGGG

CGTC

TGTG

ATGC

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

CGA CCCF

CCCF

CCCF

CCCF

CCCF

CCF

CCCF

F

CCCF

CCBF

CCCF

FFF

FFF

FFF

FFF

FDF

FFA

FFF

=A

FFF

FFF

FFF

FHHHHHJJJJJJ

FHHHHHJJJJJJ

FHHHHHJJJJJJ

FHHHHHJJJJJJ

FHHHHHJJJJJJ

DFFHBFFFGEFH

FHHHHHJJJJJJ

ADDDHFEFIEHG

FHGGHHJJJJJJ

FHHHHHJJJJJJ

FHHHHHJJJJJJ

TTG

GAC

GAC

GTA

TTG

CGA

TTG

TCA

TCA

CGA

JJJJJJJJJJJJJJJHHHGIJJJJJgtFHJJIJH simsimsimsimsimsimsim

JJJJJJJJJJHJJIJHIIJJJJBFHAHIHHHF simsimsimsimsimsimsim

JJJJJJJJJJJJGGHIJJJJJJJIJJIJJJJJJ simsimsimsimsimsimsim

JJJJJJJJJJJJIJJIJJJJJJJJJJJJJIJJJ simsimsimsimsimsimsim

JJJJJJJJJJJIIJJJJJJIJIJJIEHHHFHFD simsimsimsimsimsimsim

JJJJJJJJJJJJJJJJJJJJJJJHIJJJJJJJJ simsimsimsimsimsimsim

JJJJJJJJJJJJIJJJJJJJJJJJJJJJJJJJH simsimsimsimsimsimsim

JJIJJIGIJIJEGIJGHFGIIJJFFGGIIGII simsimsimsimsimsimsim

JJJJJJJJJJJIJJJIGHGIHIIIGHGIJJJJH simsimsimsimsimsimsim

GAHGHFHGGDHEHBGGCEHIEGBFDHBDltA simsimsimsimsimsimsim

simsimsimsimsimsimsim

GCTTGTCCTCCGAAGGTGCATCTGCAGTGCGGT simsimsimsimsimsimsim

GCTTGTGCGGCGAATTCTTATCGGAATTGCGGT simsimsimsimsimsimsim

ACTCAGCCTCATCATACTCCGGGTGGAGGTTCG simsimsimsimsimsimsim

GCGACGACTGTTCCAGCTTCGGGTGGAGGTTCG simsimsimsimsimsimsim

GCTTGTGATGCGCGTATGAATCAGCCTTGCGGT simsimsimsimsimsimsim

GCTTGTCAGGATAAGAATTCGCTTTTTTGCGGT simsimsimsimsimsimsim

GCTTGTAAGTCTCTTTCGAGTGCTCTGTGCGGT simsimsimsimsimsimsim

GCTTGTCCTCGTCTGTATCAGATGTCGTGCGGT simsimsimsimsimsimsim

GATGCGCATCGTACGTGTAATGGTGGAGGTTCG simsimsimsimsimsimsim

TGGCGTCATGATTCGTTTTAGACGGTTCAGCGT simsimsimsimsimsimsim

CGGGCTGCGTGTGCGAATCCTGGTGGAGGTTCG simsimsimsimsimsimsim

GEGIDFAEGIFCHC9FFAGHlowastDBAH

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

(a)

Sequencing resultsRemoval

All nucleotides

of errors

0L = [S 0n]

30L = [S 30n]

Ph gt 30

Ph gt 13

Ph gt 0

Ph gt 30

Phred gt 30

(b)

Base call accuracy ()

Phred score cutoff Every nucleotide must have Phred score higher than

Mean 33-mer accuracy ()

1 5 8 13 20 30

210 680 840 950 990 999

00 00 01 18 717 967

950 975 987 992 995 996

(Base accuracy)33 ()

10cutoff10

33prodi=1

10Qi10

(c)

Cumulative base call accuracy or accuracy

100

10

1

01

001

0001

0000160 70 80 1009050403020100

100

10

01

1

001

0001

00001

of 33-mer sequence ()

Sequ

ence

s (

)

(d)

60

80

100

40

20

0

60

80

100

40

20

0gt90 gt95 gt98

Sequ

ence

s (

)

Sequ

ence

s (

)

Cumulative base call accuracy oraccuracy of 33-mer sequence ()

gt99

(e)

Figure 5 (a) Representative lines from the intermediate file from Illumina deep sequencing analysis (for more information seeERROR TAG data0001txt in Section 4 and our previous publication [6]) The reads have been parsed to identify adapters and barcodesEach read has been tagged according to the library type direction of the read and quality of the adapter regions We use this intermediatelibrary to identify reads that harbor erroneous nucleotides (b) Multiset view of the intermediate library The library contains subsets thathave low medium and high quality reads Error filtering of this intermediate library to eliminates any read with Phred score below 30 yieldsa high quality library of reads 30L (c) Mean accuracy of the reads in the library after error filtering ranges from 95 to 996 Even forvery low-quality cutoff Phred gt 1 the average read quality is 95 (d) Distribution of cumulative read accuracy in libraries processed usingdifferent cutoffs (e) Linear plot of the data presented in (d) with zoom in on the region with gt90 cumulative accuracy

(C) Use a combination of A and B (accept reads withminimal cutoff and minimal cumulative score)

Many of the error analyses in the area of deep sequencingare designed for genetic reads which have variable lengthsand unknown sequence throughout the whole read Analysisof the reads in a phage-display library is a simpler problembecause phage-derived constant adapter regions flank thevariable reads Identification of the adapter region is anecessary first step in the analysis Reads in which theadapters cannot be mapped cannot be used We designedalgorithms to recover reads even if adapters were hamperedby truncation deletion ormutation [6]We observed that the

reads flanked by the erroneous adapters had a significantlyhigher error rate than reads flanked by ldquoperfectrdquo adapters(unpublished) Example of mapping of flawed reads in Illu-mina sequencing is provided in ERROR TAG data0001txt(see Section 4) In the remaining sections we analyze thepopulation of the sequences preceded by a ldquoperfectrdquo adapterto identify possible sequence-specific biases

We analyzed a typical library sequenced by Illumina usingvarious cutoffs (Figure 5) We analyzed a 33-bp segment ofthe library that contained variable seven amino acids anda constant region and GGGS terminus A simple cutoffthat discards reads with Phred lt1 nucleotides yields library

Computational and Mathematical Methods in Medicine 9

termed 1119899 which had an average 95 accuracy of the33-nucleotide read Reads that do not contain Phred = 0nucleotide rarely contain multiple low-quality reads The 1119899library was bimodal 80 of the reads had overall accuracyof 99 very few reads with accuracy 5ndash90 and significantnumber of reads with accuracy of 1 (Figures 5(d) and 5(e))These observations suggest that reads can be divided into (i)reads free of errors and (ii) reads with multiple errors

An example of a more stringent cutoff is eliminationof reads with Phred lt13 nucleotides this process yielded alibrary 13119899 in which every nucleotide had gt95 confidenceThe number of total reads in 13119899 was 10 less than numberof reads in 1119899 that is sum( 13119899) = 09sum( 1119899) The observedaverage read accuracy of the read in the 13119899 library was992 Theoretically the 095 confidence cutoff in a 33-mer nucleotide could yield reads with accuracy as low as(095)

33= 18 In practice the probability to find reads with

multiple nucleotides of 95 accuracy was vanishingly smallSpecifically among 500000 reads the lowest observed cumu-lative accuracy was 77 Such a result for example could beobtained in a sequence that has 27 ldquoperfectrdquo nucleotides and5 nucleotides with a Phred = 13 score (1)27(095)5 = 077Applying the most stringent cutoff to eliminate all reads witha Phred lt 30 yielded a library 30119899 in which every nucleotidehad 999 confidence The average confidence of the readsimproved subtly from 992 to 996 The number of totalreads in 30119899 was 30 less than number of reads in 13119899 thatis sum( 30119899) = 07sum( 13119899) It was not clear whether suchcutoff is an improvement or a detriment for analysis In thenext section we examined how frequency of the members ofthe library changed upon application of each error cutoff

23 Example of Error Analysis Sequence-Specific Censorshipduring Phred Quality Cutoff If errors occur by randomchance they should be uniformly distributed in all sequencesRemoval of erroneous read in that case should be identicalto sampling of the library by fSa operator where f is thesampling fraction For example consider the removal ofPhred lt 13 nucleotides from an unfiltered library (processdenoted as 1119899 rarr 13

119899) From the experiments we knowthat sum( 13119899) = 09sum( 1119899) if errors were distributedin sequences at random the 1119899 and 13119899 vectors should berelated as

13119899 =09Sa ( 1119899) (8)

The solutions should reside within a confidence interval

13119899 isin [

lo119862

hi119862] (9)

If errors occur preferentially in specific reads the frequencyof these reads should occur beyond the confidence intervalof the 09Sa This process could be described by a diagonalmatrix Bias as

13119899 =09Sa (Bias ( 1119899)) (10)

The elements of the diagonal matrix Bias = B119894119894 could be

estimated as follows

13119899119894isin [

lo119862119894

hi119862119894] B

119894119894= 1 (11)

13119899119894lt

lo119862119894 B

119894119894=

13119899119894

(091119899119894)

(12)

Figure 6(c) describes the representative solution of the09Sa( 1119899) (green dots) and the confidence interval (bluelines) Supplementary Scheme S3 describes the script thatcalculated this interval frommultiset 1119899 described as a plaintext file PhD7-Amp-0Ftxt using 10000 iterative calculationsof 09Sa( 1119899) This calculation required sim2 hours on a desktopcomputer Confidence interval was estimated as the mini-mum and maximum copy number found after 10000 iter-ations In this approximation of the confidence interval forsequences with the copy number lt10 before sampling it wasimpossible to determine whether the sequence disappeareddue to random sampling or due to bias The values of Biasoperator cannot be defined for these sequences and it couldbe assumed to be 1 (see (11)) For copy number gt10 howeversequence-specific bias can be readily detected We observedthat the removal of Phred lt13 reads yielded a multiset inwhich a large number of sequences deviated beyond theconfidence interval (Figure 6(d)) Their sequences could bereadily extracted by comparing the vector 13119899with the vectorof the lower confidence intervals lo

119862 (see (12)) The solutionof theBias can be illustrated graphically (Figure 6(e)) Top 30censored sequences are listed in Table S1 the other sequencescan be found in the supplementary information (file PhD7-Amp-0F-13F-CENtxt)

We performed similar calculations for 1119899 rarr 30119899 and13119899 rarr

30119899 processes The latter process is the most

interesting because 13119899 library has all nucleotides withinacceptable confidence range (gt95) and the distribution ofcumulative quality suggested that errors on average do notcluster in one read (Figure 5) The 13119899 rarr 30

119899 conversioneliminated 30 of the reads and copy numbers of manysequences deviated significantly from the random samplingthese sequences are represented by green dots outside the blueconfidence interval in Figure 6(h) Top 30 sequences are listedin Table S2 The censorship is not only sequence-specific butalso position-specific In sequences that had been censoredduring the 13119899 rarr 30119899 process lower quality reads clusteredaround 3-4 specific nucleotides (supplementary informationFigure S5)

The mechanism that leads to the disappearance of cen-sored sequences is not currently clear We attempted toidentify common motifs in censored sequences using twoapproaches (i) clustering and principal component analysesbased on Jukes-Cantor distance between sequences and(ii) identification of motifs using multiple unique sequenceidentifier software (MUSI) [17] These approaches couldnot detect any property common to censored reads whichwould make them significantly different from the othernoncensored reads Still we hypothesize that the observed

10 Computational and Mathematical Methods in Medicine

Ph gt 30 Ph gt 30

Sequencing resultsRemovalof errors

30n = Bias Sa 0

n

All nucleotides

0L = [S

0n]

30L = [S 30n]

Phred gt 30

(a)

Sequencing results

Ph gt13

1T total sequences

13T total sequences

30T total sequences

30n = Bias Sa 13

n

13n = Bias Sa 1

n

30n = Bias Sa 1

n

13L = [S

13n]

0L = [S

0n]

1L = [S

1n]

30L = [S 30n]

Ph gt 30

Ph gt 13

Ph gt 0

Ph ge 1

Ph gt 30

(b)

38390

597122105

104

103

102

10

1

105

104

103

102

101

13n = Sa 1

n

Cop

y nu

mbe

r in

Phre

dgt0

libra

ry

Copy number in Sa (Phred gt 0) library

(c)

558480

77032

Inside the interval

Outsideintervalthe

of

105

104

103

102

10

1

105

104

103

102

101

13n = Bias 1

n

13n = Bias Sa 1

nC

opy

num

ber i

n Ph

redgt0

libra

ry

Copy number in Phred gt 30 library

(d)

1

01

001

Bias can be definedfor these sequences

Bias cannot be definedfor these sequences

105

104

103

102

101

Bias = 13n|I| |N||

1nminus1

Bias = |IN|

Ratio

of c

opy

num

ber i

nPh

redgt0

Phr

edgt13

Copy number in Phred gt 0 library

(e)

502655

132857

105

104

103

102

101

105

104

103

102

10

1

30n = Sa 1

n

Number of data points per pixelgt30000

gt10000

gt3000

gt1000 1ndash10

gt100

gt300

gt30

gt10

Cop

y nu

mbe

r in

Phre

dgt0

libra

ry

Copy number in Sa (Phred gt 0) library

(f)

438703

196809

Inside the interval

Outside ofthe interval

105

104

103

102

10

1

105

104

103

102

101

Number of data points per pixel

30n = Bias Sa 1

n

30n = Bias 1

n

gt30000

gt10000

gt3000

gt1000 1ndash10

gt100

gt300

gt30

gt10

Copy number in Phred gt 30 library

Cop

y nu

mbe

r in

Phre

dgt0

libra

ry

(g)

438703

119777

Inside the interval

Outside ofthe interval

105

104

103

102

10

1

105

104

103

102

101

30n = Bias 13

n

Number of data points per pixel

30n = Bias Sa 13

n

gt30000

gt10000

gt3000

gt1000 1ndash10

gt100

gt300

gt30

gt10

Copy number in Phred gt 30 library

Cop

y nu

mbe

r in

Phre

dgt13

libra

ry

(h)

Figure 6 (a) Operator and multiset description of the error filtering procedure Applying a Phred gt 30 cutoff to library filtered by Phredgt1cutoff ( 1n) yields a subpopulation of the library ( 30n) If errors are sequence-independent the 1n rarr 30n process should be identical torandom sampling ( 30n = Sa 1n) Any sequence-specific bias (Bias) should be detected as deviation from Sa 1n (b) Progressive samplingwith more stringent cutoff (c) Theoretical Sa 1n and theoretical 999 confidence interval (blue) (d) Observation of statistically significantdeviation from Sa operator dots beyond the blue line represent sequences prone to bias Red dots represent sequences that disappeared afterin 1n rarr 30n process or during Sa 1n sampling (e) Magnitude of the bias range from 5 to 100-fold (f) Bias in sampling of Phred gt 30 datafrom Phred gt 1 data ((f) is theory (g) is observed) (h) Bias upon sampling of Phred gt 30 data from Phred gt 13 Many sequences were lost inthis sampling and this loss was statistically significant beyond the 999 interval This result shows that some sequences have propensity toharbor low- and medium-quality reads Distribution of the errors is sequence specific

censorship represents sequence-specific errors which occurin every time such sequence passes though the Illuminaanalyzer For example the sequences listed in Tables S1 andS2 and supplementary files were censored in five independent

experiments which were pooled and processed simultane-ously in one Illumina run Analysis of other instances of Illu-mina sequencing performedby other groups could help prove(or disprove) that censorship is indeed sequence-specific

Computational and Mathematical Methods in Medicine 11

and experiment-independent Sequence-specific censorshipduring Illumina analysis has been described in other publi-cations [46] The observations presented above suggest thatreading of some sequences in phage libraries does not yield anaccurate copy number Even if these sequences were enricheddue to binding their apparent copy number in sequencingwould be decreased due to sequencing bias If the magnitudeof bias is known however such error could be corrected Weanticipate that other biases could be calculated for these andother libraries in similar fashion Their calculation extendsbeyond the scope of this paper and it will be performed inour next publication

3 Discussion

31 Significance and Transformative Potential of Library-Wide Error Correction In the Medicinal Chemistry fieldstructure-activity relationships (SAR) and pharmacophoresare built using both positive and negative observations Itis the negative results that bear the most significance inthese studies because they allow mapping of the range ofconditions under which particular structure no longer worksFor example SAR of an R group of a ligand might be builton the following observations A ligand binds to the targetwhen the R group in the specific position is methyl or ethylchanging R to iso-propyl and tert-butyl ablates the bindingThis concludes that the R group must be a small alkyl groupAn analogous situation is found in SARof peptide ligands themost important information from alanine scan mutagenesisis loss of function because it helps identifying the importantresidues Interestingly loss-of-binding conclusions are neverapplied to phage-display The phage-display field is driven bypositive results Most publications report and follow up onlyon sequences enriched in the screen and consider only largecopy numbers interesting All papers focus on sequences thatwere found Very few papers in phage display ask why othersequences were not found

One of the reasons why phage display is not used forSAR-type analysis is because negative observations in phagelibrary cannot be determined with high confidence From apractical point of view measuring zero with high confidencerequires the largest number of observation (the highest depthof sequencing) The payoff however is immense one screenwith ldquoconfident zerosrdquo could potentially yield SAR for everypossible substitution of every possible amino acid We referto this (theoretical) possibility as ldquoInstant SARrdquo and itscondensed theoretical form is described in (3) or (9) and(10) This paper demonstrates that the depth of sequencing isnot the only problem towards this goal Accurate estimate ofnegative results requires complete characterization of the ori-gins of errors in sequencing which yield false negative valuesby censoring certain sequencing Other types of censorshipsuch as growth bias should be characterized and eliminatedas well As the phage display field is currently focused onpositive results the need for optimal error corrections andrecovery of erroneous reads is low With the rise of SAR-type applications in phage display error correction will berecognized as the most significant barrier because it could

lead to improper assignment of low frequencies and negativeresults Improved error correction strategies could assigna lower confidence to the sequence instead of eliminatingthe errors and labeling them as confident zero Propermathematical framework possibly similar to the one used inthis paper could be then used to carry all confidence intervalsthrough calculations to yield reliable SAR-type data

We note that the framework described in this paper issuitable for the analysis of the selection from libraries inwhich the diversity of the libraries before and after selectioncould be covered entirely by deep sequencing With thecurrent depth of sequencing it corresponds to medium-scale libraries of sim106 randommembers and affinity-maturedlibraries that containsim106 pointmutationsWe are in the pro-cess of generating these medium-scale libraries and runningselection procedures that will allow us to apply and refine ourframework In the future as technical capabilities and depthof sequencing increase the process would be applicable tolarger libraries as well

4 Methods

41 Generation of Z-Bars and Other Visualization TechniquesSequencing of the libraries has been described in our previouspublications [6 47] All data visualization in this paper wasdone by MATLAB scripts raw lowasteps output from MATLABscripts subject to minor postprocessing in Adobe Illustratorto adjust fonts relative dimensions of plots Core scripts aredescribed in the supplementary information Other scriptsare available in our previous publication [47] Illumina filesused for the analysis can be found in the directory athttpwwwchemualbertacasimderdamathbiology the fileERROR TAG data0001txt is an example of error-taggedreads PhD7-Amp-xxFtxt is the library filtered with xx Phredcutoff (xx = 1 13 and 30) file PhD7-Amp-13F-30F-COtxtdescribes confidence intervals for Phred(13) to Phred(30)filtering process other files with lowast-COtxt extension describeconfidence intervals of other processes Supplementary Fig-ures S1ndashS4 and Schemes S1 and S2 describe MATLAB imple-mentation of the Sa operator

Acknowledgments

This work was supported by funds from the University ofAlberta and Alberta Glycomics Centre

References

[1] J K Scott and G P Smith ldquoSearching for peptide ligands withan epitope libraryrdquo Science vol 249 no 4967 pp 386ndash390 1990

[2] G P Smith and V A Petrenko ldquoPhage displayrdquo ChemicalReviews vol 97 no 2 pp 391ndash410 1997

[3] A D Ellington and J W Szostak ldquoIn vitro selection of RNAmolecules that bind specific ligandsrdquo Nature vol 346 no 6287pp 818ndash822 1990

[4] C Tuerk and L Gold ldquoSystemic evolution of ligands byexponential enrichment RNA ligands to bacteriophage T4DNApolymeraserdquo Science vol 249 no 4968 pp 505ndash510 1990

12 Computational and Mathematical Methods in Medicine

[5] E Dias-Neto D N Nunes R J Giordano et al ldquoNext-generation phage display integrating and comparing availablemolecular tools to enable costeffective high-throughput analy-sisrdquo PLoS ONE vol 4 no 12 Article ID e8338 2009

[6] W LMatochko K Chu B Jin SW Lee GMWhitesides andR Derda ldquoDeep sequencing analysis of phage libraries usingIllumina platformrdquoMethods vol 58 pp 47ndash55 2012

[7] A Ernst D Gfeller Z Kan et al ldquoCoevolution of PDZ domain-ligand interactions analyzed by high-throughput phage displayand deep sequencingrdquo Molecular BioSystems vol 6 no 10 pp1782ndash1790 2010

[8] G V Kupakuwana J E Crill M P McPike and P N BorerldquoAcyclic identification of aptamers for human alpha-thrombinusing over-represented libraries and deep sequencingrdquo PLoSONE vol 6 no 5 Article ID e19395 2011

[9] P A C rsquoT Hoen S M G Jirka B R Ten Broeke et alldquoPhage display screening without repetitious selection roundsrdquoAnalytical Biochemistry vol 421 no 2 pp 622ndash631 2012

[10] H Zhang A Torkamani T M Jones D I Ruiz J Ponsand R A Lerner ldquoPhenotype-information-phenotype cyclefor deconvolution of combinatorial antibody libraries selectedagainst complex systemsrdquo Proceedings of the National Academyof Sciences of the United States of America vol 108 no 33 pp13456ndash13461 2011

[11] R Derda S K Y Tang S C Li S Ng W Matochko and M RJafari ldquoDiversity of phage-displayed libraries of peptides duringpanning and amplificationrdquo Molecules vol 16 no 2 pp 1776ndash1803 2011

[12] R W Hamming ldquoError detecting and error correcting codesrdquoBell System Technical Journal vol 29 pp 147ndash160 1950

[13] D J Rodi A S Soares and L Makowski ldquoQuantitativeassessment of peptide sequence diversity in M13 combinatorialpeptide phage display librariesrdquo Journal of Molecular Biologyvol 322 no 5 pp 1039ndash1052 2002

[14] L Makowski ldquoQuantitative analysis of peptide librariesrdquo inPhage Nanobiotechnology chapter 3 2011

[15] C E Shannon ldquoAmathematical theory of communicationrdquo BellSystem Technical Journal vol 27 pp 379ndash423 1948

[16] U Ravn F Gueneau L Baerlocher et al ldquoBy-passing in vitroscreening next generation sequencing technologies applied toantibody display and in silico candidate selectionrdquoNucleic AcidsResearch vol 38 no 21 2010

[17] T Kim M S Tyndel H Huang et al ldquoMUSI an integratedsystem for identifying multiple specificity from very largepeptide or nucleic acid data setsrdquo Nucleic Acids Research vol40 no 6 article e47 2012

[18] J A Weinstein N Jiang R A White III D S Fisher and S RQuake ldquoHigh-throughput sequencing of the zebrafish antibodyrepertoirerdquo Science vol 324 no 5928 pp 807ndash810 2009

[19] B J DeKosky G C Ippolito R P Deschner et al et al ldquoHigh-throughput sequencing of the paired human immunoglobulinheavy and light chain repertoirerdquo Nature Biotechnology vol 31pp 166ndash169 2013

[20] B Levitan ldquoStochastic modeling and optimization of phagedisplayrdquo Journal of Molecular Biology vol 277 no 4 pp 893ndash916 1998

[21] C S Riesenfeld P D Schloss and J Handelsman ldquoMetage-nomics genomic analysis of microbial communitiesrdquo AnnualReview of Genetics vol 38 pp 525ndash552 2004

[22] M L Sogin H G Morrison J A Huber et al ldquoMicrobialdiversity in the deep sea and the underexplored lsquorare biospherersquordquo

Proceedings of the National Academy of Sciences of the UnitedStates of America vol 103 no 32 pp 12115ndash12120 2006

[23] F Backhed R E Ley J L Sonnenburg D A Peterson and JI Gordon ldquoHost-bacterial mutualism in the human intestinerdquoScience vol 307 no 5717 pp 1915ndash1920 2005

[24] N Beerenwinkel and O Zagordi ldquoUltra-deep sequencing forthe analysis of viral populationsrdquo Current Opinion in Virologyvol 1 no 5 pp 413ndash418 2011

[25] J A Huber D B MarkWelch H G Morrison et al ldquoMicrobialpopulation structures in the deep marine biosphererdquo Sciencevol 318 no 5847 pp 97ndash100 2007

[26] A Wilm P P K Aw D Bertrand et al ldquoLoFreq a sequence-quality aware ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencingdatasetsrdquo Nucleic Acids Research vol 40 pp 11189ndash11201 2012

[27] P J Turnbaugh C Quince J J Faith et al ldquoOrganismalgenetic and transcriptional variation in the deeply sequencedgut microbiomes of identical twinsrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 107 no16 pp 7503ndash7508 2010

[28] CQuince A Lanzen T P Curtis et al ldquoAccurate determinationof microbial diversity from 454 pyrosequencing datardquo NatureMethods vol 6 no 9 pp 639ndash641 2009

[29] S J Watson M R A Welkers D P Depledge et al ldquoViralpopulation analysis and minority-variant detection using shortread next-generation sequencingrdquo Philosophical Transactions ofthe Royal Society B vol 368 no 1614 Article ID 20120205 2013

[30] D R Bentley S Balasubramanian H P Swerdlow et al et alldquoAccurate whole human genome sequencing using reversibleterminator chemistryrdquo Nature vol 456 pp 53ndash59 2008

[31] KNakamura TOshima TMorimoto et al ldquoSequence-specificerror profile of Illumina sequencersrdquo Nucleic Acids Researchvol 39 no 13 2011

[32] M A Quail I Kozarewa F Smith et al ldquoA large genomecenterrsquos improvements to the Illumina sequencing systemrdquoNature Methods vol 5 no 12 pp 1005ndash1010 2008

[33] O Zagordi R Klein M Daumer and N Beerenwinkel ldquoErrorcorrection of next-generation sequencing data and reliableestimation of HIV quasispeciesrdquoNucleic Acids Research vol 38no 21 pp 7400ndash7409 2010

[34] M W Schmitt S R Kennedy J J Salk E J Fox J B Hiattand L A Loeb ldquoDetection of ultra-rare mutations by next-generation sequencingrdquo Proceedings of the National Academyof Sciences of the United States of America vol 109 pp 14508ndash14513 2012

[35] A Syropoulos ldquoMathematics of multisetsrdquo in Proceedings of theWorkshop on Multiset Processing Multiset Processing Mathe-matical Computer Science and Molecular Computing Points ofView pp 347ndash358 Springer 2001

[36] W ArapM G KoloninM Trepel et al ldquoSteps towardmappingthe human vasculature by phage displayrdquo Nature Medicine vol8 no 2 pp 121ndash127 2002

[37] S Blond-Elguindi S E Cwirla W J Dower et al ldquoAffinitypanning of a library of peptides displayed on bacteriophagesreveals the binding specificity of BiPrdquoCell vol 75 no 4 pp 717ndash728 1993

[38] S Cox E Rosten J Monypenny et al ldquoBayesian localizationmicroscopy reveals nanoscale podosome dynamicsrdquo NatureMethods vol 9 no 2 pp 195ndash200 2012

[39] E Rosten G E Jones and S Cox ldquoImage plug-in for Bayesiananalysis of blinking and bleachingrdquoNature Methods vol 10 pp97ndash98 2013

Computational and Mathematical Methods in Medicine 13

[40] W L Matochko S Ng M R Jafari J Romaniuk S K Y Tangand R Derda ldquoUniform amplification of phage display librariesin monodisperse emulsionsrdquoMethods vol 58 pp 18ndash27 2012

[41] E A Peters P J Schatz S S Johnson and W J DowerldquoMembrane insertion defects caused by positive charges in theearly mature region of protein pIII of filamentous phage fd canbe corrected by prlA suppressorsrdquo Journal of Bacteriology vol176 no 14 pp 4296ndash4305 1994

[42] L A Brammer B Bolduc J L Kass K M Felice C J Norenand M F Hall ldquoA target-unrelated peptide in an M13 phagedisplay library traced to an advantageous mutation in the geneII ribosome-binding siterdquo Analytical Biochemistry vol 373 no1 pp 88ndash98 2008

[43] G A Kuzmicheva P K Jayanna I B Sorokulova and VA Petrenko ldquoDiversity and censoring of landscape phagelibrariesrdquo Protein Engineering Design and Selection vol 22 no1 pp 9ndash18 2009

[44] D R Wilson and B B Finlay ldquoPhage display applicationsinnovations and issues in phage and host biologyrdquo CanadianJournal of Microbiology vol 44 no 4 pp 313ndash329 1998

[45] R Derda S K Y Tang and G M Whitesides ldquoUniformamplification of phage with different growth characteristics inindividual compartments consisting ofmonodisperse dropletsrdquoAngewandte Chemie vol 49 no 31 pp 5301ndash5304 2010

[46] J C Dohm C Lottaz T Borodina and H HimmelbauerldquoSubstantial biases in ultra-short read data sets from high-throughput DNA sequencingrdquo Nucleic Acids Research vol 36no 16 article e105 2008

[47] W L Matochko S C Li S K Y Tang and R DerdaldquoProspective identification of parasitic sequences in phagedisplay screensrdquo Nucleic Acids Research 2013

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

Computational and Mathematical Methods in Medicine 3

Table 1 Symbols and definitions used in the theoretical description section

Symbols MeaningA a f m n k Unless specified otherwise normal font designates scalars119860 119886119873 119875 1n 13n Italic font designates vectors Different vectors can be distinguished by the left-superscript notationA a Abc Pan Sa Bold font designates operators or matrices (here all operators are matrices)1AfSa09Sa05Sa Operators can be distinguished by the left-superscript notation For sampling operator Sa this

notation specifies the sampling fraction of the Sa operatorA1 a2 A119894 a119895 Normal font with right subscript designates scalar values of the vectorA11 A21 Aij Aii Normal font with two right subscripts designates scalar values of the 2D matrix1003817100381710038171003817A1 A5

1003817100381710038171003817

Description of the scalar elements in the vector10038171003817100381710038171003817Aij Aii

10038171003817100381710038171003817

Description of the scalar elements in the matrixx isin [A B] Scalar x belongs to the inclusive scalar interval [A B] that is A le x le B119909 isin [119860119861] Vector x belongs to the ldquovector intervalrdquo [119860119861] that is for every element Ai le xi le Bi

ABC X Set where A B C X are the unique elements of the set

A(a) B(b) X(x) Multiset (2-tuple) where A B X are the unique elements and a b x are the scalars describing thecopy numbers of the A B X elements

IN Unity matrix of the Nth order that is N times Nmatrix 10038171003817100381710038171003817Aij10038171003817100381710038171003817 Aij = 120575ij (Kronecker delta)

completely using a 109-element vector This size is accessibleto the computational capacity of most desktop computers

In operator notation phage display can be described as

Sel = Pan Naive (1)

where Naive is the copy number vector for naıve librarySel is the copy number vector after panning and Pan isa panning operator In standard phage display the Panoperator is a complex product of all manipulation steps(binding amplification dilutions etc) If a screen uses noamplification and uses deep-sequencing [9 16] or large-scaleSanger sequencing [36 37] to analyze the enrichment itmight be possible to define the panning process as a simpleproduct of two operators as follows

Pan = fSa Ka (2)

Sel = fSa Ka Naive (3)

where Ka is a deterministic ldquoassociationrdquo operator whichcontains association constants for every phage clone presentin the library Description of such operator is beyond thescope of this paper and we recommend consulting otherreports that attempted to generalize the selection procedure[20] Another operator in (3) is a sampling operator ( fSa)which describes stochastic sampling of the library with msequences to yield a sublibrarywith flowastm-sequences where f isin[0 1] is a sampling fraction fSa operator has the followingproperties which emanate from physical properties of thesampling procedure

(I)

fSai 0 = 0 (sampling does not create new

sequences from nonexisting sequences) (4)

(II) fSa is a diagonal operator with diagonal scalar func-tions Sa

11Sa22sdot sdot sdot SaNN Sai(0) = 0

(III) In 119861 = fSa 119860 119861 is a vector of positive integersB119894ge 0 and sum(119861) = flowastsum(119860) Integer values

ensure that the observable values of the operator havephysical meaning The clone could be observed once(1) multiple times (2 3 etc) or not observed at all(0)

(IV) Sa is nondeterministic operator When applied to thesame vector the operator does not yield the sameresult but one of the possible vectors that satisfy rules(IndashIII) The majority of the solutions of the operatorhowever reside within a deterministic confidenceinterval fSa 119860 isin [ lo119862 hiC]

(V) As a consequence from (IV) operator Sa is nonlinearnoncommutative and nondistributive

(VI) Large sum of sampling operators with same f shouldldquoaverage outrdquo to yield IN unity matrix

(fSa1+

fSa2+

fSa3+ sdot sdot sdot

fSak)k

997888rarr flowastIN

as k 997888rarr infin

(5)

The Sa operator is simple to implement as a randomarray indexing function in any programming language (egsee Supplementary Schemes S1 and S2 available online athttpdxdoiorg1011552013491612) It might be possibleto express fSa analytically for any f as a diagonal matrix(Figure 1(d)) In this paper we use numerical treatment byan array sampling function because it is more convenient formultisets of general structure We tested the random index-ing implementation to show that the sampling algorithmyields a normal distribution for a large number of samples(Supplementary Figure S1) Despite the simplicity of fSa

4 Computational and Mathematical Methods in Medicine

Any phage library is a multiset defined by sequence set and copy number vector

Lib = [S n] middot middot middot middot middot middot

n1n2nN

N number of all possiblesequences in the libraryni copy number of theith sequence in the library

Lib = [S = sequence1sequence2 n = ]sequenceN

(a)

Total (sum) and unique sequences (uni) 11

1middot middot middotsum(n) = n middot |1| =

|1| =Nsumi=1

ni

uni(n) =Nsumi=1

1 minus 120575[ni] 0 n ne 0 delta120575[n] = 1 n = 0 Kronecker

(b)

middot middot middot

Any manipulation of library is an operator acting on the ni vector (copy number vector)Example sampling of the library

1n =

1n11n2

1nN

middot middot middot

2n =

2n12n2

2nN1Lib = [S 1n]

2Lib = [S 2n]

T1 total sequencesT2 total sequences

Sum(1n) = T1 Sum(2n) = sum(Sa 1n) = T2

2n = Sa 1

n

Sa) acts on 1n vectorSampling operator (

Some elementspresent in 1st

library are ldquogonerdquofrom the 2nd library

2ni = 0

1ni 0ne

(c)

rnrn

rn

0 00 0

0 0

Example sampling operator

Properties of Sa

middot middot middotmiddot middot middot

middot middot middot

Sa(n) = round T1

T2

N Irndn where N Irnd =

k

Operators are N times N matrices Random N times Ndiagonal matrix

(1) Sa(middot middot middot(Sa(Sa(n)) )middot middot middot) asymp Sa(n)

sum Sa(n) rarr infinn as k rarr(2) 1

k

((

(d)

amplificationSynthesis ofnucleotides

Theoretical library

Example each random

Ligation

Sequencing

Panning

Naive librarySynthetic library Selected (panned) library

General problem of selection

Sequencing

Multiset description of the phage display selection

transformationgrowth

nucleotide is present once

T = [S 1]

The ith sequence Sihas copy number syn ni

enriched sequences x i gt 1 depleted sequences x i lt 1

P and N cannot be measured They have to be approximated from the ldquoobservablesrdquo

From P and N find enrichment E = [S x] x e R+obs

P = [S panobsn]

P = [S pann]N = [S naive

n]

S = [S synn]

obsN = [S naive

obsn]

(e)

Figure 1 (a) Phage library can be described by multisets made of S = sequence set and 119899 = vector of copy numbers Any change to thelibrary can be described as functionoperator acting on the 119899 (b) Relevant functions are calculations of total sequences (sum) and uniquesequences (uni) (c) Any transformation of library to another library is an operator acting on 119899 Sampling of libraries to yield a sublibrary isthe most important operator (d) It can be described as N times N matrix Specifically Sa is a diagonal matrix of values derived from randomdistribution Rounding function is necessary to ensure the physical meaning of the sampling results Sa acting on the same vector yieldsone of many vectors that have the same number of total elements As a consequence Sa is nonlinear nondistributive and noncommutativeoperator Average of many Sa operators is a scalar (dilution factor) (e) Any screen of any library can be described as operators acting onthe copy number vectors of the naıve (or theoretical) library Copy number vectors cannot be observed directly They have to be measuredthrough sequencing As sequencing contains sampling process (Sa operator) the result of sequencing is nondeterministic Sequencing yieldsone of many possible observed copy number vectors none of which are equal to the real copy number vector

implementationmdashthe entire code is lt30 lines in MatLabmdashthe script allows rapid calculation of the results of fSa for amultiset of reasonable size (severalmillion sequences Figures4 and 5) on a desktop computer

We evaluated the behaviors of 05Sa for several multisetsThe probability to observe a specific solution is describedin Figure 3(b) Individual solutions can be represented aslines with nodes on 119883119884-plane where each node represents

one element of the multiset (Figures 3(d) and 3(e)) Themost probable solutions reside near the ldquoexpected solutionrdquo(represented as dotted line) and the probability to observea solution where many elements deviate from the probablesolution is low (Figure 3(e)) Graphical representation of thesolutions highlights that sampling could lead to deviationof the frequency of the individual elements of the multisetfor example Figure 3(e) describes gt2 fold deviation from

Computational and Mathematical Methods in Medicine 5

Library of phage

Sampling

Growthof phage

Isolationof DNA

PCRof DNA

Sequencing Analysis of

Sa Gr

Operator description of each step

Sa Is Pc Sa Se AnSa

of DNA sequencing

(observed)Accepted

sequencing data

realN = [S real

n]

obs N = [S obs n]

(a)

Relation between observed copy numbers and ldquorealrdquo copy numbers

Analysisoperator

obs n =f 5Sa An f 4Sa Seq f 3Sa PCR f 2Sa Is f1Sa Gr real n

Growth bias betoassumedandneglected(usually IN )Isolation bias betoassumedandneglected(usually IN )

PCR bias betoassumedandneglected(usually IN )Sequencing bias betoassumedandneglected(usually IN )

(b)

10

0

0 00 0

0 0

0011

0

0

0

Analysis operator is a matrix that could compensate for problems and censorships caused by Se Pc Is Gr operators

f5Sa An (f4Sa Seq f3Sa PCR f2Sa Is f1Sa Gr real ni) sim Sa real ni

Seq 1 is accepted as isSeq 2 is considered a typoand renamed to seq 3Seq 3 is accepted as isSeq N is deleted (possible error)

middot middot middot

middot middot middot

middot middot middot

An =

(c)

Figure 2 Operator description of the deep sequencing process (a) A library of phage must be processed before deep sequencing Each stepinvolves sampling which is either a deliberate partitioning of the sample or random loss of the sample Each sample preparation state could(and does) introduce bias in sequence abundance Each step thus is an operator chat changeing the 119899 vector (b) If we ignore bias duringpreparation operators could be approximated as unity vectors and sequencing could be represented as a product of sampling and analysisoperators (c) Analysis operator (An) is a binary decision matrix which describes what sequences are and are not considered as errorsDecisions such as removal of sequences or correction of sequences are the most important because they decide which ldquoobservedrdquo sequencesare considered ldquorealrdquo To make the analysis of the selection process meaningful the same An operator should be used in all analyses

the expected value for one of the elements Figure 3(f) showsthat the solution in which two elements deviate by gt2 foldis improbable This observation is a simple consequence ofthe multiplicity of the probabilities (large deviation from theaverage has probability 119901 and the probability to observe thisdeviation twice is 1199012)

Even in small multisets such as A(1) B(2) C(3) D(4)made of four unique and 10 total elements05Sa A(1) B(2) C(3) D(4) operation yields large numberof solutions with equal probability termed as redundantsolutions (eg solutions that have equal probability inFigure 3(b)) Redundancy depends on the structure of themultiset (Figure S2)This redundancy makes the calculationsof all probable solutions of Sa impractical For sets even with5-6 unique elements identification of all vectors 119861 whichsatisfy equation 119861 = fSa 119860 and reside within a 95 intervalrequires hundreds of thousands of iterations (Figure S2and S3) On the other hand calculation of the confidenceinterval of each element 119861i of the vector 119861 converges rapidlyA multiset A

1000 = A

1(1) A2(2) sdot sdot sdotA

1000(1000) with

1000 unique elements and 1 + 2 + 3 + sdot sdot sdot + 1000 = 500 500total elements is similar to an average deep sequencingdata set (Figure 4) Calculation of all probable solutions of05Sa A

1000 is beyond the capabilities of most computers

However the 999 confidence interval of all elements ofvector 119861 = 05Sa A

1000 can be calculated in sim2 minutes on

an average desktop computer The red dots in Figure 4 arelo119862119894and hi119862119894or the 999 high and low confidence interval

of all elements Bi (Figure 4)

The sampling operator is critical in phage display becausesampling of libraries occurs in every step of the selection andthe preparation of libraries for sequencing The stochasticnature of sampling operators makes two identical screensldquosimilar within a confidence intervalrdquo Solving (1) exactly isnot possible but it should be possible to estimate the solutionwithin a confidence interval Consider

Sel isin [ loKa Naive hiKa Naive] (6)

where loKa and hiKa are diagonal matrices of the upperand lower confidence intervals for the association constantsA simulation of the behavior of the Sa operator (Figures3 and S3) suggests that the relative sizes of the confidenceintervalsmight be impractically largewhen the copy numbersof sequences are lt10

Multiple sampling events of the Sa operator yield anormal distribution for each element of the vector (Figure 3)Fitting this normal distribution could yield a ldquotruerdquo valueof the process This process is identical to the extrapolationof the average from the normal distribution of noisy dataMultiple algorithms for such extrapolation exist for one- andmultidimensional stochastic processes [38 39] We believethat Sa behaves as a one-dimensional stochastic process andit might be possible to extrapolate the true value of thesampling from 7 to 10 repeated instances of Sa (ie thenumber of data sufficient to fit an 1D normal distribution)The necessary practical steps towards solving (3) or (10)are the following (i) Eliminate or account for any bias notrelated to binding (eg growth bias) (ii) Repeat the screen

6 Computational and Mathematical Methods in Medicine

Probability to observe specific solution

BB C

C

C

AD

DD

D B CCD

D[0 1 2 2]A B C D

[1 2 3 4]A B C D Sa

95

confi

denc

e int

erva

l

0 2 3 01 0 0 40 1 0 41 1 3 00 0 1 41 2 2 00 2 0 31 0 3 10 0 3 21 2 0 20 1 3 11 1 0 31 2 1 10 0 2 31 0 1 30 2 2 11 0 2 20 2 1 21 1 2 10 1 1 30 1 2 21 1 1 2

0 002 004 006 008 01 012 014

Copy number before

Cop

y nu

mbe

r afte

r

0

1

2

3

4

(a)

(c) (d)

(e)

(h)(g) (i)

(f)(b)

1 2 3 4

Probability to observe specific solution0 002 004 006 008 01 012 014

100000 trials

Sa50

1

6385

5000 trials2957

7

42

1 2 3 4 5 6 7 8012345678

Copy number before the sampling

Cop

y nu

mbe

r afte

r the

sam

plin

g

One of the solutions

[1 2 3 4 5 6 7 8]A B C D E F G H

1 2 3 4 5 6 7 8

012345678

Cop

y nu

mbe

r afte

r the

sam

plin

g

Number of data per point

Copy number before the sampling

50

BB

CCCADD

DD H HH

G

HHH

HHGG G

GGG

EE

EE

E

E

FF

FF

FF

B

CAD DH

G

HHH

GG

E EE

FF

F

[1 1 1 2 3 3 3 4]A B C D E F G H

Sa 50

Most probable solution

0

1

2

3

4

1 2 3 4

0

1

2

3

4

1 2 3 4

0

1

2

3

4

1 2 3 4

[0 2 3 0]A B C D

[1 1 1 2]A B C D

[0 2 1 2]A B C D

the s

ampl

ing

Cop

y nu

mbe

r afte

rth

e sam

plin

g

Cop

y nu

mbe

r afte

rth

e sam

plin

g

Cop

y nu

mbe

r afte

rth

e sam

plin

g

the sampling

Copy number before the sampling

Copy number before the sampling

Copy number before the sampling

gt1000 copies100ndash1000 copies

10ndash100 copies1ndash10 copies

solutions

P = 014

P lt 0001

P = 007

after 5000 trials

Figure 3 (a) Testing the sampling operator implemented as random indexing function using a model multiset (b) In 100000 trials weobserved 22 unique solutions from which 14 resided in a 95 confidence interval Solutions with 0 and 1 copies of element A were foundat equal abundances (ldquoredundant solutionsrdquo) (c) Representation of the most probable solution as a line with 4 nodes ldquo119901rdquo is a probabilityto find the solution dotted line is an expected ldquoaverage solutionrdquo for 50 sampling (d) The 5th most probable solution (e) least probablesolution deviates the most form the average (f) combination of all solutions Red thick lines describe the most probable solutions thin bluelines describe the least probable solutions (g) Sampling of larger multisets yields more possible solutions (here 2957 in 5000 trials) (h)All solutions of the sampling represented as lines (i) Probability to observe a particular copy number after sampling While (h) is the mostaccurate representations of the confidence intervals the thin blue lines describe solutions outside the confidence interval this representation isimpractical due to large number of redundant solutions in largermultisets In (i) confidence interval could be extrapolated from distributionsof individual copy numbers (e) red dots are on or outside the confidence interval

several times (iii) Measure all copy numbers of all sequencesincluding zero values with high confidence Requirement (i)has been an ongoing effort in our group [11 40] and othergroups [13 41ndash43] for review see [11 44] Deep sequencing

makes it simple to satisfy requirement (ii) and obtainmultipleinstances of the same experiment For example we describedthe Illumina sequencing method that allows using barcodedprimers to sequence 18 unrelated experiments in one deep

Computational and Mathematical Methods in Medicine 7

Number of data in one pointgt1000 copies100ndash1000 copies

10ndash100 copies1ndash10 copies

Number of data in one pointgt1000 copies100ndash1000 copies

10ndash100 copies1ndash10 copies

0 100 200 300 400 500 600 700 800 900 10000

100

200

300

400

500

600

700

800

900

1000

1 10 100 10001

10

100

1000

Copy number before the samplingC

opy

num

ber a

fter t

he sa

mpl

ing

A B C X Z

Copy number before the sampling

Cop

y nu

mbe

r afte

r the

sam

plin

g

1

1000

990

7

8

23

9 980

10Sa

456

50

5000 trials1000 unique elements

500000 total sequencesmiddot middot middot

middot middot middot

middot middot middot

middot middot middot[1 2 3 middot middot middot 999 1000]

Figure 4 (a) Testing the sampling operator using a large multiset made of 1000 unique elements with 1000 different copy numbers Imagesdescribe linear and log-scale representation of the confidence interval of the sampling operator Solutions beyond this interval were notobserved in 5000 trials Dotted line represents an overestimate of the 999 confidence interval (for details see Figure S4) Most probableoutcomes of the Sa operator have either zero or one unique sequence beyond this interval This line is used in subsequent sections (Figures 5and 6)We note that distributions of the copy numbers have well-defined shape according to central limit theorem it is a normal distributionWith enough replicas it should be possible to extrapolate the center of this distribution define the solutions explicitly and bypass thestochastic nature of the Sa operator

sequencing experiment [45] We recently scaled this effortto 50 primer sets and evaluated the performance replicas ofsimple selection procedures (in preparation)

The measurement of the copy numbers of sequencesis a separate problem that can be described using thesame sampling operators and bias operators that describehow the library is skewed by each preparation step Forexample isolation of DNA by gel purification disfavors AT-rich sequences whereas PCR favors sequence with withinspecific GC-content range [32]The real sequence abundancein any phage library ( real119899) hence has to be derived from theobserved sequence abundance ( obs119899) by solving this equation

obs119899 = (

f5Sa An) ( f4Sa Seq) ( f3Sa PCR)

times (f2Sa Is) ( f1Sa Gr) real119899

(7)

In this equation each operator in brackets describes a bias ata particular step fSa describes sampling at that step and f1ndashf5 describe the sampling fractions The bias in growth (Gr)isolation (Is) PCR amplification (PCR) and sequencing(Seq) could be related to the nucleotide sequences TheAn analysis operator is a matrix that describes retainingdiscarding or correcting the sequence (Figure 2(b)) An ideal

An operator could compensate for the biases introducedby another operator (Figure 2(c)) To define such operator(7) could be potentially solved using repeated sequencingof a well-defined model library In the next applied sectionwe examine the real deep sequencing data and identifyconditions under which these operators could be at leastpartially defined

22 Analysis or the Error Cutoff inDeep Sequencing Reads Allnext-generation sequencing techniques provide quality score(Phred Score) for every sequenced nucleotide In Illuminasequencing this score is related to the probability of thenucleotide being correct [46] In low throughput Sangersequencing the Phred score monotonously decreases withread length and the mechanisms that yield errors in capillaryelectrophoresis are well understood Common practice inSanger sequencing is to discard all reads after the firstnucleotide with a Phred score of 0 In next-generationsequencing the filtering of the reads is usuallymore stringentas follows

(A) Discard reads that have at least one read that has scorelower than ldquocutoffrdquo

(B) Discard reads that had cumulative Phred score lowerthan cutoff

8 Computational and Mathematical Methods in Medicine

C7C

Tags NKKN Bar Adapter Sequence Quality string for sequenceEnd of

read

C7C

C7CC7CC7CC7C

Ph7

Ph7Ph7P12

Ph7

TGGG

GGGC

TGTG

ATGG

TTGG

CGGC

CGGG

CGGG

CGTC

TGTG

ATGC

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

CGA CCCF

CCCF

CCCF

CCCF

CCCF

CCF

CCCF

F

CCCF

CCBF

CCCF

FFF

FFF

FFF

FFF

FDF

FFA

FFF

=A

FFF

FFF

FFF

FHHHHHJJJJJJ

FHHHHHJJJJJJ

FHHHHHJJJJJJ

FHHHHHJJJJJJ

FHHHHHJJJJJJ

DFFHBFFFGEFH

FHHHHHJJJJJJ

ADDDHFEFIEHG

FHGGHHJJJJJJ

FHHHHHJJJJJJ

FHHHHHJJJJJJ

TTG

GAC

GAC

GTA

TTG

CGA

TTG

TCA

TCA

CGA

JJJJJJJJJJJJJJJHHHGIJJJJJgtFHJJIJH simsimsimsimsimsimsim

JJJJJJJJJJHJJIJHIIJJJJBFHAHIHHHF simsimsimsimsimsimsim

JJJJJJJJJJJJGGHIJJJJJJJIJJIJJJJJJ simsimsimsimsimsimsim

JJJJJJJJJJJJIJJIJJJJJJJJJJJJJIJJJ simsimsimsimsimsimsim

JJJJJJJJJJJIIJJJJJJIJIJJIEHHHFHFD simsimsimsimsimsimsim

JJJJJJJJJJJJJJJJJJJJJJJHIJJJJJJJJ simsimsimsimsimsimsim

JJJJJJJJJJJJIJJJJJJJJJJJJJJJJJJJH simsimsimsimsimsimsim

JJIJJIGIJIJEGIJGHFGIIJJFFGGIIGII simsimsimsimsimsimsim

JJJJJJJJJJJIJJJIGHGIHIIIGHGIJJJJH simsimsimsimsimsimsim

GAHGHFHGGDHEHBGGCEHIEGBFDHBDltA simsimsimsimsimsimsim

simsimsimsimsimsimsim

GCTTGTCCTCCGAAGGTGCATCTGCAGTGCGGT simsimsimsimsimsimsim

GCTTGTGCGGCGAATTCTTATCGGAATTGCGGT simsimsimsimsimsimsim

ACTCAGCCTCATCATACTCCGGGTGGAGGTTCG simsimsimsimsimsimsim

GCGACGACTGTTCCAGCTTCGGGTGGAGGTTCG simsimsimsimsimsimsim

GCTTGTGATGCGCGTATGAATCAGCCTTGCGGT simsimsimsimsimsimsim

GCTTGTCAGGATAAGAATTCGCTTTTTTGCGGT simsimsimsimsimsimsim

GCTTGTAAGTCTCTTTCGAGTGCTCTGTGCGGT simsimsimsimsimsimsim

GCTTGTCCTCGTCTGTATCAGATGTCGTGCGGT simsimsimsimsimsimsim

GATGCGCATCGTACGTGTAATGGTGGAGGTTCG simsimsimsimsimsimsim

TGGCGTCATGATTCGTTTTAGACGGTTCAGCGT simsimsimsimsimsimsim

CGGGCTGCGTGTGCGAATCCTGGTGGAGGTTCG simsimsimsimsimsimsim

GEGIDFAEGIFCHC9FFAGHlowastDBAH

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

(a)

Sequencing resultsRemoval

All nucleotides

of errors

0L = [S 0n]

30L = [S 30n]

Ph gt 30

Ph gt 13

Ph gt 0

Ph gt 30

Phred gt 30

(b)

Base call accuracy ()

Phred score cutoff Every nucleotide must have Phred score higher than

Mean 33-mer accuracy ()

1 5 8 13 20 30

210 680 840 950 990 999

00 00 01 18 717 967

950 975 987 992 995 996

(Base accuracy)33 ()

10cutoff10

33prodi=1

10Qi10

(c)

Cumulative base call accuracy or accuracy

100

10

1

01

001

0001

0000160 70 80 1009050403020100

100

10

01

1

001

0001

00001

of 33-mer sequence ()

Sequ

ence

s (

)

(d)

60

80

100

40

20

0

60

80

100

40

20

0gt90 gt95 gt98

Sequ

ence

s (

)

Sequ

ence

s (

)

Cumulative base call accuracy oraccuracy of 33-mer sequence ()

gt99

(e)

Figure 5 (a) Representative lines from the intermediate file from Illumina deep sequencing analysis (for more information seeERROR TAG data0001txt in Section 4 and our previous publication [6]) The reads have been parsed to identify adapters and barcodesEach read has been tagged according to the library type direction of the read and quality of the adapter regions We use this intermediatelibrary to identify reads that harbor erroneous nucleotides (b) Multiset view of the intermediate library The library contains subsets thathave low medium and high quality reads Error filtering of this intermediate library to eliminates any read with Phred score below 30 yieldsa high quality library of reads 30L (c) Mean accuracy of the reads in the library after error filtering ranges from 95 to 996 Even forvery low-quality cutoff Phred gt 1 the average read quality is 95 (d) Distribution of cumulative read accuracy in libraries processed usingdifferent cutoffs (e) Linear plot of the data presented in (d) with zoom in on the region with gt90 cumulative accuracy

(C) Use a combination of A and B (accept reads withminimal cutoff and minimal cumulative score)

Many of the error analyses in the area of deep sequencingare designed for genetic reads which have variable lengthsand unknown sequence throughout the whole read Analysisof the reads in a phage-display library is a simpler problembecause phage-derived constant adapter regions flank thevariable reads Identification of the adapter region is anecessary first step in the analysis Reads in which theadapters cannot be mapped cannot be used We designedalgorithms to recover reads even if adapters were hamperedby truncation deletion ormutation [6]We observed that the

reads flanked by the erroneous adapters had a significantlyhigher error rate than reads flanked by ldquoperfectrdquo adapters(unpublished) Example of mapping of flawed reads in Illu-mina sequencing is provided in ERROR TAG data0001txt(see Section 4) In the remaining sections we analyze thepopulation of the sequences preceded by a ldquoperfectrdquo adapterto identify possible sequence-specific biases

We analyzed a typical library sequenced by Illumina usingvarious cutoffs (Figure 5) We analyzed a 33-bp segment ofthe library that contained variable seven amino acids anda constant region and GGGS terminus A simple cutoffthat discards reads with Phred lt1 nucleotides yields library

Computational and Mathematical Methods in Medicine 9

termed 1119899 which had an average 95 accuracy of the33-nucleotide read Reads that do not contain Phred = 0nucleotide rarely contain multiple low-quality reads The 1119899library was bimodal 80 of the reads had overall accuracyof 99 very few reads with accuracy 5ndash90 and significantnumber of reads with accuracy of 1 (Figures 5(d) and 5(e))These observations suggest that reads can be divided into (i)reads free of errors and (ii) reads with multiple errors

An example of a more stringent cutoff is eliminationof reads with Phred lt13 nucleotides this process yielded alibrary 13119899 in which every nucleotide had gt95 confidenceThe number of total reads in 13119899 was 10 less than numberof reads in 1119899 that is sum( 13119899) = 09sum( 1119899) The observedaverage read accuracy of the read in the 13119899 library was992 Theoretically the 095 confidence cutoff in a 33-mer nucleotide could yield reads with accuracy as low as(095)

33= 18 In practice the probability to find reads with

multiple nucleotides of 95 accuracy was vanishingly smallSpecifically among 500000 reads the lowest observed cumu-lative accuracy was 77 Such a result for example could beobtained in a sequence that has 27 ldquoperfectrdquo nucleotides and5 nucleotides with a Phred = 13 score (1)27(095)5 = 077Applying the most stringent cutoff to eliminate all reads witha Phred lt 30 yielded a library 30119899 in which every nucleotidehad 999 confidence The average confidence of the readsimproved subtly from 992 to 996 The number of totalreads in 30119899 was 30 less than number of reads in 13119899 thatis sum( 30119899) = 07sum( 13119899) It was not clear whether suchcutoff is an improvement or a detriment for analysis In thenext section we examined how frequency of the members ofthe library changed upon application of each error cutoff

23 Example of Error Analysis Sequence-Specific Censorshipduring Phred Quality Cutoff If errors occur by randomchance they should be uniformly distributed in all sequencesRemoval of erroneous read in that case should be identicalto sampling of the library by fSa operator where f is thesampling fraction For example consider the removal ofPhred lt 13 nucleotides from an unfiltered library (processdenoted as 1119899 rarr 13

119899) From the experiments we knowthat sum( 13119899) = 09sum( 1119899) if errors were distributedin sequences at random the 1119899 and 13119899 vectors should berelated as

13119899 =09Sa ( 1119899) (8)

The solutions should reside within a confidence interval

13119899 isin [

lo119862

hi119862] (9)

If errors occur preferentially in specific reads the frequencyof these reads should occur beyond the confidence intervalof the 09Sa This process could be described by a diagonalmatrix Bias as

13119899 =09Sa (Bias ( 1119899)) (10)

The elements of the diagonal matrix Bias = B119894119894 could be

estimated as follows

13119899119894isin [

lo119862119894

hi119862119894] B

119894119894= 1 (11)

13119899119894lt

lo119862119894 B

119894119894=

13119899119894

(091119899119894)

(12)

Figure 6(c) describes the representative solution of the09Sa( 1119899) (green dots) and the confidence interval (bluelines) Supplementary Scheme S3 describes the script thatcalculated this interval frommultiset 1119899 described as a plaintext file PhD7-Amp-0Ftxt using 10000 iterative calculationsof 09Sa( 1119899) This calculation required sim2 hours on a desktopcomputer Confidence interval was estimated as the mini-mum and maximum copy number found after 10000 iter-ations In this approximation of the confidence interval forsequences with the copy number lt10 before sampling it wasimpossible to determine whether the sequence disappeareddue to random sampling or due to bias The values of Biasoperator cannot be defined for these sequences and it couldbe assumed to be 1 (see (11)) For copy number gt10 howeversequence-specific bias can be readily detected We observedthat the removal of Phred lt13 reads yielded a multiset inwhich a large number of sequences deviated beyond theconfidence interval (Figure 6(d)) Their sequences could bereadily extracted by comparing the vector 13119899with the vectorof the lower confidence intervals lo

119862 (see (12)) The solutionof theBias can be illustrated graphically (Figure 6(e)) Top 30censored sequences are listed in Table S1 the other sequencescan be found in the supplementary information (file PhD7-Amp-0F-13F-CENtxt)

We performed similar calculations for 1119899 rarr 30119899 and13119899 rarr

30119899 processes The latter process is the most

interesting because 13119899 library has all nucleotides withinacceptable confidence range (gt95) and the distribution ofcumulative quality suggested that errors on average do notcluster in one read (Figure 5) The 13119899 rarr 30

119899 conversioneliminated 30 of the reads and copy numbers of manysequences deviated significantly from the random samplingthese sequences are represented by green dots outside the blueconfidence interval in Figure 6(h) Top 30 sequences are listedin Table S2 The censorship is not only sequence-specific butalso position-specific In sequences that had been censoredduring the 13119899 rarr 30119899 process lower quality reads clusteredaround 3-4 specific nucleotides (supplementary informationFigure S5)

The mechanism that leads to the disappearance of cen-sored sequences is not currently clear We attempted toidentify common motifs in censored sequences using twoapproaches (i) clustering and principal component analysesbased on Jukes-Cantor distance between sequences and(ii) identification of motifs using multiple unique sequenceidentifier software (MUSI) [17] These approaches couldnot detect any property common to censored reads whichwould make them significantly different from the othernoncensored reads Still we hypothesize that the observed

10 Computational and Mathematical Methods in Medicine

Ph gt 30 Ph gt 30

Sequencing resultsRemovalof errors

30n = Bias Sa 0

n

All nucleotides

0L = [S

0n]

30L = [S 30n]

Phred gt 30

(a)

Sequencing results

Ph gt13

1T total sequences

13T total sequences

30T total sequences

30n = Bias Sa 13

n

13n = Bias Sa 1

n

30n = Bias Sa 1

n

13L = [S

13n]

0L = [S

0n]

1L = [S

1n]

30L = [S 30n]

Ph gt 30

Ph gt 13

Ph gt 0

Ph ge 1

Ph gt 30

(b)

38390

597122105

104

103

102

10

1

105

104

103

102

101

13n = Sa 1

n

Cop

y nu

mbe

r in

Phre

dgt0

libra

ry

Copy number in Sa (Phred gt 0) library

(c)

558480

77032

Inside the interval

Outsideintervalthe

of

105

104

103

102

10

1

105

104

103

102

101

13n = Bias 1

n

13n = Bias Sa 1

nC

opy

num

ber i

n Ph

redgt0

libra

ry

Copy number in Phred gt 30 library

(d)

1

01

001

Bias can be definedfor these sequences

Bias cannot be definedfor these sequences

105

104

103

102

101

Bias = 13n|I| |N||

1nminus1

Bias = |IN|

Ratio

of c

opy

num

ber i

nPh

redgt0

Phr

edgt13

Copy number in Phred gt 0 library

(e)

502655

132857

105

104

103

102

101

105

104

103

102

10

1

30n = Sa 1

n

Number of data points per pixelgt30000

gt10000

gt3000

gt1000 1ndash10

gt100

gt300

gt30

gt10

Cop

y nu

mbe

r in

Phre

dgt0

libra

ry

Copy number in Sa (Phred gt 0) library

(f)

438703

196809

Inside the interval

Outside ofthe interval

105

104

103

102

10

1

105

104

103

102

101

Number of data points per pixel

30n = Bias Sa 1

n

30n = Bias 1

n

gt30000

gt10000

gt3000

gt1000 1ndash10

gt100

gt300

gt30

gt10

Copy number in Phred gt 30 library

Cop

y nu

mbe

r in

Phre

dgt0

libra

ry

(g)

438703

119777

Inside the interval

Outside ofthe interval

105

104

103

102

10

1

105

104

103

102

101

30n = Bias 13

n

Number of data points per pixel

30n = Bias Sa 13

n

gt30000

gt10000

gt3000

gt1000 1ndash10

gt100

gt300

gt30

gt10

Copy number in Phred gt 30 library

Cop

y nu

mbe

r in

Phre

dgt13

libra

ry

(h)

Figure 6 (a) Operator and multiset description of the error filtering procedure Applying a Phred gt 30 cutoff to library filtered by Phredgt1cutoff ( 1n) yields a subpopulation of the library ( 30n) If errors are sequence-independent the 1n rarr 30n process should be identical torandom sampling ( 30n = Sa 1n) Any sequence-specific bias (Bias) should be detected as deviation from Sa 1n (b) Progressive samplingwith more stringent cutoff (c) Theoretical Sa 1n and theoretical 999 confidence interval (blue) (d) Observation of statistically significantdeviation from Sa operator dots beyond the blue line represent sequences prone to bias Red dots represent sequences that disappeared afterin 1n rarr 30n process or during Sa 1n sampling (e) Magnitude of the bias range from 5 to 100-fold (f) Bias in sampling of Phred gt 30 datafrom Phred gt 1 data ((f) is theory (g) is observed) (h) Bias upon sampling of Phred gt 30 data from Phred gt 13 Many sequences were lost inthis sampling and this loss was statistically significant beyond the 999 interval This result shows that some sequences have propensity toharbor low- and medium-quality reads Distribution of the errors is sequence specific

censorship represents sequence-specific errors which occurin every time such sequence passes though the Illuminaanalyzer For example the sequences listed in Tables S1 andS2 and supplementary files were censored in five independent

experiments which were pooled and processed simultane-ously in one Illumina run Analysis of other instances of Illu-mina sequencing performedby other groups could help prove(or disprove) that censorship is indeed sequence-specific

Computational and Mathematical Methods in Medicine 11

and experiment-independent Sequence-specific censorshipduring Illumina analysis has been described in other publi-cations [46] The observations presented above suggest thatreading of some sequences in phage libraries does not yield anaccurate copy number Even if these sequences were enricheddue to binding their apparent copy number in sequencingwould be decreased due to sequencing bias If the magnitudeof bias is known however such error could be corrected Weanticipate that other biases could be calculated for these andother libraries in similar fashion Their calculation extendsbeyond the scope of this paper and it will be performed inour next publication

3 Discussion

31 Significance and Transformative Potential of Library-Wide Error Correction In the Medicinal Chemistry fieldstructure-activity relationships (SAR) and pharmacophoresare built using both positive and negative observations Itis the negative results that bear the most significance inthese studies because they allow mapping of the range ofconditions under which particular structure no longer worksFor example SAR of an R group of a ligand might be builton the following observations A ligand binds to the targetwhen the R group in the specific position is methyl or ethylchanging R to iso-propyl and tert-butyl ablates the bindingThis concludes that the R group must be a small alkyl groupAn analogous situation is found in SARof peptide ligands themost important information from alanine scan mutagenesisis loss of function because it helps identifying the importantresidues Interestingly loss-of-binding conclusions are neverapplied to phage-display The phage-display field is driven bypositive results Most publications report and follow up onlyon sequences enriched in the screen and consider only largecopy numbers interesting All papers focus on sequences thatwere found Very few papers in phage display ask why othersequences were not found

One of the reasons why phage display is not used forSAR-type analysis is because negative observations in phagelibrary cannot be determined with high confidence From apractical point of view measuring zero with high confidencerequires the largest number of observation (the highest depthof sequencing) The payoff however is immense one screenwith ldquoconfident zerosrdquo could potentially yield SAR for everypossible substitution of every possible amino acid We referto this (theoretical) possibility as ldquoInstant SARrdquo and itscondensed theoretical form is described in (3) or (9) and(10) This paper demonstrates that the depth of sequencing isnot the only problem towards this goal Accurate estimate ofnegative results requires complete characterization of the ori-gins of errors in sequencing which yield false negative valuesby censoring certain sequencing Other types of censorshipsuch as growth bias should be characterized and eliminatedas well As the phage display field is currently focused onpositive results the need for optimal error corrections andrecovery of erroneous reads is low With the rise of SAR-type applications in phage display error correction will berecognized as the most significant barrier because it could

lead to improper assignment of low frequencies and negativeresults Improved error correction strategies could assigna lower confidence to the sequence instead of eliminatingthe errors and labeling them as confident zero Propermathematical framework possibly similar to the one used inthis paper could be then used to carry all confidence intervalsthrough calculations to yield reliable SAR-type data

We note that the framework described in this paper issuitable for the analysis of the selection from libraries inwhich the diversity of the libraries before and after selectioncould be covered entirely by deep sequencing With thecurrent depth of sequencing it corresponds to medium-scale libraries of sim106 randommembers and affinity-maturedlibraries that containsim106 pointmutationsWe are in the pro-cess of generating these medium-scale libraries and runningselection procedures that will allow us to apply and refine ourframework In the future as technical capabilities and depthof sequencing increase the process would be applicable tolarger libraries as well

4 Methods

41 Generation of Z-Bars and Other Visualization TechniquesSequencing of the libraries has been described in our previouspublications [6 47] All data visualization in this paper wasdone by MATLAB scripts raw lowasteps output from MATLABscripts subject to minor postprocessing in Adobe Illustratorto adjust fonts relative dimensions of plots Core scripts aredescribed in the supplementary information Other scriptsare available in our previous publication [47] Illumina filesused for the analysis can be found in the directory athttpwwwchemualbertacasimderdamathbiology the fileERROR TAG data0001txt is an example of error-taggedreads PhD7-Amp-xxFtxt is the library filtered with xx Phredcutoff (xx = 1 13 and 30) file PhD7-Amp-13F-30F-COtxtdescribes confidence intervals for Phred(13) to Phred(30)filtering process other files with lowast-COtxt extension describeconfidence intervals of other processes Supplementary Fig-ures S1ndashS4 and Schemes S1 and S2 describe MATLAB imple-mentation of the Sa operator

Acknowledgments

This work was supported by funds from the University ofAlberta and Alberta Glycomics Centre

References

[1] J K Scott and G P Smith ldquoSearching for peptide ligands withan epitope libraryrdquo Science vol 249 no 4967 pp 386ndash390 1990

[2] G P Smith and V A Petrenko ldquoPhage displayrdquo ChemicalReviews vol 97 no 2 pp 391ndash410 1997

[3] A D Ellington and J W Szostak ldquoIn vitro selection of RNAmolecules that bind specific ligandsrdquo Nature vol 346 no 6287pp 818ndash822 1990

[4] C Tuerk and L Gold ldquoSystemic evolution of ligands byexponential enrichment RNA ligands to bacteriophage T4DNApolymeraserdquo Science vol 249 no 4968 pp 505ndash510 1990

12 Computational and Mathematical Methods in Medicine

[5] E Dias-Neto D N Nunes R J Giordano et al ldquoNext-generation phage display integrating and comparing availablemolecular tools to enable costeffective high-throughput analy-sisrdquo PLoS ONE vol 4 no 12 Article ID e8338 2009

[6] W LMatochko K Chu B Jin SW Lee GMWhitesides andR Derda ldquoDeep sequencing analysis of phage libraries usingIllumina platformrdquoMethods vol 58 pp 47ndash55 2012

[7] A Ernst D Gfeller Z Kan et al ldquoCoevolution of PDZ domain-ligand interactions analyzed by high-throughput phage displayand deep sequencingrdquo Molecular BioSystems vol 6 no 10 pp1782ndash1790 2010

[8] G V Kupakuwana J E Crill M P McPike and P N BorerldquoAcyclic identification of aptamers for human alpha-thrombinusing over-represented libraries and deep sequencingrdquo PLoSONE vol 6 no 5 Article ID e19395 2011

[9] P A C rsquoT Hoen S M G Jirka B R Ten Broeke et alldquoPhage display screening without repetitious selection roundsrdquoAnalytical Biochemistry vol 421 no 2 pp 622ndash631 2012

[10] H Zhang A Torkamani T M Jones D I Ruiz J Ponsand R A Lerner ldquoPhenotype-information-phenotype cyclefor deconvolution of combinatorial antibody libraries selectedagainst complex systemsrdquo Proceedings of the National Academyof Sciences of the United States of America vol 108 no 33 pp13456ndash13461 2011

[11] R Derda S K Y Tang S C Li S Ng W Matochko and M RJafari ldquoDiversity of phage-displayed libraries of peptides duringpanning and amplificationrdquo Molecules vol 16 no 2 pp 1776ndash1803 2011

[12] R W Hamming ldquoError detecting and error correcting codesrdquoBell System Technical Journal vol 29 pp 147ndash160 1950

[13] D J Rodi A S Soares and L Makowski ldquoQuantitativeassessment of peptide sequence diversity in M13 combinatorialpeptide phage display librariesrdquo Journal of Molecular Biologyvol 322 no 5 pp 1039ndash1052 2002

[14] L Makowski ldquoQuantitative analysis of peptide librariesrdquo inPhage Nanobiotechnology chapter 3 2011

[15] C E Shannon ldquoAmathematical theory of communicationrdquo BellSystem Technical Journal vol 27 pp 379ndash423 1948

[16] U Ravn F Gueneau L Baerlocher et al ldquoBy-passing in vitroscreening next generation sequencing technologies applied toantibody display and in silico candidate selectionrdquoNucleic AcidsResearch vol 38 no 21 2010

[17] T Kim M S Tyndel H Huang et al ldquoMUSI an integratedsystem for identifying multiple specificity from very largepeptide or nucleic acid data setsrdquo Nucleic Acids Research vol40 no 6 article e47 2012

[18] J A Weinstein N Jiang R A White III D S Fisher and S RQuake ldquoHigh-throughput sequencing of the zebrafish antibodyrepertoirerdquo Science vol 324 no 5928 pp 807ndash810 2009

[19] B J DeKosky G C Ippolito R P Deschner et al et al ldquoHigh-throughput sequencing of the paired human immunoglobulinheavy and light chain repertoirerdquo Nature Biotechnology vol 31pp 166ndash169 2013

[20] B Levitan ldquoStochastic modeling and optimization of phagedisplayrdquo Journal of Molecular Biology vol 277 no 4 pp 893ndash916 1998

[21] C S Riesenfeld P D Schloss and J Handelsman ldquoMetage-nomics genomic analysis of microbial communitiesrdquo AnnualReview of Genetics vol 38 pp 525ndash552 2004

[22] M L Sogin H G Morrison J A Huber et al ldquoMicrobialdiversity in the deep sea and the underexplored lsquorare biospherersquordquo

Proceedings of the National Academy of Sciences of the UnitedStates of America vol 103 no 32 pp 12115ndash12120 2006

[23] F Backhed R E Ley J L Sonnenburg D A Peterson and JI Gordon ldquoHost-bacterial mutualism in the human intestinerdquoScience vol 307 no 5717 pp 1915ndash1920 2005

[24] N Beerenwinkel and O Zagordi ldquoUltra-deep sequencing forthe analysis of viral populationsrdquo Current Opinion in Virologyvol 1 no 5 pp 413ndash418 2011

[25] J A Huber D B MarkWelch H G Morrison et al ldquoMicrobialpopulation structures in the deep marine biosphererdquo Sciencevol 318 no 5847 pp 97ndash100 2007

[26] A Wilm P P K Aw D Bertrand et al ldquoLoFreq a sequence-quality aware ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencingdatasetsrdquo Nucleic Acids Research vol 40 pp 11189ndash11201 2012

[27] P J Turnbaugh C Quince J J Faith et al ldquoOrganismalgenetic and transcriptional variation in the deeply sequencedgut microbiomes of identical twinsrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 107 no16 pp 7503ndash7508 2010

[28] CQuince A Lanzen T P Curtis et al ldquoAccurate determinationof microbial diversity from 454 pyrosequencing datardquo NatureMethods vol 6 no 9 pp 639ndash641 2009

[29] S J Watson M R A Welkers D P Depledge et al ldquoViralpopulation analysis and minority-variant detection using shortread next-generation sequencingrdquo Philosophical Transactions ofthe Royal Society B vol 368 no 1614 Article ID 20120205 2013

[30] D R Bentley S Balasubramanian H P Swerdlow et al et alldquoAccurate whole human genome sequencing using reversibleterminator chemistryrdquo Nature vol 456 pp 53ndash59 2008

[31] KNakamura TOshima TMorimoto et al ldquoSequence-specificerror profile of Illumina sequencersrdquo Nucleic Acids Researchvol 39 no 13 2011

[32] M A Quail I Kozarewa F Smith et al ldquoA large genomecenterrsquos improvements to the Illumina sequencing systemrdquoNature Methods vol 5 no 12 pp 1005ndash1010 2008

[33] O Zagordi R Klein M Daumer and N Beerenwinkel ldquoErrorcorrection of next-generation sequencing data and reliableestimation of HIV quasispeciesrdquoNucleic Acids Research vol 38no 21 pp 7400ndash7409 2010

[34] M W Schmitt S R Kennedy J J Salk E J Fox J B Hiattand L A Loeb ldquoDetection of ultra-rare mutations by next-generation sequencingrdquo Proceedings of the National Academyof Sciences of the United States of America vol 109 pp 14508ndash14513 2012

[35] A Syropoulos ldquoMathematics of multisetsrdquo in Proceedings of theWorkshop on Multiset Processing Multiset Processing Mathe-matical Computer Science and Molecular Computing Points ofView pp 347ndash358 Springer 2001

[36] W ArapM G KoloninM Trepel et al ldquoSteps towardmappingthe human vasculature by phage displayrdquo Nature Medicine vol8 no 2 pp 121ndash127 2002

[37] S Blond-Elguindi S E Cwirla W J Dower et al ldquoAffinitypanning of a library of peptides displayed on bacteriophagesreveals the binding specificity of BiPrdquoCell vol 75 no 4 pp 717ndash728 1993

[38] S Cox E Rosten J Monypenny et al ldquoBayesian localizationmicroscopy reveals nanoscale podosome dynamicsrdquo NatureMethods vol 9 no 2 pp 195ndash200 2012

[39] E Rosten G E Jones and S Cox ldquoImage plug-in for Bayesiananalysis of blinking and bleachingrdquoNature Methods vol 10 pp97ndash98 2013

Computational and Mathematical Methods in Medicine 13

[40] W L Matochko S Ng M R Jafari J Romaniuk S K Y Tangand R Derda ldquoUniform amplification of phage display librariesin monodisperse emulsionsrdquoMethods vol 58 pp 18ndash27 2012

[41] E A Peters P J Schatz S S Johnson and W J DowerldquoMembrane insertion defects caused by positive charges in theearly mature region of protein pIII of filamentous phage fd canbe corrected by prlA suppressorsrdquo Journal of Bacteriology vol176 no 14 pp 4296ndash4305 1994

[42] L A Brammer B Bolduc J L Kass K M Felice C J Norenand M F Hall ldquoA target-unrelated peptide in an M13 phagedisplay library traced to an advantageous mutation in the geneII ribosome-binding siterdquo Analytical Biochemistry vol 373 no1 pp 88ndash98 2008

[43] G A Kuzmicheva P K Jayanna I B Sorokulova and VA Petrenko ldquoDiversity and censoring of landscape phagelibrariesrdquo Protein Engineering Design and Selection vol 22 no1 pp 9ndash18 2009

[44] D R Wilson and B B Finlay ldquoPhage display applicationsinnovations and issues in phage and host biologyrdquo CanadianJournal of Microbiology vol 44 no 4 pp 313ndash329 1998

[45] R Derda S K Y Tang and G M Whitesides ldquoUniformamplification of phage with different growth characteristics inindividual compartments consisting ofmonodisperse dropletsrdquoAngewandte Chemie vol 49 no 31 pp 5301ndash5304 2010

[46] J C Dohm C Lottaz T Borodina and H HimmelbauerldquoSubstantial biases in ultra-short read data sets from high-throughput DNA sequencingrdquo Nucleic Acids Research vol 36no 16 article e105 2008

[47] W L Matochko S C Li S K Y Tang and R DerdaldquoProspective identification of parasitic sequences in phagedisplay screensrdquo Nucleic Acids Research 2013

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

4 Computational and Mathematical Methods in Medicine

Any phage library is a multiset defined by sequence set and copy number vector

Lib = [S n] middot middot middot middot middot middot

n1n2nN

N number of all possiblesequences in the libraryni copy number of theith sequence in the library

Lib = [S = sequence1sequence2 n = ]sequenceN

(a)

Total (sum) and unique sequences (uni) 11

1middot middot middotsum(n) = n middot |1| =

|1| =Nsumi=1

ni

uni(n) =Nsumi=1

1 minus 120575[ni] 0 n ne 0 delta120575[n] = 1 n = 0 Kronecker

(b)

middot middot middot

Any manipulation of library is an operator acting on the ni vector (copy number vector)Example sampling of the library

1n =

1n11n2

1nN

middot middot middot

2n =

2n12n2

2nN1Lib = [S 1n]

2Lib = [S 2n]

T1 total sequencesT2 total sequences

Sum(1n) = T1 Sum(2n) = sum(Sa 1n) = T2

2n = Sa 1

n

Sa) acts on 1n vectorSampling operator (

Some elementspresent in 1st

library are ldquogonerdquofrom the 2nd library

2ni = 0

1ni 0ne

(c)

rnrn

rn

0 00 0

0 0

Example sampling operator

Properties of Sa

middot middot middotmiddot middot middot

middot middot middot

Sa(n) = round T1

T2

N Irndn where N Irnd =

k

Operators are N times N matrices Random N times Ndiagonal matrix

(1) Sa(middot middot middot(Sa(Sa(n)) )middot middot middot) asymp Sa(n)

sum Sa(n) rarr infinn as k rarr(2) 1

k

((

(d)

amplificationSynthesis ofnucleotides

Theoretical library

Example each random

Ligation

Sequencing

Panning

Naive librarySynthetic library Selected (panned) library

General problem of selection

Sequencing

Multiset description of the phage display selection

transformationgrowth

nucleotide is present once

T = [S 1]

The ith sequence Sihas copy number syn ni

enriched sequences x i gt 1 depleted sequences x i lt 1

P and N cannot be measured They have to be approximated from the ldquoobservablesrdquo

From P and N find enrichment E = [S x] x e R+obs

P = [S panobsn]

P = [S pann]N = [S naive

n]

S = [S synn]

obsN = [S naive

obsn]

(e)

Figure 1 (a) Phage library can be described by multisets made of S = sequence set and 119899 = vector of copy numbers Any change to thelibrary can be described as functionoperator acting on the 119899 (b) Relevant functions are calculations of total sequences (sum) and uniquesequences (uni) (c) Any transformation of library to another library is an operator acting on 119899 Sampling of libraries to yield a sublibrary isthe most important operator (d) It can be described as N times N matrix Specifically Sa is a diagonal matrix of values derived from randomdistribution Rounding function is necessary to ensure the physical meaning of the sampling results Sa acting on the same vector yieldsone of many vectors that have the same number of total elements As a consequence Sa is nonlinear nondistributive and noncommutativeoperator Average of many Sa operators is a scalar (dilution factor) (e) Any screen of any library can be described as operators acting onthe copy number vectors of the naıve (or theoretical) library Copy number vectors cannot be observed directly They have to be measuredthrough sequencing As sequencing contains sampling process (Sa operator) the result of sequencing is nondeterministic Sequencing yieldsone of many possible observed copy number vectors none of which are equal to the real copy number vector

implementationmdashthe entire code is lt30 lines in MatLabmdashthe script allows rapid calculation of the results of fSa for amultiset of reasonable size (severalmillion sequences Figures4 and 5) on a desktop computer

We evaluated the behaviors of 05Sa for several multisetsThe probability to observe a specific solution is describedin Figure 3(b) Individual solutions can be represented aslines with nodes on 119883119884-plane where each node represents

one element of the multiset (Figures 3(d) and 3(e)) Themost probable solutions reside near the ldquoexpected solutionrdquo(represented as dotted line) and the probability to observea solution where many elements deviate from the probablesolution is low (Figure 3(e)) Graphical representation of thesolutions highlights that sampling could lead to deviationof the frequency of the individual elements of the multisetfor example Figure 3(e) describes gt2 fold deviation from

Computational and Mathematical Methods in Medicine 5

Library of phage

Sampling

Growthof phage

Isolationof DNA

PCRof DNA

Sequencing Analysis of

Sa Gr

Operator description of each step

Sa Is Pc Sa Se AnSa

of DNA sequencing

(observed)Accepted

sequencing data

realN = [S real

n]

obs N = [S obs n]

(a)

Relation between observed copy numbers and ldquorealrdquo copy numbers

Analysisoperator

obs n =f 5Sa An f 4Sa Seq f 3Sa PCR f 2Sa Is f1Sa Gr real n

Growth bias betoassumedandneglected(usually IN )Isolation bias betoassumedandneglected(usually IN )

PCR bias betoassumedandneglected(usually IN )Sequencing bias betoassumedandneglected(usually IN )

(b)

10

0

0 00 0

0 0

0011

0

0

0

Analysis operator is a matrix that could compensate for problems and censorships caused by Se Pc Is Gr operators

f5Sa An (f4Sa Seq f3Sa PCR f2Sa Is f1Sa Gr real ni) sim Sa real ni

Seq 1 is accepted as isSeq 2 is considered a typoand renamed to seq 3Seq 3 is accepted as isSeq N is deleted (possible error)

middot middot middot

middot middot middot

middot middot middot

An =

(c)

Figure 2 Operator description of the deep sequencing process (a) A library of phage must be processed before deep sequencing Each stepinvolves sampling which is either a deliberate partitioning of the sample or random loss of the sample Each sample preparation state could(and does) introduce bias in sequence abundance Each step thus is an operator chat changeing the 119899 vector (b) If we ignore bias duringpreparation operators could be approximated as unity vectors and sequencing could be represented as a product of sampling and analysisoperators (c) Analysis operator (An) is a binary decision matrix which describes what sequences are and are not considered as errorsDecisions such as removal of sequences or correction of sequences are the most important because they decide which ldquoobservedrdquo sequencesare considered ldquorealrdquo To make the analysis of the selection process meaningful the same An operator should be used in all analyses

the expected value for one of the elements Figure 3(f) showsthat the solution in which two elements deviate by gt2 foldis improbable This observation is a simple consequence ofthe multiplicity of the probabilities (large deviation from theaverage has probability 119901 and the probability to observe thisdeviation twice is 1199012)

Even in small multisets such as A(1) B(2) C(3) D(4)made of four unique and 10 total elements05Sa A(1) B(2) C(3) D(4) operation yields large numberof solutions with equal probability termed as redundantsolutions (eg solutions that have equal probability inFigure 3(b)) Redundancy depends on the structure of themultiset (Figure S2)This redundancy makes the calculationsof all probable solutions of Sa impractical For sets even with5-6 unique elements identification of all vectors 119861 whichsatisfy equation 119861 = fSa 119860 and reside within a 95 intervalrequires hundreds of thousands of iterations (Figure S2and S3) On the other hand calculation of the confidenceinterval of each element 119861i of the vector 119861 converges rapidlyA multiset A

1000 = A

1(1) A2(2) sdot sdot sdotA

1000(1000) with

1000 unique elements and 1 + 2 + 3 + sdot sdot sdot + 1000 = 500 500total elements is similar to an average deep sequencingdata set (Figure 4) Calculation of all probable solutions of05Sa A

1000 is beyond the capabilities of most computers

However the 999 confidence interval of all elements ofvector 119861 = 05Sa A

1000 can be calculated in sim2 minutes on

an average desktop computer The red dots in Figure 4 arelo119862119894and hi119862119894or the 999 high and low confidence interval

of all elements Bi (Figure 4)

The sampling operator is critical in phage display becausesampling of libraries occurs in every step of the selection andthe preparation of libraries for sequencing The stochasticnature of sampling operators makes two identical screensldquosimilar within a confidence intervalrdquo Solving (1) exactly isnot possible but it should be possible to estimate the solutionwithin a confidence interval Consider

Sel isin [ loKa Naive hiKa Naive] (6)

where loKa and hiKa are diagonal matrices of the upperand lower confidence intervals for the association constantsA simulation of the behavior of the Sa operator (Figures3 and S3) suggests that the relative sizes of the confidenceintervalsmight be impractically largewhen the copy numbersof sequences are lt10

Multiple sampling events of the Sa operator yield anormal distribution for each element of the vector (Figure 3)Fitting this normal distribution could yield a ldquotruerdquo valueof the process This process is identical to the extrapolationof the average from the normal distribution of noisy dataMultiple algorithms for such extrapolation exist for one- andmultidimensional stochastic processes [38 39] We believethat Sa behaves as a one-dimensional stochastic process andit might be possible to extrapolate the true value of thesampling from 7 to 10 repeated instances of Sa (ie thenumber of data sufficient to fit an 1D normal distribution)The necessary practical steps towards solving (3) or (10)are the following (i) Eliminate or account for any bias notrelated to binding (eg growth bias) (ii) Repeat the screen

6 Computational and Mathematical Methods in Medicine

Probability to observe specific solution

BB C

C

C

AD

DD

D B CCD

D[0 1 2 2]A B C D

[1 2 3 4]A B C D Sa

95

confi

denc

e int

erva

l

0 2 3 01 0 0 40 1 0 41 1 3 00 0 1 41 2 2 00 2 0 31 0 3 10 0 3 21 2 0 20 1 3 11 1 0 31 2 1 10 0 2 31 0 1 30 2 2 11 0 2 20 2 1 21 1 2 10 1 1 30 1 2 21 1 1 2

0 002 004 006 008 01 012 014

Copy number before

Cop

y nu

mbe

r afte

r

0

1

2

3

4

(a)

(c) (d)

(e)

(h)(g) (i)

(f)(b)

1 2 3 4

Probability to observe specific solution0 002 004 006 008 01 012 014

100000 trials

Sa50

1

6385

5000 trials2957

7

42

1 2 3 4 5 6 7 8012345678

Copy number before the sampling

Cop

y nu

mbe

r afte

r the

sam

plin

g

One of the solutions

[1 2 3 4 5 6 7 8]A B C D E F G H

1 2 3 4 5 6 7 8

012345678

Cop

y nu

mbe

r afte

r the

sam

plin

g

Number of data per point

Copy number before the sampling

50

BB

CCCADD

DD H HH

G

HHH

HHGG G

GGG

EE

EE

E

E

FF

FF

FF

B

CAD DH

G

HHH

GG

E EE

FF

F

[1 1 1 2 3 3 3 4]A B C D E F G H

Sa 50

Most probable solution

0

1

2

3

4

1 2 3 4

0

1

2

3

4

1 2 3 4

0

1

2

3

4

1 2 3 4

[0 2 3 0]A B C D

[1 1 1 2]A B C D

[0 2 1 2]A B C D

the s

ampl

ing

Cop

y nu

mbe

r afte

rth

e sam

plin

g

Cop

y nu

mbe

r afte

rth

e sam

plin

g

Cop

y nu

mbe

r afte

rth

e sam

plin

g

the sampling

Copy number before the sampling

Copy number before the sampling

Copy number before the sampling

gt1000 copies100ndash1000 copies

10ndash100 copies1ndash10 copies

solutions

P = 014

P lt 0001

P = 007

after 5000 trials

Figure 3 (a) Testing the sampling operator implemented as random indexing function using a model multiset (b) In 100000 trials weobserved 22 unique solutions from which 14 resided in a 95 confidence interval Solutions with 0 and 1 copies of element A were foundat equal abundances (ldquoredundant solutionsrdquo) (c) Representation of the most probable solution as a line with 4 nodes ldquo119901rdquo is a probabilityto find the solution dotted line is an expected ldquoaverage solutionrdquo for 50 sampling (d) The 5th most probable solution (e) least probablesolution deviates the most form the average (f) combination of all solutions Red thick lines describe the most probable solutions thin bluelines describe the least probable solutions (g) Sampling of larger multisets yields more possible solutions (here 2957 in 5000 trials) (h)All solutions of the sampling represented as lines (i) Probability to observe a particular copy number after sampling While (h) is the mostaccurate representations of the confidence intervals the thin blue lines describe solutions outside the confidence interval this representation isimpractical due to large number of redundant solutions in largermultisets In (i) confidence interval could be extrapolated from distributionsof individual copy numbers (e) red dots are on or outside the confidence interval

several times (iii) Measure all copy numbers of all sequencesincluding zero values with high confidence Requirement (i)has been an ongoing effort in our group [11 40] and othergroups [13 41ndash43] for review see [11 44] Deep sequencing

makes it simple to satisfy requirement (ii) and obtainmultipleinstances of the same experiment For example we describedthe Illumina sequencing method that allows using barcodedprimers to sequence 18 unrelated experiments in one deep

Computational and Mathematical Methods in Medicine 7

Number of data in one pointgt1000 copies100ndash1000 copies

10ndash100 copies1ndash10 copies

Number of data in one pointgt1000 copies100ndash1000 copies

10ndash100 copies1ndash10 copies

0 100 200 300 400 500 600 700 800 900 10000

100

200

300

400

500

600

700

800

900

1000

1 10 100 10001

10

100

1000

Copy number before the samplingC

opy

num

ber a

fter t

he sa

mpl

ing

A B C X Z

Copy number before the sampling

Cop

y nu

mbe

r afte

r the

sam

plin

g

1

1000

990

7

8

23

9 980

10Sa

456

50

5000 trials1000 unique elements

500000 total sequencesmiddot middot middot

middot middot middot

middot middot middot

middot middot middot[1 2 3 middot middot middot 999 1000]

Figure 4 (a) Testing the sampling operator using a large multiset made of 1000 unique elements with 1000 different copy numbers Imagesdescribe linear and log-scale representation of the confidence interval of the sampling operator Solutions beyond this interval were notobserved in 5000 trials Dotted line represents an overestimate of the 999 confidence interval (for details see Figure S4) Most probableoutcomes of the Sa operator have either zero or one unique sequence beyond this interval This line is used in subsequent sections (Figures 5and 6)We note that distributions of the copy numbers have well-defined shape according to central limit theorem it is a normal distributionWith enough replicas it should be possible to extrapolate the center of this distribution define the solutions explicitly and bypass thestochastic nature of the Sa operator

sequencing experiment [45] We recently scaled this effortto 50 primer sets and evaluated the performance replicas ofsimple selection procedures (in preparation)

The measurement of the copy numbers of sequencesis a separate problem that can be described using thesame sampling operators and bias operators that describehow the library is skewed by each preparation step Forexample isolation of DNA by gel purification disfavors AT-rich sequences whereas PCR favors sequence with withinspecific GC-content range [32]The real sequence abundancein any phage library ( real119899) hence has to be derived from theobserved sequence abundance ( obs119899) by solving this equation

obs119899 = (

f5Sa An) ( f4Sa Seq) ( f3Sa PCR)

times (f2Sa Is) ( f1Sa Gr) real119899

(7)

In this equation each operator in brackets describes a bias ata particular step fSa describes sampling at that step and f1ndashf5 describe the sampling fractions The bias in growth (Gr)isolation (Is) PCR amplification (PCR) and sequencing(Seq) could be related to the nucleotide sequences TheAn analysis operator is a matrix that describes retainingdiscarding or correcting the sequence (Figure 2(b)) An ideal

An operator could compensate for the biases introducedby another operator (Figure 2(c)) To define such operator(7) could be potentially solved using repeated sequencingof a well-defined model library In the next applied sectionwe examine the real deep sequencing data and identifyconditions under which these operators could be at leastpartially defined

22 Analysis or the Error Cutoff inDeep Sequencing Reads Allnext-generation sequencing techniques provide quality score(Phred Score) for every sequenced nucleotide In Illuminasequencing this score is related to the probability of thenucleotide being correct [46] In low throughput Sangersequencing the Phred score monotonously decreases withread length and the mechanisms that yield errors in capillaryelectrophoresis are well understood Common practice inSanger sequencing is to discard all reads after the firstnucleotide with a Phred score of 0 In next-generationsequencing the filtering of the reads is usuallymore stringentas follows

(A) Discard reads that have at least one read that has scorelower than ldquocutoffrdquo

(B) Discard reads that had cumulative Phred score lowerthan cutoff

8 Computational and Mathematical Methods in Medicine

C7C

Tags NKKN Bar Adapter Sequence Quality string for sequenceEnd of

read

C7C

C7CC7CC7CC7C

Ph7

Ph7Ph7P12

Ph7

TGGG

GGGC

TGTG

ATGG

TTGG

CGGC

CGGG

CGGG

CGTC

TGTG

ATGC

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

CGA CCCF

CCCF

CCCF

CCCF

CCCF

CCF

CCCF

F

CCCF

CCBF

CCCF

FFF

FFF

FFF

FFF

FDF

FFA

FFF

=A

FFF

FFF

FFF

FHHHHHJJJJJJ

FHHHHHJJJJJJ

FHHHHHJJJJJJ

FHHHHHJJJJJJ

FHHHHHJJJJJJ

DFFHBFFFGEFH

FHHHHHJJJJJJ

ADDDHFEFIEHG

FHGGHHJJJJJJ

FHHHHHJJJJJJ

FHHHHHJJJJJJ

TTG

GAC

GAC

GTA

TTG

CGA

TTG

TCA

TCA

CGA

JJJJJJJJJJJJJJJHHHGIJJJJJgtFHJJIJH simsimsimsimsimsimsim

JJJJJJJJJJHJJIJHIIJJJJBFHAHIHHHF simsimsimsimsimsimsim

JJJJJJJJJJJJGGHIJJJJJJJIJJIJJJJJJ simsimsimsimsimsimsim

JJJJJJJJJJJJIJJIJJJJJJJJJJJJJIJJJ simsimsimsimsimsimsim

JJJJJJJJJJJIIJJJJJJIJIJJIEHHHFHFD simsimsimsimsimsimsim

JJJJJJJJJJJJJJJJJJJJJJJHIJJJJJJJJ simsimsimsimsimsimsim

JJJJJJJJJJJJIJJJJJJJJJJJJJJJJJJJH simsimsimsimsimsimsim

JJIJJIGIJIJEGIJGHFGIIJJFFGGIIGII simsimsimsimsimsimsim

JJJJJJJJJJJIJJJIGHGIHIIIGHGIJJJJH simsimsimsimsimsimsim

GAHGHFHGGDHEHBGGCEHIEGBFDHBDltA simsimsimsimsimsimsim

simsimsimsimsimsimsim

GCTTGTCCTCCGAAGGTGCATCTGCAGTGCGGT simsimsimsimsimsimsim

GCTTGTGCGGCGAATTCTTATCGGAATTGCGGT simsimsimsimsimsimsim

ACTCAGCCTCATCATACTCCGGGTGGAGGTTCG simsimsimsimsimsimsim

GCGACGACTGTTCCAGCTTCGGGTGGAGGTTCG simsimsimsimsimsimsim

GCTTGTGATGCGCGTATGAATCAGCCTTGCGGT simsimsimsimsimsimsim

GCTTGTCAGGATAAGAATTCGCTTTTTTGCGGT simsimsimsimsimsimsim

GCTTGTAAGTCTCTTTCGAGTGCTCTGTGCGGT simsimsimsimsimsimsim

GCTTGTCCTCGTCTGTATCAGATGTCGTGCGGT simsimsimsimsimsimsim

GATGCGCATCGTACGTGTAATGGTGGAGGTTCG simsimsimsimsimsimsim

TGGCGTCATGATTCGTTTTAGACGGTTCAGCGT simsimsimsimsimsimsim

CGGGCTGCGTGTGCGAATCCTGGTGGAGGTTCG simsimsimsimsimsimsim

GEGIDFAEGIFCHC9FFAGHlowastDBAH

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

(a)

Sequencing resultsRemoval

All nucleotides

of errors

0L = [S 0n]

30L = [S 30n]

Ph gt 30

Ph gt 13

Ph gt 0

Ph gt 30

Phred gt 30

(b)

Base call accuracy ()

Phred score cutoff Every nucleotide must have Phred score higher than

Mean 33-mer accuracy ()

1 5 8 13 20 30

210 680 840 950 990 999

00 00 01 18 717 967

950 975 987 992 995 996

(Base accuracy)33 ()

10cutoff10

33prodi=1

10Qi10

(c)

Cumulative base call accuracy or accuracy

100

10

1

01

001

0001

0000160 70 80 1009050403020100

100

10

01

1

001

0001

00001

of 33-mer sequence ()

Sequ

ence

s (

)

(d)

60

80

100

40

20

0

60

80

100

40

20

0gt90 gt95 gt98

Sequ

ence

s (

)

Sequ

ence

s (

)

Cumulative base call accuracy oraccuracy of 33-mer sequence ()

gt99

(e)

Figure 5 (a) Representative lines from the intermediate file from Illumina deep sequencing analysis (for more information seeERROR TAG data0001txt in Section 4 and our previous publication [6]) The reads have been parsed to identify adapters and barcodesEach read has been tagged according to the library type direction of the read and quality of the adapter regions We use this intermediatelibrary to identify reads that harbor erroneous nucleotides (b) Multiset view of the intermediate library The library contains subsets thathave low medium and high quality reads Error filtering of this intermediate library to eliminates any read with Phred score below 30 yieldsa high quality library of reads 30L (c) Mean accuracy of the reads in the library after error filtering ranges from 95 to 996 Even forvery low-quality cutoff Phred gt 1 the average read quality is 95 (d) Distribution of cumulative read accuracy in libraries processed usingdifferent cutoffs (e) Linear plot of the data presented in (d) with zoom in on the region with gt90 cumulative accuracy

(C) Use a combination of A and B (accept reads withminimal cutoff and minimal cumulative score)

Many of the error analyses in the area of deep sequencingare designed for genetic reads which have variable lengthsand unknown sequence throughout the whole read Analysisof the reads in a phage-display library is a simpler problembecause phage-derived constant adapter regions flank thevariable reads Identification of the adapter region is anecessary first step in the analysis Reads in which theadapters cannot be mapped cannot be used We designedalgorithms to recover reads even if adapters were hamperedby truncation deletion ormutation [6]We observed that the

reads flanked by the erroneous adapters had a significantlyhigher error rate than reads flanked by ldquoperfectrdquo adapters(unpublished) Example of mapping of flawed reads in Illu-mina sequencing is provided in ERROR TAG data0001txt(see Section 4) In the remaining sections we analyze thepopulation of the sequences preceded by a ldquoperfectrdquo adapterto identify possible sequence-specific biases

We analyzed a typical library sequenced by Illumina usingvarious cutoffs (Figure 5) We analyzed a 33-bp segment ofthe library that contained variable seven amino acids anda constant region and GGGS terminus A simple cutoffthat discards reads with Phred lt1 nucleotides yields library

Computational and Mathematical Methods in Medicine 9

termed 1119899 which had an average 95 accuracy of the33-nucleotide read Reads that do not contain Phred = 0nucleotide rarely contain multiple low-quality reads The 1119899library was bimodal 80 of the reads had overall accuracyof 99 very few reads with accuracy 5ndash90 and significantnumber of reads with accuracy of 1 (Figures 5(d) and 5(e))These observations suggest that reads can be divided into (i)reads free of errors and (ii) reads with multiple errors

An example of a more stringent cutoff is eliminationof reads with Phred lt13 nucleotides this process yielded alibrary 13119899 in which every nucleotide had gt95 confidenceThe number of total reads in 13119899 was 10 less than numberof reads in 1119899 that is sum( 13119899) = 09sum( 1119899) The observedaverage read accuracy of the read in the 13119899 library was992 Theoretically the 095 confidence cutoff in a 33-mer nucleotide could yield reads with accuracy as low as(095)

33= 18 In practice the probability to find reads with

multiple nucleotides of 95 accuracy was vanishingly smallSpecifically among 500000 reads the lowest observed cumu-lative accuracy was 77 Such a result for example could beobtained in a sequence that has 27 ldquoperfectrdquo nucleotides and5 nucleotides with a Phred = 13 score (1)27(095)5 = 077Applying the most stringent cutoff to eliminate all reads witha Phred lt 30 yielded a library 30119899 in which every nucleotidehad 999 confidence The average confidence of the readsimproved subtly from 992 to 996 The number of totalreads in 30119899 was 30 less than number of reads in 13119899 thatis sum( 30119899) = 07sum( 13119899) It was not clear whether suchcutoff is an improvement or a detriment for analysis In thenext section we examined how frequency of the members ofthe library changed upon application of each error cutoff

23 Example of Error Analysis Sequence-Specific Censorshipduring Phred Quality Cutoff If errors occur by randomchance they should be uniformly distributed in all sequencesRemoval of erroneous read in that case should be identicalto sampling of the library by fSa operator where f is thesampling fraction For example consider the removal ofPhred lt 13 nucleotides from an unfiltered library (processdenoted as 1119899 rarr 13

119899) From the experiments we knowthat sum( 13119899) = 09sum( 1119899) if errors were distributedin sequences at random the 1119899 and 13119899 vectors should berelated as

13119899 =09Sa ( 1119899) (8)

The solutions should reside within a confidence interval

13119899 isin [

lo119862

hi119862] (9)

If errors occur preferentially in specific reads the frequencyof these reads should occur beyond the confidence intervalof the 09Sa This process could be described by a diagonalmatrix Bias as

13119899 =09Sa (Bias ( 1119899)) (10)

The elements of the diagonal matrix Bias = B119894119894 could be

estimated as follows

13119899119894isin [

lo119862119894

hi119862119894] B

119894119894= 1 (11)

13119899119894lt

lo119862119894 B

119894119894=

13119899119894

(091119899119894)

(12)

Figure 6(c) describes the representative solution of the09Sa( 1119899) (green dots) and the confidence interval (bluelines) Supplementary Scheme S3 describes the script thatcalculated this interval frommultiset 1119899 described as a plaintext file PhD7-Amp-0Ftxt using 10000 iterative calculationsof 09Sa( 1119899) This calculation required sim2 hours on a desktopcomputer Confidence interval was estimated as the mini-mum and maximum copy number found after 10000 iter-ations In this approximation of the confidence interval forsequences with the copy number lt10 before sampling it wasimpossible to determine whether the sequence disappeareddue to random sampling or due to bias The values of Biasoperator cannot be defined for these sequences and it couldbe assumed to be 1 (see (11)) For copy number gt10 howeversequence-specific bias can be readily detected We observedthat the removal of Phred lt13 reads yielded a multiset inwhich a large number of sequences deviated beyond theconfidence interval (Figure 6(d)) Their sequences could bereadily extracted by comparing the vector 13119899with the vectorof the lower confidence intervals lo

119862 (see (12)) The solutionof theBias can be illustrated graphically (Figure 6(e)) Top 30censored sequences are listed in Table S1 the other sequencescan be found in the supplementary information (file PhD7-Amp-0F-13F-CENtxt)

We performed similar calculations for 1119899 rarr 30119899 and13119899 rarr

30119899 processes The latter process is the most

interesting because 13119899 library has all nucleotides withinacceptable confidence range (gt95) and the distribution ofcumulative quality suggested that errors on average do notcluster in one read (Figure 5) The 13119899 rarr 30

119899 conversioneliminated 30 of the reads and copy numbers of manysequences deviated significantly from the random samplingthese sequences are represented by green dots outside the blueconfidence interval in Figure 6(h) Top 30 sequences are listedin Table S2 The censorship is not only sequence-specific butalso position-specific In sequences that had been censoredduring the 13119899 rarr 30119899 process lower quality reads clusteredaround 3-4 specific nucleotides (supplementary informationFigure S5)

The mechanism that leads to the disappearance of cen-sored sequences is not currently clear We attempted toidentify common motifs in censored sequences using twoapproaches (i) clustering and principal component analysesbased on Jukes-Cantor distance between sequences and(ii) identification of motifs using multiple unique sequenceidentifier software (MUSI) [17] These approaches couldnot detect any property common to censored reads whichwould make them significantly different from the othernoncensored reads Still we hypothesize that the observed

10 Computational and Mathematical Methods in Medicine

Ph gt 30 Ph gt 30

Sequencing resultsRemovalof errors

30n = Bias Sa 0

n

All nucleotides

0L = [S

0n]

30L = [S 30n]

Phred gt 30

(a)

Sequencing results

Ph gt13

1T total sequences

13T total sequences

30T total sequences

30n = Bias Sa 13

n

13n = Bias Sa 1

n

30n = Bias Sa 1

n

13L = [S

13n]

0L = [S

0n]

1L = [S

1n]

30L = [S 30n]

Ph gt 30

Ph gt 13

Ph gt 0

Ph ge 1

Ph gt 30

(b)

38390

597122105

104

103

102

10

1

105

104

103

102

101

13n = Sa 1

n

Cop

y nu

mbe

r in

Phre

dgt0

libra

ry

Copy number in Sa (Phred gt 0) library

(c)

558480

77032

Inside the interval

Outsideintervalthe

of

105

104

103

102

10

1

105

104

103

102

101

13n = Bias 1

n

13n = Bias Sa 1

nC

opy

num

ber i

n Ph

redgt0

libra

ry

Copy number in Phred gt 30 library

(d)

1

01

001

Bias can be definedfor these sequences

Bias cannot be definedfor these sequences

105

104

103

102

101

Bias = 13n|I| |N||

1nminus1

Bias = |IN|

Ratio

of c

opy

num

ber i

nPh

redgt0

Phr

edgt13

Copy number in Phred gt 0 library

(e)

502655

132857

105

104

103

102

101

105

104

103

102

10

1

30n = Sa 1

n

Number of data points per pixelgt30000

gt10000

gt3000

gt1000 1ndash10

gt100

gt300

gt30

gt10

Cop

y nu

mbe

r in

Phre

dgt0

libra

ry

Copy number in Sa (Phred gt 0) library

(f)

438703

196809

Inside the interval

Outside ofthe interval

105

104

103

102

10

1

105

104

103

102

101

Number of data points per pixel

30n = Bias Sa 1

n

30n = Bias 1

n

gt30000

gt10000

gt3000

gt1000 1ndash10

gt100

gt300

gt30

gt10

Copy number in Phred gt 30 library

Cop

y nu

mbe

r in

Phre

dgt0

libra

ry

(g)

438703

119777

Inside the interval

Outside ofthe interval

105

104

103

102

10

1

105

104

103

102

101

30n = Bias 13

n

Number of data points per pixel

30n = Bias Sa 13

n

gt30000

gt10000

gt3000

gt1000 1ndash10

gt100

gt300

gt30

gt10

Copy number in Phred gt 30 library

Cop

y nu

mbe

r in

Phre

dgt13

libra

ry

(h)

Figure 6 (a) Operator and multiset description of the error filtering procedure Applying a Phred gt 30 cutoff to library filtered by Phredgt1cutoff ( 1n) yields a subpopulation of the library ( 30n) If errors are sequence-independent the 1n rarr 30n process should be identical torandom sampling ( 30n = Sa 1n) Any sequence-specific bias (Bias) should be detected as deviation from Sa 1n (b) Progressive samplingwith more stringent cutoff (c) Theoretical Sa 1n and theoretical 999 confidence interval (blue) (d) Observation of statistically significantdeviation from Sa operator dots beyond the blue line represent sequences prone to bias Red dots represent sequences that disappeared afterin 1n rarr 30n process or during Sa 1n sampling (e) Magnitude of the bias range from 5 to 100-fold (f) Bias in sampling of Phred gt 30 datafrom Phred gt 1 data ((f) is theory (g) is observed) (h) Bias upon sampling of Phred gt 30 data from Phred gt 13 Many sequences were lost inthis sampling and this loss was statistically significant beyond the 999 interval This result shows that some sequences have propensity toharbor low- and medium-quality reads Distribution of the errors is sequence specific

censorship represents sequence-specific errors which occurin every time such sequence passes though the Illuminaanalyzer For example the sequences listed in Tables S1 andS2 and supplementary files were censored in five independent

experiments which were pooled and processed simultane-ously in one Illumina run Analysis of other instances of Illu-mina sequencing performedby other groups could help prove(or disprove) that censorship is indeed sequence-specific

Computational and Mathematical Methods in Medicine 11

and experiment-independent Sequence-specific censorshipduring Illumina analysis has been described in other publi-cations [46] The observations presented above suggest thatreading of some sequences in phage libraries does not yield anaccurate copy number Even if these sequences were enricheddue to binding their apparent copy number in sequencingwould be decreased due to sequencing bias If the magnitudeof bias is known however such error could be corrected Weanticipate that other biases could be calculated for these andother libraries in similar fashion Their calculation extendsbeyond the scope of this paper and it will be performed inour next publication

3 Discussion

31 Significance and Transformative Potential of Library-Wide Error Correction In the Medicinal Chemistry fieldstructure-activity relationships (SAR) and pharmacophoresare built using both positive and negative observations Itis the negative results that bear the most significance inthese studies because they allow mapping of the range ofconditions under which particular structure no longer worksFor example SAR of an R group of a ligand might be builton the following observations A ligand binds to the targetwhen the R group in the specific position is methyl or ethylchanging R to iso-propyl and tert-butyl ablates the bindingThis concludes that the R group must be a small alkyl groupAn analogous situation is found in SARof peptide ligands themost important information from alanine scan mutagenesisis loss of function because it helps identifying the importantresidues Interestingly loss-of-binding conclusions are neverapplied to phage-display The phage-display field is driven bypositive results Most publications report and follow up onlyon sequences enriched in the screen and consider only largecopy numbers interesting All papers focus on sequences thatwere found Very few papers in phage display ask why othersequences were not found

One of the reasons why phage display is not used forSAR-type analysis is because negative observations in phagelibrary cannot be determined with high confidence From apractical point of view measuring zero with high confidencerequires the largest number of observation (the highest depthof sequencing) The payoff however is immense one screenwith ldquoconfident zerosrdquo could potentially yield SAR for everypossible substitution of every possible amino acid We referto this (theoretical) possibility as ldquoInstant SARrdquo and itscondensed theoretical form is described in (3) or (9) and(10) This paper demonstrates that the depth of sequencing isnot the only problem towards this goal Accurate estimate ofnegative results requires complete characterization of the ori-gins of errors in sequencing which yield false negative valuesby censoring certain sequencing Other types of censorshipsuch as growth bias should be characterized and eliminatedas well As the phage display field is currently focused onpositive results the need for optimal error corrections andrecovery of erroneous reads is low With the rise of SAR-type applications in phage display error correction will berecognized as the most significant barrier because it could

lead to improper assignment of low frequencies and negativeresults Improved error correction strategies could assigna lower confidence to the sequence instead of eliminatingthe errors and labeling them as confident zero Propermathematical framework possibly similar to the one used inthis paper could be then used to carry all confidence intervalsthrough calculations to yield reliable SAR-type data

We note that the framework described in this paper issuitable for the analysis of the selection from libraries inwhich the diversity of the libraries before and after selectioncould be covered entirely by deep sequencing With thecurrent depth of sequencing it corresponds to medium-scale libraries of sim106 randommembers and affinity-maturedlibraries that containsim106 pointmutationsWe are in the pro-cess of generating these medium-scale libraries and runningselection procedures that will allow us to apply and refine ourframework In the future as technical capabilities and depthof sequencing increase the process would be applicable tolarger libraries as well

4 Methods

41 Generation of Z-Bars and Other Visualization TechniquesSequencing of the libraries has been described in our previouspublications [6 47] All data visualization in this paper wasdone by MATLAB scripts raw lowasteps output from MATLABscripts subject to minor postprocessing in Adobe Illustratorto adjust fonts relative dimensions of plots Core scripts aredescribed in the supplementary information Other scriptsare available in our previous publication [47] Illumina filesused for the analysis can be found in the directory athttpwwwchemualbertacasimderdamathbiology the fileERROR TAG data0001txt is an example of error-taggedreads PhD7-Amp-xxFtxt is the library filtered with xx Phredcutoff (xx = 1 13 and 30) file PhD7-Amp-13F-30F-COtxtdescribes confidence intervals for Phred(13) to Phred(30)filtering process other files with lowast-COtxt extension describeconfidence intervals of other processes Supplementary Fig-ures S1ndashS4 and Schemes S1 and S2 describe MATLAB imple-mentation of the Sa operator

Acknowledgments

This work was supported by funds from the University ofAlberta and Alberta Glycomics Centre

References

[1] J K Scott and G P Smith ldquoSearching for peptide ligands withan epitope libraryrdquo Science vol 249 no 4967 pp 386ndash390 1990

[2] G P Smith and V A Petrenko ldquoPhage displayrdquo ChemicalReviews vol 97 no 2 pp 391ndash410 1997

[3] A D Ellington and J W Szostak ldquoIn vitro selection of RNAmolecules that bind specific ligandsrdquo Nature vol 346 no 6287pp 818ndash822 1990

[4] C Tuerk and L Gold ldquoSystemic evolution of ligands byexponential enrichment RNA ligands to bacteriophage T4DNApolymeraserdquo Science vol 249 no 4968 pp 505ndash510 1990

12 Computational and Mathematical Methods in Medicine

[5] E Dias-Neto D N Nunes R J Giordano et al ldquoNext-generation phage display integrating and comparing availablemolecular tools to enable costeffective high-throughput analy-sisrdquo PLoS ONE vol 4 no 12 Article ID e8338 2009

[6] W LMatochko K Chu B Jin SW Lee GMWhitesides andR Derda ldquoDeep sequencing analysis of phage libraries usingIllumina platformrdquoMethods vol 58 pp 47ndash55 2012

[7] A Ernst D Gfeller Z Kan et al ldquoCoevolution of PDZ domain-ligand interactions analyzed by high-throughput phage displayand deep sequencingrdquo Molecular BioSystems vol 6 no 10 pp1782ndash1790 2010

[8] G V Kupakuwana J E Crill M P McPike and P N BorerldquoAcyclic identification of aptamers for human alpha-thrombinusing over-represented libraries and deep sequencingrdquo PLoSONE vol 6 no 5 Article ID e19395 2011

[9] P A C rsquoT Hoen S M G Jirka B R Ten Broeke et alldquoPhage display screening without repetitious selection roundsrdquoAnalytical Biochemistry vol 421 no 2 pp 622ndash631 2012

[10] H Zhang A Torkamani T M Jones D I Ruiz J Ponsand R A Lerner ldquoPhenotype-information-phenotype cyclefor deconvolution of combinatorial antibody libraries selectedagainst complex systemsrdquo Proceedings of the National Academyof Sciences of the United States of America vol 108 no 33 pp13456ndash13461 2011

[11] R Derda S K Y Tang S C Li S Ng W Matochko and M RJafari ldquoDiversity of phage-displayed libraries of peptides duringpanning and amplificationrdquo Molecules vol 16 no 2 pp 1776ndash1803 2011

[12] R W Hamming ldquoError detecting and error correcting codesrdquoBell System Technical Journal vol 29 pp 147ndash160 1950

[13] D J Rodi A S Soares and L Makowski ldquoQuantitativeassessment of peptide sequence diversity in M13 combinatorialpeptide phage display librariesrdquo Journal of Molecular Biologyvol 322 no 5 pp 1039ndash1052 2002

[14] L Makowski ldquoQuantitative analysis of peptide librariesrdquo inPhage Nanobiotechnology chapter 3 2011

[15] C E Shannon ldquoAmathematical theory of communicationrdquo BellSystem Technical Journal vol 27 pp 379ndash423 1948

[16] U Ravn F Gueneau L Baerlocher et al ldquoBy-passing in vitroscreening next generation sequencing technologies applied toantibody display and in silico candidate selectionrdquoNucleic AcidsResearch vol 38 no 21 2010

[17] T Kim M S Tyndel H Huang et al ldquoMUSI an integratedsystem for identifying multiple specificity from very largepeptide or nucleic acid data setsrdquo Nucleic Acids Research vol40 no 6 article e47 2012

[18] J A Weinstein N Jiang R A White III D S Fisher and S RQuake ldquoHigh-throughput sequencing of the zebrafish antibodyrepertoirerdquo Science vol 324 no 5928 pp 807ndash810 2009

[19] B J DeKosky G C Ippolito R P Deschner et al et al ldquoHigh-throughput sequencing of the paired human immunoglobulinheavy and light chain repertoirerdquo Nature Biotechnology vol 31pp 166ndash169 2013

[20] B Levitan ldquoStochastic modeling and optimization of phagedisplayrdquo Journal of Molecular Biology vol 277 no 4 pp 893ndash916 1998

[21] C S Riesenfeld P D Schloss and J Handelsman ldquoMetage-nomics genomic analysis of microbial communitiesrdquo AnnualReview of Genetics vol 38 pp 525ndash552 2004

[22] M L Sogin H G Morrison J A Huber et al ldquoMicrobialdiversity in the deep sea and the underexplored lsquorare biospherersquordquo

Proceedings of the National Academy of Sciences of the UnitedStates of America vol 103 no 32 pp 12115ndash12120 2006

[23] F Backhed R E Ley J L Sonnenburg D A Peterson and JI Gordon ldquoHost-bacterial mutualism in the human intestinerdquoScience vol 307 no 5717 pp 1915ndash1920 2005

[24] N Beerenwinkel and O Zagordi ldquoUltra-deep sequencing forthe analysis of viral populationsrdquo Current Opinion in Virologyvol 1 no 5 pp 413ndash418 2011

[25] J A Huber D B MarkWelch H G Morrison et al ldquoMicrobialpopulation structures in the deep marine biosphererdquo Sciencevol 318 no 5847 pp 97ndash100 2007

[26] A Wilm P P K Aw D Bertrand et al ldquoLoFreq a sequence-quality aware ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencingdatasetsrdquo Nucleic Acids Research vol 40 pp 11189ndash11201 2012

[27] P J Turnbaugh C Quince J J Faith et al ldquoOrganismalgenetic and transcriptional variation in the deeply sequencedgut microbiomes of identical twinsrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 107 no16 pp 7503ndash7508 2010

[28] CQuince A Lanzen T P Curtis et al ldquoAccurate determinationof microbial diversity from 454 pyrosequencing datardquo NatureMethods vol 6 no 9 pp 639ndash641 2009

[29] S J Watson M R A Welkers D P Depledge et al ldquoViralpopulation analysis and minority-variant detection using shortread next-generation sequencingrdquo Philosophical Transactions ofthe Royal Society B vol 368 no 1614 Article ID 20120205 2013

[30] D R Bentley S Balasubramanian H P Swerdlow et al et alldquoAccurate whole human genome sequencing using reversibleterminator chemistryrdquo Nature vol 456 pp 53ndash59 2008

[31] KNakamura TOshima TMorimoto et al ldquoSequence-specificerror profile of Illumina sequencersrdquo Nucleic Acids Researchvol 39 no 13 2011

[32] M A Quail I Kozarewa F Smith et al ldquoA large genomecenterrsquos improvements to the Illumina sequencing systemrdquoNature Methods vol 5 no 12 pp 1005ndash1010 2008

[33] O Zagordi R Klein M Daumer and N Beerenwinkel ldquoErrorcorrection of next-generation sequencing data and reliableestimation of HIV quasispeciesrdquoNucleic Acids Research vol 38no 21 pp 7400ndash7409 2010

[34] M W Schmitt S R Kennedy J J Salk E J Fox J B Hiattand L A Loeb ldquoDetection of ultra-rare mutations by next-generation sequencingrdquo Proceedings of the National Academyof Sciences of the United States of America vol 109 pp 14508ndash14513 2012

[35] A Syropoulos ldquoMathematics of multisetsrdquo in Proceedings of theWorkshop on Multiset Processing Multiset Processing Mathe-matical Computer Science and Molecular Computing Points ofView pp 347ndash358 Springer 2001

[36] W ArapM G KoloninM Trepel et al ldquoSteps towardmappingthe human vasculature by phage displayrdquo Nature Medicine vol8 no 2 pp 121ndash127 2002

[37] S Blond-Elguindi S E Cwirla W J Dower et al ldquoAffinitypanning of a library of peptides displayed on bacteriophagesreveals the binding specificity of BiPrdquoCell vol 75 no 4 pp 717ndash728 1993

[38] S Cox E Rosten J Monypenny et al ldquoBayesian localizationmicroscopy reveals nanoscale podosome dynamicsrdquo NatureMethods vol 9 no 2 pp 195ndash200 2012

[39] E Rosten G E Jones and S Cox ldquoImage plug-in for Bayesiananalysis of blinking and bleachingrdquoNature Methods vol 10 pp97ndash98 2013

Computational and Mathematical Methods in Medicine 13

[40] W L Matochko S Ng M R Jafari J Romaniuk S K Y Tangand R Derda ldquoUniform amplification of phage display librariesin monodisperse emulsionsrdquoMethods vol 58 pp 18ndash27 2012

[41] E A Peters P J Schatz S S Johnson and W J DowerldquoMembrane insertion defects caused by positive charges in theearly mature region of protein pIII of filamentous phage fd canbe corrected by prlA suppressorsrdquo Journal of Bacteriology vol176 no 14 pp 4296ndash4305 1994

[42] L A Brammer B Bolduc J L Kass K M Felice C J Norenand M F Hall ldquoA target-unrelated peptide in an M13 phagedisplay library traced to an advantageous mutation in the geneII ribosome-binding siterdquo Analytical Biochemistry vol 373 no1 pp 88ndash98 2008

[43] G A Kuzmicheva P K Jayanna I B Sorokulova and VA Petrenko ldquoDiversity and censoring of landscape phagelibrariesrdquo Protein Engineering Design and Selection vol 22 no1 pp 9ndash18 2009

[44] D R Wilson and B B Finlay ldquoPhage display applicationsinnovations and issues in phage and host biologyrdquo CanadianJournal of Microbiology vol 44 no 4 pp 313ndash329 1998

[45] R Derda S K Y Tang and G M Whitesides ldquoUniformamplification of phage with different growth characteristics inindividual compartments consisting ofmonodisperse dropletsrdquoAngewandte Chemie vol 49 no 31 pp 5301ndash5304 2010

[46] J C Dohm C Lottaz T Borodina and H HimmelbauerldquoSubstantial biases in ultra-short read data sets from high-throughput DNA sequencingrdquo Nucleic Acids Research vol 36no 16 article e105 2008

[47] W L Matochko S C Li S K Y Tang and R DerdaldquoProspective identification of parasitic sequences in phagedisplay screensrdquo Nucleic Acids Research 2013

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

Computational and Mathematical Methods in Medicine 5

Library of phage

Sampling

Growthof phage

Isolationof DNA

PCRof DNA

Sequencing Analysis of

Sa Gr

Operator description of each step

Sa Is Pc Sa Se AnSa

of DNA sequencing

(observed)Accepted

sequencing data

realN = [S real

n]

obs N = [S obs n]

(a)

Relation between observed copy numbers and ldquorealrdquo copy numbers

Analysisoperator

obs n =f 5Sa An f 4Sa Seq f 3Sa PCR f 2Sa Is f1Sa Gr real n

Growth bias betoassumedandneglected(usually IN )Isolation bias betoassumedandneglected(usually IN )

PCR bias betoassumedandneglected(usually IN )Sequencing bias betoassumedandneglected(usually IN )

(b)

10

0

0 00 0

0 0

0011

0

0

0

Analysis operator is a matrix that could compensate for problems and censorships caused by Se Pc Is Gr operators

f5Sa An (f4Sa Seq f3Sa PCR f2Sa Is f1Sa Gr real ni) sim Sa real ni

Seq 1 is accepted as isSeq 2 is considered a typoand renamed to seq 3Seq 3 is accepted as isSeq N is deleted (possible error)

middot middot middot

middot middot middot

middot middot middot

An =

(c)

Figure 2 Operator description of the deep sequencing process (a) A library of phage must be processed before deep sequencing Each stepinvolves sampling which is either a deliberate partitioning of the sample or random loss of the sample Each sample preparation state could(and does) introduce bias in sequence abundance Each step thus is an operator chat changeing the 119899 vector (b) If we ignore bias duringpreparation operators could be approximated as unity vectors and sequencing could be represented as a product of sampling and analysisoperators (c) Analysis operator (An) is a binary decision matrix which describes what sequences are and are not considered as errorsDecisions such as removal of sequences or correction of sequences are the most important because they decide which ldquoobservedrdquo sequencesare considered ldquorealrdquo To make the analysis of the selection process meaningful the same An operator should be used in all analyses

the expected value for one of the elements Figure 3(f) showsthat the solution in which two elements deviate by gt2 foldis improbable This observation is a simple consequence ofthe multiplicity of the probabilities (large deviation from theaverage has probability 119901 and the probability to observe thisdeviation twice is 1199012)

Even in small multisets such as A(1) B(2) C(3) D(4)made of four unique and 10 total elements05Sa A(1) B(2) C(3) D(4) operation yields large numberof solutions with equal probability termed as redundantsolutions (eg solutions that have equal probability inFigure 3(b)) Redundancy depends on the structure of themultiset (Figure S2)This redundancy makes the calculationsof all probable solutions of Sa impractical For sets even with5-6 unique elements identification of all vectors 119861 whichsatisfy equation 119861 = fSa 119860 and reside within a 95 intervalrequires hundreds of thousands of iterations (Figure S2and S3) On the other hand calculation of the confidenceinterval of each element 119861i of the vector 119861 converges rapidlyA multiset A

1000 = A

1(1) A2(2) sdot sdot sdotA

1000(1000) with

1000 unique elements and 1 + 2 + 3 + sdot sdot sdot + 1000 = 500 500total elements is similar to an average deep sequencingdata set (Figure 4) Calculation of all probable solutions of05Sa A

1000 is beyond the capabilities of most computers

However the 999 confidence interval of all elements ofvector 119861 = 05Sa A

1000 can be calculated in sim2 minutes on

an average desktop computer The red dots in Figure 4 arelo119862119894and hi119862119894or the 999 high and low confidence interval

of all elements Bi (Figure 4)

The sampling operator is critical in phage display becausesampling of libraries occurs in every step of the selection andthe preparation of libraries for sequencing The stochasticnature of sampling operators makes two identical screensldquosimilar within a confidence intervalrdquo Solving (1) exactly isnot possible but it should be possible to estimate the solutionwithin a confidence interval Consider

Sel isin [ loKa Naive hiKa Naive] (6)

where loKa and hiKa are diagonal matrices of the upperand lower confidence intervals for the association constantsA simulation of the behavior of the Sa operator (Figures3 and S3) suggests that the relative sizes of the confidenceintervalsmight be impractically largewhen the copy numbersof sequences are lt10

Multiple sampling events of the Sa operator yield anormal distribution for each element of the vector (Figure 3)Fitting this normal distribution could yield a ldquotruerdquo valueof the process This process is identical to the extrapolationof the average from the normal distribution of noisy dataMultiple algorithms for such extrapolation exist for one- andmultidimensional stochastic processes [38 39] We believethat Sa behaves as a one-dimensional stochastic process andit might be possible to extrapolate the true value of thesampling from 7 to 10 repeated instances of Sa (ie thenumber of data sufficient to fit an 1D normal distribution)The necessary practical steps towards solving (3) or (10)are the following (i) Eliminate or account for any bias notrelated to binding (eg growth bias) (ii) Repeat the screen

6 Computational and Mathematical Methods in Medicine

Probability to observe specific solution

BB C

C

C

AD

DD

D B CCD

D[0 1 2 2]A B C D

[1 2 3 4]A B C D Sa

95

confi

denc

e int

erva

l

0 2 3 01 0 0 40 1 0 41 1 3 00 0 1 41 2 2 00 2 0 31 0 3 10 0 3 21 2 0 20 1 3 11 1 0 31 2 1 10 0 2 31 0 1 30 2 2 11 0 2 20 2 1 21 1 2 10 1 1 30 1 2 21 1 1 2

0 002 004 006 008 01 012 014

Copy number before

Cop

y nu

mbe

r afte

r

0

1

2

3

4

(a)

(c) (d)

(e)

(h)(g) (i)

(f)(b)

1 2 3 4

Probability to observe specific solution0 002 004 006 008 01 012 014

100000 trials

Sa50

1

6385

5000 trials2957

7

42

1 2 3 4 5 6 7 8012345678

Copy number before the sampling

Cop

y nu

mbe

r afte

r the

sam

plin

g

One of the solutions

[1 2 3 4 5 6 7 8]A B C D E F G H

1 2 3 4 5 6 7 8

012345678

Cop

y nu

mbe

r afte

r the

sam

plin

g

Number of data per point

Copy number before the sampling

50

BB

CCCADD

DD H HH

G

HHH

HHGG G

GGG

EE

EE

E

E

FF

FF

FF

B

CAD DH

G

HHH

GG

E EE

FF

F

[1 1 1 2 3 3 3 4]A B C D E F G H

Sa 50

Most probable solution

0

1

2

3

4

1 2 3 4

0

1

2

3

4

1 2 3 4

0

1

2

3

4

1 2 3 4

[0 2 3 0]A B C D

[1 1 1 2]A B C D

[0 2 1 2]A B C D

the s

ampl

ing

Cop

y nu

mbe

r afte

rth

e sam

plin

g

Cop

y nu

mbe

r afte

rth

e sam

plin

g

Cop

y nu

mbe

r afte

rth

e sam

plin

g

the sampling

Copy number before the sampling

Copy number before the sampling

Copy number before the sampling

gt1000 copies100ndash1000 copies

10ndash100 copies1ndash10 copies

solutions

P = 014

P lt 0001

P = 007

after 5000 trials

Figure 3 (a) Testing the sampling operator implemented as random indexing function using a model multiset (b) In 100000 trials weobserved 22 unique solutions from which 14 resided in a 95 confidence interval Solutions with 0 and 1 copies of element A were foundat equal abundances (ldquoredundant solutionsrdquo) (c) Representation of the most probable solution as a line with 4 nodes ldquo119901rdquo is a probabilityto find the solution dotted line is an expected ldquoaverage solutionrdquo for 50 sampling (d) The 5th most probable solution (e) least probablesolution deviates the most form the average (f) combination of all solutions Red thick lines describe the most probable solutions thin bluelines describe the least probable solutions (g) Sampling of larger multisets yields more possible solutions (here 2957 in 5000 trials) (h)All solutions of the sampling represented as lines (i) Probability to observe a particular copy number after sampling While (h) is the mostaccurate representations of the confidence intervals the thin blue lines describe solutions outside the confidence interval this representation isimpractical due to large number of redundant solutions in largermultisets In (i) confidence interval could be extrapolated from distributionsof individual copy numbers (e) red dots are on or outside the confidence interval

several times (iii) Measure all copy numbers of all sequencesincluding zero values with high confidence Requirement (i)has been an ongoing effort in our group [11 40] and othergroups [13 41ndash43] for review see [11 44] Deep sequencing

makes it simple to satisfy requirement (ii) and obtainmultipleinstances of the same experiment For example we describedthe Illumina sequencing method that allows using barcodedprimers to sequence 18 unrelated experiments in one deep

Computational and Mathematical Methods in Medicine 7

Number of data in one pointgt1000 copies100ndash1000 copies

10ndash100 copies1ndash10 copies

Number of data in one pointgt1000 copies100ndash1000 copies

10ndash100 copies1ndash10 copies

0 100 200 300 400 500 600 700 800 900 10000

100

200

300

400

500

600

700

800

900

1000

1 10 100 10001

10

100

1000

Copy number before the samplingC

opy

num

ber a

fter t

he sa

mpl

ing

A B C X Z

Copy number before the sampling

Cop

y nu

mbe

r afte

r the

sam

plin

g

1

1000

990

7

8

23

9 980

10Sa

456

50

5000 trials1000 unique elements

500000 total sequencesmiddot middot middot

middot middot middot

middot middot middot

middot middot middot[1 2 3 middot middot middot 999 1000]

Figure 4 (a) Testing the sampling operator using a large multiset made of 1000 unique elements with 1000 different copy numbers Imagesdescribe linear and log-scale representation of the confidence interval of the sampling operator Solutions beyond this interval were notobserved in 5000 trials Dotted line represents an overestimate of the 999 confidence interval (for details see Figure S4) Most probableoutcomes of the Sa operator have either zero or one unique sequence beyond this interval This line is used in subsequent sections (Figures 5and 6)We note that distributions of the copy numbers have well-defined shape according to central limit theorem it is a normal distributionWith enough replicas it should be possible to extrapolate the center of this distribution define the solutions explicitly and bypass thestochastic nature of the Sa operator

sequencing experiment [45] We recently scaled this effortto 50 primer sets and evaluated the performance replicas ofsimple selection procedures (in preparation)

The measurement of the copy numbers of sequencesis a separate problem that can be described using thesame sampling operators and bias operators that describehow the library is skewed by each preparation step Forexample isolation of DNA by gel purification disfavors AT-rich sequences whereas PCR favors sequence with withinspecific GC-content range [32]The real sequence abundancein any phage library ( real119899) hence has to be derived from theobserved sequence abundance ( obs119899) by solving this equation

obs119899 = (

f5Sa An) ( f4Sa Seq) ( f3Sa PCR)

times (f2Sa Is) ( f1Sa Gr) real119899

(7)

In this equation each operator in brackets describes a bias ata particular step fSa describes sampling at that step and f1ndashf5 describe the sampling fractions The bias in growth (Gr)isolation (Is) PCR amplification (PCR) and sequencing(Seq) could be related to the nucleotide sequences TheAn analysis operator is a matrix that describes retainingdiscarding or correcting the sequence (Figure 2(b)) An ideal

An operator could compensate for the biases introducedby another operator (Figure 2(c)) To define such operator(7) could be potentially solved using repeated sequencingof a well-defined model library In the next applied sectionwe examine the real deep sequencing data and identifyconditions under which these operators could be at leastpartially defined

22 Analysis or the Error Cutoff inDeep Sequencing Reads Allnext-generation sequencing techniques provide quality score(Phred Score) for every sequenced nucleotide In Illuminasequencing this score is related to the probability of thenucleotide being correct [46] In low throughput Sangersequencing the Phred score monotonously decreases withread length and the mechanisms that yield errors in capillaryelectrophoresis are well understood Common practice inSanger sequencing is to discard all reads after the firstnucleotide with a Phred score of 0 In next-generationsequencing the filtering of the reads is usuallymore stringentas follows

(A) Discard reads that have at least one read that has scorelower than ldquocutoffrdquo

(B) Discard reads that had cumulative Phred score lowerthan cutoff

8 Computational and Mathematical Methods in Medicine

C7C

Tags NKKN Bar Adapter Sequence Quality string for sequenceEnd of

read

C7C

C7CC7CC7CC7C

Ph7

Ph7Ph7P12

Ph7

TGGG

GGGC

TGTG

ATGG

TTGG

CGGC

CGGG

CGGG

CGTC

TGTG

ATGC

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

CGA CCCF

CCCF

CCCF

CCCF

CCCF

CCF

CCCF

F

CCCF

CCBF

CCCF

FFF

FFF

FFF

FFF

FDF

FFA

FFF

=A

FFF

FFF

FFF

FHHHHHJJJJJJ

FHHHHHJJJJJJ

FHHHHHJJJJJJ

FHHHHHJJJJJJ

FHHHHHJJJJJJ

DFFHBFFFGEFH

FHHHHHJJJJJJ

ADDDHFEFIEHG

FHGGHHJJJJJJ

FHHHHHJJJJJJ

FHHHHHJJJJJJ

TTG

GAC

GAC

GTA

TTG

CGA

TTG

TCA

TCA

CGA

JJJJJJJJJJJJJJJHHHGIJJJJJgtFHJJIJH simsimsimsimsimsimsim

JJJJJJJJJJHJJIJHIIJJJJBFHAHIHHHF simsimsimsimsimsimsim

JJJJJJJJJJJJGGHIJJJJJJJIJJIJJJJJJ simsimsimsimsimsimsim

JJJJJJJJJJJJIJJIJJJJJJJJJJJJJIJJJ simsimsimsimsimsimsim

JJJJJJJJJJJIIJJJJJJIJIJJIEHHHFHFD simsimsimsimsimsimsim

JJJJJJJJJJJJJJJJJJJJJJJHIJJJJJJJJ simsimsimsimsimsimsim

JJJJJJJJJJJJIJJJJJJJJJJJJJJJJJJJH simsimsimsimsimsimsim

JJIJJIGIJIJEGIJGHFGIIJJFFGGIIGII simsimsimsimsimsimsim

JJJJJJJJJJJIJJJIGHGIHIIIGHGIJJJJH simsimsimsimsimsimsim

GAHGHFHGGDHEHBGGCEHIEGBFDHBDltA simsimsimsimsimsimsim

simsimsimsimsimsimsim

GCTTGTCCTCCGAAGGTGCATCTGCAGTGCGGT simsimsimsimsimsimsim

GCTTGTGCGGCGAATTCTTATCGGAATTGCGGT simsimsimsimsimsimsim

ACTCAGCCTCATCATACTCCGGGTGGAGGTTCG simsimsimsimsimsimsim

GCGACGACTGTTCCAGCTTCGGGTGGAGGTTCG simsimsimsimsimsimsim

GCTTGTGATGCGCGTATGAATCAGCCTTGCGGT simsimsimsimsimsimsim

GCTTGTCAGGATAAGAATTCGCTTTTTTGCGGT simsimsimsimsimsimsim

GCTTGTAAGTCTCTTTCGAGTGCTCTGTGCGGT simsimsimsimsimsimsim

GCTTGTCCTCGTCTGTATCAGATGTCGTGCGGT simsimsimsimsimsimsim

GATGCGCATCGTACGTGTAATGGTGGAGGTTCG simsimsimsimsimsimsim

TGGCGTCATGATTCGTTTTAGACGGTTCAGCGT simsimsimsimsimsimsim

CGGGCTGCGTGTGCGAATCCTGGTGGAGGTTCG simsimsimsimsimsimsim

GEGIDFAEGIFCHC9FFAGHlowastDBAH

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

(a)

Sequencing resultsRemoval

All nucleotides

of errors

0L = [S 0n]

30L = [S 30n]

Ph gt 30

Ph gt 13

Ph gt 0

Ph gt 30

Phred gt 30

(b)

Base call accuracy ()

Phred score cutoff Every nucleotide must have Phred score higher than

Mean 33-mer accuracy ()

1 5 8 13 20 30

210 680 840 950 990 999

00 00 01 18 717 967

950 975 987 992 995 996

(Base accuracy)33 ()

10cutoff10

33prodi=1

10Qi10

(c)

Cumulative base call accuracy or accuracy

100

10

1

01

001

0001

0000160 70 80 1009050403020100

100

10

01

1

001

0001

00001

of 33-mer sequence ()

Sequ

ence

s (

)

(d)

60

80

100

40

20

0

60

80

100

40

20

0gt90 gt95 gt98

Sequ

ence

s (

)

Sequ

ence

s (

)

Cumulative base call accuracy oraccuracy of 33-mer sequence ()

gt99

(e)

Figure 5 (a) Representative lines from the intermediate file from Illumina deep sequencing analysis (for more information seeERROR TAG data0001txt in Section 4 and our previous publication [6]) The reads have been parsed to identify adapters and barcodesEach read has been tagged according to the library type direction of the read and quality of the adapter regions We use this intermediatelibrary to identify reads that harbor erroneous nucleotides (b) Multiset view of the intermediate library The library contains subsets thathave low medium and high quality reads Error filtering of this intermediate library to eliminates any read with Phred score below 30 yieldsa high quality library of reads 30L (c) Mean accuracy of the reads in the library after error filtering ranges from 95 to 996 Even forvery low-quality cutoff Phred gt 1 the average read quality is 95 (d) Distribution of cumulative read accuracy in libraries processed usingdifferent cutoffs (e) Linear plot of the data presented in (d) with zoom in on the region with gt90 cumulative accuracy

(C) Use a combination of A and B (accept reads withminimal cutoff and minimal cumulative score)

Many of the error analyses in the area of deep sequencingare designed for genetic reads which have variable lengthsand unknown sequence throughout the whole read Analysisof the reads in a phage-display library is a simpler problembecause phage-derived constant adapter regions flank thevariable reads Identification of the adapter region is anecessary first step in the analysis Reads in which theadapters cannot be mapped cannot be used We designedalgorithms to recover reads even if adapters were hamperedby truncation deletion ormutation [6]We observed that the

reads flanked by the erroneous adapters had a significantlyhigher error rate than reads flanked by ldquoperfectrdquo adapters(unpublished) Example of mapping of flawed reads in Illu-mina sequencing is provided in ERROR TAG data0001txt(see Section 4) In the remaining sections we analyze thepopulation of the sequences preceded by a ldquoperfectrdquo adapterto identify possible sequence-specific biases

We analyzed a typical library sequenced by Illumina usingvarious cutoffs (Figure 5) We analyzed a 33-bp segment ofthe library that contained variable seven amino acids anda constant region and GGGS terminus A simple cutoffthat discards reads with Phred lt1 nucleotides yields library

Computational and Mathematical Methods in Medicine 9

termed 1119899 which had an average 95 accuracy of the33-nucleotide read Reads that do not contain Phred = 0nucleotide rarely contain multiple low-quality reads The 1119899library was bimodal 80 of the reads had overall accuracyof 99 very few reads with accuracy 5ndash90 and significantnumber of reads with accuracy of 1 (Figures 5(d) and 5(e))These observations suggest that reads can be divided into (i)reads free of errors and (ii) reads with multiple errors

An example of a more stringent cutoff is eliminationof reads with Phred lt13 nucleotides this process yielded alibrary 13119899 in which every nucleotide had gt95 confidenceThe number of total reads in 13119899 was 10 less than numberof reads in 1119899 that is sum( 13119899) = 09sum( 1119899) The observedaverage read accuracy of the read in the 13119899 library was992 Theoretically the 095 confidence cutoff in a 33-mer nucleotide could yield reads with accuracy as low as(095)

33= 18 In practice the probability to find reads with

multiple nucleotides of 95 accuracy was vanishingly smallSpecifically among 500000 reads the lowest observed cumu-lative accuracy was 77 Such a result for example could beobtained in a sequence that has 27 ldquoperfectrdquo nucleotides and5 nucleotides with a Phred = 13 score (1)27(095)5 = 077Applying the most stringent cutoff to eliminate all reads witha Phred lt 30 yielded a library 30119899 in which every nucleotidehad 999 confidence The average confidence of the readsimproved subtly from 992 to 996 The number of totalreads in 30119899 was 30 less than number of reads in 13119899 thatis sum( 30119899) = 07sum( 13119899) It was not clear whether suchcutoff is an improvement or a detriment for analysis In thenext section we examined how frequency of the members ofthe library changed upon application of each error cutoff

23 Example of Error Analysis Sequence-Specific Censorshipduring Phred Quality Cutoff If errors occur by randomchance they should be uniformly distributed in all sequencesRemoval of erroneous read in that case should be identicalto sampling of the library by fSa operator where f is thesampling fraction For example consider the removal ofPhred lt 13 nucleotides from an unfiltered library (processdenoted as 1119899 rarr 13

119899) From the experiments we knowthat sum( 13119899) = 09sum( 1119899) if errors were distributedin sequences at random the 1119899 and 13119899 vectors should berelated as

13119899 =09Sa ( 1119899) (8)

The solutions should reside within a confidence interval

13119899 isin [

lo119862

hi119862] (9)

If errors occur preferentially in specific reads the frequencyof these reads should occur beyond the confidence intervalof the 09Sa This process could be described by a diagonalmatrix Bias as

13119899 =09Sa (Bias ( 1119899)) (10)

The elements of the diagonal matrix Bias = B119894119894 could be

estimated as follows

13119899119894isin [

lo119862119894

hi119862119894] B

119894119894= 1 (11)

13119899119894lt

lo119862119894 B

119894119894=

13119899119894

(091119899119894)

(12)

Figure 6(c) describes the representative solution of the09Sa( 1119899) (green dots) and the confidence interval (bluelines) Supplementary Scheme S3 describes the script thatcalculated this interval frommultiset 1119899 described as a plaintext file PhD7-Amp-0Ftxt using 10000 iterative calculationsof 09Sa( 1119899) This calculation required sim2 hours on a desktopcomputer Confidence interval was estimated as the mini-mum and maximum copy number found after 10000 iter-ations In this approximation of the confidence interval forsequences with the copy number lt10 before sampling it wasimpossible to determine whether the sequence disappeareddue to random sampling or due to bias The values of Biasoperator cannot be defined for these sequences and it couldbe assumed to be 1 (see (11)) For copy number gt10 howeversequence-specific bias can be readily detected We observedthat the removal of Phred lt13 reads yielded a multiset inwhich a large number of sequences deviated beyond theconfidence interval (Figure 6(d)) Their sequences could bereadily extracted by comparing the vector 13119899with the vectorof the lower confidence intervals lo

119862 (see (12)) The solutionof theBias can be illustrated graphically (Figure 6(e)) Top 30censored sequences are listed in Table S1 the other sequencescan be found in the supplementary information (file PhD7-Amp-0F-13F-CENtxt)

We performed similar calculations for 1119899 rarr 30119899 and13119899 rarr

30119899 processes The latter process is the most

interesting because 13119899 library has all nucleotides withinacceptable confidence range (gt95) and the distribution ofcumulative quality suggested that errors on average do notcluster in one read (Figure 5) The 13119899 rarr 30

119899 conversioneliminated 30 of the reads and copy numbers of manysequences deviated significantly from the random samplingthese sequences are represented by green dots outside the blueconfidence interval in Figure 6(h) Top 30 sequences are listedin Table S2 The censorship is not only sequence-specific butalso position-specific In sequences that had been censoredduring the 13119899 rarr 30119899 process lower quality reads clusteredaround 3-4 specific nucleotides (supplementary informationFigure S5)

The mechanism that leads to the disappearance of cen-sored sequences is not currently clear We attempted toidentify common motifs in censored sequences using twoapproaches (i) clustering and principal component analysesbased on Jukes-Cantor distance between sequences and(ii) identification of motifs using multiple unique sequenceidentifier software (MUSI) [17] These approaches couldnot detect any property common to censored reads whichwould make them significantly different from the othernoncensored reads Still we hypothesize that the observed

10 Computational and Mathematical Methods in Medicine

Ph gt 30 Ph gt 30

Sequencing resultsRemovalof errors

30n = Bias Sa 0

n

All nucleotides

0L = [S

0n]

30L = [S 30n]

Phred gt 30

(a)

Sequencing results

Ph gt13

1T total sequences

13T total sequences

30T total sequences

30n = Bias Sa 13

n

13n = Bias Sa 1

n

30n = Bias Sa 1

n

13L = [S

13n]

0L = [S

0n]

1L = [S

1n]

30L = [S 30n]

Ph gt 30

Ph gt 13

Ph gt 0

Ph ge 1

Ph gt 30

(b)

38390

597122105

104

103

102

10

1

105

104

103

102

101

13n = Sa 1

n

Cop

y nu

mbe

r in

Phre

dgt0

libra

ry

Copy number in Sa (Phred gt 0) library

(c)

558480

77032

Inside the interval

Outsideintervalthe

of

105

104

103

102

10

1

105

104

103

102

101

13n = Bias 1

n

13n = Bias Sa 1

nC

opy

num

ber i

n Ph

redgt0

libra

ry

Copy number in Phred gt 30 library

(d)

1

01

001

Bias can be definedfor these sequences

Bias cannot be definedfor these sequences

105

104

103

102

101

Bias = 13n|I| |N||

1nminus1

Bias = |IN|

Ratio

of c

opy

num

ber i

nPh

redgt0

Phr

edgt13

Copy number in Phred gt 0 library

(e)

502655

132857

105

104

103

102

101

105

104

103

102

10

1

30n = Sa 1

n

Number of data points per pixelgt30000

gt10000

gt3000

gt1000 1ndash10

gt100

gt300

gt30

gt10

Cop

y nu

mbe

r in

Phre

dgt0

libra

ry

Copy number in Sa (Phred gt 0) library

(f)

438703

196809

Inside the interval

Outside ofthe interval

105

104

103

102

10

1

105

104

103

102

101

Number of data points per pixel

30n = Bias Sa 1

n

30n = Bias 1

n

gt30000

gt10000

gt3000

gt1000 1ndash10

gt100

gt300

gt30

gt10

Copy number in Phred gt 30 library

Cop

y nu

mbe

r in

Phre

dgt0

libra

ry

(g)

438703

119777

Inside the interval

Outside ofthe interval

105

104

103

102

10

1

105

104

103

102

101

30n = Bias 13

n

Number of data points per pixel

30n = Bias Sa 13

n

gt30000

gt10000

gt3000

gt1000 1ndash10

gt100

gt300

gt30

gt10

Copy number in Phred gt 30 library

Cop

y nu

mbe

r in

Phre

dgt13

libra

ry

(h)

Figure 6 (a) Operator and multiset description of the error filtering procedure Applying a Phred gt 30 cutoff to library filtered by Phredgt1cutoff ( 1n) yields a subpopulation of the library ( 30n) If errors are sequence-independent the 1n rarr 30n process should be identical torandom sampling ( 30n = Sa 1n) Any sequence-specific bias (Bias) should be detected as deviation from Sa 1n (b) Progressive samplingwith more stringent cutoff (c) Theoretical Sa 1n and theoretical 999 confidence interval (blue) (d) Observation of statistically significantdeviation from Sa operator dots beyond the blue line represent sequences prone to bias Red dots represent sequences that disappeared afterin 1n rarr 30n process or during Sa 1n sampling (e) Magnitude of the bias range from 5 to 100-fold (f) Bias in sampling of Phred gt 30 datafrom Phred gt 1 data ((f) is theory (g) is observed) (h) Bias upon sampling of Phred gt 30 data from Phred gt 13 Many sequences were lost inthis sampling and this loss was statistically significant beyond the 999 interval This result shows that some sequences have propensity toharbor low- and medium-quality reads Distribution of the errors is sequence specific

censorship represents sequence-specific errors which occurin every time such sequence passes though the Illuminaanalyzer For example the sequences listed in Tables S1 andS2 and supplementary files were censored in five independent

experiments which were pooled and processed simultane-ously in one Illumina run Analysis of other instances of Illu-mina sequencing performedby other groups could help prove(or disprove) that censorship is indeed sequence-specific

Computational and Mathematical Methods in Medicine 11

and experiment-independent Sequence-specific censorshipduring Illumina analysis has been described in other publi-cations [46] The observations presented above suggest thatreading of some sequences in phage libraries does not yield anaccurate copy number Even if these sequences were enricheddue to binding their apparent copy number in sequencingwould be decreased due to sequencing bias If the magnitudeof bias is known however such error could be corrected Weanticipate that other biases could be calculated for these andother libraries in similar fashion Their calculation extendsbeyond the scope of this paper and it will be performed inour next publication

3 Discussion

31 Significance and Transformative Potential of Library-Wide Error Correction In the Medicinal Chemistry fieldstructure-activity relationships (SAR) and pharmacophoresare built using both positive and negative observations Itis the negative results that bear the most significance inthese studies because they allow mapping of the range ofconditions under which particular structure no longer worksFor example SAR of an R group of a ligand might be builton the following observations A ligand binds to the targetwhen the R group in the specific position is methyl or ethylchanging R to iso-propyl and tert-butyl ablates the bindingThis concludes that the R group must be a small alkyl groupAn analogous situation is found in SARof peptide ligands themost important information from alanine scan mutagenesisis loss of function because it helps identifying the importantresidues Interestingly loss-of-binding conclusions are neverapplied to phage-display The phage-display field is driven bypositive results Most publications report and follow up onlyon sequences enriched in the screen and consider only largecopy numbers interesting All papers focus on sequences thatwere found Very few papers in phage display ask why othersequences were not found

One of the reasons why phage display is not used forSAR-type analysis is because negative observations in phagelibrary cannot be determined with high confidence From apractical point of view measuring zero with high confidencerequires the largest number of observation (the highest depthof sequencing) The payoff however is immense one screenwith ldquoconfident zerosrdquo could potentially yield SAR for everypossible substitution of every possible amino acid We referto this (theoretical) possibility as ldquoInstant SARrdquo and itscondensed theoretical form is described in (3) or (9) and(10) This paper demonstrates that the depth of sequencing isnot the only problem towards this goal Accurate estimate ofnegative results requires complete characterization of the ori-gins of errors in sequencing which yield false negative valuesby censoring certain sequencing Other types of censorshipsuch as growth bias should be characterized and eliminatedas well As the phage display field is currently focused onpositive results the need for optimal error corrections andrecovery of erroneous reads is low With the rise of SAR-type applications in phage display error correction will berecognized as the most significant barrier because it could

lead to improper assignment of low frequencies and negativeresults Improved error correction strategies could assigna lower confidence to the sequence instead of eliminatingthe errors and labeling them as confident zero Propermathematical framework possibly similar to the one used inthis paper could be then used to carry all confidence intervalsthrough calculations to yield reliable SAR-type data

We note that the framework described in this paper issuitable for the analysis of the selection from libraries inwhich the diversity of the libraries before and after selectioncould be covered entirely by deep sequencing With thecurrent depth of sequencing it corresponds to medium-scale libraries of sim106 randommembers and affinity-maturedlibraries that containsim106 pointmutationsWe are in the pro-cess of generating these medium-scale libraries and runningselection procedures that will allow us to apply and refine ourframework In the future as technical capabilities and depthof sequencing increase the process would be applicable tolarger libraries as well

4 Methods

41 Generation of Z-Bars and Other Visualization TechniquesSequencing of the libraries has been described in our previouspublications [6 47] All data visualization in this paper wasdone by MATLAB scripts raw lowasteps output from MATLABscripts subject to minor postprocessing in Adobe Illustratorto adjust fonts relative dimensions of plots Core scripts aredescribed in the supplementary information Other scriptsare available in our previous publication [47] Illumina filesused for the analysis can be found in the directory athttpwwwchemualbertacasimderdamathbiology the fileERROR TAG data0001txt is an example of error-taggedreads PhD7-Amp-xxFtxt is the library filtered with xx Phredcutoff (xx = 1 13 and 30) file PhD7-Amp-13F-30F-COtxtdescribes confidence intervals for Phred(13) to Phred(30)filtering process other files with lowast-COtxt extension describeconfidence intervals of other processes Supplementary Fig-ures S1ndashS4 and Schemes S1 and S2 describe MATLAB imple-mentation of the Sa operator

Acknowledgments

This work was supported by funds from the University ofAlberta and Alberta Glycomics Centre

References

[1] J K Scott and G P Smith ldquoSearching for peptide ligands withan epitope libraryrdquo Science vol 249 no 4967 pp 386ndash390 1990

[2] G P Smith and V A Petrenko ldquoPhage displayrdquo ChemicalReviews vol 97 no 2 pp 391ndash410 1997

[3] A D Ellington and J W Szostak ldquoIn vitro selection of RNAmolecules that bind specific ligandsrdquo Nature vol 346 no 6287pp 818ndash822 1990

[4] C Tuerk and L Gold ldquoSystemic evolution of ligands byexponential enrichment RNA ligands to bacteriophage T4DNApolymeraserdquo Science vol 249 no 4968 pp 505ndash510 1990

12 Computational and Mathematical Methods in Medicine

[5] E Dias-Neto D N Nunes R J Giordano et al ldquoNext-generation phage display integrating and comparing availablemolecular tools to enable costeffective high-throughput analy-sisrdquo PLoS ONE vol 4 no 12 Article ID e8338 2009

[6] W LMatochko K Chu B Jin SW Lee GMWhitesides andR Derda ldquoDeep sequencing analysis of phage libraries usingIllumina platformrdquoMethods vol 58 pp 47ndash55 2012

[7] A Ernst D Gfeller Z Kan et al ldquoCoevolution of PDZ domain-ligand interactions analyzed by high-throughput phage displayand deep sequencingrdquo Molecular BioSystems vol 6 no 10 pp1782ndash1790 2010

[8] G V Kupakuwana J E Crill M P McPike and P N BorerldquoAcyclic identification of aptamers for human alpha-thrombinusing over-represented libraries and deep sequencingrdquo PLoSONE vol 6 no 5 Article ID e19395 2011

[9] P A C rsquoT Hoen S M G Jirka B R Ten Broeke et alldquoPhage display screening without repetitious selection roundsrdquoAnalytical Biochemistry vol 421 no 2 pp 622ndash631 2012

[10] H Zhang A Torkamani T M Jones D I Ruiz J Ponsand R A Lerner ldquoPhenotype-information-phenotype cyclefor deconvolution of combinatorial antibody libraries selectedagainst complex systemsrdquo Proceedings of the National Academyof Sciences of the United States of America vol 108 no 33 pp13456ndash13461 2011

[11] R Derda S K Y Tang S C Li S Ng W Matochko and M RJafari ldquoDiversity of phage-displayed libraries of peptides duringpanning and amplificationrdquo Molecules vol 16 no 2 pp 1776ndash1803 2011

[12] R W Hamming ldquoError detecting and error correcting codesrdquoBell System Technical Journal vol 29 pp 147ndash160 1950

[13] D J Rodi A S Soares and L Makowski ldquoQuantitativeassessment of peptide sequence diversity in M13 combinatorialpeptide phage display librariesrdquo Journal of Molecular Biologyvol 322 no 5 pp 1039ndash1052 2002

[14] L Makowski ldquoQuantitative analysis of peptide librariesrdquo inPhage Nanobiotechnology chapter 3 2011

[15] C E Shannon ldquoAmathematical theory of communicationrdquo BellSystem Technical Journal vol 27 pp 379ndash423 1948

[16] U Ravn F Gueneau L Baerlocher et al ldquoBy-passing in vitroscreening next generation sequencing technologies applied toantibody display and in silico candidate selectionrdquoNucleic AcidsResearch vol 38 no 21 2010

[17] T Kim M S Tyndel H Huang et al ldquoMUSI an integratedsystem for identifying multiple specificity from very largepeptide or nucleic acid data setsrdquo Nucleic Acids Research vol40 no 6 article e47 2012

[18] J A Weinstein N Jiang R A White III D S Fisher and S RQuake ldquoHigh-throughput sequencing of the zebrafish antibodyrepertoirerdquo Science vol 324 no 5928 pp 807ndash810 2009

[19] B J DeKosky G C Ippolito R P Deschner et al et al ldquoHigh-throughput sequencing of the paired human immunoglobulinheavy and light chain repertoirerdquo Nature Biotechnology vol 31pp 166ndash169 2013

[20] B Levitan ldquoStochastic modeling and optimization of phagedisplayrdquo Journal of Molecular Biology vol 277 no 4 pp 893ndash916 1998

[21] C S Riesenfeld P D Schloss and J Handelsman ldquoMetage-nomics genomic analysis of microbial communitiesrdquo AnnualReview of Genetics vol 38 pp 525ndash552 2004

[22] M L Sogin H G Morrison J A Huber et al ldquoMicrobialdiversity in the deep sea and the underexplored lsquorare biospherersquordquo

Proceedings of the National Academy of Sciences of the UnitedStates of America vol 103 no 32 pp 12115ndash12120 2006

[23] F Backhed R E Ley J L Sonnenburg D A Peterson and JI Gordon ldquoHost-bacterial mutualism in the human intestinerdquoScience vol 307 no 5717 pp 1915ndash1920 2005

[24] N Beerenwinkel and O Zagordi ldquoUltra-deep sequencing forthe analysis of viral populationsrdquo Current Opinion in Virologyvol 1 no 5 pp 413ndash418 2011

[25] J A Huber D B MarkWelch H G Morrison et al ldquoMicrobialpopulation structures in the deep marine biosphererdquo Sciencevol 318 no 5847 pp 97ndash100 2007

[26] A Wilm P P K Aw D Bertrand et al ldquoLoFreq a sequence-quality aware ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencingdatasetsrdquo Nucleic Acids Research vol 40 pp 11189ndash11201 2012

[27] P J Turnbaugh C Quince J J Faith et al ldquoOrganismalgenetic and transcriptional variation in the deeply sequencedgut microbiomes of identical twinsrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 107 no16 pp 7503ndash7508 2010

[28] CQuince A Lanzen T P Curtis et al ldquoAccurate determinationof microbial diversity from 454 pyrosequencing datardquo NatureMethods vol 6 no 9 pp 639ndash641 2009

[29] S J Watson M R A Welkers D P Depledge et al ldquoViralpopulation analysis and minority-variant detection using shortread next-generation sequencingrdquo Philosophical Transactions ofthe Royal Society B vol 368 no 1614 Article ID 20120205 2013

[30] D R Bentley S Balasubramanian H P Swerdlow et al et alldquoAccurate whole human genome sequencing using reversibleterminator chemistryrdquo Nature vol 456 pp 53ndash59 2008

[31] KNakamura TOshima TMorimoto et al ldquoSequence-specificerror profile of Illumina sequencersrdquo Nucleic Acids Researchvol 39 no 13 2011

[32] M A Quail I Kozarewa F Smith et al ldquoA large genomecenterrsquos improvements to the Illumina sequencing systemrdquoNature Methods vol 5 no 12 pp 1005ndash1010 2008

[33] O Zagordi R Klein M Daumer and N Beerenwinkel ldquoErrorcorrection of next-generation sequencing data and reliableestimation of HIV quasispeciesrdquoNucleic Acids Research vol 38no 21 pp 7400ndash7409 2010

[34] M W Schmitt S R Kennedy J J Salk E J Fox J B Hiattand L A Loeb ldquoDetection of ultra-rare mutations by next-generation sequencingrdquo Proceedings of the National Academyof Sciences of the United States of America vol 109 pp 14508ndash14513 2012

[35] A Syropoulos ldquoMathematics of multisetsrdquo in Proceedings of theWorkshop on Multiset Processing Multiset Processing Mathe-matical Computer Science and Molecular Computing Points ofView pp 347ndash358 Springer 2001

[36] W ArapM G KoloninM Trepel et al ldquoSteps towardmappingthe human vasculature by phage displayrdquo Nature Medicine vol8 no 2 pp 121ndash127 2002

[37] S Blond-Elguindi S E Cwirla W J Dower et al ldquoAffinitypanning of a library of peptides displayed on bacteriophagesreveals the binding specificity of BiPrdquoCell vol 75 no 4 pp 717ndash728 1993

[38] S Cox E Rosten J Monypenny et al ldquoBayesian localizationmicroscopy reveals nanoscale podosome dynamicsrdquo NatureMethods vol 9 no 2 pp 195ndash200 2012

[39] E Rosten G E Jones and S Cox ldquoImage plug-in for Bayesiananalysis of blinking and bleachingrdquoNature Methods vol 10 pp97ndash98 2013

Computational and Mathematical Methods in Medicine 13

[40] W L Matochko S Ng M R Jafari J Romaniuk S K Y Tangand R Derda ldquoUniform amplification of phage display librariesin monodisperse emulsionsrdquoMethods vol 58 pp 18ndash27 2012

[41] E A Peters P J Schatz S S Johnson and W J DowerldquoMembrane insertion defects caused by positive charges in theearly mature region of protein pIII of filamentous phage fd canbe corrected by prlA suppressorsrdquo Journal of Bacteriology vol176 no 14 pp 4296ndash4305 1994

[42] L A Brammer B Bolduc J L Kass K M Felice C J Norenand M F Hall ldquoA target-unrelated peptide in an M13 phagedisplay library traced to an advantageous mutation in the geneII ribosome-binding siterdquo Analytical Biochemistry vol 373 no1 pp 88ndash98 2008

[43] G A Kuzmicheva P K Jayanna I B Sorokulova and VA Petrenko ldquoDiversity and censoring of landscape phagelibrariesrdquo Protein Engineering Design and Selection vol 22 no1 pp 9ndash18 2009

[44] D R Wilson and B B Finlay ldquoPhage display applicationsinnovations and issues in phage and host biologyrdquo CanadianJournal of Microbiology vol 44 no 4 pp 313ndash329 1998

[45] R Derda S K Y Tang and G M Whitesides ldquoUniformamplification of phage with different growth characteristics inindividual compartments consisting ofmonodisperse dropletsrdquoAngewandte Chemie vol 49 no 31 pp 5301ndash5304 2010

[46] J C Dohm C Lottaz T Borodina and H HimmelbauerldquoSubstantial biases in ultra-short read data sets from high-throughput DNA sequencingrdquo Nucleic Acids Research vol 36no 16 article e105 2008

[47] W L Matochko S C Li S K Y Tang and R DerdaldquoProspective identification of parasitic sequences in phagedisplay screensrdquo Nucleic Acids Research 2013

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

6 Computational and Mathematical Methods in Medicine

Probability to observe specific solution

BB C

C

C

AD

DD

D B CCD

D[0 1 2 2]A B C D

[1 2 3 4]A B C D Sa

95

confi

denc

e int

erva

l

0 2 3 01 0 0 40 1 0 41 1 3 00 0 1 41 2 2 00 2 0 31 0 3 10 0 3 21 2 0 20 1 3 11 1 0 31 2 1 10 0 2 31 0 1 30 2 2 11 0 2 20 2 1 21 1 2 10 1 1 30 1 2 21 1 1 2

0 002 004 006 008 01 012 014

Copy number before

Cop

y nu

mbe

r afte

r

0

1

2

3

4

(a)

(c) (d)

(e)

(h)(g) (i)

(f)(b)

1 2 3 4

Probability to observe specific solution0 002 004 006 008 01 012 014

100000 trials

Sa50

1

6385

5000 trials2957

7

42

1 2 3 4 5 6 7 8012345678

Copy number before the sampling

Cop

y nu

mbe

r afte

r the

sam

plin

g

One of the solutions

[1 2 3 4 5 6 7 8]A B C D E F G H

1 2 3 4 5 6 7 8

012345678

Cop

y nu

mbe

r afte

r the

sam

plin

g

Number of data per point

Copy number before the sampling

50

BB

CCCADD

DD H HH

G

HHH

HHGG G

GGG

EE

EE

E

E

FF

FF

FF

B

CAD DH

G

HHH

GG

E EE

FF

F

[1 1 1 2 3 3 3 4]A B C D E F G H

Sa 50

Most probable solution

0

1

2

3

4

1 2 3 4

0

1

2

3

4

1 2 3 4

0

1

2

3

4

1 2 3 4

[0 2 3 0]A B C D

[1 1 1 2]A B C D

[0 2 1 2]A B C D

the s

ampl

ing

Cop

y nu

mbe

r afte

rth

e sam

plin

g

Cop

y nu

mbe

r afte

rth

e sam

plin

g

Cop

y nu

mbe

r afte

rth

e sam

plin

g

the sampling

Copy number before the sampling

Copy number before the sampling

Copy number before the sampling

gt1000 copies100ndash1000 copies

10ndash100 copies1ndash10 copies

solutions

P = 014

P lt 0001

P = 007

after 5000 trials

Figure 3 (a) Testing the sampling operator implemented as random indexing function using a model multiset (b) In 100000 trials weobserved 22 unique solutions from which 14 resided in a 95 confidence interval Solutions with 0 and 1 copies of element A were foundat equal abundances (ldquoredundant solutionsrdquo) (c) Representation of the most probable solution as a line with 4 nodes ldquo119901rdquo is a probabilityto find the solution dotted line is an expected ldquoaverage solutionrdquo for 50 sampling (d) The 5th most probable solution (e) least probablesolution deviates the most form the average (f) combination of all solutions Red thick lines describe the most probable solutions thin bluelines describe the least probable solutions (g) Sampling of larger multisets yields more possible solutions (here 2957 in 5000 trials) (h)All solutions of the sampling represented as lines (i) Probability to observe a particular copy number after sampling While (h) is the mostaccurate representations of the confidence intervals the thin blue lines describe solutions outside the confidence interval this representation isimpractical due to large number of redundant solutions in largermultisets In (i) confidence interval could be extrapolated from distributionsof individual copy numbers (e) red dots are on or outside the confidence interval

several times (iii) Measure all copy numbers of all sequencesincluding zero values with high confidence Requirement (i)has been an ongoing effort in our group [11 40] and othergroups [13 41ndash43] for review see [11 44] Deep sequencing

makes it simple to satisfy requirement (ii) and obtainmultipleinstances of the same experiment For example we describedthe Illumina sequencing method that allows using barcodedprimers to sequence 18 unrelated experiments in one deep

Computational and Mathematical Methods in Medicine 7

Number of data in one pointgt1000 copies100ndash1000 copies

10ndash100 copies1ndash10 copies

Number of data in one pointgt1000 copies100ndash1000 copies

10ndash100 copies1ndash10 copies

0 100 200 300 400 500 600 700 800 900 10000

100

200

300

400

500

600

700

800

900

1000

1 10 100 10001

10

100

1000

Copy number before the samplingC

opy

num

ber a

fter t

he sa

mpl

ing

A B C X Z

Copy number before the sampling

Cop

y nu

mbe

r afte

r the

sam

plin

g

1

1000

990

7

8

23

9 980

10Sa

456

50

5000 trials1000 unique elements

500000 total sequencesmiddot middot middot

middot middot middot

middot middot middot

middot middot middot[1 2 3 middot middot middot 999 1000]

Figure 4 (a) Testing the sampling operator using a large multiset made of 1000 unique elements with 1000 different copy numbers Imagesdescribe linear and log-scale representation of the confidence interval of the sampling operator Solutions beyond this interval were notobserved in 5000 trials Dotted line represents an overestimate of the 999 confidence interval (for details see Figure S4) Most probableoutcomes of the Sa operator have either zero or one unique sequence beyond this interval This line is used in subsequent sections (Figures 5and 6)We note that distributions of the copy numbers have well-defined shape according to central limit theorem it is a normal distributionWith enough replicas it should be possible to extrapolate the center of this distribution define the solutions explicitly and bypass thestochastic nature of the Sa operator

sequencing experiment [45] We recently scaled this effortto 50 primer sets and evaluated the performance replicas ofsimple selection procedures (in preparation)

The measurement of the copy numbers of sequencesis a separate problem that can be described using thesame sampling operators and bias operators that describehow the library is skewed by each preparation step Forexample isolation of DNA by gel purification disfavors AT-rich sequences whereas PCR favors sequence with withinspecific GC-content range [32]The real sequence abundancein any phage library ( real119899) hence has to be derived from theobserved sequence abundance ( obs119899) by solving this equation

obs119899 = (

f5Sa An) ( f4Sa Seq) ( f3Sa PCR)

times (f2Sa Is) ( f1Sa Gr) real119899

(7)

In this equation each operator in brackets describes a bias ata particular step fSa describes sampling at that step and f1ndashf5 describe the sampling fractions The bias in growth (Gr)isolation (Is) PCR amplification (PCR) and sequencing(Seq) could be related to the nucleotide sequences TheAn analysis operator is a matrix that describes retainingdiscarding or correcting the sequence (Figure 2(b)) An ideal

An operator could compensate for the biases introducedby another operator (Figure 2(c)) To define such operator(7) could be potentially solved using repeated sequencingof a well-defined model library In the next applied sectionwe examine the real deep sequencing data and identifyconditions under which these operators could be at leastpartially defined

22 Analysis or the Error Cutoff inDeep Sequencing Reads Allnext-generation sequencing techniques provide quality score(Phred Score) for every sequenced nucleotide In Illuminasequencing this score is related to the probability of thenucleotide being correct [46] In low throughput Sangersequencing the Phred score monotonously decreases withread length and the mechanisms that yield errors in capillaryelectrophoresis are well understood Common practice inSanger sequencing is to discard all reads after the firstnucleotide with a Phred score of 0 In next-generationsequencing the filtering of the reads is usuallymore stringentas follows

(A) Discard reads that have at least one read that has scorelower than ldquocutoffrdquo

(B) Discard reads that had cumulative Phred score lowerthan cutoff

8 Computational and Mathematical Methods in Medicine

C7C

Tags NKKN Bar Adapter Sequence Quality string for sequenceEnd of

read

C7C

C7CC7CC7CC7C

Ph7

Ph7Ph7P12

Ph7

TGGG

GGGC

TGTG

ATGG

TTGG

CGGC

CGGG

CGGG

CGTC

TGTG

ATGC

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

CGA CCCF

CCCF

CCCF

CCCF

CCCF

CCF

CCCF

F

CCCF

CCBF

CCCF

FFF

FFF

FFF

FFF

FDF

FFA

FFF

=A

FFF

FFF

FFF

FHHHHHJJJJJJ

FHHHHHJJJJJJ

FHHHHHJJJJJJ

FHHHHHJJJJJJ

FHHHHHJJJJJJ

DFFHBFFFGEFH

FHHHHHJJJJJJ

ADDDHFEFIEHG

FHGGHHJJJJJJ

FHHHHHJJJJJJ

FHHHHHJJJJJJ

TTG

GAC

GAC

GTA

TTG

CGA

TTG

TCA

TCA

CGA

JJJJJJJJJJJJJJJHHHGIJJJJJgtFHJJIJH simsimsimsimsimsimsim

JJJJJJJJJJHJJIJHIIJJJJBFHAHIHHHF simsimsimsimsimsimsim

JJJJJJJJJJJJGGHIJJJJJJJIJJIJJJJJJ simsimsimsimsimsimsim

JJJJJJJJJJJJIJJIJJJJJJJJJJJJJIJJJ simsimsimsimsimsimsim

JJJJJJJJJJJIIJJJJJJIJIJJIEHHHFHFD simsimsimsimsimsimsim

JJJJJJJJJJJJJJJJJJJJJJJHIJJJJJJJJ simsimsimsimsimsimsim

JJJJJJJJJJJJIJJJJJJJJJJJJJJJJJJJH simsimsimsimsimsimsim

JJIJJIGIJIJEGIJGHFGIIJJFFGGIIGII simsimsimsimsimsimsim

JJJJJJJJJJJIJJJIGHGIHIIIGHGIJJJJH simsimsimsimsimsimsim

GAHGHFHGGDHEHBGGCEHIEGBFDHBDltA simsimsimsimsimsimsim

simsimsimsimsimsimsim

GCTTGTCCTCCGAAGGTGCATCTGCAGTGCGGT simsimsimsimsimsimsim

GCTTGTGCGGCGAATTCTTATCGGAATTGCGGT simsimsimsimsimsimsim

ACTCAGCCTCATCATACTCCGGGTGGAGGTTCG simsimsimsimsimsimsim

GCGACGACTGTTCCAGCTTCGGGTGGAGGTTCG simsimsimsimsimsimsim

GCTTGTGATGCGCGTATGAATCAGCCTTGCGGT simsimsimsimsimsimsim

GCTTGTCAGGATAAGAATTCGCTTTTTTGCGGT simsimsimsimsimsimsim

GCTTGTAAGTCTCTTTCGAGTGCTCTGTGCGGT simsimsimsimsimsimsim

GCTTGTCCTCGTCTGTATCAGATGTCGTGCGGT simsimsimsimsimsimsim

GATGCGCATCGTACGTGTAATGGTGGAGGTTCG simsimsimsimsimsimsim

TGGCGTCATGATTCGTTTTAGACGGTTCAGCGT simsimsimsimsimsimsim

CGGGCTGCGTGTGCGAATCCTGGTGGAGGTTCG simsimsimsimsimsimsim

GEGIDFAEGIFCHC9FFAGHlowastDBAH

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

(a)

Sequencing resultsRemoval

All nucleotides

of errors

0L = [S 0n]

30L = [S 30n]

Ph gt 30

Ph gt 13

Ph gt 0

Ph gt 30

Phred gt 30

(b)

Base call accuracy ()

Phred score cutoff Every nucleotide must have Phred score higher than

Mean 33-mer accuracy ()

1 5 8 13 20 30

210 680 840 950 990 999

00 00 01 18 717 967

950 975 987 992 995 996

(Base accuracy)33 ()

10cutoff10

33prodi=1

10Qi10

(c)

Cumulative base call accuracy or accuracy

100

10

1

01

001

0001

0000160 70 80 1009050403020100

100

10

01

1

001

0001

00001

of 33-mer sequence ()

Sequ

ence

s (

)

(d)

60

80

100

40

20

0

60

80

100

40

20

0gt90 gt95 gt98

Sequ

ence

s (

)

Sequ

ence

s (

)

Cumulative base call accuracy oraccuracy of 33-mer sequence ()

gt99

(e)

Figure 5 (a) Representative lines from the intermediate file from Illumina deep sequencing analysis (for more information seeERROR TAG data0001txt in Section 4 and our previous publication [6]) The reads have been parsed to identify adapters and barcodesEach read has been tagged according to the library type direction of the read and quality of the adapter regions We use this intermediatelibrary to identify reads that harbor erroneous nucleotides (b) Multiset view of the intermediate library The library contains subsets thathave low medium and high quality reads Error filtering of this intermediate library to eliminates any read with Phred score below 30 yieldsa high quality library of reads 30L (c) Mean accuracy of the reads in the library after error filtering ranges from 95 to 996 Even forvery low-quality cutoff Phred gt 1 the average read quality is 95 (d) Distribution of cumulative read accuracy in libraries processed usingdifferent cutoffs (e) Linear plot of the data presented in (d) with zoom in on the region with gt90 cumulative accuracy

(C) Use a combination of A and B (accept reads withminimal cutoff and minimal cumulative score)

Many of the error analyses in the area of deep sequencingare designed for genetic reads which have variable lengthsand unknown sequence throughout the whole read Analysisof the reads in a phage-display library is a simpler problembecause phage-derived constant adapter regions flank thevariable reads Identification of the adapter region is anecessary first step in the analysis Reads in which theadapters cannot be mapped cannot be used We designedalgorithms to recover reads even if adapters were hamperedby truncation deletion ormutation [6]We observed that the

reads flanked by the erroneous adapters had a significantlyhigher error rate than reads flanked by ldquoperfectrdquo adapters(unpublished) Example of mapping of flawed reads in Illu-mina sequencing is provided in ERROR TAG data0001txt(see Section 4) In the remaining sections we analyze thepopulation of the sequences preceded by a ldquoperfectrdquo adapterto identify possible sequence-specific biases

We analyzed a typical library sequenced by Illumina usingvarious cutoffs (Figure 5) We analyzed a 33-bp segment ofthe library that contained variable seven amino acids anda constant region and GGGS terminus A simple cutoffthat discards reads with Phred lt1 nucleotides yields library

Computational and Mathematical Methods in Medicine 9

termed 1119899 which had an average 95 accuracy of the33-nucleotide read Reads that do not contain Phred = 0nucleotide rarely contain multiple low-quality reads The 1119899library was bimodal 80 of the reads had overall accuracyof 99 very few reads with accuracy 5ndash90 and significantnumber of reads with accuracy of 1 (Figures 5(d) and 5(e))These observations suggest that reads can be divided into (i)reads free of errors and (ii) reads with multiple errors

An example of a more stringent cutoff is eliminationof reads with Phred lt13 nucleotides this process yielded alibrary 13119899 in which every nucleotide had gt95 confidenceThe number of total reads in 13119899 was 10 less than numberof reads in 1119899 that is sum( 13119899) = 09sum( 1119899) The observedaverage read accuracy of the read in the 13119899 library was992 Theoretically the 095 confidence cutoff in a 33-mer nucleotide could yield reads with accuracy as low as(095)

33= 18 In practice the probability to find reads with

multiple nucleotides of 95 accuracy was vanishingly smallSpecifically among 500000 reads the lowest observed cumu-lative accuracy was 77 Such a result for example could beobtained in a sequence that has 27 ldquoperfectrdquo nucleotides and5 nucleotides with a Phred = 13 score (1)27(095)5 = 077Applying the most stringent cutoff to eliminate all reads witha Phred lt 30 yielded a library 30119899 in which every nucleotidehad 999 confidence The average confidence of the readsimproved subtly from 992 to 996 The number of totalreads in 30119899 was 30 less than number of reads in 13119899 thatis sum( 30119899) = 07sum( 13119899) It was not clear whether suchcutoff is an improvement or a detriment for analysis In thenext section we examined how frequency of the members ofthe library changed upon application of each error cutoff

23 Example of Error Analysis Sequence-Specific Censorshipduring Phred Quality Cutoff If errors occur by randomchance they should be uniformly distributed in all sequencesRemoval of erroneous read in that case should be identicalto sampling of the library by fSa operator where f is thesampling fraction For example consider the removal ofPhred lt 13 nucleotides from an unfiltered library (processdenoted as 1119899 rarr 13

119899) From the experiments we knowthat sum( 13119899) = 09sum( 1119899) if errors were distributedin sequences at random the 1119899 and 13119899 vectors should berelated as

13119899 =09Sa ( 1119899) (8)

The solutions should reside within a confidence interval

13119899 isin [

lo119862

hi119862] (9)

If errors occur preferentially in specific reads the frequencyof these reads should occur beyond the confidence intervalof the 09Sa This process could be described by a diagonalmatrix Bias as

13119899 =09Sa (Bias ( 1119899)) (10)

The elements of the diagonal matrix Bias = B119894119894 could be

estimated as follows

13119899119894isin [

lo119862119894

hi119862119894] B

119894119894= 1 (11)

13119899119894lt

lo119862119894 B

119894119894=

13119899119894

(091119899119894)

(12)

Figure 6(c) describes the representative solution of the09Sa( 1119899) (green dots) and the confidence interval (bluelines) Supplementary Scheme S3 describes the script thatcalculated this interval frommultiset 1119899 described as a plaintext file PhD7-Amp-0Ftxt using 10000 iterative calculationsof 09Sa( 1119899) This calculation required sim2 hours on a desktopcomputer Confidence interval was estimated as the mini-mum and maximum copy number found after 10000 iter-ations In this approximation of the confidence interval forsequences with the copy number lt10 before sampling it wasimpossible to determine whether the sequence disappeareddue to random sampling or due to bias The values of Biasoperator cannot be defined for these sequences and it couldbe assumed to be 1 (see (11)) For copy number gt10 howeversequence-specific bias can be readily detected We observedthat the removal of Phred lt13 reads yielded a multiset inwhich a large number of sequences deviated beyond theconfidence interval (Figure 6(d)) Their sequences could bereadily extracted by comparing the vector 13119899with the vectorof the lower confidence intervals lo

119862 (see (12)) The solutionof theBias can be illustrated graphically (Figure 6(e)) Top 30censored sequences are listed in Table S1 the other sequencescan be found in the supplementary information (file PhD7-Amp-0F-13F-CENtxt)

We performed similar calculations for 1119899 rarr 30119899 and13119899 rarr

30119899 processes The latter process is the most

interesting because 13119899 library has all nucleotides withinacceptable confidence range (gt95) and the distribution ofcumulative quality suggested that errors on average do notcluster in one read (Figure 5) The 13119899 rarr 30

119899 conversioneliminated 30 of the reads and copy numbers of manysequences deviated significantly from the random samplingthese sequences are represented by green dots outside the blueconfidence interval in Figure 6(h) Top 30 sequences are listedin Table S2 The censorship is not only sequence-specific butalso position-specific In sequences that had been censoredduring the 13119899 rarr 30119899 process lower quality reads clusteredaround 3-4 specific nucleotides (supplementary informationFigure S5)

The mechanism that leads to the disappearance of cen-sored sequences is not currently clear We attempted toidentify common motifs in censored sequences using twoapproaches (i) clustering and principal component analysesbased on Jukes-Cantor distance between sequences and(ii) identification of motifs using multiple unique sequenceidentifier software (MUSI) [17] These approaches couldnot detect any property common to censored reads whichwould make them significantly different from the othernoncensored reads Still we hypothesize that the observed

10 Computational and Mathematical Methods in Medicine

Ph gt 30 Ph gt 30

Sequencing resultsRemovalof errors

30n = Bias Sa 0

n

All nucleotides

0L = [S

0n]

30L = [S 30n]

Phred gt 30

(a)

Sequencing results

Ph gt13

1T total sequences

13T total sequences

30T total sequences

30n = Bias Sa 13

n

13n = Bias Sa 1

n

30n = Bias Sa 1

n

13L = [S

13n]

0L = [S

0n]

1L = [S

1n]

30L = [S 30n]

Ph gt 30

Ph gt 13

Ph gt 0

Ph ge 1

Ph gt 30

(b)

38390

597122105

104

103

102

10

1

105

104

103

102

101

13n = Sa 1

n

Cop

y nu

mbe

r in

Phre

dgt0

libra

ry

Copy number in Sa (Phred gt 0) library

(c)

558480

77032

Inside the interval

Outsideintervalthe

of

105

104

103

102

10

1

105

104

103

102

101

13n = Bias 1

n

13n = Bias Sa 1

nC

opy

num

ber i

n Ph

redgt0

libra

ry

Copy number in Phred gt 30 library

(d)

1

01

001

Bias can be definedfor these sequences

Bias cannot be definedfor these sequences

105

104

103

102

101

Bias = 13n|I| |N||

1nminus1

Bias = |IN|

Ratio

of c

opy

num

ber i

nPh

redgt0

Phr

edgt13

Copy number in Phred gt 0 library

(e)

502655

132857

105

104

103

102

101

105

104

103

102

10

1

30n = Sa 1

n

Number of data points per pixelgt30000

gt10000

gt3000

gt1000 1ndash10

gt100

gt300

gt30

gt10

Cop

y nu

mbe

r in

Phre

dgt0

libra

ry

Copy number in Sa (Phred gt 0) library

(f)

438703

196809

Inside the interval

Outside ofthe interval

105

104

103

102

10

1

105

104

103

102

101

Number of data points per pixel

30n = Bias Sa 1

n

30n = Bias 1

n

gt30000

gt10000

gt3000

gt1000 1ndash10

gt100

gt300

gt30

gt10

Copy number in Phred gt 30 library

Cop

y nu

mbe

r in

Phre

dgt0

libra

ry

(g)

438703

119777

Inside the interval

Outside ofthe interval

105

104

103

102

10

1

105

104

103

102

101

30n = Bias 13

n

Number of data points per pixel

30n = Bias Sa 13

n

gt30000

gt10000

gt3000

gt1000 1ndash10

gt100

gt300

gt30

gt10

Copy number in Phred gt 30 library

Cop

y nu

mbe

r in

Phre

dgt13

libra

ry

(h)

Figure 6 (a) Operator and multiset description of the error filtering procedure Applying a Phred gt 30 cutoff to library filtered by Phredgt1cutoff ( 1n) yields a subpopulation of the library ( 30n) If errors are sequence-independent the 1n rarr 30n process should be identical torandom sampling ( 30n = Sa 1n) Any sequence-specific bias (Bias) should be detected as deviation from Sa 1n (b) Progressive samplingwith more stringent cutoff (c) Theoretical Sa 1n and theoretical 999 confidence interval (blue) (d) Observation of statistically significantdeviation from Sa operator dots beyond the blue line represent sequences prone to bias Red dots represent sequences that disappeared afterin 1n rarr 30n process or during Sa 1n sampling (e) Magnitude of the bias range from 5 to 100-fold (f) Bias in sampling of Phred gt 30 datafrom Phred gt 1 data ((f) is theory (g) is observed) (h) Bias upon sampling of Phred gt 30 data from Phred gt 13 Many sequences were lost inthis sampling and this loss was statistically significant beyond the 999 interval This result shows that some sequences have propensity toharbor low- and medium-quality reads Distribution of the errors is sequence specific

censorship represents sequence-specific errors which occurin every time such sequence passes though the Illuminaanalyzer For example the sequences listed in Tables S1 andS2 and supplementary files were censored in five independent

experiments which were pooled and processed simultane-ously in one Illumina run Analysis of other instances of Illu-mina sequencing performedby other groups could help prove(or disprove) that censorship is indeed sequence-specific

Computational and Mathematical Methods in Medicine 11

and experiment-independent Sequence-specific censorshipduring Illumina analysis has been described in other publi-cations [46] The observations presented above suggest thatreading of some sequences in phage libraries does not yield anaccurate copy number Even if these sequences were enricheddue to binding their apparent copy number in sequencingwould be decreased due to sequencing bias If the magnitudeof bias is known however such error could be corrected Weanticipate that other biases could be calculated for these andother libraries in similar fashion Their calculation extendsbeyond the scope of this paper and it will be performed inour next publication

3 Discussion

31 Significance and Transformative Potential of Library-Wide Error Correction In the Medicinal Chemistry fieldstructure-activity relationships (SAR) and pharmacophoresare built using both positive and negative observations Itis the negative results that bear the most significance inthese studies because they allow mapping of the range ofconditions under which particular structure no longer worksFor example SAR of an R group of a ligand might be builton the following observations A ligand binds to the targetwhen the R group in the specific position is methyl or ethylchanging R to iso-propyl and tert-butyl ablates the bindingThis concludes that the R group must be a small alkyl groupAn analogous situation is found in SARof peptide ligands themost important information from alanine scan mutagenesisis loss of function because it helps identifying the importantresidues Interestingly loss-of-binding conclusions are neverapplied to phage-display The phage-display field is driven bypositive results Most publications report and follow up onlyon sequences enriched in the screen and consider only largecopy numbers interesting All papers focus on sequences thatwere found Very few papers in phage display ask why othersequences were not found

One of the reasons why phage display is not used forSAR-type analysis is because negative observations in phagelibrary cannot be determined with high confidence From apractical point of view measuring zero with high confidencerequires the largest number of observation (the highest depthof sequencing) The payoff however is immense one screenwith ldquoconfident zerosrdquo could potentially yield SAR for everypossible substitution of every possible amino acid We referto this (theoretical) possibility as ldquoInstant SARrdquo and itscondensed theoretical form is described in (3) or (9) and(10) This paper demonstrates that the depth of sequencing isnot the only problem towards this goal Accurate estimate ofnegative results requires complete characterization of the ori-gins of errors in sequencing which yield false negative valuesby censoring certain sequencing Other types of censorshipsuch as growth bias should be characterized and eliminatedas well As the phage display field is currently focused onpositive results the need for optimal error corrections andrecovery of erroneous reads is low With the rise of SAR-type applications in phage display error correction will berecognized as the most significant barrier because it could

lead to improper assignment of low frequencies and negativeresults Improved error correction strategies could assigna lower confidence to the sequence instead of eliminatingthe errors and labeling them as confident zero Propermathematical framework possibly similar to the one used inthis paper could be then used to carry all confidence intervalsthrough calculations to yield reliable SAR-type data

We note that the framework described in this paper issuitable for the analysis of the selection from libraries inwhich the diversity of the libraries before and after selectioncould be covered entirely by deep sequencing With thecurrent depth of sequencing it corresponds to medium-scale libraries of sim106 randommembers and affinity-maturedlibraries that containsim106 pointmutationsWe are in the pro-cess of generating these medium-scale libraries and runningselection procedures that will allow us to apply and refine ourframework In the future as technical capabilities and depthof sequencing increase the process would be applicable tolarger libraries as well

4 Methods

41 Generation of Z-Bars and Other Visualization TechniquesSequencing of the libraries has been described in our previouspublications [6 47] All data visualization in this paper wasdone by MATLAB scripts raw lowasteps output from MATLABscripts subject to minor postprocessing in Adobe Illustratorto adjust fonts relative dimensions of plots Core scripts aredescribed in the supplementary information Other scriptsare available in our previous publication [47] Illumina filesused for the analysis can be found in the directory athttpwwwchemualbertacasimderdamathbiology the fileERROR TAG data0001txt is an example of error-taggedreads PhD7-Amp-xxFtxt is the library filtered with xx Phredcutoff (xx = 1 13 and 30) file PhD7-Amp-13F-30F-COtxtdescribes confidence intervals for Phred(13) to Phred(30)filtering process other files with lowast-COtxt extension describeconfidence intervals of other processes Supplementary Fig-ures S1ndashS4 and Schemes S1 and S2 describe MATLAB imple-mentation of the Sa operator

Acknowledgments

This work was supported by funds from the University ofAlberta and Alberta Glycomics Centre

References

[1] J K Scott and G P Smith ldquoSearching for peptide ligands withan epitope libraryrdquo Science vol 249 no 4967 pp 386ndash390 1990

[2] G P Smith and V A Petrenko ldquoPhage displayrdquo ChemicalReviews vol 97 no 2 pp 391ndash410 1997

[3] A D Ellington and J W Szostak ldquoIn vitro selection of RNAmolecules that bind specific ligandsrdquo Nature vol 346 no 6287pp 818ndash822 1990

[4] C Tuerk and L Gold ldquoSystemic evolution of ligands byexponential enrichment RNA ligands to bacteriophage T4DNApolymeraserdquo Science vol 249 no 4968 pp 505ndash510 1990

12 Computational and Mathematical Methods in Medicine

[5] E Dias-Neto D N Nunes R J Giordano et al ldquoNext-generation phage display integrating and comparing availablemolecular tools to enable costeffective high-throughput analy-sisrdquo PLoS ONE vol 4 no 12 Article ID e8338 2009

[6] W LMatochko K Chu B Jin SW Lee GMWhitesides andR Derda ldquoDeep sequencing analysis of phage libraries usingIllumina platformrdquoMethods vol 58 pp 47ndash55 2012

[7] A Ernst D Gfeller Z Kan et al ldquoCoevolution of PDZ domain-ligand interactions analyzed by high-throughput phage displayand deep sequencingrdquo Molecular BioSystems vol 6 no 10 pp1782ndash1790 2010

[8] G V Kupakuwana J E Crill M P McPike and P N BorerldquoAcyclic identification of aptamers for human alpha-thrombinusing over-represented libraries and deep sequencingrdquo PLoSONE vol 6 no 5 Article ID e19395 2011

[9] P A C rsquoT Hoen S M G Jirka B R Ten Broeke et alldquoPhage display screening without repetitious selection roundsrdquoAnalytical Biochemistry vol 421 no 2 pp 622ndash631 2012

[10] H Zhang A Torkamani T M Jones D I Ruiz J Ponsand R A Lerner ldquoPhenotype-information-phenotype cyclefor deconvolution of combinatorial antibody libraries selectedagainst complex systemsrdquo Proceedings of the National Academyof Sciences of the United States of America vol 108 no 33 pp13456ndash13461 2011

[11] R Derda S K Y Tang S C Li S Ng W Matochko and M RJafari ldquoDiversity of phage-displayed libraries of peptides duringpanning and amplificationrdquo Molecules vol 16 no 2 pp 1776ndash1803 2011

[12] R W Hamming ldquoError detecting and error correcting codesrdquoBell System Technical Journal vol 29 pp 147ndash160 1950

[13] D J Rodi A S Soares and L Makowski ldquoQuantitativeassessment of peptide sequence diversity in M13 combinatorialpeptide phage display librariesrdquo Journal of Molecular Biologyvol 322 no 5 pp 1039ndash1052 2002

[14] L Makowski ldquoQuantitative analysis of peptide librariesrdquo inPhage Nanobiotechnology chapter 3 2011

[15] C E Shannon ldquoAmathematical theory of communicationrdquo BellSystem Technical Journal vol 27 pp 379ndash423 1948

[16] U Ravn F Gueneau L Baerlocher et al ldquoBy-passing in vitroscreening next generation sequencing technologies applied toantibody display and in silico candidate selectionrdquoNucleic AcidsResearch vol 38 no 21 2010

[17] T Kim M S Tyndel H Huang et al ldquoMUSI an integratedsystem for identifying multiple specificity from very largepeptide or nucleic acid data setsrdquo Nucleic Acids Research vol40 no 6 article e47 2012

[18] J A Weinstein N Jiang R A White III D S Fisher and S RQuake ldquoHigh-throughput sequencing of the zebrafish antibodyrepertoirerdquo Science vol 324 no 5928 pp 807ndash810 2009

[19] B J DeKosky G C Ippolito R P Deschner et al et al ldquoHigh-throughput sequencing of the paired human immunoglobulinheavy and light chain repertoirerdquo Nature Biotechnology vol 31pp 166ndash169 2013

[20] B Levitan ldquoStochastic modeling and optimization of phagedisplayrdquo Journal of Molecular Biology vol 277 no 4 pp 893ndash916 1998

[21] C S Riesenfeld P D Schloss and J Handelsman ldquoMetage-nomics genomic analysis of microbial communitiesrdquo AnnualReview of Genetics vol 38 pp 525ndash552 2004

[22] M L Sogin H G Morrison J A Huber et al ldquoMicrobialdiversity in the deep sea and the underexplored lsquorare biospherersquordquo

Proceedings of the National Academy of Sciences of the UnitedStates of America vol 103 no 32 pp 12115ndash12120 2006

[23] F Backhed R E Ley J L Sonnenburg D A Peterson and JI Gordon ldquoHost-bacterial mutualism in the human intestinerdquoScience vol 307 no 5717 pp 1915ndash1920 2005

[24] N Beerenwinkel and O Zagordi ldquoUltra-deep sequencing forthe analysis of viral populationsrdquo Current Opinion in Virologyvol 1 no 5 pp 413ndash418 2011

[25] J A Huber D B MarkWelch H G Morrison et al ldquoMicrobialpopulation structures in the deep marine biosphererdquo Sciencevol 318 no 5847 pp 97ndash100 2007

[26] A Wilm P P K Aw D Bertrand et al ldquoLoFreq a sequence-quality aware ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencingdatasetsrdquo Nucleic Acids Research vol 40 pp 11189ndash11201 2012

[27] P J Turnbaugh C Quince J J Faith et al ldquoOrganismalgenetic and transcriptional variation in the deeply sequencedgut microbiomes of identical twinsrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 107 no16 pp 7503ndash7508 2010

[28] CQuince A Lanzen T P Curtis et al ldquoAccurate determinationof microbial diversity from 454 pyrosequencing datardquo NatureMethods vol 6 no 9 pp 639ndash641 2009

[29] S J Watson M R A Welkers D P Depledge et al ldquoViralpopulation analysis and minority-variant detection using shortread next-generation sequencingrdquo Philosophical Transactions ofthe Royal Society B vol 368 no 1614 Article ID 20120205 2013

[30] D R Bentley S Balasubramanian H P Swerdlow et al et alldquoAccurate whole human genome sequencing using reversibleterminator chemistryrdquo Nature vol 456 pp 53ndash59 2008

[31] KNakamura TOshima TMorimoto et al ldquoSequence-specificerror profile of Illumina sequencersrdquo Nucleic Acids Researchvol 39 no 13 2011

[32] M A Quail I Kozarewa F Smith et al ldquoA large genomecenterrsquos improvements to the Illumina sequencing systemrdquoNature Methods vol 5 no 12 pp 1005ndash1010 2008

[33] O Zagordi R Klein M Daumer and N Beerenwinkel ldquoErrorcorrection of next-generation sequencing data and reliableestimation of HIV quasispeciesrdquoNucleic Acids Research vol 38no 21 pp 7400ndash7409 2010

[34] M W Schmitt S R Kennedy J J Salk E J Fox J B Hiattand L A Loeb ldquoDetection of ultra-rare mutations by next-generation sequencingrdquo Proceedings of the National Academyof Sciences of the United States of America vol 109 pp 14508ndash14513 2012

[35] A Syropoulos ldquoMathematics of multisetsrdquo in Proceedings of theWorkshop on Multiset Processing Multiset Processing Mathe-matical Computer Science and Molecular Computing Points ofView pp 347ndash358 Springer 2001

[36] W ArapM G KoloninM Trepel et al ldquoSteps towardmappingthe human vasculature by phage displayrdquo Nature Medicine vol8 no 2 pp 121ndash127 2002

[37] S Blond-Elguindi S E Cwirla W J Dower et al ldquoAffinitypanning of a library of peptides displayed on bacteriophagesreveals the binding specificity of BiPrdquoCell vol 75 no 4 pp 717ndash728 1993

[38] S Cox E Rosten J Monypenny et al ldquoBayesian localizationmicroscopy reveals nanoscale podosome dynamicsrdquo NatureMethods vol 9 no 2 pp 195ndash200 2012

[39] E Rosten G E Jones and S Cox ldquoImage plug-in for Bayesiananalysis of blinking and bleachingrdquoNature Methods vol 10 pp97ndash98 2013

Computational and Mathematical Methods in Medicine 13

[40] W L Matochko S Ng M R Jafari J Romaniuk S K Y Tangand R Derda ldquoUniform amplification of phage display librariesin monodisperse emulsionsrdquoMethods vol 58 pp 18ndash27 2012

[41] E A Peters P J Schatz S S Johnson and W J DowerldquoMembrane insertion defects caused by positive charges in theearly mature region of protein pIII of filamentous phage fd canbe corrected by prlA suppressorsrdquo Journal of Bacteriology vol176 no 14 pp 4296ndash4305 1994

[42] L A Brammer B Bolduc J L Kass K M Felice C J Norenand M F Hall ldquoA target-unrelated peptide in an M13 phagedisplay library traced to an advantageous mutation in the geneII ribosome-binding siterdquo Analytical Biochemistry vol 373 no1 pp 88ndash98 2008

[43] G A Kuzmicheva P K Jayanna I B Sorokulova and VA Petrenko ldquoDiversity and censoring of landscape phagelibrariesrdquo Protein Engineering Design and Selection vol 22 no1 pp 9ndash18 2009

[44] D R Wilson and B B Finlay ldquoPhage display applicationsinnovations and issues in phage and host biologyrdquo CanadianJournal of Microbiology vol 44 no 4 pp 313ndash329 1998

[45] R Derda S K Y Tang and G M Whitesides ldquoUniformamplification of phage with different growth characteristics inindividual compartments consisting ofmonodisperse dropletsrdquoAngewandte Chemie vol 49 no 31 pp 5301ndash5304 2010

[46] J C Dohm C Lottaz T Borodina and H HimmelbauerldquoSubstantial biases in ultra-short read data sets from high-throughput DNA sequencingrdquo Nucleic Acids Research vol 36no 16 article e105 2008

[47] W L Matochko S C Li S K Y Tang and R DerdaldquoProspective identification of parasitic sequences in phagedisplay screensrdquo Nucleic Acids Research 2013

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

Computational and Mathematical Methods in Medicine 7

Number of data in one pointgt1000 copies100ndash1000 copies

10ndash100 copies1ndash10 copies

Number of data in one pointgt1000 copies100ndash1000 copies

10ndash100 copies1ndash10 copies

0 100 200 300 400 500 600 700 800 900 10000

100

200

300

400

500

600

700

800

900

1000

1 10 100 10001

10

100

1000

Copy number before the samplingC

opy

num

ber a

fter t

he sa

mpl

ing

A B C X Z

Copy number before the sampling

Cop

y nu

mbe

r afte

r the

sam

plin

g

1

1000

990

7

8

23

9 980

10Sa

456

50

5000 trials1000 unique elements

500000 total sequencesmiddot middot middot

middot middot middot

middot middot middot

middot middot middot[1 2 3 middot middot middot 999 1000]

Figure 4 (a) Testing the sampling operator using a large multiset made of 1000 unique elements with 1000 different copy numbers Imagesdescribe linear and log-scale representation of the confidence interval of the sampling operator Solutions beyond this interval were notobserved in 5000 trials Dotted line represents an overestimate of the 999 confidence interval (for details see Figure S4) Most probableoutcomes of the Sa operator have either zero or one unique sequence beyond this interval This line is used in subsequent sections (Figures 5and 6)We note that distributions of the copy numbers have well-defined shape according to central limit theorem it is a normal distributionWith enough replicas it should be possible to extrapolate the center of this distribution define the solutions explicitly and bypass thestochastic nature of the Sa operator

sequencing experiment [45] We recently scaled this effortto 50 primer sets and evaluated the performance replicas ofsimple selection procedures (in preparation)

The measurement of the copy numbers of sequencesis a separate problem that can be described using thesame sampling operators and bias operators that describehow the library is skewed by each preparation step Forexample isolation of DNA by gel purification disfavors AT-rich sequences whereas PCR favors sequence with withinspecific GC-content range [32]The real sequence abundancein any phage library ( real119899) hence has to be derived from theobserved sequence abundance ( obs119899) by solving this equation

obs119899 = (

f5Sa An) ( f4Sa Seq) ( f3Sa PCR)

times (f2Sa Is) ( f1Sa Gr) real119899

(7)

In this equation each operator in brackets describes a bias ata particular step fSa describes sampling at that step and f1ndashf5 describe the sampling fractions The bias in growth (Gr)isolation (Is) PCR amplification (PCR) and sequencing(Seq) could be related to the nucleotide sequences TheAn analysis operator is a matrix that describes retainingdiscarding or correcting the sequence (Figure 2(b)) An ideal

An operator could compensate for the biases introducedby another operator (Figure 2(c)) To define such operator(7) could be potentially solved using repeated sequencingof a well-defined model library In the next applied sectionwe examine the real deep sequencing data and identifyconditions under which these operators could be at leastpartially defined

22 Analysis or the Error Cutoff inDeep Sequencing Reads Allnext-generation sequencing techniques provide quality score(Phred Score) for every sequenced nucleotide In Illuminasequencing this score is related to the probability of thenucleotide being correct [46] In low throughput Sangersequencing the Phred score monotonously decreases withread length and the mechanisms that yield errors in capillaryelectrophoresis are well understood Common practice inSanger sequencing is to discard all reads after the firstnucleotide with a Phred score of 0 In next-generationsequencing the filtering of the reads is usuallymore stringentas follows

(A) Discard reads that have at least one read that has scorelower than ldquocutoffrdquo

(B) Discard reads that had cumulative Phred score lowerthan cutoff

8 Computational and Mathematical Methods in Medicine

C7C

Tags NKKN Bar Adapter Sequence Quality string for sequenceEnd of

read

C7C

C7CC7CC7CC7C

Ph7

Ph7Ph7P12

Ph7

TGGG

GGGC

TGTG

ATGG

TTGG

CGGC

CGGG

CGGG

CGTC

TGTG

ATGC

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

CGA CCCF

CCCF

CCCF

CCCF

CCCF

CCF

CCCF

F

CCCF

CCBF

CCCF

FFF

FFF

FFF

FFF

FDF

FFA

FFF

=A

FFF

FFF

FFF

FHHHHHJJJJJJ

FHHHHHJJJJJJ

FHHHHHJJJJJJ

FHHHHHJJJJJJ

FHHHHHJJJJJJ

DFFHBFFFGEFH

FHHHHHJJJJJJ

ADDDHFEFIEHG

FHGGHHJJJJJJ

FHHHHHJJJJJJ

FHHHHHJJJJJJ

TTG

GAC

GAC

GTA

TTG

CGA

TTG

TCA

TCA

CGA

JJJJJJJJJJJJJJJHHHGIJJJJJgtFHJJIJH simsimsimsimsimsimsim

JJJJJJJJJJHJJIJHIIJJJJBFHAHIHHHF simsimsimsimsimsimsim

JJJJJJJJJJJJGGHIJJJJJJJIJJIJJJJJJ simsimsimsimsimsimsim

JJJJJJJJJJJJIJJIJJJJJJJJJJJJJIJJJ simsimsimsimsimsimsim

JJJJJJJJJJJIIJJJJJJIJIJJIEHHHFHFD simsimsimsimsimsimsim

JJJJJJJJJJJJJJJJJJJJJJJHIJJJJJJJJ simsimsimsimsimsimsim

JJJJJJJJJJJJIJJJJJJJJJJJJJJJJJJJH simsimsimsimsimsimsim

JJIJJIGIJIJEGIJGHFGIIJJFFGGIIGII simsimsimsimsimsimsim

JJJJJJJJJJJIJJJIGHGIHIIIGHGIJJJJH simsimsimsimsimsimsim

GAHGHFHGGDHEHBGGCEHIEGBFDHBDltA simsimsimsimsimsimsim

simsimsimsimsimsimsim

GCTTGTCCTCCGAAGGTGCATCTGCAGTGCGGT simsimsimsimsimsimsim

GCTTGTGCGGCGAATTCTTATCGGAATTGCGGT simsimsimsimsimsimsim

ACTCAGCCTCATCATACTCCGGGTGGAGGTTCG simsimsimsimsimsimsim

GCGACGACTGTTCCAGCTTCGGGTGGAGGTTCG simsimsimsimsimsimsim

GCTTGTGATGCGCGTATGAATCAGCCTTGCGGT simsimsimsimsimsimsim

GCTTGTCAGGATAAGAATTCGCTTTTTTGCGGT simsimsimsimsimsimsim

GCTTGTAAGTCTCTTTCGAGTGCTCTGTGCGGT simsimsimsimsimsimsim

GCTTGTCCTCGTCTGTATCAGATGTCGTGCGGT simsimsimsimsimsimsim

GATGCGCATCGTACGTGTAATGGTGGAGGTTCG simsimsimsimsimsimsim

TGGCGTCATGATTCGTTTTAGACGGTTCAGCGT simsimsimsimsimsimsim

CGGGCTGCGTGTGCGAATCCTGGTGGAGGTTCG simsimsimsimsimsimsim

GEGIDFAEGIFCHC9FFAGHlowastDBAH

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

(a)

Sequencing resultsRemoval

All nucleotides

of errors

0L = [S 0n]

30L = [S 30n]

Ph gt 30

Ph gt 13

Ph gt 0

Ph gt 30

Phred gt 30

(b)

Base call accuracy ()

Phred score cutoff Every nucleotide must have Phred score higher than

Mean 33-mer accuracy ()

1 5 8 13 20 30

210 680 840 950 990 999

00 00 01 18 717 967

950 975 987 992 995 996

(Base accuracy)33 ()

10cutoff10

33prodi=1

10Qi10

(c)

Cumulative base call accuracy or accuracy

100

10

1

01

001

0001

0000160 70 80 1009050403020100

100

10

01

1

001

0001

00001

of 33-mer sequence ()

Sequ

ence

s (

)

(d)

60

80

100

40

20

0

60

80

100

40

20

0gt90 gt95 gt98

Sequ

ence

s (

)

Sequ

ence

s (

)

Cumulative base call accuracy oraccuracy of 33-mer sequence ()

gt99

(e)

Figure 5 (a) Representative lines from the intermediate file from Illumina deep sequencing analysis (for more information seeERROR TAG data0001txt in Section 4 and our previous publication [6]) The reads have been parsed to identify adapters and barcodesEach read has been tagged according to the library type direction of the read and quality of the adapter regions We use this intermediatelibrary to identify reads that harbor erroneous nucleotides (b) Multiset view of the intermediate library The library contains subsets thathave low medium and high quality reads Error filtering of this intermediate library to eliminates any read with Phred score below 30 yieldsa high quality library of reads 30L (c) Mean accuracy of the reads in the library after error filtering ranges from 95 to 996 Even forvery low-quality cutoff Phred gt 1 the average read quality is 95 (d) Distribution of cumulative read accuracy in libraries processed usingdifferent cutoffs (e) Linear plot of the data presented in (d) with zoom in on the region with gt90 cumulative accuracy

(C) Use a combination of A and B (accept reads withminimal cutoff and minimal cumulative score)

Many of the error analyses in the area of deep sequencingare designed for genetic reads which have variable lengthsand unknown sequence throughout the whole read Analysisof the reads in a phage-display library is a simpler problembecause phage-derived constant adapter regions flank thevariable reads Identification of the adapter region is anecessary first step in the analysis Reads in which theadapters cannot be mapped cannot be used We designedalgorithms to recover reads even if adapters were hamperedby truncation deletion ormutation [6]We observed that the

reads flanked by the erroneous adapters had a significantlyhigher error rate than reads flanked by ldquoperfectrdquo adapters(unpublished) Example of mapping of flawed reads in Illu-mina sequencing is provided in ERROR TAG data0001txt(see Section 4) In the remaining sections we analyze thepopulation of the sequences preceded by a ldquoperfectrdquo adapterto identify possible sequence-specific biases

We analyzed a typical library sequenced by Illumina usingvarious cutoffs (Figure 5) We analyzed a 33-bp segment ofthe library that contained variable seven amino acids anda constant region and GGGS terminus A simple cutoffthat discards reads with Phred lt1 nucleotides yields library

Computational and Mathematical Methods in Medicine 9

termed 1119899 which had an average 95 accuracy of the33-nucleotide read Reads that do not contain Phred = 0nucleotide rarely contain multiple low-quality reads The 1119899library was bimodal 80 of the reads had overall accuracyof 99 very few reads with accuracy 5ndash90 and significantnumber of reads with accuracy of 1 (Figures 5(d) and 5(e))These observations suggest that reads can be divided into (i)reads free of errors and (ii) reads with multiple errors

An example of a more stringent cutoff is eliminationof reads with Phred lt13 nucleotides this process yielded alibrary 13119899 in which every nucleotide had gt95 confidenceThe number of total reads in 13119899 was 10 less than numberof reads in 1119899 that is sum( 13119899) = 09sum( 1119899) The observedaverage read accuracy of the read in the 13119899 library was992 Theoretically the 095 confidence cutoff in a 33-mer nucleotide could yield reads with accuracy as low as(095)

33= 18 In practice the probability to find reads with

multiple nucleotides of 95 accuracy was vanishingly smallSpecifically among 500000 reads the lowest observed cumu-lative accuracy was 77 Such a result for example could beobtained in a sequence that has 27 ldquoperfectrdquo nucleotides and5 nucleotides with a Phred = 13 score (1)27(095)5 = 077Applying the most stringent cutoff to eliminate all reads witha Phred lt 30 yielded a library 30119899 in which every nucleotidehad 999 confidence The average confidence of the readsimproved subtly from 992 to 996 The number of totalreads in 30119899 was 30 less than number of reads in 13119899 thatis sum( 30119899) = 07sum( 13119899) It was not clear whether suchcutoff is an improvement or a detriment for analysis In thenext section we examined how frequency of the members ofthe library changed upon application of each error cutoff

23 Example of Error Analysis Sequence-Specific Censorshipduring Phred Quality Cutoff If errors occur by randomchance they should be uniformly distributed in all sequencesRemoval of erroneous read in that case should be identicalto sampling of the library by fSa operator where f is thesampling fraction For example consider the removal ofPhred lt 13 nucleotides from an unfiltered library (processdenoted as 1119899 rarr 13

119899) From the experiments we knowthat sum( 13119899) = 09sum( 1119899) if errors were distributedin sequences at random the 1119899 and 13119899 vectors should berelated as

13119899 =09Sa ( 1119899) (8)

The solutions should reside within a confidence interval

13119899 isin [

lo119862

hi119862] (9)

If errors occur preferentially in specific reads the frequencyof these reads should occur beyond the confidence intervalof the 09Sa This process could be described by a diagonalmatrix Bias as

13119899 =09Sa (Bias ( 1119899)) (10)

The elements of the diagonal matrix Bias = B119894119894 could be

estimated as follows

13119899119894isin [

lo119862119894

hi119862119894] B

119894119894= 1 (11)

13119899119894lt

lo119862119894 B

119894119894=

13119899119894

(091119899119894)

(12)

Figure 6(c) describes the representative solution of the09Sa( 1119899) (green dots) and the confidence interval (bluelines) Supplementary Scheme S3 describes the script thatcalculated this interval frommultiset 1119899 described as a plaintext file PhD7-Amp-0Ftxt using 10000 iterative calculationsof 09Sa( 1119899) This calculation required sim2 hours on a desktopcomputer Confidence interval was estimated as the mini-mum and maximum copy number found after 10000 iter-ations In this approximation of the confidence interval forsequences with the copy number lt10 before sampling it wasimpossible to determine whether the sequence disappeareddue to random sampling or due to bias The values of Biasoperator cannot be defined for these sequences and it couldbe assumed to be 1 (see (11)) For copy number gt10 howeversequence-specific bias can be readily detected We observedthat the removal of Phred lt13 reads yielded a multiset inwhich a large number of sequences deviated beyond theconfidence interval (Figure 6(d)) Their sequences could bereadily extracted by comparing the vector 13119899with the vectorof the lower confidence intervals lo

119862 (see (12)) The solutionof theBias can be illustrated graphically (Figure 6(e)) Top 30censored sequences are listed in Table S1 the other sequencescan be found in the supplementary information (file PhD7-Amp-0F-13F-CENtxt)

We performed similar calculations for 1119899 rarr 30119899 and13119899 rarr

30119899 processes The latter process is the most

interesting because 13119899 library has all nucleotides withinacceptable confidence range (gt95) and the distribution ofcumulative quality suggested that errors on average do notcluster in one read (Figure 5) The 13119899 rarr 30

119899 conversioneliminated 30 of the reads and copy numbers of manysequences deviated significantly from the random samplingthese sequences are represented by green dots outside the blueconfidence interval in Figure 6(h) Top 30 sequences are listedin Table S2 The censorship is not only sequence-specific butalso position-specific In sequences that had been censoredduring the 13119899 rarr 30119899 process lower quality reads clusteredaround 3-4 specific nucleotides (supplementary informationFigure S5)

The mechanism that leads to the disappearance of cen-sored sequences is not currently clear We attempted toidentify common motifs in censored sequences using twoapproaches (i) clustering and principal component analysesbased on Jukes-Cantor distance between sequences and(ii) identification of motifs using multiple unique sequenceidentifier software (MUSI) [17] These approaches couldnot detect any property common to censored reads whichwould make them significantly different from the othernoncensored reads Still we hypothesize that the observed

10 Computational and Mathematical Methods in Medicine

Ph gt 30 Ph gt 30

Sequencing resultsRemovalof errors

30n = Bias Sa 0

n

All nucleotides

0L = [S

0n]

30L = [S 30n]

Phred gt 30

(a)

Sequencing results

Ph gt13

1T total sequences

13T total sequences

30T total sequences

30n = Bias Sa 13

n

13n = Bias Sa 1

n

30n = Bias Sa 1

n

13L = [S

13n]

0L = [S

0n]

1L = [S

1n]

30L = [S 30n]

Ph gt 30

Ph gt 13

Ph gt 0

Ph ge 1

Ph gt 30

(b)

38390

597122105

104

103

102

10

1

105

104

103

102

101

13n = Sa 1

n

Cop

y nu

mbe

r in

Phre

dgt0

libra

ry

Copy number in Sa (Phred gt 0) library

(c)

558480

77032

Inside the interval

Outsideintervalthe

of

105

104

103

102

10

1

105

104

103

102

101

13n = Bias 1

n

13n = Bias Sa 1

nC

opy

num

ber i

n Ph

redgt0

libra

ry

Copy number in Phred gt 30 library

(d)

1

01

001

Bias can be definedfor these sequences

Bias cannot be definedfor these sequences

105

104

103

102

101

Bias = 13n|I| |N||

1nminus1

Bias = |IN|

Ratio

of c

opy

num

ber i

nPh

redgt0

Phr

edgt13

Copy number in Phred gt 0 library

(e)

502655

132857

105

104

103

102

101

105

104

103

102

10

1

30n = Sa 1

n

Number of data points per pixelgt30000

gt10000

gt3000

gt1000 1ndash10

gt100

gt300

gt30

gt10

Cop

y nu

mbe

r in

Phre

dgt0

libra

ry

Copy number in Sa (Phred gt 0) library

(f)

438703

196809

Inside the interval

Outside ofthe interval

105

104

103

102

10

1

105

104

103

102

101

Number of data points per pixel

30n = Bias Sa 1

n

30n = Bias 1

n

gt30000

gt10000

gt3000

gt1000 1ndash10

gt100

gt300

gt30

gt10

Copy number in Phred gt 30 library

Cop

y nu

mbe

r in

Phre

dgt0

libra

ry

(g)

438703

119777

Inside the interval

Outside ofthe interval

105

104

103

102

10

1

105

104

103

102

101

30n = Bias 13

n

Number of data points per pixel

30n = Bias Sa 13

n

gt30000

gt10000

gt3000

gt1000 1ndash10

gt100

gt300

gt30

gt10

Copy number in Phred gt 30 library

Cop

y nu

mbe

r in

Phre

dgt13

libra

ry

(h)

Figure 6 (a) Operator and multiset description of the error filtering procedure Applying a Phred gt 30 cutoff to library filtered by Phredgt1cutoff ( 1n) yields a subpopulation of the library ( 30n) If errors are sequence-independent the 1n rarr 30n process should be identical torandom sampling ( 30n = Sa 1n) Any sequence-specific bias (Bias) should be detected as deviation from Sa 1n (b) Progressive samplingwith more stringent cutoff (c) Theoretical Sa 1n and theoretical 999 confidence interval (blue) (d) Observation of statistically significantdeviation from Sa operator dots beyond the blue line represent sequences prone to bias Red dots represent sequences that disappeared afterin 1n rarr 30n process or during Sa 1n sampling (e) Magnitude of the bias range from 5 to 100-fold (f) Bias in sampling of Phred gt 30 datafrom Phred gt 1 data ((f) is theory (g) is observed) (h) Bias upon sampling of Phred gt 30 data from Phred gt 13 Many sequences were lost inthis sampling and this loss was statistically significant beyond the 999 interval This result shows that some sequences have propensity toharbor low- and medium-quality reads Distribution of the errors is sequence specific

censorship represents sequence-specific errors which occurin every time such sequence passes though the Illuminaanalyzer For example the sequences listed in Tables S1 andS2 and supplementary files were censored in five independent

experiments which were pooled and processed simultane-ously in one Illumina run Analysis of other instances of Illu-mina sequencing performedby other groups could help prove(or disprove) that censorship is indeed sequence-specific

Computational and Mathematical Methods in Medicine 11

and experiment-independent Sequence-specific censorshipduring Illumina analysis has been described in other publi-cations [46] The observations presented above suggest thatreading of some sequences in phage libraries does not yield anaccurate copy number Even if these sequences were enricheddue to binding their apparent copy number in sequencingwould be decreased due to sequencing bias If the magnitudeof bias is known however such error could be corrected Weanticipate that other biases could be calculated for these andother libraries in similar fashion Their calculation extendsbeyond the scope of this paper and it will be performed inour next publication

3 Discussion

31 Significance and Transformative Potential of Library-Wide Error Correction In the Medicinal Chemistry fieldstructure-activity relationships (SAR) and pharmacophoresare built using both positive and negative observations Itis the negative results that bear the most significance inthese studies because they allow mapping of the range ofconditions under which particular structure no longer worksFor example SAR of an R group of a ligand might be builton the following observations A ligand binds to the targetwhen the R group in the specific position is methyl or ethylchanging R to iso-propyl and tert-butyl ablates the bindingThis concludes that the R group must be a small alkyl groupAn analogous situation is found in SARof peptide ligands themost important information from alanine scan mutagenesisis loss of function because it helps identifying the importantresidues Interestingly loss-of-binding conclusions are neverapplied to phage-display The phage-display field is driven bypositive results Most publications report and follow up onlyon sequences enriched in the screen and consider only largecopy numbers interesting All papers focus on sequences thatwere found Very few papers in phage display ask why othersequences were not found

One of the reasons why phage display is not used forSAR-type analysis is because negative observations in phagelibrary cannot be determined with high confidence From apractical point of view measuring zero with high confidencerequires the largest number of observation (the highest depthof sequencing) The payoff however is immense one screenwith ldquoconfident zerosrdquo could potentially yield SAR for everypossible substitution of every possible amino acid We referto this (theoretical) possibility as ldquoInstant SARrdquo and itscondensed theoretical form is described in (3) or (9) and(10) This paper demonstrates that the depth of sequencing isnot the only problem towards this goal Accurate estimate ofnegative results requires complete characterization of the ori-gins of errors in sequencing which yield false negative valuesby censoring certain sequencing Other types of censorshipsuch as growth bias should be characterized and eliminatedas well As the phage display field is currently focused onpositive results the need for optimal error corrections andrecovery of erroneous reads is low With the rise of SAR-type applications in phage display error correction will berecognized as the most significant barrier because it could

lead to improper assignment of low frequencies and negativeresults Improved error correction strategies could assigna lower confidence to the sequence instead of eliminatingthe errors and labeling them as confident zero Propermathematical framework possibly similar to the one used inthis paper could be then used to carry all confidence intervalsthrough calculations to yield reliable SAR-type data

We note that the framework described in this paper issuitable for the analysis of the selection from libraries inwhich the diversity of the libraries before and after selectioncould be covered entirely by deep sequencing With thecurrent depth of sequencing it corresponds to medium-scale libraries of sim106 randommembers and affinity-maturedlibraries that containsim106 pointmutationsWe are in the pro-cess of generating these medium-scale libraries and runningselection procedures that will allow us to apply and refine ourframework In the future as technical capabilities and depthof sequencing increase the process would be applicable tolarger libraries as well

4 Methods

41 Generation of Z-Bars and Other Visualization TechniquesSequencing of the libraries has been described in our previouspublications [6 47] All data visualization in this paper wasdone by MATLAB scripts raw lowasteps output from MATLABscripts subject to minor postprocessing in Adobe Illustratorto adjust fonts relative dimensions of plots Core scripts aredescribed in the supplementary information Other scriptsare available in our previous publication [47] Illumina filesused for the analysis can be found in the directory athttpwwwchemualbertacasimderdamathbiology the fileERROR TAG data0001txt is an example of error-taggedreads PhD7-Amp-xxFtxt is the library filtered with xx Phredcutoff (xx = 1 13 and 30) file PhD7-Amp-13F-30F-COtxtdescribes confidence intervals for Phred(13) to Phred(30)filtering process other files with lowast-COtxt extension describeconfidence intervals of other processes Supplementary Fig-ures S1ndashS4 and Schemes S1 and S2 describe MATLAB imple-mentation of the Sa operator

Acknowledgments

This work was supported by funds from the University ofAlberta and Alberta Glycomics Centre

References

[1] J K Scott and G P Smith ldquoSearching for peptide ligands withan epitope libraryrdquo Science vol 249 no 4967 pp 386ndash390 1990

[2] G P Smith and V A Petrenko ldquoPhage displayrdquo ChemicalReviews vol 97 no 2 pp 391ndash410 1997

[3] A D Ellington and J W Szostak ldquoIn vitro selection of RNAmolecules that bind specific ligandsrdquo Nature vol 346 no 6287pp 818ndash822 1990

[4] C Tuerk and L Gold ldquoSystemic evolution of ligands byexponential enrichment RNA ligands to bacteriophage T4DNApolymeraserdquo Science vol 249 no 4968 pp 505ndash510 1990

12 Computational and Mathematical Methods in Medicine

[5] E Dias-Neto D N Nunes R J Giordano et al ldquoNext-generation phage display integrating and comparing availablemolecular tools to enable costeffective high-throughput analy-sisrdquo PLoS ONE vol 4 no 12 Article ID e8338 2009

[6] W LMatochko K Chu B Jin SW Lee GMWhitesides andR Derda ldquoDeep sequencing analysis of phage libraries usingIllumina platformrdquoMethods vol 58 pp 47ndash55 2012

[7] A Ernst D Gfeller Z Kan et al ldquoCoevolution of PDZ domain-ligand interactions analyzed by high-throughput phage displayand deep sequencingrdquo Molecular BioSystems vol 6 no 10 pp1782ndash1790 2010

[8] G V Kupakuwana J E Crill M P McPike and P N BorerldquoAcyclic identification of aptamers for human alpha-thrombinusing over-represented libraries and deep sequencingrdquo PLoSONE vol 6 no 5 Article ID e19395 2011

[9] P A C rsquoT Hoen S M G Jirka B R Ten Broeke et alldquoPhage display screening without repetitious selection roundsrdquoAnalytical Biochemistry vol 421 no 2 pp 622ndash631 2012

[10] H Zhang A Torkamani T M Jones D I Ruiz J Ponsand R A Lerner ldquoPhenotype-information-phenotype cyclefor deconvolution of combinatorial antibody libraries selectedagainst complex systemsrdquo Proceedings of the National Academyof Sciences of the United States of America vol 108 no 33 pp13456ndash13461 2011

[11] R Derda S K Y Tang S C Li S Ng W Matochko and M RJafari ldquoDiversity of phage-displayed libraries of peptides duringpanning and amplificationrdquo Molecules vol 16 no 2 pp 1776ndash1803 2011

[12] R W Hamming ldquoError detecting and error correcting codesrdquoBell System Technical Journal vol 29 pp 147ndash160 1950

[13] D J Rodi A S Soares and L Makowski ldquoQuantitativeassessment of peptide sequence diversity in M13 combinatorialpeptide phage display librariesrdquo Journal of Molecular Biologyvol 322 no 5 pp 1039ndash1052 2002

[14] L Makowski ldquoQuantitative analysis of peptide librariesrdquo inPhage Nanobiotechnology chapter 3 2011

[15] C E Shannon ldquoAmathematical theory of communicationrdquo BellSystem Technical Journal vol 27 pp 379ndash423 1948

[16] U Ravn F Gueneau L Baerlocher et al ldquoBy-passing in vitroscreening next generation sequencing technologies applied toantibody display and in silico candidate selectionrdquoNucleic AcidsResearch vol 38 no 21 2010

[17] T Kim M S Tyndel H Huang et al ldquoMUSI an integratedsystem for identifying multiple specificity from very largepeptide or nucleic acid data setsrdquo Nucleic Acids Research vol40 no 6 article e47 2012

[18] J A Weinstein N Jiang R A White III D S Fisher and S RQuake ldquoHigh-throughput sequencing of the zebrafish antibodyrepertoirerdquo Science vol 324 no 5928 pp 807ndash810 2009

[19] B J DeKosky G C Ippolito R P Deschner et al et al ldquoHigh-throughput sequencing of the paired human immunoglobulinheavy and light chain repertoirerdquo Nature Biotechnology vol 31pp 166ndash169 2013

[20] B Levitan ldquoStochastic modeling and optimization of phagedisplayrdquo Journal of Molecular Biology vol 277 no 4 pp 893ndash916 1998

[21] C S Riesenfeld P D Schloss and J Handelsman ldquoMetage-nomics genomic analysis of microbial communitiesrdquo AnnualReview of Genetics vol 38 pp 525ndash552 2004

[22] M L Sogin H G Morrison J A Huber et al ldquoMicrobialdiversity in the deep sea and the underexplored lsquorare biospherersquordquo

Proceedings of the National Academy of Sciences of the UnitedStates of America vol 103 no 32 pp 12115ndash12120 2006

[23] F Backhed R E Ley J L Sonnenburg D A Peterson and JI Gordon ldquoHost-bacterial mutualism in the human intestinerdquoScience vol 307 no 5717 pp 1915ndash1920 2005

[24] N Beerenwinkel and O Zagordi ldquoUltra-deep sequencing forthe analysis of viral populationsrdquo Current Opinion in Virologyvol 1 no 5 pp 413ndash418 2011

[25] J A Huber D B MarkWelch H G Morrison et al ldquoMicrobialpopulation structures in the deep marine biosphererdquo Sciencevol 318 no 5847 pp 97ndash100 2007

[26] A Wilm P P K Aw D Bertrand et al ldquoLoFreq a sequence-quality aware ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencingdatasetsrdquo Nucleic Acids Research vol 40 pp 11189ndash11201 2012

[27] P J Turnbaugh C Quince J J Faith et al ldquoOrganismalgenetic and transcriptional variation in the deeply sequencedgut microbiomes of identical twinsrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 107 no16 pp 7503ndash7508 2010

[28] CQuince A Lanzen T P Curtis et al ldquoAccurate determinationof microbial diversity from 454 pyrosequencing datardquo NatureMethods vol 6 no 9 pp 639ndash641 2009

[29] S J Watson M R A Welkers D P Depledge et al ldquoViralpopulation analysis and minority-variant detection using shortread next-generation sequencingrdquo Philosophical Transactions ofthe Royal Society B vol 368 no 1614 Article ID 20120205 2013

[30] D R Bentley S Balasubramanian H P Swerdlow et al et alldquoAccurate whole human genome sequencing using reversibleterminator chemistryrdquo Nature vol 456 pp 53ndash59 2008

[31] KNakamura TOshima TMorimoto et al ldquoSequence-specificerror profile of Illumina sequencersrdquo Nucleic Acids Researchvol 39 no 13 2011

[32] M A Quail I Kozarewa F Smith et al ldquoA large genomecenterrsquos improvements to the Illumina sequencing systemrdquoNature Methods vol 5 no 12 pp 1005ndash1010 2008

[33] O Zagordi R Klein M Daumer and N Beerenwinkel ldquoErrorcorrection of next-generation sequencing data and reliableestimation of HIV quasispeciesrdquoNucleic Acids Research vol 38no 21 pp 7400ndash7409 2010

[34] M W Schmitt S R Kennedy J J Salk E J Fox J B Hiattand L A Loeb ldquoDetection of ultra-rare mutations by next-generation sequencingrdquo Proceedings of the National Academyof Sciences of the United States of America vol 109 pp 14508ndash14513 2012

[35] A Syropoulos ldquoMathematics of multisetsrdquo in Proceedings of theWorkshop on Multiset Processing Multiset Processing Mathe-matical Computer Science and Molecular Computing Points ofView pp 347ndash358 Springer 2001

[36] W ArapM G KoloninM Trepel et al ldquoSteps towardmappingthe human vasculature by phage displayrdquo Nature Medicine vol8 no 2 pp 121ndash127 2002

[37] S Blond-Elguindi S E Cwirla W J Dower et al ldquoAffinitypanning of a library of peptides displayed on bacteriophagesreveals the binding specificity of BiPrdquoCell vol 75 no 4 pp 717ndash728 1993

[38] S Cox E Rosten J Monypenny et al ldquoBayesian localizationmicroscopy reveals nanoscale podosome dynamicsrdquo NatureMethods vol 9 no 2 pp 195ndash200 2012

[39] E Rosten G E Jones and S Cox ldquoImage plug-in for Bayesiananalysis of blinking and bleachingrdquoNature Methods vol 10 pp97ndash98 2013

Computational and Mathematical Methods in Medicine 13

[40] W L Matochko S Ng M R Jafari J Romaniuk S K Y Tangand R Derda ldquoUniform amplification of phage display librariesin monodisperse emulsionsrdquoMethods vol 58 pp 18ndash27 2012

[41] E A Peters P J Schatz S S Johnson and W J DowerldquoMembrane insertion defects caused by positive charges in theearly mature region of protein pIII of filamentous phage fd canbe corrected by prlA suppressorsrdquo Journal of Bacteriology vol176 no 14 pp 4296ndash4305 1994

[42] L A Brammer B Bolduc J L Kass K M Felice C J Norenand M F Hall ldquoA target-unrelated peptide in an M13 phagedisplay library traced to an advantageous mutation in the geneII ribosome-binding siterdquo Analytical Biochemistry vol 373 no1 pp 88ndash98 2008

[43] G A Kuzmicheva P K Jayanna I B Sorokulova and VA Petrenko ldquoDiversity and censoring of landscape phagelibrariesrdquo Protein Engineering Design and Selection vol 22 no1 pp 9ndash18 2009

[44] D R Wilson and B B Finlay ldquoPhage display applicationsinnovations and issues in phage and host biologyrdquo CanadianJournal of Microbiology vol 44 no 4 pp 313ndash329 1998

[45] R Derda S K Y Tang and G M Whitesides ldquoUniformamplification of phage with different growth characteristics inindividual compartments consisting ofmonodisperse dropletsrdquoAngewandte Chemie vol 49 no 31 pp 5301ndash5304 2010

[46] J C Dohm C Lottaz T Borodina and H HimmelbauerldquoSubstantial biases in ultra-short read data sets from high-throughput DNA sequencingrdquo Nucleic Acids Research vol 36no 16 article e105 2008

[47] W L Matochko S C Li S K Y Tang and R DerdaldquoProspective identification of parasitic sequences in phagedisplay screensrdquo Nucleic Acids Research 2013

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

8 Computational and Mathematical Methods in Medicine

C7C

Tags NKKN Bar Adapter Sequence Quality string for sequenceEnd of

read

C7C

C7CC7CC7CC7C

Ph7

Ph7Ph7P12

Ph7

TGGG

GGGC

TGTG

ATGG

TTGG

CGGC

CGGG

CGGG

CGTC

TGTG

ATGC

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

TATTCTCACTCT

CGA CCCF

CCCF

CCCF

CCCF

CCCF

CCF

CCCF

F

CCCF

CCBF

CCCF

FFF

FFF

FFF

FFF

FDF

FFA

FFF

=A

FFF

FFF

FFF

FHHHHHJJJJJJ

FHHHHHJJJJJJ

FHHHHHJJJJJJ

FHHHHHJJJJJJ

FHHHHHJJJJJJ

DFFHBFFFGEFH

FHHHHHJJJJJJ

ADDDHFEFIEHG

FHGGHHJJJJJJ

FHHHHHJJJJJJ

FHHHHHJJJJJJ

TTG

GAC

GAC

GTA

TTG

CGA

TTG

TCA

TCA

CGA

JJJJJJJJJJJJJJJHHHGIJJJJJgtFHJJIJH simsimsimsimsimsimsim

JJJJJJJJJJHJJIJHIIJJJJBFHAHIHHHF simsimsimsimsimsimsim

JJJJJJJJJJJJGGHIJJJJJJJIJJIJJJJJJ simsimsimsimsimsimsim

JJJJJJJJJJJJIJJIJJJJJJJJJJJJJIJJJ simsimsimsimsimsimsim

JJJJJJJJJJJIIJJJJJJIJIJJIEHHHFHFD simsimsimsimsimsimsim

JJJJJJJJJJJJJJJJJJJJJJJHIJJJJJJJJ simsimsimsimsimsimsim

JJJJJJJJJJJJIJJJJJJJJJJJJJJJJJJJH simsimsimsimsimsimsim

JJIJJIGIJIJEGIJGHFGIIJJFFGGIIGII simsimsimsimsimsimsim

JJJJJJJJJJJIJJJIGHGIHIIIGHGIJJJJH simsimsimsimsimsimsim

GAHGHFHGGDHEHBGGCEHIEGBFDHBDltA simsimsimsimsimsimsim

simsimsimsimsimsimsim

GCTTGTCCTCCGAAGGTGCATCTGCAGTGCGGT simsimsimsimsimsimsim

GCTTGTGCGGCGAATTCTTATCGGAATTGCGGT simsimsimsimsimsimsim

ACTCAGCCTCATCATACTCCGGGTGGAGGTTCG simsimsimsimsimsimsim

GCGACGACTGTTCCAGCTTCGGGTGGAGGTTCG simsimsimsimsimsimsim

GCTTGTGATGCGCGTATGAATCAGCCTTGCGGT simsimsimsimsimsimsim

GCTTGTCAGGATAAGAATTCGCTTTTTTGCGGT simsimsimsimsimsimsim

GCTTGTAAGTCTCTTTCGAGTGCTCTGTGCGGT simsimsimsimsimsimsim

GCTTGTCCTCGTCTGTATCAGATGTCGTGCGGT simsimsimsimsimsimsim

GATGCGCATCGTACGTGTAATGGTGGAGGTTCG simsimsimsimsimsimsim

TGGCGTCATGATTCGTTTTAGACGGTTCAGCGT simsimsimsimsimsimsim

CGGGCTGCGTGTGCGAATCCTGGTGGAGGTTCG simsimsimsimsimsimsim

GEGIDFAEGIFCHC9FFAGHlowastDBAH

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

ltPERFgt

(a)

Sequencing resultsRemoval

All nucleotides

of errors

0L = [S 0n]

30L = [S 30n]

Ph gt 30

Ph gt 13

Ph gt 0

Ph gt 30

Phred gt 30

(b)

Base call accuracy ()

Phred score cutoff Every nucleotide must have Phred score higher than

Mean 33-mer accuracy ()

1 5 8 13 20 30

210 680 840 950 990 999

00 00 01 18 717 967

950 975 987 992 995 996

(Base accuracy)33 ()

10cutoff10

33prodi=1

10Qi10

(c)

Cumulative base call accuracy or accuracy

100

10

1

01

001

0001

0000160 70 80 1009050403020100

100

10

01

1

001

0001

00001

of 33-mer sequence ()

Sequ

ence

s (

)

(d)

60

80

100

40

20

0

60

80

100

40

20

0gt90 gt95 gt98

Sequ

ence

s (

)

Sequ

ence

s (

)

Cumulative base call accuracy oraccuracy of 33-mer sequence ()

gt99

(e)

Figure 5 (a) Representative lines from the intermediate file from Illumina deep sequencing analysis (for more information seeERROR TAG data0001txt in Section 4 and our previous publication [6]) The reads have been parsed to identify adapters and barcodesEach read has been tagged according to the library type direction of the read and quality of the adapter regions We use this intermediatelibrary to identify reads that harbor erroneous nucleotides (b) Multiset view of the intermediate library The library contains subsets thathave low medium and high quality reads Error filtering of this intermediate library to eliminates any read with Phred score below 30 yieldsa high quality library of reads 30L (c) Mean accuracy of the reads in the library after error filtering ranges from 95 to 996 Even forvery low-quality cutoff Phred gt 1 the average read quality is 95 (d) Distribution of cumulative read accuracy in libraries processed usingdifferent cutoffs (e) Linear plot of the data presented in (d) with zoom in on the region with gt90 cumulative accuracy

(C) Use a combination of A and B (accept reads withminimal cutoff and minimal cumulative score)

Many of the error analyses in the area of deep sequencingare designed for genetic reads which have variable lengthsand unknown sequence throughout the whole read Analysisof the reads in a phage-display library is a simpler problembecause phage-derived constant adapter regions flank thevariable reads Identification of the adapter region is anecessary first step in the analysis Reads in which theadapters cannot be mapped cannot be used We designedalgorithms to recover reads even if adapters were hamperedby truncation deletion ormutation [6]We observed that the

reads flanked by the erroneous adapters had a significantlyhigher error rate than reads flanked by ldquoperfectrdquo adapters(unpublished) Example of mapping of flawed reads in Illu-mina sequencing is provided in ERROR TAG data0001txt(see Section 4) In the remaining sections we analyze thepopulation of the sequences preceded by a ldquoperfectrdquo adapterto identify possible sequence-specific biases

We analyzed a typical library sequenced by Illumina usingvarious cutoffs (Figure 5) We analyzed a 33-bp segment ofthe library that contained variable seven amino acids anda constant region and GGGS terminus A simple cutoffthat discards reads with Phred lt1 nucleotides yields library

Computational and Mathematical Methods in Medicine 9

termed 1119899 which had an average 95 accuracy of the33-nucleotide read Reads that do not contain Phred = 0nucleotide rarely contain multiple low-quality reads The 1119899library was bimodal 80 of the reads had overall accuracyof 99 very few reads with accuracy 5ndash90 and significantnumber of reads with accuracy of 1 (Figures 5(d) and 5(e))These observations suggest that reads can be divided into (i)reads free of errors and (ii) reads with multiple errors

An example of a more stringent cutoff is eliminationof reads with Phred lt13 nucleotides this process yielded alibrary 13119899 in which every nucleotide had gt95 confidenceThe number of total reads in 13119899 was 10 less than numberof reads in 1119899 that is sum( 13119899) = 09sum( 1119899) The observedaverage read accuracy of the read in the 13119899 library was992 Theoretically the 095 confidence cutoff in a 33-mer nucleotide could yield reads with accuracy as low as(095)

33= 18 In practice the probability to find reads with

multiple nucleotides of 95 accuracy was vanishingly smallSpecifically among 500000 reads the lowest observed cumu-lative accuracy was 77 Such a result for example could beobtained in a sequence that has 27 ldquoperfectrdquo nucleotides and5 nucleotides with a Phred = 13 score (1)27(095)5 = 077Applying the most stringent cutoff to eliminate all reads witha Phred lt 30 yielded a library 30119899 in which every nucleotidehad 999 confidence The average confidence of the readsimproved subtly from 992 to 996 The number of totalreads in 30119899 was 30 less than number of reads in 13119899 thatis sum( 30119899) = 07sum( 13119899) It was not clear whether suchcutoff is an improvement or a detriment for analysis In thenext section we examined how frequency of the members ofthe library changed upon application of each error cutoff

23 Example of Error Analysis Sequence-Specific Censorshipduring Phred Quality Cutoff If errors occur by randomchance they should be uniformly distributed in all sequencesRemoval of erroneous read in that case should be identicalto sampling of the library by fSa operator where f is thesampling fraction For example consider the removal ofPhred lt 13 nucleotides from an unfiltered library (processdenoted as 1119899 rarr 13

119899) From the experiments we knowthat sum( 13119899) = 09sum( 1119899) if errors were distributedin sequences at random the 1119899 and 13119899 vectors should berelated as

13119899 =09Sa ( 1119899) (8)

The solutions should reside within a confidence interval

13119899 isin [

lo119862

hi119862] (9)

If errors occur preferentially in specific reads the frequencyof these reads should occur beyond the confidence intervalof the 09Sa This process could be described by a diagonalmatrix Bias as

13119899 =09Sa (Bias ( 1119899)) (10)

The elements of the diagonal matrix Bias = B119894119894 could be

estimated as follows

13119899119894isin [

lo119862119894

hi119862119894] B

119894119894= 1 (11)

13119899119894lt

lo119862119894 B

119894119894=

13119899119894

(091119899119894)

(12)

Figure 6(c) describes the representative solution of the09Sa( 1119899) (green dots) and the confidence interval (bluelines) Supplementary Scheme S3 describes the script thatcalculated this interval frommultiset 1119899 described as a plaintext file PhD7-Amp-0Ftxt using 10000 iterative calculationsof 09Sa( 1119899) This calculation required sim2 hours on a desktopcomputer Confidence interval was estimated as the mini-mum and maximum copy number found after 10000 iter-ations In this approximation of the confidence interval forsequences with the copy number lt10 before sampling it wasimpossible to determine whether the sequence disappeareddue to random sampling or due to bias The values of Biasoperator cannot be defined for these sequences and it couldbe assumed to be 1 (see (11)) For copy number gt10 howeversequence-specific bias can be readily detected We observedthat the removal of Phred lt13 reads yielded a multiset inwhich a large number of sequences deviated beyond theconfidence interval (Figure 6(d)) Their sequences could bereadily extracted by comparing the vector 13119899with the vectorof the lower confidence intervals lo

119862 (see (12)) The solutionof theBias can be illustrated graphically (Figure 6(e)) Top 30censored sequences are listed in Table S1 the other sequencescan be found in the supplementary information (file PhD7-Amp-0F-13F-CENtxt)

We performed similar calculations for 1119899 rarr 30119899 and13119899 rarr

30119899 processes The latter process is the most

interesting because 13119899 library has all nucleotides withinacceptable confidence range (gt95) and the distribution ofcumulative quality suggested that errors on average do notcluster in one read (Figure 5) The 13119899 rarr 30

119899 conversioneliminated 30 of the reads and copy numbers of manysequences deviated significantly from the random samplingthese sequences are represented by green dots outside the blueconfidence interval in Figure 6(h) Top 30 sequences are listedin Table S2 The censorship is not only sequence-specific butalso position-specific In sequences that had been censoredduring the 13119899 rarr 30119899 process lower quality reads clusteredaround 3-4 specific nucleotides (supplementary informationFigure S5)

The mechanism that leads to the disappearance of cen-sored sequences is not currently clear We attempted toidentify common motifs in censored sequences using twoapproaches (i) clustering and principal component analysesbased on Jukes-Cantor distance between sequences and(ii) identification of motifs using multiple unique sequenceidentifier software (MUSI) [17] These approaches couldnot detect any property common to censored reads whichwould make them significantly different from the othernoncensored reads Still we hypothesize that the observed

10 Computational and Mathematical Methods in Medicine

Ph gt 30 Ph gt 30

Sequencing resultsRemovalof errors

30n = Bias Sa 0

n

All nucleotides

0L = [S

0n]

30L = [S 30n]

Phred gt 30

(a)

Sequencing results

Ph gt13

1T total sequences

13T total sequences

30T total sequences

30n = Bias Sa 13

n

13n = Bias Sa 1

n

30n = Bias Sa 1

n

13L = [S

13n]

0L = [S

0n]

1L = [S

1n]

30L = [S 30n]

Ph gt 30

Ph gt 13

Ph gt 0

Ph ge 1

Ph gt 30

(b)

38390

597122105

104

103

102

10

1

105

104

103

102

101

13n = Sa 1

n

Cop

y nu

mbe

r in

Phre

dgt0

libra

ry

Copy number in Sa (Phred gt 0) library

(c)

558480

77032

Inside the interval

Outsideintervalthe

of

105

104

103

102

10

1

105

104

103

102

101

13n = Bias 1

n

13n = Bias Sa 1

nC

opy

num

ber i

n Ph

redgt0

libra

ry

Copy number in Phred gt 30 library

(d)

1

01

001

Bias can be definedfor these sequences

Bias cannot be definedfor these sequences

105

104

103

102

101

Bias = 13n|I| |N||

1nminus1

Bias = |IN|

Ratio

of c

opy

num

ber i

nPh

redgt0

Phr

edgt13

Copy number in Phred gt 0 library

(e)

502655

132857

105

104

103

102

101

105

104

103

102

10

1

30n = Sa 1

n

Number of data points per pixelgt30000

gt10000

gt3000

gt1000 1ndash10

gt100

gt300

gt30

gt10

Cop

y nu

mbe

r in

Phre

dgt0

libra

ry

Copy number in Sa (Phred gt 0) library

(f)

438703

196809

Inside the interval

Outside ofthe interval

105

104

103

102

10

1

105

104

103

102

101

Number of data points per pixel

30n = Bias Sa 1

n

30n = Bias 1

n

gt30000

gt10000

gt3000

gt1000 1ndash10

gt100

gt300

gt30

gt10

Copy number in Phred gt 30 library

Cop

y nu

mbe

r in

Phre

dgt0

libra

ry

(g)

438703

119777

Inside the interval

Outside ofthe interval

105

104

103

102

10

1

105

104

103

102

101

30n = Bias 13

n

Number of data points per pixel

30n = Bias Sa 13

n

gt30000

gt10000

gt3000

gt1000 1ndash10

gt100

gt300

gt30

gt10

Copy number in Phred gt 30 library

Cop

y nu

mbe

r in

Phre

dgt13

libra

ry

(h)

Figure 6 (a) Operator and multiset description of the error filtering procedure Applying a Phred gt 30 cutoff to library filtered by Phredgt1cutoff ( 1n) yields a subpopulation of the library ( 30n) If errors are sequence-independent the 1n rarr 30n process should be identical torandom sampling ( 30n = Sa 1n) Any sequence-specific bias (Bias) should be detected as deviation from Sa 1n (b) Progressive samplingwith more stringent cutoff (c) Theoretical Sa 1n and theoretical 999 confidence interval (blue) (d) Observation of statistically significantdeviation from Sa operator dots beyond the blue line represent sequences prone to bias Red dots represent sequences that disappeared afterin 1n rarr 30n process or during Sa 1n sampling (e) Magnitude of the bias range from 5 to 100-fold (f) Bias in sampling of Phred gt 30 datafrom Phred gt 1 data ((f) is theory (g) is observed) (h) Bias upon sampling of Phred gt 30 data from Phred gt 13 Many sequences were lost inthis sampling and this loss was statistically significant beyond the 999 interval This result shows that some sequences have propensity toharbor low- and medium-quality reads Distribution of the errors is sequence specific

censorship represents sequence-specific errors which occurin every time such sequence passes though the Illuminaanalyzer For example the sequences listed in Tables S1 andS2 and supplementary files were censored in five independent

experiments which were pooled and processed simultane-ously in one Illumina run Analysis of other instances of Illu-mina sequencing performedby other groups could help prove(or disprove) that censorship is indeed sequence-specific

Computational and Mathematical Methods in Medicine 11

and experiment-independent Sequence-specific censorshipduring Illumina analysis has been described in other publi-cations [46] The observations presented above suggest thatreading of some sequences in phage libraries does not yield anaccurate copy number Even if these sequences were enricheddue to binding their apparent copy number in sequencingwould be decreased due to sequencing bias If the magnitudeof bias is known however such error could be corrected Weanticipate that other biases could be calculated for these andother libraries in similar fashion Their calculation extendsbeyond the scope of this paper and it will be performed inour next publication

3 Discussion

31 Significance and Transformative Potential of Library-Wide Error Correction In the Medicinal Chemistry fieldstructure-activity relationships (SAR) and pharmacophoresare built using both positive and negative observations Itis the negative results that bear the most significance inthese studies because they allow mapping of the range ofconditions under which particular structure no longer worksFor example SAR of an R group of a ligand might be builton the following observations A ligand binds to the targetwhen the R group in the specific position is methyl or ethylchanging R to iso-propyl and tert-butyl ablates the bindingThis concludes that the R group must be a small alkyl groupAn analogous situation is found in SARof peptide ligands themost important information from alanine scan mutagenesisis loss of function because it helps identifying the importantresidues Interestingly loss-of-binding conclusions are neverapplied to phage-display The phage-display field is driven bypositive results Most publications report and follow up onlyon sequences enriched in the screen and consider only largecopy numbers interesting All papers focus on sequences thatwere found Very few papers in phage display ask why othersequences were not found

One of the reasons why phage display is not used forSAR-type analysis is because negative observations in phagelibrary cannot be determined with high confidence From apractical point of view measuring zero with high confidencerequires the largest number of observation (the highest depthof sequencing) The payoff however is immense one screenwith ldquoconfident zerosrdquo could potentially yield SAR for everypossible substitution of every possible amino acid We referto this (theoretical) possibility as ldquoInstant SARrdquo and itscondensed theoretical form is described in (3) or (9) and(10) This paper demonstrates that the depth of sequencing isnot the only problem towards this goal Accurate estimate ofnegative results requires complete characterization of the ori-gins of errors in sequencing which yield false negative valuesby censoring certain sequencing Other types of censorshipsuch as growth bias should be characterized and eliminatedas well As the phage display field is currently focused onpositive results the need for optimal error corrections andrecovery of erroneous reads is low With the rise of SAR-type applications in phage display error correction will berecognized as the most significant barrier because it could

lead to improper assignment of low frequencies and negativeresults Improved error correction strategies could assigna lower confidence to the sequence instead of eliminatingthe errors and labeling them as confident zero Propermathematical framework possibly similar to the one used inthis paper could be then used to carry all confidence intervalsthrough calculations to yield reliable SAR-type data

We note that the framework described in this paper issuitable for the analysis of the selection from libraries inwhich the diversity of the libraries before and after selectioncould be covered entirely by deep sequencing With thecurrent depth of sequencing it corresponds to medium-scale libraries of sim106 randommembers and affinity-maturedlibraries that containsim106 pointmutationsWe are in the pro-cess of generating these medium-scale libraries and runningselection procedures that will allow us to apply and refine ourframework In the future as technical capabilities and depthof sequencing increase the process would be applicable tolarger libraries as well

4 Methods

41 Generation of Z-Bars and Other Visualization TechniquesSequencing of the libraries has been described in our previouspublications [6 47] All data visualization in this paper wasdone by MATLAB scripts raw lowasteps output from MATLABscripts subject to minor postprocessing in Adobe Illustratorto adjust fonts relative dimensions of plots Core scripts aredescribed in the supplementary information Other scriptsare available in our previous publication [47] Illumina filesused for the analysis can be found in the directory athttpwwwchemualbertacasimderdamathbiology the fileERROR TAG data0001txt is an example of error-taggedreads PhD7-Amp-xxFtxt is the library filtered with xx Phredcutoff (xx = 1 13 and 30) file PhD7-Amp-13F-30F-COtxtdescribes confidence intervals for Phred(13) to Phred(30)filtering process other files with lowast-COtxt extension describeconfidence intervals of other processes Supplementary Fig-ures S1ndashS4 and Schemes S1 and S2 describe MATLAB imple-mentation of the Sa operator

Acknowledgments

This work was supported by funds from the University ofAlberta and Alberta Glycomics Centre

References

[1] J K Scott and G P Smith ldquoSearching for peptide ligands withan epitope libraryrdquo Science vol 249 no 4967 pp 386ndash390 1990

[2] G P Smith and V A Petrenko ldquoPhage displayrdquo ChemicalReviews vol 97 no 2 pp 391ndash410 1997

[3] A D Ellington and J W Szostak ldquoIn vitro selection of RNAmolecules that bind specific ligandsrdquo Nature vol 346 no 6287pp 818ndash822 1990

[4] C Tuerk and L Gold ldquoSystemic evolution of ligands byexponential enrichment RNA ligands to bacteriophage T4DNApolymeraserdquo Science vol 249 no 4968 pp 505ndash510 1990

12 Computational and Mathematical Methods in Medicine

[5] E Dias-Neto D N Nunes R J Giordano et al ldquoNext-generation phage display integrating and comparing availablemolecular tools to enable costeffective high-throughput analy-sisrdquo PLoS ONE vol 4 no 12 Article ID e8338 2009

[6] W LMatochko K Chu B Jin SW Lee GMWhitesides andR Derda ldquoDeep sequencing analysis of phage libraries usingIllumina platformrdquoMethods vol 58 pp 47ndash55 2012

[7] A Ernst D Gfeller Z Kan et al ldquoCoevolution of PDZ domain-ligand interactions analyzed by high-throughput phage displayand deep sequencingrdquo Molecular BioSystems vol 6 no 10 pp1782ndash1790 2010

[8] G V Kupakuwana J E Crill M P McPike and P N BorerldquoAcyclic identification of aptamers for human alpha-thrombinusing over-represented libraries and deep sequencingrdquo PLoSONE vol 6 no 5 Article ID e19395 2011

[9] P A C rsquoT Hoen S M G Jirka B R Ten Broeke et alldquoPhage display screening without repetitious selection roundsrdquoAnalytical Biochemistry vol 421 no 2 pp 622ndash631 2012

[10] H Zhang A Torkamani T M Jones D I Ruiz J Ponsand R A Lerner ldquoPhenotype-information-phenotype cyclefor deconvolution of combinatorial antibody libraries selectedagainst complex systemsrdquo Proceedings of the National Academyof Sciences of the United States of America vol 108 no 33 pp13456ndash13461 2011

[11] R Derda S K Y Tang S C Li S Ng W Matochko and M RJafari ldquoDiversity of phage-displayed libraries of peptides duringpanning and amplificationrdquo Molecules vol 16 no 2 pp 1776ndash1803 2011

[12] R W Hamming ldquoError detecting and error correcting codesrdquoBell System Technical Journal vol 29 pp 147ndash160 1950

[13] D J Rodi A S Soares and L Makowski ldquoQuantitativeassessment of peptide sequence diversity in M13 combinatorialpeptide phage display librariesrdquo Journal of Molecular Biologyvol 322 no 5 pp 1039ndash1052 2002

[14] L Makowski ldquoQuantitative analysis of peptide librariesrdquo inPhage Nanobiotechnology chapter 3 2011

[15] C E Shannon ldquoAmathematical theory of communicationrdquo BellSystem Technical Journal vol 27 pp 379ndash423 1948

[16] U Ravn F Gueneau L Baerlocher et al ldquoBy-passing in vitroscreening next generation sequencing technologies applied toantibody display and in silico candidate selectionrdquoNucleic AcidsResearch vol 38 no 21 2010

[17] T Kim M S Tyndel H Huang et al ldquoMUSI an integratedsystem for identifying multiple specificity from very largepeptide or nucleic acid data setsrdquo Nucleic Acids Research vol40 no 6 article e47 2012

[18] J A Weinstein N Jiang R A White III D S Fisher and S RQuake ldquoHigh-throughput sequencing of the zebrafish antibodyrepertoirerdquo Science vol 324 no 5928 pp 807ndash810 2009

[19] B J DeKosky G C Ippolito R P Deschner et al et al ldquoHigh-throughput sequencing of the paired human immunoglobulinheavy and light chain repertoirerdquo Nature Biotechnology vol 31pp 166ndash169 2013

[20] B Levitan ldquoStochastic modeling and optimization of phagedisplayrdquo Journal of Molecular Biology vol 277 no 4 pp 893ndash916 1998

[21] C S Riesenfeld P D Schloss and J Handelsman ldquoMetage-nomics genomic analysis of microbial communitiesrdquo AnnualReview of Genetics vol 38 pp 525ndash552 2004

[22] M L Sogin H G Morrison J A Huber et al ldquoMicrobialdiversity in the deep sea and the underexplored lsquorare biospherersquordquo

Proceedings of the National Academy of Sciences of the UnitedStates of America vol 103 no 32 pp 12115ndash12120 2006

[23] F Backhed R E Ley J L Sonnenburg D A Peterson and JI Gordon ldquoHost-bacterial mutualism in the human intestinerdquoScience vol 307 no 5717 pp 1915ndash1920 2005

[24] N Beerenwinkel and O Zagordi ldquoUltra-deep sequencing forthe analysis of viral populationsrdquo Current Opinion in Virologyvol 1 no 5 pp 413ndash418 2011

[25] J A Huber D B MarkWelch H G Morrison et al ldquoMicrobialpopulation structures in the deep marine biosphererdquo Sciencevol 318 no 5847 pp 97ndash100 2007

[26] A Wilm P P K Aw D Bertrand et al ldquoLoFreq a sequence-quality aware ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencingdatasetsrdquo Nucleic Acids Research vol 40 pp 11189ndash11201 2012

[27] P J Turnbaugh C Quince J J Faith et al ldquoOrganismalgenetic and transcriptional variation in the deeply sequencedgut microbiomes of identical twinsrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 107 no16 pp 7503ndash7508 2010

[28] CQuince A Lanzen T P Curtis et al ldquoAccurate determinationof microbial diversity from 454 pyrosequencing datardquo NatureMethods vol 6 no 9 pp 639ndash641 2009

[29] S J Watson M R A Welkers D P Depledge et al ldquoViralpopulation analysis and minority-variant detection using shortread next-generation sequencingrdquo Philosophical Transactions ofthe Royal Society B vol 368 no 1614 Article ID 20120205 2013

[30] D R Bentley S Balasubramanian H P Swerdlow et al et alldquoAccurate whole human genome sequencing using reversibleterminator chemistryrdquo Nature vol 456 pp 53ndash59 2008

[31] KNakamura TOshima TMorimoto et al ldquoSequence-specificerror profile of Illumina sequencersrdquo Nucleic Acids Researchvol 39 no 13 2011

[32] M A Quail I Kozarewa F Smith et al ldquoA large genomecenterrsquos improvements to the Illumina sequencing systemrdquoNature Methods vol 5 no 12 pp 1005ndash1010 2008

[33] O Zagordi R Klein M Daumer and N Beerenwinkel ldquoErrorcorrection of next-generation sequencing data and reliableestimation of HIV quasispeciesrdquoNucleic Acids Research vol 38no 21 pp 7400ndash7409 2010

[34] M W Schmitt S R Kennedy J J Salk E J Fox J B Hiattand L A Loeb ldquoDetection of ultra-rare mutations by next-generation sequencingrdquo Proceedings of the National Academyof Sciences of the United States of America vol 109 pp 14508ndash14513 2012

[35] A Syropoulos ldquoMathematics of multisetsrdquo in Proceedings of theWorkshop on Multiset Processing Multiset Processing Mathe-matical Computer Science and Molecular Computing Points ofView pp 347ndash358 Springer 2001

[36] W ArapM G KoloninM Trepel et al ldquoSteps towardmappingthe human vasculature by phage displayrdquo Nature Medicine vol8 no 2 pp 121ndash127 2002

[37] S Blond-Elguindi S E Cwirla W J Dower et al ldquoAffinitypanning of a library of peptides displayed on bacteriophagesreveals the binding specificity of BiPrdquoCell vol 75 no 4 pp 717ndash728 1993

[38] S Cox E Rosten J Monypenny et al ldquoBayesian localizationmicroscopy reveals nanoscale podosome dynamicsrdquo NatureMethods vol 9 no 2 pp 195ndash200 2012

[39] E Rosten G E Jones and S Cox ldquoImage plug-in for Bayesiananalysis of blinking and bleachingrdquoNature Methods vol 10 pp97ndash98 2013

Computational and Mathematical Methods in Medicine 13

[40] W L Matochko S Ng M R Jafari J Romaniuk S K Y Tangand R Derda ldquoUniform amplification of phage display librariesin monodisperse emulsionsrdquoMethods vol 58 pp 18ndash27 2012

[41] E A Peters P J Schatz S S Johnson and W J DowerldquoMembrane insertion defects caused by positive charges in theearly mature region of protein pIII of filamentous phage fd canbe corrected by prlA suppressorsrdquo Journal of Bacteriology vol176 no 14 pp 4296ndash4305 1994

[42] L A Brammer B Bolduc J L Kass K M Felice C J Norenand M F Hall ldquoA target-unrelated peptide in an M13 phagedisplay library traced to an advantageous mutation in the geneII ribosome-binding siterdquo Analytical Biochemistry vol 373 no1 pp 88ndash98 2008

[43] G A Kuzmicheva P K Jayanna I B Sorokulova and VA Petrenko ldquoDiversity and censoring of landscape phagelibrariesrdquo Protein Engineering Design and Selection vol 22 no1 pp 9ndash18 2009

[44] D R Wilson and B B Finlay ldquoPhage display applicationsinnovations and issues in phage and host biologyrdquo CanadianJournal of Microbiology vol 44 no 4 pp 313ndash329 1998

[45] R Derda S K Y Tang and G M Whitesides ldquoUniformamplification of phage with different growth characteristics inindividual compartments consisting ofmonodisperse dropletsrdquoAngewandte Chemie vol 49 no 31 pp 5301ndash5304 2010

[46] J C Dohm C Lottaz T Borodina and H HimmelbauerldquoSubstantial biases in ultra-short read data sets from high-throughput DNA sequencingrdquo Nucleic Acids Research vol 36no 16 article e105 2008

[47] W L Matochko S C Li S K Y Tang and R DerdaldquoProspective identification of parasitic sequences in phagedisplay screensrdquo Nucleic Acids Research 2013

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

Computational and Mathematical Methods in Medicine 9

termed 1119899 which had an average 95 accuracy of the33-nucleotide read Reads that do not contain Phred = 0nucleotide rarely contain multiple low-quality reads The 1119899library was bimodal 80 of the reads had overall accuracyof 99 very few reads with accuracy 5ndash90 and significantnumber of reads with accuracy of 1 (Figures 5(d) and 5(e))These observations suggest that reads can be divided into (i)reads free of errors and (ii) reads with multiple errors

An example of a more stringent cutoff is eliminationof reads with Phred lt13 nucleotides this process yielded alibrary 13119899 in which every nucleotide had gt95 confidenceThe number of total reads in 13119899 was 10 less than numberof reads in 1119899 that is sum( 13119899) = 09sum( 1119899) The observedaverage read accuracy of the read in the 13119899 library was992 Theoretically the 095 confidence cutoff in a 33-mer nucleotide could yield reads with accuracy as low as(095)

33= 18 In practice the probability to find reads with

multiple nucleotides of 95 accuracy was vanishingly smallSpecifically among 500000 reads the lowest observed cumu-lative accuracy was 77 Such a result for example could beobtained in a sequence that has 27 ldquoperfectrdquo nucleotides and5 nucleotides with a Phred = 13 score (1)27(095)5 = 077Applying the most stringent cutoff to eliminate all reads witha Phred lt 30 yielded a library 30119899 in which every nucleotidehad 999 confidence The average confidence of the readsimproved subtly from 992 to 996 The number of totalreads in 30119899 was 30 less than number of reads in 13119899 thatis sum( 30119899) = 07sum( 13119899) It was not clear whether suchcutoff is an improvement or a detriment for analysis In thenext section we examined how frequency of the members ofthe library changed upon application of each error cutoff

23 Example of Error Analysis Sequence-Specific Censorshipduring Phred Quality Cutoff If errors occur by randomchance they should be uniformly distributed in all sequencesRemoval of erroneous read in that case should be identicalto sampling of the library by fSa operator where f is thesampling fraction For example consider the removal ofPhred lt 13 nucleotides from an unfiltered library (processdenoted as 1119899 rarr 13

119899) From the experiments we knowthat sum( 13119899) = 09sum( 1119899) if errors were distributedin sequences at random the 1119899 and 13119899 vectors should berelated as

13119899 =09Sa ( 1119899) (8)

The solutions should reside within a confidence interval

13119899 isin [

lo119862

hi119862] (9)

If errors occur preferentially in specific reads the frequencyof these reads should occur beyond the confidence intervalof the 09Sa This process could be described by a diagonalmatrix Bias as

13119899 =09Sa (Bias ( 1119899)) (10)

The elements of the diagonal matrix Bias = B119894119894 could be

estimated as follows

13119899119894isin [

lo119862119894

hi119862119894] B

119894119894= 1 (11)

13119899119894lt

lo119862119894 B

119894119894=

13119899119894

(091119899119894)

(12)

Figure 6(c) describes the representative solution of the09Sa( 1119899) (green dots) and the confidence interval (bluelines) Supplementary Scheme S3 describes the script thatcalculated this interval frommultiset 1119899 described as a plaintext file PhD7-Amp-0Ftxt using 10000 iterative calculationsof 09Sa( 1119899) This calculation required sim2 hours on a desktopcomputer Confidence interval was estimated as the mini-mum and maximum copy number found after 10000 iter-ations In this approximation of the confidence interval forsequences with the copy number lt10 before sampling it wasimpossible to determine whether the sequence disappeareddue to random sampling or due to bias The values of Biasoperator cannot be defined for these sequences and it couldbe assumed to be 1 (see (11)) For copy number gt10 howeversequence-specific bias can be readily detected We observedthat the removal of Phred lt13 reads yielded a multiset inwhich a large number of sequences deviated beyond theconfidence interval (Figure 6(d)) Their sequences could bereadily extracted by comparing the vector 13119899with the vectorof the lower confidence intervals lo

119862 (see (12)) The solutionof theBias can be illustrated graphically (Figure 6(e)) Top 30censored sequences are listed in Table S1 the other sequencescan be found in the supplementary information (file PhD7-Amp-0F-13F-CENtxt)

We performed similar calculations for 1119899 rarr 30119899 and13119899 rarr

30119899 processes The latter process is the most

interesting because 13119899 library has all nucleotides withinacceptable confidence range (gt95) and the distribution ofcumulative quality suggested that errors on average do notcluster in one read (Figure 5) The 13119899 rarr 30

119899 conversioneliminated 30 of the reads and copy numbers of manysequences deviated significantly from the random samplingthese sequences are represented by green dots outside the blueconfidence interval in Figure 6(h) Top 30 sequences are listedin Table S2 The censorship is not only sequence-specific butalso position-specific In sequences that had been censoredduring the 13119899 rarr 30119899 process lower quality reads clusteredaround 3-4 specific nucleotides (supplementary informationFigure S5)

The mechanism that leads to the disappearance of cen-sored sequences is not currently clear We attempted toidentify common motifs in censored sequences using twoapproaches (i) clustering and principal component analysesbased on Jukes-Cantor distance between sequences and(ii) identification of motifs using multiple unique sequenceidentifier software (MUSI) [17] These approaches couldnot detect any property common to censored reads whichwould make them significantly different from the othernoncensored reads Still we hypothesize that the observed

10 Computational and Mathematical Methods in Medicine

Ph gt 30 Ph gt 30

Sequencing resultsRemovalof errors

30n = Bias Sa 0

n

All nucleotides

0L = [S

0n]

30L = [S 30n]

Phred gt 30

(a)

Sequencing results

Ph gt13

1T total sequences

13T total sequences

30T total sequences

30n = Bias Sa 13

n

13n = Bias Sa 1

n

30n = Bias Sa 1

n

13L = [S

13n]

0L = [S

0n]

1L = [S

1n]

30L = [S 30n]

Ph gt 30

Ph gt 13

Ph gt 0

Ph ge 1

Ph gt 30

(b)

38390

597122105

104

103

102

10

1

105

104

103

102

101

13n = Sa 1

n

Cop

y nu

mbe

r in

Phre

dgt0

libra

ry

Copy number in Sa (Phred gt 0) library

(c)

558480

77032

Inside the interval

Outsideintervalthe

of

105

104

103

102

10

1

105

104

103

102

101

13n = Bias 1

n

13n = Bias Sa 1

nC

opy

num

ber i

n Ph

redgt0

libra

ry

Copy number in Phred gt 30 library

(d)

1

01

001

Bias can be definedfor these sequences

Bias cannot be definedfor these sequences

105

104

103

102

101

Bias = 13n|I| |N||

1nminus1

Bias = |IN|

Ratio

of c

opy

num

ber i

nPh

redgt0

Phr

edgt13

Copy number in Phred gt 0 library

(e)

502655

132857

105

104

103

102

101

105

104

103

102

10

1

30n = Sa 1

n

Number of data points per pixelgt30000

gt10000

gt3000

gt1000 1ndash10

gt100

gt300

gt30

gt10

Cop

y nu

mbe

r in

Phre

dgt0

libra

ry

Copy number in Sa (Phred gt 0) library

(f)

438703

196809

Inside the interval

Outside ofthe interval

105

104

103

102

10

1

105

104

103

102

101

Number of data points per pixel

30n = Bias Sa 1

n

30n = Bias 1

n

gt30000

gt10000

gt3000

gt1000 1ndash10

gt100

gt300

gt30

gt10

Copy number in Phred gt 30 library

Cop

y nu

mbe

r in

Phre

dgt0

libra

ry

(g)

438703

119777

Inside the interval

Outside ofthe interval

105

104

103

102

10

1

105

104

103

102

101

30n = Bias 13

n

Number of data points per pixel

30n = Bias Sa 13

n

gt30000

gt10000

gt3000

gt1000 1ndash10

gt100

gt300

gt30

gt10

Copy number in Phred gt 30 library

Cop

y nu

mbe

r in

Phre

dgt13

libra

ry

(h)

Figure 6 (a) Operator and multiset description of the error filtering procedure Applying a Phred gt 30 cutoff to library filtered by Phredgt1cutoff ( 1n) yields a subpopulation of the library ( 30n) If errors are sequence-independent the 1n rarr 30n process should be identical torandom sampling ( 30n = Sa 1n) Any sequence-specific bias (Bias) should be detected as deviation from Sa 1n (b) Progressive samplingwith more stringent cutoff (c) Theoretical Sa 1n and theoretical 999 confidence interval (blue) (d) Observation of statistically significantdeviation from Sa operator dots beyond the blue line represent sequences prone to bias Red dots represent sequences that disappeared afterin 1n rarr 30n process or during Sa 1n sampling (e) Magnitude of the bias range from 5 to 100-fold (f) Bias in sampling of Phred gt 30 datafrom Phred gt 1 data ((f) is theory (g) is observed) (h) Bias upon sampling of Phred gt 30 data from Phred gt 13 Many sequences were lost inthis sampling and this loss was statistically significant beyond the 999 interval This result shows that some sequences have propensity toharbor low- and medium-quality reads Distribution of the errors is sequence specific

censorship represents sequence-specific errors which occurin every time such sequence passes though the Illuminaanalyzer For example the sequences listed in Tables S1 andS2 and supplementary files were censored in five independent

experiments which were pooled and processed simultane-ously in one Illumina run Analysis of other instances of Illu-mina sequencing performedby other groups could help prove(or disprove) that censorship is indeed sequence-specific

Computational and Mathematical Methods in Medicine 11

and experiment-independent Sequence-specific censorshipduring Illumina analysis has been described in other publi-cations [46] The observations presented above suggest thatreading of some sequences in phage libraries does not yield anaccurate copy number Even if these sequences were enricheddue to binding their apparent copy number in sequencingwould be decreased due to sequencing bias If the magnitudeof bias is known however such error could be corrected Weanticipate that other biases could be calculated for these andother libraries in similar fashion Their calculation extendsbeyond the scope of this paper and it will be performed inour next publication

3 Discussion

31 Significance and Transformative Potential of Library-Wide Error Correction In the Medicinal Chemistry fieldstructure-activity relationships (SAR) and pharmacophoresare built using both positive and negative observations Itis the negative results that bear the most significance inthese studies because they allow mapping of the range ofconditions under which particular structure no longer worksFor example SAR of an R group of a ligand might be builton the following observations A ligand binds to the targetwhen the R group in the specific position is methyl or ethylchanging R to iso-propyl and tert-butyl ablates the bindingThis concludes that the R group must be a small alkyl groupAn analogous situation is found in SARof peptide ligands themost important information from alanine scan mutagenesisis loss of function because it helps identifying the importantresidues Interestingly loss-of-binding conclusions are neverapplied to phage-display The phage-display field is driven bypositive results Most publications report and follow up onlyon sequences enriched in the screen and consider only largecopy numbers interesting All papers focus on sequences thatwere found Very few papers in phage display ask why othersequences were not found

One of the reasons why phage display is not used forSAR-type analysis is because negative observations in phagelibrary cannot be determined with high confidence From apractical point of view measuring zero with high confidencerequires the largest number of observation (the highest depthof sequencing) The payoff however is immense one screenwith ldquoconfident zerosrdquo could potentially yield SAR for everypossible substitution of every possible amino acid We referto this (theoretical) possibility as ldquoInstant SARrdquo and itscondensed theoretical form is described in (3) or (9) and(10) This paper demonstrates that the depth of sequencing isnot the only problem towards this goal Accurate estimate ofnegative results requires complete characterization of the ori-gins of errors in sequencing which yield false negative valuesby censoring certain sequencing Other types of censorshipsuch as growth bias should be characterized and eliminatedas well As the phage display field is currently focused onpositive results the need for optimal error corrections andrecovery of erroneous reads is low With the rise of SAR-type applications in phage display error correction will berecognized as the most significant barrier because it could

lead to improper assignment of low frequencies and negativeresults Improved error correction strategies could assigna lower confidence to the sequence instead of eliminatingthe errors and labeling them as confident zero Propermathematical framework possibly similar to the one used inthis paper could be then used to carry all confidence intervalsthrough calculations to yield reliable SAR-type data

We note that the framework described in this paper issuitable for the analysis of the selection from libraries inwhich the diversity of the libraries before and after selectioncould be covered entirely by deep sequencing With thecurrent depth of sequencing it corresponds to medium-scale libraries of sim106 randommembers and affinity-maturedlibraries that containsim106 pointmutationsWe are in the pro-cess of generating these medium-scale libraries and runningselection procedures that will allow us to apply and refine ourframework In the future as technical capabilities and depthof sequencing increase the process would be applicable tolarger libraries as well

4 Methods

41 Generation of Z-Bars and Other Visualization TechniquesSequencing of the libraries has been described in our previouspublications [6 47] All data visualization in this paper wasdone by MATLAB scripts raw lowasteps output from MATLABscripts subject to minor postprocessing in Adobe Illustratorto adjust fonts relative dimensions of plots Core scripts aredescribed in the supplementary information Other scriptsare available in our previous publication [47] Illumina filesused for the analysis can be found in the directory athttpwwwchemualbertacasimderdamathbiology the fileERROR TAG data0001txt is an example of error-taggedreads PhD7-Amp-xxFtxt is the library filtered with xx Phredcutoff (xx = 1 13 and 30) file PhD7-Amp-13F-30F-COtxtdescribes confidence intervals for Phred(13) to Phred(30)filtering process other files with lowast-COtxt extension describeconfidence intervals of other processes Supplementary Fig-ures S1ndashS4 and Schemes S1 and S2 describe MATLAB imple-mentation of the Sa operator

Acknowledgments

This work was supported by funds from the University ofAlberta and Alberta Glycomics Centre

References

[1] J K Scott and G P Smith ldquoSearching for peptide ligands withan epitope libraryrdquo Science vol 249 no 4967 pp 386ndash390 1990

[2] G P Smith and V A Petrenko ldquoPhage displayrdquo ChemicalReviews vol 97 no 2 pp 391ndash410 1997

[3] A D Ellington and J W Szostak ldquoIn vitro selection of RNAmolecules that bind specific ligandsrdquo Nature vol 346 no 6287pp 818ndash822 1990

[4] C Tuerk and L Gold ldquoSystemic evolution of ligands byexponential enrichment RNA ligands to bacteriophage T4DNApolymeraserdquo Science vol 249 no 4968 pp 505ndash510 1990

12 Computational and Mathematical Methods in Medicine

[5] E Dias-Neto D N Nunes R J Giordano et al ldquoNext-generation phage display integrating and comparing availablemolecular tools to enable costeffective high-throughput analy-sisrdquo PLoS ONE vol 4 no 12 Article ID e8338 2009

[6] W LMatochko K Chu B Jin SW Lee GMWhitesides andR Derda ldquoDeep sequencing analysis of phage libraries usingIllumina platformrdquoMethods vol 58 pp 47ndash55 2012

[7] A Ernst D Gfeller Z Kan et al ldquoCoevolution of PDZ domain-ligand interactions analyzed by high-throughput phage displayand deep sequencingrdquo Molecular BioSystems vol 6 no 10 pp1782ndash1790 2010

[8] G V Kupakuwana J E Crill M P McPike and P N BorerldquoAcyclic identification of aptamers for human alpha-thrombinusing over-represented libraries and deep sequencingrdquo PLoSONE vol 6 no 5 Article ID e19395 2011

[9] P A C rsquoT Hoen S M G Jirka B R Ten Broeke et alldquoPhage display screening without repetitious selection roundsrdquoAnalytical Biochemistry vol 421 no 2 pp 622ndash631 2012

[10] H Zhang A Torkamani T M Jones D I Ruiz J Ponsand R A Lerner ldquoPhenotype-information-phenotype cyclefor deconvolution of combinatorial antibody libraries selectedagainst complex systemsrdquo Proceedings of the National Academyof Sciences of the United States of America vol 108 no 33 pp13456ndash13461 2011

[11] R Derda S K Y Tang S C Li S Ng W Matochko and M RJafari ldquoDiversity of phage-displayed libraries of peptides duringpanning and amplificationrdquo Molecules vol 16 no 2 pp 1776ndash1803 2011

[12] R W Hamming ldquoError detecting and error correcting codesrdquoBell System Technical Journal vol 29 pp 147ndash160 1950

[13] D J Rodi A S Soares and L Makowski ldquoQuantitativeassessment of peptide sequence diversity in M13 combinatorialpeptide phage display librariesrdquo Journal of Molecular Biologyvol 322 no 5 pp 1039ndash1052 2002

[14] L Makowski ldquoQuantitative analysis of peptide librariesrdquo inPhage Nanobiotechnology chapter 3 2011

[15] C E Shannon ldquoAmathematical theory of communicationrdquo BellSystem Technical Journal vol 27 pp 379ndash423 1948

[16] U Ravn F Gueneau L Baerlocher et al ldquoBy-passing in vitroscreening next generation sequencing technologies applied toantibody display and in silico candidate selectionrdquoNucleic AcidsResearch vol 38 no 21 2010

[17] T Kim M S Tyndel H Huang et al ldquoMUSI an integratedsystem for identifying multiple specificity from very largepeptide or nucleic acid data setsrdquo Nucleic Acids Research vol40 no 6 article e47 2012

[18] J A Weinstein N Jiang R A White III D S Fisher and S RQuake ldquoHigh-throughput sequencing of the zebrafish antibodyrepertoirerdquo Science vol 324 no 5928 pp 807ndash810 2009

[19] B J DeKosky G C Ippolito R P Deschner et al et al ldquoHigh-throughput sequencing of the paired human immunoglobulinheavy and light chain repertoirerdquo Nature Biotechnology vol 31pp 166ndash169 2013

[20] B Levitan ldquoStochastic modeling and optimization of phagedisplayrdquo Journal of Molecular Biology vol 277 no 4 pp 893ndash916 1998

[21] C S Riesenfeld P D Schloss and J Handelsman ldquoMetage-nomics genomic analysis of microbial communitiesrdquo AnnualReview of Genetics vol 38 pp 525ndash552 2004

[22] M L Sogin H G Morrison J A Huber et al ldquoMicrobialdiversity in the deep sea and the underexplored lsquorare biospherersquordquo

Proceedings of the National Academy of Sciences of the UnitedStates of America vol 103 no 32 pp 12115ndash12120 2006

[23] F Backhed R E Ley J L Sonnenburg D A Peterson and JI Gordon ldquoHost-bacterial mutualism in the human intestinerdquoScience vol 307 no 5717 pp 1915ndash1920 2005

[24] N Beerenwinkel and O Zagordi ldquoUltra-deep sequencing forthe analysis of viral populationsrdquo Current Opinion in Virologyvol 1 no 5 pp 413ndash418 2011

[25] J A Huber D B MarkWelch H G Morrison et al ldquoMicrobialpopulation structures in the deep marine biosphererdquo Sciencevol 318 no 5847 pp 97ndash100 2007

[26] A Wilm P P K Aw D Bertrand et al ldquoLoFreq a sequence-quality aware ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencingdatasetsrdquo Nucleic Acids Research vol 40 pp 11189ndash11201 2012

[27] P J Turnbaugh C Quince J J Faith et al ldquoOrganismalgenetic and transcriptional variation in the deeply sequencedgut microbiomes of identical twinsrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 107 no16 pp 7503ndash7508 2010

[28] CQuince A Lanzen T P Curtis et al ldquoAccurate determinationof microbial diversity from 454 pyrosequencing datardquo NatureMethods vol 6 no 9 pp 639ndash641 2009

[29] S J Watson M R A Welkers D P Depledge et al ldquoViralpopulation analysis and minority-variant detection using shortread next-generation sequencingrdquo Philosophical Transactions ofthe Royal Society B vol 368 no 1614 Article ID 20120205 2013

[30] D R Bentley S Balasubramanian H P Swerdlow et al et alldquoAccurate whole human genome sequencing using reversibleterminator chemistryrdquo Nature vol 456 pp 53ndash59 2008

[31] KNakamura TOshima TMorimoto et al ldquoSequence-specificerror profile of Illumina sequencersrdquo Nucleic Acids Researchvol 39 no 13 2011

[32] M A Quail I Kozarewa F Smith et al ldquoA large genomecenterrsquos improvements to the Illumina sequencing systemrdquoNature Methods vol 5 no 12 pp 1005ndash1010 2008

[33] O Zagordi R Klein M Daumer and N Beerenwinkel ldquoErrorcorrection of next-generation sequencing data and reliableestimation of HIV quasispeciesrdquoNucleic Acids Research vol 38no 21 pp 7400ndash7409 2010

[34] M W Schmitt S R Kennedy J J Salk E J Fox J B Hiattand L A Loeb ldquoDetection of ultra-rare mutations by next-generation sequencingrdquo Proceedings of the National Academyof Sciences of the United States of America vol 109 pp 14508ndash14513 2012

[35] A Syropoulos ldquoMathematics of multisetsrdquo in Proceedings of theWorkshop on Multiset Processing Multiset Processing Mathe-matical Computer Science and Molecular Computing Points ofView pp 347ndash358 Springer 2001

[36] W ArapM G KoloninM Trepel et al ldquoSteps towardmappingthe human vasculature by phage displayrdquo Nature Medicine vol8 no 2 pp 121ndash127 2002

[37] S Blond-Elguindi S E Cwirla W J Dower et al ldquoAffinitypanning of a library of peptides displayed on bacteriophagesreveals the binding specificity of BiPrdquoCell vol 75 no 4 pp 717ndash728 1993

[38] S Cox E Rosten J Monypenny et al ldquoBayesian localizationmicroscopy reveals nanoscale podosome dynamicsrdquo NatureMethods vol 9 no 2 pp 195ndash200 2012

[39] E Rosten G E Jones and S Cox ldquoImage plug-in for Bayesiananalysis of blinking and bleachingrdquoNature Methods vol 10 pp97ndash98 2013

Computational and Mathematical Methods in Medicine 13

[40] W L Matochko S Ng M R Jafari J Romaniuk S K Y Tangand R Derda ldquoUniform amplification of phage display librariesin monodisperse emulsionsrdquoMethods vol 58 pp 18ndash27 2012

[41] E A Peters P J Schatz S S Johnson and W J DowerldquoMembrane insertion defects caused by positive charges in theearly mature region of protein pIII of filamentous phage fd canbe corrected by prlA suppressorsrdquo Journal of Bacteriology vol176 no 14 pp 4296ndash4305 1994

[42] L A Brammer B Bolduc J L Kass K M Felice C J Norenand M F Hall ldquoA target-unrelated peptide in an M13 phagedisplay library traced to an advantageous mutation in the geneII ribosome-binding siterdquo Analytical Biochemistry vol 373 no1 pp 88ndash98 2008

[43] G A Kuzmicheva P K Jayanna I B Sorokulova and VA Petrenko ldquoDiversity and censoring of landscape phagelibrariesrdquo Protein Engineering Design and Selection vol 22 no1 pp 9ndash18 2009

[44] D R Wilson and B B Finlay ldquoPhage display applicationsinnovations and issues in phage and host biologyrdquo CanadianJournal of Microbiology vol 44 no 4 pp 313ndash329 1998

[45] R Derda S K Y Tang and G M Whitesides ldquoUniformamplification of phage with different growth characteristics inindividual compartments consisting ofmonodisperse dropletsrdquoAngewandte Chemie vol 49 no 31 pp 5301ndash5304 2010

[46] J C Dohm C Lottaz T Borodina and H HimmelbauerldquoSubstantial biases in ultra-short read data sets from high-throughput DNA sequencingrdquo Nucleic Acids Research vol 36no 16 article e105 2008

[47] W L Matochko S C Li S K Y Tang and R DerdaldquoProspective identification of parasitic sequences in phagedisplay screensrdquo Nucleic Acids Research 2013

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

10 Computational and Mathematical Methods in Medicine

Ph gt 30 Ph gt 30

Sequencing resultsRemovalof errors

30n = Bias Sa 0

n

All nucleotides

0L = [S

0n]

30L = [S 30n]

Phred gt 30

(a)

Sequencing results

Ph gt13

1T total sequences

13T total sequences

30T total sequences

30n = Bias Sa 13

n

13n = Bias Sa 1

n

30n = Bias Sa 1

n

13L = [S

13n]

0L = [S

0n]

1L = [S

1n]

30L = [S 30n]

Ph gt 30

Ph gt 13

Ph gt 0

Ph ge 1

Ph gt 30

(b)

38390

597122105

104

103

102

10

1

105

104

103

102

101

13n = Sa 1

n

Cop

y nu

mbe

r in

Phre

dgt0

libra

ry

Copy number in Sa (Phred gt 0) library

(c)

558480

77032

Inside the interval

Outsideintervalthe

of

105

104

103

102

10

1

105

104

103

102

101

13n = Bias 1

n

13n = Bias Sa 1

nC

opy

num

ber i

n Ph

redgt0

libra

ry

Copy number in Phred gt 30 library

(d)

1

01

001

Bias can be definedfor these sequences

Bias cannot be definedfor these sequences

105

104

103

102

101

Bias = 13n|I| |N||

1nminus1

Bias = |IN|

Ratio

of c

opy

num

ber i

nPh

redgt0

Phr

edgt13

Copy number in Phred gt 0 library

(e)

502655

132857

105

104

103

102

101

105

104

103

102

10

1

30n = Sa 1

n

Number of data points per pixelgt30000

gt10000

gt3000

gt1000 1ndash10

gt100

gt300

gt30

gt10

Cop

y nu

mbe

r in

Phre

dgt0

libra

ry

Copy number in Sa (Phred gt 0) library

(f)

438703

196809

Inside the interval

Outside ofthe interval

105

104

103

102

10

1

105

104

103

102

101

Number of data points per pixel

30n = Bias Sa 1

n

30n = Bias 1

n

gt30000

gt10000

gt3000

gt1000 1ndash10

gt100

gt300

gt30

gt10

Copy number in Phred gt 30 library

Cop

y nu

mbe

r in

Phre

dgt0

libra

ry

(g)

438703

119777

Inside the interval

Outside ofthe interval

105

104

103

102

10

1

105

104

103

102

101

30n = Bias 13

n

Number of data points per pixel

30n = Bias Sa 13

n

gt30000

gt10000

gt3000

gt1000 1ndash10

gt100

gt300

gt30

gt10

Copy number in Phred gt 30 library

Cop

y nu

mbe

r in

Phre

dgt13

libra

ry

(h)

Figure 6 (a) Operator and multiset description of the error filtering procedure Applying a Phred gt 30 cutoff to library filtered by Phredgt1cutoff ( 1n) yields a subpopulation of the library ( 30n) If errors are sequence-independent the 1n rarr 30n process should be identical torandom sampling ( 30n = Sa 1n) Any sequence-specific bias (Bias) should be detected as deviation from Sa 1n (b) Progressive samplingwith more stringent cutoff (c) Theoretical Sa 1n and theoretical 999 confidence interval (blue) (d) Observation of statistically significantdeviation from Sa operator dots beyond the blue line represent sequences prone to bias Red dots represent sequences that disappeared afterin 1n rarr 30n process or during Sa 1n sampling (e) Magnitude of the bias range from 5 to 100-fold (f) Bias in sampling of Phred gt 30 datafrom Phred gt 1 data ((f) is theory (g) is observed) (h) Bias upon sampling of Phred gt 30 data from Phred gt 13 Many sequences were lost inthis sampling and this loss was statistically significant beyond the 999 interval This result shows that some sequences have propensity toharbor low- and medium-quality reads Distribution of the errors is sequence specific

censorship represents sequence-specific errors which occurin every time such sequence passes though the Illuminaanalyzer For example the sequences listed in Tables S1 andS2 and supplementary files were censored in five independent

experiments which were pooled and processed simultane-ously in one Illumina run Analysis of other instances of Illu-mina sequencing performedby other groups could help prove(or disprove) that censorship is indeed sequence-specific

Computational and Mathematical Methods in Medicine 11

and experiment-independent Sequence-specific censorshipduring Illumina analysis has been described in other publi-cations [46] The observations presented above suggest thatreading of some sequences in phage libraries does not yield anaccurate copy number Even if these sequences were enricheddue to binding their apparent copy number in sequencingwould be decreased due to sequencing bias If the magnitudeof bias is known however such error could be corrected Weanticipate that other biases could be calculated for these andother libraries in similar fashion Their calculation extendsbeyond the scope of this paper and it will be performed inour next publication

3 Discussion

31 Significance and Transformative Potential of Library-Wide Error Correction In the Medicinal Chemistry fieldstructure-activity relationships (SAR) and pharmacophoresare built using both positive and negative observations Itis the negative results that bear the most significance inthese studies because they allow mapping of the range ofconditions under which particular structure no longer worksFor example SAR of an R group of a ligand might be builton the following observations A ligand binds to the targetwhen the R group in the specific position is methyl or ethylchanging R to iso-propyl and tert-butyl ablates the bindingThis concludes that the R group must be a small alkyl groupAn analogous situation is found in SARof peptide ligands themost important information from alanine scan mutagenesisis loss of function because it helps identifying the importantresidues Interestingly loss-of-binding conclusions are neverapplied to phage-display The phage-display field is driven bypositive results Most publications report and follow up onlyon sequences enriched in the screen and consider only largecopy numbers interesting All papers focus on sequences thatwere found Very few papers in phage display ask why othersequences were not found

One of the reasons why phage display is not used forSAR-type analysis is because negative observations in phagelibrary cannot be determined with high confidence From apractical point of view measuring zero with high confidencerequires the largest number of observation (the highest depthof sequencing) The payoff however is immense one screenwith ldquoconfident zerosrdquo could potentially yield SAR for everypossible substitution of every possible amino acid We referto this (theoretical) possibility as ldquoInstant SARrdquo and itscondensed theoretical form is described in (3) or (9) and(10) This paper demonstrates that the depth of sequencing isnot the only problem towards this goal Accurate estimate ofnegative results requires complete characterization of the ori-gins of errors in sequencing which yield false negative valuesby censoring certain sequencing Other types of censorshipsuch as growth bias should be characterized and eliminatedas well As the phage display field is currently focused onpositive results the need for optimal error corrections andrecovery of erroneous reads is low With the rise of SAR-type applications in phage display error correction will berecognized as the most significant barrier because it could

lead to improper assignment of low frequencies and negativeresults Improved error correction strategies could assigna lower confidence to the sequence instead of eliminatingthe errors and labeling them as confident zero Propermathematical framework possibly similar to the one used inthis paper could be then used to carry all confidence intervalsthrough calculations to yield reliable SAR-type data

We note that the framework described in this paper issuitable for the analysis of the selection from libraries inwhich the diversity of the libraries before and after selectioncould be covered entirely by deep sequencing With thecurrent depth of sequencing it corresponds to medium-scale libraries of sim106 randommembers and affinity-maturedlibraries that containsim106 pointmutationsWe are in the pro-cess of generating these medium-scale libraries and runningselection procedures that will allow us to apply and refine ourframework In the future as technical capabilities and depthof sequencing increase the process would be applicable tolarger libraries as well

4 Methods

41 Generation of Z-Bars and Other Visualization TechniquesSequencing of the libraries has been described in our previouspublications [6 47] All data visualization in this paper wasdone by MATLAB scripts raw lowasteps output from MATLABscripts subject to minor postprocessing in Adobe Illustratorto adjust fonts relative dimensions of plots Core scripts aredescribed in the supplementary information Other scriptsare available in our previous publication [47] Illumina filesused for the analysis can be found in the directory athttpwwwchemualbertacasimderdamathbiology the fileERROR TAG data0001txt is an example of error-taggedreads PhD7-Amp-xxFtxt is the library filtered with xx Phredcutoff (xx = 1 13 and 30) file PhD7-Amp-13F-30F-COtxtdescribes confidence intervals for Phred(13) to Phred(30)filtering process other files with lowast-COtxt extension describeconfidence intervals of other processes Supplementary Fig-ures S1ndashS4 and Schemes S1 and S2 describe MATLAB imple-mentation of the Sa operator

Acknowledgments

This work was supported by funds from the University ofAlberta and Alberta Glycomics Centre

References

[1] J K Scott and G P Smith ldquoSearching for peptide ligands withan epitope libraryrdquo Science vol 249 no 4967 pp 386ndash390 1990

[2] G P Smith and V A Petrenko ldquoPhage displayrdquo ChemicalReviews vol 97 no 2 pp 391ndash410 1997

[3] A D Ellington and J W Szostak ldquoIn vitro selection of RNAmolecules that bind specific ligandsrdquo Nature vol 346 no 6287pp 818ndash822 1990

[4] C Tuerk and L Gold ldquoSystemic evolution of ligands byexponential enrichment RNA ligands to bacteriophage T4DNApolymeraserdquo Science vol 249 no 4968 pp 505ndash510 1990

12 Computational and Mathematical Methods in Medicine

[5] E Dias-Neto D N Nunes R J Giordano et al ldquoNext-generation phage display integrating and comparing availablemolecular tools to enable costeffective high-throughput analy-sisrdquo PLoS ONE vol 4 no 12 Article ID e8338 2009

[6] W LMatochko K Chu B Jin SW Lee GMWhitesides andR Derda ldquoDeep sequencing analysis of phage libraries usingIllumina platformrdquoMethods vol 58 pp 47ndash55 2012

[7] A Ernst D Gfeller Z Kan et al ldquoCoevolution of PDZ domain-ligand interactions analyzed by high-throughput phage displayand deep sequencingrdquo Molecular BioSystems vol 6 no 10 pp1782ndash1790 2010

[8] G V Kupakuwana J E Crill M P McPike and P N BorerldquoAcyclic identification of aptamers for human alpha-thrombinusing over-represented libraries and deep sequencingrdquo PLoSONE vol 6 no 5 Article ID e19395 2011

[9] P A C rsquoT Hoen S M G Jirka B R Ten Broeke et alldquoPhage display screening without repetitious selection roundsrdquoAnalytical Biochemistry vol 421 no 2 pp 622ndash631 2012

[10] H Zhang A Torkamani T M Jones D I Ruiz J Ponsand R A Lerner ldquoPhenotype-information-phenotype cyclefor deconvolution of combinatorial antibody libraries selectedagainst complex systemsrdquo Proceedings of the National Academyof Sciences of the United States of America vol 108 no 33 pp13456ndash13461 2011

[11] R Derda S K Y Tang S C Li S Ng W Matochko and M RJafari ldquoDiversity of phage-displayed libraries of peptides duringpanning and amplificationrdquo Molecules vol 16 no 2 pp 1776ndash1803 2011

[12] R W Hamming ldquoError detecting and error correcting codesrdquoBell System Technical Journal vol 29 pp 147ndash160 1950

[13] D J Rodi A S Soares and L Makowski ldquoQuantitativeassessment of peptide sequence diversity in M13 combinatorialpeptide phage display librariesrdquo Journal of Molecular Biologyvol 322 no 5 pp 1039ndash1052 2002

[14] L Makowski ldquoQuantitative analysis of peptide librariesrdquo inPhage Nanobiotechnology chapter 3 2011

[15] C E Shannon ldquoAmathematical theory of communicationrdquo BellSystem Technical Journal vol 27 pp 379ndash423 1948

[16] U Ravn F Gueneau L Baerlocher et al ldquoBy-passing in vitroscreening next generation sequencing technologies applied toantibody display and in silico candidate selectionrdquoNucleic AcidsResearch vol 38 no 21 2010

[17] T Kim M S Tyndel H Huang et al ldquoMUSI an integratedsystem for identifying multiple specificity from very largepeptide or nucleic acid data setsrdquo Nucleic Acids Research vol40 no 6 article e47 2012

[18] J A Weinstein N Jiang R A White III D S Fisher and S RQuake ldquoHigh-throughput sequencing of the zebrafish antibodyrepertoirerdquo Science vol 324 no 5928 pp 807ndash810 2009

[19] B J DeKosky G C Ippolito R P Deschner et al et al ldquoHigh-throughput sequencing of the paired human immunoglobulinheavy and light chain repertoirerdquo Nature Biotechnology vol 31pp 166ndash169 2013

[20] B Levitan ldquoStochastic modeling and optimization of phagedisplayrdquo Journal of Molecular Biology vol 277 no 4 pp 893ndash916 1998

[21] C S Riesenfeld P D Schloss and J Handelsman ldquoMetage-nomics genomic analysis of microbial communitiesrdquo AnnualReview of Genetics vol 38 pp 525ndash552 2004

[22] M L Sogin H G Morrison J A Huber et al ldquoMicrobialdiversity in the deep sea and the underexplored lsquorare biospherersquordquo

Proceedings of the National Academy of Sciences of the UnitedStates of America vol 103 no 32 pp 12115ndash12120 2006

[23] F Backhed R E Ley J L Sonnenburg D A Peterson and JI Gordon ldquoHost-bacterial mutualism in the human intestinerdquoScience vol 307 no 5717 pp 1915ndash1920 2005

[24] N Beerenwinkel and O Zagordi ldquoUltra-deep sequencing forthe analysis of viral populationsrdquo Current Opinion in Virologyvol 1 no 5 pp 413ndash418 2011

[25] J A Huber D B MarkWelch H G Morrison et al ldquoMicrobialpopulation structures in the deep marine biosphererdquo Sciencevol 318 no 5847 pp 97ndash100 2007

[26] A Wilm P P K Aw D Bertrand et al ldquoLoFreq a sequence-quality aware ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencingdatasetsrdquo Nucleic Acids Research vol 40 pp 11189ndash11201 2012

[27] P J Turnbaugh C Quince J J Faith et al ldquoOrganismalgenetic and transcriptional variation in the deeply sequencedgut microbiomes of identical twinsrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 107 no16 pp 7503ndash7508 2010

[28] CQuince A Lanzen T P Curtis et al ldquoAccurate determinationof microbial diversity from 454 pyrosequencing datardquo NatureMethods vol 6 no 9 pp 639ndash641 2009

[29] S J Watson M R A Welkers D P Depledge et al ldquoViralpopulation analysis and minority-variant detection using shortread next-generation sequencingrdquo Philosophical Transactions ofthe Royal Society B vol 368 no 1614 Article ID 20120205 2013

[30] D R Bentley S Balasubramanian H P Swerdlow et al et alldquoAccurate whole human genome sequencing using reversibleterminator chemistryrdquo Nature vol 456 pp 53ndash59 2008

[31] KNakamura TOshima TMorimoto et al ldquoSequence-specificerror profile of Illumina sequencersrdquo Nucleic Acids Researchvol 39 no 13 2011

[32] M A Quail I Kozarewa F Smith et al ldquoA large genomecenterrsquos improvements to the Illumina sequencing systemrdquoNature Methods vol 5 no 12 pp 1005ndash1010 2008

[33] O Zagordi R Klein M Daumer and N Beerenwinkel ldquoErrorcorrection of next-generation sequencing data and reliableestimation of HIV quasispeciesrdquoNucleic Acids Research vol 38no 21 pp 7400ndash7409 2010

[34] M W Schmitt S R Kennedy J J Salk E J Fox J B Hiattand L A Loeb ldquoDetection of ultra-rare mutations by next-generation sequencingrdquo Proceedings of the National Academyof Sciences of the United States of America vol 109 pp 14508ndash14513 2012

[35] A Syropoulos ldquoMathematics of multisetsrdquo in Proceedings of theWorkshop on Multiset Processing Multiset Processing Mathe-matical Computer Science and Molecular Computing Points ofView pp 347ndash358 Springer 2001

[36] W ArapM G KoloninM Trepel et al ldquoSteps towardmappingthe human vasculature by phage displayrdquo Nature Medicine vol8 no 2 pp 121ndash127 2002

[37] S Blond-Elguindi S E Cwirla W J Dower et al ldquoAffinitypanning of a library of peptides displayed on bacteriophagesreveals the binding specificity of BiPrdquoCell vol 75 no 4 pp 717ndash728 1993

[38] S Cox E Rosten J Monypenny et al ldquoBayesian localizationmicroscopy reveals nanoscale podosome dynamicsrdquo NatureMethods vol 9 no 2 pp 195ndash200 2012

[39] E Rosten G E Jones and S Cox ldquoImage plug-in for Bayesiananalysis of blinking and bleachingrdquoNature Methods vol 10 pp97ndash98 2013

Computational and Mathematical Methods in Medicine 13

[40] W L Matochko S Ng M R Jafari J Romaniuk S K Y Tangand R Derda ldquoUniform amplification of phage display librariesin monodisperse emulsionsrdquoMethods vol 58 pp 18ndash27 2012

[41] E A Peters P J Schatz S S Johnson and W J DowerldquoMembrane insertion defects caused by positive charges in theearly mature region of protein pIII of filamentous phage fd canbe corrected by prlA suppressorsrdquo Journal of Bacteriology vol176 no 14 pp 4296ndash4305 1994

[42] L A Brammer B Bolduc J L Kass K M Felice C J Norenand M F Hall ldquoA target-unrelated peptide in an M13 phagedisplay library traced to an advantageous mutation in the geneII ribosome-binding siterdquo Analytical Biochemistry vol 373 no1 pp 88ndash98 2008

[43] G A Kuzmicheva P K Jayanna I B Sorokulova and VA Petrenko ldquoDiversity and censoring of landscape phagelibrariesrdquo Protein Engineering Design and Selection vol 22 no1 pp 9ndash18 2009

[44] D R Wilson and B B Finlay ldquoPhage display applicationsinnovations and issues in phage and host biologyrdquo CanadianJournal of Microbiology vol 44 no 4 pp 313ndash329 1998

[45] R Derda S K Y Tang and G M Whitesides ldquoUniformamplification of phage with different growth characteristics inindividual compartments consisting ofmonodisperse dropletsrdquoAngewandte Chemie vol 49 no 31 pp 5301ndash5304 2010

[46] J C Dohm C Lottaz T Borodina and H HimmelbauerldquoSubstantial biases in ultra-short read data sets from high-throughput DNA sequencingrdquo Nucleic Acids Research vol 36no 16 article e105 2008

[47] W L Matochko S C Li S K Y Tang and R DerdaldquoProspective identification of parasitic sequences in phagedisplay screensrdquo Nucleic Acids Research 2013

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

Computational and Mathematical Methods in Medicine 11

and experiment-independent Sequence-specific censorshipduring Illumina analysis has been described in other publi-cations [46] The observations presented above suggest thatreading of some sequences in phage libraries does not yield anaccurate copy number Even if these sequences were enricheddue to binding their apparent copy number in sequencingwould be decreased due to sequencing bias If the magnitudeof bias is known however such error could be corrected Weanticipate that other biases could be calculated for these andother libraries in similar fashion Their calculation extendsbeyond the scope of this paper and it will be performed inour next publication

3 Discussion

31 Significance and Transformative Potential of Library-Wide Error Correction In the Medicinal Chemistry fieldstructure-activity relationships (SAR) and pharmacophoresare built using both positive and negative observations Itis the negative results that bear the most significance inthese studies because they allow mapping of the range ofconditions under which particular structure no longer worksFor example SAR of an R group of a ligand might be builton the following observations A ligand binds to the targetwhen the R group in the specific position is methyl or ethylchanging R to iso-propyl and tert-butyl ablates the bindingThis concludes that the R group must be a small alkyl groupAn analogous situation is found in SARof peptide ligands themost important information from alanine scan mutagenesisis loss of function because it helps identifying the importantresidues Interestingly loss-of-binding conclusions are neverapplied to phage-display The phage-display field is driven bypositive results Most publications report and follow up onlyon sequences enriched in the screen and consider only largecopy numbers interesting All papers focus on sequences thatwere found Very few papers in phage display ask why othersequences were not found

One of the reasons why phage display is not used forSAR-type analysis is because negative observations in phagelibrary cannot be determined with high confidence From apractical point of view measuring zero with high confidencerequires the largest number of observation (the highest depthof sequencing) The payoff however is immense one screenwith ldquoconfident zerosrdquo could potentially yield SAR for everypossible substitution of every possible amino acid We referto this (theoretical) possibility as ldquoInstant SARrdquo and itscondensed theoretical form is described in (3) or (9) and(10) This paper demonstrates that the depth of sequencing isnot the only problem towards this goal Accurate estimate ofnegative results requires complete characterization of the ori-gins of errors in sequencing which yield false negative valuesby censoring certain sequencing Other types of censorshipsuch as growth bias should be characterized and eliminatedas well As the phage display field is currently focused onpositive results the need for optimal error corrections andrecovery of erroneous reads is low With the rise of SAR-type applications in phage display error correction will berecognized as the most significant barrier because it could

lead to improper assignment of low frequencies and negativeresults Improved error correction strategies could assigna lower confidence to the sequence instead of eliminatingthe errors and labeling them as confident zero Propermathematical framework possibly similar to the one used inthis paper could be then used to carry all confidence intervalsthrough calculations to yield reliable SAR-type data

We note that the framework described in this paper issuitable for the analysis of the selection from libraries inwhich the diversity of the libraries before and after selectioncould be covered entirely by deep sequencing With thecurrent depth of sequencing it corresponds to medium-scale libraries of sim106 randommembers and affinity-maturedlibraries that containsim106 pointmutationsWe are in the pro-cess of generating these medium-scale libraries and runningselection procedures that will allow us to apply and refine ourframework In the future as technical capabilities and depthof sequencing increase the process would be applicable tolarger libraries as well

4 Methods

41 Generation of Z-Bars and Other Visualization TechniquesSequencing of the libraries has been described in our previouspublications [6 47] All data visualization in this paper wasdone by MATLAB scripts raw lowasteps output from MATLABscripts subject to minor postprocessing in Adobe Illustratorto adjust fonts relative dimensions of plots Core scripts aredescribed in the supplementary information Other scriptsare available in our previous publication [47] Illumina filesused for the analysis can be found in the directory athttpwwwchemualbertacasimderdamathbiology the fileERROR TAG data0001txt is an example of error-taggedreads PhD7-Amp-xxFtxt is the library filtered with xx Phredcutoff (xx = 1 13 and 30) file PhD7-Amp-13F-30F-COtxtdescribes confidence intervals for Phred(13) to Phred(30)filtering process other files with lowast-COtxt extension describeconfidence intervals of other processes Supplementary Fig-ures S1ndashS4 and Schemes S1 and S2 describe MATLAB imple-mentation of the Sa operator

Acknowledgments

This work was supported by funds from the University ofAlberta and Alberta Glycomics Centre

References

[1] J K Scott and G P Smith ldquoSearching for peptide ligands withan epitope libraryrdquo Science vol 249 no 4967 pp 386ndash390 1990

[2] G P Smith and V A Petrenko ldquoPhage displayrdquo ChemicalReviews vol 97 no 2 pp 391ndash410 1997

[3] A D Ellington and J W Szostak ldquoIn vitro selection of RNAmolecules that bind specific ligandsrdquo Nature vol 346 no 6287pp 818ndash822 1990

[4] C Tuerk and L Gold ldquoSystemic evolution of ligands byexponential enrichment RNA ligands to bacteriophage T4DNApolymeraserdquo Science vol 249 no 4968 pp 505ndash510 1990

12 Computational and Mathematical Methods in Medicine

[5] E Dias-Neto D N Nunes R J Giordano et al ldquoNext-generation phage display integrating and comparing availablemolecular tools to enable costeffective high-throughput analy-sisrdquo PLoS ONE vol 4 no 12 Article ID e8338 2009

[6] W LMatochko K Chu B Jin SW Lee GMWhitesides andR Derda ldquoDeep sequencing analysis of phage libraries usingIllumina platformrdquoMethods vol 58 pp 47ndash55 2012

[7] A Ernst D Gfeller Z Kan et al ldquoCoevolution of PDZ domain-ligand interactions analyzed by high-throughput phage displayand deep sequencingrdquo Molecular BioSystems vol 6 no 10 pp1782ndash1790 2010

[8] G V Kupakuwana J E Crill M P McPike and P N BorerldquoAcyclic identification of aptamers for human alpha-thrombinusing over-represented libraries and deep sequencingrdquo PLoSONE vol 6 no 5 Article ID e19395 2011

[9] P A C rsquoT Hoen S M G Jirka B R Ten Broeke et alldquoPhage display screening without repetitious selection roundsrdquoAnalytical Biochemistry vol 421 no 2 pp 622ndash631 2012

[10] H Zhang A Torkamani T M Jones D I Ruiz J Ponsand R A Lerner ldquoPhenotype-information-phenotype cyclefor deconvolution of combinatorial antibody libraries selectedagainst complex systemsrdquo Proceedings of the National Academyof Sciences of the United States of America vol 108 no 33 pp13456ndash13461 2011

[11] R Derda S K Y Tang S C Li S Ng W Matochko and M RJafari ldquoDiversity of phage-displayed libraries of peptides duringpanning and amplificationrdquo Molecules vol 16 no 2 pp 1776ndash1803 2011

[12] R W Hamming ldquoError detecting and error correcting codesrdquoBell System Technical Journal vol 29 pp 147ndash160 1950

[13] D J Rodi A S Soares and L Makowski ldquoQuantitativeassessment of peptide sequence diversity in M13 combinatorialpeptide phage display librariesrdquo Journal of Molecular Biologyvol 322 no 5 pp 1039ndash1052 2002

[14] L Makowski ldquoQuantitative analysis of peptide librariesrdquo inPhage Nanobiotechnology chapter 3 2011

[15] C E Shannon ldquoAmathematical theory of communicationrdquo BellSystem Technical Journal vol 27 pp 379ndash423 1948

[16] U Ravn F Gueneau L Baerlocher et al ldquoBy-passing in vitroscreening next generation sequencing technologies applied toantibody display and in silico candidate selectionrdquoNucleic AcidsResearch vol 38 no 21 2010

[17] T Kim M S Tyndel H Huang et al ldquoMUSI an integratedsystem for identifying multiple specificity from very largepeptide or nucleic acid data setsrdquo Nucleic Acids Research vol40 no 6 article e47 2012

[18] J A Weinstein N Jiang R A White III D S Fisher and S RQuake ldquoHigh-throughput sequencing of the zebrafish antibodyrepertoirerdquo Science vol 324 no 5928 pp 807ndash810 2009

[19] B J DeKosky G C Ippolito R P Deschner et al et al ldquoHigh-throughput sequencing of the paired human immunoglobulinheavy and light chain repertoirerdquo Nature Biotechnology vol 31pp 166ndash169 2013

[20] B Levitan ldquoStochastic modeling and optimization of phagedisplayrdquo Journal of Molecular Biology vol 277 no 4 pp 893ndash916 1998

[21] C S Riesenfeld P D Schloss and J Handelsman ldquoMetage-nomics genomic analysis of microbial communitiesrdquo AnnualReview of Genetics vol 38 pp 525ndash552 2004

[22] M L Sogin H G Morrison J A Huber et al ldquoMicrobialdiversity in the deep sea and the underexplored lsquorare biospherersquordquo

Proceedings of the National Academy of Sciences of the UnitedStates of America vol 103 no 32 pp 12115ndash12120 2006

[23] F Backhed R E Ley J L Sonnenburg D A Peterson and JI Gordon ldquoHost-bacterial mutualism in the human intestinerdquoScience vol 307 no 5717 pp 1915ndash1920 2005

[24] N Beerenwinkel and O Zagordi ldquoUltra-deep sequencing forthe analysis of viral populationsrdquo Current Opinion in Virologyvol 1 no 5 pp 413ndash418 2011

[25] J A Huber D B MarkWelch H G Morrison et al ldquoMicrobialpopulation structures in the deep marine biosphererdquo Sciencevol 318 no 5847 pp 97ndash100 2007

[26] A Wilm P P K Aw D Bertrand et al ldquoLoFreq a sequence-quality aware ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencingdatasetsrdquo Nucleic Acids Research vol 40 pp 11189ndash11201 2012

[27] P J Turnbaugh C Quince J J Faith et al ldquoOrganismalgenetic and transcriptional variation in the deeply sequencedgut microbiomes of identical twinsrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 107 no16 pp 7503ndash7508 2010

[28] CQuince A Lanzen T P Curtis et al ldquoAccurate determinationof microbial diversity from 454 pyrosequencing datardquo NatureMethods vol 6 no 9 pp 639ndash641 2009

[29] S J Watson M R A Welkers D P Depledge et al ldquoViralpopulation analysis and minority-variant detection using shortread next-generation sequencingrdquo Philosophical Transactions ofthe Royal Society B vol 368 no 1614 Article ID 20120205 2013

[30] D R Bentley S Balasubramanian H P Swerdlow et al et alldquoAccurate whole human genome sequencing using reversibleterminator chemistryrdquo Nature vol 456 pp 53ndash59 2008

[31] KNakamura TOshima TMorimoto et al ldquoSequence-specificerror profile of Illumina sequencersrdquo Nucleic Acids Researchvol 39 no 13 2011

[32] M A Quail I Kozarewa F Smith et al ldquoA large genomecenterrsquos improvements to the Illumina sequencing systemrdquoNature Methods vol 5 no 12 pp 1005ndash1010 2008

[33] O Zagordi R Klein M Daumer and N Beerenwinkel ldquoErrorcorrection of next-generation sequencing data and reliableestimation of HIV quasispeciesrdquoNucleic Acids Research vol 38no 21 pp 7400ndash7409 2010

[34] M W Schmitt S R Kennedy J J Salk E J Fox J B Hiattand L A Loeb ldquoDetection of ultra-rare mutations by next-generation sequencingrdquo Proceedings of the National Academyof Sciences of the United States of America vol 109 pp 14508ndash14513 2012

[35] A Syropoulos ldquoMathematics of multisetsrdquo in Proceedings of theWorkshop on Multiset Processing Multiset Processing Mathe-matical Computer Science and Molecular Computing Points ofView pp 347ndash358 Springer 2001

[36] W ArapM G KoloninM Trepel et al ldquoSteps towardmappingthe human vasculature by phage displayrdquo Nature Medicine vol8 no 2 pp 121ndash127 2002

[37] S Blond-Elguindi S E Cwirla W J Dower et al ldquoAffinitypanning of a library of peptides displayed on bacteriophagesreveals the binding specificity of BiPrdquoCell vol 75 no 4 pp 717ndash728 1993

[38] S Cox E Rosten J Monypenny et al ldquoBayesian localizationmicroscopy reveals nanoscale podosome dynamicsrdquo NatureMethods vol 9 no 2 pp 195ndash200 2012

[39] E Rosten G E Jones and S Cox ldquoImage plug-in for Bayesiananalysis of blinking and bleachingrdquoNature Methods vol 10 pp97ndash98 2013

Computational and Mathematical Methods in Medicine 13

[40] W L Matochko S Ng M R Jafari J Romaniuk S K Y Tangand R Derda ldquoUniform amplification of phage display librariesin monodisperse emulsionsrdquoMethods vol 58 pp 18ndash27 2012

[41] E A Peters P J Schatz S S Johnson and W J DowerldquoMembrane insertion defects caused by positive charges in theearly mature region of protein pIII of filamentous phage fd canbe corrected by prlA suppressorsrdquo Journal of Bacteriology vol176 no 14 pp 4296ndash4305 1994

[42] L A Brammer B Bolduc J L Kass K M Felice C J Norenand M F Hall ldquoA target-unrelated peptide in an M13 phagedisplay library traced to an advantageous mutation in the geneII ribosome-binding siterdquo Analytical Biochemistry vol 373 no1 pp 88ndash98 2008

[43] G A Kuzmicheva P K Jayanna I B Sorokulova and VA Petrenko ldquoDiversity and censoring of landscape phagelibrariesrdquo Protein Engineering Design and Selection vol 22 no1 pp 9ndash18 2009

[44] D R Wilson and B B Finlay ldquoPhage display applicationsinnovations and issues in phage and host biologyrdquo CanadianJournal of Microbiology vol 44 no 4 pp 313ndash329 1998

[45] R Derda S K Y Tang and G M Whitesides ldquoUniformamplification of phage with different growth characteristics inindividual compartments consisting ofmonodisperse dropletsrdquoAngewandte Chemie vol 49 no 31 pp 5301ndash5304 2010

[46] J C Dohm C Lottaz T Borodina and H HimmelbauerldquoSubstantial biases in ultra-short read data sets from high-throughput DNA sequencingrdquo Nucleic Acids Research vol 36no 16 article e105 2008

[47] W L Matochko S C Li S K Y Tang and R DerdaldquoProspective identification of parasitic sequences in phagedisplay screensrdquo Nucleic Acids Research 2013

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

12 Computational and Mathematical Methods in Medicine

[5] E Dias-Neto D N Nunes R J Giordano et al ldquoNext-generation phage display integrating and comparing availablemolecular tools to enable costeffective high-throughput analy-sisrdquo PLoS ONE vol 4 no 12 Article ID e8338 2009

[6] W LMatochko K Chu B Jin SW Lee GMWhitesides andR Derda ldquoDeep sequencing analysis of phage libraries usingIllumina platformrdquoMethods vol 58 pp 47ndash55 2012

[7] A Ernst D Gfeller Z Kan et al ldquoCoevolution of PDZ domain-ligand interactions analyzed by high-throughput phage displayand deep sequencingrdquo Molecular BioSystems vol 6 no 10 pp1782ndash1790 2010

[8] G V Kupakuwana J E Crill M P McPike and P N BorerldquoAcyclic identification of aptamers for human alpha-thrombinusing over-represented libraries and deep sequencingrdquo PLoSONE vol 6 no 5 Article ID e19395 2011

[9] P A C rsquoT Hoen S M G Jirka B R Ten Broeke et alldquoPhage display screening without repetitious selection roundsrdquoAnalytical Biochemistry vol 421 no 2 pp 622ndash631 2012

[10] H Zhang A Torkamani T M Jones D I Ruiz J Ponsand R A Lerner ldquoPhenotype-information-phenotype cyclefor deconvolution of combinatorial antibody libraries selectedagainst complex systemsrdquo Proceedings of the National Academyof Sciences of the United States of America vol 108 no 33 pp13456ndash13461 2011

[11] R Derda S K Y Tang S C Li S Ng W Matochko and M RJafari ldquoDiversity of phage-displayed libraries of peptides duringpanning and amplificationrdquo Molecules vol 16 no 2 pp 1776ndash1803 2011

[12] R W Hamming ldquoError detecting and error correcting codesrdquoBell System Technical Journal vol 29 pp 147ndash160 1950

[13] D J Rodi A S Soares and L Makowski ldquoQuantitativeassessment of peptide sequence diversity in M13 combinatorialpeptide phage display librariesrdquo Journal of Molecular Biologyvol 322 no 5 pp 1039ndash1052 2002

[14] L Makowski ldquoQuantitative analysis of peptide librariesrdquo inPhage Nanobiotechnology chapter 3 2011

[15] C E Shannon ldquoAmathematical theory of communicationrdquo BellSystem Technical Journal vol 27 pp 379ndash423 1948

[16] U Ravn F Gueneau L Baerlocher et al ldquoBy-passing in vitroscreening next generation sequencing technologies applied toantibody display and in silico candidate selectionrdquoNucleic AcidsResearch vol 38 no 21 2010

[17] T Kim M S Tyndel H Huang et al ldquoMUSI an integratedsystem for identifying multiple specificity from very largepeptide or nucleic acid data setsrdquo Nucleic Acids Research vol40 no 6 article e47 2012

[18] J A Weinstein N Jiang R A White III D S Fisher and S RQuake ldquoHigh-throughput sequencing of the zebrafish antibodyrepertoirerdquo Science vol 324 no 5928 pp 807ndash810 2009

[19] B J DeKosky G C Ippolito R P Deschner et al et al ldquoHigh-throughput sequencing of the paired human immunoglobulinheavy and light chain repertoirerdquo Nature Biotechnology vol 31pp 166ndash169 2013

[20] B Levitan ldquoStochastic modeling and optimization of phagedisplayrdquo Journal of Molecular Biology vol 277 no 4 pp 893ndash916 1998

[21] C S Riesenfeld P D Schloss and J Handelsman ldquoMetage-nomics genomic analysis of microbial communitiesrdquo AnnualReview of Genetics vol 38 pp 525ndash552 2004

[22] M L Sogin H G Morrison J A Huber et al ldquoMicrobialdiversity in the deep sea and the underexplored lsquorare biospherersquordquo

Proceedings of the National Academy of Sciences of the UnitedStates of America vol 103 no 32 pp 12115ndash12120 2006

[23] F Backhed R E Ley J L Sonnenburg D A Peterson and JI Gordon ldquoHost-bacterial mutualism in the human intestinerdquoScience vol 307 no 5717 pp 1915ndash1920 2005

[24] N Beerenwinkel and O Zagordi ldquoUltra-deep sequencing forthe analysis of viral populationsrdquo Current Opinion in Virologyvol 1 no 5 pp 413ndash418 2011

[25] J A Huber D B MarkWelch H G Morrison et al ldquoMicrobialpopulation structures in the deep marine biosphererdquo Sciencevol 318 no 5847 pp 97ndash100 2007

[26] A Wilm P P K Aw D Bertrand et al ldquoLoFreq a sequence-quality aware ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencingdatasetsrdquo Nucleic Acids Research vol 40 pp 11189ndash11201 2012

[27] P J Turnbaugh C Quince J J Faith et al ldquoOrganismalgenetic and transcriptional variation in the deeply sequencedgut microbiomes of identical twinsrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 107 no16 pp 7503ndash7508 2010

[28] CQuince A Lanzen T P Curtis et al ldquoAccurate determinationof microbial diversity from 454 pyrosequencing datardquo NatureMethods vol 6 no 9 pp 639ndash641 2009

[29] S J Watson M R A Welkers D P Depledge et al ldquoViralpopulation analysis and minority-variant detection using shortread next-generation sequencingrdquo Philosophical Transactions ofthe Royal Society B vol 368 no 1614 Article ID 20120205 2013

[30] D R Bentley S Balasubramanian H P Swerdlow et al et alldquoAccurate whole human genome sequencing using reversibleterminator chemistryrdquo Nature vol 456 pp 53ndash59 2008

[31] KNakamura TOshima TMorimoto et al ldquoSequence-specificerror profile of Illumina sequencersrdquo Nucleic Acids Researchvol 39 no 13 2011

[32] M A Quail I Kozarewa F Smith et al ldquoA large genomecenterrsquos improvements to the Illumina sequencing systemrdquoNature Methods vol 5 no 12 pp 1005ndash1010 2008

[33] O Zagordi R Klein M Daumer and N Beerenwinkel ldquoErrorcorrection of next-generation sequencing data and reliableestimation of HIV quasispeciesrdquoNucleic Acids Research vol 38no 21 pp 7400ndash7409 2010

[34] M W Schmitt S R Kennedy J J Salk E J Fox J B Hiattand L A Loeb ldquoDetection of ultra-rare mutations by next-generation sequencingrdquo Proceedings of the National Academyof Sciences of the United States of America vol 109 pp 14508ndash14513 2012

[35] A Syropoulos ldquoMathematics of multisetsrdquo in Proceedings of theWorkshop on Multiset Processing Multiset Processing Mathe-matical Computer Science and Molecular Computing Points ofView pp 347ndash358 Springer 2001

[36] W ArapM G KoloninM Trepel et al ldquoSteps towardmappingthe human vasculature by phage displayrdquo Nature Medicine vol8 no 2 pp 121ndash127 2002

[37] S Blond-Elguindi S E Cwirla W J Dower et al ldquoAffinitypanning of a library of peptides displayed on bacteriophagesreveals the binding specificity of BiPrdquoCell vol 75 no 4 pp 717ndash728 1993

[38] S Cox E Rosten J Monypenny et al ldquoBayesian localizationmicroscopy reveals nanoscale podosome dynamicsrdquo NatureMethods vol 9 no 2 pp 195ndash200 2012

[39] E Rosten G E Jones and S Cox ldquoImage plug-in for Bayesiananalysis of blinking and bleachingrdquoNature Methods vol 10 pp97ndash98 2013

Computational and Mathematical Methods in Medicine 13

[40] W L Matochko S Ng M R Jafari J Romaniuk S K Y Tangand R Derda ldquoUniform amplification of phage display librariesin monodisperse emulsionsrdquoMethods vol 58 pp 18ndash27 2012

[41] E A Peters P J Schatz S S Johnson and W J DowerldquoMembrane insertion defects caused by positive charges in theearly mature region of protein pIII of filamentous phage fd canbe corrected by prlA suppressorsrdquo Journal of Bacteriology vol176 no 14 pp 4296ndash4305 1994

[42] L A Brammer B Bolduc J L Kass K M Felice C J Norenand M F Hall ldquoA target-unrelated peptide in an M13 phagedisplay library traced to an advantageous mutation in the geneII ribosome-binding siterdquo Analytical Biochemistry vol 373 no1 pp 88ndash98 2008

[43] G A Kuzmicheva P K Jayanna I B Sorokulova and VA Petrenko ldquoDiversity and censoring of landscape phagelibrariesrdquo Protein Engineering Design and Selection vol 22 no1 pp 9ndash18 2009

[44] D R Wilson and B B Finlay ldquoPhage display applicationsinnovations and issues in phage and host biologyrdquo CanadianJournal of Microbiology vol 44 no 4 pp 313ndash329 1998

[45] R Derda S K Y Tang and G M Whitesides ldquoUniformamplification of phage with different growth characteristics inindividual compartments consisting ofmonodisperse dropletsrdquoAngewandte Chemie vol 49 no 31 pp 5301ndash5304 2010

[46] J C Dohm C Lottaz T Borodina and H HimmelbauerldquoSubstantial biases in ultra-short read data sets from high-throughput DNA sequencingrdquo Nucleic Acids Research vol 36no 16 article e105 2008

[47] W L Matochko S C Li S K Y Tang and R DerdaldquoProspective identification of parasitic sequences in phagedisplay screensrdquo Nucleic Acids Research 2013

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

Computational and Mathematical Methods in Medicine 13

[40] W L Matochko S Ng M R Jafari J Romaniuk S K Y Tangand R Derda ldquoUniform amplification of phage display librariesin monodisperse emulsionsrdquoMethods vol 58 pp 18ndash27 2012

[41] E A Peters P J Schatz S S Johnson and W J DowerldquoMembrane insertion defects caused by positive charges in theearly mature region of protein pIII of filamentous phage fd canbe corrected by prlA suppressorsrdquo Journal of Bacteriology vol176 no 14 pp 4296ndash4305 1994

[42] L A Brammer B Bolduc J L Kass K M Felice C J Norenand M F Hall ldquoA target-unrelated peptide in an M13 phagedisplay library traced to an advantageous mutation in the geneII ribosome-binding siterdquo Analytical Biochemistry vol 373 no1 pp 88ndash98 2008

[43] G A Kuzmicheva P K Jayanna I B Sorokulova and VA Petrenko ldquoDiversity and censoring of landscape phagelibrariesrdquo Protein Engineering Design and Selection vol 22 no1 pp 9ndash18 2009

[44] D R Wilson and B B Finlay ldquoPhage display applicationsinnovations and issues in phage and host biologyrdquo CanadianJournal of Microbiology vol 44 no 4 pp 313ndash329 1998

[45] R Derda S K Y Tang and G M Whitesides ldquoUniformamplification of phage with different growth characteristics inindividual compartments consisting ofmonodisperse dropletsrdquoAngewandte Chemie vol 49 no 31 pp 5301ndash5304 2010

[46] J C Dohm C Lottaz T Borodina and H HimmelbauerldquoSubstantial biases in ultra-short read data sets from high-throughput DNA sequencingrdquo Nucleic Acids Research vol 36no 16 article e105 2008

[47] W L Matochko S C Li S K Y Tang and R DerdaldquoProspective identification of parasitic sequences in phagedisplay screensrdquo Nucleic Acids Research 2013

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom