introduction to biocomputing - Åbo akademi · computing with dna (and other molecules) •...

Introduction to Biocomputing

Lecture 5: September 20, 2004Computing with bioComputing with bio--moleculesmolecules

http://http://www.abo.fi/~ipetre/biocompwww.abo.fi/~ipetre/biocomp//

September 21, 2004 1

Computing with DNA (and other molecules)

• Biomolecules: DNA, RNA, protein• Bio-tools: construct, measure, multiply, manipulate molecules• Use these tools for computing

Why molecular computing ?

• Objective reasons: very small, very precise, very specific, very cheap, and very energy efficient

– Energy efficiency • On the scale of 1019 ligations/J vs. a scale of 109 operations /J in electronic

computers• On the edge of what is thermodinamically possible

– Huge density of stored information• 1g DNA can store more than one trillion CDs

– Massive parallelism– Self-healing, self-sustained systems

Other reasons for molecular computing

• Physical boundaries for the performances of the electronic computers• Fast development of biotechnologies, genetics, and pharmaceutics• (Theoretical) Understanding the essence of computation

1. A METHOD FOR STORING INFORMATION

Computing is EasyComputing is Easy

2. A FEW SIMPLE OPERATIONS FOR ACTING ON INFORMATION

Demonstrating the potential of molecular computations

• Can DNA compute “everything” ?• Theoretical models – e.g., splicing systems

– Universality results: equivalence with Turing machines– Consequence: any algorithm can be implemented using biomolecular tools

in principle – How about in practice?

Demonstrating the potential of molecular computations

• Practical demonstrations– Adleman’s experiment– Satisfiability of logical formulas– Cryptanalyzing DES– Chess problems– Tic-tac-toe– Databases– DNA-based logical circuits– …

The beginning: Adleman’s experiment (1994)

• L.M. Adleman: Molecular computation of solutions to combinatorial problems. Science, 226, 1021-1024, 1994.

• Showed how DNA can be used to solve difficult math problems• The problem of choice: the Hamiltonian Path Problem (HPP)

Hamiltonian paths

• Directed graphs: set of nodes and edges (arrows) among them• Hamiltonian path from a node s to a node e: start in the node s and follow

the edges to arrive in node e, such that all the other nodes have been visited on the way exactly once

– Some graphs have Hamiltonian paths, others do not• Hamiltonian path problem (HPP): for a given graph, decide if there exists

a Hamiltonian path– Special case of the traveling salesman problem (TSP)

• Example: Adleman’s graph• Several algorithms are known, but they all have exponential complexity

in the worst case• HPP is NP-complete

Adleman’s graph

Adleman’s graph: one path from 0 to 4 visiting all other nodes

Adleman’s graph: a Hamiltonian path from 0 to 4

Adleman’s approach to HPP

• Input: a directed graph with n nodes, vin, vout

• 1. Randomly generate paths in G• 2. Reject all paths that do not begin in vin

and do not end in vout

• 3. Reject all paths that do not involve exactly n nodes

• 4. For each node v of G, reject all paths that do not pass through v

• Output: YES if any paths remain, NO otherwise

Main idea

• Exhaustive search through all possible paths in G• Background engine: the massive parallelism of bio-operations

Experiment design: encoding the nodes

• A node: 20-mer DNA single strands2 = TATCGGATCGGTATATCCGAs3 = GCTATTCGAGCTTAAAGCTAs4 = GGCTAGGTACCAGCATGCTT

Experiment design: Encoding the edges

• Examples:s2 = 5’-TATCGGATCG GTATATCCGAs3 = 5’-GCTATTCGAG CTTAAAGCTAs4 = 5’-GGCTAGGTAC CAGCATGCTT

e2 3 = 3’-CATATAGGCT CGATAAGCTCe3 2 = 3’-GAATTTCGAT ATAGCCTAGCe3 4 = 3’-GAATTTCGAT CCGATCCATG

• Watson-Crick morphism h: a mapping applied on strings over the alphabet {A,C,T,G}

h(A)=T, h(T)=A, h(C)=G, h(G)=C• For a given string u, h(u) is the Watson-Crick complement of u (the single

strands u and h(u) can form a perfect duplex DNA molecule)• Note: h changes the orientation to 3’-5’• Example:

h(5’-CATTAG)=3’-GTAATC

• Each of the 7 strands for the nodes aresi=s’is”i

• Edge from i to j: 20-mer DNA strandh(s”is’j)

(i=0: s0 instead of s”0; j=6: s6 instead of s’6)

Experiment design: encoding paths: double strands

• A strand encoding si fuses together (annealing) with a strand encoding ei j: double strand with sticky ends

• A strand encoding sj then fuses along forming the path from i to j• It follows a strand corresponding to ej k, and one corresponding to sk,

Adleman’s solution to HPP

• Input: a directed graph G with n nodes, vin, vout

• 1. Generate all paths in G• 2. Reject all paths that do not begin in vin

and do not end in vout

• 3. Reject all paths that do not involve exactly n nodes

• 4. For each node v of G, reject all paths that do not pass through v

• Output: YES if any paths remain, NO otherwise

Step 1: Generate (all) paths in the graphs

• For each node i (except i=0,4) and each edge i j: mix large quantities of si and ei j in one single ligation reaction

• Result: DNA molecules encoding random paths– Having enough oligos we ensure having all paths with high probability

• Note: in the 1994 experiment huge scale was used (much larger than needed): each oligo present in 1013 copies

Step 2: Start in s0 and end in s4

• Multiply the result of Step 1• PCR with s0 and s4 as primers• Result: amplify those paths beginning in node 0 and ending in node 4

Step 3: Exactly 7 nodes on the path

• Run the result of Step 2 through gel electrophoresis• The 140 bp band (7 nodes on the path) excised and DNA recovered• Gel-purification and PCR• Result: paths of 7 vertices from 0 to 4

Step 4: All nodes are on the path

• Denature the product of Step 3: single stranded DNA• Testing: test for molecules s0

• Repeat the testing for s1, …, s4

• Amplify by PCR and run on gel• Result: molecules encoding Hamiltonian paths from 0 to 4molecules encoding Hamiltonian paths from 0 to 4

Discussion

• 7 days of work – the last step most time consuming (one full day)• The molecular algorithm used here is rather primitive and inefficient• The steps can be described in algorithmic way (bio-algorithm): easy to

reason

Scaling up the algorithm

• Quantity of oligos needed in the experiment: difficult issue• More edges: more oligos (linear growth)• More vertices: more oligos (exponential growth)• Errors: due to incorrect ligation, pseudo-paths may be formed; unlikely to

survive Step 4– One must be careful in the experiment design: choose oligos (encoding vertices)

without “too many common nucleotides” – Math: high Hamming distance• Other errors: losing the Hamiltonian path in Step 4 and getting some non-

Hamiltonian ones

Other practical demonstrations

• In this lecture– SAT– DES– Chess problems

• Others– Tic-tac-toe– Poker– Logical gates– Databases– Arithmetic – …

The SAT problem

• A logical formula u built from variables x1, x2, …, and the connectives ~, OR, &(negation, disjunction, conjunction)

– Example: u=(~x1 OR ~x2 OR x3) & (x2 OR x3) & (~x1 OR x3) & ~x3

• Truth-value assignment: f:{x1, x2, …} {0,1}• Given u, one can compute f(u)• u is satisfiable if there is a truth-value assignment f such that f makes u TRUE:

f(u)=1

Example

u=(~x1 OR ~x2 OR x3) & (x2 OR x3) & (~x1 OR x3) & ~x3

• f(x1)=FALSE, f(x2)=TRUE, f(x3)=FALSE– f(u)=TRUE

• f(x1)=TRUE, f(x2)=TRUE, f(x3)=TRUE– f(u)=FALSE

The SAT problem

• Problem: for a given logical formula u decide if u is satisfiable or not– In other terms, decide if there exists a truth assignment that makes u

TRUE• Complexity: NP-complete problem

– There is no solution essentially better than exhaustive search through all 2k possible truth assignments (k variables)

– Exponential-time complexity for the best known algorithms

• Simplifications (?)– Conjunctive normal form:

• Any logical formula can be written as a sequence of conjunctions u1 & u2 & u3&…& un

• Each clause ui is a sequence of disjunctions ti,1 OR ti,2 OR … OR ti,ki, with each t being a variable or the negation of a variable

• Example: u=(~x1 OR ~x2 OR x3) & (x2 OR x3) & (~x1 OR x3) & ~x3

– 3-SAT• Each clause consists of exactly 3 variables or negation of variables• Example:

u=(~x1 OR ~x2 OR x3) & (x2 OR x3 OR x4) & (~x1 OR x3 OR x5 )• 3-SAT remains NP-complete

Bio-algorithm for SAT

• R.Lipton, 1995: Using DNA to solve NP-complete problems• Idea

– exhaustive search, made feasible by the massive parallelism of DNA strands

• Sketch of the algorithm – generate all possible truth assignments and reject those not satisfying the

formula

Lipton’s approach to SAT

• Take advantage of Adleman’s idea: reduce SAT to a graph problem• Major step forward: the initial “soup” is the same for all formulas with

the same number of variables– In Adleman’s solution, different graphs need different initial “soups”

Lipton’s graph

• Idea: looking for truth assignments is essentially the same as finding a path in a graph

• Example:u=(~x1 OR ~x2 OR x3) & (x2 OR x3) & (~x1 OR x3) & ~x3

a1FALSE

a1TRUE

a2FALSE a3

a2TRUE a3

v1 v2 vout

Lipton’s graph

• Example:u=(~x1 OR ~x2 OR x3) & (x2 OR x3) & (~x1 OR x3) & ~x3

• Looking for a truth assignment is the same as looking for a path from vin to vout

• Picking a node from top is to give FALSE to the corresponding variable, node from the bottom is to give value TRUE

a1FALSE

a1TRUE

a2FALSE a3

a2TRUE a3

v1 v2 vout

Experiment design

• Vertices (nodes) encoded in 20-mer single stranded DNA• Edges: 20-mer single stranded DNA designed as in Adleman’s

experiment• Lipton encoding: produce an initial “soup” of DNA that encodes all

paths from vin to vout

– As in Adleman’s experiment • mix all single strands• Allow them to hybridize• PCR with vin and vout as primers• Result: amplify those paths starting in vin and ending in vout

Experiment design

• Note: – The graph is the same for all formulas with the same number of variables– The initial soup is the same for all formulas over k variables– On the other hand, the exact algorithm changes from formula to formula

The SAT problem

• Example: z=(x1 OR x2) & (~ x1 OR ~ x2)• The associated graph

a1FALSE

a1TRUE

a2FALSE

a2TRUE

v1 vout

Bio-implementation

• Example: z=(x1 OR x2) & (~ x1 OR ~ x2)• Initial test tube: all possible paths from vin to vout, i.e., all possible truth

assignments• Step 1: separate those strands which satisfy the first clause of the formula: (x1

OR x2)– Either x1 or x2 must be assigned TRUE– Separate from the initial soup those DNA strands containing the sequence a1

– Separate from the initial soup those DNA strands containing the sequence a2TRUE

– Mix the two test tubes: all the strands in the result satisfy the first clause• Step 2: from the result of step 1 (amplified by PCR) separate those strands

satisfying the second clause (~ x1 OR ~ x2)– Either x1 or x2 must be FALSE– Separate those DNA strands containing the sequence a1

– Separate those DNA strands containing the sequence a2FALSE

– Result: Sequence the remaining DNA sequences (if any) to get all possible truth assignments satisfying the original formula

Complexity of the algorithm

• m clauses in the formula: m steps• k variables: at most k merge and separate• Problem: cope with the errors

– Get the result with high probability• Recent advances

– Braich, Chelyapov, Johnson, Rothemund, Adleman (2002): 3-SAT problem with 20 variables – exhaustive search through more than 1 million possible solutions

• DES= Data Encryption Standard, 1977• Secret key cryptography• Encripts a 64 bit message using a 56 bit key• Breaking DES = finding the secret key, knowing the encryption of a

certain text

Breaking DES

• Classical approaches: – Differential cryptanalysis (several days on an electronic computer, needs high

number of pairs)– Dedicated hardware (expensive, specific to DES, 7 hours)– Internet-based (massive parallelism !)

• Bio-approach: it is very general (applicable to any encryption on 64 bits), 1 day of work (with some preprocessing)

DES algorithm

• Plain text – 64 bits; encrypted text – 64 bits; key – 56 bits, expanded to 64 bits

• Composed of 16 rounds• Each round is based on

– XOR on 48 bits– P-box: permutes the bits of the input– S-box: maps 6 bits into 4 bits based on a given table

Plan of an attack on DES based on DNA

• Given a function f:{0,1}m {0,1}n (e.g., the DES encryption), construct a solution Tf containing all pairs (k,f(k))

– In other terms, for a given text, encrypt that text using all possible keys– Compare then with what the system encrypts (denote E that encryption) and find

the key it uses• Separate those molecules in Tf containing E

– Result: molecules encoding (k,E) – pairs (key, encryption)

• Sequence the first half of the molecule to find the key k

Notes on the DES attack

• Tf depends only on the plain text M0. Denote Tf by DES(M0)• DES(M0) contains 256 DNA strands: less than one liter of water !• Having DES(M0), Eve can break many DES systems with very little cost (one

day work): generate (M0,E0) and find the key as above– Eve must be able to use the cryptosystem to encrypt M0 and then she compares the

result with her DNA database to find the key

Constructing DES(M0)

• Encode all possible 56-bit string into a DNA solution: less than one liter of water – Done as in Adleman’s and Lipton’s experiments – Design oligos for each bit, separated by some spacers– Allow the oligos to hybridize with each other to form all possible 56-bit strings

• Implement the primitives of the DES circuit: XOR, S-box, P-box• Apply the algorithm on all 56-bit strings in parallel• Look for the encoded text E and read the key

Implementing XOR gates

• XOR = exclusive OR– Definition:

a XOR b = 1 iff either a=1, or b=1, but not botha XOR b = 0 iff a and b have the same value

• Easy to prove:x XOR y = (x OR y) & (~x OR ~y)

• Already implemented in SAT– We know how to implement disjunctions and conjunctions, see Lipton’s

experiment

Implementing S-boxes

• S-box: essentially a function f:{0,1}6 {0,1}4

– The function f is known from the specifications of DES• f has 16 possible values• Implementation: separation• Example:

– f(z)=0000 iff z=a or z=b; – Separate from the initial soup those strands containing the sequence a– Separate from the initial soup those strands containing the sequence b– Merge the two test tubes to get all strands giving value 0000 to the function f

Summary of the DES attack

• Construct the initial solution, based on annealing and separation – Difficult: 4 months of work (these are old estimates, possibly less time nowadays!)

• Find the (plain_text,encrypted_text) pair– One must be able to use the encryption system ONCE

• Apply separation and sequencing on the initial solution to find the key: less than 1 day work in the lab• Virtues

– General method: applicable to any 64 bits encryption– The initial soup may be used to break any DES– Possibility to build such an initial soup and then sell it to anyone interested

• Note: technical errors not taken into account

A chess problem

• Place knights on a chessboard so that no knight is threatening another• Princeton University (2000): a bio-implementation based on RNA (instead of

DNA as in the other experiments shown here)– Authors: Faulhammer, Cukras, Lipton, Landweber– The problem was considered for a 3X3 chess board

A chess problem

• The problem: can be reduced to SAT– Associate to each square a variable that is true if and only if you place a knight on

it– Once you decide to place a knight on a square, the sqaure which are menaced by it

must remain free– Clearly, one may write a logical formula to describe the connections:((~h & ~f) OR ~a) & ((~g & ~i) OR ~b) & ((~d & ~h) OR ~c) & ((~c & ~i) OR ~d)

& ((~a & ~g) OR ~f)

A chess problem

• Novelty of the implementation:– Use RNA instead of DNA– Destructive approach: destroy the unacceptable solutions, rather than

separate the acceptable ones (more accurate, more likely to automate)

RNA implementation

• 3X3 chessboard: 10 bit strings (one for backup, the other nine for the nine squares of the board)

• Encode in RNA all possible 10 bit strings as seen in the other experiments

• To destroy a string containing 1 on position a:– If it contains 1 on position a, then the RNA molecule contains the specific

sequence we have designed for that value– Add the complementary DNA sequence that sticks to the targeted RNA

sequence (!)– Use enzyme RnaseH: chews up RNA/DNA hybrids, leaves “normal” RNA

RNA implementation

• Example: satisfy the formula ~a OR (a & ~h & ~f)

• Divide the solution into two tubes– Tube 1

• Destroy the strings which make first clause false: destroy the strings with 1 on position a

• Add a DNA strand complementary to the RNA strand encoding a=1• Add the enzyme to destroy the hybrids

– Tube 2• Destroy the strings which make the clause false: a=0, h=1, AND f=1• Add 3 DNA strands complementary to the RNA for a=0, f=1, h=1

RNA implementation - results

• Out of the final solutions, 43 molecules were randomly chosen and sequenced

• 42 were correct, 1 was wrong• Altogether, 126 knights placed correctly, 1 wrong: 97.7% success rate

RNA computer plays chess

Molecular Computing – perspectives

• Over-optimism, over-pessimism • What can we compute with DNA ?

– “Killer” application is needed – challenge for computer scientists– Better algorithms than exhaustive search – same comment– We need better biotech tools to control the molecules (do they exist

already?) – challenge for biotech– Cope with the errors: impact on the size of the solutions (in number of

strands)– How much can we compute – SAT up to 70-80 variables impact on the

size of the solutions (in number of strands)

• Positive side– Applications to biotechnology: e.g., a SAT implementation used to

execute Boolean queries on a “wet” database, based on some tags (IDs)– Useful in specialized environments: e.g., extreme energy efficiency or

extreme information density required– Provide the means to control biochemical systems just like electronic

computers provide the means to control electromechanical systems

• Bad news– At this moment, we cannot control the molecules with the precision the

physicists and electrical engineers control electrons– Need of a breakthrough in biotechnology: more automation, more precise

techniques– Example:

• HPP may be solved nowadays on electronic computers for graphs with 13 500 nodes

• Adleman’s approach scaled up for graphs with 200 nodes needs more DNA than the weight of the Universe

One last thought

• Adleman: “So here it is (the cell), the most amazing tool-chest you have ever seen. We know

it is a great tool-chest, because it was used to build you and me. And even though we are very clumsy in our use of the tools right now, and even though molecular biology has made only a small portion of them available to us so far, we can already use them to build a computer. And if you can build a computer, then presumably many other exciting things can be built.

So, this is the challenge of molecular science: take the tools and build something great.”

introduction to biocomputing - Åbo akademi · computing with dna (and other molecules) •...

Documents

verksamhetsberättelse - Åbo akademi

Åbo, finland

startsida | Åbo...

structural biology and biocomputing programme

abstraction layers for scalable microﬂuidic biocomputing

high-performance biocomputing for simulating the spread of...

data-intensive biocomputing in the...

Åbo akademi

biocomputing - international journal of trend in ... ·...

Åbo akademis fÖrlag – Åbo akademi university...

bioinformation technology: case studies in bioinformatics...

pacific symposium on biocomputing...

pärmen - Åbo akademi

biocomputing unit

biology 4900 biocomputing. chapter 3 pairwise sequence...

1 introduction to biocomputing: structure (dna & rna)

Åbo akademis Åbo akademis ÅrsberÄttelse 2016 Åbo...

Åbo nätgemenskap 11.9.2009

människan, maskinen, maskineriet - Åbo akademi ·...

pacific symposium on biocomputing 2015 abstract...