Computing with DNA - PDF Free Download

Computing With DNA Donald Beaver1 Pennsylvania State University

Abstract. We consider molecular models for computing and derive a DNA-based mechanism for solving intractable problems through massive parallelism. In principle, such methods might reduce the eort needed to solve otherwise dicult tasks, such as factoring large numbers, a computationally-intensive task whose intractability forms the basis for much of modern cryptography.

1 Introduction

Until recently, most computer scientists regarded computational devices as being essentially the same: according to a complexity-theoretic strengthening of the Church-Turing thesis, it seemed not only that computational devices are capable of computing the same class of results (recursive ones) but moreover they do so with roughly the same eciency (polynomial dierences in running time). Building on the work of others, Shor recently gave algorithms for factoring and discrete logarithm using a quantum computer, a type of probabilistic Turing machine whose state transitions follow quantum-mechanical laws [S94]. These transitions are such that, roughly speaking, an exponential number of paths are followed, but paths not corresponding to a solution tend to cancel each other out, leaving only those paths corresponding to a solution as ultimately observable results. The question arises whether other computational models might escape the \polynomial bonds" of Turing-like devices. Adleman suggested a positive answer [A94], showing how a series of molecular reactions could solve a 7-node Hamiltonian Path problem. The Hamiltonian Path problem is to nd a simple path in a graph from a given source to a given destination, visiting every node precisely once { or more simply, just to report whether such a path exists. Since Hamiltonian Path is an NP-complete problem and hence outside the currently known range of polynomial-time computational devices, a solution to this problem suggests that molecular computers might also escape polynomial bounds. We design a simple, polynomial-step molecular experiment to illustrate how the massive parallelism supplied by molecular-scale miniaturization might help solve intractable problems. As a benchmark, we discuss the application of our techniques to the well-known cryptographic problem of factoring large numbers and discuss whether molecular-scale computing is indeed a fundamentally new model, a threat to current cryptographic key sizes, or merely a pleasant curiosity (from the viewpoint of a cryptographer, at least). 1

317 Pond Lab, Penn State University, University Park, PA 16802, (814) 863-0147; [email protected]. Supported by NSF CCR-9210954.

2 Computing with DNA Evolution beat us to the punch millions of years ago with the ingenious translation (had it been human) of \CPU" as \DNA." DNA, or deoxyribonucleic acid, encodes our genes and much (if not all) of the operation of our cells, using a four-letter alphabet of bases: A (adenine), C (cytosine), G (guanine) and T (thymine). As a long string of characters (e.g. ACTTAGACCATG), a strand of DNA bears strong similarity to a Turing machine's tape. Unfortunately, many simple and desirable operations on such strings are nontrivial from a molecular and chemical standpoint. Fortunately, many other simple operations are feasible. DNA is normally double-stranded, containing two long strings twisted around each other in a helical form. The two strands are held together by speci c attractions between the bases: A to T, and C to G. As a result, the sequence of bases in one strand is complementary to the sequence in the other. For example, ATTGTC will line up with its Watson-Crick complement, TAACAG, to form a double-stranded molecule, A TT AT AG CT AC G .2 Life reproduces its DNA by copying, and the technical feat known as the polymerase chain reaction (PCR) duplicates this process on a massive scale. Double-stranded DNA can be split into single strands (\denatured") by heating. The single strands can be copied by using DNA polymerase to synthesize the missing complements, after suitable \primers" have been attached as a starting point. In a single such reaction, the number of copies of a strand can be doubled; with repetition, an exponential number of strands can be made. Other tools follow Nature's lead as well. For example, bacteria wage biological warfare using restriction enzymes (RE's), which can cut up foreign DNA containing certain patterns { a sort of anti-password. For example, Eco RI nds the sequence GCATATTATAGC and cuts it into two: GCTTAA AATTCG . Notice that Eco RI does not cut straight through: it leaves two short ends with single-stranded DNA ailing about. These bare, sticky ends can later realign with their original partner, or with some other complementary end oating about in the solution. (A nick in the DNA backbone remains, despite such annealing, but DNA ligase can be used to seal it.) Recombinant DNA technology is based on such operations: cut DNA and permit rough ends to seek out partners and recombine. The vast numbers of DNA molecules in a solution allow a myriad of combinations to occur, including those in which a new gene is inserted { or those in which a string representing a random pair of integers or perhaps a solution to an NP-complete problem is generated. 2

Every strand actually has structurally distinct ends, known as 5' and 3' ends. This imposes a natural orientation to the strands, and they always line up in opposite directions. In the example above, the second strand is described in \?-sense" to make the pairing clear; it would be written as GACAAT in +-sense. When one strand is written directly above another, e.g. GCATATTATACG, they are implicitly assumed to be given in opposite \senses."

Unfortunately, RE's seldom recognize sequences longer than 6 bases, meaning that only a small nite number of short patterns can be found and cut. Fortunately, DNA sequences (oligonucleotides) can be synthesized to order, permitting the recombination of bare ends to occur without depending on RE's to produce the ends in the rst place. The sequence of DNA in a given strand can be determined, although this is infeasible to do for all the molecules in a given solution (unless they are identical). Molecules can be separated according to length by placing them on an agarose gel and applying electrical current to draw them through; smaller molecules travel faster and end up in bands (according to length) further down along the gel. Thus, let us imagine we have the following primitives at hand: (1) duplication and ampli cation of a strand; (2) creation of desired polynomial-length strands; (3) large-scale recombination of strands with complementary sticky ends; and (4) separation by length.

3 NP Complete Problems If all you have is a hammer, everything looks like a nail. Since DNA is linear and the set of tools to manipulate it is limited, some problems are more suitable than others for computation using DNA chemistry. Granted, any string can be written in the 4-letter language of DNA. But how is one to take two base-4 strings of DNA, each representing a number, and then come up with another base-4 string representing their product? Such a goal is quite nontrivial, both abstractly (using only the given primitives) and chemically (at the lab bench). The Hamiltonian Path problem, on the other hand, is a natural candidate, since its solution is merely a line through a graph. No costs are associated with the edges or vertices, and no abstract logical or mathematical manipulation is needed. A candidate path is a solution if (1) it is of the proper length, and (2) it does not visit any vertex more than once. If we encode edges as DNA, we can use the massive parallelism of molecular interactions to generate many combinations of edges, and then select and amplify those which encode a path visiting each vertex exactly once. If no such path exists, no such molecule will be found. We sketch a solution derived without reference to [A94], converging on similar but distinct methods.

3.1 Generating Possible Paths Given graph G = (V; E ) with V = fv1; : : :; v g, source v1 and destination v , we will construct several short molecules representing the edges in E , each marked by one of the n ? 1 positions in which the edge may appear in a given path. In particular, we shall build a molecule e(i; j ; d) to represent edge (v ; v ) 2 E used as the d edge in a possible path. n

n

i

th

j

To build e(i; j ; d), we rst consider the vertices. Each vertex can occur in one of n positions on a path.3 We create n2 distinct sequences v(i; d) for 1 i; d n. The value d is essentially a step number, describing when vertex i has been added to the path. Thus, the occurrence of v(i; d) in a string denotes vertex v appearing at the d position. We let v (i; d) be the Watson-Crick complement of v(i; d). In principle, these various sequences need only be distinct. For practical reasons, some care is needed, since single strands can interact with themselves to form loops, or mildly with other strands that are not fully complementary, leading to unexpected eects when the encodings are actually translated into DNA strands. We must avoid accidental or weak pairing between any sequences other than v(i; d) and its complement, v (i; d). One way to prevent such interactions is to encode each item using a random sequence of moderate length. Issues analogous to the birthday paradox must be considered in choosing a length, and the results must be checked. Alternatively, codes or designs can be used to generate a set of strings with suciently many dierences between any pair. For example, if n = 4, we might make the following assignments:4 i v(i; 1) v(i; 2) v(i; 3) v(i; 4) v (i; 1) v (i; 2) v (i; 3) v(i; 4) 1 AACGA AACGC AACGG AACGT TTGCT TTGCG TTGCC TTGCA 2 CACGA CACGC CACGG CACGT GTGCT GTGCG GTGCC GTGCA 3 GACGA GACGC GACGG GACGT CTGCT CTGCG CTGCC CTGCA 4 TACGA TACGC TACGG TACGT ATGCT ATGCG ATGCC ATGCA The only pairs that are complementary are of the form v(i; d) and v(i; d). Molecule e(i; j ; d) is a simple assembly of two v-sequences, glued together by the arbitrary intervening sequence GGGGG CCCCC : v(i; d) GGGGG CCCCC v(j; d + 1) For example, using the encodings illustrated above, we have: i

th

e(2; 4; 3) =

GTGCC GGGGG CCCCC TACGT

e(1; 3; 1) = TTGCT GGGGG CCCCC GACGC Molecules e(i; j ; d) and e(j; k; d + 1) can anneal to form a larger molecule, since the single-stranded right end v(j ; d + 1) of e(i; j ; d) is complementary to the single-stranded left end v (j ; d + 1) of e(j; k; d + 1). For example, e(1; 3; 1) and e(3; 2; 2) may anneal to form: e(1; 3; 1)e(3; 2; 2) = TTGCT GGGGG CCCCC CTGCG GACGC GGGGG CCCCC CACGG 3 Actually, v1 and vn can occur only in the rst and last positions, respectively. We leave simple optimizations based on this fact to the reader, as they otherwise detract from the presentation. 4 We use length 5 for illustrative and typographical purposes only. Longer sequences should be used.

In general, e(i; j ; d) may anneal to any one of several candidates, namely those of the form e(j; k ; d + 1). Although we will not actually synthesize all of these candidates, the possibility of pairing with dierent edges permits various paths to be generated at random, depending on the random order in which the molecules collide. Let LP and RP be short random primer sequences for the PCR reaction. Molecules containing both of these sequences will be reproduced exponentially, dwar ng the populations of those containing only one or neither. De ne the following oligonucleotides L and R: 0

z

L

}|

{

z

R

}|

{

LP v(n; n) RP LP v(1; 1) RP We are now ready to start the experiment. For each edge (i; j ) 2 E and for each d with 1 d n ? 1, synthesize e(i; j ; d). Synthesize L and R. Mix these various fragments, allowing the ends to anneal and sealing nicks with ligase. Because the vertices are labeled by step numbers, the only resulting molecules that contain both LP and RP will be those assembled randomly from exactly n+1 fragments (two are L and R). Since they do contain L and R, such combinations must represent a path of length n ? 1 from v1 to v . Use PCR to amplify them. As a particular illustration, consider a four-vertex graph with V = f1; 2; 3; 4g and E = f(1; 2); (1; 3); (2; 2); (2; 4); (3; 2); (3; 4)g.We construct 20 oligonucleotides, namely L, R, and the following: e(1; 2; 1) e(1; 3; 1) e(2; 2; 1) e(2; 4; 1) e(3; 2; 1) e(3; 4; 1) e(1; 2; 2) e(1; 3; 2) e(2; 2; 2) e(2; 4; 2) e(3; 2; 2) e(3; 4; 2) e(1; 2; 3) e(1; 3; 3) e(2; 2; 3) e(2; 4; 3) e(3; 2; 3) e(3; 4; 3) After mixing, various combinations of up to ve molecules can form, including e(1; 3; 2) e(3; 4; 3) R L e(1; 3; 1) e(3; 2; 2) e(2; 4; 3) R L e(1; 2; 1) e(2; 2; 2) e(2; 4; 3) R L e(1; 2; 1) e(2; 4; 2) among others. Of all the possible combinations, there are only two molecules that will contain both L and R: n

L e(1; 3; 1) e(3; 2; 2) e(2; 4; 3) R representing Hamiltonian path (1; 3; 2; 4) L e(1; 2; 1) e(2; 2; 2) e(2; 4; 3) R representing non-Hamiltonian path (1; 2; 2; 4)

Both combinations will be ampli ed by PCR.

3.2 Selecting Hamiltonian Paths We now extract those molecules that encode a Hamiltonian path, namely those containing exactly one occurrence of each vertex. In n stages, one for each v , we i

eliminate those paths that have visited v twice, since repeating a single vertex necessitates skipping another. With our limited toolbox, this operation is not so trivial. In stage i, we lengthen every molecule by 20 bases for each time the path it represents visits v . This is achieved by eectively inserting a short fragment I (i; d) at every point that the path visits v . Let v (i; d) be the left half of v(i; d), and v (i; d) the right.5 The complements of v (i; d) and v (i; d) are v (i; d) and v (i; d), respectively. For a given i, we construct n molecules of the form:6 I (i; d) = v (i; d)ATATATATATATATATATAT v (i; d): The AT-sequence in the middle will tend to form a hairpin loop, since it is complementary to itself. The sticky ends, v (i; d) and v (i; d), remain exposed. Now, if I (i; d) comes in the neighborhood of a v(i; d) sequence, its v (i; d) portion can anneal to the left part of the v(i; d), and its v (i; d) portion can anneal to the right part. Thus, I (i; d) binds like an inchworm to occurrences of v(i; d). (A short palindrome is suggested to stabilize the unpaired single-strand portion as a loop and prevent accidental, inappropriate pairing, but this particular method is not necessary. Other llers that do not coincidentally anneal with other sequences can be used. In fact, this is a special case of a more general \text-insertion" method (including substitution and deletion) that may have broader implications for arbitrary site-directed mutagenesis. It also provides a basis for implementing general Turing machines, beyond the scope of the current exposition.) After priming and copying, one strand will represent the original sequence, while the other, new strand will be formed by lling in the gaps between the appearances of I (i; d) (for various values of d). If v is visited only once, then a single I (i; d) will anneal to the single occurrence of v(i; d), and the new strand will be 20 bases longer than the original. If v is never visited, then the new strand will be the same as the original. If v is visited more than once, then several I (i; d) fragments (having dierent values of d) will anneal to the original, one in each place that v is visited. The new strand will therefore include several I (i; d)'s and end up at least 40 bases longer than the original. After amplifying with PCR, the results are separated by length, and only those molecules with length 20 more than the original are retained. (Note that the insertion can be deleted again, by applying essentially the same method, using v (i; d) fragments instead of I (i; d), and selecting molecules that are 20 bases shorter.) After performing this process for each v , the only surviving molecules are those which encode a Hamiltonianpath. Thus, if any molecules survive, then such 5 The requirements on the encoding v(i; d) described in x3.1 must be strong enough i

i

i

L

L

R

R

L

R

L

R

L

R

L

R

i

i

i

i

i

that vL (i; d) not anneal signi cantly to any sequence other than vL (i; d) (and likewise, vR (i; d) should anneal only to vR (i; d)). 6 I (i; d) is written in ?-sense, so that if v(i; d) is written in +-sense, the desired parts will line up as explained.

a path exists, whereas if no molecules are detected, then no such path exists. This solves the NP-hard problem of detecting the existence of a Hamiltonian path. Let us continue the example from x3.1. Recall that two double-stranded molecules were ultimately produced in quantity: = L e(1; 3; 1) e(3; 2; 2) e(2; 4; 3) R; = L e(1; 2; 1) e(2; 2; 2) e(2; 4; 3) R It may help to consider their sequences more explicitly: Molecule Sequence on one strand LP v(1; 1) CCCCC v(3; 2) CCCCC v(2; 3) CCCCC v(4; 4) RP LP v(1; 1) CCCCC v(2; 2) CCCCC v(2; 3) CCCCC v(4; 4) RP In stage 1, I (1; 1) will anneal to v(1; 1) of , resulting in a new molecule that is 20 bases longer. The same occurs with . None of the other inserts { I (1; 2), I (1; 3), and I (1; 4) { anneal to or . The lengthened versions of and are retained. In stage 2, I (2; 3) will anneal to v(2; 3) of , resulting in a new molecule that is 20 bases longer and thus retained for the next stage. While I (2; 3) anneals to the v(2; 3) portion of , I (2; 2) also anneals to the v(2; 2) portion of . Thus will be increased by 40 bases, and none of its descendants will be retained. In stage 3, I (3; 2) will anneal to v(3; 2) of , resulting in a new molecule that is 20 bases longer and thus retained for the next stage. In stage 4, I (4; 4) will anneal to v(4; 4) of , resulting in a new molecule that is 20 bases longer and thus retained. Thus, molecule survives the selection phase and will be detected, signaling that a Hamiltonian path does indeed exist.

3.3 Finding a Hamiltonian Path Although one can reduce the question of nding a Hamiltonian path to answering several questions about whether a Hamiltonian path exists, we give straightforward methods to list one such path. Standard sequencing techniques work only for homogeneous solutions; they fail if the solution contains molecules with different sequences. Since more than one Hamiltonian path can exist, we cannot rely on sequencing to nd a given path, unless we dilute the solution until a single molecule remains, and then amplify it through PCR. While this somewhat chancy procedure may work, we sketch a direct and more tedious method below. We use several dierent phases, each time reducing the set of Hamiltonian paths until only one remains. In the rst phase, we determine the smallestnumbered vertex reached in one step from v1 , among all of the Hamiltonian paths. In separate experiments, one for each i (1 < i < n), radioactively-labelled v(i; 2) is mixed with denatured sample and anneals with paths containing v at position 2. After using polymerase to obtain double-stranded molecules, the results are separated on a gel (see x2). Using autoradiography, a band of radioactivity will be detected for the unbound, short v(i; 2) fragments, and, if there are i

paths containing v at position 2, a second band will be detected. Let i2 be the least-numbered experiment in which occurrences of v(i; 2) are detected. (These \new" primitives can be avoided by using the more laborious techniques of x3.2, of course.) Molecules containing v(i; 2) are then selected and ampli ed using the methods of x3.2; others are discarded. In the subsequent phases j (1 < j < n ? 1), the remaining population of Hamiltonian paths is similarly reduced, by selecting paths that pursue the smallest-numbered vertex v +1 reachable in one step from v . Ultimately, a single Hamiltonian path v1 ; v 2 ; v 3 ; : : :; v ?1 ; v is determined. i

ij

ij

i

i

in

n

3.4 Adleman's Solution In Adleman's experiment [A94], \step numbers" are not used. Instead, initial candidates of varying length are generated, and then separated by length to retain paths of length n ? 1. Our use of step numbers requires n-fold more building blocks to be constructed at the outset, a fairly prohibitive expense. On the other hand, longer molecules tend to be harder to measure precisely with current technology, and step numbers mitigate the less easily measured ineciency incurred by discarding material that happens to form paths of inappropriate length. Selection of Hamiltonian paths in [A94] uses a primitive not included here, in which molecules containing speci c sequences encoding v are \grabbed" and directly retained by strands with a complementary sequence attached to magnetic beads. Both methods require n repetitions of a selection process to check the presence or duplication of each vertex. i

4 Cryptography, Complexity, and Conclusions Rough calculations show that these methods for solving NP-complete problems are not likely to pose a threat to cryptography in the near future. Consider factoring a 1000-bit number, a task beyond the reach of current computers. Using any standard reduction to Hamiltonian Path, (cf. [CLR90], pp. 954-959), we obtain a graph with at least 106 nodes. It is probably safe to underestimate the number of possible paths at 2106 . Thus, on the order of 2106 10300 000 candidate molecules are needed. Vastly underestimating the size of DNA as that of H2O, we have: 103 g/l 6:02 1023molec/mol << 1030molec/l: 18:0g/mol Thus, far more than 10300000=1030 >> 10200000 liters of solution would be needed. Even a more ecient biological computer that only searches among 2500 possible factors would require more than 10120 liters, an amount that might have unsettled the biblical Noah. Complexity theorists would be happy to learn that the shortest-common superstring problem is \easily" solved by combinatorial molecular computers, since ;

even approximating the length of a shortest common superstring is a computationally dicult task (cf. [BJLTY94]). To determine whether a superstring of length k containing given strings s1 ; : : :; s exists, merely synthesize all possible sequences of length k using recombinant methods similar to those outlined above and then use either the magnetic-bead method of [A94] or the text insertion/deletion methods described above to select those sequences containing all the strings s1 ; : : :; s . The use of a DNA-based computer to solve the superstring problem is somewhat ironical, since the superstring problem is itself often motivated by the need to calculate large DNA sequences using data obtained through experimental sequencing of short fragments. Unfortunately, it is easy to overlook the fact that molecular methods will not scale well, whether they are applied to cryptographic problems or NP-hard ones. Unless and until a constructive proof that P=NP is found, the \solution" of an NP-complete problem is highly misleading. Even in the case of factoring, if every path in a superpolynomial computation is assigned a molecule, then a superpolynomial number of molecules will be needed. Merely tting these molecules into a polynomial volume is an impossible task, but a necessary one: if the molecules are going to have enough time to mix and react, then light must presumably be able to pass from one side of the test tube to the other within polynomial time.7 This impossible task does fortunately expose new avenues for further research, leaving open the question of whether black-hole algorithms for factoring exist. n

n

Acknowledgements The author wishes to express thanks to Rei Safavi-Naini, Pavel Pevzner, Mike Waterman, and Jennifer Long for helpful commentary and interest.

References [A94]

L. Adleman. \Molecular Computation of Solutions to Combinatorial Problems." Science 266 (November 1994), 1021{1024. [CLR90] T. Cormen, C. Leiserson, R. Rivest. Introduction to Algorithms. New York: McGraw Hill, 1990. [BJLTY94] A. Blum, T. Jiang, M. Li, J. Tromp, M. Yannakakis. \Linear Approximation of Shortest Superstrings." JACM 41:4 (1994), 634{647. [S94] P. Shor. \Algorithms for Quantum Computation: Discrete Logarithms and Factoring." Proceedings of the 35th FOCS, IEEE, 1994, 124{134. [W77] J. Watson. Molecular Biology of the Gene. Reading, Massachusetts: W. A. Benjamin, Inc., 1977.

7

Indeed, the classical interactions of DNA molecules can be simulated by a Turing machine in polynomial time, subjecting them to the Church-Turing thesis and to our antithesis: massive parallelism is not the same as exponential time.

This article was processed using the LaTEX macro package with LLNCS style