molecules, gates, circuits, computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to...

88
Molecules, Gates, Circuits, Computers Seth Goldstein and Mihai Budiu March 7, 2003 Contents 1 Introduction 2 2 The Switch 4 2.1 Information ............. 4 2.2 Boolean Logic Gates ........ 6 2.3 Transistors .............. 8 2.4 Manufacturing and Fabrication . . . 10 2.5 Moore’s Law ............ 11 2.6 The Future .............. 12 2.7 A New Regime ........... 12 3 Processor Architecture 14 3.1 Computer Architecture ....... 14 3.2 Instruction Set Architectures .... 16 3.3 Processor Evolution ......... 17 3.4 Problems Faced by Computer Archi- tecture ................ 25 4 Reliability in Computer System Architec- ture 27 4.1 Introduction ............. 27 4.2 Reliability .............. 28 4.3 Increasing Reliability ........ 30 4.4 Formal Verification ......... 32 4.5 Exposing Unreliability to Higher Layers ................ 33 4.6 Reliability and Electronic- Nanotechnology Computer Systems 36 5 Reconfigurable Hardware 37 5.1 RH Circuits ............. 38 5.2 Circuit Structures .......... 39 5.3 Using RH .............. 40 5.4 Designing RH Circuits ....... 41 5.5 RH Advantages ........... 42 5.6 RH Disadvantages .......... 45 5.7 RH and Fault Tolerance ....... 46 6 Molecular Circuit Elements 49 6.1 Devices ............... 50 6.2 Wires ................ 52 7 Fabrication 53 7.1 Techniques ............. 54 7.2 Implications ............. 55 7.3 Scale Matching ........... 56 7.4 Hierarchical Assembly of a Nanocomputer ............ 59 8 Circuits 60 8.1 Diode-Resistor Logic ........ 60 8.2 RTD-Based Logic .......... 61 8.3 Molecular Latches ......... 61 8.4 Molecular Circuits ......... 66 1

Upload: others

Post on 03-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

Molecules, Gates, Circuits, Computers

Seth Goldstein and Mihai Budiu

March 7, 2003

Contents

1 Introduction 2

2 The Switch 4

2.1 Information . . . . . . . . . . . . . 4

2.2 Boolean Logic Gates. . . . . . . . 6

2.3 Transistors. . . . . . . . . . . . . . 8

2.4 Manufacturing and Fabrication . . . 10

2.5 Moore’s Law . . .. . . . . . . . . 11

2.6 The Future. . . . . . . . . . . . . . 12

2.7 A New Regime . .. . . . . . . . . 12

3 Processor Architecture 14

3.1 Computer Architecture . . .. . . . 14

3.2 Instruction Set Architectures. . . . 16

3.3 Processor Evolution. . . . . . . . . 17

3.4 Problems Faced by Computer Archi-tecture . . . . . . . . . . . . . . . . 25

4 Reliability in Computer System Architec-ture 27

4.1 Introduction . . . . . . . . . . . . . 27

4.2 Reliability . . . . . . . . . . . . . . 28

4.3 Increasing Reliability. . . . . . . . 30

4.4 Formal Verification . . . . . . . . . 32

4.5 Exposing Unreliability to HigherLayers . . . . . . . . . . . . . . . . 33

4.6 Reliability and Electronic-Nanotechnology Computer Systems 36

5 Reconfigurable Hardware 37

5.1 RH Circuits . . . . . . . . . . . . . 38

5.2 Circuit Structures. . . . . . . . . . 39

5.3 Using RH . .. . . . . . . . . . . . 40

5.4 Designing RH Circuits .. . . . . . 41

5.5 RH Advantages. . . . . . . . . . . 42

5.6 RH Disadvantages. . . . . . . . . . 45

5.7 RH and Fault Tolerance .. . . . . . 46

6 Molecular Circuit Elements 49

6.1 Devices . . .. . . . . . . . . . . . 50

6.2 Wires . . . . . . . . . . . . . . . . 52

7 Fabrication 53

7.1 Techniques .. . . . . . . . . . . . 54

7.2 Implications .. . . . . . . . . . . . 55

7.3 Scale Matching. . . . . . . . . . . 56

7.4 Hierarchical Assembly of aNanocomputer. . . . . . . . . . . . 59

8 Circuits 60

8.1 Diode-Resistor Logic . .. . . . . . 60

8.2 RTD-Based Logic. . . . . . . . . . 61

8.3 Molecular Latches . . .. . . . . . 61

8.4 Molecular Circuits . . .. . . . . . 66

1

Page 2: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

8.5 Molecular Transistor Circuits. . . . 67

9 Molecular Architectures 68

9.1 NanoFabrics. . . . . . . . . . . . . 68

9.2 Random Approach. . . . . . . . . 73

9.3 Quasi-Regular Approaches .. . . . 74

9.4 Deterministic Approach . . .. . . . 76

9.5 Architectural Constraints . .. . . . 76

10 Defect Tolerance 76

10.1 Methodology . . .. . . . . . . . . 77

10.2 Scaling with Fabric Size . .. . . . 80

11 Using Molecular Architectures 80

11.1 Split-Phase Abstract Machine. . . . 81

12 Conclusions 82

1 Introduction

The last 40 years have witnessed persistent exponen-tial growth in computer performance. This incrediblerate of growth is a phenomenon known as Moore’slaw [Moo65]. Gordon Moore predicated in 1965 thatthe density of devices on a memory chip would dou-ble every 18 months. His prediction has proven true.Moore’s law has also come to mean that processingpower per dollar doubles every year. The most in-credible result of this success is that we have come tocount on it. In fact, we expect our computers, videogames, and cell phones to get smaller, lighter, morepowerful, and cheaper every year.

Advances in semiconductor fabrication havebrought smaller devices and wires, increased speedand chip density, greater reliability, and decreasedcost per component. This remarkable success isbased largely on complementary metal-oxide semi-conductor (CMOS) integrated circuits. Improve-ments to CMOS technology–along with periodic,but less predictable, developments in computer

architecture–have essentially cut the cost of proces-sor performance in half every 18 months.

Today we can begin to see the limits of the growthpredicted in Moore’s law. The physics of deep-submicron CMOS devices, the costs of fabrication,and the diminishing returns in computer architec-ture all pose important challenges. The strengthsof CMOS have been that its transistors and wireshave almost ideal electrical characteristics and thatthe individual components are manufactured simul-taneously with the end product. In other words, whena computer chip is under construction, its transis-tors and wires are manufactured on the final chip intheir final location. This differentiates semiconduc-tors from all other complex manufactured productsand is responsible for their low cost. However, as thesize of individual components shrinks below 100 nm,their behavior is no longer close to ideal and fabrica-tion costs begin to soar. Further, computer architec-ture, which constrains the performance of silicon, isreaching limits of its own.

Chemically assembled electronic nanotechnology(CAEN) is a promising alternative to CMOS-basedcomputing that has been under intense investigationfor several years. CAEN uses self-alignment to con-struct electronic circuits out of nanometer-scale de-vices that take advantage of quantum-mechanical ef-fects. In this chapter we explore how CAEN canbe harnessed to create useful computational deviceswith more than���� gate-equivalents per cm�. Thestrategy we describe substitutes compilation time(which is inexpensive) for manufacturing precision(which is expensive). This can be achieved through acombination of reconfigurable computing, defect tol-erance, architectural abstractions, and compiler tech-nology. The result is a high-density, low-power sub-strate that will have lower fabrication costs than itsCMOS counterparts.

Using electronic nanotechnology to build com-puters requires new ways of thinking about archi-tecture and compilation. CAEN, unlike CMOS,is not practical for constructing complex aperi-odic structures. Thus CAEN-friendly architecturesare based on dense regular structures that can beprogrammed after fabrication to implement com-plex functions. For example, nanoBlocks [GB01],

2 DRAFT! do not distribute

Page 3: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

nanoCells [HHP�01], cellular automata [LDK01],quantum cellular automata cells [NRK02], and smallPLAs [DeH02] are all programmable structures thatcan be built out of nanoscale components. Each ofthese combines nanoBlocks (or similar structures)to form a computational fabric, such as a nanoFab-ric [GB01], that can be altered after manufacture.

Because CAEN-based devices have a higher de-fect density than CMOS devices, they require built-in defect tolerance. A natural method of handlingdefects is to design the nanoFabric (or similar de-vice) for self-diagnosis and then implement the de-sired functionality by configuring around the defects.Reconfigurability is therefore integral to the opera-tion of any nanoFabric. Fortunately, many of thenanoscale components used to build nanoFabrics arewell suited for reconfigurable computing.

In reconfigurable computing, the functions of pro-grammable logic elements and their connections tostorage are changed during operation to create effi-cient, highly parallel processing kernels tailored forthe application under execution. The network of pro-cessing elements is called areconfigurable fabric.The data used to program the interconnect and pro-cessing elements are called aconfiguration. Exam-ples of current reconfigurable fabrics include com-mercial field programmable gate arrays, for exam-ple [Xil02, Alt02], and research prototypes such asChimaera [YMHB00] and PipeRench [GSM�99].As we show later, one advantage of nanoFabrics overCMOS-based reconfigurable fabrics is that the areaoverhead for supporting reconfiguration is virtuallyeliminated. This magnifies the benefits of reconfig-urable computing, yielding computing devices thatmay outperform traditional ones by orders of mag-nitude in many metrics, such as computing elementsper cm� and operations per watt.

In this chapter we take the reader from an informaldefinition of a bit to the architecture of nanoFabrics.We begin with an overview of some important fea-tures of any computing-device technology. We ana-lyze the evidence in favor of digital (as opposed toanalog) computation. We then discuss the benefits ofusing electrons as the basis of information exchange.Finally, we describe the features digital logic musthave in order to manage the complexity inherent in

computers with hundreds of millions of individualcomponents.

After describing the fundamental implementationtechnology and its salient characteristics, we pro-vide, in Section 3, an introduction to computer ar-chitecture and its recent history, including an ex-planation of how computers work and the methodsby which designers create reliable systems. Thisoverview explains what CAEN-based devices mustdo to compete with CMOS devices. We conclude thefirst half of the chapter, in Section 5, with a descrip-tion of reconfigurable computing.

The second half of the chapter explores CAEN-based computing in greater depth. We begin, inSection 6, by describing the devices and fabricationmethods available to the computer architect. Thesetools constrain the possible architectures that canbe built economically. For example, one constraintaround which to design an architecture is to disal-low three-terminal devices. This simplifies fabrica-tion but could reduce the efficiency of the resultingchip because three-terminal devices appear essentialin the construction of restoring logic families.

In Section 8.3 we describe some circuit familiesthat could be supported with molecular electronics.In particular, we describe how a restoring logic fam-ily can be constructed from a molecular latch. Thelatch is built using only two-terminal devices andclever fabrication techniques. This latch providessome of the benefits of three-terminal devices with-out requiring a processing technique that can co-locate three different wires in space. In Section 9we describe the general space of molecular architec-tures and in Section 9.5 we describe some of the con-straints imposed on any molecular architecture. InSection 9.1 we give, as a detailed example, a com-plete nanoFabric architecture.

While we focus on computers, nearly everythingdiscussed applies to other electronic devices. Infact, a computer can be considered a generaliza-tion of all electronic circuits because it is a pro-grammable device that can perform the task of anycircuit, though perhaps in a slower or less efficientmanner. If we ignore the amount of power, time,and cost required, a programmable computer can of-fer the same functionality as any electronic circuit.

3 DRAFT! do not distribute

Page 4: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

Thus, the ideas presented here are equally applica-ble to general-purpose computers, cell phone con-trollers, video game consoles, thermostats, radio re-ceivers, and countless other devices.

2 The Switch

In this section we explore the basic building blockof a digital computer. We begin with an abstractdefinition of information to show that using digitalcircuits—and more specifically—binary digital cir-cuits, is not arbitrary but rather makes such circuitsmore robust and less error sensitive. We then de-scribe the necessary features that logical operationsrequire to support the design and implementation ofcomplex digital circuits. We go on to show how logicgates can be realized with transistors. Using the tran-sistor as an example, we conclude the section with adiscussion of the essential characteristics that a fun-damental building block must have to support digitallogic.

2.1 Information

In this section we discuss the two fundamental com-ponents of a computer: switches and wires. Switchesare used to manipulate information and wires areused to move information around. To meaningfullydiscuss the switches and wires that comprise a com-puter, we must establish a context for their use. Thus,we define what a computer is and what it needs to do.Computers are complex devices that serve a varietyof functions. Thus, there are numerous different, butessentially equivalent, definitions of computers. Forour purposes we define a computer as a device thatmanipulates information. The reason for this will be-come clear.

Other sources also define a computer as an infor-mation manipulation device. For example,Merriam-Websterdefines a computer as “a programmableelectronic device that can store, retrieve, and processdata” [MW00]. TheEncyclopaedia Britannicalike-wise defines a computer as “any of various automaticelectronic devices that solve problems by processingdata according to a prescribed sequence of instruc-

tions” [Enc02]. Both definitions are unnecessarilylimited because they require the computer to be elec-tronic; we show below that this is not asine qua nonof computers. However, both definitions agree that acomputer is a device that “processes data” or, as weput it earlier, a device that manipulates information.

Our definition of a computer begs two additionalquestions: what is information and what does it meanto manipulate information? Information, as an entitythat can be measured, is a relatively modern conceptintroduced by Shannon in 1948 [Sha48]. To have in-formation about a system is to be able to distinguishsomething about the system. For example, if I havea two-faced coin on a table, it can either be heads upor tails up. If I tell you the coin is heads up, then youcan distinguish between the two possible states of thesystem. If the system were, instead, a room with alight and I told you the light was on, then you wouldagain be able to distinguish between the two possi-ble states of the system, light and dark. The amountof information in each of these systems is the same:heads/tails for the coin, on/off for the light.

Suppose instead that there are two coins on thetable—a nickel and a penny to be precise. To distin-guish the state of the entire system you would need toknow whether both coins are both heads up, whetherboth are tails up, whether the penny was heads up andthe nickel was tails up, or vice versa. In this system,there are four states. Similarly, if I have a light bulbthat can be off, or in one of three levels of brightness,then to describe the state of the light bulb would re-quire choosing between one of four states (off, dim,normal, bright). As the number of states in a sys-tem increases, the amount of information needed todescribe it also increases.

The smallest amount of information is the amountneeded to distinguish between one of two states. (Ifthere is only one possible state to the system, thenone cannot distinguish anything about it. Thereforethere is no information in describing it. For exam-ple, when you ask a colleague how they are and theyreply, “too busy,” you get no information since theyare always too busy.) We call this amount of infor-mation a bit.1 One bit of information is enough to

1Bit is shorthand for binary digit. A binary digit is a digit inbase two (i.e., a one or a zero).

4 DRAFT! do not distribute

Page 5: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

distinguish between one of two states: heads or tails,on or off, zero or one.

We now follow a common rule in computer engi-neering: whenever possible build a system using thesimplest possible set of complete components. Bya “set of complete components,” we mean a set ofcomponents from which all the desired functionalitycan be obtained. For example, in this case, we wantto be able to manipulate any amount of information.Since the smallest amount of information is a bit, canwe just manipulate bits? If so, then we have a verysimple system indeed. In fact, strings of binary dig-its can be used as the form of information that willbe manipulated in a computer.

Thus, to keep the basic devices in a computer assimple as possible we limit them to working on bitsof information, that is, they operate on binary dig-its. If we need to distinguish between more than twostates we can use strings of binary digits. Two bits ofinformation are enough to distinguish between oneof four states. In general,���� � bits of informationis enough to distinguish between one of� states. Astring of� bits can be used to distinguish 1 out of��

numbers, apples, bank accounts, etc. In other words,a string of� bits can be used to represent any num-ber from 0 to��� �. Of course, we can use numbersto represent the letters in the alphabet and strings ofnumbers can then be used to represent words, etc.

In a computer, everything is represented by stringsof binary digits. The strings of binary digits can beinterpreted as numbers, letters, dollars, or apples. Inthe end, everything represented in the computer issimply information. This is our first encounter withabstraction, arguably the central concept of computerscience. Thus, whether the computer is manipulat-ing your bank account, simulating the universe, ordisplaying a video image, it is just manipulating in-formation. To make this manipulation as easy as pos-sible, the information is represented as a string of bi-nary digits.

Another reason for using binary digits (and notdecimal digits, for example) is robustness. Infor-mally, robustness is increased because it is easier todistinguish between two things (e.g., a light is eitheron or off) than between tens or hundreds of things(e.g., if the light is on a dimmer it may be one-third

on or two-tenths on, etc.). Thus, we will require thebasic elements of our computer only to be capable ofmanipulating binary information.

Information is an abstract notion but computersare concrete entities that must manipulate concreteentities. In other words, the abstract notion of a bitmust be implemented with a real world bit. Whatattributes should this concrete bit have? It should al-ways be in one of two states. When in one state itshould stay there until actively, but easily, changed.It should be easy to distinguish between the twostates of the bit. It should be inexpensive. Finally, theimplementation of the bit should be separate from themeaning of the bit. This last requirement is anotherexample of abstraction. It may be the fundamentalabstraction of computer science and the reason thatMoore’s law is possible [Ben02]. This requirementmeans that the information carried by a bit should notdepend on the information carrier. Whether a bit isimplemented as the position of a relay (involving bil-lions of atoms) or as the spin of a single electron, theinformation content is the same. It is either a “1” or a“0.” Or, in the context of Moore’s law and electronicchips, it means that the devices on a chip can shrinkand voltage levels can be scaled down. Yet it is stillpossible to implement a bit (i.e., it is still possible todistinguish between one of two states: a logical oneand a logical zero).

Many different information carriers have beenproposed: mechanical rods [Dre86], magneticmoments [ES82], amino acids [RWB�96], pro-teins [WB02], voltage levels, etc. Voltage levels arethe traditional means for representing information incomputers. Thus, the definitions of computers in-clude the word “electronic.” Essentially a digitalcomputer uses two voltage levels to represent the twodifferent values of a bit. One level, ground, repre-sents a zero, and the other,�dd, represents a one.The main reason voltage levels (which are just thepotential energy levels of electrons) are so useful isthat voltages can be changed quickly and with verylittle energy. This is because electrons are light andfast. However, it is hard to evaluate more thoroughlyhow effective an information carrier is until we alsoexamine how we store and manipulate the informa-tion.

5 DRAFT! do not distribute

Page 6: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

Figure 1:Some Boolean logic gates, their symbols, and how they can be implemented using a NAND gate. At thetop of each column is the logical symbol for the operation. Under that is its corresponding truth table. Below that isan implementation of the truth table using only NAND gates.

2.2 Boolean Logic Gates

For the reasons noted above, computers manipulateinformation using operators that act on binary values.We call such operators Boolean logic gates. Booleanlogic was first studied by George Boole who showedthat all arithmetic computations can be synthesizedfrom a very small number of logical primitives, suchas logical AND and NOT. These two primitives area complete set, meaning that they are sufficient toimplement all logical functions. Fig. 1 shows howsome common logic gates can be implemented usingjust the logical NAND gate, which is an AND gatefollowed by a NOT gate.

While it is possible to construct any logical func-tion out of NAND gates, it is certainly tiresome todo so. For example, Fig. 2 shows the implementa-tion of a one bit half-adder using NAND gates. Ahalf-adder adds two binary digits and produces thesum and carry of the addition as output. However,by comparing the truth table, or the gate implemen-tations, we can see that a simpler representation isavailable. The carry function can be implementedwith a two-input AND gate and the sum functionwith a two-input XOR gate. Using an AND andXOR gate instead of a collection of NAND gates is away of managing complexity. It is certainly easier toconfirm that the implementation of the half-adder iscorrect by looking at two gates, the AND and XOR

Figure 2: An implementation of a half-adder usingNAND gates.

6 DRAFT! do not distribute

Page 7: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

gates, than looking at nine. This is another exampleof using the power of abstraction. In this case, weonly need to know the behavior of the XOR gate toverify that sum is implemented correctly. It is notnecessary to know the implementation of it, since weknow how it behaves.

However, for abstraction to work, we must be ableto ignore the implementation and focus only on thebehavior of a gate. Furthermore, for abstraction tobe truly useful, we need to be able to predict thebehavior of the gate in many different and possiblycomplex circumstances. For example, Fig. 3, showsthree different and functionally equivalent represen-tations of a 1-bit full-adder. The ability to abstract isthe key to managing complexity. Since modern com-puters require millions of gates, this is one require-ment for the design and implementation of comput-ers. This ability imposes certain requirements on theimplementation of the actual logic functions and theinformation carriers.

The first requirement is that a logic gate needs toensure that its output can be used as the input of an-other logic gate. The information produced must berecognizably either a zero or a one by the input ofthe next gate. Furthermore, the wire along which theinformation is transmitted must also not destroy theinformation. In a perfect world, each bit of informa-tion produced by a gate or carried by a wire wouldbe either exactly at the level of a logic zero (groundor zero volts in typical electronic systems) or a logicone (�dd). In practice, various environmental factorscombine to degrade the voltage levels so that whatshould be at zero volts is instead close to zero, butnot exactly zero. If we are to robustly perform cal-culations in the presence of the noise introduced bythe environment, we must be able to recognize nearlyzero values as zero and nearly one values as one. Fur-ther, we want to produce as output a value that is noworse, and hopefully better (i.e., closer to its perfectvalue), than the input. In other words, our elementarycomputing devices should be able to restore nearlyperfect values back to perfect values. The more am-biguous the input is allowed to be before giving anincorrect answer, the more robust our system will be.If each gate acts to restore the values closer to theirideal values, we can compose many of them together

into a complex system without concern over wherein the circuit the gate is.

By examining the full-adder implementation it isalso evident that the output of a logic gate may haveto go to more than one input of another gate. Forexample the ha-sum output (the output of the gate la-beled “x”) of the “a+b” half-adder feeds three NANDgates in the second half-adder. Furthermore, we donot know how the gates will be arranged physically,so we cannot know how long the wires are betweenthe gates. This means that the gates must be ableto drive many inputs (at least two from our example)and that the gates must drive long or short wires. Theformer notion is called fan-out. Any implementationof logical gates must have a fan-out of at least two(i.e., allow a gate to drive at least two other gates).

The final requirement we list here is that the in-puts and outputs of a logic gate must be isolated fromeach other. This requirement ensures that the outputof a logic gate is a function of its inputs and not theother way around. Isolation of inputs and outputs al-lows a circuit to be designed without concern abouthow the output will be used. If, on the other hand, theoutput could influence the input, then the designerwould have to know, in detail, how the output mightchange. Otherwise, the output might change the in-put that would cause changes to ripple throughout thesystem. This is an essential feature if one intends tobuild circuits with more than one level of logic.

If the logic gates fulfill these three requirements(restoring logic, input/output (I/O) isolation, and fan-out), a computer designer can create new logical op-erations from previously designed operations with-out regard to how the previously designed operationswere implemented. We gloss over the fact that thelogic operations do not have infinite fan-out. In otherwords, there is a contract in the abstraction that sayshow many devices a particular operator can drive.The designer must stay within that contract. Ofcourse, if we have fan-out of two NAND gates, thenwe can construct gates with any fan-out by building atree of NAND gates, where each level of the tree candrive twice as many gates as the previous one. An au-tomatic tool may do this for a designer. Likewise, anautomatic tool may analyze the contract for the logicgates and ensure that wires being driven by the gates

7 DRAFT! do not distribute

Page 8: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

Figure 3:Several different representations of a full-adder.

never exceed the contracted length. Thus, the toolsthe designer uses help to support the abstraction. Infact, the complexity of modern computer design re-quires tools to support the abstractions.

When designing a family of logical operations onewill want to ensure that they fulfill at least these re-quirements. However, one will probably not want toadd too many other requirements to the contract. Aswe will show later in Section 4, adding to the contractmay make it easier for the user of the component, butoften at a substantial cost. In other words, one costof abstraction is that the contract can be more gen-eral than necessary. For example, if we required thatlogic gates support a fan-out of 20, then every in-stance of a logic gate would have to support such afan-out, even if most of them only needed to driveone or two other gates. Another example of howabstraction can incur overhead is present in the de-sign of the full-adder in Fig. 3. The version builtfrom half-adders will include extra gates because thedesigner is not allowed to peek into the half-adderabstraction. The simplest example of this is that theinverter gates (1,��, 2, and�� in the figure) at the out-put of the carry out logic and the input of the OR gatecancel out and are not necessary. This is an exampleof how allowing optimization across an abstractionlayer can provide benefits.

So far we have only described combinationallogic. Any change on the inputs flows through thecircuit and appears at the outputs. However, eventhe simplest of circuits often need some kind of stor-

age or synchronization.2 When the storage elementsare small and part of the circuit they are often calledregisters or latches. Registers allow information tobe saved over time and also allow different computa-tions to synchronize together.

2.3 Transistors

There have been many robust implementation vehi-cles for binary logic, but nothing has compared to thetransistor in terms of its overall economy, reliability,speed, power usage, and all-around usefulness. TheMOSFET, or metal-oxide-semiconductor field-effecttransistor, was proposed in 1930 [Lil30], first demon-strated in 1960, and in its current form, started beingused regularly in the 1970s. Since the 1980s it hasnot had a serious rival for building complex comput-ing devices [Key85].

Fig. 4 shows the circuit diagram and two graphsthat characterize a typical N-type MOSFET. Eachcurve in the graph on the left indicates the amountof current that flows from the source to the drain de-pending on the voltage across the source and drainfor a particular voltage applied at the gate. As thegate voltage increases, the amount of current al-lowed to flow increases, as shown in the graph onthe left. The device acts like a voltage-controlledswitch. The graph on the right compares the outputvoltage (Vout) as a function of of the input voltage

2The vast majority of modern circuits use a clock signal, asignal which arrives at regular intervals, to synchronize the ar-rival or departure of signals from computation elements.

8 DRAFT! do not distribute

Page 9: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

Figure 4: Field effect transistor. The symbol for anN-type MOSFET is on the left. A schematic of how it isconstructed is in the middle. On the lower right is a circuitused to plot the two curves above.

. As long as the gate voltage is less than a thresh-old voltage, the transistor is “off.” Then, the middleregion is reached and the transistor switches “on.”Finally, there is a region where the transistor is “on.”Depending on the threshold voltage and the shapeof the switching region, this device is (with respectto the requirements we enumerated above) an almostperfect building block for logical operations.

An ideal transistor would have its threshold volt-age half way between zero volts and the voltage levelused to represent a one,�dd. This would give us themost noise immunity. We also want the switchingregion to be as narrow as possible (i.e., the slope ofthe curve should be as close to infinite as possible).This slope determines the gain of the device. Thegain of a device determines how much amplificationthe device applies to the input signal. The higher thegain the faster the switching speeds and the more theallowed fan-out.

What is not characterized by these curves is theisolation between input and output. We can see, bylooking at how a FET is built (see Fig. 4), that isola-tion is inherent in the transistor. The gate of an idealtransistor is in effect one plate of a capacitor (i.e., it isnot connected to the source or the drain). In practice,particularly as transistors get smaller, there is some“leakage” between the three terminals. This meansthat as transistors get smaller, they never really turn“off,” and there is some connection between the in-puts and the outputs.

Figure 5:A NOT gate constructed out of CMOS transis-tors.

MOSFETs come in two basic flavors:N-type andP-type.N-type transistors turn on as the gate voltageincreases.P-type transistors turn off as the gate volt-age increases. The complementary nature of thesedevices makes it easy to construct restoring logicoperations. For example, Fig. 5 shows the transfercurve of a NOT gate built from aP-type and anN-type transistor. Notice that even when the input volt-age is quite far away from an ideal one or zero thatthe inverter will output a nearly ideal complement ofthe input. In other words, the output is “restored”back to an ideal one or zero. The output is alsoclearly isolated from the input.

If we take a closer look at this device we also seethat when it has fully switched, it uses no power. Forexample, if�in is high, then the gates on the twotransistors will be charged up. Once they are chargedup there is no more current flow to the gates. Fur-thermore, theN-FET is turned on and theP-FET isturned off. Thus, there is very low resistance con-nection between ground and�out� If �out is con-nected to another gate, then it will discharge that gatethrough thisN-fet and then there will be no more cur-rent flow. In general, logic built from complementaryMOSFETs (CMOS) only uses power when the de-vices are switching. (As devices are getting smallerand�dd is getting lower this is changing and staticpower consumption through leakage is becoming abigger share of the power budget.) What should alsobe evident from the design of the inverter is that itwill switch from high to low as easily as from low tohigh.

Finally, we can examine how one could create amemory cell. If we connect two inverters togethersuch that the second one feeds back to the first one,

9 DRAFT! do not distribute

Page 10: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

Figure 6: A pair of connected inverters which can storeone bit (i.e., a memory cell).

Figure 7:Logic diagram for a D-type latch.

as shown in Fig. 6, we can create a memory cell. Ifthe input to the pair is low, then the first inverter willoutput a high and the second will output a low, whichwill feedback to the first inverter. Thus, even if theoriginal input is disconnected the pair will maintainthe low output indefinitely, or as long as power isbeing supplied to�dd. To change the state of thecell, enough current has to be supplied to the input toovercome the output from the second inverter. Whilethese two signals are fighting there will be currentflow from the input to ground through the lower righttransistor. There are, needless to say, better ways toform storage cells. For example, Fig. 7 shows an ex-ample of a D-type latch. These examples bring out animportant requirement for building stateful circuitsout of logic devices: power gain is required to builda memory cell from logic gates. If the device has nogain, then as the signal circulates through the gates itwill degrade, eventually becoming noise.

2.4 Manufacturing and Fabrication

In 1985, Keyes pointed out that the silicon-basedtransistor, especially as it is manufactured in very-large scale integration (VLSI) chips, is an almostideal computing device [Key85]. In his article hedescribes the important attributes of a “good” com-puter device. In addition to the technical aspects thatwe have outlined here (e.g., I/O isolation, restoring

logic, fan-out, etc.), he goes on to describe how pho-tolithographically produced CMOS also satisfies themanufacturing requirements of a good computer de-vice.

Computer systems currently contain anywherefrom thousands to hundreds of millions of transis-tors. To remain affordable, the individual transistorsmust be inexpensive to manufacture. For the past 30years, the cost per transistor has been falling expo-nentially. Even more important is that the cost ofconnecting transistors together into entire circuits isalso inexpensive and has fallen. The method of mak-ing chips, photolithography, is the key ingredient tothe ever-decreasing cost of computers. The reasonfor this decrease is that photolithography combinesmanufacturing and fabrication.

A raw silicon wafer is converted into a chipwith potentially millions of transistors through manysteps. Here we describe a gross simplification of theprocess. After the initial wafer preparation, the struc-tures on the chip are built up layer upon layer usinga combination of photomasking and etching. Pho-tomasking involves treating the surface of the chip,then shining ultraviolet (UV) light through a maskonto the treated surface. This cycle is called an ex-posure. The mask creates a pattern of UV light onthe treated surface that alters the surface. Then, someform of etching takes place which affects the light al-tered surfaces differently than those which were notaltered. Three key features of this process are: self-alignment of the transistor gates, in-place manufac-ture of the components, and integration of transistorsand wires. All these processes are similar to photo-graphic development; the most important aspect ofthis process is that all (up to hundreds of millions of)components are created simultaneously.

There are three contacts that need to be made to atransistor: gate, source, and drain. One goal in con-structing a transistor is to keep the overlap betweenthe gate and the source (or drain) as small as possible.The current manufacturing of the MOSFET guaran-tees this because the gate (formed before the sourceand drain) is used to create the source and drain. Thebasic idea is that the outer boundary of the source anddrain is defined by the location of the polysilcon thatmakes up the gate. This ensures that the source and

10 DRAFT! do not distribute

Page 11: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

drain do not overlap with the gate. Equally as impor-tant is the fact that the wires that connect to the gate(or source or drain) can also be manufactured at thesame time (or potentially at a later exposure) as thetransistor is being formed. Finally, notice that all thetransistors are made at the same time. In other words,photolithography supports construction of all the de-vices in parallel. Current chip technologies requiremore than 20 separate exposures. The result is thatthe wires and transistors are all fabricated together.This all combines to make each transistor very inex-pensive.

In addition to the fact that millions of componentsare constructed in parallel and that the componentsare fabricated into a complete circuit, lithography hasthe advantage that the entire process is very reliable.A chip with millions of transistors and millions ofconnections will most often have every single oneworking. Until recently, there has been very littlevariability between two transistors at different loca-tions in the chip. What little variability was presentwas overcome by the high gain and use of restoringlogic.

The photolithographically manufactured silicon-based transistor has been the dominant buildingblock for digital electronics over the past 30 years.Alternative technologies have either failed to pro-vide the necessary technical specifications (I/O isola-tion, restoring logic, gain, low-power, high packingdensity, equal switching times) or the necessary eco-nomic advantages (inexpensive to manufacture, reli-able, inexpensive to fabricate entire devices). Fur-thermore, the technology behind silicon-based tran-sistors has been improving all the time, making itincreasingly harder for an alternative technology totake hold.

2.5 Moore’s Law

The combination of an excellent switch and inexpen-sive robust manufacturing has led to constant tech-nological advance since the introduction of the inte-grated circuit. For the past 30 years the minimumfeature size that could reliably be created on a chiphas been roughly cut in half every 18 months. Thishas allowed the transistors and wires that make up a

computer to shrink in area by a factor of four ev-ery 18 months. The feature size reduction is dueto continuous advances in manufacturing and qual-ity control. The fact that we can use these smallerdevices is due to the abstraction we presented at thestart of this section—the bit. This relentless paceof improvement was first observed and predicted byGordon Moore in 1965 [Moo65]. Since that time, thesemiconductor industry’s agenda has been set by this“law.”

The result of this continuous advance is seenmostly in that the number of components that canbe built on a chip grows exponentially with time(see Fig. 8), enabling the construction of more pow-erful and sophisticated circuits. As the number ofcomponents per chip increases, systems which usedto require multiple chips now only require a sin-gle chip. This collocation dramatically shortensthe communication lines between subsystems andcauses a boost in performance and a decrease in sys-tem cost. For example, until the introduction of theIntel 80386 microprocessor, the floating-point oper-ations of Intel processors were carried out on a sep-arate slave chip, called a floating-point coprocessor.The next generation integrated the two on a singledie, achieving a performance boost. Power consump-tion per performance also decreases with feature re-duction, because smaller circuits can use lower sup-ply voltages and thus dissipate less power when car-rying the same computational task.

Surprisingly, despite the exponential feature re-duction, the actual total chip size and total dissipatedpower of microprocessors have kept increasing withtime. The main reason is the integration of more andmore of the functionality on the same die. Anotherreason is that microarchitects have used the avail-able real estate to implement more computationaland support structures for the execution of multipleoperations simultaneously.

All of these effects arise because as chipmakersdecrease feature size they also maintain yield. Yieldis defined as the percentage of manufactured chipswhich are defect-free. The importance of defect con-trol in integrated circuit manufacturing cannot beoveremphasized: higher defect rates translate intohigher unit costs. The cost of an integrated circuit

11 DRAFT! do not distribute

Page 12: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

is proportional with the manufacturing cost, but in-versely proportional to the yield. As we discuss morefully later, the nature of digital integrated circuitsmakes them very brittle: even one minute defect canmake the whole circuit unusable. In consequence, ahigh yield is extremely important for keeping costslow. Because of the microscopic size of the elec-tronic devices, a speck of dust or a tiny misalignmentin the photolithographic masks can create irrecover-able defects. These requirements drive up steeply thecost of production plants: modern clean room facili-ties, using extremely precise mechanical and opticalequipment, cost several billion dollars.

Despite the microscopic size of the basic compo-nents, extreme care in manufacturing has managedto maintain the yield practically constant in the lastdecade. However, two factors make it unlikely thatthis trend will continue for too long: (1) as the num-ber of components on the same area grows exponen-tially, the probability that none is flawed decreases;and (2) as the size of the individual components de-creases, they become increasingly sensitive to impu-rities (e.g., minute dust particles, radiation, etc.).

2.6 The Future

Impressive as these results have been, in the near fu-ture further increases in the performance of silicon-based, lithographically manufactured transistors willbe difficult to achieve. Never before has therebeen so much doubt in the industry about how ad-vances three-generations in the future will be accom-plished [Sem97, Semte]. There are several majorreasons for this. First, the small linewidths neces-sary for next-generation lithography require the useof increasingly shorter wavelength light, which in-troduces a host of problems that are currently be-ing addressed [Pee00]. In addition, as the numberof atoms that constitutes a device decreases, manu-facturing variability of even a few atom widths canbecome significant and lead to defects.

More important, however, is the economic barrierto commercial nanometer-scale lithography. Newfabrication facilities orders of magnitude more ex-pensive than present ones will be needed to producechips with the required densities while maintaining

acceptably low defect rates. The increasing cost ofchip masks, which must be manufactured to single-atom tolerances, precludes commercially viable de-velopment of new chips except for the highest vol-ume integrated circuits. It is entirely possible thatfurther reduction of transistor size will be halted byeconomic rather than technological factors.

This economic downside is a direct consequenceof the precision required for deep-submicron litho-graphic fabrication. Using lithography, constructionof devices and their assembly into circuits occur atthe same time. Keyes pointed out that this is agreat advantage of silicon because mass fabricationof transistors is extremely inexpensive per transis-tor [Key85]. However, as Keyes also highlighted, itproduces a set of constraints: each element in thesystem must be highly reliable, and individual de-vices on a chip cannot be tested, adjusted, or repairedafter manufacture. These constraints force the de-sign of a custom circuit to be tightly integrated withmanufacture, since no additional complexity can beintroduced into the circuit afterward. Lithographyis perfectly tuned to deal with these constraints, butat small scales the precision required to obey theseconstraints becomes the obstacle to further improve-ment. Any technology hoping to displace lithograph-ically produced CMOS integrated circuits must over-come such obstacles without sacrificing performanceor the low per-unit cost made possible by mass pro-duction.

2.7 A New Regime

As we have already suggested, we are entering anew regime—a regime where the transistor, and par-ticularly the photolithographically produced transis-tor, may not be king. Since all the attributes of agood computing device have not changed, we needto examine the assumptions on which the transistor’sdominance is based. For example, one of the key as-sumptions made by Keyes in 1985 is that the individ-ual devices cannot be tested nor altered after fabrica-tion. This assumption is what leads to the need forhigh noise margins, high gain, reliable construction,and low manufacturing variability.

12 DRAFT! do not distribute

Page 13: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

1000

10000

100000

1e+06

1e+07

1e+08

1970 1975 1980 1985 1990 1995 2000

Tra

nsis

tors

Year

4004

80088080

8086

286

386

486

Pentium

Pentium_II

Pentium_IIIPentium_4

Figure 8:Moore’s law describes the exponential increase of the number of transistors per unit area. This graph depictsthe size of successive microprocessor generations of Intel’s x86 family.

Current chip manufacturing is based on the ideathat the functionality of the chip is fixed at manu-facturing time. In other words, the chip is designed,masks are made, and the chip is fabricated, tested,and then put in a package. If there is an error at anystage in the process, the final device is inoperable.This means that every transistor and wire in the fab-ricated chip needs to be highly reliable as a singlepoint of failure can cause the entire chip to be defec-tive. An alternative system would be to draw on theidea of general purpose computers, allowing a chipto have its functionality altered after it is fabricated.In other words, we envision a chip as a piece of re-configurable hardware (see Section 5).

Reconfigurable hardware reduces the need for re-liable components because if a particular componenton the hardware is faulty, we can simply avoid us-ing it when the hardware’s final functionality is de-termined. In other words, we configure around thefaulty component, allowing chips with many defec-tive parts to perform their desired functionality. InSection 5 we also show how such hardware can testitself, eliminating the requirement that we are not al-lowed to probe the components.

Relating to the information theory description atthe start, we can talk about when information isadded to the system to create a working chip. Thecurrent method adds all the information at or beforemanufacturing time. Our method allows informationto be added after the time of manufacture. This re-duces the cost of manufacture substantially. It alsoincreases flexibility and reduces design and testingcosts.

If we are going to replace photolithography andthe CMOS-based transistor, we need to keep inmind the qualities that any computer requires in itswitches. Ideally we hope that they:

� support restoring logic

� isolate the inputs from the outputs

� provide gain

� allow a complete logic family to be constructed

� are inexpensive to manufacture

� use little power (particularly when not switch-ing)

13 DRAFT! do not distribute

Page 14: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

� pack densely

� are reliable

Furthermore, it should be inexpensive and reliable toconnect them up into a complete circuit.

Let us examine these “requirements” in turn.Restoring logic is necessary in order to support ab-straction and the design of complex, yet, reliable sys-tems. Similarly, input/ouput isolation is necessary tosupport the design of large systems. This require-ment could, however, be too stringent. For example,imagine a system in which some elements are restor-ing and others are not. If the elements that restorelogic levels are used frequently then it would be pos-sible for some of the switches to not restore logiclevels, similarly for isolating the inputs and outputs.In other words, the granularity of the restoring logiccomponent could be raised from one to a few withoutsacrificing the ability to abstract. Similarly, as longas there is some element that can isolate elements ofthe circuit from each other the designer can safely ig-nore the effect of the rest of the circuit on a localizedstructure.

Gain is required to build memory devices and totolerate noise and manufacturing variability in theenvironment. If we can construct a system whichhas a memory element as an atomic element, thengain becomes less crucial (i.e., other fault-tolerancemechanisms can be used to overcome a lack of sig-nificant gain). We discuss fault tolerance in Sec-tion 4.

Earlier we described a system of completeBoolean logic. For example, an AND and a NOTgate form a complete system. Actually, this as-sumption is also too stringent if we require that allinputs to the system appear in both their true andcomplemented form. That is, instead of computingjust ���� �� we compute functions���� �� �� �� and

���� �� �� ��, then a complete system is one whichincludes only an AND and an OR gate.3 In otherwords, if we consider a system which includes in-verters on the inputs to the system, the internal partsof the system do not need to have any inverters.

3This can be shown by using de Morgan’s law:�AND�

equals�OR�. � should be read as “Not a.”

From this analysis it appears that what we abso-lutely need are devices that pack densely, are lowpower, and are inexpensive to manufacture. Also,the fabrication of entire devices should be inexpen-sive as well. Luckily, these requirements are met bychemically assembled electronic nanotechnology.

3 Processor Architecture

In this section we briefly present basic concepts ofcomputer architecture, with an emphasis on proces-sor architecture. We discuss the evolution of proces-sors and identify some hard problems faced by fu-ture designers of computer systems. CAEN has thepotential to efficiently solve some of these problems.

3.1 Computer Architecture

3.1.1 Computers as Universal Machines

Computers areuniversal machinesbecause they canbe “taught” (i.e., programmed) to implement anytask we can describe algorithmically. Computerswere not always this way: the first computers werejust collections of functional units which were wiredtogether manually to implement the desired compu-tations. For example, if you wanted to subtract twonumbers and then square the result, you would hookthe wires fetching the numbers into a box which sub-tracts them, and you would wire the output of thesubtracter to both inputs of a box doing multiplica-tions, as in Fig. 9a. If you wanted a different compu-tation, you had to unhook the cables connecting theunits and plug them in a different way.

Eventually a great insight occurred: a computercan be made to carry out different computationseach time without any modifications to the under-lying hardware. This finding is attributed to Johnvon Neumann, although the idea existed in abstractterms long before. Von Neumann realized that theway a computer should behave can be described by aprogram, which can be stored in the memory of thecomputer [vN45]. By changing the program, you canchange what the computer was doing.

14 DRAFT! do not distribute

Page 15: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

*

a b

(a−b)*(a−b)

(b)(a)

fetch a’s value in local storage r1fetch b’s value in local storage r2r3 = r1 − r2r4 = r3 * r3store r4’s value as output

Figure 9:(a) Hardware implementation of a circuit computing�� � ���. (b) Software program performing the samecomputation.

Programs are sequences of simple instructionswhich direct the way data is routed through the com-putational units of the machine. For example, theprevious computation would be implemented by aprogram having instructions to fetch the two data val-ues from storage, feed them to a subtracting arith-metic unit, and feed the result to a multiplier, as inFig. 9b.

Most modern computers are based on von Neu-mann’s idea. Their instructions are stored in a mem-ory, which is also used to store data and interme-diate computation results. Computers thus consistof: memories, storing data and programs, peripheralunits, interfacing the computer to the outside world,and a central processing unit, the computer’s brain,which manipulates the data according to the programinstructions (see Fig. 10). In modern computers thecentral processing unit consists of one (or sometimesmultiple) microprocessors.

Von Neumann architectures have the tremendousadvantage of flexibility, as the functionality of a com-puter can be changed just by loading a different pro-gram, without any hardware modifications. How-ever, this flexibility results in a weakness, called the“von Neumann bottleneck” [Bac78]: there is inher-ently a sequential flow of information between mem-

Memory

Control Arithmetic

Processor

data

instructions

Peripherals

commands

data &programs

Figure 10:In von Neumann architectures, programs anddata are objects of the same kind and are stored in mem-ory. The program is composed of instructions which indi-cate how data is to be processed.

15 DRAFT! do not distribute

Page 16: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

ory and processor. Instructions and data are broughtessentially one-by-one from memory and processedby the processor. Computer architects have enlargedsomewhat the width of the communication channelbetween memory and processor. But the bottleneckexists in today’s computers and has become one ofthe major limiting factors of computer system per-formance.

3.2 Instruction Set Architectures

In 1964, the IBM System/360 introduced the ideaof a family of computers that would all execute thesame instructions. The instruction set constitutes aninterface between the microprocessor hardware andsoftware. This interface is called the instruction setarchitecture (ISA). It is acontractbetween the twoworlds. This interface is similar to the electrical plugin our homes: the electricity can be produced in var-ious ways (using coal, atomic energy, or renewableresources) and can be consumed in countless otherways. However, the interface delivering the electric-ity is precisely standardized, enabling an indepen-dence between producer and consumer.

This independence is also true for the hardwareand software worlds: for example, all Intel proces-sors, starting from the Intel 8086 built in 1978, untiltoday, have implemented practically the same ISAcalled the x86 ISA.4

The ISA standardization is very important eco-nomically, because it decouples the software andhardware producers: as long as the ISA is preciselyspecified, different manufacturers can compete toproduce hardware implementing it and software us-ing it. For example, today Intel, Advanced MicroDevices, and Cyrix all produce implementations ofthe x86 ISA, while countless software producers cre-ate programs using it. The existence of an ISAmakes hardware and software development indepen-dent processes.

4In truth, the ISA has changed slightly from processor to pro-cessor, but in a backward-compatible fashion, by always addingnew features. This is much like how the method of adding a thirdprong to a wall outlet extends the functionality of the outlet, butcontinues to allow you to use two-pronged devices.

In this section we describe briefly the componentsand implementation of a modern ISA.

3.2.1 Instructions

Modern ISAs differ from one another more in detailsthan in general structure. We can distinguish severaltypes of instructions: arithmetic and logic, memoryaccess, control flow, and input-output.

Basic arithmetic and logical operations do not re-quire further elaboration. Arithmetic is carried outon finite value integers, usually 32 or 64 bits long.

Another important type of operation ismemory ac-cess. The memory of the computer is viewed like alarge array of data words, each having a numericad-dress. Memory access operations allow the processorto compute memory addresses and retrieve the con-tents of the specified memory cells.

A special class of instructions, calledcontrol in-structions, indicates to the processor which instruc-tion to execute next. When executing these instruc-tions, the processor can branch to one or another in-struction, depending on the outcome of a previouscomputation.

Basic arithmetic operations, control instructions,and memory access are the only required ingredi-ents for building a truly universal computational ma-chine. Any computational task can be described as asequence of these operations.

However, the computer must also carry out non-computational tasks, such as “talking” to the periph-eral devices. The peripherals are accessed through aspecial class of input/output instructions, similar tomemory access instructions. However, these instruc-tions write and read data from external devices ratherthan to memory.

For reasons of efficiency, some common complextasks have been encoded in single instructions; forexample, modern ISAs also deal with:

� floating point computations, which are per-formed on finite-precision representations ofreal numbers

16 DRAFT! do not distribute

Page 17: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

� special “media extension” instructions, used byalgorithms manipulating digitally encoded me-dia such as speech or images

� instructions aiding the implementation of theoperating system, which offer help in creatingthe virtual machines protecting programs fromone another

� exceptions and interrupts, which are eventsthrough which programs or peripheral devicessignal to the CPU that something unpredictedhas happened, requiring immediate attentionand the interruption of the currently runningcomputation

Instructions are represented in the computer mem-ory by binary numbers and programs are sequencesof instructions. To help humans create and manipu-late programs, a set of software tools helps translateprograms written in textual form to and from thesebinary representations. One form of textual pro-gram representation is “assembly language.” Assem-bly language instructions stand in one-to-one corre-spondence with the ISA operations: each machineinstruction is represented by a corresponding textualdescription.

To illustrate, the program in Fig. 11 computes thesum of the numbers between 1 and 10 is written inthe x86 assembly language. At the end of each lineis a comment, human-readable description of the in-struction and its indented action. This program fea-tures arithmetic and control operations. Data valuesare stored and manipulated in internal microproces-sor registers; in this code fragment we see two reg-isters calledeax andedx. We also see a branch in-struction, which uses the result of a previous compar-ison to decide whether to re-execute a code fragment.As long as registeredx has a value less or equalto 10, thejle instruction steers execution at theadd instruction. This will happen exactly 9 times—causing theadd to be executed 10 times.

3.2.2 Microprocessor Operation

A microprocessor starts executing a program whenthe computer is started and stops only when the com-

puter is shut down. The processor continuously re-peats the same tasks [HP96], depicted in Fig. 12,which can be described as follows:

Fetch: Obtains the next instruction to execute frommemory.

Decode: Determines the instructions desired func-tionality: which data is to be operated upon,what operation is to be carried out, where theresults should be sent. Decoding generates sig-nals which steer the data through the processor.

Read: Retrieves the data to be operated upon by theinstruction, either from memory or from regis-ters.

Execute: Performs the main action of the proces-sor, which consists of generating a result valuebased on the combination of operands and theselected functionality. In this step “new” infor-mation is generated.

Writeback: Commits the result of the instruction.Once the result is known, it is sent to the des-tination indicated by the instruction: either amemory location or a register. The result over-writes the previous contents of the storage loca-tion.

3.3 Processor Evolution

In this section we briefly overview the evolution ofmicroarchitecture since 1971, when Intel introducedthe first microprocessor. It is impossible to do jus-tice to such a broad subject in this short space, so wesimply identify the major paradigms used in buildingprocessors today.

The evolution of computer architecture has beendriven by the insatiable need for more computingpower. The truth is that we will never have enoughcomputational power, because important classes ofpractical problems are very hard in a computationalsense [GJ79]; that is, they require a very largeamount of computation, which grows rapidly as weincrease the amount of manipulated data (e.g., expo-nentially or even faster).

17 DRAFT! do not distribute

Page 18: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

mov 0,eax # register eax is initialized with zeromov 1,edx # loads the value 1 in register edx

L34: # label used for branchingadd edx,eax # add contents of register edx to register eaxinc edx # increment (add 1) contents of register edxcmp edx,10 # compare contents of register edx with 10jle L34 # if result is less or equal jump to label L34

Figure 11:An assembly language program to calculate the sum of the first ten integers.

(1) Fetch (5) Write

(2) Decode

(3) Read

(4) Execute

memory

PC

INSTRUCTION

Decodeoperation

destination

source

*+

Registers

result data

branchsource data

address of nextinstruction

/

Figure 12: The phases of execution of an instruction: fetching instruction from memory, decoding, reading theoperand data, executing the indicated operation, and writing the result.

18 DRAFT! do not distribute

Page 19: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

Our discussion of microarchitecture evolution willtherefore be centered on its impact on performance.Most of the microarchitectural innovations were ac-tually introduced for increasing the system perfor-mance. The performance of a processor is measuredin the number of instructions it can execute in a timeunit. It is thus a measure of theinstruction through-put of the processor. The unit of measure is mil-lion instructions per second (MIPS) [HP96]. MIPSis an imprecise unit because instruction throughputdepends on the executed instruction mix (i.e., a pro-cessor may be capable of performing many more ad-ditions than multiplications in a given time interval).Nevertheless, it is still informative.

3.3.1 Clock Speed Increase

One of the most publicized marketing wars of thelast few years has been the “war of the megahertz.”The clock of a processor is used to synchronize thecomputation of all the components on the die; it istherefore a coarse measure of performance.5 Indeed,the performance increase derived from speeding theclock signals is truly impressive: in 31 years sincethe introduction of the first microprocessor, the clockspeed went from 740 kilohertz to 3 gigahertz, whichis an increase of 4000 times!

Clock speed increase is driven by miniaturization:as the components shrink, the distances traveled bythe electrical signals in a clock cycle also decrease,allowing the use of clocks with shorter periods. Sig-nificantly, however, clock speed has already reachedsome fundamental limits. Given the propagationspeed of the electromagnetic signal through the logicgates and wires, at current clock speeds the signalcan barely cross from one side of the chip to theother. If clock speed increase continues to keep thesame pace, in 10 years a clock cycle will permit theelectrical signal to reach less than 1% of the wholechip surface [HMH01].

5There is no intrinsic number of cycles required for a compu-tation: for example, on some processors a 32-bit multiplicationis implemented in 4 clock cycles, while on others, it takes 16.

3.3.2 Pipelining

Pipelining is the name used in computer architec-ture for the assembly-line method of manufacturing.When building cars on an assembly line, car framesare moved from worker to worker. Each workeris specialized in only one operation: adding doors,checking the engine, etc. More importantly, a carcan have its doors mounted while another one is hav-ing the engine checked. The two activities can thusbe performed at the same time. While the time tomake a single car is unchanged, the time betweenthe completion one car and the next is significantlyshortened.

As mentioned in Section 3.2.2, the “work” to becarried out on each instruction usually consists offive phases: fetching, decoding, reading, executing,and writing.6 If we associate the hardware doingeach operation with a worker and the instructionswith cars, an implementation of the assembly linegives us a pipelined processor (see Fig. 13).

Pipelining is a straightforward concept, providinggreat benefits with relatively little effort. It was firstinvestigated in the 1960s and has been universallyused since the beginning of the 1980s. In pipelin-ing, instructions enter and exit the pipeline in the or-der they should be executed. However, a “clumsyworker” at the beginning of the pipeline can reducethe performance of the whole system because alllater stages must wait for it. Unfortunately, while ex-ecuting programs the instruction flow may be inter-rupted for a number of reasons. For example, whena control instruction (branch) is executed, the properfollowing instruction to execute will not be knownuntil the branch completes. When a branch entersthe pipe normally no other useful work can be per-formed until its result is known.

To address the events which prevent the pipelinefrom working at full capacity, microarchitects resortto very complicated techniques, some of which aredescribed below.

6For some instructions, some phases may be missing, whilefor other instructions, some of the phases may be further decom-posed in subphases.

19 DRAFT! do not distribute

Page 20: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

time

time = n

time = n+1

time = n+2

instr X+1instr X+2instr X+3instr X+4

writeexecutedecode readfetch

instr X+3instr X+4

writeexecutedecode read

instr X+5

fetch

instr X+2

instr X+5

instr X+5

instr Xinstr X+1instr X+2instr X+3

writeexecutedecode read

instr X+4

fetch

Figure 13:Pipelined microprocessors have multiple instructions in flight, each in a different execution stage.

(a) (b)

r = x * x

x = a − b x = a * a

y = b * b

r = x − y

Figure 14: (a) Two consecutive dependent instructions.(b) The first two instructions are independent.

3.3.3 Parallelism

It is generally accepted that computer programs con-tain many operations which are not interdependent.

To illustrate, let us consider two examples. If wewant to compute������ (for some given values for�and�), we need to make a subtraction and a squaring,and the squaring cannot begin until we know the re-sult of the subtraction. We say that these operationsaredependent. However, when we want to compute��� ��, we can give each squaring to a different per-son to compute at the same time (as in Fig. 14).

Exploiting independent instructions by executingthem simultaneously (as in Fig. 15) is the most im-portant trend in microprocessor architecture in the

x = a * a y = b * b

r = x − y

Figure 15: (a) Some of the instructions in Fig. 14b canbe executed in parallel because they are independent.

last 20 years. This type of execution exploitsinstruc-tion level parallelism(ILP) [RF93].

There are other types of parallelism which are ex-ploited in computer systems to increase parallelism:for instance, data width increase is a form of bit-levelparallelism, where computations occur on more bitsin a word at once. Another type of parallelism, ex-ploited in a system having multiple microprocessors,is process-level parallelism7 because multiple pro-grams are executed simultaneously, one using eachprocessor.

In this section we briefly overview several tech-niques used for the exploitation of parallelism.

Word Width Increase Microprocessors can becharacterized by the basic word size they operate on,

7A running program is called a “process.”

20 DRAFT! do not distribute

Page 21: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

measured in bits. Intel 4004 is a 4-bit microprocessorbecause an addition computes the sum of two 4-bitnumbers in one operation.

A word has two important uses: it is the basic unitof data on which the processor operates and it is alsothe basic value used to specify a memory address(when using indirect addressing). As a consequence,the size of the memory attached to a computer is lim-ited by the word size. A processor with a word of 16bits cannot express binary numbers larger than���,which puts a limit of 64 kilowords on the address-able memory.8 The latest processors supporting thex86 ISA are 32-bit processors. The quick increase inmemory capacity makes 32 bits too limiting, cappingthe addressable memory to 4 gigabytes. Workstationmicroprocessors already have 64-bit words and thereare proposals (most notably by Intel’s rival, AMD)for extending the x86 ISA to 64 bits as well.

It has been empirically noted that miniaturizationpermits the growth of memory sizes with about 1 1/2address bits every two years, keeping cost constant.At this rate, the jump from 32 to 64 bits should buycomputer architects about 48 more years before 64-bit address spaces become insufficient. Widening theword size implicitly grows the computational powerof the machine: adding two 64-bit numbers requirestwo 32-bit operations but is workable using a singleoperation on a 64-bit processor.

Despite the fact that memory capacities havegrown at an exponential pace, the size of the prob-lems we are solving and the nature of current appli-cations have always immediately taken advantage ofthe increased capacities. For example, today’s com-puters manipulate not just text and graphics, but alsosound and video, which are huge storage consumers.

Very Large Instruction Word To return to theauto manufacturing example, another natural way tospeed up car production is to assign several workersat once to each car. Each worker is given an indepen-dent task, they all work in parallel, and before any-one starts on a new task, they wait for all the other

8Very old processors used tricks to express addresses as com-binations of multiple words to overcome the word size limita-tion.

workers to complete their task. As long as the work-ers are independent, and the tasks assigned are allfairly equal in the time they take, then the presenceof more available workers results in the productionof more cars per hour. One can see that for the fac-tory to run smoothly the scheduling of the tasks toworkers is essential.

To apply this idea to processors, we just needto have multiple processing units (workers) avail-able. However, because not all instructions can beprocessed in parallel due to the dependencies, wemust first group independent instructions together.When this grouping must be done by the compiler—because the multiple units all work in lock step—the processor is called a very long instruction word(VLIW) machine. The terminology indicates thatthe machine really executes superinstructions, whichcomprise several normal independent instructions.VLIW machines are widely used in signal process-ing tasks, for example, most cellular telephones havesuch microprocessors. Compilers for VLIW notonly group together independent instructions but alsochoose the groupings based on the expected execu-tion time of each instruction, planning ahead to bestutilize the processing units. The compiler operationwhich selects the instruction order is calledschedul-ing. Just as in the factory above, the schedulingdone by the compiler is an essential of an effectiveVLIW processor. In practice, VLIW machines arealso pipelined, having multiple parallel pipelines.

Superscalar Processors VLIW processors workvery well as long as the compiler can predict howlong each job will take. Then the available work-ers can be assigned very efficiently to the availabletasks. However, real life is always more complex andin a processor unpredictable events may occur. Fur-ther, the duration of each instruction may not be easyto predict. In such a case the carefully constructedVLIW schedules become inefficient.

To overcome the scheduling difficulties introducedby unpredictable latencies, computer architects haveproposed superscalar processors: these processorsdynamically compute dependencies and schedule in-structions as soon as all their operands are available.

21 DRAFT! do not distribute

Page 22: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

Much of the work of the compiler scheduler is movedin hardware in these machines.

Like VLIW machines, superscalar microproces-sors have multiple processing units (workers). Un-like VLIW machines, each idle worker can begin atask as soon as there is something available to per-form. The length of time each worker takes to pro-cess each instruction, or which worker will deal withwhich operation, does not need to be and in generalcannot be predicted in advance. Most desktop andworkstation processors today are superscalar.

To implement dynamic instruction scheduling inhardware, the processors use a series of very com-plex hardware structures. These structures keep trackof the status of each instruction. The status includeswhich ones have completed execution, which arewaiting for data, which are currently in execution,and which ones have not started yet. The instructionswaiting for inputs monitor the instructions under ex-ecution. When these complete and write their results,instructions waiting for these results fetch them andupdate their state correspondingly. If all of their in-puts have become available, they are marked as readyfor execution.

Prediction and Speculation No matter how manycomputational resources we use, if there are manydependent instructions in a program, there is notmuch that can be done: we must wait for an instruc-tion to complete its computation to give the result tothe dependents. Or do we?

Actually, we can do even better! Instead of keep-ing the functional units idle, we can simply guesswhat the result of a long latency instruction is andstart executing its dependents immediately, using theguessed value as input. When the instruction ac-tually terminates, we check whether we correctlyguessed its output. If the guess was correct, we havesaved time because we have already started the de-pendents. If the guess is incorrect, we must unfortu-nately discard the work of the dependents. We startthem again, using the correct value of the output thistime. This strategy has been shown to be beneficialbecause correct guesses occur frequently enough toachieve important speed-ups.

Microarchitects have incorporated this techniquein processors in multiple guises—guessing is offi-cially calledpredictingand executing ahead of timeis calledspeculating. The most frequently used pre-dictors are used to guess which way branch instruc-tions will go [MH86] to be able to start executionfrom the target instruction early.

One way of using speculation in order to removecontrol dependencies is depicted in Fig. 16. In thisexample there is no prediction because both possiblebranches of a conditional are executed.

While we noted earlier that there is not much tolose by guessing, speculation does have a downside:it requires very complicated hardware resources totrack the predictions and to flush out the speculativeresults when predictions are wrong.

Speculation also interferes with other tasks of theprocessors: for example, instructions executed spec-ulatively cannot simply complete until it is knownwhether they should have been executed in the firstplace. This raises many issues if these instructionstrigger exceptional events (such as a division byzero): the treatment of these events must be post-poned until the prediction has been validated.

Thread-Level Parallelism Instruction-level paral-lelism is not the only type of parallelism that can beexploited to speedup execution. What if we coulddecompose an application into multiple relatively in-dependent activities which can be carried out in par-allel? A typical example is a web server: it has todo some work for each request coming from a webclient. The entire work for several clients can bedone completely in parallel because the clients’ ac-tions largely do not interfere with each other. Thistype of parallelism is calledthread-levelparallelism:one way we can view a threaded program is as beingcomposed of a multitude of identical programs, allexecuting the same code, but manipulating differentvalues.

The newest processors have architectural supportfor multithreading. Multithreading is used to coverfor the periods of inactivity caused by long-latencyinstructions: when a thread has to stall, for examplebecause a long-latency instruction did not retrieve its

22 DRAFT! do not distribute

Page 23: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

if (C()) Aelse B

A

C? A B

C?

(a) (b) (c)

B

C()

C()

Figure 16: (a) Program fragment, (b) normal execution, (c) speculative execution. In the speculative version, bothtargets of a conditional branch and the condition itself are executed simultaneously. This removes the requirement forserializing the execution of the branch and of the target code.

data fast enough, another thread can use the idle pro-cessor to run its own computation [TEL95].

The problem of multithreading is compilation. Al-though we can write compilers to extract ILP fromprograms, the quest for a compiler which decom-poses a single application into multiple threads au-tomatically is much farther from fruition. Moreover,manually writing programs with multiple threads isa very cumbersome task; it is difficult for humansto think about multiple simultaneous activities thatmight interfere with eachout.

Multiprocessors When computers have severaldifferent jobs to perform, they can benefit from theexistence of multiple microprocessors. While mul-tiprocessor systems are presently a high-end feature,with the pace of feature shrinkage we can expect inthe near future that such systems will be common inordinary desktop computers.

3.3.4 Caches

In the beginning of this section we mentioned thevon Neumann bottleneck: the fact that the processorhas to bring instructions and data from memory andto store computation results in memory. We furthernoted that memory access must inherently be donesequentially. To compound this bottleneck, the rela-tive speed of memory versus processor has been de-creasing at an exponential pace [HP96]. However,processor and memory access speed have both grownexponentially with time. While the processors boasta respectable 55% increase per year, the time takenby a memory chip to service a request shrinks onlyabout 10% each year. This means that relative to theprocessor, the memory is slower and slower.

The memory commonly used in a computer sys-tem is of a type called Dynamic Random AccessMemory (DRAM). DRAM is relatively cheap andcompact, but is slow (and relatively slower witheach new generation). Computer architects have an-other type of memory at their disposal, Static RAM(SRAM). SRAM is bulky and much more expen-

23 DRAFT! do not distribute

Page 24: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

sive than DRAM, although its performance increasekeeps pace with the speed increase of the processor.

The first microprocessor generations were slowenough to use just plain DRAM: their clock cy-cle would allow memory accesses to complete dur-ing the time of one processor operation. However,the speed increase disparity soon led to a crossoverpoint, which happened at the beginning of the 1980s.Since then processors are faster than memory. To-day, a memory access can take as many as 500 pro-cessor cycles. This means that the processor wastes499 cycles for every access to DRAM. Computer ar-chitects came up with a compromise solution whichtrades cost for speed by creating an intermediatestatic RAM layer between the processor and the mainmemory. This intermediate layer is called acache.

The cache is used much like students use theirshort-term memory before an exam when they cramand memorize the information required for the exam.The information is then discarded the following day,when they start cramming for the next test.

Whenever the processor needs data from mem-ory, the data is automatically brought into the cacheby some helper circuitry and served to the proces-sor from there. This is not much benefit unless theprocessor asks again for the same data item soon:the second time it will be brought directly from thecache, much faster than from memory (see Fig. 17a).This goes on until the cache gets full (which eventu-ally happens since it is much smaller than the mainmemory). Afterwards, when even more data is ac-cessed, something must be evicted from the cache tobring in the new data.

Caches are beneficial when data access exhibits alot of “reuse,” or locality in computer terminology.Empirically it has been noted that most programs in-deed have a high degree of locality and therefore ben-efit highly from caches. However, some applications(of which media processing is an important class) ex-hibit little reuse, so they cannot really benefit fromcaches. For example, viewing a movie on the com-puter displays each frame and then never uses thatdata again.

The importance of caches has grown steadily withtime. As memory size grows, and the speed disparity

widens, the trend is to add yet another layer of cachebetween processor and memory, the “outer” layers(i.e., closer to memory, in Fig. 17b) being larger,cheaper, and slower. When miniaturization pro-gresses enough, system integration moves the cacheson the same integrated circuit with the processor andinserts new cache layers outside. Modern worksta-tion processors have three layers of cache: the firsttwo reside on the processor die. Access times tomain memory can be several hundred processor cy-cles, so performance in the absence of caches wouldbe abysmal.

Creating more and larger caches is not the onlyevolutionary trend: new caches are also substantiallymore sophisticated than the original ones. For exam-ple, modern caches can continue to serve requestsfor data even after some data has not been found(because it did not reside in the cache). The cachethen simultaneously communicates with the proces-sor and to the memory, maintaining queues of pend-ing requests from either side and handling all of themin any order.

A complication arises when we build computershaving multiple microprocessors, as is increasinglycustomary. Then each processor will have its owncaches, which can lead to confusion when one pro-cessor intends to change data: the changes should notbe confined to the copy of the data in the local cache,but propagated to the remote processors as well, andall this should be done without incurring too muchoverhead. This is calledcache coherenceand is themain reason for which computer systems with manyprocessors are very hard to build. The commonlyused cache coherence protocols simply cannot ac-commodate too many participants without incurringprohibitive performance penalties.

Caching is a technique used very frequently incomputer system design, at all levels. For exam-ple, there are caches between the main memory andthe even larger, slower, and cheaper disks. With theemergence of the Internet, caches are also used tomaintain local copies of data brought from the net-work. There are many other examples of caching inuse today.

24 DRAFT! do not distribute

Page 25: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

CPU

L2 cache

memory

L1 cache

L3 cache

disk

(a) (b)

read write

read write

Memory

Cache

Data

Frequently−used data

Processor

Figure 17:(a) Caches offer the same operations as memory: reading and writing of data. However, they are fasterand smaller in capacity for two reasons. Their small size allows electrical signals to traverse them faster. Also, theyare smaller because they are more expensive per storage unit. (b) Modern systems feature multiple layers of cache;the ones closer to the processor are faster and smaller.

Conclusion Modern superscalar processors use allof the above techniques simultaneously to increaseperformance: they contain multiported register files(which allows multiple instructions to access regis-ters at once), level 1 and level 2 caches for instruc-tion and data, issue logic (which dynamically com-putes dependencies between instructions), forward-ing logic (used for bypassing dependencies betweeninstructions simultaneously in the pipeline), branchprediction logic (which monitors the execution of theprogram and tries to guess the direction of the fol-lowing branches, to pre-fetch the destination instruc-tions), and many others that consume most of thearea and energy, while only supporting the compu-tation.

For example, the newer processors can executemultiple operations in parallel, while the old oneshad only one arithmetic unit, so they could executeat most one instruction at a time. Most resourcesin the new processor are devoted to the cache. Thecomplete removal of the cache would not change thefunctionality of the processor in any way but it wouldadversely impact its execution speed.

Describing the way the processor computes was arelatively straightforward task for the 4004. How-ever, it is an extremely complicated process for thePentium 4. While the 4004 would basically carry thefive-step process described in Section 3.2.2, the Pen-

tium 4 simultaneously reads multiple instructions,checks if they depend on each other, tries to obtainthe operands of all of them (some of them may needto be obtained from the cache, some may be in mem-ory, some may be in registers, while others may bein the middle of the pipeline), dispatches them tomultiple functional units whenever their inputs havearrived, checks the completion of each instruction,and launches other instructions even if the previousones have not yet completed. In all, Pentium 4 canhave more than 100 instructions “in flight” at any onetime.

3.4 Problems Faced by Computer Architec-ture

Despite the amazing progress of computer systemperformance, the horizon is darker for computer ar-chitects. The “easy” methods for performance en-hancement have been exploited almost to their limitsand we are now left facing very hard problems. Inthis section we discuss briefly some of these prob-lems. The last half of this chapter is devoted to de-scribing a CAEN-based solution in detail.

We believe that change in computer architecturehas historically been wrought by reversals of bal-ance in the overall system structure. For example, we

25 DRAFT! do not distribute

Page 26: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

have seen in Section 3.3.4 that memory performancegrows more slowly than processor performance. Atthe point where the two performance curves inter-sected and processors overtook memories in speed, achange of balance occurred. This change warrantedthe introduction of caches. We will call such pointscrossover points.

Moore’s law has been equated with a guaranteedstream of good news, bringing higher speeds andmore hardware resources in each new hardware gen-eration. However, if we extrapolate the technologygraphs using the historical growth rate, we rapidlyapproach some important crossover points whichspell trouble for conventional architectures.

3.4.1 Billion Gate Devices: Complexity BecomesUnmanageable

According to Moore’s law, the feature size of CMOSintegrated circuits decreases exponentially with time:every three years designers have four times moretransistors available in the same area. In additionto the miniaturization of the basic components, thehistoric trend has been toward a steady increase insilicon-die size. Current microprocessors alreadyhave 30 million transistors and, in a few years, willreach 1 billion. Nanotechnology promises to pushthis number even higher, to���� gates per squarecentimeter [CWB�99, GB01]. The wealth of re-sources has enabled designers to create more andmore complex circuits, harnessing extra circuitry toexploit the parallelism in programs and to overcomethe latency bottlenecks. The downside of such hugesizes is the enormous complexity of designing, test-ing, verifying, and debugging such circuits. Mostprocessors today are shipped with bugs [Col].

3.4.2 Low Reliability and Yield

When feature sizes become on the order of a fewmolecules, minor imperfections in the manufacturingprocess, random thermal fluctuations, or even cosmicrays can invalidate a circuit temporarily or perma-nently. Although some of the modern circuits con-tain error detection and correction capabilities (see

Section 4), these solutions are still not universallyapplied to all parts of processor design.

3.4.3 Skyrocketing Cost of CMOS Manufactur-ing

The cost of manufacturing an integrated circuit canbe separated into two components: the cost of themanufacturing plant and the non-recurring engineer-ing (NRE) cost, which is on a per-design basis. TheNRE cost includes testing and verification costs.

Moore’s “second law” describes the exponentialincrease of the cost of a manufacturing plant withtime. A state-of-the-art installation now costs morethan 3 billion dollars. This cost comes mostly fromthe very precise mechanical and optical lithographicdevices and masks. Another problem compoundingcost is the yield, as devices with very small featuresare more prone to defects.

3.4.4 Diminishing Returns in the Exploitation ofILP

Only a small fraction of the area of modern micro-processors is dedicated to actual computational units.Upon close scrutiny, the only parts of the micropro-cessor which do actual computation are the func-tional units; on Pentium 4, for example, these makeup less than 10% of the entire chip surface. All otherstructures on the chip only store and move informa-tion, they do no actual computation.

There is a tension in the design of the proces-sor pipeline: on one hand, designers add enough re-sources to support program regions that can use them(with high ILP, unpredictable branches, etc.); on theother hand, most of the time these resources switchidly. For example, 4-wide issue processors seldomcan sustain 1.5 instructions per cycle [HP96].

Processors cannot exploit ILP behind the limitedissue window9 [CG01]. However, the complexityof superscalar processors increases quickly with the

9The issue window is a set of consecutive instructions con-sidered by the processor to be executed in parallel. A proces-sor will not launch in execution instructions outside of the issuewindow, even if they are independent.

26 DRAFT! do not distribute

Page 27: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

size of the issue window, with some hardware struc-tures growing quadratically.

3.4.5 Power Consumption

The power consumed by a CPU cannot be thermallydissipated in an efficient way. Despite the fact thatthe supply voltage is decreasing for each chip gener-ation, the total power consumption is skyrocketing.Modern CPUs have reached more than 100 W, exac-erbating the difficulty of cooling [Aza00] and of sup-plying power (especially for mobile devices). Morethan half of the power is consumed just by the clocksignal, which spans the whole chip surface [Man95].Power density, which impacts directly the cooling ef-ficiency, also grows with each generation.

3.4.6 Global Signals

The time between two clock ticks is not enough fora signal to cross the whole chip. A major problemengendered by the quickly increasing clock rate isthat the electromagnetic signal does not have enoughtime to cross many levels of logic. Deeper pipeliningis a partial solution but provides diminishing returnsdue to the overhead of the pipeline registers (the timefor the information to cross the synchronizing latchesstarts to dominate computation time) [KS86].

Moreover, many structures on a modern CPU re-quire global signals or very long wires (e.g., the for-warding paths on the pipeline, the multi-ported reg-ister files, the instruction wake-up logic). Wires con-necting different modules tend to remain relativelyconstant in absolute size from one architectural gen-eration to the next [HMH01, AHKB00], so they willbe unable to function at higher clock rates.

4 Reliability in Computer SystemArchitecture

4.1 Introduction

In this section, we discuss computer system architec-ture from the standpoint of reliability. Our treatment

will be neither comprehensive nor formal; we mostlyuse examples to show how reliability considerationsinfluence the design of computer systems.

The main topic of reliability theory is the con-struction of reliable systems out of unreliable com-ponents. If the correct behavior of a system as awhole relies on the correct operation of all its com-ponents, the reliability of the system decreases expo-nentially with the number of components. Withoutspecial measures designed to increase the reliabil-ity of the ensemble, no complex system would everwork.

Computers are extremely complex systems: amodern microprocessor contains more than 30 mil-lion transistors, while a 16 megabyte memory hasmore than 100 million. Computers themselves areaggregated in even more complex constructions,such as computer networks. The Internet is com-prised of more than 125 million computers.

The main tool in the design of the complex sys-tems isabstraction: high-level functionality is com-posed of low-level behaviors. The emerging high-level functionality is regarded as atomic for the con-struction of yet higher layers. This process is anal-ogous to the use of theorems and lemmas in mathe-matical proofs: once a theorem is proven, it can beused wherever its conditions are satisfied.

Here we look at the interplay between layering andreliability. Layers can be designed which:

� enhance the reliability, offering the appearanceof a reliable mechanism constructed of less re-liable components. The main technique usedfor this purpose isredundancyin computationor storage of information. This approach ismost frequently used in computer system de-sign. Section 4.3 is devoted to examples of thistype.

� expose the unreliability to the upper layers, al-lowing them to address the imperfections. Theadvantage of such a scheme is that it may be lessexpensive to implement. The higher layers havemore knowledge about the environment and canthen implement only as much reliability as is re-quired for dealing with the errors likely to occur.

27 DRAFT! do not distribute

Page 28: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

The architecture of the Internet is based on thisparadigm, as described in Section 4.5.

� partition a system into (mostly) independentparts, which operate in isolation from one an-other. The effect of partitioning is the isola-tion of faults (fault containment) in parts of thesystem, leaving parts unaffected by the fault inworking order. This paradigm is used in the de-sign of operating systems and computer clus-ters.

The manufacturing of very large scale integrated(VLSI) circuits has achieved extraordinary reliabil-ity for the basic computer components. This en-ables computer engineers to treat most layers of acomputer as perfect and to concentrate on synthe-sizing high-level functionality, without regard to re-liability. Special classes of applications which re-quire extremely high reliability (e.g., air-traffic con-trol systems, nuclear power plant control) force sys-tem architects to introduce additional fault-toleranceproviding layers. In general, increasing reliabilitycomes at a cost. Such cost should be weighed againstthe benefit provided by the more reliable computersystem.

The use of electronic nanotechnology as a basiccomponent of computer systems will have a hugeimpact on their design methodologies. One of thefactors that will dominantly alter computer systemdesign is the intrinsic unreliability of chemically as-sembled components. Nanodevices will provide abuilding block with totally different reliability prop-erties than the ones traditionally used by computerengineers. This is due to the uncontrolled nature ofself-assembly,10 along with the minute size of thenanodevices, which makes them very sensitive tothermodynamic fluctuations and high-energy parti-cles. In this chapter, we argue that alternative meansof increasing system reliability will be required forelectronic nanotechnology.

4.2 Reliability

This section introduces some basic terminology.

10Relative to optical lithography, which is the technology usedfor the construction of VLSI integrated circuits today.

4.2.1 Definition

While this discussion is informal, we begin with aprecise definition of the notion of reliability. Wequote from [VSS87]:

The term “reliability” has a dual meaningin modern technical usage. In the broadsense, it refers to a wide range of issuesrelating to the design of large engineeringsystems that are required to work well forspecified periods of time. It often includesdescriptors such as quality and dependabil-ity and is interpreted as a qualitative mea-sure of how a system matches the spec-ifications and expectations of a user. Ina narrower sense, the reliability is a mea-sure denoting the probability of the opera-tional success of an item under considera-tion. [...]

Definition The reliability,����, of an item(a component or a system) is defined asthe probability that, when operating un-der stated environmental conditions, it willperform its intended function adequatelyin the specified interval of time��� ��.

Note the probabilistic nature of the definition ofreliability. We should realize that there are no sys-tems with perfect (���� �) reliability. A moreappropriate notion is “good enough” systems, whosereliability is satisfactory for the intended usage andunder specified budgetary constraints. We will seevarious methods that can be used to boost the re-liability. However, these methods are not free andthe systems built using them are reliable but not per-fect. Ideally a system should be no more reliable thanneeded, balancing cost with benefits.

4.2.2 Redundancy

The key ingredient for building reliable systems isredundancy. We can distinguish between two typesof redundancy, spatial and temporal (see Fig. 18).

Spatial redundancy uses more devices than strictlynecessary to implement certain functionality.

28 DRAFT! do not distribute

Page 29: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

replicate

majoritymajority

input

output

inputinputinput

outputoutputoutput

input

output

S

SSS

Figure 18:Spatial (left) and temporal (right) redundancy. Spatial redundancy replicates the computing elementS,while temporal redundancy repeats the computation.

The additional devices are used to carry re-dundant computations whose results are cross-checked. If the reliability of each part is above aminimum threshold, the overall system will bemore reliable than any of its parts. Using evenlarger amounts of redundancy allows a systemnot only to detect failures, but also to maskthem.

Temporal redundancy involves the repeated use ofa device for the same computation, followed bya comparison of all the produced results.

4.2.3 Transient and Permanent Faults

According to their duration, we classify faults of asystem into two broad categories:

Transient faults occur when the system temporarilymalfunctions but is not permanently damaged.In today’s computer systems, the vast majorityof faults are transient in nature [SS92].

Permanent faults happen at some point and nevergo away, included in this category are manufac-turing defects. We explicitly allow the use ofpartially functional components (i.e., having de-fects which affect only parts of the component)for building systems.

Temporal redundancy can only be used to copewith transient faults. Strong spatial redundancy,which enables error-correction (masking), can beused to tolerate the effect of permanent faults. Wecan expect the number of both permanent and tran-sient faults to increase in future systems, due to theminiaturization of the basic electronic components.Small components use reduced electric charges torepresent logic values and therefore are more sensi-tive to thermodynamic fluctuations, alpha particle ra-diation from the environment or even the chip pack-age, or high-energy cosmic gamma rays.

4.2.4 Reliability Cost and Balanced Systems

When designing a complex system it is important tobalance the reliability of the various parts. For exam-ple, if both memory and processor must be fault-freefor a system to be functional, it is wasteful to use amemory with a much higher reliability than the pro-cessor: the system will break down frequently fromprocessor failures, so the increased memory reliabil-ity does not contribute substantially to the reliabilityof the system.

A more complete picture should factor in the sys-tem maintenance cost: what is the cost of the sys-tem downtime required for replacing the failed parts?Using an overly reliable part increases system costs.Using parts with high failure rates requires excessivemaintenance costs. The context of use and the envi-

29 DRAFT! do not distribute

Page 30: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

ronment of the system dictate how reliable it shouldbe. For example, in safety-critical applications thecost of downtime is very large, so it makes sense toinvest in extremely reliable components.

4.3 Increasing Reliability

In this section, we illustrate several techniques usedfor building reliable computer systems. Systems thatrecognize the inevitability of faults, and are designedto cope with them, are calledfault tolerant.

4.3.1 Voting and Triple Modular Redundancy

A simple but expensive approach to fault toleranceis the complete replication of the computing system.Duplication and comparison of the results of the twocopies permits the discovery of faults that occur injust one of the computing elements.

Triplication together with a majority vote can beused to recover the correct result in the presenceof a faulty component. Triple modular redundancy(TMR) was the first fault-tolerance scheme. It wasintroduced in 1956 by John von Neumann [vN56]. Itis sometimes used in computing systems for whichmaintenance is impossible, such as on-board com-puters for space probes. If each component has asufficiently high reliability (above 50%), the resultof the majority is more reliable than each individ-ual component. The voting subsystem, which selectsthe result computed by a majority of the components,can itself be a single point of failure. Schemes havebeen devised which use replicated voters.

Voting can be naturally generalized to using�identical computations in parallel and a majorityvote. Another generalization, which is more robustagainst permanent faults, disregards the vote of com-ponents after they produce a wrong result.

4.3.2 Error-Detecting and Error-CorrectingCodes

Duplication can be used todetect wrong results,while triple modular redundancy canrecover from

one faulty computational element and voting meth-ods permit operation after multiple faults. The costof these schemes is considerable: duplication re-quires a complete replication of all critical compu-tational elements, while TMR and voting have an ef-fective resource utilization of less than 33%.

It is worthwhile to explore if we can achieve thesame degree of robustness using fewer resources.The pioneering work of Claude Shannon [Sha48] andRichard Hamming [Ham50] in the 1940s has shownthat we can indeed do better.

Let us consider the duplication method of fault tol-erance. To make matters concrete, we start with thepremise that we want to store a piece of data in asafe way. We assume that the data is encoded in bi-nary and that a data word has� bits. If we just storetwo identical copies of the data (i.e.,�� total bits)what kind of robustness do we obtain? Let us as-sume that one bit is damaged, being flipped by noise.In this case, the values of the bit in the two copieswill differ, so we will detect the fault. However, wehave no way to discern which of the two values is thecorrect one. Duplication allows us to perform errordetection, but not error-correction. Moreover, sometwo-bit failures can pass undetected by this system,if they occur in the same bit position in both copies.

Intuitively, this happens because the bits in therepresentation are rather fragile: each stored bit en-codes information coming from exactly one originalbit. We can build more robust encoding schemes ifwe “mix” the input bits and we use each stored bitto encode multiple input bits. In this case we maybe able to detect or correct errors in a stored bit byextracting the lost information from other stored bits.

To obtain error resilience we need to add redun-dancy to the encoding. This is accomplished by en-coding�-bit data using bits, where �. Thelarger, the more resilient the code can be. We callthe-bit encodings “code words.” Note that not all-bit words are code words.

There are many coding methods, each with differ-ent properties, suitable under different assumptionsabout the noise in the environment, the independenceof the errors, and the complexity of the encodingand decoding algorithms. The interested reader can

30 DRAFT! do not distribute

Page 31: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

learn more about this topic in the following resource[LC83].

In general, error-detecting methods are used whenthe information can be recovered from other sources.For example, in data transmission networks, the sentdata is only encoded using error-detecting codes be-cause if an error occurs the information can be re-sent by the source. When there is only one copy ofthe data, it must be protected with error-correctingcodes. Such codes are used for example in computermemories and disk storage systems.

Main Memory If you have ever needed to up-grade the memory of a PC, you probably have facedthe dilemma of the kind of memory to buy. Threemain types of memory chips are sold today: un-protected, parity-checked, and error-correcting code(ECC) memory. These types of memory differgreatly in reliability.

Unprotected memory uses one bit of storage foreach bit of information. It offers no protec-tion whatsoever against transient or permanentfaults and it essentially relies on the reliabil-ity of the underlying hardware. However, mainmemory capacity has grown exponentially overtime, and the amount of storage in a computertoday makes the likelihood of faults a muchmore probable event now than 10 years ago.

Parity-checked memory uses a very simplemethod to detect one bit errors.11 It worksby storing for each eight bits of data a codeword of nine bits. The ninth bit is set such thateach code word has an even number of “1”bits, hence the term “parity.” Whenever data isread from memory, the hardware automaticallychecks the parity. When the parity is not even,a fault has been detected, and an exceptionis triggered. The software will handle thisappropriately. A common course of action isto terminate the program using that piece ofdata (because there is no way the data can berecovered) and to mark that particular memory

11This method will also detect any involving an odd numberof bits.

region as faulty, which will prevent furtherusage. Parity checking is simple and can bedone quickly, so it does not slow down thememory operation.

ECC memory uses a more sophisticated encodingscheme that allows automatic correction of any1-bit error occurring in a 64-bit word. For thispurpose, the memory encodes 64 bits of datausing a 72-bit code word. The overhead ofthe scheme is therefore the same as for parity-checked memory (9/8 = 72/64), but it is morerobust than parity-checked memory since thelatter can only detect a single fault in every 8bits, while ECC can actually correct a singlefault and detect 2 faults in every 64 bits. Oneach memory access the read data is broughtto the closest code word. The code word isnext decoded to obtain the original informa-tion. These operations are substantially morecomplex than a simple parity check. There-fore, ECC memory incurs a slight performancepenalty compared to the other two types ofmemory.

Disks The most common support for persistent in-formation storage is the disk. While disks comein many types (floppy disks, hard disks, removabledisks, optical disks, compact disks, etc.), the factspresented in this section apply to most of them.

Disks use two different spatial redundancy tech-niques simultaneously. These are necessary due tothe high storage capacity and to the harsh environ-mental conditions under which disks operate: unlikesolid-state devices, disks feature moving parts whichare far more likely to exhibit failures. A close ap-proximation of a disk system is a turntable, whichhas a rotating disk and a head that can choose a cer-tain track on the disk through lateral “seek” move-ments.

Information on disks is stored in units calledsec-tors. The size of a sector depends on the disk butis relatively large (compared to the storage unit formemories, which is a 72-bit word for PCs), on theorder of a kilobyte. Data inside of a sector cannotbe changed: if even a single bit is modified, the sec-

31 DRAFT! do not distribute

Page 32: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

headplatters

sparesector

holes

servo ID sync data CRC

Figure 19:Redundant encoding of information on a disk:both error-detecting codes and spare sectors are used.

tor has to be rewritten entirely because mechanicalimprecision makes it impossible to position the read-write head precisely enough for a single bit.

In addition to the stored data, a sector contains ex-tra information to compensate for the imprecision ofthe mechanical parts and the high unreliability of themedium (see Fig. 19). For example, each sector hassome servo information, which is used by software tocheck the head alignment and to synchronize the datatransfer with sector boundaries. Traditionally eachsector has an associated sector identifier, the usageof which will be described later. The data is storedfollowing the servo and ID information, and is fol-lowed by an error-detecting code.12 The codes usedfor disks are usually from a class called cyclic redun-dancy checks (CRCs).13 Sectors are separated fromone another with gaps, which give the head somefreedom in setting the sector boundaries when a sec-tor is rewritten. On rewrite, a sector may not start inthe exact same spot as before due to imprecision inpositioning.

Some events can damage sectors: for example,cutting the power of a computer while it is still op-erating can cause a crash of the heads on the disksurface, which causes physical damage to the mag-

12More precisely, the data plus the following informationform together a code word of an error-detecting code.

13A CRC summarizes the data using achecksum. The term“cyclic” comes from the property that the same checksum is ob-tained if the input data is permuted cyclically.

netic layer used to store information. If this happensthen, with very high probability, the checksum of thedata residing on the damaged sector will not match.

To keep disks operational even after damage oc-curs, and also to allow some slack in the manufactur-ing quality, disks use a second type of spatial redun-dancy in the form ofsparesectors. Regularly spacedon the disk surface are spare sectors that are nor-mally unused under normal operation. When a sectoris damaged, it is marked and replaced with a spare.The sector ID of the spare indicates which originalis represented. Upon manufacture, disks are testedfor surface defects and for each disk adefect mapis created and stored within the disk system (eitheron the disk or in a nonvolatile memory). The defectmap is used when formatting the disk to replace thedefective sectors with spares. The defect map growsduring the disk lifetime, as newly damaged sectorsare added.

4.4 Formal Verification

Formal verification is an umbrella name for a vari-ety of sophisticated techniques used for certifying thecorrectness mostly of hardware systems, but increas-ingly for software systems as well. Formal verifica-tion is used for finding bugs and in this sense can beseen as a reliability-enhancing technique.

The crux of formal verification methods is speci-fying precisely the behavior of the systems’ compo-nents (in terms of mathematical formulas) and check-ing formally the properties of the overall systems.These properties are precisely expressed using someform of mathematical logic and are proven by a se-quence of derivations.

One particular technology which we believe willplay a very important role in future computer sys-tems istranslation validation[PSS98]. Using thistechnique, a compiler can produce proofs that theresult of the compilation process faithfully imple-ments the source program. Because modern com-puter architectures are extremely complex, they relymore and more on compilers. Translation validationcan therefore be used to certify an essential buildingblock of computer system architecture.

32 DRAFT! do not distribute

Page 33: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

4.5 Exposing Unreliability to Higher Layers

The first part of this section was dedicated to a dis-cussion of techniques used to enhance the reliabilityexposed to the upper architectural layers. We notedthat hardware is much more reliable than the soft-ware layers because it is virtually fault-free.

In this section we discuss the opposite approachused in two very successful systems. These systemsstart from the assumption that the underlying layersare essentially unreliable and that unreliability mustbeexposedto the upper layers.

This seemingly paradoxical design can be de-fended by several arguments:

� The cost of providing reliability may simplybe too high. Moreover, the cost grows withthe number of abstraction layers. The overheadmay be unacceptable for some classes of appli-cations.

� The upper layers may use their own fault-tolerance scheme. The designer faces a trade-off between the probability of failure and theeffectiveness of a fault-tolerance scheme. Eachreliability scheme we present works under cer-tain assumptions about the environment. Forinstance, the memory parity scheme in Sec-tion 4.3.2 stores one bit of parity for each byte.This scheme works well as long as there arenot multiple errors within a same byte (i.e., theerror probability of a bit is relatively low). Alower layer may forgo providing a reliable ap-pearance for the layer above if the latter usesan error-correction scheme powerful enough towithstand the combined error probability.

� In some cases, the upper layers need only a sim-ple form of reliability, and not a perfect underly-ing substrate. For example, some interactive ap-plications such as telephony would rather havethe data delivered timely than perfectly: a miss-ing piece of data generates a glitch at the re-ceiver, which is often tolerable by the humanear. By contrast, the late arrival of the completedata is of little value for creating the illusion ofan interactive conversation.

4.5.1 Internet Protocol

The Internet is a computer network, designed ini-tially to link military computer networks. One of thedesign goals of the Internet was to make it robustenough to withstand substantial damage without ser-vice loss. The Internet has evolved into a highly suc-cessful commercial network spanning all continents,with over 125 million active computers.

The Internet was created more than a century afterthe deployment of the telephone network, thereforeone would expect the Internet to build upon the prin-ciples developed during the construction of its ante-cessor. However, particularly with respect to reliabil-ity, nothing could be further from truth: the Internetarchitecture is almost the complete opposite of thetelephone network.

In order to contrast the two networks, we first dis-cuss some details on the architecture of the telephonenetworks.

Telephone Network Architecture The telephonenetwork was designed to have an exceptional relia-bility. The downtime of a telephone switching ex-change must be under three minutes per year. Onlyunder truly exceptional circumstances is a conversa-tion already initiated interrupted due to a networkfailure. The phone network will let a conversationproceed only when it has secured all the required re-sources to timely transmit the voice signals betweenthe parties. Strict standards dictate the length oftime allocated for the connection negotiation phase,which establishes an end-to-end circuit through allthe intermediary telephone exchanges. The failureto secure all necessary resources is signaled to theend user through a busy tone. Sophisticated ca-pacity planning based on detailed statistics aboutphone user behavior are used to size the carrying andswitching devices, which results in a very low prob-ability of a busy tone due to insufficient network re-sources.

A crucial factor guaranteeing the quality of thephone connection is the allocation of all necessaryresources before the call is connected. These re-sources are preallocated and reserved for the whole

33 DRAFT! do not distribute

Page 34: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

duration of the connection [Kes97]. Based on thenumber dialed, the path connecting the source anddestination is calculated using carefully precom-puted routing tables. Each exchange negotiates withthe next one on the path using a sophisticated sig-naling protocol carried on special circuits, allocatingbandwidth and switching capacity for the new call.After all point-to-point connections are successfullyestablished, a ring tone is generated. Once the con-versation begins, the voice signal is sampled and dig-itized at the first telephone exchange. Each bit ofdata has preallocated time slots in each of the trunksit will traverse. The data bits traverse all trunks in theorder they were generated and arrive at the destina-tion at the right time to enable the recreation of theanalog voice signal. For each incoming bit, switch-ing tables maintained in each exchange indicate theoutgoing circuit and precise time slot where the bitshould be steered. Once a connection has been estab-lished, there is a very high probability that the dataentering the network will emerge at the other end inthe right order and with the same delay for each bit.On disconnection, the signaling protocol tears downthe resources allocated during the call.

The Internet Architecture One of the aspectswhere the Internet philosophy fundamentally differsfrom the telephone network is reliability. Not only isthere no guarantee about the latency taken to transitthe network, but there is no guarantee that the datawill not be lost or corrupted during transit. The endusers of the Internet get a very weak service defini-tion, which could be stated as: “You put data in thenetwork tagged with some destination address. Thenetwork will try to deliver your data there.”

The way data travels in the Internet is completelydifferent from the telephone network: data is dividedinto packets that are injected in the network in the or-der they are generated. Each packet may travel a dif-ferent route to the other end point. Some packets maybe lost, some may be duplicated, and the receivermay receive the packets in the wrong order and per-haps even fragmented into smaller packets [Kes97].

The packets are moved by the Internet protocol(IP). IP works roughly as follows (see Fig. 20): whenan intermediate computer (a router) receives a data

packet, it reads the destination address containedwithin. Next the router makes an informed guessabout the direction the packet should travel (i.e., thedirectly connected neighbor closest to the final desti-nation). The packet is then forwarded to that neigh-bor.

When a router receives data faster than it cansend on the output links, it buffers them in its inter-nal memory for later processing. When the internalmemory becomes full, the router simply starts to dis-card the packets. Such dropped packets constitutethe major source of performance degradation in theInternet.

The structure of the Internet changes widely fromday to day and from hour to hour, as new computersget connected, new users dial-up, accidents disruptthe connectivity of some networks, and new commu-nication links are installed or upgraded. Routers con-stantly exchange information about their local viewof the network: each machine tells its neighborsthe information it currently has about the topology.Constant information exchange leads to distributionof network topology information to all participatingrouters. Given this infrastructure, it is amazing thatthe Internet works at all and that information flowsthrough it without corruption. How is this possible?

The answer comes from two key aspects of theInternet architecture: the core network is redundantand self-healing, while the upper layers are designedto tolerate information loss.

The core network achieves reliability using the fol-lowing ingredients:

� Rich, uniform, and redundant core:The Internet “backbone” is not a simplelinear structure; commonly between anypair of nodes there are multiple disjointcommunication paths. Moreover, there isno single point of failure whose destruc-tion can disconnect the network. An im-portant factor is that all nodes in the net-work are more or less functionally equiva-lent: there is no privileged node on whosehealth the whole system depends.

34 DRAFT! do not distribute

Page 35: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

end node

Router X

Destination Use wire12

AB, C

everything 3C

router Xrouter Y

router Z

3

B

2

A

1

towards the rest of the Internet

Figure 20:Left: In the Internet there are end nodes (which initiate the communication) and routers (which only movedata).Right: Routers use routing tables to decide which interface to use for each destination address. This exampleshows router X’s table.

� Statelessness: The routers in the Inter-net do not store information about the datatraffic flowing through them. (In contrast,a telephone exchange stores much infor-mation about all phone connections thattraverse it.) A router receives a packet,computes the best output link, and for-wards the packet. The internal state of therouter after forwarding is the same as be-fore the arrival of the packet. The lack ofstored state enables the network to with-stand the crash of routers (i.e., a crash doesnot cause loss of information which can-not be recovered through other means).

� Self-testing and healing: The routingprotocol constantly exchanges informa-tion about the topology of the network. Inthe event of network outages, this proto-col quickly propagates information aboutbackup routes, allowing communicationto resume.

User applications use a robust communication pro-tocol that makes use of temporal redundancy totolerate errors in the underlying network. Thetransmission control protocol (TCP) is a com-munication protocol executed only by the endnodes participating in communication. Thisprotocol ensures that all packets are delivered

in the order they were sent, without gaps or du-plicates. TCP achieves this feat by using thefollowing ingredients:

� numbering the sent packets

� acknowledging packet reception

� using timers to detect packets which re-ceived no acknowledgements for a longtime

� retransmitting packets deemed lost in thenetwork

Because acknowledgments are sent using or-dinary packets themselves, they may be losttoo. Lost acknowledgments cause the injec-tion of duplicate packets in the network by thesource which retransmits the packets not ac-knowledged.

The whole Internet is built on the unreliable IPcore: not only are data and acknowledgment packetssent unreliably, but also the control communicationbetween routers (describing network topology) andthe network maintenance traffic use the same unreli-able forwarding mechanism.

Despite its apparent weak structure, the Internetis a formidable competitor to other specialized me-dia distribution networks: radio, television, and tele-

35 DRAFT! do not distribute

Page 36: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

phone. The cost of voice transmission (telephone-style) in the Internet (dubbed “voice over IP” or“VoIP”) is much lower than in telephone networks.Many major phone carriers are already investingheavily in using VoIP in parts of their networks.

Why is the Internet more successfuldespiteits un-reliability?

The answer lies, at least in part, in the fact that theInternet does not offer a costly service (e.g., fault-free time sensitive communication primitives). In-stead, it offers a cheap but useful service (e.g., besteffort delivery of bits). The telephone network goesto great lengths to provide the reliable voice servicebut uses an extremely inflexible, specialized systemand wastes a huge amount of resources. Basically,the reliability of the phone network comes ata highcost.

The Internet moves the fault tolerance problem toa higher layer, from IP to TCP. TCP provides faulttolerance perfectly adequate for many applications.TCP is executed only by the end hosts and not bythe routers who relay the information. The compli-cated processing entailed by TCP does not place anyburden on the core network, which scales to globalsizes.

Applications that do not need the reliable data de-livery offered by TCP can forgo using it. For instanceInternet radio protocols use strong error-correctingcodes and no retransmissions. Lost or misplacedpackets are simply ignored. This is acceptable be-cause the human end usertolerates the low reliabilityof the signal.

4.5.2 Teramac

In this section we present briefly Teramac, a com-puter system developed by researchers at Hewlett-Packard, which exhibits another unconventional ap-proach to the problem of system reliability. Thiscomputer is built out of defective parts: more than70% of the circuits composing it have some kindof fault. Despite this, the system works flawlessly.Many papers have described Teramac. One arti-cle which underlines the connections between this

architecture and nanotechnology-based computingis [HKSW98].

It must be noted that Teramac deals only with per-manent faults. Electronic nanotechnology circuitsare very likely to exhibit a large number of suchfaults.

Teramac is built out of reconfigurable hardwaredevices, described extensively in Section 5. The re-configurable hardware feature exploited by Teramacis the fact that the basic computational elements areessentially interchangeable.

Teramac implements circuits on a reconfigurablesubstrate. Much like how disks use spare sectorsto replace the faulty ones (see Section 4.3.2), Tera-mac uses spare computational elements to replace thefaulty ones. In Section 5, we describe the mechanismfor accomplishing this after we present the main fea-tures of reconfigurable hardware and describe howsuch devices are programmed.

One of the main contributions of the Teramacproject is showing that unreliability in hardware canbe exposed and addressed by higher software lay-ers, without incurring a large overhead. This is acomplete paradigm shift in computer system design,which will very likely have many applications in thefuture.

4.6 Reliability and Electronic-Nanotechnology Computer Systems

4.6.1 Summary of Reliability Considerations

In this chapter we surveyed many ways in whichreliability considerations impact the construction ofcomputer systems.

Computer systems are built from a series of ab-stract layers, which offer increasingly powerful ab-stractions. Each layer has different reliability prop-erties and uses different reliability-enhancing tech-niques. In general, the view exposed by the hardwareto software is that of a practically perfect, fault-freesubstrate.

We have seen that reliability can be enhancedthrough the use of redundancy: spatial redundancy

36 DRAFT! do not distribute

Page 37: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

uses more hardware components to carry supplemen-tary computations, while temporal redundancy re-peatedly computes results to ensure their robustnessagainst transient faults.

While temporal redundancy is expected to reducethe overall system performance (expressed in num-ber of computations per unit time), even spatial re-dundancy comes at a price: for example, we haveseen that using powerful error-correcting codes formemories can cause performance degradations dueto the complexity of the additional computation ofthe code.

One important message of this discussion is thatreliability is a desirable feature but it comes at a cer-tain cost, which must be taken into consideration bythe designer. The designer faces a trade-off betweenthe cost of reliability of each layer and the fault tol-erance that upper layers can provide by themselves.The deep layering translates into a multiplicative ef-fect of the overheads.

4.6.2 A New Computer System Architecture

The design of future computer systems, based onCAEN, will very likely be fundamentally con-strained by the dramatically different reliabilityproperties of the new medium. We can expect bothvery high permanent defect rates from the nonde-terministic manufacturing process and a high rateof transient defects at the lowest layers, due to thethermodynamic fluctuations and the minute size ofthe currents carrying information in CAEN circuits.(Note that the latter issue, transient errors, is notunique to CAEN, but will be common in any technol-ogy which measures individual features in handfulsof nanometers.)

The use of CAEN devices will require a completerethinking of the basic architecture of a computersystem. In this section we sketch features of a plau-sible proposal. We speculate that by diminishing thecost paid for reliability we can reduce dramatically,even by orders of magnitude, the cost of completecomputing systems. We advocate:

� removing the illusion of perfect reliability fromthe hardware layer, exposing the imperfectionsof hardware to the software, and dealing withthe imprefections on as needed basis

� using as a basic building-block reconfigurablehardware circuits (see Section 5), whose config-uration and reliability are completely handledby software programs

� using defect-mapping and spares for dealingwith manufacturing defects

� using computation in encoded spaces (i.e., com-puting on data encoded using error-correctingcodes) for transient and permanent fault-tolerance

� using formal verification techniques, most no-tably of translation validation, to ensure that thecompilers correctly translate programs into ex-ecutables or configurations

� offloading most of the computation from themicroprocessor and instead using a reconfig-urable hardware substrate on which application-specific circuits are synthesized by compilers.Compilers will generate hardware configura-tions matching the application’s needs and re-liability requirements

decribe program V. configuration somewhere

5 Reconfigurable Hardware

In Section 3 we briefly presented the main paradigmsexploited in the architecture of modern computersystems. We have seen that fast-evolving technologyconstantly changes the balance between the com-puter system components, which in turn requiresconstant redesign and tuning. We have concludedthat computer architects face several difficult prob-lems, engendered by the enormous design complex-ity, increasing power consumption, and limitations ofthe materials and manufacturing processes.

In this section we describe a computationalparadigm which has accumulated momentum during

37 DRAFT! do not distribute

Page 38: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

the last decade and may ultimately solve some ofthese problems. This paradigm is embodied in thereconfigurable hardware (RH) circuits [CH99]. RHis an intermediate model of computation with fea-tures between special-purpose hardware circuits andmicroprocessors. Its computational performance isclose to hardware, yet its flexibility approaches thatof a processor, because RH is programmable.

5.1 RH Circuits

As described in Section 2.2, ordinary digital cir-cuits are composed from simple computational ele-ments calledlogic gates, connected by wires. Thelogic gates are themselves assembled from transis-tors. Each logic gate performs computations on one-bit values.

RH devices have a similar structure. However, thelogic gates of an RH device do not havea priori fixedfunctionality and the wires do not connect specificpairs of gates but are laid out as in a general net-work. Each gate isconfigurable, which means thatit can be programmed to perform a particular one-bitcomputation. Likewise, the connectivity of the wirescan be changed by programming switches that lie atthe intersections of the wires.

Each logic gate and each switch has a small as-sociated memory that is used to store the config-uration. By loading different data in each mem-ory, the gate functionalities and wire connections arechanged (see Fig. 21a). Because configuration canbe changed repeatedly, these devices are called “re-configurable.” A second interconnection network isused to send configuration information to each pro-grammable element.

RH circuits ordinarily have a regular structure andare composed of a simple tile replicated many timesalong the horizontal and vertical direction. This sym-metry has an important role in some RH applications.

RH is as powerful as standard hardware becauseany circuit that can be built in regular hardware canalso be synthesized using RH. The price paid by RHfor its flexibility is relative inefficiency: the pro-grammable gates and switches, and the associated

����

Configuration RAM Computational gates

Wires and switches

Figure 22:Relative area occupied by the components ofa reconfigurable hardware device. The actual computa-tional devices take just a small fraction of the chip (around1%).

configuration memories, are much larger than thecustom gates and wires in a normal circuit.

The limiting resources on RH are not the gates, butthe wires, because it is not known beforehand whichconnection will be made. The chip must have enoughwires for the worst case [DeH99]. Any particular im-plemented circuit ordinarily uses just a fraction of theavailable wires. To handle the peak demand, RH cir-cuits are oversupplied with wires, which ultimatelytake most of the silicon resources: normally 90%of the chip area is devoted only to wires and theirconfigurable switches. Fig. 22 illustrates the on-chiparea dedicated to the wiring, configuration memory,and actual computational devices on a typical RHchip.

As a consequence, a circuit implemented in RH islarger (in area) than the corresponding custom hard-ware circuit. Because RH integrated circuits haveapproximately the same absolute size limitations asordinary digital circuits, large circuits do not fit intoone RH chip. However, today’s RH devices havecapacities on the order of a few million gates, anamount greater than advanced microprocessors froma decade ago.

Besides the density handicap, RH devices alsohave a speed disadvantage when compared to cus-tom hardware: electrical signals in custom hardwaretravel on simple wires between two gates; in RHsignals frequently have to cross multiple switchesfrom a source to a destination. The switches increase

38 DRAFT! do not distribute

Page 39: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

A and B0001

A and BA

B

(c)(b)(a)

A

B

Configuration network

Configuration

Logic Gates

Configuration

���

���

������

������

���

���

�����

�����

��������������

��������������

�������������������������������������������������

�������������������������������������������������

������������

������������

��������

��������

���������������������

���������������������

��������

��������

���������������������

���������������������

������

������

1

1

Logic gates

Interconnection network

Switches

Memory

Memory

Address

Figure 21:(a) Reconfigurable hardware contains programmable logic gates and an interconnection network. (b) Theconnections between the logic gates can be set up by configuring the switches in the interconnection network. Eachswitch is controlled by a one-bit memory. (c) Programmable logic gates can be implemented as look-up tables usingsmall memories. In this example, the contents of the memory indicate that the gate performs the AND computation.

the power consumption and decrease the propagationspeed of the signal.

The overwhelming majority of digital circuits aresynchronous (i.e., use a clock signal to coordinatethe action of the various computational units spreadon the chip surface). This is also true of RH de-vices, although the peak clock frequency they cantypically use effectively is approximately one-fifth toone-tenth of a microprocessors’ clock. The main rea-son for this disparity is that the electrical signals mustcross several switches on almost any path.

5.2 Circuit Structures

The implementation of RH circuits described aboveis just one possible method of implementing pro-grammable logic. Here we describe three other ap-proaches that could be used. The simplest of themethods is to use a single monolithic memory. Amemory with� locations (i.e.,���� � address linesof bit words can implement any functions of���� � inputs). In Fig. 21c, a memory of four loca-tions of one-bit words is shown. It implements onefunction of two inputs. If we increase the word widthwe can add an additional truth table. If we increasethe number of words, we can increase the number ofinputs. One of the primary reasons RH devices, likefield-programmable gate arrays (FPGAs), are com-posed of many small memories (instead of one large

Figure 23:A schematic of a PLA which can implementa subset of all logical formulas of four inputs. The singleline into each AND gate represents six wires coming fromthe three inputs and their complements. The AND gateto the left of the figure shows all six input lines goinginto the�� AND gate. The single line into each OR gaterepresents four wires, one each from the AND gates. TheOR gate to the right shows all four input lines for the��

output.

memory) is the exponential growth in the size of thememory in terms of the number of inputs to the func-tion.

Historically, programmable logic devices wereimplemented more like monolithic memories thanFPGAs. To avoid the scaling problems with a sin-gle memory, another common implementation ap-proach is to use a programmable logic array (PLA).A PLA directly implements two-level logic func-tions. A logical function can be written in many dif-

39 DRAFT! do not distribute

Page 40: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

Figure 24: Example programming of a PLA to imple-ment the sum output for a full-adder. This example usesonly one of the two output functions.

ferent but logically equivalent ways. One of the mostcommon is to represent a logical function in either asum of products (SOP) form or a product of sums(POS) form. A sum of products form of a functionis constructed by summing a set of product terms.Each product term is the AND of some of the inputsor their complements. Each sum term is the OR ofsome of the product terms. For example, the SOPrepresentation of the sum output of the full-adder inFig. 3 is sum ��� ��� ��� ���. The ideabehind a PLA is to directly implement SOP by hav-ing a programmable product plane that feeds into aprogrammable sum plane.

Fig. 23 shows an example PLA which can com-pute two different functions (one each on the hori-zontal lines) of four product terms (computed on thevertical lines). The central figure is a common formof abbreviation for a PLA. It avoids drawing all thelines that enter each AND and OR gate. On eitherside we show a single example of the AND and ORgates in their unprogrammed form. Programming aPLA involved removing some of the connections intothe AND or the OR gate. The PLA shown in the fig-ure can implement only a subset of all possible threeinput functions. For only three inputs the PLA is con-siderably more expensive to implement than a look-up table (in this case it would be an eight-word mem-ory of two bits per word). However, PLAs can quiteefficiently implement functions of 16 inputs, whereassuch a memory would be highly inefficient.

Fig. 24 shows how the sum output of the full-addermight be implemented in a PLA. We represent eachwire that should remain connected to the AND or ORgate with a star on the intersection of the signal andthe wire going into the gate.

While PLAs are more efficient than single mem-ories, they still have significant overhead to supportthe programming of the AND and OR planes. Pro-grammable array logic (PAL) was introduced to re-duce this overhead. In a PAL, the inputs to the OR-plane are fixed so that only the AND-plane can beprogrammed. This restricts the outputs to the sum ofa fixed set of product terms. This decrease in flexi-bility is offset by a large reduction in the area of thedevice.

5.3 Using RH

The RH industry had a revenue of 2.6 billion dol-lars in 1999 and its growth is accelerating. The sametrends that affect the architecture of computer sys-tems are shaping the use of RH hardware today. Weare witnessing an important change in the role rele-gated to RH devices in computing systems.

5.3.1 Glue Logic

RH was originally developed as a flexible substi-tute for circuits used to “glue logic,” connecting thecustom hardware building blocks in a digital sys-tem. RH was used forfast prototyping: new cir-cuits could be designed, tested, and deployed with-out the need for a manufacturing cycle. Even today,glue logic is the most important segment of the RHmarket. The prototypical RH device of this kind iscalled a field-programmable gate array (FPGA). Thearchitecture of FPGAs is profoundly influenced bytheir usage: because they are used mostly for imple-menting “random” control logic, FPGAs favor one-bit logic elements. This can be contrasted to micro-processors, which are designed to process numericdata and therefore feature a widedatapathprocess-ing wide data values as atomic units.

40 DRAFT! do not distribute

Page 41: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

5.3.2 Computational RH Devices

Feature shrinkage enables the manufacturing of RHdevices with more and more computational elements.When large enough, they can accommodate com-putational tasks usually relegated to microproces-sors. Intense research has been conducted in the lastten years on this subject, leading to many proposalsfor integrating RH devices into computing systems[Har01]. RH devices have an enormous potentialas computational elements. They can successfullycomplement—and even supplant—the microproces-sor as the main computational core of a computingsystem [BMBG02]. In the remainder of this text wewill address RH devices only as computational de-vices.

5.3.3 Computational Applications for RH

RH as a computational device has drawn attentiondue to some impressive early results [Hau98]. Forsome application areas, RH still holds performancerecords, even occasionally when compared to cus-tom hardware solutions. Even when they do not holdthe performance crown, RH devices are substantiallycheaper to manufacture and program than other com-puting systems with large peak performance, whichfeature either custom hardware or a large number ofprocessors.

In Section 5.5, we investigate the source of the RHadvantages. Here we will list some of the successesof RH computational engines.

Computational applications exhibiting extremeperformance on RH hardware include: text searchand match, including DNA string search; encryptionand decryption; digital signal processing, such as fil-tering; and still and video image processing. Theseapplications are all “streaming”-type applications inwhich the same operation is applied to vast quan-tities of data repeatedly. Such applications exhibitsubstantial amounts of data parallelism (i.e., manypieces of data can be processed simultaneously). Thespeed-ups obtained using RH systems (compared toconventional computing systems) range from one toseveral orders of magnitude.

Recently, problems with more complex structureshave been substantially accelerated, such as graphsearches and satisfiability of Boolean formulas.14

Performance increases of lesser magnitudes havebeen reported even for less regular applications.

5.4 Designing RH Circuits

The tool chains used today for configuring RH de-vices reflect their inheritance from the random logicdomain: the RH programming tools are mostly bor-rowed from the hardware design tools. These toolsare substantially different in nature from the regu-lar software design tools because they have differ-ent optimization criteria: hardware design tools tendto have very long compile times, extremely complexoptimization phases, and very stringent testing andverification constraints. Much of the effort of thehardware design tools is directed toward addressingthe constraints of the physical world: examples in-clude laying out wires with few crossings and work-ing with the actual geometry of the devices and elec-trical constraints (e.g., electrical wire resistance).

In contrast, software tools are more interactive:making small changes in software programs and re-building the whole executable in a matter of secondsis customary. Unlike hardware, software programsare frequently upgraded in the field.

The major steps in programming RH devicesfollow quite closely the design of digital circuits(see Fig. 25): the desired functionality is describedusing a program written in a hardware descriptionlanguage (HDL). Using this representation, the cir-cuit can be simulated using computers for testing anddebugging. Next, the program is processed by a se-ries of sophisticated software tools, computer-aideddesign (CAD) tools, which first optimize the circuit,then place the logic gates on the two-dimensionalsurface, and finallyroute the wires connecting thegates (see also Fig. 26). In the case of RH devices,placement associates each logic gate in the circuitwith one of the programmable logic gates of the RH

14Satisfiability is the problem of deciding whether a givenBoolean formula can be made “true” by some assignment to itsvariables. Many important practical problems can be reduced tosatisfiability computations.

41 DRAFT! do not distribute

Page 42: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

compile

mappingTechnology

Optimize

Timing optimization Routing

Configuration generation

routed netlist

annotated netlist

placed netlistNetlist

RTL

LevelTransferRegister-

LanguageDescriptionHardware

Floorplanning

Placement

configuration

netlist + constraints

failure

Figure 25:Compiling for RH devices is a complex process involving several complex optimization phases. Floor-planning, place and route must take into consideration the constraints of the chip geometry.

device, while routing selects wire segments and con-figures the switches to connect the gates linked in theimplemented circuit.

RH CAD tools output a stream of bits, which con-stitute the configuration of each gate and switch. Atrun-time this configuration is loaded onto the chip.After configuration loading, the RH device startsbehaving like the circuit whose functionality it im-plements. Using RH is therefore a two-stage pro-cess: loading the configuration and running the de-vice. Because of the large number of gates andswitches, the configuration of a chip tends to be verylarge. As a consequence, configuration loading isquite lengthy. Evaluating the performance of a RHdevice must always factor in this overhead, which isnot present in either processors or custom hardware.

To shorten the configuration time, novel designsincorporate special features. As an example, someRH devices have local memories which can beused to cache several frequently used configura-tions [LCH00]. In this case, switching from oneconfiguration to another can be achieved somewhatfaster. Other designs can switch quickly between

configurations; these are called “multicontext FP-GAs” [DeH96]. Some other proposals allow onlyparts of the chip to be configured, while other partscompute, allowing the computation and configura-tion to overlap [GSM�99].

5.5 RH Advantages

In this section we discuss some of the advantagesof RH devices over other hardware or software so-lutions. We also analyze the features of RH whichenable it to exhibit very high computational perfor-mance. We defer to a later section (5.7) a discussionof reliability of RH.

Compared to custom hardware solutions, RH-based implementations have a shorter design andfabrication time and substantially lower costs. Thefabrication time and costs advantages result from thereuse of the same integrated circuit: there is no needto physically manufacture new chips. Developmentand debugging are expedited, as the discovery ofbugs results in configuration patches, which are gen-

42 DRAFT! do not distribute

Page 43: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

erated by a chain of software tools and require nophysical manufacturing.

The principal advantage of RH over pure softwaresolutions is increased execution speed, at least forsome types of applications. In examining these ap-plications, we realize that they exploit four main RHqualities:

� Parallelism: These applications feature a largeamount of intrinsic parallelism (i.e., many op-erations can be performed simultaneously). RHcan exploit virtually an unbounded amount ofparallelism. A microprocessor is outfitted bythe designers with a certain number of func-tional units. Each of these units is unable toproduce more than one fresh result in each clockcycle. Thecomputational bandwidthof a pro-cessor is therefore bounded at the time of man-ufacture. Moreover, very rarely can a processorreach its peak performance because the paral-lelism available in the program rarely has theexact same profile as the available units. This isdiscussed in Section 3.4.4.

In contrast, by using RH one can synthesizea virtually unbounded number of functionalunits.15 For example, if the program contains 10independent additions, one can build a configu-ration featuring 10 parallel adders. Not only canwe build highly parallel computational engines,we can do so only after we know the applica-tion requirements—a luxury that a microarchi-tect cannot afford.

Two main types of parallelism are successfullyexploited by RH devices: parallelism betweenindependent operations (multiple data items canbe processed simultaneously) and pipeline par-allelism, described in Sections 3.3.2 and 3.3.3.As an example, let us consider a program pro-cessing a digital movie, for instance, by de-creasing the brightness and next thresholding(i.e., displaying in white everything above a cer-tain intensity and the rest in black). In eachmovie frame the brightness of all pixels can be

15Due to Moore’s law, resource constraints in RH become lessimportant with each new technological generation and will ceaseto constitute a real obstacle in the near future.

increased in parallel: this is parallelism betweenindependent operations. Moreover, while oneframe is brightened, the previous frame can bethresholded by using a set of different process-ing units. This is a form of pipeline parallelism.

� No instruction issue: Unlike a microproces-sor which has to fetch new instructions con-tinuously, the functionality of RH is hardwiredwithin the circuit. The RH unit does not needto expend resources to fetch instructions frommemory and decode them.

Dispensing with instruction issue has severalpositive benefits. First, the problem of con-trol dependencies16 is eliminated or substan-tially reduced. Second, precious memory band-width is conserved because there is no need tobring instructions from the memory (the mem-ory is connected with a narrow bus to the pro-cessor, on which there is contention for trans-mitting both data and instructions).

� Unlimited internal bandwidth: A subtle dif-ference between a processor and an RH deviceis in the way they handle intermediate compu-tation results. Processors have an internal arrayof scratch-pad storage elements calledregisters.Intermediate computation results are stored inthe registers, where they can be quickly ac-cessed. When a “final” result is produced, itis stored in memory. Like the number of func-tional units, the number of registers is hard-wired from manufacturing. If the number oftemporary values exceeds the number of regis-ters, they must bespilled into memory, whichis much slower to access. A scarce resourcein processors is the number ofports into theregister file. Data is read from or written to aregister through a port: therefore, the numberof ports determines the number of simultane-ous read/write requests that can be honored bya register file. In other words, the bandwidth in

16We say that instruction B is control dependent on A if Adecides whether B is executed or not. For example, branch in-structions decide the next instruction to execute; the targets of abranch are thus control-dependent on the branch. Not knowingwhich way execution goes prevents the timely issuing of instruc-tions, penalizing execution performance.

43 DRAFT! do not distribute

Page 44: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

and out of the register file is essentially boundedby a hard limit—the number of ports.

In contrast, in RH devices the datapath is cus-tom synthesized for each application. For in-stance, there is no need to use asingle registerfile for all intermediate results. Because the datausage of the program is known, registers can beconnected exactly to their producers and con-sumers, and nothing else. Moreover, an arbi-trary number of registers can be created in theconfiguration because their space overhead isquite small. In RH, potentially unbounded reg-ister bandwidth accommodates the parallelismin the application under consideration, while theunbounded number of registers prevents costlymemory spills.

� Out-of-order execution: While superscalarprocessors allow instructions to execute in or-ders different from the one indicated by the pro-gram, the opportunity to do so is actually quitelimited:

– First, only instructions within a limited“issue window” can be reordered.

– To ensure correct execution, most super-scalar processors constrain the instruc-tions to also complete execution in pro-gram order.

In RH implementations none of these con-straints has to be obeyed.explain why!

Besides these major advantages of RH fabrics,there are three additional advantages which boostRH performance, although state-of-the-art compila-tion tools do not exploit them as effectively:

� Custom widths: The word size of a micropro-cessor is essentially fixed from manufacturing.When computations on narrower or wider datavalues are desired, programmers resort to word-size computations, which basically wastes mostof the result. In contrast, RH devices allowcomplete flexibility for the width of computa-tional units [BSWG00]. However, as we dis-cuss in Section 5.6, there is a tension between

efficiency of configuration size and logic den-sity on one side and resource waste due to shortdata word size. For this reason, narrow compu-tational units may not always be best in termsof performance.

� Custom operations: The functional units in-side a microprocessor are restricted to a smallset of general-purpose operations: simple in-teger arithmetic and logic, and bit manipula-tions. As media processing algorithms werestandardized, processors started to incorporatespecial instructions designed especially to ac-celerate these programs [PWW97]. For exam-ple, virtually all modern processors have specialinstructions to aid with the encoding and decod-ing of digital video sequences. However, newapplications and algorithms arise continuouslyand require special-purpose operations. In RHone can easily synthesize efficient implementa-tions for operations which, when expressed inclassic ISAs, require many instructions. As anexample, consider an operation that permutesthe bits of a word in a fixed order. In a normalISA this requires at least three instructions perbit, for extracting, shifting, and inserting the bit.However, in an RH with bit-level operations thepermutation can be implemented just by wireconnections, without any computation.

� Specialization: This is applicable when someof the data processed by the algorithm changesslowly. As an example, consider an encryp-tion algorithm with a given key: once the key isknown, the algorithm will process a lot of data.The algorithm can therefore bespecializedfora key [TG99]: once the key is known, an en-cryption circuit built especially for that particu-lar key can be generated, which has the potentialto operate much faster than a general-purposeencryption circuit. However, such specializa-tion must generally be done at run-time, whendata is known. This requires executing someof the configuration-generation process duringprogram execution. This can be a costly pro-cess, which may offset the advantages of spe-cialization.

44 DRAFT! do not distribute

Page 45: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

5.6 RH Disadvantages

In this section we discuss the disadvantages of RHdevices as compared to microprocessors. Some ofthe disadvantages are intrinsic to the nature of RHdevices: for example, the need of the electrical signalto cross multiple switches for practically all point-to-point connections is a consequence of the generalityof the routing fabric. Other disadvantages are “in-herited” from the traditional usage of RH devices asglue logic circuits, for example, the cumbersome andslow programming methodology. These latter disad-vantages will possibly be overcome by new researchresults.

We list disadvantages in order, beginning with theones most likely to be easily overcome, and endingwith the ones that seem fundamental.

� Bit granularity: The traditional usage of FP-GAs for implementing random control logic haseconomically favored architectures oriented to-ward one-bit computations. Synthesizing widearithmetic units from one-bit elements resultsin slow and large functional units (e.g., a cus-tom eight-bit adder is much more compact thaneight one-bit adders hooked together). This factreflects a trade-off between flexibility and per-formance. If RH is to be used to accelerategeneral-purpose computation, most likely thebalance has to be shifted more toward word-oriented devices [GSM�99, Har01].

Closely related to the granularity of the compu-tational elements is the flexibility of the routingresources. When RH is used for computationthe full generality of a fine-grained intercon-nect is unneccesary since one often only needsto switch word-sized groups of data. The flexi-bility of the fine-grained interconnect increasesoverhead without increasing usefulness. Thus,to increase the efficiency of RH for general-purpose computation the interconnect should bemade less general.

� Special programming languages: The lineageof RH from random logic is most strongly re-flected in the similarity of the tools used forprogramming both. Most notably, hardware de-

scription languages explicitly describe the ac-tivities that should be carried in parallel. In con-trast, software languages have a more sequen-tial nature, describing the program as a linearsequence of actions.

However, if RH is to migrate more toward han-dling computation, the development environ-ment used to program RH devices must alsometamorphose. Currently, an active topic ofresearch is in the determination of how to ef-ficiently bridge the gap between software lan-guages on one side and programming tools andthe parallelism available in RH devices on theother side.

� Long compilation time: The hardware-designtools used for programming RH devices tend touse a lot of time (hours and even days) in orderto achieve high-quality results. This trade-offmust be tipped in the opposite way (i.e., lowerquality, but fast compilation), to bring compila-tion time in line with the expectations of soft-ware developers.

� Limited computational resources: The num-ber of computational elements in an RH deviceis determined at the time of manufacture. Im-plementing programs that require more than theavailable amount of resources is simply not fea-sible using current technology. In contrast, pro-cessors store program instructions in memory,which is cheap and large, and whose capacitygrows quickly with new generations. Moreover,practical mechanisms forvirtualizing the mainmemory have been used for over three decades.Such mechanisms store the data on much largerdisks and use the memory essentially as a largecache for this data. Using virtualization, evenprograms requiring more memory than physi-cally available can be executed, provided theyhave good locality. Nanotechnology promisesto solve or at least severely ameliorate the prob-lem of limited computational resources: by us-ing molecular devices, several billion compu-tational elements can be assembled in a squarecentimeter, many more than are currently usedby today’s most sophisticated circuits.

45 DRAFT! do not distribute

Page 46: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

� Configuration size and configuration loadingtime: As we described previously, the size ofthe configuration describing the circuit func-tionality is substantial. Ratios of 100 bits ofconfiguration for describing a 1-bit computationare not unheard of. Large configurations requirelarge storage resources on-chip and, most im-portantly, require a long time for downloadingthe configuration from an external medium ontothe chip.

Configuration size and time can be substantiallyreduced by increasing the computational granu-larity. For example, when using eight-bit com-putational units, instead of describing the com-putation of each one-bit functional unit sepa-rately, eight of them share the same configura-tion.

The impact of configuration loading time can bealleviated by implementing chips that allow in-cremental reconfiguration: only as much of thechip should be reconfigured as necessary for im-plementing the desired functionality. Anotherpractical solution is to completely separate theelectrical interfaces for configuration and exe-cution. For such chips, while some part of thechip is being configured, the rest can be used foractive computations [GSM�99, GSB�00].

� Low density and speeds: The flexibility of RHis offset most seriously by two factors: the re-duced number of computational units that canbe implemented per unit area, and the slowdownincurred on the electrical signal crossing multi-ple switching points. A logic gate in a processoror custom hardware device is one order of mag-nitude more compact than a programmable gatein an RH device. Moreover, substantial spaceis wasted for the communication and configura-tion interconnection networks.

Some CAEN proposals (e.g., [GB01]) solvethe density problem by using computationalelements and switches which can hold theirown configuration, without the need for sup-plemental memory. The overhead in size ofsuch devices becomes insignificant. The speedoverhead, however, is much more difficult to

overcome—no effective solutions have yet beenproposed.

Interestingly enough, the low speed of the elec-trical signals can be construed as an asset of RHdevices in the following sense: RH program-mers have always had to work with the realityof slow communication links and have had tofactor the cost of communication into the bal-ance of the systems. For classical hardwarearchitects, wires were perceived until recentlyas being essentially free. However, the amaz-ing advances in clock speeds have reached alimit where the propagation delay of wires isno longer insignificant [AHKB00]. Modern de-signs must accommodate the reality of multi-cycle latencies for propagating information onthe same chip. Classical microarchitectural de-signs tend to be monolithic, and therefore breakdown, when the latencies of the communica-tion links become significant. Such architec-tures will require a major redesign to accom-modate multiple unsynchronized clock signals(clock domains). RH devices have thereforealways faced this fundamental problem, whichclassical hardware has only postponed.

� No ISA: As described in Section 3.2, a proces-sor is characterized by its instruction set archi-tecture, that is, the set of operations it can carry.The ISA is a contract between the hardware andthe software world. The same ISA can be im-plemented in various ways, but the programsusing it continue to work unchanged. This con-tract is extremely important for decomposingthe architecture into clean layers.

To date no concept equivalent of an ISA hasbeen proposed for RH devices. The lack ofan ISA also complicates the task of compilinghigh-level languages to such architectures: theISA is a very powerful conceptual handle in thedescription of programs.

5.7 RH and Fault Tolerance

A subject where the qualities of RH devices differconsiderably from either custom hardware or proces-sors is reliability. The “deep” flexibility of the RH

46 DRAFT! do not distribute

Page 47: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

devices enables the construction of computer sys-tems which can reconfigure ”on the fly” to increasereliability or to manage emerging defects. In this sec-tion we discuss how the qualities of RH fabrics canbe exploited to build robust systems.

5.7.1 Permanent Defect Tolerance

RH devices have a unique quality, which enablesboosting the yield (see Section 2.5) almost for free:such chips can be used even when they have a sub-stantial number of defects.

In an RH device practically all programmablegates are equivalent, because each can be configuredto implement any functionality. Hence, if some por-tions of the chip are defective, their functionality canbe assumed by other equivalent portions, by simplyrerouting the connecting wires and changing the con-figuration of a few gates (see Fig. 26). Given ade-fect mapof a chip, a compiler can use it during theplace-and-route process to avoid defective zones andimplement perfectly functional circuits using faultyRH. The success of this approach has been demon-strated by the Teramac project [HKSW98] (see be-low).

5.7.2 Self-Testing

Once the defective parts of a chip are known, theycan be circumvented. This can occur, for instance, byusing defect-aware place-and-route tools or by shift-ing parts of the configuration based on the regularityof the chip geometry.

We have not yet explained how defect positionsare discovered. For defect discovery we need a test-ing mechanism which can circumscribe precisely thedefective regions. This task is not a simple one be-cause there are few ways of probing a chip.

Testing would be trivial if we could test the indi-vidual components. In general, the testing strategyshould satisfy the following constraints [MG02]:

� it should not require access to the individualcomponents

� it should scale slowly with the number of de-fects

� it should scale with fabric size, so that testingdoes not become a bottleneck in the manufac-turing process

The solution is to use the programmability of thechip to implementself-testingcomputations. Theseare simple computations that can be applied selec-tively to small parts of the chip. For example, bytesting on rows and columns one can precisely assigndefects to the intersections of the rows and columnsthat give wrong results.

RH can be designed to ease the self-testing by us-ing built-in self-testing (BIST). Complex architec-tures have been devised which can carry complexself-tests [SKCA96, SWHA98, WT97] without anexternal entity controlling reconfiguration.

An important advantage of BIST is that it allowsthe testing procedure to be carried out on many partsof the chip simultaneously, reducing the time neededto build the defect map. This is especially importantfor very large chips, with billions of components,where a sequential testing process would last an in-ordinate amount of time.

5.7.3 Teramac

How effectively RH devices can provide fault tol-erance has been demonstrated by the Teramacproject [HKSW98]. Teramac is built RH devices—thousands of which are faulty—and interconnectedby a very rich network. Although more than 70% ofthe RH devices contain some sort of defect, Teramacperforms flawlessly.

The key idea behind making Teramac work is thatreconfigurability allows one to find the defects andthen to avoid them. Before Teramac can be used,it is first configured for self-diagnosis. The resultof the diagnosis phase is a map of all the defects.Then, one implements a particular circuit by config-uring around the defects.

Place-and-route tools for Teramac have been mod-ified to take into consideration the defect map.

47 DRAFT! do not distribute

Page 48: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

Place

Route

Place

Route

Figure 26:Place and route can avoid chip defects.

48 DRAFT! do not distribute

Page 49: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

For the end-user the presence of defects is practi-cally invisible: self-testing and construction of thedefect map, together with automated defect-awareplace-and-route, constitute a highly effective fault-tolerance method.

In some sense, Teramac introduces a new manu-facturing paradigm, one which trades off complexityat manufacturing time with postfabrication program-ming. The reduction in manufacturing time com-plexity makes RH fabrics a particularly attractive ar-chitecture for CAEN-based circuits, since directedself-assembly will most easily result in highly reg-ular, homogeneous structures.

5.7.4 Test While Run

Postmanufacturing testing can be extended to allowfor continuous testing of the device, even during nor-mal operation. Such devices devote most of the chiparea for the main computational task, while a fractionis used for self-testing, in the way described above.

Execution is periodically interrupted and the test-ing circuitry swaps places with some of the com-putation; wires are rerouted to connect the dis-placed circuits correctly, and then computation re-sumes [ASH�99]. When self-testing discoversfaulty areas, these are marked as invalid and neverused again.

This procedure allows for an extreme robustnessagainst defects even when they appear during sys-tem operation. As long as there is enough space toaccommodate the computation and testing, the chipcan function even if new defects appear. The cost ofthis scheme is the area devoted for testing purposesand the time spent for reconfiguration when new chiptiles begin self-testing.

5.7.5 Dynamically Adjustable Fault-Tolerance

The configurability of RH enables it to dynamicallychange the technique applied for increasing redun-dancy [SKG00]: for example, when an increasingnumber of faults is detected, the chip can be recon-figured to carry a higher number of redundant com-putations. If transient faults are predominant, tests

can be carried out using temporal redundancy. Whennoncritical computations are being carried out, no re-dundancy need be used at all.

6 Molecular Circuit Elements

We turn now to an examination of the basic devices,wires, and fabrication methods available for con-structing molecular electronic computing devices.Our primary focus is on chemically assembled elec-tronic nanotechnology.

CAEN takes advantage of chemical synthesistechniques to construct molecular-sized circuit el-ements, such as resistors, transistors, and diodes.Chemical synthesis can inexpensively produce enor-mous quantities (moles) of identical devices, or itcan be used to grow devicesin situ. The fabricateddevices are only a few nanometers in size and ex-ploit quantum mechanical properties to control volt-age and current levels across the terminals of the de-vice. These molecules are smaller than the physicalfeature-size limit that can be produced using silicon.Significantly, the operation of these devices requiresonly tiny amounts of current, which should result incomputing devices with low power consumption.

CAEN devices will be small—a single switchfor a random access memory (RAM) cell will beapproximately 100 square nanometers17 comparedwith ���� ��� nm� for a single laid-out transistor.18

To construct a simple gate or memory cell requiresseveral transistors, separateP- andN-wells, etc., re-sulting in a density difference of approximately onemillion between CAEN devices and CMOS. Sincevery few electrons are required to switch these smalldevices, they also use less power.

17We assume that the nanoscale wires are on 10 nm centers.This is not overly optimistic—single walled nanotubes have di-ameters of 1.2 nm and silicon nanowires of less than 2 nm havealready been constructed.

18Even in Silicon-on-insulator where no wells are needed, in a70 nm process, a transistor with a 4:1 ratio and no wires attachedto the source, drain, or gate measures 210 nm x 280 nm. Withjust a minimal sized wire attached to each, it measures 350 nmx 350 nm.

49 DRAFT! do not distribute

Page 50: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

In this text we largely limit our discussion tomolecular devices that haveI-V19 characteristicssimilar to those of their bulk counterparts even whena different mechanism is used to achieve the effect.For example, while the mechanism of rectificationis different in a silicon-basedP-N junction diodefrom a molecular diode, they both have similarI-V curves [AR74]. There are two reasons why wechoose to examine systems that can be built fromnanoscale devices with bulk-semiconductor analogs:(1) we can apply our experience with standard cir-cuits and (2) we can model the system with standardtools such as SPICE [QPN�02].

Molecular-scale devices alone cannot overcomethe constraints that will prove limiting for CMOS un-less a suitable, low-cost manufacturing technique isdevised to compose them into circuits. Since the de-vices are created separately from the circuit assem-bly process, localization and connection of the de-vices using a lithographic-like process will be verydifficult. We must therefore seek other means. Con-necting the nanoscale components one at a time isprohibitively expensive due to their small size andthe large number of components to be connected.Instead, economic viability requires that circuits becreated and connected through self-assembly andself-alignment.

6.1 Devices

While molecular devices of all types have been cov-ered in other sections of this encyclopediax & y,here we examine their effect on the kinds of com-puting devices that can be built. We examine devicesthat have direct analogs to bulk devices (e.g., diodes,resistors, transistors, resonant tunneling diodes), aswell as devices that have no direct counterpart (e.g.,molecular switches). While this is not an exhaustivelist of molecular devices, it serves to illustrate theimportant properties of molecular-scale devices.

19Devices are often characterized by the amount of current (I)they conduct based on the voltage (V) across their terminals. AnI-V curve is a graph of the current versus the voltage.

Figure 27:Characteristic curve of a molecular negativedifferential resistor.

6.1.1 Diodes

A diode is the simplest nonlinear device. It con-ducts current in only one direction. In 1974, Avi-ram and Ratner published the first paper on a molec-ular device. They described a theory which predictedthat molecules could behave like diodes. The the-ory is that molecular diode rectification occurs dueto through-bond tunneling between the molecular or-bitals of a single D-�-A molecule. In such a system,electrons flow easily from the donor (D) toward theacceptor (A) through the covalent bond (�), but notvice versa [AR74]. Since that time, molecules whichshow rectification have been constructed [Met99].Additionally, researchers have developed molecular-scale diodes based on the Schottky effect. From acircuit design perspective, the two most importantcharacteristics of these devices are their turn-on volt-age (i.e., when they start conducting in the forwarddirection) and the amount of current that flows inthe reverse direction. A perfect diode would startconducting in the forward direction immediately andwould conduct no current in the reverse direction.As we shall see later (in Section 8.1), the lower theturn-on voltage, the more useful the device. Unlikesilicon-based diodes, molecular diodes have turn-onvoltages of less than 0.7 V and forward to reverse ra-tios of several thousand to one [ZDRJI97]. However,since diodes have no gain, they are not sufficient tobuild large circuits.

Another molecular device of interest is the res-onant tunneling diode (RTD), which is often re-ferred to as a negative differential resistor (NDR)in the molecular device literature. Resonant tunnel-

50 DRAFT! do not distribute

Page 51: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

ing diodes are two-terminal devices that have a re-gion of negative resistance (see Fig. 27). The NDReffect is seen in a variety of molecules, most no-tably [CRRT99, CR00]. Solid-state RTD-based cir-cuits have a long history [AS62], but were essen-tially abandoned when transistors became the workhorse for circuits. The key feature of an RTD is aregion of NDR, where the tunneling current falls asthe voltage increases. The ratio of the current at thebeginning of the NDR region to the current at theend is called the peak-to-valley ratio (PVR). An addi-tional important characteristic of an RTD is the resis-tance of the device. Lower resistance leads to fasterswitching times. While molecular circuits based onRTDs include basic logical operations [EL00] andlatches [RG02, NF02], they alone cannot be usedto construct large circuits. Although they can beconnected in ways that introduce gain into a circuit,without additional devices they do not provide anyI/O isolation. In Section 8.3, we discuss their use inbuilding latches and describe how to introduce I/Oisolation.

6.1.2 Transistors

As discussed in Section 2.3, transistors are suffi-cient to construct large circuits. There has beensignificant recent progress in developing moleculartransistors. Transistor-like behavior has been re-ported in many different kinds of organic and inor-ganic molecules. In fact, Wind et al. [WAM�02]have reported on a nanotube transistor with twice thecurrent carrying capability of silicon-per-unit width.One advantage of this device is that the gate is lo-calized to the device (i.e., it does not use back-gating).20 Therefore, more than one device canbe present on the substrate. Many other impres-sive results have been reported on molecular transis-tors [BHND01, HDC�01a, DMAA01]. All of thesedevices report gain of more than one, have large cur-rent carrying capability, and localized gates. Addi-tionally, some of the reported work has developedboth P-type andN-type transistors, allowing com-

20In back-gating the gate that controls the device is the entiresubstrate on which the device is fabricated. Therefore, whenback-gating is used, all the back-gated devices are controlled bya single gate—the substrate.

plementary logic to be developed. This last fact isimportant for building circuits with reduced powerrequirements. It would therefore seem that these de-vices would be perfect for building molecular elec-tronic devices. However, assembling them into a cir-cuit is a significant remaining challenge, which wediscuss below.

6.1.3 Switches

Molecular switches, unlike the previous components,are devices that have no single bulk analog. A switchis a device that can be programmed to be eitherhighly resistive (open) or highly conductive (closed).Molecular switches have already been demon-strated [CWB�99, CWR�00]. We focus here onthe psuedo-rotaxane developed at UCLA [CWB�99,BCKS94, MMDS97] (see Fig. 28). The psuedo-rotaxane is a two-terminal device which can beviewed as a rod surrounded by a ring. Four voltagepoints are used to operate this device: program-off,signal-low, signal-high, program-on. The program-ming voltages cause the device to change state. Thesignal voltages, which have a lower absolute mag-nitude than the programming voltages, are used tooperate the circuit. Programming is achieved usinga high voltage differential applied to the two termi-nals, causing the ring to move to either the left orthe right of the rod, depending on the voltage sign.When the ring is at one end of the rod, the moleculebehaves like a wire with relatively low resistance.When the ring is at the other end, the device behaveslike an open switch (i.e., it has very high resistance).When in series with a diode, this behaves like a pro-grammable diode: when the ring is at one end theensemble is an open connection, when on the otherend it behaves like a diode. It is, of course, crucialthat the signal voltages remain less than the program-ming voltages. Otherwise, the device will have itsstate unintentionally changed.

There are two reasons why molecular switches areparticularly well suited for reconfigurable comput-ing. First, the state of a switch is stored in the switchitself. Second, the switch can be programmed us-ing the signal wires (i.e., there is no need for ad-ditional programming wires). For the rotaxane de-

51 DRAFT! do not distribute

Page 52: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

Figure 28:A diagram of a psuedo-rotaxane in series with a diode. When the ring is at the right it has low resistance(i.e., it is a conductor), when at the left it has high resistance (i.e., it is an open circuit). Adapted from Ref. [Sto] withpermission J. Stoddart, http://www.chem.ucla.edu/dept/Faculty/Stoddart/research/mv.htm.

scribed above, the state is stored in the spatial lo-cation of the ring. Under operating conditions thering position is stable. On the other hand, a CMOS-based reconfigurable device requires a static RAMcell to control each pass transistor. When the molec-ular fabric is compared to a CMOS fabric using a 6-transistor static RAM cell to control a pass transistor,we see that a reconfigurable fabric will be at least���

times as dense using molecular electronics. In addi-tion, CMOS reconfigurable circuits require two setsof wires: one for addressing the configuration bit andone for the data signal.

Perhaps a more apt comparison is between molec-ular switches and floating-gate technology, whichalso stores the configuration information at the tran-sistor itself. A floating-gate can also be used to cre-ate a programmable logic array. Because it requireshigh voltage for programming, a floating gate tran-sistor is slightly larger than a standard transistor. Itslarge size makes it, at most, one order of magnitudemore dense than the RAM cell, still giving CAEN-based devices a density factor advantage of roughly���. Moreover, floating-gate transistors act as bidi-rectional devices and therefore are less useful thanthe molecular switches which can incorporate recti-fication, acting as diodes.

Another example of a configurable device is aconfigurable transistor. Fig. 29b shows a schematicof a configurable transistor formed at the intersec-tion of two wires similar in spirit to that reportedin [RKJ�00]. Unlike the example shown in Fig. 29a,this device behaves like a transistor when the wiresare in physical contact and behaves like two nonin-tersecting conductors when the wires are detached.The configuration mechanism is not a conforma-

tional change in a molecule, but rather a conforma-tional change in the wires. By putting an attractiveelectrostatic charge on the two wires, they will cometogether. They will then stick together due to van derWaals forces.21 They can be separated by putting arepulsive electrostatic charge on the wires.

6.2 Wires

A single device, while interesting, is not useful un-less assembled in a circuit. Connecting nanoscaledevices together requires nanoscale wires. The pri-mary requirement on a wire is that it quickly trans-mits (without significantly distorting) a signal fromone device to another. Such wires exist and havebeen made from many different materials, includingvarious metals [PRS�01], silicon [KWC�00], poly-mers [EL00], and carbon nanotubes [SQM�99]. Inaddition to the wire, there must be a means for con-necting the wire to the device. The connection, orcontact, between the wire and the device must alsohave low signal distortion.

Nanoscale wires can do more than simply con-duct, they can also be active devices themselves.For example, carbon nanotubes with diode-likeand transistor-like behavior [HDC�01b] and siliconnanowires with transistor-like behavior [CDHL00]have been reported. Another example of wires withactive devices are metallic wires which are grownusing electrochemical deposition. The fabricationprocess also allows different materials to be used in

21The van der Waals force acts to pull atoms (or molecules)together. It is inversely proportional by the seventh power todistance between the atoms. Clearly this force is only importanton the nanometer scale.

52 DRAFT! do not distribute

Page 53: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

Figure 29:Creating transistors between two wires: (a) fixed and (b) configurable transistors.

the growth of the wire. In addition to making wireswith different metals [PRS�01], they can also growwires with inline molecules [KMM�01, MMK�02,KMM �02]. This technique allows molecules ex-hibiting rectification, switching, and transistor-likebehavior to be incorporated into a wire. In Sec-tion 8.3, we show how this technique aids in the con-struction of a molecular latch. In the end, CAENblurs the traditional distinction between wires anddevices.

Evaluating nanoscale wires is very difficult be-cause the wire/contact/device system is highly in-tertwined. In fact, some theorize that the nonlinearbehavior of molecules is actually a combined effectbetween the molecule and the contact [Hip01]. How-ever, we can at least characterize the system by theresistance of the wire and the coupling capacitancebetween neighboring wires and potentially the sub-strate on which the circuit is built. In the best case,the wire behaves like a one-dimensiional quantumwire (i.e., transports electrons ballistically, having alength-independent resistance). It is theorized thatcarbon nanotubes behave in this manner.

Their conductance (the inverse of resistance) is ex-pressed in units of ��� [Lan57]22 and a perfectlyconducting nanotube would have a conductance of� ��� or a resistance of approximately 6.5 k� in-dependent of its length. This roughly agrees withmeasurements made for carbon nanotubes by Soh,et. al. [SQM�99], in which they measured resis-tances for both the contacts and a nanotube as lowas 20 k�. Using this resistance we can determine arough lower bound on the RC23 delay of a molecular

22� is the charge on an electron,� is Plank’s constant.

23RC is the product of a circuits resistance (R) and its capaci-tance (C). It indicates the rate at which the circuit can change its

computing device; which determines an upper boundon the speed of such a device. Even ignoring thequantum capacitance, the RC delay for 1 micon longwires placed at a 10 nm pitch is .24 ps; at 100 nmpitch it is .13 ps. The relatively long RC constant isdue to the high resistance of the nanotube.

Equally as important as the electrical characteris-tics of the wires and the contacts is the method bywhich the wire is constructed and the contacts aremade. In CMOS-based systems, the wires and con-nections are made using photolithography. This al-lows the wires, devices, and connections to be man-ufactured at the same time, in the same place. Thiswill not be true of molecular systems. In molecu-lar systems the devices will be made separately fromthe wires and they will then be connected together.The next task is to make the connections economi-cally, yet with high reliability. The ability to achievethis will significantly influence the kinds of struc-tures that can be built using molecular devices.

7 Fabrication

Molecular-scale devices cannot themselves over-come the constraints that will prove limiting forCMOS unless a suitable, low-cost manufacturingtechnique can be devised for composing them intocircuits. For devices created separately from thecircuit assembly process, localization and connec-tion of the devices using a lithographic-like pro-cess will be very difficult. Other means must besought. Connecting the nanoscale components oneat a time is prohibitively expensive due to their

value. After RC time units the circuit can charge (or discharge)to approxiamitly 2/3 of its final value.

53 DRAFT! do not distribute

Page 54: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

small size and the large number of componentsto be connected. Instead, economic viability re-quires that circuits be created and connected throughself-assembly and self-alignment. Several groupshave recently demonstrated CAEN devices that areself-assembled, or self-aligned, or both [CWB�99,MMDS97, RKJ�00, AN97]. Advances have alsobeen made in creating wires out of single-wall car-bon nanotubes and aligning them on a silicon sub-strate [TDD�97, PLP�99]. More recently, selectivedestruction has been used to create desired carbonnanotube structures [ACA01]. In addition, metal-lic nanowires have been fabricated24 which can scaledown to 5 nm and can include embedded devices ordevice coatings [MDR�99, MRM�00].

While other sections (see especiallySections Xand Y) in this encyclopedia detail the many differ-ent assembly techniques, here we review some ofthe techniques in light of their impact on the cir-cuits and architectures that can be created. Mass as-sembly techniques range from top-down approachessuch as photolithography, to bottom-up approachessuch as self-assembled monolayers. For reasonsstated earlier, we limit ourselves to chemically as-sembled electronic nanotechnology (i.e., to bottom-up approaches).

7.1 Techniques

Bottom-up, scalable techniques include Langmuir-Blodgett films, flow-based alignment, nanoim-printing, self-assembled monolayers, and catalyzedgrowth. The common feature among all these tech-niques, with the possible exception of nanoimprint-ing,25 is that they can only form simple structures(i.e., either random or very regular, but not complexaperiodic). An additional characteristic is that the re-sulting structures usually contain some defects (i.e.,the results are not perfect). Finally, it is not possibleto predetermine exactly where a particular element

24Unlike wires created in traditional processes—i.e., lithogra-phy, these wires are “grown” in nanopores.

25Nanoimprinting has been used successfully to create irreg-ular structures. However, the process of creating masters com-bined with direct contact printing may limit the smallest achiev-able pitch to above 100 nm [XRPW99]. Further, it will be diffi-cult to precisely align the master with the target.

Figure 30: Different defect scenarios using flow-basedassembly.

will be located in the structure. That is, one cannotdeterministically assemble an array of different typesof wires into a particular aperiodic pattern.

To date, the most complex structures made usingscalable bottom-up techniques are two-dimensionalmeshes of wires.26 For example, fluidic assem-bly was used to create a two-dimensional meshof nanowires [HDW�01]. The nanowires are sus-pended in a fluid that flows down a channel. Thenanowires align with the fluid flow and occasion-ally stick to the surface they flow over. By varyingflow rate and the duration of the flow they were ableto vary the average spacing between wires and thedensity of wires. Further refinements can be madeby patterning the underlying substrate with differentmolecular end-groups to which the nanowires willshow preferential binding. By performing the flowoperation twice, first in one direction and then in theorthogonal direction, arrays of wires at right anglescan be formed.

The above method points out the possibilities andthe limitations of bottom-up assembly. On the pos-itive side, one can form one-dimensional arrays andtwo-dimensional meshes of nanowires (Fig. 30a andf). Negatives to this method include its imprecisionand its lack of determinism. Wires will rarely all be

26An exception exists for the recent advances in DNA-basedassembly (e.g., [WLWS98, Mir00, NPRG�02]). DNA-based as-sembly has been known to create ordered structures with hetero-geneous materials. However, DNA-based methods are even lessdeveloped than the methods we explore here.

54 DRAFT! do not distribute

Page 55: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

equidistant from each other (Fig. 30b). Wires thatshould be parallel may intersect (see Fig. 30c) or beaskew (Fig. 30d). The wires may be parallel but maybe offset from each other (Fig. 30e). The resultingarrays may contain shorts (see Fig. 30c) or open con-nections (see Fig. 30g and h). Finally, the techniquecan never deterministically arrange the wires in thechannel. Thus, it cannot produce a complex aperi-odic arrangements of the wires.

7.2 Implications

The use of self-assembly as the dominant meansof circuit assembly imposes the most severe limita-tions on nanoscale architectures: it will be difficultto create either precise alignment between compo-nents or deterministic aperiodic structures. Chemi-cal self-assembly, as a stochastic process, will pro-duce precise alignment of structures only rarely, andmanipulation of single nanoscale structures to con-struct large-scale circuits is impractical at best. Fur-thermore, the methods used to assemble nanoscalecomponents are most effective at creating random or,at best, crystal-like structures. These two facts havesignificant implications on the kinds of circuits andstructures that can be realized at the time of fabrica-tion.

The structural implications of bottom-up assemblyare:

Connections by overlapping wires: Lack of pre-cise alignment means that end-to-end connec-tions between groups of nanoscale wires will benear impossible to achieve. If all connectionsbetween nanoscale wires occur only when thewires are orthogonal and overlap, we reduce theneed to precisely align the wires.

Two-terminal devices are preferred: The easiestdevices to incorporate into a circuit will bedevices with two terminals (e.g., diodes, con-figurable switches, and molecular RTDs). Inthe case of wires that do more than conduct(e.g., carbon nanotube diodes), then when twowires intersect, they create an active device

between them out of topological necessity.27 Ifa molecule is required, then it can be assembledonto one of the wires and then where that wireand another wire intersect an active device willbe formed at the intersection. The device willconsist of the first wire, the molecule attachedto that wire, and the second wire where ittouches the molecule on the first wire. Or, inthe case of inline devices, one terminal of themolecule is assembled onto the end of the wireas it is grown and then the growth of the wireis resumed, making the other connection to themolecule. In all three cases, there is no needfor precise or end-to-end connections becausedevices are incorporated into the circuit outof topological necessity. A harder problem isusing three-terminal devices (e.g., moleculartransistors). It will likely prove extremelydifficult to attempt the connection of threewires to a molecular transistoren masseat thenanometer scale. One possible solution is toarrange for the intersection of the two wires tobe the gate of the device. We saw an exampleof this earlier in Fig. 29.

Active components at cross-points: The most reli-able way to make device connections will be togenerate them at the intersection of two wires.Such devices can be either two-terminal de-vices, as described above, three-terminal de-vices (e.g., [HDC�01b]), or four-terminal de-vices. In the latter two cases care must betaken to ensure that enough heterogeneity canbe introduced into the circuit to utilize the de-vices. Practical foreseeable nanocomputer de-signs will therefore have to rely on the place-ment of active components at the intersectionsof wires.

Meshes are basic unit: The methods used to as-semble wires combined with the method ofcreating active devices implies that the basicstructural unit will be a two-dimensional mesh.More complicated structures will have to be cre-ated either by combining meshes together, or

27By topological necessity we mean that the outcome is a di-rect result of the geometric arrangement. In this case, the inter-section of two wires creates, simply by being a point of intersec-tion, an active device.

55 DRAFT! do not distribute

Page 56: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

cutting the wires in a mesh to break it up intosubarrays.

Nanoscale to microscale connections will have to be sparse:A direct implication of the size difference be-tween the nanoscale and the microscale isthat connections between the two will haveto be few. If there are many connectionsbetween the two worlds, then the density ofthe nanoscale components will be dictated bythe density of the microscale. For example,using a demultiplexer to connect micronscalewires to nanoscale wires would allow at anyparticular instant 1 of� nanoscale wires to beaddressed using���� � micronscale wires. As�grows large, the nanoscale wires dominate thedevice, not the micronscale wires. We describeseveral different approaches for interfacingmicronscale devices with nanoscale devices inSection 7.3.

The architectural implications of molecular elec-tronics and self-assembly are:

Fine-grained reconfigurable: The most likely as-sembly processes are best at creating crystal-like structures (e.g., 2D meshes). Therefore, theresulting structures cannot directly implementa complex, aperiodic circuit. To create usefulaperiodic circuits will require that the device beconfigured after it is manufactured.

Defect tolerance: The stochastic process behindmolecular self-assembly will inevitably giverise to defects in the manufactured structures.Instead of defect densities in the range of onepart per billion (as one gets in silicon), we ex-pect defect densities for CAEN to be as high asa few percent. The architecture will have to bedesigned to include spares which can be usedin place of defective components. Another use-ful design criterion will be to ensure that whena set of components is assembled into a largerstructure (i.e., wires into parallel wires), that theindividual components are interchangeable. Forexample, in a set of parallel wires, it should notmattera priori which working wire is used toimplement a particular circuit.

Rather than attempting to eliminate all thedefects completely with manufacturing tech-niques, we will also rely on postfabricationsteps that allow the chip to work in spite of itsdefects. A natural method of handling the de-fects, first used in the Teramac [CAC�97] (seeSections 4.5.2 and 5.7.3), is to design the deviceto be reconfigurable and then exploit its recon-figurable nature. After fabrication, the devicecan be configured to test itself. The result oftesting could be a map of the device’s defects.The defect map can then be used to configurethe device to customize the device to implementa particular function.

Locality: As devices and wires scale down tomolecular dimensions, the wires become an in-creasingly important part of the total design.This is true not only of molecular computing butalso of end-of-the-roadmap CMOS. Architec-tures with significant locality (and thus the abil-ity to communicate most frequently over shorterwires) will have an advantage.

7.3 Scale Matching

As we mention above, every nanoscale componentcannot be connected to a micronscale component orthe density of the overall system would be dominatedby the micronscale components. However, all thecurrently proposed architectures require some con-nections to the micronscale. The connections areused for a variety of purposes, including connectionsto CMOS transistors for signal restoration and con-nections to wires for circuit inputs, power, ground,and the clock for the nanoscale circuits. Addition-ally, all the reconfigurable architectures require pro-gramming signals which must be routed to every re-configurable switch. For arrays, this requires ad-dressing individual cross-point in the array. Sincethe programming signals originate in the micronscaleworld, it would seem that we need to connect everynanoscale wire to a micronscale wire.

Consider the array of wires in Fig. 31. If we had tobring in programming signals on micronscale wiresto all the nanoscale wires simultaneously we couldnot escape the need to connect each of these wires to

56 DRAFT! do not distribute

Page 57: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

Figure 31: An array of nanoscale wires with pro-grammable molecules at each cross-point. To program asingle cross-point, we ensure that the voltage differencebetween its row wire and its column wire is greater thanthe programming voltage. All the other wires are kept atzero. The result is that the cross-points that are not beingprogrammed will have voltage drops of less than half theprogramming voltage.

a micronscale wire. Luckily, we only need to addressa single wire in a row and a single column at a time.By raising a single row to half the programming volt-age and lowering a single column by half the pro-gramming voltage, we have created the proper volt-age differential at a single cross-point. By repeatingthis for each cross-point we can configure the entirearray.

Since we only need to raise (or lower) a singlewire (in each dimension) at a time, we can use a de-multiplexer to select the nanoscale wire. A demul-tiplexer (demux) connects a single input line to oneof � output lines based on an address input. If weencode the address in binary, then we can select oneof � outputs with just���� � address lines. If the ad-dress lines and the input line are micronscale wires,then we can connect a single micronscale wire to oneof � nanoscale wires using only� ���� � micron-scale wires. Since the address lines grow with thelog of the output lines, the ratio of nanoscale wiresto micron scale wires grows as�

���� �. If this ratio

is greater than the ratio of the micron to nanoscalepitches, then we will not adversely affect the pitchof the nanoscale wires using a multiplexer. If thenanoscale array has 64 rows and columns, and thepitch is 10 nm, then the array will be 640 nm x640 nm. This would require two 6:64 demuxs. Ifthe micronscale pitch is 100 nm, then it would beperfectly pitch matched.

Figure 32: A possible implementation of a 3:8 demul-tiplexer. The broad lines are micron scale input and ad-dress lines. The thin lines are nanoscale lines. These con-nections between vertical nanoscale lines and horizontalnanoscale lines are not necessary to the operation of thedemux.

Fig. 32 shows a possible implementation of a 1:8demux. In this implementation each address lineis implemented in both its true and complementedform. If the address line is driven high, a logical“1,” then the transistors connected to the address lineare turned on, the ones on the complemented linewill remain off. In this way, one of the eight ver-tical lines will connect the input line (on the top ofthe figure) to the bottom of the selected wire. Inthe figure we have also connected the vertical linesto horizontal nanoscale lines, therefore we are ableto connect the horizontal micronscale input line toone of the horizontal nanoscale lines at the bottomof the figure. The area introduced by the demux is��� ���� ����������, where�� is the micronscalepitch and�� is the nanoscale pitch. As the numberof nanoscale wires,�, grows large, the overhead formaking the connection becomes small.

The demultiplexer above requires precise align-ment of the nanoscale and micronscale wires. As wehave argued above, such alignment will be difficultachieve. In [WK01], the authors describe a method

57 DRAFT! do not distribute

Page 58: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

Figure 33:A demux formed using random connectionsbetween the micronscale wires and the nanoscale wires.

for constructing a demultiplexer which requires sub-stantially less precision. In fact, as far as layout isconcerned, it requires only that the nanoscale wiresand the micronscale wires are laid out orthogonally.The connections between the wires are made us-ing gold particles that are deposited randomly. Thegold particle, when in contact with the nanoscalewire, creates a transistor at the point of contact withthe gold particle acting as the gate. Fig. 33 showsschematically what might result for a demux witheight outputs and six address lines. The gold par-ticles are assembled onto the microscale wires re-sulting in a randomly distributed collection of par-ticles. The micronscale wires will act as the addresslines in the demultiplexer. A mesh is then created sothat the nanoscale wires being addressed overlay thegold particles. Each nanoscale wire will now con-nect, through the gold particles, to a subset of themicronscale wires.

If we can ensure that each nanoscale wire con-nects to a different set of micronscale wires, thenthe nanoscale wires can be individually addressed bythe micronscale wires. Of course, we do not knowapriori which nanoscale wire will be selected by anygiven address on the micronscale wires. We do noteven know that only one wire will be selected at a

Figure 34: A representation of an optically addressedsingle-wire demux. This example shows a symmetricsingle-wire demux that can be programmed with three dif-ferent frequencies of light.

time. This is not as important as it seems as either(1) we may not care which wire is selected as long asonly one is selected or (2) we can determine whichone is selected postfabrication. As the number of ad-dress lines increases, the chances that each nanoscalewire will have a different address increases. For ex-ample,� ���� � address lines can address� differentnanoscale wires with 50% probability. Furthermore,the mapping of addresses to selected outputs can bedetermined in������� time [WK01].

A demultiplexer essentially trades space for time.With direct connections between the nanoscale andthe micronscale all� nanoscale wires can be con-nected at once. However, the total area used for suchconnections scales with�. With a demultiplexer onlyone connection can be made a time, reducing thebandwidth between the nanoscale and the micron-scale by a factor of�, but reducing the area requiredby �

���� �.

Another alternative, that trades both more powerand more time for space than the demultiplexersabove is to use a wire which can have the length ofits conducting region configured in a circuit. We callsuch a wire a single-wire demux (SWD). A SWD isnot as general as the demultiplexer but can be usedfor addressing configuration bits in a nanoscale array.It can be configured to address a subset of the wiresto which it is orthogonal. Fig. 34 shows an exam-ple of an optically addressable SWD. At the cost of

58 DRAFT! do not distribute

Page 59: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

Figure 35: How a SWD might be used to program theindividual cross-points of an array.

some power, two SWDs can be used to individuallyaddress a single wire out of a set of parallel wires (seeFig. 35). As opposed to the demultiplexer above, thisscheme uses more time, more power, and requireslight. However, it only needs a constant amount ofarea. For small sets of wires, it will take consider-ably less area than the demultiplexer, which scaleswith the log of the number of wires being addressed.

7.4 Hierarchical Assembly of a Nanocom-puter

Bottom-up assembly implies, by definition, a hier-archical approach to assembling complete systems.Having described some of the basic fabrication prim-itives, we now describe a plausible process for con-structing a molecular integrated circuit. The pro-cess, a hybrid of molecular self-assembly and litho-graphically produced aperiodic structure, is inexpen-sive yet creates structures out of nanoscale compo-nents. At the nanoscale, all subprocesses are self-directed. Only at the micronscale do we allow de-terministic operations. The process is hierarchical,proceeding from basic components (e.g., wires andswitches), through self-assembled arrays of compo-nents, to complete systems.

In the first fabrication step, wires of different typesare constructed through chemical self-assembly in

bulk. Some of these wires may be functionalized,either by coating their exterior or by including in-line devices along the axis of the wire. The nextstep aligns groups of wires into rafts of parallel wiresusing flow techniques [HDW�01]. Two rafts canbe combined to form a two-dimensional grid us-ing the same flow techniques. The active devicesare created wherever two appropriately functional-ized wires intersect. By choosing the molecules thatcoat the wires appropriately, we can make the inter-section points ohmic contacts, diodes, configurableswitches, or other devices. No precise alignment isrequired to create the devices, they occur whereverthe rafts of wires intersect.

The resulting grids will be on the order of a fewmicrons in size. A separate process will create asilicon-based die using standard lithography. Note,however, that this lithographic process does not haveto be “state-of-the-art” since the CMOS componentsare few and far apart. The circuits on this die willbe used to support the nanoscale components. Theymight include wiring for power, ground, clock lines,interface logic to communicate to the outside world,and support logic for the grids of devices. Thepreviously created grids can then be aligned to thesubstrate using either techniques such as flow-basedtechniques or electromagnetic alignment [SNJ�00].

At this point a crystal-like structure has been cre-ated composed of simple meshes. In order for thisdevice to perform useful logic it will have to be cus-tomized. A combination of reconfigurable switchesand wire cutting can be used to customize the sim-ple meshes into complex aperiodic circuit elements.Wire cutting allows a wire to be selectively brokeninto two pieces [WKH01]. A high voltage is appliedat a cross-point which over oxidizes one of the wires.Due to the small diameter of the wire to be cut, theover-oxidation will consume the wire at the cross-point. This results in two disconnected wires.

The resulting device is a reconfigurable fabric, asdescribed in Section 5. Once the device is manufac-tured, it is tested using techniques described in Sec-tion 10. The result of the testing phase is a defectmap that is used to avoid the defects in the device.Finally, when the device is used, the desired circuit isloaded onto the device through a configuration. The

59 DRAFT! do not distribute

Page 60: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

end effect of reconfiguration is a customized, aperi-odic circuit which can perform a useful function.

The important features of this scenario are thatit is based on both fairly random self-assembly atthe molecular level (the assembly of the rafts andthe wires), and deterministic assembly at the litho-graphic scale (the placement of the rafts on theCMOS die). It is defect tolerant due to its recon-figurability. It has no end-to-end connections at thenanoscale. All wire connections are made at the in-tersections of wires.

8 Circuits

The easiest molecular electronic devices to use arethose with only two terminals. Such devices can beincorporated into a molecular circuit at the intersec-tion of two wires. However, designing circuits withonly two-terminal devices is a challenge. Currently,there are no known two-terminal devices that in andof themselves provide I/O isolation, gain, or even in-version. Therefore, any circuit design methodologywill have to create logic functions without these ben-efits or else find equivalent primitives through col-lections of the two-terminal devices. If molecularcircuits are to scale beyond toy designs, then at somepoint in the circuit, isolation and gain will have to beintroduced.

In this section we reintroduce an old logic de-sign methodology for two-terminal devices: diode-resistor logic. Diode-resistor logic allows for the cre-ation of AND and OR gates. The logic family ismade complete (in the sense of Section 2.2) by pro-viding, at the inputs to the circuit, both the inputsand their complements. We then describe a methodfor introducing gain and I/O isolation using molecu-lar latches based on the NDR molecules described inSection 6.1.

8.1 Diode-Resistor Logic

Without transistors as the active elements, we arereduced to using an older form of circuit design,diode-resistor logic. Diode-resistor logic is a form

Figure 36:Implementations of an AND and an OR gatein diode-resistor logic.

of threshold logic that relies on the fact that diodesonly allow current to flow in one direction.

Fig. 36a shows an AND gate implemented withdiode-resistor logic. We call the voltage drop acrossthe diode�drop. If input A is grounded and B isset equal to�dd, current will flow through the diodeconnected to A and there will be a voltage drop of(�dd � �drop) across the resistor. Note that therewill be no current flow through the diode connectedto input B, since it is back-biased. If we set a logic“0” to be any voltage between ground and severaltimes�drop, then the output will be a logic “0.” Sim-ilarly, if B is low and A is high, then current will flowthrough B’s diode and not A’s. The output will stillbe “0.” If both inputs are low, then the output willalso be low. Only if both A and B are high will theoutput be high. In that case neither diode conductsand there is no voltage drop across the resistor.

Fig. 36b shows an OR gate. In this case, the resis-tor is a pull-down resistor connected to ground. Theonly way to make the output low is if both A and Bare low. If either A or B is high, then current willflow through a diode. This will cause the output tobe�dd� �drop.28 If both A and B are low, then thediodes will both be back-biased, no current will flow,and the output will be low.

28Both the AND and OR circuits shown here illustrate theneed to develop diodes with a very small�drop. The smaller�drop is the closer the output high output will be to a logic “1”for the OR gate and the closer a low output will be to a logic “0”for an AND gate.

60 DRAFT! do not distribute

Page 61: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

There are several problems with this technology.First, the voltage drops across the diodes tend to re-duce the separation between a logic one and a logiczero. For example, suppose�dd �drop and alogic zero is any voltage less than��dropand a logicone is any voltage above��drop. Suppose there arefour AND gates connected in series, with the out-put of each gate connected to both A and B of thenext gate. Then, if the input to the first gate is 0 V,then the output from the final gate will be��drop,which is interpreted as a logic one when it shouldbe a logic zero. Increasing�dd or decreasing�drophelps but does not solve the basic problem (i.e., thatdiode-resistor logic does not implement a restoringlogic). A further problem is that there is no I/O iso-lation. We address both of these issues in the nextsection.

The third problem with diode-resister logic is thatit is not a complete logic family. That is, one can-not invert a signal using only diode-resister logic.This flaw is not fatal, in that if the inputs to a logicfunction include both the signals and their comple-ments, then one can compute both the outputs andtheir complements. This result is based on de Mor-gan’s law, which states that� � � � �� and� � � � ��. Therefore, we only have to ensurethat all inputs to the device from the outside world in-clude their complements. We arrange this by invert-ing outside signals with CMOS as part of the externalinterface. Then whenever a logic function computes� , we also compute� .

Finally, this logic family consumes power evenwhen the logic elements are not switching. The staticpower dissipation is� ���. To make this a viabletechnology the resistors will have to be very large,on the order of 100 M�. This will certainly affectthe speed of the circuit, since the RC time constantis directly proportional to the resistance in the cir-cuit. Fortunately, the capacitances of molecular scalewires are small, on the order of a couple of tens of at-tofarads, due to their small geometries.

8.2 RTD-Based Logic

Resonant tunneling diodes (RTDs) can form the ba-sis for a logic family. RTD-based logic is also based

around the nonlinear characteristics of a diode. How-ever, instead of a simple rectifying device, it is basedon a diode with a region of negative differential resis-tance, as shown in Fig. 27 in Section 6.1. This logicfamily is based on bulk semiconductor tunnel diodesfrom the early 1960s (e.g., [AS62, Car63]).

Most of the work employing tunnel diodes usesthem in conjunction with transistors [MSS�99,GPKR97, PG97]. Such circuits are not practical atthe molecular level because they require complicatedarrangements of transistors and tunnel diodes. If thetransistors themselves were freely available, standardlogic families would be sufficient for implementingcircuits.

Work by Nackashi and Franzon [NF02] investi-gates using only NDRs29 and other two-terminal de-vices as the basis for a logic family. The nonlinearityof the NDR is used in conjunction with a network ofresistors to create NAND and NOR gates, creating acomplete logic family. They found that circuits basedon such devices are extremely sensitive to small vari-ations in the NDR behavior (particularly the voltageat which the peak current flows) and the values ofthe resistors. Furthermore, such logic families do notprovide gain or I/O isolation.

8.3 Molecular Latches

Neither diode-resistor logic nor RTD-based logic isscalable for the reasons stated above. Therefore, ifeither family is going to be used to compute logicfunctions, there must be another kind of device thatcan introduce gain and provide I/O isolation. Whileit might be possible to use either molecular or CMOStransistors, here we explore a more fabrication-friendly alternative based only on two-terminal de-vices. The latch we describe is motivated by earlywork on tunnel diodes [AS62] and more recentlyon compound semiconductor RTDs [MSS�99]. Themolecular latch provides the important properties ofsignal restoration, I/O isolation, and noise immunity.

29An NDR is distinguihed from an RTD only in manufactureand we will use these two acronyms interchangably.

61 DRAFT! do not distribute

Page 62: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

Figure 37: Load lines for the molecular latch. Dur-ing a latching operation the voltage at� is lowered to�monoand then raised back to�ref.

8.3.1 Molecular Latch Operation

The molecular latch is constructed from a pairof molecular RTDs (also called NDRs), describedin Section 6. Fig. 27 shows anI-V curve of a repre-sentative RTD. The key feature of an RTD is a regionof negative differential resistance, where the tunnel-ing current falls as the voltage increases. The ratio ofthe current at the beginning of the NDR region to thecurrent at the end is called the peak-to-valley ratio(PVR). Molecular RTDs that operate at room tem-perature with PVRs of 10 or more have already beenrealized [CRRT99, CR00].

Fig. 37 shows the arrangement of the RTDs ina molecular latch and a load-line diagram for themolecular latch at a bias voltage�ref. The state ofthe latch is determined by the voltage at the nodebetween the two RTDs. The load line diagramshows two stable states,�low and�high, and a thirdmetastable state. Small voltage fluctuations in themetastable state will push the circuit into one or theother of the stable states. The low state representsbinary “0” and the high state binary “1.”

The state of the latch is changed by temporarilydisrupting and restoring the equilibrium to one ofthe two bistable states. The bias voltage is tem-porarily lowered to�mono, causing a shift of theload line to the left such that the circuit has onlyone stable state. The bias voltage is then returnedto �ref. During the evolution from the monostable tobistable state (the monostable–bistable transition, or

MBT), current introduced to the data node will flowthrough the drive RTD. If the total charge introducedby this current during the transition exceeds a thresh-old value, the circuit will switch to the high state,otherwise the circuit will switch to the low state. Theamount of current necessary to force the latch intothe high state is determined by the PVR of the RTD.For a well-matched RTD pair, this current can bevery small [AS62]. A higher PVR requires more cur-rent to set the high state but provides greater stabilityagainst current variation. For the circuit to functioncorrectly, the drive RTD must be more resistive thanthe load RTD (i.e.,�ser in Fig. 38a must be largerfor RTDdrive).

RTD latch technology has already been exploitedin constructing a GaAsFET logic family by Math-ews et al. [MSS�99]. They describe a series of logicgates constructed from RTD latches, FETs, and sat-urated resistors that display high switching speedsand low power dissipation. In CAEN, the RTDlatches are used for a different purpose: they pro-vide voltage buffering by storing the state of previouscomputation for use in later computation, and sig-nal restoration to either the “0” or “1” voltage witheach latch. All computation is performed by diode-resistor logic and the latch restores the signal for alater computational stage. Work in the 1960s withtunnel diodes and threshold logic was hampered byhigh manufacturing variability [HB67]. The molec-ular RTDs discovered so far have higher PVRs thansemiconductor-based devices, which, as we explainbelow, may alleviate some of these problems.

8.3.2 Molecular Latch Simulations: Restorationand Reliability

Since molecular-scale RTD latches have yet to berealized, we conducted simulations of their behav-ior using SPICE. We used a similar device model asMathews et al. for each RTD [MSS�99]. The devicemodel is depicted in Fig. 38a. The RTD is modeledas a series resistance�ser, with a parallel capaci-tance�RTD and voltage-dependent current source��� �. This meshes well with the proposed con-struction of molecular-scale RTDs, which is accom-plished by connecting two segments of molecular-

62 DRAFT! do not distribute

Page 63: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

Figure 38:(a) Device model for RTD. (b) Possible con-struction technique for molecular RTD as an inline device(not to scale).

Figure 39:(a) Simple molecular latch circuit used in test-ing. (b) Results of single stability experiment. Note volt-age “gain” and memory effect.

scale metal wire with a number of identical RTDmolecules (see Fig. 38b). Based on the estimatedlength and dielectric of the molecules and diame-ter of the molecular wires, we estimate a capaci-tance of 2.7 aF for each RTD. We constructed modelequations for the voltage-dependent current source tomimic the I-V curves reported by Reed and Tour at190 K [CWR�00]. To explore the RTD parameterspace, we also constructed models for several differ-ent “ideal” molecular RTDs in which the�peakand�valley locations and PVR were varied.

Fig. 39b shows the results of a simple RTD sim-ulation using an idealized molecule as the basis of

the RTD latch circuit shown in Fig. 39a. The clocksignal is shaped to provide a MBT in every cycleand to maintain the bistable state for the remainderof the cycle. Note that the cycle time was deliber-ately made longer than required to ensure equilib-rium before the next cycle. The figure shows thatthe voltage is restored to the levels predicted by theRTD load line and the choice of�ref. In contrast tothe “ideal” molecule, the relatively high�peakof theReed-Tour molecule would produce narrowly spaced“0” and “1” voltages for any reasonable choice of�refsignificantly reducing the noise margins of anycircuit using this kind of latch. The molecular RTDlatch becomes more practical as�peak becomeslower. The closer�peakis to 0 V, the larger the sep-aration of low and high signals that can be producedat a realistically achievable�ref.

To determine the envelope of stability for the RTD,we simulated the behavior of this simple circuitwhile varying the “low” and “high” input voltage,the input current, and the relative sizes of the loadand drive RTDs. In simulation, simple RTD latchesin isolation appear to be stable over the range of vari-ability likely to be encountered in their synthesis.Stability has proven a significant stumbling block tothe application of semiconductor RTDs, as devicevariability often falls outside the required range forincorporation into circuits [HB67]. Molecular RTDshave an advantage in that all molecules incorporatedinto the RTD are identical. In addition, molecularRTDs have far bigger PVRs than conventional RTDs,which improves the stability of latches constructedwith them [MSS�99].

8.3.3 Combinations of Latches: I/O Isolation

In molecular circuits, latches are used to buffer andcondition the output of a diode-resistor combina-tional circuit so that it can serve as input to the nextone. We therefore need to consider the interactionsof multiple latches. The simplest circuit using morethan one latch is a delay circuit, as shown in Fig. 40.Because the latch data node emits the correct volt-age only when the latch clock is in the high state, atwo-phase clocking scheme (as shown in Fig. 41) isrequired to propagate the signal through the circuit.

63 DRAFT! do not distribute

Page 64: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

Figure 40:Simple 3-stage delay circuit using molecularlatch from Fig. 42.

Figure 41: Two-phase clocking scheme used for serieslatches.

Latches connected in series have alternating clockphases.

It is clear that the lack of I/O isolation providedby transistors is problematic for a latch consistingsolely of an RTD pair. Without intervening devices,all data nodes must exist at the same voltage, andeven with additional linear circuit elements currentis free to flow backward to set upstream latches. Toprovide the necessary isolation characteristics, we in-corporate several additional devices into the latch, asshown in Fig. 42.

A molecular-scale diode is added to prevent cur-rent from a downstream latch from flowing to set anupstream latch. Without the diode, setting a down-

Figure 42:The complete molecular latch.

stream latch will prevent the immediately upstreamlatch from resetting properly. Molecular diodes withvery low voltage drops and reverse current flow havealready been realized [KMM�02]. The resistor, R,immediately following the latch data forces enoughcurrent into RTDdrive to allow it to switch high,while allowing enough current into the next latch todetermine its value. Without the resistor, latches inseries would act as a current divider, and insufficientcurrent would flow through each RTDdrive to set itto the “1” state.

To understand the rationale for the resistor toground, consider the delay latch shown in Fig. 40and the circuit state (In=Low, D1=Low, D2=High,D3=High) at the instant before the�2 clock cycle.As the�2 clock voltage is lowered to�mono, thediodes in both D2 and D3 are reverse biased, pre-venting any current discharge from the capacitanceof latch D2. Since discharge is necessary for the latchto enter the low state, we provide a path to groundthrough a large resistor. The value of this resistormust be large to prevent current loss that would causelatch-setting failure. However, the resistor shouldalso be as small as possible in order to minimize thelatch reset time.

We simulated several simple circuits using theabove latch configuration, including the delay cir-cuit in Fig. 40 and the AND, OR, and XOR circuitsshown in Fig. 43. The circuits calculate the correctvalues, taking into account the delay required to tra-

64 DRAFT! do not distribute

Page 65: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

AND OR

XOR

Figure 43:AND, OR, XOR circuits using diode-resistor logic and intermediate molecular latches.

65 DRAFT! do not distribute

Page 66: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

Figure 44:Lack of I/O isolation in AND circuit providespath from�dd1to ground.

Figure 45:Clocking scheme used for�dd.

verse intervening latches. Circuits with more logiclevels were also simulated successfully.

One additional complication resulting from a lackof I/O isolation was discovered for the AND cir-cuit. Consider the circuit state (A=low, B=high,Y=high, Z=high) immediately before�1 begins itscycle (Fig. 44). The diode of Z is reverse biased,so the current in�dd1 will flow through the ANDdiodes into the latches and to ground, preventingthem from being reset. The latch cannot be protectedby a diode to prevent this, because doing so leaves nopath to sink current from the pull-up voltage. With-out a path to ground, the pull-up would cause Z to beset high incorrectly. To counteract this, we introducea clocking scheme for the pull-up voltage as well, asshown in Fig. 45. The pull-up voltage is temporarilybrought to ground during the MBT of the precedinglatch. This removes the forward influence and allowsthe latches to be set properly. Unfortunately, in a realcircuit this will likely have deleterious consequencesfor the RC constant.

8.3.4 Latch Summary

In simulation, the molecular latch is capable ofrestoring voltage levels to the two stable values andof providing current input, as determined by theI-Vcurve of the underlying RTD. It appears to be sta-ble against variability brought about by manufactur-ing, as long as the drive RTD can reliably be mademore resistive than the load RTD.30 Because of therelatively low current required to switch states, onemolecular latch can drive several other latches (i.e.,latches can have moderate fan-out). In combinationwith the clocking scheme described above, the latchalso provides the necessary I/O isolation to ensureproper calculation.

Some caveats are necessary, however. Because theRTD devices discovered so far are extremely resis-tive, the RC constant for circuits incorporating themis very long (on the order of hundreds of nanosec-onds). Clearly, this will need improvement beforethese devices become practical. Also, the relativesizes of the various resistors must be carefully con-trolled during manufacture to ensure that they areproperly matched. This will allow more devices toattain proper I/O isolation without adversely affect-ing the time constants for the circuit. (Note, how-ever, that controlling the resistance promises to beeasy compared to aperiodic placement of devices.)Lastly, the use of a clocked pull-up voltage may re-sult in additional complications for the design of theCMOS support circuitry. However, the similarity inwaveforms between the latch clock and�dd clockmay alleviate some of these problems.

8.4 Molecular Circuits

CMOS designers have long enjoyed the benefits ofusing a physical device, the transistor, which hasbeen close to ideal. Furthermore, they have hadthe freedom to connect devices together in whatevertopology they chose. With molecular electronics(and potentially with end-of-the-roadmap CMOS),this will not be true. Instead of a single device

30The actual RTDs need not be manufactured differently. In-stead the resistance at the connection to drive RTD can be in-creased; which can be accomplished quite easily.

66 DRAFT! do not distribute

Page 67: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

providing all the required features (switching logic,isolation, restoration of signals, and gain), differentdevices will have to be used each of which poten-tially only provides a single one of these features.In this section we have proposed a combination ofdiode-resistor logic (for logic), a molecular latch (forrestoration of signals and memory), and a combina-tion of diodes, resistors, and a clocking methodologyfor isolation. There are other alternatives to this com-bination. For example, the bulk of the logic coulduse arrays of diode-resistor logic. State could bestored using molecular latches. Another alternativeis to intersperse occasional transistors in the circuitfor I/O isolation and gain. Whatever the solution,with high likelihood, circuit designers using molec-ular electronics will have to obtain the behavior theywant with a combination of devices. For example,one might ask, if transistors are available for isola-tion and gain why not use only transistors. The rea-son is that transistors are likely to be hard to incor-porate in the circuit for reasons described earlier. So,limiting them will increase the overall density andreduce the overall cost of the final circuit.

Circuit designers currently use an abstraction ofthe transistor as the basic building block of logic.The abstraction combines three essential compo-nents: a logical switch, isolation of inputs from out-puts, and, signal restoration. To build memory de-vices several transistors need to be combined. Ex-amining molecular devices suggests that breakingthe abstraction for an ideal transistor into its prim-itive components may provide intellectual leveragefor constructing molecular circuits. We propose theSirM abstraction for molecular components. SirMstands for switch, isolator, restorer, and memory.The SirM device abstraction allows circuit design-ers to construct circuits from devices each of whichmay only display one of these four characteristics.Using the previous example, we might constructour logic from combinations of diodes and resistors(the switch component), molecular latches (the re-storer component and the memory component), anda clocking methodology that provides isolation. Al-ternatively, the switch component might come fromdiodes and resistors, the isolator and restorer com-ponents from molecular transistors, and the memorycomponent from latches.

Figure 46: Example of a non-configurableP-typenanowire FET. The diagram on the left shows the crossingwires and on the right it shows the circuit formed.

SirM recognizes that one device does not have toprovide all the features necessary for a composablescalable logic family. By identifying the primitivesnecessary (switch, isolator, restorer, and memory)circuit designers and device designers are given thefreedom to optimize the types and combinations ofdevices in ways that they have not been able to be-fore. Now, one can talk about complete device fam-ilies, instead of single devices, that provide all thenecessary components for a complete logic family. Acomplete device family is one that provides all com-ponents of SirM and therefore can be used to con-struct a scalable complete logic family.

8.5 Molecular Transistor Circuits

Nanowire FETs have been demonstrated which havethe potential to form the isolator/restorer componentsof SirM or potentially to be the core of a completelogic family. Unlike CMOS, this will not be a com-plementary logic family so, like diode-resistor logic,it will burn power even when it is not switching.Fig. 46 shows a nanowireP-FET NOR gate madeonly with crossing wires. This device, if config-urable, could be put into an array so that the inputs allrun horizontally. Each vertical wire could be config-ured to compute a different function on the inputs.DeHon proposes an architecture (see Section 9.3)based on this configurable device [DeH02]. If thedevice is not configurable, then it can be used as theisolator or restorer component of a complete devicefamily.

67 DRAFT! do not distribute

Page 68: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

9 Molecular Architectures

There are three basic approaches to molecular ar-chitectures: random, quasi-regular, and determinis-tic. The random approach assumes only the most ba-sic manufacturing primitives and only requires thatmolecules and wires be assembled in random pat-terns. On the other end of the spectrum, the deter-ministic approach assumes that complex, aperiodicstructures can be precisely built. It requires preci-sion similar to that used in lithographically basedsemiconductor manufacturing. The quasi-regular ap-proach takes a middle road. It requires that two-dimensional cross-bars be manufacturable. It doesnot require that the wires in the cross-bar be laid outin any particular order.

We begin this section with a detailed descriptionof the nanoFabric architecture, which is an exampleof a quasi-regular molecular architecture. We thenreview other approaches in all three categories.

9.1 NanoFabrics

The nanoFabric is an example of a quasi-regular ar-chitecture designed to overcome the constraints as-sociated with directed assembly of nanometer-scalecomponents and exploit the advantages of molecularelectronics. Because the fabrication process is funda-mentally nondeterministic and prone to introducingdefects, the nanoFabric must be reconfigurable andamenable to self-testing. This allows us to discoverthe characteristics of each nanoFabric and then createcircuits that avoid the defects and use only the work-ing resources. For the foreseeable future, the fabrica-tion processes will only produce simple, regular ge-ometries. Therefore, the proposed nanoFabric is builtout of simple, two-dimensional, homogeneous struc-tures. Rather than fabricating complex circuits, weuse the reconfigurability of the fabric to implementarbitrary functions postfabrication. The constructionprocess is also parallel. Heterogeneity is introducedonly at a lithographic scale. The nanoFabric can beconfigured (and reconfigured) to implement any cir-cuit, like today’s FPGAs. However, the nanoFabrichas several orders of magnitude more resources.

Figure 47:Schematic of a nanoBlock.

The nanoFabric is a planar mesh of interconnectednanoBlocks. The nanoBlocks are logic blocks thatcan be programmed to implement a three-bit inputto three-bit output Boolean function and its comple-ment (see Fig. 47). NanoBlocks can also be used asswitches to route signals. The nanoBlocks are orga-nized into clusters (see Fig. 48). Within a cluster,the nanoBlocks are connected to their nearest fourneighbors. Long wires, which may span many clus-ters (long-lines), are used to route signals betweenclusters. The nanoBlocks on the perimeter of thecluster are connected to the long-lines. This arrange-ment is similar to commercial FPGAs (allowing usto leverage current FPGA tools) and has been shownto be flexible enough to implement any circuit on theunderlying fabric.

The nanoBlock design is dictated by fabricationconstraints. Each side of the block can have inputsor outputs, but not both. Therefore, the I/O arrange-ment in Fig. 48 is required. We have arranged it sothat all nanoscale wire-to-wire connections are madebetween two orthogonal wires so that precise end-to-end alignment is not needed. Figures 49b and cshow how the outputs of one nanoBlock connect tothe inputs of another. We call the area where the in-put and output wires overlap aswitch block. Noticethat the outputs of the blocks are either facing southand east (SE) or north and west (NW). By arrang-ing the blocks such that all the SE blocks run in one

68 DRAFT! do not distribute

Page 69: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

(a) (b) (c)

Figure 49:Diagrams showing how the inputs and outputs of a nanoBlock connect to their neighbors.

Figure 48: The layout of the nanoFabric with a partialblowup of a single cluster and some of the adjacent long-lines.

diagonal and the NW run in the adjacent diagonal,we can map any circuit netlist onto the nanoFabric.Since the nanoBlocks themselves are larger than theminimum lithographic dimension, they can be posi-tioned precisely at manufacturing time in the desiredpatterns using lithography.

In addition to the intracluster routing there arelong-lines that run between the clusters to providelow-latency communication over longer distances.The nanowires in these tracks will be of varyinglengths (e.g., one, two, four, and eight clusters long),allowing a signal to traverse one or more clusterswithout going through any switches. This layout isessentially that of an island-style FPGA [SBV92].

This general layout has been shown to be efficientand amenable to place-and-route tools [BR97]. No-tice that all communication between nanoBlocks oc-curs at the nanoscale. The fact that we never need togo between nanoscale and CMOS components andback again increases the density of the nanoFabricand lowers its power requirements.

The arrangement of the clusters and the long-linespromotes scalability in several ways. First, as thenumber of components increases we can increase thenumber of long-lines that run between the clusters.This supports routability of netlists. Second, eachcluster is designed to be configured in parallel, allow-ing configuration times to remain reasonable even forvery large fabrics. The power requirements remainlow because we use molecular devices for all aspectsof circuit operation. Finally, because we assemblethe nanoFabric hierarchically we can exploit the par-allel nature of chemical assembly.

9.1.1 NanoBlock

The nanoBlock is the fundamental unit of thenanoFabric. It is composed of three sections (seeFig. 47): (1) the molecular logic array, where thefunctionality of the block is located, (2) the latches,used for signal restoration and signal latching forsequential circuit implementation, and (3) the I/Oarea, used to connect the nanoBlock to its neighborsthrough the switch block.

69 DRAFT! do not distribute

Page 70: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

Figure 50:An AND gate implemented in the MLA of ananoBlock.

Figure 51: A half-adder implemented in the MLA of ananoBlock and the equivalent circuit diagram for the com-putation of A XOR B = S. Only S,S, C, and,C are outputsof this circuit. The rest of the lines are intermediate resultsand label only for reference.

The molecular logic array (MLA) portion of ananoBlock is composed of two orthogonal sets ofwires. At each intersection of two wires lies a con-figurable molecular switch. The switches, when con-figured to be “on,” act as diodes. Designing circuitsfor the MLA is significantly different than for a pro-grammable logic array, which requires an OR andan AND plane. We have preliminary designs for a“standard cell” library using nanoBlocks (e.g., AND,OR, XOR, and ADDER).

Fig. 50 shows the implementation of an AND gate,while Fig. 51 shows the implementation for a half-adder. On the top part of Fig. 51 is a schematic ofthe portion of the circuit used to generate the sumoutput. This circuit, which is the XOR of A and B, istypical of diode-resistor logic. For example, if A ishigh and B is low, then their complements (� and�)are low and high respectively. As a result, diodes 1,2, 5, and 6 will be reverse-biased and not conducting.Diode 8 will be forward-biased and will pull the linelabeled “red” in the figure down close to a logic low.This makes diode 4 forward-biased. By manufactur-ing the resistors appropriately (i.e., resistors attachedto �dd have smaller impedances than those attachedto Gnd) most of the voltage drop occurs across R1,resulting in S being high. If A and B are both low,then diodes 2, 4, 5, and 7 are back-biased. This iso-lates S from�dd and makes it low.

The MLA computes logic functions and routessignals using diode-resistor logic. The benefit of thisscheme is that we can construct it by directed assem-bly, but the drawback is that the signal is degradedevery time it goes through a configurable switch. Inorder to restore signals to proper logic values withoutusing CMOS gates, we will use the molecular latchdescribed in Section 8.3.

The organization of the MLA supports the imple-mentation of logic functions using either two-levellogic or arbitrary logic combinations.31 If a two-level representation is used, for example, a productof sums representation, then the square shape of theMLA and the fact that all the power connections areon one side and all the connections to ground are onanother will reduce the logic density that can be re-alized in a given area. This is offset by three factors.First, it will be easier to fabricate a symmetric MLA.Second, the extra freedom to place product terms orsum terms anywhere in the array increases defect tol-erance. Third, as we will see below, reducing thenumber of different connections to the CMOS sub-strate increases the overall density of the nanoscalecomponents.

Arbitrary logic representations use few of thecross-points on any wire. As the size of the MLA in-creases, the overhead of the micronscale connections

31Refer back to Section 5.2 for definitions of these terms.

70 DRAFT! do not distribute

Page 71: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

Figure 52: An example layout of a nanoBlock and itsassociated CMOS support. The long thin rectangles rep-resent nanowires. The thicker squares represent CMOSmetal.

Figure 53:How nanoblocks tile together.

decreases, but the density of useful connections perwire decreases. The goal of the designer is to choosethe size of the MLA so that the overall logic density-per-unit area is highest. In addition to optimizing thecross-points on each wire, the designer must keep inmind that a particular circuit cannot include an arbi-trary number of diodes. Each time a signal passesthrough a diode the signal is degraded. Therefore,the total number of diodes that can be traversed islimited by the number of diodes that can be traversedbefore a logical value on the wire can no longer bedistinguished from its opposite value.

The layout of the MLA and of the switch blockmakes rerouting easy in the presence of faults. Byexamining Fig. 51, one can see that a bad switch is

easily avoided by swapping wires that only carry in-ternal values. In fact, the rows can be moved any-where within the block without affecting the cir-cuit, which makes defect tolerance significantly eas-ier than with CMOS.32 The number of possible waysto arrange the columns/rows in the MLA combinedwith the configurable cross-bar implemented by theswitch block makes the entire design robust to de-fects in either the switch block or the MLA.

Notice that all the connections between the CMOSlayer and the nanoBlock occur either between groupsof wires or with a wire that is seperated from all theother components. This improves the device densityof the fabric. To achieve a specific functionality, thecross-points are configured to be either open connec-tions or to be diodes.

Fig. 52 shows one possible way that the nanoscalecomponents of the nanoBlock might be connected toa CMOS substrate. The filled-in squares representconnections to the underlying substrate. The rect-angles represent nanoscale wires. The power (�dd)and ground (Gnd) connections are all to sets of wires.The power connection is for pull-up resistors in theMLA, the ground connection on the left is for thepull-down resistors in the MLA, and the two otherconnections are for the connections to the molecu-lar latch. The other end of the molecular latch wiresconnect the clock pad (�). Finally, the two CMOSconnections labeled “C” are for the configuration padthat connects to the single-wire demux. Notice thatonly a tiny fraction of the entire area is taken up bythe MLA itself (the checkerboard area in the middleof the nanoBlock.

The arrangement shown in Fig. 52 allows multiplenanoBlocks to tile the plane without violating anyof the process rules used in constructing the CMOSpads. That is, no pads of different types share anysides or corners. Fig. 53 show how the nanoBlockscan be arranged into the diagonal patterns needed toform the switch blocks. The repeated pattern that cantessalate the plane is shown as the area surroundedby a thick black line covering eight nanoBlocks. Thepattern covers a 20x20 region of CMOS line widths

32This is because nanowires have several orders of magnitudeless resistance than switches. Therefore, the timing of a block isunaffected by small permutations in the layout.

71 DRAFT! do not distribute

Page 72: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

and contains eight nanoblocks. Therefore, the areafor one implemented nanoblock is 50 square mini-mum line widths. This is at least an order of magni-tude smaller than a half-adder implemented solely inCMOS.

9.1.2 Configuration

The nanoFabric uses run-time reconfiguration for de-fect testing and to perform its intended function.Therefore, it is essential that the time to configurethe fabric scale with the size of the device. There aretwo factors that contribute to the configuration time.The first factor is the time that it takes to downloada configuration to the nanoFabric. The second factoris the time that it takes to distribute the configurationbits to the different regions of the nanoFabric. Con-figuration decoders are required to serialize the con-figuration process in each nanoBlock. (See Fig. 34in Section 7.3.) To reduce the CMOS overhead, weintend to configure only one nanoBlock per cluster ata time. However, the fabric has been designed so thatthe clusters can be programmed in parallel. A veryconservative estimate is that we can simultaneouslyconfigure one nanoBlock in each of 1000 clusters inparallel.

A molecular switch is configured when the volt-age across the device is increased outside the nor-mal operating range. Devices in the switch blockscan be configured directly by applying a voltage dif-ference between the long intercluster lines. In orderto achieve the densities presented above, it will alsobe necessary to develop a configuration approach forthe switches in the MLA that is implemented withnanoscale components. In particular, a nanoscaledecoder is required to address each intersection ofthe MLA independently. Instead of addressing eachnanoscale wire separately in space, we address themseparately in the time dimension. This slows downthe configuration time but increases the device den-sity.

Our preliminary calculations indicate that we canload the full nanoFabric, which is comprised of��

configuration bits at a density of���� configura-tion bits/cm�, in less than 1 second. This calcula-tion is based on realistic assumptions that, on aver-

age, fewer than 10% of the bits are set ON and thatthe configurations are highly compressible [HLS99].It also significant to note that it is not necessary toconfigure the full fabric for defect testing. Instead,we will configure only the portions under test.

As the configuration is loaded onto the fabric itwill be used to configure the nanoBlocks. Using theconfiguration decoder this will require� ��� cyclesper nanoBlock, or less than 38K cycles per cluster.Therefore, the worst-case time to configure all theclusters at a very conservative 10 MHz requires 3seconds.

9.1.3 Putting It All Together

The nanoFabric is a reconfigurable architecture builtout of CMOS and CAEN technology. The supportsystem, i.e., power, ground, clock, and configura-tion wires, I/O, and basic control, is implementedin CMOS. On top of the CMOS we construct thenanoBlocks and long-lines constructed out of chem-ically self-assembled nanoscale components.

Assuming a 100 nm CMOS process and 40 nmcenters with 128 blocks to a cluster and 30 long-lines per channel, our design should yield 200Mblocks/cm�), requiring ��� configuration bits. (If,instead of molecular latches, transistors were usedfor signal restoration then with 40 nm centers for thenanoscale wires the density is reduced by a factor of10.)

SPICE simulations show that a nanoBlock config-ured to act as a half-adder can conservatively operateat 100MHz. Preliminary calculations show that thefabric as a whole will have a static power dissipationof � ���watts and dynamic power consumption of� ��watts at 100Mhz.

The nanoFabric assumes a fairly simple fabrica-tion process but still requires the parallel alignmentof nanoscale wires and the ability to attach thosewires to photolithographically-created pads on a sil-icon substrate. It overcomes the lack of arbitrarypatterning in the fabrication methods with postfab-rication computation, which will both aid in defecttolerance and allow the desired functionality to beconfigured into the device.

72 DRAFT! do not distribute

Page 73: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

Figure 54: A representation of a Nanocell. The linksbetween particles are implicitly created due to the prox-imity of the molecules. Reprinted with permission from[Hus02], S. M. Husband., Programming the Nanocell, aRandom Array of Molecules, Ph.D. Thesis, Rice Univer-sity, 2002.

9.2 Random Approach

The random class of architectures requires less pre-cise fabrication methods than nanoFabric. This classof architectures essentially requires only fabricationmethods to randomly deposit nanoscale particles ona silicon substrate. However, it requires even morepostfabrication computation to construct useful com-puting devices.

Here we describe an example from this classof architecture proposed by a team at Rice andYale. The basic component of this architec-ture is the Nanocell [Hus02, HHP�01], a smalllithographically-defined area filled with nanoparti-cles and molecules. The nanoparticles are assem-bled in the interior using a self-assembled monolayer(see Fig. 54). Molecules that behave like molecu-lar switches and also exhibit NDR behaviors are de-posited on the nanoparticles. Along the perimeter ofthe nanocell are I/O pins which connect the nanopar-ticles in the interior of the cell with the rest of thechip, which is traditional CMOS created using litho-graphic techniques. The density of particles and de-posited molecules ensures that there will be someconnections between the particles themselves and themicronscale leads connected around the perimeterof the cell. The resulting ensemble forms a pro-grammable electrical network.

Figure 55: A Nanocell programmed to implement aNAND-gate in its upper-right corner. Reprinted with per-mission from [Hus02], S. M. Husband., Programming theNanocell, a Random Array of Molecules, Ph.D. Thesis,Rice University, 2002.

Each nanocell will have a different arrangementof particles and, as a result, will possibly have dif-ferent functionality. Before a nanocell can be usedin a circuit, its possible set of functionalities willhave to be determined and then it will have to beprogrammed to implement the desired functional-ity. Husband [Hus02] proposes a genetic algorithmwhich can be used to find the functionality of eachcell and to train the cell to implement one of its pos-sible functions. The algorithm depends on being ableto set the state of each molecule individually. Fig. 55shows an example of a nanocell that was trained toimplement a NAND gate. Once the cells are pro-grammed, they need to be connected. While somenanocells have been shown in simulation to imple-ment restoring logic, they do not provide I/O isola-tion. Therefore, this architecture requires that the in-terconnect in a circuit be implemented at least par-tially in CMOS. Alternatively, molecular latches canbe used to restore signals and provide I/O-isolationas discussed in Section 8.3.

The nanocell is defect tolerant but results in ratherpoor logic density. It is, by its nature, defect tol-erant in that no predetermined functionality is as-sumed, rather it is discovered. However, whetherthe functionality needed to implement a useful cir-cuit can be found in a reasonable amount of time isan open question. The density of a nanocell-basedcomputer is determined by the size and functional-ity of the nanocell, the supporting interconnect, and

73 DRAFT! do not distribute

Page 74: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

any logic needed for signal restoration or I/O isola-tion. Even if the supporting logic takes no area andthe interconnect is minimal, the density is low be-cause the I/O pins to the nanocell are fabricated us-ing lithography. If the lithographic pitch is��, thena nanocell with 20 I/O pins requires������� units ofarea. A NAND gate implemented in CMOS requires� � �� units of area. In other words, a nanocell im-plementing four NAND gates will be only slightlydenser than CMOS. CMOS has the advantage that nostatic power is consumed. If NMOS-style circuits areused, then the lithographic approach is significantlydenser than the nanocell. If nanocells can implementmore complicated functions (e.g., a half-adder), thenanocell may enjoy a slight density advantage.

The nanocell architecture is an example of amolecular electronic architecture that presupposesonly the simplest assembly primitives. It is defecttolerant and reconfigurable. However, it requiressignificant programming effort to overcome the ran-domness inherent in its fabrication. It requires signif-icant micronscale functionality during circuit opera-tion, in particular, it requires micronscale wires andpotentially micronscale devices, to pass signals fromone nanocell to another.

9.3 Quasi-Regular Approaches

We have already seen an instance of a quasi-regulararchitecture, the nanoFabric. It assumes that thefabrication primitives can be used to create two-dimensional meshes of wires. It further assumes thatsome kind of active device can be made at the inter-section of two wires and that the wires can be bondedto lithographically created contacts. It does not as-sume that the wires in a mesh can be determinis-tically ordered. In spite of these seemingly simpleprimitives, the nanoFabric and other architectures inthis class are surprisingly powerful. Here we discussbriefly an additional example from this class.

DeHon [DeH02] proposes an architecture basedaround arrays of silicon nanowire FETs. The arraysare arranged so that all signals in a circuit can betransmitted over nanoscale wires. At least some ofthe active elements in the arrays are reconfigurable,which supports defect tolerance and the creation of

Gnd

A0

/A0

A1

/A1

A2

/A2

PUen

Vdd

PDen

Gn

d

PU

en

Vd

d

PD

en

/A2

A2

/A1

A1

/A0

A0

Figure 56: A nanoscale array surrounded by four de-coders and the micronscale wiring necessary to programit. Reprinted with permission from [DeH02] A. DeHon, in“Proceedings of the First Workshop on Non-Silicon Com-putation,” 2002.

circuits after the device is fabricated. The nanoscalecomponents are surrounded by lithographically cre-ated wires and circuits that are used only to supportreconfiguration and to supply global signals (e.g.,clock, power, and ground).

Fig. 56 shows a possible fundamental buildingblock of this architecture. Each nanoscale array issurrounded by a set of four decoders. The decodersconnect the nanoscale wires to address lines that aremicronscale. The address lines allow any individ-ual nanowire to be selected. By selecting a partic-ular row and column, the switch at the cross-pointcan be configured. The decoders can also be usedto provide pull-up or pull-down resistors for eitherdiode-resistor logic or NMOS-style transistor logic.The decoders allow������ �� micronscale wires toaddress� nanoscale wires. The total area for an ar-ray is therefore��� ���� ��� �����

�, where� is the number of wires in each dimension of thenanoscale array, A is the number of address linesused to address the wires in the nanoscale array� � ���� �), and�� is the pitch of the nanoscalewires. In this arrangement with perfect decoders(i.e., � ���� �) and a 10:1 ratio between the

74 DRAFT! do not distribute

Page 75: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

Pullup

Pullup

Pu

lldo

wn

Pu

lldo

wn

Pullup

Pullup

Pu

lldo

wn

Pu

lldo

wn

Pu

llup

Pulldown

PulldownP

ullu

p

Pu

llup

Pulldown

PulldownP

ullu

p

Figure 57:Arrangement of NOR only planes. Reprintedwith permission from [DeH02] A. DeHon, in “Proceed-ings of the First Workshop on Non-Silicon Computation,”2002.

lithographic pitch and the nanowire pitch, then thesupport logic takes less than half the area when�reaches several hundred. Either nanoimprinting orself-assembly (see Section 7.3) can be used to con-struct the decoders. If fault tolerance is needed, orif the self-assembly method described in [WK01] isused, then�will be several times���� � requiring ar-rays of several thousand nanoscale wires before theoverhead is less than half.

When we compare the density of this approach tothe nanoFabric, it is important to look beyond theraw density of nanoscale cross-points. The nanoFab-ric also devotes approximately 50% of its area tosupport logic when� is several hundred. However,the single-wire demux would probably not scale tothat large of an array. The more important issueis whether the large arrays can be effectively uti-lized when logic is mapped to them. If a nonrestor-ing logic is used, then very few of the cross-pointsin either array would be used, effectively reducingthe density of the nanoscale portion of the architec-ture significantly. Therefore, several factors needto be considered. First, does the micronscale-to-nanoscale addressing mechanism scale to the desiredarray size? Second, can the nanoscale array be effec-

tively used to implement logical functions? Finally,will the array be sufficiently fault tolerant? We dis-cuss this final issue in Section 10.

In order to implement logic in DeHon’s architec-ture, the cross-points must be configurable. DeHonsuggests two possible approaches. If the nanowireFETs are programmable, then the array shown inFig. 56 can be replicated in groups of 16.

Fig. 57 shows the nanoscale portions of the arrays.The decoders used to program the configuration bitsserve double duty, acting as either pull-up or pull-down resistors when the circuit is running. The blackarrows show how the outputs from one set of NORscan be fed in as inputs to the orthogonal set of NORwires. Take, for example, the second column fromthe right. It has pull-up resistors at the top, then thereare four arrays running vertically, and finally a groupof pull-down resistors is at the bottom. A nanowireFET running from the pull-up to the pull-down im-plements a NOR gate. The inputs to the gate are fromthe top left or the top right. The outputs go either tothe bottom right or the bottom left. Therefore, thereis substantial flexibility in routing to and from eachgate. Furthermore, all the connections between gatesuse nanoscale wires.

If the nanowire FETs are not configurable, thenhalf the arrays can be replaced with configurablediodes that can implement an OR plane. This canbe combined with the nonconfigurable NOR gates toenable a PAL-like approach to creating logic. TheOR gates will be programmable, while the NORgates can provide I/O isolation, inversion, and signalrestoration.

This architecture, like the nanoFabric, assumes amoderate amount of precision and requires a moder-ate amount of postfabrication computation in order toconstruct useful computing devices. There are sev-eral other architectures in this class. In 1998, a teamfrom HP and UCLA proposed a defect-tolerant ar-chitecture roughly based on the Teramac [HKSW98].Their overall approach was the inspiration behind thenanoFabric. A team at AFRL proposed an architec-ture that is a cross between the quasi-regular and thedeterministic approaches. At the micronscale it isregular, but at the nanoscale it requires some arbi-trary patterning [LDK01, LDK00].

75 DRAFT! do not distribute

Page 76: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

9.4 Deterministic Approach

As one would expect, there are few proposed archi-tectures that require complete deterministic controlat the nanoscale. A team at Mitre made early propos-als for multifunctional molecules. Their approach isto exploit the huge number of molecules possible andto design molecules that can act as AND gates oreven half-adders [EL00].

A completely different approach is found in the lit-erature on quantum cellular automata (QCA). QCAsmove information around using charge instead ofcurrent [Por98, NK01, FRJ�02]. There are manyadvantages to using QCAs, particularly with respectto power. However, the individual cell in a QCAprovides no isolation or gain. The designers ofQCA systems address this problem by adding a sys-tem clock. The clock provides isolation and signalrestoration. Finally, the arrangement of the cells in aQCA, with respect to each other and the clock, willrequire great precision. It does not appear that QCAcircuits offer any advantage in terms of manufac-turability over CMOS. Recently, however, there havebeen proposals for reconfigurable QCAs [NRK02]that could possibly reduce the amount of manufac-turing precision required.

9.5 Architectural Constraints

In summary, all the successful architectural ap-proaches for molecular electronics must contendwith atomic scale effects. The small size of the in-dividual components guarantees that there will besome variability, some defects, and some constrainton the kinds of patterns that can be economically cre-ated. In addition, to interface the atomic scale elec-tronics to the outside world implies some kind of sizematching problem. Finally, the small size also entailsthat there will be huge numbers of devices. This re-sults in issues relating to scalability.

10 Defect Tolerance

The nanoFabric is defect tolerant because it is regu-lar, highly configurable, fine-grained, and, has a rich

interconnect.33 The regularity allows us to choosewhere a particular function is implemented. Theconfigurability allows us to pick which nanowires,nanoBlocks, or parts of a nanoBlock will implementa particular circuit. The fine-grained nature of the de-vice combined with the local nature of the intercon-nect reduces the impact of a defect to only a smallportion of the fabric (or even a small portion of a sin-gle nanoBlock). Finally, the rich interconnect allowsus to choose among many paths in implementing acircuit. Therefore, with a defect map we can createworking circuits on a defective fabric. Defect dis-covery relies on the fact that we can configure thenanoFabric to implement any circuit, which impliesthat we can configure the nanoFabric to test its ownresources.

The key difficulty in testing the nanoFabric (orany FPGA) is that it is not possible to test the in-dividual components in isolation. Researchers onthe Teramac project [ACC�95] faced similar issues.They devised an algorithm that allowed the Tera-mac, in conjunction with an outside host, to test it-self [CAC�97, CAC�96]. Despite the fact that over75% of the chips in the Teramac have defects, theTeramac is still used today. The basis of the defectmapping algorithm is to configure a set of devicesto act as tester circuits. These circuits (e.g., linear-feedback shift-registers) will report a result which, ifcorrect, indicates with high probability that the de-vices they are made from are fault-free.

In some sense, Teramac introduces a new manu-facturing paradigm, one which trades off complex-ity at manufacturing time with postfabrication pro-gramming. The reduction in manufacturing timecomplexity makes reconfigurable fabrics a particu-larly attractive architecture for CAEN-based circuits,since directed self-assembly will most easily result inhighly regular, homogeneous structures. We expectthat the fabrication process for these fabrics will befollowed by a testing phase, where a defect map willbe created and shipped with the fabric. The defect

33Defect tolerance through configuration also depends onshort circuits being significantly less likely to occur than stuck-open faults. The synthesis techniques should be biased to in-crease the likelihood of a stuck-open fault at the expense of po-tentially introducing more total faults.

76 DRAFT! do not distribute

Page 77: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

map will then be used by compilers to route aroundthe defects.

The key problem is then locating the defects:once these locations are known, routing around themshould be straightforward enough. In general, anymethodology for locating these defects should notrequire direct access to individual components andshould scale slowly with both defect density and fab-ric size.

We outline here an approach to defect mappingthat extends the techniques used for the Teramac.While conceptually similar to these earlier tech-niques, the approach described here makes somekey contributions that enable mapping of the muchhigher defect rates expected in CAEN-based fabrics.

Modern DRAM and SRAM chips and FPGAs areable to tolerate some defects by having redundancybuilt into them: for instance, a row containing a de-fect might be replaced with a spare row after fabrica-tion. With CAEN fabrics, this will not be possible:it is unlikely that a row or column of any apprecia-ble size will be defect free. Moreover, CAEN-baseddevices are being projected as a replacement not justfor memories but for logic as well, where simple rowreplacement will not work since logic is less regular.

There is a large body of work in statistics and in-formation theory on techniques for finding subsetsof a population all members of which satisfy a givenproperty (in our case, this translates to finding defect-free devices from among a large number, some ofwhich may be defective). Various flavors of this tech-nique, calledgroup testing, have been applied to avariety of problems [Dor43, SG59, Wol85, KBT96].However, none have constraints as demanding asthose of CAEN-based assembly.

Problems similar to this have been addressedin the domain of custom computing systems.For example, the Piperench reconfigurable proces-sor [SKG00], and more notably the Teramac customcomputer [CAC�97, HKSW98], had a notion of test-ing, defect-mapping, and defect-avoidance built intothem. Assembly was followed by a testing phasewhere the defects in the FPGAs were identified andmapped. Compilers for generating FPGA configu-rations then use this defect map to avoid these de-

fects. The testing strategy we outline is similar tothe one used for the Teramac. However, the prob-lem we address is significantly harder because theTeramac used CMOS devices whose defect rates aremuch lower than those predicted for nanoFabrics.

An alternative approach to achieve defect toler-ance would be to use techniques developed for fault-tolerant circuit design (e.g., [Pip90, Spi96]). Suchcircuit designs range from simple ones involvingtriple-mode redundancyto more complex circuitsthat perform computation in an alternative, sparsecode space, so that a certain number of errors in theoutput (up to half the minimum distance between anytwo code words) can be corrected. Such techniqueswork reliably only if the number of defects is belowa certain hard threshold, which CAEN is likely toexceed. Furthermore, even the best techniques forfault tolerance available today require a significantamount of extra physical resources, and also result ina (non-negligible) slow-down of the computation.

10.1 Methodology

One approach to defect detection in molecular-scalereconfigurable circuits consists of configuring thecomponents on the fabric34 into test circuits, whichcan report on the presence or absence of defects intheir constituent components. Each component ismade a part of many different test circuits and infor-mation about the error status of each of those circuitsis collected. This information is used to deduce andconfirm the exact location of the defects.

As an example, consider the situation in Fig. 58.Five components are configured into one test circuitthat computes a simple mathematical function. Thisfunction is such that defects in one or more circuitcomponents would cause the answer to diverge fromthe correct value. Therefore, by comparing the cir-cuit’s output with the correct answer, the presence orabsence of any defects in the circuit components can

34We are deliberately leaving the meaning of “component”unspecified. It will depend on the final design of the fabric: acomponent may be one or more simple logic gates, or a look-uptable implementing an arbitrary logic function. Also, the on-fabric interconnects will also be “components” in the sense thatthey may also be defective.

77 DRAFT! do not distribute

Page 78: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

Configuration 1 Configuration 2

Defect−free Component

Defective Component Passing Test

Failing Test

Figure 58:An example showing how a defective component is located using two different test circuit configurations.The components within one rectangular block are part of one test circuit.

be detected. In the first run, the components are con-figured vertically and test circuit 2 detects a defect.In the next run, the components are configured hori-zontally and test circuit 3 fails. Since no other errorsare detected, we can conclude that the component atthe intersection of these two circuits is defective, andall others are good.

Since the tester will not have access to individ-ual fabric components, the test circuits will be large,consisting of tens and perhaps even hundreds of com-ponents. With high defect rates, each circuit will onaverage have multiple defective components. Thiswill considerably complicate the simple picture pre-sented in the example above. In particular, test cir-cuits which give information only about the pres-ence or absence of defects (such as the ones usedabove) will be useless: almost each and every testcircuit will report the presence of defects. The keyidea we use here is more powerful test circuits: cir-cuits that return more information about the defectsin their components, such as a count of the defects.An example would be a circuit that computes a math-ematical function whose output will deviate from thecorrect value if any of the circuit’s components aredefective. If the amount of this deviation determin-istically depends on the number of defective compo-

nents, then a comparison of the circuit’s output withthe correct result can tell us the number of defectspresent in the circuit.

Using such defect-counting circuits, Mishra andGoldstein [MG02] propose splitting the process ofdefect mapping into two phases: aprobability-assignment phaseand adefect-location phase. Theprobability-assignment phase attempts to separatethe components in the fabric into two groups: thosethat are probably good and those that are probablybad. The former will have an expected defect den-sity that is low enough so that in the defect-locationphase, we can use circuits that return 0-1 informationabout the presence of defects to pinpoint them.

The first phase, that of probability assignment,works as follows:

1. The components are configured into test circuitsin many different orientations (for example, ver-tically, horizontally, and diagonally), and defectcounts for all the test circuits are noted.

2. Given these counts for all the circuits, simpleBayesian analysis is used to find the probabilitythat any particular component is good or bad.

78 DRAFT! do not distribute

Page 79: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

3. The components with a low probability of beinggood are discarded and this whole process is re-peated. This is done a small number of times,so that as many of the defective components aspossible are eliminated.

At the end of this process, two groups of compo-nents are obtained: those with a high probability ofbeing good and those with a high probability of beingbad. The latter are discarded. In the former, the frac-tion of faulty components is expected to be at such alevel that a significant number of defect-free circuitscan be found. Therefore, the second phase, that ofdefect location, proceeds as follows:

1. Test circuits providing 0-1 defect informationare run on the reduced set of components. Ifa circuit is found to be defect-free, all its com-ponents are marked as good. This is done againfor a number of test circuit orientations.

2. For a high yield, some of the components dis-carded in the probability assignment phase canbe added back and step 1 repeated.

3. Finally, all the components marked as good aredeclared good, and the others are declared bad.

There are a number of points to note here:

1. The quality of the results will depend on thequality of the test circuit. In particular, we as-sume that our counter circuits can count defectsonly up to a certain threshold (this will likely bethe case, for example, iferror-correction tech-niques such as Reed-Solomon codes [RS60] areused for test circuit design). Therefore, a higherthreshold will correspond to better quality re-sults.

2. A number of good components may be markedbad, particularly if a low-threshold test circuitis used. This is thewasteof the procedure. Itshould be noted that it is never the case that abad component is marked good.

3. With the testing strategy as described above, itwill not be possible to completely determine the

Figure 59: Yields achieved by varying the defect den-sities and the number of defects our test circuits couldcount. TheX axis represents the defect density of the fab-ric, theYaxis shows the yield achieved (or, in other words,the fraction of the fabric’s defect-free components that areidentified as such), and each line represents a counter thatcan count defects up to a different threshold.

test circuitsa priori, and some feedback depen-dent configuration on the part of the tester willbe required. The feedback will be limited torouting the predetermined test circuit configu-rations around the discarded components.

To test the effectiveness of this procedure and tomeasure the impact of the defect-counting thresholdon the results, we ran a number of simulations; theresults are presented in Fig. 59. From our results, it isapparent that it is possible to achieve high yields evenwith test circuits that can count a small number ofdefects. For example, for densities less than 10%, atest circuit that could count up to four errors achievedyields of over 80%. With more powerful test circuits,yields of over 95% are achievable.

79 DRAFT! do not distribute

Page 80: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

Figure 60: A schematic representation of how testingwill proceed in a wavelike manner through the fabric. Theblack area is tested and configured as a tester by the exter-nal tester. Each darker shaded area then tests and config-ures a lighter-shaded neighbor. For large fabrics, multiplesuch waves may grow out from different externally testedareas.

10.2 Scaling with Fabric Size

A short testing time is crucial to maintain the lowcost of CAEN fabrics, and therefore it needs to be en-sured that testing proceeds quickly and needs mini-mal support from an external tester. Our testing strat-egy currently requires time proportional to� for test-ing a fabric of�� � components, if a test circuit ofsize� is being used. To speed testing up even fur-ther, reconfigurability of the fabric can be leveragedin the following ways:

1. Once a part of the fabric is tested and defect-mapped, it can be configured to act as a testerfor the other parts, therefore reducing the timerequired on the external tester drastically.

2. Once the tester is configured onto the fabric,there is nothing to prevent us from having mul-tiple testers active simultaneously. In such ascenario, each tested area tests its adjacent onesand the testing can proceed in a wave throughthe fabric (see Fig. 60).

Therefore, the reconfigurability of the fabric helpsreduce the time on the external tester and also thetotal testing time by a significant amount.

Once a defect map has been generated, the fabriccan be used to implement arbitrary circuits. The ar-chitecture of the nanoBlock supports full utilizationof the device even in the presence of a significantnumber of defects. Due to the way we map logic towires and switches, only about 20% of the switcheswill be in use at any one time. Since the internal linesin a nanoBlock are completely interchangeable, wegenerally should be able to arrange the switches thatneed to be configured in the ON state to be on wireswhich avoid the defects.

While the molecules are expected to be robust overtime, inevitably new defects will occur over time.Finding these defects, however, will be significantlyeasier than doing the original defect mapping be-cause the unknown defect density will be very low.

11 Using Molecular Architectures

There are two scenarios in which nanoFabrics can beused: as factory-programmable devices configuredby the manufacturer to emulate a processor or othercomputing device, and as reconfigurable computingdevices.

In a manufacturer-configured device, user appli-cations treat the device as a fixed processor (or po-tentially as a small number of different processors).Processor designers will use traditional CAD toolsto create designs using standard cell libraries. Thesedesigns will then be mapped to a particular chip,taking into account the chip’s defects. A finishedproduct is therefore a nanoFabric chip and a ROMcontaining the configuration for that chip. In thismode, the configurability of the nanoFabric is usedonly to accommodate a defect-prone manufacturingprocess. While this provides the significant benefitsof reduced cost and increased densities, it ignoresmuch of the potential in a nanoFabric. Since de-fect tolerance requires that a nanoFabric be reconfig-urable, why not exploit the reconfigurability to buildapplication-specific processors?

Reconfigurable fabrics offer high performance andefficiency because they can implement hardwarematched to each application. Further, the configu-rations are created at compile-time, eliminating the

80 DRAFT! do not distribute

Page 81: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

need for complex control circuitry. Research has al-ready shown that the ability of the compiler to ex-amine the entire application gives a reconfigurabledevice efficiency advantages because it can:

� exploit all of an application’s parallelism: task-based, data, instruction-level, pipeline, and bit-level,

� create customized function units,

� eliminate a significant amount of control cir-cuitry,

� reduce memory bandwidth requirements,

� size function units to the application’s naturalword size,

� use partial evaluation and constant propagationto reduce the complexity of operations.

However, this extra performance comes at the costof significant work by the compiler. A conserva-tive estimate for the number of configurable switchesin a �cm� nanoFabric, including all the overheadfor buffers, clock, power, etc., is on the order of����. Even assuming that a compiler manipulatesonly standard cells, the complexity of mapping acircuit design to a nanoFabric will be huge, creat-ing a compilation scalability problem. Traditionalapproaches to place-and-route in particular will notscale to devices with billions of wires and devices.

In order to exploit the advantages listed above, wepropose a hierarchy of abstract machines that willhide complexity and provide an intellectual lever forcompiler designers while preserving the advantagesof reconfigurable fabrics. At the highest level is asplit-phase abstract machine (SAM), which allowsa program to be broken up into autonomous units.Each unit can be individually placed and routed andthen the resulting netlist of preplaced and routedunits can be placed. This hierarchical approach willallow the CAD tools to scale.

11.1 Split-Phase Abstract Machine

The compilation process starts by partitioning the ap-plication into a collection of threads. Each thread is

a sequence of instructions ending in a split-phase op-eration. An operation is deemed to be a split-phaseoperation if it has an unpredictable latency. For ex-ample, memory references and procedure calls are allsplit-phase operations. Therefore, each thread, sim-ilar in spirit to a threaded abstract machine (TAM)thread [CGSvE93], communicates with other threadsasynchronously using split-phase operations. Thispartitioning allows us to virtualize the hardware (byswapping threads in and out as necessary), allows theCAD tools to concentrate on mapping small isolatednetlists, and has all the mechanisms required to sup-port thread-based parallelism.

Unlike a traditional thread model, where a threadis associated with a processor when executing, eachSAM thread will be a custom “processor.” While it ispossible for a thread to be complex and load “instruc-tions” from its local store, the intention is that it re-mains fairly simple, implementing only a small pieceof a procedure. This allows the threads to act eitherin parallel or as a series of sequential processes. Italso reduces the number of timing constraints on thesystem, which is vital for increasing defect toleranceand decreasing compiler complexity.

The SAM model is a simplification of TAM.A SAM thread/processor is similar to a single-threaded codeblock in TAM, and memory operationsin SAM are similar to memory operations in Split-C[CDG�93]. In a sense, SAM will implement (in re-configurable hardware) active messages [vECGS92]for all interprocessor communications. This modelis powerful enough to support multithreading. Asthe compiler technology becomes more mature, theinherently parallel nature of the model can be ex-ploited.

While SAM can support parallel computation, aparallelizing compiler is not necessary. The perfor-mance of this model rests on the ability to create cus-tom processors. A compiler could (and in its firstincarnations will) construct machines in which onlyone processor is active at a time.

This also creates the opportunity to virtualize theentire system. Imagine each configuration of a pro-cessor as an ultrawide instruction that can be loadedas needed to virtualize the hardware. The bandwidthto load these instructions is available since CAEN

81 DRAFT! do not distribute

Page 82: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

devices can implement very dense customized mem-ories. Later, as the compiler technology becomesmore mature, the inherently parallel nature of themodel can be exploited.

The SAM model explicitly hides many importantdetails. For example, it addresses neither dynamicrouting of messages nor allocation of stacks to thethreads. Once an application has been turned intoa set of cooperating SAM threads, it is mapped toa more concrete architectural model that takes theseissues into account. The mapping process will, whenrequired, assign local stacks to threads, insert circuitsto handle stack overflow, and create a network forrouting messages with run-time computed addresses.For messages with addresses known at compile-time,it will route signals directly.

12 Conclusions

Recent advances in molecular electronics, combinedwith increased challenges in semiconductor manu-facturing, create a new opportunity for computerarchitects—the opportunity to recreate computer ar-chitecture from the ground up. New device charac-teristics require us to rethink the basic abstraction ofthe transistor. New fabrication methods require usto rethink the basic circuit abstraction. The scale ofthe devices and wires allows us to rethink our basicapproach to designing computing systems. On theone hand, the scale enables huge computing systemswith billions of components. On the other hand, thescale forces us to rethink the meaning of a workingsystem; it must be a reliable system made from unre-liable components.

The main challenge facing designers today is in re-examining and redefining the core abstractions thathave enabled us to successfully design and imple-ment computers. One approach to computer archi-tecture is to view it as a hierarchy of abstractions.The bottom layer of one possible hierarchy is thedevice—currently the transistor. Built on the de-vice layer is the abstraction of the logic gate (e.g.,a NAND gate). On top of this, we have circuits andthen function units (e.g., adders and state machines).Above function units, we have the instruction set ar-

chitecture that defines the interface between the hard-ware and the software in a computing system. Abovethese layers there are computing models, algorithms,and programming languages. A hierarchy of abstrac-tions implies a contract between layers in the hier-archy. The contract allows one to change the inter-nals of any particular layer without adversely affect-ing the entire system. It allows many people to si-multaneously work on advancing the state of the art.The advances in computing power over the last fewdecades are an excellent example of the effectivenessof this hierarchy of abstractions—almost without ex-ception, advances have been made without having tobreak the abstractions.

On the other hand, sometimes one must intention-ally break the contracts, either by piercing the ab-stractions or by creating new ones. When done care-fully the results can be dramatic. Molecular com-puting may be a case where a reexamination of thelayers of abstraction is required. For example, aswe have shown earlier, the fundamental device ab-straction may be too rigid for nanometer-scale de-vices. Decomposing the device abstraction into fourcomponents—a switch, an isolator, a restorer, and amemory (SirM)—provides device designers and cir-cuit designers alike with a new degree of freedom.

SirM is an example of the easiest kind of changeto the abstraction hierarchy. It can be thought of asa new layer between the new physical devices andtoday’s abstraction of a transistor. In other words, a“transistor”-like contract can be established by com-bining the proper switch, isolator, and restorer de-vices all together. Therefore, we are able to maintainthe current abstraction, providing backward compat-ibility for tools and ideas while providing the oppor-tunity to pierce the transistor abstraction when nec-essary. This provides the design community with anincremental path toward harnessing the underlyingdevices directly. Furthermore, it oes not change anyof the contracts with other layers in the hierarchy.

A more difficult change to work into the abstrac-tion hierarchy is wrought by the level of manufac-turing defects and transient faults expected to arisefrom using nanometer scale devices. It is probablynot reasonable to maintain the defect- and fault-freeclause in the contract expected from today’s devices

82 DRAFT! do not distribute

Page 83: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

and circuits. We have argued in this chapter that ex-posing the defects in the manufacturing process tothe higher levels in the hierarchy may reduce the to-tal cost of the system. It may also be significantlyeasier to mask defects and faults from the end-userif each layer is expected to do some of the work.With respect to defect tolerance, a combined attackat the levels of circuits, architecture, and tools maybe the only scalable approach. Reconfigurable archi-tectures allow one to configure a hardware system toimplement a particular circuit. By avoiding the faultycomponents, one may create reliable systems usingunreliable components. This requires changes in ar-chitecture, circuit design, testing, and the tools usedto create the circuits from an engineer’s description.

One way to reduce the impact of a change in theabstraction hierarchy is to encapsulate the change ina tool. For example, one can present the abstractionof a fault-free manufacturing process if the tools thatconvert circuit descriptions to circuit layout imple-ment all the changes necessary to perform defect tol-erance. By encapsulating the details in a tool, weobtain the intellectual leverage of hiding the imple-mentation details while at the same obtaining the ad-vantage of piercing the abstraction layer.

We believe that successfully harnessing the powerof molecular computing requires us to rethink sev-eral key abstractions: the transistor, the circuit, andthe ISA. As we noted earlier, we feel that the abstrac-tion of the transistor should be replaced by SirM. Theabstraction that a user’s desired circuit is created atmanufacturing time needs to be replaced by the abil-ity to configure circuits at run-time. Finally, the ab-straction of an ISA needs to be replaced with a moreflexible hardware/software interface implemented inthe compiler.

In just the last couple of years, there has beena convergence in molecular-scale architectures to-wards reconfigurable platforms. In the nanoFabric,the main computing element is a molecular-based re-configurable switch. We exploit the reconfigurablenature of the nanoFabric to provide defect toleranceand to support reconfigurable computing. Reconfig-urable computing not only offers the promise of in-creased performance but it also amortizes the costof chip manufacturing across many users by allow-

ing circuits to be configured postfabrication. Themolecular-based switch eliminates much of the over-head needed to support reconfiguration—the switchholds its own state and can be programmed withoutextra wires, making nanoFabrics ideal for reconfig-urable computing.

The notion of an ISA is very successful in hidingthe details of a processor implementation from theuser of a processor. However, it also limits the abil-ity of a compiler to take full advantage of the proces-sor. ISAs were created when computers were pro-grammed by hand. They were therefore designed tobe easily understandable by humans. Today, com-puters are programmed using high-level languagesthat are then translated by compilers into the machinelevel instructions that control the computer. There-fore, there is less of a need to make an ISA con-cise and easy to understand. We argue that harness-ing the plentiful resources that will become availablethrough molecular computing will require exposingmore of the underlying computational structure tothe compiler. In addition, by exposing more of theunderlying architecture, it should be easier for thecompiler to support increased levels of defect andfault tolerance.

In conclusion, we believe that computer archi-tecture is presently faced with an incredible op-portunity, an opportunity to harness the power ofmolecular computing to create computing systemsof unprecedented performance per dollar. To realizethis promise, we need to rethink the basic abstrac-tions that comprise a computer system. Just as cur-rent systems benefited greatly from the underlyingtechnology—the silicon-based transistor—new sys-tems will benefit from the molecular reconfigurableswitch. This can serve as the basis for a recon-figurable computing system that can construct cus-tomized circuits tailored on the fly for every applica-tion.

References

[ACA01] P. Avouris, P. Collins, and M. Arnold. Engineer-ing carbon nanotubes and nanotube circuits using electricalbreakdown.Science, 292(5517):706–709, 2001.

83 DRAFT! do not distribute

Page 84: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

[ACC�95] R. Amerson, R. Carter, W. Culbertson, P. Kuekes,and G. Snider. Teramac–configurable custom computing. InD. A. Buell and K. L. Pocek, editors,Proceedings of IEEESymposium on FPGAs for Custom Computing Machines,pages 32–38, Napa, CA, April 1995.

[AHKB00] V. Agarwal, H. Hrishikesh, S. Keckler, andD. Burger. Clock rate versus IPC: the end of the road forconventional microarchitectures. InProceedings of the 27thInternational Symposium on Computer Architecture, pages344–347, June 2000.

[Alt02] Altera Corporation. Apex de-vice family. http://www.altera.com/-html/products/apex.html, 2002.

[AN97] e. a. A. Nakajima. Room temperature operationof Si single-electron memory with self-aligned floating dotgate.Appl. Phys. Lett, 70(13):1742, 1997.

[AR74] A. Aviram and M. Ratner. Molecular rectifiers.Chemical Physics Letters, 29(2):277–283, Nov. 1974.

[AS62] I. Aleksander and R. Scarr. Tunnel devices asswitching elements.J British IRE, 23(43):177–192, 1962.

[ASH�99] M. Abramovici, C. Strond, C. Hamilton, S. Wi-jesuriya, and V. Verma. Using roving STARs for on-line test-ing and diagnosis of FPGAs in fault-tolerant applications. InInternational Test Conference, pages 973–982, 1999.

[Aza00] K. Azar. The history of power dissipation.Elec-tronics Cooling Magazine, 6(1):42–50, 2000.

[Bac78] J. Backus. Can programming be liberated fromthe von Neumann style? a functional style and its algebraof programs.Communications of the ACM, 21(8):613–641,August 1978.

[BCKS94] R. Bissell, E. Cordova, A. Kaifer, and J. Sotddart.A chemically and electrochemically switchable moleculardevice.Nature, 369:133–7, May 12, 1994.

[Ben02] C. Bennett. Lecture at Carnegie Mellon, 2002.

[BHND01] A. Bachtold, P. Hadley, T. Nakanishi, andC. Dekker. Logic circuits with carbon nanotube transistors.Science, 294:1317–1320, November 2001.

[BMBG02] M. Budiu, M. Mishra, A. Bharambe, and S. C.Goldstein. Peer-to-peer hardware-software interfaces for re-configurable fabrics. InProceedings of 2002 IEEE Sympo-sium on Field-Programmable Custom Computing Machines(FCCM), April 2002.

[BR97] V. Betz and J. Rose. VPR: A new packing, place-ment and routing tool for FPGA research. InProceedings ofthe International Workshop on Field Programmable Logicand Applications, pages 213–222, August 1997.

[BSWG00] M. Budiu, M. Sakr, K. Walker, and S. C. Gold-stein. BitValue inference: Detecting and exploiting narrow

bitwidth computations. InEuro-Par 2000 parallel process-ing: 6th International Euro-Par Conference, volume 1900 ofLecture Notes in Computer Science. Springer Verlag, 2000.

[CAC�96] W. Culbertson, R. Amerson, R. Carter, P. Kuekes,and G. Snider. The Teramac custom computer: Extendingthe limits with defect tolerance. InProceedings of the IEEEInternational Symposium on Defect and Fault Tolerance inVLSI Systems, pages 2–10, 1996.

[CAC�97] W. Culbertson, R. Amerson, R. Carter, P. Kuekes,and G. Snider. Defect tolerance on the Teramac custom com-puter. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines, pages 116–123, April 1997.

[Car63] J. Carroll.Tunnel-Diode and Semiconductor Cir-cuits. McGraw-Hill, 1963.

[CDG�93] D. Culleret al. Parallel programming in split-c. InProceedings of the Supercomputing ’93 Conference, pages262–273, Nov. 1993.

[CDHL00] Y. Cui, X. Duan, J. Hu, and C. Lieber. Doping andelectrical transport in silicon nanowires.Journal of PhysicalChemistry B, 104(22):5213–5216, 2000.

[CG01] R. Canaal and A. Gonz´alez. Reducing the com-plexity of the issue logic. InProceedings of the Interna-tional Conference on Supercomputing (ICS-01), pages 312–320, June 2001.

[CGSvE93] D. Culler, S. Goldstein, K. Schauser, andT. v. Eicken. TAM — a compiler controlled threaded ab-stract machine.Journal of Parallel and Distributed Comput-ing, 18:347–370, July 1993.

[CH99] K. Compton and S. Hauck. Configurable comput-ing: A survey of systems and software. Technical report,Northwestern University, Dept. of ECE, 1999.

[Col] R. Collins. Dr. Dobb’s micropro-cessors resources: Intel secrets and bugs.http://www.x86.org/secrets/intelsecrets.htm.

[CR00] J. Chen and M. Reed. Molecular wires, switchesand memories. InProceedings of International Conferenceon Molecular Electronics, Kailua-Kona, HI, Dec. 2000.

[CRRT99] J. Chen, M. A. Reed, A. M. Rawlett, and J. M.Tour. Observation of a large on-off ratio and negative differ-ential resistance in an electronic molecular switch.Science,286(5444):1550–2, 1999.

[CWB�99] C. P. Collieret al. Electronically configurablemolecular-based logic gates.Science, 285(5426):391–394,July 16 1999.

[CWR�00] J. Chenet al. Room-temperature negative differ-ential resistance in nanoscale molecular junctions.AppliedPhysics Letters, 77(8):1224–1226, 2000.

[DeH96] A. DeHon. Dynamically programmable arrays: Astep toward increased computational density. InProceedings

84 DRAFT! do not distribute

Page 85: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

of the 4th Canadian Workshop of Field-Programmable De-vices, pages 47–54, May 1996.

[DeH99] A. DeHon. Balancing interconnect and computa-tion in a reconfigurable computing array (or, why you don’treally want 100% LUT utilization). InProceedings of theInternational Symposium on Field Programmable Gate Ar-rays, pages 125–134, February 1999.

[DeH02] A. DeHon. Array-based architecture for molec-ular electronics. InProceedings of the First Workshop onNon-Silicon Computation (NSC-1), February 3, 2002.

[DMAA01] V. Derycke, R. Martel, J. Appenzeller, andP. Avouris. Carbon nanotube inter- and intramolecular logicgates.Nano Letters, 9(1):453–6, August 2001.

[Dor43] R. Dorfman. The detection of defective mem-bers of large populations.Annals of Mathematical Statistics,14:436–440, Dec. 1943.

[Dre86] K. E. Drexler. Engines of Creation. Doubleday,New York, 1986.

[EL00] J. Ellenbogen and J. Love. Architectures formolecular electronic computers: 1. logic structures and anadder designed from molecular electronic diodes.Proc.IEEE, 88(3):386–426, 2000.

[Enc02] New Encyclopaedia Britannica. EncyclopaediaBritannica, 2002.

[ES82] L. Esaki and E. Soncini, editors.Large Scale In-tegrated Circuits Technology, pages 728–46. Nijhoff, 1982.

[FRJ�02] S. E. Frost, A. F. Rodrigues, A. W. Janiszewski,R. T. Rausch, and P. M. Kogge. Memory in motion: A studyof storage structures in QCA. InProceedings of the FirstWorkshop on Non-Silicon Computation, February 2002.

[GB01] S. C. Goldstein and M. Budiu. NanoFabrics: Spa-tial computing using molecular electronics. InProceedingsof the 28th International Symposium on Computer Architec-ture, pages 178–189, 2001.

[GJ79] M. R. Garey and D. S. Johnson.Computers andIntractability : A Guide to the Theory of NP-Completeness.W. H. Freeman, 1979.

[GPKR97] K. F. Goser, C. Pacha, A. Kanstein, and M. L.Rossmann. Aspects of systems and circuits for nanoelectron-ics. Proceedings of the IEEE, 85(4):558–573, April 1997.

[GSB�00] S. C. Goldsteinet al. PipeRench: A reconfig-urable architecture and compiler.IEEE Computer, 33(4):70–77, April 2000.

[GSM�99] S. C. Goldsteinet al. PipeRench: a coproces-sor for streaming multimedia acceleration. InProceedingsof the 26th International Symposium on Computer Architec-ture, pages 28–39, 1999.

[Ham50] R. W. Hamming. Error-detecting and error-correcting codes. Bell Systems Technical Journal,29(2):147–160, 1950.

[Har01] R. Hartenstein. A decade of reconfigurable com-puting: a visionary retrospective. InProceedings of the De-sign, Automation and Test in Europe Conference and Exhi-bition (DATE 2001), pages 642–649, Exhibit and CongressCenter, Munich, Germany, March 2001.

[Hau98] S. Hauck. The roles of FPGAs in reprogrammablesystems. Proceedings of the IEEE, 86(4):615–638, April1998.

[HB67] D. Holmes and P. Baynton. A monolithic galliumarsenide tunnel diode construction. In A. Strickland, edi-tor, Gallium Arsenide, number 3 in Conference series, pages236–240, London, 1967. Institute of Physics and PhysicalSociety.

[HDC�01a] Y. Huang et al. Logic gates and computa-tion from assembled nanowire building blocks.Science,294(5545):1313–1317, November 9, 2001.

[HDC�01b] Y. Huang et al. Logic Gates and Computa-tion from Assembled Nanowire Building Blocks.Science,294(5545):1313–1317, 2001.

[HDW�01] Y. Huang, X. Duan, Q. Wei, , and C. Lieber.Directed assembly of one-dimensional nanostructures intofunctional networks.Science, 291(5504):630–633, January2001.

[HHP�01] S. Husbandet al. The Nanocell Approach to aMolecular Computer. Moletronics PI Meeting, Sante Fe,NM, March 2001.

[Hip01] K. Hipps. It’s all about contacts. Science,294(5542):536–537, October 19, 2001.

[HKSW98] J. R. Heath, P. J. Kuekes, G. S. Snider, and R. S.Williams. A defect-tolerant computer architecture: Oppor-tunities for nanotechnology.Science, 280(5370), June 121998.

[HLS99] S. Hauck, Z. Li, and E. Schwabe. Configurationcompression for the Xilinx XC6200 FPGA.IEEE Transac-tions on Computer-Aided Design of Integrated Circuits andSystems, 18(8):1107–1113, August 1999.

[HMH01] R. Ho, K. Mai, and M. Horowitz. The future ofwires.Proceedings of the IEEE, 89(4):490–504, April 2001.

[HP96] J. Hennessy and D. Patterson.Computer Architec-ture: A Quantitative Approach, 2nd ed.Morgan Kaufmann,1996.

[Hus02] S. M. Husband. Programming the Nanocell, aRandom Array of Molecules. PhD thesis, Rice University,April 2002.

[KBT96] E. Knill, W. J. Bruno, and D. C. Torney. Non-adaptive group testing in the presence of errors. Technical

85 DRAFT! do not distribute

Page 86: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

Report LAUR-95-2040, Los Alamos National Laboratory,Los Alamos, NM, September 1996.

[Kes97] S. Keshav.An Engineering Approach to ComputerNetworks. Addison-Wesley, 1997.

[Key85] R. W. Keyes. What makes a good computer de-vice? Science, 230(4722):138–144, October 1985.

[KMM �01] N. I. Kovtyukhovaet al. Layer-by-layer assemblyof rectifying junctions in and on metal nanowires.J. Phys.Chem. B, 105:8762–8769, 2001.

[KMM �02] N. I. Kovtyukhova et al. Layer-by-layer self-assembly strategy for template synthesis of nanoscale de-vices.Mater. Sci Eng. C, 19:255–262, 2002.

[KS86] S. R. Kunkel and J. E. Smith. Optimal pipelin-ing in supercomputers. InProceeding of the 13th Inter-national Symposium on Computer Architecture, pages 404–411, 1986.

[KWC�00] T. Kaminset al. Chemical vapor deposition of Sinanowires nucleated by TiSi2 islands on Si.Applied PhysicsLetters, 76(5):562–4, 2000.

[Lan57] R. Landauer. Spatial variation of current and fieldsdue to localized scatterers in metallic conduction.IBM Jour-nal of Research and Development, 1(3), 1957.

[LC83] S. Lin and D. J. Costello.Error Control Coding:Fundamentals and Applications. Prentice-Hall, EnglewoodCliffs, NJ, 1983.

[LCH00] Z. Li, K. Compton, and S. Hauck. Configurationcaching management techniques for reconfigurable comput-ing. In IEEE Symposium on Field-Programmable CustomComputing Machines, pages 22–36, 2000.

[LDK00] J. Lyke, G. Donohoe, and S. Karna. Cellularautomata-based reconfigurable systems as a transitional ap-proach to gigascale electronic architectures.Journal ofSpacecraft and Rockets, 39(4):489–494, July 2000.

[LDK01] J. Lyke, G. Donohoe, and S. Karna. Reconfig-urable cellular array architectures for molecular electronics.Technical Report AFRL-VS-TR-2001-1039, AFRL, March2001.

[Lil30] Lilienfeld. Method and apparatus for controllingelectric currents. U.S. Patent 1,745,175, 1930.

[Man95] R. T. Maniwa. Global distribution: Clocks andpower. Integrated System Design Magazine, 1995.

[MDR�99] B. Martin et al. Orthogonal self-assembly oncolloidal gold-platinum nanorods. Advanced Materials,11:1021–25, 1999.

[Met99] R. Metzger. Electrical rectification by a molecule:The advent of unimolecular electronic devices.Acc. Chem.Res., 32:950–7, 1999.

[MG02] M. Mishra and S. C. Goldstein. Scalable defecttolerance for molecular electronics. InFirst Workshop onNon-Silicon Computing, Cambridge, MA, February 2002.

[MH86] S. McFarling and J. Hennessy. Reducing the costof branches. InProceedings of the 13th Annual InternationalSymposium on Computer Architecture, pages 396–403, June1986.

[Mir00] C. Mirkin. Programming the Assembly ofTwo- and Three-Dimensional Architectures with DNA andNanoscale Inorganic Building Blocks. Inorg. Chem.,39:2258–72, 2000.

[MMDS97] N. S. M.V. Martinez-Diaz and J. Stoddart. Theself-assembly of a switchable [2]rotaxane.Ang. Chem. Intl.Ed. Eng., 36:1904, 1997.

[MMK �02] J. K. N. Mbindyoet al. Template synthesis ofmetal nanowires containing monolayer molecular junctions.J. Am. Chem. Soc., 124:4020–4026, 2002.

[Moo65] G. E. Moore. Cramming more components ontointegrated circuits.Electronics, April 1965.

[MRM�00] J. Mbindyo et al. DNA-Directed Assembly ofGold Nanowires on Complementary Surfaces.AdvancedMaterials, 13(4):249–254, 2000.

[MSS�99] R. Mathewset al.A New RTD-FET Logic Family.Proceedings of the IEEE, 87(4):596–605, 1999.

[MW00] Merriam-Webster.Webster’s Third New Interna-tional Dictionary. Merriam-Webster, 2000.

[NF02] D. Nackashi and P. Franzon. Moletronics: A cir-cuit design perspective. InProceedings of the SPIE, volume4236, 2002.

[NK01] M. T. Niemier and P. M. Kogge. Exploring andexploiting wire-level pipelining in emerging technologies. InProceedings of the 28th Annual International Symposium onComputer Architecture, pages 166–177. ACM Press, 2001.

[NPRG�02] S. R. Nicewarner-Pena, S. Raina, G. P. Goodrich,N. V. Fedoroff, and C. Keating. Hybridization and enzymaticextension of au nanoparticle-bound oligonucleotides.J. Am.Chem. Soc., 124:7314–7323, 2002.

[NRK02] M. T. Niemier, A. F. Rodrigues, and P. M. Kogge.A potentially implementable FPGA for quantum dot cellularautomata. InProceedings of the First Workshop on Non-Silicon Computation, February 2002.

[Pee00] P. Peery. The drive to miniaturization.Nature,406(6799):1023–1026, 2000.

[PG97] C. Pacha and K. Goser. Design of arithmetic cir-cuits using resonant tunneling diodes and threshold logic.In Proceedings of the 2nd Workshop on Innovative Circuitsand Systems for Nanoelectronics, pages 83–93, Delft, NL,September 1997.

86 DRAFT! do not distribute

Page 87: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

[Pip90] N. Pippenger. Developments in ”The Synthesisof Reliable Organisms from Unreliable Components”.Pro-ceedings of Symposia in Pure Mathematics, 50:311–324,1990.

[PLP�99] H. Park, A. Lim, J. Park, A. Alivisatos, andP. McEuen. Fabrication of metallic electrodes with nanome-ter separation by electromigration.Applied Physics Letters,75(2):301–303, 1999.

[Por98] W. Porod. Towards nanoelectronics: PossibleCNN implementations using nanoelectronic devices. InFifth IEEE International Workshop on Cellular Neural Net-works and their Applications, pages 20–25, London, Eng-land, April 1998.

[PRS�01] D. J. Penaet al. Electrochemical synthesis ofmulti-material nanowires as building blocks for functionalnanostructures.MRS Symp. Proc., 636:D4.6.1–4.6.6, 2001.

[PSS98] A. Pnueli, M. Siegel, and E. Singerman. Trans-lation validation. In S. Verlag, editor,Tools and Algorithmsfor the Construction and Analysis of Systems, TACAS ’98,volume 1384 ofLecture notes in computer science, pages151–166, 1998.

[PWW97] A. Peleg, S. Wilkie, and U. Weiser. Intel MMX formultimedia PCs.Communications of the ACM, 40(1):24–38,1997.

[QPN�02] T. Quarles, D. Pederson, R. Newton,A. Sangiovanni-Vincentelli, and C. Wayne. The Spicepage. http://bwrc.eecs.berkeley.edu/Classes/IcBook/SPICE/,2002.

[RF93] B. R. Rau and J. A. Fisher. Instruction-level par-allel processing: History, overview and perspective.Journalof Supercomputing, 7(1):9–50, January 1993.

[RG02] D. Rosewater and S. Goldstein. Digital logic us-ing molecular electronics. InIEEE International Solid-StateCircuits Conference, February 2002.

[RKJ�00] T. Rueckeset al. Carbon nanotube-based non-volatile random access memory for molecular computing.Science, 289(5476):94–97, July 7, 2000.

[RS60] I. S. Reed and G. Solomon. Polynomial codes overcertain finite fields.Journal of the Society for Industrial andApplied Mathematics (J. SIAM), 8(2):300–304, 1960.

[RWB�96] S. Roweiset al. A sticker based model for DNAcomputation. InProceedings of 2nd Annual DIMACS Work-shop on DNA Based Computers (DIMACS), DIMACS, 1996.

[SBV92] J. R. S. Brown, R. Francis and Z. Vranesic.Field-Programmable Gate Arrays. Kluwer, 1992.

[Sem97] Semiconductor Industry Association. SIARoadmap, 1997.

[Semte] Sematech.International Technology Roadmap forSemiconductors, 2000 Update.

[SG59] M. Sobel and P. A. Groll. Group testing to elimi-nate efficiently all defectives in a binomial sample.The BellSystem Technical Journal, 28:1179–1224, September 1959.

[Sha48] C. Shannon. A mathematical theory of communi-cation.Bell System Technical Journal, 27, 1948.

[SKCA96] C. Stroud, S. Konala, P. Chen, andM. Abramovici. Built-in self-test for programmablelogic blocks in FPGAs (finally, a free lunch: BIST withoutoverhead!). InProceedings of IEEE VLSI Test Symposium,pages 387–392, 1996.

[SKG00] S. K. Sinha, P. M. Kamarchik, and S. C. Goldstein.Tunable fault tolerance for runtime reconfigurable architec-tures. InIEEE Symposium on Field-Programmable CustomComputing Machines (FCCM), pages 185–192, 2000.

[SNJ�00] P. A. Smithet al. Electric-field assisted assemblyand alignment of metallic nanowires.Applied Physics Let-ters, 77(9):1399–1401, August 2000.

[Spi96] D. Spielman. Highly Fault-Tolerant Parallel Com-putation. InProceedings of the 37th Annual IEEE Confer-ence on Foundations of Computer Science 1996, pages 154–163, 1996.

[SQM�99] Sohet al. Integrated nanotube circuits: Controlledgrowth and ohmic contacting of single-walled carbon nan-otubes.Applied Physics Letters, 75(5), 1999.

[SS92] D. P. Siewiorek and R. S. Swarz.Reliable Com-puter Systems: Design and Evaluation. Digital Press, Bed-ford, MA, 1992.

[Sto] J. F. Stoddart. A [2]rotaxane ph controlled molec-ular switch. http://www.chem.ucla.edu/dept/-Faculty/stoddart/research/mv.htm.

[SWHA98] C. Stroud, S. Wijesuriya, C. Hamilton, andM. Abramovici. Built-in self-test of FPGA interconnect. InProceedings of IEEE International Test Conference, pages404–411, 1998.

[TDD�97] S. J. Tanset al. Individual single-wall carbon nan-otubes as quantum wires.Nature, 386(6624):474–7, 1997.

[TEL95] D. M. Tullsen, S. J. Eggers, and H. M. Levy.Simultaneous multithreading: Maximizing on-chip paral-lelism. In Proceedings of the 22nd Annual Interna-tional Symposium on Computer Architecture, pages 392–403, Santa Margherita Ligure, Italy, May 1995.

[TG99] R. R. Taylor and S. C. Goldstein. A high-performance flexible architecture for cryptography. In1st In-ternational Workshop on Cryptographic Hardware and Em-bedded Systems (CHES), pages 231–245, 1999.

[vECGS92] T. v. Eicken, D. E. Culler, S. Goldstein, and K. E.Schauser. Active messages: a mechanism for integratedcommunication and computation. InProceedings of the 19thInternational Symposium on Computer Architecture, pages256–266, May 1992.

87 DRAFT! do not distribute

Page 88: Molecules, Gates, Circuits, Computersphoenix/papers/mgcc.pdf · 2003-04-22 · methods available to the computer architect. These tools constrain the possible architectures that can

March 7, 2003– 10 : 27

[vN45] J. v. Neumann. First draft of a report on theEDVAC. Contract No. W-670-ORD-492, Moore Schoolof Electrical Engineering, Univ. of Penn., Philadelphia.Reprinted (in part) in Randell, Brian. 1982. Origins of Dig-ital Computers: Selected Papers, Springer-Verlag, BerlinHeidelberg, June 1945.

[vN56] J. v. Neumann. Probabilistic logics and synthesisof reliable organisms from unreliable components. In C. E.Shannon and J. McCarthy, editors,Automata Studies (Annalsof Mathematical Studies), pages 43–98. Princeton UniversityPress, 1956.

[VSS87] N. Viswanadham, V. Sarma, and M. Singh.Relia-bility of Computer and Control Systems, volume 8 ofNorth-Holland Systems and Control Series. North Holland, 1987.

[WAM�02] S. Wind, J. Appenzeller, R. Martel, V. Derycke,and P. Avouris. Vertical scaling of carbon nanotube field-effect transistors using top gate electrodes.Applied PhysicsLetters, 80(20):3817–19, 2002.

[WB02] R. Weiss and S. Basu. The device physics of cellu-lar logic gates. InFirst Workshop on Non-Silicon Computing,February 2002.

[WK01] S. Williams and P. Kuekes. Demultiplexer for aMolecular Wire Crossbar Network. U.S. Patent: 6,256,767,July 3, 2001.

[WKH01] S. Williams, P. Kuekes, and J. Heath. Molecular-Wire Crossbar Interconnect (MWCI) for Signal Routingand Communications. U.S. Patent 6,314,019, November 6,2001., 2001.

[WLWS98] E. Winfree, F. Liu, L. Wenzler, and N. Seeman.Design and Self-Assembly of Two-Dimensional DNA Crys-tals. Nature, 394:539–544, 1998.

[Wol85] J. K. Wolf. Born again group testing: Multiaccesscommunications.IEEE Transactions on Information Theory,IT-31(2):185–191, March 1985.

[WT97] S.-J. Wang and T.-M. Tsai. Test and diagnosis offaulty logic blocks in FPGAs. InProceedings of IEEE/ACMInternational Conference on Computer-Aided Design (IC-CAD), pages 722–7, 1997.

[Xil02] Xilinx Corporation. Virtex series fpgas.http://www.xilinx.com/products/virtex.htm,2002.

[XRPW99] Y. Xia, J. Rogers, K. Paul, and G. Whitesides. Un-conventional Methods for Fabricating and Patterning Nanos-tructures.Chem. Rev., 99:823–1848, 1999.

[YMHB00] A. Ye, A. Moshovos, S. Hauck, and P. Baner-jee. CHIMAERA: A high-performance architecture with atightly-coupled reconfigurable functional unit. InProceed-ings of the 27th Annual International Symposium on Com-puter Architecture, pages 225–235, June 2000.

[ZDRJI97] C. Zhou, M. R. Deshpande, M. A. Reed, and J. M.Jones II, L. anmd Tour. Nanoscale metal/self-assembledmonolayer/metal heterostructures.Appl. Phys. Lett, 71:611,1997.

88 DRAFT! do not distribute