ieee transactions on very large scale integration … · schmitt-trigger-based cell [17], whose...

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1

A Variation-Tolerant Replica-BasedReference-Generation Technique for

Single-Ended Sensing in WideVoltage-Range SRAMs

Viveka Konandur Rajanna, Student Member, IEEE, and Bharadwaj Amrutur, Senior Member, IEEE

Abstract— The most promising SRAM cells capable ofoperating over a wide range of supply voltages contain single-ended read ports. These systems require an external refer-ence voltage that suitably scales to enable error-free operationof the memory, as the supply voltage is scaled. This paperpresents a replica-based reference-generation technique for widevoltage range SRAMs. The proposed approach tracks thememory over the large range of supply voltages, and is tunable toextend functionality down to subthreshold voltages. In addition,a tunable delay-based timing-generation scheme is employedto enable memory functionality, in the presence of increasedvariation at subthreshold voltages. Configuration bits are setusing a random-sampling-based Built-in Self-Test algorithm thatsignificantly speeds up the tuning process. A 4-kb array, usingthe conventional 8T cell, implemented in the UMC 130-nmprocess, is demonstrated to function from 1.2 V down to310 mV (at 1.3 MHz and 6.45 pJ/access). The memory consumes0.115 pJ/bit/access at the energy optimum point of 400 mV.

Index Terms— Dynamic voltage scaling, internal referencegeneration, low-voltage SRAM, subthreshold memory, tunabledelay line.

I. INTRODUCTION

THE demand for systems capable of enabling a widespectrum of applications, ranging from streaming of high-

definition videos to monitoring of critical biomedical signals,continues to increase. These applications require systems thatcan support variations in a workload of up to 500 times [1].In order to extend battery life, it is important to minimizethe power consumption in these applications. While someapplications stretch the system performance even at nominalvoltages, other applications need low operating frequencies,which is easily achievable at subthreshold voltages. Hence,these systems benefit greatly by having the capability tooperate over a wide range of supply voltages, known asultradynamic voltage scaling (U-DVS). This refers to systemscapable of operating from nominal voltages down tosubthreshold voltages [1]–[3].

Manuscript received December 29, 2014; revised April 3, 2015,May 15, 2015, June 26, 2015, and August 6, 2015; accepted August 7, 2015.This work was supported by the Department of Electronics and InformationTechnology, MCIT, Government of India.

The authors are with the Department of Electrical CommunicationEngineering, Indian Institute of Science, Bangalore 560012, India (e-mail:[email protected]; [email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TVLSI.2015.2469596

The conventional static CMOS-based circuits and systemshave been shown to function down to subthresholdvoltages [4]–[6]. Furthermore, some modifications in circuitstyle allow functioning down to 62 mV [7]. However, enablinglow voltage operation in memories, specifically SRAMs, hasproved to be more challenging.

Several designs have been reported that use modifications inthe SRAM cell architecture and/or assistance from peripheralcircuitry to extend SRAM operation down to subthresholdvoltages. SRAM cells containing additional transistors(as read-buffers) are used to decouple read and hold noise mar-gins, thus allowing lower operating voltages [8]–[12]. Othercell modifications reported include upsizing transistors [13],the use of transistors with different threshold voltages [14],applying body-bias to modulate threshold voltage [15], capab-ility to weaken feedback during write [11], [16], virtual/floating supply [15], virtual ground to reduce leakageon read bitline (BL) [9], and using a Schmitt-trigger-basedcross-coupled inverter as the storage element [17]. Peripheralassist techniques, such as wordline (WL) [12], [18], [19]and BL [14] voltage modulation, virtual ground for read-sensing [10], and sense-amplifier (SA) tuning [9], [12], areused in conjunction with the above techniques to reducethe cell area at the cost of increased peripheral circuitry.However, some of the modifications that have been reportedto extend the operation to lower supply voltages adverselyaffect the SRAM performance at higher voltages.

The most promising SRAM cells for U-DVS employsingle-ended read ports [1]. Single-ended reads requireeither the BLs to have a nearly rail-to-rail swing[Fig. 1(a)] [10], [15] or an external reference voltage[Fig. 1(b)] [3], [9]. The BL fall-time and the BL swing forthese two sensing options, as shown in Fig. 1, are comparedin Fig. 2. It may be seen that the inverter-based sensing[Fig. 1(a)] is significantly slower, and causes larger swingson the BL at higher supply voltages. The effect of theselarge BL swings on power consumption can be reduced usinghierarchical BLs. Fig. 2 also compares the performance of sucha design with just 16 cells/BL [15]. All the three designs areimplemented with a comparable macro area. While an inverterwith lower cells/BL performs better at lower voltages, it isnot as good as SAs at higher voltages. In addition, hierar-chical BLs generally incur larger area overheads [20]–[22].On the other hand, high-speed SAs require a reference voltage,

1063-8210 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.


2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 1. Single-ended read in U-DVS memories using (a) inverter, causingrail-to-rail swing of BL; and (b) SA (using a reference) for higher speed andlower power.

Fig. 2. Simulation results comparing (a) time taken and (b) BL swing (duringa read operation) when using an SA, an inverter, and an inverter with shorterBLs (hierarchical BL with 16 cells per local BL) for sensing.

which is either generated externally [9] or from an internaldigital-to-analog converter (DAC). Interestingly, there is notechnique reported regarding the generation of referencevoltage internally in U-DVS systems.

In this paper, we report a variation-tolerant reference-generation mechanism suitable for U-DVS systems, whichtracks the BL voltages as the supply is scaled. The techniqueuses replica BLs to track process variations and other slowchanges affecting the memory. The key contributions ofthis paper are: 1) generating a suitable reference voltageinternally; 2) providing robustness against variation by theconfiguration of reference voltage using multiple replicacells and the selection of active replica cells; 3) furtherextending the operating range of the memory using tunabledelay lines for timing generation; and 4) employing arandom-sampling-based algorithm to significantly speed upthe tuning processes necessary to configure the reference and

Fig. 3. Typical variation in BL characteristics due to local process variationbetween different SRAM cells in a chip.

timing-generation blocks. Combining the above techniquesallows the memory to function effectively over a wide rangeof voltages without any external support.

The remainder of this paper is organized as follows.Section II presents the challenges of generating a referencevoltage and timing signals in U-DVS SRAMs. Section IIIfollows this with the proposed reference-generation techniqueinvolving the replica columns, the timing generator, andthe associated tuning algorithm. Simulation results showingthe effectiveness of the proposed design are presentedin Section IV. Furthermore, the test chip’s experimentalsetup and measured results are presented in Section V.This is followed by the discussion and conclusion inSections VI and VII, respectively.

II. CHALLENGES IN U-DVS SRAM DESIGN

SRAM cells are read by first precharging the BLs andactivating the appropriate WL, as shown in Fig. 3. Based onthe data stored in the cell, the BL either remains high (BL1)or begins to discharge (BL0). Once a sufficient differentialvoltage develops, the SAs are enabled. The SA then comparesthe BL voltage with a reference voltage VREF (for single-endedreads), and estimates the data stored in the cell.

Effects, such as random dopant fluctuation and line-edgeroughness, cause variation between the individual cells inan SRAM array. This is shown as spread in BL0 and BL1transition waveforms in Fig. 3. The effect of supply scalingon this variation is shown in Fig. 4(a), which plots the timetaken for the BL to fall to 90% of VDD and its coefficientof variation. It may be seen that, at lower voltages, both thedelay and its variation increase exponentially. This is due to theexponential dependence of currents on the threshold voltageof transistors at these low supply voltages.

The offset of the SA is also affected by the increasedvariation at lower voltages, as shown in Fig. 4(b). As shownin Fig. 4(b), at voltages below 0.35 V, the probability of failureincreases sharply. This is because of the fact that even themaximum differential voltage (VDD/2) may be insufficient tosupport the increased offset voltages of the SAs.

The earliest instant at which the SA may be enabled iswhen the difference in voltages between the slowest BL0(or the fastest BL1) and VREF is greater than its offset voltage.On the other hand, enabling the SAs too late causes increased


RAJANNA AND AMRUTUR: VARIATION-TOLERANT REPLICA-BASED REFERENCE-GENERATION TECHNIQUE 3

Fig. 4. Simulated results showing the effect of supply scaling on (a) variationin BL fall-time, obtained using Monte Carlo simulations for local variation,postlayout for an 8T SRAM [23] cell array with 256 cells/BL and (b) offsetvoltage of an nMOS-input SA [24], [25], designed to have a maximum offsetof 20 mV at 1.2 V, in the UMC 130-nm process.

BL swing, which adversely affects the memory read power andlatency. Thus, margins must be added during design in orderto accommodate these variations. Nonidealities in the timing-generation mechanism further add to this margin. We wouldhence like to minimize the sources of variation by: 1) havinga robust reference-generation mechanism and 2) enabling theSA at the optimal time. Sections II-A and II-B illustrate thesetwo challenges.

A. Sense-Amplifier Reference Voltage

Most U-DVS SRAM cells proposed [1], [3], [8], [23]employ a conventional inverter pair (as the storage element)and an additional read-buffer. An exception to this is theSchmitt-trigger-based cell [17], whose performance degradesat nominal voltages. Therefore, we have chosen a simple8T SRAM cell (Fig. 1) [23] as the representative of the mostpromising cell designs for U-DVS. The use of a read-bufferimplies that the cells only support single-ended read, sinceusing two sets of read-buffers [26] (11T) would significantlyincrease the cell area. Single-ended sensing using a simpleinverter [Fig. 1(a)] requires the BL to swing from almostrail-to-rail [10], [15], which is prohibitively expensive atnominal voltages, as mentioned earlier. Alternatively, the useof a SA requires a reference voltage, as shown in Fig. 1(b).

Fig. 5. Simulated maximum �VBL and �VBL available using the replicatechnique at different supply voltages (using 6σ variation).

A simple resistive divider may be used to internally generatethe reference voltage, as a fixed ratio of the supply voltage.However, the required reference voltage does not scale as afixed fraction of the supply voltage, as shown in Fig. 16.At higher voltages, the SA’s inputs are closer to the supply,whereas at lower voltages, the inputs (BLs) are closer toground, at the time of their activation [3]. One reporteddesign [14] uses a pseudo-nMOS inverter (alongside each SA)connected to the BL to generate the reference voltage.However, this approach affects the access speed at highervoltages.

Another option for generating the reference voltage is touse an internal DAC. This, however, requires a controllinglogic that monitors the memory supply voltage and generatesa suitable reference using a preconfigured lookup table. Usingan externally generated reference [3], [9] requires additionalpins for sensing the memory conditions and for supplyingthe required reference voltage. In addition, these approachesdo not track the memory with slow-varying changes such astemperature, bias temperature instability, and aging.

B. Timing Generation

While the conventional replica technique [27] for generatingtiming signals for SRAM works well at nominal voltages, itsperformance degrades in the presence of increased variationat lower voltages. Fig. 5 compares the maximum �VBLavailable at each supply voltage with �VBL obtained usingthe replica technique. �VBL initially increases sharply withtime and reaches a maxima, before beginning to decreaseslowly. Replica [27] and other nonprogrammable techniquesfor generating the timing poorly perform with the changes inthe timing of occurrence of the maximum �VBL as the supplyis reduced. This results in the degradation of �VBL (whichcauses reads to fail) at lower voltages, as shown in Fig. 5.

Various techniques have been reported for the generation oftiming signals that employ either averaging or tuning to reducethe effect of variation. Increased averaging may be achievedby activating a greater number of cells on the replica BL, andthen, using a timing multiplier circuit to increase the delay,such that it is sufficient for correctly sensing the BLs [28].This technique is, however, limited by the quantization inthe timing multiplier circuit, and offers no flexibility for



Fig. 6. Proposed schematic that equalizes charge on replica columns REFLand REFH , mimicking BL0 and BL1, respectively, to generate the requiredreference voltage.

tuning postfabrication. Another approach is to monitor all theBLs in the design and to generate the timing signal in stepsusing the order in which the BLs discharge [29]. Althoughthis design provides extensive averaging, it requires ∼4%additional height of the memory macro (with 128 cells/BL),and its applicability over a wide range of voltages is notdiscussed.

Tunable delay lines offer best tracking with process vari-ations [30], especially in the presence of extreme variationas seen at subthreshold voltages. They offer the flexibilityof maximizing �VBL at each supply voltage. We use built-in self-test (BIST) infrastructure to tune these delay lines, asreported in [31]–[34], to track variation caused by manufac-turing artifacts.

III. PROPOSED APPROACH

A. Reference Voltage-Generation Technique

Ideally, the reference-generation technique should generatea voltage that is the midway between the slowest BL0 (leastupper bound) and the fastest BL1 (greatest lower bound),as shown in Fig. 3

VREF = (VBL0(μ+6σ) + VBL1(μ−6σ))/2. (1)

We propose to generate this using two replica columns,one representing each of BL0 (REFL ) and BL1 (REFH ),as shown in Fig. 6. The charge on these lines is then equalizedto obtain a reference voltage in-between BL0 and BL1.Equalizing the voltages on REFL and REFH can, however,take significant amount of time, especially at lower supplyvoltages. Hence, the columns are shorted using switch S1,such that the columns REFL and REFH discharge together atthe rate shown as Ideal VREF in Fig. 3.

The generated reference voltage must be distributed to eachof the SAs, which increases the capacitance of the replicacolumns. This load is equally distributed on REFL and REFH ,by connecting each of these lines to alternate SAs, as shownin Fig. 6 (labeled as even and odd SAs). However, theadditional load causes REFL and REFH to systematicallydiffer from BL0 and BL1, respectively. This is alleviated byenabling a configurable number of replica cells to dischargethe reference lines.

Our proposed reference generator consists of two replicaSRAM columns and two columns of AND gates. During aread operation, the cells on REFL and REFH are activatedusing an additional timing signal RWLREF. This signal is theregular RWL delayed by a replica path used to mimic delaythrough the address decoder. This ensures that, during a readoperation, the cells on the replica columns are activated at thesame time as regular memory array bits.

The cells on these replica columns are written similarto regular memory bits. Each of the two replica columnscontains m cells that are connected to RWLREF by a columnof AND gates. These m cells are written to contain a data of 1,as shown in Fig. 6. The number of cells, activated at a time, iscontrolled by setting the configuration bits X[m:1] and Y [m:1].

By activating exactly one cell on REFL and deactivatingall the cells on REFH , the replica columns behave similarto BL0 and BL1, respectively, as explained earlier. However,as these columns have the additional capacitance of SAs,multiple cells may need to be activated to generate the idealreference. We denote the number of active cells as N inthis paper. As the two columns REFL and REFH are identical,activating two cells on REFL is equivalent to activatingone cell each on REFL and REFH . The number of active cellsis equally divided among the two columns to minimize anydifference in their rates of discharge. The reference voltagemay thus be varied by changing N , which is done using thecontrol bits X[m:1] and Y [m:1]. It is to be noted that the valueof m is determined during design, whereas N is tunable afterfabrication.

The proposed approach, however, causes some differencesin modeling BL0 and BL1. The off-cells on BL1 have ahigher drain-to-source voltage across their read access nMOStransistors (compared with the corresponding cells on REFH ),resulting in a higher leakage current. This can result in upto 7%–11.5% higher VREF under worst case data patterns.In addition, the leakage of the active cell (with RWL high)on BL1 is not replicated on REFH as its contributionis negligible. However, in scaled technologies with higherleakage, this behavior can be easily modeled by storing thecorresponding data in one of the 2m cells and by settingthe corresponding X[m:1] or Y [m:1] bits to 1. We also donot initialize the content of 2(256 − m) cells (256 − m oneach REFL and REFH ), as this does not change the generatedreference voltage significantly (<1.1%). In technologies withhigher leakage, the content of these cells can in fact be usedas a mechanism to fine-tune VREF.

The organization of these replica columns and otherblocks in the implemented layout of a 4-Kb SRAMarray is shown in Fig. 7. The memory, implemented in



Fig. 7. Organization, in layout, of the various blocks in the implementedmemory.

the UMC 130-nm process, is organized as 256 rows by16 columns. The RWLREF signal runs vertically with a loadof 2 − m AND gates. The write WLs are routed normally,extending over the replica columns on each row. The RWL is,however, routed slightly differently. In the first 256 − m rows,the RWL drives an additional two AND gates, along eachrow (Fig. 6), whereas in the last m rows, the RWL connectsonly to the regular cells (no additional load). Bits X[0] andY [0] are set to zero, ensuring that the 256 − m SRAM cellsare not activated during a normal operation. The switchesS1, S2, and S3 have been added to only provide debug andcharacterization capability with their state during the normaloperation shown in Fig. 6. These switches are sized, such thatthe drop across them is insignificant.

The area penalty of this technique depends on the size ofthe memory array. Our implementation uses two additionalcolumns of the SRAM cells per 16 regular columns, whichresults in a 4.5% increase in the overall area of the memorymacro. The percentage increase is estimated to be 0.87% for a32-Kb array and 0.45% for a 64-Kb array, each with the same256 cells/BL and employing just one pair of replica columns.

B. Timing Generation Using Tunable Delay Lines

The timing generator is responsible for ensuring thatsufficient differential voltage is available for the SAs,as discussed earlier in Section II. Using a tunable delay lineallows the design to adapt the timing so that higher differentialvoltage is made available to meet the offset requirement ofSAs. We have thus used a tunable delay line to generatethe necessary timing signals for the SRAM array across thewide range of supply voltages of interest. Although tunabledelay lines have been employed to counter the effects ofvariation [32], [33], their use as effective timing generatorsfor dynamic voltage scaling is not reported to the best of ourknowledge.

Fig. 8. Tunable delay line used to generate timing signals for SRAM.

Fig. 9. Schematics of the implemented (a) FDC and (b) CDC.

The designed tunable delay line (Fig. 8) consists of a finedelay block (FDB), a coarse delay block (CDB), two binary-to-thermometric encoders, and additional MUXes that providethe capability to bypass either of the delay blocks. TheFDB is implemented using a series of 16 identical fine delaycells (FDC), as shown in Fig. 9(a). Each cell consists of abuffer with a switchable load capacitor CL . Controlling theswitches (S0–S15) varies the capacitance at the intermediatenodes, thus controlling the delay of the block. A seriesof identically sized cells are chosen, over binary-weightedcells, to ensure a monotonic increase in delay with the inputcode. This simplifies the delay tuning algorithm. This design,however, causes the FDB to have a large inertial delay (delayat minimum code setting). MUXes have therefore been addedwith the capability to bypass this block if necessary.

The CDB implementation [Fig. 9(b)] controls the delay byvarying the signal path based on the thermometric code [35].A select bit (one for each of the 16 cells) determines whetherthe signal is propagated to the next cell or is routed back,at that cell, on the return path. This design allows multiplecells to be cascaded to obtain a large range of delays withoutaffecting the inertial delay. However, the jitter at the output ofthis block is code-dependent, making it less suitable for otherapplications.

As both the FDB and the CDB accept thermometriccodes to vary the delay, a binary-to-thermometric encoder isincluded (with each block) to reduce the number of configu-ration bits necessary to control the delay. Fig. 10 shows the



Fig. 10. Measured tunability of delay lines used in SRAM timing generatorblock at different supply voltages. It may be noted that the delay values foreach of the curves is normalized to its respective value at code = 0.

TABLE I

MEASURED DELAY-LINE PARAMETERS AT DIFFERENT SUPPLY VOLTAGES

delays measured for various digital-code-word settings,at three supply voltages, on the test-chip fabricated in the UMC130-nm technology. Accurate measurement of on-chip delayswas achieved using subsampling and a delay measurementunit described elsewhere [36]. The 16 FDCs and coarse delaycells (CDCs) provide linearly increasing delay and, thus, thenecessary timing range to operate the memory across thewide range of supply voltages. The step size and the linearityparameters of the delay line are summarized in Table I. Thetunable delay line, as shown in Fig. 8, occupies 1.54% of the4-Kb memory block area.

C. Tuning Algorithm

The configuration bits necessary to generate timing signalsusing the tunable delay line and the value of N used in thereference generator are determined using a tuning algorithm.Thus, this algorithm sets the absolute value of the referencevoltage and the worst case margins for the SAs. Thesealgorithms are commonly implemented using BIST infrastruc-ture [31]–[33], and must be run before the memory can beused for the first time. The algorithms are iterative in nature,and can thus take a significant amount of time to convergeto a final configuration bit settings. Minimizing this timereduces the cost associated with tuning [37], thus allowingmore frequent running of the tuning process as necessary. Thealgorithm also determines the effectiveness of the proposedreference generator in minimizing the BL swing and theaccess time at different supply voltages.

The proposed algorithm (Fig. 11) uses random-sampling-based tuning [30] to quickly determine the minimum WL

Fig. 11. Random-sampling-based algorithm used to tune the timing andreference generator for reads at a given supply voltage.

pulsewidth (tPW) and the N-value to be used for a given supplyvoltage. Faster tuning is achieved using random sampling, byfirst estimating the settings using a small subset of the memoryarray. If necessary, these are further tuned and verified forthe entire memory. This significantly reduces the tuning time,especially for larger memories [30]. It may be noted that theSA enable signal is generated from the RWL pulse. Hence,setting tPW is equivalent to tuning the SA enable timing.

A checkered pattern is first written to the memory using aconservatively high value of write-timing. The read-timing isthen set using the tuning algorithm shown in Fig. 11. Oncethe read-timings are set, the same WL pulsewidth is used forwrites, as writing is known to take lower time compared withreads.

The algorithm starts with a conservatively minimum valuefor N (NMIN) and a conservatively maximum value for tPW(tPW-MAX). These values are then tested against a randomlyselected set of M rows, where M is determined by theconfidence required in the estimation. A failure to sense BL0sat this stage indicates that VREF is lower than the desired value.As N is already set to a minimum value, the only way toincrease VREF is by choosing a different set of active cellson the replica columns [38], as shown in Fig. 12. On theother hand, if BL1s are found to fail, VREF is decreased byincreasing N . Following this, tPW is reduced iteratively (againusing random sampling) to determine the lowest functioningvalue of tPW. Once the random-sampling-based tuning iscomplete, the entire memory is tested using the set values.tPW is then adjusted, if required, to ensure that the settingsenable the entire memory to function correctly. It may benoted that the algorithm in Fig. 11 is simplified to exclude



Fig. 12. Sketch to illustrate the variation characteristics of BL0, BL1,and VREF, and options available for tuning.

Fig. 13. Variation of (a) time taken by tuning algorithm (in terms of numberof full memory reads) and (b) tPW with various tuning algorithms. Thesesimulation results are obtained for a 10-KB memory. The time taken bystandard memory BIST (MBIST) algorithms is also shown. The error barsare too small to be seen.

the exit conditions of loops (on reaching the limits of variousparameters) in the interest of clarity.

The mean performance of the four variants of the tuningalgorithm on 1000 instances of a 10-KB memory is shownin Fig. 13. Here, Conven. refers to the conventional tuningwithout random sampling [31]–[33], and R-fine refers to therandom-sampling-based algorithm shown in Fig. 11, whichsignificantly reduces the tuning time. The time required fortuning can be further reduced using the coarse-fine architectureof the tunable delay line (R-C-fine), which comes with asmall penalty in tPW [Fig. 13(b)]. This is achieved using

coarse steps in block A and fine steps in block B of Fig. 11.However, the R-fine and R-C-fine algorithms cause failures at400 mV and below. This alleviated by tuning the memoryto obtain the multiple pairs of N and tPW that functionand by choosing the setting with lower tPW (represented asR-Multi.). While this increases the tuning time (as expected),it allows the memory to function down to 350 mV. Fig. 13(a)also shows that the time required for tuning at higher voltagesis significantly lower in comparison to the standardMBIST algorithms [39], and is comparable at lower voltages.Multiple such MBIST algorithms are typically run on eachinstance of the memory. Hence, while the technique adds tothe tuning time, the increase in the total tuning time is notsignificant. It may be noted that the tuning time is influencedby various other parameters such as the initial estimate and thestep size of tPW. These values may be chosen appropriately totrade off between tPW and the tuning time.

The frequency of tuning is determined by the factors,such as the tracking, required (or margins acceptable) forslow-varying changes, the delay steps implemented, and thestorage space available for configuration settings. Either tuningmay be done each time the memory supply is varied orthe settings may be determined once at each supply voltageand stored in a lookup table for later use. The number ofconfiguration bits to be stored can be reduced by suitablydividing the voltage range of interest into smaller regionsand by storing one set of values per region. This approachtrades off the performance for lower configuration bits, and canbe especially useful in large memories that contain multipleinstances.

IV. SIMULATION RESULTS

The proposed reference-generation scheme is evaluatedusing an SRAM array in the UMC 130-nm process with256 cells/BL. The effect of local variation on BL0, BL1, andVREF (for N = 1, 2, and 3) at 1.2 and 0.45 V is shownin Fig. 14. The time axis in Fig. 14 begins from the time thatthe WLs are activated, and extends until the time at which�VBL is maximum. It may be seen that, while it is easy forthe tuning algorithm to find a set of functioning settings athigher supplies, at lower voltages, the increased variation mayrequire multiple rounds of reselection to converge on the finalsetting. The detailed simulated waveforms during a typicalread operation at 0.4 V are shown in Fig. 15.

Fig. 16 shows the generated and ideal reference voltages atdifferent supply voltages. Here, the ideal VREF is evaluated atthe timing setting determined by the tuning algorithm. It mayseen that the proposed technique closely tracks the ideal VREF,as the supply is scaled from 1.2 V down to 350 mV. It may alsobe observed that the values of VREF (and the BLs) are closer tothe supply at nominal voltages, while they are relatively closerto the ground at lower voltages, when the SAs are activated [3].

The proposed scheme also tracks the memory with globalprocess variation and changes in temperature, as shownin Fig. 17. Fig. 17(a) and (b) plots the results for a 256-rowby 16-column array, whereas Fig. 17(c) and (d) reports thetracking for a wider array with 256 rows by 128 columns.



Fig. 14. Simulated effect of local mismatch on BL0, BL1, and VREF (forN = 1, 2, and 3) at (a) 1.2 and (b) 0.45 V. Error bars: range from μ + 3σ toμ − 3σ . Fewer error bars are shown in (b) for clarity.

Fig. 15. Signal waveforms during a typical read operation at 400 mV.

In each case, only one pair of replica columns was used.These results were obtained using a tunable delay line togenerate the timing signals. The delay line and N were tunedat TT corner at −40 °C for each configuration, followingwhich the temperature and process corners were varied. Thisrepresents conservative results as tuning each chip wouldaccount for global process variation.

The proposed technique achieves good tracking with processand temperature due to the use of replica columns, whichare almost identical to regular BLs. The tracking degradesfor wider arrays. This is mainly due to the gate-dominatedcapacitance of the SAs compared with the drain-dominatedcapacitance of the SRAM cells. In addition, large

Fig. 16. Simulated results showing the tracking of the reference voltage,generated using the proposed technique, with the ideal reference as the supplyis scaled.

Fig. 17. Simulated effect of temperature and process corners on the per-centage error between the ideal and generated reference voltages at differentsupply voltages and aspect ratios. Timing signals were generated using atunable delay line that was tuned at TT, −40 °C.

Fig. 18. Die photograph of the fabricated chip in the UMC 130-nm process.

SRAM arrays will have systematic variation in transistorcharacteristics from one part of the array to another. Hence,in such cases, multiple replica columns may be employed forbetter matching.

V. EXPERIMENTAL SETUP AND MEASURED RESULTS

The proposed techniques were evaluated using a test-chipfabricated in the UMC 130-nm mixed-mode/RF process.The chip (Fig. 18) implemented a 4-Kb memory organizedas 256 rows by 16 columns. The conventional 8T SRAM cell(6T conventional + 2T read-buffer) was used with noadditional cell modifications (Fig. 1). All transistors in



Fig. 19. Measurement setup showing the fabricated chip, field-programmablegate array board, and other interface equipment used for the characterizationof chips.

the cell are minimum sized, except the two writeaccess nMOS transistors (with their drains connected toWBL and WBL_bar), which are 1.5 times the minimum size,for increased writability at low voltages. The cell was designedusing logic layout rules, and occupies 8.341 μm2. The designimplements the proposed reference-generation scheme (Fig. 6)with the capability to vary N from 0 to 8 independently oneach of the reference replica columns (m = 8). Tunable delaylines have been implemented, with 16 steps each of fine delayand coarse delay, to generate the required timing signals for theSRAM array. The fabricated chips were characterized usingthe setup shown in Fig. 19.

The internally generated pulsewidths were measured usingsubsampling flip-flops and an externally generated subsam-pling clock (Fig. 19) [36]. The input clock frequencyis 195.3 kHz, and a subsampling clock of 194.8 kHz wasused. Hence, the subsampled signals shown are at a differencefrequency of 500 Hz (195.3 − 194.8 kHz). This provides adelay amplification of

T + �T

�T= (1/194.8 kHz)

((1/194.8 kHz) − (1/195.3 kHz))≈ 390. (2)

The overall performance of the memory with the supplyvoltage scaling is shown in Fig. 20. In addition, shownin Fig. 20 are the pulsewidths of the read WL (RWL) andthe SA. The tuning algorithm, as shown in Fig. 11, was usedto obtain the settings at each supply voltage. The memoryfunctions from the nominal supply of 1.2 V down to 310 mV,using the internally generated reference voltage. The variationof energy per access with the supply voltage is shownin Fig. 21. The multiplicity factor (N) (also annotated inthe graph) does not need tuning from 1.2 down to 0.5 V.However, for the values of the supply voltages of 400 mVand below, this had to be varied in order to generate the

Fig. 20. Measured maximum operating frequency of memory as the supplyis scaled.

Fig. 21. Measured effect of supply voltage on energy per access, leakagepower, and read power.

TABLE II

MEASURED MEMORY PERFORMANCE FOR VARIOUS COMBINATIONS OF

READ SUPPLY AND MEMORY SUPPLY

proper reference voltage. Fig. 21 also shows the effect ofsupply voltage on leakage and read power. It may beobserved that the energy optimum point occurs at 400 mV,with 0.115 pJ/bit/access.

An independent read supply voltage was used to verifythe performance of the reference-generation technique at thevoltages lower than 310 mV. Operating the memory at 350 mV,the read BL’s precharge voltage was lowered down to 190 mV,while continuing to use the internally generated referenceto perform reads. Various combinations of read supply andmemory voltages were evaluated. The corresponding memoryperformance is summarized in Table II. While scaling theSRAM array supply is limited by the choice of SRAM cellused, the reference-generation technique continues to functiondown to 190 mV, making it suitable to use with other celldesigns proposed in the literature.

It is to be noted that the maximum value of N usedis 3, which implies that a value of m = 2 is sufficient.This also provides sufficient options for reselecting when the



TABLE III

COMPARISON OF THIS WORK WITH OTHER U-DVS DESIGNS

lower values of N are used at higher supply voltages. Usingthe higher values of N provides averaging, thus loweringthe requirement for reselecting. In addition, fine-tunabilityis not used. Thus, only four configuration bits are requiredfor the reference-generation technique. The delay-generationtechnique requires 4 (FDB) + 4 (CDB) = 8 bits, making atotal of 12 bits that are necessary for the proposed design tooperate over the entire range of supplies.

Table III compares our work with other U-DVS implementa-tions reported. The proposed design enables a higher frequencyof operation at nominal voltages, due to the use of SAswith an internally generated reference. This significant speedadvantage, over other designs, is maintained across the fullrange of supplies, with the exception of the design [14] imple-mented in a faster technology (65 nm). The energy and powernumbers are comparable with other reported works, with theexception of the design [15] containing only 16 cells/BL.

Our proposed design operates at a higher frequency, thanother designs, from nominal voltage down to subthresholdvoltages, making it suitable for a wide range of applications.In addition, the conventional 8T SRAM cell used requires noadditional peripheral circuitry such as a virtual power/groundgenerator [15], a WL boosting mechanism [19], or a substratebias generator. The present implementation, in contrast withthe other reported designs, does not require external support,either in the form of a reference voltage or timing-generationcircuitry, thus making it a more integrated solution.

VI. DISCUSSION

We found that the technique presented generates a nearlyideal reference voltage for single-ended sensing over a widerange of voltages. Although tuning is used to minimizemargins during the design and push performance over agreater range of supply voltages, the technique can be appliedwithout tuning. The simulation results show that the techniquecan be used without tuning, along with the conventionaltiming-generation technique [27] from 1.2 V down to 0.65 V.

The area penalty may be reduced using only one replicacolumn (as both REFL and REFH are identical) and replicaBLs that are shorter than regular BLs, at the expense of lower(coarser) tuning resolution. This loss in the resolution can thenbe compensated using fine-tunability, which can be achievedusing appropriately sized pseudo-SRAM cells. Fine-tunabilitycan also be used to further lower tPW at nominal voltages.

The speed and power advantage of SAs (over inverters)decreases as the supply is reduced, as shown in Fig. 2.In addition, the penalty of storing additional configurationbits is mainly contributed by the requirement to operate theSAs at lower voltages. Hence, it may be optimum to switchbetween using SAs at superthreshold voltages and inverters atsubthreshold voltages.

VII. CONCLUSION

This paper presented a reference generator, for U-DVSmemories, that tracks the memory over a wide range of



voltages, and is tunable to allow functioning down tosubthreshold voltages. Replica columns are used to generatethe reference voltage, which allows the technique to track slowchanges such as temperature and aging. A few configurablecells in the replica column are found to be sufficient to coverthe whole range of voltages of interest. The use of tunabledelay line to generate timing is shown to help in overcomingthe effects of process variations. Effective tuning is achievedby the random-sampling-based algorithm that uses BISThardware, which reduces the tuning time significantly forlarge SRAMs. A 4-Kb SRAM array has been designed andfabricated using the conventional 8T SRAM cell, in theUMC 130-nm technology, that achieves good performancefrom superthreshold to subthreshold voltages. Combining theproposed techniques is shown to allow the memory to functionfrom 1.2 V down to 310 mV, and read down to 190 mV (usingan independent supply), using internally generated referencevoltage and timing signals, thus requiring no external support.

REFERENCES

[1] A. P. Chandrakasan et al., “Technologies for ultradynamic voltagescaling,” Proc. IEEE, vol. 98, no. 2, pp. 191–214, Feb. 2010.

[2] B. H. Calhoun and A. Chandrakasan, “Ultra-dynamic voltage scalingusing sub-threshold operation and local voltage dithering in 90 nmCMOS,” in IEEE ISSCC Dig. Tech. Papers, vol. 1. Feb. 2005,pp. 300–301.

[3] M. E. Sinangil, N. Verma, and A. P. Chandrakasan, “A reconfig-urable 8T ultra-dynamic voltage scalable (U-DVS) SRAM in 65 nmCMOS,” IEEE J. Solid-State Circuits, vol. 44, no. 11, pp. 3163–3173,Nov. 2009.

[4] A. Wang and A. Chandrakasan, “A 180 mV FFT processor usingsubthreshold circuit techniques,” in IEEE ISSCC Dig. Tech. Papers,vol. 1. Feb. 2004, pp. 292–293.

[5] B. H. Calhoun, A. Wang, and A. Chandrakasan, “Modeling andsizing for minimum energy operation in subthreshold circuits,” IEEEJ. Solid-State Circuits, vol. 40, no. 9, pp. 1778–1786, Sep. 2005.

[6] M. Alioto, “Ultra-low power VLSI circuit design demystified andexplained: A tutorial,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 59,no. 1, pp. 3–29, Jan. 2012.

[7] N. Lotze and Y. Manoli, “A 62 mV 0.13 μm CMOS standard-cell-baseddesign technique using Schmitt-trigger logic,” in IEEE ISSCC Dig. Tech.Papers, Feb. 2011, pp. 340–342.

[8] B. H. Calhoun and A. Chandrakasan, “A 256 kb sub-threshold SRAMin 65 nm CMOS,” in IEEE ISSCC Dig. Tech. Papers, Feb. 2006,pp. 2592–2601.

[9] N. Verma and A. P. Chandrakasan, “A 65 nm 8T sub-Vt SRAMemploying sense-amplifier redundancy,” in IEEE ISSCC Dig. Tech.Papers, Feb. 2007, pp. 328–329 and 606.

[10] T.-H. Kim, J. Liu, J. Keane, and C. H. Kim, “A 0.2 V, 480 kbsubthreshold SRAM with 1 k cells per bitline for ultra-low-voltagecomputing,” IEEE J. Solid-State Circuits, vol. 43, no. 2, pp. 518–529,Feb. 2008.

[11] A. Teman, L. Pergament, O. Cohen, and A. Fish, “A 250 mV 8 kb40 nm ultra-low power 9T supply feedback SRAM (SF-SRAM),” IEEEJ. Solid-State Circuits, vol. 46, no. 11, pp. 2713–2726, Nov. 2011.

[12] Y. Sinangil and A. P. Chandrakasan, “A 128 kbit SRAM with an embed-ded energy monitoring circuit and sense-amplifier offset compensationusing body biasing,” IEEE J. Solid-State Circuits, vol. 49, no. 11,pp. 2730–2739, Nov. 2014.

[13] A. Kawasumi et al., “A single-power-supply 0.7 V 1 GHz 45 nm SRAMwith an asymmetrical unit-β-ratio memory cell,” in IEEE ISSCC Dig.Tech. Papers, Feb. 2008, pp. 382–383 and 622.

[14] M.-F. Chang et al., “A sub-0.3 V area-efficient L-shaped 7T SRAMwith read bitline swing expansion schemes based on boosted read-bitline, asymmetric-VTH read-port, and offset cell VDD biasing tech-niques,” IEEE J. Solid-State Circuits, vol. 48, no. 10, pp. 2558–2569,Oct. 2013.

[15] B. Zhai, S. Hanson, D. Blaauw, and D. Sylvester, “A variation-tolerantsub-200 mV 6-T subthreshold SRAM,” IEEE J. Solid-State Circuits,vol. 43, no. 10, pp. 2338–2348, Oct. 2008.

[16] K. Takeda et al., “A read-static-noise-margin-free SRAM cell forlow-Vdd and high-speed applications,” in IEEE ISSCC Dig. Tech.Papers, vol. 1. Feb. 2005, pp. 478–479 and 611.

[17] J. P. Kulkarni and K. Roy, “Ultralow-voltage process-variation-tolerantSchmitt-trigger-based SRAM design,” IEEE Trans. Very Large ScaleIntegr. (VLSI) Syst., vol. 20, no. 2, pp. 319–332, Feb. 2012.

[18] H. Nho et al., “A 32 nm high-k metal gate SRAM with adaptive dynamicstability enhancement for low-voltage operation,” in IEEE ISSCC Dig.Tech. Papers, Feb. 2010, pp. 346–347.

[19] J. Kulkarni, B. Geuskens, T. Karnik, M. Khellah, J. Tschanz, and V. De,“Capacitive-coupling wordline boosting with self-induced VCC collapsefor write VMIN reduction in 22-nm 8T SRAM,” in IEEE ISSCC Dig.Tech. Papers, Feb. 2012, pp. 234–236.

[20] B.-D. Yang and L.-S. Kim, “A low-power SRAM using hierarchical bitline and local sense amplifiers,” IEEE J. Solid-State Circuits, vol. 40,no. 6, pp. 1366–1376, Jun. 2005.

[21] S. Ishikura et al., “A 45 nm 2-port 8T-SRAM using hierarchical replicabitline technique with immunity from simultaneous R/W access issues,”IEEE J. Solid-State Circuits, vol. 43, no. 4, pp. 938–945, Apr. 2008.

[22] Q. Li and T. T. Kim, “Analysis of SRAM hierarchical bitlines foroptimal performance and variation tolerance,” in Proc. Int. SoC DesignConf. (ISOCC), Nov. 2011, pp. 412–415.

[23] L. Chang et al., “Stable SRAM cell design for the 32 nm node andbeyond,” in IEEE Symp. VLSI Technol., Dig. Tech. Papers, Jun. 2005,pp. 128–129.

[24] T. Kobayashi, K. Nogami, T. Shirotori, and Y. Fujimoto, “A current-controlled latch sense amplifier and a static power-saving input bufferfor low-power architecture,” IEEE J. Solid-State Circuits, vol. 28, no. 4,pp. 523–527, Apr. 1993.

[25] T. W. Matthews and P. L. Heedley, “A simulation method for accuratelydetermining DC and dynamic offsets in comparators,” in Proc. 48thMidwest Symp. Circuits Syst., vol. 2. Aug. 2005, pp. 1815–1818.

[26] S.-C. Luo and L.-Y. Chiou, “A sub-200-mV voltage-scalable SRAM withtolerance of access failure by self-activated bitline sensing,” IEEE Trans.Circuits Syst. II, Exp. Briefs, vol. 57, no. 6, pp. 440–445, Jun. 2010.

[27] B. S. Amrutur and M. A. Horowitz, “A replica technique for wordlineand sense control in low-power SRAM’s,” IEEE J. Solid-State Circuits,vol. 33, no. 8, pp. 1208–1219, Aug. 1998.

[28] Y. Niki et al., “A digitized replica bitline delay technique for random-variation-tolerant timing generation of SRAM sense amplifiers,” IEEEJ. Solid-State Circuits, vol. 46, no. 11, pp. 2545–2551, Nov. 2011.

[29] A. Kawasumi et al., “A 47% access time reduction with a worst-casetiming-generation scheme utilizing a statistical method for ultra lowvoltage SRAMs,” in Proc. Symp. VLSI Circuits (VLSIC), Jun. 2012,pp. 100–101.

[30] K. R. Viveka and B. Amrutur, “Digitally controlled variation toleranttiming generation technique for SRAM sense amplifiers,” in Proc. 5thAsia Symp. Quality Electron. Design (ASQED), Aug. 2013, pp. 233–239.

[31] C. J. Brennan et al., “BIST controlled variable sense amp timingfor 90nm embedded SRAM,” in Proc. IEEE Custom Integr. CircuitsConf. (CICC), Oct. 2004, pp. 345–348.

[32] Y.-C. Lai and S.-Y. Huang, “Robust SRAM design via BIST-assistedtiming-tracking (BATT),” IEEE J. Solid-State Circuits, vol. 44, no. 2,pp. 642–649, Feb. 2009.

[33] M. H. Abu-Rahma, M. Anis, and S. S. Yoon, “Reducing SRAM powerusing fine-grained wordline pulsewidth control,” IEEE Trans. Very LargeScale Integr. (VLSI) Syst., vol. 18, no. 3, pp. 356–364, Mar. 2010.

[34] A. Neale and M. Sachdev, “Digitally programmable SRAM timing fornano-scale technologies,” in Proc. 12th Int. Symp. Quality Electron.Design (ISQED), Mar. 2011, pp. 1–7.

[35] P. K. Das, “Precise on-chip clock skew measurement using sub-samplingand applications,” Ph.D. dissertation, Dept. Elect. Commun. Eng., IndianInst. Sci., Bengaluru, India, 2012.

[36] B. Amrutur, P. K. Das, and R. Vasudevamurthy, “0.84 ps resolution clockskew measurement via subsampling,” IEEE Trans. Very Large ScaleIntegr. (VLSI) Syst., vol. 19, no. 12, pp. 2267–2275, Dec. 2011.

[37] R. Rajsuman, “Design and test of large embedded memories:An overview,” IEEE Des. Test Comput., vol. 18, no. 3, pp. 16–27,May 2001.

[38] U. Arslan, M. P. McCartney, M. Bhargava, X. Li, K. Mai, andL. T. Pileggi, “Variation-tolerant SRAM sense-amplifier timing usingconfigurable replica bitlines,” in Proc. IEEE Custom Integr. CircuitsConf. (CICC), Sep. 2008, pp. 415–418.

[39] Cortex-A9 MBIST Controller Technical Reference Manual. [Online].Available: http://infocenter.arm.com/help/topic/com.arm.doc.ddi0414i/DDI0414I_cortex_a9_mbist_controller_r4p1_trm.pdf, accessedAug. 27, 2015.



Viveka Konandur Rajanna (S’10) received theB.E. degree in electronics and communicationengineering from the M. S. Ramaiah Instituteof Technology, Bangalore, India, in 2005, andthe M.Tech. degree in electronics design andtechnology from the Indian Institute of Science,Bangalore, in 2007, where he is currently pursuingthe Ph.D. degree with the Department of ElectricalCommunication Engineering.

He was with Analog Devices Inc., Bangalore.His current research interests include custom digital

circuit design for ultralow-power CMOS circuits and accurate on-chip delaygeneration and measurement for mitigating variability in deep submicrometerCMOS technologies.

Mr. Rajanna was a recipient of the Best Student Paper Award at the IEEEInternational Conference VLSI Design, Bangalore, in 2007.

Bharadwaj Amrutur (M’94–SM’13) received theB.Tech. degree in computer science and engineeringfrom IIT Bombay, Mumbai, India, in 1990, and theM.S. and Ph.D. degrees in electrical engineeringfrom Stanford University, Stanford, CA, USA,in 1994 and 1999, respectively.

He was with Bell Labs, Murray Hill, NJ, USA,Agilent Labs, Palo Alto, CA, USA, and Green FieldNetworks, Sunnyvale, CA, USA. He is currentlyan Associate Professor with the Department ofElectrical and Communication Engineering, Indian

Institute of Science, Bangalore, India, where he is involved in VLSI circuitsand systems.

ieee transactions on very large scale integration … · schmitt-trigger-based cell [17], whose...

Documents