research presentation

Nirav A. Desai [email protected]

1


2


3

MM-Wave Active Sensor: BPSK Spectrum can be seen in the Spectrum Analyzer

Nirav Desai


4

I assisted in these mm-wave MIMO experiments at UCSB


5


6


7


8


9


10


11


12


13


14


15


16


17


18


19


20


21


22


23


24

EE 5323: VLSI DESIGN 1 PROJECTCourse Instructor: Prof. Chris Kim

16-bit BRENT KUNG ADDER DESIGN in 45nM CMOSNirav DesaiID: 4280229

Department of Electrical and Computer EngineeringUniversity of Minnesota


25


26

Brent Kung Adder Gate Level Diagram

1. Input Block with Pre Computation

Input Adder Chain 1

Input Adder Chain 2

Input Adder Chain 3

Input Adder Chain 4

1X

1X

1X

1X

1.224X

1.562X

1.23X

1.274X

1.097X

1.553X

1.108X

1.034X

3.883X

3.043X

2.943X

10.1683X

10.8506X

36X

40X

Output Buffers to driveCapacitive Loads


Pi*Pi-1

Gi + Pi*Gi-1


27


2. Intermediate Dot Product Blocks

Intermediate Adder Chain 1

Intermediate Adder Chain 21X

1X

1X

1X

1.72X

6X

4X

16X

16X


Pi*Pi-1

Gi + Pi*Gi-1


28


3. Output Block for Post Computation

1.182X1.117X

Ci-1

Pi


Si


29

Brent Kung Adder Transistor Level Design

XOR GATE


30


Inverter Design Optimization

• NMOS Width = 90nm• PMOS / NMOS Length = 50nM• Vdd = 1.1V• Current Averaged Over One Period of 2 ns• Optimal PMOS Width = 165nM• βinverter = 165/90 = 1.834• Sizing for NAND, NOR and XOR Changed appropriately


31


1. Input Block with Pre Computation

Input Adder Block Chain 1

Gate Number 1.000 2.000 3.000 4.000 5.000 Stage G Stage F Stage B Stage H Gate HGate Name BUFFER INVERTER NOR INVERTER NAND LOAD hg value 1.000 1.000 1.646 1.000 1.352 36.000 2.225 36.000 6.943 556.248 3.540f value 3.540 3.540 2.151 3.540 2.618648b value 2.893 2.400 1.000 1.000 1.000 1.000S Value 1.000 1.224 1.097 3.883 10.16831 36.000


Gate Number 1.000 2.000 3.000 4.000 Stage G Stage F Stage B Stage H Gate HGate Name BUFFER INVERTER XOR NAND LOAD hg value 1.000 1.000 1.893 1.295 13.748 2.451 13.748 12.359 416.510 4.518f value 4.518 4.518 2.386 3.488b value 2.893 2.400 1.780 1.000 1.000S Value 1.000 1.562 1.553 3.043 13.748


Gate Number 1.000 2.000 3.000 Stage G Stage F Stage B Stage H Gate HGate Name BUFFER INVERTER NOR LOAD hg value 1.000 1.000 1.646 3.941 1.646 3.941 6.943 45.038 3.558f value 3.558 3.558 2.162b value 2.893 2.400 1.000S Value 1.000 1.230 1.108 3.941


Gate Number 1.000 2.000 3.000 4.000 5.000 Stage G Stage F Stage B Stage H Gate HGate Name BUFFER INVERTER XOR NAND INVERTER LOAD hg value 1.000 1.000 1.893 1.295 1.000 40.000 2.451 40.000 6.943 680.832 3.686f value 3.686 3.686 1.947 2.847 3.686447b value 2.893 2.400 1.000 1.000 1.000 1.000S Value 1.000 1.274 1.034 2.943 10.85056 40.000

3.94084

Logical Effort Design for Signal Chains labeled in previous slide #2


32


2. Intermediate Dot Product Blocks

Logical Effort Design for Signal Chains labeled in previous slide #3

Intermediate Adder Block Chain 1

Gate Number 1.000 2.000 Stage G Stage F Stage B Stage H Gate HGate Name INVERTER NAND LOAD hg value 1.000 1.352 1.000 1.352 6.000 1.000 8.112 2.848f value 2.848 2.107 2.848b value 1.000 1.000 1.000S Value 1.000 2.107 6.000

Intermediate Adder Block Chain 2

Gate Number 1.000 2.000 Stage G Stage F Stage B Stage H Gate HGate Name BUFFER NAND LOAD hg value 1.000 1.352 2.848 1.352 2.848 2.000 7.701 2.775f value 2.775 2.053b value 2.000 1.000S Value 1.000 1.026


33

Brent Kung Adder Simulated Performance

Voltage (V) Delay Max-C14 (nS)

Power Max (mW)

Power-DelayProduct (xE-12)

1.1 0.359 6.73 2.41

0.9 0.503 2.95 1.483

0.7 0.937 0.924 0.865

Simulations with maximally sized 1 stage buffers as determined by Logical Effort Designof individual chains

Voltage (V) Delay Max-C14 (nS)

Power Max (mW)

Power-DelayProduct (xE-12)

1.1 0.403 5.186 2.089

0.9 0.569 2.277 1.295

0.7 1.069 0.692 0.739

Simulations with minimally sized 1 stage buffers

Without Parasitic Extraction and Interconnect Parasitics buffering doesn’t improve performance significantly.


34

Brent Kung Adder Worst Case Delay

Input Pattern: A: FFFF B: 0000 -> 0001

Dotted Lines show Carry Bits 15 and 14

Carry Bit 15 Carry Bit 14


35

Brent Kung Adder Layout

Input Block with Pre Computation

Input Inverters for Bit 0 and Bit 1

Output BuffersPEX waveforms show

larger size may be needed

XORNAND10X


36


XOR 1.553X


37


NAND 10.57X Layout with inter digitated fingers to reduce parasitics


38


Intermediate Dot Product Generator

Output BuffersPEX Waveforms

show largerSize may be necessary

here


39


Output Stage with Buffers


40


Full Layout: 49.5um X 48.6um


41

Future Design Modifications

• The design uses large buffers at the output of every stage to drive large capacitances• The buffers are not needed at nodes with low fanouts and can be eliminated.• The buffers at input nodes right now cause more power consumption and add to the delay .• Thus the overall performance can be improved with fewer buffers.


42

References:

Course Slides from Prof. Kia Bazargan’s Course on VLSI

A Taxonomy of Parallel Prefix Networks

(David Harris ) – Reference paper on course

website

Digital Integrated Circuits by Jan Rabaey


43

SRAM DESIGN PROJECT PHASE 2

Nirav Desai4280229

VLSI DESIGN 2: Prof. Kia BazarganDept. of ECE

College of Science and EngineeringUniversity of Minnesota, Twin Cities

43University of Minnesota


44

SRAM CELL READ AND WRITE MARGIN FROM BUTTERFLY CURVE •NMOS inverter = 110nM PMOS inverter = 220nM NMOS Access = 90nM•NMOSinv/NMOSaccess = 1.2 PMOSinv/NMOSaccess=2.4 •Cbitline = 0.747fF for 512 cell array ( Interconnect Parasitics from ASU PTM Website )

University of Minnesota


45

SRAM CELL READ AND WRITE MARGIN FROM BUTTERFLY CURVE •NMOS inverter = 150nM PMOS inverter = 555nM NMOS Access = 180nM•NMOSinv/NMOSaccess = 1.2 PMOSinv/NMOSaccess = 3 Cbitline = 0.747fF•Curve shows SRAM cell is close to write failure. •Bitline Precharge to less than 1.1V could be explored to increase SNM.



46

Simulation Setup

• M0,M1,M3,M4 form the cross coupled inverter pair• M5,M6 are access transistors• C1, C2 is the bitline capacitance• M7 is the precharge switch for bitline ( bit ) - V3 precharges the bitline to 0.8V• V6 precharges bitbar and writes a 0 to the cell

V(write)

V(ic) V(word)

V(qbar)

V(q)

V(bitbar)V(bit)



47

Timing Waveforms for Characterization

V(write) – Applied to source of M7 (precharge switch)

V(word) – Wordline Voltage

V(qbar)

V(q)

V(ic) – Enables the precharge switch M7

V(bitbar)

V(bit)

• V(write) precharges Cbit to 0.8V via M7• V(word) disables access transistors M5 and M6 during precharge .• V(qbar) and V(q) are used to generate the butterfly curves.• V(ic) enables M7 during precharge It could be implemented as

NOT(V(word)).• V(bitbar) precharges to 0.8V, shows

charge pumping when M7 turns off and follows V(qbar) when wordline is enabled.

• V(bit) follows V(q) after word line is enabled.• V(bit) precharged to Vdd by V6



48

PASS TRANSISTOR BASED TREE DESIGN

1:8 Row Decoder Tree

Similar Tree Decoder for 16 LSB Bits



49

TREE DECODER DESIGN


50

PASS TRANSISTOR BASED TREE DESIGN

IN OUT

CK

CK

50

880

L

W

Identical Sizing for NMOS and PMOS to minimize charge injection effects

• Delay drops by ~40ps/2 for every Doubling of transistor widths• Delay drop saturates around 1000nM to 89ps• Used W/L of 880/50 for final tree



51

TREE DECODER TIMING DIAGRAMS

The following waveforms were applied to the row and column selection inputs of the tree decoder



52


It takes one cycle for initializing the tree decoder after which we get clean pulses for each row output

LSB pulse is wider than MSB pulse in bottom figure to allow the tree to clear present state before next



53


The top waveforms shows the matrix point output where the row and column select inputs are highThe output node discharges when the input goes low



54


55

READ WRITE CIRCUIT ( Design by Bong Jin )

Sense Amplifier Write Driver

Precharge Circuit



56

READ WRITE CIRCUIT TEST SETUP

Bitline Capacitance estimate from ASU PTM Website

Cbit estimate for 512 rows

NMOS Switches to allow read without disabling write circuit

Single SRAM Cell for simulations



57

READ / WRITE TIMING WAVEFORMS

Precharge Pulse ( Active Low )

Data Meant to be written to cell

Write Enable Pulse

Read Enable Pulse

Output of Write Buffer

Disable output buffer ( tristate logic )

Bitline

Bitline Bar

Data Output

Data Out Bar



58

SRAM Cell Layout



59

2X2 SRAM Array Layout

VDD

GND

GND

WORD 1

WORD 0

B0 B0BAR B1 B1BAR

This unit can be replicated in all directions without any changes. LVS check remainingArray Size = 3.7975umX2.4725um



60

References

Digital Integrated Circuits

Jan Rabaey, Anantha Chandrakasan, Borivoje Nikolic

( SRAM Cell Design, Decoders, Read Write Circuits )

CMOS VLSI Design by Weste and Harris

( Butterfly Curves )

CMOS Circuit Design, Layout and Simulation

Baker, Li, Boyce (Decoder Design)

Course slides of Prof. Kia Bazargan

( Precharge Techniques, Decoders, SRAM Cell Design )



61

System Diagram for developing LMS Algorithm for Channel Estimation ( H(z) )

Errors e1 and e2 ( e2 being the Quantized Error ) could have the same convergence

If the channel model H(z) is adapted using a LMS Model

Next few slides show regular LMS and modified LMS Error Convergence

Adaptive DSP Course by Prof. Keshab Parhi


62

Error Convergence for regular LMS takes more time than the modified LMS



63

Modified LMS Adapts all tap weights using different errors computed using as many filter output estimates as the filter order. The assumption being that the optimum gradient direction for each tap weight is different and is given by the corresponding errorLattice Predictors would be a more efficient way to do this as compared to LMS since each stage of a predictor is optimum for that order unlike modified LMS where you adapt each tap weight in a sub optimal manner.



64

EEG Spectral Estimates for Pre-Ictal, Ictal and Post-Ictal Signal Sequences



65

Spectral Estimation for a low pass filtered impulse sequence using different techniques



66

Correlograms provide best Spectral Estimates for Low Pass Filtered Impulse Trains



67

EE 5364 / CS 5204:Advanced Computer Architecture

Final Course Project on Design of a Branch Predictor

Prepared by:Nirav Desai 4280229

Amanda Skinner 3749048 Course Instructor: Prof. Pen-Chung Yew

Department of ECEUniversity of Minnesota, Twin Cities


68Nirav Desai 4280229 ECEAmanda Skinner 3749048 CS

Why Branch Predictor?• Branch Predictors improve the flow of

the instruction pipeline

• As Branch predictor accuracy increases,

cache misses decrease, or improve, for

both data and instruction caches


69

Why Branch Predictor?

Nirav Desai 4280229 ECEAmanda Skinner 3749048 CS



• As branch predictor accuracy increases, cache misses go down

• Prefetching and increasing cache size decreases cache misses

Miss Rate for Mesa benchmark. Both the L1-Data and L2 cache associativities were changed

Why Prefetching ?

[4]



• LA-PC runs ahead of PC and keeps track of load and store instructions

• RPT keeps track of previous reference addresses and strides for load and store instructions

• L2 Cache prefetching can be done by storing spill over data and instructions from L1 Cache blocks.

• INTEL CORE 2 Duo uses RPT for L1 Cache Prefetching and Loop Counter Local Branch Predictor

Reference Prediction Table[1]


72

• Loop Counter would give high accuracy on matrix multiplication

• Track all registers for loop counter as possibility of different interleaved threads using different registers

• Loop Counter error would imply dynamic update of registers based on non-local values

• Tag registers giving repeated conditional branch errors on the Branch Decision Table

• Use the O-GEHL predictor for all tagged branches

• Using the loop counter and duplicate ALU will allow indexing long histories with limited geometric length

Design of Branch Predictor




Branch Decision Table

Branch Address

Predicted Direction

Predicted Branch Target

Actual Direction

Actual BranchTarget

Counters UsedC(i)(j)

Tag

Counters UsedC(i)(j)

Entered by LA-PC

Entered by Loop Counter or O-GEHL

Entered by Duplicate ALU

Entered by PC

Entered by PC

Entered by O-GEHL

Entered by O-GEHL

if prediction != actual decision

Prediction computed by Loop Counter ?

Yes - Incorrect Duplicate Register Values

Re-Initialize Duplicate Register Stack Set LA-PC to PC

After 2 successive errors make an entry in O-GEHLAlso tag the branch address in Branch Decision Table

to be used with O-GEHL

Prediction computed by O-GEHL ?

Yes – Run the update equation on counters listed in table

Set LA-PC to PC



Loop Counter Branch Predictor

Op-Code = 4 (beq) OR Op-Code = 5 (bne)

Duplicate Register Flag == 0 ?

Yes No

First Conditional Branch

Copy Register Stack to Duplicate Register Stack( Equivalent to initializing

the duplicate register stack)

Duplicate Register Stack Initialized

Set Register Flag for rs and rt = 1These registers will be tracked by the Duplicate ALU

Proceed to Branch Prediction Computation

rs == rt ? rs != rt ?

Op code == 4 ? Op code == 5 ?

yesno yes noExecute

Copy Off-Set from bits 15 to bit 0

Sign Extend Off Set to bit 31 ( Total 32 bits )Left Shift by 2 ( to get Word Address )

Add to PC+4 to get Branch Target Address

Inc LA-PCBy 4

Inc LA-PCBy 4

Do addition and subtraction for all instructions having rs and rt with

register flags set to 1 rs – Bits 25:21 rt – Bits: 20:16

The loop counter looks at only the conditional branches

Can be extended to bgtz, blez

Op-Code:Bits 31:26



O-GEHL Branch Predictor[2]

C12()

C11()

C24()

C23()

C22()

C21()

C39()

C38()

C37()

C36()

C35()

C34()

C33()

C32()

C31()

History Lengths go in Geometric Progression given by L(i) = αi-1 L(1) + constantBest Series found from experiments: 2, 4, 9, 12, 18, 31, 54, 114, 145, 266

Dynamic History length fitting with variable α also possible.

C10266()

C10265()

C101()

Sum = ΣC(i)(j)+C(i+1)(k)+…C(i+9)(l)

• j,k,l .. Are incremented on every unconditional branch.

• j increments are modulo 2, k increments are modulo 4, l increments are modulo 266.• Each C(i)(j) is a 4 bit saturating counter

that counts -8 to 7.• Counter Update given by:

if(p!=out) if(branch==taken) c(i)(j)++

if(branch!=taken) c(i)(j)-- • Dynamic Threshold (θ) Fitting possible• Threshold(θ) by default is 0.

Sum > θ then p = takenSum < θ then p = not taken



Duplicate ALU ( for MIPS )[3]

LA-PC Address -Instruction

Duplicate Instruction Queue

Reg 3

Reg 2

Reg 1

Op Code

31-26

25-21

20-16

15-11

Decode Unit

CompareOp-Code

Op-Code == 4 OR 5: (beq, bne) Use Loop CounterOp-Code == 2 OR 3: (jump, jal) Always takeOp-Code == 0 & FUNCT==8 OR 9: (jr, jalr) Always take

Branch Target for Jump: 32bits: bits 31:28: 4 MSB bits of current PC+4 bits 27:2: Jump Target from instruction

bits 1:0 : 00 ( Word Addresses )Branch Target for Branch: 32 bits: Current PC + 4 + bits 15:0 left shifted by 2 to give word addresses

Compare Register Flags for reg1, reg2, reg3

If register flags set, do the computation forOp-Code: 0 bits(5:0) 32: add r1, r2, r3Op-Code: 0 bits(5:0) 34: sub r1, r2, r3Op-Code: 0 bits(5:0) 33: addu r1, r2, r3Op-Code: 0 bits(5:0) 35: subu r1, r2, r3Op-Code: 8: addi r1, constantOp-Code: 9: addiu r1, constant

• Set LA-PC Busy bit on instruction read• When LA-PC updated by branch predictors,

busy bit reset• For arithmetic, reset busy bit after 2 cycles• Instruction read when busy bit reset• LA-PC different from that used in RPT

This branch predictor can be used on Multi Threaded CPUs


77

Test results on O-GEHL Branch Predictor[5]




References1. An Effective On-Chip Preloading Scheme to Reduce Data Access Penalty Jean-Loup Baer, Tien-Fu Chen Department of Computer Science and Engineering, University of Washington, Seattle, WA 98195 Supercomputing '91 Proceedings of the 1991 ACM/IEEE Conference on Supercomputing

2. The O-GEHL Branch Predictor Andre Seznec The 1st JILP Championship Branch Prediction Competition CBP1 (2004) Available from www.jilp.org

3. Computer Organisation and Design The Hardware-Software Interface David Patterson and John Hennessy

4. http://en.wikipedia.org/wiki/CPU_cache

5. Analysis of the Optimized GEHL Predictor Andre Seznec Available from: http://www.irisa.fr/caps/people/seznec/ISCA05.pdf

http://www.jilp.org/

http://en.wikipedia.org/wiki/CPU_cache

http://en.wikipedia.org/wiki/CPU_cache

http://www.irisa.fr/caps/people/seznec/ISCA05.pdf


79

Research Ideas I am working on right now


80

Strained Silicon on SiGe Solar Cell

• Requires Chemical Vapor Deposition or MBE techniques for fabrication

• Tandem Solar Cell design gives a wide band of absorbable frequencies with different band gaps.

• Optimal thickness at quarter wavelength will give maximum absorption at designed frequency

• Back plate metal contacts and top plate fingered contacts

• Economically viable for charging battery packs in electric vehicles and for replacing LPG cooking gas cylinders.

• Long term viability for power generation feasible due to low operating costs and low distribution costs in a distributed model.

• Reference: Si/multicrystalline-SiGe heterostructure as a candidate for solar cells with high conversion efficiency: Photovoltaic Specialists Conference, 2002. Conference Record of the Twenty-Ninth IEEEDate of Conference: 19-24 May 2002Author(s): Usami, N. Inst. for Mater. Res., Tohoku Univ., Sendai, Japan Takahashi, T. ; Fujiwara, K. ; Ujihara, T. ; Sazaki, G. ; Murakami, Y. ; Nakajima, K. Page(s): 247 - 249

http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=8468

http://ieeexplore.ieee.org/search/searchresult.jsp?searchWithin=p_Authors:.QT.Usami,%20N..QT.&newsearch=partialPref

http://ieeexplore.ieee.org/search/searchresult.jsp?searchWithin=p_Authors:.QT.Usami,%20N..QT.&newsearch=partialPref

http://ieeexplore.ieee.org/search/searchresult.jsp?searchWithin=p_Authors:.QT.Takahashi,%20T..QT.&newsearch=partialPref

http://ieeexplore.ieee.org/search/searchresult.jsp?searchWithin=p_Authors:.QT.Fujiwara,%20K..QT.&newsearch=partialPref

http://ieeexplore.ieee.org/search/searchresult.jsp?searchWithin=p_Authors:.QT.Ujihara,%20T..QT.&newsearch=partialPref

http://ieeexplore.ieee.org/search/searchresult.jsp?searchWithin=p_Authors:.QT.Ujihara,%20T..QT.&newsearch=partialPref


81

Rake Receiver with MDS Codes

• Rake receivers could be used to identify strongest multi path component from a received signal.

• This could be achieved by correlating the received signal with itself over different delays and finding the strongest delay component.

• This does not involve maximal ratio combining.

• It could be combined with MDS codes for wireless communications where given any d bits corrupted by channel noise or multi path effects, the signal could still be recovered uniquely.

• Reference: Lectures of Prof. Cutter on iTunesU under the course on Digital Communications 2 taught at MIT.

• Reference: W-CDMA Rake Receiver implementation in DSP: EE Times: Link: http://www.eetimes.com/electronics-news/4139933/W-CDMA-RAKE-Receiver-Comes-to-Life-in-DSP

• Reference: A Rake Receiver for Maximal Ratio Combining without Channel Estimation for UWB Communications: http://digitalcommons.unf.edu/cgi/viewcontent.cgi?article=1044&context=ojii_volumes

http://www.eetimes.com/electronics-news/4139933/W-CDMA-RAKE-Receiver-Comes-to-Life-in-DSP

http://www.eetimes.com/electronics-news/4139933/W-CDMA-RAKE-Receiver-Comes-to-Life-in-DSP


82

Class S RF Power Amplifiers on GaN HEMTs

• Class S RF Power Amplifiers with fully differential H-Bridge topology could give a theoretical 100% efficiency.

• GaN HEMTs give the best high frequency switching characteristics.

• The 2 features could be combined to give a high efficiency RF power amplifier topology.

• Reference: Ph.D. Dissertation of Stephan Maroldt, University of Freiburg


83

Microprocessor Design

• The attached slides describe the design of a 16 bit Brent Kung Adder and 1024x16 asynchronous SRAM in 45 nM CMOS along with the design of a branch predictor and cache prefetch unit for a MIPS microprocessor.

• These design ideas could be combined with other ideas for pipeline design, ALU design and interconnect circuit design to give a full physical layer design of a MIPS microprocessor in 45nM CMOS.

• Various power reduction and clock gating techniques could be applied at a higher level of the hierarchy.


84

mm-wave MIMO OFDM

• mm-wave MIMO OFDM could be used for wireless backhaul networks due to its high capacity

• mm-wave MIMO systems could be extended to 2x2, 4x4, 8x8, etc topologies to exploit spatial diversity and get higher data rate.

• Reference:

• 4 channel spatial multiplexing over a mm-wave line of sight link

Microwave Symposium Digest, 2009. MTT '09. IEEE MTT-S InternationalDate of Conference: 7-12 June 2009Author(s): Sheldon, C. Dept. of Electr. & Comput. Eng., Univ. of California, Santa Barbara, CA, USA Munkyo Seo ; Torkildson, E. ; Rodwell, M. ; Madhow, U.

Page(s): 389 - 392


85

Routing algorithm to reduce congestion

• The routing algorithm to reduce congestion could be based on the idea of sparsity.

• High congestion nodes could be dropped from the network map till congestion on the node drops.

• The underlying packet streams would be using a flow control based routing protocol.

• Each node would store a map of the network which would be updated periodically using ping back messages.

• Could be applied to packet switched networks, traffic control and wireless sensor networks.


86

Photonic Computers

• These could use multiplexer based logic gates.

• Photonic multiplexers have been widely researched and developed for optical communications.

• Phase detectors could be used to identify the phase and thus the value of the stored signal.

• These would use electronic charge storage and high speed electro-optic conversion.

• Reference: Prior research on this has been carried out in UCSB.

research presentation

Documents

drive capacitive

duplicate

sram cell

signal chains

logical effort

rf power amplifiers

sram cell

branch decision