nxfee · vlsi ieee transaction & product development) [email protected] ph: +91...

25

Upload: others

Post on 21-May-2020

15 views

Category:

Documents


0 download

TRANSCRIPT

Nxfee Innovation (Semiconductor IP &#45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, Pondicherry

Web: www.nxfee.com Email:

NXFEE -

PROJECT TITLE

VLSI19_LP01 Title: A 2.5-ps Bin Size and 6.7Wrapping and Averaging Abstract: A highprogrammable gate array (FPGA) based on delay wrapping and averaging is presented. The fundamental idea is to pass a single clock through a series of delay elements to generate multipreference clocks with different phases for input time quantization. Due to periodicity, those phases will be equivalently wrapped within one reference clock period to achieve the required fine resolution. In practice, a hybrid delay matrix is created tonumber of delay cells. Multiple TDC cores are constructed for parallel measurements and then exquisite routing control and averaging are applied to smooth out the large quantization errors caused by the in homogeneityenhancement. To reduce the impact of temperature sensitivity, a cancellation circuit is created to substantially reduce the offset and confine the output difference within 2 LSB for the same input interval over the full operation temperature range of FPGA. With such a fine resolution of 2.5 ps, the integral nonlinearity is measured to be from merely corresponding rms resolution is 4.99over 0 °C–50 °C ambient temperature range with extremely low resolution variation. Its performance is even superior to many full

VLSI21_LP02 Title: Adaptive MultiCommunication Abstract: The presence of different noise sources and continuous increase in crosstalk in the deep sub micrometer technology raised concerns for onto the incorporation of crosstalk aproposes joint crosstalk avoidance with adaptive error control scheme to reduce the power consumption by providing appropriate communication resiliency based on runtime noise level. By switching between shielding and duplication as the crosstalk avoidance technique and between hybrid automatic repeat request and forward error correction as the error control policies, three modes of error resiliencies are provided. The results show that, in reduthe scheme achieves up to 25.3% power savings at 3non-adaptive scheme at the cost of only 3.4% power overhead in high protection mode.

VLSI22_LP03 Title: Coordinate Rotation Abstract: In this brief, we propose a lowmeans-based clustering algorithm used widely in mobile health monitoring applications for unsupervised and supervised learning. The itedistance of each data point from a respective centroid for a successful cluster formation until convergence presents a significant challenge to map it onto a lowbeen addressed by thengine for computing the n

NXFEE INNOVATION SEMICONDUCTOR IP & PRODUCT DEVELOPM

Semiconductor IP & VLSI IEEE Transaction & Product Development)

#45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, PondicherryEmail: [email protected] Ph: +91 9789443203, +91 9677783735.

VLSI IEEE TRANSACTION - 2017

TITLE FOR VLSI

LOW POWER

ps Bin Size and 6.7-ps Resolution FPGA Time-to-Digital Converter Based on Wrapping and Averaging

A high-resolution time-to-digital converter (TDC) implemented with field programmable gate array (FPGA) based on delay wrapping and averaging is presented. The fundamental idea is to pass a single clock through a series of delay elements to generate multipreference clocks with different phases for input time quantization. Due to periodicity, those phases will be equivalently wrapped within one reference clock period to achieve the required fine resolution. In practice, a hybrid delay matrix is created to significantly reduce the required number of delay cells. Multiple TDC cores are constructed for parallel measurements and then exquisite routing control and averaging are applied to smooth out the large quantization errors

in homogeneity of the TDC delay lines for both linearity and singleenhancement. To reduce the impact of temperature sensitivity, a cancellation circuit is created to substantially reduce the offset and confine the output difference within 2 LSB for the same input interval over the full operation temperature range of FPGA. With such a fine resolution of 2.5 ps, the integral nonlinearity is measured to be from merely −2.98 to 3.23 LSB and the corresponding rms resolution is 4.99–6.72 ps. The proposed TDC is tested to be fully functional

50 °C ambient temperature range with extremely low resolution variation. Its performance is even superior to many full-custom-designed TDCs.

Adaptive Multi-bit Crosstalk-Aware Error Control Coding Scheme for On

The presence of different noise sources and continuous increase in crosstalk in the deep sub micrometer technology raised concerns for on-chip communication reliability, leading to the incorporation of crosstalk avoidance techniques in error control coding schemes. This brief proposes joint crosstalk avoidance with adaptive error control scheme to reduce the power consumption by providing appropriate communication resiliency based on runtime noise level.

ng between shielding and duplication as the crosstalk avoidance technique and between hybrid automatic repeat request and forward error correction as the error control policies, three modes of error resiliencies are provided. The results show that, in reduthe scheme achieves up to 25.3% power savings at 3-mm wire length as compared to the original

adaptive scheme at the cost of only 3.4% power overhead in high protection mode.

Coordinate Rotation-Based Low Complexity K-Means Clustering Architecture

In this brief, we propose a low-complexity architectural implementation of the Kbased clustering algorithm used widely in mobile health monitoring applications for

unsupervised and supervised learning. The iterative nature of the algorithm computing the distance of each data point from a respective centroid for a successful cluster formation until convergence presents a significant challenge to map it onto a low-power architecture. This has been addressed by the use of a 2-D Coordinate Rotation Digital Computer-engine for computing the n-dimensional Euclidean distance involved during clustering. The

NXFEE INNOVATION PRODUCT DEVELOPMENT COMPANY

Product Development) #45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, Pondicherry-4

, +91 9677783735.

Digital Converter Based on Delay

digital converter (TDC) implemented with field programmable gate array (FPGA) based on delay wrapping and averaging is presented. The fundamental idea is to pass a single clock through a series of delay elements to generate multiple reference clocks with different phases for input time quantization. Due to periodicity, those phases will be equivalently wrapped within one reference clock period to achieve the required

significantly reduce the required number of delay cells. Multiple TDC cores are constructed for parallel measurements and then exquisite routing control and averaging are applied to smooth out the large quantization errors

the TDC delay lines for both linearity and single-shot precision enhancement. To reduce the impact of temperature sensitivity, a cancellation circuit is created to substantially reduce the offset and confine the output difference within 2 LSB for the same input interval over the full operation temperature range of FPGA. With such a fine resolution of

−2.98 to 3.23 LSB and the ted to be fully functional

50 °C ambient temperature range with extremely low resolution variation. Its

cheme for On-Chip

The presence of different noise sources and continuous increase in crosstalk in the chip communication reliability, leading

voidance techniques in error control coding schemes. This brief proposes joint crosstalk avoidance with adaptive error control scheme to reduce the power consumption by providing appropriate communication resiliency based on runtime noise level.

ng between shielding and duplication as the crosstalk avoidance technique and between hybrid automatic repeat request and forward error correction as the error control policies, three modes of error resiliencies are provided. The results show that, in reduced mode,

mm wire length as compared to the original adaptive scheme at the cost of only 3.4% power overhead in high protection mode.

ns Clustering Architecture

complexity architectural implementation of the K-based clustering algorithm used widely in mobile health monitoring applications for

rative nature of the algorithm computing the distance of each data point from a respective centroid for a successful cluster formation until

power architecture. This has -based low-complexity

dimensional Euclidean distance involved during clustering. The

Nxfee Innovation (Semiconductor IP &#45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, Pondicherry

Web: www.nxfee.com Email:

proposed clustering engine was synthesized using the TSMC 130place and route was performed following which the core area and power weremm2 and 9.21mW at 100 MHz, respectively, making the design applicable for lowtime operations within a sensor node.

VLSI34_LP04 Title: Low-Power ScanGeneration and Reseeding Abstract: A new lowon weighted pseudorandom test pattern generation and reseproposed, which supports both pseudorandom testing and deterministic BIST. During the pseudorandom testing phase, an LP weighted random test pattern generation scheme is proposed by disabling a part of scan chains. Duritestability architecture is modified slightly while the linearboth the cases, only a small number of scan chains are activated in a single cycle. Sufficient experimental results are presented to demonstrate the performance of the proposed LP BIST approach.

VLSI40_LP05 Title: A Way-FilteringConsumption Abstract: Last-level caches (LLCs) help impbecause of their large sizes. An effective solution to this problem is to selectively power down several cache ways, which, however, reduces cache associativity and performance and thus limits its effectiveness in reducing energy consumption. To overcome this limitation, we propose a new cache architecture that can logically increase cache associativity of wayproposed scheme is designed to be dynamic in activating an appropriate numberin order to eliminate the need for static profiling to determine an energyconfiguration. The experimental results show that our proposed dynamic scheme reduces the energy consumption of LLCs by 34% and 40% on singlecompared with the best performing conventional static cache configuration. The overall system energy consumption including CPU, L2 cache, and DRAM is reduced by 9.2% on quadsystems.

VLSI41_LP06 Title: Resource-Efficient SRAM Abstract: Static random access memory (SRAM)(TCAM) offers TCAM functionality by emulating it with SRAM. However, this emulation suffers from reduced memory efficiency while mapping the TCAM table on SRAM units. This is due to the limited capacity of the physical addresses in the SRAM unit. This brief offers a novel memory architecture called a resourcefunctionality using optimal resources. The SRAM unit is divided into multiple virtual blocks to store the address information presented in the TCAM table. This approach virtually increases the overall address space of the SRAM unit, mapping a greater and increasing the overall emulated TCAM bits/SRAM at the cost of reduced throughput. A 72×28-bit REST consumes only one 36on a Xilinx Kintex-7 fieldcompared with a conventional SRAM

NXFEE INNOVATION SEMICONDUCTOR IP & PRODUCT DEVELOPM

Semiconductor IP & VLSI IEEE Transaction & Product Development)

#45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, PondicherryEmail: [email protected] Ph: +91 9789443203, +91 9677783735.

proposed clustering engine was synthesized using the TSMC 130-nm technology library, and a ace and route was performed following which the core area and power were

mW at 100 MHz, respectively, making the design applicable for lowtime operations within a sensor node.

Scan-Based Built-In Self-Test Based on Weighted Pseudorandom Test Pattern Generation and Reseeding

A new low-power (LP) scan-based built-in self-test (BIST) technique is proposed based on weighted pseudorandom test pattern generation and reseeding. A new LP scan architecture is proposed, which supports both pseudorandom testing and deterministic BIST. During the pseudorandom testing phase, an LP weighted random test pattern generation scheme is proposed by disabling a part of scan chains. During the deterministic BIST phase, the designtestability architecture is modified slightly while the linear-feedback shift register is kept short. In both the cases, only a small number of scan chains are activated in a single cycle. Sufficient

ntal results are presented to demonstrate the performance of the proposed LP BIST

Filtering-Based Dynamic Logical–Associative Cache Architecture for Low

level caches (LLCs) help improve performance but suffer from energy overhead because of their large sizes. An effective solution to this problem is to selectively power down several cache ways, which, however, reduces cache associativity and performance and thus limits

ess in reducing energy consumption. To overcome this limitation, we propose a new cache architecture that can logically increase cache associativity of way-poweredproposed scheme is designed to be dynamic in activating an appropriate numberin order to eliminate the need for static profiling to determine an energyconfiguration. The experimental results show that our proposed dynamic scheme reduces the energy consumption of LLCs by 34% and 40% on single- and dual-core systems, respectively, compared with the best performing conventional static cache configuration. The overall system energy consumption including CPU, L2 cache, and DRAM is reduced by 9.2% on quad

Efficient SRAM-based Ternary Content Addressable Memory

Static random access memory (SRAM)-based ternary content addressable memory (TCAM) offers TCAM functionality by emulating it with SRAM. However, this emulation suffers

memory efficiency while mapping the TCAM table on SRAM units. This is due to the limited capacity of the physical addresses in the SRAM unit. This brief offers a novel memory architecture called a resource-efficient SRAM-based TCAM (REST), which emulates functionality using optimal resources. The SRAM unit is divided into multiple virtual blocks to store the address information presented in the TCAM table. This approach virtually increases the overall address space of the SRAM unit, mapping a greater portion of the TCAM table in SRAM and increasing the overall emulated TCAM bits/SRAM at the cost of reduced throughput. A

bit REST consumes only one 36-kbit SRAM and a few distributed RAMs via implementation 7 field-programmable gate array. It uses only 3.5% of the memory resources

compared with a conventional SRAM-based TCAM (hybrid-partitioned TCAM).

NXFEE INNOVATION PRODUCT DEVELOPMENT COMPANY

Product Development) #45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, Pondicherry-4

, +91 9677783735.

nm technology library, and a ace and route was performed following which the core area and power were estimated as 0.36

mW at 100 MHz, respectively, making the design applicable for low-power real-

Test Based on Weighted Pseudorandom Test Pattern

test (BIST) technique is proposed based eding. A new LP scan architecture is

proposed, which supports both pseudorandom testing and deterministic BIST. During the pseudorandom testing phase, an LP weighted random test pattern generation scheme is

ng the deterministic BIST phase, the design-for-feedback shift register is kept short. In

both the cases, only a small number of scan chains are activated in a single cycle. Sufficient ntal results are presented to demonstrate the performance of the proposed LP BIST

Associative Cache Architecture for Low-Energy

rove performance but suffer from energy overhead because of their large sizes. An effective solution to this problem is to selectively power down several cache ways, which, however, reduces cache associativity and performance and thus limits

ess in reducing energy consumption. To overcome this limitation, we propose a new powered-down LLCs. Our

proposed scheme is designed to be dynamic in activating an appropriate number of cache ways in order to eliminate the need for static profiling to determine an energy-optimized cache configuration. The experimental results show that our proposed dynamic scheme reduces the

core systems, respectively, compared with the best performing conventional static cache configuration. The overall system energy consumption including CPU, L2 cache, and DRAM is reduced by 9.2% on quad-core

based Ternary Content Addressable Memory

based ternary content addressable memory (TCAM) offers TCAM functionality by emulating it with SRAM. However, this emulation suffers

memory efficiency while mapping the TCAM table on SRAM units. This is due to the limited capacity of the physical addresses in the SRAM unit. This brief offers a novel memory

based TCAM (REST), which emulates TCAM functionality using optimal resources. The SRAM unit is divided into multiple virtual blocks to store the address information presented in the TCAM table. This approach virtually increases the

portion of the TCAM table in SRAM and increasing the overall emulated TCAM bits/SRAM at the cost of reduced throughput. A

kbit SRAM and a few distributed RAMs via implementation te array. It uses only 3.5% of the memory resources

partitioned TCAM).

Nxfee Innovation (Semiconductor IP &#45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, Pondicherry

Web: www.nxfee.com Email:

VLSI42_LP07 Title: Write-Amount Abstract: Spin-transfer torque random access memory one of the most promising memory technologies owing to its nonlow-leakage power characteristics. However, STTenergy consumption and limits to tRAM in the implementation of cache memories, new cache hierarchy management policies are required to overcome such drawbacks. In this brief, we evaluated several cache hierarchy management policies iL2 cache. We found that a nonexclusive policy is superior to nonin terms of energy consumption and endurance. We also propose a subblockpolicy because the write energy consumption and endurance are proportional and inversely proportional to the amount of written data, respectively. A combination of the proposed policy with a nonexclusive policy reduces the L2 cache energy consumption by improves the lifetime by 56.3% (56.8%) in a single

VLSI45_LP08 Title: Fault Diagnosis Schemes for Low Abstract: Achieving secure highas implantable and wearable medical devices are a priority in efficient block ciphers. However, security of these algorithms is not guaranteed in the presence of malicious and natural faults. Recently, a new lightweight blconsumption besides having low latency and hardware complexity. In this paper, fault diagnosis schemes for variants of Midori are proposed. To the best of the authors’ knowledge, there has been no fault diagnosis scheme presented in the literature for Midori to date. The fault diagnosis schemes are provided for the nonlinear Sand 128-bit Midori symmetric key ciphers. The proposed schemes programmable gate array and their error coverage is assessed with faultThese proposed error detection architectures make the implementations of this new lowlightweight block cipher more reliable.

VLSI50_LP09 Title: High-Throughput and Energy Abstract: Owing to their capacitycomplexity, polar codes have received significant attention recently. Successive cancellation decoding (SCD) and belief propagation decoding (BPD) are two popular approaches forpolar codes. SCD, despite having less computational complexity when compared with BPD, suffers from long latency due to the serial nature of the SC algorithm. BPD, on the other hand, is parallel in nature and is more attractive for lowiterative nature of BPD, the required latency and energy dissipation increase linearly with the number of iterations. In this paper, we propose a novel scheme based on subfreezing to reduce the average number oiterations required by BPD, which directly translates into lower latency and energy dissipation. Simulation results show that the proposed scheme has no performance degradation and achieves significant reductthe hardware architecture for the proposed scheme is developed and compared with the stateof-the-art BPD implementations for (1024, 512) polar codes. A decoding throughput of 13.9 Gb/s is achieved along with a 60%hardware efficiency when compared with the existing BPD implementations.

NXFEE INNOVATION SEMICONDUCTOR IP & PRODUCT DEVELOPM

Semiconductor IP & VLSI IEEE Transaction & Product Development)

#45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, PondicherryEmail: [email protected] Ph: +91 9789443203, +91 9677783735.

Amount-Aware Management Policies for STT-RAM Caches

transfer torque random access memory (STT-RAM) technology has emerged as one of the most promising memory technologies owing to its non-volatility, high density, and

leakage power characteristics. However, STT-RAM has certain drawbacks such as high write energy consumption and limits to the number of write cycles. To enable the adoption of STTRAM in the implementation of cache memories, new cache hierarchy management policies are required to overcome such drawbacks. In this brief, we evaluated several cache hierarchy management policies in the context of static random access memory L1 caches and an STTL2 cache. We found that a nonexclusive policy is superior to non-inclusive and exclusive policies in terms of energy consumption and endurance. We also propose a subblockpolicy because the write energy consumption and endurance are proportional and inversely proportional to the amount of written data, respectively. A combination of the proposed policy with a nonexclusive policy reduces the L2 cache energy consumption by improves the lifetime by 56.3% (56.8%) in a single-core (quad-core) system.

Fault Diagnosis Schemes for Low-Energy Block Cipher Midori Benchmarked on FPGA

chieving secure high-performance implementations for constrained applications such as implantable and wearable medical devices are a priority in efficient block ciphers. However, security of these algorithms is not guaranteed in the presence of malicious and natural faults. Recently, a new lightweight block cipher, Midori, has been proposed that optimizes the energy consumption besides having low latency and hardware complexity. In this paper, fault diagnosis schemes for variants of Midori are proposed. To the best of the authors’ knowledge, there has

n no fault diagnosis scheme presented in the literature for Midori to date. The fault diagnosis schemes are provided for the nonlinear S-box layer and for the round structures with both 64

bit Midori symmetric key ciphers. The proposed schemes are benchmarked on a field programmable gate array and their error coverage is assessed with fault-These proposed error detection architectures make the implementations of this new lowlightweight block cipher more reliable.

Throughput and Energy-Efficient Belief Propagation Polar Code Decoder

Owing to their capacity-achieving performance and low encoding and decoding complexity, polar codes have received significant attention recently. Successive cancellation decoding (SCD) and belief propagation decoding (BPD) are two popular approaches forpolar codes. SCD, despite having less computational complexity when compared with BPD, suffers from long latency due to the serial nature of the SC algorithm. BPD, on the other hand, is parallel in nature and is more attractive for low-latency applications. However, due to the iterative nature of BPD, the required latency and energy dissipation increase linearly with the number of iterations. In this paper, we propose a novel scheme based on subfreezing to reduce the average number of computations as well as the average number of iterations required by BPD, which directly translates into lower latency and energy dissipation. Simulation results show that the proposed scheme has no performance degradation and achieves significant reduction in computation complexity over the existing methods. Moreover, the hardware architecture for the proposed scheme is developed and compared with the state

art BPD implementations for (1024, 512) polar codes. A decoding throughput of 13.9 Gb/s achieved along with a 60%–73% improvement in energy reduction and two times increase in

hardware efficiency when compared with the existing BPD implementations.

NXFEE INNOVATION PRODUCT DEVELOPMENT COMPANY

Product Development) #45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, Pondicherry-4

, +91 9677783735.

RAM) technology has emerged as volatility, high density, and

RAM has certain drawbacks such as high write he number of write cycles. To enable the adoption of STT-

RAM in the implementation of cache memories, new cache hierarchy management policies are required to overcome such drawbacks. In this brief, we evaluated several cache hierarchy

n the context of static random access memory L1 caches and an STT-RAM inclusive and exclusive policies

in terms of energy consumption and endurance. We also propose a subblock-based management policy because the write energy consumption and endurance are proportional and inversely proportional to the amount of written data, respectively. A combination of the proposed policy with a nonexclusive policy reduces the L2 cache energy consumption by 33.3% (31.5%) and

core) system.

Energy Block Cipher Midori Benchmarked on FPGA

for constrained applications such as implantable and wearable medical devices are a priority in efficient block ciphers. However, security of these algorithms is not guaranteed in the presence of malicious and natural faults.

ock cipher, Midori, has been proposed that optimizes the energy consumption besides having low latency and hardware complexity. In this paper, fault diagnosis schemes for variants of Midori are proposed. To the best of the authors’ knowledge, there has

n no fault diagnosis scheme presented in the literature for Midori to date. The fault diagnosis box layer and for the round structures with both 64-bit

are benchmarked on a field -injection simulations.

These proposed error detection architectures make the implementations of this new low-energy

Efficient Belief Propagation Polar Code Decoder

achieving performance and low encoding and decoding complexity, polar codes have received significant attention recently. Successive cancellation decoding (SCD) and belief propagation decoding (BPD) are two popular approaches for decoding polar codes. SCD, despite having less computational complexity when compared with BPD, suffers from long latency due to the serial nature of the SC algorithm. BPD, on the other hand, is

plications. However, due to the iterative nature of BPD, the required latency and energy dissipation increase linearly with the number of iterations. In this paper, we propose a novel scheme based on sub-factor graph

f computations as well as the average number of iterations required by BPD, which directly translates into lower latency and energy dissipation. Simulation results show that the proposed scheme has no performance degradation and

ion in computation complexity over the existing methods. Moreover, the hardware architecture for the proposed scheme is developed and compared with the state-

art BPD implementations for (1024, 512) polar codes. A decoding throughput of 13.9 Gb/s 73% improvement in energy reduction and two times increase in

hardware efficiency when compared with the existing BPD implementations.

Nxfee Innovation (Semiconductor IP &#45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, Pondicherry

Web: www.nxfee.com Email:

VLSI60_LP10 Title: High-Speed Parallel LFSR Architectures Based on Improved State Abstract: Linear feedback shift register (LFSR) has been widely applied in BCH and CRC encoding. In order to increase the system throughput, the parallelization of LFSR is usually needed. Previously, a technique named statecomplexity of parallel LFSR architectures. Exhaustive searches are performed to find good transformation matrix candidates. This brief proposes a new technique for construction of the transformation matrix together with a moindicate that the proposed architecture outperforms the prior arts, improving the hardware efficiency by around 35% and the corresponding searching algorithm finds the desirable transformation matri

VLSI62_LP11 Title: Scalable Approach for Power Droop Reduction During Scan Abstract: The generation of significant power droop (PD) during atLogic Built-In Self Test (LBIST) is a serious concern for modern ICs. In fact, the PD originated during test may delay signal transitions of the circuit under test (CUT): aerroneously recognized as delay faults, with consequent erroneous generation of test fails and increase in yield loss. In this paper, we propose a novel scalable approach to reduce the PD during at-speed test of sequential circuits witscheme. This is achieved by reducing the activity factor of the CUT, by proper modification of the test vectors generated by the LBIST of sequential ICs. Our scalable solution allows us to reduce PD to a value similar to that occurring during the CUT in field operation, without increasing the number of test vectors required to achieve target fault coverage (FC). We present a hardware implementation of our approach that requires limited area overhead. Finally, wcompared with recent alternative solutions providing a similar PD reduction, our approach enables a significant reduction of the number of test vectors (by more than 50%), thus the test time, to achieve a target FC.

VLSI61_LP12 Title: Stochastic Implementation and Analysis of Dynamical Systems Similar to the Logistic Map Abstract: Stochastic computing (SC) is a digital computation approach that operates on random bit streams to perform complex tasks with much smaller hardware footprints coconventional binary radix approaches. SC works based on the assumption that input bit streams are independent random sequences of 1s and 0s. Previous SC efforts have avoided implementing functions that have feedback, because doing so has the poinputs. We propose a number of solutions to overcome the challenges of implementing feedback in stochastic logic. We use a family of dynamical system functions that are similar to the wellknown logistic map xdoubling and chaos, do indeed occur in digital logic with only a few gates operating on a few 0s and 1s. Our energy consumption is between 21% and 31% of the conventional binary approach. In order to verify our design methodology, we have measured the mean switching rate between the basins of attraction of two coexisting fixed points and the peak width of the steadydistribution of the output using a logisticmatch well with our numerical experiments.

NXFEE INNOVATION SEMICONDUCTOR IP & PRODUCT DEVELOPM

Semiconductor IP & VLSI IEEE Transaction & Product Development)

#45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, PondicherryEmail: [email protected] Ph: +91 9789443203, +91 9677783735.

Speed Parallel LFSR Architectures Based on Improved State-Space Transf

Linear feedback shift register (LFSR) has been widely applied in BCH and CRC encoding. In order to increase the system throughput, the parallelization of LFSR is usually needed. Previously, a technique named state-space transformation was presented to reduce the complexity of parallel LFSR architectures. Exhaustive searches are performed to find good transformation matrix candidates. This brief proposes a new technique for construction of the transformation matrix together with a more efficient searching algorithm. The realization results indicate that the proposed architecture outperforms the prior arts, improving the hardware efficiency by around 35% and the corresponding searching algorithm finds the desirable transformation matrix much faster.

Scalable Approach for Power Droop Reduction During Scan-Based Logic BIST

The generation of significant power droop (PD) during at-speed test performed by In Self Test (LBIST) is a serious concern for modern ICs. In fact, the PD originated

during test may delay signal transitions of the circuit under test (CUT): aerroneously recognized as delay faults, with consequent erroneous generation of test fails and increase in yield loss. In this paper, we propose a novel scalable approach to reduce the PD

speed test of sequential circuits with scan-based LBIST using the launchscheme. This is achieved by reducing the activity factor of the CUT, by proper modification of the test vectors generated by the LBIST of sequential ICs. Our scalable solution allows us to reduce

similar to that occurring during the CUT in field operation, without increasing the number of test vectors required to achieve target fault coverage (FC). We present a hardware implementation of our approach that requires limited area overhead. Finally, wcompared with recent alternative solutions providing a similar PD reduction, our approach enables a significant reduction of the number of test vectors (by more than 50%), thus the test time, to achieve a target FC.

stic Implementation and Analysis of Dynamical Systems Similar to the Logistic Map

Stochastic computing (SC) is a digital computation approach that operates on random bit streams to perform complex tasks with much smaller hardware footprints coconventional binary radix approaches. SC works based on the assumption that input bit streams are independent random sequences of 1s and 0s. Previous SC efforts have avoided implementing functions that have feedback, because doing so has the potential for creating highly correlated inputs. We propose a number of solutions to overcome the challenges of implementing feedback in stochastic logic. We use a family of dynamical system functions that are similar to the wellknown logistic map x→µx(1−x)as case studies. We show that complex behaviors, such as period doubling and chaos, do indeed occur in digital logic with only a few gates operating on a few 0s and 1s. Our energy consumption is between 21% and 31% of the conventional binary approach.

rder to verify our design methodology, we have measured the mean switching rate between the basins of attraction of two coexisting fixed points and the peak width of the steadydistribution of the output using a logistic-map-like function as an exampmatch well with our numerical experiments.

NXFEE INNOVATION PRODUCT DEVELOPMENT COMPANY

Product Development) #45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, Pondicherry-4

, +91 9677783735.

Space Transformations

Linear feedback shift register (LFSR) has been widely applied in BCH and CRC encoding. In order to increase the system throughput, the parallelization of LFSR is usually

ion was presented to reduce the complexity of parallel LFSR architectures. Exhaustive searches are performed to find good transformation matrix candidates. This brief proposes a new technique for construction of the

re efficient searching algorithm. The realization results indicate that the proposed architecture outperforms the prior arts, improving the hardware efficiency by around 35% and the corresponding searching algorithm finds the desirable

Based Logic BIST

speed test performed by In Self Test (LBIST) is a serious concern for modern ICs. In fact, the PD originated

during test may delay signal transitions of the circuit under test (CUT): an effect that may be erroneously recognized as delay faults, with consequent erroneous generation of test fails and increase in yield loss. In this paper, we propose a novel scalable approach to reduce the PD

based LBIST using the launch-on capture scheme. This is achieved by reducing the activity factor of the CUT, by proper modification of the test vectors generated by the LBIST of sequential ICs. Our scalable solution allows us to reduce

similar to that occurring during the CUT in field operation, without increasing the number of test vectors required to achieve target fault coverage (FC). We present a hardware implementation of our approach that requires limited area overhead. Finally, we show that, compared with recent alternative solutions providing a similar PD reduction, our approach enables a significant reduction of the number of test vectors (by more than 50%), thus the test

stic Implementation and Analysis of Dynamical Systems Similar to the Logistic Map

Stochastic computing (SC) is a digital computation approach that operates on random bit streams to perform complex tasks with much smaller hardware footprints compared with conventional binary radix approaches. SC works based on the assumption that input bit streams are independent random sequences of 1s and 0s. Previous SC efforts have avoided implementing

tential for creating highly correlated inputs. We propose a number of solutions to overcome the challenges of implementing feedback in stochastic logic. We use a family of dynamical system functions that are similar to the well-

as case studies. We show that complex behaviors, such as period doubling and chaos, do indeed occur in digital logic with only a few gates operating on a few 0s and 1s. Our energy consumption is between 21% and 31% of the conventional binary approach.

rder to verify our design methodology, we have measured the mean switching rate between the basins of attraction of two coexisting fixed points and the peak width of the steady-state

like function as an example. Theoretical results

Nxfee Innovation (Semiconductor IP &#45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, Pondicherry

Web: www.nxfee.com Email:

VLSI13_HS01 Title: Efficient Designs of Multi

Abstract: The utilization of block RAMs (BRAMs) is a critical performance factor for multimemory designs on field programmable gate arrays (FPGAs). Not only does the excessive demand on BRAMs block the usage of BRAMs from other parts of a design, but the crouting between BRAMs and logic also limits the operating frequency. This paper first introduces a brand new perspective and a more efficient way of using a conventional two reads one write (2R1W) memory as a 2R1W/4R memory. By exploiting the 2R1W/4paper introduces a hierarchical design of 4R1W memory that requires 25% fewer BRAMs than the previous approach of duplicating the 2R1W module. Memories with more read/write ports can be extended from the proposed 2R1W/4R memorCompared with previous xorcan, respectively, reduce up to 53% and 69% of BRAM usage for 4R2W memory designs with 8Kdepth. For complex multi ported desighigher clock frequencies by alleviating the complex routing in an FPGA. For 4R3W memory with 8K-depth, the proposed design can save 53% of BRAMs and enhance the operating frequency by 20%.

VLSI14_HS02 Title: High-Speed and Low

Abstract: In this paper, a novel highimplementation for point multiplication (PM) on fieldproposed. A new segmented pipelined fullthe Lopez-Dahab Montgomery PM algorithm is modified for careful scheduling to avoid data dependency resulting in a drastic reduction in the number of clock cycles (CCs) required. The proposed ECC architecture has been implemented on Xilinx FPGAs’ Vfamilies. To the best of our knowledge, our singlefastest performance to date when compared with reported works individually. Our onemultiplier-based ECC processor also achieves threported area-time performance on Virtex4 (5.32 µs at 210 MHz), on Virtex5 (4.91µs at 228 MHz), and on the more advanced Virtex7 (3.18 µsat 352 MHz). Finally, the proposed threemultiplier-based ECC implementthe fastest ECC processor design on FPGA (450 CCs to get 2.83 µs on Virtex7).

VLSI26_HS03 Title: An On-Chip Monitoring Circuit for SignalInterfaces With Source

Abstract: This paper presents an onintegrity of high speed signals for a chipscheme. The proposed OCMC consistssynthesizer, a high(ADC) to implement a subsampling scheme. The proposed fractionalsynthesizer improves the time jitter accumulated in a voltage controlled oscillator using a fractional frequency divider operated by an eighthold circuit is designed to be 6 GHz, using inductive peaking realized through a sourThe OCMC samples 49 points over two unit intervals of a highfrequency multiplication of the frequency synthesizer is 6.125/6. The 10architecture of a pipelined successive approximation register ADconsumption and chip area. The proposed OCMC is implemented with 65

NXFEE INNOVATION SEMICONDUCTOR IP & PRODUCT DEVELOPM

Semiconductor IP & VLSI IEEE Transaction & Product Development)

#45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, PondicherryEmail: [email protected] Ph: +91 9789443203, +91 9677783735.

HIGH SPEED DATA TRANSMISSION

Efficient Designs of Multi-ported Memory on FPGA

The utilization of block RAMs (BRAMs) is a critical performance factor for multimemory designs on field programmable gate arrays (FPGAs). Not only does the excessive demand on BRAMs block the usage of BRAMs from other parts of a design, but the crouting between BRAMs and logic also limits the operating frequency. This paper first introduces a brand new perspective and a more efficient way of using a conventional two reads one write (2R1W) memory as a 2R1W/4R memory. By exploiting the 2R1W/4R as the building block, this paper introduces a hierarchical design of 4R1W memory that requires 25% fewer BRAMs than the previous approach of duplicating the 2R1W module. Memories with more read/write ports can be extended from the proposed 2R1W/4R memory and the hierarchical 4R1W memory. Compared with previous xor-based and live value table-based approaches, the proposed designs can, respectively, reduce up to 53% and 69% of BRAM usage for 4R2W memory designs with 8Kdepth. For complex multi ported designs, the proposed BRAM-efficient approaches can achieve higher clock frequencies by alleviating the complex routing in an FPGA. For 4R3W memory with

depth, the proposed design can save 53% of BRAMs and enhance the operating frequency by

Speed and Low-Latency ECC Processor Implementation Over GF(2m) on FPGA

In this paper, a novel high-speed elliptic curve cryptography (ECC) processor implementation for point multiplication (PM) on field-programmable gate array (FPGA) is proposed. A new segmented pipelined full-precision multiplier is used to reduce the laten

Dahab Montgomery PM algorithm is modified for careful scheduling to avoid data dependency resulting in a drastic reduction in the number of clock cycles (CCs) required. The proposed ECC architecture has been implemented on Xilinx FPGAs’ Virtex4, Virtex5, and Virtex7 families. To the best of our knowledge, our single- and three-multiplier-based designs show the fastest performance to date when compared with reported works individually. Our one

based ECC processor also achieves the highest reported speed together with the best time performance on Virtex4 (5.32 µs at 210 MHz), on Virtex5 (4.91µs at 228

MHz), and on the more advanced Virtex7 (3.18 µsat 352 MHz). Finally, the proposed threebased ECC implementation is the first work reporting the lowest number of CCs and

the fastest ECC processor design on FPGA (450 CCs to get 2.83 µs on Virtex7).

Chip Monitoring Circuit for Signal-Integrity Analysis of 8With Source-Synchronous Clock

This paper presents an on-chip monitoring circuit (OCMC) for analyzing the signal integrity of high speed signals for a chip-to-chip interface with a source synchronous clocking scheme. The proposed OCMC consists of a fractional-N phase-locked loop (PLL)synthesizer, a high-bandwidth track-and-hold circuit, and a 10-bit analog(ADC) to implement a subsampling scheme. The proposed fractional-N PLL

roves the time jitter accumulated in a voltage controlled oscillator using a fractional frequency divider operated by an eight-phase clock. The bandwidth of the trackhold circuit is designed to be 6 GHz, using inductive peaking realized through a sourThe OCMC samples 49 points over two unit intervals of a high-speed input signal when the frequency multiplication of the frequency synthesizer is 6.125/6. The 10architecture of a pipelined successive approximation register ADC to reduce the power consumption and chip area. The proposed OCMC is implemented with 65-

NXFEE INNOVATION PRODUCT DEVELOPMENT COMPANY

Product Development) #45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, Pondicherry-4

, +91 9677783735.

The utilization of block RAMs (BRAMs) is a critical performance factor for multi-ported memory designs on field programmable gate arrays (FPGAs). Not only does the excessive demand on BRAMs block the usage of BRAMs from other parts of a design, but the complex routing between BRAMs and logic also limits the operating frequency. This paper first introduces a brand new perspective and a more efficient way of using a conventional two reads one write

R as the building block, this paper introduces a hierarchical design of 4R1W memory that requires 25% fewer BRAMs than the previous approach of duplicating the 2R1W module. Memories with more read/write ports

y and the hierarchical 4R1W memory. based approaches, the proposed designs

can, respectively, reduce up to 53% and 69% of BRAM usage for 4R2W memory designs with 8K-efficient approaches can achieve

higher clock frequencies by alleviating the complex routing in an FPGA. For 4R3W memory with depth, the proposed design can save 53% of BRAMs and enhance the operating frequency by

Latency ECC Processor Implementation Over GF(2m) on FPGA

speed elliptic curve cryptography (ECC) processor programmable gate array (FPGA) is

precision multiplier is used to reduce the latency, and Dahab Montgomery PM algorithm is modified for careful scheduling to avoid data

dependency resulting in a drastic reduction in the number of clock cycles (CCs) required. The irtex4, Virtex5, and Virtex7

based designs show the fastest performance to date when compared with reported works individually. Our one-

e highest reported speed together with the best time performance on Virtex4 (5.32 µs at 210 MHz), on Virtex5 (4.91µs at 228

MHz), and on the more advanced Virtex7 (3.18 µsat 352 MHz). Finally, the proposed three-ation is the first work reporting the lowest number of CCs and

the fastest ECC processor design on FPGA (450 CCs to get 2.83 µs on Virtex7).

Integrity Analysis of 8-Gb/s Chip-to-Chip

chip monitoring circuit (OCMC) for analyzing the signal chip interface with a source synchronous clocking

locked loop (PLL)-based frequency bit analog-to-digital converter

N PLL-based frequency roves the time jitter accumulated in a voltage controlled oscillator using a

phase clock. The bandwidth of the track-and hold circuit is designed to be 6 GHz, using inductive peaking realized through a source follower.

speed input signal when the frequency multiplication of the frequency synthesizer is 6.125/6. The 10-bit ADC uses the

C to reduce the power -nm CMOS technology

Nxfee Innovation (Semiconductor IP &#45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, Pondicherry

Web: www.nxfee.com Email:

and a 1.2 V supply. The 8voltage resolutions of 5.1 ps and 1.17 mV, respectively

VLSI35_HS04 Title: A 2.4–3.6-GHz Wideband SubTiming Alignment Technique

Abstract: This paper proposes a wideband sub harmonically injectionadaptive injection timinoscillator-period constantpulse generator (PG). The proposed injection timing alignment technique can align the injetiming adaptively in a wide range of the output clock frequency using the two blocks (OOPCD and TPD) and a falling edge locking scheme of pulses. It can avoid the risk that SILPLL may lock to the wrong frequency or even fail to lock. The PG block is tradeoff between the phase noise of SILPLL and the output frequency resolution. The OOPCD circuit occupies a negligible area. After the injection timing alignment is finished, the OOPCD is powered off so that nCMOS process. It consumes 8.6 mW at 1.2 V supply and occupies an active core area of 1×0.6mm2 . The measured output frequency range is 2.4resolution of 200 MHz and the phase noise isfrequency of 3.4 GHz. The rms jitter integrated from 1 kHz to 30 MHz is less than 112 fs for all the covered frequency points. Under the supply voltage range from 1.1 to 1.3 range from −20 °C to 70 °C, the rms ji�er varia�on of all the covered frequency points is less than 27 fs, which shows good robustness over environmental variation.

VLSI39_HS05 Title: Hardware-Efficient Built

Abstract: Memory capacity continues to increase, and many semiconductor manufacturing companies are trying to stack memory dice for larger memory capacities. Therefore, builtredundancy analysis (BIRA) is of uincreases with a larger memory capacity. A traditional spare structure that consists of simple rows and columns is somewhat inadequate for multiple memory blocks BIRA because the hardware overheavarious types of spares and can achieve a higher yield than a simple row and column spare structure. Herein, we propose a BIRA that can achieve an optimal repair rate using various spartypes. The proposed analyzer can exhaustively search not only row and column spare types but also global and local spare types. In addition, this paper proposes a faultaddressable memory (CAM) structure. The proposed CAM is small and colThe experimental results show a high repair rate with a small hardware overhead and a short analysis time.

VLSI43_HS06 Title: Fast Automatic Frequency Calibrator Using an Adaptive Frequency Search Algorithm

Abstract: A new aautomatic frequency calibrator in wideband phaseFSA optimizes the number of clock counts for each frequency comparison cycle, depending on the differencfrequency, as opposed to a binary frequency search algorithm (Bfrequency search time per cycle is fixed. This eliminates unnecessary clocking times

NXFEE INNOVATION SEMICONDUCTOR IP & PRODUCT DEVELOPM

Semiconductor IP & VLSI IEEE Transaction & Product Development)

#45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, PondicherryEmail: [email protected] Ph: +91 9789443203, +91 9677783735.

and a 1.2 V supply. The 8-Gb/s chip-to-chip interface signal is reconstructed with time and voltage resolutions of 5.1 ps and 1.17 mV, respectively.

GHz Wideband Sub-harmonically Injection-Locked PLL with Adaptive Injection Timing Alignment Technique

This paper proposes a wideband sub harmonically injection-locked PLL (SILPLL) with adaptive injection timing alignment technique. The SILPLL includes three main circuit blocks: one

period constant-delay (OOPCD) divider, timing-adjusted phase detector (TPD), and pulse generator (PG). The proposed injection timing alignment technique can align the injetiming adaptively in a wide range of the output clock frequency using the two blocks (OOPCD and TPD) and a falling edge locking scheme of pulses. It can avoid the risk that SILPLL may lock to the wrong frequency or even fail to lock. The PG block is used for half-integral injection to relax the tradeoff between the phase noise of SILPLL and the output frequency resolution. The OOPCD circuit occupies a negligible area. After the injection timing alignment is finished, the OOPCD is powered off so that no extra power is consumed. The SILPLL is implemented in the 65CMOS process. It consumes 8.6 mW at 1.2 V supply and occupies an active core area of 1×0.6mm2 . The measured output frequency range is 2.4∼3.6 GHz with an output frequency

200 MHz and the phase noise is−127.6 dBc/Hz at an offset of 1 MHz from a carrier frequency of 3.4 GHz. The rms jitter integrated from 1 kHz to 30 MHz is less than 112 fs for all the covered frequency points. Under the supply voltage range from 1.1 to 1.3 V and the temperature

−20 °C to 70 °C, the rms ji�er varia�on of all the covered frequency points is less than 27 fs, which shows good robustness over environmental variation.

Efficient Built-In Redundancy Analysis for Memory With Various Spares

Memory capacity continues to increase, and many semiconductor manufacturing companies are trying to stack memory dice for larger memory capacities. Therefore, builtredundancy analysis (BIRA) is of utmost importance because the probability of fault occurrence increases with a larger memory capacity. A traditional spare structure that consists of simple rows and columns is somewhat inadequate for multiple memory blocks BIRA because the hardware overhead and spare allocation efficiency are degraded. The proposed BIRA uses various types of spares and can achieve a higher yield than a simple row and column spare structure. Herein, we propose a BIRA that can achieve an optimal repair rate using various spartypes. The proposed analyzer can exhaustively search not only row and column spare types but also global and local spare types. In addition, this paper proposes a faultaddressable memory (CAM) structure. The proposed CAM is small and colThe experimental results show a high repair rate with a small hardware overhead and a short

Fast Automatic Frequency Calibrator Using an Adaptive Frequency Search

A new adaptive frequency search algorithm (A-FSA) is presented for a fast automatic frequency calibrator in wideband phase-locked loops (PLLs). The proposed AFSA optimizes the number of clock counts for each frequency comparison cycle, depending on the difference between the target frequency and the PLL output frequency, as opposed to a binary frequency search algorithm (Bfrequency search time per cycle is fixed. This eliminates unnecessary clocking times

NXFEE INNOVATION PRODUCT DEVELOPMENT COMPANY

Product Development) #45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, Pondicherry-4

, +91 9677783735.

chip interface signal is reconstructed with time and

Locked PLL with Adaptive Injection

locked PLL (SILPLL) with g alignment technique. The SILPLL includes three main circuit blocks: one-

adjusted phase detector (TPD), and pulse generator (PG). The proposed injection timing alignment technique can align the injection timing adaptively in a wide range of the output clock frequency using the two blocks (OOPCD and TPD) and a falling edge locking scheme of pulses. It can avoid the risk that SILPLL may lock to the

integral injection to relax the tradeoff between the phase noise of SILPLL and the output frequency resolution. The OOPCD circuit occupies a negligible area. After the injection timing alignment is finished, the OOPCD is

o extra power is consumed. The SILPLL is implemented in the 65-nm 1P9M CMOS process. It consumes 8.6 mW at 1.2 V supply and occupies an active core area of

3.6 GHz with an output frequency −127.6 dBc/Hz at an offset of 1 MHz from a carrier

frequency of 3.4 GHz. The rms jitter integrated from 1 kHz to 30 MHz is less than 112 fs for all the V and the temperature

−20 °C to 70 °C, the rms ji�er varia�on of all the covered frequency points is less than

Analysis for Memory With Various Spares

Memory capacity continues to increase, and many semiconductor manufacturing companies are trying to stack memory dice for larger memory capacities. Therefore, built-in

tmost importance because the probability of fault occurrence increases with a larger memory capacity. A traditional spare structure that consists of simple rows and columns is somewhat inadequate for multiple memory blocks BIRA because the

d and spare allocation efficiency are degraded. The proposed BIRA uses various types of spares and can achieve a higher yield than a simple row and column spare structure. Herein, we propose a BIRA that can achieve an optimal repair rate using various spare types. The proposed analyzer can exhaustively search not only row and column spare types but also global and local spare types. In addition, this paper proposes a fault-storing content-addressable memory (CAM) structure. The proposed CAM is small and collects faults efficiently. The experimental results show a high repair rate with a small hardware overhead and a short

Fast Automatic Frequency Calibrator Using an Adaptive Frequency Search

FSA) is presented for a fast locked loops (PLLs). The proposed A-

FSA optimizes the number of clock counts for each frequency comparison cycle, e between the target frequency and the PLL output

frequency, as opposed to a binary frequency search algorithm (B-FSA), where the frequency search time per cycle is fixed. This eliminates unnecessary clocking times

Nxfee Innovation (Semiconductor IP &#45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, Pondicherry

Web: www.nxfee.com Email:

during the frequency comparison process, additional circuitry needed for Ahardware overhead. To verify the effectiveness of the proposed algorithm, two wideband PLLs are designed and simulated FSA, and the other with Aat least a factor of 2, even under worst case conditions.

VLSI49_HS07 Title: A High-Efficiency 6.78Wireless Power Transmission

Abstract: This paper presents a full active rectifier consisting of GaN devices and a CMOS controller designed for wireless power transmission in highadaptive time delay control circuit is developed to maximize the conduction interval of the GaN switch, which can significantly reduce the power loss caused by the forward voltage imposed by the diode. The proposed control algorithm also eliminates the rerectifier, and thus further improves its power transfer efficiency. The controller implemented based on a high voltage 0.18transistors are assembled on the same printeprovides a maximum output current of 3 A at 5 V, with a 6.78

VLSI24_HS08 Title: Scalable Device Array for Statistical Characterization of BTI

Abstract: A device The proposed array facilitates accurate and simultaneous bias voltage application to a large number of devices, making it suitable for the measurement based statistical charadevice degradation, known as bias temperature instability. Using the proposed array, the degradation measurement of thousands of transistors is made possible in a practical amount of time. The experimental results show that the defectstatistical variation in magnitudes of threshold voltage shifts (deltadelta- VTH bears an inverse relationship to the channel areas of transistors. The degradation variability under ac stress cond

VLSI04_AE01 Title: VLSI Design of 64bit × 64bit High Performance Multiplier with Redundant Binary Encoding

Abstract: For multiplier dominated applications such as communications, and computer applications, high speed multiplier designs has always been a primary requisite. In this paper a high performance 16x16 bit redundant binary (RB) multiplier have been designed by using recenteliminate the error correcting word and a delay efficient parallel prefix Ling adder for final redundant binary to normal binary (RBrepresentation allows carryperformance RB multiplier design for summation of partial product terms. The design of multiplier also reduces redundant partial product accumulation stage when eliminating the error correcting word which improves the complexity and the critical path delay. The performance of RB multiplier design compared with conventional RB modified booth encoding multiplier (CRBMBE). The comparison is based on synthesis result obtained by synthesizing botharchitectures targeting a Xilinx FPGA in terms of area and delay analysis.

NXFEE INNOVATION SEMICONDUCTOR IP & PRODUCT DEVELOPM

Semiconductor IP & VLSI IEEE Transaction & Product Development)

#45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, PondicherryEmail: [email protected] Ph: +91 9789443203, +91 9677783735.

during the frequency comparison process, and thus reduces the total PLL lock time. The additional circuitry needed for A-FSA is only a simple counter controller, thus minimizing hardware overhead. To verify the effectiveness of the proposed algorithm, two wideband PLLs are designed and simulated using a 65-nm CMOS technology: one with BFSA, and the other with A-FSA. The latter achieves a lock time faster than the former by at least a factor of 2, even under worst case conditions.

Efficiency 6.78-MHz Full Active Rectifier with Adaptive Time Delay Control for Wireless Power Transmission

This paper presents a full active rectifier consisting of GaN devices and a CMOS controller designed for wireless power transmission in high-power consumer devices. Anadaptive time delay control circuit is developed to maximize the conduction interval of the GaN switch, which can significantly reduce the power loss caused by the forward voltage imposed by the diode. The proposed control algorithm also eliminates the reverse leakage current of the rectifier, and thus further improves its power transfer efficiency. The controller implemented based on a high voltage 0.18-µm CMOS process and the power stage consisting of four GaN transistors are assembled on the same printed circuit board (PCB) board. The proposed rectifier provides a maximum output current of 3 A at 5 V, with a 6.78-MHz ac input voltage.

Scalable Device Array for Statistical Characterization of BTI-Related Parameters

array circuit, scalable in terms of the number of transistors used, is proposed. The proposed array facilitates accurate and simultaneous bias voltage application to a large number of devices, making it suitable for the measurement based statistical charadevice degradation, known as bias temperature instability. Using the proposed array, the degradation measurement of thousands of transistors is made possible in a practical amount of time. The experimental results show that the defect-centric model can approximate the statistical variation in magnitudes of threshold voltage shifts (delta- VTH) and that the variance of

VTH bears an inverse relationship to the channel areas of transistors. The degradation variability under ac stress conditions is also presented for the first time.

AREA EFFICIENT/ TIMING & DELAY REDUCTION

VLSI Design of 64bit × 64bit High Performance Multiplier with Redundant Binary Encoding

For multiplier dominated applications such as digital signal processing, wireless communications, and computer applications, high speed multiplier designs has always been a primary requisite. In this paper a high performance 16x16 bit redundant binary (RB) multiplier have been designed by using recently proposed redundant binary encoding approach to eliminate the error correcting word and a delay efficient parallel prefix Ling adder for final redundant binary to normal binary (RB-NB) conversion. Since redundant binary (RB) representation allows carry-free addition and adaptability, it has been used in 16x16 bit highperformance RB multiplier design for summation of partial product terms. The design of multiplier also reduces redundant partial product accumulation stage when eliminating the error

ing word which improves the complexity and the critical path delay. The performance of RB multiplier design compared with conventional RB modified booth encoding multiplier (CRBMBE). The comparison is based on synthesis result obtained by synthesizing botharchitectures targeting a Xilinx FPGA in terms of area and delay analysis.

NXFEE INNOVATION PRODUCT DEVELOPMENT COMPANY

Product Development) #45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, Pondicherry-4

, +91 9677783735.

and thus reduces the total PLL lock time. The FSA is only a simple counter controller, thus minimizing

hardware overhead. To verify the effectiveness of the proposed algorithm, two nm CMOS technology: one with B-

FSA. The latter achieves a lock time faster than the former by

Rectifier with Adaptive Time Delay Control for

This paper presents a full active rectifier consisting of GaN devices and a CMOS power consumer devices. An

adaptive time delay control circuit is developed to maximize the conduction interval of the GaN switch, which can significantly reduce the power loss caused by the forward voltage imposed by

verse leakage current of the rectifier, and thus further improves its power transfer efficiency. The controller implemented

µm CMOS process and the power stage consisting of four GaN d circuit board (PCB) board. The proposed rectifier

MHz ac input voltage.

Related Parameters

array circuit, scalable in terms of the number of transistors used, is proposed. The proposed array facilitates accurate and simultaneous bias voltage application to a large number of devices, making it suitable for the measurement based statistical characterization of device degradation, known as bias temperature instability. Using the proposed array, the degradation measurement of thousands of transistors is made possible in a practical amount of

c model can approximate the VTH) and that the variance of

VTH bears an inverse relationship to the channel areas of transistors. The degradation

VLSI Design of 64bit × 64bit High Performance Multiplier with Redundant Binary Encoding

digital signal processing, wireless communications, and computer applications, high speed multiplier designs has always been a primary requisite. In this paper a high performance 16x16 bit redundant binary (RB) multiplier

ly proposed redundant binary encoding approach to eliminate the error correcting word and a delay efficient parallel prefix Ling adder for final

NB) conversion. Since redundant binary (RB) ree addition and adaptability, it has been used in 16x16 bit high-

performance RB multiplier design for summation of partial product terms. The design of multiplier also reduces redundant partial product accumulation stage when eliminating the error

ing word which improves the complexity and the critical path delay. The performance of RB multiplier design compared with conventional RB modified booth encoding multiplier (CRBMBE). The comparison is based on synthesis result obtained by synthesizing both multiplier

Nxfee Innovation (Semiconductor IP &#45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, Pondicherry

Web: www.nxfee.com Email:

VLSI05_AE02 Title: A Method to Design Single Error Correction Codes with Fast Decoding for a Subset of Critical Bits

Abstract: Single error correction (SEC) codes areand registers. In some applications, such as networking, a few control bits are added to the data to facilitate their processing. For example, flags to mark the start or the end of a packet are widely used. Therefore, it is important to have SEC codes that protect both the data and the associated control bits. It is attractive for these codes to provide fast decoding of the control bits, as these are used to determine the processing of the data and are commonltiming path. In this brief, a method to extend SEC codes to support a few additional control bits is presented. The derived codes support fast decoding of the additional control bits and are therefore suitable for networking applications.

VLSI07_AE03 Title: ENFIRE: A Spatio

Abstract: Field programmable gate arrays (FPGAs) are wellreconfigurable computing platforms. However, FPGAs demonstrate poor scalability in atechnology nodes due to the large negative impact of the elaborate programmable interconnects (PIs). The need for such vast PIs arises from two key factors: 1) finemanipulation in the configurable logic blocks and 2) the purin the FPGAs. In this paper, we propose ENFIRE, a novel memoryframework designed to provide the flexibility of reconfigurable bitwhile improving scalability and energcomputing elements storing not only the data to be processed but also the functional behavior of the application mapped into lookup tables. Computing elements are spatially distributed, communicating as needed over a hierarchical bus interconnect, while the functions are evaluated temporally inside each computing element. A custom software framework facilitates application mapping to the framework. By leveraging both spatial and temporal computing, ENFsignificantly reduces the interconnect overhead when compared with FPGA. Simulation results show an improvement of 7.6×in energy, 1.6×in energy efficiency, 1.1×in leakage, and 5.3×in unified energy efficiency, a metric that considers energy and area tocomparable FPGA implementations.

VLSI08_AE04 Title: Hybrid Hardware/Software FloatingThroughput Tradeoffs

Abstract: Hybrid floatingincurring the area overhead of full hardware FP units. The proposed implementations are synthesized in 65-nm CMOS and integrated into small fixedarchitecture. Unsigned, shift carry, and leading zero detection (USL) support is added to a processor to augment an existing instruction set architecture and increase FP throughput with little area overhead. The hybrid implementations with USL suppthroughput per core by 2.18×for addition/subtraction, 1.29×for multiplication, 3.07division, and 3.11–multiply– add (FMA) hardware. Hybrid implementaincrease throughput per core over a fixed point software kernel by 3.69addition/subtraction, 1.22and use 77.3–97.0% less area than dedifound for 38 multiplyroot designs. Thirtythroughput per cor

NXFEE INNOVATION SEMICONDUCTOR IP & PRODUCT DEVELOPM

Semiconductor IP & VLSI IEEE Transaction & Product Development)

#45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, PondicherryEmail: [email protected] Ph: +91 9789443203, +91 9677783735.

A Method to Design Single Error Correction Codes with Fast Decoding for a Subset of

Single error correction (SEC) codes are widely used to protect data stored in memories and registers. In some applications, such as networking, a few control bits are added to the data to facilitate their processing. For example, flags to mark the start or the end of a packet are

herefore, it is important to have SEC codes that protect both the data and the associated control bits. It is attractive for these codes to provide fast decoding of the control bits, as these are used to determine the processing of the data and are commonltiming path. In this brief, a method to extend SEC codes to support a few additional control bits is presented. The derived codes support fast decoding of the additional control bits and are therefore suitable for networking applications.

ENFIRE: A Spatio-Temporal Fine-Grained Reconfigurable Hardware

Field programmable gate arrays (FPGAs) are well-established as finereconfigurable computing platforms. However, FPGAs demonstrate poor scalability in atechnology nodes due to the large negative impact of the elaborate programmable interconnects (PIs). The need for such vast PIs arises from two key factors: 1) fine-grained bitmanipulation in the configurable logic blocks and 2) the purely spatial computing model followed in the FPGAs. In this paper, we propose ENFIRE, a novel memory-based spatioframework designed to provide the flexibility of reconfigurable bit-level information processing while improving scalability and energy efficiency. Dense 2-D memory arrays serve as the main computing elements storing not only the data to be processed but also the functional behavior of the application mapped into lookup tables. Computing elements are spatially distributed,

as needed over a hierarchical bus interconnect, while the functions are evaluated temporally inside each computing element. A custom software framework facilitates application mapping to the framework. By leveraging both spatial and temporal computing, ENFsignificantly reduces the interconnect overhead when compared with FPGA. Simulation results show an improvement of 7.6×in energy, 1.6×in energy efficiency, 1.1×in leakage, and 5.3×in unified energy efficiency, a metric that considers energy and area together, compared with comparable FPGA implementations.

Hybrid Hardware/Software Floating-Point Implementations for Optimized Area and Throughput Tradeoffs

Hybrid floating-point (FP) implementations improve software FP performance without incurring the area overhead of full hardware FP units. The proposed implementations are

nm CMOS and integrated into small fixed-point processors with a RIarchitecture. Unsigned, shift carry, and leading zero detection (USL) support is added to a processor to augment an existing instruction set architecture and increase FP throughput with little area overhead. The hybrid implementations with USL support increase software FP throughput per core by 2.18×for addition/subtraction, 1.29×for multiplication, 3.07

–3.81×for square root, and use 90.7–94.6% less area than dedicated fused add (FMA) hardware. Hybrid implementations with custom FP

increase throughput per core over a fixed point software kernel by 3.69addition/subtraction, 1.22–2.03×for multiplication, 14.4×for division, and 31.9× for square root,

97.0% less area than dedicated FMA hardware. The circuit area and throughput are found for 38 multiply–add, 8 addition/subtraction, 6 multiplication, 45 division, and 45 square root designs. Thirty-three multiply– add implementations are presented, which improve throughput per core versus a fixed-point software implementation by 1.11

NXFEE INNOVATION PRODUCT DEVELOPMENT COMPANY

Product Development) #45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, Pondicherry-4

, +91 9677783735.

A Method to Design Single Error Correction Codes with Fast Decoding for a Subset of

widely used to protect data stored in memories and registers. In some applications, such as networking, a few control bits are added to the data to facilitate their processing. For example, flags to mark the start or the end of a packet are

herefore, it is important to have SEC codes that protect both the data and the associated control bits. It is attractive for these codes to provide fast decoding of the control bits, as these are used to determine the processing of the data and are commonly on the critical timing path. In this brief, a method to extend SEC codes to support a few additional control bits is presented. The derived codes support fast decoding of the additional control bits and are

established as fine-grained reconfigurable computing platforms. However, FPGAs demonstrate poor scalability in advanced technology nodes due to the large negative impact of the elaborate programmable interconnects

grained bit-level data ely spatial computing model followed

based spatio-temporal level information processing

D memory arrays serve as the main computing elements storing not only the data to be processed but also the functional behavior of the application mapped into lookup tables. Computing elements are spatially distributed,

as needed over a hierarchical bus interconnect, while the functions are evaluated temporally inside each computing element. A custom software framework facilitates application mapping to the framework. By leveraging both spatial and temporal computing, ENFIRE significantly reduces the interconnect overhead when compared with FPGA. Simulation results show an improvement of 7.6×in energy, 1.6×in energy efficiency, 1.1×in leakage, and 5.3×in

gether, compared with

Point Implementations for Optimized Area and

point (FP) implementations improve software FP performance without incurring the area overhead of full hardware FP units. The proposed implementations are

point processors with a RISC-like architecture. Unsigned, shift carry, and leading zero detection (USL) support is added to a processor to augment an existing instruction set architecture and increase FP throughput with

ort increase software FP throughput per core by 2.18×for addition/subtraction, 1.29×for multiplication, 3.07–4.05×for

94.6% less area than dedicated fused tions with custom FP-specific hardware

increase throughput per core over a fixed point software kernel by 3.69–7.28×for 2.03×for multiplication, 14.4×for division, and 31.9× for square root,

cated FMA hardware. The circuit area and throughput are add, 8 addition/subtraction, 6 multiplication, 45 division, and 45 square

add implementations are presented, which improve point software implementation by 1.11–15.9× and use 38.2–

Nxfee Innovation (Semiconductor IP &#45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, Pondicherry

Web: www.nxfee.com Email:

95.3% less area than dedicated FMA hardware.

VLSI11_AE05 Title: Efficient Soft Cancelation Decoder Architectures for Polar Codes

Abstract: The flooding belief propagation (FOthe two most popular softcodes. The FO-BP algorithm has high throughput at the cost of performance degradation in high signal-to-noise ratio (decoding performance while suffering from long decoding latency and low throughput. In this paper, an improved BP algorithm, named reduced complexity soft cancelation (RCSC) algoritis proposed. Compared with the SCAN algorithm, the number of memory entries required by the RCSC algorithm is reduced by more than 50% in general, while achieving comparable or even better (e.g., when block size N=2the proposed RCSC algorithm reduces the required memory entries by more than 23% compared with the state-ofperformance improvement of the RCSC algorithm is mora different tradeoff, a reduced latency softthe decoding latency and increase the throughput of the RCSC algorithm while slightly sacrificing decoding performanceRLSC algorithms, respectively. The synthesis results demonstrate the efficiency of the proposed algorithms and architectures.

VLSI12_AE06 Title: Low-Complexity DigitToeplitz Matrix–Vector Product Decomposition

Abstract: In this paper, we have shown that a regular Toeplitz matrixbe transformed into a Toeplitz block TMVP (TBTMVP) using a suitable permutation matrix. Based on the TBTMVP representation, we have proposed a new (a,b)algorithm for implementing a digititerative block recombination, we can improve the space complexity of the proposed TBTMVP decomposition. From the synthesis results, we have shown thatmultiplier involves less area, less areaexisting digit serial multipliers.

VLSI15_AE07 Title: Hybrid LUT Multiplexer FPGA Logic Architectures Abstract: Hybrid configuracontain a mixture of look up tables and hardened multiplexers are evaluated toward the goal of higher logic density and area reduction. Multiple hybrid configurable logic block architectboth non fracturable and fracturable with varying MUX:LUT logic element ratios are evaluated across two benchmark suites using a custom tool flow consisting of Leg Upsynthesis, ABC logic synthesis and technology mapping, and Vand architecture exploration. Technology mapping optimizations that target the proposed architectures are also implemented within ABC. Experimentally, we show that for non fracturable architectures, without any mapper opand route; both accounting for complex logic block and routing area while maintaining mapping depth. With architecturesaved, post-place-andgains are seen after place

NXFEE INNOVATION SEMICONDUCTOR IP & PRODUCT DEVELOPM

Semiconductor IP & VLSI IEEE Transaction & Product Development)

#45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, PondicherryEmail: [email protected] Ph: +91 9789443203, +91 9677783735.

95.3% less area than dedicated FMA hardware.

Efficient Soft Cancelation Decoder Architectures for Polar Codes

The flooding belief propagation (FO-BP) and the soft-cancelation (SCAN) algorithms are the two most popular soft-output BP algorithms for the decoding of capacity

BP algorithm has high throughput at the cost of performance degradation in high noise ratio (SNR) region or with large block length. The SCAN algorithm has much better

decoding performance while suffering from long decoding latency and low throughput. In this paper, an improved BP algorithm, named reduced complexity soft cancelation (RCSC) algoritis proposed. Compared with the SCAN algorithm, the number of memory entries required by the RCSC algorithm is reduced by more than 50% in general, while achieving comparable or even better (e.g., when block size N=2

15) decoding performance. When block size is large(e.g., N

the proposed RCSC algorithm reduces the required memory entries by more than 23% compared of-the-art FO-BP algorithm. The numerical results show that the error

performance improvement of the RCSC algorithm is more significant when the SNR increases. For a different tradeoff, a reduced latency soft-cancelation (RLSC) algorithm is proposed to reduce the decoding latency and increase the throughput of the RCSC algorithm while slightly sacrificing decoding performance. Finally, the optimized VLSI architectures are presented for the RCSC and RLSC algorithms, respectively. The synthesis results demonstrate the efficiency of the proposed algorithms and architectures.

Complexity Digit-Serial Multiplier Over GF(2m) Based on Efficient Toeplitz Block Vector Product Decomposition

In this paper, we have shown that a regular Toeplitz matrix-vector product (TMVP) can be transformed into a Toeplitz block TMVP (TBTMVP) using a suitable permutation matrix. Based on the TBTMVP representation, we have proposed a new (a,b)-way TBTMVP decoalgorithm for implementing a digit-serial multiplication. Moreover, it is shown that, based on iterative block recombination, we can improve the space complexity of the proposed TBTMVP decomposition. From the synthesis results, we have shown that the proposed TBTMVPmultiplier involves less area, less area–delay product, and higher throughput compared with the

serial multipliers.

Hybrid LUT Multiplexer FPGA Logic Architectures

Hybrid configurable logic block architectures for field-programmable gate arrays that contain a mixture of look up tables and hardened multiplexers are evaluated toward the goal of higher logic density and area reduction. Multiple hybrid configurable logic block architectboth non fracturable and fracturable with varying MUX:LUT logic element ratios are evaluated across two benchmark suites using a custom tool flow consisting of Leg Up-synthesis, ABC logic synthesis and technology mapping, and VPR for packing, placement, routing, and architecture exploration. Technology mapping optimizations that target the proposed architectures are also implemented within ABC. Experimentally, we show that for non fracturable architectures, without any mapper optimizations, we naturally save up toand route; both accounting for complex logic block and routing area while maintaining mapping depth. With architecture-aware technology mapper optimizations in ABC, additional area is

and-route. For fracturable architectures, experiments show that only marginal gains are seen after place-and-route up to∼2%.

NXFEE INNOVATION PRODUCT DEVELOPMENT COMPANY

Product Development) #45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, Pondicherry-4

, +91 9677783735.

cancelation (SCAN) algorithms are output BP algorithms for the decoding of capacity-achieving polar

BP algorithm has high throughput at the cost of performance degradation in high SNR) region or with large block length. The SCAN algorithm has much better

decoding performance while suffering from long decoding latency and low throughput. In this paper, an improved BP algorithm, named reduced complexity soft cancelation (RCSC) algorithm, is proposed. Compared with the SCAN algorithm, the number of memory entries required by the RCSC algorithm is reduced by more than 50% in general, while achieving comparable or even

size is large(e.g., N ≥215

), the proposed RCSC algorithm reduces the required memory entries by more than 23% compared

BP algorithm. The numerical results show that the error e significant when the SNR increases. For

cancelation (RLSC) algorithm is proposed to reduce the decoding latency and increase the throughput of the RCSC algorithm while slightly sacrificing

. Finally, the optimized VLSI architectures are presented for the RCSC and RLSC algorithms, respectively. The synthesis results demonstrate the efficiency of the proposed

lier Over GF(2m) Based on Efficient Toeplitz Block

vector product (TMVP) can be transformed into a Toeplitz block TMVP (TBTMVP) using a suitable permutation matrix. Based

way TBTMVP decomposition serial multiplication. Moreover, it is shown that, based on

iterative block recombination, we can improve the space complexity of the proposed TBTMVP the proposed TBTMVP-based

delay product, and higher throughput compared with the

programmable gate arrays that contain a mixture of look up tables and hardened multiplexers are evaluated toward the goal of higher logic density and area reduction. Multiple hybrid configurable logic block architectures, both non fracturable and fracturable with varying MUX:LUT logic element ratios are evaluated

-HLS, Odin-II front-end PR for packing, placement, routing,

and architecture exploration. Technology mapping optimizations that target the proposed architectures are also implemented within ABC. Experimentally, we show that for non fracturable

timizations, we naturally save up to∼8%area post place and route; both accounting for complex logic block and routing area while maintaining mapping

aware technology mapper optimizations in ABC, additional area is route. For fracturable architectures, experiments show that only marginal

Nxfee Innovation (Semiconductor IP &#45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, Pondicherry

Web: www.nxfee.com Email:

VLSI16_AE08 Title: Sign-Magnitude Encoding for Efficient VLSI Realization of Decimal Multiplication

Abstract: Decimal X×Y (IPPs) are commonly selected from a set of prerequire only[0,5]×X via recoding digits of Y to oneThis reduces the selection logic at the cost of one extra IPP. Two’s complement signed(TCSD) encoding is often used to represent IPPs, where dynamic negation (via one xor per bit of X multiples) is required for the recoded digits of Y in [IPPs, for 16-digit operands, we manage to start the partial product reduction (PPR) with 16 IPPs that enhance the VLSI regularity. Moreover, we save 75% of negating xor's via representing precomputed multiples by siwe devise an efficient adder, with two SMSD input numbers, whose sum is represented with TCSD encoding. Thereafter, multilevel TCSD 2:1 reduction leads to two TCSD accumulated partial products, which collectively undergo a special early initiated conversion scheme to get at the final binary-coded decimal product. As such, a VLSI implementation of 16×16decimal multiplier is synthesized, where evaluations show some performanceprevious relevant designs.

VLSI20_AE09 Title: FPGA Realization of Low Register Systolic Alland Their Applications in Trinomial Multipliers

Abstract: Systolic allregister complexity, especially in fieldregister resources are not that abundant. In this paper, we have shown that the AOPsystolic multipliers can easily aarchitectures can be employed as computation cores to derive efficient implementations of systolic Montgomery multipliers based on trinomials. First, we propose a novel data broadcasting scheme in which the register complexity involved within existing AOPsignificantly reduced. We have found out that the modified AOPas a standard computation core. Next, we propose a novel Montgomery mthat can fully employ the proposed AOPalgorithm employs a novel precomputed modular operation, and the systolic structures based on this algorithm fully inherit the advantages broughtcomplexity, low criticalbrought by a precomputation unit. The proposed architectures are then implemented by Xilinx ISE 14.1 and it is shown that coleast 61.8% and 47.6% less areacompeting designs, respectively.

VLSI48_AE10 Title: Low-Complexity Transformed Encoder Architectures Codes Over Subfields

Abstract: Quasi-communication and storage systems. The encoding of these codes is traditionally done by multiplying the mesubmatrices. To reduce the encoder complexity, this paper introduces two schemes making use of finite Fourier transform. We focus on QCdimension (2

r−1) × (2

GF(2p) is a subfield of GF(2

special case. Making use of conjugacy constraints, low

NXFEE INNOVATION SEMICONDUCTOR IP & PRODUCT DEVELOPM

Semiconductor IP & VLSI IEEE Transaction & Product Development)

#45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, PondicherryEmail: [email protected] Ph: +91 9789443203, +91 9677783735.

Magnitude Encoding for Efficient VLSI Realization of Decimal Multiplication

Decimal X×Y multiplication is a complex operation, where intermediate partial products (IPPs) are commonly selected from a set of pre-computed radix-10Xmultiples. Some works require only[0,5]×X via recoding digits of Y to one-hot representation of signed digits in[This reduces the selection logic at the cost of one extra IPP. Two’s complement signed(TCSD) encoding is often used to represent IPPs, where dynamic negation (via one xor per bit of X multiples) is required for the recoded digits of Y in [−5,−1].In this paper, despite generation of 17

digit operands, we manage to start the partial product reduction (PPR) with 16 IPPs that enhance the VLSI regularity. Moreover, we save 75% of negating xor's via representing precomputed multiples by sign-magnitude signed-digit (SMSD) encoding. For the firstwe devise an efficient adder, with two SMSD input numbers, whose sum is represented with TCSD encoding. Thereafter, multilevel TCSD 2:1 reduction leads to two TCSD accumulated partial

ucts, which collectively undergo a special early initiated conversion scheme to get at the coded decimal product. As such, a VLSI implementation of 16×16

decimal multiplier is synthesized, where evaluations show some performanceprevious relevant designs.

FPGA Realization of Low Register Systolic All-One-Polynomial Multipliers over GF (2m) and Their Applications in Trinomial Multipliers

Systolic all-one-polynomial (AOP) multipliers usually suffer from the problem of high register complexity, especially in field-programmable gate array (FPGA) platforms where the register resources are not that abundant. In this paper, we have shown that the AOPsystolic multipliers can easily achieve low register-complexity implementations and the proposed architectures can be employed as computation cores to derive efficient implementations of systolic Montgomery multipliers based on trinomials. First, we propose a novel data broadcasting

e in which the register complexity involved within existing AOP-based systolic multipliers is significantly reduced. We have found out that the modified AOP-based structure can be packed as a standard computation core. Next, we propose a novel Montgomery mthat can fully employ the proposed AOP-based computation core. The proposed Montgomery algorithm employs a novel precomputed modular operation, and the systolic structures based on this algorithm fully inherit the advantages brought from the AOP-based core (low register complexity, low critical-path delay, and low latency) except some marginal hardware overhead brought by a precomputation unit. The proposed architectures are then implemented by Xilinx ISE 14.1 and it is shown that compared with the existing designs, the proposed designs achieve at least 61.8% and 47.6% less area-delay product and power delay product than the best of competing designs, respectively.

Complexity Transformed Encoder Architectures for Quasi-Cyclic NonCodes Over Subfields

-cyclic low-density parity-check (QC-LDPC) codes are adopted in many digital communication and storage systems. The encoding of these codes is traditionally done by multiplying the message vector with a generator matrix consisting of dense circulant submatrices. To reduce the encoder complexity, this paper introduces two schemes making use of finite Fourier transform. We focus on QC-LDPC codes whose circulant submatrices are of

) × (2r−1

) and the entries are elements of GF(2p), where p divides r, and hence,

) is a subfield of GF(2r). These cover a broad range of codes, and binary LDPC codes are a

special case. Making use of conjugacy constraints, low-complexity architectures are developed

NXFEE INNOVATION PRODUCT DEVELOPMENT COMPANY

Product Development) #45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, Pondicherry-4

, +91 9677783735.

Magnitude Encoding for Efficient VLSI Realization of Decimal Multiplication

multiplication is a complex operation, where intermediate partial products 10Xmultiples. Some works

hot representation of signed digits in[−5,5]. This reduces the selection logic at the cost of one extra IPP. Two’s complement signed-digit (TCSD) encoding is often used to represent IPPs, where dynamic negation (via one xor per bit of X

.In this paper, despite generation of 17 digit operands, we manage to start the partial product reduction (PPR) with 16 IPPs

that enhance the VLSI regularity. Moreover, we save 75% of negating xor's via representing digit (SMSD) encoding. For the first-level PPR,

we devise an efficient adder, with two SMSD input numbers, whose sum is represented with TCSD encoding. Thereafter, multilevel TCSD 2:1 reduction leads to two TCSD accumulated partial

ucts, which collectively undergo a special early initiated conversion scheme to get at the coded decimal product. As such, a VLSI implementation of 16×16-digit parallel

decimal multiplier is synthesized, where evaluations show some performance improvement over

Polynomial Multipliers over GF (2m)

usually suffer from the problem of high programmable gate array (FPGA) platforms where the

register resources are not that abundant. In this paper, we have shown that the AOP-based complexity implementations and the proposed

architectures can be employed as computation cores to derive efficient implementations of systolic Montgomery multipliers based on trinomials. First, we propose a novel data broadcasting

based systolic multipliers is based structure can be packed

as a standard computation core. Next, we propose a novel Montgomery multiplication algorithm based computation core. The proposed Montgomery

algorithm employs a novel precomputed modular operation, and the systolic structures based on based core (low register

path delay, and low latency) except some marginal hardware overhead brought by a precomputation unit. The proposed architectures are then implemented by Xilinx

mpared with the existing designs, the proposed designs achieve at delay product and power delay product than the best of

Cyclic Non-binary LDPC

LDPC) codes are adopted in many digital communication and storage systems. The encoding of these codes is traditionally done by

ssage vector with a generator matrix consisting of dense circulant submatrices. To reduce the encoder complexity, this paper introduces two schemes making use

LDPC codes whose circulant submatrices are of ), where p divides r, and hence,

). These cover a broad range of codes, and binary LDPC codes are a tectures are developed

Nxfee Innovation (Semiconductor IP &#45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, Pondicherry

Web: www.nxfee.com Email:

for finite Fourier and inverse transforms over subfields in this paper. In addition, composite field arithmetic is exploited to eliminate the computations associated with message mapping and reduce the complexity of Fourier transformgenerator matrix consists of circulants of dimension 63×63 with GF(2encoders achieve 22% area reduction compared with the conventional encoders without sacrificing the throughput.

VLSI58_AE11 Title: Antiwear Leveling Design for SSDs With Hybrid ECC Capability

Abstract: With the joint considerations of reliability and performance, hybrid error correction code (ECC) becomes an option in the designs of solidleveling (WL) might result in the early performance degradation to SSDs, which is common with a limited number of P/E cycles, due to the efforts to delay the bitan anti-WL design is proposed to avoid suSSDs with hybrid ECC capability can be improved without sacrificing their reliability. The capability of the proposed design was evaluated by a series of experiments, for which it was shown that the propoup to 50% without affecting the endurance of the investigated SSDs, compared with traditional approaches.

VLSI33_AE12 Title: Energy-Efficient VLSI Realization of Binary64 Division wi

Abstract: VLSI realizations of digitrepresentation of partial remainders and quotient digits. The former allows for fast carrycomputation of the next partial remainder, andivisor multiples. In studying the previous relevant works, we have noted that the binary carry save (CS) number system is prevalent in the representation of partial remainders, and redundant high radix representation of quotient digits is popular in order to reduce the cycle count. In this paper, we explore a design space containing four division architectures. These are based on binary CS or radix-16 signed digit (SD) representations of partial remainders.they use full or partial pre computation of divisor multiples. The latter uses smaller multiplexer at the cost two extra adders, where one of the operands is constant within all cycles. The quotient digits are represented by radixof the best previous relevant work and the four proposed designs show reduced power and energy figures in the proposed designs at the cost of more silicon area and delay measures. However, our energy

NXFEE INNOVATION SEMICONDUCTOR IP & PRODUCT DEVELOPM

Semiconductor IP & VLSI IEEE Transaction & Product Development)

#45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, PondicherryEmail: [email protected] Ph: +91 9789443203, +91 9677783735.

for finite Fourier and inverse transforms over subfields in this paper. In addition, composite field arithmetic is exploited to eliminate the computations associated with message mapping and reduce the complexity of Fourier transform. For a (2016, 1074) non binary QCgenerator matrix consists of circulants of dimension 63×63 with GF(2

2)entries, the proposed

encoders achieve 22% area reduction compared with the conventional encoders without sacrificing the throughput.

Antiwear Leveling Design for SSDs With Hybrid ECC Capability

With the joint considerations of reliability and performance, hybrid error correction code (ECC) becomes an option in the designs of solid-state drives (SSDs). Uleveling (WL) might result in the early performance degradation to SSDs, which is common with a limited number of P/E cycles, due to the efforts to delay the bit-error-rate growth. In this paper,

WL design is proposed to avoid such a performance problem so that the performance of SSDs with hybrid ECC capability can be improved without sacrificing their reliability. The capability of the proposed design was evaluated by a series of experiments, for which it was shown that the proposed design could greatly improve the read and write performance of SSDs up to 50% without affecting the endurance of the investigated SSDs, compared with traditional

Efficient VLSI Realization of Binary64 Division with Redundant Number Systems

VLSI realizations of digit-recurrence binary division usually use redundant representation of partial remainders and quotient digits. The former allows for fast carrycomputation of the next partial remainder, and the latter leads to less number of the required divisor multiples. In studying the previous relevant works, we have noted that the binary carry save (CS) number system is prevalent in the representation of partial remainders, and redundant

resentation of quotient digits is popular in order to reduce the cycle count. In this paper, we explore a design space containing four division architectures. These are based on

16 signed digit (SD) representations of partial remainders.they use full or partial pre computation of divisor multiples. The latter uses smaller multiplexer at the cost two extra adders, where one of the operands is constant within all cycles. The quotient digits are represented by radix-16 [−9,9]SDs. Our synthesis-based evaluation of VLSI realizations of the best previous relevant work and the four proposed designs show reduced power and energy figures in the proposed designs at the cost of more silicon area and delay measures.

ergy-delay product is 26%–35% less than that of the reference work.

NXFEE INNOVATION PRODUCT DEVELOPMENT COMPANY

Product Development) #45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, Pondicherry-4

, +91 9677783735.

for finite Fourier and inverse transforms over subfields in this paper. In addition, composite field arithmetic is exploited to eliminate the computations associated with message mapping and

binary QC-LDPC code whose )entries, the proposed

encoders achieve 22% area reduction compared with the conventional encoders without

With the joint considerations of reliability and performance, hybrid error correction state drives (SSDs). Unfortunately, wear

leveling (WL) might result in the early performance degradation to SSDs, which is common with a rate growth. In this paper,

ch a performance problem so that the performance of SSDs with hybrid ECC capability can be improved without sacrificing their reliability. The capability of the proposed design was evaluated by a series of experiments, for which it was

sed design could greatly improve the read and write performance of SSDs up to 50% without affecting the endurance of the investigated SSDs, compared with traditional

th Redundant Number Systems

recurrence binary division usually use redundant representation of partial remainders and quotient digits. The former allows for fast carry-free

d the latter leads to less number of the required divisor multiples. In studying the previous relevant works, we have noted that the binary carry save (CS) number system is prevalent in the representation of partial remainders, and redundant

resentation of quotient digits is popular in order to reduce the cycle count. In this paper, we explore a design space containing four division architectures. These are based on

16 signed digit (SD) representations of partial remainders. On the other hand, they use full or partial pre computation of divisor multiples. The latter uses smaller multiplexer at the cost two extra adders, where one of the operands is constant within all cycles. The quotient

based evaluation of VLSI realizations of the best previous relevant work and the four proposed designs show reduced power and energy figures in the proposed designs at the cost of more silicon area and delay measures.

35% less than that of the reference work.

Nxfee Innovation (Semiconductor IP &#45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, Pondicherry

Web: www.nxfee.com Email:

VLSI02_IM01 Title: A Dual-Clock VLSI Design of H.265 Sample Adaptive Offset Estimation for 8k UltraEncoding

Abstract: Sample adaptive offset (SAO) is a newly introduced inH.265/High Efficiency Video Coding (HEVC). While SAO contributes to a notable coding efficiency improvement, the estimation of SAO parameters dominates the complexity of inHEVC encoding. This paper presents an efficient VLSI design for SAO estimation. Our design features a dual-clock architecture that processes statistics collection (SC) and parameter decision (PD), the two main functional blocks of SAO estrespectively. Such a strategy reduces the overall area by 56% by addressing the heterogeneous data flows of SC and PD. To further improve the area and power efficiency, algorithmarchitecture co-optimizations are appaccumulator bit width reduction (ABR). CRS shrinks the range of fine processed bands for the band offset estimation. ABR further reduces the area by narrowing the accumulators of SC. They together achieve another 25% area reduction. The proposed VLSI design is capable of processing 8k at 120-frames/s encoding. It occupies 51k logic gates, only onestate-of-the-art implementations

VLSI03_IM02 Title: RoBA Multiplier: A REfficient Digital Signal Processing

Abstract: In this paper, we propose an approximate multiplier that is high speed yet energy efficient. The approach is to round the operands to the nearest exponent of two. This way the computational intensive part of the multiplication is omitted improving speed aconsumption at the price of a small error. The proposed approach is applicable to both signed and unsigned multiplications. We propose three hardware implementations of the approximate multiplier that includes one for the unsigned and two for thethe proposed multiplier is evaluated by comparing its performance with those of some approximate and accurate multipliers using different design parameters. In addition, the efficacy of the proposed approximate multiplimage sharpening and smoothing.

VLSI06_IM03 Title: Energy-Efficient Reduce

Abstract: Approximate computing is an emerging design paradigm that exploability of applications to produce acceptable outputs even when their computations are executed approximately. In this paper, we explore approximate computing for a key computation pattern, reduce-andrank (RnR), which is prevalent in a wiprocessing, recognition, search, and data mining. An RnR kernel performs a reduction operation (e.g., distance computation, dot product, and L1set of reference vectors, andthe current input. We propose three complementary approximation strategies for the RnR computation pattern. The first is interleaved reduction andreductions are decomposed into multiple partial reductions and interleaved with the rank computation. Leveraging this transformation, we propose the use of intermediate reduction results and ranks to identify future computations that are likely to have a low impact onoutput, and can, hence, be approximated. The second strategy, input similarityapproximation, exploits the spatial or temporal correlation of inputs (e.g., pixels of an image or frames of a video) to identify computations that are amenable to apstrategy, reference vector reordering, rearranges the order in which the reference vectors are

NXFEE INNOVATION SEMICONDUCTOR IP & PRODUCT DEVELOPM

Semiconductor IP & VLSI IEEE Transaction & Product Development)

#45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, PondicherryEmail: [email protected] Ph: +91 9789443203, +91 9677783735.

Audio, Image and Video Processing

Clock VLSI Design of H.265 Sample Adaptive Offset Estimation for 8k Ultra

Sample adaptive offset (SAO) is a newly introduced in-loop filtering component in H.265/High Efficiency Video Coding (HEVC). While SAO contributes to a notable coding efficiency improvement, the estimation of SAO parameters dominates the complexity of inHEVC encoding. This paper presents an efficient VLSI design for SAO estimation. Our design

clock architecture that processes statistics collection (SC) and parameter decision (PD), the two main functional blocks of SAO estimation, at high- and low speed clocks, respectively. Such a strategy reduces the overall area by 56% by addressing the heterogeneous data flows of SC and PD. To further improve the area and power efficiency, algorithm

optimizations are applied, including a coarse range selection (CRS) and an accumulator bit width reduction (ABR). CRS shrinks the range of fine processed bands for the band offset estimation. ABR further reduces the area by narrowing the accumulators of SC. They

ve another 25% area reduction. The proposed VLSI design is capable of processing frames/s encoding. It occupies 51k logic gates, only one-third of the circuit area of the

art implementations.

RoBA Multiplier: A Rounding-Based Approximate Multiplier for HighEfficient Digital Signal Processing

In this paper, we propose an approximate multiplier that is high speed yet energy efficient. The approach is to round the operands to the nearest exponent of two. This way the computational intensive part of the multiplication is omitted improving speed aconsumption at the price of a small error. The proposed approach is applicable to both signed and unsigned multiplications. We propose three hardware implementations of the approximate multiplier that includes one for the unsigned and two for the signed operations. The efficiency of the proposed multiplier is evaluated by comparing its performance with those of some approximate and accurate multipliers using different design parameters. In addition, the efficacy of the proposed approximate multiplier is studied in two image processing applications, i.e., image sharpening and smoothing.

Efficient Reduce-and-Rank Using Input-Adaptive Approximations

Approximate computing is an emerging design paradigm that exploability of applications to produce acceptable outputs even when their computations are executed approximately. In this paper, we explore approximate computing for a key computation pattern,

andrank (RnR), which is prevalent in a wide range of workloads, including video processing, recognition, search, and data mining. An RnR kernel performs a reduction operation (e.g., distance computation, dot product, and L1-norm) between an input vector and each of a set of reference vectors, and ranks the reduction outputs to select the top reference vectors for the current input. We propose three complementary approximation strategies for the RnR computation pattern. The first is interleaved reduction and-ranking, wherein the vector

re decomposed into multiple partial reductions and interleaved with the rank computation. Leveraging this transformation, we propose the use of intermediate reduction results and ranks to identify future computations that are likely to have a low impact onoutput, and can, hence, be approximated. The second strategy, input similarityapproximation, exploits the spatial or temporal correlation of inputs (e.g., pixels of an image or frames of a video) to identify computations that are amenable to approximation. The third strategy, reference vector reordering, rearranges the order in which the reference vectors are

NXFEE INNOVATION PRODUCT DEVELOPMENT COMPANY

Product Development) #45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, Pondicherry-4

, +91 9677783735.

Clock VLSI Design of H.265 Sample Adaptive Offset Estimation for 8k Ultra-HD TV

loop filtering component in H.265/High Efficiency Video Coding (HEVC). While SAO contributes to a notable coding efficiency improvement, the estimation of SAO parameters dominates the complexity of in-loop filtering in HEVC encoding. This paper presents an efficient VLSI design for SAO estimation. Our design

clock architecture that processes statistics collection (SC) and parameter decision and low speed clocks,

respectively. Such a strategy reduces the overall area by 56% by addressing the heterogeneous data flows of SC and PD. To further improve the area and power efficiency, algorithm-

lied, including a coarse range selection (CRS) and an accumulator bit width reduction (ABR). CRS shrinks the range of fine processed bands for the band offset estimation. ABR further reduces the area by narrowing the accumulators of SC. They

ve another 25% area reduction. The proposed VLSI design is capable of processing third of the circuit area of the

Based Approximate Multiplier for High-Speed yet Energy-

In this paper, we propose an approximate multiplier that is high speed yet energy efficient. The approach is to round the operands to the nearest exponent of two. This way the computational intensive part of the multiplication is omitted improving speed and energy consumption at the price of a small error. The proposed approach is applicable to both signed and unsigned multiplications. We propose three hardware implementations of the approximate

signed operations. The efficiency of the proposed multiplier is evaluated by comparing its performance with those of some approximate and accurate multipliers using different design parameters. In addition, the efficacy

ier is studied in two image processing applications, i.e.,

Adaptive Approximations

Approximate computing is an emerging design paradigm that exploits the intrinsic ability of applications to produce acceptable outputs even when their computations are executed approximately. In this paper, we explore approximate computing for a key computation pattern,

de range of workloads, including video processing, recognition, search, and data mining. An RnR kernel performs a reduction operation

norm) between an input vector and each of a ranks the reduction outputs to select the top reference vectors for

the current input. We propose three complementary approximation strategies for the RnR ranking, wherein the vector

re decomposed into multiple partial reductions and interleaved with the rank computation. Leveraging this transformation, we propose the use of intermediate reduction results and ranks to identify future computations that are likely to have a low impact on the output, and can, hence, be approximated. The second strategy, input similarity-based approximation, exploits the spatial or temporal correlation of inputs (e.g., pixels of an image or

proximation. The third strategy, reference vector reordering, rearranges the order in which the reference vectors are

Nxfee Innovation (Semiconductor IP &#45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, Pondicherry

Web: www.nxfee.com Email:

processed such that vectors that are relatively more critical in evaluating the correct output, are processed at the beginning of RnR operusually small, which renders a substantial portion of the total computation to be amenable to approximation. These strategies address a key challenge in approximate computingidentification of which approximation mechanism, such as computation skipping or precision scaling to realize performance and energy improvements. A second key challenge in approximate computing is that the extent to which to application, and across inputs for even a single application. Hence, inputapproximation, or the ability to automatically modulate the degree of approximation based on the nature of each individual input, is essential for obtaining optimal energy savings. In addition, to enable quality configurability in RnR kernels, we propose a kernelcorrelates well to applicationthe proposed approximation strategies dynamically. We develop a runtime framework that modulates the identified parameters during the execution of RnR kernels to minimize their energy while meeting a given target quality.quality-configurable hardware implementations of six RnRrecognition, mining, search, and video processing application domains in 45

VLSI23_IM04 Title: Dual-Quality 4:2 Compressors for Utilizing in Dynamic Accuracy Configurable Multipliers

Abstract: In this paper, we propose four 4:2 compressors, which have the flexibility of switching between the exact and approximate operating modes. In the approximquality compressors provide higher speeds and lower power consumptions at the cost of lower accuracy. Each of these compressors has its own level of accuracy in the approximate mode as well as different delays and power dissipations incompressors in the structures of parallel multipliers provides configurable multipliers whose accuracies (as well as their powers and speeds) may change dynamically during the runtime. The efficiencies of these compressors in a 32CMOS technology by comparing their parameters with those of the statemultipliers. The results of comparison indicate, on average, 46% and 68% lower delay aconsumption in the approximate mode. Also, the effectiveness of these compressors is assessed in some image processing applications.

VLSI47_IM05 Title: An FPGA-Based Hardware Accelerator for Traffic Sign Detection

Abstract: Traffic sign detection plays an important role in a number of practical applications, such as intelligent driver assistance and roadway inventory management. In order to process the large amount of data from either realtraffic sign detection system is required. In this paper, we propose an FPGAaccelerator for traffic sign detection based on cascade classifiers. To maximize the throughput and power efficiency, we propose several nooperations; 2) shared image storage; 3) adaptive workload distribution; and 4) fast image block integration. The proposed design is evaluated on a Xilinx ZC706 board. When processing highdefinition (1080p) vid0.041 J/frame.

NXFEE INNOVATION SEMICONDUCTOR IP & PRODUCT DEVELOPM

Semiconductor IP & VLSI IEEE Transaction & Product Development)

#45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, PondicherryEmail: [email protected] Ph: +91 9789443203, +91 9677783735.

processed such that vectors that are relatively more critical in evaluating the correct output, are processed at the beginning of RnR operation. The number of these critical reference vectors is usually small, which renders a substantial portion of the total computation to be amenable to approximation. These strategies address a key challenge in approximate computingidentification of which computations to approximate—and may be used to drive any approximation mechanism, such as computation skipping or precision scaling to realize performance and energy improvements. A second key challenge in approximate computing is that the extent to which computations can be approximated varies significantly from application to application, and across inputs for even a single application. Hence, inputapproximation, or the ability to automatically modulate the degree of approximation based on

ature of each individual input, is essential for obtaining optimal energy savings. In addition, to enable quality configurability in RnR kernels, we propose a kernel-level quality metric that correlates well to application-level quality, and identify key parameters that can be used to tune the proposed approximation strategies dynamically. We develop a runtime framework that modulates the identified parameters during the execution of RnR kernels to minimize their energy while meeting a given target quality. To evaluate the proposed concepts, we designed

configurable hardware implementations of six RnR-based applications from the recognition, mining, search, and video processing application domains in 45

Quality 4:2 Compressors for Utilizing in Dynamic Accuracy Configurable Multipliers

In this paper, we propose four 4:2 compressors, which have the flexibility of switching between the exact and approximate operating modes. In the approximate mode, these dualquality compressors provide higher speeds and lower power consumptions at the cost of lower accuracy. Each of these compressors has its own level of accuracy in the approximate mode as well as different delays and power dissipations in the approximate and exact modes. Using these compressors in the structures of parallel multipliers provides configurable multipliers whose accuracies (as well as their powers and speeds) may change dynamically during the runtime. The

e compressors in a 32-bit Dadda multiplier are evaluated in a 45CMOS technology by comparing their parameters with those of the state-ofmultipliers. The results of comparison indicate, on average, 46% and 68% lower delay aconsumption in the approximate mode. Also, the effectiveness of these compressors is assessed in some image processing applications.

Based Hardware Accelerator for Traffic Sign Detection

Traffic sign detection plays an important role in a number of practical applications, such as intelligent driver assistance and roadway inventory management. In order to process the large amount of data from either real-time videos or large off-line databases, a hightraffic sign detection system is required. In this paper, we propose an FPGAaccelerator for traffic sign detection based on cascade classifiers. To maximize the throughput and power efficiency, we propose several novel ideas, including: 1) rearranged numerical operations; 2) shared image storage; 3) adaptive workload distribution; and 4) fast image block integration. The proposed design is evaluated on a Xilinx ZC706 board. When processing highdefinition (1080p) video, it achieves the throughput of 126 frames/s and the energy efficiency of

NXFEE INNOVATION PRODUCT DEVELOPMENT COMPANY

Product Development) #45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, Pondicherry-4

, +91 9677783735.

processed such that vectors that are relatively more critical in evaluating the correct output, are ation. The number of these critical reference vectors is

usually small, which renders a substantial portion of the total computation to be amenable to approximation. These strategies address a key challenge in approximate computing—

and may be used to drive any approximation mechanism, such as computation skipping or precision scaling to realize performance and energy improvements. A second key challenge in approximate computing is

computations can be approximated varies significantly from application to application, and across inputs for even a single application. Hence, input-adaptive approximation, or the ability to automatically modulate the degree of approximation based on

ature of each individual input, is essential for obtaining optimal energy savings. In addition, level quality metric that

arameters that can be used to tune the proposed approximation strategies dynamically. We develop a runtime framework that modulates the identified parameters during the execution of RnR kernels to minimize their

To evaluate the proposed concepts, we designed based applications from the

recognition, mining, search, and video processing application domains in 45-nm technology.

Quality 4:2 Compressors for Utilizing in Dynamic Accuracy Configurable Multipliers

In this paper, we propose four 4:2 compressors, which have the flexibility of switching ate mode, these dual-

quality compressors provide higher speeds and lower power consumptions at the cost of lower accuracy. Each of these compressors has its own level of accuracy in the approximate mode as

the approximate and exact modes. Using these compressors in the structures of parallel multipliers provides configurable multipliers whose accuracies (as well as their powers and speeds) may change dynamically during the runtime. The

bit Dadda multiplier are evaluated in a 45-nm standard of-the-art approximate

multipliers. The results of comparison indicate, on average, 46% and 68% lower delay and power consumption in the approximate mode. Also, the effectiveness of these compressors is assessed

Traffic sign detection plays an important role in a number of practical applications, such as intelligent driver assistance and roadway inventory management. In order to process the

bases, a high-throughput traffic sign detection system is required. In this paper, we propose an FPGA-based hardware accelerator for traffic sign detection based on cascade classifiers. To maximize the throughput

vel ideas, including: 1) rearranged numerical operations; 2) shared image storage; 3) adaptive workload distribution; and 4) fast image block integration. The proposed design is evaluated on a Xilinx ZC706 board. When processing high-

eo, it achieves the throughput of 126 frames/s and the energy efficiency of

Nxfee Innovation (Semiconductor IP &#45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, Pondicherry

Web: www.nxfee.com Email:

VLSI63_IM06 Title: Soft Error Rate Reduction of Combinational Circuits Using Gate Sizing in the Presence of Process Variations

Abstract: Soft errors in coconcern for nano scale VLSI designs. This paper presents a novel sensitivitymethodology to reduce the soft error rate (SER) of combinational circuits in the presence ofprocess variations. The proposed method is based on modeling the statistics of SER of the circuit gates as a random variable to formulate a statistical optimization problem. A backward traversing algorithm with capability for incremental analysis is develcircuit gates of SER random variables. We present a gate resizing algorithm in which the gates with the most contribution to the circuit SER are selected in a candidate set using a statistical ordering approach. The pexperimental results show that using the proposed methodology, the circuit statistical SER can be reduced by up to 56.4% compared with the 14.8% SER reduction of a circuit obtained using thworst case methodology at the expense of 10% area overhead under 10% process variation ratio. The results also show that the proposed method achieves about 40% more SER reduction compared with that obtained using closed(CASSER), the most recently published similar work, in the same experimental conditions. Comparing the runtime of the proposed optimization algorithm with the optimization based on CASSER, it is observed that the proposed method is tdue to its incremental analysis property.

VLSI30_IM07 Title: Time-Encoded Values for Highly Efficient Stochastic Circuits

Abstract: Stochastic computing (SC) is a promising technique for applications that requirarea overhead and fault tolerance, but can tolerate relatively high latency. In the SC paradigm, logical computation is performed on randomized bit streams. In prior work, streams were generated with linear feedback shift registers; these contributedand consumed a significant amount of power. This paper introduces a new approach for encoding signal values: computation is performed on analog periodic pulse signals. Exploiting pulse width modulation, timeadjusting the frequency and duty cycles of pulse width modulated (PWM) signals. With this approach, the latency, area, and energy consumption are all greatly reduced. Experimental results on image processing energy dissipation, and 40% area reduction compared to prior stochastic approaches. Circuits synthesized with the proposed approach can work as fast and energyconventional binary design while retaining the faultconventional stochastic designs.

VLSI28_IM08 Title: Design of Power and Area Efficient Approximate Multipliers

Abstract: Approximate computing can decrease the design complexity with an increase in performance and power efficiency for error resilient applications. This brief deals with a new design approach for approximation of multipliers. The partial products of the multipaltered to introduce varying probability terms. Logic complexity of approximation is varied for the accumulation of altered partial products based on their probability. The proposed approximation is utilized in two variants of 16proposed multipliers achieve power savings of 72% and 38%, respectively, compared to an exact multiplier. They have better precision when compared to existing approximate multipliers. Mean relative error figures are as lowwhich are better than the previous works. Performance of the proposed multipliers is evaluated

NXFEE INNOVATION SEMICONDUCTOR IP & PRODUCT DEVELOPM

Semiconductor IP & VLSI IEEE Transaction & Product Development)

#45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, PondicherryEmail: [email protected] Ph: +91 9789443203, +91 9677783735.

Soft Error Rate Reduction of Combinational Circuits Using Gate Sizing in the Presence of Process Variations

Soft errors in combinational logic circuits are emerging as a significant reliability concern for nano scale VLSI designs. This paper presents a novel sensitivitymethodology to reduce the soft error rate (SER) of combinational circuits in the presence ofprocess variations. The proposed method is based on modeling the statistics of SER of the circuit gates as a random variable to formulate a statistical optimization problem. A backward traversing algorithm with capability for incremental analysis is developed for computing the distribution of circuit gates of SER random variables. We present a gate resizing algorithm in which the gates with the most contribution to the circuit SER are selected in a candidate set using a statistical ordering approach. The proposed algorithm trades off SER reduction and area overheads. The experimental results show that using the proposed methodology, the circuit statistical SER can be reduced by up to 56.4% compared with the 14.8% SER reduction of a circuit obtained using thworst case methodology at the expense of 10% area overhead under 10% process variation ratio. The results also show that the proposed method achieves about 40% more SER reduction compared with that obtained using closed-form analysis for statistical soft(CASSER), the most recently published similar work, in the same experimental conditions. Comparing the runtime of the proposed optimization algorithm with the optimization based on CASSER, it is observed that the proposed method is two orders of magnitude faster than CASSER due to its incremental analysis property.

Encoded Values for Highly Efficient Stochastic Circuits

Stochastic computing (SC) is a promising technique for applications that requirarea overhead and fault tolerance, but can tolerate relatively high latency. In the SC paradigm, logical computation is performed on randomized bit streams. In prior work, streams were generated with linear feedback shift registers; these contributed heavily to the hardware cost and consumed a significant amount of power. This paper introduces a new approach for encoding signal values: computation is performed on analog periodic pulse signals. Exploiting pulse width modulation, time-encoded signals corresponding to specific values are generated by adjusting the frequency and duty cycles of pulse width modulated (PWM) signals. With this approach, the latency, area, and energy consumption are all greatly reduced. Experimental results on image processing applications show up to 99% performance speedup, 98% saving in energy dissipation, and 40% area reduction compared to prior stochastic approaches. Circuits synthesized with the proposed approach can work as fast and energy

ary design while retaining the fault-tolerance and lowconventional stochastic designs.

Design of Power and Area Efficient Approximate Multipliers

Approximate computing can decrease the design complexity with an increase in performance and power efficiency for error resilient applications. This brief deals with a new design approach for approximation of multipliers. The partial products of the multipaltered to introduce varying probability terms. Logic complexity of approximation is varied for the accumulation of altered partial products based on their probability. The proposed approximation is utilized in two variants of 16-bit multipliers. Synthesis results reveal that two proposed multipliers achieve power savings of 72% and 38%, respectively, compared to an exact multiplier. They have better precision when compared to existing approximate multipliers. Mean relative error figures are as low as 7.6% and 0.02% for the proposed approximate multipliers, which are better than the previous works. Performance of the proposed multipliers is evaluated

NXFEE INNOVATION PRODUCT DEVELOPMENT COMPANY

Product Development) #45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, Pondicherry-4

, +91 9677783735.

Soft Error Rate Reduction of Combinational Circuits Using Gate Sizing in the Presence of

mbinational logic circuits are emerging as a significant reliability concern for nano scale VLSI designs. This paper presents a novel sensitivity-based gate sizing methodology to reduce the soft error rate (SER) of combinational circuits in the presence of process variations. The proposed method is based on modeling the statistics of SER of the circuit gates as a random variable to formulate a statistical optimization problem. A backward traversing

oped for computing the distribution of circuit gates of SER random variables. We present a gate resizing algorithm in which the gates with the most contribution to the circuit SER are selected in a candidate set using a statistical

roposed algorithm trades off SER reduction and area overheads. The experimental results show that using the proposed methodology, the circuit statistical SER can be reduced by up to 56.4% compared with the 14.8% SER reduction of a circuit obtained using the worst case methodology at the expense of 10% area overhead under 10% process variation ratio. The results also show that the proposed method achieves about 40% more SER reduction

form analysis for statistical soft error rate estimation (CASSER), the most recently published similar work, in the same experimental conditions. Comparing the runtime of the proposed optimization algorithm with the optimization based on

wo orders of magnitude faster than CASSER

Stochastic computing (SC) is a promising technique for applications that require low area overhead and fault tolerance, but can tolerate relatively high latency. In the SC paradigm, logical computation is performed on randomized bit streams. In prior work, streams were

heavily to the hardware cost and consumed a significant amount of power. This paper introduces a new approach for encoding signal values: computation is performed on analog periodic pulse signals. Exploiting

rresponding to specific values are generated by adjusting the frequency and duty cycles of pulse width modulated (PWM) signals. With this approach, the latency, area, and energy consumption are all greatly reduced. Experimental

applications show up to 99% performance speedup, 98% saving in energy dissipation, and 40% area reduction compared to prior stochastic approaches. Circuits synthesized with the proposed approach can work as fast and energy-efficiently as a

tolerance and low cost advantages of

Approximate computing can decrease the design complexity with an increase in performance and power efficiency for error resilient applications. This brief deals with a new design approach for approximation of multipliers. The partial products of the multiplier are altered to introduce varying probability terms. Logic complexity of approximation is varied for the accumulation of altered partial products based on their probability. The proposed

Synthesis results reveal that two proposed multipliers achieve power savings of 72% and 38%, respectively, compared to an exact multiplier. They have better precision when compared to existing approximate multipliers. Mean

as 7.6% and 0.02% for the proposed approximate multipliers, which are better than the previous works. Performance of the proposed multipliers is evaluated

Nxfee Innovation (Semiconductor IP &#45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, Pondicherry

Web: www.nxfee.com Email:

with an image processing application, where one of the proposed models achieves the highest peak signal to noise ratio.

VLSI31_VE01 Title: COMEDI: Combinatorial Election of Diagnostic Vectors From Detection Test Sets for Logic Circuits

Abstract: Although the modern automatic test pattern generation (ATPG) tools can efficiently produce near-optimal test sets with high faultset (DTS), which is needed for fault localization, is much more challenging to construct. The DTS is used to analyze the responses of failing chips during manufactuidentifying the root cause of observed errors. In this paper, a novel technique for selecting a powerful DTS for stuckexisting methods, this technique does nomodification, or miterdetermine a test set with high diagnostic coverage (DC). Two variants of the covering algorithm are proposed based on thscan-based benchmark circuits demonstrate the effectiveness of our method in terms of the size of the DTS, DC, and CPU time.

VLSI44_VE02 Title: Reordering Tests for Efficient Fail Data C

Abstract: During fail data collection, a tester collects information that is useful for defect diagnosis. If fail data collection can be terminated early, the tester time as well as the volume of fail data will be reduced. Test reordering can enhance the ability twithout affecting the quality of diagnosis. In this paper, test reordering targets logic defects based on information that is derived during defect diagnosis. The defect diagnosis procedure is enhanced to identify tests that aof a circuit. Tests that are determined to be useful for more faulty instances of a circuit are placed earlier in the test set based on the expectation that the same tests will be useful fofaulty instances of the circuit. The experimental results for logic defects in benchmark circuits support the effectiveness of this approach and indicate that test reordering helps to terminate fail data collection early without impacting the diagn

VLSI51_NOC01 Title: Multicast-Aware High

Abstract: — Today’s multiprocessor platforms employ the networkthe preferable communication backbone. Conventional NoCs are designed predominantly for unicast data exchanges. In such NoCs, the multicast traffic is generally handled by each multicast message to multiple unicast transmissions. Hence, applications dominated by multicast traffic experience high queuing latencies and significant performance penalties when running on systems designed with unicastmechanisms such as XYenhance the performance of the traditional wireline mesh NoC incorporating multicast traffic. However, even with such added features, the multihohigh network latencies and thus limits the achievable system performance. In this paper, to sustain the high-bandwidth and highpropose the design of a wirelesssupport. By integrating congestionable to efficiently handle heavy multicast injections.

NXFEE INNOVATION SEMICONDUCTOR IP & PRODUCT DEVELOPM

Semiconductor IP & VLSI IEEE Transaction & Product Development)

#45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, PondicherryEmail: [email protected] Ph: +91 9789443203, +91 9677783735.

with an image processing application, where one of the proposed models achieves the highest nal to noise ratio.

VERIFICATION

COMEDI: Combinatorial Election of Diagnostic Vectors From Detection Test Sets for Logic

Although the modern automatic test pattern generation (ATPG) tools can efficiently optimal test sets with high fault-coverage for a circuit-under-

set (DTS), which is needed for fault localization, is much more challenging to construct. The DTS is used to analyze the responses of failing chips during manufacturing test for the purpose of identifying the root cause of observed errors. In this paper, a novel technique for selecting a powerful DTS for stuck-at faults from a pool of ATPG detection vectors is proposed. Unlike existing methods, this technique does not use any diagnostic test generation, circuit modification, or miter-based approach. It constructs a combinatorial cover of the pool to determine a test set with high diagnostic coverage (DC). Two variants of the covering algorithm are proposed based on this technique. The experimental results on several combinational and

based benchmark circuits demonstrate the effectiveness of our method in terms of the size of the DTS, DC, and CPU time.

Reordering Tests for Efficient Fail Data Collection and Tester Time Reduction

During fail data collection, a tester collects information that is useful for defect diagnosis. If fail data collection can be terminated early, the tester time as well as the volume of fail data will be reduced. Test reordering can enhance the ability to terminate the process early without affecting the quality of diagnosis. In this paper, test reordering targets logic defects based on information that is derived during defect diagnosis. The defect diagnosis procedure is enhanced to identify tests that are useful for defect diagnosis across a sample of faulty instances of a circuit. Tests that are determined to be useful for more faulty instances of a circuit are placed earlier in the test set based on the expectation that the same tests will be useful fofaulty instances of the circuit. The experimental results for logic defects in benchmark circuits support the effectiveness of this approach and indicate that test reordering helps to terminate fail data collection early without impacting the diagnosis quality.

NETWORKING

Aware High-Performance Wireless Network-on-Chip Architectures

Today’s multiprocessor platforms employ the network-on-chip (NoC) architecture as the preferable communication backbone. Conventional NoCs are designed predominantly for unicast data exchanges. In such NoCs, the multicast traffic is generally handled by each multicast message to multiple unicast transmissions. Hence, applications dominated by multicast traffic experience high queuing latencies and significant performance penalties when running on systems designed with unicast-based NoC architectures. Various multicast mechanisms such as XY-tree multicast and path multicast have already been proposed to enhance the performance of the traditional wireline mesh NoC incorporating multicast traffic. However, even with such added features, the multihop nature of the wireline mesh NoC leads to high network latencies and thus limits the achievable system performance. In this paper, to

bandwidth and high-throughput requirements of emerging applications, we propose the design of a wireless NoC (WiNoC) architecture incorporating necessary multicast support. By integrating congestion-aware multicast routing with network coding, the WiNoC is able to efficiently handle heavy multicast injections.

NXFEE INNOVATION PRODUCT DEVELOPMENT COMPANY

Product Development) #45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, Pondicherry-4

, +91 9677783735.

with an image processing application, where one of the proposed models achieves the highest

COMEDI: Combinatorial Election of Diagnostic Vectors From Detection Test Sets for Logic

Although the modern automatic test pattern generation (ATPG) tools can efficiently -test, a diagnostic test

set (DTS), which is needed for fault localization, is much more challenging to construct. The DTS is ring test for the purpose of

identifying the root cause of observed errors. In this paper, a novel technique for selecting a at faults from a pool of ATPG detection vectors is proposed. Unlike

t use any diagnostic test generation, circuit based approach. It constructs a combinatorial cover of the pool to

determine a test set with high diagnostic coverage (DC). Two variants of the covering algorithm is technique. The experimental results on several combinational and

based benchmark circuits demonstrate the effectiveness of our method in terms of the size

ollection and Tester Time Reduction

During fail data collection, a tester collects information that is useful for defect diagnosis. If fail data collection can be terminated early, the tester time as well as the volume of

o terminate the process early without affecting the quality of diagnosis. In this paper, test reordering targets logic defects based on information that is derived during defect diagnosis. The defect diagnosis procedure is

re useful for defect diagnosis across a sample of faulty instances of a circuit. Tests that are determined to be useful for more faulty instances of a circuit are placed earlier in the test set based on the expectation that the same tests will be useful for other faulty instances of the circuit. The experimental results for logic defects in benchmark circuits support the effectiveness of this approach and indicate that test reordering helps to terminate

Chip Architectures

chip (NoC) architecture as the preferable communication backbone. Conventional NoCs are designed predominantly for unicast data exchanges. In such NoCs, the multicast traffic is generally handled by converting each multicast message to multiple unicast transmissions. Hence, applications dominated by multicast traffic experience high queuing latencies and significant performance penalties when

tures. Various multicast tree multicast and path multicast have already been proposed to

enhance the performance of the traditional wireline mesh NoC incorporating multicast traffic. p nature of the wireline mesh NoC leads to

high network latencies and thus limits the achievable system performance. In this paper, to throughput requirements of emerging applications, we

NoC (WiNoC) architecture incorporating necessary multicast aware multicast routing with network coding, the WiNoC is

Nxfee Innovation (Semiconductor IP &#45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, Pondicherry

Web: www.nxfee.com Email:

VLSI - BACK END PROJECT

VLSI01_BE01 Title: Temporarily FineArchitectures

Abstract: This paper presents a design approach forof parallel architectures in nearleakage energy dissipation of the idle portionsto wholly transform the throughput improvement from parallel architecturvia deep voltage scaling. We begin byarchitectures in the neardissipation largely undermines the ability throughput into energyactive-leakage power dissipation; however, the overcan offset the energsuppressing active leakage. Therefore, in this paper,by the so-called zigzag supertransitions of PGS in nearcircuits in sleep mode for asdelay overheads. We apply our proposed design to paraoperating at near- energy efficiency over baselines at the same throughput.

VLSI10_BE02 Title: Low-Power Design for a DigitFactoring Technique

Abstract: In CMOSconsumption is dominated by dynamic power, where dynamic power consists of two major components, namely, switching power and internal power. In this paper, we present a lowdesign for a digit-technique is used to minimize switching power. To the best of our knowledge, factoring method has not been reported in the literature being used in the architectural level. Logic gate substitution is also utilized to reduce internal power. Our proposed design along with several existing similar works have been realized for GF(2and a comparison is made between them. The synthesis results show that the proposed multiplier design consumes at least 27.8% lower total power than any previous work in comparison.

VLSI17_BE03 Title: Analysis and Design of a Low

Abstract: The need for ultralowconverters is pushing toward the use of dynamic regenerative comparators to maximize speed and power efficiency. In this paper, an analysis on the delay of presented and analytical expressions are derived. From the analytical expressions, designers can obtain an intuition about the main contributors to the comparator delay and fully explore the tradeoffs in dynamic comparator dcomparator is proposed, where the circuit of a conventional doublefor low-power and fast operation even in small supply voltages. Without complicating the design and by adding few transistors, the positive feedback during the regeneration is strengthened, which results in remarkably reduced delay time. Post layout simulation results in a 0.18CMOS technology confirm the analysis results. It is shown that in the proposed dycomparator both the power consumption and delay time are significantly reduced.

NXFEE INNOVATION SEMICONDUCTOR IP & PRODUCT DEVELOPM

Semiconductor IP & VLSI IEEE Transaction & Product Development)

#45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, PondicherryEmail: [email protected] Ph: +91 9789443203, +91 9677783735.

BACK END PROJECT - TANNER(nm) / HSPICE(nm) / DSCH3 - MICROWIND

Temporarily Fine-Grained Sleep Technique for Near- and Sub

This paper presents a design approach for improving energy-efficiency and throughput architectures in near- and sub-threshold voltage circuits. The focus

leakage energy dissipation of the idle portions of circuits during active modes, which can allow us transform the throughput improvement from parallel architectur

via deep voltage scaling. We begin by investigating the efficacy of parallel and pipeline the near- and sub-threshold circuits. The investigation reveals that

dissipation largely undermines the ability of deep voltage scaling to transform excessive throughput into energy savings. Techniques, such as power-gating switches (PGSs), can

leakage power dissipation; however, the over head for entering and exiting sleep modes can offset the energy savings provided by sleep mode, particularly if sleep time is finesuppressing active leakage. Therefore, in this paper, we propose a PGS design technique, inspired

zigzag super cutoff CMOS, in order to optimize the overheadstransitions of PGS in near- and sub-threshold circuits. The proposed technique enables to have circuits in sleep mode for as short as a single clock cycle with a negligible amount of energydelay overheads. We apply our proposed design to parallel multiplier

and sub-threshold voltages. Simulations show a significant improvement in efficiency over baselines at the same throughput.

Power Design for a Digit-Serial Polynomial Basis Finite Field Multiplier Using Factoring Technique

In CMOS-based application-specific integrated circuit (ASIC) designs, total power consumption is dominated by dynamic power, where dynamic power consists of two major components, namely, switching power and internal power. In this paper, we present a low

serial finite field multiplier in GF(2m

). In the proposed design, a factoring technique is used to minimize switching power. To the best of our knowledge, factoring method has not been reported in the literature being used in the design of a finite field multiplier at an architectural level. Logic gate substitution is also utilized to reduce internal power. Our proposed design along with several existing similar works have been realized for GF(2

n is made between them. The synthesis results show that the proposed multiplier design consumes at least 27.8% lower total power than any previous work in

Analysis and Design of a Low-Voltage Low-Power Double-Tail Comparator

The need for ultralow-power, area efficient, and high speed analogconverters is pushing toward the use of dynamic regenerative comparators to maximize speed and power efficiency. In this paper, an analysis on the delay of the dynamic comparators will be presented and analytical expressions are derived. From the analytical expressions, designers can obtain an intuition about the main contributors to the comparator delay and fully explore the tradeoffs in dynamic comparator design. Based on the presented analysis, a new dynamic comparator is proposed, where the circuit of a conventional double-tail comparator is modified

power and fast operation even in small supply voltages. Without complicating the design ng few transistors, the positive feedback during the regeneration is strengthened,

which results in remarkably reduced delay time. Post layout simulation results in a 0.18CMOS technology confirm the analysis results. It is shown that in the proposed dycomparator both the power consumption and delay time are significantly reduced.

NXFEE INNOVATION PRODUCT DEVELOPMENT COMPANY

Product Development) #45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, Pondicherry-4

, +91 9677783735.

MICROWIND(um)

and Sub-threshold Parallel

efficiency and throughput threshold voltage circuits. The focus is to suppress

of circuits during active modes, which can allow us transform the throughput improvement from parallel architectures into energy savings

investigating the efficacy of parallel and pipeline threshold circuits. The investigation reveals that active energy

voltage scaling to transform excessive gating switches (PGSs), can mitigate

for entering and exiting sleep modes savings provided by sleep mode, particularly if sleep time is fine grained for

we propose a PGS design technique, inspired cutoff CMOS, in order to optimize the overheads of mode

proposed technique enables to have short as a single clock cycle with a negligible amount of energy and

multiplier-based test circuits voltages. Simulations show a significant improvement in

sis Finite Field Multiplier Using

specific integrated circuit (ASIC) designs, total power consumption is dominated by dynamic power, where dynamic power consists of two major components, namely, switching power and internal power. In this paper, we present a low-power

). In the proposed design, a factoring technique is used to minimize switching power. To the best of our knowledge, factoring method

design of a finite field multiplier at an architectural level. Logic gate substitution is also utilized to reduce internal power. Our proposed design along with several existing similar works have been realized for GF(2

233)on ASIC platform,

n is made between them. The synthesis results show that the proposed multiplier design consumes at least 27.8% lower total power than any previous work in

Comparator

power, area efficient, and high speed analog-to-digital converters is pushing toward the use of dynamic regenerative comparators to maximize speed

the dynamic comparators will be presented and analytical expressions are derived. From the analytical expressions, designers can obtain an intuition about the main contributors to the comparator delay and fully explore the

esign. Based on the presented analysis, a new dynamic tail comparator is modified

power and fast operation even in small supply voltages. Without complicating the design ng few transistors, the positive feedback during the regeneration is strengthened,

which results in remarkably reduced delay time. Post layout simulation results in a 0.18-µm CMOS technology confirm the analysis results. It is shown that in the proposed dynamic comparator both the power consumption and delay time are significantly reduced.

Nxfee Innovation (Semiconductor IP &#45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, Pondicherry

Web: www.nxfee.com Email:

VLSI27_BE04 Title:10T SRAM Using Half

Low Switching Power and Ultralow RBL Leakage

Abstract: We presenended decoupled readreduction. The RBL is precharged at half the cell’s supply voltage, and is allowed to chardischarge according to the stored data bit. An inverter, driven by the complementary data node (QB), connects the RBL to the virtual power rails through a transmission gate during the read operation. RBL increases toward the VDD level for a readlevel for a read-0. Virtual power rails have the same value of the RBL precharging level during the write and the hold mode, and are connected to true supply levels only during the read operation. Dynamic control of virtualcommercial 65 nm technology is 2.47×the size of 6T with margin, and reduces the read power dissipation by 50% than that of 6T. The value of RBL leais reduced by more than 3 orders of magnitude and (ION/IOFF) is greatly improved compared with the 6T BL leakage. The overall leakage characteristics of 6T and 10T are similar, and competitive performance is achieved.

VLSI54_BE05 Title: Delay Analysis for Current Mode Threshold Logic Gate Designs

Abstract: Current mode is a popular CMOSwhere the gate delay depends on the sensor size. This paper presents a new implementation of current mode thresmethod is also proposed in order to identify quickly the sensor size that minimizes the gate delay. Simulation results on different gates implemented using the optimum sensor size ithat the proposed current mode implementation method outperforms consistently the existing implementations in delay as well as switching energy.

VLSI55_BE06 Title: Area and Energyfor Space Applications

Abstract: The limited size and power budgets of spacerequirements for reliable circuit operation within highpropose the smallest solution for softproposed complementary dualtransistor dynamic memory core that internally stores complementary data values to provide an inherent per-bit error detection correction capability is added to the memory architecture for robust softproposed memory was implemented in a 653.5×smaller silicon footprint than other radiationmemory consumes between 48% and 87% less standby power than other considered solutions across the entire operating region.

NXFEE INNOVATION SEMICONDUCTOR IP & PRODUCT DEVELOPM

Semiconductor IP & VLSI IEEE Transaction & Product Development)

#45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, PondicherryEmail: [email protected] Ph: +91 9789443203, +91 9677783735.

10T SRAM Using Half-VDD Precharge and Row-Wise Dynamically Powered Read Port for Low Switching Power and Ultralow RBL Leakage

We present, in this paper, a new 10T static random access memory cell having single ended decoupled read-bitline (RBL) with a 4T read port for low power operation and leakage reduction. The RBL is precharged at half the cell’s supply voltage, and is allowed to chardischarge according to the stored data bit. An inverter, driven by the complementary data node (QB), connects the RBL to the virtual power rails through a transmission gate during the read operation. RBL increases toward the VDD level for a read-1, and discharges toward the ground

0. Virtual power rails have the same value of the RBL precharging level during the write and the hold mode, and are connected to true supply levels only during the read operation. Dynamic control of virtual rails substantially reduces the RBL leakage. The proposed 10T cell in a commercial 65 nm technology is 2.47×the size of 6T with β=2, provides 2.3margin, and reduces the read power dissipation by 50% than that of 6T. The value of RBL leais reduced by more than 3 orders of magnitude and (ION/IOFF) is greatly improved compared with the 6T BL leakage. The overall leakage characteristics of 6T and 10T are similar, and competitive performance is achieved.

is for Current Mode Threshold Logic Gate Designs

Current mode is a popular CMOS-based implementation of threshold logic functions, where the gate delay depends on the sensor size. This paper presents a new implementation of current mode threshold functions for improved gate delay and switching energy. An analytical method is also proposed in order to identify quickly the sensor size that minimizes the gate delay. Simulation results on different gates implemented using the optimum sensor size ithat the proposed current mode implementation method outperforms consistently the existing implementations in delay as well as switching energy.

Area and Energy-Efficient Complementary Dual-Modular Redundancy Dynamic Memory Space Applications

The limited size and power budgets of space-bound systems often contradict the requirements for reliable circuit operation within high-radiation environments. In this paper, we propose the smallest solution for soft-error tolerant embedded memory yet to be presented. The proposed complementary dual-modular redundancy (CDMR) memory is based on a fourtransistor dynamic memory core that internally stores complementary data values to provide an

bit error detection capability. By adding simple, low-overhead parity, an errorcorrection capability is added to the memory architecture for robust softproposed memory was implemented in a 65-nm CMOS technology, displaying as much as a

con footprint than other radiation-hardened bit cells. In addition, the CDMR memory consumes between 48% and 87% less standby power than other considered solutions across the entire operating region.

NXFEE INNOVATION PRODUCT DEVELOPMENT COMPANY

Product Development) #45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, Pondicherry-4

, +91 9677783735.

Wise Dynamically Powered Read Port for

t, in this paper, a new 10T static random access memory cell having single bitline (RBL) with a 4T read port for low power operation and leakage

reduction. The RBL is precharged at half the cell’s supply voltage, and is allowed to charge and discharge according to the stored data bit. An inverter, driven by the complementary data node (QB), connects the RBL to the virtual power rails through a transmission gate during the read

and discharges toward the ground 0. Virtual power rails have the same value of the RBL precharging level during the

write and the hold mode, and are connected to true supply levels only during the read operation. rails substantially reduces the RBL leakage. The proposed 10T cell in a

β=2, provides 2.3×read static noise margin, and reduces the read power dissipation by 50% than that of 6T. The value of RBL leakage is reduced by more than 3 orders of magnitude and (ION/IOFF) is greatly improved compared with the 6T BL leakage. The overall leakage characteristics of 6T and 10T are similar, and

based implementation of threshold logic functions, where the gate delay depends on the sensor size. This paper presents a new implementation of

hold functions for improved gate delay and switching energy. An analytical method is also proposed in order to identify quickly the sensor size that minimizes the gate delay. Simulation results on different gates implemented using the optimum sensor size indicate that the proposed current mode implementation method outperforms consistently the existing

Modular Redundancy Dynamic Memory

bound systems often contradict the radiation environments. In this paper, we

olerant embedded memory yet to be presented. The modular redundancy (CDMR) memory is based on a four-

transistor dynamic memory core that internally stores complementary data values to provide an overhead parity, an error-

correction capability is added to the memory architecture for robust soft-error protection. The nm CMOS technology, displaying as much as a

cells. In addition, the CDMR memory consumes between 48% and 87% less standby power than other considered solutions

Nxfee Innovation (Semiconductor IP &#45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, Pondicherry

Web: www.nxfee.com Email:

VLSI56_BE07 Title: Probability-Driven Multi

Abstract: Data-driven clock gated (DDCG) and multidesign techniques that are usually treated separately. Combining these techniques into a single grouping algorithm and and its synergy with FF datato maximize the expected energy savings by grouping FFs in increasing order of their daclock toggling probabilities. We present a frontconsiderations for a 65achieve the power savings of 23% and 17%, respectively, comparFFs. About half of the savings was due to integrating the DDCG into the MBFFs.

VLSI59_BE08 Title: A High-Speed and Power

Abstract: This brief presents a fast and powerconverting extremely low levels of input voltages into high output voltage levels. The efficiency of the proposed circuit is due to the fact that not only the strensignificantly reduced when the pullstrength of the pullsimulation results of the proposed cirper transition of 157 fJ, a static power dissipation of 0.3 nW, and a propagation delay of 30 ns for input frequency of 1 MHz, low supply voltage level of VDDL=0.4V, and high supply voltage level of VDDH=1.8V.

VLSI32_BE09 Title: A 0.1–2-GHz Quadrature Correction Loop for Digital Multiphase Clock Generation Circuits in 130-nm CMOS

Abstract: A 100digital clocks is presented. The proposed circuit consists of a phasefor quadrature error correction. The circuit corrects the phase errortowithina1.5°upto1GHzandtowithin3°at2GHz. It consumes 5.4 mA from a 1.2 V supply at 2 GHz. The circuit was designed in UMC 0.13102µm×95µm. The impact of duty cycle distortion has been analyzed. Highmeasurement related issues have been discussed. The proposed circuit was used in two different applications for which the functionality has been verified.

VLSI09_BE10 Title: Conditional-Boosting Flip

Abstract: A conditionalsupply voltage is scaled down to the nearboosting to provide low latency with reduced performance variabilvoltage region. It also adopts conditional capture to minimize the switching power consumption by eliminating redundant boosting operations. Experimental results in a 65indicated that the proposed flipperformance variability due to process variation, and up to 67% improved energyat 25% switching activity compared with conventional pre

NXFEE INNOVATION SEMICONDUCTOR IP & PRODUCT DEVELOPM

Semiconductor IP & VLSI IEEE Transaction & Product Development)

#45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, PondicherryEmail: [email protected] Ph: +91 9789443203, +91 9677783735.

Driven Multi-bit Flip-Flop Integration With Clock Gating

driven clock gated (DDCG) and multi-bit flip-flops (MBFFs) are two lowdesign techniques that are usually treated separately. Combining these techniques into a single grouping algorithm and design flow enables further power savings. We study MBFF multiplicity and its synergy with FF data-to-clock toggling probabilities. A probabilistic model is implemented to maximize the expected energy savings by grouping FFs in increasing order of their daclock toggling probabilities. We present a front-end design flow, guided by physical layout considerations for a 65-nm 32-bit MIPS and a 28-nm industrial network processor. It is shown to achieve the power savings of 23% and 17%, respectively, compared with designs with ordinary FFs. About half of the savings was due to integrating the DDCG into the MBFFs.

Speed and Power-Efficient Voltage Level Shifter for Dual-Supply Applications

This brief presents a fast and power-efficient voltage level shifting circuit capable of converting extremely low levels of input voltages into high output voltage levels. The efficiency of the proposed circuit is due to the fact that not only the strength of the pullsignificantly reduced when the pull-down device is pulling down the output node, but the strength of the pull-down device is also increased using a low-power auxiliary circuit. Post layout simulation results of the proposed circuit in a 0.18-µm technology demonstrate a total energy per transition of 157 fJ, a static power dissipation of 0.3 nW, and a propagation delay of 30 ns for input frequency of 1 MHz, low supply voltage level of VDDL=0.4V, and high supply voltage level

GHz Quadrature Correction Loop for Digital Multiphase Clock Generation Circuits

A 100-MHz–2-GHz closed-loop analog in-phase/quadrature correction circuit for digital clocks is presented. The proposed circuit consists of a phase-locked loopfor quadrature error correction. The circuit corrects the phase

hina1.5°upto1GHzandtowithin3°at2GHz. It consumes 5.4 mA from a 1.2 V supply at 2 GHz. The circuit was designed in UMC 0.13-µm mixed-mode CMOS with an active area of 102µm×95µm. The impact of duty cycle distortion has been analyzed. High-measurement related issues have been discussed. The proposed circuit was used in two different applications for which the functionality has been verified.

Boosting Flip-Flop for Near-Threshold Voltage Application

A conditional-boosting flip-flop is proposed for ultra-low voltage application where the supply voltage is scaled down to the near-threshold region. The proposed flipboosting to provide low latency with reduced performance variability in the near threshold voltage region. It also adopts conditional capture to minimize the switching power consumption by eliminating redundant boosting operations. Experimental results in a 65indicated that the proposed flip-flop provided up to 72% lower latency with 75% less performance variability due to process variation, and up to 67% improved energyat 25% switching activity compared with conventional pre-charged differential flip

NXFEE INNOVATION PRODUCT DEVELOPMENT COMPANY

Product Development) #45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, Pondicherry-4

, +91 9677783735.

flops (MBFFs) are two low-power design techniques that are usually treated separately. Combining these techniques into a single

design flow enables further power savings. We study MBFF multiplicity clock toggling probabilities. A probabilistic model is implemented

to maximize the expected energy savings by grouping FFs in increasing order of their data-to-end design flow, guided by physical layout

nm industrial network processor. It is shown to ed with designs with ordinary

FFs. About half of the savings was due to integrating the DDCG into the MBFFs.

Supply Applications

efficient voltage level shifting circuit capable of converting extremely low levels of input voltages into high output voltage levels. The efficiency

gth of the pull-up device is down device is pulling down the output node, but the

power auxiliary circuit. Post layout µm technology demonstrate a total energy

per transition of 157 fJ, a static power dissipation of 0.3 nW, and a propagation delay of 30 ns for input frequency of 1 MHz, low supply voltage level of VDDL=0.4V, and high supply voltage level

GHz Quadrature Correction Loop for Digital Multiphase Clock Generation Circuits

phase/quadrature correction circuit for locked loop- type architecture

for quadrature error correction. The circuit corrects the phase hina1.5°upto1GHzandtowithin3°at2GHz. It consumes 5.4 mA from a 1.2 V supply at 2

mode CMOS with an active area of -frequency quadrature

measurement related issues have been discussed. The proposed circuit was used in two different

Threshold Voltage Application

low voltage application where the threshold region. The proposed flip-flop adopts voltage

ity in the near threshold voltage region. It also adopts conditional capture to minimize the switching power consumption by eliminating redundant boosting operations. Experimental results in a 65-nm CMOS process

ed up to 72% lower latency with 75% less performance variability due to process variation, and up to 67% improved energy-delay product

charged differential flip-flops.

Nxfee Innovation (Semiconductor IP &#45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, Pondicherry

Web: www.nxfee.com Email:

VLSI36_BE11 Title: An All-MOSFET Sub

Abstract: This paper presents a voltage reference (VR) with a power supply rejection (PSR) better than 50 dB for frequencies of up to 60 MHz, and uses MOSFETs in strong inversion. Another innovation is a compact MOSFET lowfeedback technique for a wideMOSFET VR was fabricated using a standard 0.18µm CMOS process.

VLSI46_BE12 Title: A 65-nm CMOS Constant Current Source with Reduced PVT Variation

Abstract: This paper presents a new nanometerthat attains a small value in the total processarchitecture is based on the embodiment of a processprocess-tracking bias voltage source for the dedicated temperaturecurrent conversion in a preconsumes 7.18µWwitha1.4V supply. The measured results indicate thaachieves an averagerange from−30 °C to 90 °C without any calibra�on. Besides, a low line sensi�vity of 180 ppm/V is obtained. This paper offers a better sensitrepresentative counterparts.

VLSI57_BE13 Title: A Fault Tolerance Technique for Combinational Circuits Based on SelectiveRedundancy

Abstract: With fabrication technology reaching nanoto manufacturing defects with higher susceptibility to soft errors. This paper is focused on designing combinational circuits for soft error tolerance with minimal area ovbased on analyzing random pattern testability of faults in a circuit and protecting sensitive transistors, whose soft error detection probability is relatively high, until desired circuit reliability is achieved or a given area overheadduplicating and sizing a subset of transistors necessary for providing the protection. In addition to that, a novel gateresults to reliability evaluation at the transistor level (using SPICE) with the orders of magnitude reduction in CPU time. LGSynth’91 benchmark circuits are used to evaluate the proposed algorithm. Simulation results show that the proposed algorithm achieves betother transistor sizingsignificantly lower area overhead for 130

VLSI52_BE14 Title: Preweighted Linearized VCO Analog Abstract A linearization technique of voltageoscillator (VCO) analogthe proposed technique is an openfrequencies. It is also independent of the delay element structure, so it can be applied to various VCO ADC topologies. The analog input signal is first mapped through a preweighted resistor network in which every delay element experiences a different version of the input and produces the corresponding delay. As a result, the proposed approach suppresses the impact of V/F nonlinearity on the ADC performance by expanding a linear region of the transfer cufull rail-to-rail input. This technique shows substantial improvement results by keeping nonlinearity within ±0.5% over the full input scale (dBFS) and achieves a peak signaldistortion ratio (SNDR) of 75.7 and 60.4 dB for input o

NXFEE INNOVATION SEMICONDUCTOR IP & PRODUCT DEVELOPM

Semiconductor IP & VLSI IEEE Transaction & Product Development)

#45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, PondicherryEmail: [email protected] Ph: +91 9789443203, +91 9677783735.

MOSFET Sub-1-V Voltage Reference With a−51-dB PSR up to 60 MHz

This paper presents a voltage reference (VR) with a power supply rejection (PSR) better than 50 dB for frequencies of up to 60 MHz, and uses MOSFETs in strong inversion.

er innovation is a compact MOSFET low-pass filter, which was developed along with a feedback technique for a wide-bandwidth PSR not achieved in previous works. The proposed allMOSFET VR was fabricated using a standard 0.18µm CMOS process.

nm CMOS Constant Current Source with Reduced PVT Variation

This paper presents a new nanometer-based low-power constant current reference that attains a small value in the total process–voltage–temperature variation. The circuit architecture is based on the embodiment of a process-tolerant bias current circuit

tracking bias voltage source for the dedicated temperature-compensated voltage tocurrent conversion in a pre-regulator loop. Fabricated in a UMC 65-nm CMOS process, it consumes 7.18µWwitha1.4V supply. The measured results indicate that the current reference achieves an average temperature coefficient of 119ppm/°C over 12 samples in a temperature

−30 °C to 90 °C without any calibra�on. Besides, a low line sensi�vity of 180 ppm/V is obtained. This paper offers a better sensitivity figure of merit with respect to the reported representative counterparts.

A Fault Tolerance Technique for Combinational Circuits Based on Selective

With fabrication technology reaching nano-levels, systems are becoming more prone to manufacturing defects with higher susceptibility to soft errors. This paper is focused on designing combinational circuits for soft error tolerance with minimal area ovbased on analyzing random pattern testability of faults in a circuit and protecting sensitive transistors, whose soft error detection probability is relatively high, until desired circuit reliability is achieved or a given area overhead constraint is met. Transistors are protected based on duplicating and sizing a subset of transistors necessary for providing the protection. In addition to that, a novel gate-level reliability evaluation technique is proposed that provides similar

to reliability evaluation at the transistor level (using SPICE) with the orders of magnitude reduction in CPU time. LGSynth’91 benchmark circuits are used to evaluate the proposed algorithm. Simulation results show that the proposed algorithm achieves betother transistor sizing-based techniques and the triple modular redundancy technique with significantly lower area overhead for 130-nm process technology at a ground level.

Preweighted Linearized VCO Analog-to-Digital Converter

A linearization technique of voltage-to-frequency characteristics of voltageoscillator (VCO) analog-to-digital converters (ADCs) is presented. In contrast to previous works, the proposed technique is an open-loop calibration-free configuration, so it can operate at higher frequencies. It is also independent of the delay element structure, so it can be applied to various VCO ADC topologies. The analog input signal is first mapped through a preweighted resistor

which every delay element experiences a different version of the input and produces the corresponding delay. As a result, the proposed approach suppresses the impact of V/F nonlinearity on the ADC performance by expanding a linear region of the transfer cu

rail input. This technique shows substantial improvement results by keeping nonlinearity within ±0.5% over the full input scale (dBFS) and achieves a peak signaldistortion ratio (SNDR) of 75.7 and 60.4 dB for input of −8 and 0 dBFS, respec�vely.

NXFEE INNOVATION PRODUCT DEVELOPMENT COMPANY

Product Development) #45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, Pondicherry-4

, +91 9677783735.

dB PSR up to 60 MHz

This paper presents a voltage reference (VR) with a power supply rejection (PSR) better than 50 dB for frequencies of up to 60 MHz, and uses MOSFETs in strong inversion.

pass filter, which was developed along with a bandwidth PSR not achieved in previous works. The proposed all-

nm CMOS Constant Current Source with Reduced PVT Variation

power constant current reference temperature variation. The circuit

tolerant bias current circuit and a scaled compensated voltage to-

nm CMOS process, it t the current reference

ppm/°C over 12 samples in a temperature −30 °C to 90 °C without any calibra�on. Besides, a low line sensi�vity of 180 ppm/V is

ivity figure of merit with respect to the reported

A Fault Tolerance Technique for Combinational Circuits Based on Selective-Transistor

levels, systems are becoming more prone to manufacturing defects with higher susceptibility to soft errors. This paper is focused on designing combinational circuits for soft error tolerance with minimal area overhead. The idea is based on analyzing random pattern testability of faults in a circuit and protecting sensitive transistors, whose soft error detection probability is relatively high, until desired circuit reliability

constraint is met. Transistors are protected based on duplicating and sizing a subset of transistors necessary for providing the protection. In addition

level reliability evaluation technique is proposed that provides similar to reliability evaluation at the transistor level (using SPICE) with the orders of magnitude

reduction in CPU time. LGSynth’91 benchmark circuits are used to evaluate the proposed algorithm. Simulation results show that the proposed algorithm achieves better reliability than

based techniques and the triple modular redundancy technique with nm process technology at a ground level.

frequency characteristics of voltage-controlled digital converters (ADCs) is presented. In contrast to previous works,

free configuration, so it can operate at higher frequencies. It is also independent of the delay element structure, so it can be applied to various VCO ADC topologies. The analog input signal is first mapped through a preweighted resistor

which every delay element experiences a different version of the input and produces the corresponding delay. As a result, the proposed approach suppresses the impact of V/F nonlinearity on the ADC performance by expanding a linear region of the transfer curve over the

rail input. This technique shows substantial improvement results by keeping nonlinearity within ±0.5% over the full input scale (dBFS) and achieves a peak signal-to-noise and

−8 and 0 dBFS, respec�vely.

Nxfee Innovation (Semiconductor IP &#45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, Pondicherry

Web: www.nxfee.com Email:

VLSI53_BE15 Title: A 100-mA, 99.11% Current Efficiency, 2Ripple Suppression

Abstract: Digital lowscalability for distributed multiple voltage domain applications required in statesystem on-chips. Due to the discrete nature of the output current and the discreteloop, the steady-state response of the DLDO has inherent output voltag(HD-LDO) with fast response and stable operation across a wide load range while reducing the output voltage ripple is proposed. In the HDcancelation amplifier (RCA) work in parallel. The to-digital converter, and the digitized linear stage current is fed into the DLDO as an error signal. During load transients, a gearestimation. The DLDO suppresses the output dc of the RCA within its current resolution. With this arrangement, a majority of the dc load current is provided by the DLDO and the RCA supplies ripple cancelation current. The HDand occupies 0.697 mm2 of the die area. The HD1.43–2.0 V and an output voltage range of 1.0achieves a current peak efficiency of 99.11% and a seMHz clock for a current switching between 10 and 90 mA. The RCA suppresses fundamental, second, and third harmonics of the switching frequency by 13.7, 13.3, and 14.1 dB, respectively.

VLSI18_BE16 Title: Sense Amplifier HalfQDI Cell Template

Abstract: We propose a novel asynchronous logic (async) quasiamplifier half-buffer (SAHB) cell design approach, with emphases on high operational robustness, high speed, and low power dissipation. There are five key features of our prthe SAHB cell embodies the async QDI 4voltage–temperature variations. Second, the sense amplifier (SA) block in SAHB cells embodies a cross-coupled latch with a positive feedback mecThird, the evaluation block in the SAHB comprises both nMOS pullwith minimum transistor sizing to reduce the parasitic capacitance. Fourth, both the evaluation block and SA block are tighSAHB cell is designed in CMOS static logic and hence appropriate for full range dynamic voltage scaling operation for VDD ranging from nominal voltage (1 V) to subthreshold voltage (When six library cells embodying our proposed SAHB are compared with those embodying the conventional async QDI precollectively feature simultaneous PCHB cell is inappropriate for subthreshold operation. A prototype 64adder based on the SAHB approach (at 65 nm CMOS) is designed. For a 1nominal VDD, the design based on the SAHB approachenergy and∼24% lower transistor count advantages than its PCHB counterpart. When benchmarked against the ubiquitous synchronous logic counterpart, our SAHB dissipateslower energy at the 1

NXFEE INNOVATION SEMICONDUCTOR IP & PRODUCT DEVELOPM

Semiconductor IP & VLSI IEEE Transaction & Product Development)

#45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, PondicherryEmail: [email protected] Ph: +91 9789443203, +91 9677783735.

mA, 99.11% Current Efficiency, 2-mVppRipple Digitally Controlled LDO with Active Ripple Suppression

Digital low-dropout (DLDO) regulators are gaining attention due to their design ability for distributed multiple voltage domain applications required in state

chips. Due to the discrete nature of the output current and the discretestate response of the DLDO has inherent output voltage ripple. A hybrid DLDO

LDO) with fast response and stable operation across a wide load range while reducing the output voltage ripple is proposed. In the HD-LDO, a DLDO and a low current analog ripple cancelation amplifier (RCA) work in parallel. The output dc of the RCA is sensed by a 2

digital converter, and the digitized linear stage current is fed into the DLDO as an error signal. During load transients, a gear-shift controller enables fast transient response using dynamic load

tion. The DLDO suppresses the output dc of the RCA within its current resolution. With this arrangement, a majority of the dc load current is provided by the DLDO and the RCA supplies ripple cancelation current. The HD-LDO is designed and fabricated in a 180-nm CMOS technology, and occupies 0.697 mm2 of the die area. The HD-LDO operates with an input voltage range of

2.0 V and an output voltage range of 1.0–1.57 V. At 100-mA load current, the HDachieves a current peak efficiency of 99.11% and a settling time of 15 clock periods with a 0.5MHz clock for a current switching between 10 and 90 mA. The RCA suppresses fundamental, second, and third harmonics of the switching frequency by 13.7, 13.3, and 14.1 dB, respectively.

Amplifier Half-Buffer (SAHB): A Low-Power High-Performance Asynchronous Logic

We propose a novel asynchronous logic (async) quasi-delay-insensitive (QDI) sensebuffer (SAHB) cell design approach, with emphases on high operational robustness,

high speed, and low power dissipation. There are five key features of our prthe SAHB cell embodies the async QDI 4-phase (4φ) signaling protocol to accommodate process

temperature variations. Second, the sense amplifier (SA) block in SAHB cells embodies a coupled latch with a positive feedback mechanism to speed up the output evaluation.

Third, the evaluation block in the SAHB comprises both nMOS pull-up and pullwith minimum transistor sizing to reduce the parasitic capacitance. Fourth, both the evaluation block and SA block are tightly coupled to reduce redundant internal switching nodes. Fifth, the SAHB cell is designed in CMOS static logic and hence appropriate for full range dynamic voltage scaling operation for VDD ranging from nominal voltage (1 V) to subthreshold voltage (When six library cells embodying our proposed SAHB are compared with those embodying the conventional async QDI pre-charged half buffer (PCHB) approach, the proposed SAHB cells collectively feature simultaneous ∼64% lower power, ∼21% faster, and ∼6% smPCHB cell is inappropriate for subthreshold operation. A prototype 64-bit Koggeadder based on the SAHB approach (at 65 nm CMOS) is designed. For a 1-GHz throughput and at nominal VDD, the design based on the SAHB approach simultaneously features

24% lower transistor count advantages than its PCHB counterpart. When benchmarked against the ubiquitous synchronous logic counterpart, our SAHB dissipateslower energy at the 1-GHz throughput.

NXFEE INNOVATION PRODUCT DEVELOPMENT COMPANY

Product Development) #45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, Pondicherry-4

, +91 9677783735.

mVppRipple Digitally Controlled LDO with Active

dropout (DLDO) regulators are gaining attention due to their design ability for distributed multiple voltage domain applications required in state-of-the-art

chips. Due to the discrete nature of the output current and the discrete-time control e ripple. A hybrid DLDO

LDO) with fast response and stable operation across a wide load range while reducing the LDO, a DLDO and a low current analog ripple output dc of the RCA is sensed by a 2-bit analog-

digital converter, and the digitized linear stage current is fed into the DLDO as an error signal. shift controller enables fast transient response using dynamic load

tion. The DLDO suppresses the output dc of the RCA within its current resolution. With this arrangement, a majority of the dc load current is provided by the DLDO and the RCA supplies

nm CMOS technology, LDO operates with an input voltage range of

mA load current, the HD-LDO ttling time of 15 clock periods with a 0.5-

MHz clock for a current switching between 10 and 90 mA. The RCA suppresses fundamental, second, and third harmonics of the switching frequency by 13.7, 13.3, and 14.1 dB, respectively.

Performance Asynchronous Logic

insensitive (QDI) sense-buffer (SAHB) cell design approach, with emphases on high operational robustness,

high speed, and low power dissipation. There are five key features of our proposed SAHB. First, phase (4φ) signaling protocol to accommodate process–

temperature variations. Second, the sense amplifier (SA) block in SAHB cells embodies a hanism to speed up the output evaluation.

up and pull-down networks with minimum transistor sizing to reduce the parasitic capacitance. Fourth, both the evaluation

tly coupled to reduce redundant internal switching nodes. Fifth, the SAHB cell is designed in CMOS static logic and hence appropriate for full range dynamic voltage scaling operation for VDD ranging from nominal voltage (1 V) to subthreshold voltage (∼0.3 V). When six library cells embodying our proposed SAHB are compared with those embodying the

charged half buffer (PCHB) approach, the proposed SAHB cells 6% smaller IC area; the

bit Kogge–Stone pipeline GHz throughput and at

simultaneously features ∼56% lower 24% lower transistor count advantages than its PCHB counterpart. When

benchmarked against the ubiquitous synchronous logic counterpart, our SAHB dissipates∼39%

Nxfee Innovation (Semiconductor IP &#45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, Pondicherry

Web: www.nxfee.com Email:

VLSI38_BE17 Title: On Micro-architectural Mechanisms for Cache Wear out Reduction

Abstract: Hot carrier injection (HCI) and bias temperature instability (BTI) are two of the main deleterious effects that increase a transistor’s threshold voltage over the lmicroprocessor. This voltage degradation causes slower transistor switching and eventually can result in faulty operation. HCI manifests itself when transistors switch from logic “0” to “1” and vice versa, whereas BTI is the result of a transiextended period of time. These failure mechanisms are especially acute in those transistors used to implement the SRAM cells of firstcritical to performance, and they are continuously aging. This paper focuses on microarchitectural solutions to reduce transistor aging effects induced by both HCI and BTI in the data array of L1 data caches. First, we show that the majority of cell flips are concentratnumber of specific bits within each data word. In addition, we also build upon the previous studies, showing that logic “0” is the most frequently written value in a cache by identifying which cells hold a given logic value for a significant this paper introduces a number of architectural techniques that spread the number of flips evenly across memory cells and reduce the amount of time that logic “0” values are stored in the cells by switching OFF degradation savings range from 21.8% to 44.3% depending on the application.

VLSI29_BE18 Title: Energy-Efficient TCAM Search Engine Design Using PriorityTechnology

Abstract: Ternary contentpriority encoder (PE) to select the highest priority match entry for resolving the multiple match problem due to the don’t care (X) features of TCAM. Inbased search engines are widely used in regular expression matching across multiple packets to protect against attacks, such as by viruses and spam. However, the use of PE results in increased energy consumption for patdetermine the match, our solution is a threeinformation of the matched patterns to decide the longest pattern match data. This paper proposes a promising memory technology called priorityeliminates the need for PEs and removes restrictions on ordering, implying that patterns can be stored in an arbitrary order without sorting their lengTHP. Moreover, we present a sequinput-state (SIS) scheme to disable the mass of redundant search operations in state segments on the basis of an analysis distribution of hex signatures in a virus database. Experimental results demonstrate that the PDMnonvolatile TCAM (nvTCAM) search engines by 36%search engines is used to reorder. By adopting the SISoperations in a TCAM array, the search eneengines.

NXFEE INNOVATION SEMICONDUCTOR IP & PRODUCT DEVELOPM

Semiconductor IP & VLSI IEEE Transaction & Product Development)

#45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, PondicherryEmail: [email protected] Ph: +91 9789443203, +91 9677783735.

architectural Mechanisms for Cache Wear out Reduction

Hot carrier injection (HCI) and bias temperature instability (BTI) are two of the main deleterious effects that increase a transistor’s threshold voltage over the lmicroprocessor. This voltage degradation causes slower transistor switching and eventually can result in faulty operation. HCI manifests itself when transistors switch from logic “0” to “1” and vice versa, whereas BTI is the result of a transistor maintaining the same logic value for an extended period of time. These failure mechanisms are especially acute in those transistors used to implement the SRAM cells of first-level (L1) caches, which are frequently accessed, so they are

formance, and they are continuously aging. This paper focuses on microarchitectural solutions to reduce transistor aging effects induced by both HCI and BTI in the data array of L1 data caches. First, we show that the majority of cell flips are concentratnumber of specific bits within each data word. In addition, we also build upon the previous studies, showing that logic “0” is the most frequently written value in a cache by identifying which cells hold a given logic value for a significant amount of time. Based on these observations, this paper introduces a number of architectural techniques that spread the number of flips evenly across memory cells and reduce the amount of time that logic “0” values are stored in the cells by switching OFF specific data bytes. Experimental results show that the threshold voltage degradation savings range from 21.8% to 44.3% depending on the application.

Efficient TCAM Search Engine Design Using Priority-Decision in Memory

Ternary content-addressable memory (TCAM)-based search engines generally need a priority encoder (PE) to select the highest priority match entry for resolving the multiple match problem due to the don’t care (X) features of TCAM. In contemporary network security, TCAMbased search engines are widely used in regular expression matching across multiple packets to protect against attacks, such as by viruses and spam. However, the use of PE results in increased energy consumption for pattern updates and search operations. Instead of using PEs to determine the match, our solution is a three-phase search operation that utilizes the length information of the matched patterns to decide the longest pattern match data. This paper

mising memory technology called priority-decision in memory (PDM), which eliminates the need for PEs and removes restrictions on ordering, implying that patterns can be stored in an arbitrary order without sorting their lengTHP. Moreover, we present a sequ

state (SIS) scheme to disable the mass of redundant search operations in state segments on the basis of an analysis distribution of hex signatures in a virus database. Experimental results demonstrate that the PDM-based technology can improve update energy consumption of nonvolatile TCAM (nvTCAM) search engines by 36%–67%, because most of the energy in these search engines is used to reorder. By adopting the SIS-based method to avoid unnecessary search operations in a TCAM array, the search energy reductionis around 64% of nvTCAM search

NXFEE INNOVATION PRODUCT DEVELOPMENT COMPANY

Product Development) #45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, Pondicherry-4

, +91 9677783735.

Hot carrier injection (HCI) and bias temperature instability (BTI) are two of the main deleterious effects that increase a transistor’s threshold voltage over the lifetime of a microprocessor. This voltage degradation causes slower transistor switching and eventually can result in faulty operation. HCI manifests itself when transistors switch from logic “0” to “1” and

stor maintaining the same logic value for an extended period of time. These failure mechanisms are especially acute in those transistors used

level (L1) caches, which are frequently accessed, so they are formance, and they are continuously aging. This paper focuses on micro

architectural solutions to reduce transistor aging effects induced by both HCI and BTI in the data array of L1 data caches. First, we show that the majority of cell flips are concentrated in a small number of specific bits within each data word. In addition, we also build upon the previous studies, showing that logic “0” is the most frequently written value in a cache by identifying

amount of time. Based on these observations, this paper introduces a number of architectural techniques that spread the number of flips evenly across memory cells and reduce the amount of time that logic “0” values are stored in the

specific data bytes. Experimental results show that the threshold voltage degradation savings range from 21.8% to 44.3% depending on the application.

Decision in Memory

based search engines generally need a priority encoder (PE) to select the highest priority match entry for resolving the multiple match

contemporary network security, TCAM-based search engines are widely used in regular expression matching across multiple packets to protect against attacks, such as by viruses and spam. However, the use of PE results in increased

tern updates and search operations. Instead of using PEs to phase search operation that utilizes the length

information of the matched patterns to decide the longest pattern match data. This paper decision in memory (PDM), which

eliminates the need for PEs and removes restrictions on ordering, implying that patterns can be stored in an arbitrary order without sorting their lengTHP. Moreover, we present a sequential

state (SIS) scheme to disable the mass of redundant search operations in state segments on the basis of an analysis distribution of hex signatures in a virus database. Experimental results

update energy consumption of 67%, because most of the energy in these based method to avoid unnecessary search

rgy reductionis around 64% of nvTCAM search

Nxfee Innovation (Semiconductor IP &#45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, Pondicherry

Web: www.nxfee.com Email:

VLSI37_BE19 Title: A 92-dB DR, 24.3Op Amp Sharing

Abstract: A 2–2 cascaded switchedlow-voltage, low-power, broadband analogboth analog and digital circuits and ensure lowfeed forward architecture is employed in combination with 4reduced integrator output swings and relaxed timing constraint in the feedback path. The integrator power is further reduced by sharing an op aand periodically changing the op amp bias condition between a highmode using a fast lowCMOS technology, the experimental signal-to-noise ratio, and an 84signal bandwidth of 1.25 MHz Operated at a 4024.3mW from a 1 V supply

VLSI25_BE20 Title: A 0.45 V 147–Temporal Decimation Architectures Abstract: This paper presents a realwith improved energy efficiency while maintaining high accuracy and realWavelet shrinkage is exploited to filter the noise and achieve sparse ECG signal representation. Adaptive temporal decimation is proposed to achieve configurable processing toreduce the data amount and computational activities for further power reduction. Modified Huffman and run-length wavelet source coding (MHRLC) is also designed to represent wavelet coefficients with optimized average code length and reduced mem0.18-µmCMOS, the ECG processor is implemented with customized nearfor minimum energy operation. The prototype was fully validated with the MITdatabase.

NXFEE INNOVATION SEMICONDUCTOR IP & PRODUCT DEVELOPM

Semiconductor IP & VLSI IEEE Transaction & Product Development)

#45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, PondicherryEmail: [email protected] Ph: +91 9789443203, +91 9677783735.

dB DR, 24.3-mW, 1.25-MHz BW Sigma–Delta Modulator Using Dynamically Biased

2 cascaded switched-capacitor sigma-delta modulator is presented for design of power, broadband analog-to-digital conversion. To reduce power dissipation in

both analog and digital circuits and ensure low-voltage operation, a half-forward architecture is employed in combination with 4-bit quantization, which results in

reduced integrator output swings and relaxed timing constraint in the feedback path. The integrator power is further reduced by sharing an op amp in the two integrators in each stage and periodically changing the op amp bias condition between a high-current and a lowmode using a fast low-power high-precision charge pump circuit. Implemented in a 0.18CMOS technology, the experimental prototype achieves a 92-dB dynamic range, a 91

noise ratio, and an 84-dB peak signal-to-noise plus distortion ratio, respectively for a signal bandwidth of 1.25 MHz Operated at a 40-MHz sampling rate, the modulator dissipates

a 1 V supply.

–375 nW ECG Compression Processor With Wavelet Shrinkage and Adaptive Temporal Decimation Architectures

This paper presents a real-time electrocardiogram (ECG) data compression processor roved energy efficiency while maintaining high accuracy and real

Wavelet shrinkage is exploited to filter the noise and achieve sparse ECG signal representation. Adaptive temporal decimation is proposed to achieve configurable processing toreduce the data amount and computational activities for further power reduction. Modified

length wavelet source coding (MHRLC) is also designed to represent wavelet coefficients with optimized average code length and reduced memory requirement. Fabricated in

µmCMOS, the ECG processor is implemented with customized near-threshold digital logics for minimum energy operation. The prototype was fully validated with the MIT

NXFEE INNOVATION PRODUCT DEVELOPMENT COMPANY

Product Development) #45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, Pondicherry-4

, +91 9677783735.

Delta Modulator Using Dynamically Biased

delta modulator is presented for design of digital conversion. To reduce power dissipation in

-sample delayed-input bit quantization, which results in

reduced integrator output swings and relaxed timing constraint in the feedback path. The mp in the two integrators in each stage

current and a low-current precision charge pump circuit. Implemented in a 0.18-μm

dB dynamic range, a 91-dB peak noise plus distortion ratio, respectively for a

, the modulator dissipates

375 nW ECG Compression Processor With Wavelet Shrinkage and Adaptive

time electrocardiogram (ECG) data compression processor roved energy efficiency while maintaining high accuracy and real-time operation.

Wavelet shrinkage is exploited to filter the noise and achieve sparse ECG signal representation. Adaptive temporal decimation is proposed to achieve configurable processing to adaptively reduce the data amount and computational activities for further power reduction. Modified

length wavelet source coding (MHRLC) is also designed to represent wavelet ory requirement. Fabricated in

threshold digital logics for minimum energy operation. The prototype was fully validated with the MIT-BIH Arrhythmia

Nxfee Innovation (Semiconductor IP &#45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, Pondicherry

Web: www.nxfee.com Email:

NXFEE INNOVATION SEMICONDUCTOR IP & PRODUCT DEVELOPM

Semiconductor IP & VLSI IEEE Transaction & Product Development)

#45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, PondicherryEmail: [email protected] Ph: +91 9789443203, +91 9677783735.

NXFEE INNOVATION PRODUCT DEVELOPMENT COMPANY

Product Development) #45, Vivekananda street, DhevankandappaMudaliarnagar, Nainarmandapam, Pondicherry-4

, +91 9677783735.

Filename: NXFEE_VLSI_2017_TRANSACTION - Upto Issue 6 - Ver1 Directory: C:\Users\Nxfee\Documents Template: C:\Users\Nxfee\AppData\Roaming\Microsoft\Templates\Normal.dotm Title: Subject: Author: Nxfee Server Keywords: Comments: Creation Date: 30-Apr-16 1:13:00 PM Change Number: 237 Last Saved On: 06-Jul-17 7:28:00 PM Last Saved By: Nxfee Total Editing Time: 2,951 Minutes Last Printed On: 06-Jul-17 7:31:00 PM As of Last Complete Printing Number of Pages: 24 Number of Words: 11,935 (approx.) Number of Characters: 68,032 (approx.)