Transcript

EEw382V ADVANCED PROJECTS I (ECD)

SUMMER 2007

MAYANK GUPTA

“TOPIC 4 – CIRCUITS & DATAPATH”

INTRODUCTION

At its very simplest, a microprocessor can be thought of as an assembly of distinct functional

units: an input/output, a memory, a datapath that consists of arithmetic & logic (ALU) and

control [1]. The datapath performs the actual data processing and is directly critical to the

performance and throughput of the microprocessor. It consists of a pipelined flow of fixed

and floating point execution units that perform addition, multiplication, logical operations

etc; register files (a type of memory) and other components such as multiplexers, memory

and wires, buses and buffers that make up the interconnections. The modern approach to

microprocessor design strives for higher computational throughput but at lower active/static

power and smaller area. This approach must manifest itself in the design of the basic units of

the microprocessor as well. This means that circuit designers and architects must leverage

existing technologies and inventive circuit topologies to strive for the most energy/area

efficient performance in the combinational and sequential logic that makes up the

microprocessor, especially the datapath [2].

A lot of research has gone into innovative circuits and architectures to create power efficient

datapaths and memories. Designers must balance high performance with leakage, noise,

wiring density and clock generation complexities at the circuit and transistor level; develop

new architectures to target the right applications and implement underlying functional blocks

efficiently. This also compounds challenges with design for test and manufacturability,

designing for process variations in sub-micron technology, scalability, ease of design

interoperability, extensions to the instruction set for media type applications and design

verification [2]. The design must also champion an easy design flow which requires the right

set of simulation and verification tools. The designs reviewed in this paper offer a good

window into the challenges in designing the components of the datapath and clock; and the

complications with silicon implementation.

RESEARCH OVERVIEW

I have reviewed four papers in the field of circuits and datapath from the IEEE Journal of

Solid State Circuits. A list of these papers can be found in the references section of this

paper. Two papers come from designers of the multicore CELL processor; one paper focuses

on the general circuit design elements, styles and philosophy of the entire CELL processor

while the other paper focuses on the floating point unit of the synergistic processing elements

of the CELL [3, 4]. The third paper comes from Intel and focuses on the integer execution

unit of the Pentium 4 designed for 65nm process [5]. I have grouped these three papers under

the topic of ‘Datapath Circuits’ and provided a background section, a section on the features

of their design and a critique on each paper. Some of the questions I have posed while

reviewing these papers are:

1. Does it present the tradeoffs with respect to power, area and performance?

2. Does it address technology or process issues with sub-micron technology?

3. What are the constraints on the design styles?

4. Do the design choices appear to be scalable?

5. Does it address testability and design for manufacturing?

6. Is the design approach application specific or general purpose?

7. Does it address and distinguish between architectural and circuit design choices?

The final paper comes from researchers at the University of Toronto and is a tutorial and an

extensive survey of CAM architectures [6]. This paper is analyzed separately under the ‘High

Speed Memory’ section with a background, a section on the features of CAM memory and a

critique. Some questions I have posed when reviewing this paper are:

1. Does it address power efficiency?

2. Does is address technology and process implementation and scaling issues?

3. Does it present all the tradeoffs in the competing architectures?

4. Does it address any specialized circuit topologies in addition to static CMOS?

5. Does it provide real silicon experimental data?

6. Does it provide adequate information on design for test and manufacturing?

DATAPATH CIRCUITS

2

There are various distinct circuit elements that make up a typical datapath. Each circuit

element must be designed and implemented with power efficiency, scalability and robustness

in mind. The latches and flip flops are the basic elements of pipelined flow and registers.

These require an over-reaching design cycle that must allow sufficient room for macro level

optimization and tradeoffs as well. This calls for interesting dynamic and static circuit

topologies. The SRAM and array circuitry that make up the various levels of cache and

register files also require careful considerations during design. A robust and balanced

clocking scheme is required to drive the various circuit topologies as well. The performance

and power requirements of the execution units of the datapath depends on the strengths of the

underlying circuit elements and topologies, however the decisions and tradeoffs in the

architectural design are equally important. Architects can take an application specific

approach based on the most likely format of the data to be processed or a more general

approach in their design. The execution units are also usually the hot-spots of a chip and limit

overall active power and frequency of the chip.

The first set of papers deal with circuit design and datapaths of the CELL. The CELL was

developed jointly by the Sony, Toshiba and IBM alliance starting in 2001 [7]. The CELL

Broadband Engine combines a general-purpose POWER architecture based core with

powerful co-processing elements that can perform multimedia and vector processing. The

architectural design and first implementation was carried out jointly by the STI alliance on

90nm technology followed by a 65nm implementation [7]. The CELL was designed to be a

multi-core powerhouse that combines the capabilities of a general purpose CPU and a GPU

into one die. The design emphasizes power efficiency, bandwidth and complex code

execution to allow peak throughput computation [8]. The CELL consists of nine co-

processors: a POWER based main processor (Power Processing Element) and eight co-

processors (Synergistic Processing Elements) connected through a circular bus called the

Element Interconnect Bus [8]. The PPE runs the OS and controls the SPE. Various

innovative circuit elements in the datapath, clocking systems and the arrays of the CELL

have been designed to support power efficiency, high bandwidth of code execution,

scalability and technology challenges. These innovative circuit elements are the subject of the

first paper. The SPE is designed for vectorized floating point calculation as is the case in

media and graphics applications [8]. As a result, the datapath of the SPE has been architected

3

to support the unique stream of Single Instruction Multiple Data (SIMD) type instructions

seen in these types of applications [8]. The second paper details the floating point execution

unit that supports this type of instruction execution.

The Intel paper details the integer execution unit of the 65nm Pentium 4 microprocessor. The

Pentium is the brand name for x86 and now x86-64 based single core CISC processors made

by Intel. The Pentium 4 features a deeper instruction pipeline than previous Pentium versions

designed to scale to very high frequencies known as the NetBurst architecture which brought

about obvious power limitations [9]. It also featured x86 multimedia extensions and support

for virtualization. The Pentium 4 went through numerous core designs and various

technology iterations. The 65nm Pentium 4 released in 2006 is based on the 90nm ‘Prescott’

core was the end of the line for the high power Pentium 4 architecture [10]. The 65nm

implementation essentially strived to bring a degree of power efficiency to the Prescott core

and demanded some innovative circuit design techniques. The integer execution unit for the

65nm Pentium 4 is a good example of the circuit redesign involved in implementing the

datapaths of an older technology into a new technology and is detailed in the third paper.

CELL circuit design [3]

This paper focuses on the circuit design techniques and design verification methodology that

went into designing the circuit components of the CELL processor: the PPE, the EIB, units

for thermal and power management, test, debug and the functional units of the SPE. The goal

for the designers was to integrate high performance components without violating any

thermal or package constraints in 90nm SOI technology. This required a simple yet robust

design style as well as dynamic and highly customized circuits for maximum performance,

yield and speed. The challenge was to address these tradeoffs and to pick the right

implementation for the technology.

Background

The CELL architecture was designed to exploit streaming vectorized loads and thread level

parallelism seen in media type application software where the PPE runs the OS and co-

ordinates the flow of data and threads through the SPE. The CELL has its own custom

instruction set architecture with some elements of VLIW computing and is based on the

POWER architecture with additional support for its multi-core execution. The CELL is a

4

push by its inventors to create a processing powerhouse designed for the media and digital

entertainment content consumer market. At the same time, the CELL also features powerful

SIMD, vectorized co-processors that are adept at floating point calculations suitable for super

computing type needs. Currently, the most visible implementation of the CELL processor is

in the Sony Playstation 3 game console. There could also be possible applications of CELL

based systems in the embedded or blade server market suited for medical imaging purposes.

The competitors for such a product would be multi-core offerings from IBM such as the

Xenon which also features a SIMD like implementation. The x86 offerings from Intel and

AMD that penetrate every level of the microprocessor market are also competitors. The main

constraint of the chip is the actual code development for extracting high throughput potential

from the CELL. As a result, the success of the chip requires a strong adoption by software

developers that write applications for the chip. The architecture also forces the CELL to be

cost effective only in a segment of the market dominated by media application software and

therefore unpractical for consumer PC or desktop computer. The dependence of the SPE on

the PPE is another constraint as yield can depend solely on PPE functionality.

Features and Technology

The circuit design of the CELL features an aggressive latch, flip-flop, clock generation and

array design methodology along with robust electrical verification to enable 4GHz operation

in hardware. The design style ranges from simple static CMOS to pulsed, dynamic and

domino circuits for speed critical paths. One of the many design challenges was to design

local base clock generators (LBC) to drive the master-slave flip flops. A single phase global

clock grid drives numerous LBCs to provide gain for local clock nets and functionality for

test and local clock gating. The constraints were to minimize latencies, power and area. The

LBC was designed to drive a 64bit register through a three stage buffer. The LBC drives a

launch clock (lclk) with three fan-out-of-four (FO4) delays from base clock as well as a data

capture clock (dclk) with longer delay to provide clock overlap at the cycle boundary to

minimize races. Clock waveform integrity was ensured by a post layout electrical design rule

checker. The LBCs are also designed to be clock gates through test and hold signals. They

also support scan clocks for test since scan signals are distributed throughout the chip using

simple flip-flops. The test and hold signals can also define regions of half-frequency

operation which mitigates the need for a separate low frequency clock.

5

In addition to basic transmission gate flip-flops, designers made use of a variety of special

purpose flip-flops and latches to meet power and frequency targets. One important design

was a high speed multiplexer-latch with nine inputs that accepts and drives static signals. The

multiplexer (MUX) is a dynamic NOR followed by a set-reset latch. It is driven by an LBC

through a delay chain to force dynamic gate inputs after a fixed time interval to minimize

hold time. One branch of the MUX is dedicated to fast-scan testing. The addition of a scan-

port was found to have minor impact on area and delay for the latches. The area, power, hold

time, latency and delay tradeoffs for four different types of latches/flip-flops were

characterized, included in the design library and chosen effectively. Pulsed latches had a

power advantage over flip-flops due to reduction in clock power, but flip-flops were best for

hold time considering process, voltage and temperature variations. Longer hold times and

tougher implementation rules for the pulsed clock limited their usage.

To maximize design re-use the circuit design flow was partitioned hierarchically. There was

also re-use within the block or macro. A macro scales from a few transistors to a few

thousand transistors and were designed in three different styles: fully automated, semi-

custom and full custom. The main control logic was designed using fully automated

techniques due to its regular structure and constant re-work. The random logic macros (RLM)

of the control logic were partitioned into a few thousand gates using standard cells. Libraries

made use of multi-Vt cells. All timing and electrical analysis was carried out at transistor

level for all RLMs. A lot of effort went into the synthesis algorithms for the flip-flops and the

clock buffers as well. Some macros were built using a semi-custom flow allowing quick

tuning iterations. The designers constructed a schematic using basic gates from a standard

cell library which was then optimized to tune device widths using an in-house tuner without

adding buffers or logic. The 24bit adder was constructed using this methodology and reduced

path delay by 30% and area by 25%. A full custom design methodology was used for

specialized circuitry where pockets of domino logic had to be inserted surrounding static

logic to provide speedup with minimal power consumption increase. This was used to create

the footer-less domino circuits to produce fast carry input to the CAM speeding it up by 20%.

6

The SPE circuit design received special attention as they are the limiters of performance and

power for the CELL. The design philosophy was to use static CMOS as much as possible

with dynamic circuits used in only the most area, power and timing critical paths. All

dynamic circuits were self-contained in a macro with static gating. Some examples of macros

using dynamic circuits are multi-port registers files, data forwarding macros, floating point

unit MUX and the dynamic programmable logic arrays (DLPA). The general purpose register

files (GPR) were designed to operate in three cycles: one cycle for address pre-decoding, one

cycle for final decoding and array access and the last cycle for dataflow distribution. The

GPR was designed to support eight read/write operations in a cycle. The design uses a two-

stage domino read and a static write. The wire spacing and width was determined through

Spice simulations and implemented using custom developed routing tools to optimize signal

integrity and distribution delays. The DLPAs were generated using a custom designed

program for schematics and layout. It was implemented using dynamic footed NOR gates

followed by a strobe circuit which also incorporates scan testing. The DLPAs are also used

for clock gate control signal generation. The dependence checking and the data-forwarding of

the SPE are carried out by the dependence check macro (DCM), forwarding macro (FM) and

several DLPAs. The DCM compares a new instruction with instructions in the pipeline and

the DLPA is used to determine final dependencies. Based on these results the data in the DM

is forwarded to the right dynamic MUX-latch. The DCM is highly customized and was

implemented using static and transmission gates. Each slice of the FM consists of 16 32bit

registers and a 16-way MUX implemented with dynamic eight-way NOR gates followed by a

latch. Special attention was paid to the physical implementation to avoid cross talks in the

MUX.

The goal of the SRAM array design was to support minimal cycle time, power and area for

maximum yield, scalability, robustness and multiple manufacturing lines. The CELL consists

of a 512KB L2 cache and eight 256KB local stores (LS) in the SPEs for a total of 2MB

memory. The arrays of the CELL are assigned to four functional groups: SPE, PPE, L2 and

support. The SPE and PPE arrays work on core clock while L2 and support arrays work at

half-frequency. The arrays are pipelined to support high frequency with one cycle latency

between wordline select and dataout. Read operations are single ended and require bitline

pre-charge high. The 6T SRAM cell was designed to minimize wire lengths and gate

7

alignment. The LS bitcell uses a 0.99 square micron cell with 66 cells per bitline and a high-

Vt, noise tolerant sense-amp read scheme, while the remaining cells use a 1.06 square micron

cell with a ripple domino sense write and 16 cells per bitline to reduce stress during reads.

Array functionality over all process, clock variability, voltage and temperature corners had to

be ensured through device level corner analysis with special attention paid to timing and

stability. Heavy statistical analysis and electrical verification was used to ensure cell

operation with margins for writability, stability, yield, timing and reliability. The designers

strived for minimum area, uniform printability with added redundancy in word and bit

direction to address defects and tail-end distribution writability and stability fails. A scan

based memory BIST was implemented as well which has the ability to test multiple arrays at

once.

To ensure design and electrical integrity a rigorous system of design checks was

implemented alongside the design analysis methodology. All digital macros had to pass a

unified set to electrical and topological checks which focus on clock integrity, latch and flip-

flop usage, dynamic circuit usage as well as use of transmission and static gates. All designs

also passed EM and IR drop checks to ensure robust power supply. Physical design checks to

improve yield without impacting area and to ensure easy integration of the macros was also

adopted. Transistor level timing, noise and power analysis was performed using identical

flows that comprehend SOI effects like floating body etc. All timing checks were done at

macro level and transistor level consistently.

Critique

The paper goes a great job of covering all the design styles and circuit elements that went

into the CELL microprocessor. The designers adopted a robust verification and electrical

check methodology to ensure a sound design across all electrical parameters. There was good

information on the various types of latch/flip-flop design as well as the pros and cons of each.

The tradeoffs made in the design were apparent. For example, the designers chose to

minimize area in the caches at the cost of higher BIST redundancy and testability. It wasn’t

clear what debug and test features were part of the BIST engine, but it should be strong

enough to account for SOI hysteresis effects on cache cells and read circuitry. There was

little information on architectural gating and ability to switch off cores. It wasn’t clear if the

8

CELL supports any type of gating of the SPE cores to combat defect and yield issues.

Another tradeoff was the need to use dynamic and domino circuit to speed up critical paths

usually at a minor cost of power, wiring and additional design verification. The design uses a

mixture of domino and static logic; the decision to encapsulate domino logic with static

buffers at the macro boundaries is a good practice to ensure timing and signal integrity. The

design of the local clock generator is also a good choice as it ensures a robust timing scheme

and scalability. The local clock generators are packed with test, debug and gating features as

well as support for domino circuitry. From a silicon implementation point of view, it wasn’t

clear how the post-layout and electrical checking comprehends multi-Vt transistors or multi-

voltage supplies. The paper could also have used some information on the circuit topologies

of the analog components like the PLL. It is likely that the usage of more aggressive and

dynamic circuit design styles complicated and increased the design checks, which may have

strained design resources; but it exemplifies the importance of a robust design as process

variations in technology rise. Overall, this is an informative paper on the latest design styles

for a sub-micron SOI technology design implementation.

SPE CELL floating point unit design [4]

The focus of this paper is the architecture and circuit design of the pipelined floating point

unit (FPU) of the SPE of the CELL. The SPE is designed to accelerate the processing of real-

time media and data streaming applications. As a result, the SPE FPU is designed for single

precision floating point SIMD instructions such as multiply-adds. The design goals were to

use the right circuit components and topologies detailed in the previous paper and architect

an efficient pipelined execution unit that meets all the specifications of delay and power

consumption. The paper goes into details of the design challenges and the building blocks of

the FPU.

Background

The SPE of the CELL is the workhorse of the system that performs the floating point

calculations required in media applications. As a result, it is architected to sustain high

bandwidth of streaming data and therefore requires wide and fast execution units. This places

a burden on the designers to create specialized custom circuits with minimal wiring and

clocking overhead. Since the SPE has been tuned to execute floating point instructions, it

9

might not show the best in-class performance for more general purpose workloads. The

design is also dependent on the PPE and the code base to keep the hungry cores of the SPEs

fed in order to maximize performance for power. The applications, customers, competitors

and constraints of the CELL architecture have already been mentioned in the previous

background section.

Features and Technology

The SPE contains 128 128bit wide registers. The FPU sources these registers to perform very

fast single and double precision arithmetic. A fixed-point unit also carries out 32bit integer

arithmetic and logical operations. The single precision is a four-way SIMD design which

supports media extension like SSE, MMX and VMX for vector computing. The FPU is

implemented in 90nm SOI technology with 769K transistors to meet a 5.6GHz, 1.4V

specification that delivers 44.8 G-flops of performance and burns 1.4W at 4GHz. The

designers have made an effort to optimize the architecture for target applications, co-design

logic circuits with integration, and carefully balance pipeline stages for minimum delay and

maximum throughput. The SPE instructions process 128 bit operands divided into four 32bit

word slices. Each 32bit slice supports 32bit single precision instructions, 16bit integer

multiply-add instructions and can convert instructions between both formats. Operands are

fetched from the register file (RF) into the operand latches of the FPU and the results of the

FPU can be sent back to the operand latches of the FPU or the forwarding unit (FW) from

where they can be distributed to other units like the LS or the fixed-point unit.

The designers laid out the main design challenges for the FPU in the paper. The FPU

implementation had to meet an 11 FO4 stage delay to balance performance and power. The

latency of the FPU was designed to be six cycles for single precision instructions. This

required optimization at all design layers: architecture, logic, circuits and layout. Most of the

logic was done in CMOS static gates with dynamic gate used only in timing critical areas. To

reduce latch insertion delays, three different types of latches (type C, D and E) were

designed, simulated and characterized with multiple levels of driver sizes and local clock

buffers to minimize delay. The Type C latch uses a 9:1 multiplexer which includes a scan

input, supports static inputs and outputs and uses dynamic circuits. The Type D latch is a

transmission gate flip-flop that also includes scan inputs. The Type E latch is a static pulsed

10

latch that reduces insertion delays by replacing inverters with NAND2 logic with latch

functionality and time borrowing capabilities. The FPU width was required to be aligned to

the data flow stack of the SPE which is 46 bits wide and consists of 32bit data and a 15bit

clock bay area which occupies the center area of two adjacent slices. The 46bit FPU data

flow is split between a 9bit exponent and a 37bit fraction with most fraction macros folded to

reduce their width. The FPU had to be placed such that the main high frequency clock grid

covers the FPU. The LBCs of each macro are located to the side of the latch bit cells. The

aligner was placed between the adder and the multiplier. To optimize for target media

applications i.e. single precision multiply-adds, the architects deviated from IEEE floating

point standards by only supporting truncation rounding and no-trapping exception handling

but not the representation of infinity or NaN. However, they did add software support for

missing IEEE single-precision features.

The designers also went into detail about the main building blocks of the FPU: formatter,

multiplier, aligner, fraction adder, leading zero anticipator (LZA), normalizer and result

multiplexer. The formatter pre-processes operands of the integer and converts instructions to

go through the FPU with an extra cycle penalty. The formatter also sign extends the floating-

point mantissa with a zero sign and integer operands by a 2bit sign to perform the

multiplication. The multiplier is a 25bit circuit that uses radix-4 booth encoding to cut down

the number of partial products. It supports 2s complement multiplication and uses hot-1

encoding to deal with sign bit extension. The multiplier makes use of static adders which

consist of transmission gate XOR and AOI logic to deal with partial products. The static

implementation reduces wiring associated with timing and allows more optimization. The

multiplier is a two cycle design with a total block delay of 22 FO4. The staging latches use

Type E latches which allow time borrowing. Due to the 37bit width restriction, the tree had to

be compressed and partial product locations had to be shifted. The aligner shifts the addend

based on exponent differences to align with the product. The aligner has two critical paths

both of which starts with the selection of the exponents. The first critical path computes the

shift amount and decodes it into select signals which control the 4:1 MUX that performs the

actual shifting. The 4:1 MUX consists of four-input transmission gate multiplexers. To

reduce this delay the aligner uses a sum-addressed scheme which performs adding and

decoding in parallel. The second critical path occurs when the aligner performs shift amount

11

saturation by checking the shift amount for overflow and underflow to control the last stage

of the shift. The fraction adder is a sign magnitude adder which receives the product from the

multiplier in a carry-save format and the aligned addend from the aligner. The fraction adder

consists of an incrementer, an adder and sticky logic. This adder uses a Kogge-Stone parallel

prefix adder with a binary carry look-ahead structure. The LZA is a pipelined structure that

determines the number of leading zeros in the fraction adder result. The LZA behavior

depends on whether it receives the result from the adder of the fraction adder or the

incrementer. In the first case, the LZA produces propagate, generate and kill signals for each

bit position to detect an edge vector which determines the number of leading zeros. The

normalizer performs normalization shift of the output from the LZA in four stages. The

fraction/exponent rounder adjusts exponents and detects adjustments. It makes use of a

dynamic MUX-latch and Type C latches.

To reduce active power in the SPE designers make extensive use of clock gating. All

registers have a clock enable at the LBC and only pipeline stages with valid instructions are

activated by the controller. Circuit blocks are clocked based on instruction type and operand

values. For example, integer instructions make the LZA and normalizer idle while an add

instruction bypasses the multiplier. The control logic that controls these signals is built from

standard cell library and is fully automated. The design also makes use of a 32bit bypass bus

in the FPU to minimize the delay of instructions that don’t require the multiplier or the

aligner.

Critique

This paper presents a very detailed look into the architecture of the FPU. The designers

presented all the constraints on the architecture of the FPU effectively to illuminate the trade-

offs in their choices. The trade-offs were quite obvious, the designers were striving for

maximum throughput for minimal area and power. One of the stand-out features of the FPU

architecture is the narrow data-paths of the multiplier and aligner. Although there is a minor

overhead required in folding and compression of the operands, it aligns to the SPE data flow

width which helps throughput. The paper mainly focuses on the architectural choices of the

FPU with minimal look at the underlying circuit topologies and with even less information on

silicon implementation. Some information on any SOI related issues with the FPU

12

implementation would have also been interesting. The paper presents good information on

the critical paths of each block in the FPU but could have offered more information on how

they used multi-Vt cells if any or other ways to speed up all the critical paths. It would have

been interesting to learn more about the architectural design choices for the register files and

forwarding mechanisms as well and how they interact with the execution units. The paper

mentions deviations from IEEE standards, but offers no information on any performance hit

with the software routines with implementing the missing features. Would have including

these missing features been as costly? The design approach seem to meet all the

specifications of the FPU, however it isn’t clear if their solutions will be scalable to deeper

sub-micron technologies or more cores on a die. Overall, this is a good paper on the

architecture of floating point execution units for media applications but could have used

some more circuit and transistor level optimization information.

65nm Pentium 4 integer unit [5]

The Pentium 4 integer execution unit was designed as a very high speed segmented 64bit

ALU to enable 9GHz operation while lowering active and static power. The main challenges

came from re-designing the original low voltage swing integer execution unit in 90nm

technology for operation at 65nm with a 2x frequency fast clock (fclk). The designers go into

details of the domino and static circuit topologies as well as the use of tools and

methodologies to lower design complexity and reduce development cost and time. The

designers claim a 42% reduction in normalized static power.

Background

The 65nm Pentium 4 chip was the final revision of the Pentium 4 and was meant to be a

technology shrink into 65nm with minimal additional features. The 64bit 65nm Pentium 4

works in single core and dual core (two die in a single package) configuration with 2MB L2.

The Pentium 4 core was originally designed to scale to extremely high frequencies but hit a

power ceiling due to technology scaling. This is most likely the reason for the 9GHz design

specification on its functional units like the integer execution unit. Intel went into production

with the 65nm Pentium 4 under the 6xx series name [10]. The constraints on this chip were

obviously the very poor thermal and power ratings compared to the K8 based Athlon branded

x86 architectures from AMD. The 65nm did not appear to offer any improvements over the

13

Prescott based 90nm Pentium 4 core either [10]. The 65nm Pentium 4 was eventually

superseded by the Pentium D and Intel Core branded processors which favored more energy

efficient multi-core architectures over raw clock speeds. The Pentium 4 was designed to fit

into the consumer and media system solutions market in addition to the tradition desktop and

low end server markets that can exploit its virtualization capabilities.

Features and Technology

This iteration of the Pentium 4 was designed for a 65nm strained silicon CMOS technology

which offers 15% increase in drive current. It uses a NiSi layer on the gates and drain to

lower capacitance, 1.2nm gate oxide and eight layers of copper interconnect with low K

dielectric and 35nm gate length transistors. With this technology scaling in mind, the integer

execution unit of the Pentium 4 was redesigned to use a 2x frequency fast clock to enable

single cycle latency on critical bypass loops, domino circuitry to replace low swing voltage

technology used in 90nm and new architectures for the ALU. The integer register file (IRF)

and the address generation unit (AGU) designed for 9GHz operation at 1.3V also reduce

power over the 90nm integer execution unit. The designers provide excellent details on the

integer unit and the 2x clocking scheme. The AGU, ALU, IRF, bypass cache, write-back

buses, flag logic and single cycle latency operations (adder, shifter, rotator and logic

operator) of the integer execution unit operates on the 2x frequency boundary. Longer latency

operations like multiply are implemented outside the integer execution unit on the main

processor clock boundary. The integer unit also includes a fast store forwarding buffer

(FSFB) to reduce store to load latency. The 64bit data path is implemented as two discrete

32bit paths to speed up latencies of the 32bit ALU operators. Write-back buses provide

source operand data for the ALU and AGU from the register file, data cache and FSFB. The

write-back buses which are outputs of the ALU/AGU also write back results to the register

file. The IRF can sustain 12 reads and 6 writes per processor clock cycle, the AGU can

perform one load and store address and 32bit ALU per processor cycle and the 32bit ALU

loop latency is also one processor clock cycle.

The ALU and the AGU are implemented with a full-swing two phase 2x frequency domino

circuitry to maintain signal integrity. However, time borrowing, races, misaligned inputs and

clock duty variation shrinks the effective pulse-width for domino evaluate and pre-charge and

14

complicates domino circuit design. The failures caused due to incomplete reset/pre-charge

and pulse evaporation are called signal triangulation failures. The two phase 2x domino

scheme has been implemented to minimize the risk of these signal triangulation failures in

several ways. Set dominant latches (SDL) are used strategically to convert half clock period

width domino signal to full clock period width. N-skewed CMOS gates are used to stretch the

domino evaluate window at the cost of noise sensitivity as opposed to a mechanism that uses

single PMOS for pre-charge and a complex NMOS tree for the evaluate. The effective pulse

width for each transition of a node is thoroughly characterized during verification simulation

to ensure triangulation fails do-not limit the frequency of the integer unit. The designers

however, allowed some power race violations as clock edge adjustments for the 2x clock

would have caused triangulation failures. The 2x frequency clock is designed as a single

phase clock generated by a local pulse generator to minimize clock uncertainty overhead. The

pulse generator is dual edge triggered to produce 2 fclk in one processor clock cycle. The

high phase of the fclk is independent of the processor clock by virtue of a self-timed reset

loop but can be made stretchable to facilitate speed path debug on the fclk. In silicon

validation, an appropriate setting is chosen to provide the best yield. The variation of width,

Vt and length of the transistors of pulse generator was used to simulate the timing

uncertainly.

The write-back bus was designed to be timing critical and required re-design for the 65nm

implementation since interconnect delay does not scale as well as gate delay. In addition to

improvements to the write-back buses, to mitigate issues from duty cycle variation, all

receiving low voltage swing (LVS) input multiplexers were replaced with static and domino

based multiplexers; dual rail logic in the AGU and L1 data array was converted to single rail

to provide additional metal resources to the write-back buses. On the other hand, the ALU

adder input multiplexer was implemented as a symmetric dual rail system to streamline high

frequency timing interface between the input multiplexer sequence and the dual rail domino

adder. The ALU adder input multiplexer uses a segmented 5:1 multiplexer which consists of

two smaller multiplexer-latches merged through a domino gate. This achieved denser layout

and reduced area overhead and delay.

15

The integer unit consists of two 64bit ALUs, each implemented as a 32bit data block with

one fclk latency for communication between the upper and lower units. The ALUs execute

add, subtract, logic and rotate operations. The adder is based on a sparse radix-2 carry merge

tree that generates every fourth carry. This architecture speeds up the critical path by moving

the carry merge logic to a non-critical side path. This adder also reduces propagate-generate

fanout by up to 50%, wiring complexity by 80% and delay by 20% compared to a Kogge-

Stone implementation. The ALUs require two fclk cycles to execute an ADD/SUB

instruction. In the second fclk cycle a pair of 2:1 multiplexer selects the appropriate

conditional sum or logic rotator result. The adder also supports 8bit and 16bit modes of

operation by incorporating carry-kill circuits in the carry-merge tree without impacting

performance. The designers found that the domino and static gates used in the ALU tree

scales better at lower voltages with 5% reduction in power but at the cost of 5% more area

when compared to a simple shrink from 90nm. The AGU computes linear addresses for

cache access from source operands. These operands are merged using a 4:2 compressor

followed by a sparse tree completion adder to give full linear address. The sparse tree is

divided into two 16bit segments. The lower 16bits is implemented in dual rail domino

circuits in order to meet the tight latency constraints of the L1 data cache. The ALU

shifter/rotator performs 8bit, 16bit and 32bit shift and rotate operations as well as byte swaps.

Two 32bit circuits are cascaded to perform 64bit shift left operations. To support 2x

frequency operation, the rotate/shift is broken into three stages with reduced fanout. This also

lowers the wiring complexity but at the cost of 20% area penalty.

The IRF is a 144 entry array which supports six reads and three writes per fclk. The main

goals were to meet speed, bandwidth and area concerns. To reduce long wire delays, the

designers broke up global bit lines to reduce RC delay. To minimize pulse evaporation, local

bitlines are terminated by SDL. To meet the wire speed constraints, the write was pipelined;

this alleviated the wire data distribution overhead across all bitcells at high frequency. A

bypass network is used to detect read/write collisions which required 18 CAMs for each

cycle of overlap. The decoder was carefully balanced using special latches to minimize area

needed for address decode transistors and wiring to distribute final decoded address across

the entire array. The 90nm decoder scaled well into 65nm technology; there was a 35%

smaller area footprint and lower read/write latencies but at a higher power density. The

16

designers found the integer unit to have the higher power and thermal density on the CPU.

This required heavy circuit level power optimization in the aforementioned components of

the integer unit. These optimizations included replacing LVS dual rail in the AGU with a

single rail, replacing LVS carry select adder with a sparse tree adder, replacing LVS logic

execution unit with a single rail static logic execution unit, replacing the LVS AGU adders

with a parallel sparse adder and replacing the LVS rotator with a single rail domino rotator

design. The entire AGU, ALU, rotator, adder and logic execution was done in single cycle

gating and high leakage devices were minimized. All these changes resulted in a 3.8W

reduction in power which translates to a 200MHz upside for the 65nm processor. At 1.3V

and 70 degrees the power consumption was found to be 10.3W.

Critique

With the hindsight of Pentium 4’s demise in the face of thermal and power challenges, it is

little easier to critique the design choices in this paper. It appeared the integer unit required

some complex and specialized re-design to fit into a competitive 65nm thermal budget. This

required a re-design of most blocks which resulted in either an area or power hit. It was clear

that the architecture of the Pentium 4 functional units was not scalable with technology. The

main design approach seems to favor replacing LVS circuits with domino circuits to gain

performance and combat process variations and noise. The incorporation of dual-rail domino

logic suggests additional wiring and area hits to gain additional performance and

complications with clocking and pre-charge. The paper claims huge power gains with their

implementation but this is compared to a straight-forward shrink of their 90nm into 65nm

which would have un-realistic. A better metric of comparison would be some kind of relative

power consumption of the integer unit compared to the rest of the system for a given

technology. The paper provides good information on all components of the integer execution

unit including register files and the 2x clocking scheme. The obvious drawback of having to

use a 2x clocking scheme is higher power dissipation. The paper claims that in spite of a

mixed domino/static circuit design style the complexity of the design flow was reduced, this

speaks to the refinement of the design flow which is important as designs get complex. The

designers explain the architectural tradeoffs involved in designing various components of the

unit including the AGU, ALU and shifter/rotator as well the challenges of actual circuit

implementation. The goals were to lower latency and minimal area and power cost, and the

17

9GHz specification was reached. The paper could have used some more information on the

test and debug features that went into the flip-flop and latches used in their design.

HIGH SPEED MEMORY

In addition to the various levels of cached SRAM memory, there are other memory

architectures that are part of the datapath and must serve single-cycle, high speed lookup

applications. As opposed to random memory access of standard computer memory where an

address is supplied to procure a data word, certain applications require a memory that can

search through the entire array and return a storage address for a given word. A common

example is the lookup table function of an internet router. Such a content-addressable

memory (CAM) architecture is also used in Translation Lookaside Buffers, cache controllers

and data compression hardware in addition to network applications [11]. Clearly, the CAM

also plays an important role in the datapath of various computer architectures. The design and

implementation of the CAM requires careful consideration of competing aspects of area,

density, power and speed. This is the subject of the final paper.

CAM Tutorial [6]

This paper is an extensive tutorial and survey of current CAM architectures and the

challenges and tradeoffs in the design of each component of the CAM. The goal of the

authors is to explain the working of the CAM with special emphasis on power saving

approaches. The authors provide some good models and metrics for measuring power and

delay and discuss the tradeoffs of various architectures. The authors approach the topic of

CAM design from two levels – architectural and circuit. The paper also presents some data

on CAM efficiency for various architectures in 180nm CMOS technology.

Background

This paper is a general survey on CAM architecture which means it can apply to any

microprocessor design. The most common applications of a CAM might be in network

processors that run in internet routers. Networking involves lookup-table matching for packet

forwarding which requires a fast array lookup scheme to read out output address ports

associated with an IP packet. The CAM offers the best solution in lookup and search

functions out of any hardware or software approach so it appears that unless a radically new

18

hardware comes along the CAM architecture will be ubiquitous. The single cycle-throughput

of a CAM makes it very attractive to architects. Another quality that makes a CAM suitable

to network processors is its ability to comprehend priority matching and encoding in

hardware itself. The constraints on the design are the area, power and latency concerns that

come with scaling the size of the CAM. From a hardware perspective, there are also issues

with transistor and wire density as well technology challenges. Today, the largest single chip

CAM implementations run in the 18Mb range.

Features and Technology

A CAM compares input search data against a table of stored data and returns the address of

the matching data in a single cycle. The input search-word is broadcast on the search-lines

(SL) to the table of stored CAM words. A CAM word can range from 36bits to 144bits and a

typical CAM table can range from a few hundred to 32K entries. Each stored word has a

match-line (ML) that indicated a hit or a miss which is fed into an encoder to generate a

binary match location corresponding to the ML. In case of multiple matches a priority

encoder can be used. The output of the encoder is used to read out the correct address from

an SRAM. Thus the matching word of a CAM becomes a pointer into an SRAM. The authors

develop a model of the basic CAM cell. A CAM bitcell has two basic functions: bit storage

and bit comparison. CAM bitcells are arranged horizontally to make up a CAM word with a

ML corresponding to each word which is fed into ML sense amplifiers (MLSA). Each CAM

bitcell in a column is attached to a differential SL pair. CAM operation takes place in the

following stages all happening in one clock cycle: first the search-word is loaded into data

register, then the MLs are pre-charged high, then the search-word is broadcast through the

SL, then each CAM bit compares a stored bit with the bit on the SL, then CAM words with at

least one mismatch will discharge the ML while words with all matching bits keep the ML

charged high and finally the MLSA detects whether an ML has a matching or miss condition

and maps the ML to an encoded binary address in an SRAM.

The authors present two types of CAM bitcells: NOR and NAND cells, each using cross

coupled inverters which make up an SRAM. They are named for the logical structure that

they resemble when arranged in a CAM word. The bit comparison is logically equivalent to

an XOR of the stored bit and the search bit. The NOR cell uses four minimum sized

19

transistors to implement a pull down path from ML in the event of a mismatch. The NAND

cell implements three minimum sized transistors where one transistor is a pass transistor

implementation of the XNOR function. The NOR cell provides full rail voltage swing while

the NAND provides reduced logic due to the NMOS pass transistor implementation. There

are variants of both the NAND and the NOR cells depending on the number of transistors

which pose tradeoffs in terms of transistor density and voltage swing. It is also possible to

implement ternary logic cells which can store a value representing ‘0’ and ‘1’ and causes a

match regardless of the input bit. Such an application works in network routing and priority

encoding schemes. To make a ternary cell, a second SRAM can be added to the NOR cell

where each bit connects to its own independent pull-down path. The authors point out several

modifications to the ternary cell using PMOS pull downs and complementing logic for the SL

and the ML. PMOS transistors offer a more compact layout which lowers wiring capacitance,

however they are slower. A NAND cell can be made ternary by adding storage for a mask bit

which can override matching regardless of bit value.

The paper goes into details about the ML structures and sensing schemes as well. The NOR

ML is formed by connecting NOR cells in parallel. A NOR search cycle involves three steps:

SL pre-charge low to disconnect ML from the ground, ML pre-charge high by turning on an

ML pre-charge transistor and SL being driven to trigger ML evaluation. The slowest

discharge path for the ML is through two series transistors to ground which is still faster than

a NAND cell. For this reason NOR based CAMs are more prevalent. A NAND ML is formed

by cascading several NAND cells. A PMOS pre-charge transistor pre-charges the ML high

and during evaluation, an NMOS ML evaluation transistor turns on to creates a series path (8

to 16 transistors) to ground for a CAM word with matching bits. In case of a mismatch a

single NAND cell can break the chain to ground. The MLSA detects differences between low

and high voltage. An important feature of the NAND ML is that a miss stops signal

propagation and there is no more power consumption. The drawbacks are the quadratic delay

dependence on the number of NAND cells and the noise margin of the NMOS pass

transistors. The NAND ML also has potential for charge sharing across the NAND cells

which may cause incorrect voltage on the ML; however this can be avoided by pre-charging

intermediate nodes in the cells which increases power consumption.

20

There are several ML sensing schemes with varying tradeoffs in power, delay, control and

area that can be used in CAM architectures. The authors propose a simple model to determine

power consumption of various ML sensing schemes of a NOR cell. The ML can be modeled

as a simple capacitor to model ML wiring capacitance, diffusion capacitance of the NOR cell,

pre-charge transistors and the MLSA; and a pull down resistor weighted by the number of

mismatched bits. The model assumes that charge sharing is avoided by connecting the CAM

storage bits to the top transistors of the pull down path according to figure x in the paper. To

control pre-charge and evaluation clocking, a replica ML is generated for worst case timing

and is used to generate timing signals. The dynamic power consumption for w miss MLs can

be given by the following equation, where f is the frequency of search operations:

[6]

To reduce power and increase speed a low swing voltage scheme may be used. This modifies

the power equation to

[6]

The authors reference a paper [12] that utilizes a tank capacitor with every ML that uses

charge sharing to pre-charge the ML. In case of a mismatch the ML discharges to ground and

in case of a hit it remains pre-charged. A current-race scheme removes the SL pre-charge

phase, mitigates charge sharing and saves some delay and power. The ML is pre-charged low

and concurrently drives the SL; evaluation happens by connecting the ML to a current

source. If there is a miss, then the ML is driven high. The power consumption for this scheme

is given by

[6]

The authors also point out a selective pre-charge scheme that allocates power to the ML non-

uniformly. The match operation happens on the first few bits of the word before activating

the remaining bits. The authors claim that this can save 88% dynamic power in a 144bit

CAM word. The two drawbacks are that non-uniform data distribution might eliminate

21

overall power savings and to maintain speed the initial matching might draw higher power

per bit. The selective pre-charge scheme can implemented using a mixed NAND/NOR ML

structure [43]. A pipeline scheme is a selective pre-charge scheme that divides the ML into

multiple segments in a pipelined fashion where a match in one segment drives searching in

the following segments serially. The drawbacks are increased latency, area and power

overhead although hierarchal SL bitlines can save some power. Finally, the authors present a

current saving scheme that is a data dependent ML sensing scheme that improves on the

current sensing scheme. This scheme adds a current control block which allocates different

amounts of currents based on a match or a miss, but requires ML to the pre-charged low.

Since most evaluations result in a miss this saves 50% power compared to current race

scheme. The authors summarize the various power and energy numbers for the different ML

schemes for a ternary 152 by 72 CAM block in 180nm CMOS technology in the paper.

The next topic was SL driving schemes as they apply to three cases: ML pre-charge high, ML

pre-charge low and a pipelined ML with hierarchal SL. The conventional approach applies to

when the ML is pre-charged high. The SL is driven by a cascade of inverters, first to their

pre-charge level and then to their data value. The power consumption due to SL can be given

by the following equation, where denotes the capacitance of the SL, and n is is the total

number of SL pairs.

[6]

Power consumption of the drivers can add an additional 25%. If the SL pre-charge is

eliminated as seen in the case of the current race scheme then this can secure an additional

50% power reduction. The hierarchal SL schemes built on top of the pipelined ML scheme

removes the need to drive every SL. Since only the global SL are active , the local SL might

be turned off most of the time. However this adds to implementation complexity and lower

noise immunity. The power consumption for hierarchal SL can be given by the following

equation, where is the activity rate of the local SL.

22

This requires that power dissipation by the global SL be sufficiently smaller than the local SL

which is the case if wiring capacitance is small compared to the transistor capacitance. A

small-swing scheme on the global SL can further reduce power but requires an amplifier to

convert signals to full-swing for the local SL. The authors summarize the various power and

energy numbers for the different SL schemes for a ternary 1024 by 144 CAM block in 180nm

CMOS technology in the paper.

Finally, the authors look at architectural techniques to lower power consumptions. Three

architectures have been proposed: bank selection, pre-computation and dense encoding. The

bank selection technique utilized an architecture where only a subset of CAM is active on a

given cycle. This saves area and power as the CAM is portioned by virtue of two or more

extra data select bits. The comparison circuitry can be shared as well. Total power savings

amount to 75%, however the major drawback are the additional circuitry to invoke multiple

banks, problem of bank overflow and complexity of algorithms to partition the banks and

balance the data. The pre-computation approach applies to binary CAM. In this method,

some extra information bits in a search word can be used to for an initial search thereby

saving power from a full search each time. This also saves area by simplifying comparison

circuitry for the second search. The third approach involves encoding or mapping stored data

in a ternary CAM to reduce the size of the CAM required for an application. Reducing the

number of entries can reduce power consumption. The authors refer to some papers [13, 14]

which propose some mapping algorithms to make use of unused ternary states to store

additional combinations of matched words. However, this requires changes to the ML

architecture. The authors conclude their paper by noting that CAM power consumption is

dominated by dynamic power and there is opportunity for research here. Some of the

guidelines to keep in mind when implementing CAMs are to minimize transistors and wire

capacitance, reduce voltage swing on the ML and reduce supply voltage. Reducing peak

power consumption is also an interesting challenge for future CAM design. New memory

technologies like ZRAM and MRAM will also pose challenges when architecting CAM.

Critique

This paper is probably one of the best available tutorials on memory architectures. It provides

excellent information on various CAM architectures with special emphasis on the tradeoffs

23

between them. The authors cover design styles and tradeoffs for every component of the

CAM from both a circuit and an architectural level. It gives designers the right knowledge to

pick the best implementation for their designs. However, the authors leave out some

potentially interesting information on the associated SRAM that goes with a CAM in a look-

up table scheme. There might be some research opportunities in implementing both the CAM

and SRAM bitcell together to save on area, wiring and peripheral circuitry. The authors chose

not to focus on the actual reading and writing of the CAM cells for the sake of simplicity, but

they could have devoted a small section on the circuit design challenges especially for

writing into CAM cells. The authors could have also addressed future challenges with scaling

the size of the CAM with respect to sub-micron technology issues. One of the most

interesting architectures was the partitioning and data encoding algorithms. There might be

opportunity for research in algorithms that favor application specific dynamic partitioning.

It would also be interesting to see how multi-voltage supplies and dynamic circuits can be

incorporated in the CAM architectures to further improve performance. I believe there is

some room for improvement in the working of the worst-case replica ML as well. The replica

ML governs timing based on worst case matching or miss conditions; this might be wasteful

as a sizable number of searches most likely do not require such worst case boundaries.

Perhaps some form of adjustable ML replica scheme that is aware of the application and the

running average times for matching and miss on the ML can be used to better tune CAM

timing. Overall, I think this paper does a great job in highlighting opportunities for future

research in CAM architectures in addition to a very complete introduction to CAM.

CONCLUSION

In conclusion, the papers on circuits and datapath were probably the most detailed and

technical out of all the topics. The papers stress the important of architecture in addition to

circuit level design and optimization for the datapaths and the high speed memories. A good

architecture must be scalable and must remain competitive in performance and power for

several generations of technology. A poor architecture from a power, performance or area

perspective cannot offer much optimism for improvement through technology process or

circuit improvements alone. In addition to architecture, careful design at the logical, circuit

and layout level is also important. Being able to identify and learn more about the target

applications can also help architects from the very beginning of a design cycle. As more

24

dynamic and domino circuit topologies find their way into static CMOS designs the issue of

local clock generation to support these topologies becomes important as well. Special care

must be taken to maintain electrical integrity on a local and global scale of a circuit with

timing included for all circuit topologies. Sub-micron technology poses well known

challenges with variation and higher leakage and must be accounted for by both architects

and circuit designers. Features for test, debug and yield in addition to the design verification,

synthesis and simulation tools are equally important. New, innovative, extensible and

specialized datapath architectures to address the current technology and performance

efficiency challenges will surely be interesting research topics to watch out for.

REFERENCES

[1] C. Hamacher et al, Computer Organization, 5th ed. New York: McGraw-Hill, 2002.

[2] N. Weste et al, CMOS VLSI Design: a circuits and systems perspective, 3rd ed. New York: Addison-Wesley, 2005.

[3] J. Warnock et al, “Circuit Design Techniques for a First-Generation Cell Broadband Engine Pocessor,” in Solid-State Circuits, IEEE Journal of, Volume 41, Issue 8, Aug. 2006 Page(s):1692 – 1706.

[4] H. Oh et al, “A Fully Pipelined Single-Precision Floating-Point Unit in the Synergistic Processor Element of a CELL Processor,” in Solid-State Circuits, IEEE Journal of, Volume 41, Issue 4, Apr. 2006 Page(s):759 – 771.

[5] S. Wijeratne et al, “A 9-GHz 65-nm Intel Pentium 4 Processor Integer Execution Unit,” in Solid-State Circuits, IEEE Journal of, Volume 42, Issue 1, Jan. 2007 Page(s):26 – 37.

25

[6] K. Pagiamtzis et al, “Content-Addressable Memory (CAM) Circuits and Architectures: A Tutorial and Survey,” in Solid-State Circuits, IEEE Journal of, Volume 41, Issue 3, Mar. 2006 Page(s):712 – 727.

[7] “Cell Broadband Engine resource center” IBM developerWorks. Retrieved 1 Jul.2007 <http://www.ibm.com/developerworks/power/cell/documents.html>.

[8] B. Flachs et al., “The microarchitecture of the synergistic processor for a cell processor,” in Solid-State Circuits, IEEE Journal of, Volume 41, Issue 1, Jan. 2006 Page(s):63 – 70.

[9] “Intel NetBurst Architecture” Intel Software Network. Retrieved 5 Aug. 2007. <http://www.intel.com/cd/ids/developer/asmo-na/eng/44004.htm>.

[10] V. Baranov, “Intel Pentium 4 641 (Cedar Mill) – 65nm process technology advancing” Digital-Daily.com. Retrieved 5 Aug. 2007. <http://www.digital-daily.com/cpu/intel_pentium_4_641/>.

[11] K. Pagiamtzis, “Introduction to Content-Addressable Memory (CAM)” Kostas Pagiamtzis, University of Toronto. Retrieved 5 Aug. 2007. <http://www.pagiamtzis.com/cam/camintro.html

[12] G. Kasai et al, “200 MHz/200 MSPS 3.2W at 1.5V Vdd, 9.4 Mbits ternary CAM with new charge injection match detect circuits and bank selection scheme,” in Proc. IEEE Custom Integrated Circuits Conf. (CICC), 2003, Page(s):387 – 390.

[13] S. Hanzwa et al, “A dynamic CAM based on a one-hot-spot block code-for million-entry lookup,” in Symp. VLSI Circuits Dig. Tech. Papers, 2004, Page(s):382 – 385.

[14] “A large-scale and low-power CAM architecture featuring a one-hot-spot block code for IP-address lookup in a network router,” in IEEE J. Solid-State circuits, Volume 40, Issue 42, Apr. 2005, Page(s):853 – 861.

26


Top Related