sanjay(high speed dsp architectures)

Upload: yuben-joseph

Post on 09-Apr-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/7/2019 Sanjay(High Speed Dsp Architectures)

    1/39

    High Performance DSP Architectures

    CHAPTER 1EVOLUTION OF DSP PROCESSORS

    INTRODUCTION

    Digital Signal Processing is carried out by mathematical operations. Digital SignalProcessors are microprocessors specifically designed to handle Digital Signal Processing

    tasks. These devices have seen tremendous growth in the last decade, finding use in

    everything from cellular telephones to advanced scientific instruments. In fact, hardware

    engineers use "DSP" to mean Digital Signal Processor, just as algorithm developers use

    "DSP" to mean Digital Signal Processing.

    DSP has become a key component in many consumer, communications, medical, and

    industrial products. These products use a variety of hardware approaches to implement

    DSP, ranging from the use of off-the-shelf microprocessors to field-programmable gate

    arrays (FPGAs) to custom integrated circuits (ICs). Programmable DSP processors, a

    class of microprocessors optimized for DSP, are a popular solution for several reasons.

    In comparison to fixed-function solutions, they have the advantage of potentially being

    reprogrammed in the field, allowing product upgrades or fixes. They are often more cost-

    effective than custom hardware, particularly for low-volume applications, where the

    development cost of ICs may be prohibitive. DSP processors often have an advantage in

    terms of speed, cost, and energy efficiency.

    DSP ALGORITHMS MOULD DSP ARCHITECTURES

    From the outset, DSP processor architectures have been moulded by DSP algorithms. For

    nearly every feature found in a DSP processor, there are associated DSP algorithms whose

    computation is in some way eased by inclusion of this feature. Therefore, perhaps the best

    way to understand the evolution of DSP architectures is to examine typical DSP

    algorithms and identify how their computational requirements have influenced the

    architectures of DSP processors.

    FAST MULTIPLIERS

    The FIR filter is mathematically expressed as a vector of input data, along with a vector of

    filter coefficients. For each tap of the filter, a data sample is multiplied by a filtercoefficient, with the result added to a running sum for all of the taps . Hence, the main

    component of the FIR filter algorithm is a dot product: multiply and add, multiply and add.

    These operations are not unique to the FIR filter algorithm; in fact, multiplication is one of

    the most common operations performed in signal processing convolution, IIR filtering, and

    Fourier transforms also all involve heavy use of multiply-accumulate operations.

    Originally, microprocessors implemented multiplications by a series of shift and add

    operations, each of which consumed one or more clock cycles. As might be expected, faster

    multiplication hardware yields faster performance in many DSP algorithms, and for this

    reason all modern DSP processors include at least one dedicated single- cycle multiplier or

    combined multiply-accumulate (MAC) unit .

    .

    Department Of Electronics & Communication Engineering, GEC Thrissur. 1

  • 8/7/2019 Sanjay(High Speed Dsp Architectures)

    2/39

  • 8/7/2019 Sanjay(High Speed Dsp Architectures)

    3/39

    High Performance DSP Architectures

    DATA FORMAT

    DSP applications typically must pay careful attention to numeric fidelity. Since numeric

    fidelity is far more easily maintained using a floating point format, it may seem surprising

    that most DSP processors use a fixed-point format. DSP processors tend to use the shortest

    data word that will provide adequate accuracy in their target applications. Most fixed-point

    DSP processors use 16-bit data words, because that data word width is sufficient for manyDSP applications. A few fixed-point DSP processors use 20, 24, or even 32 bits to enable

    better accuracy in applications that are difficult to implement well with 16-bit data, such as

    high-fidelity audio processing.

    To ensure adequate signal quality while using fixed-point data, DSP processors typically

    include specialized hardware to help programmers maintain numeric fidelity throughout a

    series of computations. For example, most DSP processors include one or more

    accumulator registers to hold the results of summing several multiplication products.

    Accumulator registers are typically wider than other registers; they often provide extra bits,

    called guard bits, to extend the range of values that can be represented and thus avoid

    overflow. In addition, DSP processors usually include good support for saturation

    arithmetic, rounding, and shifting, all of which are useful for maintaining numeric fidelity.

    ZERO-OVERHEAD LOOPING

    DSP algorithms typically spend the vast majority of their processing time in relatively

    small sections of software that are executed repeatedly; i.e., in loops. Hence, most DSP

    processors provide special support for efficient looping. Often, a special loop or repeat

    instruction is provided which allows the programmer to implement a for-next loop without

    expending any clock cycles for updating and testing the loop counter or branching back to

    the top of the loop. This feature is often referred to as Zero-overhead looping.

    STREAMLINED I/O

    Finally, to allow low-cost, high-performance input and output, most DSP processors

    incorporate one or more specialized serial or parallel I/O interfaces, and streamlined I/O

    handling mechanisms, such as low-overhead interrupts and direct memory access (DMA),

    to allow data transfers to proceed with little or no intervention from the processor's

    computational units.

    SPECIALIZED INSTRUCTION SETS

    DSP processor instruction sets have traditionally been designed with two goals in mind.

    The first is to make maximum use of the processor's underlying hardware, thus increasing

    efficiency. The second goal is to minimize the amount of memory space required to store

    DSP programs, since DSP applications are often quite cost-sensitive and the cost of

    memory contributes substantially to overall chip and/or system cost. To accomplish the

    first goal, conventional DSP processor instruction sets generally allow the programmer to

    specify several parallel operations in a single instruction, typically including one or two

    data fetches from memory in parallel with the main arithmetic operation. With the second

    goal in mind, instructions are kept short by restricting which registers can be used with

    which operations, and restricting which operations can be combined in an instruction.

    Department Of Electronics & Communication Engineering, GEC Thrissur. 3

  • 8/7/2019 Sanjay(High Speed Dsp Architectures)

    4/39

    High Performance DSP Architectures

    CHAPTER 2

    TRADITIONAL SOLUTIONS FOR REAL TIME PROCESSING

    DSP architectures designs have traditionally focused on providing and meeting real-timeconstraints. Advanced signal processing algorithms, such as those in base station receivers,

    present difficulties to the designer due to the implementation of complex algorithms, higher

    data rates and desire for more channels per hardware module. A key constraint from the

    manufacturing point of view is attaining a high channel density.

    Traditionally, real-time architecture designs employ a mix of DSPs, Co-processors,

    FPGAs, ASICs and Application Specific Standard Parts (ASSPs) for meeting real-time

    requirements in high performance applications. The chip rate processing is handled by the

    ASSP, ASIC or FPGA while the DSPs handle the symbol rate processing and use co-

    processors for decoding. The DSP can also implement parts of the MAC layers and control

    protocols or can be assisted by a RISC processor.

    However, dynamic variations in the system workload such as variations in the number of

    users in wireless base-stations, will require a dynamic re-partitioning of the algorithms

    which may not be possible to implement in traditional FPGAs and ASICs in real-time.

    LIMITATIONS OF SINGLE PROCESSOR DSP ARCHITECTURES

    Single processors DSPs can only have limited arithmetic units and cannot directly extendtheir architectures to 100s of arithmetic units. This is because, as the number of arithmetic

    units increases in an architecture, the size of the register files and the port interconnections

    start dominating the architecture.

    PROGRAMMABLE MULTIPROCESSOR DSP ARCHITECTURES

    Multiprocessor architectures can be classified into Single Instruction Multiple Data (SIMD)

    and Multiple Instruction Multiple Data (MIMD) architectures. Data-parallel DSPs exploit

    data parallelism, instruction level parallelism and sub word parallelism. Alternate levels of

    parallelism such as thread level parallelism exist and can be considered after this

    architecture space has been fully studied and explored.

    Department Of Electronics & Communication Engineering, GEC Thrissur. 4

  • 8/7/2019 Sanjay(High Speed Dsp Architectures)

    5/39

    High Performance DSP Architectures

    MULTI-CHIP MIMD PROCESSORS

    Each processor in a loosely coupled system has a set of I/O devices and a large local

    memory. Processors communicate by exchanging messages using some form of message-

    transfer system. Loosely coupled systems are efficient when interaction between tasks are

    minimal. The tradeoffs of this processor design have been the increase in programming

    complexity and the need for high I/O bandwidth and inter-processor support. Such MIMDsolutions are also difficult to scale with processors. E.g.: TI C4XX.

    Register file explosion in traditional DSPs with centralized register files.

    The disadvantages of the multi-chip MIMD model and architectures are the following:

    1. Load-balancing algorithms for such MIMD architectures is not straight-forward

    similar to heterogeneous systems. This makes it difficult to partition algorithms on

    this architecture model especially when the workload changes dynamically.2. The loosely coupled model is not scalable with the number of processors due to

    interconnection and I/O bandwidth issues.

    3. I/O impacts the real-time performance and power consumption of the architecture.

    4. Design of a compiler for a MIMD model on a loosely coupled architecture is difficult

    and the burden is left to the programmer to decide on the algorithm partitioning on

    the multiprocessor.

    SINGLE-CHIP MIMD PROCESSORS

    Single-chip MIMD processors can be classified into 3 categories: single-threaded chipmultiprocessors (CMPs), multi-threaded multiprocessors (MTs) and clustered VLIW

    architectures . A CMP integrates two or more complete processors on a single chip.

    Therefore, every unit of a processor is duplicated and used independently of its copies.

    Department Of Electronics & Communication Engineering, GEC Thrissur. 5

  • 8/7/2019 Sanjay(High Speed Dsp Architectures)

    6/39

    High Performance DSP Architectures

    In contrast, a multi-threaded processor interleaves the execution of instructions of various

    threads of control in the same pipeline. Therefore, multiple program counters are available

    in the fetch unit and multiple contexts are stored in multiple registers on the chip. Multi-

    threading increases instruction level parallelism in the arithmetic units by providing access

    to more than a single independent instruction stream. Programmer is assigned the duty to

    schedule the threads of his application..

    Clustered VLIW architectures are another example of VLIW architectures that solve the

    register explosion problem by employing clusters of functional units and register files.

    Clustering improves cycle time in two ways: by reducing the distance the signals have to

    travel within a cycle and by reducing the load on the bus. Clustering is beneficial for

    applications which have limited inter-cluster communication. However, compiling for

    clustered VLIW architectures can be difficult in order to schedule across various clusters

    and minimize inter-cluster operations and their latency.

    Although single chip MIMD architectures eliminate the I/O bottleneck between multiple

    processors, the load balancing and architecture scaling issues still remain. The availability

    of data parallelism in signal processing applications is not utilized efficiently in MIMD

    architectures.

    SIMD ARRAY PROCESSORS

    SIMD processing refers to processing of identical processors in the architecture that

    execute the same instruction but work on different sets of data in parallel. An SIMD array

    processor is referred to processor designs targeted towards implementation of arrays or

    matrices. There are various types of interconnection methodologies used for array

    processors such as linear array (vector), ring, star, tree, mesh, systolic arrays and

    hypercubes. For example, Illiac-IV , Burroughs Scientific Processor (BSP). Although

    vector processors have been the most popular version of array processors, mesh basedprocessors are still being used in scientific computing.

    SIMD VECTOR PROCESSORS

    Data parallelism allow vector processors to approach the performance and power efficiency

    of custom designs, while simultaneously providing the flexibility of a programmable

    processor. Vector machines were the first attempt at building super-computers, starting

    from the Cray-1 machine These processors executed vector instructions such as vector adds

    and multiplications out of a vector register file. The number of memory banks is equal to

    the number of processors such that all processors can access memory in parallel.

    Department Of Electronics & Communication Engineering, GEC Thrissur. 6

  • 8/7/2019 Sanjay(High Speed Dsp Architectures)

    7/39

    High Performance DSP Architectures

    DATA-PARALLEL DSPS

    Data-parallel DSPs as architectures that exploit ILP. Stream processors are state-of-the-art

    programmable architectures aimed at media processing applications. Stream processors

    enhance data-parallel DSPs by providing a bandwidth hierarchy for data flow in signal

    processing applications that enable support for hundreds of arithmetic units in the data-

    parallel DSP.

    PIPELINING MULTIPLE PROCESSORS

    An alternate method to attain high data rates is to provide multiple processors that are

    pipelined. Such processors would be able to take advantage of the streaming flow of data

    through the system. The disadvantages of such a design are that the architecture would

    need to be carefully designed to match the system throughput and is not flexible enough to

    adapt to changes in system workload. Also, such a pipelined system would be difficult to

    program and suffer from I/O bottlenecks unless implemented as a SoC. However, this is the

    only way to provide desired system performance if the amount of parallelism exploitation

    does not meet the system requirements.

    Department Of Electronics & Communication Engineering, GEC Thrissur. 7

  • 8/7/2019 Sanjay(High Speed Dsp Architectures)

    8/39

    High Performance DSP Architectures

    CHAPTER 3CURRENT DSP LANDSCAPE

    COVENTIONAL DSP PROCESSORS

    The performance and price range among DSP processors is very wide. In the low-cost,

    low-performance range are the industry workhorses, which are based on conventional DSP

    architecture. They issue and execute one instruction per clock cycle, and use the complex,

    multi-operation type of instructions described earlier. These processors typically include a

    single multiplier or MAC unit and an ALU, but few additional execution units, if any.

    Included in this group are Analog Devices ADSP-21xx family, Texas Instruments

    TMS320C2xx family, and Motorola's DSP560xx family. These processors generally

    operate at around 20-50 MHz, and provide good DSP performance while maintaining very

    modest power consumption and memory usage. Midrange DSP processors achieve higher

    performance than the low-cost DSPs described above through a combination of increasedclock speeds and somewhat more sophisticated architectures.

    ENHANCED CONVENTIONAL DSP PROCESSORS

    DSP processor architects improved performance by extending conventional DSP

    architectures by adding parallel execution units, typically a second multiplier and adder.

    These hardware enhancements are combined with an extended instruction set that takes

    advantage of the additional hardware by allowing more operations to be encoded in a singleinstruction and executed in parallel. We refer to this type of processor as an enhanced-

    conventional DSP processor, because it is based on the conventional DSP processor

    architectural style rather than being an entirely new approach. With this increased

    parallelism, enhanced-conventional DSP processors can execute significantly more work

    per clock cyclefor example, two MACs per cycle instead of one.

    Enhanced-conventional DSP processors typically have wider data buses to allow them to

    retrieve more data words per clock cycle in order to keep the additional execution units fed.

    They may also use wider instruction words to accommodate specification of additional

    parallel operations within a single instruction.

    MULTI-ISSUE ARCHITECTURES

    With the goals of achieving high performance and creating an architecture that lends itself

    to the use of compilers, some newer DSP processors use a multi-issue approach.

    Department Of Electronics & Communication Engineering, GEC Thrissur. 8

  • 8/7/2019 Sanjay(High Speed Dsp Architectures)

    9/39

    High Performance DSP Architectures

    In contrast to conventional and enhanced-conventional processors, multi-issue processors

    use very simple instructions that typically encode a single operation. These processors

    achieve a high level of parallelism by issuing and executing instructions in parallel groups

    rather than one at a time. Using simple instructions simplifies instruction decoding and

    execution, allowing multi-issue processors to execute at higher clock rates than

    conventional or enhanced conventional DSP processors.eg:.TMS320C62xx, The twoclasses of architectures that execute multiple instructions in parallel are referred to as

    VLIW and Superscalar. These architectures are quite similar, differing mainly in how

    instructions are grouped for parallel execution.

    VLIW and superscalar architectures provide many execution units each of which executes

    its own instruction. VLIW DSP processors typically issue a maximum of between four and

    eight instructions per clock cycle, which are fetched and issued as part of one long super-

    instruction hence the name Very Long Instruction Word. Superscalar processors typically

    issue and execute fewer instructions per cycle, usually between two and four. In a VLIW

    architecture, the assembly language programmer specifies which instructions will be

    executed in parallel. Hence, instructions are grouped at the time the program is assembled,

    and the grouping does not change during program execution. Superscalar processors, in

    contrast, contain specialized hardware that determines which instructions will be executed

    in parallel based on data dependencies and resource contention, shifting the burden of

    scheduling parallel instructions from the programmer to the processor. The processor may

    group the same set of instructions differently at different times in the program's execution;

    for example, it may group instructions one way the first time it executes a loop, then group

    them differently for subsequent iterations. The difference in the way these two types of

    architectures schedule instructions for parallel execution is important in the context of

    using them in real-time DSP applications. Because superscalar processors dynamically

    schedule parallel operations, it may be difficult for the programmer to predict exactly howlong a given segment of software will take to execute. The execution time may vary based

    on the particular data accessed, whether the processor is executing a loop for the first time

    or the third, or whether it has just finished processing an interrupt, for example. Dynamic

    features also complicate software optimization. As a rule, DSP processors have

    traditionally avoided dynamic features for just these reasons; this may be why there is

    currently only one example of a commercially available superscalar DSP processor.

    In VLIW architectures, a wide instruction word may be required in order to specify

    information about which functional unit will execute the instruction. Wider instructions

    allow the use of larger, more uniform register sets, which in turn enables higher

    performance. There are disadvantages, however, to using wide, simple instructions. Sinceeach VLIW instruction is simpler than a conventional DSP processor instruction, VLIW

    processors tend to require many more instructions to perform a given task. Combined with

    the fact that the instruction words are typically wider than those found on conventional

    DSP processors, this characteristic results in relatively high program memory usage. High

    program memory usage, in turn, may result in higher chip or system cost because of the

    need for additional ROM or RAM.

    VLIW processors typically use either wide buses or a large number of buses to access data

    memory and keep the multiple execution units fed with data. The architectures of VLIW

    DSP processors are in some ways more like those of general-purpose processors than like

    those of the highly specialized conventional DSP architectures.

    Department Of Electronics & Communication Engineering, GEC Thrissur. 9

  • 8/7/2019 Sanjay(High Speed Dsp Architectures)

    10/39

    High Performance DSP Architectures

    VLIW and superscalar processors often suffer from high energy consumption relative to

    conventional DSP processors, however in general, multi-issue processors are designed with

    an emphasis on increased speed rather than energy efficiency. These processors often have

    more execution units active in parallel than conventional DSP processors, and they require

    wide on-chip buses and memory banks to accommodate multiple parallel instructions and

    to keep the multiple execution units supplied with data, all of which contribute to increased

    energy consumption.

    Because they often have high memory usage and energy consumption, VLIW

    and superscalar processors have mainly targeted applications which have very demanding

    computational requirements but are not very sensitive to cost or energy efficiency. Forexample, a VLIW processor might be used in a cellular base station, but not in a portable

    cellular phone.

    On DSP processors with SIMD capabilities, the underlying hardware that supports SIMD

    operations varies widely. Analog Devices, for example, modified their basic conventional

    floating-point DSP architecture, the ADSP- 2106x, by adding a second set of execution

    units that exactly duplicate the original set. The augmented architecture can issue a single

    instruction and execute it in parallel in both sets of execution units using different data

    effectively doubling performance in some algorithms.

    In contrast, instead of having multiple sets of the same execution units, some DSP

    processors can logically split their execution units into multiple sub-units that process

    narrower operands. These processors treat operands in long registers as multiple short

    operands. Perhaps the most extensive SIMD capabilities we have seen in a DSP processor

    to date are found in Analog Devices' TigerSHARC processor. TigerSHARC is a VLIW

    architecture, and combines the two types of SIMD: one instruction can control execution of

    the processor's two sets of execution units, and this instruction can specify a split-

    execution-unit operation that will be executed in each set. Using this hierarchical SIMD

    capability, TigerSHARC can execute eight 16-bit multiplications per cycle SIMD is only

    effective in algorithms that can process data in parallel; for algorithms that are inherently

    serial, SIMD is generally not of use.

    Department Of Electronics & Communication Engineering, GEC Thrissur. 10

  • 8/7/2019 Sanjay(High Speed Dsp Architectures)

    11/39

    High Performance DSP Architectures

    CHAPTER 4

    DIVERGING ARCHITECTURES

    Up until recently, DSP processor designs were improved primarily by incremental

    enhancements; new DSPs tended to maintain a close resemblance to their predecessors. Inthe last couple of years, however, DSP architectures have become much more interesting,

    with a number of vendors announcing new architectures that are completely different from

    preceding generations.

    HIGH-PERFORMANCE DSPS

    Processor designers who want higher DSP performance than can be squeezed out of

    traditional architectures have come up with a variety of performance-boosting strategies.

    The main idea is that if you want to improve performance beyond the increase afforded by

    faster clock speeds, you need to increase the amount of useful work that gets done everyclock cycle. This is accomplished by increasing the number of operations that are

    performed in parallel, which can be implemented in two main ways: by increasing the

    number of operations performed by each instruction, or by increasing the number of

    instructions that are executed in every instruction cycle.

    INCREASING THE WORK PERFORMED BY EACH INSTRUCTION

    Traditionally, DSP processors have used complex, compound instructions that allow the

    programmer to encode multiple operations in a single instruction. In addition, DSP

    processors traditionally issue and execute only one instruction per instruction cycle. This

    single-issue, complex-instruction approach allows DSP processors to achieve very strongDSP performance without requiring a large amount of program memory.

    One method of increasing the amount of work performed by each instruction while

    maintaining the basics of the traditional DSP architecture and instruction set described

    above is to augment the data path with extra execution units We refer to processors that

    follow this approach as Enhanced Conventional DSPs''; their basic architecture is similar

    to previous generations of DSPs, but has been enhanced by adding execution units.

    Lucent Technologies DSP16000 architecture is based on that of the earlier DSP1600, but

    Lucent added a second multiplier, an adder , and a bit manipulation unit. To support more

    parallel operations and keep the processor from starving for data, Lucent also increased thedata bus widths to 32 bits. The net result is a processor that is able to sustain a throughput

    of two multiply-accumulates per instruction cycle.

    EXECUTING MULTIPLE INSTRUCTIONS / CYCLE

    A few designers have opted for a more RISC-like instruction set coupled with an

    architecture that supports execution of multiple instructions in every instruction cycle .E.g.

    TMS320C62xx family. In TI's version, the processor fetches a 256-bit instruction

    ``packet,'' parses the packet into eight 32-bit instructions, and routes them to its eight

    independent execution units.

    VLIW processors typically suffer from high program memory requirements and high

    power consumption. Like VLIW processors, superscalar processors issue and execute

    multiple instructions in parallel. Unlike VLIW processors, in which the programmer

    explicitly specifies which instructions will be executed in parallel, superscalar processors

    Department Of Electronics & Communication Engineering, GEC Thrissur. 11

  • 8/7/2019 Sanjay(High Speed Dsp Architectures)

    12/39

    High Performance DSP Architectures

    use dynamic instruction scheduling to determine ``on the fly'' which instructions will be

    executed concurrently based on the processor's available resources, on data dependencies,

    and on a variety of other factors. Superscalar architectures have long been used in high-

    performance general-purpose processors such as the Pentium and PowerPC.

    CIRCULAR BUFFERING

    In off-line processing, the entire input signal resides in the computer at the same time. The

    key point is that all of the information is simultaneously available to the processing

    program. This is common in scientific research and engineering, but not in consumer

    products. Off-line processing is the realm of personal computers and mainframes.

    In real-time processing, the output signal is produced at the same time that the input signal

    is being acquired. To calculate the output FIR sample, we must have access to a certain

    number of the most recent samples from the input.. When a new sample is acquired, it

    replaces the oldest sample in the array, and the pointer is moved one address ahead.

    Circular buffers are efficient because only one value needs to be changed when a new

    sample is acquired.

    Four parameters are needed to manage a circular buffer. First, there must be a pointer that

    indicates the start of the circular buffer in memory. Second, there must be a pointer

    indicating the end of the array , or a variable that holds its length . Third, the step size of

    the memory addressing must be specified. These three values define the size and

    configuration of the circular buffer, and will not change during the program operation. The

    fourth value, the pointer to the most recent sample, must be modified as each new sample isacquired. In other words, there must be program logic that controls how this fourth value is

    updated based on the value of the first three values.

    DSP/MICROCONTROLLER HYBRIDS

    Many applications require a mixture of control-oriented software and DSP software. A

    prime example is the digital cellular phone, which must implement both supervisory tasks

    and voice-processing tasks. In general, microcontrollers provide good performance in

    controller tasks and poor performance in DSP tasks; dedicated DSP processors have the

    opposite characteristics. Hence, until recently, combination controller/signal processing

    applications were typically implemented using two separate processors: a microcontroller

    and a DSP.

    Department Of Electronics & Communication Engineering, GEC Thrissur. 12

  • 8/7/2019 Sanjay(High Speed Dsp Architectures)

    13/39

    High Performance DSP Architectures

    In the past couple of years, however, a number of microcontroller vendors have begun to

    offer DSP-enhanced versions of their microcontrollers as an alternative to the dual-

    processor solution.

    Using a single processor to implement both types of software is attractive, because it can

    potentially:

    simplify the design task save board space

    reduce total power consumption

    reduce overall system cost

    Microcontroller vendors like Hitachi offers a DSP-enhanced version of their SH-2

    microcontroller. This version is called the SH-DSP, and adds a complete 16-bit fixed-point

    DSP data path to the SH-2. In contrast, ARM took a different approach and developed a

    DSP co-processor, ``Piccolo,'' that is meant to be used as an add-on to their ARM7

    microcontroller and each has its own instruction set and processes its own instruction

    stream. It is therefore possible for the two processors to operate in parallel with the caveat

    that Piccolo relies on the ARM7 to perform all data transfers.

    RECONFIGURABLE ARCHITECTURES

    Reconfigurable architectures are defined as programmable architectures that change the

    hardware or the interconnections dynamically so as to provide flexibility with simultaneous

    benefits in execution time due to the reconfiguration as opposed to turning off units to

    conserve power. There have been various approaches to provide and use this

    reconfigurability in programmable architectures. The first approach is the FPGA+

    approach, which adds a number of high-level configurable functional blocks to a general

    purpose device to optimize it for a specific purpose such as wireless . The second approach

    is to develop a reconfigurable system around a programmable ASSP. The third approach isbased on a parallel array of processors on a single die, connected by a reconfigurable

    fabric. These kind of architectures are just in their initial form of evolution.

    Department Of Electronics & Communication Engineering, GEC Thrissur. 13

  • 8/7/2019 Sanjay(High Speed Dsp Architectures)

    14/39

    High Performance DSP Architectures

    CHAPTER 5NOVEL DSP ARCHITECTURES

    "POST-HARVARD" TECHNOLOGY

    After remaining unchanged for more than a decade, DSP architectures have started to

    evolve. They are even trying to encompass control operations. Conventional DSP

    architecture typically uses Harvard-style architecture, with separate data and instruction

    buses. Their main processing elements are a multiplier, an arithmetic logic unit (ALU), and

    an accumulation register, allowing creation of a multiply-accumulate (MAC) unit that

    accepts two operands. Depending on the processor, the operands may be 16-, 24-, 32-, or

    48-bit words in either fixed-point or floating-point format. Whatever the word width, these

    conventional DSPs offer fixed-width instructions, executing one instruction per clock

    cycle.

    Figure : The conventional DSP

    architecture uses separate data

    and memory buses and features

    fixed-width instructions,

    executing one instruction per

    clock cycle.

    The instructions themselves can be fairly complex. A single instruction may embody two

    data moves, a MAC operation, and two pointer updates. These complex instructions help

    the conventional DSP offer a high degree of code density when performing repeated

    mathematical operations on arrays of numbers. As control devices, however, they leave

    something to be desired. The fixed-width instructions are inefficient when tasked with

    performing simple counter increments as part of a control loop, for instance. Even if the

    counter is only going as high as 10, the processor needs to use the full word width for the

    values. Conventional DSPs are also weak at bit-level data manipulation beyond bit shifting.

    Still, because of their number-crunching proficiency, conventional DSPs soon gained

    popularity in communications and media applications. The communications devices,including modem and telephony processors, needed the computational power for echo

    canceling, voice coding, and filtering. Media applications, including digital audio, video,

    and imaging, needed computational power for compression and filtering along with

    program flexibility to track evolving standards. DSPs also found a home in disk-drive and

    other servo-motor-control applications.

    ENHANCED DSPS EMERGE

    As semiconductor process technology evolved, conventional DSPs began to acquire a

    number of on-chip peripherals such as local memory, I/O ports, timers, and DMA

    controllers. Their basic architecture, however, didn't change for more than a decade.Eventually, though, the relative weakness in bit-level manipulation began to catch up with

    conventional DSPs, as did the incessant demand for greater performance.

    Department Of Electronics & Communication Engineering, GEC Thrissur. 14

  • 8/7/2019 Sanjay(High Speed Dsp Architectures)

    15/39

    High Performance DSP Architectures

    One common feature of these enhanced DSPs is the presence of a second MAC, which

    allows for some parallelism in computation. In many cases, this parallelism extends to

    other elements in the DSP, allowing the device to perform single-instruction, multiple-data

    (SIMD) operations. Often this is accomplished with data packing, which allows registers,

    data paths, and the like to handle two half-word operands each clock cycle. Along with

    data packing, many enhanced DSPs allow the instructions themselves to use fractionalword widths, which allows multiple instructions to launch simultaneously.

    The enhanced DSPs also tend to incorporate features that speed execution of algorithms in

    a specific application space as well as add special-purpose peripherals and memory. The

    exact nature of the specialization varies with the application an enhanced DSP targets,

    which makes direct comparisons difficult. Many include hardware accelerators for

    frequently-used operations as well as provide specialized addressing modes and augmented

    instruction sets that target the application space. The augmented instruction sets may

    include both special DSP instructions and RISC-like instructions for improved control

    operation.

    Consider, for instance, the Blackfin DSP family from Analog Devices. This family targets

    voice, video, and data communications signal processing along with control operations.

    The core includes dual 16-bit MACs, dual 40-bit arithmetic logic units (ALUs), a 40-bit

    barrel shifter, and quad 8-bit ALUs for video operations. Because the architecture allows

    data packing, the 40-bit ALUs can handle two 40-bit numbers or four 16-bit numbers. In

    addition, a control unit handles sequencing of instructions so that a mix of 16-bit control

    and 32-bit DSP instructions can pack for simultaneous execution. Data can be in 8-, 16-, or

    32-bit format.

    Figure: Analog Devices Blackfin DSP architecture handles multi-width data wordsand can simultaneously execute 16-bit control and 32-bit DSP instructions.

    The core also includes two data address generators (DSGs) to simplify both DSP and

    control operations. DSP addressing operations include circular buffering, for matrix

    operations, and bit-reversal, for unscrambling FFT results. Control operations include auto-

    increment, auto-decrement, and base-plus-immediate-offset addressing modes not found in

    conventional DSPs.

    Department Of Electronics & Communication Engineering, GEC Thrissur. 15

  • 8/7/2019 Sanjay(High Speed Dsp Architectures)

    16/39

    High Performance DSP Architectures

    INSTRUCTION SETS TARGET APPLICATIONS

    The instruction set of the Blackfin core includes both general DSP instructions and RISC-

    like control instructions. In addition, the core has complex instructions geared toward the

    needs of the intended applications. For Huffman coding, used in communicationsalgorithms, there is a "Field Deposit/Extract" command. For the Discrete Cosine

    Transform, used in imaging and video, an IEEE 1180 rounding operation is available.

    Video compression algorithms can take advantage of the "Sum Absolute Difference"

    instruction.

    These specialty instructions are one way that the Blackfin family targets applications. The

    other way is the peripheral mix each family member offers. The ADSP-21532, for

    example, aims at low-cost consumer multimedia applications by including peripherals

    supporting surround-sound and video-specific operating modes. The ADSP-21535 goes

    after high-performance communications applications with USB and PCI interfaces as well

    as substantial amounts of on-chip SRAM.

    The range and variety of variations within the Blackfin family as well as the nature of its

    specialized instructions mirror the diversity of enhanced conventional DSPs, available from

    companies such as Cirrus Logic, Motorola, and Texas Instruments. But for all the

    enhancements, these DSPs follow basically the same programming model as the

    conventional device.

    Other DSP architectures have emerged that follow a different programming model. In

    search of the highest performance levels, these architectures allow the DSP to launch

    multiple instructions at the same time for parallel execution. While these approaches resultin greater code execution speed, they also make software more difficult to optimize. They

    require careful instruction ordering to avoid needing simultaneous access to the same data.

    They also need to avoid attempting simultaneous execution of instructions where one

    instruction depends on the results of the other for its operands. Not all DSP application

    software has a structure suitable for multiple-launch execution, but when it does, these

    DSPs offer the highest performance.

    PARALLELISM ARISES

    Two different forms of multiple-launch DSPs have arisen: very long instruction word

    (VLIW) and superscalar architectures. Both have multiple execution units configured to

    operate in parallel and use RISC-like instruction sets. The instructions of a VLIW

    architecture are explicitly parallel, being composed of several sub-instructions that control

    different resources. The superscalar architectures, on the other hand, load instructions in

    bulk, then use hardware run-time scheduling to identify instructions that can run in parallel

    and map them to the execution units.

    Of the multi-launch architectures, VLIW designs are the most common. Devices from

    Adelante Technologies, Equator Technologies, Siroyan, and Texas Instruments fall into

    this category, although they vary considerably with the type and number of parallel

    execution units they offer. The TI TMS320C64xx processors, for instance, have eight

    execution units that can handle both 8- and 16-bit SIMD operations. The Siroyan OneDSP,on the other hand, is scalable from two to 32 clusters, each with several execution units.

    The Adelante Saturn DSP core as shown in the following figure demonstrates the essence

    of the VLIW approach. It uses multiple data buses in a dual-Harvard configuration to

    Department Of Electronics & Communication Engineering, GEC Thrissur. 16

  • 8/7/2019 Sanjay(High Speed Dsp Architectures)

    17/39

    High Performance DSP Architectures

    deliver data and 96-bit wide instructions to an array of execution units simultaneously.

    These units include two multipliers (MPY), four 16-bit ALUs that can combine to form

    two 20-bit ALUs, a barrel shifter with saturation logic (SST/BRS), program (PCU) and

    loop (LCU) controllers, address controllers (ACU), and an ability for design teams to add

    application-specific execution units (AXU) to speed processing.

    Figure: Adelante's Saturn DSP core handles VLIW instructions that can compriseseveral sub-instructions that control different resources. The core also handlesapplication-specific execution units (AXUs) to accelerate processing.

    The Saturn core uses a unique approach to get around one of the problems the wide word

    widths of VLIW architectures cause. Accessing external memory is a challenge for these

    DSPs, because of their need to work with buses that can be as wide as 128 bits. The Saturn

    core uses 16-bit program memory, which it maps into the 96-bit instruction word it uses

    internally. Adelante developed this mapping after analyzing millions of lines of code for

    common applications. However, the core also allows developers to create their own

    application-specific instructions that map into the VLIW.

    SUPERSCALAR DSPS

    While the 16-bit external instruction width of the Saturn processor is unusual for VLIW

    architectures, it is typical for superscalar architectures. These devices pull in several

    instructions at a time and dynamically map them to the execution units. Internally the effect

    is much the same as a VLIW architecture in that execution units are operating in parallel.

    But from the software development viewpoint the approach reduces programming

    complexity. With hardware handling the sequencing and arranging of instructions, the

    developer is free to work with the more manageable short instructions.

    The structure of a sample superscalar DSP, the LSI Logic ZSP600. Because it is a core its

    memory interface isn't constrained, making it look like a VLIW architecture. But the

    presence of the instruction-sequencing unit (ISU) and the pipeline control unit betray its

    superscalar nature. The ZSP600 fetches eight instructions at a time, and can execute as

    many as six, using its four MAC and two ALU execution units simultaneously. Data

    packing allows the units to perform 16- or 32-bit operations. The architecture also allows

    for the addition of coprocessors to speed specific DSP functions.

    Department Of Electronics & Communication Engineering, GEC Thrissur. 17

  • 8/7/2019 Sanjay(High Speed Dsp Architectures)

    18/39

    High Performance DSP Architectures

    Figure: Superscalar DSPs, such as LSI Logic's ZSP600, use several instructionssimultaneously and dynamically map these instructions to the execution units.

    This ability to add coprocessors is becoming a common feature of high-performance DSP

    cores. In many cases the core's creators have also created coprocessors for functions such

    as DES (data encryption standard) and Viterbi coding. If a pre-designed coprocessor isn't

    available, however, creating your own can be a major design challenge.

    A recently-introduced DSP architecture, the PulseDSP from Systolix, might make the task

    easier. Similar to an FPGA, the PulseDSP offers a massively parallel, repetitive structure. Itis designed as a systolic array, which means that all data transfers occur synchronously on a

    clock edge. Each processing element in the array has selectable I/O paths, local data

    memory, and an ALU. Both the I/O and the ALU are programmable, and the array has a

    programming bus running through it. The combination makes the array reprogrammable,

    either statically or dynamically. The array structure is intended to handle low-complexity

    but high-speed processing tasks using 16- to 64-bit arithmetic, which makes it suitable as a

    coprocessor.

    Figure: Systolix's PulseDSP is a systolic array that can run as a coprocessor or as astandalone unit for applications such as filters and FFTs. The array is programmable,

    Department Of Electronics & Communication Engineering, GEC Thrissur. 18

  • 8/7/2019 Sanjay(High Speed Dsp Architectures)

    19/39

    High Performance DSP Architectures

    with each processing element having its own selectable I/O paths, local data memory,and an ALU.The array can also be used as a stand-alone processor for some types of algorithms, such as

    filters and FFTs. One of the commercial implementations of the array, in fact, is to provide

    filtering in an Analog Devices data acquisition part, the AD7725. The device combines the

    PulseDSP with a sigma-delta A/D converter to provide post-processing of the acquired

    data. The DSP array implements various filter algorithms.

    Innovations such as the PulseDSP as well as the proliferation within the other DSP

    architectures are a strong indicator of how important these once-arcane processors have

    become. In many applications, especially communications, they share the spotlight with the

    RISC processor. The DSP handles the data and the RISC handles the protocols. There are

    problems with the two-processor approach, of course, including increased cost and

    software development complexity. One reason many DSPs are adding RISC-like

    instructions to their set is to be able to edge out the other processor in such applications.

    The same thing is happening with some RISC processors. Extensible cores, such as the

    Tensilica Xtensa and the ARC International ARCtangent, are offering DSP enhancements

    so that communications applications need only one processor. These enhancements follow

    the architecture of the conventional DSP, but merge the DSP functions into the instruction

    set of the RISC core.

    The ARCtangent,, demonstrates how the two get blended. The DSP instruction decode and

    processing elements both connect with the rest of the core, allowing them to use the core's

    resources as well as their own. The extensions have full access to registers and operate in

    the same instruction stream as the RISC core. ARC's DSP offerings include MACs in

    varying widths, saturation arithmetic, and X-Y memory for DSP data. The extensions also

    support DSP addressing modes such as bit-reversal.

    Figure : The ARCtangent core from ARC International blends DSP functionalityinto a RISC processor.Both DSP instruction-decode and processing elements connect with the rest of the core,

    allowing these elements to use the cores resources as well as their own.

    These extended RISC processors, enhanced conventional DSPs, and high-performance

    architectures have all proliferated in the last few years, a sure sign of the importance DSPs

    have acquired. Furthermore, that proliferation is likely to continue. With process

    technology allowing integration of multiple peripherals with DSP cores and instruction sets

    extending to match application needs, DSPs are headed the way of the microcontroller.

    From obscure, specialized parts, they are evolving to become a fundamental building block

    for virtually any system.

    Department Of Electronics & Communication Engineering, GEC Thrissur. 19

  • 8/7/2019 Sanjay(High Speed Dsp Architectures)

    20/39

    High Performance DSP Architectures

    CHAPTER 6ARCHITECTURE OF LATEST DSP PROCESSORS

    TEXAS INSTRUMENTS TMS320C67xx FAMILY

    OVERVIEW

    The TMS320C67xx family is theHighest Performance Floating-Point version DSPs.It is based on a advanced VelociTI very-long-instruction-word (VLIW) architecturemaking this DSP an excellent choice for multichannel and multifunction applications,

    which allows it to execute up to eight RISC-like instructions per clock cycle. It has added

    support for floating-point arithmetic and 64-bit data. It has a performance of up to 1 gigafloating-point operations per second (GFLOPS) at a clock rate of167 MHz.It uses an1.8-volt core supply , and executes up to 334 million MACs per second at 167 MHz. TheTMS320C67xx's two data paths extend hardware support for 64-bit data and IEEE-75432-bit single-precision and 64-bit double-precision floating-point arithmetic. Each data

    path includes a set of four execution units, a general-purpose register file, and paths for

    moving data between memory and registers.

    The four execution units in each data path comprise two ALUs, a multiplier, and an

    adder/subtractor which is used for address generation. The ALUs support both integer and

    floating point operations, and the multipliers can perform both 16x16-bit and 32x32-bit

    integer multiplies and 32-bit and 64-bit floating point multiplies. The two register files each

    contain sixteen 32-bit general-purpose registers. These registers can be used for storing

    addresses or data. To support 64-bit floating point arithmetic, pairs of adjacent registers can

    be used to hold 64-bit data.

    The C6701 DSP possesses the operational flexibility of high-speed controllers and the

    numerical capability of array processors. This processor has 32 general-purpose registers of

    32-bit word length and eight highly independent functional units. The eight functional units

    provide four floating-/fixed-point ALUs, two fixed-point ALUs, and two floating-/fixed-

    point multipliers. Program memory consists of a 64K-byte block that is user-configurable

    as cache or memory-mapped program space. Data memory consists of two 32K-byte

    blocks of RAM. The peripheral set includes two multichannel buffered serial ports

    (McBSPs), two general-purpose timers, a host-port interface (HPI), and a glueless externalmemory interface (EMIF) capable of interfacing to SDRAM or SBSRAM and

    asynchronous peripherals.

    The large bank of On-chip memory system of the TMS320C67xx implements a modified

    Harvard architecture, providing separate address spaces for program and data memory.

    Program memory has a 32-bit address bus and a 256-bit data bus. Each of the two data

    paths is connected to data memory by a 32-bit address bus and two 32-bit data buses. Since

    there are two 32-bit data buses for each data path, the TMS320C67xx can load two 64-bit

    words per instruction cycle. TMS320C6701 has 64 Kbytes of 32-bit on-chip program RAM

    and 64 Kbytes of 16-bit on-chip data RAM.

    The TMS320C6701 has one external memory interface, which provides a 23-bit address

    bus and a 32-bit data bus. These buses are multiplexed between program and data memory

    accesses. Addressing modes supported include register-direct, register-indirect, indexed

    register-indirect, and modulo addressing. Immediate data is also supported.

    Department Of Electronics & Communication Engineering, GEC Thrissur. 20

  • 8/7/2019 Sanjay(High Speed Dsp Architectures)

    21/39

    High Performance DSP Architectures

    The TMS320C67xx does not support hardware looping, and hence all loops must be

    implemented in software. However, the parallel architecture of the processor allows the

    implementation of software loops with virtually no overhead.

    The peripherals on the TMS320C6701 include a host port, four-channel DMA controller,

    two TDM-capable buffered serial ports and two 32-bit timers

    CPU ARCHITECTURE

    Department Of Electronics & Communication Engineering, GEC Thrissur. 21

  • 8/7/2019 Sanjay(High Speed Dsp Architectures)

    22/39

    High Performance DSP Architectures

    CPU DESCRIPTION

    Fetch packets are always 256 bits wide; however, the execute packets can vary in size. Thevariable-length execute packets are a key memory-saving feature, distinguishing the C67x

    CPU from other VLIW architectures.

    The CPU features two sets of functional units. Each set contains four units and a register

    file. One set contains functional units .L1, .S1, .M1, and .D1; the other set contains units

    .D2, .M2, .S2, and .L2. The two register files contain 16 32-bit registers each for the total

    of 32 general-purpose registers. The two sets of functional units, along with two register

    files, compose sides A and B of the CPU

    The four functional units on each side of the CPU can freely share the 16 registers

    belonging to that side. Additionally, each side features a single data bus connected to allregisters on the other side, by which the two sets of functional units can access data from

    the register files on opposite sides.

    In addition to the C62x DSP fixed-point instructions, the six out of eight functional units

    (.L1, .M1, .D1, .D2, .M2, and .L2) also execute floating-point instructions. The remaining

    two functional units (.S1 and .S2) also execute the new LDDW instruction which loads 64

    bits per CPU side for a total of 128 bits per cycle.

    Another key feature of the C67x CPU is the load/store architecture, where all instructions

    operate on registers. Two sets of data-addressing units (.D1 and .D2) are responsible for all

    data transfers between the register files and the memory. The data address driven by the .Dunits allows data addresses generated from one register file to be used to load or store data

    to or from the other register file. The C67x CPU supports a variety of indirect-addressing

    modes using either linear- or circular-addressing modes with 5- or 15-bit offsets. All

    instructions are conditional, and most can access any one of the 32 registers. Some

    registers, however, are singled out to support specific addressing or to hold the condition

    for conditional instructions. The two .M functional units are dedicated multipliers.

    The two .S and .L functional units perform a general set of arithmetic, logical, and branch

    functions with results available every clock cycle. The processing flow begins when a 256-

    bit-wide instruction fetch packet is fetched from a program memory. The 32-bit

    instructions destined for the individual functional units are linked together by 1 bits in

    the least significant bit (LSB) position of the instructions. The instructions that are

    chained together for simultaneous execution compose an execute packet. A 0 in the

    LSB of an instruction breaks the chain, effectively placing the instructions that follow it in

    the next execute packet. If an execute packet crosses the fetch-packet boundary (256 bits

    wide), the assembler places it in the next fetch packet, while the remainder of the current

    fetch packet is padded with NOP instructions. The number of execute packets within a

    fetch packet can vary from one to eight. Execute packets are dispatched to their respective

    functional units at the rate of one per clock cycle and the next 256-bit fetch packet is not

    fetched until all the execute packets from the current fetch packet have been dispatched.

    After decoding, the instructions simultaneously drive all active functional units for amaximum execution rate of eight instructions every clock cycle. While most results are

    stored in 32-bit registers, they can be subsequently moved to memory as bytes or half-

    words as well. All load and store instructions are byte-, half-word, or word-addressable.

    Department Of Electronics & Communication Engineering, GEC Thrissur. 22

  • 8/7/2019 Sanjay(High Speed Dsp Architectures)

    23/39

    High Performance DSP Architectures

    ANALOG DEVICES ADSP-21XX FAMILY

    OVERVIEW

    The ADSP-21xx is the first single chip DSP processor family from Analog Devices. The

    family consists of a large number of processors based on a common 16-bit fixed-pointarchitecture core with a 24-bit instruction word. Each processor combines the core DSParchitecture computation units, data address generators, and program sequencerwith

    differentiating features such as on-chip program and data memory RAM, a programmable

    timer, and one or two serial ports.

    The fastest members of the family operate at 75 MIPS at 2.5 volts, 52 MIPS at 3.3 volts,

    and 40 MIPS at 5.0 volts. Analog Devices has recently announced the ADSP-219x series,

    which offers projected speeds of up to 300 MIPS, as well as architectural enhancements.

    ADSP-21xx processors are targeted at modem, audio, PC multimedia, and digital cellular

    applications.

    Fabricated in a high speed, submicron, double-layer metal CMOS process, the highest-

    performance ADSP-21xx processors operate at 25 MHz with a 40 ns instruction cycle time.

    Every instruction can execute in a single cycle. Fabrication in CMOS results in low power

    dissipation. The ADSP-2100 Familys flexible architecture and comprehensive instruction

    set support a high degree of parallelism.

    The ADSP-21xx data path consists of three separate arithmetic execution units: anarithmetic/logic unit (ALU), a multiplier/accumulator (MAC), and a barrel shifter. Each

    unit is capable of single-cycle execution, but only one of these units can be active during a

    single instruction cycle. The ALU operates on 16-bit data. In addition to the usual ALU

    operations, the ALU provides increment/decrement, absolute value, and add-with-carry

    functions. ALU results are saturated upon overflow if the appropriate configuration bit is

    set by the programmer. The MAC unit includes a 16x16->32-bit multiplier, four input

    registers, a feedback register, a 40-bit adder, and a single 40-bit result register/accumulator

    providing eight guard bits. Besides signed operands, the multiplier can operate on

    unsigned/unsigned or on signed/unsigned operands, thus supporting multi-precision

    arithmetic. The barrel shifter shifts 16-bit inputs from an input register or from the

    ALU/MAC/barrel shifter result registers into a 32-bit result register. Logical and arithmeticshifts are supported left or right up to 32 bits. The barrel shifter also supports block

    floating-point arithmetic with block exponent detect (which determines a maximum

    exponent of a block of data), single-word exponent detect, normalize, and exponent adjust

    instructions.

    ADSP-21xx processors use a modified Harvard architecture with separate memory spaces

    and on-chip bus sets for program and data. All processors in the ADSP-21xx family

    include on-chip program RAM or ROM and on-chip data RAM.

    On-chip program memory can be used for both instructions and data, and it can beaccessed via a 14-bit address bus and a 24-bit data bus. On-chip program memory is dual-

    ported to allow the processor to fetch both a data operand and the next instruction in a

    single instruction cycle. The on-chip data memory can be accessed via a 14-bit address bus

    and a 16-bit data bus. One access to the on-chip data memory can be performed in a single

    Department Of Electronics & Communication Engineering, GEC Thrissur. 23

  • 8/7/2019 Sanjay(High Speed Dsp Architectures)

    24/39

    High Performance DSP Architectures

    instruction cycle. Three memory accesses (one instruction and two data operands) can be

    performed in one instruction cycle.

    Both of the on-chip memory spaces can be extended off-chip. All ADSP-21xx processorshave one external memory interface, providing a 14-bit address bus and a 24-bit data bus.

    This external interface is multiplexed between program and data memory accesses.

    The ADSP-21xx supports register-direct, memory-direct and register-indirect addressing

    modes. Immediate data is also supported. The ADSP-21xx provides zero-overhead

    program looping through its DO instruction. Any length sequence of instructions can be

    contained in a hardware loop, and up to 16,384 repetitions are supported.

    ARCHITECTURE OVERVIEW

    The processors contain three independent computational units: the ALU, the

    multiplier/accumulator (MAC), and the shifter. The ALU performs a standard set of

    arithmetic and logic operations; division primitives are also supported. The MAC performs

    single-cycle multiply, multiply/add, and multiply/subtract operations. The shifter performslogical and arithmetic shifts, normalization, renormalizations, and derive exponent

    operations. The shifter can be used to efficiently implement numeric format control

    including multiword floating-point representations. The internal result (R) bus directly

    connects the computational units so that the output of any unit may be used as the input of

    any unit on the next cycle. A powerful program sequencer and two dedicated data address

    generators ensure efficient use of these computational units. The sequencer supports

    conditional jumps, subroutine calls, and returns in a single cycle. With internal loop

    counters and loop stacks, the ADSP-21xx executes looped code with zero overhead no

    explicit jump instructions are required to maintain the loop. Two data address generators

    (DAGs) provide addresses for simultaneous dual operand fetches (from data memory and

    program memory). Each DAG maintains and updates four address pointers. Whenever thepointer is used to access data (indirect addressing), it is post-modified by the value of one

    of four modify registers. A length value may be associated with each pointer to implement

    automatic modulo addressing for circular buffers. The circular buffering feature is also

    Department Of Electronics & Communication Engineering, GEC Thrissur. 24

  • 8/7/2019 Sanjay(High Speed Dsp Architectures)

    25/39

    High Performance DSP Architectures

    used by the serial ports for automatic data transfers to On chip memory. Efficient data

    transfer is achieved with the use of five internal buses namely : Program Memory Address

    (PMA) Bus , Program Memory Data (PMD) Bus, Data Memory Address (DMA) Bus,

    Data Memory Data (DMD) Bus and the Result (R) Bus.

    The two address buses (PMA, DMA) share a single external address bus, allowing memory

    to be expanded off-chip, and the two data buses (PMD, DMD) share a single external data

    bus. The BMS, DMS, and PMS signals indicate which memory space is using the external

    buses. Program memory can store both instructions and data, permitting the ADSP-21xx to

    fetch two operands in a single cycle, one from program memory and one from data

    memory. The processor can fetch an operand from on-chip program memory and the next

    instruction in the same cycle. The memory interface supports slow memories and

    memorymapped peripherals with programmable wait state generation. External devices can

    gain control of the processors buses with the use of the bus request/grant signals.

    One bus grant execution mode (GO Mode) allows the ADSP-21xx to continue running

    from internal memory. A second execution mode requires the processor to halt while buses

    are granted. Each ADSP-21xx processor can respond to several different interrupts. There

    can be up to three external interrupts, configured as edge- or level-sensitive. Internal

    interrupts can be generated by the timer, serial ports, and, on the ADSP-2111, the host

    interface port. There is also a master RESET signal. Booting circuitry provides for loading

    on-chip program memory automatically from byte-wide external memory. After reset, three

    wait states are automatically generated. This allows, for example, a 60 ns ADSP-2101 to

    use a 200 ns EPROM as external boot memory. Multiple programs can be selected and

    loaded from the EPROM with no additional hardware. The data receive and transmit pins

    on SPORT1 (Serial Port 1) can be alternatively configured as a general-purpose input flagand output flag. You can use these pins for event signalling to and from an external device.

    A programmable interval timer can generate periodic interrupts. A 16-bit count register

    (TCOUNT) is decremented every n cycles, where n1 is a scaling value stored in an 8-bit

    Department Of Electronics & Communication Engineering, GEC Thrissur. 25

  • 8/7/2019 Sanjay(High Speed Dsp Architectures)

    26/39

    High Performance DSP Architectures

    register (TSCALE). When the value of the count register reaches zero, an interrupt is

    generated and the count register is reloaded from a 16-bit period register (TPERIOD).

    BLACKFIN PROCESSOR

    Blackfin Processors are a new breed of embedded media processor. Based on the Micro

    Signal Architecture (MSA) jointly developed with Intel Corporation, Blackfin Processors

    combine a 32-bit RISC-like instruction set and dual 16-bit multiply accumulate (MAC)

    signal processing functionality with the ease-of-use attributes found in general-purpose

    microcontrollers. This combination of processing attributes enables Blackfin Processors to

    perform equally well in both signal processing and control processing applications-in many

    cases deleting the requirement for separate heterogeneous processors.

    This processor family also offers industry leading power consumption performance to as

    low as 0.15mW/MMAC at 0.8V. This combination of high performance and low power is

    essential in meeting the needs of today's and future signal processing applications including

    broadband wireless, audio/video capable Internet appliances, and mobile communications.

    HIGH PERFORMANCE SIGNAL PROCESSING

    The core architecture employs fully interlocked instruction pipeline, multiple parallel

    computational blocks, efficient DMA capability, and instruction set enhancements

    designed to accelerate video processing

    FULLY INTERLOCKED INSTRUCTION PIPELINE

    All Blackfin Processors utilize a multi-stage fully interlocked pipeline that guarantees code

    is executed as you would expect and that all data hazards are hidden from the programmer.

    This type of pipeline guarantees result accuracy by stalling when necessary to achieve

    proper results.

    HIGHLY PARALLEL COMPUTATIONAL BLOCKS

    The basis of the Blackfin Processor architecture is the Data Arithmetic Unit that includestwo 16-bit Multiplier Accumulators (MACs), two 40-bit Arithmetic Logic Units (ALUs),

    four 8-bit video ALUs, and a single 40-bit barrel shifter. Each MAC can perform a 16-bit

    by 16-bit multiply on four independent data operands every cycle. The 40-bit ALUs can

    accumulate either two 40-bit numbers or four 16-bit numbers. With this architecture, 8-,

    16- and 32-bit data word sizes can be processed natively for maximum efficiency.

    Two Data Address Generators (DAGs) are complex load/store units designed to generate

    addresses to support sophisticated DSP filtering operations. For DSP addressing, bit-

    reversed addressing and circular buffering is supported. The DAGs also include two loop

    counters for nested zero overhead looping and hardware support for on-the-fly saturation

    and clipping.

    HIGH BANDWIDTH DMA CAPABILITY

    Department Of Electronics & Communication Engineering, GEC Thrissur. 26

  • 8/7/2019 Sanjay(High Speed Dsp Architectures)

    27/39

    High Performance DSP Architectures

    All Blackfin Processors have multiple, independent DMA controllers that support

    automated data transfers with minimal overhead from the processor core. DMA transfers

    can occur between the internal memories and any of the many DMA-capable peripherals.

    VIDEO INSTRUCTIONS

    In addition to native support for 8-bit data, the word size common to many pixel processing

    algorithms, the Blackfin Processor architecture includes instructions specifically defined toenhance performance in video processing applications. Video compression algorithms are

    incorporated for the enhanced instructions.

    Efficient Control Processing is similar to that of RISC control processors. These features

    include a hierarchical memory architecture, superior code density, and a variety of

    microcontroller-style peripherals including a watch-dog timer, real-time clock, and an

    integrated SDRAM controller. The L1 memory is connected directly to the processor core,

    runs at full system clock speed, and offers maximum system performance for time critical

    algorithm segments. The L2 memory is a larger, bulk memory storage block that offers

    slightly reduced performance, but still faster than off-chip memory.

    The L1 memory structure has been implemented to provide the performance needed for

    signal processing while offering the programming ease found in general purpose

    microcontrollers. By supporting both SRAM and cache programming models, system

    designers can allocate critical DSP data sets that require high bandwidth and low latency

    into SRAM, while maintaining the simple programming model of the data cache for

    operating system (OS) and microcontroller code.

    The Memory Management Unit provides for a memory protection format that can support a

    full OS Kernel. The OS Kernel runs in Supervisor mode and partitions blocks of memory

    and other system resources for the actual application software to run in User mode. This is

    a unique and powerful feature not present on traditional DSPs.

    SUPERIOR CODE DENSITY

    The Blackfin Processor architecture supports multi-length instruction encoding. Very

    frequently used control-type instructions are encoded as compact 16-bit words, with more

    mathematically intensive DSP instructions encoded as 32-bit values.

    DYNAMIC POWER MANAGEMENT

    They employ multiple power saving techniques. Blackfin Processors are based on a gated

    clock core design that selectively powers down functional units on an instruction-by-

    instruction basis. They also support multiple power-down modes for periods where little or

    no CPU activity is required. Lastly, and probably most importantly, Blackfin Processors

    support a dynamic power management scheme whereby the operating frequency AND

    Department Of Electronics & Communication Engineering, GEC Thrissur. 27

  • 8/7/2019 Sanjay(High Speed Dsp Architectures)

    28/39

    High Performance DSP Architectures

    voltage can be tailored to meet the performance requirements of the algorithm currently

    being executed.

    BLACKFIN PROCESSOR CORE BASICS

    The Blackfin Processor core is a load-store architecture consisting of a Data Arithmetic

    Unit, an Address Arithmetic Unit, and a sequencer unit. Blackfin Processors combine a

    high performance, dual MAC DSP architecture with the programming ease of a RISC

    MCU into a single, instruction set architecture.

    GENERAL PURPOSE REGISTER FILES

    The Blackfin Processor core includes an 8-entry by 32-bit data register file for general use

    by the computational units. Supported data types include 8-, 16-, or 32-bit signed or

    unsigned integer and 16- or 32-bit signed fractional. In every clock cycle, this multiported

    register file supports two 32-bit reads AND two 32-bit writes. It can also be accessed as a

    16-entry by 16-bit data register file.

    The address register file provides a general purpose addressing mechanism in addition to

    supporting circular buffering and stack maintenance. This register file consists of 8 entries

    and includes a frame pointer and a stack pointer. The frame pointer is useful for subroutine

    parameter passing, while the stack pointer is useful for storing the return address from

    subroutine calls.

    DATA ARITHMETIC UNIT

    It contains:

    Two 16-bit MACs

    Two 40-bit ALUs

    Four 8-bit video ALUs

    Single barrel shifter

    Department Of Electronics & Communication Engineering, GEC Thrissur. 28

  • 8/7/2019 Sanjay(High Speed Dsp Architectures)

    29/39

    High Performance DSP Architectures

    All computational resources can process 8-, 16-, or 32-bit operands from the data register

    file-R0 through R7. Each register can be accessed as a 32-bit register or a 16-bit register

    high or low half.

    In a single clock cycle, the dual data paths can read AND write up to two 32-bit values.

    However, since the high and low halves of the R0 through R7 registers are individually

    addressable (Rx, Rx.H, or Rx.L), each computational block can choose from either two 32-

    bit input values or four 16-bit input values with no restrictions on input data. The results ofthe computation can be written back into the register file as either a 32-bit entity or as the

    high or low 16-bit half of the register. Additionally, the method of accumulation can vary

    between data paths..

    Both accumulators are 40 bits in length, providing 8 bits of extended precision. Similar to

    the general purpose registers, both accumulators can be accessed in 16-, 32-, or 40-bit

    increments. The Blackfin architecture also supports a combined add/subtract instruction

    that can generate two 16-, 32-, or 40-bit results or four 16-bit results. In the case where four

    16-bit results are desired, the high and low half results can be interchanged. This is a very

    powerful capability and significantly improves, for instance, the FFT benchmark results.

    ADDRESS ARITHMETIC UNIT

    Two data address generators (DAGs) provide addresses for simultaneous dual operand

    fetches from memory. The DAGs share a register file that contains four sets of 32-bit index

    (I), length(L), base(B), and modify(M) registers. There are also eight additional 32-bit

    address registersP0 through P5, frame pointer, and stack pointer that can be used as

    pointers for general indexing of variables and stack locations.

    The four sets of I, L, B, and M registers are useful for implementing circular buffering.

    Used together, each set of index, length, and base registers can implement a unique circularbuffer in internal or external memory. The Blackfin architecture also supports a variety of

    addressing modes, including indirect, auto increment and decrement, indexed, and bit

    reversed. Last, all address registers are 32 bits in length, supporting the full 4 Gbyte

    address range of the Blackfin Processor architecture.

    PROGRAM SEQUENCER UNIT

    The program sequencer controls the flow of instruction execution and supports conditional

    jumps and subroutine calls, as well as nested zero-overhead looping. A multistage fully

    interlocked pipeline guarantees code is executed as expected and that all data hazards are

    hidden from the programmer. This type of pipeline guarantees result accuracy by stallingwhen necessary to achieve proper results.

    The Blackfin architecture supports 16- and 32-bit instruction lengths in addition to limited

    multi-issue 64-bit instruction packets. This ensures maximum code density by encoding the

    most frequently used control instructions as compact 16-bit words and the more

    challenging math operations as 32-bit double words.

    Department Of Electronics & Communication Engineering, GEC Thrissur. 29

  • 8/7/2019 Sanjay(High Speed Dsp Architectures)

    30/39

    High Performance DSP Architectures

    LSI LOGIC ZSP600-QUAD MAC SUPERSCALAR CORE

    O V E R V I E W

    The ZSP600 is a quad MAC superscalar DSP core that addresses the high performancedata throughput and signal processing requirements of emerging communications

    platforms. The ZSP600 supports up to Six IPC DSP performance at a peak 300MHz datarate. It includes quad MAC and quad ALU computational resources, a high-performance

    load/store memory architecture, and dedicated co-processor interfaces, combined with

    state-of-the-art power reduction techniques. These attributes make the ZSP600 core an

    ideal solution for a variety of embedded DSP algorithms, including those required for

    wireless infrastructure, mobile (3G), IAD/home gateway, central office, and

    access/network applications.ZSP600 instruction parallelism is supported by user-

    transparent instruction grouping and pipeline control to deliver superscalar DSP

    performance, while programming using a RISC-instruction set..

    The ZSP600 is a fully synthesizable, single-phase, clocked architecture, with all core I/Os

    registered for ease-of-process migration and design flexibility. The ZSP600 provides

    extensive computational resources, including four 16-bit multipliers/MACs, dual 40-bit

    ALUs, and dual 16-bit ALUs, all capable of supporting 16-and 32-bit operations. The

    ZSP600 can perform four independent 16x16 MUL/MAC operations into four 16-bit or

    two 40-bit results, two 32x32-bit MUL/MACs into a 32-bit result, or two Viterbi (add-

    compare-select) results per cycle. The ZSP600 is based upon a high-bandwidth memory

    architecture with separate 8 instruction per cycle prefetch and dual 64-bit datainterfaces , over a 24-bit address space. The instruction memory architecture allows multi-

    instruction/cycle pre-fetch to an integrated instruction cache. The data memory architectureincorporates dual independent 64-bit load/store units, with dedicated address generation,

    allowing up to eight 16-bit word or four 32-bit word load/store operations per cycle. The

    ZSP600 integrates a bi-directional co-processor interface to support hardware acceleration.

    The memory subsystem (MSS) is decoupled from the DSP operations to provide increased

    flexibility in support of different memory schemes. It also includes Instruction Set

    enhancements to RISC architecture for improved broadband and wireless application

    support..

    Department Of Electronics & Communication Engineering, GEC Thrissur. 30

  • 8/7/2019 Sanjay(High Speed Dsp Architectures)

    31/39

    High Performance DSP Architectures

    A WORD ON SUPERSCALAR DSP

    A superscalar architecture simply implies that the architecture is responsible for resolving

    the operand and resource hazards and that it has the resources to achieve an instruction

    throughput that is greater that one instruction per clock. Logic dedicated to pipeline control

    is kept to a minimum by enforcing in-order execution and by isolating the control to a

    single stage at the head of the pipeline. This stage issues sequential groups of instructions

    that have no data dependencies or other resource conflicts. Once a group of instructions has

    been issued, they advance through the pipeline in lock step.

    A VLIW machine does not employ instruction scheduling or pipeline protection.

    Instructions in a VLIW pipeline are statically issued , and it is the programmers

    responsibility to prevent data hazards and resource conflicts. Superscalar architectures also

    facilitate software compatibility not only between implementations of the same

    architecture, but also from one generation of architecture to the next thus increasing

    lifetime.

    ARCHITECTURE OVERVIEW

    The G2 architecture is scalable in terms of arithmetic resources, data bandwidth, and

    pipeline capacity. This scalable nature allows the architecture to support multipleimplementations that target different application spaces.

    All address and data I/O communication across the core boundary are registered. This

    feature is highly desirable from a SOC system designers point of view for a number of

    reasons, one being the removal of timing budget ambiguities between system logic and the

    core.

    Prefetch unit (PFU) is at the head of the instruction pipeline. The ZSP600 can prefetch

    eight 16-bit words per cycle. It is responsible for maximizing the probability that the

    instruction cache has the data required by the instruction sequencing unit (ISU) for any

    given fetch cycle. The prefetch unit performs limited decoding to identify code

    discontinuities and to apply static branch prediction when necessary. The ISU is

    responsible for instruction fetch and decode, instruction grouping, and instruction issue.

    Instruction grouping refers to the pipeline stage in which operand dependencies are

    Department Of Electronics & Communication Engineering, GEC Thrissur. 31

  • 8/7/2019 Sanjay(High Speed Dsp Architectures)

    32/39

    High Performance DSP Architectures

    resolved. The ISU issues groups of in-order instructions that will not cause any operand

    conflicts. This is the only unit (and only stage in the execution pipeline) that enforces

    pipeline protection. Isolating the pipeline protection logic in this manner simplifies pipeline

    control logic significantly.

    The ZSP600 ISU can issue up to six instructions per cycle, one to each of the six primary

    datapaths: two address generation units (AGUs), two arithmetic logic units (ALUs), and

    two multiply/accumulate/arithmetic units (MAUs) that are capable of performing up to fourMAC operations per cycle. The pipeline control unit (PCU) stages control associated with

    each of the primary data paths and the bypass logic. The PCU is also responsible for

    managing interrupt control, the co-processor interface, the debug interface, and the on-core

    timers. The Bypass unit (BYP) handles all the data forwarding between execution units.)

    PIPELINE

    The pipeline of the G2 architecture is an eight-stage pipeline. The existing architecture usesa data prefetch mechanism, called data linking, to efficiently sustain required data

    bandwidth for its dual- MAC. All pipeline protection and resource allocation is performed

    during the grouping stage. Instruction groups are issued by the grouping stage and advance

    in lock step down the remainder of the pipeline.

    Data address generation is performed in the AG stage. This stage is also responsible for

    enforcing the boundaries of the circular buffers. A load or store that straddles a boundary of

    the circular buffer is split by the AGU into two sequential accesses. Stages M0 and M1 are

    allocated for data memory loads. They are optimized for systems using synchronous RAM.

    M0 is allocated for address decode and M1 for data access and return. Load and store

    requests are registered and issued to the memory subsystem in M0. The memory interface

    is stallable. If the MSS determines that is can not return requested data during M1, it stallsthe core until the data is ready.

    ARITHMETIC RESOURCES

    Department Of Electronics & Communication Engineering, GEC Thrissur. 32

  • 8/7/2019 Sanjay(High Speed Dsp Architectures)

    33/39

    High Performance DSP Architectures

    By adding two AGUs, along with dedicated address registers , the arithmetic throughput of

    G2 demonstrates an immediate improvement. The two AGUs allow the core to issue any

    combination of two loads or stores per cycle. The data size of the load/store is

    implementation specific. Each data port in the ZSP600 is 64-bits wide, allowing a total of

    128-bits (8 words) of data to be loaded per cycle. The AGUs have dedicated hardware to

    support four circular buffers and reverse-carry addressing. The circular buffer support has

    been enhanced in functionality to support load/store operations with positive and negative

    offsets and signed indexes. Circular buffer logic also applies to address arithmetic and alsohas no alignment restrictions.

    REGISTER RESOURCES

    With the 32-bit address registers, the architecture allows implementations of the core to

    remain flexible in defining the physical linear address space. The actual address register

    remains a 32-bit register to ensure pointer sizes remain the same from one implementation

    to the next. This also allows the address registers to be used as temporary registers for the

    GPRs. Dedicated address registers simplify the instruction decoder and issue logic as it can

    now identify address related operations and assign the datapath resources appropriately.

    The primary operand resource of the AGUs is the address register file, allowing the

    general-purpose register file to be physically optimized for data moving to and from the

    ALUs and MAUs. The current generation defines two 32-bit registers and another 16-bit

    register whose low and high bytes correspond to the upper byte of each accumulator

    respectively thus resulting in a 40-bit accumulator. A guard byte is now available for each

    of the eight extended 32-bit registers of the GPRs. Accumulators are also recognized in the

    programming model by providing associated instruction set support for 40-bit arithmetic

    and data loads and stores.

    INSTRUCTION SET ENHANCEMENTS

    A powerful enhancement to the new architecture is the ability to conditionally execute

    instructions. The programming model for G2 allows programmers to define packets of

    instructions that are predicated on a specified condition. The programmer then defines a

    bracketed set of up to eight instructions that will be predicated in the execution pipeline