memory architecturesoc.yonsei.ac.kr/class/material/dsp/memoryarchitecture.pdf · 2017-03-06 · dsp...

DSP VLSI Design

Memory Architecture

Byungin Moon

Yonsei University

1YONSEI UNIVERSITYDSP VLSI Design

Memory ArchitectureOutline

Memory ArchitecturesVon Neumann architecture and Harvard architectureFast memories and multiported memories

Features for reducing memory access requirementsProgram cachesModulo addressing and algorithmic approaches

Wait statesROMExternal memory interfaces

Multiprocessor supportDynamic memory (DRAM)Direct memory access (DMA)


Memory ArchitectureNeed for High-Speed Memory Architecture

Powerful data path is, at best, only part of a high-performance processorRequire the ability to move large amounts of data to and from memory quickly

The organization of memory and its interconnection with the processor’s data path are critical factors –this is called memory architecture

Example – FIR filterDSP processor data paths are designed to perform a multiply-accumulate operation in one cycle

Only one cycle per filter tapMultiple memory accesses per cycle


Memory Architecture

FIR Filter as an Example Needing Capabilitiesof Multiple Memory Accesses per Cycle


Memory ArchitectureAttention !!!

The textbook DSPFundamentals claims that the number of memory accesses needed for one tap is four

Fetch the multiply-accumulate instructionRead the appropriate data value from the delay lineRead the appropriate coefficient valueWrite the data value to the next location in the delay line to shift data through the delay line

on the assumption that a circular buffer is not used


Memory ArchitectureVon Neumann Architecture (Revisited)

A single set of address and data linesBoth program instructions and data are stored in the single memory

Common among non-DSP processorsHowever, most of current 32-bit microprocessors have Harvard architecture

StrengthLow cost

WeaknessThe processor can make one access to memory during each instruction cycle

It takes, at least four cycles to complete a multiply-accumulate for one tap of FIR filters


Memory ArchitectureVon Neumann Architecture


Memory ArchitectureHarvard Architecture (Revisited)

Two independent memory banks and two independent sets of buses (addresses and data)

OriginalOne bank for instructions, one bank for data

ModifiedOne bank for instructions and data, one bank for data

Two memory accesses per instruction cycleComplete the four memory accesses required for one tap of FIR filters in two instruction cycles

ADSP-21xx and AT&T DSP16xx(writes take two cycles)More powerful derivatives

One program bank and two data banks (X and Y)DSP Group Pine and Oak, Zilog Z893xx, SGS-Thomson D950-CORE, and Motorola DSP5600x, DSP56xxx, and DSP96002A shortage by one bank (four – three) can be overcome by module addressing


Memory ArchitectureHarvard Architecture


Memory ArchitectureFast Memories

On-chip memories that support multiple, sequential accesses per instruction cycle over a single set of busesYield better performance when combined Harvard architecture

Harvard architecture with two banks of fast memoryCan complete four memory accesses per instruction cycle

AT&T DSP32xxVon Neumann with multiple access memoriesCan complete four sequential accesses to on-chip memory in a single instruction cycle

Zoran’s ZR3800xHarvard architecture with multiple-access memoryOne single-access program memory bank with one dual-access data memory bank


Memory ArchitectureMultiported Memories

Memories that allow multiple concurrent memory accesses over tow or more independent sets of buses

The most common type is the dual-ported, but triple- and even quadruple are sometimes usedNo need to arrange data among multiple, independent memory banksHigh cost

Doubling the number of ports doubles the areaSome DSP processors combines a modified Harvard architecture with the use of multiported memories

Motorola DSP561xx


Memory Architecture

Harvard Architecture withDual-Ported Data Memory


Memory ArchitectureConsideration for Off-Chip Memories

Multiple off-chip memory banksAlthough the memory banks can usually be extended off-chip, multiple off-chip memory accesses cannot proceed in parallel (due to the lack of a second set of external memory buses, that is due to processor cost)

Fast off-chip memoriesOff-chip delays may make it impractical to obtain two or more sequential memory accesses per instruction cycle, unless the instruction rate is relatively slow

Multiported off-chip memoriesImpractical due the reason similar to multiple banksMore I/O pins

Larger, more expensive package and larger die size


Memory ArchitectureSpecialized Memory Writes

A specialized mechanism to allow a write to data memory to proceed in parallel with an instruction read and data read

Can be used to shift data along the delay line in an FIR filter implementation

AT&T DSP16xx (Harvard architecture)Cannot provide both a data memory write and a data memory read in less than three instruction cyclesHowever, under certain circumstances, an operand register value can be written to one memory location and then loaded with a value from another memory location

TI’s fixed-point DSPsA value in memory can be loaded into the operand register and also copied to the next higher location in memory


Memory Architecture

Features for Reducing Memory Access Requirements

Special features designed to reduce the number of accesses required to perform certain kinds of operations

Achieve equal performance to other processors that provide more memory bandwidthReduce processor costMay increase execution time or software development time

Such featuresProgram cachesModule addressingAlgorithmic approaches


Memory ArchitectureProgram Caches

Small memory within the processor core that is used for storing program instructions

Eliminate the need to access program memory when fetching certain instructionsFree a memory access to be used for a data read or writeSpeed operation by avoiding delays associated with slow external (off-chip) program memory

Comparison with the caches of general-purpose microprocessors

Much smaller and simplerUsed only for program instructions, not for data

Accommodating data involves a mechanism for updating both the cache and external memory, adding significantly to the complexity of the cache hardware



Single-instruction repeat bufferThe simplest type of DSP processor cacheA one-word instruction cache that is used with a special repeat instructionA single instruction that is to be executed multiple times (by a repeat instruction) is loaded into the buffer upon its first execution; immediately subsequent executions of the same instruction fetch the instruction from the cacheTI TMS320C2x and TMS320C5xDoes not help for algorithms in which a block of multiple instructions must be executed repeatedly as a group

Extend the repeat buffer concept to accommodate more than one program instruction -> multiword repeat buffer


Memory ArchitectureProgram CachesMultiword repeat buffer

AT&T DSP16xxThe 16-entry repeat buffer is loaded when the programmer specifies a block of code of 16 or fewer instruction to be repeated using the repeat instruction

Useful for algorithms that contain loops consisting a modest number of instructions (quite common in DSP)

Single-sector instruction cacheA cache that stores some number of the most recent instructions that have been executedUsed to access a single, contiguous region of program memory

When a program control flow change accesses a location that is not already contained in the cache, the previous contents of thecache are invalidated and cannot be used

But if the program control flow jumps to a program address of the instructions in the cache, the cache is used for instructionsZoran ZR3800x



Multiple-sector instruction cacheA cache with multiple instruction sectorsTI TMS320C3X (also has a single-instruction repeat buffer)

Contains two sectors of 32 words eachEach sector used to store instructions from an independent 32-word region of program memory

In cache missesthe external address is from one of two current sectors; the word is stored in the cachethe external address out of the two sectors; one of the sectors is discarded, and a new sector is made (LRU algorithm)

A variation in the Analog Devices ADSP210xx cacheTwo-bank Harvard architectureThe cache is loaded with instructions whose execution causes contention for program memory access


Memory ArchitectureFeatures of Some DSP CachesMeasure of manual control over cache mechanisms

Locking the contents of the cache a some execution pointdisabling the cache altogetherAllow the programmer to obtain better performance than would be achieved with the built-in cache managementHelp software developers to ensure that their code will meet critical real-time constraints

Motorola DSP96002Has the internal 1 Kword by 32-bit memory

Configured either as instruction cache or program memoryWhen the cache is enabled

Organized into eight 128-word sectors, each of which can be individually locked or unlocked

Motorola DSP563xx family includes a similar dual cache/memory construct


Memory Architecture

Modulo Addressing andAlgorithmic Approaches

Modulo addressingEnables a processor to implement a delay line without having to move the data values in memoryData values are written to one memory location and remain there until they are no longer usedThe effect of data shifting along the delay line is simulated by manipulating memory pointers using modulo arithmetic

Algorithmic approachesAlgorithms that exploit data locality to reduce the number of memory accesses neededFor example, instead of computing output sample one at a time, computes two output samples at a timeLarger code size, longer software development time


Memory Architecture

Reducing Memory Access Requirements(Example form a Block FIR Filter)


Memory ArchitectureWait States (Due to Memory Accesses)

States in which the processor cannot execute its program because it is waiting for access to memoryConflict (contention)

Occur when the processor attempts to make multiple simultaneous accesses to a memory that cannot accommodate multiple accesses

For example, in pipelined DSPs, when one instruction attempts toaccess the program memory for its data and the other instructionis being fetched at the same time

Almost all processors recognize conflict and automatically insert conflict wait states needed

Exceptions (AT&T DSP16xx family)Attempting to fetch words from both external program and data memory in one instruction cycle results in a correct program word fetch, but the fetched data is invalid


Memory ArchitectureWait States (Due to Memory Accesses)

Slow memoryOff-chip slow memory too slow to support a complete memory access within one processor instruction cycleProcessor is configured to insert programmed wait states during external memory accessesWait states are configured by the programmer

Some processors can programmed to use different numbers of programmed wait states

Cases where predict in advance precisely how many states will berequired to access external memory

Bus sharing, DRAM (due to refresh) and I/ONeed externally requested wait statesTI TMS320C5x READY pin

Signal the processor that it must wait before continuing with an external memory access

Length of wait statesFrom one quarter of an instruction cycle (AT&T DSP32C) to a fullinstruction cycle (as on most processors)


Memory ArchitectureROM

On-chip ROMOn-chip read-only memory to store the application program and constant data

For low-cost, embedded applications like consumer electronics and telecommunication equipment

Types of internal memoryVersions with internal RAM

Prototyping and for low-volume productionVersions with factory-programmed ROM

Large-volume productionVersions with one-time-programmable ROM (PROM)

Prototyping or for low- or medium-volume productionExternal ROM

For applications requiring more ROM than is provided in on-chipMultiple ROM chips are used (for matching the width of the program word)Single byte-wide external ROM may be used, when

Processors support the external bus narrower than the program wordProcessors can construct instruction words by concatenating bytes


Memory ArchitectureExternal Memory Interfaces

Three main features by which external memory interfaces are differentiated

Number of memory portsSophistication and flexibilityTiming requirements

Number of memory portsMost DSP processors a single external memory port, even though they have multiple independent on-chip memory banks

Use the external memory port to extend any of the internal memory banks off-chip

Some DSP processors provide multiple off-chip memory portsHigher costADSP-21020, TMS320C30, TMS320C40, and DSP96002


Memory Architecture

Example External Memory Interface witha Single Memory Port



Sophistication and flexibilitySome are relatively simple and straightforward, with only a handful of control pinsOthers are much more complex providing the flexibility to interface with a wider range of external memory devices and buses without special interfacing hardwareSuch features

The Flexibility and granularity of programmable wait states, the inclusion of a wait pin, bus request and bus grant pins (for multiprocessor support), support for (page-mode) DRAM



Timing requirementsTiming specifications can vary significantly among processorsTiming specifications affects system cost (due to memory cost) and hardware design flexibility

Timing specifications in practical processors (from lecturer’s experience)

External memory space is divided into several sectionsEach section has its own memory typeTiming is variable (programmed by setting a specific register field) specified independently for each section


Memory ArchitectureSupport for External Memory AccessesThe DSPFundamentals presents four types of support

Manual cachingMultiprocessor support in external memory interfacesSupport for DRAMDirect memory access (DMA)

Manual cachingImprove performance by explicitly copying sections of program code from slower or more congested memory to faster or less congested memory

Useful when there processors have no program cachesFor example, a section of often-used program code stored in off-chip ROM is copied to faster on-chip RAM, either at system start-up or when that particular section is needed

Actually, system/application programs of embedded systems are stored in an external off-chip ROM


Memory Architecture

Multiprocessor Support in External Memory Interfaces

Specific features in external memory interfaces to simplify the design and enhance the performance of multiprocessor systems

Provision of two external memory portsOne port for a local memory, the other for a shared memoryMotorola DSP96002

Mechanism for bus arbitration in the shared busNegotiate control of the bus and prevent unauthorized accesses of processors sharing the busThere are significant differences in sophistication and flexibility

In some cases, simply by connecting together the appropriate pins of processorsIn other cases, extra software or external bus arbitration hardware may be required


Memory ArchitectureExamples of Shared Bus/Memory

Motorola DSP5600xTwo pins can be configured to act as bus request and bus grant

When an external bus arbitrator wants a particular DSP processor to relinquish the shared bus, it asserts that processor’s bus request inputThe processor then completes any external access in progress and relinquish the bus, acknowledging with bus grant

The DSP processor can continue to execute its program as long as no access to the shared bus is required

If an access to the shared bus is required, the processor waits until the bus request signal has been deasserted



TI TMS320C5xProvides the equivalent of bus request and bus grant signals (called HOLD* and HOLDA*)The processor allows an external device to access its on-chip memory

The external device first asserts the HOLD* inputThe processor to be accessed responds by asserting HOLDA*The external device asserts BR* (indicating it wishes to access the on-chip memory)The processor responds by asserting IAQ*The external device then read and write the processor’s on-chip memory by driving the processor’s address, data, and read/write linesWhen finished, the external device deasserts HOLD* and BR*Interprocessor data can move without an external shared memory



Bus lockingSimplifies the use of shared (lock) variables in shared memoryAllows a processor to read a shared variable from memory, modify, and write the new value back to memory without intervening stores to that variable by another processorCalled atomic test-and-set, atomic load-store,Prevent multiple processes from accessing a critical program code simultaneouslyTI TMS320C3x and TMS320C4x

Provide special instructions (such as SWAP) and hardware support for bus locking (lock signals)Called interlocked operations by Texas Instruments

Alternative to bus lockingLoad linked and store conditional


Memory ArchitectureExamples of Shared Bus/MemoryADSP-2106x

On-chip bus arbitration logic that allows direct interconnection of up to six ADSP-2106x devices

No need for special software or external hardwareAllows one DSP processor in a shared bus to access another processor’s on-chip memory

Like TI TMS320C5x family

Communication portsIntended for interprocessor communication between the same types of DSPsAnalog Devices ADSP-2106x

Six four-bit comm portsTI TMS320C4x

Six(TMS320C40) or four(TMS320C44) eight-bit comm ports


Memory Architecture

Support for Dynamic Memory(Obsolete Subject)On-chip RAM

SRAM or ROMExternal memory

In most cases, SRAMSRAM compared DRAM

FasterNo refreshSimpler interfaceHigher cost and larger area

Increasing interest in using DRAM in DSP systemsFaster DRAM types

Page mode, column static mode, and nibble mode DRAM


Memory Architecture

Support for Dynamic Memory(Obsolete Subject)

Support for faster types of memoryMemory page boundary detection capabilities

When the processor detects that a memory access has crossed a page boundary, it asserts a special output pinExternal DRAM controller use this pin to control memory and signal that the processor must insert wait statesMotorola DSP96002, Analog Devices ADSP-2100x, and TI TMS320C3x and TMS320c4x

Internal DRAM controllerMotorola DSP56004 and DSP56007, for page-mode DRAM

Current support for MemoryOn-chip single/dual-rate SDRM controllerOn-chip flash memory controllerOn-chip flash memory


Memory ArchitectureDirect Memory Access (DMA)

A technique whereby data can be transferred to or from the processor’s memory without involvement of the processor itself

From I/O device to memory or vice versa, and memory-to-memoryNeed for a separate DMA controller (bus master)

Or I/O device controller must act a bus masterThe processor sets the control information in the DMA controller

Start memory address, the number of data words to be transferred, transfer direction, source or destination peripheral

External DMA controllerUse bus request/grant mechanismData transfers between off-chip I/O and memory


Memory ArchitectureDirect Memory Access (DMA)On-chip DMA controller

Access on-/off-chip I/O and memoryProblem of memory contention with DSP core

In some cases, memory bandwidth may be large enough to allow DMA transfers to occur in parallel with normal program execution

For example, TI TMS320C4x provides on-chip memory and on-chip DMA address and data buses

Multiple channelsManage multiple DMA transfers in parallel

DMA controller has a separate set of control registers for each channelDo not support simultaneous transfers (I think)TMS320C4x (six channels), ADSP-210tx (ten), DSP96002 (two), for memory-memory or memory-peripheral transfers

Cycle-stealingBecause of limited memory bandwidth, the currently executing instruction is forced to wait one cycle during a DMA transferDSP3210 and ADSP-21xx

memory architecturesoc.yonsei.ac.kr/class/material/dsp/memoryarchitecture.pdf · 2017-03-06 · dsp...

Documents