memory architecturesoc.yonsei.ac.kr/class/material/dsp/memoryarchitecture.pdf · 2017-03-06 · dsp...
TRANSCRIPT
DSP VLSI Design
Memory Architecture
Byungin Moon
Yonsei University
1YONSEI UNIVERSITYDSP VLSI Design
Memory ArchitectureOutline
Memory ArchitecturesVon Neumann architecture and Harvard architectureFast memories and multiported memories
Features for reducing memory access requirementsProgram cachesModulo addressing and algorithmic approaches
Wait statesROMExternal memory interfaces
Multiprocessor supportDynamic memory (DRAM)Direct memory access (DMA)
2YONSEI UNIVERSITYDSP VLSI Design
Memory ArchitectureNeed for High-Speed Memory Architecture
Powerful data path is, at best, only part of a high-performance processorRequire the ability to move large amounts of data to and from memory quickly
The organization of memory and its interconnection with the processor’s data path are critical factors –this is called memory architecture
Example – FIR filterDSP processor data paths are designed to perform a multiply-accumulate operation in one cycle
Only one cycle per filter tapMultiple memory accesses per cycle
3YONSEI UNIVERSITYDSP VLSI Design
Memory Architecture
FIR Filter as an Example Needing Capabilitiesof Multiple Memory Accesses per Cycle
4YONSEI UNIVERSITYDSP VLSI Design
Memory ArchitectureAttention !!!
The textbook DSPFundamentals claims that the number of memory accesses needed for one tap is four
Fetch the multiply-accumulate instructionRead the appropriate data value from the delay lineRead the appropriate coefficient valueWrite the data value to the next location in the delay line to shift data through the delay line
on the assumption that a circular buffer is not used
5YONSEI UNIVERSITYDSP VLSI Design
Memory ArchitectureVon Neumann Architecture (Revisited)
A single set of address and data linesBoth program instructions and data are stored in the single memory
Common among non-DSP processorsHowever, most of current 32-bit microprocessors have Harvard architecture
StrengthLow cost
WeaknessThe processor can make one access to memory during each instruction cycle
It takes, at least four cycles to complete a multiply-accumulate for one tap of FIR filters
6YONSEI UNIVERSITYDSP VLSI Design
Memory ArchitectureVon Neumann Architecture
7YONSEI UNIVERSITYDSP VLSI Design
Memory ArchitectureHarvard Architecture (Revisited)
Two independent memory banks and two independent sets of buses (addresses and data)
OriginalOne bank for instructions, one bank for data
ModifiedOne bank for instructions and data, one bank for data
Two memory accesses per instruction cycleComplete the four memory accesses required for one tap of FIR filters in two instruction cycles
ADSP-21xx and AT&T DSP16xx(writes take two cycles)More powerful derivatives
One program bank and two data banks (X and Y)DSP Group Pine and Oak, Zilog Z893xx, SGS-Thomson D950-CORE, and Motorola DSP5600x, DSP56xxx, and DSP96002A shortage by one bank (four – three) can be overcome by module addressing
8YONSEI UNIVERSITYDSP VLSI Design
Memory ArchitectureHarvard Architecture
9YONSEI UNIVERSITYDSP VLSI Design
Memory ArchitectureFast Memories
On-chip memories that support multiple, sequential accesses per instruction cycle over a single set of busesYield better performance when combined Harvard architecture
Harvard architecture with two banks of fast memoryCan complete four memory accesses per instruction cycle
AT&T DSP32xxVon Neumann with multiple access memoriesCan complete four sequential accesses to on-chip memory in a single instruction cycle
Zoran’s ZR3800xHarvard architecture with multiple-access memoryOne single-access program memory bank with one dual-access data memory bank
10YONSEI UNIVERSITYDSP VLSI Design
Memory ArchitectureMultiported Memories
Memories that allow multiple concurrent memory accesses over tow or more independent sets of buses
The most common type is the dual-ported, but triple- and even quadruple are sometimes usedNo need to arrange data among multiple, independent memory banksHigh cost
Doubling the number of ports doubles the areaSome DSP processors combines a modified Harvard architecture with the use of multiported memories
Motorola DSP561xx
11YONSEI UNIVERSITYDSP VLSI Design
Memory Architecture
Harvard Architecture withDual-Ported Data Memory
12YONSEI UNIVERSITYDSP VLSI Design
Memory ArchitectureConsideration for Off-Chip Memories
Multiple off-chip memory banksAlthough the memory banks can usually be extended off-chip, multiple off-chip memory accesses cannot proceed in parallel (due to the lack of a second set of external memory buses, that is due to processor cost)
Fast off-chip memoriesOff-chip delays may make it impractical to obtain two or more sequential memory accesses per instruction cycle, unless the instruction rate is relatively slow
Multiported off-chip memoriesImpractical due the reason similar to multiple banksMore I/O pins
Larger, more expensive package and larger die size
13YONSEI UNIVERSITYDSP VLSI Design
Memory ArchitectureSpecialized Memory Writes
A specialized mechanism to allow a write to data memory to proceed in parallel with an instruction read and data read
Can be used to shift data along the delay line in an FIR filter implementation
AT&T DSP16xx (Harvard architecture)Cannot provide both a data memory write and a data memory read in less than three instruction cyclesHowever, under certain circumstances, an operand register value can be written to one memory location and then loaded with a value from another memory location
TI’s fixed-point DSPsA value in memory can be loaded into the operand register and also copied to the next higher location in memory
14YONSEI UNIVERSITYDSP VLSI Design
Memory Architecture
Features for Reducing Memory Access Requirements
Special features designed to reduce the number of accesses required to perform certain kinds of operations
Achieve equal performance to other processors that provide more memory bandwidthReduce processor costMay increase execution time or software development time
Such featuresProgram cachesModule addressingAlgorithmic approaches
15YONSEI UNIVERSITYDSP VLSI Design
Memory ArchitectureProgram Caches
Small memory within the processor core that is used for storing program instructions
Eliminate the need to access program memory when fetching certain instructionsFree a memory access to be used for a data read or writeSpeed operation by avoiding delays associated with slow external (off-chip) program memory
Comparison with the caches of general-purpose microprocessors
Much smaller and simplerUsed only for program instructions, not for data
Accommodating data involves a mechanism for updating both the cache and external memory, adding significantly to the complexity of the cache hardware
16YONSEI UNIVERSITYDSP VLSI Design
Memory ArchitectureProgram Caches
Single-instruction repeat bufferThe simplest type of DSP processor cacheA one-word instruction cache that is used with a special repeat instructionA single instruction that is to be executed multiple times (by a repeat instruction) is loaded into the buffer upon its first execution; immediately subsequent executions of the same instruction fetch the instruction from the cacheTI TMS320C2x and TMS320C5xDoes not help for algorithms in which a block of multiple instructions must be executed repeatedly as a group
Extend the repeat buffer concept to accommodate more than one program instruction -> multiword repeat buffer
17YONSEI UNIVERSITYDSP VLSI Design
Memory ArchitectureProgram CachesMultiword repeat buffer
AT&T DSP16xxThe 16-entry repeat buffer is loaded when the programmer specifies a block of code of 16 or fewer instruction to be repeated using the repeat instruction
Useful for algorithms that contain loops consisting a modest number of instructions (quite common in DSP)
Single-sector instruction cacheA cache that stores some number of the most recent instructions that have been executedUsed to access a single, contiguous region of program memory
When a program control flow change accesses a location that is not already contained in the cache, the previous contents of thecache are invalidated and cannot be used
But if the program control flow jumps to a program address of the instructions in the cache, the cache is used for instructionsZoran ZR3800x
18YONSEI UNIVERSITYDSP VLSI Design
Memory ArchitectureProgram Caches
Multiple-sector instruction cacheA cache with multiple instruction sectorsTI TMS320C3X (also has a single-instruction repeat buffer)
Contains two sectors of 32 words eachEach sector used to store instructions from an independent 32-word region of program memory
In cache missesthe external address is from one of two current sectors; the word is stored in the cachethe external address out of the two sectors; one of the sectors is discarded, and a new sector is made (LRU algorithm)
A variation in the Analog Devices ADSP210xx cacheTwo-bank Harvard architectureThe cache is loaded with instructions whose execution causes contention for program memory access
19YONSEI UNIVERSITYDSP VLSI Design
Memory ArchitectureFeatures of Some DSP CachesMeasure of manual control over cache mechanisms
Locking the contents of the cache a some execution pointdisabling the cache altogetherAllow the programmer to obtain better performance than would be achieved with the built-in cache managementHelp software developers to ensure that their code will meet critical real-time constraints
Motorola DSP96002Has the internal 1 Kword by 32-bit memory
Configured either as instruction cache or program memoryWhen the cache is enabled
Organized into eight 128-word sectors, each of which can be individually locked or unlocked
Motorola DSP563xx family includes a similar dual cache/memory construct
20YONSEI UNIVERSITYDSP VLSI Design
Memory Architecture
Modulo Addressing andAlgorithmic Approaches
Modulo addressingEnables a processor to implement a delay line without having to move the data values in memoryData values are written to one memory location and remain there until they are no longer usedThe effect of data shifting along the delay line is simulated by manipulating memory pointers using modulo arithmetic
Algorithmic approachesAlgorithms that exploit data locality to reduce the number of memory accesses neededFor example, instead of computing output sample one at a time, computes two output samples at a timeLarger code size, longer software development time
21YONSEI UNIVERSITYDSP VLSI Design
Memory Architecture
Reducing Memory Access Requirements(Example form a Block FIR Filter)
22YONSEI UNIVERSITYDSP VLSI Design
Memory ArchitectureWait States (Due to Memory Accesses)
States in which the processor cannot execute its program because it is waiting for access to memoryConflict (contention)
Occur when the processor attempts to make multiple simultaneous accesses to a memory that cannot accommodate multiple accesses
For example, in pipelined DSPs, when one instruction attempts toaccess the program memory for its data and the other instructionis being fetched at the same time
Almost all processors recognize conflict and automatically insert conflict wait states needed
Exceptions (AT&T DSP16xx family)Attempting to fetch words from both external program and data memory in one instruction cycle results in a correct program word fetch, but the fetched data is invalid
23YONSEI UNIVERSITYDSP VLSI Design
Memory ArchitectureWait States (Due to Memory Accesses)
Slow memoryOff-chip slow memory too slow to support a complete memory access within one processor instruction cycleProcessor is configured to insert programmed wait states during external memory accessesWait states are configured by the programmer
Some processors can programmed to use different numbers of programmed wait states
Cases where predict in advance precisely how many states will berequired to access external memory
Bus sharing, DRAM (due to refresh) and I/ONeed externally requested wait statesTI TMS320C5x READY pin
Signal the processor that it must wait before continuing with an external memory access
Length of wait statesFrom one quarter of an instruction cycle (AT&T DSP32C) to a fullinstruction cycle (as on most processors)
24YONSEI UNIVERSITYDSP VLSI Design
Memory ArchitectureROM
On-chip ROMOn-chip read-only memory to store the application program and constant data
For low-cost, embedded applications like consumer electronics and telecommunication equipment
Types of internal memoryVersions with internal RAM
Prototyping and for low-volume productionVersions with factory-programmed ROM
Large-volume productionVersions with one-time-programmable ROM (PROM)
Prototyping or for low- or medium-volume productionExternal ROM
For applications requiring more ROM than is provided in on-chipMultiple ROM chips are used (for matching the width of the program word)Single byte-wide external ROM may be used, when
Processors support the external bus narrower than the program wordProcessors can construct instruction words by concatenating bytes
25YONSEI UNIVERSITYDSP VLSI Design
Memory ArchitectureExternal Memory Interfaces
Three main features by which external memory interfaces are differentiated
Number of memory portsSophistication and flexibilityTiming requirements
Number of memory portsMost DSP processors a single external memory port, even though they have multiple independent on-chip memory banks
Use the external memory port to extend any of the internal memory banks off-chip
Some DSP processors provide multiple off-chip memory portsHigher costADSP-21020, TMS320C30, TMS320C40, and DSP96002
26YONSEI UNIVERSITYDSP VLSI Design
Memory Architecture
Example External Memory Interface witha Single Memory Port
27YONSEI UNIVERSITYDSP VLSI Design
Memory ArchitectureExternal Memory Interfaces
Sophistication and flexibilitySome are relatively simple and straightforward, with only a handful of control pinsOthers are much more complex providing the flexibility to interface with a wider range of external memory devices and buses without special interfacing hardwareSuch features
The Flexibility and granularity of programmable wait states, the inclusion of a wait pin, bus request and bus grant pins (for multiprocessor support), support for (page-mode) DRAM
28YONSEI UNIVERSITYDSP VLSI Design
Memory ArchitectureExternal Memory Interfaces
Timing requirementsTiming specifications can vary significantly among processorsTiming specifications affects system cost (due to memory cost) and hardware design flexibility
Timing specifications in practical processors (from lecturer’s experience)
External memory space is divided into several sectionsEach section has its own memory typeTiming is variable (programmed by setting a specific register field) specified independently for each section
29YONSEI UNIVERSITYDSP VLSI Design
Memory ArchitectureSupport for External Memory AccessesThe DSPFundamentals presents four types of support
Manual cachingMultiprocessor support in external memory interfacesSupport for DRAMDirect memory access (DMA)
Manual cachingImprove performance by explicitly copying sections of program code from slower or more congested memory to faster or less congested memory
Useful when there processors have no program cachesFor example, a section of often-used program code stored in off-chip ROM is copied to faster on-chip RAM, either at system start-up or when that particular section is needed
Actually, system/application programs of embedded systems are stored in an external off-chip ROM
30YONSEI UNIVERSITYDSP VLSI Design
Memory Architecture
Multiprocessor Support in External Memory Interfaces
Specific features in external memory interfaces to simplify the design and enhance the performance of multiprocessor systems
Provision of two external memory portsOne port for a local memory, the other for a shared memoryMotorola DSP96002
Mechanism for bus arbitration in the shared busNegotiate control of the bus and prevent unauthorized accesses of processors sharing the busThere are significant differences in sophistication and flexibility
In some cases, simply by connecting together the appropriate pins of processorsIn other cases, extra software or external bus arbitration hardware may be required
31YONSEI UNIVERSITYDSP VLSI Design
Memory ArchitectureExamples of Shared Bus/Memory
Motorola DSP5600xTwo pins can be configured to act as bus request and bus grant
When an external bus arbitrator wants a particular DSP processor to relinquish the shared bus, it asserts that processor’s bus request inputThe processor then completes any external access in progress and relinquish the bus, acknowledging with bus grant
The DSP processor can continue to execute its program as long as no access to the shared bus is required
If an access to the shared bus is required, the processor waits until the bus request signal has been deasserted
32YONSEI UNIVERSITYDSP VLSI Design
Memory ArchitectureExamples of Shared Bus/Memory
TI TMS320C5xProvides the equivalent of bus request and bus grant signals (called HOLD* and HOLDA*)The processor allows an external device to access its on-chip memory
The external device first asserts the HOLD* inputThe processor to be accessed responds by asserting HOLDA*The external device asserts BR* (indicating it wishes to access the on-chip memory)The processor responds by asserting IAQ*The external device then read and write the processor’s on-chip memory by driving the processor’s address, data, and read/write linesWhen finished, the external device deasserts HOLD* and BR*Interprocessor data can move without an external shared memory
33YONSEI UNIVERSITYDSP VLSI Design
Memory ArchitectureExamples of Shared Bus/Memory
Bus lockingSimplifies the use of shared (lock) variables in shared memoryAllows a processor to read a shared variable from memory, modify, and write the new value back to memory without intervening stores to that variable by another processorCalled atomic test-and-set, atomic load-store,Prevent multiple processes from accessing a critical program code simultaneouslyTI TMS320C3x and TMS320C4x
Provide special instructions (such as SWAP) and hardware support for bus locking (lock signals)Called interlocked operations by Texas Instruments
Alternative to bus lockingLoad linked and store conditional
34YONSEI UNIVERSITYDSP VLSI Design
Memory ArchitectureExamples of Shared Bus/MemoryADSP-2106x
On-chip bus arbitration logic that allows direct interconnection of up to six ADSP-2106x devices
No need for special software or external hardwareAllows one DSP processor in a shared bus to access another processor’s on-chip memory
Like TI TMS320C5x family
Communication portsIntended for interprocessor communication between the same types of DSPsAnalog Devices ADSP-2106x
Six four-bit comm portsTI TMS320C4x
Six(TMS320C40) or four(TMS320C44) eight-bit comm ports
35YONSEI UNIVERSITYDSP VLSI Design
Memory Architecture
Support for Dynamic Memory(Obsolete Subject)On-chip RAM
SRAM or ROMExternal memory
In most cases, SRAMSRAM compared DRAM
FasterNo refreshSimpler interfaceHigher cost and larger area
Increasing interest in using DRAM in DSP systemsFaster DRAM types
Page mode, column static mode, and nibble mode DRAM
36YONSEI UNIVERSITYDSP VLSI Design
Memory Architecture
Support for Dynamic Memory(Obsolete Subject)
Support for faster types of memoryMemory page boundary detection capabilities
When the processor detects that a memory access has crossed a page boundary, it asserts a special output pinExternal DRAM controller use this pin to control memory and signal that the processor must insert wait statesMotorola DSP96002, Analog Devices ADSP-2100x, and TI TMS320C3x and TMS320c4x
Internal DRAM controllerMotorola DSP56004 and DSP56007, for page-mode DRAM
Current support for MemoryOn-chip single/dual-rate SDRM controllerOn-chip flash memory controllerOn-chip flash memory
37YONSEI UNIVERSITYDSP VLSI Design
Memory ArchitectureDirect Memory Access (DMA)
A technique whereby data can be transferred to or from the processor’s memory without involvement of the processor itself
From I/O device to memory or vice versa, and memory-to-memoryNeed for a separate DMA controller (bus master)
Or I/O device controller must act a bus masterThe processor sets the control information in the DMA controller
Start memory address, the number of data words to be transferred, transfer direction, source or destination peripheral
External DMA controllerUse bus request/grant mechanismData transfers between off-chip I/O and memory
38YONSEI UNIVERSITYDSP VLSI Design
Memory ArchitectureDirect Memory Access (DMA)On-chip DMA controller
Access on-/off-chip I/O and memoryProblem of memory contention with DSP core
In some cases, memory bandwidth may be large enough to allow DMA transfers to occur in parallel with normal program execution
For example, TI TMS320C4x provides on-chip memory and on-chip DMA address and data buses
Multiple channelsManage multiple DMA transfers in parallel
DMA controller has a separate set of control registers for each channelDo not support simultaneous transfers (I think)TMS320C4x (six channels), ADSP-210tx (ten), DSP96002 (two), for memory-memory or memory-peripheral transfers
Cycle-stealingBecause of limited memory bandwidth, the currently executing instruction is forced to wait one cycle during a DMA transferDSP3210 and ADSP-21xx