custom dsp design of a gsm speech coder...2.2 specification the detailed mapping of the gsm speech...

Journal of VLSI Signal Processing, 11,213-228 (1995) �9 1995 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.

Custom DSP Design of a GSM Speech Coder

V. t3WALL, P. ANDREANI, L. BRANGE, P. NILSSON, A. WASS AND M. TORKELSON Department of Applied Electronics, Lund University, Ole ROmers viig 3, 221 O0 Lund, Sweden

Received February 12, 1993; Revised June 16, 1995

Abstract. The GSM speech coder for digital mobile telephones has been designed on a custom DSP using an environment for development of arbitrary processor architectures. A netlist of the speech coder has been generated from a high level description in a presented design environment. Improvement of the performance both by modifying the high level description and by hand optimization of the microcode is discussed. Advantages of a user interactive system were stepwise refinement leads to a competitive result is thus demonstrated. Simulation has been performed at algorithm, microcode, and transistor level for consistency check. To avoid dependency of a specific vendor the design environment allows implementation in various technologies and a netlist for a Plessey gate array has been generated as an example.

1 Introduction

GSM is the Pan-European digital mobile cellular telephone system currently in use in several European countries and is the result of a multinational development in Europe for mobile telephone communication [1, 2]. In comparison to earlier analog systems the extensive use of digital signal processing provides higher capacity and enables good speech quality even under poor transmission conditions. Additionally, better security using digital encryption and the support of various data services can be supplied. This paper uses the GSM speech coder, which is a clock cycle consuming part of the complete GSM system, as an application example in order to evaluate the developed design environment. The reasons for choosing the GSM speech coder as an application example are: industrial importance, a complex, highly specified algorithm with extensive simulation data, test vectors, and simulation software, and because its industrial importance has generated several implementations on various processors which can be used for comparison.

The amount of digital signal processing required in the GSM system calls for a high level of system inte- gration onto few VLSI processors considering power consumption, performance, area, and price. These aspects enhance the importance of investigating ASIC solutions. The presented design environment is a combination of independent programs which together facilitate the design of true Algorithm Specific Digital Signal Processors (ASDSPs), i.e., ASDSPs without a predefined processor core, a limited selection of

functional units, a specific instruction set, etc. An algorithm driven architecture design procedure is possible due to the fact that it is easy to design a new processor or to modify a previously developed architecture to suit new demands. A customized architecture enables an increase in performance or reduction of some critical parameters such as power consumption or silicon area. The design environment is developed to be open--both regarding inclusion of new software and the final implementation technique of a design--to be able to adapt to technological advances.

Section 2 gives a brief introduction to the GSM speech coder. A description of the GSM system can be found in [1, 2]. The design environment is described in Section 3 and the realization of the GSM speech coder is presented in Section 4. Optimization of the design to achieve a better result is applied at different stages in the design procedure. A discussion of ways to reduce the clock cycle count of the algorithm is given in Section 5. Section 6 presents the results of the design procedure and in Section 7 our conclusions are drawn.

2 The GSM Speech Coder

This section will give a brief introduction to the GSM speech coder. A comprehensive description of the speech coder algorithm is not a topic of this paper and the interested reader is directed to other literature [3, 4]. In Recommendation GSM 06.10 [3] the detailed mapping between input and output of the GSM speech coder is specified. The speech coding is the conversion

214 Owall et al.

Input Signal

F/~. 1.

Preproeessing ] I

Shoi.ter m LPC analysis

Short term -1 analysis filter

(1) Short term residual (2) Long term residual (3) Short term residual estimate (4) Reconstructed short term residual (5) Quantized long term residual

Simplified block diagram of the RPE-LTP encoder.

r

I

Reflection coefficients 36 bits/20ms

RPE parameters RPE 47 bits/5ms grid

eaeoding

/

alfilallyeSriS I-

~ - ' ~ 1 LTP parameters I ] 9 bits/5ms

LTP

analysis ] To radio subsystem

of 20 ms speech frames containing 160 speech samples in 13 bit uniform PCM format into encoded blocks of 260 bits. This complies to a sampling rate of 8000 samples of speech per second and an output bit rate of 13 kbits/s. The coding scheme is the so-called Regular Pulse Excitation--Long Term Prediction--Linear Pre- dictive Coder, referred to as RPE-LTP. A block diagram of the RPE-LTP encoder is given in Fig. 1. The speech decoder includes the same structure as the feedback- loop of the RPE-LTP encoder and can therefore be implemented on the same ASIC at little extra cost.

2.1 Coding Scheme

In the RPE-LTP encoder one speech frame, 160 sam- pies, is first pre-processed to produce an offset-free signal. These samples are then analyzed to determine coefficients for the short term analysis filter, which is then used to filter the same 160 samples resulting in the short term residual signal (1 in Fig. 1). This filtering can be seen as a digital imitation of the human tract and no data reduction has been performed [2]. This coding has a very short memory of approximately 1 ms. How- ever, the human voice has correlations where the sound recurs over a longer time interval which is used to reduce the amount of data. This reduction is performed by the long term prediction function. For the following operations one frame is divided into 4 sub-frames. The long term prediction analysis is performed on the basis of the current sub-frame and a stored sequence of

120 previously reconstructed short term residual samples. The LTP is determined by computing the cross- correlation between the sub-frame and the stored sequence. 40 long term residual samples (2 in Fig. 1) are obtained when 40 estimates of the short term residual signal (3 in Fig. 1) is subtracted from the short term residual signal itself. These samples are fed to the RPE analysis which performs the basic compression of the algorithm. In addition to being sent to the radio subsystem the RPE parameters are fed to a local RPE decoding and reconstruction module which produces a block of 40 samples of quantized long term residual signal (5 in Fig. 1). These are added to the previous block of short term residual estimates to obtain a reconstructed version of the current short term residual (4 in Fig. 1).

2.2 Specification

The detailed mapping of the GSM speech coder, down to bit level, may simplify the verification of compli- ance to the recommendation [3]. However, this form of specification prohibits full utilization of the advantages of a custom DSP since hardware requirements are more or less defined prior to the design procedure [5]. Micro operations are specified at bit level and the algorithm is given in a pseudo high level programming code which corresponds to around 1000 lines of C-code. The specification of the long addition is given here as an example of a bit true micro operation.

Micro operation: Ladd(varl, var2) 32 bits addition of two 32 bits variables (varl + var2) with overflow control and saturation; the result is set at 2147483647 when overflow occurs and at -2147483648 when undertow O c c u r s .

The code section Search for the maximum cross- correlation and coding of the LTP lag is used to illustrate the pseudo high level programming code. This part is used in Section 5 to describe the optimization process of the microcode.

Pseudo Code FOR lambda = 40 to 120

L_result = 0

FOR k = 0 to 39

L_temp = L_mult (wt [k] ,

dp [k-lambda] )

L_result = L_add(L_temp,L_result)

NEXT k

IF(L_result > L_max)THEN

Nc = lambda L_max = L_result

NEXT i ambda

3 Design Environment

When ASICs are used the time for development is often an important criterion. Therefore, to make ASICs a competitive alternative to general purpose signal processors, it is important to have access to efficient CAD- tools. Otherwise the design process will be too time consuming and other solutions are preferred. In order to allow the designer to explore different solutions these tools must give the designer rapid feedback on hardware and software properties of the chosen architecture. However, it must also be possible to refine the result to a competitive solution. Another important is- sue is to have an open design environment where the final design can be transferred between different implementation techniques.

The presented design environment is a combination of independent software--both commercial, university, and software developed at the department. How- ever, the designer community does not want a mixed bag of incompatible tools [6] and from the designers point of view the system can be seen as an integrated environment. To be able to adapt to future changes in format, cell library, application area, etc. the inten- tion has been to make a flexible environment. There- fore, the independent software communicate through intermediary formats which enable new software to be

Custom DSP Design 215

included with relative ease and few and basic cells are used in the software defined hardware.

The presented design environment allows the implementation of arbitrary algorithms on fully customized processor architectures and a netlist is generated from a high level description. No predefined processor cores are used, instead the designer has complete control of processor architecture and is able to make trade-offs between speed, area, and power consumption. As much as these factors depend on the processor architecture they depend on the chosen implementation technique, i.e. gate array or full custom, the cell library, technology, etc. Multiple outputs are available in order not to tie the design to a specific implementation technique, cell library, CAD system or silicon vendor.

Bit true simulations can be performed at any stage of the design hierarchy. Hence, consistency check between all development stages is easily performed. Final simulations and processor evaluation is performed in the chosen system of implementation. If a well char- acterized cell library is used together with a reliable simulator the functionality and performance of the fab- ricated chip is guaranteed to coincide with the simulation. This guarantee, which exists in the used Plessey system [7], makes fabrication expendable if the objec- tive is to evaluate the design.

Several chips have been designed within the design environment including a chip for the Least Mean Square algorithm (LMS), a chip for correction of quadrature modulators [8], a radio channel simulator [9], and an image convolution processor [10]. Related design systems are the Lager system [11, 12] and the more automated Cathedral-II [13].

3.1 Design Procedure

An iterative top-down approach is used in a design, from C, via assembly code to the controller synthesis and the final implementation in a chosen cell library. However, each tool can be used independently and the design process entered at a suitable level. To take advantage of the designer's knowledge, user interaction is possible at every stage in the design process. A general overview of the design environment is shown in Fig. 2 where user interaction is displayed by the numbered loops.

3.2 Scheduling

For scheduling the RL-compiler, developed at UC Berkeley, has been used. The RL-compiler generates

216 Owall et al.

md

I

Plessey EDIF

VHDL

. ~ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . !

Algorithm i Declaration :.

Compiler

Micro Code ~ Simulator j-

Controller 1 Synthesizer

t Datapath Compiler

NETLIST ~ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ~

Fig. 2. Design environment.

1

2 3

Viewlogic TdE

microcode for a processor architecture from a high level description using a combination of greedy scheduling and lazy data routing, for details see [14, 15]. The RL-compiler is a user-retargetable compiler without constraints on the user defined processor architecture. To be able to map onto a processor the RL-compiler uses a machine description (md in Fig. 2) of the actual processor architecture. Specification of the algorithm is done in the R.L-language which is loosely based on a subset of C. An RL program cannot be compiled by a standard C compiler, but to allow simulation the RL- compiler can translate the source code into standard C with calls to a run-time arithmetic library chosen by the user. To allow consistency check with lower level simulations a finite precision library for bit-true simulation is supplied.

To write the RL source code, define the machine description, and compile the RL program is an iterative procedure (loop 1 in Fig. 2). The iterations include

both changes in the RL source code and development of the hardware by improving the machine description, i.e. adding memories, functional units, or intercon- nections. In order to minimize the microcode and to reduce the size of register banks, allocated for intermediate data storage by the compiler, it is important for the designer to be familiar with the scheduling procedure in the RL-compiler. Code written without hardware considerations often produces a large overhead in the clock cycle count. Some aspects of this problem are discussed in Section 5.

3.3 Microcode Simulation

In most cases a combination of high level scheduling and hand coding is suitable. A microprogram is generated by the RL compiler and critical parts are optimized by hand (loop 2 in Fig. 2). For designs of moderate complexity, or if the efficiency of the microcode is crucial, it is possible to start the design at the microcode level by coding the complete algorithm by hand. Hand optimization or hand coding makes the use of a microcode simulator essential in order to achieve a correct result [16]. When a datapath architecture is specified in the design environment it yields a user defined micro instruction set. The fact that micro instructions are user defined to suit the algorithm simplifies hand coding.

The microcode simulator produces a C description of the microcode to be compiled with general purpose user interface routines. In this way a compiled microcode simulator is generated for each microcode under test. This technique has two advantages: it decreases simulation times and it allows the user to describe functions appearing in the machine description directly in C. Both interactive simulation for debug- ging purposes and noninteractive simulation for efficient data generation can be performed.

The microcode simulator is an important tool in the process of optimizing the microcode. Statistics of the simulation are produced, including most called subroutines which are the natural candidates for optimization by hand, since they contribute most to the total cycle count.

3.4 Datapath Compilation

The DataPath Compiler (DPC) [17] is a general tool for assembling hardware and it generates netlists of datapath modules from structural descriptions, mapped onto a cell library. Thus, the DPC creates an architecture able to execute the micro instruction set required

to perform the algorithm. In addition to the netlist, behavioral descriptions (bd in Fig. 2) which comprise the micro instruction set used in the microprogram are generated. As well as defining the micro instruction set these behavioral descriptions also state control signals to micro instructions, default levels for the control signals, and gives routing information for connecting datapath modules and the controller. This information is used in the subsequent controller synthesis phase.

DPC descriptions of multiplier structures are generated by a separate program. Multipliers use the Booth's algorithm [18] and are generated with various performance depending on the application; high speed, low power, or bit serial.

3.5 Controller Synthesis

Controller synthesis is a crucial part of the design of control flow dominated processors since the manual design of a controller requires substantial effort. A controller synthesizer, COMA [ 19, 20], has been developed at the Department of Applied Electronics while other examples are the control unit tool of the Lager system [21, 22] and the control unit synthesizer CGE in Cathedral-l~I [23]. COMA synthesizes a complete controller with memories, address processors, and interconnection specification from the microprogram and the behavioral descriptions. The size and the speed of the controller depend both on the implementation technique of the control logic [24], i.e. PLA, random logic, etc., and the structure of the microprogram. There- fore, COMA has a range of controller architectures to choose from suitable for different applications; rang- ing from a simple FSM controller to a decomposed hierarchical controller structure, Fig. 3. The hierarchical controller structure is dependent on the application and the number of levels can range from a single microcode memory to any number of sequencing levels,


Fig. 3. Depending on the microprogram and user defined architectural decisions a corresponding controller architecture is synthesized. A controller assigns control words to all declared datapath modules according to the micro program. Equal control signals are merged and static control signals are connected to ground or the power supply. In large designs the number of control signals tend to grow rapidly while the increase of number of used combinations, micro instructions, are more limited. To reduce the number of output signals from the microcode memory local decoders can be used [25, 26]. Programming facilities supported by COMA are: subroutines with variable passing, conditional statements, loops, memory declarations, address calculation in different addressing modes, etc. For loop control and conditional statements the controller responds to external signals and signals generated in datapath modules and address processors.

Memory declarations, RAMs and ROMs, and address calculations are typical parts of a microprogram. Therefore, COMA generates memory descriptions and Address Processing Units (APUs) of various complexity. Depending on the speed required for address calculations, memories can either share address processors or address processors executed in parallel can be used. If an application requires a large amount of data storage, facilities for having large memories off-chip are supported. COMA supports several addressing modes such as direct, indirect, auto-increment, etc., [27] and the set of modes used in the microprogram determines the complexity of the address processor. When only direct addressing is used the address processor consists of an address logic block executed in parallel with the mi- cromode memory, Fig. 4. Otherwise address processor cores of different complexity and size are generated, the complexity depends on used addressing modes while the size on the number of used pointers. If only a few constants are used those are stored in core multiplexers,

Main Sequencing

Leve l I

t

Subroutine Calls , i / \

~ ~ l . ' s t S e q u e n c i n g ~ _[

-] Level q

Micro Code

M e m o r y

Control

] ~ G ] Words ~- ~- to

Datapaths

l t Instruction Regis ter

Boolean States for Case and L o o p Control

Fig. 3. Simplified block scheme of hierarchical controller structure.

218 Owall et al.

Address Processing Unit (APU) i

_ ~ Pipeline Register Address Memory Constants (Optional)

. . . . . . . . . . . . . . . . . . . . Address

Address Status ~ States Processing

Controller

Core

l Read/Write

Fig. 4. Address calculation.

otherwise separate address logic is added to the APU. Depending on speed requirements are the structure of the address processing core an optional pipeline register can be implemented between the address processor and the referenced memories, see Fig. 4.

Finite state machines, microcode memory, and data memories are specified in truth tables allowing the designer to try different solutions for implementations such as ROM, PLA, or random logic. Various applications will lead to different choices depending on memory size, technology, speed, and area considerations. In large designs microcode and address logic tend to become large. Therefore, in order to avoid large modules a user driven partitioning scheme is facilitated by COMA. Smaller modules are easier to optimize, syn- thesize, and to place on an ASIC/gate array.

If the result of the controller synthesis is not sat- isfactory the microcode must be changed, either by changing the microcode by hand (loop 2 in Fig. 2) or by changing the algorithm specification and/or the machine description and redo the scheduling (loop 3 in Fig. 2).

3.6 Netlist Generation

Finally the DPC assembles datapath and controller modules into a complete description of the processor. Several outputs are available to allow the use of various cell libraries and CAD systems for the final layout generation such as: EDIF, Plessey Classic [7], Viewlogic [28], and a cell library developed at the Department of Applied Electronics [29]. Additionally a VHDL output is under development. A new cell library to an

existing output format is easily added by a cell library declaration while a new format requires an extension of software in the design environment.

4 Speech Coder Realization

The GSM speech coder algorithm [3] is a bit-true specification leaving little room for creativity in the architecture exploration phase [5]. Advantages of a custom DSP such as non-standard architecture and word length optimization are not possible to utilize with this type specification. The computational parts of the processor are defined by the micro operations stated in the recommendation, i.e. long addition, short addition, multiplication with rounding, left/right shift etc. Ways to improve the result compared to a standard DSP are the use of intermediate data storage in datapath registers, software pipelining and increased performance of address processors, especially parallel address calculation. A block diagram of the developed processor architecture is given in Fig. 5.

4.1 Datapath

When an application specific DSP is designed, the processor architecture is tailored to the algorithm in order to get a smooth dataflow through the processor. The designed datapath architecture for the GSM speech coder is similar to a standard DSP architecture following the detailed algorithm specification, Fig. 6. The mips in- tensive parts of the algorithm contain basically multiply and add operations. In order to achieve a result with


APU APU

t _ _ v lV , oop L C o n t r o l

DataPath

Fig. 5. Block diagram of the processor architecture.

Controller

'[External

I

REG [ REG [_ J

1 V

MULT 16 x 16

REG

SHIFT

REG REG

REG

Fig. 6. Datapath architecture.

220 Owall et aL

a low clock cycle count a fast multiply-add path must be available in the datapath architecture. Furthermore, it is important to be able to supply the datapath with data at the same pace as data is processed. Therefore, there are two parallel busses, parallel RAMs, and parallel APUs. Other operations are not as critical and the corresponding hardware is added in parallel to the multiply-add datapath. Physically, the datapath is split into several smaller modules.

According to the specification the datapath for the speech coder uses two different word lengths, 16 and 32 bits. The multiplier is 16 x 16 bits which generates a 32 bit product in one clock cycle and is designed for regular multiplications, multiplications with rounding, and multiplications with a left shift. Addition and subtractions with saturation are performed in a module handling both 16 and 32 bit operations. The shifter is 32 bits and can shift 16 positions in both directions in one clock cycle.

4.2 Controller

The controller synthesized for the GSM speech coder consists of two hierarchial levels and a Decision Fi- nite State Machine (DFSM), Fig. 7. The microprogram is divided into subroutines which are stored in a microcode memory and are executed sequentially controlled by the Program Counter (PC). Subroutine addresses are calculated in a Finite State Machine (FSM), consisting of subroutine address logic and a feedback register. To be able to execute the same subroutine at different parts of the microprogram, the subroutine address itself does not define the state of the FSM but is complemented with internal state variables. At the end of a subroutine a new subroutine address is triggered from the finite state machine register and the PC is reset. The controller is pipelined to avoid clock cycle overhead at subroutine transitions.

The DFSM states are set by signals from either datapath modules, address processors, or I/O-units and are used in conditional statements and for loop control. Evaluation signals from the microcode memory change the state of the DFSM, which.is needed in order to store states depending on temporary changes in input signals to the DFSM. States in the DFSM can affect both the choice of micro instructions in the microcode memory, the computation of subroutine addresses in the finite state machine, and directly as control signals to the datapath.

Due to the pipelining of the controller the use of a DFSM causes latency between the change of a status

Subroutine Address

Sign

from APU

External

Logic

t SubrouUne Address

t i

State 1

Decision 1 Finite State Machine

(Case & Loop Control)

I ~ Control Microcode Memory ~ Signals

(Micro Instructions) ] Instruction

register

End of Subroutine

Evaluate State

To

) Datapath

From

Fig. Z Control ler architecture.

signal until this change can affect the generation of a new subroutine address or the control signals to the datapath. The latency of the controller structure of Fig. 8 has as a consequence that evaluation of a state can not be performed in the last two cycles of a subroutine if it affects the next subroutine transition. If the scheduling process can not be performed in such a way that states are evaluated earlier "no operation" cycles (NOPs) have to be added in the microcode. Therefore, the latency can be reduced in the controller structure by decreasing the pipelining of the controller by removing the status register. However, this would result in a longer delay path between registers, i.e. the status signal generation in the datapath, the evaluation in the DFSM logic, and the subroutine address calculation have to be performed during one clock cycle.

If the delay of the state generation is too long it is possible to pipeline the controller further by using the output of the DFSM feedback register instead of the output of the DFSM logic. Hence, decreasing the delay by increasing the latency. COMA and the RL- compiler allow different pipelining strategies to be ex- plored and the controller latency is affected by setting different parameters. There are further possibilities in the state handling of COMA which are only used in hand coding since they are not supported by the RL-compiler. When the state controls the microcode memory directly, choosing between different control words, the latency is reduced by one cycle. If direct action is performed depending on a status signal and a state does not have to be stored for later use it is possible to let status signals affect the subroutine address calculation and the microcode memory directly. This scheme will reduce the size of the DFSM and the


~.8 .

Logic i] Memory ' -I"[ I, ~ / ," I

~ ? / ". ~ ~ _-- '" l Control I - I R e g ~ ~ L , ' . . . . . . + . . . . . . ~ Signals

States

DFSM ', Datapath

Logic

" ' - "" Status Signals

. . . . . . Latency o f one cycle

Latency in control flow operations.

latency of the controller. In the GSM speech coder all state handling is performed in the way scheduled by the RL-compiler with states in the DFSM affecting the subroutine address logic and states directly controlling the datapath.

For the Plessey Classic cell library the logic blocks have been mapped on random logic using the Synopsys system [30] and the additional circuitry has been as- sembled from the cell library using the described DPC. The number of control signals is large (105) suggesting an advantage in using local decoders which is the case if the microcode memory is implemented as a PLA. However, when Synopsys is used for random logic generation the solution without local decoders results in a lower gate count. Also the use of the partitioning scheme resulted in a larger gate count when the logic synthesis is handled by the Synopsys software. In the design process only the gate count and not place and route effects have been considered. Place and route might change the preferred structure regarding the use of local decoders or microcode partitioning.

4.3 Address Processor

In order to calculate the memory addresses for the parallel RAMs in one clock cycle, parallel address processors are needed. Consequently, the developed GSM speech coder architecture contains two address processors to facilitate the required calculation capacity. The

address processing follows the scheme of Fig. 4 where the address processing cores generated by COMA have a basic architecture according to Fig. 9. The external address input to the processor core of Fig. 9 has not been used in the GSM speech coder design but is shown for completeness. The external address input is used when an address is calculated in a user defined datapath module or taken from an I/O port. Several addressing modes such as direct, in-direct, auto-decrement, and auto-increment together with constants for loop control, are used in the speech coder. Therefore, address calculations are performed in address processor cores together with an address memory. In the GSM speech coder an address is calculated either in the addresss processor core or when the direct addressing mode is used taken from the address logic. Table 1 shows the possible addressing modes to be used in a microprogram for COMA. The number of used addressing modes determines the complexity of the generated address processor.

The size of register banks in the address processor core is determined by the scheduler and depends heavily on the high level code used as input to the RL- compiler. When variables are declared in the RL-code storage space in register banks is allocated. Therefore, to keep the size of register banks down variables have to be reused as frequently as possible. Modifications of the RL-code in this sense reduce the size of register banks significantly.

222 Owal l et al.

External Address

\

l Address Memory

\

MUX

REG A

1 MUX

l

/

j. REG B .... ]

,lC~ !

/ \

m , ,

Address Memory

Mux /

7 Memory Address

V ADD/SUB

Fig. 9. Architecture of address processing core.

The sign bit in the address processor is used as loop control in the controller architecture. The number of iterations is stored in a register and is decremented until the sign bit is set. This signal affects the state of the DFSM and breaks the loop, see Figs. 7 and 9. In the primary designed architecture the iteration loop control was performed this way without a separate loop counter. However, the next section will show how it is

Table 1. Addressing modes with syntax examples.

Addressing mode Syntax

Direct or absolute Indirect Displacement Indexed Auto-increment

Post Pre

Auto-decremant Post Pre

Mem[N] Mem[Reg] Mem[N +Reg] Mem[Regl + Reg2]

MemlReg]; Reg = Reg + N Reg = Reg + N; Mem[Reg]

Mem[Reg]; Reg = Reg - N Reg = Reg - N; Mem[Reg]

N is an arbitrary integer number.

possible to improve the performance implementing a specially assigned loop counter.

5 Code Optimization

This section will discuss some of the optimization techniques used in the process of designing the GSM speech coder, both how to rewrite the RL-code to reduce the clock cycle count in the generated microcode and how this code can be improved further by hand coding. First a simplified code segment is used to present the optimization techniques followed by a discussion on how these techniques are used and the achieved result in the most critical segment of the speech coder code. The code section is executed on a multiply-accumulate datapath which is given in Fig. 10 without overall bus interconnection together with the simplified RL-code. The complete datapath is shown in Fig. 6. To compare different microcodes the number of clock cycles per multiply-accumulate operation is used; this is referred to as cc/MAC.

.......... Cycle 1 . . . . . . . . . . . . . . -.,

- wt[i]

s

i M U L T Cycle 2 ]

�9 ~ Cycle3"i:

i I / 11 /

i [ s/

/

R E G ..-""

for (i=O ; i<40 ; i++)

res=Ladd(Lmult (wt [i] , itpa) ,res) ;

Fig. 10. Multiply accumulate datapath with initial RL-code.

Two important problems will be discussed: intermediate data storage in datapath registers and address calculation for memories. Some parts of the clock cycle overhead can be seen as shortcomings of the scheduling process of the RL-compiler. However, it is very hard for a compiler to find the specific ways to the designer had in mind when designing the processor architecture. An additional aspect to reduce the clock cycle count is to keep the size of the microcode and address memories at a minimum, this topic is briefly addressed.

5.1 Intermediate Data Storage

To achieve a high utilization of computational parts it is important that data can be supplied and stored without clock cycle overhead. If results from an operation is to be used in a following computation the result should be stored without clock cycle overhead close to the re- ceiving computational unit. Here close means without clock cycle overhead when the data is fetched. It is often advantageous to use an intermediate register and


not a memory. Memory storage requires address calculation both when the value is stored and fetched. In some cases the RL-compiler will store the value both in a register and in a memory. The unnecessary address calculation might increase the required number of clock cycles used for the operation.

In the RL-compiler the problem can be solved by assigning variables to specific registers. If the variable res in the code of Fig. 10 is not assigned to a specific register the result from the addition will be stored both in the feedback register of the adder and in a RAM, Fig. 6. In this example the storage in the RAM has no significance until the loop is completed, but the address calculation will result in an extra clock cycle. The use of a temporary variable trap, assigned to the feedback register of the adder, will eliminate the RAM storage at every iteration of the loop and save one clock cycle each iteration. Instead the value is stored in the RAM at the completion of the loop, res = trap. The modified RL-code thus is

for (i=O; i<40;i++)

tmp=Ladd(Lmult (wt [i] , itpa), tmp) ;

res = tmp;

5.2 Loop Unrolling and Software Pipelining

Address calculation optimization is illustrated using the same code section as above, Fig. 10. The scheduled microcode is given in Fig. 12a. To calculate the address & wt[i] (& is C-syntax for address), and incrementing the index, i-t-T, the RL-compiler schedules two clock cycles for the address processor, Fig. l la . One cycle is used to calculate the address by adding the memory offset &wt[0] and the index i, and another cycle to increment the index i = i + 1. This is referred to as the auto-incremental addressing mode. A comparison to detect the completion of the loop adds another cycle. The multiply operation is performed si- multaneously with the index increment followed by the accumulation executed in parallel with the comparison, Fig. 12a. This adds up to three cycles for each iteration of the loop. However, the pipelined structure of the controller requires that two extra NOPs are added to the microcode, Fig. 8. One of these NOPs can be removed by reducing the pipelining of the controller. If a special loop counter is assigned and two separate variables are used, one to control the loop and another as memory index, the increment of the loop variable can be performed in the first cycle and the comparison in the second and accordingly one NOP can be removed.

224 Owall et al.

;i fi

A d ~ . . . . . . . . . . .

(a) Auto-/ncn~-ncntal addressing mode. scheduled

Fig. 11. Performance of address calculation.

i+l

Co) Direct and ind/rcct addr~r,/ng modes (c) Auto-inc~n~-ntal addre~ing mode, band coded

Fig. 12. Increasing instruction-level parallelism.

In the scheduled microcode one iteration of the loop is completed before the next iteration is initi- ated, Fig. 12a. Since there is no parallelism in the datapath in one iteration and the iterations of the loop are performed sequentially the possibility to use the computational parts in parallel is lost. By loop unrolling the instruction-level parallelism is increased

and multiple instructions are executed in one clock cycle [27], i.e. one multiply-accumulate operation does not have to be completed before the next is ini- tiated. Loop unrolling is done by simply replicating the loop multiple times. To fully utilize the multiply- accumulate datapath to achieve the result of Fig. 12b the address calculation must be performed in one cycle.

The scheduled microcode requires two clock cycles when the auto-incremental addressing mode is used and consequently the RL-code must be changed. Hence, the code is rewritten using the direct addressing mode, wt[eonstant], in the totally unrolled loop and the indirect addressing mode in the partially unrolled loops, wt[i + constant], Fig. 1 lb. If the loop is not totally unrolled the indices are incremented at the end of the loop. Loop unrolling will increase the size of the mi- cromode memory by replicating the code and the size of the address memory by making references to it each address calculation.

Another technique for increasing the instruction- level parallelism is software pipelining. Software pipelining can be viewed as symbolic loop unrolling and will decrease the size of the microcode memory compared with the loop unrolling technique [27]. By software pipelining the loop is reorganized such that each iteration performs instructions from different iterations of the original loop, Fig. 12b. In the designed GSM speech coder software pipelining has been performed by hand coding. By hand coding the clock cycle count of the address calculation in the auto- incremental addressing mode is reduced and the latency of the controller is by-passed by a modification of the break statement of the loop. Address calculation is performed by storing the address &wt[i] - 1, initial- ized to &wt[0] - 1, and adding one for each iteration of the loop, Fig. 10c. To break the loop we compare the loop counter to 39 instead of 40, knowing the loop will be executed one more time due to the latency of the controller.

5.3 Optimization Results

To illustrate the process of optimization the code segment Search for the maximum cross-correlation and coding of the LTP lag given in Section 2 is used. The initial RL-code is

for (k=O; k<4; k++) for(l=40~ l<120; l++){

res = O; sive = 80 - l; for (i=O; i<40;i++)

res=Ladd(Lmult (wt [i] , itpa [sive++] ), res)

}.

This code segment is the most clock cycle consuming part of the speech coder and the multiply-accumulate statement is executed 12 800 times. A total reduction of more than 60 k cycles is gained between the


initially scheduled and the hand coded microcode by optimization of this code segment alone. The clock cycle count for one completion of the inner loop will be reduced from 240 clock cycles, 6 cc/MAC, to 42 clock cycles, 1.05 cc/MAC. The complete datapath is shown in Fig. 6 and the multiply-accumulate section, without overall bus interconnection, is given in Fig. 10.

The index sire is dependent on an outer loop and the offset between the indices sire and i is not the same between different executions of the inner loop. Thus, the same absolute memory address cannot be used by rearranging the storage space in the memories and the parallel APUs discussed in the previous section will be needed, see Fig. 9. The RL-compiler initially schedules one iteration of the inner loop to 6 clock cycles. One clock cycle can be referred to the storage of the variable res in a memory; application of the intermediate data storage technique removes this clock cycle. The previously discussed latency of the DFSM will introduce two NOPs at the end of each iteration. By a specially assigned loop counter the increment of the loop variable can be performed during the first cycle and the comparison during the second thus removing one of the NOPs, resulting in 4 cc/MAC.

By unrolling the loop ten times and not incrementing memory indices until the end of an iteration of the loop a result of 1.7 cc/MAC was achieved. The cycle count is higher than expected compared to explorations performed on examples with only one incremented memory reference. This clock cycle overhead is due to an inability of the RL-compiler to use the parallelism of the APUs in this example. The RL-code of the ten times unrolled loop with an assigned loop counter, variable j, is

for (i=O, j = O; j<4; j++){ tmp=Ladd(Lmult (wt [i] , itpa [sive] ),

tmp) ; tmp=Ladd(Lmult (wt [i+l] , itpa [sive+l] ),

tmp) ;

.

tmp=Ladd(Lmult(wt[i+9],itpa[sive+9]), tmp);

i = i + i0; sive teq sive + I0;

By completely unrolling the loop a result of 1.2 cc/MAC is achieved which is close to the optimal of 1.05 cc/MAC. The optimal result is achieved by hand coding using the software pipelining technique,

226 Owall et al.

including filling and emptying of the pipe. Since the loop is executed 320 times in a frame the reduction from 1.2 cc/MAC to 1.05 cc/MAC will result in a gain of 1920 clock cycles per frame. The code of the completely unrolled loop is

tmp=Ladd(Lmult (wt [0], itpa [sive] ), tmp) ; tmp=Ladd(Lmult (wt [I] , itpa [sive+l] ),

~ a p ) ;

tmp=Ladd(Lmult (wt [39] , itpa [sive+39] ), tmp) ;

A disadvantage with loop unrolling is that the size of the microcode memory is increased. In this case from 5 lines of microcode in the initial scheduling to 48 lines in the totally unrolled loop. The direct memory addressing has the same disadvantage in increasing the address memory. The direct addresses are stored in the address memory and hence the size of this will increase. By software pipelining the optimal result is achieved in 3 lines of microcode. One to fill the pipe, one for the mult-accumulate operation which is put in a loop, and one to empty the pipe.

6 Results

The scheduled microcode depends heavily on how the algorithm description is written. Thus, by iteratively changing, adjusting and trimming the high level code, the total clock cycle count of the algorithm was cut down from 250 k to 90 k. Further progress required substitution of critical pieces of scheduled microcode by handwritten segments. Hand coding of 15 outof 123 subroutines resulted in a reduction of the cycle count to 53 k. A fully hand coded design would reduce the clock cycle count further and reduce the size of the microcode memory but this has not been performed. In comparison a fully hand coded standard processor implementation results in 68 kcycles which is considered to be the minimum for the specified algorithm' on a standard processor architecture.

To avoid acoustic echoes in a conversation the delay constraint is stricter than the real time requirement. One frame of speech is 20 ms but the GSM specification requires an 8 ms processing time of the speech coder. 53 kcycles during 8 ms results in a required clock speed of less than 7 MHz. This relatively low clock frequency enables the use of a slow, and thus, low power cell library or the use of a lower supply voltage. Both alternatives result in a lower power consumption.

An implementation on a Plessey gate array requires 21 000 gates with an additional 0.75 k 16 bit RAM. ROMs and PLAs are mapped on random logic using the Synopsys system [30]. The speech coder design does not occupy all the gates on a large gate array so additional signal processing can be integrated on the same ASIC.

7 Conclusion

The GSM speech coder has been mapped on a gate array cell library. The programming procedure is similar to coding the algorithm on a general purpose DSP with the main difference of using a dedicated instruction set suited for the algorithm. Furthermore, the resulting design is in a netlist format instead of a standard component. This makes it easier to integrate the processor with other pans of a system and to add dedicated I/O circuitry. In order to facilitate mapping to several cell libraries only a few general purpose cells are used throughout the design. Having the DSP in a netlist format and using a limited number of cells offer a freedom to choose cell library, technology, and vendor late in the design procedure. Thus, transferring of a design between different implementations becomes possible.

Depending on clock frequency requirements a suitable cell library can be used. In the GSM speech coder application the number of instructions per speech frame is less but of the same order of magnitude as a standard DSP implementation. However, the implementation technique can be derived from the application requirements and a low power cell library can be used due to the relatively low clock frequency. In other applications the use of an efficient processor architecture can make it possible to reduce the clock frequency even more and thus the power consumption.

The possibilities to affect the synthesis process at all stages are proven to be important: C and microcode optimization, datapath modification, controller partitioning, etc. A system has to assist the designer to achieve a result but not force the usage of a predefined structure or solution.

The resulting processor architecture for the GSM speech coder algorithm is similar to that of a general purpose DSP. The number of micro instruction cycles it takes to process the algorithm is less but of the same order of magnitude. This is expected since the hardware is defined in the algorithm specification. The design process has shown that a good result can be achieved with an automized method and stepwise

C u s t o m D S P D e s i g n 227

re f inement . A n i m p o r t a n t c o n c l u s i o n is tha t the pos-

sibi l i ty to d e s i g n c u s t o m D S P s has to b e cons ide red

a l ready in the a l g o r i t h m d e v e l o p m e n t phase in order to

fully ut i l ize the po ten t i a l o f an ASIC .

References

1. M. Mouly and M.-B. Pautet, The GSM System for Mobile Com- munications, published by the authors, 49 rue Louise Bruneau, F-91120 Palaiseau, France, 1992.

2. S.M. Redl, M.K. Weber, and M.W. Oliphant, An Introduction to GSM, Mobile Communications series, Artech House, 1995.

3. "Recommendation GSM 06.10 GSM full rate speech transcod- ing," April 15, 1989. Version: 3.01.02.

4. L. Hanzo and J. Stefanov, "The Pan European Digital Cellu- lar Mobile Radio System--known as GSM;' in Mobile Radio Communications, Raymond Steele fEd.), Ch. 8. Pentech Press, 1992.

5. D. Weinsziehr, H. Ebert, G. Mahlich, J. Preissner, H. Sahm, J.M. Schuck, H. Bauer, K. Hellwig, and D. Lorenz, "KISS-16V2: A One-Chip ASIC DSP Solution for GSM," IEEEJ. of Solid-State Circuits, Vol. 27, pp. 1057-1066, July 1992.

6. P.G. Paulin, C. Liem, T.C. May, and S. Sutarwala, "DSP Design Tool Requirements for Embedded Systems: A Telecommunica- tion Industrial Perspective," Journal of VLSI Signal Processing, Vol. 9, pp. 23--47, 1995.

7. Plessey Semiconductors, CMOSSemi-Custom, CLA7OOOOASIC Handbook, 1992.

8. M. Faulkner, T. Mattsson, and W. Yates, "Adaptive Linearisation Using Pre-Distortion," in Proc. of 4Oth IEEE Vehicular Technol- ogy Conference, 1990.

9. A. Wass, B. Ekelund, and M, Torkelson. "A Silicon Implementa- tion of a GMSK Modulated Two Ray Radio Channel Simulator," in International Symposium on Signals, Systems, and Electron- ics, 1989.

10. V. Owall, M. Torkelson, and Peter Egelberg, "A Parallel 2 Gops/s Image Convolution Processor with Low I/O Bandwidth" IEEE ASIC'95 Conference and Exhibit, 1995.

11. Robert W. Broderson fed.), Anatomy of a Silicon Complier, Kluwer Academic Publishers, 1992.

12. C.B. Shung, R. Jain, K. Rimey, E. Wang, M.B. Srivastava, B.C. Richards, E. Lettang, S.K. Azim, L. Thon, P.N. Hilfinger, J.M. Rabaey, and R.W. Brodersen, "An Integrated CAD System for Algorithm-Specific IC Design," IEEE Trans. of Computer-Aided Design of lntegrated Circuits and Systems, Vol. CAD-10, pp. 447-463, April 1991.

13. J.M. Rabaey, H. De Man, J. VanHoof, G. Goosens, and E Catthoor, "CATHEDRAL-II: A Synthesis System for Multipro- cessor DSP Systems" in Silicon Compilation, Daniel D. Gajski fed.), Ch. 8. Addison-Wesley, 1988.

14. K.E. Rimey, "A Compiler for Application-Specific Signal Pro- cesors" Ph.D. Thesis, University of California at Berkeley, Sept. 1989.

15. L.E. Thon, K. Rimey, and L. Svensson, "From C to Silicon," in reference [ 11], Chapter 17.

16. P. Andreani, "An Environment for Application Specific Digital Signal Processor Synthesis," Technical report, Dept. of Applied Electronics, Lund University, Sweden, May 1993.

17. L. Brangn and M. Torkelson, "A Basic CAD-tool for modulo generation," in Proc. of European Solid State Circuits Confer- ence, 1989.

18. L.R Rubinfield, "A Proof of the Modified Booth Algorithm for Multiplication," IEEE Transactions on Computers, pp. 1014- 1015, October 1975.

19. V. 0wall and M. Torkelson, "Controller Synthesis for Appli- cation Specific Digital Signal Processors" in Proc. of ASIC'91 Conference and Exhibit, 1991.

20. V. Owall, "Synthesis of Controllers from a Range of Con- troller Architectures," Ph.D. Thesis, Dept. of Applied Electron- ics, Land University, Sweden, December 1994.

21. K. Azim, "Application of Silicon Compilation Techniques to a Robot Controller," Ph.D. Thesis, University of California at Berkeley, September 1988.

22. S.K. Azim, C.-S. Shung, and R.W. Brodersen, "Automatic generation of a Custom Digital Signal Processor for an Adaptive Robot Ann Controller," in Proc. of ICASSP'88, 1988.

23. J. Zegers, E Six, J.M. Rabaey, and H. De Man, "CGE: Auto- marie Generation of Controllers in the CATHEDRAL-II Silicon Compiler," in Proc. of The European Conference on Design Au- tomation, 1990.

24. L. Gerbaux and G. Saucier, "Automatic synthesis of large Moore sequencers" in Proc. of The European Conference on Design Automation, 1992.

25. E Catthor, J.M. Rabaey, G. Gossens, J.L. van Meerbergen, R. Jain, H.J. De Man, and J. Vandewalle, "Architectural Strategies for an Application-Specific Synchronous Mnltiproc~ssor Envi- ronment," IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 36, pp. 265-284, February 1988.

26. S. Dasgupta, "The Organization of Microprogram Stores," Com- puting Surveys, Vol. 11, No. 1, pp. 39-65, March 1979.

27. J.L. Hennessy and D.A. Patterson, Computer Architecture: A Quantitative Approach, Morgan Kaufmann Publishers, 1990.

28. Viewlogic Systems, Inc. Work'view, 1989. 29. R Nilsson and M. Torkelson, "A CMOS VLSI Cell Library for

Digital Signal Processors," in Proc. of l4th Nordic Semiconduc- tor Meefing, 1990.

30. Synopsys, Version 3.0, 1992.

Viktor Owall received the M.S.E.E., the Tekn. Licentiat degree and the Ph.D. from Lund University, Lund, Sweden in 1988, 1991 and 1994 respectively. Since May 1988 he has been with the Department of Applied Electronics, Land University, in the'digital signal processing group. He is working mainly in the field of automated DSP design, especially controller synthesis, controller architectures, and control flow dominated applications. Mr. Owall is currently with the Electrical Engineering Department at UCLA on a postdoc grant

228 Owall et al.

from the Swedish Research Council for Engineering Sciences. vikt @tde.lth.se

Pietro Andreani received the M.S.E.E. from Pisa University, Pisa, Italy, in 1988, and the Tekn. Licentiat degree from Land Univer- sity, Lurid, Sweden, in 1993. He joined the digital signal processing group at the Dept. of Applied Electronics, Lund University, in De- cember 1989. After working at the Dept. of Applied Electronics, Pisa University, as a CMOS IC designer during 1994, he has re- joined the Dept. of Applied Electronics in Land, where analog and mixed-mode CMOS IC's are currently his main research field. [email protected]

Anders Wass received the M.S.E.E. and Tekn. Lic. degrees in electrical engineering from Land University, Lund, Sweden, in 1987 and 1992 respectively. He joined the digital signal processing group at the Dept. of Applied Electronics, Lund University, in January 1988 as a researcher emphasizing on ASICs for digital mobile communication. During 1990 to 1992 he did part time research at University of California at Berkeley concerning simulation of digital communication systems. His major interests are in the areas of ASICs for digital communications and simulation of digital communication systems. Currently he is working at Ericsson Components in Stock- holm, Sweden. [email protected]

Lars Brange received the M.S.E.E. in 1985 and the Tekn. Licentiat degree in 1989, both from the Dept. of Applied Electronics at Lund University, Land, Sweden. He was a member of the digital signal processing group at this Dept. from 1985 until 1992. His main work was in the field of interactive VLSI design systems for automatic generation of datapath structures targeted for DSPs. Currently Mr. Brange is with Aurum Scandinavia AB.

Mats Torkelson received the M.S.E.E. degree in electrical engineering in 1980 at ETH Ziirich, and the Tekn. Lic. degree and the Ph.D. from Lund University, Lund, Sweden in 1985 and 1990 respectively. He has worked with AD/DA conveners for professional Audio tape recorders at Willi Studer AG, Switzerland and with maritime X-band radar collision avoidance systems at Lund University. During the pe- riod 1984 to 1986 he was part time at the University of California, Berkeley. Mr. Torkelson beads the digital signal processing group at the Dept. of Applied Electronics, Land University, which he ini- tiated in 1986. Since 1994 he has worked part time with Ericsson Radio Systems, Stockholm, Sweden. His current interests are mobile communication, algorithm implementation, and amplifier design.

Peter Nilsson received the M.S.E.E. and the Tekn. Licentiat degree from Lund University, Lund, Sweden in 1988 and 1992 respectively. Since February 1988 he has been with the Dept. of Applied Elec- tronics, Lund University, in the digital signal processing group. He is working mainly in the field of silicon implementation of DSPs related to the communication area, especially with serial-parallel architectures. [email protected]

custom dsp design of a gsm speech coder...2.2 specification the detailed mapping of the gsm speech...

Documents