convolution coding

5/26/2018 Convolution Coding

1/26

__________________________________________________________

Convolutional Coding on

Xtensa ProcessorsApplication Note

Tensilica, Inc.3255-6 Scott Blvd.

Santa Clara, CA 95054(408) 986-8000

Fax (408) 986-8919www.tensilica.com

January 2009 Doc Number: AN01-123-04

TENSILICA,INC.


2/26

Convolutional Coding on Xtensa Processors

2005-2008 Tensilica, Inc.

All Rights Reserved

Printed in the United States of America

This publication is provided AS IS. Tensilica, Inc. (hereafter Tensilica) does not make any warranty of any kind, either expressed or implied, including, but not

limited to, the implied warranties of merchantability and fitness for a particular purpose. Information in this document is provided solely to enable system andsoftware developers to use Tensilica processors. Unless specifically set forth herein, there are no express or implied patent, copyright or any other intellectual

property rights or licenses granted hereunder to design or fabricate Tensilica integrated circuits or integrated circuits based on the information in this document.

Tensilica does not warrant that the contents of this publication, whether individually or as one or more groups, meets your requirements or that the publication

is error-free. This publication could include technical inaccuracies or typographical errors. Changes may be made to the information herein, and these changes

may be incorporated in new editions of this publication.

The following terms are trademarks of Tensilica, Inc.: OSKit, Tensilica, Vectra, and Xtensa. All other trademarks and registered trademarks are the property of

their respective companies.

Document Change History:

September 1998 (Revised January, 2001; February, 2005)

January 2009

ii

TENSILICA,INC.

Digitally signed byTensilica Technical

PublicationsReason: Certified original

Tensilica document 1/2009


3/26


Contents

1 Communication System Challenges............................................................................12 A Simple Encoder.......................................................................................................13 The Encoding Process................................................................................................34 Viterbi Decoding.........................................................................................................65 Details of the Viterbi Algorithm....................................................................................76 Distance Metric Calculation........................................................................................77 The Trellis Decode Butterfly........................................................................................98 Implementation on Base Xtensa ...............................................................................119 Full Optimization with TIE.........................................................................................1210 Demonstration Instructions.......................................................................................1611 Summary..................................................................................................................16Appendix A VTB2.TIE Code ........................................................................................17

iii

TENSILICA,INC.


4/26


Figures

Figure 1: Communication System Block Diagram.............................................................1

Figure 2: Simple Convolutional Encoder...........................................................................2

Figure 3: Convolutional Encoder State Diagram...............................................................2

Figure 4: Trellis Diagram Showing Most-Likely Path Through States................................ 6

Figure 5: Distance Metric Graph.......................................................................................8

Figure 6: Four Butterflies in a Trellis Time Step (K=4).....................................................9

Figure 7: Butterfly with Distance Metric..........................................................................10

Figure 8: Adding State and Branch Distances Metrics....................................................10

Figure 9: Selecting Smallest Accumulated Distance Metric ............................................10

Figure 10: Butterfly Operation Diagram..........................................................................11

Tables

Table1: Distance Metric Values........................................................................................9

iv

TENSILICA,INC.


5/26


AbstractThis application note looks briefly at popular techniques for convolutional encoding and

decoding, especially Viterbi decoding, and illustrates the power of a configurable processor in

handling the performance-intensive signal processing demands of coding and decoding.

Application-specific processors are quickly designed, simulated, built in silicon, and offer

significantly better programmability, performance, and power-efficiency than most populardigital signal processors (DSPs). In particular, this paper describes user-defined TIE (Tensilica

Instruction Extension) instructions which accelerate distance metric calculations, the most

performance-critical task in Viterbi decoding, by 32x over most popular DSPs and 155x over

most popular 32-bit RISC cores.

This application note makes the assumption that the reader is familiar with Viterbi decoding,

the Xtensa Instruction Set Architecture, and the Tensilica Instruction Extension description

language. Please refer to theXtensa ISA Reference Manualand the Tensilica Instruction

Extension (TIE) Language Users Guidefor additional information.

v

TENSILICA,INC.


6/26


vi

TENSILICA,INC.


7/26


1 Communication System Challenges

One of greatest challenges in communication system design is efficient transmission and

reception of information in the presence of errors introduced by the communication channel.The presence of errors is especially pronounced in radio communication, due to the variety of

noise sources in the channel. Designers have adopted block coding methods that add

redundancy in the encoding of information before transmission. Although the addition of

redundant data reduces the overall throughput of the channel, forward error correction

improves performance by using the redundant data to correct errors during decoding at the

receiver, as shown in Figure 1.

encoder decoder noisy

channel

original

data

stream

encoded

data

encoded

data +

noise

recovered

data

stream

FIGURE 1: COMMUNICATION SYSTEM BLOCK DIAGRAM

Convolutional coding, that is, coding based on time-invariant finite state machines, is widely

used in wireless communications. This application note looks briefly at popular techniques for

convolutional encoding and decoding, especially Viterbi decoding. It illustrates the power of a

configurable processor in handling the performance-intensive signal processing demands of

coding and decoding. Specifically, user-defined instructions in the Tensilica Instruction

Extension Language (TIE) will be described which accelerate distance metric calculations, the

most performance-critical task in Viterbi decoding, by more than 32 times over most popular

digital signal processors and 155 times over most popular 32 bit RISC cores.

2 A Simple Encoder

In convolutional encoding, each new coded bit for transmission is generated by a convolution of

the current input bit with some number of earlier input bits and a masking polynomial. The

ability of the decoder to detect and correct errors in transmission depends on the number of

input bits used in the convolution. That number of bits is called the constraint length.Redundancy is added to the bit stream by the generation of more than one bit of encoded

output for each input bit. This ratio of input bits to output bits is called the coding rate. Forexample, a coding rate of 1/2 will generate 2 output bits from 1 input bit. Popular wireless

communication standards (GSM, IS-95, IS-136) use constraint lengths from 5 to 9 and coding

rates from 1/2 to 1/4.

A simple convolutional encoder, with a constraint length of 4 and coding rate of 1/2 is shown in

Figure 2. For each new input x( I ) , two new outputs, G0 and G1, are generated fortransmission.

1

TENSILICA,INC.


8/26


Xi D

D

D D

G0,i

G1,i

one sample delay exclusive OR

FIGURE 2: SIMPLE CONVOLUTIONALENCODER

This example implements the convolution code represented by the polynomials:G0 = 1 + x + x

3 and G1 = 1 + x + x

2 + x

3

The polynomial formulas listed above are a convenient way to represent inputs from current bit

(X0=1) and delayed bits (X1,X2,X3) into XOR logic to form the output. For example, output G0

(1+x+x3) is calculated by performing XOR calculation on the current bit (X0=1), the previous bit

(X1), and the third previous bit (X3). Output G1 (1+x+x2+x3) is calculated by performing XOR

calculation on the current bit (X0=1), the previous bit (X1), the second previous bit (X2), and the

third previous bit (X3).

This encoder can also be expressed as a state diagram, as shown in Figure 3. Each of the

states is labeled with a state number corresponding to the state of the three delay elements of

the circuit above. Note that the most recent bit is assigned to the LSB, while the third previous

bit is assigned to the MSB. Each of the arcs is labeled x, G0, G1 (the input bit x for that arc, and

the G0, G1 outputs for that input).

000010

100

001011

101

110

111

1,0,0

0,1,1

0,1,0

1,0,0

0,1,1

1,1,11,1,1

1,0,1

1,0,1

0,0,1 0,0,1 0,0,01,1,0

0,0,0

0,1,01,1,0

FIGURE 3: CONVOLUTIONAL ENCODER STATE DIAGRAM

It is convenient to view the encoder as a state diagram showing arcs from one encoder state to

another. Each arc is labeled with the corresponding input bit and encoder output bits. Later,

this state diagram is converted to a trellis diagram to represent state arcs with respect to time.

Note that except for the encoder outputs, the state representation remains unchanged for anybasic convolution encoder with the same constraint length due to the fact that the shifting

pattern of bits through the encoder will remain the same. Different polynomials will generate

different outputs for each arc going from one state to another.

2

TENSILICA,INC.


9/26


3 The Encoding Process

The convolution encoder described in the previous sections can be implemented either as a

hardware state machine or as a software routine running on a processor. Although the

hardware implementation for a given encoding polynomial is typically quite simple, a software

implementation offers valuable flexibility. The increasing need for adaptive and multi-protocolcommunication equipment make a processor-based solution appropriate in many

circumstances.

Below is a C implementation of the encoder that was shown earlier.

/ / Sampl e Convol ut i onal Encoder/ / Const r ai nt l engt h 4 and codi ng r ate 1/ 2/ / G0 = 1 + x + x 3 and G1 = 1 + x + x 2 + x 3/ * i nput dat a f or Convol ut i onal Encoder */char I N[ FrameSi ze] ;/ * out put dat a f r om Convol ut i onal Encoder */char G0[ FrameSi ze] , G1[ FrameSi ze] ;

voi d convol ve( ){i nt f , t emp;f or ( f =0; f = 3){

/ / Note t hat ANSI C XOR operat i ons ar e + i n pol ynomi al r epr esentat i onG0[ f ] = I N[ f ] I N[ f - 1] I N[ f - 3] ;G1[ f ] = I N[ f ] I N[ f - 1] I N[ f - 2] I N[ f - 3] ;

}el se i f ( f == 2) / / Assume Del ay el ement 3 f l ushed t o zer o{

G0[ f ] = I N[ f ] I N[ f - 1] ;G1[ f ] = I N[ f ] I N[ f - 1] I N[ f - 2] ;}el se i f ( f == 1) / / Assume Del ay el ement s 2- 3 f l ushed t o zero{

G0[ f ] = I N[ f ] I N[ f - 1] ;G1[ f ] = I N[ f ] I N[ f - 1] ;

}el se i f ( f == 0) / / I ni t i al Condi t i on:

/ / Al l Del ay el ement s are f l ushed to zer o{

G0[ f ] = I N[ f ] ;G1[ f ] = I N[ f ] ;

}

}}

3

TENSILICA,INC.


10/26


Encoding can be rewritten, as in the pseudo code below, to take advantage of the Xtensa

processors funnel shift and XOR instructions.

/ / Pseudo Code f or encoder/ / G0=1+X+X3 & G1=1+X+X2+X3/ / N = number of i nput bi t s i n f r ame

/ / Assi gn Encoder I nput & Out put St r eami nt *I nput_Pt r =&I nput ;i nt *Out put_Pt r_G0=&Out put_G0;i nt *Out put_Pt r_G1=&Out put_G1;

/ / I ni t i al i ze I nput32_ol d t o zer oI nput 32_ol d=0;

/ / Encode 32 i nput bi t s per i t er at i onf or ( i =0; i


11/26


/ / i nner l oop of k = 4, r = 1/ 2 encodi ng f or t he/ / G0 = 1 + x + x 3 and G1 = 1 + x + x 2 + x 3/ / i nput dat a f or Convol ut i onal Encoder / /

/ / comput es 64 out put pai r s per i t erat i on

/ / a2 poi nt s t o the word cont ai ni ng t he next 64 i nput bi t s/ / organi zed wi t h ol dest bi t i n the msb of t he word/ / a14 poi nt s t o t he out put buf f er f or G0/ / a15 poi nt s t o t he out put buf f er f or G1/ / a8 cont ai ns t he ol dest 32 i nput bi t s f rom t he pr evi ous i t er at i on

movi . n a1, N/ 64l oopnez a1, l oopend / / use zero over head l oop, N i s number of bi t s t o encode

l 32i a3, a2, 0 / / a3 cont ai ns l ow 32b of i nput st r eam ( 1)l 32i a9, a2, 4 / / a9 cont ai ns hi gh 32b of i nput st r eam ( 1)

/ / not e that a8 cont ai ns hi gh 32b of pr evi ous i t er at i onssai 1 / / f unnel shi f t 64b by one sampl e t i mesr c a4, a8, a3 / / a4 cont ai ns l ow del ayed by one ( x)sr c a10, a3, a9 / / a10 cont ai ns hi gh del ayed by one ( x)ssai 2 / / f unnel shi f t 64b by t wo sampl e t i messr c a5, a8, a3 / / a5 cont ai ns l ow del ayed by t wo ( x 2)sr c a11, a3, a9 / / a11 cont ai ns hi gh del ayed by t wo ( x 2)ssai 3 / / f unnel shi f t 64b by t hr ee sampl e t i messr c a6, a8, a3 / / a6 cont ai ns l ow del ayed by t hree ( x 3)sr c a12, a3, a9 / / a12 cont ai ns hi gh del ayed by t hree ( x 3)

/ / comput e G0 & G1 f or al l l ow 32bxor a4, a4, a3 / / G0= 1 + xxor a4, a4, a6 / / +x 3xor a5, a5, a4 / / G1 = G0 + x 2

/ / compute G0 & G1 f or al l hi gh 32bxor a10, a10, a9 / / G0= 1 + xxor a10, a10, a12 / / +X 3xor a11, a11, a10 / / G1 = G0 + x 2

s32i a4, a14, 0 / / st ore G0 l ow 32bs32i a5, a15, 0 / / st ore G1 l ow 32bs32i a10, a14, 4 / / st ore G0 hi gh 32bs32i a11, a15, 4 / / st ore G1 hi gh 32baddi a2, a2, 8 / / advance i nput poi nt er by 64baddi a14, a14, 8 / / and out put poi nters by 64baddi a15, a15, 8mov a8, a9 / / save hi gh 32b f or use i n next i t erat i onl oopend:

The assembly routine listed above is capable of encoding 2.5 bits per cycle. The performance

of this convolutional coding technique can be generalized to 11+((k-1)*5) cycles for each 64

input bits, where k is the constraint length. The actual performance is dependent on the

polynomials used. The convolutional coding performance of a base Xtensa processor is

comparable to a 16-bit DSP, such as members of the Texas Instruments TMS320C54x family.

This class of DSPs is capable of coding 1.5 bits per cycle for a set of polynomials with k=5 (see:

Viterbi Decoding Techniques in the TMS320C54x Family, Henry Hendrix, Texas InstrumentsApplication Note SPRA071, June 1996). For the same polynomials, performance on an Xtensa

processor is about 1.8 bits per cycle.

5

TENSILICA,INC.


12/26


4 Viterbi Decoding

The goal of decoding a received bit stream is to find the maximum-likelihood output sequence

given the received sequence - a combination of the transmitted sequence plus noise. Viterbi

decoding offers an efficient algorithm to find this output sequence. It is based on a decoder

that attempts to estimate, using the received data sequence, the likelihood that the encoder is

in each of its possible states. The graphical modeling of all possible state transitions has cometo be called a trellis diagram. A simple trellis diagram is shown below. The trellis diagram is a

different way of modeling the state diagram that was shown earlier, but with the added

dimension of time. This diagram is used to determine the correct path through the states,

based on a particular transmitted sequence, assuming the encoder started in the idle state

(000). The challenge for the decoder is to predict this path even when some of the incoming

bits (G0, G1) may have been corrupted by noise.

Received G0,G1:

000000 000 000 000

010 010010 010 010

011011 011 011 011

100 100 100 100 100

101 101 101 101101

110 110 110 110 110

111 111 111 111 111

001 001001001 001

1,0 1,00,1 0,1

Time 0 Time 1 Time 2 Time 3 Time 4

FIGURE 4: TRELLIS DIAGRAM SHOWING MOST-LIKELY PATH THROUGH STATES

6

TENSILICA,INC.


13/26


5 Details of the Viterbi Algorithm

The Viterbi decode algorithm works in two phases. In the first phase, the update phase, the

incoming data is analyzed in sequence order. The maximum-likelihood decoder works by

maintaining a running estimate of the appropriateness of each possible path through the trellis

for the received data sequence. Starting from a known initial state and for each successivereceived input pair (G0,G1), the decoder calculates a distance metric between the received

input pair and the input pair corresponding to each state arc in the diagram. The distance

metric calculation method will be discussed later. The shortest path, the series of arcs with the

smallest total distance metric, is taken to be the most-likely path through the trellis diagram.

Each path implies a unique state sequence in the encoder, and thus a unique input sequence.

This phase is considered the most CPU-intensive task within the Viterbi Algorithm, so the

remainder of this application note focuses on this area.

In the second phase, the trace back phase, the sequence of arc decisions must be traced back

to reconstruct the inferred inputs to the encoder. Recalling that the most recent data shifted

into the delay line is the LSB of the state, the inputs based upon the trellis diagram above are

inferred to be (1,0,0,0). This phase can be easily accomplished by examining the LSB of each

of the states, tracing backward through the most-likely path.

Several popular techniques are used to calculate distance metrics. In general, these methods

are categorized as either hard decision decoding or soft decision decoding. In a soft decision

decoder, the input to the decoder is an integer in the range between +B and -B. Therefore, the

strength of the signal can be used as information by the decoder. In a hard decision decoder,

threshold detection is used to quantize input signals into either of two states: +1 or -1. Soft

decision decoding with infinite range provides approximately 2.2db better coding gain than hard

decision decoding at the expense of slightly more complexity in the decoder.

6 Distance Metric Calculation

In the trellis diagram shown in Figure 4, there are arcs leading from states in one trellis column

to states in the next trellis column. Each of these arcs has an associated local distance (branch

metric). Recall that the state diagram shown earlier labels each arc with the encoder outputsfor each transition. The local distance is determined by comparing the actual received data to

expected encoder outputs for a given arc.

The Hamming Distance technique is one of the more popular techniques used for calculating

distance metrics. For a coding rate of 1/2, we can imagine the actual data, G0 and G1, to

indicate position in two different dimensions. Each arc in the trellis diagram has a

corresponding input pair, R0 and R1, which is the expected output for each arc. The diagram

below shows both actual and expected data represented as points in a Cartesian plane. The

Hamming distance is determined by adding the differences of each dimension

((G0-R0) + (G1-R1)).

7

TENSILICA,INC.


14/26

Convolutional Coding on Xtensa

Hamming

Distance

Straight-lineDistance

(R0, R1)

Expected

Actu al

(G0, G1)

FIGURE 5: DISTANCE METRIC GRAPH

Another popular distance metric technique is the Euclidian (Square) Distance technique. The

Euclidian (Square) Distance is determined by calculating the square root of the straight-line

distance between two symbols. Using the Pythagorean Theorem, the straight-line distance

between the actual and expected input pairs of the previous diagram is calculated as follows:

22 R1)( G1R0)-( G0 +

Remove the square root from the straight-line distance calculation to get the Euclidian (Square)

Distance. There is a slight bit error rate (BER) performance penalty for using the Euclidian

(Square) Distance when compared to the straight-line distance, yet this penalty is negligible

when compared with the reduction in complexity. Expanding the Euclidian (Square) distance

metric results in the following equation:

G02 - 2( R0*G0) + R02 + G12 - 2( R1*G1) +R12

Note that the distance metric for a given arc will be compared against distance metrics of other

arcs within the same trellis column. Addition of constants or multiplication by a constant will

not affect the comparison. Therefore distance metric calculation can be simplified by removing

constants and constant multipliers. G0and G1are actual inputs, which have a range between+Band B, yet are constant throughout the trellis column. Therefore the square of G0 and G1can be eliminated. Since expected inputs, R0 and R1 have possible values of +B or -B, the

square of R0 and R1 become B2, which is a constant and can be eliminated. Thus, the distance

metric can be further simplified as follows by removing these constants. - 2( R0*G0) 2( R1*G1)

Removing the constant multiplier 2 in the equation above, leaves

- ( R0*G0) ( R1*G1)

Recalling that R0 and R1 have possible values of +B or B, the distance metric is simplified as

shown in the following table:

8

TENSILICA,INC.


15/26


TABLE1: DISTANCE METRIC VALUES

Expected DataR0, R1) DistanceMetric RemovingConstant B Replace withSum, Diff+B, +B -BG0-BG1 -G0-G1 -Sum

+B, -B -BG0+BG1 -(G0-G1) -Diff

-B, -B +BG0+BG1 G0+G1 Sum

-B, +B +BG0-BG1 G0-G1 Diff

Note: Sum=G0+G1; Diff=G0-G1

The distance metric calculation has been greatly simplified to +/- the sum or difference of the

received data. To determine the local distance of a particular arc, determine the expected data

for that arc and replace it with the corresponding equation using the table above.

7 The Trellis Decode Butterfly

To aid in implementation, it is often helpful to arrange calculations in functional groups. The

procedure for doing the calculations on a single group can become a template to be used on

other like groups. A butterfly can be visualized as a grouping of 2 source states, 2 destination

states, and 4 arcs between them. For the trellis diagram shown earlier, with 8 states per

column, a time step from one trellis column to another can be visualized as 4 butterflies as

shown below.

100 001

000 000

011101

010001

110 101

100010

111 111

011 110

FIGURE 6: FOUR BUTTERFLIES IN A TRELLIS TIME STEP (K=4)

9

TENSILICA,INC.


16/26


Lets take a closer look at a single butterfly calculation. The diagram below shows a butterfly

diagram with corresponding encoder output values for each arc. The encoder outputs are

translated into local distances as per the previous table.

-Sum

-Sum

+Sum000 000

+Sum100 001

+B+B

+B+B

-B -B000 000

-B -B100 001

FIGURE 7: BUTTERFLY WITH DISTANCE METRIC

The heart of the butterfly calculation is sometimes called the ADD-COMPARE-SELECT operation.

In the ADD stage, the accumulated distance metric is calculated by taking the local distance of

each arc in the butterfly, and adding it to the accumulated distance metric from the originating

state. Considering that the accumulated distance metric of the originating state is named

StateN (N = number of state), the diagram below shows each arcs accumulated distance

metric after the ADD stage.

State0+Sum

State4+Sum

State0-Sum

State4

-Sum

000000

001100

FIGURE 8:ADDING STATE AND BRANCH DISTANCES METRICS

In the COMPARE stage, the distance metric for each arc into a destination state is compared. In

the butterfly diagram, there are two arcs and two corresponding distance metrics leading into

each destination state. Of the two arcs, the arc with the smallest distance metric is considered

as the most-likely arc and the other arc is discarded.

In the SELECT stage, the most-likely arcs accumulated distance metric is stored as the new

accumulated distance metric for the state. The diagram below shows the selected arcs and

updated accumulated distant metric, State 0 and State1, assuming State0+Sum < State4-Sum

and State0-Sum < State4+Sum.

State0+Sum

100

000

001

000State0-Sum

State0=State0+Sum

State4=State0-Sum

FIGURE 9: SELECTING SMALLESTACCUMULATED DISTANCE METRIC

The selected arcs are recorded so this information can be used during the trace back phase to

reconstruct the most-likely path through the trellis. One way to code the selected arc is to use

the MSB of the originating state. Hence, the most-likely arc into State 0 is coded as 0, and the

most-likely arc into State 1 is also coded as 0.

The regularity of the butterfly computation suggests a set of special instructions intended to

accelerate the calculation of distance metrics. Variations of the add-compare-select instructions

have been implemented on advanced digital signal processors. In our C-based implementation,

a macro called ACS is used to implement a variation of the ADD-COMPARE-SELECT calculation.

The macro and sample usage is shown for a single butterfly operation.

10

TENSILICA,INC.


17/26


/ ******************************************************************

ACS i s a macro whi ch per f orms a var i at i on of t he ADD- Compar e- Sel ectoper ati on f or each st at e i n t he Trel l i s. I t compar es 2 accumul at eddi st ance met r i cs (X, Y) of t he 2 ar cs l eadi ng i nt o t he st at e. The shor t est

arc i s sel ect ed as the most- l i kel y ar c. The shor t est accumul ated di st ancemetr i c i s st ored i n S( I ) and bi nary code whi ch desi gnates t he most - l i kel yar c t o t he st at e i s stored i n Sel ect[ j ] [ I ] , wher e ( I ) repr esent s the st at eand ( j ) r epresent s t he t r el l i s col umn.

*******************************************************************/

#def i ne ACS( S, I , X, Y) i f ( ( s1 = ( X) ) < ( s2 = ( Y) ) ) {S[ ( I ) ] = s1;Sel ect[ j ] [ ( I ) ] = 0; } el se {S[ ( I ) ] = s2; Sel ect [ j ] [ ( I ) ] = 1; }

Di f f = G0[ j ] - G1[ j ] ;

Sum = G0[ j ] +G1[ j ] ;

/ / Usi ng ACS macro f or si ngl e but t er f l y

ACS(NewStat e, 0, Stat e0+Sum, Stat e4- Sum) ;

ACS(NewStat e, 1, Stat e0- Sum, Stat e4+Sum) ;

A butterfly operation consists of two add-compare-select calculations. The code above is used

to perform the butterfly operation shown below.

State4+Sum100

State0-Sum

State4

-Sum

NewState[ 0 ] =

Min(State0+Sum,State4-Sum)

NewState[ 1 ] =

Min(State0+Sum,State4+Sum)

State0+Sum000

001

000

FIGURE 10: BUTTERFLY OPERATION DIAGRAM

A single butterfly operation is performed for every pair of destination states within a trellis

column. The same trellis column operation is iteratively performed on every subsequent trellis

column until the end of the frame. Once the end of frame is reached, each states accumulated

distance metric is compared, with the smallest being considered the ending state. The trace

back phase begins with the end state. The decoder will then extract the LSB of each state as

the deduced input bit and use the coded path to trace through all prior trellis columns until the

inferred input at the beginning of the frame is deduced.

8 Implementation on Base Xtensa

A demonstration GSM Viterbi Decoder and test bench was developed in C and is provided as an

Xplorer Workspace file, Vi t er bi_v2. xws. The decoder is a soft decision decoder using the

Euclidian (Square) Distance metric and ACS macro described earlier in this Application Note(instead of eight states described in previous sections). Since GSM uses a constraint length of

five, there will be 16 states in every trellis column. Hence, GSM requires eight butterfly

operations to decode a single bit (as compared to our previous example which only required

four butterfly operations).

The Viterbi_v2 project is a test bench that prepares a random frame of 1000 bits and thenencodes them into GSM coded symbols. The symbols are corrupted to simulate white noise.

Finally, the test bench decodes these bits and compares the output with the original input bits.

The Viterbi decoder is benchmarked for performance.

11

TENSILICA,INC.


18/26


In this original form, a single bit requires 337 cycles to decode on a base Xtensa processorwhen using aggressive compiler optimizations (-O3 switch used in xt - xcc). Given that theXtensa processor is as efficient, if not more efficient, than ARM9 and MIPS32 cores in handling

ANSI C code, the performance of other 32-bit RISC cores is estimated to be similar.

9 Full Optimization with TIE

The Tensilica Instruction Extension (TIE) language provides a powerful mechanism to add

instructions to the base Xtensa instruction set and to generate complete support in hardware

and software tools for special purpose operations. The decode butterfly involves the addition of

the local distance to a pair of adjacent states accumulated distance metric calculation, then a

comparison and selection of the most-likely arc into each of the pair of states. The regularity of

this computation suggests a set of special instructions intended to accelerate the butterfly

calculation. Variations of add-compare-select instructions have been implemented on

advanced digital signal processors to accelerate the Viterbi decoder. Likewise, variations of the

add-compare-select instruction can be developed for Xtensa using TIE. Such instructions are

invaluable in accelerating Viterbi decoders that support data encoded using arbitrary constraint

length and polynomials. On the other hand, TIE could be used to develop instructions that

accelerate the decoding of data generated from a specific encoder. TIE instructions that are

specific to an encoder can be developed with computational performance comparable to a purehardware implementation. The optimal TIE instructions chosen is dependent upon the balance

between flexibility and computational performance required in a given system.

Significant improvement using TIE can be achieved by creating a variation of the add-compare-

select butterfly computation and defining this logic as a TIE function as shown below:

/ / Vi t erbi ADD- COMPARE- SELECT But t er f l y

function [ 33: 0] VBFLY ( [ 15: 0] St at eA, [ 15: 0] St at eB, [ 15: 0] Met r i c)

{

wire [ 15: 0] neg_Metr i c = ~Metr i c + 1' b1;

/ / Add state and path met r i cwire [ 15: 0] st at eA_pat hA = Stat eA+Met r i c;

wire [ 15: 0] st at eB_pat hB = St at eB+neg_Met r i c;

/ / Compar e accumul at ed met r i c

wire [ 4: 0] compA = TIEcmp(stateA_pat hA, st at eB_pat hB, 1' b1) ;

/ / Sel ect ed ( l east val ue) pat h i s out put

wire [ 15: 0] new_st ateA = ( compA[ 4] ) ?st ateA_pathA: st at eB_pathB;

wire Sel ectA = ( compA[ 4] ) ?0: 1;

wire [ 15: 0] st at eA_pat hB = St at eA+neg_Met r i c;

wire [ 15: 0] st at eB_pat hA = Stat eB+Met r i c;

wire [ 4: 0] compB = TIEcmp(stateA_pathB, st at eB_pat hA, 1' b1) ;

wire [ 15: 0] new_st at eB = ( compB[ 4] ) ?st at eA_pat hB: st ateB_pathA;

wire Sel ectB = ( compB[ 4] ) ?0: 1;

assign VBFLY = {Sel ectA, Sel ectB, new_st at eA, new_st at eB};

}

This TIE function performs the same computation as a pair of ACS macros shown in section 7.

12

TENSILICA,INC.


19/26


Several additional techniques used to accelerate the Viterbi decoder are:

The VBFLY TIE function can be instanced several times in an operation so that multipleViterbi butterfly computations are performed in parallel.

Making use of internal TIE state (not to be confused with states in the trellis diagramreferred to as trellis states) to hold intermediate data, such as accumulated state metrics,

can eliminate many memory accesses.

Fusion of memory accesses and butterfly computations into high performance TIEoperations

FLIX with dual load/store interface allows for two operations (both operations performingload/store) to be issued in the same instruction word.

Appendix A lists vtb2. t i e, the TIE file that describes TIE operations that accelerate Viterbidecode. The TIE instructions for the trellis update phase of Viterbi decoding are summarized

below.

VBI N: Viterbi Input

C I ntr i nsi c Synt ax: voi d VBI N( VREG PG0, VREG* p_PG0)

This operation loads 2 GSM coded symbol pairs (4 bytes) at one time by using a 32-bit load into

a 32-bit register file VREG. The load pointer (p_PG0) is also auto-incremented by 4 bytes inpreparation for the next VBI Ninstruction.

VBOUT: Parallel Viterbi Butterfly Operation and Output

C I ntr i nsi c Synt ax: voi d VBOUT ( unsi gned shor t * PSel ect, VREG PG0, i mmi )

This operation updates all state metrics of a trellis column for a single pair of GSM coded data

(PG0). The add-compare-select operation is performed on all 16 states of the trellis column

using 8 VBFLY TIE functions, to support the Viterbi butterfly computations for the entire trellis

column.

This operation updates each states accumulated distance metric within 16-bit TIE states, one

for each of the 16 Trellis states and writes out 16 select bits for the most-likely arcs going into

each of the 16 trellis states. The write pointer (PSelect) is auto-incremented in preparation for

the next VBOUTinstruction. An immediate operand (i) is used to choose a symbol pair of GSMcoded data from the 32-bit VREGTIE register file. Since VBI Nprovides 2 GSM coded symbols,there will be two VBOUTinstructions for each VBI Ninstruction.

WUR_BMsel: Write User Register- Branch Metric Select

The BMSel register is a 32-bit register that sets the distance metric for each path of the Viterbi

butterfly computations as used by the VBOUT instruction. Since the VBOUT performs 8 butterfly

computations, there are 32 paths metrics. However, due to path symmetry in the butterfly

structure, we need only define the top-most path to the butterfly and remaining paths are

inferred from this path. For example, the top-most path in figure 8 is +sum. The bottom-most

path is the same as the top-most path (+sum) and the diagonal paths are negative of the top-

most path (sum).

The BMSel register is split into 8 4-bit fields, where each bit corresponds to a one-hot value for

+sum, -sum, +diff, or -diff. The most significant 4-bit field corresponds to the top-most path of

the butterfly computation that updates states 0 and 1. The following 4-bit field corresponds to

the top-most path of the butterfly computation that updates states 2 and 3, and so on.

Prior to executing VBOUT instructions, the BMSel register should be initialized with the

appropriate branch metric selection for the butterfly computations. By allowing the setting of

the branch metrics, the VBOUT instructions allows support for different polynomials used for

Viterbi coding (given that the constraint length is k=5, coding rate = 1/2).

In this example, the path metrics for each butterfly computation are taken directly from the

GSM decoder C source code. The initialization for standard GSM coded polynomials is shown in

the sample code below:

13

TENSILICA,INC.


20/26


#define di s t_sum 8

#define di s t_neg_sum4

#define di s t_di f f 2

#define di s t_neg_di f f 1

WUR_BMSel ( ( di st_sum


21/26


work-per-cycle basis. Note that the Xtensa-based implementation is written in C, whereas

hand coded assembly is required to obtain performance numbers for many DSP machines.

The TIE operations for the trace-back phase of Viterbi decoding are summarized below.

BACKTRACE: Viterbi Backtrace

C I ntr i nsi c Synt ax: voi d BACKTRACE(unsi gned shor t * PSel ect)

This operation loads the 16 select bits (from address PSelect) that were stored during

execution of VBOUT instructions. From the current minimum state, the select value

(representing the most likely path) is used to trace backward to the previous trellis stage. The

LSB value of the minimum state is considered to be the most likely output bit and is saved in a

holding register to be later written to memory using the STORE_OUT operation. The select

pointer (PSelect) is post-decremented by 2 in preparation for the next BACKTRACE operation.

BACKTRACE0: Viterbi Backtrace initialization

C I ntr i nsi c Synt ax: voi d BACKTRACE0( char Mi nSt ate)

This instruction is a subset of the BACKTRACE operation that is only executed once prior to

subsequent executions of the BACKTRACE instructions. This instruction initializes the minimum

state after the update phase. The state number with the minimum value is passed as argumentMinState.

STORE_OUT: Store eight output values

C I nt r i nsi c Synt ax: voi d STORE_OUT( unsi gned char* POutput )

This instruction performs a byte store of the single-bit output value calculated in prior

executions of the BACKTRACE instruction to pointer POutput. The POutput pointer is post-

decremented by one in preparation for the next STORE_OUT operation.

The main loop for the Viterbi decoders update phase is shown below:

for ( i =FS- 1; i >=1; i - - ) {

BACKTRACE(PSel ect ) ;

STORE_OUT( pt r _out put ) ;

}

The disassembly of the Viterbi decoders backtrace loop is as follows:

l oopgt z a10, 60000f e0

{ s tore_out a9; backt r ace a8 }

The loop consists of a single FLIX instruction that contains both BACKTRACE and STORE_OUToperations. These operations are effectively pipelined such that the backtrace is done in the

first iteration and then the output bit is written to memory in the next iteration. As a result, an

output bit is written every clock cycle. This means that the trace back phase of Viterbi decoding

occurs at a rate of one cycle per bit.

The highly optimized assembly code described in this section was directly compiled from C

source code with the TIE variable set (#define TIE). Upon building this example and simulating

it, the console shows the following:

15

TENSILICA,INC.


22/26


Pr ocessi ng New Fr ame

Err ors det ect ed = 0, Benchmark = 2. 167000 cycl es per bi t

Viterbi decodeperformance of 2.17 cycles per bit is more than 155x improvementover thestandard implementationwithout TIE acceleration(337 cycles per bit). The TIE area for thisapproach is 28.7K gates, in addition to 47K gates for base XtensaLX2 core. This core iscapable of being synthesized up to 264MHz(worst case) in .13 LV. Therefore, this solution iscapable of decoding a GSM coded bitstream at a peak rate of 130Mbits per second.

10Demonstration Instructions

The demonstration requires that you have installed Xplorer CE 2.1.1 with RB-2008.3 software

tools. The workspace, Vi t er bi _V2. xwscan be obtained from the Tensilica support website.

Follow these steps to build and simulate the demonstration code.

1. Start Xplorer and import the Vi t er bi _V2. xwsworkspace. Select all componentsprovided in the workspace for installation into your workspace.

2. In the workspace toolbar, select project (P: Viterbi_v2), configuration (C: Viterbi_v2) andrelease target (T: Release).

3. Click Build Active to compile and then click on Run to simulate. The console will display thedecode error and benchmark results.

To compare performance with ANSI C implementation (without TIE), you can comment out

the line (#define TIE) in the mai n. cfile of the Viterbi_V2 project.

11SummaryXtensa processors offer significant advantages for complex telephony applications. The Xtensa

architecture combines a powerful general-purpose 32-bit instruction set design, with a unique

configuration and extension process. These are used together to solve some of the toughest

problems in communication system design, including efficient convolutional coding and Viterbi

decoding. Application-specific-processors are quickly designed, simulated, built in silicon, and

offer significantly better programmability, performance and power-efficiency than most popular

DSPs. With the benefit of TIE, Xtensa solutions can offer almost 155x improvement incommunication processing efficiency compared to conventional 32-bit RISC cores and over 32ximprovement when compared to specialized DSPs.

16

TENSILICA,INC.


23/26


Appendix A VTB2.TIE Code

/ / VTB2. TI E/ / TI E Ext ensi ons f or Vi t er bi Accel erati on/ / FL IXformat vt b_f l i x 32 {s l ot_a, s l ot_b}

slot_opcodes s l ot_a {VBI N, STORE_OUT}

slot_opcodes s l ot_b {VBOUT, BACKTRACE, BACKTRACE0}

/ / St at es used by Vi t erbi I nstr ucti onsstate AccumDi st 0 16 add_read_write

state AccumDi st 1 16 add_read_write









state AccumDi st A 16 add_read_write

state AccumDi st B 16 add_read_writestate AccumDi st C 16 add_read_write

state AccumDi st D 16 add_read_write

state AccumDi st E 16 add_read_write

state AccumDi st F 16 add_read_write

state Mi nSt at e 4 add_read_write

state BMSel 32 add_read_write

state Output 1 add_read_write

/ / I mmedi atesimmediate_range i mm8 0 7 1

regfile VREG 32 2 vr

/ / Vi t erbi ADD- COMPARE- SELECT Butt erf l y

function [ 33: 0] VBFLY ([ 15: 0] St at eA, [ 15: 0] St ateB, [ 15: 0] Metr i c){

wire [ 15: 0] neg_Met ri c = ~Met ri c + 1' b1;

wire [15: 0] st ateA_pathA = St ateA+Metr i c;

wire [15: 0] st ateB_pat hB = Stat eB+neg_Metr i c;

wire [ 4: 0] compA = TIEcmp(stateA_pathA, st ateB_pat hB, 1' b1) ;

wire [ 15: 0] new_st ateA = ( compA[ 4] ) ?st ateA_pathA: st ateB_pat hB;wire Sel ectA = ( compA[ 4] ) ?0: 1;

wire [15: 0] st ateA_pat hB = Stat eA+neg_Metr i c;

wire [15: 0] st ateB_pathA = St ateB+Metr i c;

wire [ 4: 0] compB = TIEcmp(stateA_pathB, st ateB_pat hA, 1' b1) ;

wire [ 15: 0] new_st ateB = ( compB[ 4] ) ?st ateA_pathB: st ateB_pat hA;wire Sel ectB = ( compB[ 4] ) ?0: 1;

assign VBFLY = {Sel ectA, Sel ectB, new_st ateA, new_st ateB};}

operation VBI N {out VREG GI nput , inout AR *ar s} {out VAddr , in MemDat aI n32}{assign VAddr =ars;assign GI nput=MemDat aI n32;assign ar s=ars+4;}

operation VBOUT

17

TENSILICA,INC.


24/26


{inout AR *ars, in VREG GI nput , in i mm8 t }{

in BMSel ,inout AccumDi st 0,inout AccumDi st 1,inout AccumDi st 2,inout AccumDi st 3,inout AccumDi st 4,

inout AccumDi st 5,inout AccumDi st 6,inout AccumDi st 7,inout AccumDi st 8,inout AccumDi st 9,inout AccumDi st A,inout AccumDi st B,inout AccumDi st C,inout AccumDi st D,inout AccumDi st E,inout AccumDi st F,out VAddr ,out MemDat aOut16

}{/ / Choose G0 f r om GI nput based upon i mmedi at e ar gument t/ / Wr i t t en for Bi g Endi an Or der i ng

wire [ 7: 0] G0=( ( t ==1) ?GI nput [ 15: 8] : GI nput [ 31: 24] ) ;/ / Choose G1 f r om GI nput based upon i mmedi at e ar gument t/ / Wr i t t en for Bi g Endi an Or der i ngwire [ 7: 0] G1=( ( t ==1) ?GI nput [ 7: 0] : GI nput[ 23: 16] ) ;

/ / Decl are t empor ary var i abl es f or AccumDi stwire [ 15: 0] St ate0=AccumDi st 0;wire [ 15: 0] St ate1=AccumDi st 1;wire [ 15: 0] St ate2=AccumDi st 2;wire [ 15: 0] St ate3=AccumDi st 3;wire [ 15: 0] St ate4=AccumDi st 4;wire [ 15: 0] St ate5=AccumDi st 5;wire [ 15: 0] St ate6=AccumDi st 6;wire [ 15: 0] St ate7=AccumDi st 7;

wire [ 15: 0] St ate8=AccumDi st 8;wire [ 15: 0] St ate9=AccumDi st 9;wire [ 15: 0] St ateA=AccumDi st A;wire [ 15: 0] St ateB=AccumDi st B;wire [ 15: 0] St ateC=AccumDi st C;wire [ 15: 0] StateD=AccumDi st D;wire [15: 0] St ateE=AccumDi st E;wire [ 15: 0] Stat eF=AccumDi st F;/ / Cal cul at e Sum/ Di f f f or i nputwire [ 7: 0] Sum_8=G0+G1;

wire [ 7: 0] Di f f _8=G0- G1;

wire [ 15: 0] Sum={8{Sum_8[ 7] }, Sum_8};

wire [ 15: 0] Di f f ={8{Di f f_8[ 7] }, Di f f _8};

wire [ 15: 0] neg_Sum=~Sum + 1;

wire [ 15: 0] neg_Di f f =~Di f f + 1;

/ / Cal cul ate Accumul ated Path Metr i cs/ / Compar e/ Sel ect Short est Pat h i nto each St ate/ / usi ng 8 paral l el VBFLY f uncti ons

wire [ 15: 0] new_AccumDi st 0, new_AccumDi st 1, new_AccumDi st 2, new_AccumDi st 3,

new_AccumDi st 4, new_AccumDi st 5, new_AccumDi st 6, new_AccumDi st 7, new_AccumDi st 8,

new_AccumDi st 9, new_AccumDi st A, new_AccumDi st B, new_AccumDi st C, new_AccumDi st D,

new_AccumDi st E, new_AccumDi st F;wire Sel ect0, Sel ect1, Sel ect2, Sel ect3, Sel ect4, Sel ect5, Sel ect6, Sel ect7, Sel ect8,

Sel ect9, Sel ectA, Sel ectB, Sel ectC, Sel ectD, Sel ectE, Sel ectF;

18

TENSILICA,INC.


25/26


wire [ 15: 0] Di st A = TIEsel( BMSel [ 31] , Sum, BMSel [ 30] , neg_Sum, BMSel [ 29] , Di f f ,

BMSel [ 28] , neg_Di f f ) ;assign {Sel ect 0, Sel ect 1, new_AccumDi st 0, new_AccumDi st 1} = VBFLY( Stat e0, Stat e8,

Di stA);

wire [ 15: 0] Di st B = TIEsel( BMSel [ 27] , Sum, BMSel [ 26] , neg_Sum, BMSel [ 25] , Di f f ,

BMSel [ 24] , neg_Di f f ) ;assign {Sel ect 2, Sel ect 3, new_AccumDi st 2, new_AccumDi st 3} = VBFLY( Stat e1, Stat e9,

Di stB);

wire [ 15: 0] Di st C = TIEsel( BMSel [ 23] , Sum, BMSel [ 22] , neg_Sum, BMSel [ 21] , Di f f ,

BMSel [ 20] , neg_Di f f ) ;assign {Sel ect 4, Sel ect 5, new_AccumDi st 4, new_AccumDi st 5} = VBFLY( Stat e2, Stat eA,

Di stC);

wire [ 15: 0] Di st D = TIEsel( BMSel [ 19] , Sum, BMSel [ 18] , neg_Sum, BMSel [ 17] , Di f f ,

BMSel [ 16] , neg_Di f f ) ;assign {Sel ect 6, Sel ect 7, new_AccumDi st 6, new_AccumDi st 7} = VBFLY( Stat e3, Stat eB,

Di stD) ;

wire [ 15: 0] Di st E = TIEsel( BMSel [ 15] , Sum, BMSel [ 14] , neg_Sum, BMSel [ 13] , Di f f ,

BMSel [ 12] , neg_Di f f ) ;assign {Sel ect 8, Sel ect 9, new_AccumDi st 8, new_AccumDi st 9} = VBFLY( Stat e4, Stat eC,

Di s tE) ;

wire [15: 0] Di st F = TIEsel( BMSel [ 11] , Sum, BMSel [ 10] , neg_Sum, BMSel [ 9] , Di f f ,

BMSel [ 8], neg_Di f f ) ;assign {Sel ectA, Sel ectB, new_AccumDi st A, new_AccumDi st B} = VBFLY( Stat e5, Stat eD,

Di s tF) ;

wire [ 15: 0] Di st G = TIEsel( BMSel [ 7], Sum, BMSel [ 6] , neg_Sum, BMSel [ 5] , Di f f , BMSel [ 4] ,

neg_Di f f ) ;assign {Sel ectC, Sel ectD, new_AccumDi st C, new_AccumDi st D} = VBFLY( Stat e6, Stat eE,

Di stG) ;

wire [ 15: 0] Di st H = TIEsel( BMSel [ 3], Sum, BMSel [ 2] , neg_Sum, BMSel [ 1] , Di f f , BMSel [ 0] ,

neg_Di f f ) ;assign {Sel ect E, Sel ect F, new_AccumDi st E, new_AccumDi st F} = VBFLY( Stat e7, Stat eF,

Di stH);

/ / St ore new st ate metr i csassign AccumDi st 0=new_AccumDi st 0;

assign AccumDi st 1=new_AccumDi st 1;









assign AccumDi st A=new_AccumDi st A;

assign AccumDi st B=new_AccumDi st B;

assign AccumDi st C=new_AccumDi st C;

assign AccumDi st D=new_AccumDi st D;

assign AccumDi st E=new_AccumDi st E;

assign AccumDi st F=new_AccumDi st F;

/ / Wr i t e out t he Bi nar y Encoded Pat hs

wire [ 15: 0] Sel ect Pat hs={Sel ect 0, Sel ect 1, Sel ect 2, Sel ect 3, Sel ect 4, Sel ect 5, Sel ect 6, Sel ect 7,Sel ect 8, Sel ect 9, Sel ect A, Sel ect B, Sel ect C, Sel ect D, Sel ect E, Sel ect F};

assign VAddr =ars;assign MemDat aOut 16=Sel ect Pat hs;/ / Update t he output poi nterassign ar s=ars+2;

19

TENSILICA,INC.


26/26


}

/ / I ni t i al i ze Backtr ace i ns t ruct i onoperation BACKTRACE0{in AR ar s} {out Mi nStat e, out Output}{

/ / i ni t i al i ze Mi nstate w/ most l i kel y endstateassign Mi nSt at e = ar s;

LSB i s t he out put bi t/ / theassign Out put = ars[ 0] ;

}

operation BACKTRA inout AR *ar t }CE{{inout Mi nSt ate, out Out put , out VAddr , in MemDat aI n16}{

/ / Read i n Paths f or t r el l i s col umn and postdecr ement poi nt erassign VAddr = ar t ;wire [ 15: 0] Sel = MemDataI n16;assign ar t = ar t - 2;

/ / Sel ect path for t re l l i s statewire DataI n8 = TIEmux( Mi nSt at e[ 3: 0] , Sel [ 15] , Sel [ 14] , Sel [ 13] , Sel [ 12] , Sel [ 11] ,

Sel [10] , Sel [9] , Sel [8] , Sel [7] , Sel [6] , Sel [5] , Sel [4] , Sel [3] , Sel [2] , Sel [1] , Sel [0] ) ;

e backward one bi t t o pr evi ous st ate/ / Trac assign Mi nSt ate = {DataI n8, Mi nSt ate[ 3: 1] };

out put bi t/ / Save assign Out put = Mi nSt ate[1];}

schedule backt r ace_sched {BACKTRACE}{use Mi nSt ate 2; def Mi nSt ate 2; def Output 2; }

operation STORE_OUT{inout AR *Addr}{in Output , out VAddr , out MemDat aOut 8}{

assign VAddr = Addr ;assign MemDat aOut 8 = {7' b0, Out put};assign Addr = Addr - 1;

}

20

TENSILICA,INC.

convolution coding

Documents

trademarks of tensilica

tensilica processors

convolution coding

tensilica integrated

convolutional coding

base xtensa

document change history

registered trademarks