hspa+interleaver
TRANSCRIPT
8/23/2019 HSPA+interleaver
http://slidepdf.com/reader/full/hspainterleaver 1/14
Linköping University Post Print
Memory Conflict Analysis and Implementation
of a Re-configurable Interleaver Architecture
Supporting Unified Parallel Turbo Decoding
Rizwan Asghar, Di Wu, Johan Eilert and Dake Liu
N.B.: When citing this work, cite the original article.
The original publication is available at www.springerlink.com:
Rizwan Asghar, Di Wu, Johan Eilert and Dake Liu, Memory Conflict Analysis and
Implementation of a Re-configurable Interleaver Architecture Supporting Unified Parallel
Turbo Decoding, 2010, Journal of Signal Processing Systems for Signal, Image, and Video
Technology, (60), 1, 15-29.
http://dx.doi.org/10.1007/s11265-009-0394-8
Copyright: Springer Science Business Media
http://www.springerlink.com/
Postprint available at: Linköping University Electronic Press
http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-25599
8/23/2019 HSPA+interleaver
http://slidepdf.com/reader/full/hspainterleaver 2/14
Memory Conflict Analysis and Implementation of a
Re-configurable Interleaver Architecture Supporting
Unified Parallel Turbo Decoding
Rizwan Asghar, Di Wu, Johan Eilert, Dake Liu
Abstract − This paper presents a novel hardware interleaver
architecture for unified parallel turbo decoding. The
architecture is fully re-configurable among multiple standards
like HSPA Evolution, DVB-SH, 3GPP-LTE and WiMAX.
Turbo codes being widely used for error correction in today’s
consumer electronics are prone to introduce higher latency
due to bigger block sizes and multiple iterations. Many parallel turbo decoding architectures have recently been
proposed to enhance the channel throughput but the
interleaving algorithms used in different standards do not
freely allow using them due to higher percentage of memory
conflicts. The architecture presented in this paper provides a
re-configurable platform for implementing the parallel
interleavers for different standards by managing the conflicts
involved in each. The memory conflicts are managed by
applying different approaches like stream misalignment,
memory division and use of small FIFO buffer. The proposed
flexible architecture is low cost and consumes 0.085 mm2
area
in 65nm CMOS process. It can implement up to 8 parallel
interleavers and can operate at a frequency of 200 MHz, thus providing significant support to higher throughput systems
based on parallel SISO processors.
Keywords − Parallel interleaver, Parallel turbo decoding, Block
interleaver, Multi standard, HSPA, LTE, WiMAX, DVB-SH.
1 Introduction
Latest trends in radio communication systems always demand
for a flexible and general purpose solution for data processing
Rizwan Asghar
Department of Electrical Engineering,
Linköping University,
Linköping, SE-58183, Sweden
Phone : +46 13 281323
FAX: +46 13 139282
e-mail: [email protected]
Di Wu, Johan Eilert, Dake Liu
e-mail: [email protected]; [email protected]; [email protected]
including symbol processing and forward error correction
(FEC). Turbo Codes [1] are being widely used as a forward
error correction system in the consumer electronics in order to
detect and correct the errors during transmission over noisy
channels. Interleaving plays a vital role in improving the
performance of Turbo Codes in terms of bit error rate. The
primary function of the interleaver is to improve the distance properties of the coding schemes and to disperse the sequence
of bits in a bit stream so as to minimize the effect of burst
errors. At present a number of standards specifications like
HSPA Evolution [2][3], DVB-SH [4], UMTS-LTE [5] and
WiMAX [6] have already included the turbo codes. The
scheme of turbo code adapted is the parallel concatenated code
with two 8-state constituent encoders and an internal
interleaver. One turbo code internal interleaver is used in the
turbo encoder while the turbo decoder uses multiple instances
of interleaver and de-interleaver to decode the received bits
iteratively (Figure 1).
Along with multi-standard support, requirements of higher throughput are also emerging in connection with customer
needs. On the other hand turbo codes in general exhibit latency
(further reducing throughput) due to bigger block sizes and
multiple iterations over soft bits needed to reach to a reliable
hard decision. The latency can increase further if interleaver
block size is varying in every transmission time interval (TTI)
with the requirement of some pre-processing while changing
the block size. The technique of parallel turbo decoding is well
established to meet the high throughput requirements [7] - [10],
but it also requires implementation of parallel interleavers. If
there are P parallel SISO processors then P sub-interleavers
has to be implemented in parallel and each sub-interleaver will
be responsible to generate / N P interleaver patterns
independently where N is the total block size. The main
schemes of parallel turbo decoding can be applied to different
standards without any modifications; however, interleaving
varies among different standards. Many challenges are
involved including unified address generation and conflict
management which restrict the use of same parallel turbo
decoding architecture for multiple standards. This paper paves
special focus on handling these challenges and presents an
8/23/2019 HSPA+interleaver
http://slidepdf.com/reader/full/hspainterleaver 3/14
implementation of a re-configurable interleaver architecture
targeting unified parallel turbo decoding.
The rest of the sections in this paper are organized as
follows. Section 2 provides a picture of previous work done
and challenges involved while implementing parallel
interleaver support for parallel turbo decoding. Sections 3 – 6
cover the parallel interleaving in HSPA Evolution, DVB-SH,
3GPP-LTE and WiMAX standards respectively. Following the
hardware multiplexing methodology, the unified architecture of
the re-configurable parallel interleaver is presented in
Section 7. Section 8 provides the implementation results
followed by a conclusion.
2 Background and Challenges
Looking at the implementation aspects recursive systematic
convolutional encoders are very simple to implement as
compared to SISO decoding in turbo decoder, but the
interleavers usually tend to have same complexity on both sides
because of its variability. The implementation of interleaver on
decoder side becomes more challenging when parallel
interleaver implementation is needed in order to support
parallel SISO processing. Other then the architectural
complexity for generation of parallel interleaver addresses one
of the big challenges is to deal with memory conflicts
associated with multiple writes and reads at the same time. If
the multiple addresses generated do map to different memories
i.e. one address for each memory then there is no conflict, but
on the other hand if two or more addresses are mapped to same
memory then a situation of conflict occurs (Figure 2) and it
needs to be resolved on-the-fly.
The conflict percentage from the parallel generated
addresses is sometimes very high and no straight forward
solution exists except using double memory size which is not
cost efficient. Work in [11] − [13] provide good theoretical
back ground and also propose the generation of conflict free
interleavers but they cannot be directly used for already
existing interleaver algorithm for the standards being
investigated in this paper. The work presented in [14] − [19]
covers single address generation supporting maximum of two
standards but they do not support parallel interleaver address
generation. Reference [21] provides a parallel interleaver
architecture for HSPA+ but it does not cover the support for
multi-mode environment. A good analysis of memory conflicts
for interleaver in turbo decoder is provided in [22]. The stallingmechanism is used to stall the process in MAP decoders when
a conflict occurs. The main focus of this work has been the
buffer management. Further the stall process may result in
additional control complexity. Reference [23] provides a good
theoretical background along with an interleaver architecture
supporting parallel turbo decoding. However, it supports only
one standard i.e. WiMAX Duo-Binary Turbo Decoding. The
implementation of a parallel interleaver for a multi-mode
environment appears to be a bottleneck, which also restricts to
use same turbo decoding core for multiple standards. This
motivates the work on a re-configurable interleaver
architecture to support unified parallel turbo decoding. A lowcost and re-configurable solution for parallel interleaver
implementation for existing interleaver algorithms can open
more opportunities to meet the high throughput requirements
and at the same time it can help to meet fast time-to-market
requirements from industry and customers. This paper
addresses the management of memory conflicts and proposes
certain schemes to reduce them with lower silicon cost
overheads. By exploiting the hardware re-use methodology
among different standards a unified architecture is presented
which is low cost and fully re-configurable to generate the
parallel interleaving patterns on-the-fly.
3 Parallel Interleaver for HSPA Evolution
The throughput requirements for WCDMA based systems have
been raised in a series of specifications. After addition of
MIMO and 64 QAM in Release 8 [3] the throughput
requirements have reached to 43.2 Mbps. The interleaver
associated with HSPA+ is not inherently designed to support
parallel SISO processing, but higher throughput requirements
Mem Bank 1 Mem Bank 2
I-1
T T+k
I-2
T T+k
Mem Bank 3 Mem Bank 4
I-3
T T+k
I-4
T T+k
Conflict
Figure 2: A situation of conflict at (T+k).
(a)
(b)
Inputu(k)
Systematic
ParityData
RSCEncoder- 1
IRSC
Encoder- 2 P u n c t u r i n g
SISODecoder1
SISODecoder2
Systematic
Parity 1
Parity 2 u(k)
D
I
I
Figure 1: (a) Turbo Encoder; (b) Turbo Decoder.
8/23/2019 HSPA+interleaver
http://slidepdf.com/reader/full/hspainterleaver 4/14
and requirement to change the block size in each transmission
time interval motivates to have the provision of more than one
SISO processing in parallel. The interleaving algorithm for
3GPP-WCDMA is mentioned below. Here N is the block size,
R is the row size and C is the column size in bits.
Find number of rows ‘R’, prime number ‘p’ and primitiveroot ‘v’ for particular block size as given in the standard.
Col Size : C = p-1; if ( N ≤ R × (p-1) )
C = p; if ( R × (p-1) < N ≤ R × p) )
C = p+1; if ( R × p < N )
Construct intra row permutation sequence S(j) by:
S(j) = [ v × S(j-1) ] % p ; j = 1,2, ……. p-2
Determine the least prime integer sequence q(i) for i = 1,2,
…… R-1 , by taking q(0) = 1, such that g.c.d(q(i),p-1) = 1
and q(i) > 6 and q(i) > q(i-1).
Apply inter row permutations to q(i) to find r(i) = T ( q(i) )
Perform the intra row permutations Ui,j; for i = 0,1,... R-1
and j = 0,1, ... p-2; If (C=p) : Ui,j =S [ (j×r(i)) mod (p-1) ] and
Ui,(p-1)=0;
If (C=p+1) : Ui,j =S [ (j×r(i)) mod (p-1) ] , and
Ui,(p-1) = 0; Ui,p = p; and if (N = R × C)
then exchange U(R-1,0) with U(R-1,p)
If (C=p-1) : Ui,j =S [ (j×r(i)) mod (p-1) ] – 1 ;
Perform the inter row permutations
Read the address columns wise
The complication in the implementation is evident due to the
presence of complex functions like modulo computation, intra-
row and inter-row permutations, multiplications, finding least prime integers, and computing greatest common divisor. After
applying the intra-row and inter-row permutations a block of
random addresses denoted by k y appears as shown below:
1 2 1 ( 1) 1
2 2 2 2 ( 1) 2
1
2 3
R R C R
R R C R
R C R R R
y y y y
y y y y
y y y y
The output from the interleaver is the sequence read column-
wise from the permuted matrix. The output address k Y withinthe matrix can be expressed as:
( 1)
: 1
: 1
k i R j
loop i to C
loop j to R
Y y
Generating two addresses1
k Y and2
k Y at the same time will
reduce the overall loop size to half as follows:
1
( 1)
2
2 1
2
: 12
: 1
k i R j
k i R RC
C oop i to
loop j to R
Y y
Y y
The two addresses generated simultaneously are used to write
two data values in two memory locations. In case of a conflict
while writing into same memory, it needs to be resolved on-
the-fly. The following sub-section provides a comprehensive
analysis to reduce the memory conflicts associated with whole
range of block sizes in HSPA+.
3.1 Memory Conflict Analysis
The number of conflicts for HSPA+ interleaver reaches to 50%
of the data size as shown in Figure 3. One way to reduce the
number of conflicts is to introduce a misalignment in the
generation of two addresses (Figure 4). The misalignment can
be achieved by introducing a delay line to one set of addresses
and data values. It introduces the latency on one of the two
sequences, thus the arbitrary value called misalignment factor
(MF) cannot be very large. There are three possible values for
R (number of rows) in the prescribed interleaver i.e. 5, 10 or 20
and good MF must be within the total number of rows. It is
observed that the inter-row permutation patterns for R=5 or 10
are supportive to misalignment technique but the permutation
patterns for R=20 are not very much supportive. The best MF
found is 3 for R=5 or R=10 and 5 for R=20. Using these values
of MF the conflicts are reduced to around 10 % as shown in
Figure 3. Still the conflict count is high enough that it cannot
be managed without the support of extra memories.
Alternately, a large amount of buffer registers can be used to
handle this situation but it involves higher latency.
Another approach might be to split the memory banks into
relatively smaller sub-memories to reduce the number of
0 1000 2000 3000 4000 50000
200
400
600
800
1000
1200
Block Size
C o n
f l i c t C o u n t
Conflict Count (per sub-interleaver)
with memory division
M1M2
0 1000 2000 3000 4000 50000
10
20
30
40
50
Conflict Percentage (per sub-interleaver)
Block Size
C o
n f l i c t s %
with memory division
with misalignment
without misalignment
M1M2
with misaligment
without misaligment
Figure 3: Conflict analysis for HSPA+.
8/23/2019 HSPA+interleaver
http://slidepdf.com/reader/full/hspainterleaver 5/14
conflicts. Up to 8 memories are already needed for 8 parallel
turbo interleavers in this design, thus we can divide each
memory bank required for HSPA+ into 3 sub-memories. By
applying this configuration to the interleaver address
generation and data writing the amount of conflicts for most of
the block sizes reduced to zero (Figure 3), but still many block sizes face some conflicts. However, the amount of conflicts is
very small and it can be handled by using buffer registers. The
number of FIFO buffers can further be reduced by applying the
progressive write during the conflict occurrences in the other
memory bank. The total number of FIFO buffers needed for
each memory after memory division is plotted in Figure 5 with
maximum of 12 FIFO registers required for memory M3.
3.2 Pre-Processing
In order to make the interleaver architecture fully autonomous,
the parameters like R, C , ( )q i , ( )T i , ( )S j , p and are needed to
be computed in hardware. Some of these parameters can becomputed using lookup tables while the others need some close
loop or recursive computations. The most critical parameter
consuming more clock cycles and more hardware to compute is
the intra-row permutation pattern ( )S j . The function to be
computed for all the permutations is:
( ) ( 1) %S j v S j p (1)
Where 1,2, ..... 2 j p . This function exhibits un-known clock
cycle delay due to the reason that the value ( 1)S j is not
known with each j, therefore targeting low cost
implementation it is computed beforehand recursively and the
results are stored in a small RAM called intra-row permutation
RAM (IRP_RAM). One way of computing this parameter is
using the Interleaved Modulo Multiplication Algorithm [20]
which needs more than one clock cycle to compute one value.The hardware to compute ( )S j using modulo multiplication
algorithm is shown in Figure 6 (a). Looking at the maximum
size of v which is 5 bits this algorithm can take 5 clock cycles
to compute each ( )S j recursively.
Another approach to compute the modulo function is the
segmentation based modulo computation (SBMC). The idea is
to use only addition functions or shift-left by one in series
followed by a modulo addition. This method might not be
0 1000 2000 3000 4000 50000
10
20
30
40
Block Size
C o n f l i c t C o u n t
Conflict Count for each Memory
M1M2M3M4M5M6
0 1000 2000 3000 4000 50000
5
10
15
Block Size
F i f o S
i z e
FIFO Size for each Memeory
M1M2M3
M4M5M6
Figure 5: Final conflict count and FIFO size requirement for
HSPA+ interleaver.
Pmsb
1
0
–
Pmsb
1
0
–+
v(i) bit
<<1
0
1
0 1
0
1S(j)
1
0
q(i)
0
H
Circular Buffer
S(j) / Yk
<<&
M+
<<&
M+
<<&
M+
<<&
M+
+
1
0
2
1
0
M+
+M+
R
0
P
P
P P P P
Circular Buffer
3
2
1
00
1
0
q(i)
(a)
(b)
S(j) / Yk
1
0
Pmsb
–
<<1 M+
Figure 6: Hardware for computing S(j) using (a) Interleaved modulo
multiplication algorithm, (b) Segmentation based modulo computation.
Table 1: HSPA+ Pre-processing cycle cost comparison (cycles).
Block Size Ref. [14] Ref. [16] Ref. [18]† Proposed
40 317 15 19 11
41 295 23 31 14
5040 3587 802 821 303
5114 3048 563 583 310
† Values computed by using the methodology given in [18]
MemBank 1
MemBank 2
SISO 1 SISO 2
InterleavedAddress
Generator
Delayline for misalignment
A
A
D
D A
d d
r e s s
R e s o
l v e r
& D a
t a R o u
t e r
Figure 4: Misalignment of address and data to reduce memory conflicts.
8/23/2019 HSPA+interleaver
http://slidepdf.com/reader/full/hspainterleaver 6/14
hardware efficient when the multiplication terms are wide
range; however, it is beneficial if any one of the multiplication
term is relatively smaller in range. Here parameter is having
limited number of values i.e. 2,3,5,6,7 or 19. Thus SBMC can
be optimized to achieve low cost solution. The hardware for
computing the intra-row permutation using SBMC is shown in
Figure 6(b). The main advantage of using this scheme over
Modulo Multiplication Algorithm is the single clock cycle
execution for finding each new ( )S j , which gives a big saving
of pre-processing cycle cost over the previous proposed
designs (see Table 1). While supporting the generation of two
interleaved addresses per clock cycle, two parallel IRP_RAMseach having size 256 x 8b are required to store the intra-row
permutation patterns. Keeping in view the hardware in-
efficiency only one can be used with size 128x16b and the 16
bit data output is further divided to be used for two different
sections of address generation. The second address with which
the data has to be merged is computed by:
%2
aux
C j j p
(2)
The parameters j and aux j are compared and smaller one is used
to write the data. The same expression and comparison is usedto resolve the data distribution after reading the data. The other
parameters to be computed in the pre-processing phase are
prime number , associated integer and total number of
columns C . The parameter is stored in a lookup table with a
smaller sub-lookup table for parameter . The lookup table is
addressed via a counter and against each value of the
condition ( ) p R N R is tested using a comparator. Once p
is found, C can have only three values i.e. 1, p p or 1 p , thus
requiring at most three clock cycles. The hardware used during
the pre-processing phase remains idle during the execution
phase; therefore it can be reused to reach to a low cost solution.
3.3 HW for Parallel Interleaver in HSPA+
The parallel processing of turbo decoder with parallel SISO
blocks requires the interleaver to generate more than one
address every clock cycle and at the same time resolve thememory conflicts. The computation of final interleaved address
requires that the intra-row permutation data must be obtained
in correct order from IRP_RAM. The data output from the
RAM is denoted as ( , )U i j and it is given by:
, ( )%( 1)U i j S j r i p (3)
The address ( RA) to the RAM i.e. ( )%( 1) j r i p involves
modulo function. We present here three alternates to compute
the RAM address. The first method involves recursive
computation of addresses as shown in Figure 8. The data is
written into memory row wise but when reading back column
wise, the IRP_RAM address of previous column is used to find
next address. The recursive function is given by:
, , 1 % 1 RA i j RA i j qmod i p (4)
The parameter qmod i is computed from the least prime
numbers sequence ( )q i by ( )%( 1)qmod i q i p . With the
condition i.e. ( ) 2( 1)q i p only a subtraction can be used to
obtain qmod i . The computational part associated with
recursive approach is very small and limited to just few adders,
but it needs a circular buffer of size 20 x 8b (Figure 6 (a)) in
order to keep the old address of whole column. This approach
is low cost and can operate at very high clock rate due to
smaller critical path. Using the same hardware for computing
( )S j requires five clock cycles. To reduce it to single clock
cycle we need to use the mix of the recursive approach and the
segmentation based modulo computation (Figure 6(b)). This
second approach is not very much hardware efficient but gives
the benefit to reduce the pre-processing time.
a1
a2
a3
a4
a5
a6
a7a8
b1
b2
b3
b4
b5
b6
b7b8
c1
c2
c3
c4
c5
c6
c7c8
d1
d2
d3
d4
d5
d6
d7d8
e1
e2
e3
?
?
?
?? D
a t a R e a d i n g
DataWriting
Figure 8: Column by column recursive address computation.
R
1
0
j
<<&
M+
P
<<&
M+
P
<<&
M+
P
<<&
M+
P
<<&
M+
P
<<&
M+
P
M1 M2 M3 M4
0 j 0 0 0
+ ++ +
M5
0 j
Yk
pipeline
M+
PM+
M+
M+
P P P
Figure 7: Computing RAM address ( ) RA using SBMC approach.
8/23/2019 HSPA+interleaver
http://slidepdf.com/reader/full/hspainterleaver 7/14
The third approach is completely based on segmentation
based modulo computation and it is a non-recursive way to
compute the address RA. It can directly be applied to the term
( )%( 1) j r i p to get the address for register file. Here we
know that the parameter ( )r i can have only 22 values of prime
numbers up to 89, thus SBMC approach can be used after
applying some optimizations. The hardware required to
compute the function ( )%( 1) j r i p using SBMC approach is
shown in Figure 7. It needs more additions then the recursive
approach; however, it can directly be used to compute intra-
row permutation patterns in the pre-processing phase, thus
providing single clock cycle support for computing each
permutation pattern ( )S j . This approach cannot be better for
very high clock frequency due to longer critical path.
Pipelining can improve the performance of this scheme but at
the same time introduces higher control complexity.
The computed address is used to get the correct intra-row
permutation ( , )U i j from the RAM. It also needs to pass
through some exception handling logic to correct itself for
special cases associated with C p , 1C p or 1C p as
given in the algorithm. The final interleaved addresses1
,i jY and2
,i jY can be found by combining the inter-row permutation with
intra-row permutation as follows:
1 1
,U ( , )( )
i jC r iY i j (5)
2 2
,U ( , )( )
i jC r iY i j (6)
The complete hardware for the generation of parallel
interleaved addresses and data handling is shown in Figure 9.
A FIFO buffer of size 12 is needed as discussed in previous
sub-section to handle the conflicts. The generation of
interleaved address during run time is two addresses per clock
cycle, except the case when block size is not exactly equal to
R C . In this case data written into RAM is zero padded and
data pruning is performed using the comparators.
4 Parallel Interleaver for DVB-SH
With improvements in DVB-H, DVB-SH standard
specification provides satellite services to handheld devices.
Along with other improvements, the FEC is also improved with
inclusion of turbo coding. The interleaving associated with
turbo code is to write whole block of information sequentially
into an array and reading the information in an order defined
by the interleaving algorithm. The flow for interleaver address
computation is shown in Figure 10. There are only two
possible block sizes i.e. N = 1146 or N = 12282 with parameter
n having condition as the largest value such that 52n N . Two
lookup tables each having 32 entries, are also provided to be
used in interleaving. The interleaving algorithm can be
described as follows:
Find the parameter n such that 52n N .
Initialize an (n+5) bit counter (called Z here).
Find T = Truncate (Z[(n+4):5]+1) to n bits.
Obtain the table lookup output (L); L = LUT{Z[4:0]}.
Multiply and truncate to find M = Truncate (T × L) to n bits.
Find bit reverse BR = BitReverse{Z[4:0]}.
Find Tentative Address by concatenation TA = {BR, M}.
Compare the TA and prune out the addresses if TA N
The latency in turbo coding can be higher due to bigger block
size of 12282 which demands the provision of parallel SISO
0
1
N
01
RRp
0
1
P
C
ModuloComputation Core
128 x 16RAM
+
0
1R
+
Flag
Compare
N
Compare
Flag
Y1k
AddressResolution
Y2k
FIFO
01
Data_in
M e m o r y
B a n k 1
M e m o r y
B a n k 2
ExceptionHandler1
ExceptionHandler2
0
1R
0
1
LogicRc flags
Cc flags0
1
2
3
P
0
1
1 –
out
in
Figure 9: HW for parallel interleaver address generation in HSPA+.
2 SISO 4 SISO 8 SISO
0
200
400
600
800
1000
Mem Conflicts (N=1146)
C o n f l i c t s
2 SISO 4 SISO 8 SISO
0
2000
4000
6000
8000
10000
Mem Conflicts (N=12282)
C o n f l i c t s
Without Misalignment
With Misalignment
Without Misalignment
With Misalignment
Figure 11: Total conflicts with and without misalignment (DVB-SH).
INC and Truncate
LUT
BitReverse
(T ×L) & Truncate
Pruning(n+5) bitCounter
I(x)Z
[ 4 : 0 ]
[(n+4):5] n
n
5
T
L
BR
M n+5
M S B
n TA
Figure 10: Interleaver address computation flow for DVB-SH.
8/23/2019 HSPA+interleaver
http://slidepdf.com/reader/full/hspainterleaver 8/14
processing. The parallel address generation involves a number
of conflicts which need to be resolved on-the-fly. The final
architecture proposed can support up to 8 parallel SISO blocks.
The following sub-sections provide the detailed conflict
analysis and the hardware for parallel interleaver address
generation in DVB-SH.
4.1 Memory Conflict Analysis for DVB-SH Interleaver
The parallel generated interleaved addresses for DVB-SH arenot conflict free. Figure 11 shows the number of conflicts for
the two blocks sizes. Applying the method of misaligning on
the address and data streams, it is observed that the conflict
percentage reduces significantly.
Further analysis of conflicts (N=12282 only) on each
memory is provided in detail in Figure 12 and Figure 13. In
contrary to conflict reduction due to applying misalignment the
number of FIFO registers required to handle the number of
conflicts are more. The analysis reveals that the distribution of
conflicts for a particular memory over a range of block size
becomes irregular with misalignment which increases the
requirement of FIFO cells. On the other hand although the
number of conflicts are huge but they are uniformly distributed
among different memories and hence it becomes more
hardware efficient to use the scheme without misalignment.
4.2 HW for Parallel Interleaver in DVB-SH
As there are many invalid addresses computed by the algorithm
so the total number of clock cycles needed to complete one
block is much high than the block size. The simulation reveals
that the block size N=1146 needs 2045 cycles whereas
N=12828 needs 16382 cycles to get the valid values equal to
block sizes. Our first target is to reduce the invalid
computations and then generate the parallel addresses to further
enhance the throughput. Instead of considering a long array of
information we consider it as a block of information with total
number of rows R as 32 (corresponding to 32 entries in thelookup table) and total number of columns C as 64 and 512 for
N = 1146 and 12282 respectively. The generation of valid
addresses within one row appears to be very much systematic
which can be utilized to reduce the number of invalid
addresses. Simulation results have proven that about 14
computations for N = 1146 and 8 computations for N = 12282
within a row can be completely discarded. This also reduces
the lookup table requirement to 18 and 24 entries for N = 1146
2 SISO 4 SISO 8 SISO0
500
1000
1500
2000
Mem Conflicts (N=12282); With Misalignment
C o n f l i c t s
Total ConflictsM1 ConflictsM2 ConflictsM3 ConflictsM4 ConflictsM5 ConflictsM6 ConflictsM7 ConflictsM8 Conflicts
2 SISO 4 SISO 8 SISO0
50
100
150
200
250
FIFO S ize (N=12282); With Misalignment
F I F O S
i z e
Tot. FIFO SizeM1 FIFOM2 FIFOM3 FIFOM4 FIFOM5 FIFOM6 FIFOM7 FIFOM8 FIFO
Figure 12: Memory conflicts and FIFO size requirement for DVB-SH
(N=12282; with misalignment).
Table 2: Lookup Table for Basic and Parallel Address Generation (DVB).
No N = 1146 N = 12282
(j) Effective Rc T bas T aux Inc. Effective Rc T bas T aux Inc.
0 0 3 3 1 0 13 5 1
1 1 27 3 1 1 335 7 1
2 2 15 7 2 2 87 7 2
3 4 29 5 2 4 15 7 1
4 6 1 1 2 5 1 1 15 8 3 3 2 6 333 5 2
6 10 15 7 2 8 13 5 1
7 12 17 1 2 9 1 1 1
8 14 39 7 2 10 121 1 2
9 16 19 3 1 12 1 1 1
10 17 27 3 1 13 175 7 1
11 18 15 7 2 14 421 5 2
12 20 45 5 2 16 509 5 1
13 22 33 1 2 17 215 7 1
14 24 13 5 2 18 47 7 2
15 26 15 7 2 20 295 7 1
16 28 17 1 2 21 229 5 1
17 30 15 7 2 22 427 6 2
18 -- -- -- -- 24 409 1 1
19 -- -- -- -- 25 387 6 120 -- -- -- -- 26 193 1 2
21 -- -- -- -- 28 501 5 1
22 -- -- -- -- 29 313 1 1
23 -- -- -- -- 30 489 1 2
2 SISO 4 SISO 8 SISO0
2000
4000
6000
8000
10000
Mem Conflicts (N=12282); Without Misalignment
C o n f l i c t s
Total ConflictsM1 ConflictsM2 ConflictsM3 ConflictsM4 ConflictsM5 ConflictsM6 ConflictsM7 ConflictsM8 Conflicts
2 SISO 4 SISO 8 SISO0
2
4
6
8
FIFO Size (N=12282); Without Misalignment
F I F O S
i z e
Tot. FIFO SizeM1 FIFOM2 FIFOM3 FIFOM4 FIFOM5 FIFOM6 FIFOM7 FIFOM8 FIFO
Figure 13: Memory conflicts and FIFO size requirement for DVB-SH(N=12282; without misalignment).
8/23/2019 HSPA+interleaver
http://slidepdf.com/reader/full/hspainterleaver 9/14
and N=12282 respectively. The new lookup tables ( basT ) and the
corresponding increment step for row counter are given in
Table 2. After skipping the un-necessary computations which
appear on regular intervals we end up with total computations
of 1152 and 12288 for N = 1146 and 12282 respectively. There
are only six extra computations needed to get the whole range
of addresses and at the same time we get the block sizes which
exactly divide by 8.
While working for the parallel address generation, thecomputation intensive part is to compute in the algorithm.
Using the straight forward method generation of 8 parallel
values requires up to 8 multipliers. We present here a novel
approach of using recursive computation for computation of
. After getting the parameter for the first SISO, the rest of
the values for parallel addresses can be computed
recursively over the whole column size as shown in Figure 8.
The lookup table entries required to produce the correct
values for parallel addresses at each row instant are named as
auxT and are given in the Table 2. The modified algorithm
supporting generation of 8 parallel addresses ( m I ) is given as
Algorithm 1 and the hardware for computing parallel addresses
is shown in Figure 14.
5 Parallel Interleaver for 3GPP-LTE
The channel coding in LTE involves Turbo Code with an
internal interleaver which is based on quadratic permutation
polynomial (QPP). QPP interleavers have very compact
representation methodology and also inhibit a structure that
allows the easy analysis for its properties.The internal interleaver for turbo code is specified by the
following quadratic permutation polynomial:
2
2
( ) 1. . % x I f x f x N (7)
Here 0,1, 2, ( 1) x N , where N is block size. This
polynomial provides deterministic interleaver behavior for
different block sizes with appropriate values of 1 f and 2 f .
Direct implementation of the permutation polynomial given in
eq. (7) is in-efficient due to multiplications, modulo function
and bit growth problem. The simplified hardware solution is to
use recursive approach described in the following sub-section.
5.1 Basic Interleaver
Eq. (7) can be re-written for recursive computation as:
( 1) ( ) ( ) % x x x I I g N (8)
Where ( ) 1 2 22. . % x g f f f x N , which can also be
computed recursively as ( 1) ( ) 22. % x x g g f N . The two
recursive terms ( 1) x I and ( 1) x g are easy to compute and they
provide the basic interleaver functionality for LTE.
5.2 HW for Parallel Interleaver in UMTS-LTE
The interleaver used in LTE is special in the sense that it is
conflict free inherently when parallel interleaving is required,
thus providing ease of implementation for parallel turbo
decoding. Generation of parallel interleaver addresses can be
achieved with the replication of the hardware shown in Figure
15 and getting support from a LUT providing the starting
values. In this case a total of 32 additions are needed; however,
+Logic
R
12 LUT
(Tbas)
CircularBuffer
AM+
LUT(Taux)
1
0<<3
<<6
1
0
msb
+–
AM+
AM+
AM+
AM+
AM+
AM+
AM+ B
i t R e v e r s e
<
64/512
<<<<<<<
I1 I2 I3 I4 I5 I6 I7 I8
Figure 14: HW supporting 8 parallel interleaver addresses for DVB-SH.
Algorithm 1: Modified Algorithm for Parallel Address Generation
in DVB-SH.
Initialization:
1146 : 6; 18; 64;12282 : 9; 24; 512;
T T
T T
if N n R C elseif N n R C
1
(0) 0;:1
(0, ) 0:
c
T
Rloop p to R
M pend loop p
Execution:
1 1
1 1
1
( )
: 1: 1
( ) ( 1) ( ) %32
( , ) ( ) ( 1, ) %( . ) ( , )
: 2 8( , ) ( ) ( , ) %
( , ) ( , ) 2
:
::
T
T
c c
bas T
m aux m T n
m m flipped
loop i to C loop j to R
R j R j Inc j
M i j T j M i j C I i j M i jloop m to
i j T j M i j C
I i j M i j j
end loop m
end loop jend loop i
8/23/2019 HSPA+interleaver
http://slidepdf.com/reader/full/hspainterleaver 10/14
to achieve a low cost solution we present here the hardware
with first part being reused for multiple stages. This optimized
hardware uses 18 additions in total to generate 8 parallelinterleaver addresses, thus saving 14 additions.
6 Parallel Interleaver for WiMAX
IEEE-802.16e standard [6] called WiMAX provides the
mobility features to broadband technology with higher
transmission speed i.e. upto 70 Mb/sec. Convolutional turbo
coding (CTC) is adopted in the FEC block, which requires an
internal interleaver.
6.1 CTC Interleaver
CTC, also termed as duo-binary turbo codes can offer many
advantages like performance, over the classical single-binary
turbo codes. Work in [19] uses some algorithmic
simplifications to implement the CTC interleaver, which makes
it possible to compute the interleaver address recursively. The
interleaver in the duo-binary turbo codes works on pairs of bits.
Presence of CTC interleaver guarantees that the two output
parts, systematic output and parity output become completely
un-correlated during transmission. Parameters to define the
interleaver function as described in [6] are designated as
0 1 2, , P P P and 3. P Two steps of interleaving are described below:
Step 1: Let the incoming sequence be
0 0 0 1 1 2 2 1 1, , , , , ,.... , N N A B A B A B A B
for 0..... 1, x N if %2 1 , ,i i i ii then A B B A .
The new sequence is
1 0 0 1 1 3 3 1 1, , , , , ,.... , N N A B B A A B B A ,
Step 2: The function ( ) x I providing the mapping address is
defined by a set of 4 expressions with a switch selection:
for 0...... 1 x N , switch % 4 x
case 0: ( %4 0) 0. 1 % x I P x N
case 1: ( %4 1) 0 12. 1 % N
x I P x P N
case 2: ( %4 2) 0 2. 1 % x I P x P N
case 3: ( %4 3) 0 32. 1 % N
x I P x P N
Combining the four equations ( ) x I becomes:
( ) % x x x I Q N (9)
Where x can be computed using recursion i.e.
( 1) 0 % x x P N by initializing 0 0 and xQ is given by:
1
2
3
1 ( %4 0)
1 / 2 ( %4 1)
1 ( %4 2)
1 / 2 ( %4 3)
x
if j
N P if jQ
P if j
N P if j
As range of x and xQ is less then N , thus x I can be computed
by using additions only.
AM+
R
S
A
I
P1
0
I1
P2
S1
I2
P3
S2
I3
P4
S3
I4
P5
S4
I5
P6
S5
I6
P7
S6
I7
P8
S7
I8
AM+
R
1
0
g_c
f 2 >>1
g(0) g_
x_
a
g_
x_
b
Figure 15: HW for parallel interleaving in 3GPP-LTE.
2 SISO 4 SISO0
20
40
60Mem Conflicts & FIFO Size for CTC Interleaver (N=108)
S i z e
Total ConflictsM1 ConflictsM2 ConflictsM3 ConflictsM4 ConflictsF1 SizeF2 SizeF3 SizeF4 Size
Figure 16: Conflict count and FIFO requirement for N=108 (WiMAX).
Algorithm 2: Proposed Parallel Address Generation Algorithm
for WiMAX.
Initialization:
(0) (
_
0)
.
0 0
sub blk
N
No of SISO
nit and I
N
Execution:
( ) ( 1) 0
1
_
( ) ( )
1
( ) ( _ )
: 1
%
%
: 2 8
%
:
:
s
j j
x j
ub blk
sub b
x
m m
x x lk
oop j to
P N
I Q N
loop m to
I I N
end loo
N
N
p m
end loop j
8/23/2019 HSPA+interleaver
http://slidepdf.com/reader/full/hspainterleaver 11/14
6.2 HW for Parallel Interleaver in WiMAX
The maximum block size in the normal transmission scheme
(i.e. non H-ARQ) is 240, thus in order to achieve sufficient
bandwidth, up to 4 parallel SISO processors are enough. On
the other hand H-ARQ involves bigger block sizes (max. of
2400) thus up to 8 parallel SISO processors are useful to
reduce the latency and enhance the overall bandwidth. While
generating parallel interleaver addresses the only case which
differs in terms of memory conflict is the block size N=108
(corresponding to QPSK rate:3/4 or 64-QAM rate:3/4). All the
other block sizes are supportive to parallelism and are
inherently conflict free. The number of conflicts and optimal
size of FIFO registers required to handle the conflicts are
shown in Figure 16.
The generation of parallel interleavers other then the basic
interleaver can be done by successive computations based on
the result from the predecessor as given in Algorithm 2. The
hardware for the generation of up to 8 parallel interleaved
addresses for WiMAX is shown in Figure 17. While generating
8 parallel addresses for the block sizes supported by H-ARQ,
some permutation among the generated address is required for
selected block sizes (N=480 & N=960) as shown in Table 3.
All the other block sizes have the straight one-to-one mapping
on different memories.
7 Unified Parallel Interleaver Architecture
The interleaver parallelism for individual standards has been
explored in detail in the earlier sections and the number of
parallel SISO processors to be supported is summarized in
Table 4. The main focus of the work has been to adopt a
methodology which results in common computing elements,
thus preparing grounds for efficient hardware multiplexing. As
a result we reach to the conclusion that an accumulation
followed by modulo logic (acc_mod ) is the common
computing element. Therefore forming an array of acc_mod
with re-configurability for different combinations can serve as
a main part in the complete computing core along with an
auxiliary part consisting of a multiplication and comparator. Asa mandatory part for some of the applications a circular buffer
of size 24 is also needed to be incorporated with
acc_mod_array. The address computation follows the conflict
resolution which mainly comprises of multiplexers and shift
registers. The complete hardware block diagram for the re-
configurable parallel address generation is shown in Figure 18.
The second part of the conflict management is to apply FIFO
register bank dedicated for each memory. The size of the FIFO
register bank could have been very large but we reduced it
using the progressive writes during a situation of conflict in
other memories. Covering all the standards, the FIFO size
requirement after applying progressive writes for different
memories is given in Table 5. The role of controller is very
important to achieve this goal as it checks continuously the
empty slots for the corresponding memory. If some other
memory has the conflict at certain time instant so that the
corresponding memory is free, the controller initiates the left
over writes for this memory. The other main tasks handled by
controller are to control the sequence of operations during pre-
processing and execution phase. The pre-processing includes
AM+P0
AM+
Q1Q2Q3
‘1’
R
AM+
AM+
AM+
AM+
AM+
AM+
I1 I2 I3 I4 I5 I6 I7 I8
AM+
N
Sub_Blk_Size
Figure 17: Hardware for parallel interleaving in WiMAX.
Table 3: Permutations for Correct Memory Mapping in WiMAX.
Parallel Address Generation Sequence I1 I2 I3 I4 I5 I6 I7 I8
Permutation for N = 480 & N = 960 1 6 3 8 5 2 7 4
CircularBuffer
AM+
AM+
AM+
acc_modarray
A xBCompare
aux_computing
LUT
C O N F I G U R A T I O N V E
C T O R
C O N T R O L
M0 M0 M0 M0 M0 M0 M0 M0 F I F O
Interconnect
ConflictResolutionandAddress Alignment
DataAlignment
Siso1 Siso2 Siso3 Siso4 Siso5 Siso6 Siso7 Siso8
Parallel SISO P rocessing
I1 I2 I3 I4 I5 I6 I7 I8
Figure 18: Unified parallel interleaver architecture.
8/23/2019 HSPA+interleaver
http://slidepdf.com/reader/full/hspainterleaver 12/14
computation of necessary parameters while changing thestandard or block size. During the execution phase the
controller keeps track of block size employing row and column
counters, thus providing the block synchronization required for
each type of interleaver implementation.
There are 8 memories (M1 – M8) being considered as part
of interleaver, each having a size of 1536 x 5b to cover the
complete range of block sizes for all standards. The memory
selection is mainly made by using MSB part of the generated
address or though comparison.
8 Implementation Results
By exploiting the hardware re-use for different
implementations the final architecture shown in Figure 18
achieves the objective of being low cost. The design is
specified in Verilog HDL and synthesized using 65nm CMOS
technology. The synthesis results for two cases i.e. with 6K
memory and 12K memory are summarized in Table 6. The 6K
memory size covers all the cases except one i.e. N=12282 for
DVB-SH, thus if specifically this block size is required then
bigger memory can be used. The chip core layout is shown in
Figure 19 and it utilizes 20.085mm area in total for 6K memory
case. It can operate at a frequency of 200 MHz and consumes
12mW power in total where most of the power utilization is
from memories. The address generation hardware for re-
configurable parallel interleaver is very much silicon efficient
and low power consuming which makes it a good choice for all
future implementations targeting the multi-mode operation.
9 Conclusion
In this paper a multi-mode parallel interleaver architecture
targeting parallel SISO decoding for different standards has
been presented. The design is low cost and supports highfrequency at low power. The conflicts, occurring while
generating parallel addresses, have been managed by
incorporating different schemes. The interleaver address
generation hardware for different standards has been modified
in the way that efficient hardware multiplexing can be
achieved. The functionally of parallel turbo decoder for
different standards is more or less same, but parallel
interleaving appears to be the bottleneck to reach to a unified
version of parallel turbo decoding. The proposed architecture
plays a vital role to achieve unified parallel turbo decoding
with multi-standard support by using the existing architectures.
References
[1] C. Berrou, A. Glavieus, and P. Thitimajshima, “Near Shannon
limit error-correcting coding and decoding: Turbo-codes,”
Proceedings of IEEE ICC, May 1993, vol. 2, pp. 1064 - 1070.
[2] 3GPP, “Technical Specification Group Radio Access Network;
Multiplexing and Channel Coding (FDD) (25.212 V5.9.0),”
June 2004.
Table 4: Interleaver Parallelism in Different Standards.
Parameter HSPA+ DVB-SH 3GPP-LTE WiMAX
Block Size (Max) 5114 12282 6144 2400
Data Rate (Mbps) 43.2 50 100 70
Parallel SISO Support 2 2, 4 or 8 2, 4 or 8 2, 4 or 8
Table 5: FIFO Size Requirement for Different Memories.
Scheme Parameter M1 M2 M2 M4 M5 M6 M7 M8
HSPA+ Conflicts 0 4 39 0 22 17 -- --
FIFO Size 0 4 12 0 1 1 -- --
DVB-SH Conflicts 3072 3072 2304 2304 1344 1344 1344 1344
FIFO Size 2 2 1 1 1 1 1 1
LTE Conflicts 0 0 0 0 0 0 0 0
FIFO Size 0 0 0 0 0 0 0 0
WiMAX Conflicts 26 26 13 13 -- -- -- --
FIFO Size 4 4 2 2 -- -- -- --
Final FIFO Size 4 4 12 2 1 1 1 1
Table 6: Summary of Implementation Results.
Parameter Value with 6K Mem Value with 12K Mem
Total Area 84330 µm2 123810 µm2
Total Power Consumption 12.04 mW 12.65 mW
Memory Configuration 768 x 5b x 8 1536 x 5b x 8
Memory Area 66240 µm2 (78.5 %) 105720 µm2 (85.3 %)
Memory Power 11.48 mW (95.3 %) 12.09 mW (95.5 %)
AGU Area 18090 µm2
AGU Power 0.56 mW
Clock Rate 200 MHz
Figure 19: Layout snapshot of proposed unified interleaver.
8/23/2019 HSPA+interleaver
http://slidepdf.com/reader/full/hspainterleaver 13/14
[3] 3GPP, “Technical Specification Group Radio Access Network;
Multiplexing and Channel Coding (FDD) (25.212 V8.4.0),”
Dec. 2008.
[4] ETSI EN 302-583 V1.1.1, “Digital Video Broadcasting (DVB);
Framing Structure, Channel Coding and Modulation for Satellite
Services to Handheld Devices (SH) below 3 GHz.” March.
2008.
[5] 3GPP-LTE, “Technical Specification Group Radio Access Network; Evolved Universal Terrestrial Radio Access (E-
UTRA); Multiplexing and Channel Coding,” Release 8, 3GPP
TS 36.212 v8.0.0, (2007-09).
[6] IEEE 802.16e–2005, “IEEE Standard for Local and
Metropolitan Area Networks, Part 16: Air Interface for Fixed
Broadband Wireless Access Systems – Amendment 2: Medium
Access Control Layers for Combined Fixed and Mobile
Operations in Licensed Bands”.
[7] Y. Zhang and K.K. Parhi, “Parallel turbo decoding,”
Proceedings of ISCAS, May 2004, vol. 2, pp. II-509-512.
[8] Y. C. Lu and E.H. Lu, “A parallel decoder design for low
latency turbo decoding,” Proceedings of ICICIC, Sept. 2007, pp.
386-389.
[9] Y. Sun, Y. Zhu, M. Goel and J. R. Cavallaro, “Configurable and
scalable high throughput turbo decoder architecture for multiple
4G wireless standards,” Proceedings of ASAP 2008, pp. 209 –
214.
[10] Cheng-Hung Lin, Chun-Yu Chen and An-Yeu Wu, “High-
Throughput 12-Mode CTC Decoder for WiMAX Standard,”
Proceedings of IEEE VLSI-DAT, April 2008, pp. 216 – 219.
[11] R. Dobkin, M. Peleg and R. Ginosar, “Parallel interleaver
design and VLSI architecture for low-latency MAP turbo
decoders,” IEEE Transaction on VLSI Systems, vol. 13, no. 4,
April 2005, pp. 427 – 438.
[12] A. Giulietti, L. van der Perre, M. Strum, “Parallel turbo coding
interleavers: avoiding collisions in accesses to storage
elements,” IEE Electronic Letters, Vol. 38, No. 5, pp. 232-234,
Feb. 2002.
[13] J. Kwak, S-M. Park, S-S. Yoon and K. Lee, “Implementation of a parallel turbo decoder with dividable interleaver,” Proceedings
of ISCAS, May 2003, vol. 2, pp. 65-68.
[14] M. Shin and I.-C. Park, “Processor-based turbo interleaver for
multiple third-generation wireless standards,” IEEE
Communication Letters, vol. 7, no. 5, May 2003, pp. 210 – 212.
[15] P. Ampadu and K. Kornegay, “An efficient hardware interleaver
for 3G turbo decoding,” RAWCON, August 2003, pp. 199 –
201.
[16] R. Asghar and D. Liu, “Very low cost configurable hardware
interleaver for 3G turbo decoding,” Proceedings of ICTTA,
April 2008, pp. 1 – 5.
[17] R. Asghar and D. Liu, “Dual standard re-configurable hardware
interleaver for turbo decoding,” Proceedings of ISWPC, May
2008, pp. 768 – 772.[18] Z. Wang and Q. Li, “Very low-complexity hardware interleaver
for turbo decoding,” IEEE Transaction on Circuits and System –
II: vol. 54, no. 7, July 2007, pp. 636 – 640.
[19] R. Asghar and D. Liu, “Low complexity multi mode interleaver
core for WiMAX with support for convolutional interleaving,”
Journal of Elec., Comm. and Computer Engineering, vol. 3, no.
1, 2009, pp. 20 – 29.
[20] G. R. Blakley, “A computer algorithm for calculating the
product A*B mod M,” IEEE Transaction on Computers, vol.C-
32, No.5, May 1983, pp.497–500.
[21] R. Asghar, D. Wu, J. Eilert and D. Liu, “Memory conflict
analysis and interleaver design for parallel turbo decoding
supporting HSPA evolution,” Accepted for publication in 12th
Euromicro DSD-2009, August 2009.
[22] F. Speziali and J. Zory, “Scalable and area efficient concurrent
interleaver for high throughput turbo-decoders,” Proceedings of
Euromicro DSD-2004, pp. 334 - 341.
[23] M. Martna, M. Nicola and G. Masera, “Hardware design of a
low complexity, parallel interleaver for WiMAX Duo-Binary
turbo decoding,” IEEE Communication Letters, vol. 12, no. 11,
pp. 846 - 848, November 2008.
8/23/2019 HSPA+interleaver
http://slidepdf.com/reader/full/hspainterleaver 14/14
Rizwan Asghar is working on a Ph.D. degree at the Department of
Electrical Engineering, Linköping University, Sweden. He received
the M.Sc. degree in Physics from Quaid-i-Azam University,
Islamabad, Pakistan, and M.S. degree in Computer Engineering from
Center for Advanced Studies in Engineering, Islamabad, affiliated
with U.E.T. Texila, Pakistan. His research activity is mainly focused
on flexible and re-configurable FEC sub-systems for baseband
processors.
Di Wu received his M.Sc. in electrical engineering both from Beijing
University of Posts and Telecoms and Linköpings Universitet,
Linköping, Sweden, in 2003 and 2005 respectively. He is currently aPh.D. candidate at the Division of Computer Engineering at the
Department of Electrical Engineering at the same university. His
research interests include application specific processor design and
fast prototyping of wireless systems.
Johan Eilert received his M.Sc. in computer science and engineering
from Linköpings Universitet, Linköping, Sweden, in 2003. He is
currently a Ph.D. candidate at the Division of Computer Engineering
at the Department of Electrical Engineering at the same university.
His research interests include application specific processors for
multimedia signal processing and MIMO baseband signal processing.
Dake Liu is Professor and the Director of Computer Engineering
Division at the Department of Electrical Engineering of Linköping
University, Sweden. He got technology doctor degree from
Linköping University Sweden in 1995. He is IEEE senior member.
Dake published more than 100 papers on journals and international
conferences and holds 5 US patents. Dake’s research interests are
high-performance low-power ASIP (application specific instruction
set processors), integration of on-chip multi-processors for
communications, and media digital signal processing. Dake has
experiences also in design of communication systems and Radio
frequency CMOS integrated circuits. Dake Liu is the co-founder and
CTO of FreehandDSP AB, Stockholm Sweden. FreehadDSP was
acquired by VIA technologies (http://www.viatech.com/en/index.jsp)
in 2001. Dake Liu is currently the co-founder and CTO of Coresonic
AB, (http://www.coresonic.com/) Linköping Sweden.