cost-effective and variable-channel fastica …ldvan/paper_list/cost...where w ¼ w 11 w 21 ⋮ w n1...

Cost-Effective and Variable-Channel FastICA HardwareArchitecture and Implementation for EEG Signal Processing

Lan-Da Van & Po-Yen Huang & Tsung-Che Lu

Received: 1 July 2014 /Revised: 18 February 2015 /Accepted: 23 February 2015 /Published online: 21 March 2015# Springer Science+Business Media New York 2015

Abstract This paper proposes a cost-effective andvariable-channel floating-point fast independent compo-nent analysis (FastICA) hardware architecture and imple-mentation for EEG signal processing. The Gram-Schmidtorthonormalization based whitening process is utilized toeliminate the use of the dedicated hardware for eigenvaluedecomposition (EVD) in the FastICA algorithm. The pro-posed two processing units, PU1 and PU2, in the present-ed FastICA hardware architecture can be reused for thecentering operation of preprocessing and the updating stepof the fixed-point algorithm of the FastICA algorithm, andPU1 is reused for Gram-Schmidt orthonormalization oper-ation of preprocessing and fixed-point algorithm to reducethe hardware cost and support 2-to-16 channel FastICA.Apart from the FastICA processing, the proposed hard-ware architecture supports re-reference, synchronized av-erage, and moving average functions. The cost-effectiveand variable-channel FastICA hardware architecture is im-plemented in 90 nm 1P9M complementary metal-oxide-semiconductor (CMOS) process. As a result, the FastICAhardware implementation consumes 19.4 mW at 100 MHzwith a 1.0 V supply voltage. The core size of the chip is1.43 mm2. From the experimental results, the presentedwork achieves satisfactory performance for each function.

Keywords Cost effective . EEG signal processing .

FastICA algorithm . Gram-Schmidt . Hardware architecture .

Variable channel

1 Introduction

Electroencephalogram (EEG) research plays an important rolefor exploring the relation between human beings’ behaviorand brain [1, 2]. The EEG signals are the electrical potentialrecordings captured from the scalp sensors, where the electri-cal potentials mean that the components of the brain activity[3] are mixed. In addition, the EEG signals may be contami-nated by the artifacts of eye activity or muscular activity[4–10]. Therefore, analyzing EEG signals becomes a chal-lenge. In the brain research, the independent component anal-ysis (ICA) algorithm is regarded as a useful method to studythe brain activity through EEG signals [4–10]. The ICA algo-rithm is developed to solve the blind source separation (BSS)problem such that the mixed-up signals can be separated[11–19] and the brain activity information can be revealed.

Hardware architecture and implementation of ICA algo-rithms are challenges while considering cost, flexibility, speedand/or power. Many literatures have proposed the followingICA hardware implementations. Kim et al. [20] proposed afield-programmable gate array (FPGA) implementation of the2-channel BSS constructed by the adaptive noise canceling(ANC) module. The 2-channel ANC model consumes98.8 mW at 12.288 MHz at 1.8 V in FPGA. Du et al.[21–24] proposed parallel ICA algorithm and the correspond-ing FPGA implementation on a pilchard board, where severalsub-processes run with multiple data parallelism. Jain et al.[25] proposed a parallel architecture for the FastICA algo-rithm. The floating-point arithmetic unit is adopted onpipelined FastICA design to increase the precision in [26].One hardware efficient fixed-point FastICA is addressedin [27]. The fixed-point arithmetic is used on an FPGA-based INFOMAX ICA design [28]. Acharyya et al. [29]proposed a pioneering coordinate rotation digital computer(CORDIC)-based [30] n-dimensional FastICA algorithmand architecture by reusing the CORDIC module. Although

L.<D. Van (*) : P.<Y. Huang : T.<C. LuDepartment of Computer Science, National Chiao Tung University,Hsinchu, Taiwan 300, Republic of Chinae-mail: [email protected]

J Sign Process Syst (2016) 82:91–113DOI 10.1007/s11265-015-0988-2

the proposed architecture [29] has low complexity character-istics by the algorithm analysis, the post-layout result of theproposed overall architecture is not available. Van et al. [31]proposed an energy-efficient eight-channel FastICA imple-mentation with early determination scheme. The use of thededicated eigenvalue decomposition (EVD) processor foreight-channel preprocessing in [31] adds the overhead to thehardware cost. Besides, it is predicted that the computationalcomplexity for EVD based whitening process is largely in-creased when a higher channel number is requested [31]. Yanget al. [32] proposed a low-power eight-channel FastICA pro-cessor for epileptic seizure detection with fixed-point arith-metic. Roh et al. [33] proposed a 16-channel self-configuredICA implementation for the wearable neuro-feedback system.

The Gram-Schmidt [34] based whitening described in [12,35, 36] has been pointed out that the Gram-Schmidt basedwhitening may be applied to the solution of BSS problem[35]. Concerning the computational complexity and hardwarecost, in this paper, the Gram-Schmidt orthonormalization isapplied in whitening process and fixed-point algorithm.Thus, it is expected that the computational complexitycan be reduced and the hardware resource is possible forreuse. Note that the Gram-Schmidt orthonormalizationhas been adopted for the fixed-point iteration in theprinciple component analysis (PCA) algorithm to reducethe dimensions in [37]. To the best of our knowledge,the state-of-the-art dedicated hardware implementationsdo not adopt the Gram-Schmidt based whitening forthe FastICA processing. On the other hand, for moreflexibility, it is desired to support the variable channelselection and provide re-reference [1], synchronized average[1], and moving average [2] functions for EEG signal process-ing. Thus, we are motivated to propose a cost-effectiveFastICA architecture [38] that supports variable-channelFastICA operations and re-reference, synchronized average,and moving average functions. Herein, two reused processingunits (PUs) are proposed to achieve cost-effective andvariable-channel FastICA hardware implementation. Themain contributions of this work are summarized as follows:

1) Propose a cost-effective and 2-to-16 channel floating-point FastICA architecture with two new reused PUsadopting Gram-Schmidt based whitening.

2) Implement the FastICA hardware architecture in theapplication-specific integrated circuit (ASIC) approachto support variable-channel FastICA, re-reference, syn-chronized average, and moving average functions.

As a result, the proposed FastICA architecture is imple-mented in the TSMC 90 nm 1P9M complementary metal-oxide-semiconductor (CMOS) process with the core size of1.43 mm2. The power consumption for 16-channel processingis 19.4 mW at 100 MHz.

The rest of this paper is organized as follows. The back-ground for the FastICA algorithm is described in Section 2.The Gram-Schmidt orthonormalization based whitening is de-scribed and analyzed in Section 3. Next, the cost-effective andvariable-channel FastICA architecture with the proposed PUsand the corresponding post-layout implementation are de-scribed in Section 4. In Section 5, we show the software andthe corresponding post-layout simulation results for the vali-dation of the proposed hardware implementation. Last, theconclusion and future work is remarked in Section 6.

2 Background for the FastICA Algorithm

2.1 Independent Component Analysis

Considering the number of blind source signals and the sam-ple number are n and m, respectively, the BSS system can bemodeled as (1) [26, 29, 31].

X ¼ AS ð1Þ

whereX,A, and S denote an n bymmixed-signal matrix, an nby nmixing matrix, and an n bym blind-source-signal matrix,respectively. X and S can be further expressed as

X ¼ x1 x2⋯xn½ �T ð2Þ

S ¼ s1 s2⋯sn½ �T ð3Þ

The mixed-signal vector xi and the source-signal vector sican be respectively expressed in the following equations.

xi ¼ xi 1ð Þxi 2ð Þ⋯xi mð Þ½ �T ; for i ¼ 1; 2; 3;…; n; ð4Þ

si ¼ si 1ð Þsi 2ð Þ⋯si mð Þ½ �T ; for i ¼ 1; 2; 3;…; n; ð5Þ

where xi( j) and si( j) represent a mixed signal and a sourcesignal at discrete time j, respectively. The ICA algorithm aimsto find the matrix S without knowing the matrix A, where thematrix S is assumed to have no more than one Gaussian-distributed source-signal vector. By observing the matrix X,the inverse of the matrix A can be estimated as the weightmatrix WT, and the blind source signal S can be obtained in(6) [31].

S ¼ WTX; ð6Þ

92 J Sign Process Syst (2016) 82:91–113

where

W ¼w11

w21

⋮wn1

w12

w22

⋮wn2

⋯

w1n

w2n

⋮wnn

26643775: ð7Þ

The basic idea of ICA is to maximize the non-Gaussianityof wTX, where w denotes a column vector of W. Among theICA algorithms, the FastICA algorithm proposed in [3,13–16] has fast convergence rate and good performance onlow signal-to-noise ratio (SNR) condition [8]. Therefore, theFastICA algorithm is chosen for the hardware implementationfor EEG signal processing.

2.2 Preprocessing

In the FastICA algorithm, preprocessing is required to centerand whiten the mixed signals. Through the preprocessing, themixed signals can become zero mean and unit variance. Thecentering process can be expressed in (8).

xi jð Þ ¼ xi jð Þ−E xif g; for i ¼ 1; 2; 3;…; n; ð8Þ

where xi jð Þ and E{xi} denote the mixed signal value with zeromean and the expected value of the random variable xi(j) of

vector xi [31], respectively. Therefore, the centering matrix Xcan be expressed in (9).

X ¼xT1xT2⋮xTn

26643775 ¼

x1 1ð Þ x1 2ð Þ ⋯ x1 mð Þx2 1ð Þ x2 2ð Þ ⋯ x2 mð Þ⋮ ⋮ ⋯ ⋮

xn 1ð Þ xn 2ð Þ ⋯ xn mð Þ

26643775 ð9Þ

For the whitening process, EVD can be used to obtain unit-variance signals. Equation (10) can be obtained through theEVD on the covariance matrix Cx of ex, where ex denotes the

random column vector of X.

Cx ¼ E exexT� � ¼ EDET ; ð10Þ

where E denotes the eigenvector matrix of Cx and D repre-sents the eigenvalue matrix of Cx. D can be further expressedin (11) [31].

D ¼ diag d1; d2;⋯; dnð Þ; ð11Þ

where d1, d2,…, dn are the eigenvalues of Cx. The whiteningprocess can then be written in (12) [31].

Z ¼ D−1=2ETX ¼ PX ð12Þ

After the whitening process, the covariance matrix of ezbecomes the identity matrix I as shown in (13), where ez de-notes the random column vector of Z.

E ezezT� � ¼ PE exexT� �PT ¼ D−1=2ETEDETED−1=2 ¼ I ð13Þ

Therefore, unit covariance of whitened signal can beguaranteed.

2.3 Fixed-Point Algorithm

The weight matrix is the key to find the independent compo-nent. The non-Gaussianity needs to be measured well in orderto find the independence between vectors. In [15], thenegentropy approximation is used to estimate the non-Gaussianity. With this approximation, the FastICA algorithmcan find the maximum of non-Gaussianity by adjusting theweight matrix. For the training of the weight matrix, thefixed-point algorithm [15] is required. To estimate severalindependent components, either the symmetric orthogonaliza-tion or the deflationary orthogonalization can be utilized in thefixed-point algorithm [15]. The independent components areestimated in parallel for the fixed-point algorithm with thesymmetric orthogonalization. On the other hand, the indepen-dent components are estimated one by one with the deflation-ary orthogonalization. In this work, we adopt the fixed-pointalgorithm with the symmetric orthogonalization such that thecomputation of the fixed-point algorithm can be parallelized

in the hardware design. AssumeW ¼ w1w2⋯wn½ �, wherewi,

for i=1, 2,…, n, represents the column vector ofW [31], thefixed-point algorithm with the symmetric orthogonalizationcan be described as follows.

Step 1 Set the initial values for wi with unit norm for i=1, 2,3, …, n.

Step 2 Calcu l a t e the upda t ing s t ep [16 ] , wþi ¼

E ez g wTi ez� ��

−E g0 wTi ez� ��

wi, for i=1, 2, 3, …,n, where g is the derivative of the non-quadraticfunctionG uð Þ ¼ 1

a logcosh auð Þ, where a is a constantparameter.

Step 3 Calculate the Gram-Schmidt orthonormalization for

W using (14), (15) and (16).

w1 ¼ wþ1

wþ1

�� ð14Þ

w#kþ1 ¼ wþ

kþ1−Xki¼1

wþkþ1

� �Twi

h iwi; for k ¼ 1; 2; 3;…; n−1

ð15Þ

J Sign Process Syst (2016) 82:91–113 93

wkþ1 ¼w#

kþ1

w#kþ1

�� ; for k ¼ 1; 2; 3;…; n−1 ð16Þ

Step 4 Go back to Step 2 if not converged.

Note that the symmetric orthogonalization is based onEVD in [15]. In this work, we adopt the Gram-Schmidtorthonormalization for the symmetric orthogonalization basedfixed-point algorithm [31] in Step 3. Finally, when the con-vergence is achieved, the maximum of the non-Gaussianity of

WTZ can be estimated.

3 Gram-Schmidt Orthonormalization Based Whiteningfor the FastICA Algorithm

As mentioned in the previous section, the EVD based whiten-ing process is generally adopted in the FastICA algorithm. In[31], since whitening process and fixed-point algorithm applyCORDIC and multiplier as well as adder modules, respective-ly, a specially designed CORDIC-based EVD processor isneeded for EVD based whitening process. However, the hard-ware cost of a dedicated computing unit and the computation-al complexity of the EVD based whitening will be large whilethe number of channels increases in [31]. The pioneering workin [29] can reduce the hardware cost by reusing the CORDICmodule to perform both EVD based whitening and the fixed-point algorithm of FastICA. In this work, the concept ofGram-Schmidt based whitening process [12, 35, 36] isadopted for the FastICA algorithm. The Gram-Schmidt basedwhitening process has lower computational complexity thanthe conventional EVD based whitening does. In this section,we will analyze and compare the computational complexity ofthe conventional EVD based whitening process and the Gram-Schmidt based whitening process in this section.

3.1 Computational Complexity Analysis

The Gram-Schmidt orthonormalization based whiteningprocess for the FastICA algorithm is described as follows.First, the Gram-Schmidt orthonormalization for themixed signal values with zero mean are illustrated in (17),(18) and (19).

zþ1 ¼ x1x1k k ð17Þ

z#kþ1 ¼ xkþ1−Xki¼1

xTkþ1zþi

� �zþi ; for k ¼ 1; 2; 3;…; n−1 ð18Þ

zþkþ1 ¼z#kþ1

z#kþ1

�� ; for k ¼ 1; 2; 3;…; n−1 ð19Þ

Such process gives an orthogonal matrix Z+=[z1+z2

+⋯zn+]T,

where the vectors {z1+z2

+⋯zn+} are mutually orthogonal. The

covariance matrix of ezþ is then derived in (20).

E ezþ ezþð ÞTn o

¼ diagzþ1� �T

zþ1m

;zþ2� �T

zþ2m

;⋯;zþn� �T

zþnm

!¼ 1

mI;

ð20Þ

whereezþ denotes the random column vector of Z+. Since unitvariance is required for the whitening process, it is necessaryto scale the orthogonal matrixZ+ with a factorm1/2 as follows.

zkþ1 ¼ m1=2zþkþ1; for k ¼ 0; 1; 2;…; n−1 ð21Þ

Therefore, the covariance matrix ofez can equal the identitymatrix I.

The computational complexity is defined as the sum ofmultiplications, divisions and square roots. Tables 1 and 2summarize the computational complexity of the whiteningp r o c e s s b a s e d o n EVD a n d G r am - S c hm i d torthonormalization, respectively, where mul, div and sqrt de-note the corresponding multiplication, division and squareroot. Since it is a black box for us to know the detailed imple-mentation of EVD function call in MATLAB, EVD is as-sumed to be performed by cyclic Jacobi method [39] for com-putational complexity in Table 1, where J denotes an n by nJacobi rotation matrix [39]. Note that, the shift and add oper-ations are excluded in Tables 1 and 2 since they contribute lesscomplexity compared to the multiplication, division, andsquare root operations.

In Table 1, since the covariance matrix Cx is an n by nsymmetric matrix, only n(n+1)/2 elements are required forthe calculation of Cx. Therefore, the computational complex-ity for the calculation of Cx is m[n(n+1)/2] due to m multipli-cations for each element calculation ofCx. The product of an nby n matrix and a Jacobi rotation matrix costs 4n multiplica-tions since only 2n elements are needed to calculate, whereeach element needs two multiplications. Thus, 4n[n(n-1)/2-1]multiplications are required for the matrix multiplication of[n(n-1)/2] J matrices to obtain the eigenvector matrix E. InTable 2, xTkþ1z

þi

� �zþi needs to be calculated for i=1, 2, 3,…, k

and k=1, 2, 3, …, n-1. Since the calculation of xTkþ1zþi

� �zþi

needs 2m multiplications, the computational complexity forthe calculation of zk+1

# can be obtained as 2m∑k=1n− 1k.

The bar charts in terms of the number of multiplicationsand the sum of divisions and square roots versus number ofchannels withm=512 are shown in Fig. 1a and b, respectively.As can be seen in Fig. 1a and b, the computational complexityis mainly dominated by the multiplications for both EVD andGram-Schmidt based orthonormalization whitening process.


As a result, the Gram-Schmidt orthonormalization based whit-ening process shows less computations than the EVD basedwhitening process does, especially when the channel numberincreases. Obviously, the computations for the eigenvalue ma-trixD and eigenvector matrixE are not required for the Gram-Schmidt based whitening. Therefore, EVD is not necessary inthe Gram-Schmidt based whitening process such that theCORDIC module for EVD processing is not required in theproposed hardware architecture. Instead, from the analysis,the Gram-Schmidt based whitening process can be realizedby general computing modules such as addition, multiplica-tion, and inverse-square-root modules. Since these computingmodules can be shared in the fixed-point algorithm ofFastICA, the hardware cost is expected to be saved.

3.2 Influence of the Whitening Process on the FastICAAlgorithm

Besides the computational complexity, two whitening ap-proaches mentioned in the previous subsection need to beexplored to see the influence on the separation quality of theFastICA algorithm. Therefore, the MATLAB coding resource[40] of FastICA algorithm is modified/recoded and then sim-ulated with the EVD based and the Gram-Schmidtorthonormalization based whitening processes, respectively.The random mixing matrix A and the uniformly distributedsource-signal matrix S are used to generate the mixed-signalmatrix X. The mixed-signal matrix X is then processed withthe FastICA algorithm by above two whitening approaches.

The estimated source-signal vectors are then compared withthe source-signal vectors using the performance index definedin (22).

Performance Index ¼ 1

n

Xni¼1

abs corr coef bsi; sið Þ ð22Þ

where ŝi and abs_corr_coef (ŝi,si) denote the i-th estimatedsource-signal vector computed by the FastICA algorithm andthe absolute correlation coefficient between ŝi and si, respec-tively. The absolute correlation coefficient [41] is expressed in(23) using the notations in this paper.

abs corr coef bsi; sið Þ ¼ E bsi−E bsif gð Þ si−E sif gð Þf gσsiσ^ si

ð23Þ

where σ denotes the standard derivation. Note that the order ofthe estimated source-signal vector is sorted by associating theestimated source-signal vector and the source-signal vectorwith the maximum absolute correlation coefficient in thesimulation.

Figure 2 shows the average performance index and theaverage iteration number versus number of channels, wherethe sample number is set to 512. The average performanceindex and the average iteration number are obtained with500 test rounds. The maximum iteration number is set to511 in the simulation. As can be seen, the performance indicesand iteration numbers with EVD based and Gram-Schmidtorthonormalization based whitening processes are close. That

Table 1 Computationalcomplexity of the EVD basedwhitening process.

Procedure Computational complexity

Calculate covariance matrix Cx ¼ E exexT� �m n nþ1ð Þ

2

h in omuls

Calculate eigenvector matrix E=J12J13⋯J1nJ23⋯Jn−1,n 4n n n−1ð Þ2 −1

h in omuls

Calculate eigenvalue matrix D=ETCxE (2n3) muls

Calculate whitening matrix P=D−1/2E (n2) muls+(n) divs+(n) sqrts

Calculate whitened signal Z ¼ PX (mn2) muls

Total 4n3 þ 32m−1� �

n2 þ m2 −4� �

n� �

muls+(n) divs+(n) sqrts

Table 2 Computationalcomplexity of the Gram-Schmidtorthonormalization basedwhitening process.

Procedure Computational complexity

Calculate zþ1 ¼ x1x1k k (m) muls+(m) divs+(1) sqrt

Calculate z#kþ1 ¼ xkþ1−∑k

i¼1xTkþ1z

þi

� �zþi for k=1, 2, 3,…, n-1. 2m ∑

n−1

k¼1k

�muls

Calculate zþkþ1 ¼

z#kþ1

z#kþ1k k for k=1, 2, 3,…, n-1. [m(n-1)] muls+[m(n-1)] divs+(n-1) sqrts

Calculate zk+1=m1/2zk+1

+ for k=0, 1, 2,…, n-1. (mn) muls

Total (mn2+mn) muls+(mn) divs+(n) sqrts


means the Gram-Schmidt orthonormalization based whiten-ing process can be an alternative way to replace the EVDbased whitening process with similar separation quality inMATLAB simulation.

4 Cost-Effective and Variable-Channel FastICAHardware Architecture and Implementation

In this section, the cost-effective floating-point FastICA hard-ware architecture and implementation for FastICA processing,re-reference, synchronized average and moving average func-tions are addressed. The reasonable sample size for the hard-ware implementation of the FastICA function is explored first.Then, the hardware architecture, hardware operations, and im-plementation of these four functions are revealed andillustrated.

4.1 Sample Size Analysis

Since FastICA is a statistical algorithm, increasing the samplesize improves the separation quality. However, the memorysize will become larger once the sample size increases. Inaddition, larger sample size will lead to larger computationcomplexity according to the discussion in Section 3. For theabove reasons, the reasonable sample size for hardware im-plementation needs to be determined. Therefore, we simulatethe average performance index versus the number of channelswith the sample sizes of 256, 512, and 1024 in Fig. 3, wherethe simulation is performed by the double-precision floating-point (i.e., the most accurate precision) FastICA algorithmwith the Gram-Schmidt orthonormalization based whiteningprocess. The details ofA and X in this section are the same asthose of the simulation described in Section 3.2. From thesimulation result, using 1024 samples can reach the best

4 8 12 16 20 24 28 320

1

2

3

4

5

6

7

8

9

10x 10

5

Number of Channels

Num

ber

of M

ultip

licat

ions

EVD Based

GS Based

(a)

160

1

2

3

4

5

6

7

8

9x 103

0.032x103

8.208x103

4 8 12 16 20 24 28 320

2

4

6

8

10

12

14

16

18

Number of Channels

Sum

of

Div

isio

ns a

nd S

quar

e R

oots

EVD Based

GS Based

x 103

(b)

Figure 1 a Number ofmultiplications and b sum ofdivisions and square roots versusnumber of channels with EVDand Gram-Schmidtorthonormalization basedwhitening process.


performance among three different sample sizes, but the mem-ory cost will become a penalty. On the other hand, althoughusing 256 samples attains the lowest memory cost, the perfor-mance index is the worst among three different sample sizes.In summary, 512 is a tradeoff sample size for hardware imple-mentation. To evaluate the effects of IEEE-754 single-preci-sion floating-point and fixed-point arithmetic, the FastICAalgorithm with the Gram-Schmidt orthonormalization basedwhitening process is performed with the 32-bit floating-pointand 32-bit fixed-point simulations, respectively. The 11-bitfraction part of the fixed-point simulation is adopted for therandom data and three datasets in Section 5. Figure 3 showsthe performance index versus number of channels withfloating-point and fixed-point simulation. It can be seen thatthe performance indices with 512 sample size manipulated by

double-precision and single-precision floating point are al-most the same. For the targeted variable channel number withmaximum of 16, the performance index with fixed-point sim-ulation is lower than that of the floating-point simulation. Itshould be noted that the channel number should not be toohigh while considering an acceptable performance index. As aresult, the maximum channel number, the sample size and thearithmetic format in our hardware implementation are set to16, 512 and single-precision floating-point arithmetic format,respectively, according to the simulation result, where the de-tails will be discussed in the next subsection.

4.2 Hardware Architecture

Figure 4 shows the system diagram of the proposed FastICAhardware architecture to realize the FastICA, re-reference,synchronized average andmoving average functions. The pro-posed FastICA hardware architecture can perform differentoperations according to the instruction input, where the avail-able instruction set is listed in Table 3 and is briefly describedas follows. The LOAD instruction loads either fixed-point orfloating-point data into the memory. The OUTPUT instructionoutputs the processed data from the memory. The FASTICAinstruction operates the FastICA algorithm. The REREF in-struction produces the re-reference signal. The SYNAVG in-struction offers the synchronized average signal. Finally, theMOVAVG instruction can make the proposed architecture actas a moving average filter for the input signal. In this work, thenames of REREF and MOVAVG are the same as those in [42,43] and MATLAB, respectively.

In the proposed hardware architecture, the IEEE-754 sin-gle-precision floating-point format in (24), where the nota-tions are the same as those of [31], is adopted in the hardwaredesign in order to preserve the computation accuracy.

Lfloat ¼ −1ð ÞsL � 1: f Lð Þ � 2eL−127; ð24Þ

where Lfloat, sL, eL and fL represent the value of the IEEE-754single-precision floating-point number, sign bit, 8-bit expo-nent bit and 23-bit fraction part of the mantissa, respectively.Since the input EEG signals will be in fixed-point format ifthey are captured by the scalp sensors and quantized by theanalog-to-digital converter (ADC), the fixed-to-floating con-verter is utilized in this work to convert the fixed-point formatinto the IEEE-754 single-precision floating-point format. Af-terwards, the input signals are stored at the data memory.

Herein, the processing units (PUs), PU1 and PU2, areutilized to process the data stored in the memory. Thesimplified architecture of the PUs is illustrated in Fig. 5.Each PU contains a multiply-and-add (MAA) unit, atemporary memory, and a division-by-2λ operation unit,where the name of multiply-and-add (MAA) is the sameas that in [44] and λ is a positive integer. In this work,

0 5 10 15 20 25 30 350.4

0.5

0.6

0.7

0.8

0.9

1

Number of Channels

Ave

rage

Per

form

ance

Inde

x

0 5 10 15 20 25 30 350

100

200

300

400

500

600

Ave

rage

Iter

atio

n N

umbe

r

EVD basedGS based

Average IterationNumber

Average Performance Index

Figure 2 Average performance index and average iteration numberversus number of channels with the whitening processes based on EVDand Gram-Schmidt orthonormalization.

0 5 10 15 20 25 30 35

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of Channels

Ave

rage

Per

form

ance

Ind

ex

1024 samples (double-precision floating-point)

512 samples (double-precision floating-point)

256 samples (double-precision floating-point)512 samples (single-precision floating-point)

512 samples (fixed-point)

Figure 3 Performance index versus number of channels with differentsample numbers.


PU1 and PU2 are reused in the centering operation ofpreprocessing and the updating step [16] (i.e., Step 2 ofthe fixed-point algorithm in Section 2.3) of the fixed-point algorithm. Through parallel manipulating the PU1

and PU2, the centering operation and the updating stepcan be accelerated. It is noted that, since the Gram-Schmidt orthonomalization processes the vectors sequen-tially, only PU1 is reused for the Gram-Schmidtorthonomalization in preprocessing and fixed-point algo-rithm of the FastICA algorithm. Thus, PU1 adopts theinverse-square-root unit based on the design of [45] forthe Gram-Schmidt orthonomalization but PU2 does not.The temporary memory is used to store the computationresult such that the result can be reused in the subse-quent computation or transferred to the data memory.Since the IEEE-754 single-precision floating-point for-mat is utilized in the hardware design, the division-by-2λ operation can be realized by subtracting λ for severalfunctions as mentioned in Sections 4.3 and 4.5. For theFastICA function, PU1 and PU2 are reused to performthe centering operation of preprocessing and theupdating step of the fixed-point algorithm, and PU1 isreused for the computations of the Gram-Schmidtorthonormalization. Therefore, the dedicated EVD calcu-lation unit is not required in the proposed hardwarearchitecture. Other three function computations are real-ized by the two reused PUs in the proposed hardwarearchitecture. Thus, the cost-effective hardware architec-ture can be attained. The hardware operations of the

FastICA and other functions with the proposed PUs willbe described in detail in the following subsections.

The 32-bit instruction format of the proposed hard-ware architecture is described in Table 4. The first 3bits of the instruction represent the OP code for differ-ent operations. The remaining bits of the instruction areused to specify the user-defined parameters. The rangeof the OP code and the parameters are described inTable 5. Notably, the third parameter format of theFASTICA instruction is defined as the first 15 bits ofIEEE-754 single-precision floating-point format. In orderto simplify the hardware design, the first parameters canonly be set to 2, 4, 8 or 16 for SYNAVG andMOVAVG instructions. Also, the second parameters ofLOAD and OUTPUT can only be set to either 511 or 0since they are used to specify the input data format orthe output data type. With those functions and the user-defined parameters, the flexibility of the proposed hard-ware architecture can be attained.

For the chip implementation of the proposed FastICA hard-ware architecture, the cell-based design flowwith the standardcell library in TSMC 90 nm 1P9M CMOS process is adopted.Artisan Memory Compiler is used to generate the memorymodule. Synopsys Design Compiler is employed to synthe-size the proposed design with the timing constraint of 10 ns.Finally, Cadence SoC Encounter is used for the placement androuting procedure. Figure 6 shows the layout of the proposedFastICA hardware implementation, where the layout area oc-cupies 1.43 mm2.

Processing Unit 1DataInput Data

Output

Fixed-to-FloatingConverter

DataMemory

Controller

Processing Unit 2

MUX

Instruction Input

Figure 4 System diagram of theproposed FastICA hardwarearchitecture.

Table 3 Instruction set of the proposed FastICA hardware architecture.

Instruction Parameters Operation

LOAD Number of channels, input data type (fixed/floating) Load data into the memory

OUTPUT Number of channels, content type (weight/signal) Output data from the memory

FASTICA Number of channels, number of maximum iterations, threshold value Perform FastICA algorithm

REREF Number of channels, index of the baseline channel Remove baseline

SYNAVG Number of trails Average all trails

MOVAVG Number of samples on average, index of the target channel Moving average filter


4.3 Hardware Operations of the Preprocessingwith Gram-Schmidt Orthonomalization Based WhiteningProcess of FastICA

To illustrate the hardware operations of the FastICA functionusing the PUs in Fig. 5, the computations of the preprocessingare described in this subsection. For the centering process, (8)can be recast in (25).

xi jð Þ ¼ xi jð Þ � 1

512

X512l¼1

xi lð Þ; ð25Þ

where i=1, 2, …, n and j=1, 2, …, 512. The computation of(25) is described as follows. First, xi(l) are read from the datamemory and used to calculate the result of [∑l=1

512xi(l)]/512 usingthe MAA unit and the division-by-2λ operation unit, where λ isset to 9. Next, the result of xi( j) minus [∑l=1

512xi(l)]/512 is calcu-lated on the MAA unit and saved at the data memory. The cen-tering process can be either performed on the MAA unit of PU1

or the MAA unit of PU2. Therefore, PU1 and PU2 concurrentlyperform the computations of the centering process for differentchannels, respectively. For example, if n=16, x1 jð Þ and x2 jð Þare calculated first with i=1 and i=2 in PU1 and PU2, respective-ly, for j=1, 2,…, 512. Next, x3 jð Þ; x4 jð Þf g, x5 jð Þ; x6 jð Þf g,…,x15 jð Þ; x16 jð Þf g are calculated with i={3, 4}, i={5, 6}, …,

i={15, 16} in PU1 and PU2 for j=1, 2, 3,…, 512, respectively.After the centering process, (17), (18), (19) and (21) are used

in the Gram-Schmidt orthonormalization based whitening pro-cess and can be recast as (26), (27) and (28), respectively.

z#kþ1 ¼xkþ1 ; fork ¼ 0xkþ1− xTkþ1z

þ1

� �zþ1 −⋯− xTkþ1z

þk

� �zþk ; for k ¼ 1; 2;…; n−1

�ð26Þ

Figure 5 Simplified architecture of the processing units (PUs) and thecorresponding controller.

Table 4 Instruction format of the proposed FastICA hardwarearchitecture.

OP Code First parameter Second parameter Third parameter

3 bits 5 bits 9 bits 15 bits

Table 5 Range of the OP code and parameters of the instructions.

Instruction OPcode

Firstparameter

Secondparameter

Thirdparameter

LOAD 0(000) 1~16 511/0 (fixed/floating) –

OUTPUT 1(001) 1~16 511/0 (weight/signal) –

FASTICA 2(010) 2~16 1~511 1~20

REREF 3(011) 1~16 0~15 –

SYNAVG 4(100) 2/4/8/16 – –

MOVAVG 5(101) 2/4/8/16 0~15 –

DataMemory

ProcessingUnits

DataMemory

DataMemory

TemporaryMemory Temporary

Memory

Figure 6 Layout of the proposed FastICA hardware implementation.


zþkþ1 ¼z#kþ1

z#kþ1

�� ¼ z#kþ1 1= z#kþ1

�� ¼ z#kþ1 z#kþ1

� �Tz#kþ1

h i−1=2ð27Þ

zkþ1 ¼ 5121=2zþkþ1 ð28Þ

According to (26), (27) and (28), the computations of theGram-Schmidt orthonormalization based whitening processare described as follows.

Step 1 Let i=1 and initially set z#kþ1 ¼ xkþ1 at the temporarymemory.

Step 2 Go to Step 6 if k=0. Otherwise, go to Step 3.Step 3 Calculate xTkþ1z

þi on the MAA unit.

Step 4 Renew zk + 1# with the calculation result of z#kþ1−

xTkþ1zþi

� �zþi using the MAA unit and increase i by 1.

Step 5 Repeat Step 3-Step 4 until i>k.Step 6 Calculate [(zk+1

# )Tzk+1# ]−1/2 on the MAA unit and the

inverse-square-root unit.Step 7 Calculate zk+1

+ on the MAA unit.Step 8 Repeat Step 1-Step 7 until n vectors {z1

+,z2+,⋯,zn

+} ofZ+ are obtained.

Step 9 Calculate zk+1 on the MAA unit. Repeat this stepuntil n vectors {z1,z2,⋯,zn} of Z are obtained.

Note that the calculations of Step 1-Step 9 are only per-formed on the MAA unit of PU1. In Step 1, xkþ1 is read fromdata memory. In Step 3, zi

+ is read from data memory to com-pute the dot product of xkþ1 and zi

+ on the MAA unit. In Step4, the result of xTkþ1z

þi

� �zþi is calculated and subtracted from

zk+1# on theMAA unit. In Step 5, Step 3-Step 4 are repeated as

i≤k such that z#kþ1 ¼ xkþ1−∑ki¼1 xTkþ1z

þi

� �zþi can be obtained.

In Step 6, the result of (zk+1# )Tzk+1

# is calculated on the MAAunit and then processed with the inverse-square-root operationunit to obtain the result of [(zk + 1

# )Tzk + 1# ]−1/2. In Step 7,

the result of zk + 1+ is calculated by multiplying zk + 1

# and[(zk+1

# )Tzk+1# ]−1/2 on the MAA unit. In Step 8, Step 1-Step 7

are repeated to sequentially perform the calculations of n

vectors {z1+,z2

+,⋯,zn+} of Z+ with k=0, 1, 2, …, n-1. In Step

8, when Step 1-Step 7 are done with k=n-1, the calculation ofZ+ is completed. In Step 9, zk+1 is obtained by multiplying5121/2 and zk+1

+ on the MAA unit. Step 9 is repeated until nvectors of Z are obtained. The result of Z is saved at the datamemory. The proposed FastICA hardware architecture sup-ports variable-channel processing, where the number of chan-nels, n, is confined to 2-to-16 and defined by the user. In Step8, the controller controls PU1 to sequentially calculate {z1

+,z2+,

z3+,⋯,zn

+}. While the controller detects that the last calculationof zn

+ is completed, the controller controls the PU1 to executethe operation of Step 9. By setting different number of chan-nels, n, the hardware operations of Gram-Schmidtorthonormalization can be realized on the same PU1. For ex-ample, if n=16, z1

+ will be calculated first using Steps 1, 2, 6,and 7 with k=0. Next, z2

+ will be calculated through Step 1-Step 7 with k=1. Afterward, {z3

+,z4+,⋯,z16

+ } will be calculatedsequentially with k=2, 3, …, 15, respectively. The vectors{z1,z2,⋯,z16} are calculated through Step 9 until {z1

+,z2+,⋯,

z16+ } are obtained. Similarly, if n=4, z1

+ will be calculated firstand then {z2

+,z3+,z4

+} will be calculated sequentially.

4.4 Hardware Operations of the Fixed-Point Algorithmof FastICA

To implement the fixed-point algorithm using the two reusedPUs, the equations can be recast in (29), (30) and (31).

wþi ¼ E ez g wT

i ez� �� −E g0 wT

i ez� �� wi

¼

X512j¼1

ez j g wTi ez j� ��

−X512j¼1

g0 wTi ez j� ��

wi

512

;

≅

X512j¼1

ez j α wTi ez j� �þ β

� �� −X512j¼1

αf gwi

512

ð29Þ

where ez j denotes the j-th column vector of Z, and α and βdenote two parameters.

w#kþ1 ¼

wþkþ1 ; fork ¼ 0

wþkþ1− wþ

kþ1

� �Tw1

h iw1−⋯− wþ

kþ1

� �Twk

h iwk ; for k ¼ 1; 2; …; n−1

(ð30Þ

wkþ1 ¼w#

kþ1

w#kþ1

�� ¼ w#kþ1 1= w#

kþ1

�� ¼ w#kþ1 w#

kþ1

� �Tw#

kþ1

h i−1=2ð31Þ

The two non-linear functions g wTi ez j� �

and g0 wTi ez j� �

in (29) are realized as α wTi ez j� � þβ and α, respectively,

by the piecewise linear approximation method. Thevalues of α and β are obtained by table look-up ac-cording to the value of wT

i ez j. Further discussions of thepiecewise linear approximation method can be found in[38]. According to (29), (30) and (31), the computa-tions of the fixed-point algorithm are described asfollows.


Step 1 Let j=1 at the beginning.Step 2 Calculate wT

i ez j on the MAA unit.Step 3 Calculate α wT

i ez j� � þβ on the MAA unit.Step 4 Calculate ez j α wT

i ez j� �� þβ� on the MAA unit.Step 5 Accumulate the results of ez j α wT

i ez j� �þ β� �

and α atthe temporary memory using the MAA unit, respec-tively, and increase j by 1.

Step 6 Repeat Step 2-Step 5 until j>512.Step 7 Calculate wi

+ on the MAA unit.Step 8 Repeat Step 1-Step 7 until n vectors {w1

+,w2+,⋯ ,wn

+}are obtained.

Step 9 Calculate the Gram-Schmidt orthonormalization

for W.Step 10 Repeat Step 1-Step 9 until W is converged.

Note that the calculations of Step 1-Step 7 can be per-formed on the MAA unit in either PU1 or PU2. In other words,two wi

+ vectors can be concurrently calculated on the MAAunits of PU1 and PU2, respectively. In Step 2, wi and ez j areread from the data memory to calculate their dot product onthe MAA unit. In Step 3, the result of α wT

i ez j� � þβ are calcu-lated on the MAA unit. In Step 4, ez j is read from the data

memory and multiplied by α wTi ez j� � þβ on the MAA unit. In

Step 5, the results of ez j α wTi ez j� �� þβ� and α are separately

accumulated at the temporary memory using theMAA unit. InStep 6, Step 2-Step 5 are repeated as j≤512 to obtain

∑512j¼1 ez j α wT

i ez j� �� þβ�g and ∑j=1512{α}. In Step 7, the result

of∑512j¼1 ez j α wT

i ez j� �� þβ�gminus∑j=1512{α}wi is calculated on

the MAA unit. The division-by-512 operation is not necessaryin Step 7 since it does not affect the normalization result. InStep 8, Step 1-Step 7 are repeated to obtain n vectors {w1

+,w2+,

⋯ ,wn+}. According to the user-defined number of channels, n,

the calculations of wi+ vectors are repeated on both PU1 and

PU2 to obtain all n vectors of wi+ with i=1, 2, 3,…, n. For

example, if n=16, w1+ and w2

+ are calculated first with i=1 andi=2 in PU1 and PU2, respectively, through Step 1-Step 7.Next, {w3

+, w4+}, {w5

+, w6+},…, {w15

+ , w16+ } are calculated with

i={3, 4}, i={5, 6}, … and i={15, 16} in PU1 and PU2, re-spectively. In Step 9, the Gram-Schmidt orthonormalization is

calculated for W. The calculation of the Gram-Schmidtorthonormalization is similar to Step 1-Step 8 of the whiteningprocess in Section 4.3. In Step 10, the sum of absolute dot-products of the old weight vectors and the new weight vectors

ofW is calculated on the MAA unit to determine whetherWis converged.

4.5 Implementation of the Re-reference, SynchronizedAverage and Moving Average Functions

In terms of the re-reference, synchronized average and mov-ing average processing, the re-reference signal yr,i( j), the

synchronized average signal ys( j), and the moving averagesignal ym( j) are defined in (32), (33), and (34), respectively.

yr;i jð Þ ¼ xi jð Þ � xbaseline jð Þ; for j ¼ 1; 2; 3;…;m; ð32Þ

ys jð Þ ¼ 1

h

Xhtrial¼1

xtrial jð Þ; for j ¼ 1; 2; 3;…;m; ð33Þ

ym jð Þ ¼ 1

r

Xr−1k¼0

xtarget�channel j−kð Þ; for j ¼ 1; 2; 3;…;m; ð34Þ

where i=1, 2, 3,…, n and h and r denote the number of trailsand the number of samples on average, respectively. Sinceh and r are the first parameters of SYNAVG and

0 100 200 300 400 500 6000

20004000

0 100 200 300 400 500 6000

20004000

0 100 200 300 400 500 6000

20004000

0 100 200 300 400 500 6000

20004000

0 100 200 300 400 500 6000

5000

0 100 200 300 400 500 6000

20004000

0 100 200 300 400 500 6000

20004000

0 100 200 300 400 500 6000

20004000

0 100 200 300 400 500 6000

20004000

0 100 200 300 400 500 6000

20004000

0 100 200 300 400 500 6000

20004000

0 100 200 300 400 500 6000

20004000

0 100 200 300 400 500 6000

20004000

0 100 200 300 400 500 6000

5000

0 100 200 300 400 500 6000

20004000

0 100 200 300 400 500 6000

20004000

Figure 7 First dataset: 16-channel mixed signals generated by MATLABand [42].


MOVAVG instructions, respectively, they can only be setto 2, 4, 8, or 16 in the proposed hardware architecture asmentioned in Section 4.2. Therefore, the division opera-tion in (33) and (34) can be performed with the division-by-2λ units in the PUs by setting λ to 1, 2, 3 or 4. Theoperations of addition and subtraction in (32), (33), and

(34) are performed on the MAA units of PU1 and PU2.For the re-reference function, PU1 and PU2 concurrentlyperform the computations for different channels, respec-tively. For the synchronized average and moving averagefunction, PU1 and PU2 concurrently perform the compu-tations for different discrete time j, respectively.

0 100 200 300 400 500 600-202

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-202

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-202

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-202

0 100 200 300 400 500 600-10

010

0 100 200 300 400 500 600-202

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-10

010

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-202

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-202

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-202

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-202

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-202

0 100 200 300 400 500 600-10

010

0 100 200 300 400 500 600-202

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-10

010

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-202

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-202

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-202

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-202

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-202

0 100 200 300 400 500 600-10

010

0 100 200 300 400 500 600-202

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-10

010

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-202

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0.9984

0.9752

0.9939

0.9743

0.9784

0.9943

0.9656

0.9812

0.9926

0.9772

0.9725

0.9993

0.9728

0.9883

0.9775

0.9870

Artificial Source SignalsFloating-Point Software Separation Results Fixed-Point Software Separation Results

0.5852

0.9243

0.9720

0.8073

0.5721

0.9448

0.9613

0.9785

0.9932

0.9300

0.8405

0.9987

0.9488

0.8803

0.6638

0.4223

Figure 8 Comparison between (a) the floating-point software separation results, (b) the artificial source signals and (c) the fixed-point softwareseparation results of the FastICA function for the first dataset.


5 Evaluation and Comparison Results

In this section, the double-precision floating point (i.e., themost accurate precision) software simulation in MATLAB isperformed to verify the FastICA algorithm based on theGram-Schmidt orthonormalization for whitening process andfixed-point algorithm. The software simulations of the re-ref-erence, synchronized average and moving average functionsare also performed in MATLAB. The hardware architecture isvalidated by the post-layout simulation results. Finally, theresults are summarized and compared with other relatedworks.

5.1 Software Simulation

To verify the separation quality with the FastICA algorithmbased on the Gram-Schmidt orthonormalization for whiteningprocess and fixed-point algorithm, the software simulation isperformed. In our simulation, three datasets are adopted. Thefirst dataset contains 16-channel artificial mixed signals whichare obtained by randomly mixing the 16-channel artificialsource signals in 12-bit fixed-point format as shown inFig. 7. Note that the 16-channel artificial source signals con-tains two-channel sparse and very sparse source signals in [46]and other 14-channel source signals generated by MATLABfunctions including abs, sin, square, rem, sinc, rand, log andrandperm. Since the source signals in the first dataset areknown, the software separation result can be compared tothe source signals to determine the separation quality. Figure 8shows the artificial source signals, the floating-point andfixed-point software separation results of the FastICA algo-rithm with the Gram-Schmidt orthonomalization based whit-ening process for the first dataset. The numbers betweenFig. 8a and b denote the absolute correlation coefficients be-tween the source signals and the floating-point separation re-sults. On the other hand, the numbers between Fig. 8b and cdenote the absolute correlation coefficients between thesource signals and the fixed-point separation results. For thefloating-point separation results, the average and minimumabsolute correlation coefficients are 0.9830 and 0.9656, re-spectively. The signal-to-interference ratio (SIR) [47] resultsfor 16-channel source signals and the software separation re-sults with the first dataset are 25.0082, 13.0428, 19.1261,12.8986, 13.6465, 19.4119, 11.6197, 14.2428, 18.2994,13.4162, 12.5925, 28.7617, 12.6471, 16.2938, 13.4617, and15.8396 dB, respectively. The average value of the SIR resultsis 16.2693 dB. As a result, the artificial mixed signals are wellseparated in the floating-point software simulation. For thefixed-point separation results, the average and minimum ab-solute correlation coefficients are 0.8389 and 0.4223, respec-tively. The average and minimum absolute correlation coeffi-cients of the fixed-point separation results are lower than thoseof the floating-point simulation results.

The second dataset contains the 16-channel EEG signalswhich are adopted from the dataset in the EEGLAB toolbox[42, 43]. The details of the experiment were described in [6].Figure 9 shows the 16-channel EEG signals which are cap-tured from the positions of FPZ, F3, FZ, F4, C3, C4, CZ, P3,PZ, P4, PO3, POZ, PO4, O1, OZ, and O2 of the international10–20 electrodes placement system. The 16-channel EEG sig-nals are transformed to 12-bit fixed-point format with thesampling rate of 256 Hz and the windows size of two seconds.Figure 10a and b show the floating-point software simulationresults and the fixed-point software simulation results for thesecond dataset, respectively. The numbers in the middle ofFig. 10 denote the absolute correlation coefficients betweenthe floating-point simulation results and the fixed-point simu-lation results. For the floating-point simulation results, as canbe seen in Fig. 10a, the independent components associatedwith the late positive event-related potential can be extracted

0 100 200 300 400 500 6000

20004000

0 100 200 300 400 500 6000

20004000

0 100 200 300 400 500 6000

20004000

0 100 200 300 400 500 6000

20004000

0 100 200 300 400 500 6000

20004000

0 100 200 300 400 500 6000

20004000

0 100 200 300 400 500 6000

5000

0 100 200 300 400 500 6000

20004000

0 100 200 300 400 500 6000

20004000

0 100 200 300 400 500 6000

20004000

0 100 200 300 400 500 6000

20004000

0 100 200 300 400 500 6000

20004000

0 100 200 300 400 500 6000

20004000

0 100 200 300 400 500 6000

20004000

0 100 200 300 400 500 600100020003000

0 100 200 300 400 500 600100020003000

Figure 9 Second dataset: 16-channel EEG signals.


in the first three rows. The floating-point simulation results ofFig. 10a are similar to those reported in [6]. However, inFig. 10b, one component cannot be easily observed in thesecond row. The average and minimum absolute correlationcoefficients are 0.5507 and 0.3231, respectively.

The third dataset contains 4-channel EEG signals whichhave eye blinking artifact as shown in Fig. 11. The 4-channel EEG signals are captured from the positions of FP1,FP2, F3 and F4 of the international 10–20 electrodes place-ment system. The sampling rate and the format of the thirddataset are the same as those of the second dataset. The soft-ware simulation results for the third dataset are shown inFig. 12. As can be seen, the eye blinking component can beseparated in the first row in Fig. 12. According to the softwaresimulation results with the above three datasets, the validation

0.7885

0.3231

0.7950

0.5397

0.4318

0.6156

0.7125

0.6508

0.5675

0.5483

0.4885

0.5456

0.4697

0.4574

0.5115

0.3651

Floating-Point Software Simulation Results Fixed-Point Software Simulation Results

0 100 200 300 400 500 600-10

010

0 100 200 300 400 500 600-10

010

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-10

010

0 100 200 300 400 500 600-10

010

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

(a) (b)

Figure 10 Software simulationresult utilizing the FastICAalgorithm with the Gram-Schmidtorthonormalization basedwhitening process for the seconddataset in MATLAB: (a) floating-point simulation results and (b)fixed-point simulation results.

0 100 200 300 400 500 6000

2000

4000

0 100 200 300 400 500 6000

5000

0 100 200 300 400 500 6001000

2000

3000

0 100 200 300 400 500 6001000

2000

3000

Figure 11 Third dataset: 4-channel EEG signals.


of the FastICA algorithm with the Gram-Schmidtorthonormalization based whitening process can be achievedfor both artificial mixed signals and EEG signals.

Figure 13 shows the software simulation result of the re-reference function, where the EEG signals at the positions ofPZ and CZ in Fig. 9 are used for the input signal and thebaseline signal, respectively. Figure 14 shows the softwaresimulation result of the synchronized average function with16 EEG trials captured from the position of CPZ. Finally, theEEG signal with high-frequency noise adopted from [46] andthe corresponding software simulation result of moving aver-age function are shown in Fig. 15. As can be seen, the high-frequency noise of EEG signal can be filtered after the movingaverage processing.

5.2 Post-Layout Simulation

In this subsection, the post-layout simulation is performed.The results of post-layout simulation are compared to the soft-ware simulation results done in the previous subsection tovalidate the function of the proposed cost-effective hardwareimplementation. For the FastICA function, the three datasetsmentioned in the previous subsection are used for the post-layout simulation. Figures 16, 17 and 18 show the comparisonresults with the first dataset of the 16-channel artificial mixed

signals in Fig. 7, the second dataset of the 16-channel EEGsignals in Fig. 9 and the third dataset of the 4-channel EEGsignals in Fig. 11, respectively. In Figs. 16, 17 and 18, thenumbers in the middle of the figures denote the absolute cor-relation coefficients between the software simulation resultsand the post-layout simulation results. For the first dataset, theaverage and the minimum values of the absolute correlation

0 100 200 300 400 500 600-5

0

5

0 100 200 300 400 500 600-5

0

5

0 100 200 300 400 500 600-5

0

5

0 100 200 300 400 500 600-5

0

5

Figure 12 Software simulation result utilizing the FastICA algorithmwith the Gram-Schmidt orthonormalization based whitening process forthe third dataset in MATLAB.

0 100 200 300 400 500 600-2000

-1500

-1000

-500

0

Figure 13 Software simulation result of the re-reference function inMATLAB.

0 100 200 300 400 500 6000

20004000

0 100 200 300 400 500 6000

20004000

0 100 200 300 400 500 6000

20004000

0 100 200 300 400 500 6000

20004000

0 100 200 300 400 500 6000

20004000

0 100 200 300 400 500 6000

20004000

0 100 200 300 400 500 6000

20004000

0 100 200 300 400 500 6000

20004000

0 100 200 300 400 500 6000

20004000

0 100 200 300 400 500 6000

20004000

0 100 200 300 400 500 6000

20004000

0 100 200 300 400 500 6000

5000

0 100 200 300 400 500 6000

20004000

0 100 200 300 400 500 6000

20004000

0 100 200 300 400 500 6000

20004000

0 100 200 300 400 500 6000

20004000

(a)

0 100 200 300 400 500 6001000

1500

2000

2500

(b)

Figure 14 a 16 trials of the EEG signals from the position ofCPZ; b software simulation result of the synchronized averagefunction in MATLAB.


coefficients are 0.9967 and 0.9852, respectively. For the sec-ond dataset, the average and minimum values of the absolutecorrelation coefficients are 0.9441 and 0.8424, respectively.For the third dataset, the average absolute correlation coeffi-cient is 0.9868. Although the datasets in this work are differentfrom those in [31], it should be noted that the minimum valueof the absolute correlation coefficients of the second dataset(16-channel EEG signals), 0.8424, is near to that of the EEGsignals in Fig. 17 of [31], 0.8499. According to the compari-son results, the post-layout simulation result can attain enoughaccuracy compared to the software simulation results amongthree datasets.

Figure 19 shows the comparison result of the artificialsource signals and the post-layout simulation results with thefirst dataset in Fig. 7, where the numbers in the middle ofFig. 19 denote the absolute correlation coefficients between

0 100 200 300 400 500 6000

1000

2000

3000

4000

(a)

0 100 200 300 400 500 6000

1000

2000

3000

4000

(b)

Figure 15 a EEG signal with high-frequency noise; b softwaresimulation result of the moving average function in MATLAB.

0 100 200 300 400 500 600-202

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-202

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-202

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-202

0 100 200 300 400 500 600-10

010

0 100 200 300 400 500 600-202

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-10

010

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-202

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-202

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-202

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-202

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-202

0 100 200 300 400 500 600-10

010

0 100 200 300 400 500 600-202

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-10

010

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-202

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0.9991

0.9953

1

0.9852

0.9978

0.9963

0.9990

0.9963

1

0.9987

0.9889

1

0.9991

0.9954

0.9983

0.9972

Software Simulation Results Post-Layout Simulation ResultsFigure 16 Comparison betweenthe software simulation resultsand post-layout simulation resultsof the FastICA function for thefirst dataset in Fig. 7.


0 100 200 300 400 500 600-10

010

0 100 200 300 400 500 600-10

010

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-10

010

0 100 200 300 400 500 600-10

010

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

1

1

1

1

1

1

1

0.9610

0.9495

0.9796

0.9084

0.8571

0.8484

0.8424

0.8851

0.8742

Software Simulation Results Post-Layout Simulation ResultsFigure 17 Comparison betweenthe software simulation resultsand post-layout simulation resultsof the FastICA function for thesecond dataset in Fig. 9.

1

1

0.9735

0.9735

Software Simulation Results Post-Layout Simulation Results

0 100 200 300 400 500 600-5

0

5

0 100 200 300 400 500 600-5

0

5

0 100 200 300 400 500 600-5

0

5

0 100 200 300 400 500 600-5

0

5

0 100 200 300 400 500 600-5

0

5

0 100 200 300 400 500 600-5

0

5

0 100 200 300 400 500 600-5

0

5

0 100 200 300 400 500 600-5

0

5

Figure 18 Comparison betweenthe software simulation resultsand post-layout simulation resultsof the FastICA function for thethird dataset in Fig. 11.


the source signals and the post-layout simulation results. Theaverage and minimum absolute correlation coefficients are0.9822 and 0.9678, respectively. The SIR results for the 16-channel source signals and the post-layout simulation resultswith the first dataset are 22.2621, 13.0944, 18.5266, 14.6086,14.0360, 19.4625, 12.1755, 14.3472, 17.7386, 13.8883,

13.5538, 27.8766, 12.8188, 16.7454, 13.8240 and14.1117 dB, respectively, with an average value of16.1919 dB. The results show that the separation quality ofthe proposed cost-effective FastICA hardware implementationis satisfactory. Figures. 20, 21, and 22 show the comparisonresults of the re-reference, synchronized average and moving

0 100 200 300 400 500 600-202

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-202

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-202

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-202

0 100 200 300 400 500 600-10

010

0 100 200 300 400 500 600-202

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-10

010

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-202

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-202

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-202

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-202

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-202

0 100 200 300 400 500 600-10

010

0 100 200 300 400 500 600-202

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-10

010

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-202

0 100 200 300 400 500 600-505

0 100 200 300 400 500 600-505

0.9951

0.9736

0.9910

0.9808

0.9783

0.9924

0.9678

0.9797

0.9896

0.9777

0.9760

0.9972

0.9720

0.9875

0.9774

0.9787

Artificial Source Signals Post-Layout Simulation ResultsFigure 19 Comparison betweenthe artificial source signals andpost-layout simulation results ofthe FastICA function for the firstdataset in Fig. 7.

1

Software Simulation Result Post-Layout Simulation Result

0 100 200 300 400 500 600-2000

-1500

-1000

-500

0

0 100 200 300 400 500 600-2000

-1500

-1000

-500

0Figure 20 Comparison betweenthe software simulation result andpost-layout simulation result ofthe re-reference function.


average functions, respectively. As can be seen, the absolutecorrelation coefficients can be up to 1 such that their accura-cies can be guaranteed. According to the simulation results,since the post-layout simulation results are close to the soft-ware simulation results, the function of the hardware imple-mentation can be validated with enough accuracy.

5.3 Evaluation and Comparison Results

According to the post-layout implementation results of theproposed FastICA hardware architecture, Table 6 summarizespost-layout characteristics. The power consumption results ofthe proposed hardware architecture for the FastICA process-ing are measured after RC extraction of the placed and routednetlist with Synopsys PrimeTime. The FastICA hardware ar-chitecture consumes 19.4 mWat 100 MHz when 16 channelsare activated. Figure 23 shows the power consumption resultsamong different number of channels. In the proposed FastICAhardware architecture, the power consumption results of dif-ferent number of channels are close since the PU1 and PU2 arefrequently utilized due to the heavy computations of theupdating step [16] in (29) with the sample number of 512.On the other hand, Fig. 24 shows the energy evaluation resultsamong different number of channels. As can be seen, withhigher channel number, the energy consumption increasessince the computation time increases due to the larger compu-tational complexity. For the proposed FastICA hardware ar-chitecture, we target on the FastICA processing for the EEGdata with the window size of 2 s and the sampling rate of256 Hz. As can be seen in Table 6, since the maximum hard-ware computation time for the 16-channel FastICA processingis 1.85 s at 100 MHz, the maximum computation time isshorter than 2 s. That means the proposed FastICA hardwarearchitecture can output 512 samples per channel every 2 s. Inother words, the throughput is 256 samples/s. Thus, the

proposed FastICA hardware architecture implementation cansupport enough throughput and real-time application in thiswork.

Table 7 lists comparison results in terms of application,algorithm, arithmetic, number of channels/weight vectors,sample size, speed, power dissipation, gate count, computa-tion time, implementation approach, process technology, corearea and variable channel among the post-layout implementa-tion results of the proposed FastICA hardware architectureand the existing ICA implementations. In [20] and [26],FPGA implementations are used for two-channel ICA pro-cessing with the operating frequencies at 12.288 and50 MHz, respectively. The number of gates in [20] is 11.469thousand in FPGA approach; however, its computation time islarger than 60 s for ANC. Based on the parallel ICA algorithm,the operating frequency of the ICA implementation on FPGAis 20.161 MHz [21, 22] and between 21.357 and 35.921 MHz[23] for image processing. The design of [28] operates at68 MHz on FPGA using INFOMAX algorithm for four-channel EEG signals processing. In [31, 32], the hardwareimplementation can achieve eight-channel FastICA process-ing with a dedicated EVD processor. In [33], the self-configured ICA implementation can achieve 16-channel pro-cessing with dedicated principal component decompositionengine (PCDE) for EVD processing. Since the implementa-tion approach information and post-layout results of the over-all architecture are not provided in [29], this low-complexitypioneer work is not listed in Table 7 for hardware comparison.In this work, the proposed hardware architecture can performthe FastICA processing with a variable number of channelsfrom 2 to 16. By proposing two reused PUs for the prepro-cessing and the fixed-point algorithm in the FastICA algo-rithm, the gate count of this work is only 47.43 % higher thanthat reported in [31], while the processing capability in thiswork can be up to 16 channels and 512 samples that are two

1


0 100 200 300 400 500 6001000

1500

2000

2500

0 100 200 300 400 500 6001000

1500

2000

2500Figure 21 Comparison betweenthe software simulation result andpost-layout simulation result ofthe synchronized averagefunction.

1


100 200 300 400 500 6000

1000

2000

3000

4000

100 200 300 400 500 6000

1000

2000

3000

4000

Figure 22 Comparison betweenthe software simulation result andpost-layout simulation result ofthe moving average function.


times as those provided in [31]. The gate count of this workwith floating-point implementation is 5.79 times as that re-ported in [32] with fixed-point implementation; however, theFastICA processor in [32] is only for eight-channel electro-corticography (ECoG) signal processing. For 16-channel EEGsignal processing in our work, the separation quality results inSection 5.2 could be guaranteed using the floating-point arith-metic and two times sample size. Although the ICA hardware[33] can provide the 16-channel FastICA processing withgood separation quality, the variable-channel feature is notoffered. On the other hand, the core size of [33] is larger thanthat of this work even considering the difference of processtechnology. In summary, this hardware architecture is costeffective while considering the satisfactory separation qualityfor 2-to-16 channel EEG signal processing application. In ad-dition, the user-defined parameters as well as the functions ofre-reference, synchronized average and moving average addthe flexibility to the proposed hardware architecture.

6 Conclusion

In this work, we propose the cost-effective and 2-to-16variable-channel FastICA hardware architecture for EEG

signal processing. With the Gram-Schmidt based whit-ening in FastICA, the proposed two PUs can be sharedbetween the preprocessing and the fixed-point algorithmto increase the hardware efficiency of the presentedFastICA hardware architecture. The functions of re-ref-erence, synchronized average and moving average aswell as the user-defined parameters are also supportedin the proposed FastICA hardware architecture to attainthe flexibility. The evaluation and simulation resultsdemonstrate that the proposed hardware architecturecan achieve the satisfactory BSS quality and attain anefficient hardware usage. On the other hand, if morethan 16-channel capability is demanded, the sample sizemust be largely increased, where the analysis is shownin Fig. 3, such that large memory size will be requested.This will be the near future work while consideringseparation quality and memory size with larger samplesize for larger channel number.

Table 6 Summary of the post-layout characteristics.

Power supply 1.0 V

Max. Clock 100 MHz

Core power for 16 channels 19.4 mW

Gate count 401Ka

Core area 1.300×1.100 mm2

Pin count 131

Process technology TSMC 90 nm CMOS

Max. Computation time 1.85 s

a This number is counted using the size of two-input NAND gate

2 4 8 160

5

10

15

20

Number of Channels

Pow

er C

onsu

mpt

ion

(W

)

2 4 8 1617

17.5

18

18.5

19

19.5

Number of Channels

Pow

er C

onsu

mpt

ion

(W

)

Figure 23 Power consumptions in linear scale of the proposed FastICA hardware architecture versus different number of channels.

2 4 8 1610

0

101

102

103

104

105

Number of Channels

Ene

rgy

Con

sum

ptio

n (

J)

Figure 24 Energy consumptions in log scale of the proposed FastICAhardware architecture versus different number of channels.


Tab

le7

Hardw

arecomparisonresults

amongvariousICAim

plem

entatio

ns.

Kim

[20]

Du[21,22]

Du[23]

*1

Shyu

[26]

Huang

[28]

Van

[31]

Yang[32]

Roh

[33]

ThisWork

Application

Speech

Image

Image

Speech

EEG

EEG

ECoG

EEG

EEG

Algorith

mICA

ParallelICA

ParallelICA

FastICA

INFOMAX

FastICA

FastICA

Self-ConfiguredICA

FastICA

Arithmetic

Floating-point

Fixed-point

Fixed-point

Floatin

g-point

Fixed-point

Hybrid

Fixed-point

N/A

Floatin

g-point

No.of

Channels/Weight

Vectors(W

Vs)

24(W

Vs)

4,8,12,

16,20,24

(WVs)

24

88

162–16

SampleSize

512

N/A

N/A

3000

512

256

256

512

512

Speed(M

Hz)

12.288,N

/A20.161,

N/A

21.829

*2

21.357

*3

35.921

*4

5068

100

1120

100

Power

Dissipation(m

W)

98.8,14.5

N/A

N/A

N/A

N/A

16.35

0.0816

4.45

*6

19.4*7

Gates

(×10

3)

11.469,N

/A(For

ANC)

229.5,

N/A

N/A

N/A

315

272

69.2

N/A

401

Com

putatio

nTim

e(sec)

≥60,N/A

N/A

1129.5*5

0.003

N/A

0.29

(Max)

0.0842

(Max)

Variable

1.85

(Max)

Implem

entatio

nApproach

FPGA,A

SIC

FPGA,

ASIC

FPG

AFPGA

FPG

AASIC

ASIC

ASIC

ASIC

ProcessTechnology

(nm)

N/A,350

N/A,180

N/A

N/A

N/A

9090

130

90

CoreArea(m

m2)

N/A

1.41

N/A

N/A

N/A

1.49

0.40

7.02

1.43

VariableChannel

No

No

Yes

No

No

No

No

No

Yes

*1:N

eeds

anexternalmem

ory,*2:F

orsub-matrix,*3:F

orexternaldecorrelation,*4:F

orcomparison,*5:F

or20

WVs,*6:F

orMode3,*7:

For16

channels


Acknowledgments This work was supported in part by the Ministry ofScience and Technology (MOST) Grants MOST 103-2221-E-009-099-MY3 and MOST 103-2911-I-009-101, and in part by Contract:104W963. The authors thank Chip Implementation Center for theirsupport.

References

1. Luck, S. J. (2005). An introduction to the event-related potentialtechnique. Cambridge: MIT Press.

2. Bruce, E. N. (2005). Biomedical signal processing and signalmodeling. New York: Wiley.

3. Hyvärinen, A., & Oja, E. (2000). Independent component analysis:algorithms and applications. Neural Networks, 13(4–5), 411–430.

4. Vigário, R. N. (1997). Extraction of ocular artefacts from EEG usingindependent component analysis. Electroencephalography andClinical Neurophysiology, 103(3), 395–404.

5. Vigário, R., Särelä, J., Jousmäki, V., Hämäläinen, M., & Oja, E.(2000). Independent component approach to the analysis of EEGand MEG recordings. IEEE Transactions on BiomedicalEngineering, 47(5), 589–593.

6. Makeig, S., Westerfield, M., Jung, T. P., Covington, J., Townsend, J.,Sejnowski, T. J., & Courchesne, E. (1999). Functionally independentcomponents of the late positive event-related potential during visualspatial attention. Journal of Neuroscience, 19(7), 2665–2680.

7. Castellanos, N. P., & Makarov, V. A. (2006). Recovering EEG brainsignals: artifact suppression with wavelet enhanced independentcomponent analysis. Journal of Neuroscience Methods, 158(2),300–312.

8. Kachenoura, A., Albera, L., Senhadji, L., & Comon, P. (2008). ICA:a potential tool for BCI systems. IEEE Signal Processing Magazine,25(1), 57–68.

9. Albera, L., Kachenoura, A., Comon, P., Karfoul, A., Wendling, F.,Senhadji, L., & Merlet, I. (2012). ICA-based EEG denoising: a com-parative analysis of fifteen methods. Bulletin of the Polish Academyof Sciences Technical Sciences, 60(3), 407–418.

10. Gao, J. F., Yang, Y., Lin, P., Wang, P., & Zheng, C. X. (2010).Automatic removal of eye-movement and blink artifacts from EEGsignals. Brain Topography, 23(1), 105–114.

11. Lee, T. W. (1998). Independent component analysis: Theory andapplications. Boston: Kluwer.

12. Cichocki, A., & Amari, S. (2002). Adaptive blind signal and imageprocessing: Learning algorithms and applications. New York: Wiley.

13. Hyvärinen, A., & Oja, E. (1997). A fast fixed-point algorithm for inde-pendent component analysis. Neural Computation, 9(7), 1483–1492.

14. Hyvärinen, A. (1999). Fast and robust fixed-point algorithms forindependent component analysis. IEEE Transactions on NeuralNetworks, 10(3), 626–634.

15. Hyvärinen, A., Karhunen, J., & Oja, E. (2001). Independent compo-nent analysis. New York: Wiley.

16. Oja, E., & Yuan, Z. (2006). The FastICA algorithm revisited: con-vergence-analysis. IEEE Transactions on Neural Networks, 17(6),1370–1381.

17. Ollila, E. (2010). The deflation-based FastICA estimator: statisticalanalysis revisited. IEEE Transactions on Signal Processing, 58(3),1527–1541.

18. Choi, S., Cichocki, A., & Amari, S. (2000). Flexible independent com-ponent analysis. Journal of VLSI Signal Processing, 26(1–2), 25–38.

19. Zarzoso, V., & Comon, P. (2010). Robust independent componentanalysis by iterative maximization of the kurtosis contrast with alge-braic optimal step size. IEEE Transactions on Neural Networks,21(2), 248–261.

20. Kim, C. M., Park, H. M., Kim, T., Choi, Y. K., & Lee, S. Y. (2003).FPGA implementation of ICA algorithm for blind signal separationand adaptive noise canceling. IEEE Transactions on NeuralNetworks, 14(5), 1038–1046.

21. Du, H., Qi, H., & Peterson, G. D. (2004). Parallel ICA and its hard-ware implementation in hyperspectral image analysis. Proceedings ofSPIE, 5439, 74–83.

22. Du, H., & Qi, H. (2004). An FPGA implementation of parallel ICAfor dimensionality reduction in hyperspectral images. Proceedings ofIEEE International Geoscience and Remote Sensing Symposium, 5,3257–3260.

23. Du, H., & Qi, H. (2006). A reconfigurable FPGA system for parallelindependent component analysis. EURASIP Journal on EmbeddedSystems, 2006(23025), 1–12.

24. Du, H., Qi, H., & Wang, X. (2007). Comparative study of VLSIsolutions to independent component analysis. IEEE Transactionson Industrial Electronics, 54(1), 548–558.

25. Jain, V. K., Bhanja, S., Chapman, G. H., Doddannagari, L., &Nguyen, N. (2005). A parallel architecture for the ICA algorithm:DSP plane of a 3-D heterogeneous sensor. Proceedings of IEEEInternational Conference on Acoustics Speech and SignalProcessing, 5, 77–80.

26. Shyu, K.K., Lee,M.H.,Wu,Y. T., &Lee, P. L. (2008). Implementationof pipelined FastICA on FPGA for real-time blind source separation.IEEE Transactions on Neural Networks, 19(6), 958–970.

27. Acharyya, A., Maharatna, K., Sun, J., Al-Hashimi, B.M., & Gunn,S.R. (2009). Hardware efficient fixed-point VLSI architecture for 2Dkurtotic FastICA. Proceedings of European Conference on CircuitTheory and Design, 165–168.

28. Huang, W.C., Hung, S.H., Chung, J.F., Chang, M.H., Van, L.D., &Lin, C.T. (2008). FPGA implementation of 4-channel ICA for on-lineEEG signal separation. Proceedings of IEEE Biomedical Circuits andSystems Conference, 65–68.

29. Acharyya, A.,Maharatna, K., Al-Hashimi, B.M., & Reeve, J. (2011).Coordinate rotation based low complexity N-d FastICA algorithmand architecture. IEEE Transactions on Signal Processing, 59(8),3997–4011.

30. Volder, J. E. (1959). The CORDIC trigonometric computing tech-nique. IRE Transactions on Electronic Computers, 8(3), 330–334.

31. Van, L. D., Wu, D. Y., & Chen, C. S. (2011). Energy-efficientFastICA implementation for biomedical signal separation. IEEETransactions on Neural Networks, 22(11), 1809–1823.

32. Yang, C. H., Shih, Y. H., & Chiueh, H. (2015). An 81.6 μWFastICAprocessor for epileptic seizure detection. IEEE Transactions onBiomedical Circuits and Systems, 9(1), 60–71.

33. Roh, T., Song, K., Cho, H., Shin, D., & Yoo, H. J. (2014). Awearableneuro-feedback system with EEG-based mental status monitoringand transcranial electrical stimulation. IEEE Transactions onBiomedical Circuits and Systems, 8(6), 755–764.

34. Strang, G. (2009). Introduction to linear algebra (4th ed.).Cambridge: Wellesley.

35. Cichocki, A., Osowski, S., & Siwek, K. (2004). Prewhitening algo-rithms of signals in the presence of white noise. Proceedings of VIInternational Workshop BComputational Problems of ElectricalEngineering^, 205–208.

36. Woolfson, M. S., Bigan, C., Crowe, J. A., & Hayes-Gill, B. R.(2008). Method to separate sparse components from signal mixtures.Digital Signal Processing, 18(6), 985–1012.

37. Sharma, A., & Paliwal, K. K. (2007). Fast principal component anal-ysis using fixed-point algorithm. Pattern Recognition Letters, 28(10),1151–1155.

38. Huang, P.Y. (2011). Design and implementation of a multi-functioncost-effective EEG signal processor. M. S. thesis, Department ofComputer Science, National Chiao Tung University, Hsinchu,Taiwan. (Advisor: Lan-Da Van).


39. Golub, G. H., & Van Loan, C. F. (1996). Matrix computations (3rded.). Baltimore: Johns Hopkins University Press.

40. The FastICA package for MATLAB. http://www.cis.hut.fi/projects/ica/fastica/.

41. Lan, T., Erdogmus, D., Pavel, M., & Mathan, S. (2006). Automaticfrequency bands segmentation using statistical similarity for powerspectrum density based brain computer interfaces. Proceedings ofInternational Joint Conference on Neural Networks, 4650–4655.

42. Delorme, A., & Makeig, S. (2004). EEGLAB: an open source toolboxfor analysis of single-trial EEG dynamics including independent com-ponent analysis. Journal of Neuroscience Methods, 134(2), 9–21.

43. EEGLAB Wiki – SCCN. http://sccn.ucsd.edu/wiki/EEGLAB.44. Wu, A.Y. (Andy) (1995). Algorithm-based low-power digital signal

processing system designs. Ph. D. thesis, Department of ElectricalEngineering, University of Maryland, College Park, MD, USA.

45. Schulte, M.J., & Wires, K.E. (1999). High-speed inverse squareroots. Proceedings of 14th IEEE Symposium on ComputerArithmetic, 124–131.

46. ICALAB for Signal Processing – benchmarks. http://www.bsp.brain.riken.jp/ICALAB/ICALABSignalProc/benchmarks/.

47. ICALAB for Signal Processing. http://www.bsp.brain.riken.jp/ICALAB/ICALABSignalProc/.

Lan-Da Van received the B.S.(Honors) and the M.S. degreefrom Tatung Institute of Technol-ogy, Taipei, Taiwan, in 1995 and1997, respectively, and the Ph. D.degree from National TaiwanUniversity (NTU), Taipei, Tai-wan, in 2001, all in electrical en-gineering.

From 2001 to 2006, he was anAssociate Researcher at NationalChip Implementation Center(CIC), Hsinchu, Taiwan. SinceFeb. 2006, he joined the facultyof Department of Computer Sci-

ence, National Chiao Tung University, Hsinchu, Taiwan, where he iscurrently an Associate Professor. His research interests are in VLSI algo-rithms, architectures, and chips for digital signal processing (DSP) andbiomedical signal processing. This includes the design of low-power/high-performance/cost-effective 3-D graphics processing system,adaptive/learning systems, computer arithmetic, multidimensional filters,and transforms. He has published more than 50 journal and conferencepapers in these areas.

Dr. Van was a recipient of the Chunghwa Picture Tube (CPT) andMotorola fellowships in 1995 and 1997, respectively. He was an elected

chairman of IEEE NTU Student Branch in 2000. In 2001, he has receivedIEEE award for outstanding leadership and service to the IEEE NTUStudent Branch. In 2005, he was a recipient of the Best Poster Award atiNEER Conference for Engineering Education and Research (iCEER). In2014, he was a recipient of the teaching award of Computer ScienceCollege, National Chiao Tung University. In 2014, he was a recipient ofthe best paper award in IEEE International Conference on Internet ofThings (iThings2014). From 2009, he served as the officer of IEEE TaipeiSection. In 2014, he served as Track Co-Chair of 22nd IFIP/IEEE Inter-national Conference on Very Large Scale Integration VLSI-SoC. Heserves as Associate Editor and Guest Editor of Journal of Medical Imag-ing and Health Informatics. He served as a reviewer for several the IEEEtransactions/journals.

Po-Yen Huang received the B.S.degree in EECS undergraduatehonors program and the M.S de-gree in computer science fromNational Chiao Tung University(NCTU), Hsinchu, Taiwan, in2010 and 2011, respectively. Hisresearch interests include VLSIinformation processing algorithmand architecture for biomedicalsignal processing.

Tsung-Che Lu received theB.S. degree in engineeringscience from National ChengKung University (NCKU),Tainan, Taiwan, in 2006, andthe M.S. degree in engineer-ing and system science fromNational Tsing Hua Universi-ty (NTHU), Hsinchu, Taiwan,in 2008 . He is cur ren t lyworking toward the Ph. D.degree with the Departmentof Computer Science, Nation-al Chiao Tung Universi ty(NCTU), Hsinchu, Taiwan.

His research interests include analog and digital circuits for bio-medical signal processing.


http://www.cis.hut.fi/projects/ica/fastica/

http://www.cis.hut.fi/projects/ica/fastica/

http://sccn.ucsd.edu/wiki/EEGLAB

http://www.bsp.brain.riken.jp/ICALAB/ICALABSignalProc/benchmarks/

http://www.bsp.brain.riken.jp/ICALAB/ICALABSignalProc/benchmarks/

http://www.bsp.brain.riken.jp/ICALAB/ICALABSignalProc/

http://www.bsp.brain.riken.jp/ICALAB/ICALABSignalProc/

cost-effective and variable-channel fastica …ldvan/paper_list/cost...where w ¼ w 11 w 21 ⋮ w n1...

Documents