cost-effective and variable-channel fastica …ldvan/paper_list/cost...where w ¼ w 11 w 21 ⋮ w n1...
TRANSCRIPT
Cost-Effective and Variable-Channel FastICA HardwareArchitecture and Implementation for EEG Signal Processing
Lan-Da Van & Po-Yen Huang & Tsung-Che Lu
Received: 1 July 2014 /Revised: 18 February 2015 /Accepted: 23 February 2015 /Published online: 21 March 2015# Springer Science+Business Media New York 2015
Abstract This paper proposes a cost-effective andvariable-channel floating-point fast independent compo-nent analysis (FastICA) hardware architecture and imple-mentation for EEG signal processing. The Gram-Schmidtorthonormalization based whitening process is utilized toeliminate the use of the dedicated hardware for eigenvaluedecomposition (EVD) in the FastICA algorithm. The pro-posed two processing units, PU1 and PU2, in the present-ed FastICA hardware architecture can be reused for thecentering operation of preprocessing and the updating stepof the fixed-point algorithm of the FastICA algorithm, andPU1 is reused for Gram-Schmidt orthonormalization oper-ation of preprocessing and fixed-point algorithm to reducethe hardware cost and support 2-to-16 channel FastICA.Apart from the FastICA processing, the proposed hard-ware architecture supports re-reference, synchronized av-erage, and moving average functions. The cost-effectiveand variable-channel FastICA hardware architecture is im-plemented in 90 nm 1P9M complementary metal-oxide-semiconductor (CMOS) process. As a result, the FastICAhardware implementation consumes 19.4 mW at 100 MHzwith a 1.0 V supply voltage. The core size of the chip is1.43 mm2. From the experimental results, the presentedwork achieves satisfactory performance for each function.
Keywords Cost effective . EEG signal processing .
FastICA algorithm . Gram-Schmidt . Hardware architecture .
Variable channel
1 Introduction
Electroencephalogram (EEG) research plays an important rolefor exploring the relation between human beings’ behaviorand brain [1, 2]. The EEG signals are the electrical potentialrecordings captured from the scalp sensors, where the electri-cal potentials mean that the components of the brain activity[3] are mixed. In addition, the EEG signals may be contami-nated by the artifacts of eye activity or muscular activity[4–10]. Therefore, analyzing EEG signals becomes a chal-lenge. In the brain research, the independent component anal-ysis (ICA) algorithm is regarded as a useful method to studythe brain activity through EEG signals [4–10]. The ICA algo-rithm is developed to solve the blind source separation (BSS)problem such that the mixed-up signals can be separated[11–19] and the brain activity information can be revealed.
Hardware architecture and implementation of ICA algo-rithms are challenges while considering cost, flexibility, speedand/or power. Many literatures have proposed the followingICA hardware implementations. Kim et al. [20] proposed afield-programmable gate array (FPGA) implementation of the2-channel BSS constructed by the adaptive noise canceling(ANC) module. The 2-channel ANC model consumes98.8 mW at 12.288 MHz at 1.8 V in FPGA. Du et al.[21–24] proposed parallel ICA algorithm and the correspond-ing FPGA implementation on a pilchard board, where severalsub-processes run with multiple data parallelism. Jain et al.[25] proposed a parallel architecture for the FastICA algo-rithm. The floating-point arithmetic unit is adopted onpipelined FastICA design to increase the precision in [26].One hardware efficient fixed-point FastICA is addressedin [27]. The fixed-point arithmetic is used on an FPGA-based INFOMAX ICA design [28]. Acharyya et al. [29]proposed a pioneering coordinate rotation digital computer(CORDIC)-based [30] n-dimensional FastICA algorithmand architecture by reusing the CORDIC module. Although
L.<D. Van (*) : P.<Y. Huang : T.<C. LuDepartment of Computer Science, National Chiao Tung University,Hsinchu, Taiwan 300, Republic of Chinae-mail: [email protected]
J Sign Process Syst (2016) 82:91–113DOI 10.1007/s11265-015-0988-2
the proposed architecture [29] has low complexity character-istics by the algorithm analysis, the post-layout result of theproposed overall architecture is not available. Van et al. [31]proposed an energy-efficient eight-channel FastICA imple-mentation with early determination scheme. The use of thededicated eigenvalue decomposition (EVD) processor foreight-channel preprocessing in [31] adds the overhead to thehardware cost. Besides, it is predicted that the computationalcomplexity for EVD based whitening process is largely in-creased when a higher channel number is requested [31]. Yanget al. [32] proposed a low-power eight-channel FastICA pro-cessor for epileptic seizure detection with fixed-point arith-metic. Roh et al. [33] proposed a 16-channel self-configuredICA implementation for the wearable neuro-feedback system.
The Gram-Schmidt [34] based whitening described in [12,35, 36] has been pointed out that the Gram-Schmidt basedwhitening may be applied to the solution of BSS problem[35]. Concerning the computational complexity and hardwarecost, in this paper, the Gram-Schmidt orthonormalization isapplied in whitening process and fixed-point algorithm.Thus, it is expected that the computational complexitycan be reduced and the hardware resource is possible forreuse. Note that the Gram-Schmidt orthonormalizationhas been adopted for the fixed-point iteration in theprinciple component analysis (PCA) algorithm to reducethe dimensions in [37]. To the best of our knowledge,the state-of-the-art dedicated hardware implementationsdo not adopt the Gram-Schmidt based whitening forthe FastICA processing. On the other hand, for moreflexibility, it is desired to support the variable channelselection and provide re-reference [1], synchronized average[1], and moving average [2] functions for EEG signal process-ing. Thus, we are motivated to propose a cost-effectiveFastICA architecture [38] that supports variable-channelFastICA operations and re-reference, synchronized average,and moving average functions. Herein, two reused processingunits (PUs) are proposed to achieve cost-effective andvariable-channel FastICA hardware implementation. Themain contributions of this work are summarized as follows:
1) Propose a cost-effective and 2-to-16 channel floating-point FastICA architecture with two new reused PUsadopting Gram-Schmidt based whitening.
2) Implement the FastICA hardware architecture in theapplication-specific integrated circuit (ASIC) approachto support variable-channel FastICA, re-reference, syn-chronized average, and moving average functions.
As a result, the proposed FastICA architecture is imple-mented in the TSMC 90 nm 1P9M complementary metal-oxide-semiconductor (CMOS) process with the core size of1.43 mm2. The power consumption for 16-channel processingis 19.4 mW at 100 MHz.
The rest of this paper is organized as follows. The back-ground for the FastICA algorithm is described in Section 2.The Gram-Schmidt orthonormalization based whitening is de-scribed and analyzed in Section 3. Next, the cost-effective andvariable-channel FastICA architecture with the proposed PUsand the corresponding post-layout implementation are de-scribed in Section 4. In Section 5, we show the software andthe corresponding post-layout simulation results for the vali-dation of the proposed hardware implementation. Last, theconclusion and future work is remarked in Section 6.
2 Background for the FastICA Algorithm
2.1 Independent Component Analysis
Considering the number of blind source signals and the sam-ple number are n and m, respectively, the BSS system can bemodeled as (1) [26, 29, 31].
X ¼ AS ð1Þ
whereX,A, and S denote an n bymmixed-signal matrix, an nby nmixing matrix, and an n bym blind-source-signal matrix,respectively. X and S can be further expressed as
X ¼ x1 x2⋯xn½ �T ð2Þ
S ¼ s1 s2⋯sn½ �T ð3Þ
The mixed-signal vector xi and the source-signal vector sican be respectively expressed in the following equations.
xi ¼ xi 1ð Þxi 2ð Þ⋯xi mð Þ½ �T ; for i ¼ 1; 2; 3;…; n; ð4Þ
si ¼ si 1ð Þsi 2ð Þ⋯si mð Þ½ �T ; for i ¼ 1; 2; 3;…; n; ð5Þ
where xi( j) and si( j) represent a mixed signal and a sourcesignal at discrete time j, respectively. The ICA algorithm aimsto find the matrix S without knowing the matrix A, where thematrix S is assumed to have no more than one Gaussian-distributed source-signal vector. By observing the matrix X,the inverse of the matrix A can be estimated as the weightmatrix WT, and the blind source signal S can be obtained in(6) [31].
S ¼ WTX; ð6Þ
92 J Sign Process Syst (2016) 82:91–113
where
W ¼w11
w21
⋮wn1
w12
w22
⋮wn2
⋯
w1n
w2n
⋮wnn
26643775: ð7Þ
The basic idea of ICA is to maximize the non-Gaussianityof wTX, where w denotes a column vector of W. Among theICA algorithms, the FastICA algorithm proposed in [3,13–16] has fast convergence rate and good performance onlow signal-to-noise ratio (SNR) condition [8]. Therefore, theFastICA algorithm is chosen for the hardware implementationfor EEG signal processing.
2.2 Preprocessing
In the FastICA algorithm, preprocessing is required to centerand whiten the mixed signals. Through the preprocessing, themixed signals can become zero mean and unit variance. Thecentering process can be expressed in (8).
xi jð Þ ¼ xi jð Þ−E xif g; for i ¼ 1; 2; 3;…; n; ð8Þ
where xi jð Þ and E{xi} denote the mixed signal value with zeromean and the expected value of the random variable xi(j) of
vector xi [31], respectively. Therefore, the centering matrix Xcan be expressed in (9).
X ¼xT1xT2⋮xTn
26643775 ¼
x1 1ð Þ x1 2ð Þ ⋯ x1 mð Þx2 1ð Þ x2 2ð Þ ⋯ x2 mð Þ⋮ ⋮ ⋯ ⋮
xn 1ð Þ xn 2ð Þ ⋯ xn mð Þ
26643775 ð9Þ
For the whitening process, EVD can be used to obtain unit-variance signals. Equation (10) can be obtained through theEVD on the covariance matrix Cx of ex, where ex denotes the
random column vector of X.
Cx ¼ E exexT� � ¼ EDET ; ð10Þ
where E denotes the eigenvector matrix of Cx and D repre-sents the eigenvalue matrix of Cx. D can be further expressedin (11) [31].
D ¼ diag d1; d2;⋯; dnð Þ; ð11Þ
where d1, d2,…, dn are the eigenvalues of Cx. The whiteningprocess can then be written in (12) [31].
Z ¼ D−1=2ETX ¼ PX ð12Þ
After the whitening process, the covariance matrix of ezbecomes the identity matrix I as shown in (13), where ez de-notes the random column vector of Z.
E ezezT� � ¼ PE exexT� �PT ¼ D−1=2ETEDETED−1=2 ¼ I ð13Þ
Therefore, unit covariance of whitened signal can beguaranteed.
2.3 Fixed-Point Algorithm
The weight matrix is the key to find the independent compo-nent. The non-Gaussianity needs to be measured well in orderto find the independence between vectors. In [15], thenegentropy approximation is used to estimate the non-Gaussianity. With this approximation, the FastICA algorithmcan find the maximum of non-Gaussianity by adjusting theweight matrix. For the training of the weight matrix, thefixed-point algorithm [15] is required. To estimate severalindependent components, either the symmetric orthogonaliza-tion or the deflationary orthogonalization can be utilized in thefixed-point algorithm [15]. The independent components areestimated in parallel for the fixed-point algorithm with thesymmetric orthogonalization. On the other hand, the indepen-dent components are estimated one by one with the deflation-ary orthogonalization. In this work, we adopt the fixed-pointalgorithm with the symmetric orthogonalization such that thecomputation of the fixed-point algorithm can be parallelized
in the hardware design. AssumeW ¼ w1w2⋯wn½ �, wherewi,
for i=1, 2,…, n, represents the column vector ofW [31], thefixed-point algorithm with the symmetric orthogonalizationcan be described as follows.
Step 1 Set the initial values for wi with unit norm for i=1, 2,3, …, n.
Step 2 Calcu l a t e the upda t ing s t ep [16 ] , wþi ¼
E ez g wTi ez� �� �� �
−E g0 wTi ez� �� �
wi, for i=1, 2, 3, …,n, where g is the derivative of the non-quadraticfunctionG uð Þ ¼ 1
a logcosh auð Þ, where a is a constantparameter.
Step 3 Calculate the Gram-Schmidt orthonormalization for
W using (14), (15) and (16).
w1 ¼ wþ1
wþ1
�� �� ð14Þ
w#kþ1 ¼ wþ
kþ1−Xki¼1
wþkþ1
� �Twi
h iwi; for k ¼ 1; 2; 3;…; n−1
ð15Þ
J Sign Process Syst (2016) 82:91–113 93
wkþ1 ¼w#
kþ1
w#kþ1
�� �� ; for k ¼ 1; 2; 3;…; n−1 ð16Þ
Step 4 Go back to Step 2 if not converged.
Note that the symmetric orthogonalization is based onEVD in [15]. In this work, we adopt the Gram-Schmidtorthonormalization for the symmetric orthogonalization basedfixed-point algorithm [31] in Step 3. Finally, when the con-vergence is achieved, the maximum of the non-Gaussianity of
WTZ can be estimated.
3 Gram-Schmidt Orthonormalization Based Whiteningfor the FastICA Algorithm
As mentioned in the previous section, the EVD based whiten-ing process is generally adopted in the FastICA algorithm. In[31], since whitening process and fixed-point algorithm applyCORDIC and multiplier as well as adder modules, respective-ly, a specially designed CORDIC-based EVD processor isneeded for EVD based whitening process. However, the hard-ware cost of a dedicated computing unit and the computation-al complexity of the EVD based whitening will be large whilethe number of channels increases in [31]. The pioneering workin [29] can reduce the hardware cost by reusing the CORDICmodule to perform both EVD based whitening and the fixed-point algorithm of FastICA. In this work, the concept ofGram-Schmidt based whitening process [12, 35, 36] isadopted for the FastICA algorithm. The Gram-Schmidt basedwhitening process has lower computational complexity thanthe conventional EVD based whitening does. In this section,we will analyze and compare the computational complexity ofthe conventional EVD based whitening process and the Gram-Schmidt based whitening process in this section.
3.1 Computational Complexity Analysis
The Gram-Schmidt orthonormalization based whiteningprocess for the FastICA algorithm is described as follows.First, the Gram-Schmidt orthonormalization for themixed signal values with zero mean are illustrated in (17),(18) and (19).
zþ1 ¼ x1x1k k ð17Þ
z#kþ1 ¼ xkþ1−Xki¼1
xTkþ1zþi
� �zþi ; for k ¼ 1; 2; 3;…; n−1 ð18Þ
zþkþ1 ¼z#kþ1
z#kþ1
�� �� ; for k ¼ 1; 2; 3;…; n−1 ð19Þ
Such process gives an orthogonal matrix Z+=[z1+z2
+⋯zn+]T,
where the vectors {z1+z2
+⋯zn+} are mutually orthogonal. The
covariance matrix of ezþ is then derived in (20).
E ezþ ezþð ÞTn o
¼ diagzþ1� �T
zþ1m
;zþ2� �T
zþ2m
;⋯;zþn� �T
zþnm
!¼ 1
mI;
ð20Þ
whereezþ denotes the random column vector of Z+. Since unitvariance is required for the whitening process, it is necessaryto scale the orthogonal matrixZ+ with a factorm1/2 as follows.
zkþ1 ¼ m1=2zþkþ1; for k ¼ 0; 1; 2;…; n−1 ð21Þ
Therefore, the covariance matrix ofez can equal the identitymatrix I.
The computational complexity is defined as the sum ofmultiplications, divisions and square roots. Tables 1 and 2summarize the computational complexity of the whiteningp r o c e s s b a s e d o n EVD a n d G r am - S c hm i d torthonormalization, respectively, where mul, div and sqrt de-note the corresponding multiplication, division and squareroot. Since it is a black box for us to know the detailed imple-mentation of EVD function call in MATLAB, EVD is as-sumed to be performed by cyclic Jacobi method [39] for com-putational complexity in Table 1, where J denotes an n by nJacobi rotation matrix [39]. Note that, the shift and add oper-ations are excluded in Tables 1 and 2 since they contribute lesscomplexity compared to the multiplication, division, andsquare root operations.
In Table 1, since the covariance matrix Cx is an n by nsymmetric matrix, only n(n+1)/2 elements are required forthe calculation of Cx. Therefore, the computational complex-ity for the calculation of Cx is m[n(n+1)/2] due to m multipli-cations for each element calculation ofCx. The product of an nby n matrix and a Jacobi rotation matrix costs 4n multiplica-tions since only 2n elements are needed to calculate, whereeach element needs two multiplications. Thus, 4n[n(n-1)/2-1]multiplications are required for the matrix multiplication of[n(n-1)/2] J matrices to obtain the eigenvector matrix E. InTable 2, xTkþ1z
þi
� �zþi needs to be calculated for i=1, 2, 3,…, k
and k=1, 2, 3, …, n-1. Since the calculation of xTkþ1zþi
� �zþi
needs 2m multiplications, the computational complexity forthe calculation of zk+1
# can be obtained as 2m∑k=1n− 1k.
The bar charts in terms of the number of multiplicationsand the sum of divisions and square roots versus number ofchannels withm=512 are shown in Fig. 1a and b, respectively.As can be seen in Fig. 1a and b, the computational complexityis mainly dominated by the multiplications for both EVD andGram-Schmidt based orthonormalization whitening process.
94 J Sign Process Syst (2016) 82:91–113
As a result, the Gram-Schmidt orthonormalization based whit-ening process shows less computations than the EVD basedwhitening process does, especially when the channel numberincreases. Obviously, the computations for the eigenvalue ma-trixD and eigenvector matrixE are not required for the Gram-Schmidt based whitening. Therefore, EVD is not necessary inthe Gram-Schmidt based whitening process such that theCORDIC module for EVD processing is not required in theproposed hardware architecture. Instead, from the analysis,the Gram-Schmidt based whitening process can be realizedby general computing modules such as addition, multiplica-tion, and inverse-square-root modules. Since these computingmodules can be shared in the fixed-point algorithm ofFastICA, the hardware cost is expected to be saved.
3.2 Influence of the Whitening Process on the FastICAAlgorithm
Besides the computational complexity, two whitening ap-proaches mentioned in the previous subsection need to beexplored to see the influence on the separation quality of theFastICA algorithm. Therefore, the MATLAB coding resource[40] of FastICA algorithm is modified/recoded and then sim-ulated with the EVD based and the Gram-Schmidtorthonormalization based whitening processes, respectively.The random mixing matrix A and the uniformly distributedsource-signal matrix S are used to generate the mixed-signalmatrix X. The mixed-signal matrix X is then processed withthe FastICA algorithm by above two whitening approaches.
The estimated source-signal vectors are then compared withthe source-signal vectors using the performance index definedin (22).
Performance Index ¼ 1
n
Xni¼1
abs corr coef bsi; sið Þ ð22Þ
where ŝi and abs_corr_coef (ŝi,si) denote the i-th estimatedsource-signal vector computed by the FastICA algorithm andthe absolute correlation coefficient between ŝi and si, respec-tively. The absolute correlation coefficient [41] is expressed in(23) using the notations in this paper.
abs corr coef bsi; sið Þ ¼ E bsi−E bsif gð Þ si−E sif gð Þf gσsiσ^ si
ð23Þ
where σ denotes the standard derivation. Note that the order ofthe estimated source-signal vector is sorted by associating theestimated source-signal vector and the source-signal vectorwith the maximum absolute correlation coefficient in thesimulation.
Figure 2 shows the average performance index and theaverage iteration number versus number of channels, wherethe sample number is set to 512. The average performanceindex and the average iteration number are obtained with500 test rounds. The maximum iteration number is set to511 in the simulation. As can be seen, the performance indicesand iteration numbers with EVD based and Gram-Schmidtorthonormalization based whitening processes are close. That
Table 1 Computationalcomplexity of the EVD basedwhitening process.
Procedure Computational complexity
Calculate covariance matrix Cx ¼ E exexT� �m n nþ1ð Þ
2
h in omuls
Calculate eigenvector matrix E=J12J13⋯J1nJ23⋯Jn−1,n 4n n n−1ð Þ2 −1
h in omuls
Calculate eigenvalue matrix D=ETCxE (2n3) muls
Calculate whitening matrix P=D−1/2E (n2) muls+(n) divs+(n) sqrts
Calculate whitened signal Z ¼ PX (mn2) muls
Total 4n3 þ 32m−1� �
n2 þ m2 −4� �
n� �
muls+(n) divs+(n) sqrts
Table 2 Computationalcomplexity of the Gram-Schmidtorthonormalization basedwhitening process.
Procedure Computational complexity
Calculate zþ1 ¼ x1x1k k (m) muls+(m) divs+(1) sqrt
Calculate z#kþ1 ¼ xkþ1−∑k
i¼1xTkþ1z
þi
� �zþi for k=1, 2, 3,…, n-1. 2m ∑
n−1
k¼1k
�muls
Calculate zþkþ1 ¼
z#kþ1
z#kþ1k k for k=1, 2, 3,…, n-1. [m(n-1)] muls+[m(n-1)] divs+(n-1) sqrts
Calculate zk+1=m1/2zk+1
+ for k=0, 1, 2,…, n-1. (mn) muls
Total (mn2+mn) muls+(mn) divs+(n) sqrts
J Sign Process Syst (2016) 82:91–113 95
means the Gram-Schmidt orthonormalization based whiten-ing process can be an alternative way to replace the EVDbased whitening process with similar separation quality inMATLAB simulation.
4 Cost-Effective and Variable-Channel FastICAHardware Architecture and Implementation
In this section, the cost-effective floating-point FastICA hard-ware architecture and implementation for FastICA processing,re-reference, synchronized average and moving average func-tions are addressed. The reasonable sample size for the hard-ware implementation of the FastICA function is explored first.Then, the hardware architecture, hardware operations, and im-plementation of these four functions are revealed andillustrated.
4.1 Sample Size Analysis
Since FastICA is a statistical algorithm, increasing the samplesize improves the separation quality. However, the memorysize will become larger once the sample size increases. Inaddition, larger sample size will lead to larger computationcomplexity according to the discussion in Section 3. For theabove reasons, the reasonable sample size for hardware im-plementation needs to be determined. Therefore, we simulatethe average performance index versus the number of channelswith the sample sizes of 256, 512, and 1024 in Fig. 3, wherethe simulation is performed by the double-precision floating-point (i.e., the most accurate precision) FastICA algorithmwith the Gram-Schmidt orthonormalization based whiteningprocess. The details ofA and X in this section are the same asthose of the simulation described in Section 3.2. From thesimulation result, using 1024 samples can reach the best
4 8 12 16 20 24 28 320
1
2
3
4
5
6
7
8
9
10x 10
5
Number of Channels
Num
ber
of M
ultip
licat
ions
EVD Based
GS Based
(a)
160
1
2
3
4
5
6
7
8
9x 103
0.032x103
8.208x103
4 8 12 16 20 24 28 320
2
4
6
8
10
12
14
16
18
Number of Channels
Sum
of
Div
isio
ns a
nd S
quar
e R
oots
EVD Based
GS Based
x 103
(b)
Figure 1 a Number ofmultiplications and b sum ofdivisions and square roots versusnumber of channels with EVDand Gram-Schmidtorthonormalization basedwhitening process.
96 J Sign Process Syst (2016) 82:91–113
performance among three different sample sizes, but the mem-ory cost will become a penalty. On the other hand, althoughusing 256 samples attains the lowest memory cost, the perfor-mance index is the worst among three different sample sizes.In summary, 512 is a tradeoff sample size for hardware imple-mentation. To evaluate the effects of IEEE-754 single-preci-sion floating-point and fixed-point arithmetic, the FastICAalgorithm with the Gram-Schmidt orthonormalization basedwhitening process is performed with the 32-bit floating-pointand 32-bit fixed-point simulations, respectively. The 11-bitfraction part of the fixed-point simulation is adopted for therandom data and three datasets in Section 5. Figure 3 showsthe performance index versus number of channels withfloating-point and fixed-point simulation. It can be seen thatthe performance indices with 512 sample size manipulated by
double-precision and single-precision floating point are al-most the same. For the targeted variable channel number withmaximum of 16, the performance index with fixed-point sim-ulation is lower than that of the floating-point simulation. Itshould be noted that the channel number should not be toohigh while considering an acceptable performance index. As aresult, the maximum channel number, the sample size and thearithmetic format in our hardware implementation are set to16, 512 and single-precision floating-point arithmetic format,respectively, according to the simulation result, where the de-tails will be discussed in the next subsection.
4.2 Hardware Architecture
Figure 4 shows the system diagram of the proposed FastICAhardware architecture to realize the FastICA, re-reference,synchronized average andmoving average functions. The pro-posed FastICA hardware architecture can perform differentoperations according to the instruction input, where the avail-able instruction set is listed in Table 3 and is briefly describedas follows. The LOAD instruction loads either fixed-point orfloating-point data into the memory. The OUTPUT instructionoutputs the processed data from the memory. The FASTICAinstruction operates the FastICA algorithm. The REREF in-struction produces the re-reference signal. The SYNAVG in-struction offers the synchronized average signal. Finally, theMOVAVG instruction can make the proposed architecture actas a moving average filter for the input signal. In this work, thenames of REREF and MOVAVG are the same as those in [42,43] and MATLAB, respectively.
In the proposed hardware architecture, the IEEE-754 sin-gle-precision floating-point format in (24), where the nota-tions are the same as those of [31], is adopted in the hardwaredesign in order to preserve the computation accuracy.
Lfloat ¼ −1ð ÞsL � 1: f Lð Þ � 2eL−127; ð24Þ
where Lfloat, sL, eL and fL represent the value of the IEEE-754single-precision floating-point number, sign bit, 8-bit expo-nent bit and 23-bit fraction part of the mantissa, respectively.Since the input EEG signals will be in fixed-point format ifthey are captured by the scalp sensors and quantized by theanalog-to-digital converter (ADC), the fixed-to-floating con-verter is utilized in this work to convert the fixed-point formatinto the IEEE-754 single-precision floating-point format. Af-terwards, the input signals are stored at the data memory.
Herein, the processing units (PUs), PU1 and PU2, areutilized to process the data stored in the memory. Thesimplified architecture of the PUs is illustrated in Fig. 5.Each PU contains a multiply-and-add (MAA) unit, atemporary memory, and a division-by-2λ operation unit,where the name of multiply-and-add (MAA) is the sameas that in [44] and λ is a positive integer. In this work,
0 5 10 15 20 25 30 350.4
0.5
0.6
0.7
0.8
0.9
1
Number of Channels
Ave
rage
Per
form
ance
Inde
x
0 5 10 15 20 25 30 350
100
200
300
400
500
600
Ave
rage
Iter
atio
n N
umbe
r
EVD basedGS based
Average IterationNumber
Average Performance Index
Figure 2 Average performance index and average iteration numberversus number of channels with the whitening processes based on EVDand Gram-Schmidt orthonormalization.
0 5 10 15 20 25 30 35
0.4
0.5
0.6
0.7
0.8
0.9
1
Number of Channels
Ave
rage
Per
form
ance
Ind
ex
1024 samples (double-precision floating-point)
512 samples (double-precision floating-point)
256 samples (double-precision floating-point)512 samples (single-precision floating-point)
512 samples (fixed-point)
Figure 3 Performance index versus number of channels with differentsample numbers.
J Sign Process Syst (2016) 82:91–113 97
PU1 and PU2 are reused in the centering operation ofpreprocessing and the updating step [16] (i.e., Step 2 ofthe fixed-point algorithm in Section 2.3) of the fixed-point algorithm. Through parallel manipulating the PU1
and PU2, the centering operation and the updating stepcan be accelerated. It is noted that, since the Gram-Schmidt orthonomalization processes the vectors sequen-tially, only PU1 is reused for the Gram-Schmidtorthonomalization in preprocessing and fixed-point algo-rithm of the FastICA algorithm. Thus, PU1 adopts theinverse-square-root unit based on the design of [45] forthe Gram-Schmidt orthonomalization but PU2 does not.The temporary memory is used to store the computationresult such that the result can be reused in the subse-quent computation or transferred to the data memory.Since the IEEE-754 single-precision floating-point for-mat is utilized in the hardware design, the division-by-2λ operation can be realized by subtracting λ for severalfunctions as mentioned in Sections 4.3 and 4.5. For theFastICA function, PU1 and PU2 are reused to performthe centering operation of preprocessing and theupdating step of the fixed-point algorithm, and PU1 isreused for the computations of the Gram-Schmidtorthonormalization. Therefore, the dedicated EVD calcu-lation unit is not required in the proposed hardwarearchitecture. Other three function computations are real-ized by the two reused PUs in the proposed hardwarearchitecture. Thus, the cost-effective hardware architec-ture can be attained. The hardware operations of the
FastICA and other functions with the proposed PUs willbe described in detail in the following subsections.
The 32-bit instruction format of the proposed hard-ware architecture is described in Table 4. The first 3bits of the instruction represent the OP code for differ-ent operations. The remaining bits of the instruction areused to specify the user-defined parameters. The rangeof the OP code and the parameters are described inTable 5. Notably, the third parameter format of theFASTICA instruction is defined as the first 15 bits ofIEEE-754 single-precision floating-point format. In orderto simplify the hardware design, the first parameters canonly be set to 2, 4, 8 or 16 for SYNAVG andMOVAVG instructions. Also, the second parameters ofLOAD and OUTPUT can only be set to either 511 or 0since they are used to specify the input data format orthe output data type. With those functions and the user-defined parameters, the flexibility of the proposed hard-ware architecture can be attained.
For the chip implementation of the proposed FastICA hard-ware architecture, the cell-based design flowwith the standardcell library in TSMC 90 nm 1P9M CMOS process is adopted.Artisan Memory Compiler is used to generate the memorymodule. Synopsys Design Compiler is employed to synthe-size the proposed design with the timing constraint of 10 ns.Finally, Cadence SoC Encounter is used for the placement androuting procedure. Figure 6 shows the layout of the proposedFastICA hardware implementation, where the layout area oc-cupies 1.43 mm2.
Processing Unit 1DataInput Data
Output
Fixed-to-FloatingConverter
DataMemory
Controller
Processing Unit 2
MUX
Instruction Input
Figure 4 System diagram of theproposed FastICA hardwarearchitecture.
Table 3 Instruction set of the proposed FastICA hardware architecture.
Instruction Parameters Operation
LOAD Number of channels, input data type (fixed/floating) Load data into the memory
OUTPUT Number of channels, content type (weight/signal) Output data from the memory
FASTICA Number of channels, number of maximum iterations, threshold value Perform FastICA algorithm
REREF Number of channels, index of the baseline channel Remove baseline
SYNAVG Number of trails Average all trails
MOVAVG Number of samples on average, index of the target channel Moving average filter
98 J Sign Process Syst (2016) 82:91–113
4.3 Hardware Operations of the Preprocessingwith Gram-Schmidt Orthonomalization Based WhiteningProcess of FastICA
To illustrate the hardware operations of the FastICA functionusing the PUs in Fig. 5, the computations of the preprocessingare described in this subsection. For the centering process, (8)can be recast in (25).
xi jð Þ ¼ xi jð Þ � 1
512
X512l¼1
xi lð Þ; ð25Þ
where i=1, 2, …, n and j=1, 2, …, 512. The computation of(25) is described as follows. First, xi(l) are read from the datamemory and used to calculate the result of [∑l=1
512xi(l)]/512 usingthe MAA unit and the division-by-2λ operation unit, where λ isset to 9. Next, the result of xi( j) minus [∑l=1
512xi(l)]/512 is calcu-lated on the MAA unit and saved at the data memory. The cen-tering process can be either performed on the MAA unit of PU1
or the MAA unit of PU2. Therefore, PU1 and PU2 concurrentlyperform the computations of the centering process for differentchannels, respectively. For example, if n=16, x1 jð Þ and x2 jð Þare calculated first with i=1 and i=2 in PU1 and PU2, respective-ly, for j=1, 2,…, 512. Next, x3 jð Þ; x4 jð Þf g, x5 jð Þ; x6 jð Þf g,…,x15 jð Þ; x16 jð Þf g are calculated with i={3, 4}, i={5, 6}, …,
i={15, 16} in PU1 and PU2 for j=1, 2, 3,…, 512, respectively.After the centering process, (17), (18), (19) and (21) are used
in the Gram-Schmidt orthonormalization based whitening pro-cess and can be recast as (26), (27) and (28), respectively.
z#kþ1 ¼xkþ1 ; fork ¼ 0xkþ1− xTkþ1z
þ1
� �zþ1 −⋯− xTkþ1z
þk
� �zþk ; for k ¼ 1; 2;…; n−1
�ð26Þ
Figure 5 Simplified architecture of the processing units (PUs) and thecorresponding controller.
Table 4 Instruction format of the proposed FastICA hardwarearchitecture.
OP Code First parameter Second parameter Third parameter
3 bits 5 bits 9 bits 15 bits
Table 5 Range of the OP code and parameters of the instructions.
Instruction OPcode
Firstparameter
Secondparameter
Thirdparameter
LOAD 0(000) 1~16 511/0 (fixed/floating) –
OUTPUT 1(001) 1~16 511/0 (weight/signal) –
FASTICA 2(010) 2~16 1~511 1~20
REREF 3(011) 1~16 0~15 –
SYNAVG 4(100) 2/4/8/16 – –
MOVAVG 5(101) 2/4/8/16 0~15 –
DataMemory
ProcessingUnits
DataMemory
DataMemory
TemporaryMemory Temporary
Memory
Figure 6 Layout of the proposed FastICA hardware implementation.
J Sign Process Syst (2016) 82:91–113 99
zþkþ1 ¼z#kþ1
z#kþ1
�� �� ¼ z#kþ1 1= z#kþ1
�� ��� � ¼ z#kþ1 z#kþ1
� �Tz#kþ1
h i−1=2ð27Þ
zkþ1 ¼ 5121=2zþkþ1 ð28Þ
According to (26), (27) and (28), the computations of theGram-Schmidt orthonormalization based whitening processare described as follows.
Step 1 Let i=1 and initially set z#kþ1 ¼ xkþ1 at the temporarymemory.
Step 2 Go to Step 6 if k=0. Otherwise, go to Step 3.Step 3 Calculate xTkþ1z
þi on the MAA unit.
Step 4 Renew zk + 1# with the calculation result of z#kþ1−
xTkþ1zþi
� �zþi using the MAA unit and increase i by 1.
Step 5 Repeat Step 3-Step 4 until i>k.Step 6 Calculate [(zk+1
# )Tzk+1# ]−1/2 on the MAA unit and the
inverse-square-root unit.Step 7 Calculate zk+1
+ on the MAA unit.Step 8 Repeat Step 1-Step 7 until n vectors {z1
+,z2+,⋯,zn
+} ofZ+ are obtained.
Step 9 Calculate zk+1 on the MAA unit. Repeat this stepuntil n vectors {z1,z2,⋯,zn} of Z are obtained.
Note that the calculations of Step 1-Step 9 are only per-formed on the MAA unit of PU1. In Step 1, xkþ1 is read fromdata memory. In Step 3, zi
+ is read from data memory to com-pute the dot product of xkþ1 and zi
+ on the MAA unit. In Step4, the result of xTkþ1z
þi
� �zþi is calculated and subtracted from
zk+1# on theMAA unit. In Step 5, Step 3-Step 4 are repeated as
i≤k such that z#kþ1 ¼ xkþ1−∑ki¼1 xTkþ1z
þi
� �zþi can be obtained.
In Step 6, the result of (zk+1# )Tzk+1
# is calculated on the MAAunit and then processed with the inverse-square-root operationunit to obtain the result of [(zk + 1
# )Tzk + 1# ]−1/2. In Step 7,
the result of zk + 1+ is calculated by multiplying zk + 1
# and[(zk+1
# )Tzk+1# ]−1/2 on the MAA unit. In Step 8, Step 1-Step 7
are repeated to sequentially perform the calculations of n
vectors {z1+,z2
+,⋯,zn+} of Z+ with k=0, 1, 2, …, n-1. In Step
8, when Step 1-Step 7 are done with k=n-1, the calculation ofZ+ is completed. In Step 9, zk+1 is obtained by multiplying5121/2 and zk+1
+ on the MAA unit. Step 9 is repeated until nvectors of Z are obtained. The result of Z is saved at the datamemory. The proposed FastICA hardware architecture sup-ports variable-channel processing, where the number of chan-nels, n, is confined to 2-to-16 and defined by the user. In Step8, the controller controls PU1 to sequentially calculate {z1
+,z2+,
z3+,⋯,zn
+}. While the controller detects that the last calculationof zn
+ is completed, the controller controls the PU1 to executethe operation of Step 9. By setting different number of chan-nels, n, the hardware operations of Gram-Schmidtorthonormalization can be realized on the same PU1. For ex-ample, if n=16, z1
+ will be calculated first using Steps 1, 2, 6,and 7 with k=0. Next, z2
+ will be calculated through Step 1-Step 7 with k=1. Afterward, {z3
+,z4+,⋯,z16
+ } will be calculatedsequentially with k=2, 3, …, 15, respectively. The vectors{z1,z2,⋯,z16} are calculated through Step 9 until {z1
+,z2+,⋯,
z16+ } are obtained. Similarly, if n=4, z1
+ will be calculated firstand then {z2
+,z3+,z4
+} will be calculated sequentially.
4.4 Hardware Operations of the Fixed-Point Algorithmof FastICA
To implement the fixed-point algorithm using the two reusedPUs, the equations can be recast in (29), (30) and (31).
wþi ¼ E ez g wT
i ez� �� �� �−E g0 wT
i ez� �� �wi
¼
X512j¼1
ez j g wTi ez j� �� �� �
−X512j¼1
g0 wTi ez j� �� �
wi
512
;
≅
X512j¼1
ez j α wTi ez j� �þ β
� �� �−X512j¼1
αf gwi
512
ð29Þ
where ez j denotes the j-th column vector of Z, and α and βdenote two parameters.
w#kþ1 ¼
wþkþ1 ; fork ¼ 0
wþkþ1− wþ
kþ1
� �Tw1
h iw1−⋯− wþ
kþ1
� �Twk
h iwk ; for k ¼ 1; 2; …; n−1
(ð30Þ
wkþ1 ¼w#
kþ1
w#kþ1
�� �� ¼ w#kþ1 1= w#
kþ1
�� ��� � ¼ w#kþ1 w#
kþ1
� �Tw#
kþ1
h i−1=2ð31Þ
The two non-linear functions g wTi ez j� �
and g0 wTi ez j� �
in (29) are realized as α wTi ez j� � þβ and α, respectively,
by the piecewise linear approximation method. Thevalues of α and β are obtained by table look-up ac-cording to the value of wT
i ez j. Further discussions of thepiecewise linear approximation method can be found in[38]. According to (29), (30) and (31), the computa-tions of the fixed-point algorithm are described asfollows.
100 J Sign Process Syst (2016) 82:91–113
Step 1 Let j=1 at the beginning.Step 2 Calculate wT
i ez j on the MAA unit.Step 3 Calculate α wT
i ez j� � þβ on the MAA unit.Step 4 Calculate ez j α wT
i ez j� �� þβ� on the MAA unit.Step 5 Accumulate the results of ez j α wT
i ez j� �þ β� �
and α atthe temporary memory using the MAA unit, respec-tively, and increase j by 1.
Step 6 Repeat Step 2-Step 5 until j>512.Step 7 Calculate wi
+ on the MAA unit.Step 8 Repeat Step 1-Step 7 until n vectors {w1
+,w2+,⋯ ,wn
+}are obtained.
Step 9 Calculate the Gram-Schmidt orthonormalization
for W.Step 10 Repeat Step 1-Step 9 until W is converged.
Note that the calculations of Step 1-Step 7 can be per-formed on the MAA unit in either PU1 or PU2. In other words,two wi
+ vectors can be concurrently calculated on the MAAunits of PU1 and PU2, respectively. In Step 2, wi and ez j areread from the data memory to calculate their dot product onthe MAA unit. In Step 3, the result of α wT
i ez j� � þβ are calcu-lated on the MAA unit. In Step 4, ez j is read from the data
memory and multiplied by α wTi ez j� � þβ on the MAA unit. In
Step 5, the results of ez j α wTi ez j� �� þβ� and α are separately
accumulated at the temporary memory using theMAA unit. InStep 6, Step 2-Step 5 are repeated as j≤512 to obtain
∑512j¼1 ez j α wT
i ez j� ��� þβ�g and ∑j=1512{α}. In Step 7, the result
of∑512j¼1 ez j α wT
i ez j� ��� þβ�gminus∑j=1512{α}wi is calculated on
the MAA unit. The division-by-512 operation is not necessaryin Step 7 since it does not affect the normalization result. InStep 8, Step 1-Step 7 are repeated to obtain n vectors {w1
+,w2+,
⋯ ,wn+}. According to the user-defined number of channels, n,
the calculations of wi+ vectors are repeated on both PU1 and
PU2 to obtain all n vectors of wi+ with i=1, 2, 3,…, n. For
example, if n=16, w1+ and w2
+ are calculated first with i=1 andi=2 in PU1 and PU2, respectively, through Step 1-Step 7.Next, {w3
+, w4+}, {w5
+, w6+},…, {w15
+ , w16+ } are calculated with
i={3, 4}, i={5, 6}, … and i={15, 16} in PU1 and PU2, re-spectively. In Step 9, the Gram-Schmidt orthonormalization is
calculated for W. The calculation of the Gram-Schmidtorthonormalization is similar to Step 1-Step 8 of the whiteningprocess in Section 4.3. In Step 10, the sum of absolute dot-products of the old weight vectors and the new weight vectors
ofW is calculated on the MAA unit to determine whetherWis converged.
4.5 Implementation of the Re-reference, SynchronizedAverage and Moving Average Functions
In terms of the re-reference, synchronized average and mov-ing average processing, the re-reference signal yr,i( j), the
synchronized average signal ys( j), and the moving averagesignal ym( j) are defined in (32), (33), and (34), respectively.
yr;i jð Þ ¼ xi jð Þ � xbaseline jð Þ; for j ¼ 1; 2; 3;…;m; ð32Þ
ys jð Þ ¼ 1
h
Xhtrial¼1
xtrial jð Þ; for j ¼ 1; 2; 3;…;m; ð33Þ
ym jð Þ ¼ 1
r
Xr−1k¼0
xtarget�channel j−kð Þ; for j ¼ 1; 2; 3;…;m; ð34Þ
where i=1, 2, 3,…, n and h and r denote the number of trailsand the number of samples on average, respectively. Sinceh and r are the first parameters of SYNAVG and
0 100 200 300 400 500 6000
20004000
0 100 200 300 400 500 6000
20004000
0 100 200 300 400 500 6000
20004000
0 100 200 300 400 500 6000
20004000
0 100 200 300 400 500 6000
5000
0 100 200 300 400 500 6000
20004000
0 100 200 300 400 500 6000
20004000
0 100 200 300 400 500 6000
20004000
0 100 200 300 400 500 6000
20004000
0 100 200 300 400 500 6000
20004000
0 100 200 300 400 500 6000
20004000
0 100 200 300 400 500 6000
20004000
0 100 200 300 400 500 6000
20004000
0 100 200 300 400 500 6000
5000
0 100 200 300 400 500 6000
20004000
0 100 200 300 400 500 6000
20004000
Figure 7 First dataset: 16-channel mixed signals generated by MATLABand [42].
J Sign Process Syst (2016) 82:91–113 101
MOVAVG instructions, respectively, they can only be setto 2, 4, 8, or 16 in the proposed hardware architecture asmentioned in Section 4.2. Therefore, the division opera-tion in (33) and (34) can be performed with the division-by-2λ units in the PUs by setting λ to 1, 2, 3 or 4. Theoperations of addition and subtraction in (32), (33), and
(34) are performed on the MAA units of PU1 and PU2.For the re-reference function, PU1 and PU2 concurrentlyperform the computations for different channels, respec-tively. For the synchronized average and moving averagefunction, PU1 and PU2 concurrently perform the compu-tations for different discrete time j, respectively.
0 100 200 300 400 500 600-202
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-202
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-202
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-202
0 100 200 300 400 500 600-10
010
0 100 200 300 400 500 600-202
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-10
010
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-202
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-202
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-202
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-202
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-202
0 100 200 300 400 500 600-10
010
0 100 200 300 400 500 600-202
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-10
010
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-202
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-202
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-202
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-202
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-202
0 100 200 300 400 500 600-10
010
0 100 200 300 400 500 600-202
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-10
010
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-202
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0.9984
0.9752
0.9939
0.9743
0.9784
0.9943
0.9656
0.9812
0.9926
0.9772
0.9725
0.9993
0.9728
0.9883
0.9775
0.9870
Artificial Source SignalsFloating-Point Software Separation Results Fixed-Point Software Separation Results
0.5852
0.9243
0.9720
0.8073
0.5721
0.9448
0.9613
0.9785
0.9932
0.9300
0.8405
0.9987
0.9488
0.8803
0.6638
0.4223
Figure 8 Comparison between (a) the floating-point software separation results, (b) the artificial source signals and (c) the fixed-point softwareseparation results of the FastICA function for the first dataset.
102 J Sign Process Syst (2016) 82:91–113
5 Evaluation and Comparison Results
In this section, the double-precision floating point (i.e., themost accurate precision) software simulation in MATLAB isperformed to verify the FastICA algorithm based on theGram-Schmidt orthonormalization for whitening process andfixed-point algorithm. The software simulations of the re-ref-erence, synchronized average and moving average functionsare also performed in MATLAB. The hardware architecture isvalidated by the post-layout simulation results. Finally, theresults are summarized and compared with other relatedworks.
5.1 Software Simulation
To verify the separation quality with the FastICA algorithmbased on the Gram-Schmidt orthonormalization for whiteningprocess and fixed-point algorithm, the software simulation isperformed. In our simulation, three datasets are adopted. Thefirst dataset contains 16-channel artificial mixed signals whichare obtained by randomly mixing the 16-channel artificialsource signals in 12-bit fixed-point format as shown inFig. 7. Note that the 16-channel artificial source signals con-tains two-channel sparse and very sparse source signals in [46]and other 14-channel source signals generated by MATLABfunctions including abs, sin, square, rem, sinc, rand, log andrandperm. Since the source signals in the first dataset areknown, the software separation result can be compared tothe source signals to determine the separation quality. Figure 8shows the artificial source signals, the floating-point andfixed-point software separation results of the FastICA algo-rithm with the Gram-Schmidt orthonomalization based whit-ening process for the first dataset. The numbers betweenFig. 8a and b denote the absolute correlation coefficients be-tween the source signals and the floating-point separation re-sults. On the other hand, the numbers between Fig. 8b and cdenote the absolute correlation coefficients between thesource signals and the fixed-point separation results. For thefloating-point separation results, the average and minimumabsolute correlation coefficients are 0.9830 and 0.9656, re-spectively. The signal-to-interference ratio (SIR) [47] resultsfor 16-channel source signals and the software separation re-sults with the first dataset are 25.0082, 13.0428, 19.1261,12.8986, 13.6465, 19.4119, 11.6197, 14.2428, 18.2994,13.4162, 12.5925, 28.7617, 12.6471, 16.2938, 13.4617, and15.8396 dB, respectively. The average value of the SIR resultsis 16.2693 dB. As a result, the artificial mixed signals are wellseparated in the floating-point software simulation. For thefixed-point separation results, the average and minimum ab-solute correlation coefficients are 0.8389 and 0.4223, respec-tively. The average and minimum absolute correlation coeffi-cients of the fixed-point separation results are lower than thoseof the floating-point simulation results.
The second dataset contains the 16-channel EEG signalswhich are adopted from the dataset in the EEGLAB toolbox[42, 43]. The details of the experiment were described in [6].Figure 9 shows the 16-channel EEG signals which are cap-tured from the positions of FPZ, F3, FZ, F4, C3, C4, CZ, P3,PZ, P4, PO3, POZ, PO4, O1, OZ, and O2 of the international10–20 electrodes placement system. The 16-channel EEG sig-nals are transformed to 12-bit fixed-point format with thesampling rate of 256 Hz and the windows size of two seconds.Figure 10a and b show the floating-point software simulationresults and the fixed-point software simulation results for thesecond dataset, respectively. The numbers in the middle ofFig. 10 denote the absolute correlation coefficients betweenthe floating-point simulation results and the fixed-point simu-lation results. For the floating-point simulation results, as canbe seen in Fig. 10a, the independent components associatedwith the late positive event-related potential can be extracted
0 100 200 300 400 500 6000
20004000
0 100 200 300 400 500 6000
20004000
0 100 200 300 400 500 6000
20004000
0 100 200 300 400 500 6000
20004000
0 100 200 300 400 500 6000
20004000
0 100 200 300 400 500 6000
20004000
0 100 200 300 400 500 6000
5000
0 100 200 300 400 500 6000
20004000
0 100 200 300 400 500 6000
20004000
0 100 200 300 400 500 6000
20004000
0 100 200 300 400 500 6000
20004000
0 100 200 300 400 500 6000
20004000
0 100 200 300 400 500 6000
20004000
0 100 200 300 400 500 6000
20004000
0 100 200 300 400 500 600100020003000
0 100 200 300 400 500 600100020003000
Figure 9 Second dataset: 16-channel EEG signals.
J Sign Process Syst (2016) 82:91–113 103
in the first three rows. The floating-point simulation results ofFig. 10a are similar to those reported in [6]. However, inFig. 10b, one component cannot be easily observed in thesecond row. The average and minimum absolute correlationcoefficients are 0.5507 and 0.3231, respectively.
The third dataset contains 4-channel EEG signals whichhave eye blinking artifact as shown in Fig. 11. The 4-channel EEG signals are captured from the positions of FP1,FP2, F3 and F4 of the international 10–20 electrodes place-ment system. The sampling rate and the format of the thirddataset are the same as those of the second dataset. The soft-ware simulation results for the third dataset are shown inFig. 12. As can be seen, the eye blinking component can beseparated in the first row in Fig. 12. According to the softwaresimulation results with the above three datasets, the validation
0.7885
0.3231
0.7950
0.5397
0.4318
0.6156
0.7125
0.6508
0.5675
0.5483
0.4885
0.5456
0.4697
0.4574
0.5115
0.3651
Floating-Point Software Simulation Results Fixed-Point Software Simulation Results
0 100 200 300 400 500 600-10
010
0 100 200 300 400 500 600-10
010
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-10
010
0 100 200 300 400 500 600-10
010
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
(a) (b)
Figure 10 Software simulationresult utilizing the FastICAalgorithm with the Gram-Schmidtorthonormalization basedwhitening process for the seconddataset in MATLAB: (a) floating-point simulation results and (b)fixed-point simulation results.
0 100 200 300 400 500 6000
2000
4000
0 100 200 300 400 500 6000
5000
0 100 200 300 400 500 6001000
2000
3000
0 100 200 300 400 500 6001000
2000
3000
Figure 11 Third dataset: 4-channel EEG signals.
104 J Sign Process Syst (2016) 82:91–113
of the FastICA algorithm with the Gram-Schmidtorthonormalization based whitening process can be achievedfor both artificial mixed signals and EEG signals.
Figure 13 shows the software simulation result of the re-reference function, where the EEG signals at the positions ofPZ and CZ in Fig. 9 are used for the input signal and thebaseline signal, respectively. Figure 14 shows the softwaresimulation result of the synchronized average function with16 EEG trials captured from the position of CPZ. Finally, theEEG signal with high-frequency noise adopted from [46] andthe corresponding software simulation result of moving aver-age function are shown in Fig. 15. As can be seen, the high-frequency noise of EEG signal can be filtered after the movingaverage processing.
5.2 Post-Layout Simulation
In this subsection, the post-layout simulation is performed.The results of post-layout simulation are compared to the soft-ware simulation results done in the previous subsection tovalidate the function of the proposed cost-effective hardwareimplementation. For the FastICA function, the three datasetsmentioned in the previous subsection are used for the post-layout simulation. Figures 16, 17 and 18 show the comparisonresults with the first dataset of the 16-channel artificial mixed
signals in Fig. 7, the second dataset of the 16-channel EEGsignals in Fig. 9 and the third dataset of the 4-channel EEGsignals in Fig. 11, respectively. In Figs. 16, 17 and 18, thenumbers in the middle of the figures denote the absolute cor-relation coefficients between the software simulation resultsand the post-layout simulation results. For the first dataset, theaverage and the minimum values of the absolute correlation
0 100 200 300 400 500 600-5
0
5
0 100 200 300 400 500 600-5
0
5
0 100 200 300 400 500 600-5
0
5
0 100 200 300 400 500 600-5
0
5
Figure 12 Software simulation result utilizing the FastICA algorithmwith the Gram-Schmidt orthonormalization based whitening process forthe third dataset in MATLAB.
0 100 200 300 400 500 600-2000
-1500
-1000
-500
0
Figure 13 Software simulation result of the re-reference function inMATLAB.
0 100 200 300 400 500 6000
20004000
0 100 200 300 400 500 6000
20004000
0 100 200 300 400 500 6000
20004000
0 100 200 300 400 500 6000
20004000
0 100 200 300 400 500 6000
20004000
0 100 200 300 400 500 6000
20004000
0 100 200 300 400 500 6000
20004000
0 100 200 300 400 500 6000
20004000
0 100 200 300 400 500 6000
20004000
0 100 200 300 400 500 6000
20004000
0 100 200 300 400 500 6000
20004000
0 100 200 300 400 500 6000
5000
0 100 200 300 400 500 6000
20004000
0 100 200 300 400 500 6000
20004000
0 100 200 300 400 500 6000
20004000
0 100 200 300 400 500 6000
20004000
(a)
0 100 200 300 400 500 6001000
1500
2000
2500
(b)
Figure 14 a 16 trials of the EEG signals from the position ofCPZ; b software simulation result of the synchronized averagefunction in MATLAB.
J Sign Process Syst (2016) 82:91–113 105
coefficients are 0.9967 and 0.9852, respectively. For the sec-ond dataset, the average and minimum values of the absolutecorrelation coefficients are 0.9441 and 0.8424, respectively.For the third dataset, the average absolute correlation coeffi-cient is 0.9868. Although the datasets in this work are differentfrom those in [31], it should be noted that the minimum valueof the absolute correlation coefficients of the second dataset(16-channel EEG signals), 0.8424, is near to that of the EEGsignals in Fig. 17 of [31], 0.8499. According to the compari-son results, the post-layout simulation result can attain enoughaccuracy compared to the software simulation results amongthree datasets.
Figure 19 shows the comparison result of the artificialsource signals and the post-layout simulation results with thefirst dataset in Fig. 7, where the numbers in the middle ofFig. 19 denote the absolute correlation coefficients between
0 100 200 300 400 500 6000
1000
2000
3000
4000
(a)
0 100 200 300 400 500 6000
1000
2000
3000
4000
(b)
Figure 15 a EEG signal with high-frequency noise; b softwaresimulation result of the moving average function in MATLAB.
0 100 200 300 400 500 600-202
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-202
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-202
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-202
0 100 200 300 400 500 600-10
010
0 100 200 300 400 500 600-202
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-10
010
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-202
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-202
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-202
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-202
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-202
0 100 200 300 400 500 600-10
010
0 100 200 300 400 500 600-202
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-10
010
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-202
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0.9991
0.9953
1
0.9852
0.9978
0.9963
0.9990
0.9963
1
0.9987
0.9889
1
0.9991
0.9954
0.9983
0.9972
Software Simulation Results Post-Layout Simulation ResultsFigure 16 Comparison betweenthe software simulation resultsand post-layout simulation resultsof the FastICA function for thefirst dataset in Fig. 7.
106 J Sign Process Syst (2016) 82:91–113
0 100 200 300 400 500 600-10
010
0 100 200 300 400 500 600-10
010
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-10
010
0 100 200 300 400 500 600-10
010
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
1
1
1
1
1
1
1
0.9610
0.9495
0.9796
0.9084
0.8571
0.8484
0.8424
0.8851
0.8742
Software Simulation Results Post-Layout Simulation ResultsFigure 17 Comparison betweenthe software simulation resultsand post-layout simulation resultsof the FastICA function for thesecond dataset in Fig. 9.
1
1
0.9735
0.9735
Software Simulation Results Post-Layout Simulation Results
0 100 200 300 400 500 600-5
0
5
0 100 200 300 400 500 600-5
0
5
0 100 200 300 400 500 600-5
0
5
0 100 200 300 400 500 600-5
0
5
0 100 200 300 400 500 600-5
0
5
0 100 200 300 400 500 600-5
0
5
0 100 200 300 400 500 600-5
0
5
0 100 200 300 400 500 600-5
0
5
Figure 18 Comparison betweenthe software simulation resultsand post-layout simulation resultsof the FastICA function for thethird dataset in Fig. 11.
J Sign Process Syst (2016) 82:91–113 107
the source signals and the post-layout simulation results. Theaverage and minimum absolute correlation coefficients are0.9822 and 0.9678, respectively. The SIR results for the 16-channel source signals and the post-layout simulation resultswith the first dataset are 22.2621, 13.0944, 18.5266, 14.6086,14.0360, 19.4625, 12.1755, 14.3472, 17.7386, 13.8883,
13.5538, 27.8766, 12.8188, 16.7454, 13.8240 and14.1117 dB, respectively, with an average value of16.1919 dB. The results show that the separation quality ofthe proposed cost-effective FastICA hardware implementationis satisfactory. Figures. 20, 21, and 22 show the comparisonresults of the re-reference, synchronized average and moving
0 100 200 300 400 500 600-202
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-202
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-202
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-202
0 100 200 300 400 500 600-10
010
0 100 200 300 400 500 600-202
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-10
010
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-202
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-202
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-202
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-202
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-202
0 100 200 300 400 500 600-10
010
0 100 200 300 400 500 600-202
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-10
010
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-202
0 100 200 300 400 500 600-505
0 100 200 300 400 500 600-505
0.9951
0.9736
0.9910
0.9808
0.9783
0.9924
0.9678
0.9797
0.9896
0.9777
0.9760
0.9972
0.9720
0.9875
0.9774
0.9787
Artificial Source Signals Post-Layout Simulation ResultsFigure 19 Comparison betweenthe artificial source signals andpost-layout simulation results ofthe FastICA function for the firstdataset in Fig. 7.
1
Software Simulation Result Post-Layout Simulation Result
0 100 200 300 400 500 600-2000
-1500
-1000
-500
0
0 100 200 300 400 500 600-2000
-1500
-1000
-500
0Figure 20 Comparison betweenthe software simulation result andpost-layout simulation result ofthe re-reference function.
108 J Sign Process Syst (2016) 82:91–113
average functions, respectively. As can be seen, the absolutecorrelation coefficients can be up to 1 such that their accura-cies can be guaranteed. According to the simulation results,since the post-layout simulation results are close to the soft-ware simulation results, the function of the hardware imple-mentation can be validated with enough accuracy.
5.3 Evaluation and Comparison Results
According to the post-layout implementation results of theproposed FastICA hardware architecture, Table 6 summarizespost-layout characteristics. The power consumption results ofthe proposed hardware architecture for the FastICA process-ing are measured after RC extraction of the placed and routednetlist with Synopsys PrimeTime. The FastICA hardware ar-chitecture consumes 19.4 mWat 100 MHz when 16 channelsare activated. Figure 23 shows the power consumption resultsamong different number of channels. In the proposed FastICAhardware architecture, the power consumption results of dif-ferent number of channels are close since the PU1 and PU2 arefrequently utilized due to the heavy computations of theupdating step [16] in (29) with the sample number of 512.On the other hand, Fig. 24 shows the energy evaluation resultsamong different number of channels. As can be seen, withhigher channel number, the energy consumption increasessince the computation time increases due to the larger compu-tational complexity. For the proposed FastICA hardware ar-chitecture, we target on the FastICA processing for the EEGdata with the window size of 2 s and the sampling rate of256 Hz. As can be seen in Table 6, since the maximum hard-ware computation time for the 16-channel FastICA processingis 1.85 s at 100 MHz, the maximum computation time isshorter than 2 s. That means the proposed FastICA hardwarearchitecture can output 512 samples per channel every 2 s. Inother words, the throughput is 256 samples/s. Thus, the
proposed FastICA hardware architecture implementation cansupport enough throughput and real-time application in thiswork.
Table 7 lists comparison results in terms of application,algorithm, arithmetic, number of channels/weight vectors,sample size, speed, power dissipation, gate count, computa-tion time, implementation approach, process technology, corearea and variable channel among the post-layout implementa-tion results of the proposed FastICA hardware architectureand the existing ICA implementations. In [20] and [26],FPGA implementations are used for two-channel ICA pro-cessing with the operating frequencies at 12.288 and50 MHz, respectively. The number of gates in [20] is 11.469thousand in FPGA approach; however, its computation time islarger than 60 s for ANC. Based on the parallel ICA algorithm,the operating frequency of the ICA implementation on FPGAis 20.161 MHz [21, 22] and between 21.357 and 35.921 MHz[23] for image processing. The design of [28] operates at68 MHz on FPGA using INFOMAX algorithm for four-channel EEG signals processing. In [31, 32], the hardwareimplementation can achieve eight-channel FastICA process-ing with a dedicated EVD processor. In [33], the self-configured ICA implementation can achieve 16-channel pro-cessing with dedicated principal component decompositionengine (PCDE) for EVD processing. Since the implementa-tion approach information and post-layout results of the over-all architecture are not provided in [29], this low-complexitypioneer work is not listed in Table 7 for hardware comparison.In this work, the proposed hardware architecture can performthe FastICA processing with a variable number of channelsfrom 2 to 16. By proposing two reused PUs for the prepro-cessing and the fixed-point algorithm in the FastICA algo-rithm, the gate count of this work is only 47.43 % higher thanthat reported in [31], while the processing capability in thiswork can be up to 16 channels and 512 samples that are two
1
Software Simulation Result Post-Layout Simulation Result
0 100 200 300 400 500 6001000
1500
2000
2500
0 100 200 300 400 500 6001000
1500
2000
2500Figure 21 Comparison betweenthe software simulation result andpost-layout simulation result ofthe synchronized averagefunction.
1
Software Simulation Result Post-Layout Simulation Result
100 200 300 400 500 6000
1000
2000
3000
4000
100 200 300 400 500 6000
1000
2000
3000
4000
Figure 22 Comparison betweenthe software simulation result andpost-layout simulation result ofthe moving average function.
J Sign Process Syst (2016) 82:91–113 109
times as those provided in [31]. The gate count of this workwith floating-point implementation is 5.79 times as that re-ported in [32] with fixed-point implementation; however, theFastICA processor in [32] is only for eight-channel electro-corticography (ECoG) signal processing. For 16-channel EEGsignal processing in our work, the separation quality results inSection 5.2 could be guaranteed using the floating-point arith-metic and two times sample size. Although the ICA hardware[33] can provide the 16-channel FastICA processing withgood separation quality, the variable-channel feature is notoffered. On the other hand, the core size of [33] is larger thanthat of this work even considering the difference of processtechnology. In summary, this hardware architecture is costeffective while considering the satisfactory separation qualityfor 2-to-16 channel EEG signal processing application. In ad-dition, the user-defined parameters as well as the functions ofre-reference, synchronized average and moving average addthe flexibility to the proposed hardware architecture.
6 Conclusion
In this work, we propose the cost-effective and 2-to-16variable-channel FastICA hardware architecture for EEG
signal processing. With the Gram-Schmidt based whit-ening in FastICA, the proposed two PUs can be sharedbetween the preprocessing and the fixed-point algorithmto increase the hardware efficiency of the presentedFastICA hardware architecture. The functions of re-ref-erence, synchronized average and moving average aswell as the user-defined parameters are also supportedin the proposed FastICA hardware architecture to attainthe flexibility. The evaluation and simulation resultsdemonstrate that the proposed hardware architecturecan achieve the satisfactory BSS quality and attain anefficient hardware usage. On the other hand, if morethan 16-channel capability is demanded, the sample sizemust be largely increased, where the analysis is shownin Fig. 3, such that large memory size will be requested.This will be the near future work while consideringseparation quality and memory size with larger samplesize for larger channel number.
Table 6 Summary of the post-layout characteristics.
Power supply 1.0 V
Max. Clock 100 MHz
Core power for 16 channels 19.4 mW
Gate count 401Ka
Core area 1.300×1.100 mm2
Pin count 131
Process technology TSMC 90 nm CMOS
Max. Computation time 1.85 s
a This number is counted using the size of two-input NAND gate
2 4 8 160
5
10
15
20
Number of Channels
Pow
er C
onsu
mpt
ion
(W
)
2 4 8 1617
17.5
18
18.5
19
19.5
Number of Channels
Pow
er C
onsu
mpt
ion
(W
)
Figure 23 Power consumptions in linear scale of the proposed FastICA hardware architecture versus different number of channels.
2 4 8 1610
0
101
102
103
104
105
Number of Channels
Ene
rgy
Con
sum
ptio
n (
J)
Figure 24 Energy consumptions in log scale of the proposed FastICAhardware architecture versus different number of channels.
110 J Sign Process Syst (2016) 82:91–113
Tab
le7
Hardw
arecomparisonresults
amongvariousICAim
plem
entatio
ns.
Kim
[20]
Du[21,22]
Du[23]
*1
Shyu
[26]
Huang
[28]
Van
[31]
Yang[32]
Roh
[33]
ThisWork
Application
Speech
Image
Image
Speech
EEG
EEG
ECoG
EEG
EEG
Algorith
mICA
ParallelICA
ParallelICA
FastICA
INFOMAX
FastICA
FastICA
Self-ConfiguredICA
FastICA
Arithmetic
Floating-point
Fixed-point
Fixed-point
Floatin
g-point
Fixed-point
Hybrid
Fixed-point
N/A
Floatin
g-point
No.of
Channels/Weight
Vectors(W
Vs)
24(W
Vs)
4,8,12,
16,20,24
(WVs)
24
88
162–16
SampleSize
512
N/A
N/A
3000
512
256
256
512
512
Speed(M
Hz)
12.288,N
/A20.161,
N/A
21.829
*2
21.357
*3
35.921
*4
5068
100
1120
100
Power
Dissipation(m
W)
98.8,14.5
N/A
N/A
N/A
N/A
16.35
0.0816
4.45
*6
19.4*7
Gates
(×10
3)
11.469,N
/A(For
ANC)
229.5,
N/A
N/A
N/A
315
272
69.2
N/A
401
Com
putatio
nTim
e(sec)
≥60,N/A
N/A
1129.5*5
0.003
N/A
0.29
(Max)
0.0842
(Max)
Variable
1.85
(Max)
Implem
entatio
nApproach
FPGA,A
SIC
FPGA,
ASIC
FPG
AFPGA
FPG
AASIC
ASIC
ASIC
ASIC
ProcessTechnology
(nm)
N/A,350
N/A,180
N/A
N/A
N/A
9090
130
90
CoreArea(m
m2)
N/A
1.41
N/A
N/A
N/A
1.49
0.40
7.02
1.43
VariableChannel
No
No
Yes
No
No
No
No
No
Yes
*1:N
eeds
anexternalmem
ory,*2:F
orsub-matrix,*3:F
orexternaldecorrelation,*4:F
orcomparison,*5:F
or20
WVs,*6:F
orMode3,*7:
For16
channels
J Sign Process Syst (2016) 82:91–113 111
Acknowledgments This work was supported in part by the Ministry ofScience and Technology (MOST) Grants MOST 103-2221-E-009-099-MY3 and MOST 103-2911-I-009-101, and in part by Contract:104W963. The authors thank Chip Implementation Center for theirsupport.
References
1. Luck, S. J. (2005). An introduction to the event-related potentialtechnique. Cambridge: MIT Press.
2. Bruce, E. N. (2005). Biomedical signal processing and signalmodeling. New York: Wiley.
3. Hyvärinen, A., & Oja, E. (2000). Independent component analysis:algorithms and applications. Neural Networks, 13(4–5), 411–430.
4. Vigário, R. N. (1997). Extraction of ocular artefacts from EEG usingindependent component analysis. Electroencephalography andClinical Neurophysiology, 103(3), 395–404.
5. Vigário, R., Särelä, J., Jousmäki, V., Hämäläinen, M., & Oja, E.(2000). Independent component approach to the analysis of EEGand MEG recordings. IEEE Transactions on BiomedicalEngineering, 47(5), 589–593.
6. Makeig, S., Westerfield, M., Jung, T. P., Covington, J., Townsend, J.,Sejnowski, T. J., & Courchesne, E. (1999). Functionally independentcomponents of the late positive event-related potential during visualspatial attention. Journal of Neuroscience, 19(7), 2665–2680.
7. Castellanos, N. P., & Makarov, V. A. (2006). Recovering EEG brainsignals: artifact suppression with wavelet enhanced independentcomponent analysis. Journal of Neuroscience Methods, 158(2),300–312.
8. Kachenoura, A., Albera, L., Senhadji, L., & Comon, P. (2008). ICA:a potential tool for BCI systems. IEEE Signal Processing Magazine,25(1), 57–68.
9. Albera, L., Kachenoura, A., Comon, P., Karfoul, A., Wendling, F.,Senhadji, L., & Merlet, I. (2012). ICA-based EEG denoising: a com-parative analysis of fifteen methods. Bulletin of the Polish Academyof Sciences Technical Sciences, 60(3), 407–418.
10. Gao, J. F., Yang, Y., Lin, P., Wang, P., & Zheng, C. X. (2010).Automatic removal of eye-movement and blink artifacts from EEGsignals. Brain Topography, 23(1), 105–114.
11. Lee, T. W. (1998). Independent component analysis: Theory andapplications. Boston: Kluwer.
12. Cichocki, A., & Amari, S. (2002). Adaptive blind signal and imageprocessing: Learning algorithms and applications. New York: Wiley.
13. Hyvärinen, A., & Oja, E. (1997). A fast fixed-point algorithm for inde-pendent component analysis. Neural Computation, 9(7), 1483–1492.
14. Hyvärinen, A. (1999). Fast and robust fixed-point algorithms forindependent component analysis. IEEE Transactions on NeuralNetworks, 10(3), 626–634.
15. Hyvärinen, A., Karhunen, J., & Oja, E. (2001). Independent compo-nent analysis. New York: Wiley.
16. Oja, E., & Yuan, Z. (2006). The FastICA algorithm revisited: con-vergence-analysis. IEEE Transactions on Neural Networks, 17(6),1370–1381.
17. Ollila, E. (2010). The deflation-based FastICA estimator: statisticalanalysis revisited. IEEE Transactions on Signal Processing, 58(3),1527–1541.
18. Choi, S., Cichocki, A., & Amari, S. (2000). Flexible independent com-ponent analysis. Journal of VLSI Signal Processing, 26(1–2), 25–38.
19. Zarzoso, V., & Comon, P. (2010). Robust independent componentanalysis by iterative maximization of the kurtosis contrast with alge-braic optimal step size. IEEE Transactions on Neural Networks,21(2), 248–261.
20. Kim, C. M., Park, H. M., Kim, T., Choi, Y. K., & Lee, S. Y. (2003).FPGA implementation of ICA algorithm for blind signal separationand adaptive noise canceling. IEEE Transactions on NeuralNetworks, 14(5), 1038–1046.
21. Du, H., Qi, H., & Peterson, G. D. (2004). Parallel ICA and its hard-ware implementation in hyperspectral image analysis. Proceedings ofSPIE, 5439, 74–83.
22. Du, H., & Qi, H. (2004). An FPGA implementation of parallel ICAfor dimensionality reduction in hyperspectral images. Proceedings ofIEEE International Geoscience and Remote Sensing Symposium, 5,3257–3260.
23. Du, H., & Qi, H. (2006). A reconfigurable FPGA system for parallelindependent component analysis. EURASIP Journal on EmbeddedSystems, 2006(23025), 1–12.
24. Du, H., Qi, H., & Wang, X. (2007). Comparative study of VLSIsolutions to independent component analysis. IEEE Transactionson Industrial Electronics, 54(1), 548–558.
25. Jain, V. K., Bhanja, S., Chapman, G. H., Doddannagari, L., &Nguyen, N. (2005). A parallel architecture for the ICA algorithm:DSP plane of a 3-D heterogeneous sensor. Proceedings of IEEEInternational Conference on Acoustics Speech and SignalProcessing, 5, 77–80.
26. Shyu, K.K., Lee,M.H.,Wu,Y. T., &Lee, P. L. (2008). Implementationof pipelined FastICA on FPGA for real-time blind source separation.IEEE Transactions on Neural Networks, 19(6), 958–970.
27. Acharyya, A., Maharatna, K., Sun, J., Al-Hashimi, B.M., & Gunn,S.R. (2009). Hardware efficient fixed-point VLSI architecture for 2Dkurtotic FastICA. Proceedings of European Conference on CircuitTheory and Design, 165–168.
28. Huang, W.C., Hung, S.H., Chung, J.F., Chang, M.H., Van, L.D., &Lin, C.T. (2008). FPGA implementation of 4-channel ICA for on-lineEEG signal separation. Proceedings of IEEE Biomedical Circuits andSystems Conference, 65–68.
29. Acharyya, A.,Maharatna, K., Al-Hashimi, B.M., & Reeve, J. (2011).Coordinate rotation based low complexity N-d FastICA algorithmand architecture. IEEE Transactions on Signal Processing, 59(8),3997–4011.
30. Volder, J. E. (1959). The CORDIC trigonometric computing tech-nique. IRE Transactions on Electronic Computers, 8(3), 330–334.
31. Van, L. D., Wu, D. Y., & Chen, C. S. (2011). Energy-efficientFastICA implementation for biomedical signal separation. IEEETransactions on Neural Networks, 22(11), 1809–1823.
32. Yang, C. H., Shih, Y. H., & Chiueh, H. (2015). An 81.6 μWFastICAprocessor for epileptic seizure detection. IEEE Transactions onBiomedical Circuits and Systems, 9(1), 60–71.
33. Roh, T., Song, K., Cho, H., Shin, D., & Yoo, H. J. (2014). Awearableneuro-feedback system with EEG-based mental status monitoringand transcranial electrical stimulation. IEEE Transactions onBiomedical Circuits and Systems, 8(6), 755–764.
34. Strang, G. (2009). Introduction to linear algebra (4th ed.).Cambridge: Wellesley.
35. Cichocki, A., Osowski, S., & Siwek, K. (2004). Prewhitening algo-rithms of signals in the presence of white noise. Proceedings of VIInternational Workshop BComputational Problems of ElectricalEngineering^, 205–208.
36. Woolfson, M. S., Bigan, C., Crowe, J. A., & Hayes-Gill, B. R.(2008). Method to separate sparse components from signal mixtures.Digital Signal Processing, 18(6), 985–1012.
37. Sharma, A., & Paliwal, K. K. (2007). Fast principal component anal-ysis using fixed-point algorithm. Pattern Recognition Letters, 28(10),1151–1155.
38. Huang, P.Y. (2011). Design and implementation of a multi-functioncost-effective EEG signal processor. M. S. thesis, Department ofComputer Science, National Chiao Tung University, Hsinchu,Taiwan. (Advisor: Lan-Da Van).
112 J Sign Process Syst (2016) 82:91–113
39. Golub, G. H., & Van Loan, C. F. (1996). Matrix computations (3rded.). Baltimore: Johns Hopkins University Press.
40. The FastICA package for MATLAB. http://www.cis.hut.fi/projects/ica/fastica/.
41. Lan, T., Erdogmus, D., Pavel, M., & Mathan, S. (2006). Automaticfrequency bands segmentation using statistical similarity for powerspectrum density based brain computer interfaces. Proceedings ofInternational Joint Conference on Neural Networks, 4650–4655.
42. Delorme, A., & Makeig, S. (2004). EEGLAB: an open source toolboxfor analysis of single-trial EEG dynamics including independent com-ponent analysis. Journal of Neuroscience Methods, 134(2), 9–21.
43. EEGLAB Wiki – SCCN. http://sccn.ucsd.edu/wiki/EEGLAB.44. Wu, A.Y. (Andy) (1995). Algorithm-based low-power digital signal
processing system designs. Ph. D. thesis, Department of ElectricalEngineering, University of Maryland, College Park, MD, USA.
45. Schulte, M.J., & Wires, K.E. (1999). High-speed inverse squareroots. Proceedings of 14th IEEE Symposium on ComputerArithmetic, 124–131.
46. ICALAB for Signal Processing – benchmarks. http://www.bsp.brain.riken.jp/ICALAB/ICALABSignalProc/benchmarks/.
47. ICALAB for Signal Processing. http://www.bsp.brain.riken.jp/ICALAB/ICALABSignalProc/.
Lan-Da Van received the B.S.(Honors) and the M.S. degreefrom Tatung Institute of Technol-ogy, Taipei, Taiwan, in 1995 and1997, respectively, and the Ph. D.degree from National TaiwanUniversity (NTU), Taipei, Tai-wan, in 2001, all in electrical en-gineering.
From 2001 to 2006, he was anAssociate Researcher at NationalChip Implementation Center(CIC), Hsinchu, Taiwan. SinceFeb. 2006, he joined the facultyof Department of Computer Sci-
ence, National Chiao Tung University, Hsinchu, Taiwan, where he iscurrently an Associate Professor. His research interests are in VLSI algo-rithms, architectures, and chips for digital signal processing (DSP) andbiomedical signal processing. This includes the design of low-power/high-performance/cost-effective 3-D graphics processing system,adaptive/learning systems, computer arithmetic, multidimensional filters,and transforms. He has published more than 50 journal and conferencepapers in these areas.
Dr. Van was a recipient of the Chunghwa Picture Tube (CPT) andMotorola fellowships in 1995 and 1997, respectively. He was an elected
chairman of IEEE NTU Student Branch in 2000. In 2001, he has receivedIEEE award for outstanding leadership and service to the IEEE NTUStudent Branch. In 2005, he was a recipient of the Best Poster Award atiNEER Conference for Engineering Education and Research (iCEER). In2014, he was a recipient of the teaching award of Computer ScienceCollege, National Chiao Tung University. In 2014, he was a recipient ofthe best paper award in IEEE International Conference on Internet ofThings (iThings2014). From 2009, he served as the officer of IEEE TaipeiSection. In 2014, he served as Track Co-Chair of 22nd IFIP/IEEE Inter-national Conference on Very Large Scale Integration VLSI-SoC. Heserves as Associate Editor and Guest Editor of Journal of Medical Imag-ing and Health Informatics. He served as a reviewer for several the IEEEtransactions/journals.
Po-Yen Huang received the B.S.degree in EECS undergraduatehonors program and the M.S de-gree in computer science fromNational Chiao Tung University(NCTU), Hsinchu, Taiwan, in2010 and 2011, respectively. Hisresearch interests include VLSIinformation processing algorithmand architecture for biomedicalsignal processing.
Tsung-Che Lu received theB.S. degree in engineeringscience from National ChengKung University (NCKU),Tainan, Taiwan, in 2006, andthe M.S. degree in engineer-ing and system science fromNational Tsing Hua Universi-ty (NTHU), Hsinchu, Taiwan,in 2008 . He is cur ren t lyworking toward the Ph. D.degree with the Departmentof Computer Science, Nation-al Chiao Tung Universi ty(NCTU), Hsinchu, Taiwan.
His research interests include analog and digital circuits for bio-medical signal processing.
J Sign Process Syst (2016) 82:91–113 113