# An Efficient ASIC Implementation of 16-channel On-line Recursive ICA Processor for Real-time EEG System

Wai-Chi Fang-IEEE Fellow, Kuan-Ju Huang, Chia-Ching Chou, Jui-Chung Chang, Gert Cauwenberghs-IEEE Fellow, and Tzyy-Ping Jung, Sr-IEEE Member

Abstract—This is a proposal for an efficient very-large-scale integration (VLSI) design, 16-channel on-line recursive independent component analysis (ORICA) processor ASIC for real-time EEG system, implemented with TSMC 40 nm CMOS technology. ORICA is appropriate to be used in real-time EEG system to separate artifacts because of its highly efficient and real-time process features. The proposed ORICA processor is composed of an ORICA processing unit and a singular value decomposition (SVD) processing unit. Compared with previous work [1], this proposed ORICA processor has enhanced effectiveness and reduced hardware complexity by utilizing a deeper pipeline architecture, shared arithmetic processing unit, and shared registers. The 16-channel random signals which contain 8-channel super-Gaussian and 8-channel sub-Gaussian components are used to analyze the dependence of the source components, and the average correlation coefficient is 0.95452 between the original source signals and extracted ORICA signals. Finally, the proposed ORICA processor ASIC is implemented with TSMC 40 nm CMOS technology, and it consumes 15.72 mW at 100 MHz operating frequency.

## I. INTRODUCTION

Electroencephalogram (EEG) is a non-invasive tool for recording electrical activity along the scalp produced by the firing of neurons within the brain. In recent years, many portable EEG systems have been proposed in academic research, the business community, and plenty of tiny bio-status recorder systems. However, EEG signals are very sensitive, and are always contaminated by various disturbances like ocular artifacts, electromyography (EMG), and electrical noise from nearby instruments, which seriously affect the precision of identifications and analysis when acquiring the EEG signals.

Independent component analysis (ICA) has proven to be an effective method to clearly separate the clean EEG signal and artifacts into different channels from the contaminated EEG signals. The results obtained after processing of ICA can be used for further applications such as brain–computer interfaces (BCIs) [2]. To immediately enhance applications for BCIs, real-time ICA pre-processing is essential. The advanced ORICA algorithm proposed by [3] is different from the common ICA algorithm such as Infomax [4] and FastICA [5]. It can be used in real-time EEG systems to separate artifacts due to its faster convergence rate and satisfactory

Wai-Chi Fang, Kuan-Ju Huang, Chia-Ching Chou, and Jui-Chung Chang are with Department of Electronics Engineering, National Chiao-Tung University, Taiwan. (corresponding author to provide phone: +886-3-573-5603; fax: +886-3-513-1291; e-mail: wfang@mail.nctu.edu.tw).

Gert Cauwenberghs and Tzyy-Ping Jung are with Swartz Center for Computational Neuroscience, Institute for Neural Computation, University of California San Diego, San Diego, California, United States of America. separation performance features. ORICA is more feasible and efficient in obtaining accurate results in a real-time EEG system. Because of the complicated computations of ORICA, it is not suitable for a PC-based implementation. Therefore, this study proposes a high-efficient hardware design of 16-channel on-line recursive ICA processor ASIC for real-time EEG system. The organization of this paper is as follows. In section II, the ORICA system adopted algorithm is described. In section III, the system architecture and design methods are presented. The experimental results are given in sections IV, and section V is the conclusion.

#### II. DESCRIPTION OF ORICA ALGORITHM

This proposed application-specific integrated circuit (ASIC) adopts a recursive algorithm, ORICA [3], to implement a real-time EEG acquisition system. The flow chart of the ORICA algorithm in this paper is depicted in Fig. 1. The ORICA algorithm presented in this paper is composed of two main parts: the whitening unit and the training unit.

## A. Whitening

After EEG raw signals X are acquired from each channel, and the whitening unit estimates covariance matrix Cov(X) of X. It creates the uncorrelated vector Z to effectively accelerate the training processing from (1) to (3). The inverse covariance matrix P is obtained by SVD unit in the real-time EEG system. It is able to reduce the computation time of the ORICA training unit to accelerate the convergence speed.

$$Cov(X) = E[X, X^{T}]$$
(1)

$$P = Cov(X)^{-1/2} \tag{2}$$

$$V = W_n \times P \times X \tag{3}$$

# B. ORICA Training

J

The ORICA training is presented from (4) to (8). The goal is to find an adjustable separating matrix  $W_n$  and the independent component Y. When  $\Delta W_n$  is closed to zero, the  $W_n$  matrix is convergent. In order to feasibly implement the ORICA algorithm in hardware, the coefficient  $\lambda_0$  and  $\gamma$  are set to 0.995 and 0.6 respectively in this design.



Figure 1. The flow chart of the ORICA algorithm.

<sup>\*</sup>Resrach supported by Ministry of Science and Technology of Taiwan.

$$k = sign\left(\frac{E\left\{Y^{4}\right\}}{\left(E\left\{Y^{2}\right\}\right)^{2}} - 3\right)$$
(4)

$$\begin{cases} k = 1, f = -2 \tanh(Y) \\ k = -1, f = \tanh(Y) - Y \end{cases}$$
(5)

$$\Delta W_n = \frac{\lambda_n}{1 - \lambda_n} [W_n - \frac{Y \times f^T \times W_n}{1 + \lambda_n (f^T \times Y - 1)}]$$
(6)

$$\lambda_n = \frac{\lambda_0}{t^{\gamma}} \tag{7}$$

$$W_{n+1} = (W_n + \Delta W_n)^{-1/2} + (W_n + \Delta W_n)$$
(8)

## C. The SVD

During the calculating procedure of each sampling data such as in (2) and (8), inverse and inverse square root matrices must be processed. Since it is very difficult to be resolved in the hardware, this paper develops a SVD unit to solve the problem of large amount of complicated computation.

The Jacobi SVD (JSVD) algorithm has proven to be an effective method to deal with pseudo inverse, matrix approximation and also ill-posed problems.  $A_{mxn}$  is a rectangular matrix and can be decomposed into three special matrices as shown in (9) by JSVD definition. The columns of  $U_{mxm}$  and  $V^{T}_{nxn}$  are the eigenvectors of  $AA^{T}$  and  $A^{T}A$ , and the diagonal elements of  $\Sigma_{mxn}$  are singular values of matrix A.

$$A_{mxn} = U_{mxm} \, \Sigma_{mxn} \, V^T_{nxn} \tag{9}$$

Using the property of the unitary matrix, (9) can be written as the following form

$$A = U \Sigma V^{T} \Longrightarrow U^{T} A V = \Sigma$$
(10)

## III. THE REAL-TIME EEG SYSTEM ARCHITECTURE

The hardware architecture of the 16-channel real-time EEG systems based on ORICA processor is shown in Fig. 2. It comprises five main processing units: a system control unit, a whitening unit, a ORICA training unit, a SVD unit, and an ORICA output stage and floating matrix multiplier.

#### A. The System Control Unit

This control unit is in charge of the ORICA processor data flow. It controls the using permissions of each unit to avoid data conflicts and structure hazards.

#### B. The Whitening Unit

In the pre-processing stage, the raw EEG signals are pre-processed by whitening transformation in the whitening unit. The whitening transformation is an effective method that can decorrelate the EEG original source. This unit converts the covariance matrix *COV\_X* into the identity matrix. This effectively creates new random variables that are uncorrelated and have the same variances as the original random variables. After decorrelation, iterations in the training unit can be converged efficiently. Also, the complexity of computation can be extensively decreased.



Figure 2. The overall hardware architecture of the proposed EEG system.

## C. The ORICA Training Unit

The training unit, which is shown in Fig. 3, is used to calculate the unmixing matrix  $W_n$ , and it consumes most of computational time because of the iterative training loops. This unit contains:

1) A Kurtosis unit which uses registers to store the calculated independent component Y. After that, the system raises Y to the 2-th and 4-th power. Then it compares the value of  $E\{Y^4\}$  and  $3*E\{Y^2\}^2$ . The positive (negative) value for k indicates super-Gaussian (sub-Gaussian) components in (4).

2) A *Tanh LUT* which is a look up table of hyperbolic tangent function to approach the value of *f* which as shown in (5). As simulated by MATLAB, the output of *Tanh*(*Y*) is almost saturated when the input is a value larger than 3 or smaller than -3. However, this work uses 48 16-bit numbers to substitute the values of set  $\{Tanh(Y), -3 < Y < 3\}$ . When the input value is out of the range, the mirrored non-linear lookup unit will output the value of saturation that is +1 or -1.

3) A Weight Updated unit employs the parallel shared multipliers and adders to calculate the best unmixing matrix  $W_{n+1}$ , which is shown in (6) and (8).  $W_{n+1}$  is the separating matrix used to calculate the estimated independent component. In addition, this unit includes the SVD unit to complete the inverse matrix operation.

4) A Learning Rate unit is used to calculate  $\lambda_n$ , which is shown in (7). Since the 16-channel ORICA processor must process a huge amount of data, the calculated error caused by iterative operation is easily accumulated. In order to improve the accuracy of the proposed system, this paper utilizes a approximation better curve method in hardware implementation. Because ORICA algorithm must execute an exponential equation,  $Y_c = 0.995/t^{0.6}$ , to calculate the value of  $\lambda_n$ , this work adopts different eight straight lines, whose slopes are 0.3546, 0.2201, 0.1684, 0.0711, 0.0357, 0.0205, 0.0140 and 0.0106, in order to approximate the  $Y_c$  curve in



Figure 3. The hardware architecture of the ORICA training unit.



Figure 4. The diagram of Y<sub>c</sub> approximation curve.

corresponding intervals shown in Fig. 4. When compared with the previous work [1], which only adopts two straight lines to approximate, this work has a faster convergence rate and better data accuracy.

#### D. The ORICA Output Stage and Floating Matrix Multiplier

The independent components of mixed signals are extracted in this stage. The EEG raw signals, the P matrix from whitening unit and the W matrix from ICA training unit are all required to perform the computation of the resulting components in the floating matrix multiplier. The floating matrix multiplier employs a shared scalar product to calculate the unmixing matrix W and independent component analysis output  $ORICA\_OUT$ . The estimated independent components are finally calculated by multiplying W unmixing with x. In addition, a handshaking mechanism is implemented to make the output interface flexible.



Figure 5. The block diagram of the proposed SVD processor.



Figure 6. The process of CORDIC dada fetching.

#### E. The SVD Unit

The SVD unit, shown in Fig. 5, calculates the diagonal, inverse, and inverse square root matrices of the target matrices. It adopts a coordinate rotation digital computer (CORDIC) algorithm [6] to execute SVD of the target matrix. This processor uses INVSQRT Root unit and Inverse unit to calculate (2) and (8).

In order to reduce area and power consumption, this paper uses three single-port SRAMs to store data instead of dual-port SRAMs. The SVD control unit is the top module of this unit, which controls the mode of the execution in SVD. *Angle\_CORDICs* will catch the four corresponding elements with (p,q), which from SRAM  $\Sigma$  to calculate  $\theta$  and  $\phi$  as shown in Fig. 6. After the calculation of angles is completed, *Vector\_CORDICs* will catch row vectors of matrix U and  $\Sigma$ , and catch column vectors of matrix  $\Sigma$  and R. After the specific elements taken by *Vector\_CORDICs*, the SVD processor obtains updated elements on corresponding vectors.

#### IV. EXPERIMENTAL RESULTS

The 16-channel random independent source signals which contain 8-channel super-Gaussian and 8-channel sub-Gaussian components are shown in Fig. 7 (a). The maximum correlation between each source signal is 0.0032, which is used to analyze the dependence of the source components. The source signal is mixed with a stationary mixing matrix to generate the measured signal shown in Fig. 7 (b). To verify the performance of the proposed design, the extracted ORICA signal is performed by using the designed processor. According to Fig. 7 (c) the channel mapping between all original sources makes the extracted independent components easier to indicate. For an analysis of the



on-stationary characteristics of ORICA output, the correlation coefficient of the ORICA is evaluated. The average correlation is 0.95452 between the original source signals and extracted ORICA signal. The ability to separate the mixture of super-Gaussian and sub-Gaussian random signals does not imply the ability to find out the artifacts and components in real EEG signals.

# V. CONCLUSION

This paper presents an efficient VLSI design, 16-channel on-line recursive independent component analysis (ORICA) processor ASIC for real-time EEG system, which is implemented with TSMC 40 nm CMOS technology. The proposed design uses hardware parallelism and pipeline to achieve real-time processing and data handling. Moreover, the high efficiency ORICA training unit with various design techniques such as kurtosis size decision, an optimized mirrored look up table, an automatic learning rate decision procedure, and an optimized specification analysis are also designed to efficiently estimate unmixing weight matrix. Therefore, hardware cost and power consumption can be reduced. The average correlation coefficient is 0.95452 between the original source signals and the extracted ORICA signals.

This ORICA processor ASIC is implemented using the TSMC 40 nm CMOS technology. The ASIC occupies a core area of 1800 x 1800  $\mu$ m<sup>2</sup> and consumes 15.72 mW at a core supply voltage of 0.9 V with a 100 MHz clock operating frequency. The specification and the silicon layout of the real-time EEG system with the proposed ORICA processor ASIC is shown in Fig. 8.

#### ACKNOWLEDGMENT

This work was supported in part by the Ministry of Science and Technology of Taiwan, R.O.C., under grant NSC102-2220-E-009-033 and 101-2221-E-009-169-MY2. The authors would also like to express their sincere appreciation to the National Chip Implementation Center for chip fabrication and testing service.

#### REFERENCES

- [1] Wei-Yeh Shih; Jui-Chieh Liao; Kuan-Ju Huang; Wai-Chi Fang; Cauwenberghs, G.; Tzyy-Ping Jung, "An efficient VLSI implementation of on-line recursive ICA processor for real-time multi-channel EEG signal separation," *Engineering in Medicine and Biology Society (EMBC), 2013 35th Annual International Conference of the IEEE*, vol., no., pp.6808,6811, 3-7 July 2013
- [2] Palumbo, A.; Calabrese, B.; Cocorullo, G.; Lanuzza, M.; Veltri, P.; Vizza, P.; Gambardella, A.; Sturniolo, M.; "A novel ICA-based hardware system for reconfigurable and portable BCI," *Medical Measurements and Applications, 2009. MeMeA 2009. IEEE International Workshop on*, vol., no., pp.95-98, 29-30 May 2009.
- [3] Akhtar, M.T.; Tzyy-Ping Jung; Makeig, S.; Cauwenberghs, G.; "Recursive independent component analysis for online blind source separation," *Circuits and Systems (ISCAS), 2012 IEEE International Symposium on*, vol., no., pp.2813-2816, 20-23 May 2012.
- [4] A. J. Bell and T. J. Sejnowski, "An information maximization approach to blind separation and blind deconvolution," *Neural Computation*, vol. 7, no. 6, pp. 1129–1159, Nov. 1995.
- [5] A. Hyvärinen; and E. Oja.; "A Fast Fixed-Point Algorithm for Independent Component Analysis," *Neural Computation*, vol. 9, No. 7, pp. 1483-1492, 1997.
- [6] Jun Ma; Parhi, K.K.; Deprettere, E.F.;, "An algorithm transformation approach to CORDIC based parallel singular value decompositions architectures," Signals, Systems, and Computers, 1999. Conference Record of the Thirty-Third Asilomar Conference on , vol.2, no., pp.1401-1405 vol.2, 24-27 Oct. 1999

| Parameter    | Value         |                                                       |
|--------------|---------------|-------------------------------------------------------|
| Technology   | TSMC 40 nm    |                                                       |
|              | CMOS tech.    |                                                       |
| Core Size    | 1800 x 1800   |                                                       |
|              | $\mu m^2$     |                                                       |
| Output delay | 0.0075sec     |                                                       |
| Gate count   | 0.572 million |                                                       |
| Sample rate  | 128Hz         |                                                       |
| Operation    | 100 MHz       |                                                       |
| Frequency    |               |                                                       |
| Power        | 15.72 mW      |                                                       |
| Comsumption  |               | an all and an an an all all an an all an an all an an |

Figure 8. The specification and silicon layout of the proposed ASIC