# **A Low Power Biomedical Signal Processor ASIC Based on Hardware Software Codesign**

Z. D. Nie, L. Wang, *Member, IEEE* ,W. G. Chen, T. Zhang, and Y. T. Zhang, *Fellow, IEEE*

*Abstract*— **A low power biomedical digital signal processor ASIC based on hardware and software codesign methodology was presented in this paper. The codesign methodology was used to achieve higher system performance and design flexibility. The hardware implementation included a low power 32bit RISC CPU ARM7TDMI, a low power AHB-compatible bus, and a scalable digital co-processor that was optimized for low power Fast Fourier Transform (FFT) calculations. The co-processor could be scaled for 8-point, 16-point and 32-point FFTs, taking approximate 50, 100 and 150 clock circles, respectively. The complete design was intensively simulated using ARM DSM model and was emulated by ARM Versatile platform, before conducted to silicon. The multi-million-gate ASIC was fabricated using SMIC 0.18μm mixed-signal CMOS 1P6M technology. The die area measures 5,000μm x 2,350μm. The power consumption was approximately 3.6 mW at 1.8V power supply and 1MHz clock rate. The power consumption for FFT calculations was less than 1.5 % comparing with the conventional embedded software-based solution.** 

### I. INTRODUCTION

ECENT years have seen a surge of the development of RECENT years have seen a surge of the development of Wireless, low power wearable and implantable devices for physiological measurements and telemedicine applications [1, 2]. Subsequently, various signal processing algorithms were developed to process the measured signals such as ECG, EEG, EMG, EGG, respiration, PPG, and etc [3]. It is envisaged that most of the aforementioned signals are periodic and many vital signs such as heart rate, pulse rate, EEG rhythms and respiration rate have primary features in frequency spectrum. Therefore, spectrum analysis is quite often a fundamental building block being employed and a Fast Fourier Transform (FFT) unit is a first choice to be considered [4-6].

In another side of the spectrum, the wearable or implantable devices must be small and discreet. One of the design challenges is the processing-on-node capability.

Manuscript received April 24, 2009. This work was supported in part by Chinese Academy of Sciences "the 100 Talent People" Program and Key Lab for Biomedical Informatics and Health Engineering, Chinese Academy of Sciences.

L.Wang is with the Institute of Biomedical and Health Engineering (IBHE) Shenzhen Institute of Advanced Technology (SIAT) (phone:86-755-86392277, e-mail: wang.lei@siat.ac.cn).

Y. T. Zhang is with the Institute of Biomedical and Health Engineering (IBHE) Shenzhen Institute of Advanced Technology (SIAT) (e-mail: yt.zhang@siat.ac.cn).

Because of the extreme size and power constraints, it is difficult to perform the spectrum analysis within the wearable or implantable devices. A possible alternative is to transmit all the raw data to a more powerful base-station, such as a PDA, for post-processing [7]. This approach is disadvantageous because a) the RF transmission consumes a large amount of battery power; b) it requires a high bandwidth RF channel.

Different off-the-shelf IC modules were used to tackle the on-node computational bottleneck. A conventional microprocessor is easy to use, but its architecture was not instruction sets. The use of FPGA has also been suggested, but the relatively high power consumption prohibits its use for practical wearable or implantable applications.

An Application-Specific Integrated Circuit (ASIC) can be fully customized, providing maximal design flexibility at the lowest-possible power consumption [8]. In an ASIC, all functional building blocks can be integrated into a single piece of silicon, which means potential size reduction for the sensor nodes. This also simplifies the subsequent packaging and assembly processes. An ASIC is cost-effective when volume production is applied.

In this paper, we represent a mixed-signal ASIC based on hardware and software codesign for scalable FFT calculations. Codesign is a methodology for solving design problems in processor-based embedded systems and allows the concurrent design of both hardware and software [9-10]. The design advantages of the codesign approach for biomedical signal spectrum analysis was also illustrated in this paper.

## II. SYSTEM ARCHITECTURE

The complete ASIC was designed based on the codesign methodology. It was partitioned into hardware portion and embedded software portion. Fig. 1 illustrated the system architecture of the ASIC.

The system has the following primary features:

--32-bit RSIC ARM7TDMI processor.

--Low power AHB compatible bus (LPAHB).

--Scalable FFT module that could be scaled for 8-point, 16-point and 32-point FFT.

--4K-word SRAM for data and program storage.

--Digital interfaces to various off-the-chip ADC and RF front end modules.

--On chip VCO oscillator to generate clock rates up to 30 MHz, eliminating the needs for off-chip oscillators.

Z. D. Nie, W.G. Chen, T.Zhang are with the Institute of Biomedical and Health Engineering (IBHE) Shenzhen Institute of Advanced Technology (SIAT) (e-mail: zd.nie@siat.ac.cn).



Fig. 1. ASIC system architecture

As for design flow, Verilog was used for RTL level descriptions. The RTL codes were simulated and synthesized using Synopsys tools, and then emulated in using ARM Versatile platform. The backend tools used are Cadence tools. The software part adopted C and assembler languages based on ARM RealView Development Suite (RVDS).

#### III. HARDWARE IMPLEMENTATION

## *A. ARM7TDMI implemetation*

The ARM7TDMI core is a 32-bit embedded RISC processor delivered as a hard macro cell optimized to provide the best combination of performance, power and area characteristics. The ARM7TDMI core enables system designers to build embedded devices requiring small size, low power and high performance [11].

## *B. Low power AHB compatible system bus design*

In this design, an AHB compatible system bus was designed for inter-system connection. For simple purpose, the AHB Lite architecture is used for high efficiency communication.



Fig. 2. The Structure of Low Power AHB Compatible Bus

Typically the power dissipated by system-level buses contributes the largest portion of the global power of a complex VLSI system. Therefore, the minimization of the switching activity at the I/O interfaces provides significant savings on the overall power budget [12]. In our design a bus-invert code (INV) was used to minimize the switching in the data bus, for the address bus, a modified GRAY encoding was adopted to preserve the one-transition property for consecutive addresses of byte-addressable machines [13]. The structure of the low power AHB compatible bus is illustrated in Fig. 2.

## *C. Scalable FFT circuit description*

A scalable FFT module that could be scaled for 8-point, 16-point and 32-point FFT was implemented into the digital co-processor. The FFT was designated in a scalable manner.

A 2-point FFT is the basic building unit for scalable design. For 2n-point FFT, the Decimation-In-Time algorithm and Decimation-In-Frequency algorithm were deduced from the Cooley-Turkey algorithm [14]. The Decimation-In-Time FFT Radix-2 was used and carried out by modified butterfly architecture.

The inputs of the scalable FFT were 8-bit complex number: Xp-in (n) and Xq-in (n). The absolute value of a twiddle factor WNk was set to be less than or equal to 1. Before Xq-in (n) multiply with WNk, WNk was multiplied by 26. After the multiplying, the result was right shift six bits. The result of multiplication of complex number  $Xq-in (n) = Xq-in -R(n) + i$  $Xq-in -I(n)$  and WNk=WNk-R+i WNk-I, was transformed as:  $Xqn(n)=[Xq-I(n)*(WNk-R - WNk-I) + WNk-R*(Xq-R(n))$ 

$$
-Xq-I(n)
$$
 ] + i[Xq-R(n)\*(WNk-R+WNk-I)-WNk-R\*  
(Xq-P(n), Yq I(n))]

$$
(Xq-R(n)-Xq-I(n))]
$$
\n<sup>(1)</sup>

The results of the modified butterfly were:

- $Xp$ -out (n) = [Xp-in (n)-R + Xqn-R(n)] + i [Xp-in (n)-I +  $X$ qn-I(n)]
- $Xq$ -out (n) =  $[Xp$ -in (n)-R +  $Xqn-R(n)$ ] i  $[Xp$ -in (n)-I +  $Xqn-I(n)$  (2)

Equations (1) and (2) indicated that one complex number multiplier needs 3 multipliers, 3 subtracters and 2 adders [15]. The basic structure was illustrated in Fig. 3. Two complex numbers were fed into Stage2, the result of Stage2 was extended with sign and then fed into DFF 2A\_0 after the demultiplexer. In the next step, Stage2 received another two complex numbers, the result of Stage2 was again extended with sign and then input to DFF 2A 1. If the inputs of Stage4 were ready, Stage4 was immediately enabled. Via reusing the basic modules, 4-point FFT was achieved [16].



Fig. 3. The Structure of The Basic Module

Fig. 4 illustrated that the by scaling up, the 8-point, 16-point and 32-point FFTs were achieved hierarchically.



The signal SEL was used to set different operation models: "00" for idle status, "01" for 8-point FFT, "10" for 16-point FFT, and "11" for 32-point FFT.

### IV. SOFTWARE IMPLEMENTATION

The codesign methodology was used to bridge the software and hardware design. Fig.5 demonstrates the software flow. The source codes were hybrid of ARM assemble code and C code, which were compiled and linked in the RVDS. The codes were debugged in the Instruction Set Simulator that was hardware independent. The Realview ICE platform downloaded codes into the hardware for emulations.



Fig. 5. Software Implementation Flow



Fig. 6. Source Code Flow

Fig. 6 illustrates the source code design flow. The source

code mainly contained two parts: boot code and application code. In this design, all the modules except ARM7TDMI were slaves, and the boot code was responsible for booting the complete digital system. For the application code, Tables I gives an example of reading and writing the FFT coprocessor.

# TABLE I. SIMPLE INSTRUCTIONS FOR WRITE AND REDA FFT

MOV r1, #0xc8000000 //FFT control register address to r1 STR r0,  $[r1, #0xdc]$  // move data to FFT control register LDR  $r2$ ,  $[r0, #0x44]$  // read the result from FFT result register

# V. RESULTS AND DISCUSSIONS

The completed system was intensively simulated and emulated using ARM DSM model and the ARM Versatile platform, the latter implements a Logic Tile (Xilinx Virtex-5 FPGA XC5VLX330) to host all the digital hardware. Results indicated that the ARM could be booted successfully with the current setup and all the digital hardware worked as intended.

The digital design was synthesized in Synopsys DC compiler and full-chip layout was conducted in Cadence SoC-Encounter and Virtuoso using SMIC 0.18-μm mixed-signal CMOS 1P6M library. Fig.7 gives the layout view of the ASIC. The die area measures 5000μm by 2350μm. Table II illustrates the ASIC specification.



Fig. 7. The Layout View of the ASIC



\*: the power was estimated at 1.8V power supply and at 1 MHz clock rate.

TABLE III. CLOCK CIRCLES FOR THE EXECUTION OF DIFFERENT FFTS

|                      | FFT 8 | <b>FFT 16</b> | <b>FFT 32</b> |
|----------------------|-------|---------------|---------------|
| Embedded<br>approach | 59500 | 114800        | 208000        |
| Codesign<br>approach | 50    | 100           | 150           |

Table III represents the computational performance comparisons between an embedded (pure software) approach and the codesign approach that was implemented in our design. It indicated that, to complete a 32-point FFT, it took approximate 200K clock circles using a 16-bit ARM-compatible microprocessor, the number of clock circles for our approach is merely 150. As illustrated in Table II the FFT hardware consumed 11 times more power than the ARM core. Therefore, the power consumption per one 32-point FFT using our approach was less than 1.5 % of the power consumption from a pure embedded software solution.

In order to evaluate the accuracies of the FFT computation, the error rates were compared between a floating point solution (using a PC) and the fixed point solution which was adopted by our design. The following formulas  $(3) - (8)$  were used:

$$
S_{out\_mat\_r} = \sum_{k=0}^{k=N} |X_{out\_mat\_k\_r}|
$$
 (3)

$$
S_{out\_mat\_i} = \sum_{k=0}^{k=N} |X_{out\_mat\_k\_i}|
$$
 (4)

$$
S_{out\_ves\_r} = \sum_{k=0}^{k=N} |X_{out\_ves\_k\_r}|
$$
 (5)

$$
S_{out\_vcs\_i} = \sum_{k=0}^{k=N} |X_{out\_vcs\_k\_i}|
$$
(6)

$$
Error_r = \frac{||S_{out\_mat_r}| - |S_{out\_ves_r}|}{|S_{out\_mat_r}|} \tag{7}
$$

$$
Error_{i} = \frac{||S_{out\_mat_i}| - |S_{out\_ves_i}|}{|S_{out\_mat_i}|}
$$
(8)

 $Xout\_mat\_k\_r$  and  $Xout\_mat\_k\_i$  were the real and imaging portions calculated in floating point; and the *Xout* \_  $vcs$  \_  $k$  \_ *r* and *Xout* \_  $vcs$  \_  $k$  \_ *i* were the real and imaging portions calculated in fixed point based on our approach. The results were illustrated in table IV. It indicated that the averaging relative errors, which were mainly caused by quantization, were approximately 3 %.

TABLE IV. ERROR RATE BETWEEN FIXED AND FLOATING POINT **CALCULATION** 

|         | FFT 8   | <b>FFT 16</b> | <b>FFT 32</b> |
|---------|---------|---------------|---------------|
| Error r | $2.9\%$ | $3.1\%$       | $3.6\%$       |
| Error i | $0\%$   | $2.0\%$       | 2.8%          |

#### VI. CONCLUSIONS

A low power biomedical digital signal processor ASIC based on hardware and software codesign methodology was presented. The codesign methodology was employed. The ASIC was fabricated in SMIC 0.18-μm mixed-signal CMOS 1P6M technology. The die area measures 5000 μm by 2350 μm. Simulation results indicated the dedicated architecture consumes only 1.5 % of the power for FFT calculations, as comparing with a pure embedded software approach. In the future we will characterize the error propagations of the scalable FFT, and also develop more complicated codesign strategies to host a greater variety of biomedical spectrum analysis applications.

#### **REFERENCES**

- [1] E. N. Bruce, Biomedical Signal Processing and Signal Modeling.New York: IEEE Wiley, 2001.
- [2] [Rangayyan, Biomedical Signal Analysis: A Case-Study Approach.IEEE Press, 2002.
- [3] Cerutti, S., "In the Spotlight: Biomedical Signal Processing," Biomedical Engineering, IEEE Reviews in , vol.1, pp.8-11, 2008
- [4] Ajit P. Yoganathan1, Ramesh Gupta1 and William H. Corcoran1.Fast fourier transform in the analysis of biomedical data.Medical and Biological Engineering and Computing.Vol 14, Number 2.pp 239-245,1976
- [5] CESARELLI M. ; CLEMENTE F. ; BRACALE M. ; A flexible FFT algorithm for processing biomedical signals using a personal computer.Journal of biomedical engineering, 1990, vol. 12, no6, pp. 527-530
- [6] Basano, L.; Ottonello, P., "Real-Time FFT to Monitor Muscle Fatigue," Biomedical Engineering, IEEE Transactions on , vol.BME-33, no.11, pp.1049-1051, Nov. 1986
- [7] Benny Lo, Surapa Thiemjarus, Rachel King and Guang Zhong Yang, "Body Sensor Network ?A Wireless Sensor Platform for Pervasive Healthcare Monitoring" Adjunct Proceedings of the 3rd International Conference on Pervasive Computing (PERVASIVE 2005), pp.77-80, May 2005
- [8] Lei Wang, Surapa Thiemjarus, Benny Lo, and Guang-Zhong Yang, Toward A Mixed-Signal Reconfigurable ASIC for Real-Time Activity Recognition, 5th International Workshop on Wearable and Implantable Body Sensor Networks (BSN 2008) and 5th International Summer School and Symposium on Medical Devices and Biosensors (ISSS-MDBS 2008), Hong Kong, Jun. 1-3, 2008
- [9] T. Sumanaweera and D. Liu, Medical image reconstruction with the FFT, in GPU Gems 2: Programming Techniques for High-Performance Graphics and General-Purpose Computation, M. Pharr, Ed. Addison-Wesley, March 2005, pp. 765-784
- [10] De Michell, G.; Gupta, R.K., "Hardware/software co-design," Proceedings of the IEEE , vol.85, no.3, pp.349-365, Mar 1997
- [11] www.arm.com
- [12] Luca Benini, Giovanni De Micheli, Enrico Macii,et al. Address Bus Encoding Techniques for System-Level Power Optimization.Design, Automation, and Test in Europe.Part IV. 2008.1, pp. 275-289
- [13] Mehta, H.; Owens, R.M.; Irwin, M.J., "Some issues in gray code addressing," VLSI, 1996. Proceedings., Sixth Great Lakes Symposium on , vol., no., pp.178-181, 22-23 Mar 1996
- [14] C.H.Sung, K.B.Lee, and C.W.Jen, "Design and implementation core", ASIA-Pacific Conference on ASIC ,pp.295-298,2002
- [15] C. S. Wallace suggestion for fast multiplier [j] .IEEE transactions on electronic compute,1964, 13 (2): 14-17.
- [16] A. D. booth. A signed binary multiplicand technique [J] .quarterly journal of mechanics and Applied Mathematics, 1951, 4(2): 236-240.