# A new VLSI system for adaptive recursive filtering \*

Kam Hoi CHENG \*\* and Sartaj SAHNI

Computer Science Department, 136 Lind Hall, University of Minnesota, Minneapolis, MN 55455, U.S.A.

Received July 1986

Abstract. We develop an efficient bidirectional chain VLSI system for the adaptive recursive filtering problem. Our design is an improvement over previous designs. It matches the performance of a broadcast chain but does not use the broadcast capability.

Keywords. VLSI architectures, systolic systems, adaptive recursive filtering.

# 1. Introduction

VLSI architectures for a variety of problems have been proposed by several authors. A bibliography of over 150 research papers dealing with this subject appears in [6]. In this paper, we are concerned solely with the adaptive recursive filtering problem. The input to this problem is an  $n \times w$  matrix A of weighting coefficients and a  $1 \times w$  vector  $(x_{1-w}, \ldots, x_0)$ . The output is a  $1 \times n$  vector  $(x_1, \ldots, x_n)$  where

$$x_{i} = \sum_{j=1}^{w} a_{ij} x_{i+j-w-1}, \quad i = 1, 2, ..., n.$$
(1)

In evaluating a VLSI design, we assume that the VLSI system will be attached to the host processor using a bus as in Fig. 1. The evaluation of a VLSI design should take the following into account:

- (1) *Processors*: how many processors are used in the VLSI system? This figure is denoted by P.
- (2) Bus bandwidth: the maximum amount of data to be transmitted between the host and the VLSI system in any cycle. This figure is denoted by B.
- (3) Speed: how much time does the VLSI system need to complete its task? This time may be decomposed into the times  $T_{\rm C}$  (time for computations) and  $T_{\rm D}$  (time for data transmissions both within the VLSI system and between the host and the VLSI system).

Let C denote the time spent for computation by a single processor algorithm and D denote the total amount of data that needs to be transmitted between the host and VLSI system. As an example, consider the problem of multiplying two  $n \times n$  matrices A and B to get Y. Each element of Y is the sum of n products. We shall count one multiplication and addition as one arithmetic (or computation) step. If the classical matrix multiplication algorithm is used,  $C = n^3$ . If P = n, then  $T_C \ge n^2$ . The host needs to send  $2n^2$  elements to the VLSI system and

<sup>\*</sup> This research was supported in part by the National Science Foundation under grant MCS-83-05567.

<sup>\*\*</sup> Currently with the Computer Science Department, University of Houston, Houston, TX, U.S.A.



receive  $n^2$  elements back. So,  $D=3n^2$ . With a bandwidth of n,  $T_{\rm D}$  must be at least 3n.  $T_{\rm D}$  will exceed 3n if the bandwidth is not used to capacity at all times. For the adaptive recursive filtering problem, C=nw and D=nw+n+w. If n processors are used, then  $T_{\rm C}\geqslant w$  and  $T_{\rm D}>w+1$ .

The ratio

$$R_{\rm D} = B * T_{\rm D}/D$$

measures the effectiveness with which the bandwidth B has been used. Clearly,  $R_D \ge 1$  for every VLSI design.

The ratio

$$R_C = P * T_C/C$$

measures the effectiveness of processor utilization. Once again, we see that  $R_C \ge 1$  for every VLSI design.

Finally, we may combine the two efficiency ratios  $R_{\rm C}$  and  $R_{\rm D}$  into the single ratio  $R = R_{\rm C} * R_{\rm D}$ . A design that makes effective use of the available bandwidth and processors will have R close to 1.

The efficiency measure R as defined here is the same as that used in [1-3] to evaluate VLSI designs for matrix multiplication and back substitution. This measure is also quite similar to that proposed in [4]. In fact, the two measures become identical when  $T_{\rm C} = T_{\rm D}$ .

In comparing different architectures for the same problem, one must be wary about over emphasizing the importance of  $R_{\rm C}$ ,  $R_{\rm D}$  and R. Clearly, by using P=1 and B=1, we get  $R_{\rm C}=R_{\rm D}=R=1$  but no speed up at all. So, we are really interested in minimizing  $T_{\rm C}$  and  $T_{\rm D}$  while keeping R close to 1.

VLSI architectures for the adaptive recursive filtering problem have been proposed earlier in [4,5,7–9]. The design of [4] uses a broadcast chain and has P=w, B=w+2,  $T_{\rm C}=n+w-1$ ,  $T_{\rm D}=n+w$ ,  $R_{\rm C}\sim 1+w/n\sim 1$ ,  $R_{\rm D}\sim 1+1/w+w/n\sim 1$  and  $R\sim 1$ . The design of [5] uses a bidirectional chain of processors. An improved version is described in [8]. For this,  $P=\lceil w/2\rceil$ ,  $B=\lceil w/2\rceil+2$ ,  $T_{\rm C}=2n+w-2$ ,  $T_{\rm D}=2(n+w-1)$ ,  $R_{\rm C}\sim 1+w/(2n)\sim 1$   $R_{\rm D}\sim 1+3/w+w/n\sim 1$  and  $R\sim 1$ . The design of [7] uses a systolic ring architecture to solve the simple recurrence problem. It can be easily extended to solve the adaptive recursive filtering problem. This extension has  $P=\lceil w/2\rceil$ ,  $B=\lceil w/2\rceil+1$ ,  $T_{\rm C}=2(n-1)+w$ ,  $T_{\rm D}=2(n+w-1)+1$ ,  $T_{\rm C}\sim 1+w/(2n)\sim 1$ ,  $T_{\rm D}\sim 1+1/w+w/n\sim 1$  and  $T_{\rm C}\sim 1$ .

While all the above designs have an  $R \sim 1$ , the broadcast chain of [4] has a  $T_{\rm C}$  and  $T_{\rm D}$  that is about half that of the other designs. In this paper, we develop a bidirectional chain VLSI system that has the same (actually slightly smaller)  $T_{\rm D}$  and  $T_{\rm C}$  as the broadcast chain of [4]. For our design, P = w, B = w + 1,  $T_{\rm C} = n + \lceil w/2 \rceil$ ,  $T_{\rm D} = n + w + 1$ ,  $R_{\rm C} \sim 1 + w/(2n) \sim 1$ ,

 $R_{\rm D} \sim 1 + w/n \sim 1$  and  $R \sim 1$ . Our design shows that a broadcast chain is not required to obtain this  $T_{\rm C}$  and  $T_{\rm D}$  performance.

# 2. o(n) throughout bidirectional chain

An o(n) throughout bidirectional chain for the adaptive recursive filtering problem can be obtained by extending the systolic design of [9] for the nonadaptive recursive filtering problem. This extension requires us to recast (1) into the following form:

$$x_{i} = \sum_{j=1}^{w} a_{ij} x_{i+j-w-1}$$

$$= \sum_{j=1}^{w-1} a_{ij} x_{i+j-w-1} + a_{iw} \sum_{j=1}^{w} a_{i-1, j} x_{i+j-w-2}$$

$$= \sum_{j=1}^{w-1} a_{ij} x_{i+j-w-1} + a_{iw} \sum_{j=0}^{w-1} a_{i-1, j+1} x_{i+j-w-1}$$

$$= \sum_{j=0}^{w-1} b_{ij} x_{i+j-w-1}, \quad i \ge 1$$
(2)

where

$$b_{i0} = a_{iw} a_{i-1,1}, \qquad b_{ij} = a_{ij} + a_{iw} a_{i-1,j+1}, \quad 1 \le j \le w - 1, \tag{3}$$

$$a_{01} = 1, a_{0j} = 0, 2 \le j \le w, x_{-w} = x_0.$$
 (4)

To calculate the  $b_{ij}$ 's of (3) dynamically, w PEs in addition to the w+1 PEs used in [9] are needed. The performance figures of the resulting VLSI system are P=2w+1, B=2w+3,  $T_{\rm C} \sim n + [w/2]$ ,  $T_{\rm D} \sim n + w$ ,  $R_{\rm C} \sim 2 + 1/w + w/n \sim 2$ ,  $R_{\rm D} \sim 2 + 1/w + w/n \sim 2$  and  $R \sim 4$ .

Improved performance can be obtained by using the bidirectional chain architecture of Fig. 2. Each PE has the ability to add, multiply, and transfer data to/from its left and right adjacent processor. All the even numbered PEs are on the left, while all the odd numbered PEs are on the right. The output is generated from the middle PE, PE(w). The PEs to the left of PE(w) compute all terms involving even columns of A, while PEs on the right compute all terms involving odd columns of A. The case when w is even is shown in Fig. 2(a). The case when w is even is shown in Fig. 2(b).

The middle processor, PE(w), has the five registers: A, V, X, Y and Z. The remaining PEs have three registers (A, X and Y) each. We use the notation R(i) to denote register R,  $R \in \{A, V, X, Y, Z\}$ , of PE(i). The A register of each PE is used to hold an input value from the A matrix. PE(i) receives input from column i of A only,  $1 \le i \le w$ . The X register of each PE holds an  $x_i$  value while the Y registers hold partial sums in the computation of an  $x_i$ . In each cycle, the X(i)'s move one step away from the center PE, PE(w), while the Y(i)'s move one step towards this PE.

The working of the VLSI system is described formally in Algorithm 2.1. The first for loop sets up the initial configuration. The three steps in the **parallel do** are executed simultaneously. When this for loop terminates, PE(w) contains  $x_p$  for  $p = \lceil (w-1)/2 \rceil - w = \lceil -(w+1)/2 \rceil$  in its X register. The X register of a PE that is a units away from PE(w) contains  $x_{p-a}$ . The second for loop contains two sets of concurrently executed statements. In the first set, i.e. first **parallel do**, essentially five concurrent activities are performed in each iteration of this loop:

- (1) PE(w) either inputs an  $x_i$ ,  $i \le 0$  or outputs a newly computed  $x_i$ , i > 0.
- (2) All X values move one PE away from the middle PE.



Fig. 2.(a) w is odd; (b) w is even.

- (3) Each PE inputs an A value. Note that we assume  $a_{ij} = 0$  for  $i \le 0$ .
- (4) All Y values move one PE towards the middle PE. However, the Y value from PE(w-1) is moved to the Z register of PE(w) rather than to its Y register (this latter register receives the Y value form PE(w-2)). The boundary PEs (1 and 2) reset their Y registers to zero.
- (5) From the data patterns of Fig. 2(a) and (b), we observe that if the Y value in PE(w-1) is a partial sum for  $x_i$ , then that in PE(w-2) is a partial sum for  $x_{i+1}$ . Hence, Y(w) and Z(w) contain incompatible partial sums. The partial sum in Y(w) is to be used in the next iteration. V(w) is used to save the previous value of Y(w). Consequently, V(w) and Z(w) contain partial sums for the same  $x_i$ .

In the second **parallel do** set of statements, either a new term is added to a partial sum Y(i) or a new  $x_i$  computed. PE(w) computes a new  $x_i$  by computing (V(w) + Z(w)) and A(w) \* X(w) in parallel. The two results are then added (the operations may also be pipelined). Assuming that the time for an addition is no more than that for a multiply, the computation performed in PE(w) takes the same time as that performed in the other PEs.

```
for j \leftarrow 1 to \lceil (w-1)/2 \rceil do
  do in parallel
     X(w) \leftarrow x_{j-w}
     X(w-1) \leftarrow X(w)
                                               1 \leq i \leq w - 2
     X(i) \leftarrow X(i+2)
  end
for j \leftarrow \lceil (w + 1)/2 \rceil to n + w do
   do in parallel
     case
           j < w + 1 : X(w) \leftarrow x_{j-w}
           j = w + 1 : X(w) \leftarrow x_0
           j > w + 1: output X(w) { output x_{j-w-1} }
      endcase
     X(w-1) \leftarrow X(w)
     X(i) \leftarrow X(i+2)
                                                1 \leq i \, \leq w \, -2
     A(w) \leftarrow a_{j-w,w}
     A\left(i\right) \leftarrow a_{j + \lfloor (w-i)/2 \rfloor + 1 - w, i} \quad 1 \leq i \leq w - 1
      Y(1) \leftarrow Y(2) \leftarrow 0
      Y(i) \leftarrow Y(i-2)
                                                 3 \leq i \, \leq w
      V(w) \leftarrow Y(w)
      Z(w) \leftarrow Y(w-1)
   end
   do in parallel
      Y(i) \leftarrow Y(i) + A(i) * X(i) \quad 1 \le i \le w - 1
      X(w) \leftarrow (V(w) + Z(w)) + A(w) * X(w)  j \ge w + 1
   end
\quad \textbf{end} \quad
output X(w)
                          \{ \text{ output } x_n \}
```

Algorithm 2.1.

Table 1 w = 5

| j  | PE       |                |          |       |                |       |       |                                        |                |       |                |       |  |
|----|----------|----------------|----------|-------|----------------|-------|-------|----------------------------------------|----------------|-------|----------------|-------|--|
|    | 2        |                | 4        |       | 5              |       |       | ###################################### | 3              |       | 1              |       |  |
|    | X        | Y              | X        | Y     | $\overline{X}$ | Υ .   | V     | Z                                      | $\overline{X}$ | Y     | $\overline{X}$ | Y     |  |
| 1  | -0.00    | -7             | _        |       | x_4            | _     | _     | =                                      |                | -     | <del></del>    | -     |  |
| 2  |          | -              | $x_{-4}$ |       | $x_{-3}$       | _     | _     | _                                      | $x_{-4}$       | -     | -              | _     |  |
| 3  | $x_{-4}$ | <del>-</del> 1 | $x_{-3}$ | -     | $x_{-2}$       | _     | _     | _                                      | $x_{-3}$       | _     | $x_{-4}$       | [1,1] |  |
| 4  | $x_{-3}$ | [1,2]          | $x_{-2}$ | _     | $x_{-1}$       | -     | _     | _                                      | $x_{-2}$       | [1,3] | $x_{-3}$       | [2,1] |  |
| 5  | $x_{-2}$ | [2,2]          | $x_{-1}$ | 1,4]  | $x_0$          | [1,3] | _     | _                                      | $x_{-1}$       | [2,3] | $x_{-2}$       | [3,1] |  |
| 6  | $x_{-1}$ | [3,2]          | $x_0$    | [2,4] | $x_1$          | [2,3] | [1,3] | [1,4]                                  | $x_0$          | [3,3] | $x_{-1}$       | [4,1] |  |
| 7  | $x_0$    | [4,2]          | $x_1$    | [3,4] | $x_2$          | [3,3] | [2,3] | [2,4]                                  | $x_1$          | [4,3] | $x_0$          | [5,1] |  |
| 8  | $x_1$    | [5,2]          | $x_2$    | [4,4] | $x_3$          | [4,4] | [3,3] | [3,4]                                  | $x_2$          | [5,3] | $x_1$          | [6,1] |  |
| 9  | $x_2$    | [6,2]          | $x_3$    | [5,4] | $x_4$          | [5,3] | [4,3] | [4,4]                                  | $x_3$          | [6,3] | $x_2$          | [7,1] |  |
| 10 | $x_3$    | [7,2]          | $x_4$    | [6,4] | $x_5$          | [6,3] | [5,3] | [5,4]                                  | $x_4$          | [7,3] | $x_3$          | [8,1] |  |

Table 2 C = nw, D = nw + n + w

| Performance                      | Architecture    |          |             |                           |          |  |  |  |  |  |
|----------------------------------|-----------------|----------|-------------|---------------------------|----------|--|--|--|--|--|
|                                  | Bidirectional   | Systolic |             |                           |          |  |  |  |  |  |
|                                  | Broadcast chain | Chain    | ring<br>[7] |                           |          |  |  |  |  |  |
|                                  | [4]             | [5]      | [9]         | Our                       | r.a      |  |  |  |  |  |
| P                                | w               | [w/2]    | 2 w         | w                         | [w/2]    |  |  |  |  |  |
| B                                | w+2             | [w/2]+2  | 2w          | w+1                       | -[w/2]+1 |  |  |  |  |  |
| $T_{\mathbf{C}}$                 | n+w             | 2n + w   | n + [w/2]   | $n + \lfloor w/2 \rfloor$ | 2(n-1)+w |  |  |  |  |  |
| $T_{ m D}$                       | n+w             | 2(n+w-1) | n+w         | n+w                       | 2(n+w-1) |  |  |  |  |  |
| T <sub>D</sub><br>R <sub>C</sub> | 1               | 1        | 2           | 1                         | 1        |  |  |  |  |  |
| $R_{\mathrm{D}}$                 | 1               | 1        | 2           | 1                         | 1        |  |  |  |  |  |
| R                                | 1               | 1        | 4           | 1                         | 1        |  |  |  |  |  |

Table 1 is a timing diagram for the case w = 5 where j refers to the **for** loop index of Algorithm 2.1. For each PE, the contents of its X and Y registers following the execution of the **for** loop for that j value are shown. The notation [i, p] denotes  $\sum_{j=1, j \text{ odd}}^p a_{ij} x_{i+j-w-1}$  for PEs on the right of PE(w) and  $\sum_{j=1, j \text{ even}}^p a_{ij} x_{i+j-w-1}$  for PEs on the left. V(w) contains the sum of odd terms (as w is odd), while Z(w) contains the sum of even terms (as w is odd).

The performance figures of this design are P = w, B = w + 1,  $T_C = n + \lceil w/2 \rceil$ ,  $T_D = n + w + 1$ ,  $R_C \sim 1 + w/(2n) \sim 1$ ,  $R_D \sim 1 + w/n \sim 1$  and  $R \sim 1$ .

## 3. Conclusions

We have developed a VLSI system for the adaptive recursive filtering problem that has  $T_{\rm C}$  and  $T_{\rm D}$  that is o(n) and also has  $R \sim 1$ . Previously, this had been done only for the case of VLSI systems using the broadcast capability. Our design does not employ this capability. The performance characteristics for various VLSI systems for the adaptive recursive filtering problem are summarized in Table 2. In going through this table, one should keep in mind that the different systolic solutions require PEs of different complexity.

Further improvement in throughput (at the expense of design complexity) is possible. However, this cannot be obtained using recurrence (1) as in order to compute  $x_i$ , we need to know  $x_{i-1}$ . Hence  $x_i$  can be computed, at best, one cycle after  $x_{i-1}$  has been computed. However, we can bring both  $T_C$  and  $T_D$  down to o(n/2) by computing two  $x_i$ 's each cycle using recurrence (2) and formulae (3) and (4). The idea is the same as used in the back substitution problem in [2]. The VLSI system that incorporates this uses more hardware and is quite a bit more complex. The method may be extended to get a  $T_C$  and  $T_D$  of o(n/k) for any fixed k.

# References

- K.H. Cheng and S. Sahni, VLSI systems for matrix multiplication, in: Foundations of Software Technology and Theoretical Computer Science, Lecture Notes in Computer Science (Springer, Berlin, 1985) 428–456.
- [2] K.H. Cheng and S. Sahni, VLSI architectures for back substitution, in: H.-J. Kugler, ed., *Information Processing* '86 (North-Holland, Amsterdam, 1986) 373–378.
- [3] K.H. Cheng and S. Sahni, VLSI architectures for LU decomposition, in: Proc. 20th Annual Hawaii International Conference on System Sciences, Vol. II (1987) 177-187.
- [4] K.H. Huang and J.A. Abraham, Efficient parallel algorithms for processor arrays, in: *Proc. IEEE International Conference on Parallel Processing* (1982) 271–279.

- [5] H.T. Kung and C.E. Leiserson, Systolic arrays for VLSI, Department of Computer Science, Carnegie-Mellon University, 1978.
- [6] H.T. Kung, A listing of systolic papers, Department of Computer Science, Carnegie-Mellon University, 1984.
- [7] H.T. Kung and M. Lam, Wafer scale integration and two level pipelined implementations of systolic arrays, J. Parallel and Distributed Processing 1 (1) (1984).
- [8] C.E. Leiserson, Area-Efficient VLSI Computation (MIT Press, Cambridge, MA, 1983).
- [9] Y. Robert and M. Tchuente, Designing efficient systolic algorithms, Laboratoire IMAG, BP 68, F38402 Saint Martin D'Heres Cedex, 1984.