EBLAST - A High Compression Image Transformation

Mark S. Schmalz Gerhard X. Ritter

Department of Computer and Information Science and Engineering

University of Florida, Gainesville, FL 32611

Frank M. Caimi

Department of Electrical and Software Engineering

Harbor Branch Oceanographic Institution, Fort Pierce, FL 34946

ABSTRACT

Exploitation of limited communication channel bandwidth typically requires an encoder that has low error rate and high compression ratio, especially when the channel has intensive error correction. For example, noisy channels such as acoustic uplinks in underwater (UW) communication can have costly error correction processes that increase effective source entropy and thus increase compression requirements.

In this paper, we present a high-compression image transformation (called EBLAST for Enhanced Blurring, Local Averaging, and Thresholding) that facilitates image transmission along low-bandwidth channels such as acoustic modems, and exhibits low mean-squared error (MSE) with spatially uniform reconstruction error. Although initially applied to tactical communication problems, this innovative transform Blurring, Local Averaging, and Thresholding), can be used in commercial applications such as videotelephony. EBLAST's compression ratio (CR), which usually ranges from 200:1 to as high as 250:1 on underwater imagery, can in some cases be increased by follow-on Huffman encoding to yield CR approximating 300:1 with no additional information loss. Additionally, in previous work the authors showed that restriction of the source image to typical targets of interest, followed by compression, could increase CR to 20,000:1 or greater. The background information is characterized, then discarded prior to compression. In the decompression step, the background is approximately reconstructed from statistical parameters, which can provide verisimilitude for human target cueing applications.

Analyses of EBLAST's formulation and implementation emphasize assessment of EBLAST's performance on a suite of innovative image quality measures as well as more traditional metrics such as computational cost, ease of parallelization, and scalability. Due to near-uniform spatial distribution of reconstruction error, EBLAST is suitable for ATR using compressed or decompressed imagery. EBLAST is expressed in image algebra, a rigorous, concise notation that unifies linear and nonlinear mathematics in the image domain. Image algebra has been implemented on a variety of workstations and parallel processors. As a result, our algorithms are widely portable.

Keywords: Image compression, Parallel computation, SIMD mesh

1. INTRODUCTION

The compression of digital imagery for surveillance applications that employ power-limited autonomous vehicles has been a topic of keen research interest which has not been addressed effectively using traditional signal processing techniques. For example, streaming compression transforms such as Huffman encoding or delta modulation can be implemented in embedded digital signal processors (DSPs) or parallel SIMD meshes with small memory models [-,-,-]. In practice, such methods do not provide sufficient compression for transmission of realistically large images across noisy, low-bandwidth channels such as underwater acoustic modems [-,-]. The preceding comments also hold for established block-oriented transforms such as JPEG [-], visual pattern image coding (VPIC [-]), and various types of non-hierarchically structured transform coding techniques [-,-,-]. In contrast, pyramid-structured transforms such as EPIC [-], SPIHT [-], and wavelet packet compression [-,-] produce sufficiently high CR, but req uire that the entire source image be present in memory at some stage of the compression process. This requirement is prohibitive for real-time processing of moderate- to high-resolution images on DSPs or parallel meshes with small local memories. In response to this situation, a novel high-compression image transformation called EBLAST has been developed to support reduction of the data burden inherent in multispectral underwater imagery, for transmission along acoustic communication channels. EBLAST has advantages of (a) high compression ratio (100:1 < CR < 250:1), and (b) block structure that facilitates parallel computation on SIMD meshes and other types of parallel processors.

Unlike hierarchically-structured (e.g., wavelet-based) transforms, EBLAST processes image data in blockwise fashion. For example, let an MxN-pixel image be partitioned into KxL-pixel rectangular encoding blocks, and assume that each pixel requires m bits of storage. If one encoding block is stored per processing element (PE), then KLm bits of local memory are required. If K,L = 10 (a typical value) and m = 8, then 800 bits of memory are required. Although an image-template convolution is required by ELBAST compression and decompression processes, convolution can be performed locally using the shift-and-add paradigm if the templates are small (a typical constraint). Small template size is crucial to minimize boundary effects, which can impact storage efficiency. For example, if K,L = 10 and a 3x3-pixel template is employed, then 44 pixels would be stored locally in addition to the KL = 100 pixels within each encoding block. If m = 8, then 144(8) = 1152 bits would be stored locally. This situation is advantageous for embedded processors on board power-limited platforms such as underwater autonomous vehicles (UAVs). Additionally, since EBLAST computation is generally constrained to within-block operations, power consumption can be confined primarily to within-ALU operations. As a result, power consumption associated with I/O-bound processing (e.g., hierarchical transforms on SIMD meshes [-]) is circumvented in EBLAST.

In this paper, the background and basic theory of EBLAST is presented in Section 2, with error analysis and estimation of noise sensitivity given in Section 3. Examples of EBLAST application to UW image compression are given in Section 4, together with an implementational analysis of computational requirements and comparison of EBLAST with currently-employed compression transforms. Conclusions and suggestions for future work are presented in Section 5.

2. BACKGROUND AND THEORY

EBLAST evolved from an experimental transform called BLAST (unrelated to the medical image compression procedure of the same name [-]), which was discovered in the course of image enhancement that supported the authors' research in underwater target recognition. In particular, it was found that imagery preprocessed by a sharpening filter, then compressed with JPEG or VPIC, yielded improved subjective visual quality in the decompressed image, versus decompressed imagery obtained in the absence of presharpening. Although it was initially conjectured that sharpening merely enhanced greyscale gradients that defined target features, in-depth analysis showed that JPEG and VPIC each implement a type of low-pass filter, but in different ways. For example, JPEG discards small-magnitude coefficients of the discrete cosine transformation that often represent high spatial frequencies, whereas VPIC represents a source block in terms of its mean. Using the VPIC algorithm as a first-order model, it was possible to verify that the presharpening step linearly combines source pixels, so as to project information from a given source block to adjacent encoding blocks. This increases the probability that information from adjacent blocks can be (a) averaged with the current block, thereby making the source image more compressible; and (b) retrieved during decompression via an approximation to the inverse of the convolution template employed in the block averaging process.

It is important to note that EBLAST achieves its high compression ratio through block averaging of the presharpened source image. To facilitate higher compression than can be obtained from the source quantization scheme, EBLAST also quantizes either (a) the source block greyscales or (b) the block mean into m greylevels, where m < n, the number of source greylevels. For example, in some types of noise eight-bit imagery, it is possible to quantize the block mean to three bits using statistical sampling techniques, without significant loss of source information.

In this section, we discuss the components of EBLAST, as follows. Section 2.1 contains an analysis of the quantization scheme. Section 2.2 discusses utility of rectangular or nonrectangular block domains employed during averaging. In Section 2.3, a comprehensive analysis of template formulation and application for compression and decompression is given, which can partially compensate quantization and averaging errors. Section 2.4 summarizes block and image enhancement strategies that are designed to make the decompressed image appear more realistic for human target cueing applications.

2.1. Grayscale and Block Gradient Quantization

In order to make a source image more compressible, one can apply quantization to the source greylevels, thereby reducing the number of bits required to represent information between the perceived black and saturation levels (gb and gs) of an image display. In this study, gb and gs were estimated by querying users about the just-noticeable differences in greylevel at the extrema of a linear greyscale displayed on a monitor placed in a darkened room. Also, it was found that quantization of the computed block gradient magnitude tends to reduce block entropy at the cost of accuracy of representation and, hence, of reproduction in the decompression process.

2.1.1. Observation. In practice, we have found that grayscale information that occurs within the interval [gb, gs] subtends approximately 50 to 60 percent of the greyscale range R of the source image. Given the typical case where [gb, gs] appears in the middle of R, is not unreasonable to assume that [gb, gs] would subtend the middle three quintiles of R.

2.1.2. Analysis. It has been shown that human observers typically judge 27 greylevels to be sufficient for visually acceptable rendering of a continuous-appearing digital greyscale [-]. Assuming that a greyscale range R partitioned into quintiles has 27 greylevels, then linear quantization would yield 5.4 greylevels per quintile. Given Observation 2.1.1, the middle three quin-

tiles of R would represent 16.2 = 5.4(3) greylevels. Thus implies that the perceived visible greyscale range [gb, gs] of an image could be linearly quantized into 16 greylevels using four bits, which approximates the computed partition width of 16.2 grey-

levels within an error of less than 1.5 percent of [gb, gs].

2.1.3. Example. Figure 1b illustrates the effect of greyscale restriction (per Observation 2.1.1) applied to the source image shown in Figure 1a. Note the preservation of visible image features, and the observation of features in dark areas, as shown in the error image of Figure 1c, where MSE = ___ percent of full scale (percent fs). Linear quantization of Figure 1a into 27 greylevels is shown in Figure 1d, and quantization of the source image given in Figure 1b into 16 greylevels is shown in Figure 1e. The difference between Figure 1d and the quantized restricted image of Figure 1e is shown in Figure 1f. Note that the errors which occur in the pseudolinear (smooth) areas of face, wall, and mirror reflection are typical of linear quantization errors. The MSE of Figure 1f equals _____ percent fs, an increase of _____ percent fs over the MSE of Figure 1c.

 

Figure 1. Effect of greyscale restriction and quantization on visual appearance of a digital image: (a) lena source image,

(b) restriction of source image to [gb, gs], (c) error image (absolute difference between a) and b), (d) quantization of a) to 27

greylevels, (e) quantization of the middle three quintiles of a) to 16 greylevels, (f) error image (difference between d) and e).

2.1.4. Observation. Nonlinear quantization can be employed to reduce the number of source greylevels within the restricted

greyscale range [gb, gs]. Various types of statistical quantization schemes have been reported in the literature [-,-,-], for exam-

ple, linearly bounded m-tiles or partition boundaries determined from models of the human visual system (HVS). In practice, depending on image statistics, desired contrast, and viewing conditions, one could select a brightening function (e.g., a sinusoidal greyscale transform) or a contrast enhancement operation such as the logarithm of the greylevel (per Example 2.1.6).

2.1.5. Alternate Approach. An alternative approach is to apply histogram equalization to the restriction of the source image

to R = [gb, gs], which renders the cumulative histogram linear within R. Thus, an m-tile partitioning of the histogram-equalized image yields a linear quantization of the resultant contrast-enhanced greyscale.

2.1.6. Example. An outdoor image of ships is shown in Figure 2a, which is restricted in Figure 2b to the middle three quintiles of its greyscale range, per the technique of Sections 2.1.1-2.1.3. Figures 2c and 2d illustrate linear quantization of the greyscale to 4 bits per pixel (bpp) and 3 bpp, respectively. The error images and error histograms of Figures 2c (and 2d),computed with respect to the source image, are given in Figures 2d and 2e (2g-h). As predicted by theory [-], the MSE of Figure 2f (having value ____ percent fs) extends the MSE of Figure 2c (at ___ percent fs) by ____ percent fs. In contrast, Figure 2i is enhanced with a sine transform applied to the greyscale values, which yields the error image (error histogram) shown in Figure

2j (2k). Per the approach outlined in Section 2.1.5, histogram equalization of Figure 2a restricted to [gb, gs] yields Figure 2l, whose error image at MSE = _____ percent fs and error histogram are given in Figures 2m and 2n, respectively.

 

Figure 2. Different methods of quantizing a greyscale-restricted image: (a) source image of ___x___ pixels, (b) restriction of a) to the middle three quintiles of its greyscale, (c-d) quantization of b) to four bpp with associated error image and histogram, (f-h) quantization of b) to three bpp with associated error image and histogram, (i-k) sine transformation of greyscale in b) with associated error image and histogram, and (l-n) histogram equalization of b), with associated error image and histogram, where the error histograms associated with images i) and l) were computed with respect to sine(a) and histogram-eq(a).

The error associated with linear or nonlinear quantization methods can be determined from the source image histogram, and can be modelled in an information-theoretic sense, as discussed in the following development.

2.1.7. Assumption. Let X denote an MxN-pixel domain, and let source image be input to the histogram operation to yield histogram h.

2.1.8. Observation. Given Assumption 2.1.7, let be linearly quantized into n < m bins, where . Thus, m/n bins of h(a) are combined into one bin of the quantized image , whose histogram is denoted by . This means that the errors due to quantization can be computed from a quantized value , and its corresponding bins in .

We formalize this observation, as follows.

2.1.9. Definition. Given Assumption 2.1.7, let range(a) be linearly quantized into n < m greyvalues, by a map . Given an error threshold , it is said that q has an e-dual if and only if

,

where (|| ||) denotes the norm. Let the projection such that if , then .

2.1.10. Lemma. Given Assumption 2.1.7 and Definition 2.1.9, if both and its histogram h are known, then the mean error per quantization bin inherent in the linear quantization q(a) is given by

,

Proof. From Definition 2.1.9, each greylevel f in is mapped to one greylevel . For each of values , the quantization error f-q(f) occurs h(f) times. q

2.1.11. Assumption. Given Definition 2.1.9, let the nonlinear quantization map , where n < m and m/n is nonconstant. In the most general case, given , quantization bins in range(a) are collected in

,

which we assume for purposes of convenience to be an indexed set.

A more general expression for quantization error can now be stated in the following theorem.

2.1.12. Theorem. Given Assumption 2.1.11, if a source image and its histogram are known, then the mean-squared error per quantization bin due to the quantization of a by is given by

.

Proof. The theorem generalizes Lemma 2.1.10 via Assumption 2.1.11, which replaces the assumption that range(a) is partitioned into m/n equally spaced bins. q

2.1.13. Observation. In Assumption 2.1.11, the various bins contained in C can be chosen by a variety of methods. The

authors prefer to first restrict range(a) to the interval R = [gb, gs], clamping pixel values less (greater) than gb (gs) to the value gb (gs). Given a source image , such clamping is expressed in image algebra as follows:

.

The resultant image b is then input to a cumulative histogram operator , from which the n-tiles are obtained (over [gb, gs] quantized to Zn) that constitute the elements of C.

2.1.14. Complexity. The restriction shown in Observation 2.1.13 requires O(|X|) comparisons, while construction of c(a) requires |X| + 2m additions. Determination of the n-tiles of c(a) requires m comparisons, and construction fo the quantization lookup table (LUT) that implements q requires O(n) I/O operations.

2.2.15. Example. Given Theorem 2.1.12, the occurrence of quantization error for the source image of Figure 3a is illustrated

in Figures 3b,c,d for m-tile based quantization with m = ___, ___, and ___ bins, respectively. The graph of versus a(x) for the preceding values of m is shown in Figure 3e, while is graphed as a function of m in Figure 3f.

 

 

Figure 3. Occurrence of error in various m-tile based quantization maps: (a) source image

a; (b-d) m-tile quantization of a restricted to [gb, gs] at m = ___, ___, and ___ bpp, respectively; (e) graph of as a function of a(x); and (f) graph of as a function of m.

We next examine the effect on encoding block shape and size on the compression ratio and reconstruction quality of a blockwise-compressed image.

2.2. Block Parameter Effects

The shape of encoding blocks tends to influence the reconstruction (decompression) error at a given compresison ratio, as discussed in Section 2.2.1 and following. However, regardless of the block shape, the effect of block size on compression ratio is direct and can likewise influence quantization error.

2.2.1. Definition. An image transform is denoted as .

2.2.2. Definition. The domain compression ratio of a transform is given by

2.2.3. Definition. The maximum word width required to encode values in a set S is denoted by siz(S).

2.2.4. Example. If and the interval is linearly quantized into n bins, then a Boolean encoding of S requires bits per value in S. However, linear quantization may not distinguish closely adjacent values in S.

2.2.5. Definition. The range compression ratio of a transform , denoted by , is given by

2.2.6. Definition. The compression ratio of a transform is defined as

It follows that .

2.2.7. Definition. A transform T is called a compressive or compression transform if and only if .

2.2.8. Observation. As CRD increases, more values in the input of T are grouped together by T, and are represented by fewer bits in the output of T as CRR increases. The employment of various quantization strategies yields different CRR. However, in block-structured compression transforms, CRD is typically a function of blocksize. For example, a KxL-pixel block having n bpp, which is represented in terms of m < n bpp in the compressed image, has CRD = KL and CRR = n/m.

2.2.9. Example. Let M,N = 1024 and K,L = 8, and assume that a source iamge has 8-bit pixels. If the quantization map q defined in Section 2.1.9 transforms greyscale information into 3 bpp of compressed data, then the compression ratio componnents are given by

and ,

which implies a net compression ratio of .

2.2.10. Observation. Encoding blocks need not be rectangular -- in practice, we have found that elliptical encoding blocks tend to reduce perceived blocking effect in a decompressed image. While rectangular blocks have the advantage of precise tiling of the source domain, non-rectangular blocks can be more problematic, especially when one attempts to resolve issues of block overlap, for example, in elliptical blocks. The following algorithms have been found to be useful in this study.

2.2.11. Assumption. Let a finite source domain be partitioned into encoding blocks, each of which has a distinct reference point y in the compressed domain Y. Denoting the partitioning function as , further assume that a dual of g can be expressed as the map , which returns the coordinates in X of the y-th encoding block.

 

Figure 4. Coverage of image domain by elliptical encoding blocks have KxL-pixel bounding boxes: (a) maximum partial coverage with no overlap, (b) partial coverage with overlap, and (c) total coverage with no overlap.

2.2.12. Observation. Given Assumption 2.2.11, let a source image be partitioned into elliptical encoding blocks, each having a bounding box of size KxL pixels. Ellipse overlap is computed as follows.

Case 1. To achieve maximum partial coverage of X with no overlap, the elliptical encoding blocks must be arranged as

shown in Figure 4a. Thus, if an ellipse occupies a fraction fE of its bounding box, then a fraction (1 - fE) of the source data will be ignored. Since this corresponds to the probability of detection of a source pixel (denoted by

Prd), it might initially appear attractive to apply information theoretic analysis to compute detected entropy HE of image a at a given fE, as follows:

,

where Pr(a(xi)) denotes probability of a source pixel value at source point .

Case 2. Total coverage of X can be achieved using elliptical blocks, but with overlap, as shown in Figure 4c. This increases fE to 1.0, hence (1 - fE) equals zero, and all data is accounted for in the encoding block. Unfortunately, the redundancy that results from block overlap can decrease the compression ratio, as shown below.

2.2.13. Effect of Overlap. Assume that an ellipse of major axis K pixels and minor axis length L pixels is inscribed within a

KxL-pixel rectangle, as shown in Figure 4a. The area of the rectangle equals KL pixels, and the area of the ellipse is pKL/4

pixels. Hence, fE = p/4. Conversely, if the ellipse circumscribes the rectangle, then fE = 4/p. Thus, the range represents the ellipse area sampling factor by which the domain compression ratio CRD is increased (decreased) at the mini-

mum (maximum) of fE due to undersampling (oversampling) of the source domain.

An additional effect induced by elliptical encoding blocks configured as shown in Figure 4b,c is the presence of interstitial structure, which causes a fixed pattern error in the compression transform. This error is discussed as follows.

2.2.14. Effect of Interstitial Structure. Elliptical encoding blocks have the disadvantage that interstitial gaps, occurring

when fE < 1, can cause source featurs to be distorted or ignored if such features are subtended by one or more gaps. Conversely, when fE > 1, oversampling increases redundancy and thus partially defeats the purpose of compression. The effect of fE of varying the ellipse major axis length L and minor axis length K (per the geometry of Figure 4) is illustrated in Figure 5. The graphs of fE shown in Figure 5 were computed by constructing ellipses as shown in Figure 4, then measuring overlap at various resolutions, determined by blocksize. Due to the quantization effects encountered at selected block resolution levels,

certain of the contour lines have a stepwise appearance.

 

Figure 5. Ellipse overlap fraction fE as a function of minor and major axis length K and L, respectively, for K,L = 6, 8, and 10 pixels. These encoding block dimensions are typical values for EBLAST compression.

2.2.15. Observation. One might reasonably ask if elliptical blocks are the only feasible block shape. The authors' research has shown that other block shapes, such as serpentine tiles (shown in Figure 6a) or a combination of octagons and squares (Figure 6c) also reduce blocking effect. Unfortunately, the overlap effect discussed in Section 2.2.13 tends to increase in the presence of serpentine blocks, since an enclosed (and enclosing) rectangle is filled less by a serpentine block than by an ellipse, as can be readily observed by inspection of Figure 6b. Although the space-filling octagon-square pattern does not have overlap problems, it tends to induce blocking effect at its horizontal and vertical boundaries. Hence, both patterns have been found less suitable as encoding block domains than ellipses or rectangles.

 

Figure 6. Additional non-rectangular block shapes investigated by the authors include (a) interlocking serpentine blocks having (b) less space-filling area than an ellipse incscribed in the corresponding bounding box, and (c) an octagon-square pattern.

 

Theory presented in this paper is expressed in terms of image algebra, a rigorous, concise notation for image and signal processing developed at UF under sponsorship of DARPA and the US Air Force for over a decade [14]. Image algebra unifies linear and nonlinear mathematics in the image domain, and has been implemented on a variety of workstations and parallel processors, including Honeywell's Parallel Recirculating Pipeline processor (PREP) [15], Lockheed-Martin's PAL series of processors [5], ERIM's Cytocomputer [16], the Maspar MP-2 [17], and networks of Inmos Transputers [18]. Under DARPA funding, UF is currently developing an implementation of image algebra for reconfigurable hardware such as FPGAs [19].

2.1.1. Assumption. Let X or Y denote a point set (usually a subset of Euclidean n-space Rn), and let F or G denote a value set (customarily a subset of the reals R). Set operations include union ( ), intersection ( ), cardinality (|S|), subtraction (\), choice, and complement.

2.1.2. Definition. An image is a mapping from a point set X to a value set F, denoted by . For example, X can be an MxN-pixel array. A real-valued image can be equivalently represented by its graph . The domain and range of image a are denoted by and .

2.1.3. Definition. The restriction of image a to a subset W of X is denoted by . A restriction of to a subset of a that is characterized by a property S of F is given by

 

For example, given a threshold , if then .

2.1.4. Definition. A template is a mapping from a point set to an image set. For example, an F-valued template from Y to X, denoted by the map , has values denoted by , and weights that are denoted by . The graph representation of t is given by . The point y is called the target point of t, and is called the target pixel of t.

2.1.5. Definition. Given an image , a template , and the associative, commutative operations , the backward image-template operation, also called the right product, is given by

 

If , with and , then the right product is instantiated as the image-template convolution

.

Reference 14 contains a discussion of operations between templates, which are similar to image-template operations.

2.2. Image Compression.

Our theoretical development is based on the formal description of an image transformation which is elaborated in [20].

2.2.1. Definition. An image transform is denoted as .

2.2.2. Definition. The domain compression ratio of a transform is given by

2.2.3. Definition. The maximum word width required to encode values in a set S is denoted by siz(S).

2.2.4. Example. If and the interval is linearly quantized into n bins, then a Boolean encoding of S requires bits per value in S. However, linear quantization may not distinguish closely adjacent values in S.

2.2.5. Definition. The range compression ratio of a transform , denoted by , is given by

2.2.6. Definition. The compression ratio of a transform is defined as

It follows that .

2.2.7. Definition. A transform T is called a compressive or compression transform if and only if .

2.3. SIMD and multi-SIMD Architectures.

A formal model of a generalized SIMD mesh is presented, which is elaborated to yield a model of MSIMD structure.

2.3.1. Definition. A SIMD mesh has domain X represented as an MxN-point array. Let the mesh be denoted by

,

where each processing element (PE) is represented by an n-element vector , that contains parameters such as register contents, functionality (including faulting or failing behaviors and exceptions), and a unique pro-

cessor address for each PE. The vector a can also include I/O buffer parameters and values of data input or output to the mesh.

2.3.2. Observation. At a given time t, all processors of a SIMD mesh have the same functionality. However, the execution of certain operations may be blocked locally by a mask bit stored in a given PE's register. The mask bit is input from a mask image broadcast to the mesh a priori. This type of instruction masking can be advantageous, for example, when spatial permutations are implemented, where adjacent PEs could transfer data from opposite directions.

2.3.3. Definition. A multi-SIMD mesh is a SIMD mesh whose functionality can vary per PE. Typically, the mesh domain X is subdivided into q contiguous (customarily rectangular) partitions . Each partition has functionality fi, which may differ from the functionality fj of another partition Wj, as shown in Figure 1a.

2.3.4. Observation. A given PE may perform either a computational operation or an I/O task (e.g., NEWS communication if X is four-connected). Thus, an MSIMD mesh may be configured as an ensemble of computational partitions (also called functional units) that communicate externally to the mesh and with each other via I/O-intensive partitions. The I/O partitions can

be treated as a single functional unit, such as the tree-structured bus schematically diagrammed by partitions f6a through f6d in Figure 1b. This technique is extensible to the regular array of functional units shown schematically in Figure 1c. Additionally,

if I/O-intensive PEs are located at the borders of the mesh, then buffered external I/O can be implemented as shown in Figure 1d. This technique can be advantageous for processing stream-compressed (or -encrypted) data by first decompressing (decrypting) the data within the boundary PEs before transferring data to the mesh.

 

Figure 1. MSIMD mesh with (a) differing functionality per partition; (b) computational units f1 through f5 and data routing partitions f6a through f6d.; (c) functional blocks connected by I/O buses or embedded in an I/O-intensive surround; and (d) boundary PEs that can function as buffered parallel buses for external I/O.

2.3.5. Assumption. Given current backplane technology, it is reasonable to assume that X is a circulant array, but no subset of X is circulant in two dimensions. However, a given partition of X may be circulant in one dimension if it includes boundary processors, for example, the partition having functionality f1, which is shown schematically in Figure 1a.

2.3.6. Observation. The I/O buses referred to in Observation 2.2.4 can be employed in data transmission between pipeline segments, which includes the task of data or instruction forwarding. This admits pipelining as a technique for mapping ISP/ATR algorithms to multi-SIMD processors. Far from being a pedagogic curiosity, pipelined multi-SIMD processing is a technique that would allow fast, dynamic reconfiguration at medium granularity (i.e., at the functional unit or pipeline segment level), to meet a variety of time-varying input conditions. For example, statistical nonstationarities due to changes in image content (important for context-dependent transforms) could be partially detected or compensated via adaptive pipelining.

2.4. Pipelined Processing with SIMD and Multi-SIMD Architectures.

This section contains an overview of the implementational strategy employed in this study.

2.4.1. Assumption. Let a transform be decomposed into functional units , ..., , ..., , where 1 < i < n, such that, given a source image , the transformed image . Also assume that each fi has local communication (e.g., with fi+1) and fast communica-

tion with fi+k .

2.4.2. Observation. Numerous image operations and transforms can be implemented in pipelined fashion, provided that forward and backward data dependencies do not exceed the data forwarding capabilities of a given pipeline's hardware implementation. In reconfigurable pipeline processors where pipeline segment functionality can be dynamically programmed, it is possible to buffer a given forwarding from a pipeline configuration at time t to a subsequent configuration at time t+Dt. This capability facilitates the construction of a virtual pipeline P, whose length n can exceed the number of segments m in a physical pipeline P' that performs the computation of P. We discuss implementational issues of this concept, as follows.

 

Figure 2. Three cases of pipeline processing: (a) logical pipeline P with m stages can be accommodated

by an n-segment physical pipeline P', provided that ; (b) implementation of P on P' requires reconfigurations per invocation of P, as well as the use of output buffers; and (c) if data forwarding is present, reconfiguration requires no more than n forwarding buffers in addition to an output buffer.

2.4.3. Example. Three cases pertain to reconfigurable pipelines, as shown in Figure 2:

Case 1. Assume that a static pipeline P has m logical segments comprising a computational cascade with functionality denoted by . If P can be implemented in hardware that supports each segment in a static pipeline segment, as shown in Figure 2a, then reconfiguration is unnecessary, and this is not an interesting case.

Case 2. Let the assumption concerning P's functionality made in Case 1 hold, but let m exceed the number of physical segments n available in hardware. Two alternatives are of interest. First, one could group the functionality of several segments of P into one segment of the n-segment pipeline, to yield n logical segments, albeit at reduced

throughput due to longer compute times per segment. Second, one could implement P on a reconfigurable pipeline P' that has n segments and requires reconfigurations. However, this implies that the pipeline would require output buffering after each reconfiguration to provide input to the next reconfiguration, as shown sche-

matically in Figure 2b. The associated context switching and I/O overhead could be prohibitive if timing issues are crucial, for example, in real-time implementations.

Case 3. A more interesting case occurs when forwarding (a) exists between stages i and i+k of P, where k > 1 and , and the (b) spans reconfiguration boundaries. In such cases, the use of forwarding buffers would be required, to preserve forwarded data between segment reconfigurations, as shown schematically in Figure 2c.

2.4.4. Remark. An implementational issue arises regarding the number of forwarding buffers (no more than n per reconfiguration) and their implementation on a SIMD mesh. A solution is readily obtained when tagged data are employed. For example, if a datum x is to be forwarded from the i-th pipeline segment to the (i+k)-th segment, where i < n and i+k > n, then the sequence (f, x, i, i+k) would be transmitted, where f denotes a forwarding tag. Note that the transmission of the index i potentially supports process rollback resulting from pipeline configuration error or failure of the (i+k)-th segment. This allows a row of PEs that are adjacent to the PEs configured for pipeline computation to serve as both an I/O bus (per Figure 1b) and as a series of forwarding buffers, which terminate in the (n+1)-th PE of Figure 3.

 

Figure 3. PEs in a multi-SIMD architecture configured for data forwarding on an n-segment pipeline.

2.4.5. Observation. An additional issue concerns the timing of data forwarding operations in relationship to logical, arithmetic, and transcendental operations. For example, in Figure 3a, the forwarding PEs each would require two cycles for buffered data transfer between adjacent PEs, namely, (1) NEWS communication, and (2) copying of data from the NEWS buffer register to a local memory partition reserved as a forwarding buffer. Thus, forwarding data along n processors of the pipeline shown in Figure 3a would require 2(n-1) clock cycles. For large n, this may seem prohibitive. However, if this cost is amortized per PE, then the computational complexity in cycles per PE tends to predominate over I/O cost, and is not reduced by data transfer along the forwarding PEs. The following example is illustrative.

2.4.6. Example. Let a logical pipeline P mapped to an n-segment physical pipeline P'require k additions, p multiplications, and q transcendental operations, where it is assumed that k+p+q = n, for purposes of simplicity (e.g., Case 1, Example 2.4.2).

Letting each addition, multiplication, and transcendental operation require Na, Nm, and Nt cycles, respectively, the mean work requirement per processor is given by

cycles/PE .

For example, if n = 16, k = 10, p = 4, and q = 2, with Na = 20, Nm = 200 and Nt = 1000, with one operation computed per pipeline segment, then

= 187.5 cycles/PE .

Thus, in this example, the computational cost far exceeds the previously-mentioned forwarding cost of two cycles per PE. Current research indicates that this observation is typical for compression transforms thus far investigated.

We next analyze the complexity and pipelining of several common compression transforms.

3. IMAGE COMPRESSION TRANSFORMS

For the purposes of this study, the following classes characterize image compression transforms:

Class 1. Streaming transforms, in which source pixels are streamed into an encoder (e.g., by scanning a source image in normal order), thereby producing an output (compressed) bitstream.

Examples: Huffman encoding, digital pulse code modulation, and delta modulation.

Class 2. Block transforms, which partition a 2-D source domain into multi-pixel encoding blocks (e.g., KxL-pixel rectangular partitions). The encoder operates on each source block individually, often independent of other blocks.

Examples: JPEG, MPEG, VPIC, VQ, and EBLAST.

Class 3. Pyramidal transforms represent a source image at multiple resolution levels (i.e., hierarchically) via a structure such as a quadtree or octree. The encoder usually computes over the entire image representation at some stage of the compression process, which typically requires a large memory model.

Examples: Wavelet-based compression [13], EPIC [12], and SPIHT [21].

Streaming transforms are often amenable to pipeline implementation. By configuring one or more functional units of an FPGA (or processing elements/partitions of an MSIMD mesh) as connected pipeline segments, implementation on small processors is feasible, with small cost per segment.

Block transforms require memory capacity that is usually proportional to blocksize, and tend to consume more cycles per block than streaming transforms consume per pixel. However, it is unclear if block transforms are more efficient per pixel for a given compression ratio, all other parameters being equal. Additionally, disadvantages of blocking effect and representational locality often imply tradeoffs between compression ratio and visual image quality in high-compression block transforms.

Pyramidal transforms represent an image globally at multiple resolution levels, and can generally achieve improved image quality over block or streaming transforms, especially at high compression ratios. As mentioned previously, required computation over the entire image typically involves large local memory that is often unavailable on many DSPs, SIMDs, or FPGA-based processors. Separable linear transforms may not suffice for such applications, since the row (column) transforms of a 2-D image on X must be marshalled (at O(|X|) storage requirement) prior to application of the column (row) transforms.

3.1. Huffman Encoding.

Let B = {0,1} and let the map denote Huffman encoding, which employs (1) quantization to map each source pixel value to the interval Zm = [0,m-1]; (2) a codebook constructed from input statistics, via which each quantized pixel value is mapped to a bit string of maximum length n; and (3) concatenation of bitstrings to yield the compressed version of source image .

3.1.1. Complexity. If c contains words of average length k bits, then |X| table lookups and I/O operations are required for Huffman compression.

3.1.2. Parallel implementation. Let X be divided into P partitions of equal size, which can be concatenated in normal scanning order to yield X. Let each Wi be input to one of P pipelines, each of which consist of (a) a table lookup pro-

cessor that outputs the bit string in c corresponding to a source pixel value from block b = , and (b) a concatenation processor that assembles the bit strings into a representation of b. This operation would be followed by a concatenation net-

work that would assemble the bit string on Y which represents source image a, as shown in Figure 6a.

3.1.3. Implementational complexity. The table lookup processor that implements the quantization map requires O(1) time and O(m) memory in one PE. The concatenation processor requires stages of a tree-structured network comprised of PEs, each having a maximum of KLn bits storage.

3.2. Visual Pattern Image Coding (VPIC).

VPIC is a codebook-based, lossy compression transform that partitions a source image blockwise, then encodes each block in terms of a best-match codebook exemplar and several statistical parameters. VPIC's chief advantages are (a) a small codebook, usually comprised of four to eight exemplars; (b) low computational cost that is linear in terms of blocksize; and (c) low representational error for many types of natural scenes. VPIC compression ratios typically range from 20:1 to 45:1.

3.2.1. Definition. Given an MxN-pixel array X and an image , let a function , where , index each

KxL-pixel partition of a. Such partitions are called encoding blocks, which are customarily rectangular. For a given , let h have a dual representation that returns the domain points in X of the y-th encoding block, such that . Additionally, let the function locate a landmark in such as the block centroid.

3.2.2. Definition. Given Definition 3.2.1, an encoding block in is defined as .

3.2.3. Algorithm. Given the preceding definition, for each block , the VPIC transform is computed as follows:

Step 1. Partition b(y) into top, bottom, left, and right halves bt, bb, bl, bt from which the gradient D is computed as

.

The block mean m = S b/KL is clamped to the interval [gb,gs], delimited by black and saturation greylevels.

Step 2. Given a prespecified threshold T, if D < T, then b is a mean block and ac(y) = (f,m) is stored, where flag f = 0. Step 1 is then applied to the next block.

Step 3. Compute block orientation q and thresholded bitmap d, as follows:

.

Step 4. Determine index j of the exemplar in c that best matches d, and store ac(y) = (f,m,D,q,j), where f = 1 denotes an edge block.

All values comprising the Nm (Ne) mean (edge) blocks of the compressed image are subsequently reduced to packed binary format. If the compressed image is large, then lossless compression such as Huffman or Lempel-Ziv encoding may be em-

ployed to further increase the compression ratio, given sufficient inter-block correlation.

3.2.4. Observation. Decompression involves scaling the u-th exemplar pattern for mean and gradient g, where the latter can be thought of as being similar to the block standard deviation. The decompression step therefore requires one multiplica-

tion and KL additions per edge block, as well as KL I/O operations per mean block. The compression ratio is thus proportional to the fraction of edge blocks, i.e., , where denotes the number of mean blocks.

3.2.5. Parallel Implementation. Due to its block structure, VPIC is amenable to SIMD-parallel computation. Since mean and edge blocks are associated with different computations, the piecewise reconfigurability of MSIMD meshes is also useful for parallel VPIC computation. In practice, computation of VPIC mean blocks can be decomposed into the following stages: (1) block mean, (2) block gradient and thresholding, and (3) storage of mean block parameters. For edge blocks, Step 3 is replaced by computation of block gradient orientation, and the following steps are added: (4) exemplar pattern matching and (5) storage of edge block parameters.

Using the partitioning strategy discussed in Section 2.4, a three-segment VPIC pipeline can be implemented as follows.

Segment 1 (denoted by f1) would compute the block mean, while f2 would compute the block gradient. The latter operation can be performed efficiently if block partition sums (e.g., S and S ) are forwarded from f1 to f2. The classification of a block as mean or edge would be implemented in f2, which would cause f3 to (a) store (0,m) in the case of a mean block, or (b)

compute gradient orientation q in the case of an edge block. If an edge block was present, block thresholding and exemplar

pattern matching would be performed in f3 , which would store ac(y) = (f,m,D,q,j), as discussed in Section 3.2.3. This implies reconfigurability of f3, to permit storage of (0,m) or computation of q, and also implies that f3 would be reconfigured with sufficient speed to avoid penalizing the computation of q (the uncommon and therefore non-default case).

3.2.6. Complexity. Input of each block to f1 requires KL I/O operations, which would require KL/p cycles for p-byte parallel data transfer. Block mean computation in f1 requires KL-1 additions and one division, with the consequent time requirement:

,

where , , and denote the time delay associated with I/O, addition, or multiplication. Given forwarding of the four block partitions from f1 to f2, gradient computation in f2 requires three additions and one multiplication for each component and , as well as one addition, two multiplications, and one square root to compute D. Given the overhead of gradient thresholding and comparison, we thus obtain the following time requirement for f2:

,

where and denote the time delay associated with the operations of square root and comparison. Given forwarding of and to f3, together with d, the thresholded version of b, the reconfiguration of f3 , denoted by , could compute q in one application of an inverse tangent function. Given a codebook c of Q exemplars stored in local memory, correlation of d

and c using the Hamming distance as a matching criterion would require QKL exclusive-or operations and Q(KL-1) summa-

tions. Assuming 16-bit I/O between PEs and a reconfiguration time denoted by , the total time for f3 is given by

,

where the left member describes the cost of storing a mean block representation, while the right member corresponds to the processing associated with an edge block. The predominant impact of edge blocks manifests as an increase by a factor of Q in the number of addition and comparison operations.

3.3. Vector Quantization.

Let an image be indexed by a function h, per Definition 3.2.1, to yield a collection of KxL-pixel encoding blocks

denoted by A = .

3.3.1. Algorithm. Vector quantization of a source image proceeds as follows:

Step 1. Derive a codebook from A (e.g., via the Lloyd-Max clustering algorithm [10] or Kohonen net [22]).

Step 2. For each block in A, find the index of the best-match exemplar in c, under constraint of an error bound . Set .

Decompression involves substitution of the codebook exemplars indexed by to yield an approximation .

3.3.2. Complexity. Given a codebook that is chosen a priori, VQ can provide an efficient image compression technique. The correlation overhead involved in Step 1 of the preceding algorithm entails O(|X|) multiplications and additions, which can be computed blockwise and in parallel. Decompression is less costly, requiring only |X| I/O operations.

3.3.3. Observation. The chief advantage of VQ decompression is computational efficiency. A secondary advantage is that the prespecification of exemplar matching error implies a rigorous error bound per block, provided that the codebook accurately portrays test image statistics. Unfortunately, image training sets from which VQ codebooks are typically constructed tend to incompletely describe test imagery. In such cases, the representation error in the decompressed image can exceed the error bound.

The chief disadvantage of VQ is the high cost of codebook construction, with codebook search overhead and representational error being secondary deficiencies. Additional problems can accrue from the presence of blocking effects, similar to JPEG-compressed imagery. Although block boundaries can be partially obliterated by application of a smoothing operator, high-frequency detail tends to be lost.

3.3.4. Technique. Recent advances in VQ emphasize reduced error at high compression ratios using smaller codebooks of 32 to 64 exemplars [9]. An additional development has been the application of lossless, entropy-based coding schemes (e.g., Huffman coding) to the codebook indices that comprise pixel values of the compressed image. Since certain exemplars are employed more frequently than others, additional compression may thus be obtained, albeit with slightly increased decompression overhead of O(|Y|) I/O operations. The codebook search problem, which represents the computational bottleneck in VQ, can be addressed via the following three methods:

1) Brute-force pattern matching, which employs pointwise comparison of a KxL-pixel encoding block b(y) with each of Q codebook exemplars, thereby incurring work of W = QKL comparisons and additions. This cost can be shared over N parallel processors each having local memory of QKL storage elements, thus yielding a work reduction of W/N per processor, albeit with an overhead for combining results of O(logN).

2) Progressive search, in which b(y) is mapped to a Boolean vector that encodes one or more addresses of codebook vectors which yield a best match. Such vectors can be thought of as multilevel hash keys of n bits, which can be

decoded in O(log n) time. This scheme is described in detail in Reference [23], and can be implemented in parallel using from one to log(n) processors. In the latter case, O(log2n) cost is incurred in pipelined execution. Thus, with

O(log n) overhead, an source image having |Y| encoding blocks can be processed by a codebook search algorithm in time T(Y,n) = |Y| + O(log n).

3) Projective block indexing, in which feature measures are derived from a given block via projection of a subset of block contents to a multidimensional feature space. Combination of the resultant (m) n-bit feature vectors requires O(mn) work on an n-PE vector processor. Such combination can also be implemented in O(log m) time using 2mn processors in a tree-structured configuration. Examples of this technique include Tabular Nearest Neighbor (TNE) indexing [24], and a similar technique that lacks the sophistication of TNE's combination mechanism [25].

In this study, we emphasize Method 3, due to (a) efficiency and suitability for parallel computation, as well as (b) utility for

adaptive VQ-based compression [26].

 

Figure 4. VQ projective block indexing: (a) projection of two pattern clusters c1 and c2 to feature-space coordinate axes A1 through Am ; (b) pointwise combination of Boolean projection vectors to yield a result .

3.3.5. Parallel implementation. Let b denote a KxL-pixel source block and assume that a codebook exists that characterizes input within an error tolerance e. That is, a best-match exemplar d in c is constrained as || b - d || < e. Let m pro-

jections of b onto feature space F be constructed, thereby yielding an ensemble of vectors denoted by . Let the m vectors in v contain a unitary value where a given block b(i), i = 1..n, is represented by a cluster described by a projection of b to the i-th dimension, and zero elsewhere. Let denote an operation that combines to yield a result that represents the signature of b, from which an exemplar in c can be derived monotonically. This pattern matching process consists of two steps: (1) projection and (2) vector combination. The

first step requires the computation of projections from pattern cluster (e.g., exemplar region) boundaries to each feature-space axis, as shown in Figure 4a. If a test point p possibly lies within a given pattern cluster of the i-th projection, then a value is established. Otherwise, . Observe that this process merely determines the membership of the i-th projection of p in each bounding interval that is a projection of cj to the i-th feature-space axis. The class interval extrema

would be precomputed, leaving only the determination of membership, which can be computed in 2Qm comparisons. The intersection operation illustrated in Figure 4b requires Q(m-1) combination operations, where Boolean and is illustrated.

3.3.6. Implementational complexity. Given the preceding analysis, the total work requirement of projective classification is 2mQ comparisons and mQ combination operations (i.e., invocations of g). When implemented in terms of Q PEs, as shown in Figure 5a, VQ codebook search requires a time delay expressed as

.

If the cost of comparisons predominates, which is usual, then this case is useful when , i.e., for small codebooks that are highly overrepresented in the feature space. Here, classification can be performed as shown in Figure 4a, with no additional

search overhead, which is conceptually equivalent to a closed hashing scheme with one item per bucket. In contrast, when , the marshalling operation shown in Figure 5b is not required. Hence, it is more economical to perform 2Q comparisons per PE to establish the projection vectors , which are intersected pointwise, as shown in Figure 4b.

 

Figure 5. Schematic diagram of parallel implementation of VQ when (a) and (b) .

3.3.6. Example. Beginning with a large codebook, let Q = 16K and m = 256. In Figure 5a, 2(16K) + 3m = 32K + 768 = 33,536 PEs would be required, but only 256 comparisons and eight invocations of g are required per PE, with 16K I/O operations. In contrast, if Q = 256 and m = 256, then the architecture of Figure 5b requires 511 PEs, each of which implements 512 comparisons and eight invocations of g, whereas Figure 5a would require 32K comparisons and eight invocations of g.

3.4. JPEG.

JPEG is a well-established compression standard [8], which is based upon quantization and selection of coefficients derived from cosine transformation of encoding blocks of a two-dimensional source image .

3.4.1. Algorithm. Let the map denote JPEG encoding, which employs (1) partitioning of X into

KxL-pixel blocks, where K and L are customarily powers of two; (2) transformation of each source block by the discrete cosine transform (DCT); (3) quantization and selection of transform coefficients in each block, via an entropy criterion derived from a user-specified quality parameter; and (4) Huffman encoding of DCT coefficients into m bit vectors per block, where m can vary with block size or variance, and prespecified transform parameters.

3.4.2. Complexity. A separable 2-D DCT requires K column transforms, each having real multiplications, and L row transforms, each having real multiplications. Transform quantization, selection, and Huffman encoding require O(KL) work, in which I/O operations predominate if table lookup is employed.

3.4.3. Parallel implementation. Each encoding block can be thought of as inducing a separate JPEG pipeline, which has (a) block segmentation into L columns, followed by L DCT column transforms; (b) marshalling of the column transforms into a KxL-pixel array; (c) partitioning of the KxL-pixel array from b) into K rows, to which are applied K DCT row transforms; (d) marshalling of the K rows into a KxL-pixel block, (e) quantization and selection maps or lookup tables applied to the block's KL pixels, and (f) the application of Huffman encoding described in Section 3.1. Note that the quantization and selection maps can be replicated in rowwise fashion to facilitate parallelism, and that the use of row and column transforms represents rigorous practice, due to linear separability of the DCT.

 

Figure 6. Schematic diagram of pipelining strategy for (a) JPEG and (b) EBLAST compression.

3.4.4. Complexity. If each row (or column) DCT is implemented on one PE (assuming sufficient computational resources), then (or ) cosines and multiplications are required per PE. Marshalling requires O(K+L) I/O operations on a dedicated processor with K+L I/O lines, which represents a cost that is quadratic in KL. This could be achieved at

the cost of K+L rows of PEs, for a total of K(K+L) PEs per block. As depicted in Figure 6a, the cost of coefficient quantization and selection is thus 2L PEs. If L > K, then the block representation can be rotated to yield a space requirement of

S(K,L) = 2K(K + L + 2) PEs

for DCT, including coefficient quantization and selection. Due to the frequency of square macroblocks in JPEG, this consideration can usually be overlooked.

The preceding configuration presupposes that the PEs dedicated to DCT computation have sufficient arithmetic capability (e.g., real multiplication) to efficiently implement the DCT. If this is the case, then the time complexity can be expressed as

,

which includes the cost of Huffman encoding implemented per Section 3.1.

3.5. EBLAST.

Recently, a new image compression transform, called Enhanced Blurring, Local Averaging, Sharpening, and Thresholding (EBLAST) has been developed that is suitable for human target cueing as well as ATR applications. In underwater imaging applications, EBLAST has achieved compression ratios exceeding 250:1 with visually acceptable image quality [11].

3.5.1. Algorithm. The EBLAST transform, denoted by , employs (1) partitioning of the source domain X into KxL-pixel blocks; (2) preconditioning of the entire image (blockwise) by convolution with a template ; (3) computation of the block mean m; and (4) quantization of each block mean into . Decompression reverses this process by dequantization and projection of each block mean to its respective source block domain on X, followed by postconditioning via convolution with a template . If s and t are spatially invariant, then reconstruction error in-

volved in the decompression process tends to be spatially uniform, which is advantageous for ATR applications. Additionally, if there are n distinct values in the source image, then CR = KL(n/m).

3.5.2. Complexity. Let a source image be partitioned into KxL-pixel blocks . Given a block averaging and quantization operation , EBLAST compression is expressed as

This implies a work requirement that is dominated by

,

where denotes the support of template s at point . A similar number of additions are also required, with 2|X|/KL table lookup operations required for quantization.

3.5.3. Parallel implementation. EBLAST can be readily decomposed to facilitate parallel implementation, given the structure of underlying convolution, as well as block mean and quantization operations. Assume that each convolution is divided

into K row pipelines, each of which operates on L pixels (as shown in Figure 1). In order to implement the convolution without undue inter-processor I/O, each row convolution must operate on rows containing nL pixels each.

For example, if K,L = 10 and s is a 3x3-pixel template, then 30 pixels or 240 bits (assuming an 8-bit greyscale) of local

memory is required, which is not prohibitive in practice. The output of each row processor would then be L pixels, corresponding to rowwise convolution.

3.5.4. Implementational complexity. Observing that the block mean can be computed using a summation tree of depth 2K PEs, and quantization requires one PE (as shown previously), a total 3K+1 PEs is required. This implies a critical-path time

complexity of

.

The convolution operation shown in Section 3.5.2 requires only integer additions and multiplications. Per Reference 27, as few as eight bits of precision can be employed in EBLAST compression to yield acceptable MSE, which verifies the

suitability of integer arithmetic in this case. Space consumption is likewise reasonable, with PEs required per block if the parallel implementation emphasizes rowwise processing. By employing integer operations, the ratio can be reduced, which is advantageous for embedded SIMD meshes as well as reconfigurable logic implementations.

We next model practical pipeline architectures and MSIMD architecture designs for image compression applications.

4. PRELIMINARY MODELLING AND SIMULATION RESULTS

As part of a larger research and development effort in fault-tolerant control of adaptive or reconfigurable computing devices, two of the authors (MSS and GXR) are constructing models of various computational device types, with specific instances of each device. The models can be perturbed to simulate realistic faulting and failing behaviors. Example devices currently being modeled include Lockheed-Martin's PAL-I SIMD processor (385 MOPs peak bandwidth per card using eight-bit operations) and configurable logic devices such as the Xilinx XC62xx and 40xx series of field-programmable gate arrays. In this section, we concentrate on device models for the PAL-I processor (Section 4.1), then present preliminary modelling results for JPEG and EBLAST on these models (Section 4.2).

4.1. Example Device Models: PAL-I SIMD Processor.

The PAL-I processor is comprised of 9,216 PEs, each of which has an arithmetic logic unit (ALU) and memory (typically 4K bits), as illustrated schematically in Figure 7. At a high level, each PAL-I system is comprised of a PAL host (e.g., a sequential workstation) and one or more PAL units. Each PAL unit has one or two IPV cards that perform computational work, which is managed by an I/O controller (IOC) card, also shown in Figure 7. The IOC card contains an interface to the PAL host, a frame buffer for storage of image operands, and an instruction sequencer with memory. The host interface can convert middleware instruction issued by the PAL Host into PAL_WS library calls, which directly control the IPV card. In general, the IPV card contains a 4x6-element array of PAL chips, where each chip contains a 12x16-element array of PEs, described previously. The IPV card has I/O buffer memory, and the PEs are interconnected by a circulant mesh communication network with serial dependencies along rows and columns of the mesh.

The datapaths associated with a typical PAL image computation are illustrated schematically in Figure 8, where the names of major PAL components shown in Figure 7 are italicized. Dataflow between the host and the PAL unit is restricted to images and instruction bursts. In contrast, dataflow between the IOC and IPV cards is comprised of image partitions and Boolean instruction vectors. Instructions are passed in the traditional manner, from instruction memory to the I/O buffer and communication net (NEWS grid). However, PAL-I can also access each PE's ALU and memory, thereby facilitating efficient manipulation of Boolean data for purposes of optimizing image and signal processing operations such as morphological erosion or dilation. This feature is not modeled in the current research prototype, for purposes of simplicity. However, the efficient partitioning of image and template operands to facilitate mapping of image algebra to a PAL-I system is discussed in detail in Reference 28. Such partitioning algorithms and heuristics further support (a) realistic performance modelling and (b) efficient implementation of compression transforms on PAL and similar SIMD machines.

 

Figure 7. Low-Level Structural Model for the Lockheed-Martin PAL-I SIMD processor system.

 

Figure 8. Dataflow diagram for the Lockheed-Martin PAL-I SIMD processor system.

4.2. Simulation Model Results.

The dataflow graph of Figure 8 comprises a high-level architecture graph whose vertices are derived directly from the structural diagram of Figure 7. One can thus parameterize each element with measured performance information (e.g., channel error rate and bandwidth for communication elements such as the host-unit bus or NEWS grid), thereby facilitating derivation of a mathematically rigorous model of computation. Such models can be instantiated in software as discrete-event simulations or as analytical estimations of computational cost. This study emphasized the latter perspective. An example is shown in the computational cost models of Figure 9a, which are primarily concerned with the overhead incurred by arithmetic operations. Of key interest is the effect of the blocksize parameters K and L, which directly impact the domain compression ratio CRD discussed in Section 2.2.

An additional consideration is the ratio . For example, if cycles and cycles of the PAL mesh clock, then the graph of Figure 9b is obtained for JPEG and EBLAST. In the case of EBLAST, given K,L = 10 (a typical figure), setting yields the following approximation to the time complexity:

,

where the error of approximation is zero for 10x10-pixel blocks, and varies with increasing or decreasing blocksize. The adjustment of to yield more efficient processing can be facilitated in practice by the use of limited-precision arithmetic, per Reference 27. For example, this permits the space complexity of multiplication (which tends to vary nearly quadratically

with the number of bits precision B), to be reduced faster than the complexity of addition, which tends to be log-linear in B for certain types of adders. The space complexity directly impacts the pathlength along which electrons flow from input to output. As a result, time complexity of a given operation tends to vary log-linearly to quadratically with the number of bits precision,

where multiplication requires more resources than its component operation of addition. In practice, this also means that the only feasible way to decrease is to decrease . Hence, reduced-precision computation is indicated wherever supported by the target architecture.

 

Figure 9. Arithmetic complexity models of JPEG and EBLAST, as a function of blocksize,

where Dtmand Dta denote delays due to multiplication and addition.

It is readily verified from the graphs of Figure 9b and Figure 10 that the effect of decreasing the cost of multiplication with respect to addition is to cause the number of arithmetic cycles required by JPEG to approach that of EBLAST. For example, at fma = 3:1, JPEB requires 960 cycles for 8x8-pixel blocks, as opposed to 835 cycles for EBLAST, which represents a difference between JPEG and EBLAST of 125 cycles, or 14.9 percent of EBLAST's arithmetic complexity. In contrast, at fma = 2:1, JPEG (ELBAST) requires 720 (630) cycles for 8x8-pixel blocks, a difference of 90 cycles, or 14.2 percent of EBLAST's arithmetic complexity. Hence, it is reasonable to infer that JPEG is multiplication-bound with respect to EBLAST. In practice, this is a correct conclusion, due to the high incidence of multiplications in JPEG's cosine transformation step. If JPEG's requirement of real multiplication is accounted for (which it is not in this preliminary model), then the disparity in computational cost between the two transformations would further increase.

It is also interesting to note that EBLAST's arithmetic complexity scales linearly with K and L, which is unusual for compression transforms. This is advantageous for pipeline processing, since there is no critical blocksize at which EBLAST yields diminishing returns in terms of computational efficiency. As a result, EBLAST's compression ratio is limited in terms of a computational implementation by (a) local memory size, (b) error constraints on decompressed image quality, and (c) arithmetic precision required to achieve a given compression ratio under prespecified error constraints. Similar statements apply to JPEG, but with the caveat that the cost of compression and decompression increases more slowly with increasing blocksize. For example, in the data presented in Figures 9b and 10a, the arithmetic cost of JPEG appears to become quasi-linear after approximately the point K,L = 12 pixels, which decreases to approximately 10 pixels in Figure 10b. In future research, we plan to investigate this effect in a variety of transform-coding compression methods, with emphasis on wavelets.

 

Figure 10. Arithmetic complexity of JPEG and EBLAST, as a function of blocksize, for fma = 2:1 and 3:1.

A further advantage of EBLAST is its ability to accurately compress imagery where source blocks are statistically subsampled at ratios as high as 4:1 on blocks as small as 8 pixels square. In contrast, JPEG and commonly-employed block-structured transform coding methods are not necessarily amenable to input subsampling, which produces aliasing of high-frequency artifacts in the presence of missing pixels. As a result, research has shown the time complexity figure shown for EBLAST could be decreased by a factor of 4:1 with respect to the complexity of JPEG, without significantly decreasing the visual quality or mean-squared error of corresponding decompressed imagery. The linear scaling of EBLAST's arithmetic complexity as a function of the subsampling factor is due to (a) a linear scaling of complexity with blocksize, and (b) relative insensitivity of EBLAST to input noise or changes in input variance. The latter effect results from the block averaging process employed in EBLAST's compression step. This advantage bodes well for implementation on processor arrays with small local memories, especially if large blocksize is employed to achieve a high compression ratio.

5. CONCLUSIONS

The emergence of embedded parallel and reconfigurable processors has made the implementation of compression transforms on such devices attractive implementationally, due to advantages of functional flexibility, increased speed and throughput, as well as increased use of substrate that have not been realized in more conventional computing devices. This paper introduces problems and solutions associated with the parallel implementation of streaming transforms such as Huffman encoding and block oriented transforms such as JPEG, vector quantization, visual pattern image coding, or EBLAST, with additional discussion of pyramidally-structured transforms such as wavelet-based compression. It is shown that streaming and block-oriented transforms can be pipelined to yield efficient execution without sacrificing computational accuracy or fidelity in the reconstructed image. In contrast, pyramid-based transforms require computation over the entire source image at some stage of the compression algorithm. Such techniques are unsuitable for compression of large-format images on embedded parallel processors with small local memories.

This paper introduces techniques for pipelined configuration of a relatively new type of parallel processor, called a multi-SIMD or MSIMD mesh, which has advantages of (a) amenability to synchronous pipelined implementations, (b) local reconfigurability of locally heterogeneous functionality, and (c) only slightly increased complexity with respect to traditional SIMD meshes. Simulation models of Lockheed-Martin's PAL-I processor were modified to facilitate estimation of the arithmetic complexity of two block-structured compression transforms (JPEG and EBLAST) on MSIMD pipelines. It is shown that EBLAST incurs fewer arithmetic bottlenecks than JPEG (and consequently less arithmetic complexity) for a given blocksize. The parallel computation of EBLAST has two additional advantages for MSIMD meshes. Namely, EBLAST input can be subsampled at factor f to yield linear scaling of arithmetic complexity with f. Current research has shown that visually acceptable decompressed image quality persists up to f = 6:1, with acceptable error for ATR purposes at f < 4:1. The second advantage of EBLAST is suitability for computation with limited-precision arithmetic. For example, on SIMD meshes such as PAL-I, EBLAST can be computed at seven or eight bits precision without degradation in the visible quality of corresponding decompressed imagery. Additionally, mean-squared error less than 10 percent of full scale is thus encountered. However, JPEG does not exhibit the advantage of amenability to sub-sampling, although computation of JPEG at somewhat reduced precision (e.g., 10-12 bits for arithmetic operations) is feasible. The advantages of EBLAST bode well for implementation on small embedded processors in applications where low power consumption is a key constraint.

Future work involves implementation of fault injection processes for the MSIMD hardware models, followed by perturbation of the models to determine effects of faults such as bus errors and row/column defects in mesh processing elements. Subsequently, a variety of compression algorithms not reported in this paper would be tested on the enhanced MSIMD models.

REFERENCES

[1] Yates, R.B., N.A. Thacker, S.J. Evans, S.N. Walker, and P.A. Ivey. "Array processor for general purpose digital image compression", IEEE Journal of Solid-State Circuits 30(3):244-250 (1995).

[2] Caefer, C.E., J. Silverman, J.M. Mooney, A.P. Tannes, and V.E. Vickers. "Signal-processing algorithms for point-target detection in consecutive-frame staring imagery", Proceedings SPIE 2020:93-107 (1993).

[3] Lu, G. "Approach to image retrieval based on shape", Journal of Information Science 23(2):119-127 (1997).

[4] Yang, J.Y., Y. Lee, H. Lee, and J. Kim. "Variable length coding ASIC chip for HDTV video encoders", IEEE Transactions on Consumer Electronics 43(3): 633-638 (1997).

[5] Jackson, R.D., P.C. Coffield, and J.N. Wilson. "New SIMD computer vision architecture with image algebra programming environment", Proceedings IEEE Aerospace Applications Conference 1:169-185 (1997).

[6] Jeon, B., J. Park, and J. Jeong. "Huffman coding of DCT coefficients using dynamic codeword assignment and adaptive codebook selection", Signal Processing: Image Communication 12(3):253-262 (1998).

[7] Jain, A.K. "Image data compression: A review", Proceedings IEEE 69(2):349-389 (1981).

[8] Panchanathan, S., N. Gamaz, and A. Jain. "JPEG based scalable image compression", Computer Commun. 19(12):1001-1013 (1996).

[9] Masselos, K., P. Merakos, T. Stouraitis, and C.E. Goutis. "Novel codebook design techniques for vector quantization image compression", Proceedings IEEE International Symposium on Circuits and Systems 2:1321-1324 (1997).

[10] Scheunders, P. "Genetic Lloyd-Max image quantization algorithm", Pattern Recognition Letters 17(5):547-556 (1996).

[11] Schmalz, M.S., G.X. Ritter, and F.M. Caimi. "Performance evaluation of data compression transforms for underwater imaging and object recognition", Proceedings IEEE Oceans `97 Conference 2:1075-1081 (1997).

[12] Eastwood, R.L., L.E. Freitag, and J.A. Catapovic. "Compression techniques for improving underwater acoustic transmission of images and data", Proceedings IEEE Oceans `96 Conference (supplement) pp. 63-68 (1996).

[13] Meyer, F.G., A.Z. Averbuch, J.O. Stromberg, and R.R. Coifman. "Fast wavelet packet image compression", Proceedings of the 1998 IEEE Data Compression Conference, p.563 (1998).

[14] Ritter, G.X. and J.N. Wilson. Handbook of Computer Vision Algorithms in Image Algebra, Boca Raton, FL: CRC Press (1996).

[15] Honeywell, Inc. A Programmer's Guide to the Parallel Recirculating Pipeline (PREP), Internal Document (1988).

[16] Lougheed, R.M. and D.L. McCubbrey. "The Cytocomputer: A practical pipelined image processor", Proceedings of the Seventh International Symposium on Computer Architecture, pp. 411-418 (1980).

[17] Meyer, T. and J.L. Davidson. "Image Algebra preprocessor for the MasPar parallel computer", Proc. SPIE 1568:92-100 (1991).

[18] Crookes, D., P.J. Morrow, and P.J. McParland. "An algebra-based language for image processing on Transputers", Proceedings of the IEE Third Annual Conference on Image Processing and its Applications 307:457-461 (1989).

[19] Sweat, M. and J.N. Wilson. "An overview of AIM: Supporting computer vision on heterogeneous high-performance computing systems", in Proceedings SPIE 3452 (in press).

[20] Schmalz, M.S. On the Processing of Compressed and Encrypted Imagery, with Applications in Efficient, Secure Computation, Ph.D. Dissertation, University of Florida (1996).

[21] Said, A. and W.A. Pearlman. "New, fast, and efficient image codec based on set partitioning in hierarchical trees", IEEE Transactions on Circuits and Systems for Video Technology 6(3):243-250 (1996).

[22] Rezai-Rad, G.A. and R.J. Green. "Competitive learning algorithm for non-zero memory codebook design in encoding of CT sequences", Proceedings of the International Conference on Information, Communications and Signal Processing, ICICS `97 1:279-282 (1997).

[23] Krumm, J. "Object detection with vector quantized binary features", Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 1997, pp.179-185 (1997).

[24] Personal communication with G. Key, Frontier Technology, Inc. (1998).

[25] Kossentini, F. and M.J.T. Smith. "Fast PNN design algorithm for entropy-constrained residual vector quantization", IEEE Transactions on Image Processing 7(7):1045-1050 (1998).

[26] Shin, F.B., D.H. Kil, and G.J. Dobeck. "Impact of lossy image compression on automatic target recognition performance", Proceedings of the IEEE Oceans `96 Conference 2:943-948 (1996).

[27] Schmalz, M.S., G.X. Ritter, and F.M. Caimi. "Limits on computational precision of image compression transformations", in Proceedings SPIE 3456 (in press).

[28] Riedy, E.J. and Wilson, J.N. "Heuristic cost optimization for image template operations on the PAL-I SIMD-parallel image processor", UF/CCVV Technical Note #PAL-97-01 (1997).