GIS - Computational Problems: GIS Features

GIS - Computational Problems: § 1: GIS Features

Instructors: G.X. Ritter and M.S.Schmalz

The majority of GIS datasets are currently represented in vector format, which is convenient due to storage efficiency but can be difficult to manipulate analytically. To achieve tractability, vector format GIS data is often rasterized (e.g., conformed to a grid-cell representation). The processes involved in vectorization of map data as well as rasterization of vector data have indiosyncratic errors that manifest as representational error in a given GIS system.

Section Overview. In this section, for purposes of orientation, we overview the vectorization of GIS data, then discuss theoretical and practical issues associated with feature and dataset merging. It is important to note that this section does not have as its goal the development of a specific "best" technique or algorithm for performing vectorization or rasterization operations. The material is structured as follows:

In Section 2.1, we outline basic problems in dealing with vector-format GIS datasets, then discuss vectorization, rasterization, and dataset merging. Sections 2.2 and 2.3 detail the unsolved problem of determining feature type from map data, satellite imagery, etc., and resolving conflicting feature types or attributes in rasterized vector datasets or grid-cell data. Section 2.4 extends the preceding concepts to include the merging of disjoint datasets for which some coordinate information is available. Section 2.5 details a related sub-problem, namely, the coregistration of map information, imagery, and GIS data.

Goals. For the purposes of this course, goals for this section include:

Learning GIS terminology and background;
Exploring two dataset-related problems instantiated as merging and alignment of vector- or raster-format data;
Understanding GIS feature types, attributes, and errors inherent in GIS data, with the ability to estimate output error for GIS algorithms when input error is known; and
Prediction of computational cost and accuracy (e.g., the space-time-error bandwidth product) for a given algorithm.

Orientation, Terminology, and Problems. The process of meeting our goals will be more efficient if we understand some differences between GIS and the better-established discipline of cartography or map-making (which is often confused with GIS).

Goals: Cartography emphasizes map-making with some emphasis on analysis (e.g., derivation of map projections). In contrast, GIS emphasizes storage, retrieval, and analysis of spatial datasets.
Basis: In cartography, map information is referenced to a datum plane, whereas GIS seeks a common spatial domain to (hopefully) minimize coregistration error when comparing multiple spatial datasets.
Key Problems: Cartographers seek to maintain map databases with current, accuracy information, and render updates more accurate as with improvement of land surveying and bathymetry techniques (bathymetry = measurement of water depth). GIS deals with these problems as well, but also emphasizes the acquisition, formatting, merging, alignment, analysis, and reporting of spatial data about land use, function, features and attributes, etc.

GIS terms can often be confused with corresponding terminology and notation in cartography. Hence, we offer the following definitions:

Spatial Data are information about land or water that are referenced to a spatial domain (e.g., map coordinates such as latitude and longitude);
Spectral Data comprise signals, imagery, or information specific to one or more frequencies or wavelengths (e.g., passbands in the electromagnetic spectrum);
Resource Data comprise signals, imagery, or information concerning the availability or use of natural or manmade resources that are related to spatial (e.g., map) domains; and
Coregistered Data are two or more datasets (a) whose domains have a nonempty intersection W and (b) which have corresponding features in each dataset aligned on the same point in W.

Key Problems of Interest in this Section include the feature conflict resolution problem and the dataset merging problem. Exact or approximate solutions to these problems discussed in this section are intended to support further research in the higher-level problems of heterogeneous dataset merging and alignment.

Assumption.

F^X

ⁿ

The Feature Correspondence Problem can be stated as follows.

P_FCP: If N : X -> 2^X denotes a neighborhood on X, then given a point x X, does a point y X exist such that a|_N(x) and b|_N(y) correlate better than for any other choice of y?

Extension: For each z X, determine

d(z) = x(z) - y(z)

_FCP

ⁿ

the disparity map of

with respect to

Definition: Given d, if a (or b) is compensated for d to yield a (or b), such that maximal correlation exists between a|_N(x) and b|_N(x) for each x X, then b and a (a and b) are said to be coregistered.

The Dataset Merging Problem can be stated as follows:

P_DSM: Assume that there exists ground truth ground truth g F^X and a comparison function c : F × F -> R. If a(x) b(x) for some x X, then what mapping f : F^X × F^X -> F^X can be found (if any) such that the representational error

e = ||c(f(a,b), g)||

For example, if c is the subtraction operation, then minimization could be subject to a constraint such as:

|e| |c(a,g)| + |c(b,g)| .

As an aside, note that in the extension of the corresponding points problem, if the disparity map d is a constant image, then coregistration can be a trivial process, provided that significant data does not lie outside the common domain W. Otherwise, we say that d represents nonuniform disparity. In cases where the magnitude or direction component of the disparity d is a nonlinear function of x, we say that there exists anisomorphic disparity. For example, consider the simple cases of optical distortions such as barrel distortion or pincushion distortion, which nonlinearly warp an input image.

We next consider the processes of map segmentation into vector data sets, as well as rasterization and merging of vector data sets.

2.1. Polygonalization and Merging of Vector Data Sets.

Given a scene as shown in Figure 2.1, a corresponding map a on a two-dimensional domain X

R² contains |X| points x

X that may be expressed in terms of the map coordinates (x,y)

X. For example, one could have the customary coordinates of latitude and longitude, or survey measure from a given benchmark.

Figure 2.1. Schematic diagram of (a) natural scene and (b) corresponding 2-D map.

A given domain point x

X customarily has measurement error associated with locating an object on the map a. One way to express such error is via an abstraction called Circular Error Probable (CEP), as shown in Figure 2.2. For example, given North-South error

_NS (sometimes called Northing error) or East-West error

_EW (sometimes called Easting error), we can approximate CEP, denoted by

_CEP, as follows:

_CEP = ((_NS² + _EW²)/2)^1/2 ,

which assumes RMS error accumulation that may or may not hold in practice.

Figure 2.2. Schematic diagram of circular error probable.

The concept of statistical characterization of spatial error will be important to our subsequent analyses of GIS data and algorithm accuracy. In practice, spatial error propagates through the various phases of GIS data acquisition and processing algorithms, as shown in Figure 2.3. Hence, if we have analytical expressions for the constituent processes shown in Figure 2.3, we can apply well-known error analysis procedures to obtain error functions of these processes, from which estimations of GIS dataset error can be made.

Figure 2.3. Schematic diagram of error propagation in a GIS data acquisition, preprocessing, and storage system.

In order to better understand the mathematics of algorithms involved in GIS data processing and archival, we present the following discussion of vectorization and rasterization processes.

2.1.1. Overview of Vectorization and Rasterization

Map data can be converted to a more compact format that describes (a) the boundary coordinates of regions within the map, and (b) the contents of each region. By coercing each bounded region to have one and only one feature type, one can construct a map from a set of boundary coordinates to a set of feature types, which we call the vectorization map.

High-Level Vectorization Algorithm. In general, vectorization proceeds according to the following steps:

Step 1.

Step 2. Segment the map labelled in pointwise fashion in Step 1 to yield polygonal regions, each of which contain one and only one feature type or attribute set.

Step 3. Given the vertices of the polygons obtained in Step 2, represent each polygon as a path whose segments are described by their vertices. These path segments are called vectors and may be straight or curvilinear. For example, ARC/INFO uses curvilinear vectors that are circular arcs.

Step 4. Associate each polygon (represented as a chain of vectors or their vertices) with the geature type and attributes contained within the corresponding map region.

Step 5. Encode and store the output of Step 4. For example, one could use ARC/INFO, ERDAS, or IDRISI format, and could store the data to disk or digital tape.

Rasterizing might initially appear to be the inverse of vectorization, since it might appear that one merely converts between vector and map data, or vice versa. However, this is an oversimplification that yields little information about the subprocesses and errors that are idiosyncratic to each of these two important and markedly distinct processes.

In particular, we note that vectorization is concerned primarily with the spatial definition of map neighborhoods or regions in terms of a set of vector components (e.g., line segments or arcs) that delimit a region having one and only one feature type or attribute set. In contrast, rasterization assumes a given spatial segmentation scheme (e.g., an image pixel grid) that is not necessarily data-dependent, then associates one or more feature types or attribute sets per grid cell. A key problem in vectorization is accurately portraying the region boundary with a limited set of vector primitives, whereas rasterization can be seen as primarily emphasizing the resolution of feature type or attribute conflicts within a fixed, prespecified spatial framework.

Discussion. Vectorization is acustomary procedure that has many different implementations. Errors inherent in vectorization include:

Alignment Errors that result, for example, from inaccurate alignment of linear vectors with curved region boundaries;
Quantization Errors which arise primarily from the choice of error bound when matching one or more vectors to curved region boundaries that are subtended by the vector(s); and
Segmentation Errors produced by (a) incorrect labelling of the source maps (often considered a systematic error), or (b) gross errors in vectorizing a region boundary.

By choosing curvilinear vector segments or by decreasing the error bound, it may be possible to reduce vector alignment and quantization errors, albeit with increased computational cost. However, quantization errors, which derive from Nyquists's sampling theorm, are generally considered irreducible unless they can be averaged to zero over the course of a computation. Although such errors still remain as local features of the GIS dataset, they are considered (erroneously) to be "averaged out" over a larger area of the dataset domain.

High-Level Rasterization Algorithm. The following steps are generally employed in rasterization:

Step 1.

Step 2. Discretize or partition a 2-D spatial map domain X into rectangular blocks of size q×q map units.

Step 3. Resolve feature type conflicts within each block or grid cell on a case-by-case basis, using contextual information as required.

Step 4. Resolve any remaining feature attribute conflicts.

Discussion. Rasterization tends to have similar errors to those exhibited by vectorization, which include:

Spatial Quantization Error due to the nonzero resolution interval q; and
Feature Resolution Error that results from incorrect assignment of feature type or attributes within a grid cell.

Note that quantization error derives from the Nyquist criterion (sampling theory) and decreases as q approaches zero. Additionally, when two conflicting feature types are present within a given grid cell, feature type or attribute resolution errors are unavoidable.

Dataset Merging Problem. The dataset merging problem in GIS can be stated as follows:

P_DMP :

Given multiple GIS datasets (vector or raster format), merge data to form GIS dataset on a common spatial domain.

Inherent in the process of combining datasets is feature conflict resolution. Assume, for purposes of simplicity, that each GIS dataset is registered on a common, two-dimensional spatial domain X. Then, P_DMP can be reduced to the following question:

Given GIS datasets

F^X

, with what probability does

Inherent in this specification of P_DMP is the assumption that GIS data can be coregistered on a common spatial domain. Raster format data (e.g., an MxN-pixel rectangularly-structured domain) is the domain most commonly employed, which is also called grid cell format. Hence, if a dataset is in raster format, it may have to be resampled to produce a raster of different resolution. If the data is in vector format, then rasterization may be required. Thus, we next examine vector-to-raster conversion in detail.

2.1.2. Rasterization of Vector-format GIS Data.

We have previously shown that the rasterization problem can be reduced to the problem of resolving which feature attribute a grid cell should be labelled. Additionally, we discussed area-based methods for feature label assignment. Three pertinent cases will be discussed in this section, which are listed in increasing order of difficulty, as follows:

Case I.

Case II. Multiple vector segments connected to form one polyline cross a grid cell, thereby partitioning the cell into two polygons, neither of which could be convex.

Case III. Multiple vector segments connected to form multiple polylines cross a grid cell, thereby partitioning the cell into multiple polygons.

Fortunately, the area of each polygonal partition of a grid cell can be accurately computed in every case using a simple algorithm derived from the plane geometric formula for the area of a trapezoid.

Recall. Given a trapezoid T having parallel sides of length a and b and height h, the area of T is given by

< T > = h · (a + b)/2 .

Observation. If a given polygon A in a vectorized map can be decomposed into trapezoids A₁, A₂, ..., A_i, ..., A_n, each of which has parallel sides of length a_i and b_i, with height h_i, then the area of A would be computed as:

< A > = < A_i > = h_i · (a_i + b_i)/2 .

Applying this observation to Case I, we see that there are two sub-cases:

Case Ia:

_i+1

z_y

₁

₂

x_y

a₁ = || x_y(1) - z_y(1) || and b₁ = || z_y(2) - x_y(3) ||

h₁ = || x_y(2) - x_y(3) || .

< A₁ > = h₁ · (a₁ + b₁)/2 ,

₂

Case Ib: The grid cell is divided into a triangle and an irregular pentagon, as shown in Figure ___b. Given the notational conventions of Figure ___a, the area of the triangular partition A₁ is given by

< A₁ > = || x_y(1) - z_y(1) || &183; ||x_y(1) - z_y(2) ||/2 ,

₁

In Case II, the grid cell is bisected by a sequence of line segments whose vertices are given by the sequence S = {x₁, x₂,..., x_i, ..., x_n}

X. Each consecutive pair of vertices x_i,x_i+1 defines the slant side of a trapezoid, as shown in Figure ___a. Summing the areas of the component trapezoids is a simple matter, unless the sequence reverses direction, as shown in Figure ___b. In this case, we must employ a more generalized algorithm for specifying the area of an irregular polygon, as follows.

Algorithm. Given a map domain X R², let the image a ,

%%%%LEFT OFF HERE%%%%

2.2. Determining Feature Type from GIS/Image Data.

2.3. Resolving GIS Attribute Conflicts.

2.4. Merging Disjoint Heterogeneous Regions.

2.5. Coregistration of GIS Features and Neighborhoods.

References.

{Doa97} Doaks, J. Another Introduction to GIS, New York: Bogus Press (1997).

{Doe97} Doe, J.H. An Introduction to GIS, New York: Bogus Press (1997).

This concludes our introductory discussion of GIS issues. We next consider computational problems related to GIS features.