_{T}he majority of GIS datasets are currently represented in vector format, and have an inherent representational error that arises from sensor errors as well as from discretization or polygonalization processes that we discussed in detail in Section 2. GIS algorithms propagate this dataset error through various stages of computation, yielding a GIS product or result that often has unexpected errors. Such errors influence coregistration (e.g., map overlay) and tend to corrupt the derivation of range or elevation data from stereo imagery. This further impacts the integration of surface models (e.g., spline models based on elevation data) with GIS datasets.
Section Overview. In this section, we discuss errors in GIS datasets as well as error and complexity measures, then investigate the effect of such errors on elevation and surface modelling with GIS. The section is structured as follows:
In Section 4.1, we present an in-depth discussion of errors in spatial datasets, together with measures for quantifying such errors and various types of error analysis. Section 4.2 details techniques for error analysis as well as the estimation of complexity in GIS algorithms, both of which are required for the design and implementation of accurate, efficient software. The difficult problem of automatic analysis of stereophotogrammetric data is presented in Section 4.3, and an extension of this problem to multi-modal GIS (e.g., derived and measured elevation data) is discussed in Section 4.4.
Goals. For the purposes of the current course, goals for this section include:
We begin our discussion with a distinction between cartographic error and GIS error.
_{N}ote that map accuracy is a relatively minor issue in cartography. Users of maps are rarely aware of this problem, due to their familiarity with map notation and its underlying assumptions. For example, if a map shows that a drainage ditch runs parallel to a road, one assumes from world knowledge that the ditch is located close to the road. However, on the map, for purposes of cartographic license, the map may appear to be offset laterally from the road by as much as the width of the road.
Hence, the key issues in this section are measurement, estimation, and prediction of GIS error. The following observations pertain:
The error modelling process can be decomposed into the following steps or levels:
Level 2. Error detection and measurement -- Devise and implement test procedures for detecting and estimating error deterministically or stochastically.
Level 3. Error propagation modelling -- Predict or estimate the accrual of error as a computational cascade comprised of a sequence of operations.
Level 4. Strategies for error management -- Determine methods for carrying out GIS computations to achieve minimum output error.
Level 5. Strategies for error "reduction" -- Note that the common assumption is erroneous, namely, that error can be reduced via summation over error distributions. In practice, such distributions are often asymmetrical with nonzero means. Indeed, the assumption that error can be nontrivially reduced derives from convenient pedagogic examples that employ symmetric distributions with zero mean or assume infinite computational precision. In practice, "error reduction" often strives to re-order the processes of a GIS algorithm, or substitute less erroneous processes, which is a type of error management.
At Level 1 (error isolation), the following types of error are found:
In this course, we will concentrate upon positional and attribute error primarily, and thus examine Levels 1-5 of the preceding hierarchy for error sources.
Level 1 (Error isolation) sources of error are classified by type as shown in the following examples:
Attribute: Poor assessment of attributes to due measurement, conceptual, quantitative or qualitative error, as well as systematic (e.g., operator) error in manual classification systems.
Level 2 (Error detection and measurement) focuses on methods of assessing accuracy levels in spatial data, for example:
Level 3 (Error propagation modelling) focuses on error propagation and error production, which are distinguished in the following examples:
Level 4 (Error management) focuses on coping with error and making specifiably accurate decisions in the presence of error. In GIS, this can involve:
Level 5 (Strategies for error reduction) includes but is not limited to the following methods:
Theory. Given a_{i} F^{X}, where 1 i n, let there be k_{i} polygons per layer. According to McAlpine and Cook [McA71], an estimate of the number of spurious polygons is given by:
k_{ovl} = ( k_{i}^{1/2} )^{2} .
This process is illustrated in Figure 4.1.1. Note that k_{ovl} rises exponentially with n, which means that spurious polygons tend to shrink exponentially as n increases, due to the constant size of X.
Figure 4.1.1. Effect of overlaying two maps (a,b) with polygons
to produce a map (c) with fragmented polygons.
2. Error Detection could involve, for example, comparing
polygon size to a prespecified threshold (e.g., the IDRIS system's
MERGE
command). However, in practice, this approach is
flawed by a lack of correlation between polygon size and
representational error.
3. Error Propagation: Models of polygon fragmentation are typically based on k_{i}, the number of polygons per layer, rather than spatial error per layer or per polygon. Hence, analysis must be modified such that another measure, such as polygon size, is employed to characterize or detect error.
Goodchild [-] developed an error model based on the intersection of a truthed cartographic line and its digitized representation in different layers.
Given two layers with numbers of line vertices denoted by v_{1} and v_{2}, the maximum number of spurious polygons is given by:
max(N_{p}) = 2 · min(v_{1},v_{2}) - 4 .
If regular interleaving of vertices exists as shown in Figure 4.1.2, then the number of spurious polygons is estimated by Goodchild as:N_{p} ~ 2 v_{1} v_{2} / (v_{1} + v_{2}) - 3 ,
which has been shown to be an overestimate [Ver94]. Additionally, the relationship between the number of vertices and the number of spurious polygons has been observed to vary considerably.
Figure 4.1.2. Two types of vertex interleaving in digitized datasets.
Theory. Given layers 1 i,j n, the covariance for layers i and j is given by:
s_{ij} = (1/|X|) · (a_{j}(x) - b_{j}(x)) (a_{i}(x) - b_{i}(x)) .
When i = j, this equation defines the error variance within a given data layer.
Observation. The preceding equation facilitates calculation of the error variance of a composite map a (F^{n})^{X} as a function of a prespecified arithmetic operator applied during map overlay.
Example 1. When n layers are added, the composite error variance is given by [Ver94]:
s_{c} = s_{ij} .
Remark. Although error variance is always positive per layer, the error covariance may be positive or negative. Negative error covariance can imply that the error variance of the composite map may be lower than the variances of the indiviudal map layers. This would appear to contradict the absolute error assumption, namely, that absolute error of a map cannot be less than the error of its most erroneous component map. However, this appearance is false, since we are dealing with signed error, not the absolute value of an error measure.
Example 2. When a subtraction operator is applied to n layers, the composite error variance is given by [Ver94]:
s_{c} = s_{ij} - 4 · s_{1j} .
Implementational Issues. The preceding method depends upon one's ability to measure variance and covariance for a given dataset. This implies that a and b coincide spatially (i.e., are coregistered), which may not hold in practice for irregular polygons. Note that rasterization can achieve some spatial alignment at the pixel level, albeit at the cost of additional spatial and attribute error.
Algorithm. Given attribute (e.g., cover) classes,
Remark. A minor modification of this algorithm uses the same number of samples from each cover class, to avoid error due to under-representation by small sample size.
Example. Consider employment of the logical and operator to match attributes between GIS data layers. Given the i-th layer with fraction Pr[E_{i}] of cells correctly classified, the composite map accuracy is given by:
Pr[E_{c}] = Pr[E_{1} and E_{2}] = Pr[E_{1}] · Pr[E_{2} | E_{1}] ,
where the conditional probability Pr[E_{2} | E_{1}] denotes the fraction of cells correctly classified in Layer 1 that are correctly classified in Layer 2. For more than two layers, the composite map accuracy is given by:
Pr[E_{c}] = Pr[E_{1} E_{2} ... E_{n}] = Pr[E_{1}] · Pr[E_{2} | E_{1}] · Pr[E_{1} | (E_{i})] ,
where(E_{i}) = E_{j} .
From these equations, minimum and maximum composite map accuracy can be computed as:
Pr[E_{c}^{max}] = min(Pr[E_{i}]), i = 1..N
Pr[E_{c}^{min}] = max(0,
1 - Pr[E_{i}]) ,
Observation. The preceding results yield several concepts and observations concerning map accuracy and the logical and or set-intersection operator, as follows:
The following related discussion of categorical coverages is taken from [Chr94].
Recall. Coregistration and co-location in map overlay is a fundamental operation upon which more complex GIS procedures are constructed.
Observation. Map overlay employs positional information to construct new polygons that share characteristics of the separate source layers or primary maps. One can think of overlay as visually implementing Venn diagrams applied to geographic database queries. Figure 4.1.3 is illustrative of the inheritance problems that result from misregistered data.
Figure 4.1.3. Map overlay can be
thought of as implementing Venn diagrams. Regions A C and B D inherit, or share
characteristics of, their components.
Example. Categorical coverage is portrayed in Figure 4.1.4, which shows a three-class soil labelling problem. Assume that the labels are based on soil density (e.g., density(silt) > density(loam) > density(sand)). Since density measures can be expressed in terms of a subset of R, and the density-based labelling problem exemplified in Figure 4.1.4 occurs on the continuous domain R^{2}, this is a categorical coverage problem.
Figure 4.1.4. Example of categorical coverage problem,
a type
of map distinct from collection units.
Observations. The following instances are illustrative of some differences between categorical coverage and collection zone maps:
Example. In sociological studies, the geographic unit may be a city block or township, when in fact a person, firm, or household would be more appropriate. This difference can be due in part to the availability of statistical data over prespecified spatial domains (e.g., census data organized geographically).
As part of this case study of errors in categorical data, we present the following high-level discussion of the effects of category error and overlay error in GIS maps.
Observation. The most common form of map overlay error is
called a sliver, as shown in Figure 4.1.5. Some GIS become
clogged with polygon representations of slivers and hence require
spatial filtration. An example of this is the MERGE
command in the IDRIS system.
Figure 4.1.5. "Slivers", the most common map overlay error.
Method. Existing techniques for analysis of categorical data can be adapted to GIS, for example:
Observation. The log-linear model allows a contingency table to be decomposed into effects in terms of a hierarchical model. Salient features of the log-linear model are listed as follows:
Figure 4.2.1.
Underlying model of suitability analysis applied to confidence limits.
Definition. A rank map has attributes that are ordinal-, interval-, or ratio-valued and pertain to rankings that represent mathematical relations between attributes of more than one map. For example, a soil map could be ranked to determine potential for landslides or rate of water percolation.
Definition. Geographic suitability analysis applies a given operation or transformation to the attributes of one or more maps, for example, a rank map or an overlay.
Notation. Given n primary maps that are overlaid, let salient variables be defined as follows:
Definition. The result of a suitability analysis restricted to
the p-th polygon of the resultant suitability map has an attribute
r_{p} that is a function of the weight vector w =
(w_{1}, w_{2}, ... , w_{n}) and the attribute
vector a_{p} = (a_{p1}, a_{p2},
... ,a_{pn})'
, as follows:
r_{p} = f(w,a_{p}) ,
which yields the resultant vector r = (r_{1}, r_{2}, ... , r_{P(n)}).Observation. If a_{p} is replaced by point or line attributes (e.g., raster data is employed), then the preceding formulation attains full generality, including most commonly known GIS suitability analyses. In particular, the weighted intersection overlay is represented by:
r_{p} = f(w,a_{p}) = w_{i} · a_{p,i} , p = 1,2,...,P(n) .
Observation. The multidimensional scaling approach derives r_{p} from the primary map attributes and a given attribute c_{i}, where i = 1..n, as follows:r_{p} = f(w,a_{p}) = ( w_{i} · a_{p,i})^{1/2} , p = 1,2,...,P(n) .
Note that the scalar c_{i} is chosen for each of the primary or rank maps. Hence, c = (c_{1}, c_{2}, ..., c_{n}).Given the preceding theory, we now address the problem of sensitivity analysis of GIS algorithms.
Definition. Confidence limits for attribute errors indicate the range of validity in an output attribute value given the error ranges in the input attribute values (e.g., in the primary maps).
Algorithm. The process of suitability map generation involves the following steps:
Step 2. Intersect the polygons of the n rank maps to yield the P(n) polygons of the suitability map. Attribute values are given by Equations (II) or (III).
Step 3. Transform primary map attributes to rank map attributes (interval or ratiometric data that connote a ranking), as follows:
T_{i}(b_{p,i}) = a_{p,i} , where i = 1..n, p = 1..P(n) .
Step 4.Transform rank map attributes to suitability map attributes via Equation (II) or (III).Observation. One can represent the attributes of the suitability map as
r_{p} = w_{i} · a_{p,i} = w_{i} · T_{i}(b_{p,i}) , p = 1,2,...,P(n) .
to obtain overlay suitability andr_{p} = ( w_{i} · (a_{p,i} - c_{i})^{2} )^{1/2} = ( w_{i} · (T_{i}(b_{p,i}) - c_{i})^{2} )^{1/2}, p = 1,2,...,P(n) ,
for target suitability, where c_{i} denotes the most or least preferred value of the i-th rank map.The matrix product provides a concise, convenient expression for the preceding two equations. For example, consider the following equation:
Likewise, given the ideal attributes I_{1} through I_{n}, if t = (r_{1}^{2}, r_{2}^{2}, ... , r_{P(n)}^{2}), then the target attributes are given by:
We next consider measures of geographic sensitivity. Given an expression for T_{n}, the sensitivity of individual polygons can be determined. At least five sensitivity measures for entire maps have been defined by Lodwick [Lod94], as follows:
1. Attribute Sensitivity Measures (ASMs):
m_{1}(s,r,r) = s_{p} · | r_{p} - r_{p} | ,
where the weight vector s = (s_{1}, s_{2}, ... , s_{P(n)}).Notes:
· If s_{p} = 1, then m_{1} is said to be unweighted.
· If s_{p} = A_{p}, where A_{p} denotes the area of the p-th polygon, then m_{1} is said to be an areal-weighted sensitivity measure.
· A normalized areal-weighted attribute sensitivity measure can be computed by setting the p-th weight, as follows:
s_{p} = A_{p} / A_{p} .
The weighted position sensitivity measure (e.g.,weighted for position in a given rank ordering scheme) is given by:
m_{2}(s,r,r) = s_{p} · | (r_{p}) - (r_{p}) | ,
where the weight vector s = (s_{1}, s_{2}, ... , s_{P(n)}) was defined previously.Notes:
· If s_{p} = 1,then m_{2} is said to be unweighted.
· If s_{p} = A_{p}, then m_{2} is termed an areal-weighted rank sensitivity measure.
· If s_{p} portrays the proportion of total map area occupied by the p-th polygon (per the preceding section on ASMs), then m_{2} is said to be a normalized areal-weighted rank sensitivity measure.
For example, if k denotes the number of primary maps remaining after map removal, then the corresponding MRSM is given by:
m_{3}(s,r,r) = s_{p} · | (r_{p} / n) - (r_{p} / k) | ,
where the weight vector s was defined previously.4. Polygon Sensitivity Measures (PoSMs):
The polygons that undergo maximum change have indices in the set
J = { (j N : | r_{j} - r_{j} | = | r_{i} - r_{i} | } .
The polygon sensitivity measure is thus given bym_{4}(s,r,r) = | r_{i} - r_{i} | .
m_{5}(s,r,r) = s_{p} · A_{p} ,
where s_{p} = 1 if r_{p} = r_{p} or s_{p} = 0 otherwise.Note: If s_{p} = | r_{p} - r_{p} |, then we obtain a relative area sensitivity measure.
Step 2. Measure sensitivity using m_{1} through m_{5}, as discussed in the preceding sections.
Step 3. If the probability distribution associated with Monte Carlo simulation is known, then Monte Carlo simulation of r = T · w or t = T · w would produce resultant attributes that could be analyzed statistically.
LEFT OFF HERE (middle of p.28 in notes)
[Bib77] Bibby, J. "The general linear model: A cautionary tale", in Analysis of Survey Data 2:35-80, Eds. C. O'Muircheartaigh and C. Payne, New York: John Wiley (1977).
[Bis75] Bishop, Y., S. Fienberg, and P. Holland. Discrete Multivariate Analysis: Theory and Practice, Boston, MA: MIT Press (1975).
[Chr94] Chrisman, N.R. "Modelling error in overlaid categorical maps", in Accuracy of Spatial Databases, Eds. M. Goodchild and S. Gopal, London: Taylor and Francis, Second Printing (1994).
[Goo78] Goodchild, M.F. "Statistical aspects of the polygon overlay problem", Harvard Papers on Geographic Information Systems, Volume 6, Reading, MA: Addison-Wesley (1978).
[McA71] McAlpine, J.R. and B.G. Cook. "Data reliability from map overlay", in Proceedings of the 43rd Congress of the Australian and New Zealand Association for the Advancement of Science (1971).
[Ver94] Veregin, H. "Error modeling for the map overlay operation", in Accuracy of Spatial Databases, Eds. M. Goodchild and S. Gopal, London: Taylor and Francis, Second Printing (1994).