Cryptology - I: Appendix A - Review of Statistics

Instructors: R.E. Newman-Wolfe and M.S. Schmalz


In this class, we often express cryptologic theory in terms of statistical measures. Since one's recall of basic statistics may require improvement in order to follow the lecture material, we present the following brief review. We begin with a summary of parameters and distributions, then discuss computational means for determining such distributions. Finally, we consider how various distributions can be manipulated to achieve certain properties, such as increased entropy or randomness.

A-1. Statistical Parameters.

Basic statistical parameters include the mean, standard deviation, central and noncentral moments, as well as measures derived from moments (e.g., skewness and kurtosis). Following preliminary definitions, we show how certain measures can be computed from the histogram of an image or message.

A-2. Computing Probability Distributions.

In this section, we discuss the nature and computation of various types of probability distributions (e.g., Gaussian, Poisson, Lorentzian, Gamma, Chi-squared, and Student's t) that will be useful in cryptanalysis. As an example, we present theory and algorithms for the Gaussian distribution. In Section A-3, we discuss the manipulation of such distributions as a prelude to statistical cryptanalysis.

There are many different types of probability distributions, each of which has been derived from observations of natural phenomena. Unfortunately, common probability distributions are rarely derived from first principles. That is, the laws of physics are typically not employed to construct a body of theory from which a given distribution is computed a priori. Rather, data is gathered from observations of one or more processes, and a probability distribution is fitted to measures derived from the data. Thus, in practice, there are relatively few causal models that can deterministically link a given physical process with the statistical distribution that characterizes the outcomes of that process. As a result, one often has little knowledge about why a given distribution occurs in a given situation.

For example, it is well known that the times between arrival of photons at a detector can be characterized with reasonable accuracy by a Poisson distribution. It is likewise known that the greylevels of images that depict natural scenes under spatially uniform illumination are typically Gaussian-distributed. Lettington [Let94] has shown that the greylevels of gradient images taken from selected scenes that contain naturally-occurring or manufactured objects generally conform to a Lorentzian distribution. The reasons for these behaviors are presently not apparent. As a result, there are interesting possibilities for research in causal models that underlie probability distributions.

We shall next discuss uses of sampling distributions.

A-3. Manipulating Probability Distributions.

It is occasionally useful to test whether or not there is a significant difference between two sample means. Also, we often need to test whether two variables in a given bivariate sample are associated with each other. For example, one might want to test the frequencies of occurrence of digrams in plaintext and ciphertext corpi that appear to be associated with each other. In this section, we examine several simple but useful tests, such as the t-test, chi-squared test for dependence, the Phi test for association, Cramer's V-test, and Pearson correlation.

A-3.1. t-test for Significant Difference of Means.

A-3.2. Chi-Squared Test for Dependence.

Unlike the t-test, which compares group means between two samples, the chi-squared test examines nominal or ordinal measurements to compare group frequencies. This comparison of frequency data contrasts with the t-test, which compares means of samples that are distributed across a real-valued interval. Thus, the chi-squared test is inherently useful in conjunction with histogram-based manipulation.

A-3.3. Phi Test of Association.

Occasionally, one prefers to compare finite sets of Boolean data. The Phi coefficient measures the association between bivariate data described by a 2 × 2-cell table. Phi is preferred over the 2 test because Phi corrects for the fact that the 2 result varies with the number of cases.

A-3.4. Cramer's V-Test.

The Cramer test adjusts the Phi test result for the number of rows or columns in the input data table. This adjustment depends upon the minimum of the number of rows and the number of columns. Both Phi and Cramer-V are equivalent in a 2 × 2-cell table. The associational tests described thus far can be extended to yield a more powerful test of association, called Pearson correlation.

A-3.5. Pearson Product-Moment Correlation.

Correlation is a process whose result indicates the strength of a relationship between two variables. Pearson Product-Moment Correlation (PPMC) expresses this estimate of association in terms of a linear function. Thus, PPMC is occasionally called zero-order correlation or linear correlation. We next discuss simple methods by which one distribution can be made to portray another distribution.

A-3.6. Transforming Probability Distributions.