Computational Learning Theory

Li M. Fu

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Introduction

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Topics

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

PAC Model

Valiant (1984) introduced the notion of ``learnable.'' We say that a class of target concepts is learnable if the following condition is met: For every concept in the class and with any probability distribution on the instance space, there exists a polynomial-time algorithm which can produce a hypothesis such that its probability of error has a small upper bound. The probability of error is relative to the distribution of instances and is defined to be the probability of instances that are either in the hypothesis and not in the target concept or in the target concept and not in the hypothesis. When the error is small, the hypothesis is a good approximation to the target concept. This learning theory has been used as a basis for analyzing the properties of learning algorithms.

PAC Learning

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

VC-Dimension and Growth Function

Given a set of instances, a hypothesis h in H can partition the set into two groups: The instances in one group satisfy the hypothesis h and those in the other group do not. The partition is called the dichotomy induced by h. The maximum number of dichotomies induced by hypotheses in H on any set of m instances is defined as the growth function of H with respect to m. The Vapnik-Chervonenkis dimension (VCdim) of H is the largest m such that the corresponding growth function is equal to 2^{m}. That is, H can induce all possible dichotomies of m instances drawn from the instance space if and only if the Vapnik-Chervonenkis dimension of H is m. Thus, the Vapnik-Chervonenkis dimension of H measures the capacity of H. For example, the VCdim of a single perceptron with two input units is 3, since it is possible to find a set of 3 points that can be linearly dichotomized in all 2^{3} ways but no set of four points can be dichotomized in all possible ways. It turns out that the VCdim of a single perceptron with an d-dimensional input is d + 1.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

VC-Dimension Examples:

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Sample Complexity

The version sapce is said to be epsilon-exhausted (with respect to c and D-domain) if
For all h in VS, errorD(h) < epsilon

Suppose a consistent learner has learned m independently drawn examples of the target concept. The probability that the version space is not epsilon-exhausted is less than or equal to

|H|e-epsilon*m

The sample complexity for a consistent learner:

m > = (1/epsilon)(ln|H| + ln(1/delta))

An agnostic learner does not assume that the target concept is in H and can learn a hypothesis with nonzero training error. In this case, we use the Hoeffding (additive Chernoff) bounds to obtain the generalization error:

P[True-error(h) > training-error(h) + epsilon] 
< = e-2m*epsilon^{2}
Thus,
P[there exists h in H, True-error(h) > training-error(h) + epsilon] 
< = |H|e-2m*epsilon^{2}
The sample complexity for an agnostic learner is
m > = (1/2epsilon2)(ln|H| + ln(1/delta))

For the class C of concepts described by conjunctions of boolean literals,

|H| = 3n
m > = (1/epsilon)(nln3 + ln(1/delta))

For the class C of unbiased concepts

|H| = 22n
m > = (1/epsilon)(2nln2 + ln(1/delta))
So, it is not PAC-learnable.

k-term DNF is not PAC-learnable if H = k-DNF but it becomes PAC-learnable if H = k-CNF.

Given an infinite hypothesis H for learning a concept, the learning system which learns a set of m random instances is probably approximately correct with probability delta and accuracy epsilon if

m > = (4log(2/delta) + 8 VCdim(H)log(13/epsilon))/ epsilon

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Generalization Error

For finite hypothesis spaces,
P[True-error(h) > training-error(h) + epsilon] 
< = e-2m*epsilon^{2}
Thus,
P[there exists h in H, True-error(h) > training-error(h) + epsilon] 
< = |H|e-2m*epsilon^{2}

For infinite hypothesis spaces, use the bounds based on VC-dimensions under uniform convergence (in the worse case scenario).

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

VC-dimension of a Computing Neural Network

The VCdim of a one-hidden-layer perceptron with full connectivity between the layers is in the range (Baum and Haussler 1989, Hush and Horne 1993)
2[N_{h}/2]d < = VCdim < = 2N_{w} log(eN_{n})
where [*] is the floor operation that returns the largest integer less than its argument, N_{h} is the number of hidden units, N_{w} is the total number of weights in the network, N_{n} is the total number of nodes in the network, e is the base of the natural logarithm, and d is the number of input units.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Summary

Other Supplementary Material