Computational Learning Theory
Li M. Fu
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Introduction
-
The theory about learnability
-
The theory about learning error
-
The theory about sample size
-
The theory about computational complexity
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Topics
-
PAC model
-
VC dimension
-
Mistake bound
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
PAC Model
Valiant (1984) introduced the notion of ``learnable.''
We say that a class of target concepts is learnable
if the following condition is met:
For every concept in the class and with
any probability distribution on the instance space,
there exists a polynomial-time algorithm which can produce a hypothesis
such that its probability of error has a small upper bound.
The probability of error is relative to the distribution
of instances and is defined to be
the probability of instances
that are either in the hypothesis and not in the target concept or
in the target concept and not in the hypothesis.
When the error is small, the hypothesis is a good
approximation to the target concept.
This learning theory has been used as a basis for analyzing
the properties of learning algorithms.
PAC Learning
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
VC-Dimension and Growth Function
Given a set of instances, a hypothesis h in H can partition
the set into two groups: The instances in
one group satisfy the hypothesis h
and those in the other group do not.
The partition is called the dichotomy induced by h.
The maximum number of dichotomies induced by
hypotheses in H on any set of m instances is defined as
the growth function of H with respect to m.
The Vapnik-Chervonenkis dimension (VCdim)
of H is the largest m
such that the corresponding growth function is equal to 2^{m}.
That is, H can induce all possible dichotomies of m
instances drawn from the instance space
if and only if
the Vapnik-Chervonenkis dimension of H is m.
Thus, the Vapnik-Chervonenkis dimension of H
measures the capacity of H.
For example, the VCdim of a single perceptron with two input
units is 3, since it is possible to find a set of
3 points that can be linearly dichotomized in all 2^{3} ways
but no set of four points can be dichotomized
in all possible ways.
It turns out that the VCdim of a single perceptron with
an d-dimensional input is d + 1.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
VC-Dimension Examples:
- C: [0, a], 0 < a < 1, a in R
- C: [a, b], 0 < a,b < 1, a,b in R
- C: k non-intersecting intervals
- C: half spaces of the plane
- C: circles in the plane
- C: axis-aligned rectangles in the plane
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Sample Complexity
The version sapce is said to be epsilon-exhausted (with respect to c
and D-domain) if
For all h in VS, errorD(h) < epsilon
Suppose a consistent learner has learned m independently drawn examples of the
target concept.
The probability that the version space is not epsilon-exhausted is less than
or equal to
|H|e-epsilon*m
The sample complexity for a consistent learner:
m > = (1/epsilon)(ln|H| + ln(1/delta))
An agnostic learner does not assume that the target concept is in
H and can learn a hypothesis with nonzero training error.
In this case, we use the Hoeffding (additive Chernoff) bounds to obtain
the generalization error:
P[True-error(h) > training-error(h) + epsilon]
< = e-2m*epsilon^{2}
Thus,
P[there exists h in H, True-error(h) > training-error(h) + epsilon]
< = |H|e-2m*epsilon^{2}
The sample complexity for an agnostic learner is
m > = (1/2epsilon2)(ln|H| + ln(1/delta))
For the class C of concepts described by conjunctions of boolean
literals,
|H| = 3n
m > = (1/epsilon)(nln3 + ln(1/delta))
For the class C of unbiased concepts
|H| = 22n
m > = (1/epsilon)(2nln2 + ln(1/delta))
So, it is not PAC-learnable.
k-term DNF is not PAC-learnable if H = k-DNF but it becomes
PAC-learnable if H = k-CNF.
Given an infinite hypothesis H for learning a concept,
the learning system which learns a set of m random instances
is probably approximately correct with probability delta
and accuracy epsilon if
m > = (4log(2/delta) + 8 VCdim(H)log(13/epsilon))/ epsilon
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Generalization Error
For finite hypothesis spaces,
P[True-error(h) > training-error(h) + epsilon]
< = e-2m*epsilon^{2}
Thus,
P[there exists h in H, True-error(h) > training-error(h) + epsilon]
< = |H|e-2m*epsilon^{2}
For infinite hypothesis spaces, use the bounds based on VC-dimensions
under uniform convergence (in the worse case scenario).
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
VC-dimension of a Computing Neural Network
The VCdim of a one-hidden-layer perceptron
with full connectivity between the layers
is in the range
(Baum and Haussler 1989, Hush and Horne 1993)
2[N_{h}/2]d < = VCdim < = 2N_{w} log(eN_{n})
where [*] is the floor operation that returns
the largest integer less than its argument,
N_{h} is the number of hidden units, N_{w} is the
total number of weights in the network, N_{n} is the total
number of nodes in the network, e is the base of the natural
logarithm, and d is the number of input units.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Summary
- What is PAC learning?
- What is the difference between a consistent learner
and an agnostic learner?
- How to estimate sample complexity for finite hypothesis
spaces?
- How to estimate sample complexity for infinite hypothesis
spaces?
- What is the problem with the sample complexity estimated
based on PAC learning?
Other Supplementary Material