Decision Tree Learning
Li M. Fu
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Introduction
-
It is one of the most widely used and practical methods for inductive inference.
-
It is a method for approximating discrete-valued functions.
-
It is capable of learning disjunctive expressions.
-
It incompletely searches a complete hypothesis space.
-
Its inductive bias favors small trees.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Topics
- What problems are appropriate?
- The algorithm
- The information theoretical criterion
- How to deal with a large data base?
- The hypothesis space
- Inductive bias
- Issues:
- Overfitting the data
- Continuous attributes
- The entropy criterion
- Missing attributes
- Attributes with difference costs
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Appropriate Problems
- Attribute-value representation
- Discrete output values (but inputs can be continuous.)
- Disjunctive concepts
- Noise and errors
- Missing information
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Attribute Selection Based on Entropy
Select the best feature which minimizes the entropy function H.
For a feature, the entropy is calculated for each value.
The sum of the entropy weighted by the probability of each value
is the entropy for that feature.
H = Sum_{j} (p_{j} * H_{j})
H_{j} = Sum_{i}(- p_{i} * log p_{i})
where p_{i} is the probability associated with ith class.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Decision Tree Learning Method
ID3 Learning Method
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
ID3 on Large Data Sets
- (1) Select a random subset W (called the ``window'')
from the training set.
- (2) Build a decision tree for the current window.
-
Select the best feature which minimizes the entropy function H.
-
Categorize training instances into subsets by this feature.
-
Repeat this process recursively until each subset
contains instances of one kind (class)
or some statistical criterion is satisfied.
- (3) Scan the entire training set for exceptions to
the decision tree.
- (4) If exceptions are found, insert some of them
into W and repeat from step 2. The insertion may be done
either by replacing some of the existing instances
in the window or by augmenting it with the new exceptions.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Hypothesis Space Search
- Complete hypothesis space (because of disjunctive representation)
- Incomplete search
- No backtracking
- Non-incremental (but can be modified to be incremental)
- Ensemble statistical information for node selection
(less sensitive to noise than the VS approach in this sense)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Inductive Bias
- Shorter trees are preferred.
- Attributes with higher information gain are selected first in
tree construction.
- Preference bias (relative to restriction bias as in the VS approach)
- Why prefer short hypotheses? Occam's razor (contentious?)
Generalization?
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Overfitting to the Training Data
- The training error is statistically smaller than the
test error for a given hypothesis.
- Solutions:
- Early stopping
- Validation sets
- Statistical criterion for continuation (of the tree)
- Post-pruning
- Minimal description length (cost-function = error + complexity)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Pruning Techniques
- Reduced error pruning (of nodes)
- Rule post-pruning
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
ID3 versus C4.5 (C5)
C4.5 extension
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Other Issues:
- Continuous-Valued Attributes
- Gain ratio instead of Gain
- Missing attribute values
- Attributes with different costs
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Summary
- Why is it a practical method?
- How does it differ from the version space approach?
- What is its inductive bias?
- How to modify the entropy criterion in the case
of multi-value attributes? What is the issue here?
- How to avoid overfitting the data?
- How to deal with a large data base?