|
 |
 |
 |
Research Interest
|
Alin Dobra
Assistant Professor CISE Department University of Florida |
I have a long standing interest in applying statistical tools and
techniques (EM algorithm, Variational Analysis, etc.) to learning
problems in Machine Learning/Datamining. Most of my recent work with
my thesis advisor Johannes Gehrke explored
how this techniques can be applied in the context of Classification
and Regression trees.
Starting with my visit as a summer intern at Bell-Labs in Summer 2001,
I became interested in designing randomized algorithms for computing
good and guaranteeable approximations to complex SQL queries over
streaming data.
I am actively pursuing both these research interests at the moment. A
more detailed description of each follows.
Classification and Regression Tree
Construction
- Bias Correction in Classification Tree
Construction. In the absence of significant correlations
between predictor attributes and predicted attribute in classification
tree construction there is a strong tendency exhibited by most of the
split selection methods to choose the split attribute that takes more
values. Such a tendency, especially if it is pronounced, is
undesirable for a number of reasons including nonuniformity in the
search for the solution and the fact that non predictive attributes
with large domains are preferred to slightly predictive but with small
domain attributes.
The progress we have made in this area is to give a universal method
for bias correction for any split selection method that is
statistically grounded. We also used the general technique to correct
the bias of the gini gain for k-ary splits, i.e. splits in which there
is an outgoing branch for each possible value of the attribute.
A desirable extension of this work is to binary splits (CART type of
classification trees). This is a nontrivial extension since it
essentially involved building a good approximation of the gini gain for
pure noise. We plan to address this problem in future work.
- Scalable Linear Regression Trees.
Regression trees, the generalization of classification trees for
continuous prediction, compare favorably with other regressors and in
the same time are very appealing for untrained users due to their
intuitive nature. For complex problems it is desirable to have more
complex predictors in leaves, in particular linear predictors, but in
the same time keep the properties that make regression trees
appealing.
Scalable regression tree algorithms for the case when the predictors
in leaves are constants can be easily obtained with minor changes of
scalable classification tree algorithm. This is not the case though
for regression trees with linear models in the leaves for which the
scalability problem is much harder. The main idea of our algorithm,
that provides a scalable algorithm for the construction of this
regressors, is to use the EM algorithm for Gaussian mixtures, for each
intermediate node, and then to locally transform the regression
problem into a classification problem based on closeness to these
clusters. A side benefit of this approach, apart from providing a
principled way to speed up the learning, is the fact that good oblique
splits can be easily determined.
- Probabilistic Classification and Regression
Trees. A problem with regular classification and regression
trees is the fact that, once the split in a node is decided, any
datapoint will contribute to only one of the subtrees of the node,
thus leading to exponential data fragmentation. This means that
usually the number of samples available for decisions in the leaves is
small which leads to small sample problems, i.e. decisions will have
large statistical errors. Also, the prediction of classic trees has
sharp discontinuities on the decision boundaries (split points) which
is in itself a problem for some application domains. We plan to
address all these problems with a variation of the regular
classification and regression trees that uses information in all the
datapoints to take any of the decisions but weights the importance of
the datapoints based on closeness to the split point. The challenge is
to preserve the efficiency of the tree construction algorithms which
such a process and to give good statistical justifications for the
chosen model. This is mostly work in progress.
Streaming Algorithms
|