Home
Teaching
Academics
Papers
Research

transparent transparent transparent
Research Interest
Alin Dobra
Assistant Professor
CISE Department
University of Florida

I have a long standing interest in applying statistical tools and techniques (EM algorithm, Variational Analysis, etc.) to learning problems in Machine Learning/Datamining. Most of my recent work with my thesis advisor Johannes Gehrke explored how this techniques can be applied in the context of Classification and Regression trees.

Starting with my visit as a summer intern at Bell-Labs in Summer 2001, I became interested in designing randomized algorithms for computing good and guaranteeable approximations to complex SQL queries over streaming data.

I am actively pursuing both these research interests at the moment. A more detailed description of each follows.

Classification and Regression Tree Construction

  • Bias Correction in Classification Tree Construction. In the absence of significant correlations between predictor attributes and predicted attribute in classification tree construction there is a strong tendency exhibited by most of the split selection methods to choose the split attribute that takes more values. Such a tendency, especially if it is pronounced, is undesirable for a number of reasons including nonuniformity in the search for the solution and the fact that non predictive attributes with large domains are preferred to slightly predictive but with small domain attributes.

    The progress we have made in this area is to give a universal method for bias correction for any split selection method that is statistically grounded. We also used the general technique to correct the bias of the gini gain for k-ary splits, i.e. splits in which there is an outgoing branch for each possible value of the attribute.

    A desirable extension of this work is to binary splits (CART type of classification trees). This is a nontrivial extension since it essentially involved building a good approximation of the gini gain for pure noise. We plan to address this problem in future work.

  • Scalable Linear Regression Trees. Regression trees, the generalization of classification trees for continuous prediction, compare favorably with other regressors and in the same time are very appealing for untrained users due to their intuitive nature. For complex problems it is desirable to have more complex predictors in leaves, in particular linear predictors, but in the same time keep the properties that make regression trees appealing.

    Scalable regression tree algorithms for the case when the predictors in leaves are constants can be easily obtained with minor changes of scalable classification tree algorithm. This is not the case though for regression trees with linear models in the leaves for which the scalability problem is much harder. The main idea of our algorithm, that provides a scalable algorithm for the construction of this regressors, is to use the EM algorithm for Gaussian mixtures, for each intermediate node, and then to locally transform the regression problem into a classification problem based on closeness to these clusters. A side benefit of this approach, apart from providing a principled way to speed up the learning, is the fact that good oblique splits can be easily determined.

  • Probabilistic Classification and Regression Trees. A problem with regular classification and regression trees is the fact that, once the split in a node is decided, any datapoint will contribute to only one of the subtrees of the node, thus leading to exponential data fragmentation. This means that usually the number of samples available for decisions in the leaves is small which leads to small sample problems, i.e. decisions will have large statistical errors. Also, the prediction of classic trees has sharp discontinuities on the decision boundaries (split points) which is in itself a problem for some application domains. We plan to address all these problems with a variation of the regular classification and regression trees that uses information in all the datapoints to take any of the decisions but weights the importance of the datapoints based on closeness to the split point. The challenge is to preserve the efficiency of the tree construction algorithms which such a process and to give good statistical justifications for the chosen model. This is mostly work in progress.

Streaming Algorithms

© 2004 Alin Dobra email: adobra at cise dot ufl dot edu