Research Ideology: I am more interested in understanding and explaining why things work rather than blindly developing new techniques for a particular problem (e.g. clustering there are about 100+ algorithms). If a new algorithm props-up from the analysis or understanding then I am all for it (e.g. Support Vector Machines came from Vapnik Chervonenkis Theory, Boosting was inspired from the Probably Approximately Correct framework). I believe that understanding the essence behind techniques is essential to make progress since if and when a technique fails we are then in a significantly better position to find a fix (if it exists or conclude otherwise). This can be very hard to do if we do not understand the thing we are dealing with.
Specific Research: My primary research area is Machine Learning/Data mining. I am interested in bridging the gap between statistics and Data mining/Machine Learning which will hopefully lead to improved decision making in the choice of models for the application at hand.
Current State: Dr. Alin Dobra and I have developed a novel approach for analyzing classification models and model selection measures in the non-asymptotic regime. The approach encompasses ways of accurately and efficiently computing the moments of the generalization error for the classification algorithm at hand. The moments are over a space of classifiers induced by the classification algorithm and samples of size N from a given or built underlying distribution. We have thus far applied the methodology to analyze Naive Bayes classifier, Random Decision trees and K Nearest Neighbor algorithms. Model selection measures such as cross-validation, leave one out and hold out set estimation can also be studied in this framework. We have not only provided a framework to study these algorithms but have also suggested solutions to do this efficiently (non-linear optimization techniques) maintaining high accuracy.
Future Work: This includes a) analyzing more classification algorithms in the framework and b) enhancing scalability. For example, the extension of the analysis of Random Decision trees to other deterministic decision tree algorithms is not difficult but making it scalable is a challenge.Current State: In collective classification instances are classified taking into account the class labels of related instances which is different from traditional classification where instances are classified based only on their own set of attributes. A natural question that arises is, when should collective classification be preffered to traditional classification? I have tried to answer this question by providing necessary conditions for collective classification to outperform traditional classification. Moreover, I have also derived distribution free bounds for such a setting. The derived bounds have this unique feature that they depend on the degree of auto-correlation between interacting datapoints. The bounds also help in the estimation of effective sample size (ESS) which is the equivalent independent sample size of a larger correlated sample.
Future Work: To figure out sufficient conditions for collective classification to outperform traditional classification. To derive much tighter bounds than the already derived ones take into account the degree of auto-correlation between interacting instances. A more interesting problem is to build a principled approach (probably based on information theory) to estimate ESS which will result in the theory (viz. statistical bounds etc.) developed for i.i.d. data also being applicable to correlated data.In the final year of my undergrad I did a research project at the Center for Development of Advanced Computing (C-DAC). We developed a proprietary character recognition algorithm for printed Devanagari script documents. Devanagari script has many more characters than the Roman script. Moreover, combination of these characters is also allowed to produce composite characters which makes the problem of efficient and accurate recognition challenging. This algorithm used fuzzy class membership.
The algorithm was integrated into their commercial product for Optical Character Recognition (OCR) Software -- "Chitrankan". We also won the 2nd prize at Concepts 2004 which is a reputed national level final year undergraduate project competition sponsored by Microsoft, Cybage, Calsoft and a few other companies.