Specific Research: My primary research area is Machine Learning/Data mining. I am interested in mainly two things, a) bridging the gap between statistics and Data mining/Machine Learning and b) developing methods/algorithms that can be deployed in real world applications. I realize that the goals mentioned before are somewhat antithetical from the theory/applications perspective but not so much if one looks at it from the usefulness/progress perspective.
Dr. Alin Dobra and I have developed a novel approach for analyzing classification models and model selection measures in the non-asymptotic regime. The approach encompasses ways of accurately and efficiently computing the moments of the generalization error for the classification algorithm at hand. The moments are over a space of classifiers induced by the classification algorithm and samples of size N from a given or built underlying distribution. We have thus far applied the methodology to analyze Naive Bayes classifier, Random Decision trees and K Nearest Neighbor algorithms. Model selection measures such as cross-validation, leave one out and hold out set estimation can also be studied in this framework. We have not only provided a framework to study these algorithms but have also suggested solutions to do this efficiently (non-linear optimization techniques) maintaining high accuracy.
Future work in this includes a) analyzing more classification algorithms in the framework and b) enhancing scalability. For example, the extension of the analysis of Random Decision trees to other deterministic decision tree algorithms is not difficult but making it scalable is a challenge.Current State: In collective classification instances are classified taking into account the class labels of related instances which is different from traditional classification where instances are classified based only on their own set of attributes. A natural question that arises is, when should collective classification be preffered to traditional classification? I have tried to answer this question by providing necessary conditions for collective classification to outperform traditional classification. Moreover, I have also derived distribution free bounds for such a setting. The derived bounds have this unique feature that they depend on the degree of auto-correlation between interacting datapoints. The bounds also help in the estimation of effective sample size (ESS) which is the equivalent independent sample size of a larger correlated sample.
In the future I would like to to figure out sufficient conditions for collective classification to outperform traditional classification. To derive much tighter bounds than the already derived ones take into account the degree of auto-correlation between interacting instances. A more interesting problem is to build a principled approach (probably based on information theory) to estimate ESS which will result in the theory (viz. statistical bounds etc.) developed for i.i.d. data also being applicable to correlated data.Time series forecasting is a well studied problem in machine learning/data mining and statistics. In literature usually, a single time series is given and the goal to forecast it for the next few steps. The problems I have been dealing with involve multiple time series obtained from a complex industrial process and the goal is to keep predicting a chosen one for as many steps as possible. In some cases this chosen or target time series may never be measured for the steps we are predicting and hence a standard online learning approach may not apply. Moreover, there are temporarily unstable and weak relationships between the target and the input time series, with the target drifting over time. The input time series may also be sampled at lower frequencies than the target. I have done some initial work to reasonably address these issues, however a lot more could be done.
Building any kind off theory for clustering that will aid practitioners is hard because the "goodness" of the final clustering depends on what it will be used for. This "goodness" is a vague intuitive notion which changes depending on the situation and is hence, extremely challenging to precisely formalize. In more technical terms, the clustering quality measures which one cares about change for different applications. Hence, it is hard to say that a particular set of clustering quality measures are the best choice for all problems. Given that the space of clustering problems is so irregular (in terms of what is good), building a general theory that is oblivious to these differences would most likely be of not much use. That is why I feel that a different approach to map the space is required.
One way of mapping the space could be by creating a set of Canonical problems that map to important and varied enough application domains. Researchers can then focus on these canonical problems and develop theory viz. the best clustering quality measures, best clustering algorithms etc. for each of them. A practitioner can then map his specific problem to the closest canonical problem and use the suggested algorithms or evaluation techniques.
A major portion of the current approach towards bringing some kind of sanity to clustering is what I consider a bottom-up approach, where specific algorithms are studied and characterized (i.e. best algorithm under some conditions, etc.) with the hope that those conditions are something practitioners will want or know in their problem. This is easier from an academic standpoint but harder to connect with reality. The approach that I have mentioned is more of a top-down approach, where you identify the conditions that matter apriori and then try to develop theory that exactly or approximately satisfies those conditions. This will be closer to reality, though it might be more of a challenge for the researchers. But isn't that what we (researchers) are here for? Moreover, I believe that such an approach would lead to a more focused effort concerning what practitioners care about, rather than research that is solely of academic interest.
In the final year of my undergrad I did a research project at the Center for Development of Advanced Computing (C-DAC). We developed a proprietary character recognition algorithm for printed Devanagari script documents. Devanagari script has many more characters than the Roman script. Moreover, combination of these characters is also allowed to produce composite characters which makes the problem of efficient and accurate recognition challenging. This algorithm used fuzzy class membership.
The algorithm was integrated into their commercial product for Optical Character Recognition (OCR) Software -- "Chitrankan". We also won the 2nd prize at Concepts 2004 which is a reputed national level final year undergraduate project competition sponsored by Microsoft, Cybage, Calsoft and a few other companies.