Welcome

Hello! I am currently a Post-doctoral Research Scientist at the Computer and Information Science and Engineering department, University of Florida. In past, I have been a Pre-doctoral Fellow and a Ph.D. student at UF.

I work with Dr. Sanjay Ranka and Dr. Chris Jermaine (now at Rice University) persuing research in data mining and machine learning. My recent work has been focused on designing innovative mixture modeling frameworks that can capture interesting patterns in subspaces of high dimensional data, and novel regression based mixture models that allow all model paramters to "evolve" with respect to a regressor variable like time.

I also collaborate with the Bioinformatics research group and the Wireless research group in the CISE department for some of their research problems that involve modeling and machine learning.

LinkedIn Facebook Twitter

Schedule

Office: CSE 402.
My scheduled research meetings are on Monday at 11am and on Thursday at 11:30am in CSE432.
If you need to meet me at any other time, please send me an email to setup an appointment.

Disclaimer

All the opinions expressed on this website are not endorsed by, sponsored by, or provided on behalf of the Computer and Information Science and Engineering department at the University of Florida. All the facts on this page are true to the best of the author's knowledge.

Compatibility

Your web browser needs to support HTML4.01, CSS2 and JavaScript to correctly view this website.

Overview of Academic Profile

At present I am a post-doctoral research scientist with the Computer and Information Science and Engineering department at University of Florida. My research and development interests are in the fields of data mining and machine learning. In the last few years, I have investigated novel probabilistic generative mixture models.

My formal education is as follows:

  • Doctor of Philosophy in Computer Engineering from University of Florida.
    Disseration: Novel Mixture Models to Learn Complex and Evolving Patterns in High-dimensional Data.
    Adviser: Dr. Sanjay Ranka. Co-adviser: Dr. Christopher Jermaine.
  • Master of Science in Computer Networking from North Carolina State University
  • Bachelor of Engineering in Electronics and Communications from Nirma Institute of Technology, Gujarat University

During my stint as a grad student I also gained a lot of hands-on experience in teaching via handling lectures, supplementary teaching sessions, lab sessions, and designing and grading assignments and exams for under graduate and graduate courses.

A some what detailed version of my resume is available in PDF and HTML formats.

Research Projects

Data Mining and Machine Learning

  • Piecewise Linear Regression Mixture Models:
    • Developed a novel probabilistic mixture model that allows both the mixing proportions and mixture components to evolve with respect to regressors like time in a piece-wise linear fashion.
    • Developed a Bayesian learning algorithm using Monte Carlo Markov Chain methods (Gibbs sampling) to simulate the joint distribution of the random variables and the data in the proposed model.
    • Prototyped the learning algorithm using Matlab, and implemented using C.
    • Validated the modeling framework using real-world datasets that evolve with respect to time.
  • Probabilistic Weighted Ensemble of Roles Models:
    • Developed a Bayesian probabilistic generative mixture model that allows multiple sources to interact and influence data generation in an intuitive and meaningful way.
    • Developed a Monte Carlo Markov Chain based learning algorithm using Gibbs sampling.
    • Improved run time performance at least 10x by using highly accurate approximation.
    • Designed an innovative scheme to quantify importance of data attributes using KL divergence.
    • Prototyped the learning algorithm using Matlab. Implemented a parallelized version using C.
    • Validated the modeling framework using real-world high-dimensional (1000+ attribute) datasets.
  • Subset Mixture Models:
    • Developed an alternative to standard mixture modeling that allows a set of components to generate a data point as opposed to a single component.
    • Developed stratified sampling based Monte Carlo Expectation Maximization learning algorithm to perform Maximum Likelihood Estimation of the model parameters given a dataset.
    • Designed a novel user supplied constraint based scheme to limit model components to data subspaces.
    • Prototyped the learning algorithm using Matlab, and implemented using C.
    • Validated the modeling framework using real-world high-dimensional (100+ attribute) datasets.

Applied Statistical Modeling and Machine Learning

  • Bioinformatics and Computational Biology:
    • Collaborated with the bioinformatics research group to develop graphical models that can learn the direct and indirect effects of an external stimuli to a set of related attributes in a dataset.
    • Utilized Markov Random Field as a prior in a Bayesian framework to encapsulate the relationship and interaction amongst the attributes, and to model their interaction with the external stimuli.
    • Developed an Iterative Conditional Mode based optimization scheme to learn the directly affected attributes.
    • Extended the base model to multi-class data with the goal of learning the differences in directly affected attributes for various classes due to the external stimuli.
    • Applied both the models to study the direct and indirect effects of external stimuli like radiation and drugs on gene expression in single class and multi-class datasets by leveraging the gene interaction networks.
  • Wireless Network Usage:
    • Worked closely with colleagues from the wireless research group to investigate various spatio-temporal data mining approaches to model user behavior and wireless traffic in a campus wide wireless network.
    • Explored various clustering methods using attributes like time, location, type of traffic, and target IP address.
    • Utilized co-clustering methods that can simultaneously cluster along the aforementioned attributes.
    • Employed subset mixture models that can discover correlations in network usage patterns.
    • Analyzed changes in clustering and correlations based on variation in location and time.

Parallel and Distributed Computing

  • Developed parallelized versions of learning algorithms for my modeling frameworks using threads and barrier synchronization for shared memory multi-core machines.
  • Experience with various parallel programming models such as shared memory, threads, message passing, data parallel, single instruction multiple data, and multiple instruction multiple data.
  • Experience with issues like partitioning, communication, synchronization, data dependencies, load balancing, granularity, i/o costs, and performance analysis.
  • Familiar with MPI, MapReduce, CUDA, Hadoop, HBase, Hypertable, Pig, and Mahout.

Publications

  • Manas Somaiya, Christopher Jermaine, and Sanjay Ranka. Learning Correlations Using the Mixture-of-subsets Model. ACM Transactions on Knowledge Discovery from Data, Volume 1, Issue 4, Pages 1 - 42, January 2008.
  • Manas Somaiya, Christopher Jermaine, and Sanjay Ranka. A POWER Framework for Multi-class Membership in Bayesian Mixture Models. Accepted at ACM Conference on Knowledge Discovery and Data Mining 2010.
  • Manas Somaiya, Christopher Jermaine, and Sanjay Ranka. A Framework for Piecewise Linear Regression Mixture Models. Submitted to ACM Transactions on Knowledge Discovery from Data.
  • Nirmalya Bandyopadhyay, Manas Somaiya, Tamer Kahveci, and Sanjay Ranka. Modeling Perturbations using Gene Networks. Accepted at International Conference on Computational Systems Bioinformatics 2010.
  • Nirmalya Bandyopadhyay, Manas Somaiya, Tamer Kahveci, and Sanjay Ranka. Identifying Differentially Regulated Genes. Submitted to ACM Conference on Bioinformatics and Computational Biology 2010.
  • Saeed Moghaddam, Ahmed Helmy, Sanjay Ranka, and Manas Somaiya. Data-driven Co-clustering Model of Internet Usage in Large Mobile Societies. Submitted to ACM International Conference on Modeling, Analysis, and Simulation of Wireless and Mobile Systems 2010.

Coursework

University of Florida

Analysis of Algorithms, Computer Architecture Principles, Software Testing and Verification, Formal Languages and Computation Theory, Programming Language Principles, Operating System Principles, Cryptographic Protocols, Approximate Query Processing, Bioinformatics, Database System Implementation, Advanced Data Structures, Introduction to Parallel Computing.

North Carolina State University

Operating Systems Principles, Software Engineering, Computer Networks, Technology Evaluation and Commercialization Concepts, Internet Protocols, High Speed Networks, Advanced Protocol Design, Data Structures, Introduction to Computer Performance Modeling, Concurrent Software Systems, Information Systems Security, String Processing Languages.

Teaching

University of Florida

  • COT 5405 Analysis of Algorithms, CIS 4930/6930 Data Mining, and CIS 4930/6930 Introduction to Parallel Computing
    • Teaching select topics.
    • Supplementary teaching including programming labs, and leading group discussions.
    • Desiging and evaluating class assignments and exams.
    • Holding office hours to discuss student difficulties in one-on-one sessions.
  • CAP 5510 Bioinformatics, COP 4600 Operating Systems, COP 4720 Information and Database Systems 2, COT 3100 Applications of Discrete Structures
    • Supplementary teaching including programming labs, and leading groups discussions.
    • Grading of class assignments and exams.
    • Holding office hours to discuss student difficultiesi in one-on-one sessions.

North Carolina State University

  • CSC 116 Introduction to Computing (JAVA) and CSC 216 Programming Concepts (JAVA)
    • Supplementary teaching including programming labs, and leading group discussions.
    • Grading of class assignments and exams.
    • Holding office hours to discuss student difficulties in one-on-one sessions.

Activities:

  • Program Committee: IC3 2010
  • Organizing Committee: Workshop on Data Mining for Biomedical Applications, October 2008
  • Research track paper reviewer: IC3 2010, PAKDD 2010, IC3 2009, ICPP 2009, KDD 2009 2008, ICDCIT 2008
  • Student reviewer: IEEE Potentials 2003
  • Student member: IEEE, IEEE Computer Society, ACM, ACM SIGMOD, ACM SIGKDD

Social and Professional Networks

LinkedIn Facebook Twitter