Announcements
- 03/14/2005: Homework 4 has been posetd. It is due on Friday, Mar 25
- 02/14/2005: Homework 3 has been posted. It is due on Friday, Feb 25
- 02/02/2005: Homework 2 is due Monday instead of Friday
- 01/21/2005: Homework 2 has been posted. It is due on Friday, Feb 4
- 01/21/2005: Homework 1 has been posted. It is due on Friday, Jan 28
- 01/11/2005: There will be no homework #0 since the answer to the
question was already put on the blackboard (the mapping is exactly the
inverse of the c.d.f. of the normal distribution).
CAP 6930: Datamining
Datamining is a graduate course
geared toward Ph.D level students. The goal of the class is to expose
students to the topic of datamining and to develop research
skills like:
- Critical thinking
- Paper writting
- Conducting research
Tentatively, the following topics will be explored in the class:
- What is datamining
- connections with mathematics(statistics), machine learning and databases.
- Introduction to Probability Theory and Statistics
- Learning and data processing
- Unsupervised Learning
- Clustering
- Similarity based
- Density based
- Outlier detection
- Supervised learning
- Classification
- Classification trees
- Nearest-neighbor classification
- Bayesian classifiers
- Ensemble methods (bagging, boosting)
- Regression
- Linear regression
- Regression trees
- Other regression methods
- Association rule mining
- Privacy preserving datamining
Instructor
- Alin Dobra
Office: 474 CSE Building
Email: adobra at cise.ufl.edu
Office Hours(instructor's office)
- Wednesday 4pm-5:30pm
- Thursday 1pm-2:30pm
Class Requirements and Grading
Grading will be based on 60% homework (one homework every 1-2 weeks), 30% class project (writing a small research paper) and 10% class participation. There will be no exam.
High level collaboration for the homework is allowed and encouraged. You can arrive at the solution together with other students as long as you list the people you worked with and write the assignment separately. Please type the homework (you can draw diagrams by hand).
Details on the project will be clarified latter. In principle you are expected to take a small topic of your choosing and write a small paper on it. Options are:
- Survey paper: read 10-15 papers on the same topic and summarize the information
- Research paper: develop a new method and write a paper about it (do not be overly ambitious).
- Implement in a new way an existing technique and write a small experimental paper.
The class participation will be judged not only on physical presence,
but also on your contribution to the class. Questions and opinions are
encouraged at all times.
Lecture Notes
- Introduction to Probability Theory and Statistics
- Introduction to classification and regression
- Eibe Frank Ph.D thesis
- Loh, W.-Y. and Shih, Y.-S. (1997), Split selection methods for classification trees, Statistica Sinica, vol. 7, pp. 815-840.
- Alin Dobra and J. E. Gehrke. SECRET: A Scalable Linear Regression Tree Algorithm. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Edmonton, Alberta, Canada, July 2002.
- Bucila, Gehrke, Kifer and White, DualMiner: A Dual-Pruning Algorithm for Itemsets with Constraints
- Smola, Scholkopf, A Tutorial on Support Vector Regression
Scribed Notes
Homework
Homework 1
Homework 2
Homework 3
Homework 4