Syllabus

Instructor

Professor Christopher Jermaine ("Chris"). My contact info is:

There will also be at least one TA for the course, Subramanian Arumugam. 

Overview and Course Description

This course will cover the topic of indexing for different type of large, (typically) disk-based data sets. Indexing generally refers to the design, implementation, and use of data structures that allow you to quickly find database objects that satisfy a particular selection criteria. For example, in a database containing a large number of surveillance photographs, you may want to find all photographs that are similar to the picture of a known criminal, without having to check each and every photograph in the database explicitly. In the class, we will study data structures that are appropriate for this problem, as well as for many other database search problems. 

Basic Course Outline and Topics Covered

  1. Indexing basics: hard disks, external sorting, hashing, B+-Trees (first 3 weeks)
  2. Speeding index updates (week 4)
  3. Indexing spatial and multidimensional data (weeks 5-6)
  4. Indexing for nearest neighbor-queries (week 7)
  5. Indexing time: temporal and spatial-temporal data (week 8)
  6. Indexing sequential data for DNA and text search applications (weeks 9-10)
  7. Indexing sequential data for time series (week 11)
  8. Indexing for general metric spaces (week 12)
  9. Indexing hierarchical databases, particularly XML (weeks 13-14)
  10. Indexing graph databases, including chemical compound databases (week 15-16)

What to Expect in this Course

The first three weeks of the course will be very traditional: they will consist entirely of lectures given by me that cover the basics of database storage and indexing. In fact, if you have taken COP 6726 (Database Implementation or "DBI" as it is typically called) with me, you will find that most of this material is review from that class. There is not a textbook for the course, so it is important that you show up to class and take notes as needed!

After the first three weeks, things will change a bit and the course will be centered around reading and discussing technical papers from the research literature that describe various indexing problems and then propose algorithmic solutions to them. All of these papers can be found on the web and so you can easily download and print them out yourself -- thus, not only is there not a course textbook, there is also not a packet of notes that you need to buy. You'll just download and print the papers you need to keep up.  

While I will present and lead the discussion on a few of the technical papers that we will cover in the class, each student (along with a partner) will be responsible for presenting one technical paper during the semester. To do this, you will need to prepare a PowerPoint or whiteboard presentation on the paper, give this presentation in class, and handle any questions that come up on the technical aspects of the paper during your presentation. Since we will have between 30 and 40 students in the class, between 15 and 20 of the lectures will consist of student presentations.

To make sure that a reasonable fraction of the people in the class actually read the paper before each presentation, you will be asked to turn in a typed, 20-line (or longer) summary (in an 11-point font or smaller) of the main technical idea in each paper before class on the day that the paper is presented. You won't have to summarize all of the papers, just enough to keep you honest and make sure that you don't rely only on other people's presentations to figure out what is going on in class (see the Grading section below). To obtain 100% credit towards your grade for summarizing the papers, you'll need to turn in acceptable summaries for 50% of the papers that we discuss.

To make sure that you attend class and actually try to understand the material that we will cover, there will be a short, multiple choice quiz in class on most Fridays. There will be ten quizzes during the semester, and you can drop the lowest two quiz scores. Since you can drop two quizzes and there are so many of them, in general, make-ups are not allowed. If you have to be out of town for a quiz, then you'll simply drop that quiz. Each quiz will have six to eight questions and take approximately 15 minutes. If you have taken DBI with me, then you know exactly what to expect on these quizzes. There are no exams in the class.

Finally, there is an implementation part of the course where you will be asked to actually implement and evaluate some of the indexing strategies that we will read about. This implementation will be done in teams of two, and will require C/C++ programming. Grad students will be required to do more than undergrads, though undergrads are welcome to complete all of the assignments. 

As I've told people, the programming workload for grad students will be about 1/2 to 2/3 of the workload for the DBI class, and the programming workload for undergrads will be about the same to a bit more than the programming workload that you saw if you took CIS 4301 with me.  As a wild guess, I'd guess that your group will have to write 500 lines of code for assignment 1, 1000 lines of code for assignment 2, 1000 lines of code for assignment 3, and then 500-1000 lines of code for each of the last two assignments (which undergrads won't be required to do). 

Grading

Grading is as follows:

The Programming Assignments

There will be five different programming assignments. Undergrad students will only be asked to complete the first three assignments, though they are certainly welcome to complete all five. All programming assignments will be completed in teams of two, using the C/C++ programming language. "Mixed" teams of one undergraduate and one graduate student are allowed, but the undergraduate on the team should be aware that he or she will then need to complete all five assignments. Each programming assignment will involve implementing some indexing structure, running some experiments to validate that structure, and then turning in the code and a short report describing the experimental results. All assignments are weighted equally. They are:

  1. Due 4th week for everyone. Implementing an unordered index file with sequential scan.
  2. Due 7th week for grads, 8th week for ugrads. Implementing a B+-Tree secondary index; grads also need to implement a Hilbert version of the tree for multidimensional data.
  3. Due 11th week for grads, 14th week for ugrads. Implementing an R-Tree.
  4. Due 12th week for grads. Implementing KNN search using an R-Tree and also sequential scan.
  5. Due last week for grads. Implementing time series indexing using an R-Tree. 

Mailing List and Web Page

The class will make extensive use of a mailing list to disseminate information. It is absolutely mandatory that you sign up for the list and monitor it closely. The class will also have a web page that will be used to post assignments and other "big" pieces of information, located at http://www.cise.ufl.edu/class/cis4930fa07ilg/. The web page is not yet fully functional; at present, it contains exactly this document.

Academic Dishonesty

There are implementations of R-Trees and B+-Trees floating around out there on the 'Net. It is absolutely prohibited for you to refer in any way, shape, or form to one of those implementations when you are working to complete the course project. Any cases of academic dishonesty will be prosecuted to the fullest extent possible. Just don't do it!

Class Attendance

Class attendance will not be recorded and will not affect your grade, but it is assumed that all students will attend most of the class meetings. If you don't attend class, you can't turn in your summary of the paper for that day, and you typically can't make up any quizzes that you miss. 

Students with Disabilities

Students requesting classroom accommodation must first register with the Dean of Students Office. The Dean of Students Office will provide documentation to the student who must then provide this documentation to me when requesting accommodation, which will typically take the form of extra time to complete the in-class quizzes.