CIS 6930 / CIS4930: Data Science: Large-scale Advanced Data Analysis
Instructor: Daisy Zhe Wang
Section: 6263 / 6874
Location: Tuesday CSE E119 / Thursday CSE E119
Time: Tuesday 8-9th period (3:00-4:55pm) / Thursday 9th periods (4:05-4:55pm)
Office hour: Tuesday 10th period / Thursday 10th period (5:00-6:00pm)
Contact: E456 (office), (352) 562-8936 (office phone)
Overview
More and more companies are generating large amounts of diverse data (e.g., tweets, logs, click-streams, health care records, mobile phones, sensor nets) and applying sophisticated statistical models and algorithms for decision support, quantitative analysis and to build data-intensive products and services. Examples include Netflix, Google, Facebook, Twitter, Amazon, Fox Interactive, Splunk. Database systems have traditionally been the de facto framework for scalable data management, querying and analysis. However, the new requirements in deep analysis and big data go beyond the capabilities of the traditional database systems. This course will describe real-life applications that require large-scale advanced data analysis; cutting edge algorithms that are used for different analysis tasks; and existing data management systems and computing infrastructures developed to scale to the data as well as the computation.
In this course, we will discuss recent publications on Data Science with emphasis on algorithms and systems for large-scale advanced data analysis. Each student will be responsible for presenting one or more such papers in class and participating in discussions on papers presented by the other people in class. Also, each student will do a class project that has the largest impact on the final grade. Every student should be comfortable with programming and preferably have prior experience with data management systems, data modeling and analysis.
Presentations
- [08/23/11] cis6930fa11 introduction class
- [08/25/11] cis6930fa11 MADLib by Chris Grant
- [08/30/11] cis6930fa11 BayesStore-IE
- [09/01/11] cis6930fa11 ML on Multicore (Mahout) by Joir-dan Gumbs
- [09/06/11] cis6930fa11 Spark by Margan Buer
- [09/06/11] cis6930fa11 MauveDB by Prithvi Raj
- [09/08/11] cis6930fa11 Dremel by Aravinth Bheemaraj
- [09/13/11] cis6930fa11 SciDB by Morgan Buer
- [09/13/11] cis6930fa11 Map-Reduce Online by Neeraj Ganapathy
- [09/15/11] cis6930fa11 Polaris by Genesh Viswanathan
- [09/27/11] cis6930fa11 Usher by Prithvi Raj
- [09/27/11] cis6930fa11 Wranger by Kun Li
- [09/29/11] cis6930fa11 WebTables by Genesh Viswanathan
- [10/4/11] cis6930fa11 BBQ by Gautam S. Thakur
- [10/4/11] cis6930fa11 Potter's Wheel by Sunny Khatri
- [10/11/11] cis6930fa11 FusionTable by Arvinth Bheemaraj
- [10/11/11] cis6930fa11 PayAsYouGoIntegration by Chris Grant
- [10/13/11] cis6930fa11 IterativeBlocking by Lin Shuang
- [10/18/11] cis6930fa11 QueryDrivenER by Sean Goldberg
- [10/20/11] cis6930fa11 OpenIE by Yibin Wang
- [10/25/11] cis6930fa11 Reverb by Sindhura Tokala
- [10/25/11] cis6930fa11 NEROverTweets by Sunny Khatri
- [10/27/11] cis6930fa11 DBLife by Gautam S. Thakur
- [11/1/11] cis6930fa11 ULDB by Sindhura Tokala
- [11/1/11] cis6930fa11 PrDB by Kun Li
Announcements
- Guidelines cis6930fa11 final guidelines for final presentation and final report. Good luck! See you Dec. 6th at 2pm EST in E404.
- Sign up for your final project presentation slots here. Please send me your slides before the presentation.
- Sign up sheet will be put up for presentation slots on 1st and 6th Dec soon. Stay in tuned.
- Good job on the midterm presentations! Three weeks until the final presentation – keep up with the good work!
- A literary review needs to be turn in before class on 17th Nov on Crowd Search.
- Sign up for your project midterm presentation slot here. Please send me your slides before the class in which you are presenting your work. Grades will be given considering both the presentation and the slides.
- Project discussions will be held in my office E456.
- Sign up for your project proposal discussion slot here. At the time of discussion, 1-2 page cis6930fa11 project proposal is due. The proposal will be graded.
- There will be two 50-minute paper presentations per person. Sign up for a second paper presentation slot here.
- Deadline for signing up for the first paper presentation slots is before Tuesday class 08/30/2011. During the class on 08/30/2011, we will do the adjustments.
- Project groups should include 2-3 people. Start forming your groups.
- Sign up for a first paper presentation slot each here. Only sign up for slots that has 0 votes. In other words, presentations are given by individuals. The paper(s) will be presented during each time slot can be found in the cis6930fa11 reading list. (Papers are presented in the order of the list. If there are three papers in the Tuesday class, the first two goes to the first hour, and the last one goes to the second hour.)
- Slides of the cis6930fa11 introduction class.
- A brief reading list is here cis6930fa11 reading list. I will update the list throughout the semester.
- Course Syllabus is ready cis6930fa11 syllabus.
- Please check the announcements regularly
Prerequisites
Information and Database Systems I (CIS 4301) or equivalent is a pre-requisite. Preferably you have already taken one of the following courses: COP6726 Database System Implementation, CIS4930DTM Data Mining or courses in Machine Learning/Natural Language Processing. I will assume that the students already have basic knowledge of database systems, data mining, data analysis, and are comfortable with basic computer programming (e.g., with C or Java).
Topics
This course will cover the most recent developments in a broad range of Data Science problems. I would like to put more focus on algorithms and systems that enable advanced (statistical/machine learning) data analysis. The topics are as follows:
- Data Collection and Cleaning
- Data Integration
- Data Analysis
- Frameworks: MADLib, Mahout
- Applications and Algorithms: Text Analysis, A/B Testing, Log Analysis
- Trends and Advanced Topics: Probabilistic Databases, Crowd-sourcing
- User Interface Design
Grading
(Subject to minor changes till the first day of the semester) Grading will be based on project (55 %), presentations (20 %), and homework (25 %). Class participation and novelty in projects will be rewarded by bonus (5 % each). Late returns will cause deduction of 20 % of the points for each late day.
Text book and Some Pointers
There is no required text book for this class. We will use papers as our main source.
Some related links and pointers:
- Data Science Summit (they have some great talks, e.g., Data Scientist DNA)
- Kaggle Competitions
- Data Science course at Berkeley offered by Hammerbacher and Franklin (look into their Resource section for more readings)
Other Interest Projects
- Google Refine
- IBM Big Sheets
- Collaborative Sentence Translation
- Google Tenzing: A SQL Implementation On the Map Reduce Framework
Datasets
- Enron Email Dataset
- New York Times Dataset
- Twitter Dataset
- UCI datasets
- Kaggle datasets
Other
- I strongly encourage class attendance and participation.
- If I postpone or cancel the office hour, I will post it in the announcements section.
- Check out Services for students with disabilities.
- Cheating, plagiarism, and other types of academic dishonesty will be subject to punishment.