New Technologies for Online Aggregation
This document gives an overview of the research project "CAREER: New Technologies for Online Aggregation," funded by the National Science Foundation under grant number 0347408. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the NSF. For more information or for any comments or questions, please contact Chris Jermaine.
Note: this particular NSF-funded project (as of August 2009), is now winding down. However, our work on online aggregation continues as the NSF-funded project, The Desgin and Implementation of the DBO Database System.
Databases: A Solved Problem? Databases have been the subject of intensive research and development for more than 30 years. Excellent commercial database systems now exist, such as IBM's DB2, Oracle Corporation's Oracle DBMS, Microsoft's SQL-Server, and several others. Nowadays, there are even a number of high-quality public-license, or "free" database systems, the most popular of which are MySQL and PostgreSQL. Databases are ubiquitous, and every large company in existence today relies on database technology for its day-to-day operations. Databases even power a large portion of the World Wide Web, since almost all commercial and institutional websites use a database backend to manage website content.
Despite the fact that database systems have been wildly successful commercially and have had a tremendous impact on daily life, there is still much room for improvement. One particular area where database technology is still sub-par is in the area of data warehousing. Today, huge quantities of electronic information are collected from a plethora of different sources. Products sold by retail companies, web pages viewed by internet browsers, summaries of calls logged by product help centers, and a multitude of other activities in our daily lives produce database records that are logged in a large, electronic information repositories. These repositories are typically called data warehouses.
The reason that the data is collected is for subsequent analysis -- figuring out how customers react to sales promotions, how websites are viewed, how and why customers complain, and so on. Unfortunately, current analytic tools are not up to the task. For evidence of this, one need only look at the latest TPC-H benchmark results in detail. The TPC-H benchmark is a data warehouse and associated set of analytic queries that database hardware and software vendors use to test the performance of their systems. Take a look at the latest benchmark results -- the current performance is not all that great. You'll see that it is possible to spend a million dollars to archive a few hundred gigabytes of data (which is not that much nowadays) and you still have to wait 20+ minutes to get an answer to a complicated SQL query.
Thus, we argue that databases are most certainly not a solved problem!
A Possible Solution: Online Query Processing. One way to address this is to use online query processing. In online processing, the database makes use of randomized algorithms to come up with a quick guess as to the answer to a query. As the user waits, the guess is refined, until eventually the "guess" is totally accurate as query processing is completed. This has the advantage of allowing the user to stop waiting for the final query answer as soon as the guess is "good enough". The potential benefit should be obvious: if it takes ten hours to get the exact answer, but only five minutes to get a high-quality guess, then we have saved a huge amount of both computer and user time. Analytic queries over a data warehouse are particularly amenable to this sort of approximation because the questions that are asked are almost always statistical in nature (indeed, every one of the 22 TPH-H benchamrk queries is statistical).
For an example of how online query processing might work, consider the simple SQL query:
SELECT SUM (e.SALARY)
FROM EMPLOYEE AS e
WHERE e.DEPARTMENT = 'Accounting'
To answer this query, we could simply randomly (offline) scramble the records in the relation EMPLOYEE on disk, and then read them in, one-at-a-time, in random order when the query is issued. At all times, no matter how much data we have processed, we compute SUM(e.SALARY)for those records where e.DEPARTMENT = 'Accounting', and scale the current sum up by the inverse of the fraction of the records that we have processed so far. For example, if we have processed 50% of the records and the current sum is $1,000,000, then $2,000,000 is a good guess for the final answer to the query. In fact, this estimate is unbiased, and the error of the guess can be bounded using classical statistical methods.
Redesigning the Database System From the Ground Up. While performing this sort of online aggregation for simple SUM queries over a single database table is relatively easy and was first proposed more than ten years ago, the applicability of prior work was quite limited, for several reasons. First, the type of query that could be answered using online aggregation consisted of simple, SELECT-FROM-WHERE-GROUP BY queries, with no subqueries. Second, and perhaps more significant, existing algorithms for online aggregation were not scalable. As soon as enough data had been processed that all of the records read in so far could not be stored in memory, it was not known how a statistically-meaningful guess for the final query answer could be arrived at. Given that a modern, relatively inexpensive hard disk array can produce a gigabyte of data in a second, and that a high-end system might have 50GB of main memory, this means that online aggregation could only be used for around 50 seconds!
In response to this, the goal of this particular project is to redesign the database system from the ground up to support scalable, online analytic query processing. The prototype database system that we are working on is called DBO, which is short for "Database-Online". DBO has the following, specific design goals:
The ultimate goal is to fully implement and distribute a version of DBO that can be used to answer analytic queries over very large, multiple terabyte-sized databases.
Project Results So Far. Since the project began in May 2004, we have made significant progress towards meeting these goals. Some specific results are as follows:
SELECT SUM (s.AMT)
FROM SALES AS s
WHERE s.DATE >= 'June 1, 2005' AND
s.DATE <= 'June 1, 2006'
If the records are scrambled on disk, then it is really hard to efficiently access only the relevant records, and records from all of the other time periods must be processed as well. We have proposed a very interesting solution for this problem in the paper:
Materialized Sample Views for Database Approximation. Shantanu Joshi and Chris Jermaine. IEEE Trans. Knowl. Data Eng. 20(3): 337-351 (2008)
Online Estimation For Subset-Based SQL Queries. Chris Jermaine, Alin Dobra, Abhijit Pol, Shantanu Joshi. VLDB Conference 2005: 745-756
We have also considered the (very difficult!) problem of extending this work to the case where no index is available:
Sampling-Based Estimators for Subset-Based Queries. Shantanu Joshi and Chris Jermaine. VLDB Journal 18(1): 181-202 (2009)
Online Random Shuffling of Large Database Tables. Chris Jermaine. IEEE Trans. Knowl. Data Eng. 19(1): 73-84 (2007)
The Sort-Merge-Shrink Join. Chris Jermaine, Alin Dobra, Subramanian Arumugam, Shantanu Joshi, Abhijit Pol. ACM Trans. Database Syst. 31(4): 1382-1416 (2006)
A Disk-Based Join With Probabilistic Guarantees. Chris Jermaine, Alin Dobra, Subramanian Arumugam, Shantanu Joshi, Abhijit Pol. SIGMOD Conference 2005: 563-574
More recently, we extended this work to handle entire join "trees" over large numbers of input relations:
Scalable Approximate Query Processing with the DBO Engine. Chris Jermaine, Subramanian Arumugam, Abhijit Pol, Alin Dobra. SIGMOD Conference 2007: 725-736.
This paper was co-awarded the SIGMOD 2007 best paper award! A demo of the current version of the DBO system appeared in the upcoming SIGMOD 2008 conference:
The DBO Database System. Florin Rusu, Fei Xu, Luis Perez, Mingxi Wu, Ravi Jampani, Chris Jermaine. SIGMOD Conference 2008 1223-1226 (demonstration).
Project Personnel. The work of the following people has been sponsored at one time or another through the NSF funding associated with this project:
Subramanian Arumugam - Post-doctoral asssociate, CISE (graduated from CISE in 2008)
Chris Jermaine - Associate professor, CISE
Shantanu Joshi - PhD student, CISE (graduated; now at Oracle)
Abhijit Pol - PhD student, CISE (graduated; now at Yahoo!)
Mingxi Wu - PhD student, CISE (graduate; now at Oracle)
Fei Xu - PhD student, CISE
Last modified: May 4, 2009