THALIA : Test Harness for the Assessment of Legacy information Integration Approaches

Abstract:

Information integration is a challenging problem. Despite the significant efforts in developing effective solutions and tools for automating information integration, current techniques are still mostly manual, requiring significant programmatic set-up with only limited reusability of code. We introduce a new, publicly available testbed and benchmark called THALIA (Test Harness for the Assessment of Legacy information Integration Approaches) to simplify the evaluation of existing integration technologies and to enable more thorough and more focused testing. THALIA provides researchers with a collection of downloadable data sources representing University course catalogs, a set of twelve benchmark queries, as well as a scoring function for ranking the performance of an integration system. Our benchmark focuses on syntactic and semantic heterogeneities since we believe they still pose the greatest technical challenges. A second important contribution of this paper is a systematic classification of the different types of syntactic and semantic heterogeneities, which directly lead to the twelve queries that make up the benchmark. A sample evaluation of two integration systems at the end of the paper is intended to show the usefulness of THALIA in identifying the problem areas that need the greatest attention from our research community if we want to improve the usefulness of today’s integration systems.