THALIA can be used in two ways:

1. As a rich source of test data for integration problems exhibiting a wide variety of syntactic and semantic heterogeneities, which we have grouped into three categories:
Attribute heterogeneities (e.g., synonyms, simple and complex mappings, union types, and heterogeneities due to language of expression), missing data (e.g., nulls, virtual columns, and semantic incompatibility), and a variety of structural heterogeneities. See our introductory publication for a more detailed description.

2. As a benchmark for objectively evaluating the capabilities of integration technology taking into account the correctness of the solution as well as the amount of programmatic effort (i.e., the complexity of external functions) needed to resolve any heterogeneities. Our benchmark is currently comprised of twelve queries each requiring the resolution of a particular type of heterogeneity.

Downloadable University course catalogs are represented using well-formed and valid XML according to the extracted schema for each course catalog. Extraction and translation from the original representation was done using a source-specific wrapper which preserves structural and semantic heterogeneities that exist among the different course catalogs. We have used an enhanced version of the Telegraph Screen Scraper (TESS) system developed at UC Berkeley to extract the source data. The enhanced version, DataExtractor (HTMLtoXML), can be obtained from along with the examples used to extract data provided in THALIA. DataExtractor (HTMLtoXML) tool provides added functionality over TESS system and can store the extracted data in an XML file.

For each type of heterogeneity listed above, we have formulated a benchmark query against two data sources from our testbed that requires a particular integration activity in order to provide the correct result: a reference schema, which is used to formulate the query, as well as a challenge schema which exhibits the type of heterogeneity that is to be resolved by the integration system.
Note, in some cases (e.g., Benchmark Query 9), a query may illustrate additional types of heterogeneities that are showcased in other queries. Queries are written in XQuery version 1.0.

Please note that integration systems that do not provide query processing can still use the benchmark by providing an integrated schema over the two data sources associated with each benchmark query.

Users can browse both the repository of cached course catalogs in their original representation as well as our collection of extracted XML documents before running the benchmark. Users can also download the DataExtractor (HTMLtoXML) wrapper tool from along with the examples used for THALIA.

In order to exchange information about the capabilities of existing integration systems, users are encouraged to upload the outcome of their benchmark evaluation.