TODS Test Dataset Generation

  • This page describes how to generate the test datasets used in the experiments of the TODS paper.
  • The experiments were performed on two synthetically generated relations: EMP and SALES. Each relation has 200 million records. With a record size of 100 bytes, this yields us a 20 Gigabyte relation.
  • The record structure of the relations are as follows:
    	struct EMP {
    		int Key; // 4 bytes
    		double Value; // 8 bytes
    		int data[22]; // 88 bytes
    	struct SALES {
    		int Key; // 4 bytes
    		double Value; // 8 bytes
    		int data[22]; // 88 bytes
  • There are a total of 5 million unique keys in the range [0-4999999]. These keys are distributed among the 200 million records using a zipfian distribution.

    Record Allocation Table

  • The actual dataset can be generated using the record allocation files given below for various zipf parameter values. We will refer to the record allocation file as recalloc hereafter. Click on the links to download the files (each file is about 3-4 MB in size and is in TXT format)
  • For each zipf parameter, recalloc1 provides information about the EMP relation, recalloc2 provides information about the SALES relation.

    zipf= EMP SALES
    0.0 recalloc1recalloc2
    0.2 recalloc1recalloc2
    0.4 recalloc1recalloc2
    0.6 recalloc1recalloc2
    0.8 recalloc1recalloc2
    1.0 recalloc1recalloc2
  • Each line in the recalloc file provides the following information:
     Key    Frequency     AVG_Value
  • The recalloc file has an entry for every Key present in the relation. For each Key, the Frequency field tells us how many times the Key should be repeated in the relation. The next field AVG_VALUE tells us what the average over the Value column should be for all these keys. For a given Key , in our data generation, the Value column of the records with that Key followed a normal distribution with mean AVG_value and a standard deviation of AVG_value/30.0 The rest of the fields in the relation tuple is filled with random values (using function lrand48 () for example).