DOMRUL: Learning the Domain Rules

A New Generation Machine Learning and Data Mining System

The DOMRUL system is developed and implemented in Lisp by Li Min Fu for learning the domain rules based on the neural network technology. DOMRUL can learn in any domain with no restrictions on rule number and rule size. It has been installed on the Internet, providing a free service of rule learning. The web version is provided at "http://whale.cise.ufl.edu:1234/examples/servlets/domrul.htm" DOMRUL. To access this web version, the Netscape browser is preferred. It is also available by sending the data to RULE@CISE.UFL.EDU with the subject line rule (i.e., Subject: rule). The result is automatically communicated back to the user.

To discover underlying domain regularities or rules has been a major long-term goal for scientific research (knowledge discovery) and engineering application (problem solving). However, when the domain rules get complex, current machine learning programs learn only approximate rather than true domain rules from a limited amount of observed data. DOMRUL is intended for discovering precisely the domain rules with neither false positive nor false negative domain rules. In this perspective, experiments in learning complex logic rule sets have demonstrated that DOMRUL is ten times more accurate than the commonly used rule learning tool C4.5. In light of this result, DOMRUL appears a major breakthrough in rule learning technology. For the details of the system's learning theory and related issues, please refer to Publications in this column.

How to Use it

The program is designed to be very easy to use. It is domain independent and no parameter tuning is needed. In the content of your e-mail message, please set three variables, with BEGIN as the first line and END as the last line. However, if you use the web version, the data file should not include BEGIN and END.

Please type the following symbols correctly in each data element: t, f, positive, and negative.

For example, a data file submitted to the web interface looks like

(setq domain 'hepatitis)
(setq feature-vector '(fever fatigue high-SGOT))
(setq data '(
	     ((t f t)(positive))
	     ((t f f)(negative))
	     ((t t t)(positive))
	     ))
Or in an e-mail message:
BEGIN
(setq domain 'hepatitis)
(setq feature-vector '(fever fatigue high-SGOT))
(setq data '(
	     ((t f t)(positive))
	     ((t f f)(negative))
	     ((t t t)(positive))
	     ))
END

[Annotation:] In this example, the domain is about hepatitis. There are three features used: fever, fatigue, and high-SGOT, each assuming binary values: t (true/yes) and f (false/no). There are three instances:

   instance #1: ((t f t)(positive))
   instance #2: ((t f f)(negative))
   instance #3: ((t t t)(positive))
Each instance has a feature vector value along with a label indicating a positive (example) or negative instance (counterexample) of a certain predefined concept in the domain. Here the concept is simply hepatitis. So, for example, instance #1 is a positive instance with the feature vector value of (t f t) meaning fever = yes, fatigue = no, and high-SGOT = yes.

Another quick self-explanatory example is the exclusive function. The data file is as follows.

(setq domain 'xor)
(setq feature-vector '(x1 x2))
(setq data '(
((t t)(negative))
((t f)(positive))
((f t)(positive))
((f f)(negative))
))
Or in an e-mail message:
BEGIN
(setq domain 'xor)
(setq feature-vector '(x1 x2))
(setq data '(
((t t)(negative))
((t f)(positive))
((f t)(positive))
((f f)(negative))
))
END

[Note 1:] Due to limited computational resources, there should be not more than 25 features and 1500 instances. The data exceeding these limits will not be processed.

[Note 2:] The current version does not process continuous attributes directly. They must be discretized properly before the data is sent.

[Note 3:] The system performs multivariant learning. It is recommended that the number of instances be at least ten times as large as that of features.

[Note 4:] The program is designed to return no more than ten (best) rules for each data submission. If the domain has more rules, an iterative version of this program is needed to learn all of them. This version is not available to the public at this time. In this case, the user can submit multiple selected data sets as an alternative.

[Note 5:] It is the user's responsibility to interpret and use the rules produced by this program.

Performance Evaluation

In many cases, existing machine learning programs can only learn approximate rather than true domain rules. Here, the domain rules refer to the set of rules which bear domain semantics and can explain all the domain instances. Learning the domain rules is important for both engineering applications and scientific research. The domain rules provide insight into the precise nature of the domain. If such rules are learned, prediction error should be zero. DOMRUL is able to discover precisely (with neither false positive nor false negative) domain rules given adequate data. It can be viewed as a new generation rule learning tool, based on neural network technology. DOMRUL has extraordinary capability to learn the domain rules from even a small fraction of domain instances. This ability has been demonstrated by the following simulation experiments. A domain of 25 variables is defined in each of ten case studies concerning learning logic rules. In these cases, the number of the domain rules ranges from 5 to 15 with each rule having 5-10 logic literals. The training data consists of 1500 randomly generated instances. Each training instance is classified as positive if it can match any of the domain rules and as negative otherwise. It is ensured that each rule can be matched by at least some instances so that it can be learned. The training data represents just an extremely small fraction (4.5 x 10^{-5}) of 2^{25} possible domain instances. The learning system does not know what the domain rules are. The learning task is to discover them inductively based on the classified data. A learned rule which is not exactly identical to any of the domain rules is counted as a false positive rule. A domain rule which is not learned is counted as a false negative rule. Overall, there are 98 domain rules for the concept ``truth''. DOMRUL learned 97 rules with 3 false positive and 4 false negative rules, or equivalently a false positive rate of 3.1% and a false negative rate of 4.1%. DOMRUL learned the domain rules with 100% accuracy in seven cases. In comparison, the most popular rule learning system C4.5 learned 101 rules with 46 false positive and 43 false negative rules, or equivalently a false positive rate of 45.5% and a false negative rate of 43.9%. C4.5 learned the domain rules with 100% accuracy in two cases. Thus, DOMRUL is ten times better than C4.5 in terms of errors. In light of this result, DOMRUL appears as a major breakthrough in rule learning technology. The results are summarized in the following table.

Error rates in learning the domain rules by C4.5 and DOMRUL. FP: False positive rules. FN: False negative rules. RN: Rule number.

Case Domain RN C4.5 DOMRUL
Learn RN FP No. FN No. Learn RN FP No. FN No.
1 5 5 0 0 5 0 0
2 7 7 3 3 7 0 0
3 10 9 4 5 10 0 0
4 5 5 0 0 5 0 0
5 15 13 4 6 15 1 1
6 14 13 4 5 13 1 2
7 14 18 13 9 14 0 0
8 9 9 4 4 9 0 0
9 6 7 4 3 6 1 1
10 13 15 10 8 13 0 0
Overall 98 101 46 43 97 3 4

The statistical one-sided paired t-test can be applied. The null hypothesis is that the rules generated by DOMRUL and those by C4.5 are not different with respect to their errors. The t values based on the results of false positives and false negatives are 3.308 and 4.114, respectively. The null hypothesis can be rejected at the level of significance of < 0.005 (degree of freedom = 9) for both cases. Thus, the difference is very statistically significant.

Furthermore, it has been demonstrated that DOMRUL can effectively deal with noise, inconsistent data, and uncertainty where other machine learning systems based on symbolic search, logic inductive programming, and decision trees fall short, and that DOMRUL learns rules far more accurately than previous systems based on neural networks. In other experiments, 10% noise is added to the data. That is, 10% of the training instances are perturbed on their feature values and the class label. There is a chance of 10% for each value and the class label to be changed in those perturbed instances. Under this condition, C4.5 and DOMRUL were evaluated again on the first five cases which were chosen without any bias. The results are summarized in the following table, which shows C4.5 learned 53 rules with 37 false positive and 26 false negative rules, or equivalently a false positive rate of 69.8% and a false negative rate of 61.9%, whereas DOMRUL learned 41 rules with 3 false positive and 4 false negative rules, or equivalently a false positive rate of 7.3% and a false negative rate of 9.5%. Thus, DOMRUL is robust even in a noisy condition.

Error rates in learning the domain rules under 10% data noise. FP: False positive rules. FN: False negative rules. RN: Rule number.

Case Domain RN C4.5 DOMRUL
Learn RN FP No. FN No. Learn RN FP No. FN No.
1 5 7 3 1 5 0 0
2 7 10 8 5 7 1 1
3 10 10 7 7 10 1 1
4 5 4 1 2 5 0 0
5 15 22 18 11 14 1 2
Overall 42 53 37 26 41 3 4

In a domain of 100 variables with each rule having 25-50 logic literals, DOMRUL successfully discovered the domain rules based on 5000 instances. In this domain, C4.5 failed to learn any rule correctly. Notice that exhaustive search is practically impossible in domains of this size given today's computer technology.

Logic Data Set 1
Logic Noisy Data Set 1
Logic Domain Rule Set 1
Logic Data Set 2
Logic Noisy Data Set 2
Logic Domain Rule Set 2
Logic Data Set 3
Logic Noisy Data Set 3
Logic Domain Rule Set 3
Logic Data Set 4
Logic Noisy Data Set 4
Logic Domain Rule Set 4
Logic Data Set 5
Logic Noisy Data Set 5
Logic Domain Rule Set 5
Logic Data Set 6
Logic Domain Rule Set 6
Logic Data Set 7
Logic Domain Rule Set 7
Logic Data Set 8
Logic Domain Rule Set 8
Logic Data Set 9
Logic Domain Rule Set 9
Logic Data Set 10
Logic Domain Rule Set 10

DOMRUL's performance superiority has also been demonstrated in a number of real world domains (reported elsewhere).

References: