To discover underlying domain regularities or rules has been a major long-term goal for scientific research (knowledge discovery) and engineering application (problem solving). However, when the domain rules get complex, current machine learning programs learn only approximate rather than true domain rules from a limited amount of observed data. DOMRUL is intended for discovering precisely the domain rules with neither false positive nor false negative domain rules. In this perspective, experiments in learning complex logic rule sets have demonstrated that DOMRUL is ten times more accurate than the commonly used rule learning tool C4.5. In light of this result, DOMRUL appears a major breakthrough in rule learning technology. For the details of the system's learning theory and related issues, please refer to Publications in this column.
Please type the following symbols correctly in each data element: t, f, positive, and negative.
For example, a data file submitted to the web interface looks like
(setq domain 'hepatitis) (setq feature-vector '(fever fatigue high-SGOT)) (setq data '( ((t f t)(positive)) ((t f f)(negative)) ((t t t)(positive)) ))Or in an e-mail message:
BEGIN (setq domain 'hepatitis) (setq feature-vector '(fever fatigue high-SGOT)) (setq data '( ((t f t)(positive)) ((t f f)(negative)) ((t t t)(positive)) )) END
[Annotation:] In this example, the domain is about hepatitis. There are three features used: fever, fatigue, and high-SGOT, each assuming binary values: t (true/yes) and f (false/no). There are three instances:
instance #1: ((t f t)(positive)) instance #2: ((t f f)(negative)) instance #3: ((t t t)(positive))Each instance has a feature vector value along with a label indicating a positive (example) or negative instance (counterexample) of a certain predefined concept in the domain. Here the concept is simply hepatitis. So, for example, instance #1 is a positive instance with the feature vector value of (t f t) meaning fever = yes, fatigue = no, and high-SGOT = yes.
Another quick self-explanatory example is the exclusive function. The data file is as follows.
(setq domain 'xor) (setq feature-vector '(x1 x2)) (setq data '( ((t t)(negative)) ((t f)(positive)) ((f t)(positive)) ((f f)(negative)) ))Or in an e-mail message:
BEGIN (setq domain 'xor) (setq feature-vector '(x1 x2)) (setq data '( ((t t)(negative)) ((t f)(positive)) ((f t)(positive)) ((f f)(negative)) )) END
[Note 1:] Due to limited computational resources, there should be not more than 25 features and 1500 instances. The data exceeding these limits will not be processed.
[Note 2:] The current version does not process continuous attributes directly. They must be discretized properly before the data is sent.
[Note 3:] The system performs multivariant learning. It is recommended that the number of instances be at least ten times as large as that of features.
[Note 4:] The program is designed to return no more than ten (best) rules for each data submission. If the domain has more rules, an iterative version of this program is needed to learn all of them. This version is not available to the public at this time. In this case, the user can submit multiple selected data sets as an alternative.
[Note 5:] It is the user's responsibility to interpret and use the rules produced by this program.
| Case | Domain RN | C4.5 | DOMRUL | ||||
|---|---|---|---|---|---|---|---|
| Learn RN | FP No. | FN No. | Learn RN | FP No. | FN No. | ||
| 1 | 5 | 5 | 0 | 0 | 5 | 0 | 0 |
| 2 | 7 | 7 | 3 | 3 | 7 | 0 | 0 |
| 3 | 10 | 9 | 4 | 5 | 10 | 0 | 0 |
| 4 | 5 | 5 | 0 | 0 | 5 | 0 | 0 |
| 5 | 15 | 13 | 4 | 6 | 15 | 1 | 1 |
| 6 | 14 | 13 | 4 | 5 | 13 | 1 | 2 |
| 7 | 14 | 18 | 13 | 9 | 14 | 0 | 0 |
| 8 | 9 | 9 | 4 | 4 | 9 | 0 | 0 |
| 9 | 6 | 7 | 4 | 3 | 6 | 1 | 1 |
| 10 | 13 | 15 | 10 | 8 | 13 | 0 | 0 |
| Overall | 98 | 101 | 46 | 43 | 97 | 3 | 4 |
The statistical one-sided paired t-test can be applied. The null hypothesis is that the rules generated by DOMRUL and those by C4.5 are not different with respect to their errors. The t values based on the results of false positives and false negatives are 3.308 and 4.114, respectively. The null hypothesis can be rejected at the level of significance of < 0.005 (degree of freedom = 9) for both cases. Thus, the difference is very statistically significant.
Furthermore, it has been demonstrated that DOMRUL can effectively deal with noise, inconsistent data, and uncertainty where other machine learning systems based on symbolic search, logic inductive programming, and decision trees fall short, and that DOMRUL learns rules far more accurately than previous systems based on neural networks. In other experiments, 10% noise is added to the data. That is, 10% of the training instances are perturbed on their feature values and the class label. There is a chance of 10% for each value and the class label to be changed in those perturbed instances. Under this condition, C4.5 and DOMRUL were evaluated again on the first five cases which were chosen without any bias. The results are summarized in the following table, which shows C4.5 learned 53 rules with 37 false positive and 26 false negative rules, or equivalently a false positive rate of 69.8% and a false negative rate of 61.9%, whereas DOMRUL learned 41 rules with 3 false positive and 4 false negative rules, or equivalently a false positive rate of 7.3% and a false negative rate of 9.5%. Thus, DOMRUL is robust even in a noisy condition.
| Case | Domain RN | C4.5 | DOMRUL | ||||
|---|---|---|---|---|---|---|---|
| Learn RN | FP No. | FN No. | Learn RN | FP No. | FN No. | ||
| 1 | 5 | 7 | 3 | 1 | 5 | 0 | 0 |
| 2 | 7 | 10 | 8 | 5 | 7 | 1 | 1 |
| 3 | 10 | 10 | 7 | 7 | 10 | 1 | 1 |
| 4 | 5 | 4 | 1 | 2 | 5 | 0 | 0 |
| 5 | 15 | 22 | 18 | 11 | 14 | 1 | 2 |
| Overall | 42 | 53 | 37 | 26 | 41 | 3 | 4 |
In a domain of 100 variables with each rule
having 25-50 logic literals,
DOMRUL successfully discovered the domain rules based on
5000 instances.
In this domain, C4.5 failed to learn any rule correctly.
Notice that exhaustive search is practically impossible in domains of
this size given today's computer technology.
Logic Data Set 1
Logic Noisy Data Set 1
Logic Domain Rule Set 1
Logic Data Set 2
Logic Noisy Data Set 2
Logic Domain Rule Set 2
Logic Data Set 3
Logic Noisy Data Set 3
Logic Domain Rule Set 3
Logic Data Set 4
Logic Noisy Data Set 4
Logic Domain Rule Set 4
Logic Data Set 5
Logic Noisy Data Set 5
Logic Domain Rule Set 5
Logic Data Set 6
Logic Domain Rule Set 6
Logic Data Set 7
Logic Domain Rule Set 7
Logic Data Set 8
Logic Domain Rule Set 8
Logic Data Set 9
Logic Domain Rule Set 9
Logic Data Set 10
Logic Domain Rule Set 10
DOMRUL's performance superiority has also been demonstrated in a number of real world domains (reported elsewhere).