************************************************************************* * * * ID3 Documentation * * * * Copyright - (1990 McDonnell Douglas Corporation) * * * ************************************************************************* I. Introduction Name of Algorithm: Enhanced version of ID3 Version: 3.1b3 Name of File Containing Program: id3.public.lisp (id3v31b3.lsp) Name of File Containing Documentation: id3.doc.public (id3v31.doc) File Size: 90K Authors: John F. Dooley, Keith Hacke, and Dan C. St. Clair Date of Last Code Revision: June 27, 1990 Date of Last Document Revision: April 19, 1991 Software Requirements: Common Lisp Special Hardware Requirements: none The code contains the following copyright notice: Copyright - (1990 McDonnell Douglas Corporation) II. Description of Algorithm The ID3 algorithm, developed by Quinlan [1986b], is a non-incremental inductive "classification" algorithm. It uses a set of training examples to build a decision tree representing the concepts learned. This program then tests the "goodness" of the decision tree by attempting to classify a number of test examples. The ID3 algorithm is non-incremental in the sense that it assumes the entire training set is available at the beginning of the algorithm. (As described in the Chi-Square notes, this code performs a limited implementation of the Chi-Square test. It does NOT perform the cutoff described in Quinlan [1986a], page 154. III. User's Guide 1. Getting Started and Stopped To start the ID3 program, first load it into Lisp, using whatever command is appropriate to your system. It is suggested that the program be compiled before it is run. To start the program, simply type: > (run) A copyright message will be printed, followed by a prompt to hit the key. When RETURN is hit, the type of machine and the current date and time are printed, and a series of questions will be asked for which the user will have to provide answers. These questions tailor this run to the user's needs, allowing some variation in the form of the algorithm that is run, and in the selection of which statistical tests are used. There are a set of standard defaults that can be used to avoid most of these questions. The following questions are always asked: Training File to read:[Input the name of the file containing the training examples.] Testing File to read: [Input the name of the file containing the test examples.] Use standard defaults (y/n)? [y]: If the user types 'y', or hits RETURN, the following questions will NOT appear. Instead, default values will be used. The current defaults are: No concept strength test. No chi-square test. Sequential access to the training file. Use the entire training file. When testing, require an exact conclusion match for acceptance of a test example. Perform Concept Strength Test [N is default, else enter a Rate <1.0]? If the user wishes to use the Concept Strength Metric to test and prevent node splitting when the current tree is classifying examples with an accuracy above the concept strength limit, enter the limit, a fraction between 0.0 and 1.0 [St.Clair, et. al.1990a,b]. Perform Chi-Square Test [N]? If the user wishes to use the Chi-Square test to prevent node splitting on attributes having little significance, enter a 'Y', else just hit RETURN. If a 'Y' is entered, a second question is asked, concerning the chi-square confidence level to be used. The chi-square test will examine all the available attributes of a leaf node (those not already used as test attributes on the path from the root to the leaf), and return a list of those attributes whose computed chi-square value is greater than the one extracted from the chi-square table. If none of the available attributes passes the test (the list is empty), then the node is not split. Otherwise, the entropy calculation is made and an attribute selected to be the test attribute for the split node. Note that the current chi-square table allows a maximum of 10 values per attribute. To increase this number,the user must consult a chi-square table (like the one in Hogg and Tanis) and add more values to each row in the table. At what significance level? [0.95]? Four levels are available, 0.90, 0.95 (the default), 0.975, and 0.99. This selects the appropriate row of the chi-square table from which to extract the table value. Fraction of training file to use [1.0]? The fraction of the training file to be used in building the tree (between 0.0 and 1.0) should be entered. Just typing RETURN, allows the entire training file to be used. Random or Sequential access [R]? If the user wants the training file to be read sequentially, type an 's', else if random reading of examples is desired, just hit RETURN. If random access is chosen, a file of random example numbers is generated and this fact is reported to the user. Choose method to determine when to accept conclusion: 1. Conclusion matches any of the training examples 2. Conclusion must match exactly the training examples 3. Conclusion must match the most common training example Enter number of choice [1]: This option will change the manner in which test examples are accepted or rejected by the decision tree. The tree is said to have classified a test example correctly if the attribute values of the test example match the attribute values along a path in the tree and the test conclusion "matches" conclusion(s) found at the corresponding leaf node of the tree. The meaning of conclusion "matches" is as follows: If (1) is chosen (the default value) the test conclusion is said to "match" the conclusion at the leaf node if it is equal to any one of them. If (2) is chosen, the test conclusion is said to "match" the conclusion at the leaf node if there is only one conclusion at the leaf node and the test example's conclusion is equivalent to it. If (3) is chosen, then the test example is accepted if it's conclusion matches the most common conclusion at that leaf. Once these questions are answered, the program will begin to run. The program reads the training file and builds a single decision tree. When the tree has been built, it is tested using the testing file named above. Information is printed for each test example that the tree is unable to classify (see errorS description below). After testing is complete, summary information will be printed, along with a one-line menu. The summary information includes the fraction of test examples correctly classified, the number of interior nodes in the tree, the number of leaves with single conclusions, and the number with multiple conclusions, if any. The one-line menu, shown below, allows the user to get reports on aspects of the tree, and the testing results: [Run,Entropy,Chi,sTrength,Kounts,Print,print-All,Output-file,errorS,Quit]: Each entry in the menu represents an option the user can take once the decision tree has been built and tested. To execute an option the user should type the letter indicated (by Uppercase) in the menu (for example, C for chi-square, O to output the tree to a text file, etc.) Note that except for *root*, the node names are generated automatically by the program using the "gentemp" Common Lisp function. All node names begin with a prefix of "T" which is followed by an integer, as in "T1234". These node names can be seen by printing the tree. The options are: Run - halts the current execution, resets all the variables, and starts over again. Entropy - prompts for a node and computes the entropy function for all the attributes at that node. Chi - prompts for a node and an attribute list. Computes the chi-square values for the attributes at that node, and the tabular chi-square values used to decide node splitting. Remember that for attributes with more than 10 values, this test will fail and the program will usually bomb ungracefully.. sTrength - prompts for a node and computes the Concept Strength metric for the subtree [St.Clair, et. al., 1990]. Kounts - prints out counts for nodes in the current tree, such as number of interior nodes, number of leaves, number of splits prevented, number of pull-ups, etc.. Print - prompts for a node and prints the tree from that node "down". Only the node information is printed, to make the tree more readable. print-All - prompts for a node and prints the entire tree from that node "down". It will print all the node information, plus a list of all the instances stored at each leaf. Note that for larger data sets this tree may not be very readable on a video display screen. errorS - prompts for a node and prints out a table of numbers of test classification failures at that node, as follows: Failure Type Count ------------------------------ ------- Missing Tree Value 3 Different Conclusions at Leaf 12 ------------------------------ ------- "Missing Tree Value" means that the test example had a value for a test attribute that was not found in the tree. These "missing values" occur when the training set does not contain examples that use all the values of attributes that are included in the tree. For example, if an attribute "outlook", has values SUNNY, RAINY, and OVERCAST, and no training example contains a SUNNY, then SUNNY is a missing value. If a test example contains SUNNY, it will not be classified correctly by the tree if "outlook" is one of the attributes on the classification path. As indicated above, this is reported as the tree is tested. "Different Conclusions at Leaf" means that the conclusion of a test example is different than any of the conclusions found at the leaf node the test example reaches. This is an indication that the current set of attributes for this training set is inadequate to discriminate between at least two of the testing examples. Output-file - prompts for an output file name and then writes the tree, and enough information to re-create the tree into a text file. [See Output Format section for a description of the file format.] Quit - exits the program. 2. Input Data Format The training example file consists of six parts: 1. A descriptive comment in the form of a Lisp string. This is a string in double quotes which, in turn, is enclosed in parentheses. This comment is merely printed at the beginning of the run. 2. The attribute space description, which consists of a list containing a set of pairs of the form "attribute (list of values)". 3. The attribute "use" list, which consists of the word "use", followed by a list of attributes to use when building the tree. Only those attributes will be used, and only those values from examples will be extracted when the example is read from the training file. For example: (use class hair eyes) 4. The goal space description, which is a list containing the name of the goal attribute and all it's possible values. The goal must be one of the attributes contained in the attribute space description above. 5. Training examples are organized as lists, one example per list. Attribute values must be listed in the same order as the attribute names appear in the attribute space description. 6. The file must end with a null list (). The example, below is Utgoff's Physical Characteristics data from [Utgoff 1988]. (" Physical Characteristics data Source: Utgoff, ID5, June, 1988, Machine Learning ") (class (pos neg) height (short tall) hair (blond dark red) eyes (blue brown) ) (use class height hair eyes) (class pos neg) (neg short blond brown) (neg tall dark brown) (pos tall blond blue) (neg tall dark blue) (neg short dark blue) (pos tall red blue) (neg tall blond brown) (pos short blond blue) () The testing file contains lists of attribute values to be used to test the decision tree. Each list must be ordered to match the order of attributes in the attribute space description. The null list must be included as the last line in the file. A testing file used for Utgoff's data might be: (neg short blond brown) (neg tall dark brown) (pos tall blond blue) (neg tall dark blue) (neg short dark blue) (pos tall red blue) (neg tall blond brown) (pos short blond blue) () 3. Output Format Output is normally directed to the screen (to save any output, the user should create a "dribble" file before executing this program). A typical session's output looks like: > (run) Copyright Unpublished - (1989, 1990) McDonnell Douglas Research Laboratories). All Rights Reserved under the copyright laws by McDonnell Douglas Corporation. McDonnell Douglas Corporation Proprietary rights are included in the information disclosed herein. Recipients by accepting this software agrees that neither this software nor the information disclosed herein nor any part thereof shall be reproduced or transferred to other software or used or disclosed to others for manufacturing or for any other purpose except as specifically authorized in writing by McDonnell Douglas Corporation. Press to continue... "******* ID3 - June 1990 (modified by Dooley & St. Clair) ********" "Sun-4" 23:57:7 6/8/90 Friday Training file to read: physical.dat Testing file to read: physicalt.dat Use standard defaults (y/n)? [y]: n Perform Concept Strength test [ means NO, else enter a Rate <1.0]? Perform Chi-square test [n]? Training file size = 8 Fraction of training file to use [1.0]? Random (R) or Sequential (S) access [S]? Chose method to determine when to accept conclusion 1. Conclusion matches any of the training examples 2. Conclusion must match exactly the training examples 3. Conclusion must match the most common training example Enter number of choice [1]: Physical characteristics data Source: Utgoff, ID5 June, 1988, Machine Learning ATTRIBUTE SPACE = CLASS HEIGHT HAIR EYES GOAL SPACE = CLASS Sample Size = 8 Building ID3 Tree #Examples Accuracy Error Rate Leaves Same/Concl Interior Conflicting 8 1.000 0.000 4 0 2 0 Errors due to missing attribute values in tree: 0 Errors due to different test example conclusions: 0 Completed! Time to execute is 2.95 Number of splits: 2 Total Number of instances read from training file: 8 Number of leaf nodes in tree: 4 Number of same conclusion interior nodes: 0 Number of interior nodes in tree: 2 Number of multiple conclusion leaf nodes in tree: 0 [Run,Entropy,Chi,sTrength,Kounts,Print,print-All,Output-file,errorS,Quit]? a Which node[*ROOT*]: HAIR T3 [(ERRORS (MISSING-VAL 0 DIFF-CONCL 0) STATS (POS 3 NEG 5))] (1 1) BLOND EYES T4 [(ERRORS (MISSING-VAL 0 DIFF-CONCL 0) STATS (POS 2 NEG 2))] (2/3 2/5) BLUE T8 [(ERRORS (MISSING-VAL 0 DIFF-CONCL 0) STATS (POS 2 NEG 0))] (2/3 0) (CLASS POS HEIGHT TALL) (CLASS POS HEIGHT SHORT) BROWN T7 [(ERRORS (MISSING-VAL 0 DIFF-CONCL 0) STATS (POS 0 NEG 2))] (0 2/5) (CLASS NEG HEIGHT SHORT) (CLASS NEG HEIGHT TALL) DARK T5 [(ERRORS (MISSING-VAL 0 DIFF-CONCL 0) STATS (POS 0 NEG 3))] (0 3/5) (CLASS NEG HEIGHT TALL EYES BROWN) (CLASS NEG HEIGHT TALL EYES BLUE) (CLASS NEG HEIGHT SHORT EYES BLUE) RED T6 [(ERRORS (MISSING-VAL 0 DIFF-CONCL 0) STATS (POS 1 NEG 0))] (1/3 0) (CLASS POS HEIGHT TALL EYES BLUE) [Run,Entropy,Chi,sTrength,Kounts,Print,print-All,Output-file,errorS,Quit]: o Output File name: physical.txt [Run,Entropy,Chi,sTrength,Kounts,Print,print-All,Output-file,errorS,Quit]: q T > ***************************************************************************** A brief explanation of the output table above seems in order: Iteration - number of training examples read to create the current tree. Accuracy - fraction of examples from the test file that were correctly classified by the decision tree. Error Rate - is (1 - Accuracy). Leaves - number of leaves in the current tree. Same/Concl - number of interior nodes with only a single conclusion for all the subtrees under the node. Interior - number of interior nodes in the decision tree. The total number of nodes in the tree is (Leaves + Interior). Conflicting - number of leaves containing examples with two or more conclusions. These nodes were not split because the split was prevented either by the chi-square test, the effective-rate test or because no attributes were left on which to split. Output File Format The output file created when the user exercises the 'o' option is an ASCII text file containing enough information to re-create the tree. (The current version of the ID3 implementation will not allow this file as input. There is a separate program, called "testtree.lsp" that reads this file as input and allows the use of several testing files in succession.) The file format contains only alphanumeric characters, Carriage Returns, Line Feeds, and spaces. The format is as follows: Training file name for identification. Interior Nodes: XX Single Conclusion Leaves: XX Multiple Conclusion Leaves: XX Attributes Used in Tree: (list of attributes tested in interior nodes of tree) ( (a left parenthesis to start the attribute space) ...the attribute space of the training set, presented as ATTRIBUTE-NAME (LIST of VALUES) ...with each attribute on a separate line, and the list of values inside parentheses. ) (a right parenthesis to end the attribute space) (USE [the use list for the tree - a list of the attributes used] ) (GOAL-SPACE presented as a list like (GOAL-ATTRIBUTE Values)) Number of Training Examples in this set. NIL [Node information, with each node presented as:] Node-Name [the gentemp node name assigned to this node] Attribute-Name [if this node is an interior node, **LEAF** otherwise] Children List If the node is an interior node, the Children list is presented as a list of (VALUE NODE-NAME) pairs, all on one line. If the node is a leaf node, the list is presented as follows: ( [a starting left parenthesis all alone at the beginning of the line] (Instance List) [a list of ATTRIBUTE VALUE pairs for this instance; there may be several of these lists, one to a line] ) [a closing right parenthesis all alone at the beginning of the line] (ERRORS (VAL1 COUNT1 ...VALN COUNTN) STATS (VAL1 COUNT1 ...VALN COUNTN)) [the counts of instances in this sub-tree taken from the node's property list] NIL [a NIL used to indicate the end of the information for this node] NIL [at the end of the file, a final, extra NIL to indicate EOF] For example, the output file for the tree created using Utgoff's physical characteristics data looks like: physical.data Interior Nodes: 2 Single Conclusion Leaves: 4 Multiple Conclusion Leaves: 0 Attributes used in tree: (EYES HAIR) ( CLASS (POS NEG) HEIGHT (SHORT TALL) HAIR (BLOND DARK RED) EYES (BLUE BROWN) ) (USE CLASS HEIGHT HAIR EYES) (CLASS POS NEG) 8 NIL T0002 HAIR (BLOND T0003 DARK T0004 RED T0005 (ERRORS (MISSING-VAL 0 DIFF-CONCL 0) STATS (NEG 5 POS 3)) NIL T0003 EYES (BLUE T0007 BROWN T0006) (ERRORS (MISSING-VAL 0 DIFF-CONCL 0) STATS (NEG 2 POS 2)) NIL T0004 **LEAF** ( (CLASS NEG HEIGHT TALL EYES BROWN) (CLASS NEG HEIGHT TALL EYES BLUE) (CLASS NEG HEIGHT SHORT EYES BLUE) ) (ERRORS (MISSING-VAL 0 DIFF-CONCL 0) STATS (NEG 3 POS 0)) NIL T0005 **LEAF** ( (CLASS POS HEIGHT TALL EYES BLUE) ) (ERRORS (MISSING-VAL 0 DIFF-CONCL 0) STATS (NEG 0 POS 1)) NIL T0007 **LEAF** ( (CLASS POS HEIGHT TALL) (CLASS POS HEIGHT SHORT) ) (ERRORS (MISSING-VAL 0 DIFF-CONCL 0) STATS (NEG 0 POS 2)) NIL T0006 **LEAF** ( (CLASS NEG HEIGHT SHORT) (CLASS NEG HEIGHT TALL) ) (ERRORS (MISSING-VAL 0 DIFF-CONCL 0) STATS (NEG 2 POS 0)) NIL NIL 4. Error Messages Aside from fatal Lisp error messages, which are due to incorrect input data, and the fatal error of trying to compute the chi-square values of attributes with more than 10 values, the only error messages printed are: "File does not exist!" is printed if either the training or testing file names can not be found. The user is prompted for a new file name. "Multiple Conclusions at leaf node " is printed if two examples at the same leaf node have exactly the same data, and different conclusions. The data is still added to the tree. "Value missing in instance (.....) for attribute XXX in node Tyyy" is printed if a value is missing from an input example. "***ERROR*** Invalid attribute value!" is printed if an input value for an input example is not in the attribute space description. not in attribute space...ignoring it. is printed when the program is creating the "use list" if it finds an attribute in the user's provided use list that is not in the attribute space. "Error trying to classify: (instance list) Instance failed acceptance criterion. Path taken: TXXX NIL TYYY ATTRIBUTE1 TZZZ ATTRIBUTE2 etc. is printed when a test example has a different conclusion than the conclusion(s) at a leaf node it has reached. "Error trying to classify: (instance list) Test attribute value doesn't match tree. Path taken: TXXX NIL TYYY ATTRIBUTE1 TZZZ ATTRIBUTE2 etc. is printed when a test example has a valid attribute value that was not used to create an edge in the tree. (That is, the value was "missing" in the training set, so an edge for it was not created.) IV. Program Structures and Control Flow 1. Data Structures The most important data structure is, of course, the decision tree. Each node in the tree is created using the same Common Lisp "defstruct" construct (defstruct treenode (test-attribute nil (children nil :type list)) For interior nodes, the test-attribute field contains the name of the attribute used to test examples at that point in the tree. For leaf nodes, the test-attribute field is always NIL. For interior nodes, the children field contains a list of (value node-name) pairs that specify which node in the tree to go to when an example has a particular value. For leaf nodes, the children field contains a list of the examples classified at this leaf, minus the attributes and values the example traversed on the path from the root node to this leaf. For both kinds of nodes, the table of instance counts required by the ID3 algorithm and classification errors at each node is kept on the property list of the node. 2. Major Control Flow The (run) function acts as the driver for the program, setting up some initial variable values via the (build-description-space) function, and calling the (build-decision-tree) function which is a generic tree building function. It initializes more variables, opens the appropriate files, and calls the (build-decision-recurs) function which actually implements the ID3 algorithm. (build-decision-recurs) in turn calls the (split-up-tree) function, which reads in the initial set of training examples, and calls (test-for- split) to begin the process of testing the tree and splitting leaf nodes to create new subtrees. Once the tree has been built, the (learned-goal) function is called to test the tree. (Learned-goal) opens the testing file, and reads test examples one at a time and attempts to classify them. (Learned-goal) calls (knownp), which calls (knownp-rec) to recursively descend the tree, matching test attribute values with the values stored in the tree. At a leaf, (knownp-rec) calls (same-conclusion) to test the acceptance criterion specified by the user. Counts of all correctly and incorrectly classified test examples are kept, and the summaries are reported at the end. At the end of testing the function (menu) is called, which displays the one- line menu and waits for user input. (Menu) calls a number of functions, based on the user selection. For more details please see the section below, and the source code. 3. Top-Level Pseudo-Code for ID3 Implementation BEGIN Initialize variables and the root of the tree. Prompt for and read the initial training set size. READ in the set of training examples, adding them to the root of the tree, which is initially a leaf. Using (split-up-tree) build the decision tree by recursively calling (test-for-split) on first the root, splitting it, and then on all it's children, splitting them, etc. IF (the set of examples at the leaf contains more than one conclusion) THEN Find the best test attribute of those remaining by computing the entropy of each attribute at the leaf and picking the attribute with the smallest entropy value. IF (we are using the chi-square test and the computed chi- square value is smaller than the chi-square table value) THEN DO NOT SPLIT. ELSE IF (we are using the concept strength test and the computed Concept Strength performance rate is greater than the concept strength limit entered by the user) THEN DO NOT SPLIT. ELSE Split the leaf into a sub-tree using the new test attribute. (The new sub-tree will have at least two leaves.) ENDIF ENDIF Test the accuracy of the finished tree by calling (learned-goal) to read the entire test file and attempt to classify all the examples in it. END. 4. Description of Major Tree Construction Routines a) (split-up-tree) - is called from (build-decision-recurs) or (build-decision-hat-recurs) which has already read in the set of training examples. (split-up-tree) recursively builds the decision tree starting at the root by first calling (test-for-split) to split the root. (test-for-split) tests a leaf to see if there are multiple conclusions at the leaf. If it finds more than one conclusion, it determines the best test attribute and makes the leaf into a subtree. (split-up-tree) then calls itself, passing each of the root's new children as the root of a subtree. b) (split-leaf) - is called from (test-for-split) when a leaf contains a set of examples with more than one conclusion. The leaf is then made into the root of a subtree, each leaf of which has examples with only one conclusion. (split-leaf) accomplishes this by calling the function (make-a-leaf-a-node), which makes the leaf into an interior node by changing it's test-attribute field, and calling (add-kids-to-node) which makes the new leaves and attaches them to the children field of the new interior node. V. Hardware and Software Requirements The ID3 program was written in Common Lisp, and should run on any system that supports that standard. No special hardware requirements are necessary. The program has run successfully without modification on Symbolics Lisp Machines, TI Explorer II's and MicroExplorers, Apple Macintosh II's running Allegro Lisp, a Sun- 4/260, a VAX-11/785, and a Vax 8600, both running Vax Lisp. VI. Authors' Addresses John F. Dooley 7121 Stanford Ave. University City, MO 63130 E-mail: s092874@umrvma.umr.edu Dan C. St. Clair University of MO-Rolla, and McDonnell Douglas Research Laboratories P.O. Box 516 St. Louis, MO 63166 E-mail: c0567@umrvmb.umr.edu Keith R. Hacke McDonnell Douglas Corporation P.O. Box 516 St. Louis, MO 63166 VII. Acknowledgements Support for this effort was provided by McDonnell Douglas Research Laboratories, St. Louis, MO. VIII. References Quinlan, J. R., "The Effect of Noise on Concept Learning". Machine Learning: An Artificial Intelligence Approach, Vol. II, 149-166, 1986. 1986. Quinlan, J. R., "Induction of decision trees". Machine Learning, 1:1, 81-106, 1986. St. Clair, D. C, Bond, W. E., and Sabharwal, C. L., "Measuring Concept Strength in Classification Trees," Fifth International Conference on Methodologies for Intelligent Systems: Selected Papers, Knoxville, TN, October 1990, pp. 168-179. St. Clair, D. C, Sabharwal, C. L., Hacke, K., and Bond, W. E., "Using Decision-Tree Classifier Systems to Extract Knowledge from Databases," Fifth Conference on Artificial Intelligence for Space Applications, May, 1990, pp. 507-516. Utgoff, P. E., "ID5: An incremental ID3". Proceedings of the Fifth International Conference on Machine Learning, 107-120, 1988.