This project is to implement several kinds of classification which we learned in class on to the data coming from " The Insurance Company Benchmark (COIL 2000)" [5]. The classification method can be linear classifier, Bayes classifier, k-NN classifier, neural network. All the results given by these methods will be given and compared, and all the methods are analyzed. Feature selection is very critical in this problem, and it is based on careful analysis of the data. For each of the classifications, the result is also compared by changing amount of data, the dimension, k in k-NN and v in Parzen windows.
The Insurance Company Benchmark (COIL 2000) is a block of very detailed survey information on the people some of whom bought and plan to buy the caravan insurance policy. The people were asked to answer 85 questions, each of which can be regarded as one feature in the classification. The block consists 3 parts. The first is training data which contains 5822 survey responses, 348 are from caravan policy holders. The second is testing data, and it contains answers from 4000 potential caravan insurance policy buyer, and the last one is the true data that shows who of the 4000 actually bought the policy at last.
The original purpose of dealing with this data is to answer the question :" Can you predict who would be interested in buying a caravan insurance policy and give an explanation why". There were two kinds of work that have been done to answer this question--prediction on customers who will buy the policy and description on customer and why they will buy. Several works are concerning the first question and the naive Bayes gives the best result-- 121out of 238 total. Some other test method are: the random selection (42), k-NN( 94), neural network(105) and linear(118).
In this project, besides giving the targets rates, I will also give the error rate by wrongly classifying with different methods.
In training process (with training data) I tried to looks into the feature differences between two classes: policy holder and un-holder. The meanings and interpretations of each features are given as data dictionary in attachment. Each feature has a range of values representing some specific meaning or selection under the feature. Though what on earth each value means and represents for is not explained, the values themselves do give information, at least in comparing with elements in tow classifies.
exclusive features These features are special because the policy holders have no such values while the un-holders have non-zero values with them. By looking at these features, we can easily tell who in the test data name list will not buy the caravan insurance policy, based on the assumption that those who have exclusive features of un-zero value will not buy the caravan policy. For example, in feature 1, all possible values that both group people ( or say, two classes) can select are from 1 to 41, but class 1 (the policy holders) does not take the number of 15 ~ 19, 21, 28, and 40, which class 2 (the policy un-holders) takes. By observing whether these values of feature one, we can find somebody in test data who are not to buy caravan policy. Through searching all the 56 exclusive features, we can find more. All these prospect un-holder will be 100% correctly selected if the above assumption is correct. Thus, we just mark these people as un-holder and delete them from the test data. In the following different ways of classifications, we just use the remaining to decrease the possible error rate caused by classifying.
By calculating, we can find 523 people who won't buy the policy. By checking with the true data, 3.25 %, or 17 of 523 are wrong, which add 17/4000 = 0.425% to the overall error rate. This means, the assumption is not perfect, or say, the real condition in the testing data are not completely the same as those in training data. But comparing with the following result in part 5, we can see this error is neglectable.
Our goal is try to find as many potential policy buyers as possible, with constrain on high hit rate, i.e. the ratio of correct selection to number of selected candidates is as high as possible. It is proved from this experiment, that the hit rate has inverse relation to total number of selection. So, how to adjust these two target parameter is a problem. Also, it can be shown that the results with preprocessing will give less number of targets, but with high hit rate, while without preprocessing, we can get a little bit more potential buyers, at expense of lower hit rate.
Another special case in this project which should be noticed is that in both training and testing data, the proportion difference between the policy holder and un-holder is quite big. There are only 6% of policy holder and 94% of un-holder in training data, which will result in less number of claims on policy holder, especially when using the classifier which needs the parameter of P(w1) and P(w2). The typical case is Bayes classifier, which gives the boundary as P(w1)P(x|w1)-P(w2)P(x|w2). When P(w1) is much bigger that P(w2) and x is randomly distributed under either w1or w2, then P(w1) will dominant the classifier, and most potential data belong to w2 may be classified as class one.
Bayes classifier:
Assuming that all training data of each people are independent and features are independent each other, and all have normal distributions. The feature are selected from those with high correlation--which comes from the real world experience and the data structure themselves. For example, boating and caravanning are indicative of an outdoor lifestyle, those who have boats are 6 times more likely to also have a caravan. (http://www.liacs.nl/~putten/library/cc2000/report2.html, No.20).
The testing is based on two different conditions: with preprocessing and without preprocessing. With either way, I use the whole testing data to test, and use the acquired classifier to test the testing data. Different numbers of features are used to compare:
| feature=[47 68 44 65 59 80 43, 16 21, 5] | feature=[47 68 44 65 59 80 43, 16 21, 20] | feature=11:20 | feature=21:30 | feature=1:20 | feature =11:30 | fearture=1:30 | |
| with preprocessing | 36/166
(21.69%) |
32/170
(18.82%) |
9/54
(16.67%) |
9/75
(12.00%) |
24/155
(15.48%) |
37/239
(15.48%) |
50/409
(12.22%) |
| without proprocessing | 40/192
(20.83%) |
36/198
(18.18%) |
10/75
(13.33%) |
11/101
(10.89%) |
27/187
(14.44.%) |
40/276
(14.49%) |
53/449
(11.80%) |
Also, I changed the number of testing data :(feature= [47 68 44 65 59 80 43 16 21 5]) without preprocessing
| data percentage | 1/14 | 1/13 | 1/12 | 1/11 | 1/10 | 1/9 | 1/8 | 1/7 | 1/6 | 1/5 | 1/4 | 1/3 | 1/2 | 1/1 |
| without preprocessing | 14/148 | 14/162 | 15/174 | 16/185 | 18/202 | 18/210 | 20/220 | 22/229 | 26/242 | 28/265 | 31/302 | 39/354 | 48/441 | 40/192 |
| data percentage | 1/14 | 2/14 | 3/14 | 4/14 | 5/14 | 6/14 | 7/14 | 8/14 | 9/14 | 10/14 | 11/14 | 12/14 | 13/14 | 14/14 |
| without preprocessing | 14/148 | 22/229 | 28/268 | 34/330 | 39/357 | 42/395 | 48/441 | 54/426 | 54/401 | 53/381 | 52/362 | 47/319 | 45/280 | 40/192 |
The following figure gives the direct look on the relationship of the claimed policy holder number, correctly claimed policy holder number, and the hit rate. When total number is increasing, the hit rated is increasing, while the number of claimed goes up and then down. It is because that more training data give more constrains on classifier, so it is more difficult to give a claim, but whenever a decision is made, the accuracy will be high. And also can be seen is the relationship between the hit rate and the number of claimed policy holder. Coinciding with what I claimed in part four "expectation of the result", when the testing data is more enough (say, more that 50%), the high hit rate is companied with less number claimed policy holders.

Fisher Linear:
When training, all the data are used, and the feature is [47 68 44 65 59 80 43 16 21 5]. The results are given in two conditions: with preprocessing, and without preprocessing. In first figure, the 89 / 545 means claim number is 545 and 89 is correct, the hit rate is 16.33%. By comparing with the benchmark, (http://www.liacs.nl/~putten/library/cc2000/PUTTEN~1.pdf), the result is pretty good, but the hit rate are low (still under 20%).


Neural Network:
The features are still the same as [47 68 44 65 59 80 43 16 21 5], because it is almost the best that can be used by feature analysis. There are two kind of results, one is with only 10 hidden units and 2% of training data, the other is with 30 hidden units and 20% of training data. The result shows that more hidden units and more training data can give more claim and higher accuracy. the training data verify result is got by testing the remaining training data with the neural network. In the first case the best MSE in training is 0, but in second case, the MSE is 5%, but it can still give better results.
| 10 hidden units, 2% training data | training data verify result: 27 / 348 (7.76%) | testing data result: 22 / 327 (6.73%) |
| 30 hidden units, 20% training data | training data verify result: 77 /348 (22.13%) | testing data result: 23 / 213 (10.8%) |
1-NN
The 1-NN used the whole training data and we can get 127 real policy buyer out of 1361 claimers, which is only 9.33% hit rate. The reason maybe lays in it that the data value are actually some answer choices to specific questions, and they do not give any meaning. The number in deferent feature may not be compared, since their "units" are quite different. We can not use them efficiently just as we can not get exact meaning by comparing " one pound " and "3 feet", unless we can scientifically normalize these feature's "units", which is unfortunately unavailable.
K-L
The K-L feature dimension reduction is done by observing the divergence of two classes' projection on the 85 eigenvectors. Some vectors of biggest divergence ( divergence= (mu1-mu2)^2 / (p1* sigma1 + p2*sigma2) ) are kept. As mentioned before, there are tradeoff between the hit rate and hit number. The following result is got by linear classifier. The adequate number of dimension can give good selection results, when dimension=30, the result is best in terms of both hit rat and hit number.

The fisher linear classification gives the best result in terms of hit number of policy holder, and the Bayes classifier gives the highest hit rate. The neural network gives the least best result, partly due to the less number of data used in training. The over all result is not ideal, due to the following reasons:
The feature is not good, or say, seldom of the all 85 features are separable. They looks too close for both classes. This can be seen from the data dictionary. This also reflect why feature dimension reduction can not get improvement obviously.
The proportion of two class is heavily unbalanced which leads to the status that one "small" class may be over-whelmed by or less distinguished from the "big" one.
So, the conventional classifier can not give too improved results in this specific problem. We need to find other ways to resolve this problem.
In part 3, I described and analyzed the features, here I give the detailed analysis about which are the more important ones and why I select them.
First of all, the analysis is based on real life experience. Since this problem here is very realistic, and we must focus our consideration on the meaning of each feature. The caravan policy is not a general policy and is aimed to some specific group of people, so, we are more likely to notice such features that can reflect the characters of this kind of policy holders. For example, in [1], [2] , [3] and [4], it is stipulated that those people with following characters (strongest predictors) are more likely to buy caravan policy : first, they have contribution car policies (feature 47) and the number of car policies are also important (feature 68), and the reason is obvious-- the caravan policy is highly connected with car policy, and having a car policy may be the prerequisite by many policy companies; second, many caravan policy holders have some numbers of contribution of fire policies ( feature 59 and 80), the reason is still very "real life", the fire insurance is for a caravan and the level of fire instance cover that is most likely to be indicative of a caravan policy is feature 59 with value 4 ( 43.39% of holder vs. only19.64% of un-holder); third, the number of boat policy and their contribution ( feature 82 and 61), the reason is that boating and caravanning are indicative of an outdoor lifestyle, those who have boats are 6 times more likely to also have a caravan. Besides, the condition of purchasing power class (feature 43) and the number of holding cars (feature 32,33,34) and the income (feature 37 - 42) are also features being worthy of some consideration.
Second of all, the features also selected on the actually data distribution are not constraining on above analysis. For example, after calculation, I found that feature 82 and 61 do not give any contribution to the target rate improvement or target number improvement and on the contrary, they dramatically damage the results. It is due to the fact that most of them (96.26% of holder and 99.63% of un-holder) take value zero. On the other side, the feature 16, 21, 5 and 20 gives promotion in either target rate or target numbers, especially feature 5 (customer main type) gives better result than feature 33,34 and 42 ( 2 cars, no car, and average income), which are recommended in [2]. As to other features, since most of them ( often it is 99%) take value zero, they do not make too much sense. Here is the example of how the features infect the results ( using Bayes classifier):
| feature | results: | hit rate |
| 47, 68 | 15 / 114 | 0.1316 |
| 47, 68, 44,65 | 14 / 111 | 0.1261 |
| 47, 68, 44, 65, 59, 80 | 14 / 98 | 0.1429 |
| 47, 68, 44, 65, 59, 80, 43 | 14 / 95 | 0.1474 |
| 47, 68, 44, 65, 59, 80, 43, 16 | 24 / 172 | 0.1890 |
| 47, 68, 44, 65, 59, 80, 43, 16, 20 * | 34 / 160 | 0.2125 |
| 47, 68, 44, 65, 59, 80, 43, 16, 20, 5 ** | 36 / 166 | 0.2169 |
| 47, 68, 44, 65, 59, 80, 43, 16, 20, 33 ** | 33 / 155 | 0.2129 |
| 47, 68, 44, 65, 59, 80, 43, 82, 61 * | 3 / 17 | 0.1765 |
[1] Charles Elkan. CoIL: "Challenge 2000 Entry" , http://www.liacs.nl/~putten/library/cc2000/ELKANP~1.pdf
[2] YongSeog Kim and W. Nick Street.: " CoIL Challenge 2000: Choosing and Explaining Likely Caravan Insurance Customers",
May 26, 2000, http://www.liacs.nl/~putten/library/cc2000/STREET~1.pdf
[3] Petri Kontkanen. :"CoIL 2000 Submission", http://www.liacs.nl/~putten/library/cc2000/KONTKA~1.pdf
[4] Philip Brierley. :" COIL 2000 Challenge: Characteristics of caravan insurance policy owners",
http://www.liacs.nl/~putten/library/cc2000/BRIERL~1.pdf
[5] Peter van der Putten, Michel de Ruiter and Maarten van Someren. :" CoIL Challenge 2000 Tasks and Results: Predicting and Explaining
Caravan Policy Ownership", http://www.liacs.nl/~putten/library/cc2000/PUTTEN~1.pdf