Abstract—Decision trees are one of the most widely used methods of inductive inference. They can be used to learn discrete or continuous valued hypotheses and create compact rules for evaluation of a set of data. An advantage of decision and regression trees is that they are robust to noisy data, which makes them perfectly suited to be able to predict whether a heart attack patient will be alive one year after his incident where all the data may not be available. This paper is a survey of some of the methods of constructing and evaluating decision trees.
1. INTRODUCTION
A myocardial infarction, commonly referred to as a heart attack, is a serious medical condition where blood vessels that supply blood to the heart are blocked, preventing enough oxygen from getting to the heart. The heart muscle dies from the lack of oxygen and impairs heart function or kills the patient. Heart attacks are positively correlated with diabetes, smoking, high blood pressure, obesity, age and alcohol consumption. While prognosis varies greatly based on underlying personal health, the extent of damage and the treatment given, for the period of 20052008 in the United States, the median mortality rate at 30 days was 16.6% with a range from 10.9% to 24.9% depending on the admitting hospital.[9] That rate improves to 96% at the 1 year mark if the patient survives the heart attack.
Physicians would like to be able to tell their patients their possible rate of survival and predict whether or not a certain treatment could help the patient. In order to make that prediction, we can use decision trees to model whether or not the patient has a good chance of survival. Using regression trees, we can map the input space into a realvalued domain using attributes that cardiologists could examine to determine the patient’s chances of survival after a given timeframe.
In building a decision tree, we use the most influential attribute values to represent the internal nodes of the tree at each level. Each internal node tests an attribute value, each edge corresponds to an attribute value and each leaf node leads to a classification – in our case, deceased or alive. We are able to traverse the tree from the root to classify an unseen example. The tree can also be expressed in the form of simple rules. This would be helpful when explaining the prognosis to the patient.
1.1 OUTLINE OF RESEARCH
In this research survey, we implemented a decision tree using an adapted ID3 algorithm. We evaluated different splitting criterion as well as different approaches to handling missing attributes in the dataset. In addition, the author considers different approaches to handling continuous valued attributes and methods to reduce decision tree overfitting. Lastly, results are discussed in section 4 after running the decision tree multiple times with the echocardiogram dataset.
1.2 DATA
The data that we used is from the University of California at Irvine machine learning repository. The dataset that we chose was the 1989 echocardiogram dataset. This dataset contained 132 rows, with 12 attributes, 8 of which were actually usable for decision tree construction (the remaining four were references for the original contributor of this dataset). This dataset had missing values and all of the attributes were continuous valued.
The data described different measurements of patients who had suffered from acute myocardial infarction at some point in the past. The attributes included “AGEATHEARTATTACK” (the patients’ age in years when the heart attack happened), “PERICARDIALEFFUSION” (binary choice relative to fluid around the heart), “FRACTIONALSHORTENING” (a measure of contractility around the heart where lower numbers are abnormal), “EPSS” (Epoint septal separation which is another measure of contractility where larger numbers are abnormal), “LVDD” (left ventricular enddiastolic dimension, where larger numbers are abnormal), “WALLMOTIONINDEX” (a measure of how many segments of the left ventricle are seen moving divided by the number of segments seen) and “ALIVEATONE” (a binary choice where 0 represents deceased or unknown and 1 is alive at one year).
It is important to note that not all rows could be used for learning. Two attributes, “SURVIVAL” and “STILLALIVE” must be analyzed as a set. SURVIVAL described the number of months the patient had survived since the heart attack. Some of the rows described patients who survived less than a year and are still alive based on STILLALIVE (a binary attribute, 0 representing deceased and 1 representing alive). These patients cannot be used for prediction.
It has previously been noted that “the problem addressed by past researchers was to predict from other variables whether or not the patient will survive at least one year. The most difficult part of this problem is correctly predicting that the patient will not survive. This difficulty seems to stem from the size of the dataset.” [1] In implementing the decision tree logic, we have found that this is the case as well.
2. APPLICATION
We chose to write the implementation of the decision tree in Java because of the ease of use of the language. Java also presented superior capabilities in working with and parsing data from files. Using Java allowed the author to more efficiently model the problem through the use of OO concepts, such as polymorphism and inheritance.
2.1 DECISION TREE CONSTRUCTION
The decision tree is constructed using the ID3 algorithm originally described by Quinlan [4] and shown in Mitchell [2] with an adaptation by the author to handle numeric, continuous valued attributes, missing attributes and pruning. ID3 is a simple decision tree algorithm. The algorithm is shown below
ID3 (Examples, Target_Attribute, Attributes)

Create a root node for the tree

If all Examples are positive, return the singlenode tree root, with label = +.

If all Examples are negative, return the singlenode tree root, with label = .

If the number of predicting Attributes is empty, then return the single node tree root, with label = most common value of target attribute in the examples.

Otherwise Begin

A = The Attribute that best classifies Examples.

Decision tree attribute for root = A.

For each possible value, vi, of A

Add a new tree branch below root, corresponding to the test A = vi.

Let Examples(vi) be the subset of examples that have the value vi for A

If Examples(vi) is empty

Below this new branch, add a leaf node with the label = most common target value in the examples.


Else

Below this new branch add the subtree ID3 (Examples(vi), Target_Attribute, Attributes  {A})




End

Return root
The ID3 algorithm “uses information gain as splitting criteria and growing stops when all instances belong to a single value of target feature or when best information gain is not greater than zero.” [6] For further information on ID3, see [2], [4], [6], [7].
2.2 SPLITTING CRITERION
A decision tree is formed by having some concrete concept of splitting data into subsets which form children nodes of the parent. In our implementation of the decision tree, we decided to use a univariate impuritybased splitting criterion called entropy.
Entropy is the measure of impurity or chaos in the data. If all elements in a set of data belong to the same class, the entropy would be zero and if all the elements in a dataset were evenly mixed, the entropy would be one. Entropy may be measured with the following equation
where is the number of positive training examples in T and is the number of negative training examples in T. For further discussion on entropy, see Mitchell [2] or Alpaydin [3].
The ID3 algorithm can use this measure of entropy of each attribute set of values to determine the best gain, or expected reduction in entropy due to splitting on A or the difference in entropies before splitting and after splitting on A.
At each level of the tree, the “best” attributes can be found to create the maximum reduction in entropy, called information gain. These two calculations represent the preference to create the smallest tree possible for several different reasons – mainly that a short hypothesis that accurately describes the data is unlikely to be coincidence. For further discussion on entropy and gain, see Mitchell [2], Alpaydin [3], Steinberg [5] and Lior et al [6].
2.3 MISSING VALUES
In the dataset, there are many missing values. The missing values show up mainly for the attributes ALIVEATONE (43.2%), EPSS (10.6%) and LVDD (7.5% of the data). In addition, 66.3% of the data had a row with at least 1 missing attribute value.
Clearly, this dataset is not ideal for predicting attributes. Fortunately, decision trees are robust to noisy data. The strategy of replacing the missing attribute values with the most common value among training examples was suggested in [2]. We decided to implement this idea to deal with these missing values and initiate those values with a surrogate value. The surrogate values were calculated using the average of all of the attributes for continuousvalued data or the most common attribute value for discrete valued data. Our replacement was done only after all the data had been read in from the dataset instead of using a moving average while the data was still being read.
There are two main reasons we chose this method. First, this is an extremely simple method that doesn’t require a lot of calculation. Since decision trees cannot back up once they have made a splitting decision, we would never have to worry about the data changing without needing to regenerate the tree. Second, this allowed the program to describe a finer grain representation of the true average value of the particular attribute for the global dataset.
2.4 HANDLING CONTINUOUS AND DISCRETE ATTRIBUTES
In ID3, handling discrete attributes are quite simple. For each attribute, a child node is created and that branch is taken when the tree is traversed to evaluate a test set. To handle continuous data, some partitioning of the attribute values must take place to discretize the possible range. For example, if we had the attributes shown in the following table:
PlayTennis  Temperature 
No  40 
No  48 
Yes  60 
Yes  72 
Yes  80 
No  90 
We may consider creating a discrete variable that corresponds to Temperature > 40 and Temperature < (48 + 60) / 2 that would produce a true or false value.
For our implementation, we decided to take the average of all of the values and make that our discrete split value, where all instances less than the average branch went to the left node and all the instances greater than the discrete value branch went to the right node. The reason that we chose this approach is because it was a simple implementation over other methods that included binary search, statistical search and weighted searches. [2] [6] This causes our tree to look closer in form to a CART tree rather than an ID3 tree, as CART can only produce trees that are binary trees. CART uses this approach of a single partitioning value when using univariate splitting criteria. [5] The main reason that we are able to implement this approach is the data is such that it is increasingly “abnormal” the larger the value becomes.
2.5 TESTING
Testing of the decision tree logic was carried out using the PlayTennis example shown in [2]. This example used only discrete values. After verifying the decision tree could accurately classify these examples, the program was adapted to use continuous valued attributes.
Testing performed was done with a 3fold cross validation method, were the data was divided into training and test sets such that  R  = k X  S , where R is the size of the training set, k is the relative size and S is the size of the test set.
For each test run, we chose 88 training examples and 44 test examples at random. We attempted to keep labeled classes in the training set as balanced as possible by setting a threshold n. This threshold prevented the classes from becoming unbalanced by no more than a difference of n. If the threshold n was ever reached, we discarded random selections until we were under the threshold again in order to balance the classes. Discussion of results are in section 4.
2.6 DECISION TREE PRUNING
In the process of building the decision tree, the accuracy of the tree is determined by the training examples. “However, measured over an independent set of training examples, the accuracy first increases and then decreases.” [2] This is an instance of overfitting the data. To prevent this overfitting condition, the tree is evaluated and then cut back to the “essential” nodes such that the accuracy does not decrease with realworld training examples.
In our implementation of pruning, we had no stopping criteria to prevent overfitting. Our implementation let the overfitting occur and then we used a postpruning method called ReducedError pruning, as described by Quinlan. [7] In this algorithm, the tree nodes are traversed from bottom to top while the procedure checking to see if replacing it with the most frequent class improves the accuracy of the decision tree. Pruning continues until accuracy decreases and the algorithm ends with the smallest, accurate decision tree.
Further discussion of pruning may be found in [2], [3], [5] and [6].
3. POTENTIAL PROBLEMS
There are some problems that we encountered during implementation. The first problem is that ID3 does not natively support missing attributes, numeric attributes or pruning. The algorithm had to be adapted to support these features. In adapting the algorithm, less than efficient methods were chosen. In my implementation, we split at the mean of the attribute values. This would cause the surrogate values and the split points to be the same value. In addition, this method does not handle outliers in the data well and forces everything towards the center of the sample set. Implementing a better search algorithm or multivariate splitting might allow the author to see improved accuracy. Another alternative would be to use the C4.5 algorithm [8], which is an evolution of ID3 adds full support for these requirements.
Another problem that we experienced is having enough training data without missing attributes to build an effective decision tree. With the large missing attribute rate, it would be hard to get a good handle on any sort of trends in the dataset. A possible solution may be to use a weighted method to assign the most probable value at the point where we encounter the missing value.
4. EVALUATION
For the evaluation of the results, we have chosen not to assign value to the results so that a result of ALIVEATONE equal to 1 is positive and a result of ALIVEATONE equal to 1 as well instead of negative. One may choose to assign intrinsic values to the instances, but one could easily evaluate these results without them. We have chosen to only focus on the results rather than make a determination of the “goodness” of a particular outcome. Since we have chosen this method of evaluation, we would not have any false positives or true negative results, therefore we will not be reporting any precision calculations.
4.2 MACRO EVALUATION
In 10,000 independent runs using random subsets of data for each test, the overall recall for the decision tree was 66.82%. The average Fmeasure for these runs was 0.38. See figure 4.1. In the majority of the runs, the most influential attributes where SURVIVAL, EPSS and LVDD, where each of these attributes appeared in 100% of the decision trees created.
4.1 MIRCO EVALUATION
When broken down into smaller sets of 1,000 runs, the recall and Fmeasure vary greatly. In 10 runs, we had a range from 27.27% recall to 93.18% recall. The median recall was 84.09%.
Run  Recall  Fmeasure 
1  90.90%  0.47 
2  84.09%  0.46 
3  93.18%  0.48 
4  88.63%  0.47 
5  31.81%  0.24 
6  84.09%  0.46 
7  45.45%  0.31 
8  27.27%  0.21 
9  38.63%  0.28 
10  88.63%  0.47 
Overall  66.82%  0.38 
Figure 4.1
There is a large gap between our maximum and our minimum recall. This can be attributed to several issues, including poor data in the data set and less than efficient splitting choices. The accuracy of the decision tree depends partly on the data with which you decide to train. The data that we used was missing many attributes, with almost half (43.2%) of those attribute values being the target attribute. In the absence of a value, the most common value was substituted which, in the case of this data set, would heavily skew the results towards predicting death. It should be noted that overly pessimistic results and overly optimistic results can each present their own dangers to the patient.
Another improvement to the results may come in the form of changing the policy on attribute splitting and missing value handling. If a more accurate method of splitting were implemented (multivariate criterion, using a better search method, etc.), we would expect to see a more consistent result set.
Based on these results, we would feel fairly confident that we could make a useful prediction, but only if we had a confidence rating on the results (due to the high variability of the results).
Based on these results, while not highly accurate, they could provide good insight into what attributes are the most important in regards to classification. In other words, the decision tree would describe the attributes that should be used for other machine learning programs, such as clustering, artificial neural networks or support vector machines. We could be confident that these attributes are the most important because they were chosen through entropy and gain calculations while constructing the decision tree.
5. CONCLUSION
In this research project, different methods of constructing decision and regression trees were explored. Additionally, different methods of node splitting, missing attribute substitution and tree pruning were investigated. While the results of this project show only a 66% accuracy, decision trees are still a valid machine learning technique. With an augmented decision logic and a better data set, decision trees may be able to predict discrete or continuous data at a much better rate.
References
 Salzberg, Stephen. University of California, Irvine Machine Learning Data Repository. 1989. [Online]. http://archive.ics.uci.edu/ml/datasets/Echocardiogram
 Mitchell, Tom M. Machine Learning WCB McGrawHill, Boston, MA. 1997
 Alpaydin, Ethem. Introduction to Machine Learning, Second Edition. The MIT Press, Cambridge, MA. 2010.
 Quinlan, J. R. Induction of Decision Trees. Mach. Learn. 1, 1 (Mar. 1986), 81106.
 Steinberg, Dan. CART: Classification and Regression Trees. Taylor and Francis Group. pp 179201, 2009.
 Rokach, Lior and Maimon, Oded. TopDown Induction of Decision Tree Classifieres – A Survey. IEEE Transactions on Systems, Man and Cybernetics – Part C: Applications and Reviews. Vol. 35, No. 4. pp 476487, November 2005.
 Quinlan, J.R. Simplifying Decision Trees. International Journal of ManMachine Studies. Vol. 27, pp. 221234, 1987.
 Quinlan, J.R. C4.5: Programs for Machine Learning. San Francisco, CA: Morgan Kaufmann, 1993.
 Krumholz H et al. Patterns of hospital performance in acute myocardial infarction and heart failure – 30day mortality and readmission. Circulation: Cardiovascular Quality and Outcomes. [Online] http://circoutcomes.ahajournals.org/cgi/content/abstract/2/5/407. 2009.