Tag Archive: decision trees


In a previous post, I explored how one might apply decision trees to solve a complex problem. This post will explore the code necessary to implement that decision tree. If you would like a full copy of the source code, it is available here in zip format.

Entropy.java – In Entropy.java, we are concerned with calculating the amount of entropy, or the amount of uncertainty or randomness with a particular variable. For example, consider a classifier with two classes, YES and NO. If a particular variable or attribute, say x has three training examples of class YES and three training examples of class NO (for a total of six), the entropy would be 1. This is because there is an equal number of both classes for this variable and is the most mixed up you can get. Likewise, if x had all six training examples of a particular class, say YES, then entropy would be 0 because this particular variable would be pure, thus making it a leaf node in our decision tree.

Entropy may be calculated in the following way:


import java.util.ArrayList;

public class Entropy {	
	public static double calculateEntropy(ArrayList<Record> data) {
		double entropy = 0;
		
		if(data.size() == 0) {
			// nothing to do
			return 0;
		}
		
		for(int i = 0; i < Hw1.setSize("PlayTennis"); i++) {
			int count = 0;
			for(int j = 0; j < data.size(); j++) {
				Record record = data.get(j);
				
				if(record.getAttributes().get(4).getValue() == i) {
					count++;
				}
			}
				
			double probability = count / (double)data.size();
			if(count > 0) {
				entropy += -probability * (Math.log(probability) / Math.log(2));
			}
		}
		
		return entropy;
	}
	
	public static double calculateGain(double rootEntropy, ArrayList<Double> subEntropies, ArrayList<Integer> setSizes, int data) {
		double gain = rootEntropy; 
		
		for(int i = 0; i < subEntropies.size(); i++) {
			gain += -((setSizes.get(i) / (double)data) * subEntropies.get(i));
		}
		
		return gain;
	}
}

Tree.java – This tree class contains all our code for building our decision tree. Note that each level, we choose the attribute that presents the best gain for that node. The gain is simply the expected reduction in the entropy of Xachieved by learning the state of the random variable A. Gain is also known as Kullback-Leibler divergence. Gain can be calculated in the following way:

Notice that gain is calculated as a function of all the values of the attribute.

import java.io.*;
import java.util.*;

public class Tree {
	public Node buildTree(ArrayList<Record> records, Node root, LearningSet learningSet) {
		int bestAttribute = -1;
		double bestGain = 0;
		root.setEntropy(Entropy.calculateEntropy(root.getData()));
		
		if(root.getEntropy() == 0) {
			return root;
		}
		
		for(int i = 0; i < Hw1.NUM_ATTRS - 2; i++) {
			if(!Hw1.isAttributeUsed(i)) {
				double entropy = 0;
				ArrayList<Double> entropies = new ArrayList<Double>();
				ArrayList<Integer> setSizes = new ArrayList<Integer>();
				
				for(int j = 0; j < Hw1.NUM_ATTRS - 2; j++) {
					ArrayList<Record> subset = subset(root, i, j);
					setSizes.add(subset.size());
					
					if(subset.size() != 0) {
						entropy = Entropy.calculateEntropy(subset);
						entropies.add(entropy);
					}
				}
				
				double gain = Entropy.calculateGain(root.getEntropy(), entropies, setSizes, root.getData().size());
				
				if(gain > bestGain) {
					bestAttribute = i;
					bestGain = gain;
				}
			}
		}
		if(bestAttribute != -1) {
			int setSize = Hw1.setSize(Hw1.attrMap.get(bestAttribute));
			root.setTestAttribute(new DiscreteAttribute(Hw1.attrMap.get(bestAttribute), 0));
			root.children = new Node[setSize];
			root.setUsed(true);
			Hw1.usedAttributes.add(bestAttribute);
			
			for (int j = 0; j< setSize; j++) {
				root.children[j] = new Node();
				root.children[j].setParent(root);
				root.children[j].setData(subset(root, bestAttribute, j));
				root.children[j].getTestAttribute().setName(Hw1.getLeafNames(bestAttribute, j));
				root.children[j].getTestAttribute().setValue(j);
			}

			for (int j = 0; j < setSize; j++) {
				buildTree(root.children[j].getData(), root.children[j], learningSet);
			}

			root.setData(null);
		}
		else {
			return root;
		}
		
		return root;
	}
	
	public ArrayList<Record> subset(Node root, int attr, int value) {
		ArrayList<Record> subset = new ArrayList<Record>();
		
		for(int i = 0; i < root.getData().size(); i++) {
			Record record = root.getData().get(i);
			
			if(record.getAttributes().get(attr).getValue() == value) {
				subset.add(record);
			}
		}
		return subset;
	}
	
	public double calculateSurrogates(ArrayList<Record> records) {
		return 0;
	}
}

DiscreteAttribute.java


public class DiscreteAttribute extends Attribute {
	public static final int Sunny = 0;
	public static final int Overcast = 1;
	public static final int Rain = 2;

	public static final int Hot = 0;
	public static final int Mild = 1;
	public static final int Cool = 2;
	
	public static final int High = 0;
	public static final int Normal = 1;
	
	public static final int Weak = 0;
	public static final int Strong = 1;
	
	public static final int PlayNo = 0;
	public static final int PlayYes = 1;
	
	enum PlayTennis {
		No,
		Yes
	}
	
	enum Wind {
		Weak,
		Strong
	}
	
	enum Humidity {
		High,
		Normal
	}
	
	enum Temp {
		Hot,
		Mild,
		Cool
	}
	
	enum Outlook {
		Sunny,
		Overcast,
		Rain
	}

	public DiscreteAttribute(String name, double value) {
		super(name, value);
	}

	public DiscreteAttribute(String name, String value) {
		super(name, value);
	}
}

Hw1.java – This class is our main driver class

import java.util.*;

public class Hw1 {
	public static int NUM_ATTRS = 6;
	public static ArrayList<String> attrMap;
	public static ArrayList<Integer> usedAttributes = new ArrayList<Integer>();

	public static void main(String[] args) {
		populateAttrMap();

		Tree t = new Tree();
		ArrayList<Record> records;
		LearningSet learningSet = new LearningSet();
		
		// read in all our data
		records = FileReader.buildRecords();
		
		Node root = new Node();
		
		for(Record record : records) {
			root.getData().add(record);
		}
		
		t.buildTree(records, root, learningSet);
		traverseTree(records.get(12), root);
		return;
	}
	
	public static void traverseTree(Record r, Node root) {
		while(root.children != null) {
			double nodeValue = 0;
			for(int i = 0; i < r.getAttributes().size(); i++) {
				if(r.getAttributes().get(i).getName().equalsIgnoreCase(root.getTestAttribute().getName())) {
					nodeValue = r.getAttributes().get(i).getValue();
					break;
				}
			}
			for(int i = 0; i < root.getChildren().length; i++) {
				if(nodeValue == root.children[i].getTestAttribute().getValue()) {
					traverseTree(r, root.children[i]);
				}
			}
		}
		
		System.out.print("Prediction for Play Tennis: ");
		if(root.getTestAttribute().getValue() == 0) {
			System.out.println("No");
		}
		else if(root.getTestAttribute().getValue() == 0) {
			System.out.println("Yes");
		}

		return;
	}
	
	public static boolean isAttributeUsed(int attribute) {
		if(usedAttributes.contains(attribute)) {
			return true;
		}
		else {
			return false;
		}
	}
	
	public static int setSize(String set) {
		if(set.equalsIgnoreCase("Outlook")) {
			return 3;
		}
		else if(set.equalsIgnoreCase("Wind")) {
			return 2;
		}
		else if(set.equalsIgnoreCase("Temperature")) {
			return 3;
		}
		else if(set.equalsIgnoreCase("Humidity")) {
			return 2;
		}
		else if(set.equalsIgnoreCase("PlayTennis")) {
			return 2;
		}
		return 0;
	}
	
	public static String getLeafNames(int attributeNum, int valueNum) {
		if(attributeNum == 0) {
			if(valueNum == 0) {
				return "Sunny";
			}
			else if(valueNum == 1) {
				return "Overcast";
			}
			else if(valueNum == 2) {
				return "Rain";
			}
		}
		else if(attributeNum == 1) {
			if(valueNum == 0) {
				return "Hot";
			}
			else if(valueNum == 1) {
				return "Mild";
			}
			else if(valueNum == 2) {
				return "Cool";
			}
		}
		else if(attributeNum == 2) {
			if(valueNum == 0) {
				return "High";
			}
			else if(valueNum == 1) {
				return "Normal";
			}
		}
		else if(attributeNum == 3) {
			if(valueNum == 0) {
				return "Weak";
			}
			else if(valueNum == 1) {
				return "Strong";
			}
		}
		
		return null;
	}
	
	public static void populateAttrMap() {
		attrMap = new ArrayList<String>();
		attrMap.add("Outlook");
		attrMap.add("Temperature");
		attrMap.add("Humidity");
		attrMap.add("Wind");
		attrMap.add("PlayTennis");
	}
}

Node.java – Node.java holds our information in the tree.


import java.util.*;

public class Node {
	private Node parent;
	public Node[] children;
	private ArrayList<Record> data;
	private double entropy;
	private boolean isUsed;
	private DiscreteAttribute testAttribute;

	public Node() {
		this.data = new ArrayList<Record>();
		setEntropy(0.0);
		setParent(null);
		setChildren(null);
		setUsed(false);
		setTestAttribute(new DiscreteAttribute("", 0));
	}

	public void setParent(Node parent) {
		this.parent = parent;
	}

	public Node getParent() {
		return parent;
	}

	public void setData(ArrayList<Record> data) {
		this.data = data;
	}

	public ArrayList<Record> getData() {
		return data;
	}

	public void setEntropy(double entropy) {
		this.entropy = entropy;
	}

	public double getEntropy() {
		return entropy;
	}

	public void setChildren(Node[] children) {
		this.children = children;
	}

	public Node[] getChildren() {
		return children;
	}

	public void setUsed(boolean isUsed) {
		this.isUsed = isUsed;
	}

	public boolean isUsed() {
		return isUsed;
	}

	public void setTestAttribute(DiscreteAttribute testAttribute) {
		this.testAttribute = testAttribute;
	}

	public DiscreteAttribute getTestAttribute() {
		return testAttribute;
	}
}

FileReader.java – The least interesting class in the code

import java.io.*;
import java.util.ArrayList;
import java.util.StringTokenizer;

public class FileReader {
	public static final String PATH_TO_DATA_FILE = "playtennis.data";

    public static ArrayList<Record> buildRecords() {
		BufferedReader reader = null;
		DataInputStream dis = null;
		ArrayList<Record> records = new ArrayList<Record>();

        try { 
           File f = new File(PATH_TO_DATA_FILE);
           FileInputStream fis = new FileInputStream(f); 
           reader = new BufferedReader(new InputStreamReader(fis));;
           
           // read the first record of the file
           String line;
           Record r = null;
           ArrayList<DiscreteAttribute> attributes;
           while ((line = reader.readLine()) != null) {
              StringTokenizer st = new StringTokenizer(line, ",");
              attributes = new ArrayList<DiscreteAttribute>();
              r = new Record();
              
              if(Hw1.NUM_ATTRS != st.countTokens()) {
            	  throw new Exception("Unknown number of attributes!");
              }
              	
			  @SuppressWarnings("unused")
			  String day = st.nextToken();
			  String outlook = st.nextToken();
			  String temperature = st.nextToken();
			  String humidity = st.nextToken();
			  String wind = st.nextToken();
			  String playTennis = st.nextToken();
			  
			  if(outlook.equalsIgnoreCase("overcast")) {
				  attributes.add(new DiscreteAttribute("Outlook", DiscreteAttribute.Overcast));
			  }
			  else if(outlook.equalsIgnoreCase("sunny")) {
				  attributes.add(new DiscreteAttribute("Outlook", DiscreteAttribute.Sunny));
			  }
			  else if(outlook.equalsIgnoreCase("rain")) {
				  attributes.add(new DiscreteAttribute("Outlook", DiscreteAttribute.Rain));
			  }
			  
			  if(temperature.equalsIgnoreCase("hot")) {
				  attributes.add(new DiscreteAttribute("Temperature", DiscreteAttribute.Hot));
			  }
			  else if(temperature.equalsIgnoreCase("mild")) {
				  attributes.add(new DiscreteAttribute("Temperature", DiscreteAttribute.Mild));
			  }
			  else if(temperature.equalsIgnoreCase("cool")) {
				  attributes.add(new DiscreteAttribute("Temperature", DiscreteAttribute.Cool));
			  }
			  
			  if(humidity.equalsIgnoreCase("high")) {
				  attributes.add(new DiscreteAttribute("Humidity", DiscreteAttribute.High));
			  }
			  else if(humidity.equalsIgnoreCase("normal")) {
				  attributes.add(new DiscreteAttribute("Humidity", DiscreteAttribute.Normal));
			  }
			  
			  if(wind.equalsIgnoreCase("weak")) {
				  attributes.add(new DiscreteAttribute("Wind", DiscreteAttribute.Weak));

			  }
			  else if(wind.equalsIgnoreCase("strong")) {
				  attributes.add(new DiscreteAttribute("Wind", DiscreteAttribute.Strong));

			  }
			  
			  if(playTennis.equalsIgnoreCase("no")) {
				  attributes.add(new DiscreteAttribute("PlayTennis", DiscreteAttribute.PlayNo));
			  }
			  else if(playTennis.equalsIgnoreCase("yes")) {
				  attributes.add(new DiscreteAttribute("PlayTennis", DiscreteAttribute.PlayYes));
			  }
			    		    
			  r.setAttributes(attributes);
			  records.add(r);
           }

        } 
        catch (IOException e) { 
           System.out.println("Uh oh, got an IOException error: " + e.getMessage()); 
        } 
        catch (Exception e) {
            System.out.println("Uh oh, got an Exception error: " + e.getMessage()); 
        }
        finally { 
           if (dis != null) {
              try {
                 dis.close();
              } catch (IOException ioe) {
                 System.out.println("IOException error trying to close the file: " + ioe.getMessage()); 
              }
           }
        }
		return records;
	}
}

**EDIT:**

playtennis.data is only a simple text file that describes the learned attribute “play tennis” for a particular learning episode. I modeled my playtennis.data file off of Tom Mitchell’s play tennis example in his book “Machine Learning.” Essentially what it contains is attributes describing each learning episode: outlook (sunny, overcast, rain), wind (strong, weak) and humidity (high, normal) and play tennis (yes, no). Based off this information, one can trace the decision tree to derive your learned attribute. A sample decision tree is below. All you have to do is create a text file (tab delimited) that describes each particular situation (make sure you match the columns in the text file to the parsing that occurs in the Java).

**EDIT 2:**

After so many requests, I have worked up a small playtennis.data file based on the Tom Mitchell “Machine Learning” book. This follows his example data exactly and can be found here (rename the file to “playtennis.data” before running as WordPress wouldn’t let me upload a .data file due to “security restrictions.” A small note of caution: the above code was put together very quickly for purposes of my own learning. I am not claiming that it is by any means complete or fault tolerant, however, I do believe that the entropy and gain calculation is sound and correct.

Abstract—Decision trees are one of the most widely used methods of inductive inference.  They can be used to learn discrete or continuous valued hypotheses and create compact rules for evaluation of a set of data.  An advantage of decision  and regression trees is that they are robust to noisy data, which makes them perfectly suited to be able to predict whether a heart attack patient will be alive one year after his incident where all the data may not be available.  This paper is a survey of some of the methods of constructing and evaluating decision trees.

1.  INTRODUCTION

A myocardial infarction, commonly referred to as a heart attack, is a serious medical condition where blood vessels that supply blood to the heart are blocked, preventing enough oxygen from getting to the heart. The heart muscle dies from the lack of oxygen and impairs heart function or kills the patient. Heart attacks are positively correlated with diabetes, smoking, high blood pressure, obesity, age and alcohol consumption. While prognosis varies greatly based on underlying personal health, the extent of damage and the treatment given, for the period of 2005-2008 in the United States, the median mortality rate at 30 days was 16.6% with a range from 10.9% to 24.9% depending on the admitting hospital.[9] That rate improves to 96% at the 1 year mark if the patient survives the heart attack.

Physicians would like to be able to tell their patients their possible rate of survival and predict whether or not a certain treatment could help the patient. In order to make that prediction, we can use decision trees to model whether or not the patient has a good chance of survival. Using regression trees, we can map the input space into a real-valued domain using attributes that cardiologists could examine to determine the patient’s chances of survival after a given timeframe.

In building a decision tree, we use the most influential attribute values to represent the internal nodes of the tree at each level.  Each internal node tests an attribute value, each edge corresponds to an attribute value and each leaf node leads to a classification – in our case, deceased or alive.  We are able to traverse the tree from the root to classify an unseen example.  The tree can also be expressed in the form of simple rules.  This would be helpful when explaining the prognosis to the patient.

1.1  OUTLINE OF RESEARCH

In this research survey, we implemented a decision tree using an adapted ID3 algorithm.  We evaluated different splitting criterion as well as different approaches to handling missing attributes in the dataset.  In addition, the author considers different approaches to handling continuous valued attributes and methods to reduce decision tree over-fitting.  Lastly, results are discussed in section 4 after running the decision tree multiple times with the echocardiogram dataset.

1.2  DATA

The data that we used is from the University of California at Irvine machine learning repository.  The dataset that we chose was the 1989 echocardiogram dataset.  This dataset contained 132 rows, with 12 attributes, 8 of which were actually usable for decision tree construction (the remaining four were references for the original contributor of this dataset).  This dataset had missing values and all of the attributes were continuous valued.

The data described different measurements of patients who had suffered from acute myocardial infarction at some point in the past.  The attributes included “AGE-AT-HEART-ATTACK” (the patients’ age in years when the heart attack happened), “PERICARDIAL-EFFUSION” (binary choice relative to fluid around the heart), “FRACTIONAL-SHORTENING” (a measure of contractility around the heart where lower numbers are abnormal), “EPSS” (E-point septal separation which is another measure of contractility where larger numbers are abnormal), “LVDD” (left ventricular end-diastolic dimension, where larger numbers are abnormal), “WALL-MOTION-INDEX” (a measure of how many segments of the left ventricle are seen moving divided by the number of segments seen) and “ALIVE-AT-ONE” (a binary choice where 0 represents deceased or unknown and 1 is alive at one year).

It is important to note that not all rows could be used for learning.  Two attributes, “SURVIVAL” and “STILL-ALIVE” must be analyzed as a set.  SURVIVAL described the number of months the patient had survived since the heart attack.  Some of the rows described patients who survived less than a year and are still alive based on STILL-ALIVE (a binary attribute, 0 representing deceased and 1 representing alive).  These patients cannot be used for prediction.

It has previously been noted that “the problem addressed by past researchers was to predict from other variables whether or not the patient will survive at least one year.  The most difficult part of this problem is correctly predicting that the patient will not survive.  This difficulty seems to stem from the size of the dataset.” [1]  In implementing the decision tree logic, we have found that this is the case as well.

2.  APPLICATION

We chose to write the implementation of the decision tree in Java because of the ease of use of the language.  Java also presented superior capabilities in working with and parsing data from files.  Using Java allowed the author to more efficiently model the problem through the use of OO concepts, such as polymorphism and inheritance.

2.1  DECISION TREE CONSTRUCTION

The decision tree is constructed using the ID3 algorithm originally described by Quinlan [4] and shown in Mitchell [2] with an adaptation by the author to handle numeric, continuous valued attributes, missing attributes and pruning.  ID3 is a simple decision tree algorithm.  The algorithm is shown below

ID3 (Examples, Target_Attribute, Attributes)
  • Create a root node for the tree
  • If all Examples are positive, return the single-node tree root, with label = +.
  • If all Examples are negative, return the single-node tree root, with label = -.
  • If the number of predicting Attributes is empty, then return the single node tree root, with label = most common
    value of target attribute in the examples.
  • Otherwise Begin
    • A = The Attribute that best classifies Examples.
    • Decision tree attribute for root = A.
    • For each possible value, vi, of A
      • Add a new tree branch below root, corresponding to the test A = vi.
      • Let Examples(vi) be the subset of examples that have the value vi for A
      • If Examples(vi) is empty
        • Below this new branch, add a leaf node with the label = most common target value in the examples.
      • Else
        • Below this new branch add the subtree ID3 (Examples(vi), Target_Attribute, Attributes - {A})
  • End
  • Return root

The ID3 algorithm “uses information gain as splitting criteria and growing stops when all instances belong to a single value of target feature or when best information gain is not greater than zero.” [6]  For further information on ID3, see [2], [4], [6], [7].

2.2  SPLITTING CRITERION

A decision tree is formed by having some concrete concept of splitting data into subsets which form children nodes of the parent.  In our implementation of the decision tree, we decided to use a univariate impurity-based splitting criterion called entropy.

Entropy is the measure of impurity or chaos in the data.  If all elements in a set of data belong to the same class, the entropy would be zero and if all the elements in a dataset were evenly mixed, the entropy would be one.  Entropy may be measured with the following equation

where is the number of positive training examples in T and is the number of negative training examples in T.  For further discussion on entropy, see Mitchell [2] or Alpaydin [3].

The ID3 algorithm can use this measure of entropy of each attribute set of values to determine the best gain, or expected reduction in entropy due to splitting on A or the difference in entropies before splitting and after splitting on A.

At each level of the tree, the “best” attributes can be found to create the maximum reduction in entropy, called information gain.  These two calculations represent the preference to create the smallest tree possible for several different reasons – mainly that a short hypothesis that accurately describes the data is unlikely to be coincidence.  For further discussion on entropy and gain, see Mitchell [2], Alpaydin [3], Steinberg [5] and Lior et al [6].

2.3  MISSING VALUES

In the dataset, there are many missing values.  The missing values show up mainly for the attributes ALIVE-AT-ONE (43.2%), EPSS (10.6%) and LVDD (7.5% of the data).  In addition, 66.3% of the data had a row with at least 1 missing attribute value.

Clearly, this dataset is not ideal for predicting attributes.  Fortunately, decision trees are robust to noisy data.  The strategy of replacing the missing attribute values with the most common value among training examples was suggested in [2].  We decided to implement this idea to deal with these missing values and initiate those values with a surrogate value.  The surrogate values were calculated using the average of all of the attributes for continuous-valued data or the most common attribute value for discrete valued data.  Our replacement was done only after all the data had been read in from the dataset instead of using a moving average while the data was still being read.

There are two main reasons we chose this method.  First, this is an extremely simple method that doesn’t require a lot of calculation.  Since decision trees cannot back up once they have made a splitting decision, we would never have to worry about the data changing without needing to regenerate the tree.  Second, this allowed the program to describe a finer grain representation of the true average value of the particular attribute for the global dataset.

2.4  HANDLING CONTINUOUS AND DISCRETE ATTRIBUTES

In ID3, handling discrete attributes are quite simple.  For each attribute, a child node is created and that branch is taken when the tree is traversed to evaluate a test set.  To handle continuous data, some partitioning of the attribute values must take place to discretize the possible range.  For example, if we had the attributes shown in the following table:

PlayTennis Temperature
No 40
No 48
Yes 60
Yes 72
Yes 80
No 90

We may consider creating a discrete variable that corresponds to Temperature > 40 and Temperature < (48 + 60) / 2 that would produce a true or false value.

For our implementation, we decided to take the average of all of the values and make that our discrete split value, where all instances less than the average branch went to the left node and all the instances greater than the discrete value branch went to the right node.  The reason that we chose this approach is because it was a simple implementation over other methods that included binary search, statistical search and weighted searches. [2] [6]  This causes our tree to look closer in form to a CART tree rather than an ID3 tree, as CART can only produce trees that are binary trees.  CART uses this approach of a single partitioning value when using univariate splitting criteria. [5]  The main reason that we are able to implement this approach is the data is such that it is increasingly “abnormal” the larger the value becomes.

2.5  TESTING

Testing of the decision tree logic was carried out using the PlayTennis example shown in [2].  This example used only discrete values.  After verifying the decision tree could accurately classify these examples, the program was adapted to use continuous valued attributes.

Testing performed was done with a 3-fold cross validation method, were the data was divided into training and test sets such that | R | = k X | S |, where R is the size of the training set, k is the relative size and S is the size of the test set.

For each test run, we chose 88 training examples and 44 test examples at random. We attempted to keep labeled classes in the training set as balanced as possible by setting a threshold n. This threshold prevented the classes from becoming unbalanced by no more than a difference of n. If the threshold n was ever reached, we discarded random selections until we were under the threshold again in order to balance the classes.  Discussion of results are in section 4.

2.6  DECISION TREE PRUNING

In the process of building the decision tree, the accuracy of the tree is determined by the training examples.  “However, measured over an independent set of training examples, the accuracy first increases and then decreases.” [2]  This is an instance of over-fitting the data.  To prevent this over-fitting condition, the tree is evaluated and then cut back to the “essential” nodes such that the accuracy does not decrease with real-world training examples.

In our implementation of pruning, we had no stopping criteria to prevent over-fitting.  Our implementation let the over-fitting occur and then we used a post-pruning method called Reduced-Error pruning, as described by Quinlan. [7]  In this algorithm, the tree nodes are traversed from bottom to top while the procedure checking to see if replacing it with the most frequent class improves the accuracy of the decision tree.  Pruning continues until accuracy decreases and the algorithm ends with the smallest, accurate decision tree.

Further discussion of pruning may be found in [2], [3], [5] and [6].

3.  POTENTIAL PROBLEMS

There are some problems that we encountered during implementation.  The first problem is that ID3 does not natively support missing attributes, numeric attributes or pruning.  The algorithm had to be adapted to support these features.  In adapting the algorithm, less than efficient methods were chosen.  In my implementation, we split at the mean of the attribute values.  This would cause the surrogate values and the split points to be the same value.  In addition, this method does not handle outliers in the data well and forces everything towards the center of the sample set.  Implementing a better search algorithm or multivariate splitting might allow the author to see improved accuracy.  Another alternative would be to use the C4.5 algorithm [8], which is an evolution of ID3 adds full support for these requirements.

Another problem that we experienced is having enough training data without missing attributes to build an effective decision tree.  With the large missing attribute rate, it would be hard to get a good handle on any sort of trends in the dataset.  A possible solution may be to use a weighted method to assign the most probable value at the point where we encounter the missing value.

4.  EVALUATION

For the evaluation of the results, we have chosen not to assign value to the results so that a result of ALIVE-AT-ONE equal to 1 is positive and a result of ALIVE-AT-ONE equal to 1 as well instead of negative.  One may choose to assign intrinsic values to the instances, but one could easily evaluate these results without them.  We have chosen to only focus on the results rather than make a determination of the “goodness” of a particular outcome.  Since we have chosen this method of evaluation, we would not have any false positives or true negative results, therefore we will not be reporting any precision calculations.

4.2  MACRO EVALUATION

In 10,000 independent runs using random subsets of data for each test, the overall recall for the decision tree was 66.82%.  The average F-measure for these runs was 0.38.  See figure 4.1.  In the majority of the runs, the most influential attributes where SURVIVAL, EPSS and LVDD, where each of these attributes appeared in 100% of the decision trees created.

4.1 MIRCO EVALUATION

When broken down into smaller sets of 1,000 runs, the recall and F-measure vary greatly.  In 10 runs, we had a range from 27.27% recall to 93.18% recall.  The median recall was 84.09%.

Run Recall F-measure
1 90.90% 0.47
2 84.09% 0.46
3 93.18% 0.48
4 88.63% 0.47
5 31.81% 0.24
6 84.09% 0.46
7 45.45% 0.31
8 27.27% 0.21
9 38.63% 0.28
10 88.63% 0.47
Overall 66.82% 0.38
Figure 4.1

There is a large gap between our maximum and our minimum recall.  This can be attributed to several issues, including poor data in the data set and less than efficient splitting choices.  The accuracy of the decision tree depends partly on the data with which you decide to train.  The data that we used was missing many attributes, with almost half (43.2%) of those attribute values being the target attribute.  In the absence of a value, the most common value was substituted which, in the case of this data set, would heavily skew the results towards predicting death. It should be noted that overly pessimistic results and overly optimistic results can each present their own dangers to the patient.

Another improvement to the results may come in the form of changing the policy on attribute splitting and missing value handling.  If a more accurate method of splitting were implemented (multivariate criterion, using a better search method, etc.), we would expect to see a more consistent result set.

Based on these results, we would feel fairly confident that we could make a useful prediction, but only if we had a confidence rating on the results (due to the high variability of the results).

Based on these results, while not highly accurate, they could provide good insight into what attributes are the most important in regards to classification.  In other words, the decision tree would describe the attributes that should be used for other machine learning programs, such as clustering, artificial neural networks or support vector machines.  We could be confident that these attributes are the most important because they were chosen through entropy and gain calculations while constructing the decision tree.

5.  CONCLUSION

In this research project, different methods of constructing decision and regression trees were explored.  Additionally, different methods of node splitting, missing attribute substitution and tree pruning were investigated.  While the results of this project show only a 66% accuracy, decision trees are still a valid machine learning technique.  With an augmented decision logic and a better data set, decision trees may be able to predict discrete or continuous data at a much better rate.

References

  1. Salzberg, Stephen.  University of California, Irvine Machine Learning Data Repository.  1989.  [Online]. http://archive.ics.uci.edu/ml/datasets/Echocardiogram
  2. Mitchell, Tom M.  Machine Learning WCB McGraw-Hill, Boston, MA.  1997
  3. Alpaydin, Ethem.  Introduction to Machine Learning, Second Edition.  The MIT Press, Cambridge, MA.  2010.
  4. Quinlan, J. R. Induction of Decision Trees. Mach. Learn. 1, 1 (Mar. 1986), 81-106.
  5. Steinberg, Dan.  CART: Classification and Regression Trees. Taylor and Francis Group. pp 179-201, 2009.
  6. Rokach, Lior and Maimon, Oded. Top-Down Induction of Decision Tree Classifieres – A Survey.  IEEE Transactions on Systems, Man and Cybernetics – Part C: Applications and Reviews.  Vol. 35, No. 4.  pp 476-487, November 2005.
  7. Quinlan, J.R.  Simplifying Decision Trees. International Journal of Man-Machine Studies.  Vol. 27, pp. 221-234, 1987.
  8. Quinlan, J.R.  C4.5: Programs for Machine Learning.  San Francisco, CA: Morgan Kaufmann, 1993.
  9. Krumholz H et al. Patterns of hospital performance in acute myocardial infarction and heart failure – 30-day mortality and readmission. Circulation: Cardiovascular Quality and Outcomes. [Online] http://circoutcomes.ahajournals.org/cgi/content/abstract/2/5/407. 2009.
All code owned and written by David Stites and published on this blog is licensed under MIT/BSD.