Tag Archive: supervised learning

Abstract—Recommendation systems take artifacts about items and provide suggestions to the user on what other products the might like. There are many different types of recommender algorithms, including nearest-neighbor, linear classifiers and SVMs. However, most recommender systems are collaborative systems that rely on users to rate the products that they bought. This paper presents a analysis of recommender systems using a mobile device and backend data points for a coupon delivery system.

Index Terms—machine learning, recommender systems, supervised learning, nearest neighbor, classification


Recommendations are a part of everyday life. An individual constantly receives recommendations from friends, family, salespeople and Internet resources, such as online reviews. We want to make the most informed choices possible about decisions in our daily life. For example, when buying a flat screen TV, we want to have the best resolution, size and refresh rate for the money. There are many factors that influence our decisions – budget, time, product features and most importantly, previous experience. We can analyze all the factors that led up to the decision and then make a conclusion or decision based on those results. A recommender system uses historical data to recommend new items that may of be interest to a particular user.

Coupious is a mobile phone application that gives it’s user coupons and deals for businesses around their geographic location. The application runs on the user’s cell phone and automatically retrieves coupons based upon GPS coordinates. Redemption of these coupons is as simple as tapping a “Use Now” button. Coupious’ services is now available on the iPhone, iPod Touch, and Android platforms. Coupious is currently available in Minneapolis, MN, West Lafayette, IN at Purdue University and Berkley, CA. Clip Mobile is the Canada based version of Coupious that is currently available in Toronto.

Using push technology, it is possible to integrate a mobile recommendation system into Coupious. The benefit of this would be threefold: 1) offer the customers the best possible coupons based on their personal spending habits – if a user feels they received a “good deal” with Coupious, they would be more likely to use it again and integrate it into their bargain shopping strategy, 2) offer businesses the ability to capitalize on their market demographics – the ability to reach individual customers to provide goods or services would drive home more revenue and add value to the product and 3) adding coupons to the service would immediately make the system more useful to a user as it would present, desirable, geographically proximate offers without extraneous offers.


In this research project, we evaluated the batch k-nearest neighbors algorithm in Java. We chose to write the implementation of the decision tree in Java because of the ease of use of the language. Java also presented superior capabilities in working with and parsing data from files. Using Java allowed the author to more efficiently model the problem through the use of OO concepts, such as polymorphism and inheritance. The kNN algorithm was originally suggested by Donald Knuth in [9] where it was referenced as the post-office problem, where one would want to assign residences to the nearest post office.

The goal of this research was to find a solution that we felt would be successful in achieving the highest rate of coupon redemption. Presumptively, in achieving the highest rate of redemption required learning what the user likes with the smallest error percentage. Additionally, we wanted to know if increasing the attributes used for computation, would effect the quality of the result set.

1.2 DATA

The data that we used is from the Coupious production database. Currently, there are approximately 70,000 rows of data (“nearby” queries, impressions and details impressions) and approximately 3,400 of those represent actual coupon redemptions. The data is an aggregate from March 25th, 2009 until February 11, 2010. The results are from a mixture of different cities where Coupious is currently in production.

From a logical standpoint, Coupious is simply a conduit through which a user may earn his discount and has no vested interest in whether or not a user redeems a coupon in a particular session. However, from a business standpoint, Coupious markets the product based on being able to entice sales through coupon redemption. Therefore, for classification purposes, sessions that ended in one or more redemptions will be labeled +1 and sessions that ended without redemption will be labeled -1.


While there hasn’t been any previous work in the space of mobile recommendation systems, there has been a large amount of work in the recommender systems space and classification. In [1], [2] and [3], direct marketing is studied using collaborative filtering. In [3], the authors use SVMs and latent class models to model predictions of whether or not a customer would be likely to buy a particular product. The most direct comparison of work would be in [1] and [8], where SVMs and linear classifiers are used to cluster content driven recommender systems.


Broadly, recommender systems can be grouped into two categories, content-based and collaborative based. In content-based systems, the recommendations are based solely on the attributes of the actual product. For example, in Coupious, attributes of a particular coupon redemption includes the distance from the merchant when the coupon was used, the date and time of the redemption, the category of the coupon, the expiry and the offer text. These attributes are physical characteristics of the coupon. Recommendations can be made to users without relying on any experience-based information.

In collaboration systems, recommendations are provided based on not only product attributes but the overlap of preferences of “like-minded” people, first introduced by Goldberg et. al [5]. For example, a user is asked to rate how well they liked the product or give an opinion of a movie. This provides the algorithm a base line of preference for a particular user, which allows the algorithm to associate product attributes with a positive or negative response. Since Coupious does not ask for user ratings, this paper will focus exclusively on content-based applications.

Many content-based systems have similar or common attributes. As stated in [3], the “central problem of content-based recommendation is identifying a sufficiently large set of key attributes.” If the attribute set is too small, there may not be enough information for the program to build a profile of the user. Conversely, if there are too many attributes, the program won’t be able to identify the truly influential attributes, which leads to poor performance [6]. Also, while the label for a particular feature vector with Coupious data is +1 or -1, many of the features in the data are multi-valued attributes (such as distance, date-time stamps, etc.), which maybe hard to represent in a binary manner, if the algorithm requires it.

In feature selection, we are “interested in finding k of the d dimensions that give us the most information and accuracy and we discard the other (d – k) dimensions [4].” How can we find the attributes that will give us the most information and accuracy? For Coupious, the attribute set is quite limited. For this research, we explicitly decided the features that will contribute to our recommendations. In all cases, all attributes were under consideration while the algorithm was running and we never partitioned the attributes for different runs to create different recommendations.


As shown in [7], “Once the clustering is complete, performance can be very good, since the size of the group that must be analyzed is much smaller.” The nearest neighbor algorithm is such a classification algorithm. K-nearest neighbors is one of the simplest machine learning algorithms, where an instance is classified by a majority vote by the closest training examples in the hypothesis space. This algorithm is an example of instance-based learning, where previous data points are stored and interpolations is performed to find a recommendation. An important distinction to make is that kNN does not try to create a model when it are given test data, but rather performs the computation when tested. The algorithm works by attempting to find a previously seen data point that is “closest” to the query data point and then uses its previous output for prediction. The algorithm calculates its approximation using this definition:

The algorithm is shown below

KNN (Examples, Target_Instance)

  • Determine the parameter K = number of nearest neighbors.
  • Calculate the distance between Target_Instance and all the Examples based on the Euclidian distance.
  • Sort the distance and determine nearest neighbor based on the Kth minimum distance.
  • Gather the Category Y of the nearest neighbors.
  • Use simple majority of the category of nearest neighbors as the prediction value of the Target_Instance.
  • Return classification of Target_Instance.

This algorithm lends itself well to the Coupious application. When a user uses the Coupious application, they want it to launch quickly and present a list of coupons within 10-15 seconds. Since this algorithm is so simple, we are able to calculate coupons that the user might enjoy fairly quickly and deliver them to the handset. Also, because Coupious doesn’t know any personal details about the user, the ability to cluster users into groups without the need for any heavy additional implementation on the front or back ends of the application is advantageous.

There are several possible problems with this approach. While this is a simple algorithm, it doesn’t take high feature dimensionality into account. If there are many features, the algorithm will have to perform a lot of computations to create the clusters. Additionally, each attribute is given the same degree of influence on the recommendation. In Coupious, the date and time the coupon was redeemed has the same amount of bearing on the recommendation as does the current distance from the offer and previous redemption history. The features may not scale compared to their importance and the performance of the recommendation may be degraded by irrelevant features. Lastly, [6] describes the curse of dimensionality, which describes how, if a product has 20 attributes, 2 of which are actually useful, how classification may be completely different when considering each of the 20 attributes but identical when only considering the 2 relevant attributes.


While implementing the k-nearest neighbors algorithm, we decided to evaluate instances based upon seven different attributes, including coupon category, distance, expiration, offer, redemption date, handset and handset operating system, and upon the session result, a “hit” or “redemption.” The offer and expiration attributes required extra processing before they were able to be used for clustering. In both of these attributes, a “bag of words” (applying a Naive Bayes classifier to text to determine the classification) implementation was used to determine what type of offer the coupon had made.

For the OFFER attribute, we split the value space into 4 discrete values, PAYFORFREE (an instance where the customer had to pay some initial amount to receive a free item), PERCENTAGE (an instance where the customer received some percentage discount), DOLLAR (an instance where a customer received a dollar amount discount) and UNKNOWN (an instance where the classification was unknown). While the PERCENTAGE and DOLLAR values essentially compute to the exact same values when calculated, consumers tend to react differently when seeing higher percentage discounts versus dollar discounts (i.e. 50% off instead of $5 discount, even though they may equate to an identical discount, if the price were $10).

For the expiration attribute, some text parsing was done to determine what type of expiration the coupon had (DATE, USES, NONE or UNKNOWN). If the parsing detected a date format, we used DATE and if the parsing detected that it was a limited usage coupon (either limited by total uses across a population, herein known as “global” or limited by uses per customer, herein known as “local”), we designated that the coupon was USES. If the coupon was valid indefinitely, we designated that the coupon expiration was NONE. We did not record the distance of dates in the future, by the type of limited usage (global or local) or by the number of uses left.

An important detail to note is that in our implementation of kNN, if an attribute was unknown, we declined using it for computation of the nearest neighbor because the UNKNOWN attribute was typically the last value in a Java enumeration and therefore would be assigned a high integer value which would skew the results unnecessarily if used for computation.

For the remaining attributes (OFFERTIME, CATEGORY, DISTANCE, HANDSET, which considers handset model and OS and ACTION), the value space of each attribute was divided over the possible discrete values for that attribute according to Table 2.2.

Attribute Possible Values
OFFERTIME Morning, Afternoon, Evening, Night, Unknown
CATEGORY Entertainment, Automotive, Food & Dining, Health & Beauty, Retail, Sports & Recreation, Travel, Clothing & Apparel, Electronics & Appliances, Furniture & Decor, Grocery, Hobbies & Crafts, Home Services, Hotels & Lodging, Nightlife & Bars, Nonprofits & Youth, Office Supplies, Other, Pet Services, Professional Services, Real Estate, Unknown
DISTANCE Less than 2 miles, 2 to 5 miles, 5 to 10 miles, 10 to 20 miles, 20 to 50 miles, 50 to 100 miles, Unknown
HANDSET Device: iPhone, iPod, G1, Hero, myTouch, Droid, Unknown
OS: iPhone, Android, Unknown
ACTION Redeem, Hit, Unknown

Table 2.2

In the case of the handset OS, iPhone and iPod were classified together as the iPhone OS as they are the same OS with different build targets. It is important to note that kNN works equally well with continuous valued attributes as well as discrete-valued attributes. For further discussion on using kNN with real-valued attributes, see [10].


There were several problems that were encountered while implementing kNN. The first problem was “Majority Voting.” Majority voting is the last step in the algorithm to classify an instance and is an inherent problem with the way the kNN algorithm works. If a particular class dominates the training data, it will skew the votes towards that class since we are only considering data at a local level, that is the distance from our classification instance to the nearest data points. There are two ways to solve this problem: 1) balancing the dataset, or, 2) weighting the neighboring instances. Balancing is the simplest technique where an equal proportion of either class are present in the training data. A more complicated, but more effective, method is to weight the neighboring instances such that neighbors that are further away have a smaller weight value than closer neighbors.

Determining “k” beforehand is a troublesome problem. This is due to the fact that, if “k” is too large, the separation between classifications becomes blurred. This could cause the program to group two clusters together that are, in fact, distinct clusters themselves. However, a large “k” value does reduce the effect of noise by including more samples in the hypothesis. If “k” is too small, we might not calculate a good representation of the sample space because our results were too local. In either case, if the target attribute is binary, “k” should be an odd number to avoid the possibilities of ties with majority voting.

Also, kNN has a high computation cost because it has to consider the distance of every point from itself to another point. The time-complexity of kNN is O(n), which wouldn’t scale well to hypothesis spaces with millions of instances.

Discretizing non-related attributes, like category, presented a unique challenge for the author. When considering continuous attributes, such as distance, it was easy to discretize this data. In the case of Coupious, the distance ranges were already defined in the application so we just had to translate those over to our classification algorithm. However, some attributes, such as category, while related at an attribute level (in that they were all categories of coupon), had no real values. In this case, we simply assigned them increasing integer values

Another problem that we encountered is the sparsity problem. Since we implemented a content-driven model, the degree of accuracy relied upon how much data we had about a particular end user to build their profile. If the customer only had one or two sessions ending in no redemptions, it might not be possible to achieve any accuracy about this person. We dealt with this problem by artificially creating new records to supplement previous real records.

Coupious relies on the GPS module inside the smart phone to tell us where a user is currently located. From that position, the user gets a list of coupons that are close to the user’s location. However, there are no safeguards in place to guarantee that a redemption is real. A curious user may attempt to redeem a coupon when he is not at the actual merchant location. In the data, we can account for this by calculating the distance from the merchant at the time of the redemption request. However, this GPS reading may be inaccurate as the GPS model can adjust it’s accuracy to save power and battery life.

Lastly, accuracy is somewhat limited because of the fact that we are using a content-only model. There is no way to interact with the user to ask if the recommendations are truly useful. To achieve this additional metric would require major changes to the application and the backend systems that are outside the scope of this research paper.

Despite the multiple problems with kNN, it is quite robust to noisy data as indicated in section 4, which makes it well-suited for this classification task, as the author can only verify the reasonableness of the data, not the integrity.


Testing of the kNN logic was carried out using a 3-fold cross validation method, were the data was divided into training and test sets such that | R | = k X | S |, where R is the size of the training set, k is the relative size and S is the size of the test set.

For each test run, we chose 4,000 training examples and 1 test example at random. We attempted to keep labeled classes in the training set as balanced as possible by setting a threshold n. This threshold prevented the classes from becoming unbalanced by more than a difference of n. If the threshold n was ever reached, we discarded random selections until we were under the threshold again and thereby balance the classes. Discussion of results are in section 5.


Even though kNN is a simplistic algorithm, the classification results were quite accurate. To test the algorithm, we performed 10,000 independent runs where an equal number of “hit” and “redemption” rows were selected at random (2,000 of each so as to keep the inductive bias as fair as possible). An individual classification instance was chosen at random from that set of 4,000 instances and was then classified according to its nearest neighbors.


When evaluating the results, there was one main factor that affected the recall and it was the size of the k neighbors considered in the calculation (see Table 5.1). In the 10,000 independent runs using random subsets of data for each test, the overall recall for the kNN algorithm was 85.32%. The average F-measure for these runs was 0.45.

However, if one considers a subset of results with k-size less than or equal to 15, the average recall was much higher – 94%. After a k-size greater than 30, we see a significant drop-off of recall. This can be attributed to the fact that the groups are becoming less defined because, as k-size grows, the nodes that are being used are “further” away from the classification instance, and therefore the results are not as “good.”


When broken down into smaller sets of 1,000 runs, the recall and F-measure vary greatly. In 16 runs, we had a range from 29.40% recall to 98.20% recall. The median recall was 92.70%.

Run K Size Recall F-measure Avg. Distance
1 1 98.20% 0.49 0.18
2 2 97.80% 0.49 0.23
3 3 95.80% 0.48 0.49
4 4 95.70% 0.48 0.44
5 5 95.60% 0.48 0.46
6 6 95.20% 0.48 0.59
7 7 93.10% 0.48 0.48
8 8 92.70% 0.47 0.69
9 9 94.20% 0.48 0.82
10 10 93.40% 0.48 0.88
11 15 91.70% 0.48 1.21
12 30 87.50% 0.47 2.16
13 50 85.10% 0.46 3.63
14 100 70.80% 0.42 7.57
15 200 48.90% 0.32 16.57
16 500 29.40% 0.22 31.64
Overall 85.32% 0.45


Table 5.1

There is a large gap between our maximum and our minimum recall. This can be attributed mainly to our k-size. Based on these results, we would feel fairly confident that we could make a useful prediction, but only if we had a confidence rating on the results (due to the high variability of the results).

These results could provide good insight into what k-size or neighbors are the most influential in suggesting a coupon. This would allow us to more carefully target Coupious advertising based on the user.

When could one consider these results valid? If the average distance of a classification instance to the examples is below a threshold such that the classification truly reflects its neighbors, we could consider the results valid. Another form of validating the results could be pruning the attribute set. If we were able to prune away attributes that didn’t affect the recall in a negative manner, we would be left with a set of attributes that truly influence the customers purchasing behavior, although this task is better suited for a decision or regression tree. An improvement to this kNN algorithm might come in the form of altering the k-size based upon the population or attributes that we are considering.


In this research project, we have implemented the kNN algorithm for recommender systems. The algorithm for the nearest neighbor was explored and several problems were identified and overcome. Different techniques were investigated to improve the accuracy of the system. The results of this project show an overall accuracy of 85.32%, which makes kNN an excellent, simple technique for implementing recommender systems.


[1] Zhang, T., Iyengar, V. S. Recommender Systems Using Linear Classifiers. Journal of Machine Learning Research 2. (2002). 313-334.
[2] Basu, C., Hirsh, H., Cohen, W. Recommendation as Classification: Using Social and Content-Based Information in Recommendation.
[3] Cheung, K. W., Kowk, J. T., Law, M. H., Tsui, K. C. Mining Customer Product Ratings for Personalized Marketing
[4] E. Alpayden., Introduction to Machine Learning, 2nd ed. MIT Press. Cambridge, Mass, 2010.
[5] D. Goldberg, D. Nichols, B.M. Oki, D. Terry, Collaborative filtering to weave an information tapestry, Communications of the ACM 35. (12) (December 1992) 61 – 70.
[6] T.M. Mitchell, Machine Learning, New York. McGraw-Hill, 1997.
[7] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl, “Recommender Systems for Large-Scale E-Commerce: Scalable Neighborhood Formation Using Clustering,” Proc. Fifth Int’l Conf. Computer and Information Technology, 2002.
[8] D. Barbella, S. Benzaid, J. Christensen, B. Jackson, X. V. Qin, D. Musicant. Understanding Support Vector Machine Classifications via a Recommender System-Like Approach. [Online]. http://www.cs.carleton.edu/faculty/dmusican/svmzen.pdf
[9] D.E. Knuth. “The Art of Computer Programming.” Addison-Wesley. 1973.
[10] C. Elkan. “Nearest Neighbor Classification.” University of California – San Diego. 2007. [Online]. http://cseweb.ucsd.edu/~elkan/151/nearestn.pdf

In a previous post, I explored how one might apply decision trees to solve a complex problem. This post will explore the code necessary to implement that decision tree. If you would like a full copy of the source code, it is available here in zip format.

Entropy.java – In Entropy.java, we are concerned with calculating the amount of entropy, or the amount of uncertainty or randomness with a particular variable. For example, consider a classifier with two classes, YES and NO. If a particular variable or attribute, say x has three training examples of class YES and three training examples of class NO (for a total of six), the entropy would be 1. This is because there is an equal number of both classes for this variable and is the most mixed up you can get. Likewise, if x had all six training examples of a particular class, say YES, then entropy would be 0 because this particular variable would be pure, thus making it a leaf node in our decision tree.

Entropy may be calculated in the following way:

import java.util.ArrayList;

public class Entropy {	
	public static double calculateEntropy(ArrayList<Record> data) {
		double entropy = 0;
		if(data.size() == 0) {
			// nothing to do
			return 0;
		for(int i = 0; i < Hw1.setSize("PlayTennis"); i++) {
			int count = 0;
			for(int j = 0; j < data.size(); j++) {
				Record record = data.get(j);
				if(record.getAttributes().get(4).getValue() == i) {
			double probability = count / (double)data.size();
			if(count > 0) {
				entropy += -probability * (Math.log(probability) / Math.log(2));
		return entropy;
	public static double calculateGain(double rootEntropy, ArrayList<Double> subEntropies, ArrayList<Integer> setSizes, int data) {
		double gain = rootEntropy; 
		for(int i = 0; i < subEntropies.size(); i++) {
			gain += -((setSizes.get(i) / (double)data) * subEntropies.get(i));
		return gain;

Tree.java – This tree class contains all our code for building our decision tree. Note that each level, we choose the attribute that presents the best gain for that node. The gain is simply the expected reduction in the entropy of Xachieved by learning the state of the random variable A. Gain is also known as Kullback-Leibler divergence. Gain can be calculated in the following way:

Notice that gain is calculated as a function of all the values of the attribute.

import java.io.*;
import java.util.*;

public class Tree {
	public Node buildTree(ArrayList<Record> records, Node root, LearningSet learningSet) {
		int bestAttribute = -1;
		double bestGain = 0;
		if(root.getEntropy() == 0) {
			return root;
		for(int i = 0; i < Hw1.NUM_ATTRS - 2; i++) {
			if(!Hw1.isAttributeUsed(i)) {
				double entropy = 0;
				ArrayList<Double> entropies = new ArrayList<Double>();
				ArrayList<Integer> setSizes = new ArrayList<Integer>();
				for(int j = 0; j < Hw1.NUM_ATTRS - 2; j++) {
					ArrayList<Record> subset = subset(root, i, j);
					if(subset.size() != 0) {
						entropy = Entropy.calculateEntropy(subset);
				double gain = Entropy.calculateGain(root.getEntropy(), entropies, setSizes, root.getData().size());
				if(gain > bestGain) {
					bestAttribute = i;
					bestGain = gain;
		if(bestAttribute != -1) {
			int setSize = Hw1.setSize(Hw1.attrMap.get(bestAttribute));
			root.setTestAttribute(new DiscreteAttribute(Hw1.attrMap.get(bestAttribute), 0));
			root.children = new Node[setSize];
			for (int j = 0; j< setSize; j++) {
				root.children[j] = new Node();
				root.children[j].setData(subset(root, bestAttribute, j));
				root.children[j].getTestAttribute().setName(Hw1.getLeafNames(bestAttribute, j));

			for (int j = 0; j < setSize; j++) {
				buildTree(root.children[j].getData(), root.children[j], learningSet);

		else {
			return root;
		return root;
	public ArrayList<Record> subset(Node root, int attr, int value) {
		ArrayList<Record> subset = new ArrayList<Record>();
		for(int i = 0; i < root.getData().size(); i++) {
			Record record = root.getData().get(i);
			if(record.getAttributes().get(attr).getValue() == value) {
		return subset;
	public double calculateSurrogates(ArrayList<Record> records) {
		return 0;

DiscreteAttribute.java public class DiscreteAttribute extends Attribute { public static final int Sunny = 0; public static final int Overcast = 1; public static final int Rain = 2; public static final int Hot = 0; public static final int Mild = 1; public static final int Cool = 2; public static final int High = 0; public static final int Normal = 1; public static final int Weak = 0; public static final int Strong = 1; public static final int PlayNo = 0; public static final int PlayYes = 1; enum PlayTennis { No, Yes } enum Wind { Weak, Strong } enum Humidity { High, Normal } enum Temp { Hot, Mild, Cool } enum Outlook { Sunny, Overcast, Rain } public DiscreteAttribute(String name, double value) { super(name, value); } public DiscreteAttribute(String name, String value) { super(name, value); } }

Hw1.java – This class is our main driver class

import java.util.*;

public class Hw1 {
	public static int NUM_ATTRS = 6;
	public static ArrayList<String> attrMap;
	public static ArrayList<Integer> usedAttributes = new ArrayList<Integer>();

	public static void main(String[] args) {

		Tree t = new Tree();
		ArrayList<Record> records;
		LearningSet learningSet = new LearningSet();
		// read in all our data
		records = FileReader.buildRecords();
		Node root = new Node();
		for(Record record : records) {
		t.buildTree(records, root, learningSet);
		traverseTree(records.get(12), root);
	public static void traverseTree(Record r, Node root) {
		while(root.children != null) {
			double nodeValue = 0;
			for(int i = 0; i < r.getAttributes().size(); i++) {
				if(r.getAttributes().get(i).getName().equalsIgnoreCase(root.getTestAttribute().getName())) {
					nodeValue = r.getAttributes().get(i).getValue();
			for(int i = 0; i < root.getChildren().length; i++) {
				if(nodeValue == root.children[i].getTestAttribute().getValue()) {
					traverseTree(r, root.children[i]);
		System.out.print("Prediction for Play Tennis: ");
		if(root.getTestAttribute().getValue() == 0) {
		else if(root.getTestAttribute().getValue() == 0) {

	public static boolean isAttributeUsed(int attribute) {
		if(usedAttributes.contains(attribute)) {
			return true;
		else {
			return false;
	public static int setSize(String set) {
		if(set.equalsIgnoreCase("Outlook")) {
			return 3;
		else if(set.equalsIgnoreCase("Wind")) {
			return 2;
		else if(set.equalsIgnoreCase("Temperature")) {
			return 3;
		else if(set.equalsIgnoreCase("Humidity")) {
			return 2;
		else if(set.equalsIgnoreCase("PlayTennis")) {
			return 2;
		return 0;
	public static String getLeafNames(int attributeNum, int valueNum) {
		if(attributeNum == 0) {
			if(valueNum == 0) {
				return "Sunny";
			else if(valueNum == 1) {
				return "Overcast";
			else if(valueNum == 2) {
				return "Rain";
		else if(attributeNum == 1) {
			if(valueNum == 0) {
				return "Hot";
			else if(valueNum == 1) {
				return "Mild";
			else if(valueNum == 2) {
				return "Cool";
		else if(attributeNum == 2) {
			if(valueNum == 0) {
				return "High";
			else if(valueNum == 1) {
				return "Normal";
		else if(attributeNum == 3) {
			if(valueNum == 0) {
				return "Weak";
			else if(valueNum == 1) {
				return "Strong";
		return null;
	public static void populateAttrMap() {
		attrMap = new ArrayList<String>();

Node.java – Node.java holds our information in the tree.

import java.util.*;

public class Node {
	private Node parent;
	public Node[] children;
	private ArrayList<Record> data;
	private double entropy;
	private boolean isUsed;
	private DiscreteAttribute testAttribute;

	public Node() {
		this.data = new ArrayList<Record>();
		setTestAttribute(new DiscreteAttribute("", 0));

	public void setParent(Node parent) {
		this.parent = parent;

	public Node getParent() {
		return parent;

	public void setData(ArrayList<Record> data) {
		this.data = data;

	public ArrayList<Record> getData() {
		return data;

	public void setEntropy(double entropy) {
		this.entropy = entropy;

	public double getEntropy() {
		return entropy;

	public void setChildren(Node[] children) {
		this.children = children;

	public Node[] getChildren() {
		return children;

	public void setUsed(boolean isUsed) {
		this.isUsed = isUsed;

	public boolean isUsed() {
		return isUsed;

	public void setTestAttribute(DiscreteAttribute testAttribute) {
		this.testAttribute = testAttribute;

	public DiscreteAttribute getTestAttribute() {
		return testAttribute;

FileReader.java – The least interesting class in the code

import java.io.*;
import java.util.ArrayList;
import java.util.StringTokenizer;

public class FileReader {
	public static final String PATH_TO_DATA_FILE = "playtennis.data";

    public static ArrayList<Record> buildRecords() {
		BufferedReader reader = null;
		DataInputStream dis = null;
		ArrayList<Record> records = new ArrayList<Record>();

        try { 
           File f = new File(PATH_TO_DATA_FILE);
           FileInputStream fis = new FileInputStream(f); 
           reader = new BufferedReader(new InputStreamReader(fis));;
           // read the first record of the file
           String line;
           Record r = null;
           ArrayList<DiscreteAttribute> attributes;
           while ((line = reader.readLine()) != null) {
              StringTokenizer st = new StringTokenizer(line, ",");
              attributes = new ArrayList<DiscreteAttribute>();
              r = new Record();
              if(Hw1.NUM_ATTRS != st.countTokens()) {
            	  throw new Exception("Unknown number of attributes!");
			  String day = st.nextToken();
			  String outlook = st.nextToken();
			  String temperature = st.nextToken();
			  String humidity = st.nextToken();
			  String wind = st.nextToken();
			  String playTennis = st.nextToken();
			  if(outlook.equalsIgnoreCase("overcast")) {
				  attributes.add(new DiscreteAttribute("Outlook", DiscreteAttribute.Overcast));
			  else if(outlook.equalsIgnoreCase("sunny")) {
				  attributes.add(new DiscreteAttribute("Outlook", DiscreteAttribute.Sunny));
			  else if(outlook.equalsIgnoreCase("rain")) {
				  attributes.add(new DiscreteAttribute("Outlook", DiscreteAttribute.Rain));
			  if(temperature.equalsIgnoreCase("hot")) {
				  attributes.add(new DiscreteAttribute("Temperature", DiscreteAttribute.Hot));
			  else if(temperature.equalsIgnoreCase("mild")) {
				  attributes.add(new DiscreteAttribute("Temperature", DiscreteAttribute.Mild));
			  else if(temperature.equalsIgnoreCase("cool")) {
				  attributes.add(new DiscreteAttribute("Temperature", DiscreteAttribute.Cool));
			  if(humidity.equalsIgnoreCase("high")) {
				  attributes.add(new DiscreteAttribute("Humidity", DiscreteAttribute.High));
			  else if(humidity.equalsIgnoreCase("normal")) {
				  attributes.add(new DiscreteAttribute("Humidity", DiscreteAttribute.Normal));
			  if(wind.equalsIgnoreCase("weak")) {
				  attributes.add(new DiscreteAttribute("Wind", DiscreteAttribute.Weak));

			  else if(wind.equalsIgnoreCase("strong")) {
				  attributes.add(new DiscreteAttribute("Wind", DiscreteAttribute.Strong));

			  if(playTennis.equalsIgnoreCase("no")) {
				  attributes.add(new DiscreteAttribute("PlayTennis", DiscreteAttribute.PlayNo));
			  else if(playTennis.equalsIgnoreCase("yes")) {
				  attributes.add(new DiscreteAttribute("PlayTennis", DiscreteAttribute.PlayYes));

        catch (IOException e) { 
           System.out.println("Uh oh, got an IOException error: " + e.getMessage()); 
        catch (Exception e) {
            System.out.println("Uh oh, got an Exception error: " + e.getMessage()); 
        finally { 
           if (dis != null) {
              try {
              } catch (IOException ioe) {
                 System.out.println("IOException error trying to close the file: " + ioe.getMessage()); 
		return records;


playtennis.data is only a simple text file that describes the learned attribute “play tennis” for a particular learning episode. I modeled my playtennis.data file off of Tom Mitchell’s play tennis example in his book “Machine Learning.” Essentially what it contains is attributes describing each learning episode: outlook (sunny, overcast, rain), wind (strong, weak) and humidity (high, normal) and play tennis (yes, no). Based off this information, one can trace the decision tree to derive your learned attribute. A sample decision tree is below. All you have to do is create a text file (tab delimited) that describes each particular situation (make sure you match the columns in the text file to the parsing that occurs in the Java).

**EDIT 2:**

After so many requests, I have worked up a small playtennis.data file based on the Tom Mitchell “Machine Learning” book. This follows his example data exactly and can be found here (rename the file to “playtennis.data” before running as WordPress wouldn’t let me upload a .data file due to “security restrictions.” A small note of caution: the above code was put together very quickly for purposes of my own learning. I am not claiming that it is by any means complete or fault tolerant, however, I do believe that the entropy and gain calculation is sound and correct.

Abstract—Decision trees are one of the most widely used methods of inductive inference.  They can be used to learn discrete or continuous valued hypotheses and create compact rules for evaluation of a set of data.  An advantage of decision  and regression trees is that they are robust to noisy data, which makes them perfectly suited to be able to predict whether a heart attack patient will be alive one year after his incident where all the data may not be available.  This paper is a survey of some of the methods of constructing and evaluating decision trees.


A myocardial infarction, commonly referred to as a heart attack, is a serious medical condition where blood vessels that supply blood to the heart are blocked, preventing enough oxygen from getting to the heart. The heart muscle dies from the lack of oxygen and impairs heart function or kills the patient. Heart attacks are positively correlated with diabetes, smoking, high blood pressure, obesity, age and alcohol consumption. While prognosis varies greatly based on underlying personal health, the extent of damage and the treatment given, for the period of 2005-2008 in the United States, the median mortality rate at 30 days was 16.6% with a range from 10.9% to 24.9% depending on the admitting hospital.[9] That rate improves to 96% at the 1 year mark if the patient survives the heart attack.

Physicians would like to be able to tell their patients their possible rate of survival and predict whether or not a certain treatment could help the patient. In order to make that prediction, we can use decision trees to model whether or not the patient has a good chance of survival. Using regression trees, we can map the input space into a real-valued domain using attributes that cardiologists could examine to determine the patient’s chances of survival after a given timeframe.

In building a decision tree, we use the most influential attribute values to represent the internal nodes of the tree at each level.  Each internal node tests an attribute value, each edge corresponds to an attribute value and each leaf node leads to a classification – in our case, deceased or alive.  We are able to traverse the tree from the root to classify an unseen example.  The tree can also be expressed in the form of simple rules.  This would be helpful when explaining the prognosis to the patient.


In this research survey, we implemented a decision tree using an adapted ID3 algorithm.  We evaluated different splitting criterion as well as different approaches to handling missing attributes in the dataset.  In addition, the author considers different approaches to handling continuous valued attributes and methods to reduce decision tree over-fitting.  Lastly, results are discussed in section 4 after running the decision tree multiple times with the echocardiogram dataset.

1.2  DATA

The data that we used is from the University of California at Irvine machine learning repository.  The dataset that we chose was the 1989 echocardiogram dataset.  This dataset contained 132 rows, with 12 attributes, 8 of which were actually usable for decision tree construction (the remaining four were references for the original contributor of this dataset).  This dataset had missing values and all of the attributes were continuous valued.

The data described different measurements of patients who had suffered from acute myocardial infarction at some point in the past.  The attributes included “AGE-AT-HEART-ATTACK” (the patients’ age in years when the heart attack happened), “PERICARDIAL-EFFUSION” (binary choice relative to fluid around the heart), “FRACTIONAL-SHORTENING” (a measure of contractility around the heart where lower numbers are abnormal), “EPSS” (E-point septal separation which is another measure of contractility where larger numbers are abnormal), “LVDD” (left ventricular end-diastolic dimension, where larger numbers are abnormal), “WALL-MOTION-INDEX” (a measure of how many segments of the left ventricle are seen moving divided by the number of segments seen) and “ALIVE-AT-ONE” (a binary choice where 0 represents deceased or unknown and 1 is alive at one year).

It is important to note that not all rows could be used for learning.  Two attributes, “SURVIVAL” and “STILL-ALIVE” must be analyzed as a set.  SURVIVAL described the number of months the patient had survived since the heart attack.  Some of the rows described patients who survived less than a year and are still alive based on STILL-ALIVE (a binary attribute, 0 representing deceased and 1 representing alive).  These patients cannot be used for prediction.

It has previously been noted that “the problem addressed by past researchers was to predict from other variables whether or not the patient will survive at least one year.  The most difficult part of this problem is correctly predicting that the patient will not survive.  This difficulty seems to stem from the size of the dataset.” [1]  In implementing the decision tree logic, we have found that this is the case as well.


We chose to write the implementation of the decision tree in Java because of the ease of use of the language.  Java also presented superior capabilities in working with and parsing data from files.  Using Java allowed the author to more efficiently model the problem through the use of OO concepts, such as polymorphism and inheritance.


The decision tree is constructed using the ID3 algorithm originally described by Quinlan [4] and shown in Mitchell [2] with an adaptation by the author to handle numeric, continuous valued attributes, missing attributes and pruning.  ID3 is a simple decision tree algorithm.  The algorithm is shown below

ID3 (Examples, Target_Attribute, Attributes)
  • Create a root node for the tree
  • If all Examples are positive, return the single-node tree root, with label = +.
  • If all Examples are negative, return the single-node tree root, with label = -.
  • If the number of predicting Attributes is empty, then return the single node tree root, with label = most common
    value of target attribute in the examples.
  • Otherwise Begin
    • A = The Attribute that best classifies Examples.
    • Decision tree attribute for root = A.
    • For each possible value, vi, of A
      • Add a new tree branch below root, corresponding to the test A = vi.
      • Let Examples(vi) be the subset of examples that have the value vi for A
      • If Examples(vi) is empty
        • Below this new branch, add a leaf node with the label = most common target value in the examples.
      • Else
        • Below this new branch add the subtree ID3 (Examples(vi), Target_Attribute, Attributes - {A})
  • End
  • Return root

The ID3 algorithm “uses information gain as splitting criteria and growing stops when all instances belong to a single value of target feature or when best information gain is not greater than zero.” [6]  For further information on ID3, see [2], [4], [6], [7].


A decision tree is formed by having some concrete concept of splitting data into subsets which form children nodes of the parent.  In our implementation of the decision tree, we decided to use a univariate impurity-based splitting criterion called entropy.

Entropy is the measure of impurity or chaos in the data.  If all elements in a set of data belong to the same class, the entropy would be zero and if all the elements in a dataset were evenly mixed, the entropy would be one.  Entropy may be measured with the following equation

where is the number of positive training examples in T and is the number of negative training examples in T.  For further discussion on entropy, see Mitchell [2] or Alpaydin [3].

The ID3 algorithm can use this measure of entropy of each attribute set of values to determine the best gain, or expected reduction in entropy due to splitting on A or the difference in entropies before splitting and after splitting on A.

At each level of the tree, the “best” attributes can be found to create the maximum reduction in entropy, called information gain.  These two calculations represent the preference to create the smallest tree possible for several different reasons – mainly that a short hypothesis that accurately describes the data is unlikely to be coincidence.  For further discussion on entropy and gain, see Mitchell [2], Alpaydin [3], Steinberg [5] and Lior et al [6].


In the dataset, there are many missing values.  The missing values show up mainly for the attributes ALIVE-AT-ONE (43.2%), EPSS (10.6%) and LVDD (7.5% of the data).  In addition, 66.3% of the data had a row with at least 1 missing attribute value.

Clearly, this dataset is not ideal for predicting attributes.  Fortunately, decision trees are robust to noisy data.  The strategy of replacing the missing attribute values with the most common value among training examples was suggested in [2].  We decided to implement this idea to deal with these missing values and initiate those values with a surrogate value.  The surrogate values were calculated using the average of all of the attributes for continuous-valued data or the most common attribute value for discrete valued data.  Our replacement was done only after all the data had been read in from the dataset instead of using a moving average while the data was still being read.

There are two main reasons we chose this method.  First, this is an extremely simple method that doesn’t require a lot of calculation.  Since decision trees cannot back up once they have made a splitting decision, we would never have to worry about the data changing without needing to regenerate the tree.  Second, this allowed the program to describe a finer grain representation of the true average value of the particular attribute for the global dataset.


In ID3, handling discrete attributes are quite simple.  For each attribute, a child node is created and that branch is taken when the tree is traversed to evaluate a test set.  To handle continuous data, some partitioning of the attribute values must take place to discretize the possible range.  For example, if we had the attributes shown in the following table:

PlayTennis Temperature
No 40
No 48
Yes 60
Yes 72
Yes 80
No 90

We may consider creating a discrete variable that corresponds to Temperature > 40 and Temperature < (48 + 60) / 2 that would produce a true or false value.

For our implementation, we decided to take the average of all of the values and make that our discrete split value, where all instances less than the average branch went to the left node and all the instances greater than the discrete value branch went to the right node.  The reason that we chose this approach is because it was a simple implementation over other methods that included binary search, statistical search and weighted searches. [2] [6]  This causes our tree to look closer in form to a CART tree rather than an ID3 tree, as CART can only produce trees that are binary trees.  CART uses this approach of a single partitioning value when using univariate splitting criteria. [5]  The main reason that we are able to implement this approach is the data is such that it is increasingly “abnormal” the larger the value becomes.


Testing of the decision tree logic was carried out using the PlayTennis example shown in [2].  This example used only discrete values.  After verifying the decision tree could accurately classify these examples, the program was adapted to use continuous valued attributes.

Testing performed was done with a 3-fold cross validation method, were the data was divided into training and test sets such that | R | = k X | S |, where R is the size of the training set, k is the relative size and S is the size of the test set.

For each test run, we chose 88 training examples and 44 test examples at random. We attempted to keep labeled classes in the training set as balanced as possible by setting a threshold n. This threshold prevented the classes from becoming unbalanced by no more than a difference of n. If the threshold n was ever reached, we discarded random selections until we were under the threshold again in order to balance the classes.  Discussion of results are in section 4.


In the process of building the decision tree, the accuracy of the tree is determined by the training examples.  “However, measured over an independent set of training examples, the accuracy first increases and then decreases.” [2]  This is an instance of over-fitting the data.  To prevent this over-fitting condition, the tree is evaluated and then cut back to the “essential” nodes such that the accuracy does not decrease with real-world training examples.

In our implementation of pruning, we had no stopping criteria to prevent over-fitting.  Our implementation let the over-fitting occur and then we used a post-pruning method called Reduced-Error pruning, as described by Quinlan. [7]  In this algorithm, the tree nodes are traversed from bottom to top while the procedure checking to see if replacing it with the most frequent class improves the accuracy of the decision tree.  Pruning continues until accuracy decreases and the algorithm ends with the smallest, accurate decision tree.

Further discussion of pruning may be found in [2], [3], [5] and [6].


There are some problems that we encountered during implementation.  The first problem is that ID3 does not natively support missing attributes, numeric attributes or pruning.  The algorithm had to be adapted to support these features.  In adapting the algorithm, less than efficient methods were chosen.  In my implementation, we split at the mean of the attribute values.  This would cause the surrogate values and the split points to be the same value.  In addition, this method does not handle outliers in the data well and forces everything towards the center of the sample set.  Implementing a better search algorithm or multivariate splitting might allow the author to see improved accuracy.  Another alternative would be to use the C4.5 algorithm [8], which is an evolution of ID3 adds full support for these requirements.

Another problem that we experienced is having enough training data without missing attributes to build an effective decision tree.  With the large missing attribute rate, it would be hard to get a good handle on any sort of trends in the dataset.  A possible solution may be to use a weighted method to assign the most probable value at the point where we encounter the missing value.


For the evaluation of the results, we have chosen not to assign value to the results so that a result of ALIVE-AT-ONE equal to 1 is positive and a result of ALIVE-AT-ONE equal to 1 as well instead of negative.  One may choose to assign intrinsic values to the instances, but one could easily evaluate these results without them.  We have chosen to only focus on the results rather than make a determination of the “goodness” of a particular outcome.  Since we have chosen this method of evaluation, we would not have any false positives or true negative results, therefore we will not be reporting any precision calculations.


In 10,000 independent runs using random subsets of data for each test, the overall recall for the decision tree was 66.82%.  The average F-measure for these runs was 0.38.  See figure 4.1.  In the majority of the runs, the most influential attributes where SURVIVAL, EPSS and LVDD, where each of these attributes appeared in 100% of the decision trees created.


When broken down into smaller sets of 1,000 runs, the recall and F-measure vary greatly.  In 10 runs, we had a range from 27.27% recall to 93.18% recall.  The median recall was 84.09%.

Run Recall F-measure
1 90.90% 0.47
2 84.09% 0.46
3 93.18% 0.48
4 88.63% 0.47
5 31.81% 0.24
6 84.09% 0.46
7 45.45% 0.31
8 27.27% 0.21
9 38.63% 0.28
10 88.63% 0.47
Overall 66.82% 0.38
Figure 4.1

There is a large gap between our maximum and our minimum recall.  This can be attributed to several issues, including poor data in the data set and less than efficient splitting choices.  The accuracy of the decision tree depends partly on the data with which you decide to train.  The data that we used was missing many attributes, with almost half (43.2%) of those attribute values being the target attribute.  In the absence of a value, the most common value was substituted which, in the case of this data set, would heavily skew the results towards predicting death. It should be noted that overly pessimistic results and overly optimistic results can each present their own dangers to the patient.

Another improvement to the results may come in the form of changing the policy on attribute splitting and missing value handling.  If a more accurate method of splitting were implemented (multivariate criterion, using a better search method, etc.), we would expect to see a more consistent result set.

Based on these results, we would feel fairly confident that we could make a useful prediction, but only if we had a confidence rating on the results (due to the high variability of the results).

Based on these results, while not highly accurate, they could provide good insight into what attributes are the most important in regards to classification.  In other words, the decision tree would describe the attributes that should be used for other machine learning programs, such as clustering, artificial neural networks or support vector machines.  We could be confident that these attributes are the most important because they were chosen through entropy and gain calculations while constructing the decision tree.


In this research project, different methods of constructing decision and regression trees were explored.  Additionally, different methods of node splitting, missing attribute substitution and tree pruning were investigated.  While the results of this project show only a 66% accuracy, decision trees are still a valid machine learning technique.  With an augmented decision logic and a better data set, decision trees may be able to predict discrete or continuous data at a much better rate.


  1. Salzberg, Stephen.  University of California, Irvine Machine Learning Data Repository.  1989.  [Online]. http://archive.ics.uci.edu/ml/datasets/Echocardiogram
  2. Mitchell, Tom M.  Machine Learning WCB McGraw-Hill, Boston, MA.  1997
  3. Alpaydin, Ethem.  Introduction to Machine Learning, Second Edition.  The MIT Press, Cambridge, MA.  2010.
  4. Quinlan, J. R. Induction of Decision Trees. Mach. Learn. 1, 1 (Mar. 1986), 81-106.
  5. Steinberg, Dan.  CART: Classification and Regression Trees. Taylor and Francis Group. pp 179-201, 2009.
  6. Rokach, Lior and Maimon, Oded. Top-Down Induction of Decision Tree Classifieres – A Survey.  IEEE Transactions on Systems, Man and Cybernetics – Part C: Applications and Reviews.  Vol. 35, No. 4.  pp 476-487, November 2005.
  7. Quinlan, J.R.  Simplifying Decision Trees. International Journal of Man-Machine Studies.  Vol. 27, pp. 221-234, 1987.
  8. Quinlan, J.R.  C4.5: Programs for Machine Learning.  San Francisco, CA: Morgan Kaufmann, 1993.
  9. Krumholz H et al. Patterns of hospital performance in acute myocardial infarction and heart failure – 30-day mortality and readmission. Circulation: Cardiovascular Quality and Outcomes. [Online] http://circoutcomes.ahajournals.org/cgi/content/abstract/2/5/407. 2009.
All code owned and written by David Stites and published on this blog is licensed under MIT/BSD.