Category: Machine Learning

Abstract— With the Internet becoming a major resource for many individuals, business, universities and countries, there is an increasing amount of traffic on the Internet.  Some applications available on the Internet need to be able to provide stringent performance guarantees from their web servers (such as online securities trading).  Due to the fact that Internet traffic is bursty and unpredictable, and the interactions between hardware and software in multi-tier systems are complex, there needs to be a large degree of atomicity to capacity planning.

Index Terms—capacity planning, multi-tier systems, requests, sessions, performance, autonomic computing, distributed systems

I. Introduction

The Internet has become an important and popular channel for a variety of different services, such as news aggregation, online shopping, social networking or financial brokerage and other services. Many of these popular services on the Internet today, such as Facebook, Twitter, Amazon and eBay [2, 5, 9] are composed of a generic multi-tier architecture.  Requests generated by users flow through each layer in this architecture.  Each tier in this system provides certain functionality to the following tier by executing part of the incoming request [5].  This multi-tier system includes 1) a web tier running a web server to respond to incoming requests, such as Apache 2) an application tier that is running an application container which hosts an application, such as Tomcat 3) and a backend database tier running database software such as MySQL.

Tiered Application [2]

Each of these tiers may be a single machine or may be distributed over a cluster of nodes.  In addition, these servers may be virtualized, such that multiple applications can be hosted and run on the same physical node.  Often times, to ensure acceptable performance, especially in application hosting, there are service level agreements (SLA) that are put into place to govern the desired system performance.  An SLA can specify one or multiple parameters, such as required total uptime of a system, total throughput, average end-to-end delay of requests within a certain percentile, etc.  The SLA target is maintained by the addition of resources into each tier and can be maintained by removing resources as necessary (without violating the SLA).

Traffic on a computer network is inherently non-uniform and hard to predict, mainly due to “think” time of the end user.  This “bursty” behavior is characterized by short, uneven spikes of peak congestion in the life of an application [4].  Popular services can experience dynamic and varying workloads that depend on popularity, time or date, or general demand.  In addition, Internet flash crowds can cause bursty and unpredictable workloads in each of these tiers.  For example, in November 2000 holiday season, experience a forty-minute downtime due to an overload [5].  These flash crowds can cause a significant deviation from the normal traffic profile of an application, affecting the performance.

The performance of the system can be characterized by the total measured end-to-end delay generated by these incoming requests and their workload each tier.  Due to the fact that the interactions between the software and hardware in each of these aforementioned tiers can be quite complex, the management process of allocating sufficient resources (CPU, disk, network bandwidth, etc.) to each tier such that they don’t saturate and become a bottleneck in the system is often a difficult, lengthy and potentially error-prone process to model and estimate for human operators.  For example, if the performance of an application tier dominates the performance of the system, that tier becomes the bottleneck.  It is non-trivial to model the process (scheduling, algorithms, memory, I/O and CPU times) and time it takes to execute the business logic in this tier, especially if resources are shared, as is common in distributed systems.  In addition to these problems, static allocation does not address the issue of Internet flash crowds and the potential they have to overload the system and jeopardize the SLA compliance.

Capacity planning and autonomic resource management plays an important role in modern data centers.  Service differentiation (Quality of Service) and performance isolation help these complex systems to adapt to changing workloads within the system as a whole and within each tier.  In this paper, we will present a survey of the large variety of models, techniques and results of autonomic resource management in large-scale, multi-tier systems.

II. Network Traffic Characteristics

To truly understand the reason for capacity planning and the difficulties it presents in modeling in terms of the interactions between the incoming requests and the hardware and software that service them, we have to understand network traffic characteristics.  Incoming session and request rates tend to fluctuate based on a variety of different factors.  These varying workloads could be characterized as “bursty”, which is defined by [4] as short, uneven spikes of peak congestion during the lifetime for the system.  These traffic patterns deviate significantly from the average traffic arrival rates.

One way to characterize burstiness as seen in [4] is to use the variable I to represent the index of dispersion.  The index of dispersion is used to measure whether a set of observed occurrences are clustered or dispersed compared to a standard model.  When I is larger, the observed occurrences of bursty traffic is more disperse.  We can see that in Figure 1, burstiness can aggregate into different periods represented by different indices of dispersion.


Figure 1

A simple way to calculate the index of dispersion for a particular is as follows:

where Nt is the number of requests completed in a time window of t seconds, where t seconds are counted ignoring server idle time, Var(Nt) is the variance of the number of completed requests and E[Nt] is the mean service rate during busy periods.  An astute reader can deduce that as t becomes large, if there are no bursty periods within t, the index of dispersion is low.

Now that bursty periods can be quantified via I, what causes burstiness?  [4] suggests that “burstiness in service times can be a result of a certain workload combination” and “burstiness in service times can be caused by a bottleneck switch between tiers and can be a result of “hidden” resource contention between the transactions of different types and across different tiers.”  That is, one server or tier may be lightly loaded during a particular period, but may become saturated in other periods where a large number of requests are processed.

This hidden resource contention and different types of transactions across different tiers can be extremely difficult to model and even more troublesome for human operators to correctly provision for.  Autonomic provisioning of system resources is needed to optimize service metrics.

III. Self-Management

As previously mentioned, the major motivation behind autonomic resource management and capacity planning is to reduce the human involvement in these activities due to their difficulty.  If the systems are able to plan for the future and react to the present without operator input, it will take the load off of system administrators and more accurately respond to the situation at hand.  In order to do this, one has to define the goals and properties of such an autonomic system.  The main properties of an autonomic system, as cited in [1] are:

  • Self-configuration: autonomic systems configure themselves based on high level goals, specifying what is desired (SLAs), not how to achieve the goals.  Systems with self-configuration can install and set themselves up.
  • Self-optimization: autonomic systems can optimize its use of resources either proactively or reactively in an attempt to improve service and meet SLAs.
  • Self-healing: autonomic systems can detect and diagnose problems at both high and low levels (software and hardware).  Autonomic systems should also attempt to fix the problem.
  • Self-protecting: autonomic systems should protect itself from attacks but also from trusted users trying to make changes.

One particularly important feature of autonomic system is the ability to exhibit “proactive” features.  For example, software or hardware agents should be able to be:

  • Autonomic: Agents operative without direct intervention of humans and have some kind of control over their actions and internal state [1].”
  • Social: “Agents interact with other agents via some agent-communication languages [1].”
  • Reactive and proactive: “Agents are able to perceive their environment and respond in a timely fashion.  Agents do not simply act in response to their environment, but they are able to exhibit goal-directed behavior in taking the initiative [1]

In order to model these properties, IBM suggested a general framework autonomic control loop, MAPE-K as detailed in [1].  MAPE-K allows for clear boundaries in the model to classify the work that is taking place in the autonomic system.  In MAPE-K, there are managed elements, which represents any software or hardware resource that is managed by the autonomic agents.  Sensors collect information about managed elements and provide input to the MAPE-K loop such that the autonomic manager can execute the changes.  Typically the managers are additional software components configured with high-level goals which leaves the implementation of these goals up to the autonomic manager.  Effectors actually carry out the changes to the managed elements (these changes can be fine or course grained).  Each of these elements can be observed in the following sections.

In the MAPE-K loop, there are five different components: monitor, analyze, plan, execute and knowledge.  Monitoring involves “capturing properties of the environment (psychical or virtual) that are important to the self properties of the autonomic system [1].”  This information is captured from sensors in the system in two different ways, passively (using built in tools to run an analysis on the system) or actively (engineering the software to monitor and improve performance).

In the planning model of MAPE-K, the manager uses the monitoring data from the sensors to produce a series of changes to one or more managed elements.  These changes can be calculated by having the autonomic system keep state information about the managed elements and data so that adaptation plans can be adjusted over time.  An additional benefit of keeping state information is that systems are able to create an architectural model where the actual system mirrors the model and proposed changes are able to be verifying that the system integrity is still in tact before and after the proposed changes.  If any violations occur after applying the changes to the model, the changes can be aborted or rolled back to avoid damage or downtime to the system.

In the knowledge part of the model, the knowledge used to effect adaptation can come humans, logs, sensor data, or day-to-day observation of a system to observe its behavior [1].  In this part of the model, there is a large space that one can use machine learning to acquire knowledge about the system.  While [1] suggested reinforcement learning and Bayesian techniques, other authors [2] suggest K-Nearest Neighbors (KNN) and neural networks [7] with fuzzy control.  This author would suggest that decision trees could be an effective method of acquiring the knowledge to effect the proposed plans.  Also, clustering could be used as a way of identifying and classifying previously similar plans to see if they were successful or if they resulted in failure.

Finally, there are several different levels of automaticity including: Basic, Managed, Predictive, Adaptive and Fully Autonomic.

Now that we have seen a general framework for automated resource management, let’s continue to explore each component.

  1. Managed Elements

Managed elements in an autonomic system consist of all the hardware and software elements that can be managed by autonomic agents.  Due to the fact that multi-tier systems can be very complex, there are multiple levels of detail that one can view the system.

The first and highest level is where the system is viewed as a black box.  At this level, we consider the end-to-end delay when the request enters the system and returns back to us.  If there is congestion or delay caused by insufficient capacity at any tier, we are unable to know which tier’s capacity is causing the problem. Typically, changes in allocation to capacity are fairly course-grained at this level.  [8] plans their allocation strategy at this level.

At the next level down, we no longer monitor system performance by a single delay metric.  We are now able to monitor the performance of individual tiers.  When congestion occurs at a single tier, we are able to target that tier and increase the allocation capacity of just that tier.  However, one must be careful not to trigger downstream allocations with the increased capacity at the original tier.  [5] plans their allocation strategy at this level.

Lastly, at the most fine-grained level, we are able to collect statistics on individual components within a tier such as CPU usage for an application on a particular node.  Figure 2 shows an example of the black-box and per-tier paradigms.


  1. Autonomic Managers and Controllers

In the MAPE-K loop, the monitoring, analyzing and planning phases are done in the control plane or autonomic manager.  The manager should adjust managed elements in the following fashion: [2]

  • Prompt. It should sense impending SLA violations accurately and quickly and trigger resource allocation requests as soon as possible to deal with load increases (or decreases).
  • Stable. It should avoid unnecessary oscillations in provisioning resources (adding and removing).

Keeping the two guidelines in mind, there are two important decisions that every autonomic manager will be making: 1) when to provision resources and 2) when a provisioning decision is made, how much of a particular resource should be provisioned.

When To Provision

When considering the question of “when to provision”, one must consider the timescale that is being evaluated.  There are two main methods of provisioning: predictively and reactively.  Predictive provisioning attempts to stay ahead of the workload, while reactive provisioning follows to correct for short term fluctuations.

In predictive provisioning, the autonomic system is planning increased (or decreased) capacity for the future.  When evaluating real-world workloads, typically the system or application will see a peak in traffic during the middle of the day and be the minimum during the middle of the night [5].  Other factors, such as recent news (breaking important stories) or seasons (holiday shopping) can affect this distribution of traffic.  By using predictive analysis, automated systems are able to provision sufficient capacity well in advance of any potential events that would cause SLA violations.

Prediction can be done via a number of different methods including statistical analysis, machine learning, or by simply using past observations of workload.  This prediction is usually implemented in the control plane or autonomic manager of the system [2, 5, 6, 7, 8].  For example in [2], a scheduler for each application monitors the application database tier and collects various metrics, such as average query throughput, average number of connections, read/write ratios and system statistics such as CPU, memory and I/O usage, about the system and application performance.  These metrics are reported back to the autonomic manager and an adaptive filter predicts the future load based on the current measured load information.  Each of these metrics are weighted to reflect the usefulness of that feature.  In addition, a KNN classifier determines if an SLA is broken and redirects the source allocation to adjust the number of databases such that that tier is no longer in violation.  A resource allocation component decides how to map the resources dynamically.  [7] uses self-adaptive neural fuzzy controller to decide upon the allocation of resources.  Moreover, [6] uses a model estimator which automatically learns online a model for the relationship between an application’s resource allocation and it’s performance and a optimizer which predicts the resource allocation required to meet performance targets.  Lastly in [5], the authors use a histogram of request arrival rates for each hour over several days.  Using that data, the peak workload is estimated as a high percentile of the arrival rate distribution for that hour.  By using these metrics, the application is able to predict shifting workloads in a sliding window.

Most of the papers survived use the error in prediction e(k) (that is, the predicted workload or arrival rate λpred(t) and the observed workload or arrival rate λobs(t) for a time period t) or the change in error in prediction Δe(k) (e(k) −e(k −1)) as input to their algorithms to help determine the next periods allocation [5, 6, 7].  Since workloads and arrival rates often exhibit bursty overloads [4, 5], these parameters fine-tune our prediction model.

Proactive provisioning alone is not enough to make the system robust and immune to SLA violations.  For example, there may be errors in prediction if workload or arrival rate deviates greatly from previous days.  As mentioned earlier, Internet flash crowds can spike network requests have the potential to cause congestion and overload the system due to their bursty, unpredictable nature.  In these cases, the errors would lag behind the actual events and there would be insufficient capacity.

Reactive provisioning can be used to quickly correct any deviations from these unpredicted events.  Reactive provisioning is used to plan on a shorter time scale, perhaps on several-minute basis.  If anomalies are detected, reactive provisioning can quickly allocate additional capacity to the affected tiers so they don’t become a bottleneck.

In [5], the authors implement reactive provisioning by comparing the predicted session arrival rate λpred(t) and the observed arrival rate λobs(t) for a time period t.  If these two measurements differ by more than a threshold, they take corrective provisioning action.  In [2], the resource manager monitors the average latency receiver from each workload scheduler during each sampling period.  The resource manager uses a smoothened latency average computed as an exponentially weighted mean of the form WL = α ×L + (1 –α) ×WL, where L is the current query latency.  When the α parameter is larger, the system is more responsive the average to current latency.  In [6], the authors attempt to reactively provision limited resources by using an auto-regressive –moving-average (ARMA) model, where two parameters a1(k) and a2(k) capture the correlation between the applications past and present performance and b0(k)  and b1(k)  are vectors of coefficients capturing the correlation between the current performance and the recent resource allocations.  Note that if the spike in traffic is large, it may require several rounds of reactive provisioning to get the capacity to an acceptable level.

In [8], the authors consider provisioning resources based on “what-if” analysis.  They argue that most of web applications consist of services that form an acyclic directed graph, which can be formed into a decision tree.  In their model, they ask each tier to predict, online, future performance in the event it received an additional capacity unit or had one capacity unit removed.  The performance model in [8] uses the following equation to calculate response time:

where, Rserver is the average response time of the service, n is the number of CPU cores assigned to the service, λ is the average request rate and Sserver is the mean service time of the server.  When a threshold of service time is exceeded, re-computation of service time occurs.  These predictions are given to the parent node.  Using these predictions, provisioning agents negotiate resources with each other based on maximum performance gain and minimum performance loss.  The root node selects which services to provision across the tree when the SLA is about to violated, or de-provision, if resources can be removed without causing an SLA violation. Furthermore, in [8], the authors also consider how provisioning cache instances using the previously described “performance promises” would affect workloads in a particular tier and all children tiers due to an increased hit rate.  Figure 3 from [8] illustrates the decision process:

Figure 3

We can see that service I has k immediate children services and it aggregates it’s own performance promises as follows:

Lastly, while the majority of the papers reviewed were concerned with provisioning additional capacity, [2] also considered removing unneeded capacity.  If the average latency is below a low-end threshold, the resource manager triggers a resource removal.  The system then performs a temporary removal of the resource.  If the average latency remains below the low threshold, the resource is permanently removed.  The reason that a temporary removal is performed first is that mistakenly removing a resource is potentially a costly operation if it negatively impacts the system performance.  The main motivation behind this logic is that unused resources are wasted if they are being under-utilized in the system.

When comparing the effectiveness of reactive and proactive, Figure 4 from [5] that proper proactive provisioning can greatly reduce the time spent in violation of the SLA.

Figure 4

In addition to handling smoothly increasing workloads, the provisioning techniques from [5], as shown in Figure 5, predictive provisioning can also handle sudden load bursts effectively.

Figure 5

By combining predictive and reactive provisioning, systems are able have sufficient capacity to handle predictable workloads as well as short-term instabilities.

How Much To Provision

The question of “how much of a given resource to provision” is less straightforward than “when to provision.”  As we stated earlier, it may be necessary to go through several rounds of provisioning before the end-to-end delay is at an acceptable level.  Moreover, depending on the implementation and type of controller you use, the model determining “how much” could be different.  However, all provisioning algorithms attempt to provision enough to meet the SLA, within a time period t, without overprovisioning, because that would be a waste of resources.  Additionally, one critical factor to consider and try to avoid is that increasing the provisioning at a particular tier k might create a downstream bottleneck at tier k + n, assuming n tiers.  Several authors in the survey explore how to avoid this phenomenon.

The authors in [5] present a general framework for determining provisioning needs based on average end-to-end latency.  Suppose a system that has n tiers, denoted by T1, T2,…TN and let R denote the desired end-to-end delay.  Suppose further that the end-to-end response times are broken down into per-tier response times denoted by d1, d2,…dN, such that Σ di = R.  Lastly, assume that the incoming session rate is λ.  Typically, one would want to provision the system for the worst-case scenario, that is the peak of λ.  Individual server capacity can be modeled using M/M/1 first come, first serve (FCFS), non-preemptive queues and each request in the queue has a certain amount of work to do.

Assuming the service rate of the queue is μ, then λ ⁄ μ = ρ, would be the service ratio.  If ρ is less than 1, then the queuing delay, that is, the amount the request waits in the queue to be serviced is bounded and finite.  If ρ is equal to 1, the queue length is infinite but queuing delay is only infinite if the inter-arrival times of requests are not deterministic.  Lastly, if ρ is greater than 1, then the queuing delay is infinite.  This can express useful data, such as service time (the queuing delay and the time it takes to execute the request).  This behavior can be modeled in the queuing theory result [9]:

where di is the average response time for tier i, si is the average service time for a request at that tier and λi is the request arrival rate at tier i.  is the variance of inter-arrival time and the variance of service time, respectively, which can be monitored online.  Using this equation, we can obtain a lower bound on a request rate for server i.  Assuming a think-time of Ζ, then request are issued are a rate of (1 / Ζ) [5].  Therefore, the number of servers ηi needed at tier i to service a peak request rate can be computed as:

where βi is a tier specific constant, τ is average session duration, λi is the capacity of a single server and λ is the request arrival rate.  The authors in [5] assume that the servers in each tier are load balanced, although other models do exist that explore provisioning without load balancing, such as [10].

In [7], the authors use a neural fuzzy controller with four layers to determine how much of a particular resource should be (re) allocated.  In their controller design, they have a neural controller with four layers (see Figure 6).

Figure 6

In layer 1, the input variables e(k) and Δe(k) are passed into the neural controller nodes.  In layer 2, each node in this layer acts as a “linguistic term” assigned to one of the input variables in layer 1, where they use their membership functions to determine the degree to which an input value belongs to a fuzzy set (i.e. negative large corresponds to the numeric value -50).  Layer 3 uses the outputs from layer 2 multiplied together to determine the firing strength of a particular rule in layer 3.  Lastly, in layer 4, the output of layer 3 is “defuzzified” into numeric output in terms of resource adjustment Δm(k).  The magnitude of adjustment is determined by the online learning of the neural fuzzy controller, described in more detail in [7].  This online learning algorithm is able to adapt quite rapidly to stationary and dynamic workloads to guarantee 95th-percentile delay, as seen in in Figure 7 from [7].

Figure 7

In [6], the authors use the optimizer to determine changes in resource allocations for finite resources such as disk and CPU.  This is a significant departure from other models in that it is assumed that we have a (much) larger pool of resources (essentially unlimited) to allocate increased resources from.  Their optimizer uses uar as the resource allocation required to meet the performance target or SLA.  The optimizer’s high-level goal is to determine an allocation in a stable manner (no resource oscillations) by using a cost minimization function to find the optimum resource assignments.  In their cost minimization function:

where Jp = (yna(k) – 1)2 serves as a penalty for deviation of the application performance from the desired target and Jp = ||ura(k) – ua(k – 1)||2 serves as a control cost function to improve stability of the system.  The controller will attempt to minimize the linear combination of both functions.  One important point to note in this particular allocation scheme is that allocation requests are confined by the following constraint:

where urirt is a tuple of requested resources for application i, resource r for tier t.  Stated another way, the allocations for all applications for a particular resource, such as CPU or disk, in a particular tier must be less than or equal to100%.  When this constraint is violated, a particular resource is experiencing contention.  It is important to note that in this allocation scheme, while there exists the potential for all application to receive adequate capacity to satisfy demand, when contention exists, applications are weighted according to their priorities in the minimization function so that most “important” application receives the largest share of the divided resources.  Clearly, because we cannot allocate more than 100% of a node’s physical CPU, disk or other resource, all applications suffer from performance degradation – the only difference is to what extent the application suffers that degradation.

Other authors of course decide how much to provision based on their controller implementation.  In [2], the authors use a k-nearest neighbors classifier to decide the number of databases to be provisioned.  Lastly, in [8], use a decision tree process to determine the maximum performance gain (and minimal performance loss) to allocate or remove additional resources.  The performance of the algorithm in [8] was quite impressive in observing it’s SLA and remaining well-provisioned without wasting resources as see in Figure 8.

Figure 8

Before looking at additional MAPE-K components, it is worth looking at solving two additional enhancements to resource provisioning: preventing downstream bottlenecks and preventing allocation oscillations.

  1. A. Preventing Downstream Bottlenecks

The relationship between an incoming request and the work units at successive tiers are not necessarily 1-to-1.  A single incoming search request at the web tier may trigger more 1 more query requests at a database tier.  Therefore, downstream tier capacity has to be considered when allocating additional capacity at upstream tiers.

First, consider what would happen if additional downstream tiers were adjusted for additional incoming requests.  When an increase in allocation is triggered by an SLA violation at tier k, the capacity will increase by n.  If the downstream tier k + 1 is almost to capacity, that tier now has an additional n requests to service, which has the potential to cause an SLA violation.  When the resource manager detects a potential violation at tier k + 1, there will be another round of provisioning.  In the meantime, due to the fact that it takes a fixed amount of time to provision additional resources, tier k + 1, could be in violation.  Following a pattern, this could cause up to k provisioning events.  Clearly, this cascading chain of allocation, violation and additional provisioning, which in addition to being wasteful, increases the total time in violation of the SLA.  [5] proposes a solution by using the constant β, where if βi is greater than one if the request triggers multiple requests at tier i or less than one if caching at prior tiers reduces the work at this tier.

  1. B. Preventing Allocation Oscillations

Due to the fact that internet traffic has a bursty nature, there is a potential for a rapidly fluctuating load in the system.  Because of this fact, a necessary enhancement to the system controller should be a filter to smooth or dampen very brief load spikes.  This is not to say that we don’t want to respond to all SLA violations, but adding or removing additional capacity can potentially be a very costly and time consuming operation.

In addition, in [2], the authors keep state on the additional allocation, as either STEADY or MIGRATION state.  During the MIGRATION state, it may take a period of time after the SLA violation and associated additional capacity allocation is triggered for the migration of additional resources to occur and “warm-up.”  [2] found that latency may continued to be high or even increase during this period, and as such, sampling during this period may not be reflective of the actual system state.  Furthermore, they found that making decision based on samples from this migration and warm-up time period, may continue to add unnecessary replicas which will need to be removed later.  This is wasteful and could penalize other applications hosted on the same machine.

For additional information on each model and provisioning algorithm, refer to the related paper.

  1. Effectors

The execution of the plan in the MAPE-K loop happens with the effectors. In some of the papers that were surveyed, [citation needed – 5?], the authors used request policing to ensure that the admission rate does not exceed the capacity.  Any extra sessions beyond capacity are dropped at the threshold of the system.  Once the request was admitted, it was not dropped at any intermediate tier.  Conversely, each tier could have their own request policing mechanism however, this is quite wasteful as requests that have had prior work done, may be dropped at downstream tiers.

In the majority of the papers [2, 5, 6, 7, 8], additional capacity was allocated from some pool of free resources.  If no more free resources existed or that resource reached it replication maximum, resources were re-provisioned or the sessions and requests were capped at a maximum. In [2], additional database capacity was added to the tier by replicating the existing database.  During the period of migration, requests were held and then when the migration was complete, they were “played” in stages to bring the new copy up to date with the current copy with any additional writes that may have occurred during migration.

In [5], the authors used agile server switching.  In this model, each node had a copy of the application running in a separate virtual machine (VM) however, only one application was active at a time.  When the control plane determined that it needed to allocate additional capacity, if the application running on a particular node schedule for replacement was not the application receiving the additional capacity, the sessions and request were ramped down using fixed-rate or measurement-based techniques.  Once all the sessions had been concluded for the previous application in that VM, it was suspended and the new allocation was brought up.  This allowed new applications to quickly be provisioned for additional capacity.

However, admission control may not be the best choice and [3] presented an alternative to dropping sessions by allocating additional capacity in a “degraded mode.”  Using traditional control theory (having a closed loop that reacts to sensors and actuators), we can readily adjust the load on the server to stay within bounds of process schedulability based on available resources.

The actuator or effector is responsible for performing a specific action based on controller output.  In the case of [3], they initially used admission control as a way of controlling server load.  This limits the number of clients that the server can respond to concurrently.  All other clients are rejected and while this is not a favorable outcome, it is a way of providing QoS to other clients.

However, this actuator can be adapted to provide a “degraded” level of service.  In this “degraded” level of service, the content that is served to the client receiving the degraded service is different than that which the “full service” client receives.  For example, the degraded content may have less content such as images in it.  To perform this degradation, there must be multiple versions of the content and the web server will serve the content from the appropriate file structure at request time.  There can be m different service levels where m = I + F (I is the service level where F is the fraction of clients who are served at level I).  When m = M, all clients receive the highest service and conversely when m = 0, all clients are rejected.  This allows a more general model of saying that a percentage of clients will receive degraded service, rather than specifically specifying which clients (which clients may be addressed through different functionality, such has hashing their IP address).  The server load is control through the input of the variable m.

The monitor proposed in [3] is expressed as the utilization U as a function of served request rate R and delivered byte bandwidth W (U = aR + bW, where a and b and derived constants through linear regression).  This function is able to approximate the maximum utilization rate for deadline-monotonic schedulers to meet deadlines of U < 0.58.  Using this scheme of admission and “capacity” planning, the authors in [3] were able to admit almost 3 times the number of requests to the system as shown in Figure 9 before significant service degradation occurred.

Figure 9

  1. Sensors

Sensors are part of the knowledge phase in the MAPE-K loop.  The sensors that report data to the control plane can be any number of metrics.  In the papers surveyed, there were authors that used CPU usage [5, 6], bandwidth, throughput, active network connections [2, 5], I/O transfer [2, 6], cache hits [8], end-to-end delay or response time [5, 7], session arrival rate [5], request arrival rate [5], read/write ratios [2], lock ratios [2] and memory usage [2].  These parameters can be recorded and monitored online or offline, using standard system tools such as vmstat, iostat and sysstat or custom written software to record statistics.

IV. Quality of Service & Performance Isolation

Lastly, the control theory approach was extended in [3] using virtual servers to show support for performance isolation (each virtual server can guarantee a maximum request rate and maximum throughput), service differentiation (each virtual server supports request prioritization) and excess capacity sharing (conforming virtual servers under overload can temporarily exceed capacity to avoid client degradation).

With performance isolation, a virtual server is configured with a maximum request rate RMAX and a maximum delivered bandwidth of WMAX.  This expresses an throughput agreement to serve up to the specified levels.  If RMAX is exceeded, then the agreement on WMAX is revoked.

In performance isolation, the virtual server can adapt content as previously mentioned to stay within the agreement bounds.  When each of the i servers are configured, if their aggregate utilizations U*i is less than U < 0.58, then the system capacity is planned correctly.  To perform the correct bookkeeping, load classification is done when the request is performed and then based on the classification requests served and delivered bandwidth is charged against the requested virtual server.  Lastly, based on the rates for virtual server i for requests (Ri) and delivered bandwidth (Wi), the control loop achieves the degree of content degradation necessary to keep the utilization of that virtual server at the proper level, preventing overload.

In service differentiation, the goal is to support a number of different clients at different service levels (lower priority clients are degraded first).  In [3], if there are m priority levels, where 1 is the highest priority, capacity of the utilization level of that virtual server is available to clients in priority order.  For the purposes of their research, [3] had two service levels, premium and basic, where premium was guaranteed service and basic was not guaranteed service.

Lastly, in sharing excess capacity, if the load on one server exceeds the maximum capacity C, then the server under overload may temporarily utilize the resources of under utilized virtual servers.  This requires only a simple modification to the control theory loop.

IV. Conclusion

In this survey, we presented information on burstiness in network traffic and how it affects service times.  We also showed how autonomic systems can provision themselves to handle the dynamic workloads.  Dynamic provisioning in multi-tier applications raises some interesting and unique challenges.  As distributed systems become more complex, the techniques surveyed in this paper will become more useful and relevant.


[1]     M.C. Huebscher, J.A McCann.  A survey of autonomic computing: Degrees, models, and applications.  ACM Computing Surveys, 40(3), 2008.
[2]     J. Chen, G Soundararajan, C. Amza. Autonomic Provisioning of Backend Databases in Dynamic Content Web Servers. Proc. of IEEE ICAC, 2006.
[3]     T. Abdelzaher, K.G. Shin, N. Bhatti. Performance Guarantees for Web Server End-Systems: A Control-Theoretical Approach. IEEE Transactions on Parallel and Distributed Systems, 13(1), 2002.
[4]     N. Mi, G. Casale, L. Cherkasova, M. Smirini. Burstiness in multi-tier applications: Symptoms, causes, and new models. Proc. of ACM/IFIP/USENIX Middleware, 2008.
[5]     B. Urgaonkar, P. Shenoy, A. Chandra, P. Goyal, and T. Wood. Agile dynamic provisioning of multi-tier Internet applications. ACM Trans. on Autonomous and Adaptive Systems, 3(1):1-39, 2008.
[6]     P. Padala, K.-Y. Hou, K. G. Shin, X. Zhu, M. Uysal, Z. Wang, S. Singhal, and A. Merchant. Automated control of multiple virtualized resources. Proc. of EuroSys, 2009.
[7]     P. Lama and X. Zhou. Autonomic provisioning with self-adaptive neural fuzzy control for end-to-end delay guarantee. Proc. IEEE/ACM MASCOTS, pages 151-160, 2010.
[8]     D. Jiang, G. Pierre, and C.-H. Chi. Autonomous resource provisioning for multi-service web applications. Proc. ACM WWW, 2010.
[9]     L. Kleinrock. Queuing Systems, Volume 2: Computer Applications. John Wiley and Sons, Inc. 1976.
[10]  B. Urgaonkar, G. Pacifici, G. Shenoy, P. Spreitzer, M. and A. Tantawi. An analytical model for multi-tier Internet services and its applications. Proc. of ACM ICMMCS.  Banff, Canada, 2005.

Abstract—Clustering is an easy to use and implement method of unsupervised inductive inference. Clustering can be used to learn discrete or continuous valued hypotheses and create compact groups of objects that display similar characteristics, while maintaining a high degree of separation from other groupings. This paper is a survey of some of the methods of constructing and evaluating clusters.

Index Terms—machine learning, hierarchical agglomerative clustering, k-means clustering, unsupervised learning, Pearson’s coefficient, euclidean distance

1. Introduction

Clustering is a method of dividing a heterogenous group of objects into smaller, more homogenous groups that display similar characteristics to other objects in the cluster while displaying one or more dissimilar characteristics to objects in other clusters. Clustering is an unsupervised learning technique that has many applications in data mining, pattern recognition, image analysis and market segmentation. Clustering is easy to implement and fairly quick to come up with the groupings.

The main purposes of a Department of Corrections at any governmental level are to enhance public safety, to provide proper care of the inmates, to supervise inmates under their jurisdiction and to assist inmates’ re-entry into society.

There is no doubt that inmates at correctional institutions are dangerous to society. However, even after they are incarcerated at these institutions, some individuals remain ongoing offenders. While all individuals in prison have displayed some sort of deviant behavior, it is hypothesized that certain combinations of personality traits make some inmates more likely to be sexual predators and some inmates more likely to be sexual victims of these predators. At most correctional facilities, sexual contact between inmates, consensual or not, is not permitted.

Identification of those inmates likely to be sexually predatory toward other inmates would greatly assist corrections facilities in their goal of providing a safer environment for incarceration. Clustering can help with this goal by comparing a particular offender to known perpetrators and victims. After comparison, victims can be incarcerated separately from predators and receive any special needs assistance that can be offered while predators can be segregated in such a fashion as to reduce the potential for successful predatory behaviors.

1.1 Outline of Research

In this research survey, we implemented two different types of clustering algorithms – a standard “bottom-up” hierarchical method, single link clustering, and a standard “top-down” partitional algorithm, k-means clustering. We evaluated different distance measure criteria, including Euclidian distance and Pearson’s correlation coefficient. Results are discussed in section 4 after running the clustering algorithms multiple times with the provided Colorado inmate dataset.

1.2 Data

The dataset that we used was provided by Dr. Coolidge of the Department of Psychology at the University of Colorado at Colorado Springs. The dataset is publicly available at This dataset pertains to scores on personality disorder tests given to inmates in the State of Colorado. Dr. Coolidge’s inventory of personality disorder tests are given to all inmates in the State of Colorado.. This dataset provided contained 100 rows (25 rows describing victims of sexual abuse and 75 describing perpetrators of sexual abuse) with 14 attributes chosen by Dr. Coolidge.

The data described different measurements of how inmates had scored on different personality disorders tests. The tests included antisocial (AN), avoidant (AV), borderline (BO), dependent (DE), depressive (DP), histrionic (HI), narcissistic (NA), obsessive-compulsive (OC), paranoid (PA), passive-aggressive (PG), schizotypal (ST), schizoid (SZ), sadistic (SA) and self-defeating (SD) markers. The scores on these individual tests are measured by T scores, which are a type of standardized score that can be mathematically transformed into other types of standardized scores. T scores have a Gaussian distribution and a score of 50 is always considered the mean and a standard deviation is always 10.

It should be noted that even though the dataset is quite small, with only 100 rows available, the quality of the data in the dataset is very good. The astute reader can appreciate the fact that incarcerated persons might not be completely truthful in their answering of the test questions for a variety of reasons, such as a lack of caring or the desire to appear more “damaged” than any other inmate. The data is cross-checked with several other validation methods to ensure that the answers provided are reasonable. Test scores that are not reasonable were discarded by Dr. Coolidge and not included in this dataset.

2. Application

We chose to write the implementation of the clustering algorithms in Java because of the ease of use of the language. Java also presented superior capabilities in working with and parsing data from files. Using Java allowed the author to more efficiently model the problem through the use of OO concepts, such as polymorphism and inheritance. Lastly, several different Java libraries were available, such as Java-ML [1] that increased the ability to analyze the clusters after the algorithms had been run.

2.1 Hierarchical Cluster Construction

In agglomerative single link clustering, clusters are constructed by comparing each point (or cluster) with every other point (or cluster). Each object is placed in a separate cluster, and at each iteration of the algorithm, we merge the closest pair of points (or clusters), until certain termination conditions are satisfied.

This algorithm requires defining the idea of a single link or the proximity of other points to a single point. For single link, the proximity of two points (or clusters) is defined as the minimum of the distance between any two points (or clusters). If there exists such a link, or edge, then the two points (or clusters) are merged together.

This is often called “the nearest neighbor clustering technique.” [4] Relating this algorithm to graph theory, this clustering technique constructs a minimum spanning tree by finding connected components, so the algorithm is quite similar to Kruskal’s or Prim’s algorithm.

MSTSingleLink (Elements, AdjacencyMatrix)
Create a set of clusters C = {t1, t2,…,tn} from Elements
Create a partial distance matrix showing the distance between all clusters in C.
k = n, where n is the number of clusters
d = 0
Ci, Cj = Closest clusters in AdjacencyMatrix
d = dis(Ci, Cj) // update dendrogram with distance threshold
dis({Ci ∪ Cj}, C) // recalculate the distance from the new cluster to all other points in C
C = C – {Ci} – {Cj} ∪ {Ci ∪ Cj} // merge the two closest clusters or points
dis(Ci, Cj) = 0
Until k = 1

Typically, for this algorithm, the termination criteria is for all elements grouped together in one cluster. A better termination criterion would be to record the distances at which the merges of individual objects and clusters take place and if there is a large jump in this distance (large being defined by the user), it might give the user an indication that the two objects or clusters should not be merged because they are highly dissimilar. Note that the running time for MSTSingleLink is O(n^2). This running time makes it impractical for large datasets. For further information on single link MST, see [2], [4].

2.2 Partitional Cluster Construction

Where a hierarchical clustering algorithm creates clusters in multiple steps, a partitional algorithm, such as k-means, creates the clusters in one step. [4] Also, in partitional clustering, the number of clusters to create must be known a priori and used as input to the algorithm.

In k-means, elements are moved among sets of clusters until some sort of termination criteria is reached, such as convergence. A possible measure for convergence could be testing to see if cluster elements have not changed clusters. Using k-means allows one to achieve a “high degree of similarity among elements, while a high degree of dissimilarity among elements in different clusters.” [4]

KMeansCluster (Elements, k)
Create a set of clusters C = {t1, t2,…,tn} from Elements
Assign intial values for means, m1, m2,…,mk
Assign each item ti to the cluster which has the closest mean
Calculate new means for each cluster
Until convergence criteria met

Note that the running time for KMeansCluster is O(tkn), where t is the number of iterations. While, k-means does not suffer from the chaining problem, it does have other problems. k-means does not handle outliers well, work with categorical data or produce any clusters shapes other than convex clusters. [4] Also, while k-means produces good results, it does not scale well and is not time-efficient. [4] While this particular provided dataset is not large, k-means could have problems with attempting to cluster millions of objects. Lastly, it is possible for k-means to find a local optimum and miss the global optimum. For further information on k-means clustering, see [2], [4].

2.3 Distance Criterion

In both algorithms, cluster formation relies on having some notion of a distance measure. Using this metric, we can determine how “similar” two elements are. Depending on the distance metric chosen, it will influence the shape of our clusters. While there are many distance measures such as Mahalanobis, hamming, city-block and Minkowski [2], [6], in our implementation, we used two different distance measures, a euclidean distance measure and Pearson’s correlation coefficient.

2.3.1 Euclidean Distance

Euclidean distance is the ordinary distance between two points that one would measure with a ruler. It is a simple distance metric and by far the most commonly used, where one has to make sure all attributes have the same scale [2]. It is defined by this equation

2.3.2 Pearson’s Correlation Coefficient

Pearson’s correlation coefficient is a measure of the linear dependence between two variables X and Y, giving a value between +1 and −1 inclusive. A value of 1 implies that a linear equation describes the relationship between X and Y perfectly, with all data points lying on a line for which Y increases as X increases. A value of −1 implies that all data points lie on a line for which Y decreases as X increases. A value of 0 implies that there is no linear correlation between the variables. [3] Although it depends on the data being analyzed, typically anything over 0.5 or below -0.5 indicated a large correlation. Pearson’s correlation coefficient can be calculated using the equation below, where X and Y bar are the means of the two variables.

For further information on Pearson’s correlation coefficient, see [3].

2.4 Testing

Testing of the clustering algorithms was performed with the entire dataset of 100 examples. For single link clustering, the algorithm was run using the Euclidean distance measure and Pearson’s correlation coefficient. For k-means clustering, the algorithm was run using the Euclidean distance measure 8 times. Each time the program was run, the number of clusters specified, k, was increased by one from 2 clusters to 10 clusters. Discussion of results are in section 4.

3. Potential Problems

There was one main problem that we encountered during implementation and testing. Our biggest problem is that with single link clustering, we observed the chaining effect. The chaining effect is where “points that are not related to each other at all are merged together to form a cluster simply because they happen to be near (perhaps via a transitive relationship) points that are close to each other.” [4] By including dissimilar objects, this can cause clusters to become skewed. A potential solution to this problem would be to either specify a maximum distance threshold, above which points (or clusters) would not be merged. This could also serve as part of the termination criteria. Another solution, would be to use a complete link distance criteria. [4] The chaining effect is most obviously seen when the output of the clustering program is a dendrogram.

4. Evaluation

For the evaluation of the results, we have chosen to use several different evaluation criteria. For single link clustering, we will qualify our results by examining the intra-cluster and inter-cluster distance, which measures the homogeneity of the clusters. In addition, we also evaluate the distance threshold at which the clusters were merged and the entropy within the cluster. Lastly, we evaluate the cluster by calculating the recall, precision and F measure of that cluster.

For k-means clustering, we evaluate the clusters using some of the same techniques (recall, precision and F measure) and we also introduce the squared sum of errors measure, and Bayesian information criterion measure.

4.1 Macro Evaluation

In single link clustering, the distance threshold that produced the best clusters was 46.23 (see Figure 4.1). Cluster 1 were inmates that exhibited personalities of victims of sexual abuse while cluster 2 were inmates that exhibited personalities of perpetrators of sexual abuse.

The remaining clusters, 3, 4, 5 and 6 were outliers and in their own clusters and their personalities exhibited behavior of both victims and perpetrators of sexual abuse. We noted that the distance threshold was much higher to merge clusters 4, 5, 6 with thresholds of 49.60, 55.22 and 59.63.

Cluster Cluster Members Intra-cluster Distance Inter-cluster Distance
1 1, 65, 4, 6, 97, 45, 49, 58, 53, 19, 18, 42, 67, 55, 62, 36, 7, 32, 54, 69, 39, 38, 89, 24, 26, 72, 8, 3, 9, 11, 95, 91, 27 15.64 90.82
2 2, 10, 16, 41, 50, 47, 51, 87, 78, 64, 68, 79, 77, 44, 84, 29, 31, 34, 63, 33, 90, 80, 40, 74, 82, 37, 43, 71, 48, 93, 96, 98, 22, 73, 52, 20, 57, 59, 46, 61, 75, 85, 66, 92, 83, 94, 88, 81, 70, 100, 60, 99, 86, 56, 13, 15, 30, 21, 14, 12, 23 8.73 79.20
3 5, 35, 25 0.17 66.25
4 17 0.0 67.19
5 28 0.0 67.19
6 76 0.0 67.20
Figure 4.1 – Index of clusters with intra-cluster and inter-cluster distance for Euclidean single link clustering

In Pearson’s coefficient clustering, the distance threshold that produced the best clusters was .035 (see Figure 4.2).
Cluster Cluster Members Intra-cluster Distance Inter-cluster Distance
1 1, 17, 14, 52, 48, 54, 27, 21, 22, 26, 77, 74, 33, 50, 57, 62, 47, 49, 79, 31, 34, 69, 67, 2, 86, 71, 100, 3, 41, 10, 29, 8, 25, 40, 83, 37, 19, 44, 46, 72, 55, 81, 88, 66, 30, 98, 38, 70, 20, 45, 36, 80, 60, 87, 13, 56, 91, 68, 51, 23, 16, 53, 89, 12, 65, 82, 94, 96, 90, 64, 58, 43 19.71 138.31
2 4, 11, 35, 84, 39, 15, 61, 63, 18, 92, 24, 76, 6, 7, 59, 95, 78 3.68 108.01
3 5, 97, 73, 93, 75 2.5 106.01
4 32, 42, 99 0.0 106.72
5 28 0.0 109.65
6 9 0.0 109.65
7 85 0.0 109.65
Figure 4.2 – Index of clusters with intra-cluster and inter-cluster distance for Pearson’s single link clustering

Based on these results, using the Euclidean distance may produce better clusters based on intra-cluster distance. Another improvement to the results may come in the form of changing the policy on finding the best distance at which to merge the clusters (i.e. using complete link or average link distance measures). If a more accurate method of distance finding were implemented, we would expect to see a more consistent result set because there would be less of an effect from the chaining problem.

4.2 Micro Evaluation

When the individual clusters are broken down and the individual members are analyzed, we achieve the following results. These recall, precision, F measure and entropy measurements assume the same clusters as above in section 4.1. In our calculations, we assign a negative value to not identifying a sexual abuse perpetrator because they would be allowed to interact with the general inmate population instead of being in administrative segregation. In addition, we also only consider the first two clusters, of either predator or victim, but no mixed classes.

Cluster Recall Precision F-measure
1 44.00% 32.35% 0.37
2 66.67% 75.75% 0.71
Overall 55.34% 54.05% 0.54
Figure 4.5 – Recall, Precision, F-Measure for Euclidean Single Link Clustering

Cluster Recall Precision F-measure
1 64.00% 22.22% 0.33
2 13.33% 70.58% 0.22
Overall 38.67% 46.40% 0.28
Figure 4.5 – Recall, Precision, F-Measure for Pearson’s Coefficient Single Link Clustering

From these results, it shows that despite the chaining effect, the Euclidean distance appears to be the superior distance measure when clustering via hierarchical agglomerative methods.

Our next test involved using k-means clustering. We ran the algorithm from 2 to 10 clusters and we measured the accuracy of the clusters using Bayesian information criteria (a criterion for model selection among a class of parametric models) and the sum of squared errors, as defined below:

where x is the observed data, n is the number of data points in x, k is the number of free parameters to be estimated, p(x|k) is the likelihood of the observed data given the number of parameters and L is the maximized value of the likelihood function.

k Clusters Members BIC Score SSE
2 Cluster 1: 2, 5, 10, 12, 14, 15, 16, 20, 21, 22, 23, 25, 29, 31, 33, 34, 35, 37, 40, 41, 43, 44, 46, 47, 50, 51, 56, 57, 59, 60, 61, 63, 66, 70, 71, 73, 75, 77, 78, 79, 80, 81, 83, 84, 85, 86, 87, 88, 90, 92, 94, 98, 99, 100
Cluster 2: 1, 3, 4, 6, 7, 8, 9, 11, 13, 17, 18, 19, 24, 26, 27, 28, 30, 32, 36, 38, 39, 42, 45, 48, 49, 52, 53, 54, 55, 58, 62, 64, 65, 67, 68, 69, 72, 74, 76, 82, 89, 91, 93, 95, 96, 97 144,666.03 223,893.45
3 Cluster 1: 1, 3, 4, 6, 7, 9, 11, 13, 18, 21, 24, 30, 32, 36, 38, 41, 42, 44, 45, 47, 48, 49, 50, 52, 53, 54, 55, 58, 62, 64, 65, 67, 68, 69, 74, 78, 82, 93, 95, 96, 97, 98
Cluster 2: 2, 5, 10, 12, 14, 15, 16, 20, 22, 23, 25, 29, 31, 33, 34, 35, 37, 40, 43, 46, 51, 56, 57, 59, 60, 61, 63, 66, 70, 71, 73, 75, 77, 79, 80, 81, 83, 84, 85, 86, 87, 88, 90, 92, 94, 99, 100
Cluster 3: 8, 17, 19, 26, 27, 28, 39, 72, 76, 89, 91 142,191.31 190,017.27
4 Cluster 1: 8, 17, 26, 28, 39, 72, 76, 89,
Cluster 2: 3, 4, 6, 19, 27, 36, 42, 45, 49, 55, 62, 67, 91, 95, 97
Cluster 3: 2, 5, 10, 12, 14, 15, 16, 20, 22, 23, 25, 29, 31, 33, 34, 35, 37, 40, 43, 46, 51, 56, 57, 59, 60, 61, 63, 66, 70, 71, 73, 75, 77, 80, 81, 83, 85, 86, 88, 90, 92, 94, 99, 100
Cluster 4: 1, 7, 9, 11, 13, 18, 21, 24, 30, 32, 38, 41, 44, 47, 48, 50, 52, 53, 54, 58, 64, 65, 68, 69, 74, 78, 79, 82, 84, 87, 93, 96, 98 143,341.18 174,843.29
5 Cluster 1: 8, 17, 24, 26, 38, 65, 72, 89
Cluster 2: 2, 10, 12, 13, 15, 16, 21, 22, 23, 29, 30, 32, 33, 37, 40, 41, 44, 47, 48, 50, 51, 52, 56, 71, 73, 77, 78, 79, 84, 87, 90, 93, 96, 98
Cluster 3: 5, 14, 20, 25, 31, 34, 35, 43, 46, 57, 59, 60, 61, 63, 66, 70, 75, 80, 81, 83, 85, 86, 88, 92, 94, 99, 100
Cluster 4: 7, 28, 36, 39, 76
Cluster 5: 1, 3, 4, 6, 9, 11, 18, 19, 27, 42, 45, 49, 53, 54, 55, 58, 62, 64, 67, 68, 69, 74, 82, 91, 95, 97 141,884.73 152,909.21
6 Cluster 1: 8, 17, 24, 26, 38, 65, 72, 89
Cluster 2: 2, 5, 10, 12, 14, 20, 22, 23, 25, 29, 31, 33, 34, 35, 37, 40, 43, 46, 51, 56, 57, 59, 60, 61, 63, 66, 70, 71, 75, 77, 80, 81, 83, 85, 86, 88, 90, 92, 94, 99, 100
Cluster 3: 1, 7, 13, 15, 16, 21, 30, 32, 41, 44, 47, 48, 50, 52, 53, 54, 58, 64, 68, 69, 73, 74, 78, 79, 82, 84, 87, 93, 96, 98
Cluster 4: 4, 6, 18, 19, 36, 39, 42, 45, 49, 62, 67, 91, 97
Cluster 5: 28, 76
Cluster 6: 3, 9, 11, 27, 55, 95 141,230.52 144,776.69
7 Cluster 1: 2, 10, 12, 13, 15, 16, 21, 22, 23, 29, 30, 31, 33, 37, 40, 41, 44, 47, 48, 50, 51, 56, 73, 77, 78, 79, 84, 87, 90, 93, 94, 98
Cluster 2: 5, 14, 20, 25, 34, 35, 43, 46, 57, 59, 60, 61, 63, 66, 70, 71, 75, 80, 81, 83, 85, 86, 88, 92, 99, 100,
Cluster 3: 4, 6, 9, 11, 18, 19, 42, 45, 49, 53, 55, 58, 62, 64, 67, 68, 74, 82, 95, 96, 97
Cluster 4: 28, 76,
Cluster 5: 3, 27, 91
Cluster 6: 1, 7, 32, 36, 39, 52, 54, 69
Cluster 7: 8, 17, 24, 26, 38, 65, 72, 89 140,546.79 135,787.72
8 Cluster 1: 3, 27, 95
Cluster 2: 1, 7, 24, 30, 32, 38, 53, 54, 58, 64, 65, 68, 69, 74, 82
Cluster 3: 39
Cluster 4: 9, 11, 18, 42, 62, 67,
Cluster 5: 4, 6, 19, 36, 45, 49, 55, 91, 97
Cluster 6: 5, 10, 12, 14, 20, 22, 23, 25, 29, 31, 34, 35, 37, 40, 43, 46, 56, 57, 59, 60, 61, 63, 66, 70, 71, 75, 77, 80, 81, 83, 85, 86, 88, 90, 92, 94, 99, 100
Cluster 7: 8, 17, 26, 28, 72, 76, 89
Cluster 8: 2, 13, 15, 16, 21, 33, 41, 44, 47, 48, 50, 51, 52, 73, 78, 79, 84, 87, 93, 96, 98 140,510.19 138,410.06
9 Cluster 1: 3, 9, 11, 74, 82, 95
Cluster 2: 15, 29, 30, 31, 33, 34, 40, 41, 44, 47, 50, 73, 78, 79, 84, 87, 90
Cluster 3: 8, 17, 24, 38, 65, 72, 89,
Cluster 4: 26, 28, 39, 76
Cluster 5: 5, 14, 20, 25, 35, 46, 57, 59, 60, 61, 63, 66, 70, 75, 80, 81, 83, 85, 86, 88, 92, 100
Cluster 6: 4, 6, 18, 19, 27, 36, 42, 45, 49, 55, 62, 67, 91, 97
Cluster 7: 2, 10, 16, 21, 22, 37, 43, 51, 71, 77, 94, 98, 99
Cluster 8: 1, 7, 13, 32, 48, 52, 53, 54, 58, 64, 68, 69, 93, 96
Cluster 9: 12, 23, 56 140,889.59 126,846.88
10 Cluster 1: 28, 76
Cluster 2: 91
Cluster 3: 30, 32, 53, 54, 58, 64, 68, 69, 74, 82
Cluster 4: 5, 14, 20, 25, 35, 43, 46, 57, 59, 60, 61, 63, 66, 70, 71, 75, 80, 81, 83, 85, 86, 88, 92, 99, 100
Cluster 5: 27
Cluster 6: 9
Cluster 7: 3, 4, 6, 11, 18, 19, 42, 45, 49, 55, 62, 67, 95, 97
Cluster 8: 1, 7, 13, 21, 36, 39, 48, 52, 93, 96
Cluster 9: 2, 10, 12, 15, 16, 22, 23, 29, 31, 33, 34, 37, 40, 41, 44, 47, 50, 51, 56, 73, 77, 78, 79, 84, 87, 90, 94, 98
Cluster 10: 8, 17, 24, 26, 38, 65, 72, 89 140,789.73 117,432.09
Figure 4.6 – Cluster Members, BIC Score and SSE Score for Euclidean k-means Clustering

Cluster Recall Precision F-measure
1 48.00% 22.22% 0.30
2 17.30% 28.20% 0.21
Overall 32.65% 25.21% 0.26
Figure 4.7 – Recall, Precision, F-Measure for Euclidean k-means Clustering

In these results, we can see that, as the number of clusters increases, the squared sum of errors decreases. This is because we are including fewer dissimilar items in each of the clusters, so they more accurately represent the true nature of that cluster. I would also expect the precision, recall, and F measure to increase as the number of clusters increase as well. However, it would become harder to interpret the actual “class” of the clusters as k increases as we were instructed to disregard the class of each instance in the dataset. Additionally, we might achieve better results if we used decision trees to identify the most influential personality test markers and then used a subset of those markers for clustering.

Based on all results, while not highly accurate, prison officials could obtain good insight into what attributes are the most important in regards to who might be on-going offenders.

5. Conclusion

In this research project, different methods of constructing clusters were explored. Additionally, different distance measures were implemented and then analyzed to see how they affected the accuracy of the clusters created.

While the results of this project show only a maximum of 67% accuracy, clustering is still a valid machine learning technique. With an more advanced algorithm and an increased size dataset, clustering may be able to predict predators and victims at a much better rate.


Abeel, T.; de Peer, Y. V. & Saeys, Y. Java-ML: A Machine Learning Library, Journal of Machine Learning Research, 2009, 10, 931-934

Alpaydin, E. Introduction to Machine Learning, Second Edition. The MIT Press, Cambridge, MA. 2010.

Coolidge, F. Statistics A Gentle Introduction, 2nd edition. SAGE Publications, Inc. 2006.

Dunham, M. Data Mining: Introductory and Advanced Topics. Prentice-Hall. 2002.

Saha, S., Bandyopadhya, S. Performance Evaluation of Some Symmetry-Based Cluster Validity Indexes. IEEE Transactions on Systems, Man and Cybernetics – Part C: Applications and Review. Vol. 39, No. 4. July 2009.

Jain, A.K., Murty, M.N., Flynn, P.J. Data clustering: a review. ACM Computing Surveys. Vol. 31, No. 3. Sept. 1999.

In a previous post, I explored how one might apply classification to solve a complex problem. This post will explore the code necessary to implement that nearest neighbor classification algorithm. If you would like a full copy of the source code, it is available here in zip format. – This is the main driver of the code. To do the classification, we are essentially interested in finding the distance between the particular instance we are trying to classify to other instances.  We then determine the classification of the instance we want from a “majority vote” of the other k closest instances.  Each feature of an instance is a separate class that essentially just stores a continuous or discrete value depending on if you are using regression or not to classify your neighbors.  The additional feature classes and file reader are left to the reader as an exercise.  Note that it would be fairly easy to weight features using this model depending on if you want to give one feature more clout than another in determining the neighbors.

The nice visualization of the algorithm is provided by Kardi Teknomo. As you can see, we take the number of k closest instances and use a “majority vote” to classify the instance.  While this is an extremely simple method, it is great for noisy data and large data sets.  The two drawbacks are the running time O(n^2) and the fact that we have to determine k ahead of time.  However, despite this, as shown in the previous paper, the accuracy can be quite high.

import java.util.*;

public class Knn {
	public static final String PATH_TO_DATA_FILE = "";
	public static final int NUM_ATTRS = 9;
	public static final int K = 262;

	public static final int CATEGORY_INDEX = 0;
	public static final int DISTANCE_INDEX = 1;
	public static final int EXPIRATION_INDEX = 2;
	public static final int HANDSET_INDEX = 3;
	public static final int OFFER_INDEX = 4;
	public static final int WSACTION_INDEX = 5;
	public static final int NUM_RUNS = 1000;
	public static double averageDistance = 0;

	public static void main(String[] args) {
		ArrayList instances = null;
		ArrayList distances = null;
		ArrayList neighbors = null;
		WSAction.Action classification = null;
		Instance classificationInstance = null;
		FileReader reader = null;
		int numRuns = 0, truePositives = 0, falsePositives = 0, falseNegatives = 0, trueNegatives = 0;
		double precision = 0, recall = 0, fMeasure = 0;

		falsePositives = 1;

		reader = new FileReader(PATH_TO_DATA_FILE);
		instances = reader.buildInstances();

		do {
			classificationInstance = extractIndividualInstance(instances);

			distances = calculateDistances(instances, classificationInstance);
			neighbors = getNearestNeighbors(distances);
			classification = determineMajority(neighbors);

			System.out.println("Gathering " + K + " nearest neighbors to:");

			System.out.println("\nExpected situation result for instance: " + classification.toString());

			if(classification.toString().equals(((WSAction)classificationInstance.getAttributes().get(WSACTION_INDEX)).getAction().toString())) {
			else {

		} while(numRuns &lt; NUM_RUNS);

		precision = ((double)(truePositives / (double)(truePositives + falsePositives)));
		recall = ((double)(truePositives / (double)(truePositives + falseNegatives)));
		fMeasure = ((double)(precision * recall) / (double)(precision + recall));

		System.out.println("Precision: " + precision);
		System.out.println("Recall: " + recall);
		System.out.println("F-Measure: " + fMeasure);
		System.out.println("Average distance: " + (double)(averageDistance / (double)(NUM_RUNS * K)));

	public static Instance extractIndividualInstance(ArrayList instances) {
		Random generator = new Random(new Date().getTime());
		int random = generator.nextInt(instances.size() - 1);

		Instance singleInstance = instances.get(random);

		return singleInstance;

	public static void printClassificationInstance(Instance classificationInstance) {
		for(Feature f : classificationInstance.getAttributes()) {
			System.out.print(f.getName() + ": ");
			if(f instanceof Category) {
			else if(f instanceof Distance) {
			else if (f instanceof Expiration) {
			else if (f instanceof Handset) {
				System.out.print(((Handset)f).getOs().toString() + ", ");
			else if (f instanceof Offer) {
			else if (f instanceof WSAction) {

	public static void printNeighbors(ArrayList neighbors) {
		int i = 0;
		for(Neighbor neighbor : neighbors) {
			Instance instance = neighbor.getInstance();

			System.out.println("\nNeighbor " + (i + 1) + ", distance: " + neighbor.getDistance());
			for(Feature f : instance.getAttributes()) {
				System.out.print(f.getName() + ": ");
				if(f instanceof Category) {
				else if(f instanceof Distance) {
				else if (f instanceof Expiration) {
				else if (f instanceof Handset) {
					System.out.print(((Handset)f).getOs().toString() + ", ");
				else if (f instanceof Offer) {
				else if (f instanceof WSAction) {

	public static WSAction.Action determineMajority(ArrayList neighbors) {
		int yea = 0, ney = 0;

		for(int i = 0; i &lt; neighbors.size(); i++) { 			Neighbor neighbor = neighbors.get(i); 			Instance instance = neighbor.getInstance(); 			if(instance.isRedeemed()) { 				yea++; 			} 			else { 				ney++; 			} 		} 		 		if(yea &gt; ney) {
			return WSAction.Action.Redeem;
		else {
			return WSAction.Action.Hit;

	public static ArrayList getNearestNeighbors(ArrayList distances) {
		ArrayList neighbors = new ArrayList();

		for(int i = 0; i &lt; K; i++) {
			averageDistance += distances.get(i).getDistance();

		return neighbors;

	public static ArrayList calculateDistances(ArrayList instances, Instance singleInstance) {
		ArrayList distances = new ArrayList();
		Neighbor neighbor = null;
		int distance = 0;

		for(int i = 0; i &lt; instances.size(); i++) {
			Instance instance = instances.get(i);
			distance = 0;
			neighbor = new Neighbor();

			// for each feature, go through and calculate the "distance"
			for(Feature f : instance.getAttributes()) {
				if(f instanceof Category) {
					Category.Categories cat = ((Category) f).getCategory();
					Category singleInstanceCat = (Category)singleInstance.getAttributes().get(CATEGORY_INDEX);
					distance += Math.pow((cat.ordinal() - singleInstanceCat.getCategory().ordinal()), 2);
				else if(f instanceof Distance) {
					Distance.DistanceRange dist = ((Distance) f).getDistance();
					Distance singleInstanceDist = (Distance)singleInstance.getAttributes().get(DISTANCE_INDEX);
					distance += Math.pow((dist.ordinal() - singleInstanceDist.getDistance().ordinal()), 2);
				else if (f instanceof Expiration) {
					Expiration.Expiry exp = ((Expiration) f).getExpiry();
					Expiration singleInstanceExp = (Expiration)singleInstance.getAttributes().get(EXPIRATION_INDEX);
					distance += Math.pow((exp.ordinal() - singleInstanceExp.getExpiry().ordinal()), 2);
				else if (f instanceof Handset) {
					// there are two calculations needed here, one for device, one for OS
					Handset.Device device = ((Handset) f).getDevice();
					Handset singleInstanceDevice = (Handset)singleInstance.getAttributes().get(HANDSET_INDEX);
					distance += Math.pow((device.ordinal() - singleInstanceDevice.getDevice().ordinal()), 2);

					Handset.OS os = ((Handset) f).getOs();
					Handset singleInstanceOs = (Handset)singleInstance.getAttributes().get(HANDSET_INDEX);
					distance += Math.pow((os.ordinal() - singleInstanceOs.getOs().ordinal()), 2);
				else if (f instanceof Offer) {
					Offer.OfferType offer = ((Offer) f).getOfferType();
					Offer singleInstanceOffer = (Offer)singleInstance.getAttributes().get(OFFER_INDEX);
					distance += Math.pow((offer.ordinal() - singleInstanceOffer.getOfferType().ordinal()), 2);
				else if (f instanceof WSAction) {
					WSAction.Action action = ((WSAction) f).getAction();
					WSAction singleInstanceAction = (WSAction)singleInstance.getAttributes().get(WSACTION_INDEX);
					distance += Math.pow((action.ordinal() - singleInstanceAction.getAction().ordinal()), 2);
				else {
					System.out.println("Unknown category in distance calculation.  Exiting for debug: " + f);


		for (int i = 0; i &lt; distances.size(); i++) {
			for (int j = 0; j &lt; distances.size() - i - 1; j++) { 				if(distances.get(j).getDistance() &gt; distances.get(j + 1).getDistance()) {
					Neighbor tempNeighbor = distances.get(j);
					distances.set(j, distances.get(j + 1));
					distances.set(j + 1, tempNeighbor);

		return distances;


Abstract—Recommendation systems take artifacts about items and provide suggestions to the user on what other products the might like. There are many different types of recommender algorithms, including nearest-neighbor, linear classifiers and SVMs. However, most recommender systems are collaborative systems that rely on users to rate the products that they bought. This paper presents a analysis of recommender systems using a mobile device and backend data points for a coupon delivery system.

Index Terms—machine learning, recommender systems, supervised learning, nearest neighbor, classification


Recommendations are a part of everyday life. An individual constantly receives recommendations from friends, family, salespeople and Internet resources, such as online reviews. We want to make the most informed choices possible about decisions in our daily life. For example, when buying a flat screen TV, we want to have the best resolution, size and refresh rate for the money. There are many factors that influence our decisions – budget, time, product features and most importantly, previous experience. We can analyze all the factors that led up to the decision and then make a conclusion or decision based on those results. A recommender system uses historical data to recommend new items that may of be interest to a particular user.

Coupious is a mobile phone application that gives it’s user coupons and deals for businesses around their geographic location. The application runs on the user’s cell phone and automatically retrieves coupons based upon GPS coordinates. Redemption of these coupons is as simple as tapping a “Use Now” button. Coupious’ services is now available on the iPhone, iPod Touch, and Android platforms. Coupious is currently available in Minneapolis, MN, West Lafayette, IN at Purdue University and Berkley, CA. Clip Mobile is the Canada based version of Coupious that is currently available in Toronto.

Using push technology, it is possible to integrate a mobile recommendation system into Coupious. The benefit of this would be threefold: 1) offer the customers the best possible coupons based on their personal spending habits – if a user feels they received a “good deal” with Coupious, they would be more likely to use it again and integrate it into their bargain shopping strategy, 2) offer businesses the ability to capitalize on their market demographics – the ability to reach individual customers to provide goods or services would drive home more revenue and add value to the product and 3) adding coupons to the service would immediately make the system more useful to a user as it would present, desirable, geographically proximate offers without extraneous offers.


In this research project, we evaluated the batch k-nearest neighbors algorithm in Java. We chose to write the implementation of the decision tree in Java because of the ease of use of the language. Java also presented superior capabilities in working with and parsing data from files. Using Java allowed the author to more efficiently model the problem through the use of OO concepts, such as polymorphism and inheritance. The kNN algorithm was originally suggested by Donald Knuth in [9] where it was referenced as the post-office problem, where one would want to assign residences to the nearest post office.

The goal of this research was to find a solution that we felt would be successful in achieving the highest rate of coupon redemption. Presumptively, in achieving the highest rate of redemption required learning what the user likes with the smallest error percentage. Additionally, we wanted to know if increasing the attributes used for computation, would effect the quality of the result set.

1.2 DATA

The data that we used is from the Coupious production database. Currently, there are approximately 70,000 rows of data (“nearby” queries, impressions and details impressions) and approximately 3,400 of those represent actual coupon redemptions. The data is an aggregate from March 25th, 2009 until February 11, 2010. The results are from a mixture of different cities where Coupious is currently in production.

From a logical standpoint, Coupious is simply a conduit through which a user may earn his discount and has no vested interest in whether or not a user redeems a coupon in a particular session. However, from a business standpoint, Coupious markets the product based on being able to entice sales through coupon redemption. Therefore, for classification purposes, sessions that ended in one or more redemptions will be labeled +1 and sessions that ended without redemption will be labeled -1.


While there hasn’t been any previous work in the space of mobile recommendation systems, there has been a large amount of work in the recommender systems space and classification. In [1], [2] and [3], direct marketing is studied using collaborative filtering. In [3], the authors use SVMs and latent class models to model predictions of whether or not a customer would be likely to buy a particular product. The most direct comparison of work would be in [1] and [8], where SVMs and linear classifiers are used to cluster content driven recommender systems.


Broadly, recommender systems can be grouped into two categories, content-based and collaborative based. In content-based systems, the recommendations are based solely on the attributes of the actual product. For example, in Coupious, attributes of a particular coupon redemption includes the distance from the merchant when the coupon was used, the date and time of the redemption, the category of the coupon, the expiry and the offer text. These attributes are physical characteristics of the coupon. Recommendations can be made to users without relying on any experience-based information.

In collaboration systems, recommendations are provided based on not only product attributes but the overlap of preferences of “like-minded” people, first introduced by Goldberg et. al [5]. For example, a user is asked to rate how well they liked the product or give an opinion of a movie. This provides the algorithm a base line of preference for a particular user, which allows the algorithm to associate product attributes with a positive or negative response. Since Coupious does not ask for user ratings, this paper will focus exclusively on content-based applications.

Many content-based systems have similar or common attributes. As stated in [3], the “central problem of content-based recommendation is identifying a sufficiently large set of key attributes.” If the attribute set is too small, there may not be enough information for the program to build a profile of the user. Conversely, if there are too many attributes, the program won’t be able to identify the truly influential attributes, which leads to poor performance [6]. Also, while the label for a particular feature vector with Coupious data is +1 or -1, many of the features in the data are multi-valued attributes (such as distance, date-time stamps, etc.), which maybe hard to represent in a binary manner, if the algorithm requires it.

In feature selection, we are “interested in finding k of the d dimensions that give us the most information and accuracy and we discard the other (d – k) dimensions [4].” How can we find the attributes that will give us the most information and accuracy? For Coupious, the attribute set is quite limited. For this research, we explicitly decided the features that will contribute to our recommendations. In all cases, all attributes were under consideration while the algorithm was running and we never partitioned the attributes for different runs to create different recommendations.


As shown in [7], “Once the clustering is complete, performance can be very good, since the size of the group that must be analyzed is much smaller.” The nearest neighbor algorithm is such a classification algorithm. K-nearest neighbors is one of the simplest machine learning algorithms, where an instance is classified by a majority vote by the closest training examples in the hypothesis space. This algorithm is an example of instance-based learning, where previous data points are stored and interpolations is performed to find a recommendation. An important distinction to make is that kNN does not try to create a model when it are given test data, but rather performs the computation when tested. The algorithm works by attempting to find a previously seen data point that is “closest” to the query data point and then uses its previous output for prediction. The algorithm calculates its approximation using this definition:

The algorithm is shown below

KNN (Examples, Target_Instance)

  • Determine the parameter K = number of nearest neighbors.
  • Calculate the distance between Target_Instance and all the Examples based on the Euclidian distance.
  • Sort the distance and determine nearest neighbor based on the Kth minimum distance.
  • Gather the Category Y of the nearest neighbors.
  • Use simple majority of the category of nearest neighbors as the prediction value of the Target_Instance.
  • Return classification of Target_Instance.

This algorithm lends itself well to the Coupious application. When a user uses the Coupious application, they want it to launch quickly and present a list of coupons within 10-15 seconds. Since this algorithm is so simple, we are able to calculate coupons that the user might enjoy fairly quickly and deliver them to the handset. Also, because Coupious doesn’t know any personal details about the user, the ability to cluster users into groups without the need for any heavy additional implementation on the front or back ends of the application is advantageous.

There are several possible problems with this approach. While this is a simple algorithm, it doesn’t take high feature dimensionality into account. If there are many features, the algorithm will have to perform a lot of computations to create the clusters. Additionally, each attribute is given the same degree of influence on the recommendation. In Coupious, the date and time the coupon was redeemed has the same amount of bearing on the recommendation as does the current distance from the offer and previous redemption history. The features may not scale compared to their importance and the performance of the recommendation may be degraded by irrelevant features. Lastly, [6] describes the curse of dimensionality, which describes how, if a product has 20 attributes, 2 of which are actually useful, how classification may be completely different when considering each of the 20 attributes but identical when only considering the 2 relevant attributes.


While implementing the k-nearest neighbors algorithm, we decided to evaluate instances based upon seven different attributes, including coupon category, distance, expiration, offer, redemption date, handset and handset operating system, and upon the session result, a “hit” or “redemption.” The offer and expiration attributes required extra processing before they were able to be used for clustering. In both of these attributes, a “bag of words” (applying a Naive Bayes classifier to text to determine the classification) implementation was used to determine what type of offer the coupon had made.

For the OFFER attribute, we split the value space into 4 discrete values, PAYFORFREE (an instance where the customer had to pay some initial amount to receive a free item), PERCENTAGE (an instance where the customer received some percentage discount), DOLLAR (an instance where a customer received a dollar amount discount) and UNKNOWN (an instance where the classification was unknown). While the PERCENTAGE and DOLLAR values essentially compute to the exact same values when calculated, consumers tend to react differently when seeing higher percentage discounts versus dollar discounts (i.e. 50% off instead of $5 discount, even though they may equate to an identical discount, if the price were $10).

For the expiration attribute, some text parsing was done to determine what type of expiration the coupon had (DATE, USES, NONE or UNKNOWN). If the parsing detected a date format, we used DATE and if the parsing detected that it was a limited usage coupon (either limited by total uses across a population, herein known as “global” or limited by uses per customer, herein known as “local”), we designated that the coupon was USES. If the coupon was valid indefinitely, we designated that the coupon expiration was NONE. We did not record the distance of dates in the future, by the type of limited usage (global or local) or by the number of uses left.

An important detail to note is that in our implementation of kNN, if an attribute was unknown, we declined using it for computation of the nearest neighbor because the UNKNOWN attribute was typically the last value in a Java enumeration and therefore would be assigned a high integer value which would skew the results unnecessarily if used for computation.

For the remaining attributes (OFFERTIME, CATEGORY, DISTANCE, HANDSET, which considers handset model and OS and ACTION), the value space of each attribute was divided over the possible discrete values for that attribute according to Table 2.2.

Attribute Possible Values
OFFERTIME Morning, Afternoon, Evening, Night, Unknown
CATEGORY Entertainment, Automotive, Food & Dining, Health & Beauty, Retail, Sports & Recreation, Travel, Clothing & Apparel, Electronics & Appliances, Furniture & Decor, Grocery, Hobbies & Crafts, Home Services, Hotels & Lodging, Nightlife & Bars, Nonprofits & Youth, Office Supplies, Other, Pet Services, Professional Services, Real Estate, Unknown
DISTANCE Less than 2 miles, 2 to 5 miles, 5 to 10 miles, 10 to 20 miles, 20 to 50 miles, 50 to 100 miles, Unknown
HANDSET Device: iPhone, iPod, G1, Hero, myTouch, Droid, Unknown
OS: iPhone, Android, Unknown
ACTION Redeem, Hit, Unknown

Table 2.2

In the case of the handset OS, iPhone and iPod were classified together as the iPhone OS as they are the same OS with different build targets. It is important to note that kNN works equally well with continuous valued attributes as well as discrete-valued attributes. For further discussion on using kNN with real-valued attributes, see [10].


There were several problems that were encountered while implementing kNN. The first problem was “Majority Voting.” Majority voting is the last step in the algorithm to classify an instance and is an inherent problem with the way the kNN algorithm works. If a particular class dominates the training data, it will skew the votes towards that class since we are only considering data at a local level, that is the distance from our classification instance to the nearest data points. There are two ways to solve this problem: 1) balancing the dataset, or, 2) weighting the neighboring instances. Balancing is the simplest technique where an equal proportion of either class are present in the training data. A more complicated, but more effective, method is to weight the neighboring instances such that neighbors that are further away have a smaller weight value than closer neighbors.

Determining “k” beforehand is a troublesome problem. This is due to the fact that, if “k” is too large, the separation between classifications becomes blurred. This could cause the program to group two clusters together that are, in fact, distinct clusters themselves. However, a large “k” value does reduce the effect of noise by including more samples in the hypothesis. If “k” is too small, we might not calculate a good representation of the sample space because our results were too local. In either case, if the target attribute is binary, “k” should be an odd number to avoid the possibilities of ties with majority voting.

Also, kNN has a high computation cost because it has to consider the distance of every point from itself to another point. The time-complexity of kNN is O(n), which wouldn’t scale well to hypothesis spaces with millions of instances.

Discretizing non-related attributes, like category, presented a unique challenge for the author. When considering continuous attributes, such as distance, it was easy to discretize this data. In the case of Coupious, the distance ranges were already defined in the application so we just had to translate those over to our classification algorithm. However, some attributes, such as category, while related at an attribute level (in that they were all categories of coupon), had no real values. In this case, we simply assigned them increasing integer values

Another problem that we encountered is the sparsity problem. Since we implemented a content-driven model, the degree of accuracy relied upon how much data we had about a particular end user to build their profile. If the customer only had one or two sessions ending in no redemptions, it might not be possible to achieve any accuracy about this person. We dealt with this problem by artificially creating new records to supplement previous real records.

Coupious relies on the GPS module inside the smart phone to tell us where a user is currently located. From that position, the user gets a list of coupons that are close to the user’s location. However, there are no safeguards in place to guarantee that a redemption is real. A curious user may attempt to redeem a coupon when he is not at the actual merchant location. In the data, we can account for this by calculating the distance from the merchant at the time of the redemption request. However, this GPS reading may be inaccurate as the GPS model can adjust it’s accuracy to save power and battery life.

Lastly, accuracy is somewhat limited because of the fact that we are using a content-only model. There is no way to interact with the user to ask if the recommendations are truly useful. To achieve this additional metric would require major changes to the application and the backend systems that are outside the scope of this research paper.

Despite the multiple problems with kNN, it is quite robust to noisy data as indicated in section 4, which makes it well-suited for this classification task, as the author can only verify the reasonableness of the data, not the integrity.


Testing of the kNN logic was carried out using a 3-fold cross validation method, were the data was divided into training and test sets such that | R | = k X | S |, where R is the size of the training set, k is the relative size and S is the size of the test set.

For each test run, we chose 4,000 training examples and 1 test example at random. We attempted to keep labeled classes in the training set as balanced as possible by setting a threshold n. This threshold prevented the classes from becoming unbalanced by more than a difference of n. If the threshold n was ever reached, we discarded random selections until we were under the threshold again and thereby balance the classes. Discussion of results are in section 5.


Even though kNN is a simplistic algorithm, the classification results were quite accurate. To test the algorithm, we performed 10,000 independent runs where an equal number of “hit” and “redemption” rows were selected at random (2,000 of each so as to keep the inductive bias as fair as possible). An individual classification instance was chosen at random from that set of 4,000 instances and was then classified according to its nearest neighbors.


When evaluating the results, there was one main factor that affected the recall and it was the size of the k neighbors considered in the calculation (see Table 5.1). In the 10,000 independent runs using random subsets of data for each test, the overall recall for the kNN algorithm was 85.32%. The average F-measure for these runs was 0.45.

However, if one considers a subset of results with k-size less than or equal to 15, the average recall was much higher – 94%. After a k-size greater than 30, we see a significant drop-off of recall. This can be attributed to the fact that the groups are becoming less defined because, as k-size grows, the nodes that are being used are “further” away from the classification instance, and therefore the results are not as “good.”


When broken down into smaller sets of 1,000 runs, the recall and F-measure vary greatly. In 16 runs, we had a range from 29.40% recall to 98.20% recall. The median recall was 92.70%.

Run K Size Recall F-measure Avg. Distance
1 1 98.20% 0.49 0.18
2 2 97.80% 0.49 0.23
3 3 95.80% 0.48 0.49
4 4 95.70% 0.48 0.44
5 5 95.60% 0.48 0.46
6 6 95.20% 0.48 0.59
7 7 93.10% 0.48 0.48
8 8 92.70% 0.47 0.69
9 9 94.20% 0.48 0.82
10 10 93.40% 0.48 0.88
11 15 91.70% 0.48 1.21
12 30 87.50% 0.47 2.16
13 50 85.10% 0.46 3.63
14 100 70.80% 0.42 7.57
15 200 48.90% 0.32 16.57
16 500 29.40% 0.22 31.64
Overall 85.32% 0.45


Table 5.1

There is a large gap between our maximum and our minimum recall. This can be attributed mainly to our k-size. Based on these results, we would feel fairly confident that we could make a useful prediction, but only if we had a confidence rating on the results (due to the high variability of the results).

These results could provide good insight into what k-size or neighbors are the most influential in suggesting a coupon. This would allow us to more carefully target Coupious advertising based on the user.

When could one consider these results valid? If the average distance of a classification instance to the examples is below a threshold such that the classification truly reflects its neighbors, we could consider the results valid. Another form of validating the results could be pruning the attribute set. If we were able to prune away attributes that didn’t affect the recall in a negative manner, we would be left with a set of attributes that truly influence the customers purchasing behavior, although this task is better suited for a decision or regression tree. An improvement to this kNN algorithm might come in the form of altering the k-size based upon the population or attributes that we are considering.


In this research project, we have implemented the kNN algorithm for recommender systems. The algorithm for the nearest neighbor was explored and several problems were identified and overcome. Different techniques were investigated to improve the accuracy of the system. The results of this project show an overall accuracy of 85.32%, which makes kNN an excellent, simple technique for implementing recommender systems.


[1] Zhang, T., Iyengar, V. S. Recommender Systems Using Linear Classifiers. Journal of Machine Learning Research 2. (2002). 313-334.
[2] Basu, C., Hirsh, H., Cohen, W. Recommendation as Classification: Using Social and Content-Based Information in Recommendation.
[3] Cheung, K. W., Kowk, J. T., Law, M. H., Tsui, K. C. Mining Customer Product Ratings for Personalized Marketing
[4] E. Alpayden., Introduction to Machine Learning, 2nd ed. MIT Press. Cambridge, Mass, 2010.
[5] D. Goldberg, D. Nichols, B.M. Oki, D. Terry, Collaborative filtering to weave an information tapestry, Communications of the ACM 35. (12) (December 1992) 61 – 70.
[6] T.M. Mitchell, Machine Learning, New York. McGraw-Hill, 1997.
[7] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl, “Recommender Systems for Large-Scale E-Commerce: Scalable Neighborhood Formation Using Clustering,” Proc. Fifth Int’l Conf. Computer and Information Technology, 2002.
[8] D. Barbella, S. Benzaid, J. Christensen, B. Jackson, X. V. Qin, D. Musicant. Understanding Support Vector Machine Classifications via a Recommender System-Like Approach. [Online].
[9] D.E. Knuth. “The Art of Computer Programming.” Addison-Wesley. 1973.
[10] C. Elkan. “Nearest Neighbor Classification.” University of California – San Diego. 2007. [Online].

In a previous post, I explored how one might apply decision trees to solve a complex problem. This post will explore the code necessary to implement that decision tree. If you would like a full copy of the source code, it is available here in zip format. – In, we are concerned with calculating the amount of entropy, or the amount of uncertainty or randomness with a particular variable. For example, consider a classifier with two classes, YES and NO. If a particular variable or attribute, say x has three training examples of class YES and three training examples of class NO (for a total of six), the entropy would be 1. This is because there is an equal number of both classes for this variable and is the most mixed up you can get. Likewise, if x had all six training examples of a particular class, say YES, then entropy would be 0 because this particular variable would be pure, thus making it a leaf node in our decision tree.

Entropy may be calculated in the following way:

import java.util.ArrayList;

public class Entropy {	
	public static double calculateEntropy(ArrayList<Record> data) {
		double entropy = 0;
		if(data.size() == 0) {
			// nothing to do
			return 0;
		for(int i = 0; i < Hw1.setSize("PlayTennis"); i++) {
			int count = 0;
			for(int j = 0; j < data.size(); j++) {
				Record record = data.get(j);
				if(record.getAttributes().get(4).getValue() == i) {
			double probability = count / (double)data.size();
			if(count > 0) {
				entropy += -probability * (Math.log(probability) / Math.log(2));
		return entropy;
	public static double calculateGain(double rootEntropy, ArrayList<Double> subEntropies, ArrayList<Integer> setSizes, int data) {
		double gain = rootEntropy; 
		for(int i = 0; i < subEntropies.size(); i++) {
			gain += -((setSizes.get(i) / (double)data) * subEntropies.get(i));
		return gain;
} – This tree class contains all our code for building our decision tree. Note that each level, we choose the attribute that presents the best gain for that node. The gain is simply the expected reduction in the entropy of Xachieved by learning the state of the random variable A. Gain is also known as Kullback-Leibler divergence. Gain can be calculated in the following way:

Notice that gain is calculated as a function of all the values of the attribute.

import java.util.*;

public class Tree {
	public Node buildTree(ArrayList<Record> records, Node root, LearningSet learningSet) {
		int bestAttribute = -1;
		double bestGain = 0;
		if(root.getEntropy() == 0) {
			return root;
		for(int i = 0; i < Hw1.NUM_ATTRS - 2; i++) {
			if(!Hw1.isAttributeUsed(i)) {
				double entropy = 0;
				ArrayList<Double> entropies = new ArrayList<Double>();
				ArrayList<Integer> setSizes = new ArrayList<Integer>();
				for(int j = 0; j < Hw1.NUM_ATTRS - 2; j++) {
					ArrayList<Record> subset = subset(root, i, j);
					if(subset.size() != 0) {
						entropy = Entropy.calculateEntropy(subset);
				double gain = Entropy.calculateGain(root.getEntropy(), entropies, setSizes, root.getData().size());
				if(gain > bestGain) {
					bestAttribute = i;
					bestGain = gain;
		if(bestAttribute != -1) {
			int setSize = Hw1.setSize(Hw1.attrMap.get(bestAttribute));
			root.setTestAttribute(new DiscreteAttribute(Hw1.attrMap.get(bestAttribute), 0));
			root.children = new Node[setSize];
			for (int j = 0; j< setSize; j++) {
				root.children[j] = new Node();
				root.children[j].setData(subset(root, bestAttribute, j));
				root.children[j].getTestAttribute().setName(Hw1.getLeafNames(bestAttribute, j));

			for (int j = 0; j < setSize; j++) {
				buildTree(root.children[j].getData(), root.children[j], learningSet);

		else {
			return root;
		return root;
	public ArrayList<Record> subset(Node root, int attr, int value) {
		ArrayList<Record> subset = new ArrayList<Record>();
		for(int i = 0; i < root.getData().size(); i++) {
			Record record = root.getData().get(i);
			if(record.getAttributes().get(attr).getValue() == value) {
		return subset;
	public double calculateSurrogates(ArrayList<Record> records) {
		return 0;
} public class DiscreteAttribute extends Attribute { public static final int Sunny = 0; public static final int Overcast = 1; public static final int Rain = 2; public static final int Hot = 0; public static final int Mild = 1; public static final int Cool = 2; public static final int High = 0; public static final int Normal = 1; public static final int Weak = 0; public static final int Strong = 1; public static final int PlayNo = 0; public static final int PlayYes = 1; enum PlayTennis { No, Yes } enum Wind { Weak, Strong } enum Humidity { High, Normal } enum Temp { Hot, Mild, Cool } enum Outlook { Sunny, Overcast, Rain } public DiscreteAttribute(String name, double value) { super(name, value); } public DiscreteAttribute(String name, String value) { super(name, value); } } – This class is our main driver class

import java.util.*;

public class Hw1 {
	public static int NUM_ATTRS = 6;
	public static ArrayList<String> attrMap;
	public static ArrayList<Integer> usedAttributes = new ArrayList<Integer>();

	public static void main(String[] args) {

		Tree t = new Tree();
		ArrayList<Record> records;
		LearningSet learningSet = new LearningSet();
		// read in all our data
		records = FileReader.buildRecords();
		Node root = new Node();
		for(Record record : records) {
		t.buildTree(records, root, learningSet);
		traverseTree(records.get(12), root);
	public static void traverseTree(Record r, Node root) {
		while(root.children != null) {
			double nodeValue = 0;
			for(int i = 0; i < r.getAttributes().size(); i++) {
				if(r.getAttributes().get(i).getName().equalsIgnoreCase(root.getTestAttribute().getName())) {
					nodeValue = r.getAttributes().get(i).getValue();
			for(int i = 0; i < root.getChildren().length; i++) {
				if(nodeValue == root.children[i].getTestAttribute().getValue()) {
					traverseTree(r, root.children[i]);
		System.out.print("Prediction for Play Tennis: ");
		if(root.getTestAttribute().getValue() == 0) {
		else if(root.getTestAttribute().getValue() == 0) {

	public static boolean isAttributeUsed(int attribute) {
		if(usedAttributes.contains(attribute)) {
			return true;
		else {
			return false;
	public static int setSize(String set) {
		if(set.equalsIgnoreCase("Outlook")) {
			return 3;
		else if(set.equalsIgnoreCase("Wind")) {
			return 2;
		else if(set.equalsIgnoreCase("Temperature")) {
			return 3;
		else if(set.equalsIgnoreCase("Humidity")) {
			return 2;
		else if(set.equalsIgnoreCase("PlayTennis")) {
			return 2;
		return 0;
	public static String getLeafNames(int attributeNum, int valueNum) {
		if(attributeNum == 0) {
			if(valueNum == 0) {
				return "Sunny";
			else if(valueNum == 1) {
				return "Overcast";
			else if(valueNum == 2) {
				return "Rain";
		else if(attributeNum == 1) {
			if(valueNum == 0) {
				return "Hot";
			else if(valueNum == 1) {
				return "Mild";
			else if(valueNum == 2) {
				return "Cool";
		else if(attributeNum == 2) {
			if(valueNum == 0) {
				return "High";
			else if(valueNum == 1) {
				return "Normal";
		else if(attributeNum == 3) {
			if(valueNum == 0) {
				return "Weak";
			else if(valueNum == 1) {
				return "Strong";
		return null;
	public static void populateAttrMap() {
		attrMap = new ArrayList<String>();
} – holds our information in the tree.

import java.util.*;

public class Node {
	private Node parent;
	public Node[] children;
	private ArrayList<Record> data;
	private double entropy;
	private boolean isUsed;
	private DiscreteAttribute testAttribute;

	public Node() { = new ArrayList<Record>();
		setTestAttribute(new DiscreteAttribute("", 0));

	public void setParent(Node parent) {
		this.parent = parent;

	public Node getParent() {
		return parent;

	public void setData(ArrayList<Record> data) { = data;

	public ArrayList<Record> getData() {
		return data;

	public void setEntropy(double entropy) {
		this.entropy = entropy;

	public double getEntropy() {
		return entropy;

	public void setChildren(Node[] children) {
		this.children = children;

	public Node[] getChildren() {
		return children;

	public void setUsed(boolean isUsed) {
		this.isUsed = isUsed;

	public boolean isUsed() {
		return isUsed;

	public void setTestAttribute(DiscreteAttribute testAttribute) {
		this.testAttribute = testAttribute;

	public DiscreteAttribute getTestAttribute() {
		return testAttribute;
} – The least interesting class in the code

import java.util.ArrayList;
import java.util.StringTokenizer;

public class FileReader {
	public static final String PATH_TO_DATA_FILE = "";

    public static ArrayList<Record> buildRecords() {
		BufferedReader reader = null;
		DataInputStream dis = null;
		ArrayList<Record> records = new ArrayList<Record>();

        try { 
           File f = new File(PATH_TO_DATA_FILE);
           FileInputStream fis = new FileInputStream(f); 
           reader = new BufferedReader(new InputStreamReader(fis));;
           // read the first record of the file
           String line;
           Record r = null;
           ArrayList<DiscreteAttribute> attributes;
           while ((line = reader.readLine()) != null) {
              StringTokenizer st = new StringTokenizer(line, ",");
              attributes = new ArrayList<DiscreteAttribute>();
              r = new Record();
              if(Hw1.NUM_ATTRS != st.countTokens()) {
            	  throw new Exception("Unknown number of attributes!");
			  String day = st.nextToken();
			  String outlook = st.nextToken();
			  String temperature = st.nextToken();
			  String humidity = st.nextToken();
			  String wind = st.nextToken();
			  String playTennis = st.nextToken();
			  if(outlook.equalsIgnoreCase("overcast")) {
				  attributes.add(new DiscreteAttribute("Outlook", DiscreteAttribute.Overcast));
			  else if(outlook.equalsIgnoreCase("sunny")) {
				  attributes.add(new DiscreteAttribute("Outlook", DiscreteAttribute.Sunny));
			  else if(outlook.equalsIgnoreCase("rain")) {
				  attributes.add(new DiscreteAttribute("Outlook", DiscreteAttribute.Rain));
			  if(temperature.equalsIgnoreCase("hot")) {
				  attributes.add(new DiscreteAttribute("Temperature", DiscreteAttribute.Hot));
			  else if(temperature.equalsIgnoreCase("mild")) {
				  attributes.add(new DiscreteAttribute("Temperature", DiscreteAttribute.Mild));
			  else if(temperature.equalsIgnoreCase("cool")) {
				  attributes.add(new DiscreteAttribute("Temperature", DiscreteAttribute.Cool));
			  if(humidity.equalsIgnoreCase("high")) {
				  attributes.add(new DiscreteAttribute("Humidity", DiscreteAttribute.High));
			  else if(humidity.equalsIgnoreCase("normal")) {
				  attributes.add(new DiscreteAttribute("Humidity", DiscreteAttribute.Normal));
			  if(wind.equalsIgnoreCase("weak")) {
				  attributes.add(new DiscreteAttribute("Wind", DiscreteAttribute.Weak));

			  else if(wind.equalsIgnoreCase("strong")) {
				  attributes.add(new DiscreteAttribute("Wind", DiscreteAttribute.Strong));

			  if(playTennis.equalsIgnoreCase("no")) {
				  attributes.add(new DiscreteAttribute("PlayTennis", DiscreteAttribute.PlayNo));
			  else if(playTennis.equalsIgnoreCase("yes")) {
				  attributes.add(new DiscreteAttribute("PlayTennis", DiscreteAttribute.PlayYes));

        catch (IOException e) { 
           System.out.println("Uh oh, got an IOException error: " + e.getMessage()); 
        catch (Exception e) {
            System.out.println("Uh oh, got an Exception error: " + e.getMessage()); 
        finally { 
           if (dis != null) {
              try {
              } catch (IOException ioe) {
                 System.out.println("IOException error trying to close the file: " + ioe.getMessage()); 
		return records;

**EDIT:** is only a simple text file that describes the learned attribute “play tennis” for a particular learning episode. I modeled my file off of Tom Mitchell’s play tennis example in his book “Machine Learning.” Essentially what it contains is attributes describing each learning episode: outlook (sunny, overcast, rain), wind (strong, weak) and humidity (high, normal) and play tennis (yes, no). Based off this information, one can trace the decision tree to derive your learned attribute. A sample decision tree is below. All you have to do is create a text file (tab delimited) that describes each particular situation (make sure you match the columns in the text file to the parsing that occurs in the Java).

**EDIT 2:**

After so many requests, I have worked up a small file based on the Tom Mitchell “Machine Learning” book. This follows his example data exactly and can be found here (rename the file to “” before running as WordPress wouldn’t let me upload a .data file due to “security restrictions.” A small note of caution: the above code was put together very quickly for purposes of my own learning. I am not claiming that it is by any means complete or fault tolerant, however, I do believe that the entropy and gain calculation is sound and correct.

All code owned and written by David Stites and published on this blog is licensed under MIT/BSD.