Abstract— With the Internet becoming a major resource for many individuals, business, universities and countries, there is an increasing amount of traffic on the Internet.  Some applications available on the Internet need to be able to provide stringent performance guarantees from their web servers (such as online securities trading).  Due to the fact that Internet traffic is bursty and unpredictable, and the interactions between hardware and software in multi-tier systems are complex, there needs to be a large degree of atomicity to capacity planning.

Index Terms—capacity planning, multi-tier systems, requests, sessions, performance, autonomic computing, distributed systems

I. Introduction

The Internet has become an important and popular channel for a variety of different services, such as news aggregation, online shopping, social networking or financial brokerage and other services. Many of these popular services on the Internet today, such as Facebook, Twitter, Amazon and eBay [2, 5, 9] are composed of a generic multi-tier architecture.  Requests generated by users flow through each layer in this architecture.  Each tier in this system provides certain functionality to the following tier by executing part of the incoming request [5].  This multi-tier system includes 1) a web tier running a web server to respond to incoming requests, such as Apache 2) an application tier that is running an application container which hosts an application, such as Tomcat 3) and a backend database tier running database software such as MySQL.

Tiered Application [2]

Each of these tiers may be a single machine or may be distributed over a cluster of nodes.  In addition, these servers may be virtualized, such that multiple applications can be hosted and run on the same physical node.  Often times, to ensure acceptable performance, especially in application hosting, there are service level agreements (SLA) that are put into place to govern the desired system performance.  An SLA can specify one or multiple parameters, such as required total uptime of a system, total throughput, average end-to-end delay of requests within a certain percentile, etc.  The SLA target is maintained by the addition of resources into each tier and can be maintained by removing resources as necessary (without violating the SLA).

Traffic on a computer network is inherently non-uniform and hard to predict, mainly due to “think” time of the end user.  This “bursty” behavior is characterized by short, uneven spikes of peak congestion in the life of an application [4].  Popular services can experience dynamic and varying workloads that depend on popularity, time or date, or general demand.  In addition, Internet flash crowds can cause bursty and unpredictable workloads in each of these tiers.  For example, in November 2000 holiday season, Amazon.com experience a forty-minute downtime due to an overload [5].  These flash crowds can cause a significant deviation from the normal traffic profile of an application, affecting the performance.

The performance of the system can be characterized by the total measured end-to-end delay generated by these incoming requests and their workload each tier.  Due to the fact that the interactions between the software and hardware in each of these aforementioned tiers can be quite complex, the management process of allocating sufficient resources (CPU, disk, network bandwidth, etc.) to each tier such that they don’t saturate and become a bottleneck in the system is often a difficult, lengthy and potentially error-prone process to model and estimate for human operators.  For example, if the performance of an application tier dominates the performance of the system, that tier becomes the bottleneck.  It is non-trivial to model the process (scheduling, algorithms, memory, I/O and CPU times) and time it takes to execute the business logic in this tier, especially if resources are shared, as is common in distributed systems.  In addition to these problems, static allocation does not address the issue of Internet flash crowds and the potential they have to overload the system and jeopardize the SLA compliance.

Capacity planning and autonomic resource management plays an important role in modern data centers.  Service differentiation (Quality of Service) and performance isolation help these complex systems to adapt to changing workloads within the system as a whole and within each tier.  In this paper, we will present a survey of the large variety of models, techniques and results of autonomic resource management in large-scale, multi-tier systems.

II. Network Traffic Characteristics

To truly understand the reason for capacity planning and the difficulties it presents in modeling in terms of the interactions between the incoming requests and the hardware and software that service them, we have to understand network traffic characteristics.  Incoming session and request rates tend to fluctuate based on a variety of different factors.  These varying workloads could be characterized as “bursty”, which is defined by [4] as short, uneven spikes of peak congestion during the lifetime for the system.  These traffic patterns deviate significantly from the average traffic arrival rates.

One way to characterize burstiness as seen in [4] is to use the variable I to represent the index of dispersion.  The index of dispersion is used to measure whether a set of observed occurrences are clustered or dispersed compared to a standard model.  When I is larger, the observed occurrences of bursty traffic is more disperse.  We can see that in Figure 1, burstiness can aggregate into different periods represented by different indices of dispersion.

 

Figure 1

A simple way to calculate the index of dispersion for a particular is as follows:

where Nt is the number of requests completed in a time window of t seconds, where t seconds are counted ignoring server idle time, Var(Nt) is the variance of the number of completed requests and E[Nt] is the mean service rate during busy periods.  An astute reader can deduce that as t becomes large, if there are no bursty periods within t, the index of dispersion is low.

Now that bursty periods can be quantified via I, what causes burstiness?  [4] suggests that “burstiness in service times can be a result of a certain workload combination” and “burstiness in service times can be caused by a bottleneck switch between tiers and can be a result of “hidden” resource contention between the transactions of different types and across different tiers.”  That is, one server or tier may be lightly loaded during a particular period, but may become saturated in other periods where a large number of requests are processed.

This hidden resource contention and different types of transactions across different tiers can be extremely difficult to model and even more troublesome for human operators to correctly provision for.  Autonomic provisioning of system resources is needed to optimize service metrics.

III. Self-Management

As previously mentioned, the major motivation behind autonomic resource management and capacity planning is to reduce the human involvement in these activities due to their difficulty.  If the systems are able to plan for the future and react to the present without operator input, it will take the load off of system administrators and more accurately respond to the situation at hand.  In order to do this, one has to define the goals and properties of such an autonomic system.  The main properties of an autonomic system, as cited in [1] are:

  • Self-configuration: autonomic systems configure themselves based on high level goals, specifying what is desired (SLAs), not how to achieve the goals.  Systems with self-configuration can install and set themselves up.
  • Self-optimization: autonomic systems can optimize its use of resources either proactively or reactively in an attempt to improve service and meet SLAs.
  • Self-healing: autonomic systems can detect and diagnose problems at both high and low levels (software and hardware).  Autonomic systems should also attempt to fix the problem.
  • Self-protecting: autonomic systems should protect itself from attacks but also from trusted users trying to make changes.

One particularly important feature of autonomic system is the ability to exhibit “proactive” features.  For example, software or hardware agents should be able to be:

  • Autonomic: Agents operative without direct intervention of humans and have some kind of control over their actions and internal state [1].”
  • Social: “Agents interact with other agents via some agent-communication languages [1].”
  • Reactive and proactive: “Agents are able to perceive their environment and respond in a timely fashion.  Agents do not simply act in response to their environment, but they are able to exhibit goal-directed behavior in taking the initiative [1]

In order to model these properties, IBM suggested a general framework autonomic control loop, MAPE-K as detailed in [1].  MAPE-K allows for clear boundaries in the model to classify the work that is taking place in the autonomic system.  In MAPE-K, there are managed elements, which represents any software or hardware resource that is managed by the autonomic agents.  Sensors collect information about managed elements and provide input to the MAPE-K loop such that the autonomic manager can execute the changes.  Typically the managers are additional software components configured with high-level goals which leaves the implementation of these goals up to the autonomic manager.  Effectors actually carry out the changes to the managed elements (these changes can be fine or course grained).  Each of these elements can be observed in the following sections.

In the MAPE-K loop, there are five different components: monitor, analyze, plan, execute and knowledge.  Monitoring involves “capturing properties of the environment (psychical or virtual) that are important to the self properties of the autonomic system [1].”  This information is captured from sensors in the system in two different ways, passively (using built in tools to run an analysis on the system) or actively (engineering the software to monitor and improve performance).

In the planning model of MAPE-K, the manager uses the monitoring data from the sensors to produce a series of changes to one or more managed elements.  These changes can be calculated by having the autonomic system keep state information about the managed elements and data so that adaptation plans can be adjusted over time.  An additional benefit of keeping state information is that systems are able to create an architectural model where the actual system mirrors the model and proposed changes are able to be verifying that the system integrity is still in tact before and after the proposed changes.  If any violations occur after applying the changes to the model, the changes can be aborted or rolled back to avoid damage or downtime to the system.

In the knowledge part of the model, the knowledge used to effect adaptation can come humans, logs, sensor data, or day-to-day observation of a system to observe its behavior [1].  In this part of the model, there is a large space that one can use machine learning to acquire knowledge about the system.  While [1] suggested reinforcement learning and Bayesian techniques, other authors [2] suggest K-Nearest Neighbors (KNN) and neural networks [7] with fuzzy control.  This author would suggest that decision trees could be an effective method of acquiring the knowledge to effect the proposed plans.  Also, clustering could be used as a way of identifying and classifying previously similar plans to see if they were successful or if they resulted in failure.

Finally, there are several different levels of automaticity including: Basic, Managed, Predictive, Adaptive and Fully Autonomic.

Now that we have seen a general framework for automated resource management, let’s continue to explore each component.

  1. Managed Elements

Managed elements in an autonomic system consist of all the hardware and software elements that can be managed by autonomic agents.  Due to the fact that multi-tier systems can be very complex, there are multiple levels of detail that one can view the system.

The first and highest level is where the system is viewed as a black box.  At this level, we consider the end-to-end delay when the request enters the system and returns back to us.  If there is congestion or delay caused by insufficient capacity at any tier, we are unable to know which tier’s capacity is causing the problem. Typically, changes in allocation to capacity are fairly course-grained at this level.  [8] plans their allocation strategy at this level.

At the next level down, we no longer monitor system performance by a single delay metric.  We are now able to monitor the performance of individual tiers.  When congestion occurs at a single tier, we are able to target that tier and increase the allocation capacity of just that tier.  However, one must be careful not to trigger downstream allocations with the increased capacity at the original tier.  [5] plans their allocation strategy at this level.

Lastly, at the most fine-grained level, we are able to collect statistics on individual components within a tier such as CPU usage for an application on a particular node.  Figure 2 shows an example of the black-box and per-tier paradigms.

 

  1. Autonomic Managers and Controllers

In the MAPE-K loop, the monitoring, analyzing and planning phases are done in the control plane or autonomic manager.  The manager should adjust managed elements in the following fashion: [2]

  • Prompt. It should sense impending SLA violations accurately and quickly and trigger resource allocation requests as soon as possible to deal with load increases (or decreases).
  • Stable. It should avoid unnecessary oscillations in provisioning resources (adding and removing).

Keeping the two guidelines in mind, there are two important decisions that every autonomic manager will be making: 1) when to provision resources and 2) when a provisioning decision is made, how much of a particular resource should be provisioned.

When To Provision

When considering the question of “when to provision”, one must consider the timescale that is being evaluated.  There are two main methods of provisioning: predictively and reactively.  Predictive provisioning attempts to stay ahead of the workload, while reactive provisioning follows to correct for short term fluctuations.

In predictive provisioning, the autonomic system is planning increased (or decreased) capacity for the future.  When evaluating real-world workloads, typically the system or application will see a peak in traffic during the middle of the day and be the minimum during the middle of the night [5].  Other factors, such as recent news (breaking important stories) or seasons (holiday shopping) can affect this distribution of traffic.  By using predictive analysis, automated systems are able to provision sufficient capacity well in advance of any potential events that would cause SLA violations.

Prediction can be done via a number of different methods including statistical analysis, machine learning, or by simply using past observations of workload.  This prediction is usually implemented in the control plane or autonomic manager of the system [2, 5, 6, 7, 8].  For example in [2], a scheduler for each application monitors the application database tier and collects various metrics, such as average query throughput, average number of connections, read/write ratios and system statistics such as CPU, memory and I/O usage, about the system and application performance.  These metrics are reported back to the autonomic manager and an adaptive filter predicts the future load based on the current measured load information.  Each of these metrics are weighted to reflect the usefulness of that feature.  In addition, a KNN classifier determines if an SLA is broken and redirects the source allocation to adjust the number of databases such that that tier is no longer in violation.  A resource allocation component decides how to map the resources dynamically.  [7] uses self-adaptive neural fuzzy controller to decide upon the allocation of resources.  Moreover, [6] uses a model estimator which automatically learns online a model for the relationship between an application’s resource allocation and it’s performance and a optimizer which predicts the resource allocation required to meet performance targets.  Lastly in [5], the authors use a histogram of request arrival rates for each hour over several days.  Using that data, the peak workload is estimated as a high percentile of the arrival rate distribution for that hour.  By using these metrics, the application is able to predict shifting workloads in a sliding window.

Most of the papers survived use the error in prediction e(k) (that is, the predicted workload or arrival rate λpred(t) and the observed workload or arrival rate λobs(t) for a time period t) or the change in error in prediction Δe(k) (e(k) −e(k −1)) as input to their algorithms to help determine the next periods allocation [5, 6, 7].  Since workloads and arrival rates often exhibit bursty overloads [4, 5], these parameters fine-tune our prediction model.

Proactive provisioning alone is not enough to make the system robust and immune to SLA violations.  For example, there may be errors in prediction if workload or arrival rate deviates greatly from previous days.  As mentioned earlier, Internet flash crowds can spike network requests have the potential to cause congestion and overload the system due to their bursty, unpredictable nature.  In these cases, the errors would lag behind the actual events and there would be insufficient capacity.

Reactive provisioning can be used to quickly correct any deviations from these unpredicted events.  Reactive provisioning is used to plan on a shorter time scale, perhaps on several-minute basis.  If anomalies are detected, reactive provisioning can quickly allocate additional capacity to the affected tiers so they don’t become a bottleneck.

In [5], the authors implement reactive provisioning by comparing the predicted session arrival rate λpred(t) and the observed arrival rate λobs(t) for a time period t.  If these two measurements differ by more than a threshold, they take corrective provisioning action.  In [2], the resource manager monitors the average latency receiver from each workload scheduler during each sampling period.  The resource manager uses a smoothened latency average computed as an exponentially weighted mean of the form WL = α ×L + (1 –α) ×WL, where L is the current query latency.  When the α parameter is larger, the system is more responsive the average to current latency.  In [6], the authors attempt to reactively provision limited resources by using an auto-regressive –moving-average (ARMA) model, where two parameters a1(k) and a2(k) capture the correlation between the applications past and present performance and b0(k)  and b1(k)  are vectors of coefficients capturing the correlation between the current performance and the recent resource allocations.  Note that if the spike in traffic is large, it may require several rounds of reactive provisioning to get the capacity to an acceptable level.

In [8], the authors consider provisioning resources based on “what-if” analysis.  They argue that most of web applications consist of services that form an acyclic directed graph, which can be formed into a decision tree.  In their model, they ask each tier to predict, online, future performance in the event it received an additional capacity unit or had one capacity unit removed.  The performance model in [8] uses the following equation to calculate response time:

where, Rserver is the average response time of the service, n is the number of CPU cores assigned to the service, λ is the average request rate and Sserver is the mean service time of the server.  When a threshold of service time is exceeded, re-computation of service time occurs.  These predictions are given to the parent node.  Using these predictions, provisioning agents negotiate resources with each other based on maximum performance gain and minimum performance loss.  The root node selects which services to provision across the tree when the SLA is about to violated, or de-provision, if resources can be removed without causing an SLA violation. Furthermore, in [8], the authors also consider how provisioning cache instances using the previously described “performance promises” would affect workloads in a particular tier and all children tiers due to an increased hit rate.  Figure 3 from [8] illustrates the decision process:

Figure 3

We can see that service I has k immediate children services and it aggregates it’s own performance promises as follows:

Lastly, while the majority of the papers reviewed were concerned with provisioning additional capacity, [2] also considered removing unneeded capacity.  If the average latency is below a low-end threshold, the resource manager triggers a resource removal.  The system then performs a temporary removal of the resource.  If the average latency remains below the low threshold, the resource is permanently removed.  The reason that a temporary removal is performed first is that mistakenly removing a resource is potentially a costly operation if it negatively impacts the system performance.  The main motivation behind this logic is that unused resources are wasted if they are being under-utilized in the system.

When comparing the effectiveness of reactive and proactive, Figure 4 from [5] that proper proactive provisioning can greatly reduce the time spent in violation of the SLA.

Figure 4

In addition to handling smoothly increasing workloads, the provisioning techniques from [5], as shown in Figure 5, predictive provisioning can also handle sudden load bursts effectively.

Figure 5

By combining predictive and reactive provisioning, systems are able have sufficient capacity to handle predictable workloads as well as short-term instabilities.

How Much To Provision

The question of “how much of a given resource to provision” is less straightforward than “when to provision.”  As we stated earlier, it may be necessary to go through several rounds of provisioning before the end-to-end delay is at an acceptable level.  Moreover, depending on the implementation and type of controller you use, the model determining “how much” could be different.  However, all provisioning algorithms attempt to provision enough to meet the SLA, within a time period t, without overprovisioning, because that would be a waste of resources.  Additionally, one critical factor to consider and try to avoid is that increasing the provisioning at a particular tier k might create a downstream bottleneck at tier k + n, assuming n tiers.  Several authors in the survey explore how to avoid this phenomenon.

The authors in [5] present a general framework for determining provisioning needs based on average end-to-end latency.  Suppose a system that has n tiers, denoted by T1, T2,…TN and let R denote the desired end-to-end delay.  Suppose further that the end-to-end response times are broken down into per-tier response times denoted by d1, d2,…dN, such that Σ di = R.  Lastly, assume that the incoming session rate is λ.  Typically, one would want to provision the system for the worst-case scenario, that is the peak of λ.  Individual server capacity can be modeled using M/M/1 first come, first serve (FCFS), non-preemptive queues and each request in the queue has a certain amount of work to do.

Assuming the service rate of the queue is μ, then λ ⁄ μ = ρ, would be the service ratio.  If ρ is less than 1, then the queuing delay, that is, the amount the request waits in the queue to be serviced is bounded and finite.  If ρ is equal to 1, the queue length is infinite but queuing delay is only infinite if the inter-arrival times of requests are not deterministic.  Lastly, if ρ is greater than 1, then the queuing delay is infinite.  This can express useful data, such as service time (the queuing delay and the time it takes to execute the request).  This behavior can be modeled in the queuing theory result [9]:

where di is the average response time for tier i, si is the average service time for a request at that tier and λi is the request arrival rate at tier i.  is the variance of inter-arrival time and the variance of service time, respectively, which can be monitored online.  Using this equation, we can obtain a lower bound on a request rate for server i.  Assuming a think-time of Ζ, then request are issued are a rate of (1 / Ζ) [5].  Therefore, the number of servers ηi needed at tier i to service a peak request rate can be computed as:

where βi is a tier specific constant, τ is average session duration, λi is the capacity of a single server and λ is the request arrival rate.  The authors in [5] assume that the servers in each tier are load balanced, although other models do exist that explore provisioning without load balancing, such as [10].

In [7], the authors use a neural fuzzy controller with four layers to determine how much of a particular resource should be (re) allocated.  In their controller design, they have a neural controller with four layers (see Figure 6).

Figure 6

In layer 1, the input variables e(k) and Δe(k) are passed into the neural controller nodes.  In layer 2, each node in this layer acts as a “linguistic term” assigned to one of the input variables in layer 1, where they use their membership functions to determine the degree to which an input value belongs to a fuzzy set (i.e. negative large corresponds to the numeric value -50).  Layer 3 uses the outputs from layer 2 multiplied together to determine the firing strength of a particular rule in layer 3.  Lastly, in layer 4, the output of layer 3 is “defuzzified” into numeric output in terms of resource adjustment Δm(k).  The magnitude of adjustment is determined by the online learning of the neural fuzzy controller, described in more detail in [7].  This online learning algorithm is able to adapt quite rapidly to stationary and dynamic workloads to guarantee 95th-percentile delay, as seen in in Figure 7 from [7].

Figure 7

In [6], the authors use the optimizer to determine changes in resource allocations for finite resources such as disk and CPU.  This is a significant departure from other models in that it is assumed that we have a (much) larger pool of resources (essentially unlimited) to allocate increased resources from.  Their optimizer uses uar as the resource allocation required to meet the performance target or SLA.  The optimizer’s high-level goal is to determine an allocation in a stable manner (no resource oscillations) by using a cost minimization function to find the optimum resource assignments.  In their cost minimization function:

where Jp = (yna(k) – 1)2 serves as a penalty for deviation of the application performance from the desired target and Jp = ||ura(k) – ua(k – 1)||2 serves as a control cost function to improve stability of the system.  The controller will attempt to minimize the linear combination of both functions.  One important point to note in this particular allocation scheme is that allocation requests are confined by the following constraint:

where urirt is a tuple of requested resources for application i, resource r for tier t.  Stated another way, the allocations for all applications for a particular resource, such as CPU or disk, in a particular tier must be less than or equal to100%.  When this constraint is violated, a particular resource is experiencing contention.  It is important to note that in this allocation scheme, while there exists the potential for all application to receive adequate capacity to satisfy demand, when contention exists, applications are weighted according to their priorities in the minimization function so that most “important” application receives the largest share of the divided resources.  Clearly, because we cannot allocate more than 100% of a node’s physical CPU, disk or other resource, all applications suffer from performance degradation – the only difference is to what extent the application suffers that degradation.

Other authors of course decide how much to provision based on their controller implementation.  In [2], the authors use a k-nearest neighbors classifier to decide the number of databases to be provisioned.  Lastly, in [8], use a decision tree process to determine the maximum performance gain (and minimal performance loss) to allocate or remove additional resources.  The performance of the algorithm in [8] was quite impressive in observing it’s SLA and remaining well-provisioned without wasting resources as see in Figure 8.

Figure 8

Before looking at additional MAPE-K components, it is worth looking at solving two additional enhancements to resource provisioning: preventing downstream bottlenecks and preventing allocation oscillations.

  1. A. Preventing Downstream Bottlenecks

The relationship between an incoming request and the work units at successive tiers are not necessarily 1-to-1.  A single incoming search request at the web tier may trigger more 1 more query requests at a database tier.  Therefore, downstream tier capacity has to be considered when allocating additional capacity at upstream tiers.

First, consider what would happen if additional downstream tiers were adjusted for additional incoming requests.  When an increase in allocation is triggered by an SLA violation at tier k, the capacity will increase by n.  If the downstream tier k + 1 is almost to capacity, that tier now has an additional n requests to service, which has the potential to cause an SLA violation.  When the resource manager detects a potential violation at tier k + 1, there will be another round of provisioning.  In the meantime, due to the fact that it takes a fixed amount of time to provision additional resources, tier k + 1, could be in violation.  Following a pattern, this could cause up to k provisioning events.  Clearly, this cascading chain of allocation, violation and additional provisioning, which in addition to being wasteful, increases the total time in violation of the SLA.  [5] proposes a solution by using the constant β, where if βi is greater than one if the request triggers multiple requests at tier i or less than one if caching at prior tiers reduces the work at this tier.

  1. B. Preventing Allocation Oscillations

Due to the fact that internet traffic has a bursty nature, there is a potential for a rapidly fluctuating load in the system.  Because of this fact, a necessary enhancement to the system controller should be a filter to smooth or dampen very brief load spikes.  This is not to say that we don’t want to respond to all SLA violations, but adding or removing additional capacity can potentially be a very costly and time consuming operation.

In addition, in [2], the authors keep state on the additional allocation, as either STEADY or MIGRATION state.  During the MIGRATION state, it may take a period of time after the SLA violation and associated additional capacity allocation is triggered for the migration of additional resources to occur and “warm-up.”  [2] found that latency may continued to be high or even increase during this period, and as such, sampling during this period may not be reflective of the actual system state.  Furthermore, they found that making decision based on samples from this migration and warm-up time period, may continue to add unnecessary replicas which will need to be removed later.  This is wasteful and could penalize other applications hosted on the same machine.

For additional information on each model and provisioning algorithm, refer to the related paper.

  1. Effectors

The execution of the plan in the MAPE-K loop happens with the effectors. In some of the papers that were surveyed, [citation needed – 5?], the authors used request policing to ensure that the admission rate does not exceed the capacity.  Any extra sessions beyond capacity are dropped at the threshold of the system.  Once the request was admitted, it was not dropped at any intermediate tier.  Conversely, each tier could have their own request policing mechanism however, this is quite wasteful as requests that have had prior work done, may be dropped at downstream tiers.

In the majority of the papers [2, 5, 6, 7, 8], additional capacity was allocated from some pool of free resources.  If no more free resources existed or that resource reached it replication maximum, resources were re-provisioned or the sessions and requests were capped at a maximum. In [2], additional database capacity was added to the tier by replicating the existing database.  During the period of migration, requests were held and then when the migration was complete, they were “played” in stages to bring the new copy up to date with the current copy with any additional writes that may have occurred during migration.

In [5], the authors used agile server switching.  In this model, each node had a copy of the application running in a separate virtual machine (VM) however, only one application was active at a time.  When the control plane determined that it needed to allocate additional capacity, if the application running on a particular node schedule for replacement was not the application receiving the additional capacity, the sessions and request were ramped down using fixed-rate or measurement-based techniques.  Once all the sessions had been concluded for the previous application in that VM, it was suspended and the new allocation was brought up.  This allowed new applications to quickly be provisioned for additional capacity.

However, admission control may not be the best choice and [3] presented an alternative to dropping sessions by allocating additional capacity in a “degraded mode.”  Using traditional control theory (having a closed loop that reacts to sensors and actuators), we can readily adjust the load on the server to stay within bounds of process schedulability based on available resources.

The actuator or effector is responsible for performing a specific action based on controller output.  In the case of [3], they initially used admission control as a way of controlling server load.  This limits the number of clients that the server can respond to concurrently.  All other clients are rejected and while this is not a favorable outcome, it is a way of providing QoS to other clients.

However, this actuator can be adapted to provide a “degraded” level of service.  In this “degraded” level of service, the content that is served to the client receiving the degraded service is different than that which the “full service” client receives.  For example, the degraded content may have less content such as images in it.  To perform this degradation, there must be multiple versions of the content and the web server will serve the content from the appropriate file structure at request time.  There can be m different service levels where m = I + F (I is the service level where F is the fraction of clients who are served at level I).  When m = M, all clients receive the highest service and conversely when m = 0, all clients are rejected.  This allows a more general model of saying that a percentage of clients will receive degraded service, rather than specifically specifying which clients (which clients may be addressed through different functionality, such has hashing their IP address).  The server load is control through the input of the variable m.

The monitor proposed in [3] is expressed as the utilization U as a function of served request rate R and delivered byte bandwidth W (U = aR + bW, where a and b and derived constants through linear regression).  This function is able to approximate the maximum utilization rate for deadline-monotonic schedulers to meet deadlines of U < 0.58.  Using this scheme of admission and “capacity” planning, the authors in [3] were able to admit almost 3 times the number of requests to the system as shown in Figure 9 before significant service degradation occurred.

Figure 9

  1. Sensors

Sensors are part of the knowledge phase in the MAPE-K loop.  The sensors that report data to the control plane can be any number of metrics.  In the papers surveyed, there were authors that used CPU usage [5, 6], bandwidth, throughput, active network connections [2, 5], I/O transfer [2, 6], cache hits [8], end-to-end delay or response time [5, 7], session arrival rate [5], request arrival rate [5], read/write ratios [2], lock ratios [2] and memory usage [2].  These parameters can be recorded and monitored online or offline, using standard system tools such as vmstat, iostat and sysstat or custom written software to record statistics.

IV. Quality of Service & Performance Isolation

Lastly, the control theory approach was extended in [3] using virtual servers to show support for performance isolation (each virtual server can guarantee a maximum request rate and maximum throughput), service differentiation (each virtual server supports request prioritization) and excess capacity sharing (conforming virtual servers under overload can temporarily exceed capacity to avoid client degradation).

With performance isolation, a virtual server is configured with a maximum request rate RMAX and a maximum delivered bandwidth of WMAX.  This expresses an throughput agreement to serve up to the specified levels.  If RMAX is exceeded, then the agreement on WMAX is revoked.

In performance isolation, the virtual server can adapt content as previously mentioned to stay within the agreement bounds.  When each of the i servers are configured, if their aggregate utilizations U*i is less than U < 0.58, then the system capacity is planned correctly.  To perform the correct bookkeeping, load classification is done when the request is performed and then based on the classification requests served and delivered bandwidth is charged against the requested virtual server.  Lastly, based on the rates for virtual server i for requests (Ri) and delivered bandwidth (Wi), the control loop achieves the degree of content degradation necessary to keep the utilization of that virtual server at the proper level, preventing overload.

In service differentiation, the goal is to support a number of different clients at different service levels (lower priority clients are degraded first).  In [3], if there are m priority levels, where 1 is the highest priority, capacity of the utilization level of that virtual server is available to clients in priority order.  For the purposes of their research, [3] had two service levels, premium and basic, where premium was guaranteed service and basic was not guaranteed service.

Lastly, in sharing excess capacity, if the load on one server exceeds the maximum capacity C, then the server under overload may temporarily utilize the resources of under utilized virtual servers.  This requires only a simple modification to the control theory loop.

IV. Conclusion

In this survey, we presented information on burstiness in network traffic and how it affects service times.  We also showed how autonomic systems can provision themselves to handle the dynamic workloads.  Dynamic provisioning in multi-tier applications raises some interesting and unique challenges.  As distributed systems become more complex, the techniques surveyed in this paper will become more useful and relevant.

References

[1]     M.C. Huebscher, J.A McCann.  A survey of autonomic computing: Degrees, models, and applications.  ACM Computing Surveys, 40(3), 2008.
[2]     J. Chen, G Soundararajan, C. Amza. Autonomic Provisioning of Backend Databases in Dynamic Content Web Servers. Proc. of IEEE ICAC, 2006.
[3]     T. Abdelzaher, K.G. Shin, N. Bhatti. Performance Guarantees for Web Server End-Systems: A Control-Theoretical Approach. IEEE Transactions on Parallel and Distributed Systems, 13(1), 2002.
[4]     N. Mi, G. Casale, L. Cherkasova, M. Smirini. Burstiness in multi-tier applications: Symptoms, causes, and new models. Proc. of ACM/IFIP/USENIX Middleware, 2008.
[5]     B. Urgaonkar, P. Shenoy, A. Chandra, P. Goyal, and T. Wood. Agile dynamic provisioning of multi-tier Internet applications. ACM Trans. on Autonomous and Adaptive Systems, 3(1):1-39, 2008.
[6]     P. Padala, K.-Y. Hou, K. G. Shin, X. Zhu, M. Uysal, Z. Wang, S. Singhal, and A. Merchant. Automated control of multiple virtualized resources. Proc. of EuroSys, 2009.
[7]     P. Lama and X. Zhou. Autonomic provisioning with self-adaptive neural fuzzy control for end-to-end delay guarantee. Proc. IEEE/ACM MASCOTS, pages 151-160, 2010.
[8]     D. Jiang, G. Pierre, and C.-H. Chi. Autonomous resource provisioning for multi-service web applications. Proc. ACM WWW, 2010.
[9]     L. Kleinrock. Queuing Systems, Volume 2: Computer Applications. John Wiley and Sons, Inc. 1976.
[10]  B. Urgaonkar, G. Pacifici, G. Shenoy, P. Spreitzer, M. and A. Tantawi. An analytical model for multi-tier Internet services and its applications. Proc. of ACM ICMMCS.  Banff, Canada, 2005.