Category: Web

With this blog post, I am pleased to announce that A Few Guys Coding has created a service with a different twist. Wedding customs and traditions are changing all the time. Whether it is the customs, music, the attire or even certain colors such as a wedding I once went to that had black and purple as the main colors. I had even heard of online RSVPs. This software stack, WeddingWare, was born out of necessity. For my own wedding, 5 months away now, we needed a way for guests to RSVP. Since we didn’t want to spend al the money on response cards, we are environmentally friendly (it saves trees!) and I am a software engineer, I suggested that we come up with online RSVPs (the guest will still receive a paper invitation with an invitation code). After a couple brain storming sessions later and much, much code, WeddingWare is officially available for license. While the primary focus is custom wedding RSVPs, the software can also handle shower invitations as well has bachelorette party invitations. Being highly customized is a main focus of this software because after all, who wants to go to a URL that looks like this: Our software is simple and we will work with clients directly to setup a RSVP site where the look is clean and the URL is custom (for example, ours will be, eventually, at The software has a full-featured backend that allows the party hosts to login and 1) create new parties and invitations 2) have multiple events 3) get metrics on guests that are attending (such as outstanding invitations vs. people who have responded). Guests will be directed to a friendly, intuitive site that the hosts have fully customized, down the color scheme even. After RSVPing, the guest will receive confirmation emails with hotels, driving directions, maps, etc. The site is mobile browser friendly and we are working on coming up with a matching iOS application. Pricing is still to be deteremined – yet will be affordable and based on the number of guests. A live demo is available at

Disclaimer: I am not a “web guy” by nature. I have recently start experimenting more and more with it because eventually I want to become proficient with web and design. Therefore, this may be a simple tips for “web people” but a typical “desktop application” person might struggle with this.

  1. Debugging in Chrome, Firefox and Safari is easy, debugging in IE is hard: Chrome, Firefox and Safari all make it extremely easy to debug issue related to HTML, CSS and Javascript (AJAX included!) because they Chrome and Safari have built in “inspectors.”  Firefox requires an additional plugin, Firebug, but it is just as useful once you have it.  IE makes this process as painful, however, in IE 8 (I haven’t verified anything before or after this version), the Tools menu has a “Developer Tools” window that attempts to do the same thing as Firebug and the built-in inspector for Chrome and Safari.  While it is kind of like using rock knives compared to the other tools, it did work and I was able to debug my CSS in IE (Javascript is much harder). Microsoft has a nice writeup on this tool here.  In fact, go ahead and use Photoshop or Illustrator to create your designs – that’s fine.  However, when it comes time, always test first in IE since it has the poorest support.  Chances are that if it works in IE first time, it’ll work in the other browsers first time as well.  Oh yeah, and get CSSEdit (sorry Windows people – you’re on  your own).
  2. RAILS_ENV != RACK_ENV in Rails 3: I was having a devil of a time getting my staging server to work with my staging database.  Passenger and Apache would connect and use the proper staging files directory but it would constantly connect to the production database.  The trick is that in Rails 3 (only!!!) you need to set ENV[‘RACK_ENV’] = “staging” or ENV[‘RACK_ENV’] = “production”.  In Rails 2.x.x, you still use the RAILS_ENV to tell it which environment to use.
  3. Having a good deployment strategy and capistrano recipe is key: Many months earlier, I had already put up a plain HTML site (no framworks, *gasp*) at at a really cheap host, iPage (they were $30 for the year and believe me, you get what you pay for but that was OK because it was just text and images). Everything was working there just fine there while I was developing this Rails app. Initially, (before I switched to Rails from Google App Engine), I thought that I would host the application at some place like Heroku, since their deployment looks dead simple. As it turned out, it would be much cheaper, by about $40/per month, to have my primary host (the one this site is using) and just buy some additional upgrades in RAM to my VM to suppose multiple Passenger instances.Anyways, I knew that I would need to start testing outside of a development environment. In addition to the standard development, test and production environments given to you in Rails, I converted the test environment into a local version of a production environment (cache classing, actual mail delivery, etc.), however I kept my database adapter as SQLite. I also created an additional environment called staging. The theory behind this is that I would run exactly as it would in production (using MySQL, class caching, session stored via :mem_cache, using delayed_job to deliver mail, etc.) with test fixtures and data in the database. Obviously production would be production environment with live data. This allowed me to iron out any bugs that might crop up during switching to a more production-like environment and catch any assumptions I had made in development that wouldn’t be valid in production (when switching adapters, etc.). I dedicated a subdomian to deploying my staging environment and password protected it. If you are interested in my Capistrano file, environments file and Apache vhost file, they are here (with the obvious implementation and sensitive path/password redactions).
  4. Radio buttons: I was making some AJAX calls that should have selected radio buttons in the same group when the call returned based on :success or :error AJAX callback. I was having trouble getting this to happen but this code here is solid:
    // this example had a yes/no. it can be extended to any number of radio buttons
    var radios = $('input:radio[name=radio_group_name]');
    radios.filter('[value=val1]').attr('checked', false);
    radios.filter('[value=val2]').attr('checked', true);
  5. Rails UJS: Rails 3 embraces a more MVC way of thinking by abstracting the Javascript choice (whether it is jQuery, Prototype, etc.). Before the implementation was fairly specific and you sometimes had to mix specific JS code into your Ruby code. With UJS, implementation in Rails 3 takes benefit of the new HTML5 data-*@ attributes. So Rails doesn’t spit out Prototype-based Javascript inline (the old helpers are still there). Rather, the helpers just define an appropriate data attribute in the tag, and that’s where Javascript comes in. Obviously this is a more flexible and reusable pattern. You can also plug and play different frameworks without the headache of writing a lot of the same code over again (I swapped Prototype and jQuery in in several minutes – just remember to get the new rails framework Javascript files, such as rails.js, in jQuery or you’ll have weird errors that are hard to track down.  Check out jqueryrails.). Since I am far from a RoR or Javascript expert, this, this and this has a more expert breakdown (why re-invent the wheel).
  6. Remote form_for (AJAX) and :onsubmit(): In Rails 2? and at least 3, form_form, :remote => true overrides the :onsubmit => ‘func();’ method to do the actual form submission.  If you want to bind something to your form before it gets submitted (or during or after!), bind the form using jQuery.bind() and then observe the AJAX callback functions to do what you need. <script type="text/javascript"&gt;
      function get_address_information() {
        // check jquery for all the possible callbacks. there is also success and error. compete get calls after both success and error
        var pars = 'param1=x&amp;param2=y&amp;param3=z';
        $.ajax({ type: "POST",
          url: "/get_invite_address",
          data: pars, dataType: 'script',
          beforeSend: function() {
            // do your before send checking here
          complete: function(data, status) {
            // do some post processing ehre
  7. Rails is as good as everyone says (and other languages/frameworks are as bad as you think): Truly. The only drawback is the lack of internationalization. However, in terms of use, the language, setup, data access Rails is superior in every way. Rails setups up for testing (as well as performance testing) from the word go. The framework allows you to concentrate on coding while it does the heavy lifting.

Report card from here

Abstract— With the Internet becoming a major resource for many individuals, business, universities and countries, there is an increasing amount of traffic on the Internet.  Some applications available on the Internet need to be able to provide stringent performance guarantees from their web servers (such as online securities trading).  Due to the fact that Internet traffic is bursty and unpredictable, and the interactions between hardware and software in multi-tier systems are complex, there needs to be a large degree of atomicity to capacity planning.

Index Terms—capacity planning, multi-tier systems, requests, sessions, performance, autonomic computing, distributed systems

I. Introduction

The Internet has become an important and popular channel for a variety of different services, such as news aggregation, online shopping, social networking or financial brokerage and other services. Many of these popular services on the Internet today, such as Facebook, Twitter, Amazon and eBay [2, 5, 9] are composed of a generic multi-tier architecture.  Requests generated by users flow through each layer in this architecture.  Each tier in this system provides certain functionality to the following tier by executing part of the incoming request [5].  This multi-tier system includes 1) a web tier running a web server to respond to incoming requests, such as Apache 2) an application tier that is running an application container which hosts an application, such as Tomcat 3) and a backend database tier running database software such as MySQL.

Tiered Application [2]

Each of these tiers may be a single machine or may be distributed over a cluster of nodes.  In addition, these servers may be virtualized, such that multiple applications can be hosted and run on the same physical node.  Often times, to ensure acceptable performance, especially in application hosting, there are service level agreements (SLA) that are put into place to govern the desired system performance.  An SLA can specify one or multiple parameters, such as required total uptime of a system, total throughput, average end-to-end delay of requests within a certain percentile, etc.  The SLA target is maintained by the addition of resources into each tier and can be maintained by removing resources as necessary (without violating the SLA).

Traffic on a computer network is inherently non-uniform and hard to predict, mainly due to “think” time of the end user.  This “bursty” behavior is characterized by short, uneven spikes of peak congestion in the life of an application [4].  Popular services can experience dynamic and varying workloads that depend on popularity, time or date, or general demand.  In addition, Internet flash crowds can cause bursty and unpredictable workloads in each of these tiers.  For example, in November 2000 holiday season, experience a forty-minute downtime due to an overload [5].  These flash crowds can cause a significant deviation from the normal traffic profile of an application, affecting the performance.

The performance of the system can be characterized by the total measured end-to-end delay generated by these incoming requests and their workload each tier.  Due to the fact that the interactions between the software and hardware in each of these aforementioned tiers can be quite complex, the management process of allocating sufficient resources (CPU, disk, network bandwidth, etc.) to each tier such that they don’t saturate and become a bottleneck in the system is often a difficult, lengthy and potentially error-prone process to model and estimate for human operators.  For example, if the performance of an application tier dominates the performance of the system, that tier becomes the bottleneck.  It is non-trivial to model the process (scheduling, algorithms, memory, I/O and CPU times) and time it takes to execute the business logic in this tier, especially if resources are shared, as is common in distributed systems.  In addition to these problems, static allocation does not address the issue of Internet flash crowds and the potential they have to overload the system and jeopardize the SLA compliance.

Capacity planning and autonomic resource management plays an important role in modern data centers.  Service differentiation (Quality of Service) and performance isolation help these complex systems to adapt to changing workloads within the system as a whole and within each tier.  In this paper, we will present a survey of the large variety of models, techniques and results of autonomic resource management in large-scale, multi-tier systems.

II. Network Traffic Characteristics

To truly understand the reason for capacity planning and the difficulties it presents in modeling in terms of the interactions between the incoming requests and the hardware and software that service them, we have to understand network traffic characteristics.  Incoming session and request rates tend to fluctuate based on a variety of different factors.  These varying workloads could be characterized as “bursty”, which is defined by [4] as short, uneven spikes of peak congestion during the lifetime for the system.  These traffic patterns deviate significantly from the average traffic arrival rates.

One way to characterize burstiness as seen in [4] is to use the variable I to represent the index of dispersion.  The index of dispersion is used to measure whether a set of observed occurrences are clustered or dispersed compared to a standard model.  When I is larger, the observed occurrences of bursty traffic is more disperse.  We can see that in Figure 1, burstiness can aggregate into different periods represented by different indices of dispersion.


Figure 1

A simple way to calculate the index of dispersion for a particular is as follows:

where Nt is the number of requests completed in a time window of t seconds, where t seconds are counted ignoring server idle time, Var(Nt) is the variance of the number of completed requests and E[Nt] is the mean service rate during busy periods.  An astute reader can deduce that as t becomes large, if there are no bursty periods within t, the index of dispersion is low.

Now that bursty periods can be quantified via I, what causes burstiness?  [4] suggests that “burstiness in service times can be a result of a certain workload combination” and “burstiness in service times can be caused by a bottleneck switch between tiers and can be a result of “hidden” resource contention between the transactions of different types and across different tiers.”  That is, one server or tier may be lightly loaded during a particular period, but may become saturated in other periods where a large number of requests are processed.

This hidden resource contention and different types of transactions across different tiers can be extremely difficult to model and even more troublesome for human operators to correctly provision for.  Autonomic provisioning of system resources is needed to optimize service metrics.

III. Self-Management

As previously mentioned, the major motivation behind autonomic resource management and capacity planning is to reduce the human involvement in these activities due to their difficulty.  If the systems are able to plan for the future and react to the present without operator input, it will take the load off of system administrators and more accurately respond to the situation at hand.  In order to do this, one has to define the goals and properties of such an autonomic system.  The main properties of an autonomic system, as cited in [1] are:

  • Self-configuration: autonomic systems configure themselves based on high level goals, specifying what is desired (SLAs), not how to achieve the goals.  Systems with self-configuration can install and set themselves up.
  • Self-optimization: autonomic systems can optimize its use of resources either proactively or reactively in an attempt to improve service and meet SLAs.
  • Self-healing: autonomic systems can detect and diagnose problems at both high and low levels (software and hardware).  Autonomic systems should also attempt to fix the problem.
  • Self-protecting: autonomic systems should protect itself from attacks but also from trusted users trying to make changes.

One particularly important feature of autonomic system is the ability to exhibit “proactive” features.  For example, software or hardware agents should be able to be:

  • Autonomic: Agents operative without direct intervention of humans and have some kind of control over their actions and internal state [1].”
  • Social: “Agents interact with other agents via some agent-communication languages [1].”
  • Reactive and proactive: “Agents are able to perceive their environment and respond in a timely fashion.  Agents do not simply act in response to their environment, but they are able to exhibit goal-directed behavior in taking the initiative [1]

In order to model these properties, IBM suggested a general framework autonomic control loop, MAPE-K as detailed in [1].  MAPE-K allows for clear boundaries in the model to classify the work that is taking place in the autonomic system.  In MAPE-K, there are managed elements, which represents any software or hardware resource that is managed by the autonomic agents.  Sensors collect information about managed elements and provide input to the MAPE-K loop such that the autonomic manager can execute the changes.  Typically the managers are additional software components configured with high-level goals which leaves the implementation of these goals up to the autonomic manager.  Effectors actually carry out the changes to the managed elements (these changes can be fine or course grained).  Each of these elements can be observed in the following sections.

In the MAPE-K loop, there are five different components: monitor, analyze, plan, execute and knowledge.  Monitoring involves “capturing properties of the environment (psychical or virtual) that are important to the self properties of the autonomic system [1].”  This information is captured from sensors in the system in two different ways, passively (using built in tools to run an analysis on the system) or actively (engineering the software to monitor and improve performance).

In the planning model of MAPE-K, the manager uses the monitoring data from the sensors to produce a series of changes to one or more managed elements.  These changes can be calculated by having the autonomic system keep state information about the managed elements and data so that adaptation plans can be adjusted over time.  An additional benefit of keeping state information is that systems are able to create an architectural model where the actual system mirrors the model and proposed changes are able to be verifying that the system integrity is still in tact before and after the proposed changes.  If any violations occur after applying the changes to the model, the changes can be aborted or rolled back to avoid damage or downtime to the system.

In the knowledge part of the model, the knowledge used to effect adaptation can come humans, logs, sensor data, or day-to-day observation of a system to observe its behavior [1].  In this part of the model, there is a large space that one can use machine learning to acquire knowledge about the system.  While [1] suggested reinforcement learning and Bayesian techniques, other authors [2] suggest K-Nearest Neighbors (KNN) and neural networks [7] with fuzzy control.  This author would suggest that decision trees could be an effective method of acquiring the knowledge to effect the proposed plans.  Also, clustering could be used as a way of identifying and classifying previously similar plans to see if they were successful or if they resulted in failure.

Finally, there are several different levels of automaticity including: Basic, Managed, Predictive, Adaptive and Fully Autonomic.

Now that we have seen a general framework for automated resource management, let’s continue to explore each component.

  1. Managed Elements

Managed elements in an autonomic system consist of all the hardware and software elements that can be managed by autonomic agents.  Due to the fact that multi-tier systems can be very complex, there are multiple levels of detail that one can view the system.

The first and highest level is where the system is viewed as a black box.  At this level, we consider the end-to-end delay when the request enters the system and returns back to us.  If there is congestion or delay caused by insufficient capacity at any tier, we are unable to know which tier’s capacity is causing the problem. Typically, changes in allocation to capacity are fairly course-grained at this level.  [8] plans their allocation strategy at this level.

At the next level down, we no longer monitor system performance by a single delay metric.  We are now able to monitor the performance of individual tiers.  When congestion occurs at a single tier, we are able to target that tier and increase the allocation capacity of just that tier.  However, one must be careful not to trigger downstream allocations with the increased capacity at the original tier.  [5] plans their allocation strategy at this level.

Lastly, at the most fine-grained level, we are able to collect statistics on individual components within a tier such as CPU usage for an application on a particular node.  Figure 2 shows an example of the black-box and per-tier paradigms.


  1. Autonomic Managers and Controllers

In the MAPE-K loop, the monitoring, analyzing and planning phases are done in the control plane or autonomic manager.  The manager should adjust managed elements in the following fashion: [2]

  • Prompt. It should sense impending SLA violations accurately and quickly and trigger resource allocation requests as soon as possible to deal with load increases (or decreases).
  • Stable. It should avoid unnecessary oscillations in provisioning resources (adding and removing).

Keeping the two guidelines in mind, there are two important decisions that every autonomic manager will be making: 1) when to provision resources and 2) when a provisioning decision is made, how much of a particular resource should be provisioned.

When To Provision

When considering the question of “when to provision”, one must consider the timescale that is being evaluated.  There are two main methods of provisioning: predictively and reactively.  Predictive provisioning attempts to stay ahead of the workload, while reactive provisioning follows to correct for short term fluctuations.

In predictive provisioning, the autonomic system is planning increased (or decreased) capacity for the future.  When evaluating real-world workloads, typically the system or application will see a peak in traffic during the middle of the day and be the minimum during the middle of the night [5].  Other factors, such as recent news (breaking important stories) or seasons (holiday shopping) can affect this distribution of traffic.  By using predictive analysis, automated systems are able to provision sufficient capacity well in advance of any potential events that would cause SLA violations.

Prediction can be done via a number of different methods including statistical analysis, machine learning, or by simply using past observations of workload.  This prediction is usually implemented in the control plane or autonomic manager of the system [2, 5, 6, 7, 8].  For example in [2], a scheduler for each application monitors the application database tier and collects various metrics, such as average query throughput, average number of connections, read/write ratios and system statistics such as CPU, memory and I/O usage, about the system and application performance.  These metrics are reported back to the autonomic manager and an adaptive filter predicts the future load based on the current measured load information.  Each of these metrics are weighted to reflect the usefulness of that feature.  In addition, a KNN classifier determines if an SLA is broken and redirects the source allocation to adjust the number of databases such that that tier is no longer in violation.  A resource allocation component decides how to map the resources dynamically.  [7] uses self-adaptive neural fuzzy controller to decide upon the allocation of resources.  Moreover, [6] uses a model estimator which automatically learns online a model for the relationship between an application’s resource allocation and it’s performance and a optimizer which predicts the resource allocation required to meet performance targets.  Lastly in [5], the authors use a histogram of request arrival rates for each hour over several days.  Using that data, the peak workload is estimated as a high percentile of the arrival rate distribution for that hour.  By using these metrics, the application is able to predict shifting workloads in a sliding window.

Most of the papers survived use the error in prediction e(k) (that is, the predicted workload or arrival rate λpred(t) and the observed workload or arrival rate λobs(t) for a time period t) or the change in error in prediction Δe(k) (e(k) −e(k −1)) as input to their algorithms to help determine the next periods allocation [5, 6, 7].  Since workloads and arrival rates often exhibit bursty overloads [4, 5], these parameters fine-tune our prediction model.

Proactive provisioning alone is not enough to make the system robust and immune to SLA violations.  For example, there may be errors in prediction if workload or arrival rate deviates greatly from previous days.  As mentioned earlier, Internet flash crowds can spike network requests have the potential to cause congestion and overload the system due to their bursty, unpredictable nature.  In these cases, the errors would lag behind the actual events and there would be insufficient capacity.

Reactive provisioning can be used to quickly correct any deviations from these unpredicted events.  Reactive provisioning is used to plan on a shorter time scale, perhaps on several-minute basis.  If anomalies are detected, reactive provisioning can quickly allocate additional capacity to the affected tiers so they don’t become a bottleneck.

In [5], the authors implement reactive provisioning by comparing the predicted session arrival rate λpred(t) and the observed arrival rate λobs(t) for a time period t.  If these two measurements differ by more than a threshold, they take corrective provisioning action.  In [2], the resource manager monitors the average latency receiver from each workload scheduler during each sampling period.  The resource manager uses a smoothened latency average computed as an exponentially weighted mean of the form WL = α ×L + (1 –α) ×WL, where L is the current query latency.  When the α parameter is larger, the system is more responsive the average to current latency.  In [6], the authors attempt to reactively provision limited resources by using an auto-regressive –moving-average (ARMA) model, where two parameters a1(k) and a2(k) capture the correlation between the applications past and present performance and b0(k)  and b1(k)  are vectors of coefficients capturing the correlation between the current performance and the recent resource allocations.  Note that if the spike in traffic is large, it may require several rounds of reactive provisioning to get the capacity to an acceptable level.

In [8], the authors consider provisioning resources based on “what-if” analysis.  They argue that most of web applications consist of services that form an acyclic directed graph, which can be formed into a decision tree.  In their model, they ask each tier to predict, online, future performance in the event it received an additional capacity unit or had one capacity unit removed.  The performance model in [8] uses the following equation to calculate response time:

where, Rserver is the average response time of the service, n is the number of CPU cores assigned to the service, λ is the average request rate and Sserver is the mean service time of the server.  When a threshold of service time is exceeded, re-computation of service time occurs.  These predictions are given to the parent node.  Using these predictions, provisioning agents negotiate resources with each other based on maximum performance gain and minimum performance loss.  The root node selects which services to provision across the tree when the SLA is about to violated, or de-provision, if resources can be removed without causing an SLA violation. Furthermore, in [8], the authors also consider how provisioning cache instances using the previously described “performance promises” would affect workloads in a particular tier and all children tiers due to an increased hit rate.  Figure 3 from [8] illustrates the decision process:

Figure 3

We can see that service I has k immediate children services and it aggregates it’s own performance promises as follows:

Lastly, while the majority of the papers reviewed were concerned with provisioning additional capacity, [2] also considered removing unneeded capacity.  If the average latency is below a low-end threshold, the resource manager triggers a resource removal.  The system then performs a temporary removal of the resource.  If the average latency remains below the low threshold, the resource is permanently removed.  The reason that a temporary removal is performed first is that mistakenly removing a resource is potentially a costly operation if it negatively impacts the system performance.  The main motivation behind this logic is that unused resources are wasted if they are being under-utilized in the system.

When comparing the effectiveness of reactive and proactive, Figure 4 from [5] that proper proactive provisioning can greatly reduce the time spent in violation of the SLA.

Figure 4

In addition to handling smoothly increasing workloads, the provisioning techniques from [5], as shown in Figure 5, predictive provisioning can also handle sudden load bursts effectively.

Figure 5

By combining predictive and reactive provisioning, systems are able have sufficient capacity to handle predictable workloads as well as short-term instabilities.

How Much To Provision

The question of “how much of a given resource to provision” is less straightforward than “when to provision.”  As we stated earlier, it may be necessary to go through several rounds of provisioning before the end-to-end delay is at an acceptable level.  Moreover, depending on the implementation and type of controller you use, the model determining “how much” could be different.  However, all provisioning algorithms attempt to provision enough to meet the SLA, within a time period t, without overprovisioning, because that would be a waste of resources.  Additionally, one critical factor to consider and try to avoid is that increasing the provisioning at a particular tier k might create a downstream bottleneck at tier k + n, assuming n tiers.  Several authors in the survey explore how to avoid this phenomenon.

The authors in [5] present a general framework for determining provisioning needs based on average end-to-end latency.  Suppose a system that has n tiers, denoted by T1, T2,…TN and let R denote the desired end-to-end delay.  Suppose further that the end-to-end response times are broken down into per-tier response times denoted by d1, d2,…dN, such that Σ di = R.  Lastly, assume that the incoming session rate is λ.  Typically, one would want to provision the system for the worst-case scenario, that is the peak of λ.  Individual server capacity can be modeled using M/M/1 first come, first serve (FCFS), non-preemptive queues and each request in the queue has a certain amount of work to do.

Assuming the service rate of the queue is μ, then λ ⁄ μ = ρ, would be the service ratio.  If ρ is less than 1, then the queuing delay, that is, the amount the request waits in the queue to be serviced is bounded and finite.  If ρ is equal to 1, the queue length is infinite but queuing delay is only infinite if the inter-arrival times of requests are not deterministic.  Lastly, if ρ is greater than 1, then the queuing delay is infinite.  This can express useful data, such as service time (the queuing delay and the time it takes to execute the request).  This behavior can be modeled in the queuing theory result [9]:

where di is the average response time for tier i, si is the average service time for a request at that tier and λi is the request arrival rate at tier i.  is the variance of inter-arrival time and the variance of service time, respectively, which can be monitored online.  Using this equation, we can obtain a lower bound on a request rate for server i.  Assuming a think-time of Ζ, then request are issued are a rate of (1 / Ζ) [5].  Therefore, the number of servers ηi needed at tier i to service a peak request rate can be computed as:

where βi is a tier specific constant, τ is average session duration, λi is the capacity of a single server and λ is the request arrival rate.  The authors in [5] assume that the servers in each tier are load balanced, although other models do exist that explore provisioning without load balancing, such as [10].

In [7], the authors use a neural fuzzy controller with four layers to determine how much of a particular resource should be (re) allocated.  In their controller design, they have a neural controller with four layers (see Figure 6).

Figure 6

In layer 1, the input variables e(k) and Δe(k) are passed into the neural controller nodes.  In layer 2, each node in this layer acts as a “linguistic term” assigned to one of the input variables in layer 1, where they use their membership functions to determine the degree to which an input value belongs to a fuzzy set (i.e. negative large corresponds to the numeric value -50).  Layer 3 uses the outputs from layer 2 multiplied together to determine the firing strength of a particular rule in layer 3.  Lastly, in layer 4, the output of layer 3 is “defuzzified” into numeric output in terms of resource adjustment Δm(k).  The magnitude of adjustment is determined by the online learning of the neural fuzzy controller, described in more detail in [7].  This online learning algorithm is able to adapt quite rapidly to stationary and dynamic workloads to guarantee 95th-percentile delay, as seen in in Figure 7 from [7].

Figure 7

In [6], the authors use the optimizer to determine changes in resource allocations for finite resources such as disk and CPU.  This is a significant departure from other models in that it is assumed that we have a (much) larger pool of resources (essentially unlimited) to allocate increased resources from.  Their optimizer uses uar as the resource allocation required to meet the performance target or SLA.  The optimizer’s high-level goal is to determine an allocation in a stable manner (no resource oscillations) by using a cost minimization function to find the optimum resource assignments.  In their cost minimization function:

where Jp = (yna(k) – 1)2 serves as a penalty for deviation of the application performance from the desired target and Jp = ||ura(k) – ua(k – 1)||2 serves as a control cost function to improve stability of the system.  The controller will attempt to minimize the linear combination of both functions.  One important point to note in this particular allocation scheme is that allocation requests are confined by the following constraint:

where urirt is a tuple of requested resources for application i, resource r for tier t.  Stated another way, the allocations for all applications for a particular resource, such as CPU or disk, in a particular tier must be less than or equal to100%.  When this constraint is violated, a particular resource is experiencing contention.  It is important to note that in this allocation scheme, while there exists the potential for all application to receive adequate capacity to satisfy demand, when contention exists, applications are weighted according to their priorities in the minimization function so that most “important” application receives the largest share of the divided resources.  Clearly, because we cannot allocate more than 100% of a node’s physical CPU, disk or other resource, all applications suffer from performance degradation – the only difference is to what extent the application suffers that degradation.

Other authors of course decide how much to provision based on their controller implementation.  In [2], the authors use a k-nearest neighbors classifier to decide the number of databases to be provisioned.  Lastly, in [8], use a decision tree process to determine the maximum performance gain (and minimal performance loss) to allocate or remove additional resources.  The performance of the algorithm in [8] was quite impressive in observing it’s SLA and remaining well-provisioned without wasting resources as see in Figure 8.

Figure 8

Before looking at additional MAPE-K components, it is worth looking at solving two additional enhancements to resource provisioning: preventing downstream bottlenecks and preventing allocation oscillations.

  1. A. Preventing Downstream Bottlenecks

The relationship between an incoming request and the work units at successive tiers are not necessarily 1-to-1.  A single incoming search request at the web tier may trigger more 1 more query requests at a database tier.  Therefore, downstream tier capacity has to be considered when allocating additional capacity at upstream tiers.

First, consider what would happen if additional downstream tiers were adjusted for additional incoming requests.  When an increase in allocation is triggered by an SLA violation at tier k, the capacity will increase by n.  If the downstream tier k + 1 is almost to capacity, that tier now has an additional n requests to service, which has the potential to cause an SLA violation.  When the resource manager detects a potential violation at tier k + 1, there will be another round of provisioning.  In the meantime, due to the fact that it takes a fixed amount of time to provision additional resources, tier k + 1, could be in violation.  Following a pattern, this could cause up to k provisioning events.  Clearly, this cascading chain of allocation, violation and additional provisioning, which in addition to being wasteful, increases the total time in violation of the SLA.  [5] proposes a solution by using the constant β, where if βi is greater than one if the request triggers multiple requests at tier i or less than one if caching at prior tiers reduces the work at this tier.

  1. B. Preventing Allocation Oscillations

Due to the fact that internet traffic has a bursty nature, there is a potential for a rapidly fluctuating load in the system.  Because of this fact, a necessary enhancement to the system controller should be a filter to smooth or dampen very brief load spikes.  This is not to say that we don’t want to respond to all SLA violations, but adding or removing additional capacity can potentially be a very costly and time consuming operation.

In addition, in [2], the authors keep state on the additional allocation, as either STEADY or MIGRATION state.  During the MIGRATION state, it may take a period of time after the SLA violation and associated additional capacity allocation is triggered for the migration of additional resources to occur and “warm-up.”  [2] found that latency may continued to be high or even increase during this period, and as such, sampling during this period may not be reflective of the actual system state.  Furthermore, they found that making decision based on samples from this migration and warm-up time period, may continue to add unnecessary replicas which will need to be removed later.  This is wasteful and could penalize other applications hosted on the same machine.

For additional information on each model and provisioning algorithm, refer to the related paper.

  1. Effectors

The execution of the plan in the MAPE-K loop happens with the effectors. In some of the papers that were surveyed, [citation needed – 5?], the authors used request policing to ensure that the admission rate does not exceed the capacity.  Any extra sessions beyond capacity are dropped at the threshold of the system.  Once the request was admitted, it was not dropped at any intermediate tier.  Conversely, each tier could have their own request policing mechanism however, this is quite wasteful as requests that have had prior work done, may be dropped at downstream tiers.

In the majority of the papers [2, 5, 6, 7, 8], additional capacity was allocated from some pool of free resources.  If no more free resources existed or that resource reached it replication maximum, resources were re-provisioned or the sessions and requests were capped at a maximum. In [2], additional database capacity was added to the tier by replicating the existing database.  During the period of migration, requests were held and then when the migration was complete, they were “played” in stages to bring the new copy up to date with the current copy with any additional writes that may have occurred during migration.

In [5], the authors used agile server switching.  In this model, each node had a copy of the application running in a separate virtual machine (VM) however, only one application was active at a time.  When the control plane determined that it needed to allocate additional capacity, if the application running on a particular node schedule for replacement was not the application receiving the additional capacity, the sessions and request were ramped down using fixed-rate or measurement-based techniques.  Once all the sessions had been concluded for the previous application in that VM, it was suspended and the new allocation was brought up.  This allowed new applications to quickly be provisioned for additional capacity.

However, admission control may not be the best choice and [3] presented an alternative to dropping sessions by allocating additional capacity in a “degraded mode.”  Using traditional control theory (having a closed loop that reacts to sensors and actuators), we can readily adjust the load on the server to stay within bounds of process schedulability based on available resources.

The actuator or effector is responsible for performing a specific action based on controller output.  In the case of [3], they initially used admission control as a way of controlling server load.  This limits the number of clients that the server can respond to concurrently.  All other clients are rejected and while this is not a favorable outcome, it is a way of providing QoS to other clients.

However, this actuator can be adapted to provide a “degraded” level of service.  In this “degraded” level of service, the content that is served to the client receiving the degraded service is different than that which the “full service” client receives.  For example, the degraded content may have less content such as images in it.  To perform this degradation, there must be multiple versions of the content and the web server will serve the content from the appropriate file structure at request time.  There can be m different service levels where m = I + F (I is the service level where F is the fraction of clients who are served at level I).  When m = M, all clients receive the highest service and conversely when m = 0, all clients are rejected.  This allows a more general model of saying that a percentage of clients will receive degraded service, rather than specifically specifying which clients (which clients may be addressed through different functionality, such has hashing their IP address).  The server load is control through the input of the variable m.

The monitor proposed in [3] is expressed as the utilization U as a function of served request rate R and delivered byte bandwidth W (U = aR + bW, where a and b and derived constants through linear regression).  This function is able to approximate the maximum utilization rate for deadline-monotonic schedulers to meet deadlines of U < 0.58.  Using this scheme of admission and “capacity” planning, the authors in [3] were able to admit almost 3 times the number of requests to the system as shown in Figure 9 before significant service degradation occurred.

Figure 9

  1. Sensors

Sensors are part of the knowledge phase in the MAPE-K loop.  The sensors that report data to the control plane can be any number of metrics.  In the papers surveyed, there were authors that used CPU usage [5, 6], bandwidth, throughput, active network connections [2, 5], I/O transfer [2, 6], cache hits [8], end-to-end delay or response time [5, 7], session arrival rate [5], request arrival rate [5], read/write ratios [2], lock ratios [2] and memory usage [2].  These parameters can be recorded and monitored online or offline, using standard system tools such as vmstat, iostat and sysstat or custom written software to record statistics.

IV. Quality of Service & Performance Isolation

Lastly, the control theory approach was extended in [3] using virtual servers to show support for performance isolation (each virtual server can guarantee a maximum request rate and maximum throughput), service differentiation (each virtual server supports request prioritization) and excess capacity sharing (conforming virtual servers under overload can temporarily exceed capacity to avoid client degradation).

With performance isolation, a virtual server is configured with a maximum request rate RMAX and a maximum delivered bandwidth of WMAX.  This expresses an throughput agreement to serve up to the specified levels.  If RMAX is exceeded, then the agreement on WMAX is revoked.

In performance isolation, the virtual server can adapt content as previously mentioned to stay within the agreement bounds.  When each of the i servers are configured, if their aggregate utilizations U*i is less than U < 0.58, then the system capacity is planned correctly.  To perform the correct bookkeeping, load classification is done when the request is performed and then based on the classification requests served and delivered bandwidth is charged against the requested virtual server.  Lastly, based on the rates for virtual server i for requests (Ri) and delivered bandwidth (Wi), the control loop achieves the degree of content degradation necessary to keep the utilization of that virtual server at the proper level, preventing overload.

In service differentiation, the goal is to support a number of different clients at different service levels (lower priority clients are degraded first).  In [3], if there are m priority levels, where 1 is the highest priority, capacity of the utilization level of that virtual server is available to clients in priority order.  For the purposes of their research, [3] had two service levels, premium and basic, where premium was guaranteed service and basic was not guaranteed service.

Lastly, in sharing excess capacity, if the load on one server exceeds the maximum capacity C, then the server under overload may temporarily utilize the resources of under utilized virtual servers.  This requires only a simple modification to the control theory loop.

IV. Conclusion

In this survey, we presented information on burstiness in network traffic and how it affects service times.  We also showed how autonomic systems can provision themselves to handle the dynamic workloads.  Dynamic provisioning in multi-tier applications raises some interesting and unique challenges.  As distributed systems become more complex, the techniques surveyed in this paper will become more useful and relevant.


[1]     M.C. Huebscher, J.A McCann.  A survey of autonomic computing: Degrees, models, and applications.  ACM Computing Surveys, 40(3), 2008.
[2]     J. Chen, G Soundararajan, C. Amza. Autonomic Provisioning of Backend Databases in Dynamic Content Web Servers. Proc. of IEEE ICAC, 2006.
[3]     T. Abdelzaher, K.G. Shin, N. Bhatti. Performance Guarantees for Web Server End-Systems: A Control-Theoretical Approach. IEEE Transactions on Parallel and Distributed Systems, 13(1), 2002.
[4]     N. Mi, G. Casale, L. Cherkasova, M. Smirini. Burstiness in multi-tier applications: Symptoms, causes, and new models. Proc. of ACM/IFIP/USENIX Middleware, 2008.
[5]     B. Urgaonkar, P. Shenoy, A. Chandra, P. Goyal, and T. Wood. Agile dynamic provisioning of multi-tier Internet applications. ACM Trans. on Autonomous and Adaptive Systems, 3(1):1-39, 2008.
[6]     P. Padala, K.-Y. Hou, K. G. Shin, X. Zhu, M. Uysal, Z. Wang, S. Singhal, and A. Merchant. Automated control of multiple virtualized resources. Proc. of EuroSys, 2009.
[7]     P. Lama and X. Zhou. Autonomic provisioning with self-adaptive neural fuzzy control for end-to-end delay guarantee. Proc. IEEE/ACM MASCOTS, pages 151-160, 2010.
[8]     D. Jiang, G. Pierre, and C.-H. Chi. Autonomous resource provisioning for multi-service web applications. Proc. ACM WWW, 2010.
[9]     L. Kleinrock. Queuing Systems, Volume 2: Computer Applications. John Wiley and Sons, Inc. 1976.
[10]  B. Urgaonkar, G. Pacifici, G. Shenoy, P. Spreitzer, M. and A. Tantawi. An analytical model for multi-tier Internet services and its applications. Proc. of ACM ICMMCS.  Banff, Canada, 2005.

I recently spent a good amount of time trying to figure out why I could not retreive an object from a session that I had previously stored.  After doing a little thinking, I came up with the answer and I hope this saves someone else an immense amount of time.  There is one setup item that needs to be done before you can start using the standard HttpSession objects in your App Engine project.  This line needs to go in your appengine-web.xml file:


If you don’t you may find yourself with only read-access to the session object.

In addition, you need to make sure that all your objects that are you going to persist to the session implement the interface.  This is particularly important and this is what I failed to realize until I struggled with this for 2 hours.  The reason the object needs to be serializable is because App Engine stores session data in the datastore and memcache.  Any objects referenced by the value you put in the session must be serializable, so the entire object graph is available.  What I found interesting is that it must commit the session data in an transactionally based manner because I had also stored a String in the session and that wasn’t persisted either.  If the object isn’t serializable, the app will NOT fail in a local development machine, but will fail when deployed to the cloud.

A bit of sample code:

public void doGet(HttpServletRequest req, HttpServletResponse resp) {
  HttpSession session = reg.getSession(true);
  // passing in a boolean to getSession() will allow you to inspect if a session
  // already exists.  if you pass in true, the session will be created by default
  // whether one exists already or not.
  // if you pass in false, if a session doesn't exist, you will be return NULL and no
  // session will be created.  if there is a previously existing session, it will return
  // that session.
  String name = (String)session.getAttribute('name');
  session.setAttribute("age", 25);

You can check to see that everything is being stored correctly if you look on your machine, you should see a cookie for the domain your working in (dev: localhost, prod: with the key JSESSIONID.  The value for JSESSIONID should match what is in your _ah_SESSION table in the App Engine datastore. You can visually inspect the bytes in the session as well.

There is a small gotcha between the standard J2EE HttpSession and the GAE HttpSession.  There is a difference in when services manipulate objects stored in the session, in that case your changes will be lost when another service will get the object from the session.  The fix for this is invoking setAttribute again after having modified the Person object in the session. This workaround will solve all the inconsistencies but has a pretty important trade-off, every setAttribute will trigger a new serialization and write to the datastore.

Lastly, depending on the utilization of your application, you may find yourself with more than a few sessions.  That is because when the session is written to the datastore, the expiration of the session is set to  24 * 60 * 60 * 1,000 = System.currentTimeMillis() + 86,400,000.   The _expires field is updated each time the session is active, so that could be quite a bit of data storage.  There is currently no automatic removal of expired sessions in GAE.

One last note: remember that App Engine is a distributed architecture so a difference from J2EE is that you are never guaranteed the same application server instance during request processing as the previous request.  While the object is being serialized correctly in memcache, you still have to call setAttribute() every time due to the fact that memory is not shared.

This research article was co-authored by David Stites (dstites[at] and Jonathan Snyder (jsynder4[at]

Abstract—A common problem for many web sites is appearing as a high ranking in search engines.  Many searchers make quick decisions about which result they will choose and unless a web site appears within a certain threshold of the top rankings, the site may have a low throughput from search engines. Search engine optimization, while widely regarded as a difficult art, provides a simple framework for improving the volume and or quality of traffic to a web site that uses those techniques.  This paper is a survey of how search engines work broadly, SEO guidelines and a practical case study using the University of Colorado at Colorado Springs’ Engineering and Science department home page.

Index Terms—search engine optimization, search indexing, keyword discovery, web conversion cycle, optimized search, organic search, search marketing

1.  Introduction

Search Engines are the main portal for finding information on the web.  Millions of users each day make searches on the internet looking to purchase products or find information.  Additionally, users generally only look at the first few results.  If a website is not on the first couple pages it will rarely be visited.  All these factors give rise to many techniques employed to raise a website’s search engine ranking.

Search engine optimization (SEO) is the practice of optimizing a web site so that the website will rank high for particular queries on a search engine.  Effective SEO involves understanding how search engines operate, making goals and measuring progress, and a fair amount of trial and error.

Discussed in the following sections is a review of search engine technology, a step by step guide of SEO practices, lessons learned during the promotion of a website, and recommendations for the Engineering and Applied Science College of the University of Colorado at Colorado Springs.

In this survey, we spend our research efforts determining how SEO affects organic search results and do not consider paid inclusion.  For more information on paid inclusion SEO, see [1].

2.  A Review of Search Engine Technology

Search engine optimization techniques can be organized into two categories: white hat, and black hat.  White hat practices seek to raise search engine rankings by providing good and useful content.  On the other hand, black hat techniques seek to trick the search engine into thinking that a website should be ranked high when in fact the page provides very little useful content.  These non-useful pages are called web spam because from the user’s perspective they are undesired web pages.

The goal of the search engine is to provide useful search results to a user.  The growth of web search has been an arms race where the search engine develops techniques to demote spam pages, and then websites again try to manipulate their pages to the top of popular queries.  Search engine optimization is a careful balance of working to convince the search engine that a page is relevant and worthwhile for certain searches while making sure that the page is not marked as web spam.

The accomplishment of this objective requires understanding of how a search engine ranks pages, and how a search engine marks pages as spam.  Although the exact algorithms of commercial search engines are unknown, many papers have been published giving the general ideas that search engines may employ.  Additionally, empirical evidence of a search engine’s rankings can give clues to its methods.  Search engine technology involves many systems such as crawling, indexing, authority evaluation, and query time ranking.

2.1  Crawling

Most of the work of the search engine is done long before the user ever enters a query.  Search engines crawl the web continually, downloading content from millions of websites.  These crawlers start with seed pages and follow all the links that are contained on all the pages that they download.  For a url to ever appear in a search result page, the url must first be crawled.  Search engine crawlers present themselves in the HTTP traffic with a particular user-agent when they are crawling which websites can log.

In the eyes of search engine crawlers not all websites are created equal.  In fact some sites are crawled more often than others.  Crawlers apply sophisticated algorithms to prioritize which sites to visit.  These algorithms can determine how often a webpage changes, and how much is changing. [7]  Additionally, depending on how important a website is, the crawler may crawl some site more deeply.  One goal of a search engine optimizer is to have web crawlers crawl the website often and to crawl all of the pages on the site.

Search crawlers look for special files in the root of the domain.  First they look for a “robots.txt” file which tells the crawler which sites not to crawl.  This can be useful to tell the crawler not to crawl admin pages because only certain people have access to these pages.  The robots.txt file may contain a reference to an XML site map.  The site map is an XML document which lists every page that is a part of the website.  This can be helpful for dynamically generated websites where the sites may not all necessarily have links.  An example of this is on the Wikipedia site.  Wikipedia has a great search interface, but crawlers do not know how to use the search or what to search for.  Wikipedia has a site map which lists all the pages it contains.  This enables the crawler to know for sure that it has crawled every page.  When optimizing a website, it is important to make it easy for a site to be completely crawled.  Crawlers are timid to get caught in infinite loops of pages or download content that will never be used in search result pages. [5]

2.2  Indexing

After the search engine has crawled a particular page, it then analyzes the page.  In its most simple form the search engine will throw away all the html tags and simply look at the text.  More sophisticated algorithms look more closely at the structure of the document to determine which sections relate to each other, which sections contain navigational content, which sections have large text, and weight the sections accordingly. [3]  Once the text is extracted, information retrieval techniques are used to create an index.  The index is a listing of every keyword on every webpage.  This index can be thought of as the index in a book.  The index tells which page the subject is mentioned; however, in this case the index tells which website contains particular words.  The index also contains a score for how many times the word appears normalized by the length of the document.  When creating the index, the words are first converted into their root form.  For example the word “tables” is converted to “table”, “running” is converted to “run”, and so forth.

In order for a page to appear in the search engine results page, the page must contain the words that were searched for.  When promoting a website, particular queries should be targeted, and these words should be put on the page as much as possible.  However, this kind of promoting is abused.  Indeed, some sites are just pages and pages of keywords with advertisements designed to get a high search engine ranking.  Other tactics include putting large lists of keywords on the page, but making the keywords only visible to the search engine.  In fact some search engines actually parse the html and css of the page to find words that are being hidden to the user.  These kinds of tactics can easily flag a site as being a spam site.  Therefore one should make sure that the targeted keywords are not used in excess and text is not intentionally hidden.

One problem that many pages have is using images to display text.  Search engines do not bother trying to read the text on images.  This can be damaging to a site especially when the logo is an image.  A large portion of searches on the internet are navigational searches.  In these searches, people are looking for a particular page.  For these kinds of searches, it is hard to rank high in the search engine result page when the company brand name is not on the page except in an image.  One alternative is to provide text in the alt field of the image tag.  This may not be the best option however because the text in the alt field is text that the user does not normally see and is therefore prone to keyword abuse; hence, suspicious to the search engine.

2.3  Authority Evaluation

One of the major factors that sets Google apart from other search engines is its early use of PageRank as an additional factor when ranking search result. [3]  Google found that because the web is not a homogeneous collection of useful documents, many searches would yield spam pages, or pages that were not useful.  In an effort to combat this, Google extracted the link structure of all the documents.  The links between pages are viewed as a sort of endorsement from one page to another.  The idea is that if Page A links to Page B, then the quality of Page B is as high or higher than that of Page A.

The mathematics behind this model is a model of a random web surfer.  The web surfer starts on any page on the internet.  With the probability of 85% the random web surfer will follow a link on the page.  With the probability of 15%, the web surfer will go to a random page on the internet.  The PageRank of a given page is the probability that the user will be visiting that page.

One of the problems with this model is that any page on the internet will have some PageRank, even if it is never linked to.  This baseline page rank can be used to artificially boost the PageRank of another page.  Creating thousands of pages that all link to a target page can generate a high PageRank for the target page.  These pages are called link farms.

Many techniques have been proposed to combat link farms.  In the “Trust-rank” algorithm [4], a small number of pages are reviewed manually to determine which pages are good.  These pages are used as the pages that the web surfer will go to 15% of the time.  This effectively eliminates the ability to create link farms because a random page on the internet must have at least one page with trust-rank linking to it before it has any trust-rank.   Another technique is to use actual user data to determine which sites are spam.  By tracking which sites a user visits and how long the visits are, a search engine can determine which are the most useful pages for a given query. [6]

2.4  Result Ranking

With traditional information retrieval techniques, the results are ranked according to how often the query terms appear on a page.  Additionally, the query terms are weighted according to how often they occur in any document.  For example, the word “the” is so common in the English language that it is weighted very low.  On the other hand the word “infrequent” occurs much less often, so it would be given a higher weight.
During the beginning of the World Wide Web, search engines simply ranked pages according to the words on the page compared to the page’s length and relative frequency of the words.  This algorithm is relative simple to fool.  One can optimize landing pages to contain high frequencies of the targeted query terms.  To combat this, the relevance score and the authority score are combined to determine the final ranking.  In order for a document to appear in the search engine result page it must match the keywords in the query, but the final ranking is largely determined by the authority score.

3. Guidelines to SEO

Equipped with the knowledge of how search engines index and rank web sites, one has to tailor content in such a way that gives the best opportunity for being ranked highly.  There are many guidelines to SEO, but we attempt to distill the most important ones into our survey here.  For a more complete picture of SEO best practices and information see, [1].

3.1  Refine The Search Campaign

The first thing that the reader must consider is why are they going to do SEO on their web site?  In a broad sense, the purpose is to get more traffic volume or quality to one’s web site, but one needs to break this down into smaller subgoals.  We will look at defining the target audience to determine why people use search in the first place and then we will examine how one can leverage the searchers intent to design a web site that will best serve the audience.

3.2  Define The Target Audience

Typically searchers aren’t really sure on what they are looking for.  After all, that is why they are performing a search in the first place.  When designing a web site, the webmaster must be cognizant of the searcher’s intent.

Generally, a searcher’s intent can be broken down into 3 different categories.  Know which category a particular searcher might fall into is important in how one designs your web site and which keywords they might choose.

  • Navigational searchers want to find a specific web site, such as JPMorgan Chase and might use keywords such as “jp morgan chase investments web site.”  Typically, navigational searchers are looking for very specific information “because they have visited in the past, someone has told them about it because they have heard of a company and they just assume the site exists.  Unlike other types of searchers, there is only one right answer.” [1]  It is important to note that typically navigational searchers just want to get to the home page of the web site – not deep information.  With navigational searches, it is possible to bring up multiple results with the same name (i.e. A Few Guys Painting and A Few Guys Coding).
  • Informational searchers want information on a particular subject to answer a question they have or to learn more about a particular subject.  A typical query for a informational searcher might be “how do I write iPhone applications.”  Unlike navigational searchers, informational searchers typically want deep information on a particular subject – they just don’t know where it exists.  Typically, “information queries don’t have a single right answer.” [1]  By far, this type of search dominates the types of searches that people perform and therefore the key to having a high ranking in informational searches is to choose the right keywords for your web site.
  • Transactional searchers want to do something, whether it is purchase an item, sign up for a newsletter, etc. A sample transactional search query might be “colorado rockies tickets.”  Transactional queries are the hardest type of query to SEO for because “the queries are often related to specific products.” [1]  The fact that there are many retailers or companies that provide the same services only complicates matters.  In addition, it can be hard to dipher between whether the searcher wants information or a transaction (i.e. “Canon EOS XSi”).

After understanding the target searcher, one has to consider how the searcher will consume the information they find.  According to Hunt, et. al., “nearly all users look at the first two or three organic search results, less time looking at results ranked below #3 and far less time scanning results that rank seventh and below.  Eighty-three percent report scrolling down only when they don’t find the result they want within the first three results.”

So what does this mean for people doing SEO?  Choosing the correct keywords and descriptions to go on your site is probably the most important action one can take.  Searchers choose a search result fairly quickly, and they do so by evaluating 4 different pieces of information, included with every result, which include: URL, title, the snippet, and “other” factors.

Graph 3.1 – Identifying the percentages of the 4 influencing factors of a search result. [1]

3.3  Identify Web Site Goals

Now that we have identified the types of searchers that are looking for information, lets consider why one wants to even attempt SEO in the first place.  For that, one needs to identify some goals, including on deciding the purpose of your web site.  Generally, there six types of goals including:

  • Web sales.  This goal is selling goods or services online.  Note that this could be a purely online model (as in the case of or or it can be a mix between online and brick and mortar stores.  Web sales can be broken down further in specifying 1) whether or not the store is a retailer that sells many different products from different manufactures or if is a manufactures’ site and 2) what type of delivery is available (instant or traditional shipping).  In all cases, the ultimate desire is to increase the number of sales online.  This type of site would benefit from optimizing their site with transactional searchers in mind.
  • Offline sales. This goal involves converting web visitors into sales at brick and mortar stores. While this type of goal would benefit from transactional optimization, the benefits from SEO are much harder to calculate because there isn’t as good a measure on sales success.  The most important concept that a webmaster of an offline sales site can remember is to always emphasize the “call-to-action”, which the thing that you are trying to get someone to do, in this case, convert from web to physical sale.
  • Leads. Leads are conceptually the same as offline sales, however leads are defined by when the customer switches to the offline channel.  Customers who do some research online and then switch to offline are leads while customers who know model numbers, prices, etc. are offline sales.  With leads, the search optimization and marketing strategy needs to be different because customers who are leads are typically informational searchers instead of transactional searchers as in the case of web and offline sales.  Therefore, people who want to optimize for leads need to attract customers who are still deciding on what they want.
  • Market awareness. If a goal for one’s web site is market awareness, this is an instance of where paid placement could help boost your page rank quicker than organic results simply due to the fact that the product or service isn’t well know yet.  With market awareness, the web site mainly exists to raise awareness, so you would want to the site for navigational and informational searchers.
  • Information & entertainment. These sites exist solely to disseminate information or provide entertainment.  They typically don’t sell anything, but might raise revenue through ad placement or premium content, such as ESPN Insider.  Sites that focus on information and entertainment should focus almost exclusively on optimizing their site for informational searchers.
  • Persuasion.  Persuasion websites are typically designed to influence public opinion or provide help to people.  They usually not designed to make money.  In order to reach the most people, sites like these should be designed for a informational searcher.

3.4  Define SEO Success

A crucial element of undertaking SEO is to determine what should be used to measure the “success” of the efforts performed.  Depending on what type of goal the webmaster or or marketing department has defined for the website (web or offline sales, leads, market awareness, information and entertainment and persuasion), there are different ways to measure success.  For example, the natural way for a site that does web or offline sales to determine success is to count the conversions (i.e. the ratio of “lookers” to “buyers”).  If a baseline conversion ratio can be established prior to SEO, you can measure the difference between the old rate and the new rate after SEO.

This logic can be applied to any of the goals presented.  For example, with market awareness, you could measure conversion by sending out surveys to consumers or perhaps including some sort of promotion or “call-to-action” that would be unique to the SEO campaign.  If a consumer were to use that particular promotion or perform that action, it would give an indication of a conversion.

Another way to determine success is to analyze the traffic to the web site and there are a couple of different metrics that can be used for this including “page views and visits and visitors.” [1]  Using the most basic calculation with page views, you could determine a per-hour, per-day, per-week, per-month or even per-year average count of visitors to your website.  Using some tracking elements (such as cookies or JavaScript), you could determine the rise (or fall) or visitors to the site before and after SEO was performed.  The amount of information that this simple metric would provide would be invaluable because you could also determine peak visiting hours, which pages had the most hits at a specific time, visitor loyalty (unique vs. returning visitors), visitor demographics (location, browser and operating system, etc.) and even the time spent per visit.

3.5  Decide on Keywords

Keywords are the most important element of SEO.  It is important that during SEO, one focuses on the keywords that searchers are most likely to use while searching.  When choosing keywords, it is important to consider several different factors such as keyword variations, density, search volume and keyword “competition.”

Keyword variations play an important role in choosing keywords because one has have to consider how an average searcher may try to find information.  For example, certain keywords such as “review or compare” might be able to be used interchangeably in a search query.  In addition, some keywords are brand names that people automatically associate with products such as “Kleenex” for tissues and “iPod” for MP3 player.  If someone is able to have more variations on a certain sequence of keywords, they have a higher likelihood of being found.

Search volume also an big factor in determining which keywords to choose because one wants to choose keywords that people are actually searching for.  If a particular keyword only has 3,000 queries a month, it is much better to use a keyword that has 20,000 queries a month because 1) the searchers are associating the higher volume keyword with whatever the subject is instead of the lower volume keyword and 2) using the higher volume keyword will reach a larger audience.  However, this is not to say that low-volume keywords have no value.  In fact, the opposite is true.  Mega-volume keywords, such as brand names, often used by companies to compete to have the highest ranking.  If one were able to achieve the same or nearly same results with lower contention keywords, the SEO process would be much easier because fewer companies target them making it easier to achieve a high ranking with them.

Lastly, keyword density is an important factor.  When deciding  where to rank your page, the search engines have it figured out that anywhere from 3%-12% of the page is a good density for keywords, with 7% being optimal.  If it detects a higher percentage than this, the search spider might consider the page spam, simply because one is trying to stuff as many high volume keywords into the page as possible without any relevant context or content to the user.  Typically, this results in a lower ranking, no ranking or sometimes the page is even removed from the index or blocked.

3.5.1  Create Landing Pages for Keyword Combinations

The last step in assessing a site under consideration for SEO would be to identify pages on your site that you want your specific keyword queries to lead to when a searcher enters that query.  For example, P&G might want “best laundry detergent” to lead to their Tide home page.  Landing pages are the pages that these queries lead to and are “designed to reinforce the searchers intent.” [1]  Each of the keywords or phrases identified in section 3.4 must lead to a landing page and those pages must be indexed.  If there isn’t a landing page already for some of those keywords, then they must be created.

3.6  Page Elements That Matter And Don’t Matter

Now that the target audience, goals, keywords and landing pages are identified, it is crucial to consider the design of the web page.

  • Eliminate popup windows.  The content in popup windows are not indexed by spiders.  If important content, navigation  or links are contained inside the popup window, then it will not be seen by the spider.  The content needs to be moved outside the popup window.
  • Pulldown navigation.  Pulldown navigations suffers from the same problem as popup windows.  Since spiders cannot see these elements (mouse-over or click on them), they cannot index the content, which creates a large problem if the navigation is done with pulldown menus.  Either the pulldown menu must be done in a compatible way or the site has to allow for some other alternative means of navigation.
  • Simplifying dynamic URLs. Pages that use dynamic content and URLs must be simplified for the spiders to crawl.  The nature of a dynamic URL means that a spider could spend an infinite amount of time attempt to crawl all the possible URLs and that would produce a lot of duplicate content.  To deal with this, spiders will only crawl dynamic URLs if the URL has less than 2 dynamic parameters, is less than 1,000 characters long, does not contain a session identifier and every valid URL is linked from another page.
  • Validating HTML.  Robots are very sensitive to correctly formed documents.  It is important that the HTML is valid for the spider to get a good indication on what the page is truly about
  • Reduce dependencies.  Some technologies, such as Flash, make it impossible for the spider to index the content inside them.  The spider cannot view this particular content, so any important keywords or information inside it, is lost.  The content should be moved outside to allow indexing to take place.
  • Slim down page content.  A spider’s time is valuable and typically spiders don’t crawl all the pages of a bloated web site.  Google and Yahoo! spiders stop at about 100,000 characters. [1]  The typical cause of HTML page bloat is embedded content such as styling and other content such as JavaScript.  A simple way to solve this is to link to Cascading Style Sheet (CSS) so that you are able to make use of re-useable styles.  Another way to reduce JavaScript bloat is to use a program to obfusticate long files which replace long variable names such as “longVariableName”, which much shorter versions such as “A.”
  • Use redirects.  From time to time, pages within a web site move.  It is important to make accurate use of the correct type of redirects within your site so that when spiders attempt to visit the old URL, they are redirected to the new URL.  If they find that the server returns a 404 (Unavailable) for the old URL, the spider might remove that particular page from the index.  Instead, the proper way to indicate that the page has moved permanently is to use a server-side redirect, also known as a “301 redirect.”  This is returned to the spider when it attempts to navigate to the old URL and it is then able to update the index with the new URL.  A sample implementation might look like the following:

Redirect 301 /oldDirectory/oldName.html

Note that spiders cannot follow JavaScript or Meta refresh directives. [1]  Additionally one can use a “302 redirect” for temporarily moved URLs.  See Fielding et. al for more information on “302 redirect.”

  • Create site maps. Site maps are important for larger sites because “it not only allows spiders to access your site’s pages but they also serve as very powerful clues to the search engine as to the thematic content of your site.” [1]  The anchor text used for the link could provide some very good keywords to the the spider.
  • Titles and snippets.  Together, the title and the snippet that the search spider extracts account for a large part of how they index a particular page.  The title, the most important clue to a spider on the particular subject of a page, is the most easily fixed element.  The title is a great place to use keywords that were previously decided on.  For example, the title element for StubHub, a ticket brokerage site is “Tickets at StubHub! Where Fans Buy and Sell Tickets.” Additionally, the snippet, or summary that the spider comes up with to describe their result is important as well.  Typically, the spider uses the first block of text that they run across to use for a snippet.  For example, for WebMD, Googlebot uses the following snippet “The leading source for trustworthy and timely health and medical news and information. Providing credible health information, supportive community, …”  In both of these examples, it is clear that having essential keywords present in both are highly correlated to having high page ranks.
  • Formatting heading elements.  Using traditional HTML subsection formatting elements, such as <h1>, <h2>, <h3>, etc. to denote important information can help give context clues to spiders on what text is important on a particular page.

3.6.1  The Importance of Links

Links, both internal and external, play a big role in SEO and in page ranking.  Search engines place a certain value on links because they can use these link to judge the the value of the information.  Similar to how scientific papers have relied on citations to validate and confer status upon an authors work, links apply status and value to a particular site.  “Inbound links act as a surrogate for the quality and “trustworthiness” of the content, which spiders cannot discern from merely looking at the words on the page.”  [1]

Several factors can influence how an algorithm ranks the link popularity on a particular page including 1) link quality, link quantity, anchor text and link relevancy.

Using the 4 link popularity factors from above, search engines use a theory of hub and authority pages to create link value.  Hub pages are web pages that link to other pages on a similar subject.  Authority pages are pages that are linked to by many other pages on a particular subject.  Therefore, search engines usually assign a high rank to these pages because they are most closely related to a searcher’s keywords. Using this model, it is easy to see why the harder an inbound link is to get, the more valuable it might be in terms of value.  For more information on inbound and outbound link importance and their value, see Hunt et. al.

3.7  Getting Robots To Crawl The Site

In addition to all of the goals and HTML elements above, it is crucial to have a “robots.txt” file that will allow the robot to crawl the site.  This file can give instructions to web robots using the Robots Exclusion Protocol.  Before any indexing occurs at a site, the robot will check the top-level directory for the “robots.txt” file and will index the site accordingly.  A sample “robots.txt” file might look like the following.

User-agent: *

This instructions applies to all robots or “User-agents” and they are to index the whole site because nothing is listed under the “Disallow” directive.  It is possible to tailor this file to individual robots and server content.  While the protocol is completely advisory, it is highly recommended to improve the search quality of what the robots index.  Note that the robot does not have to obey the “robots.txt” file, however most non-malicious robots do obey the instructions.

3.8  Dispelling SEO Myths

  • SEO is a one time activity.  SEO is a continual activity that must be revisited from time to time to ensure that the most up-to-date keywords and content are being used.
  • SEO is a quick fix process for site traffic.  Generating high quality organic traffic that will help with conversion is a slow process.  For each change that one makes to the web page, spiders must re-index that page and then calculate the new rankings.  It could take several months to years depending on your goal of achieving a higher ranking, passing a competitor in rankings or achieving the top results spot.
  • META tags help with page ranking. META tags were abused by keyword spammers by packing in as many highly searched keywords as possible, early in the days of developing search engines and many spiders now give META tags little to no credence.

4.  Promotion Of

In order to test the authors hypothesis, we chose to perform our SEO experiment with  A Few Guys Coding, LLC is a small, author-owned company that provides contract engineering services mainly for mobile platforms (iPhone, iPod Touch, iPad, Android) but also does web and desktop applications.   This web site has never had SEO performed on it and was not designed with any such considerations in mind.

4.1  Initial Investigation For

In order to determine what keywords we should focus on, we created a survey and asked the following question to people who are not engineers: “Suppose you were a manager of a business that had a great idea for a mobile phone application. You knew that you had to hire an iPhone, iPod or iPad developer because you didn’t directly know anyone who could do this work for you. What words, terms or phrases might you consider searching for in Google to find this person?”  Table 4.1 represents the range of responses that were provided.  In analyzing this data, the words provided were stemmed to account for different endings.  In addition, stop words, or words that are ignored because they don’t provide any significance to the query, were also filtered out and not considered.

Looking at the results and given the question posed to the survey audience, the percentages of the top three results did not surprise the authors.  What did surprise the authors were the relative high number of results for the keywords “technician”, “help”, “inventor/invention”, “creator” and “market/marketing” and the relatively low number of results for “mobile”, “phone”, “software” and “programmer.”

After careful considering, the authors chose to incorporate some of the keywords suggested by the survey audience into the web page.

Keyword Count Percentage
Developer 81 19.33%
Application 72 17.18%
iPhone 51 12.17%
Apple 24 5.73%
Phone 21 5.01%
iPod 17 4.06%
Mobile 17 4.06%
iPad 17 4.06%
Programmer 17 4.06%
Technician 11 2.63%
Help 10 2.39%
Mac/Macintosh 9 2.15%
Technology 6 1.43%
Market/Marketing 6 1.43%
Software 7 1.67%
Inventor/Invention 5 1.19%
Creator 5 1.19%
Business 4 0.95%
iTunes 3 0.72%
Designer/Designing 3 0.72%
Hire 3 0.72%
Handset 3 0.72%
Top 3 0.72%
Computer 2 0.48%
Company 2 0.48%
Contractor 3 0.72%
AT&T 1 0.24%
Store 1 0.24%
3G 1 0.24%
OS 1 0.24%
Code 1 0.24%
File 1 0.24%
Resource 1 0.24%
Sale 1 0.24%
Science 1 0.24%
Graphics 1 0.24%
Analyst 1 0.24%
Creative 1 0.24%
Devoted 1 0.24%
Energetic 1 0.24%
Smart 1 0.24%
Engineer 1 0.24%
Objective-C 1 0.24%
Total 419 100.00%

Table 4.1 – Responses from a search audience regarding possible keywords for

4.2  Baseline Rankings

Prior to performing SEO on, the Google page rank was 1/10.  The website was indexed by major search engines, such as Google, Yahoo!, Bing, AOL and Ask, but was not crawled on a regular basis.

Before performing an SEO activities, the amount of traffic that was coming from search engines was a little over 7% overall.  Most traffic was coming from direct referrals, for example when the user entered the address from a business card they had gotten (see graph 5.1).

In addition, the titles and snippets of the page were not as good as they could be (i.e. they didn’t include keywords or readily extractable content for the snippet).  For example, the title of the home page for was “Welcome to A Few Guys Coding, LLC.”

Moreover, the web site was a transition from an older web site,, so many of the links already in Google referred to old pages that no longer existed.  When those pages were clicked on in the Google results, it took the user to a “404 Unavailable” page.

Graph 5.1 – Sources of traffic for before SEO activities

In addition to the work done with keyword densities and titles, the author also started a blog and Twitter account, (,, that addressed computer science and programming topics.  In the blog, we linked to other parts of the main site when a topic was referenced in a blog post that A Few Guys Coding dealt with, such as iPhone applications.  A summary feed of this blog was placed on the main page web site and the authors noticed an increase in crawler traffic after the spider determined the content was changing frequently enough to warrant additional visits to the site to re-calculate PageRank.

4.3  Rankings After SEO

After an initial pass of SEO was performed to the web site, the traffic from search engines increased dramatically and the page rank increased 1 point to 2/10.  The authors believe that this increase was largely from changing the titles of the web pages themselves, changing keyword densities and tying keywords to landing pages.   Using Google Analytics, over a two month period, traffic for was up 81.24%.

They word densities are not quite as high as 3%-11%, however, it is a marked improvement from before, when the important keyword densities were all 1% and below.  See Table 5.2.

Keyword Count Density Page
iPhone 10 2.55% /services
iPad 6 1.53% /services
iPod 6 1.53% /services
Code 10 2.55% /services
Software 2 1.32% /
Application 10 2.55% /services
Develop/Developer 4 1.42% /services

Table 5.2 – Keyword densities for certain pages on

Graph 5.2 – Sources of traffic for after SEO activities (April 2010)

Pages Page Views % Pageviews Avg. Time On Page
/ 387 66.38% 1:42
/learnmore 59 10.12% 2:09
/services 41 7.03% 1:08
/portfolio 44 7.55% 1:19
/contact 35 6.00% 0:45
/services/iphone 8 1.37% 0:39
/getaquote 6 1.03% 0:15
/services/ipad 3 0.51% 0:27
Totals 583 100.00% 1:03

Table 5.3 – Page View Overview for top 8 most visited pages in April, 2010

Graph 5.3 – Number of visitors per day for April, 2010 against Time on Site goal achievement rate

Another action that was taken by the authors was using Apache mod_rewrite and a redirect file, we we’re able to direct the spider to update their index to the new pages (from the older site) using a “301 redirect”.  We were able to transform the URL using mod_rewrite to match the current top-level domain.  This ensured that pages were not removed from the crawlers index due to being status 404.

Lastly, the authors set several goals for the web site (besides the increase in PageRank), including a “Time on Site” measure that would help measure SEO success.  If a particular user stayed on the site for longer than 5 minutes and/or had 10 or more page views, we considered this criteria meeting the goal.  See Graph 5.3 for a comparison of visitors to goals met.

4.3  Conclusions

Clearly, making simple changes to content (such as keywords and titles) can have a large effect on search engine ranking and the amount and quality of traffic that is directed to a site that has had SEO.  The difference in search engine traffic over a month represented an 400% increase. It might be reasonable to infer that an additional increase of 3-4% in keyword density might generate search engine referrals up to 50%-60%. The authors would like to see the effects of continuing SEO at a 3, 6, 9 and 12 month period.

5.  Recommendations To UCCS

Based on the research that we performed, we would like to make some suggestions to the UCCS EAS webmaster that would allow the UCCS EAS site to be ranked higher than it currently is.  Currently, the URL has a Google PageRank of 4/10. By following the suggestions below, it may be possible to raise the Page Rank 1 or 2 points to 5/10 or 6/10.  We have broken down our recommendations into a four step process: define the scope and goals, understand the decision making process of a potential EAS student, create content, and build links.

5.1  Define the Scope and Goals

The first step is to define the scope and goals of the Search Engine Optimization efforts.  One question to ask is: Is this effort only related to getting more EAS students, or is it to promote UCCS as a whole?  How much effort can be put into this project?  The answers to these questions will determine the scope of the project.  One potential goal is to recruit more students to the college, but a new student may have attended even without the SEO campaign.  Consequently, a plan for measuring how students became interested in the college will be important.

5.2  Understanding the Decision Making Process

Key to understanding how to boost traffic through SEO practices is understanding the content target users are searching for.  This can be discovered through surveys or interviews with students that have gone through the process of choosing a college to attend.  This can be current UCCS students or students of other universities.  From my own experience I would guess that a typical decision making process would involve finding answers to these questions: Should I go to college? What is the best school for me?  Why should I go to UCCS?  Which major should I pick? How do I apply?  Where can I find help with my application?  Each of these questions is a good area for creating good content.

5.3  Create Content

Once the decision making process is understood, content can be created to answer the questions that people are searching for.  Special attention should be made to use keywords that people would search for when they have that question.  A quick look at the current EAS site reveals that there is little content related to recruitment.  Additionally, the current pages have few keywords that are searched frequently.  Additional recommendations are listed below that highlight some of the deficiencies of the current EAS pages.

  • Eliminate non-indexable content. The Adobe Flash content on the main landing page, is unable to be indexed by spiders and robots.  All the information contained in that Flash element is lost.  Additionally, any images that contain content, are non-indexable as well.
  • Remove “expandable” navigation.  Spiders are unable to “click” on these individual sections to expand them and are therefore unable to crawl the linked pages.  The navigation should be reworked so that all links are accessible without needing to perform any special interface actions.
  • Choose better titles for individual pages.  Despite the actual pages content or purpose, all pages under the main landing page have the title “UCCS | College of Engineering and Applied Science.”  These titles should change depending on the subject or content of the page so that spiders and robots are able to create a more accurate index of the page.
  • Better use of heading formatting.  Headings should use valid heading tags such as <h1>, <h2>, <h3>, etc. so that the robots that crawl the site can extract main ideas and content from the page for the index.
  • Check and adjust keywords and keyword densities.  The densities of the keywords on the page are low should reflect what the page is about.  For example, on the application page for EAS, the keywords concerning admission and applying are low.  In fact, out of the top 18 keywords on the page only 4 or 5 have anything to do with admission and the densities are low, ranging from .79% to 2.37%.
Keyword Count Density
Engineering 63 5.54%
Science 43 3.78%
Computer 39 3.43%
Department 39 3.43%
Admission 27 2.37%
Application 23 2.02%
Colorado 20 1.76%
Electrical 19 1.67%
UCCS 18 1.58%
Mechanical 18 1.58%
Form 15 1.32%
Aerospace 14 1.23%
Springs 12 1.05%
College 12 1.05%
Applied 12 1.05%
Student 11 0.97%
Application 10 0.88%
Financial 9 0.79%
  • Include a site map.  A site map would help ensure that all the pages that were meant to be accessible to a web visitor are also accessible to a spider crawling the content.

5.4  Build Links

Lastly, it is important to build the authority score by getting other pages to link to the content pages.  Links can be built within organizations that already have relationships with the university.  For example the City of Colorado Springs, engineering organizations, and businesses that recruit from the EAS college could all be great sources of links.  Publishing press releases to news organizations of new websites could also be helpful in generating links.  Another form of link building could be simply link to the target pages heavily from other pages on the EAS site.

Another thing to consider is that people understand when content is on the EAS site it is going to be biased towards the EAS college.  Prospective students are not going to trust the document as much as if it were coming from a third party site.  One way to overcome this is to write content for other websites on the web.  This provides two benefits.  First it creates a seemingly unbiased source of information that can be slanted towards recruiting as EAS, and the content can link back to the EAS website providing a good link for building the authority score of a page.

By following these guidelines, the EAS college can succeed in generating more traffic on its recruitment pages and therefore seeing more students attend the college.

6.  Conclusion

In this research project, we have investigated the implementation of search engines.  We have also presented different elements that affect search ranking.  Based on our research and case study with, we have provided recommendations to the UCCS EAS department to improve their PageRank within Google and other major search engines.  Indeed, search engine optimization is an important technique that any web master must master so that their site can be indexed as high in the rankings as possible.


  1. Hunt, B, Moran, M.  Search Engine Marketing, Inc,. Driving Search Traffic to Your Company’s Web Site, 2nd ed. (2008). IBM Press.
  2. Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P., Berners-Lee., T.  RFC 2616 – HTTP/1.1 Status Code Definitions.  1999 [Online].
  3. S. Brin and L. Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks and ISDN Systems. 1998. pp. 107-117.
  4. Z. Gyongyi, H. Garcia-Molina, and J. Pedersen. Combating Web Spam with TrustRank. Very Large Data Bases. Vol. 30, pp. 576-587. 2004.
  5. O. Brandman, J. Cho, H. Garcia-Molina, N. Shivakumar. Crawler-Friendly Web Servers. ACM SIGMETRICS Performance Evaluation Review. Vol. 28, No. 2, pp. 9-14. 2000.
  6. G. Rodriguez-Mula, H. Garcia-Molina, A. Paepcke.  Collaborative value filtering on the Web. Computer Networks and ISDN Systemes. Vol. 30, No. 1-7, pp. 736-738. 1998.
  7. J. Cho, H. Garcia-Molina.  The Evolution of the Web and Implications for an Incremental Crawler.  Very Large Data Bases. pp. 200-209. 2000.

I recently came across a problem of needing to create incremental backups to a remote site for my server in the case of a failure. Since my VPS provider didn’t provide this as a service (paid or free), I had to come up with a different solution. This solution assumes that you are using Ubuntu (in my case, Karmic Koala), root access and an Amazon S3 account.  Also, this assumes that you are willing to spend the money to back up to S3.  The pricing structure is here, but in my experience, my initial backup cost $3.78 and since then, my average monthly bill has been < $0.25.  You can calculate your own bill with this handy Amazon S3 S3/EC2 calculator.

I know that using FUSE is not the fastest method of backing up, so you’re mileage may vary depending on your tolerance levels and needs.  The actual download site for FuseOverAmazon is here.  Also, I am using rsync because I believe that incremental (differential) backups are far more efficient and cost/time saving than full backups every week.

1.  The first step is to install all the dependencies we’ll need for FUSE:

sudo apt-get install build-essential libcurl4-openssl-dev libxml2-dev libfuse-dev

Next, install the most recent version of s3fs. As of now the most recent is r191, but here is a link to the downloads section so that you can check to see which version is the most up-to-date. I chose to put my src download in /usr/local/src.

tar -xzf s3fs*
cd s3fs
sudo make install
sudo mkdir /backup/s3
sudo chown yourusername:yourusername /backup/s3

2. Scripting your backup plan:

You’ll need to create a bucket on the S3 cloud.  If you haven’t done this already, you can use an online tool like JetS3t (my favorite).  I would recommend that you create a separate bucket for each logical site you are going to backup.  For example, I backup each one of my repositories in Unfuddle in a different bucket.  That makes restoring easier.  You might also want to consider replicating to multiple locations, if you don’t trust that Amazon can keep your data safe or even use a separate service provider like JungleDisk, Mozy or Backblaze.

Using gvim or TextMate (or some other text editor), we are going to automate mounting the volume, perform a sync and unmount the volume.  The reason I unmount is for safety.  If somehow the hard disk becomes corrupted, I have a bit of time to prevent the script from running and replicating the bad data.  If the volume is constantly mounted, that may not be the case.  It is also easy to wipe out the volume if you aren’t careful.

The following will be the script in your backup script, (or whatever you name yours):


/usr/bin/s3fs yourbucket -o accessKeyId=yourS3key -o secretAccessKey=yourS3secretkey /mnt/s3
/usr/bin/rsync -avz --delete /home/username/dir/you/want/to/backup /mnt/s3
/usr/bin/rsync -avz --log-file=log.file --delete --exclude /sys --exclude /mnt --exclude /proc --exclude /tmp / /mnt/s3 #exclude some directories
mail -s "backup complete with log" &lt; log.file #email yourself the log
mv log.file log.file.`date +"%Y%m%d%H%M%S"` # move the file to a log with a datetime stamp
/bin/umount /mnt/s3

There some directories that I don’t want to backup, one being proc, because the that directory is manged by the OS while the system is running. You don’t want to restore this directory. Also, even though rsync is smart enough to recognize cycles, we don’t want to backup our /mnt/s3 directory. We exclude those here. Note, the –delete option. This will delete any files that have been removed on the ’source’. Lastly, note that we can increase/decrease the verbosity of the script and email ourselves a transcript of the backup session so we know that it actually took place – not a bad way to keep tabs.  After we are finished emailing ourselves (the potentially massive log file), we rename it to keep track of our backups on the server as well. There are many more options with rsync, so check out the man pages for the command to customize your script.

chmod 755

Before you run the entire script, you might want to use the line above to change the permissions on the script you just saved.  You can verify the integrity of the script by running each command individually, which isn’t a bad idea after editing it for your own situation because mistakes do happen.  A quick check after the S3 volume (df -h) is mounted will show 256T available for your own personal use.

The most important part is automating the backup process.  If you forget and you lose your most recent data, then what was the point!?  We are going to use good ol’ fashioned *nix cron daemon to handle this process for us. There are two options for creating your crontab.  You can either put this script (or a softlink) to it in your cron.hourly, cron.daily, cron.weekly, cron.monthly folder or you can directly edit the crontab file to have more control over when the script runs.  I personally run mine every hour and every week on Sunday.  Here is a nice cron reference to customize your schedule.

crontab -e
* * * * * /path/to/ # this runs it hourly
0 0 0 0 0 /path/to/ # this runs it every week on sunday

A note about speed: The initial backup could take a long time.  The server up-stream speed is the limiting factor on how long this takes.  While rsync is a great program, using FUSE is not the speediest option in the world. There is another solution out there called ‘s3sync.’

To run the script initially and create your first back-up (if you can’t wait), simply run this command: sudo ./

One last nice thing is that this can be adapted to run anywhere, other servers, your home computers, etc.  If you can install Ruby and the dependencies above, you can have ultra cheap backups without a lot of hassle.

That’s it!

All code owned and written by David Stites and published on this blog is licensed under MIT/BSD.