<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>A Few Guys Coding Blog</title>
	<atom:link href="http://afewguyscoding.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://afewguyscoding.com</link>
	<description>Thoughts on computer science and software from [self]</description>
	<lastBuildDate>Tue, 20 Nov 2012 03:15:04 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Preventing User and Hardware Tracking in Mobile Devices</title>
		<link>http://afewguyscoding.com/2012/11/preventing-user-hardware-tracking-mobile-devices/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=preventing-user-hardware-tracking-mobile-devices</link>
		<comments>http://afewguyscoding.com/2012/11/preventing-user-hardware-tracking-mobile-devices/#comments</comments>
		<pubDate>Mon, 19 Nov 2012 19:03:18 +0000</pubDate>
		<dc:creator>David Stites</dc:creator>
				<category><![CDATA[Android]]></category>
		<category><![CDATA[Apple]]></category>
		<category><![CDATA[iOS]]></category>
		<category><![CDATA[iPad]]></category>
		<category><![CDATA[iPhone]]></category>
		<category><![CDATA[Mobile]]></category>
		<category><![CDATA[Objective-C]]></category>
		<category><![CDATA[Operating Systems]]></category>
		<category><![CDATA[OS X]]></category>
		<category><![CDATA[Security]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[android]]></category>
		<category><![CDATA[apps]]></category>
		<category><![CDATA[attack]]></category>
		<category><![CDATA[confidentiality]]></category>
		<category><![CDATA[driver]]></category>
		<category><![CDATA[entitlements]]></category>
		<category><![CDATA[frameworks]]></category>
		<category><![CDATA[hardware]]></category>
		<category><![CDATA[hardware address]]></category>
		<category><![CDATA[identifier]]></category>
		<category><![CDATA[ipod]]></category>
		<category><![CDATA[ipod touch]]></category>
		<category><![CDATA[mac address]]></category>
		<category><![CDATA[mac os]]></category>
		<category><![CDATA[os x]]></category>
		<category><![CDATA[privacy]]></category>
		<category><![CDATA[sandbox]]></category>
		<category><![CDATA[security]]></category>
		<category><![CDATA[tracking]]></category>

		<guid isPermaLink="false">http://blog.afewguyscoding.com/?p=434</guid>
		<description><![CDATA[PREVENTING USER AND HARDWARE TRACKING IN MOBILE DEVICES by  David Robert Stites  B.S., Purdue University, 2007  Thesis directed by Professor Rory Lewis A thesis submitted to the Graduate Faculty of the University of Colorado at Colorado Springs in partial fulfillment of the requirements for the degree of Master of Science Department of Computer Science © [...]]]></description>
				<content:encoded><![CDATA[<p>PREVENTING USER AND HARDWARE TRACKING IN MOBILE DEVICES<br />
by  David Robert Stites <br />
B.S., Purdue University, 2007<br />
 Thesis directed by Professor Rory Lewis</p>
<p>A thesis submitted to the Graduate Faculty of the University of Colorado at Colorado Springs in partial fulfillment of the requirements for the degree of Master of Science<br />
Department of Computer Science</p>
<p>© Copyright By David Robert Stites 2012  All Rights Reserved</p>
<p>Mobile devices, such as smartphones or PDAs, have become increasingly popular with consumers and often provide essential functionality in their everyday life.  Usually these mobile devices contain a great deal of sensitive information such as addresses, contacts, ingoing/outgoing call logs, SMS messages and, on the latest models, a calendar, emails and potentially the user’s current location.  A smartphone or mobile device today can be as powerful as a desktop or laptop in some respects and, while the latest models feature a complete OS, for many users these devices are “just phones” so there is an underestimation of the risk connected to mobile device privacy.  There is a currently existing privacy problem associated with user and hardware tracking in mobile devices.  Users can be tracked without their knowledge and consent and have rich profiles built about them using their hardware interface address regarding their location and preferences.  This information can be potentially cross correlated to other existing datasets to build advertising profiles for these users.  The mitigation to this problem using a framework to support randomly generated, disposable hardware addresses.</p>
<p>Full text of the document can be found here: <a href='http://blog.afewguyscoding.com/wp-content/uploads/2012/11/Preventing-User-and-Hardware-Tracking-in-Mobile-Devices.pdf'>PDF</a></p>
]]></content:encoded>
			<wfw:commentRss>http://afewguyscoding.com/2012/11/preventing-user-hardware-tracking-mobile-devices/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>IPC Made Easy: Introducing XPC and NSXPCConnection</title>
		<link>http://afewguyscoding.com/2012/07/ipc-easy-introducing-xpc-nsxpcconnection/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=ipc-easy-introducing-xpc-nsxpcconnection</link>
		<comments>http://afewguyscoding.com/2012/07/ipc-easy-introducing-xpc-nsxpcconnection/#comments</comments>
		<pubDate>Sat, 28 Jul 2012 00:02:17 +0000</pubDate>
		<dc:creator>David Stites</dc:creator>
				<category><![CDATA[Apple]]></category>
		<category><![CDATA[Objective-C]]></category>
		<category><![CDATA[Operating Systems]]></category>
		<category><![CDATA[OS X]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[daemon]]></category>
		<category><![CDATA[ipc]]></category>
		<category><![CDATA[nsxpcconnection]]></category>
		<category><![CDATA[nsxpcinterface]]></category>
		<category><![CDATA[nsxpclistener]]></category>

		<guid isPermaLink="false">http://blog.afewguyscoding.com/?p=409</guid>
		<description><![CDATA[Some new classes were introduced in Foundation.framework as part of Mac OS X 10.8 (Mountain Lion) to help ease the pain associated with performing IPC (inter-process communication) in Mac OS applications. Among them were NSXPCConnection, NSXPCListener, NSXPCListenerDelegate and NSXPCInterface. You can find the documentation inside the development portal or as part of the Xcode bundle [...]]]></description>
				<content:encoded><![CDATA[<p>Some new classes were introduced in Foundation.framework as part of Mac OS X 10.8 (Mountain Lion) to help ease the pain associated with performing IPC (inter-process communication) in Mac OS applications.  Among them were <a href="https://developer.apple.com/library/mac/#documentation/Foundation/Reference/NSXPCConnection_reference/translated_content/NSXPCConnection.html" target="_blank">NSXPCConnection</a>, <a href="https://developer.apple.com/library/mac/#documentation/Foundation/Reference/NSXPCListener_reference/translated_content/NSXPCListener.html#//apple_ref/doc/uid/TP40012171" target="_blank">NSXPCListener</a>, <a href="https://developer.apple.com/library/mac/#documentation/Foundation/Reference/NSXPCListenerDelegate_reference/translated_content/NSXPCListenerDelegate.html#//apple_ref/doc/uid/TP40012172" target="_blank">NSXPCListenerDelegate</a> and <a href="http://developer.apple.com/library/mac///#/documentation/Foundation/Reference/NSXPCInterface_reference/translated_content/NSXPCInterface.html" target="_blank">NSXPCInterface</a>. You can find the documentation inside the development portal or as part of the Xcode bundle but this post is meant to show you how easy it truly is to package up messages and send them off to other processes.</p>
<p>But first, a bit of background.  What is IPC?  Courtesy of WikiPedia, &#8220;IPC is a set of methods for the exchange of data among multiple threads in one or more processes. Processes may be running on one or more computers connected by a network. IPC methods are divided into methods for message passing, synchronization, shared memory, and remote procedure calls (RPC). The method of IPC used may vary based on the bandwidth and latency of communication between the threads, and the type of data being communicated.&#8221;  </p>
<p>Originally, you could have done IPC in OS X using mach messages, which is how drivers traditionally communicated. </p>
<p>While information sharing and modularity are definitely some of the benefits of IPC, one of the biggest wins in my mind is the fact that we can perform privilege separation with IPC.  Consider the following: you have wrote some code that will take some user input and crunch on it and then return a result.  Note that this doesn&#8217;t have to be intensive computation, it could be as easy as interpolating an NSString.  User input is a taint source, meaning that the input data is untrusted and could potentially (and perhaps unintentionally) be malicious.  If your program were running in a privileged mode or had some increased set of ACLs, then if the input were able to exploit a vulnerability, then it would be able to inherit the same privilege level as the application.  </p>
<p>Another benefit of this separation is that suppose the input causes the program to crash.  If the processing were done in the main application, it would crash the entire application.  If the processing is done in the daemon, then the daemon can crash and the application would still be alive and well.</p>
<p>I have attached a project that demonstrates NSURLConnection&#8217;s ability to say &#8220;Hello.&#8221;  <a href='http://blog.afewguyscoding.com/wp-content/uploads/2012/07/hello.tar.gz'>hello.tar.gz</a></p>
]]></content:encoded>
			<wfw:commentRss>http://afewguyscoding.com/2012/07/ipc-easy-introducing-xpc-nsxpcconnection/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>StegaGram</title>
		<link>http://afewguyscoding.com/2012/05/stegagram/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=stegagram</link>
		<comments>http://afewguyscoding.com/2012/05/stegagram/#comments</comments>
		<pubDate>Sun, 13 May 2012 05:26:32 +0000</pubDate>
		<dc:creator>David Stites</dc:creator>
				<category><![CDATA[Apple]]></category>
		<category><![CDATA[iOS]]></category>
		<category><![CDATA[iPad]]></category>
		<category><![CDATA[iPhone]]></category>
		<category><![CDATA[Objective-C]]></category>
		<category><![CDATA[Security]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[algorithm]]></category>
		<category><![CDATA[app]]></category>
		<category><![CDATA[app store]]></category>
		<category><![CDATA[asymmetric]]></category>
		<category><![CDATA[cryptography]]></category>
		<category><![CDATA[decryption]]></category>
		<category><![CDATA[diffe-helman]]></category>
		<category><![CDATA[encryption]]></category>
		<category><![CDATA[ios]]></category>
		<category><![CDATA[ipod]]></category>
		<category><![CDATA[man-in-the-middle]]></category>
		<category><![CDATA[mitm]]></category>
		<category><![CDATA[Mobile]]></category>
		<category><![CDATA[rsa]]></category>
		<category><![CDATA[security]]></category>
		<category><![CDATA[stegagram]]></category>
		<category><![CDATA[steganography]]></category>

		<guid isPermaLink="false">http://blog.afewguyscoding.com/?p=398</guid>
		<description><![CDATA[Everyone knows &#8212; or at least everyone SHOULD know &#8212; that email is not a secure form of communication. It&#8217;s a lot like yelling across a parking lot. Your message is sent &#8220;in the clear&#8221; along most of the connections that lie between you and the recipient. For the times when you want to send [...]]]></description>
				<content:encoded><![CDATA[<p>Everyone knows &#8212; or at least everyone SHOULD know &#8212; that email is not a secure form of communication. It&#8217;s a lot like yelling across a parking lot. Your message is sent &#8220;in the clear&#8221; along most of the connections that lie between you and the recipient. For the times when you want to send a message that does NOT stand out in the open for others to read, StegaGram is your answer.</p>
<p>Double Armor<br />
StegaGram protects your communication in two ways. First, it locks the message so that it can only be read by the person to whom you&#8217;re sending the message. Then, it hides the locked message inside a picture, so that it doesn&#8217;t even look like a locked message is being sent. As an analogy, consider keeping your valuables in a strong safe, located in your front yard. It&#8217;s a great safe, but why invite attention to the fact that you have it? Using StegaGram is like keeping that strong safe hidden in a secret panel behind a picture in your house.</p>
<p>No Password, No Problem<br />
Short passwords are not very secure because they can be quickly guessed by computer programs. Long passwords are better, but can be hard to remember. We chose to avoid these problems altogether by using long strings of random numbers, known as keys. If you want more details you can read more below. Otherwise, suffice it to say that it&#8217;s stronger than a password but you don&#8217;t need to remember anything. You just need to pass a key to your friend using a QR Code &#8212; those barcode-looking squares you see all over the place.</p>
<p>Under the Hood<br />
You know those cars that look cool from the outside but lack actual power and performance when driving? Yeah, that&#8217;s not us. StegaGram is clean and easy to use, but also employs the latest methods of cryptography and steganography. In fact, our initial version was denied for public distribution because it was too strong. We had to tone it down a bit. As for our hiding methods, they don&#8217;t just avoid detection by the human eye. We use a technique that passes well under the radar of digital analysis programs which search for anomalies in histograms.</p>
<p>For the Nerds<br />
StegaGram uses a combination of asymmetric cryptography and an optimized version of the Graph-Theoretic approach to steganography. The asymmetry of the cryptographic keys allows for a distributed authentication model, similar to that used in the PGP community. Our initial version uses 2048-bit RSA encryption. As for the key exchange, the QR-Code method prevents the classic &#8216;Man-in-the-Middle Attack&#8217; used against the Diffe-Helman pattern, because there is no communication over a network during the exchange. In addition, our steganographic algorithm preserves first-order statistics, unlike most other freely-available steganographic software. For more details, take a look through <a href="http://blog.afewguyscoding.com/wp-content/uploads/2012/05/stegagram.pdf">our research paper</a>.</p>
<p>Fine Print<br />
This application was created for academic and recreational purposes, and comes with no guarantees or warrantees for its information protection. It&#8217;s pretty burly, as mentioned above, but is used at your own risk. Thank you to Alex Renger for the idea of StegaGram and for the steganographic algorithm. Thank you to Dr. Yue from the University of Colorado at Colorado Springs for his teaching and support.</p>
]]></content:encoded>
			<wfw:commentRss>http://afewguyscoding.com/2012/05/stegagram/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Slides from February Colorado Springs iOS Developer Meetup and Colorado Springs March Cocoaheads</title>
		<link>http://afewguyscoding.com/2012/03/slides-february-colorado-springs-ios-developer-meetup-colorado-springs-march-cocoaheads/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=slides-february-colorado-springs-ios-developer-meetup-colorado-springs-march-cocoaheads</link>
		<comments>http://afewguyscoding.com/2012/03/slides-february-colorado-springs-ios-developer-meetup-colorado-springs-march-cocoaheads/#comments</comments>
		<pubDate>Sat, 10 Mar 2012 03:50:29 +0000</pubDate>
		<dc:creator>David Stites</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.afewguyscoding.com/?p=386</guid>
		<description><![CDATA[Please enjoy the slides from previous Meetups. &#160; Buffer Overflows A Survey on Mobile Device Security SQL Injections]]></description>
				<content:encoded><![CDATA[<p>Please enjoy the slides from previous Meetups.</p>
<p>&nbsp;</p>
<p><a href="http://blog.afewguyscoding.com/wp-content/uploads/2012/03/Buffer-Overflows-upload.pdf">Buffer Overflows</p>
<p></a><a href="http://blog.afewguyscoding.com/wp-content/uploads/2012/03/A-Survey-on-Mobile-Device-Security.pdf">A Survey on Mobile Device Security</a></p>
<p><a href="http://blog.afewguyscoding.com/wp-content/uploads/2012/03/SQL-Injections.pdf">SQL Injections</a></p>
]]></content:encoded>
			<wfw:commentRss>http://afewguyscoding.com/2012/03/slides-february-colorado-springs-ios-developer-meetup-colorado-springs-march-cocoaheads/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Password Security Theater</title>
		<link>http://afewguyscoding.com/2012/02/password-security-theater/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=password-security-theater</link>
		<comments>http://afewguyscoding.com/2012/02/password-security-theater/#comments</comments>
		<pubDate>Thu, 09 Feb 2012 02:55:13 +0000</pubDate>
		<dc:creator>David Stites</dc:creator>
				<category><![CDATA[Security]]></category>
		<category><![CDATA[password]]></category>
		<category><![CDATA[security]]></category>
		<category><![CDATA[theater]]></category>

		<guid isPermaLink="false">http://blog.afewguyscoding.com/?p=375</guid>
		<description><![CDATA[I got to thinking about passwords last night because I couldn&#8217;t sleep and I wanted to clarify something.  Many people have asked me, &#8220;If you use numbers and special characters add to the &#8220;hardness&#8221; of a password?&#8221; Initially I say &#8220;No, but it adds to the key space&#8221; (key space is the total number of possibilities that [...]]]></description>
				<content:encoded><![CDATA[<p>I got to thinking about passwords last night because I couldn&#8217;t sleep and I wanted to clarify something.  Many people have asked me, &#8220;If you use numbers and special characters add to the &#8220;hardness&#8221; of a password?&#8221;</p>
<p>Initially I say &#8220;No, but it adds to the key space&#8221; (key space is the total number of possibilities that an attacker must try).  What I actually mean to say is &#8220;Yes, it does increase the &#8220;hardness&#8221; of the password (and increases the key space), but it is not strictly necessary to use.&#8221;</p>
<p>One of the tips I like to use is to create a password that is composed of several words that you like or are particularly memorable.  For example, you could probably look around your office or home and do this.  Looking around me, I could choose <strong>projectorchocolateglobemarker</strong> (you could choose something else that is easier to remember and I probably would too, but this is just an example).</p>
<p>Consider this: if a password was composed of lower case, upper case, numbers and special characters, that would be 72 characters to chose from.  If your mean system administrator said that your password had to be composed of 8 characters combining those 4 types of characters, then the total number of combinations would be ~7.22 x 10^14 (assuming a password length of 8 characters).  That&#8217;s a lot of possible combinations a cracker/hacker must try!</p>
<p>Now consider the password where I ran the words together (<strong>projectorchocolateglobemarker</strong>).  This would be resistant to a dictionary attack.  A dictionary attack uses a targeted technique of successively trying all the words in an exhaustive list called a dictionary (from a pre-arranged list of values).  Commonly, these pre-arranged lists are actually words from a dictionary because people are lazy and they will make their password is easy to remember (not through any fault of their own but humans are notoriously bad at remembering things that doesn&#8217;t make sense or they have no connection to).   While those individual words (projector, chocolate, globe and marker) would all be in the dictionary, the concatenation of them would not.  While we have fewer total characters to choose from (26 lower case characters), the total number possible combinations for the password would be orders of magnitude larger.  In fact, it would be ~1.052 x 10^41.  That is a much larger key space!</p>
<p>Now, if you haven&#8217;t glossed over by now, consider a computer that could do 1 key attempt per microsecond (which is certainly not out of the realm of possibilities), that is about 1 million key attempts (to crack) per second.  The first password would take 722M seconds (22 years), the second would take ~10^27 years.  Clearly, we can see which is more secure (and I didn&#8217;t need a lot of characters to do it).  These longer, harder passwords, are also more immune to what we call &#8220;rainbow table&#8221; attacks.</p>
<p>A lot of the password stuff is what is called &#8220;security theater.&#8221;  Security theater is a term that describes security countermeasures intended to provide the feeling of improved security while doing little or nothing to actually improve security.  Having a system administrator create policies that don&#8217;t make sense (such as the crazy combinations of letters, numbers, special chars) when the password is far less secure is an example of security theater.  Bruce Schneier uses it a lot to describe TSA security.</p>
<p>Another way to create a memorable password is to think of a memorable phrase such as &#8220;my sister likes to eat juicy orange every day&#8221;.  Then, take the first letter of each word and combine them to make your password.  In this example, your password would be <strong>msltejoed</strong>.</p>
<p>Lastly, I wanted to mention one last tip.  If a user could setup a policy that would lock out after a certain number of failed attempts, such as 3, this is the most secure way to do a pass-worded system because it wouldn&#8217;t allow an attacker to do a &#8220;brute force&#8221; attack where they try all the different types of keys.</p>
]]></content:encoded>
			<wfw:commentRss>http://afewguyscoding.com/2012/02/password-security-theater/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Licensing Module in Java</title>
		<link>http://afewguyscoding.com/2012/02/licensing-module-java/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=licensing-module-java</link>
		<comments>http://afewguyscoding.com/2012/02/licensing-module-java/#comments</comments>
		<pubDate>Mon, 06 Feb 2012 20:55:10 +0000</pubDate>
		<dc:creator>David Stites</dc:creator>
				<category><![CDATA[Java]]></category>
		<category><![CDATA[Security]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[code]]></category>
		<category><![CDATA[license]]></category>
		<category><![CDATA[module]]></category>
		<category><![CDATA[rsa]]></category>
		<category><![CDATA[source code]]></category>

		<guid isPermaLink="false">http://blog.afewguyscoding.com/?p=364</guid>
		<description><![CDATA[I recently had a need to build a licensing module in Java for a project I was working on.  All of the modules out there cost money, so I figured that I would release a free one.  What is neat about this is that it is pretty simple easy and fast to implement and it [...]]]></description>
				<content:encoded><![CDATA[<p>I recently had a need to build a licensing module in Java for a project I was working on.  All of the modules out there cost money, so I figured that I would release a free one.  What is neat about this is that it is pretty simple easy and fast to implement and it comes with a license key generator.  The best part is that it relies on PKI, so unless someone were able to patch the binary to skip the authentication check, they would have to be able to break a 2048bit RSA key (which is pretty safe considering no one has broken a 1024bit RSA key).</p>
<p>This particular licensing module has support for multiple license types: Trial, Single Version and Lifetime.  Also, the license has support for information such as name, email, license number, license type, expiration date and version number.  There is support for blacklisted, invalid, phony, and expired keys.  The one thing I want to mention before we get into the code is that you&#8217;ll need a public and private keypair (<a href="http://www.openssl.org/docs/HOWTO/keys.txt" target="_blank">you can use OpenSSL to do this</a>) in .der format (<a href="http://www.openssl.org/docs/apps/x509.html" target="_blank">an X509 certificate using OpenSSL again</a>).  So without further ado&#8230;</p>
<p><a href="http://blog.afewguyscoding.com/2012/02/licensing-module-java/license/" rel="attachment wp-att-369">Licensing code zip</a></p>
<p><strong>License.java: This is your license object.  It will be written to disk and checked with the program starts.  It contains all the information the managers will need to check the license.<br />
</strong></p>
<pre>package License;

import java.io.*;
import java.util.*;

public class License implements Serializable {
    private static final long serialVersionUID = 1L;

    private String name;
    private String email;
    private String licenseNumber;
    private LicenseType licenseType;
    private Date expiration;
    private String version;

    public License() {
        name = "";
        email = "";
        licenseNumber = "";
        expiration = new Date();
        version = "";
        licenseType = LicenseType.TRIAL;
    }

    public License(String name, String email, String licenseNumber, Date expiration, LicenseType licenseType, String version) {
        this.name = name;
        this.email = email;
        this.licenseNumber = licenseNumber;
        this.expiration = expiration;
        this.licenseType = licenseType;
        this.version = version;
    }

    // getters and setters here
}</pre>
<p><strong>KeyStatus.java: This is an enumeration that returns the status of a key validation operation.<br />
</strong></p>
<pre>package License;

public enum KeyStatus {
    KEY_GOOD,
    KEY_INVALID,
    KEY_BLACKLISTED,
    KEY_PHONY,
    KEY_EXPIRED
}</pre>
<p><strong>LicenseType.java: This is an enumeration that represents the type of license being generated.<br />
</strong></p>
<pre>package License;

public enum LicenseType {
    TRIAL, SINGLE_VERSION, LIFETIME
}</pre>
<p><strong>LicenseFileFilter.java: This is purely for usability on the UI end.  We can provide the JFileChooser this filter and it will find only our license files.<br />
</strong></p>
<pre>package License;

import Utility.FileExtension;
import java.io.File;
import javax.swing.filechooser.FileFilter;

public class LicenseFileFilter extends FileFilter {
    public boolean accept(File f) {
        return f.isDirectory() || f.getName().toLowerCase().endsWith(FileExtension.ZIP);
    }

    public String getDescription() {
        return "License files";
    }
}</pre>
<p><strong>LicenseManager.java: This is the main workhorse for the licensing code.  It generates the keys and checks them.<br />
</strong></p>
<pre>package License;

import java.io.*;
import java.net.*;
import java.security.*;
import java.util.*;
import javax.swing.JOptionPane;

public class LicenseManager {
    private static LicenseManager instance;
    public static boolean IS_TRIAL = true;
    public static boolean IS_LICENSED =  false;
    public static License LICENSE = null;
    private static final int ENTROPY = 456456456;
    private static final String HEXES = "0123456789ABCDEF";

    public static final String LICENSE_FILENAME = "license";
    public static final String HASH_FILENAME = "license.sha1";
    public static final String SIGNATURE_FILENAME = "license.sig";

    private static final int KEY_LEN = 62;
    private static final byte[] def = new byte[]{24, 4, 124, 10, 91};
    private static final byte[][] params = new byte[][]{{24, 4, 127}, {10, 0, 56}, {1, 2, 91}, {7, 1, 100}};
    private static final Set blacklist = new TreeSet();

    private Timer t;
    private static final int DELAY = 900000;

    static {
        blacklist.add("11111111");
    }

    protected LicenseManager() {
        t = new Timer();
        t.scheduleAtFixedRate(new CheckLicenseTask(), DELAY, DELAY);
    }

    public static LicenseManager getLicenseManager() {
        if (instance == null) {
            instance = new LicenseManager();
        }

        return instance;
    }

    /**
     *
     * @param lic
     */
    private void writeLicenseFile(License lic, String path) {
        try {
            ObjectOutputStream oos = new ObjectOutputStream(new FileOutputStream(new File(path + LICENSE_FILENAME)));
            oos.writeObject(lic);
            oos.close();
        }
        catch(Exception ex) { }
    }

    public static KeyStatus readLicenseFile(String licensePath, String signaturePath, String hashPath) {
        try {
            // read in file and validate the LICENSE based on the signature
            // this will remove changes of faking a LICENSE file
            // the LICENSE file has to be signed with our key
            File licenseFile = new File(licensePath);
            File signatureFile = new File(signaturePath);
            File hashFile = new File(hashPath);

            KeyStatus status = EncryptionManager.getEncryptionManager().verify(licenseFile, signatureFile, hashFile);

            if(!status.equals(KeyStatus.KEY_GOOD)) {
                return KeyStatus.KEY_INVALID;
            }

            ObjectInputStream ois = new ObjectInputStream(new FileInputStream(licenseFile));
            LICENSE = (License)ois.readObject();

            String lic = LICENSE.getLicenseNumber();
            if(LICENSE.getLicenseType().equals(LicenseType.TRIAL)) {
                IS_TRIAL = true;

                Calendar c = Calendar.getInstance();
                if(c.getTime().after(LICENSE.getExpiration())) {
                    return KeyStatus.KEY_EXPIRED;
                }

                Date expiration = LicenseManager.getLicenseManager().LICENSE.getExpiration();

                long val = expiration.getTime() - c.getTime().getTime();
                val /= (1000 * 60 * 60 * 24);
                JOptionPane.showMessageDialog(null, "This is a trial version of the software.  You have " + val + " days remaining in your trial.", "Trial Version", JOptionPane.INFORMATION_MESSAGE);
            }
            else if(LICENSE.getLicenseType().equals(LicenseType.SINGLE_VERSION)) {
                IS_TRIAL = false;
                status = checkKey(lic);
                IS_LICENSED = true;
            }
            else if(LICENSE.getLicenseType().equals(LicenseType.LIFETIME)) {
                IS_TRIAL = false;
                status = checkKey(lic);
                IS_LICENSED = true;
            }

            if(!status.equals(KeyStatus.KEY_GOOD)) {
                return status;
            }

            return KeyStatus.KEY_GOOD;
        }
        catch(Exception ex) {
            System.out.println(ex.toString());
            return KeyStatus.KEY_INVALID;
        }
    }

    /**
     *
     * @param name
     * @param email
     * @param authCode
     * @param licenseType
     * @param expiration
     * @param version
     * @param path
     */
    public void createLicense(String name, String email, String authCode, LicenseType licenseType, Date expiration, String version, String path) {
        byte[] entropy = null;

        try {
            MessageDigest digest = MessageDigest.getInstance("SHA-512");
            digest.reset();
            entropy = digest.digest(getByteArrayFromHexString(authCode));
        }
        catch(NoSuchAlgorithmException ex) { /* this will never happen */ }

        License lic = new License(name, email, LicenseManager.makeKey(ENTROPY, entropy), expiration, licenseType, version);
        writeLicenseFile(lic, path);
    }

    /**
     *
     * @return
     */
    private static byte[] getHardwareEntropy() {
        byte[] mac;
        try {
            NetworkInterface ni = NetworkInterface.getByInetAddress(InetAddress.getLocalHost());
            if (ni != null) {
                mac = ni.getHardwareAddress();
                if (mac == null) {
                    mac = def;
                }
            } else {
                mac = def;
            }
        }
        catch (Exception ex) {
            mac = def;
        }

        byte[] entropyEncoded = null;
        try {
            MessageDigest digest = MessageDigest.getInstance("SHA-512");
            digest.reset();
            entropyEncoded = digest.digest(mac);
        }
        catch(NoSuchAlgorithmException ex) { /* this will never happen */ }

        return entropyEncoded;
    }

    /**
     *
     * @param seed
     * @param a
     * @param b
     * @param c
     * @return
     */
    private static byte getKeyByte(final int seed, final byte a, final byte b, final byte c) {
        final int a1 = a % 25;
        final int b1 = b % 3;
        if (a1 % 2 == 0) {
            return (byte) (((seed &gt;&gt; a1) &amp; 0x000000FF) ^ ((seed &gt;&gt; b1) | c));
        } else {
            return (byte) (((seed &gt;&gt; a1) &amp; 0x000000FF) ^ ((seed &gt;&gt; b1) &amp; c));
        }
    }

    /**
     *
     * @param s
     * @return
     */
    private static String getChecksum(final String s) {
        int left = 0x0056;
        int right = 0x00AF;
        for (byte b : s.getBytes()) {
            right += b;
            if (right &gt; 0x00FF) {
                right -= 0x00FF;
            }
            left += right;
            if (left &gt; 0x00FF) {
                left -= 0x00FF;
            }
        }
        int sum = (left &lt;&lt; 8) + right;
        return intToHex(sum, 4);
    }

    /**
     *
     * @param seed
     * @param entropy
     * @return
     */
    public static String makeKey(final int seed, byte[] entropy) {
        // fill keyBytes with values derived from seed.
        // the parameters used here must be exactly the same
        // as the ones used in the checkKey function.
        final byte[] keyBytes = new byte[25];
        keyBytes[0] = getKeyByte(seed, params[0][0], params[0][1], params[0][2]);
        keyBytes[1] = getKeyByte(seed, params[1][0], params[1][1], params[1][2]);
        keyBytes[2] = getKeyByte(seed, params[2][0], params[2][1], params[2][2]);
        keyBytes[3] = getKeyByte(seed, params[3][0], params[3][1], params[3][2]);
        for(int i = 4, j = 0; (j + 2) &lt; entropy.length; i++, j += 3) {
            keyBytes[i] = getKeyByte(seed, entropy[j], entropy[j + 1], entropy[j + 2]);
        }      

        // the key string begins with a hexadecimal string of the seed
        final StringBuilder result = new StringBuilder(intToHex(seed, 8));

        // then is followed by hexadecimal strings of each byte in the key
        for (byte b : keyBytes) {
            result.append(intToHex(b, 2));
        }

        // add checksum to key string
        String key = result.toString();
        key += getChecksum(key);

        return key;
    }

    /**
     *
     * @param key
     * @return
     */
    private static boolean validateKeyChecksum(final String key) {
        if (key.length() != KEY_LEN) {
            return false;
        }

        // last four characters are the checksum
        final String checksum = key.substring(KEY_LEN - 4);
        return checksum.equals(getChecksum(key.substring(0, KEY_LEN - 4)));
    }

    /**
     *
     * @param key
     * @return
     */
    public static KeyStatus checkKey(final String key) {
        if (!validateKeyChecksum(key)) {
            return KeyStatus.KEY_INVALID; // bad checksum or wrong number of
            // characters
        }

        // test against blacklist
        for (String bl : blacklist) {
            if (key.startsWith(bl)) {
                return KeyStatus.KEY_BLACKLISTED;
            }
        }

        // at this point, the key is either valid or forged,
        // because a forged key can have a valid checksum.
        // we now test the "bytes" of the key to determine if it is
        // actually valid.

        // when building your release application, use conditional defines
        // or comment out most of the byte checks! this is the heart
        // of the partial key verification system. by not compiling in
        // each check, there is no way for someone to build a keygen that
        // will produce valid keys. if an invalid keygen is released, you can
        // simply change which byte checks are compiled in, and any serial
        // number built with the fake keygen no longer works.

        // note that the parameters used for getKeyByte calls MUST
        // MATCH the values that makeKey uses to make the key in the
        // first place!

        // extract the seed from the supplied key string
        final int seed;
        try {
            seed = Integer.valueOf(key.substring(0, 8), 16);
        } catch (NumberFormatException e) {
            return KeyStatus.KEY_PHONY;
        }

        // test key 0
        final String kb0 = key.substring(8, 10);
        final byte b0 = getKeyByte(seed, params[0][0], params[0][1], params[0][2]);
        if (!kb0.equals(intToHex(b0, 2))) {
            return KeyStatus.KEY_PHONY;
        }

        // test key1
        final String kb1 = key.substring(10, 12);
        final byte b1 = getKeyByte(seed, params[1][0], params[1][1], params[1][2]);
        if (!kb1.equals(intToHex(b1, 2))) {
            return KeyStatus.KEY_PHONY;
        }

        // test key2
        final String kb2 = key.substring(12, 14);
        final byte b2 = getKeyByte(seed, params[2][0], params[2][1], params[2][2]);
        if (!kb2.equals(intToHex(b2, 2))) {
            return KeyStatus.KEY_PHONY;
        }

        // test key3
        final String kb3 = key.substring(14, 16);
        final byte b3 = getKeyByte(seed, params[3][0], params[3][1], params[3][2]);
        if (!kb3.equals(intToHex(b3, 2))) {
            return KeyStatus.KEY_PHONY;
        }

        // test the hardware entropy
        byte[] encodedEntropy = getHardwareEntropy();
        for(int i = 16, j = 0; (j + 2) &lt; encodedEntropy.length; i += 2, j += 3) {
String kb = key.substring(i, i + 2);             byte b = getKeyByte(seed, encodedEntropy[j], encodedEntropy[j + 1], encodedEntropy[j + 2]);             if(!kb.equals(intToHex(b, 2))) {                 return KeyStatus.KEY_INVALID;             }         }         // if we get this far, then it means the key is either good, or was made         // with a keygen derived from "this" release.         return KeyStatus.KEY_GOOD;     }     /**      *       * @param n      * @param chars      * @return       */     protected static String intToHex(final Number n, final int chars) {         return String.format("%0" + chars + "x", n);     }          /**      *       * @param raw      * @return       */     public static String getHexStringFromBytes(byte[] raw) {         if ( raw == null ) {             return null;         }         final StringBuilder hex = new StringBuilder( 2 * raw.length );         for ( final byte b : raw ) {             hex.append(HEXES.charAt((b &amp; 0xF0) &gt;&gt; 4)).append(HEXES.charAt((b &amp; 0x0F)));
        }

        return hex.toString();
    }

    /**
     *
     * @param s
     * @return
     */
    public static byte[] getByteArrayFromHexString(String s) {
        int len = s.length();
        byte[] data = new byte[len / 2];
        for (int i = 0; i &lt; len; i += 2) {
            data[i / 2] = (byte) ((Character.digit(s.charAt(i), 16) &lt;&lt; 4)
                                 + Character.digit(s.charAt(i+1), 16));
        }
        return data;
    }

    private class CheckLicenseTask extends TimerTask {
        public CheckLicenseTask() { }

        @Override
        public void run() {
            System.out.println("checking license");
        }
    }
}</pre>
<p><strong>EncryptionManager.java: This module will manage encryption, decryption, signing and validation operations using your keys.  It is a singleton object.<br />
</strong></p>
<pre>package License;

import Utility.Logger;
import java.io.*;
import java.security.*;
import java.security.spec.*;
import java.util.Arrays;
import javax.crypto.*;

public class EncryptionManager {

    private static EncryptionManager instance;
    // this file should be in your jar
    private static final String PUBLIC_KEY_FILE = "/License/public_key.der";
    // this file will be on your hard drive
    private static final String PRIVATE_KEY_FILE = "/path/to/your/private_key.der";
    private static PublicKey publicKey;
    private static PrivateKey privateKey;

    protected EncryptionManager() throws GeneralSecurityException {
    }

    public static EncryptionManager getEncryptionManager() {
        if (instance == null) {
            try {
                instance = new EncryptionManager();

                try {
                    privateKey = loadPrivateKey(PRIVATE_KEY_FILE);
                }
                catch (Exception ex) {
                    //Logger.getLogger().ALog("private key failed to load - couldn't instantiate encryption manager.\n" + ex.toString());
                }
                try {
                    publicKey = loadPublicKey(PUBLIC_KEY_FILE);
                }
                catch (Exception ex) {
                    //Logger.getLogger().ALog("public key failed to load - couldn't instantiate encryption manager.\n" + ex.toString());
                }
            }
            catch(GeneralSecurityException ex) {
                //Logger.getLogger().ALog("couldn't instantiate encryption manager.\n" + ex.toString());
            }
        }

        return instance;
    }

    /**
     *
     * @param filename
     * @return
     * @throws Exception
     */
    private static PublicKey loadPublicKey(String filename) throws Exception {
        DataInputStream dis = new DataInputStream(File.class.getResourceAsStream(filename));
        byte[] keyBytes = new byte[dis.available()];
        dis.readFully(keyBytes);
        dis.close();

        X509EncodedKeySpec spec = new X509EncodedKeySpec(keyBytes);
        KeyFactory kf = KeyFactory.getInstance("RSA");
        return kf.generatePublic(spec);
    }

    /**
     *
     * @param filename
     * @return
     * @throws Exception
     */
    private static PrivateKey loadPrivateKey(String filename) throws Exception {
        File f = new File(filename);
        FileInputStream fis = new FileInputStream(f);
        DataInputStream dis = new DataInputStream(fis);
        byte[] keyBytes = new byte[(int) f.length()];
        dis.readFully(keyBytes);
        dis.close();

        PKCS8EncodedKeySpec spec = new PKCS8EncodedKeySpec(keyBytes);
        KeyFactory kf = KeyFactory.getInstance("RSA");
        return kf.generatePrivate(spec);
    }

    /**
     *
     * @param dataToHashPath
     * @return
     */
    public static byte[] digest(File dataToHashPath) {
        try {
            InputStream fin = new FileInputStream(dataToHashPath);
            MessageDigest md5Digest = MessageDigest.getInstance("MD5");

            byte[] buffer = new byte[1024];
            int read;

            do {
                read = fin.read(buffer);
                if (read &gt; 0) {
                    md5Digest.update(buffer, 0, read);
                }
            } while (read != -1);
            fin.close();

            byte[] digest = md5Digest.digest();
            if (digest == null) {
                return null;
            }

            return digest;
        } catch (Exception e) {
            return null;
        }
    }

    /**
     *
     * @param dataToVerify
     * @param signatureFile
     * @param hashFile
     * @return
     * @throws NoSuchAlgorithmException
     * @throws NoSuchProviderException
     * @throws InvalidKeyException
     * @throws SignatureException
     * @throws NoSuchPaddingException
     * @throws FileNotFoundException
     * @throws IOException
     */
    public static KeyStatus verify(File dataToVerify, File signatureFile, File hashFile) throws NoSuchAlgorithmException, NoSuchProviderException, InvalidKeyException,
            SignatureException, NoSuchPaddingException, FileNotFoundException, IOException {
        // first validate the hash of the file
        FileInputStream hashfis = new FileInputStream(hashFile);
        byte[] hashToVerify = new byte[hashfis.available()];
        hashfis.read(hashToVerify);
        hashfis.close();

        byte[] licenseBytes = digest(dataToVerify);
        if(!Arrays.equals(licenseBytes, hashToVerify)) {
            Logger.getLogger().ALog("key failed to pass hash check");
            return KeyStatus.KEY_INVALID;
        }

        // now validate that we were the ones who shipped it
        Signature rsaSignature = Signature.getInstance("SHA1withRSA");
        rsaSignature.initVerify(publicKey);

        FileInputStream sigfis = new FileInputStream(signatureFile);
        byte[] sigToVerify = new byte[sigfis.available()];
        sigfis.read(sigToVerify);
        sigfis.close();

        FileInputStream datafis = new FileInputStream(hashFile);
        BufferedInputStream bufin = new BufferedInputStream(datafis);

        byte[] buffer = new byte[1024];
        int len;
        while (bufin.available() != 0) {
            len = bufin.read(buffer);
            rsaSignature.update(buffer, 0, len);
        };

        bufin.close();

        if (rsaSignature.verify(sigToVerify)) {
            return KeyStatus.KEY_GOOD;
        } else {
            Logger.getLogger().ALog("key failed to pass signature check");
            return KeyStatus.KEY_INVALID;
        }
    }

    /**
     *
     * @param dataToSign
     * @param signatureFilePath
     * @param hashFilePath
     * @throws NoSuchAlgorithmException
     * @throws NoSuchProviderException
     * @throws InvalidKeyException
     * @throws SignatureException
     * @throws FileNotFoundException
     * @throws IOException
     */
    public static void sign(byte[] dataToSign, String signatureFilePath, String hashFilePath) throws NoSuchAlgorithmException, NoSuchProviderException, InvalidKeyException,
            SignatureException, FileNotFoundException, IOException {
        // initialize the signing algorithm with our private key
        Signature rsaSignature = Signature.getInstance("SHA1withRSA");
        rsaSignature.initSign(privateKey);
        rsaSignature.update(dataToSign, 0, dataToSign.length);

        // sign it
        byte[] sig = rsaSignature.sign();

        // save the signature to disk to verify later
        FileOutputStream fos = new FileOutputStream(signatureFilePath);
        fos.write(sig);
        fos.close();

        fos = new FileOutputStream(hashFilePath);
        fos.write(dataToSign);
        fos.close();
    }
}</pre>
]]></content:encoded>
			<wfw:commentRss>http://afewguyscoding.com/2012/02/licensing-module-java/feed/</wfw:commentRss>
		<slash:comments>14</slash:comments>
		</item>
		<item>
		<title>Affected Applications from &#8220;A Survey Of Mobile Device Security: Threats, Vulnerabilities and Defenses&#8221;</title>
		<link>http://afewguyscoding.com/2012/01/affected-applications-a-survey-mobile-device-security-threats-vulnerabilities-defenses/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=affected-applications-a-survey-mobile-device-security-threats-vulnerabilities-defenses</link>
		<comments>http://afewguyscoding.com/2012/01/affected-applications-a-survey-mobile-device-security-threats-vulnerabilities-defenses/#comments</comments>
		<pubDate>Tue, 17 Jan 2012 16:25:07 +0000</pubDate>
		<dc:creator>David Stites</dc:creator>
				<category><![CDATA[Apple]]></category>
		<category><![CDATA[iOS]]></category>
		<category><![CDATA[iPad]]></category>
		<category><![CDATA[iPhone]]></category>
		<category><![CDATA[Mobile]]></category>
		<category><![CDATA[Operating Systems]]></category>
		<category><![CDATA[Security]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[android]]></category>
		<category><![CDATA[app store]]></category>
		<category><![CDATA[apps]]></category>
		<category><![CDATA[attack]]></category>
		<category><![CDATA[availability]]></category>
		<category><![CDATA[blackberry]]></category>
		<category><![CDATA[cell phone]]></category>
		<category><![CDATA[confidentiality]]></category>
		<category><![CDATA[defense]]></category>
		<category><![CDATA[integrity]]></category>
		<category><![CDATA[ipod touch]]></category>
		<category><![CDATA[jailbreak]]></category>
		<category><![CDATA[operating system]]></category>
		<category><![CDATA[sandbox]]></category>
		<category><![CDATA[security]]></category>
		<category><![CDATA[southwest airlines]]></category>
		<category><![CDATA[threat]]></category>
		<category><![CDATA[vulnerability]]></category>

		<guid isPermaLink="false">http://blog.afewguyscoding.com/?p=356</guid>
		<description><![CDATA[The author surveyed over 230 applications (the full list of applications ,can be found in A Survey Of Mobile Device Security: Threats, Vulnerabilities and Defenses), including applications in the “Top” categories on the iTunes store to determine what type of information could be extracted from auditing packet streams. The results were quite surprising. To perform [...]]]></description>
				<content:encoded><![CDATA[<p>The author surveyed over 230 applications (the full list of applications ,can be found in <strong></strong><strong><a title="Edit “A Survey Of Mobile Device Security: Threats, Vulnerabilities and Defenses”" href="http://blog.afewguyscoding.com/2011/12/survey-mobile-device-security-threats-vulnerabilities-defenses/">A Survey Of Mobile Device Security: Threats, Vulnerabilities and Defenses</a></strong>), including applications in the “Top” categories on the iTunes store to determine what type of information could be extracted from auditing packet streams. The results were quite surprising.</p>
<p>To perform this audit, the author launched one application at a time and used WireShark to capture and analyze packets. The experiment was performed on an open network that the author created. The access point was a Cisco Small Business router (WAP4410N) and was configured using a hidden SSID and MAC address authentication to prevent outside users from associating with the access point and introducing outside, extra packets. While the author realizes that hidden SSIDs and MAC address authentication are easily defeated mechanisms, it was used to prevent casual users from using the access point. The mobile devices used were an Apple iPod Touch 4G, an Apple iPad 1G and an iPhone 4, configured with iOS 5.0.1.</p>
<p>For reasons of classification, the authors created several different levels of potential security breaches. The levels are defined as:</p>
<ul>
<li><em>None</em>: This level is defined as having no potential security breaches and no exposure of confidential information.</li>
</ul>
<ul>
<li><em>Low</em>: This level is defined as having a few potential security breaches or exposure of confidential information that could not directly affect the user, such as device IDs that could be used in tracking users (in iOS, these are called UUIDs).</li>
</ul>
<ul>
<li><em>Medium</em>: This level is defined as having several potential security breaches or exposure of confidential information that is potentially serious or if information is exposed such that an attacker would be able to identify the user on an individual basis, such as addresses, latitudes or longitudes, etc.</li>
</ul>
<ul>
<li><em>High</em>: This level is defined as having multiple potential security breaches or exposure of extremely confidential information, such as account numbers, PINs, and username/password combinations.</li>
</ul>
<p>For more information on the specific application, including the version number of the application with the vulnerability, see Appendix A for a full listing.</p>
<table border="1" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td valign="top" width="93">
<p align="center"><strong>Application</strong></p>
</td>
<td valign="top" width="93">
<p align="center"><strong>Level</strong></p>
</td>
<td valign="top" width="93">
<p align="center"><strong>Risks Found</strong></p>
</td>
</tr>
<tr>
<td valign="top" width="93">GrubHub</td>
<td valign="top" width="93">
<p align="center">Low</p>
</td>
<td valign="top" width="93">
<p align="right">UUID</p>
</td>
</tr>
<tr>
<td valign="top" width="93">The Weather Channel</td>
<td valign="top" width="93">
<p align="center">Low</p>
</td>
<td valign="top" width="93">
<p align="right">Geocoded location</p>
</td>
</tr>
<tr>
<td valign="top" width="93">Path</td>
<td valign="top" width="93">
<p align="center">Low</p>
</td>
<td valign="top" width="93">
<p align="right">UUID</p>
</td>
</tr>
<tr>
<td valign="top" width="93">Handmade</td>
<td valign="top" width="93">
<p align="center">Low</p>
</td>
<td valign="top" width="93">
<p align="right">UUID</p>
</td>
</tr>
<tr>
<td valign="top" width="93">iHeartRadio</td>
<td valign="top" width="93">
<p align="center">Low</p>
</td>
<td valign="top" width="93">
<p align="right">Reverse Geocoded location</p>
</td>
</tr>
<tr>
<td valign="top" width="93">TabbedOut</td>
<td valign="top" width="93">
<p align="center">Low</p>
</td>
<td valign="top" width="93">
<p align="right">UUID, Platform</p>
</td>
</tr>
<tr>
<td valign="top" width="93">Priceline</td>
<td valign="top" width="93">
<p align="center">Low</p>
</td>
<td valign="top" width="93">
<p align="right">UUID, Geocoded location, “Search” API is unencrypted</p>
</td>
</tr>
<tr>
<td valign="top" width="93">Free WiFi</td>
<td valign="top" width="93">
<p align="center">Low</p>
</td>
<td valign="top" width="93">
<p align="right">Geocoded location</p>
</td>
</tr>
<tr>
<td valign="top" width="93">Coupious</td>
<td valign="top" width="93">
<p align="center">Medium</p>
</td>
<td valign="top" width="93">
<p align="right">Geocoded location, UUID, coupon redemption codes</p>
</td>
</tr>
<tr>
<td valign="top" width="93">Delivery Status</td>
<td valign="top" width="93">
<p align="center">Medium</p>
</td>
<td valign="top" width="93">
<p align="right">UPS transmits reverse geocoded locations and tracking numbers.</p>
</td>
</tr>
<tr>
<td valign="top" width="93">Color</td>
<td valign="top" width="93">
<p align="center">Medium</p>
</td>
<td valign="top" width="93">
<p align="right">Reverse geocoded location and photos taken and shared by users</p>
</td>
</tr>
<tr>
<td valign="top" width="93">Cloudette</td>
<td valign="top" width="93">
<p align="center">Medium</p>
</td>
<td valign="top" width="93">
<p align="right">Username in plaintext and password, hashed with MD5</p>
</td>
</tr>
<tr>
<td valign="top" width="93">Gas Buddy</td>
<td valign="top" width="93">
<p align="center">Medium</p>
</td>
<td valign="top" width="93">
<p align="right">Username and password, hashed with MD5</p>
</td>
</tr>
<tr>
<td valign="top" width="93">Ness</td>
<td valign="top" width="93">
<p align="center">Medium</p>
</td>
<td valign="top" width="93">
<p align="right">Reverse geocoded location</p>
</td>
</tr>
<tr>
<td valign="top" width="93">Southwest Airlines</td>
<td valign="top" width="93">
<p align="center">High</p>
</td>
<td valign="top" width="93">
<p align="right">Username and password in plaintext</p>
</td>
</tr>
<tr>
<td valign="top" width="93">Minus</td>
<td valign="top" width="93">
<p align="center">High</p>
</td>
<td valign="top" width="93">
<p align="right">Username and password in plaintext</p>
</td>
</tr>
<tr>
<td valign="top" width="93">WordPress</td>
<td valign="top" width="93">
<p align="center">High</p>
</td>
<td valign="top" width="93">
<p align="right">Username and password in plaintext</p>
</td>
</tr>
<tr>
<td valign="top" width="93">Foodspotting</td>
<td valign="top" width="93">
<p align="center">High</p>
</td>
<td valign="top" width="93">
<p align="right">Username and password, Geocoded location</p>
</td>
</tr>
<tr>
<td valign="top" width="93">ustream</td>
<td valign="top" width="93">
<p align="center">High</p>
</td>
<td valign="top" width="93">
<p align="right">Username and password, UUID, geocoded location</p>
</td>
</tr>
<tr>
<td valign="top" width="93">Labelbox</td>
<td valign="top" width="93">
<p align="center">High</p>
</td>
<td valign="top" width="93">
<p align="right">Username and password, geocoded location</p>
</td>
</tr>
</tbody>
</table>
<p>&nbsp;</p>
<p>The majority of the applications that were surveyed encrypted the exchanges of confidential or sensitive information, such as usernames, passwords and account numbers via SSL/TLS.</p>
<p>However, many applications performed some sort of tracking or storing of analytic information, such as passing the UUID in a call to a web service.  In some of the instances, this identifying information was not encrypted.  While not potentially dangerous in the sense that an attacker could use this information to “identify” a particular person, none of the applications let users know that their information such as UUID, phone OS and model, was being used or recorded, nor did they let the user “opt-out.”</p>
<p>The largest single potential security breach was with the Southwest Airlines application.  Due to the fact that the username and password were submitted to a web server via a POST operation in plaintext, an attacker could simply sniff for this data.  If an example was captured, one could use those credentials to log into a particular account and book travel, use award miles and possibly change information in the victims profile.  This not only obviously worrisome from the standpoint of a potential attacker fraudulently using a victims account and credit card information, but also due to the possibility of a terrorists threats in air travel.</p>
<p>For example, consider the possibility of a person who is currently (and rightfully) on the Department of Homeland Security’s “No-Fly” list.  If this person were able to capture a victim’s credentials and create a fake ID, he could pass through TSA security without being stopped.</p>
<p>Of the 253 applications surveyed, 91.7% had no risk found, 3.1% had a low risk, 2.3% had a medium risk and 2.3% had a high risk.  While it would be desirable to have no applications in the “Medium” or “High” category, the number of applications the authors found presented a security risk was both surprising and far too numerous.  There are over 500,000 applications on the iOS App Store, so extrapolating the results, there could be at least 15,500 applications in the “Low” risk category and 11,500 applications in the “Medium” and “High” risk category.</p>
<p>Overall, the number of applications with some sort of security risk is low.  This is not very surprising to the authors as many of these applications are in the “Top” applications list and any potential security flaws would have already been found.</p>
<p>Due to the fact that iOS does not have a robust privilege system, there is no way that a user could know their information was being used in a dangerous or insecure way.  While there is support for showing users there is network traffic by  using a spinning “network activity indicator”, it is certainly not mandatory for them to do so.  In fact, a legitimate or malware application could access the network interfaces, sending and receiving information and never alert the user on iOS.</p>
<p>Developers typically do not follow the principle of least privilege.  If an application needs a set of privileges for functionality, they will request them up front, not just when they are needed.  This is particularly dangerous because this could be an entry point for an attacker to compromise the application.</p>
<p>[19] performed research where they surveyed 940 Android applications and found that more than 50% required 1 extra unnecessary permission and 6% required more than 4 unnecessary permissions.  The reasons that developers may request more permissions than are necessary could be because 1) they don’t understand the importance of security and least privilege, 2) they are planning on future releases that will require these privileges and 3) they don’t fully understand how to work with the platform and make the code function correctly.</p>
<p>Since mobile devices and smartphones are unique in that they have a built-in billing system, there must be ongoing education of developers with emphasis on security and privacy or additional built-in measures in the OS to enforce security over code the developers write or the permissions for which they ask.</p>
<p>Here is the full list of applications tested.</p>
<p>Bold applications represent applications bundled with iOS from Apple.</p>
<table border="1" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td valign="top" width="87">
<p align="center"><strong>Application</strong></p>
</td>
<td valign="top" width="52">
<p align="center"><strong>Version</strong></p>
</td>
<td valign="top" width="84">
<p align="center"><strong>Application</strong></p>
</td>
<td valign="top" width="53">
<p align="center"><strong>Version</strong></p>
</td>
</tr>
<tr>
<td valign="top" width="87"><strong>Messages</strong></td>
<td valign="top" width="52"><strong>5.0.1</strong></td>
<td valign="top" width="84">RedLaser Classic</td>
<td valign="top" width="53">2.9.8</td>
</tr>
<tr>
<td valign="top" width="87"><strong>Calendar</strong></td>
<td valign="top" width="52"><strong>5.0.1</strong></td>
<td valign="top" width="84">eBay</td>
<td valign="top" width="53">2.4.0</td>
</tr>
<tr>
<td valign="top" width="87"><strong>App Store</strong></td>
<td valign="top" width="52"><strong>5.0.1</strong></td>
<td valign="top" width="84">Craigslist</td>
<td valign="top" width="53">3.033</td>
</tr>
<tr>
<td valign="top" width="87"><strong>Settings</strong></td>
<td valign="top" width="52"><strong>5.0.1</strong></td>
<td valign="top" width="84">Key Ring</td>
<td valign="top" width="53">5.4</td>
</tr>
<tr>
<td valign="top" width="87">Spotify</td>
<td valign="top" width="52">0.4.21</td>
<td valign="top" width="84">Coupious</td>
<td valign="top" width="53">1.4.1</td>
</tr>
<tr>
<td valign="top" width="87"><strong>Contacts</strong></td>
<td valign="top" width="52"><strong>5.0.1</strong></td>
<td valign="top" width="84">Cars</td>
<td valign="top" width="53">1.6.1</td>
</tr>
<tr>
<td valign="top" width="87"><strong>Notes</strong></td>
<td valign="top" width="52"><strong>5.0.1</strong></td>
<td valign="top" width="84">Amazon PriceCheck</td>
<td valign="top" width="53">1.2.1</td>
</tr>
<tr>
<td valign="top" width="87"><strong>Newstand</strong></td>
<td valign="top" width="52"><strong>5.0.1</strong></td>
<td valign="top" width="84">Linode</td>
<td valign="top" width="53">1.0.6</td>
</tr>
<tr>
<td valign="top" width="87"><strong>Reminders</strong></td>
<td valign="top" width="52"><strong>5.0.1</strong></td>
<td valign="top" width="84">Unfuddle</td>
<td valign="top" width="53">1.1.1</td>
</tr>
<tr>
<td valign="top" width="87">Find My Friends</td>
<td valign="top" width="52">1.0</td>
<td valign="top" width="84">MiniBooks</td>
<td valign="top" width="53">1.0.2</td>
</tr>
<tr>
<td valign="top" width="87"><strong>Videos</strong></td>
<td valign="top" width="52"><strong>5.0.1</strong></td>
<td valign="top" width="84">iTC Mobile</td>
<td valign="top" width="53">2.4</td>
</tr>
<tr>
<td valign="top" width="87">Vlingo</td>
<td valign="top" width="52">2.1.1</td>
<td valign="top" width="84">Blueprint viewer</td>
<td valign="top" width="53">1.7</td>
</tr>
<tr>
<td valign="top" width="87"><strong>Photos</strong></td>
<td valign="top" width="52"><strong>5.0.1</strong></td>
<td valign="top" width="84">Square</td>
<td valign="top" width="53">2.2</td>
</tr>
<tr>
<td valign="top" width="87"><strong>Camera</strong></td>
<td valign="top" width="52"><strong>5.0.1</strong></td>
<td valign="top" width="84">WordPress</td>
<td valign="top" width="53">2.9.2</td>
</tr>
<tr>
<td valign="top" width="87">Instagram</td>
<td valign="top" width="52">2.0.5</td>
<td valign="top" width="84"><strong>Maps</strong></td>
<td valign="top" width="53"><strong>5.0.1</strong></td>
</tr>
<tr>
<td valign="top" width="87">iMovie</td>
<td valign="top" width="52">1.2.2</td>
<td valign="top" width="84">FlightTrack</td>
<td valign="top" width="53">4.2.2</td>
</tr>
<tr>
<td valign="top" width="87">DashOfColor</td>
<td valign="top" width="52">3.1</td>
<td valign="top" width="84">Kayak</td>
<td valign="top" width="53">19.0.6</td>
</tr>
<tr>
<td valign="top" width="87">ColorSplash</td>
<td valign="top" width="52">1.7.2</td>
<td valign="top" width="84">Southwest</td>
<td valign="top" width="53">1.8.2</td>
</tr>
<tr>
<td valign="top" width="87">UStream Broadcaster</td>
<td valign="top" width="52">2.1</td>
<td valign="top" width="84">American</td>
<td valign="top" width="53">1.3.3</td>
</tr>
<tr>
<td valign="top" width="87">TiltShiftGen</td>
<td valign="top" width="52">2.02</td>
<td valign="top" width="84">Fly Delta</td>
<td valign="top" width="53">1.6</td>
</tr>
<tr>
<td valign="top" width="87">Gorillacam</td>
<td valign="top" width="52">1.2.2</td>
<td valign="top" width="84">Flysmart</td>
<td valign="top" width="53">2.5.25</td>
</tr>
<tr>
<td valign="top" width="87">CameraPlus</td>
<td valign="top" width="52">2.4</td>
<td valign="top" width="84">Priceline Negotiator</td>
<td valign="top" width="53">5.6</td>
</tr>
<tr>
<td valign="top" width="87">PS Express</td>
<td valign="top" width="52">2.03</td>
<td valign="top" width="84">Free WiFi</td>
<td valign="top" width="53">1.1.2</td>
</tr>
<tr>
<td valign="top" width="87">Dropcam</td>
<td valign="top" width="52">1.4.3</td>
<td valign="top" width="84">Google Earth</td>
<td valign="top" width="53">3.2</td>
</tr>
<tr>
<td valign="top" width="87">Chase</td>
<td valign="top" width="52">2.14.5799</td>
<td valign="top" width="84">Translator</td>
<td valign="top" width="53">3.1</td>
</tr>
<tr>
<td valign="top" width="87">Citibank</td>
<td valign="top" width="52">3.7</td>
<td valign="top" width="84"><strong>Phone</strong></td>
<td valign="top" width="53"><strong>5.0.1</strong></td>
</tr>
<tr>
<td valign="top" width="87">Discover</td>
<td valign="top" width="52">2.1</td>
<td valign="top" width="84"><strong>Mail</strong></td>
<td valign="top" width="53"><strong>5.0.1</strong></td>
</tr>
<tr>
<td valign="top" width="87">Fidelity</td>
<td valign="top" width="52">1.6.1</td>
<td valign="top" width="84"><strong>Safari</strong></td>
<td valign="top" width="53"><strong>5.0.1</strong></td>
</tr>
<tr>
<td valign="top" width="87">TD Trader</td>
<td valign="top" width="52">115.12</td>
<td valign="top" width="84"><strong>Music</strong></td>
<td valign="top" width="53"><strong>5.0.1</strong></td>
</tr>
<tr>
<td valign="top" width="87">PayPal</td>
<td valign="top" width="52">3.6</td>
<td valign="top" width="84">Flixster</td>
<td valign="top" width="53">5.02</td>
</tr>
<tr>
<td valign="top" width="87">Mint.com</td>
<td valign="top" width="52">2.0</td>
<td valign="top" width="84">Boxee</td>
<td valign="top" width="53">1.2.1</td>
</tr>
<tr>
<td valign="top" width="87"><strong>Stock</strong></td>
<td valign="top" width="52"><strong>5.0.1</strong></td>
<td valign="top" width="84">redbox</td>
<td valign="top" width="53">2.3.1</td>
</tr>
<tr>
<td valign="top" width="87">thinkorswim</td>
<td valign="top" width="52">115.12</td>
<td valign="top" width="84">Youtube</td>
<td valign="top" width="53"><strong>5.0.1</strong></td>
</tr>
<tr>
<td valign="top" width="87">Geico</td>
<td valign="top" width="52">2.0.2</td>
<td valign="top" width="84">Fandango</td>
<td valign="top" width="53">4.5</td>
</tr>
<tr>
<td valign="top" width="87">Dropbox</td>
<td valign="top" width="52">1.4.6</td>
<td valign="top" width="84">XFINITY TV</td>
<td valign="top" width="53">1.8</td>
</tr>
<tr>
<td valign="top" width="87">1Password</td>
<td valign="top" width="52">3.6.1</td>
<td valign="top" width="84">IMDb</td>
<td valign="top" width="53">2.3.1</td>
</tr>
<tr>
<td valign="top" width="87">Alarm Clock</td>
<td valign="top" width="52">1.1</td>
<td valign="top" width="84">i.TV</td>
<td valign="top" width="53">3.4.1</td>
</tr>
<tr>
<td valign="top" width="87">Planets</td>
<td valign="top" width="52">3.1</td>
<td valign="top" width="84">MobiTV</td>
<td valign="top" width="53">1.0</td>
</tr>
<tr>
<td valign="top" width="87">Dictation</td>
<td valign="top" width="52">1.1</td>
<td valign="top" width="84">Netflix</td>
<td valign="top" width="53">1.4</td>
</tr>
<tr>
<td valign="top" width="87">Inrix Traffic</td>
<td valign="top" width="52">3.5.1</td>
<td valign="top" width="84">VNC</td>
<td valign="top" width="53">3.2.1</td>
</tr>
<tr>
<td valign="top" width="87">Adobe Ideas</td>
<td valign="top" width="52">1.2</td>
<td valign="top" width="84">RDP</td>
<td valign="top" width="53">2.8</td>
</tr>
<tr>
<td valign="top" width="87">IP-Relay</td>
<td valign="top" width="52">1.2</td>
<td valign="top" width="84">TouchTerm</td>
<td valign="top" width="53">2.1</td>
</tr>
<tr>
<td valign="top" width="87">iLlumination</td>
<td valign="top" width="52">1.0.1</td>
<td valign="top" width="84">Scorekeeper</td>
<td valign="top" width="53">4.1</td>
</tr>
<tr>
<td valign="top" width="87">Fake-a-call</td>
<td valign="top" width="52">5.05</td>
<td valign="top" width="84">Statware</td>
<td valign="top" width="53">1.0.3</td>
</tr>
<tr>
<td valign="top" width="87">HeyTell</td>
<td valign="top" width="52">2.3.2</td>
<td valign="top" width="84">NIKE+ GPS</td>
<td valign="top" width="53">3.2.1</td>
</tr>
<tr>
<td valign="top" width="87"><strong>Weather</strong></td>
<td valign="top" width="52"><strong>5.0.1</strong></td>
<td valign="top" width="84">MiLB Triple A</td>
<td valign="top" width="53">1.1.0</td>
</tr>
<tr>
<td valign="top" width="87">The Weather Channel</td>
<td valign="top" width="52">2.1.1</td>
<td valign="top" width="84">Pandora</td>
<td valign="top" width="53">3.1.16</td>
</tr>
<tr>
<td valign="top" width="87"><strong>Calculator</strong></td>
<td valign="top" width="52"><strong>5.0.1</strong></td>
<td valign="top" width="84">Shazam</td>
<td valign="top" width="53">4.8.4</td>
</tr>
<tr>
<td valign="top" width="87"><strong>Clock</strong></td>
<td valign="top" width="52"><strong>5.0.1</strong></td>
<td valign="top" width="84">Soundhound</td>
<td valign="top" width="53">4.1.1</td>
</tr>
<tr>
<td valign="top" width="87"><strong>Compass</strong></td>
<td valign="top" width="52"><strong>5.0.1</strong></td>
<td valign="top" width="84">iHeartRadio</td>
<td valign="top" width="53">4.0.1</td>
</tr>
<tr>
<td valign="top" width="87"><strong>Voice Memos</strong></td>
<td valign="top" width="52"><strong>5.0.1</strong></td>
<td valign="top" width="84">Last.fm</td>
<td valign="top" width="53">3.2.0</td>
</tr>
<tr>
<td valign="top" width="87">AroundMe</td>
<td valign="top" width="52">5.1.0</td>
<td valign="top" width="84">Songify</td>
<td valign="top" width="53">1.0.6</td>
</tr>
<tr>
<td valign="top" width="87">myAT&amp;T</td>
<td valign="top" width="52">2.1.2</td>
<td valign="top" width="84"><strong>iTunes</strong></td>
<td valign="top" width="53"><strong>5.0.1</strong></td>
</tr>
<tr>
<td valign="top" width="87">WeddingWire</td>
<td valign="top" width="52">3.1</td>
<td valign="top" width="84">Virtuoso</td>
<td valign="top" width="53">1.0</td>
</tr>
<tr>
<td valign="top" width="87">LogTen</td>
<td valign="top" width="52">3.3.1</td>
<td valign="top" width="84">I Am T-Pain</td>
<td valign="top" width="53">2.0.0</td>
</tr>
<tr>
<td valign="top" width="87">French</td>
<td valign="top" width="52">1.0</td>
<td valign="top" width="84">Scrabble</td>
<td valign="top" width="53">1.13.78</td>
</tr>
<tr>
<td valign="top" width="87">Binary Calc</td>
<td valign="top" width="52">1.4</td>
<td valign="top" width="84">Harbor Master</td>
<td valign="top" width="53">2.1</td>
</tr>
<tr>
<td valign="top" width="87">Amazon</td>
<td valign="top" width="52">1.8.0</td>
<td valign="top" width="84">Zombie Duck</td>
<td valign="top" width="53">4.1</td>
</tr>
<tr>
<td valign="top" width="87">Groupon</td>
<td valign="top" width="52">1.5.7</td>
<td valign="top" width="84">Zombieville</td>
<td valign="top" width="53">1.7</td>
</tr>
<tr>
<td valign="top" width="87">LivingSocial</td>
<td valign="top" width="52">3.2.2</td>
<td valign="top" width="84">Table Tennis</td>
<td valign="top" width="53">4.1.0</td>
</tr>
<tr>
<td valign="top" width="87">Yowza</td>
<td valign="top" width="52">2.5</td>
<td valign="top" width="84">iFighter</td>
<td valign="top" width="53">1.9</td>
</tr>
<tr>
<td valign="top" width="87">Coupons</td>
<td valign="top" width="52"></td>
<td valign="top" width="84">Hired Gun</td>
<td valign="top" width="53">1.8</td>
</tr>
<tr>
<td valign="top" width="87">Airport Utility</td>
<td valign="top" width="52">1.0</td>
<td valign="top" width="84">Lock n’ Roll</td>
<td valign="top" width="53">3.0</td>
</tr>
<tr>
<td valign="top" width="87">Walgreens</td>
<td valign="top" width="52">3.0.2</td>
<td valign="top" width="84">Sneezies Lite</td>
<td valign="top" width="53">1.3</td>
</tr>
<tr>
<td valign="top" width="87">MyHumana</td>
<td valign="top" width="52">3.0.2</td>
<td valign="top" width="84">Pad Racer</td>
<td valign="top" width="53">1.1</td>
</tr>
<tr>
<td valign="top" width="87">Nike + iPod</td>
<td valign="top" width="52"></td>
<td valign="top" width="84">Uno</td>
<td valign="top" width="53">2.0.0</td>
</tr>
<tr>
<td valign="top" width="87">Gold’s Gym Spotter</td>
<td valign="top" width="52">1.2</td>
<td valign="top" width="84">CamWow</td>
<td valign="top" width="53">2.2</td>
</tr>
<tr>
<td valign="top" width="87">Lose It!</td>
<td valign="top" width="52">3.7.2</td>
<td valign="top" width="84">Labelbox</td>
<td valign="top" width="53">1.3.1</td>
</tr>
<tr>
<td valign="top" width="87">FitnessTrack</td>
<td valign="top" width="52">1.5.5</td>
<td valign="top" width="84">Photosynth</td>
<td valign="top" width="53">1.1.2</td>
</tr>
<tr>
<td valign="top" width="87">LIVESTRONG</td>
<td valign="top" width="52">1.0.1</td>
<td valign="top" width="84">Color Effects</td>
<td valign="top" width="53">3.1</td>
</tr>
<tr>
<td valign="top" width="87">MyFitnessPal</td>
<td valign="top" width="52"></td>
<td valign="top" width="84">Saturation</td>
<td valign="top" width="53">1.0</td>
</tr>
<tr>
<td valign="top" width="87">Nutrisystem</td>
<td valign="top" width="52">2.3</td>
<td valign="top" width="84">Peppermint</td>
<td valign="top" width="53">1.1</td>
</tr>
<tr>
<td valign="top" width="87">Kindle</td>
<td valign="top" width="52">2.8.5</td>
<td valign="top" width="84">FlickStackrXP</td>
<td valign="top" width="53">1.9.6</td>
</tr>
<tr>
<td valign="top" width="87">Instapaper</td>
<td valign="top" width="52">4.0.1</td>
<td valign="top" width="84">Minus</td>
<td valign="top" width="53">2.1.3</td>
</tr>
<tr>
<td valign="top" width="87">iBooks</td>
<td valign="top" width="52"><strong>5.0.1</strong></td>
<td valign="top" width="84">Gallery</td>
<td valign="top" width="53">2.0.1</td>
</tr>
<tr>
<td valign="top" width="87">Zinio</td>
<td valign="top" width="52">2.2</td>
<td valign="top" width="84">Handmade</td>
<td valign="top" width="53">1.1</td>
</tr>
<tr>
<td valign="top" width="87">Twitter</td>
<td valign="top" width="52">4.0</td>
<td valign="top" width="84">StubHub</td>
<td valign="top" width="53">2.4.1</td>
</tr>
<tr>
<td valign="top" width="87">Facebook</td>
<td valign="top" width="52">4.0.3</td>
<td valign="top" width="84">Pushpins</td>
<td valign="top" width="53">2.0.1</td>
</tr>
<tr>
<td valign="top" width="87">Google+</td>
<td valign="top" width="52">1.0.7.2940</td>
<td valign="top" width="84">Black Friday</td>
<td valign="top" width="53">2.0</td>
</tr>
<tr>
<td valign="top" width="87">foursquare</td>
<td valign="top" width="52">4.1.2</td>
<td valign="top" width="84">Sam’s Club</td>
<td valign="top" width="53">2.1.1</td>
</tr>
<tr>
<td valign="top" width="87">LinkedIn</td>
<td valign="top" width="52">4.2</td>
<td valign="top" width="84">Cyber Monday</td>
<td valign="top" width="53">2.1.0</td>
</tr>
<tr>
<td valign="top" width="87">Meebo</td>
<td valign="top" width="52">1.95</td>
<td valign="top" width="84">Words With Friends</td>
<td valign="top" width="53">4.1</td>
</tr>
<tr>
<td valign="top" width="87">Yelp</td>
<td valign="top" width="52">5.5.0</td>
<td valign="top" width="84">Ultimate Free Word Finder</td>
<td valign="top" width="53">1.01</td>
</tr>
<tr>
<td valign="top" width="87">PingChat</td>
<td valign="top" width="52"></td>
<td valign="top" width="84">Mad Gab</td>
<td valign="top" width="53">2.0</td>
</tr>
<tr>
<td valign="top" width="87">Bump</td>
<td valign="top" width="52">2.5.6</td>
<td valign="top" width="84">Metal Storm</td>
<td valign="top" width="53">4.0.2</td>
</tr>
<tr>
<td valign="top" width="87">Color</td>
<td valign="top" width="52">1.1</td>
<td valign="top" width="84">Need For Speed</td>
<td valign="top" width="53">1.0.11</td>
</tr>
<tr>
<td valign="top" width="87">Cloudette</td>
<td valign="top" width="52">1.0.1</td>
<td valign="top" width="84">Madden NFL 2010</td>
<td valign="top" width="53"></td>
</tr>
<tr>
<td valign="top" width="87">soundtracking</td>
<td valign="top" width="52">2.0.2</td>
<td valign="top" width="84">Shizzlr</td>
<td valign="top" width="53">3.2.1</td>
</tr>
<tr>
<td valign="top" width="87">Free RSS</td>
<td valign="top" width="52">3.4</td>
<td valign="top" width="84">Flashlight</td>
<td valign="top" width="53">5.1</td>
</tr>
<tr>
<td valign="top" width="87">NetNewsWire</td>
<td valign="top" width="52">2.0.5</td>
<td valign="top" width="84">Tip Calculator</td>
<td valign="top" width="53">1.3.1</td>
</tr>
<tr>
<td valign="top" width="87">FOX News</td>
<td valign="top" width="52">1.2.4</td>
<td valign="top" width="84">PCalc Lite</td>
<td valign="top" width="53">2.4.3</td>
</tr>
<tr>
<td valign="top" width="87">OpenTable</td>
<td valign="top" width="52">3.4.2</td>
<td valign="top" width="84">Fake Call</td>
<td valign="top" width="53">1.1</td>
</tr>
<tr>
<td valign="top" width="87">Urbanspoon</td>
<td valign="top" width="52">1.17</td>
<td valign="top" width="84">To Do</td>
<td valign="top" width="53">3.2</td>
</tr>
<tr>
<td valign="top" width="87">Epicurious</td>
<td valign="top" width="52">3.0.1</td>
<td valign="top" width="84">Google</td>
<td valign="top" width="53">1.0.0.8117</td>
</tr>
<tr>
<td valign="top" width="87">WinePhD</td>
<td valign="top" width="52">1.2</td>
<td valign="top" width="84">Evernote</td>
<td valign="top" width="53">4.1.6</td>
</tr>
<tr>
<td valign="top" width="87">TabbedOut</td>
<td valign="top" width="52">2.3.3</td>
<td valign="top" width="84">Coin Flip</td>
<td valign="top" width="53">2.2</td>
</tr>
<tr>
<td valign="top" width="87">Foodspotting</td>
<td valign="top" width="52">2.7</td>
<td valign="top" width="84">Grades 2</td>
<td valign="top" width="53">2.03</td>
</tr>
<tr>
<td valign="top" width="87">GrubHub</td>
<td valign="top" width="52">2.20</td>
<td valign="top" width="84">Sundry Notes</td>
<td valign="top" width="53">3.2</td>
</tr>
<tr>
<td valign="top" width="87">RecipeGrazer</td>
<td valign="top" width="52">1.3</td>
<td valign="top" width="84">OneNote</td>
<td valign="top" width="53">1.2</td>
</tr>
<tr>
<td valign="top" width="87">Starbucks</td>
<td valign="top" width="52">2.1.1</td>
<td valign="top" width="84">Enigmo</td>
<td valign="top" width="53">4.1</td>
</tr>
<tr>
<td valign="top" width="87">Starbucks Mobile Card</td>
<td valign="top" width="52"></td>
<td valign="top" width="84">Angry Birds</td>
<td valign="top" width="53">1.6.3</td>
</tr>
<tr>
<td valign="top" width="87">Ness</td>
<td valign="top" width="52">1.1</td>
<td valign="top" width="84">JellyCar</td>
<td valign="top" width="53">1.5.4</td>
</tr>
<tr>
<td valign="top" width="87">iDisk</td>
<td valign="top" width="52">1.2.1</td>
<td valign="top" width="84">Runway</td>
<td valign="top" width="53">1.6</td>
</tr>
<tr>
<td valign="top" width="87">Remote</td>
<td valign="top" width="52">2.2</td>
<td valign="top" width="84">RockBand Free</td>
<td valign="top" width="53">1.3.49</td>
</tr>
<tr>
<td valign="top" width="87">Apple Store</td>
<td valign="top" width="52">2.0</td>
<td valign="top" width="84">Game Center</td>
<td valign="top" width="53"><strong>5.0.1</strong></td>
</tr>
<tr>
<td valign="top" width="87">Find iPhone</td>
<td valign="top" width="52">1.3</td>
<td valign="top" width="84">App For Cats</td>
<td valign="top" width="53">1.1</td>
</tr>
<tr>
<td valign="top" width="87">Pages</td>
<td valign="top" width="52">1.5</td>
<td valign="top" width="84">PadRacer</td>
<td valign="top" width="53">1.1</td>
</tr>
<tr>
<td valign="top" width="87">Places</td>
<td valign="top" width="52">1.31</td>
<td valign="top" width="84">Implode</td>
<td valign="top" width="53">2.2.4</td>
</tr>
<tr>
<td valign="top" width="87">TripAdvisor</td>
<td valign="top" width="52">5.9</td>
<td valign="top" width="84">Astronut</td>
<td valign="top" width="53">1.0.1</td>
</tr>
<tr>
<td valign="top" width="87">Google Latitude</td>
<td valign="top" width="52">2.2.1</td>
<td valign="top" width="84">Monopoly</td>
<td valign="top" width="53">1.2.9</td>
</tr>
<tr>
<td valign="top" width="87">Gas Buddy</td>
<td valign="top" width="52">1.10</td>
<td valign="top" width="84">Deliveries</td>
<td valign="top" width="53">4.5</td>
</tr>
<tr>
<td valign="top" width="87">Maplets</td>
<td valign="top" width="52">2.2.2</td>
<td valign="top" width="84">Skype</td>
<td valign="top" width="53">3.5.454</td>
</tr>
<tr>
<td valign="top" width="87">iTranslate</td>
<td valign="top" width="52">5.1</td>
<td valign="top" width="84">Units</td>
<td valign="top" width="53">2.1.2</td>
</tr>
<tr>
<td valign="top" width="87">Translate</td>
<td valign="top" width="52">1.6.2</td>
<td valign="top" width="84">NCAA Football 2011</td>
<td valign="top" width="53"></td>
</tr>
<tr>
<td valign="top" width="87">KG Free</td>
<td valign="top" width="52"></td>
<td valign="top" width="84">ESPN ScoreCenter</td>
<td valign="top" width="53">2.2.2</td>
</tr>
<tr>
<td valign="top" width="87">Wikipedia</td>
<td valign="top" width="52">2.2</td>
<td valign="top" width="84">Ski Report</td>
<td valign="top" width="53">2.2.1</td>
</tr>
<tr>
<td valign="top" width="87">White Noise</td>
<td valign="top" width="52">5.0.3</td>
<td valign="top" width="84">EpicMix</td>
<td valign="top" width="53">2.0.1</td>
</tr>
<tr>
<td valign="top" width="87">Sleep Machine Lite</td>
<td valign="top" width="52">2.0.1</td>
<td valign="top" width="84">MLB At Bat</td>
<td valign="top" width="53">4.6.1</td>
</tr>
<tr>
<td valign="top" width="87">Inception</td>
<td valign="top" width="52">1.6</td>
<td valign="top" width="84">Purdue</td>
<td valign="top" width="53">3.0</td>
</tr>
<tr>
<td valign="top" width="87">Sleep</td>
<td valign="top" width="52">2.0.1</td>
<td valign="top" width="84">NASA</td>
<td valign="top" width="53">1.43</td>
</tr>
<tr>
<td valign="top" width="87">Night Stand</td>
<td valign="top" width="52">1.0.4</td>
<td valign="top" width="84">80,000 Wallpapers</td>
<td valign="top" width="53">1.98</td>
</tr>
<tr>
<td valign="top" width="87">Geico BroStache</td>
<td valign="top" width="52">1.0.1</td>
<td valign="top" width="84">Wedding 911</td>
<td valign="top" width="53">1.06</td>
</tr>
<tr>
<td valign="top" width="87">CamCard</td>
<td valign="top" width="52">2.6.0.4</td>
<td valign="top" width="84">Path</td>
<td valign="top" width="53">2.0.2</td>
</tr>
<tr>
<td valign="top" width="87">Offline Pages</td>
<td valign="top" width="52">1.5.2</td>
<td valign="top" width="84">Facebook Messenger</td>
<td valign="top" width="53">1.5.2</td>
</tr>
<tr>
<td valign="top" width="87">GPS Tracker</td>
<td valign="top" width="52">1.2.2</td>
<td valign="top" width="84">Quora</td>
<td valign="top" width="53">1.1</td>
</tr>
<tr>
<td valign="top" width="87">TextPics Free</td>
<td valign="top" width="52">2.2</td>
<td valign="top" width="84">Big Button Box</td>
<td valign="top" width="53">3.0</td>
</tr>
<tr>
<td valign="top" width="87">Peel</td>
<td valign="top" width="52">2.0</td>
<td valign="top" width="84"></td>
<td valign="top" width="53"></td>
</tr>
</tbody>
</table>
]]></content:encoded>
			<wfw:commentRss>http://afewguyscoding.com/2012/01/affected-applications-a-survey-mobile-device-security-threats-vulnerabilities-defenses/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A Survey Of Mobile Device Security: Threats, Vulnerabilities and Defenses</title>
		<link>http://afewguyscoding.com/2011/12/survey-mobile-device-security-threats-vulnerabilities-defenses/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=survey-mobile-device-security-threats-vulnerabilities-defenses</link>
		<comments>http://afewguyscoding.com/2011/12/survey-mobile-device-security-threats-vulnerabilities-defenses/#comments</comments>
		<pubDate>Wed, 14 Dec 2011 07:09:28 +0000</pubDate>
		<dc:creator>David Stites</dc:creator>
				<category><![CDATA[Android]]></category>
		<category><![CDATA[iOS]]></category>
		<category><![CDATA[iPad]]></category>
		<category><![CDATA[iPhone]]></category>
		<category><![CDATA[Mobile]]></category>
		<category><![CDATA[Operating Systems]]></category>
		<category><![CDATA[Security]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[android]]></category>
		<category><![CDATA[app store]]></category>
		<category><![CDATA[Apple]]></category>
		<category><![CDATA[apps]]></category>
		<category><![CDATA[attack]]></category>
		<category><![CDATA[availability]]></category>
		<category><![CDATA[blackberry]]></category>
		<category><![CDATA[cell phone]]></category>
		<category><![CDATA[confidentiality]]></category>
		<category><![CDATA[defense]]></category>
		<category><![CDATA[integrity]]></category>
		<category><![CDATA[ipod touch]]></category>
		<category><![CDATA[jailbreak]]></category>
		<category><![CDATA[operating system]]></category>
		<category><![CDATA[sandbox]]></category>
		<category><![CDATA[security]]></category>
		<category><![CDATA[southwest airlines]]></category>
		<category><![CDATA[threat]]></category>
		<category><![CDATA[vulnerability]]></category>

		<guid isPermaLink="false">http://blog.afewguyscoding.com/?p=335</guid>
		<description><![CDATA[By David Stites and Anitha Tadimalla {dstites, atadimalla}@uccs.edu University of Colorado at Colorado Springs Abstract— Mobile devices, such as smartphones and PDAs have become increasingly popular with consumers and often provide essential functionality in our everyday life. Usually these mobile devices contain lots of sensitive information, such as addresses, contacts, ingoing/outgoing call logs, SMS messages, [...]]]></description>
				<content:encoded><![CDATA[<p>By David Stites and Anitha Tadimalla {dstites, atadimalla}@uccs.edu<br />
University of Colorado at Colorado Springs</p>
<p><strong><em>Abstract</em></strong><strong>— Mobile devices, such as smartphones and PDAs have become increasingly popular with consumers and often provide essential functionality in our everyday life.  Usually these mobile devices contain lots of sensitive information, such as addresses, contacts, ingoing/outgoing call logs, SMS messages, and on latest models, a calendar, emails and potentially our current position.  A smartphone or mobile device today is as powerful as a desktop or laptop and while the latest models feature a complete OS, for many users these devices are “just phones”, so there is a underestimation of the risk connected to mobile device security. This makes mobile devices an interesting target for malicious users.  Damages that a user can sustain are financial loss, privacy and confidentiality, slowdown of processing speed, battery life.</strong></p>
<p><strong>Index Terms—mobile, security, malware, defense</strong></p>
<ol>
<li>INTRODUCTION</li>
</ol>
<p>Mobile devices, especially cellphones, have changed a great deal from their counterparts from the 1990s.  Gone are the days of brick-sized phones, with a 1-line displays, 9 analog buttons and several KB of memory.  In recent years there has been an explosion [2] of powerful mobile computing devices.  These new smartphones and tablets, small enough to fit in your pocket or backpack, hold an immense amount of computing power.  Information is available at a simple touch or finger flick and many users use these devices to access all sorts of data or services such as email, personal contacts, websites and even performing tasks normally reserved for a desktop system, such as video conferencing, watching movies or listening to music.</p>
<p>These devices are able to access the internet, download additional software from the internet, send and receive email, browse websites and send and receive SMS messages from other users.  In addition to these capabilities, many subscribers use their cell phone as a primary method of communication, storing their personal contacts information (which include address, email addresses, phone numbers, <em>etc.</em>) as well as photographs they have taken.  Additionally, many of these devices have a built-in GPS that allows the user to “geo-tag” photographs and use <em>Location Based Services</em>, such as FourSquare, Twitter and Facebook in addition to the basic mapping and GPS functionality.</p>
<p>Devices that run iOS (iPhone, iPad and iPod Touch), Android and Windows Mobile (which represent the majority of the market share) present a brand new computing paradigm in terms of availability, user interface and security.  These devices are being targeted as never before by attackers [1].  Today more than 300 kinds of malware &#8211; among them worms, Trojan horses and other viruses and spyware have been unleashed against the devices [1].  Although desktop systems remain the most widely targeted platform, as mobile computing become more ubiquitous and powerful, the lines between a traditional desktop system and a mobile system will become blurred and these devices will gradually enter the virtual battlefield.</p>
<p>Clearly, these new capabilities mixed with the fact that users store personal information on the devices make it a prime target for attackers.  There are three different basic categories of attacks that can be carried out against mobile devices and they include:</p>
<ul>
<li><strong>Confidentiality attacks</strong>: Data theft, data harvesting [3]</li>
<li><strong>Integrity attacks</strong>: Phone hijacking [2]</li>
<li><strong>Availability attacks</strong>: Protocol based Denial-of-Service attacks, battery draining [1]</li>
</ul>
<p>These three categories represent a wide spectrum of security issues and there are many different attacks that an attacker could carry out.  Any of the aforementioned attacks could range in severity from “low” to “high”, which makes this particular research area a wide open problem.</p>
<p>2. Survey Organization</p>
<p>This research paper is organized in the following manner: section III discusses some notable previous work that has been in this particular field, section IV discusses threats, vulnerabilities and defenses of mobile device security and sections V and VI presents some original work of the authors that show current and potential security threats to security from mobile device applications.</p>
<p>3.  Previous Work</p>
<p>There has been a significant amount of previous work done in the area of mobile security.  In [2], the authors show that since many of the mobile devices are based on similar codebases as their desktop counterparts (as in the case of Windows Mobile and iOS), rootkits, or stealthy malware that affects system programs and files, can equally affect the mobile version of these systems.  In addition to showing that the systems can be exploited, they also show the implications of being able to compromise these devices.</p>
<p>In [4], the authors demonstrated the ability to deploy “a Trojan with few and innocuous permissions, that can extract a small amount of targeted private information from the audio sensor of the phone. Using targeted profiles for context-aware analysis, Soundcomber intelligently pulls out sensitive data such as credit card and PIN numbers from both tone- and speech-based inter- action with phone menu systems.”</p>
<p>In [5], the authors show that even though Android has an advanced permissions system with a sandboxed execution environment, a “genuine application [can be] exploited at runtime or a malicious application can escalate granted permissions and imply that Android’s security model cannot deal with a transitive permission usage attack and Android’s sandbox model fails as a last resort against malware and sophisticated runtime attacks.”</p>
<p>In [3], the author shows many of the attacks that occur on iOS have to do with theft or procurement of personal or private information.  For example, iOS applications would have access to a user’s address book that includes phone numbers, addresses and email addresses.  Additionally, an application can also access other data such as call history, carrier information and photographs.</p>
<p>4. Threats, Vulnerabilities and Defenses to Mobile Device Security</p>
<p><strong>A)    </strong><strong> Attacks and Vulnerabilities</strong></p>
<p>Since there has been explosive growth in the mobile device market and the demand for smartphones, tablets and other integrated devices have increased dramatically.  According to Nielsen [5], “in the third quarter of 2009, smartphones accounted for 40% of new phones sold in the period, up from 25% in the prior quarter. And in the third quarter, for the first time, more people accessed the Internet from smartphones than regular phones. Assuming that 150 million people will be using smartphones by mid-2011, that means 120 million will be on the mobile Internet and 90 million, or 60%, will be watching video” according to Nielsen projections based on current data trends.</p>
<p>In addition to a enormous user base, people are spending more time on their mobile devices than ever before.  “As of the second quarter, Nielsen has previously reported that some 15 million U.S. mobile subscribers watch video on their phones for an average of three hours, 15 minutes each month” [5].</p>
<p>Previously attacks such as the ones detailed later in this section haven’t been seen until now.  Securing mobile devices present unique challenges to researchers, attackers, defenders and users.  This “uniqueness” is due to the fact that computers don’t typically have some the interfaces that mobile devices do, such as GPS hardware, or similar  data storage paradigms, such as storage in the cloud.  Several potential reasons mobile malware and attacks are becoming more popular is because people are using their mobile devices for more day-to-day tasks such as banking, email and web surfing.  This might cause users to store more “valuable” information on these devices and this represents a prime target for attackers due to the wealth of information they could obtain.  In addition, while users may have good security habits when it comes to more traditional systems, they may not realize that their mobile device could be just as vulnerable as their servers, desktops and laptops.</p>
<p><strong>I)          </strong><strong> </strong><strong>History and Current Day Data</strong></p>
<p>The first known cell phone virus, Cabir (EPOC.cabir and Symbian/Cabir), occurred in 2004, written by the group 29A [1, 8].  It affected devices that had Bluetooth modules and spread using Bluetooth via OBEX (object exchange). While it was originally designed as a proof-of-concept that would affect Symbian OS devices, it could spread to any Bluetooth enabled device such as desktop computers and printers.</p>
<p>The virus would activate each time the phone was turned on and immediately start looking for other hosts to infect.  However, Cabir required the victim user to accept the transfer before any transfer could start.  If an attacker were able to masquerade as someone the victim trusted, then they could easily socially engineer their way into infecting the victim’s phone.</p>
<p>Since the virus was released as a proof-of-concept to shock the security community into focusing on the importance of mobile malware, it did nothing other than drain the battery of the infected device while it constantly scanned for new targets to which to send the virus.  A variant of Cabir, Mabir, was released that infected phones not only via Bluetooth but also by SMS, so it could use the carrier as an attack vector.  This increased the infection potential of this virus exponentially since you didn’t have to be within the 10 meter range of Bluetooth to be infected.</p>
<p>Other historic, notable mobile malware include Duts (a virus for the PocketPC platform), Skulls (a trojan horse that infects all applications, including SMS/MMSes) and Commwarrior (a worm that used MMS messages and Bluetooth to spread to other devices).  A more complete list of mobile viruses, trojan horses, worms and malware exist at [1, 7, 8].</p>
<p>Clearly, since mobile devices represent the future of computing, the fact that mobile malware is becoming more prolific shouldn’t be a surprise.  In fact, F-Secure reported an almost 400% increase in mobile malware within a two year period from 2005-2007 [2].</p>
<p>Additionally, [9] reports that the amount of Android malware jumped 37% in the third quarter of 2011 and that Android is now the “exclusive platform for all new mobile malware.  While the Symbian OS remains the platform with the all-time greatest number of malware, Android is clearly today’s target.” Today, there are currently over 1,200 known malware samples [9].</p>
<p>In [10], the authors performed a survey of mobile malware in the wild.  They determined, that between 2009 and 2011, there were 46 pieces of malware released, including 4 for iOS, 24 for Symbian and 18 for Android.  They also determined that the most common malicious activities were confidentiality attacks (collecting user information, 61%) and integrity attacks (sending premium-rate SMS messages, 52%).</p>
<p><strong>I)              </strong><strong> Reasons for an increase in attacks</strong></p>
<p>There are a number of reasons that the community is now experiencing an increase in malware including:</p>
<ul>
<li><em>Increased computing power and storage capabilities</em>: While many consumers may not recognize mobile devices as being equivalent in power to their larger counterparts (laptops and desktop PCs) due to their size, many smartphones, tablet devices and other PDA type devices have a rich set of hardware interfaces.  Many smart phones, such as iPhones, Android-based phones, have powerful dual-core processors and a large amount of storage space to accommodate music, movies, documents and other types of media that can be consumed on the go.</li>
</ul>
<p>In addition, the available software applications that come pre-installed on the device by the device manufacturer, such as web browsers, email clients, messaging applications allow the user to interact much more with the physical and virtual world than previous mobile devices.  Furthermore, additional software applications can be downloaded, installed from the Internet and run by the user.  These third-party applications are able to access this advanced hardware as well as GPS and network interfaces (3G, WiFi and Bluetooth).</p>
<p>This provides mobile malware and crimeware authors a much larger array of possibilities to carry out their attacks.  In addition, more sophisticated hardware and software could make it easier for these attackers to “hide” their attack by ensuring that it only consumes a small portion of the resources, thereby modeling a legitimate application.</p>
<ul>
<li><em>Increased network connectivity</em>: There is a widespread availability of 802.11 WLANs and high-speed broadband data access (3G, WiMAX).  These services allow users to stay constantly connected to services such as email and messaging at home, at work and in foreign places, such as coffee shops.  In addition, many applications utilize network connections to either request or send data.  For example, a game application might transmit a user’s high score to a web server for storage.</li>
</ul>
<p>Additionally, some applications rely on the fact that the phone will have a network connection to receive and send data, such as the Amazon.com shopping application.  Lastly, many recent applications utilize <em>location-based services</em>, such as Facebook, Twitter, and FourSquare.  These applications provide additional functionality if they are able to access the network and a user’s current location.</p>
<p>Lastly, many applications can make use of social data, such as the friends one might have on Facebook.  This Facebook data could be stored within the application.  While Facebook might maintain rigorous security standards on who and what can access the user’s data, other third party applications might not be so careful with the user’s data.</p>
<ul>
<li><em>Standardization of OS and interfaces</em>: The OS is consistent on all the same family of devices, so malware applications would have more effect, being able to exploit the same security vulnerability across many devices.  In addition, many manufacturers give third party  developers access to the system to write applications for the platform.  For example, one can freely download the Android and iOS SDK.  Using this provided SDK, one could craft a virus or some other piece of malware and then submit it for inclusion in the respective application storefronts.</li>
</ul>
<ul>
<li><em>Enterprise integration</em>: Many mobile devices, such as Android, iOS and Blackberry, support standards to be integrated into an enterprise environment.  For example, many of the same devices have support for Virtual Private Networks (VPNs) as well as Exchange server integration.  Thieves and malware authors recognize that this will greatly enhance the infection potential if a mobile virus, trojan horse or worm is able to spread from a mobile device to a corporate environment.</li>
</ul>
<p>Consider the case of an employee with a mobile device becoming infected while at a coffee shop and taking that device back to the corporate environment where it could spread throughout the organization.  Where previously an attack might have only stolen information pertaining to the particular victim, now the attacker could have potentially obtained information on many different people as well as corporate information.</p>
<ul>
<li><em>Other reasons, social engineering and hacktivism</em>: The number of socially-engineered attacks are becoming more prevalent and more sophisticated.  [9] reports “that targeting content works based on cultural and sociological differences between geographic regions.  Hacktivism have become part of the mainstream in 2011 due to groups such as Anonymous and LulzSec.”</li>
</ul>
<p><strong>I)                   </strong><strong> Attack Vectors, Motivations and Types of Attacks</strong></p>
<p>There are a number of attack vectors that exist for compromising confidentiality, integrity or availability.  In fact, many of the attack vectors are the same ones that are available to desktop applications.  Mobile attack vectors often spread via interfaces and services  as well as interfaces unique to smart phones, including SMS and MMS.</p>
<p>While the motivation behind such attacks can be varied and numerous, as well as being outside the scope of this paper, [10] identifies some current and future incentives including:</p>
<ul>
<li><em>Novelty and amusement</em></li>
<li><em>Financial Gain</em></li>
<li><em>Political Gain</em></li>
<li><em>Damage resources</em></li>
</ul>
<p>In addition to numerous attack vectors and motivations, there are many different types of attacks that a malicious individual could attempt to carry out.  [10] defined three main categories of attacks including 1) <em>malware attacks</em> 2) <em>grayware attacks</em> and 3) <em>spyware attacks</em>.</p>
<p><em>Malware attacks</em>: “Malware attacks are attacks that gain access to a device for the purpose of stealing data, damaging the device, annoying the user, <em>etc</em>.  The attacker defrauds the user into installing the malicious application or gains unauthorized remote access by taking advantage of device vulnerabilities.  These particular types of threats provide no notice to the user and typically includes worms, Trojan horses and viruses [10].”</p>
<ul>
<li><em>RF attacks</em>: In this particular type of attack, an attacker could compromise confidentiality, integrity and availability.  For example, if an attacker were to have the correct equipment, the attacker could sniff the air (WLAN and RF) for user data that the attacker could steal if it were unencrypted, such as usernames, passwords and account numbers. This particular type of attack is made easy if: 1) the network the user is on doesn’t utilize encryption and, 2) the application transmits confidential information in plaintext.</li>
</ul>
<p>As we’ll see later on in section V, there are a number of iOS applications today that transmit private information without encryption. Additionally, if an attacker were able to successfully carry out a “mis-association” attack where the mobile device joins the wrong access point, the attacker could attack the integrity of communications by performing a “man-in-the-middle” attack, if the data were not validated or signed.  Consider an example where a victim user sent a SMS message that said “Transfer $1000 to account <em>X</em>.”  The attacker could alter the SMS to say “Transfer $10,000 to account <em>Y</em>”, where <em>Y</em> was his account.</p>
<p>Lastly, an attacker could also compromise the availability of RF services.  Consider an example where an aircraft was WiFi enabled (passengers could access the WiFi during flight).  An attacker could carry out a “disassociation” attack, where he transmits 802.11 management frames that cause the wireless clients to disassociate from the access point.  The denial-of-service would continue until the attacker stopped transmitting the packets.</p>
<ul>
<li><em>Bluetooth attacks</em>: This particular class of attack has many different possible different attacks, similar to RF attacks, that can compromise confidentiality and integrity of data.  In “blue-jacking”, an attacker’s malware application could insert contacts or SMS messages into a victim user’s mobile device.  Additionally, in “blue-snarfing”, a user’s data is again under attack by allowing the attacker to steal or transfer the victim’s data.  Yet another type of bluetooth attack is “blue-tracking” where an attacker could follow the victim’s movements.  Lastly in “blue-bugging”, an attacker could listen in on conversations by activating the attack software and having the phone call them back.  When the attacker answers the call back, the attacker would then be able to listen in to a conversation.</li>
</ul>
<ul>
<li><em>SMS attacks</em>: “SMS spam is used for commercial advertising and spreading phishing links.  Commercial spammers are incentivized to use malware to send SMS spam because sending SMS spam is illegal in most countries” [10].  In addition to sending and receiving regular-rate SMS messages, SMS can also be used as an attack vector by exploiting vulnerabilities in the software stack, such as performing SMS fuzzing.</li>
</ul>
<ul>
<li><em>GPS/Location attacks</em>: In this type of attack, the attacker can access the GPS hardware to monitor the user’s movement and current location.  This data can be used to create a profile of a particular user.  In addition, this information could be sold to to other companies for purposes of marketing and sending advertisements to the user.  A more insidious use of a GPS/location attack would allow a criminal to track when a victim leaves their residence and then the attacker could rob the person while they are gone.</li>
</ul>
<ul>
<li><em>Application masquerading and personal data attacks</em>: This particular type of attack is as simple as accessing a user’s private data and saving, sending or using it in an unauthorized manner.  For example, in [3], the author shows multiple examples of information that an application could access that could potentially be sensitive or confidential, such as a user’s phonebook, keyboard and location cache and photo albums.</li>
</ul>
<p>In addition, [3] also showed that it would be relatively trivial for an application to traverse the file system of a mobile device, recording information that it could find valuable.  This can all occur in the background and the user would never know that it is happening.  This data could be sent off to a server and stored for later misuse.</p>
<ul>
<li><em>Phone “Jailbreaking” and 3rd party application stores</em>: While not encouraged by the manufacturer, (and in some cases against the carrier terms and agreements), many users prefer to “jailbreak” their mobile devices.  Jailbreaking is the process of removing the security limitations that are imposed by the operating system, such as the ability to only run signed applications and install additional extensions and themes.  This also allows the user to bypass application sandboxing mechanism and install applications from unofficial application stores.  Users find this desirable because they can add functionality that didn’t previously exist or could get from any existing application on the application stores.  However, users may not realize that this could potentially be a problem if they were to install a malicious application that would normally be killed by the OS on an un-jailbroken device.  For example, one of the available exploits for iOS is when a user jailbreaks their iDevice, installs the OpenSSH package and doesn’t change the root password for the device, which is “alpine”.  This would allow a hacker to login to their device as the root user and exploit the system because they would have full access to read, write and execute any command.</li>
</ul>
<ul>
<li><em>Premium-rate attacks</em>: Premium rate services can deliver valuable content to a user’s mobile device.  When used in a legitimate manner, a user could receive financial information, technical support or even adult services.  These services can cost as much as several dollars per message or minute.  In [10], they identified 24 of the 46 pieces of malware surveyed as sending premium-rate SMS messages.  In one piece of malware, [10] found that applications purporting to be a Russian adult video player sent premium-rate SMS to an adult service.</li>
</ul>
<p>Another piece of malware, Geinimi, sent premium SMS messages to numbers specified by the remote command servers.  This is potentially a very large security concern, because these premium SMS messages don’t require a user’s permission to send, so they potentially could go unnoticed until an attacker has racked up hundreds or even thousands of dollars on a victim user’s phone bill.  In addition to premium-rate SMS message attacks, [10] also found that 2 of the 46 malicious applications contained premium-rate phone call attacks.</p>
<ul>
<li><em>Power management attacks</em>: [11] describes three different classes of power management attacks.  A power management attack is a form of a denial-of-service attack that affects the availability of the mobile device by draining the battery more quickly than it would under normal operation.  The attacker’s goal is to maximize the difference in power consumption between active and sleep states and keep the device from sleeping.</li>
</ul>
<p>The three classes of attacks include 1) service request attacks 2) benign power attacks 3) malignant power attacks.  In service request attacks, repeated requests are made to the victim for services, typically over a network.  Even if the request is ultimately not granted, the power must be expended by the device to determine whether or not to grant the service request.  In benign power attacks, the mobile system executes valid, but power hungry tasks (such as displaying a hidden animated gif or executing a JavaScript).  In malignant power attacks, attackers create or modify binary executables that force the mobile system to consume more power than it normally would.  For example, consider an application that plays a silent audio track in the background while the application is running.  This type of attack might be difficult for the user to detect because they may think that their battery can no longer hold a charge.</p>
<ul>
<li><em>Time-activated and location-activated attacks</em>: An attacker may choose to activate attacks at certainly locations or at a pre-determined or random time in the future.  When the victim arrives at the location and uses the software, it could activate whatever is the intended malignant function.  This would be fairly easy to do as many applications have access to the GPS.  This type of attack may also sneak by any analysis of the application as it wouldn’t run every time the application was launched, but rather only under certain conditions.</li>
</ul>
<p>In addition to malware attacks, [10] also defines several other types of “attacks” including:</p>
<p><em>Grayware attacks</em>:  In [10], the authors classified grayware attacks as “some legitimate applications collect user data for the purpose of marketing or user profiling. Grayware spies on users, but the companies that distribute grayware do not aim to harm users. Pieces of grayware provide real functionality and value to the users. The companies that distribute grayware may disclose their collection habits in their privacy policies, with varying degrees of clarity. Grayware sits at the edge of legality; its behavior may be legal or illegal depending on the jurisdiction of the complaint and the wording of its privacy policy. Unlike malware or personal spyware, illegal grayware is punished with corporate fines rather than personal sentences. Even when the activity of grayware is legal, users may object to the data collection if they discover it. Application markets may choose to remove or allow grayware when detected on a case-by-case basis.”</p>
<p>For example, in 2009, bloggers raised concern over the PinchMedia LLC analytics framework.  This framework provided third party application developers usage information about their users.  The users were not informed and they were not given an option to opt-out [3].  In addition, Storm8 and MogoRoad also faced legal issues when it was discovered that they were collecting users’ contact information without informing them and then  transmitting the collected information in plaintext [3].</p>
<p><em>Spyware attacks</em>: The last category that [10] detailed was “Spyware attacks.”  “Spyware collects personal information such as location or text message history over a period of time. With personal spyware, the attacker has physical access to the device and installs the software without the user&#8217;s knowledge. Personal spyware sends the victim&#8217;s information to the person who installed the application onto the victim&#8217;s device, rather than to the author of the application. For example, a person might install personal spyware onto a spouse&#8217;s phone. It is legal to sell personal spyware in the U.S. because it does not defraud the purchaser (i.e., the attacker). Personal spyware is honest about its purpose to the person who purchases and installs the application. However, it may be illegal to install personal spyware on another person&#8217;s smartphone without his or her authorization [10].”</p>
<p>For more on platform specific exploits, in the papers we surveyed, please reference [1, 2, 3, 4, 6, 10, 11, 18, 20].</p>
<p><strong>A)    </strong><strong> Defenses</strong></p>
<p>When defending against mobile phone security threats and mobile malware, there are two main categories: <em>prevention</em> and <em>detection and recovery</em>.  The majority of the work that has been already implemented falls into the class of prevention.  However, detection and recovery is becoming a more popular research topic.  We discuss some of the major defenses below:</p>
<ul>
<li><em>Code analysis (static and dynamic)</em>: There are two main techniques of determining an application’s characteristics: statically and dynamically.  Both have certain advantages and disadvantages.</li>
</ul>
<p>In static analysis, many techniques may be used to determine how the program works, such as decompilation, decryption, pattern matching and static system call analysis.  The central idea behind static analysis is finding signatures of malicious code in a fast and easy manner.  In [3], the author describes part of the App Store review process for applications and detail that using static analysis, one can dump the strings in a binary file and check them against a black list of forbidden classes, method names and file paths.  However, static analysis can be circumvented with obfuscation techniques.  Additionally, some languages, such as Objective-C allow developers to lookup classes and methods by name at runtime.  This feature of the language adds additional opportunity to “hide” malware functionality.</p>
<p>The primary disadvantage of static code analysis is that malicious code patterns have to be known in advance, making it impossible to automatically detect new malware or malicious polymorphic code without an intervention of a human expert [23].</p>
<p>Dynamic analysis is a set of techniques which involve running an application in a controlled environment and monitoring its behavior.  Various heuristics can be used to capture behavior of the application such as monitoring file changes, network activity, processes, threads and system call tracing [23].</p>
<p>One large difference between the iOS App Store and the Android Marketplace is that applications that are released to the iOS App Store are reviewed by a human.  This accomplishes an important goal in that it rejects applications that are in a “legal gray zone such as casino gambling or collecting personal data.”  The Android Marketplace allows an author to post apps without needing any <em>a priori </em>review but rather it relies on “crowd-sourcing” to review the applications and post comments, either positive or negative.</p>
<ul>
<li><em>ASLR and DEP</em>:  ASLR (address space layout randomization) is a computer security method that involves re-arranging the positions of key data areas, including the position of libraries, the heap and the stack in the process’ address space.  This technique makes it more difficult for an attacker to predict target addresses [21].  For example, an attacker who is attempting to overflow the stack, would first have to locate the address of the stack before they could overflow it.  If an attacker were to guess incorrectly and the program terminated, the next time the program were launched, the stack address will have moved again and the attacker would have to start all over again in their attempt to find it.  While the full description of how ASLR is implemented and its effectiveness is outside the scope of this paper, the reader may learn more at [21].</li>
</ul>
<p>In addition to ASLR, DEP (data execution prevention) is another security feature that prevents an application or service from executing code from a non-executable memory-region.  This protection can be enforced in hardware and/or software.  Again, while the full description of how DEP is implemented and its effectiveness is outside the scope of this paper, the reader may learn more at [22].</p>
<p>The platforms that support ALSR and DEP include iOS and Windows Mobile.  Android has plans to support ASLR but the Blackberry platform has neither.</p>
<ul>
<li><em>Application sandboxing</em>: “Sandboxing consists of running mobile code in a restricted environment called a sandbox [17].”  A sandbox can be characterized by two different mechanisms: 1) confining code, either through type checking, language properties or the use of protection domains to prevent the subversion of trusted code and 2) enforcing a fixed policy for the execution of code [17].</li>
</ul>
<p>When applied to the real world, each application will have its own environment.   Other applications should not be able to interfere with another applications environment nor should that particular application be able to interfere with other applications’ environments.  Both iOS and Android implement application sandboxes.  In the iOS model, each application only has <em>read-write</em> access to a few directories (the application’s own directory and a temporary directory) and applications cannot read or write to any other directories (including other applications and system directories) than for which it is authorized.</p>
<p>In the iOS sandbox, all applications share the same sandbox rules and they’re allowed any action any application could ever need [18].  Compared to the Android sandbox, “applications must explicitly share resources and data and do this by declaring permissions they need for additional capabilities not provided by the basic sandbox.  These additional permissions are granted or denied by the user at install time only. [13]”</p>
<ul>
<li><em>Permission systems</em>:  In Android, there is the additional protection of a permission system.  This permission system “treats all applications as potentially buggy or malicious so they are assigned a low-privilege ID that can only access their own files [19].”  Depending on what the application wants to do, it can request an elevation of permissions from the user, at install-time, which the user will ultimately grant or deny.</li>
</ul>
<p>For example, there are several different levels of permissions including Normal (permissions that protect access that could annoy but not harm the user, such as setting the wallpaper), Dangerous (permissions that protect access that potentially harm the user such as gathering private information), and System (permissions that need access to the most dangerous privileges, such as deleting applications).</p>
<p>In addition to the permissions system, Android also has the Intents system.  Intents are typed interprocess messages that are directed to particular applications or systems services, or broadcast to applications subscribing to a particular intent type [20].  Access to the Intents system is funneled through ActivityManagerService which restricts intents only being sent by applications with the appropriate permissions and processes with a UID that match the systems [19].</p>
<p>iOS does have a rudimentary concept of a permissions system but not one that is as in depth as the Android model.   For example, when an application attempts to access the user’s location via the GPS hardware, the application will confirm with the user that this is an acceptable action, which the user can confirm or deny.  However, for the most part, Apple relies on a number of other techniques to handle security issues, such as Sandboxing, ASLR/DEP and code analysis.</p>
<ul>
<li><em>ACLs and capability lists</em>: Both iOS and Android implement standard UNIX type permissions with users and groups.  This allows the systems to implement the concept of “least privilege” where accounts and users are only able access data to which they are properly privileged to access.  In iOS, files have permissions and third party applications runs as the user “mobile”, instead of root.  In addition, certain operations on Android require the proper permissions, such as accessing as network interfaces.  Typically, when a user “roots” or “jailbreaks” their device, they are elevating the permissions of the application to that of a more privileged account to bypass certain security features such as sandboxing.</li>
</ul>
<ul>
<li><em>Code signing</em>: Code signing is a process where executable binaries are signed digitally by software authors to guarantee that the code that has not been altered or corrupted [15].  This is an important part of the defense systems in both the Android, Blackberry, Windows Mobile and iOS security model and it is used extensively in both models for validating third party applications.  An important distinction between the two is that with iOS code signing, the code must be signed with a certificate validated by a <em>certificate authority </em>or CA (in this case, Apple), whereas with Android, self-signed certificates are acceptable [13, 14]</li>
</ul>
<p>In addition to validating applications, iOS also uses code signing  when booting up.  When a device running iOS boots up, the first significant piece of code that runs on a device running iOS is the BootROM or the “SecureROM”, which is read-only [16].  Within this BootROM, an Apple root certificate is embedded such that the firmware can be validated as being official and secure.  Once the RSA signature has been checked, control is passed to the low level bootloader or the “LLB.”  This module runs several setup routines and checks the signature of iBoot, which is a stage 2 bootloader for all devices running iOS.  Once iBoot starts, it allows the device to go into a “recovery” or “DFU” mode that allows devices to be restored from any state.  Firmware images from Apple are signed and checked when the upgrade or downgrade occurs.  If the device boots normally instead of going into DFU mode, control is passed to <em>Launchd</em>, <em>CommCenter</em>, and <em>Springboard</em> [11].</p>
<p>In addition to validating system software, iOS also signs the code directory structure with SHA-1 hashes of memory pages, and a PKCS#7 signature is embedded in all binaries that are downloaded from the App Store [11].  For binaries that do not validate at run time, they are killed by the OS to prevent any binary that has been altered from running. Lastly, to provide a defense against rootkits, for system binaries, code directories hashes are cached in the kernel [11].</p>
<p>While regular users may not know or care whether a binary is signed properly, this particular process creates accountability and increased trust on the platform if an application can be traced back to a known source.</p>
<p>The astute reader will recognize that a signed binary may <em>not</em> necessarily contain safe or bug-free code &#8211; just code verified as coming from a particular known source and has not been subject to tampering.  Also, if the system does not strictly enforce running only validated binaries, users could be tricked (through social engineering perhaps) into running code that refuses to validate.  Lastly, one potential flaw with Google’s scheme is that Google does not require that a CA sign the application signing certificate &#8211; they only record the signatures for book keeping purposes.  This additional level that Apple introduces could potentially reduce the amount of malware and spyware on Android.</p>
<ul>
<li><em>Data encryption</em>: Data encryption and decryption is a very CPU and power intensive activity.  Battery life on a mobile device is a very precious commodity because the usage model for a mobile device is such that the devices are meant to be used out in the field where a constant power source isn’t always available.  As such, there hasn’t been any commercial implementations of automatic whole disk encryption.  However, certain OSes, such as iOS, encrypts particular items such as the Keychain (a password manager and storage framework) and provides the ability to encrypt individual files.  Another option for third party applications is to provide their own encryption services framework, using known models, such as PKI, and known programs, such as OpenSSL.</li>
</ul>
<p>One interesting study of mobile device encryption was done by the authors of [28].  In their research, they explored the use of <em>Field Programmable Gate Arrays </em>or FPGAs, processors and ASIC hardware in the context of finding a framework for encryption on hand-held communication units. They used the IDEA  encryption algorithm to show the tradeoffs in the suggested technologies.  They measured their results using three different metrics: 1) performance, 2) programmability and, 3) power consumption.  They determined that since power consumption is directly related to frequency, FPGAs provided the highest performance (MOPS/watt).</p>
<ul>
<li><em>Detection and recovery defenses</em>: There has been a lot of notable work done in this category of mobile malware defense.  For example, [24] presented various approaches for mitigating malware on mobile devices.  The authors  implemented and evaluated the suggested approaches on Google Android. The work is divided into the following three segments: a host-based intrusion detection framework; an implementation of SELinux in Android; and static analysis of Android application files.</li>
</ul>
<p>[24] determined that to provide well-rounded protection, a security suite for mobile devices or smart phones (especially open-source ones such as Android) should include a collection of tools blending various capabilities that operate in synergistic fashion.</p>
<p>In [24], the author’s first approach was an innovative host-based intrusion detection system (IDS) for detecting malware on mobile devices. This framework relies on a lightweight agent (in terms of CPU, memory and battery consumption) that continuously samples various features on a device, analyzes collected data using machine learning and temporal reasoning methods and infers the state of the device. Features belonging to groups such as Messaging, Phone Calls and Applications belong to the Application Framework category and were extracted through APIs provided by the framework; features belonging to groups such as Keyboard, Touch Screen, Scheduling and Memory belong to the Linux Kernel category.</p>
<p>This study on anomaly detection was based on various detection algorithms. The purpose of this study is to understand how a detection algorithm, a particular feature selection method  and number of top features can be combined to differentiate between benign and malicious applications which are not included in the training set, when training and testing are performed on different devices and to find specific features that yield maximum detection accuracy. Empirical results suggest that the proposed framework is effective in detecting malware on mobile devices in general and on Android in particular (accuracy of 87.4% with false positive rate of 12.6%).</p>
<p>The author’s second study examined the applicability of detecting malware instances using a light version of the Knowledge-based Temporal Abstraction (KBTA) method that can be activated on resource-limited devices. The new approach was applied for detecting malware on Google Android powered-devices. Evaluation results demonstrated the effectiveness of the new approach in detecting malicious applications on mobile devices (detection rate above 94% in most scenarios) and the feasibility of running such a system on mobile devices (CPU consumption was 3% on average).</p>
<p>This study also proposed the implementation of SELinux in Android in order to harden the Android system and to enforce low-level access control on critical Linux processes that run under privileged users. By choosing this route, the system can be better protected from scenarios in which an attacker exploits vulnerability in one of the high privileged processes.</p>
<p>[25] is another interesting study using malware behavior detection.  This paper proposed a behavior-based malware detection system for Windows Mobile platform called WMMD (Windows Mobile Malware Detection system). WMMD uses API interception techniques to dynamically analyze an applications behavior and compare it with a malicious behavior characteristics library using model checking.</p>
<p>The architecture of the proposed framework consists of two modules: a Dynamic Analysis Module and a Behavior Detection Module. Both modules use the API interception technique to obtain software running information. The Dynamic Analysis Module is responsible for analyzing the program’s behavior. The Behavior Detection Module monitors the process’ real time information and compares it with  abehavior signature library. Once it detects a mal-behavior, it responds to the user and offers a feedback to construct new behavior model.</p>
<p>All the experiments were done first in Windows Mobile Emulator in PC and then verified in a real mobile phone. The Emulator is Windows Mobile 6.0 professional version and the real mobile phone is HTC PPC6800 with Windows Mobile 6.0 OS.</p>
<p>To test WMMD’s effectiveness towards obfuscation or packing techniques, they used UPX packer to pack five Windows Mobile viruses, and compare WMMD to six other anti-virus software packages (Windows Mobile 6.0) which included an updated virus signatures database. Testing results revealed that all of the other anti-virus products can only detect the virus before packing and they are useless when the virus was packed. However, the WMMD can detect those viruses after packing, since WMMD checks the API call in real execution which cannot be changed. For example, <em>WinCE.Infojack.A</em> is a trojan that binds to popular software installation files and it will extract a file named <em>mservice.exe </em>to the \Windows directory and create <em>mservice.lnk </em>file to the \Windows\StartUp directory.  It then start this <em>mservice.exe</em> process. It is apparent that other variants of <em>WinCE.Infojack.A</em> have the same behavior and thus they only need to monitor the CreateFile() and CreateProcess() APIs to check whether their arguments satisfies the specific behavior.</p>
<p>This evaluation on real-world mobile malware shows that behavioral detection can successfully detect malware variants which have certain behavioral patterns with existing patterns in the database, while other anti-virus product cannot detect.</p>
<p>In [26], the authors describe that malware doesn’t always need to be physically installed on the phone to affect the mobile device as they considered defending mobile devices against “proximity malware.”  The dynamics of proximity propagation inherently depend upon the mobility dynamics of a user population in a given geographic region. Unfortunately, there is no ideal methodology for modeling user mobility. Traces of mobile user contacts reflect actual behavior, but they are difficult to generalize and only capture a subset of all contacts due to a lack of geographic coverage. Analytic epidemiological models are efficient to compute and scale well, but simplify many details. Synthetic models are flexible and provide the necessary geographic coverage, but lack the full authenticity of user mobility traces.</p>
<p>[26] assumed that devices have a trusted defense software component that can examine messages and files transferred between devices, securely record persistent information about these transfers, and control device hardware when necessary (e.g., disable radio communication). These assumptions may be strong, but not unreasonable given the increasing prevalence of trusted computing modules. However, if malware has the ability to disable defense software, we can predict what the result will be: unchecked propagation through a population.</p>
<p>The first strategy described in [26] simply uses local evidence to detect malware and prevent further dissemination by the device, such as by disabling the Bluetooth or WiFi radio. Preventing further propagation by disabling communication may inconvenience the user, but voice and messaging with the provider network remain possible. Disabling the malware prevents further propagation but makes no attempt to notify other devices or the network about the presence of malware. It serves as a useful baseline for comparison.</p>
<p>The second strategy described in [26] extends local detection with an active mitigation component. In this strategy, each device maintains a table <em>S </em>of signatures of malware files, such as an MD5 hash over the file content. After a device <em>X </em>infers that it is infected, it disables the malware and warns subsequent devices about it. Device <em>X </em>computes a content-based signature <em>s </em>over the file(s) that triggered the infection recognition (e.g., the hash it has used to track file transfers in the first place). When <em>X </em>comes into proximity contact with another device <em>Y</em>, <em>X </em>disseminates the signature <em>s </em>to <em>Y </em>. If <em>Y </em>is infected, it immediately disables the malware. <em>Y </em>then adds <em>s </em>to its signature table <em>S</em>. Whenever another device shares a file with <em>Y,</em> <em>Y </em>will check the file against the signatures in <em>S</em>. The device can then either delete the file, or warn the user about the file.</p>
<p>The third strategy described in [26] relies upon the network provider to disseminate signatures using a broadcast mechanism. In addition to standard unicast messaging, providers are also able to send data packets over broadcast at a low cost.  In this strategy, whenever a device decides that it is infected, it sends the malware content to an anti-virus server in the provider network (using MMS). The server, since it presumably contains far greater processing power than the mobile devices, can compute a better quality signature. Also, due to access to anti-virus experts, the server may also be able to compute a patch that contains information on how devices may “cure” themselves and remove the infection from the device. Manual involvement in generating patches is also a possibility.</p>
<p>Lastly, in [27], the authors used data mining techniques to detect malware behavior.  This paper proposed a technique of ontology-based behavioral analysis to develop a detection method for smart phone malware. In this experiment, a mobile environment is constructed in the laboratory. The HTC HD2 smart phone with Windows mobile 6.5 operation system was adopted as the main test platform. Then they installed and ran their mobile malware detection system (MMDS) on an HTC HD2 smart phone. Then the other smart phones sent files or messages to the HTC HD2 through MMS or SMS. The MMDS can automatically filter all files and message by extracting their behavioral characteristics. The system will determine the degree of danger of these behaviors. When users have confirmed that this message is in danger of intrusion, the system will refuse the MMS or SMS.</p>
<p>As a result of the experiment in [27], the proposed FPN model can detect most of new mobile malware. However, there exist two pieces of mobile malware that cannot be detected by the  FPN model. Since the collection of mobile malware is difficult, one cannot gain various types of mobile malware to test. Thus, the FPN model through the behavior analysis of mobile malware based on the ontology theory may not detect the above-mentioned mobile malware.</p>
<p><strong>I)          </strong><strong> Future Defense Work</strong></p>
<p>The authors would also like to explore the feasibility of implementing  other security features such as:</p>
<ul>
<li><em>Enhanced permissions systems</em>: It would be worthwhile to spend time researching how one could detect over-privileged applications and adjust their privilege level for only the privileges that are needed.  Additionally, research into a permission models that allow finer-grain control on application permissions would allow developers and users to control exactly what information or modules could be used or accessed.  Lastly, research into dynamic analysis models to determine malicious activity would be useful as mobile platforms and applications are becoming more complicated and future attacks could possibly target push notification systems and in-application purchasing systems.</li>
</ul>
<ul>
<li><em>Trusted computing modules</em>: Trusted computing is a technology that ensures that a computer will consistently behave in an expected way and that those behaviors will be strictly enforced by hardware and software.  If these modules could be included as hardware on the phone, we could ensure that hackers couldn’t deploy rootkits on phones because the boot process would be verified and secure.</li>
</ul>
<ul>
<li><em>Encryption modules</em>: Many desktop operating sytems implement some form of full hard drive encryption, such as Windows BitLocker and Mac OS X FileVault.  Using a hardware encryption module, it would  be possible to encrypt the entire hard drive without consuming a large amount of battery power.  For example, the user application space could be encrypted in a separate volume that is mounted and decrypted at boot time.  In addition to encrypting the entire user application space, applications could also provide their own separate encryption keys to do encryption of application specific data.</li>
</ul>
<ul>
<li><em>Firewall modules</em>: Since smartphones and mobile devices are now as powerful as desktop computers, and the trend of consumers using them as a replacement to desktop and laptop systems will continue, more personal and confidential data will be stored on mobile platforms.  It would be valuable to invest research time into mobile firewalls and packet filtering in attempt to possibly detect whether or not data harvesting is occuring on the device and being transferred across a network interface.</li>
</ul>
<p align="center">5. Applications Survey</p>
<p>    The author surveyed over 230 applications (the full list of applications ,can be found in Appendix A), including applications in the “Top” categories on the iTunes store to determine what type of information could be extracted from auditing packet streams.  The results were quite surprising.</p>
<p>To perform this audit, the author launched one application at a time and used WireShark to capture and analyze packets.  The experiment was performed on an open network that the author created.  The access point was a Cisco Small Business router (WAP4410N) and was configured using a hidden SSID and MAC address authentication to prevent outside users from associating with the access point and introducing outside, extra packets.  While the author realizes that hidden SSIDs and MAC address authentication are easily defeated mechanisms, it was used to prevent casual users from using the access point.  The mobile devices used were an Apple iPod Touch 4G, an Apple iPad 1G and an iPhone 4, configured with iOS 5.0.1.</p>
<p>For reasons of classification, the authors created several different levels of potential security breaches.  The levels are defined as:</p>
<ul>
<li><em>None</em>: This level is defined as having no potential security breaches and no exposure of confidential information.</li>
</ul>
<ul>
<li><em>Low</em>: This level is defined as having a few potential security breaches or exposure of confidential information that could not directly affect the user, such as device IDs that could be used in tracking users (in iOS, these are called UUIDs).</li>
</ul>
<ul>
<li><em>Medium</em>: This level is defined as having several potential security breaches or exposure of confidential information that is potentially serious or if information is exposed such that an attacker would be able to identify the user on an individual basis, such as addresses, latitudes or longitudes, etc.</li>
</ul>
<ul>
<li><em>High</em>: This level is defined as having multiple potential security breaches or exposure of extremely confidential information, such as account numbers, PINs, and username/password combinations.</li>
</ul>
<p>For more information on the specific application, including the version number of the application with the vulnerability, see Appendix A for a full listing.</p>
<table border="1" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td valign="top" width="93">
<p align="center"><strong>Application</strong></p>
</td>
<td valign="top" width="93">
<p align="center"><strong>Level</strong></p>
</td>
<td valign="top" width="93">
<p align="center"><strong>Risks Found</strong></p>
</td>
</tr>
<tr>
<td valign="top" width="93">GrubHub</td>
<td valign="top" width="93">
<p align="center">Low</p>
</td>
<td valign="top" width="93">
<p align="right">UUID</p>
</td>
</tr>
<tr>
<td valign="top" width="93">The Weather Channel</td>
<td valign="top" width="93">
<p align="center">Low</p>
</td>
<td valign="top" width="93">
<p align="right">Geocoded location</p>
</td>
</tr>
<tr>
<td valign="top" width="93">Path</td>
<td valign="top" width="93">
<p align="center">Low</p>
</td>
<td valign="top" width="93">
<p align="right">UUID</p>
</td>
</tr>
<tr>
<td valign="top" width="93">Handmade</td>
<td valign="top" width="93">
<p align="center">Low</p>
</td>
<td valign="top" width="93">
<p align="right">UUID</p>
</td>
</tr>
<tr>
<td valign="top" width="93">iHeartRadio</td>
<td valign="top" width="93">
<p align="center">Low</p>
</td>
<td valign="top" width="93">
<p align="right">Reverse Geocoded location</p>
</td>
</tr>
<tr>
<td valign="top" width="93">TabbedOut</td>
<td valign="top" width="93">
<p align="center">Low</p>
</td>
<td valign="top" width="93">
<p align="right">UUID, Platform</p>
</td>
</tr>
<tr>
<td valign="top" width="93">Priceline</td>
<td valign="top" width="93">
<p align="center">Low</p>
</td>
<td valign="top" width="93">
<p align="right">UUID, Geocoded location, “Search” API is unencrypted</p>
</td>
</tr>
<tr>
<td valign="top" width="93">Free WiFi</td>
<td valign="top" width="93">
<p align="center">Low</p>
</td>
<td valign="top" width="93">
<p align="right">Geocoded location</p>
</td>
</tr>
<tr>
<td valign="top" width="93">Coupious</td>
<td valign="top" width="93">
<p align="center">Medium</p>
</td>
<td valign="top" width="93">
<p align="right">Geocoded location, UUID, coupon redemption codes</p>
</td>
</tr>
<tr>
<td valign="top" width="93">Delivery Status</td>
<td valign="top" width="93">
<p align="center">Medium</p>
</td>
<td valign="top" width="93">
<p align="right">UPS transmits reverse geocoded locations and tracking numbers.</p>
</td>
</tr>
<tr>
<td valign="top" width="93">Color</td>
<td valign="top" width="93">
<p align="center">Medium</p>
</td>
<td valign="top" width="93">
<p align="right">Reverse geocoded location and photos taken and shared by users</p>
</td>
</tr>
<tr>
<td valign="top" width="93">Cloudette</td>
<td valign="top" width="93">
<p align="center">Medium</p>
</td>
<td valign="top" width="93">
<p align="right">Username in plaintext and password, hashed with MD5</p>
</td>
</tr>
<tr>
<td valign="top" width="93">Gas Buddy</td>
<td valign="top" width="93">
<p align="center">Medium</p>
</td>
<td valign="top" width="93">
<p align="right">Username and password, hashed with MD5</p>
</td>
</tr>
<tr>
<td valign="top" width="93">Ness</td>
<td valign="top" width="93">
<p align="center">Medium</p>
</td>
<td valign="top" width="93">
<p align="right">Reverse geocoded location</p>
</td>
</tr>
<tr>
<td valign="top" width="93">Southwest Airlines</td>
<td valign="top" width="93">
<p align="center">High</p>
</td>
<td valign="top" width="93">
<p align="right">Username and password in plaintext</p>
</td>
</tr>
<tr>
<td valign="top" width="93">Minus</td>
<td valign="top" width="93">
<p align="center">High</p>
</td>
<td valign="top" width="93">
<p align="right">Username and password in plaintext</p>
</td>
</tr>
<tr>
<td valign="top" width="93">WordPress</td>
<td valign="top" width="93">
<p align="center">High</p>
</td>
<td valign="top" width="93">
<p align="right">Username and password in plaintext</p>
</td>
</tr>
<tr>
<td valign="top" width="93">Foodspotting</td>
<td valign="top" width="93">
<p align="center">High</p>
</td>
<td valign="top" width="93">
<p align="right">Username and password, Geocoded location</p>
</td>
</tr>
<tr>
<td valign="top" width="93">ustream</td>
<td valign="top" width="93">
<p align="center">High</p>
</td>
<td valign="top" width="93">
<p align="right">Username and password, UUID, geocoded location</p>
</td>
</tr>
<tr>
<td valign="top" width="93">Labelbox</td>
<td valign="top" width="93">
<p align="center">High</p>
</td>
<td valign="top" width="93">
<p align="right">Username and password, geocoded location</p>
</td>
</tr>
</tbody>
</table>
<p>The majority of the applications that were surveyed encrypted the exchanges of confidential or sensitive information, such as usernames, passwords and account numbers via SSL/TLS.</p>
<p>However, many applications performed some sort of tracking or storing of analytic information, such as passing the UUID in a call to a web service.  In some of the instances, this identifying information was not encrypted.  While not potentially dangerous in the sense that an attacker could use this information to “identify” a particular person, none of the applications let users know that their information such as UUID, phone OS and model, was being used or recorded, nor did they let the user “opt-out.”</p>
<p>The largest single potential security breach was with the Southwest Airlines application.  Due to the fact that the username and password were submitted to a web server via a POST operation in plaintext, an attacker could simply sniff for this data.  If an example was captured, one could use those credentials to log into a particular account and book travel, use award miles and possibly change information in the victims profile.  This not only obviously worrisome from the standpoint of a potential attacker fraudulently using a victims account and credit card information, but also due to the possibility of a terrorists threats in air travel.</p>
<p>For example, consider the possibility of a person who is currently (and rightfully) on the Department of Homeland Security’s “No-Fly” list.  If this person were able to capture a victim’s credentials and create a fake ID, he could pass through TSA security without being stopped.</p>
<p>Of the 253 applications surveyed, 91.7% had no risk found, 3.1% had a low risk, 2.3% had a medium risk and 2.3% had a high risk.  While it would be desirable to have no applications in the “Medium” or “High” category, the number of applications the authors found presented a security risk was both surprising and far too numerous.  There are over 500,000 applications on the iOS App Store, so extrapolating the results, there could be at least 15,500 applications in the “Low” risk category and 11,500 applications in the “Medium” and “High” risk category.</p>
<p>Overall, the number of applications with some sort of security risk is low.  This is not very surprising to the authors as many of these applications are in the “Top” applications list and any potential security flaws would have already been found.</p>
<p>Due to the fact that iOS does not have a robust privilege system, there is no way that a user could know their information was being used in a dangerous or insecure way.  While there is support for showing users there is network traffic by  using a spinning “network activity indicator”, it is certainly not mandatory for them to do so.  In fact, a legitimate or malware application could access the network interfaces, sending and receiving information and never alert the user on iOS.</p>
<p>Developers typically do not follow the principle of least privilege.  If an application needs a set of privileges for functionality, they will request them up front, not just when they are needed.  This is particularly dangerous because this could be an entry point for an attacker to compromise the application.</p>
<p>[19] performed research where they surveyed 940 Android applications and found that more than 50% required 1 extra unnecessary permission and 6% required more than 4 unnecessary permissions.  The reasons that developers may request more permissions than are necessary could be because 1) they don’t understand the importance of security and least privilege, 2) they are planning on future releases that will require these privileges and 3) they don’t fully understand how to work with the platform and make the code function correctly.</p>
<p>Since mobile devices and smartphones are unique in that they have a built-in billing system, there must be ongoing education of developers with emphasis on security and privacy or additional built-in measures in the OS to enforce security over code the developers write or the permissions for which they ask.</p>
<p align="center">6. Using Mobile Devices as Network Monitors</p>
<p>We also researched the possibility of using WiFi sniffing and cracking utilities on the iPhone, as well as the feasibility of releasing a spyware application into the iTunes App Store and collecting user information.  As a side note, all the techniques used here could easily be ported to other mobile platforms, such as Android, but this particular research focused on iOS devices.</p>
<p>To be able to perform basic packet sniffing, there are several critical elements that must be performed.  The first element is to be able to put the particular interface into a “promiscuous” mode, in the case of wired network interfaces or “monitor” mode, in the case of wireless network interfaces.</p>
<p>In normal network interface operation, the kernel will discard any packets that are not destined to the specific node.  Using this “monitor” mode, we are able to capture and further analyze these packets that would normally be discarded.  This was a particularly easy piece of code (see Appendix B) to write (if one understands the BSD subsystem and C).  In our particular implementation, we have a utility function to toggle the “promiscuous” or “monitor” bit (IFF_PROMISC) in the command word flag (SIOCGIFFLAGS) as well as a function to list all the interfaces that the OS knows about.</p>
<p>In addition to being able to manipulate the firmware of the network card, we also need to be able to access the bpf files or Berkely Packet Filter devices that are located in /dev.  These bpf devices “provide a raw interface to data link layers in a protocol independent fashion.  All packets, even those destined for other hosts, are accessible through this mechanism [29].”</p>
<p>The access to these files are restricted to root only.  This presents a problem for as root access is restricted for applications that will be available in the App Store.  Therefore, to install and distribute this application, one must have a jailbroken iOS device.</p>
<p>To give the programmer a more friendly access to the raw data from the bpf device and the packets, a user space program, libpcap (Unix or Linux) or WinPCap (Windows), is bolted onto these kernel-only devices.  Interestingly, the authors discovered that the SDK for Mac OS included a pre-built library, libpcap.dylib.  Unfortunately, that library is not available natively for iOS, but it can be cross-compiled for the arm architecture by downloading the source from <a href="http://www.tcpdump.org/">www.tcpdump.org</a>.</p>
<p>Lastly, to perform useful functions with this data, one must create an interface to analyze, extract and filter packet information at the application level.  While, a .pcap file could  be created for later analysis, it might not be as useful as having live analysis.  This can easily be done with a user space program such as tcpdump or WireShark.</p>
<p>This particular research shows there could be a major potential data confidentiality problem with mobile devices.  Assuming that the device is jailbroken and if we were able to release an app into the Cydia App Store that a user could download, we could silently harvest and store personal and sensitive information from that particular user, <em>in addition</em> to any other device that is on an open wireless network (assuming that the user was connected), without alerting the user.  In addition the being able to sniff packets, since we are in the Cydia App Store, we would have full use of all APIs, including Apple private APIs to harvest personal information.</p>
<p>One question that the authors would like to explore in future research is if we could also use the <em>aircrack-ng</em> suite on mobile devices.</p>
<p align="center">7. Conclusion</p>
<p>In this paper, we examined the history of mobile threats and vulnerabilities as well as current threats and vulnerabilities against mobile devices.  We also researched and examined defense mechanisms that currently exist and proposed future research topics.  Additionally, we performed two experiments, one as an audit of mobile application security and the other as the feasibility of turning a mobile device into a RF sniffing and data collection device.</p>
<p align="center">Appendix A: Full Application List</p>
<p>Bold applications represent applications bundled with iOS from Apple.</p>
<table border="1" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td valign="top" width="87">
<p align="center"><strong>Application</strong></p>
</td>
<td valign="top" width="52">
<p align="center"><strong>Version</strong></p>
</td>
<td valign="top" width="84">
<p align="center"><strong>Application</strong></p>
</td>
<td valign="top" width="53">
<p align="center"><strong>Version</strong></p>
</td>
</tr>
<tr>
<td valign="top" width="87"><strong>Messages</strong></td>
<td valign="top" width="52"><strong>5.0.1</strong></td>
<td valign="top" width="84">RedLaser Classic</td>
<td valign="top" width="53">2.9.8</td>
</tr>
<tr>
<td valign="top" width="87"><strong>Calendar</strong></td>
<td valign="top" width="52"><strong>5.0.1</strong></td>
<td valign="top" width="84">eBay</td>
<td valign="top" width="53">2.4.0</td>
</tr>
<tr>
<td valign="top" width="87"><strong>App Store</strong></td>
<td valign="top" width="52"><strong>5.0.1</strong></td>
<td valign="top" width="84">Craigslist</td>
<td valign="top" width="53">3.033</td>
</tr>
<tr>
<td valign="top" width="87"><strong>Settings</strong></td>
<td valign="top" width="52"><strong>5.0.1</strong></td>
<td valign="top" width="84">Key Ring</td>
<td valign="top" width="53">5.4</td>
</tr>
<tr>
<td valign="top" width="87">Spotify</td>
<td valign="top" width="52">0.4.21</td>
<td valign="top" width="84">Coupious</td>
<td valign="top" width="53">1.4.1</td>
</tr>
<tr>
<td valign="top" width="87"><strong>Contacts</strong></td>
<td valign="top" width="52"><strong>5.0.1</strong></td>
<td valign="top" width="84">Cars</td>
<td valign="top" width="53">1.6.1</td>
</tr>
<tr>
<td valign="top" width="87"><strong>Notes</strong></td>
<td valign="top" width="52"><strong>5.0.1</strong></td>
<td valign="top" width="84">Amazon PriceCheck</td>
<td valign="top" width="53">1.2.1</td>
</tr>
<tr>
<td valign="top" width="87"><strong>Newstand</strong></td>
<td valign="top" width="52"><strong>5.0.1</strong></td>
<td valign="top" width="84">Linode</td>
<td valign="top" width="53">1.0.6</td>
</tr>
<tr>
<td valign="top" width="87"><strong>Reminders</strong></td>
<td valign="top" width="52"><strong>5.0.1</strong></td>
<td valign="top" width="84">Unfuddle</td>
<td valign="top" width="53">1.1.1</td>
</tr>
<tr>
<td valign="top" width="87">Find My Friends</td>
<td valign="top" width="52">1.0</td>
<td valign="top" width="84">MiniBooks</td>
<td valign="top" width="53">1.0.2</td>
</tr>
<tr>
<td valign="top" width="87"><strong>Videos</strong></td>
<td valign="top" width="52"><strong>5.0.1</strong></td>
<td valign="top" width="84">iTC Mobile</td>
<td valign="top" width="53">2.4</td>
</tr>
<tr>
<td valign="top" width="87">Vlingo</td>
<td valign="top" width="52">2.1.1</td>
<td valign="top" width="84">Blueprint viewer</td>
<td valign="top" width="53">1.7</td>
</tr>
<tr>
<td valign="top" width="87"><strong>Photos</strong></td>
<td valign="top" width="52"><strong>5.0.1</strong></td>
<td valign="top" width="84">Square</td>
<td valign="top" width="53">2.2</td>
</tr>
<tr>
<td valign="top" width="87"><strong>Camera</strong></td>
<td valign="top" width="52"><strong>5.0.1</strong></td>
<td valign="top" width="84">WordPress</td>
<td valign="top" width="53">2.9.2</td>
</tr>
<tr>
<td valign="top" width="87">Instagram</td>
<td valign="top" width="52">2.0.5</td>
<td valign="top" width="84"><strong>Maps</strong></td>
<td valign="top" width="53"><strong>5.0.1</strong></td>
</tr>
<tr>
<td valign="top" width="87">iMovie</td>
<td valign="top" width="52">1.2.2</td>
<td valign="top" width="84">FlightTrack</td>
<td valign="top" width="53">4.2.2</td>
</tr>
<tr>
<td valign="top" width="87">DashOfColor</td>
<td valign="top" width="52">3.1</td>
<td valign="top" width="84">Kayak</td>
<td valign="top" width="53">19.0.6</td>
</tr>
<tr>
<td valign="top" width="87">ColorSplash</td>
<td valign="top" width="52">1.7.2</td>
<td valign="top" width="84">Southwest</td>
<td valign="top" width="53">1.8.2</td>
</tr>
<tr>
<td valign="top" width="87">UStream Broadcaster</td>
<td valign="top" width="52">2.1</td>
<td valign="top" width="84">American</td>
<td valign="top" width="53">1.3.3</td>
</tr>
<tr>
<td valign="top" width="87">TiltShiftGen</td>
<td valign="top" width="52">2.02</td>
<td valign="top" width="84">Fly Delta</td>
<td valign="top" width="53">1.6</td>
</tr>
<tr>
<td valign="top" width="87">Gorillacam</td>
<td valign="top" width="52">1.2.2</td>
<td valign="top" width="84">Flysmart</td>
<td valign="top" width="53">2.5.25</td>
</tr>
<tr>
<td valign="top" width="87">CameraPlus</td>
<td valign="top" width="52">2.4</td>
<td valign="top" width="84">Priceline Negotiator</td>
<td valign="top" width="53">5.6</td>
</tr>
<tr>
<td valign="top" width="87">PS Express</td>
<td valign="top" width="52">2.03</td>
<td valign="top" width="84">Free WiFi</td>
<td valign="top" width="53">1.1.2</td>
</tr>
<tr>
<td valign="top" width="87">Dropcam</td>
<td valign="top" width="52">1.4.3</td>
<td valign="top" width="84">Google Earth</td>
<td valign="top" width="53">3.2</td>
</tr>
<tr>
<td valign="top" width="87">Chase</td>
<td valign="top" width="52">2.14.5799</td>
<td valign="top" width="84">Translator</td>
<td valign="top" width="53">3.1</td>
</tr>
<tr>
<td valign="top" width="87">Citibank</td>
<td valign="top" width="52">3.7</td>
<td valign="top" width="84"><strong>Phone</strong></td>
<td valign="top" width="53"><strong>5.0.1</strong></td>
</tr>
<tr>
<td valign="top" width="87">Discover</td>
<td valign="top" width="52">2.1</td>
<td valign="top" width="84"><strong>Mail</strong></td>
<td valign="top" width="53"><strong>5.0.1</strong></td>
</tr>
<tr>
<td valign="top" width="87">Fidelity</td>
<td valign="top" width="52">1.6.1</td>
<td valign="top" width="84"><strong>Safari</strong></td>
<td valign="top" width="53"><strong>5.0.1</strong></td>
</tr>
<tr>
<td valign="top" width="87">TD Trader</td>
<td valign="top" width="52">115.12</td>
<td valign="top" width="84"><strong>Music</strong></td>
<td valign="top" width="53"><strong>5.0.1</strong></td>
</tr>
<tr>
<td valign="top" width="87">PayPal</td>
<td valign="top" width="52">3.6</td>
<td valign="top" width="84">Flixster</td>
<td valign="top" width="53">5.02</td>
</tr>
<tr>
<td valign="top" width="87">Mint.com</td>
<td valign="top" width="52">2.0</td>
<td valign="top" width="84">Boxee</td>
<td valign="top" width="53">1.2.1</td>
</tr>
<tr>
<td valign="top" width="87"><strong>Stock</strong></td>
<td valign="top" width="52"><strong>5.0.1</strong></td>
<td valign="top" width="84">redbox</td>
<td valign="top" width="53">2.3.1</td>
</tr>
<tr>
<td valign="top" width="87">thinkorswim</td>
<td valign="top" width="52">115.12</td>
<td valign="top" width="84">Youtube</td>
<td valign="top" width="53"><strong>5.0.1</strong></td>
</tr>
<tr>
<td valign="top" width="87">Geico</td>
<td valign="top" width="52">2.0.2</td>
<td valign="top" width="84">Fandango</td>
<td valign="top" width="53">4.5</td>
</tr>
<tr>
<td valign="top" width="87">Dropbox</td>
<td valign="top" width="52">1.4.6</td>
<td valign="top" width="84">XFINITY TV</td>
<td valign="top" width="53">1.8</td>
</tr>
<tr>
<td valign="top" width="87">1Password</td>
<td valign="top" width="52">3.6.1</td>
<td valign="top" width="84">IMDb</td>
<td valign="top" width="53">2.3.1</td>
</tr>
<tr>
<td valign="top" width="87">Alarm Clock</td>
<td valign="top" width="52">1.1</td>
<td valign="top" width="84">i.TV</td>
<td valign="top" width="53">3.4.1</td>
</tr>
<tr>
<td valign="top" width="87">Planets</td>
<td valign="top" width="52">3.1</td>
<td valign="top" width="84">MobiTV</td>
<td valign="top" width="53">1.0</td>
</tr>
<tr>
<td valign="top" width="87">Dictation</td>
<td valign="top" width="52">1.1</td>
<td valign="top" width="84">Netflix</td>
<td valign="top" width="53">1.4</td>
</tr>
<tr>
<td valign="top" width="87">Inrix Traffic</td>
<td valign="top" width="52">3.5.1</td>
<td valign="top" width="84">VNC</td>
<td valign="top" width="53">3.2.1</td>
</tr>
<tr>
<td valign="top" width="87">Adobe Ideas</td>
<td valign="top" width="52">1.2</td>
<td valign="top" width="84">RDP</td>
<td valign="top" width="53">2.8</td>
</tr>
<tr>
<td valign="top" width="87">IP-Relay</td>
<td valign="top" width="52">1.2</td>
<td valign="top" width="84">TouchTerm</td>
<td valign="top" width="53">2.1</td>
</tr>
<tr>
<td valign="top" width="87">iLlumination</td>
<td valign="top" width="52">1.0.1</td>
<td valign="top" width="84">Scorekeeper</td>
<td valign="top" width="53">4.1</td>
</tr>
<tr>
<td valign="top" width="87">Fake-a-call</td>
<td valign="top" width="52">5.05</td>
<td valign="top" width="84">Statware</td>
<td valign="top" width="53">1.0.3</td>
</tr>
<tr>
<td valign="top" width="87">HeyTell</td>
<td valign="top" width="52">2.3.2</td>
<td valign="top" width="84">NIKE+ GPS</td>
<td valign="top" width="53">3.2.1</td>
</tr>
<tr>
<td valign="top" width="87"><strong>Weather</strong></td>
<td valign="top" width="52"><strong>5.0.1</strong></td>
<td valign="top" width="84">MiLB Triple A</td>
<td valign="top" width="53">1.1.0</td>
</tr>
<tr>
<td valign="top" width="87">The Weather Channel</td>
<td valign="top" width="52">2.1.1</td>
<td valign="top" width="84">Pandora</td>
<td valign="top" width="53">3.1.16</td>
</tr>
<tr>
<td valign="top" width="87"><strong>Calculator</strong></td>
<td valign="top" width="52"><strong>5.0.1</strong></td>
<td valign="top" width="84">Shazam</td>
<td valign="top" width="53">4.8.4</td>
</tr>
<tr>
<td valign="top" width="87"><strong>Clock</strong></td>
<td valign="top" width="52"><strong>5.0.1</strong></td>
<td valign="top" width="84">Soundhound</td>
<td valign="top" width="53">4.1.1</td>
</tr>
<tr>
<td valign="top" width="87"><strong>Compass</strong></td>
<td valign="top" width="52"><strong>5.0.1</strong></td>
<td valign="top" width="84">iHeartRadio</td>
<td valign="top" width="53">4.0.1</td>
</tr>
<tr>
<td valign="top" width="87"><strong>Voice Memos</strong></td>
<td valign="top" width="52"><strong>5.0.1</strong></td>
<td valign="top" width="84">Last.fm</td>
<td valign="top" width="53">3.2.0</td>
</tr>
<tr>
<td valign="top" width="87">AroundMe</td>
<td valign="top" width="52">5.1.0</td>
<td valign="top" width="84">Songify</td>
<td valign="top" width="53">1.0.6</td>
</tr>
<tr>
<td valign="top" width="87">myAT&#038;T</td>
<td valign="top" width="52">2.1.2</td>
<td valign="top" width="84"><strong>iTunes</strong></td>
<td valign="top" width="53"><strong>5.0.1</strong></td>
</tr>
<tr>
<td valign="top" width="87">WeddingWire</td>
<td valign="top" width="52">3.1</td>
<td valign="top" width="84">Virtuoso</td>
<td valign="top" width="53">1.0</td>
</tr>
<tr>
<td valign="top" width="87">LogTen</td>
<td valign="top" width="52">3.3.1</td>
<td valign="top" width="84">I Am T-Pain</td>
<td valign="top" width="53">2.0.0</td>
</tr>
<tr>
<td valign="top" width="87">French</td>
<td valign="top" width="52">1.0</td>
<td valign="top" width="84">Scrabble</td>
<td valign="top" width="53">1.13.78</td>
</tr>
<tr>
<td valign="top" width="87">Binary Calc</td>
<td valign="top" width="52">1.4</td>
<td valign="top" width="84">Harbor Master</td>
<td valign="top" width="53">2.1</td>
</tr>
<tr>
<td valign="top" width="87">Amazon</td>
<td valign="top" width="52">1.8.0</td>
<td valign="top" width="84">Zombie Duck</td>
<td valign="top" width="53">4.1</td>
</tr>
<tr>
<td valign="top" width="87">Groupon</td>
<td valign="top" width="52">1.5.7</td>
<td valign="top" width="84">Zombieville</td>
<td valign="top" width="53">1.7</td>
</tr>
<tr>
<td valign="top" width="87">LivingSocial</td>
<td valign="top" width="52">3.2.2</td>
<td valign="top" width="84">Table Tennis</td>
<td valign="top" width="53">4.1.0</td>
</tr>
<tr>
<td valign="top" width="87">Yowza</td>
<td valign="top" width="52">2.5</td>
<td valign="top" width="84">iFighter</td>
<td valign="top" width="53">1.9</td>
</tr>
<tr>
<td valign="top" width="87">Coupons</td>
<td valign="top" width="52"></td>
<td valign="top" width="84">Hired Gun</td>
<td valign="top" width="53">1.8</td>
</tr>
<tr>
<td valign="top" width="87">Airport Utility</td>
<td valign="top" width="52">1.0</td>
<td valign="top" width="84">Lock n’ Roll</td>
<td valign="top" width="53">3.0</td>
</tr>
<tr>
<td valign="top" width="87">Walgreens</td>
<td valign="top" width="52">3.0.2</td>
<td valign="top" width="84">Sneezies Lite</td>
<td valign="top" width="53">1.3</td>
</tr>
<tr>
<td valign="top" width="87">MyHumana</td>
<td valign="top" width="52">3.0.2</td>
<td valign="top" width="84">Pad Racer</td>
<td valign="top" width="53">1.1</td>
</tr>
<tr>
<td valign="top" width="87">Nike + iPod</td>
<td valign="top" width="52"></td>
<td valign="top" width="84">Uno</td>
<td valign="top" width="53">2.0.0</td>
</tr>
<tr>
<td valign="top" width="87">Gold’s Gym Spotter</td>
<td valign="top" width="52">1.2</td>
<td valign="top" width="84">CamWow</td>
<td valign="top" width="53">2.2</td>
</tr>
<tr>
<td valign="top" width="87">Lose It!</td>
<td valign="top" width="52">3.7.2</td>
<td valign="top" width="84">Labelbox</td>
<td valign="top" width="53">1.3.1</td>
</tr>
<tr>
<td valign="top" width="87">FitnessTrack</td>
<td valign="top" width="52">1.5.5</td>
<td valign="top" width="84">Photosynth</td>
<td valign="top" width="53">1.1.2</td>
</tr>
<tr>
<td valign="top" width="87">LIVESTRONG</td>
<td valign="top" width="52">1.0.1</td>
<td valign="top" width="84">Color Effects</td>
<td valign="top" width="53">3.1</td>
</tr>
<tr>
<td valign="top" width="87">MyFitnessPal</td>
<td valign="top" width="52"></td>
<td valign="top" width="84">Saturation</td>
<td valign="top" width="53">1.0</td>
</tr>
<tr>
<td valign="top" width="87">Nutrisystem</td>
<td valign="top" width="52">2.3</td>
<td valign="top" width="84">Peppermint</td>
<td valign="top" width="53">1.1</td>
</tr>
<tr>
<td valign="top" width="87">Kindle</td>
<td valign="top" width="52">2.8.5</td>
<td valign="top" width="84">FlickStackrXP</td>
<td valign="top" width="53">1.9.6</td>
</tr>
<tr>
<td valign="top" width="87">Instapaper</td>
<td valign="top" width="52">4.0.1</td>
<td valign="top" width="84">Minus</td>
<td valign="top" width="53">2.1.3</td>
</tr>
<tr>
<td valign="top" width="87">iBooks</td>
<td valign="top" width="52"><strong>5.0.1</strong></td>
<td valign="top" width="84">Gallery</td>
<td valign="top" width="53">2.0.1</td>
</tr>
<tr>
<td valign="top" width="87">Zinio</td>
<td valign="top" width="52">2.2</td>
<td valign="top" width="84">Handmade</td>
<td valign="top" width="53">1.1</td>
</tr>
<tr>
<td valign="top" width="87">Twitter</td>
<td valign="top" width="52">4.0</td>
<td valign="top" width="84">StubHub</td>
<td valign="top" width="53">2.4.1</td>
</tr>
<tr>
<td valign="top" width="87">Facebook</td>
<td valign="top" width="52">4.0.3</td>
<td valign="top" width="84">Pushpins</td>
<td valign="top" width="53">2.0.1</td>
</tr>
<tr>
<td valign="top" width="87">Google+</td>
<td valign="top" width="52">1.0.7.2940</td>
<td valign="top" width="84">Black Friday</td>
<td valign="top" width="53">2.0</td>
</tr>
<tr>
<td valign="top" width="87">foursquare</td>
<td valign="top" width="52">4.1.2</td>
<td valign="top" width="84">Sam’s Club</td>
<td valign="top" width="53">2.1.1</td>
</tr>
<tr>
<td valign="top" width="87">LinkedIn</td>
<td valign="top" width="52">4.2</td>
<td valign="top" width="84">Cyber Monday</td>
<td valign="top" width="53">2.1.0</td>
</tr>
<tr>
<td valign="top" width="87">Meebo</td>
<td valign="top" width="52">1.95</td>
<td valign="top" width="84">Words With Friends</td>
<td valign="top" width="53">4.1</td>
</tr>
<tr>
<td valign="top" width="87">Yelp</td>
<td valign="top" width="52">5.5.0</td>
<td valign="top" width="84">Ultimate Free Word Finder</td>
<td valign="top" width="53">1.01</td>
</tr>
<tr>
<td valign="top" width="87">PingChat</td>
<td valign="top" width="52"></td>
<td valign="top" width="84">Mad Gab</td>
<td valign="top" width="53">2.0</td>
</tr>
<tr>
<td valign="top" width="87">Bump</td>
<td valign="top" width="52">2.5.6</td>
<td valign="top" width="84">Metal Storm</td>
<td valign="top" width="53">4.0.2</td>
</tr>
<tr>
<td valign="top" width="87">Color</td>
<td valign="top" width="52">1.1</td>
<td valign="top" width="84">Need For Speed</td>
<td valign="top" width="53">1.0.11</td>
</tr>
<tr>
<td valign="top" width="87">Cloudette</td>
<td valign="top" width="52">1.0.1</td>
<td valign="top" width="84">Madden NFL 2010</td>
<td valign="top" width="53"></td>
</tr>
<tr>
<td valign="top" width="87">soundtracking</td>
<td valign="top" width="52">2.0.2</td>
<td valign="top" width="84">Shizzlr</td>
<td valign="top" width="53">3.2.1</td>
</tr>
<tr>
<td valign="top" width="87">Free RSS</td>
<td valign="top" width="52">3.4</td>
<td valign="top" width="84">Flashlight</td>
<td valign="top" width="53">5.1</td>
</tr>
<tr>
<td valign="top" width="87">NetNewsWire</td>
<td valign="top" width="52">2.0.5</td>
<td valign="top" width="84">Tip Calculator</td>
<td valign="top" width="53">1.3.1</td>
</tr>
<tr>
<td valign="top" width="87">FOX News</td>
<td valign="top" width="52">1.2.4</td>
<td valign="top" width="84">PCalc Lite</td>
<td valign="top" width="53">2.4.3</td>
</tr>
<tr>
<td valign="top" width="87">OpenTable</td>
<td valign="top" width="52">3.4.2</td>
<td valign="top" width="84">Fake Call</td>
<td valign="top" width="53">1.1</td>
</tr>
<tr>
<td valign="top" width="87">Urbanspoon</td>
<td valign="top" width="52">1.17</td>
<td valign="top" width="84">To Do</td>
<td valign="top" width="53">3.2</td>
</tr>
<tr>
<td valign="top" width="87">Epicurious</td>
<td valign="top" width="52">3.0.1</td>
<td valign="top" width="84">Google</td>
<td valign="top" width="53">1.0.0.8117</td>
</tr>
<tr>
<td valign="top" width="87">WinePhD</td>
<td valign="top" width="52">1.2</td>
<td valign="top" width="84">Evernote</td>
<td valign="top" width="53">4.1.6</td>
</tr>
<tr>
<td valign="top" width="87">TabbedOut</td>
<td valign="top" width="52">2.3.3</td>
<td valign="top" width="84">Coin Flip</td>
<td valign="top" width="53">2.2</td>
</tr>
<tr>
<td valign="top" width="87">Foodspotting</td>
<td valign="top" width="52">2.7</td>
<td valign="top" width="84">Grades 2</td>
<td valign="top" width="53">2.03</td>
</tr>
<tr>
<td valign="top" width="87">GrubHub</td>
<td valign="top" width="52">2.20</td>
<td valign="top" width="84">Sundry Notes</td>
<td valign="top" width="53">3.2</td>
</tr>
<tr>
<td valign="top" width="87">RecipeGrazer</td>
<td valign="top" width="52">1.3</td>
<td valign="top" width="84">OneNote</td>
<td valign="top" width="53">1.2</td>
</tr>
<tr>
<td valign="top" width="87">Starbucks</td>
<td valign="top" width="52">2.1.1</td>
<td valign="top" width="84">Enigmo</td>
<td valign="top" width="53">4.1</td>
</tr>
<tr>
<td valign="top" width="87">Starbucks Mobile Card</td>
<td valign="top" width="52"></td>
<td valign="top" width="84">Angry Birds</td>
<td valign="top" width="53">1.6.3</td>
</tr>
<tr>
<td valign="top" width="87">Ness</td>
<td valign="top" width="52">1.1</td>
<td valign="top" width="84">JellyCar</td>
<td valign="top" width="53">1.5.4</td>
</tr>
<tr>
<td valign="top" width="87">iDisk</td>
<td valign="top" width="52">1.2.1</td>
<td valign="top" width="84">Runway</td>
<td valign="top" width="53">1.6</td>
</tr>
<tr>
<td valign="top" width="87">Remote</td>
<td valign="top" width="52">2.2</td>
<td valign="top" width="84">RockBand Free</td>
<td valign="top" width="53">1.3.49</td>
</tr>
<tr>
<td valign="top" width="87">Apple Store</td>
<td valign="top" width="52">2.0</td>
<td valign="top" width="84">Game Center</td>
<td valign="top" width="53"><strong>5.0.1</strong></td>
</tr>
<tr>
<td valign="top" width="87">Find iPhone</td>
<td valign="top" width="52">1.3</td>
<td valign="top" width="84">App For Cats</td>
<td valign="top" width="53">1.1</td>
</tr>
<tr>
<td valign="top" width="87">Pages</td>
<td valign="top" width="52">1.5</td>
<td valign="top" width="84">PadRacer</td>
<td valign="top" width="53">1.1</td>
</tr>
<tr>
<td valign="top" width="87">Places</td>
<td valign="top" width="52">1.31</td>
<td valign="top" width="84">Implode</td>
<td valign="top" width="53">2.2.4</td>
</tr>
<tr>
<td valign="top" width="87">TripAdvisor</td>
<td valign="top" width="52">5.9</td>
<td valign="top" width="84">Astronut</td>
<td valign="top" width="53">1.0.1</td>
</tr>
<tr>
<td valign="top" width="87">Google Latitude</td>
<td valign="top" width="52">2.2.1</td>
<td valign="top" width="84">Monopoly</td>
<td valign="top" width="53">1.2.9</td>
</tr>
<tr>
<td valign="top" width="87">Gas Buddy</td>
<td valign="top" width="52">1.10</td>
<td valign="top" width="84">Deliveries</td>
<td valign="top" width="53">4.5</td>
</tr>
<tr>
<td valign="top" width="87">Maplets</td>
<td valign="top" width="52">2.2.2</td>
<td valign="top" width="84">Skype</td>
<td valign="top" width="53">3.5.454</td>
</tr>
<tr>
<td valign="top" width="87">iTranslate</td>
<td valign="top" width="52">5.1</td>
<td valign="top" width="84">Units</td>
<td valign="top" width="53">2.1.2</td>
</tr>
<tr>
<td valign="top" width="87">Translate</td>
<td valign="top" width="52">1.6.2</td>
<td valign="top" width="84">NCAA Football 2011</td>
<td valign="top" width="53"></td>
</tr>
<tr>
<td valign="top" width="87">KG Free</td>
<td valign="top" width="52"></td>
<td valign="top" width="84">ESPN ScoreCenter</td>
<td valign="top" width="53">2.2.2</td>
</tr>
<tr>
<td valign="top" width="87">Wikipedia</td>
<td valign="top" width="52">2.2</td>
<td valign="top" width="84">Ski Report</td>
<td valign="top" width="53">2.2.1</td>
</tr>
<tr>
<td valign="top" width="87">White Noise</td>
<td valign="top" width="52">5.0.3</td>
<td valign="top" width="84">EpicMix</td>
<td valign="top" width="53">2.0.1</td>
</tr>
<tr>
<td valign="top" width="87">Sleep Machine Lite</td>
<td valign="top" width="52">2.0.1</td>
<td valign="top" width="84">MLB At Bat</td>
<td valign="top" width="53">4.6.1</td>
</tr>
<tr>
<td valign="top" width="87">Inception</td>
<td valign="top" width="52">1.6</td>
<td valign="top" width="84">Purdue</td>
<td valign="top" width="53">3.0</td>
</tr>
<tr>
<td valign="top" width="87">Sleep</td>
<td valign="top" width="52">2.0.1</td>
<td valign="top" width="84">NASA</td>
<td valign="top" width="53">1.43</td>
</tr>
<tr>
<td valign="top" width="87">Night Stand</td>
<td valign="top" width="52">1.0.4</td>
<td valign="top" width="84">80,000 Wallpapers</td>
<td valign="top" width="53">1.98</td>
</tr>
<tr>
<td valign="top" width="87">Geico BroStache</td>
<td valign="top" width="52">1.0.1</td>
<td valign="top" width="84">Wedding 911</td>
<td valign="top" width="53">1.06</td>
</tr>
<tr>
<td valign="top" width="87">CamCard</td>
<td valign="top" width="52">2.6.0.4</td>
<td valign="top" width="84">Path</td>
<td valign="top" width="53">2.0.2</td>
</tr>
<tr>
<td valign="top" width="87">Offline Pages</td>
<td valign="top" width="52">1.5.2</td>
<td valign="top" width="84">Facebook Messenger</td>
<td valign="top" width="53">1.5.2</td>
</tr>
<tr>
<td valign="top" width="87">GPS Tracker</td>
<td valign="top" width="52">1.2.2</td>
<td valign="top" width="84">Quora</td>
<td valign="top" width="53">1.1</td>
</tr>
<tr>
<td valign="top" width="87">TextPics Free</td>
<td valign="top" width="52">2.2</td>
<td valign="top" width="84">Big Button Box</td>
<td valign="top" width="53">3.0</td>
</tr>
<tr>
<td valign="top" width="87">Peel</td>
<td valign="top" width="52">2.0</td>
<td valign="top" width="84"></td>
<td valign="top" width="53"></td>
</tr>
</tbody>
</table>
<p align="center">Appendix B</p>
<pre>#include <stdio.h></pre>
<pre>#include <stdlib.h></pre>
<pre>#include <unistd.h></pre>
<pre>#include <string.h></pre>
<pre>#include <sys/errno.h></pre>
<pre>#include <sys/types.h></pre>
<pre>#include <sys/ioctl.h></pre>
<pre>#include <sys/socket.h></pre>
<pre>#include <net/if.h></pre>
<pre>int go_promisc(int);</pre>
<pre>int get_iface_list(struct ifconf *);</pre>
<pre>int get_iface_names(void);</pre>
<pre>int go_promisc(int on) {</pre>
<pre>int fd;</pre>
<pre>struct ifreq *ifreq = (struct ifreq *)malloc(sizeof(struct ifreq));</pre>
<pre>get_iface_names();</pre>
<pre>strcpy(ifreq->ifr_name, "en0");</pre>
<pre>fd = socket(AF_INET, SOCK_STREAM, 0);</pre>
<pre>if(fd < 0) {</pre>
<pre>perror("opening socket");</pre>
<pre>return -1;</pre>
<pre>}</pre>
<pre>int status = ioctl(fd, SIOCGIFFLAGS, ifreq);</pre>
<pre>if(status < 0) {</pre>
<pre>perror("ioctl(SIOCGIFFLAGS)");</pre>
<pre>status = -1;</pre>
<pre>}</pre>
<pre>if(on) {</pre>
<pre>ifreq->ifr_flags |= IFF_PROMISC;</pre>
<pre>ifreq->ifr_flags |= IFF_ALLMULTI;</pre>
<pre>}</pre>
<pre>else {</pre>
<pre>ifreq->ifr_flags &#038;= ~IFF_PROMISC;</pre>
<pre>ifreq->ifr_flags &#038;= ~IFF_ALLMULTI;</pre>
<pre>}</pre>
<pre>status = ioctl(fd, SIOCSIFFLAGS, ifreq);</pre>
<pre>if(status < 0) {</pre>
<pre>perror("ioctl(SIOCSIFFLAGS)");</pre>
<pre>status = -1;</pre>
<pre>}</pre>
<pre>close(fd);</pre>
<pre>return status;</pre>
<pre>}</pre>
<pre>int get_iface_list(struct ifconf *ifconf) {</pre>
<pre>int sock, rval;</pre>
<pre>sock = socket(AF_INET, SOCK_STREAM, 0);</pre>
<pre>if(sock < 0) {</pre>
<pre>perror("opening socket");</pre>
<pre>return (-1);</pre>
<pre>}</pre>
<pre>if((rval = ioctl(sock, SIOCGIFCONF, (char *)ifconf)) < 0) {</pre>
<pre>perror("ioctl(SIOGIFCONF)");</pre>
<pre>}</pre>
<pre>close(sock);</pre>
<pre>return rval;</pre>
<pre>}</pre>
<pre>int get_iface_names() {</pre>
<pre>static struct ifreq ifreqs[20];</pre>
<pre>struct ifconf ifconf;</pre>
<pre>int  nifaces, i;</pre>
<pre>memset(&#038;ifconf, 0, sizeof(ifconf));</pre>
<pre>ifconf.ifc_buf = (char*) (ifreqs);</pre>
<pre>ifconf.ifc_len = sizeof(ifreqs);</pre>
<pre>if(get_iface_list(&#038;ifconf) < 0) {</pre>
<pre>return -1;</pre>
<pre>}</pre>
<pre>nifaces =  ifconf.ifc_len / sizeof(struct ifreq);</pre>
<pre>printf("Interfaces (count = %d)\n", nifaces);</pre>
<pre>for(i = 0; i < nifaces; i++) {</pre>
<pre>printf("\t%-10s\n", ifreqs[i].ifr_name);</pre>
<pre>}</pre>
<pre>}</pre>
<pre>int main(int argc, char *argv[]) {</pre>
<pre>if(argc != 2) {</pre>
<pre>printf("usage: promisc [ON | OFF]");</pre>
<pre>return -1;</pre>
<pre>}</pre>
<pre>get_iface_names();</pre>
<pre>if(strcmp(argv[1], "ON") == 0) {</pre>
<pre>return go_promisc(1);</pre>
<pre>}</pre>
<pre>else if(strcmp(argv[1], "OFF") == 0) {</pre>
<pre>return go_promisc(0);</pre>
<pre>}</pre>
<pre>else {</pre>
<pre>printf("usage: promisc [ON | OFF]");</pre>
<pre>return -1;</pre>
<pre>}</pre>
<pre>return 0;</pre>
<pre>}</pre>
<p align="center">References</p>
<p>[1]     M. Hypponen, “Malware Goes Mobile”, November 2006, Scientific American Magazine. Pages 70–77.  http://www.cs.virginia.edu/~robins/Malware_Goes_Mobile.pdf</p>
<p>[2]     J. Bickford, O. O'Hare, A. Baliga, V. Ganapathy, and Iftode. L, “Rootkits on Smartphones: Attacks, Implications and Opportunities”. ACM, In the Workshop on Mobile Computing Systems and Applications. Annapolis, MD. Feb. 2010.</p>
<p>[3]     N. Seriot, “iPhone Privacy”, Black Hat DC 2010. Arlington, Virginia, USA. <a href="http://seriot.ch/">http://seriot.ch</a></p>
<p>[4]     R. Schlegel, K. Zhang, X. Zhou, M. Intwala, A. Kapadia, X. Wang, "Soundminer: A Stealthy and Context-Aware Sound Trojan for Smartphones”. In Proceedings of the 18th Annual Network &#038; Distributed System Security Symposium (NDSS) Feb. 2011.</p>
<p>[5]     J. Rocha. “The Droid: Is this the smartphone consumers are looking for?”. Nov. 2011. http://blog.nielsen.com/nielsenwire/consumer/the-droid-is-this-the-smartphone-consumers-are-looking-for/</p>
<p>[6]     A. Bose. “Propagation, detection and containment of mobile malware”. 2008. http:/hd1.handle.net/2027.42/60849</p>
<p>[7]     http://www.symbianpoint.com/types-latest-list-mobile-viruses.html</p>
<p>[8]     A. Gostev. “Mobile Malware Evolution: An Overview, Part 1”. Sep. 2006. <a href="http://www.securelist.com/en/analysis?pubid=200119916">http://www.securelist.com/en/analysis?pubid=200119916</a></p>
<p>[9]     McAfee. “McAfee Threats Report, Third Quarter 2011”. 2011.  www.mcafee.com/us/resources/reports/rp-quarterly-threat-q3-2011.pdf</p>
<p>[10]  A. Felt, M. Finifter, E. Chin, S. Hanna, D. Wagner. “A Survey of Mobile Malware in the Wild.” ACM CCS Workshop on Security and Privacy in Smartphones and Mobile Devices (SPSM). Oct. 2011.</p>
<p>[11]  T. Martin, M. Hsiao, D. Ha, J. Krishnawami. “Denial-of-Service Attacks on Battery Powered Mobile Computeres”. Proc. of Second IEEE Annual Conference on Pervasive Computing and Communications (PERCOM). 2004.</p>
<p>[12]  K. Aras. “Jailbreaking iOS - How an iPhone breaks free.” Stuttgart Media University.</p>
<p>[13]  Google. “Application Signing”. 2011. http://developer.android.com/guide/publishing/app-signing.html</p>
<p>[14]  Apple. “Configuring Development Assets”. 2011. http://developer.apple.com/library/IOs/#documentation/Xcode/Conceptual/ios_development_workflow/100-Configuring_Your_Developmet_Assets/identities_and_devices.html</p>
<p>[15]  “Code Signing.” 2011. http://en.wikipedia.org/wiki/Code_signing</p>
<p>[16]  The iPhone Wiki. 2011. http://www.theiphonewiki.com/wiki/index.php</p>
<p>[17]  S. Loueiro, R. Molva, Y. Roudier. “Mobile Code Security”. Institut Eurecom. 2011. <a href="http://www.eurecom.fr/~nsteam/Papers/mcs5.pdf">http://www.eurecom.fr/~nsteam/Papers/mcs5.pdf</a></p>
<p>[18]  C. Miller, “Mobile Attacks and Defenses”, IEEE Security and Privacy. Vol. 9, Issue 4, Pages 68-70, July-Aug 2011</p>
<p>[19]  A. Felt, E. Chin, S. Hanna, D. Song, D. Wagner. “Android Permissions Demystified”. Proceedings of ACM conference on COmputer and Communications Security (CCS 2011). 2011.</p>
<p>[20]  W. Enck, D. Octeau, P. McDaniel, S. Chaudhuri. “A Study of Android Application Security.” Proceedings of the 20th USENIX Security Symposium. 2011.</p>
<p>[21]  H Shacham, M. Page, B. Pfaff, E. Goh, N. Modadugu, D. Boneh, “On the effectiveness of address-space randomization”. Proceedings of the ACM conference on Computer and Communications Security (CCS’04). New York. 2004.</p>
<p>[22]  R. Hund, T. Holz, F. Freiling. “Return-oriented rootkits: bypassing kernel code integrity protection mechanisms”. Proceedings of ACM 18th Conference on USENIX Security Symposium (SSYM’09). Berkeley, CA. 2009.</p>
<p>[23]  T. Blasing, L. Batyuk, A.D. Schmidt, S. A. Camtepe. S. Albayrak. “An Android Application Sandbox System for Suspicious Software Detection.”</p>
<p>[24]  A. Shabtai. “Malware Detection on Mobile Devices”. Proceedings of IEEE on Mobile Data Management. Pages 289-290. 2010.</p>
<p>[25]  S. Dai, Y. Liu, T. Wang, T. Wei, W. Zou. “Behavior based malware detection on mobile phones”. IEEE Wireless Communications Networking and Mobile Computing (WiCOM). Pages 1-4. 2010.</p>
<p>[26]  G. Zyba, G. Voelker, M. Liljenstam, A. Mehes, P. Johansson. “Defending mobile phones from proximity malware”. Proceedings of IEEE INFOCOM. 2009.</p>
<p>[27]  H.S Chiang, W.J. Tsaur. “Identifying smartphone malware using data-mining technology”. Proceedings of IEEE Computer Communications and Networks (ICCCN). Pages 1-6. 2011.</p>
<p>[28]        O. Mencer, M. Mar, M.J.  Flynn. “Hardware Software Tri-Design of Encryption for Mobile Communications Units.” Proceedings of IEEE Acoustics, Speech and Signal Processing. 1998.</p>
]]></content:encoded>
			<wfw:commentRss>http://afewguyscoding.com/2011/12/survey-mobile-device-security-threats-vulnerabilities-defenses/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>BackTrack 5 R1 and Older Wireless Dongles</title>
		<link>http://afewguyscoding.com/2011/10/backtrack-5-r1-older-wireless-dongles/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=backtrack-5-r1-older-wireless-dongles</link>
		<comments>http://afewguyscoding.com/2011/10/backtrack-5-r1-older-wireless-dongles/#comments</comments>
		<pubDate>Fri, 28 Oct 2011 15:17:09 +0000</pubDate>
		<dc:creator>David Stites</dc:creator>
				<category><![CDATA[Operating Systems]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[802.11]]></category>
		<category><![CDATA[backtrack]]></category>
		<category><![CDATA[dongle]]></category>
		<category><![CDATA[driver]]></category>
		<category><![CDATA[drivers]]></category>
		<category><![CDATA[firmware]]></category>
		<category><![CDATA[hacking]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[ndiswrapper]]></category>
		<category><![CDATA[netgear]]></category>
		<category><![CDATA[p54usb]]></category>
		<category><![CDATA[ubuntu]]></category>
		<category><![CDATA[usb]]></category>
		<category><![CDATA[wireless]]></category>

		<guid isPermaLink="false">http://blog.afewguyscoding.com/?p=332</guid>
		<description><![CDATA[Recently I started doing some interesting security work. I downloaded a great series from Vivek Ramachandran on 802.11 Wireless Hacking. For the series, he suggested using the Alfa Networks AWUS036h card. Because I have a wedding coming up and I need to save money, I wanted to see if I could get it working without [...]]]></description>
				<content:encoded><![CDATA[<p>Recently I started doing some interesting security work. I downloaded a <a href="http://www.securitytube.net/user/Vivek-Ramachandran" target="_blank">great series</a> from Vivek Ramachandran on 802.11 Wireless Hacking. For the series, he suggested using the Alfa Networks AWUS036h card. Because I have a wedding coming up and I need to save money, I wanted to see if I could get it working without that since I had an old USB dongle from NetGear (WG111v1). When I plugged it in initially to my BT5 VM, it brought up the error message &#8220;(p54usb) cannot load firmware isl3887usb (-2)&#8221;. Using ndiswrapper is a fine way to go if you never plan on using monitor mode with the network card but for the tutorial series, I needed to have this capability. To solve this issue, simply use the following command and you&#8217;ll be good to go with your wireless adapter if it can use the p54 driver.<br />
<code>apt-get install linux-firmware-nonfree</code></p>
<p>This was all discovered after several hours of building drivers, realizing that I accidentally downloaded the x64 version of BT5 (which is why the drivers weren&#8217;t working), ndiswrapper doesn&#8217;t (and won&#8217;t ever) support monitor mode.</p>
]]></content:encoded>
			<wfw:commentRss>http://afewguyscoding.com/2011/10/backtrack-5-r1-older-wireless-dongles/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Restoring &#8220;Bounce&#8221; Functionality in Mail.app and Lion</title>
		<link>http://afewguyscoding.com/2011/08/restoring-bounce-functionality-mail-app-lion/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=restoring-bounce-functionality-mail-app-lion</link>
		<comments>http://afewguyscoding.com/2011/08/restoring-bounce-functionality-mail-app-lion/#comments</comments>
		<pubDate>Wed, 31 Aug 2011 18:10:15 +0000</pubDate>
		<dc:creator>David Stites</dc:creator>
				<category><![CDATA[Apple]]></category>
		<category><![CDATA[bounce]]></category>
		<category><![CDATA[lion]]></category>
		<category><![CDATA[mac]]></category>
		<category><![CDATA[mail]]></category>

		<guid isPermaLink="false">http://blog.afewguyscoding.com/?p=327</guid>
		<description><![CDATA[Download Bouce Message.workflow and place the file in ~/Library/Services/. The &#8220;Bounce&#8221; functionality was one of my most used features of Mail, and it was particiarly handy in removing yourself from non-spam, non-automated mailing lists (such as recruiters and people you no longer want to hear from).  This feature was removed in Lion/Mail 5!  While this [...]]]></description>
				<content:encoded><![CDATA[<p>Download <a href='http://blog.afewguyscoding.com/wp-content/uploads/2011/08/Bouce-Message.workflow.zip'>Bouce Message.workflow</a> and place the file in ~/Library/Services/.</p>
<div>
<p>The &#8220;Bounce&#8221; functionality was one of my most used features of Mail, and it was particiarly handy in removing yourself from non-spam, non-automated mailing lists (such as recruiters and people you no longer want to hear from).  This feature was removed in Lion/Mail 5!  While this feature probably won&#8217;t ever be coming back, you can restore the functionality yourself in about 30 seconds using AppleScript and Automator.</p>
<p>1. Open Automator.</p>
<p>2. Create a new service.</p>
<p>3. Configure the service so that it has &#8220;no input&#8221; in &#8220;Mail&#8221;</p>
<p>4. Drag &#8220;Get Selected Mail Messages&#8221; into the workflow</p>
<p>5. Drag &#8220;Run Applescript&#8221; into the workflow</p>
<p>6. Use the following AppleScript, then save the workflow with a name like &#8220;Bounce Message&#8221;.</p>
<pre><strong>on</strong> <strong>run</strong> {input, parameters}</pre>
<pre>          <strong>tell</strong> <em>application</em> "Mail"</pre>
<pre>                         <strong>repeat</strong> <strong>with</strong> eachMessage <strong>in</strong> input</pre>
<pre>            <strong>bounce</strong> eachMessage</pre>
<pre>            <strong>delete</strong> eachMessage # added as a suggestion to be more like the Apple functionality</pre>
<pre>                         <strong>end</strong> <strong>repeat</strong></pre>
<pre>          <strong>end</strong> <strong>tell</strong></pre>
<pre><strong>end</strong> <strong>run</strong></pre>
<p>7. In Mail, select the message you would like to bounce.  Then from the &#8220;Mail&#8221; menu, choose &#8220;Services&#8221;, then click on your new service.  The message will bounce.</p>
</div>
]]></content:encoded>
			<wfw:commentRss>http://afewguyscoding.com/2011/08/restoring-bounce-functionality-mail-app-lion/feed/</wfw:commentRss>
		<slash:comments>33</slash:comments>
		</item>
		<item>
		<title>7 Quick Tips Learned While Creating WeddingWare</title>
		<link>http://afewguyscoding.com/2011/06/7-quick-tips-learned-creating-weddingware/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=7-quick-tips-learned-creating-weddingware</link>
		<comments>http://afewguyscoding.com/2011/06/7-quick-tips-learned-creating-weddingware/#comments</comments>
		<pubDate>Thu, 23 Jun 2011 16:43:35 +0000</pubDate>
		<dc:creator>David Stites</dc:creator>
				<category><![CDATA[Web]]></category>
		<category><![CDATA[ajax]]></category>
		<category><![CDATA[capistrano]]></category>
		<category><![CDATA[chrome]]></category>
		<category><![CDATA[css]]></category>
		<category><![CDATA[debug]]></category>
		<category><![CDATA[deploy]]></category>
		<category><![CDATA[deployment]]></category>
		<category><![CDATA[firefox]]></category>
		<category><![CDATA[form_for]]></category>
		<category><![CDATA[html]]></category>
		<category><![CDATA[internet explorer]]></category>
		<category><![CDATA[javascript]]></category>
		<category><![CDATA[onsubmit]]></category>
		<category><![CDATA[rack]]></category>
		<category><![CDATA[rails]]></category>
		<category><![CDATA[rails 3]]></category>
		<category><![CDATA[rails ujs]]></category>
		<category><![CDATA[remote]]></category>
		<category><![CDATA[safari]]></category>

		<guid isPermaLink="false">http://blog.afewguyscoding.com/?p=292</guid>
		<description><![CDATA[With this blog post, I am pleased to announce that A Few Guys Coding has created a service with a different twist. Wedding customs and traditions are changing all the time. Whether it is the customs, music, the attire or even certain colors such as a wedding I once went to that had black and [...]]]></description>
				<content:encoded><![CDATA[<p>With this blog post, I am pleased to announce that A Few Guys Coding has created a service with a different twist.  Wedding customs and traditions are changing all the time.  Whether it is the customs, music, the attire or even certain colors such as a wedding I once went to that had black and purple as the main colors.  I had even heard of online RSVPs.  This software stack, WeddingWare, was born out of necessity.  For my own wedding, 5 months away now, we needed a way for guests to RSVP.  Since we didn&#8217;t want to spend al the money on response cards, we are environmentally friendly (it saves trees!) and I am a software engineer, I suggested that we come up with online RSVPs (the guest will still receive a paper invitation with an invitation code).  After a couple brain storming sessions later and much, much code, WeddingWare is officially available for license.  While the primary focus is custom wedding RSVPs, the software can also handle shower invitations as well has bachelorette party invitations.  Being highly customized is a main focus of this software because after all, who wants to go to a URL that looks like this: www.theknot.com/user/123456/mywedding/rsvp.html.  Our software is simple and we will work with clients directly to setup a RSVP site where the look is clean and the URL is custom (for example, ours will be, eventually, at <a href="http://www.davidheartsrachel.com">www.davidheartsrachel.com</a>).  The software has a full-featured backend that allows the party hosts to login and 1) create new parties and invitations 2) have multiple events 3) get metrics on guests that are attending (such as outstanding invitations vs. people who have responded).  Guests will be directed to a friendly, intuitive site that the hosts have fully customized, down the color scheme even.  After RSVPing, the guest will receive confirmation emails with hotels, driving directions, maps, etc.  The site is mobile browser friendly and we are working on coming up with a matching iOS application.  Pricing is still to be deteremined &#8211; yet will be affordable and based on the number of guests. A live demo is available at <a href="http://dhr.afewguyscoding.com">http://dhr.afewguyscoding.com</a>.</p>
<p>Disclaimer: I am not a &#8220;web guy&#8221; by nature. I have recently start experimenting more and more with it because eventually I want to become proficient with web and design. Therefore, this may be a simple tips for &#8220;web people&#8221; but a typical &#8220;desktop application&#8221; person might struggle with this.</p>
<ol>
<li><strong>Debugging in Chrome, Firefox and Safari is easy, debugging in IE is hard</strong>: Chrome, Firefox and Safari all make it extremely easy to debug issue related to HTML, CSS and Javascript (AJAX included!) because they Chrome and Safari have built in &#8220;inspectors.&#8221;  Firefox requires an additional plugin, Firebug, but it is just as useful once you have it.  IE makes this process as painful, however, in IE 8 (I haven&#8217;t verified anything before or after this version), the Tools menu has a &#8220;Developer Tools&#8221; window that attempts to do the same thing as Firebug and the built-in inspector for Chrome and Safari.  While it is kind of like using rock knives compared to the other tools, it did work and I was able to debug my CSS in IE (Javascript is much harder). Microsoft has a nice writeup on this tool <a href="http://msdn.microsoft.com/en-us/library/dd565627(v=vs.85).aspx" target="_blank">here</a>.  In fact, go ahead and use Photoshop or Illustrator to create your designs &#8211; that&#8217;s fine.  However, when it comes time, always test first in IE since it has the poorest support.  Chances are that if it works in IE first time, it&#8217;ll work in the other browsers first time as well.  Oh yeah, and get <a href="http://www.macrabbit.com/cssedit" target="_blank">CSSEdit</a> (sorry Windows people &#8211; you&#8217;re on  your own).</li>
<li><strong>RAILS_ENV != RACK_ENV in Rails 3</strong>: I was having a devil of a time getting my staging server to work with my staging database.  Passenger and Apache would connect and use the proper staging files directory but it would constantly connect to the production database.  The trick is that in Rails 3 (only!!!) you need to set ENV['RACK_ENV'] = &#8220;staging&#8221; or ENV['RACK_ENV'] = &#8220;production&#8221;.  In Rails 2.x.x, you still use the RAILS_ENV to tell it which environment to use.</li>
<li><strong>Having a good deployment strategy and capistrano recipe is key</strong>: Many months earlier, I had already put up a plain HTML site (no framworks, *gasp*) at www.davidheartsrachel.com at a really cheap host, <a href="http://www.ipage.com" target="_blank">iPage</a> (they were $30 for the year and believe me, you get what you pay for but that was OK because it was just text and images). Everything was working there just fine there while I was developing this Rails app. Initially, (before I switched to Rails from Google App Engine), I thought that I would host the application at some place like <a href="http://www.heroku.com" target="_blank">Heroku</a>, since their deployment looks dead simple. As it turned out, it would be much cheaper, by about $40/per month, to have my primary host (the one this site is using) and just buy some additional upgrades in RAM to my VM to suppose multiple Passenger instances.Anyways, I knew that I would need to start testing outside of a development environment. In addition to the standard <em>development, test </em>and <em>production</em> environments given to you in Rails, I converted the <em>test </em>environment into a local version of a <em>production </em>environment (cache classing, actual mail delivery, etc.), however I kept my database adapter as SQLite. I also created an additional environment called <em>staging</em>. The theory behind this is that I would run exactly as it would in production (using MySQL, class caching, session stored via :mem_cache, using delayed_job to deliver mail, etc.) with test fixtures and data in the database. Obviously <em>production</em> would be production environment with live data. This allowed me to iron out any bugs that might crop up during switching to a more production-like environment and catch any assumptions I had made in development that wouldn&#8217;t be valid in production (when switching adapters, etc.). I dedicated a subdomian to deploying my staging environment and password protected it. If you are interested in my Capistrano file, environments file and Apache vhost file, they are <a href="http://blog.afewguyscoding.com/wp-content/uploads/2011/06/capistrano-env.zip">here</a> (with the obvious implementation and sensitive path/password redactions).</li>
<li><strong>Radio buttons</strong>: I was making some AJAX calls that should have selected radio buttons in the same group when the call returned based on :success or :error AJAX callback. I was having trouble getting this to happen but this code here is solid:<br />
<code>// this example had a yes/no. it can be extended to any number of radio buttons<br />
var radios = $('input:radio[name=radio_group_name]');<br />
radios.filter('[value=val1]').attr('checked', false);<br />
radios.filter('[value=val2]').attr('checked', true);</code></li>
<li><strong>Rails UJS</strong>: Rails 3 embraces a more MVC way of thinking by abstracting the Javascript choice (whether it is jQuery, Prototype, etc.). Before the implementation was fairly specific and you sometimes had to mix specific JS code into your Ruby code. With UJS, implementation in Rails 3 takes benefit of the new HTML5 data-*@ attributes. So Rails doesn’t spit out Prototype-based Javascript inline (the old helpers are still there). Rather, the helpers just define an appropriate data attribute in the tag, and that’s where Javascript comes in. Obviously this is a more flexible and reusable pattern. You can also plug and play different frameworks without the headache of writing a lot of the same code over again (I swapped Prototype and jQuery in in several minutes &#8211; just remember to get the new rails framework Javascript files, such as rails.js, in jQuery or you&#8217;ll have weird errors that are hard to track down.  Check out <a href="http://rubygems.org/gems/jquery-rails" target="_blank">jqueryrails</a>.). Since I am far from a RoR or Javascript expert, <a href="http://joshhuckabee.com/jquery-rails-3">this</a>, <a href="http://therailworld.com/posts/26-Using-Prototype-and-JQuery-with-Rails3-UJS-">this</a> and <a href="http://net.tutsplus.com/tutorials/javascript-ajax/using-unobtrusive-javascript-and-ajax-with-rails-3/">this</a> has a more expert breakdown (why re-invent the wheel).</li>
<li><strong>Remote form_for (AJAX) and :onsubmit()</strong>: In Rails 2? and at least 3, form_form, :remote =&gt; true overrides the :onsubmit =&gt; &#8216;func();&#8217; method to do the actual form submission.  If you want to bind something to your form before it gets submitted (or during or after!), bind the form using jQuery.bind() and then observe the AJAX callback functions to do what you need. &lt;<code>script type="text/javascript"&amp;gt;<br />
  function get_address_information() {<br />
    // check jquery for all the possible callbacks. there is also success and error. compete get calls after both success and error<br />
    var pars = 'param1=x&amp;amp;param2=y&amp;amp;param3=z';<br />
    $.ajax({ type: "POST",<br />
      url: "/get_invite_address",<br />
      data: pars, dataType: 'script',<br />
      beforeSend: function() {<br />
        // do your before send checking here<br />
      },<br />
      complete: function(data, status) {<br />
        // do some post processing ehre<br />
      }<br />
    });<br />
  }<br />
&amp;lt;/script&amp;gt;</code></li>
<li><strong>Rails is as good as everyone says (and other languages/frameworks are as bad as you think)</strong>: Truly. The only drawback is the lack of internationalization. However, in terms of use, the language, setup, data access Rails is superior in every way. Rails setups up for testing (as well as performance testing) from the word go. The framework allows you to concentrate on coding while it does the heavy lifting.</li>
</ol>
<p>Report card from <a href=".pardontheinformation.com%2F2008%2F09%2Fjava-vs-ruby-on-rails-it-is-dead-heat.html">here</a></p>
<p><a href="http://blog.afewguyscoding.com/wp-content/uploads/2011/06/report.jpg" rel="lightbox[292]"><img class="aligncenter size-full wp-image-312" title="report" src="http://blog.afewguyscoding.com/wp-content/uploads/2011/06/report.jpg" alt="" width="830" height="485" /></a><a href="http://blog.afewguyscoding.com/wp-content/uploads/2011/06/rails-books-v-java-books.jpg" rel="lightbox[292]"><img class="aligncenter size-full wp-image-311" title="rails-books-v-java-books" src="http://blog.afewguyscoding.com/wp-content/uploads/2011/06/rails-books-v-java-books.jpg" alt="" width="500" height="412" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://afewguyscoding.com/2011/06/7-quick-tips-learned-creating-weddingware/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A Survey on Resource Management and Allocation Techniques in Autonomic Distributed Systems</title>
		<link>http://afewguyscoding.com/2011/03/survey-resource-management-allocation-techniques-autonomic-distributed-systems/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=survey-resource-management-allocation-techniques-autonomic-distributed-systems</link>
		<comments>http://afewguyscoding.com/2011/03/survey-resource-management-allocation-techniques-autonomic-distributed-systems/#comments</comments>
		<pubDate>Thu, 31 Mar 2011 20:11:14 +0000</pubDate>
		<dc:creator>David Stites</dc:creator>
				<category><![CDATA[Distributed Systems]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Operating Systems]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[Web]]></category>
		<category><![CDATA[allocation]]></category>
		<category><![CDATA[autonomic computing]]></category>
		<category><![CDATA[capacity planning]]></category>
		<category><![CDATA[distributed systems]]></category>
		<category><![CDATA[multi-tier systems]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[requests]]></category>
		<category><![CDATA[sessions]]></category>

		<guid isPermaLink="false">http://blog.afewguyscoding.com/?p=263</guid>
		<description><![CDATA[Abstract— With the Internet becoming a major resource for many individuals, business, universities and countries, there is an increasing amount of traffic on the Internet.  Some applications available on the Internet need to be able to provide stringent performance guarantees from their web servers (such as online securities trading).  Due to the fact that Internet [...]]]></description>
				<content:encoded><![CDATA[<p>Abstract— <strong>With the Internet becoming a major resource for many individuals, business, universities and countries, there is an increasing amount of traffic on the Internet.  Some applications available on the Internet need to be able to provide stringent performance guarantees from their web servers (such as online securities trading).  Due to the fact that Internet traffic is bursty and unpredictable, and the interactions between hardware and software in multi-tier systems are complex, there needs to be a large degree of atomicity to capacity planning</strong>.</p>
<p>Index Terms—<strong>capacity planning, multi-tier systems, requests, sessions, performance, autonomic computing, distributed systems</strong></p>
<h2>I. Introduction</h2>
<p>The Internet has become an important and popular channel for a variety of different services, such as news aggregation, online shopping, social networking or financial brokerage and other services. Many of these popular services on the Internet today, such as Facebook, Twitter, Amazon and eBay [2, 5, 9] are composed of a generic multi-tier architecture.  Requests generated by users flow through each layer in this architecture.  Each tier in this system provides certain functionality to the following tier by executing part of the incoming request [5].  This multi-tier system includes 1) a web tier running a web server to respond to incoming requests, such as Apache 2) an application tier that is running an application container which hosts an application, such as Tomcat 3) and a backend database tier running database software such as MySQL.</p>
<p style="text-align: center;"><a href="http://blog.afewguyscoding.com/wp-content/uploads/2011/03/tiers.png" rel="lightbox[263]"><img class="size-full wp-image-269" title="Multi-tier web application" src="http://blog.afewguyscoding.com/wp-content/uploads/2011/03/tiers.png" alt="" width="446" height="113" /></a></p>
<p style="text-align: center;"><em>Tiered Application [2]</em></p>
<p>Each of these tiers may be a single machine or may be distributed over a cluster of nodes.  In addition, these servers may be virtualized, such that multiple applications can be hosted and run on the same physical node.  Often times, to ensure acceptable performance, especially in application hosting, there are service level agreements (SLA) that are put into place to govern the desired system performance.  An SLA can specify one or multiple parameters, such as required total uptime of a system, total throughput, average end-to-end delay of requests within a certain percentile, etc.  The SLA target is maintained by the addition of resources into each tier and can be maintained by removing resources as necessary (without violating the SLA).</p>
<p>Traffic on a computer network is inherently non-uniform and hard to predict, mainly due to “think” time of the end user.  This “bursty” behavior is characterized by short, uneven spikes of peak congestion in the life of an application [4].  Popular services can experience dynamic and varying workloads that depend on popularity, time or date, or general demand.  In addition, Internet flash crowds can cause bursty and unpredictable workloads in each of these tiers.  For example, in November 2000 holiday season, Amazon.com experience a forty-minute downtime due to an overload [5].  These flash crowds can cause a significant deviation from the normal traffic profile of an application, affecting the performance.</p>
<p>The performance of the system can be characterized by the total measured end-to-end delay generated by these incoming requests and their workload each tier.  Due to the fact that the interactions between the software and hardware in each of these aforementioned tiers can be quite complex, the management process of allocating sufficient resources (CPU, disk, network bandwidth, etc.) to each tier such that they don’t saturate and become a bottleneck in the system is often a difficult, lengthy and potentially error-prone process to model and estimate for human operators.  For example, if the performance of an application tier dominates the performance of the system, that tier becomes the bottleneck.  It is non-trivial to model the process (scheduling, algorithms, memory, I/O and CPU times) and time it takes to execute the business logic in this tier, especially if resources are shared, as is common in distributed systems.  In addition to these problems, static allocation does not address the issue of Internet flash crowds and the potential they have to overload the system and jeopardize the SLA compliance.</p>
<p>Capacity planning and autonomic resource management plays an important role in modern data centers.  Service differentiation (Quality of Service) and performance isolation help these complex systems to adapt to changing workloads within the system as a whole and within each tier.  In this paper, we will present a survey of the large variety of models, techniques and results of autonomic resource management in large-scale, multi-tier systems.</p>
<h2>II. Network Traffic Characteristics</h2>
<p>To truly understand the reason for capacity planning and the difficulties it presents in modeling in terms of the interactions between the incoming requests and the hardware and software that service them, we have to understand network traffic characteristics.  Incoming session and request rates tend to fluctuate based on a variety of different factors.  These varying workloads could be characterized as “bursty”, which is defined by [4] as short, uneven spikes of peak congestion during the lifetime for the system.  These traffic patterns deviate significantly from the average traffic arrival rates.</p>
<p>One way to characterize burstiness as seen in [4] is to use the variable <em>I</em> to represent the <em>index of dispersion</em>.  The index of dispersion is used to measure whether a set of observed occurrences are clustered or dispersed compared to a standard model.  When <em>I </em>is larger, the observed occurrences of bursty traffic is more disperse.  We can see that in <em>Figure</em> <em>1</em>, burstiness can aggregate into different periods represented by different indices of dispersion.</p>
<p>&nbsp;</p>
<p><a href="http://blog.afewguyscoding.com/wp-content/uploads/2011/03/index-dispersion-low.png" rel="lightbox[263]"></a><a href="http://blog.afewguyscoding.com/wp-content/uploads/2011/03/index-dispersion-high.png" rel="lightbox[263]"><img class="aligncenter size-full wp-image-264" title="index-dispersion-high" src="http://blog.afewguyscoding.com/wp-content/uploads/2011/03/index-dispersion-high.png" alt="" width="205" height="170" /></a><img class="aligncenter size-full wp-image-265" title="index-dispersion-low" src="http://blog.afewguyscoding.com/wp-content/uploads/2011/03/index-dispersion-low.png" alt="" width="195" height="171" /></p>
<p style="text-align: center;"><em> Figure 1</em></p>
<p>A simple way to calculate the index of dispersion for a particular is as follows:</p>
<p><a href="http://blog.afewguyscoding.com/wp-content/uploads/2011/03/idx-dispersion-eq.png" rel="lightbox[263]"><img class="aligncenter size-full wp-image-270" title="idx-dispersion-eq" src="http://blog.afewguyscoding.com/wp-content/uploads/2011/03/idx-dispersion-eq.png" alt="" width="154" height="73" /></a></p>
<p>where <em>N<sub>t</sub></em> is the number of requests completed in a time window of <em>t</em> seconds, where <em>t </em>seconds are counted ignoring server idle time, <em>Var(N<sub>t</sub>)</em> is the variance of the number of completed requests and <em>E[N<sub>t</sub>]</em> is the mean service rate during busy periods.  An astute reader can deduce that as <em>t</em> becomes large, if there are no bursty periods within <em>t</em>, the index of dispersion is low.</p>
<p>Now that bursty periods can be quantified via <em>I</em>, what causes burstiness?  [4] suggests that “burstiness in service times can be a result of a certain workload combination” and “burstiness in service times can be caused by a bottleneck switch between tiers and can be a result of “hidden” resource contention between the transactions of different types and across different tiers.”  That is, one server or tier may be lightly loaded during a particular period, but may become saturated in other periods where a large number of requests are processed.</p>
<p>This hidden resource contention and different types of transactions across different tiers can be extremely difficult to model and even more troublesome for human operators to correctly provision for.  Autonomic provisioning of system resources is needed to optimize service metrics.</p>
<h2>III. Self-Management</h2>
<p>As previously mentioned, the major motivation behind autonomic resource management and capacity planning is to reduce the human involvement in these activities due to their difficulty.  If the systems are able to plan for the future and react to the present without operator input, it will take the load off of system administrators and more accurately respond to the situation at hand.  In order to do this, one has to define the goals and properties of such an autonomic system.  The main properties of an autonomic system, as cited in [1] are:</p>
<ul>
<li>Self-configuration: autonomic systems configure themselves based on high level goals, specifying what is desired (SLAs), not how to achieve the goals.  Systems with self-configuration can install and set themselves up.</li>
<li>Self-optimization: autonomic systems can optimize its use of resources either proactively or reactively in an attempt to improve service and meet SLAs.</li>
<li>Self-healing: autonomic systems can detect and diagnose problems at both high and low levels (software and hardware).  Autonomic systems should also attempt to fix the problem.</li>
<li>Self-protecting: autonomic systems should protect itself from attacks but also from trusted users trying to make changes.</li>
</ul>
<p>One particularly important feature of autonomic system is the ability to exhibit “proactive” features.  For example, software or hardware agents should be able to be:</p>
<ul>
<li>Autonomic: Agents operative without direct intervention of humans and have some kind of control over their actions and internal state [1].”</li>
<li>Social: “Agents interact with other agents via some agent-communication languages [1].”</li>
<li>Reactive and proactive: “Agents are able to perceive their environment and respond in a timely fashion.  Agents do not simply act in response to their environment, but they are able to exhibit goal-directed behavior in taking the initiative [1]</li>
</ul>
<p>In order to model these properties, IBM suggested a general framework autonomic control loop, MAPE-K as detailed in [1].  MAPE-K allows for clear boundaries in the model to classify the work that is taking place in the autonomic system.  In MAPE-K, there are managed elements, which represents any software or hardware resource that is managed by the autonomic agents.  Sensors collect information about managed elements and provide input to the MAPE-K loop such that the autonomic manager can execute the changes.  Typically the managers are additional software components configured with high-level goals which leaves the implementation of these goals up to the autonomic manager.  Effectors actually carry out the changes to the managed elements (these changes can be fine or course grained).  Each of these elements can be observed in the following sections.</p>
<p>In the MAPE-K loop, there are five different components: monitor, analyze, plan, execute and knowledge.  Monitoring involves “capturing properties of the environment (psychical or virtual) that are important to the self properties of the autonomic system [1].”  This information is captured from sensors in the system in two different ways, passively (using built in tools to run an analysis on the system) or actively (engineering the software to monitor and improve performance).</p>
<p>In the planning model of MAPE-K, the manager uses the monitoring data from the sensors to produce a series of changes to one or more managed elements.  These changes can be calculated by having the autonomic system keep state information about the managed elements and data so that adaptation plans can be adjusted over time.  An additional benefit of keeping state information is that systems are able to create an architectural model where the actual system mirrors the model and proposed changes are able to be verifying that the system integrity is still in tact before and after the proposed changes.  If any violations occur after applying the changes to the model, the changes can be aborted or rolled back to avoid damage or downtime to the system.</p>
<p>In the knowledge part of the model, the knowledge used to effect adaptation can come humans, logs, sensor data, or day-to-day observation of a system to observe its behavior [1].  In this part of the model, there is a large space that one can use machine learning to acquire knowledge about the system.  While [1] suggested reinforcement learning and Bayesian techniques, other authors [2] suggest K-Nearest Neighbors (KNN) and neural networks [7] with fuzzy control.  This author would suggest that decision trees could be an effective method of acquiring the knowledge to effect the proposed plans.  Also, clustering could be used as a way of identifying and classifying previously similar plans to see if they were successful or if they resulted in failure.</p>
<p>Finally, there are several different levels of automaticity including: <em>Basic, Managed, Predictive, Adaptive and Fully Autonomic</em>.</p>
<p>Now that we have seen a general framework for automated resource management, let’s continue to explore each component.</p>
<ol>
<li>
<h3>Managed Elements</h3>
</li>
</ol>
<p>Managed elements in an autonomic system consist of all the hardware and software elements that can be managed by autonomic agents.  Due to the fact that multi-tier systems can be very complex, there are multiple levels of detail that one can view the system.</p>
<p>The first and highest level is where the system is viewed as a black box.  At this level, we consider the end-to-end delay when the request enters the system and returns back to us.  If there is congestion or delay caused by insufficient capacity at any tier, we are unable to know which tier’s capacity is causing the problem. Typically, changes in allocation to capacity are fairly course-grained at this level.  [8] plans their allocation strategy at this level.</p>
<p>At the next level down, we no longer monitor system performance by a single delay metric.  We are now able to monitor the performance of individual tiers.  When congestion occurs at a single tier, we are able to target that tier and increase the allocation capacity of just that tier.  However, one must be careful not to trigger downstream allocations with the increased capacity at the original tier.  [5] plans their allocation strategy at this level.</p>
<p>Lastly, at the most fine-grained level, we are able to collect statistics on individual components within a tier such as CPU usage for an application on a particular node.  <em>Figure 2</em> shows an example of the black-box and per-tier paradigms.</p>
<p>&nbsp;</p>
<p><a href="http://blog.afewguyscoding.com/wp-content/uploads/2011/03/req-per-tier.png" rel="lightbox[263]"><img class="aligncenter size-full wp-image-271" title="req-per-tier" src="http://blog.afewguyscoding.com/wp-content/uploads/2011/03/req-per-tier.png" alt="" width="388" height="320" /></a></p>
<ol>
<li>
<h3>Autonomic Managers and Controllers</h3>
</li>
</ol>
<p>In the MAPE-K loop, the monitoring, analyzing and planning phases are done in the control plane or autonomic manager.  The manager should adjust managed elements in the following fashion: [2]</p>
<ul>
<li>Prompt. It should sense impending SLA violations accurately and quickly and trigger resource allocation requests as soon as possible to deal with load increases (or decreases).</li>
<li>Stable. It should avoid unnecessary oscillations in provisioning resources (adding and removing).</li>
</ul>
<p>Keeping the two guidelines in mind, there are two important decisions that every autonomic manager will be making: 1) when to provision resources and 2) when a provisioning decision is made, how much of a particular resource should be provisioned.</p>
<h4 style="text-align: center;"><em>When To Provision</em></h4>
<p>When considering the question of “when to provision”, one must consider the timescale that is being evaluated.  There are two main methods of provisioning: predictively and reactively.  Predictive provisioning attempts to stay ahead of the workload, while reactive provisioning follows to correct for short term fluctuations.</p>
<p>In predictive provisioning, the autonomic system is planning increased (or decreased) capacity for the future.  When evaluating real-world workloads, typically the system or application will see a peak in traffic during the middle of the day and be the minimum during the middle of the night [5].  Other factors, such as recent news (breaking important stories) or seasons (holiday shopping) can affect this distribution of traffic.  By using predictive analysis, automated systems are able to provision sufficient capacity well in advance of any potential events that would cause SLA violations.</p>
<p>Prediction can be done via a number of different methods including statistical analysis, machine learning, or by simply using past observations of workload.  This prediction is usually implemented in the control plane or autonomic manager of the system [2, 5, 6, 7, 8].  For example in [2], a scheduler for each application monitors the application database tier and collects various metrics, such as average query throughput, average number of connections, read/write ratios and system statistics such as CPU, memory and I/O usage, about the system and application performance.  These metrics are reported back to the autonomic manager and an adaptive filter predicts the future load based on the current measured load information.  Each of these metrics are weighted to reflect the usefulness of that feature.  In addition, a KNN classifier determines if an SLA is broken and redirects the source allocation to adjust the number of databases such that that tier is no longer in violation.  A resource allocation component decides how to map the resources dynamically.  [7] uses self-adaptive neural fuzzy controller to decide upon the allocation of resources.  Moreover, [6] uses a model estimator which automatically learns online a model for the relationship between an application’s resource allocation and it’s performance and a optimizer which predicts the resource allocation required to meet performance targets.  Lastly in [5], the authors use a histogram of request arrival rates for each hour over several days.  Using that data, the peak workload is estimated as a high percentile of the arrival rate distribution for that hour.  By using these metrics, the application is able to predict shifting workloads in a sliding window.</p>
<p>Most of the papers survived use the error in prediction e(k) (that is, the predicted workload or arrival rate λ<sub>pred</sub>(t) and the observed workload or arrival rate λ<sub>obs</sub>(t) for a time period t) or the change in error in prediction Δe(k) (e(k) −e(k −1)) as input to their algorithms to help determine the next periods allocation [5, 6, 7].  Since workloads and arrival rates often exhibit bursty overloads [4, 5], these parameters fine-tune our prediction model.</p>
<p>Proactive provisioning alone is not enough to make the system robust and immune to SLA violations.  For example, there may be errors in prediction if workload or arrival rate deviates greatly from previous days.  As mentioned earlier, Internet flash crowds can spike network requests have the potential to cause congestion and overload the system due to their bursty, unpredictable nature.  In these cases, the errors would lag behind the actual events and there would be insufficient capacity.</p>
<p>Reactive provisioning can be used to quickly correct any deviations from these unpredicted events.  Reactive provisioning is used to plan on a shorter time scale, perhaps on several-minute basis.  If anomalies are detected, reactive provisioning can quickly allocate additional capacity to the affected tiers so they don’t become a bottleneck.</p>
<p>In [5], the authors implement reactive provisioning by comparing the predicted session arrival rate λ<sub>pred</sub>(t) and the observed arrival rate λ<sub>obs</sub>(t) for a time period t.  If these two measurements differ by more than a threshold, they take corrective provisioning action.  In [2], the resource manager monitors the average latency receiver from each workload scheduler during each sampling period.  The resource manager uses a smoothened latency average computed as an exponentially weighted mean of the form WL = α ×L + (1 –α) ×WL, where L is the current query latency.  When the α parameter is larger, the system is more responsive the average to current latency.  In [6], the authors attempt to reactively provision limited resources by using an auto-regressive –moving-average (ARMA) model, where two parameters a<sub>1</sub>(k) and a<sub>2</sub>(k) capture the correlation between the applications past and present performance and b<sub>0</sub>(k)  and b<sub>1</sub>(k)  are vectors of coefficients capturing the correlation between the current performance and the recent resource allocations.  Note that if the spike in traffic is large, it may require several rounds of reactive provisioning to get the capacity to an acceptable level.</p>
<p>In [8], the authors consider provisioning resources based on “what-if” analysis.  They argue that most of web applications consist of services that form an acyclic directed graph, which can be formed into a decision tree.  In their model, they ask each tier to predict, online, future performance in the event it received an additional capacity unit or had one capacity unit removed.  The performance model in [8] uses the following equation to calculate response time:</p>
<p><a href="http://blog.afewguyscoding.com/wp-content/uploads/2011/03/response-time.png" rel="lightbox[263]"><img class="aligncenter size-full wp-image-272" title="response-time" src="http://blog.afewguyscoding.com/wp-content/uploads/2011/03/response-time.png" alt="" width="321" height="128" /></a></p>
<p>where, <em>R</em><em><sub>server</sub></em> is the average response time of the service, <em>n</em> is the number of CPU cores assigned to the service, λ is the average request rate and <em>S</em><em><sub>server</sub></em> is the mean service time of the server.  When a threshold of service time is exceeded, re-computation of service time occurs.  These predictions are given to the parent node.  Using these predictions, provisioning agents negotiate resources with each other based on maximum performance gain and minimum performance loss.  The root node selects which services to provision across the tree when the SLA is about to violated, or de-provision, if resources can be removed without causing an SLA violation. Furthermore, in [8], the authors also consider how provisioning cache instances using the previously described “performance promises” would affect workloads in a particular tier and all children tiers due to an increased hit rate.  <em>Figure 3</em> from [8] illustrates the decision process:</p>
<p><a href="http://blog.afewguyscoding.com/wp-content/uploads/2011/03/per-node-gain.png" rel="lightbox[263]"><img class="aligncenter size-full wp-image-276" title="per-node-gain" src="http://blog.afewguyscoding.com/wp-content/uploads/2011/03/per-node-gain.png" alt="" width="512" height="244" /></a></p>
<p style="text-align: center;"><em>Figure 3</em></p>
<p>We can see that service <em>I</em> has <em>k</em> immediate children services and it aggregates it’s own performance promises as follows:</p>
<p><a href="http://blog.afewguyscoding.com/wp-content/uploads/2011/03/max-min-gain-loss.png" rel="lightbox[263]"><img class="aligncenter size-full wp-image-274" title="max-min-gain-loss" src="http://blog.afewguyscoding.com/wp-content/uploads/2011/03/max-min-gain-loss.png" alt="" width="512" height="94" /></a></p>
<p>Lastly, while the majority of the papers reviewed were concerned with provisioning additional capacity, [2] also considered removing unneeded capacity.  If the average latency is below a low-end threshold, the resource manager triggers a resource removal.  The system then performs a temporary removal of the resource.  If the average latency remains below the low threshold, the resource is permanently removed.  The reason that a temporary removal is performed first is that mistakenly removing a resource is potentially a costly operation if it negatively impacts the system performance.  The main motivation behind this logic is that unused resources are wasted if they are being under-utilized in the system.</p>
<p>When comparing the effectiveness of reactive and proactive, <em>Figure 4</em> from [5] that proper proactive provisioning can greatly reduce the time spent in violation of the SLA.</p>
<p><a href="http://blog.afewguyscoding.com/wp-content/uploads/2011/03/sla-time-violation.png" rel="lightbox[263]"><img class="aligncenter size-full wp-image-273" title="sla-time-violation" src="http://blog.afewguyscoding.com/wp-content/uploads/2011/03/sla-time-violation.png" alt="" width="421" height="313" /></a></p>
<p style="text-align: center;"><em>Figure 4</em></p>
<p>In addition to handling smoothly increasing workloads, the provisioning techniques from [5], as shown in <em>Figure 5</em>, predictive provisioning can also handle sudden load bursts effectively.</p>
<p><a href="http://blog.afewguyscoding.com/wp-content/uploads/2011/03/reduced-sla-violation.png" rel="lightbox[263]"><img class="aligncenter size-full wp-image-278" title="reduced-sla-violation" src="http://blog.afewguyscoding.com/wp-content/uploads/2011/03/reduced-sla-violation.png" alt="" width="428" height="314" /></a></p>
<p style="text-align: center;"><em>Figure 5</em></p>
<p>By combining predictive and reactive provisioning, systems are able have sufficient capacity to handle predictable workloads as well as short-term instabilities.</p>
<h4 style="text-align: center;"><em>How Much To Provision</em></h4>
<p>The question of “how much of a given resource to provision” is less straightforward than “when to provision.”  As we stated earlier, it may be necessary to go through several rounds of provisioning before the end-to-end delay is at an acceptable level.  Moreover, depending on the implementation and type of controller you use, the model determining “how much” could be different.  However, all provisioning algorithms attempt to provision enough to meet the SLA, within a time period t, without overprovisioning, because that would be a waste of resources.  Additionally, one critical factor to consider and try to avoid is that increasing the provisioning at a particular tier k might create a downstream bottleneck at tier k + n, assuming n tiers.  Several authors in the survey explore how to avoid this phenomenon.</p>
<p>The authors in [5] present a general framework for determining provisioning needs based on average end-to-end latency.  Suppose a system that has n tiers, denoted by T<sub>1</sub>, T<sub>2</sub>,…T<sub>N</sub> and let R denote the desired end-to-end delay.  Suppose further that the end-to-end response times are broken down into per-tier response times denoted by d<sub>1</sub>, d<sub>2</sub>,…d<sub>N</sub>, such that Σ d<sub>i</sub> = R.  Lastly, assume that the incoming session rate is λ.  Typically, one would want to provision the system for the worst-case scenario, that is the peak of λ.  Individual server capacity can be modeled using M/M/1 first come, first serve (FCFS), non-preemptive queues and each request in the queue has a certain amount of work to do.</p>
<p>Assuming the service rate of the queue is μ, then λ ⁄ μ = ρ, would be the service ratio.  If ρ is less than 1, then the queuing delay, that is, the amount the request waits in the queue to be serviced is bounded and finite.  If ρ is equal to 1, the queue length is infinite but queuing delay is only infinite if the inter-arrival times of requests are not deterministic.  Lastly, if ρ is greater than 1, then the queuing delay is infinite.  This can express useful data, such as service time (the queuing delay and the time it takes to execute the request).  This behavior can be modeled in the queuing theory result [9]:</p>
<p><a href="http://blog.afewguyscoding.com/wp-content/uploads/2011/03/queue-theory.png" rel="lightbox[263]"><img class="aligncenter size-full wp-image-277" title="queue-theory" src="http://blog.afewguyscoding.com/wp-content/uploads/2011/03/queue-theory.png" alt="" width="215" height="85" /></a></p>
<p>where d<sub>i</sub> is the average response time for tier i, s<sub>i</sub> is the average service time for a request at that tier and λ<sub>i</sub> is the request arrival rate at tier i.   is the variance of inter-arrival time and the variance of service time, respectively, which can be monitored online.  Using this equation, we can obtain a lower bound on a request rate for server i.  Assuming a think-time of Ζ, then request are issued are a rate of (1 / Ζ) [5].  Therefore, the number of servers η<sub>i</sub> needed at tier i to service a peak request rate can be computed as:</p>
<p><a href="http://blog.afewguyscoding.com/wp-content/uploads/2011/03/peak-req-rate.png" rel="lightbox[263]"><img class="aligncenter size-full wp-image-275" title="peak-req-rate" src="http://blog.afewguyscoding.com/wp-content/uploads/2011/03/peak-req-rate.png" alt="" width="158" height="77" /></a></p>
<p>where β<sub>i</sub> is a tier specific constant, τ is average session duration, λ<sub>i</sub> is the capacity of a single server and λ is the request arrival rate.  The authors in [5] assume that the servers in each tier are load balanced, although other models do exist that explore provisioning without load balancing, such as [10].</p>
<p>In [7], the authors use a neural fuzzy controller with four layers to determine how much of a particular resource should be (re) allocated.  In their controller design, they have a neural controller with four layers (see <em>Figure 6</em>).</p>
<p><a href="http://blog.afewguyscoding.com/wp-content/uploads/2011/03/neural-fuzzy.png" rel="lightbox[263]"><img class="aligncenter size-full wp-image-266" title="neural-fuzzy" src="http://blog.afewguyscoding.com/wp-content/uploads/2011/03/neural-fuzzy.png" alt="" width="277" height="238" /></a></p>
<p style="text-align: center;"><em>Figure 6</em></p>
<p>In layer 1, the input variables <em>e(k)</em> and <em>Δe(k)</em> are passed into the neural controller nodes.  In layer 2, each node in this layer acts as a “linguistic term” assigned to one of the input variables in layer 1, where they use their membership functions to determine the degree to which an input value belongs to a fuzzy set (i.e. negative large corresponds to the numeric value -50).  Layer 3 uses the outputs from layer 2 multiplied together to determine the firing strength of a particular rule in layer 3.  Lastly, in layer 4, the output of layer 3 is “defuzzified” into numeric output in terms of resource adjustment <em>Δm(k)</em>.  The magnitude of adjustment is determined by the online learning of the neural fuzzy controller, described in more detail in [7].  This online learning algorithm is able to adapt quite rapidly to stationary and dynamic workloads to guarantee 95<sup>th</sup>-percentile delay, as seen in in <em>Figure 7</em> from [7].</p>
<p><a href="http://blog.afewguyscoding.com/wp-content/uploads/2011/03/neural-fuzzy-static-workload.png" rel="lightbox[263]"></a><a href="http://blog.afewguyscoding.com/wp-content/uploads/2011/03/neural-fuzzy-dynamic-workload.png" rel="lightbox[263]"><img class="aligncenter size-full wp-image-267" title="neural-fuzzy-dynamic-workload" src="http://blog.afewguyscoding.com/wp-content/uploads/2011/03/neural-fuzzy-dynamic-workload.png" alt="" width="249" height="161" /></a><img class="aligncenter size-full wp-image-268" title="neural-fuzzy-static-workload" src="http://blog.afewguyscoding.com/wp-content/uploads/2011/03/neural-fuzzy-static-workload.png" alt="" width="250" height="175" /></p>
<p style="text-align: center;"><em>Figure 7</em></p>
<p>In [6], the authors use the optimizer to determine changes in resource allocations for finite resources such as disk and CPU.  This is a significant departure from other models in that it is assumed that we have a (much) larger pool of resources (essentially unlimited) to allocate increased resources from.  Their optimizer uses u<sub>ar</sub> as the resource allocation required to meet the performance target or SLA.  The optimizer’s high-level goal is to determine an allocation in a stable manner (no resource oscillations) by using a cost minimization function to find the optimum resource assignments.  In their cost minimization function:</p>
<p><a href="http://blog.afewguyscoding.com/wp-content/uploads/2011/03/cost-minimization.png" rel="lightbox[263]"><img class="aligncenter size-full wp-image-281" title="cost-minimization" src="http://blog.afewguyscoding.com/wp-content/uploads/2011/03/cost-minimization.png" alt="" width="396" height="43" /></a></p>
<p>where J<sub>p</sub> = (yn<sub>a</sub>(k) &#8211; 1)<sup>2</sup> serves as a penalty for deviation of the application performance from the desired target and J<sub>p</sub> = ||ur<sub>a</sub>(k) – u<sub>a</sub>(k &#8211; 1)||<sup>2</sup> serves as a control cost function to improve stability of the system.  The controller will attempt to minimize the linear combination of both functions.  One important point to note in this particular allocation scheme is that allocation requests are confined by the following constraint:</p>
<p><a href="http://blog.afewguyscoding.com/wp-content/uploads/2011/03/cosntraint.png" rel="lightbox[263]"><img class="aligncenter size-full wp-image-280" title="cosntraint" src="http://blog.afewguyscoding.com/wp-content/uploads/2011/03/cosntraint.png" alt="" width="122" height="47" /></a></p>
<p>where ur<sub>irt</sub> is a tuple of requested resources for application i, resource r for tier t.  Stated another way, the allocations for all applications for a particular resource, such as CPU or disk, in a particular tier must be less than or equal to100%.  When this constraint is violated, a particular resource is experiencing contention.  It is important to note that in this allocation scheme, while there exists the potential for all application to receive adequate capacity to satisfy demand, when contention exists, applications are weighted according to their priorities in the minimization function so that most “important” application receives the largest share of the divided resources.  Clearly, because we cannot allocate more than 100% of a node’s physical CPU, disk or other resource, all applications suffer from performance degradation – the only difference is to what extent the application suffers that degradation.</p>
<p>Other authors of course decide how much to provision based on their controller implementation.  In [2], the authors use a <em>k</em>-nearest neighbors classifier to decide the number of databases to be provisioned.  Lastly, in [8], use a decision tree process to determine the maximum performance gain (and minimal performance loss) to allocate or remove additional resources.  The performance of the algorithm in [8] was quite impressive in observing it’s SLA and remaining well-provisioned without wasting resources as see in <em>Figure 8</em>.</p>
<p><a href="http://blog.afewguyscoding.com/wp-content/uploads/2011/03/cache-sla.png" rel="lightbox[263]"><img class="aligncenter size-full wp-image-279" title="cache-sla" src="http://blog.afewguyscoding.com/wp-content/uploads/2011/03/cache-sla.png" alt="" width="513" height="378" /></a></p>
<p style="text-align: center;"><em>Figure 8</em></p>
<p>Before looking at additional MAPE-K components, it is worth looking at solving two additional enhancements to resource provisioning: preventing downstream bottlenecks and preventing allocation oscillations.</p>
<ol>
<li>
<h5><em>A. </em><em>Preventing Downstream Bottlenecks</em></h5>
</li>
</ol>
<p>The relationship between an incoming request and the work units at successive tiers are not necessarily 1-to-1.  A single incoming search request at the web tier may trigger more 1 more query requests at a database tier.  Therefore, downstream tier capacity has to be considered when allocating additional capacity at upstream tiers.</p>
<p>First, consider what would happen if additional downstream tiers were adjusted for additional incoming requests.  When an increase in allocation is triggered by an SLA violation at tier <em>k</em>, the capacity will increase by <em>n</em>.  If the downstream tier <em>k + 1</em> is almost to capacity, that tier now has an additional <em>n </em>requests to service, which has the potential to cause an SLA violation.  When the resource manager detects a potential violation at tier <em>k + 1</em>, there will be another round of provisioning.  In the meantime, due to the fact that it takes a fixed amount of time to provision additional resources, tier <em>k + 1</em>, could be in violation.  Following a pattern, this could cause up to <em>k</em> provisioning events.  Clearly, this cascading chain of allocation, violation and additional provisioning, which in addition to being wasteful, increases the total time in violation of the SLA.  [5] proposes a solution by using the constant β, where if β<sub>i</sub> is greater than one if the request triggers multiple requests at tier <em>i </em>or less than one if caching at prior tiers reduces the work at this tier.</p>
<ol>
<li>
<h5><em>B. </em><em>Preventing Allocation Oscillations</em></h5>
</li>
</ol>
<p>Due to the fact that internet traffic has a bursty nature, there is a potential for a rapidly fluctuating load in the system.  Because of this fact, a necessary enhancement to the system controller should be a filter to smooth or dampen very brief load spikes.  This is not to say that we don’t want to respond to all SLA violations, but adding or removing additional capacity can potentially be a very costly and time consuming operation.</p>
<p>In addition, in [2], the authors keep state on the additional allocation, as either <em>STEADY</em> or <em>MIGRATION</em> state.  During the <em>MIGRATION</em> state, it may take a period of time after the SLA violation and associated additional capacity allocation is triggered for the migration of additional resources to occur and “warm-up.”  [2] found that latency may continued to be high or even increase during this period, and as such, sampling during this period may not be reflective of the actual system state.  Furthermore, they found that making decision based on samples from this migration and warm-up time period, may continue to add unnecessary replicas which will need to be removed later.  This is wasteful and could penalize other applications hosted on the same machine.</p>
<p>For additional information on each model and provisioning algorithm, refer to the related paper.</p>
<ol>
<li>
<h3>Effectors</h3>
</li>
</ol>
<p>The execution of the plan in the MAPE-K loop happens with the effectors. In some of the papers that were surveyed, [citation needed – 5?], the authors used request policing to ensure that the admission rate does not exceed the capacity.  Any extra sessions beyond capacity are dropped at the threshold of the system.  Once the request was admitted, it was not dropped at any intermediate tier.  Conversely, each tier could have their own request policing mechanism however, this is quite wasteful as requests that have had prior work done, may be dropped at downstream tiers.</p>
<p>In the majority of the papers [2, 5, 6, 7, 8], additional capacity was allocated from some pool of free resources.  If no more free resources existed or that resource reached it replication maximum, resources were re-provisioned or the sessions and requests were capped at a maximum. In [2], additional database capacity was added to the tier by replicating the existing database.  During the period of migration, requests were held and then when the migration was complete, they were “played” in stages to bring the new copy up to date with the current copy with any additional writes that may have occurred during migration.</p>
<p>In [5], the authors used agile server switching.  In this model, each node had a copy of the application running in a separate virtual machine (VM) however, only one application was active at a time.  When the control plane determined that it needed to allocate additional capacity, if the application running on a particular node schedule for replacement was not the application receiving the additional capacity, the sessions and request were ramped down using fixed-rate or measurement-based techniques.  Once all the sessions had been concluded for the previous application in that VM, it was suspended and the new allocation was brought up.  This allowed new applications to quickly be provisioned for additional capacity.</p>
<p>However, admission control may not be the best choice and [3] presented an alternative to dropping sessions by allocating additional capacity in a “degraded mode.”  Using traditional control theory (having a closed loop that reacts to sensors and actuators), we can readily adjust the load on the server to stay within bounds of process schedulability based on available resources.</p>
<p>The actuator or effector is responsible for performing a specific action based on controller output.  In the case of [3], they initially used admission control as a way of controlling server load.  This limits the number of clients that the server can respond to concurrently.  All other clients are rejected and while this is not a favorable outcome, it is a way of providing QoS to other clients.</p>
<p>However, this actuator can be adapted to provide a “degraded” level of service.  In this “degraded” level of service, the content that is served to the client receiving the degraded service is different than that which the “full service” client receives.  For example, the degraded content may have less content such as images in it.  To perform this degradation, there must be multiple versions of the content and the web server will serve the content from the appropriate file structure at request time.  There can be <em>m</em> different service levels where <em>m = I + F </em>(<em>I</em> is the service level where <em>F</em> is the fraction of clients who are served at level <em>I</em>).  When <em>m = M, </em>all clients receive the highest service and conversely when <em>m = 0, </em>all clients are rejected.  This allows a more general model of saying that a percentage of clients will receive degraded service, rather than specifically specifying which clients (which clients may be addressed through different functionality, such has hashing their IP address).  The server load is control through the input of the variable <em>m</em>.</p>
<p>The monitor proposed in [3] is expressed as the utilization <em>U</em> as a function of served request rate <em>R</em> and delivered byte bandwidth <em>W</em> (<em>U = aR + bW, </em>where <em>a</em> and <em>b</em> and derived constants through linear regression).  This function is able to approximate the maximum utilization rate for deadline-monotonic schedulers to meet deadlines of <em>U</em> &lt; 0.58.  Using this scheme of admission and “capacity” planning, the authors in [3] were able to admit almost 3 times the number of requests to the system as shown in <em>Figure 9 </em>before significant service degradation occurred.</p>
<p><a href="http://blog.afewguyscoding.com/wp-content/uploads/2011/03/degradation.png" rel="lightbox[263]"><img class="aligncenter size-full wp-image-282" title="degradation" src="http://blog.afewguyscoding.com/wp-content/uploads/2011/03/degradation.png" alt="" width="515" height="321" /></a></p>
<p style="text-align: center;"><em>Figure 9</em></p>
<ol>
<li>
<h3>Sensors</h3>
</li>
</ol>
<p>Sensors are part of the knowledge phase in the MAPE-K loop.  The sensors that report data to the control plane can be any number of metrics.  In the papers surveyed, there were authors that used CPU usage [5, 6], bandwidth, throughput, active network connections [2, 5], I/O transfer [2, 6], cache hits [8], end-to-end delay or response time [5, 7], session arrival rate [5], request arrival rate [5], read/write ratios [2], lock ratios [2] and memory usage [2].  These parameters can be recorded and monitored online or offline, using standard system tools such as<em> vmstat, iostat</em> and <em>sysstat</em> or custom written software to record statistics.</p>
<h2>IV. Quality of Service &amp; Performance Isolation</h2>
<p>Lastly, the control theory approach was extended in [3] using virtual servers to show support for performance isolation (each virtual server can guarantee a maximum request rate and maximum throughput), service differentiation (each virtual server supports request prioritization) and excess capacity sharing (conforming virtual servers under overload can temporarily exceed capacity to avoid client degradation).</p>
<p>With performance isolation, a virtual server is configured with a maximum request rate <em>R</em><em><sub>MAX</sub></em><em> </em>and a maximum delivered bandwidth of <em>W</em><em><sub>MAX</sub></em>.  This expresses an throughput agreement to serve up to the specified levels.  If <em>R</em><em><sub>MAX</sub></em> is exceeded, then the agreement on <em>W</em><em><sub>MAX</sub></em> is revoked.</p>
<p>In performance isolation, the virtual server can adapt content as previously mentioned to stay within the agreement bounds.  When each of the <em>i</em> servers are configured, if their aggregate utilizations <em>U*</em><em><sub>i</sub></em> is less than <em>U &lt; </em>0<em>.</em>58, then the system capacity is planned correctly.  To perform the correct bookkeeping, load classification is done when the request is performed and then based on the classification requests served and delivered bandwidth is charged against the requested virtual server.  Lastly, based on the rates for virtual server <em>i</em> for requests (<em>R</em><em><sub>i</sub></em>) and delivered bandwidth (<em>W</em><em><sub>i</sub></em>), the control loop achieves the degree of content degradation necessary to keep the utilization of that virtual server at the proper level, preventing overload.</p>
<p>In service differentiation, the goal is to support a number of different clients at different service levels (lower priority clients are degraded first).  In [3], if there are <em>m</em> priority levels, where 1 is the highest priority, capacity of the utilization level of that virtual server is available to clients in priority order.  For the purposes of their research, [3] had two service levels, premium and basic, where premium was guaranteed service and basic was not guaranteed service.</p>
<p>Lastly, in sharing excess capacity, if the load on one server exceeds the maximum capacity <em>C</em>, then the server under overload may temporarily utilize the resources of under utilized virtual servers.  This requires only a simple modification to the control theory loop.</p>
<h2>IV. Conclusion</h2>
<p>In this survey, we presented information on burstiness in network traffic and how it affects service times.  We also showed how autonomic systems can provision themselves to handle the dynamic workloads.  Dynamic provisioning in multi-tier applications raises some interesting and unique challenges.  As distributed systems become more complex, the techniques surveyed in this paper will become more useful and relevant.</p>
<h3>References</h3>
<h5>[1]     M.C. Huebscher, J.A McCann.  A survey of autonomic computing: Degrees, models, and applications.  ACM Computing Surveys, 40(3), 2008.</h5>
<h5>[2]     J. Chen, G Soundararajan, C. Amza. Autonomic Provisioning of Backend Databases in Dynamic Content Web Servers. Proc. of IEEE ICAC, 2006.</h5>
<h5>[3]     T. Abdelzaher, K.G. Shin, N. Bhatti. Performance Guarantees for Web Server End-Systems: A Control-Theoretical Approach. IEEE Transactions on Parallel and Distributed Systems, 13(1), 2002.</h5>
<h5>[4]      N. Mi, G. Casale, L. Cherkasova, M. Smirini. Burstiness in multi-tier applications: Symptoms, causes, and new models. Proc. of ACM/IFIP/USENIX Middleware, 2008.</h5>
<h5>[5]     B. Urgaonkar, P. Shenoy, A. Chandra, P. Goyal, and T. Wood. <em>Agile dynamic provisioning of multi-tier Internet applications</em>. ACM Trans. on Autonomous and Adaptive Systems, 3(1):1-39, 2008.</h5>
<h5>[6]     P. Padala, K.-Y. Hou, K. G. Shin, X. Zhu, M. Uysal, Z. Wang, S. Singhal, and A. Merchant. <em>Automated control of multiple virtualized resources</em>. Proc. of EuroSys, 2009.</h5>
<h5>[7]     P. Lama and X. Zhou. <em>Autonomic provisioning with self-adaptive neural fuzzy control for end-to-end delay guarantee</em>. Proc. IEEE/ACM MASCOTS, pages 151-160, 2010.</h5>
<h5>[8]     D. Jiang, G. Pierre, and C.-H. Chi. <em>Autonomous resource provisioning for multi-service web applications</em>. Proc. ACM WWW, 2010.</h5>
<h5>[9]     L. Kleinrock. <em>Queuing Systems, Volume 2: Computer Applications</em>. John Wiley and Sons, Inc. 1976.</h5>
<h5>[10]  B. Urgaonkar, G. Pacifici, G. Shenoy, P. Spreitzer, M. and A. Tantawi. <em>An analytical model for multi-tier Internet services and its applications.</em> Proc. of ACM ICMMCS.  Banff, Canada, 2005.</h5>
]]></content:encoded>
			<wfw:commentRss>http://afewguyscoding.com/2011/03/survey-resource-management-allocation-techniques-autonomic-distributed-systems/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>HttpSession and Google App Engine</title>
		<link>http://afewguyscoding.com/2011/02/httpsession-google-app-engine/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=httpsession-google-app-engine</link>
		<comments>http://afewguyscoding.com/2011/02/httpsession-google-app-engine/#comments</comments>
		<pubDate>Sat, 05 Feb 2011 22:03:36 +0000</pubDate>
		<dc:creator>David Stites</dc:creator>
				<category><![CDATA[Java]]></category>
		<category><![CDATA[Web]]></category>
		<category><![CDATA[google app engine]]></category>
		<category><![CDATA[httpservletrequest]]></category>
		<category><![CDATA[httpsession]]></category>
		<category><![CDATA[j2ee]]></category>
		<category><![CDATA[serialization]]></category>

		<guid isPermaLink="false">http://blog.afewguyscoding.com/?p=254</guid>
		<description><![CDATA[I recently spent a good amount of time trying to figure out why I could not retreive an object from a session that I had previously stored.  After doing a little thinking, I came up with the answer and I hope this saves someone else an immense amount of time.  There is one setup item [...]]]></description>
				<content:encoded><![CDATA[<p>I recently spent a good amount of time trying to figure out why I could not retreive an object from a session that I had previously stored.  After doing a little thinking, I came up with the answer and I hope this saves someone else an immense amount of time.  There is one setup item that needs to be done before you can start using the standard HttpSession objects in your App Engine project.  This line needs to go in your <em>appengine-web.xml</em> file:</p>
<pre>&lt;sessions-enabled&gt;true&lt;/sessions-enabled&gt;</pre>
<p>If you don&#8217;t you may find yourself with only read-access to the session object.</p>
<p>In addition, you need to <strong>make sure that all your objects</strong> that are you going to persist to the session implement the <span style="font-family: Consolas, Monaco, 'Courier New', Courier, monospace; font-size: 12px; line-height: 18px; white-space: pre;">java.io.Serializable </span>interface.  This is particularly important and this is what I failed to realize until I struggled with this for 2 hours.  The reason the object needs to be serializable is because App Engine stores session data in the datastore and memcache.  Any objects referenced by the value you put in the session must be serializable, so the entire object graph is available.  What I found interesting is that it must commit the session data in an transactionally based manner because I had also stored a String in the session and that wasn&#8217;t persisted either.  If the object isn&#8217;t serializable, the app will <strong>NOT</strong> fail in a local development machine, but <strong>will</strong> fail when deployed to the cloud.</p>
<p>A bit of sample code:</p>
<pre>public void doGet(HttpServletRequest req, HttpServletResponse resp) {
  HttpSession session = reg.getSession(true);
  // passing in a boolean to getSession() will allow you to inspect if a session
  // already exists.  if you pass in true, the session will be created by default
  // whether one exists already or not.
  // if you pass in false, if a session doesn't exist, you will be return NULL and no
  // session will be created.  if there is a previously existing session, it will return
  // that session.</pre>
<pre>  String name = (String)session.getAttribute('name');</pre>
<pre>  session.setAttribute("age", 25);
}</pre>
<p>You can check to see that everything is being stored correctly if you look on your machine, you should see a cookie for the domain your working in (dev: localhost, prod: appname.appspot.com) with the key <span style="font-family: Consolas, Monaco, 'Courier New', Courier, monospace; font-size: 12px; line-height: 18px; white-space: pre;">JSESSIONID</span>.  The value for <span style="font-family: Consolas, Monaco, 'Courier New', Courier, monospace; font-size: 12px; line-height: 18px; white-space: pre;">JSESSIONID </span>should match what is in your <span style="font-family: Consolas, Monaco, 'Courier New', Courier, monospace; font-size: 12px; line-height: 18px; white-space: pre;">_ah_SESSION </span>table in the App Engine datastore.  You can visually inspect the bytes in the session as well.</p>
<p>There is a small gotcha between the standard J2EE HttpSession and the GAE HttpSession.  There is a difference in when services manipulate objects stored in the session, in that case your changes will be lost when another service will get the object from the session.  The fix for this is invoking setAttribute again after having modified the Person object in the session. This workaround will solve all the inconsistencies but has a pretty important trade-off, every setAttribute will trigger a new serialization and write to the datastore.</p>
<p>Lastly, depending on the utilization of your application, you may find yourself with more than a few sessions.  That is because when the session is written to the datastore, the expiration of the session is set to  24 * 60 * 60 * 1,000 = System.currentTimeMillis() + 86,400,000.   The _expires field is updated each time the session is active, so that could be quite a bit of data storage.  There is <strong>currently</strong> no automatic removal of expired sessions in GAE.</p>
<p><strong>One last note</strong>: remember that App Engine is a distributed architecture so a difference from J2EE is that you are never guaranteed the same application server instance during request processing as the previous request.  While the object is being serialized correctly in memcache, you still have to call setAttribute() every time due to the fact that memory is not shared.</p>
]]></content:encoded>
			<wfw:commentRss>http://afewguyscoding.com/2011/02/httpsession-google-app-engine/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>Rumors of a &#8220;Home Button&#8221;-less iPad2 and iPhone5</title>
		<link>http://afewguyscoding.com/2011/01/rumors-home-button-less-ipad2-iphone5/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=rumors-home-button-less-ipad2-iphone5</link>
		<comments>http://afewguyscoding.com/2011/01/rumors-home-button-less-ipad2-iphone5/#comments</comments>
		<pubDate>Thu, 13 Jan 2011 17:06:54 +0000</pubDate>
		<dc:creator>David Stites</dc:creator>
				<category><![CDATA[Apple]]></category>
		<category><![CDATA[iOS]]></category>
		<category><![CDATA[iPad]]></category>
		<category><![CDATA[iPhone]]></category>
		<category><![CDATA[Mobile]]></category>
		<category><![CDATA[Objective-C]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[interface]]></category>
		<category><![CDATA[ios]]></category>

		<guid isPermaLink="false">http://blog.afewguyscoding.com/?p=241</guid>
		<description><![CDATA[Normally, I think that simpler is better. I enjoy simple things and not just necessarily things like the smell of cut grass, an ice cold beer at a friends BBQ, the first snowfall of winter or reading a book. After all, Occam was on to something (the law of succinctness is a principle which generally [...]]]></description>
				<content:encoded><![CDATA[<p>Normally, I think that simpler is better. I enjoy simple things and not just necessarily things like the smell of cut grass, an ice cold beer at a friends BBQ, the first snowfall of winter or reading a book.  After all, Occam was on to something (the law of succinctness is a principle which generally recommends selecting the competing hypothesis that makes the fewest new assumptions, when the hypotheses are equal in other respects). Simple is good &#8211; less moving parts, less things to break.</p>
<p>I also enjoy simple software and hardware interfaces.  Things that I can just pick up and know instantly how to use without reading an instruction manual.  Something so simple that Grandma could use it.  Apple products always have incredibly simple, unified interfaces: &#8220;One Interface to rule them all, One Interface to find them, One Interface to bring them all, and in the darkness bind them.&#8221;  Simple people living in a simple world.  However, simple does NOT imply intuitive.</p>
<p>Ever since the dawn of the electronics age, electronics have had buttons or some sort of interface that had something you could push, pull, or manipulate in some other fashion. Now, there are rumors that iPad2 and iPhone5 will not have home buttons.  In Johnathan Geller&#8217;s <a href="http://www.bgr.com/2011/01/12/exclusive-apple-will-remove-home-button-on-next-ipad-and-iphone-photo-booth-and-ilife-coming/" target="_blank">post</a>:</p>
<blockquote><p>&#8220;We just got some pretty wild information &#8230; while it’s hard to believe at first, it does make sense. &#8230; The iPad will be losing the home button. &#8230; Instead of button taps, you will use new multitouch gestures to navigate to the home screen and &#8230; launch the app switcher.<br />
We’re told that this change will make its way over to the iPhone as well. &#8230; Apple employees are already testing iPads and iPhones with no home buttons on the Apple campus. &#8230; Steve Jobs didn’t want any physical buttons on the original iPhone &#8230; it looks like he may soon get his wish.&#8221;</p></blockquote>
<p>So what will be replacing the home button?  In iOS 4.3, which was seeded to developers yesterday, there are new gestures that be done with four or five fingers.  This is particularly a stretch since previous iOS releases supported UIGestureRecognizer to simplify recognizing things like taps, pinch-in and pinch-out.  You could customize the number of fingers needed to perform the gesture and this is simply an extension of existing functionality.  In iOS 4.3, a four or five finger pinch brings you to the home screen,  a swipe up or down reveals the multi-tasking bar and a swipe left or right allows switching between apps.</p>
<p>I am going to do what a devoted Applephite would never do and disagree with what Steve wants.  I think that this is a bad decision because, unlike Steve, regular users <strong>actually <em>like </em>buttons</strong>!  They like the tactile feel of pushing something, the affirmative &#8220;click&#8221; that something was done or something will happen.  Consider some software that would process information for whole minutes at a time without providing feedback in the form of a spinner or a progress bar.  Most users would get fed up and try to quit the application.  Without the feedback, user&#8217;s may not know what to do.  While these gestures may be fine for people who have been around technology their whole life (think GenY and younger), there are a lot of users who would have problems picking  up the device and straight away being able to use it.  I would even go so far to say that they may never discover some of the features without reading the manual or browsing forums for tips.  Apple has always pride themselves, on simple AND intuitive, easy to use software.  I don&#8217;t think regular users are ready for this.</p>
<p>To make my point even clearer, there are still people I know with iOS 4.2 that don&#8217;t know that double-tapping the home button brings up the task bar.  They literally have over 75 applications &#8220;running&#8221; and if it weren&#8217;t for me, they would have continued on into their own blissful world, not knowing of that functionality.  And that&#8217;s WITH a button.  These gestures should be included in the software in addition so user&#8217;s can have their choice of methods of how they want to interact with the device.</p>
<p>Usability aside, how would we perform operations such as recovering crashed phones?</p>
<p>Personally, I would rather see support for a 4G network and WiFi-less FaceTime support.  Although, who knows what the new feature set will be because apparently iOS 4.3 drops support for arm6 (3G).</p>
]]></content:encoded>
			<wfw:commentRss>http://afewguyscoding.com/2011/01/rumors-home-button-less-ipad2-iphone5/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Horiztonal Scroll Wheel with Cocoa and Objective-C</title>
		<link>http://afewguyscoding.com/2010/08/horiztonal-scroll-wheel-cocoa-objective-c/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=horiztonal-scroll-wheel-cocoa-objective-c</link>
		<comments>http://afewguyscoding.com/2010/08/horiztonal-scroll-wheel-cocoa-objective-c/#comments</comments>
		<pubDate>Tue, 03 Aug 2010 14:55:25 +0000</pubDate>
		<dc:creator>David Stites</dc:creator>
				<category><![CDATA[Apple]]></category>
		<category><![CDATA[iPad]]></category>
		<category><![CDATA[iPhone]]></category>
		<category><![CDATA[Mobile]]></category>
		<category><![CDATA[Objective-C]]></category>
		<category><![CDATA[Software]]></category>

		<guid isPermaLink="false">http://blog.afewguyscoding.com/?p=210</guid>
		<description><![CDATA[As engineers, we should attempt to cross the boundary of developers vs. front-end UI design engineers. Typically, people who write code are not very good UI engineers because they just don&#8217;t have the experience. It isn&#8217;t that we want to make a bad interface, we are just very accustomed to using standard UI components. However, [...]]]></description>
				<content:encoded><![CDATA[<p>As engineers, we should attempt to cross the boundary of developers vs. front-end UI design engineers. Typically, people who write code are not very good UI engineers because they just don&#8217;t have the experience. It isn&#8217;t that we want to make a bad interface, we are just very accustomed to using standard UI components. However, the standard UI components can&#8217;t always get the job done in the most effective or user friendly manner. This was precisely the case when I needed to come up with a quick, intuitive way to filter information in one of my iPhone applications.</p>
<p>The horizontal scroll wheel is great because it is built out of standard UI components (UIImageView and UIScrollView) and there is no subclassing to be done to make the widget work. It can be dropped into any application.  There is little code to support the layout and generation of the scroll wheel, and all layout calculations are done dynamically so you can have as many or as few items in the list as desired. Also, you can easily create additional value with the audio framework and Core Animation that is extremely simple to do.</p>
<p>For users, it is great because it is an easy to use and understand, quick method to filter a UITableView. The component requires no explanation.  A sign of good UI design is a WYSIWYG type approach where the interface is so intuitive it is a joy to use. Create components that users can&#8217;t stop playing with.</p>
<p>Let&#8217;s get started already. By the way, this tutorial assumes you know Objective-C, Cocoa and basic programming principles. I welcome feedback on any of the code.</p>
<p>ScrollWheel.h</p>
<pre><code>#import &lt;UIKit/UIKit.h&gt;;
#import "ChannelCategory.h"
#import "NSArray+FirstObject.h"
#import "Colors.h"
#import "Debugger.h"

#define SCROLL_WHEEL_CATEGORY_CHANGED @"ScrollWheelCategoryChanged"
#define SCROLL_WHEEL_CATEGORY_ADDED @"ScrollWheelCategoryAdded"

@interface CategoryScrollWheel : UIView &lt;UIScrollViewDelegate&gt; {
  IBOutlet UIScrollView *_scrollView;
}

@property (nonatomic, retain) IBOutlet UIScrollView *_scrollView;

- (void) setupCategories: (NSArray *) categories;
- (void) scrollToButton: (id) sender;
- (void) rebuildCategories;
- (NSArray *) loadNewsCategories;
@end</code>

ScrollWheel.m

<code>#import "CategoryScrollWheel.h"</code>

<code>@implementation CategoryScrollWheel

@synthesize _scrollView;

- (id)initWithFrame:(CGRect)frame {
  if ((self = [super initWithFrame:frame])) {
    // Initialization code
  }
  return self;
}

- (void) rebuildCategories {
  // release the previous wheel and alloc a new one

  [_scrollView release], _scrollView = nil;
  _scrollView = [[UIScrollView alloc] init];

  // reload the categories into the scroll wheel

  [self setupCategories: [NSArray arrayWithArray: [self loadNewsCategories]]];
}

- (NSArray *) loadNewsCategories {
  NSManagedObjectContext *managedObjectContext = [(RSSAppDelegate *)[[UIApplication sharedApplication] delegate] managedObjectContext];

  NSFetchRequest *request = [[NSFetchRequest alloc] init];
  NSError *error;
  NSArray *fetchResults;
  NSSortDescriptor *sortDescriptor = [[NSSortDescriptor alloc] initWithKey:@"name" ascending:NO];
  NSArray *sortDescriptors = [[NSArray alloc] initWithObjects: sortDescriptor, nil];

  [request setEntity: [NSEntityDescription entityForName:@"ChannelCategory" inManagedObjectContext: managedObjectContext]];
  [request setSortDescriptors: sortDescriptors];

  if ((fetchResults = [managedObjectContext executeFetchRequest: request error: &amp;amp;error]) == nil) {
    ALog("%@", error);
    [request release], request = nil;
    [sortDescriptor release], sortDescriptor = nil;
    [sortDescriptors release], sortDescriptors = nil;
    return nil;
  }

  [sortDescriptor release], sortDescriptor = nil;
  [sortDescriptors release], sortDescriptors = nil;

  if([fetchResults count] == 0) {
    // no feeds to refresh
    [request release], request = nil;
    return nil;
  }
  else {
    return fetchResults;
  }
}

- (void) setupCategories: (NSArray *) categories {
  // our variable x will keep track of how far our scroll view extends
  // we need to add a half scroll view width to get the arrow in the middle to
  // be able to center up

  double x = ([self _scrollView].frame.size.width / 2);
  double lastHoldover = 0;
  
  for(ChannelCategory *category in categories) {
    UIButton *button = [[UIButton alloc] init];
    [button setTitle: [category name] forState: UIControlStateNormal];
    [[button titleLabel] setFont: [UIFont fontWithName: @"Helvetica-Bold" size: 11]];
    [[button titleLabel] setTextColor: UI_COLOR_LIGHT_GREY];
    
    CGRect rect = CGRectMake(x, 0, [[category name] sizeWithFont: [UIFont fontWithName: @"Helvetica-Bold" size: 11]].width, 30);

    if([category isEqual: [categories firstObject]]) {
      x -= (rect.size.width / 2);
      rect.origin.x = x;
    }

    [button setFrame: rect];
    [button addTarget: self action: @selector(scrollToButton:) forControlEvents: UIControlEventTouchUpInside];

    [[self _scrollView] addSubview: button];

    x += button.frame.size.width + 10;

    if([category isEqual: [categories lastObject]]) {
      lastHoldover = (button.frame.size.width / 2);
      // bad hack to lop off the extra 10 on the last item
      x -= 10;
    }

    [button release], button = nil;
  }

  // this adds a half scrollview width to the end of the scrollview

  x += ([self _scrollView].frame.size.width / 2);
  x -= lastHoldover;

  CGSize size = CGSizeMake(x, 30);

  [[self _scrollView] setContentSize: size];
}

- (void)scrollViewDidEndDecelerating:(UIScrollView *)scrollView {
  // we add the offset and half the view together to bring it into line with the
  // arrow in the middle
  double val = [[self _scrollView] contentOffset].x + ([self _scrollView].frame.size.width / 2);

  // loop through and see which button is closest.  snap to that button
  // using scrollToButton

  UIButton *closestButton;
  double closest = 100;

  for (UIButton *view in [[self _scrollView] subviews]) {
    if ([view isKindOfClass:[UIButton class]]) {
      double calculatedValue = abs(view.frame.origin.x -  val);

      if(calculatedValue &amp;lt; closest) {
        closestButton = (UIButton *)view;
        closest = calculatedValue;
      }
    }
  }

  CGPoint offset = closestButton.frame.origin;
  CGRect scrollViewFrame = [[self _scrollView] frame];
  offset.x -= (scrollViewFrame.size.width / 2) - (closestButton.frame.size.width / 2);

  [[self _scrollView] setContentOffset:offset animated:YES];
  [[NSNotificationCenter defaultCenter] postNotificationName: SCROLL_WHEEL_CATEGORY_CHANGED object: [[closestButton titleLabel] text]];
}

- (void) scrollToButton: (id) sender {
  CGPoint offset = ((UIButton *)sender).frame.origin;
  CGRect scrollViewFrame = [[self _scrollView] frame];

  offset.x -= (scrollViewFrame.size.width / 2) - (((UIButton *)sender).frame.size.width / 2);

  [[self _scrollView] setContentOffset:offset animated:YES];
  [[NSNotificationCenter defaultCenter] postNotificationName: SCROLL_WHEEL_CATEGORY_CHANGED object: [[sender titleLabel] text]];
}

/*
// Only override drawRect: if you perform custom drawing.
// An empty implementation adversely affects performance during animation.
- (void)drawRect:(CGRect)rect {
  // Drawing code
}
*/

- (void)dealloc {
  [super dealloc];
}
</code>
<code>@end</code></pre>
<p>ScrollWheel.xib</p>
<p>In my ScrollWheel.xib, I have 2 UIImageViews and a UIScrollView.  Literally, it is that easy.  The image view is meant to create a nice frame around the scroll view giving the illusion of &#8220;limited view.&#8221;  The top UIImageView may even have a small arrow pointing down to indicate which item is currently selected.  The UIScrollView is the one referenced in the code above.</p>
<p>So there it is &#8211; a neat, clean, compact way of taking standard UI components and turning it into a custom UI to allow for selection of categories.  If you have any questions or comments on how to improve the code, please let me know!</p>
]]></content:encoded>
			<wfw:commentRss>http://afewguyscoding.com/2010/08/horiztonal-scroll-wheel-cocoa-objective-c/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Predicting Sexual Predators &amp; Victims in Colorado’s Correctional Facilities Using Clustering</title>
		<link>http://afewguyscoding.com/2010/06/k-means-single-link-clustering/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=k-means-single-link-clustering</link>
		<comments>http://afewguyscoding.com/2010/06/k-means-single-link-clustering/#comments</comments>
		<pubDate>Fri, 18 Jun 2010 03:08:16 +0000</pubDate>
		<dc:creator>David Stites</dc:creator>
				<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[clustering]]></category>
		<category><![CDATA[kmeans]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[single link]]></category>

		<guid isPermaLink="false">http://blog.afewguyscoding.com/?p=150</guid>
		<description><![CDATA[Abstract—Clustering is an easy to use and implement method of unsupervised inductive inference. Clustering can be used to learn discrete or continuous valued hypotheses and create compact groups of objects that display similar characteristics, while maintaining a high degree of separation from other groupings. This paper is a survey of some of the methods of [...]]]></description>
				<content:encoded><![CDATA[<p><strong>Abstract—Clustering is an easy to use and implement method of unsupervised inductive inference.  Clustering can be used to learn discrete or continuous valued hypotheses and create compact groups of objects that display similar characteristics, while maintaining a high degree of separation from other groupings.  This paper is a survey of some of the methods of constructing and evaluating clusters</strong>.</p>
<p><em>Index Terms—machine learning, hierarchical agglomerative clustering, k-means clustering, unsupervised learning, Pearson’s coefficient, euclidean distance</em></p>
<h3>1.  Introduction</h3>
<p>Clustering is a method of dividing a heterogenous group of objects into smaller, more homogenous groups that display similar characteristics to other objects in the cluster while displaying one or more dissimilar characteristics to objects in other clusters.  Clustering is an unsupervised learning technique that has many applications in data mining, pattern recognition, image analysis and market segmentation.  Clustering is easy to implement and fairly quick to come up with the groupings.</p>
<p>The main purposes of a Department of Corrections at any governmental level are to enhance public safety, to provide proper care of the inmates, to supervise inmates under their jurisdiction and to assist inmates’ re-entry into society.</p>
<p>There is no doubt that inmates at correctional institutions are dangerous to society.  However, even after they are incarcerated at these institutions, some individuals remain ongoing offenders.  While all individuals in prison have displayed some sort of deviant behavior, it is hypothesized that certain combinations of personality traits make some inmates more likely to be sexual predators and some inmates more likely to be sexual victims of these predators.  At most correctional facilities, sexual contact between inmates, consensual or not, is not permitted.</p>
<p>Identification of those inmates likely to be sexually predatory toward other inmates would greatly assist corrections facilities in their goal of providing a safer environment for incarceration.  Clustering can help with this goal by comparing a particular offender to known perpetrators and victims.  After comparison, victims can be incarcerated separately from predators and receive any special needs assistance that can be offered while predators can be segregated in such a fashion as to reduce the potential for successful predatory behaviors.</p>
<h3>1.1  Outline of Research</h3>
<p>In this research survey, we implemented two different types of clustering algorithms &#8211; a standard “bottom-up” hierarchical method, single link clustering, and a standard “top-down” partitional algorithm, k-means clustering.  We evaluated different distance measure criteria, including Euclidian distance and Pearson’s correlation coefficient.  Results are discussed in section 4 after running the clustering algorithms multiple times with the provided Colorado inmate dataset.</p>
<h3>1.2  Data</h3>
<p>The dataset that we used was provided by Dr. Coolidge of the Department of Psychology at the University of Colorado at Colorado Springs. The dataset is publicly available at http://www.cs.uccs.edu/~kalita/work/cs586/2010/CoolidgePerpetratorVictimData.csv. This dataset pertains to scores on personality disorder tests given to inmates in the State of Colorado. Dr. Coolidge’s inventory of personality disorder tests are given to all inmates in the State of Colorado..  This dataset provided contained 100 rows (25 rows describing victims of sexual abuse and 75 describing perpetrators of sexual abuse) with 14 attributes chosen by Dr. Coolidge.</p>
<p>The data described different measurements of how inmates had scored on different personality disorders tests.  The tests included antisocial <em>(AN)</em>, avoidant <em>(AV)</em>, borderline <em>(BO)</em>, dependent <em>(DE)</em>, depressive <em>(DP)</em>, histrionic <em>(HI)</em>, narcissistic <em>(NA)</em>, obsessive-compulsive <em>(OC)</em>, paranoid <em>(PA)</em>, passive-aggressive <em>(PG)</em>, schizotypal<em> (ST)</em>, schizoid <em>(SZ)</em>, sadistic <em>(SA) </em>and self-defeating<em> (SD)</em> markers.  The scores on these individual tests are measured by T scores, which are a type of standardized score that can be mathematically transformed into other types of standardized scores.  T scores have a Gaussian distribution and a score of 50 is always considered the mean and a standard deviation is always 10.</p>
<p>It should be noted that even though the dataset is quite small, with only 100 rows available, the quality of the data in the dataset is very good.  The astute reader can appreciate the fact that incarcerated persons might not be completely truthful in their answering of the test questions for a variety of reasons, such as a lack of caring or the desire to appear more “damaged” than any other inmate.  The data is cross-checked with several other validation methods to ensure that the answers provided are reasonable.  Test scores that are not reasonable were discarded by Dr. Coolidge and not included in this dataset.</p>
<h3>2.  Application</h3>
<p>We chose to write the implementation of the clustering algorithms in Java because of the ease of use of the language.  Java also presented superior capabilities in working with and parsing data from files.  Using Java allowed the author to more efficiently model the problem through the use of OO concepts, such as polymorphism and inheritance.  Lastly, several different Java libraries were available, such as Java-ML [1] that increased the ability to analyze the clusters after the algorithms had been run.</p>
<h3>2.1  Hierarchical Cluster Construction</h3>
<p>In agglomerative single link clustering, clusters are constructed by comparing each point (or cluster) with every other point (or cluster).  Each object is placed in a separate cluster, and at each iteration of the algorithm, we merge the closest pair of points (or clusters), until certain termination conditions are satisfied.</p>
<p>This algorithm requires defining the idea of a single link or the proximity of other points to a single point.  For single link, the proximity of two points (or clusters) is defined as the minimum of the distance between any two points (or clusters).  If there exists such a link, or edge, then the two points (or clusters) are merged together.</p>
<p><a href="http://blog.afewguyscoding.com/wp-content/uploads/2010/06/Screen-shot-2010-06-17-at-8.52.47-PM.png" rel="lightbox[150]"><img class="aligncenter size-full wp-image-214" title="Minimum Distance" src="http://blog.afewguyscoding.com/wp-content/uploads/2010/06/Screen-shot-2010-06-17-at-8.52.47-PM.png" alt="" width="294" height="38" /></a></p>
<p>This is often called “the nearest neighbor clustering technique.” [4]  Relating this algorithm to  graph theory, this clustering technique constructs a minimum spanning tree by finding connected components, so the algorithm is quite similar to Kruskal’s or Prim’s algorithm.</p>
<p><em> MSTSingleLink (Elements, AdjacencyMatrix)<br />
Create a set of clusters C = {t1, t2,&#8230;,tn} from Elements<br />
Create a partial distance matrix showing the distance between all clusters in C.<br />
k = n, where n is the number of clusters<br />
d = 0<br />
Begin<br />
Ci, Cj = Closest clusters in AdjacencyMatrix<br />
d = dis(Ci, Cj) // update dendrogram with distance threshold<br />
dis({Ci ∪ Cj}, C) // recalculate the distance from the new cluster to all other points in C<br />
C = C &#8211; {Ci} &#8211; {Cj} ∪ {Ci ∪ Cj} // merge the two closest clusters or points<br />
dis(Ci, Cj) = 0<br />
Until k = 1</em></p>
<p>Typically, for this algorithm, the termination criteria is for all elements grouped together in one cluster.  A better termination criterion would be to record the distances at which the merges of individual objects and clusters take place and if there is a large jump in this distance (large being defined by the user), it might give the user an indication that the two objects or clusters should not be merged because they are highly dissimilar.  Note that the running time for <em>MSTSingleLink</em> is<em> O(n^2)</em>.  This running time makes it impractical for large datasets.  For further information on single link MST, see [2], [4].</p>
<h3>2.2  Partitional Cluster Construction</h3>
<p>Where a hierarchical clustering algorithm creates clusters in multiple steps, a partitional algorithm, such as <em>k</em>-means, creates the clusters in one step. [4]  Also, in partitional clustering, the number of clusters to create must be known a priori and used as input to the algorithm.</p>
<p>In <em>k</em>-means, elements are moved among sets of clusters until some sort of termination criteria is reached, such as convergence.  A possible measure for convergence could be testing to see if cluster elements have not changed clusters.  Using <em>k</em>-means allows one to achieve a “high degree of similarity among elements, while a high degree of dissimilarity among elements in different clusters.” [4]</p>
<p><em> KMeansCluster (Elements, k)<br />
Create a set of clusters C = {t1, t2,&#8230;,tn} from Elements<br />
Assign intial values for means, m1, m2,&#8230;,mk<br />
Begin<br />
Assign each item ti to the cluster which has the closest mean<br />
Calculate new means for each cluster<br />
Until convergence criteria met</em></p>
<p>Note that the running time for <em>KMeansCluster</em> is <em>O(tkn)</em>, where t is the number of iterations.  While, <em>k</em>-means does not suffer from the chaining problem, it does have other problems.  <em>k</em>-means does not handle outliers well, work with categorical data or produce any clusters shapes other than convex clusters. [4]  Also, while <em>k</em>-means produces good results, it does not scale well and is not time-efficient. [4]  While this particular provided dataset is not large, <em>k</em>-means could have problems with attempting to cluster millions of objects.  Lastly, it is possible for <em>k</em>-means to find a local optimum and miss the global optimum.  For further information on <em>k</em>-means clustering, see [2], [4].</p>
<h3>2.3  Distance Criterion</h3>
<p>In both algorithms, cluster formation relies on having some notion of a distance measure.  Using this metric, we can determine how “similar” two elements are.  Depending on the distance metric chosen, it will influence the shape of our clusters.  While there are many distance measures such as Mahalanobis, hamming, city-block and Minkowski [2], [6], in our implementation, we used two different distance measures, a euclidean distance measure and Pearson’s correlation coefficient.</p>
<h3>2.3.1  Euclidean Distance</h3>
<p>Euclidean distance is the ordinary distance between two points that one would measure with a ruler.  It is a simple distance metric and by far the most commonly used, where one has to make sure all attributes have the same scale [2].  It is defined by this equation￼</p>
<p><a href="http://blog.afewguyscoding.com/wp-content/uploads/2010/06/Screen-shot-2010-06-17-at-8.52.32-PM.png" rel="lightbox[150]"><img class="aligncenter size-full wp-image-213" title="Euclidean Distance Equation" src="http://blog.afewguyscoding.com/wp-content/uploads/2010/06/Screen-shot-2010-06-17-at-8.52.32-PM.png" alt="" width="377" height="57" /></a></p>
<h3>2.3.2  Pearson’s Correlation Coefficient</h3>
<p>Pearson’s correlation coefficient is a measure of the linear dependence between two variables X and Y, giving a value between +1 and −1 inclusive.  A value of 1 implies that a linear equation describes the relationship between X and Y perfectly, with all data points lying on a line for which Y increases as X increases. A value of −1 implies that all data points lie on a line for which Y decreases as X increases. A value of 0 implies that there is no linear correlation between the variables. [3]  Although it depends on the data being analyzed, typically anything over 0.5 or below -0.5 indicated a large correlation.  Pearson’s correlation coefficient can be calculated using the equation below, where X and Y bar are the means of the two variables.</p>
<p><a href="http://blog.afewguyscoding.com/wp-content/uploads/2010/06/Screen-shot-2010-06-17-at-8.51.55-PM.png" rel="lightbox[150]"><img class="aligncenter size-full wp-image-212" title="Pearson Correlation Coefficient" src="http://blog.afewguyscoding.com/wp-content/uploads/2010/06/Screen-shot-2010-06-17-at-8.51.55-PM.png" alt="" width="342" height="67" /></a></p>
<p>For further information on Pearson’s correlation coefficient, see [3].</p>
<h3>2.4  Testing</h3>
<p>Testing of the clustering algorithms was performed with the entire dataset of 100 examples.  For single link clustering, the algorithm was run using the Euclidean distance measure and Pearson’s correlation coefficient.  For <em>k</em>-means clustering, the algorithm was run using the Euclidean distance measure 8 times.  Each time the program was run, the number of clusters specified, <em>k</em>, was increased by one from 2 clusters to 10 clusters.  Discussion of results are in section 4.</p>
<h3>3.  Potential Problems</h3>
<p>There was one main problem that we encountered during implementation and testing.  Our biggest problem is that with single link clustering, we observed the chaining effect.  The chaining effect is where “points that are not related to each other at all are merged together to form a cluster simply because they happen to be near (perhaps via a transitive relationship) points that are close to each other.” [4]  By including dissimilar objects, this can cause clusters to become skewed.  A potential solution to this problem would be to either specify a maximum distance threshold, above which points (or clusters) would not be merged.  This could also serve as part of the termination criteria.  Another solution, would be to use a complete link distance criteria. [4]  The chaining effect is most obviously seen when the output of the clustering program is a dendrogram.</p>
<h3>4.  Evaluation</h3>
<p>For the evaluation of the results, we have chosen to use several different evaluation criteria.  For single link clustering, we will qualify our results by examining the intra-cluster and inter-cluster distance, which measures the homogeneity of the clusters.  In addition, we also evaluate the distance threshold at which the clusters were merged and the entropy within the cluster.  Lastly, we evaluate the cluster by calculating the recall, precision and F measure of that cluster.</p>
<p>For k-means clustering, we evaluate the clusters using some of the same techniques (recall, precision and F measure) and we also introduce the squared sum of errors measure, and Bayesian information criterion measure.</p>
<h3>4.1 Macro Evaluation</h3>
<p>In single link clustering, the distance threshold that produced the best clusters was 46.23 (see Figure 4.1).  Cluster 1 were inmates that exhibited personalities of victims of sexual abuse while cluster 2 were inmates that exhibited personalities of perpetrators of sexual abuse.</p>
<p>The remaining clusters, 3, 4, 5 and 6 were outliers and in their own clusters and their personalities exhibited behavior of both victims and perpetrators of sexual abuse.  We noted that the distance threshold was much higher to merge clusters 4, 5, 6 with thresholds of 49.60, 55.22 and 59.63.</p>
<p>Cluster	Cluster Members	Intra-cluster Distance	Inter-cluster Distance<br />
1	1, 65, 4, 6, 97, 45, 49, 58, 53, 19, 18, 42, 67, 55, 62, 36, 7, 32, 54, 69, 39, 38, 89, 24, 26, 72, 8, 3, 9, 11, 95, 91, 27	15.64	90.82<br />
2	2, 10, 16, 41, 50, 47, 51, 87, 78, 64, 68, 79, 77, 44, 84, 29, 31, 34, 63, 33, 90, 80, 40, 74, 82, 37, 43, 71, 48, 93, 96, 98, 22, 73, 52, 20, 57, 59, 46, 61, 75, 85, 66, 92, 83, 94, 88, 81, 70, 100, 60, 99, 86, 56, 13, 15, 30, 21, 14, 12, 23	8.73	79.20<br />
3	5, 35, 25	0.17	66.25<br />
4	17	0.0	67.19<br />
5	28	0.0	67.19<br />
6	76	0.0	67.20<br />
<em> Figure 4.1 &#8211; Index of clusters with intra-cluster and inter-cluster distance for Euclidean single link clustering</em></p>
<p>In Pearson’s coefficient clustering, the distance threshold that produced the best clusters was .035 (see Figure 4.2).<br />
Cluster	Cluster Members	Intra-cluster Distance	Inter-cluster Distance<br />
1	1, 17, 14, 52, 48, 54, 27, 21, 22, 26, 77, 74, 33, 50, 57, 62, 47, 49, 79, 31, 34, 69, 67, 2, 86, 71, 100, 3, 41, 10, 29, 8, 25, 40, 83, 37, 19, 44, 46, 72, 55, 81, 88, 66, 30, 98, 38, 70, 20, 45, 36, 80, 60, 87, 13, 56, 91, 68, 51, 23, 16, 53, 89, 12, 65, 82, 94, 96, 90, 64, 58, 43	19.71	138.31<br />
2	4, 11, 35, 84, 39, 15, 61, 63, 18, 92, 24, 76, 6, 7, 59, 95, 78	3.68	108.01<br />
3	5, 97, 73, 93, 75	2.5	106.01<br />
4	32, 42, 99	0.0	106.72<br />
5	28	0.0	109.65<br />
6	9	0.0	109.65<br />
7	85	0.0	109.65<br />
<em> Figure 4.2 &#8211; Index of clusters with intra-cluster and inter-cluster distance for Pearson’s single link clustering</em></p>
<p>Based on these results, using the Euclidean distance may produce better clusters based on intra-cluster distance.  Another improvement to the results may come in the form of changing the policy on finding the best distance at which to merge the clusters (i.e. using complete link or average link distance measures).  If a more accurate method of distance finding were implemented, we would expect to see a more consistent result set because there would be less of an effect from the chaining problem.</p>
<h3>4.2 Micro Evaluation</h3>
<p>When the individual clusters are broken down and the individual members are analyzed, we achieve the following results.  These recall, precision, F measure and entropy measurements assume the same clusters as above in section 4.1.  In our calculations, we assign a negative value to not identifying a sexual abuse perpetrator because they would be allowed to interact with the general inmate population instead of being in administrative segregation.  In addition, we also only consider the first two clusters, of either predator or victim, but no mixed classes.</p>
<p>Cluster	Recall	Precision	F-measure<br />
1	44.00%	32.35%	0.37<br />
2	66.67%	75.75%	0.71<br />
Overall	55.34%	54.05%	0.54<br />
<em> Figure 4.5 &#8211; Recall, Precision, F-Measure for Euclidean Single Link Clustering</em></p>
<p>Cluster	Recall	Precision	F-measure<br />
1	64.00%	22.22%	0.33<br />
2	13.33%	70.58%	0.22<br />
Overall	38.67%	46.40%	0.28<br />
<em> Figure 4.5 &#8211; Recall, Precision, F-Measure for Pearson’s Coefficient Single Link Clustering</em></p>
<p>From these results, it shows that despite the chaining effect, the Euclidean distance appears to be the superior distance measure when clustering via hierarchical agglomerative methods.</p>
<p>Our next test involved using k-means clustering.  We ran the algorithm from 2 to 10 clusters and we measured the accuracy of the clusters using Bayesian information criteria (a criterion for model selection among a class of parametric models) and the sum of squared errors, as defined below:￼</p>
<p><a href="http://blog.afewguyscoding.com/wp-content/uploads/2010/06/Screen-shot-2010-06-17-at-8.53.21-PM.png" rel="lightbox[150]"><img class="aligncenter size-full wp-image-215" title="BIC Equation" src="http://blog.afewguyscoding.com/wp-content/uploads/2010/06/Screen-shot-2010-06-17-at-8.53.21-PM.png" alt="" width="365" height="35" /></a></p>
<p>where x is the observed data, n is the number of data points in x, k is the number of free parameters to be estimated, p(x|k) is the likelihood of the observed data given the number of parameters and L is the maximized value of the likelihood function.</p>
<p><a href="http://blog.afewguyscoding.com/wp-content/uploads/2010/06/Screen-shot-2010-06-17-at-8.53.28-PM.png" rel="lightbox[150]"><img class="aligncenter size-full wp-image-216" title="Log Liklihood" src="http://blog.afewguyscoding.com/wp-content/uploads/2010/06/Screen-shot-2010-06-17-at-8.53.28-PM.png" alt="" width="125" height="63" /></a> ￼</p>
<p>k	Clusters Members	BIC Score	SSE<br />
2	Cluster 1: 2, 5, 10, 12, 14, 15, 16, 20, 21, 22, 23, 25, 29, 31, 33, 34, 35, 37, 40, 41, 43, 44, 46, 47, 50, 51, 56, 57, 59, 60, 61, 63, 66, 70, 71, 73, 75, 77, 78, 79, 80, 81, 83, 84, 85, 86, 87, 88, 90, 92, 94, 98, 99, 100<br />
Cluster 2: 1, 3, 4, 6, 7, 8, 9, 11, 13, 17, 18, 19, 24, 26, 27, 28, 30, 32, 36, 38, 39, 42, 45, 48, 49, 52, 53, 54, 55, 58, 62, 64, 65, 67, 68, 69, 72, 74, 76, 82, 89, 91, 93, 95, 96, 97	144,666.03	223,893.45<br />
3	Cluster 1: 1, 3, 4, 6, 7, 9, 11, 13, 18, 21, 24, 30, 32, 36, 38, 41, 42, 44, 45, 47, 48, 49, 50, 52, 53, 54, 55, 58, 62, 64, 65, 67, 68, 69, 74, 78, 82, 93, 95, 96, 97, 98<br />
Cluster 2: 2, 5, 10, 12, 14, 15, 16, 20, 22, 23, 25, 29, 31, 33, 34, 35, 37, 40, 43, 46, 51, 56, 57, 59, 60, 61, 63, 66, 70, 71, 73, 75, 77, 79, 80, 81, 83, 84, 85, 86, 87, 88, 90, 92, 94, 99, 100<br />
Cluster 3: 8, 17, 19, 26, 27, 28, 39, 72, 76, 89, 91	142,191.31	190,017.27<br />
4	Cluster 1: 8, 17, 26, 28, 39, 72, 76, 89,<br />
Cluster 2: 3, 4, 6, 19, 27, 36, 42, 45, 49, 55, 62, 67, 91, 95, 97<br />
Cluster 3: 2, 5, 10, 12, 14, 15, 16, 20, 22, 23, 25, 29, 31, 33, 34, 35, 37, 40, 43, 46, 51, 56, 57, 59, 60, 61, 63, 66, 70, 71, 73, 75, 77, 80, 81, 83, 85, 86, 88, 90, 92, 94, 99, 100<br />
Cluster 4: 1, 7, 9, 11, 13, 18, 21, 24, 30, 32, 38, 41, 44, 47, 48, 50, 52, 53, 54, 58, 64, 65, 68, 69, 74, 78, 79, 82, 84, 87, 93, 96, 98	143,341.18	174,843.29<br />
5	Cluster 1: 8, 17, 24, 26, 38, 65, 72, 89<br />
Cluster 2: 2, 10, 12, 13, 15, 16, 21, 22, 23, 29, 30, 32, 33, 37, 40, 41, 44, 47, 48, 50, 51, 52, 56, 71, 73, 77, 78, 79, 84, 87, 90, 93, 96, 98<br />
Cluster 3: 5, 14, 20, 25, 31, 34, 35, 43, 46, 57, 59, 60, 61, 63, 66, 70, 75, 80, 81, 83, 85, 86, 88, 92, 94, 99, 100<br />
Cluster 4: 7, 28, 36, 39, 76<br />
Cluster 5: 1, 3, 4, 6, 9, 11, 18, 19, 27, 42, 45, 49, 53, 54, 55, 58, 62, 64, 67, 68, 69, 74, 82, 91, 95, 97	141,884.73	152,909.21<br />
6	Cluster 1: 8, 17, 24, 26, 38, 65, 72, 89<br />
Cluster 2: 2, 5, 10, 12, 14, 20, 22, 23, 25, 29, 31, 33, 34, 35, 37, 40, 43, 46, 51, 56, 57, 59, 60, 61, 63, 66, 70, 71, 75, 77, 80, 81, 83, 85, 86, 88, 90, 92, 94, 99, 100<br />
Cluster 3: 1, 7, 13, 15, 16, 21, 30, 32, 41, 44, 47, 48, 50, 52, 53, 54, 58, 64, 68, 69, 73, 74, 78, 79, 82, 84, 87, 93, 96, 98<br />
Cluster 4: 4, 6, 18, 19, 36, 39, 42, 45, 49, 62, 67, 91, 97<br />
Cluster 5: 28, 76<br />
Cluster 6: 3, 9, 11, 27, 55, 95	141,230.52	144,776.69<br />
7	Cluster 1: 2, 10, 12, 13, 15, 16, 21, 22, 23, 29, 30, 31, 33, 37, 40, 41, 44, 47, 48, 50, 51, 56, 73, 77, 78, 79, 84, 87, 90, 93, 94, 98<br />
Cluster 2: 5, 14, 20, 25, 34, 35, 43, 46, 57, 59, 60, 61, 63, 66, 70, 71, 75, 80, 81, 83, 85, 86, 88, 92, 99, 100,<br />
Cluster 3: 4, 6, 9, 11, 18, 19, 42, 45, 49, 53, 55, 58, 62, 64, 67, 68, 74, 82, 95, 96, 97<br />
Cluster 4: 28, 76,<br />
Cluster 5: 3, 27, 91<br />
Cluster 6: 1, 7, 32, 36, 39, 52, 54, 69<br />
Cluster 7: 8, 17, 24, 26, 38, 65, 72, 89	140,546.79	135,787.72<br />
8	Cluster 1: 3, 27, 95<br />
Cluster 2: 1, 7, 24, 30, 32, 38, 53, 54, 58, 64, 65, 68, 69, 74, 82<br />
Cluster 3: 39<br />
Cluster 4: 9, 11, 18, 42, 62, 67,<br />
Cluster 5: 4, 6, 19, 36, 45, 49, 55, 91, 97<br />
Cluster 6: 5, 10, 12, 14, 20, 22, 23, 25, 29, 31, 34, 35, 37, 40, 43, 46, 56, 57, 59, 60, 61, 63, 66, 70, 71, 75, 77, 80, 81, 83, 85, 86, 88, 90, 92, 94, 99, 100<br />
Cluster 7: 8, 17, 26, 28, 72, 76, 89<br />
Cluster 8: 2, 13, 15, 16, 21, 33, 41, 44, 47, 48, 50, 51, 52, 73, 78, 79, 84, 87, 93, 96, 98	140,510.19	138,410.06<br />
9	Cluster 1: 3, 9, 11, 74, 82, 95<br />
Cluster 2: 15, 29, 30, 31, 33, 34, 40, 41, 44, 47, 50, 73, 78, 79, 84, 87, 90<br />
Cluster 3: 8, 17, 24, 38, 65, 72, 89,<br />
Cluster 4: 26, 28, 39, 76<br />
Cluster 5: 5, 14, 20, 25, 35, 46, 57, 59, 60, 61, 63, 66, 70, 75, 80, 81, 83, 85, 86, 88, 92, 100<br />
Cluster 6: 4, 6, 18, 19, 27, 36, 42, 45, 49, 55, 62, 67, 91, 97<br />
Cluster 7: 2, 10, 16, 21, 22, 37, 43, 51, 71, 77, 94, 98, 99<br />
Cluster 8: 1, 7, 13, 32, 48, 52, 53, 54, 58, 64, 68, 69, 93, 96<br />
Cluster 9: 12, 23, 56	140,889.59	126,846.88<br />
10	Cluster 1: 28, 76<br />
Cluster 2: 91<br />
Cluster 3: 30, 32, 53, 54, 58, 64, 68, 69, 74, 82<br />
Cluster 4: 5, 14, 20, 25, 35, 43, 46, 57, 59, 60, 61, 63, 66, 70, 71, 75, 80, 81, 83, 85, 86, 88, 92, 99, 100<br />
Cluster 5: 27<br />
Cluster 6: 9<br />
Cluster 7: 3, 4, 6, 11, 18, 19, 42, 45, 49, 55, 62, 67, 95, 97<br />
Cluster 8: 1, 7, 13, 21, 36, 39, 48, 52, 93, 96<br />
Cluster 9: 2, 10, 12, 15, 16, 22, 23, 29, 31, 33, 34, 37, 40, 41, 44, 47, 50, 51, 56, 73, 77, 78, 79, 84, 87, 90, 94, 98<br />
Cluster 10: 8, 17, 24, 26, 38, 65, 72, 89	140,789.73	117,432.09<br />
Figure 4.6 &#8211; Cluster Members, BIC Score and SSE Score for Euclidean k-means Clustering</p>
<p>Cluster	Recall	Precision	F-measure<br />
1	48.00%	22.22%	0.30<br />
2	17.30%	28.20%	0.21<br />
Overall	32.65%	25.21%	0.26<br />
Figure 4.7 &#8211; Recall, Precision, F-Measure for Euclidean k-means Clustering</p>
<p>In these results, we can see that, as the number of clusters increases, the squared sum of errors decreases.  This is because we are including fewer dissimilar items in each of the clusters, so they more accurately represent the true nature of that cluster.  I would also expect the precision, recall, and F measure to increase as the number of clusters increase as well.  However, it would become harder to interpret the actual “class” of the clusters as k increases as we were instructed to disregard the class of each instance in the dataset.  Additionally, we might achieve better results if we used decision trees to identify the most influential personality test markers and then used a subset of those markers for clustering.</p>
<p>Based on all results, while not highly accurate, prison officials could obtain good insight into what attributes are the most important in regards to who might be on-going offenders.</p>
<h3>5.  Conclusion</h3>
<p>In this research project, different methods of constructing clusters were explored.  Additionally, different distance measures were implemented and then analyzed to see how they affected the accuracy of the clusters created.</p>
<p>While the results of this project show only a maximum of 67% accuracy, clustering is still a valid machine learning technique.  With an more advanced algorithm and an increased size dataset, clustering may be able to predict predators and victims at a much better rate.</p>
<h4>References</h4>
<h4><span style="font-weight: normal;">Abeel, T.; de Peer, Y. V. &amp; Saeys, Y. Java-ML: A Machine Learning Library, Journal of Machine Learning Research, 2009, 10, 931-934</span></h4>
<p>Alpaydin, E.  Introduction to Machine Learning, Second Edition.  The MIT Press, Cambridge, MA.  2010.</p>
<p>Coolidge, F. Statistics A Gentle Introduction, 2nd edition. SAGE Publications, Inc.  2006.</p>
<p>Dunham, M. Data Mining: Introductory and Advanced Topics. Prentice-Hall. 2002.</p>
<p>Saha, S., Bandyopadhya, S.  Performance Evaluation of Some Symmetry-Based Cluster Validity Indexes.  IEEE Transactions on Systems, Man and Cybernetics &#8211; Part C: Applications and Review.  Vol. 39, No. 4.  July 2009.</p>
<p>Jain, A.K., Murty, M.N., Flynn, P.J.  Data clustering: a review.  ACM Computing Surveys.  Vol. 31, No. 3.  Sept. 1999.</p>
]]></content:encoded>
			<wfw:commentRss>http://afewguyscoding.com/2010/06/k-means-single-link-clustering/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>A Practical Survey of Search Engine Optimization Techniques</title>
		<link>http://afewguyscoding.com/2010/05/practical-survey-search-engine-optimization-techniques/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=practical-survey-search-engine-optimization-techniques</link>
		<comments>http://afewguyscoding.com/2010/05/practical-survey-search-engine-optimization-techniques/#comments</comments>
		<pubDate>Fri, 21 May 2010 05:05:21 +0000</pubDate>
		<dc:creator>David Stites</dc:creator>
				<category><![CDATA[Web]]></category>
		<category><![CDATA[a few guys coding]]></category>
		<category><![CDATA[iPad]]></category>
		<category><![CDATA[iPhone]]></category>
		<category><![CDATA[ipod]]></category>
		<category><![CDATA[keyword discovery]]></category>
		<category><![CDATA[optimize]]></category>
		<category><![CDATA[organic]]></category>
		<category><![CDATA[search engine optimization]]></category>
		<category><![CDATA[search indexing]]></category>
		<category><![CDATA[search marketing]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[uccs]]></category>
		<category><![CDATA[web conversion cycle]]></category>

		<guid isPermaLink="false">http://blog.afewguyscoding.com/?p=180</guid>
		<description><![CDATA[This research article was co-authored by David Stites (dstites[at]uccs.edu) and Jonathan Snyder (jsynder4[at]uccs.edu) Abstract—A common problem for many web sites is appearing as a high ranking in search engines.  Many searchers make quick decisions about which result they will choose and unless a web site appears within a certain threshold of the top rankings, the [...]]]></description>
				<content:encoded><![CDATA[<p>This research article was co-authored by David Stites (dstites[at]uccs.edu) and Jonathan Snyder (jsynder4[at]uccs.edu)</p>
<p><strong><em>Abstract</em>—A common problem for many web sites is appearing as a high ranking in search engines.  Many searchers make quick decisions about which result they will choose and unless a web site appears within a certain threshold of the top rankings, the site may have a low throughput from search engines. Search engine optimization, while widely regarded as a difficult art, provides a simple framework for improving the volume and or quality of traffic to a web site that uses those techniques.  This paper is a survey of how search engines work broadly, SEO guidelines and a practical case study using the University of Colorado at Colorado Springs’ Engineering and Science department home page.</strong></p>
<p><strong><em>Index Terms</em>—search engine optimization, search indexing, keyword discovery, web conversion cycle, optimized search, organic search, search marketing</strong></p>
<h3>1.  Introduction</h3>
<p>Search Engines are the main portal for finding information on the web.  Millions of users each day make searches on the internet looking to purchase products or find information.  Additionally, users generally only look at the first few results.  If a website is not on the first couple pages it will rarely be visited.  All these factors give rise to many techniques employed to raise a website’s search engine ranking.</p>
<p>Search engine optimization (SEO) is the practice of optimizing a web site so that the website will rank high for particular queries on a search engine.  Effective SEO involves understanding how search engines operate, making goals and measuring progress, and a fair amount of trial and error.</p>
<p>Discussed in the following sections is a review of search engine technology, a step by step guide of SEO practices, lessons learned during the promotion of a website, and recommendations for the Engineering and Applied Science College of the University of Colorado at Colorado Springs.</p>
<p>In this survey, we spend our research efforts determining how SEO affects organic search results and do not consider paid inclusion.  For more information on paid inclusion SEO, see [1].</p>
<h3>2.  A Review of Search Engine Technology</h3>
<p>Search engine optimization techniques can be organized into two categories: white hat, and black hat.  White hat practices seek to raise search engine rankings by providing good and useful content.  On the other hand, black hat techniques seek to trick the search engine into thinking that a website should be ranked high when in fact the page provides very little useful content.  These non-useful pages are called web spam because from the user’s perspective they are undesired web pages.</p>
<p>The goal of the search engine is to provide useful search results to a user.  The growth of web search has been an arms race where the search engine develops techniques to demote spam pages, and then websites again try to manipulate their pages to the top of popular queries.  Search engine optimization is a careful balance of working to convince the search engine that a page is relevant and worthwhile for certain searches while making sure that the page is not marked as web spam.</p>
<p>The accomplishment of this objective requires understanding of how a search engine ranks pages, and how a search engine marks pages as spam.  Although the exact algorithms of commercial search engines are unknown, many papers have been published giving the general ideas that search engines may employ.  Additionally, empirical evidence of a search engine’s rankings can give clues to its methods.  Search engine technology involves many systems such as crawling, indexing, authority evaluation, and query time ranking.</p>
<h3>2.1  Crawling</h3>
<p>Most of the work of the search engine is done long before the user ever enters a query.  Search engines crawl the web continually, downloading content from millions of websites.  These crawlers start with seed pages and follow all the links that are contained on all the pages that they download.  For a url to ever appear in a search result page, the url must first be crawled.  Search engine crawlers present themselves in the HTTP traffic with a particular user-agent when they are crawling which websites can log.</p>
<p>In the eyes of search engine crawlers not all websites are created equal.  In fact some sites are crawled more often than others.  Crawlers apply sophisticated algorithms to prioritize which sites to visit.  These algorithms can determine how often a webpage changes, and how much is changing. [7]  Additionally, depending on how important a website is, the crawler may crawl some site more deeply.  One goal of a search engine optimizer is to have web crawlers crawl the website often and to crawl all of the pages on the site.</p>
<p>Search crawlers look for special files in the root of the domain.  First they look for a “robots.txt” file which tells the crawler which sites not to crawl.  This can be useful to tell the crawler not to crawl admin pages because only certain people have access to these pages.  The robots.txt file may contain a reference to an XML site map.  The site map is an XML document which lists every page that is a part of the website.  This can be helpful for dynamically generated websites where the sites may not all necessarily have links.  An example of this is on the Wikipedia site.  Wikipedia has a great search interface, but crawlers do not know how to use the search or what to search for.  Wikipedia has a site map which lists all the pages it contains.  This enables the crawler to know for sure that it has crawled every page.  When optimizing a website, it is important to make it easy for a site to be completely crawled.  Crawlers are timid to get caught in infinite loops of pages or download content that will never be used in search result pages. [5]</p>
<h3>2.2  Indexing</h3>
<p>After the search engine has crawled a particular page, it then analyzes the page.  In its most simple form the search engine will throw away all the html tags and simply look at the text.  More sophisticated algorithms look more closely at the structure of the document to determine which sections relate to each other, which sections contain navigational content, which sections have large text, and weight the sections accordingly. [3]  Once the text is extracted, information retrieval techniques are used to create an index.  The index is a listing of every keyword on every webpage.  This index can be thought of as the index in a book.  The index tells which page the subject is mentioned; however, in this case the index tells which website contains particular words.  The index also contains a score for how many times the word appears normalized by the length of the document.  When creating the index, the words are first converted into their root form.  For example the word “tables” is converted to “table”, “running” is converted to “run”, and so forth.</p>
<p>In order for a page to appear in the search engine results page, the page must contain the words that were searched for.  When promoting a website, particular queries should be targeted, and these words should be put on the page as much as possible.  However, this kind of promoting is abused.  Indeed, some sites are just pages and pages of keywords with advertisements designed to get a high search engine ranking.  Other tactics include putting large lists of keywords on the page, but making the keywords only visible to the search engine.  In fact some search engines actually parse the html and css of the page to find words that are being hidden to the user.  These kinds of tactics can easily flag a site as being a spam site.  Therefore one should make sure that the targeted keywords are not used in excess and text is not intentionally hidden.</p>
<p>One problem that many pages have is using images to display text.  Search engines do not bother trying to read the text on images.  This can be damaging to a site especially when the logo is an image.  A large portion of searches on the internet are navigational searches.  In these searches, people are looking for a particular page.  For these kinds of searches, it is hard to rank high in the search engine result page when the company brand name is not on the page except in an image.  One alternative is to provide text in the alt field of the image tag.  This may not be the best option however because the text in the alt field is text that the user does not normally see and is therefore prone to keyword abuse; hence, suspicious to the search engine.</p>
<h3>2.3  Authority Evaluation</h3>
<p>One of the major factors that sets Google apart from other search engines is its early use of PageRank as an additional factor when ranking search result. [3]  Google found that because the web is not a homogeneous collection of useful documents, many searches would yield spam pages, or pages that were not useful.  In an effort to combat this, Google extracted the link structure of all the documents.  The links between pages are viewed as a sort of endorsement from one page to another.  The idea is that if Page A links to Page B, then the quality of Page B is as high or higher than that of Page A.</p>
<p>The mathematics behind this model is a model of a random web surfer.  The web surfer starts on any page on the internet.  With the probability of 85% the random web surfer will follow a link on the page.  With the probability of 15%, the web surfer will go to a random page on the internet.  The PageRank of a given page is the probability that the user will be visiting that page.</p>
<p>One of the problems with this model is that any page on the internet will have some PageRank, even if it is never linked to.  This baseline page rank can be used to artificially boost the PageRank of another page.  Creating thousands of pages that all link to a target page can generate a high PageRank for the target page.  These pages are called link farms.</p>
<p>Many techniques have been proposed to combat link farms.  In the “Trust-rank” algorithm [4], a small number of pages are reviewed manually to determine which pages are good.  These pages are used as the pages that the web surfer will go to 15% of the time.  This effectively eliminates the ability to create link farms because a random page on the internet must have at least one page with trust-rank linking to it before it has any trust-rank.   Another technique is to use actual user data to determine which sites are spam.  By tracking which sites a user visits and how long the visits are, a search engine can determine which are the most useful pages for a given query. [6]</p>
<h3>2.4  Result Ranking</h3>
<p>With traditional information retrieval techniques, the results are ranked according to how often the query terms appear on a page.  Additionally, the query terms are weighted according to how often they occur in any document.  For example, the word “the” is so common in the English language that it is weighted very low.  On the other hand the word “infrequent” occurs much less often, so it would be given a higher weight.<br />
During the beginning of the World Wide Web, search engines simply ranked pages according to the words on the page compared to the page’s length and relative frequency of the words.  This algorithm is relative simple to fool.  One can optimize landing pages to contain high frequencies of the targeted query terms.  To combat this, the relevance score and the authority score are combined to determine the final ranking.  In order for a document to appear in the search engine result page it must match the keywords in the query, but the final ranking is largely determined by the authority score.</p>
<h3>3. Guidelines to SEO</h3>
<p>Equipped with the knowledge of how search engines index and rank web sites, one has to tailor content in such a way that gives the best opportunity for being ranked highly.  There are many guidelines to SEO, but we attempt to distill the most important ones into our survey here.  For a more complete picture of SEO best practices and information see, [1].</p>
<h3>3.1  Refine The Search Campaign</h3>
<p>The first thing that the reader must consider is why are they going to do SEO on their web site?  In a broad sense, the purpose is to get more traffic volume or quality to one’s web site, but one needs to break this down into smaller subgoals.  We will look at defining the target audience to determine why people use search in the first place and then we will examine how one can leverage the searchers intent to design a web site that will best serve the audience.</p>
<h3>3.2  Define The Target Audience</h3>
<p>Typically searchers aren’t really sure on what they are looking for.  After all, that is why they are performing a search in the first place.  When designing a web site, the webmaster must be cognizant of the searcher’s intent.</p>
<p>Generally, a searcher’s intent can be broken down into 3 different categories.  Know which category a particular searcher might fall into is important in how one designs your web site and which keywords they might choose.</p>
<ul>
<li><em>Navigational </em>searchers want to find a specific web site, such as JPMorgan Chase and might use keywords such as “jp morgan chase investments web site.”  Typically, <em>navigational</em> searchers are looking for very specific information “because they have visited in the past, someone has told them about it because they have heard of a company and they just assume the site exists.  Unlike other types of searchers, there is only one right answer.” [1]  It is important to note that typically <em>navigational</em> searchers just want to get to the home page of the web site &#8211; not deep information.  With navigational searches, it is possible to bring up multiple results with the same name (<em>i.e.</em> A Few Guys Painting and A Few Guys Coding).</li>
<li><em>Informational</em> searchers want information on a particular subject to answer a question they have or to learn more about a particular subject.  A typical query for a <em>informational</em> searcher might be “how do I write iPhone applications.”  Unlike <em>navigational</em> searchers, <em>informational</em> searchers typically want deep information on a particular subject &#8211; they just don’t know where it exists.  Typically, “information queries don’t have a single right answer.” [1]  By far, this type of search dominates the types of searches that people perform and therefore the key to having a high ranking in informational searches is to choose the right keywords for your web site.</li>
<li><em>Transactional</em> searchers want to do something, whether it is purchase an item, sign up for a newsletter, <em>etc.</em> A sample transactional search query might be “colorado rockies tickets.”  <em>Transactional</em> queries are the hardest type of query to SEO for because “the queries are often related to specific products.” [1]  The fact that there are many retailers or companies that provide the same services only complicates matters.  In addition, it can be hard to dipher between whether the searcher wants information or a transaction (<em>i.e.</em> “Canon EOS XSi”).</li>
</ul>
<p>After understanding the target searcher, one has to consider how the searcher will consume the information they find.  According to Hunt, <em>et. al.</em>, “nearly all users look at the first two or three organic search results, less time looking at results ranked below #3 and <em>far</em> less time scanning results that rank seventh and below.  Eighty-three percent report scrolling down only when they don’t find the result they want within the first three results.”</p>
<p>So what does this mean for people doing SEO?  Choosing the correct keywords and descriptions to go on your site is probably the most important action one can take.  Searchers choose a search result fairly quickly, and they do so by evaluating 4 different pieces of information, included with every result, which include: URL, title, the snippet, and “other” factors.</p>
<p><a href="http://blog.afewguyscoding.com/wp-content/uploads/2010/05/factors.png" rel="lightbox[180]"><img src="http://blog.afewguyscoding.com/wp-content/uploads/2010/05/factors.png" alt="" title="Factors in SEO" width="283" height="284" class="aligncenter size-full wp-image-187" /></a><br />
<em>Graph 3.1 &#8211; Identifying the percentages of the 4 influencing factors of a search result. [1]</em></p>
<h3>3.3  Identify Web Site Goals</h3>
<p>Now that we have identified the types of searchers that are looking for information, lets consider why one wants to even attempt SEO in the first place.  For that, one needs to identify some goals, including on deciding the purpose of your web site.  Generally, there six types of goals including:</p>
<ul>
<li><em>Web sales</em>.  This goal is selling goods or services online.  Note that this could be a purely online model (as in the case of Amazon.com or Buy.com) or it can be a mix between online and brick and mortar stores.  <em>Web sales </em>can be broken down further in specifying 1) whether or not the store is a retailer that sells many different products from different manufactures or if is a manufactures&#8217; site and 2) what type of delivery is available (instant or traditional shipping).  In all cases, the ultimate desire is to increase the number of sales online.  This type of site would benefit from optimizing their site with <em>transactional </em>searchers in mind.</li>
<li><em>Offline sales. </em>This goal involves converting web visitors into sales at brick and mortar stores. While this type of goal would benefit from <em>transactional </em>optimization, the benefits from SEO are much harder to calculate because there isn’t as good a measure on sales success.  The most important concept that a webmaster of an <em>offline sales</em> site can remember is to always emphasize the “call-to-action”, which the thing that you are trying to get someone to do, in this case, convert from web to physical sale.</li>
<li><em>Leads.</em> Leads are conceptually the same as offline sales, however leads are defined by when the customer switches to the offline channel.  Customers who do some research online and then switch to offline are <em>leads</em> while customers who know model numbers, prices, <em>etc.</em> are <em>offline sales</em>.  With leads, the search optimization and marketing strategy needs to be different because customers who are leads are typically <em>informational</em> searchers instead of <em>transactional</em> searchers as in the case of <em>web</em> and <em>offline</em> <em>sales</em>.  Therefore, people who want to optimize for leads need to attract customers who are still deciding on what they want.</li>
<li><em>Market awareness</em>. If a goal for one’s web site is market awareness, this is an instance of where paid placement could help boost your page rank quicker than organic results simply due to the fact that the product or service isn’t well know yet.  With <em>market awareness</em>, the web site mainly exists to raise awareness, so you would want to the site for <em>navigational </em>and <em>informational </em>searchers.</li>
<li><em>Information &amp; entertainment</em>. These sites exist solely to disseminate information or provide entertainment.  They typically don’t sell anything, but might raise revenue through ad placement or premium content, such as ESPN Insider.  Sites that focus on information and entertainment should focus almost exclusively on optimizing their site for <em>informational</em> searchers.</li>
<li><em>Persuasion</em>.  Persuasion websites are typically designed to influence public opinion or provide help to people.  They usually not designed to make money.  In order to reach the most people, sites like these should be designed for a <em>informational </em>searcher.</li>
</ul>
<h3>3.4  Define SEO Success</h3>
<p>A crucial element of undertaking SEO is to determine what should be used to measure the “success” of the efforts performed.  Depending on what type of goal the webmaster or or marketing department has defined for the website (<em>web or offline sales, leads, market awareness, information and entertainment </em>and <em>persuasion</em>), there are different ways to measure success.  For example, the natural way for a site that does <em>web</em> or <em>offline sales</em> to determine success is to count the conversions (<em>i.e.</em> the ratio of “lookers” to “buyers”).  If a baseline conversion ratio can be established prior to SEO, you can measure the difference between the old rate and the new rate after SEO.</p>
<p>This logic can be applied to any of the goals presented.  For example, with market awareness, you could measure conversion by sending out surveys to consumers or perhaps including some sort of promotion or “call-to-action” that would be unique to the SEO campaign.  If a consumer were to use that particular promotion or perform that action, it would give an indication of a conversion.</p>
<p>Another way to determine success is to analyze the traffic to the web site and there are a couple of different metrics that can be used for this including “page views and visits and visitors.” [1]  Using the most basic calculation with page views, you could determine a per-hour, per-day, per-week, per-month or even per-year average count of visitors to your website.  Using some tracking elements (such as cookies or JavaScript), you could determine the rise (or fall) or visitors to the site before and after SEO was performed.  The amount of information that this simple metric would provide would be invaluable because you could also determine peak visiting hours, which pages had the most hits at a specific time, visitor loyalty (unique vs. returning visitors), visitor demographics (location, browser and operating system, <em>etc.</em>) and even the time spent per visit.</p>
<h3>3.5  Decide on Keywords</h3>
<p>Keywords are the most important element of SEO.  It is important that during SEO, one focuses on the keywords that searchers are most likely to use while searching.  When choosing keywords, it is important to consider several different factors such as keyword variations, density, search volume and keyword “competition.”</p>
<p>Keyword variations play an important role in choosing keywords because one has have to consider how an average searcher may try to find information.  For example, certain keywords such as “review or compare” might be able to be used interchangeably in a search query.  In addition, some keywords are brand names that people automatically associate with products such as “Kleenex” for tissues and “iPod” for MP3 player.  If someone is able to have more variations on a certain sequence of keywords, they have a higher likelihood of being found.</p>
<p>Search volume also an big factor in determining which keywords to choose because one wants to choose keywords that people are actually searching for.  If a particular keyword only has 3,000 queries a month, it is much better to use a keyword that has 20,000 queries a month because 1) the searchers are associating the higher volume keyword with whatever the subject is instead of the lower volume keyword and 2) using the higher volume keyword will reach a larger audience.  However, this is not to say that low-volume keywords have no value.  In fact, the opposite is true.  Mega-volume keywords, such as brand names, often used by companies to compete to have the highest ranking.  If one were able to achieve the same or nearly same results with lower contention keywords, the SEO process would be much easier because fewer companies target them making it easier to achieve a high ranking with them.</p>
<p>Lastly, keyword density is an important factor.  When deciding  where to rank your page, the search engines have it figured out that anywhere from 3%-12% of the page is a good density for keywords, with 7% being optimal.  If it detects a higher percentage than this, the search spider might consider the page spam, simply because one is trying to stuff as many high volume keywords into the page as possible without any relevant context or content to the user.  Typically, this results in a lower ranking, no ranking or sometimes the page is even removed from the index or blocked.</p>
<h3>3.5.1  Create Landing Pages for Keyword Combinations</h3>
<p>The last step in assessing a site under consideration for SEO would be to identify pages on your site that you want your specific keyword queries to lead to when a searcher enters that query.  For example, P&amp;G might want “best laundry detergent” to lead to their Tide home page.  Landing pages are the pages that these queries lead to and are “designed to reinforce the searchers intent.” [1]  Each of the keywords or phrases identified in section 3.4 must lead to a landing page and those pages must be indexed.  If there isn’t a landing page already for some of those keywords, then they must be created.</p>
<h3>3.6  Page Elements That Matter And Don’t Matter</h3>
<p>Now that the target audience, goals, keywords and landing pages are identified, it is crucial to consider the design of the web page.</p>
<ul>
<li><em>Eliminate popup windows</em>.  The content in popup windows are not indexed by spiders.  If important content, navigation  or links are contained inside the popup window, then it will not be seen by the spider.  The content needs to be moved outside the popup window.</li>
<li><em>Pulldown navigation</em>.  Pulldown navigations suffers from the same problem as popup windows.  Since spiders cannot see these elements (mouse-over or click on them), they cannot index the content, which creates a large problem if the navigation is done with pulldown menus.  Either the pulldown menu must be done in a compatible way or the site has to allow for some other alternative means of navigation.</li>
<li><em>Simplifying dynamic URLs.</em> Pages that use dynamic content and URLs must be simplified for the spiders to crawl.  The nature of a dynamic URL means that a spider could spend an infinite amount of time attempt to crawl all the possible URLs and that would produce a lot of duplicate content.  To deal with this, spiders will only crawl dynamic URLs if the URL has less than 2 dynamic parameters, is less than 1,000 characters long, does not contain a session identifier and every valid URL is linked from another page.</li>
<li><em>Validating HTML</em>.  Robots are very sensitive to correctly formed documents.  It is important that the HTML is valid for the spider to get a good indication on what the page is truly about</li>
<li><em>Reduce dependencies</em>.  Some technologies, such as Flash, make it impossible for the spider to index the content inside them.  The spider cannot view this particular content, so any important keywords or information inside it, is lost.  The content should be moved outside to allow indexing to take place.</li>
<li><em>Slim down page content</em>.  A spider’s time is valuable and typically spiders don’t crawl all the pages of a bloated web site.  Google and Yahoo! spiders stop at about 100,000 characters. [1]  The typical cause of HTML page bloat is embedded content such as styling and other content such as JavaScript.  A simple way to solve this is to link to Cascading Style Sheet (CSS) so that you are able to make use of re-useable styles.  Another way to reduce JavaScript bloat is to use a program to obfusticate long files which replace long variable names such as “<em>longVariableName</em>”, which much shorter versions such as “<em>A</em>.”</li>
<li><em>Use redirects</em>.  From time to time, pages within a web site move.  It is important to make accurate use of the correct type of redirects within your site so that when spiders attempt to visit the old URL, they are redirected to the new URL.  If they find that the server returns a 404 (Unavailable) for the old URL, the spider might remove that particular page from the index.  Instead, the proper way to indicate that the page has moved permanently is to use a <em>server-side</em> <em>redirect</em>, also known as a “301 redirect.”  This is returned to the spider when it attempts to navigate to the old URL and it is then able to update the index with the new URL.  A sample implementation might look like the following:</li>
</ul>
<p>Redirect 301 /oldDirectory/oldName.html <a href="http://www.domain.com/newDirectory/newName.html">http://www.domain.com/newDirectory/newName.html</a></p>
<p>Note that spiders cannot follow JavaScript or Meta refresh directives. [1]  Additionally one can use a “302 redirect” for temporarily moved URLs.  See Fielding<em> et. al</em> for more information on “302 redirect.”</p>
<ul>
<li><em>Create site maps. </em>Site maps are important for larger sites because “it not only allows spiders to access your site’s pages but they also serve as very powerful clues to the search engine as to the thematic content of your site.” [1]  The anchor text used for the link could provide some very good keywords to the the spider.</li>
<li><em>Titles and snippets</em>.  Together, the title and the snippet that the search spider extracts account for a large part of how they index a particular page.  The title, the most important clue to a spider on the particular subject of a page, is the most easily fixed element.  The title is a great place to use keywords that were previously decided on.  For example, the title element for StubHub, a ticket brokerage site is “Tickets at StubHub! Where Fans Buy and Sell Tickets.” Additionally, the snippet, or summary that the spider comes up with to describe their result is important as well.  Typically, the spider uses the first block of text that they run across to use for a snippet.  For example, for WebMD, Googlebot uses the following snippet “The leading source for trustworthy and timely health and medical news and information. Providing credible health information, supportive community, &#8230;”  In both of these examples, it is clear that having essential keywords present in both are highly correlated to having high page ranks.</li>
<li><em>Formatting heading elements</em>.  Using traditional HTML subsection formatting elements, such as &lt;h1&gt;, &lt;h2&gt;, &lt;h3&gt;, <em>etc</em>. to denote important information can help give context clues to spiders on what text is important on a particular page.</li>
</ul>
<h3>3.6.1  The Importance of Links</h3>
<p>Links, both internal and external, play a big role in SEO and in page ranking.  Search engines place a certain value on links because they can use these link to judge the the value of the information.  Similar to how scientific papers have relied on citations to validate and confer status upon an authors work, links apply status and value to a particular site.  “Inbound links act as a surrogate for the quality and “trustworthiness” of the content, which spiders cannot discern from merely looking at the words on the page.”  [1]</p>
<p>Several factors can influence how an algorithm ranks the link popularity on a particular page including 1) link quality, link quantity, anchor text and link relevancy.</p>
<p>Using the 4 link popularity factors from above, search engines use a theory of <em>hub</em> and <em>authority</em> pages to create link value.  <em>Hub</em> pages are web pages that link to other pages on a similar subject.  <em>Authority </em>pages are pages that are linked to by many other pages on a particular subject.  Therefore, search engines usually assign a high rank to these pages because they are most closely related to a searcher’s keywords. Using this model, it is easy to see why the harder an inbound link is to get, the more valuable it might be in terms of value.  For more information on inbound and outbound link importance and their value, see Hunt <em>et. al.</em></p>
<h3>3.7  Getting Robots To Crawl The Site</h3>
<p>In addition to all of the goals and HTML elements above, it is crucial to have a “robots.txt” file that will allow the robot to crawl the site.  This file can give instructions to web robots using the <em>Robots Exclusion Protocol</em>.  Before any indexing occurs at a site, the robot will check the top-level directory for the “robots.txt” file and will index the site accordingly.  A sample “robots.txt” file might look like the following.</p>
<p>User-agent: *<br />
Disallow:</p>
<p>This instructions applies to all robots or “User-agents” and they are to index the whole site because nothing is listed under the “Disallow” directive.  It is possible to tailor this file to individual robots and server content.  While the protocol is completely advisory, it is highly recommended to improve the search quality of what the robots index.  Note that the robot <em>does not</em> have to obey the “robots.txt” file, however most non-malicious robots do obey the instructions.</p>
<h3>3.8  Dispelling SEO Myths</h3>
<ul>
<li><em> </em><em>SEO is a one time activity</em>.  SEO is a continual activity that must be revisited from time to time to ensure that the most up-to-date keywords and content are being used.</li>
<li><em> </em><em>SEO is a quick fix process for site traffic</em>.  Generating high quality organic traffic that will help with conversion is a slow process.  For each change that one makes to the web page, spiders must re-index that page and then calculate the new rankings.  It could take several months to years depending on your goal of achieving a higher ranking, passing a competitor in rankings or achieving the top results spot.</li>
<li><em> </em><em>META tags help with page ranking.</em> META tags were abused by keyword spammers by packing in as many highly searched keywords as possible, early in the days of developing search engines and many spiders now give META tags little to no credence.</li>
</ul>
<h3>4.  Promotion Of AFewGuysCoding.com</h3>
<p>In order to test the authors hypothesis, we chose to perform our SEO experiment with AFewGuysCoding.com.  A Few Guys Coding, LLC is a small, author-owned company that provides contract engineering services mainly for mobile platforms (iPhone, iPod Touch, iPad, Android) but also does web and desktop applications.   This web site has never had SEO performed on it and was not designed with any such considerations in mind.</p>
<h3>4.1  Initial Investigation For AFewGuysCoding.com</h3>
<p>In order to determine what keywords we should focus on, we created a survey and asked the following question to people who are not engineers: “Suppose you were a manager of a business that had a great idea for a mobile phone application. You knew that you had to hire an iPhone, iPod or iPad developer because you didn&#8217;t directly know anyone who could do this work for you. What words, terms or phrases might you consider searching for in Google to find this person?”  <em>Table 4.1</em> represents the range of responses that were provided.  In analyzing this data, the words provided were stemmed to account for different endings.  In addition, stop words, or words that are ignored because they don’t provide any significance to the query, were also filtered out and not considered.</p>
<p>Looking at the results and given the question posed to the survey audience, the percentages of the top three results did not surprise the authors.  What did surprise the authors were the relative high number of results for the keywords “technician”, “help”, “inventor/invention”, “creator” and “market/marketing” and the relatively low number of results for “mobile”, “phone”, “software” and “programmer.”</p>
<p>After careful considering, the authors chose to incorporate some of the keywords suggested by the survey audience into the web page.</p>
<table cellspacing="0" cellpadding="0" border="1">
<tbody>
<tr>
<td valign="top"><strong>Keyword</strong></td>
<td valign="top"><strong>Count</strong></td>
<td valign="top"><strong>Percentage</strong></td>
</tr>
<tr>
<td valign="top">Developer</td>
<td valign="top">81</td>
<td valign="top">19.33%</td>
</tr>
<tr>
<td valign="top">Application</td>
<td valign="top">72</td>
<td valign="top">17.18%</td>
</tr>
<tr>
<td valign="top">iPhone</td>
<td valign="top">51</td>
<td valign="top">12.17%</td>
</tr>
<tr>
<td valign="top">Apple</td>
<td valign="top">24</td>
<td valign="top">5.73%</td>
</tr>
<tr>
<td valign="top">Phone</td>
<td valign="top">21</td>
<td valign="top">5.01%</td>
</tr>
<tr>
<td valign="top">iPod</td>
<td valign="top">17</td>
<td valign="top">4.06%</td>
</tr>
<tr>
<td valign="top">Mobile</td>
<td valign="top">17</td>
<td valign="top">4.06%</td>
</tr>
<tr>
<td valign="top">iPad</td>
<td valign="top">17</td>
<td valign="top">4.06%</td>
</tr>
<tr>
<td valign="top">Programmer</td>
<td valign="top">17</td>
<td valign="top">4.06%</td>
</tr>
<tr>
<td valign="top">Technician</td>
<td valign="top">11</td>
<td valign="top">2.63%</td>
</tr>
<tr>
<td valign="top">Help</td>
<td valign="top">10</td>
<td valign="top">2.39%</td>
</tr>
<tr>
<td valign="top">Mac/Macintosh</td>
<td valign="top">9</td>
<td valign="top">2.15%</td>
</tr>
<tr>
<td valign="top">Technology</td>
<td valign="top">6</td>
<td valign="top">1.43%</td>
</tr>
<tr>
<td valign="top">Market/Marketing</td>
<td valign="top">6</td>
<td valign="top">1.43%</td>
</tr>
<tr>
<td valign="top">Software</td>
<td valign="top">7</td>
<td valign="top">1.67%</td>
</tr>
<tr>
<td valign="top">Inventor/Invention</td>
<td valign="top">5</td>
<td valign="top">1.19%</td>
</tr>
<tr>
<td valign="top">Creator</td>
<td valign="top">5</td>
<td valign="top">1.19%</td>
</tr>
<tr>
<td valign="top">Business</td>
<td valign="top">4</td>
<td valign="top">0.95%</td>
</tr>
<tr>
<td valign="top">iTunes</td>
<td valign="top">3</td>
<td valign="top">0.72%</td>
</tr>
<tr>
<td valign="top">Designer/Designing</td>
<td valign="top">3</td>
<td valign="top">0.72%</td>
</tr>
<tr>
<td valign="top">Hire</td>
<td valign="top">3</td>
<td valign="top">0.72%</td>
</tr>
<tr>
<td valign="top">Handset</td>
<td valign="top">3</td>
<td valign="top">0.72%</td>
</tr>
<tr>
<td valign="top">Top</td>
<td valign="top">3</td>
<td valign="top">0.72%</td>
</tr>
<tr>
<td valign="top">Computer</td>
<td valign="top">2</td>
<td valign="top">0.48%</td>
</tr>
<tr>
<td valign="top">Company</td>
<td valign="top">2</td>
<td valign="top">0.48%</td>
</tr>
<tr>
<td valign="top">Contractor</td>
<td valign="top">3</td>
<td valign="top">0.72%</td>
</tr>
<tr>
<td valign="top">AT&amp;T</td>
<td valign="top">1</td>
<td valign="top">0.24%</td>
</tr>
<tr>
<td valign="top">Store</td>
<td valign="top">1</td>
<td valign="top">0.24%</td>
</tr>
<tr>
<td valign="top">3G</td>
<td valign="top">1</td>
<td valign="top">0.24%</td>
</tr>
<tr>
<td valign="top">OS</td>
<td valign="top">1</td>
<td valign="top">0.24%</td>
</tr>
<tr>
<td valign="top">Code</td>
<td valign="top">1</td>
<td valign="top">0.24%</td>
</tr>
<tr>
<td valign="top">File</td>
<td valign="top">1</td>
<td valign="top">0.24%</td>
</tr>
<tr>
<td valign="top">Resource</td>
<td valign="top">1</td>
<td valign="top">0.24%</td>
</tr>
<tr>
<td valign="top">Sale</td>
<td valign="top">1</td>
<td valign="top">0.24%</td>
</tr>
<tr>
<td valign="top">Science</td>
<td valign="top">1</td>
<td valign="top">0.24%</td>
</tr>
<tr>
<td valign="top">Graphics</td>
<td valign="top">1</td>
<td valign="top">0.24%</td>
</tr>
<tr>
<td valign="top">Analyst</td>
<td valign="top">1</td>
<td valign="top">0.24%</td>
</tr>
<tr>
<td valign="top">Creative</td>
<td valign="top">1</td>
<td valign="top">0.24%</td>
</tr>
<tr>
<td valign="top">Devoted</td>
<td valign="top">1</td>
<td valign="top">0.24%</td>
</tr>
<tr>
<td valign="top">Energetic</td>
<td valign="top">1</td>
<td valign="top">0.24%</td>
</tr>
<tr>
<td valign="top">Smart</td>
<td valign="top">1</td>
<td valign="top">0.24%</td>
</tr>
<tr>
<td valign="top">Engineer</td>
<td valign="top">1</td>
<td valign="top">0.24%</td>
</tr>
<tr>
<td valign="top">Objective-C</td>
<td valign="top">1</td>
<td valign="top">0.24%</td>
</tr>
<tr>
<td valign="top">Total</td>
<td valign="top">419</td>
<td valign="top">100.00%</td>
</tr>
</tbody>
</table>
<p><em> </em></p>
<p><em>Table 4.1 &#8211; Responses from a search audience regarding possible keywords for AFewGuysCoding.com</em></p>
<h3>4.2  Baseline Rankings</h3>
<p>Prior to performing SEO on AFewGuysCoding.com, the Google page rank was 1/10.  The website was indexed by major search engines, such as Google, Yahoo!, Bing, AOL and Ask, but was not crawled on a regular basis.</p>
<p>Before performing an SEO activities, the amount of traffic that was coming from search engines was a little over 7% overall.  Most traffic was coming from direct referrals, for example when the user entered the address from a business card they had gotten (see <em>graph 5.1</em>).</p>
<p>In addition, the titles and snippets of the page were not as good as they could be (<em>i.e.</em> they didn’t include keywords or readily extractable content for the snippet).  For example, the title of the home page for AFewGuysCoding.com was “Welcome to A Few Guys Coding, LLC.”</p>
<p>Moreover, the web site was a transition from an older web site, <a href="http://www.davidstites.com">www.davidstites.com</a>, so many of the links already in Google referred to old pages that no longer existed.  When those pages were clicked on in the Google results, it took the user to a “404 Unavailable” page.</p>
<p><a href="http://blog.afewguyscoding.com/wp-content/uploads/2010/05/traffic-before.png" rel="lightbox[180]"><img src="http://blog.afewguyscoding.com/wp-content/uploads/2010/05/traffic-before.png" alt="" title="Source of traffic before SEO" width="322" height="321" class="aligncenter size-full wp-image-190" /></a><br />
<em>Graph 5.1 &#8211; Sources of traffic for AFewGuysCoding.com before SEO activities</em></p>
<p><em> </em></p>
<p>In addition to the work done with keyword densities and titles, the author also started a blog and Twitter account, (<a href="http://blog.afewguyscoding.com">http://blog.afewguyscoding.com, http://www.twitter.com/afewguyscoding</a>), that addressed computer science and programming topics.  In the blog, we linked to other parts of the main site when a topic was referenced in a blog post that A Few Guys Coding dealt with, such as iPhone applications.  A summary feed of this blog was placed on the main page AFewGuysCoding.com web site and the authors noticed an increase in crawler traffic after the spider determined the content was changing frequently enough to warrant additional visits to the site to re-calculate PageRank.</p>
<p>4.3  Rankings After SEO</p>
<p>After an initial pass of SEO was performed to the web site, the traffic from search engines increased dramatically and the page rank increased 1 point to 2/10.  The authors believe that this increase was largely from changing the titles of the web pages themselves, changing keyword densities and tying keywords to landing pages.   Using Google Analytics, over a two month period, traffic for AFewGuysCoding.com was up 81.24%.</p>
<p>They word densities are not quite as high as 3%-11%, however, it is a marked improvement from before, when the important keyword densities were all 1% and below.  See <em>Table 5.2</em>.</p>
<table cellspacing="0" cellpadding="0"  border="1">
<tbody>
<tr>
<td valign="top"><strong>Keyword</strong></td>
<td valign="top"><strong>Count</strong></td>
<td valign="top"><strong>Density</strong></td>
<td valign="top"><strong>Page</strong></td>
</tr>
<tr>
<td valign="top">iPhone</td>
<td valign="top">10</td>
<td valign="top">2.55%</td>
<td valign="top">/services</td>
</tr>
<tr>
<td valign="top">iPad</td>
<td valign="top">6</td>
<td valign="top">1.53%</td>
<td valign="top">/services</td>
</tr>
<tr>
<td valign="top">iPod</td>
<td valign="top">6</td>
<td valign="top">1.53%</td>
<td valign="top">/services</td>
</tr>
<tr>
<td valign="top">Code</td>
<td valign="top">10</td>
<td valign="top">2.55%</td>
<td valign="top">/services</td>
</tr>
<tr>
<td valign="top">Software</td>
<td valign="top">2</td>
<td valign="top">1.32%</td>
<td valign="top">/</td>
</tr>
<tr>
<td valign="top">Application</td>
<td valign="top">10</td>
<td valign="top">2.55%</td>
<td valign="top">/services</td>
</tr>
<tr>
<td valign="top">Develop/Developer</td>
<td valign="top">4</td>
<td valign="top">1.42%</td>
<td valign="top">/services</td>
</tr>
</tbody>
</table>
<p><em>Table 5.2 &#8211; Keyword densities for certain pages on AFewGuysCoding.com</em></p>
<p><a href="http://blog.afewguyscoding.com/wp-content/uploads/2010/05/traffic-after.png" rel="lightbox[180]"><img src="http://blog.afewguyscoding.com/wp-content/uploads/2010/05/traffic-after.png" alt="" title="Sources of traffic after SEO" width="308" height="303" class="aligncenter size-full wp-image-189" /></a><em>Graph 5.2 &#8211; Sources of traffic for AFewGuysCoding.com after SEO activities (April 2010)</em></p>
<p><em> </em></p>
<table cellspacing="0" cellpadding="0" border="1">
<tbody>
<tr>
<td valign="top"><strong>Pages</strong></td>
<td valign="top"><strong>Page Views</strong></td>
<td valign="top"><strong>% Pageviews</strong></td>
<td valign="top"><strong>Avg. Time On Page</strong></td>
</tr>
<tr>
<td valign="top">/</td>
<td valign="top">387</td>
<td valign="top">66.38%</td>
<td valign="top">1:42</td>
</tr>
<tr>
<td valign="top">/learnmore</td>
<td valign="top">59</td>
<td valign="top">10.12%</td>
<td valign="top">2:09</td>
</tr>
<tr>
<td valign="top">/services</td>
<td valign="top">41</td>
<td valign="top">7.03%</td>
<td valign="top">1:08</td>
</tr>
<tr>
<td valign="top">/portfolio</td>
<td valign="top">44</td>
<td valign="top">7.55%</td>
<td valign="top">1:19</td>
</tr>
<tr>
<td valign="top">/contact</td>
<td valign="top">35</td>
<td valign="top">6.00%</td>
<td valign="top">0:45</td>
</tr>
<tr>
<td valign="top">/services/iphone</td>
<td valign="top">8</td>
<td valign="top">1.37%</td>
<td valign="top">0:39</td>
</tr>
<tr>
<td valign="top">/getaquote</td>
<td valign="top">6</td>
<td valign="top">1.03%</td>
<td valign="top">0:15</td>
</tr>
<tr>
<td valign="top">/services/ipad</td>
<td valign="top">3</td>
<td valign="top">0.51%</td>
<td valign="top">0:27</td>
</tr>
<tr>
<td valign="top">Totals</td>
<td valign="top">583</td>
<td valign="top">100.00%</td>
<td valign="top">1:03</td>
</tr>
</tbody>
</table>
<p><em>Table 5.3 &#8211; Page View Overview for top 8 most visited pages in April, 2010</em></p>
<p><a href="http://blog.afewguyscoding.com/wp-content/uploads/2010/05/hits-goals.png" rel="lightbox[180]"><img src="http://blog.afewguyscoding.com/wp-content/uploads/2010/05/hits-goals.png" alt="" title="Visitors vs. Goals Met Daily" width="356" height="219" class="aligncenter size-full wp-image-188" /></a><em>Graph 5.3 &#8211; Number of visitors per day for April, 2010 against Time on Site goal achievement rate</em></p>
<p>Another action that was taken by the authors was using<em> </em>Apache <em>mod_rewrite</em> and a redirect file, we we’re able to direct the spider to update their index to the new pages (from the older site) using a “301 redirect”.  We were able to transform the URL using <em>mod_rewrite</em> to match the current top-level domain.  This ensured that pages were not removed from the crawlers index due to being status 404.</p>
<p>Lastly, the authors set several goals for the web site (besides the increase in PageRank), including a “Time on Site” measure that would help measure SEO success.  If a particular user stayed on the site for longer than 5 minutes and/or had 10 or more page views, we considered this criteria meeting the goal.  See <em>Graph 5.3</em> for a comparison of visitors to goals met.</p>
<p>4.3  Conclusions</p>
<p>Clearly, making simple changes to content (such as keywords and titles) can have a large effect on search engine ranking and the amount and quality of traffic that is directed to a site that has had SEO.  The difference in search engine traffic over a month represented an 400% increase. It might be reasonable to infer that an additional increase of 3-4% in keyword density might generate search engine referrals up to 50%-60%. The authors would like to see the effects of continuing SEO at a 3, 6, 9 and 12 month period.</p>
<h3>5.  Recommendations To UCCS</h3>
<p>Based on the research that we performed, we would like to make some suggestions to the UCCS EAS webmaster that would allow the UCCS EAS site to be ranked higher than it currently is.  Currently, the URL <a href="http://eas.uccs.edu">http://eas.uccs.edu</a>/ has a Google PageRank of 4/10. By following the suggestions below, it may be possible to raise the Page Rank 1 or 2 points to 5/10 or 6/10.  We have broken down our recommendations into a four step process: define the scope and goals, understand the decision making process of a potential EAS student, create content, and build links.</p>
<h3>5.1  Define the Scope and Goals</h3>
<p>The first step is to define the scope and goals of the Search Engine Optimization efforts.  One question to ask is: Is this effort only related to getting more EAS students, or is it to promote UCCS as a whole?  How much effort can be put into this project?  The answers to these questions will determine the scope of the project.  One potential goal is to recruit more students to the college, but a new student may have attended even without the SEO campaign.  Consequently, a plan for measuring how students became interested in the college will be important.</p>
<h3>5.2  Understanding the Decision Making Process</h3>
<p>Key to understanding how to boost traffic through SEO practices is understanding the content target users are searching for.  This can be discovered through surveys or interviews with students that have gone through the process of choosing a college to attend.  This can be current UCCS students or students of other universities.  From my own experience I would guess that a typical decision making process would involve finding answers to these questions: Should I go to college? What is the best school for me?  Why should I go to UCCS?  Which major should I pick? How do I apply?  Where can I find help with my application?  Each of these questions is a good area for creating good content.</p>
<h3>5.3  Create Content</h3>
<p>Once the decision making process is understood, content can be created to answer the questions that people are searching for.  Special attention should be made to use keywords that people would search for when they have that question.  A quick look at the current EAS site reveals that there is little content related to recruitment.  Additionally, the current pages have few keywords that are searched frequently.  Additional recommendations are listed below that highlight some of the deficiencies of the current EAS pages.</p>
<ul>
<li><em>Eliminate non-indexable content.</em> The Adobe Flash content on the main landing page, <a href="http://www.eas.uccs.edu">http://www.eas.uccs.edu</a>/ is unable to be indexed by spiders and robots.  All the information contained in that Flash element is lost.  Additionally, any images that contain content, are non-indexable as well.</li>
<li><em>Remove “expandable” navigation</em>.  Spiders are unable to “click” on these individual sections to expand them and are therefore unable to crawl the linked pages.  The navigation should be reworked so that all links are accessible without needing to perform any special interface actions.</li>
<li><em>Choose better titles for individual pages</em>.  Despite the actual pages content or purpose, all pages under the main landing page have the title “UCCS | College of Engineering and Applied Science.”  These titles should change depending on the subject or content of the page so that spiders and robots are able to create a more accurate index of the page.</li>
<li><em>Better use of heading formatting</em>.  Headings should use valid heading tags such as &lt;h1&gt;, &lt;h2&gt;, &lt;h3&gt;, <em>etc</em>. so that the robots that crawl the site can extract main ideas and content from the page for the index.</li>
<li><em>Check and adjust keywords and keyword densities</em>.  The densities of the keywords on the page are low should reflect what the page is about.  For example, on the application page for EAS, the keywords concerning admission and applying are low.  In fact, out of the top 18 keywords on the page only 4 or 5 have anything to do with admission and the densities are low, ranging from .79% to 2.37%.</li>
</ul>
<table cellspacing="0" cellpadding="0" border="1">
<tbody>
<tr>
<td valign="top"><strong>Keyword</strong></td>
<td valign="top"><strong>Count</strong></td>
<td valign="top"><strong>Density</strong></td>
</tr>
<tr>
<td valign="top">Engineering</td>
<td valign="top">63</td>
<td valign="top">5.54%</td>
</tr>
<tr>
<td valign="top">Science</td>
<td valign="top">43</td>
<td valign="top">3.78%</td>
</tr>
<tr>
<td valign="top">Computer</td>
<td valign="top">39</td>
<td valign="top">3.43%</td>
</tr>
<tr>
<td valign="top">Department</td>
<td valign="top">39</td>
<td valign="top">3.43%</td>
</tr>
<tr>
<td valign="top">Admission</td>
<td valign="top">27</td>
<td valign="top">2.37%</td>
</tr>
<tr>
<td valign="top">Application</td>
<td valign="top">23</td>
<td valign="top">2.02%</td>
</tr>
<tr>
<td valign="top">Colorado</td>
<td valign="top">20</td>
<td valign="top">1.76%</td>
</tr>
<tr>
<td valign="top">Electrical</td>
<td valign="top">19</td>
<td valign="top">1.67%</td>
</tr>
<tr>
<td valign="top">UCCS</td>
<td valign="top">18</td>
<td valign="top">1.58%</td>
</tr>
<tr>
<td valign="top">Mechanical</td>
<td valign="top">18</td>
<td valign="top">1.58%</td>
</tr>
<tr>
<td valign="top">Form</td>
<td valign="top">15</td>
<td valign="top">1.32%</td>
</tr>
<tr>
<td valign="top">Aerospace</td>
<td valign="top">14</td>
<td valign="top">1.23%</td>
</tr>
<tr>
<td valign="top">Springs</td>
<td valign="top">12</td>
<td valign="top">1.05%</td>
</tr>
<tr>
<td valign="top">College</td>
<td valign="top">12</td>
<td valign="top">1.05%</td>
</tr>
<tr>
<td valign="top">Applied</td>
<td valign="top">12</td>
<td valign="top">1.05%</td>
</tr>
<tr>
<td valign="top">Student</td>
<td valign="top">11</td>
<td valign="top">0.97%</td>
</tr>
<tr>
<td valign="top">Application</td>
<td valign="top">10</td>
<td valign="top">0.88%</td>
</tr>
<tr>
<td valign="top">Financial</td>
<td valign="top">9</td>
<td valign="top">0.79%</td>
</tr>
</tbody>
</table>
<ul>
<li><em>Include a site map</em>.  A site map would help ensure that all the pages that were meant to be accessible to a web visitor are also accessible to a spider crawling the content.</li>
</ul>
<h3>5.4  Build Links</h3>
<p>Lastly, it is important to build the authority score by getting other pages to link to the content pages.  Links can be built within organizations that already have relationships with the university.  For example the City of Colorado Springs, engineering organizations, and businesses that recruit from the EAS college could all be great sources of links.  Publishing press releases to news organizations of new websites could also be helpful in generating links.  Another form of link building could be simply link to the target pages heavily from other pages on the EAS site.</p>
<p>Another thing to consider is that people understand when content is on the EAS site it is going to be biased towards the EAS college.  Prospective students are not going to trust the document as much as if it were coming from a third party site.  One way to overcome this is to write content for other websites on the web.  This provides two benefits.  First it creates a seemingly unbiased source of information that can be slanted towards recruiting as EAS, and the content can link back to the EAS website providing a good link for building the authority score of a page.</p>
<p>By following these guidelines, the EAS college can succeed in generating more traffic on its recruitment pages and therefore seeing more students attend the college.</p>
<h3>6.  Conclusion</h3>
<p>In this research project, we have investigated the implementation of search engines.  We have also presented different elements that affect search ranking.  Based on our research and case study with AFewGuysCoding.com, we have provided recommendations to the UCCS EAS department to improve their PageRank within Google and other major search engines.  Indeed, search engine optimization is an important technique that any web master must master so that their site can be indexed as high in the rankings as possible.</p>
<h3>References</h3>
<ol>
<li>Hunt, B, Moran, M.  <em>Search Engine Marketing, Inc,. Driving Search Traffic to Your Company’s Web Site, 2nd ed</em>. (2008). IBM Press.</li>
<li>Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P., Berners-Lee., T.  <em>RFC 2616 &#8211; HTTP/1.1 Status Code Definitions</em>.  1999 [Online]. <a href="http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html">http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html</a></li>
<li>S. Brin and L. Page. <em>The Anatomy of a Large-Scale Hypertextual Web Search Engine</em>. Computer Networks and ISDN Systems. 1998. pp. 107-117.</li>
<li>Z. Gyongyi, H. Garcia-Molina, and J. Pedersen. <em>Combating Web Spam with TrustRank</em>. Very Large Data Bases. Vol. 30, pp. 576-587. 2004.</li>
<li>O. Brandman, J. Cho, H. Garcia-Molina, N. Shivakumar. <em>Crawler-Friendly Web Servers</em>. ACM SIGMETRICS Performance Evaluation Review. Vol. 28, No. 2, pp. 9-14. 2000.</li>
<li>G. Rodriguez-Mula, H. Garcia-Molina, A. Paepcke.  <em>Collaborative value filtering on the Web</em>. Computer Networks and ISDN Systemes. Vol. 30, No. 1-7, pp. 736-738. 1998.</li>
<li>J. Cho, H. Garcia-Molina.  <em>The Evolution of the Web and Implications for an Incremental Crawler</em>.  Very Large Data Bases. pp. 200-209. 2000.</li>
</ol>
]]></content:encoded>
			<wfw:commentRss>http://afewguyscoding.com/2010/05/practical-survey-search-engine-optimization-techniques/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>K Nearest Neighbors in Java</title>
		<link>http://afewguyscoding.com/2010/05/nearest-neighbors-java/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=nearest-neighbors-java</link>
		<comments>http://afewguyscoding.com/2010/05/nearest-neighbors-java/#comments</comments>
		<pubDate>Sat, 08 May 2010 03:40:09 +0000</pubDate>
		<dc:creator>David Stites</dc:creator>
				<category><![CDATA[Java]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Mobile]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[iPad]]></category>
		<category><![CDATA[iPhone]]></category>
		<category><![CDATA[knn]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[nearest neighbor]]></category>

		<guid isPermaLink="false">http://blog.afewguyscoding.com/?p=128</guid>
		<description><![CDATA[In a previous post, I explored how one might apply classification to solve a complex problem. This post will explore the code necessary to implement that nearest neighbor classification algorithm. If you would like a full copy of the source code, it is available here in zip format. Knn.java &#8211; This is the main driver [...]]]></description>
				<content:encoded><![CDATA[<p>In a <a href="http://blog.afewguyscoding.com/2010/04/content-recommenation-systems-for-a-mobile-platform/">previous post</a>, I explored how one might apply classification to  solve a complex problem.  This post will explore the code necessary to  implement that nearest neighbor classification algorithm.  If you would like a full copy of the  source code, it is available <a href="http://blog.afewguyscoding.com/wp-content/uploads/2010/04/ml-knn.zip">here</a> in zip format.<br />
<strong></strong></p>
<p><strong>Knn.java</strong> &#8211; This is the main driver of the code. To do the classification, we are essentially interested in finding the distance between the particular instance we are trying to classify to other instances.  We then determine the classification of the instance we want from a &#8220;majority vote&#8221; of the other <em>k</em> closest instances.  Each feature of an instance is a separate class that essentially just stores a continuous or discrete value depending on if you are using regression or not to classify your neighbors.  The additional feature classes and file reader are left to the reader as an exercise.  Note that it would be fairly easy to weight features using this model depending on if you want to give one feature more clout than another in determining the neighbors.</p>
<p>The nice visualization of the algorithm is provided by <a href="http://people.revoledu.com/kardi/tutorial/KNN/HowTo_KNN.html">Kardi Teknomo.</a> As you can see, we take the number of <em>k</em> closest instances and use a &#8220;majority vote&#8221; to classify the instance.  While this is an extremely simple method, it is great for noisy data and large data sets.  The two drawbacks are the running time <em>O(n^2)</em> and the fact that we have to determine <em>k</em> ahead of time.  However, despite this, as shown in the previous paper, the accuracy can be quite high.</p>
<p><a href="http://blog.afewguyscoding.com/wp-content/uploads/2010/05/HowTo_KNN_clip_image008.jpg" rel="lightbox[128]"><img class="aligncenter size-full wp-image-167" title="KNN Graph Visualization" src="http://blog.afewguyscoding.com/wp-content/uploads/2010/05/HowTo_KNN_clip_image008.jpg" alt="" width="576" height="433" /></a></p>
<pre><code>import java.util.*;

public class Knn {
	public static final String PATH_TO_DATA_FILE = "coupious.data";
	public static final int NUM_ATTRS = 9;
	public static final int K = 262;

	public static final int CATEGORY_INDEX = 0;
	public static final int DISTANCE_INDEX = 1;
	public static final int EXPIRATION_INDEX = 2;
	public static final int HANDSET_INDEX = 3;
	public static final int OFFER_INDEX = 4;
	public static final int WSACTION_INDEX = 5;
	public static final int NUM_RUNS = 1000;
	public static double averageDistance = 0;

	public static void main(String[] args) {
		ArrayList instances = null;
		ArrayList distances = null;
		ArrayList neighbors = null;
		WSAction.Action classification = null;
		Instance classificationInstance = null;
		FileReader reader = null;
		int numRuns = 0, truePositives = 0, falsePositives = 0, falseNegatives = 0, trueNegatives = 0;
		double precision = 0, recall = 0, fMeasure = 0;

		falsePositives = 1;

		reader = new FileReader(PATH_TO_DATA_FILE);
		instances = reader.buildInstances();

		do {
			classificationInstance = extractIndividualInstance(instances);

			distances = calculateDistances(instances, classificationInstance);
			neighbors = getNearestNeighbors(distances);
			classification = determineMajority(neighbors);

			System.out.println("Gathering " + K + " nearest neighbors to:");
			printClassificationInstance(classificationInstance);

			printNeighbors(neighbors);
			System.out.println("\nExpected situation result for instance: " + classification.toString());

			if(classification.toString().equals(((WSAction)classificationInstance.getAttributes().get(WSACTION_INDEX)).getAction().toString())) {
				truePositives++;
			}
			else {
				falseNegatives++;
			}
			numRuns++;

			instances.add(classificationInstance);
		} while(numRuns &amp;lt; NUM_RUNS);

		precision = ((double)(truePositives / (double)(truePositives + falsePositives)));
		recall = ((double)(truePositives / (double)(truePositives + falseNegatives)));
		fMeasure = ((double)(precision * recall) / (double)(precision + recall));

		System.out.println("Precision: " + precision);
		System.out.println("Recall: " + recall);
		System.out.println("F-Measure: " + fMeasure);
		System.out.println("Average distance: " + (double)(averageDistance / (double)(NUM_RUNS * K)));
	}

	public static Instance extractIndividualInstance(ArrayList instances) {
		Random generator = new Random(new Date().getTime());
		int random = generator.nextInt(instances.size() - 1);

		Instance singleInstance = instances.get(random);
		instances.remove(random);

		return singleInstance;
	}

	public static void printClassificationInstance(Instance classificationInstance) {
		for(Feature f : classificationInstance.getAttributes()) {
			System.out.print(f.getName() + ": ");
			if(f instanceof Category) {
				System.out.println(((Category)f).getCategory().toString());
			}
			else if(f instanceof Distance) {
				System.out.println(((Distance)f).getDistance().toString());
			}
			else if (f instanceof Expiration) {
				System.out.println(((Expiration)f).getExpiry().toString());
			}
			else if (f instanceof Handset) {
				System.out.print(((Handset)f).getOs().toString() + ", ");
				System.out.println(((Handset)f).getDevice().toString());
			}
			else if (f instanceof Offer) {
				System.out.println(((Offer)f).getOfferType().toString());
			}
			else if (f instanceof WSAction) {
				System.out.println(((WSAction)f).getAction().toString());
			}
		}
	}

	public static void printNeighbors(ArrayList neighbors) {
		int i = 0;
		for(Neighbor neighbor : neighbors) {
			Instance instance = neighbor.getInstance();

			System.out.println("\nNeighbor " + (i + 1) + ", distance: " + neighbor.getDistance());
			i++;
			for(Feature f : instance.getAttributes()) {
				System.out.print(f.getName() + ": ");
				if(f instanceof Category) {
					System.out.println(((Category)f).getCategory().toString());
				}
				else if(f instanceof Distance) {
					System.out.println(((Distance)f).getDistance().toString());
				}
				else if (f instanceof Expiration) {
					System.out.println(((Expiration)f).getExpiry().toString());
				}
				else if (f instanceof Handset) {
					System.out.print(((Handset)f).getOs().toString() + ", ");
					System.out.println(((Handset)f).getDevice().toString());
				}
				else if (f instanceof Offer) {
					System.out.println(((Offer)f).getOfferType().toString());
				}
				else if (f instanceof WSAction) {
					System.out.println(((WSAction)f).getAction().toString());
				}
			}
		}
	}

	public static WSAction.Action determineMajority(ArrayList neighbors) {
		int yea = 0, ney = 0;

		for(int i = 0; i &amp;lt; neighbors.size(); i++) { 			Neighbor neighbor = neighbors.get(i); 			Instance instance = neighbor.getInstance(); 			if(instance.isRedeemed()) { 				yea++; 			} 			else { 				ney++; 			} 		} 		 		if(yea &amp;gt; ney) {
			return WSAction.Action.Redeem;
		}
		else {
			return WSAction.Action.Hit;
		}
	}

	public static ArrayList getNearestNeighbors(ArrayList distances) {
		ArrayList neighbors = new ArrayList();

		for(int i = 0; i &amp;lt; K; i++) {
			averageDistance += distances.get(i).getDistance();
			neighbors.add(distances.get(i));
		}

		return neighbors;
	}

	public static ArrayList calculateDistances(ArrayList instances, Instance singleInstance) {
		ArrayList distances = new ArrayList();
		Neighbor neighbor = null;
		int distance = 0;

		for(int i = 0; i &amp;lt; instances.size(); i++) {
			Instance instance = instances.get(i);
			distance = 0;
			neighbor = new Neighbor();

			// for each feature, go through and calculate the "distance"
			for(Feature f : instance.getAttributes()) {
				if(f instanceof Category) {
					Category.Categories cat = ((Category) f).getCategory();
					Category singleInstanceCat = (Category)singleInstance.getAttributes().get(CATEGORY_INDEX);
					distance += Math.pow((cat.ordinal() - singleInstanceCat.getCategory().ordinal()), 2);
				}
				else if(f instanceof Distance) {
					Distance.DistanceRange dist = ((Distance) f).getDistance();
					Distance singleInstanceDist = (Distance)singleInstance.getAttributes().get(DISTANCE_INDEX);
					distance += Math.pow((dist.ordinal() - singleInstanceDist.getDistance().ordinal()), 2);
				}
				else if (f instanceof Expiration) {
					Expiration.Expiry exp = ((Expiration) f).getExpiry();
					Expiration singleInstanceExp = (Expiration)singleInstance.getAttributes().get(EXPIRATION_INDEX);
					distance += Math.pow((exp.ordinal() - singleInstanceExp.getExpiry().ordinal()), 2);
				}
				else if (f instanceof Handset) {
					// there are two calculations needed here, one for device, one for OS
					Handset.Device device = ((Handset) f).getDevice();
					Handset singleInstanceDevice = (Handset)singleInstance.getAttributes().get(HANDSET_INDEX);
					distance += Math.pow((device.ordinal() - singleInstanceDevice.getDevice().ordinal()), 2);

					Handset.OS os = ((Handset) f).getOs();
					Handset singleInstanceOs = (Handset)singleInstance.getAttributes().get(HANDSET_INDEX);
					distance += Math.pow((os.ordinal() - singleInstanceOs.getOs().ordinal()), 2);
				}
				else if (f instanceof Offer) {
					Offer.OfferType offer = ((Offer) f).getOfferType();
					Offer singleInstanceOffer = (Offer)singleInstance.getAttributes().get(OFFER_INDEX);
					distance += Math.pow((offer.ordinal() - singleInstanceOffer.getOfferType().ordinal()), 2);
				}
				else if (f instanceof WSAction) {
					WSAction.Action action = ((WSAction) f).getAction();
					WSAction singleInstanceAction = (WSAction)singleInstance.getAttributes().get(WSACTION_INDEX);
					distance += Math.pow((action.ordinal() - singleInstanceAction.getAction().ordinal()), 2);
				}
				else {
					System.out.println("Unknown category in distance calculation.  Exiting for debug: " + f);
					System.exit(1);
				}
			}
			neighbor.setDistance(distance);
			neighbor.setInstance(instance);

			distances.add(neighbor);
		}

		for (int i = 0; i &amp;lt; distances.size(); i++) {
			for (int j = 0; j &amp;lt; distances.size() - i - 1; j++) { 				if(distances.get(j).getDistance() &amp;gt; distances.get(j + 1).getDistance()) {
					Neighbor tempNeighbor = distances.get(j);
					distances.set(j, distances.get(j + 1));
					distances.set(j + 1, tempNeighbor);
				}
			}
		}

		return distances;
	}

}</code></pre>
]]></content:encoded>
			<wfw:commentRss>http://afewguyscoding.com/2010/05/nearest-neighbors-java/feed/</wfw:commentRss>
		<slash:comments>29</slash:comments>
		</item>
		<item>
		<title>A Better App Rating System</title>
		<link>http://afewguyscoding.com/2010/04/app-rating-system/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=app-rating-system</link>
		<comments>http://afewguyscoding.com/2010/04/app-rating-system/#comments</comments>
		<pubDate>Tue, 27 Apr 2010 13:12:39 +0000</pubDate>
		<dc:creator>David Stites</dc:creator>
				<category><![CDATA[Apple]]></category>
		<category><![CDATA[iPad]]></category>
		<category><![CDATA[iPhone]]></category>
		<category><![CDATA[Mobile]]></category>
		<category><![CDATA[app store]]></category>
		<category><![CDATA[rating system]]></category>

		<guid isPermaLink="false">http://blog.afewguyscoding.com/?p=130</guid>
		<description><![CDATA[As developers, we try our best to create the best products we can.  We really do.  Yes, I know that some features get pushed to the back burner when releases are behind schedule or some bugs get kludged and we hedge our bets on whether the user will hit them or not.  However, for the [...]]]></description>
				<content:encoded><![CDATA[<p>As developers, we try our best to <a href="http://www.afewguyscoding.com/services/" target="_blank">create the best products</a> we can.  We really do.  Yes, I know that some features get pushed to the back burner when releases are behind schedule or some bugs get kludged and we hedge our bets on whether the user will hit them or not.  However, for the most part, we do what makes the application look the best.  I mean why not &#8211; we are type A personalities after all (usually).</p>
<p>So, it really bugs (read: annoys) me when users criticize the application that they are using without providing any helpful feed back, what they don&#8217;t like or even what went wrong!  Help us help you &#8211; we can&#8217;t do anything unless you provide us some information.  We aren&#8217;t expecting you to debug the application yourself or even send us crash logs &#8211; just a general, <strong>accurate</strong> description of the problem.  However, don&#8217;t think that the responsibility is completely on the consumer.  Developers have to at least attempt to easily facilitate the collection of this information.</p>
<p>The good example the Apple App store.  For those of you who aren&#8217;t familiar with the rating systems on there, you can rate the app after you download it with a 1-5 star designation.  You also can enter a comment along with your rating.  If you break down what kind of reviews people leave, there are several main categories: 1) the faithful user who expounds the virtues of the application and leaves a glowing review, 2) an impossible to please user that will give your app a 3 star rating for giving the old &#8220;college try&#8221; (i.e. &#8220;This app is good but would be better if it had &#8220;x, y and z&#8221; feature) 3) a user that leaves a good comment, but poor star rating or vice versa 4) a user that reports bugs and gives a low star rating (i.e. &#8220;I updated the app and every time I attempt to launch it, the app crashes!  please fix!!!!&#8221; and 5) the type of user who&#8217;s review consists of no helpful information and a low rating (i.e &#8220;this app sucks&#8221; or &#8220;the worst application ever&#8221;)</p>
<p>To give you an idea of how broken the system is, here is a small sampling of some of the typical reviews found from the 5 categories above.  <strong>Note: These comments are taken in the original form and have not been edited</strong>.</p>
<ul>
<li>&#8220;You get what you pay for.  A BIG FAT ZERO.  WHAT A JOKE.&#8221; (1 star)</li>
<li>&#8220;This app is garbage.&#8221; (1 star)</li>
<li>&#8220;Pure crap.  Sorry devs, but this is bad.&#8221; (1 star)</li>
<li>&#8220;The best app ever.&#8221; (5 stars)</li>
<li>&#8220;A great way to save money.&#8221; (1 star)</li>
<li>&#8220;I love the idea. But every time load the app it thinks for 30 seconds and returns to the home screen.  Good luck guys!.&#8221; (3 stars)</li>
</ul>
<p>If you browse the comments for any given app, you can see they are all over the place.  Let me clear however &#8211; good apps deserve good ratings and bad apps deserve bad ratings.  However, you are able to <strong>rate apps without even downloading them </strong>to try them.  As proof of how ridiculous this is, the night before the iPad launched, iTunes users were able to go to the iTunes app store and download iPad apps.  Now, absolutely no one had an iPad to test the applications with, however the ratings were still pouring in.  How could end users possibly evaluate the application without even running?</p>
<p>Another part of the system that is broken is that the only time that the iPhone prompts you to rate applications is when you delete them off your phone.  Typically, if you&#8217;re deleting something, something went horribly, horribly wrong (i.e. the application crashes all the time, won&#8217;t run, you don&#8217;t have a use for it anymore).  At this point, the user&#8217;s rating will most likely be negative, which will end up affecting the rating of your application.</p>
<p>Lastly, there is no way to disable ratings, edit ratings or reset ratings.  The only way they get &#8220;reset&#8221; is if you release a new version, but even then the previous ratings don&#8217;t disappear &#8211; the iTunes App store only separates the ratings into &#8220;This version&#8221; and &#8220;Previous versions.&#8221;  I am not saying that you should be able to &#8220;hide&#8221; the previously poor  versions, but they shouldn&#8217;t affect the overall rating because every piece of software has known and unknown bugs.  The &#8220;overall&#8221; rating is the lifetime rating of the application.  Anyone who understands statistics averages fairly well knows that once you get some bad ratings that bring your overall rating down, it takes a LOT more ratings to improve the average.  For example, suppose we had 10,000 ratings that gave us an overall of 2.5/5 stars.  It would require<strong> another 5,000 5-star ratings</strong> to raise the app rating 1/2 star to 3/5 stars.  It only takes longer to raise your rating with additional lower ratings.  This is stacked against the developer and isn&#8217;t fair.</p>
<p>So how can we improve the process so that both the consumer and the developer is satisfied, where the consumer gets to voice their opinion and the developer gets to fix the product. First of all, lets remove the functionality that allows someone to rate an app as they are removing it.  Let&#8217;s also can the the ability to rate apps without even downloading them.  If we were able to prompt the user for a rating after they had used the app <em>n</em> times or after some total accumulated usage time <em>t</em> (tracked by iTunes) then we might get a far more unbiased opinion.  After meeting these requirements, we could then unlock the ability to rate the app or iTunes could even prompt us for our thoughts (similar to how Amazon and eBay do after you&#8217;ve placed an order and they follow up with you to check on the service).</p>
<p><a href="http://blog.afewguyscoding.com/wp-content/uploads/2010/04/FirefoxScreenSnapz001.png" rel="lightbox[130]"><img class="aligncenter size-full wp-image-156" title="Amazon Review System" src="http://blog.afewguyscoding.com/wp-content/uploads/2010/04/FirefoxScreenSnapz001.png" alt="" width="709" height="372" /></a></p>
<p>Another idea is that even though this might deter a few people from leaving ratings, good AND bad, have a required comment field that would allow someone to elaborate on why an app deserves a particular rating.  If the user can&#8217;t take the 2 minutes to fill out a simple rating form, then they can&#8217;t be too serious about rating the application.  This could even be taken one step further by allowing the Apple or the developer to create a questionnaire with different fields depending on the rating.  eBay has a very good of example of this by listing categories of behavoirs that are encouraged. In the same vein, you could also introduce categories.  Obviously this last addition is probably overkill because it is more complicated to implement.</p>
<p><a href="http://blog.afewguyscoding.com/wp-content/uploads/2010/04/FirefoxScreenSnapz002.png" rel="lightbox[130]"><img class="aligncenter size-full  wp-image-157" title="eBay Rating System" src="http://blog.afewguyscoding.com/wp-content/uploads/2010/04/FirefoxScreenSnapz002.png" alt="" width="395" height="175" /></a></p>
<p>Lastly, it might be helpful to have a rating system within a rating system where other users could click a &#8220;thumbs up&#8221; or &#8220;thumbs down&#8221; to indicate whether another user&#8217;s review was particularly helpful or completely off base.</p>
<p>While no rating system is absolutely perfect, we want to get the best information possible about a potential purchase. While a rating system should be easy for the user to express their opinion, it shouldn&#8217;t be so simple as to give them all the power or reduce the rating to a simple binary &#8220;Good&#8221; or &#8220;Bad&#8221; rating.</p>
<p>Of course, the other alternative to all this is that, we as developers could just produce bug-free software the first time we have a app release.</p>
]]></content:encoded>
			<wfw:commentRss>http://afewguyscoding.com/2010/04/app-rating-system/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Backing up to Amazon S3 with Ruby s3fs</title>
		<link>http://afewguyscoding.com/2010/04/backing-amazon-s3-ruby-s3fs/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=backing-amazon-s3-ruby-s3fs</link>
		<comments>http://afewguyscoding.com/2010/04/backing-amazon-s3-ruby-s3fs/#comments</comments>
		<pubDate>Tue, 20 Apr 2010 14:44:44 +0000</pubDate>
		<dc:creator>David Stites</dc:creator>
				<category><![CDATA[Web]]></category>
		<category><![CDATA[amazon s3]]></category>
		<category><![CDATA[backup]]></category>
		<category><![CDATA[ruby]]></category>

		<guid isPermaLink="false">http://blog.afewguyscoding.com/?p=103</guid>
		<description><![CDATA[I recently came across a problem of needing to create incremental backups to a remote site for my server in the case of a failure. Since my VPS provider didn&#8217;t provide this as a service (paid or free), I had to come up with a different solution. This solution assumes that you are using Ubuntu [...]]]></description>
				<content:encoded><![CDATA[<p>I recently came across a problem of needing to create incremental backups to a remote site for <a href="http://www.afewguyscoding.com" target="_blank">my server</a> in the case of a failure.  Since my <a href="http://www.linode.com">VPS provider</a> didn&#8217;t provide this as a service (paid or free), I had to come up with a different solution.  This solution assumes that you are using Ubuntu (in my case, Karmic Koala), root access and an <a href="http://aws.amazon.com/s3">Amazon S3 account</a>.  Also, this assumes that you are willing to spend the money to back up to S3.  The pricing structure is <a href="http://aws.amazon.com/s3/#pricing" target="_blank">here</a>, but in my experience, my initial backup cost $3.78 and since then, my average monthly bill has been &lt; $0.25.  You can calculate your own bill with this handy Amazon S3 S3/EC2 <a href="http://calculator.s3.amazonaws.com/calc5.html" target="_blank">calculator</a>.</p>
<p>I know that using FUSE is not the fastest method of backing up, so you&#8217;re mileage may vary depending on your tolerance levels and needs.  The actual download site for FuseOverAmazon is <a href="http://code.google.com/p/s3fs/wiki/FuseOverAmazon">here</a>.  Also, I am using rsync because I believe that incremental (differential) backups are far more efficient and cost/time saving than full backups every week.</p>
<p>1.  The first step is to install all the dependencies we&#8217;ll need for FUSE:</p>
<pre><code>sudo apt-get install build-essential libcurl4-openssl-dev libxml2-dev libfuse-dev</code></pre>
<p>Next, install the most recent version of s3fs. As of now the most recent is r191, but here is a link to the <a href="http://code.google.com/p/s3fs/downloads/list">downloads</a> section so that you can check to see which version is the most up-to-date.  I chose to put my src download in <code>/usr/local/src</code>.</p>
<pre><code>wget http://s3fs.googlecode.com/files/s3fs-r191-source.tar.gz
tar -xzf s3fs*
cd s3fs
make
sudo make install
sudo mkdir /backup/s3
sudo chown yourusername:yourusername /backup/s3</code></pre>
<p>2. Scripting your backup plan:</p>
<p>You&#8217;ll need to create a bucket on the S3 cloud.  If you haven&#8217;t done this already, you can use an online tool like <a href="http://jets3t.s3.amazonaws.com/index.html" target="_blank">JetS3t</a> (my favorite).  I would recommend that you create a separate bucket for each logical site you are going to backup.  For example, I backup each one of my repositories in Unfuddle in a different bucket.  That makes restoring easier.  You might also want to consider replicating to multiple locations, if you don&#8217;t trust that Amazon can keep your data safe or even use a separate service provider like <a href="http://www.jungledisk.com" target="_blank">JungleDisk</a>, <a href="http://www.mozy.com" target="_blank">Mozy</a> or <a href="http://www.backblaze.com" target="_blank">Backblaze</a>.</p>
<p>Using gvim or TextMate (or some other text editor), we are going to automate mounting the volume, perform a sync and unmount the volume.  The reason I unmount is for safety.  If somehow the hard disk becomes corrupted, I have a bit of time to prevent the script from running and replicating the bad data.  If the volume is constantly mounted, that may not be the case.  It is also easy to wipe out the volume if you aren&#8217;t careful.</p>
<p>The following will be the script in your backup script, s3fs-backup.sh (or whatever you name yours):</p>
<pre><code>#!/bin/bash

/usr/bin/s3fs yourbucket -o accessKeyId=yourS3key -o secretAccessKey=yourS3secretkey /mnt/s3
/usr/bin/rsync -avz --delete /home/username/dir/you/want/to/backup /mnt/s3
/usr/bin/rsync -avz --log-file=log.file --delete --exclude /sys --exclude /mnt --exclude /proc --exclude /tmp / /mnt/s3 #exclude some directories
mail -s "backup complete with log" user@host.org &amp;lt; log.file #email yourself the log
mv log.file log.file.`date +"%Y%m%d%H%M%S"` # move the file to a log with a datetime stamp
/bin/umount /mnt/s3</code></pre>
<p>There some directories that I don&#8217;t want to backup, one being <code>proc</code>, because the that directory is manged by the OS while the system is running.  You don&#8217;t want to restore this directory.  Also, even though rsync is smart enough to recognize cycles, we don&#8217;t want to backup our <code>/mnt/s3</code> directory.  We exclude those here.  Note, the –delete option. This will delete any files that have been removed on the ’source’.  Lastly, note that we can increase/decrease the verbosity of the script and email ourselves a transcript of the backup session so we know that it actually took place &#8211; not a bad way to keep tabs.  After we are finished emailing ourselves (the potentially massive log file), we rename it to keep track of our backups on the server as well.  There are many more options with rsync, so check out the <em>man</em> pages for the command to customize your script.</p>
<pre><code>chmod 755 s3fs-backup.sh</code></pre>
<p>Before you run the entire script, you might want to use the line above to change the permissions on the script you just saved.  You can verify the integrity of the script by running each command individually, which isn&#8217;t a bad idea after editing it for your own situation because mistakes do happen.  A quick check after the S3 volume (<code>df -h</code>) is mounted will show 256T available for your own personal use.</p>
<p>The most important part is automating the backup process.  If you forget and you lose your most recent data, then what was the point!?  We are going to use good ol&#8217; fashioned *nix cron daemon to handle this process for us. There are two options for creating your crontab.  You can either put this script (or a softlink) to it in your cron.hourly, cron.daily, cron.weekly, cron.monthly folder or you can directly edit the crontab file to have more control over when the script runs.  I personally run mine every hour and every week on Sunday.  Here is a nice <a href="http://en.wikipedia.org/wiki/Cron" target="_blank">cron reference</a> to customize your schedule.</p>
<p><code><span style="font-family: monospace, Monaco, 'Courier New', Courier, monospace; line-height: 18px; font-size: 12px; white-space: pre;">crontab -e<br />
* * * * * /path/to/s3fs.sh # this runs it hourly<br />
0 0 0 0 0 /path/to/s3fs.sh # this runs it every week on sunday </span></code></p>
<p>A note about speed: The initial backup could take a <strong>long</strong> time.  The server up-stream speed is the limiting factor on how long this takes.  While rsync is a great program, using FUSE is not the speediest option in the world. There is another solution out there called &#8216;<a href="http://s3sync.net/wiki" target="_blank">s3sync</a>.&#8217;</p>
<p>To run the script initially and create your first back-up (if you can&#8217;t wait), simply run this command: <code>sudo ./s3fs.sh</code>.</p>
<p>One last nice thing is that this can be adapted to run anywhere, other servers, your home computers, etc.  If you can install Ruby and the dependencies above, you can have ultra cheap backups without a lot of hassle.</p>
<p>That&#8217;s it!</p>
]]></content:encoded>
			<wfw:commentRss>http://afewguyscoding.com/2010/04/backing-amazon-s3-ruby-s3fs/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Content Recommendation Systems for a Mobile Platform</title>
		<link>http://afewguyscoding.com/2010/04/content-recommenation-systems-for-a-mobile-platform/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=content-recommenation-systems-for-a-mobile-platform</link>
		<comments>http://afewguyscoding.com/2010/04/content-recommenation-systems-for-a-mobile-platform/#comments</comments>
		<pubDate>Sat, 10 Apr 2010 07:36:36 +0000</pubDate>
		<dc:creator>David Stites</dc:creator>
				<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Mobile]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[android]]></category>
		<category><![CDATA[blackberry]]></category>
		<category><![CDATA[classification]]></category>
		<category><![CDATA[coupious]]></category>
		<category><![CDATA[iPhone]]></category>
		<category><![CDATA[knn]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[nearest neighbor]]></category>
		<category><![CDATA[supervised learning]]></category>

		<guid isPermaLink="false">http://blog.afewguyscoding.com/?p=75</guid>
		<description><![CDATA[Abstract—Recommendation systems take artifacts about items and provide suggestions to the user on what other products the might like. There are many different types of recommender algorithms, including nearest-neighbor, linear classifiers and SVMs. However, most recommender systems are collaborative systems that rely on users to rate the products that they bought. This paper presents a [...]]]></description>
				<content:encoded><![CDATA[<p><strong>Abstract—Recommendation systems take artifacts about items and provide suggestions to the user on what other products the might like.  There are many different types of recommender algorithms, including nearest-neighbor, linear classifiers and SVMs.  However, most recommender systems are collaborative systems that rely on users to rate the products that they bought.  This paper presents a analysis of recommender systems using a mobile device and backend data points for a coupon delivery system. </strong></p>
<h6><em><span style="font-weight: normal;">Index Terms—machine learning, recommender systems, supervised learning, nearest neighbor, classification</span></em></h6>
<h3>1. INTRODUCTION</h3>
<p>Recommendations are a part of everyday life.  An individual constantly receives recommendations from friends, family, salespeople and Internet resources, such as online reviews.  We want to make the most informed choices possible about decisions in our daily life.  For example, when buying a flat screen TV, we want to have the best resolution, size and refresh rate for the money.  There are many factors that influence our decisions – budget, time, product features and most importantly, previous experience.  We can analyze all the factors that led up to the decision and then make a conclusion or decision based on those results. A recommender system uses historical data to recommend new items that may of be interest to a particular user.</p>
<p><a href="http://www.coupious.com">Coupious</a> is a mobile phone application that gives it’s user coupons and deals for businesses around their geographic location. The application runs on the user’s cell phone and automatically retrieves coupons based upon GPS coordinates.  Redemption of these coupons is as simple as tapping a “Use Now” button.  Coupious’ services is now available on the iPhone, iPod Touch, and Android platforms. <a href="http://www.coupious.com">Coupious</a> is currently available in Minneapolis, MN, West Lafayette, IN at Purdue University and Berkley, CA.  Clip Mobile is the Canada based version of Coupious that is currently available in Toronto.</p>
<p>Using push technology, it is possible to integrate a mobile recommendation system into Coupious.  The benefit of this would be threefold: 1) offer the customers the best possible coupons based on their personal spending habits &#8211; if a user feels they received a “good deal” with Coupious, they would be more likely to use it again and integrate it into their bargain shopping strategy, 2) offer businesses the ability to capitalize on their market demographics &#8211; the ability to reach individual customers to provide goods or services would drive home more revenue and add value to the product and 3) adding coupons to the service would immediately make the system more useful to a user as it would present, desirable, geographically proximate offers without extraneous offers.</p>
<h3>1.1 OUTLINE OF RESEARCH</h3>
<p>In this research project, we evaluated the batch k-nearest neighbors algorithm in Java.  We chose to write the implementation of the decision tree in Java because of the ease of use of the language.  Java also presented superior capabilities in working with and parsing data from files.  Using Java allowed the author to more efficiently model the problem through the use of OO concepts, such as polymorphism and inheritance.  The kNN algorithm was originally suggested by Donald Knuth in [9] where it was referenced as the post-office problem, where one would want to assign residences to the nearest post office.</p>
<p>The goal of this research was to find a solution that we felt would be successful in achieving the highest rate of coupon redemption.  Presumptively, in achieving the highest rate of redemption required learning what the user likes with the smallest error percentage.  Additionally, we wanted to know if increasing the attributes used for computation, would effect the quality of the result set.</p>
<h3>1.2 DATA</h3>
<p>The data that we used is from the Coupious production database.  Currently, there are approximately 70,000 rows of data (“nearby” queries, impressions and details impressions) and approximately 3,400 of those represent actual coupon redemptions. The data is an aggregate from March 25th, 2009 until February 11, 2010.  The results are from a mixture of different cities where Coupious is currently in production.</p>
<p>From a logical standpoint, Coupious is simply a conduit through which a user may earn his discount and has no vested interest in whether or not a user redeems a coupon in a particular session.  However, from a business standpoint, Coupious markets the product based on being able to entice sales through coupon redemption.  Therefore, for classification purposes, sessions that ended in one or more redemptions will be labeled +1 and sessions that ended without redemption will be labeled -1.</p>
<h3>1.3 PREVIOUS WORK</h3>
<p>While there hasn’t been any previous work in the space of mobile recommendation systems, there has been a large amount of work in the recommender systems space and classification.   In [1], [2] and [3], direct marketing is studied using collaborative filtering.  In [3], the authors use SVMs and latent class models to model predictions of whether or not a customer would be likely to buy a particular product.   The most direct comparison of work would be in [1] and [8], where SVMs and linear classifiers are used to cluster content driven recommender systems.</p>
<h3>2.  APPLICATION</h3>
<p>Broadly, recommender systems can be grouped into two categories, content-based and collaborative based.  In content-based systems, the recommendations are based solely on the attributes of the actual product.  For example, in Coupious, attributes of a particular coupon redemption includes the distance from the merchant when the coupon was used, the date and time of the redemption, the category of the coupon, the expiry and the offer text.  These attributes are physical characteristics of the coupon.  Recommendations can be made to users without relying on any experience-based information.</p>
<p>In collaboration systems, recommendations are provided based on not only product attributes but the overlap of preferences of “like-minded” people, first introduced by Goldberg et. al [5].  For example, a user is asked to rate how well they liked the product or give an opinion of a movie.  This provides the algorithm a base line of preference for a particular user, which allows the algorithm to associate product attributes with a positive or negative response. Since Coupious does not ask for user ratings, this paper will focus exclusively on content-based applications.</p>
<p>Many content-based systems have similar or common attributes.  As stated in [3], the  “central problem of content-based recommendation is identifying a sufficiently large set of key attributes.”  If the attribute set is too small, there may not be enough information for the program to build a profile of the user.  Conversely, if there are too many attributes, the program won’t be able to identify the truly influential attributes, which leads to poor performance [6].  Also, while the label for a particular feature vector with Coupious data is +1 or -1, many of the features in the data are multi-valued attributes (such as distance, date-time stamps, etc.), which maybe hard to represent in a binary manner, if the algorithm requires it.</p>
<p>In feature selection, we are “interested in finding k of the d dimensions that give us the most information and accuracy and we discard the other (d – k) dimensions [4].”  How can we find the attributes that will give us the most information and accuracy?  For Coupious, the attribute set is quite limited.  For this research, we explicitly decided the features that will contribute to our recommendations.  In all cases, all attributes were under consideration while the algorithm was running and we never partitioned the attributes for different runs to create different recommendations.</p>
<h3>2.1  THE kNN ALGORITHM</h3>
<p>As shown in [7], “Once the clustering is complete,  performance can be very good, since the size of the group that must be analyzed is much smaller.”  The nearest neighbor algorithm is such a classification algorithm.  K-nearest neighbors is one of the simplest machine learning algorithms, where an instance is classified by a majority vote by the closest training examples in the hypothesis space. This algorithm is an example of instance-based learning, where previous data points are stored and interpolations is performed to find a recommendation.  An important distinction to make is that kNN does not try to create a model when it are given test data, but rather performs the computation when tested.  The algorithm works by attempting to find a previously seen data point that is &#8220;closest&#8221; to the query data point and then uses its previous output for prediction.  The algorithm calculates its approximation using this definition:</p>
<p><a href="http://blog.afewguyscoding.com/wp-content/uploads/2010/04/knn.png" rel="lightbox[75]"><img src="http://blog.afewguyscoding.com/wp-content/uploads/2010/04/knn.png" alt="" title="knn equation" width="203" height="46" class="aligncenter size-full wp-image-120" /></a></p>
<p>The algorithm is shown below</p>
<p><em>KNN (Examples, Target_Instance)<br />
</em></p>
<ul>
<li>Determine the parameter <em>K</em> = number of nearest neighbors.</li>
<li>Calculate the distance between <em>Target_Instance</em> and all the <em>Examples</em> based on the Euclidian distance.</li>
<li>Sort the distance and determine nearest neighbor based on the <em>K</em>th minimum distance.</li>
<li>Gather the Category <em>Y</em> of the nearest neighbors.</li>
<li>Use simple majority of the category of nearest neighbors as the prediction value of the <em>Target_Instance</em>.</li>
<li>Return classification of <em>Target_Instance</em>.</li>
</ul>
<p>This algorithm lends itself well to the Coupious application.  When a user uses the Coupious application, they want it to launch quickly and present a list of coupons within 10-15 seconds.  Since this algorithm is so simple, we are able to calculate coupons that the user might enjoy fairly quickly and deliver them to the handset.  Also, because Coupious doesn’t know any personal details about the user, the ability to cluster users into groups without the need for any heavy additional implementation on the front or back ends of the application is advantageous.</p>
<p>There are several possible problems with this approach.  While this is a simple algorithm, it doesn’t take high feature dimensionality into account.  If there are many features, the algorithm will have to perform a lot of computations to create the clusters.  Additionally, each attribute is given the same degree of influence on the recommendation.  In Coupious, the date and time the coupon was redeemed has the same amount of bearing on the recommendation as does the current distance from the offer and previous redemption history.  The features may not scale compared to their importance and the performance of the recommendation may be degraded by irrelevant features. Lastly, [6] describes the curse of dimensionality, which describes how, if a product has 20 attributes, 2 of which are actually useful, how classification may be completely different when considering each of the 20 attributes but identical when only considering the 2 relevant attributes.</p>
<h3>2.2  ATTRIBUTE SELECTION</h3>
<p>While implementing the <em>k</em>-nearest neighbors algorithm, we decided to evaluate instances based upon seven different attributes, including coupon category, distance, expiration, offer, redemption date, handset and handset operating system, and upon the session result, a “hit” or “redemption.”  The offer and expiration attributes required extra processing before they were able to be used for clustering.  In both of these attributes, a “bag of words” (applying a Naive Bayes classifier to text to determine the classification) implementation was used to determine what type of offer the coupon had made.</p>
<p>For the OFFER attribute, we split the value space into 4 discrete values, <em>PAYFORFREE</em> (an instance where the customer had to pay some initial amount to receive a free item), <em>PERCENTAGE</em> (an instance where the customer received some percentage discount), <em>DOLLAR</em> (an instance where a customer received a dollar amount discount) and <em>UNKNOWN</em> (an instance where the classification was unknown).  While the <em>PERCENTAGE</em> and <em>DOLLAR</em> values essentially compute to the exact same values when calculated, consumers tend to react differently when seeing higher percentage discounts versus dollar discounts (i.e. 50% off instead of $5 discount, even though they may equate to an identical discount, if the price were $10).</p>
<p>For the expiration attribute, some text parsing was done to determine what type of expiration the coupon had (<em>DATE</em>, <em>USES</em>, <em>NONE</em> or <em>UNKNOWN</em>).  If the parsing detected a date format, we used <em>DATE</em> and if the parsing detected that it was a limited usage coupon (either limited by total uses across a population, herein known as “global” or limited by uses per customer, herein known as “local”), we designated that the coupon was <em>USES</em>.  If the coupon was valid indefinitely, we designated that the coupon expiration was <em>NONE</em>.  We did not record the distance of dates in the future, by the type of limited usage (global or local) or by the number of uses left.</p>
<p>An important detail to note is that in our implementation of kNN, if an attribute was unknown, we declined using it for computation of the nearest neighbor because the <em>UNKNOWN</em> attribute was typically the last value in a Java enumeration and therefore would be assigned a high integer value which would skew the results unnecessarily if used for computation.<br />
￼<br />
For the remaining attributes (<em>OFFERTIME</em>, <em>CATEGORY</em>, <em>DISTANCE</em>, <em>HANDSET</em>, which considers handset model and OS and <em>ACTION</em>), the value space of each attribute was divided over the possible discrete values for that attribute according to Table 2.2.</p>
<table border="1">
<tbody>
<tr bgcolor="ccc">
<th><strong>Attribute</strong></th>
<th><strong>Possible Values</strong></th>
</tr>
<tr>
<td><em>OFFERTIME</em></td>
<td><em>Morning, Afternoon, Evening, Night, Unknown</em></td>
</tr>
<tr bgcolor="eee">
<td><em>CATEGORY</em></td>
<td><em>Entertainment, Automotive, Food &amp; Dining, Health &amp; Beauty, Retail, Sports &amp; Recreation, Travel, Clothing &amp; Apparel, Electronics &amp; Appliances, Furniture &amp; Decor, Grocery, Hobbies &amp; Crafts, Home Services, Hotels &amp; Lodging, Nightlife &amp; Bars, Nonprofits &amp; Youth, Office Supplies, Other, Pet Services, Professional Services, Real Estate, Unknown</em></td>
</tr>
<tr>
<td><em>DISTANCE</em></td>
<td><em>Less than 2 miles, 2 to 5 miles, 5 to 10 miles, 10 to 20 miles, 20 to 50 miles, 50 to 100 miles, Unknown</em></td>
</tr>
<tr bgcolor="eee">
<td><em>HANDSET</em></td>
<td><Strong>Device</strong>:<em> iPhone, iPod, G1, Hero, myTouch, Droid, Unknown</em><br />
<strong>OS</strong>: <em>iPhone, Android, Unknown</em></td>
</tr>
<tr>
<td><em>ACTION</em></td>
<td><em>Redeem, Hit, Unknown</em></td>
</tr>
</tbody>
</table>
<p><em>Table 2.2</em></p>
<p>In the case of the handset OS, iPhone and iPod were classified together as the iPhone OS as they are the same OS with different build targets.  It is important to note that kNN works equally well with continuous valued attributes as well as discrete-valued attributes.  For further discussion on using kNN with real-valued attributes, see [10].</p>
<h3>3. PROBLEMS</h3>
<p>There were several problems that were encountered while implementing kNN.  The first problem was “Majority Voting.”  Majority voting is the last step in the algorithm to classify an instance and is an inherent problem with the way the kNN algorithm works.  If a particular class dominates the training data, it will skew the votes towards that class since we are only considering data at a local level, that is the distance from our classification instance to the nearest data points.  There are two ways to solve this problem: 1) balancing the dataset, or, 2) weighting the neighboring instances.  Balancing is the simplest technique where an equal proportion of either class are present in the training data.  A more complicated, but more effective, method is to weight the neighboring instances such that neighbors that are further away have a smaller weight value than closer neighbors.</p>
<p>Determining “<em>k</em>” beforehand is a troublesome problem.  This is due to the fact that, if “<em>k</em>” is too large, the separation between classifications becomes blurred.  This could cause the program to group two clusters together that are, in fact, distinct clusters themselves.  However, a large “k” value does reduce the effect of noise by including more samples in the hypothesis.  If “<em>k</em>” is too small, we might not calculate a good representation of the sample space because our results were too local.  In either case, if the target attribute is binary, “<em>k</em>” should be an odd number to avoid the possibilities of ties with majority voting.</p>
<p>Also, kNN has a high computation cost because it has to consider the distance of every point from itself to another point.  The time-complexity of kNN is <em>O(n)</em>, which wouldn’t scale well to hypothesis spaces with millions of instances.</p>
<p>Discretizing non-related attributes, like category, presented a unique challenge for the author.  When considering continuous attributes, such as distance, it was easy to discretize this data.  In the case of Coupious, the distance ranges were already defined in the application so we just had to translate those over to our classification algorithm.  However, some attributes, such as category, while related at an attribute level (in that they were all categories of coupon), had no real values.  In this case, we simply assigned them increasing integer values</p>
<p>Another problem that we encountered is the sparsity problem.  Since we implemented a content-driven model, the degree of accuracy relied upon how much data we had about a particular end user to build their profile.  If the customer only had one or two sessions ending in no redemptions, it might not be possible to achieve any accuracy about this person.  We dealt with this problem by artificially creating new records to supplement previous real records.</p>
<p>Coupious relies on the GPS module inside the smart phone to tell us where a user is currently located.  From that position, the user gets a list of coupons that are close to the user’s location.  However, there are no safeguards in place to guarantee that a redemption is real.  A curious user may attempt to redeem a coupon when he is not at the actual merchant location.  In the data, we can account for this by calculating the distance from the merchant at the time of the redemption request.  However, this GPS reading may be inaccurate as the GPS model can adjust it’s accuracy to save power and battery life.</p>
<p>Lastly, accuracy is somewhat limited because of the fact that we are using a content-only model.  There is no way to interact with the user to ask if the recommendations are truly useful.  To achieve this additional metric would require major changes to the application and the backend systems that are outside the scope of this research paper.</p>
<p>Despite the multiple problems with kNN, it is quite robust to noisy data as indicated in section 4, which makes it well-suited for this classification task, as the author can only verify the reasonableness of the data, not the integrity.</p>
<h3>4. TESTING</h3>
<p>Testing of the kNN logic was carried out using a 3-fold cross validation method, were the data was divided into training and test sets such that | <em>R</em> | = <em>k</em> X | <em>S</em> |, where <em>R</em> is the size of the training set, k is the relative size and <em>S</em> is the size of the test set.</p>
<p>For each test run, we chose 4,000 training examples and 1 test example at random. We attempted to keep labeled classes in the training set as balanced as possible by setting a threshold <em>n</em>. This threshold prevented the classes from becoming unbalanced by more than a difference of <em>n</em>.  If the threshold <em>n</em> was ever reached, we discarded random selections until we were under the threshold again and thereby balance the classes.  Discussion of results are in section 5.</p>
<h3>5.  EVALUATION</h3>
<p>Even though kNN is a simplistic algorithm, the classification results were quite accurate.  To test the algorithm, we performed 10,000 independent runs where an equal number of “hit” and “redemption” rows were selected at random (2,000 of each so as to keep the inductive bias as fair as possible).  An individual classification instance was chosen at random from that set of 4,000 instances and was then classified according to its nearest neighbors.</p>
<h3>5.1 MACRO EVALUATION</h3>
<p>When evaluating the results, there was one main factor that affected the recall and it was the size of the k neighbors considered in the calculation (see <em>Table 5.1</em>).  In the 10,000 independent runs using random subsets of data for each test, the overall recall for the kNN algorithm was 85.32%.  The average F-measure for these runs was 0.45.</p>
<p>However, if one considers a subset of results with k-size less than or equal to 15, the average recall was much higher &#8211; 94%.    After a k-size greater than 30, we see a significant drop-off of recall.  This can be attributed to the fact that the groups are becoming less defined because, as k-size grows, the nodes that are being used are “further” away from the classification instance, and therefore the results are not as “good.”</p>
<h3>5.2 MICRO EVALUATION</h3>
<p>When broken down into smaller sets of 1,000 runs, the recall and F-measure vary greatly.  In 16 runs, we had a range from 29.40% recall to 98.20% recall.  The median recall was 92.70%.</p>
<table border="1">
<tr bgcolor="ccc">
<th><strong>Run</strong></th>
<th><strong>K Size</strong></th>
<th><strong>Recall</strong></th>
<th><strong>F-measure</strong></th>
<th><strong>Avg. Distance</strong></th>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>98.20%</td>
<td>0.49	</td>
<td>0.18</td>
</tr>
<tr bgcolor="eee">
<td>2</td>
<td>2</td>
<td>97.80%</td>
<td>0.49	</td>
<td>0.23</td>
</tr>
<tr>
<td>3</td>
<td>3</td>
<td>95.80%</td>
<td>0.48</td>
<td>	0.49</td>
</tr>
<tr bgcolor="eee">
<td>4</td>
<td>4</td>
<td>95.70%</td>
<td>0.48</td>
<td>0.44</td>
</tr>
<tr>
<td>5</td>
<td>5</td>
<td>95.60%</td>
<td>0.48</td>
<td>	0.46</td>
</tr>
<tr bgcolor="eee">
<td>6</td>
<td>6</td>
<td>95.20%</td>
<td>0.48</td>
<td>	0.59</td>
</tr>
<tr>
<td>7</td>
<td>7</td>
<td>93.10%</td>
<td>0.48</td>
<td>	0.48</td>
</tr>
<tr bgcolor="eee">
<td>8</td>
<td>8</td>
<td>92.70%</td>
<td>0.47</td>
<td>0.69</td>
</tr>
<tr>
<td>9</td>
<td>9</td>
<td>94.20%</td>
<td>	0.48	</td>
<td>0.82</td>
</tr>
<tr bgcolor="eee">
<td>10</td>
<td>10</td>
<td>93.40%</td>
<td>	0.48	</td>
<td>0.88</td>
</tr>
<tr>
<td>11</td>
<td>15</td>
<td>91.70%</td>
<td>0.48</td>
<td>	1.21</td>
</tr>
<tr bgcolor="eee">
<td>12</td>
<td>30</td>
<td>87.50%</td>
<td>0.47</td>
<td>2.16</td>
</tr>
<tr>
<td>13</td>
<td>50</td>
<td>85.10%</td>
<td>0.46</td>
<td>	3.63</td>
</tr>
<tr bgcolor="eee">
<td>14</td>
<td>100</td>
<td>70.80%</td>
<td>0.42</td>
<td>	7.57</td>
</tr>
<tr>
<td>15</td>
<td>200</td>
<td>48.90%</td>
<td>	0.32</td>
<td>	16.57</td>
</tr>
<tr bgcolor="eee">
<td>16</td>
<td>500</td>
<td>29.40%</td>
<td>	0.22</td>
<td>	31.64</td>
</tr>
<tr bgcolor="ccc">
<td>Overall</td>
<td></td>
<td>85.32%</td>
<td>0.45
<td>	4.25</td>
</tr>
</table>
<p><em>Table 5.1</em></p>
<p><a href="http://blog.afewguyscoding.com/wp-content/uploads/2010/04/graph.png" rel="lightbox[75]"><img src="http://blog.afewguyscoding.com/wp-content/uploads/2010/04/graph.png" alt="" title="knn-graph" width="308" height="243" class="aligncenter size-full wp-image-121" /></a></p>
<p>￼￼There is a large gap between our maximum and our minimum recall.  This can be attributed mainly to our <em>k</em>-size.   Based on these results, we would feel fairly confident that we could make a useful prediction, but only if we had a confidence rating on the results (due to the high variability of the results).</p>
<p>These results could provide good insight into what <em>k</em>-size or neighbors are the most influential in suggesting a coupon.  This would allow us to more carefully target Coupious advertising based on the user.</p>
<p>When could one consider these results valid?  If the average distance of a classification instance to the examples is below a threshold such that the classification truly reflects its neighbors, we could consider the results valid.  Another form of validating the results could be pruning the attribute set.  If we were able to prune away attributes that didn’t affect the recall in a negative manner, we would be left with a set of attributes that truly influence the customers purchasing behavior, although this task is better suited for a decision or regression tree.  An improvement to this kNN algorithm might come in the form of altering the k-size based upon the population or attributes that we are considering.</p>
<h3>6.  CONCLUSION</h3>
<p>In this research project, we have implemented the kNN algorithm for recommender systems.  The algorithm for the nearest neighbor was explored and several problems were identified and overcome.  Different techniques were investigated to improve the accuracy of the system.    The results of this project show an overall accuracy of 85.32%, which makes kNN an excellent, simple technique for implementing recommender systems.</p>
<h4>REFERENCES</h4>
<p>[1] Zhang, T., Iyengar, V. S.  Recommender Systems Using Linear Classifiers.  Journal of Machine Learning Research 2. (2002). 313-334.<br />
[2] Basu, C., Hirsh, H., Cohen, W.  Recommendation as Classification: Using Social and Content-Based Information  in Recommendation.<br />
[3] Cheung, K. W., Kowk, J. T., Law, M. H., Tsui, K. C.  Mining Customer Product Ratings for Personalized Marketing<br />
[4] E. Alpayden., Introduction to Machine Learning, 2nd ed.  MIT Press.  Cambridge, Mass, 2010.<br />
[5] D. Goldberg, D. Nichols, B.M. Oki, D. Terry, Collaborative filtering to weave an information tapestry, Communications of the	ACM	35.  (12)	(December	1992)	61 – 70.<br />
[6] T.M. Mitchell, Machine Learning, New York.  McGraw-Hill, 1997.<br />
[7] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl, “Recommender Systems for Large-Scale E-Commerce: Scalable Neighborhood Formation Using Clustering,” Proc. Fifth Int’l Conf. Computer and Information Technology, 2002.<br />
[8] D. Barbella, S. Benzaid, J. Christensen, B. Jackson, X. V. Qin, D. Musicant.  Understanding Support Vector Machine Classifications via a Recommender System-Like Approach.  [Online]. <a href="http://www.cs.carleton.edu/faculty/dmusican/svmzen.pdf">http://www.cs.carleton.edu/faculty/dmusican/svmzen.pdf</a><br />
[9] D.E. Knuth. “The Art of Computer Programming.”  Addison-Wesley.  1973.<br />
[10] C. Elkan. “Nearest Neighbor Classification.”  University of California &#8211; San Diego.  2007. [Online].  <a href="http://cseweb.ucsd.edu/~elkan/151/nearestn.pdf" target="_blank">http://cseweb.ucsd.edu/~elkan/151/nearestn.pdf</a></p>
]]></content:encoded>
			<wfw:commentRss>http://afewguyscoding.com/2010/04/content-recommenation-systems-for-a-mobile-platform/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Solving the Multitasking Problem in iPhone OS</title>
		<link>http://afewguyscoding.com/2010/04/solving-multitasking-problem-iphone-os/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=solving-multitasking-problem-iphone-os</link>
		<comments>http://afewguyscoding.com/2010/04/solving-multitasking-problem-iphone-os/#comments</comments>
		<pubDate>Thu, 01 Apr 2010 14:41:25 +0000</pubDate>
		<dc:creator>David Stites</dc:creator>
				<category><![CDATA[iPhone]]></category>
		<category><![CDATA[Mobile]]></category>
		<category><![CDATA[Operating Systems]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[android]]></category>
		<category><![CDATA[Apple]]></category>
		<category><![CDATA[iPad]]></category>
		<category><![CDATA[ipod]]></category>
		<category><![CDATA[multitasking]]></category>
		<category><![CDATA[tablet]]></category>

		<guid isPermaLink="false">http://blog.afewguyscoding.com/?p=95</guid>
		<description><![CDATA[iPhone OS is one of the most advanced OS available today &#8211; and it s a mobile OS. Using the OS, it is clear to see why mobile computing is the future. When compared to other OSs, the iPhone OS holds it&#8217;s own. Broadly, looking at most modern desktop OSs, it has support for nearly [...]]]></description>
				<content:encoded><![CDATA[<p>iPhone OS is one of the most advanced OS available today &#8211; and it s a mobile OS.  Using the OS, it is clear to see why mobile computing is the future.  When compared to other OSs, the iPhone OS holds it&#8217;s own.  Broadly, looking at most modern desktop OSs, it has support for nearly everything that one would have available in a &#8220;full&#8221;, modern OS.  It has support for network, full graphics support (video, animation, photos), security, etc.  The one thing that it lacks is true multitasking.  Let&#8217;s be clear though; Today&#8217;s iPhone 3.x firmware is a fully preemptive multitasking operating system, but it artificially restricts apps (other than specific ones bundled with the system by Apple) from running in the background.</p>
<p>But <em>why</em>?  Apple initially avoided an app model supporting multiple apps running at once to help preserve battery life and simplify the user experience.</p>
<p>So, Apple decided instead to opt for a &#8220;compromise&#8221;, push notifications to enable third party apps to appear to respond to outside updates (such as incoming messages or news alerts) even when they were not actually running.  It would give the &#8220;illusion&#8221; of allowing 3rd party multitasking by displaying messages to the user when the app needed their attention.  Other OSs, such as Android and Blackberry have full support for 3rd party multitasking, at the expense of battery life and perhaps a more complicated user experience.</p>
<p>So what is one to do?  After writing the initial draft of this post, the <a href="http://www.tuaw.com/2010/04/01/rumor-apples-iphone-4-0-to-support-multitasking-expose-like-i/">news</a> of iPhone OS 4.0 supporting multitasking came out a few days later.  But, I don&#8217;t care about rumors &#8211; I want results.  Just as in &#8220;Jerry Maguire&#8221;, <strong>show me the money.</strong> Oh, and for the sake of argument, let&#8217;s assume that our hardware is a single processor and our OS is using kernal threads and round-robin scheduling.</p>
<p>First, let&#8217;s consider the model for most modern OS and how they implement multitasking.  Multitasking is actually a simple slight of hand.  A process is simply an instance of a program or executable that is mapped into memory.  We all know that we can have multiple programs or processes running at a time.  But does the computer work with all the processes all at the same time?  No.  Similar to how a human works and processes information, a processor works with whatever data it is handed, but is only capable of working with particular process at a time.  The processes are scheduled in a round-robin queue and when their turn comes up, they get a small piece of time with the processor.  After their time, or quantum, expires, or the process blocks, the process is removed from the running context and placed back in the queue to wait for their turn again.  This context switching happens so fast that a normal user cannot tell that this is the true model of what is happening and they believe that all programs are executing concurrently, as in the processor sharing model.  The model described above has a whole array of issues relating to memory management (including page faulting), security and protection.  The model I would like to describe below would have the same issues, but I will only talk about security and protection.</p>
<p>Armed with a <strong>very</strong> basic idea of how processes work, how can we adapt this to a more mobile OS?  What are the areas that we can adjust.</p>
<p><strong><span style="text-decoration: line-through;">Quantum</span> &#8211; </strong><strong><span style="color: #ff0000;">BAD</span></strong><br />
First, we can adjust the quantum, so there is less overhead associated with servicing each process and hence less power consumption.  But Is this a good solution?  Absolutely not.  Quanta that are too short cause unnecessary thrashing with all the context switching that is occurring and if in an extreme case, the time servicing a context switch, may be longer than the actual time the process receives from the processor.  What about variable quanta?  Not a good idea either.  The process would receive non-uniform amounts of time, making scheduling more difficult and perhaps a &#8220;jittery&#8221; response in the UI.</p>
<p><strong><span style="text-decoration: line-through;">Scheduling</span> &#8211; </strong><strong><span style="color: #ff0000;">BAD</span></strong><br />
What if we change the type of scheduling we do?  What if we try FCFS (first come, first served) or SJF (shortest job first)?  That is not a good solution either because with FCFS, the process will continue to run and hog the processor until it is finished and in the case of applications, they will never finish until the user terminates the app.  However, the solution proposed will draw inspiration from this category.</p>
<p><strong><span style="text-decoration: line-through;">Notification</span> &#8211; </strong><strong><span style="color: #ff0000;">BAD</span></strong><br />
What if we eliminate the context switching and interrupts?  We can set bits to indicate the app needs service.  This idea is the absolute worst of them so far.  What happens if two background applications request OS services in the same off round-robin cycle.  If you proposed using shared memory for notification, you just lost an &#8220;interrupt.&#8221;  What about not using shared memory?  What a pain!  Now you have to setup dynamic data structures for each app that runs and terminates.  Ok, enough with the freshmen ideas.  Are you ready to have a heap of knowledge dumped on you?</p>
<p>Let&#8217;s revisit the scheduling idea and draw inspiration from a multi-level feedback queue.  In this proposed solution, there would be the same blocked and ready queues that any scheduler would have.  However, there would be a foreground queue and a background queue.  The foreground queue would contain the currently &#8220;focused&#8221; app, i.e. the process that the user is actively using.  The background queue would contain all the remaining ready processes.  The foreground queue would receive processor cycles as normal and the background queue would receive a fraction of the cycles to the tune of 1/<em>n</em> processes, with <em>n</em> capped at some upper bound to ensure that all processes still receive decent service.  Giving the background processes some processor time ensure they can still continue to work.  When the user wants to switch to a different process, the foreground process is removed and placed in the background queue.  The rest of the process paradigm would stay the same, where a process could interrupt and request services from the OS</p>
<p>There are security implications of allowing multiprocessing on the iPhone, and the major one is security and protection.  When the user presses the home button on the iPhone, the application receives a delegate message from the OS, <em>applicationWillTerminate</em>.  From that point, the application has 5 whole seconds to exit, before the OS steps in and kills the process.  With multiple processes, we now have to consider how multiple apps will behave together.  Obviously the iPhone OS implements a virtual address space, so that separate processes have separate chunks of transparent memory.  However, some additional security measures might need to be taken so that other processes are unable to access anything but their own VAs.  </p>
<p>As much as I wish I could test this hypothesis, I can&#8217;t because I have no access to the iPhone firmware.  So I will leave it up to the folks at 1 Infinite Loop, but this is just my .02.</p>
]]></content:encoded>
			<wfw:commentRss>http://afewguyscoding.com/2010/04/solving-multitasking-problem-iphone-os/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>ID3 Decision Trees in Java</title>
		<link>http://afewguyscoding.com/2010/03/id3-decision-trees-java/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=id3-decision-trees-java</link>
		<comments>http://afewguyscoding.com/2010/03/id3-decision-trees-java/#comments</comments>
		<pubDate>Wed, 31 Mar 2010 17:30:12 +0000</pubDate>
		<dc:creator>David Stites</dc:creator>
				<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[decision trees]]></category>
		<category><![CDATA[ID3]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[supervised learning]]></category>

		<guid isPermaLink="false">http://blog.afewguyscoding.com/?p=67</guid>
		<description><![CDATA[In a previous post, I explored how one might apply decision trees to solve a complex problem. This post will explore the code necessary to implement that decision tree. If you would like a full copy of the source code, it is available here in zip format. Entropy.java &#8211; In Entropy.java, we are concerned with [...]]]></description>
				<content:encoded><![CDATA[<p>In a previous post, I explored how one might apply decision trees to solve a complex problem.  This post will explore the code necessary to implement that decision tree.  If you would like a full copy of the source code, it is available <a href='http://blog.afewguyscoding.com/wp-content/uploads/2010/03/src.zip'>here</a> in zip format.</p>
<p><strong>Entropy.java</strong> &#8211; In Entropy.java, we are concerned with calculating the amount of entropy, or the amount of uncertainty or randomness with a particular variable.  For example, consider a classifier with two classes, <em>YES</em> and <em>NO</em>.  If a particular variable or attribute, say <em>x</em> has three training examples of class YES and three training examples of class NO (for a total of six), the entropy would be 1.  This is because there is an equal number of both classes for this variable and is the most mixed up you can get.  Likewise, if <em>x</em> had all six training examples of a particular class, say <em>YES</em>, then entropy would be 0 because this particular variable would be <em>pure</em>, thus making it a leaf node in our decision tree.</p>
<p>Entropy may be calculated in the following way:</p>
<p><a href="http://blog.afewguyscoding.com/wp-content/uploads/2010/03/entropy-equation.png" rel="lightbox[67]"><img src="http://blog.afewguyscoding.com/wp-content/uploads/2010/03/entropy-equation.png" alt="" title="entropy-equation" width="356" height="40" class="aligncenter size-full wp-image-46" /></a><br />
</p>
<pre>
<code >import java.util.ArrayList;

public class Entropy {	
	public static double calculateEntropy(ArrayList&lt;Record&gt; data) {
		double entropy = 0;
		
		if(data.size() == 0) {
			// nothing to do
			return 0;
		}
		
		for(int i = 0; i &lt; Hw1.setSize("PlayTennis"); i++) {
			int count = 0;
			for(int j = 0; j &lt; data.size(); j++) {
				Record record = data.get(j);
				
				if(record.getAttributes().get(4).getValue() == i) {
					count++;
				}
			}
				
			double probability = count / (double)data.size();
			if(count &gt; 0) {
				entropy += -probability * (Math.log(probability) / Math.log(2));
			}
		}
		
		return entropy;
	}
	
	public static double calculateGain(double rootEntropy, ArrayList&lt;Double&gt; subEntropies, ArrayList&lt;Integer&gt; setSizes, int data) {
		double gain = rootEntropy; 
		
		for(int i = 0; i &lt; subEntropies.size(); i++) {
			gain += -((setSizes.get(i) / (double)data) * subEntropies.get(i));
		}
		
		return gain;
	}
}</code></pre>
<p><strong>Tree.java</strong> &#8211; This tree class contains all our code for building our decision tree.  Note that each level, we choose the attribute that presents the best <em>gain</em> for that node.  The gain is simply the expected reduction in the entropy of <em>X</em>achieved by learning the state of the random variable <em>A</em>.  Gain is also known as Kullback-Leibler divergence.  Gain can be calculated in the following way:</p>
<p><a href="http://blog.afewguyscoding.com/wp-content/uploads/2010/03/gain-equation.png" rel="lightbox[67]"><img src="http://blog.afewguyscoding.com/wp-content/uploads/2010/03/gain-equation.png" alt="" title="gain-equation" width="356" height="69" class="aligncenter size-full wp-image-47" /></a></p>
<p>Notice that gain is calculated as a function of all the values of the attribute.</p>
<pre><code>import java.io.*;
import java.util.*;

public class Tree {
	public Node buildTree(ArrayList&lt;Record&gt; records, Node root, LearningSet learningSet) {
		int bestAttribute = -1;
		double bestGain = 0;
		root.setEntropy(Entropy.calculateEntropy(root.getData()));
		
		if(root.getEntropy() == 0) {
			return root;
		}
		
		for(int i = 0; i &lt; Hw1.NUM_ATTRS - 2; i++) {
			if(!Hw1.isAttributeUsed(i)) {
				double entropy = 0;
				ArrayList&lt;Double&gt; entropies = new ArrayList&lt;Double&gt;();
				ArrayList&lt;Integer&gt; setSizes = new ArrayList&lt;Integer&gt;();
				
				for(int j = 0; j &lt; Hw1.NUM_ATTRS - 2; j++) {
					ArrayList&lt;Record&gt; subset = subset(root, i, j);
					setSizes.add(subset.size());
					
					if(subset.size() != 0) {
						entropy = Entropy.calculateEntropy(subset);
						entropies.add(entropy);
					}
				}
				
				double gain = Entropy.calculateGain(root.getEntropy(), entropies, setSizes, root.getData().size());
				
				if(gain &gt; bestGain) {
					bestAttribute = i;
					bestGain = gain;
				}
			}
		}
		if(bestAttribute != -1) {
			int setSize = Hw1.setSize(Hw1.attrMap.get(bestAttribute));
			root.setTestAttribute(new DiscreteAttribute(Hw1.attrMap.get(bestAttribute), 0));
			root.children = new Node[setSize];
			root.setUsed(true);
			Hw1.usedAttributes.add(bestAttribute);
			
			for (int j = 0; j&lt; setSize; j++) {
				root.children[j] = new Node();
				root.children[j].setParent(root);
				root.children[j].setData(subset(root, bestAttribute, j));
				root.children[j].getTestAttribute().setName(Hw1.getLeafNames(bestAttribute, j));
				root.children[j].getTestAttribute().setValue(j);
			}

			for (int j = 0; j &lt; setSize; j++) {
				buildTree(root.children[j].getData(), root.children[j], learningSet);
			}

			root.setData(null);
		}
		else {
			return root;
		}
		
		return root;
	}
	
	public ArrayList&lt;Record&gt; subset(Node root, int attr, int value) {
		ArrayList&lt;Record&gt; subset = new ArrayList&lt;Record&gt;();
		
		for(int i = 0; i &lt; root.getData().size(); i++) {
			Record record = root.getData().get(i);
			
			if(record.getAttributes().get(attr).getValue() == value) {
				subset.add(record);
			}
		}
		return subset;
	}
	
	public double calculateSurrogates(ArrayList&lt;Record&gt; records) {
		return 0;
	}
}</code></pre>
<p><strong>DiscreteAttribute.java</strong - A subclass of the parent, Attribute.java</p>
<pre><code>
public class DiscreteAttribute extends Attribute {
	public static final int Sunny = 0;
	public static final int Overcast = 1;
	public static final int Rain = 2;

	public static final int Hot = 0;
	public static final int Mild = 1;
	public static final int Cool = 2;
	
	public static final int High = 0;
	public static final int Normal = 1;
	
	public static final int Weak = 0;
	public static final int Strong = 1;
	
	public static final int PlayNo = 0;
	public static final int PlayYes = 1;
	
	enum PlayTennis {
		No,
		Yes
	}
	
	enum Wind {
		Weak,
		Strong
	}
	
	enum Humidity {
		High,
		Normal
	}
	
	enum Temp {
		Hot,
		Mild,
		Cool
	}
	
	enum Outlook {
		Sunny,
		Overcast,
		Rain
	}

	public DiscreteAttribute(String name, double value) {
		super(name, value);
	}

	public DiscreteAttribute(String name, String value) {
		super(name, value);
	}
}</code></pre>
<p><br<strong>Hw1.java</strong> &#8211; This class is our main driver class</p>
<pre><code>import java.util.*;

public class Hw1 {
	public static int NUM_ATTRS = 6;
	public static ArrayList&lt;String&gt; attrMap;
	public static ArrayList&lt;Integer&gt; usedAttributes = new ArrayList&lt;Integer&gt;();

	public static void main(String[] args) {
		populateAttrMap();

		Tree t = new Tree();
		ArrayList&lt;Record&gt; records;
		LearningSet learningSet = new LearningSet();
		
		// read in all our data
		records = FileReader.buildRecords();
		
		Node root = new Node();
		
		for(Record record : records) {
			root.getData().add(record);
		}
		
		t.buildTree(records, root, learningSet);
		traverseTree(records.get(12), root);
		return;
	}
	
	public static void traverseTree(Record r, Node root) {
		while(root.children != null) {
			double nodeValue = 0;
			for(int i = 0; i &lt; r.getAttributes().size(); i++) {
				if(r.getAttributes().get(i).getName().equalsIgnoreCase(root.getTestAttribute().getName())) {
					nodeValue = r.getAttributes().get(i).getValue();
					break;
				}
			}
			for(int i = 0; i &lt; root.getChildren().length; i++) {
				if(nodeValue == root.children[i].getTestAttribute().getValue()) {
					traverseTree(r, root.children[i]);
				}
			}
		}
		
		System.out.print("Prediction for Play Tennis: ");
		if(root.getTestAttribute().getValue() == 0) {
			System.out.println("No");
		}
		else if(root.getTestAttribute().getValue() == 0) {
			System.out.println("Yes");
		}

		return;
	}
	
	public static boolean isAttributeUsed(int attribute) {
		if(usedAttributes.contains(attribute)) {
			return true;
		}
		else {
			return false;
		}
	}
	
	public static int setSize(String set) {
		if(set.equalsIgnoreCase("Outlook")) {
			return 3;
		}
		else if(set.equalsIgnoreCase("Wind")) {
			return 2;
		}
		else if(set.equalsIgnoreCase("Temperature")) {
			return 3;
		}
		else if(set.equalsIgnoreCase("Humidity")) {
			return 2;
		}
		else if(set.equalsIgnoreCase("PlayTennis")) {
			return 2;
		}
		return 0;
	}
	
	public static String getLeafNames(int attributeNum, int valueNum) {
		if(attributeNum == 0) {
			if(valueNum == 0) {
				return "Sunny";
			}
			else if(valueNum == 1) {
				return "Overcast";
			}
			else if(valueNum == 2) {
				return "Rain";
			}
		}
		else if(attributeNum == 1) {
			if(valueNum == 0) {
				return "Hot";
			}
			else if(valueNum == 1) {
				return "Mild";
			}
			else if(valueNum == 2) {
				return "Cool";
			}
		}
		else if(attributeNum == 2) {
			if(valueNum == 0) {
				return "High";
			}
			else if(valueNum == 1) {
				return "Normal";
			}
		}
		else if(attributeNum == 3) {
			if(valueNum == 0) {
				return "Weak";
			}
			else if(valueNum == 1) {
				return "Strong";
			}
		}
		
		return null;
	}
	
	public static void populateAttrMap() {
		attrMap = new ArrayList&lt;String&gt;();
		attrMap.add("Outlook");
		attrMap.add("Temperature");
		attrMap.add("Humidity");
		attrMap.add("Wind");
		attrMap.add("PlayTennis");
	}
}</code></pre>
<p><strong>Node.java</strong> &#8211; Node.java holds our information in the tree.</p>
<pre><code>
import java.util.*;

public class Node {
	private Node parent;
	public Node[] children;
	private ArrayList&lt;Record&gt; data;
	private double entropy;
	private boolean isUsed;
	private DiscreteAttribute testAttribute;

	public Node() {
		this.data = new ArrayList&lt;Record&gt;();
		setEntropy(0.0);
		setParent(null);
		setChildren(null);
		setUsed(false);
		setTestAttribute(new DiscreteAttribute("", 0));
	}

	public void setParent(Node parent) {
		this.parent = parent;
	}

	public Node getParent() {
		return parent;
	}

	public void setData(ArrayList&lt;Record&gt; data) {
		this.data = data;
	}

	public ArrayList&lt;Record&gt; getData() {
		return data;
	}

	public void setEntropy(double entropy) {
		this.entropy = entropy;
	}

	public double getEntropy() {
		return entropy;
	}

	public void setChildren(Node[] children) {
		this.children = children;
	}

	public Node[] getChildren() {
		return children;
	}

	public void setUsed(boolean isUsed) {
		this.isUsed = isUsed;
	}

	public boolean isUsed() {
		return isUsed;
	}

	public void setTestAttribute(DiscreteAttribute testAttribute) {
		this.testAttribute = testAttribute;
	}

	public DiscreteAttribute getTestAttribute() {
		return testAttribute;
	}
}
</code></pre>
<p><strong>FileReader.java</strong> &#8211; The least interesting class in the code</p>
<pre><code>import java.io.*;
import java.util.ArrayList;
import java.util.StringTokenizer;

public class FileReader {
	public static final String PATH_TO_DATA_FILE = "playtennis.data";

    public static ArrayList&lt;Record&gt; buildRecords() {
		BufferedReader reader = null;
		DataInputStream dis = null;
		ArrayList&lt;Record&gt; records = new ArrayList&lt;Record&gt;();

        try { 
           File f = new File(PATH_TO_DATA_FILE);
           FileInputStream fis = new FileInputStream(f); 
           reader = new BufferedReader(new InputStreamReader(fis));;
           
           // read the first record of the file
           String line;
           Record r = null;
           ArrayList&lt;DiscreteAttribute&gt; attributes;
           while ((line = reader.readLine()) != null) {
              StringTokenizer st = new StringTokenizer(line, ",");
              attributes = new ArrayList&lt;DiscreteAttribute&gt;();
              r = new Record();
              
              if(Hw1.NUM_ATTRS != st.countTokens()) {
            	  throw new Exception("Unknown number of attributes!");
              }
              	
			  @SuppressWarnings("unused")
			  String day = st.nextToken();
			  String outlook = st.nextToken();
			  String temperature = st.nextToken();
			  String humidity = st.nextToken();
			  String wind = st.nextToken();
			  String playTennis = st.nextToken();
			  
			  if(outlook.equalsIgnoreCase("overcast")) {
				  attributes.add(new DiscreteAttribute("Outlook", DiscreteAttribute.Overcast));
			  }
			  else if(outlook.equalsIgnoreCase("sunny")) {
				  attributes.add(new DiscreteAttribute("Outlook", DiscreteAttribute.Sunny));
			  }
			  else if(outlook.equalsIgnoreCase("rain")) {
				  attributes.add(new DiscreteAttribute("Outlook", DiscreteAttribute.Rain));
			  }
			  
			  if(temperature.equalsIgnoreCase("hot")) {
				  attributes.add(new DiscreteAttribute("Temperature", DiscreteAttribute.Hot));
			  }
			  else if(temperature.equalsIgnoreCase("mild")) {
				  attributes.add(new DiscreteAttribute("Temperature", DiscreteAttribute.Mild));
			  }
			  else if(temperature.equalsIgnoreCase("cool")) {
				  attributes.add(new DiscreteAttribute("Temperature", DiscreteAttribute.Cool));
			  }
			  
			  if(humidity.equalsIgnoreCase("high")) {
				  attributes.add(new DiscreteAttribute("Humidity", DiscreteAttribute.High));
			  }
			  else if(humidity.equalsIgnoreCase("normal")) {
				  attributes.add(new DiscreteAttribute("Humidity", DiscreteAttribute.Normal));
			  }
			  
			  if(wind.equalsIgnoreCase("weak")) {
				  attributes.add(new DiscreteAttribute("Wind", DiscreteAttribute.Weak));

			  }
			  else if(wind.equalsIgnoreCase("strong")) {
				  attributes.add(new DiscreteAttribute("Wind", DiscreteAttribute.Strong));

			  }
			  
			  if(playTennis.equalsIgnoreCase("no")) {
				  attributes.add(new DiscreteAttribute("PlayTennis", DiscreteAttribute.PlayNo));
			  }
			  else if(playTennis.equalsIgnoreCase("yes")) {
				  attributes.add(new DiscreteAttribute("PlayTennis", DiscreteAttribute.PlayYes));
			  }
			    		    
			  r.setAttributes(attributes);
			  records.add(r);
           }

        } 
        catch (IOException e) { 
           System.out.println("Uh oh, got an IOException error: " + e.getMessage()); 
        } 
        catch (Exception e) {
            System.out.println("Uh oh, got an Exception error: " + e.getMessage()); 
        }
        finally { 
           if (dis != null) {
              try {
                 dis.close();
              } catch (IOException ioe) {
                 System.out.println("IOException error trying to close the file: " + ioe.getMessage()); 
              }
           }
        }
		return records;
	}
}</code></pre>
<p><strong>**EDIT:**</strong></p>
<p>playtennis.data is only a simple text file that describes the learned attribute &#8220;play tennis&#8221; for a particular learning episode.  I modeled my playtennis.data file off of Tom Mitchell&#8217;s play tennis example in his book &#8220;Machine Learning.&#8221;  Essentially what it contains is attributes describing each learning episode: <strong>outlook</strong> (<em>sunny, overcast, rain</em>), <strong>wind</strong> (<em>strong, weak</em>) and <strong>humidity</strong> (<em>high, normal</em>) and play tennis (yes, no).  Based off this information, one can trace the decision tree to derive your learned attribute.  A sample decision tree is below.  All you have to do is create a text file (tab delimited) that describes each particular situation (make sure you match the columns in the text file to the parsing that occurs in the Java).<br />
<a href="http://blog.afewguyscoding.com/wp-content/uploads/2010/03/playtennis.png" rel="lightbox[67]"><img src="http://blog.afewguyscoding.com/wp-content/uploads/2010/03/playtennis.png" alt="" title="playtennis" width="355" height="265" class="aligncenter size-full wp-image-238" /></a></p>
<p><strong>**EDIT 2:**</strong></p>
<p>After so many requests, I have worked up a small playtennis.data file based on the Tom Mitchell &#8220;Machine Learning&#8221; book.  This follows his example data exactly and can be found <a href='http://blog.afewguyscoding.com/wp-content/uploads/2010/03/playtennis.txt'>here</a> (rename the file to &#8220;playtennis.data&#8221; before running as WordPress wouldn&#8217;t let me upload a .data file due to &#8220;security restrictions.&#8221;  <strong>A small note of caution</strong>: the above code was put together very quickly for purposes of my own learning.  I am not claiming that it is by any means complete or fault tolerant, however, I do believe that the entropy and gain calculation is sound and correct.</p>
]]></content:encoded>
			<wfw:commentRss>http://afewguyscoding.com/2010/03/id3-decision-trees-java/feed/</wfw:commentRss>
		<slash:comments>88</slash:comments>
		</item>
		<item>
		<title>Message: HTML Parsing Error: Unable to modify the parent container element before the child element is closed (KB927917)</title>
		<link>http://afewguyscoding.com/2010/03/message-html-parsing-error-unable-to-modify-the-parent-container-element-before-the-child-element-is-closed-kb927917/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=message-html-parsing-error-unable-to-modify-the-parent-container-element-before-the-child-element-is-closed-kb927917</link>
		<comments>http://afewguyscoding.com/2010/03/message-html-parsing-error-unable-to-modify-the-parent-container-element-before-the-child-element-is-closed-kb927917/#comments</comments>
		<pubDate>Sat, 27 Mar 2010 20:31:01 +0000</pubDate>
		<dc:creator>David Stites</dc:creator>
				<category><![CDATA[Web]]></category>
		<category><![CDATA[dom]]></category>
		<category><![CDATA[html]]></category>
		<category><![CDATA[IE]]></category>
		<category><![CDATA[javascript]]></category>
		<category><![CDATA[KB927917]]></category>

		<guid isPermaLink="false">http://blog.afewguyscoding.com/?p=26</guid>
		<description><![CDATA[I recently ran across this problem when I was attempting to test some code that I was using for a client web site. I researched the error but there weren&#8217;t a whole lot of good articles on the problem. The error is fairly cryptic but here is the basic gist: the markup has some JavaScript [...]]]></description>
				<content:encoded><![CDATA[<p>I recently ran across this problem when I was attempting to test some code that I was using for a client web site.  I researched the error but there weren&#8217;t a whole lot of good articles on the problem.  The error is fairly cryptic but here is the basic gist: the markup has some JavaScript code that tries to modify the parent node in the DOM from the a child node before the child is finished loading. Here&#8217;s a short example of a how to get the error:</p>
<p><code>&lt;div id="parent"&gt;</code><br />
&nbsp;&nbsp;<code>&lt;div id="child"&gt;Child node content</code><br />
    &nbsp;&nbsp;&nbsp;&nbsp;<code>&lt;script&gt;</code><br />
    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<code>document.getElementById('parent').innerHTML = 'New child';</code><br />
    &nbsp;&nbsp;&nbsp;&nbsp;<code>&lt;/script&gt;</code><br />
 &nbsp;&nbsp; <code>&lt;/div&gt;</code><br />
<code>&lt;/div&gt; </code></p>
<p>The content will still be replaced but with a warning in the console. To get rid of the warning just move the script out to the parent container:</p>
<p><code>&lt;div id="parent"&gt;</code><br />
  &nbsp;&nbsp;<code>&lt;div id="child"&gt;Same child node content&lt;/div&gt;</code><br />
  &nbsp;&nbsp;<code>&lt;script</code><br />
    &nbsp;&nbsp;&nbsp;&nbsp;<code>document.getElementById('parent').innerHTML = 'New child';</code><br />
  &nbsp;&nbsp;<code>&lt;/script</code><br />
<code>&lt;/div&gt;  </code></p>
<p>The problem I was having was related to writing out the markup associated with <a href="http://static.railstips.org/orderedlist/demos/fancy-zoom-jquery/">FancyZoom</a> div content.  The same effect can be acheived by using body.onload or jQuery(document).ready, although I did notice that if you use <em>onload()</em> or <em>.ready</em>, if the page is not cached locally, you might have a FOUC (<strong>f</strong>lash <strong>o</strong>f <strong>u</strong>nformatted <strong>c</strong>ontent).  </p>
<p>If you want more information about this problem head over to <a href="http://support.microsoft.com/default.aspx/kb/927917">Microsoft&#8217;s support site</a>.  </p>
]]></content:encoded>
			<wfw:commentRss>http://afewguyscoding.com/2010/03/message-html-parsing-error-unable-to-modify-the-parent-container-element-before-the-child-element-is-closed-kb927917/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Decision Tree Learning Acting As A Cardiologist</title>
		<link>http://afewguyscoding.com/2010/03/decision-tree-learning-acting-as-a-cardiologist/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=decision-tree-learning-acting-as-a-cardiologist</link>
		<comments>http://afewguyscoding.com/2010/03/decision-tree-learning-acting-as-a-cardiologist/#comments</comments>
		<pubDate>Sun, 21 Mar 2010 23:37:05 +0000</pubDate>
		<dc:creator>David Stites</dc:creator>
				<category><![CDATA[Software]]></category>
		<category><![CDATA[classification]]></category>
		<category><![CDATA[decision trees]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[myocardial infarction]]></category>
		<category><![CDATA[supervised learning]]></category>

		<guid isPermaLink="false">http://blog.afewguyscoding.com/?p=12</guid>
		<description><![CDATA[Abstract—Decision trees are one of the most widely used methods of inductive inference.  They can be used to learn discrete or continuous valued hypotheses and create compact rules for evaluation of a set of data.  An advantage of decision  and regression trees is that they are robust to noisy data, which makes them perfectly suited [...]]]></description>
				<content:encoded><![CDATA[<p><strong><em>Abstract</em>—Decision trees are one of the most widely used methods of inductive inference.  They can be used to learn discrete or continuous valued hypotheses and create compact rules for evaluation of a set of data.  An advantage of decision  and regression trees is that they are robust to noisy data, which makes them perfectly suited to be able to predict whether a heart attack patient will be alive one year after his incident where all the data may not be available.  This paper is a survey of some of the methods of constructing and evaluating decision trees.</strong></p>
<h3><strong>1.  INTRODUCTION</strong></h3>
<p>A myocardial infarction, commonly referred to as a heart attack, is a serious medical condition where blood vessels that supply blood to the heart are blocked, preventing enough oxygen from getting to the heart. The heart muscle dies from the lack of oxygen and impairs heart function or kills the patient. Heart attacks are positively correlated with diabetes, smoking, high blood pressure, obesity, age and alcohol consumption. While prognosis varies greatly based on underlying personal health, the extent of damage and the treatment given, for the period of 2005-2008 in the United States, the median mortality rate at 30 days was 16.6% with a range from 10.9% to 24.9% depending on the admitting hospital.[9] That rate improves to 96% at the 1 year mark if the patient survives the heart attack.</p>
<p>Physicians would like to be able to tell their patients their possible rate of survival and predict whether or not a certain treatment could help the patient. In order to make that prediction, we can use decision trees to model whether or not the patient has a good chance of survival. Using regression trees, we can map the input space into a real-valued domain using attributes that cardiologists could examine to determine the patient’s chances of survival after a given timeframe.</p>
<p>In building a decision tree, we use the most influential attribute values to represent the internal nodes of the tree at each level.  Each internal node tests an attribute value, each edge corresponds to an attribute value and each leaf node leads to a classification – in our case, deceased or alive.  We are able to traverse the tree from the root to classify an unseen example.  The tree can also be expressed in the form of simple rules.  This would be helpful when explaining the prognosis to the patient.</p>
<h3><strong>1.1  OUTLINE OF RESEARCH</strong></h3>
<p>In this research survey, we implemented a decision tree using an adapted ID3 algorithm.  We evaluated different splitting criterion as well as different approaches to handling missing attributes in the dataset.  In addition, the author considers different approaches to handling continuous valued attributes and methods to reduce decision tree over-fitting.  Lastly, results are discussed in section 4 after running the decision tree multiple times with the echocardiogram dataset.</p>
<h3>1.2  DATA</h3>
<p>The data that we used is from the University of California at Irvine machine learning repository.  The dataset that we chose was the 1989 echocardiogram dataset.  This dataset contained 132 rows, with 12 attributes, 8 of which were actually usable for decision tree construction (the remaining four were references for the original contributor of this dataset).  This dataset had missing values and all of the attributes were continuous valued.</p>
<p>The data described different measurements of patients who had suffered from acute myocardial infarction at some point in the past.  The attributes included “<em>AGE-AT-HEART-ATTACK</em>” (the patients’ age in years when the heart attack happened), “<em>PERICARDIAL-EFFUSION</em>” (binary choice relative to fluid around the heart), “<em>FRACTIONAL-SHORTENING</em>” (a measure of contractility around the heart where lower numbers are abnormal), “<em>EPSS</em>” (E-point septal separation which is another measure of contractility where larger numbers are abnormal), “<em>LVDD</em>” (left ventricular end-diastolic dimension, where larger numbers are abnormal), “<em>WALL-MOTION-INDEX</em>” (a measure of how many segments of the left ventricle are seen moving divided by the number of segments seen) and “<em>ALIVE-AT-ONE</em>” (a binary choice where 0 represents deceased or unknown and 1 is alive at one year).</p>
<p>It is important to note that not all rows could be used for learning.  Two attributes, “<em>SURVIVAL</em>” and “<em>STILL-ALIVE</em>” must be analyzed as a set.  <em>SURVIVAL</em> described the number of months the patient had survived since the heart attack.  Some of the rows described patients who survived less than a year and are still alive based on <em>STILL-ALIVE</em> (a binary attribute, 0 representing deceased and 1 representing alive).  These patients cannot be used for prediction.</p>
<p>It has previously been noted that “the problem addressed by past researchers was to predict from other variables whether or not the patient will survive at least one year.  The most difficult part of this problem is correctly predicting that the patient will <em>not</em> survive.  This difficulty seems to stem from the size of the dataset.” [1]  In implementing the decision tree logic, we have found that this is the case as well.</p>
<h3>2.  APPLICATION</h3>
<p>We chose to write the implementation of the decision tree in Java because of the ease of use of the language.  Java also presented superior capabilities in working with and parsing data from files.  Using Java allowed the author to more efficiently model the problem through the use of OO concepts, such as polymorphism and inheritance.</p>
<h3>2.1  DECISION TREE CONSTRUCTION</h3>
<p>The decision tree is constructed using the ID3 algorithm originally described by Quinlan [4] and shown in Mitchell [2] with an adaptation by the author to handle numeric, continuous valued attributes, missing attributes and pruning.  ID3 is a simple decision tree algorithm.  The algorithm is shown below</p>
<pre>ID3 (Examples, Target_Attribute, Attributes)</pre>
<ul>
<li>
<pre>Create a root node for the tree</pre>
</li>
<li>
<pre>If all Examples are positive, return the single-node tree root, with label = +.</pre>
</li>
<li>
<pre>If all Examples are negative, return the single-node tree root, with label = -.</pre>
</li>
<li>
<pre>If the number of predicting Attributes is empty, then return the single node tree root, with label = most common
value of target attribute in the examples.</pre>
</li>
<li>
<pre>Otherwise Begin</pre>
<ul>
<li>
<pre>A = The Attribute that best classifies Examples.</pre>
</li>
<li>
<pre>Decision tree attribute for root = A.</pre>
</li>
<li>
<pre>For each possible value, vi, of A</pre>
<ul>
<li>
<pre>Add a new tree branch below root, corresponding to the test A = vi.</pre>
</li>
<li>
<pre>Let Examples(vi) be the subset of examples that have the value vi for A</pre>
</li>
<li>
<pre>If Examples(vi) is empty</pre>
<ul>
<li>
<pre>Below this new branch, add a leaf node with the label = most common target value in the examples.</pre>
</li>
</ul>
</li>
<li>
<pre>Else</pre>
<ul>
<li>
<pre>Below this new branch add the subtree ID3 (Examples(vi), Target_Attribute, Attributes - {A})</pre>
</li>
</ul>
</li>
</ul>
</li>
</ul>
</li>
<li>
<pre>End</pre>
</li>
<li>
<pre>Return root</pre>
</li>
</ul>
<p>The ID3 algorithm “uses information gain as splitting criteria and growing stops when all instances belong to a single value of target feature or when best information gain is not greater than zero.” [6]  For further information on ID3, see [2], [4], [6], [7].</p>
<h3>2.2  SPLITTING CRITERION</h3>
<p>A decision tree is formed by having some concrete concept of splitting data into subsets which form children nodes of the parent.  In our implementation of the decision tree, we decided to use a univariate impurity-based splitting criterion called entropy.</p>
<p>Entropy is the measure of impurity or chaos in the data.  If all elements in a set of data belong to the same class, the entropy would be zero and if all the elements in a dataset were evenly mixed, the entropy would be one.  Entropy may be measured with the following equation</p>
<p><a href="http://blog.afewguyscoding.com/wp-content/uploads/2010/03/entropy-equation.png" rel="lightbox[12]"><img src="http://blog.afewguyscoding.com/wp-content/uploads/2010/03/entropy-equation.png" alt="" title="entropy-equation" width="356" height="40" class="aligncenter size-full wp-image-46" /></a></p>
<p>where <img src="http://blog.afewguyscoding.com/wp-content/uploads/2010/03/p-plus.png" alt="" title="p-plus" width="19" height="18" class="alignnone size-full wp-image-48" /> is the number of positive training examples in T and <img src="http://blog.afewguyscoding.com/wp-content/uploads/2010/03/p-plus.png" alt="" title="p-plus" width="19" height="18" class="alignnone size-full wp-image-48" /> is the number of negative training examples in T.  For further discussion on entropy, see Mitchell [2] or Alpaydin [3].</p>
<p>The ID3 algorithm can use this measure of entropy of each attribute set of values to determine the best gain, or expected reduction in entropy due to splitting on A or the difference in entropies before splitting and after splitting on A.</p>
<p><a href="http://blog.afewguyscoding.com/wp-content/uploads/2010/03/gain-equation.png" rel="lightbox[12]"><img src="http://blog.afewguyscoding.com/wp-content/uploads/2010/03/gain-equation.png" alt="" title="gain-equation" width="356" height="69" class="aligncenter size-full wp-image-47" /></a></p>
<p>At each level of the tree, the “best” attributes can be found to create the maximum reduction in entropy, called information gain.  These two calculations represent the preference to create the smallest tree possible for several different reasons – mainly that a short hypothesis that accurately describes the data is unlikely to be coincidence.  For further discussion on entropy and gain, see Mitchell [2], Alpaydin [3], Steinberg [5] and Lior <em>et al</em> [6].</p>
<h3>2.3  MISSING VALUES</h3>
<p>In the dataset, there are many missing values.  The missing values show up mainly for the attributes <em>ALIVE-AT-ONE</em> (43.2%), <em>EPSS</em> (10.6%)<em> </em>and <em>LVDD</em> (7.5% of the data).  In addition, 66.3% of the data had a row with at least 1 missing attribute value.</p>
<p>Clearly, this dataset is not ideal for predicting attributes.  Fortunately, decision trees are robust to noisy data.  The strategy of replacing the missing attribute values with the most common value among training examples was suggested in [2].  We decided to implement this idea to deal with these missing values and initiate those values with a surrogate value.  The surrogate values were calculated using the average of all of the attributes for continuous-valued data or the most common attribute value for discrete valued data.  Our replacement was done only after all the data had been read in from the dataset instead of using a moving average while the data was still being read.</p>
<p>There are two main reasons we chose this method.  First, this is an extremely simple method that doesn’t require a lot of calculation.  Since decision trees cannot back up once they have made a splitting decision, we would never have to worry about the data changing without needing to regenerate the tree.  Second, this allowed the program to describe a finer grain representation of the true average value of the particular attribute for the global dataset.</p>
<h3>2.4  HANDLING CONTINUOUS AND DISCRETE ATTRIBUTES</h3>
<p>In ID3, handling discrete attributes are quite simple.  For each attribute, a child node is created and that branch is taken when the tree is traversed to evaluate a test set.  To handle continuous data, some partitioning of the attribute values must take place to discretize the possible range.  For example, if we had the attributes shown in the following table:</p>
<table style="border: 1px solid #000;" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td valign="top"><strong>PlayTennis</strong></td>
<td valign="top"><strong>Temperature</strong></td>
</tr>
<tr bgcolor="#ccc">
<td valign="top">No</td>
<td valign="top">40</td>
</tr>
<tr>
<td valign="top">No</td>
<td valign="top">48</td>
</tr>
<tr bgcolor="#ccc">
<td valign="top">Yes</td>
<td valign="top">60</td>
</tr>
<tr>
<td valign="top">Yes</td>
<td valign="top">72</td>
</tr>
<tr bgcolor="#ccc">
<td valign="top">Yes</td>
<td valign="top">80</td>
</tr>
<tr>
<td valign="top">No</td>
<td valign="top">90</td>
</tr>
</tbody>
</table>
<p>We may consider creating a discrete variable that corresponds to <em>Temperature</em> &gt; 40 and <em>Temperature</em> &lt; (48 + 60) / 2 that would produce a true or false value.</p>
<p>For our implementation, we decided to take the average of all of the values and make that our discrete split value, where all instances less than the average branch went to the left node and all the instances greater than the discrete value branch went to the right node.  The reason that we chose this approach is because it was a simple implementation over other methods that included binary search, statistical search and weighted searches. [2] [6]  This causes our tree to look closer in form to a CART tree rather than an ID3 tree, as CART can only produce trees that are binary trees.  CART uses this approach of a single partitioning value when using univariate splitting criteria. [5]  The main reason that we are able to implement this approach is the data is such that it is increasingly “abnormal” the larger the value becomes.</p>
<h3>2.5  TESTING</h3>
<p>Testing of the decision tree logic was carried out using the PlayTennis example shown in [2].  This example used only discrete values.  After verifying the decision tree could accurately classify these examples, the program was adapted to use continuous valued attributes.</p>
<p>Testing performed was done with a <em>3-fold cross validation</em> method, were the data was divided into training and test sets such that | <em>R</em> | = k X | <em>S</em> |, where <em>R</em> is the size of the training set, <em>k </em>is the relative size and <em>S</em> is the size of the test set.</p>
<p>For each test run, we chose 88 training examples and 44 test examples at random. We attempted to keep labeled classes in the training set as balanced as possible by setting a threshold <em>n.</em> This threshold prevented the classes from becoming unbalanced by no more than a difference of <em>n.</em> If the threshold <em>n </em>was ever reached, we discarded random selections until we were under the threshold again in order to balance the classes.  Discussion of results are in section 4.</p>
<h3>2.6  DECISION TREE PRUNING</h3>
<p>In the process of building the decision tree, the accuracy of the tree is determined by the training examples.  “However, measured over an independent set of training examples, the accuracy first increases and then decreases.” [2]  This is an instance of over-fitting the data.  To prevent this over-fitting condition, the tree is evaluated and then cut back to the “essential” nodes such that the accuracy does not decrease with real-world training examples.</p>
<p>In our implementation of pruning, we had no stopping criteria to prevent over-fitting.  Our implementation let the over-fitting occur and then we used a post-pruning method called Reduced-Error pruning, as described by Quinlan. [7]  In this algorithm, the tree nodes are traversed from bottom to top while the procedure checking to see if replacing it with the most frequent class improves the accuracy of the decision tree.  Pruning continues until accuracy decreases and the algorithm ends with the smallest, accurate decision tree.</p>
<p>Further discussion of pruning may be found in [2], [3], [5] and [6].</p>
<h3>3.  POTENTIAL PROBLEMS</h3>
<p>There are some problems that we encountered during implementation.  The first problem is that ID3 does not natively support missing attributes, numeric attributes or pruning.  The algorithm had to be adapted to support these features.  In adapting the algorithm, less than efficient methods were chosen.  In my implementation, we split at the mean of the attribute values.  This would cause the surrogate values and the split points to be the same value.  In addition, this method does not handle outliers in the data well and forces everything towards the center of the sample set.  Implementing a better search algorithm or multivariate splitting might allow the author to see improved accuracy.  Another alternative would be to use the C4.5 algorithm [8], which is an evolution of ID3 adds full support for these requirements.</p>
<p>Another problem that we experienced is having enough training data without missing attributes to build an effective decision tree.  With the large missing attribute rate, it would be hard to get a good handle on any sort of trends in the dataset.  A possible solution may be to use a weighted method to assign the most probable value at the point where we encounter the missing value.</p>
<h3>4.  EVALUATION</h3>
<p>For the evaluation of the results, we have chosen not to assign value to the results so that a result of <em>ALIVE-AT-ONE</em> equal to 1 is positive and a result of <em>ALIVE-AT-ONE</em> equal to 1 as well instead of negative.  One may choose to assign intrinsic values to the instances, but one could easily evaluate these results without them.  We have chosen to only focus on the results rather than make a determination of the “goodness” of a particular outcome.  Since we have chosen this method of evaluation, we would not have any false positives or true negative results, therefore we will not be reporting any precision calculations.</p>
<h3>4.2  MACRO EVALUATION</h3>
<p>In 10,000 independent runs using random subsets of data for each test, the overall recall for the decision tree was 66.82%.  The average F-measure for these runs was 0.38.  See <em>figure 4.1</em>.  In the majority of the runs, the most influential attributes where <em>SURVIVAL, EPSS</em> and <em>LVDD</em>, where each of these attributes appeared in 100% of the decision trees created.</p>
<h3>4.1 MIRCO EVALUATION</h3>
<p>When broken down into smaller sets of 1,000 runs, the recall and F-measure vary greatly.  In 10 runs, we had a range from 27.27% recall to 93.18% recall.  The median recall was 84.09%.</p>
<table style="border: 1px solid #000;" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td valign="top"><strong>Run</strong></td>
<td valign="top"><strong>Recall</strong></td>
<td valign="top"><strong>F-measure</strong></td>
</tr>
<tr>
<td valign="top">1</td>
<td valign="top">90.90%</td>
<td valign="top">0.47</td>
</tr>
<tr bgcolor="#ccc">
<td valign="top">2</td>
<td valign="top">84.09%</td>
<td valign="top">0.46</td>
</tr>
<tr>
<td valign="top">3</td>
<td valign="top">93.18%</td>
<td valign="top">0.48</td>
</tr>
<tr bgcolor="#ccc">
<td valign="top">4</td>
<td valign="top">88.63%</td>
<td valign="top">0.47</td>
</tr>
<tr>
<td valign="top">5</td>
<td valign="top">31.81%</td>
<td valign="top">0.24</td>
</tr>
<tr bgcolor="#ccc">
<td valign="top">6</td>
<td valign="top">84.09%</td>
<td valign="top">0.46</td>
</tr>
<tr>
<td valign="top">7</td>
<td valign="top">45.45%</td>
<td valign="top">0.31</td>
</tr>
<tr bgcolor="#ccc">
<td valign="top">8</td>
<td valign="top">27.27%</td>
<td valign="top">0.21</td>
</tr>
<tr>
<td valign="top">9</td>
<td valign="top">38.63%</td>
<td valign="top">0.28</td>
</tr>
<tr bgcolor="#ccc">
<td valign="top">10</td>
<td valign="top">88.63%</td>
<td valign="top">0.47</td>
</tr>
<tr>
<td valign="top">Overall</td>
<td valign="top">66.82%</td>
<td valign="top">0.38</td>
</tr>
</tbody>
</table>
<h6><em>Figure 4.1</em></h6>
<p>There is a large gap between our maximum and our minimum recall.  This can be attributed to several issues, including poor data in the data set and less than efficient splitting choices.  The accuracy of the decision tree depends partly on the data with which you decide to train.  The data that we used was missing many attributes, with almost half (43.2%) of those attribute values being the target attribute.  In the absence of a value, the most common value was substituted which, in the case of this data set, would heavily skew the results towards predicting death. It should be noted that overly pessimistic results and overly optimistic results can each present their own dangers to the patient.</p>
<p>Another improvement to the results may come in the form of changing the policy on attribute splitting and missing value handling.  If a more accurate method of splitting were implemented (multivariate criterion, using a better search method, <em>etc</em>.), we would expect to see a more consistent result set.</p>
<p><a href="http://blog.afewguyscoding.com/wp-content/uploads/2010/03/decision-tree-cardiology-line-graph.png" rel="lightbox[12]"><img src="http://blog.afewguyscoding.com/wp-content/uploads/2010/03/decision-tree-cardiology-line-graph.png" alt="" title="decision-tree-cardiology-line-graph" width="343" height="210" class="aligncenter size-full wp-image-45" /></a></p>
<p>Based on these results, we would feel fairly confident that we could make a useful prediction, but only if we had a confidence rating on the results (due to the high variability of the results).</p>
<p>Based on these results, while not highly accurate, they could provide good insight into what attributes are the most important in regards to classification.  In other words, the decision tree would describe the attributes that should be used for other machine learning programs, such as clustering, artificial neural networks or support vector machines.  We could be confident that these attributes are the most important because they were chosen through entropy and gain calculations while constructing the decision tree.</p>
<h3>5.  CONCLUSION</h3>
<p>In this research project, different methods of constructing decision and regression trees were explored.  Additionally, different methods of node splitting, missing attribute substitution and tree pruning were investigated.  While the results of this project show only a 66% accuracy, decision trees are still a valid machine learning technique.  With an augmented decision logic and a better data set, decision trees may be able to predict discrete or continuous data at a much better rate.</p>
<h4>References</h4>
<ol>
<li>Salzberg, Stephen.  University of California, Irvine Machine Learning Data Repository.  1989.  [Online]. http://archive.ics.uci.edu/ml/datasets/Echocardiogram</li>
<li>Mitchell, Tom M.  <em>Machine Learning</em> WCB McGraw-Hill, Boston, MA.  1997</li>
<li>Alpaydin, Ethem.  <em>Introduction to Machine Learning</em>, Second Edition.  The MIT Press, Cambridge, MA.  2010.</li>
<li>Quinlan, J. R. <em>Induction of Decision Trees</em>. Mach. Learn. 1, 1 (Mar. 1986), 81-106.</li>
<li>Steinberg, Dan.  <em>CART: Classification and Regression Trees.</em> Taylor and Francis Group. pp 179-201, 2009.</li>
<li>Rokach, Lior and Maimon, Oded. <em>Top-Down Induction of Decision Tree Classifieres &#8211; A Survey</em>.  IEEE Transactions on Systems, Man and Cybernetics &#8211; Part C: Applications and Reviews.  Vol. 35, No. 4.  pp 476-487, November 2005.</li>
<li>Quinlan, J.R.  <em>Simplifying Decision Trees</em>. International Journal of Man-Machine Studies.  Vol. 27, pp. 221-234, 1987.</li>
<li>Quinlan, J.R.  <em>C4.5: Programs for Machine Learning</em>.  San Francisco, CA: Morgan Kaufmann, 1993.</li>
<li>Krumholz H <em>et al</em>. <em>Patterns of hospital performance in acute myocardial infarction and heart failure &#8211; 30-day mortality and readmission</em>. Circulation: Cardiovascular Quality and Outcomes. [Online] http://circoutcomes.ahajournals.org/cgi/content/abstract/2/5/407. 2009.</li>
</ol>
]]></content:encoded>
			<wfw:commentRss>http://afewguyscoding.com/2010/03/decision-tree-learning-acting-as-a-cardiologist/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Why Airlines Got Mobile Right</title>
		<link>http://afewguyscoding.com/2010/03/why-united-airlines-got-mobile-right/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=why-united-airlines-got-mobile-right</link>
		<comments>http://afewguyscoding.com/2010/03/why-united-airlines-got-mobile-right/#comments</comments>
		<pubDate>Mon, 15 Mar 2010 20:45:23 +0000</pubDate>
		<dc:creator>David Stites</dc:creator>
				<category><![CDATA[Mobile]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[airlines]]></category>
		<category><![CDATA[iPhone]]></category>

		<guid isPermaLink="false">http://blog.afewguyscoding.com/?p=14</guid>
		<description><![CDATA[Air travel just got slightly more tolerable.  Last week, United Airlines launched their wireless check-in and paperless boarding pass service.  When I saw this, I was slightly skeptical because I normally expect more progressive companies such as Southwest or Virgin America to launch services similar to this.  American Airlines and United only recently added WiFi [...]]]></description>
				<content:encoded><![CDATA[<p>Air travel just got <em>slightly</em> more tolerable.  Last week, United Airlines launched their wireless check-in and paperless boarding pass service.  When I saw this, I<a href="http://blog.afewguyscoding.com/wp-content/uploads/2010/03/mobile.jpg" rel="lightbox[14]"><img class="alignright size-full wp-image-23" title="United Airlines Mobile Check-in" src="http://blog.afewguyscoding.com/wp-content/uploads/2010/03/mobile.jpg" alt="" width="184" height="184" /></a> was slightly skeptical because I normally expect more progressive companies such as Southwest or Virgin America to launch services similar to this.  American Airlines and United only recently added WiFi to their routes (it is available on United on about 3 routes and most MD-80s that American has have GoGo InFlight).  At certain airports (see below), you can use your smartphone to checkin and grab your boarding pass.</p>
<ol>
<li>Enter <a href="http://mobile.united.com/">mobile.united.com</a> into the browser of your mobile device to check in for any flight within the U.S. starting 24 hours before scheduled departure.</li>
<li>If you are flying through one of the airports listed below, you can select to receive your boarding pass via email. Otherwise, visit any EasyCheck-in® kiosk to print your boarding pass at the airport.</li>
<li>To use your device as your boarding pass, scan the barcode on your screen at security and at the gate when boarding.</li>
</ol>
<p>You&#8217;ll also be able to check flight status and get up-to-the-minute arrival and departure information, including times and gate information at the following airports (with United):</p>
<ul>
<li>Chicago O&#8217;Hare (ORD)</li>
<li>Dallas &#8211; Fort Worth (DFW)</li>
<li>Denver (DEN)</li>
<li>Las Vegas (LAS)</li>
<li>Los Angeles (LAX)</li>
<li>New York LaGuardia (LGA)</li>
<li>San Francisco (SFO)</li>
<li>Washington Dulles (IAD)</li>
</ul>
<h3><span style="color: #800000;">Why This Is Awesome?</span></h3>
<ul>
<li><strong>It is so easy!</strong> Being able to check-in to your flight on the way to the airport and walking past the ticket counters without ever needing to print a boarding pass will quickly become my favorite part of flying.  The whole idea of printing your boarding pass at home for your &#8220;convenience&#8221; is really not a &#8220;convenience&#8221; for most customers.  It is a cost saving measure by the airlines.  There is less interaction required for the customers and less paper and ink tickets going out that the airline has to pay for.</li>
<li><strong>This might mean reduced ticket prices?</strong> But probably not, since the airline industry barely scrapes by with a profit anyway.  Although, you would wonder if this might reduce the number of people the airline industry employs since there would be less of a need to interact with customers.</li>
</ul>
<h3><span style="color: #800000;">Why Will This Flop?</span></h3>
<ul>
<li><strong>What happened to security?</strong> Is this really secure?  I have a really hard time believing that the TSA is going to let this fly (yes, I know, a terrible pun).  The first time that a TSA agent sees this, they are probably going to reject this person outright.  Depending on what the boarding pass looked like, how hard would to fake something that looked like this.  Not very, if you were technically skilled.  The TSA agents would need a scanner that would interface with the United ticket database to verify that these are actual, genuine tickets.</li>
</ul>
<h3><span style="color: #800000;">What I Would Like To See</span></h3>
<ul>
<li><strong>Dedicated apps</strong>.  If you&#8217;re a business, having your own app for smart phones is <em>de rigueur</em>.  This is would create value among customers because they would love the convenience of being able to walk straight into security with their carryon and get right to their gate.  You could allow for a much richer customer experience, such as choosing your seats by using not only a seat map, but an actual photo of that seat to let you see if you would like the window placement, leg room, TV location, etc.  It might benefit the airlines as well because customers could move to standby for other flights, upgrade to first class, check on missing baggage &#8211; all without ever needing to call or speak to anyone.</li>
</ul>
<p>There are other airlines such as Continental and Delta (planned) migrating to online boarding passes.  Everything considered, this is excellent use of smart phones and rather overdue.  This is an classic example of writing apps to create value rather than writing apps to drive revenue.</p>
]]></content:encoded>
			<wfw:commentRss>http://afewguyscoding.com/2010/03/why-united-airlines-got-mobile-right/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Weakening of Computer Science Educations</title>
		<link>http://afewguyscoding.com/2010/03/the-weakening-of-computer-science-educations/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=the-weakening-of-computer-science-educations</link>
		<comments>http://afewguyscoding.com/2010/03/the-weakening-of-computer-science-educations/#comments</comments>
		<pubDate>Thu, 11 Mar 2010 20:15:45 +0000</pubDate>
		<dc:creator>David Stites</dc:creator>
				<category><![CDATA[Software]]></category>
		<category><![CDATA[algorithms]]></category>
		<category><![CDATA[design]]></category>
		<category><![CDATA[object-oriented]]></category>

		<guid isPermaLink="false">http://blog.afewguyscoding.com/?p=3</guid>
		<description><![CDATA[I have a major problem with how computer science is being taught to new engineers.  In recent years, it seems that all anyone focuses on anymore is what programming languages that one knows and it seems to be shifting to all OO languages.  This is a very wrong and dangerous attitude to take.  However, this [...]]]></description>
				<content:encoded><![CDATA[<p>I have a major problem with how computer science is being taught to new engineers.  In recent years, it seems that all anyone focuses on anymore is what programming languages that one knows and it seems to be shifting to all OO languages.  This is a very wrong and dangerous attitude to take.  However, this is not to say that knowing a broad range of technologies is bad &#8211; it isn&#8217;t.  It will make you a better engineer.  Universities that want to stay competitive in graduating competent computer science majors need to realize this and refocus their curriculum.  At the same time, there needs to be a major paradigm shift in the industry away from obsessing over comparing who&#8217;s repertoire of languages is bigger.</p>
<p>Skills that are being taught are softening.  Good old fashioned coding and algorithm design is giving way to using libraries of code that essentially do all the work for the programmer.  But wait &#8211; isn&#8217;t code reuse a fundamental of good, quality programming.  Yes, it is!  However, this should not be the starting point for students to learn.  When the libraries are used, the logic behind the code is lost.  Good design principles dictate that when using an API, we shouldn&#8217;t care what&#8217;s inside the &#8220;black box&#8221; but rather just accept that if we feed the API good and proper data, it will perform as expected and we&#8217;ll get values out but students who are learning should care because they aren&#8217;t just learning to program, they are learning to think.  Another example from Java: what is the complexity of .contains()?  While this should be obvious to most engineers that it is O(n), universities that teach that it is acceptable to use APIs and libraries all the time are doing a major disservice to their students.</p>
<p>OO languages are becoming the <em>de facto</em> first course.  This is a backwards approach because OO allows for advanced design with encapsulation, inheritance and polymorphism.  We have to get back to the basics by learning to code as a function of learning computer science where one can appreciate architecture designs, compilers, memory management, language theory and the relationship between hardware and software.  Knowing a procedueral programming language is an absolute necessity and the  <a title="Linux Kernel" href="http://en.wikipedia.org/wiki/Linux_Kernel">Linux Kernel</a>, <a title="Git (software)" href="http://en.wikipedia.org/wiki/Git_(software)">Git</a> and  <a title="Apache Server" href="http://en.wikipedia.org/wiki/Apache_Server">Apache Server</a> are all products of procedural languages.  Not bad company at all.</p>
<p>My first programming language that I learned at Purdue was C.  This was not the common path for most CS majors &#8211; the majority of people started with Java.  Learning C first created a foundation that was later built upon by taking the required Java course.  As a friend recently put it &#8211; it is like learning to drive a car or learning how to engineer a car.  Anyone can drive a car and some people love it so much, they become cab drivers.  That is why they have &#8220;X Technology for Dummies&#8221; books.  But, if you want to dive into the internals of the car, look at how the engine works for example, you need the solid education of understanding the principles behind the application.</p>
<p>Being a &#8220;good developer&#8221; does not mean that you are practicing computer science, because in reality, coding is only a small fraction of what computer science is about.  Programming languages are simply a means to an end.</p>
]]></content:encoded>
			<wfw:commentRss>http://afewguyscoding.com/2010/03/the-weakening-of-computer-science-educations/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
	</channel>
</rss>
