Abstract—Clustering is an easy to use and implement method of unsupervised inductive inference. Clustering can be used to learn discrete or continuous valued hypotheses and create compact groups of objects that display similar characteristics, while maintaining a high degree of separation from other groupings. This paper is a survey of some of the methods of constructing and evaluating clusters.

Index Terms—machine learning, hierarchical agglomerative clustering, k-means clustering, unsupervised learning, Pearson’s coefficient, euclidean distance

1. Introduction

Clustering is a method of dividing a heterogenous group of objects into smaller, more homogenous groups that display similar characteristics to other objects in the cluster while displaying one or more dissimilar characteristics to objects in other clusters. Clustering is an unsupervised learning technique that has many applications in data mining, pattern recognition, image analysis and market segmentation. Clustering is easy to implement and fairly quick to come up with the groupings.

The main purposes of a Department of Corrections at any governmental level are to enhance public safety, to provide proper care of the inmates, to supervise inmates under their jurisdiction and to assist inmates’ re-entry into society.

There is no doubt that inmates at correctional institutions are dangerous to society. However, even after they are incarcerated at these institutions, some individuals remain ongoing offenders. While all individuals in prison have displayed some sort of deviant behavior, it is hypothesized that certain combinations of personality traits make some inmates more likely to be sexual predators and some inmates more likely to be sexual victims of these predators. At most correctional facilities, sexual contact between inmates, consensual or not, is not permitted.

Identification of those inmates likely to be sexually predatory toward other inmates would greatly assist corrections facilities in their goal of providing a safer environment for incarceration. Clustering can help with this goal by comparing a particular offender to known perpetrators and victims. After comparison, victims can be incarcerated separately from predators and receive any special needs assistance that can be offered while predators can be segregated in such a fashion as to reduce the potential for successful predatory behaviors.

1.1 Outline of Research

In this research survey, we implemented two different types of clustering algorithms – a standard “bottom-up” hierarchical method, single link clustering, and a standard “top-down” partitional algorithm, k-means clustering. We evaluated different distance measure criteria, including Euclidian distance and Pearson’s correlation coefficient. Results are discussed in section 4 after running the clustering algorithms multiple times with the provided Colorado inmate dataset.

1.2 Data

The dataset that we used was provided by Dr. Coolidge of the Department of Psychology at the University of Colorado at Colorado Springs. The dataset is publicly available at http://www.cs.uccs.edu/~kalita/work/cs586/2010/CoolidgePerpetratorVictimData.csv. This dataset pertains to scores on personality disorder tests given to inmates in the State of Colorado. Dr. Coolidge’s inventory of personality disorder tests are given to all inmates in the State of Colorado.. This dataset provided contained 100 rows (25 rows describing victims of sexual abuse and 75 describing perpetrators of sexual abuse) with 14 attributes chosen by Dr. Coolidge.

The data described different measurements of how inmates had scored on different personality disorders tests. The tests included antisocial (AN), avoidant (AV), borderline (BO), dependent (DE), depressive (DP), histrionic (HI), narcissistic (NA), obsessive-compulsive (OC), paranoid (PA), passive-aggressive (PG), schizotypal (ST), schizoid (SZ), sadistic (SA) and self-defeating (SD) markers. The scores on these individual tests are measured by T scores, which are a type of standardized score that can be mathematically transformed into other types of standardized scores. T scores have a Gaussian distribution and a score of 50 is always considered the mean and a standard deviation is always 10.

It should be noted that even though the dataset is quite small, with only 100 rows available, the quality of the data in the dataset is very good. The astute reader can appreciate the fact that incarcerated persons might not be completely truthful in their answering of the test questions for a variety of reasons, such as a lack of caring or the desire to appear more “damaged” than any other inmate. The data is cross-checked with several other validation methods to ensure that the answers provided are reasonable. Test scores that are not reasonable were discarded by Dr. Coolidge and not included in this dataset.

2. Application

We chose to write the implementation of the clustering algorithms in Java because of the ease of use of the language. Java also presented superior capabilities in working with and parsing data from files. Using Java allowed the author to more efficiently model the problem through the use of OO concepts, such as polymorphism and inheritance. Lastly, several different Java libraries were available, such as Java-ML [1] that increased the ability to analyze the clusters after the algorithms had been run.

2.1 Hierarchical Cluster Construction

In agglomerative single link clustering, clusters are constructed by comparing each point (or cluster) with every other point (or cluster). Each object is placed in a separate cluster, and at each iteration of the algorithm, we merge the closest pair of points (or clusters), until certain termination conditions are satisfied.

This algorithm requires defining the idea of a single link or the proximity of other points to a single point. For single link, the proximity of two points (or clusters) is defined as the minimum of the distance between any two points (or clusters). If there exists such a link, or edge, then the two points (or clusters) are merged together.

This is often called “the nearest neighbor clustering technique.” [4] Relating this algorithm to graph theory, this clustering technique constructs a minimum spanning tree by finding connected components, so the algorithm is quite similar to Kruskal’s or Prim’s algorithm.

MSTSingleLink (Elements, AdjacencyMatrix)
Create a set of clusters C = {t1, t2,…,tn} from Elements
Create a partial distance matrix showing the distance between all clusters in C.
k = n, where n is the number of clusters
d = 0
Begin
Ci, Cj = Closest clusters in AdjacencyMatrix
d = dis(Ci, Cj) // update dendrogram with distance threshold
dis({Ci ∪ Cj}, C) // recalculate the distance from the new cluster to all other points in C
C = C – {Ci} – {Cj} ∪ {Ci ∪ Cj} // merge the two closest clusters or points
dis(Ci, Cj) = 0
Until k = 1

Typically, for this algorithm, the termination criteria is for all elements grouped together in one cluster. A better termination criterion would be to record the distances at which the merges of individual objects and clusters take place and if there is a large jump in this distance (large being defined by the user), it might give the user an indication that the two objects or clusters should not be merged because they are highly dissimilar. Note that the running time for MSTSingleLink is O(n^2). This running time makes it impractical for large datasets. For further information on single link MST, see [2], [4].

2.2 Partitional Cluster Construction

Where a hierarchical clustering algorithm creates clusters in multiple steps, a partitional algorithm, such as k-means, creates the clusters in one step. [4] Also, in partitional clustering, the number of clusters to create must be known a priori and used as input to the algorithm.

In k-means, elements are moved among sets of clusters until some sort of termination criteria is reached, such as convergence. A possible measure for convergence could be testing to see if cluster elements have not changed clusters. Using k-means allows one to achieve a “high degree of similarity among elements, while a high degree of dissimilarity among elements in different clusters.” [4]

KMeansCluster (Elements, k)
Create a set of clusters C = {t1, t2,…,tn} from Elements
Assign intial values for means, m1, m2,…,mk
Begin
Assign each item ti to the cluster which has the closest mean
Calculate new means for each cluster
Until convergence criteria met

Note that the running time for KMeansCluster is O(tkn), where t is the number of iterations. While, k-means does not suffer from the chaining problem, it does have other problems. k-means does not handle outliers well, work with categorical data or produce any clusters shapes other than convex clusters. [4] Also, while k-means produces good results, it does not scale well and is not time-efficient. [4] While this particular provided dataset is not large, k-means could have problems with attempting to cluster millions of objects. Lastly, it is possible for k-means to find a local optimum and miss the global optimum. For further information on k-means clustering, see [2], [4].

2.3 Distance Criterion

In both algorithms, cluster formation relies on having some notion of a distance measure. Using this metric, we can determine how “similar” two elements are. Depending on the distance metric chosen, it will influence the shape of our clusters. While there are many distance measures such as Mahalanobis, hamming, city-block and Minkowski [2], [6], in our implementation, we used two different distance measures, a euclidean distance measure and Pearson’s correlation coefficient.

2.3.1 Euclidean Distance

Euclidean distance is the ordinary distance between two points that one would measure with a ruler. It is a simple distance metric and by far the most commonly used, where one has to make sure all attributes have the same scale [2]. It is defined by this equation

2.3.2 Pearson’s Correlation Coefficient

Pearson’s correlation coefficient is a measure of the linear dependence between two variables X and Y, giving a value between +1 and −1 inclusive. A value of 1 implies that a linear equation describes the relationship between X and Y perfectly, with all data points lying on a line for which Y increases as X increases. A value of −1 implies that all data points lie on a line for which Y decreases as X increases. A value of 0 implies that there is no linear correlation between the variables. [3] Although it depends on the data being analyzed, typically anything over 0.5 or below -0.5 indicated a large correlation. Pearson’s correlation coefficient can be calculated using the equation below, where X and Y bar are the means of the two variables.

For further information on Pearson’s correlation coefficient, see [3].

2.4 Testing

Testing of the clustering algorithms was performed with the entire dataset of 100 examples. For single link clustering, the algorithm was run using the Euclidean distance measure and Pearson’s correlation coefficient. For k-means clustering, the algorithm was run using the Euclidean distance measure 8 times. Each time the program was run, the number of clusters specified, k, was increased by one from 2 clusters to 10 clusters. Discussion of results are in section 4.

3. Potential Problems

There was one main problem that we encountered during implementation and testing. Our biggest problem is that with single link clustering, we observed the chaining effect. The chaining effect is where “points that are not related to each other at all are merged together to form a cluster simply because they happen to be near (perhaps via a transitive relationship) points that are close to each other.” [4] By including dissimilar objects, this can cause clusters to become skewed. A potential solution to this problem would be to either specify a maximum distance threshold, above which points (or clusters) would not be merged. This could also serve as part of the termination criteria. Another solution, would be to use a complete link distance criteria. [4] The chaining effect is most obviously seen when the output of the clustering program is a dendrogram.

4. Evaluation

For the evaluation of the results, we have chosen to use several different evaluation criteria. For single link clustering, we will qualify our results by examining the intra-cluster and inter-cluster distance, which measures the homogeneity of the clusters. In addition, we also evaluate the distance threshold at which the clusters were merged and the entropy within the cluster. Lastly, we evaluate the cluster by calculating the recall, precision and F measure of that cluster.

For k-means clustering, we evaluate the clusters using some of the same techniques (recall, precision and F measure) and we also introduce the squared sum of errors measure, and Bayesian information criterion measure.

4.1 Macro Evaluation

In single link clustering, the distance threshold that produced the best clusters was 46.23 (see Figure 4.1). Cluster 1 were inmates that exhibited personalities of victims of sexual abuse while cluster 2 were inmates that exhibited personalities of perpetrators of sexual abuse.

The remaining clusters, 3, 4, 5 and 6 were outliers and in their own clusters and their personalities exhibited behavior of both victims and perpetrators of sexual abuse. We noted that the distance threshold was much higher to merge clusters 4, 5, 6 with thresholds of 49.60, 55.22 and 59.63.

Cluster Cluster Members Intra-cluster Distance Inter-cluster Distance
1 1, 65, 4, 6, 97, 45, 49, 58, 53, 19, 18, 42, 67, 55, 62, 36, 7, 32, 54, 69, 39, 38, 89, 24, 26, 72, 8, 3, 9, 11, 95, 91, 27 15.64 90.82
2 2, 10, 16, 41, 50, 47, 51, 87, 78, 64, 68, 79, 77, 44, 84, 29, 31, 34, 63, 33, 90, 80, 40, 74, 82, 37, 43, 71, 48, 93, 96, 98, 22, 73, 52, 20, 57, 59, 46, 61, 75, 85, 66, 92, 83, 94, 88, 81, 70, 100, 60, 99, 86, 56, 13, 15, 30, 21, 14, 12, 23 8.73 79.20
3 5, 35, 25 0.17 66.25
4 17 0.0 67.19
5 28 0.0 67.19
6 76 0.0 67.20
Figure 4.1 – Index of clusters with intra-cluster and inter-cluster distance for Euclidean single link clustering

In Pearson’s coefficient clustering, the distance threshold that produced the best clusters was .035 (see Figure 4.2).
Cluster Cluster Members Intra-cluster Distance Inter-cluster Distance
1 1, 17, 14, 52, 48, 54, 27, 21, 22, 26, 77, 74, 33, 50, 57, 62, 47, 49, 79, 31, 34, 69, 67, 2, 86, 71, 100, 3, 41, 10, 29, 8, 25, 40, 83, 37, 19, 44, 46, 72, 55, 81, 88, 66, 30, 98, 38, 70, 20, 45, 36, 80, 60, 87, 13, 56, 91, 68, 51, 23, 16, 53, 89, 12, 65, 82, 94, 96, 90, 64, 58, 43 19.71 138.31
2 4, 11, 35, 84, 39, 15, 61, 63, 18, 92, 24, 76, 6, 7, 59, 95, 78 3.68 108.01
3 5, 97, 73, 93, 75 2.5 106.01
4 32, 42, 99 0.0 106.72
5 28 0.0 109.65
6 9 0.0 109.65
7 85 0.0 109.65
Figure 4.2 – Index of clusters with intra-cluster and inter-cluster distance for Pearson’s single link clustering

Based on these results, using the Euclidean distance may produce better clusters based on intra-cluster distance. Another improvement to the results may come in the form of changing the policy on finding the best distance at which to merge the clusters (i.e. using complete link or average link distance measures). If a more accurate method of distance finding were implemented, we would expect to see a more consistent result set because there would be less of an effect from the chaining problem.

4.2 Micro Evaluation

When the individual clusters are broken down and the individual members are analyzed, we achieve the following results. These recall, precision, F measure and entropy measurements assume the same clusters as above in section 4.1. In our calculations, we assign a negative value to not identifying a sexual abuse perpetrator because they would be allowed to interact with the general inmate population instead of being in administrative segregation. In addition, we also only consider the first two clusters, of either predator or victim, but no mixed classes.

Cluster Recall Precision F-measure
1 44.00% 32.35% 0.37
2 66.67% 75.75% 0.71
Overall 55.34% 54.05% 0.54
Figure 4.5 – Recall, Precision, F-Measure for Euclidean Single Link Clustering

Cluster Recall Precision F-measure
1 64.00% 22.22% 0.33
2 13.33% 70.58% 0.22
Overall 38.67% 46.40% 0.28
Figure 4.5 – Recall, Precision, F-Measure for Pearson’s Coefficient Single Link Clustering

From these results, it shows that despite the chaining effect, the Euclidean distance appears to be the superior distance measure when clustering via hierarchical agglomerative methods.

Our next test involved using k-means clustering. We ran the algorithm from 2 to 10 clusters and we measured the accuracy of the clusters using Bayesian information criteria (a criterion for model selection among a class of parametric models) and the sum of squared errors, as defined below:

where x is the observed data, n is the number of data points in x, k is the number of free parameters to be estimated, p(x|k) is the likelihood of the observed data given the number of parameters and L is the maximized value of the likelihood function.

k Clusters Members BIC Score SSE
2 Cluster 1: 2, 5, 10, 12, 14, 15, 16, 20, 21, 22, 23, 25, 29, 31, 33, 34, 35, 37, 40, 41, 43, 44, 46, 47, 50, 51, 56, 57, 59, 60, 61, 63, 66, 70, 71, 73, 75, 77, 78, 79, 80, 81, 83, 84, 85, 86, 87, 88, 90, 92, 94, 98, 99, 100
Cluster 2: 1, 3, 4, 6, 7, 8, 9, 11, 13, 17, 18, 19, 24, 26, 27, 28, 30, 32, 36, 38, 39, 42, 45, 48, 49, 52, 53, 54, 55, 58, 62, 64, 65, 67, 68, 69, 72, 74, 76, 82, 89, 91, 93, 95, 96, 97 144,666.03 223,893.45
3 Cluster 1: 1, 3, 4, 6, 7, 9, 11, 13, 18, 21, 24, 30, 32, 36, 38, 41, 42, 44, 45, 47, 48, 49, 50, 52, 53, 54, 55, 58, 62, 64, 65, 67, 68, 69, 74, 78, 82, 93, 95, 96, 97, 98
Cluster 2: 2, 5, 10, 12, 14, 15, 16, 20, 22, 23, 25, 29, 31, 33, 34, 35, 37, 40, 43, 46, 51, 56, 57, 59, 60, 61, 63, 66, 70, 71, 73, 75, 77, 79, 80, 81, 83, 84, 85, 86, 87, 88, 90, 92, 94, 99, 100
Cluster 3: 8, 17, 19, 26, 27, 28, 39, 72, 76, 89, 91 142,191.31 190,017.27
4 Cluster 1: 8, 17, 26, 28, 39, 72, 76, 89,
Cluster 2: 3, 4, 6, 19, 27, 36, 42, 45, 49, 55, 62, 67, 91, 95, 97
Cluster 3: 2, 5, 10, 12, 14, 15, 16, 20, 22, 23, 25, 29, 31, 33, 34, 35, 37, 40, 43, 46, 51, 56, 57, 59, 60, 61, 63, 66, 70, 71, 73, 75, 77, 80, 81, 83, 85, 86, 88, 90, 92, 94, 99, 100
Cluster 4: 1, 7, 9, 11, 13, 18, 21, 24, 30, 32, 38, 41, 44, 47, 48, 50, 52, 53, 54, 58, 64, 65, 68, 69, 74, 78, 79, 82, 84, 87, 93, 96, 98 143,341.18 174,843.29
5 Cluster 1: 8, 17, 24, 26, 38, 65, 72, 89
Cluster 2: 2, 10, 12, 13, 15, 16, 21, 22, 23, 29, 30, 32, 33, 37, 40, 41, 44, 47, 48, 50, 51, 52, 56, 71, 73, 77, 78, 79, 84, 87, 90, 93, 96, 98
Cluster 3: 5, 14, 20, 25, 31, 34, 35, 43, 46, 57, 59, 60, 61, 63, 66, 70, 75, 80, 81, 83, 85, 86, 88, 92, 94, 99, 100
Cluster 4: 7, 28, 36, 39, 76
Cluster 5: 1, 3, 4, 6, 9, 11, 18, 19, 27, 42, 45, 49, 53, 54, 55, 58, 62, 64, 67, 68, 69, 74, 82, 91, 95, 97 141,884.73 152,909.21
6 Cluster 1: 8, 17, 24, 26, 38, 65, 72, 89
Cluster 2: 2, 5, 10, 12, 14, 20, 22, 23, 25, 29, 31, 33, 34, 35, 37, 40, 43, 46, 51, 56, 57, 59, 60, 61, 63, 66, 70, 71, 75, 77, 80, 81, 83, 85, 86, 88, 90, 92, 94, 99, 100
Cluster 3: 1, 7, 13, 15, 16, 21, 30, 32, 41, 44, 47, 48, 50, 52, 53, 54, 58, 64, 68, 69, 73, 74, 78, 79, 82, 84, 87, 93, 96, 98
Cluster 4: 4, 6, 18, 19, 36, 39, 42, 45, 49, 62, 67, 91, 97
Cluster 5: 28, 76
Cluster 6: 3, 9, 11, 27, 55, 95 141,230.52 144,776.69
7 Cluster 1: 2, 10, 12, 13, 15, 16, 21, 22, 23, 29, 30, 31, 33, 37, 40, 41, 44, 47, 48, 50, 51, 56, 73, 77, 78, 79, 84, 87, 90, 93, 94, 98
Cluster 2: 5, 14, 20, 25, 34, 35, 43, 46, 57, 59, 60, 61, 63, 66, 70, 71, 75, 80, 81, 83, 85, 86, 88, 92, 99, 100,
Cluster 3: 4, 6, 9, 11, 18, 19, 42, 45, 49, 53, 55, 58, 62, 64, 67, 68, 74, 82, 95, 96, 97
Cluster 4: 28, 76,
Cluster 5: 3, 27, 91
Cluster 6: 1, 7, 32, 36, 39, 52, 54, 69
Cluster 7: 8, 17, 24, 26, 38, 65, 72, 89 140,546.79 135,787.72
8 Cluster 1: 3, 27, 95
Cluster 2: 1, 7, 24, 30, 32, 38, 53, 54, 58, 64, 65, 68, 69, 74, 82
Cluster 3: 39
Cluster 4: 9, 11, 18, 42, 62, 67,
Cluster 5: 4, 6, 19, 36, 45, 49, 55, 91, 97
Cluster 6: 5, 10, 12, 14, 20, 22, 23, 25, 29, 31, 34, 35, 37, 40, 43, 46, 56, 57, 59, 60, 61, 63, 66, 70, 71, 75, 77, 80, 81, 83, 85, 86, 88, 90, 92, 94, 99, 100
Cluster 7: 8, 17, 26, 28, 72, 76, 89
Cluster 8: 2, 13, 15, 16, 21, 33, 41, 44, 47, 48, 50, 51, 52, 73, 78, 79, 84, 87, 93, 96, 98 140,510.19 138,410.06
9 Cluster 1: 3, 9, 11, 74, 82, 95
Cluster 2: 15, 29, 30, 31, 33, 34, 40, 41, 44, 47, 50, 73, 78, 79, 84, 87, 90
Cluster 3: 8, 17, 24, 38, 65, 72, 89,
Cluster 4: 26, 28, 39, 76
Cluster 5: 5, 14, 20, 25, 35, 46, 57, 59, 60, 61, 63, 66, 70, 75, 80, 81, 83, 85, 86, 88, 92, 100
Cluster 6: 4, 6, 18, 19, 27, 36, 42, 45, 49, 55, 62, 67, 91, 97
Cluster 7: 2, 10, 16, 21, 22, 37, 43, 51, 71, 77, 94, 98, 99
Cluster 8: 1, 7, 13, 32, 48, 52, 53, 54, 58, 64, 68, 69, 93, 96
Cluster 9: 12, 23, 56 140,889.59 126,846.88
10 Cluster 1: 28, 76
Cluster 2: 91
Cluster 3: 30, 32, 53, 54, 58, 64, 68, 69, 74, 82
Cluster 4: 5, 14, 20, 25, 35, 43, 46, 57, 59, 60, 61, 63, 66, 70, 71, 75, 80, 81, 83, 85, 86, 88, 92, 99, 100
Cluster 5: 27
Cluster 6: 9
Cluster 7: 3, 4, 6, 11, 18, 19, 42, 45, 49, 55, 62, 67, 95, 97
Cluster 8: 1, 7, 13, 21, 36, 39, 48, 52, 93, 96
Cluster 9: 2, 10, 12, 15, 16, 22, 23, 29, 31, 33, 34, 37, 40, 41, 44, 47, 50, 51, 56, 73, 77, 78, 79, 84, 87, 90, 94, 98
Cluster 10: 8, 17, 24, 26, 38, 65, 72, 89 140,789.73 117,432.09
Figure 4.6 – Cluster Members, BIC Score and SSE Score for Euclidean k-means Clustering

Cluster Recall Precision F-measure
1 48.00% 22.22% 0.30
2 17.30% 28.20% 0.21
Overall 32.65% 25.21% 0.26
Figure 4.7 – Recall, Precision, F-Measure for Euclidean k-means Clustering

In these results, we can see that, as the number of clusters increases, the squared sum of errors decreases. This is because we are including fewer dissimilar items in each of the clusters, so they more accurately represent the true nature of that cluster. I would also expect the precision, recall, and F measure to increase as the number of clusters increase as well. However, it would become harder to interpret the actual “class” of the clusters as k increases as we were instructed to disregard the class of each instance in the dataset. Additionally, we might achieve better results if we used decision trees to identify the most influential personality test markers and then used a subset of those markers for clustering.

Based on all results, while not highly accurate, prison officials could obtain good insight into what attributes are the most important in regards to who might be on-going offenders.

5. Conclusion

In this research project, different methods of constructing clusters were explored. Additionally, different distance measures were implemented and then analyzed to see how they affected the accuracy of the clusters created.

While the results of this project show only a maximum of 67% accuracy, clustering is still a valid machine learning technique. With an more advanced algorithm and an increased size dataset, clustering may be able to predict predators and victims at a much better rate.

References

Abeel, T.; de Peer, Y. V. & Saeys, Y. Java-ML: A Machine Learning Library, Journal of Machine Learning Research, 2009, 10, 931-934

Alpaydin, E. Introduction to Machine Learning, Second Edition. The MIT Press, Cambridge, MA. 2010.

Coolidge, F. Statistics A Gentle Introduction, 2nd edition. SAGE Publications, Inc. 2006.

Dunham, M. Data Mining: Introductory and Advanced Topics. Prentice-Hall. 2002.

Saha, S., Bandyopadhya, S. Performance Evaluation of Some Symmetry-Based Cluster Validity Indexes. IEEE Transactions on Systems, Man and Cybernetics – Part C: Applications and Review. Vol. 39, No. 4. July 2009.

Jain, A.K., Murty, M.N., Flynn, P.J. Data clustering: a review. ACM Computing Surveys. Vol. 31, No. 3. Sept. 1999.

« »