Figure 3.5 illustrates the basic idea of clustering. Assume we have a data set with only two variables: age and weight. Such a data set could be obtained by projecting Table 3.1 onto the last two columns. The dots correspond to persons having a particular age and weight. Through a clustering technique like k-means, the three clusters shown on the right-hand side of Fig. 3.5 can be discovered. Ideally, the instances in one cluster are close to one another while being further away from instances in other clusters. Each of the clusters has a centroid denoted by a +. The centroid denotes the “center” of the cluster and can be computed by taking the average of the coordinates of the instances in the cluster. Note that Fig. 3.5 shows only two dimensions. This is a bit misleading as typically there will be many dimensions (e.g., the number of courses or products). However, the two dimensional view helps to understand the basic idea. Distance-based clustering algorithms like k-means and agglomerative hierarchical clustering assume a distance notion. The most common approach is to consider each instance to be an n-dimensional vector where n is the number of variables and then simply take the Euclidian distance. For this purpose, ordinal values but also binary values need to be made numeric, e.g., true = 1, false = 0, cum laude = 2, passed = 1, failed = 0. Note that scaling is important when defining a distance met- ric. For example, if one variable represents the distance in meters ranging from 10 to 1,000,000 while another variable represents some utilization factor ranging from 0.2 to 0.8, then the distance variable will dominate the utilization variable. Hence, some normalization is needed. Figure 3.6 shows the basic idea of k-means clustering. Here, we simplified things as much as possible, i.e., k = 2 and there are only 10 instances. The approach starts with a random initialization of two centroids denoted by the two + symbols. In Fig. 3.6(a), the centroids are randomly put onto the two dimensional space. Using the selected distance metric, all instances are assigned to the closest centroid. Here we use the standard Euclidian distance. All instances with an open dot are assigned to the centroid on the left whereas all the instances with a closed dot are assigned to the centroid on the right. Based on this assignment, we get two initial clusters.