Why do we need clustering

What is clustering? Definition, methods and examples

Clustering is a category of algorithms in machine learning that sorts data into similar groups. The advantage of clustering is that the method, as an unsupervised machine learning algorithm, does not require any prior knowledge of the data and thus operates purely on similarities within the data. The use of clustering algorithms is very popular, from the grouping of customers or products to outlier detection in banking to use as a spam filter. In this article, we will start with a definition of clustering before introducing the various methods and algorithms.

Table of contents

Definition of clustering: what is it?

Simply put, clustering is a machine learning method of organizing data points into groups. Similarities in the data (e.g. similar age, same gender) are used to identify groups that are as homogeneous as possible (e.g. young, male people). Clustering works here without any knowledge of which entries are similar, but calculates the similarities purely on the basis of the data. Clustering is therefore a suitable method to generate groups or segments without prior knowledge and to derive knowledge from them.

The goals of using clustering can be roughly classified into two categories. The first category aims to combine similar data points and thus reduce complexity. The other category tries to identify data points that do not belong to a large group and therefore have special features. This category is called outlier detection. In both categories, the aim is to identify similar groups in order to carry out appropriately adapted measures.

There are many topics in which this knowledge gain is applied. Whether customer clustering, product clustering, as fraud detection or as a spam filter - clustering is a very versatile approach in the field of machine learning and data science.

Clustering as a method in unsupervised machine learning

Clustering as a method belongs to the field of machine learning, German machine learning. It is more precisely classified as “unsupervised machine learning”, that is, unsupervised learning. Unsupervised learning means that the data does not contain any target variables on which the algorithm is oriented, but that the patterns and relationships are calculated purely on the data itself.

Since clustering is a method in machine learning, it also falls into the super-category Artificial Intelligence. Artificial intelligence algorithms learn from data and can extract patterns or knowledge without set rules. Therefore, clustering is particularly interesting to use in the area of ​​data mining in order to examine existing data for as yet unknown factors.

Examples of the use of clustering in companies

Customer grouping and customer segments using clustering

A very common area of ​​application for clustering is in marketing and product development. The algorithm is used here to identify meaningful segments of similar customers. The similarity can be based on the master data (e.g. age, gender), on transaction data (e.g. number of purchases, shopping cart value) or other behavioral data (e.g. number of service requests, duration of membership in the loyalty program).

Once customer clusters have been identified, more individual campaigns can be rolled out. For example, a personalized newsletter, individual offers, different types of service contracts or other promotions are the result of this better customer understanding.

Clustering as a spam filter

Another interesting example of clustering in everyday use is its use as a spam filter. Meta-attributes of e-mails (e.g. length, character distribution, attributes via the header ...) are used to separate spam from real e-mails.

Product data analysis: groups, quality and more

Another example of the use of clustering in companies is the use of clustering in product data analysis. Product data are very central master data in every company and they are often considered under-maintained and unclear whether they have the best structure.

Clustering can help, for example, to develop category trees and product data structures by using the similarity of product categories or individual products based on their master data. The pricing strategy can also be supported in order to see which products are in fact very similar and thus possibly fall into the same price range.

Product data quality can be cited as an additional area of ​​application in the product data environment. Clustering can help to identify which products show poor data quality and, based on this, give recommendations for a correction.

Fraud detection using clustering

An example from which both consumers and companies benefit is outlier detection in the sense of fraud detection. Banks and credit card companies are very active in this area and use (among other things) clustering to detect unusual transactions and mark them for review.

Which methods and algorithms are there in clustering?

In clustering, as in many other areas of machine learning, there is now a large variety of methods that can be used. Depending on the application and, above all, the database, each algorithm can deliver a different result. I deliberately leave it open that “different” is not always better or worse. Because data can be grouped and separated in many ways, especially when one speaks of a high-dimensional space. This is actually what makes clustering so complex: to use the right algorithm for the application at hand.

In the following we would like to introduce four of the most prominent clustering algorithms before we describe further algorithms at least in brief.

k-means as an example of partitioning clustering

Partitioning clustering, in German partitioning cluster analysis, is one of the best-known clustering algorithms. k-Means is the most frequently used algorithm in this category. The “K” stands for the number of clusters to be defined, while “means” stands for the mean value, ie where the center of the cluster is located.

As the name suggests, k-Means looks for a point for each of its clusters at which the variance to all surrounding points is as low as possible. This is done in an iterative process:

  1. Initialization: random selection of K centers
  2. Allocation of all data points to the closest center, measured on a distance metric
  3. Move the centers to the center of all assigned data points
  4. Go to 2), unless a termination criterion is met

The distance metric is the distance between the data point and the cluster center, whereby a range of calculation methods can be used here (e.g. Euclidean distance, Manhattan distance). For example, a person aged 18 and 160cm is closer to another person aged 20 and 170cm than a person aged 60 and 190cm.

If the algorithm is terminated, i.e. has reached the termination criterion (e.g. number of passes or slight change to the previous step), it outputs the center of the closest cluster center for each data point.

Areas of application and special features of k-means

The fact that k-Means is the most widely used clustering algorithm is due to its properties. On the one hand it is very easy to implement, on the other hand it scales well even with large amounts of data. The result can be controlled iteratively by varying the cluster size.

The disadvantages of k-means are that you have to determine the cluster size yourself, that it is primarily suitable for circular (spherical) clusters and that it can only process numerical data. In addition, k-Means is susceptible to outliers, so that attention must be paid to the preprocessing of the data.

Receive further articles on the topic of Data Driven Company directly by email:

Hierarchical (agglomerative) clustering

In addition to k-means, hierarchical clustering is one of the most frequently used algorithms. However, the number of clusters is not defined in advance, but each data point starts in its own cluster and is then combined with the closest one. The algorithm looks roughly like this:

  1. Initialization: Each data point is a cluster
  2. Linkage: For each cluster, the closest one is found according to the distance metric and these clusters are merged
  3. Go to 2), unless a termination criterion is met

The relevant part is obviously the linkage, i.e. finding and merging two clusters. There are roughly four types of linkage: Single Linkage, Average Linkage, Complete or Maximum Linkage, and Ward Linkage:

  • Single linkage, the simplest case, uses the two individual data points within two clusters that are closest to each other. A high variance is therefore “good” for a cluster, since more other data points can be reached with it.
  • Average linkage compares the distance from each data point to every other data point of another cluster and then takes their mean value. Consequently, it is not “outliers” that determine whether clusters are put together, but the entire composition.
  • Complete (maximum) linkage reverses single linkage and does not take the closest data points, but the furthest apart and selects the minimum to the other clusters. As a result, it tends to focus on “non-outliers”.
  • Ward Linkage according to Joe H. Ward, Jr. (1963) and compares the variance of possible assembled clusters. As a variance minimization method, it is therefore very similar to k-means optimization and tries to form clusters that are as homogeneous as possible.

Areas of application and special features of hierarchical clustering

Hierarchical clustering has the special feature that you can precisely track the composition of the clusters down to the individual data level. This makes the method particularly attractive in areas in which the composition of the clusters is to be interpreted or when it is unclear in advance how many clusters are to be specified.

Based on this thought, all use cases in which the proximity of individual data points is interesting, especially with hierarchical clustering, make sense. For example, the proximity of people or products can be analyzed very flexibly and formed into any number (or a few) clusters. The graphics that can be visualized from this (“dendogram”) allow the knowledge to be conveyed particularly visually.

Hierarchical agglomerative clustering, on the other hand, has difficulties due to the high resource requirements and grouping effects that (illegally) connect several clusters. For example, through single linkage, a kind of bridge can be built between two clusters that actually do not (yet) belong together because the distance between the individual data points is smaller than the density within the cluster. Finally, it is still a challenge to determine which number of clusters makes sense - both experience and domain knowledge are required here.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

Density-Based Spatial Clustering of Applications with Noise, in German density-based spatial cluster analysis with noise, is a clustering algorithm that is mainly used in high-dimensional space or to detect outliers. It is based on the fact that nearby points are counted in the same cluster if they reach enough other points. If this is not the case, the point is defined as “noise”. The high-level algorithm works as follows:

  1. For each data point, determine how many data points are close to it
  2. If the related data points exceed a limit value, it is considered a cluster with core points which are “dense”
  3. If data points can be reached from a core point but do not meet the limit value, they are considered to be density-accessible
  4. If data points cannot be reached from any core point, they are regarded as noise

The algorithm is controlled by two parameters: the length of the limit value (epsilon, “neighborhood length”) and the minimum number of data points in a cluster (minPoints). If you increase epsilon, this allows for more points in the same cluster, but you lose selectivity and possibly merge several clusters. If the minimum number is increased, fewer clusters are recognized and more data points are classified in noise or density-accessible.

Areas of application and special features of DBSCAN

DBSCAN has some advantages compared to other clustering algorithms such as k-means. For example, DBSCAN can detect and differentiate clusters of any shape, not just spherical ones like k-means. The fact that the density of the cluster and not just the distance is included makes DBSCAN more robust against outliers than other algorithms.

Problems, on the other hand, are that the behavior of the algorithm is very strongly controlled by the two parameters. This requires a lot of experience or at least a willingness to experiment. Furthermore, DBSCAN cannot be partitioned, since all points have to be connected at all times, which leads to problems with scaling.

Fuzzy Clustering (c-Means)

Fuzzy clustering (also called c-means or soft-k-means) is, as the name suggests, an algorithm that does not assign points to a cluster, but works with probabilities. Consequently, each data point can belong to several clusters with different probabilities.

As can be seen from the alternative names c-means or soft-k-means, the algorithm is based on the same principle as k-means, but calculates probabilities instead of fixed cluster memberships. The cluster centers are also calculated based on the probabilities. Thus, fuzzy clustering provides a distribution of points rather than assignments.

Areas of application and special features of fuzzy clustering

The same advantages and disadvantages of k-means also apply to fuzzy clustering. What is different, however, is the gentle assignment to clusters using probabilities, which allows other use cases. For example, when assigning customers, it makes a lot of sense to assign each person a “proximity” to a prototype rather than binary segments. The problem, however, is that due to the multidimensional nature of the probabilities, a higher requirement is necessary than with k-means.

Further clustering algorithms

  • Model-based clustering:
  • Divisive clustering:
  • Relocation Algorithms:
  • K-Medoids: Similar to K-Means, but K-Medoids chooses real data points as center points instead of the average of all objects.
  • Subspace Clustering: Subspace Clustering tries to define 2D areas that group similar objects together.
  • Projection Technique:
  • Co-Clustering Technique:
  • K-Prototypes: Well suited for dealing with categorical variables.
Receive further articles on the topic of Data Driven Company directly by email:

Common questions in clustering

What is a distance metric?

The distance metric defines the computation of the “similarity” between two points. The simplest case in one-dimensional space is simply the difference between two numerical values ​​(e.g. age), but also the absolute difference, the power of the difference, the logarithmic difference are possible approaches for a distance metric.

Obviously, it becomes more interesting when one talks about a multidimensional distance metric. There is a whole range of established distance metrics such as the Euclidean distance, the Manhattan distance or the cosine distance. As usual, each of the metrics has advantages and disadvantages and the distance metric plays a central role in every machine learning algorithm, as it defines how sensitively the algorithm reacts to deviations and how it interprets them.

How do you determine the optimal number of clusters?

It is often unclear how many clusters you want to create. Especially with partition methods like k-means you have to decide in advance which “k” you want to optimize for. Methods such as the elbow method, the average silhouette model or the gap statistical method are used here. In principle, it is always a matter of iteratively calculating different cluster sizes (e.g. 2 to 10) to find the number that best differentiates between the clusters.

In addition to these statistical calculations, in practice, however, different cluster sizes are often experimented with in order to find a suitable size for the problem at hand. Here, statistically optimal analyzes are sometimes ignored, because the question from the business (e.g. “I want four newsletter groups”) stipulates conditions or experience (e.g. “We think in terms of three customer ratings”) have a major influence.

In order to bring all this information together, there are now also cumulative approaches in various software packages that use many optimization methods (e.g.Elbow, Silhouette ..) and then outputs the most frequent number of clusters as a recommendation. In R, for example, this is NbClust () in the package of the same name, which compares about 30 methods for k-means.

What role does data quality play in clustering?

As in any data-based approach, data quality is highly relevant. Since clustering works directly on the data and uses them as indicators for the groupings, poor data quality naturally has even more serious consequences.

In addition to generally poor data quality, there are also isolated cases of outliers. There are two factors to consider here. On the one hand, clustering is very well suited to detect outliers, which would be the direct application.

However, if this is not the goal, the algorithm can be negatively influenced by outliers. In particular, algorithms that work directly on distance metrics such as k-means are very susceptible to outliers. It is therefore important to look at the data quality very critically and to correct it if necessary.

How do I deal with redundancy (features with high correlation) in clustering?

Redundant features, i.e. variables that are very similar or even completely the same, have a high influence on clustering. With unweighted attributes, in the simplest case this leads to a feature (e.g. weight) being weighted twice (e.g. once in grams and once in kilograms).

Now there are several ideas how to address this redundancy / correlation of features. The first way is definitely to become aware of a correlation, i.e. to identify the correlations using a correlation matrix or similar analyzes. There are also clustering distance metrics that use correlations between metrics as distance, namely the Mahalanobis distance.

In addition to the insight that it is so, you have to decide whether and, if so, how to correct it. In many cases one would like to exclude redundant features or, even better, to weight them down. The second has the advantage that minimal interaction influences are not removed, which is the case with a complete exclusion.

In general, it should be said that high correlations can place a heavy weight on a certain metric that is undesirable. It is therefore important to familiarize yourself very intensively with the available data beforehand in order to identify redundancies.

How do I deal with a large number of variables?

A very high number of variables usually leads to a very long runtime of the algorithms and possibly leads to the fact that minimal effects are used to differentiate the clusters. What is often used is a principal component analysis (PCA) for feature selection. This selection of metrics checks which features trigger a high variance within the data in the first place. Thus, the number of variables can be reduced and other problems can be avoided.

Which preprocessing is common in clustering?

The preprocessing steps depend on the clustering algorithm used, but there are some general steps that are performed in advance:

  • Missing data: If individual data entries are missing, it must be decided how to deal with them. For example, by removing the entries, the attributes or imputation, missing data can be eliminated.
  • Curse of dimensionality: Already briefly mentioned, a very high number of features can have negative effects. Therefore one tries to keep the number to a maximum of approx. 20-30 features.
  • Data normalization: Data in an output format can have a massive impact on the results of the distance calculation. For example, a distance in the “Weight” variable has a greater impact than “Length in millimeters”. Therefore, one normalizes the data, for example using z-transformation.
  • Categorical Variables: Not all algorithms can process categorical features, for example “gender”. Accordingly, the data must either be transferred in numbers (eg “large / small” in “10/100”) or using one-hot encoding in binary variables (eg “black / green” in “black -> 1/0, green”) -> 0/1).

The role of clustering in the data driven company

As one of the central categories in the area of ​​machine learning, clustering plays a very central role in a data-driven company when it comes to the evaluation and use of data. Clustering has a wide range of uses and can therefore be creatively applied to many problems in business.

From a technical point of view, clustering is also somewhat easier than complex neural networks or time series analyzes, so that it can also be used by junior data scientists or other colleagues. And the effect can be very high: If I get to know my customers better, if I see outliers in my transactions, or if I can gradually and automatically increase the data quality, this is of great value.

We therefore definitely recommend the use of clustering in the company and look forward to exciting use cases.

Further information & tutorials

Clustering in python

The following is a video tutorial on how to implement clustering in python:

Clustering in R

Here is a version of a video tutorial on clustering in R:

Book recommendation on clustering: algorithms and methods

A broader introduction, but still very detailed on the topic of clustering, is the book “Practical Statistics for Data Scientists: 50+ Essential Concepts Using R and Python”. The great thing about the book is that in addition to theoretical concepts and practical exercises, one step back is always taken to ask the question “Why do we need this?”.

Receive further articles on the topic of Data Driven Company directly by email:
Categories Analytics & Machine Learning, Expertise