What is Association Rule Mining

Association rules - shopping cart analysis

Table of Contents

1 Introduction

2. Web data mining
2.1 data mining
2.2 Web mining

3. Association rules
3.1 Basic concept
3.2 Apriori algorithm
3.3 Data formats for association rules

4. Shopping cart analysis on Amazon

II List of Figures

III Bibliography

1 Introduction

The use of the internet is increasing steadily. According to the Federal Statistical Office, Internet use increased by 2 percentage points in 2009 compared to 2008, and the trend is rising. “In the first quarter of 2009, 73% of people aged ten and over used the Internet. In the same period of the previous year the share was still 71% ”[1]

The Internet serves people not only as a source of information or as an entertainment tool, but also as a sales channel for making online purchases. In a survey by the Federal Statistical Office, 75% of the subjects surveyed stated that they had already ordered goods via the Internet. 55% of the respondents did this within the last three months. Figure 1 shows that private individuals between the ages of 25-54 prefer to shop online. In both 2002 and 2009, this age group ranked first among the most frequently used individuals who shop on the Internet. Furthermore, this age range shows the highest percentage point increase in the last seven years. Clothing, sporting goods, tourism services and private consumer goods such as furniture or toys are among the most popular goods ordered over the Internet.[2]

Figure not included in this excerpt

Figure 1: Purchases made over the Internet by private individuals[3]

In order to increase sales through suitable advertising measures, companies carry out shopping cart analyzes. So that the relationship between the products that are bought on the Internet can be recorded, large amounts of data have to be constantly processed and regularly evaluated. This is done using association rules which, for example, use the a priori algorithm to automatically examine and analyze large amounts of data. The following describes what web data mining is, how the association rules determine the product relationship and how the apriori algorithm works. The process is illustrated in the last part of this seminar paper using an example.

2. Web data mining

In today's information age, more data is being produced and disseminated on the Internet than ever before. The goal of web data mining is to extract useful information or knowledge from large databases and analyze it. Based on the knowledge gained from the data mining process (databases, texts), a specialization was created for the application area of ​​the Internet, web mining (hyperlinks, web content and user usage). The following section explains the two processes.

2.1 data mining

In order to be able to record huge amounts of data and information, software developers have come up with mathematical solutions and programmed so-called algorithms and the software applications based on them that address various problems and record recurring methods.

The data mining process is also referred to as "KDD - knowledge discovery in databases11". In German “Gaining knowledge from existing data”. KDD is generally defined as a pattern recognition process.

Before processing, the basic file (large volumes of data, databases, texts or images) must be examined appropriately. Disturbing factors must be removed and cleaned up. Once the data have been cleaned up, the algorithm is filled with them and the knowledge is processed. A decision is then made as to whether the content can be used and the useful information can be saved; this is often the case. The entire process is carried out multiple times in order to achieve relevant results. These are then saved, for example, in relational databases.[4]

2.2 Web mining

The well-known data mining technology has been specialized for the Internet to extract patterns and content from billions of partly dynamic web content, hyperlinks and personal data. The web mining process can be divided into three categories:

- Web structure mining: The structure of the page can be recognized using the hyperlink. Useful content can be filtered out and checked for similarities.
- Web content mining: Web content can be classified and summarized. Product descriptions, forum entries or reviews can be found. In addition, opinions of customer ratings can be obtained.
- Web usage mining: The algorithm sees every step that the user takes on the page via a log file that documents the usage of the website visitor. The interaction of the user is carefully examined and analyzed.[5]

On the basis of the web data mining process, specific information can be used for the association rules with the help of the algorithms.

3. Association rules

Since the association rules were developed in 1993, it has been the most popular model for finding correlations (linear relationships) between different data. The best-known application is the shopping cart analysis, which analyzes the customer's purchase decision and determines the quantity of items in the shopping cart. According to a study, it can be seen that generally 10% of all shoppers put beer and chips in their shopping cart (support). It can also be seen that 80% of beer buyers also bought chips at the same time (confidence).

The following parameters are relevant for the association rules:

- Support - strong correlation (e.g. 10% of all buyers).
- Confidence - frequency of occurrence together (e.g. 80% of these shoppers bought both products together).

The algorithms are designed in such a way that they discover all association rules with predefined minimal confidence and minimal support. (see 3.1) Companies implement this knowledge in a targeted manner, e.g. in the form of cross-selling.[6]

"Cross-selling refers to the covering of a customer requirement through the sale of additional products that are connected to the entry-level products (i.e. products that originally created an interest in buying or a business relationship) but are not substitutes for the entry-level products. The sale of the additional products can take place at different times or at the same time as the sale of the entry-level products. A provider can sell additional products that he has created himself or bought from other providers. "[7]

This relationship can generally be transferred to a wide range of content, e.g. relationships to words on the web can be controlled on this basis. However, this analysis cannot determine the order in which the items were placed in the shopping cart. This is why, for example, template algorithms are used that can track the click path of the user.[8]

3.1 Basic concept

The purpose of the association rule is to determine items (item quantity in the shopping cart) that have a certain relationship to other items. Amount of items: [Figure not included in this reading sample] e.g. {chips, beer, eggs, flour, sugar}.

These products must be observed under a certain transaction period. The transaction describes all purchases made.

Amount of transactions: [Figure not included in this reading sample] e.g. {{chips, beer} (eggs, flour, sugar}}

Each transaction ti consists of items, which is why it must be determined [Figure not included in this reading sample].

The logical conclusion of an association rule says that [Figure not included in this reading sample] X and Y are described as an item set and are a subset of I. The statement behind this formula is: If X is present, then probably comes too Y forward. E.g. a customer not only buys eggs (X), but very likely also buys flour (Y).

If T stands for (eggs, flour, sugar}, the association rule not only means that a customer has placed three items in the shopping cart, but also that he has bought (eggs, flour, sugar}. Thus, (eggs, flour} stands for X and (sugar} for Y. The probability that he will buy X and Y together is very high.

The transaction ti e T, says that T is contained in the item set X if X is a subset of ti. It can thus be said that X includes ti. The support of an association rule X Y determines the proportion of transactions in T and is called "probability estimation". "N" stands for the transaction occurrences in T.

Figure not included in this excerpt

The support provides a useful measure in association rules. Because if the number is too low, it means that the rule came about by accident. This application is not profitable in everyday business because it is too general and therefore it is rarely used.

Confidence describes the accuracy of the rule. It provides information about the ratio of the transactions from T that are contained in X and Y. The confidence can be interpreted as a condition of the probability estimation. It determines the probability and the accuracy of the rule. If the value is too low, Y and X cannot be reliably determined.


[1] [Destatis, 2009a] (07/01/2010)

[2] See [Destatis, 2009b] (07.01.2010)

[3] [Destatis, 2009b] (07.01.2010)

[4] See [Liu, 2007] p.6

[5] See [Liu, 2007] p.7

[6] See [Liu, 2007] p.13

[7] [Schafer, 2002] p.56

[8] See [Liu, 2007] p.13

End of the reading sample from 20 pages