What is word2vec

A vector is worth a thousand words

Stefan Fischerländer

Stefan Fischerländer is the managing partner of the Passau digital agency Gipfelstolz. Since 2000 he has been advising clients with a focus on technical SEO and working as a developer. With his theoretical knowledge and his practical experience, he also supports the SEO tool TermLabs.io as a tech evangelist.

Download more from this author article as PDF

AlphaGo, RankBrain and Google AI - machine learning and artificial intelligence have become household names for search engine optimizers. However, what machine learning is and how search engines use it is not very well known. Reason enough for Stefan Fischerländer to present the basics of the methods and to show what is already possible today with software published by Google. The results are impressive, open your eyes and give at least a rough impression of how far a content assessment based on texts or their words through machines can go.

Hardly any SEO conference goes by without a lecture on the importance of machine learning for Google. And Sundar Pichai, Google's CEO, said in October 2015: “We will introduce machine learning into all of our products, in search, in advertising, on YouTube or in the Play Store.” But even if Pichai search mentioned in the first place of his list, the examples in the lectures or blog posts mostly refer to other, more visual or playful methods. However, it has recently become increasingly clear that at the latest with the relatively new part of the Google algorithm called RankBrain, machine learning methods were also finding their way into normal web searches. Since RankBrain is the third most important ranking factor according to Google's own presentation, this should be reason enough for every online marketer to take a closer look at machine learning.

According to a definition by Arthur Samuel, the inventor of the term, machine learning describes methods with which computers can learn to solve tasks without being explicitly programmed to solve the task. One problem that falls into this category, which is not uncommon for search engines, is Classification. An algorithm tries to sort objects as appropriately as possible into given groups (classes). Google shows search queries with purchase intent (transactional queries), for example, often exclusively online shops and price search engines. So the search engine is using a process somewhere that can classify websites into classes such as “online shop” or “price search engine”.

Classify objects

How computers approach this task is to show a method that can classify car models. Cars can be characterized by the most varied of sizes; two particularly meaningful pieces of information are the power in kilowatts and the empty weight in kilograms. This information is shown in Table 1 for six models.



Power kW

Weight / kg

Opel Corsa






Mercedes B-Class



Ford Transit



Mercedes Sprinter



Ford Expedition



Table 1

In general terms, the table contains six objects (car models), each with two variables (power and weight). The objects can thus be drawn in a coordinate system that represents the power in the x-direction and the weight in the y-direction. The blue dots in Figure 1 result for the data from Table 1.

A person can already gather a lot of information from this graphic representation. In the graphic, neighboring car models are apparently very similar, while cars with a large gap - such as the Corsa and X6 - are quite different models. In addition, a person immediately recognizes that there are apparently three classes of cars shown here.

Short distance means great similarity

From a mathematical point of view, the car models are each two-dimensional vectors. The Opel Corsa, for example, is represented by the vector (51 1120), the BMW X6 by the vector (225 2100). This representation of arbitrary objects as vectors forms the basis of almost all machine learning methods and the assumption made here that two vectors with a small distance are quite similar applies to most such methods.

The distances between vectors (considered here as position vectors that are placed at the origin of the coordinate system) can easily be calculated using the Euclidean distance. For the two-dimensional case the corresponding formula is:

where (x1 x2) and (y1 y2) are the two position vectors whose distance is to be calculated.

The distance between the Corsa and the X6 is calculated as:

For the Corsa and B-Class, however, there is only a difference of 277.8.

For classification, the method still needs the predefined classes to which the objects (car models) are to be assigned. Such classes can be defined by specifying typical sizes; the three classes “compact cars”, “SUVs” and “vans” could therefore be defined using the values ​​in Table 2, for example. These class representatives are shown in Figure 1 as red dots.


Power kW

Weight / kg

Compact car









Table 2


A very simple method of assigning the six car models to the classes is to calculate the distance to each of the specified classes for each car and then to pack the car into the class for which the distance is the smallest. This method is called Distance classifier and is probably one of the simplest machine learning processes ever.

Calculate with any number of dimensions

The chosen example with only two variables (weight and power) for the cars is of course very simple and most objects that computers have to calculate with have significantly more such variables. A task with more than two dimensions can no longer be represented graphically, but the method just shown does not need this graphical representation at all. It is sufficient if the distance between two objects (vectors) can be calculated - and that is possible with the Euclidean distance for any number of dimensions. In the general case, the formula for distance shown earlier for two dimensions is:

where n is the number of dimensions.

In addition to the method just shown, which can classify objects, there are many other methods: Cluster algorithms, for example, do not require any given classes, but instead try to form groups (clusters) of similar objects on their own. For example, Google News presents articles on the same topic together; this requires a cluster algorithm in the background that recognizes which articles are of the same type and must be displayed accordingly. Processes for dimension reduction simplify given data so that the essential properties are better revealed or can be represented in a graphic. Other methods, on the other hand, are used to discover outliers in data or to make predictions, such as the famous bought-too function from Amazon. Although the results can seem quite intelligent, the methods have little to do with artificial intelligence. Rather, they are statistical methods, the results of which can form a knowledge base, which can then be a basis for real artificial intelligence.

The vector space model

As different as the purpose of the various methods may be, at its core, machine learning always consists of two steps: The first step is to represent objects as vectors. As in the car example earlier, this may be trivial. For other tasks, however, this is often the real difficulty. The second step is then to calculate with the vectors obtained in this way; e.g. to determine distances or to find vectors that point in a similar direction. As soon as a problem has been traced back to vectors, all methods can be applied to it without going into the specific task. This is so interesting for search engines because in the classic Information retrieval - the basis of text-based search engines - documents (web pages) are represented as vectors.

In this Vector space model are the objects that are represented as vectors, the web pages to be indexed. But how does a text become a vector? For this purpose, a vector is created for each text, the dimension of which is equal to the number of all words that occur in at least one of the texts to be indexed. Each dimension stands for a word; If a word occurs in a document, there is a number not equal to zero at this point in the vector. Which number is exactly there depends on the specific implementation and is determined by formulas that should sound familiar to search engine optimizers: TF-IDF and WDF-IDF are two very frequently used calculation methods. For both methods, the number increases when a word occurs more frequently in the document; however, it becomes smaller if a word appears in a large number of different documents.

This procedure can be illustrated with a concrete example. For the documents listed in Table 3, vectors must be created. The unimportant words like is or the remain disregarded for the sake of simplicity. Then there is the list of all words that occur in at least one of the documents for: (Paris capital France city metropolis). The document vectors therefore have dimension 5. For a “real” web index, however, the dimension will be in the order of several million.




Paris is the capital of France and the largest city in France


Paris is a city in France


Paris is a metropolis

Table 3


To keep the example simple, a different calculation is chosen instead of TF-IDF or WDF-IDF: The elements of the vector are simply the number of word occurrences in the respective document. This results in the document vector for document 3 from Table 3 for D3 = (1 0 0 0 1). The first component in the vector stands for the word Paris, the second for Capital etc. and finally the fifth component stands for metropolis. With these vectors you can now calculate as you like and texts can be classified and clustered as shown in the car example, the similarities of texts can be determined and what the arsenal of machine learning processes has to offer.

Neural networks on the rise

The methods presented so far are more conventional statistical methods. Due to various media-effective successes, methods based on neural networks are often in the foreground when machine learning is discussed. Neural networks show amazing results in various areas: The AlphaGo program developed by Google now beats the best Go players and the image recognition processes are well advanced. For a long time, neural networks were rarely used for working with texts, but that changed with the release of the software at the latest word2vec through google.

Word2vec accepts texts in any language, analyzes the word connections in the texts and then creates a vector for each word that appears in the texts. Such a method, which finds vectors that are as suitable as possible for words, is called Word embedding designated. The dimension of the vectors can be specified by the user and is usually somewhere between 10 and 500. Word2vec calculates the vectors in such a way that terms with a similar meaning are represented by similar vectors. Since each word is now represented by a vector, words can be calculated like vectors, which enables interesting insights.

Arithmetic with words

This calculation can be found as a prime example in the English-language literature on this topic: king - man + woman = queen. Translated this means something like: If you subtract all male characteristics from a king and add the characteristics of a woman, you get a queen. This is entirely plausible; The amazing thing about it is that word2vec creates this connection without having any idea what a king actually is.

If you let go of word2vec on the German Wikipedia, you get very similar findings. Some particularly meaningful results are summarized in Table 4. It is interesting that the queen only shows up as the third best result. This is probably due to the fact that queens were far less important in the German-speaking area than in England, for example.


closest vectors

King - man + woman

Consort, regent, queen

Queen - woman + man

King, bodyguard, royal couple

better - good + high

higher, lower, high

Yandex - Russia

Search engine, Gmail, Yahoo

Canberra - Australia + Canada

Ottawa, Ontario, Montreal

Table 4

The invoice listed in the table Canberra - Australia + Canada is suitable for determining capital cities of any country. If Canada is replaced by Ghana in the formula, word2vec delivers Accra as the capital of Ghana. In an experiment with ten randomly chosen countries, the system was able to find the correct capital seven times.

Why does it work? Word2vec calculates the word vectors in such a way that as much meaning as possible is preserved from the underlying texts. Not only are meaning-like words placed close together, the vectors that point from one word to another also have a certain meaning. That is how the expression evidently results Canberra - Australia the meaning is capital of. In Figure 4 there are several such Is-capital-of-vectors and it can be clearly seen that these three vectors are very similar.

150,000 tweets are enough

But not only with a huge text corpus like Wikipedia, which consists of at least 5GB of text data in the German version alone, can knowledge be gained. An evaluation of around 150,000 tweets by German search engine optimizers shows that word2vec can summarize relevant conferences, tools, search engines and internet companies in groups from these short text snippets alone (Figure 5).

Now this software is more than just an interesting gimmick. Paul Haahr, one of the leading developers of Google's ranking algorithms, explained at a conference in 2016: "word2vec is part of what RankBrain does" ("word2vec is one layer of what RankBrain is doing"), and the developer of word2vec is also the main author of the RankBrain paper. Unfortunately, Haahr also said that Google itself doesn't quite understand what RankBrain is doing exactly; It is correspondingly difficult for outsiders to see through. Nevertheless, the examples shown above should make it clear that the ranking algorithm can now also access detailed knowledge with the help of word2vec. Google researchers write: "Word2vec can be used to automatically extract facts, but also to check the correctness of existing facts."

Online marketers should be aware of the importance of this development. Optimization for individual keywords regardless of the user's search intent or the use of questionable content with dubious facts may still largely work today. But developments such as word2vec give the algorithm world knowledge that increasingly condemns optimization that is not consistently geared to the needs of the user to fail.