What is the Overture Keyword Search Tool

How do search engines work

Seminar elaboration by Thomas Haidlas WS 2003/2004

Abstract

Everyone knows search engines, everyone uses it. Most of all humans have their personal favorite of search engine. They know how they deal with it and obtain optimal results. Why we need them, should be clear. We use search engines in order to find information in the WWW.
But why we have to use different search tools for this. If there would be a central administration in the WWW, this administration could lead a centralized index. The central administration would administer the index, put new pages to the index, or would delete old pages from the index. But, this table does not exist, that is the result of the history of the net and its topology. Therefore we must use different search tools. These tools are the topic of the following text.

1.1 Introduction

What good is the largest transport network without signposts and road maps? People don't even know how to achieve their goals. You would be driving randomly through the cities in search of the nearest theater or cinema. Of course, asking doesn't cost anything, but you would first have to find someone who has already been to the theater you are looking for. As a result, signposts, street names and other signs are necessary to find your way through the maze of streets.
Correspondingly, constructive help is needed to get the information you want on the World Wide Web. Unless you know someone who knows someone who knows a web address The WWW is basically a huge information database, which is unfortunately totally unstructured. Everyone can put their information on the web and give other prices. There is no central register that manages the information. A search in this mass of data would cost a lot of time and nerves until the desired information is found. So there must be helpers who support us with Internet research. These popular helpers are called search engines. But unfortunately they cannot be the panacea either. On the one hand, the search engines do not have all the pages on the WWW in their index. If a search term is entered in a search engine but no relevant hits are found, this does not mean that the information sought does not exist. No, the search engine simply has not yet found this information in its index. We will see how this happens later.
In contrast, there can be too many hits for a search query. So there is an oversupply of information and the decisive source has to be selected from the large number of hits. According to research studies, only the first three hit pages are of interest to the surfer. The following hits are ignored at all. So there has to be a mechanism that enables a correct assessment. These studies seem to have also been read by the e-commerce operators. They have their pages optimized by so-called search engine optimizers (SEO) in order to achieve a better ranking in the hit lists. But there are also certain measures available to private individuals that serve to have their pages indexed by the search engines. So you can see that search engines serve different purposes:
It is primarily a search tool. A surfer uses it to gather information about a certain topic. To do this, he enters the search term in a mask and, after sending the request, receives a list of websites that could be associated with the search term. In addition, it has become a tool of advertising and sales. Some search engine operators make money through advertising banners and popups. Still others have the entry in their index or a top spot in the hit list paid for.
So you shouldn't be too naive about the first hit spots.

1.2 Definition: search engine

"Search engine - program for information research on the Internet, which searches the World Wide Web for key terms in files and / or documents and provides the locations in retrievable databases."
(Source: Der Brockhaus)

1.3 Other search tools

1.3.1 Catalogs

Search engines process websites by machine. The catalog, on the other hand, is created by editors, so it contains a human factor. The editors surf the net and add interesting websites to their archives. For this purpose, a small review is usually created, which is included in the catalog together with the links. The catalog is sorted hierarchically according to categories. The decisive factor for its quality is not the ability of the robots, but the editorial skills of the editors.
The best known representative of its variety is Yahoo. Due to the human factor, catalogs are designed for a very small area of ​​the web. You cannot compete with the large number of entries in a search engine. In addition, there is for the most part no ranking in the catalogs. Catalogs often offer a search function, but this does not extend over the recorded pages, but rather over the reviews within the catalog. If a request cannot be processed, the request is often forwarded to a partner search engine. This searches its index and provides the information if necessary. The user doesn't even have to notice the forwarding.

1.3.2 Meta search engines

Most surfers have their favorites when it comes to search engines. They know how to enter more complex search terms or how to assess the ranking and use it for themselves. In short, they know how to find the information most effective. However, using only one search engine is not always advisable. It is true that today you get too many rather than too few results, but there can also be more unusual search queries for which the results are rather few and far between. One should keep in mind that not every search engine delivers the same results, nor does it search the whole web.
If you draw a conclusion from this, you should submit your search query to several search engines. The meta search engines were developed for precisely this purpose.
You receive a request and send it on to cooperating search engines in parallel. These then go about their usual work and send the results back to the meta search engine. Due to the parallel query to several search engines, it is possible to get identical hits several times. Therefore it is necessary that the metasearchers eliminate duplicate results.
Since these meta search engines cover a large area, they are particularly suitable for searching for rare terms.

2. How do the search engines work

The following chapter is about the more precise operation of the search engines. The aim is to examine how the search servants get their index, how the users' inquiries are processed and how the weighting comes about.

2.1 Harvesting

Harvesting is about indexing as many pages as possible on the web. This task is done by programs that are sent out by the search engines. These programs, called robots (or also crawlers, gatherers), follow the links of the pages bit by bit like a human surfer. The robots roam large parts of the WWW, starting with a few start URLs.
How exactly the current page is searched varies from search engine to search engine. For example, WebCrawler and Infoseek only track the top-level links. AltaVista and Excite, on the other hand, also search through several sub-levels and send the content back to their client. Most robots, however, have a problem with frames and image maps. The robots only follow the correction href (a href = ). Contents that can only be reached via frame references and image maps therefore mostly remain unvisited.
Since the web is constantly changing, a single search is of course not enough. The search programs must be started in cycles. How big these cycles are depends on the operator. The time can vary between a few days and weeks. Fireball even sends its robots to search for sites every day. Of course, it is particularly effective to send the robots out more frequently when the update frequency of the pages increases. This means that new content is adopted immediately.
It should be clear that a single robot cannot visit all the sites on the net in a very short time. That is why the search engines start several robots in order to master the large number of pages. In order not to visit the same page multiple times, the programs must be organized centrally. The large number of search engines with their robots creates traffic on the network.
There are two ways to prevent robots from visiting your website.
On the one hand, meta tags can be used. They can be used to determine whether the page should be included in the search engine's index or whether the links on the page should be taken into account.
Example meta tags: The keyword Index instructs the robots to include the page in the index. Follow allows the robot to follow the links on the page. If you want to prevent the indexing process or the analysis of the links, you simply use the keywords noindex or nofollow . However, these meta tags are only recommendations to the robots, whether or not they adhere to them is up to them. The robots.txt file is another way of preventing the robots from indexing. This file must be in the root directory and can lock certain files or entire directories.

2.1.1 Structure of a robots.txt

A robots.txt consists of rules and these rules in turn consist of two parts. The first thing to do is to specify which robot should obey the rule. Then it is determined for this robot which files it can use for indexing or which it should ignore. Since the search engine operators give their robots certain names, this process is quite simple. The names of the robots are made known on the pages of the search engines. For example, the AltaVista robot is called Scooter. Example 1: User agent: Scooter
Disallow: /privat/ geht_dich_gar_nix_an.html
Allow: / open all
Here the `` all open '' directory is allowed for the robot `` scooter ''. Access to the file geht_dich_gar_nix_an.html in the directory privat is forbidden. Example 2: User agent: *
Allow: / open all
So that not every robot has to be treated individually, there is the * notation. It addresses all robots that visit the site. In this example, all robots are allowed to index.

2.2 Structuring of the accumulated data

When the little helpers have sent the data back to the search engine, they still have to be processed. For this purpose, an index table is created that contains all pages visited and the words they contain (keywords).
In order to increase the access speed to this table, the search engines try to keep it completely in the main memory. When a search query is sent to the search engine, it searches the table for the search string and lists the relevant pages.
This description of the process is of course a bit naive, because the Boolean links (AND / OR) and truncations must also be included.

2.3 Evaluation of the request

As already described above, the search string is compared with the index table. If the search string only consists of a single word, the process is relatively simple. wherever there is a match, the page is included in the hit list. Since searchers rarely include a single word in the search mask, the search string has to be processed further. The processing thus includes the analysis of truncations, Boolean operators and the search expressions. So not only letters but also control and special characters have to be processed.

2.3.1 Table comparison

Now the string must be compared with the entries in the table. If an entry in the table matches the search string, the entry is marked as `` true ''. If the search string consists of more than just one word, it must also be checked whether the other conditions also apply to this page. If this is the case, the page can be added to the hit list.
But how does the final ranking of such a list come about?

2.4 Weighting of the results

There are several options for weighting the hit list. The procedures used differ from search engine to search engine.

2.4.1 Number of matching keywords

If the query consists of several linked words, the pages with the most matches are placed at the top of the hit list.

2.4.2 Frequency of occurrence

The page on which the search term occurs most frequently is listed highest.

2.4.3 Position of the occurrence of a search term

This is about the local area in which the search term occurs. First, the domain and the URL are searched for. This is only logical, as a website that has the search term in its name also deals with topics related to the search term.
The document names are also checked. If a file with the searched term is found, this will also be reflected in the hit list. If the search string is in a heading, it can be assumed that the following text is also relevant for the searcher. If headings are not included as text but as graphics, the weighting of the page is of course made more difficult.
If a search string is found as early as possible in the document, it is classified as more relevant than a document in which the string only appears at the end of the document.
Site operators can help the robots to index their site. This is done through the use of meta tags, which make the content of the page in a short form legible for the robots. However, the meta tags are used less today for indexing, as they were often used for spamming in the past. Meta tags were declared that had nothing to do with the content of the page. This allowed surfers to be directed to pages that did not match their search content.

2.4.4 Number of links

This is a weighting method that takes into account a pre-evaluation of the Internet community. With this method, the degree of linking is evaluated. It is assumed that if a site operator links to another site, he must view it as a relevant source of information. This means that popular pages appear further up in the hit list.

3. Ranking

Since website operators would like to see their website placed high in the ranking, they have come up with methods to influence the ranking in their favor. This includes, for example, the above-described spamming using meta tags.
In order to get better hit lists and to prevent spamming, search engine operators have to come up with new ranking methods.

3.1.1 Listing for money

Money makes the world go round. Some search engines have probably taken this sentence to heart. Website operators can buy a place in the hit list here. Since of course not everyone can and wants to buy a place in the ranking list, this method is combined with other methods. (e.g. http://www.espotting.de or http://www.content.overture.com)

3.1.2 Behavior of users and administrators

The behavior of users and administrators is analyzed here. On the one hand, as already described above, the linking by external sites is analyzed here. A more detailed description of how this works follows later. Furthermore, the surfers themselves are observed. So it is taken into account, for example, how many surfers visit the site.

3.2 Listing for money

Capitalism has also found its way into the search engine sector. Here, the highest list is made of who pays the most. This works something like this: The page providers pay an amount to be included in the catalog of the search engine. If a search query is made, the relevant web pages are displayed. The entry that has been paid for the most is at the top. Next to the links to the pages are the prices that have been paid for the entry and how much additional income the search engine will bring if the user clicks on this link. However, this ranking method is not used independently, but is combined with other methods. The paid entries are then at the first place, after that normal hits are listed. Of course, this method means that the content of the websites is no longer relevant. The operator who can invest the most money is listed first. Often online shops and other commercial sites use this method to lead surfers to these sites and to advertise their products. If you think about it, this type of marketing must be worthwhile for everyone involved. Another possibility of the `` listing-against-payment '' principle is to buy a faster processing time from the search engine operator.The aim is to add the website to the search engine's catalog without waiting for the robot to find it in its harvesting cycle. Operators of commercial sites simply cannot afford to lose time because the robots have not yet indexed their website. Time is money! Ranking versus cash is a double-edged sword. This ranking method is advantageous for surfers who mainly use the Internet to buy products from established manufacturers / online shops. However, surfers who are looking for the latest information or smaller / unknown online shops should avoid such search engines.

3.3 Page Rank / Link Frequency

This is how Google works. Here it is rated how often the website has been linked by other site operators. The more links refer to the page, the higher its rank in the hit list. It is even checked how often the page from which the link originates was itself linked. A link from a popular page allows the page to be indexed to move up in the hit list. There are additional ranking points if the search term is in the link text. This can also prevent some spamming processes. For example, you could create `` phantom pages '' that only serve to set a link on the website. The fact that there is no link to the `` phantom pages '' does not affect the ranking.
But unfortunately there are already new processes that Google is struggling with. More on that later. Unfortunately, this `` Google '' method also has disadvantages. It is extremely difficult for new pages here, as there are of course no links from other pages.
Searches with many links lead to problems. Search queries with just one word work without any problems.

3.4 Hit Popularity Engine

This ranking method is only used after a pre-sorting. So there is already an index that is pre-sorted. If a request is now made, the hit list is displayed with the help of this presorting. A click on a link now leads to the surfer voting for this affected page. This click is stored in a database and thus helps you to move up in the hit list. Frequently clicked pages are considered popular and therefore good with this method. That is why they have to be listed further up the next time you search for the same topic. Logically, attempted fraud must also be intercepted here. The site owner should not click his site several times himself in order to influence a better ranking result. Just as with the `` ranking according to link frequency '' method, new pages also have a disadvantage here. In addition, high-quality pages may be sidelined because they are not visited. However, if such a slipped page is clicked again, this click counts more than a click on a popular page. This allows the side to get back to the front ranks. This method, called `` Hit Popularity Engine '', was developed by Direct Hit. Some search engine operators have partnered with the company and use this engine for their search engines.

4. Election manipulation

The world could have been so beautiful. After many search engines had to fight the flood of spamming early on, a new silver lining appeared on the horizon. A new search engine called Google worked with a new ranking concept. As already described above, this concept should ensure a preliminary examination of the pages by the Internet community. The new Google algorithm should take into account the number of keywords it contains and the frequency of links. If a webmaster likes a foreign page, he links it and thus ensures that Google lists it higher up in their ranking. So, whether young or old, every webmaster has a right to choose. Actually a good idea, given the enthusiasm with which private administrators maintain their websites. With the same demands with which you design your pages, you also observe and evaluate other people's pages. As a result, interesting and informative websites are often linked and thus made known to all Google surfers. But even this idealistic attempt was undermined. The democratic process has been manipulated by electoral fraudsters. But first to the `` why ''.
Why should hit lists be manipulated and who is interested in them. It is primarily e-commerce portals that want to influence hit lists in their favor. They want to place their offers as cheaply as possible and visible to everyone. Basically it is a form of advertising, search engine marketing. They hire search engine optimizers who know every trick and take advantage of every loophole in Google's "Webmaster Guidelines" [6] in order to provide their clients with a good ranking result. Please do not misunderstand me. It is of course completely legitimate for SEOs to optimize their customers' pages by legal means. But unfortunately, as in other industries, it seems that black sheep are admitting.

4.1 How is Google being tricked?

First of all, the so-called link farms should be mentioned here. In order to set up link farms, many new domains are first lifted out of the ground. After that, programs generate thousands of linked pages. These pages contain keywords that are related to each other. In some cases, these pages are even elaborately designed to conceal the fact that it is part of a link farm. If a link farm has been set up, links are set from here to the customers' web pages. Redirects are also popular methods of spamming. Unfortunately, Google can only recognize redirection through HTML meta tags and simple JavaScripts.
However, search engine optimizers sometimes deliberately complicate the redirect commands so that the Google indexer can no longer recognize them. The surfer is drawn to the provider page via an intermediate page that contains the terms searched for. The `` shifted side '' is, so to speak, the wolf in sheep's clothing. The search engine optimizer creates a normal page and lets the Google robots index it. After that, the content of the page is simply exchanged. On the other hand, so-called cloaking is a little more technical. Programs run on the server to determine who is executing the request. If it is a robot, it will see the page that has been optimized for search engines. This naturally creates a deliberately high ranking again. However, the user who follows the ranking is presented with a completely different page than the robot. It is quite simple but effective to add links to guest books. Guest books with a high ranking are particularly popular. If a robot then runs over such a guest book, it automatically arrives at the corresponding page and adds it to the search engine's index. If the guest book is very popular, the linked page will of course also achieve a high ranking. If the layout of the page is sufficiently complex, for example due to complicated table structures, even the simple trick `` blind text '' can no longer be recognized. `` Blind Text '' is text written in the background color of the website. This text can then only be read by the robots and contains specifically selected links and keywords that are included in the ranking process. One could also imagine a trade among the webmasters. Webmasters of well-listed sites are paid to set a link on their own site or link farm. Thus the link farms also get a good ranking. The trade in web links even goes so far that companies gather a large number of partners around them. The services of the partners are then offered on the websites in a laudable manner. The direct link to the partner site is of course subtly pointed out. If a deal comes about, the partners earn with the income. It is only logical now that you can collect more commissions if your own page rises even more in the ranking. In a nutshell: A high ranking ensures greater numbers of visitors to my site. This leads to more direct income, i.e. sales of my products. The customers discover the website of my partner sites and also buy there. This in turn leads to indirect income through commissions. What remains to be said is that the combination of different methods unfortunately means that a good and popular search engine may again fall victim to commercial interests.

5. Conclusion

When looking for information on the Internet, the first thing you should do is select the appropriate tool. If the catalog does not offer the desired topic or if a term should be found spontaneously and quickly, the search engine is the right tool. Once you've found your favorite search engine, you should still think outside the box. The WWW is a dynamic network. The search engine market can also change. Be it through better ranking procedures or excessive spamming, which leads to hit lists being littered. The ranking method should also be considered and assessed. This helps to separate relevant results from irrelevant ones.
If the information sought is very specific, a meta search engine should be used. The search query extends over several search engines and delivers results in a reasonable time.
If you are interested in optimizing your website, simply enter the search engine ranking on Google. Perhaps there is really a relation between ranking place and required know-how.

6. List of sources

[1] http://www.suchfibel.de [2] Bager, Jo. "Orientationless information collector" c t 23/99
[3] Karzauninkat, Stefan. "Target manhunt" c t 23/99
[4] Lennartz, Sven. "I am important" c t 23/99
[5] Karzauninkat, Stefan. "Google trashed" c t 1/03
[6] http://www.google.com/webmasters [7] Dr. Sander-Beuermann, Wolfgang. "Treasure hunter" c t 13/98
[8] Dittmar, Arno. "Search engines and inquiries in the WWW" (SS 02):
http://homepages.fh-giessen.de/~hg10013/Lehre/MMS/SS02/Dittmar/index.htm[9] Rudolf, Ralf. "Search engines and inquiries in the WWW":
http://homepages.fh-giessen.de/~hg10013/Lehre/MMS/SS02/Rudolf/index.htm