First off, you might ask: so, who is this fub-guy anyway, that he thinks he can tell us about search engines with any authority? I have been researching, building, modifying and using search engines in a professional quality since '95. In '96, I graduated on a comparison on how people used two search engines with different paradigms. Then, in '97 I started work at the computer science department of the University of Nijmegen, as part of the DoRo ("DOcument ROuting") ESPRIT-project. I did research on learning algorithms for text classification. This even resulted in a publication: a poster presentation at SIGIR'98 in Melbourne. In '98, I started work at Cap Gemini at the Advanced Technology Services-division. There, I was the architect and search engine-builder of the Sibylle project. My professional involvement with search engines ended when I started work at GX in November '01.
How Search Engines work
A Search Engine consists of roughly three components: the crawler, the indexer and the search-interface. The crawler is a program that 'harvests' webpages. It operates on a list of URLs: one by one, the URLs are requested from their servers, the HTML is downloaded and stored on disk, links are extracted from the HTML and placed on the list again.
Of course, this is a never-ending task: the internet is huge and even if you could crawl everything, some pages will have changed when you have finished your list -- so you'd have to start again. Therefore, some heuristics are used to determine which pages should be crawled more often than others -- pages that everybody links to and that change often are crawled more frequently.
Next is the indexer. This program strips all HTML tags out of the text, adds meta-data (the title of the document, the URL, date last modified, etc) and places all this information in the index. The index is an optimised database that contains information in the form of 'term X occurs Y times in document Z' or even 'term X occurs on position Y in document Z'. With all the added overhead of the indexing structure and metadata, the final size on disk is about twice the size of the HTML! Operations that could be run here include stemming (reducing words to their linguistic 'stem') or removing of stopwords (no-one should search on 'the', for instance).
When all the information is stored in the index, the search-interface can be used. This is the 'front-end' part of a search engine: users give their queries to the search-interface for processing. Basically, the user's query is run through the index and documents that match the query are presented -- think of it as a rather complicated SQL statement and you're not too far off (in fact, in the case of the Oracle Index Server, the query is an SQL statement!).
But how does the search engine know what documents to present? Search engines work on the 'keyword assumption': If word X occurs in a document, the document is 'about' X. We know that this isn't true -- but we don't (yet) have machines that can determine the meaning of any given text. So counting words is all we can do.
But clearly, not every document that contains X is a desirable result when the user searches for X. Most search engines use a mathematical 'trick', called TF/IDF, to determine which documents to present for a given query. 'TF' stands for 'Term Frequency', which is the number of occurrences of the term in a document. 'IDF' stands for 'Inverse Document Frequency', which is the number of documents this term occurs in at all. Terms with a low IDF are more specific (and thus, in the context of a search query, more important) than terms with a high IDF. And documents in which the term occurs more often are better matches than documents where the term occurs only once.
TF/IDF was the state of the (commercial) art in the days of AltaVista. But then the spammers came, who tried to manipulate their documents to gain favourable indexing for their documents on certain terms. This is partly solved by instating an upper limit to the TF (say, 10), but doesn't negate the ill effects of the spamming. Also, suppose that for a given term X (say, 'basketball'), there are 150 documents whose TF for that term is 10 -- because the IDF is the same for all documents, the TF/IDF weight of that term for those 150 documents are the same. But you want to present only 10 documents in the first page -- what are you to do?
This is where additional heuristics come into play. For instance, if the sitename contains the term (as is the case with www.basketball.com), then give the document a higher weight. Also, if the URL path contains the term (as is the case with dmoz.org/Sports/Basketball/), then give it a higher weight (but slightly less than with the term in the site name).
It is also important to realise that a webpage does not exist in a vacuum. It connects to other pages (on-site or off-site) via links, and other pages link to it. Links in the text have a link-text. That text is a description of the contents of the linked page. Therefore, those link-texts are good descriptions of the content of that page. That makes www.nba.com a good candidate for the top-position for 'basketball'.
As with any feature, this creates weird artifacts for certain situations. For instance, search for the term 'here' -- you'll get links to the Adobe Acrobat Reader and the RealOne player -- because lots of webmasters link to these documents with a text like 'You need the Acrobat Reader to view these documents. Click here to download it.' There have been incidents where 'evil empire' linked to microsoft.com, and more of the kind.
And, of course, there is the Google PageRank, by far the most powerful feature of that search engine. It uses the links between webpages to determine which document is 'more important' than the others.
Not all sites are created equal. There are small sites that contain nothing but baby-pictures of the newborn kid of the owners of the site, and then there are sites like Yahoo, that everyone links to. Clearly, a hit on Yahoo is of more value to the average searcher than a hit on the page of baby-pictures -- but how do you calculate a site's worth?
Some sites, like Yahoo, are important because they link to a lot of pages, and they get linked to a lot. Such a site is, so to say, a crossroads on the web. If a lot of people link to that site, they implicitly endorse the content of that site -- if they thought it would be crap, they wouldn't link to it. (Unless, of course, this is a list of sites that suck, but that will not be the case on average.) Therefore, these often-linked-to sites are considered 'important' and 'trusted'.
These sites then link to other sites, and 'pass on' a bit of their 'trust'. The assumption is, that the quality of the content on the site also extends to the offsite links. And those sites pass a bit of their trust to their off-site links, and so on.
Therefore, the www.nba.com website has a very high 'trust' from all sites that link to it, which also helps to put it on the top spot for 'basketball'.
www.livejournal.com is also a very 'trusted' site, because a lot of people link to it in their blogs. That is why, if you search for 'fub', my LJ pops up as the 12th hit. The eleven earlier hits are all of institutions who have 'FUB' as their acronym.
I hope you don't get confused by the technical terms. If any of you have any additional queries, just ask! And if I've made a mistake, I'm sure xaviar_nl will correct me!