Introduction to Information Retrieval - Search Engines

  • by Michael Hawthornthwaite
  • Published on 1/30/07
  • Article views: 374

Your Ad Here

This article aims to provide readers with an overview of the very basics of information retrieval. Understanding these principles can help you to optimise your website content for the search engines and also help you to analyse search engine algorithm changes. However, the details in this article are not intended to describe how modern search engines work, as they use many additional factors, including link analysis.

Information retrieval (IR) is the science of searching for documents / within documents. Information retrieval techniques form some of the most fundamental elements of web search engine technology. This article will discuss information retrieval in the context of search engines.

Indexes

It is unrealistic to remotely access documents in real-time when performing a search, as it would be exceptionally slow and unreliable. Therefore a local index is created, which for search engines is done by a crawler (aka spider). Thus, when you perform a search you are not actually searching the web, but are searching a version of the web as seen and stored by the crawler at some point in the past.

The index would not usually contain the whole document (this may, however, be stored in a separate document cache), but stores a representation of the terms relevant to the document that is quickly and easily searchable. There are various stages to this process (not all systems will include all of these stages):

1. Document
This is the document in its raw format with all text, structure and formatting.

2. Structure Analysis
Recognising headings, paragraphs, titles, bold text, lists, ..., etc.

3. Lexical Analysis
Converting the characters in the document into a list of words. This process may include analysing digits, hyphens, punctuation and the case of letters. Proper Noun Analysis can use the case and format of words/phrases to identify important information such as names, places, dates and organisations.

4. Stopwords Removal
The removal of words which occur very often and provide no ability to discriminate between documents. For example: 'the', 'it', 'is'. However, it can be seen that some search engines leave these words in the index and remove them at the user query level. This allows '+word' queries to be performed.

5. Stemming
This is a conflation procedure which reduces variations of a word into a single root. For example, both 'worked' and 'working' may be reduced to 'work'. The Porter Stemming Algorithm can be used to perform stemming.

After these processes have been performed we have a list of index terms for this particular document.

Index Term Weighting

We now need to calculate to what degree a term is relevant to a particular document. The following is an example of a weighting scheme:

* Index Term Frequency
This is the frequency of a term inside a document. The frequency is usually normalised within the particular document:
TermFrequency(term, document) = (no. occurrences of term in document) / (no. occurrence of term with max occurrences in document)

* Inverse Document Frequency
The inverse of the frequency of a term between all the documents in the set. Terms that appear in many documents are not very useful as they do not allow us to discriminate between documents.
IDF(term) = log([no. documents in collection] / [no. documents in collection containing term])

* Weight
This is the actual index term weight for a particular term in a particular document:
Weight(term, document) = TermFrequency(term, document) * IDF(term)

Other items may be a factor in deciding weight, such as: the terms position in the document, whether it was in the title, whether it was bold, whether it was in a list, ..., etc.

Reverse Index

We now have a list of terms (with their weights) for a given document. However, a list of documents that contain a particular word would be much more useful, rather than a list of words for a particular document. This is called a reverse index.

For example, if we had the following three documents:

1. This is a file about website search engine optimisation
2. A website design tutorial file
3. A file about bespoke software design and development

Then the index terms for each document may be as follows (weights would be in parentheses):

1. file(?), website(?), search(?), engine(?), optimisation(?)
2. website(?), design(?), tutorial(?), file(?)
3. file(?), bespoke(?), software(?), design(?), development(?)

However, the reverse index would be:

file: document1(?), document2(?), docuement3(?)
website: document1(?), document2(?)
search: document1(?)
engine: document1(?)
optimisation: document1(?)
design: document2(?), document3(?)
tutorial: document2(?)
bespoke: document3(?)
software: document3(?)
development: document3(?)

The reverse index then allows us to easily find the relevant documents for a particular word.

Similarity Matching

This is the process for computing the relevance of a document to a particular query. It can comprise:

* Query Term Weighting
Applies weights to each term in a query. For example, terms at the beginning of a query may be weighted more heavily.

* Similarity Coefficient
Uses the query term weights and document term weights to compute the similarity between a query and a document. The similarity could be calculated using the vector space model and calculating the cosine coefficient (this will not be discussed here).

Refreshing the Index

Documents can continually change, therefore the index needs to be continually refreshed. The crawler needs to decide how often to reindex particular documents, based on how often they are updated. If a document is not updated very often, then reindexing it very often would be a waste of resources. However, documents that are always changing need to be continually reindexed as they may no longer be relevant to terms they are currently indexed for.

Measuring Accuracy of IR Systems

Two of the simplest ways to assess the accuracy of a basic information retrieval system are Precision and Recall. These are calculated using the number of relevant documents and the number of retrieved documents (the documents perceived to be relevant by the system), the documents actually returned to the user are where these two sets of document overlap.

* Precision
Ratio of no. relevant documents returned to the total number of documents retrieved - i.e. the number of documents returned that are relevant.

* Recall
Ratio of no. relevant documents returned to the total number of relevant documents - i.e. the number of relevant documents that are returned.

The documents actually returned from the retrieved documents set will be decided using some form of ranking mechanism (discussion of this is beyond the scope of this article).

Generally, there is a compromise between precision and recall, as increasing the number of documents retrieved is likely to also increase the number of irrelevant documents in the set of retrieved documents.

Web Search Engines

Web search engines (such as Google, Yahoo! and MSN) usually combine information retrieval techniques with link structure analysis, as well as many other unknown techniques. Obviously, the above techniques are very easily spammed, so any useful search engine would need to try to filter out spamming where possible.

Michael Hawthornthwaite works at Acid Computer Services (Manchester) who specialize in web design, web development and bespoke software development.

Bookmark it: Digg this! Add to Del.icio.us Furl this! Add to Stumbleupon Add to Facebook

Rate this article:


More Articles

  • Top 10 SEO secrets
    Today everyone wants to have better search engine result pages ranks. However only one can have the rank 1. Therefore you need to optimize constantly your websites and to have an edge over others you need to do something different and unique. Following are 10 rare tips/secrets which can help you get high search engine ranks.
  • SEO For Beginners
    Since the beginning of the understanding of Search Engine Optimization - SEO - there has been tons of different theories developed about how to do it 'right'. The truth is, however, that there is no one single method that is one hundred percent known to be better than the rest.
  • SEO Merits of Interlinking
    Many webmasters spend a lot of time optimizing their home page only. That's fine and is something you should definitely do, but you should also take some time to optimize all of your inner web pages as well.
  • How can Your SEO Keywords make Money?
    As you likely know by now, there is a big difference between effective, quality website content, and search engine optimized (SEO) website content that will obtain the highest ranks in search engines and directories such as Google, Yahoo!, and MSN.
  • The Mysteries of SEO
    SEO, or search engine optimization, is the most important element in building a successful website. Most people online have heard the term SEO and have some idea of what is involved yet it remains a mysterious process to many.
  • Top 10 SEO secrets
    Today everyone wants to have better search engine result pages ranks. However only one can have the rank 1. Therefore you need to optimize constantly your websites and to have an edge over others you need to do something different and unique. Following are 10 rare tips/secrets which can help you get high search engine ranks.
  • Google Sitemaps: The New Features
    The Google Sitemap program has brought a high level of assistance to web developers, web designers and business owners who want to optimize how their websites web pages are seen by the Google search engine.
  • Search Engine Optimization Basics
    SEO stands for Search Engine Optimization. Search Engine Optimization is an ongoing process of getting high placement in search engines at search phrases, relevant to the web content.
  • As Search Engines Grow Smarter, Will You?
    As search gets smarter, tricks get cheaper and we get nearer to coming full circle to an original goal of Internet search: that content is, indeed, king. It cuts across the grain of some notions we've held in the industry for some time, that there are shortcuts aplenty in the hunt for better search rankings.
  • Use of Search Engine Optimization
    For a website owner the only way to get potential customers is to have them visit your site and there are only three ways a customer can reach your site. By typing the web address, a link of another website and through search engines.
  • How to Straighten Out Your Tax Affairs in 2010
    Utter the word 'taxes' in front of people and you are sure hear a loud groan. Frankly, there are valid reasons behind this response: First of all, paying your taxes annually can be a financial burden, especially with the economic hard times which we are going through these days.
  • Benefits of Yoga - Redefining Life, Meditation and Exercise
    Yoga is a complete exercise for the mind, body and soul and it can give a person a change to know himself in a new light. Regular practice of Yoga has been known to reduce stress, increase concentration and also benefit one's overall health.
  • The Best Floral Perfumes
    Floral perfumes are considered the best perfumes all over the world. In fact the history of perfumes says that floral perfumes are the first perfume category of the world and thus popular also. Floral perfumes represent the essence of love and romance.
  • Coffee Gift Baskets
    Even a Coffee Bean or San Francisco Coffee outlet frequenter will not be able to master the art of appreciating true gourmet coffee unless they know the history and information about gourmet coffee.
  • Rent a car in Israel
    If you are already in Israel and wish to rent a car, then you can make a booking and tell the service to pick you up from wherever you are. The best part about car rental in Israel is that they are extremely flexible.
  • How to Straighten Out Your Tax Affairs in 2010
    Utter the word 'taxes' in front of people and you are sure hear a loud groan. Frankly, there are valid reasons behind this response: First of all, paying your taxes annually can be a financial burden, especially with the economic hard times which we are going through these days.
  • Ecommerce Checkout - Make Them Register
    The world of ecommerce is expanding daily as more and more retailers take their business to the web in the hopes of attracting a percentage of those ever finicky web shoppers. As the amount of businesses open up ecommerce shops online, the amount of web development companies is also growing, offering ecommerce solutions for even the smallest of budgets. It is important that web developers understand the dynamics of the checkout process when discussing the options with the client.
  • A History of Vintage Automobiles - From 1916 to 1924
    Before cars were around, people usually traveled distances using horse-drawn carriages. But as early as 1335 several bright minds have been trying to come up with a 'horseless carriage'.
  • 3 Rules to Follow in Your Logo Design
    When designing your own logo or getting it done by logo design professionals, ensure the following three basic rules and you will have a great and memorable corporate identity.
  • Top 10 SEO secrets
    Today everyone wants to have better search engine result pages ranks. However only one can have the rank 1. Therefore you need to optimize constantly your websites and to have an edge over others you need to do something different and unique. Following are 10 rare tips/secrets which can help you get high search engine ranks.


The opinions expressed in the above articles are solely of the authors and the owners of Point Articles may or may not agree with them. Copyright (c) 2008 Point Articles. All rights reserved.