Sunday 29 November 2009

3.8 Information Retrieval

Information retrieval refers to the retrieval of unstructured information relevant to a particular user’s requirements. Due to the subjective relevance of the results it is probabilistic, whereas querying a database for structured information is deterministic. For example, many users may enter the same search terms into a search engine, while actually looking for different information, whereas if several users query a RDMS using the same SQL they should be attempting to retrieve the same information.

In order to facilitate the efficient retrieval of unstructured information such as text, the information has to be indexed by identifying relevant fields and words for indexing and preparing the text. This is achieved by removing stop words, stemming and identifying synonyms. The most widely used type of index is an inverted file, an index of searchable terms containing a list of associated documents.

In order to find resources for my DITA blog I have relied mainly on Google. Google has three distinct parts; GoogleBot,- the web crawler that finds and retrieves web pages; the indexer that sorts through the full text of web pages and stores search terms in a massive database; and the query processor which carries out the search by comparing entered terms with the index. There is currently some confusion about Google’s use of stop words. Google used to automatically ignore stop words but informed you that it was doing so and gave you the option to repeat the search with the words included. This message no longer appears and it is unclear whether Google no longer uses stop words and indexes every single word, or whether they still use stop words but just don’t tell the searcher.

No comments:

Post a Comment