Sunday, October 12, 2008

ESP : Indexing

The purpose of storing an index is to optimize speed and performance in finding relevant documents for a search query. Without an index, the search engine would scan every document in the corpus, which would require considerable time and computing power. For example, while an index of 10,000 documents can be queried within milliseconds, a sequential scan of every word in 10,000 large documents could take hours. The additional computer storage required to store the index, as well as the considerable increase in the time required for an update to take place, are traded off for the time saved during information retrieval.

Index Design Factors
Major factors in designing a search engine's architecture include:

Merge factors

How data enters the index, or how words or subject features are added to the index during text corpus traversal, and whether multiple indexers can work asynchronously. The indexer must first check whether it is updating old content or adding new content. Traversal typically correlates to the data collection policy. Search engine index merging is similar in concept to the SQL Merge command and other merge algorithms.

Storage techniques
How to store the index data, that is, whether information should be data compressed or filtered.

Index size
How much computer storage is required to support the index.

Lookup speed
How quickly a word can be found in the inverted index. The speed of finding an entry in a data structure, compared with how quickly it can be updated or removed, is a central focus of computer science.

Maintenance
How the index is maintained over time.

Fault tolerance
How important it is for the service to be reliable. Issues include dealing with index corruption, determining whether bad data can be treated in isolation, dealing with bad hardware, partitioning, and schemes such as hash-based or composite partitioning, as well as replication.

Index Data Structures
Search engine architectures vary in the way indexing is performed and in methods of index storage to meet the various design factors. Types of indices include:

Suffix tree

Figuratively structured like a tree, supports linear time lookup. Built by storing the suffixes of words. The suffix tree is a type of trie. Tries support extendable hashing, which is important for search engine indexing.[8] Used for searching for patterns in DNA sequences and clustering. A major drawback is that the storage of a word in the tree may require more storage than storing the word itself. An alternate representation is a suffix array, which is considered to require less virtual memory and supports data compression such as the BWT algorithm.

Tree
An ordered tree data structure that is used to store an associative array where the keys are strings. Regarded as faster than a hash table but less space-efficient.

Inverted index
Stores a list of occurrences of each atomic search criterion[10], typically in the form of a hash table or binary tree.

Citation index
Stores citations or hyperlinks between documents to support citation analysis, a subject of Bibliometrics.

Ngram index
Stores sequences of length of data to support other types of retrieval or text mining.

Term document matrix
Used in latent semantic analysis, stores the occurrences of words in documents in a two-dimensional sparse matrix.