Tokens

Within a document collection, we assume that each document has a unique document identifier (docID). Until then you can think of tokens and normalized tokens as also loosely equivalent to words. Multiple occurrences of the same term from the same document are then merged. The result is split into a dictionary and postings, as shown in Figure 1.4. The postings are secondarily sorted by docID. This provides the basis for efficient query processing. This inverted index structure is essentially without rivals as the most efficient structure for supporting ad hoc text search. In Chapter 5, we will examine how each can be optimized for storage and access efficiency. We will also discuss how to use the data structure of a postings list in a search engine.

Inforetrievalauto3

An Introduction to Information Retrieval: Manning