1.3. Recoll overview

Recoll uses the Xapian information retrieval library as its storage and retrieval engine. Xapian is a very mature package using a sophisticated probabilistic ranking model. Recoll provides the interface to get data into (indexing) and out (searching) of the system.

In practice, Xapian works by remembering where terms appear in your document files. The acquisition process is called indexing.

The resulting index can be big (roughly the size of the original document set), but it is not a document archive. Recoll can only display documents that still exist at the place from which they were indexed. (Actually, there is a way to reconstruct a document from the information in the index, but the result is not nice, as all formatting, punctuation and capitalisation are lost).

Recoll stores all internal data in Unicode UTF-8 format, and it can index files with different character sets, encodings, and languages into the same index. It has input filters for many document types.

Stemming depends on the document language. Recoll stores the unstemmed versions of terms and uses auxiliary databases for term expansion. It can switch stemming languages, or add a language, without reindexing. Storing documents in different languages in the same index is possible, and useful in practice, but does introduce possibilities of confusion. Recoll currently makes no attempt at automatic language recognition.

Recoll has many parameters which define exactly what to index, and how to classify and decode the source documents. These are kept in a configuration file. A default configuration is copied into a standard location (usually something like /usr/[local/]share/recoll/examples) during installation. The default parameters from this file may be overriden by values that you set inside your personal configuration, found by default in the .recoll subdirectory of your home directory. The default configuration will index your home directory with default parameters and should be sufficient for giving Recoll a try, but you may want to adjust it later.

Indexing is started automatically the first time you execute the recoll search graphical user interface, or by executing the recollindex command.

Searches are performed inside the recoll program, which has many options to help you find what you are looking for.