4.5. Extending Recoll

4.5.1. Writing a document filter

Recoll filters are executable programs which translate from a specific format (ie: openoffice, acrobat, etc.) to the Recoll indexing input format, which was chosen to be HTML.

Recoll filters are usually shell-scripts, but this is in no way necessary. These programs are extremely simple and most of the difficulty lies in extracting the text from the native format, not outputting what is expected by Recoll. Happily enough, most document formats already have translators or text extractors which handle the difficult part and can be called from the filter.

Filters are called with a single argument which is the source file name. They should output the result to stdout.

The RECOLL_FILTER_FORPREVIEW environment variable (values yes, no) tells the filter if the operation is for indexing or previewing. Some filters use this to output a slightly different format. This is not essential.

The output HTML could be very minimal like the following example:

<html><head>
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
</head>
<body>some text content</body></html>
         

You should take care to escape some characters inside the text by transforming them into appropriate entities. "&" should be transformed into "&amp;", "<" should be transformed into "&lt;".

The character set needs to be specified in the header. It does not need to be UTF-8 (Recoll will take care of translating it), but it must be accurate for good results.

Recoll will also make use of other header fields if they are present: title, description, keywords.

As of Recoll release 1.9, filters also have the possibility to "invent" field names. This should be output as meta tags:

<meta name="somefield" content="Some textual data" />

In this case, a correspondance between field name and Xapian prefix should also be added to the mimeconf file. See the existing entries for inspiration. The field can then be used inside the query language to narrow searches.

The easiest way to write a new filter is probably to start from an existing one.