3.9. External parsers

DataparkSearch indexer can use external parsers to index various file types (mime types).

Parser is an executable program which converts one of the mime types to text/plain or text/html. For example, if you have postscript files, you can use ps2ascii parser (filter), which reads postscript file from stdin and produces ascii to stdout.

3.9.1. Supported parser types

Indexer supports four types of parsers that can:

3.9.2. Setting up parsers

  1. Configure mime types

    Configure your web server to send appropriate "Content-Type" header. For apache, have a look at mime.types file, most mime types are already defined there.

    If you want to index local files or via ftp use "AddType" command in indexer.conf to associate file name extensions with their mime types. For example:

    
AddType text/html *.html
    

  2. Add parsers

    Add lines with parsers definitions. Lines have the following format with three arguments:

    
Mime <from_mime> <to_mime> <command line>
    

    For example, the following line defines parser for man pages:

    
# Use deroff for parsing man pages ( *.man )
    Mime  application/x-troff-man   text/plain   deroff
    

    This parser will take data from stdin and output result to stdout.

    Many parsers can not operate on stdin and require a file to read from. In this case indexer creates a temporary file in /tmp and will remove it when parser exits. Use $1 macro in parser command line to substitute file name. For example, Mime command for "catdoc" MS Word to ASCII converters may look like this:

    
Mime application/msword text/plain "/usr/bin/catdoc -a $1"
    

    If your parser writes result into output file, use $2 macro. indexer will replace $2 by temporary file name, start parser, read result from this temporary file then remove it. For example:

    
Mime application/msword text/plain "/usr/bin/catdoc -a $1 >$2"
    

    The parser above will read data from first temporary file and write result to second one. Both temporary files will be removed when parser exists. Note that result of usage of this parser will be absolutely the same with the previous one, but they use different execution mode: file->stdout and file->file correspondingly.

3.9.3. Avoid indexer hang on parser execution

To avoid a indexer hang on parser execution, you may specify the amount of time in seconds for parser execution in your indexer.conf by ParserTimeOut command. For example:


ParserTimeOut 600

Default value is 300 seconds, i.e. 5 minutes.

3.9.4. Pipes in parser's command line

You can use pipes in parser's command line. For example, these lines will be useful to index gzipped man pages from local disk:


AddType  application/x-gzipped-man  *.1.gz *.2.gz *.3.gz *.4.gz
Mime     application/x-gzipped-man  text/plain  "zcat | deroff"

3.9.5. Charsets and parsers

Some parsers can produce output in other charset than given in LocalCharset command. Specify charset to make indexer convert parser's output to proper one. For example, if your catdoc is configured to produce output in windows-1251 charset but LocalCharset is koi8-r, use this command for parsing MS Word documents:


Mime  application/msword  "text/plain; charset=windows-1251" "catdoc -a $1"

3.9.6. DPS_URL environment variable

When executing a parser indexer creates DPS_URL environment variable with an URL being processed as a value. You can use this variable in parser scripts.

3.9.7. Some third-party parsers