TRE API reference manual

The regcomp() functions

#include <tre/regex.h>

int regcomp(regex_t *preg, const char *regex, int cflags);
int regncomp(regex_t *preg, const char *regex, size_t len, int cflags);
int regwcomp(regex_t *preg, const wchar_t *regex, int cflags);
int regwncomp(regex_t *preg, const wchar_t *regex, size_t len, int cflags);

The regcomp() function compiles the regex string pointed to by regex to an internal representation and stores the result in the pattern buffer structure pointed to by preg. The regncomp() function is like regcomp(), but regex is not terminated with the null byte. Instead, the len argument is used to give the length of the string, and the string may contain null bytes. The regwcomp() and regwncomp() functions work like regcomp() and regncomp(), respectively, but take a wide character (wchar_t) string instead of a byte string.

The cflags argument is a the bitwise inclusive OR of zero or more of the following flags (defined in the header <tre/regex.h>):

REG_EXTENDED
Use POSIX Extended Regular Expression (ERE) compatible syntax when compiling regex. The default syntax is the POSIX Basic Regular Expression (BRE) syntax, but it is considered obsolete.
REG_ICASE
Ignore case. Subsequent searches with the regexec family of functions using this pattern buffer will be case insensitive.
REG_NOSUB
Do not report submatches. Subsequent searches with the regexec family of functions will only report whether a match was found or not and will not fill the submatch array.
REG_NEWLINE
Normally the newline character is treated as an ordinary character. When this flag is used, the newline character ('\n', ASCII code 10) is treated specially as follows:
  1. The match-any-character operator (dot "." outside a bracket expression) does not match a newline.
  2. A non-matching list ([^...]) not containing a newline does not match a newline.
  3. The match-beginning-of-line operator ^ matches the empty string immediately after a newline as well as the empty string at the beginning of the string (but see the REG_NOTBOL regexec() flag below).
  4. The match-end-of-line operator $ matches the empty string immediately before a newline as well as the empty string at the end of the string (but see the REG_NOTEOL regexec() flag below).
REG_LITERAL
Interpret the entire regex argument as a literal string, that is, all characters will be considered ordinary. This is a nonstandard extension, compatible with but not specified by POSIX.
REG_NOSPEC
Same as REG_LITERAL. This flag is provided for compatibility with BSD.

The regex_t structure has the following fields that the application can read:

size_t re_nsub
Number of parenthesized subexpressions in regex.

The regcomp function returns zero if the compilation was successful, or one of the following error codes if there was an error:

REG_BADPAT
Invalid regexp. TRE returns this only if a multibyte character set is used in the current locale, and regex contained an invalid multibyte sequence.
REG_ECOLLATE
Invalid collating element referenced. TRE returns this whenever equivalence classes or multicharacter collating elements are used in bracket expressions (they are not supported yet).
REG_ECTYPE
Unknown character class name in [[:name:]].
REG_EESCAPE
The last character of regex was a backslash (\).
REG_ESUBREG
Invalid back reference; number in \digit invalid.
REG_EBRACK
[] imbalance.
REG_EPAREN
\(\) or () imbalance.
REG_EBRACE
\{\} or {} imbalance.
REG_BADBR
{} content invalid: not a number, more than two numbers, first larger than second, or number too large.
REG_ERANGE
Invalid character range, e.g. ending point is earlier in the collating order than the starting point.
REG_ERANGE
Out of memory.
REG_BADRPT
Invalid use of repetition operator. TRE never returns this.

The regexec() functions

#include <tre/regex.h>

int regexec(const regex_t *preg, const char *string, size_t nmatch,
            regmatch_t pmatch[], int eflags);
int regnexec(const regex_t *preg, const char *string, size_t len,
             size_t nmatch, regmatch_t pmatch[], int eflags);
int regwexec(const regex_t *preg, const wchar_t *string, size_t nmatch,
             regmatch_t pmatch[], int eflags);
int regwnexec(const regex_t *preg, const wchar_t *string, size_t len,
              size_t nmatch, regmatch_t pmatch[], int eflags);

The regexec() function matches the null-terminated string against the compiled regexp preg, initialized by a previous call to any one of the regcomp functions. The regnexec() function is like regexec(), but string is not terminated with a null byte. Instead, the len argument is used to give the length of the string, and the string may contain null bytes. The regwexec() and regwnexec() functions work like regexec() and regnexec(), respectively, but take a wide character (wchar_t) string instead of a byte string. The eflags argument is a bitwise OR of zero or more of the following flags:

REG_NOTBOL

When this flag is used, the match-beginning-of-line operator ^ does not match the empty string at the beginning of string. If REG_NEWLINE was used when compiling preg the empty string immediately after a newline character will still be matched.

REG_NOTEOL

When this flag is used, the match-end-of-line operator $ does not match the empty string at the end of string. If REG_NEWLINE was used when compiling preg the empty string immediately before a newline character will still be matched.

These flags are useful when different portions of a string are passed to regexec and the beginning or end of the partial string should not be interpreted as the beginning or end of a line.

If REG_NOSUB was used when compiling preg, nmatch is zero, or pmatch is NULL, then the pmatch argument is ignored. Otherwise, the submatches corresponding to the parenthesized subexpressions are filled in the elements of pmatch, which must be dimensioned to have at least nmatch elements.

The regmatch_t structure contains at least the following fields:

regoff_t rm_so
Byte offset from start of string to start of substring.
regoff_t rm_eo
Byte offset from start of string to the first character after the substring.

The length of a submatch in bytes can be computed by subtracting rm_eo and rm_so. If a parenthesized subexpression did not participate in a match, the rm_so and rm_eo fields for the corresponding pmatch element are set to -1. When a multibyte character set is in effect, the submatch offsets are given as byte offsets, not character offsets.

The regexec() functions return zero if a match was found, otherwise they return REG_NOMATCH to indicate no match, or REG_ESPACE to indicate that enough temporary memory could not be allocated to complete the matching operation.

The approximate matching functions

#include <tre/regex.h>

typedef struct {
  int cost_ins;
  int cost_del;
  int cost_subst;
  int max_cost;

  int max_ins;
  int max_del;
  int max_subst;
  int max_err;
} regaparams_t;

typedef struct {
  size_t nmatch;
  regmatch_t *pmatch;
  int cost;
  int num_ins;
  int num_del;
  int num_subst;
} regamatch_t;

int regaexec(const regex_t *preg, const char *string,
             regamatch_t *match, regaparams_t params, int eflags);
int reganexec(const regex_t *preg, const char *string, size_t len,
              regamatch_t *match, regaparams_t params, int eflags);
int regawexec(const regex_t *preg, const wchar_t *string,
              regamatch_t *match, regaparams_t params, int eflags);
int regawnexec( const regex_t *preg, const wchar_t *string, size_t len,
               regamatch_t *match, regaparams_t params, int eflags);

The regaexec() function searches for the best match in string against the compiled regexp preg, initialized by a previous call to any one of the regcomp functions.

The reganexec() function is like regaexec(), but string is not terminated by a null byte. Instead, the len argument is used to tell the length of the string, and the string may contain null bytes. The regawexec() and regawnexec() functions work like regaexec() and reganexec(), respectively, but take a wide character (wchar_t) string instead of a byte string.

The eflags argument is like for the regexec() functions.

The params struct controls the approximate matching parameters:

int cost_ins
The default cost of an inserted character, that is, an extra character in string.
int cost_del
The default cost of a deleted character, that is, a character missing from string.
int cost_subst
The default cost of a substituted character.
int max_cost
The maximum allowed cost of a match. If this is set to zero, an exact matching is searched for, and results equivalent to those returned by the regexec() functions are returned.
int max_ins
Maximum allowed number of inserted characters.
int max_del
Maximum allowed number of deleted characters.
int max_subst
Maximum allowed number of substituted characters.
int max_err
Maximum allowed number of errors (inserts + deletes + substitutes).

The match argument points to a regamatch_t structure. The nmatch and pmatch field must be filled by the caller. If REG_NOSUB was used when compiling the regexp, or match->nmatch is zero, or match->pmatch is NULL, the match->pmatch argument is ignored. Otherwise, the submatches corresponding to the parenthesized subexpressions are filled in the elements of match->pmatch, which must be dimensioned to have at least match->nmatch elements. The match->cost field is set to the cost of the match found, and the match->num_ins, match->num_del, and match->num_subst fields are set to the number of inserts, deletes, and substitutes in the match, respectively.

The regaexec() functions return zero if a match with cost smaller than params->max_cost was found, otherwise they return REG_NOMATCH to indicate no match, or REG_ESPACE to indicate that enough temporary memory could not be allocated to complete the matching operation.

Checking build time options

#include <tre/regex.h>

char *tre_version(void);
int tre_config(int query, void *result);

The tre_config() function can be used to retrieve information of which optional features have been compiled into the TRE library and information of other parameters that may change between releases.

The query argument is an integer telling what information is requested for. The result argument is a pointer to a variable where the information is returned. The return value of a call to tre_config() is zero if query was recognized, REG_NOMATCH otherwise.

The following values are recognized for query:

TRE_CONFIG_APPROX
The result is an integer that is set to one if approximate matching support is available, zero if not.
TRE_CONFIG_WCHAR
The result is an integer that is set to one if wide character support is available, zero if not.
TRE_CONFIG_MULTIBYTE
The result is an integer that is set to one if multibyte character set support is available, zero if not.
TRE_CONFIG_SYSTEM_ABI
The result is an integer that is set to one if TRE has been compiled to be compatible with the system regex ABI, zero if not.
TRE_CONFIG_VERSION
The result is a pointer to a static character string that gives the version of the TRE library.

The tre_version() function returns a character string that gives the version of the TRE library.

Preprocessor definitions

The header <tre/regex.h> defines certain C preprocessor symbols.

Version information

The following definitions may be useful for checking whether a new enough version is being used. Note that it is recommended to use the pkg-config tool for version and other checks in Autoconf scripts.

TRE_VERSION
The version string.
TRE_VERSION_1
The major version number (first part of version string).
TRE_VERSION_2
The minor version number (second part of version string).
TRE_VERSION_3
The micro version number (third part of version string).

Features

The following definitions may be useful for checking whether all necessary features are enabled. Use these only if compile time checking suffices (linking statically with TRE). When linking dynamically tre_config() should be used instead.

TRE_APPROX
This is defined if approximate matching support is enabled. The prototypes for approximate matching functions are defined only if TRE_APPROX is defined.
TRE_WCHAR
This is defined if wide character support is enabled. The prototypes for wide character matching functions are defined only if TRE_WCHAR is defined.
TRE_MULTIBYTE
This is defined if multibyte character set support is enabled. If this is not set any locale settings are ignored, and the default locale is used when parsing regexps and matching strings.