PCRE(3)                                                 PCRE(3)





NAME
       PCRE - Perl-compatible regular expressions

INTRODUCTION

       The  PCRE  library  is a set of functions that implement
       regular expression pattern matching using the same  syn-
       tax  and semantics as Perl, with just a few differences.
       The current implementation of PCRE (release 5.x)  corre-
       sponds  approximately  with  Perl 5.8, including support
       for UTF-8 encoded strings and Unicode  general  category
       properties.  However,  this support has to be explicitly
       enabled; it is not the default.

       PCRE is written in C and released as a C library. A num-
       ber  of  people  have written wrappers and interfaces of
       various kinds. A C++ class is included in these  contri-
       butions,  which can be found in the Contrib directory at
       the primary FTP site, which is:

       ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre

       Details of exactly which Perl  regular  expression  fea-
       tures  are  and  are  not supported by PCRE are given in
       separate documents. See the pcrepattern  and  pcrecompat
       pages.

       Some  features  of  PCRE  can  be included, excluded, or
       changed when the library  is  built.  The  pcre_config()
       function  makes  it  possible  for  a client to discover
       which features are available.  The  features  themselves
       are described in the pcrebuild page. Documentation about
       building PCRE for various operating systems can be found
       in the README file in the source distribution.

USER DOCUMENTATION

       The  user  documentation  for PCRE comprises a number of
       different sections. In the "man" format, each  of  these
       is  a separate "man page". In the HTML format, each is a
       separate page, linked from the index page. In the  plain
       text format, all the sections are concatenated, for ease
       of searching. The sections are as follows:

         pcre              this document
         pcreapi           details of PCRE's native API
         pcrebuild         options for building PCRE
         pcrecallout       details of the callout feature
         pcrecompat        discussion of Perl compatibility
         pcregrep          description of the pcregrep command
         pcrepartial        details  of  the  partial  matching
       facility
         pcrepattern       syntax and semantics of supported
                             regular expressions
         pcreperform       discussion of performance issues
         pcreposix         the POSIX-compatible API
         pcreprecompile     details of saving and re-using pre-
       compiled patterns
         pcresample        discussion of the sample program
         pcretest          description of the pcretest  testing
       command

       In  addition,  in the "man" and HTML formats, there is a
       short  page  for  each  library  function,  listing  its
       arguments and results.

LIMITATIONS

       There  are some size limitations in PCRE but it is hoped
       that they will never in practice be relevant.

       The maximum length of a compiled pattern is 65539  (sic)
       bytes  if  PCRE  is  compiled  with the default internal
       linkage size of  2.  If  you  want  to  process  regular
       expressions  that  are  truly  enormous, you can compile
       PCRE with an internal linkage size of 3 or  4  (see  the
       README file in the source distribution and the pcrebuild
       documentation for details). In these cases the limit  is
       substantially  larger.   However, the speed of execution
       will be slower.

       All values in repeating quantifiers must  be  less  than
       65536.   The  maximum number of capturing subpatterns is
       65535.

       There is no limit to the number of non-capturing subpat-
       terns,  but the maximum depth of nesting of all kinds of
       parenthesized subpattern,  including  capturing  subpat-
       terns,  assertions,  and  other  types of subpattern, is
       200.

       The maximum length of a subject string  is  the  largest
       positive  number that an integer variable can hold. How-
       ever, PCRE uses  recursion  to  handle  subpatterns  and
       indefinite  repetition.  This  means  that the available
       stack space may limit the size of a subject string  that
       can be processed by certain patterns.


UTF-8 AND UNICODE PROPERTY SUPPORT

       From  release 3.3, PCRE has had some support for charac-
       ter strings encoded in the UTF-8 format. For release 4.0
       this  was greatly extended to cover most common require-
       ments, and in release 5.0 additional support for Unicode
       general category properties was added.

       In  order  process UTF-8 strings, you must build PCRE to
       include UTF-8 support in the code, and, in addition, you
       must call pcre_compile() with the PCRE_UTF8 option flag.
       When you do this,  both  the  pattern  and  any  subject
       strings that are matched against it are treated as UTF-8
       strings instead of just strings of bytes.

       If you compile PCRE with UTF-8 support, but do  not  use
       it  at  run  time, the library will be a bit bigger, but
       the additional run time overhead is limited  to  testing
       the  PCRE_UTF8  flag in several places, so should not be
       very large.

       If PCRE is built with Unicode character property support
       (which  implies  UTF-8  support),  the  escape sequences
       \p{..}, \P{..}, and \X  are  supported.   The  available
       properties that can be tested are limited to the general
       category properties such as Lu for an upper case  letter
       or  Nd for a decimal number. A full list is given in the
       pcrepattern documentation. The PCRE library is increased
       in  size  by  about 90K when Unicode property support is
       included.

       The following comments apply when  PCRE  is  running  in
       UTF-8 mode:

       1.  When  you set the PCRE_UTF8 flag, the strings passed
       as patterns and subjects are  checked  for  validity  on
       entry  to  the  relevant  functions. If an invalid UTF-8
       string is passed, an error return is given. In some sit-
       uations,  you  may  already  know  that your strings are
       valid, and therefore want to skip these checks in  order
       to    improve    performance.    If    you    set    the
       PCRE_NO_UTF8_CHECK flag at compile time or at run  time,
       PCRE  assumes  that  the  pattern or subject it is given
       (respectively) contains only valid UTF-8 codes. In  this
       case,  it  does not diagnose an invalid UTF-8 string. If
       you  pass  an  invalid  UTF-8  string   to   PCRE   when
       PCRE_NO_UTF8_CHECK  is  set,  the results are undefined.
       Your program may crash.

       2. In a pattern, the escape sequence \x{...}, where  the
       contents  of  the braces is a string of hexadecimal dig-
       its, is interpreted as a UTF-8 character whose code num-
       ber  is  the  given  hexadecimal  number,  for  example:
       \x{1234}. If a non-hexadecimal digit appears between the
       braces,   the  item  is  not  recognized.   This  escape
       sequence can be used either as a literal,  or  within  a
       character class.

       3.  The  original  hexadecimal  escape  sequence,  \xhh,
       matches a two-byte  UTF-8  character  if  the  value  is
       greater than 127.

       4.  Repeat  quantifiers  apply to complete UTF-8 charac-
       ters, not to individual bytes, for example:  \x{100}{3}.

       5.  The  dot  metacharacter  matches one UTF-8 character
       instead of a single byte.

       6. The escape sequence \C can be used to match a  single
       byte in UTF-8 mode, but its use can lead to some strange
       effects.

       7. The character escapes \b, \B, \d, \D, \s, \S, \w, and
       \W  correctly test characters of any code value, but the
       characters that PCRE recognizes as  digits,  spaces,  or
       word  characters remain the same set as before, all with
       values less than 256. This remains true even  when  PCRE
       includes  Unicode property support, because to do other-
       wise would slow down PCRE in many common cases.  If  you
       really  want to test for a wider sense of, say, "digit",
       you must use Unicode property tests such as \p{Nd}.

       8. Similarly, characters  that  match  the  POSIX  named
       character classes are all low-valued characters.

       9.  Case-insensitive matching applies only to characters
       whose values are less than 128,  unless  PCRE  is  built
       with  Unicode  property support. Even when Unicode prop-
       erty support is available, PCRE still uses its own char-
       acter  tables when checking the case of low-valued char-
       acters, so as not to degrade performance.   The  Unicode
       property  information  is  used only for characters with
       higher values.

AUTHOR

       Philip Hazel <ph10@cam.ac.uk>
       University Computing Service,
       Cambridge CB2 3QG, England.
       Phone: +44 1223 334714

Last updated: 09 September 2004
Copyright (c) 1997-2004 University of Cambridge.



                                                        PCRE(3)
