PCRE(3)                                                 PCRE(3)





NAME
       PCRE - Perl-compatible regular expressions

PCRE PERFORMANCE

       Certain items that may appear in regular expression pat-
       terns are more efficient than others. It is  more  effi-
       cient  to  use a character class like [aeiou] than a set
       of alternatives such as  (a|e|i|o|u).  In  general,  the
       simplest  construction that provides the required behav-
       iour is usually the  most  efficient.  Jeffrey  Friedl's
       book  contains  a lot of useful general discussion about
       optimizing regular  expressions  for  efficient  perfor-
       mance.  This  document contains a few observations about
       PCRE.

       Using Unicode character properties (the \p, \P,  and  \X
       escapes)  is  slow, because PCRE has to scan a structure
       that contains data for over fifteen thousand  characters
       whenever  it  needs  a  character's property. If you can
       find an alternative pattern that does not use  character
       properties, it will probably be faster.

       When  a pattern begins with .* not in parentheses, or in
       parentheses that are not the subject of a backreference,
       and  the  PCRE_DOTALL  option  is  set,  the  pattern is
       implicitly anchored by PCRE, since it can match only  at
       the  start  of a subject string. However, if PCRE_DOTALL
       is not set, PCRE cannot make this optimization,  because
       the  .  metacharacter does not then match a newline, and
       if the subject string contains newlines, the pattern may
       match  from  the  character immediately following one of
       them instead of from the very start.  For  example,  the
       pattern

         .*second

       matches the subject "first\nand second" (where \n stands
       for a newline character), with the match starting at the
       seventh  character.  In  order  to  do this, PCRE has to
       retry the match starting after every newline in the sub-
       ject.

       If  you  are  using  such a pattern with subject strings
       that do not contain newlines, the  best  performance  is
       obtained by setting PCRE_DOTALL, or starting the pattern
       with ^.* to indicate explicit anchoring. That saves PCRE
       from having to scan along the subject looking for a new-
       line to restart at.

       Beware  of  patterns  that  contain  nested   indefinite
       repeats.  These can take a long time to run when applied
       to a string that does not match.  Consider  the  pattern
       fragment

         (a+)*

       This  can  match  "aaaa"  in 33 different ways, and this
       number increases very rapidly as the string gets longer.
       (The  * repeat can match 0, 1, 2, 3, or 4 times, and for
       each of those cases other than  0,  the  +  repeats  can
       match different numbers of times.) When the remainder of
       the pattern is such that the entire match  is  going  to
       fail,  PCRE  has  in  principle  to  try  every possible
       variation, and this can take an extremely long time.

       An optimization catches some of the  more  simple  cases
       such as

         (a+)*b

       where  a  literal character follows. Before embarking on
       the standard matching procedure, PCRE checks that  there
       is  a  "b"  later in the subject string, and if there is
       not, it fails the match immediately. However, when there
       is  no  following  literal  this  optimization cannot be
       used. You can see the difference by comparing the behav-
       iour of

         (a+)*\d

       with  the  pattern  above.  The  former  gives a failure
       almost instantly when applied to a  whole  line  of  "a"
       characters, whereas the latter takes an appreciable time
       with strings longer than about 20 characters.

       In many cases, the solution to this kind of  performance
       issue  is to use an atomic group or a possessive quanti-
       fier.

Last updated: 09 September 2004
Copyright (c) 1997-2004 University of Cambridge.



                                                        PCRE(3)
