PCRETEST(1)                                         PCRETEST(1)





NAME
       pcretest - a program for testing Perl-compatible regular
       expressions.

SYNOPSIS

       pcretest [-C]  [-d]  [-i]  [-m]  [-o  osize]  [-p]  [-t]
       [source]
            [destination]

       pcretest was written as a test program for the PCRE reg-
       ular expression library itself, but it can also be  used
       for  experimenting  with regular expressions. This docu-
       ment describes the features of  the  test  program;  for
       details  of  the regular expressions themselves, see the
       pcrepattern  documentation.  For  details  of  the  PCRE
       library  function  calls  and  their  options,  see  the
       pcreapi documentation.

OPTIONS

       -C        Output the version number of the PCRE library,
                 and   all   available  information  about  the
                 optional features that are included, and  then
                 exit.

       -d        Behave  as  if  each  regex had the /D (debug)
                 modifier; the internal form  is  output  after
                 compilation.

       -i        Behave  as  if each regex had the /I modifier;
                 information  about  the  compiled  pattern  is
                 given after compilation.

       -m        Output the size of each compiled pattern after
                 it has been compiled. This  is  equivalent  to
                 adding /M to each regular expression. For com-
                 patibility with earlier versions of  pcretest,
                 -s is a synonym for -m.

       -o osize  Set  the number of elements in the output vec-
                 tor that is used when calling  pcre_exec()  to
                 be  osize.  The  default value is 45, which is
                 enough for 14  capturing  subexpressions.  The
                 vector  size  can  be  changed  for individual
                 matching calls by including  \O  in  the  data
                 line (see below).

       -p        Behave  as  if each regex has /P modifier; the
                 POSIX wrapper API is used to call  PCRE.  None
                 of the other options has any effect when -p is
                 set.

       -t        Run each compile, study, and match many  times
                 with  a  timer,  and output resulting time per
                 compile or match (in milliseconds). Do not set
                 -m with -t, because you will then get the size
                 output a zillion times, and the timing will be
                 distorted.

DESCRIPTION

       If  pcretest  is  given two filename arguments, it reads
       from the first and writes to the second. If it is  given
       only  one filename argument, it reads from that file and
       writes to stdout. Otherwise, it  reads  from  stdin  and
       writes  to  stdout,  and prompts for each line of input,
       using "re>"  to  prompt  for  regular  expressions,  and
       "data>" to prompt for data lines.

       The  program  handles  any  number of sets of input on a
       single input  file.  Each  set  starts  with  a  regular
       expression,  and continues with any number of data lines
       to be matched against the pattern.

       Each data line is matched separately and  independently.
       If you want to do multiple-line matches, you have to use
       the \n escape sequence in a  single  line  of  input  to
       encode  the  newline  characters.  The maximum length of
       data line is 30,000 characters.

       An empty line signals the end  of  the  data  lines,  at
       which  point a new regular expression is read. The regu-
       lar expressions are given enclosed in  any  non-alphanu-
       meric delimiters other than backslash, for example

         /(a|bc)x+yz/

       White  space  before the initial delimiter is ignored. A
       regular expression may be continued over  several  input
       lines, in which case the newline characters are included
       within it. It  is  possible  to  include  the  delimiter
       within the pattern by escaping it, for example

         /abc\/def/

       If  you do so, the escape and the delimiter form part of
       the  pattern,  but  since  delimiters  are  always  non-
       alphanumeric,  this  does not affect its interpretation.
       If the terminating delimiter is immediately followed  by
       a backslash, for example,

         /abc/\

       then  a  backslash  is  added to the end of the pattern.
       This is done to provide a way of testing the error  con-
       dition  that  arises  if a pattern finishes with a back-
       slash, because

         /abc\/

       is interpreted as the  first  line  of  a  pattern  that
       starts  with  "abc/",  causing pcretest to read the next
       line as a continuation of the regular expression.

PATTERN MODIFIERS

       A pattern may be followed by any  number  of  modifiers,
       which  are  mostly  single  characters.  Following  Perl
       usage, these are referred to below as, for example, "the
       /i  modifier",  even though the delimiter of the pattern
       need not always be a slash, and no slash  is  used  when
       writing  modifiers.  Whitespace  may  appear between the
       final pattern delimiter  and  the  first  modifier,  and
       between the modifiers themselves.

       The  /i, /m, /s, and /x modifiers set the PCRE_CASELESS,
       PCRE_MULTILINE, PCRE_DOTALL, or  PCRE_EXTENDED  options,
       respectively,  when pcre_compile() is called. These four
       modifier letters have the same  effect  as  they  do  in
       Perl. For example:

         /caseless/i

       The  following table shows additional modifiers for set-
       ting PCRE options that do not correspond to anything  in
       Perl:

         /A    PCRE_ANCHORED
         /C    PCRE_AUTO_CALLOUT
         /E    PCRE_DOLLAR_ENDONLY
         /N    PCRE_NO_AUTO_CAPTURE
         /U    PCRE_UNGREEDY
         /X    PCRE_EXTRA

       Searching  for  all possible matches within each subject
       string can be requested by the /g or /G modifier.  After
       finding  a  match,  PCRE  is  called again to search the
       remainder of the subject string. The difference  between
       /g  and /G is that the former uses the startoffset argu-
       ment to pcre_exec() to start searching at  a  new  point
       within  the  entire string (which is in effect what Perl
       does), whereas the latter passes over a  shortened  sub-
       string.  This makes a difference to the matching process
       if  the  pattern  begins  with  a  lookbehind  assertion
       (including \b or \B).

       If  any  call  to  pcre_exec()  in  a  /g or /G sequence
       matches an empty string, the next call is done with  the
       PCRE_NOTEMPTY  and  PCRE_ANCHORED  flags set in order to
       search for another, non-empty, match at the same  point.
       If this second match fails, the start offset is advanced
       by one, and the normal match is retried.  This  imitates
       the  way Perl handles such cases when using the /g modi-
       fier or the split() function.

       There are yet more modifiers  for  controlling  the  way
       pcretest operates.

       The  /+ modifier requests that as well as outputting the
       substring that  matched  the  entire  pattern,  pcretest
       should  in  addition output the remainder of the subject
       string. This is useful for tests where the subject  con-
       tains multiple copies of the same substring.

       The /L modifier must be followed directly by the name of
       a locale, for example,

         /pattern/Lfr_FR

       For this reason, it must be the last modifier. The given
       locale  is  set,  pcre_maketables() is called to build a
       set of character tables for the locale, and this is then
       passed  to  pcre_compile()  when  compiling  the regular
       expression. Without an /L modifier, NULL  is  passed  as
       the  tables  pointer;  that  is,  /L applies only to the
       expression on which it appears.

       The /I modifier requests that pcretest  output  informa-
       tion about the compiled pattern (whether it is anchored,
       has a fixed first character, and so on). It does this by
       calling  pcre_fullinfo()  after  compiling a pattern. If
       the pattern is studied, the results  of  that  are  also
       output.

       The  /D modifier is a PCRE debugging feature, which also
       assumes /I.  It causes the  internal  form  of  compiled
       regular  expressions  to be output after compilation. If
       the pattern was studied,  the  information  returned  is
       also output.

       The  /F  modifier causes pcretest to flip the byte order
       of the fields  in  the  compiled  pattern  that  contain
       2-byte  and 4-byte numbers. This facility is for testing
       the feature in PCRE that allows it to  execute  patterns
       that  were  compiled  on a host with a different endian-
       ness. This feature  is  not  available  when  the  POSIX
       interface  to  PCRE  is being used, that is, when the /P
       pattern modifier is  specified.  See  also  the  section
       about saving and reloading compiled patterns below.

       The  /S  modifier causes pcre_study() to be called after
       the expression has been compiled, and the  results  used
       when the expression is matched.

       The  /M modifier causes the size of memory block used to
       hold the compiled pattern to be output.

       The /P modifier causes pcretest to  call  PCRE  via  the
       POSIX  wrapper API rather than its native API. When this
       is done, all other modifiers except /i, /m, and  /+  are
       ignored. REG_ICASE is set if /i is present, and REG_NEW-
       LINE is set if /m  is  present.  The  wrapper  functions
       force PCRE_DOLLAR_ENDONLY always, and PCRE_DOTALL unless
       REG_NEWLINE is set.

       The /8 modifier causes pcretest to call  PCRE  with  the
       PCRE_UTF8  option  set.  This turns on support for UTF-8
       character handling in PCRE, provided that  it  was  com-
       piled  with  this  support  enabled.  This modifier also
       causes any non-printing characters in output strings  to
       be  printed  using  the  \x{hh...}  notation if they are
       valid UTF-8 sequences.

       If the /? modifier is used with /8, it  causes  pcretest
       to   call  pcre_compile()  with  the  PCRE_NO_UTF8_CHECK
       option, to suppress the checking of the string for UTF-8
       validity.

DATA LINES

       Before  each data line is passed to pcre_exec(), leading
       and trailing whitespace  is  removed,  and  it  is  then
       scanned for \ escapes. Some of these are pretty esoteric
       features, intended for checking out  some  of  the  more
       complicated  features  of  PCRE. If you are just testing
       "ordinary" regular expressions, you probably don't  need
       any of these. The following escapes are recognized:

         \a         alarm (= BEL)
         \b         backspace
         \e         escape
         \f         formfeed
         \n         newline
         \r         carriage return
         \t         tab
         \v         vertical tab
         \nnn       octal character (up to 3 octal digits)
         \xhh       hexadecimal character (up to 2 hex digits)
         \x{hh...}  hexadecimal character, any number of digits
                      in UTF-8 mode
         \A           pass   the   PCRE_ANCHORED   option    to
       pcre_exec()
         \B         pass the PCRE_NOTBOL option to pcre_exec()
         \Cdd       call pcre_copy_substring() for substring dd
                      after a  successful  match  (number  less
       than 32)
         \Cname      call  pcre_copy_named_substring() for sub-
       string
                      "name" after  a  successful  match  (name
       termin-
                      ated by next non alphanumeric character)
         \C+         show  the  current  captured substrings at
       callout
                      time
         \C-        do not supply a callout function
         \C!n       return 1 instead of 0 when callout number n
       is
                      reached
         \C!n!m     return 1 instead of 0 when callout number n
       is
                      reached for the nth time
         \C*n       pass the number  n  (may  be  negative)  as
       callout
                      data;  this is used as the callout return
       value
         \Gdd       call pcre_get_substring() for substring dd
                      after a  successful  match  (number  less
       than 32)
         \Gname      call  pcre_get_named_substring()  for sub-
       string
                      "name" after  a  successful  match  (name
       termin-
                      ated by next non-alphanumeric character)
         \L         call pcre_get_substringlist() after a
                      successful match
         \M         discover the minimum MATCH_LIMIT setting
         \N            pass   the   PCRE_NOTEMPTY   option   to
       pcre_exec()
         \Odd       set the size of the output vector passed to
                      pcre_exec() to dd (any number of digits)
         \P         pass the PCRE_PARTIAL option to pcre_exec()
         \S         output details  of  memory  get/free  calls
       during matching
         \Z         pass the PCRE_NOTEOL option to pcre_exec()
         \?         pass the PCRE_NO_UTF8_CHECK option to
                      pcre_exec()
         \>dd       start the match at offset dd (any number of
       digits);
                      this sets the  startoffset  argument  for
       pcre_exec()

       A  backslash  followed by anything else just escapes the
       anything else. If the very last  character  is  a  back-
       slash,  it  is  ignored.  This gives a way of passing an
       empty line as data, since a real empty  line  terminates
       the data input.

       If  \M  is  present,  pcretest calls pcre_exec() several
       times, with different values in the match_limit field of
       the  pcre_extra data structure, until it finds the mini-
       mum number that is needed for pcre_exec()  to  complete.
       This  number is a measure of the amount of recursion and
       backtracking that takes place, and checking it  out  can
       be  instructive.  For most simple matches, the number is
       quite small, but for patterns with very large numbers of
       matching possibilities, it can become large very quickly
       with increasing length of subject string.

       When \O is used, the value specified may  be  higher  or
       lower  than  the  size set by the -O command line option
       (or defaulted to 45); \O applies only  to  the  call  of
       pcre_exec() for the line in which it appears.

       If  the  /P modifier was present on the pattern, causing
       the POSIX wrapper API to be used, only \B  and  \Z  have
       any  effect,  causing  REG_NOTBOL  and  REG_NOTEOL to be
       passed to regexec() respectively.

       The use of \x{hh...} to represent  UTF-8  characters  is
       not  dependent on the use of the /8 modifier on the pat-
       tern. It is recognized always. There may be  any  number
       of  hexadecimal  digits inside the braces. The result is
       from one to six bytes, encoded according  to  the  UTF-8
       rules.

OUTPUT FROM PCRETEST

       When a match succeeds, pcretest outputs the list of cap-
       tured substrings that pcre_exec() returns, starting with
       number  0 for the string that matched the whole pattern.
       Otherwise, it outputs "No match" or "Partial match" when
       pcre_exec()      returns      PCRE_ERROR_NOMATCH      or
       PCRE_ERROR_PARTIAL, respectively, and otherwise the PCRE
       negative error number. Here is an example of an interac-
       tive pcretest run.

         $ pcretest
         PCRE version 5.00 07-Sep-2004

           re> /^abc(\d+)/
         data> abc123
          0: abc123
          1: 123
         data> xyz
         No match

       If the strings contain any non-printing characters, they
       are  output as \0x escapes, or as \x{...} escapes if the
       /8 modifier was present on the pattern. If  the  pattern
       has  the /+ modifier, the output for substring 0 is fol-
       lowed by the the rest of the subject string,  identified
       by "0+" like this:

           re> /cat/+
         data> cataract
          0: cat
          0+ aract

       If the pattern has the /g or /G modifier, the results of
       successive matching attempts  are  output  in  sequence,
       like this:

           re> /\Bi(\w\w)/g
         data> Mississippi
          0: iss
          1: ss
          0: iss
          1: ss
          0: ipp
          1: pp

       "No  match"  is  output  only if the first match attempt
       fails.

       If any of the sequences \C, \G, or \L are present  in  a
       data  line  that is successfully matched, the substrings
       extracted by the convenience functions are  output  with
       C,  G,  or L after the string number instead of a colon.
       This is in addition to the normal full list. The  string
       length  (that  is,  the return from the extraction func-
       tion) is given in parentheses after each string  for  \C
       and \G.

       Note  that  while patterns can be continued over several
       lines (a plain ">" prompt is  used  for  continuations),
       data  lines may not. However newlines can be included in
       data by means of the \n escape.

CALLOUTS

       If the pattern contains any callout requests, pcretest's
       callout  function is called during matching. By default,
       it displays the callout number, the  start  and  current
       positions  in the text at the callout time, and the next
       pattern item to be tested. For example, the output

         --->pqrabcdef
           0    ^  ^     \d

       indicates that callout number 0  occurred  for  a  match
       attempt  starting at the fourth character of the subject
       string, when the pointer was at the seventh character of
       the  data,  and  when the next pattern item was \d. Just
       one circumflex is output if the start and current  posi-
       tions are the same.

       Callouts  numbered 255 are assumed to be automatic call-
       outs, inserted as a result of the /C  pattern  modifier.
       In this case, instead of showing the callout number, the
       offset in the pattern, preceded by a  plus,  is  output.
       For example:

           re> /\d?[A-E]\*/C
         data> E*
         --->E*
          +0 ^      \d?
          +3 ^      [A-E]
          +8 ^^     \*
         +10 ^ ^
          0: E*

       The  callout function in pcretest returns zero (carry on
       matching) by default, but you can use an \C  item  in  a
       data line (as described above) to change this.

       Inserting callouts can be helpful when using pcretest to
       check  complicated  regular  expressions.  For   further
       information about callouts, see the pcrecallout documen-
       tation.

SAVING AND RELOADING COMPILED PATTERNS

       The facilities described in this section are not  avail-
       able when the POSIX inteface to PCRE is being used, that
       is, when the /P pattern modifier is specified.

       When the POSIX interface is not in use,  you  can  cause
       pcretest  to write a compiled pattern to a file, by fol-
       lowing the modifiers with > and a file name.  For  exam-
       ple:

         /pattern/im >/some/file

       See  the  pcreprecompile  documentation for a discussion
       about saving and re-using compiled patterns.

       The data that is written  is  binary.  The  first  eight
       bytes  are  the length of the compiled pattern data fol-
       lowed by the length of the  optional  study  data,  each
       written as four bytes in big-endian order (most signifi-
       cant byte first). If there is no study data (either  the
       pattern  was not studied, or studying did not return any
       data), the second length is zero. The lengths  are  fol-
       lowed by an exact copy of the compiled pattern. If there
       is additional study data, this follows immediately after
       the  compiled  pattern. After writing the file, pcretest
       expects to read a new pattern.

       A saved pattern can be reloaded into pcretest by specif-
       ing  < and a file name instead of a pattern. The name of
       the file must not contain a <  character,  as  otherwise
       pcretest  will interpret the line as a pattern delimited
       by < characters.  For example:

          re> </some/file
         Compiled regex loaded from /some/file
         No study data

       When the pattern has been loaded, pcretest  proceeds  to
       read data lines in the usual way.

       You  can  copy a file written by pcretest to a different
       host and reload it there, even if the new host has oppo-
       site endianness to the one on which the pattern was com-
       piled. For example, you can compile on  an  i86  machine
       and run on a SPARC machine.

       File  names  for saving and reloading can be absolute or
       relative, but note that the shell facility of  expanding
       a  file  name that starts with a tilde (~) is not avail-
       able.

       The ability to save and  reload  files  in  pcretest  is
       intended  for  testing  and  experimentation.  It is not
       intended for production use because only a  single  pat-
       tern  can be written to a file. Furthermore, there is no
       facility for supplying custom character tables  for  use
       with  a  reloaded  pattern.  If the original pattern was
       compiled with custom tables, an attempt to match a  sub-
       ject  string using a reloaded pattern is likely to cause
       pcretest to crash.  Finally, if you attempt  to  load  a
       file  that  is  not in the correct format, the result is
       undefined.

AUTHOR

       Philip Hazel <ph10@cam.ac.uk>
       University Computing Service,
       Cambridge CB2 3QG, England.

Last updated: 10 September 2004
Copyright (c) 1997-2004 University of Cambridge.



                                                    PCRETEST(1)
