6681 lines
		
	
	
		
			306 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			6681 lines
		
	
	
		
			306 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
-----------------------------------------------------------------------------
 | 
						|
This file contains a concatenation of the PCRE man pages, converted to plain
 | 
						|
text format for ease of searching with a text editor, or for use on systems
 | 
						|
that do not have a man page processor. The small individual files that give
 | 
						|
synopses of each function in the library have not been included. There are
 | 
						|
separate text files for the pcregrep and pcretest commands.
 | 
						|
-----------------------------------------------------------------------------
 | 
						|
 | 
						|
 | 
						|
PCRE(3)                                                                PCRE(3)
 | 
						|
 | 
						|
 | 
						|
NAME
 | 
						|
       PCRE - Perl-compatible regular expressions
 | 
						|
 | 
						|
 | 
						|
INTRODUCTION
 | 
						|
 | 
						|
       The  PCRE  library is a set of functions that implement regular expres-
 | 
						|
       sion pattern matching using the same syntax and semantics as Perl, with
 | 
						|
       just  a  few  differences. Certain features that appeared in Python and
 | 
						|
       PCRE before they appeared in Perl are also available using  the  Python
 | 
						|
       syntax.  There is also some support for certain .NET and Oniguruma syn-
 | 
						|
       tax items, and there is an option for  requesting  some  minor  changes
 | 
						|
       that give better JavaScript compatibility.
 | 
						|
 | 
						|
       The  current  implementation of PCRE (release 7.x) corresponds approxi-
 | 
						|
       mately with Perl 5.10, including support for UTF-8 encoded strings  and
 | 
						|
       Unicode general category properties. However, UTF-8 and Unicode support
 | 
						|
       has to be explicitly enabled; it is not the default. The Unicode tables
 | 
						|
       correspond to Unicode release 5.1.
 | 
						|
 | 
						|
       In  addition to the Perl-compatible matching function, PCRE contains an
 | 
						|
       alternative matching function that matches the same  compiled  patterns
 | 
						|
       in  a different way. In certain circumstances, the alternative function
 | 
						|
       has some advantages. For a discussion of the two  matching  algorithms,
 | 
						|
       see the pcrematching page.
 | 
						|
 | 
						|
       PCRE  is  written  in C and released as a C library. A number of people
 | 
						|
       have written wrappers and interfaces of various kinds.  In  particular,
 | 
						|
       Google  Inc.   have  provided  a comprehensive C++ wrapper. This is now
 | 
						|
       included as part of the PCRE distribution. The pcrecpp page has details
 | 
						|
       of  this  interface.  Other  people's contributions can be found in the
 | 
						|
       Contrib directory at the primary FTP site, which is:
 | 
						|
 | 
						|
       ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre
 | 
						|
 | 
						|
       Details of exactly which Perl regular expression features are  and  are
 | 
						|
       not supported by PCRE are given in separate documents. See the pcrepat-
 | 
						|
       tern and pcrecompat pages. There is a syntax summary in the  pcresyntax
 | 
						|
       page.
 | 
						|
 | 
						|
       Some  features  of  PCRE can be included, excluded, or changed when the
 | 
						|
       library is built. The pcre_config() function makes it  possible  for  a
 | 
						|
       client  to  discover  which  features are available. The features them-
 | 
						|
       selves are described in the pcrebuild page. Documentation about  build-
 | 
						|
       ing  PCRE for various operating systems can be found in the README file
 | 
						|
       in the source distribution.
 | 
						|
 | 
						|
       The library contains a number of undocumented  internal  functions  and
 | 
						|
       data  tables  that  are  used by more than one of the exported external
 | 
						|
       functions, but which are not intended  for  use  by  external  callers.
 | 
						|
       Their  names  all begin with "_pcre_", which hopefully will not provoke
 | 
						|
       any name clashes. In some environments, it is possible to control which
 | 
						|
       external  symbols  are  exported when a shared library is built, and in
 | 
						|
       these cases the undocumented symbols are not exported.
 | 
						|
 | 
						|
 | 
						|
USER DOCUMENTATION
 | 
						|
 | 
						|
       The user documentation for PCRE comprises a number  of  different  sec-
 | 
						|
       tions.  In the "man" format, each of these is a separate "man page". In
 | 
						|
       the HTML format, each is a separate page, linked from the  index  page.
 | 
						|
       In  the  plain text format, all the sections are concatenated, for ease
 | 
						|
       of searching. The sections are as follows:
 | 
						|
 | 
						|
         pcre              this document
 | 
						|
         pcre-config       show PCRE installation configuration information
 | 
						|
         pcreapi           details of PCRE's native C API
 | 
						|
         pcrebuild         options for building PCRE
 | 
						|
         pcrecallout       details of the callout feature
 | 
						|
         pcrecompat        discussion of Perl compatibility
 | 
						|
         pcrecpp           details of the C++ wrapper
 | 
						|
         pcregrep          description of the pcregrep command
 | 
						|
         pcrematching      discussion of the two matching algorithms
 | 
						|
         pcrepartial       details of the partial matching facility
 | 
						|
         pcrepattern       syntax and semantics of supported
 | 
						|
                             regular expressions
 | 
						|
         pcresyntax        quick syntax reference
 | 
						|
         pcreperform       discussion of performance issues
 | 
						|
         pcreposix         the POSIX-compatible C API
 | 
						|
         pcreprecompile    details of saving and re-using precompiled patterns
 | 
						|
         pcresample        discussion of the sample program
 | 
						|
         pcrestack         discussion of stack usage
 | 
						|
         pcretest          description of the pcretest testing command
 | 
						|
 | 
						|
       In addition, in the "man" and HTML formats, there is a short  page  for
 | 
						|
       each C library function, listing its arguments and results.
 | 
						|
 | 
						|
 | 
						|
LIMITATIONS
 | 
						|
 | 
						|
       There  are some size limitations in PCRE but it is hoped that they will
 | 
						|
       never in practice be relevant.
 | 
						|
 | 
						|
       The maximum length of a compiled pattern is 65539 (sic) bytes  if  PCRE
 | 
						|
       is compiled with the default internal linkage size of 2. If you want to
 | 
						|
       process regular expressions that are truly enormous,  you  can  compile
 | 
						|
       PCRE  with  an  internal linkage size of 3 or 4 (see the README file in
 | 
						|
       the source distribution and the pcrebuild documentation  for  details).
 | 
						|
       In  these  cases the limit is substantially larger.  However, the speed
 | 
						|
       of execution is slower.
 | 
						|
 | 
						|
       All values in repeating quantifiers must be less than 65536.
 | 
						|
 | 
						|
       There is no limit to the number of parenthesized subpatterns, but there
 | 
						|
       can be no more than 65535 capturing subpatterns.
 | 
						|
 | 
						|
       The maximum length of name for a named subpattern is 32 characters, and
 | 
						|
       the maximum number of named subpatterns is 10000.
 | 
						|
 | 
						|
       The maximum length of a subject string is the largest  positive  number
 | 
						|
       that  an integer variable can hold. However, when using the traditional
 | 
						|
       matching function, PCRE uses recursion to handle subpatterns and indef-
 | 
						|
       inite  repetition.  This means that the available stack space may limit
 | 
						|
       the size of a subject string that can be processed by certain patterns.
 | 
						|
       For a discussion of stack issues, see the pcrestack documentation.
 | 
						|
 | 
						|
 | 
						|
UTF-8 AND UNICODE PROPERTY SUPPORT
 | 
						|
 | 
						|
       From  release  3.3,  PCRE  has  had  some support for character strings
 | 
						|
       encoded in the UTF-8 format. For release 4.0 this was greatly  extended
 | 
						|
       to  cover  most common requirements, and in release 5.0 additional sup-
 | 
						|
       port for Unicode general category properties was added.
 | 
						|
 | 
						|
       In order process UTF-8 strings, you must build PCRE  to  include  UTF-8
 | 
						|
       support  in  the  code,  and, in addition, you must call pcre_compile()
 | 
						|
       with the PCRE_UTF8 option flag, or the  pattern  must  start  with  the
 | 
						|
       sequence  (*UTF8).  When  either of these is the case, both the pattern
 | 
						|
       and any subject strings that are matched  against  it  are  treated  as
 | 
						|
       UTF-8 strings instead of just strings of bytes.
 | 
						|
 | 
						|
       If  you compile PCRE with UTF-8 support, but do not use it at run time,
 | 
						|
       the library will be a bit bigger, but the additional run time  overhead
 | 
						|
       is limited to testing the PCRE_UTF8 flag occasionally, so should not be
 | 
						|
       very big.
 | 
						|
 | 
						|
       If PCRE is built with Unicode character property support (which implies
 | 
						|
       UTF-8  support),  the  escape sequences \p{..}, \P{..}, and \X are sup-
 | 
						|
       ported.  The available properties that can be tested are limited to the
 | 
						|
       general  category  properties such as Lu for an upper case letter or Nd
 | 
						|
       for a decimal number, the Unicode script names such as Arabic  or  Han,
 | 
						|
       and  the  derived  properties  Any  and L&. A full list is given in the
 | 
						|
       pcrepattern documentation. Only the short names for properties are sup-
 | 
						|
       ported.  For example, \p{L} matches a letter. Its Perl synonym, \p{Let-
 | 
						|
       ter}, is not supported.  Furthermore,  in  Perl,  many  properties  may
 | 
						|
       optionally  be  prefixed by "Is", for compatibility with Perl 5.6. PCRE
 | 
						|
       does not support this.
 | 
						|
 | 
						|
   Validity of UTF-8 strings
 | 
						|
 | 
						|
       When you set the PCRE_UTF8 flag, the strings  passed  as  patterns  and
 | 
						|
       subjects are (by default) checked for validity on entry to the relevant
 | 
						|
       functions. From release 7.3 of PCRE, the check is according  the  rules
 | 
						|
       of  RFC  3629, which are themselves derived from the Unicode specifica-
 | 
						|
       tion. Earlier releases of PCRE followed the rules of  RFC  2279,  which
 | 
						|
       allows  the  full range of 31-bit values (0 to 0x7FFFFFFF). The current
 | 
						|
       check allows only values in the range U+0 to U+10FFFF, excluding U+D800
 | 
						|
       to U+DFFF.
 | 
						|
 | 
						|
       The  excluded  code  points are the "Low Surrogate Area" of Unicode, of
 | 
						|
       which the Unicode Standard says this: "The Low Surrogate Area does  not
 | 
						|
       contain  any  character  assignments,  consequently  no  character code
 | 
						|
       charts or namelists are provided for this area. Surrogates are reserved
 | 
						|
       for  use  with  UTF-16 and then must be used in pairs." The code points
 | 
						|
       that are encoded by UTF-16 pairs  are  available  as  independent  code
 | 
						|
       points  in  the  UTF-8  encoding.  (In other words, the whole surrogate
 | 
						|
       thing is a fudge for UTF-16 which unfortunately messes up UTF-8.)
 | 
						|
 | 
						|
       If an  invalid  UTF-8  string  is  passed  to  PCRE,  an  error  return
 | 
						|
       (PCRE_ERROR_BADUTF8) is given. In some situations, you may already know
 | 
						|
       that your strings are valid, and therefore want to skip these checks in
 | 
						|
       order to improve performance. If you set the PCRE_NO_UTF8_CHECK flag at
 | 
						|
       compile time or at run time, PCRE assumes that the pattern  or  subject
 | 
						|
       it  is  given  (respectively)  contains only valid UTF-8 codes. In this
 | 
						|
       case, it does not diagnose an invalid UTF-8 string.
 | 
						|
 | 
						|
       If you pass an invalid UTF-8 string  when  PCRE_NO_UTF8_CHECK  is  set,
 | 
						|
       what  happens  depends on why the string is invalid. If the string con-
 | 
						|
       forms to the "old" definition of UTF-8 (RFC 2279), it is processed as a
 | 
						|
       string  of  characters  in  the  range 0 to 0x7FFFFFFF. In other words,
 | 
						|
       apart from the initial validity test, PCRE (when in UTF-8 mode) handles
 | 
						|
       strings  according  to  the more liberal rules of RFC 2279. However, if
 | 
						|
       the string does not even conform to RFC 2279, the result is  undefined.
 | 
						|
       Your program may crash.
 | 
						|
 | 
						|
       If  you  want  to  process  strings  of  values  in the full range 0 to
 | 
						|
       0x7FFFFFFF, encoded in a UTF-8-like manner as per the old RFC, you  can
 | 
						|
       set PCRE_NO_UTF8_CHECK to bypass the more restrictive test. However, in
 | 
						|
       this situation, you will have to apply your own validity check.
 | 
						|
 | 
						|
   General comments about UTF-8 mode
 | 
						|
 | 
						|
       1. An unbraced hexadecimal escape sequence (such  as  \xb3)  matches  a
 | 
						|
       two-byte UTF-8 character if the value is greater than 127.
 | 
						|
 | 
						|
       2.  Octal  numbers  up to \777 are recognized, and match two-byte UTF-8
 | 
						|
       characters for values greater than \177.
 | 
						|
 | 
						|
       3. Repeat quantifiers apply to complete UTF-8 characters, not to  indi-
 | 
						|
       vidual bytes, for example: \x{100}{3}.
 | 
						|
 | 
						|
       4.  The dot metacharacter matches one UTF-8 character instead of a sin-
 | 
						|
       gle byte.
 | 
						|
 | 
						|
       5. The escape sequence \C can be used to match a single byte  in  UTF-8
 | 
						|
       mode,  but  its  use can lead to some strange effects. This facility is
 | 
						|
       not available in the alternative matching function, pcre_dfa_exec().
 | 
						|
 | 
						|
       6. The character escapes \b, \B, \d, \D, \s, \S, \w, and  \W  correctly
 | 
						|
       test  characters of any code value, but the characters that PCRE recog-
 | 
						|
       nizes as digits, spaces, or word characters  remain  the  same  set  as
 | 
						|
       before, all with values less than 256. This remains true even when PCRE
 | 
						|
       includes Unicode property support, because to do otherwise  would  slow
 | 
						|
       down  PCRE in many common cases. If you really want to test for a wider
 | 
						|
       sense of, say, "digit", you must use Unicode  property  tests  such  as
 | 
						|
       \p{Nd}.  Note  that  this  also applies to \b, because it is defined in
 | 
						|
       terms of \w and \W.
 | 
						|
 | 
						|
       7. Similarly, characters that match the POSIX named  character  classes
 | 
						|
       are all low-valued characters.
 | 
						|
 | 
						|
       8.  However,  the Perl 5.10 horizontal and vertical whitespace matching
 | 
						|
       escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char-
 | 
						|
       acters.
 | 
						|
 | 
						|
       9.  Case-insensitive  matching  applies only to characters whose values
 | 
						|
       are less than 128, unless PCRE is built with Unicode property  support.
 | 
						|
       Even  when  Unicode  property support is available, PCRE still uses its
 | 
						|
       own character tables when checking the case of  low-valued  characters,
 | 
						|
       so  as not to degrade performance.  The Unicode property information is
 | 
						|
       used only for characters with higher values. Even when Unicode property
 | 
						|
       support is available, PCRE supports case-insensitive matching only when
 | 
						|
       there is a one-to-one mapping between a letter's  cases.  There  are  a
 | 
						|
       small  number  of  many-to-one  mappings in Unicode; these are not sup-
 | 
						|
       ported by PCRE.
 | 
						|
 | 
						|
 | 
						|
AUTHOR
 | 
						|
 | 
						|
       Philip Hazel
 | 
						|
       University Computing Service
 | 
						|
       Cambridge CB2 3QH, England.
 | 
						|
 | 
						|
       Putting an actual email address here seems to have been a spam  magnet,
 | 
						|
       so  I've  taken  it away. If you want to email me, use my two initials,
 | 
						|
       followed by the two digits 10, at the domain cam.ac.uk.
 | 
						|
 | 
						|
 | 
						|
REVISION
 | 
						|
 | 
						|
       Last updated: 11 April 2009
 | 
						|
       Copyright (c) 1997-2009 University of Cambridge.
 | 
						|
------------------------------------------------------------------------------
 | 
						|
 | 
						|
 | 
						|
PCREBUILD(3)                                                      PCREBUILD(3)
 | 
						|
 | 
						|
 | 
						|
NAME
 | 
						|
       PCRE - Perl-compatible regular expressions
 | 
						|
 | 
						|
 | 
						|
PCRE BUILD-TIME OPTIONS
 | 
						|
 | 
						|
       This  document  describes  the  optional  features  of PCRE that can be
 | 
						|
       selected when the library is compiled. It assumes use of the  configure
 | 
						|
       script,  where the optional features are selected or deselected by pro-
 | 
						|
       viding options to configure before running the make  command.  However,
 | 
						|
       the  same  options  can be selected in both Unix-like and non-Unix-like
 | 
						|
       environments using the GUI facility of  CMakeSetup  if  you  are  using
 | 
						|
       CMake instead of configure to build PCRE.
 | 
						|
 | 
						|
       The complete list of options for configure (which includes the standard
 | 
						|
       ones such as the  selection  of  the  installation  directory)  can  be
 | 
						|
       obtained by running
 | 
						|
 | 
						|
         ./configure --help
 | 
						|
 | 
						|
       The  following  sections  include  descriptions  of options whose names
 | 
						|
       begin with --enable or --disable. These settings specify changes to the
 | 
						|
       defaults  for  the configure command. Because of the way that configure
 | 
						|
       works, --enable and --disable always come in pairs, so  the  complemen-
 | 
						|
       tary  option always exists as well, but as it specifies the default, it
 | 
						|
       is not described.
 | 
						|
 | 
						|
 | 
						|
C++ SUPPORT
 | 
						|
 | 
						|
       By default, the configure script will search for a C++ compiler and C++
 | 
						|
       header files. If it finds them, it automatically builds the C++ wrapper
 | 
						|
       library for PCRE. You can disable this by adding
 | 
						|
 | 
						|
         --disable-cpp
 | 
						|
 | 
						|
       to the configure command.
 | 
						|
 | 
						|
 | 
						|
UTF-8 SUPPORT
 | 
						|
 | 
						|
       To build PCRE with support for UTF-8 Unicode character strings, add
 | 
						|
 | 
						|
         --enable-utf8
 | 
						|
 | 
						|
       to the configure command. Of itself, this  does  not  make  PCRE  treat
 | 
						|
       strings  as UTF-8. As well as compiling PCRE with this option, you also
 | 
						|
       have have to set the PCRE_UTF8 option when you call the  pcre_compile()
 | 
						|
       function.
 | 
						|
 | 
						|
       If  you set --enable-utf8 when compiling in an EBCDIC environment, PCRE
 | 
						|
       expects its input to be either ASCII or UTF-8 (depending on the runtime
 | 
						|
       option).  It  is not possible to support both EBCDIC and UTF-8 codes in
 | 
						|
       the same  version  of  the  library.  Consequently,  --enable-utf8  and
 | 
						|
       --enable-ebcdic are mutually exclusive.
 | 
						|
 | 
						|
 | 
						|
UNICODE CHARACTER PROPERTY SUPPORT
 | 
						|
 | 
						|
       UTF-8  support allows PCRE to process character values greater than 255
 | 
						|
       in the strings that it handles. On its own, however, it does  not  pro-
 | 
						|
       vide any facilities for accessing the properties of such characters. If
 | 
						|
       you want to be able to use the pattern escapes \P, \p,  and  \X,  which
 | 
						|
       refer to Unicode character properties, you must add
 | 
						|
 | 
						|
         --enable-unicode-properties
 | 
						|
 | 
						|
       to  the configure command. This implies UTF-8 support, even if you have
 | 
						|
       not explicitly requested it.
 | 
						|
 | 
						|
       Including Unicode property support adds around 30K  of  tables  to  the
 | 
						|
       PCRE  library.  Only  the general category properties such as Lu and Nd
 | 
						|
       are supported. Details are given in the pcrepattern documentation.
 | 
						|
 | 
						|
 | 
						|
CODE VALUE OF NEWLINE
 | 
						|
 | 
						|
       By default, PCRE interprets the linefeed (LF) character  as  indicating
 | 
						|
       the  end  of  a line. This is the normal newline character on Unix-like
 | 
						|
       systems. You can compile PCRE to use carriage return (CR)  instead,  by
 | 
						|
       adding
 | 
						|
 | 
						|
         --enable-newline-is-cr
 | 
						|
 | 
						|
       to  the  configure  command.  There  is  also  a --enable-newline-is-lf
 | 
						|
       option, which explicitly specifies linefeed as the newline character.
 | 
						|
 | 
						|
       Alternatively, you can specify that line endings are to be indicated by
 | 
						|
       the two character sequence CRLF. If you want this, add
 | 
						|
 | 
						|
         --enable-newline-is-crlf
 | 
						|
 | 
						|
       to the configure command. There is a fourth option, specified by
 | 
						|
 | 
						|
         --enable-newline-is-anycrlf
 | 
						|
 | 
						|
       which  causes  PCRE  to recognize any of the three sequences CR, LF, or
 | 
						|
       CRLF as indicating a line ending. Finally, a fifth option, specified by
 | 
						|
 | 
						|
         --enable-newline-is-any
 | 
						|
 | 
						|
       causes PCRE to recognize any Unicode newline sequence.
 | 
						|
 | 
						|
       Whatever line ending convention is selected when PCRE is built  can  be
 | 
						|
       overridden  when  the library functions are called. At build time it is
 | 
						|
       conventional to use the standard for your operating system.
 | 
						|
 | 
						|
 | 
						|
WHAT \R MATCHES
 | 
						|
 | 
						|
       By default, the sequence \R in a pattern matches  any  Unicode  newline
 | 
						|
       sequence,  whatever  has  been selected as the line ending sequence. If
 | 
						|
       you specify
 | 
						|
 | 
						|
         --enable-bsr-anycrlf
 | 
						|
 | 
						|
       the default is changed so that \R matches only CR, LF, or  CRLF.  What-
 | 
						|
       ever  is selected when PCRE is built can be overridden when the library
 | 
						|
       functions are called.
 | 
						|
 | 
						|
 | 
						|
BUILDING SHARED AND STATIC LIBRARIES
 | 
						|
 | 
						|
       The PCRE building process uses libtool to build both shared and  static
 | 
						|
       Unix  libraries by default. You can suppress one of these by adding one
 | 
						|
       of
 | 
						|
 | 
						|
         --disable-shared
 | 
						|
         --disable-static
 | 
						|
 | 
						|
       to the configure command, as required.
 | 
						|
 | 
						|
 | 
						|
POSIX MALLOC USAGE
 | 
						|
 | 
						|
       When PCRE is called through the POSIX interface (see the pcreposix doc-
 | 
						|
       umentation),  additional  working  storage  is required for holding the
 | 
						|
       pointers to capturing substrings, because PCRE requires three  integers
 | 
						|
       per  substring,  whereas  the POSIX interface provides only two. If the
 | 
						|
       number of expected substrings is small, the wrapper function uses space
 | 
						|
       on the stack, because this is faster than using malloc() for each call.
 | 
						|
       The default threshold above which the stack is no longer used is 10; it
 | 
						|
       can be changed by adding a setting such as
 | 
						|
 | 
						|
         --with-posix-malloc-threshold=20
 | 
						|
 | 
						|
       to the configure command.
 | 
						|
 | 
						|
 | 
						|
HANDLING VERY LARGE PATTERNS
 | 
						|
 | 
						|
       Within  a  compiled  pattern,  offset values are used to point from one
 | 
						|
       part to another (for example, from an opening parenthesis to an  alter-
 | 
						|
       nation  metacharacter).  By default, two-byte values are used for these
 | 
						|
       offsets, leading to a maximum size for a  compiled  pattern  of  around
 | 
						|
       64K.  This  is sufficient to handle all but the most gigantic patterns.
 | 
						|
       Nevertheless, some people do want to process enormous patterns,  so  it
 | 
						|
       is  possible  to compile PCRE to use three-byte or four-byte offsets by
 | 
						|
       adding a setting such as
 | 
						|
 | 
						|
         --with-link-size=3
 | 
						|
 | 
						|
       to the configure command. The value given must be 2,  3,  or  4.  Using
 | 
						|
       longer  offsets slows down the operation of PCRE because it has to load
 | 
						|
       additional bytes when handling them.
 | 
						|
 | 
						|
 | 
						|
AVOIDING EXCESSIVE STACK USAGE
 | 
						|
 | 
						|
       When matching with the pcre_exec() function, PCRE implements backtrack-
 | 
						|
       ing  by  making recursive calls to an internal function called match().
 | 
						|
       In environments where the size of the stack is limited,  this  can  se-
 | 
						|
       verely  limit  PCRE's operation. (The Unix environment does not usually
 | 
						|
       suffer from this problem, but it may sometimes be necessary to increase
 | 
						|
       the  maximum  stack size.  There is a discussion in the pcrestack docu-
 | 
						|
       mentation.) An alternative approach to recursion that uses memory  from
 | 
						|
       the  heap  to remember data, instead of using recursive function calls,
 | 
						|
       has been implemented to work round the problem of limited  stack  size.
 | 
						|
       If you want to build a version of PCRE that works this way, add
 | 
						|
 | 
						|
         --disable-stack-for-recursion
 | 
						|
 | 
						|
       to  the  configure  command. With this configuration, PCRE will use the
 | 
						|
       pcre_stack_malloc and pcre_stack_free variables to call memory  manage-
 | 
						|
       ment  functions. By default these point to malloc() and free(), but you
 | 
						|
       can replace the pointers so that your own functions are used.
 | 
						|
 | 
						|
       Separate functions are  provided  rather  than  using  pcre_malloc  and
 | 
						|
       pcre_free  because  the  usage  is  very  predictable:  the block sizes
 | 
						|
       requested are always the same, and  the  blocks  are  always  freed  in
 | 
						|
       reverse  order.  A calling program might be able to implement optimized
 | 
						|
       functions that perform better  than  malloc()  and  free().  PCRE  runs
 | 
						|
       noticeably more slowly when built in this way. This option affects only
 | 
						|
       the  pcre_exec()  function;  it   is   not   relevant   for   the   the
 | 
						|
       pcre_dfa_exec() function.
 | 
						|
 | 
						|
 | 
						|
LIMITING PCRE RESOURCE USAGE
 | 
						|
 | 
						|
       Internally,  PCRE has a function called match(), which it calls repeat-
 | 
						|
       edly  (sometimes  recursively)  when  matching  a  pattern   with   the
 | 
						|
       pcre_exec()  function.  By controlling the maximum number of times this
 | 
						|
       function may be called during a single matching operation, a limit  can
 | 
						|
       be  placed  on  the resources used by a single call to pcre_exec(). The
 | 
						|
       limit can be changed at run time, as described in the pcreapi  documen-
 | 
						|
       tation.  The default is 10 million, but this can be changed by adding a
 | 
						|
       setting such as
 | 
						|
 | 
						|
         --with-match-limit=500000
 | 
						|
 | 
						|
       to  the  configure  command.  This  setting  has  no  effect   on   the
 | 
						|
       pcre_dfa_exec() matching function.
 | 
						|
 | 
						|
       In  some  environments  it is desirable to limit the depth of recursive
 | 
						|
       calls of match() more strictly than the total number of calls, in order
 | 
						|
       to  restrict  the maximum amount of stack (or heap, if --disable-stack-
 | 
						|
       for-recursion is specified) that is used. A second limit controls this;
 | 
						|
       it  defaults  to  the  value  that is set for --with-match-limit, which
 | 
						|
       imposes no additional constraints. However, you can set a  lower  limit
 | 
						|
       by adding, for example,
 | 
						|
 | 
						|
         --with-match-limit-recursion=10000
 | 
						|
 | 
						|
       to  the  configure  command.  This  value can also be overridden at run
 | 
						|
       time.
 | 
						|
 | 
						|
 | 
						|
CREATING CHARACTER TABLES AT BUILD TIME
 | 
						|
 | 
						|
       PCRE uses fixed tables for processing characters whose code values  are
 | 
						|
       less  than 256. By default, PCRE is built with a set of tables that are
 | 
						|
       distributed in the file pcre_chartables.c.dist. These  tables  are  for
 | 
						|
       ASCII codes only. If you add
 | 
						|
 | 
						|
         --enable-rebuild-chartables
 | 
						|
 | 
						|
       to  the  configure  command, the distributed tables are no longer used.
 | 
						|
       Instead, a program called dftables is compiled and  run.  This  outputs
 | 
						|
       the source for new set of tables, created in the default locale of your
 | 
						|
       C runtime system. (This method of replacing the tables does not work if
 | 
						|
       you  are cross compiling, because dftables is run on the local host. If
 | 
						|
       you need to create alternative tables when cross  compiling,  you  will
 | 
						|
       have to do so "by hand".)
 | 
						|
 | 
						|
 | 
						|
USING EBCDIC CODE
 | 
						|
 | 
						|
       PCRE  assumes  by  default that it will run in an environment where the
 | 
						|
       character code is ASCII (or Unicode, which is  a  superset  of  ASCII).
 | 
						|
       This  is  the  case for most computer operating systems. PCRE can, how-
 | 
						|
       ever, be compiled to run in an EBCDIC environment by adding
 | 
						|
 | 
						|
         --enable-ebcdic
 | 
						|
 | 
						|
       to the configure command. This setting implies --enable-rebuild-charta-
 | 
						|
       bles.  You  should  only  use  it if you know that you are in an EBCDIC
 | 
						|
       environment (for example,  an  IBM  mainframe  operating  system).  The
 | 
						|
       --enable-ebcdic option is incompatible with --enable-utf8.
 | 
						|
 | 
						|
 | 
						|
PCREGREP OPTIONS FOR COMPRESSED FILE SUPPORT
 | 
						|
 | 
						|
       By default, pcregrep reads all files as plain text. You can build it so
 | 
						|
       that it recognizes files whose names end in .gz or .bz2, and reads them
 | 
						|
       with libz or libbz2, respectively, by adding one or both of
 | 
						|
 | 
						|
         --enable-pcregrep-libz
 | 
						|
         --enable-pcregrep-libbz2
 | 
						|
 | 
						|
       to the configure command. These options naturally require that the rel-
 | 
						|
       evant libraries are installed on your system. Configuration  will  fail
 | 
						|
       if they are not.
 | 
						|
 | 
						|
 | 
						|
PCRETEST OPTION FOR LIBREADLINE SUPPORT
 | 
						|
 | 
						|
       If you add
 | 
						|
 | 
						|
         --enable-pcretest-libreadline
 | 
						|
 | 
						|
       to  the  configure  command,  pcretest  is  linked with the libreadline
 | 
						|
       library, and when its input is from a terminal, it reads it  using  the
 | 
						|
       readline() function. This provides line-editing and history facilities.
 | 
						|
       Note that libreadline is GPL-licenced, so if you distribute a binary of
 | 
						|
       pcretest linked in this way, there may be licensing issues.
 | 
						|
 | 
						|
       Setting  this  option  causes  the -lreadline option to be added to the
 | 
						|
       pcretest build. In many operating environments with  a  sytem-installed
 | 
						|
       libreadline this is sufficient. However, in some environments (e.g.  if
 | 
						|
       an unmodified distribution version of readline is in use),  some  extra
 | 
						|
       configuration  may  be necessary. The INSTALL file for libreadline says
 | 
						|
       this:
 | 
						|
 | 
						|
         "Readline uses the termcap functions, but does not link with the
 | 
						|
         termcap or curses library itself, allowing applications which link
 | 
						|
         with readline the to choose an appropriate library."
 | 
						|
 | 
						|
       If your environment has not been set up so that an appropriate  library
 | 
						|
       is automatically included, you may need to add something like
 | 
						|
 | 
						|
         LIBS="-ncurses"
 | 
						|
 | 
						|
       immediately before the configure command.
 | 
						|
 | 
						|
 | 
						|
SEE ALSO
 | 
						|
 | 
						|
       pcreapi(3), pcre_config(3).
 | 
						|
 | 
						|
 | 
						|
AUTHOR
 | 
						|
 | 
						|
       Philip Hazel
 | 
						|
       University Computing Service
 | 
						|
       Cambridge CB2 3QH, England.
 | 
						|
 | 
						|
 | 
						|
REVISION
 | 
						|
 | 
						|
       Last updated: 17 March 2009
 | 
						|
       Copyright (c) 1997-2009 University of Cambridge.
 | 
						|
------------------------------------------------------------------------------
 | 
						|
 | 
						|
 | 
						|
PCREMATCHING(3)                                                PCREMATCHING(3)
 | 
						|
 | 
						|
 | 
						|
NAME
 | 
						|
       PCRE - Perl-compatible regular expressions
 | 
						|
 | 
						|
 | 
						|
PCRE MATCHING ALGORITHMS
 | 
						|
 | 
						|
       This document describes the two different algorithms that are available
 | 
						|
       in PCRE for matching a compiled regular expression against a given sub-
 | 
						|
       ject  string.  The  "standard"  algorithm  is  the  one provided by the
 | 
						|
       pcre_exec() function.  This works in the same was  as  Perl's  matching
 | 
						|
       function, and provides a Perl-compatible matching operation.
 | 
						|
 | 
						|
       An  alternative  algorithm is provided by the pcre_dfa_exec() function;
 | 
						|
       this operates in a different way, and is not  Perl-compatible.  It  has
 | 
						|
       advantages  and disadvantages compared with the standard algorithm, and
 | 
						|
       these are described below.
 | 
						|
 | 
						|
       When there is only one possible way in which a given subject string can
 | 
						|
       match  a pattern, the two algorithms give the same answer. A difference
 | 
						|
       arises, however, when there are multiple possibilities. For example, if
 | 
						|
       the pattern
 | 
						|
 | 
						|
         ^<.*>
 | 
						|
 | 
						|
       is matched against the string
 | 
						|
 | 
						|
         <something> <something else> <something further>
 | 
						|
 | 
						|
       there are three possible answers. The standard algorithm finds only one
 | 
						|
       of them, whereas the alternative algorithm finds all three.
 | 
						|
 | 
						|
 | 
						|
REGULAR EXPRESSIONS AS TREES
 | 
						|
 | 
						|
       The set of strings that are matched by a regular expression can be rep-
 | 
						|
       resented  as  a  tree structure. An unlimited repetition in the pattern
 | 
						|
       makes the tree of infinite size, but it is still a tree.  Matching  the
 | 
						|
       pattern  to a given subject string (from a given starting point) can be
 | 
						|
       thought of as a search of the tree.  There are two  ways  to  search  a
 | 
						|
       tree:  depth-first  and  breadth-first, and these correspond to the two
 | 
						|
       matching algorithms provided by PCRE.
 | 
						|
 | 
						|
 | 
						|
THE STANDARD MATCHING ALGORITHM
 | 
						|
 | 
						|
       In the terminology of Jeffrey Friedl's book "Mastering Regular  Expres-
 | 
						|
       sions",  the  standard  algorithm  is an "NFA algorithm". It conducts a
 | 
						|
       depth-first search of the pattern tree. That is, it  proceeds  along  a
 | 
						|
       single path through the tree, checking that the subject matches what is
 | 
						|
       required. When there is a mismatch, the algorithm  tries  any  alterna-
 | 
						|
       tives  at  the  current point, and if they all fail, it backs up to the
 | 
						|
       previous branch point in the  tree,  and  tries  the  next  alternative
 | 
						|
       branch  at  that  level.  This often involves backing up (moving to the
 | 
						|
       left) in the subject string as well.  The  order  in  which  repetition
 | 
						|
       branches  are  tried  is controlled by the greedy or ungreedy nature of
 | 
						|
       the quantifier.
 | 
						|
 | 
						|
       If a leaf node is reached, a matching string has  been  found,  and  at
 | 
						|
       that  point the algorithm stops. Thus, if there is more than one possi-
 | 
						|
       ble match, this algorithm returns the first one that it finds.  Whether
 | 
						|
       this  is the shortest, the longest, or some intermediate length depends
 | 
						|
       on the way the greedy and ungreedy repetition quantifiers are specified
 | 
						|
       in the pattern.
 | 
						|
 | 
						|
       Because  it  ends  up  with a single path through the tree, it is rela-
 | 
						|
       tively straightforward for this algorithm to keep  track  of  the  sub-
 | 
						|
       strings  that  are  matched  by portions of the pattern in parentheses.
 | 
						|
       This provides support for capturing parentheses and back references.
 | 
						|
 | 
						|
 | 
						|
THE ALTERNATIVE MATCHING ALGORITHM
 | 
						|
 | 
						|
       This algorithm conducts a breadth-first search of  the  tree.  Starting
 | 
						|
       from  the  first  matching  point  in the subject, it scans the subject
 | 
						|
       string from left to right, once, character by character, and as it does
 | 
						|
       this,  it remembers all the paths through the tree that represent valid
 | 
						|
       matches. In Friedl's terminology, this is a kind  of  "DFA  algorithm",
 | 
						|
       though  it is not implemented as a traditional finite state machine (it
 | 
						|
       keeps multiple states active simultaneously).
 | 
						|
 | 
						|
       The scan continues until either the end of the subject is  reached,  or
 | 
						|
       there  are  no more unterminated paths. At this point, terminated paths
 | 
						|
       represent the different matching possibilities (if there are none,  the
 | 
						|
       match  has  failed).   Thus,  if there is more than one possible match,
 | 
						|
       this algorithm finds all of them, and in particular, it finds the long-
 | 
						|
       est.  In PCRE, there is an option to stop the algorithm after the first
 | 
						|
       match (which is necessarily the shortest) has been found.
 | 
						|
 | 
						|
       Note that all the matches that are found start at the same point in the
 | 
						|
       subject. If the pattern
 | 
						|
 | 
						|
         cat(er(pillar)?)
 | 
						|
 | 
						|
       is  matched  against the string "the caterpillar catchment", the result
 | 
						|
       will be the three strings "cat", "cater", and "caterpillar" that  start
 | 
						|
       at the fourth character of the subject. The algorithm does not automat-
 | 
						|
       ically move on to find matches that start at later positions.
 | 
						|
 | 
						|
       There are a number of features of PCRE regular expressions that are not
 | 
						|
       supported by the alternative matching algorithm. They are as follows:
 | 
						|
 | 
						|
       1.  Because  the  algorithm  finds  all possible matches, the greedy or
 | 
						|
       ungreedy nature of repetition quantifiers is not relevant.  Greedy  and
 | 
						|
       ungreedy quantifiers are treated in exactly the same way. However, pos-
 | 
						|
       sessive quantifiers can make a difference when what follows could  also
 | 
						|
       match what is quantified, for example in a pattern like this:
 | 
						|
 | 
						|
         ^a++\w!
 | 
						|
 | 
						|
       This  pattern matches "aaab!" but not "aaa!", which would be matched by
 | 
						|
       a non-possessive quantifier. Similarly, if an atomic group is  present,
 | 
						|
       it  is matched as if it were a standalone pattern at the current point,
 | 
						|
       and the longest match is then "locked in" for the rest of  the  overall
 | 
						|
       pattern.
 | 
						|
 | 
						|
       2. When dealing with multiple paths through the tree simultaneously, it
 | 
						|
       is not straightforward to keep track of  captured  substrings  for  the
 | 
						|
       different  matching  possibilities,  and  PCRE's implementation of this
 | 
						|
       algorithm does not attempt to do this. This means that no captured sub-
 | 
						|
       strings are available.
 | 
						|
 | 
						|
       3.  Because no substrings are captured, back references within the pat-
 | 
						|
       tern are not supported, and cause errors if encountered.
 | 
						|
 | 
						|
       4. For the same reason, conditional expressions that use  a  backrefer-
 | 
						|
       ence  as  the  condition or test for a specific group recursion are not
 | 
						|
       supported.
 | 
						|
 | 
						|
       5. Because many paths through the tree may be  active,  the  \K  escape
 | 
						|
       sequence, which resets the start of the match when encountered (but may
 | 
						|
       be on some paths and not on others), is not  supported.  It  causes  an
 | 
						|
       error if encountered.
 | 
						|
 | 
						|
       6.  Callouts  are  supported, but the value of the capture_top field is
 | 
						|
       always 1, and the value of the capture_last field is always -1.
 | 
						|
 | 
						|
       7. The \C escape sequence, which (in the standard algorithm) matches  a
 | 
						|
       single  byte, even in UTF-8 mode, is not supported because the alterna-
 | 
						|
       tive algorithm moves through the subject  string  one  character  at  a
 | 
						|
       time, for all active paths through the tree.
 | 
						|
 | 
						|
       8.  Except for (*FAIL), the backtracking control verbs such as (*PRUNE)
 | 
						|
       are not supported. (*FAIL) is supported, and  behaves  like  a  failing
 | 
						|
       negative assertion.
 | 
						|
 | 
						|
 | 
						|
ADVANTAGES OF THE ALTERNATIVE ALGORITHM
 | 
						|
 | 
						|
       Using  the alternative matching algorithm provides the following advan-
 | 
						|
       tages:
 | 
						|
 | 
						|
       1. All possible matches (at a single point in the subject) are automat-
 | 
						|
       ically  found,  and  in particular, the longest match is found. To find
 | 
						|
       more than one match using the standard algorithm, you have to do kludgy
 | 
						|
       things with callouts.
 | 
						|
 | 
						|
       2.  There is much better support for partial matching. The restrictions
 | 
						|
       on the content of the pattern that apply when using the standard  algo-
 | 
						|
       rithm  for  partial matching do not apply to the alternative algorithm.
 | 
						|
       For non-anchored patterns, the starting position of a partial match  is
 | 
						|
       available.
 | 
						|
 | 
						|
       3.  Because  the  alternative  algorithm  scans the subject string just
 | 
						|
       once, and never needs to backtrack, it is possible to  pass  very  long
 | 
						|
       subject  strings  to  the matching function in several pieces, checking
 | 
						|
       for partial matching each time.
 | 
						|
 | 
						|
 | 
						|
DISADVANTAGES OF THE ALTERNATIVE ALGORITHM
 | 
						|
 | 
						|
       The alternative algorithm suffers from a number of disadvantages:
 | 
						|
 | 
						|
       1. It is substantially slower than  the  standard  algorithm.  This  is
 | 
						|
       partly  because  it has to search for all possible matches, but is also
 | 
						|
       because it is less susceptible to optimization.
 | 
						|
 | 
						|
       2. Capturing parentheses and back references are not supported.
 | 
						|
 | 
						|
       3. Although atomic groups are supported, their use does not provide the
 | 
						|
       performance advantage that it does for the standard algorithm.
 | 
						|
 | 
						|
 | 
						|
AUTHOR
 | 
						|
 | 
						|
       Philip Hazel
 | 
						|
       University Computing Service
 | 
						|
       Cambridge CB2 3QH, England.
 | 
						|
 | 
						|
 | 
						|
REVISION
 | 
						|
 | 
						|
       Last updated: 19 April 2008
 | 
						|
       Copyright (c) 1997-2008 University of Cambridge.
 | 
						|
------------------------------------------------------------------------------
 | 
						|
 | 
						|
 | 
						|
PCREAPI(3)                                                          PCREAPI(3)
 | 
						|
 | 
						|
 | 
						|
NAME
 | 
						|
       PCRE - Perl-compatible regular expressions
 | 
						|
 | 
						|
 | 
						|
PCRE NATIVE API
 | 
						|
 | 
						|
       #include <pcre.h>
 | 
						|
 | 
						|
       pcre *pcre_compile(const char *pattern, int options,
 | 
						|
            const char **errptr, int *erroffset,
 | 
						|
            const unsigned char *tableptr);
 | 
						|
 | 
						|
       pcre *pcre_compile2(const char *pattern, int options,
 | 
						|
            int *errorcodeptr,
 | 
						|
            const char **errptr, int *erroffset,
 | 
						|
            const unsigned char *tableptr);
 | 
						|
 | 
						|
       pcre_extra *pcre_study(const pcre *code, int options,
 | 
						|
            const char **errptr);
 | 
						|
 | 
						|
       int pcre_exec(const pcre *code, const pcre_extra *extra,
 | 
						|
            const char *subject, int length, int startoffset,
 | 
						|
            int options, int *ovector, int ovecsize);
 | 
						|
 | 
						|
       int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
 | 
						|
            const char *subject, int length, int startoffset,
 | 
						|
            int options, int *ovector, int ovecsize,
 | 
						|
            int *workspace, int wscount);
 | 
						|
 | 
						|
       int pcre_copy_named_substring(const pcre *code,
 | 
						|
            const char *subject, int *ovector,
 | 
						|
            int stringcount, const char *stringname,
 | 
						|
            char *buffer, int buffersize);
 | 
						|
 | 
						|
       int pcre_copy_substring(const char *subject, int *ovector,
 | 
						|
            int stringcount, int stringnumber, char *buffer,
 | 
						|
            int buffersize);
 | 
						|
 | 
						|
       int pcre_get_named_substring(const pcre *code,
 | 
						|
            const char *subject, int *ovector,
 | 
						|
            int stringcount, const char *stringname,
 | 
						|
            const char **stringptr);
 | 
						|
 | 
						|
       int pcre_get_stringnumber(const pcre *code,
 | 
						|
            const char *name);
 | 
						|
 | 
						|
       int pcre_get_stringtable_entries(const pcre *code,
 | 
						|
            const char *name, char **first, char **last);
 | 
						|
 | 
						|
       int pcre_get_substring(const char *subject, int *ovector,
 | 
						|
            int stringcount, int stringnumber,
 | 
						|
            const char **stringptr);
 | 
						|
 | 
						|
       int pcre_get_substring_list(const char *subject,
 | 
						|
            int *ovector, int stringcount, const char ***listptr);
 | 
						|
 | 
						|
       void pcre_free_substring(const char *stringptr);
 | 
						|
 | 
						|
       void pcre_free_substring_list(const char **stringptr);
 | 
						|
 | 
						|
       const unsigned char *pcre_maketables(void);
 | 
						|
 | 
						|
       int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
 | 
						|
            int what, void *where);
 | 
						|
 | 
						|
       int pcre_info(const pcre *code, int *optptr, int *firstcharptr);
 | 
						|
 | 
						|
       int pcre_refcount(pcre *code, int adjust);
 | 
						|
 | 
						|
       int pcre_config(int what, void *where);
 | 
						|
 | 
						|
       char *pcre_version(void);
 | 
						|
 | 
						|
       void *(*pcre_malloc)(size_t);
 | 
						|
 | 
						|
       void (*pcre_free)(void *);
 | 
						|
 | 
						|
       void *(*pcre_stack_malloc)(size_t);
 | 
						|
 | 
						|
       void (*pcre_stack_free)(void *);
 | 
						|
 | 
						|
       int (*pcre_callout)(pcre_callout_block *);
 | 
						|
 | 
						|
 | 
						|
PCRE API OVERVIEW
 | 
						|
 | 
						|
       PCRE has its own native API, which is described in this document. There
 | 
						|
       are also some wrapper functions that correspond to  the  POSIX  regular
 | 
						|
       expression  API.  These  are  described in the pcreposix documentation.
 | 
						|
       Both of these APIs define a set of C function calls. A C++  wrapper  is
 | 
						|
       distributed with PCRE. It is documented in the pcrecpp page.
 | 
						|
 | 
						|
       The  native  API  C  function prototypes are defined in the header file
 | 
						|
       pcre.h, and on Unix systems the library itself is called  libpcre.   It
 | 
						|
       can normally be accessed by adding -lpcre to the command for linking an
 | 
						|
       application  that  uses  PCRE.  The  header  file  defines  the  macros
 | 
						|
       PCRE_MAJOR  and  PCRE_MINOR to contain the major and minor release num-
 | 
						|
       bers for the library.  Applications can use these  to  include  support
 | 
						|
       for different releases of PCRE.
 | 
						|
 | 
						|
       The   functions   pcre_compile(),  pcre_compile2(),  pcre_study(),  and
 | 
						|
       pcre_exec() are used for compiling and matching regular expressions  in
 | 
						|
       a  Perl-compatible  manner. A sample program that demonstrates the sim-
 | 
						|
       plest way of using them is provided in the file  called  pcredemo.c  in
 | 
						|
       the  source distribution. The pcresample documentation describes how to
 | 
						|
       compile and run it.
 | 
						|
 | 
						|
       A second matching function, pcre_dfa_exec(), which is not Perl-compati-
 | 
						|
       ble,  is  also provided. This uses a different algorithm for the match-
 | 
						|
       ing. The alternative algorithm finds all possible matches (at  a  given
 | 
						|
       point  in  the subject), and scans the subject just once. However, this
 | 
						|
       algorithm does not return captured substrings. A description of the two
 | 
						|
       matching  algorithms and their advantages and disadvantages is given in
 | 
						|
       the pcrematching documentation.
 | 
						|
 | 
						|
       In addition to the main compiling and  matching  functions,  there  are
 | 
						|
       convenience functions for extracting captured substrings from a subject
 | 
						|
       string that is matched by pcre_exec(). They are:
 | 
						|
 | 
						|
         pcre_copy_substring()
 | 
						|
         pcre_copy_named_substring()
 | 
						|
         pcre_get_substring()
 | 
						|
         pcre_get_named_substring()
 | 
						|
         pcre_get_substring_list()
 | 
						|
         pcre_get_stringnumber()
 | 
						|
         pcre_get_stringtable_entries()
 | 
						|
 | 
						|
       pcre_free_substring() and pcre_free_substring_list() are also provided,
 | 
						|
       to free the memory used for extracted strings.
 | 
						|
 | 
						|
       The  function  pcre_maketables()  is  used  to build a set of character
 | 
						|
       tables  in  the  current  locale   for   passing   to   pcre_compile(),
 | 
						|
       pcre_exec(),  or  pcre_dfa_exec(). This is an optional facility that is
 | 
						|
       provided for specialist use.  Most  commonly,  no  special  tables  are
 | 
						|
       passed,  in  which case internal tables that are generated when PCRE is
 | 
						|
       built are used.
 | 
						|
 | 
						|
       The function pcre_fullinfo() is used to find out  information  about  a
 | 
						|
       compiled  pattern; pcre_info() is an obsolete version that returns only
 | 
						|
       some of the available information, but is retained for  backwards  com-
 | 
						|
       patibility.   The function pcre_version() returns a pointer to a string
 | 
						|
       containing the version of PCRE and its date of release.
 | 
						|
 | 
						|
       The function pcre_refcount() maintains a  reference  count  in  a  data
 | 
						|
       block  containing  a compiled pattern. This is provided for the benefit
 | 
						|
       of object-oriented applications.
 | 
						|
 | 
						|
       The global variables pcre_malloc and pcre_free  initially  contain  the
 | 
						|
       entry  points  of  the  standard malloc() and free() functions, respec-
 | 
						|
       tively. PCRE calls the memory management functions via these variables,
 | 
						|
       so  a  calling  program  can replace them if it wishes to intercept the
 | 
						|
       calls. This should be done before calling any PCRE functions.
 | 
						|
 | 
						|
       The global variables pcre_stack_malloc  and  pcre_stack_free  are  also
 | 
						|
       indirections  to  memory  management functions. These special functions
 | 
						|
       are used only when PCRE is compiled to use  the  heap  for  remembering
 | 
						|
       data, instead of recursive function calls, when running the pcre_exec()
 | 
						|
       function. See the pcrebuild documentation for  details  of  how  to  do
 | 
						|
       this.  It  is  a non-standard way of building PCRE, for use in environ-
 | 
						|
       ments that have limited stacks. Because of the greater  use  of  memory
 | 
						|
       management,  it  runs  more  slowly. Separate functions are provided so
 | 
						|
       that special-purpose external code can be  used  for  this  case.  When
 | 
						|
       used,  these  functions  are always called in a stack-like manner (last
 | 
						|
       obtained, first freed), and always for memory blocks of the same  size.
 | 
						|
       There  is  a discussion about PCRE's stack usage in the pcrestack docu-
 | 
						|
       mentation.
 | 
						|
 | 
						|
       The global variable pcre_callout initially contains NULL. It can be set
 | 
						|
       by  the  caller  to  a "callout" function, which PCRE will then call at
 | 
						|
       specified points during a matching operation. Details are given in  the
 | 
						|
       pcrecallout documentation.
 | 
						|
 | 
						|
 | 
						|
NEWLINES
 | 
						|
 | 
						|
       PCRE  supports five different conventions for indicating line breaks in
 | 
						|
       strings: a single CR (carriage return) character, a  single  LF  (line-
 | 
						|
       feed) character, the two-character sequence CRLF, any of the three pre-
 | 
						|
       ceding, or any Unicode newline sequence. The Unicode newline  sequences
 | 
						|
       are  the  three just mentioned, plus the single characters VT (vertical
 | 
						|
       tab, U+000B), FF (formfeed, U+000C), NEL (next line, U+0085), LS  (line
 | 
						|
       separator, U+2028), and PS (paragraph separator, U+2029).
 | 
						|
 | 
						|
       Each  of  the first three conventions is used by at least one operating
 | 
						|
       system as its standard newline sequence. When PCRE is built, a  default
 | 
						|
       can  be  specified.  The default default is LF, which is the Unix stan-
 | 
						|
       dard. When PCRE is run, the default can be overridden,  either  when  a
 | 
						|
       pattern is compiled, or when it is matched.
 | 
						|
 | 
						|
       At compile time, the newline convention can be specified by the options
 | 
						|
       argument of pcre_compile(), or it can be specified by special  text  at
 | 
						|
       the start of the pattern itself; this overrides any other settings. See
 | 
						|
       the pcrepattern page for details of the special character sequences.
 | 
						|
 | 
						|
       In the PCRE documentation the word "newline" is used to mean "the char-
 | 
						|
       acter  or pair of characters that indicate a line break". The choice of
 | 
						|
       newline convention affects the handling of  the  dot,  circumflex,  and
 | 
						|
       dollar metacharacters, the handling of #-comments in /x mode, and, when
 | 
						|
       CRLF is a recognized line ending sequence, the match position  advance-
 | 
						|
       ment for a non-anchored pattern. There is more detail about this in the
 | 
						|
       section on pcre_exec() options below.
 | 
						|
 | 
						|
       The choice of newline convention does not affect the interpretation  of
 | 
						|
       the  \n  or  \r  escape  sequences, nor does it affect what \R matches,
 | 
						|
       which is controlled in a similar way, but by separate options.
 | 
						|
 | 
						|
 | 
						|
MULTITHREADING
 | 
						|
 | 
						|
       The PCRE functions can be used in  multi-threading  applications,  with
 | 
						|
       the  proviso  that  the  memory  management  functions  pointed  to  by
 | 
						|
       pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the
 | 
						|
       callout function pointed to by pcre_callout, are shared by all threads.
 | 
						|
 | 
						|
       The  compiled form of a regular expression is not altered during match-
 | 
						|
       ing, so the same compiled pattern can safely be used by several threads
 | 
						|
       at once.
 | 
						|
 | 
						|
 | 
						|
SAVING PRECOMPILED PATTERNS FOR LATER USE
 | 
						|
 | 
						|
       The compiled form of a regular expression can be saved and re-used at a
 | 
						|
       later time, possibly by a different program, and even on a  host  other
 | 
						|
       than  the  one  on  which  it  was  compiled.  Details are given in the
 | 
						|
       pcreprecompile documentation. However, compiling a  regular  expression
 | 
						|
       with  one version of PCRE for use with a different version is not guar-
 | 
						|
       anteed to work and may cause crashes.
 | 
						|
 | 
						|
 | 
						|
CHECKING BUILD-TIME OPTIONS
 | 
						|
 | 
						|
       int pcre_config(int what, void *where);
 | 
						|
 | 
						|
       The function pcre_config() makes it possible for a PCRE client to  dis-
 | 
						|
       cover which optional features have been compiled into the PCRE library.
 | 
						|
       The pcrebuild documentation has more details about these optional  fea-
 | 
						|
       tures.
 | 
						|
 | 
						|
       The  first  argument  for pcre_config() is an integer, specifying which
 | 
						|
       information is required; the second argument is a pointer to a variable
 | 
						|
       into  which  the  information  is  placed. The following information is
 | 
						|
       available:
 | 
						|
 | 
						|
         PCRE_CONFIG_UTF8
 | 
						|
 | 
						|
       The output is an integer that is set to one if UTF-8 support is  avail-
 | 
						|
       able; otherwise it is set to zero.
 | 
						|
 | 
						|
         PCRE_CONFIG_UNICODE_PROPERTIES
 | 
						|
 | 
						|
       The  output  is  an  integer  that is set to one if support for Unicode
 | 
						|
       character properties is available; otherwise it is set to zero.
 | 
						|
 | 
						|
         PCRE_CONFIG_NEWLINE
 | 
						|
 | 
						|
       The output is an integer whose value specifies  the  default  character
 | 
						|
       sequence  that is recognized as meaning "newline". The four values that
 | 
						|
       are supported are: 10 for LF, 13 for CR, 3338 for CRLF, -2 for ANYCRLF,
 | 
						|
       and  -1  for  ANY.  Though they are derived from ASCII, the same values
 | 
						|
       are returned in EBCDIC environments. The default should normally corre-
 | 
						|
       spond to the standard sequence for your operating system.
 | 
						|
 | 
						|
         PCRE_CONFIG_BSR
 | 
						|
 | 
						|
       The output is an integer whose value indicates what character sequences
 | 
						|
       the \R escape sequence matches by default. A value of 0 means  that  \R
 | 
						|
       matches  any  Unicode  line ending sequence; a value of 1 means that \R
 | 
						|
       matches only CR, LF, or CRLF. The default can be overridden when a pat-
 | 
						|
       tern is compiled or matched.
 | 
						|
 | 
						|
         PCRE_CONFIG_LINK_SIZE
 | 
						|
 | 
						|
       The  output  is  an  integer that contains the number of bytes used for
 | 
						|
       internal linkage in compiled regular expressions. The value is 2, 3, or
 | 
						|
       4.  Larger  values  allow larger regular expressions to be compiled, at
 | 
						|
       the expense of slower matching. The default value of  2  is  sufficient
 | 
						|
       for  all  but  the  most massive patterns, since it allows the compiled
 | 
						|
       pattern to be up to 64K in size.
 | 
						|
 | 
						|
         PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
 | 
						|
 | 
						|
       The output is an integer that contains the threshold  above  which  the
 | 
						|
       POSIX  interface  uses malloc() for output vectors. Further details are
 | 
						|
       given in the pcreposix documentation.
 | 
						|
 | 
						|
         PCRE_CONFIG_MATCH_LIMIT
 | 
						|
 | 
						|
       The output is a long integer that gives the default limit for the  num-
 | 
						|
       ber  of  internal  matching  function calls in a pcre_exec() execution.
 | 
						|
       Further details are given with pcre_exec() below.
 | 
						|
 | 
						|
         PCRE_CONFIG_MATCH_LIMIT_RECURSION
 | 
						|
 | 
						|
       The output is a long integer that gives the default limit for the depth
 | 
						|
       of   recursion  when  calling  the  internal  matching  function  in  a
 | 
						|
       pcre_exec() execution.  Further  details  are  given  with  pcre_exec()
 | 
						|
       below.
 | 
						|
 | 
						|
         PCRE_CONFIG_STACKRECURSE
 | 
						|
 | 
						|
       The  output is an integer that is set to one if internal recursion when
 | 
						|
       running pcre_exec() is implemented by recursive function calls that use
 | 
						|
       the  stack  to remember their state. This is the usual way that PCRE is
 | 
						|
       compiled. The output is zero if PCRE was compiled to use blocks of data
 | 
						|
       on  the  heap  instead  of  recursive  function  calls.  In  this case,
 | 
						|
       pcre_stack_malloc and  pcre_stack_free  are  called  to  manage  memory
 | 
						|
       blocks on the heap, thus avoiding the use of the stack.
 | 
						|
 | 
						|
 | 
						|
COMPILING A PATTERN
 | 
						|
 | 
						|
       pcre *pcre_compile(const char *pattern, int options,
 | 
						|
            const char **errptr, int *erroffset,
 | 
						|
            const unsigned char *tableptr);
 | 
						|
 | 
						|
       pcre *pcre_compile2(const char *pattern, int options,
 | 
						|
            int *errorcodeptr,
 | 
						|
            const char **errptr, int *erroffset,
 | 
						|
            const unsigned char *tableptr);
 | 
						|
 | 
						|
       Either of the functions pcre_compile() or pcre_compile2() can be called
 | 
						|
       to compile a pattern into an internal form. The only difference between
 | 
						|
       the  two interfaces is that pcre_compile2() has an additional argument,
 | 
						|
       errorcodeptr, via which a numerical error code can be returned.
 | 
						|
 | 
						|
       The pattern is a C string terminated by a binary zero, and is passed in
 | 
						|
       the  pattern  argument.  A  pointer to a single block of memory that is
 | 
						|
       obtained via pcre_malloc is returned. This contains the  compiled  code
 | 
						|
       and related data. The pcre type is defined for the returned block; this
 | 
						|
       is a typedef for a structure whose contents are not externally defined.
 | 
						|
       It is up to the caller to free the memory (via pcre_free) when it is no
 | 
						|
       longer required.
 | 
						|
 | 
						|
       Although the compiled code of a PCRE regex is relocatable, that is,  it
 | 
						|
       does not depend on memory location, the complete pcre data block is not
 | 
						|
       fully relocatable, because it may contain a copy of the tableptr  argu-
 | 
						|
       ment, which is an address (see below).
 | 
						|
 | 
						|
       The options argument contains various bit settings that affect the com-
 | 
						|
       pilation. It should be zero if no options are required.  The  available
 | 
						|
       options  are  described  below. Some of them (in particular, those that
 | 
						|
       are compatible with Perl, but also some others) can  also  be  set  and
 | 
						|
       unset  from  within  the  pattern  (see the detailed description in the
 | 
						|
       pcrepattern documentation). For those options that can be different  in
 | 
						|
       different  parts  of  the pattern, the contents of the options argument
 | 
						|
       specifies their initial settings at the start of compilation and execu-
 | 
						|
       tion.  The PCRE_ANCHORED and PCRE_NEWLINE_xxx options can be set at the
 | 
						|
       time of matching as well as at compile time.
 | 
						|
 | 
						|
       If errptr is NULL, pcre_compile() returns NULL immediately.  Otherwise,
 | 
						|
       if  compilation  of  a  pattern fails, pcre_compile() returns NULL, and
 | 
						|
       sets the variable pointed to by errptr to point to a textual error mes-
 | 
						|
       sage. This is a static string that is part of the library. You must not
 | 
						|
       try to free it. The offset from the start of the pattern to the charac-
 | 
						|
       ter where the error was discovered is placed in the variable pointed to
 | 
						|
       by erroffset, which must not be NULL. If it is, an immediate  error  is
 | 
						|
       given.
 | 
						|
 | 
						|
       If  pcre_compile2()  is  used instead of pcre_compile(), and the error-
 | 
						|
       codeptr argument is not NULL, a non-zero error code number is  returned
 | 
						|
       via  this argument in the event of an error. This is in addition to the
 | 
						|
       textual error message. Error codes and messages are listed below.
 | 
						|
 | 
						|
       If the final argument, tableptr, is NULL, PCRE uses a  default  set  of
 | 
						|
       character  tables  that  are  built  when  PCRE  is compiled, using the
 | 
						|
       default C locale. Otherwise, tableptr must be an address  that  is  the
 | 
						|
       result  of  a  call to pcre_maketables(). This value is stored with the
 | 
						|
       compiled pattern, and used again by pcre_exec(), unless  another  table
 | 
						|
       pointer is passed to it. For more discussion, see the section on locale
 | 
						|
       support below.
 | 
						|
 | 
						|
       This code fragment shows a typical straightforward  call  to  pcre_com-
 | 
						|
       pile():
 | 
						|
 | 
						|
         pcre *re;
 | 
						|
         const char *error;
 | 
						|
         int erroffset;
 | 
						|
         re = pcre_compile(
 | 
						|
           "^A.*Z",          /* the pattern */
 | 
						|
           0,                /* default options */
 | 
						|
           &error,           /* for error message */
 | 
						|
           &erroffset,       /* for error offset */
 | 
						|
           NULL);            /* use default character tables */
 | 
						|
 | 
						|
       The  following  names  for option bits are defined in the pcre.h header
 | 
						|
       file:
 | 
						|
 | 
						|
         PCRE_ANCHORED
 | 
						|
 | 
						|
       If this bit is set, the pattern is forced to be "anchored", that is, it
 | 
						|
       is  constrained to match only at the first matching point in the string
 | 
						|
       that is being searched (the "subject string"). This effect can also  be
 | 
						|
       achieved  by appropriate constructs in the pattern itself, which is the
 | 
						|
       only way to do it in Perl.
 | 
						|
 | 
						|
         PCRE_AUTO_CALLOUT
 | 
						|
 | 
						|
       If this bit is set, pcre_compile() automatically inserts callout items,
 | 
						|
       all  with  number  255, before each pattern item. For discussion of the
 | 
						|
       callout facility, see the pcrecallout documentation.
 | 
						|
 | 
						|
         PCRE_BSR_ANYCRLF
 | 
						|
         PCRE_BSR_UNICODE
 | 
						|
 | 
						|
       These options (which are mutually exclusive) control what the \R escape
 | 
						|
       sequence  matches.  The choice is either to match only CR, LF, or CRLF,
 | 
						|
       or to match any Unicode newline sequence. The default is specified when
 | 
						|
       PCRE is built. It can be overridden from within the pattern, or by set-
 | 
						|
       ting an option when a compiled pattern is matched.
 | 
						|
 | 
						|
         PCRE_CASELESS
 | 
						|
 | 
						|
       If this bit is set, letters in the pattern match both upper  and  lower
 | 
						|
       case  letters.  It  is  equivalent  to  Perl's /i option, and it can be
 | 
						|
       changed within a pattern by a (?i) option setting. In UTF-8 mode,  PCRE
 | 
						|
       always  understands the concept of case for characters whose values are
 | 
						|
       less than 128, so caseless matching is always possible. For  characters
 | 
						|
       with  higher  values,  the concept of case is supported if PCRE is com-
 | 
						|
       piled with Unicode property support, but not otherwise. If you want  to
 | 
						|
       use  caseless  matching  for  characters 128 and above, you must ensure
 | 
						|
       that PCRE is compiled with Unicode property support  as  well  as  with
 | 
						|
       UTF-8 support.
 | 
						|
 | 
						|
         PCRE_DOLLAR_ENDONLY
 | 
						|
 | 
						|
       If  this bit is set, a dollar metacharacter in the pattern matches only
 | 
						|
       at the end of the subject string. Without this option,  a  dollar  also
 | 
						|
       matches  immediately before a newline at the end of the string (but not
 | 
						|
       before any other newlines). The PCRE_DOLLAR_ENDONLY option  is  ignored
 | 
						|
       if  PCRE_MULTILINE  is  set.   There is no equivalent to this option in
 | 
						|
       Perl, and no way to set it within a pattern.
 | 
						|
 | 
						|
         PCRE_DOTALL
 | 
						|
 | 
						|
       If this bit is set, a dot metacharater in the pattern matches all char-
 | 
						|
       acters,  including  those that indicate newline. Without it, a dot does
 | 
						|
       not match when the current position is at a  newline.  This  option  is
 | 
						|
       equivalent  to Perl's /s option, and it can be changed within a pattern
 | 
						|
       by a (?s) option setting. A negative class such as [^a] always  matches
 | 
						|
       newline characters, independent of the setting of this option.
 | 
						|
 | 
						|
         PCRE_DUPNAMES
 | 
						|
 | 
						|
       If  this  bit is set, names used to identify capturing subpatterns need
 | 
						|
       not be unique. This can be helpful for certain types of pattern when it
 | 
						|
       is  known  that  only  one instance of the named subpattern can ever be
 | 
						|
       matched. There are more details of named subpatterns  below;  see  also
 | 
						|
       the pcrepattern documentation.
 | 
						|
 | 
						|
         PCRE_EXTENDED
 | 
						|
 | 
						|
       If  this  bit  is  set,  whitespace  data characters in the pattern are
 | 
						|
       totally ignored except when escaped or inside a character class. White-
 | 
						|
       space does not include the VT character (code 11). In addition, charac-
 | 
						|
       ters between an unescaped # outside a character class and the next new-
 | 
						|
       line,  inclusive,  are  also  ignored.  This is equivalent to Perl's /x
 | 
						|
       option, and it can be changed within a pattern by a  (?x)  option  set-
 | 
						|
       ting.
 | 
						|
 | 
						|
       This  option  makes  it possible to include comments inside complicated
 | 
						|
       patterns.  Note, however, that this applies only  to  data  characters.
 | 
						|
       Whitespace   characters  may  never  appear  within  special  character
 | 
						|
       sequences in a pattern, for  example  within  the  sequence  (?(  which
 | 
						|
       introduces a conditional subpattern.
 | 
						|
 | 
						|
         PCRE_EXTRA
 | 
						|
 | 
						|
       This  option  was invented in order to turn on additional functionality
 | 
						|
       of PCRE that is incompatible with Perl, but it  is  currently  of  very
 | 
						|
       little  use. When set, any backslash in a pattern that is followed by a
 | 
						|
       letter that has no special meaning  causes  an  error,  thus  reserving
 | 
						|
       these  combinations  for  future  expansion.  By default, as in Perl, a
 | 
						|
       backslash followed by a letter with no special meaning is treated as  a
 | 
						|
       literal.  (Perl can, however, be persuaded to give a warning for this.)
 | 
						|
       There are at present no other features controlled by  this  option.  It
 | 
						|
       can also be set by a (?X) option setting within a pattern.
 | 
						|
 | 
						|
         PCRE_FIRSTLINE
 | 
						|
 | 
						|
       If  this  option  is  set,  an  unanchored pattern is required to match
 | 
						|
       before or at the first  newline  in  the  subject  string,  though  the
 | 
						|
       matched text may continue over the newline.
 | 
						|
 | 
						|
         PCRE_JAVASCRIPT_COMPAT
 | 
						|
 | 
						|
       If this option is set, PCRE's behaviour is changed in some ways so that
 | 
						|
       it is compatible with JavaScript rather than Perl. The changes  are  as
 | 
						|
       follows:
 | 
						|
 | 
						|
       (1)  A  lone  closing square bracket in a pattern causes a compile-time
 | 
						|
       error, because this is illegal in JavaScript (by default it is  treated
 | 
						|
       as a data character). Thus, the pattern AB]CD becomes illegal when this
 | 
						|
       option is set.
 | 
						|
 | 
						|
       (2) At run time, a back reference to an unset subpattern group  matches
 | 
						|
       an  empty  string (by default this causes the current matching alterna-
 | 
						|
       tive to fail). A pattern such as (\1)(a) succeeds when this  option  is
 | 
						|
       set  (assuming  it can find an "a" in the subject), whereas it fails by
 | 
						|
       default, for Perl compatibility.
 | 
						|
 | 
						|
         PCRE_MULTILINE
 | 
						|
 | 
						|
       By default, PCRE treats the subject string as consisting  of  a  single
 | 
						|
       line  of characters (even if it actually contains newlines). The "start
 | 
						|
       of line" metacharacter (^) matches only at the  start  of  the  string,
 | 
						|
       while  the  "end  of line" metacharacter ($) matches only at the end of
 | 
						|
       the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY
 | 
						|
       is set). This is the same as Perl.
 | 
						|
 | 
						|
       When  PCRE_MULTILINE  it  is set, the "start of line" and "end of line"
 | 
						|
       constructs match immediately following or immediately  before  internal
 | 
						|
       newlines  in  the  subject string, respectively, as well as at the very
 | 
						|
       start and end. This is equivalent to Perl's /m option, and  it  can  be
 | 
						|
       changed within a pattern by a (?m) option setting. If there are no new-
 | 
						|
       lines in a subject string, or no occurrences of ^ or $  in  a  pattern,
 | 
						|
       setting PCRE_MULTILINE has no effect.
 | 
						|
 | 
						|
         PCRE_NEWLINE_CR
 | 
						|
         PCRE_NEWLINE_LF
 | 
						|
         PCRE_NEWLINE_CRLF
 | 
						|
         PCRE_NEWLINE_ANYCRLF
 | 
						|
         PCRE_NEWLINE_ANY
 | 
						|
 | 
						|
       These  options  override the default newline definition that was chosen
 | 
						|
       when PCRE was built. Setting the first or the second specifies  that  a
 | 
						|
       newline  is  indicated  by a single character (CR or LF, respectively).
 | 
						|
       Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by  the
 | 
						|
       two-character  CRLF  sequence.  Setting  PCRE_NEWLINE_ANYCRLF specifies
 | 
						|
       that any of the three preceding sequences should be recognized. Setting
 | 
						|
       PCRE_NEWLINE_ANY  specifies that any Unicode newline sequence should be
 | 
						|
       recognized. The Unicode newline sequences are the three just mentioned,
 | 
						|
       plus  the  single  characters  VT (vertical tab, U+000B), FF (formfeed,
 | 
						|
       U+000C), NEL (next line, U+0085), LS (line separator, U+2028),  and  PS
 | 
						|
       (paragraph  separator,  U+2029).  The  last  two are recognized only in
 | 
						|
       UTF-8 mode.
 | 
						|
 | 
						|
       The newline setting in the  options  word  uses  three  bits  that  are
 | 
						|
       treated as a number, giving eight possibilities. Currently only six are
 | 
						|
       used (default plus the five values above). This means that if  you  set
 | 
						|
       more  than one newline option, the combination may or may not be sensi-
 | 
						|
       ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to
 | 
						|
       PCRE_NEWLINE_CRLF,  but other combinations may yield unused numbers and
 | 
						|
       cause an error.
 | 
						|
 | 
						|
       The only time that a line break is specially recognized when  compiling
 | 
						|
       a  pattern  is  if  PCRE_EXTENDED  is set, and an unescaped # outside a
 | 
						|
       character class is encountered. This indicates  a  comment  that  lasts
 | 
						|
       until  after the next line break sequence. In other circumstances, line
 | 
						|
       break  sequences  are  treated  as  literal  data,   except   that   in
 | 
						|
       PCRE_EXTENDED mode, both CR and LF are treated as whitespace characters
 | 
						|
       and are therefore ignored.
 | 
						|
 | 
						|
       The newline option that is set at compile time becomes the default that
 | 
						|
       is used for pcre_exec() and pcre_dfa_exec(), but it can be overridden.
 | 
						|
 | 
						|
         PCRE_NO_AUTO_CAPTURE
 | 
						|
 | 
						|
       If this option is set, it disables the use of numbered capturing paren-
 | 
						|
       theses in the pattern. Any opening parenthesis that is not followed  by
 | 
						|
       ?  behaves as if it were followed by ?: but named parentheses can still
 | 
						|
       be used for capturing (and they acquire  numbers  in  the  usual  way).
 | 
						|
       There is no equivalent of this option in Perl.
 | 
						|
 | 
						|
         PCRE_UNGREEDY
 | 
						|
 | 
						|
       This  option  inverts  the "greediness" of the quantifiers so that they
 | 
						|
       are not greedy by default, but become greedy if followed by "?". It  is
 | 
						|
       not  compatible  with Perl. It can also be set by a (?U) option setting
 | 
						|
       within the pattern.
 | 
						|
 | 
						|
         PCRE_UTF8
 | 
						|
 | 
						|
       This option causes PCRE to regard both the pattern and the  subject  as
 | 
						|
       strings  of  UTF-8 characters instead of single-byte character strings.
 | 
						|
       However, it is available only when PCRE is built to include UTF-8  sup-
 | 
						|
       port.  If not, the use of this option provokes an error. Details of how
 | 
						|
       this option changes the behaviour of PCRE are given in the  section  on
 | 
						|
       UTF-8 support in the main pcre page.
 | 
						|
 | 
						|
         PCRE_NO_UTF8_CHECK
 | 
						|
 | 
						|
       When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
 | 
						|
       automatically checked. There is a  discussion  about  the  validity  of
 | 
						|
       UTF-8  strings  in  the main pcre page. If an invalid UTF-8 sequence of
 | 
						|
       bytes is found, pcre_compile() returns an error. If  you  already  know
 | 
						|
       that your pattern is valid, and you want to skip this check for perfor-
 | 
						|
       mance reasons, you can set the PCRE_NO_UTF8_CHECK option.  When  it  is
 | 
						|
       set,  the  effect  of  passing  an invalid UTF-8 string as a pattern is
 | 
						|
       undefined. It may cause your program to crash. Note  that  this  option
 | 
						|
       can  also be passed to pcre_exec() and pcre_dfa_exec(), to suppress the
 | 
						|
       UTF-8 validity checking of subject strings.
 | 
						|
 | 
						|
 | 
						|
COMPILATION ERROR CODES
 | 
						|
 | 
						|
       The following table lists the error  codes  than  may  be  returned  by
 | 
						|
       pcre_compile2(),  along with the error messages that may be returned by
 | 
						|
       both compiling functions. As PCRE has developed, some error codes  have
 | 
						|
       fallen out of use. To avoid confusion, they have not been re-used.
 | 
						|
 | 
						|
          0  no error
 | 
						|
          1  \ at end of pattern
 | 
						|
          2  \c at end of pattern
 | 
						|
          3  unrecognized character follows \
 | 
						|
          4  numbers out of order in {} quantifier
 | 
						|
          5  number too big in {} quantifier
 | 
						|
          6  missing terminating ] for character class
 | 
						|
          7  invalid escape sequence in character class
 | 
						|
          8  range out of order in character class
 | 
						|
          9  nothing to repeat
 | 
						|
         10  [this code is not in use]
 | 
						|
         11  internal error: unexpected repeat
 | 
						|
         12  unrecognized character after (? or (?-
 | 
						|
         13  POSIX named classes are supported only within a class
 | 
						|
         14  missing )
 | 
						|
         15  reference to non-existent subpattern
 | 
						|
         16  erroffset passed as NULL
 | 
						|
         17  unknown option bit(s) set
 | 
						|
         18  missing ) after comment
 | 
						|
         19  [this code is not in use]
 | 
						|
         20  regular expression is too large
 | 
						|
         21  failed to get memory
 | 
						|
         22  unmatched parentheses
 | 
						|
         23  internal error: code overflow
 | 
						|
         24  unrecognized character after (?<
 | 
						|
         25  lookbehind assertion is not fixed length
 | 
						|
         26  malformed number or name after (?(
 | 
						|
         27  conditional group contains more than two branches
 | 
						|
         28  assertion expected after (?(
 | 
						|
         29  (?R or (?[+-]digits must be followed by )
 | 
						|
         30  unknown POSIX class name
 | 
						|
         31  POSIX collating elements are not supported
 | 
						|
         32  this version of PCRE is not compiled with PCRE_UTF8 support
 | 
						|
         33  [this code is not in use]
 | 
						|
         34  character value in \x{...} sequence is too large
 | 
						|
         35  invalid condition (?(0)
 | 
						|
         36  \C not allowed in lookbehind assertion
 | 
						|
         37  PCRE does not support \L, \l, \N, \U, or \u
 | 
						|
         38  number after (?C is > 255
 | 
						|
         39  closing ) for (?C expected
 | 
						|
         40  recursive call could loop indefinitely
 | 
						|
         41  unrecognized character after (?P
 | 
						|
         42  syntax error in subpattern name (missing terminator)
 | 
						|
         43  two named subpatterns have the same name
 | 
						|
         44  invalid UTF-8 string
 | 
						|
         45  support for \P, \p, and \X has not been compiled
 | 
						|
         46  malformed \P or \p sequence
 | 
						|
         47  unknown property name after \P or \p
 | 
						|
         48  subpattern name is too long (maximum 32 characters)
 | 
						|
         49  too many named subpatterns (maximum 10000)
 | 
						|
         50  [this code is not in use]
 | 
						|
         51  octal value is greater than \377 (not in UTF-8 mode)
 | 
						|
         52  internal error: overran compiling workspace
 | 
						|
         53   internal  error:  previously-checked  referenced  subpattern not
 | 
						|
       found
 | 
						|
         54  DEFINE group contains more than one branch
 | 
						|
         55  repeating a DEFINE group is not allowed
 | 
						|
         56  inconsistent NEWLINE options
 | 
						|
         57  \g is not followed by a braced, angle-bracketed, or quoted
 | 
						|
               name/number or by a plain number
 | 
						|
         58  a numbered reference must not be zero
 | 
						|
         59  (*VERB) with an argument is not supported
 | 
						|
         60  (*VERB) not recognized
 | 
						|
         61  number is too big
 | 
						|
         62  subpattern name expected
 | 
						|
         63  digit expected after (?+
 | 
						|
         64  ] is an invalid data character in JavaScript compatibility mode
 | 
						|
 | 
						|
       The numbers 32 and 10000 in errors 48 and 49  are  defaults;  different
 | 
						|
       values may be used if the limits were changed when PCRE was built.
 | 
						|
 | 
						|
 | 
						|
STUDYING A PATTERN
 | 
						|
 | 
						|
       pcre_extra *pcre_study(const pcre *code, int options
 | 
						|
            const char **errptr);
 | 
						|
 | 
						|
       If  a  compiled  pattern is going to be used several times, it is worth
 | 
						|
       spending more time analyzing it in order to speed up the time taken for
 | 
						|
       matching.  The function pcre_study() takes a pointer to a compiled pat-
 | 
						|
       tern as its first argument. If studying the pattern produces additional
 | 
						|
       information  that  will  help speed up matching, pcre_study() returns a
 | 
						|
       pointer to a pcre_extra block, in which the study_data field points  to
 | 
						|
       the results of the study.
 | 
						|
 | 
						|
       The  returned  value  from  pcre_study()  can  be  passed  directly  to
 | 
						|
       pcre_exec(). However, a pcre_extra block  also  contains  other  fields
 | 
						|
       that  can  be  set  by the caller before the block is passed; these are
 | 
						|
       described below in the section on matching a pattern.
 | 
						|
 | 
						|
       If studying the pattern does not  produce  any  additional  information
 | 
						|
       pcre_study() returns NULL. In that circumstance, if the calling program
 | 
						|
       wants to pass any of the other fields to pcre_exec(), it  must  set  up
 | 
						|
       its own pcre_extra block.
 | 
						|
 | 
						|
       The  second  argument of pcre_study() contains option bits. At present,
 | 
						|
       no options are defined, and this argument should always be zero.
 | 
						|
 | 
						|
       The third argument for pcre_study() is a pointer for an error  message.
 | 
						|
       If  studying  succeeds  (even  if no data is returned), the variable it
 | 
						|
       points to is set to NULL. Otherwise it is set to  point  to  a  textual
 | 
						|
       error message. This is a static string that is part of the library. You
 | 
						|
       must not try to free it. You should test the  error  pointer  for  NULL
 | 
						|
       after calling pcre_study(), to be sure that it has run successfully.
 | 
						|
 | 
						|
       This is a typical call to pcre_study():
 | 
						|
 | 
						|
         pcre_extra *pe;
 | 
						|
         pe = pcre_study(
 | 
						|
           re,             /* result of pcre_compile() */
 | 
						|
           0,              /* no options exist */
 | 
						|
           &error);        /* set to NULL or points to a message */
 | 
						|
 | 
						|
       At present, studying a pattern is useful only for non-anchored patterns
 | 
						|
       that do not have a single fixed starting character. A bitmap of  possi-
 | 
						|
       ble starting bytes is created.
 | 
						|
 | 
						|
 | 
						|
LOCALE SUPPORT
 | 
						|
 | 
						|
       PCRE  handles  caseless matching, and determines whether characters are
 | 
						|
       letters, digits, or whatever, by reference to a set of tables,  indexed
 | 
						|
       by  character  value.  When running in UTF-8 mode, this applies only to
 | 
						|
       characters with codes less than 128. Higher-valued  codes  never  match
 | 
						|
       escapes  such  as  \w or \d, but can be tested with \p if PCRE is built
 | 
						|
       with Unicode character property support. The use of locales  with  Uni-
 | 
						|
       code  is discouraged. If you are handling characters with codes greater
 | 
						|
       than 128, you should either use UTF-8 and Unicode, or use locales,  but
 | 
						|
       not try to mix the two.
 | 
						|
 | 
						|
       PCRE  contains  an  internal set of tables that are used when the final
 | 
						|
       argument of pcre_compile() is  NULL.  These  are  sufficient  for  many
 | 
						|
       applications.  Normally, the internal tables recognize only ASCII char-
 | 
						|
       acters. However, when PCRE is built, it is possible to cause the inter-
 | 
						|
       nal tables to be rebuilt in the default "C" locale of the local system,
 | 
						|
       which may cause them to be different.
 | 
						|
 | 
						|
       The internal tables can always be overridden by tables supplied by  the
 | 
						|
       application that calls PCRE. These may be created in a different locale
 | 
						|
       from the default. As more and more applications change  to  using  Uni-
 | 
						|
       code, the need for this locale support is expected to die away.
 | 
						|
 | 
						|
       External  tables  are  built by calling the pcre_maketables() function,
 | 
						|
       which has no arguments, in the relevant locale. The result can then  be
 | 
						|
       passed  to  pcre_compile()  or  pcre_exec()  as often as necessary. For
 | 
						|
       example, to build and use tables that are appropriate  for  the  French
 | 
						|
       locale  (where  accented  characters  with  values greater than 128 are
 | 
						|
       treated as letters), the following code could be used:
 | 
						|
 | 
						|
         setlocale(LC_CTYPE, "fr_FR");
 | 
						|
         tables = pcre_maketables();
 | 
						|
         re = pcre_compile(..., tables);
 | 
						|
 | 
						|
       The locale name "fr_FR" is used on Linux and other  Unix-like  systems;
 | 
						|
       if you are using Windows, the name for the French locale is "french".
 | 
						|
 | 
						|
       When  pcre_maketables()  runs,  the  tables are built in memory that is
 | 
						|
       obtained via pcre_malloc. It is the caller's responsibility  to  ensure
 | 
						|
       that  the memory containing the tables remains available for as long as
 | 
						|
       it is needed.
 | 
						|
 | 
						|
       The pointer that is passed to pcre_compile() is saved with the compiled
 | 
						|
       pattern,  and the same tables are used via this pointer by pcre_study()
 | 
						|
       and normally also by pcre_exec(). Thus, by default, for any single pat-
 | 
						|
       tern, compilation, studying and matching all happen in the same locale,
 | 
						|
       but different patterns can be compiled in different locales.
 | 
						|
 | 
						|
       It is possible to pass a table pointer or NULL (indicating the  use  of
 | 
						|
       the  internal  tables)  to  pcre_exec(). Although not intended for this
 | 
						|
       purpose, this facility could be used to match a pattern in a  different
 | 
						|
       locale from the one in which it was compiled. Passing table pointers at
 | 
						|
       run time is discussed below in the section on matching a pattern.
 | 
						|
 | 
						|
 | 
						|
INFORMATION ABOUT A PATTERN
 | 
						|
 | 
						|
       int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
 | 
						|
            int what, void *where);
 | 
						|
 | 
						|
       The pcre_fullinfo() function returns information about a compiled  pat-
 | 
						|
       tern. It replaces the obsolete pcre_info() function, which is neverthe-
 | 
						|
       less retained for backwards compability (and is documented below).
 | 
						|
 | 
						|
       The first argument for pcre_fullinfo() is a  pointer  to  the  compiled
 | 
						|
       pattern.  The second argument is the result of pcre_study(), or NULL if
 | 
						|
       the pattern was not studied. The third argument specifies  which  piece
 | 
						|
       of  information  is required, and the fourth argument is a pointer to a
 | 
						|
       variable to receive the data. The yield of the  function  is  zero  for
 | 
						|
       success, or one of the following negative numbers:
 | 
						|
 | 
						|
         PCRE_ERROR_NULL       the argument code was NULL
 | 
						|
                               the argument where was NULL
 | 
						|
         PCRE_ERROR_BADMAGIC   the "magic number" was not found
 | 
						|
         PCRE_ERROR_BADOPTION  the value of what was invalid
 | 
						|
 | 
						|
       The  "magic  number" is placed at the start of each compiled pattern as
 | 
						|
       an simple check against passing an arbitrary memory pointer. Here is  a
 | 
						|
       typical  call  of pcre_fullinfo(), to obtain the length of the compiled
 | 
						|
       pattern:
 | 
						|
 | 
						|
         int rc;
 | 
						|
         size_t length;
 | 
						|
         rc = pcre_fullinfo(
 | 
						|
           re,               /* result of pcre_compile() */
 | 
						|
           pe,               /* result of pcre_study(), or NULL */
 | 
						|
           PCRE_INFO_SIZE,   /* what is required */
 | 
						|
           &length);         /* where to put the data */
 | 
						|
 | 
						|
       The possible values for the third argument are defined in  pcre.h,  and
 | 
						|
       are as follows:
 | 
						|
 | 
						|
         PCRE_INFO_BACKREFMAX
 | 
						|
 | 
						|
       Return  the  number  of  the highest back reference in the pattern. The
 | 
						|
       fourth argument should point to an int variable. Zero  is  returned  if
 | 
						|
       there are no back references.
 | 
						|
 | 
						|
         PCRE_INFO_CAPTURECOUNT
 | 
						|
 | 
						|
       Return  the  number of capturing subpatterns in the pattern. The fourth
 | 
						|
       argument should point to an int variable.
 | 
						|
 | 
						|
         PCRE_INFO_DEFAULT_TABLES
 | 
						|
 | 
						|
       Return a pointer to the internal default character tables within  PCRE.
 | 
						|
       The  fourth  argument should point to an unsigned char * variable. This
 | 
						|
       information call is provided for internal use by the pcre_study() func-
 | 
						|
       tion.  External  callers  can  cause PCRE to use its internal tables by
 | 
						|
       passing a NULL table pointer.
 | 
						|
 | 
						|
         PCRE_INFO_FIRSTBYTE
 | 
						|
 | 
						|
       Return information about the first byte of any matched  string,  for  a
 | 
						|
       non-anchored  pattern. The fourth argument should point to an int vari-
 | 
						|
       able. (This option used to be called PCRE_INFO_FIRSTCHAR; the old  name
 | 
						|
       is still recognized for backwards compatibility.)
 | 
						|
 | 
						|
       If  there  is  a  fixed first byte, for example, from a pattern such as
 | 
						|
       (cat|cow|coyote), its value is returned. Otherwise, if either
 | 
						|
 | 
						|
       (a) the pattern was compiled with the PCRE_MULTILINE option, and  every
 | 
						|
       branch starts with "^", or
 | 
						|
 | 
						|
       (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
 | 
						|
       set (if it were set, the pattern would be anchored),
 | 
						|
 | 
						|
       -1 is returned, indicating that the pattern matches only at  the  start
 | 
						|
       of  a  subject string or after any newline within the string. Otherwise
 | 
						|
       -2 is returned. For anchored patterns, -2 is returned.
 | 
						|
 | 
						|
         PCRE_INFO_FIRSTTABLE
 | 
						|
 | 
						|
       If the pattern was studied, and this resulted in the construction of  a
 | 
						|
       256-bit table indicating a fixed set of bytes for the first byte in any
 | 
						|
       matching string, a pointer to the table is returned. Otherwise NULL  is
 | 
						|
       returned.  The fourth argument should point to an unsigned char * vari-
 | 
						|
       able.
 | 
						|
 | 
						|
         PCRE_INFO_HASCRORLF
 | 
						|
 | 
						|
       Return 1 if the pattern contains any explicit  matches  for  CR  or  LF
 | 
						|
       characters,  otherwise  0.  The  fourth argument should point to an int
 | 
						|
       variable. An explicit match is either a literal CR or LF character,  or
 | 
						|
       \r or \n.
 | 
						|
 | 
						|
         PCRE_INFO_JCHANGED
 | 
						|
 | 
						|
       Return  1  if  the (?J) or (?-J) option setting is used in the pattern,
 | 
						|
       otherwise 0. The fourth argument should point to an int variable.  (?J)
 | 
						|
       and (?-J) set and unset the local PCRE_DUPNAMES option, respectively.
 | 
						|
 | 
						|
         PCRE_INFO_LASTLITERAL
 | 
						|
 | 
						|
       Return  the  value of the rightmost literal byte that must exist in any
 | 
						|
       matched string, other than at its  start,  if  such  a  byte  has  been
 | 
						|
       recorded. The fourth argument should point to an int variable. If there
 | 
						|
       is no such byte, -1 is returned. For anchored patterns, a last  literal
 | 
						|
       byte  is  recorded only if it follows something of variable length. For
 | 
						|
       example, for the pattern /^a\d+z\d+/ the returned value is "z", but for
 | 
						|
       /^a\dz\d/ the returned value is -1.
 | 
						|
 | 
						|
         PCRE_INFO_NAMECOUNT
 | 
						|
         PCRE_INFO_NAMEENTRYSIZE
 | 
						|
         PCRE_INFO_NAMETABLE
 | 
						|
 | 
						|
       PCRE  supports the use of named as well as numbered capturing parenthe-
 | 
						|
       ses. The names are just an additional way of identifying the  parenthe-
 | 
						|
       ses, which still acquire numbers. Several convenience functions such as
 | 
						|
       pcre_get_named_substring() are provided for  extracting  captured  sub-
 | 
						|
       strings  by  name. It is also possible to extract the data directly, by
 | 
						|
       first converting the name to a number in order to  access  the  correct
 | 
						|
       pointers in the output vector (described with pcre_exec() below). To do
 | 
						|
       the conversion, you need  to  use  the  name-to-number  map,  which  is
 | 
						|
       described by these three values.
 | 
						|
 | 
						|
       The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT
 | 
						|
       gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size
 | 
						|
       of  each  entry;  both  of  these  return  an int value. The entry size
 | 
						|
       depends on the length of the longest name. PCRE_INFO_NAMETABLE  returns
 | 
						|
       a  pointer  to  the  first  entry of the table (a pointer to char). The
 | 
						|
       first two bytes of each entry are the number of the capturing parenthe-
 | 
						|
       sis,  most  significant byte first. The rest of the entry is the corre-
 | 
						|
       sponding name, zero terminated. The names are  in  alphabetical  order.
 | 
						|
       When PCRE_DUPNAMES is set, duplicate names are in order of their paren-
 | 
						|
       theses numbers. For example, consider  the  following  pattern  (assume
 | 
						|
       PCRE_EXTENDED  is  set,  so  white  space  -  including  newlines  - is
 | 
						|
       ignored):
 | 
						|
 | 
						|
         (?<date> (?<year>(\d\d)?\d\d) -
 | 
						|
         (?<month>\d\d) - (?<day>\d\d) )
 | 
						|
 | 
						|
       There are four named subpatterns, so the table has  four  entries,  and
 | 
						|
       each  entry  in the table is eight bytes long. The table is as follows,
 | 
						|
       with non-printing bytes shows in hexadecimal, and undefined bytes shown
 | 
						|
       as ??:
 | 
						|
 | 
						|
         00 01 d  a  t  e  00 ??
 | 
						|
         00 05 d  a  y  00 ?? ??
 | 
						|
         00 04 m  o  n  t  h  00
 | 
						|
         00 02 y  e  a  r  00 ??
 | 
						|
 | 
						|
       When  writing  code  to  extract  data from named subpatterns using the
 | 
						|
       name-to-number map, remember that the length of the entries  is  likely
 | 
						|
       to be different for each compiled pattern.
 | 
						|
 | 
						|
         PCRE_INFO_OKPARTIAL
 | 
						|
 | 
						|
       Return  1 if the pattern can be used for partial matching, otherwise 0.
 | 
						|
       The fourth argument should point to an int  variable.  The  pcrepartial
 | 
						|
       documentation  lists  the restrictions that apply to patterns when par-
 | 
						|
       tial matching is used.
 | 
						|
 | 
						|
         PCRE_INFO_OPTIONS
 | 
						|
 | 
						|
       Return a copy of the options with which the pattern was  compiled.  The
 | 
						|
       fourth  argument  should  point to an unsigned long int variable. These
 | 
						|
       option bits are those specified in the call to pcre_compile(), modified
 | 
						|
       by any top-level option settings at the start of the pattern itself. In
 | 
						|
       other words, they are the options that will be in force  when  matching
 | 
						|
       starts.  For  example, if the pattern /(?im)abc(?-i)d/ is compiled with
 | 
						|
       the PCRE_EXTENDED option, the result is PCRE_CASELESS,  PCRE_MULTILINE,
 | 
						|
       and PCRE_EXTENDED.
 | 
						|
 | 
						|
       A  pattern  is  automatically  anchored by PCRE if all of its top-level
 | 
						|
       alternatives begin with one of the following:
 | 
						|
 | 
						|
         ^     unless PCRE_MULTILINE is set
 | 
						|
         \A    always
 | 
						|
         \G    always
 | 
						|
         .*    if PCRE_DOTALL is set and there are no back
 | 
						|
                 references to the subpattern in which .* appears
 | 
						|
 | 
						|
       For such patterns, the PCRE_ANCHORED bit is set in the options returned
 | 
						|
       by pcre_fullinfo().
 | 
						|
 | 
						|
         PCRE_INFO_SIZE
 | 
						|
 | 
						|
       Return  the  size  of the compiled pattern, that is, the value that was
 | 
						|
       passed as the argument to pcre_malloc() when PCRE was getting memory in
 | 
						|
       which to place the compiled data. The fourth argument should point to a
 | 
						|
       size_t variable.
 | 
						|
 | 
						|
         PCRE_INFO_STUDYSIZE
 | 
						|
 | 
						|
       Return the size of the data block pointed to by the study_data field in
 | 
						|
       a  pcre_extra  block.  That  is,  it  is  the  value that was passed to
 | 
						|
       pcre_malloc() when PCRE was getting memory into which to place the data
 | 
						|
       created  by  pcre_study(). The fourth argument should point to a size_t
 | 
						|
       variable.
 | 
						|
 | 
						|
 | 
						|
OBSOLETE INFO FUNCTION
 | 
						|
 | 
						|
       int pcre_info(const pcre *code, int *optptr, int *firstcharptr);
 | 
						|
 | 
						|
       The pcre_info() function is now obsolete because its interface  is  too
 | 
						|
       restrictive  to return all the available data about a compiled pattern.
 | 
						|
       New  programs  should  use  pcre_fullinfo()  instead.  The   yield   of
 | 
						|
       pcre_info()  is the number of capturing subpatterns, or one of the fol-
 | 
						|
       lowing negative numbers:
 | 
						|
 | 
						|
         PCRE_ERROR_NULL       the argument code was NULL
 | 
						|
         PCRE_ERROR_BADMAGIC   the "magic number" was not found
 | 
						|
 | 
						|
       If the optptr argument is not NULL, a copy of the  options  with  which
 | 
						|
       the  pattern  was  compiled  is placed in the integer it points to (see
 | 
						|
       PCRE_INFO_OPTIONS above).
 | 
						|
 | 
						|
       If the pattern is not anchored and the  firstcharptr  argument  is  not
 | 
						|
       NULL,  it is used to pass back information about the first character of
 | 
						|
       any matched string (see PCRE_INFO_FIRSTBYTE above).
 | 
						|
 | 
						|
 | 
						|
REFERENCE COUNTS
 | 
						|
 | 
						|
       int pcre_refcount(pcre *code, int adjust);
 | 
						|
 | 
						|
       The pcre_refcount() function is used to maintain a reference  count  in
 | 
						|
       the data block that contains a compiled pattern. It is provided for the
 | 
						|
       benefit of applications that  operate  in  an  object-oriented  manner,
 | 
						|
       where different parts of the application may be using the same compiled
 | 
						|
       pattern, but you want to free the block when they are all done.
 | 
						|
 | 
						|
       When a pattern is compiled, the reference count field is initialized to
 | 
						|
       zero.   It is changed only by calling this function, whose action is to
 | 
						|
       add the adjust value (which may be positive or  negative)  to  it.  The
 | 
						|
       yield of the function is the new value. However, the value of the count
 | 
						|
       is constrained to lie between 0 and 65535, inclusive. If the new  value
 | 
						|
       is outside these limits, it is forced to the appropriate limit value.
 | 
						|
 | 
						|
       Except  when it is zero, the reference count is not correctly preserved
 | 
						|
       if a pattern is compiled on one host and then  transferred  to  a  host
 | 
						|
       whose byte-order is different. (This seems a highly unlikely scenario.)
 | 
						|
 | 
						|
 | 
						|
MATCHING A PATTERN: THE TRADITIONAL FUNCTION
 | 
						|
 | 
						|
       int pcre_exec(const pcre *code, const pcre_extra *extra,
 | 
						|
            const char *subject, int length, int startoffset,
 | 
						|
            int options, int *ovector, int ovecsize);
 | 
						|
 | 
						|
       The  function pcre_exec() is called to match a subject string against a
 | 
						|
       compiled pattern, which is passed in the code argument. If the  pattern
 | 
						|
       has been studied, the result of the study should be passed in the extra
 | 
						|
       argument. This function is the main matching facility of  the  library,
 | 
						|
       and it operates in a Perl-like manner. For specialist use there is also
 | 
						|
       an alternative matching function, which is described below in the  sec-
 | 
						|
       tion about the pcre_dfa_exec() function.
 | 
						|
 | 
						|
       In  most applications, the pattern will have been compiled (and option-
 | 
						|
       ally studied) in the same process that calls pcre_exec().  However,  it
 | 
						|
       is possible to save compiled patterns and study data, and then use them
 | 
						|
       later in different processes, possibly even on different hosts.  For  a
 | 
						|
       discussion about this, see the pcreprecompile documentation.
 | 
						|
 | 
						|
       Here is an example of a simple call to pcre_exec():
 | 
						|
 | 
						|
         int rc;
 | 
						|
         int ovector[30];
 | 
						|
         rc = pcre_exec(
 | 
						|
           re,             /* result of pcre_compile() */
 | 
						|
           NULL,           /* we didn't study the pattern */
 | 
						|
           "some string",  /* the subject string */
 | 
						|
           11,             /* the length of the subject string */
 | 
						|
           0,              /* start at offset 0 in the subject */
 | 
						|
           0,              /* default options */
 | 
						|
           ovector,        /* vector of integers for substring information */
 | 
						|
           30);            /* number of elements (NOT size in bytes) */
 | 
						|
 | 
						|
   Extra data for pcre_exec()
 | 
						|
 | 
						|
       If  the  extra argument is not NULL, it must point to a pcre_extra data
 | 
						|
       block. The pcre_study() function returns such a block (when it  doesn't
 | 
						|
       return  NULL), but you can also create one for yourself, and pass addi-
 | 
						|
       tional information in it. The pcre_extra block contains  the  following
 | 
						|
       fields (not necessarily in this order):
 | 
						|
 | 
						|
         unsigned long int flags;
 | 
						|
         void *study_data;
 | 
						|
         unsigned long int match_limit;
 | 
						|
         unsigned long int match_limit_recursion;
 | 
						|
         void *callout_data;
 | 
						|
         const unsigned char *tables;
 | 
						|
 | 
						|
       The  flags  field  is a bitmap that specifies which of the other fields
 | 
						|
       are set. The flag bits are:
 | 
						|
 | 
						|
         PCRE_EXTRA_STUDY_DATA
 | 
						|
         PCRE_EXTRA_MATCH_LIMIT
 | 
						|
         PCRE_EXTRA_MATCH_LIMIT_RECURSION
 | 
						|
         PCRE_EXTRA_CALLOUT_DATA
 | 
						|
         PCRE_EXTRA_TABLES
 | 
						|
 | 
						|
       Other flag bits should be set to zero. The study_data field is  set  in
 | 
						|
       the  pcre_extra  block  that is returned by pcre_study(), together with
 | 
						|
       the appropriate flag bit. You should not set this yourself, but you may
 | 
						|
       add  to  the  block by setting the other fields and their corresponding
 | 
						|
       flag bits.
 | 
						|
 | 
						|
       The match_limit field provides a means of preventing PCRE from using up
 | 
						|
       a  vast amount of resources when running patterns that are not going to
 | 
						|
       match, but which have a very large number  of  possibilities  in  their
 | 
						|
       search  trees.  The  classic  example  is  the  use of nested unlimited
 | 
						|
       repeats.
 | 
						|
 | 
						|
       Internally, PCRE uses a function called match() which it calls  repeat-
 | 
						|
       edly  (sometimes  recursively). The limit set by match_limit is imposed
 | 
						|
       on the number of times this function is called during  a  match,  which
 | 
						|
       has  the  effect  of  limiting the amount of backtracking that can take
 | 
						|
       place. For patterns that are not anchored, the count restarts from zero
 | 
						|
       for each position in the subject string.
 | 
						|
 | 
						|
       The  default  value  for  the  limit can be set when PCRE is built; the
 | 
						|
       default default is 10 million, which handles all but the  most  extreme
 | 
						|
       cases.  You  can  override  the  default by suppling pcre_exec() with a
 | 
						|
       pcre_extra    block    in    which    match_limit    is    set,     and
 | 
						|
       PCRE_EXTRA_MATCH_LIMIT  is  set  in  the  flags  field. If the limit is
 | 
						|
       exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.
 | 
						|
 | 
						|
       The match_limit_recursion field is similar to match_limit, but  instead
 | 
						|
       of limiting the total number of times that match() is called, it limits
 | 
						|
       the depth of recursion. The recursion depth is a  smaller  number  than
 | 
						|
       the  total number of calls, because not all calls to match() are recur-
 | 
						|
       sive.  This limit is of use only if it is set smaller than match_limit.
 | 
						|
 | 
						|
       Limiting the recursion depth limits the amount of  stack  that  can  be
 | 
						|
       used, or, when PCRE has been compiled to use memory on the heap instead
 | 
						|
       of the stack, the amount of heap memory that can be used.
 | 
						|
 | 
						|
       The default value for match_limit_recursion can be  set  when  PCRE  is
 | 
						|
       built;  the  default  default  is  the  same  value  as the default for
 | 
						|
       match_limit. You can override the default by suppling pcre_exec()  with
 | 
						|
       a   pcre_extra   block  in  which  match_limit_recursion  is  set,  and
 | 
						|
       PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in  the  flags  field.  If  the
 | 
						|
       limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.
 | 
						|
 | 
						|
       The  pcre_callout  field is used in conjunction with the "callout" fea-
 | 
						|
       ture, which is described in the pcrecallout documentation.
 | 
						|
 | 
						|
       The tables field  is  used  to  pass  a  character  tables  pointer  to
 | 
						|
       pcre_exec();  this overrides the value that is stored with the compiled
 | 
						|
       pattern. A non-NULL value is stored with the compiled pattern  only  if
 | 
						|
       custom  tables  were  supplied to pcre_compile() via its tableptr argu-
 | 
						|
       ment.  If NULL is passed to pcre_exec() using this mechanism, it forces
 | 
						|
       PCRE's  internal  tables  to be used. This facility is helpful when re-
 | 
						|
       using patterns that have been saved after compiling  with  an  external
 | 
						|
       set  of  tables,  because  the  external tables might be at a different
 | 
						|
       address when pcre_exec() is called. See the  pcreprecompile  documenta-
 | 
						|
       tion for a discussion of saving compiled patterns for later use.
 | 
						|
 | 
						|
   Option bits for pcre_exec()
 | 
						|
 | 
						|
       The  unused  bits of the options argument for pcre_exec() must be zero.
 | 
						|
       The only bits that may  be  set  are  PCRE_ANCHORED,  PCRE_NEWLINE_xxx,
 | 
						|
       PCRE_NOTBOL,    PCRE_NOTEOL,   PCRE_NOTEMPTY,   PCRE_NO_START_OPTIMIZE,
 | 
						|
       PCRE_NO_UTF8_CHECK and PCRE_PARTIAL.
 | 
						|
 | 
						|
         PCRE_ANCHORED
 | 
						|
 | 
						|
       The PCRE_ANCHORED option limits pcre_exec() to matching  at  the  first
 | 
						|
       matching  position.  If  a  pattern was compiled with PCRE_ANCHORED, or
 | 
						|
       turned out to be anchored by virtue of its contents, it cannot be  made
 | 
						|
       unachored at matching time.
 | 
						|
 | 
						|
         PCRE_BSR_ANYCRLF
 | 
						|
         PCRE_BSR_UNICODE
 | 
						|
 | 
						|
       These options (which are mutually exclusive) control what the \R escape
 | 
						|
       sequence matches. The choice is either to match only CR, LF,  or  CRLF,
 | 
						|
       or  to  match  any Unicode newline sequence. These options override the
 | 
						|
       choice that was made or defaulted when the pattern was compiled.
 | 
						|
 | 
						|
         PCRE_NEWLINE_CR
 | 
						|
         PCRE_NEWLINE_LF
 | 
						|
         PCRE_NEWLINE_CRLF
 | 
						|
         PCRE_NEWLINE_ANYCRLF
 | 
						|
         PCRE_NEWLINE_ANY
 | 
						|
 | 
						|
       These options override  the  newline  definition  that  was  chosen  or
 | 
						|
       defaulted  when the pattern was compiled. For details, see the descrip-
 | 
						|
       tion of pcre_compile()  above.  During  matching,  the  newline  choice
 | 
						|
       affects  the  behaviour  of the dot, circumflex, and dollar metacharac-
 | 
						|
       ters. It may also alter the way the match position is advanced after  a
 | 
						|
       match failure for an unanchored pattern.
 | 
						|
 | 
						|
       When  PCRE_NEWLINE_CRLF,  PCRE_NEWLINE_ANYCRLF,  or PCRE_NEWLINE_ANY is
 | 
						|
       set, and a match attempt for an unanchored pattern fails when the  cur-
 | 
						|
       rent  position  is  at  a  CRLF  sequence,  and the pattern contains no
 | 
						|
       explicit matches for  CR  or  LF  characters,  the  match  position  is
 | 
						|
       advanced by two characters instead of one, in other words, to after the
 | 
						|
       CRLF.
 | 
						|
 | 
						|
       The above rule is a compromise that makes the most common cases work as
 | 
						|
       expected.  For  example,  if  the  pattern  is .+A (and the PCRE_DOTALL
 | 
						|
       option is not set), it does not match the string "\r\nA" because, after
 | 
						|
       failing  at the start, it skips both the CR and the LF before retrying.
 | 
						|
       However, the pattern [\r\n]A does match that string,  because  it  con-
 | 
						|
       tains an explicit CR or LF reference, and so advances only by one char-
 | 
						|
       acter after the first failure.
 | 
						|
 | 
						|
       An explicit match for CR of LF is either a literal appearance of one of
 | 
						|
       those  characters,  or  one  of the \r or \n escape sequences. Implicit
 | 
						|
       matches such as [^X] do not count, nor does \s (which includes  CR  and
 | 
						|
       LF in the characters that it matches).
 | 
						|
 | 
						|
       Notwithstanding  the above, anomalous effects may still occur when CRLF
 | 
						|
       is a valid newline sequence and explicit \r or \n escapes appear in the
 | 
						|
       pattern.
 | 
						|
 | 
						|
         PCRE_NOTBOL
 | 
						|
 | 
						|
       This option specifies that first character of the subject string is not
 | 
						|
       the beginning of a line, so the  circumflex  metacharacter  should  not
 | 
						|
       match  before it. Setting this without PCRE_MULTILINE (at compile time)
 | 
						|
       causes circumflex never to match. This option affects only  the  behav-
 | 
						|
       iour of the circumflex metacharacter. It does not affect \A.
 | 
						|
 | 
						|
         PCRE_NOTEOL
 | 
						|
 | 
						|
       This option specifies that the end of the subject string is not the end
 | 
						|
       of a line, so the dollar metacharacter should not match it nor  (except
 | 
						|
       in  multiline mode) a newline immediately before it. Setting this with-
 | 
						|
       out PCRE_MULTILINE (at compile time) causes dollar never to match. This
 | 
						|
       option  affects only the behaviour of the dollar metacharacter. It does
 | 
						|
       not affect \Z or \z.
 | 
						|
 | 
						|
         PCRE_NOTEMPTY
 | 
						|
 | 
						|
       An empty string is not considered to be a valid match if this option is
 | 
						|
       set.  If  there are alternatives in the pattern, they are tried. If all
 | 
						|
       the alternatives match the empty string, the entire  match  fails.  For
 | 
						|
       example, if the pattern
 | 
						|
 | 
						|
         a?b?
 | 
						|
 | 
						|
       is  applied  to  a string not beginning with "a" or "b", it matches the
 | 
						|
       empty string at the start of the subject. With PCRE_NOTEMPTY set,  this
 | 
						|
       match is not valid, so PCRE searches further into the string for occur-
 | 
						|
       rences of "a" or "b".
 | 
						|
 | 
						|
       Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe-
 | 
						|
       cial  case  of  a  pattern match of the empty string within its split()
 | 
						|
       function, and when using the /g modifier. It  is  possible  to  emulate
 | 
						|
       Perl's behaviour after matching a null string by first trying the match
 | 
						|
       again at the same offset with PCRE_NOTEMPTY and PCRE_ANCHORED, and then
 | 
						|
       if  that  fails by advancing the starting offset (see below) and trying
 | 
						|
       an ordinary match again. There is some code that demonstrates how to do
 | 
						|
       this in the pcredemo.c sample program.
 | 
						|
 | 
						|
         PCRE_NO_START_OPTIMIZE
 | 
						|
 | 
						|
       There  are a number of optimizations that pcre_exec() uses at the start
 | 
						|
       of a match, in order to speed up the process. For  example,  if  it  is
 | 
						|
       known  that  a  match must start with a specific character, it searches
 | 
						|
       the subject for that character, and fails immediately if it cannot find
 | 
						|
       it,  without actually running the main matching function. When callouts
 | 
						|
       are in use, these optimizations can cause  them  to  be  skipped.  This
 | 
						|
       option  disables  the  "start-up" optimizations, causing performance to
 | 
						|
       suffer, but ensuring that the callouts do occur.
 | 
						|
 | 
						|
         PCRE_NO_UTF8_CHECK
 | 
						|
 | 
						|
       When PCRE_UTF8 is set at compile time, the validity of the subject as a
 | 
						|
       UTF-8  string is automatically checked when pcre_exec() is subsequently
 | 
						|
       called.  The value of startoffset is also checked  to  ensure  that  it
 | 
						|
       points  to  the start of a UTF-8 character. There is a discussion about
 | 
						|
       the validity of UTF-8 strings in the section on UTF-8  support  in  the
 | 
						|
       main  pcre  page.  If  an  invalid  UTF-8  sequence  of bytes is found,
 | 
						|
       pcre_exec() returns the error PCRE_ERROR_BADUTF8. If  startoffset  con-
 | 
						|
       tains an invalid value, PCRE_ERROR_BADUTF8_OFFSET is returned.
 | 
						|
 | 
						|
       If  you  already  know that your subject is valid, and you want to skip
 | 
						|
       these   checks   for   performance   reasons,   you   can    set    the
 | 
						|
       PCRE_NO_UTF8_CHECK  option  when calling pcre_exec(). You might want to
 | 
						|
       do this for the second and subsequent calls to pcre_exec() if  you  are
 | 
						|
       making  repeated  calls  to  find  all  the matches in a single subject
 | 
						|
       string. However, you should be  sure  that  the  value  of  startoffset
 | 
						|
       points  to  the  start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is
 | 
						|
       set, the effect of passing an invalid UTF-8 string as a subject,  or  a
 | 
						|
       value  of startoffset that does not point to the start of a UTF-8 char-
 | 
						|
       acter, is undefined. Your program may crash.
 | 
						|
 | 
						|
         PCRE_PARTIAL
 | 
						|
 | 
						|
       This option turns on the  partial  matching  feature.  If  the  subject
 | 
						|
       string  fails to match the pattern, but at some point during the match-
 | 
						|
       ing process the end of the subject was reached (that  is,  the  subject
 | 
						|
       partially  matches  the  pattern and the failure to match occurred only
 | 
						|
       because there were not enough subject characters), pcre_exec()  returns
 | 
						|
       PCRE_ERROR_PARTIAL  instead of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL is
 | 
						|
       used, there are restrictions on what may appear in the  pattern.  These
 | 
						|
       are discussed in the pcrepartial documentation.
 | 
						|
 | 
						|
   The string to be matched by pcre_exec()
 | 
						|
 | 
						|
       The  subject string is passed to pcre_exec() as a pointer in subject, a
 | 
						|
       length (in bytes) in length, and a starting byte offset in startoffset.
 | 
						|
       In UTF-8 mode, the byte offset must point to the start of a UTF-8 char-
 | 
						|
       acter. Unlike the pattern string, the subject may contain  binary  zero
 | 
						|
       bytes.  When the starting offset is zero, the search for a match starts
 | 
						|
       at the beginning of the subject, and this is by  far  the  most  common
 | 
						|
       case.
 | 
						|
 | 
						|
       A  non-zero  starting offset is useful when searching for another match
 | 
						|
       in the same subject by calling pcre_exec() again after a previous  suc-
 | 
						|
       cess.   Setting  startoffset differs from just passing over a shortened
 | 
						|
       string and setting PCRE_NOTBOL in the case of  a  pattern  that  begins
 | 
						|
       with any kind of lookbehind. For example, consider the pattern
 | 
						|
 | 
						|
         \Biss\B
 | 
						|
 | 
						|
       which  finds  occurrences  of "iss" in the middle of words. (\B matches
 | 
						|
       only if the current position in the subject is not  a  word  boundary.)
 | 
						|
       When  applied  to the string "Mississipi" the first call to pcre_exec()
 | 
						|
       finds the first occurrence. If pcre_exec() is called  again  with  just
 | 
						|
       the  remainder  of  the  subject,  namely  "issipi", it does not match,
 | 
						|
       because \B is always false at the start of the subject, which is deemed
 | 
						|
       to  be  a  word  boundary. However, if pcre_exec() is passed the entire
 | 
						|
       string again, but with startoffset set to 4, it finds the second occur-
 | 
						|
       rence  of "iss" because it is able to look behind the starting point to
 | 
						|
       discover that it is preceded by a letter.
 | 
						|
 | 
						|
       If a non-zero starting offset is passed when the pattern  is  anchored,
 | 
						|
       one attempt to match at the given offset is made. This can only succeed
 | 
						|
       if the pattern does not require the match to be at  the  start  of  the
 | 
						|
       subject.
 | 
						|
 | 
						|
   How pcre_exec() returns captured substrings
 | 
						|
 | 
						|
       In  general, a pattern matches a certain portion of the subject, and in
 | 
						|
       addition, further substrings from the subject  may  be  picked  out  by
 | 
						|
       parts  of  the  pattern.  Following the usage in Jeffrey Friedl's book,
 | 
						|
       this is called "capturing" in what follows, and the  phrase  "capturing
 | 
						|
       subpattern"  is  used for a fragment of a pattern that picks out a sub-
 | 
						|
       string. PCRE supports several other kinds of  parenthesized  subpattern
 | 
						|
       that do not cause substrings to be captured.
 | 
						|
 | 
						|
       Captured substrings are returned to the caller via a vector of integers
 | 
						|
       whose address is passed in ovector. The number of elements in the  vec-
 | 
						|
       tor  is  passed in ovecsize, which must be a non-negative number. Note:
 | 
						|
       this argument is NOT the size of ovector in bytes.
 | 
						|
 | 
						|
       The first two-thirds of the vector is used to pass back  captured  sub-
 | 
						|
       strings,  each  substring using a pair of integers. The remaining third
 | 
						|
       of the vector is used as workspace by pcre_exec() while  matching  cap-
 | 
						|
       turing  subpatterns, and is not available for passing back information.
 | 
						|
       The number passed in ovecsize should always be a multiple of three.  If
 | 
						|
       it is not, it is rounded down.
 | 
						|
 | 
						|
       When  a  match  is successful, information about captured substrings is
 | 
						|
       returned in pairs of integers, starting at the  beginning  of  ovector,
 | 
						|
       and  continuing  up  to two-thirds of its length at the most. The first
 | 
						|
       element of each pair is set to the byte offset of the  first  character
 | 
						|
       in  a  substring, and the second is set to the byte offset of the first
 | 
						|
       character after the end of a substring. Note: these values  are  always
 | 
						|
       byte offsets, even in UTF-8 mode. They are not character counts.
 | 
						|
 | 
						|
       The  first  pair  of  integers, ovector[0] and ovector[1], identify the
 | 
						|
       portion of the subject string matched by the entire pattern.  The  next
 | 
						|
       pair  is  used for the first capturing subpattern, and so on. The value
 | 
						|
       returned by pcre_exec() is one more than the highest numbered pair that
 | 
						|
       has  been  set.  For example, if two substrings have been captured, the
 | 
						|
       returned value is 3. If there are no capturing subpatterns, the  return
 | 
						|
       value from a successful match is 1, indicating that just the first pair
 | 
						|
       of offsets has been set.
 | 
						|
 | 
						|
       If a capturing subpattern is matched repeatedly, it is the last portion
 | 
						|
       of the string that it matched that is returned.
 | 
						|
 | 
						|
       If  the vector is too small to hold all the captured substring offsets,
 | 
						|
       it is used as far as possible (up to two-thirds of its length), and the
 | 
						|
       function  returns  a value of zero. If the substring offsets are not of
 | 
						|
       interest, pcre_exec() may be called with ovector  passed  as  NULL  and
 | 
						|
       ovecsize  as zero. However, if the pattern contains back references and
 | 
						|
       the ovector is not big enough to remember the related substrings,  PCRE
 | 
						|
       has  to  get additional memory for use during matching. Thus it is usu-
 | 
						|
       ally advisable to supply an ovector.
 | 
						|
 | 
						|
       The pcre_info() function can be used to find  out  how  many  capturing
 | 
						|
       subpatterns  there  are  in  a  compiled pattern. The smallest size for
 | 
						|
       ovector that will allow for n captured substrings, in addition  to  the
 | 
						|
       offsets of the substring matched by the whole pattern, is (n+1)*3.
 | 
						|
 | 
						|
       It  is  possible for capturing subpattern number n+1 to match some part
 | 
						|
       of the subject when subpattern n has not been used at all. For example,
 | 
						|
       if  the  string  "abc"  is  matched against the pattern (a|(z))(bc) the
 | 
						|
       return from the function is 4, and subpatterns 1 and 3 are matched, but
 | 
						|
       2  is  not.  When  this happens, both values in the offset pairs corre-
 | 
						|
       sponding to unused subpatterns are set to -1.
 | 
						|
 | 
						|
       Offset values that correspond to unused subpatterns at the end  of  the
 | 
						|
       expression  are  also  set  to  -1. For example, if the string "abc" is
 | 
						|
       matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are  not
 | 
						|
       matched.  The  return  from the function is 2, because the highest used
 | 
						|
       capturing subpattern number is 1. However, you can refer to the offsets
 | 
						|
       for  the  second  and third capturing subpatterns if you wish (assuming
 | 
						|
       the vector is large enough, of course).
 | 
						|
 | 
						|
       Some convenience functions are provided  for  extracting  the  captured
 | 
						|
       substrings as separate strings. These are described below.
 | 
						|
 | 
						|
   Error return values from pcre_exec()
 | 
						|
 | 
						|
       If  pcre_exec()  fails, it returns a negative number. The following are
 | 
						|
       defined in the header file:
 | 
						|
 | 
						|
         PCRE_ERROR_NOMATCH        (-1)
 | 
						|
 | 
						|
       The subject string did not match the pattern.
 | 
						|
 | 
						|
         PCRE_ERROR_NULL           (-2)
 | 
						|
 | 
						|
       Either code or subject was passed as NULL,  or  ovector  was  NULL  and
 | 
						|
       ovecsize was not zero.
 | 
						|
 | 
						|
         PCRE_ERROR_BADOPTION      (-3)
 | 
						|
 | 
						|
       An unrecognized bit was set in the options argument.
 | 
						|
 | 
						|
         PCRE_ERROR_BADMAGIC       (-4)
 | 
						|
 | 
						|
       PCRE  stores a 4-byte "magic number" at the start of the compiled code,
 | 
						|
       to catch the case when it is passed a junk pointer and to detect when a
 | 
						|
       pattern that was compiled in an environment of one endianness is run in
 | 
						|
       an environment with the other endianness. This is the error  that  PCRE
 | 
						|
       gives when the magic number is not present.
 | 
						|
 | 
						|
         PCRE_ERROR_UNKNOWN_OPCODE (-5)
 | 
						|
 | 
						|
       While running the pattern match, an unknown item was encountered in the
 | 
						|
       compiled pattern. This error could be caused by a bug  in  PCRE  or  by
 | 
						|
       overwriting of the compiled pattern.
 | 
						|
 | 
						|
         PCRE_ERROR_NOMEMORY       (-6)
 | 
						|
 | 
						|
       If  a  pattern contains back references, but the ovector that is passed
 | 
						|
       to pcre_exec() is not big enough to remember the referenced substrings,
 | 
						|
       PCRE  gets  a  block of memory at the start of matching to use for this
 | 
						|
       purpose. If the call via pcre_malloc() fails, this error is given.  The
 | 
						|
       memory is automatically freed at the end of matching.
 | 
						|
 | 
						|
         PCRE_ERROR_NOSUBSTRING    (-7)
 | 
						|
 | 
						|
       This  error is used by the pcre_copy_substring(), pcre_get_substring(),
 | 
						|
       and  pcre_get_substring_list()  functions  (see  below).  It  is  never
 | 
						|
       returned by pcre_exec().
 | 
						|
 | 
						|
         PCRE_ERROR_MATCHLIMIT     (-8)
 | 
						|
 | 
						|
       The  backtracking  limit,  as  specified  by the match_limit field in a
 | 
						|
       pcre_extra structure (or defaulted) was reached.  See  the  description
 | 
						|
       above.
 | 
						|
 | 
						|
         PCRE_ERROR_CALLOUT        (-9)
 | 
						|
 | 
						|
       This error is never generated by pcre_exec() itself. It is provided for
 | 
						|
       use by callout functions that want to yield a distinctive  error  code.
 | 
						|
       See the pcrecallout documentation for details.
 | 
						|
 | 
						|
         PCRE_ERROR_BADUTF8        (-10)
 | 
						|
 | 
						|
       A  string  that contains an invalid UTF-8 byte sequence was passed as a
 | 
						|
       subject.
 | 
						|
 | 
						|
         PCRE_ERROR_BADUTF8_OFFSET (-11)
 | 
						|
 | 
						|
       The UTF-8 byte sequence that was passed as a subject was valid, but the
 | 
						|
       value  of startoffset did not point to the beginning of a UTF-8 charac-
 | 
						|
       ter.
 | 
						|
 | 
						|
         PCRE_ERROR_PARTIAL        (-12)
 | 
						|
 | 
						|
       The subject string did not match, but it did match partially.  See  the
 | 
						|
       pcrepartial documentation for details of partial matching.
 | 
						|
 | 
						|
         PCRE_ERROR_BADPARTIAL     (-13)
 | 
						|
 | 
						|
       The  PCRE_PARTIAL  option  was  used with a compiled pattern containing
 | 
						|
       items that are not supported for partial matching. See the  pcrepartial
 | 
						|
       documentation for details of partial matching.
 | 
						|
 | 
						|
         PCRE_ERROR_INTERNAL       (-14)
 | 
						|
 | 
						|
       An  unexpected  internal error has occurred. This error could be caused
 | 
						|
       by a bug in PCRE or by overwriting of the compiled pattern.
 | 
						|
 | 
						|
         PCRE_ERROR_BADCOUNT       (-15)
 | 
						|
 | 
						|
       This error is given if the value of the ovecsize argument is negative.
 | 
						|
 | 
						|
         PCRE_ERROR_RECURSIONLIMIT (-21)
 | 
						|
 | 
						|
       The internal recursion limit, as specified by the match_limit_recursion
 | 
						|
       field  in  a  pcre_extra  structure (or defaulted) was reached. See the
 | 
						|
       description above.
 | 
						|
 | 
						|
         PCRE_ERROR_BADNEWLINE     (-23)
 | 
						|
 | 
						|
       An invalid combination of PCRE_NEWLINE_xxx options was given.
 | 
						|
 | 
						|
       Error numbers -16 to -20 and -22 are not used by pcre_exec().
 | 
						|
 | 
						|
 | 
						|
EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
 | 
						|
 | 
						|
       int pcre_copy_substring(const char *subject, int *ovector,
 | 
						|
            int stringcount, int stringnumber, char *buffer,
 | 
						|
            int buffersize);
 | 
						|
 | 
						|
       int pcre_get_substring(const char *subject, int *ovector,
 | 
						|
            int stringcount, int stringnumber,
 | 
						|
            const char **stringptr);
 | 
						|
 | 
						|
       int pcre_get_substring_list(const char *subject,
 | 
						|
            int *ovector, int stringcount, const char ***listptr);
 | 
						|
 | 
						|
       Captured substrings can be  accessed  directly  by  using  the  offsets
 | 
						|
       returned  by  pcre_exec()  in  ovector.  For convenience, the functions
 | 
						|
       pcre_copy_substring(),    pcre_get_substring(),    and    pcre_get_sub-
 | 
						|
       string_list()  are  provided for extracting captured substrings as new,
 | 
						|
       separate, zero-terminated strings. These functions identify  substrings
 | 
						|
       by  number.  The  next section describes functions for extracting named
 | 
						|
       substrings.
 | 
						|
 | 
						|
       A substring that contains a binary zero is correctly extracted and  has
 | 
						|
       a  further zero added on the end, but the result is not, of course, a C
 | 
						|
       string.  However, you can process such a string  by  referring  to  the
 | 
						|
       length  that  is  returned  by  pcre_copy_substring() and pcre_get_sub-
 | 
						|
       string().  Unfortunately, the interface to pcre_get_substring_list() is
 | 
						|
       not  adequate for handling strings containing binary zeros, because the
 | 
						|
       end of the final string is not independently indicated.
 | 
						|
 | 
						|
       The first three arguments are the same for all  three  of  these  func-
 | 
						|
       tions:  subject  is  the subject string that has just been successfully
 | 
						|
       matched, ovector is a pointer to the vector of integer offsets that was
 | 
						|
       passed to pcre_exec(), and stringcount is the number of substrings that
 | 
						|
       were captured by the match, including the substring  that  matched  the
 | 
						|
       entire regular expression. This is the value returned by pcre_exec() if
 | 
						|
       it is greater than zero. If pcre_exec() returned zero, indicating  that
 | 
						|
       it  ran out of space in ovector, the value passed as stringcount should
 | 
						|
       be the number of elements in the vector divided by three.
 | 
						|
 | 
						|
       The functions pcre_copy_substring() and pcre_get_substring() extract  a
 | 
						|
       single  substring,  whose  number  is given as stringnumber. A value of
 | 
						|
       zero extracts the substring that matched the  entire  pattern,  whereas
 | 
						|
       higher  values  extract  the  captured  substrings.  For pcre_copy_sub-
 | 
						|
       string(), the string is placed in buffer,  whose  length  is  given  by
 | 
						|
       buffersize,  while  for  pcre_get_substring()  a new block of memory is
 | 
						|
       obtained via pcre_malloc, and its address is  returned  via  stringptr.
 | 
						|
       The  yield  of  the function is the length of the string, not including
 | 
						|
       the terminating zero, or one of these error codes:
 | 
						|
 | 
						|
         PCRE_ERROR_NOMEMORY       (-6)
 | 
						|
 | 
						|
       The buffer was too small for pcre_copy_substring(), or the  attempt  to
 | 
						|
       get memory failed for pcre_get_substring().
 | 
						|
 | 
						|
         PCRE_ERROR_NOSUBSTRING    (-7)
 | 
						|
 | 
						|
       There is no substring whose number is stringnumber.
 | 
						|
 | 
						|
       The  pcre_get_substring_list()  function  extracts  all  available sub-
 | 
						|
       strings and builds a list of pointers to them. All this is  done  in  a
 | 
						|
       single block of memory that is obtained via pcre_malloc. The address of
 | 
						|
       the memory block is returned via listptr, which is also  the  start  of
 | 
						|
       the  list  of  string pointers. The end of the list is marked by a NULL
 | 
						|
       pointer. The yield of the function is zero if all  went  well,  or  the
 | 
						|
       error code
 | 
						|
 | 
						|
         PCRE_ERROR_NOMEMORY       (-6)
 | 
						|
 | 
						|
       if the attempt to get the memory block failed.
 | 
						|
 | 
						|
       When  any of these functions encounter a substring that is unset, which
 | 
						|
       can happen when capturing subpattern number n+1 matches  some  part  of
 | 
						|
       the  subject, but subpattern n has not been used at all, they return an
 | 
						|
       empty string. This can be distinguished from a genuine zero-length sub-
 | 
						|
       string  by inspecting the appropriate offset in ovector, which is nega-
 | 
						|
       tive for unset substrings.
 | 
						|
 | 
						|
       The two convenience functions pcre_free_substring() and  pcre_free_sub-
 | 
						|
       string_list()  can  be  used  to free the memory returned by a previous
 | 
						|
       call  of  pcre_get_substring()  or  pcre_get_substring_list(),  respec-
 | 
						|
       tively.  They  do  nothing  more  than  call the function pointed to by
 | 
						|
       pcre_free, which of course could be called directly from a  C  program.
 | 
						|
       However,  PCRE is used in some situations where it is linked via a spe-
 | 
						|
       cial  interface  to  another  programming  language  that  cannot   use
 | 
						|
       pcre_free  directly;  it is for these cases that the functions are pro-
 | 
						|
       vided.
 | 
						|
 | 
						|
 | 
						|
EXTRACTING CAPTURED SUBSTRINGS BY NAME
 | 
						|
 | 
						|
       int pcre_get_stringnumber(const pcre *code,
 | 
						|
            const char *name);
 | 
						|
 | 
						|
       int pcre_copy_named_substring(const pcre *code,
 | 
						|
            const char *subject, int *ovector,
 | 
						|
            int stringcount, const char *stringname,
 | 
						|
            char *buffer, int buffersize);
 | 
						|
 | 
						|
       int pcre_get_named_substring(const pcre *code,
 | 
						|
            const char *subject, int *ovector,
 | 
						|
            int stringcount, const char *stringname,
 | 
						|
            const char **stringptr);
 | 
						|
 | 
						|
       To extract a substring by name, you first have to find associated  num-
 | 
						|
       ber.  For example, for this pattern
 | 
						|
 | 
						|
         (a+)b(?<xxx>\d+)...
 | 
						|
 | 
						|
       the number of the subpattern called "xxx" is 2. If the name is known to
 | 
						|
       be unique (PCRE_DUPNAMES was not set), you can find the number from the
 | 
						|
       name by calling pcre_get_stringnumber(). The first argument is the com-
 | 
						|
       piled pattern, and the second is the name. The yield of the function is
 | 
						|
       the  subpattern  number,  or PCRE_ERROR_NOSUBSTRING (-7) if there is no
 | 
						|
       subpattern of that name.
 | 
						|
 | 
						|
       Given the number, you can extract the substring directly, or use one of
 | 
						|
       the functions described in the previous section. For convenience, there
 | 
						|
       are also two functions that do the whole job.
 | 
						|
 | 
						|
       Most   of   the   arguments    of    pcre_copy_named_substring()    and
 | 
						|
       pcre_get_named_substring()  are  the  same  as  those for the similarly
 | 
						|
       named functions that extract by number. As these are described  in  the
 | 
						|
       previous  section,  they  are not re-described here. There are just two
 | 
						|
       differences:
 | 
						|
 | 
						|
       First, instead of a substring number, a substring name is  given.  Sec-
 | 
						|
       ond, there is an extra argument, given at the start, which is a pointer
 | 
						|
       to the compiled pattern. This is needed in order to gain access to  the
 | 
						|
       name-to-number translation table.
 | 
						|
 | 
						|
       These  functions call pcre_get_stringnumber(), and if it succeeds, they
 | 
						|
       then call pcre_copy_substring() or pcre_get_substring(),  as  appropri-
 | 
						|
       ate.  NOTE:  If PCRE_DUPNAMES is set and there are duplicate names, the
 | 
						|
       behaviour may not be what you want (see the next section).
 | 
						|
 | 
						|
       Warning: If the pattern uses the "(?|" feature to set up multiple  sub-
 | 
						|
       patterns  with  the  same  number,  you cannot use names to distinguish
 | 
						|
       them, because names are not included in the compiled code. The matching
 | 
						|
       process uses only numbers.
 | 
						|
 | 
						|
 | 
						|
DUPLICATE SUBPATTERN NAMES
 | 
						|
 | 
						|
       int pcre_get_stringtable_entries(const pcre *code,
 | 
						|
            const char *name, char **first, char **last);
 | 
						|
 | 
						|
       When  a  pattern  is  compiled with the PCRE_DUPNAMES option, names for
 | 
						|
       subpatterns are not required to  be  unique.  Normally,  patterns  with
 | 
						|
       duplicate  names  are such that in any one match, only one of the named
 | 
						|
       subpatterns participates. An example is shown in the pcrepattern  docu-
 | 
						|
       mentation.
 | 
						|
 | 
						|
       When    duplicates   are   present,   pcre_copy_named_substring()   and
 | 
						|
       pcre_get_named_substring() return the first substring corresponding  to
 | 
						|
       the  given  name  that  is set. If none are set, PCRE_ERROR_NOSUBSTRING
 | 
						|
       (-7) is returned; no  data  is  returned.  The  pcre_get_stringnumber()
 | 
						|
       function  returns one of the numbers that are associated with the name,
 | 
						|
       but it is not defined which it is.
 | 
						|
 | 
						|
       If you want to get full details of all captured substrings for a  given
 | 
						|
       name,  you  must  use  the pcre_get_stringtable_entries() function. The
 | 
						|
       first argument is the compiled pattern, and the second is the name. The
 | 
						|
       third  and  fourth  are  pointers to variables which are updated by the
 | 
						|
       function. After it has run, they point to the first and last entries in
 | 
						|
       the  name-to-number  table  for  the  given  name.  The function itself
 | 
						|
       returns the length of each entry,  or  PCRE_ERROR_NOSUBSTRING  (-7)  if
 | 
						|
       there  are none. The format of the table is described above in the sec-
 | 
						|
       tion entitled Information about a  pattern.   Given  all  the  relevant
 | 
						|
       entries  for the name, you can extract each of their numbers, and hence
 | 
						|
       the captured data, if any.
 | 
						|
 | 
						|
 | 
						|
FINDING ALL POSSIBLE MATCHES
 | 
						|
 | 
						|
       The traditional matching function uses a  similar  algorithm  to  Perl,
 | 
						|
       which stops when it finds the first match, starting at a given point in
 | 
						|
       the subject. If you want to find all possible matches, or  the  longest
 | 
						|
       possible  match,  consider using the alternative matching function (see
 | 
						|
       below) instead. If you cannot use the alternative function,  but  still
 | 
						|
       need  to  find all possible matches, you can kludge it up by making use
 | 
						|
       of the callout facility, which is described in the pcrecallout documen-
 | 
						|
       tation.
 | 
						|
 | 
						|
       What you have to do is to insert a callout right at the end of the pat-
 | 
						|
       tern.  When your callout function is called, extract and save the  cur-
 | 
						|
       rent  matched  substring.  Then  return  1, which forces pcre_exec() to
 | 
						|
       backtrack and try other alternatives. Ultimately, when it runs  out  of
 | 
						|
       matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.
 | 
						|
 | 
						|
 | 
						|
MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
 | 
						|
 | 
						|
       int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
 | 
						|
            const char *subject, int length, int startoffset,
 | 
						|
            int options, int *ovector, int ovecsize,
 | 
						|
            int *workspace, int wscount);
 | 
						|
 | 
						|
       The  function  pcre_dfa_exec()  is  called  to  match  a subject string
 | 
						|
       against a compiled pattern, using a matching algorithm that  scans  the
 | 
						|
       subject  string  just  once, and does not backtrack. This has different
 | 
						|
       characteristics to the normal algorithm, and  is  not  compatible  with
 | 
						|
       Perl.  Some  of the features of PCRE patterns are not supported. Never-
 | 
						|
       theless, there are times when this kind of matching can be useful.  For
 | 
						|
       a discussion of the two matching algorithms, see the pcrematching docu-
 | 
						|
       mentation.
 | 
						|
 | 
						|
       The arguments for the pcre_dfa_exec() function  are  the  same  as  for
 | 
						|
       pcre_exec(), plus two extras. The ovector argument is used in a differ-
 | 
						|
       ent way, and this is described below. The other  common  arguments  are
 | 
						|
       used  in  the  same way as for pcre_exec(), so their description is not
 | 
						|
       repeated here.
 | 
						|
 | 
						|
       The two additional arguments provide workspace for  the  function.  The
 | 
						|
       workspace  vector  should  contain at least 20 elements. It is used for
 | 
						|
       keeping  track  of  multiple  paths  through  the  pattern  tree.  More
 | 
						|
       workspace  will  be  needed for patterns and subjects where there are a
 | 
						|
       lot of potential matches.
 | 
						|
 | 
						|
       Here is an example of a simple call to pcre_dfa_exec():
 | 
						|
 | 
						|
         int rc;
 | 
						|
         int ovector[10];
 | 
						|
         int wspace[20];
 | 
						|
         rc = pcre_dfa_exec(
 | 
						|
           re,             /* result of pcre_compile() */
 | 
						|
           NULL,           /* we didn't study the pattern */
 | 
						|
           "some string",  /* the subject string */
 | 
						|
           11,             /* the length of the subject string */
 | 
						|
           0,              /* start at offset 0 in the subject */
 | 
						|
           0,              /* default options */
 | 
						|
           ovector,        /* vector of integers for substring information */
 | 
						|
           10,             /* number of elements (NOT size in bytes) */
 | 
						|
           wspace,         /* working space vector */
 | 
						|
           20);            /* number of elements (NOT size in bytes) */
 | 
						|
 | 
						|
   Option bits for pcre_dfa_exec()
 | 
						|
 | 
						|
       The unused bits of the options argument  for  pcre_dfa_exec()  must  be
 | 
						|
       zero.  The  only  bits  that  may  be  set are PCRE_ANCHORED, PCRE_NEW-
 | 
						|
       LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY,  PCRE_NO_UTF8_CHECK,
 | 
						|
       PCRE_PARTIAL, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last
 | 
						|
       three of these are the same as for pcre_exec(), so their description is
 | 
						|
       not repeated here.
 | 
						|
 | 
						|
         PCRE_PARTIAL
 | 
						|
 | 
						|
       This  has  the  same general effect as it does for pcre_exec(), but the
 | 
						|
       details  are  slightly  different.  When  PCRE_PARTIAL   is   set   for
 | 
						|
       pcre_dfa_exec(),  the  return code PCRE_ERROR_NOMATCH is converted into
 | 
						|
       PCRE_ERROR_PARTIAL if the end of the subject  is  reached,  there  have
 | 
						|
       been no complete matches, but there is still at least one matching pos-
 | 
						|
       sibility. The portion of the string that provided the partial match  is
 | 
						|
       set as the first matching string.
 | 
						|
 | 
						|
         PCRE_DFA_SHORTEST
 | 
						|
 | 
						|
       Setting  the  PCRE_DFA_SHORTEST option causes the matching algorithm to
 | 
						|
       stop as soon as it has found one match. Because of the way the alterna-
 | 
						|
       tive  algorithm  works, this is necessarily the shortest possible match
 | 
						|
       at the first possible matching point in the subject string.
 | 
						|
 | 
						|
         PCRE_DFA_RESTART
 | 
						|
 | 
						|
       When pcre_dfa_exec()  is  called  with  the  PCRE_PARTIAL  option,  and
 | 
						|
       returns  a  partial  match, it is possible to call it again, with addi-
 | 
						|
       tional subject characters, and have it continue with  the  same  match.
 | 
						|
       The  PCRE_DFA_RESTART  option requests this action; when it is set, the
 | 
						|
       workspace and wscount options must reference the same vector as  before
 | 
						|
       because  data  about  the  match so far is left in them after a partial
 | 
						|
       match. There is more discussion of this  facility  in  the  pcrepartial
 | 
						|
       documentation.
 | 
						|
 | 
						|
   Successful returns from pcre_dfa_exec()
 | 
						|
 | 
						|
       When  pcre_dfa_exec()  succeeds, it may have matched more than one sub-
 | 
						|
       string in the subject. Note, however, that all the matches from one run
 | 
						|
       of  the  function  start  at the same point in the subject. The shorter
 | 
						|
       matches are all initial substrings of the longer matches. For  example,
 | 
						|
       if the pattern
 | 
						|
 | 
						|
         <.*>
 | 
						|
 | 
						|
       is matched against the string
 | 
						|
 | 
						|
         This is <something> <something else> <something further> no more
 | 
						|
 | 
						|
       the three matched strings are
 | 
						|
 | 
						|
         <something>
 | 
						|
         <something> <something else>
 | 
						|
         <something> <something else> <something further>
 | 
						|
 | 
						|
       On  success,  the  yield of the function is a number greater than zero,
 | 
						|
       which is the number of matched substrings.  The  substrings  themselves
 | 
						|
       are  returned  in  ovector. Each string uses two elements; the first is
 | 
						|
       the offset to the start, and the second is the offset to  the  end.  In
 | 
						|
       fact,  all  the  strings  have the same start offset. (Space could have
 | 
						|
       been saved by giving this only once, but it was decided to retain  some
 | 
						|
       compatibility  with  the  way pcre_exec() returns data, even though the
 | 
						|
       meaning of the strings is different.)
 | 
						|
 | 
						|
       The strings are returned in reverse order of length; that is, the long-
 | 
						|
       est  matching  string is given first. If there were too many matches to
 | 
						|
       fit into ovector, the yield of the function is zero, and the vector  is
 | 
						|
       filled with the longest matches.
 | 
						|
 | 
						|
   Error returns from pcre_dfa_exec()
 | 
						|
 | 
						|
       The  pcre_dfa_exec()  function returns a negative number when it fails.
 | 
						|
       Many of the errors are the same  as  for  pcre_exec(),  and  these  are
 | 
						|
       described  above.   There are in addition the following errors that are
 | 
						|
       specific to pcre_dfa_exec():
 | 
						|
 | 
						|
         PCRE_ERROR_DFA_UITEM      (-16)
 | 
						|
 | 
						|
       This return is given if pcre_dfa_exec() encounters an item in the  pat-
 | 
						|
       tern  that  it  does not support, for instance, the use of \C or a back
 | 
						|
       reference.
 | 
						|
 | 
						|
         PCRE_ERROR_DFA_UCOND      (-17)
 | 
						|
 | 
						|
       This return is given if pcre_dfa_exec()  encounters  a  condition  item
 | 
						|
       that  uses  a back reference for the condition, or a test for recursion
 | 
						|
       in a specific group. These are not supported.
 | 
						|
 | 
						|
         PCRE_ERROR_DFA_UMLIMIT    (-18)
 | 
						|
 | 
						|
       This return is given if pcre_dfa_exec() is called with an  extra  block
 | 
						|
       that contains a setting of the match_limit field. This is not supported
 | 
						|
       (it is meaningless).
 | 
						|
 | 
						|
         PCRE_ERROR_DFA_WSSIZE     (-19)
 | 
						|
 | 
						|
       This return is given if  pcre_dfa_exec()  runs  out  of  space  in  the
 | 
						|
       workspace vector.
 | 
						|
 | 
						|
         PCRE_ERROR_DFA_RECURSE    (-20)
 | 
						|
 | 
						|
       When  a  recursive subpattern is processed, the matching function calls
 | 
						|
       itself recursively, using private vectors for  ovector  and  workspace.
 | 
						|
       This  error  is  given  if  the output vector is not large enough. This
 | 
						|
       should be extremely rare, as a vector of size 1000 is used.
 | 
						|
 | 
						|
 | 
						|
SEE ALSO
 | 
						|
 | 
						|
       pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), pcrematching(3),  pcrepar-
 | 
						|
       tial(3), pcreposix(3), pcreprecompile(3), pcresample(3), pcrestack(3).
 | 
						|
 | 
						|
 | 
						|
AUTHOR
 | 
						|
 | 
						|
       Philip Hazel
 | 
						|
       University Computing Service
 | 
						|
       Cambridge CB2 3QH, England.
 | 
						|
 | 
						|
 | 
						|
REVISION
 | 
						|
 | 
						|
       Last updated: 11 April 2009
 | 
						|
       Copyright (c) 1997-2009 University of Cambridge.
 | 
						|
------------------------------------------------------------------------------
 | 
						|
 | 
						|
 | 
						|
PCRECALLOUT(3)                                                  PCRECALLOUT(3)
 | 
						|
 | 
						|
 | 
						|
NAME
 | 
						|
       PCRE - Perl-compatible regular expressions
 | 
						|
 | 
						|
 | 
						|
PCRE CALLOUTS
 | 
						|
 | 
						|
       int (*pcre_callout)(pcre_callout_block *);
 | 
						|
 | 
						|
       PCRE provides a feature called "callout", which is a means of temporar-
 | 
						|
       ily passing control to the caller of PCRE  in  the  middle  of  pattern
 | 
						|
       matching.  The  caller of PCRE provides an external function by putting
 | 
						|
       its entry point in the global variable pcre_callout. By  default,  this
 | 
						|
       variable contains NULL, which disables all calling out.
 | 
						|
 | 
						|
       Within  a  regular  expression,  (?C) indicates the points at which the
 | 
						|
       external function is to be called.  Different  callout  points  can  be
 | 
						|
       identified  by  putting  a number less than 256 after the letter C. The
 | 
						|
       default value is zero.  For  example,  this  pattern  has  two  callout
 | 
						|
       points:
 | 
						|
 | 
						|
         (?C1)abc(?C2)def
 | 
						|
 | 
						|
       If  the  PCRE_AUTO_CALLOUT  option  bit  is  set when pcre_compile() is
 | 
						|
       called, PCRE automatically  inserts  callouts,  all  with  number  255,
 | 
						|
       before  each  item in the pattern. For example, if PCRE_AUTO_CALLOUT is
 | 
						|
       used with the pattern
 | 
						|
 | 
						|
         A(\d{2}|--)
 | 
						|
 | 
						|
       it is processed as if it were
 | 
						|
 | 
						|
       (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
 | 
						|
 | 
						|
       Notice that there is a callout before and after  each  parenthesis  and
 | 
						|
       alternation  bar.  Automatic  callouts  can  be  used  for tracking the
 | 
						|
       progress of pattern matching. The pcretest command has an  option  that
 | 
						|
       sets  automatic callouts; when it is used, the output indicates how the
 | 
						|
       pattern is matched. This is useful information when you are  trying  to
 | 
						|
       optimize the performance of a particular pattern.
 | 
						|
 | 
						|
 | 
						|
MISSING CALLOUTS
 | 
						|
 | 
						|
       You  should  be  aware  that,  because of optimizations in the way PCRE
 | 
						|
       matches patterns by default, callouts  sometimes  do  not  happen.  For
 | 
						|
       example, if the pattern is
 | 
						|
 | 
						|
         ab(?C4)cd
 | 
						|
 | 
						|
       PCRE knows that any matching string must contain the letter "d". If the
 | 
						|
       subject string is "abyz", the lack of "d" means that  matching  doesn't
 | 
						|
       ever  start,  and  the  callout is never reached. However, with "abyd",
 | 
						|
       though the result is still no match, the callout is obeyed.
 | 
						|
 | 
						|
       You can disable these optimizations by passing the  PCRE_NO_START_OPTI-
 | 
						|
       MIZE  option  to  pcre_exec()  or  pcre_dfa_exec(). This slows down the
 | 
						|
       matching process, but does ensure that callouts  such  as  the  example
 | 
						|
       above are obeyed.
 | 
						|
 | 
						|
 | 
						|
THE CALLOUT INTERFACE
 | 
						|
 | 
						|
       During  matching, when PCRE reaches a callout point, the external func-
 | 
						|
       tion defined by pcre_callout is called (if it is set). This applies  to
 | 
						|
       both  the  pcre_exec()  and the pcre_dfa_exec() matching functions. The
 | 
						|
       only argument to the callout function is a pointer  to  a  pcre_callout
 | 
						|
       block. This structure contains the following fields:
 | 
						|
 | 
						|
         int          version;
 | 
						|
         int          callout_number;
 | 
						|
         int         *offset_vector;
 | 
						|
         const char  *subject;
 | 
						|
         int          subject_length;
 | 
						|
         int          start_match;
 | 
						|
         int          current_position;
 | 
						|
         int          capture_top;
 | 
						|
         int          capture_last;
 | 
						|
         void        *callout_data;
 | 
						|
         int          pattern_position;
 | 
						|
         int          next_item_length;
 | 
						|
 | 
						|
       The  version  field  is an integer containing the version number of the
 | 
						|
       block format. The initial version was 0; the current version is 1.  The
 | 
						|
       version  number  will  change  again in future if additional fields are
 | 
						|
       added, but the intention is never to remove any of the existing fields.
 | 
						|
 | 
						|
       The callout_number field contains the number of the  callout,  as  com-
 | 
						|
       piled  into  the pattern (that is, the number after ?C for manual call-
 | 
						|
       outs, and 255 for automatically generated callouts).
 | 
						|
 | 
						|
       The offset_vector field is a pointer to the vector of offsets that  was
 | 
						|
       passed   by   the   caller  to  pcre_exec()  or  pcre_dfa_exec().  When
 | 
						|
       pcre_exec() is used, the contents can be inspected in order to  extract
 | 
						|
       substrings  that  have  been  matched  so  far,  in the same way as for
 | 
						|
       extracting substrings after a match has completed. For  pcre_dfa_exec()
 | 
						|
       this field is not useful.
 | 
						|
 | 
						|
       The subject and subject_length fields contain copies of the values that
 | 
						|
       were passed to pcre_exec().
 | 
						|
 | 
						|
       The start_match field normally contains the offset within  the  subject
 | 
						|
       at  which  the  current  match  attempt started. However, if the escape
 | 
						|
       sequence \K has been encountered, this value is changed to reflect  the
 | 
						|
       modified  starting  point.  If the pattern is not anchored, the callout
 | 
						|
       function may be called several times from the same point in the pattern
 | 
						|
       for different starting points in the subject.
 | 
						|
 | 
						|
       The  current_position  field  contains the offset within the subject of
 | 
						|
       the current match pointer.
 | 
						|
 | 
						|
       When the pcre_exec() function is used, the capture_top  field  contains
 | 
						|
       one  more than the number of the highest numbered captured substring so
 | 
						|
       far. If no substrings have been captured, the value of  capture_top  is
 | 
						|
       one.  This  is always the case when pcre_dfa_exec() is used, because it
 | 
						|
       does not support captured substrings.
 | 
						|
 | 
						|
       The capture_last field contains the number of the  most  recently  cap-
 | 
						|
       tured  substring. If no substrings have been captured, its value is -1.
 | 
						|
       This is always the case when pcre_dfa_exec() is used.
 | 
						|
 | 
						|
       The callout_data field contains a value that is passed  to  pcre_exec()
 | 
						|
       or  pcre_dfa_exec() specifically so that it can be passed back in call-
 | 
						|
       outs. It is passed in the pcre_callout field  of  the  pcre_extra  data
 | 
						|
       structure.  If  no such data was passed, the value of callout_data in a
 | 
						|
       pcre_callout block is NULL. There is a description  of  the  pcre_extra
 | 
						|
       structure in the pcreapi documentation.
 | 
						|
 | 
						|
       The  pattern_position field is present from version 1 of the pcre_call-
 | 
						|
       out structure. It contains the offset to the next item to be matched in
 | 
						|
       the pattern string.
 | 
						|
 | 
						|
       The  next_item_length field is present from version 1 of the pcre_call-
 | 
						|
       out structure. It contains the length of the next item to be matched in
 | 
						|
       the  pattern  string. When the callout immediately precedes an alterna-
 | 
						|
       tion bar, a closing parenthesis, or the end of the pattern, the  length
 | 
						|
       is  zero.  When the callout precedes an opening parenthesis, the length
 | 
						|
       is that of the entire subpattern.
 | 
						|
 | 
						|
       The pattern_position and next_item_length fields are intended  to  help
 | 
						|
       in  distinguishing between different automatic callouts, which all have
 | 
						|
       the same callout number. However, they are set for all callouts.
 | 
						|
 | 
						|
 | 
						|
RETURN VALUES
 | 
						|
 | 
						|
       The external callout function returns an integer to PCRE. If the  value
 | 
						|
       is  zero,  matching  proceeds  as  normal. If the value is greater than
 | 
						|
       zero, matching fails at the current point, but  the  testing  of  other
 | 
						|
       matching possibilities goes ahead, just as if a lookahead assertion had
 | 
						|
       failed. If the value is less than zero, the  match  is  abandoned,  and
 | 
						|
       pcre_exec() (or pcre_dfa_exec()) returns the negative value.
 | 
						|
 | 
						|
       Negative   values   should   normally   be   chosen  from  the  set  of
 | 
						|
       PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan-
 | 
						|
       dard  "no  match"  failure.   The  error  number  PCRE_ERROR_CALLOUT is
 | 
						|
       reserved for use by callout functions; it will never be  used  by  PCRE
 | 
						|
       itself.
 | 
						|
 | 
						|
 | 
						|
AUTHOR
 | 
						|
 | 
						|
       Philip Hazel
 | 
						|
       University Computing Service
 | 
						|
       Cambridge CB2 3QH, England.
 | 
						|
 | 
						|
 | 
						|
REVISION
 | 
						|
 | 
						|
       Last updated: 15 March 2009
 | 
						|
       Copyright (c) 1997-2009 University of Cambridge.
 | 
						|
------------------------------------------------------------------------------
 | 
						|
 | 
						|
 | 
						|
PCRECOMPAT(3)                                                    PCRECOMPAT(3)
 | 
						|
 | 
						|
 | 
						|
NAME
 | 
						|
       PCRE - Perl-compatible regular expressions
 | 
						|
 | 
						|
 | 
						|
DIFFERENCES BETWEEN PCRE AND PERL
 | 
						|
 | 
						|
       This  document describes the differences in the ways that PCRE and Perl
 | 
						|
       handle regular expressions. The differences described here  are  mainly
 | 
						|
       with  respect  to  Perl 5.8, though PCRE versions 7.0 and later contain
 | 
						|
       some features that are expected to be in the forthcoming Perl 5.10.
 | 
						|
 | 
						|
       1. PCRE has only a subset of Perl's UTF-8 and Unicode support.  Details
 | 
						|
       of  what  it does have are given in the section on UTF-8 support in the
 | 
						|
       main pcre page.
 | 
						|
 | 
						|
       2. PCRE does not allow repeat quantifiers on lookahead assertions. Perl
 | 
						|
       permits  them,  but they do not mean what you might think. For example,
 | 
						|
       (?!a){3} does not assert that the next three characters are not "a". It
 | 
						|
       just asserts that the next character is not "a" three times.
 | 
						|
 | 
						|
       3.  Capturing  subpatterns  that occur inside negative lookahead asser-
 | 
						|
       tions are counted, but their entries in the offsets  vector  are  never
 | 
						|
       set.  Perl sets its numerical variables from any such patterns that are
 | 
						|
       matched before the assertion fails to match something (thereby succeed-
 | 
						|
       ing),  but  only  if the negative lookahead assertion contains just one
 | 
						|
       branch.
 | 
						|
 | 
						|
       4. Though binary zero characters are supported in the  subject  string,
 | 
						|
       they are not allowed in a pattern string because it is passed as a nor-
 | 
						|
       mal C string, terminated by zero. The escape sequence \0 can be used in
 | 
						|
       the pattern to represent a binary zero.
 | 
						|
 | 
						|
       5.  The  following Perl escape sequences are not supported: \l, \u, \L,
 | 
						|
       \U, and \N. In fact these are implemented by Perl's general string-han-
 | 
						|
       dling  and are not part of its pattern matching engine. If any of these
 | 
						|
       are encountered by PCRE, an error is generated.
 | 
						|
 | 
						|
       6. The Perl escape sequences \p, \P, and \X are supported only if  PCRE
 | 
						|
       is  built  with Unicode character property support. The properties that
 | 
						|
       can be tested with \p and \P are limited to the general category  prop-
 | 
						|
       erties  such  as  Lu and Nd, script names such as Greek or Han, and the
 | 
						|
       derived properties Any and L&.
 | 
						|
 | 
						|
       7. PCRE does support the \Q...\E escape for quoting substrings. Charac-
 | 
						|
       ters  in  between  are  treated as literals. This is slightly different
 | 
						|
       from Perl in that $ and @ are  also  handled  as  literals  inside  the
 | 
						|
       quotes.  In Perl, they cause variable interpolation (but of course PCRE
 | 
						|
       does not have variables). Note the following examples:
 | 
						|
 | 
						|
           Pattern            PCRE matches      Perl matches
 | 
						|
 | 
						|
           \Qabc$xyz\E        abc$xyz           abc followed by the
 | 
						|
                                                  contents of $xyz
 | 
						|
           \Qabc\$xyz\E       abc\$xyz          abc\$xyz
 | 
						|
           \Qabc\E\$\Qxyz\E   abc$xyz           abc$xyz
 | 
						|
 | 
						|
       The \Q...\E sequence is recognized both inside  and  outside  character
 | 
						|
       classes.
 | 
						|
 | 
						|
       8. Fairly obviously, PCRE does not support the (?{code}) and (??{code})
 | 
						|
       constructions. However, there is support for recursive  patterns.  This
 | 
						|
       is  not available in Perl 5.8, but will be in Perl 5.10. Also, the PCRE
 | 
						|
       "callout" feature allows an external function to be called during  pat-
 | 
						|
       tern matching. See the pcrecallout documentation for details.
 | 
						|
 | 
						|
       9.  Subpatterns  that  are  called  recursively or as "subroutines" are
 | 
						|
       always treated as atomic groups in  PCRE.  This  is  like  Python,  but
 | 
						|
       unlike Perl.
 | 
						|
 | 
						|
       10.  There are some differences that are concerned with the settings of
 | 
						|
       captured strings when part of  a  pattern  is  repeated.  For  example,
 | 
						|
       matching  "aba"  against  the  pattern  /^(a(b)?)+$/  in Perl leaves $2
 | 
						|
       unset, but in PCRE it is set to "b".
 | 
						|
 | 
						|
       11.  PCRE  does  support  Perl  5.10's  backtracking  verbs  (*ACCEPT),
 | 
						|
       (*FAIL),  (*F),  (*COMMIT), (*PRUNE), (*SKIP), and (*THEN), but only in
 | 
						|
       the forms without an  argument.  PCRE  does  not  support  (*MARK).  If
 | 
						|
       (*ACCEPT)  is within capturing parentheses, PCRE does not set that cap-
 | 
						|
       ture group; this is different to Perl.
 | 
						|
 | 
						|
       12. PCRE provides some extensions to the Perl regular expression facil-
 | 
						|
       ities.   Perl  5.10  will  include new features that are not in earlier
 | 
						|
       versions, some of which (such as named parentheses) have been  in  PCRE
 | 
						|
       for some time. This list is with respect to Perl 5.10:
 | 
						|
 | 
						|
       (a)  Although  lookbehind  assertions  must match fixed length strings,
 | 
						|
       each alternative branch of a lookbehind assertion can match a different
 | 
						|
       length of string. Perl requires them all to have the same length.
 | 
						|
 | 
						|
       (b)  If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $
 | 
						|
       meta-character matches only at the very end of the string.
 | 
						|
 | 
						|
       (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-
 | 
						|
       cial meaning is faulted. Otherwise, like Perl, the backslash is quietly
 | 
						|
       ignored.  (Perl can be made to issue a warning.)
 | 
						|
 | 
						|
       (d) If PCRE_UNGREEDY is set, the greediness of the  repetition  quanti-
 | 
						|
       fiers is inverted, that is, by default they are not greedy, but if fol-
 | 
						|
       lowed by a question mark they are.
 | 
						|
 | 
						|
       (e) PCRE_ANCHORED can be used at matching time to force a pattern to be
 | 
						|
       tried only at the first matching position in the subject string.
 | 
						|
 | 
						|
       (f)  The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and PCRE_NO_AUTO_CAP-
 | 
						|
       TURE options for pcre_exec() have no Perl equivalents.
 | 
						|
 | 
						|
       (g) The \R escape sequence can be restricted to match only CR,  LF,  or
 | 
						|
       CRLF by the PCRE_BSR_ANYCRLF option.
 | 
						|
 | 
						|
       (h) The callout facility is PCRE-specific.
 | 
						|
 | 
						|
       (i) The partial matching facility is PCRE-specific.
 | 
						|
 | 
						|
       (j) Patterns compiled by PCRE can be saved and re-used at a later time,
 | 
						|
       even on different hosts that have the other endianness.
 | 
						|
 | 
						|
       (k) The alternative matching function (pcre_dfa_exec())  matches  in  a
 | 
						|
       different way and is not Perl-compatible.
 | 
						|
 | 
						|
       (l)  PCRE  recognizes some special sequences such as (*CR) at the start
 | 
						|
       of a pattern that set overall options that cannot be changed within the
 | 
						|
       pattern.
 | 
						|
 | 
						|
 | 
						|
AUTHOR
 | 
						|
 | 
						|
       Philip Hazel
 | 
						|
       University Computing Service
 | 
						|
       Cambridge CB2 3QH, England.
 | 
						|
 | 
						|
 | 
						|
REVISION
 | 
						|
 | 
						|
       Last updated: 11 September 2007
 | 
						|
       Copyright (c) 1997-2007 University of Cambridge.
 | 
						|
------------------------------------------------------------------------------
 | 
						|
 | 
						|
 | 
						|
PCREPATTERN(3)                                                  PCREPATTERN(3)
 | 
						|
 | 
						|
 | 
						|
NAME
 | 
						|
       PCRE - Perl-compatible regular expressions
 | 
						|
 | 
						|
 | 
						|
PCRE REGULAR EXPRESSION DETAILS
 | 
						|
 | 
						|
       The  syntax and semantics of the regular expressions that are supported
 | 
						|
       by PCRE are described in detail below. There is a quick-reference  syn-
 | 
						|
       tax summary in the pcresyntax page. PCRE tries to match Perl syntax and
 | 
						|
       semantics as closely as it can. PCRE  also  supports  some  alternative
 | 
						|
       regular  expression  syntax (which does not conflict with the Perl syn-
 | 
						|
       tax) in order to provide some compatibility with regular expressions in
 | 
						|
       Python, .NET, and Oniguruma.
 | 
						|
 | 
						|
       Perl's  regular expressions are described in its own documentation, and
 | 
						|
       regular expressions in general are covered in a number of  books,  some
 | 
						|
       of  which  have  copious  examples. Jeffrey Friedl's "Mastering Regular
 | 
						|
       Expressions", published by  O'Reilly,  covers  regular  expressions  in
 | 
						|
       great  detail.  This  description  of  PCRE's  regular  expressions  is
 | 
						|
       intended as reference material.
 | 
						|
 | 
						|
       The original operation of PCRE was on strings of  one-byte  characters.
 | 
						|
       However,  there is now also support for UTF-8 character strings. To use
 | 
						|
       this, you must build PCRE to  include  UTF-8  support,  and  then  call
 | 
						|
       pcre_compile()  with  the  PCRE_UTF8  option.  There  is also a special
 | 
						|
       sequence that can be given at the start of a pattern:
 | 
						|
 | 
						|
         (*UTF8)
 | 
						|
 | 
						|
       Starting a pattern with this sequence  is  equivalent  to  setting  the
 | 
						|
       PCRE_UTF8  option.  This  feature  is  not Perl-compatible. How setting
 | 
						|
       UTF-8 mode affects pattern matching  is  mentioned  in  several  places
 | 
						|
       below.  There  is  also  a  summary of UTF-8 features in the section on
 | 
						|
       UTF-8 support in the main pcre page.
 | 
						|
 | 
						|
       The remainder of this document discusses the  patterns  that  are  sup-
 | 
						|
       ported  by  PCRE when its main matching function, pcre_exec(), is used.
 | 
						|
       From  release  6.0,   PCRE   offers   a   second   matching   function,
 | 
						|
       pcre_dfa_exec(),  which matches using a different algorithm that is not
 | 
						|
       Perl-compatible. Some of the features discussed below are not available
 | 
						|
       when  pcre_dfa_exec()  is used. The advantages and disadvantages of the
 | 
						|
       alternative function, and how it differs from the normal function,  are
 | 
						|
       discussed in the pcrematching page.
 | 
						|
 | 
						|
 | 
						|
NEWLINE CONVENTIONS
 | 
						|
 | 
						|
       PCRE  supports five different conventions for indicating line breaks in
 | 
						|
       strings: a single CR (carriage return) character, a  single  LF  (line-
 | 
						|
       feed) character, the two-character sequence CRLF, any of the three pre-
 | 
						|
       ceding, or any Unicode newline sequence. The pcreapi page  has  further
 | 
						|
       discussion  about newlines, and shows how to set the newline convention
 | 
						|
       in the options arguments for the compiling and matching functions.
 | 
						|
 | 
						|
       It is also possible to specify a newline convention by starting a  pat-
 | 
						|
       tern string with one of the following five sequences:
 | 
						|
 | 
						|
         (*CR)        carriage return
 | 
						|
         (*LF)        linefeed
 | 
						|
         (*CRLF)      carriage return, followed by linefeed
 | 
						|
         (*ANYCRLF)   any of the three above
 | 
						|
         (*ANY)       all Unicode newline sequences
 | 
						|
 | 
						|
       These override the default and the options given to pcre_compile(). For
 | 
						|
       example, on a Unix system where LF is the default newline sequence, the
 | 
						|
       pattern
 | 
						|
 | 
						|
         (*CR)a.b
 | 
						|
 | 
						|
       changes the convention to CR. That pattern matches "a\nb" because LF is
 | 
						|
       no longer a newline. Note that these special settings,  which  are  not
 | 
						|
       Perl-compatible,  are  recognized  only at the very start of a pattern,
 | 
						|
       and that they must be in upper case.  If  more  than  one  of  them  is
 | 
						|
       present, the last one is used.
 | 
						|
 | 
						|
       The  newline  convention  does  not  affect what the \R escape sequence
 | 
						|
       matches. By default, this is any Unicode  newline  sequence,  for  Perl
 | 
						|
       compatibility.  However, this can be changed; see the description of \R
 | 
						|
       in the section entitled "Newline sequences" below. A change of \R  set-
 | 
						|
       ting can be combined with a change of newline convention.
 | 
						|
 | 
						|
 | 
						|
CHARACTERS AND METACHARACTERS
 | 
						|
 | 
						|
       A  regular  expression  is  a pattern that is matched against a subject
 | 
						|
       string from left to right. Most characters stand for  themselves  in  a
 | 
						|
       pattern,  and  match  the corresponding characters in the subject. As a
 | 
						|
       trivial example, the pattern
 | 
						|
 | 
						|
         The quick brown fox
 | 
						|
 | 
						|
       matches a portion of a subject string that is identical to itself. When
 | 
						|
       caseless  matching is specified (the PCRE_CASELESS option), letters are
 | 
						|
       matched independently of case. In UTF-8 mode, PCRE  always  understands
 | 
						|
       the  concept  of case for characters whose values are less than 128, so
 | 
						|
       caseless matching is always possible. For characters with  higher  val-
 | 
						|
       ues,  the concept of case is supported if PCRE is compiled with Unicode
 | 
						|
       property support, but not otherwise.   If  you  want  to  use  caseless
 | 
						|
       matching  for  characters  128  and above, you must ensure that PCRE is
 | 
						|
       compiled with Unicode property support as well as with UTF-8 support.
 | 
						|
 | 
						|
       The power of regular expressions comes  from  the  ability  to  include
 | 
						|
       alternatives  and  repetitions in the pattern. These are encoded in the
 | 
						|
       pattern by the use of metacharacters, which do not stand for themselves
 | 
						|
       but instead are interpreted in some special way.
 | 
						|
 | 
						|
       There  are  two different sets of metacharacters: those that are recog-
 | 
						|
       nized anywhere in the pattern except within square brackets, and  those
 | 
						|
       that  are  recognized  within square brackets. Outside square brackets,
 | 
						|
       the metacharacters are as follows:
 | 
						|
 | 
						|
         \      general escape character with several uses
 | 
						|
         ^      assert start of string (or line, in multiline mode)
 | 
						|
         $      assert end of string (or line, in multiline mode)
 | 
						|
         .      match any character except newline (by default)
 | 
						|
         [      start character class definition
 | 
						|
         |      start of alternative branch
 | 
						|
         (      start subpattern
 | 
						|
         )      end subpattern
 | 
						|
         ?      extends the meaning of (
 | 
						|
                also 0 or 1 quantifier
 | 
						|
                also quantifier minimizer
 | 
						|
         *      0 or more quantifier
 | 
						|
         +      1 or more quantifier
 | 
						|
                also "possessive quantifier"
 | 
						|
         {      start min/max quantifier
 | 
						|
 | 
						|
       Part of a pattern that is in square brackets  is  called  a  "character
 | 
						|
       class". In a character class the only metacharacters are:
 | 
						|
 | 
						|
         \      general escape character
 | 
						|
         ^      negate the class, but only if the first character
 | 
						|
         -      indicates character range
 | 
						|
         [      POSIX character class (only if followed by POSIX
 | 
						|
                  syntax)
 | 
						|
         ]      terminates the character class
 | 
						|
 | 
						|
       The following sections describe the use of each of the metacharacters.
 | 
						|
 | 
						|
 | 
						|
BACKSLASH
 | 
						|
 | 
						|
       The backslash character has several uses. Firstly, if it is followed by
 | 
						|
       a non-alphanumeric character, it takes away any  special  meaning  that
 | 
						|
       character  may  have.  This  use  of  backslash  as an escape character
 | 
						|
       applies both inside and outside character classes.
 | 
						|
 | 
						|
       For example, if you want to match a * character, you write  \*  in  the
 | 
						|
       pattern.   This  escaping  action  applies whether or not the following
 | 
						|
       character would otherwise be interpreted as a metacharacter, so  it  is
 | 
						|
       always  safe  to  precede  a non-alphanumeric with backslash to specify
 | 
						|
       that it stands for itself. In particular, if you want to match a  back-
 | 
						|
       slash, you write \\.
 | 
						|
 | 
						|
       If  a  pattern is compiled with the PCRE_EXTENDED option, whitespace in
 | 
						|
       the pattern (other than in a character class) and characters between  a
 | 
						|
       # outside a character class and the next newline are ignored. An escap-
 | 
						|
       ing backslash can be used to include a whitespace  or  #  character  as
 | 
						|
       part of the pattern.
 | 
						|
 | 
						|
       If  you  want  to remove the special meaning from a sequence of charac-
 | 
						|
       ters, you can do so by putting them between \Q and \E. This is  differ-
 | 
						|
       ent  from  Perl  in  that  $  and  @ are handled as literals in \Q...\E
 | 
						|
       sequences in PCRE, whereas in Perl, $ and @ cause  variable  interpola-
 | 
						|
       tion. Note the following examples:
 | 
						|
 | 
						|
         Pattern            PCRE matches   Perl matches
 | 
						|
 | 
						|
         \Qabc$xyz\E        abc$xyz        abc followed by the
 | 
						|
                                             contents of $xyz
 | 
						|
         \Qabc\$xyz\E       abc\$xyz       abc\$xyz
 | 
						|
         \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
 | 
						|
 | 
						|
       The  \Q...\E  sequence  is recognized both inside and outside character
 | 
						|
       classes.
 | 
						|
 | 
						|
   Non-printing characters
 | 
						|
 | 
						|
       A second use of backslash provides a way of encoding non-printing char-
 | 
						|
       acters  in patterns in a visible manner. There is no restriction on the
 | 
						|
       appearance of non-printing characters, apart from the binary zero  that
 | 
						|
       terminates  a  pattern,  but  when  a pattern is being prepared by text
 | 
						|
       editing, it is usually easier  to  use  one  of  the  following  escape
 | 
						|
       sequences than the binary character it represents:
 | 
						|
 | 
						|
         \a        alarm, that is, the BEL character (hex 07)
 | 
						|
         \cx       "control-x", where x is any character
 | 
						|
         \e        escape (hex 1B)
 | 
						|
         \f        formfeed (hex 0C)
 | 
						|
         \n        linefeed (hex 0A)
 | 
						|
         \r        carriage return (hex 0D)
 | 
						|
         \t        tab (hex 09)
 | 
						|
         \ddd      character with octal code ddd, or backreference
 | 
						|
         \xhh      character with hex code hh
 | 
						|
         \x{hhh..} character with hex code hhh..
 | 
						|
 | 
						|
       The  precise  effect of \cx is as follows: if x is a lower case letter,
 | 
						|
       it is converted to upper case. Then bit 6 of the character (hex 40)  is
 | 
						|
       inverted.   Thus  \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;
 | 
						|
       becomes hex 7B.
 | 
						|
 | 
						|
       After \x, from zero to two hexadecimal digits are read (letters can  be
 | 
						|
       in  upper  or  lower case). Any number of hexadecimal digits may appear
 | 
						|
       between \x{ and }, but the value of the character  code  must  be  less
 | 
						|
       than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is,
 | 
						|
       the maximum value in hexadecimal is 7FFFFFFF. Note that this is  bigger
 | 
						|
       than the largest Unicode code point, which is 10FFFF.
 | 
						|
 | 
						|
       If  characters  other than hexadecimal digits appear between \x{ and },
 | 
						|
       or if there is no terminating }, this form of escape is not recognized.
 | 
						|
       Instead,  the  initial  \x  will  be interpreted as a basic hexadecimal
 | 
						|
       escape, with no following digits, giving a  character  whose  value  is
 | 
						|
       zero.
 | 
						|
 | 
						|
       Characters whose value is less than 256 can be defined by either of the
 | 
						|
       two syntaxes for \x. There is no difference in the way  they  are  han-
 | 
						|
       dled. For example, \xdc is exactly the same as \x{dc}.
 | 
						|
 | 
						|
       After  \0  up  to two further octal digits are read. If there are fewer
 | 
						|
       than two digits, just  those  that  are  present  are  used.  Thus  the
 | 
						|
       sequence \0\x\07 specifies two binary zeros followed by a BEL character
 | 
						|
       (code value 7). Make sure you supply two digits after the initial  zero
 | 
						|
       if the pattern character that follows is itself an octal digit.
 | 
						|
 | 
						|
       The handling of a backslash followed by a digit other than 0 is compli-
 | 
						|
       cated.  Outside a character class, PCRE reads it and any following dig-
 | 
						|
       its  as  a  decimal  number. If the number is less than 10, or if there
 | 
						|
       have been at least that many previous capturing left parentheses in the
 | 
						|
       expression,  the  entire  sequence  is  taken  as  a  back reference. A
 | 
						|
       description of how this works is given later, following the  discussion
 | 
						|
       of parenthesized subpatterns.
 | 
						|
 | 
						|
       Inside  a  character  class, or if the decimal number is greater than 9
 | 
						|
       and there have not been that many capturing subpatterns, PCRE  re-reads
 | 
						|
       up to three octal digits following the backslash, and uses them to gen-
 | 
						|
       erate a data character. Any subsequent digits stand for themselves.  In
 | 
						|
       non-UTF-8  mode,  the  value  of a character specified in octal must be
 | 
						|
       less than \400. In UTF-8 mode, values up to  \777  are  permitted.  For
 | 
						|
       example:
 | 
						|
 | 
						|
         \040   is another way of writing a space
 | 
						|
         \40    is the same, provided there are fewer than 40
 | 
						|
                   previous capturing subpatterns
 | 
						|
         \7     is always a back reference
 | 
						|
         \11    might be a back reference, or another way of
 | 
						|
                   writing a tab
 | 
						|
         \011   is always a tab
 | 
						|
         \0113  is a tab followed by the character "3"
 | 
						|
         \113   might be a back reference, otherwise the
 | 
						|
                   character with octal code 113
 | 
						|
         \377   might be a back reference, otherwise
 | 
						|
                   the byte consisting entirely of 1 bits
 | 
						|
         \81    is either a back reference, or a binary zero
 | 
						|
                   followed by the two characters "8" and "1"
 | 
						|
 | 
						|
       Note  that  octal  values of 100 or greater must not be introduced by a
 | 
						|
       leading zero, because no more than three octal digits are ever read.
 | 
						|
 | 
						|
       All the sequences that define a single character value can be used both
 | 
						|
       inside  and  outside character classes. In addition, inside a character
 | 
						|
       class, the sequence \b is interpreted as the backspace  character  (hex
 | 
						|
       08),  and the sequences \R and \X are interpreted as the characters "R"
 | 
						|
       and "X", respectively. Outside a character class, these sequences  have
 | 
						|
       different meanings (see below).
 | 
						|
 | 
						|
   Absolute and relative back references
 | 
						|
 | 
						|
       The  sequence  \g followed by an unsigned or a negative number, option-
 | 
						|
       ally enclosed in braces, is an absolute or relative back  reference.  A
 | 
						|
       named back reference can be coded as \g{name}. Back references are dis-
 | 
						|
       cussed later, following the discussion of parenthesized subpatterns.
 | 
						|
 | 
						|
   Absolute and relative subroutine calls
 | 
						|
 | 
						|
       For compatibility with Oniguruma, the non-Perl syntax \g followed by  a
 | 
						|
       name or a number enclosed either in angle brackets or single quotes, is
 | 
						|
       an alternative syntax for referencing a subpattern as  a  "subroutine".
 | 
						|
       Details  are  discussed  later.   Note  that  \g{...} (Perl syntax) and
 | 
						|
       \g<...> (Oniguruma syntax) are not synonymous. The  former  is  a  back
 | 
						|
       reference; the latter is a subroutine call.
 | 
						|
 | 
						|
   Generic character types
 | 
						|
 | 
						|
       Another use of backslash is for specifying generic character types. The
 | 
						|
       following are always recognized:
 | 
						|
 | 
						|
         \d     any decimal digit
 | 
						|
         \D     any character that is not a decimal digit
 | 
						|
         \h     any horizontal whitespace character
 | 
						|
         \H     any character that is not a horizontal whitespace character
 | 
						|
         \s     any whitespace character
 | 
						|
         \S     any character that is not a whitespace character
 | 
						|
         \v     any vertical whitespace character
 | 
						|
         \V     any character that is not a vertical whitespace character
 | 
						|
         \w     any "word" character
 | 
						|
         \W     any "non-word" character
 | 
						|
 | 
						|
       Each pair of escape sequences partitions the complete set of characters
 | 
						|
       into  two disjoint sets. Any given character matches one, and only one,
 | 
						|
       of each pair.
 | 
						|
 | 
						|
       These character type sequences can appear both inside and outside char-
 | 
						|
       acter  classes.  They each match one character of the appropriate type.
 | 
						|
       If the current matching point is at the end of the subject string,  all
 | 
						|
       of them fail, since there is no character to match.
 | 
						|
 | 
						|
       For  compatibility  with Perl, \s does not match the VT character (code
 | 
						|
       11).  This makes it different from the the POSIX "space" class. The  \s
 | 
						|
       characters  are  HT  (9), LF (10), FF (12), CR (13), and space (32). If
 | 
						|
       "use locale;" is included in a Perl script, \s may match the VT charac-
 | 
						|
       ter. In PCRE, it never does.
 | 
						|
 | 
						|
       In  UTF-8 mode, characters with values greater than 128 never match \d,
 | 
						|
       \s, or \w, and always match \D, \S, and \W. This is true even when Uni-
 | 
						|
       code  character  property  support is available. These sequences retain
 | 
						|
       their original meanings from before UTF-8 support was available, mainly
 | 
						|
       for  efficiency  reasons. Note that this also affects \b, because it is
 | 
						|
       defined in terms of \w and \W.
 | 
						|
 | 
						|
       The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to
 | 
						|
       the  other  sequences, these do match certain high-valued codepoints in
 | 
						|
       UTF-8 mode.  The horizontal space characters are:
 | 
						|
 | 
						|
         U+0009     Horizontal tab
 | 
						|
         U+0020     Space
 | 
						|
         U+00A0     Non-break space
 | 
						|
         U+1680     Ogham space mark
 | 
						|
         U+180E     Mongolian vowel separator
 | 
						|
         U+2000     En quad
 | 
						|
         U+2001     Em quad
 | 
						|
         U+2002     En space
 | 
						|
         U+2003     Em space
 | 
						|
         U+2004     Three-per-em space
 | 
						|
         U+2005     Four-per-em space
 | 
						|
         U+2006     Six-per-em space
 | 
						|
         U+2007     Figure space
 | 
						|
         U+2008     Punctuation space
 | 
						|
         U+2009     Thin space
 | 
						|
         U+200A     Hair space
 | 
						|
         U+202F     Narrow no-break space
 | 
						|
         U+205F     Medium mathematical space
 | 
						|
         U+3000     Ideographic space
 | 
						|
 | 
						|
       The vertical space characters are:
 | 
						|
 | 
						|
         U+000A     Linefeed
 | 
						|
         U+000B     Vertical tab
 | 
						|
         U+000C     Formfeed
 | 
						|
         U+000D     Carriage return
 | 
						|
         U+0085     Next line
 | 
						|
         U+2028     Line separator
 | 
						|
         U+2029     Paragraph separator
 | 
						|
 | 
						|
       A "word" character is an underscore or any character less than 256 that
 | 
						|
       is  a  letter  or  digit.  The definition of letters and digits is con-
 | 
						|
       trolled by PCRE's low-valued character tables, and may vary if  locale-
 | 
						|
       specific  matching is taking place (see "Locale support" in the pcreapi
 | 
						|
       page). For example, in a French locale such  as  "fr_FR"  in  Unix-like
 | 
						|
       systems,  or "french" in Windows, some character codes greater than 128
 | 
						|
       are used for accented letters, and these are matched by \w. The use  of
 | 
						|
       locales with Unicode is discouraged.
 | 
						|
 | 
						|
   Newline sequences
 | 
						|
 | 
						|
       Outside  a  character class, by default, the escape sequence \R matches
 | 
						|
       any Unicode newline sequence. This is a Perl 5.10 feature. In non-UTF-8
 | 
						|
       mode \R is equivalent to the following:
 | 
						|
 | 
						|
         (?>\r\n|\n|\x0b|\f|\r|\x85)
 | 
						|
 | 
						|
       This  is  an  example  of an "atomic group", details of which are given
 | 
						|
       below.  This particular group matches either the two-character sequence
 | 
						|
       CR  followed  by  LF,  or  one  of  the single characters LF (linefeed,
 | 
						|
       U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage
 | 
						|
       return, U+000D), or NEL (next line, U+0085). The two-character sequence
 | 
						|
       is treated as a single unit that cannot be split.
 | 
						|
 | 
						|
       In UTF-8 mode, two additional characters whose codepoints  are  greater
 | 
						|
       than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
 | 
						|
       rator, U+2029).  Unicode character property support is not  needed  for
 | 
						|
       these characters to be recognized.
 | 
						|
 | 
						|
       It is possible to restrict \R to match only CR, LF, or CRLF (instead of
 | 
						|
       the complete set  of  Unicode  line  endings)  by  setting  the  option
 | 
						|
       PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched.
 | 
						|
       (BSR is an abbrevation for "backslash R".) This can be made the default
 | 
						|
       when  PCRE  is  built;  if this is the case, the other behaviour can be
 | 
						|
       requested via the PCRE_BSR_UNICODE option.   It  is  also  possible  to
 | 
						|
       specify  these  settings  by  starting a pattern string with one of the
 | 
						|
       following sequences:
 | 
						|
 | 
						|
         (*BSR_ANYCRLF)   CR, LF, or CRLF only
 | 
						|
         (*BSR_UNICODE)   any Unicode newline sequence
 | 
						|
 | 
						|
       These override the default and the options given to pcre_compile(), but
 | 
						|
       they can be overridden by options given to pcre_exec(). Note that these
 | 
						|
       special settings, which are not Perl-compatible, are recognized only at
 | 
						|
       the  very  start  of a pattern, and that they must be in upper case. If
 | 
						|
       more than one of them is present, the last one is  used.  They  can  be
 | 
						|
       combined  with  a  change of newline convention, for example, a pattern
 | 
						|
       can start with:
 | 
						|
 | 
						|
         (*ANY)(*BSR_ANYCRLF)
 | 
						|
 | 
						|
       Inside a character class, \R matches the letter "R".
 | 
						|
 | 
						|
   Unicode character properties
 | 
						|
 | 
						|
       When PCRE is built with Unicode character property support, three addi-
 | 
						|
       tional  escape sequences that match characters with specific properties
 | 
						|
       are available.  When not in UTF-8 mode, these sequences are  of  course
 | 
						|
       limited  to  testing characters whose codepoints are less than 256, but
 | 
						|
       they do work in this mode.  The extra escape sequences are:
 | 
						|
 | 
						|
         \p{xx}   a character with the xx property
 | 
						|
         \P{xx}   a character without the xx property
 | 
						|
         \X       an extended Unicode sequence
 | 
						|
 | 
						|
       The property names represented by xx above are limited to  the  Unicode
 | 
						|
       script names, the general category properties, and "Any", which matches
 | 
						|
       any character (including newline). Other properties such as "InMusical-
 | 
						|
       Symbols"  are  not  currently supported by PCRE. Note that \P{Any} does
 | 
						|
       not match any characters, so always causes a match failure.
 | 
						|
 | 
						|
       Sets of Unicode characters are defined as belonging to certain scripts.
 | 
						|
       A  character from one of these sets can be matched using a script name.
 | 
						|
       For example:
 | 
						|
 | 
						|
         \p{Greek}
 | 
						|
         \P{Han}
 | 
						|
 | 
						|
       Those that are not part of an identified script are lumped together  as
 | 
						|
       "Common". The current list of scripts is:
 | 
						|
 | 
						|
       Arabic,  Armenian,  Balinese,  Bengali,  Bopomofo,  Braille,  Buginese,
 | 
						|
       Buhid,  Canadian_Aboriginal,  Cherokee,  Common,   Coptic,   Cuneiform,
 | 
						|
       Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic,
 | 
						|
       Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew,  Hira-
 | 
						|
       gana,  Inherited,  Kannada,  Katakana,  Kharoshthi,  Khmer, Lao, Latin,
 | 
						|
       Limbu,  Linear_B,  Malayalam,  Mongolian,  Myanmar,  New_Tai_Lue,  Nko,
 | 
						|
       Ogham,  Old_Italic,  Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician,
 | 
						|
       Runic,  Shavian,  Sinhala,  Syloti_Nagri,  Syriac,  Tagalog,  Tagbanwa,
 | 
						|
       Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Yi.
 | 
						|
 | 
						|
       Each  character has exactly one general category property, specified by
 | 
						|
       a two-letter abbreviation. For compatibility with Perl, negation can be
 | 
						|
       specified  by  including a circumflex between the opening brace and the
 | 
						|
       property name. For example, \p{^Lu} is the same as \P{Lu}.
 | 
						|
 | 
						|
       If only one letter is specified with \p or \P, it includes all the gen-
 | 
						|
       eral  category properties that start with that letter. In this case, in
 | 
						|
       the absence of negation, the curly brackets in the escape sequence  are
 | 
						|
       optional; these two examples have the same effect:
 | 
						|
 | 
						|
         \p{L}
 | 
						|
         \pL
 | 
						|
 | 
						|
       The following general category property codes are supported:
 | 
						|
 | 
						|
         C     Other
 | 
						|
         Cc    Control
 | 
						|
         Cf    Format
 | 
						|
         Cn    Unassigned
 | 
						|
         Co    Private use
 | 
						|
         Cs    Surrogate
 | 
						|
 | 
						|
         L     Letter
 | 
						|
         Ll    Lower case letter
 | 
						|
         Lm    Modifier letter
 | 
						|
         Lo    Other letter
 | 
						|
         Lt    Title case letter
 | 
						|
         Lu    Upper case letter
 | 
						|
 | 
						|
         M     Mark
 | 
						|
         Mc    Spacing mark
 | 
						|
         Me    Enclosing mark
 | 
						|
         Mn    Non-spacing mark
 | 
						|
 | 
						|
         N     Number
 | 
						|
         Nd    Decimal number
 | 
						|
         Nl    Letter number
 | 
						|
         No    Other number
 | 
						|
 | 
						|
         P     Punctuation
 | 
						|
         Pc    Connector punctuation
 | 
						|
         Pd    Dash punctuation
 | 
						|
         Pe    Close punctuation
 | 
						|
         Pf    Final punctuation
 | 
						|
         Pi    Initial punctuation
 | 
						|
         Po    Other punctuation
 | 
						|
         Ps    Open punctuation
 | 
						|
 | 
						|
         S     Symbol
 | 
						|
         Sc    Currency symbol
 | 
						|
         Sk    Modifier symbol
 | 
						|
         Sm    Mathematical symbol
 | 
						|
         So    Other symbol
 | 
						|
 | 
						|
         Z     Separator
 | 
						|
         Zl    Line separator
 | 
						|
         Zp    Paragraph separator
 | 
						|
         Zs    Space separator
 | 
						|
 | 
						|
       The  special property L& is also supported: it matches a character that
 | 
						|
       has the Lu, Ll, or Lt property, in other words, a letter  that  is  not
 | 
						|
       classified as a modifier or "other".
 | 
						|
 | 
						|
       The  Cs  (Surrogate)  property  applies only to characters in the range
 | 
						|
       U+D800 to U+DFFF. Such characters are not valid in UTF-8  strings  (see
 | 
						|
       RFC 3629) and so cannot be tested by PCRE, unless UTF-8 validity check-
 | 
						|
       ing has been turned off (see the discussion  of  PCRE_NO_UTF8_CHECK  in
 | 
						|
       the pcreapi page).
 | 
						|
 | 
						|
       The  long  synonyms  for  these  properties that Perl supports (such as
 | 
						|
       \p{Letter}) are not supported by PCRE, nor is it  permitted  to  prefix
 | 
						|
       any of these properties with "Is".
 | 
						|
 | 
						|
       No character that is in the Unicode table has the Cn (unassigned) prop-
 | 
						|
       erty.  Instead, this property is assumed for any code point that is not
 | 
						|
       in the Unicode table.
 | 
						|
 | 
						|
       Specifying  caseless  matching  does not affect these escape sequences.
 | 
						|
       For example, \p{Lu} always matches only upper case letters.
 | 
						|
 | 
						|
       The \X escape matches any number of Unicode  characters  that  form  an
 | 
						|
       extended Unicode sequence. \X is equivalent to
 | 
						|
 | 
						|
         (?>\PM\pM*)
 | 
						|
 | 
						|
       That  is,  it matches a character without the "mark" property, followed
 | 
						|
       by zero or more characters with the "mark"  property,  and  treats  the
 | 
						|
       sequence  as  an  atomic group (see below).  Characters with the "mark"
 | 
						|
       property are typically accents that  affect  the  preceding  character.
 | 
						|
       None  of  them  have  codepoints less than 256, so in non-UTF-8 mode \X
 | 
						|
       matches any one character.
 | 
						|
 | 
						|
       Matching characters by Unicode property is not fast, because  PCRE  has
 | 
						|
       to  search  a  structure  that  contains data for over fifteen thousand
 | 
						|
       characters. That is why the traditional escape sequences such as \d and
 | 
						|
       \w do not use Unicode properties in PCRE.
 | 
						|
 | 
						|
   Resetting the match start
 | 
						|
 | 
						|
       The escape sequence \K, which is a Perl 5.10 feature, causes any previ-
 | 
						|
       ously matched characters not  to  be  included  in  the  final  matched
 | 
						|
       sequence. For example, the pattern:
 | 
						|
 | 
						|
         foo\Kbar
 | 
						|
 | 
						|
       matches  "foobar",  but reports that it has matched "bar". This feature
 | 
						|
       is similar to a lookbehind assertion (described  below).   However,  in
 | 
						|
       this  case, the part of the subject before the real match does not have
 | 
						|
       to be of fixed length, as lookbehind assertions do. The use of \K  does
 | 
						|
       not  interfere  with  the setting of captured substrings.  For example,
 | 
						|
       when the pattern
 | 
						|
 | 
						|
         (foo)\Kbar
 | 
						|
 | 
						|
       matches "foobar", the first substring is still set to "foo".
 | 
						|
 | 
						|
   Simple assertions
 | 
						|
 | 
						|
       The final use of backslash is for certain simple assertions. An  asser-
 | 
						|
       tion  specifies a condition that has to be met at a particular point in
 | 
						|
       a match, without consuming any characters from the subject string.  The
 | 
						|
       use  of subpatterns for more complicated assertions is described below.
 | 
						|
       The backslashed assertions are:
 | 
						|
 | 
						|
         \b     matches at a word boundary
 | 
						|
         \B     matches when not at a word boundary
 | 
						|
         \A     matches at the start of the subject
 | 
						|
         \Z     matches at the end of the subject
 | 
						|
                 also matches before a newline at the end of the subject
 | 
						|
         \z     matches only at the end of the subject
 | 
						|
         \G     matches at the first matching position in the subject
 | 
						|
 | 
						|
       These assertions may not appear in character classes (but note that  \b
 | 
						|
       has a different meaning, namely the backspace character, inside a char-
 | 
						|
       acter class).
 | 
						|
 | 
						|
       A word boundary is a position in the subject string where  the  current
 | 
						|
       character  and  the previous character do not both match \w or \W (i.e.
 | 
						|
       one matches \w and the other matches \W), or the start or  end  of  the
 | 
						|
       string if the first or last character matches \w, respectively.
 | 
						|
 | 
						|
       The  \A,  \Z,  and \z assertions differ from the traditional circumflex
 | 
						|
       and dollar (described in the next section) in that they only ever match
 | 
						|
       at  the  very start and end of the subject string, whatever options are
 | 
						|
       set. Thus, they are independent of multiline mode. These  three  asser-
 | 
						|
       tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
 | 
						|
       affect only the behaviour of the circumflex and dollar  metacharacters.
 | 
						|
       However,  if the startoffset argument of pcre_exec() is non-zero, indi-
 | 
						|
       cating that matching is to start at a point other than the beginning of
 | 
						|
       the  subject,  \A  can never match. The difference between \Z and \z is
 | 
						|
       that \Z matches before a newline at the end of the string as well as at
 | 
						|
       the very end, whereas \z matches only at the end.
 | 
						|
 | 
						|
       The  \G assertion is true only when the current matching position is at
 | 
						|
       the start point of the match, as specified by the startoffset  argument
 | 
						|
       of  pcre_exec().  It  differs  from \A when the value of startoffset is
 | 
						|
       non-zero. By calling pcre_exec() multiple times with appropriate  argu-
 | 
						|
       ments, you can mimic Perl's /g option, and it is in this kind of imple-
 | 
						|
       mentation where \G can be useful.
 | 
						|
 | 
						|
       Note, however, that PCRE's interpretation of \G, as the  start  of  the
 | 
						|
       current match, is subtly different from Perl's, which defines it as the
 | 
						|
       end of the previous match. In Perl, these can  be  different  when  the
 | 
						|
       previously  matched  string was empty. Because PCRE does just one match
 | 
						|
       at a time, it cannot reproduce this behaviour.
 | 
						|
 | 
						|
       If all the alternatives of a pattern begin with \G, the  expression  is
 | 
						|
       anchored to the starting match position, and the "anchored" flag is set
 | 
						|
       in the compiled regular expression.
 | 
						|
 | 
						|
 | 
						|
CIRCUMFLEX AND DOLLAR
 | 
						|
 | 
						|
       Outside a character class, in the default matching mode, the circumflex
 | 
						|
       character  is  an  assertion  that is true only if the current matching
 | 
						|
       point is at the start of the subject string. If the  startoffset  argu-
 | 
						|
       ment  of  pcre_exec()  is  non-zero,  circumflex can never match if the
 | 
						|
       PCRE_MULTILINE option is unset. Inside a  character  class,  circumflex
 | 
						|
       has an entirely different meaning (see below).
 | 
						|
 | 
						|
       Circumflex  need  not be the first character of the pattern if a number
 | 
						|
       of alternatives are involved, but it should be the first thing in  each
 | 
						|
       alternative  in  which  it appears if the pattern is ever to match that
 | 
						|
       branch. If all possible alternatives start with a circumflex, that  is,
 | 
						|
       if  the  pattern  is constrained to match only at the start of the sub-
 | 
						|
       ject, it is said to be an "anchored" pattern.  (There  are  also  other
 | 
						|
       constructs that can cause a pattern to be anchored.)
 | 
						|
 | 
						|
       A  dollar  character  is  an assertion that is true only if the current
 | 
						|
       matching point is at the end of  the  subject  string,  or  immediately
 | 
						|
       before a newline at the end of the string (by default). Dollar need not
 | 
						|
       be the last character of the pattern if a number  of  alternatives  are
 | 
						|
       involved,  but  it  should  be  the last item in any branch in which it
 | 
						|
       appears. Dollar has no special meaning in a character class.
 | 
						|
 | 
						|
       The meaning of dollar can be changed so that it  matches  only  at  the
 | 
						|
       very  end  of  the string, by setting the PCRE_DOLLAR_ENDONLY option at
 | 
						|
       compile time. This does not affect the \Z assertion.
 | 
						|
 | 
						|
       The meanings of the circumflex and dollar characters are changed if the
 | 
						|
       PCRE_MULTILINE  option  is  set.  When  this  is the case, a circumflex
 | 
						|
       matches immediately after internal newlines as well as at the start  of
 | 
						|
       the  subject  string.  It  does not match after a newline that ends the
 | 
						|
       string. A dollar matches before any newlines in the string, as well  as
 | 
						|
       at  the very end, when PCRE_MULTILINE is set. When newline is specified
 | 
						|
       as the two-character sequence CRLF, isolated CR and  LF  characters  do
 | 
						|
       not indicate newlines.
 | 
						|
 | 
						|
       For  example, the pattern /^abc$/ matches the subject string "def\nabc"
 | 
						|
       (where \n represents a newline) in multiline mode, but  not  otherwise.
 | 
						|
       Consequently,  patterns  that  are anchored in single line mode because
 | 
						|
       all branches start with ^ are not anchored in  multiline  mode,  and  a
 | 
						|
       match  for  circumflex  is  possible  when  the startoffset argument of
 | 
						|
       pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is  ignored  if
 | 
						|
       PCRE_MULTILINE is set.
 | 
						|
 | 
						|
       Note  that  the sequences \A, \Z, and \z can be used to match the start
 | 
						|
       and end of the subject in both modes, and if all branches of a  pattern
 | 
						|
       start  with  \A it is always anchored, whether or not PCRE_MULTILINE is
 | 
						|
       set.
 | 
						|
 | 
						|
 | 
						|
FULL STOP (PERIOD, DOT)
 | 
						|
 | 
						|
       Outside a character class, a dot in the pattern matches any one charac-
 | 
						|
       ter  in  the subject string except (by default) a character that signi-
 | 
						|
       fies the end of a line. In UTF-8 mode, the  matched  character  may  be
 | 
						|
       more than one byte long.
 | 
						|
 | 
						|
       When  a line ending is defined as a single character, dot never matches
 | 
						|
       that character; when the two-character sequence CRLF is used, dot  does
 | 
						|
       not  match  CR  if  it  is immediately followed by LF, but otherwise it
 | 
						|
       matches all characters (including isolated CRs and LFs). When any  Uni-
 | 
						|
       code  line endings are being recognized, dot does not match CR or LF or
 | 
						|
       any of the other line ending characters.
 | 
						|
 | 
						|
       The behaviour of dot with regard to newlines can  be  changed.  If  the
 | 
						|
       PCRE_DOTALL  option  is  set,  a dot matches any one character, without
 | 
						|
       exception. If the two-character sequence CRLF is present in the subject
 | 
						|
       string, it takes two dots to match it.
 | 
						|
 | 
						|
       The  handling of dot is entirely independent of the handling of circum-
 | 
						|
       flex and dollar, the only relationship being  that  they  both  involve
 | 
						|
       newlines. Dot has no special meaning in a character class.
 | 
						|
 | 
						|
 | 
						|
MATCHING A SINGLE BYTE
 | 
						|
 | 
						|
       Outside a character class, the escape sequence \C matches any one byte,
 | 
						|
       both in and out of UTF-8 mode. Unlike a  dot,  it  always  matches  any
 | 
						|
       line-ending  characters.  The  feature  is provided in Perl in order to
 | 
						|
       match individual bytes in UTF-8 mode. Because it breaks up UTF-8  char-
 | 
						|
       acters  into individual bytes, what remains in the string may be a mal-
 | 
						|
       formed UTF-8 string. For this reason, the \C escape  sequence  is  best
 | 
						|
       avoided.
 | 
						|
 | 
						|
       PCRE  does  not  allow \C to appear in lookbehind assertions (described
 | 
						|
       below), because in UTF-8 mode this would make it impossible  to  calcu-
 | 
						|
       late the length of the lookbehind.
 | 
						|
 | 
						|
 | 
						|
SQUARE BRACKETS AND CHARACTER CLASSES
 | 
						|
 | 
						|
       An opening square bracket introduces a character class, terminated by a
 | 
						|
       closing square bracket. A closing square bracket on its own is not spe-
 | 
						|
       cial. If a closing square bracket is required as a member of the class,
 | 
						|
       it should be the first data character in the class  (after  an  initial
 | 
						|
       circumflex, if present) or escaped with a backslash.
 | 
						|
 | 
						|
       A  character  class matches a single character in the subject. In UTF-8
 | 
						|
       mode, the character may occupy more than one byte. A matched  character
 | 
						|
       must be in the set of characters defined by the class, unless the first
 | 
						|
       character in the class definition is a circumflex, in  which  case  the
 | 
						|
       subject  character  must  not  be in the set defined by the class. If a
 | 
						|
       circumflex is actually required as a member of the class, ensure it  is
 | 
						|
       not the first character, or escape it with a backslash.
 | 
						|
 | 
						|
       For  example, the character class [aeiou] matches any lower case vowel,
 | 
						|
       while [^aeiou] matches any character that is not a  lower  case  vowel.
 | 
						|
       Note that a circumflex is just a convenient notation for specifying the
 | 
						|
       characters that are in the class by enumerating those that are  not.  A
 | 
						|
       class  that starts with a circumflex is not an assertion: it still con-
 | 
						|
       sumes a character from the subject string, and therefore  it  fails  if
 | 
						|
       the current pointer is at the end of the string.
 | 
						|
 | 
						|
       In  UTF-8 mode, characters with values greater than 255 can be included
 | 
						|
       in a class as a literal string of bytes, or by using the  \x{  escaping
 | 
						|
       mechanism.
 | 
						|
 | 
						|
       When  caseless  matching  is set, any letters in a class represent both
 | 
						|
       their upper case and lower case versions, so for  example,  a  caseless
 | 
						|
       [aeiou]  matches  "A"  as well as "a", and a caseless [^aeiou] does not
 | 
						|
       match "A", whereas a caseful version would. In UTF-8 mode, PCRE  always
 | 
						|
       understands  the  concept  of case for characters whose values are less
 | 
						|
       than 128, so caseless matching is always possible. For characters  with
 | 
						|
       higher  values,  the  concept  of case is supported if PCRE is compiled
 | 
						|
       with Unicode property support, but not otherwise.  If you want  to  use
 | 
						|
       caseless  matching  for  characters 128 and above, you must ensure that
 | 
						|
       PCRE is compiled with Unicode property support as well  as  with  UTF-8
 | 
						|
       support.
 | 
						|
 | 
						|
       Characters  that  might  indicate  line breaks are never treated in any
 | 
						|
       special way  when  matching  character  classes,  whatever  line-ending
 | 
						|
       sequence  is  in  use,  and  whatever  setting  of  the PCRE_DOTALL and
 | 
						|
       PCRE_MULTILINE options is used. A class such as [^a] always matches one
 | 
						|
       of these characters.
 | 
						|
 | 
						|
       The  minus (hyphen) character can be used to specify a range of charac-
 | 
						|
       ters in a character  class.  For  example,  [d-m]  matches  any  letter
 | 
						|
       between  d  and  m,  inclusive.  If  a minus character is required in a
 | 
						|
       class, it must be escaped with a backslash  or  appear  in  a  position
 | 
						|
       where  it cannot be interpreted as indicating a range, typically as the
 | 
						|
       first or last character in the class.
 | 
						|
 | 
						|
       It is not possible to have the literal character "]" as the end charac-
 | 
						|
       ter  of a range. A pattern such as [W-]46] is interpreted as a class of
 | 
						|
       two characters ("W" and "-") followed by a literal string "46]", so  it
 | 
						|
       would  match  "W46]"  or  "-46]". However, if the "]" is escaped with a
 | 
						|
       backslash it is interpreted as the end of range, so [W-\]46] is  inter-
 | 
						|
       preted  as a class containing a range followed by two other characters.
 | 
						|
       The octal or hexadecimal representation of "]" can also be used to  end
 | 
						|
       a range.
 | 
						|
 | 
						|
       Ranges  operate in the collating sequence of character values. They can
 | 
						|
       also  be  used  for  characters  specified  numerically,  for   example
 | 
						|
       [\000-\037].  In UTF-8 mode, ranges can include characters whose values
 | 
						|
       are greater than 255, for example [\x{100}-\x{2ff}].
 | 
						|
 | 
						|
       If a range that includes letters is used when caseless matching is set,
 | 
						|
       it matches the letters in either case. For example, [W-c] is equivalent
 | 
						|
       to [][\\^_`wxyzabc], matched caselessly,  and  in  non-UTF-8  mode,  if
 | 
						|
       character  tables  for  a French locale are in use, [\xc8-\xcb] matches
 | 
						|
       accented E characters in both cases. In UTF-8 mode, PCRE  supports  the
 | 
						|
       concept  of  case for characters with values greater than 128 only when
 | 
						|
       it is compiled with Unicode property support.
 | 
						|
 | 
						|
       The character types \d, \D, \p, \P, \s, \S, \w, and \W may also  appear
 | 
						|
       in  a  character  class,  and add the characters that they match to the
 | 
						|
       class. For example, [\dABCDEF] matches any hexadecimal digit. A circum-
 | 
						|
       flex  can  conveniently  be used with the upper case character types to
 | 
						|
       specify a more restricted set of characters  than  the  matching  lower
 | 
						|
       case  type.  For example, the class [^\W_] matches any letter or digit,
 | 
						|
       but not underscore.
 | 
						|
 | 
						|
       The only metacharacters that are recognized in  character  classes  are
 | 
						|
       backslash,  hyphen  (only  where  it can be interpreted as specifying a
 | 
						|
       range), circumflex (only at the start), opening  square  bracket  (only
 | 
						|
       when  it can be interpreted as introducing a POSIX class name - see the
 | 
						|
       next section), and the terminating  closing  square  bracket.  However,
 | 
						|
       escaping other non-alphanumeric characters does no harm.
 | 
						|
 | 
						|
 | 
						|
POSIX CHARACTER CLASSES
 | 
						|
 | 
						|
       Perl supports the POSIX notation for character classes. This uses names
 | 
						|
       enclosed by [: and :] within the enclosing square brackets.  PCRE  also
 | 
						|
       supports this notation. For example,
 | 
						|
 | 
						|
         [01[:alpha:]%]
 | 
						|
 | 
						|
       matches "0", "1", any alphabetic character, or "%". The supported class
 | 
						|
       names are
 | 
						|
 | 
						|
         alnum    letters and digits
 | 
						|
         alpha    letters
 | 
						|
         ascii    character codes 0 - 127
 | 
						|
         blank    space or tab only
 | 
						|
         cntrl    control characters
 | 
						|
         digit    decimal digits (same as \d)
 | 
						|
         graph    printing characters, excluding space
 | 
						|
         lower    lower case letters
 | 
						|
         print    printing characters, including space
 | 
						|
         punct    printing characters, excluding letters and digits
 | 
						|
         space    white space (not quite the same as \s)
 | 
						|
         upper    upper case letters
 | 
						|
         word     "word" characters (same as \w)
 | 
						|
         xdigit   hexadecimal digits
 | 
						|
 | 
						|
       The "space" characters are HT (9), LF (10), VT (11), FF (12), CR  (13),
 | 
						|
       and  space  (32). Notice that this list includes the VT character (code
 | 
						|
       11). This makes "space" different to \s, which does not include VT (for
 | 
						|
       Perl compatibility).
 | 
						|
 | 
						|
       The  name  "word"  is  a Perl extension, and "blank" is a GNU extension
 | 
						|
       from Perl 5.8. Another Perl extension is negation, which  is  indicated
 | 
						|
       by a ^ character after the colon. For example,
 | 
						|
 | 
						|
         [12[:^digit:]]
 | 
						|
 | 
						|
       matches  "1", "2", or any non-digit. PCRE (and Perl) also recognize the
 | 
						|
       POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
 | 
						|
       these are not supported, and an error is given if they are encountered.
 | 
						|
 | 
						|
       In UTF-8 mode, characters with values greater than 128 do not match any
 | 
						|
       of the POSIX character classes.
 | 
						|
 | 
						|
 | 
						|
VERTICAL BAR
 | 
						|
 | 
						|
       Vertical bar characters are used to separate alternative patterns.  For
 | 
						|
       example, the pattern
 | 
						|
 | 
						|
         gilbert|sullivan
 | 
						|
 | 
						|
       matches  either "gilbert" or "sullivan". Any number of alternatives may
 | 
						|
       appear, and an empty  alternative  is  permitted  (matching  the  empty
 | 
						|
       string). The matching process tries each alternative in turn, from left
 | 
						|
       to right, and the first one that succeeds is used. If the  alternatives
 | 
						|
       are  within a subpattern (defined below), "succeeds" means matching the
 | 
						|
       rest of the main pattern as well as the alternative in the subpattern.
 | 
						|
 | 
						|
 | 
						|
INTERNAL OPTION SETTING
 | 
						|
 | 
						|
       The settings of the  PCRE_CASELESS,  PCRE_MULTILINE,  PCRE_DOTALL,  and
 | 
						|
       PCRE_EXTENDED  options  (which are Perl-compatible) can be changed from
 | 
						|
       within the pattern by  a  sequence  of  Perl  option  letters  enclosed
 | 
						|
       between "(?" and ")".  The option letters are
 | 
						|
 | 
						|
         i  for PCRE_CASELESS
 | 
						|
         m  for PCRE_MULTILINE
 | 
						|
         s  for PCRE_DOTALL
 | 
						|
         x  for PCRE_EXTENDED
 | 
						|
 | 
						|
       For example, (?im) sets caseless, multiline matching. It is also possi-
 | 
						|
       ble to unset these options by preceding the letter with a hyphen, and a
 | 
						|
       combined  setting and unsetting such as (?im-sx), which sets PCRE_CASE-
 | 
						|
       LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and  PCRE_EXTENDED,
 | 
						|
       is  also  permitted.  If  a  letter  appears  both before and after the
 | 
						|
       hyphen, the option is unset.
 | 
						|
 | 
						|
       The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and  PCRE_EXTRA
 | 
						|
       can  be changed in the same way as the Perl-compatible options by using
 | 
						|
       the characters J, U and X respectively.
 | 
						|
 | 
						|
       When one of these option changes occurs at  top  level  (that  is,  not
 | 
						|
       inside  subpattern parentheses), the change applies to the remainder of
 | 
						|
       the pattern that follows. If the change is placed right at the start of
 | 
						|
       a pattern, PCRE extracts it into the global options (and it will there-
 | 
						|
       fore show up in data extracted by the pcre_fullinfo() function).
 | 
						|
 | 
						|
       An option change within a subpattern (see below for  a  description  of
 | 
						|
       subpatterns) affects only that part of the current pattern that follows
 | 
						|
       it, so
 | 
						|
 | 
						|
         (a(?i)b)c
 | 
						|
 | 
						|
       matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
 | 
						|
       used).   By  this means, options can be made to have different settings
 | 
						|
       in different parts of the pattern. Any changes made in one  alternative
 | 
						|
       do  carry  on  into subsequent branches within the same subpattern. For
 | 
						|
       example,
 | 
						|
 | 
						|
         (a(?i)b|c)
 | 
						|
 | 
						|
       matches "ab", "aB", "c", and "C", even though  when  matching  "C"  the
 | 
						|
       first  branch  is  abandoned before the option setting. This is because
 | 
						|
       the effects of option settings happen at compile time. There  would  be
 | 
						|
       some very weird behaviour otherwise.
 | 
						|
 | 
						|
       Note:  There  are  other  PCRE-specific  options that can be set by the
 | 
						|
       application when the compile or match functions  are  called.  In  some
 | 
						|
       cases the pattern can contain special leading sequences such as (*CRLF)
 | 
						|
       to override what the application has set or what  has  been  defaulted.
 | 
						|
       Details  are  given  in the section entitled "Newline sequences" above.
 | 
						|
       There is also the (*UTF8) leading sequence that  can  be  used  to  set
 | 
						|
       UTF-8 mode; this is equivalent to setting the PCRE_UTF8 option.
 | 
						|
 | 
						|
 | 
						|
SUBPATTERNS
 | 
						|
 | 
						|
       Subpatterns are delimited by parentheses (round brackets), which can be
 | 
						|
       nested.  Turning part of a pattern into a subpattern does two things:
 | 
						|
 | 
						|
       1. It localizes a set of alternatives. For example, the pattern
 | 
						|
 | 
						|
         cat(aract|erpillar|)
 | 
						|
 | 
						|
       matches one of the words "cat", "cataract", or  "caterpillar".  Without
 | 
						|
       the  parentheses,  it  would  match  "cataract", "erpillar" or an empty
 | 
						|
       string.
 | 
						|
 | 
						|
       2. It sets up the subpattern as  a  capturing  subpattern.  This  means
 | 
						|
       that,  when  the  whole  pattern  matches,  that portion of the subject
 | 
						|
       string that matched the subpattern is passed back to the caller via the
 | 
						|
       ovector  argument  of pcre_exec(). Opening parentheses are counted from
 | 
						|
       left to right (starting from 1) to obtain  numbers  for  the  capturing
 | 
						|
       subpatterns.
 | 
						|
 | 
						|
       For  example,  if the string "the red king" is matched against the pat-
 | 
						|
       tern
 | 
						|
 | 
						|
         the ((red|white) (king|queen))
 | 
						|
 | 
						|
       the captured substrings are "red king", "red", and "king", and are num-
 | 
						|
       bered 1, 2, and 3, respectively.
 | 
						|
 | 
						|
       The  fact  that  plain  parentheses  fulfil two functions is not always
 | 
						|
       helpful.  There are often times when a grouping subpattern is  required
 | 
						|
       without  a capturing requirement. If an opening parenthesis is followed
 | 
						|
       by a question mark and a colon, the subpattern does not do any  captur-
 | 
						|
       ing,  and  is  not  counted when computing the number of any subsequent
 | 
						|
       capturing subpatterns. For example, if the string "the white queen"  is
 | 
						|
       matched against the pattern
 | 
						|
 | 
						|
         the ((?:red|white) (king|queen))
 | 
						|
 | 
						|
       the captured substrings are "white queen" and "queen", and are numbered
 | 
						|
       1 and 2. The maximum number of capturing subpatterns is 65535.
 | 
						|
 | 
						|
       As a convenient shorthand, if any option settings are required  at  the
 | 
						|
       start  of  a  non-capturing  subpattern,  the option letters may appear
 | 
						|
       between the "?" and the ":". Thus the two patterns
 | 
						|
 | 
						|
         (?i:saturday|sunday)
 | 
						|
         (?:(?i)saturday|sunday)
 | 
						|
 | 
						|
       match exactly the same set of strings. Because alternative branches are
 | 
						|
       tried  from  left  to right, and options are not reset until the end of
 | 
						|
       the subpattern is reached, an option setting in one branch does  affect
 | 
						|
       subsequent  branches,  so  the above patterns match "SUNDAY" as well as
 | 
						|
       "Saturday".
 | 
						|
 | 
						|
 | 
						|
DUPLICATE SUBPATTERN NUMBERS
 | 
						|
 | 
						|
       Perl 5.10 introduced a feature whereby each alternative in a subpattern
 | 
						|
       uses  the same numbers for its capturing parentheses. Such a subpattern
 | 
						|
       starts with (?| and is itself a non-capturing subpattern. For  example,
 | 
						|
       consider this pattern:
 | 
						|
 | 
						|
         (?|(Sat)ur|(Sun))day
 | 
						|
 | 
						|
       Because  the two alternatives are inside a (?| group, both sets of cap-
 | 
						|
       turing parentheses are numbered one. Thus, when  the  pattern  matches,
 | 
						|
       you  can  look  at captured substring number one, whichever alternative
 | 
						|
       matched. This construct is useful when you want to  capture  part,  but
 | 
						|
       not all, of one of a number of alternatives. Inside a (?| group, paren-
 | 
						|
       theses are numbered as usual, but the number is reset at the  start  of
 | 
						|
       each  branch. The numbers of any capturing buffers that follow the sub-
 | 
						|
       pattern start after the highest number used in any branch. The  follow-
 | 
						|
       ing  example  is taken from the Perl documentation.  The numbers under-
 | 
						|
       neath show in which buffer the captured content will be stored.
 | 
						|
 | 
						|
         # before  ---------------branch-reset----------- after
 | 
						|
         / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
 | 
						|
         # 1            2         2  3        2     3     4
 | 
						|
 | 
						|
       A backreference or a recursive call to  a  numbered  subpattern  always
 | 
						|
       refers to the first one in the pattern with the given number.
 | 
						|
 | 
						|
       An  alternative approach to using this "branch reset" feature is to use
 | 
						|
       duplicate named subpatterns, as described in the next section.
 | 
						|
 | 
						|
 | 
						|
NAMED SUBPATTERNS
 | 
						|
 | 
						|
       Identifying capturing parentheses by number is simple, but  it  can  be
 | 
						|
       very  hard  to keep track of the numbers in complicated regular expres-
 | 
						|
       sions. Furthermore, if an  expression  is  modified,  the  numbers  may
 | 
						|
       change.  To help with this difficulty, PCRE supports the naming of sub-
 | 
						|
       patterns. This feature was not added to Perl until release 5.10. Python
 | 
						|
       had  the  feature earlier, and PCRE introduced it at release 4.0, using
 | 
						|
       the Python syntax. PCRE now supports both the Perl and the Python  syn-
 | 
						|
       tax.
 | 
						|
 | 
						|
       In  PCRE,  a subpattern can be named in one of three ways: (?<name>...)
 | 
						|
       or (?'name'...) as in Perl, or (?P<name>...) as in  Python.  References
 | 
						|
       to capturing parentheses from other parts of the pattern, such as back-
 | 
						|
       references, recursion, and conditions, can be made by name as  well  as
 | 
						|
       by number.
 | 
						|
 | 
						|
       Names  consist  of  up  to  32 alphanumeric characters and underscores.
 | 
						|
       Named capturing parentheses are still  allocated  numbers  as  well  as
 | 
						|
       names,  exactly as if the names were not present. The PCRE API provides
 | 
						|
       function calls for extracting the name-to-number translation table from
 | 
						|
       a compiled pattern. There is also a convenience function for extracting
 | 
						|
       a captured substring by name.
 | 
						|
 | 
						|
       By default, a name must be unique within a pattern, but it is  possible
 | 
						|
       to relax this constraint by setting the PCRE_DUPNAMES option at compile
 | 
						|
       time. This can be useful for patterns where only one  instance  of  the
 | 
						|
       named  parentheses  can  match. Suppose you want to match the name of a
 | 
						|
       weekday, either as a 3-letter abbreviation or as the full name, and  in
 | 
						|
       both cases you want to extract the abbreviation. This pattern (ignoring
 | 
						|
       the line breaks) does the job:
 | 
						|
 | 
						|
         (?<DN>Mon|Fri|Sun)(?:day)?|
 | 
						|
         (?<DN>Tue)(?:sday)?|
 | 
						|
         (?<DN>Wed)(?:nesday)?|
 | 
						|
         (?<DN>Thu)(?:rsday)?|
 | 
						|
         (?<DN>Sat)(?:urday)?
 | 
						|
 | 
						|
       There are five capturing substrings, but only one is ever set  after  a
 | 
						|
       match.  (An alternative way of solving this problem is to use a "branch
 | 
						|
       reset" subpattern, as described in the previous section.)
 | 
						|
 | 
						|
       The convenience function for extracting the data by  name  returns  the
 | 
						|
       substring  for  the first (and in this example, the only) subpattern of
 | 
						|
       that name that matched. This saves searching  to  find  which  numbered
 | 
						|
       subpattern  it  was. If you make a reference to a non-unique named sub-
 | 
						|
       pattern from elsewhere in the pattern, the one that corresponds to  the
 | 
						|
       lowest  number  is used. For further details of the interfaces for han-
 | 
						|
       dling named subpatterns, see the pcreapi documentation.
 | 
						|
 | 
						|
       Warning: You cannot use different names to distinguish between two sub-
 | 
						|
       patterns  with  the same number (see the previous section) because PCRE
 | 
						|
       uses only the numbers when matching.
 | 
						|
 | 
						|
 | 
						|
REPETITION
 | 
						|
 | 
						|
       Repetition is specified by quantifiers, which can  follow  any  of  the
 | 
						|
       following items:
 | 
						|
 | 
						|
         a literal data character
 | 
						|
         the dot metacharacter
 | 
						|
         the \C escape sequence
 | 
						|
         the \X escape sequence (in UTF-8 mode with Unicode properties)
 | 
						|
         the \R escape sequence
 | 
						|
         an escape such as \d that matches a single character
 | 
						|
         a character class
 | 
						|
         a back reference (see next section)
 | 
						|
         a parenthesized subpattern (unless it is an assertion)
 | 
						|
 | 
						|
       The  general repetition quantifier specifies a minimum and maximum num-
 | 
						|
       ber of permitted matches, by giving the two numbers in  curly  brackets
 | 
						|
       (braces),  separated  by  a comma. The numbers must be less than 65536,
 | 
						|
       and the first must be less than or equal to the second. For example:
 | 
						|
 | 
						|
         z{2,4}
 | 
						|
 | 
						|
       matches "zz", "zzz", or "zzzz". A closing brace on its  own  is  not  a
 | 
						|
       special  character.  If  the second number is omitted, but the comma is
 | 
						|
       present, there is no upper limit; if the second number  and  the  comma
 | 
						|
       are  both omitted, the quantifier specifies an exact number of required
 | 
						|
       matches. Thus
 | 
						|
 | 
						|
         [aeiou]{3,}
 | 
						|
 | 
						|
       matches at least 3 successive vowels, but may match many more, while
 | 
						|
 | 
						|
         \d{8}
 | 
						|
 | 
						|
       matches exactly 8 digits. An opening curly bracket that  appears  in  a
 | 
						|
       position  where a quantifier is not allowed, or one that does not match
 | 
						|
       the syntax of a quantifier, is taken as a literal character. For  exam-
 | 
						|
       ple, {,6} is not a quantifier, but a literal string of four characters.
 | 
						|
 | 
						|
       In  UTF-8  mode,  quantifiers  apply to UTF-8 characters rather than to
 | 
						|
       individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char-
 | 
						|
       acters, each of which is represented by a two-byte sequence. Similarly,
 | 
						|
       when Unicode property support is available, \X{3} matches three Unicode
 | 
						|
       extended  sequences,  each of which may be several bytes long (and they
 | 
						|
       may be of different lengths).
 | 
						|
 | 
						|
       The quantifier {0} is permitted, causing the expression to behave as if
 | 
						|
       the previous item and the quantifier were not present. This may be use-
 | 
						|
       ful for subpatterns that are referenced as subroutines  from  elsewhere
 | 
						|
       in the pattern. Items other than subpatterns that have a {0} quantifier
 | 
						|
       are omitted from the compiled pattern.
 | 
						|
 | 
						|
       For convenience, the three most common quantifiers have  single-charac-
 | 
						|
       ter abbreviations:
 | 
						|
 | 
						|
         *    is equivalent to {0,}
 | 
						|
         +    is equivalent to {1,}
 | 
						|
         ?    is equivalent to {0,1}
 | 
						|
 | 
						|
       It  is  possible  to construct infinite loops by following a subpattern
 | 
						|
       that can match no characters with a quantifier that has no upper limit,
 | 
						|
       for example:
 | 
						|
 | 
						|
         (a?)*
 | 
						|
 | 
						|
       Earlier versions of Perl and PCRE used to give an error at compile time
 | 
						|
       for such patterns. However, because there are cases where this  can  be
 | 
						|
       useful,  such  patterns  are now accepted, but if any repetition of the
 | 
						|
       subpattern does in fact match no characters, the loop is forcibly  bro-
 | 
						|
       ken.
 | 
						|
 | 
						|
       By  default,  the quantifiers are "greedy", that is, they match as much
 | 
						|
       as possible (up to the maximum  number  of  permitted  times),  without
 | 
						|
       causing  the  rest of the pattern to fail. The classic example of where
 | 
						|
       this gives problems is in trying to match comments in C programs. These
 | 
						|
       appear  between  /*  and  */ and within the comment, individual * and /
 | 
						|
       characters may appear. An attempt to match C comments by  applying  the
 | 
						|
       pattern
 | 
						|
 | 
						|
         /\*.*\*/
 | 
						|
 | 
						|
       to the string
 | 
						|
 | 
						|
         /* first comment */  not comment  /* second comment */
 | 
						|
 | 
						|
       fails,  because it matches the entire string owing to the greediness of
 | 
						|
       the .*  item.
 | 
						|
 | 
						|
       However, if a quantifier is followed by a question mark, it  ceases  to
 | 
						|
       be greedy, and instead matches the minimum number of times possible, so
 | 
						|
       the pattern
 | 
						|
 | 
						|
         /\*.*?\*/
 | 
						|
 | 
						|
       does the right thing with the C comments. The meaning  of  the  various
 | 
						|
       quantifiers  is  not  otherwise  changed,  just the preferred number of
 | 
						|
       matches.  Do not confuse this use of question mark with its  use  as  a
 | 
						|
       quantifier  in its own right. Because it has two uses, it can sometimes
 | 
						|
       appear doubled, as in
 | 
						|
 | 
						|
         \d??\d
 | 
						|
 | 
						|
       which matches one digit by preference, but can match two if that is the
 | 
						|
       only way the rest of the pattern matches.
 | 
						|
 | 
						|
       If  the PCRE_UNGREEDY option is set (an option that is not available in
 | 
						|
       Perl), the quantifiers are not greedy by default, but  individual  ones
 | 
						|
       can  be  made  greedy  by following them with a question mark. In other
 | 
						|
       words, it inverts the default behaviour.
 | 
						|
 | 
						|
       When a parenthesized subpattern is quantified  with  a  minimum  repeat
 | 
						|
       count  that is greater than 1 or with a limited maximum, more memory is
 | 
						|
       required for the compiled pattern, in proportion to  the  size  of  the
 | 
						|
       minimum or maximum.
 | 
						|
 | 
						|
       If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-
 | 
						|
       alent to Perl's /s) is set, thus allowing the dot  to  match  newlines,
 | 
						|
       the  pattern  is  implicitly anchored, because whatever follows will be
 | 
						|
       tried against every character position in the subject string, so  there
 | 
						|
       is  no  point  in  retrying the overall match at any position after the
 | 
						|
       first. PCRE normally treats such a pattern as though it  were  preceded
 | 
						|
       by \A.
 | 
						|
 | 
						|
       In  cases  where  it  is known that the subject string contains no new-
 | 
						|
       lines, it is worth setting PCRE_DOTALL in order to  obtain  this  opti-
 | 
						|
       mization, or alternatively using ^ to indicate anchoring explicitly.
 | 
						|
 | 
						|
       However,  there is one situation where the optimization cannot be used.
 | 
						|
       When .*  is inside capturing parentheses that  are  the  subject  of  a
 | 
						|
       backreference  elsewhere  in the pattern, a match at the start may fail
 | 
						|
       where a later one succeeds. Consider, for example:
 | 
						|
 | 
						|
         (.*)abc\1
 | 
						|
 | 
						|
       If the subject is "xyz123abc123" the match point is the fourth  charac-
 | 
						|
       ter. For this reason, such a pattern is not implicitly anchored.
 | 
						|
 | 
						|
       When a capturing subpattern is repeated, the value captured is the sub-
 | 
						|
       string that matched the final iteration. For example, after
 | 
						|
 | 
						|
         (tweedle[dume]{3}\s*)+
 | 
						|
 | 
						|
       has matched "tweedledum tweedledee" the value of the captured substring
 | 
						|
       is  "tweedledee".  However,  if there are nested capturing subpatterns,
 | 
						|
       the corresponding captured values may have been set in previous  itera-
 | 
						|
       tions. For example, after
 | 
						|
 | 
						|
         /(a|(b))+/
 | 
						|
 | 
						|
       matches "aba" the value of the second captured substring is "b".
 | 
						|
 | 
						|
 | 
						|
ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
 | 
						|
 | 
						|
       With  both  maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
 | 
						|
       repetition, failure of what follows normally causes the  repeated  item
 | 
						|
       to  be  re-evaluated to see if a different number of repeats allows the
 | 
						|
       rest of the pattern to match. Sometimes it is useful to  prevent  this,
 | 
						|
       either  to  change the nature of the match, or to cause it fail earlier
 | 
						|
       than it otherwise might, when the author of the pattern knows there  is
 | 
						|
       no point in carrying on.
 | 
						|
 | 
						|
       Consider,  for  example, the pattern \d+foo when applied to the subject
 | 
						|
       line
 | 
						|
 | 
						|
         123456bar
 | 
						|
 | 
						|
       After matching all 6 digits and then failing to match "foo", the normal
 | 
						|
       action  of  the matcher is to try again with only 5 digits matching the
 | 
						|
       \d+ item, and then with  4,  and  so  on,  before  ultimately  failing.
 | 
						|
       "Atomic  grouping"  (a  term taken from Jeffrey Friedl's book) provides
 | 
						|
       the means for specifying that once a subpattern has matched, it is  not
 | 
						|
       to be re-evaluated in this way.
 | 
						|
 | 
						|
       If  we  use atomic grouping for the previous example, the matcher gives
 | 
						|
       up immediately on failing to match "foo" the first time.  The  notation
 | 
						|
       is a kind of special parenthesis, starting with (?> as in this example:
 | 
						|
 | 
						|
         (?>\d+)foo
 | 
						|
 | 
						|
       This  kind  of  parenthesis "locks up" the  part of the pattern it con-
 | 
						|
       tains once it has matched, and a failure further into  the  pattern  is
 | 
						|
       prevented  from  backtracking into it. Backtracking past it to previous
 | 
						|
       items, however, works as normal.
 | 
						|
 | 
						|
       An alternative description is that a subpattern of  this  type  matches
 | 
						|
       the  string  of  characters  that an identical standalone pattern would
 | 
						|
       match, if anchored at the current point in the subject string.
 | 
						|
 | 
						|
       Atomic grouping subpatterns are not capturing subpatterns. Simple cases
 | 
						|
       such as the above example can be thought of as a maximizing repeat that
 | 
						|
       must swallow everything it can. So, while both \d+ and  \d+?  are  pre-
 | 
						|
       pared  to  adjust  the number of digits they match in order to make the
 | 
						|
       rest of the pattern match, (?>\d+) can only match an entire sequence of
 | 
						|
       digits.
 | 
						|
 | 
						|
       Atomic  groups in general can of course contain arbitrarily complicated
 | 
						|
       subpatterns, and can be nested. However, when  the  subpattern  for  an
 | 
						|
       atomic group is just a single repeated item, as in the example above, a
 | 
						|
       simpler notation, called a "possessive quantifier" can  be  used.  This
 | 
						|
       consists  of  an  additional  + character following a quantifier. Using
 | 
						|
       this notation, the previous example can be rewritten as
 | 
						|
 | 
						|
         \d++foo
 | 
						|
 | 
						|
       Note that a possessive quantifier can be used with an entire group, for
 | 
						|
       example:
 | 
						|
 | 
						|
         (abc|xyz){2,3}+
 | 
						|
 | 
						|
       Possessive   quantifiers   are   always  greedy;  the  setting  of  the
 | 
						|
       PCRE_UNGREEDY option is ignored. They are a convenient notation for the
 | 
						|
       simpler  forms  of atomic group. However, there is no difference in the
 | 
						|
       meaning of a possessive quantifier and  the  equivalent  atomic  group,
 | 
						|
       though  there  may  be a performance difference; possessive quantifiers
 | 
						|
       should be slightly faster.
 | 
						|
 | 
						|
       The possessive quantifier syntax is an extension to the Perl  5.8  syn-
 | 
						|
       tax.   Jeffrey  Friedl  originated the idea (and the name) in the first
 | 
						|
       edition of his book. Mike McCloskey liked it, so implemented it when he
 | 
						|
       built  Sun's Java package, and PCRE copied it from there. It ultimately
 | 
						|
       found its way into Perl at release 5.10.
 | 
						|
 | 
						|
       PCRE has an optimization that automatically "possessifies" certain sim-
 | 
						|
       ple  pattern  constructs.  For  example, the sequence A+B is treated as
 | 
						|
       A++B because there is no point in backtracking into a sequence  of  A's
 | 
						|
       when B must follow.
 | 
						|
 | 
						|
       When  a  pattern  contains an unlimited repeat inside a subpattern that
 | 
						|
       can itself be repeated an unlimited number of  times,  the  use  of  an
 | 
						|
       atomic  group  is  the  only way to avoid some failing matches taking a
 | 
						|
       very long time indeed. The pattern
 | 
						|
 | 
						|
         (\D+|<\d+>)*[!?]
 | 
						|
 | 
						|
       matches an unlimited number of substrings that either consist  of  non-
 | 
						|
       digits,  or  digits  enclosed in <>, followed by either ! or ?. When it
 | 
						|
       matches, it runs quickly. However, if it is applied to
 | 
						|
 | 
						|
         aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 | 
						|
 | 
						|
       it takes a long time before reporting  failure.  This  is  because  the
 | 
						|
       string  can be divided between the internal \D+ repeat and the external
 | 
						|
       * repeat in a large number of ways, and all  have  to  be  tried.  (The
 | 
						|
       example  uses  [!?]  rather than a single character at the end, because
 | 
						|
       both PCRE and Perl have an optimization that allows  for  fast  failure
 | 
						|
       when  a single character is used. They remember the last single charac-
 | 
						|
       ter that is required for a match, and fail early if it is  not  present
 | 
						|
       in  the  string.)  If  the pattern is changed so that it uses an atomic
 | 
						|
       group, like this:
 | 
						|
 | 
						|
         ((?>\D+)|<\d+>)*[!?]
 | 
						|
 | 
						|
       sequences of non-digits cannot be broken, and failure happens quickly.
 | 
						|
 | 
						|
 | 
						|
BACK REFERENCES
 | 
						|
 | 
						|
       Outside a character class, a backslash followed by a digit greater than
 | 
						|
       0 (and possibly further digits) is a back reference to a capturing sub-
 | 
						|
       pattern earlier (that is, to its left) in the pattern,  provided  there
 | 
						|
       have been that many previous capturing left parentheses.
 | 
						|
 | 
						|
       However, if the decimal number following the backslash is less than 10,
 | 
						|
       it is always taken as a back reference, and causes  an  error  only  if
 | 
						|
       there  are  not that many capturing left parentheses in the entire pat-
 | 
						|
       tern. In other words, the parentheses that are referenced need  not  be
 | 
						|
       to  the left of the reference for numbers less than 10. A "forward back
 | 
						|
       reference" of this type can make sense when a  repetition  is  involved
 | 
						|
       and  the  subpattern to the right has participated in an earlier itera-
 | 
						|
       tion.
 | 
						|
 | 
						|
       It is not possible to have a numerical "forward back  reference"  to  a
 | 
						|
       subpattern  whose  number  is  10  or  more using this syntax because a
 | 
						|
       sequence such as \50 is interpreted as a character  defined  in  octal.
 | 
						|
       See the subsection entitled "Non-printing characters" above for further
 | 
						|
       details of the handling of digits following a backslash.  There  is  no
 | 
						|
       such  problem  when named parentheses are used. A back reference to any
 | 
						|
       subpattern is possible using named parentheses (see below).
 | 
						|
 | 
						|
       Another way of avoiding the ambiguity inherent in  the  use  of  digits
 | 
						|
       following a backslash is to use the \g escape sequence, which is a fea-
 | 
						|
       ture introduced in Perl 5.10.  This  escape  must  be  followed  by  an
 | 
						|
       unsigned  number  or  a negative number, optionally enclosed in braces.
 | 
						|
       These examples are all identical:
 | 
						|
 | 
						|
         (ring), \1
 | 
						|
         (ring), \g1
 | 
						|
         (ring), \g{1}
 | 
						|
 | 
						|
       An unsigned number specifies an absolute reference without the  ambigu-
 | 
						|
       ity that is present in the older syntax. It is also useful when literal
 | 
						|
       digits follow the reference. A negative number is a relative reference.
 | 
						|
       Consider this example:
 | 
						|
 | 
						|
         (abc(def)ghi)\g{-1}
 | 
						|
 | 
						|
       The sequence \g{-1} is a reference to the most recently started captur-
 | 
						|
       ing subpattern before \g, that is, is it equivalent to  \2.  Similarly,
 | 
						|
       \g{-2} would be equivalent to \1. The use of relative references can be
 | 
						|
       helpful in long patterns, and also in  patterns  that  are  created  by
 | 
						|
       joining together fragments that contain references within themselves.
 | 
						|
 | 
						|
       A  back  reference matches whatever actually matched the capturing sub-
 | 
						|
       pattern in the current subject string, rather  than  anything  matching
 | 
						|
       the subpattern itself (see "Subpatterns as subroutines" below for a way
 | 
						|
       of doing that). So the pattern
 | 
						|
 | 
						|
         (sens|respons)e and \1ibility
 | 
						|
 | 
						|
       matches "sense and sensibility" and "response and responsibility",  but
 | 
						|
       not  "sense and responsibility". If caseful matching is in force at the
 | 
						|
       time of the back reference, the case of letters is relevant. For  exam-
 | 
						|
       ple,
 | 
						|
 | 
						|
         ((?i)rah)\s+\1
 | 
						|
 | 
						|
       matches  "rah  rah"  and  "RAH RAH", but not "RAH rah", even though the
 | 
						|
       original capturing subpattern is matched caselessly.
 | 
						|
 | 
						|
       There are several different ways of writing back  references  to  named
 | 
						|
       subpatterns.  The  .NET syntax \k{name} and the Perl syntax \k<name> or
 | 
						|
       \k'name' are supported, as is the Python syntax (?P=name). Perl  5.10's
 | 
						|
       unified back reference syntax, in which \g can be used for both numeric
 | 
						|
       and named references, is also supported. We  could  rewrite  the  above
 | 
						|
       example in any of the following ways:
 | 
						|
 | 
						|
         (?<p1>(?i)rah)\s+\k<p1>
 | 
						|
         (?'p1'(?i)rah)\s+\k{p1}
 | 
						|
         (?P<p1>(?i)rah)\s+(?P=p1)
 | 
						|
         (?<p1>(?i)rah)\s+\g{p1}
 | 
						|
 | 
						|
       A  subpattern  that  is  referenced  by  name may appear in the pattern
 | 
						|
       before or after the reference.
 | 
						|
 | 
						|
       There may be more than one back reference to the same subpattern. If  a
 | 
						|
       subpattern  has  not actually been used in a particular match, any back
 | 
						|
       references to it always fail. For example, the pattern
 | 
						|
 | 
						|
         (a|(bc))\2
 | 
						|
 | 
						|
       always fails if it starts to match "a" rather than "bc". Because  there
 | 
						|
       may  be  many  capturing parentheses in a pattern, all digits following
 | 
						|
       the backslash are taken as part of a potential back  reference  number.
 | 
						|
       If the pattern continues with a digit character, some delimiter must be
 | 
						|
       used to terminate the back reference. If the  PCRE_EXTENDED  option  is
 | 
						|
       set,  this  can  be  whitespace.  Otherwise an empty comment (see "Com-
 | 
						|
       ments" below) can be used.
 | 
						|
 | 
						|
       A back reference that occurs inside the parentheses to which it  refers
 | 
						|
       fails  when  the subpattern is first used, so, for example, (a\1) never
 | 
						|
       matches.  However, such references can be useful inside  repeated  sub-
 | 
						|
       patterns. For example, the pattern
 | 
						|
 | 
						|
         (a|b\1)+
 | 
						|
 | 
						|
       matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
 | 
						|
       ation of the subpattern,  the  back  reference  matches  the  character
 | 
						|
       string  corresponding  to  the previous iteration. In order for this to
 | 
						|
       work, the pattern must be such that the first iteration does  not  need
 | 
						|
       to  match the back reference. This can be done using alternation, as in
 | 
						|
       the example above, or by a quantifier with a minimum of zero.
 | 
						|
 | 
						|
 | 
						|
ASSERTIONS
 | 
						|
 | 
						|
       An assertion is a test on the characters  following  or  preceding  the
 | 
						|
       current  matching  point that does not actually consume any characters.
 | 
						|
       The simple assertions coded as \b, \B, \A, \G, \Z,  \z,  ^  and  $  are
 | 
						|
       described above.
 | 
						|
 | 
						|
       More  complicated  assertions  are  coded as subpatterns. There are two
 | 
						|
       kinds: those that look ahead of the current  position  in  the  subject
 | 
						|
       string,  and  those  that  look  behind  it. An assertion subpattern is
 | 
						|
       matched in the normal way, except that it does not  cause  the  current
 | 
						|
       matching position to be changed.
 | 
						|
 | 
						|
       Assertion  subpatterns  are  not  capturing subpatterns, and may not be
 | 
						|
       repeated, because it makes no sense to assert the  same  thing  several
 | 
						|
       times.  If  any kind of assertion contains capturing subpatterns within
 | 
						|
       it, these are counted for the purposes of numbering the capturing  sub-
 | 
						|
       patterns in the whole pattern.  However, substring capturing is carried
 | 
						|
       out only for positive assertions, because it does not  make  sense  for
 | 
						|
       negative assertions.
 | 
						|
 | 
						|
   Lookahead assertions
 | 
						|
 | 
						|
       Lookahead assertions start with (?= for positive assertions and (?! for
 | 
						|
       negative assertions. For example,
 | 
						|
 | 
						|
         \w+(?=;)
 | 
						|
 | 
						|
       matches a word followed by a semicolon, but does not include the  semi-
 | 
						|
       colon in the match, and
 | 
						|
 | 
						|
         foo(?!bar)
 | 
						|
 | 
						|
       matches  any  occurrence  of  "foo" that is not followed by "bar". Note
 | 
						|
       that the apparently similar pattern
 | 
						|
 | 
						|
         (?!foo)bar
 | 
						|
 | 
						|
       does not find an occurrence of "bar"  that  is  preceded  by  something
 | 
						|
       other  than "foo"; it finds any occurrence of "bar" whatsoever, because
 | 
						|
       the assertion (?!foo) is always true when the next three characters are
 | 
						|
       "bar". A lookbehind assertion is needed to achieve the other effect.
 | 
						|
 | 
						|
       If you want to force a matching failure at some point in a pattern, the
 | 
						|
       most convenient way to do it is  with  (?!)  because  an  empty  string
 | 
						|
       always  matches, so an assertion that requires there not to be an empty
 | 
						|
       string must always fail.
 | 
						|
 | 
						|
   Lookbehind assertions
 | 
						|
 | 
						|
       Lookbehind assertions start with (?<= for positive assertions and  (?<!
 | 
						|
       for negative assertions. For example,
 | 
						|
 | 
						|
         (?<!foo)bar
 | 
						|
 | 
						|
       does  find  an  occurrence  of "bar" that is not preceded by "foo". The
 | 
						|
       contents of a lookbehind assertion are restricted  such  that  all  the
 | 
						|
       strings it matches must have a fixed length. However, if there are sev-
 | 
						|
       eral top-level alternatives, they do not all  have  to  have  the  same
 | 
						|
       fixed length. Thus
 | 
						|
 | 
						|
         (?<=bullock|donkey)
 | 
						|
 | 
						|
       is permitted, but
 | 
						|
 | 
						|
         (?<!dogs?|cats?)
 | 
						|
 | 
						|
       causes  an  error at compile time. Branches that match different length
 | 
						|
       strings are permitted only at the top level of a lookbehind  assertion.
 | 
						|
       This  is  an  extension  compared  with  Perl (at least for 5.8), which
 | 
						|
       requires all branches to match the same length of string. An  assertion
 | 
						|
       such as
 | 
						|
 | 
						|
         (?<=ab(c|de))
 | 
						|
 | 
						|
       is  not  permitted,  because  its single top-level branch can match two
 | 
						|
       different lengths, but it is acceptable if rewritten to  use  two  top-
 | 
						|
       level branches:
 | 
						|
 | 
						|
         (?<=abc|abde)
 | 
						|
 | 
						|
       In some cases, the Perl 5.10 escape sequence \K (see above) can be used
 | 
						|
       instead of a lookbehind assertion; this is not restricted to  a  fixed-
 | 
						|
       length.
 | 
						|
 | 
						|
       The  implementation  of lookbehind assertions is, for each alternative,
 | 
						|
       to temporarily move the current position back by the fixed  length  and
 | 
						|
       then try to match. If there are insufficient characters before the cur-
 | 
						|
       rent position, the assertion fails.
 | 
						|
 | 
						|
       PCRE does not allow the \C escape (which matches a single byte in UTF-8
 | 
						|
       mode)  to appear in lookbehind assertions, because it makes it impossi-
 | 
						|
       ble to calculate the length of the lookbehind. The \X and  \R  escapes,
 | 
						|
       which can match different numbers of bytes, are also not permitted.
 | 
						|
 | 
						|
       Possessive  quantifiers  can  be  used  in  conjunction with lookbehind
 | 
						|
       assertions to specify efficient matching at  the  end  of  the  subject
 | 
						|
       string. Consider a simple pattern such as
 | 
						|
 | 
						|
         abcd$
 | 
						|
 | 
						|
       when  applied  to  a  long string that does not match. Because matching
 | 
						|
       proceeds from left to right, PCRE will look for each "a" in the subject
 | 
						|
       and  then  see  if what follows matches the rest of the pattern. If the
 | 
						|
       pattern is specified as
 | 
						|
 | 
						|
         ^.*abcd$
 | 
						|
 | 
						|
       the initial .* matches the entire string at first, but when this  fails
 | 
						|
       (because there is no following "a"), it backtracks to match all but the
 | 
						|
       last character, then all but the last two characters, and so  on.  Once
 | 
						|
       again  the search for "a" covers the entire string, from right to left,
 | 
						|
       so we are no better off. However, if the pattern is written as
 | 
						|
 | 
						|
         ^.*+(?<=abcd)
 | 
						|
 | 
						|
       there can be no backtracking for the .*+ item; it can  match  only  the
 | 
						|
       entire  string.  The subsequent lookbehind assertion does a single test
 | 
						|
       on the last four characters. If it fails, the match fails  immediately.
 | 
						|
       For  long  strings, this approach makes a significant difference to the
 | 
						|
       processing time.
 | 
						|
 | 
						|
   Using multiple assertions
 | 
						|
 | 
						|
       Several assertions (of any sort) may occur in succession. For example,
 | 
						|
 | 
						|
         (?<=\d{3})(?<!999)foo
 | 
						|
 | 
						|
       matches "foo" preceded by three digits that are not "999". Notice  that
 | 
						|
       each  of  the  assertions is applied independently at the same point in
 | 
						|
       the subject string. First there is a  check  that  the  previous  three
 | 
						|
       characters  are  all  digits,  and  then there is a check that the same
 | 
						|
       three characters are not "999".  This pattern does not match "foo" pre-
 | 
						|
       ceded  by  six  characters,  the first of which are digits and the last
 | 
						|
       three of which are not "999". For example, it  doesn't  match  "123abc-
 | 
						|
       foo". A pattern to do that is
 | 
						|
 | 
						|
         (?<=\d{3}...)(?<!999)foo
 | 
						|
 | 
						|
       This  time  the  first assertion looks at the preceding six characters,
 | 
						|
       checking that the first three are digits, and then the second assertion
 | 
						|
       checks that the preceding three characters are not "999".
 | 
						|
 | 
						|
       Assertions can be nested in any combination. For example,
 | 
						|
 | 
						|
         (?<=(?<!foo)bar)baz
 | 
						|
 | 
						|
       matches  an occurrence of "baz" that is preceded by "bar" which in turn
 | 
						|
       is not preceded by "foo", while
 | 
						|
 | 
						|
         (?<=\d{3}(?!999)...)foo
 | 
						|
 | 
						|
       is another pattern that matches "foo" preceded by three digits and  any
 | 
						|
       three characters that are not "999".
 | 
						|
 | 
						|
 | 
						|
CONDITIONAL SUBPATTERNS
 | 
						|
 | 
						|
       It  is possible to cause the matching process to obey a subpattern con-
 | 
						|
       ditionally or to choose between two alternative subpatterns,  depending
 | 
						|
       on  the result of an assertion, or whether a previous capturing subpat-
 | 
						|
       tern matched or not. The two possible forms of  conditional  subpattern
 | 
						|
       are
 | 
						|
 | 
						|
         (?(condition)yes-pattern)
 | 
						|
         (?(condition)yes-pattern|no-pattern)
 | 
						|
 | 
						|
       If  the  condition is satisfied, the yes-pattern is used; otherwise the
 | 
						|
       no-pattern (if present) is used. If there are more  than  two  alterna-
 | 
						|
       tives in the subpattern, a compile-time error occurs.
 | 
						|
 | 
						|
       There  are  four  kinds of condition: references to subpatterns, refer-
 | 
						|
       ences to recursion, a pseudo-condition called DEFINE, and assertions.
 | 
						|
 | 
						|
   Checking for a used subpattern by number
 | 
						|
 | 
						|
       If the text between the parentheses consists of a sequence  of  digits,
 | 
						|
       the  condition  is  true if the capturing subpattern of that number has
 | 
						|
       previously matched. An alternative notation is to  precede  the  digits
 | 
						|
       with a plus or minus sign. In this case, the subpattern number is rela-
 | 
						|
       tive rather than absolute.  The most recently opened parentheses can be
 | 
						|
       referenced  by  (?(-1),  the  next most recent by (?(-2), and so on. In
 | 
						|
       looping constructs it can also make sense to refer to subsequent groups
 | 
						|
       with constructs such as (?(+2).
 | 
						|
 | 
						|
       Consider  the  following  pattern, which contains non-significant white
 | 
						|
       space to make it more readable (assume the PCRE_EXTENDED option) and to
 | 
						|
       divide it into three parts for ease of discussion:
 | 
						|
 | 
						|
         ( \( )?    [^()]+    (?(1) \) )
 | 
						|
 | 
						|
       The  first  part  matches  an optional opening parenthesis, and if that
 | 
						|
       character is present, sets it as the first captured substring. The sec-
 | 
						|
       ond  part  matches one or more characters that are not parentheses. The
 | 
						|
       third part is a conditional subpattern that tests whether the first set
 | 
						|
       of parentheses matched or not. If they did, that is, if subject started
 | 
						|
       with an opening parenthesis, the condition is true, and so the yes-pat-
 | 
						|
       tern  is  executed  and  a  closing parenthesis is required. Otherwise,
 | 
						|
       since no-pattern is not present, the  subpattern  matches  nothing.  In
 | 
						|
       other  words,  this  pattern  matches  a  sequence  of non-parentheses,
 | 
						|
       optionally enclosed in parentheses.
 | 
						|
 | 
						|
       If you were embedding this pattern in a larger one,  you  could  use  a
 | 
						|
       relative reference:
 | 
						|
 | 
						|
         ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...
 | 
						|
 | 
						|
       This  makes  the  fragment independent of the parentheses in the larger
 | 
						|
       pattern.
 | 
						|
 | 
						|
   Checking for a used subpattern by name
 | 
						|
 | 
						|
       Perl uses the syntax (?(<name>)...) or (?('name')...)  to  test  for  a
 | 
						|
       used  subpattern  by  name.  For compatibility with earlier versions of
 | 
						|
       PCRE, which had this facility before Perl, the syntax  (?(name)...)  is
 | 
						|
       also  recognized. However, there is a possible ambiguity with this syn-
 | 
						|
       tax, because subpattern names may  consist  entirely  of  digits.  PCRE
 | 
						|
       looks  first for a named subpattern; if it cannot find one and the name
 | 
						|
       consists entirely of digits, PCRE looks for a subpattern of  that  num-
 | 
						|
       ber,  which must be greater than zero. Using subpattern names that con-
 | 
						|
       sist entirely of digits is not recommended.
 | 
						|
 | 
						|
       Rewriting the above example to use a named subpattern gives this:
 | 
						|
 | 
						|
         (?<OPEN> \( )?    [^()]+    (?(<OPEN>) \) )
 | 
						|
 | 
						|
 | 
						|
   Checking for pattern recursion
 | 
						|
 | 
						|
       If the condition is the string (R), and there is no subpattern with the
 | 
						|
       name  R, the condition is true if a recursive call to the whole pattern
 | 
						|
       or any subpattern has been made. If digits or a name preceded by amper-
 | 
						|
       sand follow the letter R, for example:
 | 
						|
 | 
						|
         (?(R3)...) or (?(R&name)...)
 | 
						|
 | 
						|
       the  condition is true if the most recent recursion is into the subpat-
 | 
						|
       tern whose number or name is given. This condition does not  check  the
 | 
						|
       entire recursion stack.
 | 
						|
 | 
						|
       At  "top  level", all these recursion test conditions are false. Recur-
 | 
						|
       sive patterns are described below.
 | 
						|
 | 
						|
   Defining subpatterns for use by reference only
 | 
						|
 | 
						|
       If the condition is the string (DEFINE), and  there  is  no  subpattern
 | 
						|
       with  the  name  DEFINE,  the  condition is always false. In this case,
 | 
						|
       there may be only one alternative  in  the  subpattern.  It  is  always
 | 
						|
       skipped  if  control  reaches  this  point  in the pattern; the idea of
 | 
						|
       DEFINE is that it can be used to define "subroutines" that can be  ref-
 | 
						|
       erenced  from elsewhere. (The use of "subroutines" is described below.)
 | 
						|
       For example, a pattern to match an IPv4 address could be  written  like
 | 
						|
       this (ignore whitespace and line breaks):
 | 
						|
 | 
						|
         (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
 | 
						|
         \b (?&byte) (\.(?&byte)){3} \b
 | 
						|
 | 
						|
       The  first part of the pattern is a DEFINE group inside which a another
 | 
						|
       group named "byte" is defined. This matches an individual component  of
 | 
						|
       an  IPv4  address  (a number less than 256). When matching takes place,
 | 
						|
       this part of the pattern is skipped because DEFINE acts  like  a  false
 | 
						|
       condition.
 | 
						|
 | 
						|
       The rest of the pattern uses references to the named group to match the
 | 
						|
       four dot-separated components of an IPv4 address, insisting on  a  word
 | 
						|
       boundary at each end.
 | 
						|
 | 
						|
   Assertion conditions
 | 
						|
 | 
						|
       If  the  condition  is  not  in any of the above formats, it must be an
 | 
						|
       assertion.  This may be a positive or negative lookahead or  lookbehind
 | 
						|
       assertion.  Consider  this  pattern,  again  containing non-significant
 | 
						|
       white space, and with the two alternatives on the second line:
 | 
						|
 | 
						|
         (?(?=[^a-z]*[a-z])
 | 
						|
         \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
 | 
						|
 | 
						|
       The condition  is  a  positive  lookahead  assertion  that  matches  an
 | 
						|
       optional  sequence of non-letters followed by a letter. In other words,
 | 
						|
       it tests for the presence of at least one letter in the subject.  If  a
 | 
						|
       letter  is found, the subject is matched against the first alternative;
 | 
						|
       otherwise it is  matched  against  the  second.  This  pattern  matches
 | 
						|
       strings  in  one  of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
 | 
						|
       letters and dd are digits.
 | 
						|
 | 
						|
 | 
						|
COMMENTS
 | 
						|
 | 
						|
       The sequence (?# marks the start of a comment that continues up to  the
 | 
						|
       next  closing  parenthesis.  Nested  parentheses are not permitted. The
 | 
						|
       characters that make up a comment play no part in the pattern  matching
 | 
						|
       at all.
 | 
						|
 | 
						|
       If  the PCRE_EXTENDED option is set, an unescaped # character outside a
 | 
						|
       character class introduces a  comment  that  continues  to  immediately
 | 
						|
       after the next newline in the pattern.
 | 
						|
 | 
						|
 | 
						|
RECURSIVE PATTERNS
 | 
						|
 | 
						|
       Consider  the problem of matching a string in parentheses, allowing for
 | 
						|
       unlimited nested parentheses. Without the use of  recursion,  the  best
 | 
						|
       that  can  be  done  is  to use a pattern that matches up to some fixed
 | 
						|
       depth of nesting. It is not possible to  handle  an  arbitrary  nesting
 | 
						|
       depth.
 | 
						|
 | 
						|
       For some time, Perl has provided a facility that allows regular expres-
 | 
						|
       sions to recurse (amongst other things). It does this by  interpolating
 | 
						|
       Perl  code in the expression at run time, and the code can refer to the
 | 
						|
       expression itself. A Perl pattern using code interpolation to solve the
 | 
						|
       parentheses problem can be created like this:
 | 
						|
 | 
						|
         $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
 | 
						|
 | 
						|
       The (?p{...}) item interpolates Perl code at run time, and in this case
 | 
						|
       refers recursively to the pattern in which it appears.
 | 
						|
 | 
						|
       Obviously, PCRE cannot support the interpolation of Perl code. Instead,
 | 
						|
       it  supports  special  syntax  for recursion of the entire pattern, and
 | 
						|
       also for individual subpattern recursion.  After  its  introduction  in
 | 
						|
       PCRE  and  Python,  this  kind of recursion was introduced into Perl at
 | 
						|
       release 5.10.
 | 
						|
 | 
						|
       A special item that consists of (? followed by a  number  greater  than
 | 
						|
       zero and a closing parenthesis is a recursive call of the subpattern of
 | 
						|
       the given number, provided that it occurs inside that  subpattern.  (If
 | 
						|
       not,  it  is  a  "subroutine" call, which is described in the next sec-
 | 
						|
       tion.) The special item (?R) or (?0) is a recursive call of the  entire
 | 
						|
       regular expression.
 | 
						|
 | 
						|
       In  PCRE (like Python, but unlike Perl), a recursive subpattern call is
 | 
						|
       always treated as an atomic group. That is, once it has matched some of
 | 
						|
       the subject string, it is never re-entered, even if it contains untried
 | 
						|
       alternatives and there is a subsequent matching failure.
 | 
						|
 | 
						|
       This PCRE pattern solves the nested  parentheses  problem  (assume  the
 | 
						|
       PCRE_EXTENDED option is set so that white space is ignored):
 | 
						|
 | 
						|
         \( ( (?>[^()]+) | (?R) )* \)
 | 
						|
 | 
						|
       First  it matches an opening parenthesis. Then it matches any number of
 | 
						|
       substrings which can either be a  sequence  of  non-parentheses,  or  a
 | 
						|
       recursive  match  of the pattern itself (that is, a correctly parenthe-
 | 
						|
       sized substring).  Finally there is a closing parenthesis.
 | 
						|
 | 
						|
       If this were part of a larger pattern, you would not  want  to  recurse
 | 
						|
       the entire pattern, so instead you could use this:
 | 
						|
 | 
						|
         ( \( ( (?>[^()]+) | (?1) )* \) )
 | 
						|
 | 
						|
       We  have  put the pattern into parentheses, and caused the recursion to
 | 
						|
       refer to them instead of the whole pattern.
 | 
						|
 | 
						|
       In a larger pattern,  keeping  track  of  parenthesis  numbers  can  be
 | 
						|
       tricky.  This is made easier by the use of relative references. (A Perl
 | 
						|
       5.10 feature.)  Instead of (?1) in the  pattern  above  you  can  write
 | 
						|
       (?-2) to refer to the second most recently opened parentheses preceding
 | 
						|
       the recursion. In other  words,  a  negative  number  counts  capturing
 | 
						|
       parentheses leftwards from the point at which it is encountered.
 | 
						|
 | 
						|
       It  is  also  possible  to refer to subsequently opened parentheses, by
 | 
						|
       writing references such as (?+2). However, these  cannot  be  recursive
 | 
						|
       because  the  reference  is  not inside the parentheses that are refer-
 | 
						|
       enced. They are always "subroutine" calls, as  described  in  the  next
 | 
						|
       section.
 | 
						|
 | 
						|
       An  alternative  approach is to use named parentheses instead. The Perl
 | 
						|
       syntax for this is (?&name); PCRE's earlier syntax  (?P>name)  is  also
 | 
						|
       supported. We could rewrite the above example as follows:
 | 
						|
 | 
						|
         (?<pn> \( ( (?>[^()]+) | (?&pn) )* \) )
 | 
						|
 | 
						|
       If  there  is more than one subpattern with the same name, the earliest
 | 
						|
       one is used.
 | 
						|
 | 
						|
       This particular example pattern that we have been looking  at  contains
 | 
						|
       nested  unlimited repeats, and so the use of atomic grouping for match-
 | 
						|
       ing strings of non-parentheses is important when applying  the  pattern
 | 
						|
       to strings that do not match. For example, when this pattern is applied
 | 
						|
       to
 | 
						|
 | 
						|
         (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
 | 
						|
 | 
						|
       it yields "no match" quickly. However, if atomic grouping is not  used,
 | 
						|
       the  match  runs  for a very long time indeed because there are so many
 | 
						|
       different ways the + and * repeats can carve up the  subject,  and  all
 | 
						|
       have to be tested before failure can be reported.
 | 
						|
 | 
						|
       At the end of a match, the values set for any capturing subpatterns are
 | 
						|
       those from the outermost level of the recursion at which the subpattern
 | 
						|
       value  is  set.   If  you want to obtain intermediate values, a callout
 | 
						|
       function can be used (see below and the pcrecallout documentation).  If
 | 
						|
       the pattern above is matched against
 | 
						|
 | 
						|
         (ab(cd)ef)
 | 
						|
 | 
						|
       the  value  for  the  capturing  parentheses is "ef", which is the last
 | 
						|
       value taken on at the top level. If additional parentheses  are  added,
 | 
						|
       giving
 | 
						|
 | 
						|
         \( ( ( (?>[^()]+) | (?R) )* ) \)
 | 
						|
            ^                        ^
 | 
						|
            ^                        ^
 | 
						|
 | 
						|
       the  string  they  capture is "ab(cd)ef", the contents of the top level
 | 
						|
       parentheses. If there are more than 15 capturing parentheses in a  pat-
 | 
						|
       tern, PCRE has to obtain extra memory to store data during a recursion,
 | 
						|
       which it does by using pcre_malloc, freeing  it  via  pcre_free  after-
 | 
						|
       wards.  If  no  memory  can  be  obtained,  the  match  fails  with the
 | 
						|
       PCRE_ERROR_NOMEMORY error.
 | 
						|
 | 
						|
       Do not confuse the (?R) item with the condition (R),  which  tests  for
 | 
						|
       recursion.   Consider  this pattern, which matches text in angle brack-
 | 
						|
       ets, allowing for arbitrary nesting. Only digits are allowed in  nested
 | 
						|
       brackets  (that is, when recursing), whereas any characters are permit-
 | 
						|
       ted at the outer level.
 | 
						|
 | 
						|
         < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >
 | 
						|
 | 
						|
       In this pattern, (?(R) is the start of a conditional  subpattern,  with
 | 
						|
       two  different  alternatives for the recursive and non-recursive cases.
 | 
						|
       The (?R) item is the actual recursive call.
 | 
						|
 | 
						|
 | 
						|
SUBPATTERNS AS SUBROUTINES
 | 
						|
 | 
						|
       If the syntax for a recursive subpattern reference (either by number or
 | 
						|
       by  name)  is used outside the parentheses to which it refers, it oper-
 | 
						|
       ates like a subroutine in a programming language. The "called"  subpat-
 | 
						|
       tern may be defined before or after the reference. A numbered reference
 | 
						|
       can be absolute or relative, as in these examples:
 | 
						|
 | 
						|
         (...(absolute)...)...(?2)...
 | 
						|
         (...(relative)...)...(?-1)...
 | 
						|
         (...(?+1)...(relative)...
 | 
						|
 | 
						|
       An earlier example pointed out that the pattern
 | 
						|
 | 
						|
         (sens|respons)e and \1ibility
 | 
						|
 | 
						|
       matches "sense and sensibility" and "response and responsibility",  but
 | 
						|
       not "sense and responsibility". If instead the pattern
 | 
						|
 | 
						|
         (sens|respons)e and (?1)ibility
 | 
						|
 | 
						|
       is  used, it does match "sense and responsibility" as well as the other
 | 
						|
       two strings. Another example is  given  in  the  discussion  of  DEFINE
 | 
						|
       above.
 | 
						|
 | 
						|
       Like recursive subpatterns, a "subroutine" call is always treated as an
 | 
						|
       atomic group. That is, once it has matched some of the subject  string,
 | 
						|
       it  is  never  re-entered, even if it contains untried alternatives and
 | 
						|
       there is a subsequent matching failure.
 | 
						|
 | 
						|
       When a subpattern is used as a subroutine, processing options  such  as
 | 
						|
       case-independence are fixed when the subpattern is defined. They cannot
 | 
						|
       be changed for different calls. For example, consider this pattern:
 | 
						|
 | 
						|
         (abc)(?i:(?-1))
 | 
						|
 | 
						|
       It matches "abcabc". It does not match "abcABC" because the  change  of
 | 
						|
       processing option does not affect the called subpattern.
 | 
						|
 | 
						|
 | 
						|
ONIGURUMA SUBROUTINE SYNTAX
 | 
						|
 | 
						|
       For  compatibility with Oniguruma, the non-Perl syntax \g followed by a
 | 
						|
       name or a number enclosed either in angle brackets or single quotes, is
 | 
						|
       an  alternative  syntax  for  referencing a subpattern as a subroutine,
 | 
						|
       possibly recursively. Here are two of the examples used above,  rewrit-
 | 
						|
       ten using this syntax:
 | 
						|
 | 
						|
         (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
 | 
						|
         (sens|respons)e and \g'1'ibility
 | 
						|
 | 
						|
       PCRE  supports  an extension to Oniguruma: if a number is preceded by a
 | 
						|
       plus or a minus sign it is taken as a relative reference. For example:
 | 
						|
 | 
						|
         (abc)(?i:\g<-1>)
 | 
						|
 | 
						|
       Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are  not
 | 
						|
       synonymous.  The former is a back reference; the latter is a subroutine
 | 
						|
       call.
 | 
						|
 | 
						|
 | 
						|
CALLOUTS
 | 
						|
 | 
						|
       Perl has a feature whereby using the sequence (?{...}) causes arbitrary
 | 
						|
       Perl  code to be obeyed in the middle of matching a regular expression.
 | 
						|
       This makes it possible, amongst other things, to extract different sub-
 | 
						|
       strings that match the same pair of parentheses when there is a repeti-
 | 
						|
       tion.
 | 
						|
 | 
						|
       PCRE provides a similar feature, but of course it cannot obey arbitrary
 | 
						|
       Perl code. The feature is called "callout". The caller of PCRE provides
 | 
						|
       an external function by putting its entry point in the global  variable
 | 
						|
       pcre_callout.   By default, this variable contains NULL, which disables
 | 
						|
       all calling out.
 | 
						|
 | 
						|
       Within a regular expression, (?C) indicates the  points  at  which  the
 | 
						|
       external  function  is  to be called. If you want to identify different
 | 
						|
       callout points, you can put a number less than 256 after the letter  C.
 | 
						|
       The  default  value is zero.  For example, this pattern has two callout
 | 
						|
       points:
 | 
						|
 | 
						|
         (?C1)abc(?C2)def
 | 
						|
 | 
						|
       If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are
 | 
						|
       automatically  installed  before each item in the pattern. They are all
 | 
						|
       numbered 255.
 | 
						|
 | 
						|
       During matching, when PCRE reaches a callout point (and pcre_callout is
 | 
						|
       set),  the  external function is called. It is provided with the number
 | 
						|
       of the callout, the position in the pattern, and, optionally, one  item
 | 
						|
       of  data  originally supplied by the caller of pcre_exec(). The callout
 | 
						|
       function may cause matching to proceed, to backtrack, or to fail  alto-
 | 
						|
       gether. A complete description of the interface to the callout function
 | 
						|
       is given in the pcrecallout documentation.
 | 
						|
 | 
						|
 | 
						|
BACKTRACKING CONTROL
 | 
						|
 | 
						|
       Perl 5.10 introduced a number of "Special Backtracking Control  Verbs",
 | 
						|
       which are described in the Perl documentation as "experimental and sub-
 | 
						|
       ject to change or removal in a future version of Perl". It goes  on  to
 | 
						|
       say:  "Their usage in production code should be noted to avoid problems
 | 
						|
       during upgrades." The same remarks apply to the PCRE features described
 | 
						|
       in this section.
 | 
						|
 | 
						|
       Since  these  verbs  are  specifically related to backtracking, most of
 | 
						|
       them can be  used  only  when  the  pattern  is  to  be  matched  using
 | 
						|
       pcre_exec(), which uses a backtracking algorithm. With the exception of
 | 
						|
       (*FAIL), which behaves like a failing negative assertion, they cause an
 | 
						|
       error if encountered by pcre_dfa_exec().
 | 
						|
 | 
						|
       The  new verbs make use of what was previously invalid syntax: an open-
 | 
						|
       ing parenthesis followed by an asterisk. In Perl, they are generally of
 | 
						|
       the form (*VERB:ARG) but PCRE does not support the use of arguments, so
 | 
						|
       its general form is just (*VERB). Any number of these verbs  may  occur
 | 
						|
       in a pattern. There are two kinds:
 | 
						|
 | 
						|
   Verbs that act immediately
 | 
						|
 | 
						|
       The following verbs act as soon as they are encountered:
 | 
						|
 | 
						|
          (*ACCEPT)
 | 
						|
 | 
						|
       This  verb causes the match to end successfully, skipping the remainder
 | 
						|
       of the pattern. When inside a recursion, only the innermost pattern  is
 | 
						|
       ended  immediately.  PCRE  differs  from  Perl  in  what happens if the
 | 
						|
       (*ACCEPT) is inside capturing parentheses. In Perl, the data so far  is
 | 
						|
       captured: in PCRE no data is captured. For example:
 | 
						|
 | 
						|
         A(A|B(*ACCEPT)|C)D
 | 
						|
 | 
						|
       This  matches  "AB", "AAD", or "ACD", but when it matches "AB", no data
 | 
						|
       is captured.
 | 
						|
 | 
						|
         (*FAIL) or (*F)
 | 
						|
 | 
						|
       This verb causes the match to fail, forcing backtracking to  occur.  It
 | 
						|
       is  equivalent to (?!) but easier to read. The Perl documentation notes
 | 
						|
       that it is probably useful only when combined  with  (?{})  or  (??{}).
 | 
						|
       Those  are,  of course, Perl features that are not present in PCRE. The
 | 
						|
       nearest equivalent is the callout feature, as for example in this  pat-
 | 
						|
       tern:
 | 
						|
 | 
						|
         a+(?C)(*FAIL)
 | 
						|
 | 
						|
       A  match  with the string "aaaa" always fails, but the callout is taken
 | 
						|
       before each backtrack happens (in this example, 10 times).
 | 
						|
 | 
						|
   Verbs that act after backtracking
 | 
						|
 | 
						|
       The following verbs do nothing when they are encountered. Matching con-
 | 
						|
       tinues  with what follows, but if there is no subsequent match, a fail-
 | 
						|
       ure is forced.  The verbs  differ  in  exactly  what  kind  of  failure
 | 
						|
       occurs.
 | 
						|
 | 
						|
         (*COMMIT)
 | 
						|
 | 
						|
       This  verb  causes  the whole match to fail outright if the rest of the
 | 
						|
       pattern does not match. Even if the pattern is unanchored,  no  further
 | 
						|
       attempts  to find a match by advancing the start point take place. Once
 | 
						|
       (*COMMIT) has been passed, pcre_exec() is committed to finding a  match
 | 
						|
       at the current starting point, or not at all. For example:
 | 
						|
 | 
						|
         a+(*COMMIT)b
 | 
						|
 | 
						|
       This  matches  "xxaab" but not "aacaab". It can be thought of as a kind
 | 
						|
       of dynamic anchor, or "I've started, so I must finish."
 | 
						|
 | 
						|
         (*PRUNE)
 | 
						|
 | 
						|
       This verb causes the match to fail at the current position if the  rest
 | 
						|
       of the pattern does not match. If the pattern is unanchored, the normal
 | 
						|
       "bumpalong" advance to the next starting character then happens.  Back-
 | 
						|
       tracking  can  occur as usual to the left of (*PRUNE), or when matching
 | 
						|
       to the right of (*PRUNE), but if there is no match to the right,  back-
 | 
						|
       tracking  cannot  cross (*PRUNE).  In simple cases, the use of (*PRUNE)
 | 
						|
       is just an alternative to an atomic group or possessive quantifier, but
 | 
						|
       there  are  some uses of (*PRUNE) that cannot be expressed in any other
 | 
						|
       way.
 | 
						|
 | 
						|
         (*SKIP)
 | 
						|
 | 
						|
       This verb is like (*PRUNE), except that if the pattern  is  unanchored,
 | 
						|
       the  "bumpalong" advance is not to the next character, but to the posi-
 | 
						|
       tion in the subject where (*SKIP) was  encountered.  (*SKIP)  signifies
 | 
						|
       that  whatever  text  was  matched leading up to it cannot be part of a
 | 
						|
       successful match. Consider:
 | 
						|
 | 
						|
         a+(*SKIP)b
 | 
						|
 | 
						|
       If the subject is "aaaac...",  after  the  first  match  attempt  fails
 | 
						|
       (starting  at  the  first  character in the string), the starting point
 | 
						|
       skips on to start the next attempt at "c". Note that a possessive quan-
 | 
						|
       tifer  does not have the same effect in this example; although it would
 | 
						|
       suppress backtracking  during  the  first  match  attempt,  the  second
 | 
						|
       attempt  would  start at the second character instead of skipping on to
 | 
						|
       "c".
 | 
						|
 | 
						|
         (*THEN)
 | 
						|
 | 
						|
       This verb causes a skip to the next alternation if the rest of the pat-
 | 
						|
       tern does not match. That is, it cancels pending backtracking, but only
 | 
						|
       within the current alternation. Its name  comes  from  the  observation
 | 
						|
       that it can be used for a pattern-based if-then-else block:
 | 
						|
 | 
						|
         ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
 | 
						|
 | 
						|
       If  the COND1 pattern matches, FOO is tried (and possibly further items
 | 
						|
       after the end of the group if FOO succeeds);  on  failure  the  matcher
 | 
						|
       skips  to  the second alternative and tries COND2, without backtracking
 | 
						|
       into COND1. If (*THEN) is used outside  of  any  alternation,  it  acts
 | 
						|
       exactly like (*PRUNE).
 | 
						|
 | 
						|
 | 
						|
SEE ALSO
 | 
						|
 | 
						|
       pcreapi(3), pcrecallout(3), pcrematching(3), pcre(3).
 | 
						|
 | 
						|
 | 
						|
AUTHOR
 | 
						|
 | 
						|
       Philip Hazel
 | 
						|
       University Computing Service
 | 
						|
       Cambridge CB2 3QH, England.
 | 
						|
 | 
						|
 | 
						|
REVISION
 | 
						|
 | 
						|
       Last updated: 11 April 2009
 | 
						|
       Copyright (c) 1997-2009 University of Cambridge.
 | 
						|
------------------------------------------------------------------------------
 | 
						|
 | 
						|
 | 
						|
PCRESYNTAX(3)                                                    PCRESYNTAX(3)
 | 
						|
 | 
						|
 | 
						|
NAME
 | 
						|
       PCRE - Perl-compatible regular expressions
 | 
						|
 | 
						|
 | 
						|
PCRE REGULAR EXPRESSION SYNTAX SUMMARY
 | 
						|
 | 
						|
       The  full syntax and semantics of the regular expressions that are sup-
 | 
						|
       ported by PCRE are described in  the  pcrepattern  documentation.  This
 | 
						|
       document contains just a quick-reference summary of the syntax.
 | 
						|
 | 
						|
 | 
						|
QUOTING
 | 
						|
 | 
						|
         \x         where x is non-alphanumeric is a literal x
 | 
						|
         \Q...\E    treat enclosed characters as literal
 | 
						|
 | 
						|
 | 
						|
CHARACTERS
 | 
						|
 | 
						|
         \a         alarm, that is, the BEL character (hex 07)
 | 
						|
         \cx        "control-x", where x is any character
 | 
						|
         \e         escape (hex 1B)
 | 
						|
         \f         formfeed (hex 0C)
 | 
						|
         \n         newline (hex 0A)
 | 
						|
         \r         carriage return (hex 0D)
 | 
						|
         \t         tab (hex 09)
 | 
						|
         \ddd       character with octal code ddd, or backreference
 | 
						|
         \xhh       character with hex code hh
 | 
						|
         \x{hhh..}  character with hex code hhh..
 | 
						|
 | 
						|
 | 
						|
CHARACTER TYPES
 | 
						|
 | 
						|
         .          any character except newline;
 | 
						|
                      in dotall mode, any character whatsoever
 | 
						|
         \C         one byte, even in UTF-8 mode (best avoided)
 | 
						|
         \d         a decimal digit
 | 
						|
         \D         a character that is not a decimal digit
 | 
						|
         \h         a horizontal whitespace character
 | 
						|
         \H         a character that is not a horizontal whitespace character
 | 
						|
         \p{xx}     a character with the xx property
 | 
						|
         \P{xx}     a character without the xx property
 | 
						|
         \R         a newline sequence
 | 
						|
         \s         a whitespace character
 | 
						|
         \S         a character that is not a whitespace character
 | 
						|
         \v         a vertical whitespace character
 | 
						|
         \V         a character that is not a vertical whitespace character
 | 
						|
         \w         a "word" character
 | 
						|
         \W         a "non-word" character
 | 
						|
         \X         an extended Unicode sequence
 | 
						|
 | 
						|
       In PCRE, \d, \D, \s, \S, \w, and \W recognize only ASCII characters.
 | 
						|
 | 
						|
 | 
						|
GENERAL CATEGORY PROPERTY CODES FOR \p and \P
 | 
						|
 | 
						|
         C          Other
 | 
						|
         Cc         Control
 | 
						|
         Cf         Format
 | 
						|
         Cn         Unassigned
 | 
						|
         Co         Private use
 | 
						|
         Cs         Surrogate
 | 
						|
 | 
						|
         L          Letter
 | 
						|
         Ll         Lower case letter
 | 
						|
         Lm         Modifier letter
 | 
						|
         Lo         Other letter
 | 
						|
         Lt         Title case letter
 | 
						|
         Lu         Upper case letter
 | 
						|
         L&         Ll, Lu, or Lt
 | 
						|
 | 
						|
         M          Mark
 | 
						|
         Mc         Spacing mark
 | 
						|
         Me         Enclosing mark
 | 
						|
         Mn         Non-spacing mark
 | 
						|
 | 
						|
         N          Number
 | 
						|
         Nd         Decimal number
 | 
						|
         Nl         Letter number
 | 
						|
         No         Other number
 | 
						|
 | 
						|
         P          Punctuation
 | 
						|
         Pc         Connector punctuation
 | 
						|
         Pd         Dash punctuation
 | 
						|
         Pe         Close punctuation
 | 
						|
         Pf         Final punctuation
 | 
						|
         Pi         Initial punctuation
 | 
						|
         Po         Other punctuation
 | 
						|
         Ps         Open punctuation
 | 
						|
 | 
						|
         S          Symbol
 | 
						|
         Sc         Currency symbol
 | 
						|
         Sk         Modifier symbol
 | 
						|
         Sm         Mathematical symbol
 | 
						|
         So         Other symbol
 | 
						|
 | 
						|
         Z          Separator
 | 
						|
         Zl         Line separator
 | 
						|
         Zp         Paragraph separator
 | 
						|
         Zs         Space separator
 | 
						|
 | 
						|
 | 
						|
SCRIPT NAMES FOR \p AND \P
 | 
						|
 | 
						|
       Arabic,  Armenian,  Balinese,  Bengali,  Bopomofo,  Braille,  Buginese,
 | 
						|
       Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common, Coptic, Cu-
 | 
						|
       neiform,  Cypriot,  Cyrillic,  Deseret, Devanagari, Ethiopic, Georgian,
 | 
						|
       Glagolitic, Gothic, Greek, Gujarati, Gurmukhi,  Han,  Hangul,  Hanunoo,
 | 
						|
       Hebrew,  Hiragana,  Inherited, Kannada, Katakana, Kayah_Li, Kharoshthi,
 | 
						|
       Khmer, Lao, Latin, Lepcha, Limbu, Linear_B, Lycian, Lydian,  Malayalam,
 | 
						|
       Mongolian,  Myanmar,  New_Tai_Lue, Nko, Ogham, Old_Italic, Old_Persian,
 | 
						|
       Ol_Chiki, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic, Saurash-
 | 
						|
       tra,  Shavian,  Sinhala,  Sudanese, Syloti_Nagri, Syriac, Tagalog, Tag-
 | 
						|
       banwa,  Tai_Le,  Tamil,  Telugu,  Thaana,  Thai,   Tibetan,   Tifinagh,
 | 
						|
       Ugaritic, Vai, Yi.
 | 
						|
 | 
						|
 | 
						|
CHARACTER CLASSES
 | 
						|
 | 
						|
         [...]       positive character class
 | 
						|
         [^...]      negative character class
 | 
						|
         [x-y]       range (can be used for hex characters)
 | 
						|
         [[:xxx:]]   positive POSIX named set
 | 
						|
         [[:^xxx:]]  negative POSIX named set
 | 
						|
 | 
						|
         alnum       alphanumeric
 | 
						|
         alpha       alphabetic
 | 
						|
         ascii       0-127
 | 
						|
         blank       space or tab
 | 
						|
         cntrl       control character
 | 
						|
         digit       decimal digit
 | 
						|
         graph       printing, excluding space
 | 
						|
         lower       lower case letter
 | 
						|
         print       printing, including space
 | 
						|
         punct       printing, excluding alphanumeric
 | 
						|
         space       whitespace
 | 
						|
         upper       upper case letter
 | 
						|
         word        same as \w
 | 
						|
         xdigit      hexadecimal digit
 | 
						|
 | 
						|
       In PCRE, POSIX character set names recognize only ASCII characters. You
 | 
						|
       can use \Q...\E inside a character class.
 | 
						|
 | 
						|
 | 
						|
QUANTIFIERS
 | 
						|
 | 
						|
         ?           0 or 1, greedy
 | 
						|
         ?+          0 or 1, possessive
 | 
						|
         ??          0 or 1, lazy
 | 
						|
         *           0 or more, greedy
 | 
						|
         *+          0 or more, possessive
 | 
						|
         *?          0 or more, lazy
 | 
						|
         +           1 or more, greedy
 | 
						|
         ++          1 or more, possessive
 | 
						|
         +?          1 or more, lazy
 | 
						|
         {n}         exactly n
 | 
						|
         {n,m}       at least n, no more than m, greedy
 | 
						|
         {n,m}+      at least n, no more than m, possessive
 | 
						|
         {n,m}?      at least n, no more than m, lazy
 | 
						|
         {n,}        n or more, greedy
 | 
						|
         {n,}+       n or more, possessive
 | 
						|
         {n,}?       n or more, lazy
 | 
						|
 | 
						|
 | 
						|
ANCHORS AND SIMPLE ASSERTIONS
 | 
						|
 | 
						|
         \b          word boundary (only ASCII letters recognized)
 | 
						|
         \B          not a word boundary
 | 
						|
         ^           start of subject
 | 
						|
                      also after internal newline in multiline mode
 | 
						|
         \A          start of subject
 | 
						|
         $           end of subject
 | 
						|
                      also before newline at end of subject
 | 
						|
                      also before internal newline in multiline mode
 | 
						|
         \Z          end of subject
 | 
						|
                      also before newline at end of subject
 | 
						|
         \z          end of subject
 | 
						|
         \G          first matching position in subject
 | 
						|
 | 
						|
 | 
						|
MATCH POINT RESET
 | 
						|
 | 
						|
         \K          reset start of match
 | 
						|
 | 
						|
 | 
						|
ALTERNATION
 | 
						|
 | 
						|
         expr|expr|expr...
 | 
						|
 | 
						|
 | 
						|
CAPTURING
 | 
						|
 | 
						|
         (...)           capturing group
 | 
						|
         (?<name>...)    named capturing group (Perl)
 | 
						|
         (?'name'...)    named capturing group (Perl)
 | 
						|
         (?P<name>...)   named capturing group (Python)
 | 
						|
         (?:...)         non-capturing group
 | 
						|
         (?|...)         non-capturing group; reset group numbers for
 | 
						|
                          capturing groups in each alternative
 | 
						|
 | 
						|
 | 
						|
ATOMIC GROUPS
 | 
						|
 | 
						|
         (?>...)         atomic, non-capturing group
 | 
						|
 | 
						|
 | 
						|
COMMENT
 | 
						|
 | 
						|
         (?#....)        comment (not nestable)
 | 
						|
 | 
						|
 | 
						|
OPTION SETTING
 | 
						|
 | 
						|
         (?i)            caseless
 | 
						|
         (?J)            allow duplicate names
 | 
						|
         (?m)            multiline
 | 
						|
         (?s)            single line (dotall)
 | 
						|
         (?U)            default ungreedy (lazy)
 | 
						|
         (?x)            extended (ignore white space)
 | 
						|
         (?-...)         unset option(s)
 | 
						|
 | 
						|
       The following is recognized only at the start of a pattern or after one
 | 
						|
       of the newline-setting options with similar syntax:
 | 
						|
 | 
						|
         (*UTF8)         set UTF-8 mode
 | 
						|
 | 
						|
 | 
						|
LOOKAHEAD AND LOOKBEHIND ASSERTIONS
 | 
						|
 | 
						|
         (?=...)         positive look ahead
 | 
						|
         (?!...)         negative look ahead
 | 
						|
         (?<=...)        positive look behind
 | 
						|
         (?<!...)        negative look behind
 | 
						|
 | 
						|
       Each top-level branch of a look behind must be of a fixed length.
 | 
						|
 | 
						|
 | 
						|
BACKREFERENCES
 | 
						|
 | 
						|
         \n              reference by number (can be ambiguous)
 | 
						|
         \gn             reference by number
 | 
						|
         \g{n}           reference by number
 | 
						|
         \g{-n}          relative reference by number
 | 
						|
         \k<name>        reference by name (Perl)
 | 
						|
         \k'name'        reference by name (Perl)
 | 
						|
         \g{name}        reference by name (Perl)
 | 
						|
         \k{name}        reference by name (.NET)
 | 
						|
         (?P=name)       reference by name (Python)
 | 
						|
 | 
						|
 | 
						|
SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)
 | 
						|
 | 
						|
         (?R)            recurse whole pattern
 | 
						|
         (?n)            call subpattern by absolute number
 | 
						|
         (?+n)           call subpattern by relative number
 | 
						|
         (?-n)           call subpattern by relative number
 | 
						|
         (?&name)        call subpattern by name (Perl)
 | 
						|
         (?P>name)       call subpattern by name (Python)
 | 
						|
         \g<name>        call subpattern by name (Oniguruma)
 | 
						|
         \g'name'        call subpattern by name (Oniguruma)
 | 
						|
         \g<n>           call subpattern by absolute number (Oniguruma)
 | 
						|
         \g'n'           call subpattern by absolute number (Oniguruma)
 | 
						|
         \g<+n>          call subpattern by relative number (PCRE extension)
 | 
						|
         \g'+n'          call subpattern by relative number (PCRE extension)
 | 
						|
         \g<-n>          call subpattern by relative number (PCRE extension)
 | 
						|
         \g'-n'          call subpattern by relative number (PCRE extension)
 | 
						|
 | 
						|
 | 
						|
CONDITIONAL PATTERNS
 | 
						|
 | 
						|
         (?(condition)yes-pattern)
 | 
						|
         (?(condition)yes-pattern|no-pattern)
 | 
						|
 | 
						|
         (?(n)...        absolute reference condition
 | 
						|
         (?(+n)...       relative reference condition
 | 
						|
         (?(-n)...       relative reference condition
 | 
						|
         (?(<name>)...   named reference condition (Perl)
 | 
						|
         (?('name')...   named reference condition (Perl)
 | 
						|
         (?(name)...     named reference condition (PCRE)
 | 
						|
         (?(R)...        overall recursion condition
 | 
						|
         (?(Rn)...       specific group recursion condition
 | 
						|
         (?(R&name)...   specific recursion condition
 | 
						|
         (?(DEFINE)...   define subpattern for reference
 | 
						|
         (?(assert)...   assertion condition
 | 
						|
 | 
						|
 | 
						|
BACKTRACKING CONTROL
 | 
						|
 | 
						|
       The following act immediately they are reached:
 | 
						|
 | 
						|
         (*ACCEPT)       force successful match
 | 
						|
         (*FAIL)         force backtrack; synonym (*F)
 | 
						|
 | 
						|
       The  following  act only when a subsequent match failure causes a back-
 | 
						|
       track to reach them. They all force a match failure, but they differ in
 | 
						|
       what happens afterwards. Those that advance the start-of-match point do
 | 
						|
       so only if the pattern is not anchored.
 | 
						|
 | 
						|
         (*COMMIT)       overall failure, no advance of starting point
 | 
						|
         (*PRUNE)        advance to next starting character
 | 
						|
         (*SKIP)         advance start to current matching position
 | 
						|
         (*THEN)         local failure, backtrack to next alternation
 | 
						|
 | 
						|
 | 
						|
NEWLINE CONVENTIONS
 | 
						|
 | 
						|
       These are recognized only at the very start of the pattern or  after  a
 | 
						|
       (*BSR_...) or (*UTF8) option.
 | 
						|
 | 
						|
         (*CR)           carriage return only
 | 
						|
         (*LF)           linefeed only
 | 
						|
         (*CRLF)         carriage return followed by linefeed
 | 
						|
         (*ANYCRLF)      all three of the above
 | 
						|
         (*ANY)          any Unicode newline sequence
 | 
						|
 | 
						|
 | 
						|
WHAT \R MATCHES
 | 
						|
 | 
						|
       These  are  recognized only at the very start of the pattern or after a
 | 
						|
       (*...) option that sets the newline convention or UTF-8 mode.
 | 
						|
 | 
						|
         (*BSR_ANYCRLF)  CR, LF, or CRLF
 | 
						|
         (*BSR_UNICODE)  any Unicode newline sequence
 | 
						|
 | 
						|
 | 
						|
CALLOUTS
 | 
						|
 | 
						|
         (?C)      callout
 | 
						|
         (?Cn)     callout with data n
 | 
						|
 | 
						|
 | 
						|
SEE ALSO
 | 
						|
 | 
						|
       pcrepattern(3), pcreapi(3), pcrecallout(3), pcrematching(3), pcre(3).
 | 
						|
 | 
						|
 | 
						|
AUTHOR
 | 
						|
 | 
						|
       Philip Hazel
 | 
						|
       University Computing Service
 | 
						|
       Cambridge CB2 3QH, England.
 | 
						|
 | 
						|
 | 
						|
REVISION
 | 
						|
 | 
						|
       Last updated: 11 April 2009
 | 
						|
       Copyright (c) 1997-2009 University of Cambridge.
 | 
						|
------------------------------------------------------------------------------
 | 
						|
 | 
						|
 | 
						|
PCREPARTIAL(3)                                                  PCREPARTIAL(3)
 | 
						|
 | 
						|
 | 
						|
NAME
 | 
						|
       PCRE - Perl-compatible regular expressions
 | 
						|
 | 
						|
 | 
						|
PARTIAL MATCHING IN PCRE
 | 
						|
 | 
						|
       In  normal  use  of  PCRE,  if  the  subject  string  that is passed to
 | 
						|
       pcre_exec() or pcre_dfa_exec() matches as far as it goes,  but  is  too
 | 
						|
       short  to  match  the  entire  pattern, PCRE_ERROR_NOMATCH is returned.
 | 
						|
       There are circumstances where it might be helpful to  distinguish  this
 | 
						|
       case from other cases in which there is no match.
 | 
						|
 | 
						|
       Consider, for example, an application where a human is required to type
 | 
						|
       in data for a field with specific formatting requirements.  An  example
 | 
						|
       might be a date in the form ddmmmyy, defined by this pattern:
 | 
						|
 | 
						|
         ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$
 | 
						|
 | 
						|
       If the application sees the user's keystrokes one by one, and can check
 | 
						|
       that what has been typed so far is potentially valid,  it  is  able  to
 | 
						|
       raise  an  error as soon as a mistake is made, possibly beeping and not
 | 
						|
       reflecting the character that has been typed. This  immediate  feedback
 | 
						|
       is  likely  to  be a better user interface than a check that is delayed
 | 
						|
       until the entire string has been entered.
 | 
						|
 | 
						|
       PCRE supports the concept of partial matching by means of the PCRE_PAR-
 | 
						|
       TIAL   option,   which   can   be   set  when  calling  pcre_exec()  or
 | 
						|
       pcre_dfa_exec(). When this flag is set for pcre_exec(), the return code
 | 
						|
       PCRE_ERROR_NOMATCH  is converted into PCRE_ERROR_PARTIAL if at any time
 | 
						|
       during the matching process the last part of the subject string matched
 | 
						|
       part  of  the  pattern. Unfortunately, for non-anchored matching, it is
 | 
						|
       not possible to obtain the position of the start of the partial  match.
 | 
						|
       No captured data is set when PCRE_ERROR_PARTIAL is returned.
 | 
						|
 | 
						|
       When   PCRE_PARTIAL   is  set  for  pcre_dfa_exec(),  the  return  code
 | 
						|
       PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the  end  of
 | 
						|
       the  subject is reached, there have been no complete matches, but there
 | 
						|
       is still at least one matching possibility. The portion of  the  string
 | 
						|
       that provided the partial match is set as the first matching string.
 | 
						|
 | 
						|
       Using PCRE_PARTIAL disables one of PCRE's optimizations. PCRE remembers
 | 
						|
       the last literal byte in a pattern, and abandons  matching  immediately
 | 
						|
       if  such a byte is not present in the subject string. This optimization
 | 
						|
       cannot be used for a subject string that might match only partially.
 | 
						|
 | 
						|
 | 
						|
RESTRICTED PATTERNS FOR PCRE_PARTIAL
 | 
						|
 | 
						|
       Because of the way certain internal optimizations  are  implemented  in
 | 
						|
       the  pcre_exec()  function, the PCRE_PARTIAL option cannot be used with
 | 
						|
       all patterns. These restrictions do not apply when  pcre_dfa_exec()  is
 | 
						|
       used.  For pcre_exec(), repeated single characters such as
 | 
						|
 | 
						|
         a{2,4}
 | 
						|
 | 
						|
       and repeated single metasequences such as
 | 
						|
 | 
						|
         \d+
 | 
						|
 | 
						|
       are  not permitted if the maximum number of occurrences is greater than
 | 
						|
       one.  Optional items such as \d? (where the maximum is one) are permit-
 | 
						|
       ted.   Quantifiers  with any values are permitted after parentheses, so
 | 
						|
       the invalid examples above can be coded thus:
 | 
						|
 | 
						|
         (a){2,4}
 | 
						|
         (\d)+
 | 
						|
 | 
						|
       These constructions run more slowly, but for the kinds  of  application
 | 
						|
       that  are  envisaged  for this facility, this is not felt to be a major
 | 
						|
       restriction.
 | 
						|
 | 
						|
       If PCRE_PARTIAL is set for a pattern  that  does  not  conform  to  the
 | 
						|
       restrictions,  pcre_exec() returns the error code PCRE_ERROR_BADPARTIAL
 | 
						|
       (-13).  You can use the PCRE_INFO_OKPARTIAL call to pcre_fullinfo()  to
 | 
						|
       find out if a compiled pattern can be used for partial matching.
 | 
						|
 | 
						|
 | 
						|
EXAMPLE OF PARTIAL MATCHING USING PCRETEST
 | 
						|
 | 
						|
       If  the  escape  sequence  \P  is  present in a pcretest data line, the
 | 
						|
       PCRE_PARTIAL flag is used for the match. Here is a run of pcretest that
 | 
						|
       uses the date example quoted above:
 | 
						|
 | 
						|
           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
 | 
						|
         data> 25jun04\P
 | 
						|
          0: 25jun04
 | 
						|
          1: jun
 | 
						|
         data> 25dec3\P
 | 
						|
         Partial match
 | 
						|
         data> 3ju\P
 | 
						|
         Partial match
 | 
						|
         data> 3juj\P
 | 
						|
         No match
 | 
						|
         data> j\P
 | 
						|
         No match
 | 
						|
 | 
						|
       The  first  data  string  is  matched completely, so pcretest shows the
 | 
						|
       matched substrings. The remaining four strings do not  match  the  com-
 | 
						|
       plete  pattern,  but  the first two are partial matches. The same test,
 | 
						|
       using pcre_dfa_exec() matching (by means of the  \D  escape  sequence),
 | 
						|
       produces the following output:
 | 
						|
 | 
						|
           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
 | 
						|
         data> 25jun04\P\D
 | 
						|
          0: 25jun04
 | 
						|
         data> 23dec3\P\D
 | 
						|
         Partial match: 23dec3
 | 
						|
         data> 3ju\P\D
 | 
						|
         Partial match: 3ju
 | 
						|
         data> 3juj\P\D
 | 
						|
         No match
 | 
						|
         data> j\P\D
 | 
						|
         No match
 | 
						|
 | 
						|
       Notice  that in this case the portion of the string that was matched is
 | 
						|
       made available.
 | 
						|
 | 
						|
 | 
						|
MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()
 | 
						|
 | 
						|
       When a partial match has been found using pcre_dfa_exec(), it is possi-
 | 
						|
       ble  to  continue  the  match  by providing additional subject data and
 | 
						|
       calling pcre_dfa_exec() again with the same  compiled  regular  expres-
 | 
						|
       sion, this time setting the PCRE_DFA_RESTART option. You must also pass
 | 
						|
       the same working space as before, because this is where details of  the
 | 
						|
       previous  partial  match are stored. Here is an example using pcretest,
 | 
						|
       using the \R escape sequence to set the PCRE_DFA_RESTART option (\P and
 | 
						|
       \D are as above):
 | 
						|
 | 
						|
           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
 | 
						|
         data> 23ja\P\D
 | 
						|
         Partial match: 23ja
 | 
						|
         data> n05\R\D
 | 
						|
          0: n05
 | 
						|
 | 
						|
       The  first  call has "23ja" as the subject, and requests partial match-
 | 
						|
       ing; the second call  has  "n05"  as  the  subject  for  the  continued
 | 
						|
       (restarted)  match.   Notice  that when the match is complete, only the
 | 
						|
       last part is shown; PCRE does  not  retain  the  previously  partially-
 | 
						|
       matched  string. It is up to the calling program to do that if it needs
 | 
						|
       to.
 | 
						|
 | 
						|
       You can set PCRE_PARTIAL  with  PCRE_DFA_RESTART  to  continue  partial
 | 
						|
       matching over multiple segments. This facility can be used to pass very
 | 
						|
       long subject strings to pcre_dfa_exec(). However, some care  is  needed
 | 
						|
       for certain types of pattern.
 | 
						|
 | 
						|
       1.  If  the  pattern contains tests for the beginning or end of a line,
 | 
						|
       you need to pass the PCRE_NOTBOL or PCRE_NOTEOL options,  as  appropri-
 | 
						|
       ate,  when  the subject string for any call does not contain the begin-
 | 
						|
       ning or end of a line.
 | 
						|
 | 
						|
       2. If the pattern contains backward assertions (including  \b  or  \B),
 | 
						|
       you  need  to  arrange for some overlap in the subject strings to allow
 | 
						|
       for this. For example, you could pass the subject in  chunks  that  are
 | 
						|
       500  bytes long, but in a buffer of 700 bytes, with the starting offset
 | 
						|
       set to 200 and the previous 200 bytes at the start of the buffer.
 | 
						|
 | 
						|
       3. Matching a subject string that is split into multiple segments  does
 | 
						|
       not  always produce exactly the same result as matching over one single
 | 
						|
       long string.  The difference arises when there  are  multiple  matching
 | 
						|
       possibilities,  because a partial match result is given only when there
 | 
						|
       are no completed matches in a call to pcre_dfa_exec(). This means  that
 | 
						|
       as  soon  as  the  shortest match has been found, continuation to a new
 | 
						|
       subject segment is no longer possible.  Consider this pcretest example:
 | 
						|
 | 
						|
           re> /dog(sbody)?/
 | 
						|
         data> do\P\D
 | 
						|
         Partial match: do
 | 
						|
         data> gsb\R\P\D
 | 
						|
          0: g
 | 
						|
         data> dogsbody\D
 | 
						|
          0: dogsbody
 | 
						|
          1: dog
 | 
						|
 | 
						|
       The pattern matches the words "dog" or "dogsbody". When the subject  is
 | 
						|
       presented  in  several  parts  ("do" and "gsb" being the first two) the
 | 
						|
       match stops when "dog" has been found, and it is not possible  to  con-
 | 
						|
       tinue.  On  the  other  hand,  if  "dogsbody"  is presented as a single
 | 
						|
       string, both matches are found.
 | 
						|
 | 
						|
       Because of this phenomenon, it does not usually make  sense  to  end  a
 | 
						|
       pattern that is going to be matched in this way with a variable repeat.
 | 
						|
 | 
						|
       4. Patterns that contain alternatives at the top level which do not all
 | 
						|
       start with the same pattern item may not work as expected. For example,
 | 
						|
       consider this pattern:
 | 
						|
 | 
						|
         1234|3789
 | 
						|
 | 
						|
       If  the  first  part of the subject is "ABC123", a partial match of the
 | 
						|
       first alternative is found at offset 3. There is no partial  match  for
 | 
						|
       the second alternative, because such a match does not start at the same
 | 
						|
       point in the subject string. Attempting to  continue  with  the  string
 | 
						|
       "789" does not yield a match because only those alternatives that match
 | 
						|
       at one point in the subject are remembered. The problem arises  because
 | 
						|
       the  start  of the second alternative matches within the first alterna-
 | 
						|
       tive. There is no problem with anchored patterns or patterns such as:
 | 
						|
 | 
						|
         1234|ABCD
 | 
						|
 | 
						|
       where no string can be a partial match for both alternatives.
 | 
						|
 | 
						|
 | 
						|
AUTHOR
 | 
						|
 | 
						|
       Philip Hazel
 | 
						|
       University Computing Service
 | 
						|
       Cambridge CB2 3QH, England.
 | 
						|
 | 
						|
 | 
						|
REVISION
 | 
						|
 | 
						|
       Last updated: 04 June 2007
 | 
						|
       Copyright (c) 1997-2007 University of Cambridge.
 | 
						|
------------------------------------------------------------------------------
 | 
						|
 | 
						|
 | 
						|
PCREPRECOMPILE(3)                                            PCREPRECOMPILE(3)
 | 
						|
 | 
						|
 | 
						|
NAME
 | 
						|
       PCRE - Perl-compatible regular expressions
 | 
						|
 | 
						|
 | 
						|
SAVING AND RE-USING PRECOMPILED PCRE PATTERNS
 | 
						|
 | 
						|
       If  you  are running an application that uses a large number of regular
 | 
						|
       expression patterns, it may be useful to store them  in  a  precompiled
 | 
						|
       form  instead  of  having to compile them every time the application is
 | 
						|
       run.  If you are not  using  any  private  character  tables  (see  the
 | 
						|
       pcre_maketables()  documentation),  this is relatively straightforward.
 | 
						|
       If you are using private tables, it is a little bit more complicated.
 | 
						|
 | 
						|
       If you save compiled patterns to a file, you can copy them to a differ-
 | 
						|
       ent  host  and  run them there. This works even if the new host has the
 | 
						|
       opposite endianness to the one on which  the  patterns  were  compiled.
 | 
						|
       There  may  be a small performance penalty, but it should be insignifi-
 | 
						|
       cant. However, compiling regular expressions with one version  of  PCRE
 | 
						|
       for  use  with  a  different  version is not guaranteed to work and may
 | 
						|
       cause crashes.
 | 
						|
 | 
						|
 | 
						|
SAVING A COMPILED PATTERN
 | 
						|
       The value returned by pcre_compile() points to a single block of memory
 | 
						|
       that  holds  the compiled pattern and associated data. You can find the
 | 
						|
       length of this block in bytes by calling pcre_fullinfo() with an  argu-
 | 
						|
       ment  of  PCRE_INFO_SIZE. You can then save the data in any appropriate
 | 
						|
       manner. Here is sample code that compiles a pattern and writes it to  a
 | 
						|
       file. It assumes that the variable fd refers to a file that is open for
 | 
						|
       output:
 | 
						|
 | 
						|
         int erroroffset, rc, size;
 | 
						|
         char *error;
 | 
						|
         pcre *re;
 | 
						|
 | 
						|
         re = pcre_compile("my pattern", 0, &error, &erroroffset, NULL);
 | 
						|
         if (re == NULL) { ... handle errors ... }
 | 
						|
         rc = pcre_fullinfo(re, NULL, PCRE_INFO_SIZE, &size);
 | 
						|
         if (rc < 0) { ... handle errors ... }
 | 
						|
         rc = fwrite(re, 1, size, fd);
 | 
						|
         if (rc != size) { ... handle errors ... }
 | 
						|
 | 
						|
       In this example, the bytes  that  comprise  the  compiled  pattern  are
 | 
						|
       copied  exactly.  Note that this is binary data that may contain any of
 | 
						|
       the 256 possible byte  values.  On  systems  that  make  a  distinction
 | 
						|
       between binary and non-binary data, be sure that the file is opened for
 | 
						|
       binary output.
 | 
						|
 | 
						|
       If you want to write more than one pattern to a file, you will have  to
 | 
						|
       devise  a  way of separating them. For binary data, preceding each pat-
 | 
						|
       tern with its length is probably  the  most  straightforward  approach.
 | 
						|
       Another  possibility is to write out the data in hexadecimal instead of
 | 
						|
       binary, one pattern to a line.
 | 
						|
 | 
						|
       Saving compiled patterns in a file is only one possible way of  storing
 | 
						|
       them  for later use. They could equally well be saved in a database, or
 | 
						|
       in the memory of some daemon process that passes them  via  sockets  to
 | 
						|
       the processes that want them.
 | 
						|
 | 
						|
       If  the pattern has been studied, it is also possible to save the study
 | 
						|
       data in a similar way to the compiled  pattern  itself.  When  studying
 | 
						|
       generates  additional  information, pcre_study() returns a pointer to a
 | 
						|
       pcre_extra data block. Its format is defined in the section on matching
 | 
						|
       a  pattern in the pcreapi documentation. The study_data field points to
 | 
						|
       the binary study data,  and  this  is  what  you  must  save  (not  the
 | 
						|
       pcre_extra  block itself). The length of the study data can be obtained
 | 
						|
       by calling pcre_fullinfo() with  an  argument  of  PCRE_INFO_STUDYSIZE.
 | 
						|
       Remember  to check that pcre_study() did return a non-NULL value before
 | 
						|
       trying to save the study data.
 | 
						|
 | 
						|
 | 
						|
RE-USING A PRECOMPILED PATTERN
 | 
						|
 | 
						|
       Re-using a precompiled pattern is straightforward. Having  reloaded  it
 | 
						|
       into   main   memory,   you   pass   its   pointer  to  pcre_exec()  or
 | 
						|
       pcre_dfa_exec() in the usual way. This  should  work  even  on  another
 | 
						|
       host,  and  even  if  that  host has the opposite endianness to the one
 | 
						|
       where the pattern was compiled.
 | 
						|
 | 
						|
       However, if you passed a pointer to custom character  tables  when  the
 | 
						|
       pattern  was  compiled  (the  tableptr argument of pcre_compile()), you
 | 
						|
       must now pass a similar  pointer  to  pcre_exec()  or  pcre_dfa_exec(),
 | 
						|
       because  the  value  saved  with the compiled pattern will obviously be
 | 
						|
       nonsense. A field in a pcre_extra() block is used to pass this data, as
 | 
						|
       described  in the section on matching a pattern in the pcreapi documen-
 | 
						|
       tation.
 | 
						|
 | 
						|
       If you did not provide custom character tables  when  the  pattern  was
 | 
						|
       compiled,  the  pointer  in  the compiled pattern is NULL, which causes
 | 
						|
       pcre_exec() to use PCRE's internal tables. Thus, you  do  not  need  to
 | 
						|
       take any special action at run time in this case.
 | 
						|
 | 
						|
       If  you  saved study data with the compiled pattern, you need to create
 | 
						|
       your own pcre_extra data block and set the study_data field to point to
 | 
						|
       the  reloaded  study  data. You must also set the PCRE_EXTRA_STUDY_DATA
 | 
						|
       bit in the flags field to indicate that study  data  is  present.  Then
 | 
						|
       pass  the  pcre_extra  block  to  pcre_exec() or pcre_dfa_exec() in the
 | 
						|
       usual way.
 | 
						|
 | 
						|
 | 
						|
COMPATIBILITY WITH DIFFERENT PCRE RELEASES
 | 
						|
 | 
						|
       In general, it is safest to  recompile  all  saved  patterns  when  you
 | 
						|
       update  to  a new PCRE release, though not all updates actually require
 | 
						|
       this. Recompiling is definitely needed for release 7.2.
 | 
						|
 | 
						|
 | 
						|
AUTHOR
 | 
						|
 | 
						|
       Philip Hazel
 | 
						|
       University Computing Service
 | 
						|
       Cambridge CB2 3QH, England.
 | 
						|
 | 
						|
 | 
						|
REVISION
 | 
						|
 | 
						|
       Last updated: 13 June 2007
 | 
						|
       Copyright (c) 1997-2007 University of Cambridge.
 | 
						|
------------------------------------------------------------------------------
 | 
						|
 | 
						|
 | 
						|
PCREPERFORM(3)                                                  PCREPERFORM(3)
 | 
						|
 | 
						|
 | 
						|
NAME
 | 
						|
       PCRE - Perl-compatible regular expressions
 | 
						|
 | 
						|
 | 
						|
PCRE PERFORMANCE
 | 
						|
 | 
						|
       Two  aspects  of performance are discussed below: memory usage and pro-
 | 
						|
       cessing time. The way you express your pattern as a regular  expression
 | 
						|
       can affect both of them.
 | 
						|
 | 
						|
 | 
						|
MEMORY USAGE
 | 
						|
 | 
						|
       Patterns are compiled by PCRE into a reasonably efficient byte code, so
 | 
						|
       that most simple patterns do not use much memory. However, there is one
 | 
						|
       case where memory usage can be unexpectedly large. When a parenthesized
 | 
						|
       subpattern has a quantifier with a minimum greater than 1 and/or a lim-
 | 
						|
       ited  maximum,  the  whole subpattern is repeated in the compiled code.
 | 
						|
       For example, the pattern
 | 
						|
 | 
						|
         (abc|def){2,4}
 | 
						|
 | 
						|
       is compiled as if it were
 | 
						|
 | 
						|
         (abc|def)(abc|def)((abc|def)(abc|def)?)?
 | 
						|
 | 
						|
       (Technical aside: It is done this way so that backtrack  points  within
 | 
						|
       each of the repetitions can be independently maintained.)
 | 
						|
 | 
						|
       For  regular expressions whose quantifiers use only small numbers, this
 | 
						|
       is not usually a problem. However, if the numbers are large,  and  par-
 | 
						|
       ticularly  if  such repetitions are nested, the memory usage can become
 | 
						|
       an embarrassment. For example, the very simple pattern
 | 
						|
 | 
						|
         ((ab){1,1000}c){1,3}
 | 
						|
 | 
						|
       uses 51K bytes when compiled. When PCRE is compiled  with  its  default
 | 
						|
       internal  pointer  size of two bytes, the size limit on a compiled pat-
 | 
						|
       tern is 64K, and this is reached with the above pattern  if  the  outer
 | 
						|
       repetition is increased from 3 to 4. PCRE can be compiled to use larger
 | 
						|
       internal pointers and thus handle larger compiled patterns, but  it  is
 | 
						|
       better to try to rewrite your pattern to use less memory if you can.
 | 
						|
 | 
						|
       One  way  of reducing the memory usage for such patterns is to make use
 | 
						|
       of PCRE's "subroutine" facility. Re-writing the above pattern as
 | 
						|
 | 
						|
         ((ab)(?2){0,999}c)(?1){0,2}
 | 
						|
 | 
						|
       reduces the memory requirements to 18K, and indeed it remains under 20K
 | 
						|
       even  with the outer repetition increased to 100. However, this pattern
 | 
						|
       is not exactly equivalent, because the "subroutine" calls  are  treated
 | 
						|
       as  atomic groups into which there can be no backtracking if there is a
 | 
						|
       subsequent matching failure. Therefore, PCRE cannot  do  this  kind  of
 | 
						|
       rewriting  automatically.   Furthermore,  there is a noticeable loss of
 | 
						|
       speed when executing the modified pattern. Nevertheless, if the  atomic
 | 
						|
       grouping  is  not  a  problem and the loss of speed is acceptable, this
 | 
						|
       kind of rewriting will allow you to process patterns that  PCRE  cannot
 | 
						|
       otherwise handle.
 | 
						|
 | 
						|
 | 
						|
PROCESSING TIME
 | 
						|
 | 
						|
       Certain  items  in regular expression patterns are processed more effi-
 | 
						|
       ciently than others. It is more efficient to use a character class like
 | 
						|
       [aeiou]   than   a   set   of  single-character  alternatives  such  as
 | 
						|
       (a|e|i|o|u). In general, the simplest construction  that  provides  the
 | 
						|
       required behaviour is usually the most efficient. Jeffrey Friedl's book
 | 
						|
       contains a lot of useful general discussion  about  optimizing  regular
 | 
						|
       expressions  for  efficient  performance.  This document contains a few
 | 
						|
       observations about PCRE.
 | 
						|
 | 
						|
       Using Unicode character properties (the \p,  \P,  and  \X  escapes)  is
 | 
						|
       slow,  because PCRE has to scan a structure that contains data for over
 | 
						|
       fifteen thousand characters whenever it needs a  character's  property.
 | 
						|
       If  you  can  find  an  alternative pattern that does not use character
 | 
						|
       properties, it will probably be faster.
 | 
						|
 | 
						|
       When a pattern begins with .* not in  parentheses,  or  in  parentheses
 | 
						|
       that are not the subject of a backreference, and the PCRE_DOTALL option
 | 
						|
       is set, the pattern is implicitly anchored by PCRE, since it can  match
 | 
						|
       only  at  the start of a subject string. However, if PCRE_DOTALL is not
 | 
						|
       set, PCRE cannot make this optimization, because  the  .  metacharacter
 | 
						|
       does  not then match a newline, and if the subject string contains new-
 | 
						|
       lines, the pattern may match from the character  immediately  following
 | 
						|
       one of them instead of from the very start. For example, the pattern
 | 
						|
 | 
						|
         .*second
 | 
						|
 | 
						|
       matches  the subject "first\nand second" (where \n stands for a newline
 | 
						|
       character), with the match starting at the seventh character. In  order
 | 
						|
       to do this, PCRE has to retry the match starting after every newline in
 | 
						|
       the subject.
 | 
						|
 | 
						|
       If you are using such a pattern with subject strings that do  not  con-
 | 
						|
       tain newlines, the best performance is obtained by setting PCRE_DOTALL,
 | 
						|
       or starting the pattern with ^.* or ^.*? to indicate  explicit  anchor-
 | 
						|
       ing.  That saves PCRE from having to scan along the subject looking for
 | 
						|
       a newline to restart at.
 | 
						|
 | 
						|
       Beware of patterns that contain nested indefinite  repeats.  These  can
 | 
						|
       take  a  long time to run when applied to a string that does not match.
 | 
						|
       Consider the pattern fragment
 | 
						|
 | 
						|
         ^(a+)*
 | 
						|
 | 
						|
       This can match "aaaa" in 16 different ways, and this  number  increases
 | 
						|
       very  rapidly  as the string gets longer. (The * repeat can match 0, 1,
 | 
						|
       2, 3, or 4 times, and for each of those cases other than 0 or 4, the  +
 | 
						|
       repeats  can  match  different numbers of times.) When the remainder of
 | 
						|
       the pattern is such that the entire match is going to fail, PCRE has in
 | 
						|
       principle  to  try  every  possible  variation,  and  this  can take an
 | 
						|
       extremely long time, even for relatively short strings.
 | 
						|
 | 
						|
       An optimization catches some of the more simple cases such as
 | 
						|
 | 
						|
         (a+)*b
 | 
						|
 | 
						|
       where a literal character follows. Before  embarking  on  the  standard
 | 
						|
       matching  procedure,  PCRE checks that there is a "b" later in the sub-
 | 
						|
       ject string, and if there is not, it fails the match immediately.  How-
 | 
						|
       ever,  when  there  is no following literal this optimization cannot be
 | 
						|
       used. You can see the difference by comparing the behaviour of
 | 
						|
 | 
						|
         (a+)*\d
 | 
						|
 | 
						|
       with the pattern above. The former gives  a  failure  almost  instantly
 | 
						|
       when  applied  to  a  whole  line of "a" characters, whereas the latter
 | 
						|
       takes an appreciable time with strings longer than about 20 characters.
 | 
						|
 | 
						|
       In many cases, the solution to this kind of performance issue is to use
 | 
						|
       an atomic group or a possessive quantifier.
 | 
						|
 | 
						|
 | 
						|
AUTHOR
 | 
						|
 | 
						|
       Philip Hazel
 | 
						|
       University Computing Service
 | 
						|
       Cambridge CB2 3QH, England.
 | 
						|
 | 
						|
 | 
						|
REVISION
 | 
						|
 | 
						|
       Last updated: 06 March 2007
 | 
						|
       Copyright (c) 1997-2007 University of Cambridge.
 | 
						|
------------------------------------------------------------------------------
 | 
						|
 | 
						|
 | 
						|
PCREPOSIX(3)                                                      PCREPOSIX(3)
 | 
						|
 | 
						|
 | 
						|
NAME
 | 
						|
       PCRE - Perl-compatible regular expressions.
 | 
						|
 | 
						|
 | 
						|
SYNOPSIS OF POSIX API
 | 
						|
 | 
						|
       #include <pcreposix.h>
 | 
						|
 | 
						|
       int regcomp(regex_t *preg, const char *pattern,
 | 
						|
            int cflags);
 | 
						|
 | 
						|
       int regexec(regex_t *preg, const char *string,
 | 
						|
            size_t nmatch, regmatch_t pmatch[], int eflags);
 | 
						|
 | 
						|
       size_t regerror(int errcode, const regex_t *preg,
 | 
						|
            char *errbuf, size_t errbuf_size);
 | 
						|
 | 
						|
       void regfree(regex_t *preg);
 | 
						|
 | 
						|
 | 
						|
DESCRIPTION
 | 
						|
 | 
						|
       This  set  of  functions provides a POSIX-style API to the PCRE regular
 | 
						|
       expression package. See the pcreapi documentation for a description  of
 | 
						|
       PCRE's native API, which contains much additional functionality.
 | 
						|
 | 
						|
       The functions described here are just wrapper functions that ultimately
 | 
						|
       call  the  PCRE  native  API.  Their  prototypes  are  defined  in  the
 | 
						|
       pcreposix.h  header  file,  and  on  Unix systems the library itself is
 | 
						|
       called pcreposix.a, so can be accessed by  adding  -lpcreposix  to  the
 | 
						|
       command  for  linking  an application that uses them. Because the POSIX
 | 
						|
       functions call the native ones, it is also necessary to add -lpcre.
 | 
						|
 | 
						|
       I have implemented only those POSIX option bits that can be  reasonably
 | 
						|
       mapped  to PCRE native options. In addition, the option REG_EXTENDED is
 | 
						|
       defined with the value zero. This has no  effect,  but  since  programs
 | 
						|
       that  are  written  to  the POSIX interface often use it, this makes it
 | 
						|
       easier to slot in PCRE as a replacement library.  Other  POSIX  options
 | 
						|
       are not even defined.
 | 
						|
 | 
						|
       When  PCRE  is  called  via these functions, it is only the API that is
 | 
						|
       POSIX-like in style. The syntax and semantics of  the  regular  expres-
 | 
						|
       sions  themselves  are  still  those of Perl, subject to the setting of
 | 
						|
       various PCRE options, as described below. "POSIX-like in  style"  means
 | 
						|
       that  the  API  approximates  to  the POSIX definition; it is not fully
 | 
						|
       POSIX-compatible, and in multi-byte encoding  domains  it  is  probably
 | 
						|
       even less compatible.
 | 
						|
 | 
						|
       The  header for these functions is supplied as pcreposix.h to avoid any
 | 
						|
       potential clash with other POSIX  libraries.  It  can,  of  course,  be
 | 
						|
       renamed or aliased as regex.h, which is the "correct" name. It provides
 | 
						|
       two structure types, regex_t for  compiled  internal  forms,  and  reg-
 | 
						|
       match_t  for  returning  captured substrings. It also defines some con-
 | 
						|
       stants whose names start  with  "REG_";  these  are  used  for  setting
 | 
						|
       options and identifying error codes.
 | 
						|
 | 
						|
 | 
						|
COMPILING A PATTERN
 | 
						|
 | 
						|
       The  function regcomp() is called to compile a pattern into an internal
 | 
						|
       form. The pattern is a C string terminated by a  binary  zero,  and  is
 | 
						|
       passed  in  the  argument  pattern. The preg argument is a pointer to a
 | 
						|
       regex_t structure that is used as a base for storing information  about
 | 
						|
       the compiled regular expression.
 | 
						|
 | 
						|
       The argument cflags is either zero, or contains one or more of the bits
 | 
						|
       defined by the following macros:
 | 
						|
 | 
						|
         REG_DOTALL
 | 
						|
 | 
						|
       The PCRE_DOTALL option is set when the regular expression is passed for
 | 
						|
       compilation to the native function. Note that REG_DOTALL is not part of
 | 
						|
       the POSIX standard.
 | 
						|
 | 
						|
         REG_ICASE
 | 
						|
 | 
						|
       The PCRE_CASELESS option is set when the regular expression  is  passed
 | 
						|
       for compilation to the native function.
 | 
						|
 | 
						|
         REG_NEWLINE
 | 
						|
 | 
						|
       The  PCRE_MULTILINE option is set when the regular expression is passed
 | 
						|
       for compilation to the native function. Note that this does  not  mimic
 | 
						|
       the  defined  POSIX  behaviour  for REG_NEWLINE (see the following sec-
 | 
						|
       tion).
 | 
						|
 | 
						|
         REG_NOSUB
 | 
						|
 | 
						|
       The PCRE_NO_AUTO_CAPTURE option is set when the regular  expression  is
 | 
						|
       passed for compilation to the native function. In addition, when a pat-
 | 
						|
       tern that is compiled with this flag is passed to regexec() for  match-
 | 
						|
       ing,  the  nmatch  and  pmatch  arguments  are ignored, and no captured
 | 
						|
       strings are returned.
 | 
						|
 | 
						|
         REG_UTF8
 | 
						|
 | 
						|
       The PCRE_UTF8 option is set when the regular expression is  passed  for
 | 
						|
       compilation  to the native function. This causes the pattern itself and
 | 
						|
       all data strings used for matching it to be treated as  UTF-8  strings.
 | 
						|
       Note that REG_UTF8 is not part of the POSIX standard.
 | 
						|
 | 
						|
       In  the  absence  of  these  flags, no options are passed to the native
 | 
						|
       function.  This means the the  regex  is  compiled  with  PCRE  default
 | 
						|
       semantics.  In particular, the way it handles newline characters in the
 | 
						|
       subject string is the Perl way, not the POSIX way.  Note  that  setting
 | 
						|
       PCRE_MULTILINE  has only some of the effects specified for REG_NEWLINE.
 | 
						|
       It does not affect the way newlines are matched by . (they  aren't)  or
 | 
						|
       by a negative class such as [^a] (they are).
 | 
						|
 | 
						|
       The  yield of regcomp() is zero on success, and non-zero otherwise. The
 | 
						|
       preg structure is filled in on success, and one member of the structure
 | 
						|
       is  public: re_nsub contains the number of capturing subpatterns in the
 | 
						|
       regular expression. Various error codes are defined in the header file.
 | 
						|
 | 
						|
 | 
						|
MATCHING NEWLINE CHARACTERS
 | 
						|
 | 
						|
       This area is not simple, because POSIX and Perl take different views of
 | 
						|
       things.   It  is  not possible to get PCRE to obey POSIX semantics, but
 | 
						|
       then PCRE was never intended to be a POSIX engine. The following  table
 | 
						|
       lists  the  different  possibilities for matching newline characters in
 | 
						|
       PCRE:
 | 
						|
 | 
						|
                                 Default   Change with
 | 
						|
 | 
						|
         . matches newline          no     PCRE_DOTALL
 | 
						|
         newline matches [^a]       yes    not changeable
 | 
						|
         $ matches \n at end        yes    PCRE_DOLLARENDONLY
 | 
						|
         $ matches \n in middle     no     PCRE_MULTILINE
 | 
						|
         ^ matches \n in middle     no     PCRE_MULTILINE
 | 
						|
 | 
						|
       This is the equivalent table for POSIX:
 | 
						|
 | 
						|
                                 Default   Change with
 | 
						|
 | 
						|
         . matches newline          yes    REG_NEWLINE
 | 
						|
         newline matches [^a]       yes    REG_NEWLINE
 | 
						|
         $ matches \n at end        no     REG_NEWLINE
 | 
						|
         $ matches \n in middle     no     REG_NEWLINE
 | 
						|
         ^ matches \n in middle     no     REG_NEWLINE
 | 
						|
 | 
						|
       PCRE's behaviour is the same as Perl's, except that there is no equiva-
 | 
						|
       lent  for  PCRE_DOLLAR_ENDONLY in Perl. In both PCRE and Perl, there is
 | 
						|
       no way to stop newline from matching [^a].
 | 
						|
 | 
						|
       The  default  POSIX  newline  handling  can  be  obtained  by   setting
 | 
						|
       PCRE_DOTALL  and  PCRE_DOLLAR_ENDONLY, but there is no way to make PCRE
 | 
						|
       behave exactly as for the REG_NEWLINE action.
 | 
						|
 | 
						|
 | 
						|
MATCHING A PATTERN
 | 
						|
 | 
						|
       The function regexec() is called  to  match  a  compiled  pattern  preg
 | 
						|
       against  a  given string, which is by default terminated by a zero byte
 | 
						|
       (but see REG_STARTEND below), subject to the options in  eflags.  These
 | 
						|
       can be:
 | 
						|
 | 
						|
         REG_NOTBOL
 | 
						|
 | 
						|
       The PCRE_NOTBOL option is set when calling the underlying PCRE matching
 | 
						|
       function.
 | 
						|
 | 
						|
         REG_NOTEMPTY
 | 
						|
 | 
						|
       The PCRE_NOTEMPTY option is set when calling the underlying PCRE match-
 | 
						|
       ing function. Note that REG_NOTEMPTY is not part of the POSIX standard.
 | 
						|
       However, setting this option can give more POSIX-like behaviour in some
 | 
						|
       situations.
 | 
						|
 | 
						|
         REG_NOTEOL
 | 
						|
 | 
						|
       The PCRE_NOTEOL option is set when calling the underlying PCRE matching
 | 
						|
       function.
 | 
						|
 | 
						|
         REG_STARTEND
 | 
						|
 | 
						|
       The string is considered to start at string +  pmatch[0].rm_so  and  to
 | 
						|
       have  a terminating NUL located at string + pmatch[0].rm_eo (there need
 | 
						|
       not actually be a NUL at that location), regardless  of  the  value  of
 | 
						|
       nmatch.  This  is a BSD extension, compatible with but not specified by
 | 
						|
       IEEE Standard 1003.2 (POSIX.2), and should  be  used  with  caution  in
 | 
						|
       software intended to be portable to other systems. Note that a non-zero
 | 
						|
       rm_so does not imply REG_NOTBOL; REG_STARTEND affects only the location
 | 
						|
       of the string, not how it is matched.
 | 
						|
 | 
						|
       If  the pattern was compiled with the REG_NOSUB flag, no data about any
 | 
						|
       matched strings  is  returned.  The  nmatch  and  pmatch  arguments  of
 | 
						|
       regexec() are ignored.
 | 
						|
 | 
						|
       Otherwise,the portion of the string that was matched, and also any cap-
 | 
						|
       tured substrings, are returned via the pmatch argument, which points to
 | 
						|
       an  array  of nmatch structures of type regmatch_t, containing the mem-
 | 
						|
       bers rm_so and rm_eo. These contain the offset to the  first  character
 | 
						|
       of  each  substring and the offset to the first character after the end
 | 
						|
       of each substring, respectively. The 0th element of the vector  relates
 | 
						|
       to  the  entire portion of string that was matched; subsequent elements
 | 
						|
       relate to the capturing subpatterns of the regular  expression.  Unused
 | 
						|
       entries in the array have both structure members set to -1.
 | 
						|
 | 
						|
       A  successful  match  yields  a  zero  return;  various error codes are
 | 
						|
       defined in the header file, of  which  REG_NOMATCH  is  the  "expected"
 | 
						|
       failure code.
 | 
						|
 | 
						|
 | 
						|
ERROR MESSAGES
 | 
						|
 | 
						|
       The regerror() function maps a non-zero errorcode from either regcomp()
 | 
						|
       or regexec() to a printable message. If preg is  not  NULL,  the  error
 | 
						|
       should have arisen from the use of that structure. A message terminated
 | 
						|
       by a binary zero is placed  in  errbuf.  The  length  of  the  message,
 | 
						|
       including  the  zero, is limited to errbuf_size. The yield of the func-
 | 
						|
       tion is the size of buffer needed to hold the whole message.
 | 
						|
 | 
						|
 | 
						|
MEMORY USAGE
 | 
						|
 | 
						|
       Compiling a regular expression causes memory to be allocated and  asso-
 | 
						|
       ciated  with  the preg structure. The function regfree() frees all such
 | 
						|
       memory, after which preg may no longer be used as  a  compiled  expres-
 | 
						|
       sion.
 | 
						|
 | 
						|
 | 
						|
AUTHOR
 | 
						|
 | 
						|
       Philip Hazel
 | 
						|
       University Computing Service
 | 
						|
       Cambridge CB2 3QH, England.
 | 
						|
 | 
						|
 | 
						|
REVISION
 | 
						|
 | 
						|
       Last updated: 11 March 2009
 | 
						|
       Copyright (c) 1997-2009 University of Cambridge.
 | 
						|
------------------------------------------------------------------------------
 | 
						|
 | 
						|
 | 
						|
PCRECPP(3)                                                          PCRECPP(3)
 | 
						|
 | 
						|
 | 
						|
NAME
 | 
						|
       PCRE - Perl-compatible regular expressions.
 | 
						|
 | 
						|
 | 
						|
SYNOPSIS OF C++ WRAPPER
 | 
						|
 | 
						|
       #include <pcrecpp.h>
 | 
						|
 | 
						|
 | 
						|
DESCRIPTION
 | 
						|
 | 
						|
       The  C++  wrapper  for PCRE was provided by Google Inc. Some additional
 | 
						|
       functionality was added by Giuseppe Maxia. This brief man page was con-
 | 
						|
       structed  from  the  notes  in the pcrecpp.h file, which should be con-
 | 
						|
       sulted for further details.
 | 
						|
 | 
						|
 | 
						|
MATCHING INTERFACE
 | 
						|
 | 
						|
       The "FullMatch" operation checks that supplied text matches a  supplied
 | 
						|
       pattern  exactly.  If pointer arguments are supplied, it copies matched
 | 
						|
       sub-strings that match sub-patterns into them.
 | 
						|
 | 
						|
         Example: successful match
 | 
						|
            pcrecpp::RE re("h.*o");
 | 
						|
            re.FullMatch("hello");
 | 
						|
 | 
						|
         Example: unsuccessful match (requires full match):
 | 
						|
            pcrecpp::RE re("e");
 | 
						|
            !re.FullMatch("hello");
 | 
						|
 | 
						|
         Example: creating a temporary RE object:
 | 
						|
            pcrecpp::RE("h.*o").FullMatch("hello");
 | 
						|
 | 
						|
       You can pass in a "const char*" or a "string" for "text". The  examples
 | 
						|
       below  tend to use a const char*. You can, as in the different examples
 | 
						|
       above, store the RE object explicitly in a variable or use a  temporary
 | 
						|
       RE  object.  The  examples below use one mode or the other arbitrarily.
 | 
						|
       Either could correctly be used for any of these examples.
 | 
						|
 | 
						|
       You must supply extra pointer arguments to extract matched subpieces.
 | 
						|
 | 
						|
         Example: extracts "ruby" into "s" and 1234 into "i"
 | 
						|
            int i;
 | 
						|
            string s;
 | 
						|
            pcrecpp::RE re("(\\w+):(\\d+)");
 | 
						|
            re.FullMatch("ruby:1234", &s, &i);
 | 
						|
 | 
						|
         Example: does not try to extract any extra sub-patterns
 | 
						|
            re.FullMatch("ruby:1234", &s);
 | 
						|
 | 
						|
         Example: does not try to extract into NULL
 | 
						|
            re.FullMatch("ruby:1234", NULL, &i);
 | 
						|
 | 
						|
         Example: integer overflow causes failure
 | 
						|
            !re.FullMatch("ruby:1234567891234", NULL, &i);
 | 
						|
 | 
						|
         Example: fails because there aren't enough sub-patterns:
 | 
						|
            !pcrecpp::RE("\\w+:\\d+").FullMatch("ruby:1234", &s);
 | 
						|
 | 
						|
         Example: fails because string cannot be stored in integer
 | 
						|
            !pcrecpp::RE("(.*)").FullMatch("ruby", &i);
 | 
						|
 | 
						|
       The provided pointer arguments can be pointers to  any  scalar  numeric
 | 
						|
       type, or one of:
 | 
						|
 | 
						|
          string        (matched piece is copied to string)
 | 
						|
          StringPiece   (StringPiece is mutated to point to matched piece)
 | 
						|
          T             (where "bool T::ParseFrom(const char*, int)" exists)
 | 
						|
          NULL          (the corresponding matched sub-pattern is not copied)
 | 
						|
 | 
						|
       The  function returns true iff all of the following conditions are sat-
 | 
						|
       isfied:
 | 
						|
 | 
						|
         a. "text" matches "pattern" exactly;
 | 
						|
 | 
						|
         b. The number of matched sub-patterns is >= number of supplied
 | 
						|
            pointers;
 | 
						|
 | 
						|
         c. The "i"th argument has a suitable type for holding the
 | 
						|
            string captured as the "i"th sub-pattern. If you pass in
 | 
						|
            void * NULL for the "i"th argument, or a non-void * NULL
 | 
						|
            of the correct type, or pass fewer arguments than the
 | 
						|
            number of sub-patterns, "i"th captured sub-pattern is
 | 
						|
            ignored.
 | 
						|
 | 
						|
       CAVEAT: An optional sub-pattern that does  not  exist  in  the  matched
 | 
						|
       string  is  assigned  the  empty  string. Therefore, the following will
 | 
						|
       return false (because the empty string is not a valid number):
 | 
						|
 | 
						|
          int number;
 | 
						|
          pcrecpp::RE::FullMatch("abc", "[a-z]+(\\d+)?", &number);
 | 
						|
 | 
						|
       The matching interface supports at most 16 arguments per call.  If  you
 | 
						|
       need    more,    consider    using    the    more   general   interface
 | 
						|
       pcrecpp::RE::DoMatch. See pcrecpp.h for the signature for DoMatch.
 | 
						|
 | 
						|
       NOTE: Do not use no_arg, which is used internally to mark the end of  a
 | 
						|
       list  of optional arguments, as a placeholder for missing arguments, as
 | 
						|
       this can lead to segfaults.
 | 
						|
 | 
						|
 | 
						|
QUOTING METACHARACTERS
 | 
						|
 | 
						|
       You can use the "QuoteMeta" operation to insert backslashes before  all
 | 
						|
       potentially  meaningful  characters  in  a string. The returned string,
 | 
						|
       used as a regular expression, will exactly match the original string.
 | 
						|
 | 
						|
         Example:
 | 
						|
            string quoted = RE::QuoteMeta(unquoted);
 | 
						|
 | 
						|
       Note that it's legal to escape a character even if it  has  no  special
 | 
						|
       meaning  in  a  regular expression -- so this function does that. (This
 | 
						|
       also makes it identical to the perl function  of  the  same  name;  see
 | 
						|
       "perldoc    -f    quotemeta".)    For   example,   "1.5-2.0?"   becomes
 | 
						|
       "1\.5\-2\.0\?".
 | 
						|
 | 
						|
 | 
						|
PARTIAL MATCHES
 | 
						|
 | 
						|
       You can use the "PartialMatch" operation when you want the  pattern  to
 | 
						|
       match any substring of the text.
 | 
						|
 | 
						|
         Example: simple search for a string:
 | 
						|
            pcrecpp::RE("ell").PartialMatch("hello");
 | 
						|
 | 
						|
         Example: find first number in a string:
 | 
						|
            int number;
 | 
						|
            pcrecpp::RE re("(\\d+)");
 | 
						|
            re.PartialMatch("x*100 + 20", &number);
 | 
						|
            assert(number == 100);
 | 
						|
 | 
						|
 | 
						|
UTF-8 AND THE MATCHING INTERFACE
 | 
						|
 | 
						|
       By  default,  pattern  and text are plain text, one byte per character.
 | 
						|
       The UTF8 flag, passed to  the  constructor,  causes  both  pattern  and
 | 
						|
       string to be treated as UTF-8 text, still a byte stream but potentially
 | 
						|
       multiple bytes per character. In practice, the text is likelier  to  be
 | 
						|
       UTF-8  than  the pattern, but the match returned may depend on the UTF8
 | 
						|
       flag, so always use it when matching UTF8 text. For example,  "."  will
 | 
						|
       match  one  byte normally but with UTF8 set may match up to three bytes
 | 
						|
       of a multi-byte character.
 | 
						|
 | 
						|
         Example:
 | 
						|
            pcrecpp::RE_Options options;
 | 
						|
            options.set_utf8();
 | 
						|
            pcrecpp::RE re(utf8_pattern, options);
 | 
						|
            re.FullMatch(utf8_string);
 | 
						|
 | 
						|
         Example: using the convenience function UTF8():
 | 
						|
            pcrecpp::RE re(utf8_pattern, pcrecpp::UTF8());
 | 
						|
            re.FullMatch(utf8_string);
 | 
						|
 | 
						|
       NOTE: The UTF8 flag is ignored if pcre was not configured with the
 | 
						|
             --enable-utf8 flag.
 | 
						|
 | 
						|
 | 
						|
PASSING MODIFIERS TO THE REGULAR EXPRESSION ENGINE
 | 
						|
 | 
						|
       PCRE defines some modifiers to  change  the  behavior  of  the  regular
 | 
						|
       expression   engine.  The  C++  wrapper  defines  an  auxiliary  class,
 | 
						|
       RE_Options, as a vehicle to pass such modifiers to  a  RE  class.  Cur-
 | 
						|
       rently, the following modifiers are supported:
 | 
						|
 | 
						|
          modifier              description               Perl corresponding
 | 
						|
 | 
						|
          PCRE_CASELESS         case insensitive match      /i
 | 
						|
          PCRE_MULTILINE        multiple lines match        /m
 | 
						|
          PCRE_DOTALL           dot matches newlines        /s
 | 
						|
          PCRE_DOLLAR_ENDONLY   $ matches only at end       N/A
 | 
						|
          PCRE_EXTRA            strict escape parsing       N/A
 | 
						|
          PCRE_EXTENDED         ignore whitespaces          /x
 | 
						|
          PCRE_UTF8             handles UTF8 chars          built-in
 | 
						|
          PCRE_UNGREEDY         reverses * and *?           N/A
 | 
						|
          PCRE_NO_AUTO_CAPTURE  disables capturing parens   N/A (*)
 | 
						|
 | 
						|
       (*)  Both Perl and PCRE allow non capturing parentheses by means of the
 | 
						|
       "?:" modifier within the pattern itself. e.g. (?:ab|cd) does  not  cap-
 | 
						|
       ture, while (ab|cd) does.
 | 
						|
 | 
						|
       For  a  full  account on how each modifier works, please check the PCRE
 | 
						|
       API reference page.
 | 
						|
 | 
						|
       For each modifier, there are two member functions whose  name  is  made
 | 
						|
       out  of  the  modifier  in  lowercase,  without the "PCRE_" prefix. For
 | 
						|
       instance, PCRE_CASELESS is handled by
 | 
						|
 | 
						|
         bool caseless()
 | 
						|
 | 
						|
       which returns true if the modifier is set, and
 | 
						|
 | 
						|
         RE_Options & set_caseless(bool)
 | 
						|
 | 
						|
       which sets or unsets the modifier. Moreover, PCRE_EXTRA_MATCH_LIMIT can
 | 
						|
       be  accessed  through  the  set_match_limit()  and match_limit() member
 | 
						|
       functions. Setting match_limit to a non-zero value will limit the  exe-
 | 
						|
       cution  of pcre to keep it from doing bad things like blowing the stack
 | 
						|
       or taking an eternity to return a result.  A  value  of  5000  is  good
 | 
						|
       enough  to stop stack blowup in a 2MB thread stack. Setting match_limit
 | 
						|
       to  zero  disables  match  limiting.  Alternatively,   you   can   call
 | 
						|
       match_limit_recursion()  which uses PCRE_EXTRA_MATCH_LIMIT_RECURSION to
 | 
						|
       limit how much  PCRE  recurses.  match_limit()  limits  the  number  of
 | 
						|
       matches PCRE does; match_limit_recursion() limits the depth of internal
 | 
						|
       recursion, and therefore the amount of stack that is used.
 | 
						|
 | 
						|
       Normally, to pass one or more modifiers to a RE class,  you  declare  a
 | 
						|
       RE_Options object, set the appropriate options, and pass this object to
 | 
						|
       a RE constructor. Example:
 | 
						|
 | 
						|
          RE_options opt;
 | 
						|
          opt.set_caseless(true);
 | 
						|
          if (RE("HELLO", opt).PartialMatch("hello world")) ...
 | 
						|
 | 
						|
       RE_options has two constructors. The default constructor takes no argu-
 | 
						|
       ments  and creates a set of flags that are off by default. The optional
 | 
						|
       parameter option_flags is to facilitate transfer of legacy code from  C
 | 
						|
       programs.  This lets you do
 | 
						|
 | 
						|
          RE(pattern,
 | 
						|
            RE_Options(PCRE_CASELESS|PCRE_MULTILINE)).PartialMatch(str);
 | 
						|
 | 
						|
       However, new code is better off doing
 | 
						|
 | 
						|
          RE(pattern,
 | 
						|
            RE_Options().set_caseless(true).set_multiline(true))
 | 
						|
              .PartialMatch(str);
 | 
						|
 | 
						|
       If you are going to pass one of the most used modifiers, there are some
 | 
						|
       convenience functions that return a RE_Options class with the appropri-
 | 
						|
       ate  modifier  already  set: CASELESS(), UTF8(), MULTILINE(), DOTALL(),
 | 
						|
       and EXTENDED().
 | 
						|
 | 
						|
       If you need to set several options at once, and you don't  want  to  go
 | 
						|
       through  the pains of declaring a RE_Options object and setting several
 | 
						|
       options, there is a parallel method that give you such ability  on  the
 | 
						|
       fly.  You  can  concatenate several set_xxxxx() member functions, since
 | 
						|
       each of them returns a reference to its class object. For  example,  to
 | 
						|
       pass  PCRE_CASELESS, PCRE_EXTENDED, and PCRE_MULTILINE to a RE with one
 | 
						|
       statement, you may write:
 | 
						|
 | 
						|
          RE(" ^ xyz \\s+ .* blah$",
 | 
						|
            RE_Options()
 | 
						|
              .set_caseless(true)
 | 
						|
              .set_extended(true)
 | 
						|
              .set_multiline(true)).PartialMatch(sometext);
 | 
						|
 | 
						|
 | 
						|
SCANNING TEXT INCREMENTALLY
 | 
						|
 | 
						|
       The "Consume" operation may be useful if you want to  repeatedly  match
 | 
						|
       regular expressions at the front of a string and skip over them as they
 | 
						|
       match. This requires use of the "StringPiece" type, which represents  a
 | 
						|
       sub-range  of  a  real  string.  Like RE, StringPiece is defined in the
 | 
						|
       pcrecpp namespace.
 | 
						|
 | 
						|
         Example: read lines of the form "var = value" from a string.
 | 
						|
            string contents = ...;                 // Fill string somehow
 | 
						|
            pcrecpp::StringPiece input(contents);  // Wrap in a StringPiece
 | 
						|
 | 
						|
            string var;
 | 
						|
            int value;
 | 
						|
            pcrecpp::RE re("(\\w+) = (\\d+)\n");
 | 
						|
            while (re.Consume(&input, &var, &value)) {
 | 
						|
              ...;
 | 
						|
            }
 | 
						|
 | 
						|
       Each successful call  to  "Consume"  will  set  "var/value",  and  also
 | 
						|
       advance "input" so it points past the matched text.
 | 
						|
 | 
						|
       The  "FindAndConsume"  operation  is  similar to "Consume" but does not
 | 
						|
       anchor your match at the beginning of  the  string.  For  example,  you
 | 
						|
       could extract all words from a string by repeatedly calling
 | 
						|
 | 
						|
         pcrecpp::RE("(\\w+)").FindAndConsume(&input, &word)
 | 
						|
 | 
						|
 | 
						|
PARSING HEX/OCTAL/C-RADIX NUMBERS
 | 
						|
 | 
						|
       By default, if you pass a pointer to a numeric value, the corresponding
 | 
						|
       text is interpreted as a base-10  number.  You  can  instead  wrap  the
 | 
						|
       pointer with a call to one of the operators Hex(), Octal(), or CRadix()
 | 
						|
       to interpret the text in another base. The CRadix  operator  interprets
 | 
						|
       C-style  "0"  (base-8)  and  "0x"  (base-16)  prefixes, but defaults to
 | 
						|
       base-10.
 | 
						|
 | 
						|
         Example:
 | 
						|
           int a, b, c, d;
 | 
						|
           pcrecpp::RE re("(.*) (.*) (.*) (.*)");
 | 
						|
           re.FullMatch("100 40 0100 0x40",
 | 
						|
                        pcrecpp::Octal(&a), pcrecpp::Hex(&b),
 | 
						|
                        pcrecpp::CRadix(&c), pcrecpp::CRadix(&d));
 | 
						|
 | 
						|
       will leave 64 in a, b, c, and d.
 | 
						|
 | 
						|
 | 
						|
REPLACING PARTS OF STRINGS
 | 
						|
 | 
						|
       You can replace the first match of "pattern" in "str"  with  "rewrite".
 | 
						|
       Within  "rewrite",  backslash-escaped  digits (\1 to \9) can be used to
 | 
						|
       insert text matching corresponding parenthesized group  from  the  pat-
 | 
						|
       tern. \0 in "rewrite" refers to the entire matching text. For example:
 | 
						|
 | 
						|
         string s = "yabba dabba doo";
 | 
						|
         pcrecpp::RE("b+").Replace("d", &s);
 | 
						|
 | 
						|
       will  leave  "s" containing "yada dabba doo". The result is true if the
 | 
						|
       pattern matches and a replacement occurs, false otherwise.
 | 
						|
 | 
						|
       GlobalReplace is like Replace except that it replaces  all  occurrences
 | 
						|
       of  the  pattern  in  the string with the rewrite. Replacements are not
 | 
						|
       subject to re-matching. For example:
 | 
						|
 | 
						|
         string s = "yabba dabba doo";
 | 
						|
         pcrecpp::RE("b+").GlobalReplace("d", &s);
 | 
						|
 | 
						|
       will leave "s" containing "yada dada doo". It  returns  the  number  of
 | 
						|
       replacements made.
 | 
						|
 | 
						|
       Extract  is like Replace, except that if the pattern matches, "rewrite"
 | 
						|
       is copied into "out" (an additional argument) with substitutions.   The
 | 
						|
       non-matching  portions  of "text" are ignored. Returns true iff a match
 | 
						|
       occurred and the extraction happened successfully;  if no match occurs,
 | 
						|
       the string is left unaffected.
 | 
						|
 | 
						|
 | 
						|
AUTHOR
 | 
						|
 | 
						|
       The C++ wrapper was contributed by Google Inc.
 | 
						|
       Copyright (c) 2007 Google Inc.
 | 
						|
 | 
						|
 | 
						|
REVISION
 | 
						|
 | 
						|
       Last updated: 17 March 2009
 | 
						|
------------------------------------------------------------------------------
 | 
						|
 | 
						|
 | 
						|
PCRESAMPLE(3)                                                    PCRESAMPLE(3)
 | 
						|
 | 
						|
 | 
						|
NAME
 | 
						|
       PCRE - Perl-compatible regular expressions
 | 
						|
 | 
						|
 | 
						|
PCRE SAMPLE PROGRAM
 | 
						|
 | 
						|
       A simple, complete demonstration program, to get you started with using
 | 
						|
       PCRE, is supplied in the file pcredemo.c in the PCRE distribution.
 | 
						|
 | 
						|
       The program compiles the regular expression that is its first argument,
 | 
						|
       and  matches  it  against the subject string in its second argument. No
 | 
						|
       PCRE options are set, and default character tables are used. If  match-
 | 
						|
       ing  succeeds,  the  program  outputs  the  portion of the subject that
 | 
						|
       matched, together with the contents of any captured substrings.
 | 
						|
 | 
						|
       If the -g option is given on the command line, the program then goes on
 | 
						|
       to check for further matches of the same regular expression in the same
 | 
						|
       subject string. The logic is a little bit tricky because of the  possi-
 | 
						|
       bility  of  matching an empty string. Comments in the code explain what
 | 
						|
       is going on.
 | 
						|
 | 
						|
       If PCRE is installed in the standard include  and  library  directories
 | 
						|
       for  your  system, you should be able to compile the demonstration pro-
 | 
						|
       gram using this command:
 | 
						|
 | 
						|
         gcc -o pcredemo pcredemo.c -lpcre
 | 
						|
 | 
						|
       If PCRE is installed elsewhere, you may need to add additional  options
 | 
						|
       to  the  command line. For example, on a Unix-like system that has PCRE
 | 
						|
       installed in /usr/local, you  can  compile  the  demonstration  program
 | 
						|
       using a command like this:
 | 
						|
 | 
						|
         gcc -o pcredemo -I/usr/local/include pcredemo.c \
 | 
						|
             -L/usr/local/lib -lpcre
 | 
						|
 | 
						|
       Once  you  have  compiled the demonstration program, you can run simple
 | 
						|
       tests like this:
 | 
						|
 | 
						|
         ./pcredemo 'cat|dog' 'the cat sat on the mat'
 | 
						|
         ./pcredemo -g 'cat|dog' 'the dog sat on the cat'
 | 
						|
 | 
						|
       Note that there is a  much  more  comprehensive  test  program,  called
 | 
						|
       pcretest,  which  supports  many  more  facilities  for testing regular
 | 
						|
       expressions and the PCRE library. The pcredemo program is provided as a
 | 
						|
       simple coding example.
 | 
						|
 | 
						|
       On some operating systems (e.g. Solaris), when PCRE is not installed in
 | 
						|
       the standard library directory, you may get an error like this when you
 | 
						|
       try to run pcredemo:
 | 
						|
 | 
						|
         ld.so.1:  a.out:  fatal:  libpcre.so.0:  open failed: No such file or
 | 
						|
       directory
 | 
						|
 | 
						|
       This is caused by the way shared library support works  on  those  sys-
 | 
						|
       tems. You need to add
 | 
						|
 | 
						|
         -R/usr/local/lib
 | 
						|
 | 
						|
       (for example) to the compile command to get round this problem.
 | 
						|
 | 
						|
 | 
						|
AUTHOR
 | 
						|
 | 
						|
       Philip Hazel
 | 
						|
       University Computing Service
 | 
						|
       Cambridge CB2 3QH, England.
 | 
						|
 | 
						|
 | 
						|
REVISION
 | 
						|
 | 
						|
       Last updated: 23 January 2008
 | 
						|
       Copyright (c) 1997-2008 University of Cambridge.
 | 
						|
------------------------------------------------------------------------------
 | 
						|
PCRESTACK(3)                                                      PCRESTACK(3)
 | 
						|
 | 
						|
 | 
						|
NAME
 | 
						|
       PCRE - Perl-compatible regular expressions
 | 
						|
 | 
						|
 | 
						|
PCRE DISCUSSION OF STACK USAGE
 | 
						|
 | 
						|
       When  you call pcre_exec(), it makes use of an internal function called
 | 
						|
       match(). This calls itself recursively at branch points in the pattern,
 | 
						|
       in  order to remember the state of the match so that it can back up and
 | 
						|
       try a different alternative if the first one fails.  As  matching  pro-
 | 
						|
       ceeds  deeper  and deeper into the tree of possibilities, the recursion
 | 
						|
       depth increases.
 | 
						|
 | 
						|
       Not all calls of match() increase the recursion depth; for an item such
 | 
						|
       as  a* it may be called several times at the same level, after matching
 | 
						|
       different numbers of a's. Furthermore, in a number of cases  where  the
 | 
						|
       result  of  the  recursive call would immediately be passed back as the
 | 
						|
       result of the current call (a "tail recursion"), the function  is  just
 | 
						|
       restarted instead.
 | 
						|
 | 
						|
       The pcre_dfa_exec() function operates in an entirely different way, and
 | 
						|
       hardly uses recursion at all. The limit on its complexity is the amount
 | 
						|
       of  workspace  it  is  given.  The comments that follow do NOT apply to
 | 
						|
       pcre_dfa_exec(); they are relevant only for pcre_exec().
 | 
						|
 | 
						|
       You can set limits on the number of times that match() is called,  both
 | 
						|
       in  total  and  recursively. If the limit is exceeded, an error occurs.
 | 
						|
       For details, see the section on  extra  data  for  pcre_exec()  in  the
 | 
						|
       pcreapi documentation.
 | 
						|
 | 
						|
       Each  time  that match() is actually called recursively, it uses memory
 | 
						|
       from the process stack. For certain kinds of  pattern  and  data,  very
 | 
						|
       large  amounts of stack may be needed, despite the recognition of "tail
 | 
						|
       recursion".  You can often reduce the amount of recursion,  and  there-
 | 
						|
       fore  the  amount of stack used, by modifying the pattern that is being
 | 
						|
       matched. Consider, for example, this pattern:
 | 
						|
 | 
						|
         ([^<]|<(?!inet))+
 | 
						|
 | 
						|
       It matches from wherever it starts until it encounters "<inet"  or  the
 | 
						|
       end  of  the  data,  and is the kind of pattern that might be used when
 | 
						|
       processing an XML file. Each iteration of the outer parentheses matches
 | 
						|
       either  one  character that is not "<" or a "<" that is not followed by
 | 
						|
       "inet". However, each time a  parenthesis  is  processed,  a  recursion
 | 
						|
       occurs, so this formulation uses a stack frame for each matched charac-
 | 
						|
       ter. For a long string, a lot of stack is required. Consider  now  this
 | 
						|
       rewritten pattern, which matches exactly the same strings:
 | 
						|
 | 
						|
         ([^<]++|<(?!inet))+
 | 
						|
 | 
						|
       This  uses very much less stack, because runs of characters that do not
 | 
						|
       contain "<" are "swallowed" in one item inside the parentheses.  Recur-
 | 
						|
       sion  happens  only when a "<" character that is not followed by "inet"
 | 
						|
       is encountered (and we assume this is relatively  rare).  A  possessive
 | 
						|
       quantifier  is  used  to stop any backtracking into the runs of non-"<"
 | 
						|
       characters, but that is not related to stack usage.
 | 
						|
 | 
						|
       This example shows that one way of avoiding stack problems when  match-
 | 
						|
       ing long subject strings is to write repeated parenthesized subpatterns
 | 
						|
       to match more than one character whenever possible.
 | 
						|
 | 
						|
   Compiling PCRE to use heap instead of stack
 | 
						|
 | 
						|
       In environments where stack memory is constrained, you  might  want  to
 | 
						|
       compile  PCRE to use heap memory instead of stack for remembering back-
 | 
						|
       up points. This makes it run a lot more slowly, however. Details of how
 | 
						|
       to do this are given in the pcrebuild documentation. When built in this
 | 
						|
       way, instead of using the stack, PCRE obtains and frees memory by call-
 | 
						|
       ing  the  functions  that  are  pointed to by the pcre_stack_malloc and
 | 
						|
       pcre_stack_free variables. By default,  these  point  to  malloc()  and
 | 
						|
       free(),  but you can replace the pointers to cause PCRE to use your own
 | 
						|
       functions. Since the block sizes are always the same,  and  are  always
 | 
						|
       freed in reverse order, it may be possible to implement customized mem-
 | 
						|
       ory handlers that are more efficient than the standard functions.
 | 
						|
 | 
						|
   Limiting PCRE's stack usage
 | 
						|
 | 
						|
       PCRE has an internal counter that can be used to  limit  the  depth  of
 | 
						|
       recursion,  and  thus cause pcre_exec() to give an error code before it
 | 
						|
       runs out of stack. By default, the limit is very  large,  and  unlikely
 | 
						|
       ever  to operate. It can be changed when PCRE is built, and it can also
 | 
						|
       be set when pcre_exec() is called. For details of these interfaces, see
 | 
						|
       the pcrebuild and pcreapi documentation.
 | 
						|
 | 
						|
       As a very rough rule of thumb, you should reckon on about 500 bytes per
 | 
						|
       recursion. Thus, if you want to limit your  stack  usage  to  8Mb,  you
 | 
						|
       should  set  the  limit at 16000 recursions. A 64Mb stack, on the other
 | 
						|
       hand, can support around 128000 recursions. The pcretest  test  program
 | 
						|
       has a command line option (-S) that can be used to increase the size of
 | 
						|
       its stack.
 | 
						|
 | 
						|
   Changing stack size in Unix-like systems
 | 
						|
 | 
						|
       In Unix-like environments, there is not often a problem with the  stack
 | 
						|
       unless  very  long  strings  are  involved, though the default limit on
 | 
						|
       stack size varies from system to system. Values from 8Mb  to  64Mb  are
 | 
						|
       common. You can find your default limit by running the command:
 | 
						|
 | 
						|
         ulimit -s
 | 
						|
 | 
						|
       Unfortunately,  the  effect  of  running out of stack is often SIGSEGV,
 | 
						|
       though sometimes a more explicit error message is given. You  can  nor-
 | 
						|
       mally increase the limit on stack size by code such as this:
 | 
						|
 | 
						|
         struct rlimit rlim;
 | 
						|
         getrlimit(RLIMIT_STACK, &rlim);
 | 
						|
         rlim.rlim_cur = 100*1024*1024;
 | 
						|
         setrlimit(RLIMIT_STACK, &rlim);
 | 
						|
 | 
						|
       This  reads  the current limits (soft and hard) using getrlimit(), then
 | 
						|
       attempts to increase the soft limit to  100Mb  using  setrlimit().  You
 | 
						|
       must do this before calling pcre_exec().
 | 
						|
 | 
						|
   Changing stack size in Mac OS X
 | 
						|
 | 
						|
       Using setrlimit(), as described above, should also work on Mac OS X. It
 | 
						|
       is also possible to set a stack size when linking a program. There is a
 | 
						|
       discussion   about   stack  sizes  in  Mac  OS  X  at  this  web  site:
 | 
						|
       http://developer.apple.com/qa/qa2005/qa1419.html.
 | 
						|
 | 
						|
 | 
						|
AUTHOR
 | 
						|
 | 
						|
       Philip Hazel
 | 
						|
       University Computing Service
 | 
						|
       Cambridge CB2 3QH, England.
 | 
						|
 | 
						|
 | 
						|
REVISION
 | 
						|
 | 
						|
       Last updated: 09 July 2008
 | 
						|
       Copyright (c) 1997-2008 University of Cambridge.
 | 
						|
------------------------------------------------------------------------------
 | 
						|
 | 
						|
 |