6681 lines
		
	
	
		
			306 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			6681 lines
		
	
	
		
			306 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| -----------------------------------------------------------------------------
 | |
| This file contains a concatenation of the PCRE man pages, converted to plain
 | |
| text format for ease of searching with a text editor, or for use on systems
 | |
| that do not have a man page processor. The small individual files that give
 | |
| synopses of each function in the library have not been included. There are
 | |
| separate text files for the pcregrep and pcretest commands.
 | |
| -----------------------------------------------------------------------------
 | |
| 
 | |
| 
 | |
| PCRE(3)                                                                PCRE(3)
 | |
| 
 | |
| 
 | |
| NAME
 | |
|        PCRE - Perl-compatible regular expressions
 | |
| 
 | |
| 
 | |
| INTRODUCTION
 | |
| 
 | |
|        The  PCRE  library is a set of functions that implement regular expres-
 | |
|        sion pattern matching using the same syntax and semantics as Perl, with
 | |
|        just  a  few  differences. Certain features that appeared in Python and
 | |
|        PCRE before they appeared in Perl are also available using  the  Python
 | |
|        syntax.  There is also some support for certain .NET and Oniguruma syn-
 | |
|        tax items, and there is an option for  requesting  some  minor  changes
 | |
|        that give better JavaScript compatibility.
 | |
| 
 | |
|        The  current  implementation of PCRE (release 7.x) corresponds approxi-
 | |
|        mately with Perl 5.10, including support for UTF-8 encoded strings  and
 | |
|        Unicode general category properties. However, UTF-8 and Unicode support
 | |
|        has to be explicitly enabled; it is not the default. The Unicode tables
 | |
|        correspond to Unicode release 5.1.
 | |
| 
 | |
|        In  addition to the Perl-compatible matching function, PCRE contains an
 | |
|        alternative matching function that matches the same  compiled  patterns
 | |
|        in  a different way. In certain circumstances, the alternative function
 | |
|        has some advantages. For a discussion of the two  matching  algorithms,
 | |
|        see the pcrematching page.
 | |
| 
 | |
|        PCRE  is  written  in C and released as a C library. A number of people
 | |
|        have written wrappers and interfaces of various kinds.  In  particular,
 | |
|        Google  Inc.   have  provided  a comprehensive C++ wrapper. This is now
 | |
|        included as part of the PCRE distribution. The pcrecpp page has details
 | |
|        of  this  interface.  Other  people's contributions can be found in the
 | |
|        Contrib directory at the primary FTP site, which is:
 | |
| 
 | |
|        ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre
 | |
| 
 | |
|        Details of exactly which Perl regular expression features are  and  are
 | |
|        not supported by PCRE are given in separate documents. See the pcrepat-
 | |
|        tern and pcrecompat pages. There is a syntax summary in the  pcresyntax
 | |
|        page.
 | |
| 
 | |
|        Some  features  of  PCRE can be included, excluded, or changed when the
 | |
|        library is built. The pcre_config() function makes it  possible  for  a
 | |
|        client  to  discover  which  features are available. The features them-
 | |
|        selves are described in the pcrebuild page. Documentation about  build-
 | |
|        ing  PCRE for various operating systems can be found in the README file
 | |
|        in the source distribution.
 | |
| 
 | |
|        The library contains a number of undocumented  internal  functions  and
 | |
|        data  tables  that  are  used by more than one of the exported external
 | |
|        functions, but which are not intended  for  use  by  external  callers.
 | |
|        Their  names  all begin with "_pcre_", which hopefully will not provoke
 | |
|        any name clashes. In some environments, it is possible to control which
 | |
|        external  symbols  are  exported when a shared library is built, and in
 | |
|        these cases the undocumented symbols are not exported.
 | |
| 
 | |
| 
 | |
| USER DOCUMENTATION
 | |
| 
 | |
|        The user documentation for PCRE comprises a number  of  different  sec-
 | |
|        tions.  In the "man" format, each of these is a separate "man page". In
 | |
|        the HTML format, each is a separate page, linked from the  index  page.
 | |
|        In  the  plain text format, all the sections are concatenated, for ease
 | |
|        of searching. The sections are as follows:
 | |
| 
 | |
|          pcre              this document
 | |
|          pcre-config       show PCRE installation configuration information
 | |
|          pcreapi           details of PCRE's native C API
 | |
|          pcrebuild         options for building PCRE
 | |
|          pcrecallout       details of the callout feature
 | |
|          pcrecompat        discussion of Perl compatibility
 | |
|          pcrecpp           details of the C++ wrapper
 | |
|          pcregrep          description of the pcregrep command
 | |
|          pcrematching      discussion of the two matching algorithms
 | |
|          pcrepartial       details of the partial matching facility
 | |
|          pcrepattern       syntax and semantics of supported
 | |
|                              regular expressions
 | |
|          pcresyntax        quick syntax reference
 | |
|          pcreperform       discussion of performance issues
 | |
|          pcreposix         the POSIX-compatible C API
 | |
|          pcreprecompile    details of saving and re-using precompiled patterns
 | |
|          pcresample        discussion of the sample program
 | |
|          pcrestack         discussion of stack usage
 | |
|          pcretest          description of the pcretest testing command
 | |
| 
 | |
|        In addition, in the "man" and HTML formats, there is a short  page  for
 | |
|        each C library function, listing its arguments and results.
 | |
| 
 | |
| 
 | |
| LIMITATIONS
 | |
| 
 | |
|        There  are some size limitations in PCRE but it is hoped that they will
 | |
|        never in practice be relevant.
 | |
| 
 | |
|        The maximum length of a compiled pattern is 65539 (sic) bytes  if  PCRE
 | |
|        is compiled with the default internal linkage size of 2. If you want to
 | |
|        process regular expressions that are truly enormous,  you  can  compile
 | |
|        PCRE  with  an  internal linkage size of 3 or 4 (see the README file in
 | |
|        the source distribution and the pcrebuild documentation  for  details).
 | |
|        In  these  cases the limit is substantially larger.  However, the speed
 | |
|        of execution is slower.
 | |
| 
 | |
|        All values in repeating quantifiers must be less than 65536.
 | |
| 
 | |
|        There is no limit to the number of parenthesized subpatterns, but there
 | |
|        can be no more than 65535 capturing subpatterns.
 | |
| 
 | |
|        The maximum length of name for a named subpattern is 32 characters, and
 | |
|        the maximum number of named subpatterns is 10000.
 | |
| 
 | |
|        The maximum length of a subject string is the largest  positive  number
 | |
|        that  an integer variable can hold. However, when using the traditional
 | |
|        matching function, PCRE uses recursion to handle subpatterns and indef-
 | |
|        inite  repetition.  This means that the available stack space may limit
 | |
|        the size of a subject string that can be processed by certain patterns.
 | |
|        For a discussion of stack issues, see the pcrestack documentation.
 | |
| 
 | |
| 
 | |
| UTF-8 AND UNICODE PROPERTY SUPPORT
 | |
| 
 | |
|        From  release  3.3,  PCRE  has  had  some support for character strings
 | |
|        encoded in the UTF-8 format. For release 4.0 this was greatly  extended
 | |
|        to  cover  most common requirements, and in release 5.0 additional sup-
 | |
|        port for Unicode general category properties was added.
 | |
| 
 | |
|        In order process UTF-8 strings, you must build PCRE  to  include  UTF-8
 | |
|        support  in  the  code,  and, in addition, you must call pcre_compile()
 | |
|        with the PCRE_UTF8 option flag, or the  pattern  must  start  with  the
 | |
|        sequence  (*UTF8).  When  either of these is the case, both the pattern
 | |
|        and any subject strings that are matched  against  it  are  treated  as
 | |
|        UTF-8 strings instead of just strings of bytes.
 | |
| 
 | |
|        If  you compile PCRE with UTF-8 support, but do not use it at run time,
 | |
|        the library will be a bit bigger, but the additional run time  overhead
 | |
|        is limited to testing the PCRE_UTF8 flag occasionally, so should not be
 | |
|        very big.
 | |
| 
 | |
|        If PCRE is built with Unicode character property support (which implies
 | |
|        UTF-8  support),  the  escape sequences \p{..}, \P{..}, and \X are sup-
 | |
|        ported.  The available properties that can be tested are limited to the
 | |
|        general  category  properties such as Lu for an upper case letter or Nd
 | |
|        for a decimal number, the Unicode script names such as Arabic  or  Han,
 | |
|        and  the  derived  properties  Any  and L&. A full list is given in the
 | |
|        pcrepattern documentation. Only the short names for properties are sup-
 | |
|        ported.  For example, \p{L} matches a letter. Its Perl synonym, \p{Let-
 | |
|        ter}, is not supported.  Furthermore,  in  Perl,  many  properties  may
 | |
|        optionally  be  prefixed by "Is", for compatibility with Perl 5.6. PCRE
 | |
|        does not support this.
 | |
| 
 | |
|    Validity of UTF-8 strings
 | |
| 
 | |
|        When you set the PCRE_UTF8 flag, the strings  passed  as  patterns  and
 | |
|        subjects are (by default) checked for validity on entry to the relevant
 | |
|        functions. From release 7.3 of PCRE, the check is according  the  rules
 | |
|        of  RFC  3629, which are themselves derived from the Unicode specifica-
 | |
|        tion. Earlier releases of PCRE followed the rules of  RFC  2279,  which
 | |
|        allows  the  full range of 31-bit values (0 to 0x7FFFFFFF). The current
 | |
|        check allows only values in the range U+0 to U+10FFFF, excluding U+D800
 | |
|        to U+DFFF.
 | |
| 
 | |
|        The  excluded  code  points are the "Low Surrogate Area" of Unicode, of
 | |
|        which the Unicode Standard says this: "The Low Surrogate Area does  not
 | |
|        contain  any  character  assignments,  consequently  no  character code
 | |
|        charts or namelists are provided for this area. Surrogates are reserved
 | |
|        for  use  with  UTF-16 and then must be used in pairs." The code points
 | |
|        that are encoded by UTF-16 pairs  are  available  as  independent  code
 | |
|        points  in  the  UTF-8  encoding.  (In other words, the whole surrogate
 | |
|        thing is a fudge for UTF-16 which unfortunately messes up UTF-8.)
 | |
| 
 | |
|        If an  invalid  UTF-8  string  is  passed  to  PCRE,  an  error  return
 | |
|        (PCRE_ERROR_BADUTF8) is given. In some situations, you may already know
 | |
|        that your strings are valid, and therefore want to skip these checks in
 | |
|        order to improve performance. If you set the PCRE_NO_UTF8_CHECK flag at
 | |
|        compile time or at run time, PCRE assumes that the pattern  or  subject
 | |
|        it  is  given  (respectively)  contains only valid UTF-8 codes. In this
 | |
|        case, it does not diagnose an invalid UTF-8 string.
 | |
| 
 | |
|        If you pass an invalid UTF-8 string  when  PCRE_NO_UTF8_CHECK  is  set,
 | |
|        what  happens  depends on why the string is invalid. If the string con-
 | |
|        forms to the "old" definition of UTF-8 (RFC 2279), it is processed as a
 | |
|        string  of  characters  in  the  range 0 to 0x7FFFFFFF. In other words,
 | |
|        apart from the initial validity test, PCRE (when in UTF-8 mode) handles
 | |
|        strings  according  to  the more liberal rules of RFC 2279. However, if
 | |
|        the string does not even conform to RFC 2279, the result is  undefined.
 | |
|        Your program may crash.
 | |
| 
 | |
|        If  you  want  to  process  strings  of  values  in the full range 0 to
 | |
|        0x7FFFFFFF, encoded in a UTF-8-like manner as per the old RFC, you  can
 | |
|        set PCRE_NO_UTF8_CHECK to bypass the more restrictive test. However, in
 | |
|        this situation, you will have to apply your own validity check.
 | |
| 
 | |
|    General comments about UTF-8 mode
 | |
| 
 | |
|        1. An unbraced hexadecimal escape sequence (such  as  \xb3)  matches  a
 | |
|        two-byte UTF-8 character if the value is greater than 127.
 | |
| 
 | |
|        2.  Octal  numbers  up to \777 are recognized, and match two-byte UTF-8
 | |
|        characters for values greater than \177.
 | |
| 
 | |
|        3. Repeat quantifiers apply to complete UTF-8 characters, not to  indi-
 | |
|        vidual bytes, for example: \x{100}{3}.
 | |
| 
 | |
|        4.  The dot metacharacter matches one UTF-8 character instead of a sin-
 | |
|        gle byte.
 | |
| 
 | |
|        5. The escape sequence \C can be used to match a single byte  in  UTF-8
 | |
|        mode,  but  its  use can lead to some strange effects. This facility is
 | |
|        not available in the alternative matching function, pcre_dfa_exec().
 | |
| 
 | |
|        6. The character escapes \b, \B, \d, \D, \s, \S, \w, and  \W  correctly
 | |
|        test  characters of any code value, but the characters that PCRE recog-
 | |
|        nizes as digits, spaces, or word characters  remain  the  same  set  as
 | |
|        before, all with values less than 256. This remains true even when PCRE
 | |
|        includes Unicode property support, because to do otherwise  would  slow
 | |
|        down  PCRE in many common cases. If you really want to test for a wider
 | |
|        sense of, say, "digit", you must use Unicode  property  tests  such  as
 | |
|        \p{Nd}.  Note  that  this  also applies to \b, because it is defined in
 | |
|        terms of \w and \W.
 | |
| 
 | |
|        7. Similarly, characters that match the POSIX named  character  classes
 | |
|        are all low-valued characters.
 | |
| 
 | |
|        8.  However,  the Perl 5.10 horizontal and vertical whitespace matching
 | |
|        escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char-
 | |
|        acters.
 | |
| 
 | |
|        9.  Case-insensitive  matching  applies only to characters whose values
 | |
|        are less than 128, unless PCRE is built with Unicode property  support.
 | |
|        Even  when  Unicode  property support is available, PCRE still uses its
 | |
|        own character tables when checking the case of  low-valued  characters,
 | |
|        so  as not to degrade performance.  The Unicode property information is
 | |
|        used only for characters with higher values. Even when Unicode property
 | |
|        support is available, PCRE supports case-insensitive matching only when
 | |
|        there is a one-to-one mapping between a letter's  cases.  There  are  a
 | |
|        small  number  of  many-to-one  mappings in Unicode; these are not sup-
 | |
|        ported by PCRE.
 | |
| 
 | |
| 
 | |
| AUTHOR
 | |
| 
 | |
|        Philip Hazel
 | |
|        University Computing Service
 | |
|        Cambridge CB2 3QH, England.
 | |
| 
 | |
|        Putting an actual email address here seems to have been a spam  magnet,
 | |
|        so  I've  taken  it away. If you want to email me, use my two initials,
 | |
|        followed by the two digits 10, at the domain cam.ac.uk.
 | |
| 
 | |
| 
 | |
| REVISION
 | |
| 
 | |
|        Last updated: 11 April 2009
 | |
|        Copyright (c) 1997-2009 University of Cambridge.
 | |
| ------------------------------------------------------------------------------
 | |
| 
 | |
| 
 | |
| PCREBUILD(3)                                                      PCREBUILD(3)
 | |
| 
 | |
| 
 | |
| NAME
 | |
|        PCRE - Perl-compatible regular expressions
 | |
| 
 | |
| 
 | |
| PCRE BUILD-TIME OPTIONS
 | |
| 
 | |
|        This  document  describes  the  optional  features  of PCRE that can be
 | |
|        selected when the library is compiled. It assumes use of the  configure
 | |
|        script,  where the optional features are selected or deselected by pro-
 | |
|        viding options to configure before running the make  command.  However,
 | |
|        the  same  options  can be selected in both Unix-like and non-Unix-like
 | |
|        environments using the GUI facility of  CMakeSetup  if  you  are  using
 | |
|        CMake instead of configure to build PCRE.
 | |
| 
 | |
|        The complete list of options for configure (which includes the standard
 | |
|        ones such as the  selection  of  the  installation  directory)  can  be
 | |
|        obtained by running
 | |
| 
 | |
|          ./configure --help
 | |
| 
 | |
|        The  following  sections  include  descriptions  of options whose names
 | |
|        begin with --enable or --disable. These settings specify changes to the
 | |
|        defaults  for  the configure command. Because of the way that configure
 | |
|        works, --enable and --disable always come in pairs, so  the  complemen-
 | |
|        tary  option always exists as well, but as it specifies the default, it
 | |
|        is not described.
 | |
| 
 | |
| 
 | |
| C++ SUPPORT
 | |
| 
 | |
|        By default, the configure script will search for a C++ compiler and C++
 | |
|        header files. If it finds them, it automatically builds the C++ wrapper
 | |
|        library for PCRE. You can disable this by adding
 | |
| 
 | |
|          --disable-cpp
 | |
| 
 | |
|        to the configure command.
 | |
| 
 | |
| 
 | |
| UTF-8 SUPPORT
 | |
| 
 | |
|        To build PCRE with support for UTF-8 Unicode character strings, add
 | |
| 
 | |
|          --enable-utf8
 | |
| 
 | |
|        to the configure command. Of itself, this  does  not  make  PCRE  treat
 | |
|        strings  as UTF-8. As well as compiling PCRE with this option, you also
 | |
|        have have to set the PCRE_UTF8 option when you call the  pcre_compile()
 | |
|        function.
 | |
| 
 | |
|        If  you set --enable-utf8 when compiling in an EBCDIC environment, PCRE
 | |
|        expects its input to be either ASCII or UTF-8 (depending on the runtime
 | |
|        option).  It  is not possible to support both EBCDIC and UTF-8 codes in
 | |
|        the same  version  of  the  library.  Consequently,  --enable-utf8  and
 | |
|        --enable-ebcdic are mutually exclusive.
 | |
| 
 | |
| 
 | |
| UNICODE CHARACTER PROPERTY SUPPORT
 | |
| 
 | |
|        UTF-8  support allows PCRE to process character values greater than 255
 | |
|        in the strings that it handles. On its own, however, it does  not  pro-
 | |
|        vide any facilities for accessing the properties of such characters. If
 | |
|        you want to be able to use the pattern escapes \P, \p,  and  \X,  which
 | |
|        refer to Unicode character properties, you must add
 | |
| 
 | |
|          --enable-unicode-properties
 | |
| 
 | |
|        to  the configure command. This implies UTF-8 support, even if you have
 | |
|        not explicitly requested it.
 | |
| 
 | |
|        Including Unicode property support adds around 30K  of  tables  to  the
 | |
|        PCRE  library.  Only  the general category properties such as Lu and Nd
 | |
|        are supported. Details are given in the pcrepattern documentation.
 | |
| 
 | |
| 
 | |
| CODE VALUE OF NEWLINE
 | |
| 
 | |
|        By default, PCRE interprets the linefeed (LF) character  as  indicating
 | |
|        the  end  of  a line. This is the normal newline character on Unix-like
 | |
|        systems. You can compile PCRE to use carriage return (CR)  instead,  by
 | |
|        adding
 | |
| 
 | |
|          --enable-newline-is-cr
 | |
| 
 | |
|        to  the  configure  command.  There  is  also  a --enable-newline-is-lf
 | |
|        option, which explicitly specifies linefeed as the newline character.
 | |
| 
 | |
|        Alternatively, you can specify that line endings are to be indicated by
 | |
|        the two character sequence CRLF. If you want this, add
 | |
| 
 | |
|          --enable-newline-is-crlf
 | |
| 
 | |
|        to the configure command. There is a fourth option, specified by
 | |
| 
 | |
|          --enable-newline-is-anycrlf
 | |
| 
 | |
|        which  causes  PCRE  to recognize any of the three sequences CR, LF, or
 | |
|        CRLF as indicating a line ending. Finally, a fifth option, specified by
 | |
| 
 | |
|          --enable-newline-is-any
 | |
| 
 | |
|        causes PCRE to recognize any Unicode newline sequence.
 | |
| 
 | |
|        Whatever line ending convention is selected when PCRE is built  can  be
 | |
|        overridden  when  the library functions are called. At build time it is
 | |
|        conventional to use the standard for your operating system.
 | |
| 
 | |
| 
 | |
| WHAT \R MATCHES
 | |
| 
 | |
|        By default, the sequence \R in a pattern matches  any  Unicode  newline
 | |
|        sequence,  whatever  has  been selected as the line ending sequence. If
 | |
|        you specify
 | |
| 
 | |
|          --enable-bsr-anycrlf
 | |
| 
 | |
|        the default is changed so that \R matches only CR, LF, or  CRLF.  What-
 | |
|        ever  is selected when PCRE is built can be overridden when the library
 | |
|        functions are called.
 | |
| 
 | |
| 
 | |
| BUILDING SHARED AND STATIC LIBRARIES
 | |
| 
 | |
|        The PCRE building process uses libtool to build both shared and  static
 | |
|        Unix  libraries by default. You can suppress one of these by adding one
 | |
|        of
 | |
| 
 | |
|          --disable-shared
 | |
|          --disable-static
 | |
| 
 | |
|        to the configure command, as required.
 | |
| 
 | |
| 
 | |
| POSIX MALLOC USAGE
 | |
| 
 | |
|        When PCRE is called through the POSIX interface (see the pcreposix doc-
 | |
|        umentation),  additional  working  storage  is required for holding the
 | |
|        pointers to capturing substrings, because PCRE requires three  integers
 | |
|        per  substring,  whereas  the POSIX interface provides only two. If the
 | |
|        number of expected substrings is small, the wrapper function uses space
 | |
|        on the stack, because this is faster than using malloc() for each call.
 | |
|        The default threshold above which the stack is no longer used is 10; it
 | |
|        can be changed by adding a setting such as
 | |
| 
 | |
|          --with-posix-malloc-threshold=20
 | |
| 
 | |
|        to the configure command.
 | |
| 
 | |
| 
 | |
| HANDLING VERY LARGE PATTERNS
 | |
| 
 | |
|        Within  a  compiled  pattern,  offset values are used to point from one
 | |
|        part to another (for example, from an opening parenthesis to an  alter-
 | |
|        nation  metacharacter).  By default, two-byte values are used for these
 | |
|        offsets, leading to a maximum size for a  compiled  pattern  of  around
 | |
|        64K.  This  is sufficient to handle all but the most gigantic patterns.
 | |
|        Nevertheless, some people do want to process enormous patterns,  so  it
 | |
|        is  possible  to compile PCRE to use three-byte or four-byte offsets by
 | |
|        adding a setting such as
 | |
| 
 | |
|          --with-link-size=3
 | |
| 
 | |
|        to the configure command. The value given must be 2,  3,  or  4.  Using
 | |
|        longer  offsets slows down the operation of PCRE because it has to load
 | |
|        additional bytes when handling them.
 | |
| 
 | |
| 
 | |
| AVOIDING EXCESSIVE STACK USAGE
 | |
| 
 | |
|        When matching with the pcre_exec() function, PCRE implements backtrack-
 | |
|        ing  by  making recursive calls to an internal function called match().
 | |
|        In environments where the size of the stack is limited,  this  can  se-
 | |
|        verely  limit  PCRE's operation. (The Unix environment does not usually
 | |
|        suffer from this problem, but it may sometimes be necessary to increase
 | |
|        the  maximum  stack size.  There is a discussion in the pcrestack docu-
 | |
|        mentation.) An alternative approach to recursion that uses memory  from
 | |
|        the  heap  to remember data, instead of using recursive function calls,
 | |
|        has been implemented to work round the problem of limited  stack  size.
 | |
|        If you want to build a version of PCRE that works this way, add
 | |
| 
 | |
|          --disable-stack-for-recursion
 | |
| 
 | |
|        to  the  configure  command. With this configuration, PCRE will use the
 | |
|        pcre_stack_malloc and pcre_stack_free variables to call memory  manage-
 | |
|        ment  functions. By default these point to malloc() and free(), but you
 | |
|        can replace the pointers so that your own functions are used.
 | |
| 
 | |
|        Separate functions are  provided  rather  than  using  pcre_malloc  and
 | |
|        pcre_free  because  the  usage  is  very  predictable:  the block sizes
 | |
|        requested are always the same, and  the  blocks  are  always  freed  in
 | |
|        reverse  order.  A calling program might be able to implement optimized
 | |
|        functions that perform better  than  malloc()  and  free().  PCRE  runs
 | |
|        noticeably more slowly when built in this way. This option affects only
 | |
|        the  pcre_exec()  function;  it   is   not   relevant   for   the   the
 | |
|        pcre_dfa_exec() function.
 | |
| 
 | |
| 
 | |
| LIMITING PCRE RESOURCE USAGE
 | |
| 
 | |
|        Internally,  PCRE has a function called match(), which it calls repeat-
 | |
|        edly  (sometimes  recursively)  when  matching  a  pattern   with   the
 | |
|        pcre_exec()  function.  By controlling the maximum number of times this
 | |
|        function may be called during a single matching operation, a limit  can
 | |
|        be  placed  on  the resources used by a single call to pcre_exec(). The
 | |
|        limit can be changed at run time, as described in the pcreapi  documen-
 | |
|        tation.  The default is 10 million, but this can be changed by adding a
 | |
|        setting such as
 | |
| 
 | |
|          --with-match-limit=500000
 | |
| 
 | |
|        to  the  configure  command.  This  setting  has  no  effect   on   the
 | |
|        pcre_dfa_exec() matching function.
 | |
| 
 | |
|        In  some  environments  it is desirable to limit the depth of recursive
 | |
|        calls of match() more strictly than the total number of calls, in order
 | |
|        to  restrict  the maximum amount of stack (or heap, if --disable-stack-
 | |
|        for-recursion is specified) that is used. A second limit controls this;
 | |
|        it  defaults  to  the  value  that is set for --with-match-limit, which
 | |
|        imposes no additional constraints. However, you can set a  lower  limit
 | |
|        by adding, for example,
 | |
| 
 | |
|          --with-match-limit-recursion=10000
 | |
| 
 | |
|        to  the  configure  command.  This  value can also be overridden at run
 | |
|        time.
 | |
| 
 | |
| 
 | |
| CREATING CHARACTER TABLES AT BUILD TIME
 | |
| 
 | |
|        PCRE uses fixed tables for processing characters whose code values  are
 | |
|        less  than 256. By default, PCRE is built with a set of tables that are
 | |
|        distributed in the file pcre_chartables.c.dist. These  tables  are  for
 | |
|        ASCII codes only. If you add
 | |
| 
 | |
|          --enable-rebuild-chartables
 | |
| 
 | |
|        to  the  configure  command, the distributed tables are no longer used.
 | |
|        Instead, a program called dftables is compiled and  run.  This  outputs
 | |
|        the source for new set of tables, created in the default locale of your
 | |
|        C runtime system. (This method of replacing the tables does not work if
 | |
|        you  are cross compiling, because dftables is run on the local host. If
 | |
|        you need to create alternative tables when cross  compiling,  you  will
 | |
|        have to do so "by hand".)
 | |
| 
 | |
| 
 | |
| USING EBCDIC CODE
 | |
| 
 | |
|        PCRE  assumes  by  default that it will run in an environment where the
 | |
|        character code is ASCII (or Unicode, which is  a  superset  of  ASCII).
 | |
|        This  is  the  case for most computer operating systems. PCRE can, how-
 | |
|        ever, be compiled to run in an EBCDIC environment by adding
 | |
| 
 | |
|          --enable-ebcdic
 | |
| 
 | |
|        to the configure command. This setting implies --enable-rebuild-charta-
 | |
|        bles.  You  should  only  use  it if you know that you are in an EBCDIC
 | |
|        environment (for example,  an  IBM  mainframe  operating  system).  The
 | |
|        --enable-ebcdic option is incompatible with --enable-utf8.
 | |
| 
 | |
| 
 | |
| PCREGREP OPTIONS FOR COMPRESSED FILE SUPPORT
 | |
| 
 | |
|        By default, pcregrep reads all files as plain text. You can build it so
 | |
|        that it recognizes files whose names end in .gz or .bz2, and reads them
 | |
|        with libz or libbz2, respectively, by adding one or both of
 | |
| 
 | |
|          --enable-pcregrep-libz
 | |
|          --enable-pcregrep-libbz2
 | |
| 
 | |
|        to the configure command. These options naturally require that the rel-
 | |
|        evant libraries are installed on your system. Configuration  will  fail
 | |
|        if they are not.
 | |
| 
 | |
| 
 | |
| PCRETEST OPTION FOR LIBREADLINE SUPPORT
 | |
| 
 | |
|        If you add
 | |
| 
 | |
|          --enable-pcretest-libreadline
 | |
| 
 | |
|        to  the  configure  command,  pcretest  is  linked with the libreadline
 | |
|        library, and when its input is from a terminal, it reads it  using  the
 | |
|        readline() function. This provides line-editing and history facilities.
 | |
|        Note that libreadline is GPL-licenced, so if you distribute a binary of
 | |
|        pcretest linked in this way, there may be licensing issues.
 | |
| 
 | |
|        Setting  this  option  causes  the -lreadline option to be added to the
 | |
|        pcretest build. In many operating environments with  a  sytem-installed
 | |
|        libreadline this is sufficient. However, in some environments (e.g.  if
 | |
|        an unmodified distribution version of readline is in use),  some  extra
 | |
|        configuration  may  be necessary. The INSTALL file for libreadline says
 | |
|        this:
 | |
| 
 | |
|          "Readline uses the termcap functions, but does not link with the
 | |
|          termcap or curses library itself, allowing applications which link
 | |
|          with readline the to choose an appropriate library."
 | |
| 
 | |
|        If your environment has not been set up so that an appropriate  library
 | |
|        is automatically included, you may need to add something like
 | |
| 
 | |
|          LIBS="-ncurses"
 | |
| 
 | |
|        immediately before the configure command.
 | |
| 
 | |
| 
 | |
| SEE ALSO
 | |
| 
 | |
|        pcreapi(3), pcre_config(3).
 | |
| 
 | |
| 
 | |
| AUTHOR
 | |
| 
 | |
|        Philip Hazel
 | |
|        University Computing Service
 | |
|        Cambridge CB2 3QH, England.
 | |
| 
 | |
| 
 | |
| REVISION
 | |
| 
 | |
|        Last updated: 17 March 2009
 | |
|        Copyright (c) 1997-2009 University of Cambridge.
 | |
| ------------------------------------------------------------------------------
 | |
| 
 | |
| 
 | |
| PCREMATCHING(3)                                                PCREMATCHING(3)
 | |
| 
 | |
| 
 | |
| NAME
 | |
|        PCRE - Perl-compatible regular expressions
 | |
| 
 | |
| 
 | |
| PCRE MATCHING ALGORITHMS
 | |
| 
 | |
|        This document describes the two different algorithms that are available
 | |
|        in PCRE for matching a compiled regular expression against a given sub-
 | |
|        ject  string.  The  "standard"  algorithm  is  the  one provided by the
 | |
|        pcre_exec() function.  This works in the same was  as  Perl's  matching
 | |
|        function, and provides a Perl-compatible matching operation.
 | |
| 
 | |
|        An  alternative  algorithm is provided by the pcre_dfa_exec() function;
 | |
|        this operates in a different way, and is not  Perl-compatible.  It  has
 | |
|        advantages  and disadvantages compared with the standard algorithm, and
 | |
|        these are described below.
 | |
| 
 | |
|        When there is only one possible way in which a given subject string can
 | |
|        match  a pattern, the two algorithms give the same answer. A difference
 | |
|        arises, however, when there are multiple possibilities. For example, if
 | |
|        the pattern
 | |
| 
 | |
|          ^<.*>
 | |
| 
 | |
|        is matched against the string
 | |
| 
 | |
|          <something> <something else> <something further>
 | |
| 
 | |
|        there are three possible answers. The standard algorithm finds only one
 | |
|        of them, whereas the alternative algorithm finds all three.
 | |
| 
 | |
| 
 | |
| REGULAR EXPRESSIONS AS TREES
 | |
| 
 | |
|        The set of strings that are matched by a regular expression can be rep-
 | |
|        resented  as  a  tree structure. An unlimited repetition in the pattern
 | |
|        makes the tree of infinite size, but it is still a tree.  Matching  the
 | |
|        pattern  to a given subject string (from a given starting point) can be
 | |
|        thought of as a search of the tree.  There are two  ways  to  search  a
 | |
|        tree:  depth-first  and  breadth-first, and these correspond to the two
 | |
|        matching algorithms provided by PCRE.
 | |
| 
 | |
| 
 | |
| THE STANDARD MATCHING ALGORITHM
 | |
| 
 | |
|        In the terminology of Jeffrey Friedl's book "Mastering Regular  Expres-
 | |
|        sions",  the  standard  algorithm  is an "NFA algorithm". It conducts a
 | |
|        depth-first search of the pattern tree. That is, it  proceeds  along  a
 | |
|        single path through the tree, checking that the subject matches what is
 | |
|        required. When there is a mismatch, the algorithm  tries  any  alterna-
 | |
|        tives  at  the  current point, and if they all fail, it backs up to the
 | |
|        previous branch point in the  tree,  and  tries  the  next  alternative
 | |
|        branch  at  that  level.  This often involves backing up (moving to the
 | |
|        left) in the subject string as well.  The  order  in  which  repetition
 | |
|        branches  are  tried  is controlled by the greedy or ungreedy nature of
 | |
|        the quantifier.
 | |
| 
 | |
|        If a leaf node is reached, a matching string has  been  found,  and  at
 | |
|        that  point the algorithm stops. Thus, if there is more than one possi-
 | |
|        ble match, this algorithm returns the first one that it finds.  Whether
 | |
|        this  is the shortest, the longest, or some intermediate length depends
 | |
|        on the way the greedy and ungreedy repetition quantifiers are specified
 | |
|        in the pattern.
 | |
| 
 | |
|        Because  it  ends  up  with a single path through the tree, it is rela-
 | |
|        tively straightforward for this algorithm to keep  track  of  the  sub-
 | |
|        strings  that  are  matched  by portions of the pattern in parentheses.
 | |
|        This provides support for capturing parentheses and back references.
 | |
| 
 | |
| 
 | |
| THE ALTERNATIVE MATCHING ALGORITHM
 | |
| 
 | |
|        This algorithm conducts a breadth-first search of  the  tree.  Starting
 | |
|        from  the  first  matching  point  in the subject, it scans the subject
 | |
|        string from left to right, once, character by character, and as it does
 | |
|        this,  it remembers all the paths through the tree that represent valid
 | |
|        matches. In Friedl's terminology, this is a kind  of  "DFA  algorithm",
 | |
|        though  it is not implemented as a traditional finite state machine (it
 | |
|        keeps multiple states active simultaneously).
 | |
| 
 | |
|        The scan continues until either the end of the subject is  reached,  or
 | |
|        there  are  no more unterminated paths. At this point, terminated paths
 | |
|        represent the different matching possibilities (if there are none,  the
 | |
|        match  has  failed).   Thus,  if there is more than one possible match,
 | |
|        this algorithm finds all of them, and in particular, it finds the long-
 | |
|        est.  In PCRE, there is an option to stop the algorithm after the first
 | |
|        match (which is necessarily the shortest) has been found.
 | |
| 
 | |
|        Note that all the matches that are found start at the same point in the
 | |
|        subject. If the pattern
 | |
| 
 | |
|          cat(er(pillar)?)
 | |
| 
 | |
|        is  matched  against the string "the caterpillar catchment", the result
 | |
|        will be the three strings "cat", "cater", and "caterpillar" that  start
 | |
|        at the fourth character of the subject. The algorithm does not automat-
 | |
|        ically move on to find matches that start at later positions.
 | |
| 
 | |
|        There are a number of features of PCRE regular expressions that are not
 | |
|        supported by the alternative matching algorithm. They are as follows:
 | |
| 
 | |
|        1.  Because  the  algorithm  finds  all possible matches, the greedy or
 | |
|        ungreedy nature of repetition quantifiers is not relevant.  Greedy  and
 | |
|        ungreedy quantifiers are treated in exactly the same way. However, pos-
 | |
|        sessive quantifiers can make a difference when what follows could  also
 | |
|        match what is quantified, for example in a pattern like this:
 | |
| 
 | |
|          ^a++\w!
 | |
| 
 | |
|        This  pattern matches "aaab!" but not "aaa!", which would be matched by
 | |
|        a non-possessive quantifier. Similarly, if an atomic group is  present,
 | |
|        it  is matched as if it were a standalone pattern at the current point,
 | |
|        and the longest match is then "locked in" for the rest of  the  overall
 | |
|        pattern.
 | |
| 
 | |
|        2. When dealing with multiple paths through the tree simultaneously, it
 | |
|        is not straightforward to keep track of  captured  substrings  for  the
 | |
|        different  matching  possibilities,  and  PCRE's implementation of this
 | |
|        algorithm does not attempt to do this. This means that no captured sub-
 | |
|        strings are available.
 | |
| 
 | |
|        3.  Because no substrings are captured, back references within the pat-
 | |
|        tern are not supported, and cause errors if encountered.
 | |
| 
 | |
|        4. For the same reason, conditional expressions that use  a  backrefer-
 | |
|        ence  as  the  condition or test for a specific group recursion are not
 | |
|        supported.
 | |
| 
 | |
|        5. Because many paths through the tree may be  active,  the  \K  escape
 | |
|        sequence, which resets the start of the match when encountered (but may
 | |
|        be on some paths and not on others), is not  supported.  It  causes  an
 | |
|        error if encountered.
 | |
| 
 | |
|        6.  Callouts  are  supported, but the value of the capture_top field is
 | |
|        always 1, and the value of the capture_last field is always -1.
 | |
| 
 | |
|        7. The \C escape sequence, which (in the standard algorithm) matches  a
 | |
|        single  byte, even in UTF-8 mode, is not supported because the alterna-
 | |
|        tive algorithm moves through the subject  string  one  character  at  a
 | |
|        time, for all active paths through the tree.
 | |
| 
 | |
|        8.  Except for (*FAIL), the backtracking control verbs such as (*PRUNE)
 | |
|        are not supported. (*FAIL) is supported, and  behaves  like  a  failing
 | |
|        negative assertion.
 | |
| 
 | |
| 
 | |
| ADVANTAGES OF THE ALTERNATIVE ALGORITHM
 | |
| 
 | |
|        Using  the alternative matching algorithm provides the following advan-
 | |
|        tages:
 | |
| 
 | |
|        1. All possible matches (at a single point in the subject) are automat-
 | |
|        ically  found,  and  in particular, the longest match is found. To find
 | |
|        more than one match using the standard algorithm, you have to do kludgy
 | |
|        things with callouts.
 | |
| 
 | |
|        2.  There is much better support for partial matching. The restrictions
 | |
|        on the content of the pattern that apply when using the standard  algo-
 | |
|        rithm  for  partial matching do not apply to the alternative algorithm.
 | |
|        For non-anchored patterns, the starting position of a partial match  is
 | |
|        available.
 | |
| 
 | |
|        3.  Because  the  alternative  algorithm  scans the subject string just
 | |
|        once, and never needs to backtrack, it is possible to  pass  very  long
 | |
|        subject  strings  to  the matching function in several pieces, checking
 | |
|        for partial matching each time.
 | |
| 
 | |
| 
 | |
| DISADVANTAGES OF THE ALTERNATIVE ALGORITHM
 | |
| 
 | |
|        The alternative algorithm suffers from a number of disadvantages:
 | |
| 
 | |
|        1. It is substantially slower than  the  standard  algorithm.  This  is
 | |
|        partly  because  it has to search for all possible matches, but is also
 | |
|        because it is less susceptible to optimization.
 | |
| 
 | |
|        2. Capturing parentheses and back references are not supported.
 | |
| 
 | |
|        3. Although atomic groups are supported, their use does not provide the
 | |
|        performance advantage that it does for the standard algorithm.
 | |
| 
 | |
| 
 | |
| AUTHOR
 | |
| 
 | |
|        Philip Hazel
 | |
|        University Computing Service
 | |
|        Cambridge CB2 3QH, England.
 | |
| 
 | |
| 
 | |
| REVISION
 | |
| 
 | |
|        Last updated: 19 April 2008
 | |
|        Copyright (c) 1997-2008 University of Cambridge.
 | |
| ------------------------------------------------------------------------------
 | |
| 
 | |
| 
 | |
| PCREAPI(3)                                                          PCREAPI(3)
 | |
| 
 | |
| 
 | |
| NAME
 | |
|        PCRE - Perl-compatible regular expressions
 | |
| 
 | |
| 
 | |
| PCRE NATIVE API
 | |
| 
 | |
|        #include <pcre.h>
 | |
| 
 | |
|        pcre *pcre_compile(const char *pattern, int options,
 | |
|             const char **errptr, int *erroffset,
 | |
|             const unsigned char *tableptr);
 | |
| 
 | |
|        pcre *pcre_compile2(const char *pattern, int options,
 | |
|             int *errorcodeptr,
 | |
|             const char **errptr, int *erroffset,
 | |
|             const unsigned char *tableptr);
 | |
| 
 | |
|        pcre_extra *pcre_study(const pcre *code, int options,
 | |
|             const char **errptr);
 | |
| 
 | |
|        int pcre_exec(const pcre *code, const pcre_extra *extra,
 | |
|             const char *subject, int length, int startoffset,
 | |
|             int options, int *ovector, int ovecsize);
 | |
| 
 | |
|        int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
 | |
|             const char *subject, int length, int startoffset,
 | |
|             int options, int *ovector, int ovecsize,
 | |
|             int *workspace, int wscount);
 | |
| 
 | |
|        int pcre_copy_named_substring(const pcre *code,
 | |
|             const char *subject, int *ovector,
 | |
|             int stringcount, const char *stringname,
 | |
|             char *buffer, int buffersize);
 | |
| 
 | |
|        int pcre_copy_substring(const char *subject, int *ovector,
 | |
|             int stringcount, int stringnumber, char *buffer,
 | |
|             int buffersize);
 | |
| 
 | |
|        int pcre_get_named_substring(const pcre *code,
 | |
|             const char *subject, int *ovector,
 | |
|             int stringcount, const char *stringname,
 | |
|             const char **stringptr);
 | |
| 
 | |
|        int pcre_get_stringnumber(const pcre *code,
 | |
|             const char *name);
 | |
| 
 | |
|        int pcre_get_stringtable_entries(const pcre *code,
 | |
|             const char *name, char **first, char **last);
 | |
| 
 | |
|        int pcre_get_substring(const char *subject, int *ovector,
 | |
|             int stringcount, int stringnumber,
 | |
|             const char **stringptr);
 | |
| 
 | |
|        int pcre_get_substring_list(const char *subject,
 | |
|             int *ovector, int stringcount, const char ***listptr);
 | |
| 
 | |
|        void pcre_free_substring(const char *stringptr);
 | |
| 
 | |
|        void pcre_free_substring_list(const char **stringptr);
 | |
| 
 | |
|        const unsigned char *pcre_maketables(void);
 | |
| 
 | |
|        int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
 | |
|             int what, void *where);
 | |
| 
 | |
|        int pcre_info(const pcre *code, int *optptr, int *firstcharptr);
 | |
| 
 | |
|        int pcre_refcount(pcre *code, int adjust);
 | |
| 
 | |
|        int pcre_config(int what, void *where);
 | |
| 
 | |
|        char *pcre_version(void);
 | |
| 
 | |
|        void *(*pcre_malloc)(size_t);
 | |
| 
 | |
|        void (*pcre_free)(void *);
 | |
| 
 | |
|        void *(*pcre_stack_malloc)(size_t);
 | |
| 
 | |
|        void (*pcre_stack_free)(void *);
 | |
| 
 | |
|        int (*pcre_callout)(pcre_callout_block *);
 | |
| 
 | |
| 
 | |
| PCRE API OVERVIEW
 | |
| 
 | |
|        PCRE has its own native API, which is described in this document. There
 | |
|        are also some wrapper functions that correspond to  the  POSIX  regular
 | |
|        expression  API.  These  are  described in the pcreposix documentation.
 | |
|        Both of these APIs define a set of C function calls. A C++  wrapper  is
 | |
|        distributed with PCRE. It is documented in the pcrecpp page.
 | |
| 
 | |
|        The  native  API  C  function prototypes are defined in the header file
 | |
|        pcre.h, and on Unix systems the library itself is called  libpcre.   It
 | |
|        can normally be accessed by adding -lpcre to the command for linking an
 | |
|        application  that  uses  PCRE.  The  header  file  defines  the  macros
 | |
|        PCRE_MAJOR  and  PCRE_MINOR to contain the major and minor release num-
 | |
|        bers for the library.  Applications can use these  to  include  support
 | |
|        for different releases of PCRE.
 | |
| 
 | |
|        The   functions   pcre_compile(),  pcre_compile2(),  pcre_study(),  and
 | |
|        pcre_exec() are used for compiling and matching regular expressions  in
 | |
|        a  Perl-compatible  manner. A sample program that demonstrates the sim-
 | |
|        plest way of using them is provided in the file  called  pcredemo.c  in
 | |
|        the  source distribution. The pcresample documentation describes how to
 | |
|        compile and run it.
 | |
| 
 | |
|        A second matching function, pcre_dfa_exec(), which is not Perl-compati-
 | |
|        ble,  is  also provided. This uses a different algorithm for the match-
 | |
|        ing. The alternative algorithm finds all possible matches (at  a  given
 | |
|        point  in  the subject), and scans the subject just once. However, this
 | |
|        algorithm does not return captured substrings. A description of the two
 | |
|        matching  algorithms and their advantages and disadvantages is given in
 | |
|        the pcrematching documentation.
 | |
| 
 | |
|        In addition to the main compiling and  matching  functions,  there  are
 | |
|        convenience functions for extracting captured substrings from a subject
 | |
|        string that is matched by pcre_exec(). They are:
 | |
| 
 | |
|          pcre_copy_substring()
 | |
|          pcre_copy_named_substring()
 | |
|          pcre_get_substring()
 | |
|          pcre_get_named_substring()
 | |
|          pcre_get_substring_list()
 | |
|          pcre_get_stringnumber()
 | |
|          pcre_get_stringtable_entries()
 | |
| 
 | |
|        pcre_free_substring() and pcre_free_substring_list() are also provided,
 | |
|        to free the memory used for extracted strings.
 | |
| 
 | |
|        The  function  pcre_maketables()  is  used  to build a set of character
 | |
|        tables  in  the  current  locale   for   passing   to   pcre_compile(),
 | |
|        pcre_exec(),  or  pcre_dfa_exec(). This is an optional facility that is
 | |
|        provided for specialist use.  Most  commonly,  no  special  tables  are
 | |
|        passed,  in  which case internal tables that are generated when PCRE is
 | |
|        built are used.
 | |
| 
 | |
|        The function pcre_fullinfo() is used to find out  information  about  a
 | |
|        compiled  pattern; pcre_info() is an obsolete version that returns only
 | |
|        some of the available information, but is retained for  backwards  com-
 | |
|        patibility.   The function pcre_version() returns a pointer to a string
 | |
|        containing the version of PCRE and its date of release.
 | |
| 
 | |
|        The function pcre_refcount() maintains a  reference  count  in  a  data
 | |
|        block  containing  a compiled pattern. This is provided for the benefit
 | |
|        of object-oriented applications.
 | |
| 
 | |
|        The global variables pcre_malloc and pcre_free  initially  contain  the
 | |
|        entry  points  of  the  standard malloc() and free() functions, respec-
 | |
|        tively. PCRE calls the memory management functions via these variables,
 | |
|        so  a  calling  program  can replace them if it wishes to intercept the
 | |
|        calls. This should be done before calling any PCRE functions.
 | |
| 
 | |
|        The global variables pcre_stack_malloc  and  pcre_stack_free  are  also
 | |
|        indirections  to  memory  management functions. These special functions
 | |
|        are used only when PCRE is compiled to use  the  heap  for  remembering
 | |
|        data, instead of recursive function calls, when running the pcre_exec()
 | |
|        function. See the pcrebuild documentation for  details  of  how  to  do
 | |
|        this.  It  is  a non-standard way of building PCRE, for use in environ-
 | |
|        ments that have limited stacks. Because of the greater  use  of  memory
 | |
|        management,  it  runs  more  slowly. Separate functions are provided so
 | |
|        that special-purpose external code can be  used  for  this  case.  When
 | |
|        used,  these  functions  are always called in a stack-like manner (last
 | |
|        obtained, first freed), and always for memory blocks of the same  size.
 | |
|        There  is  a discussion about PCRE's stack usage in the pcrestack docu-
 | |
|        mentation.
 | |
| 
 | |
|        The global variable pcre_callout initially contains NULL. It can be set
 | |
|        by  the  caller  to  a "callout" function, which PCRE will then call at
 | |
|        specified points during a matching operation. Details are given in  the
 | |
|        pcrecallout documentation.
 | |
| 
 | |
| 
 | |
| NEWLINES
 | |
| 
 | |
|        PCRE  supports five different conventions for indicating line breaks in
 | |
|        strings: a single CR (carriage return) character, a  single  LF  (line-
 | |
|        feed) character, the two-character sequence CRLF, any of the three pre-
 | |
|        ceding, or any Unicode newline sequence. The Unicode newline  sequences
 | |
|        are  the  three just mentioned, plus the single characters VT (vertical
 | |
|        tab, U+000B), FF (formfeed, U+000C), NEL (next line, U+0085), LS  (line
 | |
|        separator, U+2028), and PS (paragraph separator, U+2029).
 | |
| 
 | |
|        Each  of  the first three conventions is used by at least one operating
 | |
|        system as its standard newline sequence. When PCRE is built, a  default
 | |
|        can  be  specified.  The default default is LF, which is the Unix stan-
 | |
|        dard. When PCRE is run, the default can be overridden,  either  when  a
 | |
|        pattern is compiled, or when it is matched.
 | |
| 
 | |
|        At compile time, the newline convention can be specified by the options
 | |
|        argument of pcre_compile(), or it can be specified by special  text  at
 | |
|        the start of the pattern itself; this overrides any other settings. See
 | |
|        the pcrepattern page for details of the special character sequences.
 | |
| 
 | |
|        In the PCRE documentation the word "newline" is used to mean "the char-
 | |
|        acter  or pair of characters that indicate a line break". The choice of
 | |
|        newline convention affects the handling of  the  dot,  circumflex,  and
 | |
|        dollar metacharacters, the handling of #-comments in /x mode, and, when
 | |
|        CRLF is a recognized line ending sequence, the match position  advance-
 | |
|        ment for a non-anchored pattern. There is more detail about this in the
 | |
|        section on pcre_exec() options below.
 | |
| 
 | |
|        The choice of newline convention does not affect the interpretation  of
 | |
|        the  \n  or  \r  escape  sequences, nor does it affect what \R matches,
 | |
|        which is controlled in a similar way, but by separate options.
 | |
| 
 | |
| 
 | |
| MULTITHREADING
 | |
| 
 | |
|        The PCRE functions can be used in  multi-threading  applications,  with
 | |
|        the  proviso  that  the  memory  management  functions  pointed  to  by
 | |
|        pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the
 | |
|        callout function pointed to by pcre_callout, are shared by all threads.
 | |
| 
 | |
|        The  compiled form of a regular expression is not altered during match-
 | |
|        ing, so the same compiled pattern can safely be used by several threads
 | |
|        at once.
 | |
| 
 | |
| 
 | |
| SAVING PRECOMPILED PATTERNS FOR LATER USE
 | |
| 
 | |
|        The compiled form of a regular expression can be saved and re-used at a
 | |
|        later time, possibly by a different program, and even on a  host  other
 | |
|        than  the  one  on  which  it  was  compiled.  Details are given in the
 | |
|        pcreprecompile documentation. However, compiling a  regular  expression
 | |
|        with  one version of PCRE for use with a different version is not guar-
 | |
|        anteed to work and may cause crashes.
 | |
| 
 | |
| 
 | |
| CHECKING BUILD-TIME OPTIONS
 | |
| 
 | |
|        int pcre_config(int what, void *where);
 | |
| 
 | |
|        The function pcre_config() makes it possible for a PCRE client to  dis-
 | |
|        cover which optional features have been compiled into the PCRE library.
 | |
|        The pcrebuild documentation has more details about these optional  fea-
 | |
|        tures.
 | |
| 
 | |
|        The  first  argument  for pcre_config() is an integer, specifying which
 | |
|        information is required; the second argument is a pointer to a variable
 | |
|        into  which  the  information  is  placed. The following information is
 | |
|        available:
 | |
| 
 | |
|          PCRE_CONFIG_UTF8
 | |
| 
 | |
|        The output is an integer that is set to one if UTF-8 support is  avail-
 | |
|        able; otherwise it is set to zero.
 | |
| 
 | |
|          PCRE_CONFIG_UNICODE_PROPERTIES
 | |
| 
 | |
|        The  output  is  an  integer  that is set to one if support for Unicode
 | |
|        character properties is available; otherwise it is set to zero.
 | |
| 
 | |
|          PCRE_CONFIG_NEWLINE
 | |
| 
 | |
|        The output is an integer whose value specifies  the  default  character
 | |
|        sequence  that is recognized as meaning "newline". The four values that
 | |
|        are supported are: 10 for LF, 13 for CR, 3338 for CRLF, -2 for ANYCRLF,
 | |
|        and  -1  for  ANY.  Though they are derived from ASCII, the same values
 | |
|        are returned in EBCDIC environments. The default should normally corre-
 | |
|        spond to the standard sequence for your operating system.
 | |
| 
 | |
|          PCRE_CONFIG_BSR
 | |
| 
 | |
|        The output is an integer whose value indicates what character sequences
 | |
|        the \R escape sequence matches by default. A value of 0 means  that  \R
 | |
|        matches  any  Unicode  line ending sequence; a value of 1 means that \R
 | |
|        matches only CR, LF, or CRLF. The default can be overridden when a pat-
 | |
|        tern is compiled or matched.
 | |
| 
 | |
|          PCRE_CONFIG_LINK_SIZE
 | |
| 
 | |
|        The  output  is  an  integer that contains the number of bytes used for
 | |
|        internal linkage in compiled regular expressions. The value is 2, 3, or
 | |
|        4.  Larger  values  allow larger regular expressions to be compiled, at
 | |
|        the expense of slower matching. The default value of  2  is  sufficient
 | |
|        for  all  but  the  most massive patterns, since it allows the compiled
 | |
|        pattern to be up to 64K in size.
 | |
| 
 | |
|          PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
 | |
| 
 | |
|        The output is an integer that contains the threshold  above  which  the
 | |
|        POSIX  interface  uses malloc() for output vectors. Further details are
 | |
|        given in the pcreposix documentation.
 | |
| 
 | |
|          PCRE_CONFIG_MATCH_LIMIT
 | |
| 
 | |
|        The output is a long integer that gives the default limit for the  num-
 | |
|        ber  of  internal  matching  function calls in a pcre_exec() execution.
 | |
|        Further details are given with pcre_exec() below.
 | |
| 
 | |
|          PCRE_CONFIG_MATCH_LIMIT_RECURSION
 | |
| 
 | |
|        The output is a long integer that gives the default limit for the depth
 | |
|        of   recursion  when  calling  the  internal  matching  function  in  a
 | |
|        pcre_exec() execution.  Further  details  are  given  with  pcre_exec()
 | |
|        below.
 | |
| 
 | |
|          PCRE_CONFIG_STACKRECURSE
 | |
| 
 | |
|        The  output is an integer that is set to one if internal recursion when
 | |
|        running pcre_exec() is implemented by recursive function calls that use
 | |
|        the  stack  to remember their state. This is the usual way that PCRE is
 | |
|        compiled. The output is zero if PCRE was compiled to use blocks of data
 | |
|        on  the  heap  instead  of  recursive  function  calls.  In  this case,
 | |
|        pcre_stack_malloc and  pcre_stack_free  are  called  to  manage  memory
 | |
|        blocks on the heap, thus avoiding the use of the stack.
 | |
| 
 | |
| 
 | |
| COMPILING A PATTERN
 | |
| 
 | |
|        pcre *pcre_compile(const char *pattern, int options,
 | |
|             const char **errptr, int *erroffset,
 | |
|             const unsigned char *tableptr);
 | |
| 
 | |
|        pcre *pcre_compile2(const char *pattern, int options,
 | |
|             int *errorcodeptr,
 | |
|             const char **errptr, int *erroffset,
 | |
|             const unsigned char *tableptr);
 | |
| 
 | |
|        Either of the functions pcre_compile() or pcre_compile2() can be called
 | |
|        to compile a pattern into an internal form. The only difference between
 | |
|        the  two interfaces is that pcre_compile2() has an additional argument,
 | |
|        errorcodeptr, via which a numerical error code can be returned.
 | |
| 
 | |
|        The pattern is a C string terminated by a binary zero, and is passed in
 | |
|        the  pattern  argument.  A  pointer to a single block of memory that is
 | |
|        obtained via pcre_malloc is returned. This contains the  compiled  code
 | |
|        and related data. The pcre type is defined for the returned block; this
 | |
|        is a typedef for a structure whose contents are not externally defined.
 | |
|        It is up to the caller to free the memory (via pcre_free) when it is no
 | |
|        longer required.
 | |
| 
 | |
|        Although the compiled code of a PCRE regex is relocatable, that is,  it
 | |
|        does not depend on memory location, the complete pcre data block is not
 | |
|        fully relocatable, because it may contain a copy of the tableptr  argu-
 | |
|        ment, which is an address (see below).
 | |
| 
 | |
|        The options argument contains various bit settings that affect the com-
 | |
|        pilation. It should be zero if no options are required.  The  available
 | |
|        options  are  described  below. Some of them (in particular, those that
 | |
|        are compatible with Perl, but also some others) can  also  be  set  and
 | |
|        unset  from  within  the  pattern  (see the detailed description in the
 | |
|        pcrepattern documentation). For those options that can be different  in
 | |
|        different  parts  of  the pattern, the contents of the options argument
 | |
|        specifies their initial settings at the start of compilation and execu-
 | |
|        tion.  The PCRE_ANCHORED and PCRE_NEWLINE_xxx options can be set at the
 | |
|        time of matching as well as at compile time.
 | |
| 
 | |
|        If errptr is NULL, pcre_compile() returns NULL immediately.  Otherwise,
 | |
|        if  compilation  of  a  pattern fails, pcre_compile() returns NULL, and
 | |
|        sets the variable pointed to by errptr to point to a textual error mes-
 | |
|        sage. This is a static string that is part of the library. You must not
 | |
|        try to free it. The offset from the start of the pattern to the charac-
 | |
|        ter where the error was discovered is placed in the variable pointed to
 | |
|        by erroffset, which must not be NULL. If it is, an immediate  error  is
 | |
|        given.
 | |
| 
 | |
|        If  pcre_compile2()  is  used instead of pcre_compile(), and the error-
 | |
|        codeptr argument is not NULL, a non-zero error code number is  returned
 | |
|        via  this argument in the event of an error. This is in addition to the
 | |
|        textual error message. Error codes and messages are listed below.
 | |
| 
 | |
|        If the final argument, tableptr, is NULL, PCRE uses a  default  set  of
 | |
|        character  tables  that  are  built  when  PCRE  is compiled, using the
 | |
|        default C locale. Otherwise, tableptr must be an address  that  is  the
 | |
|        result  of  a  call to pcre_maketables(). This value is stored with the
 | |
|        compiled pattern, and used again by pcre_exec(), unless  another  table
 | |
|        pointer is passed to it. For more discussion, see the section on locale
 | |
|        support below.
 | |
| 
 | |
|        This code fragment shows a typical straightforward  call  to  pcre_com-
 | |
|        pile():
 | |
| 
 | |
|          pcre *re;
 | |
|          const char *error;
 | |
|          int erroffset;
 | |
|          re = pcre_compile(
 | |
|            "^A.*Z",          /* the pattern */
 | |
|            0,                /* default options */
 | |
|            &error,           /* for error message */
 | |
|            &erroffset,       /* for error offset */
 | |
|            NULL);            /* use default character tables */
 | |
| 
 | |
|        The  following  names  for option bits are defined in the pcre.h header
 | |
|        file:
 | |
| 
 | |
|          PCRE_ANCHORED
 | |
| 
 | |
|        If this bit is set, the pattern is forced to be "anchored", that is, it
 | |
|        is  constrained to match only at the first matching point in the string
 | |
|        that is being searched (the "subject string"). This effect can also  be
 | |
|        achieved  by appropriate constructs in the pattern itself, which is the
 | |
|        only way to do it in Perl.
 | |
| 
 | |
|          PCRE_AUTO_CALLOUT
 | |
| 
 | |
|        If this bit is set, pcre_compile() automatically inserts callout items,
 | |
|        all  with  number  255, before each pattern item. For discussion of the
 | |
|        callout facility, see the pcrecallout documentation.
 | |
| 
 | |
|          PCRE_BSR_ANYCRLF
 | |
|          PCRE_BSR_UNICODE
 | |
| 
 | |
|        These options (which are mutually exclusive) control what the \R escape
 | |
|        sequence  matches.  The choice is either to match only CR, LF, or CRLF,
 | |
|        or to match any Unicode newline sequence. The default is specified when
 | |
|        PCRE is built. It can be overridden from within the pattern, or by set-
 | |
|        ting an option when a compiled pattern is matched.
 | |
| 
 | |
|          PCRE_CASELESS
 | |
| 
 | |
|        If this bit is set, letters in the pattern match both upper  and  lower
 | |
|        case  letters.  It  is  equivalent  to  Perl's /i option, and it can be
 | |
|        changed within a pattern by a (?i) option setting. In UTF-8 mode,  PCRE
 | |
|        always  understands the concept of case for characters whose values are
 | |
|        less than 128, so caseless matching is always possible. For  characters
 | |
|        with  higher  values,  the concept of case is supported if PCRE is com-
 | |
|        piled with Unicode property support, but not otherwise. If you want  to
 | |
|        use  caseless  matching  for  characters 128 and above, you must ensure
 | |
|        that PCRE is compiled with Unicode property support  as  well  as  with
 | |
|        UTF-8 support.
 | |
| 
 | |
|          PCRE_DOLLAR_ENDONLY
 | |
| 
 | |
|        If  this bit is set, a dollar metacharacter in the pattern matches only
 | |
|        at the end of the subject string. Without this option,  a  dollar  also
 | |
|        matches  immediately before a newline at the end of the string (but not
 | |
|        before any other newlines). The PCRE_DOLLAR_ENDONLY option  is  ignored
 | |
|        if  PCRE_MULTILINE  is  set.   There is no equivalent to this option in
 | |
|        Perl, and no way to set it within a pattern.
 | |
| 
 | |
|          PCRE_DOTALL
 | |
| 
 | |
|        If this bit is set, a dot metacharater in the pattern matches all char-
 | |
|        acters,  including  those that indicate newline. Without it, a dot does
 | |
|        not match when the current position is at a  newline.  This  option  is
 | |
|        equivalent  to Perl's /s option, and it can be changed within a pattern
 | |
|        by a (?s) option setting. A negative class such as [^a] always  matches
 | |
|        newline characters, independent of the setting of this option.
 | |
| 
 | |
|          PCRE_DUPNAMES
 | |
| 
 | |
|        If  this  bit is set, names used to identify capturing subpatterns need
 | |
|        not be unique. This can be helpful for certain types of pattern when it
 | |
|        is  known  that  only  one instance of the named subpattern can ever be
 | |
|        matched. There are more details of named subpatterns  below;  see  also
 | |
|        the pcrepattern documentation.
 | |
| 
 | |
|          PCRE_EXTENDED
 | |
| 
 | |
|        If  this  bit  is  set,  whitespace  data characters in the pattern are
 | |
|        totally ignored except when escaped or inside a character class. White-
 | |
|        space does not include the VT character (code 11). In addition, charac-
 | |
|        ters between an unescaped # outside a character class and the next new-
 | |
|        line,  inclusive,  are  also  ignored.  This is equivalent to Perl's /x
 | |
|        option, and it can be changed within a pattern by a  (?x)  option  set-
 | |
|        ting.
 | |
| 
 | |
|        This  option  makes  it possible to include comments inside complicated
 | |
|        patterns.  Note, however, that this applies only  to  data  characters.
 | |
|        Whitespace   characters  may  never  appear  within  special  character
 | |
|        sequences in a pattern, for  example  within  the  sequence  (?(  which
 | |
|        introduces a conditional subpattern.
 | |
| 
 | |
|          PCRE_EXTRA
 | |
| 
 | |
|        This  option  was invented in order to turn on additional functionality
 | |
|        of PCRE that is incompatible with Perl, but it  is  currently  of  very
 | |
|        little  use. When set, any backslash in a pattern that is followed by a
 | |
|        letter that has no special meaning  causes  an  error,  thus  reserving
 | |
|        these  combinations  for  future  expansion.  By default, as in Perl, a
 | |
|        backslash followed by a letter with no special meaning is treated as  a
 | |
|        literal.  (Perl can, however, be persuaded to give a warning for this.)
 | |
|        There are at present no other features controlled by  this  option.  It
 | |
|        can also be set by a (?X) option setting within a pattern.
 | |
| 
 | |
|          PCRE_FIRSTLINE
 | |
| 
 | |
|        If  this  option  is  set,  an  unanchored pattern is required to match
 | |
|        before or at the first  newline  in  the  subject  string,  though  the
 | |
|        matched text may continue over the newline.
 | |
| 
 | |
|          PCRE_JAVASCRIPT_COMPAT
 | |
| 
 | |
|        If this option is set, PCRE's behaviour is changed in some ways so that
 | |
|        it is compatible with JavaScript rather than Perl. The changes  are  as
 | |
|        follows:
 | |
| 
 | |
|        (1)  A  lone  closing square bracket in a pattern causes a compile-time
 | |
|        error, because this is illegal in JavaScript (by default it is  treated
 | |
|        as a data character). Thus, the pattern AB]CD becomes illegal when this
 | |
|        option is set.
 | |
| 
 | |
|        (2) At run time, a back reference to an unset subpattern group  matches
 | |
|        an  empty  string (by default this causes the current matching alterna-
 | |
|        tive to fail). A pattern such as (\1)(a) succeeds when this  option  is
 | |
|        set  (assuming  it can find an "a" in the subject), whereas it fails by
 | |
|        default, for Perl compatibility.
 | |
| 
 | |
|          PCRE_MULTILINE
 | |
| 
 | |
|        By default, PCRE treats the subject string as consisting  of  a  single
 | |
|        line  of characters (even if it actually contains newlines). The "start
 | |
|        of line" metacharacter (^) matches only at the  start  of  the  string,
 | |
|        while  the  "end  of line" metacharacter ($) matches only at the end of
 | |
|        the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY
 | |
|        is set). This is the same as Perl.
 | |
| 
 | |
|        When  PCRE_MULTILINE  it  is set, the "start of line" and "end of line"
 | |
|        constructs match immediately following or immediately  before  internal
 | |
|        newlines  in  the  subject string, respectively, as well as at the very
 | |
|        start and end. This is equivalent to Perl's /m option, and  it  can  be
 | |
|        changed within a pattern by a (?m) option setting. If there are no new-
 | |
|        lines in a subject string, or no occurrences of ^ or $  in  a  pattern,
 | |
|        setting PCRE_MULTILINE has no effect.
 | |
| 
 | |
|          PCRE_NEWLINE_CR
 | |
|          PCRE_NEWLINE_LF
 | |
|          PCRE_NEWLINE_CRLF
 | |
|          PCRE_NEWLINE_ANYCRLF
 | |
|          PCRE_NEWLINE_ANY
 | |
| 
 | |
|        These  options  override the default newline definition that was chosen
 | |
|        when PCRE was built. Setting the first or the second specifies  that  a
 | |
|        newline  is  indicated  by a single character (CR or LF, respectively).
 | |
|        Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by  the
 | |
|        two-character  CRLF  sequence.  Setting  PCRE_NEWLINE_ANYCRLF specifies
 | |
|        that any of the three preceding sequences should be recognized. Setting
 | |
|        PCRE_NEWLINE_ANY  specifies that any Unicode newline sequence should be
 | |
|        recognized. The Unicode newline sequences are the three just mentioned,
 | |
|        plus  the  single  characters  VT (vertical tab, U+000B), FF (formfeed,
 | |
|        U+000C), NEL (next line, U+0085), LS (line separator, U+2028),  and  PS
 | |
|        (paragraph  separator,  U+2029).  The  last  two are recognized only in
 | |
|        UTF-8 mode.
 | |
| 
 | |
|        The newline setting in the  options  word  uses  three  bits  that  are
 | |
|        treated as a number, giving eight possibilities. Currently only six are
 | |
|        used (default plus the five values above). This means that if  you  set
 | |
|        more  than one newline option, the combination may or may not be sensi-
 | |
|        ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to
 | |
|        PCRE_NEWLINE_CRLF,  but other combinations may yield unused numbers and
 | |
|        cause an error.
 | |
| 
 | |
|        The only time that a line break is specially recognized when  compiling
 | |
|        a  pattern  is  if  PCRE_EXTENDED  is set, and an unescaped # outside a
 | |
|        character class is encountered. This indicates  a  comment  that  lasts
 | |
|        until  after the next line break sequence. In other circumstances, line
 | |
|        break  sequences  are  treated  as  literal  data,   except   that   in
 | |
|        PCRE_EXTENDED mode, both CR and LF are treated as whitespace characters
 | |
|        and are therefore ignored.
 | |
| 
 | |
|        The newline option that is set at compile time becomes the default that
 | |
|        is used for pcre_exec() and pcre_dfa_exec(), but it can be overridden.
 | |
| 
 | |
|          PCRE_NO_AUTO_CAPTURE
 | |
| 
 | |
|        If this option is set, it disables the use of numbered capturing paren-
 | |
|        theses in the pattern. Any opening parenthesis that is not followed  by
 | |
|        ?  behaves as if it were followed by ?: but named parentheses can still
 | |
|        be used for capturing (and they acquire  numbers  in  the  usual  way).
 | |
|        There is no equivalent of this option in Perl.
 | |
| 
 | |
|          PCRE_UNGREEDY
 | |
| 
 | |
|        This  option  inverts  the "greediness" of the quantifiers so that they
 | |
|        are not greedy by default, but become greedy if followed by "?". It  is
 | |
|        not  compatible  with Perl. It can also be set by a (?U) option setting
 | |
|        within the pattern.
 | |
| 
 | |
|          PCRE_UTF8
 | |
| 
 | |
|        This option causes PCRE to regard both the pattern and the  subject  as
 | |
|        strings  of  UTF-8 characters instead of single-byte character strings.
 | |
|        However, it is available only when PCRE is built to include UTF-8  sup-
 | |
|        port.  If not, the use of this option provokes an error. Details of how
 | |
|        this option changes the behaviour of PCRE are given in the  section  on
 | |
|        UTF-8 support in the main pcre page.
 | |
| 
 | |
|          PCRE_NO_UTF8_CHECK
 | |
| 
 | |
|        When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
 | |
|        automatically checked. There is a  discussion  about  the  validity  of
 | |
|        UTF-8  strings  in  the main pcre page. If an invalid UTF-8 sequence of
 | |
|        bytes is found, pcre_compile() returns an error. If  you  already  know
 | |
|        that your pattern is valid, and you want to skip this check for perfor-
 | |
|        mance reasons, you can set the PCRE_NO_UTF8_CHECK option.  When  it  is
 | |
|        set,  the  effect  of  passing  an invalid UTF-8 string as a pattern is
 | |
|        undefined. It may cause your program to crash. Note  that  this  option
 | |
|        can  also be passed to pcre_exec() and pcre_dfa_exec(), to suppress the
 | |
|        UTF-8 validity checking of subject strings.
 | |
| 
 | |
| 
 | |
| COMPILATION ERROR CODES
 | |
| 
 | |
|        The following table lists the error  codes  than  may  be  returned  by
 | |
|        pcre_compile2(),  along with the error messages that may be returned by
 | |
|        both compiling functions. As PCRE has developed, some error codes  have
 | |
|        fallen out of use. To avoid confusion, they have not been re-used.
 | |
| 
 | |
|           0  no error
 | |
|           1  \ at end of pattern
 | |
|           2  \c at end of pattern
 | |
|           3  unrecognized character follows \
 | |
|           4  numbers out of order in {} quantifier
 | |
|           5  number too big in {} quantifier
 | |
|           6  missing terminating ] for character class
 | |
|           7  invalid escape sequence in character class
 | |
|           8  range out of order in character class
 | |
|           9  nothing to repeat
 | |
|          10  [this code is not in use]
 | |
|          11  internal error: unexpected repeat
 | |
|          12  unrecognized character after (? or (?-
 | |
|          13  POSIX named classes are supported only within a class
 | |
|          14  missing )
 | |
|          15  reference to non-existent subpattern
 | |
|          16  erroffset passed as NULL
 | |
|          17  unknown option bit(s) set
 | |
|          18  missing ) after comment
 | |
|          19  [this code is not in use]
 | |
|          20  regular expression is too large
 | |
|          21  failed to get memory
 | |
|          22  unmatched parentheses
 | |
|          23  internal error: code overflow
 | |
|          24  unrecognized character after (?<
 | |
|          25  lookbehind assertion is not fixed length
 | |
|          26  malformed number or name after (?(
 | |
|          27  conditional group contains more than two branches
 | |
|          28  assertion expected after (?(
 | |
|          29  (?R or (?[+-]digits must be followed by )
 | |
|          30  unknown POSIX class name
 | |
|          31  POSIX collating elements are not supported
 | |
|          32  this version of PCRE is not compiled with PCRE_UTF8 support
 | |
|          33  [this code is not in use]
 | |
|          34  character value in \x{...} sequence is too large
 | |
|          35  invalid condition (?(0)
 | |
|          36  \C not allowed in lookbehind assertion
 | |
|          37  PCRE does not support \L, \l, \N, \U, or \u
 | |
|          38  number after (?C is > 255
 | |
|          39  closing ) for (?C expected
 | |
|          40  recursive call could loop indefinitely
 | |
|          41  unrecognized character after (?P
 | |
|          42  syntax error in subpattern name (missing terminator)
 | |
|          43  two named subpatterns have the same name
 | |
|          44  invalid UTF-8 string
 | |
|          45  support for \P, \p, and \X has not been compiled
 | |
|          46  malformed \P or \p sequence
 | |
|          47  unknown property name after \P or \p
 | |
|          48  subpattern name is too long (maximum 32 characters)
 | |
|          49  too many named subpatterns (maximum 10000)
 | |
|          50  [this code is not in use]
 | |
|          51  octal value is greater than \377 (not in UTF-8 mode)
 | |
|          52  internal error: overran compiling workspace
 | |
|          53   internal  error:  previously-checked  referenced  subpattern not
 | |
|        found
 | |
|          54  DEFINE group contains more than one branch
 | |
|          55  repeating a DEFINE group is not allowed
 | |
|          56  inconsistent NEWLINE options
 | |
|          57  \g is not followed by a braced, angle-bracketed, or quoted
 | |
|                name/number or by a plain number
 | |
|          58  a numbered reference must not be zero
 | |
|          59  (*VERB) with an argument is not supported
 | |
|          60  (*VERB) not recognized
 | |
|          61  number is too big
 | |
|          62  subpattern name expected
 | |
|          63  digit expected after (?+
 | |
|          64  ] is an invalid data character in JavaScript compatibility mode
 | |
| 
 | |
|        The numbers 32 and 10000 in errors 48 and 49  are  defaults;  different
 | |
|        values may be used if the limits were changed when PCRE was built.
 | |
| 
 | |
| 
 | |
| STUDYING A PATTERN
 | |
| 
 | |
|        pcre_extra *pcre_study(const pcre *code, int options
 | |
|             const char **errptr);
 | |
| 
 | |
|        If  a  compiled  pattern is going to be used several times, it is worth
 | |
|        spending more time analyzing it in order to speed up the time taken for
 | |
|        matching.  The function pcre_study() takes a pointer to a compiled pat-
 | |
|        tern as its first argument. If studying the pattern produces additional
 | |
|        information  that  will  help speed up matching, pcre_study() returns a
 | |
|        pointer to a pcre_extra block, in which the study_data field points  to
 | |
|        the results of the study.
 | |
| 
 | |
|        The  returned  value  from  pcre_study()  can  be  passed  directly  to
 | |
|        pcre_exec(). However, a pcre_extra block  also  contains  other  fields
 | |
|        that  can  be  set  by the caller before the block is passed; these are
 | |
|        described below in the section on matching a pattern.
 | |
| 
 | |
|        If studying the pattern does not  produce  any  additional  information
 | |
|        pcre_study() returns NULL. In that circumstance, if the calling program
 | |
|        wants to pass any of the other fields to pcre_exec(), it  must  set  up
 | |
|        its own pcre_extra block.
 | |
| 
 | |
|        The  second  argument of pcre_study() contains option bits. At present,
 | |
|        no options are defined, and this argument should always be zero.
 | |
| 
 | |
|        The third argument for pcre_study() is a pointer for an error  message.
 | |
|        If  studying  succeeds  (even  if no data is returned), the variable it
 | |
|        points to is set to NULL. Otherwise it is set to  point  to  a  textual
 | |
|        error message. This is a static string that is part of the library. You
 | |
|        must not try to free it. You should test the  error  pointer  for  NULL
 | |
|        after calling pcre_study(), to be sure that it has run successfully.
 | |
| 
 | |
|        This is a typical call to pcre_study():
 | |
| 
 | |
|          pcre_extra *pe;
 | |
|          pe = pcre_study(
 | |
|            re,             /* result of pcre_compile() */
 | |
|            0,              /* no options exist */
 | |
|            &error);        /* set to NULL or points to a message */
 | |
| 
 | |
|        At present, studying a pattern is useful only for non-anchored patterns
 | |
|        that do not have a single fixed starting character. A bitmap of  possi-
 | |
|        ble starting bytes is created.
 | |
| 
 | |
| 
 | |
| LOCALE SUPPORT
 | |
| 
 | |
|        PCRE  handles  caseless matching, and determines whether characters are
 | |
|        letters, digits, or whatever, by reference to a set of tables,  indexed
 | |
|        by  character  value.  When running in UTF-8 mode, this applies only to
 | |
|        characters with codes less than 128. Higher-valued  codes  never  match
 | |
|        escapes  such  as  \w or \d, but can be tested with \p if PCRE is built
 | |
|        with Unicode character property support. The use of locales  with  Uni-
 | |
|        code  is discouraged. If you are handling characters with codes greater
 | |
|        than 128, you should either use UTF-8 and Unicode, or use locales,  but
 | |
|        not try to mix the two.
 | |
| 
 | |
|        PCRE  contains  an  internal set of tables that are used when the final
 | |
|        argument of pcre_compile() is  NULL.  These  are  sufficient  for  many
 | |
|        applications.  Normally, the internal tables recognize only ASCII char-
 | |
|        acters. However, when PCRE is built, it is possible to cause the inter-
 | |
|        nal tables to be rebuilt in the default "C" locale of the local system,
 | |
|        which may cause them to be different.
 | |
| 
 | |
|        The internal tables can always be overridden by tables supplied by  the
 | |
|        application that calls PCRE. These may be created in a different locale
 | |
|        from the default. As more and more applications change  to  using  Uni-
 | |
|        code, the need for this locale support is expected to die away.
 | |
| 
 | |
|        External  tables  are  built by calling the pcre_maketables() function,
 | |
|        which has no arguments, in the relevant locale. The result can then  be
 | |
|        passed  to  pcre_compile()  or  pcre_exec()  as often as necessary. For
 | |
|        example, to build and use tables that are appropriate  for  the  French
 | |
|        locale  (where  accented  characters  with  values greater than 128 are
 | |
|        treated as letters), the following code could be used:
 | |
| 
 | |
|          setlocale(LC_CTYPE, "fr_FR");
 | |
|          tables = pcre_maketables();
 | |
|          re = pcre_compile(..., tables);
 | |
| 
 | |
|        The locale name "fr_FR" is used on Linux and other  Unix-like  systems;
 | |
|        if you are using Windows, the name for the French locale is "french".
 | |
| 
 | |
|        When  pcre_maketables()  runs,  the  tables are built in memory that is
 | |
|        obtained via pcre_malloc. It is the caller's responsibility  to  ensure
 | |
|        that  the memory containing the tables remains available for as long as
 | |
|        it is needed.
 | |
| 
 | |
|        The pointer that is passed to pcre_compile() is saved with the compiled
 | |
|        pattern,  and the same tables are used via this pointer by pcre_study()
 | |
|        and normally also by pcre_exec(). Thus, by default, for any single pat-
 | |
|        tern, compilation, studying and matching all happen in the same locale,
 | |
|        but different patterns can be compiled in different locales.
 | |
| 
 | |
|        It is possible to pass a table pointer or NULL (indicating the  use  of
 | |
|        the  internal  tables)  to  pcre_exec(). Although not intended for this
 | |
|        purpose, this facility could be used to match a pattern in a  different
 | |
|        locale from the one in which it was compiled. Passing table pointers at
 | |
|        run time is discussed below in the section on matching a pattern.
 | |
| 
 | |
| 
 | |
| INFORMATION ABOUT A PATTERN
 | |
| 
 | |
|        int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
 | |
|             int what, void *where);
 | |
| 
 | |
|        The pcre_fullinfo() function returns information about a compiled  pat-
 | |
|        tern. It replaces the obsolete pcre_info() function, which is neverthe-
 | |
|        less retained for backwards compability (and is documented below).
 | |
| 
 | |
|        The first argument for pcre_fullinfo() is a  pointer  to  the  compiled
 | |
|        pattern.  The second argument is the result of pcre_study(), or NULL if
 | |
|        the pattern was not studied. The third argument specifies  which  piece
 | |
|        of  information  is required, and the fourth argument is a pointer to a
 | |
|        variable to receive the data. The yield of the  function  is  zero  for
 | |
|        success, or one of the following negative numbers:
 | |
| 
 | |
|          PCRE_ERROR_NULL       the argument code was NULL
 | |
|                                the argument where was NULL
 | |
|          PCRE_ERROR_BADMAGIC   the "magic number" was not found
 | |
|          PCRE_ERROR_BADOPTION  the value of what was invalid
 | |
| 
 | |
|        The  "magic  number" is placed at the start of each compiled pattern as
 | |
|        an simple check against passing an arbitrary memory pointer. Here is  a
 | |
|        typical  call  of pcre_fullinfo(), to obtain the length of the compiled
 | |
|        pattern:
 | |
| 
 | |
|          int rc;
 | |
|          size_t length;
 | |
|          rc = pcre_fullinfo(
 | |
|            re,               /* result of pcre_compile() */
 | |
|            pe,               /* result of pcre_study(), or NULL */
 | |
|            PCRE_INFO_SIZE,   /* what is required */
 | |
|            &length);         /* where to put the data */
 | |
| 
 | |
|        The possible values for the third argument are defined in  pcre.h,  and
 | |
|        are as follows:
 | |
| 
 | |
|          PCRE_INFO_BACKREFMAX
 | |
| 
 | |
|        Return  the  number  of  the highest back reference in the pattern. The
 | |
|        fourth argument should point to an int variable. Zero  is  returned  if
 | |
|        there are no back references.
 | |
| 
 | |
|          PCRE_INFO_CAPTURECOUNT
 | |
| 
 | |
|        Return  the  number of capturing subpatterns in the pattern. The fourth
 | |
|        argument should point to an int variable.
 | |
| 
 | |
|          PCRE_INFO_DEFAULT_TABLES
 | |
| 
 | |
|        Return a pointer to the internal default character tables within  PCRE.
 | |
|        The  fourth  argument should point to an unsigned char * variable. This
 | |
|        information call is provided for internal use by the pcre_study() func-
 | |
|        tion.  External  callers  can  cause PCRE to use its internal tables by
 | |
|        passing a NULL table pointer.
 | |
| 
 | |
|          PCRE_INFO_FIRSTBYTE
 | |
| 
 | |
|        Return information about the first byte of any matched  string,  for  a
 | |
|        non-anchored  pattern. The fourth argument should point to an int vari-
 | |
|        able. (This option used to be called PCRE_INFO_FIRSTCHAR; the old  name
 | |
|        is still recognized for backwards compatibility.)
 | |
| 
 | |
|        If  there  is  a  fixed first byte, for example, from a pattern such as
 | |
|        (cat|cow|coyote), its value is returned. Otherwise, if either
 | |
| 
 | |
|        (a) the pattern was compiled with the PCRE_MULTILINE option, and  every
 | |
|        branch starts with "^", or
 | |
| 
 | |
|        (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
 | |
|        set (if it were set, the pattern would be anchored),
 | |
| 
 | |
|        -1 is returned, indicating that the pattern matches only at  the  start
 | |
|        of  a  subject string or after any newline within the string. Otherwise
 | |
|        -2 is returned. For anchored patterns, -2 is returned.
 | |
| 
 | |
|          PCRE_INFO_FIRSTTABLE
 | |
| 
 | |
|        If the pattern was studied, and this resulted in the construction of  a
 | |
|        256-bit table indicating a fixed set of bytes for the first byte in any
 | |
|        matching string, a pointer to the table is returned. Otherwise NULL  is
 | |
|        returned.  The fourth argument should point to an unsigned char * vari-
 | |
|        able.
 | |
| 
 | |
|          PCRE_INFO_HASCRORLF
 | |
| 
 | |
|        Return 1 if the pattern contains any explicit  matches  for  CR  or  LF
 | |
|        characters,  otherwise  0.  The  fourth argument should point to an int
 | |
|        variable. An explicit match is either a literal CR or LF character,  or
 | |
|        \r or \n.
 | |
| 
 | |
|          PCRE_INFO_JCHANGED
 | |
| 
 | |
|        Return  1  if  the (?J) or (?-J) option setting is used in the pattern,
 | |
|        otherwise 0. The fourth argument should point to an int variable.  (?J)
 | |
|        and (?-J) set and unset the local PCRE_DUPNAMES option, respectively.
 | |
| 
 | |
|          PCRE_INFO_LASTLITERAL
 | |
| 
 | |
|        Return  the  value of the rightmost literal byte that must exist in any
 | |
|        matched string, other than at its  start,  if  such  a  byte  has  been
 | |
|        recorded. The fourth argument should point to an int variable. If there
 | |
|        is no such byte, -1 is returned. For anchored patterns, a last  literal
 | |
|        byte  is  recorded only if it follows something of variable length. For
 | |
|        example, for the pattern /^a\d+z\d+/ the returned value is "z", but for
 | |
|        /^a\dz\d/ the returned value is -1.
 | |
| 
 | |
|          PCRE_INFO_NAMECOUNT
 | |
|          PCRE_INFO_NAMEENTRYSIZE
 | |
|          PCRE_INFO_NAMETABLE
 | |
| 
 | |
|        PCRE  supports the use of named as well as numbered capturing parenthe-
 | |
|        ses. The names are just an additional way of identifying the  parenthe-
 | |
|        ses, which still acquire numbers. Several convenience functions such as
 | |
|        pcre_get_named_substring() are provided for  extracting  captured  sub-
 | |
|        strings  by  name. It is also possible to extract the data directly, by
 | |
|        first converting the name to a number in order to  access  the  correct
 | |
|        pointers in the output vector (described with pcre_exec() below). To do
 | |
|        the conversion, you need  to  use  the  name-to-number  map,  which  is
 | |
|        described by these three values.
 | |
| 
 | |
|        The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT
 | |
|        gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size
 | |
|        of  each  entry;  both  of  these  return  an int value. The entry size
 | |
|        depends on the length of the longest name. PCRE_INFO_NAMETABLE  returns
 | |
|        a  pointer  to  the  first  entry of the table (a pointer to char). The
 | |
|        first two bytes of each entry are the number of the capturing parenthe-
 | |
|        sis,  most  significant byte first. The rest of the entry is the corre-
 | |
|        sponding name, zero terminated. The names are  in  alphabetical  order.
 | |
|        When PCRE_DUPNAMES is set, duplicate names are in order of their paren-
 | |
|        theses numbers. For example, consider  the  following  pattern  (assume
 | |
|        PCRE_EXTENDED  is  set,  so  white  space  -  including  newlines  - is
 | |
|        ignored):
 | |
| 
 | |
|          (?<date> (?<year>(\d\d)?\d\d) -
 | |
|          (?<month>\d\d) - (?<day>\d\d) )
 | |
| 
 | |
|        There are four named subpatterns, so the table has  four  entries,  and
 | |
|        each  entry  in the table is eight bytes long. The table is as follows,
 | |
|        with non-printing bytes shows in hexadecimal, and undefined bytes shown
 | |
|        as ??:
 | |
| 
 | |
|          00 01 d  a  t  e  00 ??
 | |
|          00 05 d  a  y  00 ?? ??
 | |
|          00 04 m  o  n  t  h  00
 | |
|          00 02 y  e  a  r  00 ??
 | |
| 
 | |
|        When  writing  code  to  extract  data from named subpatterns using the
 | |
|        name-to-number map, remember that the length of the entries  is  likely
 | |
|        to be different for each compiled pattern.
 | |
| 
 | |
|          PCRE_INFO_OKPARTIAL
 | |
| 
 | |
|        Return  1 if the pattern can be used for partial matching, otherwise 0.
 | |
|        The fourth argument should point to an int  variable.  The  pcrepartial
 | |
|        documentation  lists  the restrictions that apply to patterns when par-
 | |
|        tial matching is used.
 | |
| 
 | |
|          PCRE_INFO_OPTIONS
 | |
| 
 | |
|        Return a copy of the options with which the pattern was  compiled.  The
 | |
|        fourth  argument  should  point to an unsigned long int variable. These
 | |
|        option bits are those specified in the call to pcre_compile(), modified
 | |
|        by any top-level option settings at the start of the pattern itself. In
 | |
|        other words, they are the options that will be in force  when  matching
 | |
|        starts.  For  example, if the pattern /(?im)abc(?-i)d/ is compiled with
 | |
|        the PCRE_EXTENDED option, the result is PCRE_CASELESS,  PCRE_MULTILINE,
 | |
|        and PCRE_EXTENDED.
 | |
| 
 | |
|        A  pattern  is  automatically  anchored by PCRE if all of its top-level
 | |
|        alternatives begin with one of the following:
 | |
| 
 | |
|          ^     unless PCRE_MULTILINE is set
 | |
|          \A    always
 | |
|          \G    always
 | |
|          .*    if PCRE_DOTALL is set and there are no back
 | |
|                  references to the subpattern in which .* appears
 | |
| 
 | |
|        For such patterns, the PCRE_ANCHORED bit is set in the options returned
 | |
|        by pcre_fullinfo().
 | |
| 
 | |
|          PCRE_INFO_SIZE
 | |
| 
 | |
|        Return  the  size  of the compiled pattern, that is, the value that was
 | |
|        passed as the argument to pcre_malloc() when PCRE was getting memory in
 | |
|        which to place the compiled data. The fourth argument should point to a
 | |
|        size_t variable.
 | |
| 
 | |
|          PCRE_INFO_STUDYSIZE
 | |
| 
 | |
|        Return the size of the data block pointed to by the study_data field in
 | |
|        a  pcre_extra  block.  That  is,  it  is  the  value that was passed to
 | |
|        pcre_malloc() when PCRE was getting memory into which to place the data
 | |
|        created  by  pcre_study(). The fourth argument should point to a size_t
 | |
|        variable.
 | |
| 
 | |
| 
 | |
| OBSOLETE INFO FUNCTION
 | |
| 
 | |
|        int pcre_info(const pcre *code, int *optptr, int *firstcharptr);
 | |
| 
 | |
|        The pcre_info() function is now obsolete because its interface  is  too
 | |
|        restrictive  to return all the available data about a compiled pattern.
 | |
|        New  programs  should  use  pcre_fullinfo()  instead.  The   yield   of
 | |
|        pcre_info()  is the number of capturing subpatterns, or one of the fol-
 | |
|        lowing negative numbers:
 | |
| 
 | |
|          PCRE_ERROR_NULL       the argument code was NULL
 | |
|          PCRE_ERROR_BADMAGIC   the "magic number" was not found
 | |
| 
 | |
|        If the optptr argument is not NULL, a copy of the  options  with  which
 | |
|        the  pattern  was  compiled  is placed in the integer it points to (see
 | |
|        PCRE_INFO_OPTIONS above).
 | |
| 
 | |
|        If the pattern is not anchored and the  firstcharptr  argument  is  not
 | |
|        NULL,  it is used to pass back information about the first character of
 | |
|        any matched string (see PCRE_INFO_FIRSTBYTE above).
 | |
| 
 | |
| 
 | |
| REFERENCE COUNTS
 | |
| 
 | |
|        int pcre_refcount(pcre *code, int adjust);
 | |
| 
 | |
|        The pcre_refcount() function is used to maintain a reference  count  in
 | |
|        the data block that contains a compiled pattern. It is provided for the
 | |
|        benefit of applications that  operate  in  an  object-oriented  manner,
 | |
|        where different parts of the application may be using the same compiled
 | |
|        pattern, but you want to free the block when they are all done.
 | |
| 
 | |
|        When a pattern is compiled, the reference count field is initialized to
 | |
|        zero.   It is changed only by calling this function, whose action is to
 | |
|        add the adjust value (which may be positive or  negative)  to  it.  The
 | |
|        yield of the function is the new value. However, the value of the count
 | |
|        is constrained to lie between 0 and 65535, inclusive. If the new  value
 | |
|        is outside these limits, it is forced to the appropriate limit value.
 | |
| 
 | |
|        Except  when it is zero, the reference count is not correctly preserved
 | |
|        if a pattern is compiled on one host and then  transferred  to  a  host
 | |
|        whose byte-order is different. (This seems a highly unlikely scenario.)
 | |
| 
 | |
| 
 | |
| MATCHING A PATTERN: THE TRADITIONAL FUNCTION
 | |
| 
 | |
|        int pcre_exec(const pcre *code, const pcre_extra *extra,
 | |
|             const char *subject, int length, int startoffset,
 | |
|             int options, int *ovector, int ovecsize);
 | |
| 
 | |
|        The  function pcre_exec() is called to match a subject string against a
 | |
|        compiled pattern, which is passed in the code argument. If the  pattern
 | |
|        has been studied, the result of the study should be passed in the extra
 | |
|        argument. This function is the main matching facility of  the  library,
 | |
|        and it operates in a Perl-like manner. For specialist use there is also
 | |
|        an alternative matching function, which is described below in the  sec-
 | |
|        tion about the pcre_dfa_exec() function.
 | |
| 
 | |
|        In  most applications, the pattern will have been compiled (and option-
 | |
|        ally studied) in the same process that calls pcre_exec().  However,  it
 | |
|        is possible to save compiled patterns and study data, and then use them
 | |
|        later in different processes, possibly even on different hosts.  For  a
 | |
|        discussion about this, see the pcreprecompile documentation.
 | |
| 
 | |
|        Here is an example of a simple call to pcre_exec():
 | |
| 
 | |
|          int rc;
 | |
|          int ovector[30];
 | |
|          rc = pcre_exec(
 | |
|            re,             /* result of pcre_compile() */
 | |
|            NULL,           /* we didn't study the pattern */
 | |
|            "some string",  /* the subject string */
 | |
|            11,             /* the length of the subject string */
 | |
|            0,              /* start at offset 0 in the subject */
 | |
|            0,              /* default options */
 | |
|            ovector,        /* vector of integers for substring information */
 | |
|            30);            /* number of elements (NOT size in bytes) */
 | |
| 
 | |
|    Extra data for pcre_exec()
 | |
| 
 | |
|        If  the  extra argument is not NULL, it must point to a pcre_extra data
 | |
|        block. The pcre_study() function returns such a block (when it  doesn't
 | |
|        return  NULL), but you can also create one for yourself, and pass addi-
 | |
|        tional information in it. The pcre_extra block contains  the  following
 | |
|        fields (not necessarily in this order):
 | |
| 
 | |
|          unsigned long int flags;
 | |
|          void *study_data;
 | |
|          unsigned long int match_limit;
 | |
|          unsigned long int match_limit_recursion;
 | |
|          void *callout_data;
 | |
|          const unsigned char *tables;
 | |
| 
 | |
|        The  flags  field  is a bitmap that specifies which of the other fields
 | |
|        are set. The flag bits are:
 | |
| 
 | |
|          PCRE_EXTRA_STUDY_DATA
 | |
|          PCRE_EXTRA_MATCH_LIMIT
 | |
|          PCRE_EXTRA_MATCH_LIMIT_RECURSION
 | |
|          PCRE_EXTRA_CALLOUT_DATA
 | |
|          PCRE_EXTRA_TABLES
 | |
| 
 | |
|        Other flag bits should be set to zero. The study_data field is  set  in
 | |
|        the  pcre_extra  block  that is returned by pcre_study(), together with
 | |
|        the appropriate flag bit. You should not set this yourself, but you may
 | |
|        add  to  the  block by setting the other fields and their corresponding
 | |
|        flag bits.
 | |
| 
 | |
|        The match_limit field provides a means of preventing PCRE from using up
 | |
|        a  vast amount of resources when running patterns that are not going to
 | |
|        match, but which have a very large number  of  possibilities  in  their
 | |
|        search  trees.  The  classic  example  is  the  use of nested unlimited
 | |
|        repeats.
 | |
| 
 | |
|        Internally, PCRE uses a function called match() which it calls  repeat-
 | |
|        edly  (sometimes  recursively). The limit set by match_limit is imposed
 | |
|        on the number of times this function is called during  a  match,  which
 | |
|        has  the  effect  of  limiting the amount of backtracking that can take
 | |
|        place. For patterns that are not anchored, the count restarts from zero
 | |
|        for each position in the subject string.
 | |
| 
 | |
|        The  default  value  for  the  limit can be set when PCRE is built; the
 | |
|        default default is 10 million, which handles all but the  most  extreme
 | |
|        cases.  You  can  override  the  default by suppling pcre_exec() with a
 | |
|        pcre_extra    block    in    which    match_limit    is    set,     and
 | |
|        PCRE_EXTRA_MATCH_LIMIT  is  set  in  the  flags  field. If the limit is
 | |
|        exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.
 | |
| 
 | |
|        The match_limit_recursion field is similar to match_limit, but  instead
 | |
|        of limiting the total number of times that match() is called, it limits
 | |
|        the depth of recursion. The recursion depth is a  smaller  number  than
 | |
|        the  total number of calls, because not all calls to match() are recur-
 | |
|        sive.  This limit is of use only if it is set smaller than match_limit.
 | |
| 
 | |
|        Limiting the recursion depth limits the amount of  stack  that  can  be
 | |
|        used, or, when PCRE has been compiled to use memory on the heap instead
 | |
|        of the stack, the amount of heap memory that can be used.
 | |
| 
 | |
|        The default value for match_limit_recursion can be  set  when  PCRE  is
 | |
|        built;  the  default  default  is  the  same  value  as the default for
 | |
|        match_limit. You can override the default by suppling pcre_exec()  with
 | |
|        a   pcre_extra   block  in  which  match_limit_recursion  is  set,  and
 | |
|        PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in  the  flags  field.  If  the
 | |
|        limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.
 | |
| 
 | |
|        The  pcre_callout  field is used in conjunction with the "callout" fea-
 | |
|        ture, which is described in the pcrecallout documentation.
 | |
| 
 | |
|        The tables field  is  used  to  pass  a  character  tables  pointer  to
 | |
|        pcre_exec();  this overrides the value that is stored with the compiled
 | |
|        pattern. A non-NULL value is stored with the compiled pattern  only  if
 | |
|        custom  tables  were  supplied to pcre_compile() via its tableptr argu-
 | |
|        ment.  If NULL is passed to pcre_exec() using this mechanism, it forces
 | |
|        PCRE's  internal  tables  to be used. This facility is helpful when re-
 | |
|        using patterns that have been saved after compiling  with  an  external
 | |
|        set  of  tables,  because  the  external tables might be at a different
 | |
|        address when pcre_exec() is called. See the  pcreprecompile  documenta-
 | |
|        tion for a discussion of saving compiled patterns for later use.
 | |
| 
 | |
|    Option bits for pcre_exec()
 | |
| 
 | |
|        The  unused  bits of the options argument for pcre_exec() must be zero.
 | |
|        The only bits that may  be  set  are  PCRE_ANCHORED,  PCRE_NEWLINE_xxx,
 | |
|        PCRE_NOTBOL,    PCRE_NOTEOL,   PCRE_NOTEMPTY,   PCRE_NO_START_OPTIMIZE,
 | |
|        PCRE_NO_UTF8_CHECK and PCRE_PARTIAL.
 | |
| 
 | |
|          PCRE_ANCHORED
 | |
| 
 | |
|        The PCRE_ANCHORED option limits pcre_exec() to matching  at  the  first
 | |
|        matching  position.  If  a  pattern was compiled with PCRE_ANCHORED, or
 | |
|        turned out to be anchored by virtue of its contents, it cannot be  made
 | |
|        unachored at matching time.
 | |
| 
 | |
|          PCRE_BSR_ANYCRLF
 | |
|          PCRE_BSR_UNICODE
 | |
| 
 | |
|        These options (which are mutually exclusive) control what the \R escape
 | |
|        sequence matches. The choice is either to match only CR, LF,  or  CRLF,
 | |
|        or  to  match  any Unicode newline sequence. These options override the
 | |
|        choice that was made or defaulted when the pattern was compiled.
 | |
| 
 | |
|          PCRE_NEWLINE_CR
 | |
|          PCRE_NEWLINE_LF
 | |
|          PCRE_NEWLINE_CRLF
 | |
|          PCRE_NEWLINE_ANYCRLF
 | |
|          PCRE_NEWLINE_ANY
 | |
| 
 | |
|        These options override  the  newline  definition  that  was  chosen  or
 | |
|        defaulted  when the pattern was compiled. For details, see the descrip-
 | |
|        tion of pcre_compile()  above.  During  matching,  the  newline  choice
 | |
|        affects  the  behaviour  of the dot, circumflex, and dollar metacharac-
 | |
|        ters. It may also alter the way the match position is advanced after  a
 | |
|        match failure for an unanchored pattern.
 | |
| 
 | |
|        When  PCRE_NEWLINE_CRLF,  PCRE_NEWLINE_ANYCRLF,  or PCRE_NEWLINE_ANY is
 | |
|        set, and a match attempt for an unanchored pattern fails when the  cur-
 | |
|        rent  position  is  at  a  CRLF  sequence,  and the pattern contains no
 | |
|        explicit matches for  CR  or  LF  characters,  the  match  position  is
 | |
|        advanced by two characters instead of one, in other words, to after the
 | |
|        CRLF.
 | |
| 
 | |
|        The above rule is a compromise that makes the most common cases work as
 | |
|        expected.  For  example,  if  the  pattern  is .+A (and the PCRE_DOTALL
 | |
|        option is not set), it does not match the string "\r\nA" because, after
 | |
|        failing  at the start, it skips both the CR and the LF before retrying.
 | |
|        However, the pattern [\r\n]A does match that string,  because  it  con-
 | |
|        tains an explicit CR or LF reference, and so advances only by one char-
 | |
|        acter after the first failure.
 | |
| 
 | |
|        An explicit match for CR of LF is either a literal appearance of one of
 | |
|        those  characters,  or  one  of the \r or \n escape sequences. Implicit
 | |
|        matches such as [^X] do not count, nor does \s (which includes  CR  and
 | |
|        LF in the characters that it matches).
 | |
| 
 | |
|        Notwithstanding  the above, anomalous effects may still occur when CRLF
 | |
|        is a valid newline sequence and explicit \r or \n escapes appear in the
 | |
|        pattern.
 | |
| 
 | |
|          PCRE_NOTBOL
 | |
| 
 | |
|        This option specifies that first character of the subject string is not
 | |
|        the beginning of a line, so the  circumflex  metacharacter  should  not
 | |
|        match  before it. Setting this without PCRE_MULTILINE (at compile time)
 | |
|        causes circumflex never to match. This option affects only  the  behav-
 | |
|        iour of the circumflex metacharacter. It does not affect \A.
 | |
| 
 | |
|          PCRE_NOTEOL
 | |
| 
 | |
|        This option specifies that the end of the subject string is not the end
 | |
|        of a line, so the dollar metacharacter should not match it nor  (except
 | |
|        in  multiline mode) a newline immediately before it. Setting this with-
 | |
|        out PCRE_MULTILINE (at compile time) causes dollar never to match. This
 | |
|        option  affects only the behaviour of the dollar metacharacter. It does
 | |
|        not affect \Z or \z.
 | |
| 
 | |
|          PCRE_NOTEMPTY
 | |
| 
 | |
|        An empty string is not considered to be a valid match if this option is
 | |
|        set.  If  there are alternatives in the pattern, they are tried. If all
 | |
|        the alternatives match the empty string, the entire  match  fails.  For
 | |
|        example, if the pattern
 | |
| 
 | |
|          a?b?
 | |
| 
 | |
|        is  applied  to  a string not beginning with "a" or "b", it matches the
 | |
|        empty string at the start of the subject. With PCRE_NOTEMPTY set,  this
 | |
|        match is not valid, so PCRE searches further into the string for occur-
 | |
|        rences of "a" or "b".
 | |
| 
 | |
|        Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe-
 | |
|        cial  case  of  a  pattern match of the empty string within its split()
 | |
|        function, and when using the /g modifier. It  is  possible  to  emulate
 | |
|        Perl's behaviour after matching a null string by first trying the match
 | |
|        again at the same offset with PCRE_NOTEMPTY and PCRE_ANCHORED, and then
 | |
|        if  that  fails by advancing the starting offset (see below) and trying
 | |
|        an ordinary match again. There is some code that demonstrates how to do
 | |
|        this in the pcredemo.c sample program.
 | |
| 
 | |
|          PCRE_NO_START_OPTIMIZE
 | |
| 
 | |
|        There  are a number of optimizations that pcre_exec() uses at the start
 | |
|        of a match, in order to speed up the process. For  example,  if  it  is
 | |
|        known  that  a  match must start with a specific character, it searches
 | |
|        the subject for that character, and fails immediately if it cannot find
 | |
|        it,  without actually running the main matching function. When callouts
 | |
|        are in use, these optimizations can cause  them  to  be  skipped.  This
 | |
|        option  disables  the  "start-up" optimizations, causing performance to
 | |
|        suffer, but ensuring that the callouts do occur.
 | |
| 
 | |
|          PCRE_NO_UTF8_CHECK
 | |
| 
 | |
|        When PCRE_UTF8 is set at compile time, the validity of the subject as a
 | |
|        UTF-8  string is automatically checked when pcre_exec() is subsequently
 | |
|        called.  The value of startoffset is also checked  to  ensure  that  it
 | |
|        points  to  the start of a UTF-8 character. There is a discussion about
 | |
|        the validity of UTF-8 strings in the section on UTF-8  support  in  the
 | |
|        main  pcre  page.  If  an  invalid  UTF-8  sequence  of bytes is found,
 | |
|        pcre_exec() returns the error PCRE_ERROR_BADUTF8. If  startoffset  con-
 | |
|        tains an invalid value, PCRE_ERROR_BADUTF8_OFFSET is returned.
 | |
| 
 | |
|        If  you  already  know that your subject is valid, and you want to skip
 | |
|        these   checks   for   performance   reasons,   you   can    set    the
 | |
|        PCRE_NO_UTF8_CHECK  option  when calling pcre_exec(). You might want to
 | |
|        do this for the second and subsequent calls to pcre_exec() if  you  are
 | |
|        making  repeated  calls  to  find  all  the matches in a single subject
 | |
|        string. However, you should be  sure  that  the  value  of  startoffset
 | |
|        points  to  the  start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is
 | |
|        set, the effect of passing an invalid UTF-8 string as a subject,  or  a
 | |
|        value  of startoffset that does not point to the start of a UTF-8 char-
 | |
|        acter, is undefined. Your program may crash.
 | |
| 
 | |
|          PCRE_PARTIAL
 | |
| 
 | |
|        This option turns on the  partial  matching  feature.  If  the  subject
 | |
|        string  fails to match the pattern, but at some point during the match-
 | |
|        ing process the end of the subject was reached (that  is,  the  subject
 | |
|        partially  matches  the  pattern and the failure to match occurred only
 | |
|        because there were not enough subject characters), pcre_exec()  returns
 | |
|        PCRE_ERROR_PARTIAL  instead of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL is
 | |
|        used, there are restrictions on what may appear in the  pattern.  These
 | |
|        are discussed in the pcrepartial documentation.
 | |
| 
 | |
|    The string to be matched by pcre_exec()
 | |
| 
 | |
|        The  subject string is passed to pcre_exec() as a pointer in subject, a
 | |
|        length (in bytes) in length, and a starting byte offset in startoffset.
 | |
|        In UTF-8 mode, the byte offset must point to the start of a UTF-8 char-
 | |
|        acter. Unlike the pattern string, the subject may contain  binary  zero
 | |
|        bytes.  When the starting offset is zero, the search for a match starts
 | |
|        at the beginning of the subject, and this is by  far  the  most  common
 | |
|        case.
 | |
| 
 | |
|        A  non-zero  starting offset is useful when searching for another match
 | |
|        in the same subject by calling pcre_exec() again after a previous  suc-
 | |
|        cess.   Setting  startoffset differs from just passing over a shortened
 | |
|        string and setting PCRE_NOTBOL in the case of  a  pattern  that  begins
 | |
|        with any kind of lookbehind. For example, consider the pattern
 | |
| 
 | |
|          \Biss\B
 | |
| 
 | |
|        which  finds  occurrences  of "iss" in the middle of words. (\B matches
 | |
|        only if the current position in the subject is not  a  word  boundary.)
 | |
|        When  applied  to the string "Mississipi" the first call to pcre_exec()
 | |
|        finds the first occurrence. If pcre_exec() is called  again  with  just
 | |
|        the  remainder  of  the  subject,  namely  "issipi", it does not match,
 | |
|        because \B is always false at the start of the subject, which is deemed
 | |
|        to  be  a  word  boundary. However, if pcre_exec() is passed the entire
 | |
|        string again, but with startoffset set to 4, it finds the second occur-
 | |
|        rence  of "iss" because it is able to look behind the starting point to
 | |
|        discover that it is preceded by a letter.
 | |
| 
 | |
|        If a non-zero starting offset is passed when the pattern  is  anchored,
 | |
|        one attempt to match at the given offset is made. This can only succeed
 | |
|        if the pattern does not require the match to be at  the  start  of  the
 | |
|        subject.
 | |
| 
 | |
|    How pcre_exec() returns captured substrings
 | |
| 
 | |
|        In  general, a pattern matches a certain portion of the subject, and in
 | |
|        addition, further substrings from the subject  may  be  picked  out  by
 | |
|        parts  of  the  pattern.  Following the usage in Jeffrey Friedl's book,
 | |
|        this is called "capturing" in what follows, and the  phrase  "capturing
 | |
|        subpattern"  is  used for a fragment of a pattern that picks out a sub-
 | |
|        string. PCRE supports several other kinds of  parenthesized  subpattern
 | |
|        that do not cause substrings to be captured.
 | |
| 
 | |
|        Captured substrings are returned to the caller via a vector of integers
 | |
|        whose address is passed in ovector. The number of elements in the  vec-
 | |
|        tor  is  passed in ovecsize, which must be a non-negative number. Note:
 | |
|        this argument is NOT the size of ovector in bytes.
 | |
| 
 | |
|        The first two-thirds of the vector is used to pass back  captured  sub-
 | |
|        strings,  each  substring using a pair of integers. The remaining third
 | |
|        of the vector is used as workspace by pcre_exec() while  matching  cap-
 | |
|        turing  subpatterns, and is not available for passing back information.
 | |
|        The number passed in ovecsize should always be a multiple of three.  If
 | |
|        it is not, it is rounded down.
 | |
| 
 | |
|        When  a  match  is successful, information about captured substrings is
 | |
|        returned in pairs of integers, starting at the  beginning  of  ovector,
 | |
|        and  continuing  up  to two-thirds of its length at the most. The first
 | |
|        element of each pair is set to the byte offset of the  first  character
 | |
|        in  a  substring, and the second is set to the byte offset of the first
 | |
|        character after the end of a substring. Note: these values  are  always
 | |
|        byte offsets, even in UTF-8 mode. They are not character counts.
 | |
| 
 | |
|        The  first  pair  of  integers, ovector[0] and ovector[1], identify the
 | |
|        portion of the subject string matched by the entire pattern.  The  next
 | |
|        pair  is  used for the first capturing subpattern, and so on. The value
 | |
|        returned by pcre_exec() is one more than the highest numbered pair that
 | |
|        has  been  set.  For example, if two substrings have been captured, the
 | |
|        returned value is 3. If there are no capturing subpatterns, the  return
 | |
|        value from a successful match is 1, indicating that just the first pair
 | |
|        of offsets has been set.
 | |
| 
 | |
|        If a capturing subpattern is matched repeatedly, it is the last portion
 | |
|        of the string that it matched that is returned.
 | |
| 
 | |
|        If  the vector is too small to hold all the captured substring offsets,
 | |
|        it is used as far as possible (up to two-thirds of its length), and the
 | |
|        function  returns  a value of zero. If the substring offsets are not of
 | |
|        interest, pcre_exec() may be called with ovector  passed  as  NULL  and
 | |
|        ovecsize  as zero. However, if the pattern contains back references and
 | |
|        the ovector is not big enough to remember the related substrings,  PCRE
 | |
|        has  to  get additional memory for use during matching. Thus it is usu-
 | |
|        ally advisable to supply an ovector.
 | |
| 
 | |
|        The pcre_info() function can be used to find  out  how  many  capturing
 | |
|        subpatterns  there  are  in  a  compiled pattern. The smallest size for
 | |
|        ovector that will allow for n captured substrings, in addition  to  the
 | |
|        offsets of the substring matched by the whole pattern, is (n+1)*3.
 | |
| 
 | |
|        It  is  possible for capturing subpattern number n+1 to match some part
 | |
|        of the subject when subpattern n has not been used at all. For example,
 | |
|        if  the  string  "abc"  is  matched against the pattern (a|(z))(bc) the
 | |
|        return from the function is 4, and subpatterns 1 and 3 are matched, but
 | |
|        2  is  not.  When  this happens, both values in the offset pairs corre-
 | |
|        sponding to unused subpatterns are set to -1.
 | |
| 
 | |
|        Offset values that correspond to unused subpatterns at the end  of  the
 | |
|        expression  are  also  set  to  -1. For example, if the string "abc" is
 | |
|        matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are  not
 | |
|        matched.  The  return  from the function is 2, because the highest used
 | |
|        capturing subpattern number is 1. However, you can refer to the offsets
 | |
|        for  the  second  and third capturing subpatterns if you wish (assuming
 | |
|        the vector is large enough, of course).
 | |
| 
 | |
|        Some convenience functions are provided  for  extracting  the  captured
 | |
|        substrings as separate strings. These are described below.
 | |
| 
 | |
|    Error return values from pcre_exec()
 | |
| 
 | |
|        If  pcre_exec()  fails, it returns a negative number. The following are
 | |
|        defined in the header file:
 | |
| 
 | |
|          PCRE_ERROR_NOMATCH        (-1)
 | |
| 
 | |
|        The subject string did not match the pattern.
 | |
| 
 | |
|          PCRE_ERROR_NULL           (-2)
 | |
| 
 | |
|        Either code or subject was passed as NULL,  or  ovector  was  NULL  and
 | |
|        ovecsize was not zero.
 | |
| 
 | |
|          PCRE_ERROR_BADOPTION      (-3)
 | |
| 
 | |
|        An unrecognized bit was set in the options argument.
 | |
| 
 | |
|          PCRE_ERROR_BADMAGIC       (-4)
 | |
| 
 | |
|        PCRE  stores a 4-byte "magic number" at the start of the compiled code,
 | |
|        to catch the case when it is passed a junk pointer and to detect when a
 | |
|        pattern that was compiled in an environment of one endianness is run in
 | |
|        an environment with the other endianness. This is the error  that  PCRE
 | |
|        gives when the magic number is not present.
 | |
| 
 | |
|          PCRE_ERROR_UNKNOWN_OPCODE (-5)
 | |
| 
 | |
|        While running the pattern match, an unknown item was encountered in the
 | |
|        compiled pattern. This error could be caused by a bug  in  PCRE  or  by
 | |
|        overwriting of the compiled pattern.
 | |
| 
 | |
|          PCRE_ERROR_NOMEMORY       (-6)
 | |
| 
 | |
|        If  a  pattern contains back references, but the ovector that is passed
 | |
|        to pcre_exec() is not big enough to remember the referenced substrings,
 | |
|        PCRE  gets  a  block of memory at the start of matching to use for this
 | |
|        purpose. If the call via pcre_malloc() fails, this error is given.  The
 | |
|        memory is automatically freed at the end of matching.
 | |
| 
 | |
|          PCRE_ERROR_NOSUBSTRING    (-7)
 | |
| 
 | |
|        This  error is used by the pcre_copy_substring(), pcre_get_substring(),
 | |
|        and  pcre_get_substring_list()  functions  (see  below).  It  is  never
 | |
|        returned by pcre_exec().
 | |
| 
 | |
|          PCRE_ERROR_MATCHLIMIT     (-8)
 | |
| 
 | |
|        The  backtracking  limit,  as  specified  by the match_limit field in a
 | |
|        pcre_extra structure (or defaulted) was reached.  See  the  description
 | |
|        above.
 | |
| 
 | |
|          PCRE_ERROR_CALLOUT        (-9)
 | |
| 
 | |
|        This error is never generated by pcre_exec() itself. It is provided for
 | |
|        use by callout functions that want to yield a distinctive  error  code.
 | |
|        See the pcrecallout documentation for details.
 | |
| 
 | |
|          PCRE_ERROR_BADUTF8        (-10)
 | |
| 
 | |
|        A  string  that contains an invalid UTF-8 byte sequence was passed as a
 | |
|        subject.
 | |
| 
 | |
|          PCRE_ERROR_BADUTF8_OFFSET (-11)
 | |
| 
 | |
|        The UTF-8 byte sequence that was passed as a subject was valid, but the
 | |
|        value  of startoffset did not point to the beginning of a UTF-8 charac-
 | |
|        ter.
 | |
| 
 | |
|          PCRE_ERROR_PARTIAL        (-12)
 | |
| 
 | |
|        The subject string did not match, but it did match partially.  See  the
 | |
|        pcrepartial documentation for details of partial matching.
 | |
| 
 | |
|          PCRE_ERROR_BADPARTIAL     (-13)
 | |
| 
 | |
|        The  PCRE_PARTIAL  option  was  used with a compiled pattern containing
 | |
|        items that are not supported for partial matching. See the  pcrepartial
 | |
|        documentation for details of partial matching.
 | |
| 
 | |
|          PCRE_ERROR_INTERNAL       (-14)
 | |
| 
 | |
|        An  unexpected  internal error has occurred. This error could be caused
 | |
|        by a bug in PCRE or by overwriting of the compiled pattern.
 | |
| 
 | |
|          PCRE_ERROR_BADCOUNT       (-15)
 | |
| 
 | |
|        This error is given if the value of the ovecsize argument is negative.
 | |
| 
 | |
|          PCRE_ERROR_RECURSIONLIMIT (-21)
 | |
| 
 | |
|        The internal recursion limit, as specified by the match_limit_recursion
 | |
|        field  in  a  pcre_extra  structure (or defaulted) was reached. See the
 | |
|        description above.
 | |
| 
 | |
|          PCRE_ERROR_BADNEWLINE     (-23)
 | |
| 
 | |
|        An invalid combination of PCRE_NEWLINE_xxx options was given.
 | |
| 
 | |
|        Error numbers -16 to -20 and -22 are not used by pcre_exec().
 | |
| 
 | |
| 
 | |
| EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
 | |
| 
 | |
|        int pcre_copy_substring(const char *subject, int *ovector,
 | |
|             int stringcount, int stringnumber, char *buffer,
 | |
|             int buffersize);
 | |
| 
 | |
|        int pcre_get_substring(const char *subject, int *ovector,
 | |
|             int stringcount, int stringnumber,
 | |
|             const char **stringptr);
 | |
| 
 | |
|        int pcre_get_substring_list(const char *subject,
 | |
|             int *ovector, int stringcount, const char ***listptr);
 | |
| 
 | |
|        Captured substrings can be  accessed  directly  by  using  the  offsets
 | |
|        returned  by  pcre_exec()  in  ovector.  For convenience, the functions
 | |
|        pcre_copy_substring(),    pcre_get_substring(),    and    pcre_get_sub-
 | |
|        string_list()  are  provided for extracting captured substrings as new,
 | |
|        separate, zero-terminated strings. These functions identify  substrings
 | |
|        by  number.  The  next section describes functions for extracting named
 | |
|        substrings.
 | |
| 
 | |
|        A substring that contains a binary zero is correctly extracted and  has
 | |
|        a  further zero added on the end, but the result is not, of course, a C
 | |
|        string.  However, you can process such a string  by  referring  to  the
 | |
|        length  that  is  returned  by  pcre_copy_substring() and pcre_get_sub-
 | |
|        string().  Unfortunately, the interface to pcre_get_substring_list() is
 | |
|        not  adequate for handling strings containing binary zeros, because the
 | |
|        end of the final string is not independently indicated.
 | |
| 
 | |
|        The first three arguments are the same for all  three  of  these  func-
 | |
|        tions:  subject  is  the subject string that has just been successfully
 | |
|        matched, ovector is a pointer to the vector of integer offsets that was
 | |
|        passed to pcre_exec(), and stringcount is the number of substrings that
 | |
|        were captured by the match, including the substring  that  matched  the
 | |
|        entire regular expression. This is the value returned by pcre_exec() if
 | |
|        it is greater than zero. If pcre_exec() returned zero, indicating  that
 | |
|        it  ran out of space in ovector, the value passed as stringcount should
 | |
|        be the number of elements in the vector divided by three.
 | |
| 
 | |
|        The functions pcre_copy_substring() and pcre_get_substring() extract  a
 | |
|        single  substring,  whose  number  is given as stringnumber. A value of
 | |
|        zero extracts the substring that matched the  entire  pattern,  whereas
 | |
|        higher  values  extract  the  captured  substrings.  For pcre_copy_sub-
 | |
|        string(), the string is placed in buffer,  whose  length  is  given  by
 | |
|        buffersize,  while  for  pcre_get_substring()  a new block of memory is
 | |
|        obtained via pcre_malloc, and its address is  returned  via  stringptr.
 | |
|        The  yield  of  the function is the length of the string, not including
 | |
|        the terminating zero, or one of these error codes:
 | |
| 
 | |
|          PCRE_ERROR_NOMEMORY       (-6)
 | |
| 
 | |
|        The buffer was too small for pcre_copy_substring(), or the  attempt  to
 | |
|        get memory failed for pcre_get_substring().
 | |
| 
 | |
|          PCRE_ERROR_NOSUBSTRING    (-7)
 | |
| 
 | |
|        There is no substring whose number is stringnumber.
 | |
| 
 | |
|        The  pcre_get_substring_list()  function  extracts  all  available sub-
 | |
|        strings and builds a list of pointers to them. All this is  done  in  a
 | |
|        single block of memory that is obtained via pcre_malloc. The address of
 | |
|        the memory block is returned via listptr, which is also  the  start  of
 | |
|        the  list  of  string pointers. The end of the list is marked by a NULL
 | |
|        pointer. The yield of the function is zero if all  went  well,  or  the
 | |
|        error code
 | |
| 
 | |
|          PCRE_ERROR_NOMEMORY       (-6)
 | |
| 
 | |
|        if the attempt to get the memory block failed.
 | |
| 
 | |
|        When  any of these functions encounter a substring that is unset, which
 | |
|        can happen when capturing subpattern number n+1 matches  some  part  of
 | |
|        the  subject, but subpattern n has not been used at all, they return an
 | |
|        empty string. This can be distinguished from a genuine zero-length sub-
 | |
|        string  by inspecting the appropriate offset in ovector, which is nega-
 | |
|        tive for unset substrings.
 | |
| 
 | |
|        The two convenience functions pcre_free_substring() and  pcre_free_sub-
 | |
|        string_list()  can  be  used  to free the memory returned by a previous
 | |
|        call  of  pcre_get_substring()  or  pcre_get_substring_list(),  respec-
 | |
|        tively.  They  do  nothing  more  than  call the function pointed to by
 | |
|        pcre_free, which of course could be called directly from a  C  program.
 | |
|        However,  PCRE is used in some situations where it is linked via a spe-
 | |
|        cial  interface  to  another  programming  language  that  cannot   use
 | |
|        pcre_free  directly;  it is for these cases that the functions are pro-
 | |
|        vided.
 | |
| 
 | |
| 
 | |
| EXTRACTING CAPTURED SUBSTRINGS BY NAME
 | |
| 
 | |
|        int pcre_get_stringnumber(const pcre *code,
 | |
|             const char *name);
 | |
| 
 | |
|        int pcre_copy_named_substring(const pcre *code,
 | |
|             const char *subject, int *ovector,
 | |
|             int stringcount, const char *stringname,
 | |
|             char *buffer, int buffersize);
 | |
| 
 | |
|        int pcre_get_named_substring(const pcre *code,
 | |
|             const char *subject, int *ovector,
 | |
|             int stringcount, const char *stringname,
 | |
|             const char **stringptr);
 | |
| 
 | |
|        To extract a substring by name, you first have to find associated  num-
 | |
|        ber.  For example, for this pattern
 | |
| 
 | |
|          (a+)b(?<xxx>\d+)...
 | |
| 
 | |
|        the number of the subpattern called "xxx" is 2. If the name is known to
 | |
|        be unique (PCRE_DUPNAMES was not set), you can find the number from the
 | |
|        name by calling pcre_get_stringnumber(). The first argument is the com-
 | |
|        piled pattern, and the second is the name. The yield of the function is
 | |
|        the  subpattern  number,  or PCRE_ERROR_NOSUBSTRING (-7) if there is no
 | |
|        subpattern of that name.
 | |
| 
 | |
|        Given the number, you can extract the substring directly, or use one of
 | |
|        the functions described in the previous section. For convenience, there
 | |
|        are also two functions that do the whole job.
 | |
| 
 | |
|        Most   of   the   arguments    of    pcre_copy_named_substring()    and
 | |
|        pcre_get_named_substring()  are  the  same  as  those for the similarly
 | |
|        named functions that extract by number. As these are described  in  the
 | |
|        previous  section,  they  are not re-described here. There are just two
 | |
|        differences:
 | |
| 
 | |
|        First, instead of a substring number, a substring name is  given.  Sec-
 | |
|        ond, there is an extra argument, given at the start, which is a pointer
 | |
|        to the compiled pattern. This is needed in order to gain access to  the
 | |
|        name-to-number translation table.
 | |
| 
 | |
|        These  functions call pcre_get_stringnumber(), and if it succeeds, they
 | |
|        then call pcre_copy_substring() or pcre_get_substring(),  as  appropri-
 | |
|        ate.  NOTE:  If PCRE_DUPNAMES is set and there are duplicate names, the
 | |
|        behaviour may not be what you want (see the next section).
 | |
| 
 | |
|        Warning: If the pattern uses the "(?|" feature to set up multiple  sub-
 | |
|        patterns  with  the  same  number,  you cannot use names to distinguish
 | |
|        them, because names are not included in the compiled code. The matching
 | |
|        process uses only numbers.
 | |
| 
 | |
| 
 | |
| DUPLICATE SUBPATTERN NAMES
 | |
| 
 | |
|        int pcre_get_stringtable_entries(const pcre *code,
 | |
|             const char *name, char **first, char **last);
 | |
| 
 | |
|        When  a  pattern  is  compiled with the PCRE_DUPNAMES option, names for
 | |
|        subpatterns are not required to  be  unique.  Normally,  patterns  with
 | |
|        duplicate  names  are such that in any one match, only one of the named
 | |
|        subpatterns participates. An example is shown in the pcrepattern  docu-
 | |
|        mentation.
 | |
| 
 | |
|        When    duplicates   are   present,   pcre_copy_named_substring()   and
 | |
|        pcre_get_named_substring() return the first substring corresponding  to
 | |
|        the  given  name  that  is set. If none are set, PCRE_ERROR_NOSUBSTRING
 | |
|        (-7) is returned; no  data  is  returned.  The  pcre_get_stringnumber()
 | |
|        function  returns one of the numbers that are associated with the name,
 | |
|        but it is not defined which it is.
 | |
| 
 | |
|        If you want to get full details of all captured substrings for a  given
 | |
|        name,  you  must  use  the pcre_get_stringtable_entries() function. The
 | |
|        first argument is the compiled pattern, and the second is the name. The
 | |
|        third  and  fourth  are  pointers to variables which are updated by the
 | |
|        function. After it has run, they point to the first and last entries in
 | |
|        the  name-to-number  table  for  the  given  name.  The function itself
 | |
|        returns the length of each entry,  or  PCRE_ERROR_NOSUBSTRING  (-7)  if
 | |
|        there  are none. The format of the table is described above in the sec-
 | |
|        tion entitled Information about a  pattern.   Given  all  the  relevant
 | |
|        entries  for the name, you can extract each of their numbers, and hence
 | |
|        the captured data, if any.
 | |
| 
 | |
| 
 | |
| FINDING ALL POSSIBLE MATCHES
 | |
| 
 | |
|        The traditional matching function uses a  similar  algorithm  to  Perl,
 | |
|        which stops when it finds the first match, starting at a given point in
 | |
|        the subject. If you want to find all possible matches, or  the  longest
 | |
|        possible  match,  consider using the alternative matching function (see
 | |
|        below) instead. If you cannot use the alternative function,  but  still
 | |
|        need  to  find all possible matches, you can kludge it up by making use
 | |
|        of the callout facility, which is described in the pcrecallout documen-
 | |
|        tation.
 | |
| 
 | |
|        What you have to do is to insert a callout right at the end of the pat-
 | |
|        tern.  When your callout function is called, extract and save the  cur-
 | |
|        rent  matched  substring.  Then  return  1, which forces pcre_exec() to
 | |
|        backtrack and try other alternatives. Ultimately, when it runs  out  of
 | |
|        matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.
 | |
| 
 | |
| 
 | |
| MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
 | |
| 
 | |
|        int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
 | |
|             const char *subject, int length, int startoffset,
 | |
|             int options, int *ovector, int ovecsize,
 | |
|             int *workspace, int wscount);
 | |
| 
 | |
|        The  function  pcre_dfa_exec()  is  called  to  match  a subject string
 | |
|        against a compiled pattern, using a matching algorithm that  scans  the
 | |
|        subject  string  just  once, and does not backtrack. This has different
 | |
|        characteristics to the normal algorithm, and  is  not  compatible  with
 | |
|        Perl.  Some  of the features of PCRE patterns are not supported. Never-
 | |
|        theless, there are times when this kind of matching can be useful.  For
 | |
|        a discussion of the two matching algorithms, see the pcrematching docu-
 | |
|        mentation.
 | |
| 
 | |
|        The arguments for the pcre_dfa_exec() function  are  the  same  as  for
 | |
|        pcre_exec(), plus two extras. The ovector argument is used in a differ-
 | |
|        ent way, and this is described below. The other  common  arguments  are
 | |
|        used  in  the  same way as for pcre_exec(), so their description is not
 | |
|        repeated here.
 | |
| 
 | |
|        The two additional arguments provide workspace for  the  function.  The
 | |
|        workspace  vector  should  contain at least 20 elements. It is used for
 | |
|        keeping  track  of  multiple  paths  through  the  pattern  tree.  More
 | |
|        workspace  will  be  needed for patterns and subjects where there are a
 | |
|        lot of potential matches.
 | |
| 
 | |
|        Here is an example of a simple call to pcre_dfa_exec():
 | |
| 
 | |
|          int rc;
 | |
|          int ovector[10];
 | |
|          int wspace[20];
 | |
|          rc = pcre_dfa_exec(
 | |
|            re,             /* result of pcre_compile() */
 | |
|            NULL,           /* we didn't study the pattern */
 | |
|            "some string",  /* the subject string */
 | |
|            11,             /* the length of the subject string */
 | |
|            0,              /* start at offset 0 in the subject */
 | |
|            0,              /* default options */
 | |
|            ovector,        /* vector of integers for substring information */
 | |
|            10,             /* number of elements (NOT size in bytes) */
 | |
|            wspace,         /* working space vector */
 | |
|            20);            /* number of elements (NOT size in bytes) */
 | |
| 
 | |
|    Option bits for pcre_dfa_exec()
 | |
| 
 | |
|        The unused bits of the options argument  for  pcre_dfa_exec()  must  be
 | |
|        zero.  The  only  bits  that  may  be  set are PCRE_ANCHORED, PCRE_NEW-
 | |
|        LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY,  PCRE_NO_UTF8_CHECK,
 | |
|        PCRE_PARTIAL, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last
 | |
|        three of these are the same as for pcre_exec(), so their description is
 | |
|        not repeated here.
 | |
| 
 | |
|          PCRE_PARTIAL
 | |
| 
 | |
|        This  has  the  same general effect as it does for pcre_exec(), but the
 | |
|        details  are  slightly  different.  When  PCRE_PARTIAL   is   set   for
 | |
|        pcre_dfa_exec(),  the  return code PCRE_ERROR_NOMATCH is converted into
 | |
|        PCRE_ERROR_PARTIAL if the end of the subject  is  reached,  there  have
 | |
|        been no complete matches, but there is still at least one matching pos-
 | |
|        sibility. The portion of the string that provided the partial match  is
 | |
|        set as the first matching string.
 | |
| 
 | |
|          PCRE_DFA_SHORTEST
 | |
| 
 | |
|        Setting  the  PCRE_DFA_SHORTEST option causes the matching algorithm to
 | |
|        stop as soon as it has found one match. Because of the way the alterna-
 | |
|        tive  algorithm  works, this is necessarily the shortest possible match
 | |
|        at the first possible matching point in the subject string.
 | |
| 
 | |
|          PCRE_DFA_RESTART
 | |
| 
 | |
|        When pcre_dfa_exec()  is  called  with  the  PCRE_PARTIAL  option,  and
 | |
|        returns  a  partial  match, it is possible to call it again, with addi-
 | |
|        tional subject characters, and have it continue with  the  same  match.
 | |
|        The  PCRE_DFA_RESTART  option requests this action; when it is set, the
 | |
|        workspace and wscount options must reference the same vector as  before
 | |
|        because  data  about  the  match so far is left in them after a partial
 | |
|        match. There is more discussion of this  facility  in  the  pcrepartial
 | |
|        documentation.
 | |
| 
 | |
|    Successful returns from pcre_dfa_exec()
 | |
| 
 | |
|        When  pcre_dfa_exec()  succeeds, it may have matched more than one sub-
 | |
|        string in the subject. Note, however, that all the matches from one run
 | |
|        of  the  function  start  at the same point in the subject. The shorter
 | |
|        matches are all initial substrings of the longer matches. For  example,
 | |
|        if the pattern
 | |
| 
 | |
|          <.*>
 | |
| 
 | |
|        is matched against the string
 | |
| 
 | |
|          This is <something> <something else> <something further> no more
 | |
| 
 | |
|        the three matched strings are
 | |
| 
 | |
|          <something>
 | |
|          <something> <something else>
 | |
|          <something> <something else> <something further>
 | |
| 
 | |
|        On  success,  the  yield of the function is a number greater than zero,
 | |
|        which is the number of matched substrings.  The  substrings  themselves
 | |
|        are  returned  in  ovector. Each string uses two elements; the first is
 | |
|        the offset to the start, and the second is the offset to  the  end.  In
 | |
|        fact,  all  the  strings  have the same start offset. (Space could have
 | |
|        been saved by giving this only once, but it was decided to retain  some
 | |
|        compatibility  with  the  way pcre_exec() returns data, even though the
 | |
|        meaning of the strings is different.)
 | |
| 
 | |
|        The strings are returned in reverse order of length; that is, the long-
 | |
|        est  matching  string is given first. If there were too many matches to
 | |
|        fit into ovector, the yield of the function is zero, and the vector  is
 | |
|        filled with the longest matches.
 | |
| 
 | |
|    Error returns from pcre_dfa_exec()
 | |
| 
 | |
|        The  pcre_dfa_exec()  function returns a negative number when it fails.
 | |
|        Many of the errors are the same  as  for  pcre_exec(),  and  these  are
 | |
|        described  above.   There are in addition the following errors that are
 | |
|        specific to pcre_dfa_exec():
 | |
| 
 | |
|          PCRE_ERROR_DFA_UITEM      (-16)
 | |
| 
 | |
|        This return is given if pcre_dfa_exec() encounters an item in the  pat-
 | |
|        tern  that  it  does not support, for instance, the use of \C or a back
 | |
|        reference.
 | |
| 
 | |
|          PCRE_ERROR_DFA_UCOND      (-17)
 | |
| 
 | |
|        This return is given if pcre_dfa_exec()  encounters  a  condition  item
 | |
|        that  uses  a back reference for the condition, or a test for recursion
 | |
|        in a specific group. These are not supported.
 | |
| 
 | |
|          PCRE_ERROR_DFA_UMLIMIT    (-18)
 | |
| 
 | |
|        This return is given if pcre_dfa_exec() is called with an  extra  block
 | |
|        that contains a setting of the match_limit field. This is not supported
 | |
|        (it is meaningless).
 | |
| 
 | |
|          PCRE_ERROR_DFA_WSSIZE     (-19)
 | |
| 
 | |
|        This return is given if  pcre_dfa_exec()  runs  out  of  space  in  the
 | |
|        workspace vector.
 | |
| 
 | |
|          PCRE_ERROR_DFA_RECURSE    (-20)
 | |
| 
 | |
|        When  a  recursive subpattern is processed, the matching function calls
 | |
|        itself recursively, using private vectors for  ovector  and  workspace.
 | |
|        This  error  is  given  if  the output vector is not large enough. This
 | |
|        should be extremely rare, as a vector of size 1000 is used.
 | |
| 
 | |
| 
 | |
| SEE ALSO
 | |
| 
 | |
|        pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), pcrematching(3),  pcrepar-
 | |
|        tial(3), pcreposix(3), pcreprecompile(3), pcresample(3), pcrestack(3).
 | |
| 
 | |
| 
 | |
| AUTHOR
 | |
| 
 | |
|        Philip Hazel
 | |
|        University Computing Service
 | |
|        Cambridge CB2 3QH, England.
 | |
| 
 | |
| 
 | |
| REVISION
 | |
| 
 | |
|        Last updated: 11 April 2009
 | |
|        Copyright (c) 1997-2009 University of Cambridge.
 | |
| ------------------------------------------------------------------------------
 | |
| 
 | |
| 
 | |
| PCRECALLOUT(3)                                                  PCRECALLOUT(3)
 | |
| 
 | |
| 
 | |
| NAME
 | |
|        PCRE - Perl-compatible regular expressions
 | |
| 
 | |
| 
 | |
| PCRE CALLOUTS
 | |
| 
 | |
|        int (*pcre_callout)(pcre_callout_block *);
 | |
| 
 | |
|        PCRE provides a feature called "callout", which is a means of temporar-
 | |
|        ily passing control to the caller of PCRE  in  the  middle  of  pattern
 | |
|        matching.  The  caller of PCRE provides an external function by putting
 | |
|        its entry point in the global variable pcre_callout. By  default,  this
 | |
|        variable contains NULL, which disables all calling out.
 | |
| 
 | |
|        Within  a  regular  expression,  (?C) indicates the points at which the
 | |
|        external function is to be called.  Different  callout  points  can  be
 | |
|        identified  by  putting  a number less than 256 after the letter C. The
 | |
|        default value is zero.  For  example,  this  pattern  has  two  callout
 | |
|        points:
 | |
| 
 | |
|          (?C1)abc(?C2)def
 | |
| 
 | |
|        If  the  PCRE_AUTO_CALLOUT  option  bit  is  set when pcre_compile() is
 | |
|        called, PCRE automatically  inserts  callouts,  all  with  number  255,
 | |
|        before  each  item in the pattern. For example, if PCRE_AUTO_CALLOUT is
 | |
|        used with the pattern
 | |
| 
 | |
|          A(\d{2}|--)
 | |
| 
 | |
|        it is processed as if it were
 | |
| 
 | |
|        (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
 | |
| 
 | |
|        Notice that there is a callout before and after  each  parenthesis  and
 | |
|        alternation  bar.  Automatic  callouts  can  be  used  for tracking the
 | |
|        progress of pattern matching. The pcretest command has an  option  that
 | |
|        sets  automatic callouts; when it is used, the output indicates how the
 | |
|        pattern is matched. This is useful information when you are  trying  to
 | |
|        optimize the performance of a particular pattern.
 | |
| 
 | |
| 
 | |
| MISSING CALLOUTS
 | |
| 
 | |
|        You  should  be  aware  that,  because of optimizations in the way PCRE
 | |
|        matches patterns by default, callouts  sometimes  do  not  happen.  For
 | |
|        example, if the pattern is
 | |
| 
 | |
|          ab(?C4)cd
 | |
| 
 | |
|        PCRE knows that any matching string must contain the letter "d". If the
 | |
|        subject string is "abyz", the lack of "d" means that  matching  doesn't
 | |
|        ever  start,  and  the  callout is never reached. However, with "abyd",
 | |
|        though the result is still no match, the callout is obeyed.
 | |
| 
 | |
|        You can disable these optimizations by passing the  PCRE_NO_START_OPTI-
 | |
|        MIZE  option  to  pcre_exec()  or  pcre_dfa_exec(). This slows down the
 | |
|        matching process, but does ensure that callouts  such  as  the  example
 | |
|        above are obeyed.
 | |
| 
 | |
| 
 | |
| THE CALLOUT INTERFACE
 | |
| 
 | |
|        During  matching, when PCRE reaches a callout point, the external func-
 | |
|        tion defined by pcre_callout is called (if it is set). This applies  to
 | |
|        both  the  pcre_exec()  and the pcre_dfa_exec() matching functions. The
 | |
|        only argument to the callout function is a pointer  to  a  pcre_callout
 | |
|        block. This structure contains the following fields:
 | |
| 
 | |
|          int          version;
 | |
|          int          callout_number;
 | |
|          int         *offset_vector;
 | |
|          const char  *subject;
 | |
|          int          subject_length;
 | |
|          int          start_match;
 | |
|          int          current_position;
 | |
|          int          capture_top;
 | |
|          int          capture_last;
 | |
|          void        *callout_data;
 | |
|          int          pattern_position;
 | |
|          int          next_item_length;
 | |
| 
 | |
|        The  version  field  is an integer containing the version number of the
 | |
|        block format. The initial version was 0; the current version is 1.  The
 | |
|        version  number  will  change  again in future if additional fields are
 | |
|        added, but the intention is never to remove any of the existing fields.
 | |
| 
 | |
|        The callout_number field contains the number of the  callout,  as  com-
 | |
|        piled  into  the pattern (that is, the number after ?C for manual call-
 | |
|        outs, and 255 for automatically generated callouts).
 | |
| 
 | |
|        The offset_vector field is a pointer to the vector of offsets that  was
 | |
|        passed   by   the   caller  to  pcre_exec()  or  pcre_dfa_exec().  When
 | |
|        pcre_exec() is used, the contents can be inspected in order to  extract
 | |
|        substrings  that  have  been  matched  so  far,  in the same way as for
 | |
|        extracting substrings after a match has completed. For  pcre_dfa_exec()
 | |
|        this field is not useful.
 | |
| 
 | |
|        The subject and subject_length fields contain copies of the values that
 | |
|        were passed to pcre_exec().
 | |
| 
 | |
|        The start_match field normally contains the offset within  the  subject
 | |
|        at  which  the  current  match  attempt started. However, if the escape
 | |
|        sequence \K has been encountered, this value is changed to reflect  the
 | |
|        modified  starting  point.  If the pattern is not anchored, the callout
 | |
|        function may be called several times from the same point in the pattern
 | |
|        for different starting points in the subject.
 | |
| 
 | |
|        The  current_position  field  contains the offset within the subject of
 | |
|        the current match pointer.
 | |
| 
 | |
|        When the pcre_exec() function is used, the capture_top  field  contains
 | |
|        one  more than the number of the highest numbered captured substring so
 | |
|        far. If no substrings have been captured, the value of  capture_top  is
 | |
|        one.  This  is always the case when pcre_dfa_exec() is used, because it
 | |
|        does not support captured substrings.
 | |
| 
 | |
|        The capture_last field contains the number of the  most  recently  cap-
 | |
|        tured  substring. If no substrings have been captured, its value is -1.
 | |
|        This is always the case when pcre_dfa_exec() is used.
 | |
| 
 | |
|        The callout_data field contains a value that is passed  to  pcre_exec()
 | |
|        or  pcre_dfa_exec() specifically so that it can be passed back in call-
 | |
|        outs. It is passed in the pcre_callout field  of  the  pcre_extra  data
 | |
|        structure.  If  no such data was passed, the value of callout_data in a
 | |
|        pcre_callout block is NULL. There is a description  of  the  pcre_extra
 | |
|        structure in the pcreapi documentation.
 | |
| 
 | |
|        The  pattern_position field is present from version 1 of the pcre_call-
 | |
|        out structure. It contains the offset to the next item to be matched in
 | |
|        the pattern string.
 | |
| 
 | |
|        The  next_item_length field is present from version 1 of the pcre_call-
 | |
|        out structure. It contains the length of the next item to be matched in
 | |
|        the  pattern  string. When the callout immediately precedes an alterna-
 | |
|        tion bar, a closing parenthesis, or the end of the pattern, the  length
 | |
|        is  zero.  When the callout precedes an opening parenthesis, the length
 | |
|        is that of the entire subpattern.
 | |
| 
 | |
|        The pattern_position and next_item_length fields are intended  to  help
 | |
|        in  distinguishing between different automatic callouts, which all have
 | |
|        the same callout number. However, they are set for all callouts.
 | |
| 
 | |
| 
 | |
| RETURN VALUES
 | |
| 
 | |
|        The external callout function returns an integer to PCRE. If the  value
 | |
|        is  zero,  matching  proceeds  as  normal. If the value is greater than
 | |
|        zero, matching fails at the current point, but  the  testing  of  other
 | |
|        matching possibilities goes ahead, just as if a lookahead assertion had
 | |
|        failed. If the value is less than zero, the  match  is  abandoned,  and
 | |
|        pcre_exec() (or pcre_dfa_exec()) returns the negative value.
 | |
| 
 | |
|        Negative   values   should   normally   be   chosen  from  the  set  of
 | |
|        PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan-
 | |
|        dard  "no  match"  failure.   The  error  number  PCRE_ERROR_CALLOUT is
 | |
|        reserved for use by callout functions; it will never be  used  by  PCRE
 | |
|        itself.
 | |
| 
 | |
| 
 | |
| AUTHOR
 | |
| 
 | |
|        Philip Hazel
 | |
|        University Computing Service
 | |
|        Cambridge CB2 3QH, England.
 | |
| 
 | |
| 
 | |
| REVISION
 | |
| 
 | |
|        Last updated: 15 March 2009
 | |
|        Copyright (c) 1997-2009 University of Cambridge.
 | |
| ------------------------------------------------------------------------------
 | |
| 
 | |
| 
 | |
| PCRECOMPAT(3)                                                    PCRECOMPAT(3)
 | |
| 
 | |
| 
 | |
| NAME
 | |
|        PCRE - Perl-compatible regular expressions
 | |
| 
 | |
| 
 | |
| DIFFERENCES BETWEEN PCRE AND PERL
 | |
| 
 | |
|        This  document describes the differences in the ways that PCRE and Perl
 | |
|        handle regular expressions. The differences described here  are  mainly
 | |
|        with  respect  to  Perl 5.8, though PCRE versions 7.0 and later contain
 | |
|        some features that are expected to be in the forthcoming Perl 5.10.
 | |
| 
 | |
|        1. PCRE has only a subset of Perl's UTF-8 and Unicode support.  Details
 | |
|        of  what  it does have are given in the section on UTF-8 support in the
 | |
|        main pcre page.
 | |
| 
 | |
|        2. PCRE does not allow repeat quantifiers on lookahead assertions. Perl
 | |
|        permits  them,  but they do not mean what you might think. For example,
 | |
|        (?!a){3} does not assert that the next three characters are not "a". It
 | |
|        just asserts that the next character is not "a" three times.
 | |
| 
 | |
|        3.  Capturing  subpatterns  that occur inside negative lookahead asser-
 | |
|        tions are counted, but their entries in the offsets  vector  are  never
 | |
|        set.  Perl sets its numerical variables from any such patterns that are
 | |
|        matched before the assertion fails to match something (thereby succeed-
 | |
|        ing),  but  only  if the negative lookahead assertion contains just one
 | |
|        branch.
 | |
| 
 | |
|        4. Though binary zero characters are supported in the  subject  string,
 | |
|        they are not allowed in a pattern string because it is passed as a nor-
 | |
|        mal C string, terminated by zero. The escape sequence \0 can be used in
 | |
|        the pattern to represent a binary zero.
 | |
| 
 | |
|        5.  The  following Perl escape sequences are not supported: \l, \u, \L,
 | |
|        \U, and \N. In fact these are implemented by Perl's general string-han-
 | |
|        dling  and are not part of its pattern matching engine. If any of these
 | |
|        are encountered by PCRE, an error is generated.
 | |
| 
 | |
|        6. The Perl escape sequences \p, \P, and \X are supported only if  PCRE
 | |
|        is  built  with Unicode character property support. The properties that
 | |
|        can be tested with \p and \P are limited to the general category  prop-
 | |
|        erties  such  as  Lu and Nd, script names such as Greek or Han, and the
 | |
|        derived properties Any and L&.
 | |
| 
 | |
|        7. PCRE does support the \Q...\E escape for quoting substrings. Charac-
 | |
|        ters  in  between  are  treated as literals. This is slightly different
 | |
|        from Perl in that $ and @ are  also  handled  as  literals  inside  the
 | |
|        quotes.  In Perl, they cause variable interpolation (but of course PCRE
 | |
|        does not have variables). Note the following examples:
 | |
| 
 | |
|            Pattern            PCRE matches      Perl matches
 | |
| 
 | |
|            \Qabc$xyz\E        abc$xyz           abc followed by the
 | |
|                                                   contents of $xyz
 | |
|            \Qabc\$xyz\E       abc\$xyz          abc\$xyz
 | |
|            \Qabc\E\$\Qxyz\E   abc$xyz           abc$xyz
 | |
| 
 | |
|        The \Q...\E sequence is recognized both inside  and  outside  character
 | |
|        classes.
 | |
| 
 | |
|        8. Fairly obviously, PCRE does not support the (?{code}) and (??{code})
 | |
|        constructions. However, there is support for recursive  patterns.  This
 | |
|        is  not available in Perl 5.8, but will be in Perl 5.10. Also, the PCRE
 | |
|        "callout" feature allows an external function to be called during  pat-
 | |
|        tern matching. See the pcrecallout documentation for details.
 | |
| 
 | |
|        9.  Subpatterns  that  are  called  recursively or as "subroutines" are
 | |
|        always treated as atomic groups in  PCRE.  This  is  like  Python,  but
 | |
|        unlike Perl.
 | |
| 
 | |
|        10.  There are some differences that are concerned with the settings of
 | |
|        captured strings when part of  a  pattern  is  repeated.  For  example,
 | |
|        matching  "aba"  against  the  pattern  /^(a(b)?)+$/  in Perl leaves $2
 | |
|        unset, but in PCRE it is set to "b".
 | |
| 
 | |
|        11.  PCRE  does  support  Perl  5.10's  backtracking  verbs  (*ACCEPT),
 | |
|        (*FAIL),  (*F),  (*COMMIT), (*PRUNE), (*SKIP), and (*THEN), but only in
 | |
|        the forms without an  argument.  PCRE  does  not  support  (*MARK).  If
 | |
|        (*ACCEPT)  is within capturing parentheses, PCRE does not set that cap-
 | |
|        ture group; this is different to Perl.
 | |
| 
 | |
|        12. PCRE provides some extensions to the Perl regular expression facil-
 | |
|        ities.   Perl  5.10  will  include new features that are not in earlier
 | |
|        versions, some of which (such as named parentheses) have been  in  PCRE
 | |
|        for some time. This list is with respect to Perl 5.10:
 | |
| 
 | |
|        (a)  Although  lookbehind  assertions  must match fixed length strings,
 | |
|        each alternative branch of a lookbehind assertion can match a different
 | |
|        length of string. Perl requires them all to have the same length.
 | |
| 
 | |
|        (b)  If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $
 | |
|        meta-character matches only at the very end of the string.
 | |
| 
 | |
|        (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-
 | |
|        cial meaning is faulted. Otherwise, like Perl, the backslash is quietly
 | |
|        ignored.  (Perl can be made to issue a warning.)
 | |
| 
 | |
|        (d) If PCRE_UNGREEDY is set, the greediness of the  repetition  quanti-
 | |
|        fiers is inverted, that is, by default they are not greedy, but if fol-
 | |
|        lowed by a question mark they are.
 | |
| 
 | |
|        (e) PCRE_ANCHORED can be used at matching time to force a pattern to be
 | |
|        tried only at the first matching position in the subject string.
 | |
| 
 | |
|        (f)  The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and PCRE_NO_AUTO_CAP-
 | |
|        TURE options for pcre_exec() have no Perl equivalents.
 | |
| 
 | |
|        (g) The \R escape sequence can be restricted to match only CR,  LF,  or
 | |
|        CRLF by the PCRE_BSR_ANYCRLF option.
 | |
| 
 | |
|        (h) The callout facility is PCRE-specific.
 | |
| 
 | |
|        (i) The partial matching facility is PCRE-specific.
 | |
| 
 | |
|        (j) Patterns compiled by PCRE can be saved and re-used at a later time,
 | |
|        even on different hosts that have the other endianness.
 | |
| 
 | |
|        (k) The alternative matching function (pcre_dfa_exec())  matches  in  a
 | |
|        different way and is not Perl-compatible.
 | |
| 
 | |
|        (l)  PCRE  recognizes some special sequences such as (*CR) at the start
 | |
|        of a pattern that set overall options that cannot be changed within the
 | |
|        pattern.
 | |
| 
 | |
| 
 | |
| AUTHOR
 | |
| 
 | |
|        Philip Hazel
 | |
|        University Computing Service
 | |
|        Cambridge CB2 3QH, England.
 | |
| 
 | |
| 
 | |
| REVISION
 | |
| 
 | |
|        Last updated: 11 September 2007
 | |
|        Copyright (c) 1997-2007 University of Cambridge.
 | |
| ------------------------------------------------------------------------------
 | |
| 
 | |
| 
 | |
| PCREPATTERN(3)                                                  PCREPATTERN(3)
 | |
| 
 | |
| 
 | |
| NAME
 | |
|        PCRE - Perl-compatible regular expressions
 | |
| 
 | |
| 
 | |
| PCRE REGULAR EXPRESSION DETAILS
 | |
| 
 | |
|        The  syntax and semantics of the regular expressions that are supported
 | |
|        by PCRE are described in detail below. There is a quick-reference  syn-
 | |
|        tax summary in the pcresyntax page. PCRE tries to match Perl syntax and
 | |
|        semantics as closely as it can. PCRE  also  supports  some  alternative
 | |
|        regular  expression  syntax (which does not conflict with the Perl syn-
 | |
|        tax) in order to provide some compatibility with regular expressions in
 | |
|        Python, .NET, and Oniguruma.
 | |
| 
 | |
|        Perl's  regular expressions are described in its own documentation, and
 | |
|        regular expressions in general are covered in a number of  books,  some
 | |
|        of  which  have  copious  examples. Jeffrey Friedl's "Mastering Regular
 | |
|        Expressions", published by  O'Reilly,  covers  regular  expressions  in
 | |
|        great  detail.  This  description  of  PCRE's  regular  expressions  is
 | |
|        intended as reference material.
 | |
| 
 | |
|        The original operation of PCRE was on strings of  one-byte  characters.
 | |
|        However,  there is now also support for UTF-8 character strings. To use
 | |
|        this, you must build PCRE to  include  UTF-8  support,  and  then  call
 | |
|        pcre_compile()  with  the  PCRE_UTF8  option.  There  is also a special
 | |
|        sequence that can be given at the start of a pattern:
 | |
| 
 | |
|          (*UTF8)
 | |
| 
 | |
|        Starting a pattern with this sequence  is  equivalent  to  setting  the
 | |
|        PCRE_UTF8  option.  This  feature  is  not Perl-compatible. How setting
 | |
|        UTF-8 mode affects pattern matching  is  mentioned  in  several  places
 | |
|        below.  There  is  also  a  summary of UTF-8 features in the section on
 | |
|        UTF-8 support in the main pcre page.
 | |
| 
 | |
|        The remainder of this document discusses the  patterns  that  are  sup-
 | |
|        ported  by  PCRE when its main matching function, pcre_exec(), is used.
 | |
|        From  release  6.0,   PCRE   offers   a   second   matching   function,
 | |
|        pcre_dfa_exec(),  which matches using a different algorithm that is not
 | |
|        Perl-compatible. Some of the features discussed below are not available
 | |
|        when  pcre_dfa_exec()  is used. The advantages and disadvantages of the
 | |
|        alternative function, and how it differs from the normal function,  are
 | |
|        discussed in the pcrematching page.
 | |
| 
 | |
| 
 | |
| NEWLINE CONVENTIONS
 | |
| 
 | |
|        PCRE  supports five different conventions for indicating line breaks in
 | |
|        strings: a single CR (carriage return) character, a  single  LF  (line-
 | |
|        feed) character, the two-character sequence CRLF, any of the three pre-
 | |
|        ceding, or any Unicode newline sequence. The pcreapi page  has  further
 | |
|        discussion  about newlines, and shows how to set the newline convention
 | |
|        in the options arguments for the compiling and matching functions.
 | |
| 
 | |
|        It is also possible to specify a newline convention by starting a  pat-
 | |
|        tern string with one of the following five sequences:
 | |
| 
 | |
|          (*CR)        carriage return
 | |
|          (*LF)        linefeed
 | |
|          (*CRLF)      carriage return, followed by linefeed
 | |
|          (*ANYCRLF)   any of the three above
 | |
|          (*ANY)       all Unicode newline sequences
 | |
| 
 | |
|        These override the default and the options given to pcre_compile(). For
 | |
|        example, on a Unix system where LF is the default newline sequence, the
 | |
|        pattern
 | |
| 
 | |
|          (*CR)a.b
 | |
| 
 | |
|        changes the convention to CR. That pattern matches "a\nb" because LF is
 | |
|        no longer a newline. Note that these special settings,  which  are  not
 | |
|        Perl-compatible,  are  recognized  only at the very start of a pattern,
 | |
|        and that they must be in upper case.  If  more  than  one  of  them  is
 | |
|        present, the last one is used.
 | |
| 
 | |
|        The  newline  convention  does  not  affect what the \R escape sequence
 | |
|        matches. By default, this is any Unicode  newline  sequence,  for  Perl
 | |
|        compatibility.  However, this can be changed; see the description of \R
 | |
|        in the section entitled "Newline sequences" below. A change of \R  set-
 | |
|        ting can be combined with a change of newline convention.
 | |
| 
 | |
| 
 | |
| CHARACTERS AND METACHARACTERS
 | |
| 
 | |
|        A  regular  expression  is  a pattern that is matched against a subject
 | |
|        string from left to right. Most characters stand for  themselves  in  a
 | |
|        pattern,  and  match  the corresponding characters in the subject. As a
 | |
|        trivial example, the pattern
 | |
| 
 | |
|          The quick brown fox
 | |
| 
 | |
|        matches a portion of a subject string that is identical to itself. When
 | |
|        caseless  matching is specified (the PCRE_CASELESS option), letters are
 | |
|        matched independently of case. In UTF-8 mode, PCRE  always  understands
 | |
|        the  concept  of case for characters whose values are less than 128, so
 | |
|        caseless matching is always possible. For characters with  higher  val-
 | |
|        ues,  the concept of case is supported if PCRE is compiled with Unicode
 | |
|        property support, but not otherwise.   If  you  want  to  use  caseless
 | |
|        matching  for  characters  128  and above, you must ensure that PCRE is
 | |
|        compiled with Unicode property support as well as with UTF-8 support.
 | |
| 
 | |
|        The power of regular expressions comes  from  the  ability  to  include
 | |
|        alternatives  and  repetitions in the pattern. These are encoded in the
 | |
|        pattern by the use of metacharacters, which do not stand for themselves
 | |
|        but instead are interpreted in some special way.
 | |
| 
 | |
|        There  are  two different sets of metacharacters: those that are recog-
 | |
|        nized anywhere in the pattern except within square brackets, and  those
 | |
|        that  are  recognized  within square brackets. Outside square brackets,
 | |
|        the metacharacters are as follows:
 | |
| 
 | |
|          \      general escape character with several uses
 | |
|          ^      assert start of string (or line, in multiline mode)
 | |
|          $      assert end of string (or line, in multiline mode)
 | |
|          .      match any character except newline (by default)
 | |
|          [      start character class definition
 | |
|          |      start of alternative branch
 | |
|          (      start subpattern
 | |
|          )      end subpattern
 | |
|          ?      extends the meaning of (
 | |
|                 also 0 or 1 quantifier
 | |
|                 also quantifier minimizer
 | |
|          *      0 or more quantifier
 | |
|          +      1 or more quantifier
 | |
|                 also "possessive quantifier"
 | |
|          {      start min/max quantifier
 | |
| 
 | |
|        Part of a pattern that is in square brackets  is  called  a  "character
 | |
|        class". In a character class the only metacharacters are:
 | |
| 
 | |
|          \      general escape character
 | |
|          ^      negate the class, but only if the first character
 | |
|          -      indicates character range
 | |
|          [      POSIX character class (only if followed by POSIX
 | |
|                   syntax)
 | |
|          ]      terminates the character class
 | |
| 
 | |
|        The following sections describe the use of each of the metacharacters.
 | |
| 
 | |
| 
 | |
| BACKSLASH
 | |
| 
 | |
|        The backslash character has several uses. Firstly, if it is followed by
 | |
|        a non-alphanumeric character, it takes away any  special  meaning  that
 | |
|        character  may  have.  This  use  of  backslash  as an escape character
 | |
|        applies both inside and outside character classes.
 | |
| 
 | |
|        For example, if you want to match a * character, you write  \*  in  the
 | |
|        pattern.   This  escaping  action  applies whether or not the following
 | |
|        character would otherwise be interpreted as a metacharacter, so  it  is
 | |
|        always  safe  to  precede  a non-alphanumeric with backslash to specify
 | |
|        that it stands for itself. In particular, if you want to match a  back-
 | |
|        slash, you write \\.
 | |
| 
 | |
|        If  a  pattern is compiled with the PCRE_EXTENDED option, whitespace in
 | |
|        the pattern (other than in a character class) and characters between  a
 | |
|        # outside a character class and the next newline are ignored. An escap-
 | |
|        ing backslash can be used to include a whitespace  or  #  character  as
 | |
|        part of the pattern.
 | |
| 
 | |
|        If  you  want  to remove the special meaning from a sequence of charac-
 | |
|        ters, you can do so by putting them between \Q and \E. This is  differ-
 | |
|        ent  from  Perl  in  that  $  and  @ are handled as literals in \Q...\E
 | |
|        sequences in PCRE, whereas in Perl, $ and @ cause  variable  interpola-
 | |
|        tion. Note the following examples:
 | |
| 
 | |
|          Pattern            PCRE matches   Perl matches
 | |
| 
 | |
|          \Qabc$xyz\E        abc$xyz        abc followed by the
 | |
|                                              contents of $xyz
 | |
|          \Qabc\$xyz\E       abc\$xyz       abc\$xyz
 | |
|          \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
 | |
| 
 | |
|        The  \Q...\E  sequence  is recognized both inside and outside character
 | |
|        classes.
 | |
| 
 | |
|    Non-printing characters
 | |
| 
 | |
|        A second use of backslash provides a way of encoding non-printing char-
 | |
|        acters  in patterns in a visible manner. There is no restriction on the
 | |
|        appearance of non-printing characters, apart from the binary zero  that
 | |
|        terminates  a  pattern,  but  when  a pattern is being prepared by text
 | |
|        editing, it is usually easier  to  use  one  of  the  following  escape
 | |
|        sequences than the binary character it represents:
 | |
| 
 | |
|          \a        alarm, that is, the BEL character (hex 07)
 | |
|          \cx       "control-x", where x is any character
 | |
|          \e        escape (hex 1B)
 | |
|          \f        formfeed (hex 0C)
 | |
|          \n        linefeed (hex 0A)
 | |
|          \r        carriage return (hex 0D)
 | |
|          \t        tab (hex 09)
 | |
|          \ddd      character with octal code ddd, or backreference
 | |
|          \xhh      character with hex code hh
 | |
|          \x{hhh..} character with hex code hhh..
 | |
| 
 | |
|        The  precise  effect of \cx is as follows: if x is a lower case letter,
 | |
|        it is converted to upper case. Then bit 6 of the character (hex 40)  is
 | |
|        inverted.   Thus  \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;
 | |
|        becomes hex 7B.
 | |
| 
 | |
|        After \x, from zero to two hexadecimal digits are read (letters can  be
 | |
|        in  upper  or  lower case). Any number of hexadecimal digits may appear
 | |
|        between \x{ and }, but the value of the character  code  must  be  less
 | |
|        than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is,
 | |
|        the maximum value in hexadecimal is 7FFFFFFF. Note that this is  bigger
 | |
|        than the largest Unicode code point, which is 10FFFF.
 | |
| 
 | |
|        If  characters  other than hexadecimal digits appear between \x{ and },
 | |
|        or if there is no terminating }, this form of escape is not recognized.
 | |
|        Instead,  the  initial  \x  will  be interpreted as a basic hexadecimal
 | |
|        escape, with no following digits, giving a  character  whose  value  is
 | |
|        zero.
 | |
| 
 | |
|        Characters whose value is less than 256 can be defined by either of the
 | |
|        two syntaxes for \x. There is no difference in the way  they  are  han-
 | |
|        dled. For example, \xdc is exactly the same as \x{dc}.
 | |
| 
 | |
|        After  \0  up  to two further octal digits are read. If there are fewer
 | |
|        than two digits, just  those  that  are  present  are  used.  Thus  the
 | |
|        sequence \0\x\07 specifies two binary zeros followed by a BEL character
 | |
|        (code value 7). Make sure you supply two digits after the initial  zero
 | |
|        if the pattern character that follows is itself an octal digit.
 | |
| 
 | |
|        The handling of a backslash followed by a digit other than 0 is compli-
 | |
|        cated.  Outside a character class, PCRE reads it and any following dig-
 | |
|        its  as  a  decimal  number. If the number is less than 10, or if there
 | |
|        have been at least that many previous capturing left parentheses in the
 | |
|        expression,  the  entire  sequence  is  taken  as  a  back reference. A
 | |
|        description of how this works is given later, following the  discussion
 | |
|        of parenthesized subpatterns.
 | |
| 
 | |
|        Inside  a  character  class, or if the decimal number is greater than 9
 | |
|        and there have not been that many capturing subpatterns, PCRE  re-reads
 | |
|        up to three octal digits following the backslash, and uses them to gen-
 | |
|        erate a data character. Any subsequent digits stand for themselves.  In
 | |
|        non-UTF-8  mode,  the  value  of a character specified in octal must be
 | |
|        less than \400. In UTF-8 mode, values up to  \777  are  permitted.  For
 | |
|        example:
 | |
| 
 | |
|          \040   is another way of writing a space
 | |
|          \40    is the same, provided there are fewer than 40
 | |
|                    previous capturing subpatterns
 | |
|          \7     is always a back reference
 | |
|          \11    might be a back reference, or another way of
 | |
|                    writing a tab
 | |
|          \011   is always a tab
 | |
|          \0113  is a tab followed by the character "3"
 | |
|          \113   might be a back reference, otherwise the
 | |
|                    character with octal code 113
 | |
|          \377   might be a back reference, otherwise
 | |
|                    the byte consisting entirely of 1 bits
 | |
|          \81    is either a back reference, or a binary zero
 | |
|                    followed by the two characters "8" and "1"
 | |
| 
 | |
|        Note  that  octal  values of 100 or greater must not be introduced by a
 | |
|        leading zero, because no more than three octal digits are ever read.
 | |
| 
 | |
|        All the sequences that define a single character value can be used both
 | |
|        inside  and  outside character classes. In addition, inside a character
 | |
|        class, the sequence \b is interpreted as the backspace  character  (hex
 | |
|        08),  and the sequences \R and \X are interpreted as the characters "R"
 | |
|        and "X", respectively. Outside a character class, these sequences  have
 | |
|        different meanings (see below).
 | |
| 
 | |
|    Absolute and relative back references
 | |
| 
 | |
|        The  sequence  \g followed by an unsigned or a negative number, option-
 | |
|        ally enclosed in braces, is an absolute or relative back  reference.  A
 | |
|        named back reference can be coded as \g{name}. Back references are dis-
 | |
|        cussed later, following the discussion of parenthesized subpatterns.
 | |
| 
 | |
|    Absolute and relative subroutine calls
 | |
| 
 | |
|        For compatibility with Oniguruma, the non-Perl syntax \g followed by  a
 | |
|        name or a number enclosed either in angle brackets or single quotes, is
 | |
|        an alternative syntax for referencing a subpattern as  a  "subroutine".
 | |
|        Details  are  discussed  later.   Note  that  \g{...} (Perl syntax) and
 | |
|        \g<...> (Oniguruma syntax) are not synonymous. The  former  is  a  back
 | |
|        reference; the latter is a subroutine call.
 | |
| 
 | |
|    Generic character types
 | |
| 
 | |
|        Another use of backslash is for specifying generic character types. The
 | |
|        following are always recognized:
 | |
| 
 | |
|          \d     any decimal digit
 | |
|          \D     any character that is not a decimal digit
 | |
|          \h     any horizontal whitespace character
 | |
|          \H     any character that is not a horizontal whitespace character
 | |
|          \s     any whitespace character
 | |
|          \S     any character that is not a whitespace character
 | |
|          \v     any vertical whitespace character
 | |
|          \V     any character that is not a vertical whitespace character
 | |
|          \w     any "word" character
 | |
|          \W     any "non-word" character
 | |
| 
 | |
|        Each pair of escape sequences partitions the complete set of characters
 | |
|        into  two disjoint sets. Any given character matches one, and only one,
 | |
|        of each pair.
 | |
| 
 | |
|        These character type sequences can appear both inside and outside char-
 | |
|        acter  classes.  They each match one character of the appropriate type.
 | |
|        If the current matching point is at the end of the subject string,  all
 | |
|        of them fail, since there is no character to match.
 | |
| 
 | |
|        For  compatibility  with Perl, \s does not match the VT character (code
 | |
|        11).  This makes it different from the the POSIX "space" class. The  \s
 | |
|        characters  are  HT  (9), LF (10), FF (12), CR (13), and space (32). If
 | |
|        "use locale;" is included in a Perl script, \s may match the VT charac-
 | |
|        ter. In PCRE, it never does.
 | |
| 
 | |
|        In  UTF-8 mode, characters with values greater than 128 never match \d,
 | |
|        \s, or \w, and always match \D, \S, and \W. This is true even when Uni-
 | |
|        code  character  property  support is available. These sequences retain
 | |
|        their original meanings from before UTF-8 support was available, mainly
 | |
|        for  efficiency  reasons. Note that this also affects \b, because it is
 | |
|        defined in terms of \w and \W.
 | |
| 
 | |
|        The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to
 | |
|        the  other  sequences, these do match certain high-valued codepoints in
 | |
|        UTF-8 mode.  The horizontal space characters are:
 | |
| 
 | |
|          U+0009     Horizontal tab
 | |
|          U+0020     Space
 | |
|          U+00A0     Non-break space
 | |
|          U+1680     Ogham space mark
 | |
|          U+180E     Mongolian vowel separator
 | |
|          U+2000     En quad
 | |
|          U+2001     Em quad
 | |
|          U+2002     En space
 | |
|          U+2003     Em space
 | |
|          U+2004     Three-per-em space
 | |
|          U+2005     Four-per-em space
 | |
|          U+2006     Six-per-em space
 | |
|          U+2007     Figure space
 | |
|          U+2008     Punctuation space
 | |
|          U+2009     Thin space
 | |
|          U+200A     Hair space
 | |
|          U+202F     Narrow no-break space
 | |
|          U+205F     Medium mathematical space
 | |
|          U+3000     Ideographic space
 | |
| 
 | |
|        The vertical space characters are:
 | |
| 
 | |
|          U+000A     Linefeed
 | |
|          U+000B     Vertical tab
 | |
|          U+000C     Formfeed
 | |
|          U+000D     Carriage return
 | |
|          U+0085     Next line
 | |
|          U+2028     Line separator
 | |
|          U+2029     Paragraph separator
 | |
| 
 | |
|        A "word" character is an underscore or any character less than 256 that
 | |
|        is  a  letter  or  digit.  The definition of letters and digits is con-
 | |
|        trolled by PCRE's low-valued character tables, and may vary if  locale-
 | |
|        specific  matching is taking place (see "Locale support" in the pcreapi
 | |
|        page). For example, in a French locale such  as  "fr_FR"  in  Unix-like
 | |
|        systems,  or "french" in Windows, some character codes greater than 128
 | |
|        are used for accented letters, and these are matched by \w. The use  of
 | |
|        locales with Unicode is discouraged.
 | |
| 
 | |
|    Newline sequences
 | |
| 
 | |
|        Outside  a  character class, by default, the escape sequence \R matches
 | |
|        any Unicode newline sequence. This is a Perl 5.10 feature. In non-UTF-8
 | |
|        mode \R is equivalent to the following:
 | |
| 
 | |
|          (?>\r\n|\n|\x0b|\f|\r|\x85)
 | |
| 
 | |
|        This  is  an  example  of an "atomic group", details of which are given
 | |
|        below.  This particular group matches either the two-character sequence
 | |
|        CR  followed  by  LF,  or  one  of  the single characters LF (linefeed,
 | |
|        U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage
 | |
|        return, U+000D), or NEL (next line, U+0085). The two-character sequence
 | |
|        is treated as a single unit that cannot be split.
 | |
| 
 | |
|        In UTF-8 mode, two additional characters whose codepoints  are  greater
 | |
|        than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
 | |
|        rator, U+2029).  Unicode character property support is not  needed  for
 | |
|        these characters to be recognized.
 | |
| 
 | |
|        It is possible to restrict \R to match only CR, LF, or CRLF (instead of
 | |
|        the complete set  of  Unicode  line  endings)  by  setting  the  option
 | |
|        PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched.
 | |
|        (BSR is an abbrevation for "backslash R".) This can be made the default
 | |
|        when  PCRE  is  built;  if this is the case, the other behaviour can be
 | |
|        requested via the PCRE_BSR_UNICODE option.   It  is  also  possible  to
 | |
|        specify  these  settings  by  starting a pattern string with one of the
 | |
|        following sequences:
 | |
| 
 | |
|          (*BSR_ANYCRLF)   CR, LF, or CRLF only
 | |
|          (*BSR_UNICODE)   any Unicode newline sequence
 | |
| 
 | |
|        These override the default and the options given to pcre_compile(), but
 | |
|        they can be overridden by options given to pcre_exec(). Note that these
 | |
|        special settings, which are not Perl-compatible, are recognized only at
 | |
|        the  very  start  of a pattern, and that they must be in upper case. If
 | |
|        more than one of them is present, the last one is  used.  They  can  be
 | |
|        combined  with  a  change of newline convention, for example, a pattern
 | |
|        can start with:
 | |
| 
 | |
|          (*ANY)(*BSR_ANYCRLF)
 | |
| 
 | |
|        Inside a character class, \R matches the letter "R".
 | |
| 
 | |
|    Unicode character properties
 | |
| 
 | |
|        When PCRE is built with Unicode character property support, three addi-
 | |
|        tional  escape sequences that match characters with specific properties
 | |
|        are available.  When not in UTF-8 mode, these sequences are  of  course
 | |
|        limited  to  testing characters whose codepoints are less than 256, but
 | |
|        they do work in this mode.  The extra escape sequences are:
 | |
| 
 | |
|          \p{xx}   a character with the xx property
 | |
|          \P{xx}   a character without the xx property
 | |
|          \X       an extended Unicode sequence
 | |
| 
 | |
|        The property names represented by xx above are limited to  the  Unicode
 | |
|        script names, the general category properties, and "Any", which matches
 | |
|        any character (including newline). Other properties such as "InMusical-
 | |
|        Symbols"  are  not  currently supported by PCRE. Note that \P{Any} does
 | |
|        not match any characters, so always causes a match failure.
 | |
| 
 | |
|        Sets of Unicode characters are defined as belonging to certain scripts.
 | |
|        A  character from one of these sets can be matched using a script name.
 | |
|        For example:
 | |
| 
 | |
|          \p{Greek}
 | |
|          \P{Han}
 | |
| 
 | |
|        Those that are not part of an identified script are lumped together  as
 | |
|        "Common". The current list of scripts is:
 | |
| 
 | |
|        Arabic,  Armenian,  Balinese,  Bengali,  Bopomofo,  Braille,  Buginese,
 | |
|        Buhid,  Canadian_Aboriginal,  Cherokee,  Common,   Coptic,   Cuneiform,
 | |
|        Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic,
 | |
|        Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew,  Hira-
 | |
|        gana,  Inherited,  Kannada,  Katakana,  Kharoshthi,  Khmer, Lao, Latin,
 | |
|        Limbu,  Linear_B,  Malayalam,  Mongolian,  Myanmar,  New_Tai_Lue,  Nko,
 | |
|        Ogham,  Old_Italic,  Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician,
 | |
|        Runic,  Shavian,  Sinhala,  Syloti_Nagri,  Syriac,  Tagalog,  Tagbanwa,
 | |
|        Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Yi.
 | |
| 
 | |
|        Each  character has exactly one general category property, specified by
 | |
|        a two-letter abbreviation. For compatibility with Perl, negation can be
 | |
|        specified  by  including a circumflex between the opening brace and the
 | |
|        property name. For example, \p{^Lu} is the same as \P{Lu}.
 | |
| 
 | |
|        If only one letter is specified with \p or \P, it includes all the gen-
 | |
|        eral  category properties that start with that letter. In this case, in
 | |
|        the absence of negation, the curly brackets in the escape sequence  are
 | |
|        optional; these two examples have the same effect:
 | |
| 
 | |
|          \p{L}
 | |
|          \pL
 | |
| 
 | |
|        The following general category property codes are supported:
 | |
| 
 | |
|          C     Other
 | |
|          Cc    Control
 | |
|          Cf    Format
 | |
|          Cn    Unassigned
 | |
|          Co    Private use
 | |
|          Cs    Surrogate
 | |
| 
 | |
|          L     Letter
 | |
|          Ll    Lower case letter
 | |
|          Lm    Modifier letter
 | |
|          Lo    Other letter
 | |
|          Lt    Title case letter
 | |
|          Lu    Upper case letter
 | |
| 
 | |
|          M     Mark
 | |
|          Mc    Spacing mark
 | |
|          Me    Enclosing mark
 | |
|          Mn    Non-spacing mark
 | |
| 
 | |
|          N     Number
 | |
|          Nd    Decimal number
 | |
|          Nl    Letter number
 | |
|          No    Other number
 | |
| 
 | |
|          P     Punctuation
 | |
|          Pc    Connector punctuation
 | |
|          Pd    Dash punctuation
 | |
|          Pe    Close punctuation
 | |
|          Pf    Final punctuation
 | |
|          Pi    Initial punctuation
 | |
|          Po    Other punctuation
 | |
|          Ps    Open punctuation
 | |
| 
 | |
|          S     Symbol
 | |
|          Sc    Currency symbol
 | |
|          Sk    Modifier symbol
 | |
|          Sm    Mathematical symbol
 | |
|          So    Other symbol
 | |
| 
 | |
|          Z     Separator
 | |
|          Zl    Line separator
 | |
|          Zp    Paragraph separator
 | |
|          Zs    Space separator
 | |
| 
 | |
|        The  special property L& is also supported: it matches a character that
 | |
|        has the Lu, Ll, or Lt property, in other words, a letter  that  is  not
 | |
|        classified as a modifier or "other".
 | |
| 
 | |
|        The  Cs  (Surrogate)  property  applies only to characters in the range
 | |
|        U+D800 to U+DFFF. Such characters are not valid in UTF-8  strings  (see
 | |
|        RFC 3629) and so cannot be tested by PCRE, unless UTF-8 validity check-
 | |
|        ing has been turned off (see the discussion  of  PCRE_NO_UTF8_CHECK  in
 | |
|        the pcreapi page).
 | |
| 
 | |
|        The  long  synonyms  for  these  properties that Perl supports (such as
 | |
|        \p{Letter}) are not supported by PCRE, nor is it  permitted  to  prefix
 | |
|        any of these properties with "Is".
 | |
| 
 | |
|        No character that is in the Unicode table has the Cn (unassigned) prop-
 | |
|        erty.  Instead, this property is assumed for any code point that is not
 | |
|        in the Unicode table.
 | |
| 
 | |
|        Specifying  caseless  matching  does not affect these escape sequences.
 | |
|        For example, \p{Lu} always matches only upper case letters.
 | |
| 
 | |
|        The \X escape matches any number of Unicode  characters  that  form  an
 | |
|        extended Unicode sequence. \X is equivalent to
 | |
| 
 | |
|          (?>\PM\pM*)
 | |
| 
 | |
|        That  is,  it matches a character without the "mark" property, followed
 | |
|        by zero or more characters with the "mark"  property,  and  treats  the
 | |
|        sequence  as  an  atomic group (see below).  Characters with the "mark"
 | |
|        property are typically accents that  affect  the  preceding  character.
 | |
|        None  of  them  have  codepoints less than 256, so in non-UTF-8 mode \X
 | |
|        matches any one character.
 | |
| 
 | |
|        Matching characters by Unicode property is not fast, because  PCRE  has
 | |
|        to  search  a  structure  that  contains data for over fifteen thousand
 | |
|        characters. That is why the traditional escape sequences such as \d and
 | |
|        \w do not use Unicode properties in PCRE.
 | |
| 
 | |
|    Resetting the match start
 | |
| 
 | |
|        The escape sequence \K, which is a Perl 5.10 feature, causes any previ-
 | |
|        ously matched characters not  to  be  included  in  the  final  matched
 | |
|        sequence. For example, the pattern:
 | |
| 
 | |
|          foo\Kbar
 | |
| 
 | |
|        matches  "foobar",  but reports that it has matched "bar". This feature
 | |
|        is similar to a lookbehind assertion (described  below).   However,  in
 | |
|        this  case, the part of the subject before the real match does not have
 | |
|        to be of fixed length, as lookbehind assertions do. The use of \K  does
 | |
|        not  interfere  with  the setting of captured substrings.  For example,
 | |
|        when the pattern
 | |
| 
 | |
|          (foo)\Kbar
 | |
| 
 | |
|        matches "foobar", the first substring is still set to "foo".
 | |
| 
 | |
|    Simple assertions
 | |
| 
 | |
|        The final use of backslash is for certain simple assertions. An  asser-
 | |
|        tion  specifies a condition that has to be met at a particular point in
 | |
|        a match, without consuming any characters from the subject string.  The
 | |
|        use  of subpatterns for more complicated assertions is described below.
 | |
|        The backslashed assertions are:
 | |
| 
 | |
|          \b     matches at a word boundary
 | |
|          \B     matches when not at a word boundary
 | |
|          \A     matches at the start of the subject
 | |
|          \Z     matches at the end of the subject
 | |
|                  also matches before a newline at the end of the subject
 | |
|          \z     matches only at the end of the subject
 | |
|          \G     matches at the first matching position in the subject
 | |
| 
 | |
|        These assertions may not appear in character classes (but note that  \b
 | |
|        has a different meaning, namely the backspace character, inside a char-
 | |
|        acter class).
 | |
| 
 | |
|        A word boundary is a position in the subject string where  the  current
 | |
|        character  and  the previous character do not both match \w or \W (i.e.
 | |
|        one matches \w and the other matches \W), or the start or  end  of  the
 | |
|        string if the first or last character matches \w, respectively.
 | |
| 
 | |
|        The  \A,  \Z,  and \z assertions differ from the traditional circumflex
 | |
|        and dollar (described in the next section) in that they only ever match
 | |
|        at  the  very start and end of the subject string, whatever options are
 | |
|        set. Thus, they are independent of multiline mode. These  three  asser-
 | |
|        tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
 | |
|        affect only the behaviour of the circumflex and dollar  metacharacters.
 | |
|        However,  if the startoffset argument of pcre_exec() is non-zero, indi-
 | |
|        cating that matching is to start at a point other than the beginning of
 | |
|        the  subject,  \A  can never match. The difference between \Z and \z is
 | |
|        that \Z matches before a newline at the end of the string as well as at
 | |
|        the very end, whereas \z matches only at the end.
 | |
| 
 | |
|        The  \G assertion is true only when the current matching position is at
 | |
|        the start point of the match, as specified by the startoffset  argument
 | |
|        of  pcre_exec().  It  differs  from \A when the value of startoffset is
 | |
|        non-zero. By calling pcre_exec() multiple times with appropriate  argu-
 | |
|        ments, you can mimic Perl's /g option, and it is in this kind of imple-
 | |
|        mentation where \G can be useful.
 | |
| 
 | |
|        Note, however, that PCRE's interpretation of \G, as the  start  of  the
 | |
|        current match, is subtly different from Perl's, which defines it as the
 | |
|        end of the previous match. In Perl, these can  be  different  when  the
 | |
|        previously  matched  string was empty. Because PCRE does just one match
 | |
|        at a time, it cannot reproduce this behaviour.
 | |
| 
 | |
|        If all the alternatives of a pattern begin with \G, the  expression  is
 | |
|        anchored to the starting match position, and the "anchored" flag is set
 | |
|        in the compiled regular expression.
 | |
| 
 | |
| 
 | |
| CIRCUMFLEX AND DOLLAR
 | |
| 
 | |
|        Outside a character class, in the default matching mode, the circumflex
 | |
|        character  is  an  assertion  that is true only if the current matching
 | |
|        point is at the start of the subject string. If the  startoffset  argu-
 | |
|        ment  of  pcre_exec()  is  non-zero,  circumflex can never match if the
 | |
|        PCRE_MULTILINE option is unset. Inside a  character  class,  circumflex
 | |
|        has an entirely different meaning (see below).
 | |
| 
 | |
|        Circumflex  need  not be the first character of the pattern if a number
 | |
|        of alternatives are involved, but it should be the first thing in  each
 | |
|        alternative  in  which  it appears if the pattern is ever to match that
 | |
|        branch. If all possible alternatives start with a circumflex, that  is,
 | |
|        if  the  pattern  is constrained to match only at the start of the sub-
 | |
|        ject, it is said to be an "anchored" pattern.  (There  are  also  other
 | |
|        constructs that can cause a pattern to be anchored.)
 | |
| 
 | |
|        A  dollar  character  is  an assertion that is true only if the current
 | |
|        matching point is at the end of  the  subject  string,  or  immediately
 | |
|        before a newline at the end of the string (by default). Dollar need not
 | |
|        be the last character of the pattern if a number  of  alternatives  are
 | |
|        involved,  but  it  should  be  the last item in any branch in which it
 | |
|        appears. Dollar has no special meaning in a character class.
 | |
| 
 | |
|        The meaning of dollar can be changed so that it  matches  only  at  the
 | |
|        very  end  of  the string, by setting the PCRE_DOLLAR_ENDONLY option at
 | |
|        compile time. This does not affect the \Z assertion.
 | |
| 
 | |
|        The meanings of the circumflex and dollar characters are changed if the
 | |
|        PCRE_MULTILINE  option  is  set.  When  this  is the case, a circumflex
 | |
|        matches immediately after internal newlines as well as at the start  of
 | |
|        the  subject  string.  It  does not match after a newline that ends the
 | |
|        string. A dollar matches before any newlines in the string, as well  as
 | |
|        at  the very end, when PCRE_MULTILINE is set. When newline is specified
 | |
|        as the two-character sequence CRLF, isolated CR and  LF  characters  do
 | |
|        not indicate newlines.
 | |
| 
 | |
|        For  example, the pattern /^abc$/ matches the subject string "def\nabc"
 | |
|        (where \n represents a newline) in multiline mode, but  not  otherwise.
 | |
|        Consequently,  patterns  that  are anchored in single line mode because
 | |
|        all branches start with ^ are not anchored in  multiline  mode,  and  a
 | |
|        match  for  circumflex  is  possible  when  the startoffset argument of
 | |
|        pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is  ignored  if
 | |
|        PCRE_MULTILINE is set.
 | |
| 
 | |
|        Note  that  the sequences \A, \Z, and \z can be used to match the start
 | |
|        and end of the subject in both modes, and if all branches of a  pattern
 | |
|        start  with  \A it is always anchored, whether or not PCRE_MULTILINE is
 | |
|        set.
 | |
| 
 | |
| 
 | |
| FULL STOP (PERIOD, DOT)
 | |
| 
 | |
|        Outside a character class, a dot in the pattern matches any one charac-
 | |
|        ter  in  the subject string except (by default) a character that signi-
 | |
|        fies the end of a line. In UTF-8 mode, the  matched  character  may  be
 | |
|        more than one byte long.
 | |
| 
 | |
|        When  a line ending is defined as a single character, dot never matches
 | |
|        that character; when the two-character sequence CRLF is used, dot  does
 | |
|        not  match  CR  if  it  is immediately followed by LF, but otherwise it
 | |
|        matches all characters (including isolated CRs and LFs). When any  Uni-
 | |
|        code  line endings are being recognized, dot does not match CR or LF or
 | |
|        any of the other line ending characters.
 | |
| 
 | |
|        The behaviour of dot with regard to newlines can  be  changed.  If  the
 | |
|        PCRE_DOTALL  option  is  set,  a dot matches any one character, without
 | |
|        exception. If the two-character sequence CRLF is present in the subject
 | |
|        string, it takes two dots to match it.
 | |
| 
 | |
|        The  handling of dot is entirely independent of the handling of circum-
 | |
|        flex and dollar, the only relationship being  that  they  both  involve
 | |
|        newlines. Dot has no special meaning in a character class.
 | |
| 
 | |
| 
 | |
| MATCHING A SINGLE BYTE
 | |
| 
 | |
|        Outside a character class, the escape sequence \C matches any one byte,
 | |
|        both in and out of UTF-8 mode. Unlike a  dot,  it  always  matches  any
 | |
|        line-ending  characters.  The  feature  is provided in Perl in order to
 | |
|        match individual bytes in UTF-8 mode. Because it breaks up UTF-8  char-
 | |
|        acters  into individual bytes, what remains in the string may be a mal-
 | |
|        formed UTF-8 string. For this reason, the \C escape  sequence  is  best
 | |
|        avoided.
 | |
| 
 | |
|        PCRE  does  not  allow \C to appear in lookbehind assertions (described
 | |
|        below), because in UTF-8 mode this would make it impossible  to  calcu-
 | |
|        late the length of the lookbehind.
 | |
| 
 | |
| 
 | |
| SQUARE BRACKETS AND CHARACTER CLASSES
 | |
| 
 | |
|        An opening square bracket introduces a character class, terminated by a
 | |
|        closing square bracket. A closing square bracket on its own is not spe-
 | |
|        cial. If a closing square bracket is required as a member of the class,
 | |
|        it should be the first data character in the class  (after  an  initial
 | |
|        circumflex, if present) or escaped with a backslash.
 | |
| 
 | |
|        A  character  class matches a single character in the subject. In UTF-8
 | |
|        mode, the character may occupy more than one byte. A matched  character
 | |
|        must be in the set of characters defined by the class, unless the first
 | |
|        character in the class definition is a circumflex, in  which  case  the
 | |
|        subject  character  must  not  be in the set defined by the class. If a
 | |
|        circumflex is actually required as a member of the class, ensure it  is
 | |
|        not the first character, or escape it with a backslash.
 | |
| 
 | |
|        For  example, the character class [aeiou] matches any lower case vowel,
 | |
|        while [^aeiou] matches any character that is not a  lower  case  vowel.
 | |
|        Note that a circumflex is just a convenient notation for specifying the
 | |
|        characters that are in the class by enumerating those that are  not.  A
 | |
|        class  that starts with a circumflex is not an assertion: it still con-
 | |
|        sumes a character from the subject string, and therefore  it  fails  if
 | |
|        the current pointer is at the end of the string.
 | |
| 
 | |
|        In  UTF-8 mode, characters with values greater than 255 can be included
 | |
|        in a class as a literal string of bytes, or by using the  \x{  escaping
 | |
|        mechanism.
 | |
| 
 | |
|        When  caseless  matching  is set, any letters in a class represent both
 | |
|        their upper case and lower case versions, so for  example,  a  caseless
 | |
|        [aeiou]  matches  "A"  as well as "a", and a caseless [^aeiou] does not
 | |
|        match "A", whereas a caseful version would. In UTF-8 mode, PCRE  always
 | |
|        understands  the  concept  of case for characters whose values are less
 | |
|        than 128, so caseless matching is always possible. For characters  with
 | |
|        higher  values,  the  concept  of case is supported if PCRE is compiled
 | |
|        with Unicode property support, but not otherwise.  If you want  to  use
 | |
|        caseless  matching  for  characters 128 and above, you must ensure that
 | |
|        PCRE is compiled with Unicode property support as well  as  with  UTF-8
 | |
|        support.
 | |
| 
 | |
|        Characters  that  might  indicate  line breaks are never treated in any
 | |
|        special way  when  matching  character  classes,  whatever  line-ending
 | |
|        sequence  is  in  use,  and  whatever  setting  of  the PCRE_DOTALL and
 | |
|        PCRE_MULTILINE options is used. A class such as [^a] always matches one
 | |
|        of these characters.
 | |
| 
 | |
|        The  minus (hyphen) character can be used to specify a range of charac-
 | |
|        ters in a character  class.  For  example,  [d-m]  matches  any  letter
 | |
|        between  d  and  m,  inclusive.  If  a minus character is required in a
 | |
|        class, it must be escaped with a backslash  or  appear  in  a  position
 | |
|        where  it cannot be interpreted as indicating a range, typically as the
 | |
|        first or last character in the class.
 | |
| 
 | |
|        It is not possible to have the literal character "]" as the end charac-
 | |
|        ter  of a range. A pattern such as [W-]46] is interpreted as a class of
 | |
|        two characters ("W" and "-") followed by a literal string "46]", so  it
 | |
|        would  match  "W46]"  or  "-46]". However, if the "]" is escaped with a
 | |
|        backslash it is interpreted as the end of range, so [W-\]46] is  inter-
 | |
|        preted  as a class containing a range followed by two other characters.
 | |
|        The octal or hexadecimal representation of "]" can also be used to  end
 | |
|        a range.
 | |
| 
 | |
|        Ranges  operate in the collating sequence of character values. They can
 | |
|        also  be  used  for  characters  specified  numerically,  for   example
 | |
|        [\000-\037].  In UTF-8 mode, ranges can include characters whose values
 | |
|        are greater than 255, for example [\x{100}-\x{2ff}].
 | |
| 
 | |
|        If a range that includes letters is used when caseless matching is set,
 | |
|        it matches the letters in either case. For example, [W-c] is equivalent
 | |
|        to [][\\^_`wxyzabc], matched caselessly,  and  in  non-UTF-8  mode,  if
 | |
|        character  tables  for  a French locale are in use, [\xc8-\xcb] matches
 | |
|        accented E characters in both cases. In UTF-8 mode, PCRE  supports  the
 | |
|        concept  of  case for characters with values greater than 128 only when
 | |
|        it is compiled with Unicode property support.
 | |
| 
 | |
|        The character types \d, \D, \p, \P, \s, \S, \w, and \W may also  appear
 | |
|        in  a  character  class,  and add the characters that they match to the
 | |
|        class. For example, [\dABCDEF] matches any hexadecimal digit. A circum-
 | |
|        flex  can  conveniently  be used with the upper case character types to
 | |
|        specify a more restricted set of characters  than  the  matching  lower
 | |
|        case  type.  For example, the class [^\W_] matches any letter or digit,
 | |
|        but not underscore.
 | |
| 
 | |
|        The only metacharacters that are recognized in  character  classes  are
 | |
|        backslash,  hyphen  (only  where  it can be interpreted as specifying a
 | |
|        range), circumflex (only at the start), opening  square  bracket  (only
 | |
|        when  it can be interpreted as introducing a POSIX class name - see the
 | |
|        next section), and the terminating  closing  square  bracket.  However,
 | |
|        escaping other non-alphanumeric characters does no harm.
 | |
| 
 | |
| 
 | |
| POSIX CHARACTER CLASSES
 | |
| 
 | |
|        Perl supports the POSIX notation for character classes. This uses names
 | |
|        enclosed by [: and :] within the enclosing square brackets.  PCRE  also
 | |
|        supports this notation. For example,
 | |
| 
 | |
|          [01[:alpha:]%]
 | |
| 
 | |
|        matches "0", "1", any alphabetic character, or "%". The supported class
 | |
|        names are
 | |
| 
 | |
|          alnum    letters and digits
 | |
|          alpha    letters
 | |
|          ascii    character codes 0 - 127
 | |
|          blank    space or tab only
 | |
|          cntrl    control characters
 | |
|          digit    decimal digits (same as \d)
 | |
|          graph    printing characters, excluding space
 | |
|          lower    lower case letters
 | |
|          print    printing characters, including space
 | |
|          punct    printing characters, excluding letters and digits
 | |
|          space    white space (not quite the same as \s)
 | |
|          upper    upper case letters
 | |
|          word     "word" characters (same as \w)
 | |
|          xdigit   hexadecimal digits
 | |
| 
 | |
|        The "space" characters are HT (9), LF (10), VT (11), FF (12), CR  (13),
 | |
|        and  space  (32). Notice that this list includes the VT character (code
 | |
|        11). This makes "space" different to \s, which does not include VT (for
 | |
|        Perl compatibility).
 | |
| 
 | |
|        The  name  "word"  is  a Perl extension, and "blank" is a GNU extension
 | |
|        from Perl 5.8. Another Perl extension is negation, which  is  indicated
 | |
|        by a ^ character after the colon. For example,
 | |
| 
 | |
|          [12[:^digit:]]
 | |
| 
 | |
|        matches  "1", "2", or any non-digit. PCRE (and Perl) also recognize the
 | |
|        POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
 | |
|        these are not supported, and an error is given if they are encountered.
 | |
| 
 | |
|        In UTF-8 mode, characters with values greater than 128 do not match any
 | |
|        of the POSIX character classes.
 | |
| 
 | |
| 
 | |
| VERTICAL BAR
 | |
| 
 | |
|        Vertical bar characters are used to separate alternative patterns.  For
 | |
|        example, the pattern
 | |
| 
 | |
|          gilbert|sullivan
 | |
| 
 | |
|        matches  either "gilbert" or "sullivan". Any number of alternatives may
 | |
|        appear, and an empty  alternative  is  permitted  (matching  the  empty
 | |
|        string). The matching process tries each alternative in turn, from left
 | |
|        to right, and the first one that succeeds is used. If the  alternatives
 | |
|        are  within a subpattern (defined below), "succeeds" means matching the
 | |
|        rest of the main pattern as well as the alternative in the subpattern.
 | |
| 
 | |
| 
 | |
| INTERNAL OPTION SETTING
 | |
| 
 | |
|        The settings of the  PCRE_CASELESS,  PCRE_MULTILINE,  PCRE_DOTALL,  and
 | |
|        PCRE_EXTENDED  options  (which are Perl-compatible) can be changed from
 | |
|        within the pattern by  a  sequence  of  Perl  option  letters  enclosed
 | |
|        between "(?" and ")".  The option letters are
 | |
| 
 | |
|          i  for PCRE_CASELESS
 | |
|          m  for PCRE_MULTILINE
 | |
|          s  for PCRE_DOTALL
 | |
|          x  for PCRE_EXTENDED
 | |
| 
 | |
|        For example, (?im) sets caseless, multiline matching. It is also possi-
 | |
|        ble to unset these options by preceding the letter with a hyphen, and a
 | |
|        combined  setting and unsetting such as (?im-sx), which sets PCRE_CASE-
 | |
|        LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and  PCRE_EXTENDED,
 | |
|        is  also  permitted.  If  a  letter  appears  both before and after the
 | |
|        hyphen, the option is unset.
 | |
| 
 | |
|        The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and  PCRE_EXTRA
 | |
|        can  be changed in the same way as the Perl-compatible options by using
 | |
|        the characters J, U and X respectively.
 | |
| 
 | |
|        When one of these option changes occurs at  top  level  (that  is,  not
 | |
|        inside  subpattern parentheses), the change applies to the remainder of
 | |
|        the pattern that follows. If the change is placed right at the start of
 | |
|        a pattern, PCRE extracts it into the global options (and it will there-
 | |
|        fore show up in data extracted by the pcre_fullinfo() function).
 | |
| 
 | |
|        An option change within a subpattern (see below for  a  description  of
 | |
|        subpatterns) affects only that part of the current pattern that follows
 | |
|        it, so
 | |
| 
 | |
|          (a(?i)b)c
 | |
| 
 | |
|        matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
 | |
|        used).   By  this means, options can be made to have different settings
 | |
|        in different parts of the pattern. Any changes made in one  alternative
 | |
|        do  carry  on  into subsequent branches within the same subpattern. For
 | |
|        example,
 | |
| 
 | |
|          (a(?i)b|c)
 | |
| 
 | |
|        matches "ab", "aB", "c", and "C", even though  when  matching  "C"  the
 | |
|        first  branch  is  abandoned before the option setting. This is because
 | |
|        the effects of option settings happen at compile time. There  would  be
 | |
|        some very weird behaviour otherwise.
 | |
| 
 | |
|        Note:  There  are  other  PCRE-specific  options that can be set by the
 | |
|        application when the compile or match functions  are  called.  In  some
 | |
|        cases the pattern can contain special leading sequences such as (*CRLF)
 | |
|        to override what the application has set or what  has  been  defaulted.
 | |
|        Details  are  given  in the section entitled "Newline sequences" above.
 | |
|        There is also the (*UTF8) leading sequence that  can  be  used  to  set
 | |
|        UTF-8 mode; this is equivalent to setting the PCRE_UTF8 option.
 | |
| 
 | |
| 
 | |
| SUBPATTERNS
 | |
| 
 | |
|        Subpatterns are delimited by parentheses (round brackets), which can be
 | |
|        nested.  Turning part of a pattern into a subpattern does two things:
 | |
| 
 | |
|        1. It localizes a set of alternatives. For example, the pattern
 | |
| 
 | |
|          cat(aract|erpillar|)
 | |
| 
 | |
|        matches one of the words "cat", "cataract", or  "caterpillar".  Without
 | |
|        the  parentheses,  it  would  match  "cataract", "erpillar" or an empty
 | |
|        string.
 | |
| 
 | |
|        2. It sets up the subpattern as  a  capturing  subpattern.  This  means
 | |
|        that,  when  the  whole  pattern  matches,  that portion of the subject
 | |
|        string that matched the subpattern is passed back to the caller via the
 | |
|        ovector  argument  of pcre_exec(). Opening parentheses are counted from
 | |
|        left to right (starting from 1) to obtain  numbers  for  the  capturing
 | |
|        subpatterns.
 | |
| 
 | |
|        For  example,  if the string "the red king" is matched against the pat-
 | |
|        tern
 | |
| 
 | |
|          the ((red|white) (king|queen))
 | |
| 
 | |
|        the captured substrings are "red king", "red", and "king", and are num-
 | |
|        bered 1, 2, and 3, respectively.
 | |
| 
 | |
|        The  fact  that  plain  parentheses  fulfil two functions is not always
 | |
|        helpful.  There are often times when a grouping subpattern is  required
 | |
|        without  a capturing requirement. If an opening parenthesis is followed
 | |
|        by a question mark and a colon, the subpattern does not do any  captur-
 | |
|        ing,  and  is  not  counted when computing the number of any subsequent
 | |
|        capturing subpatterns. For example, if the string "the white queen"  is
 | |
|        matched against the pattern
 | |
| 
 | |
|          the ((?:red|white) (king|queen))
 | |
| 
 | |
|        the captured substrings are "white queen" and "queen", and are numbered
 | |
|        1 and 2. The maximum number of capturing subpatterns is 65535.
 | |
| 
 | |
|        As a convenient shorthand, if any option settings are required  at  the
 | |
|        start  of  a  non-capturing  subpattern,  the option letters may appear
 | |
|        between the "?" and the ":". Thus the two patterns
 | |
| 
 | |
|          (?i:saturday|sunday)
 | |
|          (?:(?i)saturday|sunday)
 | |
| 
 | |
|        match exactly the same set of strings. Because alternative branches are
 | |
|        tried  from  left  to right, and options are not reset until the end of
 | |
|        the subpattern is reached, an option setting in one branch does  affect
 | |
|        subsequent  branches,  so  the above patterns match "SUNDAY" as well as
 | |
|        "Saturday".
 | |
| 
 | |
| 
 | |
| DUPLICATE SUBPATTERN NUMBERS
 | |
| 
 | |
|        Perl 5.10 introduced a feature whereby each alternative in a subpattern
 | |
|        uses  the same numbers for its capturing parentheses. Such a subpattern
 | |
|        starts with (?| and is itself a non-capturing subpattern. For  example,
 | |
|        consider this pattern:
 | |
| 
 | |
|          (?|(Sat)ur|(Sun))day
 | |
| 
 | |
|        Because  the two alternatives are inside a (?| group, both sets of cap-
 | |
|        turing parentheses are numbered one. Thus, when  the  pattern  matches,
 | |
|        you  can  look  at captured substring number one, whichever alternative
 | |
|        matched. This construct is useful when you want to  capture  part,  but
 | |
|        not all, of one of a number of alternatives. Inside a (?| group, paren-
 | |
|        theses are numbered as usual, but the number is reset at the  start  of
 | |
|        each  branch. The numbers of any capturing buffers that follow the sub-
 | |
|        pattern start after the highest number used in any branch. The  follow-
 | |
|        ing  example  is taken from the Perl documentation.  The numbers under-
 | |
|        neath show in which buffer the captured content will be stored.
 | |
| 
 | |
|          # before  ---------------branch-reset----------- after
 | |
|          / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
 | |
|          # 1            2         2  3        2     3     4
 | |
| 
 | |
|        A backreference or a recursive call to  a  numbered  subpattern  always
 | |
|        refers to the first one in the pattern with the given number.
 | |
| 
 | |
|        An  alternative approach to using this "branch reset" feature is to use
 | |
|        duplicate named subpatterns, as described in the next section.
 | |
| 
 | |
| 
 | |
| NAMED SUBPATTERNS
 | |
| 
 | |
|        Identifying capturing parentheses by number is simple, but  it  can  be
 | |
|        very  hard  to keep track of the numbers in complicated regular expres-
 | |
|        sions. Furthermore, if an  expression  is  modified,  the  numbers  may
 | |
|        change.  To help with this difficulty, PCRE supports the naming of sub-
 | |
|        patterns. This feature was not added to Perl until release 5.10. Python
 | |
|        had  the  feature earlier, and PCRE introduced it at release 4.0, using
 | |
|        the Python syntax. PCRE now supports both the Perl and the Python  syn-
 | |
|        tax.
 | |
| 
 | |
|        In  PCRE,  a subpattern can be named in one of three ways: (?<name>...)
 | |
|        or (?'name'...) as in Perl, or (?P<name>...) as in  Python.  References
 | |
|        to capturing parentheses from other parts of the pattern, such as back-
 | |
|        references, recursion, and conditions, can be made by name as  well  as
 | |
|        by number.
 | |
| 
 | |
|        Names  consist  of  up  to  32 alphanumeric characters and underscores.
 | |
|        Named capturing parentheses are still  allocated  numbers  as  well  as
 | |
|        names,  exactly as if the names were not present. The PCRE API provides
 | |
|        function calls for extracting the name-to-number translation table from
 | |
|        a compiled pattern. There is also a convenience function for extracting
 | |
|        a captured substring by name.
 | |
| 
 | |
|        By default, a name must be unique within a pattern, but it is  possible
 | |
|        to relax this constraint by setting the PCRE_DUPNAMES option at compile
 | |
|        time. This can be useful for patterns where only one  instance  of  the
 | |
|        named  parentheses  can  match. Suppose you want to match the name of a
 | |
|        weekday, either as a 3-letter abbreviation or as the full name, and  in
 | |
|        both cases you want to extract the abbreviation. This pattern (ignoring
 | |
|        the line breaks) does the job:
 | |
| 
 | |
|          (?<DN>Mon|Fri|Sun)(?:day)?|
 | |
|          (?<DN>Tue)(?:sday)?|
 | |
|          (?<DN>Wed)(?:nesday)?|
 | |
|          (?<DN>Thu)(?:rsday)?|
 | |
|          (?<DN>Sat)(?:urday)?
 | |
| 
 | |
|        There are five capturing substrings, but only one is ever set  after  a
 | |
|        match.  (An alternative way of solving this problem is to use a "branch
 | |
|        reset" subpattern, as described in the previous section.)
 | |
| 
 | |
|        The convenience function for extracting the data by  name  returns  the
 | |
|        substring  for  the first (and in this example, the only) subpattern of
 | |
|        that name that matched. This saves searching  to  find  which  numbered
 | |
|        subpattern  it  was. If you make a reference to a non-unique named sub-
 | |
|        pattern from elsewhere in the pattern, the one that corresponds to  the
 | |
|        lowest  number  is used. For further details of the interfaces for han-
 | |
|        dling named subpatterns, see the pcreapi documentation.
 | |
| 
 | |
|        Warning: You cannot use different names to distinguish between two sub-
 | |
|        patterns  with  the same number (see the previous section) because PCRE
 | |
|        uses only the numbers when matching.
 | |
| 
 | |
| 
 | |
| REPETITION
 | |
| 
 | |
|        Repetition is specified by quantifiers, which can  follow  any  of  the
 | |
|        following items:
 | |
| 
 | |
|          a literal data character
 | |
|          the dot metacharacter
 | |
|          the \C escape sequence
 | |
|          the \X escape sequence (in UTF-8 mode with Unicode properties)
 | |
|          the \R escape sequence
 | |
|          an escape such as \d that matches a single character
 | |
|          a character class
 | |
|          a back reference (see next section)
 | |
|          a parenthesized subpattern (unless it is an assertion)
 | |
| 
 | |
|        The  general repetition quantifier specifies a minimum and maximum num-
 | |
|        ber of permitted matches, by giving the two numbers in  curly  brackets
 | |
|        (braces),  separated  by  a comma. The numbers must be less than 65536,
 | |
|        and the first must be less than or equal to the second. For example:
 | |
| 
 | |
|          z{2,4}
 | |
| 
 | |
|        matches "zz", "zzz", or "zzzz". A closing brace on its  own  is  not  a
 | |
|        special  character.  If  the second number is omitted, but the comma is
 | |
|        present, there is no upper limit; if the second number  and  the  comma
 | |
|        are  both omitted, the quantifier specifies an exact number of required
 | |
|        matches. Thus
 | |
| 
 | |
|          [aeiou]{3,}
 | |
| 
 | |
|        matches at least 3 successive vowels, but may match many more, while
 | |
| 
 | |
|          \d{8}
 | |
| 
 | |
|        matches exactly 8 digits. An opening curly bracket that  appears  in  a
 | |
|        position  where a quantifier is not allowed, or one that does not match
 | |
|        the syntax of a quantifier, is taken as a literal character. For  exam-
 | |
|        ple, {,6} is not a quantifier, but a literal string of four characters.
 | |
| 
 | |
|        In  UTF-8  mode,  quantifiers  apply to UTF-8 characters rather than to
 | |
|        individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char-
 | |
|        acters, each of which is represented by a two-byte sequence. Similarly,
 | |
|        when Unicode property support is available, \X{3} matches three Unicode
 | |
|        extended  sequences,  each of which may be several bytes long (and they
 | |
|        may be of different lengths).
 | |
| 
 | |
|        The quantifier {0} is permitted, causing the expression to behave as if
 | |
|        the previous item and the quantifier were not present. This may be use-
 | |
|        ful for subpatterns that are referenced as subroutines  from  elsewhere
 | |
|        in the pattern. Items other than subpatterns that have a {0} quantifier
 | |
|        are omitted from the compiled pattern.
 | |
| 
 | |
|        For convenience, the three most common quantifiers have  single-charac-
 | |
|        ter abbreviations:
 | |
| 
 | |
|          *    is equivalent to {0,}
 | |
|          +    is equivalent to {1,}
 | |
|          ?    is equivalent to {0,1}
 | |
| 
 | |
|        It  is  possible  to construct infinite loops by following a subpattern
 | |
|        that can match no characters with a quantifier that has no upper limit,
 | |
|        for example:
 | |
| 
 | |
|          (a?)*
 | |
| 
 | |
|        Earlier versions of Perl and PCRE used to give an error at compile time
 | |
|        for such patterns. However, because there are cases where this  can  be
 | |
|        useful,  such  patterns  are now accepted, but if any repetition of the
 | |
|        subpattern does in fact match no characters, the loop is forcibly  bro-
 | |
|        ken.
 | |
| 
 | |
|        By  default,  the quantifiers are "greedy", that is, they match as much
 | |
|        as possible (up to the maximum  number  of  permitted  times),  without
 | |
|        causing  the  rest of the pattern to fail. The classic example of where
 | |
|        this gives problems is in trying to match comments in C programs. These
 | |
|        appear  between  /*  and  */ and within the comment, individual * and /
 | |
|        characters may appear. An attempt to match C comments by  applying  the
 | |
|        pattern
 | |
| 
 | |
|          /\*.*\*/
 | |
| 
 | |
|        to the string
 | |
| 
 | |
|          /* first comment */  not comment  /* second comment */
 | |
| 
 | |
|        fails,  because it matches the entire string owing to the greediness of
 | |
|        the .*  item.
 | |
| 
 | |
|        However, if a quantifier is followed by a question mark, it  ceases  to
 | |
|        be greedy, and instead matches the minimum number of times possible, so
 | |
|        the pattern
 | |
| 
 | |
|          /\*.*?\*/
 | |
| 
 | |
|        does the right thing with the C comments. The meaning  of  the  various
 | |
|        quantifiers  is  not  otherwise  changed,  just the preferred number of
 | |
|        matches.  Do not confuse this use of question mark with its  use  as  a
 | |
|        quantifier  in its own right. Because it has two uses, it can sometimes
 | |
|        appear doubled, as in
 | |
| 
 | |
|          \d??\d
 | |
| 
 | |
|        which matches one digit by preference, but can match two if that is the
 | |
|        only way the rest of the pattern matches.
 | |
| 
 | |
|        If  the PCRE_UNGREEDY option is set (an option that is not available in
 | |
|        Perl), the quantifiers are not greedy by default, but  individual  ones
 | |
|        can  be  made  greedy  by following them with a question mark. In other
 | |
|        words, it inverts the default behaviour.
 | |
| 
 | |
|        When a parenthesized subpattern is quantified  with  a  minimum  repeat
 | |
|        count  that is greater than 1 or with a limited maximum, more memory is
 | |
|        required for the compiled pattern, in proportion to  the  size  of  the
 | |
|        minimum or maximum.
 | |
| 
 | |
|        If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-
 | |
|        alent to Perl's /s) is set, thus allowing the dot  to  match  newlines,
 | |
|        the  pattern  is  implicitly anchored, because whatever follows will be
 | |
|        tried against every character position in the subject string, so  there
 | |
|        is  no  point  in  retrying the overall match at any position after the
 | |
|        first. PCRE normally treats such a pattern as though it  were  preceded
 | |
|        by \A.
 | |
| 
 | |
|        In  cases  where  it  is known that the subject string contains no new-
 | |
|        lines, it is worth setting PCRE_DOTALL in order to  obtain  this  opti-
 | |
|        mization, or alternatively using ^ to indicate anchoring explicitly.
 | |
| 
 | |
|        However,  there is one situation where the optimization cannot be used.
 | |
|        When .*  is inside capturing parentheses that  are  the  subject  of  a
 | |
|        backreference  elsewhere  in the pattern, a match at the start may fail
 | |
|        where a later one succeeds. Consider, for example:
 | |
| 
 | |
|          (.*)abc\1
 | |
| 
 | |
|        If the subject is "xyz123abc123" the match point is the fourth  charac-
 | |
|        ter. For this reason, such a pattern is not implicitly anchored.
 | |
| 
 | |
|        When a capturing subpattern is repeated, the value captured is the sub-
 | |
|        string that matched the final iteration. For example, after
 | |
| 
 | |
|          (tweedle[dume]{3}\s*)+
 | |
| 
 | |
|        has matched "tweedledum tweedledee" the value of the captured substring
 | |
|        is  "tweedledee".  However,  if there are nested capturing subpatterns,
 | |
|        the corresponding captured values may have been set in previous  itera-
 | |
|        tions. For example, after
 | |
| 
 | |
|          /(a|(b))+/
 | |
| 
 | |
|        matches "aba" the value of the second captured substring is "b".
 | |
| 
 | |
| 
 | |
| ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
 | |
| 
 | |
|        With  both  maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
 | |
|        repetition, failure of what follows normally causes the  repeated  item
 | |
|        to  be  re-evaluated to see if a different number of repeats allows the
 | |
|        rest of the pattern to match. Sometimes it is useful to  prevent  this,
 | |
|        either  to  change the nature of the match, or to cause it fail earlier
 | |
|        than it otherwise might, when the author of the pattern knows there  is
 | |
|        no point in carrying on.
 | |
| 
 | |
|        Consider,  for  example, the pattern \d+foo when applied to the subject
 | |
|        line
 | |
| 
 | |
|          123456bar
 | |
| 
 | |
|        After matching all 6 digits and then failing to match "foo", the normal
 | |
|        action  of  the matcher is to try again with only 5 digits matching the
 | |
|        \d+ item, and then with  4,  and  so  on,  before  ultimately  failing.
 | |
|        "Atomic  grouping"  (a  term taken from Jeffrey Friedl's book) provides
 | |
|        the means for specifying that once a subpattern has matched, it is  not
 | |
|        to be re-evaluated in this way.
 | |
| 
 | |
|        If  we  use atomic grouping for the previous example, the matcher gives
 | |
|        up immediately on failing to match "foo" the first time.  The  notation
 | |
|        is a kind of special parenthesis, starting with (?> as in this example:
 | |
| 
 | |
|          (?>\d+)foo
 | |
| 
 | |
|        This  kind  of  parenthesis "locks up" the  part of the pattern it con-
 | |
|        tains once it has matched, and a failure further into  the  pattern  is
 | |
|        prevented  from  backtracking into it. Backtracking past it to previous
 | |
|        items, however, works as normal.
 | |
| 
 | |
|        An alternative description is that a subpattern of  this  type  matches
 | |
|        the  string  of  characters  that an identical standalone pattern would
 | |
|        match, if anchored at the current point in the subject string.
 | |
| 
 | |
|        Atomic grouping subpatterns are not capturing subpatterns. Simple cases
 | |
|        such as the above example can be thought of as a maximizing repeat that
 | |
|        must swallow everything it can. So, while both \d+ and  \d+?  are  pre-
 | |
|        pared  to  adjust  the number of digits they match in order to make the
 | |
|        rest of the pattern match, (?>\d+) can only match an entire sequence of
 | |
|        digits.
 | |
| 
 | |
|        Atomic  groups in general can of course contain arbitrarily complicated
 | |
|        subpatterns, and can be nested. However, when  the  subpattern  for  an
 | |
|        atomic group is just a single repeated item, as in the example above, a
 | |
|        simpler notation, called a "possessive quantifier" can  be  used.  This
 | |
|        consists  of  an  additional  + character following a quantifier. Using
 | |
|        this notation, the previous example can be rewritten as
 | |
| 
 | |
|          \d++foo
 | |
| 
 | |
|        Note that a possessive quantifier can be used with an entire group, for
 | |
|        example:
 | |
| 
 | |
|          (abc|xyz){2,3}+
 | |
| 
 | |
|        Possessive   quantifiers   are   always  greedy;  the  setting  of  the
 | |
|        PCRE_UNGREEDY option is ignored. They are a convenient notation for the
 | |
|        simpler  forms  of atomic group. However, there is no difference in the
 | |
|        meaning of a possessive quantifier and  the  equivalent  atomic  group,
 | |
|        though  there  may  be a performance difference; possessive quantifiers
 | |
|        should be slightly faster.
 | |
| 
 | |
|        The possessive quantifier syntax is an extension to the Perl  5.8  syn-
 | |
|        tax.   Jeffrey  Friedl  originated the idea (and the name) in the first
 | |
|        edition of his book. Mike McCloskey liked it, so implemented it when he
 | |
|        built  Sun's Java package, and PCRE copied it from there. It ultimately
 | |
|        found its way into Perl at release 5.10.
 | |
| 
 | |
|        PCRE has an optimization that automatically "possessifies" certain sim-
 | |
|        ple  pattern  constructs.  For  example, the sequence A+B is treated as
 | |
|        A++B because there is no point in backtracking into a sequence  of  A's
 | |
|        when B must follow.
 | |
| 
 | |
|        When  a  pattern  contains an unlimited repeat inside a subpattern that
 | |
|        can itself be repeated an unlimited number of  times,  the  use  of  an
 | |
|        atomic  group  is  the  only way to avoid some failing matches taking a
 | |
|        very long time indeed. The pattern
 | |
| 
 | |
|          (\D+|<\d+>)*[!?]
 | |
| 
 | |
|        matches an unlimited number of substrings that either consist  of  non-
 | |
|        digits,  or  digits  enclosed in <>, followed by either ! or ?. When it
 | |
|        matches, it runs quickly. However, if it is applied to
 | |
| 
 | |
|          aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 | |
| 
 | |
|        it takes a long time before reporting  failure.  This  is  because  the
 | |
|        string  can be divided between the internal \D+ repeat and the external
 | |
|        * repeat in a large number of ways, and all  have  to  be  tried.  (The
 | |
|        example  uses  [!?]  rather than a single character at the end, because
 | |
|        both PCRE and Perl have an optimization that allows  for  fast  failure
 | |
|        when  a single character is used. They remember the last single charac-
 | |
|        ter that is required for a match, and fail early if it is  not  present
 | |
|        in  the  string.)  If  the pattern is changed so that it uses an atomic
 | |
|        group, like this:
 | |
| 
 | |
|          ((?>\D+)|<\d+>)*[!?]
 | |
| 
 | |
|        sequences of non-digits cannot be broken, and failure happens quickly.
 | |
| 
 | |
| 
 | |
| BACK REFERENCES
 | |
| 
 | |
|        Outside a character class, a backslash followed by a digit greater than
 | |
|        0 (and possibly further digits) is a back reference to a capturing sub-
 | |
|        pattern earlier (that is, to its left) in the pattern,  provided  there
 | |
|        have been that many previous capturing left parentheses.
 | |
| 
 | |
|        However, if the decimal number following the backslash is less than 10,
 | |
|        it is always taken as a back reference, and causes  an  error  only  if
 | |
|        there  are  not that many capturing left parentheses in the entire pat-
 | |
|        tern. In other words, the parentheses that are referenced need  not  be
 | |
|        to  the left of the reference for numbers less than 10. A "forward back
 | |
|        reference" of this type can make sense when a  repetition  is  involved
 | |
|        and  the  subpattern to the right has participated in an earlier itera-
 | |
|        tion.
 | |
| 
 | |
|        It is not possible to have a numerical "forward back  reference"  to  a
 | |
|        subpattern  whose  number  is  10  or  more using this syntax because a
 | |
|        sequence such as \50 is interpreted as a character  defined  in  octal.
 | |
|        See the subsection entitled "Non-printing characters" above for further
 | |
|        details of the handling of digits following a backslash.  There  is  no
 | |
|        such  problem  when named parentheses are used. A back reference to any
 | |
|        subpattern is possible using named parentheses (see below).
 | |
| 
 | |
|        Another way of avoiding the ambiguity inherent in  the  use  of  digits
 | |
|        following a backslash is to use the \g escape sequence, which is a fea-
 | |
|        ture introduced in Perl 5.10.  This  escape  must  be  followed  by  an
 | |
|        unsigned  number  or  a negative number, optionally enclosed in braces.
 | |
|        These examples are all identical:
 | |
| 
 | |
|          (ring), \1
 | |
|          (ring), \g1
 | |
|          (ring), \g{1}
 | |
| 
 | |
|        An unsigned number specifies an absolute reference without the  ambigu-
 | |
|        ity that is present in the older syntax. It is also useful when literal
 | |
|        digits follow the reference. A negative number is a relative reference.
 | |
|        Consider this example:
 | |
| 
 | |
|          (abc(def)ghi)\g{-1}
 | |
| 
 | |
|        The sequence \g{-1} is a reference to the most recently started captur-
 | |
|        ing subpattern before \g, that is, is it equivalent to  \2.  Similarly,
 | |
|        \g{-2} would be equivalent to \1. The use of relative references can be
 | |
|        helpful in long patterns, and also in  patterns  that  are  created  by
 | |
|        joining together fragments that contain references within themselves.
 | |
| 
 | |
|        A  back  reference matches whatever actually matched the capturing sub-
 | |
|        pattern in the current subject string, rather  than  anything  matching
 | |
|        the subpattern itself (see "Subpatterns as subroutines" below for a way
 | |
|        of doing that). So the pattern
 | |
| 
 | |
|          (sens|respons)e and \1ibility
 | |
| 
 | |
|        matches "sense and sensibility" and "response and responsibility",  but
 | |
|        not  "sense and responsibility". If caseful matching is in force at the
 | |
|        time of the back reference, the case of letters is relevant. For  exam-
 | |
|        ple,
 | |
| 
 | |
|          ((?i)rah)\s+\1
 | |
| 
 | |
|        matches  "rah  rah"  and  "RAH RAH", but not "RAH rah", even though the
 | |
|        original capturing subpattern is matched caselessly.
 | |
| 
 | |
|        There are several different ways of writing back  references  to  named
 | |
|        subpatterns.  The  .NET syntax \k{name} and the Perl syntax \k<name> or
 | |
|        \k'name' are supported, as is the Python syntax (?P=name). Perl  5.10's
 | |
|        unified back reference syntax, in which \g can be used for both numeric
 | |
|        and named references, is also supported. We  could  rewrite  the  above
 | |
|        example in any of the following ways:
 | |
| 
 | |
|          (?<p1>(?i)rah)\s+\k<p1>
 | |
|          (?'p1'(?i)rah)\s+\k{p1}
 | |
|          (?P<p1>(?i)rah)\s+(?P=p1)
 | |
|          (?<p1>(?i)rah)\s+\g{p1}
 | |
| 
 | |
|        A  subpattern  that  is  referenced  by  name may appear in the pattern
 | |
|        before or after the reference.
 | |
| 
 | |
|        There may be more than one back reference to the same subpattern. If  a
 | |
|        subpattern  has  not actually been used in a particular match, any back
 | |
|        references to it always fail. For example, the pattern
 | |
| 
 | |
|          (a|(bc))\2
 | |
| 
 | |
|        always fails if it starts to match "a" rather than "bc". Because  there
 | |
|        may  be  many  capturing parentheses in a pattern, all digits following
 | |
|        the backslash are taken as part of a potential back  reference  number.
 | |
|        If the pattern continues with a digit character, some delimiter must be
 | |
|        used to terminate the back reference. If the  PCRE_EXTENDED  option  is
 | |
|        set,  this  can  be  whitespace.  Otherwise an empty comment (see "Com-
 | |
|        ments" below) can be used.
 | |
| 
 | |
|        A back reference that occurs inside the parentheses to which it  refers
 | |
|        fails  when  the subpattern is first used, so, for example, (a\1) never
 | |
|        matches.  However, such references can be useful inside  repeated  sub-
 | |
|        patterns. For example, the pattern
 | |
| 
 | |
|          (a|b\1)+
 | |
| 
 | |
|        matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
 | |
|        ation of the subpattern,  the  back  reference  matches  the  character
 | |
|        string  corresponding  to  the previous iteration. In order for this to
 | |
|        work, the pattern must be such that the first iteration does  not  need
 | |
|        to  match the back reference. This can be done using alternation, as in
 | |
|        the example above, or by a quantifier with a minimum of zero.
 | |
| 
 | |
| 
 | |
| ASSERTIONS
 | |
| 
 | |
|        An assertion is a test on the characters  following  or  preceding  the
 | |
|        current  matching  point that does not actually consume any characters.
 | |
|        The simple assertions coded as \b, \B, \A, \G, \Z,  \z,  ^  and  $  are
 | |
|        described above.
 | |
| 
 | |
|        More  complicated  assertions  are  coded as subpatterns. There are two
 | |
|        kinds: those that look ahead of the current  position  in  the  subject
 | |
|        string,  and  those  that  look  behind  it. An assertion subpattern is
 | |
|        matched in the normal way, except that it does not  cause  the  current
 | |
|        matching position to be changed.
 | |
| 
 | |
|        Assertion  subpatterns  are  not  capturing subpatterns, and may not be
 | |
|        repeated, because it makes no sense to assert the  same  thing  several
 | |
|        times.  If  any kind of assertion contains capturing subpatterns within
 | |
|        it, these are counted for the purposes of numbering the capturing  sub-
 | |
|        patterns in the whole pattern.  However, substring capturing is carried
 | |
|        out only for positive assertions, because it does not  make  sense  for
 | |
|        negative assertions.
 | |
| 
 | |
|    Lookahead assertions
 | |
| 
 | |
|        Lookahead assertions start with (?= for positive assertions and (?! for
 | |
|        negative assertions. For example,
 | |
| 
 | |
|          \w+(?=;)
 | |
| 
 | |
|        matches a word followed by a semicolon, but does not include the  semi-
 | |
|        colon in the match, and
 | |
| 
 | |
|          foo(?!bar)
 | |
| 
 | |
|        matches  any  occurrence  of  "foo" that is not followed by "bar". Note
 | |
|        that the apparently similar pattern
 | |
| 
 | |
|          (?!foo)bar
 | |
| 
 | |
|        does not find an occurrence of "bar"  that  is  preceded  by  something
 | |
|        other  than "foo"; it finds any occurrence of "bar" whatsoever, because
 | |
|        the assertion (?!foo) is always true when the next three characters are
 | |
|        "bar". A lookbehind assertion is needed to achieve the other effect.
 | |
| 
 | |
|        If you want to force a matching failure at some point in a pattern, the
 | |
|        most convenient way to do it is  with  (?!)  because  an  empty  string
 | |
|        always  matches, so an assertion that requires there not to be an empty
 | |
|        string must always fail.
 | |
| 
 | |
|    Lookbehind assertions
 | |
| 
 | |
|        Lookbehind assertions start with (?<= for positive assertions and  (?<!
 | |
|        for negative assertions. For example,
 | |
| 
 | |
|          (?<!foo)bar
 | |
| 
 | |
|        does  find  an  occurrence  of "bar" that is not preceded by "foo". The
 | |
|        contents of a lookbehind assertion are restricted  such  that  all  the
 | |
|        strings it matches must have a fixed length. However, if there are sev-
 | |
|        eral top-level alternatives, they do not all  have  to  have  the  same
 | |
|        fixed length. Thus
 | |
| 
 | |
|          (?<=bullock|donkey)
 | |
| 
 | |
|        is permitted, but
 | |
| 
 | |
|          (?<!dogs?|cats?)
 | |
| 
 | |
|        causes  an  error at compile time. Branches that match different length
 | |
|        strings are permitted only at the top level of a lookbehind  assertion.
 | |
|        This  is  an  extension  compared  with  Perl (at least for 5.8), which
 | |
|        requires all branches to match the same length of string. An  assertion
 | |
|        such as
 | |
| 
 | |
|          (?<=ab(c|de))
 | |
| 
 | |
|        is  not  permitted,  because  its single top-level branch can match two
 | |
|        different lengths, but it is acceptable if rewritten to  use  two  top-
 | |
|        level branches:
 | |
| 
 | |
|          (?<=abc|abde)
 | |
| 
 | |
|        In some cases, the Perl 5.10 escape sequence \K (see above) can be used
 | |
|        instead of a lookbehind assertion; this is not restricted to  a  fixed-
 | |
|        length.
 | |
| 
 | |
|        The  implementation  of lookbehind assertions is, for each alternative,
 | |
|        to temporarily move the current position back by the fixed  length  and
 | |
|        then try to match. If there are insufficient characters before the cur-
 | |
|        rent position, the assertion fails.
 | |
| 
 | |
|        PCRE does not allow the \C escape (which matches a single byte in UTF-8
 | |
|        mode)  to appear in lookbehind assertions, because it makes it impossi-
 | |
|        ble to calculate the length of the lookbehind. The \X and  \R  escapes,
 | |
|        which can match different numbers of bytes, are also not permitted.
 | |
| 
 | |
|        Possessive  quantifiers  can  be  used  in  conjunction with lookbehind
 | |
|        assertions to specify efficient matching at  the  end  of  the  subject
 | |
|        string. Consider a simple pattern such as
 | |
| 
 | |
|          abcd$
 | |
| 
 | |
|        when  applied  to  a  long string that does not match. Because matching
 | |
|        proceeds from left to right, PCRE will look for each "a" in the subject
 | |
|        and  then  see  if what follows matches the rest of the pattern. If the
 | |
|        pattern is specified as
 | |
| 
 | |
|          ^.*abcd$
 | |
| 
 | |
|        the initial .* matches the entire string at first, but when this  fails
 | |
|        (because there is no following "a"), it backtracks to match all but the
 | |
|        last character, then all but the last two characters, and so  on.  Once
 | |
|        again  the search for "a" covers the entire string, from right to left,
 | |
|        so we are no better off. However, if the pattern is written as
 | |
| 
 | |
|          ^.*+(?<=abcd)
 | |
| 
 | |
|        there can be no backtracking for the .*+ item; it can  match  only  the
 | |
|        entire  string.  The subsequent lookbehind assertion does a single test
 | |
|        on the last four characters. If it fails, the match fails  immediately.
 | |
|        For  long  strings, this approach makes a significant difference to the
 | |
|        processing time.
 | |
| 
 | |
|    Using multiple assertions
 | |
| 
 | |
|        Several assertions (of any sort) may occur in succession. For example,
 | |
| 
 | |
|          (?<=\d{3})(?<!999)foo
 | |
| 
 | |
|        matches "foo" preceded by three digits that are not "999". Notice  that
 | |
|        each  of  the  assertions is applied independently at the same point in
 | |
|        the subject string. First there is a  check  that  the  previous  three
 | |
|        characters  are  all  digits,  and  then there is a check that the same
 | |
|        three characters are not "999".  This pattern does not match "foo" pre-
 | |
|        ceded  by  six  characters,  the first of which are digits and the last
 | |
|        three of which are not "999". For example, it  doesn't  match  "123abc-
 | |
|        foo". A pattern to do that is
 | |
| 
 | |
|          (?<=\d{3}...)(?<!999)foo
 | |
| 
 | |
|        This  time  the  first assertion looks at the preceding six characters,
 | |
|        checking that the first three are digits, and then the second assertion
 | |
|        checks that the preceding three characters are not "999".
 | |
| 
 | |
|        Assertions can be nested in any combination. For example,
 | |
| 
 | |
|          (?<=(?<!foo)bar)baz
 | |
| 
 | |
|        matches  an occurrence of "baz" that is preceded by "bar" which in turn
 | |
|        is not preceded by "foo", while
 | |
| 
 | |
|          (?<=\d{3}(?!999)...)foo
 | |
| 
 | |
|        is another pattern that matches "foo" preceded by three digits and  any
 | |
|        three characters that are not "999".
 | |
| 
 | |
| 
 | |
| CONDITIONAL SUBPATTERNS
 | |
| 
 | |
|        It  is possible to cause the matching process to obey a subpattern con-
 | |
|        ditionally or to choose between two alternative subpatterns,  depending
 | |
|        on  the result of an assertion, or whether a previous capturing subpat-
 | |
|        tern matched or not. The two possible forms of  conditional  subpattern
 | |
|        are
 | |
| 
 | |
|          (?(condition)yes-pattern)
 | |
|          (?(condition)yes-pattern|no-pattern)
 | |
| 
 | |
|        If  the  condition is satisfied, the yes-pattern is used; otherwise the
 | |
|        no-pattern (if present) is used. If there are more  than  two  alterna-
 | |
|        tives in the subpattern, a compile-time error occurs.
 | |
| 
 | |
|        There  are  four  kinds of condition: references to subpatterns, refer-
 | |
|        ences to recursion, a pseudo-condition called DEFINE, and assertions.
 | |
| 
 | |
|    Checking for a used subpattern by number
 | |
| 
 | |
|        If the text between the parentheses consists of a sequence  of  digits,
 | |
|        the  condition  is  true if the capturing subpattern of that number has
 | |
|        previously matched. An alternative notation is to  precede  the  digits
 | |
|        with a plus or minus sign. In this case, the subpattern number is rela-
 | |
|        tive rather than absolute.  The most recently opened parentheses can be
 | |
|        referenced  by  (?(-1),  the  next most recent by (?(-2), and so on. In
 | |
|        looping constructs it can also make sense to refer to subsequent groups
 | |
|        with constructs such as (?(+2).
 | |
| 
 | |
|        Consider  the  following  pattern, which contains non-significant white
 | |
|        space to make it more readable (assume the PCRE_EXTENDED option) and to
 | |
|        divide it into three parts for ease of discussion:
 | |
| 
 | |
|          ( \( )?    [^()]+    (?(1) \) )
 | |
| 
 | |
|        The  first  part  matches  an optional opening parenthesis, and if that
 | |
|        character is present, sets it as the first captured substring. The sec-
 | |
|        ond  part  matches one or more characters that are not parentheses. The
 | |
|        third part is a conditional subpattern that tests whether the first set
 | |
|        of parentheses matched or not. If they did, that is, if subject started
 | |
|        with an opening parenthesis, the condition is true, and so the yes-pat-
 | |
|        tern  is  executed  and  a  closing parenthesis is required. Otherwise,
 | |
|        since no-pattern is not present, the  subpattern  matches  nothing.  In
 | |
|        other  words,  this  pattern  matches  a  sequence  of non-parentheses,
 | |
|        optionally enclosed in parentheses.
 | |
| 
 | |
|        If you were embedding this pattern in a larger one,  you  could  use  a
 | |
|        relative reference:
 | |
| 
 | |
|          ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...
 | |
| 
 | |
|        This  makes  the  fragment independent of the parentheses in the larger
 | |
|        pattern.
 | |
| 
 | |
|    Checking for a used subpattern by name
 | |
| 
 | |
|        Perl uses the syntax (?(<name>)...) or (?('name')...)  to  test  for  a
 | |
|        used  subpattern  by  name.  For compatibility with earlier versions of
 | |
|        PCRE, which had this facility before Perl, the syntax  (?(name)...)  is
 | |
|        also  recognized. However, there is a possible ambiguity with this syn-
 | |
|        tax, because subpattern names may  consist  entirely  of  digits.  PCRE
 | |
|        looks  first for a named subpattern; if it cannot find one and the name
 | |
|        consists entirely of digits, PCRE looks for a subpattern of  that  num-
 | |
|        ber,  which must be greater than zero. Using subpattern names that con-
 | |
|        sist entirely of digits is not recommended.
 | |
| 
 | |
|        Rewriting the above example to use a named subpattern gives this:
 | |
| 
 | |
|          (?<OPEN> \( )?    [^()]+    (?(<OPEN>) \) )
 | |
| 
 | |
| 
 | |
|    Checking for pattern recursion
 | |
| 
 | |
|        If the condition is the string (R), and there is no subpattern with the
 | |
|        name  R, the condition is true if a recursive call to the whole pattern
 | |
|        or any subpattern has been made. If digits or a name preceded by amper-
 | |
|        sand follow the letter R, for example:
 | |
| 
 | |
|          (?(R3)...) or (?(R&name)...)
 | |
| 
 | |
|        the  condition is true if the most recent recursion is into the subpat-
 | |
|        tern whose number or name is given. This condition does not  check  the
 | |
|        entire recursion stack.
 | |
| 
 | |
|        At  "top  level", all these recursion test conditions are false. Recur-
 | |
|        sive patterns are described below.
 | |
| 
 | |
|    Defining subpatterns for use by reference only
 | |
| 
 | |
|        If the condition is the string (DEFINE), and  there  is  no  subpattern
 | |
|        with  the  name  DEFINE,  the  condition is always false. In this case,
 | |
|        there may be only one alternative  in  the  subpattern.  It  is  always
 | |
|        skipped  if  control  reaches  this  point  in the pattern; the idea of
 | |
|        DEFINE is that it can be used to define "subroutines" that can be  ref-
 | |
|        erenced  from elsewhere. (The use of "subroutines" is described below.)
 | |
|        For example, a pattern to match an IPv4 address could be  written  like
 | |
|        this (ignore whitespace and line breaks):
 | |
| 
 | |
|          (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
 | |
|          \b (?&byte) (\.(?&byte)){3} \b
 | |
| 
 | |
|        The  first part of the pattern is a DEFINE group inside which a another
 | |
|        group named "byte" is defined. This matches an individual component  of
 | |
|        an  IPv4  address  (a number less than 256). When matching takes place,
 | |
|        this part of the pattern is skipped because DEFINE acts  like  a  false
 | |
|        condition.
 | |
| 
 | |
|        The rest of the pattern uses references to the named group to match the
 | |
|        four dot-separated components of an IPv4 address, insisting on  a  word
 | |
|        boundary at each end.
 | |
| 
 | |
|    Assertion conditions
 | |
| 
 | |
|        If  the  condition  is  not  in any of the above formats, it must be an
 | |
|        assertion.  This may be a positive or negative lookahead or  lookbehind
 | |
|        assertion.  Consider  this  pattern,  again  containing non-significant
 | |
|        white space, and with the two alternatives on the second line:
 | |
| 
 | |
|          (?(?=[^a-z]*[a-z])
 | |
|          \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
 | |
| 
 | |
|        The condition  is  a  positive  lookahead  assertion  that  matches  an
 | |
|        optional  sequence of non-letters followed by a letter. In other words,
 | |
|        it tests for the presence of at least one letter in the subject.  If  a
 | |
|        letter  is found, the subject is matched against the first alternative;
 | |
|        otherwise it is  matched  against  the  second.  This  pattern  matches
 | |
|        strings  in  one  of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
 | |
|        letters and dd are digits.
 | |
| 
 | |
| 
 | |
| COMMENTS
 | |
| 
 | |
|        The sequence (?# marks the start of a comment that continues up to  the
 | |
|        next  closing  parenthesis.  Nested  parentheses are not permitted. The
 | |
|        characters that make up a comment play no part in the pattern  matching
 | |
|        at all.
 | |
| 
 | |
|        If  the PCRE_EXTENDED option is set, an unescaped # character outside a
 | |
|        character class introduces a  comment  that  continues  to  immediately
 | |
|        after the next newline in the pattern.
 | |
| 
 | |
| 
 | |
| RECURSIVE PATTERNS
 | |
| 
 | |
|        Consider  the problem of matching a string in parentheses, allowing for
 | |
|        unlimited nested parentheses. Without the use of  recursion,  the  best
 | |
|        that  can  be  done  is  to use a pattern that matches up to some fixed
 | |
|        depth of nesting. It is not possible to  handle  an  arbitrary  nesting
 | |
|        depth.
 | |
| 
 | |
|        For some time, Perl has provided a facility that allows regular expres-
 | |
|        sions to recurse (amongst other things). It does this by  interpolating
 | |
|        Perl  code in the expression at run time, and the code can refer to the
 | |
|        expression itself. A Perl pattern using code interpolation to solve the
 | |
|        parentheses problem can be created like this:
 | |
| 
 | |
|          $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
 | |
| 
 | |
|        The (?p{...}) item interpolates Perl code at run time, and in this case
 | |
|        refers recursively to the pattern in which it appears.
 | |
| 
 | |
|        Obviously, PCRE cannot support the interpolation of Perl code. Instead,
 | |
|        it  supports  special  syntax  for recursion of the entire pattern, and
 | |
|        also for individual subpattern recursion.  After  its  introduction  in
 | |
|        PCRE  and  Python,  this  kind of recursion was introduced into Perl at
 | |
|        release 5.10.
 | |
| 
 | |
|        A special item that consists of (? followed by a  number  greater  than
 | |
|        zero and a closing parenthesis is a recursive call of the subpattern of
 | |
|        the given number, provided that it occurs inside that  subpattern.  (If
 | |
|        not,  it  is  a  "subroutine" call, which is described in the next sec-
 | |
|        tion.) The special item (?R) or (?0) is a recursive call of the  entire
 | |
|        regular expression.
 | |
| 
 | |
|        In  PCRE (like Python, but unlike Perl), a recursive subpattern call is
 | |
|        always treated as an atomic group. That is, once it has matched some of
 | |
|        the subject string, it is never re-entered, even if it contains untried
 | |
|        alternatives and there is a subsequent matching failure.
 | |
| 
 | |
|        This PCRE pattern solves the nested  parentheses  problem  (assume  the
 | |
|        PCRE_EXTENDED option is set so that white space is ignored):
 | |
| 
 | |
|          \( ( (?>[^()]+) | (?R) )* \)
 | |
| 
 | |
|        First  it matches an opening parenthesis. Then it matches any number of
 | |
|        substrings which can either be a  sequence  of  non-parentheses,  or  a
 | |
|        recursive  match  of the pattern itself (that is, a correctly parenthe-
 | |
|        sized substring).  Finally there is a closing parenthesis.
 | |
| 
 | |
|        If this were part of a larger pattern, you would not  want  to  recurse
 | |
|        the entire pattern, so instead you could use this:
 | |
| 
 | |
|          ( \( ( (?>[^()]+) | (?1) )* \) )
 | |
| 
 | |
|        We  have  put the pattern into parentheses, and caused the recursion to
 | |
|        refer to them instead of the whole pattern.
 | |
| 
 | |
|        In a larger pattern,  keeping  track  of  parenthesis  numbers  can  be
 | |
|        tricky.  This is made easier by the use of relative references. (A Perl
 | |
|        5.10 feature.)  Instead of (?1) in the  pattern  above  you  can  write
 | |
|        (?-2) to refer to the second most recently opened parentheses preceding
 | |
|        the recursion. In other  words,  a  negative  number  counts  capturing
 | |
|        parentheses leftwards from the point at which it is encountered.
 | |
| 
 | |
|        It  is  also  possible  to refer to subsequently opened parentheses, by
 | |
|        writing references such as (?+2). However, these  cannot  be  recursive
 | |
|        because  the  reference  is  not inside the parentheses that are refer-
 | |
|        enced. They are always "subroutine" calls, as  described  in  the  next
 | |
|        section.
 | |
| 
 | |
|        An  alternative  approach is to use named parentheses instead. The Perl
 | |
|        syntax for this is (?&name); PCRE's earlier syntax  (?P>name)  is  also
 | |
|        supported. We could rewrite the above example as follows:
 | |
| 
 | |
|          (?<pn> \( ( (?>[^()]+) | (?&pn) )* \) )
 | |
| 
 | |
|        If  there  is more than one subpattern with the same name, the earliest
 | |
|        one is used.
 | |
| 
 | |
|        This particular example pattern that we have been looking  at  contains
 | |
|        nested  unlimited repeats, and so the use of atomic grouping for match-
 | |
|        ing strings of non-parentheses is important when applying  the  pattern
 | |
|        to strings that do not match. For example, when this pattern is applied
 | |
|        to
 | |
| 
 | |
|          (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
 | |
| 
 | |
|        it yields "no match" quickly. However, if atomic grouping is not  used,
 | |
|        the  match  runs  for a very long time indeed because there are so many
 | |
|        different ways the + and * repeats can carve up the  subject,  and  all
 | |
|        have to be tested before failure can be reported.
 | |
| 
 | |
|        At the end of a match, the values set for any capturing subpatterns are
 | |
|        those from the outermost level of the recursion at which the subpattern
 | |
|        value  is  set.   If  you want to obtain intermediate values, a callout
 | |
|        function can be used (see below and the pcrecallout documentation).  If
 | |
|        the pattern above is matched against
 | |
| 
 | |
|          (ab(cd)ef)
 | |
| 
 | |
|        the  value  for  the  capturing  parentheses is "ef", which is the last
 | |
|        value taken on at the top level. If additional parentheses  are  added,
 | |
|        giving
 | |
| 
 | |
|          \( ( ( (?>[^()]+) | (?R) )* ) \)
 | |
|             ^                        ^
 | |
|             ^                        ^
 | |
| 
 | |
|        the  string  they  capture is "ab(cd)ef", the contents of the top level
 | |
|        parentheses. If there are more than 15 capturing parentheses in a  pat-
 | |
|        tern, PCRE has to obtain extra memory to store data during a recursion,
 | |
|        which it does by using pcre_malloc, freeing  it  via  pcre_free  after-
 | |
|        wards.  If  no  memory  can  be  obtained,  the  match  fails  with the
 | |
|        PCRE_ERROR_NOMEMORY error.
 | |
| 
 | |
|        Do not confuse the (?R) item with the condition (R),  which  tests  for
 | |
|        recursion.   Consider  this pattern, which matches text in angle brack-
 | |
|        ets, allowing for arbitrary nesting. Only digits are allowed in  nested
 | |
|        brackets  (that is, when recursing), whereas any characters are permit-
 | |
|        ted at the outer level.
 | |
| 
 | |
|          < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >
 | |
| 
 | |
|        In this pattern, (?(R) is the start of a conditional  subpattern,  with
 | |
|        two  different  alternatives for the recursive and non-recursive cases.
 | |
|        The (?R) item is the actual recursive call.
 | |
| 
 | |
| 
 | |
| SUBPATTERNS AS SUBROUTINES
 | |
| 
 | |
|        If the syntax for a recursive subpattern reference (either by number or
 | |
|        by  name)  is used outside the parentheses to which it refers, it oper-
 | |
|        ates like a subroutine in a programming language. The "called"  subpat-
 | |
|        tern may be defined before or after the reference. A numbered reference
 | |
|        can be absolute or relative, as in these examples:
 | |
| 
 | |
|          (...(absolute)...)...(?2)...
 | |
|          (...(relative)...)...(?-1)...
 | |
|          (...(?+1)...(relative)...
 | |
| 
 | |
|        An earlier example pointed out that the pattern
 | |
| 
 | |
|          (sens|respons)e and \1ibility
 | |
| 
 | |
|        matches "sense and sensibility" and "response and responsibility",  but
 | |
|        not "sense and responsibility". If instead the pattern
 | |
| 
 | |
|          (sens|respons)e and (?1)ibility
 | |
| 
 | |
|        is  used, it does match "sense and responsibility" as well as the other
 | |
|        two strings. Another example is  given  in  the  discussion  of  DEFINE
 | |
|        above.
 | |
| 
 | |
|        Like recursive subpatterns, a "subroutine" call is always treated as an
 | |
|        atomic group. That is, once it has matched some of the subject  string,
 | |
|        it  is  never  re-entered, even if it contains untried alternatives and
 | |
|        there is a subsequent matching failure.
 | |
| 
 | |
|        When a subpattern is used as a subroutine, processing options  such  as
 | |
|        case-independence are fixed when the subpattern is defined. They cannot
 | |
|        be changed for different calls. For example, consider this pattern:
 | |
| 
 | |
|          (abc)(?i:(?-1))
 | |
| 
 | |
|        It matches "abcabc". It does not match "abcABC" because the  change  of
 | |
|        processing option does not affect the called subpattern.
 | |
| 
 | |
| 
 | |
| ONIGURUMA SUBROUTINE SYNTAX
 | |
| 
 | |
|        For  compatibility with Oniguruma, the non-Perl syntax \g followed by a
 | |
|        name or a number enclosed either in angle brackets or single quotes, is
 | |
|        an  alternative  syntax  for  referencing a subpattern as a subroutine,
 | |
|        possibly recursively. Here are two of the examples used above,  rewrit-
 | |
|        ten using this syntax:
 | |
| 
 | |
|          (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
 | |
|          (sens|respons)e and \g'1'ibility
 | |
| 
 | |
|        PCRE  supports  an extension to Oniguruma: if a number is preceded by a
 | |
|        plus or a minus sign it is taken as a relative reference. For example:
 | |
| 
 | |
|          (abc)(?i:\g<-1>)
 | |
| 
 | |
|        Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are  not
 | |
|        synonymous.  The former is a back reference; the latter is a subroutine
 | |
|        call.
 | |
| 
 | |
| 
 | |
| CALLOUTS
 | |
| 
 | |
|        Perl has a feature whereby using the sequence (?{...}) causes arbitrary
 | |
|        Perl  code to be obeyed in the middle of matching a regular expression.
 | |
|        This makes it possible, amongst other things, to extract different sub-
 | |
|        strings that match the same pair of parentheses when there is a repeti-
 | |
|        tion.
 | |
| 
 | |
|        PCRE provides a similar feature, but of course it cannot obey arbitrary
 | |
|        Perl code. The feature is called "callout". The caller of PCRE provides
 | |
|        an external function by putting its entry point in the global  variable
 | |
|        pcre_callout.   By default, this variable contains NULL, which disables
 | |
|        all calling out.
 | |
| 
 | |
|        Within a regular expression, (?C) indicates the  points  at  which  the
 | |
|        external  function  is  to be called. If you want to identify different
 | |
|        callout points, you can put a number less than 256 after the letter  C.
 | |
|        The  default  value is zero.  For example, this pattern has two callout
 | |
|        points:
 | |
| 
 | |
|          (?C1)abc(?C2)def
 | |
| 
 | |
|        If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are
 | |
|        automatically  installed  before each item in the pattern. They are all
 | |
|        numbered 255.
 | |
| 
 | |
|        During matching, when PCRE reaches a callout point (and pcre_callout is
 | |
|        set),  the  external function is called. It is provided with the number
 | |
|        of the callout, the position in the pattern, and, optionally, one  item
 | |
|        of  data  originally supplied by the caller of pcre_exec(). The callout
 | |
|        function may cause matching to proceed, to backtrack, or to fail  alto-
 | |
|        gether. A complete description of the interface to the callout function
 | |
|        is given in the pcrecallout documentation.
 | |
| 
 | |
| 
 | |
| BACKTRACKING CONTROL
 | |
| 
 | |
|        Perl 5.10 introduced a number of "Special Backtracking Control  Verbs",
 | |
|        which are described in the Perl documentation as "experimental and sub-
 | |
|        ject to change or removal in a future version of Perl". It goes  on  to
 | |
|        say:  "Their usage in production code should be noted to avoid problems
 | |
|        during upgrades." The same remarks apply to the PCRE features described
 | |
|        in this section.
 | |
| 
 | |
|        Since  these  verbs  are  specifically related to backtracking, most of
 | |
|        them can be  used  only  when  the  pattern  is  to  be  matched  using
 | |
|        pcre_exec(), which uses a backtracking algorithm. With the exception of
 | |
|        (*FAIL), which behaves like a failing negative assertion, they cause an
 | |
|        error if encountered by pcre_dfa_exec().
 | |
| 
 | |
|        The  new verbs make use of what was previously invalid syntax: an open-
 | |
|        ing parenthesis followed by an asterisk. In Perl, they are generally of
 | |
|        the form (*VERB:ARG) but PCRE does not support the use of arguments, so
 | |
|        its general form is just (*VERB). Any number of these verbs  may  occur
 | |
|        in a pattern. There are two kinds:
 | |
| 
 | |
|    Verbs that act immediately
 | |
| 
 | |
|        The following verbs act as soon as they are encountered:
 | |
| 
 | |
|           (*ACCEPT)
 | |
| 
 | |
|        This  verb causes the match to end successfully, skipping the remainder
 | |
|        of the pattern. When inside a recursion, only the innermost pattern  is
 | |
|        ended  immediately.  PCRE  differs  from  Perl  in  what happens if the
 | |
|        (*ACCEPT) is inside capturing parentheses. In Perl, the data so far  is
 | |
|        captured: in PCRE no data is captured. For example:
 | |
| 
 | |
|          A(A|B(*ACCEPT)|C)D
 | |
| 
 | |
|        This  matches  "AB", "AAD", or "ACD", but when it matches "AB", no data
 | |
|        is captured.
 | |
| 
 | |
|          (*FAIL) or (*F)
 | |
| 
 | |
|        This verb causes the match to fail, forcing backtracking to  occur.  It
 | |
|        is  equivalent to (?!) but easier to read. The Perl documentation notes
 | |
|        that it is probably useful only when combined  with  (?{})  or  (??{}).
 | |
|        Those  are,  of course, Perl features that are not present in PCRE. The
 | |
|        nearest equivalent is the callout feature, as for example in this  pat-
 | |
|        tern:
 | |
| 
 | |
|          a+(?C)(*FAIL)
 | |
| 
 | |
|        A  match  with the string "aaaa" always fails, but the callout is taken
 | |
|        before each backtrack happens (in this example, 10 times).
 | |
| 
 | |
|    Verbs that act after backtracking
 | |
| 
 | |
|        The following verbs do nothing when they are encountered. Matching con-
 | |
|        tinues  with what follows, but if there is no subsequent match, a fail-
 | |
|        ure is forced.  The verbs  differ  in  exactly  what  kind  of  failure
 | |
|        occurs.
 | |
| 
 | |
|          (*COMMIT)
 | |
| 
 | |
|        This  verb  causes  the whole match to fail outright if the rest of the
 | |
|        pattern does not match. Even if the pattern is unanchored,  no  further
 | |
|        attempts  to find a match by advancing the start point take place. Once
 | |
|        (*COMMIT) has been passed, pcre_exec() is committed to finding a  match
 | |
|        at the current starting point, or not at all. For example:
 | |
| 
 | |
|          a+(*COMMIT)b
 | |
| 
 | |
|        This  matches  "xxaab" but not "aacaab". It can be thought of as a kind
 | |
|        of dynamic anchor, or "I've started, so I must finish."
 | |
| 
 | |
|          (*PRUNE)
 | |
| 
 | |
|        This verb causes the match to fail at the current position if the  rest
 | |
|        of the pattern does not match. If the pattern is unanchored, the normal
 | |
|        "bumpalong" advance to the next starting character then happens.  Back-
 | |
|        tracking  can  occur as usual to the left of (*PRUNE), or when matching
 | |
|        to the right of (*PRUNE), but if there is no match to the right,  back-
 | |
|        tracking  cannot  cross (*PRUNE).  In simple cases, the use of (*PRUNE)
 | |
|        is just an alternative to an atomic group or possessive quantifier, but
 | |
|        there  are  some uses of (*PRUNE) that cannot be expressed in any other
 | |
|        way.
 | |
| 
 | |
|          (*SKIP)
 | |
| 
 | |
|        This verb is like (*PRUNE), except that if the pattern  is  unanchored,
 | |
|        the  "bumpalong" advance is not to the next character, but to the posi-
 | |
|        tion in the subject where (*SKIP) was  encountered.  (*SKIP)  signifies
 | |
|        that  whatever  text  was  matched leading up to it cannot be part of a
 | |
|        successful match. Consider:
 | |
| 
 | |
|          a+(*SKIP)b
 | |
| 
 | |
|        If the subject is "aaaac...",  after  the  first  match  attempt  fails
 | |
|        (starting  at  the  first  character in the string), the starting point
 | |
|        skips on to start the next attempt at "c". Note that a possessive quan-
 | |
|        tifer  does not have the same effect in this example; although it would
 | |
|        suppress backtracking  during  the  first  match  attempt,  the  second
 | |
|        attempt  would  start at the second character instead of skipping on to
 | |
|        "c".
 | |
| 
 | |
|          (*THEN)
 | |
| 
 | |
|        This verb causes a skip to the next alternation if the rest of the pat-
 | |
|        tern does not match. That is, it cancels pending backtracking, but only
 | |
|        within the current alternation. Its name  comes  from  the  observation
 | |
|        that it can be used for a pattern-based if-then-else block:
 | |
| 
 | |
|          ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
 | |
| 
 | |
|        If  the COND1 pattern matches, FOO is tried (and possibly further items
 | |
|        after the end of the group if FOO succeeds);  on  failure  the  matcher
 | |
|        skips  to  the second alternative and tries COND2, without backtracking
 | |
|        into COND1. If (*THEN) is used outside  of  any  alternation,  it  acts
 | |
|        exactly like (*PRUNE).
 | |
| 
 | |
| 
 | |
| SEE ALSO
 | |
| 
 | |
|        pcreapi(3), pcrecallout(3), pcrematching(3), pcre(3).
 | |
| 
 | |
| 
 | |
| AUTHOR
 | |
| 
 | |
|        Philip Hazel
 | |
|        University Computing Service
 | |
|        Cambridge CB2 3QH, England.
 | |
| 
 | |
| 
 | |
| REVISION
 | |
| 
 | |
|        Last updated: 11 April 2009
 | |
|        Copyright (c) 1997-2009 University of Cambridge.
 | |
| ------------------------------------------------------------------------------
 | |
| 
 | |
| 
 | |
| PCRESYNTAX(3)                                                    PCRESYNTAX(3)
 | |
| 
 | |
| 
 | |
| NAME
 | |
|        PCRE - Perl-compatible regular expressions
 | |
| 
 | |
| 
 | |
| PCRE REGULAR EXPRESSION SYNTAX SUMMARY
 | |
| 
 | |
|        The  full syntax and semantics of the regular expressions that are sup-
 | |
|        ported by PCRE are described in  the  pcrepattern  documentation.  This
 | |
|        document contains just a quick-reference summary of the syntax.
 | |
| 
 | |
| 
 | |
| QUOTING
 | |
| 
 | |
|          \x         where x is non-alphanumeric is a literal x
 | |
|          \Q...\E    treat enclosed characters as literal
 | |
| 
 | |
| 
 | |
| CHARACTERS
 | |
| 
 | |
|          \a         alarm, that is, the BEL character (hex 07)
 | |
|          \cx        "control-x", where x is any character
 | |
|          \e         escape (hex 1B)
 | |
|          \f         formfeed (hex 0C)
 | |
|          \n         newline (hex 0A)
 | |
|          \r         carriage return (hex 0D)
 | |
|          \t         tab (hex 09)
 | |
|          \ddd       character with octal code ddd, or backreference
 | |
|          \xhh       character with hex code hh
 | |
|          \x{hhh..}  character with hex code hhh..
 | |
| 
 | |
| 
 | |
| CHARACTER TYPES
 | |
| 
 | |
|          .          any character except newline;
 | |
|                       in dotall mode, any character whatsoever
 | |
|          \C         one byte, even in UTF-8 mode (best avoided)
 | |
|          \d         a decimal digit
 | |
|          \D         a character that is not a decimal digit
 | |
|          \h         a horizontal whitespace character
 | |
|          \H         a character that is not a horizontal whitespace character
 | |
|          \p{xx}     a character with the xx property
 | |
|          \P{xx}     a character without the xx property
 | |
|          \R         a newline sequence
 | |
|          \s         a whitespace character
 | |
|          \S         a character that is not a whitespace character
 | |
|          \v         a vertical whitespace character
 | |
|          \V         a character that is not a vertical whitespace character
 | |
|          \w         a "word" character
 | |
|          \W         a "non-word" character
 | |
|          \X         an extended Unicode sequence
 | |
| 
 | |
|        In PCRE, \d, \D, \s, \S, \w, and \W recognize only ASCII characters.
 | |
| 
 | |
| 
 | |
| GENERAL CATEGORY PROPERTY CODES FOR \p and \P
 | |
| 
 | |
|          C          Other
 | |
|          Cc         Control
 | |
|          Cf         Format
 | |
|          Cn         Unassigned
 | |
|          Co         Private use
 | |
|          Cs         Surrogate
 | |
| 
 | |
|          L          Letter
 | |
|          Ll         Lower case letter
 | |
|          Lm         Modifier letter
 | |
|          Lo         Other letter
 | |
|          Lt         Title case letter
 | |
|          Lu         Upper case letter
 | |
|          L&         Ll, Lu, or Lt
 | |
| 
 | |
|          M          Mark
 | |
|          Mc         Spacing mark
 | |
|          Me         Enclosing mark
 | |
|          Mn         Non-spacing mark
 | |
| 
 | |
|          N          Number
 | |
|          Nd         Decimal number
 | |
|          Nl         Letter number
 | |
|          No         Other number
 | |
| 
 | |
|          P          Punctuation
 | |
|          Pc         Connector punctuation
 | |
|          Pd         Dash punctuation
 | |
|          Pe         Close punctuation
 | |
|          Pf         Final punctuation
 | |
|          Pi         Initial punctuation
 | |
|          Po         Other punctuation
 | |
|          Ps         Open punctuation
 | |
| 
 | |
|          S          Symbol
 | |
|          Sc         Currency symbol
 | |
|          Sk         Modifier symbol
 | |
|          Sm         Mathematical symbol
 | |
|          So         Other symbol
 | |
| 
 | |
|          Z          Separator
 | |
|          Zl         Line separator
 | |
|          Zp         Paragraph separator
 | |
|          Zs         Space separator
 | |
| 
 | |
| 
 | |
| SCRIPT NAMES FOR \p AND \P
 | |
| 
 | |
|        Arabic,  Armenian,  Balinese,  Bengali,  Bopomofo,  Braille,  Buginese,
 | |
|        Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common, Coptic, Cu-
 | |
|        neiform,  Cypriot,  Cyrillic,  Deseret, Devanagari, Ethiopic, Georgian,
 | |
|        Glagolitic, Gothic, Greek, Gujarati, Gurmukhi,  Han,  Hangul,  Hanunoo,
 | |
|        Hebrew,  Hiragana,  Inherited, Kannada, Katakana, Kayah_Li, Kharoshthi,
 | |
|        Khmer, Lao, Latin, Lepcha, Limbu, Linear_B, Lycian, Lydian,  Malayalam,
 | |
|        Mongolian,  Myanmar,  New_Tai_Lue, Nko, Ogham, Old_Italic, Old_Persian,
 | |
|        Ol_Chiki, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic, Saurash-
 | |
|        tra,  Shavian,  Sinhala,  Sudanese, Syloti_Nagri, Syriac, Tagalog, Tag-
 | |
|        banwa,  Tai_Le,  Tamil,  Telugu,  Thaana,  Thai,   Tibetan,   Tifinagh,
 | |
|        Ugaritic, Vai, Yi.
 | |
| 
 | |
| 
 | |
| CHARACTER CLASSES
 | |
| 
 | |
|          [...]       positive character class
 | |
|          [^...]      negative character class
 | |
|          [x-y]       range (can be used for hex characters)
 | |
|          [[:xxx:]]   positive POSIX named set
 | |
|          [[:^xxx:]]  negative POSIX named set
 | |
| 
 | |
|          alnum       alphanumeric
 | |
|          alpha       alphabetic
 | |
|          ascii       0-127
 | |
|          blank       space or tab
 | |
|          cntrl       control character
 | |
|          digit       decimal digit
 | |
|          graph       printing, excluding space
 | |
|          lower       lower case letter
 | |
|          print       printing, including space
 | |
|          punct       printing, excluding alphanumeric
 | |
|          space       whitespace
 | |
|          upper       upper case letter
 | |
|          word        same as \w
 | |
|          xdigit      hexadecimal digit
 | |
| 
 | |
|        In PCRE, POSIX character set names recognize only ASCII characters. You
 | |
|        can use \Q...\E inside a character class.
 | |
| 
 | |
| 
 | |
| QUANTIFIERS
 | |
| 
 | |
|          ?           0 or 1, greedy
 | |
|          ?+          0 or 1, possessive
 | |
|          ??          0 or 1, lazy
 | |
|          *           0 or more, greedy
 | |
|          *+          0 or more, possessive
 | |
|          *?          0 or more, lazy
 | |
|          +           1 or more, greedy
 | |
|          ++          1 or more, possessive
 | |
|          +?          1 or more, lazy
 | |
|          {n}         exactly n
 | |
|          {n,m}       at least n, no more than m, greedy
 | |
|          {n,m}+      at least n, no more than m, possessive
 | |
|          {n,m}?      at least n, no more than m, lazy
 | |
|          {n,}        n or more, greedy
 | |
|          {n,}+       n or more, possessive
 | |
|          {n,}?       n or more, lazy
 | |
| 
 | |
| 
 | |
| ANCHORS AND SIMPLE ASSERTIONS
 | |
| 
 | |
|          \b          word boundary (only ASCII letters recognized)
 | |
|          \B          not a word boundary
 | |
|          ^           start of subject
 | |
|                       also after internal newline in multiline mode
 | |
|          \A          start of subject
 | |
|          $           end of subject
 | |
|                       also before newline at end of subject
 | |
|                       also before internal newline in multiline mode
 | |
|          \Z          end of subject
 | |
|                       also before newline at end of subject
 | |
|          \z          end of subject
 | |
|          \G          first matching position in subject
 | |
| 
 | |
| 
 | |
| MATCH POINT RESET
 | |
| 
 | |
|          \K          reset start of match
 | |
| 
 | |
| 
 | |
| ALTERNATION
 | |
| 
 | |
|          expr|expr|expr...
 | |
| 
 | |
| 
 | |
| CAPTURING
 | |
| 
 | |
|          (...)           capturing group
 | |
|          (?<name>...)    named capturing group (Perl)
 | |
|          (?'name'...)    named capturing group (Perl)
 | |
|          (?P<name>...)   named capturing group (Python)
 | |
|          (?:...)         non-capturing group
 | |
|          (?|...)         non-capturing group; reset group numbers for
 | |
|                           capturing groups in each alternative
 | |
| 
 | |
| 
 | |
| ATOMIC GROUPS
 | |
| 
 | |
|          (?>...)         atomic, non-capturing group
 | |
| 
 | |
| 
 | |
| COMMENT
 | |
| 
 | |
|          (?#....)        comment (not nestable)
 | |
| 
 | |
| 
 | |
| OPTION SETTING
 | |
| 
 | |
|          (?i)            caseless
 | |
|          (?J)            allow duplicate names
 | |
|          (?m)            multiline
 | |
|          (?s)            single line (dotall)
 | |
|          (?U)            default ungreedy (lazy)
 | |
|          (?x)            extended (ignore white space)
 | |
|          (?-...)         unset option(s)
 | |
| 
 | |
|        The following is recognized only at the start of a pattern or after one
 | |
|        of the newline-setting options with similar syntax:
 | |
| 
 | |
|          (*UTF8)         set UTF-8 mode
 | |
| 
 | |
| 
 | |
| LOOKAHEAD AND LOOKBEHIND ASSERTIONS
 | |
| 
 | |
|          (?=...)         positive look ahead
 | |
|          (?!...)         negative look ahead
 | |
|          (?<=...)        positive look behind
 | |
|          (?<!...)        negative look behind
 | |
| 
 | |
|        Each top-level branch of a look behind must be of a fixed length.
 | |
| 
 | |
| 
 | |
| BACKREFERENCES
 | |
| 
 | |
|          \n              reference by number (can be ambiguous)
 | |
|          \gn             reference by number
 | |
|          \g{n}           reference by number
 | |
|          \g{-n}          relative reference by number
 | |
|          \k<name>        reference by name (Perl)
 | |
|          \k'name'        reference by name (Perl)
 | |
|          \g{name}        reference by name (Perl)
 | |
|          \k{name}        reference by name (.NET)
 | |
|          (?P=name)       reference by name (Python)
 | |
| 
 | |
| 
 | |
| SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)
 | |
| 
 | |
|          (?R)            recurse whole pattern
 | |
|          (?n)            call subpattern by absolute number
 | |
|          (?+n)           call subpattern by relative number
 | |
|          (?-n)           call subpattern by relative number
 | |
|          (?&name)        call subpattern by name (Perl)
 | |
|          (?P>name)       call subpattern by name (Python)
 | |
|          \g<name>        call subpattern by name (Oniguruma)
 | |
|          \g'name'        call subpattern by name (Oniguruma)
 | |
|          \g<n>           call subpattern by absolute number (Oniguruma)
 | |
|          \g'n'           call subpattern by absolute number (Oniguruma)
 | |
|          \g<+n>          call subpattern by relative number (PCRE extension)
 | |
|          \g'+n'          call subpattern by relative number (PCRE extension)
 | |
|          \g<-n>          call subpattern by relative number (PCRE extension)
 | |
|          \g'-n'          call subpattern by relative number (PCRE extension)
 | |
| 
 | |
| 
 | |
| CONDITIONAL PATTERNS
 | |
| 
 | |
|          (?(condition)yes-pattern)
 | |
|          (?(condition)yes-pattern|no-pattern)
 | |
| 
 | |
|          (?(n)...        absolute reference condition
 | |
|          (?(+n)...       relative reference condition
 | |
|          (?(-n)...       relative reference condition
 | |
|          (?(<name>)...   named reference condition (Perl)
 | |
|          (?('name')...   named reference condition (Perl)
 | |
|          (?(name)...     named reference condition (PCRE)
 | |
|          (?(R)...        overall recursion condition
 | |
|          (?(Rn)...       specific group recursion condition
 | |
|          (?(R&name)...   specific recursion condition
 | |
|          (?(DEFINE)...   define subpattern for reference
 | |
|          (?(assert)...   assertion condition
 | |
| 
 | |
| 
 | |
| BACKTRACKING CONTROL
 | |
| 
 | |
|        The following act immediately they are reached:
 | |
| 
 | |
|          (*ACCEPT)       force successful match
 | |
|          (*FAIL)         force backtrack; synonym (*F)
 | |
| 
 | |
|        The  following  act only when a subsequent match failure causes a back-
 | |
|        track to reach them. They all force a match failure, but they differ in
 | |
|        what happens afterwards. Those that advance the start-of-match point do
 | |
|        so only if the pattern is not anchored.
 | |
| 
 | |
|          (*COMMIT)       overall failure, no advance of starting point
 | |
|          (*PRUNE)        advance to next starting character
 | |
|          (*SKIP)         advance start to current matching position
 | |
|          (*THEN)         local failure, backtrack to next alternation
 | |
| 
 | |
| 
 | |
| NEWLINE CONVENTIONS
 | |
| 
 | |
|        These are recognized only at the very start of the pattern or  after  a
 | |
|        (*BSR_...) or (*UTF8) option.
 | |
| 
 | |
|          (*CR)           carriage return only
 | |
|          (*LF)           linefeed only
 | |
|          (*CRLF)         carriage return followed by linefeed
 | |
|          (*ANYCRLF)      all three of the above
 | |
|          (*ANY)          any Unicode newline sequence
 | |
| 
 | |
| 
 | |
| WHAT \R MATCHES
 | |
| 
 | |
|        These  are  recognized only at the very start of the pattern or after a
 | |
|        (*...) option that sets the newline convention or UTF-8 mode.
 | |
| 
 | |
|          (*BSR_ANYCRLF)  CR, LF, or CRLF
 | |
|          (*BSR_UNICODE)  any Unicode newline sequence
 | |
| 
 | |
| 
 | |
| CALLOUTS
 | |
| 
 | |
|          (?C)      callout
 | |
|          (?Cn)     callout with data n
 | |
| 
 | |
| 
 | |
| SEE ALSO
 | |
| 
 | |
|        pcrepattern(3), pcreapi(3), pcrecallout(3), pcrematching(3), pcre(3).
 | |
| 
 | |
| 
 | |
| AUTHOR
 | |
| 
 | |
|        Philip Hazel
 | |
|        University Computing Service
 | |
|        Cambridge CB2 3QH, England.
 | |
| 
 | |
| 
 | |
| REVISION
 | |
| 
 | |
|        Last updated: 11 April 2009
 | |
|        Copyright (c) 1997-2009 University of Cambridge.
 | |
| ------------------------------------------------------------------------------
 | |
| 
 | |
| 
 | |
| PCREPARTIAL(3)                                                  PCREPARTIAL(3)
 | |
| 
 | |
| 
 | |
| NAME
 | |
|        PCRE - Perl-compatible regular expressions
 | |
| 
 | |
| 
 | |
| PARTIAL MATCHING IN PCRE
 | |
| 
 | |
|        In  normal  use  of  PCRE,  if  the  subject  string  that is passed to
 | |
|        pcre_exec() or pcre_dfa_exec() matches as far as it goes,  but  is  too
 | |
|        short  to  match  the  entire  pattern, PCRE_ERROR_NOMATCH is returned.
 | |
|        There are circumstances where it might be helpful to  distinguish  this
 | |
|        case from other cases in which there is no match.
 | |
| 
 | |
|        Consider, for example, an application where a human is required to type
 | |
|        in data for a field with specific formatting requirements.  An  example
 | |
|        might be a date in the form ddmmmyy, defined by this pattern:
 | |
| 
 | |
|          ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$
 | |
| 
 | |
|        If the application sees the user's keystrokes one by one, and can check
 | |
|        that what has been typed so far is potentially valid,  it  is  able  to
 | |
|        raise  an  error as soon as a mistake is made, possibly beeping and not
 | |
|        reflecting the character that has been typed. This  immediate  feedback
 | |
|        is  likely  to  be a better user interface than a check that is delayed
 | |
|        until the entire string has been entered.
 | |
| 
 | |
|        PCRE supports the concept of partial matching by means of the PCRE_PAR-
 | |
|        TIAL   option,   which   can   be   set  when  calling  pcre_exec()  or
 | |
|        pcre_dfa_exec(). When this flag is set for pcre_exec(), the return code
 | |
|        PCRE_ERROR_NOMATCH  is converted into PCRE_ERROR_PARTIAL if at any time
 | |
|        during the matching process the last part of the subject string matched
 | |
|        part  of  the  pattern. Unfortunately, for non-anchored matching, it is
 | |
|        not possible to obtain the position of the start of the partial  match.
 | |
|        No captured data is set when PCRE_ERROR_PARTIAL is returned.
 | |
| 
 | |
|        When   PCRE_PARTIAL   is  set  for  pcre_dfa_exec(),  the  return  code
 | |
|        PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the  end  of
 | |
|        the  subject is reached, there have been no complete matches, but there
 | |
|        is still at least one matching possibility. The portion of  the  string
 | |
|        that provided the partial match is set as the first matching string.
 | |
| 
 | |
|        Using PCRE_PARTIAL disables one of PCRE's optimizations. PCRE remembers
 | |
|        the last literal byte in a pattern, and abandons  matching  immediately
 | |
|        if  such a byte is not present in the subject string. This optimization
 | |
|        cannot be used for a subject string that might match only partially.
 | |
| 
 | |
| 
 | |
| RESTRICTED PATTERNS FOR PCRE_PARTIAL
 | |
| 
 | |
|        Because of the way certain internal optimizations  are  implemented  in
 | |
|        the  pcre_exec()  function, the PCRE_PARTIAL option cannot be used with
 | |
|        all patterns. These restrictions do not apply when  pcre_dfa_exec()  is
 | |
|        used.  For pcre_exec(), repeated single characters such as
 | |
| 
 | |
|          a{2,4}
 | |
| 
 | |
|        and repeated single metasequences such as
 | |
| 
 | |
|          \d+
 | |
| 
 | |
|        are  not permitted if the maximum number of occurrences is greater than
 | |
|        one.  Optional items such as \d? (where the maximum is one) are permit-
 | |
|        ted.   Quantifiers  with any values are permitted after parentheses, so
 | |
|        the invalid examples above can be coded thus:
 | |
| 
 | |
|          (a){2,4}
 | |
|          (\d)+
 | |
| 
 | |
|        These constructions run more slowly, but for the kinds  of  application
 | |
|        that  are  envisaged  for this facility, this is not felt to be a major
 | |
|        restriction.
 | |
| 
 | |
|        If PCRE_PARTIAL is set for a pattern  that  does  not  conform  to  the
 | |
|        restrictions,  pcre_exec() returns the error code PCRE_ERROR_BADPARTIAL
 | |
|        (-13).  You can use the PCRE_INFO_OKPARTIAL call to pcre_fullinfo()  to
 | |
|        find out if a compiled pattern can be used for partial matching.
 | |
| 
 | |
| 
 | |
| EXAMPLE OF PARTIAL MATCHING USING PCRETEST
 | |
| 
 | |
|        If  the  escape  sequence  \P  is  present in a pcretest data line, the
 | |
|        PCRE_PARTIAL flag is used for the match. Here is a run of pcretest that
 | |
|        uses the date example quoted above:
 | |
| 
 | |
|            re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
 | |
|          data> 25jun04\P
 | |
|           0: 25jun04
 | |
|           1: jun
 | |
|          data> 25dec3\P
 | |
|          Partial match
 | |
|          data> 3ju\P
 | |
|          Partial match
 | |
|          data> 3juj\P
 | |
|          No match
 | |
|          data> j\P
 | |
|          No match
 | |
| 
 | |
|        The  first  data  string  is  matched completely, so pcretest shows the
 | |
|        matched substrings. The remaining four strings do not  match  the  com-
 | |
|        plete  pattern,  but  the first two are partial matches. The same test,
 | |
|        using pcre_dfa_exec() matching (by means of the  \D  escape  sequence),
 | |
|        produces the following output:
 | |
| 
 | |
|            re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
 | |
|          data> 25jun04\P\D
 | |
|           0: 25jun04
 | |
|          data> 23dec3\P\D
 | |
|          Partial match: 23dec3
 | |
|          data> 3ju\P\D
 | |
|          Partial match: 3ju
 | |
|          data> 3juj\P\D
 | |
|          No match
 | |
|          data> j\P\D
 | |
|          No match
 | |
| 
 | |
|        Notice  that in this case the portion of the string that was matched is
 | |
|        made available.
 | |
| 
 | |
| 
 | |
| MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()
 | |
| 
 | |
|        When a partial match has been found using pcre_dfa_exec(), it is possi-
 | |
|        ble  to  continue  the  match  by providing additional subject data and
 | |
|        calling pcre_dfa_exec() again with the same  compiled  regular  expres-
 | |
|        sion, this time setting the PCRE_DFA_RESTART option. You must also pass
 | |
|        the same working space as before, because this is where details of  the
 | |
|        previous  partial  match are stored. Here is an example using pcretest,
 | |
|        using the \R escape sequence to set the PCRE_DFA_RESTART option (\P and
 | |
|        \D are as above):
 | |
| 
 | |
|            re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
 | |
|          data> 23ja\P\D
 | |
|          Partial match: 23ja
 | |
|          data> n05\R\D
 | |
|           0: n05
 | |
| 
 | |
|        The  first  call has "23ja" as the subject, and requests partial match-
 | |
|        ing; the second call  has  "n05"  as  the  subject  for  the  continued
 | |
|        (restarted)  match.   Notice  that when the match is complete, only the
 | |
|        last part is shown; PCRE does  not  retain  the  previously  partially-
 | |
|        matched  string. It is up to the calling program to do that if it needs
 | |
|        to.
 | |
| 
 | |
|        You can set PCRE_PARTIAL  with  PCRE_DFA_RESTART  to  continue  partial
 | |
|        matching over multiple segments. This facility can be used to pass very
 | |
|        long subject strings to pcre_dfa_exec(). However, some care  is  needed
 | |
|        for certain types of pattern.
 | |
| 
 | |
|        1.  If  the  pattern contains tests for the beginning or end of a line,
 | |
|        you need to pass the PCRE_NOTBOL or PCRE_NOTEOL options,  as  appropri-
 | |
|        ate,  when  the subject string for any call does not contain the begin-
 | |
|        ning or end of a line.
 | |
| 
 | |
|        2. If the pattern contains backward assertions (including  \b  or  \B),
 | |
|        you  need  to  arrange for some overlap in the subject strings to allow
 | |
|        for this. For example, you could pass the subject in  chunks  that  are
 | |
|        500  bytes long, but in a buffer of 700 bytes, with the starting offset
 | |
|        set to 200 and the previous 200 bytes at the start of the buffer.
 | |
| 
 | |
|        3. Matching a subject string that is split into multiple segments  does
 | |
|        not  always produce exactly the same result as matching over one single
 | |
|        long string.  The difference arises when there  are  multiple  matching
 | |
|        possibilities,  because a partial match result is given only when there
 | |
|        are no completed matches in a call to pcre_dfa_exec(). This means  that
 | |
|        as  soon  as  the  shortest match has been found, continuation to a new
 | |
|        subject segment is no longer possible.  Consider this pcretest example:
 | |
| 
 | |
|            re> /dog(sbody)?/
 | |
|          data> do\P\D
 | |
|          Partial match: do
 | |
|          data> gsb\R\P\D
 | |
|           0: g
 | |
|          data> dogsbody\D
 | |
|           0: dogsbody
 | |
|           1: dog
 | |
| 
 | |
|        The pattern matches the words "dog" or "dogsbody". When the subject  is
 | |
|        presented  in  several  parts  ("do" and "gsb" being the first two) the
 | |
|        match stops when "dog" has been found, and it is not possible  to  con-
 | |
|        tinue.  On  the  other  hand,  if  "dogsbody"  is presented as a single
 | |
|        string, both matches are found.
 | |
| 
 | |
|        Because of this phenomenon, it does not usually make  sense  to  end  a
 | |
|        pattern that is going to be matched in this way with a variable repeat.
 | |
| 
 | |
|        4. Patterns that contain alternatives at the top level which do not all
 | |
|        start with the same pattern item may not work as expected. For example,
 | |
|        consider this pattern:
 | |
| 
 | |
|          1234|3789
 | |
| 
 | |
|        If  the  first  part of the subject is "ABC123", a partial match of the
 | |
|        first alternative is found at offset 3. There is no partial  match  for
 | |
|        the second alternative, because such a match does not start at the same
 | |
|        point in the subject string. Attempting to  continue  with  the  string
 | |
|        "789" does not yield a match because only those alternatives that match
 | |
|        at one point in the subject are remembered. The problem arises  because
 | |
|        the  start  of the second alternative matches within the first alterna-
 | |
|        tive. There is no problem with anchored patterns or patterns such as:
 | |
| 
 | |
|          1234|ABCD
 | |
| 
 | |
|        where no string can be a partial match for both alternatives.
 | |
| 
 | |
| 
 | |
| AUTHOR
 | |
| 
 | |
|        Philip Hazel
 | |
|        University Computing Service
 | |
|        Cambridge CB2 3QH, England.
 | |
| 
 | |
| 
 | |
| REVISION
 | |
| 
 | |
|        Last updated: 04 June 2007
 | |
|        Copyright (c) 1997-2007 University of Cambridge.
 | |
| ------------------------------------------------------------------------------
 | |
| 
 | |
| 
 | |
| PCREPRECOMPILE(3)                                            PCREPRECOMPILE(3)
 | |
| 
 | |
| 
 | |
| NAME
 | |
|        PCRE - Perl-compatible regular expressions
 | |
| 
 | |
| 
 | |
| SAVING AND RE-USING PRECOMPILED PCRE PATTERNS
 | |
| 
 | |
|        If  you  are running an application that uses a large number of regular
 | |
|        expression patterns, it may be useful to store them  in  a  precompiled
 | |
|        form  instead  of  having to compile them every time the application is
 | |
|        run.  If you are not  using  any  private  character  tables  (see  the
 | |
|        pcre_maketables()  documentation),  this is relatively straightforward.
 | |
|        If you are using private tables, it is a little bit more complicated.
 | |
| 
 | |
|        If you save compiled patterns to a file, you can copy them to a differ-
 | |
|        ent  host  and  run them there. This works even if the new host has the
 | |
|        opposite endianness to the one on which  the  patterns  were  compiled.
 | |
|        There  may  be a small performance penalty, but it should be insignifi-
 | |
|        cant. However, compiling regular expressions with one version  of  PCRE
 | |
|        for  use  with  a  different  version is not guaranteed to work and may
 | |
|        cause crashes.
 | |
| 
 | |
| 
 | |
| SAVING A COMPILED PATTERN
 | |
|        The value returned by pcre_compile() points to a single block of memory
 | |
|        that  holds  the compiled pattern and associated data. You can find the
 | |
|        length of this block in bytes by calling pcre_fullinfo() with an  argu-
 | |
|        ment  of  PCRE_INFO_SIZE. You can then save the data in any appropriate
 | |
|        manner. Here is sample code that compiles a pattern and writes it to  a
 | |
|        file. It assumes that the variable fd refers to a file that is open for
 | |
|        output:
 | |
| 
 | |
|          int erroroffset, rc, size;
 | |
|          char *error;
 | |
|          pcre *re;
 | |
| 
 | |
|          re = pcre_compile("my pattern", 0, &error, &erroroffset, NULL);
 | |
|          if (re == NULL) { ... handle errors ... }
 | |
|          rc = pcre_fullinfo(re, NULL, PCRE_INFO_SIZE, &size);
 | |
|          if (rc < 0) { ... handle errors ... }
 | |
|          rc = fwrite(re, 1, size, fd);
 | |
|          if (rc != size) { ... handle errors ... }
 | |
| 
 | |
|        In this example, the bytes  that  comprise  the  compiled  pattern  are
 | |
|        copied  exactly.  Note that this is binary data that may contain any of
 | |
|        the 256 possible byte  values.  On  systems  that  make  a  distinction
 | |
|        between binary and non-binary data, be sure that the file is opened for
 | |
|        binary output.
 | |
| 
 | |
|        If you want to write more than one pattern to a file, you will have  to
 | |
|        devise  a  way of separating them. For binary data, preceding each pat-
 | |
|        tern with its length is probably  the  most  straightforward  approach.
 | |
|        Another  possibility is to write out the data in hexadecimal instead of
 | |
|        binary, one pattern to a line.
 | |
| 
 | |
|        Saving compiled patterns in a file is only one possible way of  storing
 | |
|        them  for later use. They could equally well be saved in a database, or
 | |
|        in the memory of some daemon process that passes them  via  sockets  to
 | |
|        the processes that want them.
 | |
| 
 | |
|        If  the pattern has been studied, it is also possible to save the study
 | |
|        data in a similar way to the compiled  pattern  itself.  When  studying
 | |
|        generates  additional  information, pcre_study() returns a pointer to a
 | |
|        pcre_extra data block. Its format is defined in the section on matching
 | |
|        a  pattern in the pcreapi documentation. The study_data field points to
 | |
|        the binary study data,  and  this  is  what  you  must  save  (not  the
 | |
|        pcre_extra  block itself). The length of the study data can be obtained
 | |
|        by calling pcre_fullinfo() with  an  argument  of  PCRE_INFO_STUDYSIZE.
 | |
|        Remember  to check that pcre_study() did return a non-NULL value before
 | |
|        trying to save the study data.
 | |
| 
 | |
| 
 | |
| RE-USING A PRECOMPILED PATTERN
 | |
| 
 | |
|        Re-using a precompiled pattern is straightforward. Having  reloaded  it
 | |
|        into   main   memory,   you   pass   its   pointer  to  pcre_exec()  or
 | |
|        pcre_dfa_exec() in the usual way. This  should  work  even  on  another
 | |
|        host,  and  even  if  that  host has the opposite endianness to the one
 | |
|        where the pattern was compiled.
 | |
| 
 | |
|        However, if you passed a pointer to custom character  tables  when  the
 | |
|        pattern  was  compiled  (the  tableptr argument of pcre_compile()), you
 | |
|        must now pass a similar  pointer  to  pcre_exec()  or  pcre_dfa_exec(),
 | |
|        because  the  value  saved  with the compiled pattern will obviously be
 | |
|        nonsense. A field in a pcre_extra() block is used to pass this data, as
 | |
|        described  in the section on matching a pattern in the pcreapi documen-
 | |
|        tation.
 | |
| 
 | |
|        If you did not provide custom character tables  when  the  pattern  was
 | |
|        compiled,  the  pointer  in  the compiled pattern is NULL, which causes
 | |
|        pcre_exec() to use PCRE's internal tables. Thus, you  do  not  need  to
 | |
|        take any special action at run time in this case.
 | |
| 
 | |
|        If  you  saved study data with the compiled pattern, you need to create
 | |
|        your own pcre_extra data block and set the study_data field to point to
 | |
|        the  reloaded  study  data. You must also set the PCRE_EXTRA_STUDY_DATA
 | |
|        bit in the flags field to indicate that study  data  is  present.  Then
 | |
|        pass  the  pcre_extra  block  to  pcre_exec() or pcre_dfa_exec() in the
 | |
|        usual way.
 | |
| 
 | |
| 
 | |
| COMPATIBILITY WITH DIFFERENT PCRE RELEASES
 | |
| 
 | |
|        In general, it is safest to  recompile  all  saved  patterns  when  you
 | |
|        update  to  a new PCRE release, though not all updates actually require
 | |
|        this. Recompiling is definitely needed for release 7.2.
 | |
| 
 | |
| 
 | |
| AUTHOR
 | |
| 
 | |
|        Philip Hazel
 | |
|        University Computing Service
 | |
|        Cambridge CB2 3QH, England.
 | |
| 
 | |
| 
 | |
| REVISION
 | |
| 
 | |
|        Last updated: 13 June 2007
 | |
|        Copyright (c) 1997-2007 University of Cambridge.
 | |
| ------------------------------------------------------------------------------
 | |
| 
 | |
| 
 | |
| PCREPERFORM(3)                                                  PCREPERFORM(3)
 | |
| 
 | |
| 
 | |
| NAME
 | |
|        PCRE - Perl-compatible regular expressions
 | |
| 
 | |
| 
 | |
| PCRE PERFORMANCE
 | |
| 
 | |
|        Two  aspects  of performance are discussed below: memory usage and pro-
 | |
|        cessing time. The way you express your pattern as a regular  expression
 | |
|        can affect both of them.
 | |
| 
 | |
| 
 | |
| MEMORY USAGE
 | |
| 
 | |
|        Patterns are compiled by PCRE into a reasonably efficient byte code, so
 | |
|        that most simple patterns do not use much memory. However, there is one
 | |
|        case where memory usage can be unexpectedly large. When a parenthesized
 | |
|        subpattern has a quantifier with a minimum greater than 1 and/or a lim-
 | |
|        ited  maximum,  the  whole subpattern is repeated in the compiled code.
 | |
|        For example, the pattern
 | |
| 
 | |
|          (abc|def){2,4}
 | |
| 
 | |
|        is compiled as if it were
 | |
| 
 | |
|          (abc|def)(abc|def)((abc|def)(abc|def)?)?
 | |
| 
 | |
|        (Technical aside: It is done this way so that backtrack  points  within
 | |
|        each of the repetitions can be independently maintained.)
 | |
| 
 | |
|        For  regular expressions whose quantifiers use only small numbers, this
 | |
|        is not usually a problem. However, if the numbers are large,  and  par-
 | |
|        ticularly  if  such repetitions are nested, the memory usage can become
 | |
|        an embarrassment. For example, the very simple pattern
 | |
| 
 | |
|          ((ab){1,1000}c){1,3}
 | |
| 
 | |
|        uses 51K bytes when compiled. When PCRE is compiled  with  its  default
 | |
|        internal  pointer  size of two bytes, the size limit on a compiled pat-
 | |
|        tern is 64K, and this is reached with the above pattern  if  the  outer
 | |
|        repetition is increased from 3 to 4. PCRE can be compiled to use larger
 | |
|        internal pointers and thus handle larger compiled patterns, but  it  is
 | |
|        better to try to rewrite your pattern to use less memory if you can.
 | |
| 
 | |
|        One  way  of reducing the memory usage for such patterns is to make use
 | |
|        of PCRE's "subroutine" facility. Re-writing the above pattern as
 | |
| 
 | |
|          ((ab)(?2){0,999}c)(?1){0,2}
 | |
| 
 | |
|        reduces the memory requirements to 18K, and indeed it remains under 20K
 | |
|        even  with the outer repetition increased to 100. However, this pattern
 | |
|        is not exactly equivalent, because the "subroutine" calls  are  treated
 | |
|        as  atomic groups into which there can be no backtracking if there is a
 | |
|        subsequent matching failure. Therefore, PCRE cannot  do  this  kind  of
 | |
|        rewriting  automatically.   Furthermore,  there is a noticeable loss of
 | |
|        speed when executing the modified pattern. Nevertheless, if the  atomic
 | |
|        grouping  is  not  a  problem and the loss of speed is acceptable, this
 | |
|        kind of rewriting will allow you to process patterns that  PCRE  cannot
 | |
|        otherwise handle.
 | |
| 
 | |
| 
 | |
| PROCESSING TIME
 | |
| 
 | |
|        Certain  items  in regular expression patterns are processed more effi-
 | |
|        ciently than others. It is more efficient to use a character class like
 | |
|        [aeiou]   than   a   set   of  single-character  alternatives  such  as
 | |
|        (a|e|i|o|u). In general, the simplest construction  that  provides  the
 | |
|        required behaviour is usually the most efficient. Jeffrey Friedl's book
 | |
|        contains a lot of useful general discussion  about  optimizing  regular
 | |
|        expressions  for  efficient  performance.  This document contains a few
 | |
|        observations about PCRE.
 | |
| 
 | |
|        Using Unicode character properties (the \p,  \P,  and  \X  escapes)  is
 | |
|        slow,  because PCRE has to scan a structure that contains data for over
 | |
|        fifteen thousand characters whenever it needs a  character's  property.
 | |
|        If  you  can  find  an  alternative pattern that does not use character
 | |
|        properties, it will probably be faster.
 | |
| 
 | |
|        When a pattern begins with .* not in  parentheses,  or  in  parentheses
 | |
|        that are not the subject of a backreference, and the PCRE_DOTALL option
 | |
|        is set, the pattern is implicitly anchored by PCRE, since it can  match
 | |
|        only  at  the start of a subject string. However, if PCRE_DOTALL is not
 | |
|        set, PCRE cannot make this optimization, because  the  .  metacharacter
 | |
|        does  not then match a newline, and if the subject string contains new-
 | |
|        lines, the pattern may match from the character  immediately  following
 | |
|        one of them instead of from the very start. For example, the pattern
 | |
| 
 | |
|          .*second
 | |
| 
 | |
|        matches  the subject "first\nand second" (where \n stands for a newline
 | |
|        character), with the match starting at the seventh character. In  order
 | |
|        to do this, PCRE has to retry the match starting after every newline in
 | |
|        the subject.
 | |
| 
 | |
|        If you are using such a pattern with subject strings that do  not  con-
 | |
|        tain newlines, the best performance is obtained by setting PCRE_DOTALL,
 | |
|        or starting the pattern with ^.* or ^.*? to indicate  explicit  anchor-
 | |
|        ing.  That saves PCRE from having to scan along the subject looking for
 | |
|        a newline to restart at.
 | |
| 
 | |
|        Beware of patterns that contain nested indefinite  repeats.  These  can
 | |
|        take  a  long time to run when applied to a string that does not match.
 | |
|        Consider the pattern fragment
 | |
| 
 | |
|          ^(a+)*
 | |
| 
 | |
|        This can match "aaaa" in 16 different ways, and this  number  increases
 | |
|        very  rapidly  as the string gets longer. (The * repeat can match 0, 1,
 | |
|        2, 3, or 4 times, and for each of those cases other than 0 or 4, the  +
 | |
|        repeats  can  match  different numbers of times.) When the remainder of
 | |
|        the pattern is such that the entire match is going to fail, PCRE has in
 | |
|        principle  to  try  every  possible  variation,  and  this  can take an
 | |
|        extremely long time, even for relatively short strings.
 | |
| 
 | |
|        An optimization catches some of the more simple cases such as
 | |
| 
 | |
|          (a+)*b
 | |
| 
 | |
|        where a literal character follows. Before  embarking  on  the  standard
 | |
|        matching  procedure,  PCRE checks that there is a "b" later in the sub-
 | |
|        ject string, and if there is not, it fails the match immediately.  How-
 | |
|        ever,  when  there  is no following literal this optimization cannot be
 | |
|        used. You can see the difference by comparing the behaviour of
 | |
| 
 | |
|          (a+)*\d
 | |
| 
 | |
|        with the pattern above. The former gives  a  failure  almost  instantly
 | |
|        when  applied  to  a  whole  line of "a" characters, whereas the latter
 | |
|        takes an appreciable time with strings longer than about 20 characters.
 | |
| 
 | |
|        In many cases, the solution to this kind of performance issue is to use
 | |
|        an atomic group or a possessive quantifier.
 | |
| 
 | |
| 
 | |
| AUTHOR
 | |
| 
 | |
|        Philip Hazel
 | |
|        University Computing Service
 | |
|        Cambridge CB2 3QH, England.
 | |
| 
 | |
| 
 | |
| REVISION
 | |
| 
 | |
|        Last updated: 06 March 2007
 | |
|        Copyright (c) 1997-2007 University of Cambridge.
 | |
| ------------------------------------------------------------------------------
 | |
| 
 | |
| 
 | |
| PCREPOSIX(3)                                                      PCREPOSIX(3)
 | |
| 
 | |
| 
 | |
| NAME
 | |
|        PCRE - Perl-compatible regular expressions.
 | |
| 
 | |
| 
 | |
| SYNOPSIS OF POSIX API
 | |
| 
 | |
|        #include <pcreposix.h>
 | |
| 
 | |
|        int regcomp(regex_t *preg, const char *pattern,
 | |
|             int cflags);
 | |
| 
 | |
|        int regexec(regex_t *preg, const char *string,
 | |
|             size_t nmatch, regmatch_t pmatch[], int eflags);
 | |
| 
 | |
|        size_t regerror(int errcode, const regex_t *preg,
 | |
|             char *errbuf, size_t errbuf_size);
 | |
| 
 | |
|        void regfree(regex_t *preg);
 | |
| 
 | |
| 
 | |
| DESCRIPTION
 | |
| 
 | |
|        This  set  of  functions provides a POSIX-style API to the PCRE regular
 | |
|        expression package. See the pcreapi documentation for a description  of
 | |
|        PCRE's native API, which contains much additional functionality.
 | |
| 
 | |
|        The functions described here are just wrapper functions that ultimately
 | |
|        call  the  PCRE  native  API.  Their  prototypes  are  defined  in  the
 | |
|        pcreposix.h  header  file,  and  on  Unix systems the library itself is
 | |
|        called pcreposix.a, so can be accessed by  adding  -lpcreposix  to  the
 | |
|        command  for  linking  an application that uses them. Because the POSIX
 | |
|        functions call the native ones, it is also necessary to add -lpcre.
 | |
| 
 | |
|        I have implemented only those POSIX option bits that can be  reasonably
 | |
|        mapped  to PCRE native options. In addition, the option REG_EXTENDED is
 | |
|        defined with the value zero. This has no  effect,  but  since  programs
 | |
|        that  are  written  to  the POSIX interface often use it, this makes it
 | |
|        easier to slot in PCRE as a replacement library.  Other  POSIX  options
 | |
|        are not even defined.
 | |
| 
 | |
|        When  PCRE  is  called  via these functions, it is only the API that is
 | |
|        POSIX-like in style. The syntax and semantics of  the  regular  expres-
 | |
|        sions  themselves  are  still  those of Perl, subject to the setting of
 | |
|        various PCRE options, as described below. "POSIX-like in  style"  means
 | |
|        that  the  API  approximates  to  the POSIX definition; it is not fully
 | |
|        POSIX-compatible, and in multi-byte encoding  domains  it  is  probably
 | |
|        even less compatible.
 | |
| 
 | |
|        The  header for these functions is supplied as pcreposix.h to avoid any
 | |
|        potential clash with other POSIX  libraries.  It  can,  of  course,  be
 | |
|        renamed or aliased as regex.h, which is the "correct" name. It provides
 | |
|        two structure types, regex_t for  compiled  internal  forms,  and  reg-
 | |
|        match_t  for  returning  captured substrings. It also defines some con-
 | |
|        stants whose names start  with  "REG_";  these  are  used  for  setting
 | |
|        options and identifying error codes.
 | |
| 
 | |
| 
 | |
| COMPILING A PATTERN
 | |
| 
 | |
|        The  function regcomp() is called to compile a pattern into an internal
 | |
|        form. The pattern is a C string terminated by a  binary  zero,  and  is
 | |
|        passed  in  the  argument  pattern. The preg argument is a pointer to a
 | |
|        regex_t structure that is used as a base for storing information  about
 | |
|        the compiled regular expression.
 | |
| 
 | |
|        The argument cflags is either zero, or contains one or more of the bits
 | |
|        defined by the following macros:
 | |
| 
 | |
|          REG_DOTALL
 | |
| 
 | |
|        The PCRE_DOTALL option is set when the regular expression is passed for
 | |
|        compilation to the native function. Note that REG_DOTALL is not part of
 | |
|        the POSIX standard.
 | |
| 
 | |
|          REG_ICASE
 | |
| 
 | |
|        The PCRE_CASELESS option is set when the regular expression  is  passed
 | |
|        for compilation to the native function.
 | |
| 
 | |
|          REG_NEWLINE
 | |
| 
 | |
|        The  PCRE_MULTILINE option is set when the regular expression is passed
 | |
|        for compilation to the native function. Note that this does  not  mimic
 | |
|        the  defined  POSIX  behaviour  for REG_NEWLINE (see the following sec-
 | |
|        tion).
 | |
| 
 | |
|          REG_NOSUB
 | |
| 
 | |
|        The PCRE_NO_AUTO_CAPTURE option is set when the regular  expression  is
 | |
|        passed for compilation to the native function. In addition, when a pat-
 | |
|        tern that is compiled with this flag is passed to regexec() for  match-
 | |
|        ing,  the  nmatch  and  pmatch  arguments  are ignored, and no captured
 | |
|        strings are returned.
 | |
| 
 | |
|          REG_UTF8
 | |
| 
 | |
|        The PCRE_UTF8 option is set when the regular expression is  passed  for
 | |
|        compilation  to the native function. This causes the pattern itself and
 | |
|        all data strings used for matching it to be treated as  UTF-8  strings.
 | |
|        Note that REG_UTF8 is not part of the POSIX standard.
 | |
| 
 | |
|        In  the  absence  of  these  flags, no options are passed to the native
 | |
|        function.  This means the the  regex  is  compiled  with  PCRE  default
 | |
|        semantics.  In particular, the way it handles newline characters in the
 | |
|        subject string is the Perl way, not the POSIX way.  Note  that  setting
 | |
|        PCRE_MULTILINE  has only some of the effects specified for REG_NEWLINE.
 | |
|        It does not affect the way newlines are matched by . (they  aren't)  or
 | |
|        by a negative class such as [^a] (they are).
 | |
| 
 | |
|        The  yield of regcomp() is zero on success, and non-zero otherwise. The
 | |
|        preg structure is filled in on success, and one member of the structure
 | |
|        is  public: re_nsub contains the number of capturing subpatterns in the
 | |
|        regular expression. Various error codes are defined in the header file.
 | |
| 
 | |
| 
 | |
| MATCHING NEWLINE CHARACTERS
 | |
| 
 | |
|        This area is not simple, because POSIX and Perl take different views of
 | |
|        things.   It  is  not possible to get PCRE to obey POSIX semantics, but
 | |
|        then PCRE was never intended to be a POSIX engine. The following  table
 | |
|        lists  the  different  possibilities for matching newline characters in
 | |
|        PCRE:
 | |
| 
 | |
|                                  Default   Change with
 | |
| 
 | |
|          . matches newline          no     PCRE_DOTALL
 | |
|          newline matches [^a]       yes    not changeable
 | |
|          $ matches \n at end        yes    PCRE_DOLLARENDONLY
 | |
|          $ matches \n in middle     no     PCRE_MULTILINE
 | |
|          ^ matches \n in middle     no     PCRE_MULTILINE
 | |
| 
 | |
|        This is the equivalent table for POSIX:
 | |
| 
 | |
|                                  Default   Change with
 | |
| 
 | |
|          . matches newline          yes    REG_NEWLINE
 | |
|          newline matches [^a]       yes    REG_NEWLINE
 | |
|          $ matches \n at end        no     REG_NEWLINE
 | |
|          $ matches \n in middle     no     REG_NEWLINE
 | |
|          ^ matches \n in middle     no     REG_NEWLINE
 | |
| 
 | |
|        PCRE's behaviour is the same as Perl's, except that there is no equiva-
 | |
|        lent  for  PCRE_DOLLAR_ENDONLY in Perl. In both PCRE and Perl, there is
 | |
|        no way to stop newline from matching [^a].
 | |
| 
 | |
|        The  default  POSIX  newline  handling  can  be  obtained  by   setting
 | |
|        PCRE_DOTALL  and  PCRE_DOLLAR_ENDONLY, but there is no way to make PCRE
 | |
|        behave exactly as for the REG_NEWLINE action.
 | |
| 
 | |
| 
 | |
| MATCHING A PATTERN
 | |
| 
 | |
|        The function regexec() is called  to  match  a  compiled  pattern  preg
 | |
|        against  a  given string, which is by default terminated by a zero byte
 | |
|        (but see REG_STARTEND below), subject to the options in  eflags.  These
 | |
|        can be:
 | |
| 
 | |
|          REG_NOTBOL
 | |
| 
 | |
|        The PCRE_NOTBOL option is set when calling the underlying PCRE matching
 | |
|        function.
 | |
| 
 | |
|          REG_NOTEMPTY
 | |
| 
 | |
|        The PCRE_NOTEMPTY option is set when calling the underlying PCRE match-
 | |
|        ing function. Note that REG_NOTEMPTY is not part of the POSIX standard.
 | |
|        However, setting this option can give more POSIX-like behaviour in some
 | |
|        situations.
 | |
| 
 | |
|          REG_NOTEOL
 | |
| 
 | |
|        The PCRE_NOTEOL option is set when calling the underlying PCRE matching
 | |
|        function.
 | |
| 
 | |
|          REG_STARTEND
 | |
| 
 | |
|        The string is considered to start at string +  pmatch[0].rm_so  and  to
 | |
|        have  a terminating NUL located at string + pmatch[0].rm_eo (there need
 | |
|        not actually be a NUL at that location), regardless  of  the  value  of
 | |
|        nmatch.  This  is a BSD extension, compatible with but not specified by
 | |
|        IEEE Standard 1003.2 (POSIX.2), and should  be  used  with  caution  in
 | |
|        software intended to be portable to other systems. Note that a non-zero
 | |
|        rm_so does not imply REG_NOTBOL; REG_STARTEND affects only the location
 | |
|        of the string, not how it is matched.
 | |
| 
 | |
|        If  the pattern was compiled with the REG_NOSUB flag, no data about any
 | |
|        matched strings  is  returned.  The  nmatch  and  pmatch  arguments  of
 | |
|        regexec() are ignored.
 | |
| 
 | |
|        Otherwise,the portion of the string that was matched, and also any cap-
 | |
|        tured substrings, are returned via the pmatch argument, which points to
 | |
|        an  array  of nmatch structures of type regmatch_t, containing the mem-
 | |
|        bers rm_so and rm_eo. These contain the offset to the  first  character
 | |
|        of  each  substring and the offset to the first character after the end
 | |
|        of each substring, respectively. The 0th element of the vector  relates
 | |
|        to  the  entire portion of string that was matched; subsequent elements
 | |
|        relate to the capturing subpatterns of the regular  expression.  Unused
 | |
|        entries in the array have both structure members set to -1.
 | |
| 
 | |
|        A  successful  match  yields  a  zero  return;  various error codes are
 | |
|        defined in the header file, of  which  REG_NOMATCH  is  the  "expected"
 | |
|        failure code.
 | |
| 
 | |
| 
 | |
| ERROR MESSAGES
 | |
| 
 | |
|        The regerror() function maps a non-zero errorcode from either regcomp()
 | |
|        or regexec() to a printable message. If preg is  not  NULL,  the  error
 | |
|        should have arisen from the use of that structure. A message terminated
 | |
|        by a binary zero is placed  in  errbuf.  The  length  of  the  message,
 | |
|        including  the  zero, is limited to errbuf_size. The yield of the func-
 | |
|        tion is the size of buffer needed to hold the whole message.
 | |
| 
 | |
| 
 | |
| MEMORY USAGE
 | |
| 
 | |
|        Compiling a regular expression causes memory to be allocated and  asso-
 | |
|        ciated  with  the preg structure. The function regfree() frees all such
 | |
|        memory, after which preg may no longer be used as  a  compiled  expres-
 | |
|        sion.
 | |
| 
 | |
| 
 | |
| AUTHOR
 | |
| 
 | |
|        Philip Hazel
 | |
|        University Computing Service
 | |
|        Cambridge CB2 3QH, England.
 | |
| 
 | |
| 
 | |
| REVISION
 | |
| 
 | |
|        Last updated: 11 March 2009
 | |
|        Copyright (c) 1997-2009 University of Cambridge.
 | |
| ------------------------------------------------------------------------------
 | |
| 
 | |
| 
 | |
| PCRECPP(3)                                                          PCRECPP(3)
 | |
| 
 | |
| 
 | |
| NAME
 | |
|        PCRE - Perl-compatible regular expressions.
 | |
| 
 | |
| 
 | |
| SYNOPSIS OF C++ WRAPPER
 | |
| 
 | |
|        #include <pcrecpp.h>
 | |
| 
 | |
| 
 | |
| DESCRIPTION
 | |
| 
 | |
|        The  C++  wrapper  for PCRE was provided by Google Inc. Some additional
 | |
|        functionality was added by Giuseppe Maxia. This brief man page was con-
 | |
|        structed  from  the  notes  in the pcrecpp.h file, which should be con-
 | |
|        sulted for further details.
 | |
| 
 | |
| 
 | |
| MATCHING INTERFACE
 | |
| 
 | |
|        The "FullMatch" operation checks that supplied text matches a  supplied
 | |
|        pattern  exactly.  If pointer arguments are supplied, it copies matched
 | |
|        sub-strings that match sub-patterns into them.
 | |
| 
 | |
|          Example: successful match
 | |
|             pcrecpp::RE re("h.*o");
 | |
|             re.FullMatch("hello");
 | |
| 
 | |
|          Example: unsuccessful match (requires full match):
 | |
|             pcrecpp::RE re("e");
 | |
|             !re.FullMatch("hello");
 | |
| 
 | |
|          Example: creating a temporary RE object:
 | |
|             pcrecpp::RE("h.*o").FullMatch("hello");
 | |
| 
 | |
|        You can pass in a "const char*" or a "string" for "text". The  examples
 | |
|        below  tend to use a const char*. You can, as in the different examples
 | |
|        above, store the RE object explicitly in a variable or use a  temporary
 | |
|        RE  object.  The  examples below use one mode or the other arbitrarily.
 | |
|        Either could correctly be used for any of these examples.
 | |
| 
 | |
|        You must supply extra pointer arguments to extract matched subpieces.
 | |
| 
 | |
|          Example: extracts "ruby" into "s" and 1234 into "i"
 | |
|             int i;
 | |
|             string s;
 | |
|             pcrecpp::RE re("(\\w+):(\\d+)");
 | |
|             re.FullMatch("ruby:1234", &s, &i);
 | |
| 
 | |
|          Example: does not try to extract any extra sub-patterns
 | |
|             re.FullMatch("ruby:1234", &s);
 | |
| 
 | |
|          Example: does not try to extract into NULL
 | |
|             re.FullMatch("ruby:1234", NULL, &i);
 | |
| 
 | |
|          Example: integer overflow causes failure
 | |
|             !re.FullMatch("ruby:1234567891234", NULL, &i);
 | |
| 
 | |
|          Example: fails because there aren't enough sub-patterns:
 | |
|             !pcrecpp::RE("\\w+:\\d+").FullMatch("ruby:1234", &s);
 | |
| 
 | |
|          Example: fails because string cannot be stored in integer
 | |
|             !pcrecpp::RE("(.*)").FullMatch("ruby", &i);
 | |
| 
 | |
|        The provided pointer arguments can be pointers to  any  scalar  numeric
 | |
|        type, or one of:
 | |
| 
 | |
|           string        (matched piece is copied to string)
 | |
|           StringPiece   (StringPiece is mutated to point to matched piece)
 | |
|           T             (where "bool T::ParseFrom(const char*, int)" exists)
 | |
|           NULL          (the corresponding matched sub-pattern is not copied)
 | |
| 
 | |
|        The  function returns true iff all of the following conditions are sat-
 | |
|        isfied:
 | |
| 
 | |
|          a. "text" matches "pattern" exactly;
 | |
| 
 | |
|          b. The number of matched sub-patterns is >= number of supplied
 | |
|             pointers;
 | |
| 
 | |
|          c. The "i"th argument has a suitable type for holding the
 | |
|             string captured as the "i"th sub-pattern. If you pass in
 | |
|             void * NULL for the "i"th argument, or a non-void * NULL
 | |
|             of the correct type, or pass fewer arguments than the
 | |
|             number of sub-patterns, "i"th captured sub-pattern is
 | |
|             ignored.
 | |
| 
 | |
|        CAVEAT: An optional sub-pattern that does  not  exist  in  the  matched
 | |
|        string  is  assigned  the  empty  string. Therefore, the following will
 | |
|        return false (because the empty string is not a valid number):
 | |
| 
 | |
|           int number;
 | |
|           pcrecpp::RE::FullMatch("abc", "[a-z]+(\\d+)?", &number);
 | |
| 
 | |
|        The matching interface supports at most 16 arguments per call.  If  you
 | |
|        need    more,    consider    using    the    more   general   interface
 | |
|        pcrecpp::RE::DoMatch. See pcrecpp.h for the signature for DoMatch.
 | |
| 
 | |
|        NOTE: Do not use no_arg, which is used internally to mark the end of  a
 | |
|        list  of optional arguments, as a placeholder for missing arguments, as
 | |
|        this can lead to segfaults.
 | |
| 
 | |
| 
 | |
| QUOTING METACHARACTERS
 | |
| 
 | |
|        You can use the "QuoteMeta" operation to insert backslashes before  all
 | |
|        potentially  meaningful  characters  in  a string. The returned string,
 | |
|        used as a regular expression, will exactly match the original string.
 | |
| 
 | |
|          Example:
 | |
|             string quoted = RE::QuoteMeta(unquoted);
 | |
| 
 | |
|        Note that it's legal to escape a character even if it  has  no  special
 | |
|        meaning  in  a  regular expression -- so this function does that. (This
 | |
|        also makes it identical to the perl function  of  the  same  name;  see
 | |
|        "perldoc    -f    quotemeta".)    For   example,   "1.5-2.0?"   becomes
 | |
|        "1\.5\-2\.0\?".
 | |
| 
 | |
| 
 | |
| PARTIAL MATCHES
 | |
| 
 | |
|        You can use the "PartialMatch" operation when you want the  pattern  to
 | |
|        match any substring of the text.
 | |
| 
 | |
|          Example: simple search for a string:
 | |
|             pcrecpp::RE("ell").PartialMatch("hello");
 | |
| 
 | |
|          Example: find first number in a string:
 | |
|             int number;
 | |
|             pcrecpp::RE re("(\\d+)");
 | |
|             re.PartialMatch("x*100 + 20", &number);
 | |
|             assert(number == 100);
 | |
| 
 | |
| 
 | |
| UTF-8 AND THE MATCHING INTERFACE
 | |
| 
 | |
|        By  default,  pattern  and text are plain text, one byte per character.
 | |
|        The UTF8 flag, passed to  the  constructor,  causes  both  pattern  and
 | |
|        string to be treated as UTF-8 text, still a byte stream but potentially
 | |
|        multiple bytes per character. In practice, the text is likelier  to  be
 | |
|        UTF-8  than  the pattern, but the match returned may depend on the UTF8
 | |
|        flag, so always use it when matching UTF8 text. For example,  "."  will
 | |
|        match  one  byte normally but with UTF8 set may match up to three bytes
 | |
|        of a multi-byte character.
 | |
| 
 | |
|          Example:
 | |
|             pcrecpp::RE_Options options;
 | |
|             options.set_utf8();
 | |
|             pcrecpp::RE re(utf8_pattern, options);
 | |
|             re.FullMatch(utf8_string);
 | |
| 
 | |
|          Example: using the convenience function UTF8():
 | |
|             pcrecpp::RE re(utf8_pattern, pcrecpp::UTF8());
 | |
|             re.FullMatch(utf8_string);
 | |
| 
 | |
|        NOTE: The UTF8 flag is ignored if pcre was not configured with the
 | |
|              --enable-utf8 flag.
 | |
| 
 | |
| 
 | |
| PASSING MODIFIERS TO THE REGULAR EXPRESSION ENGINE
 | |
| 
 | |
|        PCRE defines some modifiers to  change  the  behavior  of  the  regular
 | |
|        expression   engine.  The  C++  wrapper  defines  an  auxiliary  class,
 | |
|        RE_Options, as a vehicle to pass such modifiers to  a  RE  class.  Cur-
 | |
|        rently, the following modifiers are supported:
 | |
| 
 | |
|           modifier              description               Perl corresponding
 | |
| 
 | |
|           PCRE_CASELESS         case insensitive match      /i
 | |
|           PCRE_MULTILINE        multiple lines match        /m
 | |
|           PCRE_DOTALL           dot matches newlines        /s
 | |
|           PCRE_DOLLAR_ENDONLY   $ matches only at end       N/A
 | |
|           PCRE_EXTRA            strict escape parsing       N/A
 | |
|           PCRE_EXTENDED         ignore whitespaces          /x
 | |
|           PCRE_UTF8             handles UTF8 chars          built-in
 | |
|           PCRE_UNGREEDY         reverses * and *?           N/A
 | |
|           PCRE_NO_AUTO_CAPTURE  disables capturing parens   N/A (*)
 | |
| 
 | |
|        (*)  Both Perl and PCRE allow non capturing parentheses by means of the
 | |
|        "?:" modifier within the pattern itself. e.g. (?:ab|cd) does  not  cap-
 | |
|        ture, while (ab|cd) does.
 | |
| 
 | |
|        For  a  full  account on how each modifier works, please check the PCRE
 | |
|        API reference page.
 | |
| 
 | |
|        For each modifier, there are two member functions whose  name  is  made
 | |
|        out  of  the  modifier  in  lowercase,  without the "PCRE_" prefix. For
 | |
|        instance, PCRE_CASELESS is handled by
 | |
| 
 | |
|          bool caseless()
 | |
| 
 | |
|        which returns true if the modifier is set, and
 | |
| 
 | |
|          RE_Options & set_caseless(bool)
 | |
| 
 | |
|        which sets or unsets the modifier. Moreover, PCRE_EXTRA_MATCH_LIMIT can
 | |
|        be  accessed  through  the  set_match_limit()  and match_limit() member
 | |
|        functions. Setting match_limit to a non-zero value will limit the  exe-
 | |
|        cution  of pcre to keep it from doing bad things like blowing the stack
 | |
|        or taking an eternity to return a result.  A  value  of  5000  is  good
 | |
|        enough  to stop stack blowup in a 2MB thread stack. Setting match_limit
 | |
|        to  zero  disables  match  limiting.  Alternatively,   you   can   call
 | |
|        match_limit_recursion()  which uses PCRE_EXTRA_MATCH_LIMIT_RECURSION to
 | |
|        limit how much  PCRE  recurses.  match_limit()  limits  the  number  of
 | |
|        matches PCRE does; match_limit_recursion() limits the depth of internal
 | |
|        recursion, and therefore the amount of stack that is used.
 | |
| 
 | |
|        Normally, to pass one or more modifiers to a RE class,  you  declare  a
 | |
|        RE_Options object, set the appropriate options, and pass this object to
 | |
|        a RE constructor. Example:
 | |
| 
 | |
|           RE_options opt;
 | |
|           opt.set_caseless(true);
 | |
|           if (RE("HELLO", opt).PartialMatch("hello world")) ...
 | |
| 
 | |
|        RE_options has two constructors. The default constructor takes no argu-
 | |
|        ments  and creates a set of flags that are off by default. The optional
 | |
|        parameter option_flags is to facilitate transfer of legacy code from  C
 | |
|        programs.  This lets you do
 | |
| 
 | |
|           RE(pattern,
 | |
|             RE_Options(PCRE_CASELESS|PCRE_MULTILINE)).PartialMatch(str);
 | |
| 
 | |
|        However, new code is better off doing
 | |
| 
 | |
|           RE(pattern,
 | |
|             RE_Options().set_caseless(true).set_multiline(true))
 | |
|               .PartialMatch(str);
 | |
| 
 | |
|        If you are going to pass one of the most used modifiers, there are some
 | |
|        convenience functions that return a RE_Options class with the appropri-
 | |
|        ate  modifier  already  set: CASELESS(), UTF8(), MULTILINE(), DOTALL(),
 | |
|        and EXTENDED().
 | |
| 
 | |
|        If you need to set several options at once, and you don't  want  to  go
 | |
|        through  the pains of declaring a RE_Options object and setting several
 | |
|        options, there is a parallel method that give you such ability  on  the
 | |
|        fly.  You  can  concatenate several set_xxxxx() member functions, since
 | |
|        each of them returns a reference to its class object. For  example,  to
 | |
|        pass  PCRE_CASELESS, PCRE_EXTENDED, and PCRE_MULTILINE to a RE with one
 | |
|        statement, you may write:
 | |
| 
 | |
|           RE(" ^ xyz \\s+ .* blah$",
 | |
|             RE_Options()
 | |
|               .set_caseless(true)
 | |
|               .set_extended(true)
 | |
|               .set_multiline(true)).PartialMatch(sometext);
 | |
| 
 | |
| 
 | |
| SCANNING TEXT INCREMENTALLY
 | |
| 
 | |
|        The "Consume" operation may be useful if you want to  repeatedly  match
 | |
|        regular expressions at the front of a string and skip over them as they
 | |
|        match. This requires use of the "StringPiece" type, which represents  a
 | |
|        sub-range  of  a  real  string.  Like RE, StringPiece is defined in the
 | |
|        pcrecpp namespace.
 | |
| 
 | |
|          Example: read lines of the form "var = value" from a string.
 | |
|             string contents = ...;                 // Fill string somehow
 | |
|             pcrecpp::StringPiece input(contents);  // Wrap in a StringPiece
 | |
| 
 | |
|             string var;
 | |
|             int value;
 | |
|             pcrecpp::RE re("(\\w+) = (\\d+)\n");
 | |
|             while (re.Consume(&input, &var, &value)) {
 | |
|               ...;
 | |
|             }
 | |
| 
 | |
|        Each successful call  to  "Consume"  will  set  "var/value",  and  also
 | |
|        advance "input" so it points past the matched text.
 | |
| 
 | |
|        The  "FindAndConsume"  operation  is  similar to "Consume" but does not
 | |
|        anchor your match at the beginning of  the  string.  For  example,  you
 | |
|        could extract all words from a string by repeatedly calling
 | |
| 
 | |
|          pcrecpp::RE("(\\w+)").FindAndConsume(&input, &word)
 | |
| 
 | |
| 
 | |
| PARSING HEX/OCTAL/C-RADIX NUMBERS
 | |
| 
 | |
|        By default, if you pass a pointer to a numeric value, the corresponding
 | |
|        text is interpreted as a base-10  number.  You  can  instead  wrap  the
 | |
|        pointer with a call to one of the operators Hex(), Octal(), or CRadix()
 | |
|        to interpret the text in another base. The CRadix  operator  interprets
 | |
|        C-style  "0"  (base-8)  and  "0x"  (base-16)  prefixes, but defaults to
 | |
|        base-10.
 | |
| 
 | |
|          Example:
 | |
|            int a, b, c, d;
 | |
|            pcrecpp::RE re("(.*) (.*) (.*) (.*)");
 | |
|            re.FullMatch("100 40 0100 0x40",
 | |
|                         pcrecpp::Octal(&a), pcrecpp::Hex(&b),
 | |
|                         pcrecpp::CRadix(&c), pcrecpp::CRadix(&d));
 | |
| 
 | |
|        will leave 64 in a, b, c, and d.
 | |
| 
 | |
| 
 | |
| REPLACING PARTS OF STRINGS
 | |
| 
 | |
|        You can replace the first match of "pattern" in "str"  with  "rewrite".
 | |
|        Within  "rewrite",  backslash-escaped  digits (\1 to \9) can be used to
 | |
|        insert text matching corresponding parenthesized group  from  the  pat-
 | |
|        tern. \0 in "rewrite" refers to the entire matching text. For example:
 | |
| 
 | |
|          string s = "yabba dabba doo";
 | |
|          pcrecpp::RE("b+").Replace("d", &s);
 | |
| 
 | |
|        will  leave  "s" containing "yada dabba doo". The result is true if the
 | |
|        pattern matches and a replacement occurs, false otherwise.
 | |
| 
 | |
|        GlobalReplace is like Replace except that it replaces  all  occurrences
 | |
|        of  the  pattern  in  the string with the rewrite. Replacements are not
 | |
|        subject to re-matching. For example:
 | |
| 
 | |
|          string s = "yabba dabba doo";
 | |
|          pcrecpp::RE("b+").GlobalReplace("d", &s);
 | |
| 
 | |
|        will leave "s" containing "yada dada doo". It  returns  the  number  of
 | |
|        replacements made.
 | |
| 
 | |
|        Extract  is like Replace, except that if the pattern matches, "rewrite"
 | |
|        is copied into "out" (an additional argument) with substitutions.   The
 | |
|        non-matching  portions  of "text" are ignored. Returns true iff a match
 | |
|        occurred and the extraction happened successfully;  if no match occurs,
 | |
|        the string is left unaffected.
 | |
| 
 | |
| 
 | |
| AUTHOR
 | |
| 
 | |
|        The C++ wrapper was contributed by Google Inc.
 | |
|        Copyright (c) 2007 Google Inc.
 | |
| 
 | |
| 
 | |
| REVISION
 | |
| 
 | |
|        Last updated: 17 March 2009
 | |
| ------------------------------------------------------------------------------
 | |
| 
 | |
| 
 | |
| PCRESAMPLE(3)                                                    PCRESAMPLE(3)
 | |
| 
 | |
| 
 | |
| NAME
 | |
|        PCRE - Perl-compatible regular expressions
 | |
| 
 | |
| 
 | |
| PCRE SAMPLE PROGRAM
 | |
| 
 | |
|        A simple, complete demonstration program, to get you started with using
 | |
|        PCRE, is supplied in the file pcredemo.c in the PCRE distribution.
 | |
| 
 | |
|        The program compiles the regular expression that is its first argument,
 | |
|        and  matches  it  against the subject string in its second argument. No
 | |
|        PCRE options are set, and default character tables are used. If  match-
 | |
|        ing  succeeds,  the  program  outputs  the  portion of the subject that
 | |
|        matched, together with the contents of any captured substrings.
 | |
| 
 | |
|        If the -g option is given on the command line, the program then goes on
 | |
|        to check for further matches of the same regular expression in the same
 | |
|        subject string. The logic is a little bit tricky because of the  possi-
 | |
|        bility  of  matching an empty string. Comments in the code explain what
 | |
|        is going on.
 | |
| 
 | |
|        If PCRE is installed in the standard include  and  library  directories
 | |
|        for  your  system, you should be able to compile the demonstration pro-
 | |
|        gram using this command:
 | |
| 
 | |
|          gcc -o pcredemo pcredemo.c -lpcre
 | |
| 
 | |
|        If PCRE is installed elsewhere, you may need to add additional  options
 | |
|        to  the  command line. For example, on a Unix-like system that has PCRE
 | |
|        installed in /usr/local, you  can  compile  the  demonstration  program
 | |
|        using a command like this:
 | |
| 
 | |
|          gcc -o pcredemo -I/usr/local/include pcredemo.c \
 | |
|              -L/usr/local/lib -lpcre
 | |
| 
 | |
|        Once  you  have  compiled the demonstration program, you can run simple
 | |
|        tests like this:
 | |
| 
 | |
|          ./pcredemo 'cat|dog' 'the cat sat on the mat'
 | |
|          ./pcredemo -g 'cat|dog' 'the dog sat on the cat'
 | |
| 
 | |
|        Note that there is a  much  more  comprehensive  test  program,  called
 | |
|        pcretest,  which  supports  many  more  facilities  for testing regular
 | |
|        expressions and the PCRE library. The pcredemo program is provided as a
 | |
|        simple coding example.
 | |
| 
 | |
|        On some operating systems (e.g. Solaris), when PCRE is not installed in
 | |
|        the standard library directory, you may get an error like this when you
 | |
|        try to run pcredemo:
 | |
| 
 | |
|          ld.so.1:  a.out:  fatal:  libpcre.so.0:  open failed: No such file or
 | |
|        directory
 | |
| 
 | |
|        This is caused by the way shared library support works  on  those  sys-
 | |
|        tems. You need to add
 | |
| 
 | |
|          -R/usr/local/lib
 | |
| 
 | |
|        (for example) to the compile command to get round this problem.
 | |
| 
 | |
| 
 | |
| AUTHOR
 | |
| 
 | |
|        Philip Hazel
 | |
|        University Computing Service
 | |
|        Cambridge CB2 3QH, England.
 | |
| 
 | |
| 
 | |
| REVISION
 | |
| 
 | |
|        Last updated: 23 January 2008
 | |
|        Copyright (c) 1997-2008 University of Cambridge.
 | |
| ------------------------------------------------------------------------------
 | |
| PCRESTACK(3)                                                      PCRESTACK(3)
 | |
| 
 | |
| 
 | |
| NAME
 | |
|        PCRE - Perl-compatible regular expressions
 | |
| 
 | |
| 
 | |
| PCRE DISCUSSION OF STACK USAGE
 | |
| 
 | |
|        When  you call pcre_exec(), it makes use of an internal function called
 | |
|        match(). This calls itself recursively at branch points in the pattern,
 | |
|        in  order to remember the state of the match so that it can back up and
 | |
|        try a different alternative if the first one fails.  As  matching  pro-
 | |
|        ceeds  deeper  and deeper into the tree of possibilities, the recursion
 | |
|        depth increases.
 | |
| 
 | |
|        Not all calls of match() increase the recursion depth; for an item such
 | |
|        as  a* it may be called several times at the same level, after matching
 | |
|        different numbers of a's. Furthermore, in a number of cases  where  the
 | |
|        result  of  the  recursive call would immediately be passed back as the
 | |
|        result of the current call (a "tail recursion"), the function  is  just
 | |
|        restarted instead.
 | |
| 
 | |
|        The pcre_dfa_exec() function operates in an entirely different way, and
 | |
|        hardly uses recursion at all. The limit on its complexity is the amount
 | |
|        of  workspace  it  is  given.  The comments that follow do NOT apply to
 | |
|        pcre_dfa_exec(); they are relevant only for pcre_exec().
 | |
| 
 | |
|        You can set limits on the number of times that match() is called,  both
 | |
|        in  total  and  recursively. If the limit is exceeded, an error occurs.
 | |
|        For details, see the section on  extra  data  for  pcre_exec()  in  the
 | |
|        pcreapi documentation.
 | |
| 
 | |
|        Each  time  that match() is actually called recursively, it uses memory
 | |
|        from the process stack. For certain kinds of  pattern  and  data,  very
 | |
|        large  amounts of stack may be needed, despite the recognition of "tail
 | |
|        recursion".  You can often reduce the amount of recursion,  and  there-
 | |
|        fore  the  amount of stack used, by modifying the pattern that is being
 | |
|        matched. Consider, for example, this pattern:
 | |
| 
 | |
|          ([^<]|<(?!inet))+
 | |
| 
 | |
|        It matches from wherever it starts until it encounters "<inet"  or  the
 | |
|        end  of  the  data,  and is the kind of pattern that might be used when
 | |
|        processing an XML file. Each iteration of the outer parentheses matches
 | |
|        either  one  character that is not "<" or a "<" that is not followed by
 | |
|        "inet". However, each time a  parenthesis  is  processed,  a  recursion
 | |
|        occurs, so this formulation uses a stack frame for each matched charac-
 | |
|        ter. For a long string, a lot of stack is required. Consider  now  this
 | |
|        rewritten pattern, which matches exactly the same strings:
 | |
| 
 | |
|          ([^<]++|<(?!inet))+
 | |
| 
 | |
|        This  uses very much less stack, because runs of characters that do not
 | |
|        contain "<" are "swallowed" in one item inside the parentheses.  Recur-
 | |
|        sion  happens  only when a "<" character that is not followed by "inet"
 | |
|        is encountered (and we assume this is relatively  rare).  A  possessive
 | |
|        quantifier  is  used  to stop any backtracking into the runs of non-"<"
 | |
|        characters, but that is not related to stack usage.
 | |
| 
 | |
|        This example shows that one way of avoiding stack problems when  match-
 | |
|        ing long subject strings is to write repeated parenthesized subpatterns
 | |
|        to match more than one character whenever possible.
 | |
| 
 | |
|    Compiling PCRE to use heap instead of stack
 | |
| 
 | |
|        In environments where stack memory is constrained, you  might  want  to
 | |
|        compile  PCRE to use heap memory instead of stack for remembering back-
 | |
|        up points. This makes it run a lot more slowly, however. Details of how
 | |
|        to do this are given in the pcrebuild documentation. When built in this
 | |
|        way, instead of using the stack, PCRE obtains and frees memory by call-
 | |
|        ing  the  functions  that  are  pointed to by the pcre_stack_malloc and
 | |
|        pcre_stack_free variables. By default,  these  point  to  malloc()  and
 | |
|        free(),  but you can replace the pointers to cause PCRE to use your own
 | |
|        functions. Since the block sizes are always the same,  and  are  always
 | |
|        freed in reverse order, it may be possible to implement customized mem-
 | |
|        ory handlers that are more efficient than the standard functions.
 | |
| 
 | |
|    Limiting PCRE's stack usage
 | |
| 
 | |
|        PCRE has an internal counter that can be used to  limit  the  depth  of
 | |
|        recursion,  and  thus cause pcre_exec() to give an error code before it
 | |
|        runs out of stack. By default, the limit is very  large,  and  unlikely
 | |
|        ever  to operate. It can be changed when PCRE is built, and it can also
 | |
|        be set when pcre_exec() is called. For details of these interfaces, see
 | |
|        the pcrebuild and pcreapi documentation.
 | |
| 
 | |
|        As a very rough rule of thumb, you should reckon on about 500 bytes per
 | |
|        recursion. Thus, if you want to limit your  stack  usage  to  8Mb,  you
 | |
|        should  set  the  limit at 16000 recursions. A 64Mb stack, on the other
 | |
|        hand, can support around 128000 recursions. The pcretest  test  program
 | |
|        has a command line option (-S) that can be used to increase the size of
 | |
|        its stack.
 | |
| 
 | |
|    Changing stack size in Unix-like systems
 | |
| 
 | |
|        In Unix-like environments, there is not often a problem with the  stack
 | |
|        unless  very  long  strings  are  involved, though the default limit on
 | |
|        stack size varies from system to system. Values from 8Mb  to  64Mb  are
 | |
|        common. You can find your default limit by running the command:
 | |
| 
 | |
|          ulimit -s
 | |
| 
 | |
|        Unfortunately,  the  effect  of  running out of stack is often SIGSEGV,
 | |
|        though sometimes a more explicit error message is given. You  can  nor-
 | |
|        mally increase the limit on stack size by code such as this:
 | |
| 
 | |
|          struct rlimit rlim;
 | |
|          getrlimit(RLIMIT_STACK, &rlim);
 | |
|          rlim.rlim_cur = 100*1024*1024;
 | |
|          setrlimit(RLIMIT_STACK, &rlim);
 | |
| 
 | |
|        This  reads  the current limits (soft and hard) using getrlimit(), then
 | |
|        attempts to increase the soft limit to  100Mb  using  setrlimit().  You
 | |
|        must do this before calling pcre_exec().
 | |
| 
 | |
|    Changing stack size in Mac OS X
 | |
| 
 | |
|        Using setrlimit(), as described above, should also work on Mac OS X. It
 | |
|        is also possible to set a stack size when linking a program. There is a
 | |
|        discussion   about   stack  sizes  in  Mac  OS  X  at  this  web  site:
 | |
|        http://developer.apple.com/qa/qa2005/qa1419.html.
 | |
| 
 | |
| 
 | |
| AUTHOR
 | |
| 
 | |
|        Philip Hazel
 | |
|        University Computing Service
 | |
|        Cambridge CB2 3QH, England.
 | |
| 
 | |
| 
 | |
| REVISION
 | |
| 
 | |
|        Last updated: 09 July 2008
 | |
|        Copyright (c) 1997-2008 University of Cambridge.
 | |
| ------------------------------------------------------------------------------
 | |
| 
 | |
| 
 |