349 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			Groff
		
	
	
	
	
	
			
		
		
	
	
			349 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			Groff
		
	
	
	
	
	
.TH PCRECPP 3 "08 January 2012" "PCRE 8.30"
 | 
						|
.SH NAME
 | 
						|
PCRE - Perl-compatible regular expressions.
 | 
						|
.SH "SYNOPSIS OF C++ WRAPPER"
 | 
						|
.rs
 | 
						|
.sp
 | 
						|
.B #include <pcrecpp.h>
 | 
						|
.
 | 
						|
.SH DESCRIPTION
 | 
						|
.rs
 | 
						|
.sp
 | 
						|
The C++ wrapper for PCRE was provided by Google Inc. Some additional
 | 
						|
functionality was added by Giuseppe Maxia. This brief man page was constructed
 | 
						|
from the notes in the \fIpcrecpp.h\fP file, which should be consulted for
 | 
						|
further details. Note that the C++ wrapper supports only the original 8-bit
 | 
						|
PCRE library. There is no 16-bit or 32-bit support at present.
 | 
						|
.
 | 
						|
.
 | 
						|
.SH "MATCHING INTERFACE"
 | 
						|
.rs
 | 
						|
.sp
 | 
						|
The "FullMatch" operation checks that supplied text matches a supplied pattern
 | 
						|
exactly. If pointer arguments are supplied, it copies matched sub-strings that
 | 
						|
match sub-patterns into them.
 | 
						|
.sp
 | 
						|
  Example: successful match
 | 
						|
     pcrecpp::RE re("h.*o");
 | 
						|
     re.FullMatch("hello");
 | 
						|
.sp
 | 
						|
  Example: unsuccessful match (requires full match):
 | 
						|
     pcrecpp::RE re("e");
 | 
						|
     !re.FullMatch("hello");
 | 
						|
.sp
 | 
						|
  Example: creating a temporary RE object:
 | 
						|
     pcrecpp::RE("h.*o").FullMatch("hello");
 | 
						|
.sp
 | 
						|
You can pass in a "const char*" or a "string" for "text". The examples below
 | 
						|
tend to use a const char*. You can, as in the different examples above, store
 | 
						|
the RE object explicitly in a variable or use a temporary RE object. The
 | 
						|
examples below use one mode or the other arbitrarily. Either could correctly be
 | 
						|
used for any of these examples.
 | 
						|
.P
 | 
						|
You must supply extra pointer arguments to extract matched subpieces.
 | 
						|
.sp
 | 
						|
  Example: extracts "ruby" into "s" and 1234 into "i"
 | 
						|
     int i;
 | 
						|
     string s;
 | 
						|
     pcrecpp::RE re("(\e\ew+):(\e\ed+)");
 | 
						|
     re.FullMatch("ruby:1234", &s, &i);
 | 
						|
.sp
 | 
						|
  Example: does not try to extract any extra sub-patterns
 | 
						|
     re.FullMatch("ruby:1234", &s);
 | 
						|
.sp
 | 
						|
  Example: does not try to extract into NULL
 | 
						|
     re.FullMatch("ruby:1234", NULL, &i);
 | 
						|
.sp
 | 
						|
  Example: integer overflow causes failure
 | 
						|
     !re.FullMatch("ruby:1234567891234", NULL, &i);
 | 
						|
.sp
 | 
						|
  Example: fails because there aren't enough sub-patterns:
 | 
						|
     !pcrecpp::RE("\e\ew+:\e\ed+").FullMatch("ruby:1234", &s);
 | 
						|
.sp
 | 
						|
  Example: fails because string cannot be stored in integer
 | 
						|
     !pcrecpp::RE("(.*)").FullMatch("ruby", &i);
 | 
						|
.sp
 | 
						|
The provided pointer arguments can be pointers to any scalar numeric
 | 
						|
type, or one of:
 | 
						|
.sp
 | 
						|
   string        (matched piece is copied to string)
 | 
						|
   StringPiece   (StringPiece is mutated to point to matched piece)
 | 
						|
   T             (where "bool T::ParseFrom(const char*, int)" exists)
 | 
						|
   NULL          (the corresponding matched sub-pattern is not copied)
 | 
						|
.sp
 | 
						|
The function returns true iff all of the following conditions are satisfied:
 | 
						|
.sp
 | 
						|
  a. "text" matches "pattern" exactly;
 | 
						|
.sp
 | 
						|
  b. The number of matched sub-patterns is >= number of supplied
 | 
						|
     pointers;
 | 
						|
.sp
 | 
						|
  c. The "i"th argument has a suitable type for holding the
 | 
						|
     string captured as the "i"th sub-pattern. If you pass in
 | 
						|
     void * NULL for the "i"th argument, or a non-void * NULL
 | 
						|
     of the correct type, or pass fewer arguments than the
 | 
						|
     number of sub-patterns, "i"th captured sub-pattern is
 | 
						|
     ignored.
 | 
						|
.sp
 | 
						|
CAVEAT: An optional sub-pattern that does not exist in the matched
 | 
						|
string is assigned the empty string. Therefore, the following will
 | 
						|
return false (because the empty string is not a valid number):
 | 
						|
.sp
 | 
						|
   int number;
 | 
						|
   pcrecpp::RE::FullMatch("abc", "[a-z]+(\e\ed+)?", &number);
 | 
						|
.sp
 | 
						|
The matching interface supports at most 16 arguments per call.
 | 
						|
If you need more, consider using the more general interface
 | 
						|
\fBpcrecpp::RE::DoMatch\fP. See \fBpcrecpp.h\fP for the signature for
 | 
						|
\fBDoMatch\fP.
 | 
						|
.P
 | 
						|
NOTE: Do not use \fBno_arg\fP, which is used internally to mark the end of a
 | 
						|
list of optional arguments, as a placeholder for missing arguments, as this can
 | 
						|
lead to segfaults.
 | 
						|
.
 | 
						|
.
 | 
						|
.SH "QUOTING METACHARACTERS"
 | 
						|
.rs
 | 
						|
.sp
 | 
						|
You can use the "QuoteMeta" operation to insert backslashes before all
 | 
						|
potentially meaningful characters in a string. The returned string, used as a
 | 
						|
regular expression, will exactly match the original string.
 | 
						|
.sp
 | 
						|
  Example:
 | 
						|
     string quoted = RE::QuoteMeta(unquoted);
 | 
						|
.sp
 | 
						|
Note that it's legal to escape a character even if it has no special meaning in
 | 
						|
a regular expression -- so this function does that. (This also makes it
 | 
						|
identical to the perl function of the same name; see "perldoc -f quotemeta".)
 | 
						|
For example, "1.5-2.0?" becomes "1\e.5\e-2\e.0\e?".
 | 
						|
.
 | 
						|
.SH "PARTIAL MATCHES"
 | 
						|
.rs
 | 
						|
.sp
 | 
						|
You can use the "PartialMatch" operation when you want the pattern
 | 
						|
to match any substring of the text.
 | 
						|
.sp
 | 
						|
  Example: simple search for a string:
 | 
						|
     pcrecpp::RE("ell").PartialMatch("hello");
 | 
						|
.sp
 | 
						|
  Example: find first number in a string:
 | 
						|
     int number;
 | 
						|
     pcrecpp::RE re("(\e\ed+)");
 | 
						|
     re.PartialMatch("x*100 + 20", &number);
 | 
						|
     assert(number == 100);
 | 
						|
.
 | 
						|
.
 | 
						|
.SH "UTF-8 AND THE MATCHING INTERFACE"
 | 
						|
.rs
 | 
						|
.sp
 | 
						|
By default, pattern and text are plain text, one byte per character. The UTF8
 | 
						|
flag, passed to the constructor, causes both pattern and string to be treated
 | 
						|
as UTF-8 text, still a byte stream but potentially multiple bytes per
 | 
						|
character. In practice, the text is likelier to be UTF-8 than the pattern, but
 | 
						|
the match returned may depend on the UTF8 flag, so always use it when matching
 | 
						|
UTF8 text. For example, "." will match one byte normally but with UTF8 set may
 | 
						|
match up to three bytes of a multi-byte character.
 | 
						|
.sp
 | 
						|
  Example:
 | 
						|
     pcrecpp::RE_Options options;
 | 
						|
     options.set_utf8();
 | 
						|
     pcrecpp::RE re(utf8_pattern, options);
 | 
						|
     re.FullMatch(utf8_string);
 | 
						|
.sp
 | 
						|
  Example: using the convenience function UTF8():
 | 
						|
     pcrecpp::RE re(utf8_pattern, pcrecpp::UTF8());
 | 
						|
     re.FullMatch(utf8_string);
 | 
						|
.sp
 | 
						|
NOTE: The UTF8 flag is ignored if pcre was not configured with the
 | 
						|
      --enable-utf8 flag.
 | 
						|
.
 | 
						|
.
 | 
						|
.SH "PASSING MODIFIERS TO THE REGULAR EXPRESSION ENGINE"
 | 
						|
.rs
 | 
						|
.sp
 | 
						|
PCRE defines some modifiers to change the behavior of the regular expression
 | 
						|
engine. The C++ wrapper defines an auxiliary class, RE_Options, as a vehicle to
 | 
						|
pass such modifiers to a RE class. Currently, the following modifiers are
 | 
						|
supported:
 | 
						|
.sp
 | 
						|
   modifier              description               Perl corresponding
 | 
						|
.sp
 | 
						|
   PCRE_CASELESS         case insensitive match      /i
 | 
						|
   PCRE_MULTILINE        multiple lines match        /m
 | 
						|
   PCRE_DOTALL           dot matches newlines        /s
 | 
						|
   PCRE_DOLLAR_ENDONLY   $ matches only at end       N/A
 | 
						|
   PCRE_EXTRA            strict escape parsing       N/A
 | 
						|
   PCRE_EXTENDED         ignore white spaces         /x
 | 
						|
   PCRE_UTF8             handles UTF8 chars          built-in
 | 
						|
   PCRE_UNGREEDY         reverses * and *?           N/A
 | 
						|
   PCRE_NO_AUTO_CAPTURE  disables capturing parens   N/A (*)
 | 
						|
.sp
 | 
						|
(*) Both Perl and PCRE allow non capturing parentheses by means of the
 | 
						|
"?:" modifier within the pattern itself. e.g. (?:ab|cd) does not
 | 
						|
capture, while (ab|cd) does.
 | 
						|
.P
 | 
						|
For a full account on how each modifier works, please check the
 | 
						|
PCRE API reference page.
 | 
						|
.P
 | 
						|
For each modifier, there are two member functions whose name is made
 | 
						|
out of the modifier in lowercase, without the "PCRE_" prefix. For
 | 
						|
instance, PCRE_CASELESS is handled by
 | 
						|
.sp
 | 
						|
  bool caseless()
 | 
						|
.sp
 | 
						|
which returns true if the modifier is set, and
 | 
						|
.sp
 | 
						|
  RE_Options & set_caseless(bool)
 | 
						|
.sp
 | 
						|
which sets or unsets the modifier. Moreover, PCRE_EXTRA_MATCH_LIMIT can be
 | 
						|
accessed through the \fBset_match_limit()\fP and \fBmatch_limit()\fP member
 | 
						|
functions. Setting \fImatch_limit\fP to a non-zero value will limit the
 | 
						|
execution of pcre to keep it from doing bad things like blowing the stack or
 | 
						|
taking an eternity to return a result. A value of 5000 is good enough to stop
 | 
						|
stack blowup in a 2MB thread stack. Setting \fImatch_limit\fP to zero disables
 | 
						|
match limiting. Alternatively, you can call \fBmatch_limit_recursion()\fP
 | 
						|
which uses PCRE_EXTRA_MATCH_LIMIT_RECURSION to limit how much PCRE
 | 
						|
recurses. \fBmatch_limit()\fP limits the number of matches PCRE does;
 | 
						|
\fBmatch_limit_recursion()\fP limits the depth of internal recursion, and
 | 
						|
therefore the amount of stack that is used.
 | 
						|
.P
 | 
						|
Normally, to pass one or more modifiers to a RE class, you declare
 | 
						|
a \fIRE_Options\fP object, set the appropriate options, and pass this
 | 
						|
object to a RE constructor. Example:
 | 
						|
.sp
 | 
						|
   RE_Options opt;
 | 
						|
   opt.set_caseless(true);
 | 
						|
   if (RE("HELLO", opt).PartialMatch("hello world")) ...
 | 
						|
.sp
 | 
						|
RE_options has two constructors. The default constructor takes no arguments and
 | 
						|
creates a set of flags that are off by default. The optional parameter
 | 
						|
\fIoption_flags\fP is to facilitate transfer of legacy code from C programs.
 | 
						|
This lets you do
 | 
						|
.sp
 | 
						|
   RE(pattern,
 | 
						|
     RE_Options(PCRE_CASELESS|PCRE_MULTILINE)).PartialMatch(str);
 | 
						|
.sp
 | 
						|
However, new code is better off doing
 | 
						|
.sp
 | 
						|
   RE(pattern,
 | 
						|
     RE_Options().set_caseless(true).set_multiline(true))
 | 
						|
       .PartialMatch(str);
 | 
						|
.sp
 | 
						|
If you are going to pass one of the most used modifiers, there are some
 | 
						|
convenience functions that return a RE_Options class with the
 | 
						|
appropriate modifier already set: \fBCASELESS()\fP, \fBUTF8()\fP,
 | 
						|
\fBMULTILINE()\fP, \fBDOTALL\fP(), and \fBEXTENDED()\fP.
 | 
						|
.P
 | 
						|
If you need to set several options at once, and you don't want to go through
 | 
						|
the pains of declaring a RE_Options object and setting several options, there
 | 
						|
is a parallel method that give you such ability on the fly. You can concatenate
 | 
						|
several \fBset_xxxxx()\fP member functions, since each of them returns a
 | 
						|
reference to its class object. For example, to pass PCRE_CASELESS,
 | 
						|
PCRE_EXTENDED, and PCRE_MULTILINE to a RE with one statement, you may write:
 | 
						|
.sp
 | 
						|
   RE(" ^ xyz \e\es+ .* blah$",
 | 
						|
     RE_Options()
 | 
						|
       .set_caseless(true)
 | 
						|
       .set_extended(true)
 | 
						|
       .set_multiline(true)).PartialMatch(sometext);
 | 
						|
.sp
 | 
						|
.
 | 
						|
.
 | 
						|
.SH "SCANNING TEXT INCREMENTALLY"
 | 
						|
.rs
 | 
						|
.sp
 | 
						|
The "Consume" operation may be useful if you want to repeatedly
 | 
						|
match regular expressions at the front of a string and skip over
 | 
						|
them as they match. This requires use of the "StringPiece" type,
 | 
						|
which represents a sub-range of a real string. Like RE, StringPiece
 | 
						|
is defined in the pcrecpp namespace.
 | 
						|
.sp
 | 
						|
  Example: read lines of the form "var = value" from a string.
 | 
						|
     string contents = ...;                 // Fill string somehow
 | 
						|
     pcrecpp::StringPiece input(contents);  // Wrap in a StringPiece
 | 
						|
.sp
 | 
						|
     string var;
 | 
						|
     int value;
 | 
						|
     pcrecpp::RE re("(\e\ew+) = (\e\ed+)\en");
 | 
						|
     while (re.Consume(&input, &var, &value)) {
 | 
						|
       ...;
 | 
						|
     }
 | 
						|
.sp
 | 
						|
Each successful call to "Consume" will set "var/value", and also
 | 
						|
advance "input" so it points past the matched text.
 | 
						|
.P
 | 
						|
The "FindAndConsume" operation is similar to "Consume" but does not
 | 
						|
anchor your match at the beginning of the string. For example, you
 | 
						|
could extract all words from a string by repeatedly calling
 | 
						|
.sp
 | 
						|
  pcrecpp::RE("(\e\ew+)").FindAndConsume(&input, &word)
 | 
						|
.
 | 
						|
.
 | 
						|
.SH "PARSING HEX/OCTAL/C-RADIX NUMBERS"
 | 
						|
.rs
 | 
						|
.sp
 | 
						|
By default, if you pass a pointer to a numeric value, the
 | 
						|
corresponding text is interpreted as a base-10 number. You can
 | 
						|
instead wrap the pointer with a call to one of the operators Hex(),
 | 
						|
Octal(), or CRadix() to interpret the text in another base. The
 | 
						|
CRadix operator interprets C-style "0" (base-8) and "0x" (base-16)
 | 
						|
prefixes, but defaults to base-10.
 | 
						|
.sp
 | 
						|
  Example:
 | 
						|
    int a, b, c, d;
 | 
						|
    pcrecpp::RE re("(.*) (.*) (.*) (.*)");
 | 
						|
    re.FullMatch("100 40 0100 0x40",
 | 
						|
                 pcrecpp::Octal(&a), pcrecpp::Hex(&b),
 | 
						|
                 pcrecpp::CRadix(&c), pcrecpp::CRadix(&d));
 | 
						|
.sp
 | 
						|
will leave 64 in a, b, c, and d.
 | 
						|
.
 | 
						|
.
 | 
						|
.SH "REPLACING PARTS OF STRINGS"
 | 
						|
.rs
 | 
						|
.sp
 | 
						|
You can replace the first match of "pattern" in "str" with "rewrite".
 | 
						|
Within "rewrite", backslash-escaped digits (\e1 to \e9) can be
 | 
						|
used to insert text matching corresponding parenthesized group
 | 
						|
from the pattern. \e0 in "rewrite" refers to the entire matching
 | 
						|
text. For example:
 | 
						|
.sp
 | 
						|
  string s = "yabba dabba doo";
 | 
						|
  pcrecpp::RE("b+").Replace("d", &s);
 | 
						|
.sp
 | 
						|
will leave "s" containing "yada dabba doo". The result is true if the pattern
 | 
						|
matches and a replacement occurs, false otherwise.
 | 
						|
.P
 | 
						|
\fBGlobalReplace\fP is like \fBReplace\fP except that it replaces all
 | 
						|
occurrences of the pattern in the string with the rewrite. Replacements are
 | 
						|
not subject to re-matching. For example:
 | 
						|
.sp
 | 
						|
  string s = "yabba dabba doo";
 | 
						|
  pcrecpp::RE("b+").GlobalReplace("d", &s);
 | 
						|
.sp
 | 
						|
will leave "s" containing "yada dada doo". It returns the number of
 | 
						|
replacements made.
 | 
						|
.P
 | 
						|
\fBExtract\fP is like \fBReplace\fP, except that if the pattern matches,
 | 
						|
"rewrite" is copied into "out" (an additional argument) with substitutions.
 | 
						|
The non-matching portions of "text" are ignored. Returns true iff a match
 | 
						|
occurred and the extraction happened successfully;  if no match occurs, the
 | 
						|
string is left unaffected.
 | 
						|
.
 | 
						|
.
 | 
						|
.SH AUTHOR
 | 
						|
.rs
 | 
						|
.sp
 | 
						|
.nf
 | 
						|
The C++ wrapper was contributed by Google Inc.
 | 
						|
Copyright (c) 2007 Google Inc.
 | 
						|
.fi
 | 
						|
.
 | 
						|
.
 | 
						|
.SH REVISION
 | 
						|
.rs
 | 
						|
.sp
 | 
						|
.nf
 | 
						|
Last updated: 08 January 2012
 | 
						|
.fi
 |