333 lines
11 KiB
HTML
333 lines
11 KiB
HTML
|
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
|
|
<HTML><HEAD><TITLE>Man page of REGEX</TITLE>
|
|
</HEAD><BODY>
|
|
<H1>REGEX</H1>
|
|
Section: Linux Programmer's Manual (7)<BR>Updated: 2009-01-12<BR><A HREF="#index">Index</A>
|
|
<A HREF="/cgi-bin/man/man2html">Return to Main Contents</A><HR>
|
|
|
|
<A NAME="lbAB"> </A>
|
|
<H2>NAME</H2>
|
|
|
|
regex - POSIX.2 regular expressions
|
|
<A NAME="lbAC"> </A>
|
|
<H2>DESCRIPTION</H2>
|
|
|
|
Regular expressions ("RE"s),
|
|
as defined in POSIX.2, come in two forms:
|
|
modern REs (roughly those of
|
|
<I>egrep</I>;
|
|
|
|
POSIX.2 calls these "extended" REs)
|
|
and obsolete REs (roughly those of
|
|
<B><A HREF="/cgi-bin/man/man2html?1+ed">ed</A></B>(1);
|
|
|
|
POSIX.2 "basic" REs).
|
|
Obsolete REs mostly exist for backward compatibility in some old programs;
|
|
they will be discussed at the end.
|
|
POSIX.2 leaves some aspects of RE syntax and semantics open;
|
|
"(!)" marks decisions on these aspects that
|
|
may not be fully portable to other POSIX.2 implementations.
|
|
<P>
|
|
|
|
A (modern) RE is one(!) or more nonempty(!) <I>branches</I>,
|
|
separated by '|'.
|
|
It matches anything that matches one of the branches.
|
|
<P>
|
|
|
|
A branch is one(!) or more <I>pieces</I>, concatenated.
|
|
It matches a match for the first, followed by a match for the second,
|
|
and so on.
|
|
<P>
|
|
|
|
A piece is an <I>atom</I> possibly followed
|
|
by a single(!) '*', '+', '?', or <I>bound</I>.
|
|
An atom followed by '*'
|
|
matches a sequence of 0 or more matches of the atom.
|
|
An atom followed by '+'
|
|
matches a sequence of 1 or more matches of the atom.
|
|
An atom followed by '?'
|
|
matches a sequence of 0 or 1 matches of the atom.
|
|
<P>
|
|
|
|
A <I>bound</I> is '{' followed by an unsigned decimal integer,
|
|
possibly followed by ','
|
|
possibly followed by another unsigned decimal integer,
|
|
always followed by '}'.
|
|
The integers must lie between 0 and
|
|
<B>RE_DUP_MAX</B>
|
|
|
|
(255(!)) inclusive,
|
|
and if there are two of them, the first may not exceed the second.
|
|
An atom followed by a bound containing one integer <I>i</I>
|
|
and no comma matches
|
|
a sequence of exactly <I>i</I> matches of the atom.
|
|
An atom followed by a bound
|
|
containing one integer <I>i</I> and a comma matches
|
|
a sequence of <I>i</I> or more matches of the atom.
|
|
An atom followed by a bound
|
|
containing two integers <I>i</I> and <I>j</I> matches
|
|
a sequence of <I>i</I> through <I>j</I> (inclusive) matches of the atom.
|
|
<P>
|
|
|
|
An atom is a regular expression enclosed in "<I>()</I>"
|
|
(matching a match for the regular expression),
|
|
an empty set of "<I>()</I>" (matching the null string)(!),
|
|
a <I>bracket expression</I> (see below), '.'
|
|
(matching any single character), '^' (matching the null string at the
|
|
beginning of a line), '$' (matching the null string at the
|
|
end of a line), a '\' followed by one of the characters
|
|
"<I>^.[$()|*+?{\</I>"
|
|
(matching that character taken as an ordinary character),
|
|
a '\' followed by any other character(!)
|
|
(matching that character taken as an ordinary character,
|
|
as if the '\' had not been present(!)),
|
|
or a single character with no other significance (matching that character).
|
|
A '{' followed by a character other than a digit is an ordinary
|
|
character, not the beginning of a bound(!).
|
|
It is illegal to end an RE with '\'.
|
|
<P>
|
|
|
|
A <I>bracket expression</I> is a list of characters enclosed in "<I>[]</I>".
|
|
It normally matches any single character from the list (but see below).
|
|
If the list begins with '^',
|
|
it matches any single character
|
|
(but see below) <I>not</I> from the rest of the list.
|
|
If two characters in the list are separated by '-', this is shorthand
|
|
for the full <I>range</I> of characters between those two (inclusive) in the
|
|
collating sequence,
|
|
for example, "<I>[0-9]</I>" in ASCII matches any decimal digit.
|
|
It is illegal(!) for two ranges to share an
|
|
endpoint, for example, "<I>a-c-e</I>".
|
|
Ranges are very collating-sequence-dependent,
|
|
and portable programs should avoid relying on them.
|
|
<P>
|
|
|
|
To include a literal ']' in the list, make it the first character
|
|
(following a possible '^').
|
|
To include a literal '-', make it the first or last character,
|
|
or the second endpoint of a range.
|
|
To use a literal '-' as the first endpoint of a range,
|
|
enclose it in "<I>[.</I>" and "<I>.]</I>"
|
|
to make it a collating element (see below).
|
|
With the exception of these and some combinations using '[' (see next
|
|
paragraphs), all other special characters, including '\', lose their
|
|
special significance within a bracket expression.
|
|
<P>
|
|
|
|
Within a bracket expression, a collating element (a character,
|
|
a multicharacter sequence that collates as if it were a single character,
|
|
or a collating-sequence name for either)
|
|
enclosed in "<I>[.</I>" and "<I>.]</I>" stands for the
|
|
sequence of characters of that collating element.
|
|
The sequence is a single element of the bracket expression's list.
|
|
A bracket expression containing a multicharacter collating element
|
|
can thus match more than one character,
|
|
for example, if the collating sequence includes a "ch" collating element,
|
|
then the RE "<I>[[.ch.]]*c</I>" matches the first five characters
|
|
of "chchcc".
|
|
<P>
|
|
|
|
Within a bracket expression, a collating element enclosed in "<I>[=</I>" and
|
|
"<I>=]</I>" is an equivalence class, standing for the sequences of characters
|
|
of all collating elements equivalent to that one, including itself.
|
|
(If there are no other equivalent collating elements,
|
|
the treatment is as if the enclosing delimiters
|
|
were "<I>[.</I>" and "<I>.]</I>".)
|
|
For example, if o and are the members of an equivalence class,
|
|
then "<I>[[=o=]]</I>", "<I>[[==]]</I>",
|
|
and "<I>[o]</I>" are all synonymous.
|
|
An equivalence class may not(!) be an endpoint
|
|
of a range.
|
|
<P>
|
|
|
|
Within a bracket expression, the name of a <I>character class</I> enclosed
|
|
in "<I>[:</I>" and "<I>:]</I>" stands for the list
|
|
of all characters belonging to that
|
|
class.
|
|
Standard character class names are:
|
|
<P>
|
|
|
|
<DL COMPACT><DT id="1"><DD>
|
|
<TABLE>
|
|
<TR VALIGN=top><TD>alnum</TD><TD>digit</TD><TD>punct<BR></TD></TR>
|
|
<TR VALIGN=top><TD>alpha</TD><TD>graph</TD><TD>space<BR></TD></TR>
|
|
<TR VALIGN=top><TD>blank</TD><TD>lower</TD><TD>upper<BR></TD></TR>
|
|
<TR VALIGN=top><TD>cntrl</TD><TD>print</TD><TD>xdigit<BR></TD></TR>
|
|
</TABLE>
|
|
|
|
</DL>
|
|
|
|
<P>
|
|
|
|
These stand for the character classes defined in
|
|
<B><A HREF="/cgi-bin/man/man2html?3+wctype">wctype</A></B>(3).
|
|
|
|
A locale may provide others.
|
|
A character class may not be used as an endpoint of a range.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<P>
|
|
|
|
In the event that an RE could match more than one substring of a given
|
|
string,
|
|
the RE matches the one starting earliest in the string.
|
|
If the RE could match more than one substring starting at that point,
|
|
it matches the longest.
|
|
Subexpressions also match the longest possible substrings, subject to
|
|
the constraint that the whole match be as long as possible,
|
|
with subexpressions starting earlier in the RE taking priority over
|
|
ones starting later.
|
|
Note that higher-level subexpressions thus take priority over
|
|
their lower-level component subexpressions.
|
|
<P>
|
|
|
|
Match lengths are measured in characters, not collating elements.
|
|
A null string is considered longer than no match at all.
|
|
For example,
|
|
"<I>bb*</I>" matches the three middle characters of "abbbc",
|
|
"<I>(wee|week)(knights|nights)</I>"
|
|
matches all ten characters of "weeknights",
|
|
when "<I>(.*).*</I>" is matched against "abc" the parenthesized subexpression
|
|
matches all three characters, and
|
|
when "<I>(a*)*</I>" is matched against "bc"
|
|
both the whole RE and the parenthesized
|
|
subexpression match the null string.
|
|
<P>
|
|
|
|
If case-independent matching is specified,
|
|
the effect is much as if all case distinctions had vanished from the
|
|
alphabet.
|
|
When an alphabetic that exists in multiple cases appears as an
|
|
ordinary character outside a bracket expression, it is effectively
|
|
transformed into a bracket expression containing both cases,
|
|
for example, 'x' becomes "<I>[xX]</I>".
|
|
When it appears inside a bracket expression, all case counterparts
|
|
of it are added to the bracket expression, so that, for example, "<I>[x]</I>"
|
|
becomes "<I>[xX]</I>" and "<I>[^x]</I>" becomes "<I>[^xX]</I>".
|
|
<P>
|
|
|
|
No particular limit is imposed on the length of REs(!).
|
|
Programs intended to be portable should not employ REs longer
|
|
than 256 bytes,
|
|
as an implementation can refuse to accept such REs and remain
|
|
POSIX-compliant.
|
|
<P>
|
|
|
|
Obsolete ("basic") regular expressions differ in several respects.
|
|
'|', '+', and '?' are
|
|
ordinary characters and there is no equivalent
|
|
for their functionality.
|
|
The delimiters for bounds are "<I>\{</I>" and "<I>\}</I>",
|
|
with '{' and '}' by themselves ordinary characters.
|
|
The parentheses for nested subexpressions are "<I>\(</I>" and "<I>\)</I>",
|
|
with '(' and ')' by themselves ordinary characters.
|
|
'^' is an ordinary character except at the beginning of the
|
|
RE or(!) the beginning of a parenthesized subexpression,
|
|
'$' is an ordinary character except at the end of the
|
|
RE or(!) the end of a parenthesized subexpression,
|
|
and '*' is an ordinary character if it appears at the beginning of the
|
|
RE or the beginning of a parenthesized subexpression
|
|
(after a possible leading '^').
|
|
<P>
|
|
|
|
Finally, there is one new type of atom, a <I>back reference</I>:
|
|
'\' followed by a nonzero decimal digit <I>d</I>
|
|
matches the same sequence of characters
|
|
matched by the <I>d</I>th parenthesized subexpression
|
|
(numbering subexpressions by the positions of their opening parentheses,
|
|
left to right),
|
|
so that, for example, "<I>\([bc]\)\1</I>" matches "bb" or "cc" but not "bc".
|
|
<A NAME="lbAD"> </A>
|
|
<H2>BUGS</H2>
|
|
|
|
Having two kinds of REs is a botch.
|
|
<P>
|
|
|
|
The current POSIX.2 spec says that ')' is an ordinary character in
|
|
the absence of an unmatched '(';
|
|
this was an unintentional result of a wording error,
|
|
and change is likely.
|
|
Avoid relying on it.
|
|
<P>
|
|
|
|
Back references are a dreadful botch,
|
|
posing major problems for efficient implementations.
|
|
They are also somewhat vaguely defined
|
|
(does
|
|
"<I>a\(\(b\)*\2\)*d</I>" match "abbbd"?).
|
|
Avoid using them.
|
|
<P>
|
|
|
|
POSIX.2's specification of case-independent matching is vague.
|
|
The "one case implies all cases" definition given above
|
|
is current consensus among implementors as to the right interpretation.
|
|
|
|
|
|
|
|
|
|
<A NAME="lbAE"> </A>
|
|
<H2>AUTHOR</H2>
|
|
|
|
|
|
|
|
This page was taken from Henry Spencer's regex package.
|
|
<A NAME="lbAF"> </A>
|
|
<H2>SEE ALSO</H2>
|
|
|
|
<B><A HREF="/cgi-bin/man/man2html?1+grep">grep</A></B>(1),
|
|
|
|
<B><A HREF="/cgi-bin/man/man2html?3+regex">regex</A></B>(3)
|
|
|
|
<P>
|
|
|
|
POSIX.2, section 2.8 (Regular Expression Notation).
|
|
<A NAME="lbAG"> </A>
|
|
<H2>COLOPHON</H2>
|
|
|
|
This page is part of release 5.05 of the Linux
|
|
<I>man-pages</I>
|
|
|
|
project.
|
|
A description of the project,
|
|
information about reporting bugs,
|
|
and the latest version of this page,
|
|
can be found at
|
|
<A HREF="https://www.kernel.org/doc/man-pages/.">https://www.kernel.org/doc/man-pages/.</A>
|
|
<P>
|
|
|
|
<HR>
|
|
<A NAME="index"> </A><H2>Index</H2>
|
|
<DL>
|
|
<DT id="2"><A HREF="#lbAB">NAME</A><DD>
|
|
<DT id="3"><A HREF="#lbAC">DESCRIPTION</A><DD>
|
|
<DT id="4"><A HREF="#lbAD">BUGS</A><DD>
|
|
<DT id="5"><A HREF="#lbAE">AUTHOR</A><DD>
|
|
<DT id="6"><A HREF="#lbAF">SEE ALSO</A><DD>
|
|
<DT id="7"><A HREF="#lbAG">COLOPHON</A><DD>
|
|
</DL>
|
|
<HR>
|
|
This document was created by
|
|
<A HREF="/cgi-bin/man/man2html">man2html</A>,
|
|
using the manual pages.<BR>
|
|
Time: 00:06:09 GMT, March 31, 2021
|
|
</BODY>
|
|
</HTML>
|