qt6windows7/doc/global/includes/corelib/port-from-qregexp.qdocinc
2023-10-29 23:33:08 +01:00

176 lines
7.7 KiB
Plaintext

// Copyright (C) 2022 Giuseppe D'Angelo <dangelog@gmail.com>.
// Copyright (C) 2022 Klarälvdalens Datakonsult AB, a KDAB Group company, info@kdab.com, author Giuseppe D'Angelo <giuseppe.dangelo@kdab.com>
// Copyright (C) 2022 The Qt Company Ltd.
// SPDX-License-Identifier: LicenseRef-Qt-Commercial OR GFDL-1.3-no-invariants-only
//! [porting-to-qregularexpression]
The QRegularExpression class introduced in Qt 5 implements Perl-compatible
regular expressions and is a big improvement upon QRegExp in terms of APIs
offered, supported pattern syntax, and speed of execution. The biggest
difference is that QRegularExpression simply holds a regular expression,
and it's \e{not} modified when a match is requested. Instead, a
QRegularExpressionMatch object is returned, to check the result of a match
and extract the captured substring. The same applies to global matching and
QRegularExpressionMatchIterator.
Other differences are outlined below.
\note QRegularExpression does not support all the features available in
Perl-compatible regular expressions. The most notable one is the fact that
duplicated names for capturing groups are not supported, and using them can
lead to undefined behavior. This may change in a future version of Qt.
\section3 Different pattern syntax
Porting a regular expression from QRegExp to QRegularExpression may require
changes to the pattern itself.
In specific scenarios, QRegExp was too lenient and accepted patterns that
are simply invalid when using QRegularExpression. These are easy to detect,
because the QRegularExpression objects built with these patterns are not
valid (see QRegularExpression::isValid()).
In other cases, a pattern ported from QRegExp to QRegularExpression may
silently change semantics. Therefore, it is necessary to review the
patterns used. The most notable cases of silent incompatibility are:
\list
\li Curly braces are needed to use a hexadecimal escape like \c{\xHHHH}
with more than 2 digits. A pattern like \c{\x2022} needs to be ported
to \c{\x{2022}}, or it will match a space (\c{0x20}) followed by the
string \c{"22"}. In general, it is highly recommended to always use
curly braces with the \c{\x} escape, no matter the number of digits
specified.
\li A 0-to-n quantification like \c{{,n}} needs to be ported to \c{{0,n}}
to preserve semantics. Otherwise, a pattern such as \c{\d{,3}} would
match a digit followed by the exact string \c{"{,3}"}.
\li QRegExp by default does Unicode-aware matching, while
QRegularExpression requires a separate option; see below for more
details.
\li c{.} in QRegExp does by default match all characters, including the
newline character. QRegularExpression excludes the newline character
by default. To include the newline character, set the
QRegularExpression::DotMatchesEverythingOption pattern option.
\endlist
For an overview of the regular expression syntax supported by
QRegularExpression, please refer to the
\l{https://pcre.org/original/doc/html/pcrepattern.html}{pcrepattern(3)}
man page, describing the pattern syntax supported by PCRE (the reference
implementation of Perl-compatible regular expressions).
\section3 Porting from QRegExp::exactMatch()
QRegExp::exactMatch() served two purposes: it exactly matched a regular
expression against a subject string, and it implemented partial matching.
\section4 Porting from QRegExp's Exact Matching
Exact matching indicates whether the regular expression matches the entire
subject string. For example, the classes yield on the subject string \c{"abc123"}:
\table
\header \li \li QRegExp::exactMatch() \li QRegularExpressionMatch::hasMatch()
\row \li \c{"\\d+"} \li \b false \li \b true
\row \li \c{"[a-z]+\\d+"} \li \b true \li \b true
\endtable
Exact matching is not reflected in QRegularExpression. If you want
to be sure that the subject string matches the regular expression
exactly, you can wrap the pattern using the QRegularExpression::anchoredPattern()
function:
\snippet code/doc_src_port_from_qregexp.cpp 0
\section4 Porting from QRegExp's Partial Matching
When using QRegExp::exactMatch(), if an exact match was not found, one
could still find out how much of the subject string was matched by the
regular expression by calling QRegExp::matchedLength(). If the returned length
was equal to the subject string's length, then one could conclude that a partial
match was found.
QRegularExpression supports partial matching explicitly by means of the
appropriate QRegularExpression::MatchType.
\section3 Global matching
Due to limitations of the QRegExp API, it was impossible to implement global
matching correctly (that is, like Perl does). In particular, patterns that
can match 0 characters (like \c{"a*"}) are problematic.
QRegularExpression::globalMatch() implements Perl global match correctly, and
the returned iterator can be used to examine each result.
For example, if you have code like:
\snippet code/doc_src_port_from_qregexp.cpp 1
You can rewrite it as:
\snippet code/doc_src_port_from_qregexp.cpp 2
\section3 Unicode properties support
When using QRegExp, character classes such as \c{\w}, \c{\d}, etc. match
characters with the corresponding Unicode property: for instance, \c{\d}
matches any character with the Unicode \c{Nd} (decimal digit) property.
Those character classes only match ASCII characters by default when using
QRegularExpression: for instance, \c{\d} matches exactly a character in the
\c{0-9} ASCII range. It is possible to change this behavior by using the
QRegularExpression::UseUnicodePropertiesOption pattern option.
\section3 Wildcard matching
There is no direct way to do wildcard matching in QRegularExpression.
However, the QRegularExpression::wildcardToRegularExpression() method
is provided to translate glob patterns into a Perl-compatible regular
expression that can be used for that purpose.
For example, if you have code like:
\snippet code/doc_src_port_from_qregexp.cpp 3
You can rewrite it as:
\snippet code/doc_src_port_from_qregexp.cpp 4
Please note though that some shell-like wildcard patterns might not be
translated to what you expect. The following example code will silently
break if simply converted using the above-mentioned function:
\snippet code/doc_src_port_from_qregexp.cpp 5
This is because, by default, the regular expression returned by
QRegularExpression::wildcardToRegularExpression() is fully anchored.
To get a regular expression that is not anchored, pass
QRegularExpression::UnanchoredWildcardConversion as the conversion
options:
\snippet code/doc_src_port_from_qregexp.cpp 6
\section3 Minimal matching
QRegExp::setMinimal() implemented minimal matching by simply reversing the
greediness of the quantifiers (QRegExp did not support lazy quantifiers,
like \c{*?}, \c{+?}, etc.). QRegularExpression instead does support greedy,
lazy, and possessive quantifiers. The QRegularExpression::InvertedGreedinessOption
pattern option can be useful to emulate the effects of QRegExp::setMinimal():
if enabled, it inverts the greediness of quantifiers (greedy ones become
lazy and vice versa).
\section3 Caret modes
The QRegularExpression::AnchorAtOffsetMatchOption match option can be used to
emulate the QRegExp::CaretAtOffset behavior. There is no equivalent for the
other QRegExp::CaretMode modes.
//! [porting-to-qregularexpression]