// Copyright (C) 2022 Giuseppe D'Angelo <dangelog@gmail.com>. // Copyright (C) 2022 Klarälvdalens Datakonsult AB, a KDAB Group company, info@kdab.com, author Giuseppe D'Angelo <giuseppe.dangelo@kdab.com> // Copyright (C) 2022 The Qt Company Ltd. // SPDX-License-Identifier: LicenseRef-Qt-Commercial OR GFDL-1.3-no-invariants-only //! [porting-to-qregularexpression] The QRegularExpression class introduced in Qt 5 implements Perl-compatible regular expressions and is a big improvement upon QRegExp in terms of APIs offered, supported pattern syntax, and speed of execution. The biggest difference is that QRegularExpression simply holds a regular expression, and it's \e{not} modified when a match is requested. Instead, a QRegularExpressionMatch object is returned, to check the result of a match and extract the captured substring. The same applies to global matching and QRegularExpressionMatchIterator. Other differences are outlined below. \note QRegularExpression does not support all the features available in Perl-compatible regular expressions. The most notable one is the fact that duplicated names for capturing groups are not supported, and using them can lead to undefined behavior. This may change in a future version of Qt. \section3 Different pattern syntax Porting a regular expression from QRegExp to QRegularExpression may require changes to the pattern itself. In specific scenarios, QRegExp was too lenient and accepted patterns that are simply invalid when using QRegularExpression. These are easy to detect, because the QRegularExpression objects built with these patterns are not valid (see QRegularExpression::isValid()). In other cases, a pattern ported from QRegExp to QRegularExpression may silently change semantics. Therefore, it is necessary to review the patterns used. The most notable cases of silent incompatibility are: \list \li Curly braces are needed to use a hexadecimal escape like \c{\xHHHH} with more than 2 digits. A pattern like \c{\x2022} needs to be ported to \c{\x{2022}}, or it will match a space (\c{0x20}) followed by the string \c{"22"}. In general, it is highly recommended to always use curly braces with the \c{\x} escape, no matter the number of digits specified. \li A 0-to-n quantification like \c{{,n}} needs to be ported to \c{{0,n}} to preserve semantics. Otherwise, a pattern such as \c{\d{,3}} would match a digit followed by the exact string \c{"{,3}"}. \li QRegExp by default does Unicode-aware matching, while QRegularExpression requires a separate option; see below for more details. \li c{.} in QRegExp does by default match all characters, including the newline character. QRegularExpression excludes the newline character by default. To include the newline character, set the QRegularExpression::DotMatchesEverythingOption pattern option. \endlist For an overview of the regular expression syntax supported by QRegularExpression, please refer to the \l{https://pcre.org/original/doc/html/pcrepattern.html}{pcrepattern(3)} man page, describing the pattern syntax supported by PCRE (the reference implementation of Perl-compatible regular expressions). \section3 Porting from QRegExp::exactMatch() QRegExp::exactMatch() served two purposes: it exactly matched a regular expression against a subject string, and it implemented partial matching. \section4 Porting from QRegExp's Exact Matching Exact matching indicates whether the regular expression matches the entire subject string. For example, the classes yield on the subject string \c{"abc123"}: \table \header \li \li QRegExp::exactMatch() \li QRegularExpressionMatch::hasMatch() \row \li \c{"\\d+"} \li \b false \li \b true \row \li \c{"[a-z]+\\d+"} \li \b true \li \b true \endtable Exact matching is not reflected in QRegularExpression. If you want to be sure that the subject string matches the regular expression exactly, you can wrap the pattern using the QRegularExpression::anchoredPattern() function: \snippet code/doc_src_port_from_qregexp.cpp 0 \section4 Porting from QRegExp's Partial Matching When using QRegExp::exactMatch(), if an exact match was not found, one could still find out how much of the subject string was matched by the regular expression by calling QRegExp::matchedLength(). If the returned length was equal to the subject string's length, then one could conclude that a partial match was found. QRegularExpression supports partial matching explicitly by means of the appropriate QRegularExpression::MatchType. \section3 Global matching Due to limitations of the QRegExp API, it was impossible to implement global matching correctly (that is, like Perl does). In particular, patterns that can match 0 characters (like \c{"a*"}) are problematic. QRegularExpression::globalMatch() implements Perl global match correctly, and the returned iterator can be used to examine each result. For example, if you have code like: \snippet code/doc_src_port_from_qregexp.cpp 1 You can rewrite it as: \snippet code/doc_src_port_from_qregexp.cpp 2 \section3 Unicode properties support When using QRegExp, character classes such as \c{\w}, \c{\d}, etc. match characters with the corresponding Unicode property: for instance, \c{\d} matches any character with the Unicode \c{Nd} (decimal digit) property. Those character classes only match ASCII characters by default when using QRegularExpression: for instance, \c{\d} matches exactly a character in the \c{0-9} ASCII range. It is possible to change this behavior by using the QRegularExpression::UseUnicodePropertiesOption pattern option. \section3 Wildcard matching There is no direct way to do wildcard matching in QRegularExpression. However, the QRegularExpression::wildcardToRegularExpression() method is provided to translate glob patterns into a Perl-compatible regular expression that can be used for that purpose. For example, if you have code like: \snippet code/doc_src_port_from_qregexp.cpp 3 You can rewrite it as: \snippet code/doc_src_port_from_qregexp.cpp 4 Please note though that some shell-like wildcard patterns might not be translated to what you expect. The following example code will silently break if simply converted using the above-mentioned function: \snippet code/doc_src_port_from_qregexp.cpp 5 This is because, by default, the regular expression returned by QRegularExpression::wildcardToRegularExpression() is fully anchored. To get a regular expression that is not anchored, pass QRegularExpression::UnanchoredWildcardConversion as the conversion options: \snippet code/doc_src_port_from_qregexp.cpp 6 \section3 Minimal matching QRegExp::setMinimal() implemented minimal matching by simply reversing the greediness of the quantifiers (QRegExp did not support lazy quantifiers, like \c{*?}, \c{+?}, etc.). QRegularExpression instead does support greedy, lazy, and possessive quantifiers. The QRegularExpression::InvertedGreedinessOption pattern option can be useful to emulate the effects of QRegExp::setMinimal(): if enabled, it inverts the greediness of quantifiers (greedy ones become lazy and vice versa). \section3 Caret modes The QRegularExpression::AnchorAtOffsetMatchOption match option can be used to emulate the QRegExp::CaretAtOffset behavior. There is no equivalent for the other QRegExp::CaretMode modes. //! [porting-to-qregularexpression]