Most of the regular expression engines use backtracking
to try all possible execution paths of the regular expression when evaluating
an input, in some cases it can cause performance issues, called catastrophic backtracking
situations. In the worst case, the complexity
of the regular expression is exponential in the size of the input, this means that a small carefully-crafted input (like 20 chars) can trigger
catastrophic backtracking
and cause a denial of service of the application. Super-linear regex complexity can lead to the same impact too
with, in this case, a large carefully-crafted input (thousands chars).
This rule determines the runtime complexity of a regular expression and informs you of the complexity if it is not linear.
Note that, due to improvements to the matching algorithm, some cases of exponential runtime complexity have become impossible when run using JDK 9 or later. In such cases, an issue will only be reported if the project’s target Java version is 8 or earlier.
There is a risk if you answered yes to any of those questions.
To avoid catastrophic backtracking
situations, make sure that none of the following conditions apply to your regular expression.
In all of the following cases, catastrophic backtracking can only happen if the problematic part of the regex is followed by a pattern that can
fail, causing the backtracking to actually happen. Note that when performing a full match (e.g. using String.matches
), the end of the
regex counts as a pattern that can fail because it will only succeed when the end of the string is reached.
r*
or r*?
, such that the regex r
could produce different
possible matches (of possibly different lengths) on the same input, the worst case matching time can be exponential. This can be the case if
r
contains optional parts, alternations or additional repetitions (but not if the repetition is written in such a way that there’s only
one way to match it).
a*b*
is not a problem because a*
and b*
match different
things and a*_a*
is not a problem because the repetitions are separated by a '_'
and can’t match that '_'
.
However, a*a*
and .*_.*
have quadratic runtime. Matcher.find
, String.split
, String.replaceAll
etc.) and the regex is not anchored to the beginning of the string, quadratic runtime is especially hard to avoid because whenever a match fails,
the regex engine will try again starting at the next index. This means that any unbounded repetition (even a possessive one), if it’s followed by a
pattern that can fail, can cause quadratic runtime on some inputs. For example str.split("\\s*,")
will run in quadratic time on strings
that consist entirely of spaces (or at least contain large sequences of spaces, not followed by a comma). In order to rewrite your regular expression without these patterns, consider the following strategies:
{1,5}
instead of +
for instance. nested quantifiers
to limit the number of way the inner group can be matched by the outer quantifier, for instance this
nested quantifier situation (ba+)+
doesn’t cause performance issues, indeed, the inner group can be matched only if there exists
exactly one b
char per repetition of the group. possessive quantifiers
and atomic grouping
. .
to exclude separators where applicable. For example the quadratic regex
.*_.*
can be made linear by changing it to [^_]*_.*
Sometimes it’s not possible to rewrite the regex to be linear while still matching what you want it to match. Especially when using partial matches, for which it is quite hard to avoid quadratic runtimes. In those cases consider the following approaches:
str.split("\\s*,\\s*")
with str.split(",")
and
then trimming the spaces from the strings as a second step. Matcher.find()
, it is often possible to make the regex infallible by making all the parts that could fail optional,
which will prevent backtracking. Of course this means that you’ll accept more strings than intended, but this can be handled by using capturing
groups to check whether the optional parts were matched or not and then ignoring the match if they weren’t. For example the regex x*y
could be replaced with x*(y)?
and then the call to matcher.find()
could be replaced with matcher.find() &&
matcher.group(1) != null
. The first regex evaluation will never end in JDK
<= 9 and the second regex evaluation will never end in any versions of the
JDK
:
java.util.regex.Pattern.compile("(a+)+").matcher( "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"+ "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"+ "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"+ "aaaaaaaaaaaaaaa!").matches(); // Sensitive java.util.regex.Pattern.compile("(h|h|ih(((i|a|c|c|a|i|i|j|b|a|i|b|a|a|j))+h)ahbfhba|c|i)*").matcher( "hchcchicihcchciiicichhcichcihcchiihichiciiiihhcchi"+ "cchhcihchcihiihciichhccciccichcichiihcchcihhicchcciicchcccihiiihhihihihi"+ "chicihhcciccchihhhcchichchciihiicihciihcccciciccicciiiiiiiiicihhhiiiihchccch"+ "chhhhiiihchihcccchhhiiiiiiiicicichicihcciciihichhhhchihciiihhiccccccciciihh"+ "ichiccchhicchicihihccichicciihcichccihhiciccccccccichhhhihihhcchchihih"+ "iihhihihihicichihiiiihhhhihhhchhichiicihhiiiiihchccccchichci").matches(); // Sensitive
Possessive quantifiers do not keep backtracking positions, thus can be used, if possible, to avoid performance issues:
java.util.regex.Pattern.compile("(a+)++").matcher( "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"+ "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"+ "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"+ "aaaaaaaaaaaaaaa!").matches(); // Compliant java.util.regex.Pattern.compile("(h|h|ih(((i|a|c|c|a|i|i|j|b|a|i|b|a|a|j))+h)ahbfhba|c|i)*+").matcher( "hchcchicihcchciiicichhcichcihcchiihichiciiiihhcchi"+ "cchhcihchcihiihciichhccciccichcichiihcchcihhicchcciicchcccihiiihhihihihi"+ "chicihhcciccchihhhcchichchciihiicihciihcccciciccicciiiiiiiiicihhhiiiihchccch"+ "chhhhiiihchihcccchhhiiiiiiiicicichicihcciciihichhhhchihciiihhiccccccciciihh"+ "ichiccchhicchicihihccichicciihcichccihhiciccccccccichhhhihihhcchchihih"+ "iihhihihihicichihiiiihhhhihhhchhichiicihhiiiiihchccccchichci").matches(); // Compliant