且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

为什么这个正则表达式会杀死Java正则表达式引擎?

更新时间:2022-05-20 22:39:02

Java正则表达式引擎崩溃的原因是正则表达式的这一部分导致堆栈溢出(确实!):

The reason the Java regex engine crashes is that this part of your regex causes a stack overflow (indeed!):

[\s]|[^<]

这里发生的是,与\s匹配的每个字符也可以通过[^< ]。这意味着有两种方法可以匹配每个空白字符。如果我们用A和B代表两个字符类:

What happens here is that every character matched by \s can also be matched by [^<]. That means there are two ways to match each whitespace character. If we represent the two character classes with A and B:

A|B

然后可以将三个空格的字符串匹配为AAA,AAB,ABA,ABB,BAA,BAB,BBA或BBB。换句话说,这部分正则表达式的复杂性是2 ^ N.这会杀死任何对我所谓的没有任何保护措施的正则表达式引擎灾难性的回溯

Then a string of three spaces could be matched as AAA, AAB, ABA, ABB, BAA, BAB, BBA, or BBB. In other words the complexity of this part of the regex is 2^N. This will kill any regex engine that doesn't have any safeguards against what I call catastrophic backtracking.

在正则表达式中使用交替(垂直条)时,请务必确保备选方案是互斥的。也就是说,最多可以允许其中一个替代方案匹配任何给定的文本位。

When using alternation (vertical bar) in a regex, always make sure the alternatives are mutually exclusive. That is, at most one of the alternatives may be allowed to match any given bit of text.