且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

R regex编译器对于给定regex的工作方式有所不同

更新时间:2023-11-10 16:16:34

它看起来像 TRE正则表达式引擎(默认情况下,在基本R regex函数中使用)(基于Henry Spencer最初于1986年编写的regex库),如果正则表达式中的第一个模式以懒惰的量词开头并以结尾,则匹配字符串末尾的最短匹配项. $ 锚点.

It looks like TRE regex engine (used by default in base R regex functions), based on the regex library initially written by Henry Spencer in 1986, matches the shortest match at the end of the string if the first pattern in the regular expression starts with a lazy quantifier and ends with $ anchor.

比较这些案例:

sub(" +?on.*$", "", Data)  # "Posted by ondrej" "Posted by ona'je"
sub(" +?on.*", "", Data)   # "Posted bydrej on 29 Feb 2020." "Posted bya'je on 29feb 2020"
sub(" +?on(.*)", "", Data) # as expected
sub(" +on.*", "", Data)    # as expected

这是怎么回事?

  • 第一种情况是 sub("+?on.* $",",Data),第一种模式将所有量词的贪婪程度设置为正则表达式.因此,第二个量词 * 即使没有?,也将被设置为lazy ,因为用 +对第一个空格进行了量化?,一个懒惰的量词.这是一个已知的TRE错误",也存在于其他基于Henry Spencer的regexl库的regex引擎中.

  • The first case is sub(" +?on.*$", "", Data) and the first pattern sets the greediness of all the quantifiers on the same level in the regex. So, the second quantifier, *, will be set to lazy even without ? after it as the first space was quantified with +?, a lazy quantifier. It is a known TRE "bug", also present in some other regex engines based on Henry Spencer's regexl library.

第二个 sub("+?on.*",",数据)匹配方式与写入"+?on.*?"的方式相同.(同样,由于第一个模式将贪婪级别设置为在该级别上是懒惰的),并且仅匹配 1个或多个空格,然后在模式末尾的 on .*?不匹配.

The second sub(" +?on.*", "", Data) matches the same way as if it were written " +?on.*?" (again, due to the first pattern setting the greediness level to lazy on that level) and that would only match 1 or more spaces and then on, .*? matches nothing when at the end of the pattern.

第三个, sub("+?on(.*)",",Data),产生了预期的结果,因为第二个量化模式.*在另一个级别(一个级别)上,并且它的贪婪不受另一个级别上的 +?的影响.因此,(.*)在这里贪婪地匹配.

The third one, sub(" +?on(.*)", "", Data), yields the expected results because the second quantified pattern, .*, is on the other level (one level deep) and its greediness is not affected by the +? that is on another level. So, (.*) matches greedily here.

第四个 sub("+ on.*",",Data)会产生预期的结果,因为第一个模式是贪婪的,因此下一个量化模式的贪婪是也很贪心.

The fourth one, sub(" +on.*", "", Data), yields the expected results because the first pattern is greedy, so the next quantified pattern greediness is also greedy.