且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Python 正则表达式是否与 Ruby 的原子分组等效?

更新时间:2023-02-17 23:13:36

Python 不直接支持此功能,但您可以通过使用零宽度前瞻断言 ((?=RE)),它从当前点开始匹配您想要的相同语义,将命名组 ((?P<name>RE)) 放在前瞻中,然后使用命名的反向引用(>(?P=name)) 以完全匹配零宽度断言匹配的任何内容.结合在一起,这将为您提供相同的语义,但代价是创建额外的匹配组和大量语法.

Python does not directly support this feature, but you can emulate it by using a zero-width lookahead assert ((?=RE)), which matches from the current point with the same semantics you want, putting a named group ((?P<name>RE)) inside the lookahead, and then using a named backreference ((?P=name)) to match exactly whatever the zero-width assertion matched. Combined together, this gives you the same semantics, at the cost of creating an additional matching group, and a lot of syntax.

例如,您提供的链接给出了 Ruby 示例

For example, the link you provided gives the Ruby example of

/"(?>.*)"/.match('"Quote"') #=> nil

我们可以像这样在 Python 中模拟:

We can emulate that in Python as such:

re.search(r'"(?=(?P<tmp>.*))(?P=tmp)"', '"Quote"') # => None

我们可以证明我正在做一些有用的事情而不仅仅是喷出线路噪音,因为如果我们改变它以便内部组不吃最后的",它仍然匹配:

We can show that I'm doing something useful and not just spewing line noise, because if we change it so that the inner group doesn't eat the final ", it still matches:

re.search(r'"(?=(?P<tmp>[A-Za-z]*))(?P=tmp)"', '"Quote"').groupdict()
# => {'tmp': 'Quote'}

您也可以使用匿名组和数字反向引用,但这会充满线路噪音:

You can also use anonymous groups and numeric backreferences, but this gets awfully full of line-noise:

re.search(r'"(?=(.*))\1"', '"Quote"') # => None

(完全披露:我从 perl 的 perlre 中学到了这个技巧文档,在 (?>...).)

(Full disclosure: I learned this trick from perl's perlre documentation, which mentions it under the documentation for (?>...).)

除了具有正确的语义外,它还具有适当的性能属性.如果我们从 perlre 中移植一个例子:

In addition to having the right semantics, this also has the appropriate performance properties. If we port an example out of perlre:

[nelhage@anarchique:~/tmp]$ cat re.py
import re
import timeit


re_1 = re.compile(r'''\(
                           (
                             [^()]+           # x+
                           |
                             \( [^()]* \)
                           )+
                       \)
                   ''', re.X)
re_2 = re.compile(r'''\(
                           (
                             (?=(?P<tmp>[^()]+ ))(?P=tmp) # Emulate (?> x+)
                           |
                             \( [^()]* \)
                           )+
                       \)''', re.X)

print timeit.timeit("re_1.search('((()' + 'a' * 25)",
                    setup  = "from __main__ import re_1",
                    number = 10)

print timeit.timeit("re_2.search('((()' + 'a' * 25)",
                    setup  = "from __main__ import re_2",
                    number = 10)

我们看到了显着的改进:

We see a dramatic improvement:

[nelhage@anarchique:~/tmp]$ python re.py
96.0800571442
7.41481781006e-05

随着我们扩展搜索字符串的长度,这只会变得更加引人注目.

Which only gets more dramatic as we extend the length of the search string.