且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

重用正则表达式模式的一部分

更新时间:2023-02-23 13:56:50

否,使用标准库re 模块时,正则表达式patterns 不能被符号化".

当然,您始终可以通过重用 Python 变量来实现:

digit_letter_letter_digit = r'\d\w\w\d'

然后使用字符串格式来构建更大的模式:

match(r"{0},{0}".format(digit_letter_letter_digit), inputtext)

或者,使用 Python 3.6+ f 字符串:

dlld = r'\d\w\w\d'匹配(fr{dlld},{dlld}",输入文本)

我经常使用这种技术从可重用的子模式中组合出更大、更复杂的模式.

如果您准备安装外部库,则regex 项目 可以通过 regex 子例程调用来解决这个问题.语法 (?) 重新使用已使用(隐式编号)捕获组的模式:

(\d\w\w\d),(?1)^..^ ^..^|\|捕获组 1 的重用模式\捕获组 1

您可以对 named 捕获组执行相同操作,其中 (?...) 是命名组 groupname, 和 (?&groupname), (?P&groupname)(?P>groupname) 重新使用匹配的模式groupname(后两种形式是与其他引擎兼容的替代形式).

最后,regex 支持 (?(DEFINE)...) 块来定义"子例程模式,而无需它们在该阶段实际匹配任何内容.您可以在该构造中放置多个 (..)(?...) 捕获组,以便稍后在实际模式中引用它们:

(?(DEFINE)(?\d\w\w\d))(?&d​​lld),(?&dlld)^......^ ^......^ ^......^|\/创建dlld"模式两次使用dlld"模式

明确一点:标准库re 模块不支持子程序模式.

Consider this (very simplified) example string:

1aw2,5cx7

As you can see, it is two digit/letter/letter/digit values separated by a comma.

Now, I could match this with the following:

>>> from re import match
>>> match("\d\w\w\d,\d\w\w\d", "1aw2,5cx7")
<_sre.SRE_Match object at 0x01749D40>
>>>

The problem is though, I have to write \d\w\w\d twice. With small patterns, this isn't so bad but, with more complex Regexes, writing the exact same thing twice makes the end pattern enormous and cumbersome to work with. It also seems redundant.

I tried using a named capture group:

>>> from re import match
>>> match("(?P<id>\d\w\w\d),(?P=id)", "1aw2,5cx7")
>>>

But it didn't work because it was looking for two occurrences of 1aw2, not digit/letter/letter/digit.

Is there any way to save part of a pattern, such as \d\w\w\d, so it can be used latter on in the same pattern? In other words, can I reuse a sub-pattern in a pattern?

No, when using the standard library re module, regular expression patterns cannot be 'symbolized'.

You can always do so by re-using Python variables, of course:

digit_letter_letter_digit = r'\d\w\w\d'

then use string formatting to build the larger pattern:

match(r"{0},{0}".format(digit_letter_letter_digit), inputtext)

or, using Python 3.6+ f-strings:

dlld = r'\d\w\w\d'
match(fr"{dlld},{dlld}", inputtext)

I often do use this technique to compose larger, more complex patterns from re-usable sub-patterns.

If you are prepared to install an external library, then the regex project can solve this problem with a regex subroutine call. The syntax (?<digit>) re-uses the pattern of an already used (implicitly numbered) capturing group:

(\d\w\w\d),(?1)
^........^ ^..^
|           \
|             re-use pattern of capturing group 1  
\
  capturing group 1

You can do the same with named capturing groups, where (?<groupname>...) is the named group groupname, and (?&groupname), (?P&groupname) or (?P>groupname) re-use the pattern matched by groupname (the latter two forms are alternatives for compatibility with other engines).

And finally, regex supports the (?(DEFINE)...) block to 'define' subroutine patterns without them actually matching anything at that stage. You can put multiple (..) and (?<name>...) capturing groups in that construct to then later refer to them in the actual pattern:

(?(DEFINE)(?<dlld>\d\w\w\d))(?&dlld),(?&dlld)
          ^...............^ ^......^ ^......^
          |                    \       /          
 creates 'dlld' pattern      uses 'dlld' pattern twice

Just to be explicit: the standard library re module does not support subroutine patterns.