且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

使用Regex匹配嵌套模式(使用PHP的递归)

更新时间:2023-02-20 09:17:09

您可以使用此正则表达式来匹配所需内容(为方便起见,将正则表达式放在字符串文字中):

You can use this regex to match what you want (the regex placed in a string literal for sake of convenience):

'~<a=5>(<([a-zA-Z0-9]+)[^>]*>(?1)*</\2>|[^<>]++)*</a>~'

这是上面的正则表达式的分解:

Here is a break down of the regex above:

<a=5>
(
  <([a-zA-Z0-9]+)[^>]*>
  (?1)*
  </\2>
  |
  [^<>]++
)*
</a>

第一部分<([a-zA-Z0-9]+)[^>]*>(?1)*</\2>匹配一对匹配的标记及其所有内容.假定标签名称由字符[a-zA-Z0-9]组成.匹配结束标记</\2>时,捕获标记的名称([a-zA-Z0-9]+)和向后引用.

The first part <([a-zA-Z0-9]+)[^>]*>(?1)*</\2> matches pair of matching tags and all its content. It assumes that the name of the tag consists of the characters [a-zA-Z0-9]. The name of the tag is captured ([a-zA-Z0-9]+) and backreference when matching the closing tag </\2>.

第二部分[^<>]++与标记之外的其他任何内容匹配.请注意,没有对带引号的字符串进行处理,因此根据您的输入,它可能不起作用.

The second part [^<>]++ matches whatever else outside the tags. Note that there is no handling of quoted string, so depending on your input it may not work.

然后返回到例程调用,该例程递归地调用第一个捕获组.您会注意到一个标签可以包含0个或多个其他标签或非标签内容的实例.由于正则表达式的编写方式,该属性也由最外面的<a=5>...</a>对共享.

Then back to the routine call which recursively calls the first capturing group. You would notice that a tag can contain 0 or more instances of other tags or non-tag contents. Due to the way the regex is written, this property is also shared by the outer most <a=5>...</a> pair.

在regex101上进行演示