且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

折叠并捕获单个正则表达式中的重复模式

更新时间:2022-10-18 17:40:14

先阅读本文!

这篇文章是为了展示这种可能性,而不是支持一切正则表达式"解决问题的方法.作者编写了 3-4 个变体,每个变体都有在达到当前解决方案之前难以检测的微妙错误.

对于您的具体示例,还有其他更易于维护的更好解决方案,例如沿分隔符匹配和拆分匹配项.

这篇文章涉及您的具体示例.我真的怀疑一个完整的概括是可能的,但背后的想法可以在类似的情况下重复使用.

总结

  • .NET 支持使用 捕获重复模式CaptureCollection 类.
  • 对于支持G 和look-behind 的语言,我们或许可以构建一个使用全局匹配功能的正则表达式.编写完全正确的代码并不容易,编写一个有细微错误的正则表达式也很容易.
  • 对于没有 G 和后视支持的语言:可以使用 ^ 模拟 G,通过截取输入字符串单场比赛后.(本答案未涵盖).

解决方案

此解决方案假设正则表达式引擎支持 G 匹配边界、前瞻 (?=pattern) 和后视 (?.Java、Perl、PCRE、.NET、Ruby 正则表达式支持上述所有高级功能.

但是,您可以在 .NET 中使用正则表达式.由于 .NET 支持捕获由通过 CaptureCollection 类.

对于您的情况,它可以在一个正则表达式中完成,使用 G 匹配边界,并提前限制重复次数:

(?:start:(?=w+(?:-w+){2,9}:end)|(?<=-)G)(w+)(?:-|:结束)

演示.构造重复w+-,然后w+:end.

(?:start:(?=w+(?:-w+){2,9}:end)|(?!^)G-)(w+)

演示.第一项的构造是 w+,然后重复 -w+.(感谢 ka ᵠ 的建议).这种结构更容易推断其正确性,因为交替较少.

G 匹配边界在您需要进行标记化时特别有用,您需要确保引擎不会跳过并匹配本应无效的内容.

说明

让我们分解正则表达式:

(?:开始:(?=w+(?:-w+){2,9}:end)|(?

最容易识别的部分是(w+)在最后一行,就是你要捕捉的词.

最后一行也很容易识别:要匹配的单词后面可以跟-:end.

我允许正则表达式***地开始匹配字符串中的任何地方.换句话说,start:...:end 可以出现在字符串的任何位置,并且可以出现任意次数;正则表达式将简单地匹配所有单词.您只需要处理返回的数组以分离匹配的令牌实际来自的位置.

至于解释,正则表达式的开头检查字符串start:的存在,然后前瞻检查单词数是否在指定的限制内,并以:结束代码>.要么,要么我们检查前一个匹配之前的字符是-,然后从前一个匹配继续.

对于其他结构:

(?:开始:(?=w+(?:-w+){2,9}:end)|(?!^)G-)(w+)

一切都差不多,除了我们先匹配start:w+,然后再匹配-w+形式的重复.与第一个构造相反,我们首先匹配 start:w+-,然后 w+-(或 w+:endcode> 用于最后一次重复).

使这个正则表达式适用于字符串中间的匹配非常棘手:

  • 我们需要检查 start::end 之间的字数(作为原始正则表达式要求的一部分).

  • G 也匹配字符串的开头!需要 (?!^) 来防止这种行为.如果不注意这一点,当没有任何 start: 时,正则表达式可能会产生匹配.

    对于第一个构造,后视 (?<=-) 已经阻止了这种情况((?!^)(?).

  • 对于第一个构造 (?:start:(?=w+(?:-w+){2,9}:end)|(?<=-)G)(w+)(?:-|:end),我们需要确保在 :end 之后没有匹配任何有趣的东西.后视就是为了这个目的:它可以防止 :end 之后的任何垃圾匹配.

    第二个构造不会遇到这个问题,因为在我们匹配了中间的所有标记后,我们将卡在 : (of :end) 处.

验证版本

如果你想验证输入字符串是否符合格式(前后没有多余的东西),提取数据,你可以这样添加锚点:

(?:^start:(?=w+(?:-w+){2,9}:end$)|(?!^)G-)(w+)(?:^start:(?=w+(?:-w+){2,9}:end$)|(?!^)G)(w+)(?:-|:end)

(Look-behind 也不需要,但我们仍然需要 (?!^) 来防止 G 匹配字符串的开头).>

施工

对于您想要捕获所有重复实例的所有问题,我认为不存在修改正则表达式的通用方法.要转换的困难"(或不可能?)案例的一个示例是当重复必须回溯一个或多个循环以满足特定条件以匹配时.

当原始正则表达式描述整个输入字符串(验证类型)时,与尝试从字符串中间进行匹配(匹配类型)的正则表达式相比,通常更容易转换.但是,您始终可以与原始正则表达式进行匹配,我们将匹配类型问题转换回验证类型问题.

我们通过以下步骤构建这样的正则表达式:

  • 编写一个正则表达式来覆盖重复之前的部分(例如 start:).让我们称之为前缀正则表达式.
  • 匹配并捕获第一个实例.(例如 (w+))
    (此时,第一个实例和分隔符应该已经匹配了)
  • 添加 G 作为替代.通常还需要防止它匹配字符串的开头.
  • 添加分隔符(如果有).(例如 -)
    (在这一步之后,其余的标记也应该已经匹配了,除了最后一个)
  • 在重复后添加覆盖该部分的部分(如有必要)(例如 :end).让我们调用重复后的部分后缀正则表达式(我们是否将其添加到构造中无关紧要).
  • 现在是困难的部分.您需要检查:
    • 除了前缀正则表达式之外,没有其他方法可以开始匹配.记下 G 分支.
    • 后缀正则表达式匹配之后,无法开始任何匹配.注意 G 分支如何开始匹配​​.
    • 对于第一次构造,如果您将后缀正则表达式(例如 :end)与分隔符(例如 -)交替混合使用,请确保不要结束up 允许后缀正则表达式作为分隔符.

I keep bumping into situations where I need to capture a number of tokens from a string and after countless tries I couldn't find a way to simplify the process.

So let's say the text is:

start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end

This example has 8 items inside, but say it could have between 3 and 10 items.

I'd ideally like something like this:
start:(?:(w+)-?){3,10}:end nice and clean BUT it only captures the last match. see here

I usually use something like this in simple situations:

start:(w+)-(w+)-(w+)-?(w+)?-?(w+)?-?(w+)?-?(w+)?-?(w+)?-?(w+)?-?(w+)?:end

3 groups mandatory and another 7 optional because of the max 10 limit, but this doesn't look 'nice' and it would be a pain to write and track if the max limit was 100 and the matches were more complex. demo

And the best I could do so far:

start:(w+)-((?1))-((?1))-?((?1))?-?((?1))?-?((?1))?-?((?1))?-?((?1))?:end

shorter especially if the matches are complex but still long. demo

Anyone managed to make it work as a 1 regex-only solution without programming?

I'm mostly interested on how can this be done in PCRE but other flavors would be ok too.

Update:

The purpose is to validate a match and capture individual tokens inside match 0 by RegEx alone, without any OS/Software/Programming-Language limitation

Update 2 (bounty):

With @nhahtdh's help I got to the RegExp below by using G:

(?:start:(?=(?:[w]+(?:-|(?=:end))){3,10}:end)|(?!^)G-)([w]+)

demo even shorter, but can be described without repeating code

I'm also interested in the ECMA flavor and as it doesn't support G wondering if there's another way, especially without using /g modifier.

Read this first!

This post is to show the possibility rather than endorsing the "everything regex" approach to problem. The author has written 3-4 variations, each has subtle bug that are tricky to detect, before reaching the current solution.

For your specific example, there are other better solution that is more maintainable, such as matching and splitting the match along the delimiters.

This post deals with your specific example. I really doubt a full generalization is possible, but the idea behind is reusable for similar cases.

Summary

  • .NET supports capturing repeating pattern with CaptureCollection class.
  • For languages that supports G and look-behind, we may be able to construct a regex that works with global matching function. It is not easy to write it completely correct and easy to write a subtly buggy regex.
  • For languages without G and look-behind support: it is possible to emulate G with ^, by chomping the input string after a single match. (Not covered in this answer).

Solution

This solution assumes the regex engine supports G match boundary, look-ahead (?=pattern), and look-behind (?<=pattern). Java, Perl, PCRE, .NET, Ruby regex flavors support all those advanced features above.

However, you can go with your regex in .NET. Since .NET supports capturing all instances of that is matched by a capturing group that is repeated via CaptureCollection class.

For your case, it can be done in one regex, with the use of G match boundary, and look-ahead to constrain the number of repetitions:

(?:start:(?=w+(?:-w+){2,9}:end)|(?<=-)G)(w+)(?:-|:end)

DEMO. The construction is w+- repeated, then w+:end.

(?:start:(?=w+(?:-w+){2,9}:end)|(?!^)G-)(w+)

DEMO. The construction is w+ for the first item, then -w+ repeated. (Thanks to ka ᵠ for the suggestion). This construction is simpler to reason about its correctness, since there are less alternations.

G match boundary is especially useful when you need to do tokenization, where you need to make sure the engine not skipping ahead and matching stuffs that should have been invalid.

Explanation

Let us break down the regex:

(?:
  start:(?=w+(?:-w+){2,9}:end)
    |
  (?<=-)G
)
(w+)
(?:-|:end)

The easiest part to recognize is (w+) in the line before last, which is the word that you want to capture.

The last line is also quite easy to recognize: the word to be matched may be followed by - or :end.

I allow the regex to freely start matching anywhere in the string. In other words, start:...:end can appear anywhere in the string, and any number of times; the regex will simply match all the words. You only need to process the array returned to separate where the matched tokens actually come from.

As for the explanation, the beginning of the regex checks for the presence of the string start:, and the following look-ahead checks that the number of words is within specified limit and it ends with :end. Either that, or we check that the character before the previous match is a -, and continue from previous match.

For the other construction:

(?:
  start:(?=w+(?:-w+){2,9}:end)
    |
  (?!^)G-
)
(w+)

Everything is almost the same, except that we match start:w+ first before matching the repetition of the form -w+. In contrast to the first construction, where we match start:w+- first, and the repeated instances of w+- (or w+:end for the last repetition).

It is quite tricky to make this regex works for matching in middle of the string:

  • We need to check the number of words between start: and :end (as part of the requirement of the original regex).

  • G matches the beginning of the string also! (?!^) is needed to prevent this behavior. Without taking care of this, the regex may produce a match when there isn't any start:.

    For the first construction, the look-behind (?<=-) already prevent this case ((?!^) is implied by (?<=-)).

  • For the first construction (?:start:(?=w+(?:-w+){2,9}:end)|(?<=-)G)(w+)(?:-|:end), we need to make sure that we don't match anything funny after :end. The look-behind is for that purpose: it prevents any garbage after :end from matching.

    The second construction doesn't run into this problem, since we will get stuck at : (of :end) after we have matched all the tokens in between.

Validation Version

If you want to do validation that the input string follows the format (no extra stuff in front and behind), and extract the data, you can add anchors as such:

(?:^start:(?=w+(?:-w+){2,9}:end$)|(?!^)G-)(w+)
(?:^start:(?=w+(?:-w+){2,9}:end$)|(?!^)G)(w+)(?:-|:end)

(Look-behind is also not needed, but we still need (?!^) to prevent G from matching the start of the string).

Construction

For all the problems where you want to capture all instances of a repetition, I don't think there exists a general way to modify the regex. One example of a "hard" (or impossible?) case to convert is when a repetition has to backtrack one or more loop to fulfill certain condition to match.

When the original regex describes the whole input string (validation type), it is usually easier to convert compared to a regex that tries to match from the middle of the string (matching type). However, you can always do a match with the original regex, and we convert matching type problem back to validation type problem.

We build such regex by going through these steps:

  • Write a regex that covers the part before the repetition (e.g. start:). Let us call this prefix regex.
  • Match and capture the first instance. (e.g. (w+))
    (At this point, the first instance and delimiter should have been matched)
  • Add the G as an alternation. Usually also need to prevent it from matching the start of the string.
  • Add the delimiter (if any). (e.g. -)
    (After this step, the rest of the tokens should have also been matched, except the last maybe)
  • Add the part that covers the part after the repetition (if necessary) (e.g. :end). Let us call the part after the repetition suffix regex (whether we add it to the construction doesn't matter).
  • Now the hard part. You need to check that:
    • There is no other way to start a match, apart from the prefix regex. Take note of the G branch.
    • There is no way to start any match after the suffix regex has been matched. Take note of how G branch starts a match.
    • For the first construction, if you mix the suffix regex (e.g. :end) with delimiter (e.g. -) in an alternation, make sure you don't end up allowing the suffix regex as delimiter.