更新时间:2022-10-18 17:40:14
这篇文章是为了展示这种可能性,而不是支持一切正则表达式"解决问题的方法.作者编写了 3-4 个变体,每个变体都有在达到当前解决方案之前难以检测的微妙错误.
对于您的具体示例,还有其他更易于维护的更好解决方案,例如沿分隔符匹配和拆分匹配项.
这篇文章涉及您的具体示例.我真的怀疑一个完整的概括是可能的,但背后的想法可以在类似的情况下重复使用.
捕获重复模式CaptureCollection
类.G
和look-behind 的语言,我们或许可以构建一个使用全局匹配功能的正则表达式.编写完全正确的代码并不容易,编写一个有细微错误的正则表达式也很容易.G
和后视支持的语言:可以使用 ^
模拟 G
,通过截取输入字符串单场比赛后.(本答案未涵盖).此解决方案假设正则表达式引擎支持 G
匹配边界、前瞻 (?=pattern)
和后视 (?.Java、Perl、PCRE、.NET、Ruby 正则表达式支持上述所有高级功能.
但是,您可以在 .NET 中使用正则表达式.由于 .NET 支持捕获由通过 CaptureCollection
类.
对于您的情况,它可以在一个正则表达式中完成,使用 G
匹配边界,并提前限制重复次数:
(?:start:(?=w+(?:-w+){2,9}:end)|(?<=-)G)(w+)(?:-|:结束)
演示.构造重复w+-
,然后w+:end
.
(?:start:(?=w+(?:-w+){2,9}:end)|(?!^)G-)(w+)
演示.第一项的构造是 w+
,然后重复 -w+
.(感谢 ka ᵠ 的建议).这种结构更容易推断其正确性,因为交替较少.
G
匹配边界在您需要进行标记化时特别有用,您需要确保引擎不会跳过并匹配本应无效的内容.
让我们分解正则表达式:
(?:开始:(?=w+(?:-w+){2,9}:end)|(?
最容易识别的部分是(w+)
在最后一行,就是你要捕捉的词.
最后一行也很容易识别:要匹配的单词后面可以跟-
或:end
.
我允许正则表达式***地开始匹配字符串中的任何地方.换句话说,start:...:end
可以出现在字符串的任何位置,并且可以出现任意次数;正则表达式将简单地匹配所有单词.您只需要处理返回的数组以分离匹配的令牌实际来自的位置.
至于解释,正则表达式的开头检查字符串start:
的存在,然后前瞻检查单词数是否在指定的限制内,并以:结束代码>.要么,要么我们检查前一个匹配之前的字符是-
,然后从前一个匹配继续.
对于其他结构:
(?:开始:(?=w+(?:-w+){2,9}:end)|(?!^)G-)(w+)
一切都差不多,除了我们先匹配start:w+
,然后再匹配-w+
形式的重复.与第一个构造相反,我们首先匹配 start:w+-
,然后 w+-
(或 w+:end
code> 用于最后一次重复).
使这个正则表达式适用于字符串中间的匹配非常棘手:
我们需要检查 start:
和 :end
之间的字数(作为原始正则表达式要求的一部分).
G
也匹配字符串的开头!需要 (?!^)
来防止这种行为.如果不注意这一点,当没有任何 start:
时,正则表达式可能会产生匹配.
对于第一个构造,后视 (?<=-)
已经阻止了这种情况((?!^)
被 (?).
对于第一个构造 (?:start:(?=w+(?:-w+){2,9}:end)|(?<=-)G)(w+)(?:-|:end)
,我们需要确保在 :end
之后没有匹配任何有趣的东西.后视就是为了这个目的:它可以防止 :end
之后的任何垃圾匹配.
第二个构造不会遇到这个问题,因为在我们匹配了中间的所有标记后,我们将卡在 :
(of :end
) 处.
如果你想验证输入字符串是否符合格式(前后没有多余的东西),和提取数据,你可以这样添加锚点:
(?:^start:(?=w+(?:-w+){2,9}:end$)|(?!^)G-)(w+)(?:^start:(?=w+(?:-w+){2,9}:end$)|(?!^)G)(w+)(?:-|:end)
(Look-behind 也不需要,但我们仍然需要 (?!^)
来防止 G
匹配字符串的开头).>
对于您想要捕获所有重复实例的所有问题,我认为不存在修改正则表达式的通用方法.要转换的困难"(或不可能?)案例的一个示例是当重复必须回溯一个或多个循环以满足特定条件以匹配时.
当原始正则表达式描述整个输入字符串(验证类型)时,与尝试从字符串中间进行匹配(匹配类型)的正则表达式相比,通常更容易转换.但是,您始终可以与原始正则表达式进行匹配,我们将匹配类型问题转换回验证类型问题.
我们通过以下步骤构建这样的正则表达式:
start:
).让我们称之为前缀正则表达式.(w+)
)G
作为替代.通常还需要防止它匹配字符串的开头.-
):end
).让我们调用重复后的部分后缀正则表达式(我们是否将其添加到构造中无关紧要).G
分支.G
分支如何开始匹配.:end
)与分隔符(例如 -
)交替混合使用,请确保不要结束up 允许后缀正则表达式作为分隔符.I keep bumping into situations where I need to capture a number of tokens from a string and after countless tries I couldn't find a way to simplify the process.
So let's say the text is:
start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end
This example has 8 items inside, but say it could have between 3 and 10 items.
I'd ideally like something like this:start:(?:(w+)-?){3,10}:end
nice and clean BUT it only captures the last match. see here
I usually use something like this in simple situations:
start:(w+)-(w+)-(w+)-?(w+)?-?(w+)?-?(w+)?-?(w+)?-?(w+)?-?(w+)?-?(w+)?:end
3 groups mandatory and another 7 optional because of the max 10 limit, but this doesn't look 'nice' and it would be a pain to write and track if the max limit was 100 and the matches were more complex. demo
And the best I could do so far:
start:(w+)-((?1))-((?1))-?((?1))?-?((?1))?-?((?1))?-?((?1))?-?((?1))?:end
shorter especially if the matches are complex but still long. demo
Anyone managed to make it work as a 1 regex-only solution without programming?
I'm mostly interested on how can this be done in PCRE but other flavors would be ok too.
The purpose is to validate a match and capture individual tokens inside match 0
by RegEx alone, without any OS/Software/Programming-Language limitation
With @nhahtdh's help I got to the RegExp below by using G
:
(?:start:(?=(?:[w]+(?:-|(?=:end))){3,10}:end)|(?!^)G-)([w]+)
demo even shorter, but can be described without repeating code
I'm also interested in the ECMA flavor and as it doesn't support G
wondering if there's another way, especially without using /g
modifier.
This post is to show the possibility rather than endorsing the "everything regex" approach to problem. The author has written 3-4 variations, each has subtle bug that are tricky to detect, before reaching the current solution.
For your specific example, there are other better solution that is more maintainable, such as matching and splitting the match along the delimiters.
This post deals with your specific example. I really doubt a full generalization is possible, but the idea behind is reusable for similar cases.
CaptureCollection
class.G
and look-behind, we may be able to construct a regex that works with global matching function. It is not easy to write it completely correct and easy to write a subtly buggy regex.G
and look-behind support: it is possible to emulate G
with ^
, by chomping the input string after a single match. (Not covered in this answer).This solution assumes the regex engine supports G
match boundary, look-ahead (?=pattern)
, and look-behind (?<=pattern)
. Java, Perl, PCRE, .NET, Ruby regex flavors support all those advanced features above.
However, you can go with your regex in .NET. Since .NET supports capturing all instances of that is matched by a capturing group that is repeated via CaptureCollection
class.
For your case, it can be done in one regex, with the use of G
match boundary, and look-ahead to constrain the number of repetitions:
(?:start:(?=w+(?:-w+){2,9}:end)|(?<=-)G)(w+)(?:-|:end)
DEMO. The construction is w+-
repeated, then w+:end
.
(?:start:(?=w+(?:-w+){2,9}:end)|(?!^)G-)(w+)
DEMO. The construction is w+
for the first item, then -w+
repeated. (Thanks to ka ᵠ for the suggestion). This construction is simpler to reason about its correctness, since there are less alternations.
G
match boundary is especially useful when you need to do tokenization, where you need to make sure the engine not skipping ahead and matching stuffs that should have been invalid.
Let us break down the regex:
(?:
start:(?=w+(?:-w+){2,9}:end)
|
(?<=-)G
)
(w+)
(?:-|:end)
The easiest part to recognize is (w+)
in the line before last, which is the word that you want to capture.
The last line is also quite easy to recognize: the word to be matched may be followed by -
or :end
.
I allow the regex to freely start matching anywhere in the string. In other words, start:...:end
can appear anywhere in the string, and any number of times; the regex will simply match all the words. You only need to process the array returned to separate where the matched tokens actually come from.
As for the explanation, the beginning of the regex checks for the presence of the string start:
, and the following look-ahead checks that the number of words is within specified limit and it ends with :end
. Either that, or we check that the character before the previous match is a -
, and continue from previous match.
For the other construction:
(?:
start:(?=w+(?:-w+){2,9}:end)
|
(?!^)G-
)
(w+)
Everything is almost the same, except that we match start:w+
first before matching the repetition of the form -w+
. In contrast to the first construction, where we match start:w+-
first, and the repeated instances of w+-
(or w+:end
for the last repetition).
It is quite tricky to make this regex works for matching in middle of the string:
We need to check the number of words between start:
and :end
(as part of the requirement of the original regex).
G
matches the beginning of the string also! (?!^)
is needed to prevent this behavior. Without taking care of this, the regex may produce a match when there isn't any start:
.
For the first construction, the look-behind (?<=-)
already prevent this case ((?!^)
is implied by (?<=-)
).
For the first construction (?:start:(?=w+(?:-w+){2,9}:end)|(?<=-)G)(w+)(?:-|:end)
, we need to make sure that we don't match anything funny after :end
. The look-behind is for that purpose: it prevents any garbage after :end
from matching.
The second construction doesn't run into this problem, since we will get stuck at :
(of :end
) after we have matched all the tokens in between.
If you want to do validation that the input string follows the format (no extra stuff in front and behind), and extract the data, you can add anchors as such:
(?:^start:(?=w+(?:-w+){2,9}:end$)|(?!^)G-)(w+)
(?:^start:(?=w+(?:-w+){2,9}:end$)|(?!^)G)(w+)(?:-|:end)
(Look-behind is also not needed, but we still need (?!^)
to prevent G
from matching the start of the string).
For all the problems where you want to capture all instances of a repetition, I don't think there exists a general way to modify the regex. One example of a "hard" (or impossible?) case to convert is when a repetition has to backtrack one or more loop to fulfill certain condition to match.
When the original regex describes the whole input string (validation type), it is usually easier to convert compared to a regex that tries to match from the middle of the string (matching type). However, you can always do a match with the original regex, and we convert matching type problem back to validation type problem.
We build such regex by going through these steps:
start:
). Let us call this prefix regex. (w+)
)G
as an alternation. Usually also need to prevent it from matching the start of the string.-
):end
). Let us call the part after the repetition suffix regex (whether we add it to the construction doesn't matter).G
branch.G
branch starts a match.:end
) with delimiter (e.g. -
) in an alternation, make sure you don't end up allowing the suffix regex as delimiter.