且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

正则表达式 Javascript 捕获组与量词不工作

更新时间:2022-10-18 23:02:01

这里有一个更简单的例子来说明这个问题:

console.log('34'.match(/(?:(3)|(4))+/));

在 PHP 中,只要匹配到一个捕获组,就会将其放入结果中.相比之下,在 JavaScript 中,事情更复杂:当交替 | 的一侧有捕获组时,每当输入 整个交替 标记时,有两种可能性:

  • 所采取的交替包含捕获组,结果会将捕获组索引设置为匹配的值
  • 所采取的替代包含捕获组,在这种情况下,结果将undefined分配给该索引 - 即使捕获组之前匹配过.

这在规范中描述:

被| 跳过的模式部分内的任何捕获括号生成未定义的值而不是字符串.

RepeatMatcher 的第 4 步每次重复 Atom 时都会清除 Atom 的捕获.

因为最外层*的每次迭代都会清除量化Atom中包含的所有捕获的String


在您的情况下,修复它的最简单调整是删除重复的最外层捕获组,以便一次只匹配一个子序列,例如 1m,然后是 1d,然后遍历匹配项,而不是尝试一次性匹配所有内容.为确保所有匹配项彼此相邻(例如 1m1d,而不是 1m 1d),请在遍历匹配项时检查 index看看它是否在前一场比赛的旁边.

I have this nice regex:

 *(?:(?:([0-9]+)(?:d| ?days?)(?:, ?| )?)|(?:([0-9]+)(?:h| ?hours?)(?:, ?| )?)|(?:([0-9]+)(?:m| ?minutes?)(?:, ?| )?)|(?:([0-9]+)(?:s| ?seconds?)(?:, ?| )?))+

that pretty much matches a human-readable time-delta. It works on php, python, and go, but for some reason the capture groups do not work on javascript. Here is a working php example on regex101 that shows the working capture groups. You will notice that upon changing it to javascript (ECMAscript) mode, the capture group will only capture the last value. Can somebody please help and clarify what I am doing wrong, and whu it doesn't work on js?

Here's a simpler example that demonstrates the issue:

console.log(
  '34'.match(/(?:(3)|(4))+/)
);

In PHP, whenever a capture group is matched, it will be put into the result. In contrast, in JavaScript, things are more complicated: when there are capturing groups on one side of an alternation |, whenever the whole alternation token is entered, there are 2 possibilities:

  • The alternation that is taken contains the capture group, and the result will have the capture group index set to the matched value
  • The alternation that is taken does not contain the capture group, in which case the result will have undefined assigned to that index - even if the capturing group was matched previously.

This is described in the specification:

Any capturing parentheses inside a portion of the pattern skipped by | produce undefined values instead of Strings.

and

Step 4 of the RepeatMatcher clears Atom's captures each time Atom is repeated.

because each iteration of the outermost * clears all captured Strings contained in the quantified Atom


In your case, the easiest tweak to fix it would be to remove the repeating outermost capturing group, so that only one subsequence is matched at a time, eg 1m, and then 1d, then iterate through the matches, instead of trying to match everything all in one go. To ensure that all the matches are next to each other (eg 1m1d, and not 1m 1d), check the index while iterating through the matches to see if it's next to a previous match or not.