更新时间:2023-02-21 14:11:28
您的模式不起作用的原因是:(?<=\((.*)\)\[)
,因为Python的re模块不允许在后面进行变长查找.
The reason your pattern doesn't work is here: (?<=\((.*)\)\[)
since the re module of Python doesn't allow variable length lookbehind.
您可以使用 Python的新正则表达式模块以更方便的方式获得所需的内容 (因为re模块的功能较少).
示例:(?|(?<txt>(?<url>(?:ht|f)tps?://\S+(?<=\P{P})))|\(([^)]+)\)\[(\g<url>)\])
图案细节:
(?| # open a branch reset group
# first case there is only the url
(?<txt> # in this case, the text and the url
(?<url> # are the same
(?:ht|f)tps?://\S+(?<=\P{P})
)
)
| # OR
# the (text)[url] format
\( ([^)]+) \) # this group will be named "txt" too
\[ (\g<url>) \] # this one "url"
)
此模式使用分支重置功能(?|...|...|...)
,该功能允许交替保留捕获组名称(或编号).在该模式中,由于?<txt>
组首先在替换的第一个成员中打开,因此第二个成员中的第一个组将自动具有相同的名称. ?<url>
组也是如此.
This pattern uses the branch reset feature (?|...|...|...)
that allows to preserve capturing groups names (or numbers) in an alternation. In the pattern, since the ?<txt>
group is opened at first in the first member of the alternation, the first group in the second member will have the same name automatically. The same for the ?<url>
group.
\g<url>
是对已命名子模式?<url>
的引用(就像别名一样,这种方式无需在第二个成员中重写它).
\g<url>
is a reference to the named subpattern ?<url>
(like an alias, in this way, no need to rewrite it in the second member.)
(?<=\P{P})
检查url的最后一个字符是否不是标点字符(例如,用于避免使用右方括号). (我不确定语法,可能是\P{Punct}
)
(?<=\P{P})
checks if the last character of the url is not a punctuation character (useful to avoid the closing square bracket for example). (I'm not sure of the syntax, it may be \P{Punct}
)