且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何使用正则表达式查找所有Markdown链接?

更新时间:2023-02-21 14:11:28

您的模式不起作用的原因是:(?<=\((.*)\)\[),因为Python的re模块不允许在后面进行变长查找.

The reason your pattern doesn't work is here: (?<=\((.*)\)\[) since the re module of Python doesn't allow variable length lookbehind.

您可以使用 Python的新正则表达式模块以更方便的方式获得所需的内容 (因为re模块的功能较少).

示例:(?|(?<txt>(?<url>(?:ht|f)tps?://\S+(?<=\P{P})))|\(([^)]+)\)\[(\g<url>)\])

在线演示

图案细节:

(?|                                       # open a branch reset group
    # first case there is only the url
    (?<txt>                               # in this case, the text and the url  
        (?<url>                           # are the same
            (?:ht|f)tps?://\S+(?<=\P{P})
        )
    )
  |                                       # OR
    # the (text)[url] format
    \( ([^)]+) \)                         # this group will be named "txt" too 
    \[ (\g<url>) \]                       # this one "url"
)

此模式使用分支重置功能(?|...|...|...),该功能允许交替保留捕获组名称(或编号).在该模式中,由于?<txt>组首先在替换的第一个成员中打开,因此第二个成员中的第一个组将自动具有相同的名称. ?<url>组也是如此.

This pattern uses the branch reset feature (?|...|...|...) that allows to preserve capturing groups names (or numbers) in an alternation. In the pattern, since the ?<txt> group is opened at first in the first member of the alternation, the first group in the second member will have the same name automatically. The same for the ?<url> group.

\g<url>是对已命名子模式?<url>的引用(就像别名一样,这种方式无需在第二个成员中重写它).

\g<url> is a reference to the named subpattern ?<url> (like an alias, in this way, no need to rewrite it in the second member.)

(?<=\P{P})检查url的最后一个字符是否不是标点字符(例如,用于避免使用右方括号). (我不确定语法,可能是\P{Punct})

(?<=\P{P}) checks if the last character of the url is not a punctuation character (useful to avoid the closing square bracket for example). (I'm not sure of the syntax, it may be \P{Punct})