且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

捕获python正则表达式中的重复组

更新时间:2022-10-18 17:40:14

如果您不能使用 PyPi 正则表达式库,您必须分两步完成:1) 使用 sm-mta抓取行> 和 2) 获取您需要的值,例如

重新导入

txt="""Aug 15 00:01:06 **** sm-mta*** to=<user1@gmail.com>,<user2@yahoo.com>,user3@aol.com, some_more_stuff8 月 16 日 13:16:09 **** sendmail*** to=<user4@yahoo.com>, some_more_stuff8 月 17 日 11:14:48 **** sm-mta*** to=<user5@gmail.com>,<user6@gmail.com>, some_more_stuff"""rx = r'@([^\s>,]+)'Filtered_lines = [x for x in txt.split('\n') if 'sm-mta' in x]打印(re.findall(rx, " ".join(filtered_lines)))

查看 Python 在线演示.@([^\s>,]+) 模式将匹配 @ 并捕获并返回除空格之外的任何 1+ 个字符,>code> 和 ,.

如果你可以使用 PyPi 正则表达式库,你可能会得到你需要的字符串列表

>>>导入正则表达式>>>x="""8 月 15 日 00:01:06 **** sm-mta*** to=<user1@gmail.com>,<user2@yahoo.com>,user3@aol.com, some_more_stuff8 月 16 日 13:16:09 **** sendmail*** to=<user4@yahoo.com>, some_more_stuff8 月 17 日 11:14:48 **** sm-mta*** to=<user5@gmail.com>,<user6@gmail.com>, some_more_stuff""">>>rx = r'(?:^(?=.*sm-mta)|\G(?!^)).*?@\K[^\s>,]+'>>>打印(regex.findall(rx,x,regex.M))['gmail.com', 'yahoo.com', 'aol.com,', 'gmail.com', 'gmail.com']

请参阅 Python 在线演示正则表达式演示.

模式详情

  • (?:^(?=.*sm-mta)|\G(?!^)) - 在任何之后有 sm-mta 子串的行除换行符以外的 0+ 个字符,或上一场比赛结束的地方
  • .*?@ - 除换行符以外的任何 0+ 个字符,尽可能少,直到 @@> 本身
  • \K - 一个匹配重置运算符,它丢弃当前迭代中到目前为止匹配的所有文本
  • [^\s>,]+ - 除了空格、>
  • 之外的 1 个或多个字符

I have a mail log file, which is like this:

Aug 15 00:01:06 **** sm-mta*** to=<user1@gmail.com>,<user2@yahoo.com>,user3@aol.com, some_more_stuff
Aug 16 13:16:09 **** sendmail*** to=<user4@yahoo.com>, some_more_stuff
Aug 17 11:14:48 **** sm-mta*** to=<user5@gmail.com>,<user6@gmail.com>, some_more_stuff

What I want is a list of all mail hosts in lines that contain "sm-mta". In this case that would be: ['gmail.com', 'yahoo.com', 'aol.com', 'gmail.com', gmail.com']

re.findall(r'sm-mta.*to=.+?@(.*?)[>, ]') will return only first host of each matching line (['gmail.com','gmail.com'])

re.findall(r'.+?@(.*?)[>, ]') will return the correct list, but I need filtering too. Is there any workaround on this?

If you cannot use PyPi regex library, you will have to do that in two steps: 1) grab the lines with sm-mta and 2) grab the values you need, with something like

import re

txt="""Aug 15 00:01:06 **** sm-mta*** to=<user1@gmail.com>,<user2@yahoo.com>,user3@aol.com, some_more_stuff
Aug 16 13:16:09 **** sendmail*** to=<user4@yahoo.com>, some_more_stuff
Aug 17 11:14:48 **** sm-mta*** to=<user5@gmail.com>,<user6@gmail.com>, some_more_stuff"""
rx = r'@([^\s>,]+)'
filtered_lines = [x for x in txt.split('\n') if 'sm-mta' in x]
print(re.findall(rx, " ".join(filtered_lines)))

See the Python demo online. The @([^\s>,]+) pattern will match @ and will capture and return any 1+ chars other than whitespace, > and ,.

If you can use PyPi regex library, you may get the list of the strings you need with

>>> import regex
>>> x="""Aug 15 00:01:06 **** sm-mta*** to=<user1@gmail.com>,<user2@yahoo.com>,user3@aol.com, some_more_stuff
Aug 16 13:16:09 **** sendmail*** to=<user4@yahoo.com>, some_more_stuff
Aug 17 11:14:48 **** sm-mta*** to=<user5@gmail.com>,<user6@gmail.com>, some_more_stuff"""
>>> rx = r'(?:^(?=.*sm-mta)|\G(?!^)).*?@\K[^\s>,]+'
>>> print(regex.findall(rx, x, regex.M))
['gmail.com', 'yahoo.com', 'aol.com,', 'gmail.com', 'gmail.com']

See the Python online demo and a regex demo.

Pattern details

  • (?:^(?=.*sm-mta)|\G(?!^)) - a line that has sm-mta substring after any 0+ chars other than line break chars, or the place where the previous match ended
  • .*?@ - any 0+ chars other than line break chars, as few as possible, up to the @ and a @ itself
  • \K - a match reset operator that discards all the text matched so far in the current iteration
  • [^\s>,]+ - 1 or more chars other than whitespace, , and >