且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

python正则表达式查找带重音的单词

更新时间:2022-11-12 08:05:10

最简单的方法与您在 Python 3 中的方法相同.这意味着您必须明确使用 unicode 而不是 str 对象,包括 u 前缀的字符串文字.而且,理想情况下,在文件顶部有一个显式的编码声明,这样您也可以用 Unicode 编写文字.

The simplest way to do this is the same way you'd do it in Python 3. This means you have to explicitly use unicode instead of str objects, include u-prefixed string literals. And, ideally, an explicit coding declaration at the top of your file so you can write the literals in Unicode as well.

# -*- coding: utf-8 -*-

import re

pattern = re.compile(ur'Nombre vern[aá]culo'`)
text = u'Nombre vernáculo'
match = pattern.search(text)
print match

请注意,我在模式末尾省略了 \..您的文本不以 . 结尾,因此您不应该寻找一个,否则会失败.

Notice that I left off the \. on the end of the pattern. Your text doesn't end in a ., so you shouldn't be looking for one, or it's going to fail.

当然,如果你想搜索除源代码之外的其他地方的文本,你需要decode('utf-8'),或者io.opencodecs.open 文件,而不仅仅是 open

Of course if you want to search text that comes from somewhere besides your source code, you'll need to decode('utf-8') it, or io.open or codecs.open the file instead of just open, etc.

如果您不能使用编码声明,或者不能相信您的文本编辑器能够处理 UTF-8,您仍然可以使用 Unicode 字符串,只需使用它们的 Unicode 代码点对字符进行转义:

If you can't use a coding declaration, or can't trust your text editor to handle UTF-8, you can still use Unicode strings, just escape the characters with their Unicode code points:

import re

pattern = re.compile(ur'Nombre vern[a\xe1]culo'`)
text = u'Nombre vern\xe1culo'
match = pattern.search(text)
print match

如果您必须使用 str,那么您必须手动编码为 UTF-8 并转义单个字节,就像您尝试做的那样.但是现在您不是要匹配单个字符,而是要匹配多字符序列 \xc3\xa1.所以你不能使用字符类.取而代之的是,您已将其明确地写成一组交替:


If you have to use str, then you do have to manually encode to UTF-8 and escape the individual bytes, as you were trying to do. But now you're not trying to match a single character, but a multi-character sequence, \xc3\xa1. So you can't use a character class. Instead, you have write it out explicitly as a group with alternation:

pattern = re.compile(r'Nombre vern(?:a|\xc3\xa1)culo')
text = 'Nombre vern\xc3\xa1culo'
match = pattern.search(text)
print match