且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

将正则表达式转换为国际字符

更新时间:2023-02-17 22:56:42

以下是必需的步骤:

  • 使用u模式选项.这会打开PCRE_UTF8 PCRE_UCP(PHP文档忘记提及其中一个):

  • Use the u pattern option. This turns on PCRE_UTF8 and PCRE_UCP (the PHP docs forget to mention that one):

PCRE_UTF8

此选项使PCRE将模式和主题都视为UTF-8字符的字符串,而不是单字节字符串.但是,仅当PCRE被构建为包含UTF支持时,它才可用.如果不是,则使用此选项会引发错误.在pcreunicode页中提供了有关此选项如何更改PCRE行为的详细信息.

This option causes PCRE to regard both the pattern and the subject as strings of UTF-8 characters instead of single-byte strings. However, it is available only when PCRE is built to include UTF support. If not, the use of this option provokes an error. Details of how this option changes the behaviour of PCRE are given in the pcreunicode page.

PCRE_UCP

此选项更改PCRE处理\B\b\D\d\S\s\W\w和某些POSIX字符类的方式.默认情况下,仅识别ASCII字符,但是如果设置了PCRE_UCP,则改用Unicode属性对字符进行分类.在pcrepattern页面中有关通用字符类型的部分中提供了更多详细信息.如果设置了PCRE_UCP,则匹配它影响的项目之一会花费更长的时间.仅当PCRE已在Unicode属性支持下进行编译时,此选项才可用.

This option changes the way PCRE processes \B, \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes. By default, only ASCII characters are recognized, but if PCRE_UCP is set, Unicode properties are used instead to classify characters. More details are given in the section on generic character types in the pcrepattern page. If you set PCRE_UCP, matching one of the items it affects takes much longer. The option is available only if PCRE has been compiled with Unicode property support.

  • \dPCRE_UCP一样好(它已经等效于\p{N}),但是您必须替换这些[a-z]范围以解决重音字符:

  • \d will do just fine with PCRE_UCP (it's equivalent to \p{N} already), but you have to replace these [a-z] ranges to account for accented characters:

    • \p{L}
    • 替换[a-zA-Z]
    • \p{Ll}
    • 替换[a-z]
    • [A-Z]替换为\p{Lu}
    • Replace [a-zA-Z] with \p{L}
    • Replace [a-z] with \p{Ll}
    • Replace [A-Z] with \p{Lu}

    \p{X}的意思是:来自 Unicode类别的字符> X ,其中L表示字母Ll表示小写字母,而Lu表示大写字母.您可以从文档中获取列表.

    \p{X} means: a character from Unicode category X, where L means letter, Ll means lowercase letter and Lu means uppercase letter. You can get a list from the docs.

    请注意,您可以在字符类(例如[\p{L}\d\s])内使用\p{X}.

    Note that you can use \p{X} inside a character class: [\p{L}\d\s] for instance.

    并确保对PHP中的字符串使用UTF8编码.另外,请确保使用支持Unicode的函数来处理这些字符串.

    And make sure you use UTF8 encoding for your strings in PHP. Also, make sure you use Unicode-aware functions to handle these strings.