更新时间:2023-02-26 12:35:26
通用解决方案
Mathias Bynens建议遵循 UTS18 建议,因此应遵循Unicode意识
[^\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]
请注意有关建议的Unicode属性类组合的注释:
这仅是单词边界的近似值(请参见下面的 b ) .这 添加了连接器标点符号以用于编程语言 标识符,因此将"_"添加到和类似的字符.
更多注意事项
\w
构造(以及与之对应的\W
)在支持Unicode的上下文中进行匹配时,会在正则表达式引擎中匹配相似但略有不同的字符集.
例如,这里是文档), [^\p{Alpha}\p{gc=Mn}\p{gc=Me}\p{gc=Mc}\p{Digit}\p{gc=Pc}\p{IsJoin_Control}]
,其中\p{gc=Mn}\p{gc=Me}\p{gc=Mc}
可以写为\p{M}
.
在PHP PCRE中,\W
与[^\p{L}\p{N}_]
匹配.
Rexegg速查表将Python 3 \w
定义为" Unicode字母,表意文字,数字或下划线",即[\p{L}\p{Mn}\p{Nd}_]
.
您可以将\W
大致分解为[^\p{L}\p{N}\p{M}\p{Pc}]
:
/[^\p{L}\p{N}\p{M}\p{Pc}]/gu
其中
[^
-是否定字符类的开头,该字符类与除以下以外的单个字符相匹配:
\p{L}
-任何Unicode字母\p{N}
-任意Unicode数字\p{M}
-变音符号\p{Pc}
-连接器标点符号]
-字符类的结尾.请注意,这是与下划线匹配的\p{Pc}
类.
注意,\p{Alphabetic}
(\p{Alpha}
)包括所有与\p{L}
匹配的字母,以及由Ⅻ
–罗马数字12
的字符,以及与\p{Other_Alphabetic}
(\p{OAlpha}
)匹配的其他一些符号.
其他版本:
/[^\p{L}0-9_]/gu
-仅使用仅识别Unicode字母的\W
/[^\p{L}\p{N}_]/gu
-(PCRE \W
样式)仅使用仅识别Unicode字母和数字的\W
.请注意,Java的(?U)\W
将匹配PCRE,Python和.NET中的\W
匹配项.
In python or PHP a simple regex such as /\W/gu
matches any non-word character in any script, in javascript however it matches [^A-Za-z0-9_]
, what are the correct ranges to match the same characters as python and PHP?
https://regex101.com/r/yhNF8U/1/
Generic solution
Mathias Bynens suggests to follow the UTS18 recommendation and thus a Unicode-aware \W
will look like:
[^\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]
Please note the comment for the suggested Unicode property class combination:
This is only an approximation to Word Boundaries (see b below). The Connector Punctuation is added in for programming language identifiers, thus adding "_" and similar characters.
More considerations
The \w
construct (and thus its \W
counterpart), when matching in a Unicode-aware context, matches similar, but somewhat different set of characters across regex engines.
For example, here is Non-word character: \W
.NET definition: [^\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Mn}\p{Pc}\p{Lm}]
, where \p{Ll}\p{Lu}\p{Lt}\p{Lo}
can be contracted to a sheer \p{L}
and the pattern is thus equal to [^\p{L}\p{Nd}\p{Mn}\p{Pc}]
.
In Android (see documentation), [^\p{Alpha}\p{gc=Mn}\p{gc=Me}\p{gc=Mc}\p{Digit}\p{gc=Pc}\p{IsJoin_Control}]
, where \p{gc=Mn}\p{gc=Me}\p{gc=Mc}
can be just written as \p{M}
.
In PHP PCRE, \W
matches [^\p{L}\p{N}_]
.
Rexegg cheat sheet defines Python 3 \w
as "Unicode letter, ideogram, digit, or underscore", i.e. [\p{L}\p{Mn}\p{Nd}_]
.
You may roughly decompose \W
as [^\p{L}\p{N}\p{M}\p{Pc}]
:
/[^\p{L}\p{N}\p{M}\p{Pc}]/gu
where
[^
- is the start of the negated character class that matches a single char other than:
\p{L}
- any Unicode letter\p{N}
- any Unicode digit\p{M}
- a diacritic mark\p{Pc}
- a connector punctuation symbol]
- end of the character class.Note it is \p{Pc}
class that matches an underscore.
NOTE that \p{Alphabetic}
(\p{Alpha}
) includes all letters matched by \p{L}
, plus letter numbers matched by \p{Nl}
(e.g. Ⅻ
– a character for the roman number 12
), plus some other symbols matched with \p{Other_Alphabetic}
(\p{OAlpha}
).
Other variations:
/[^\p{L}0-9_]/gu
- to just use \W
that is aware of Unicode letters only/[^\p{L}\p{N}_]/gu
- (PCRE \W
style) to just use \W
that is aware of Unicode letters and digits only.Note that Java's (?U)\W
will match a mix of what \W
matches in PCRE, Python and .NET.