且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何使用PHP的preg_replace函数将Unicode代码点转换为实际字符/HTML实体?

更新时间:2023-02-19 20:17:36

来自 PHP手册:

单引号和双引号的PHP字符串具有反斜杠的特殊含义.因此,如果\必须与正则表达式\\匹配,则必须在PHP代码中使用"\\\\"'\\\\'.

首先,在正则表达式中,仅使用一个反斜杠(\).如PHP手册中所述,您需要使用\\\\来匹配文字反斜杠(某些例外情况). /p>

第二,您在原始表达式中缺少捕获组. preg_replace()在给定的字符串中搜索与提供的模式匹配的字符串,并返回该字符串,其中与捕获组匹配的内容被替换字符串替换.

具有正确的转义和正确的捕获组的更新后的正则表达式如下所示:

$str2 = preg_replace('/\\\\u([0-9a-f]+)/i', '&#x$1;', $str);

输出:

おはよう

表达式: \\\\u([0-9a-f]+)

  • \\\\-匹配文字反斜杠
  • u-与文字u字符匹配
  • (-捕获组的开始
    • [0-9a-f]-字符类-一次或多次匹配数字(0-9)或字母(a-f)
  • )-捕获组结束
  • i修饰符-用于不区分大小写的匹配

替换: &#x$1

  • &-文字与符号(&)
  • #-文字井字号(#)
  • x-文字字符x
  • $1-第一个捕获组的内容-在这种情况下,是形式为304a等的字符串.

RegExr演示.

I want to convert a set of Unicode code points in string format to actual characters and/or HTML entities (either result is fine).

For example, if I have the following string assignment:

$str = '\u304a\u306f\u3088\u3046';

I want to use the preg_replace function to convert those Unicode code points to actual characters and/or HTML entities.

As per other Stack Overflow posts I saw for similar issues, I first attempted the following:

$str = '\u304a\u306f\u3088\u3046';
$str2 = preg_replace('/\u[0-9a-f]+/', '&#x$1;', $str);

However, whenever I attempt to do this, I get the following PHP error:

Warning: preg_replace() [function.preg-replace]: Compilation failed: PCRE does not support \L, \l, \N, \U, or \u

I tried all sorts of things like adding the u flag to the regex or changing /\u[0-9a-f]+/ to /\x{[0-9a-f]+}/, but nothing seems to work.

Also, I've looked at all sorts of other relevant pages/posts I could find on the web related to converting Unicode code points to actual characters in PHP, but either I'm missing something crucial, or something is wrong because I can't fix the issue I'm having.

Can someone please offer me a concrete solution on how to convert a string of Unicode code points to actual characters and/or a string of HTML entities?

From the PHP manual:

Single and double quoted PHP strings have special meaning of backslash. Thus if \ has to be matched with a regular expression \\, then "\\\\" or '\\\\' must be used in PHP code.

First of all, in your regular expression, you're only using one backslash (\). As explained in the PHP manual, you need to use \\\\ to match a literal backslash (with some exceptions).

Second, you are missing the capturing groups in your original expression. preg_replace() searches the given string for matches to the supplied pattern and returns the string where the contents matched by the capturing groups are replaced with the replacement string.

The updated regular expression with proper escaping and correct capturing groups would look like:

$str2 = preg_replace('/\\\\u([0-9a-f]+)/i', '&#x$1;', $str);

Output:

おはよう

Expression: \\\\u([0-9a-f]+)

  • \\\\ - matches a literal backslash
  • u - matches the literal u character
  • ( - beginning of the capturing group
    • [0-9a-f] - character class -- matches a digit (0 - 9) or an alphabet (from a - f) one or more times
  • ) - end of capturing group
  • i modifier - used for case-insensitive matching

Replacement: &#x$1

  • & - literal ampersand character (&)
  • # - literal pound character (#)
  • x - literal character x
  • $1 - contents of the first capturing group -- in this case, the strings of the form 304a etc.

RegExr Demo.