且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

使用 libxml2 解析 xml 时处理代理对

更新时间:2022-02-17 05:16:36

请注意,xD83D 是一个高代理项.一个代理对由一个高代理和一个低代理组成;两个高代理并排不是代理对",这是胡说八道.

Note that xD83D is a high surrogate. A surrogate pair consists of a high surrogate and a low surrogate; having two high surrogates next to each other is not a "surrogate pair", it is nonsense.

另请注意,在 XML 中表示非 BMP 字符的正确方法是作为组合字符的单个字符引用,例如 𒂫.在某些字符编码中需要将非 BMP 字符拆分为两个代理项,但在 XML 字符引用中不需要(或不允许).XML 中的字符引用表示 Unicode 代码点,而不是特定于特定字符编码的数值.

Also note that the correct way to represent a non-BMP character in XML is as a single character reference for the combined character, for example 𒂫. Splitting a non-BMP character into two surrogates is needed in some character encodings, but it is not needed (or allowed) in XML character references. Character references in XML represent Unicode code-points, not the numeric values specific to a particular character encoding.

如果您无法修复创建此错误 XML 的程序,那么***的方法是使用脚本修复它,例如在 Perl 中查找无效的字符引用对并用正确的 XML 表示替换它们.

If you can't fix the program that created this bad XML, then the best approach would be to repair it using a script e.g. in Perl that looks for the invalid character references pairs and replaces them with the correct XML representation.