如何解析无效(坏/格式不正确)的 XML?

更新时间：2022-03-19 00:15:23

那个XML"比 invalid 更糟糕——它格式不正确；请参阅格式正确与有效 XML.

That "XML" is worse than invalid – it's not well-formed; see Well Formed vs Valid XML.

对违法行为的可预测性进行非正式评估没有帮助.该文本数据不是 XML.没有符合标准的 XML 工具或库可以帮助您处理它.

An informal assessment of the predictability of the transgressions does not help. That textual data is not XML. No conformant XML tools or libraries can help you process it.

让供应商自行解决问题.要求格式良好的 XML.(从技术上讲，短语 格式良好的 XML 是多余的，但可能有助于强调.)

Have the provider fix the problem on their end. Demand well-formed XML. (Technically the phrase well-formed XML is redundant but may be useful for emphasis.)

使用容错标记解析器在解析为 XML 之前清理问题:

Use a tolerant markup parser to cleanup the problem ahead of parsing as XML:

独立: xmlstarlet 具有强大的恢复和修复功能能力^{_{来源:RomanPerekhrest}}

Standalone: xmlstarlet has robust recovering and repair capabilities^{_{credit: RomanPerekhrest}}

xmlstarlet fo -o -R -H -D bad.xml 2>/dev/null

独立和 C/C++: HTML Tidy 有效也有 XML.Taggle 是一个端口将 TagSoup 转换为 C++.

Standalone and C/C++: HTML Tidy works with XML too. Taggle is a port of TagSoup to C++.

Python: 美汤是基于 Python 的.请参阅解析器之间的差异部分中的注释.另请参阅对这个问题的回答了解更多信息在 Python 中处理格式不正确的标记的建议，尤其包括 lxml 的 recover=True 选项.另请参阅this answer了解如何使用 codecs.EncodedFile() 清除非法字符.

Python: Beautiful Soup is Python-based. See notes in the Differences between parsers section. See also answers to this question for more suggestions for dealing with not-well-formed markup in Python, including especially lxml's recover=True option. See also this answer for how to use codecs.EncodedFile() to cleanup illegal characters.

Java: TagSoup 和JSoup 专注于 HTML.FilterInputStream 可以用于预处理清理.

Java: TagSoup and JSoup focus on HTML. FilterInputStream can be used for preprocessing cleanup.

.NET:

XmlReaderSettings.CheckCharacters 可以被禁用以解决非法 XML 字符问题.
@jdweng 笔记那 XmlReaderSettings.ConformanceLevel 可以设置为ConformanceLevel.Fragment 以便 XmlReader 可以读取 XML 格式良好的已解析实体缺少根元素.
@jdweng 还报告 XmlReader.ReadToFollowing() 有时可以用于解决 XML 语法问题，但请注意下面 #3 中的违规警告.
Microsoft.Language.Xml.XMLParser 被称为错误-宽容".

XmlReaderSettings.CheckCharacters can be disabled to get past illegal XML character problems.
@jdweng notes that XmlReaderSettings.ConformanceLevel can be set to ConformanceLevel.Fragment so that XmlReader can read XML Well-Formed Parsed Entities lacking a root element.
@jdweng also reports that XmlReader.ReadToFollowing() can sometimes be used to work-around XML syntactical issues, but note rule-breaking warning in #3 below.
Microsoft.Language.Xml.XMLParser is said to be "error-tolerant".

PHP: 参见 DOMDocument::$recover 和 libxml_use_internal_errors(true).在这里查看很好的例子.

PHP: See DOMDocument::$recover and libxml_use_internal_errors(true). See nice example here.

Ruby: Nokogiri 支持Gentle Well-形成性".

Ruby: Nokogiri supports "Gentle Well-Formedness".

R:参见htmlTreeParse() 用于 R 中的容错标记解析.

R: See htmlTreeParse() for fault-tolerant markup parsing in R.

Perl: 参见 XML::Liberal，一个解析损坏的 XML 的超级*** XML 解析器".

Perl: See XML::Liberal, a "super liberal XML parser that parses broken XML."

将数据处理为文本使用文本编辑器手动或以编程方式使用字符/字符串函数.这样做以编程方式可以从棘手到不可能作为看起来是什么通常不可预测 -- 规则破坏很少受规则约束.

Process the data as text manually using a text editor or programmatically using character/string functions. Doing this programmatically can range from tricky to impossible as what appears to be predictable often is not -- rule breaking is rarely bound by rules.

对于无效字符错误，使用正则表达式删除/替换无效字符:

For invalid character errors, use regex to remove/replace invalid characters:

PHP: preg_replace('/[^x{0009}x{000a}x{000d}x{0020}-x{D7FF}x{E000}-x{FFFD}]+/u', ' ', $s);
Ruby: string.tr("^u{0009}u{000a}u{000d}u{0020}-u{D7FF}u{E000‌ }-u{FFFD}", ' ')
JavaScript: inputStr.replace(/[^x09x0Ax0Dx20-xFFx85xA0-uD7FFuE000-uFDCFuFDE0-uFFFD]/gm, '')

对于 & 符号，使用正则表达式将匹配项替换为 &:^{_{credit: blhsin，演示子>}}

For ampersands, use regex to replace matches with &:^{_{credit: blhsin, demo}}

&(?!(?:#d+|#x[0-9a-f]+|w+);)

请注意，上述正则表达式不会接受注释或 CDATA部分考虑在内.

Note that the above regular expressions won't take comments or CDATA sections into account.

上一篇 : ：如何创建Axis2 Web服务?下一篇 : Java反射调用具有原始类型的构造函数

如何解析无效(坏/格式不正确)的 XML?

相关阅读

技术问答最新文章