且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何解析无效(坏/格式不正确)的 XML?

更新时间:2022-03-19 00:15:23

那个XML"比 invalid 更糟糕——它格式不正确;请参阅格式正确与有效 XML.

That "XML" is worse than invalid – it's not well-formed; see Well Formed vs Valid XML.

对违法行为的可预测性进行非正式评估没有帮助.该文本数据不是 XML.没有符合标准的 XML 工具或库可以帮助您处理它.

An informal assessment of the predictability of the transgressions does not help. That textual data is not XML. No conformant XML tools or libraries can help you process it.

  1. 让供应商自行解决问题.要求格式良好的 XML.(从技术上讲,短语 格式良好的 XML 是多余的,但可能有助于强调.)

  1. Have the provider fix the problem on their end. Demand well-formed XML. (Technically the phrase well-formed XML is redundant but may be useful for emphasis.)

使用容错标记解析器在解析为 XML 之前清理问题:

Use a tolerant markup parser to cleanup the problem ahead of parsing as XML:

  • Standalone: xmlstarlet has robust recovering and repair capabilities credit: RomanPerekhrest

xmlstarlet fo -o -R -H -D bad.xml 2>/dev/null

  • 独立和 C/C++: HTML Tidy 有效也有 XML.Taggle 是一个端口将 TagSoup 转换为 C++.

  • Standalone and C/C++: HTML Tidy works with XML too. Taggle is a port of TagSoup to C++.

    Python: 美汤 是基于 Python 的.请参阅 解析器之间的差异部分中的注释.另请参阅对这个问题的回答了解更多信息在 Python 中处理格式不正确的标记的建议,尤其包括 lxml 的 recover=True 选项.另请参阅this answer了解如何使用 codecs.EncodedFile() 清除非法字符.

    Python: Beautiful Soup is Python-based. See notes in the Differences between parsers section. See also answers to this question for more suggestions for dealing with not-well-formed markup in Python, including especially lxml's recover=True option. See also this answer for how to use codecs.EncodedFile() to cleanup illegal characters.

    Java: TagSoupJSoup 专注于 HTML.FilterInputStream 可以用于预处理清理.

    Java: TagSoup and JSoup focus on HTML. FilterInputStream can be used for preprocessing cleanup.

    .NET:

    • XmlReaderSettings.CheckCharacters can be disabled to get past illegal XML character problems.
    • @jdweng notes that XmlReaderSettings.ConformanceLevel can be set to ConformanceLevel.Fragment so that XmlReader can read XML Well-Formed Parsed Entities lacking a root element.
    • @jdweng also reports that XmlReader.ReadToFollowing() can sometimes be used to work-around XML syntactical issues, but note rule-breaking warning in #3 below.
    • Microsoft.Language.Xml.XMLParser is said to be "error-tolerant".

    PHP: 参见 DOMDocument::$recoverlibxml_use_internal_errors(true).在这里查看很好的例子.

    PHP: See DOMDocument::$recover and libxml_use_internal_errors(true). See nice example here.

    Ruby: Nokogiri 支持Gentle Well-形成性".

    Ruby: Nokogiri supports "Gentle Well-Formedness".

    R:参见htmlTreeParse() 用于 R 中的容错标记解析.

    R: See htmlTreeParse() for fault-tolerant markup parsing in R.

    Perl: 参见 XML::Liberal,一个解析损坏的 XML 的超级*** XML 解析器".

    Perl: See XML::Liberal, a "super liberal XML parser that parses broken XML."

    将数据处理为文本使用文本编辑器手动或以编程方式使用字符/字符串函数.这样做以编程方式可以从棘手到不可能作为看起来是什么通常不可预测 -- 规则破坏很少受规则约束.

    Process the data as text manually using a text editor or programmatically using character/string functions. Doing this programmatically can range from tricky to impossible as what appears to be predictable often is not -- rule breaking is rarely bound by rules.

    • 对于无效字符错误,使用正则表达式删除/替换无效字符:

    • For invalid character errors, use regex to remove/replace invalid characters:

    • PHP: preg_replace('/[^x{0009}x{000a}x{000d}x{0020}-x{D7FF}x{E000}-x{FFFD}]+/u', ' ', $s);
    • Ruby: string.tr("^u{0009}u{000a}u{000d}u{0020}-u{D7FF}u{E000‌ }-u{FFFD}", ' ')
    • JavaScript: inputStr.replace(/[^x09x0Ax0Dx20-xFFx85xA0-uD7FFuE000-uFDCFuFDE0-uFFFD]/gm, '')

    对于 & 符号,使用正则表达式将匹配项替换为 &: credit: blhsin演示子>

    For ampersands, use regex to replace matches with &: credit: blhsin, demo

    &(?!(?:#d+|#x[0-9a-f]+|w+);)
    

  • 请注意,上述正则表达式不会接受注释或 CDATA部分考虑在内.

    Note that the above regular expressions won't take comments or CDATA sections into account.