更新时间:2023-11-06 20:57:34
以下正则表达式将通过指定否定 XML 文档中整个有效 unicode 条目集的字符类来从 XML 中删除任何无效字符:
The following regex will remove any invalid characters from XML by specifying a character class negating the entire set of valid unicode entries in an XML document:
$rPattern = "[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000\x10FFFF]"
$xmlText -replace $rPattern,''
这很容易变成一个简单的函数:
function Repair-XmlString
{
[CmdletBinding()]
param(
[Parameter(Mandatory=$true,Position=0)]
[string]$inXML
)
# Match all characters that does NOT belong in an XML document
$rPattern = "[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000\x10FFFF]"
# Replace said characters with [String]::Empty and return
return [System.Text.RegularExpressions.Regex]::Replace($inXML,$rPattern,"")
}
然后做:
Repair-XmlString (Get-Content path\to\file.xml -Raw) |Set-Content path\to\file.xml