且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何使用 Powershell 从 XML 中删除特殊/坏字符

更新时间:2023-11-06 20:57:34

以下正则表达式将通过指定否定 XML 文档中整个有效 unicode 条目集的字符类来从 XML 中删除任何无效字符:

The following regex will remove any invalid characters from XML by specifying a character class negating the entire set of valid unicode entries in an XML document:

$rPattern = "[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000\x10FFFF]"
$xmlText -replace $rPattern,''

这很容易变成一个简单的函数:

function Repair-XmlString
{
  [CmdletBinding()]
  param(
    [Parameter(Mandatory=$true,Position=0)]
    [string]$inXML
  )

  # Match all characters that does NOT belong in an XML document
  $rPattern = "[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000\x10FFFF]"

  # Replace said characters with [String]::Empty and return
  return [System.Text.RegularExpressions.Regex]::Replace($inXML,$rPattern,"")
}

然后做:

Repair-XmlString (Get-Content path\to\file.xml -Raw) |Set-Content path\to\file.xml