且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

将 PowerShell 的默认输出编码更改为 UTF-8

更新时间:2022-12-22 15:12:21

注意:

  • 下一个部分主要适用于 Windows PowerShell.

  • 在这两种情况下,信息都适用于使 PowerShell 使用 UTF-8 读取和写入文件.

    • 相比之下,有关如何向外部程序发送和接收 UTF-8 编码的字符串的信息,请参阅这个答案.

  • PSv5.1 或更高版本中,其中 >>>Out 的有效别名-File,您可以通过$PSDefaultParameterValues 偏好变量:

    • $PSDefaultParameterValues['Out-File:Encoding'] = 'utf8'
    • 注意:
      • Windows PowerShell(最新和最终版本为 v5.1 的旧版)中,此总是创建 UTF-8 文件带有(伪)BOM.

        • 许多基于 Unix 的实用程序无法识别此 BOM(见底部);请参阅这篇博文,了解创建无 BOM 的 UTF-8 文件的解决方法.
      • PowerShell (Core) v6+ 中,BOM-less UTF-8 是默认(见下一节),但如果你确实想要一个 BOM,你可以使用 'utf8BOM'

  • PSv5.0 或以下中,您不能更改>/>>,但是,在 PSv3 或更高版本上,上述技术确实适用于显式调用 Out-文件.
    ($PSDefaultParameterValues 首选项变量是在 PSv3.0 中引入的).

  • PSv3.0 或更高版本中,如果您要所有支持
    的 cmdlet 设置默认编码-Encoding 参数
    (在 PSv5.1+ 中包括 >>>>),使用:>

    • $PSDefaultParameterValues['*:Encoding'] = 'utf8'

如果你把这个命令放在你的 $PROFILE 中,cmdlet 如 Out-FileSet-Content 将默认使用 UTF-8 编码,但请注意,这使其成为会话全局设置,它将影响所有未通过以下方式明确指定编码的命令/脚本他们的 -Encoding 参数.

同样,确保在您的脚本模块中包含您希望以相同方式运行的命令,以便它们确实即使由另一个用户或不同的机器运行,行为也相同;但是,为了避免会话全局更改,请使用以下形式创建$PSDefaultParameterValues本地副本:

  • $PSDefaultParameterValues = @{ '*:Encoding' = 'utf8' }

有关许多 Windows PowerShell 标准 cmdlet 中极度不一致的默认字符编码行为的摘要,请参阅底部部分.


自动$OutputEncoding 变量无关,仅适用于PowerShell与外部程序通信的方式(PowerShell 在向它们发送字符串时使用什么编码) - 它与输出重定向运算符和 PowerShell cmdlet 用于保存到文件的编码无关.


可选阅读:跨平台视角:PowerShell Core:

PowerShell 现在是跨平台的,通过其PowerShell Core 版本,其编码 - 明智地 - 默认为 BOM-less UTF-8,符合类 Unix 平台.

  • 这意味着没有 BOM 的源代码文件被假定为 UTF-8,并使用 >/Out-File/Set-Content 默认为 BOM-less UTF-8;显式使用 utf8 -Encoding 参数也可以创建 BOM-less UTF-8,但您可以选择创建文件 带有 utf8bom 值的伪 BOM.

  • 如果您在类 Unix 平台上使用编辑器创建 PowerShell 脚本,现在甚至在 Windows 上使用 Visual Studio Code 和 Sublime Text 等跨平台编辑器,生成的 *.ps1 文件通常具有 UTF-8 伪 BOM:

    • 这在 PowerShell Core 上运行良好.
    • 如果文件包含非 ASCII 字符,它可能会在 Windows PowerShell 上中断;如果您确实需要在脚本中使用非 ASCII 字符,请将它们保存为 UTF-8 with BOM.
      如果没有 BOM,Windows PowerShell (mis) 会将您的脚本解释为以旧版ANSI"格式进行编码.代码页(由 Unicode 之前的应用程序的系统区域设置确定;例如,美式英语系统上的 Windows-1252).
  • 相反,确实具有 UTF-8 伪 BOM 的文件在类 Unix 平台上可能会出现问题,因为它们会导致诸如 cat 之类的 Unix 实用程序、sedawk - 甚至一些编辑器,例如 gedit - 以通过传递伪 BOM,即,将其视为数据.

    • 这可能不是总是的问题,但肯定会成为问题,例如当您尝试使用 bash 将文件读入字符串时,例如 text=$(cat file)text=$(<file) - 结果变量将包含伪 BOM 作为前 3 个字节.

Windows PowerShell 中的默认编码行为不一致:

遗憾的是,Windows PowerShell 中使用的默认字符编码非常不一致;跨平台 PowerShell Core 版本,如上一节所述,令人称道地结束了这一点.

注意:

  • 以下内容并未涵盖所有标准 cmdlet.

  • 谷歌搜索 cmdlet 名称以查找其帮助主题现在默认显示主题的 PowerShell Core 版本;使用左侧主题列表上方的版本下拉列表切换到 Windows PowerShell 版本.

  • 在撰写本文时,该文档经常错误地声称 ASCII 是 Windows PowerShell 中的默认编码 - 请参阅 这个 GitHub 文档问题.


Cmdlet 编写:

Out-File>/>>> 创建Unicode"- UTF-16LE - 默认文件 - 其中每个 ASCII 范围字符(太) 由 2 个字节表示 - 这与 Set-Content/Add-Content 明显不同(见下一点);New-ModuleManifestExport-CliXml 也创建 UTF-16LE 文件.

Set-Content(和 Add-Content 如果文件尚不存在/为空)使用 ANSI 编码(由活动系统区域设置的 ANSI 指定的编码)旧代码页,PowerShell 将其称为 Default).

Export-Csv 确实创建了 ASCII 文件,如文档所示,但请参阅下面的 -Append 注释.

Export-PSSession 默认使用 BOM 创建 UTF-8 文件.

New-Item -Type File -Value 当前创建 BOM-less(!) UTF-8.

Send-MailMessage 帮助主题还声称 ASCII 编码是默认值 - 我没有亲自验证该声明.

Start-Transcript 总是创建带有 BOM 的 UTF-8 文件,但请参阅 -Append 的注释下面.

重新追加到现有文件的命令:

>>/Out-File -Append 使 no 尝试匹配文件的现有内容的编码.也就是说,他们盲目地应用他们的默认编码,除非用 -Encoding 另有指示,这不是 >>> 的选项(除了在 PSv5.1+ 中间接使用,通过 $PSDefaultParameterValues,如上所示).简而言之:您必须知道现有文件内容的编码并使用相同的编码进行追加.

Add-Content 是一个值得称道的例外:在没有明确的 -Encoding 参数的情况下,它会检测现有编码并自动将其应用于新内容.谢谢,js2010.请注意,在 Windows PowerShell 中,这意味着如果现有内容没有 BOM,则应用的是 ANSI 编码,而在 PowerShell Core 中则是 UTF-8.

Out-File -Append/>>Add-Content 之间的这种不一致,也会影响到 PowerShell Core,在这个 GitHub 问题中进行了讨论.>

Export-Csv -Append 部分匹配现有编码:如果现有文件的编码是任何 ASCII,它会盲目地附加 UTF-8/UTF-8/ANSI,但正确匹配 UTF-16LE 和 UTF-16BE.
换句话说:在没有 BOM 的情况下,Export-Csv -Append 假定 UTF-8 是,而 Add-Content 假定 ANSI.

Start-Transcript -Append 部分匹配现有编码:它正确匹配编码和 BOM,但默认为潜在的有损 ASCII 编码一个的缺席.


读取的Cmdlet(即在缺少BOM时使用的编码):

Get-ContentImport-PowerShellDataFile 默认为ANSI(Default),与Set-Content一致代码>.
ANSI 也是 PowerShell 引擎本身在从文件中读取源代码时的默认设置.

相比之下,Import-CsvImport-CliXmlSelect-String 在没有 BOM 的情况下假定为 UTF-8.

By default, when you redirect the output of a command to a file or pipe it into something else in PowerShell, the encoding is UTF-16, which isn't useful. I'm looking to change it to UTF-8.

It can be done on a case-by-case basis by replacing the >foo.txt syntax with | out-file foo.txt -encoding utf8 but this is awkward to have to repeat every time.

The persistent way to set things in PowerShell is to put them in UsersmeDocumentsWindowsPowerShellprofile.ps1; I've verified that this file is indeed executed on startup.

It has been said that the output encoding can be set with $PSDefaultParameterValues = @{'Out-File:Encoding' = 'utf8'} but I've tried this and it had no effect.

https://blogs.msdn.microsoft.com/powershell/2006/12/11/outputencoding-to-the-rescue/ which talks about $OutputEncoding looks at first glance as though it should be relevant, but then it talks about output being encoded in ASCII, which is not what's actually happening.

How do you set PowerShell to use UTF-8?

Note:

  • The next section applies primarily to Windows PowerShell.

  • In both cases, the information applies to making PowerShell use UTF-8 for reading and writing files.

    • By contrast, for information on how to send and receive UTF-8-encoded strings to and from external programs, see this answer.

  • In PSv5.1 or higher, where > and >> are effectively aliases of Out-File, you can set the default encoding for > / >> / Out-File via the $PSDefaultParameterValues preference variable:

    • $PSDefaultParameterValues['Out-File:Encoding'] = 'utf8'
    • Note:
      • In Windows PowerShell (the legacy edition whose latest and final version is v5.1), this invariably creates UTF-8 file with a (pseudo) BOM.

        • Many Unix-based utilities do not recognize this BOM (see bottom); see this post for workarounds that create BOM-less UTF-8 files.
      • In PowerShell (Core) v6+, BOM-less UTF-8 is the default (see next section), but if you do want a BOM there, you can use 'utf8BOM'

  • In PSv5.0 or below, you cannot change the encoding for > / >>, but, on PSv3 or higher, the above technique does work for explicit calls to Out-File.
    (The $PSDefaultParameterValues preference variable was introduced in PSv3.0).

  • In PSv3.0 or higher, if you want to set the default encoding for all cmdlets that support
    an -Encoding parameter
    (which in PSv5.1+ includes > and >>), use:

    • $PSDefaultParameterValues['*:Encoding'] = 'utf8'

If you place this command in your $PROFILE, cmdlets such as Out-File and Set-Content will use UTF-8 encoding by default, but note that this makes it a session-global setting that will affect all commands / scripts that do not explicitly specify an encoding via their -Encoding parameter.

Similarly, be sure to include such commands in your scripts or modules that you want to behave the same way, so that they indeed behave the same even when run by another user or a different machine; however, to avoid a session-global change, use the following form to create a local copy of $PSDefaultParameterValues:

  • $PSDefaultParameterValues = @{ '*:Encoding' = 'utf8' }

For a summary of the wildly inconsistent default character encoding behavior across many of the Windows PowerShell standard cmdlets, see the bottom section.


The automatic $OutputEncoding variable is unrelated, and only applies to how PowerShell communicates with external programs (what encoding PowerShell uses when sending strings to them) - it has nothing to do with the encoding that the output redirection operators and PowerShell cmdlets use to save to files.


Optional reading: The cross-platform perspective: PowerShell Core:

PowerShell is now cross-platform, via its PowerShell Core edition, whose encoding - sensibly - defaults to BOM-less UTF-8, in line with Unix-like platforms.

  • This means that source-code files without a BOM are assumed to be UTF-8, and using > / Out-File / Set-Content defaults to BOM-less UTF-8; explicit use of the utf8 -Encoding argument too creates BOM-less UTF-8, but you can opt to create files with the pseudo-BOM with the utf8bom value.

  • If you create PowerShell scripts with an editor on a Unix-like platform and nowadays even on Windows with cross-platform editors such as Visual Studio Code and Sublime Text, the resulting *.ps1 file will typically not have a UTF-8 pseudo-BOM:

    • This works fine on PowerShell Core.
    • It may break on Windows PowerShell, if the file contains non-ASCII characters; if you do need to use non-ASCII characters in your scripts, save them as UTF-8 with BOM.
      Without the BOM, Windows PowerShell (mis)interprets your script as being encoded in the legacy "ANSI" codepage (determined by the system locale for pre-Unicode applications; e.g., Windows-1252 on US-English systems).
  • Conversely, files that do have the UTF-8 pseudo-BOM can be problematic on Unix-like platforms, as they cause Unix utilities such as cat, sed, and awk - and even some editors such as gedit - to pass the pseudo-BOM through, i.e., to treat it as data.

    • This may not always be a problem, but definitely can be, such as when you try to read a file into a string in bash with, say, text=$(cat file) or text=$(<file) - the resulting variable will contain the pseudo-BOM as the first 3 bytes.

Inconsistent default encoding behavior in Windows PowerShell:

Regrettably, the default character encoding used in Windows PowerShell is wildly inconsistent; the cross-platform PowerShell Core edition, as discussed in the previous section, has commendably put and end to this.

Note:

  • The following doesn't aspire to cover all standard cmdlets.

  • Googling cmdlet names to find their help topics now shows you the PowerShell Core version of the topics by default; use the version drop-down list above the list of topics on the left to switch to a Windows PowerShell version.

  • As of this writing, the documentation frequently incorrectly claims that ASCII is the default encoding in Windows PowerShell - see this GitHub docs issue.


Cmdlets that write:

Out-File and > / >> create "Unicode" - UTF-16LE - files by default - in which every ASCII-range character (too) is represented by 2 bytes - which notably differs from Set-Content / Add-Content (see next point); New-ModuleManifest and Export-CliXml also create UTF-16LE files.

Set-Content (and Add-Content if the file doesn't yet exist / is empty) uses ANSI encoding (the encoding specified by the active system locale's ANSI legacy code page, which PowerShell calls Default).

Export-Csv indeed creates ASCII files, as documented, but see the notes re -Append below.

Export-PSSession creates UTF-8 files with BOM by default.

New-Item -Type File -Value currently creates BOM-less(!) UTF-8.

The Send-MailMessage help topic also claims that ASCII encoding is the default - I have not personally verified that claim.

Start-Transcript invariably creates UTF-8 files with BOM, but see the notes re -Append below.

Re commands that append to an existing file:

>> / Out-File -Append make no attempt to match the encoding of a file's existing content. That is, they blindly apply their default encoding, unless instructed otherwise with -Encoding, which is not an option with >> (except indirectly in PSv5.1+, via $PSDefaultParameterValues, as shown above). In short: you must know the encoding of an existing file's content and append using that same encoding.

Add-Content is the laudable exception: in the absence of an explicit -Encoding argument, it detects the existing encoding and automatically applies it to the new content.Thanks, js2010. Note that in Windows PowerShell this means that it is ANSI encoding that is applied if the existing content has no BOM, whereas it is UTF-8 in PowerShell Core.

This inconsistency between Out-File -Append / >> and Add-Content, which also affects PowerShell Core, is discussed in this GitHub issue.

Export-Csv -Append partially matches the existing encoding: it blindly appends UTF-8 if the existing file's encoding is any of ASCII/UTF-8/ANSI, but correctly matches UTF-16LE and UTF-16BE.
To put it differently: in the absence of a BOM, Export-Csv -Append assumes UTF-8 is, whereas Add-Content assumes ANSI.

Start-Transcript -Append partially matches the existing encoding: It correctly matches encodings with BOM, but defaults to potentially lossy ASCII encoding in the absence of one.


Cmdlets that read (that is, the encoding used in the absence of a BOM):

Get-Content and Import-PowerShellDataFile default to ANSI (Default), which is consistent with Set-Content.
ANSI is also what the PowerShell engine itself defaults to when it reads source code from files.

By contrast, Import-Csv, Import-CliXml and Select-String assume UTF-8 in the absence of a BOM.