且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

使用 Encode::encode 和“utf8"

更新时间:2023-09-11 23:20:34

                   ╔════════════════════════════════════════════╤══════════════════════╗
                   ║                                            │                      ║
                   ║                  On Read                   │       On Write       ║
                   ║                                            │                      ║
        Perl       ╟─────────────────────┬──────────────────────┼──────────────────────╢
        5.26       ║                     │                      │                      ║
                   ║ Invalid encoding    │ Outside of Unicode,  │ Outside of Unicode,  ║
                   ║ other than sequence │ Unicode nonchar, or  │ Unicode nonchar, or  ║
                   ║ length              │ Unicode surrogate    │ Unicode surrogate    ║
                   ║                     │                      │                      ║
╔══════════════════╬═════════════════════╪══════════════════════╪══════════════════════╣
║                  ║                     │                      │                      ║
║ :encoding(UTF-8) ║ Warns and Replaces  │ Warns and Replaces   │ Warns and Replaces   ║
║                  ║                     │                      │                      ║
╟──────────────────╫─────────────────────┼──────────────────────┼──────────────────────╢
║                  ║                     │                      │                      ║
║ :encoding(utf8)  ║ Warns and Replaces  │ Accepts              │ Warns and Outputs    ║
║                  ║                     │                      │                      ║
╟──────────────────╫─────────────────────┼──────────────────────┼──────────────────────╢
║                  ║                     │                      │                      ║
║ :utf8            ║ Corrupt scalar      │ Accepts              │ Warns and Outputs    ║
║                  ║                     │                      │                      ║
╚══════════════════╩═════════════════════╧══════════════════════╧══════════════════════╝

如果您无法查看上表,请单击此处

请注意,:encoding(UTF-8) 实际上使用 utf8 进行解码,然后检查结果字符是否在可接受的范围内.这减少了错误输入的错误消息数量,因此很好.

Note that :encoding(UTF-8) actually decodes using utf8, then checks if the resulting character is in the acceptable range. This reduces the number of error messages for bad input, so it's good.

(编码名称不区分大小写.)

(Encoding names are case-insensitive.)

用于生成上表的测试:

  • :encoding(UTF-8)

$ printf "xC3xA9
xEFxBFxBF
xEDxA0x80
xF8x88x80x80x80
x80
" |
   perl -MB -nle'
      use open ":std", ":encoding(UTF-8)";
      my $sv = B::svref_2object($_);
      printf "%vX%s (internal: %vX, UTF8=%d)
", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_);
   '
utf8 "xFFFF" does not map to Unicode.
utf8 "xD800" does not map to Unicode.
utf8 "x200000" does not map to Unicode.
utf8 "x80" does not map to Unicode.
E9 (internal: C3.A9, UTF8=1)
5C.78.7B.46.46.46.46.7D = x{FFFF} (internal: 5C.78.7B.46.46.46.46.7D, UTF8=1)
5C.78.7B.44.38.30.30.7D = x{D800} (internal: 5C.78.7B.44.38.30.30.7D, UTF8=1)
5C.78.7B.32.30.30.30.30.30.7D = x{200000} (internal: 5C.78.7B.32.30.30.30.30.30.7D, UTF8=1)
5C.78.38.30 = x80 (internal: 5C.78.38.30, UTF8=1)

  • :encoding(utf8)

    $ printf "xC3xA9
    xEFxBFxBF
    xEDxA0x80
    xF8x88x80x80x80
    x80
    " |
       perl -MB -nle'
          use open ":std", ":encoding(utf8)";
          my $sv = B::svref_2object($_);
          printf "%vX%s (internal: %vX, UTF8=%d)
    ", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_);
       '
    utf8 "x80" does not map to Unicode.
    E9 (internal: C3.A9, UTF8=1)
    FFFF (internal: EF.BF.BF, UTF8=1)
    D800 (internal: ED.A0.80, UTF8=1)
    200000 (internal: F8.88.80.80.80, UTF8=1)
    5C.78.38.30 = x80 (internal: 5C.78.38.30, UTF8=1)
    

  • :utf8

    $ printf "xC3xA9
    xEFxBFxBF
    xEDxA0x80
    xF8x88x80x80x80
    x80
    " |
       perl -MB -nle'
          use open ":std", ":utf8";
          my $sv = B::svref_2object($_);
          printf "%vX%s (internal: %vX, UTF8=%d)
    ", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_);
       '
    E9 (internal: C3.A9, UTF8=1)
    FFFF (internal: EF.BF.BF, UTF8=1)
    D800 (internal: ED.A0.80, UTF8=1)
    200000 (internal: F8.88.80.80.80, UTF8=1)
    Malformed UTF-8 character: x80 (unexpected continuation byte 0x80, with no preceding start byte) in printf at -e line 4, <> line 5.
    0 (internal: 80, UTF8=1)
    

    • :encoding(UTF-8)

    $ perl -e'
       use open ":std", ":encoding(UTF-8)";
       print "x{E9}
    ";
       print "x{FFFF}
    ";
       print "x{D800}
    ";
       print "x{20_0000}
    ";
    ' >a
    Unicode non-character U+FFFF is not recommended for open interchange in print at -e line 4.
    Unicode surrogate U+D800 is illegal in UTF-8 at -e line 5.
    Code point 0x200000 is not Unicode, may not be portable in print at -e line 6.
    "x{ffff}" does not map to utf8.
    "x{d800}" does not map to utf8.
    "x{200000}" does not map to utf8.
    
    $ od -t c a
    0000000 303 251  
          x   {   F   F   F   F   }  
          x   {   D
    0000020   8   0   0   }  
          x   {   2   0   0   0   0   0   }  
    
    0000040
    
    $ cat a
    é
    x{FFFF}
    x{D800}
    x{200000}
    

  • :encoding(utf8)

    $ perl -e'
       use open ":std", ":encoding(utf8)";
       print "x{E9}
    ";
       print "x{FFFF}
    ";
       print "x{D800}
    ";
       print "x{20_0000}
    ";
    ' >a
    Unicode surrogate U+D800 is illegal in UTF-8 at -e line 4.
    Code point 0x200000 is not Unicode, may not be portable in print at -e line 5.
    
    $ od -t c a
    0000000 303 251  
     355 240 200  
     370 210 200 200 200  
    
    0000015
    
    $ cat a
    é
    ▒
    ▒
    

  • :utf8

    结果与 :encoding(utf8) 相同.

    使用 Perl 5.26 测试.

    Tested using Perl 5.26.

    Encode::encode 默认会用替换字符替换无效字符.即使您将更宽松的utf8"作为编码传递也是如此吗?

    Perl 字符串是 32 位或 64 位字符的字符串,具体取决于构建.utf8 可以编码任何 72 位整数.因此,它能够编码所有可以被要求编码的字符.

    Perl strings are strings of 32-bit or 64-bit characters depending on the build. utf8 can encode any 72-bit integer. It is therefore capable of encoding all characters it can be asked to encode.