且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

将字节编码转换为unicode

更新时间:2023-02-25 22:55:30

这个:

x <- "bi<df>chen Z<fc>rcher hello world <c6>"

m <- gregexpr("<[0-9a-f]{2}>", x)
codes <- regmatches(x,m)
chars <- lapply(codes, function(x) {
    rawToChar(as.raw(strtoi(paste0("0x",substr(x,2,3)))), multiple=T)
})
regmatches(x,m) <- chars
x
# [1] "bi\xdfchen Z\xfcrcher hello world \xc6"
Encoding(x) <- "latin1"
x
# [1] "bißchen Zürcher hello world Æ"  

请注意,您不能通过将\x粘贴到数字的前端来进行转义的字符。 \x根本不在字符串中。这就是R如何选择在屏幕上表示它。这里使用rawToChar()将一个数字转换成我们想要的字符。

Note that you can't make an escaped character by pasting a "\x" to the front of a number. That "\x" really isn't in the string at all. It's just how R chooses to represent it on screen. Here use use rawToChar() to turn a number into the character we want.

我在Mac上测试了这个,所以我不得不将编码设置为latin1来查看控制台中的正确符号。只使用像这样的单字节不是正确的UTF-8。

I tested this on a Mac so I had to set the encoding to "latin1" to see the correct symbols in the console. Just using a single byte like that isn't proper UTF-8.