且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

在java中utf-8解码

更新时间:2022-10-19 15:57:49

当处理字符串时,请记住:字节!= char 。所以在你的第一个例子中,你有 char c3 ,而不是字节c3 这是一个很大的区别: code> byte 将是UTF-8序列的一部分,但 char 已经是Unicode 。所以当你将它转换成UTF-8时,Unicode字符 c3 必须成为字节序列 c3 83



所以问题是:你如何获得String?在该代码中必须有一个错误,它不能正确处理UTF-8编码的字节序列。



为什么 ISO-8859-1 通常工作的原因是,该编码不会修改任何 char 与代码点&lt ; 256(即0到255之间的任何东西),因此UTF-8编码的字节序列将不被修改。



你的最后一个例子也是错误的: char e9 是在 ISO-8859-1 和Unicode中的é。在UTF-8中,它不是有效的,因为它不是一个字节,因为它是字节c3 前缀缺失。也就是说,它正确地表示您寻求的Unicode字符串。


I'm trying to pass parameters from a PHP middle tier to a java backend that understands J2EE. I'm writing the controller code in Groovy. In there, I'm trying to decode some parameter that will likely contain international characters.

I am really puzzled by the results of my debugging this problem so far, hence I wanted to share it with you in the hope that someone will be able to give the correct interpretation of my results.

For the sake of my little test, the parameter I'm passing is "déjeuner". Just to be sure, System.out.println("déjeuner") correctly gives me:

déjeuner

in the console

Now following are the char/dec and hex values of each char of the original string:

next char: d 100 64
next char: ? -61 c3
next char: ? -87 a9
next char: j 106 6a
next char: e 101 65
next char: u 117 75
next char: n 110 6e
next char: e 101 65
next char: r 114 72

note that the c3a9 sequence in UTF-8 is the wished-for character: http://www.fileformat.info/info/unicode/char/00e9/index.htm

Now if I try to read this string as an UTF-8 string, as in stmt.getBytes("UTF-8"), I suddenly end up having a 11 bytes sequence, as follows:

64 c3 83 c2 a9 6a 65 75 6e 65 72

whereas stmt.getBytes("iso-8859-1") gives me 9 bytes:

64 c3 a9 6a 65 75 6e 65 72

note the c3a9 sequence here!

now if I try to convert the UTF-8 sequence to UTF-8, as in

new String(stmt.getBytes("UTF-8"), "UTF-8");

I get:

next char: d 100 64
next char: ? -61 c3
next char: ? -87 a9
next char: j 106 6a
next char: e 101 65
next char: u 117 75
next char: n 110 6e
next char: e 101 65
next char: r 114 72

note the c3a9 sequence

while

new String(stmt.getBytes("iso-8859-1"), "UTF-8")

results in:

next char: d 100 64
next char: ? -23 e9
next char: j 106 6a
next char: e 101 65
next char: u 117 75
next char: n 110 6e
next char: e 101 65
next char: r 114 72

note the e9 which in utf-8 (and ascii) is, again, the 'é' character that I'm longing for.

Unfortunately, in neither case am I ending up with a proper string that would display like the literal string "déjeuner". Strangely enough, the byte sequences both seem correct though.

When dealing with Strings, always remember: byte != char. So in your first example, you have the char c3, not the byte c3 which is a huge difference: The byte would be part of the UTF-8 sequence but the char already is Unicode. So when you convert that to UTF-8, the Unicode character c3 must become the byte sequence c3 83.

So the question is: How did you get the String? There must be a bug in that code which doesn't properly handle UTF-8 encoded byte sequences.

The reason why ISO-8859-1 usually works is that this encoding doesn't modify any char with a code point < 256 (i.e. anything between 0 and 255), so UTF-8 encoded byte sequences won't be modified.

Your last example is also wrong: The char e9 is é in ISO-8859-1 and Unicode. In UTF-8, it's not valid since it's not a byte and since it's the byte c3 prefix is missing. That said, it correctly represents the Unicode string you seek.