且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Ruby CSV读取时出现UTF8编码错误

更新时间:2022-11-22 11:54:13

看起来您在检测有效编码的文件。 CharlockHolmes 提供:confidence => 37 的有用提示,这意味着检测到的编码可能不正确。

根据错误消息和 test_transcode.rb https://github.com/MacRuby/MacRuby/blob/master/test-mri/test/ruby/test_transcode.rb 我找到了通过这两个错误消息的编码。在 String#encode 的帮助下,很容易测试:

  \\ x8F \x98.encode(UTF-8,cp1256)#=> ڈک

您的问题看起来与文件完全相关,而不是ruby。



如果我们不确定使用哪个编码并且可以同意放弃某个字符,我们可以使用:invalid :undef params for String#encode ,在这种情况下:


$ b b

 \x8F\x98.encode(UTF-8,CP1250,:invalid =>:replace,:undef =>:replace, :replace =>?)#=> Ź? 

其他方式是使用 Iconv * // IGNORE 目标编码选项:

  Iconv.iconv -8 // IGNORE,CP1250,\x8F\x98)

CharlockHolmes 的源代码编码建议应该是相当不错的。



PS。 String.encode 是在ruby 1.9中引入的。使用ruby 1.8,您可以使用 Iconv


This is what I was doing:

csv = CSV.open(file_name, "r")

I used this for testing:

line = csv.shift
while not line.nil?
  puts line
  line = csv.shift
end

And I ran into this:

ArgumentError: invalid byte sequence in UTF-8

I read the answer here and this is what I tried

csv = CSV.open(file_name, "r", encoding: "windows-1251:utf-8")

I ran into the following error:

Encoding::UndefinedConversionError: "\x98" to UTF-8 in conversion from Windows-1251 to UTF-8

Then I came across a Ruby gem - charlock_holmes. I figured I'd try using it to find the source encoding.

CharlockHolmes::EncodingDetector.detect(File.read(file_name))
=> {:type=>:text, :encoding=>"windows-1252", :confidence=>37, :language=>"fr"}

So I did this:

csv = CSV.open(file_name, "r", encoding: "windows-1252:utf-8")

And still got this:

Encoding::UndefinedConversionError: "\x8F" to UTF-8 in conversion from Windows-1252 to UTF-8

It looks like you have problem with detecting the valid encoding of your file. CharlockHolmes provide you with useful tip of :confidence=>37 which simply means the detected encoding may not be the right one.

Basing on error messages and test_transcode.rb from https://github.com/MacRuby/MacRuby/blob/master/test-mri/test/ruby/test_transcode.rb I found the encoding that passes through both of your error messages. With help of String#encode it's easy to test:

"\x8F\x98".encode("UTF-8","cp1256") # => "ڈک"

Your issue looks like strictly related to the file and not to ruby.

In case we are not sure which encoding to use and can agree to loose some character we can use :invalid and :undef params for String#encode, in this case:

"\x8F\x98".encode("UTF-8", "CP1250",:invalid => :replace, :undef => :replace, :replace => "?") # => "Ź?"

other way is to use Iconv *//IGNORE option for target encoding:

Iconv.iconv("UTF-8//IGNORE","CP1250", "\x8F\x98")

As a source encoding suggestion of CharlockHolmes should be pretty good.

PS. String.encode was introduced in ruby 1.9. With ruby 1.8 you can use Iconv