更新时间:2022-11-22 11:54:13
看起来您在检测有效编码的文件。 CharlockHolmes 提供 根据错误消息和 您的问题看起来与文件完全相关,而不是ruby。 如果我们不确定使用哪个编码并且可以同意放弃某个字符,我们可以使用:confidence => 37
的有用提示,这意味着检测到的编码可能不正确。 test_transcode.rb
从 https://github.com/MacRuby/MacRuby/blob/master/test-mri/test/ruby/test_transcode.rb 我找到了通过这两个错误消息的编码。在 String#encode
的帮助下,很容易测试:
\\ x8F \x98.encode(UTF-8,cp1256)#=> ڈک
:invalid
和:undef
params for String#encode
,在这种情况下:
$ b b
\x8F\x98.encode(UTF-8,CP1250,:invalid =>:replace,:undef =>:replace, :replace =>?)#=> Ź?
其他方式是使用 Iconv
* // IGNORE
目标编码选项:
Iconv.iconv -8 // IGNORE,CP1250,\x8F\x98)
CharlockHolmes 的源代码编码建议应该是相当不错的。
PS。 String.encode
是在ruby 1.9中引入的。使用ruby 1.8,您可以使用 Iconv
This is what I was doing:
csv = CSV.open(file_name, "r")
I used this for testing:
line = csv.shift
while not line.nil?
puts line
line = csv.shift
end
And I ran into this:
ArgumentError: invalid byte sequence in UTF-8
I read the answer here and this is what I tried
csv = CSV.open(file_name, "r", encoding: "windows-1251:utf-8")
I ran into the following error:
Encoding::UndefinedConversionError: "\x98" to UTF-8 in conversion from Windows-1251 to UTF-8
Then I came across a Ruby gem - charlock_holmes. I figured I'd try using it to find the source encoding.
CharlockHolmes::EncodingDetector.detect(File.read(file_name))
=> {:type=>:text, :encoding=>"windows-1252", :confidence=>37, :language=>"fr"}
So I did this:
csv = CSV.open(file_name, "r", encoding: "windows-1252:utf-8")
And still got this:
Encoding::UndefinedConversionError: "\x8F" to UTF-8 in conversion from Windows-1252 to UTF-8
It looks like you have problem with detecting the valid encoding of your file. CharlockHolmes provide you with useful tip of :confidence=>37
which simply means the detected encoding may not be the right one.
Basing on error messages and test_transcode.rb
from https://github.com/MacRuby/MacRuby/blob/master/test-mri/test/ruby/test_transcode.rb I found the encoding that passes through both of your error messages. With help of String#encode
it's easy to test:
"\x8F\x98".encode("UTF-8","cp1256") # => "ڈک"
Your issue looks like strictly related to the file and not to ruby.
In case we are not sure which encoding to use and can agree to loose some character we can use :invalid
and :undef
params for String#encode
, in this case:
"\x8F\x98".encode("UTF-8", "CP1250",:invalid => :replace, :undef => :replace, :replace => "?") # => "Ź?"
other way is to use Iconv
*//IGNORE
option for target encoding:
Iconv.iconv("UTF-8//IGNORE","CP1250", "\x8F\x98")
As a source encoding suggestion of CharlockHolmes should be pretty good.
PS. String.encode
was introduced in ruby 1.9. With ruby 1.8 you can use Iconv