且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

打开iso-8859-1编码html与nokogiri混乱的口音

更新时间:2022-10-29 10:46:44

您用于下载文件的方法可能会更改编码,打破文件中的重音符号。尝试这样看看它正常工作:

  require'rubygems'
require'nokogiri'
require' open-uri'

url ='http://www.elmundo.es/accesible/elmundo/2012/03/07/solidaridad/1331108705.html'
doc = Nokogiri :: HTML(open(url))
File.open(1331108705.html,w){| f | f.write(doc.to_html)}
system('open','1331108705.html')#在Mac OS X上,这将打开浏览器中的html文件

您是如何下载文件的?


I'm trying to make some changes to an html page encoded with charset=iso-8859-1

doc = Nokogiri::HTML(open(html_file))

puts doc.to_html messes up all the accents in the page. So if I save it back it looks broken in the browser as well.

I'm still on Rails 3.0.6... Any hints how to fix this problem?

Here's one of the pages suffering from that for example: http://www.elmundo.es/accesible/elmundo/2012/03/07/solidaridad/1331108705.html

I've asked also in Github but I have the feeling this will be faster. I'll update both places if I get a cure for the problem.

UPDATE 1 24 March 2012

Thanks for the comments. I managed to partially solve this issue. I believe this has nothing to do with Nokogiri however. As I mentioned in some comment I just need to open and save the file to get the accents messed up.

The closest to a fix I got is doing this:

thefile = File.open(html_file, "r") 
text =  thefile.read
doc = Nokogiri::HTML(text)
... do any stuff with nokogiri
File.open(html_file, 'w') {|f| f.write(doc.to_html) }

The original file came with iso-8859-1, the save one goes in utf-8 pretty much it looks ok. Accents are in place. Except for the access in the capital letter :-P I get question marks like in Econom�a , there should be í (i with an accent)

Getting closer I think. If someone has a hint to cover the capital letters as well it might be almost done.

Cheers.

The method you used to download the file may have changed the encoding, breaking the accents in the file. Try this to see it working correctly:

require 'rubygems'
require 'nokogiri'
require 'open-uri'

url = 'http://www.elmundo.es/accesible/elmundo/2012/03/07/solidaridad/1331108705.html'
doc = Nokogiri::HTML(open(url))
File.open("1331108705.html", "w") {|f| f.write(doc.to_html)}
system('open', '1331108705.html') # on Mac OS X, this will open the html file in your browser

How did you download the file?