且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何在不破坏页面编码的情况下加载保存的网页?

更新时间:2023-09-10 17:57:04

俄语和其他网站存在问题,因此对于某些网页或网站,该问题无法以可靠的方式解决.这还不错,好的Web浏览器甚至可以进行自动检测,但这并不可靠.首先,它不保证未经任何试验就可以正确渲染.在病理情况下,存在歧义,因此不可能100%确定编码.

您会看到,页面仅仅是字节数组.保存时,将其保存为原样.页面的编码来自三个不同的来源:

There is a problem with Russian and other Web sites which makes this problem not solvable in really reliable ways for certain Web pages or sites. This is not too bad, good Web browser can even do auto-detection, but this is not reliable. First of all, it does not guarantee correct rendering without some trial. In pathological cases, there is ambiguity, so it is not possible to be 100% sure about the encoding.

You see, a page is merely an array of bytes. When you save it, you save it as it is. The encoding of the page is derived from three different sources:

  1. 对于Unicode UTF,HTML文件可以以BOM表开头.请参阅:
    http://unicode.org/ [ http://unicode.org/faq/utf_bom.html [^ ].由于其他两种方式,因此不需要这样做.如果使用BOM不应与HTML或HTTP标头文本中声明的字符集相抵触,请参见下文.
  2. 与HTTP等效.它放置在<head>标记下,如下所示:
  1. For Unicode UTFs, an HTML file can start with BOM. See:
    http://unicode.org/[^], http://unicode.org/faq/utf_bom.html[^]. This is not required due to two other way. If BOM is used is should not contradict the charset declared in the text of HTML or HTTP header, see below.
  2. HTTP equivalent. It is placed under <head> tag and looks like this:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

  • HTTP标头.字符集信息可以出现在HTTP标头中.


  • 现在,一些俄罗斯站点使用过时的编码,例如Windows CP-1251或KOI8-r.这确实很糟糕,但是如果在文件中正确规定了字符集,并且页面语言仅为俄语+基本拉丁语(没有超过ASCII的欧洲字符),则可以接受.在这种情况下,所有非废话浏览器都会正确呈现该页面.当不使用任何树方式或它们相互矛盾时,就会出现问题.不幸的是,即使是最近,有些网站也是如此.太糟糕了.

    现在,我们要解决已保存文件的问题.某些网站仅使用方法3.他们是如何做到的?通常,HTTP服务器具有默认字符集"之类的选项.如果文件没有字符集的指示,则HTTP标头会附带带有默认字符集的自动生成的标头.这真的很糟糕,因为如果字符集不是Unicode UTF,则不允许使用多种语言(实际上,仅应使用UTF-8),但是此类站点的创建者认为,他们需要保护磁盘上的某些磁盘空间和流量.此限制的费用;他们确实做到了,因为Unicode总是会获得更多的空间,即使是最经济的UTF-8.不过,当您只是在线观看Web页面时,此方法仍然有效.

    当人们将文件保存在本地磁盘上时,就会发生问题.如果仅使用方法#3,则会丢失字符集信息.抱歉,怪罪那些网站的作者了.我建议识别这种情况并通过在head中添加"http-equiv"标记来修复保存的文件,如方法2所示.

    当然,这不仅是俄语站点的问题,而且我主要在俄语站点上观察到它.如今,情况变得更好了.

    —SA



    Now, some Russian sites use obsolete encoding like Windows CP-1251 or KOI8-r. This is really bad, but it can be acceptable if the charset if correctly prescribed in the file and if the page languages are only Russian + base Latin (no European characters beyond ASCII). In this case, all non-nonsense browsers render the page correctly. The problem appears when none of the tree ways is uses, or when they contradict. Unfortunately, some sites are like that, even these days. Too bad.

    Now, we''re coming to the problem of the saved files. Some Web sites use only the method #3. How they do it? Usually, an HTTP server has an option like "default charset". If a file has no indication of the charset, HTTP headers come with automatically generated header with the default charset. This is really bad as it does not allow for several languages if the charset is not a Unicode UTF (practically, only UTF-8 should be used), but creators of such sites think they need to safe on some disk space and traffic at the expense of this limitation; they really do as Unicode always get a bit more space, even the most economic UTF-8. Still, this works when you simply watch Web page on line.

    The problems happen when one saves the file on a local disk. If only the method #3 is used, charset information is lost. Sorry, blame the authors of those sites. I would recommend to identify such situations and fix saved file by adding an "http-equiv" tag in head as in method #2.

    This is not just a problem of Russian sites, of course, but I observed it mostly on Russian sites. These days, situations gets better.

    —SA