且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

HTML编码问题 - “”字符显示而不是“& nbsp;”

更新时间:2023-02-25 14:34:54


乱,不HTML模板中的不正确的空格(s)正在编码为ISO-8859-1,以便它们不正确地显示为Â字符


那就是编码为UTF-8,而不是ISO-8859-1。 ISO-8859-1中的不间断字符为0xA0字节;当编码为UTF-8时,它将为0xC2,0xA0,如果您(不正确地)将ISO-8859-1视为Â,则将其显示出来。这包括您可能不会注意到的尾随如果那个字节不在那里,那么其他的东西就会损坏你的文档,我们需要进一步了解一下。



什么是正则表达式,模板怎么样工作?如果您的& nbsp; 字符串正确(正确)转换为U + 00A0非打破空格字符,则似乎有一个适当的HTML解析器。如果是这样,您可以在DOM中本地处理您的模板,并要求使用ASCII编码进行序列化,以将非ASCII字符作为字符引用。这也阻止你不得不对HTML本身进行正则表达式后处理,这是一个非常狡猾的业务。



无论如何,现在你可以添加一个以下到您的文档的< head> ,看看是否使浏览器看起来正确:



    $ HTML4的b $ b
  • < meta http-equiv =Content-Typecontent =text / html; charset = utf-8/>

  • for HTML5:< meta charset =utf-8>



如果你这样做,那么任何剩下的问题都是ActivePDF的错误。


I've got a legacy app just starting to misbehave, for whatever reason I'm not sure. It generates a bunch of HTML that gets turned into PDF reports by ActivePDF.

The process works like this:

  1. Pull an HTML template from a DB with tokens in it to be replaced (e.g. "~CompanyName~", "~CustomerName~", etc.)
  2. Replace the tokens with real data
  3. Tidy the HTML with a simple regex function that property formats HTML tag attribute values (ensures quotation marks, etc, since ActivePDF's rendering engine hates anything but single quotes around attribute values)
  4. Send off the HTML to a web service that creates the PDF.

Somewhere in that mess, the non-breaking spaces from the HTML template (the &nbsp;s) are encoding as ISO-8859-1 so that they show up incorrectly as an "Â" character when viewing the document in a browser (FireFox). ActivePDF pukes on these non-UTF8 characters.

My question: since I don't know where the problem stems from and don't have time to investigate it, is there an easy way to re-encode or find-and-replace the bad characters? I've tried sending it through this little function I threw together, but it turns it all into gobbledegook doesn't change anything.

Private Shared Function ConvertToUTF8(ByVal html As String) As String
    Dim isoEncoding As Encoding = Encoding.GetEncoding("iso-8859-1")
    Dim source As Byte() = isoEncoding.GetBytes(html)
    Return Encoding.UTF8.GetString(Encoding.Convert(isoEncoding, Encoding.UTF8, source))
End Function

Any ideas?

EDIT:

I'm getting by with this for now, though it hardly seems like a good solution:

Private Shared Function ReplaceNonASCIIChars(ByVal html As String) As String
    Return Regex.Replace(html, "[^\u0000-\u007F]", "&nbsp;")
End Function

Somewhere in that mess, the non-breaking spaces from the HTML template (the  s) are encoding as ISO-8859-1 so that they show up incorrectly as an "Â" character

That'd be encoding to UTF-8 then, not ISO-8859-1. The non-breaking space character is byte 0xA0 in ISO-8859-1; when encoded to UTF-8 it'd be 0xC2,0xA0, which, if you (incorrectly) view it as ISO-8859-1 comes out as " ". That includes a trailing nbsp which you might not be noticing; if that byte isn't there, then something else has mauled your document and we need to see further up to find out what.

What's the regexp, how does the templating work? There would seem to be a proper HTML parser involved somewhere if your &nbsp; strings are (correctly) being turned into U+00A0 NON-BREAKING SPACE characters. If so, you could just process your template natively in the DOM, and ask it to serialise using the ASCII encoding to keep non-ASCII characters as character references. That would also stop you having to do regex post-processing on the HTML itself, which is always a highly dodgy business.

Well anyway, for now you can add one of the following to your document's <head> and see if that makes it look right in the browser:

  • for HTML4: <meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
  • for HTML5: <meta charset="utf-8">

If you've done that, then any remaining problem is ActivePDF's fault.