且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何在 XML (UTF-8) 中嵌入上传的二进制文件 (ASCII-8BIT)?

更新时间:2023-11-27 12:49:22

首先,如果不进行某种文本转换,就不能在 XML 文档中嵌入二进制文件.至少需要以某种方式对 PDF 文档和 PNG 图像进行编码 - 可能 Base64 - 在您开始尝试将其内容视为字符串而不是字节序列之前.

UndefinedConversionError 表明您正在尝试将文本从 Ruby 认为是 ASCII 的文本转换为 UTF-8.但源文本包含一个字节,其值为 0x89(十进制 137),超出 ASCII 范围.如果源文件是二进制文件,这并不意外,并且 base64 编码将解决该问题.

但是,如果生成该错误的源文件已经是文本,那么您需要确定并指定它实际使用的字符集.0x89 表示它既不是 ASCII 也不是 UTF-8,因此最有可能的选项是 Latin-1 或 Windows-1252.

I have a file which is uploaded via a regular form_for, this gives me a ActionDispatch::Http::UploadedFile object in the params hash on which I can call .read to get the content. I now need to embed the file in an XML document. I'm using a regular Ruby string for now to construct the XML. The default encoding for a Rails string is utf-8.

Therefore I get the error Encoding::UndefinedConversionError, "\x89" from ASCII-8BIT to UTF-8.

This happens for the following files:

what-matters-now-1.pdf: application/octet-stream; charset=binary
example.csv: text/plain; charset=utf-8
investigations.png: image/png; charset=binary

It does not happen for:

my_test.txt: text/plain; charset=us-ascii

I have tried changing the encoding, but I get the same error:

params[:file].read.encode('utf-8')

First, you cannot embed a binary file in an XML document without some sort of conversion to text. At least the PDF document and the PNG image need to be encoded somehow - probably Base64 - before you start trying to treat their contents as strings of characters instead of sequences of bytes.

The UndefinedConversionError indicates that you're trying to convert text into UTF-8 from what Ruby thinks is ASCII. But the source text includes a byte whose value is 0x89 (137 decimal), which is outside the ASCII range. That is not at all unexpected if the source file is a binary file, and base64-encoding it will fix that problem.

If, however, the source file generating that error is already text, then you need to determine and specify what character set it is actually using. The 0x89 indicates it is neither ASCII nor UTF-8, so the most likely options are Latin-1 or Windows-1252.