且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

波斯语中的QString

更新时间:2023-02-18 16:51:22

我很好奇等待答复,自己玩弄一点:



我复制了文本سلام(英语: Hello)并将其粘贴到Nodepad ++(在我的情况下使用UTF-8编码)。然后我切换到以十六进制查看并得到:



  $ qmake-qt5 testQPersian。 pro 

$ make

$ ./testQ波斯



同样,Latin-1的输出看起来与OP以及Notepad ++公开的内容相似



输出为UTF-8提供了预期的文本(按预期,因为我提供了正确的UTF-8编码作为输入)。



可能是,ASCII / Latin-1输出的变化有点令人困惑。 –存在多种字符字节编码,它们在下半部分(0 ... 127)共享ASCII,但在上半部分(128 ... 255)具有不同的字节含义。 (请查看



所以,似乎代替了



d8 b3 d9 84 d8 a7 d9 85



他得到了



00 08 d8 b3 d9 84 d8 a7 d9 85



可能的解释:



服务器首先发送16位长度 00 08 –解释为 Big-Endian 16位整数: 8 ,然后 8 个字节(看起来就像我上面播放的字节)。
(AFAIK,如果发件人和接收者本来就有不同的字节序,则使用Big-Endian二进制网络协议以防止字节序问题并不稀奇。)此处: htons(3)-Linux手册页


在i386上,主机字节顺序是最低有效字节在先,而在Internet上使用的网络字节顺序是最高有效字节在前。







OP声称已使用此协议数据输出– writeUTF


将两个字节的长度信息写入输出流,然后是每个字符的修改后的UTF-8表示形式在字符串s中。如果s为null,则抛出NullPointerException。字符串s中的每个字符都将转换为一个,两个或三个字节的组,具体取决于字符的值。


因此,解码看起来可能像这样:

  QByteArray readData( \x00\x08\xd8\xb3 dxd9\x84\xd8\xa7\xd9\x85,10); 
// QByteArray readData = socket-> readAll();
无符号长度
=((uint8_t)readData [0]<< 8)+(uint8_t)readData [1];
QString text = QString :: fromUtf8(dataRead.data()+ 2,长度);




  1. 前两个字节是从 readData 并组合为 length (解码big-endian 16位整数)。


  2. dataRead 的其余部分将转换为 QString ,提供先前提取的长度。因此,将跳过 readData 的前两个长度字节。



I have given a Qt Project which needs to support Persian language.T he data is sent from a server and using the first line, I get a QByteArray and convert it to QString using the second line:

    QByteArray readData = socket->readAll();
    QString DataAsString = QTextCodec::codecForUtfText(readData)->toUnicode(readData);

When the data is sent is English, everything is fine, but when it is Persian, instead of

سلام

I get

سÙ\u0084اÙ\u0085

I mentioned the process so people wouldn't suggest methods to make a multi language app that uses .tr. It's all about text and decoding not those translation methods. My OS is Windows 8.1 (for the case you need to know it).

I get this hex Value when the server sends سلام

0008d8b3d984d8a7d985

By the way the server sends two extra bytes at the beginning for a reason I don't know. So I cut it off using:

DataAsString.remove(0,2);

after it's been converted to QString so the hex value has some extra at the begging.

I was far to curious to wait for reply and toyed a bit on my own:

I copied the text سلام (in English: "Hello") and pasted it into Nodepad++ (which used UTF-8 encoding in my case). Then I switched to View as Hex and got:

The ASCII dump on right side looks a bit similar to what OP got unexpectedly. This let me believe that the bytes in readData are encoded in UTF-8. Hence, I took the exposed hex-numbers and made a little sample code:

testQPersian.cc:

#include <QtWidgets>

int main(int argc, char **argv)
{
  QByteArray readData = "\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85";
  QString textLatin1 = QString::fromLatin1(readData);
  QString textUtf8 = QString::fromUtf8(readData);
  QApplication app(argc, argv);
  QWidget qWin;
  QGridLayout qGrid;
  qGrid.addWidget(new QLabel("Latin-1:"), 0, 0);
  qGrid.addWidget(new QLabel(textLatin1), 0, 1);
  qGrid.addWidget(new QLabel("UTF-8:"), 1, 0);
  qGrid.addWidget(new QLabel(textUtf8), 1, 1);
  qWin.setLayout(&qGrid);
  qWin.show();
  return app.exec();
}

testQPersian.pro:

SOURCES = testQPersian.cc

QT += widgets

Compiled and tested in cygwin on Windows 10:

$ qmake-qt5 testQPersian.pro

$ make

$ ./testQPersian

Again, the output as Latin-1 looks a bit similar to what OP got as well as what Notepad++ exposed.

The output as UTF-8 provides the expected text (as expected because I provided a proper UTF-8 encoding as input).

May be, it's a bit confusing that the ASCII/Latin-1 output vary. – There exists multiple character byte encodings which share the ASCII in the lower half (0 ... 127) but have different meanings of bytes in the upper half (128 ... 255). (Have a look at ISO/IEC 8859 to see what I mean. These have been introduced as localizations before Unicode became popular as the final solution of the localization problem.)

The Persian characters have surely all Unicode codepoints beyond 127. (Unicode shares the ASCII for the first 128 codepoints as well.) Such codepoints are encoded in UTF-8 as sequences of multiple bytes where each byte has the MSB (the most significant bit – Bit 7) set. Hence, if these bytes are (accidentally) interpreted with any ISO8859 encoding then the upper half becomes relevant. Thus, depending on the currently used ISO8859 encoding, this may produce different glyphs.


Some continuation:

OP sent the following snapshot:

So, it seems instead of

d8 b3 d9 84 d8 a7 d9 85

he got

00 08 d8 b3 d9 84 d8 a7 d9 85

A possible interpretation:

The server sends first a 16 bit length 00 08 – interpreted as Big-Endian 16 bit integer: 8, then 8 bytes encoded in UTF-8 (which look exactly like the one I got with playing above). (AFAIK, it's not unusual to use Big-Endian for binary network protocols to prevent endianess issues if sender and receiver have natively different endianess.) Further reading e.g. here: htons(3) - Linux man page

On the i386 the host byte order is Least Significant Byte first, whereas the network byte order, as used on the Internet, is Most Significant Byte first.


OP claims that this protocol is used DataOutput – writeUTF:

Writes two bytes of length information to the output stream, followed by the modified UTF-8 representation of every character in the string s. If s is null, a NullPointerException is thrown. Each character in the string s is converted to a group of one, two, or three bytes, depending on the value of the character.

So, the decoding could look like this:

QByteArray readData("\x00\x08\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85", 10);
//QByteArray readData = socket->readAll();
unsigned length
  = ((uint8_t)readData[0] <<  8) + (uint8_t)readData[1];
QString text = QString::fromUtf8(dataRead.data() + 2, length);

  1. The first two bytes are extracted from readData and combined to the length (decoding big-endian 16 bit integer).

  2. The rest of dataRead is converted to QString providing the previously extracted length. Thereby, the first 2 length bytes of readData are skipped.