且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何将多字节字符串转换为glibc中fxprintf.c中的宽字符字符串?

更新时间:2022-12-22 20:51:26

/ b>在这篇文章中找到具体的问题并不容易;总的来说,这篇文章似乎试图讨论关于glibc的实现细节,在我看来这更好地指向论坛专门针对该库的开发,例如 libc-alpha邮件列表(或参见 https://www.gnu.org/software/libc/development.html 其他选项)。这种讨论并不是一个很好的匹配堆栈溢出,恕我直言,但尽管如此,我试图回答我可以找到。


$ b


  1. wfmt [i] = fmt [i ]; 从多字节转换为宽字符?



    其实,代码是:

     断言(isascii(FMT [I])); 
    wfmt [i] = fmt [i];

    这是基于ascii字符的数字值与 wchar_t的。严格来说,情况并非如此。 C标准规定:


    基本字符集的每个成员的代码值应该等于它的值作为单独的字符如果实现未定义 __ STDC_MB_MIGHT_NEQ_WC __ ,则为整数字符常量。 (§ 7.19 / 2)

    (gcc没有定义符号)

    但是,这只适用于基本集合中的字符,而不适用于由 isascii 识别的所有字符。基本字符集包含91个可打印的ASCII字符以及空格,换行符,水平制表符,垂直制表符和换页符。所以在理论上可能的是其余控制字符之一将不被正确转换。但是,调用 __ fxprintf 时使用的实际格式字符串只包含基本字符集中的字符,所以在实践中这种迂回细节并不重要。


  2. 为什么在执行 perror(приветмир);

    $ b $时没有声明警告b

    因为只有格式字符串被转换,格式字符串(它是%s%s%s \ n)包含只有ASCII字符。由于格式字符串包含%s (而不是%ls ),所以参数预期为 char * (而不是 wchar_t * )在窄字符和宽字符方向。 b $ b

  3. 我们可以在面向广泛的stderr上使用 dup()使它不是面向广泛的吗?



    这不是一个好主意。首先,如果流有一个方向,那么它也可能有一个非空的内部缓冲区。由于该缓冲区是stdio库的一部分,而不是底层的Posix fd,所以不会与重复的fd共享。所以,由perror打印的信息可能会插入在一些现有的输出中间。另外,多字节编码可能具有移位状态,并且输出流当前不处于初始移位状态。在这种情况下,输出ascii序列可能会导致乱码输出。

    在实际的实现中,dup只在没有方向的流上执行;这些流从来没有任何输出指向它们,所以它们肯定还处于初始转换状态,并且有一个空的缓冲区(如果流被缓冲了)。

  4. 你,但不要在这里做。正常的做法是提交一个bug。没有理由相信glibc开发者阅读SO问题,即使他们这样做了,也有人会将问题复制到一个bug中,并复制任何提议的补丁。
    / ol>

    Currently, the logic in glibc source of perror is such:

    If stderr is oriented, use it as is, else dup() it and use perror() on dup()'ed fd.

    If stderr is wide-oriented, the following logic from stdio-common/fxprintf.c is used:

    size_t len = strlen (fmt) + 1;
    wchar_t wfmt[len];
    for (size_t i = 0; i < len; ++i)
      {
        assert (isascii (fmt[i]));
        wfmt[i] = fmt[i];
      }
    res = __vfwprintf (fp, wfmt, ap);
    

    The format string is converted to wide-character form by the following code, which I do not understand:

    wfmt[i] = fmt[i];
    

    Also, it uses isascii assert:

    assert (isascii(fmt[i]));
    

    But the format string is not always ascii in wide-character programs, because we may use UTF-8 format string, which can contain non-7bit value(s). Why there is no assert warning when we run the following code (assuming UTF-8 locale and UTF-8 compiler encoding)?

    #include <stdio.h>
    #include <errno.h>
    #include <wchar.h>
    #include <locale.h>
    int main(void)
    {
      setlocale(LC_CTYPE, "en_US.UTF-8");
      fwide(stderr, 1);
      errno = EINVAL;
      perror("привет мир");  /* note, that the string is multibyte */
      return 0;
    }
    

    $ ./a.out 
    привет мир: Invalid argument
    


    Can we use dup() on wide-oriented stderr to make it not wide-oriented? In such case the code could be rewritten without using this mysterious conversion, taking into account the fact that perror() takes only multibyte strings (const char *s) and locale messages are all multibyte anyway.

    Turns out we can. The following code demonstrates this:

    #include <stdio.h>
    #include <wchar.h>
    #include <unistd.h>
    int main(void)
    {
      fwide(stdout,1);
      FILE *fp;
      int fd = -1;
      if ((fd = fileno (stdout)) == -1) return 1;
      if ((fd = dup (fd)) == -1) return 1;
      if ((fp = fdopen (fd, "w+")) == NULL) return 1;
      wprintf(L"stdout: %d, dup: %d\n", fwide(stdout, 0), fwide(fp, 0));
      return 0;
    }
    

    $ ./a.out 
    stdout: 1, dup: 0
    

    BTW, is it worth posting an issue about this improvement to glibc developers?


    NOTE

    Using dup() is limited with respect to buffering. I wonder if it is considered in the implementation of perror() in glibc. The following example demonstrates this issue. The output is done not in the order of writing to the stream, but in the order in which the data in the buffer is written-off. Note, that the order of values in the output is not the same as in the program, because the output of fprintf is written-off first (because of "\n"), and the output of fwprintf is written off when program exits.

    #include <wchar.h>
    #include <stdio.h>
    #include <unistd.h>
    int main(void)
    {
      wint_t wc = L'b';
      fwprintf(stdout, L"%lc", wc);
    
      /* --- */
    
      FILE *fp;
      int fd = -1;
      if ((fd = fileno (stdout)) == -1) return 1;
      if ((fd = dup (fd)) == -1) return 1;
      if ((fp = fdopen (fd, "w+")) == NULL) return 1;
    
      char c = 'h';
      fprintf(fp, "%c\n", c);
      return 0;
    }
    

    $ ./a.out 
    h
    b
    

    But if we use \n in fwprintf, the output is the same as in the program:

    $ ./a.out 
    b
    h
    

    perror() manages to get away with that, because in GNU libc stderr is unbuffered. But will it work safely in programs where stderr is manually set to buffered mode?


    This is the patch that I would propose to glibc developers:

    diff -urN glibc-2.24.orig/stdio-common/perror.c glibc-2.24/stdio-common/perror.c
    --- glibc-2.24.orig/stdio-common/perror.c   2016-08-02 09:01:36.000000000 +0700
    +++ glibc-2.24/stdio-common/perror.c    2016-10-10 16:46:03.814756394 +0700
    @@ -36,7 +36,7 @@
    
       errstring = __strerror_r (errnum, buf, sizeof buf);
    
    -  (void) __fxprintf (fp, "%s%s%s\n", s, colon, errstring);
    +  (void) _IO_fprintf (fp, "%s%s%s\n", s, colon, errstring);
     }
    
    
    @@ -55,7 +55,7 @@
          of the stream.  What is supposed to happen when the stream isn't
          oriented yet?  In this case we'll create a new stream which is
          using the same underlying file descriptor.  */
    -  if (__builtin_expect (_IO_fwide (stderr, 0) != 0, 1)
    +  if (__builtin_expect (_IO_fwide (stderr, 0) < 0, 1)
           || (fd = __fileno (stderr)) == -1
           || (fd = __dup (fd)) == -1
           || (fp = fdopen (fd, "w+")) == NULL)
    

    NOTE: It wasn't easy to find concrete questions in this post; on the whole, the post seems to be an attempt to engage in a discussion about implementation details of glibc, which it seems to me would be better directed to a forum specifically oriented to development of that library such as the libc-alpha mailing list. (Or see https://www.gnu.org/software/libc/development.html for other options.) This sort of discussion is not really a good match for ***, IMHO. Nonetheless, I tried to answer the questions I could find.

    1. How does wfmt[i] = fmt[i]; convert from multibyte to wide character?

      Actually, the code is:

      assert(isascii(fmt[i]));
      wfmt[i] = fmt[i];
      

      which is based on the fact that the numeric value of an ascii character is the same as a wchar_t. Strictly speaking, this need not be the case. The C standard specifies:

      Each member of the basic character set shall have a code value equal to its value when used as the lone character in an integer character constant if an implementation does not define __STDC_MB_MIGHT_NEQ_WC__. (§7.19/2)

      (gcc does not define that symbol.)

      However, that only applies to characters in the basic set, not to all characters recognized by isascii. The basic character set contains the 91 printable ascii characters as well as space, newline, horizontal tab, vertical tab and form feed. So it is theoretically possible that one of the remaining control characters will not be correctly converted. However, the actual format string used in the call to __fxprintf only contains characters from the basic character set, so in practice this pedantic detail is not important.

    2. Why there is no assert warning when we execute perror("привет мир");?

      Because only the format string is being converted, and the format string (which is "%s%s%s\n") contains only ascii characters. Since the format string contains %s (and not %ls), the argument is expected to be char* (and not wchar_t*) in both the narrow- and wide-character orientations.

    3. Can we use dup() on wide-oriented stderr to make it not wide-oriented?

      That would not be a good idea. First, if the stream has an orientation, it might also have a non-empty internal buffer. Since that buffer is part of the stdio library and not of the underlying Posix fd, it will not be shared with the duplicate fd. So the message printed by perror might be interpolated in the middle of some existing output. In addition, it is possible that the multibyte encoding has shift states, and that the output stream is not currently in the initial shift state. In that case, outputting an ascii sequence could result in garbled output.

      In the actual implementation, the dup is only performed on streams without orientation; these streams have never had any output directed at them, so they are definitely still in the initial shift state with an empty buffer (if the stream is buffered).

    4. Is it worth posting an issue about this improvement to glibc developers?

      That is up to you, but don't do it here. The normal way of doing that would be to file a bug. There is no reason to believe that glibc developers read SO questions, and even if they do, someone would have to copy the issue to a bug, and also copy any proposed patch.