且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

所有空白字符?语言独立吗?

更新时间:2022-11-16 08:34:27

是否将特定字符归类为空白字符应取决于所使用的字符集.也就是说,编程语言可以对构成空白的内容进行自己的定义.

大多数现代语言都使用Unicode字符集,确实具有空格分隔符的定义. Zs类别中的任何字符都是空格分隔符.

您可以在此处查看完整列表.此外,您可以在官方 Unicode字符数据库中为;Zs;进行grep这些字符.请注意,随着新的Unicode版本的出现,该类别中的字符数可能会增加,因此我不会说存在多少此类字符,甚至也不会 attempt 列出它们.

除了 Zs Unicode 类别之外,Unicode还定义了> 的字符..>

现在,许多语言,甚至是现代语言,对于诸如\s[:space:]的正则表达式都具有特殊的符号,但是请注意,这些语言仅引用ASCII集中的某些字符.通常这些仅限于

  • SPACE(代码点32,U + 0020)
  • TAB(代码点9,U + 0009)
  • LINE FEED(代码点10,U + 000A)
  • 行制表符(代码点11,U + 000B)
  • 送纸(代码点12,U + 000C)
  • 回车(代码点13,U + 000D)

现在该列表很有趣,因为它不仅包含空格分隔符(Zs),而且还包含"Control,Other"类别(Cc).这就是编程语言使用空白"一词时通常的意思.

因此,回答空白字符的完整列表"问题的***方法是说这取决于您的意思".如果您的意思是经典空白",则可能是上面列出的六个字符.如果您想要更现代"的东西,那么这是这六个与Unicode类别Zs中所有字符的并集.再说一遍,您可能还需要查看其他块(例如,杰里·科芬(Jerry Coffin)对您的问题的评论中提到的U + 1361).这也取决于您打算如何使用这些空格字符 .

现在最后一件事:Unicode尚未在世界上包含所有字符;它一直在增长.有一天可能会添加新的空格字符.目前,类别Zs +经典是您***的选择.

I was wondering if all the language treats the same set of characters as white space charactes or is there any variation.

Can anyone provide complete list of White space characters separating the one which can be entered from keyboard? If it's different, the difference and the reason would be more appropriate. Any language is helpful if you don't bring out Whitespace or its variants(if any). I certainly don't want a complete list for language like Whitespace :)

Whether a particular character is categorized as a whitespace character or not should depend on the character set being used. That said, it is not impossible that a programming language can make its own definition of what constitutes whitespace.

Most modern languages use the Unicode Character set, which does have a definition for space separator characters. Any character in the Zs category is a space separator.

You can see the complete list here. In addition you can grep for ;Zs; in the official Unicode Character Database to see those characters. Note that the number of characters in this category may grow as new Unicode versions come into existence, so I will not say how many such characters exist, nor even attempt to list them.

In addition to the Zs Unicode category, Unicode also defines character properties. Among the properties defined by Unicode is a Whitespace property. As of Unicode 7.0, characters with this property include all of the characters with category Zs plus a few control characters (including U+0009, U+000A, U+000B, U+000C, U+000D, and U+0085). You can find all of the characters with the whitespace property at Unicode.org here.

Now many languages, even modern ones, have special symbols for regular expressions such as \s or [:space:] but beware, these only refer to certain characters from the ASCII set; generally these are restricted to

  • SPACE (codepoint 32, U+0020)
  • TAB (codepoint 9, U+0009)
  • LINE FEED (codepoint 10, U+000A)
  • LINE TABULATION (codepoint 11, U+000B)
  • FORM FEED (codepoint 12, U+000C)
  • CARRIAGE RETURN (codepoint 13, U+000D)

Now this list is interesting because it contains not only space separators (Zs), but also from the "Control, Other" category (Cc). This is what a programming language generally means when it uses the term "whitespace."

So probably the best way to answer your question for a "complete list" of whitespace characters is to say "it depends on what you mean." If you mean "classic whitespace" it is probably the six characters listed above. If you want something more "modern" then it is the union of those six with all the characters from the Unicode category Zs. Then again, you might need to look within other blocks, too (e.g., U+1361 as mentioned in a comment to your question by Jerry Coffin). It also depends on what you intend to do with these space characters.

Now one last thing: Unicode doesn't have every character in the world yet; it keeps growing. It is possible that someday new space characters will be added. For now, category Zs + the classics are your best bet.