且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何获取字符串中的字符数?

更新时间:2023-02-18 20:12:41

您可以从utf8软件包尝试 RuneCountInString 。 / p>


返回p中的符文数目


,如此脚本中所示:世界的长度可能为6(以中文书写:世界),但其符文数为2:

  package main 

importfmt
importunicode / utf8

func main(){
fmt.Println(Hello,World,len(世界),utf8.RuneCountInString )
}

Phrozen 在评论中添加: / p>

其实你可以通过类型强制转换来执行 len()

len([] rune(世界))将打印 2






Stefan Steiger 指向博客文章 Go 中的文字正常化



什么是字符?


strings blog post 字符可以跨多个符文

例如,' e '和'◌◌'(急性\\\́)可以组合形成'é'( e\\\́ )。 这两个符文是一个字符



字符的定义可能因应用程序而异。

对于 正规化 ,我们将其定义为:





  • 不修改或与任何其他符文反向组合的符文

  • 空序列的非启动器,即符文(通常是口音)。



标准化算法会同时处理一个字符。


使用该软件包及其 Iter 类型,character的实际数字将是:

  package main 

importfmt
importgolang.org/x/text/unicode/norm

func main(){
var ia norm.Iter
ia.InitString(norm.NFKD,école)
nc:= 0
for!ia.Done
nc = nc + 1
ia.Next()
}
fmt.Printf(字符数:%d \\\
,nc)
}

这里使用 Unicode规范化表单 NFKD兼容性分解


How can I get the number of characters of a string in Go?

For example, if I have a string "hello" the method should return 5. I saw that len(str) returns the number of bytes and not the number of characters so len("£") returns 2 instead of 1 because £ is encoded with two bytes in UTF-8.

You can try RuneCountInString from the utf8 package.

returns the number of runes in p

that, as illustrated in this script: the length of "World" might be 6 (when written in Chinese: "世界"), but its rune count is 2:

package main

import "fmt"
import "unicode/utf8"

func main() {
    fmt.Println("Hello, 世界", len("世界"), utf8.RuneCountInString("世界"))
}

Phrozen adds in the comments:

Actually you can do len() over runes by just type casting.
len([]rune("世界")) will print 2. At leats in Go 1.3.


Stefan Steiger points to the blog post "Text normalization in Go"

What is a character?

As was mentioned in the strings blog post, characters can span multiple runes.
For example, an 'e' and '◌́◌́' (acute "\u0301") can combine to form 'é' ("e\u0301" in NFD). Together these two runes are one character.

The definition of a character may vary depending on the application.
For normalization we will define it as:

  • a sequence of runes that starts with a starter,
  • a rune that does not modify or combine backwards with any other rune,
  • followed by possibly empty sequence of non-starters, that is, runes that do (typically accents).

The normalization algorithm processes one character at at time.

Using that package and its Iter type, the actual number of "character" would be:

package main

import "fmt"
import "golang.org/x/text/unicode/norm"

func main() {
    var ia norm.Iter
    ia.InitString(norm.NFKD, "école")
    nc := 0
    for !ia.Done() {
        nc = nc + 1
        ia.Next()
    }
    fmt.Printf("Number of chars: %d\n", nc)
}

Here, this uses the Unicode Normalization form NFKD "Compatibility Decomposition"