且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何使用WordNet查找英文单词的频率计数?

更新时间:2022-06-24 10:09:48

在WordNet中,每个引理都有该方法返回的频率计数 lemma.count(),并且存储在文件nltk_data/corpora/wordnet/cntlist.rev中.

In WordNet, every Lemma has a frequency count that is returned by the method lemma.count(), and which is stored in the file nltk_data/corpora/wordnet/cntlist.rev.

代码示例:

from nltk.corpus import wordnet
syns = wordnet.synsets('stack')
for s in syns:
    for l in s.lemmas():
        print l.name + " " + str(l.count())

结果:

stack 2
batch 0
deal 1
flock 1
good_deal 13
great_deal 10
hatful 0
heap 2
lot 13
mass 14
mess 0
...

但是,许多计数为零,并且在源文件或文档中没有信息用来创建该数据的语料库.根据 Daniel Jurafsky的语音和语言处理 一书和James H. Martin,感官频率来自 SemCor 语料库,已经很小而过时的布朗语料库的一部分.

However, many counts are zero and there is no information in the source file or in the documentation which corpus was used to create this data. According to the book Speech and Language Processing from Daniel Jurafsky and James H. Martin, the sense frequencies come from the SemCor corpus which is a subset of the already small and outdated Brown Corpus.

因此,***选择最适合您的应用程序的语料库,然后按照Christopher的建议自己创建数据.

So it's probably best to choose the corpus that fits best to the your application and create the data yourself as Christopher suggested.

要使此Python3.x兼容,请执行以下操作:

To make this Python3.x compatible just do:

代码示例:

from nltk.corpus import wordnet
syns = wordnet.synsets('stack')
for s in syns:
    for l in s.lemmas():
        print( l.name() + " " + str(l.count()))