在Python 3.3.2中计算词组频率

更新时间：2021-07-10 01:27:47

首先，这就是我将生成的cnt生成方式(以减少内存开销)

First of all, this is how I would generate the cnt that you do (to reduce memory overhead)

def findWords(filepath):
  with open(filepath) as infile:
    for line in infile:
      words = re.findall('\w+', line.lower())
      yield from words

cnt = collections.Counter(findWords('02.2003.BenBernanke.txt'))

现在，关于您的词组问题:

Now, on to your question about phrases:

from itertools import tee
phrases = {'central bank', 'high inflation'}
fw1, fw2 = tee(findWords('02.2003.BenBernanke.txt'))   
next(fw2)
for w1,w2 in zip(fw1, fw2)):
  phrase = ' '.join([w1, w2])
  if phrase in phrases:
    cnt[phrase] += 1

希望这会有所帮助

上一篇 : ：如何使用numpy在python中计算RMSPE下一篇 : 有效地重塑稀疏矩阵，Python，SciPy 0.12

在Python 3.3.2中计算词组频率

相关阅读

技术问答最新文章