且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

tf-idf 特征权重使用 sklearn.feature_extraction.text.TfidfVectorizer

更新时间:2022-04-24 10:43:52

从 0.15 版本开始,可以通过 TfidfVectorizeridf_ 检索每个特征的 tf-idf 分数/code> 对象:

Since version 0.15, the tf-idf score of each feature can be retrieved via the attribute idf_ of the TfidfVectorizer object:

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["This is very strange",
          "This is very nice"]
vectorizer = TfidfVectorizer(min_df=1)
X = vectorizer.fit_transform(corpus)
idf = vectorizer.idf_
print dict(zip(vectorizer.get_feature_names(), idf))

输出:

{u'is': 1.0,
 u'nice': 1.4054651081081644,
 u'strange': 1.4054651081081644,
 u'this': 1.0,
 u'very': 1.0}

正如评论中所讨论的,在 0.15 版本之前,一种解决方法是通过假定隐藏的 _tfidf(TfidfTransformer) 的矢量化器:

idf = vectorizer._tfidf.idf_
print dict(zip(vectorizer.get_feature_names(), idf))

应该给出与上面相同的输出.

which should give the same output as above.