且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

sklearn 中字母的 N-gram

更新时间:2023-11-23 13:54:34

有一个 'analyzer' 参数可以满足您的需求.

There is an 'analyzer' param which does what you want.

根据文档:-

分析器:字符串、{‘word’、‘char’、‘char_wb’}或可调用

analyzer : string, {‘word’, ‘char’, ‘char_wb’} or callable

特征应该由单词还是字符n-gram组成.选项‘char_wb’仅从单词边界内的文本创建字符 n-gram;单词边缘的 n-gram 用空格填充.

Whether the feature should be made of word or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.

如果传递了一个可调用对象,它将用于提取特征序列从原始的、未处理的输入中提取出来.

If a callable is passed it is used to extract the sequence of features out of the raw, unprocessed input.

默认情况下,它设置为 word,您可以更改.

By default, it is set to word, which you can change.

就去做:

vectorizer = CountVectorizer(ngram_range=(1, 100),
                             token_pattern = r"(?u)\b\w+\b", 
                             analyzer='char')