更新时间:2022-06-06 22:26:05
简而言之:
在POS标记中,您需要上下文句子,而不是非语法标记的列表.
When POS tagging you need a context sentence not a list of ungrammatical tokens.
在上下文上下文中进行词句化时,获得正确词条的唯一方法是手动指定pos标签.
When lemmatizing out of context sentence, the only way to get the right lemma is to manually specify the pos tags.
pos
参数. n
POS,另请参见
pos
parameter for the lemmatize function. n
POS, see also WordNetLemmatizer not returning the right lemma unless POS is explicit - Python NLTK冗长:
POS标记器通常只适用于完整的句子,而不适用于单个单词.当您尝试在上下文之外标记单个单词时,您得到的是最常见的标记.
POS tagger usually works on the full sentence and not individual words. When you try to tag a single word out of context, what you get is the most frequent tag.
要验证在标记单个单词(即只有1个单词的句子)时,它始终具有相同的标记:
To verify that when tagging a single word (i.e. a sentence with only 1 word), it always gives the same tag:
>>> from nltk.stem import WordNetLemmatizer
>>> from nltk import pos_tag
>>> ptb2wn_pos = {'J':'a', 'V':'v', 'N':'n', 'R':'r'}
>>> sent = ['skydive']
>>> most_frequent_tag = pos_tag(sent)[0][1]
>>> most_frequent_tag
'JJ'
>>> most_frequent_tag = ptb2wn_pos[most_frequent_tag[0]]
>>> most_frequent_tag
'a'
>>> for _ in range(1000): assert ptb2wn_pos[pos_tag(sent)[0][1][0]] == most_frequent_tag;
...
>>>
现在,由于默认情况下如果句子只有1个单词,标记始终为'a',则WordNetLemmatizer
将始终返回skydive
:
Now, since the tag is always 'a' by default if the sentence only have 1 word, then the WordNetLemmatizer
will always return skydive
:
>>> wnl = WordNetLemmatizer()
>>> wnl.lemmatize(sent[0], pos=most_frequent_tag)
'skydive'
让我们在句子的上下文中查看单词的引理:
Let's to to see the lemma of a word in context of a sentence:
>>> sent2 = 'They skydrive from the tower yesterday'
>>> pos_tag(sent2.split())
[('They', 'PRP'), ('skydrive', 'VBP'), ('from', 'IN'), ('the', 'DT'), ('tower', 'NN'), ('yesterday', 'NN')]
>>> pos_tag(sent2.split())[1]
('skydrive', 'VBP')
>>> pos_tag(sent2.split())[1][1]
'VBP'
>>> ptb2wn_pos[pos_tag(sent2.split())[1][1][0]]
'v'
因此,当您执行pos_tag
时,令牌输入列表的上下文很重要.
So the context of the input list of tokens matters when you do pos_tag
.
在您的示例中,您有一个列表['skydiving', 'skydiving', 'skydiving']
,这意味着您正在使用pos标记的句子是不合语法的句子:
In your example, you had a list ['skydiving', 'skydiving', 'skydiving']
meaning the sentence that you are pos-tagging is an ungrammatical sentence:
高空跳伞高空跳伞
skydiving skydiving skydiving
pos_tag
函数认为这是一个普通句子,因此带有标签:
And the pos_tag
function thinks is a normal sentence hence giving the tags:
>>> sent3 = 'skydiving skydiving skydiving'.split()
>>> pos_tag(sent3)
[('skydiving', 'VBG'), ('skydiving', 'NN'), ('skydiving', 'VBG')]
在这种情况下,第一个是动词,第二个词是名词,第三个词是动词,这将返回以下引理(您不希望这样):
In which case the first is a verb, the second word a noun and the third word a verb, which will return the following lemma (which you do not desire):
>>> wnl.lemmatize('skydiving', 'v')
'skydive'
>>> wnl.lemmatize('skydiving', 'n')
'skydiving'
>>> wnl.lemmatize('skydiving', 'v')
'skydive'
因此,如果我们在您的令牌列表中有一个有效的语法句子,则输出看起来可能会非常不同
So if we have a valid grammatical sentence in your list of token, the output might look very different
>>> sent3 = 'The skydiving sport is an exercise that promotes diving from the sky , ergo when you are skydiving , you feel like you are descending to earth .'
>>> pos_tag(sent3.split())
[('The', 'DT'), ('skydiving', 'NN'), ('sport', 'NN'), ('is', 'VBZ'), ('an', 'DT'), ('exercise', 'NN'), ('that', 'IN'), ('promotes', 'NNS'), ('diving', 'VBG'), ('from', 'IN'), ('the', 'DT'), ('sky', 'NN'), (',', ','), ('ergo', 'RB'), ('when', 'WRB'), ('you', 'PRP'), ('are', 'VBP'), ('skydiving', 'VBG'), (',', ','), ('you', 'PRP'), ('feel', 'VBP'), ('like', 'IN'), ('you', 'PRP'), ('are', 'VBP'), ('descending', 'VBG'), ('to', 'TO'), ('earth', 'JJ'), ('.', '.')]