11.4 词形还原
词形还原(Lemmatization)是词干提取的一种变体。两个过程之间的主要区别是,词干提取通常会创建不存在的单词,而词形还原是实际存在的单词。词形还原的一个例子是将 run 作为诸如 running 和 ran 这样的单词的基本形式,或者将 better 和 good 视为相同的词形。
words = ["running", "ran", "runs", "easily", "fairly", "cats", "geese", "better", "worse"]import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
# 对每个单词进行词形还原
lemmatized_words = []
for word in words:
w = lemmatizer.lemmatize(word)
lemmatized_words.append(w)
# 输出还原后的单词
for word, lemmatized in zip(words, lemmatized_words):
print(f"{word} -> {lemmatized}")
running -> running
ran -> ran
runs -> run
easily -> easily
fairly -> fairly
cats -> cat
geese -> goose
better -> better
worse -> worse