Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

11.4 词形还原

词形还原(Lemmatization)是词干提取的一种变体。两个过程之间的主要区别是,词干提取通常会创建不存在的单词,而词形还原是实际存在的单词。词形还原的一个例子是将 run 作为诸如 running 和 ran 这样的单词的基本形式,或者将 better 和 good 视为相同的词形。

words = ["running", "ran", "runs", "easily", "fairly", "cats", "geese", "better", "worse"]
import nltk
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
# 对每个单词进行词形还原
lemmatized_words = []
for word in words:
    w = lemmatizer.lemmatize(word)
    lemmatized_words.append(w)

# 输出还原后的单词
for word, lemmatized in zip(words, lemmatized_words):
    print(f"{word} -> {lemmatized}")
running -> running
ran -> ran
runs -> run
easily -> easily
fairly -> fairly
cats -> cat
geese -> goose
better -> better
worse -> worse