Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

11.1 分词

分词(Tokenization)是将文本分割成有意义的片段。这些片段可以是单词、标点符号、数字或其他构成句子的特殊字符。

一组预定的规则使我们能够有效地将句子转换为标记列表。以下代码片段展示了使用 NLTK 来进行分词:

text = "This is a sample content!"
from nltk.tokenize import word_tokenize 
word_tokenize(text, language='english')
NLTK库信息: {'nltk_data path': 'nltk/nltk_data', 'nltk_data included': ['tokenizers/punkt', 'taggers/averaged_perceptron_tagger', 'taggers/universal_tagset', 'corpora/.DS_Store', 'corpora/inaugural', 'corpora/wordnet.zip', 'corpora/stopwords.zip']}
['This', 'is', 'a', 'sample', 'content', '!']

练习