11.1 分词

分词（Tokenization）是将文本分割成有意义的片段。这些片段可以是单词、标点符号、数字或其他构成句子的特殊字符。

一组预定的规则使我们能够有效地将句子转换为标记列表。以下代码片段展示了使用 NLTK 来进行分词：

text = "This is a sample content!"

from nltk.tokenize import word_tokenize 
word_tokenize(text, language='english')

NLTK库信息： {'nltk_data path': 'nltk/nltk_data', 'nltk_data included': ['tokenizers/punkt', 'taggers/averaged_perceptron_tagger', 'taggers/universal_tagset', 'corpora/.DS_Store', 'corpora/inaugural', 'corpora/wordnet.zip', 'corpora/stopwords.zip']}

['This', 'is', 'a', 'sample', 'content', '!']

练习¶