变身抓重点小能手：机器学习中的文本摘要入门指南

17# Looping through the paragraphs and adding them to the variable

18for p in paragraphs:

19 article_content += p.text

使用urllib.request实现网页数据的抓取，再调用BeautifulSoup来解析网页数据。

第二步：数据处理

为确保抓取到的文本数据尽可能没有噪音，需要做一些基本的文本清理。这里使用了NLTK的stopwords和PorterStemmer。

PorterStemmer可以将单词还原为词根形式，就是说能把 cleaning, cleaned, cleaner 都还原成 clean。

此外还要创建一个字典，来存储文本中每一个单词的出现频率。

循环整个文本来消除 “a”、“the” 这样的停止词，并记录单词们的出现频率。

1from nltk.corpus import stopwords

2from nltk.stem import PorterStemmer

变身抓重点小能手：机器学习中的文本摘要入门指南 | 资源( 七 )