全文和文章元數據抽取開源Python庫:newspaper
newspaper: 一個新聞、全文和文章元數據抽取開源Python庫。支持包括中文在內的多種自然語言,支持關鍵字、圖像和摘要等多種元數據類型抽取,支持多線程下載。
- Full Python3 and Python2 support
- Multi-threaded article download framework
- News url identification
- Text extraction from html
- Top image extraction from html
- All image extraction from html
- Keyword extraction from text
- Summary extraction from text
- Author extraction from text
- Google trending terms extraction
- Works in 10+ languages (English, Chinese, German, Arabic, ...) </ul>
>>> from newspaper import Article>>> url = '
>>> article.download()
>>> article.html '<!DOCTYPE HTML><html itemscope itemtype="http://...'
>>> article.parse()
>>> article.authors ['Leigh Ann Caldwell', 'John Honway']
>>> article.publish_date datetime.datetime(2013, 12, 30, 0, 0)
>>> article.text 'Washington (CNN) -- Not everyone subscribes to a New Year's resolution...'
>>> article.top_image '
>>> article.movies ['http://油Tube.com/path/to/link.com', ...]
>>> article.nlp()
>>> article.keywords ['New Years', 'resolution', ...]
>>> article.summary 'The study shows that 93% of people ...'
>>> import newspaper
>>> cnn_paper = newspaper.build('
>>> for article in cnn_paper.articles: >>> print(article.url) http://www.cnn.com/2013/11/27/justice/tucson-arizona-captive-girls/ http://www.cnn.com/2013/12/11/us/texas-teen-dwi-wreck/index.html ...
>>> for category in cnn_paper.category_urls(): >>> print(category)
http://lifestyle.cnn.com http://cnn.com/world http://tech.cnn.com ...
>>> cnn_article = cnn_paper.articles[0] >>> cnn_article.download() >>> cnn_article.parse() >>> cnn_article.nlp() ...
>>> from newspaper import fulltext
>>> html = requests.get(...).text >>> text = fulltext(html)
Newspaper has seamless language extraction and detection. If no language is specified, Newspaper will attempt to auto detect a language.
>>> from newspaper import Article >>> url = '
>>> a = Article(url, language='zh') # Chinese
>>> a.download() >>> a.parse()
>>> print(a.text[:150]) 香港行政長官梁振英在各方壓力下就其大宅的違章建 筑(僭建)問題到立法會接受質詢,并向香港民眾道歉。 梁振英在星期二(12月10日)的答問大會開始之際 在其演說中道歉,但強調他在違章建筑問題上沒有隱瞞的 意圖和動機。 一些親北京陣營議員歡迎梁振英道歉, 且認為應能獲得香港民眾接受,但這些議員也質問梁振英有
>>> print(a.title) 港特首梁振英就住宅違建事件道歉
If you are certain that an entire news source is in one language, go ahead and use the same api :)
>>> import newspaper >>> sina_paper = newspaper.build('
>>> for category in sina_paper.category_urls(): >>> print(category) http://health.sina.com.cn http://eladies.sina.com.cn http://english.sina.com ...
>>> article = sina_paper.articles[0] >>> article.download() >>> article.parse()
>>> print(article.text) 新浪武漢汽車綜合 隨著汽車市場的日趨成熟, 傳統的“集全家之力抱得愛車歸”的全額購車模式已然過時, 另一種輕松的新興 車模式――金融購車正逐步成為時下消費者購 買愛車最為時尚的消費理念,他們認為,這種新穎的購車 模式既能在短期內 ...
>>> print(article.title) 兩年雙免0手續0利率 科魯茲掀背金融輕松購_武漢車市_武漢汽 車網_新浪汽車_新浪網</pre>