全文和文章元數據抽取開源Python庫：newspaper

jopen 9年前發布 | 37K 次閱讀 newspaper Python開發

newspaper: 一個新聞、全文和文章元數據抽取開源Python庫。支持包括中文在內的多種自然語言，支持關鍵字、圖像和摘要等多種元數據類型抽取，支持多線程下載。

Full Python3 and Python2 support

Multi-threaded article download framework

News url identification

Text extraction from html

Top image extraction from html

All image extraction from html

Keyword extraction from text

Summary extraction from text

Author extraction from text

Google trending terms extraction

Works in 10+ languages (English, Chinese, German, Arabic, ...)

>>> from newspaper import Article
>>> url = '
>>> article.download()

>>> article.html
'<!DOCTYPE HTML><html itemscope itemtype="http://...'
>>> article.parse()
>>> article.authors
['Leigh Ann Caldwell', 'John Honway']
>>> article.publish_date
datetime.datetime(2013, 12, 30, 0, 0)
>>> article.text
'Washington (CNN) -- Not everyone subscribes to a New Year's resolution...'
>>> article.top_image
'
>>> article.movies
['http://油Tube.com/path/to/link.com', ...]
>>> article.nlp()
>>> article.keywords
['New Years', 'resolution', ...]
>>> article.summary
'The study shows that 93% of people ...'
>>> import newspaper
>>> cnn_paper = newspaper.build('
>>> for article in cnn_paper.articles:
>>>     print(article.url)
http://www.cnn.com/2013/11/27/justice/tucson-arizona-captive-girls/
http://www.cnn.com/2013/12/11/us/texas-teen-dwi-wreck/index.html
...
>>> for category in cnn_paper.category_urls():
>>>     print(category)
http://lifestyle.cnn.com
http://cnn.com/world
http://tech.cnn.com
...
>>> cnn_article = cnn_paper.articles[0]
>>> cnn_article.download()
>>> cnn_article.parse()
>>> cnn_article.nlp()
...
>>> from newspaper import fulltext
>>> html = requests.get(...).text
>>> text = fulltext(html)
Newspaper has seamless language extraction and detection. If no language is specified, Newspaper will attempt to auto detect a language.
>>> from newspaper import Article
>>> url = '
>>> a = Article(url, language='zh') # Chinese

>>> a.download()
>>> a.parse()
>>> print(a.text[:150])
香港行政長官梁振英在各方壓力下就其大宅的違章建
筑（僭建）問題到立法會接受質詢，并向香港民眾道歉。
梁振英在星期二（12月10日）的答問大會開始之際
在其演說中道歉，但強調他在違章建筑問題上沒有隱瞞的
意圖和動機。 一些親北京陣營議員歡迎梁振英道歉，
且認為應能獲得香港民眾接受，但這些議員也質問梁振英有
>>> print(a.title)
港特首梁振英就住宅違建事件道歉
If you are certain that an entire news source is in one language, go ahead and use the same api :)
>>> import newspaper
>>> sina_paper = newspaper.build('
>>> for category in sina_paper.category_urls():
>>>     print(category)
http://health.sina.com.cn
http://eladies.sina.com.cn
http://english.sina.com
...
>>> article = sina_paper.articles[0]
>>> article.download()
>>> article.parse()
>>> print(article.text)
新浪武漢汽車綜合 隨著汽車市場的日趨成熟，
傳統的“集全家之力抱得愛車歸”的全額購車模式已然過時，
另一種輕松的新興 車模式――金融購車正逐步成為時下消費者購
買愛車最為時尚的消費理念，他們認為，這種新穎的購車
模式既能在短期內
...
>>> print(article.title)
兩年雙免0手續0利率 科魯茲掀背金融輕松購_武漢車市_武漢汽
車網_新浪汽車_新浪網</pre>

項目主頁：http://www.baiduhome.net/lib/view/home/1432176001646

本文由用戶 jopen 自行上傳分享，僅供網友學習交流。所有權歸原作者，若您的權利被侵害，請聯系管理員。

轉載本站原創文章，請注明出處，并保留原始鏈接、圖片水印。

本站是一個以用戶分享為主的開源技術平臺，歡迎各類分享！

本文地址：http://www.baiduhome.net/lib/view/open1432176001646.html

newspaper Python開發

全文和文章元數據抽取開源Python庫：newspaper

相關經驗

相關資訊

相關文檔

目錄