Python開源的正文抽取模塊:libextract

xpkdi 10年前發布 | 19K 次閱讀 Python開發 libextract

基于統計特征抽取HT/XML文檔正文。

Libextract提供兩個現成的抽取器:api.articles 和 api.tabular。

libextract.api.articles(document, encoding='utf-8', count=5)

Given an html document, and optionally the encoding and the number of predictions (count) to return (in descending rank),articlesreturns a list of HTML-nodes likely containing the articles of text of a given website.

The extraction algorithm is based of text length. Refer to rodricios.github.io/eatiht for an in-depth explanation.

libextract.api.tabular(document, encoding='utf-8', count=5)

Given an html document, and optionally the encoding, and the number of predictions (count) to return (in descending rank) tabular returns a list of HTML nodes likely containing "tabular" data (ie. table, and table-like elements).

Installation

pip install libextract

Usage

Extracting text-nodes from a wikipedia page:

from requests import get from libextract.api import articles

r = get('http://en.wikipedia.org/wiki/Information_extraction')
textnodes = articles(r.content)

Libextract uses Python's de facto HT/XML processing library, lxml.

The predictions returned by bothapi.articlesandapi.tabularare lxml HtmlElement objects (along with the associated metric used to rank each prediction).

Therefore, you can access lxml's methods for post-processing.

>> print(textnodes[0][0].text_content())
Information extraction (IE) is the task of automatically extracting structured information...

Tabular-data extraction is just as easy.

from libextract.api import tabular

height_data = get("http://en.wikipedia.org/wiki/Human_height")
tabs = tabular(height_data.content)

To convert HT/XML element to pythondict(and, you know, use it with Pandas and stuff):

>>> from libextract import clean >>> clean.to_dict(tabs[0][0])
{'Entity': ['Monaco', 'Macau', 'Japan', 'Singapore', 'San Marino',
  ...}

項目主頁:http://www.baiduhome.net/lib/view/home/1431939836677

 本文由用戶 xpkdi 自行上傳分享,僅供網友學習交流。所有權歸原作者,若您的權利被侵害,請聯系管理員。
 轉載本站原創文章,請注明出處,并保留原始鏈接、圖片水印。
 本站是一個以用戶分享為主的開源技術平臺,歡迎各類分享!