用JAVA實現的自動抽取網頁正文算法：ContentExtractor

jopen 10年前發布 | 85K 次閱讀算法 ContentExtractor

簡介

ContentExtractor是一個開源的網頁正文抽取工具，用JAVA實現，具有非常高的抽取精度。

算法

ContentExtractor的網頁正文抽取算法使用的是CEPR，適用于幾乎所有的包含正文的網頁。算法簡介：http://dl.acm.org/citation.cfm?id=2505558

教程

ContentExtractor的接口非常簡單，用戶可以根據網頁的url，或者網頁的html，來進行網頁正文抽取：

根據url，抽取網頁的正文：

public static void main(String[] args) throws Exception {
        String content=ContentExtractor.getContentByURL("http://news.
            xinhuanet.com/world/2014-11/02/c_127166728.htm");
        System.out.println(content);
}

根據html，抽取網頁的正文：

public static void main(String[] args) throws Exception {
        String html="獲取到的html源碼";
        String content=ContentExtractor.getContentByHtml(html);
        System.out.println(content);
}

導入項目

從ContentExtractor的github主頁https://github.com/hfut-dmic/ContentExtractor上下載ContentExtractor-{版本號}-bin.zip,將解壓后得到的jar包全部放到工程的build path即可。

聯系我們

歡迎加入討論群：385105758

郵箱:wugq@hfut.edu.cn

開發者

ContentExtractor由合肥工業大學dmic團隊開發。

項目主頁：http://www.baiduhome.net/lib/view/home/1414977630825

本文由用戶 jopen 自行上傳分享，僅供網友學習交流。所有權歸原作者，若您的權利被侵害，請聯系管理員。

轉載本站原創文章，請注明出處，并保留原始鏈接、圖片水印。

本站是一個以用戶分享為主的開源技術平臺，歡迎各類分享！

本文地址：http://www.baiduhome.net/lib/view/open1414977630825.html

算法 ContentExtractor

用JAVA實現的自動抽取網頁正文算法：ContentExtractor

簡介

算法

教程

導入項目

聯系我們

開發者

相關經驗

相關資訊

相關文檔

目錄