HTML 解析/提取器:woody

jopen 11年前發布 | 69K 次閱讀 woody HTML操作類庫

woody 是一款 Java 的HTML 解析/提取器,用法非常類似 webmagic, 是對其抽取模板完全重寫,之所有單獨提取出來是因為為來更好可重用。

一些新功能:

  • 多種結果數據類型(String, char, byte, short int, long, double, float, string[], Set, List,Data)
  • 支持用戶之定義腳本處理函數(目前支持Javascript 函數配置處理)
  • 支持css,xpath內核替換
  • 支持filter功能
  • 對css,xpath 內核對象的緩存

一個完整的例子:

public class OsChinaBlog {

    public static void main(String[] args) throws Exception {
        Document doc = Jsoup.connect("http://www.oschina.net/news/43879/webmagic-0-3-0").timeout(60000)
                .userAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:23.0) Gecko/20100101 Firefox/23.0").get();
        String html = doc.html();
        OsChinaBlogModel model = AnnotationExtractor.me().process(html, OsChinaBlogModel.class);
        System.out.println(model.toJson());
    }

    public static class OsChinaBlogModel extends Model {

        public OsChinaBlogModel() {
            //use to reflect
        }

        @Inject
        @ComboExtract(value = { @ExtractBy(value = "h1.OSCTitle", type = ExprType.CSS),
                @ExtractBy(value = "http://title/text()", type = ExprType.XPATH) }, op = OP.OR)
        public String title;

        @Inject
        @ExtractBy(value = "div.PubDate a[href~=http://my\\.oschina\\.net/]", type = ExprType.CSS)
        public String author;

        @Inject
        @ExtractBy(value = "發布于.\\s*(\\d+年\\d+月\\d+日)", type = ExprType.REGEX)
        public Date publishDate;

        @Inject
        @ComboExtract(value = {
                @ExtractBy(value = "div.PubDate", type = ExprType.CSS, setting = @Setting(outerHtml = true)),
                @ExtractBy(value = "(\\d+)評", type = ExprType.REGEX) }, op = OP.AND)
        public int commentNum;

        @Inject
        @ExtractBy(value = "span#p_favor_count", type = ExprType.CSS, setting = @Setting(function = @Function(value = "replace", args = {
                "+", "" })))
        public int collectNum;

        @Inject
        @ComboExtract(value = {
                @ExtractBy(value = "div[id=userComments]", type = ExprType.CSS, setting = @Setting(outerHtml = true)),
                @ExtractBy(value = "div.TextContent", type = ExprType.CSS) }, op = OP.AND, multi = true)
        public List commentContents;

        @Inject
        @ExtractBy(value = "div[id=toolbar_wrapper]", setting = @Setting(fliters = { "b", "span" }), type = ExprType.CSS, impl = Document.class)
        public String weibo;

    }
}

項目主頁:http://www.baiduhome.net/lib/view/home/1378731525709

 本文由用戶 jopen 自行上傳分享,僅供網友學習交流。所有權歸原作者,若您的權利被侵害,請聯系管理員。
 轉載本站原創文章,請注明出處,并保留原始鏈接、圖片水印。
 本站是一個以用戶分享為主的開源技術平臺,歡迎各類分享!