Java實現的基于模板的網頁結構化信息精準抽取組件:HtmlExtractor
HtmlExtractor是一個Java實現的基于模板的網頁結構化信息精準抽取組件,本身并不包含爬蟲功能,但可被爬蟲或其他程序調用以便更精準地對網頁結構化信息進行抽取。
HtmlExtractor是為大規模分布式環境設計的,采用主從架構,主節點負責維護抽取規則,從節點向主節點請求抽取規則,當抽取規則發生變化,主節點主動通知從節點,從而能實現抽取規則變化之后的實時動態生效。
如何使用?
HtmlExtractor由2個子項目構成,html-extractor和html-extractor-web。 html-extractor實現了數據抽取邏輯,是從節點,html-extractor-web提供web界面來維護抽取規則,是主節點。 html-extractor是一個jar包,可通過maven引用: <dependency> <groupId>org.apdplat</groupId> <artifactId>html-extractor</artifactId> <version>1.1</version> </dependency> html-extractor-web是一個war包,需要部署到Servlet/Jsp容器上。 在html-extractor-web目錄下運行mvn jetty:run就可以啟動Servlet/Jsp容器jetty,之后打開瀏覽器訪問: http://localhost:8080/html-extractor-web/api/ 查看自己定義的規則。注意:頁面模板中定義的所有CSS路徑和抽取表達式全部抽取成功,才算抽取成功, 只要有一個CSS路徑或抽取表達式失敗,就是抽取失敗。</pre>
如何使用HtmlExtractor實現基于模板的網頁結構化信息精準抽取?
單機集中式使用方法:
//1、構造抽取規則 List<UrlPattern> urlPatterns = new ArrayList<>(); //1.1、構造URL模式 UrlPattern urlPattern = new UrlPattern(); urlPattern.setUrlPattern("http://money.163.com/\\d{2}/\\d{4}/\\d{2}/[0-9A-Z]{16}.html"); //1.2、構造HTML模板 HtmlTemplate htmlTemplate = new HtmlTemplate(); htmlTemplate.setTemplateName("網易財經頻道"); htmlTemplate.setTableName("finance"); //1.3、將URL模式和HTML模板建立關聯 urlPattern.addHtmlTemplate(htmlTemplate); //1.4、構造CSS路徑 CssPath cssPath = new CssPath(); cssPath.setCssPath("h1"); cssPath.setFieldName("title"); cssPath.setFieldDescription("標題"); //1.5、將CSS路徑和模板建立關聯 htmlTemplate.addCssPath(cssPath); //1.6、構造CSS路徑 cssPath = new CssPath(); cssPath.setCssPath("div#endText"); cssPath.setFieldName("content"); cssPath.setFieldDescription("正文"); //1.7、將CSS路徑和模板建立關聯 htmlTemplate.addCssPath(cssPath); //可象上面那樣構造多個URLURL模式 urlPatterns.add(urlPattern); //2、獲取抽取規則對象 ExtractRegular extractRegular = ExtractRegular.getInstance(urlPatterns); //注意:可通過如下3個方法動態地改變抽取規則 //extractRegular.addUrlPatterns(urlPatterns); //extractRegular.addUrlPattern(urlPattern); //extractRegular.removeUrlPattern(urlPattern.getUrlPattern()); //3、獲取HTML抽取工具 HtmlExtractor htmlExtractor = new DefaultHtmlExtractor(extractRegular); //4、抽取網頁 String url = "http://money.163.com/08/1219/16/4THR2TMP002533QK.html"; HtmlFetcher htmlFetcher = new JSoupHtmlFetcher(); String html = htmlFetcher.fetch(url); List<ExtractResult> extractResults = htmlExtractor.extract(url, html); //5、輸出結果 int i = 1; for (ExtractResult extractResult : extractResults) { System.out.println((i++) + "、網頁 " + extractResult.getUrl() + " 的抽取結果"); if(!extractResult.isSuccess()){ System.out.println("抽取失敗:"); for(ExtractFailLog extractFailLog : extractResult.getExtractFailLogs()){ System.out.println("\turl:"+extractFailLog.getUrl()); System.out.println("\turlPattern:"+extractFailLog.getUrlPattern()); System.out.println("\ttemplateName:"+extractFailLog.getTemplateName()); System.out.println("\tfieldName:"+extractFailLog.getFieldName()); System.out.println("\tfieldDescription:"+extractFailLog.getFieldDescription()); System.out.println("\tcssPath:"+extractFailLog.getCssPath()); if(extractFailLog.getExtractExpression()!=null) { System.out.println("\textractExpression:" + extractFailLog.getExtractExpression()); } } continue; } Map<String, List<ExtractResultItem>> extractResultItems = extractResult.getExtractResultItems(); for(String field : extractResultItems.keySet()){ List<ExtractResultItem> values = extractResultItems.get(field); if(values.size() > 1){ int j=1; System.out.println("\t多值字段:"+field); for(ExtractResultItem item : values){ System.out.println("\t\t"+(j++)+"、"+field+" = "+item.getValue()); } }else{ System.out.println("\t"+field+" = "+values.get(0).getValue()); } } System.out.println("\tdescription = "+extractResult.getDescription()); System.out.println("\tkeywords = "+extractResult.getKeywords()); }
多機分布式使用方法:
1、運行主節點,負責維護抽取規則: 方法一:在html-extractor-web目錄下運行mvn jetty:run 。 方法二:在html-extractor-web目錄下運行mvn install , 然后將target/html-extractor-web-1.0.war部署到Tomcat。 2、獲取一個HtmlExtractor的實例(從節點),示例代碼如下: String allExtractRegularUrl = "http://localhost:8080/HtmlExtractorServer/api/all_extract_regular.jsp"; String redisHost = "localhost"; int redisPort = 6379; ExtractRegular extractRegular = ExtractRegular.getInstance(allExtractRegularUrl, redisHost, redisPort); HtmlExtractor htmlExtractor = new DefaultHtmlExtractor(extractRegular); 3、抽取信息,示例代碼如下: String url = "http://money.163.com/08/1219/16/4THR2TMP002533QK.html"; HtmlFetcher htmlFetcher = new JSoupHtmlFetcher(); String html = htmlFetcher.fetch(url); List<ExtractResult> extractResults = htmlExtractor.extract(url, html); int i = 1; for (ExtractResult extractResult : extractResults) { System.out.println((i++) + "、網頁 " + extractResult.getUrl() + " 的抽取結果"); if(!extractResult.isSuccess()){ System.out.println("抽取失敗:"); for(ExtractFailLog extractFailLog : extractResult.getExtractFailLogs()){ System.out.println("\turl:"+extractFailLog.getUrl()); System.out.println("\turlPattern:"+extractFailLog.getUrlPattern()); System.out.println("\ttemplateName:"+extractFailLog.getTemplateName()); System.out.println("\tfieldName:"+extractFailLog.getFieldName()); System.out.println("\tfieldDescription:"+extractFailLog.getFieldDescription()); System.out.println("\tcssPath:"+extractFailLog.getCssPath()); if(extractFailLog.getExtractExpression()!=null) { System.out.println("\textractExpression:" + extractFailLog.getExtractExpression()); } } continue; } for(ExtractResultItem extractResultItem : extractResult.getExtractResultItems()){ System.out.print("\t"+extractResultItem.getField()+" = "+extractResultItem.getValue()); } System.out.println("\tdescription = "+extractResult.getDescription()); System.out.println("\tkeywords = "+extractResult.getKeywords()); }
本文由用戶 jopen 自行上傳分享,僅供網友學習交流。所有權歸原作者,若您的權利被侵害,請聯系管理員。
轉載本站原創文章,請注明出處,并保留原始鏈接、圖片水印。
本站是一個以用戶分享為主的開源技術平臺,歡迎各類分享!