強大的Java 的HTML 解析器，jsoup 1.7.1 發布

jopen 12年前發布 | 14K 次閱讀 jsoup

jsoup 是一款 Java 的HTML 解析器，可直接解析某個URL地址、HTML文本內容。它提供了一套非常省力的API，可通過DOM，CSS以及類似于JQuery的操作方法來取出和操作數據。

jsoup的主要功能如下：

從一個URL，文件或字符串中解析HTML；

使用DOM或CSS選擇器來查找、取出數據；

可操作HTML元素、屬性、文本；

本站還翻譯了官方的CookBook中文文檔:http://www.baiduhome.net/jsoup

示例代碼：

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "
Element content = doc.getElementById("content");
Elements links = content.getElementsByTag("a");
for (Element link : links) {
  String linkHref = link.attr("href");
  String linkText = link.text();
}</pre> 

jsoup 1.7.1 發布了，該版本在性能和穩定性方面都有不少提升，下載地址：
jsoup-1.7.1.jar core library

jsoup-1.7.1-sources.jar optional sources jar

jsoup-1.7.1-javadoc.jar optional javadoc jar
</ul>
改進記錄:


- 改進解析時間，比之前的版本快2.3x倍，降低了內存消耗。

- 選擇元素時減少內存消耗和垃圾收集。

- 刪除不必要的Tag.valueOf同步，從而使多線程解析，運行速度更快。

- Introduced finer granularity of exceptions in Jsoup.connect, including HttpStatusException and UnsupportedMimeTypeException, allowing programmers better control of error cases.

- In Jsoup.clean, allow custom Document.OutputSettings, to control pretty printing, character set, and entity escaping.

- Whitespace normalise document.title() output.

- In Jsoup.connect, fail faster if the return content type is not supported.

- Made entity decoding less greedy, so that non-entities are less likely to be incorrectly treated as entities.

- In Jsoup.connect, enforce a connection disconnect after every connect. This precludes keep-alive connections to the same host, but in practise many implementations will leak connections, particularly on error.

- 如果服務器不指定Content-Type頭，把它當作OK。

- 如果服務器返回一個不支持的字符集頭，試圖解碼的默認字符集（UTF8）的內容，而不是想逃與不支持的字符集異常。




Bug 修復:

- Fixed an issue when determining the Windows-1254 character-set from a meta tag when run in the Turkish locale.

- Fixed whitespace preservation in textarea tags.

- Fixed an issue that prevented frameset documents to be cleaned by the Cleaner.

- Fixed an issue when normalising whitespace for strings containing high-surrogate characters.
</div>
after every connect. This precludes keep-alive connections to the same host, but in practise many implementations will leak connections, particularly on error. - If a server doesn't specify a content-type header, treat that as OK. - If a server returns an unsupported character-set header, attempt to decode the content with the default charset (UTF8), instead of bailing with an unsupported charset exception. Bug fixes: - Fixed an issue when determining the Windows-1254 character-set from a meta tag when run in the Turkish locale. - Fixed whitespace preservation in textarea tags. - Fixed an issue that prevented frameset documents to be cleaned by the Cleaner. - Fixed an issue when normalising whitespace for strings containing high-surrogate characters.
                    
                         本文由用戶 jopen  自行上傳分享，僅供網友學習交流。所有權歸原作者，若您的權利被侵害，請聯系管理員。
                         轉載本站原創文章，請注明出處，并保留原始鏈接、圖片水印。
                         本站是一個以用戶分享為主的開源技術平臺，歡迎各類分享！
                         
本文地址：http://www.baiduhome.net/news/view/188fc57                        
                         jsoup
                    
                
                
                
                    相關資訊
                    
                        
                            
                             強大的Java 的HTML 解析器，jsoup 1.7.1 發布
                        
                        
                            
                             強大的Java HTML 解析器，jsoup 1.6.2 發布
                        
                        
                            
                             Java開源類庫HTML 解析器，jsoup 1.8.3 發布
                        
                        
                            
                             Java開源的HTML 解析器，jsoup 1.6.3 發布
                        
                        
                            
                             基于Java的開源HTML解析器：jsoup 1.7.3 發布
                        
                    
                
                
                    相關經驗
                    
 HTML解析器 jsoup
 Java操作Html文檔利器---Jsoup
 jsoup 解析HTML信息
 類似于JSoup的Net版HTML解析器：NSoup
 使用Jsoup解析和操作HTML
                    
                
                
                    相關文檔
                    
 jsoup 中文幫助文檔
 jsoup Cookbook(中文版)
 xml解析器
 XML 解析器
 用Java輸出HTML文件