強大的Java 的HTML 解析器,jsoup 1.7.1 發布
jsoup 是一款 Java 的HTML 解析器,可直接解析某個URL地址、HTML文本內容。它提供了一套非常省力的API,可通過DOM,CSS以及類似于JQuery的操作方法來取出和操作數據。
jsoup的主要功能如下:
- 從一個URL,文件或字符串中解析HTML;
- 使用DOM或CSS選擇器來查找、取出數據;
- 可操作HTML元素、屬性、文本; </ol>
jsoup-1.7.1.jar
core libraryjsoup-1.7.1-sources.jar
optional sources jarjsoup-1.7.1-javadoc.jar
optional javadoc jar
</ul>
本站還翻譯了官方的CookBook中文文檔:http://www.baiduhome.net/jsoup
示例代碼:
File input = new File("/tmp/input.html"); Document doc = Jsoup.parse(input, "UTF-8", "Element content = doc.getElementById("content"); Elements links = content.getElementsByTag("a"); for (Element link : links) { String linkHref = link.attr("href"); String linkText = link.text(); }</pre>
jsoup 1.7.1 發布了,該版本在性能和穩定性方面都有不少提升,下載地址:
改進記錄:
- 改進解析時間,比之前的版本快2.3x倍,降低了內存消耗。- 選擇元素時減少內存消耗和垃圾收集。- 刪除不必要的Tag.valueOf同步,從而使多線程解析,運行速度更快。- Introduced finer granularity of exceptions in Jsoup.connect, including HttpStatusException and UnsupportedMimeTypeException, allowing programmers better control of error cases.- In Jsoup.clean, allow custom Document.OutputSettings, to control pretty printing, character set, and entity escaping.- Whitespace normalise document.title() output.- In Jsoup.connect, fail faster if the return content type is not supported.- Made entity decoding less greedy, so that non-entities are less likely to be incorrectly treated as entities.- In Jsoup.connect, enforce a connection disconnect after every connect. This precludes keep-alive connections to the same host, but in practise many implementations will leak connections, particularly on error.- 如果服務器不指定Content-Type頭,把它當作OK。- 如果服務器返回一個不支持的字符集頭,試圖解碼的默認字符集(UTF8)的內容,而不是想逃與不支持的字符集異常。
Bug 修復:- Fixed an issue when determining the Windows-1254 character-set from a meta tag when run in the Turkish locale.- Fixed whitespace preservation in textarea tags.- Fixed an issue that prevented frameset documents to be cleaned by the Cleaner.- Fixed an issue when normalising whitespace for strings containing high-surrogate characters.</div> after every connect. This precludes keep-alive connections to the same host, but in practise many implementations will leak connections, particularly on error. - If a server doesn't specify a content-type header, treat that as OK. - If a server returns an unsupported character-set header, attempt to decode the content with the default charset (UTF8), instead of bailing with an unsupported charset exception. Bug fixes: - Fixed an issue when determining the Windows-1254 character-set from a meta tag when run in the Turkish locale. - Fixed whitespace preservation in textarea tags. - Fixed an issue that prevented frameset documents to be cleaned by the Cleaner. - Fixed an issue when normalising whitespace for strings containing high-surrogate characters.本文由用戶 jopen 自行上傳分享,僅供網友學習交流。所有權歸原作者,若您的權利被侵害,請聯系管理員。轉載本站原創文章,請注明出處,并保留原始鏈接、圖片水印。本站是一個以用戶分享為主的開源技術平臺,歡迎各類分享!相關資訊
相關經驗