強大的Java 的HTML 解析器,jsoup 1.7.1 發布

jopen 12年前發布 | 14K 次閱讀 jsoup

jsoup 是一款 Java 的HTML 解析器,可直接解析某個URL地址、HTML文本內容。它提供了一套非常省力的API,可通過DOM,CSS以及類似于JQuery的操作方法來取出和操作數據。

jsoup的主要功能如下:

  1. 從一個URL,文件或字符串中解析HTML;
  2. 使用DOM或CSS選擇器來查找、取出數據;
  3. 可操作HTML元素、屬性、文本;
  4. </ol>

    本站還翻譯了官方的CookBook中文文檔:http://www.baiduhome.net/jsoup

    示例代碼:

    File input = new File("/tmp/input.html");
    Document doc = Jsoup.parse(input, "UTF-8", "
    

    Element content = doc.getElementById("content"); Elements links = content.getElementsByTag("a"); for (Element link : links) { String linkHref = link.attr("href"); String linkText = link.text(); }</pre>

    jsoup 1.7.1 發布了,該版本在性能和穩定性方面都有不少提升,下載地址:

    • jsoup-1.7.1.jar core library
    • jsoup-1.7.1-sources.jar optional sources jar
    • jsoup-1.7.1-javadoc.jar optional javadoc jar
    • </ul>

      改進記錄:

      - 改進解析時間,比之前的版本快2.3x倍,降低了內存消耗。

      - 選擇元素時減少內存消耗和垃圾收集。

      - 刪除不必要的Tag.valueOf同步,從而使多線程解析,運行速度更快。

      - Introduced finer granularity of exceptions in Jsoup.connect, including HttpStatusException and UnsupportedMimeTypeException, allowing programmers better control of error cases.

      - In Jsoup.clean, allow custom Document.OutputSettings, to control pretty printing, character set, and entity escaping.

      - Whitespace normalise document.title() output.

      - In Jsoup.connect, fail faster if the return content type is not supported.

      - Made entity decoding less greedy, so that non-entities are less likely to be incorrectly treated as entities.

      - In Jsoup.connect, enforce a connection disconnect after every connect. This precludes keep-alive connections to the same host, but in practise many implementations will leak connections, particularly on error.

      - 如果服務器不指定Content-Type頭,把它當作OK。

      - 如果服務器返回一個不支持的字符集頭,試圖解碼的默認字符集(UTF8)的內容,而不是想逃與不支持的字符集異常。

      Bug 修復:

      - Fixed an issue when determining the Windows-1254 character-set from a meta tag when run in the Turkish locale.

      - Fixed whitespace preservation in textarea tags.

      - Fixed an issue that prevented frameset documents to be cleaned by the Cleaner.

      - Fixed an issue when normalising whitespace for strings containing high-surrogate characters.
      </div> after every connect. This precludes keep-alive connections to the same host, but in practise many implementations will leak connections, particularly on error. - If a server doesn't specify a content-type header, treat that as OK. - If a server returns an unsupported character-set header, attempt to decode the content with the default charset (UTF8), instead of bailing with an unsupported charset exception. Bug fixes: - Fixed an issue when determining the Windows-1254 character-set from a meta tag when run in the Turkish locale. - Fixed whitespace preservation in textarea tags. - Fixed an issue that prevented frameset documents to be cleaned by the Cleaner. - Fixed an issue when normalising whitespace for strings containing high-surrogate characters.

       本文由用戶 jopen 自行上傳分享,僅供網友學習交流。所有權歸原作者,若您的權利被侵害,請聯系管理員。
       轉載本站原創文章,請注明出處,并保留原始鏈接、圖片水印。
       本站是一個以用戶分享為主的開源技術平臺,歡迎各類分享!