Java文檔內容抽取工具集合，Apache Tika 1.11 發布

jopen 10年前發布 | 17K 次閱讀 Apache Tika

Tika是一個內容抽取的工具集合(a toolkit for text extracting)。它集成了POI, Pdfbox 并且為文本抽取工作提供了一個統一的界面。其次，Tika也提供了便利的擴展API，用來豐富其對第三方文件格式的支持。

在當前的0.2-SNAPSHOT版本中， Tika提供了對如下文件格式的支持:

PDF - 通過Pdfbox
MS-* - 通過POI
HTML - 使用nekohtml將不規范的html整理成為xhtml
OpenOffice 格式 - Tika提供
Archive - zip, tar, gzip, bzip等
RTF - Tika提供
Java class - Class解析由ASM完成
Image - 只支持圖像的元數據抽取
XML

Tika的API十分便捷，核心是Parser interface，其中定義了一個parse方法：
public void parse(InputStream stream, ContentHandler handler, Metadata metadata)
用stream參數傳遞需要解析的文件流，文本內容會被傳入handler，而元數據會更新至metadata。

可以使用Tika的ParserUtils工具來根據文件的mime-type來得到一個適當的Parser來進行解析工作。或者Tika還提供了一個AutoDetectParser根據不同的二進制文件的特殊格式 (比如說Magic Code)，來尋找適合的Parser。

Apache Tika 1.11 發布，此版本包括大量的改進和 bug 修復：

* Java7 API support for allowing java.nio.file.Path as method arguments
    was added to Tika and to ParsingReader, TikaFileTypeDetector, and to
    Tika Config (TIKA-1745, TIKA-1746, TIKA-1751).

* MIME support was added for WebVTT: The Web Video Text Tracks Format
    files (TIKA-1772).

* MIME magic improved to ensure emails detected as message/rfc822
    (TIKA-1771).

* Upgrade to Jackcess Encrypt 2.1.1 to avoid binary incompatibility
    with Bouncy Castle (TIKA-1736).

* Make div and other markup more consistent between PPT and
    PPTX (TIKA-1755).

* Parse multiple authors from MSOffice's semi-colon delimited
    author field (TIKA-1765).

* Include CTAKESConfig.properties within tika-parsers resources
    by default (TIKA-1741).

* Prevent infinite recursion when processing inline images
    in PDF files by limiting extraction of duplicate images
    within the same page (TIKA-1742).

* Upgrade to POI 3.13-final (via Andreas Beeker) (TIKA-1707).

* Upgraded tika-batch to use Path throughout (TIKA-1747 and
    (TIKA-1754).

* Upgraded to Path in TikaInputStream (via Yaniv Kunda) (TIKA-1744).

* Changed default content handler type for "/rmeta" in tika-server
    to "xml" to align with "-J" option in tika-app.
    Clients can now specify handler types via PathParam. (TIKA-1716).

* The fantastic GROBID (or Grobid) GeneRation Of BIbliographic Data
    for machine learning from PDF files is now integrated as a
    Tika parser (TIKA-1699, TIKA-1712).

* The ability to specify the Tesseract Config Path was added
    to the OCR Parser (TIKA-1703).

* Upgraded to ASM 5.0.4 (TIKA-1705).

* Corrected Tika Config XML detector definition explicit loading
    of MimeTypes (TIKA-1708)

* In Tika Parsers, Batch, Server, App and Examples, use Apache
    Commons IO instead of inlined ex-Commons classes, and the Java 7
    Standard Charset definitions (TIKA-1710)

* Upgraded to Commons Compress 1.10, which enables zlib compressed
    archives support (TIKA-1718)

詳細改進請看：

http://www.apache.org/dist/tika/CHANGES-1.11.txt

下載：http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.11-src.zip

Apache Tika 同時提供在 Maven：

http://repo1.maven.org/maven2/org/apache/tika/

更多內容請看發行說明。

本文由用戶 jopen 自行上傳分享，僅供網友學習交流。所有權歸原作者，若您的權利被侵害，請聯系管理員。

轉載本站原創文章，請注明出處，并保留原始鏈接、圖片水印。

本站是一個以用戶分享為主的開源技術平臺，歡迎各類分享！

本文地址：http://www.baiduhome.net/news/view/14031b

Apache Tika

Java文檔內容抽取工具集合，Apache Tika 1.11 發布

相關資訊

相關經驗

相關文檔