文檔內容抽取工具集合,Apache Tika 1.8 發布
Tika是一個內容抽取的工具集合(a toolkit for text extracting)。它集成了POI, Pdfbox 并且為文本抽取工作提供了一個統一的界面。其次,Tika也提供了便利的擴展API,用來豐富其對第三方文件格式的支持。
Tika的API十分便捷,核心是Parser interface,其中定義了一個parse方法:
public void parse(InputStream stream, ContentHandler handler, Metadata metadata)
用stream參數傳遞需要解析的文件流, 文本內容會被傳入handler,而元數據會更新至metadata。
可以使用Tika的ParserUtils工具來根據文件的mime-type
來得到一個適當的Parser來進行解析工作。或者Tika還提供了一個AutoDetectParser根據不同的二進制文件的特殊格式 (比如說Magic Code),來尋找適合的Parser。
-
Fix null pointer when processing ODT footer styles (TIKA-1600).
-
Upgrade to com.drewnoakes' metadata-extractor to 2.0 and
add parser for webp metadata (TIKA-1594). -
Duration extracted from MP3s with no ID3 tags (TIKA-1589).
-
Upgraded to PDFBox 1.8.9 (TIKA-1575).
-
Tika now supports the IsaTab data standard for bioinformatics
both in terms of MIME identification and in terms of parsing
(TIKA-1580). -
Tika server can now enable CORS requests with the command line
"--cors" or "-C" option (TIKA-1586). -
Update jhighlight dependency to avoid using LGPL license. Thank
@kkrugler for his great contribution (TIKA-1581). -
Updated HDF and NetCDF parsers to output file version in
metadata (TIKA-1578 and TIKA-1579). -
Upgraded to POI 3.12-beta1 (TIKA-1531).
-
Added tika-batch module for directory to directory batch
processing. This is a new, experimental capability, and the API will
likely change in future releases (TIKA-1330). -
Translator.translate() Exceptions are now restricted to
TikaException and IOException (TIKA-1416). -
Tika now supports MIME detection for Microsoft Extended
Makefiles (EMF) (TIKA-1554). -
Tika has improved delineation in XML and HTML MIME detection
(TIKA-1365). -
Upgraded the Drew Noakes metadata-extractor to version 2.7.2
(TIKA-1576). -
Added basic style support for ODF documents, contributed by
Axel D枚rfler (TIKA-1063). -
Move Tika server resources and writers to separate
org.apache.tika.server.resource and writer packages (TIKA-1564). -
Upgrade UCAR dependencies to 4.5.5 (TIKA-1571).
-
Fix Paths in Tika server welcome page (TIKA-1567).
-
Fixed infinite recursion while parsing some PDFs (TIKA-1038).
-
XHTMLContentHandler now properly passes along body attributes,
contributed by Markus Jelsma (TIKA-995). -
TikaCLI option --compare-file-magic to report mime types known to
the file(1) tool but not known / fully known to Tika. -
MediaTypeRegistry support for returning known child types.
-
Support for excluding (blacklisting) certain Parsers from being
used by DefaultParser via the Tika Config file, using the new
parser-exclude tag (TIKA-1558).
詳細信息請查看發行頁面。
此版本現已提供下載:
http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.8-src.zip
來自:http://www.oschina.net/news/61711/apache-tika-1-8-released