開源的搜索引擎,Nutch 1.9 發布
Nutch 是一個開放源代碼(open-source)的Java搜索引擎包,它提供了構建一個搜索引擎所需要的全部工具和功能。使用Nutch不僅可以建立自己內 部網的搜索引擎,同時也可以針對整個網絡建立搜索引擎。除了基本的功能之外,Nutch也還有不少自己的特色,如Map-Reduce、Hadoop、 Plugin等。
Nutch 從總體上看來,分為三個主要的部分:爬行、索引和搜索。Web db是Nutch初始運行的URL集合;Fetcher是用來抓取網頁的爬行器,也就是平時常說的Crawler;indexer是用來建立索引的部分, 它將會生成的索引文件并存放在系統之中;searcher是查詢器,用來完成對某一詞條的搜索并返回結果。
近日,Apache Nutch 1.9 發布,主要改進包括:
改進
[NUTCH-1502] - Test for CrawlDatum state transitions
[NUTCH-1561] - improve usability of parse-metatags and index-metadata
[NUTCH-1676] - Add rudimentary SSL support to protocol-http
[NUTCH-1745] - Upgrade to ElasticSearch 1.1.0
[NUTCH-1747] - Use AtomicInteger as semaphore in Fetcher
[NUTCH-1757] - ParserChecker to take custom metadata as input
[NUTCH-1758] - IndexChecker to send document to IndexWriters
[NUTCH-1772] - Injector does not need merging if no pre-existing crawldb
[NUTCH-1782] - NodeWalker to return current node
[NUTCH-1787] - update and complete API doc overview page
[NUTCH-1794] - IndexingFilterChecker to optionally dumpText
[NUTCH-1799] - ANT Eclipse task discovers all plugin jars automatically
新的特性
[NUTCH-207] - Bandwidth target for fetcher rather than a thread count
[NUTCH-1327] - QueryStringNormalizer
[NUTCH-1590] - [SECURITY] Frame injection vulnerability in published Javadoc