開源的搜索引擎,Nutch 1.9 發布
Nutch 是一個開放源代碼(open-source)的Java搜索引擎包,它提供了構建一個搜索引擎所需要的全部工具和功能。使用Nutch不僅可以建立自己內 部網的搜索引擎,同時也可以針對整個網絡建立搜索引擎。除了基本的功能之外,Nutch也還有不少自己的特色,如Map-Reduce、Hadoop、 Plugin等。
Nutch 從總體上看來,分為三個主要的部分:爬行、索引和搜索。Web db是Nutch初始運行的URL集合;Fetcher是用來抓取網頁的爬行器,也就是平時常說的Crawler;indexer是用來建立索引的部分, 它將會生成的索引文件并存放在系統之中;searcher是查詢器,用來完成對某一詞條的搜索并返回結果。

近日,Apache Nutch 1.9 發布,主要改進包括:
改進
- [NUTCH-1502] - Test for CrawlDatum state transitions 
- [NUTCH-1561] - improve usability of parse-metatags and index-metadata 
- [NUTCH-1676] - Add rudimentary SSL support to protocol-http 
- [NUTCH-1745] - Upgrade to ElasticSearch 1.1.0 
- [NUTCH-1747] - Use AtomicInteger as semaphore in Fetcher 
- [NUTCH-1757] - ParserChecker to take custom metadata as input 
- [NUTCH-1758] - IndexChecker to send document to IndexWriters 
- [NUTCH-1772] - Injector does not need merging if no pre-existing crawldb 
- [NUTCH-1782] - NodeWalker to return current node 
- [NUTCH-1787] - update and complete API doc overview page 
- [NUTCH-1794] - IndexingFilterChecker to optionally dumpText 
- [NUTCH-1799] - ANT Eclipse task discovers all plugin jars automatically 
新的特性
- [NUTCH-207] - Bandwidth target for fetcher rather than a thread count 
- [NUTCH-1327] - QueryStringNormalizer 
- [NUTCH-1590] - [SECURITY] Frame injection vulnerability in published Javadoc