Tag Archives: Apache Nutch

The Apache Nutch Virtual Appliance

Apache Nutch

 

Google Appliance
Google Appliance

The Apache Nutch Virtual Appliance (NVA) is being worked on.

Inspired on the Google Appliance, a black box Search Appliance, I recently decided to start working on an Apache Nutch Virtual Appliance.

Project in progress

Since this is a project in progress I can’t show you anything yet, but as the development progresses further you will off course be kept informed trough this page.

Requirements

 

User Requirements

From the users perspective I thought the best way of looking at this is the Google Appliance. This implies the following specifications:

  • Area of Application. Just like the Google Appliance our NVA will crawl Intranets and Business Networks only. So we don’t have a need to crawl the whole Internet. This eliminates the need for separate Hadoop Instances.
  • Plug & play – The user shouldn’t be bothered with tecnical implications such as complicated configurations or difficult XML files to configure. The only thing the user should do is plugin the VA to the Network and the rest should work automatically.
  • Basic configuration should be done when the VA is first started and connected to the network.
  • The Fetching, Parsing and Indexing should all happen in the background to keep the system responsive during the whole process of crawling and Indexing.
  • During Fetching, Parsing and Indexing, the user should be able to follow the progress on a Dashboard, preferably from a webbrowser.
  • We are absolutely not concerned with speed since the whole process will work in the background.
  • The search interface should be clear and intuitive.
Technical requirements
  • Everything should be Open Source, including the hypervizor. We will use Apache Nutch 1.x because of performance considerations. The development hypervizor will be Oracle VirtualBox. The guest OS is Linux. The distribution probably Debian.
  • The Search Interface will be based on Apache Velocity. This is at present the most promising framework for creating Solr based user friendly user interfaces.
Timelines & Planning

I am planning to do this in my spare time so it’s difficult for me to give an accurate estimation on the delivery date.

Project members sought

If you are an Apache Nutch developer and  interested in joining me on this project then please Contact me.

Nutch 1 vs Nutch 2

Introduction

Nutch version 2 is already out for 7 years now, while version 1 is also still available and under active development. Currently we can choose from 2 major branches, the 1.x and 2.x branches. The main differences are that the 2.x branch comes with support for NoSQL Databases for it’s storage and Nutch 1.x stores it’s data and Index still in Apache SOLR.

The backend for the NoSQL connectivity is provided by Apache Gora, which provides an in-memory data model and persistence for big data. Gora supports persisting to column stores, key value stores, document stores, distributed in-memory key/value stores, in-memory data grids, in-memory caches, distributed multi-model stores, and hybrid in-memory architectures.

Gora also enables analysis of data with extensive Apache Hadoop MapReduce™ and Apache Spark™ support. Gora uses the Apache Software License v2.0. Gora graduated from the Apache Incubator in January 2012 to become a top-level Apache project.

At DigitalPebble a study was performed between the two versions to determine which was the fastest. It was concluded that Nutch 1.x was still the fastest on all fronts and this was mainly due to Gora which is responsible for a lot of overhead.

Conclusion

At present there’s no need to upgrade from 1 to 2 and this will probably not change the coming years.