The Apache Nutch Virtual Appliance

Spread the love

Apache Nutch

 

Google Appliance
Google Appliance

The Apache Nutch Virtual Appliance (NVA) is being worked on.

Inspired on the Google Appliance, a black box Search Appliance, I recently decided to start working on an Apache Nutch Virtual Appliance.

Project in progress

Since this is a project in progress I can’t show you anything yet, but as the development progresses further you will off course be kept informed trough this page.

Requirements

 

User Requirements

From the users perspective I thought the best way of looking at this is the Google Appliance. This implies the following specifications:

  • Area of Application. Just like the Google Appliance our NVA will crawl Intranets and Business Networks only. So we don’t have a need to crawl the whole Internet. This eliminates the need for separate Hadoop Instances.
  • Plug & play – The user shouldn’t be bothered with tecnical implications such as complicated configurations or difficult XML files to configure. The only thing the user should do is plugin the VA to the Network and the rest should work automatically.
  • Basic configuration should be done when the VA is first started and connected to the network.
  • The Fetching, Parsing and Indexing should all happen in the background to keep the system responsive during the whole process of crawling and Indexing.
  • During Fetching, Parsing and Indexing, the user should be able to follow the progress on a Dashboard, preferably from a webbrowser.
  • We are absolutely not concerned with speed since the whole process will work in the background.
  • The search interface should be clear and intuitive.
Technical requirements
  • Everything should be Open Source, including the hypervizor. We will use Apache Nutch 1.x because of performance considerations. The development hypervizor will be Oracle VirtualBox. The guest OS is Linux. The distribution probably Debian.
  • The Search Interface will be based on Apache Velocity. This is at present the most promising framework for creating Solr based user friendly user interfaces.
Timelines & Planning

I am planning to do this in my spare time so it’s difficult for me to give an accurate estimation on the delivery date.

Project members sought

If you are an Apache Nutch developer and  interested in joining me on this project then please Contact me.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.