Enterprise Search

Spread the love

Introduction

I already described how I built an Entetprise search solution for the Dutch Ministery of Defense in another post.

In this post I will describe how I developed a connector which enabled searching Lotus Notes and Domino documents plus a crawler which exclusively crawled Lotus Notes documents from Notes servers. And I will also explain why Apache Solr is a real Enterprise Search Server.


The figure above shows what most people regard as Enterprise Search.

According to Wikipedia Enterprise search is the practice of making content from multiple enterprise-type sources, such as databases and intranets, searchable to a defined audience and is used to describe the software of search information within an enterprise (though the search function and its results may still be public).

I have been involved in Apache Nutch development, which can be used as a web search engine but also an Intranet Search Engine. But what if your data (Documents) are not only stored as documents on simple file systems shared by webservers?

One example are Lotus Notes Databases which usually contain Lotus Notes Documents. Similar to Lotus Notes are Microsoft exchange or sharepoint, which can contain (what else?) Office documents.

How to crawl and index these Systems? Pretty straightforward, as long as it’s Microsoft. Use Nutch to crawl through the web content, fetch all office documents and Solr will do the rest.

for Notes it’s a different story. The web content is often shown in views which are categorized. Why? Because the devoper thought it’s a good idea.

So what happens? The views show the same documents, categorized by name, date, author, subject, etc. etc, etc. This results usually in crawling madness, although Nutch stores hash codes for every document to prevent duplicates, the crawl slows down and eventually stops. The crawler is exhausted and the WebDb’s and LinkDb’s are overloaded.

So we needed a more structured solution. Luckily Domino has a Java API. And there are several ways to loop trough all Notes Documents.

i came up with the following solution:

Apache Solr comes with a Java Client API named SolrJ. It allows to construct SolrDocuments which are actually JavaBeans, which can be stored in Iterable ArrayLists, growable Arrays which can hold life Objects.

  1. Get a NotesDocumentCollection with all docs in the Db.
  2. Get the first Document.
  3. Convert to a SolrDocument.
  4. Get the next Document.
  5. Convert to a SolrDocument.
  6. Repeat until all Documents are processed.
  7. Send all SolrDocuments to the SolrServer.
  8. End. Finished.

Under the hood, Solr uses Apache Tika (In Solr known as Solr cell). Which converts each document to XHTML, So it’s actually XML. To display the search results nicely in the browser, XPATH’s were constructed to show neat titles together with abstracts.

To import data into Solr, Solr is equipped with an InputHandler, which can cope with different formats and Data Sources. Like for example XML or even JDBC.

It must be said that importing (indexing) data is extremely fast! Importing a Notes database with 5000 docs took less than a minute.

People have also reported importing the full Wikipedia in less then an hour! Which is amazing!

Conclusion

All together it can be said that Apache Solr is a true Enterprise Search Server which can compete with commercial products such as  Endecca, G2 Crowd, SwiftType or Algolia.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.