INFORMATION RETRIEVAL IN THE REAL WORLD, A TRUE STORY!

Spread the love

This is a true story

History

Back in 2008, I was working on a Search project at a large Dutch governmental organization with branches all over the world. 
The project was a research project where two commercial products for Enterprise Search where evaluated. Next, to this, the two commercial products would be compared with an Open Source equivalent. Next to a simple product comparison, this story handles about the core Information Retrieva subjects.

The Job

The Open Source candidate was Apache Nutch and I was the consultant to implement it.

 

When I first came into office, I had a meeting with my Project Manager who showed me around in the Department and introduced me to some of the other engineers as well as some people of the management staff and Officers I was going to work with. 
Everything in this organization was highly confidential so I can’t tell you many details. Particularly because I didn’t get all the information but also since the project was classified (the organization was the Dutch Ministry of Defense).
 Therefore I won’t go into details very much. 
My PM was a nice and reasonable man in the rank of a Colonel. I knew this organization already since 20 years earlier I served for them in a uniform and at another location.

This time, however, I was hired by them as a citizen for my Nutch expertise. I was then actively involved in the community and my name was found in several Apache documents on the Internet.
 At the end of the introduction, my PM gave me my security chip card.  He explicitly warned me not to forget (or even worse) lose it, since it was my key to the building.
A few days later, however, Murphies law struck. While on my way home I decided to stop to do some shoppings and I lost my chip card in the supermarket.
 After I noticed I had lost it I went back to the supermarket and asked if it had been found. Off course no one had found it (or reported it was found).
I contacted my PM and told him what had happened. As I mentioned before he was a reasonable man and told me not to worry too much about it. The next day we would go to the Security department who would have a solution for this problem. I wasn’t the only one. It had happened before.
 Next day I reported my problem to the security office. The MP in charge was obviously not happy with the situation and did let me know the ‘Military way’. He knew there was a new procedure for this. He only didn’t know the exact details. Those were in a recent word document somewhere on the organization’s intranet. He instructed me to go and search for the document. He appointed two other security officers to help me. By the time we started searching it was 4 PM. I took place behind my workstation and started to have a look at the Intranet.
The Intranet was basically a large collection of Office Documents, hosted on a Microsoft Internet Information Server (IIS) which came with its own full-text Indexing and Search server, the Microsoft Index Server.

Searching with Microsoft; a lot of work and no results.

My first query was “procedure AND verloren AND pas”. Since the organization was Dutch, I had to search in the Dutch language. Literally translated this means “procedure AND lost AND pass” The AND operator was necessary to tell the search server to look for documents containing my terms procedure, lost and pass, so the search results had to contain these three terms.
The search returned a few thousand documents which matched the query. 
I looked into a few of them. I saw reports about stolen passes in the Netherlands Antilles and several other documents that didn’t contain the information I was looking for.

Boolean Search

Searching with parameters like AND, OR, NOT and NEAR is called Boolean searching. The. Queries are called Boolean queries.
Boolean searching is very common and popular because they can be nested to complex queries like: ((cat OR cats) AND (dog OR dogs) NEAR (food) OR foods)) NOT (medicine OR medicines)).

Professional Research

Professional researchers and Academic Librarians still use this when searching for scientific documents in Databases like MEDLINECASAnalytical Abstracts, TOXLINE (a subset of MEDLINE), EMBASEDERWENTor BEILSTIJN to name just a few. (There are much more but mentioning them here falls outside the scope of this article).

I changed my query to “procedure AND (lost NEAR (pass OR card)) for obvious reasons.
Again this resulted in a list of over thousand documents, all matching the query but not what I was looking for.
It was now 7 PM and we were getting tired. The security officers were used to leave at 5 PM.

Finally, the officer in charge decided to phone his superior and asked him if he knew where the document could be.
 The superior officer had written the document himself and remembered exactly where he had stored it. A few days later I had a chip card and was able to walk In and out the office. I had learned from this exercise. I would never lose my chip card again and the document in question was obviously a needle in a haystack and therefore a good test case.

A few months later I had implemented all the requirements in Nutch (Officials are very strict, especially when it comes to security and forms) and I got the opportunity to Crawl and Index the complete Intranet. The Intranet consisted of four departments (Just like the organization) Army, Air Force, Navy and Military Police. Each department had about 20.000 to 30.000 Office documents so the complete Organization had an estimated 100.000 documents.
We crawled one department at a time. Each crawl took about 3 hours so crawling the complete intranet took about 12 hours. (We used only one single (Virtual) Machine, no Hadoop cluster.

Searching with Nutch; Result in less than a minute!

After the crawl had finished, the first thing I did was doing a test search. In the search box I typed (what else) “procedure lost pass” (The Boolean parameters are not necessary for Nutch. Like Google, Nutch regards every term as relevant.
The result: Bingo! What took three men 3 hours (or a full day in FTE equivalent) was now done in just ten seconds. A saving of one day Full-Time Equivalent! How was this possible?

The answer lies in a few technical differences between the Microsoft Search system and Nutch. I will explain them here:

  • Result ranking – The only thing Microsoft takes into account is the Term Frequency (Tf) of a document. The reasoning is simple. The more often the search term(s) occur in a document, the more relevant the document.
 Nutch does the same thing but adds some more intelligence like 

References from other documents (also known as anchors, incoming links or inLinks. The more documents link to another document, the more important it must be and if the links come from important documents, then the ‘score’ must also be higher. This ‘score’ as I name it is in Google known as ‘PageRank‘. Google uses this. In fact, Nutch has copied this from Google. Actually, Nutch and Hadoop are inspired from Google publications.
  • Similarity– Lucene (the engine behind Nutch) has this built in. Using the Vector-Space Model of Information Retrieval it’s possible to compute a ‘distance’ between two documents or a document and search query. Think of Vector-Space model as Newton’s 1st law of Motion we all learned in High-school where we could predict the motion of an object by computing the resultant of the vectors of all forces applied to the object. Vector-Space is similar to this. By expressing all terms in a document as vectors it’s possible to compute a single resultant to a document (or phrase) we can express this as one single value which can be used to compare two documents with each other. The more the resultant of two documents compare, the more similar they are. This is typically used in More Like This (mlt) which we often see on the web. For this to work, all terms and their positions in the document need to be stored in the Index. Which is a standard feature in Lucene.
  •  Document parts (or Meta-Data) – Documents consist of parts like the title, chapters, paragraphs, and body text. When search terms are found in the title of a document or one of its headings (h1, h2, h3 etc.), it is likely the document is more relevant to the query and the score is boosted by a factor (configurable in Nutch).
  • Results clustering – This is what actually did the magic in my case. Based on the similarities described above, search results can be grouped together in conceptual clusters. By applying statistical analysis of these groups relations can be found and related terms can be displayed near the search results. In my case, I saw (next to the results) the related term ‘badge’. One click on this term narrowed down the results and my document was number 1 in the results. I clicked it and found that a few months earlier could not be found by 3 men.
    Results clustering is not a standard feature of Nutch but it was implemented using a plugin. The Carrot2 plugin. Carrot2 is an algorithm for results clustering. To see it in action click here. Play around with it, especially with the graphical representations. When I search for my own name I see this. From the figure it’s obvious the person “Evert Wagenaar” has something to do with Nutch, Sor, Lucene, Indexing, Facebook, LinkedIn and obviously has some followers on Twitter. Type in your own name and see what you can discover about yourself. It’s fun! Please note: You won’t find this view in the online version of Carrot2. Instead, you will need to download the workbench version for your platform, which is an Eclipse Rich Client Application. Although carrot2.org looks like a Search Engine, it’s actually not. It uses Public API’s from Google, Yahoo!, Bing and DuckDuckGo to create the Search functionality. Carrot2 does the clustering itself using different algorithms. You don’t have to go to carrot2 to access it. I downloaded carrot2, installed it on my Apache Tomcat Server so you can run it from here as well.
What was the outcome of this study?

I can’t tell. Just like everything at the ministry of defense, this is classified Information. I’m already in a breach by telling you this.

Conclusion

Managers and CEOs. It’s time to wake up!

Start looking in your organization how much time your employees are spending to search for the information they need to do their jobs! As anyone else you know time = money.
If it’s only 1 FTE per week you should hire me! I’ll do the job for you for a fixed price. Your investment will pay off in 3 months. This is guaranteed. Not good? Money back!

76 thoughts on “INFORMATION RETRIEVAL IN THE REAL WORLD, A TRUE STORY!”

  1. With havin so much content do you ever run into any issues of plagorism or copyright violation? My site has a lot of
    exclusive content I’ve either written myself or outsourced but it looks like a lot of it is popping it up all over the web without
    my permission. Do you know any techniques to help stop content from
    being ripped off? I’d truly appreciate it.

    1. Just make sure your content is authentic and unique. Keep it updated regularly. It’s good SEO practice anyway.
      You also may want to use Plagiarized checker, although this only detects Plagiarized content.

      Finally you can email the site-owner that they use stolen content which is your intellectual property and ask them to remove it from their site. Also send emails to the network owner(s). They usually have abuse@network-name addresses.

  2. I’ve learn some good stuff here. Certainly worth bookmarking
    for revisiting. I surprise how much effort you place to
    make this type of great informative site.

  3. I’ve been browsing online more than three hours today, yet I never
    found any interesting article like yours. It’s pretty worth enough
    for me. Personally, if all site owners and bloggers made good content as you did, the web will be much more useful than ever before.

  4. I’m really enjoying the design and layout of your website.
    It’s a very easy on the eyes which makes it much more pleasant for me to come here and visit more often. Did you hire out a designer to create your
    theme? Exceptional work!

  5. Hey just wanted to give you a quick heads up and let you know a few of the images aren’t loading correctly.
    I’m not sure why but I think its a linking issue.
    I’ve tried it in two different internet browsers and both show the same results.

  6. Wow that was odd. I just wrote an extremely long comment but after I clicked submit my comment didn’t appear.
    Grrrr… well I’m not writing all that over again. Anyway,
    just wanted to say wonderful blog!

  7. I blog quite often and I truly appreciate your content.
    Your article has really peaked my interest. I will take a note of your site
    and keep checking for new information about once a week.
    I opted in for your Feed as well.

  8. Hi there! This blog post couldn’t be written any better!

    Looking at this article reminds me of my previous roommate!
    He constantly kept preaching about this. I will send this
    article to him. Pretty sure he’s going to have
    a very good read. Many thanks for sharing!

  9. Do you have a spam problem on this blog; I also am a blogger, and I was curious about your situation; we have developed some nice procedures and we are looking to trade techniques with other folks, why not shoot me an e-mail if interested.

  10. Greetings! I’ve been reading your weblog for a while
    now and finally got the courage to go ahead and give you a shout
    out from Houston Tx! Just wanted to say keep up the great work!

  11. I’ve been exploring for a little bit for any high-quality articles or weblog posts on this sort of
    space . Exploring in Yahoo I finally stumbled upon this site.

    Studying this info So i’m happy to express that I’ve
    an incredibly good uncanny feeling I discovered just what I needed.
    I such a lot without a doubt will make sure to don?t omit this site and provides it a glance on a continuing
    basis.

  12. Sweet blog! I found it while surfing around on Yahoo News.
    Do you have any suggestions on how to get listed in Yahoo News?

    I’ve been trying for a while but I never seem to get there!
    Appreciate it

  13. I do not even know how I ended up here, but I thought this post
    was great. I don’t know who you are but certainly you’re going to a famous blogger if you aren’t already 😉 Cheers!

  14. We’re a gaggle of volunteers and opening a new scheme
    in our community. Your website provided us with helpful info to work on. You’ve done a formidable activity and our
    entire neighborhood will probably be grateful to you.

  15. Wow, fantastic blog layout! How long have you been blogging for?
    you make blogging look easy. The overall look of
    your web site is great, let alone the content!

  16. Thank you for some other informative web site. The place else may I
    get that kind of info written in such an ideal way? I’ve a challenge that I’m simply now operating on,
    and I’ve been on the glance out for such info.

  17. Hi there, just became aware of your blog through Google, and found that it’s truly informative. Im gonna watch out for brussels. I will appreciate if you continue this in future. Many people will be benefited from your writing. Cheers! cbbckekeedgf

  18. I discovered your blog web site website on the search engines and check several of your early posts. Always sustain up the very excellent operate. I lately additional increase Rss to my MSN News Reader. Looking for toward reading much far more on your part later on! cgdbeefekeed

  19. Thank you a bunch for sharing this with all people you really recognise what you’re
    talking approximately! Bookmarked. Please additionally discuss with my website =).
    We may have a hyperlink change arrangement among us

  20. It’s a shame you don’t have a donate button! I’d definitely
    donate to this superb blog! I suppose for now i’ll settle for book-marking and adding your RSS
    feed to my Google account. I look forward to fresh updates and will talk about this site with my Facebook group.
    Chat soon!

  21. It’s truly very difficult in this full of activity life to listen news
    on Television, thus I only use web for that reason, and take the most up-to-date news.

  22. I blog frequently and I seriously appreciate your content.

    This great article has truly peaked my interest.

    I’m going to bookmark your blog and keep checking
    for new details about once a week. I opted in for your RSS
    feed as well.

  23. Hello, i think that i saw you visited my website thus i got here to return the
    want?.I am attempting to to find issues to improve my site!I assume its good
    enough to use a few of your ideas!!

  24. Hello There. I found your blog using msn. This
    is a really well written article. I’ll be sure to bookmark it and return to read
    more of your useful information. Thanks for the post.
    I’ll definitely return.

  25. Hey! I just wanted to ask if you ever have
    any trouble with hackers? My last blog (wordpress) was hacked and I ended up
    losing several weeks of hard work due to no back up.
    Do you have any methods to prevent hackers?

  26. An intriguing discussion is definitely worth comment. I
    do think that you need to publish more on this subject matter,
    it may not be a taboo subject but usually
    folks don’t speak about these issues. To the next!
    Cheers!!

  27. I’m excited to find this great site. I wanted to thank you for your
    time due to this fantastic read!! I definitely loved every part
    of it and I have you saved as a favorite to look at new things in your site.

  28. Hey there! I simply would like to give you a big thumbs up for your great information you have got right here on this post.
    I am coming back to your website for more
    soon.

  29. Hey there! Would you mind if I share your blog with my twitter group?
    There’s a lot of folks that I think would really enjoy your content.

    Please let me know. Thank you

  30. The other day, while I was at work, my sister stole my iphone and tested to
    see if it can survive a forty foot drop, just so she can be a youtube
    sensation. My iPad is now broken and she has 83 views.
    I know this is entirely off topic but I had to share it with
    someone!

  31. Usually I don’t read article on blogs, however I would like to say that this write-up very pressured me to take a look
    at and do so! Your writing taste has been surprised me.
    Thank you, very nice article.

  32. Hi, i think that i saw you visited my blog thus i came to “return the favor”.I am trying to find things to enhance my web
    site!I suppose its ok to use some of your ideas!!

  33. When someone writes an piece of writing he/she keeps the plan of a user in his/her mind that how a user can be aware of it.
    Therefore that’s why this article is amazing. Thanks!

  34. Awesome blog! Do you have any helpful hints for aspiring writers?
    I’m hoping to start my own website soon but I’m a little lost on everything.
    Would you advise starting with a free platform like WordPress
    or go for a paid option? There are so many choices out
    there that I’m totally confused .. Any recommendations?
    Thanks a lot!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.