Eric Day

Thoughts, code, and other oddments.
Dark | Light

< || >

Narada – A Scalable Open Source Search Engine

May 27th, 2009

I’ve been working with Patrick Galbraith for the past couple weeks on a new project that started as an example in his upcoming book. It is a search engine built using Gearman, Sphinx, Drizzle or MySQL, and memcached. Patrick wrote the first implementation in Perl to tie all these pieces together, but there is also a Java version underway bring written by Trond Norbye and Eric Lambert that will be shown at the CommunityOne and JavaOne conferences next week. I’ve been helping get the system setup on a new cluster and with the port to Drizzle.

Narada provides interfaces that allow you to submit URLs to be indexed and crawled, and then to search those indexes and get a result set back. This allows you to index and search your own set of URLs, possibly for a single website or just for your own personal archive. The crawler in the back-end will be able to stop after some recursion limit from the original URL and also be able to apply URL filters (for example, only index pages under the domain “oddments.org”). Other filters and extensions should be easy to add. Narada is interesting because it is:

  • Open Source – You can modify it to fit your own needs, hopefully in a modular way so that changes can be contributed back to the project.
  • Easy to Scale – The system is built on a number of asynchronous queues, and the processes to perform that work can run on any number of machines. Increasing your capacity is now trivial, simply start up more machines and with new workers.
  • Language Agnostic – While the first versions are in Perl and Java, it is easy to mix in other languages. For example, if a certain component was slow, we could rewrite it in C for better performance. The APIs to index and search can also be wrapped for any language since it will mostly just involved wrapping the Gearman client API. I’m thinking of hacking up a PHP API.

So, how does Narada work under the hood?


Click here for the full-size image

The blue boxes represent your front-end application that use Narada, using the Gearman client API. The yellow boxes represent Gearman workers that perform one of the tasks in the chain. The orange boxes represent the storage mechanisms such as Drizzle, MySQL, Sphinx index, or memcached.

When a URL is submitted, it will first be queued in a Drizzle table for later processing. A Gearman job is started during the table INSERT to notify a Fetch Worker that a new URL is ready. Once a free Fetch Worker is available, it downloads the page and looks for more URLs to index. This is where recursion limits and filters are implemented. Next, it takes the resulting document and pushes it into memcached and notifies the Document Worker a new document is ready to be stored and indexed. The Document Worker then stores this inside of another Drizzle table and will start the Sphinx indexer if it hasn’t been run in a while. We don’t want to index on every URL since this would be wasteful and expensive. At this point the document is stored, indexed, and memcached is primed with the content.

When a search request comes in, the client will dispatch a search job to the Search Worker. This worker is responsible for performing the Sphinx search and gathering the necessary information from memcached or Drizzle so the client can return some meaningful results. In the future we will most likely be sharding the data and indexes, so the Search Worker will also be responsible for aggregating multiple shard searches into one set for the caller.

The code is still rough around the edges, but we’ve set it up on a couple clusters so far and it is working quite well. We’ll be actively working on it and refining the install process so it is easier to get it up and running.

Posted in Drizzle, Gearman, Main, MySQL

4 Responses to "Narada – A Scalable Open Source Search Engine"

  1. Alfassa says:

    Hi,

    That sounds great. Interesting.

    I was curious about the name – Narada…any stories behind the naming?

    bye,
    Alfassa.

  2. [...] highlight a scalable open source search engine using Drizzle, Memcached, Gearman and Sphinx. See Eric’s write up for more [...]

  3. Eric Day says:

    Hi Alfassa,

    Patrick choose the name from based on the Hindu devine sage Narada: http://en.wikipedia.org/wiki/Narada

    From what I understand, his reasoning is because much like Narada was “the ultimate nomad who roams the three lokas … to find out about the life and welfare of people”, this project roams the web to find out information for people to use. :)

  4. Pradical says:

    Great stuff. is there a blog or site dedicated to Narada? I know there is a launchpad site but lazy slackers like me love a nice blog that has updates – just like the drizzle blog. :)

    FYI – Narada is also considered a trouble maker albeit leading to clarity at the end of the confusion he causes. Lets hope this narada leads to clarity at the end and skips the creating trouble part :)

Leave a Reply


< || >
Blog
Wiki
About
Resume
RSS
Comments

E-Mail
Launchpad
LinkedIn
Twitter
identi.ca
Facebook

OpenStack
Scale Stack
Gearman
NW Veg
Veg Food & Fit

Linux On Laptops