Eric DayThoughts, code, and other oddments. |
Dark | Light |
|
|
|
< Cache Line Sizes and Concurrency || libdrizzle 0.3 Released > Narada – A Scalable Open Source Search EngineMay 27th, 2009I’ve been working with Patrick Galbraith for the past couple weeks on a new project that started as an example in his upcoming book. It is a search engine built using Gearman, Sphinx, Drizzle or MySQL, and memcached. Patrick wrote the first implementation in Perl to tie all these pieces together, but there is also a Java version underway bring written by Trond Norbye and Eric Lambert that will be shown at the CommunityOne and JavaOne conferences next week. I’ve been helping get the system setup on a new cluster and with the port to Drizzle. Narada provides interfaces that allow you to submit URLs to be indexed and crawled, and then to search those indexes and get a result set back. This allows you to index and search your own set of URLs, possibly for a single website or just for your own personal archive. The crawler in the back-end will be able to stop after some recursion limit from the original URL and also be able to apply URL filters (for example, only index pages under the domain “oddments.org”). Other filters and extensions should be easy to add. Narada is interesting because it is:
So, how does Narada work under the hood?
The blue boxes represent your front-end application that use Narada, using the Gearman client API. The yellow boxes represent Gearman workers that perform one of the tasks in the chain. The orange boxes represent the storage mechanisms such as Drizzle, MySQL, Sphinx index, or memcached. When a URL is submitted, it will first be queued in a Drizzle table for later processing. A Gearman job is started during the table INSERT to notify a Fetch Worker that a new URL is ready. Once a free Fetch Worker is available, it downloads the page and looks for more URLs to index. This is where recursion limits and filters are implemented. Next, it takes the resulting document and pushes it into memcached and notifies the Document Worker a new document is ready to be stored and indexed. The Document Worker then stores this inside of another Drizzle table and will start the Sphinx indexer if it hasn’t been run in a while. We don’t want to index on every URL since this would be wasteful and expensive. At this point the document is stored, indexed, and memcached is primed with the content. When a search request comes in, the client will dispatch a search job to the Search Worker. This worker is responsible for performing the Sphinx search and gathering the necessary information from memcached or Drizzle so the client can return some meaningful results. In the future we will most likely be sharding the data and indexes, so the Search Worker will also be responsible for aggregating multiple shard searches into one set for the caller. The code is still rough around the edges, but we’ve set it up on a couple clusters so far and it is working quite well. We’ll be actively working on it and refining the install process so it is easier to get it up and running. Posted in Drizzle, Gearman, Main, MySQL4 Responses to "Narada – A Scalable Open Source Search Engine"
Leave a Reply< Cache Line Sizes and Concurrency || libdrizzle 0.3 Released > |
Blog Wiki About Resume RSS Comments Launchpad identi.ca OpenStack Scale Stack Gearman NW Veg Veg Food & Fit |
|
Copyright (C) Eric Day - eday@oddments.org All content licensed under the Creative Commons Attribution 3.0 License. Hosted by Rackspace Cloud |
|
Hi,
That sounds great. Interesting.
I was curious about the name – Narada…any stories behind the naming?
bye,
Alfassa.