Distributed search follow-up
- 2006-11-06
-
Write a full post in response to this!
Some time ago I posted Just a thought: free distributed search?, suggesting that maybe relying on the centralized approach of search engine companies like Google was unwise, and that some kind of decentralized approach could work better for searching. Recently, I was directed to an actual attempt to implement this kind of strategy called Majestic-12. It’s a UK-based project which applies the distributed computing model made famous by SETI@home to the problem. Isn’t that amazing?

The Majestic-12 distributed search engine has grown in power by leaps and bounds over the last year as more people donate their unused CPU-cycles to the project. More than 20 billion web pages have been crawled to date (data image snapshot from Majestic-12 site on 2006-11/6).
From the site’s published rationale for the project:
So what about search engines?. There are millions web sites out there, with billions of pages and so far only a handful of huge companies were able to create a search engine that can provide relevant information to the users. Big companies control the entry point to the data you seek, and neither you nor web masters who run the sites have a say in the matter.. How does Majestic-12 fit into all this?. Majestic-12 is developing a search engine scalable to billions of web pages that is based on support by the community. Since the task of building a World Wide Web search engine is so huge, we have chosen to make Majestic-12 Distributed Search Engine based on the concept of distributed computing. The idea being that many machines work on one task to get it done quicker than one large machine alone. One of the biggest challenges with the search engines is actually getting billions of pages, and to do this cost effectively we have created a client software called MJ12node that can be run on otherwise idle computers. This concept was used successfully by projects like SETI@HOME and distributed.net.. MJ12node software combines machines from all around the globe to crawl, collate and then send back it’s findings to the master server. The crawled data will be analysed (indexed) and added to the Majestic-12 search engine. The result? Hopefully the biggest crawl of the web, and perhaps even the most up to date search engine of its time..
So I guess the answer to my question is “Yes, someone is already working on it”. I never cease to be amazed by the creativity and industry of free software developers!
Write a full post in response to this!
Similar articles
Do you like this post?
Vote for it!
Copyright information
This entry is (C) Copyright by its author, 2004-2008. Unless a different license is specified in the entry's body, the following license applies: "Verbatim copying and distribution of this entire article is permitted in any medium without royalty provided this notice is preserved and appropriate attribution information (author, original site, original URL) is included".
Biography
Terry Hancock: Terry Hancock is co-owner and technical officer of Anansi Spaceworks, dedicated to the application of free software methods to the development of space.
- Terry Hancock's posts
- Login or register to post comments
- 2378 reads
- Printer friendly version (unavailable!)




Best voted contents
-
Free software heroes: from Stallman to Google, a list of inspiring individuals who made everything possible
Tony Mobily, 2008-06-15 -
Ian Lynch's take on the BECTA fiasco
Tony Mobily, 2008-06-17 -
The Groklaw effect hits Becta. And yes, I am coining a new term
Tony Mobily, 2008-06-15 -
Mail merge in OpenOffice.org
Michael Crider, 2008-06-17
Similar entries
All news
Other sites
- The Top 10 Everything (Dave). The good, the bad and the ugly.
- Free Software news (Dave & Bridget). All about free software -- free as in freedom!
- Book Reviews: Illiterarty (Bridget). Book reviews, blogs, and short stories.
Hot topics - last 60 days
-
A future without Microsoft
Tony Mobily, 2008-06-08 -
Vienna failed to migrate to GNU/Linux: why?
Tony Mobily, 2008-06-09 -
Free software heroes: from Stallman to Google, a list of inspiring individuals who made everything possible
Tony Mobily, 2008-06-15 -
Dubious ads in Free Software Magazine
Tony Mobily, 2008-05-25 -
The Bizarre Cathedral - 6
Ryan Cartwright, 2008-05-25
Dedicated server
Very interesting
Submitted by Scott Carpenter on Mon, 2006-11-06 05:54.
Vote!Thanks for pointing this out, Terry. As much as I like Google, I agree that we want alternatives that don't involve one company acting as such an important gatekeeper. But in this case, even if millions of individuals contribute cycles and bandwidth to scour the web, isn't Majestic-12 still a single point of control for us to actually use the results?
Will have to read more about it, but I think I'd enjoy running a search node more than a SETI node. I like SETI@Home, but have never felt like I was contributing that much the times I've run it on my machines.
The algorithms for finding relevant content are obviously going to be important in the success of this thing. Maybe we can judge improvements by where freesoftwaremagazine.com ranks in a search for: free software magazine. Running with alexc's magic recipe, it ranks 37th at the moment.
Moving to freedom doesn't even register for my backwater little site, but maybe that's because I'm so obscure :-)
Searching with quotes, "moving to freedom" gets:
Error: Index 'mainindex170206' search failed due to: Index returned result type of InternalError Extended error: System.Exception: Wrong skip delta=-8978 at Majestic12.InvertedIndexScanner_Default.UseSkipIndex(Int32 p_iDocID) in H:\Alex\PROJECTS\MJ12searchLib\InvertedIndexScanner.cs:line 629 at Majestic12.InvertedIndexScanner_Default.SkipToDocID(Int32 p_iDocID) in H:\Alex\PROJECTS\MJ12searchLib\InvertedIndexScanner.cs:line 932 at
----
http://www.movingtofreedom.org/
Single point of control?
Submitted by Terry Hancock on Tue, 2006-11-07 01:19.
Vote!I admit I haven't explored this thoroughly. Presumeably, though what will emerge either through this project, or in reaction to it, is adaptive radiation of crawler algorithms and search techniques.
One thing I can't find on the Majestic-12 website is a download for their server-side engine, but I don't know if that's intentional, or just a casual omission (clearly the client is the more important thing for them to resolve in the near-term, so they may just not be considering the next step yet). It's a good question, though, whether this is a true free software project. I may have to inquire.
It's pretty clear, though, that if the client and search engines are both open source projects, they will evolve to meet some kind of community consensus of how a search engine should work, rather than the conclusions of corporate management (which are highly distorted from what the search market wants).
Is it Free Software?
Submitted by mjkaye on Tue, 2006-11-07 10:16.
Vote!Their site describes the client as "free to download and use". How "Free" is this software?
Is Yacy not a better bet?
--
Cutting Free - Free Software at the cutting edge (cuttingfree.blogsome.com)
I thought so, but...
Submitted by Terry Hancock on Wed, 2006-11-08 07:31.
Vote!I thought so when I wrote this post, but I'm beginning to doubt it. Thanks for the link.