Distributed search follow-up

Distributed search follow-up


Some time ago I posted Just a thought: free distributed search?, suggesting that maybe relying on the centralized approach of search engine companies like Google was unwise, and that some kind of decentralized approach could work better for searching. Recently, I was directed to an actual attempt to implement this kind of strategy called Majestic-12. It's a UK-based project which applies the distributed computing model made famous by SETI@home to the problem. Isn't that amazing?

The Majestic-12 distributed search engine has grown in power by leaps and bounds over the last year as more people donate their unused CPU-cycles to the project. More than 20 billion web pages have been crawled to date (data image snapshot from Majestic-12 site on 2006-11/6).The Majestic-12 distributed search engine has grown in power by leaps and bounds over the last year as more people donate their unused CPU-cycles to the project. More than 20 billion web pages have been crawled to date (data image snapshot from Majestic-12 site on 2006-11/6).

From the site's published rationale for the project:

So what about search engines?. There are millions web sites out there, with billions of pages and so far only a handful of huge companies were able to create a search engine that can provide relevant information to the users. Big companies control the entry point to the data you seek, and neither you nor web masters who run the sites have a say in the matter.. How does Majestic-12 fit into all this?. Majestic-12 is developing a search engine scalable to billions of web pages that is based on support by the community. Since the task of building a World Wide Web search engine is so huge, we have chosen to make Majestic-12 Distributed Search Engine based on the concept of distributed computing. The idea being that many machines work on one task to get it done quicker than one large machine alone. One of the biggest challenges with the search engines is actually getting billions of pages, and to do this cost effectively we have created a client software called MJ12node that can be run on otherwise idle computers. This concept was used successfully by projects like SETI@HOME and distributed.net.. MJ12node software combines machines from all around the globe to crawl, collate and then send back it's findings to the master server. The crawled data will be analysed (indexed) and added to the Majestic-12 search engine. The result? Hopefully the biggest crawl of the web, and perhaps even the most up to date search engine of its time..

So I guess the answer to my question is "Yes, someone is already working on it". I never cease to be amazed by the creativity and industry of free software developers!

Category: 

Comments

Scott Carpenter's picture

Thanks for pointing this out, Terry. As much as I like Google, I agree that we want alternatives that don't involve one company acting as such an important gatekeeper. But in this case, even if millions of individuals contribute cycles and bandwidth to scour the web, isn't Majestic-12 still a single point of control for us to actually use the results?

Will have to read more about it, but I think I'd enjoy running a search node more than a SETI node. I like SETI@Home, but have never felt like I was contributing that much the times I've run it on my machines.

The algorithms for finding relevant content are obviously going to be important in the success of this thing. Maybe we can judge improvements by where freesoftwaremagazine.com ranks in a search for: free software magazine. Running with alexc's magic recipe, it ranks 37th at the moment.

Moving to freedom doesn't even register for my backwater little site, but maybe that's because I'm so obscure :-)

Searching with quotes, "moving to freedom" gets:

Error: Index 'mainindex170206' search failed due to: Index returned result type of InternalError Extended error: System.Exception: Wrong skip delta=-8978 at Majestic12.InvertedIndexScanner_Default.UseSkipIndex(Int32 p_iDocID) in H:\Alex\PROJECTS\MJ12searchLib\InvertedIndexScanner.cs:line 629 at Majestic12.InvertedIndexScanner_Default.SkipToDocID(Int32 p_iDocID) in H:\Alex\PROJECTS\MJ12searchLib\InvertedIndexScanner.cs:line 932 at

----
http://www.movingtofreedom.org/

Terry Hancock's picture

I admit I haven't explored this thoroughly. Presumeably, though what will emerge either through this project, or in reaction to it, is adaptive radiation of crawler algorithms and search techniques.

One thing I can't find on the Majestic-12 website is a download for their server-side engine, but I don't know if that's intentional, or just a casual omission (clearly the client is the more important thing for them to resolve in the near-term, so they may just not be considering the next step yet). It's a good question, though, whether this is a true free software project. I may have to inquire.

It's pretty clear, though, that if the client and search engines are both open source projects, they will evolve to meet some kind of community consensus of how a search engine should work, rather than the conclusions of corporate management (which are highly distorted from what the search market wants).

mjkaye's picture
Submitted by mjkaye on

Their site describes the client as "free to download and use". How "Free" is this software?

Is Yacy not a better bet?

--
Cutting Free - Free Software at the cutting edge (cuttingfree.blogsome.com)

Author information

Terry Hancock's picture

Biography

Terry Hancock is co-owner and technical officer of Anansi Spaceworks. Currently he is working on a free-culture animated series project about space development, called Lunatics as well helping out with the Morevna Project.