Impossible thing #2: Wikipedia

Impossible thing #2: Wikipedia


Wikipedia is the largest and most comprehensive encyclopedic work ever created in the history of mankind. It's common to draw comparisons to Encyclopedia Britannica, but they are hardly comparable works—Wikipedia is dozens of times larger and covers many more subjects. Accuracy is a more debatable topic, but studies have suggested that Wikipedia is not as much less accurate than Britannica as one might naively suppose. Project Gutenberg is a less well known, but much older part of the free culture movement, having been started in 1971. Today it contains over 24,000 e-texts.

Myth #2

"Even if you can do large things with bazaar methods, corporations are always going to do bigger and better work."

Unlike the previous myth, this one is largely unchallenged. Even inside the free culture community there is a strong perception of the community as a rebel faction embattled against a much more powerful foe. Yet, some projects challenge this world view!

Measuring Wikipedia

It's actually a bit hard to say what the exact size of Wikipedia is today, because the log engine that the site used to measure its size started to fail in 2006, due to the enormous size of the database! Since then, there is no direct data available on the total size of Wikipedia, nor on the English language version (the largest language version, unsurprisingly). There is data on some of the less highly populated language versions, simply because they haven't grown so large yet.

However, we can make some estimates based on the evidence before 2006 and the somewhat less complete statistics which continue to be available. 2006 was a pivotal year for Wikipedia, it was the year it surpassed the Yong-Le Encyclopedia, the former largest encyclopedic work ever created, commissioned by the Emperor of China in 1403 and so large it was only ever possible to make two copies of it (including the original). It was bound into approximately 23,000 volumes, and unfortunately does not survive intact into the present day, although there are still some volumes in existence.

Growth of Wikipedia by word count. Late in 2006, the size of the database exceeded the capacity of the logging engine and less systematic estimates have to be used. The diamonds show estimates based on article counts, with an assumption that mean article size remained the same (in the previous data, there is a gradual trend upwards in mean article size).Growth of Wikipedia by word count. Late in 2006, the size of the database exceeded the capacity of the logging engine and less systematic estimates have to be used. The diamonds show estimates based on article counts, with an assumption that mean article size remained the same (in the previous data, there is a gradual trend upwards in mean article size).

It was also the year in which Wikipedia apparently finally transitioned from "exponential" to approximately "linear" growth, which can be regarded as an important maturation step. Instead of growing explosively, as it did in its first few years of existence, Wikipedia is now moving into a more sustainable growth pattern, with an increasing effort being put into improving the quality of existing articles rather than adding new ones (which is not to say that new articles aren't being written: the growth may be linear, but it's linear at something close to adding a whole new Yong-Le Encyclopedia per year!) Figure 2.1 illustrates the growth and size of Wikipedia, compared to some significant other works.

Wikipedia's growth may be linear, but it's linear at something close to adding a whole new Yong-Le Encyclopedia per year

This is an expected pattern for growth: the entire curve is typically a "sigmoid" (so named, because it is "S-shaped"), with an initial period of exponential growth when there is no retarding force whatsoever, followed by linear growth, and finally an asymptotic taper as the phenomenon runs into environmental limits. Thus far, Wikipedia appears to have exhausted the potential for rapidly increasing labor and has already picked all of the "low-hanging fruit" of encyclopedic entries.

Now, it is moving into a phase of growth represented primarily by the effort of the existing interested "Wikipedians" (now a fairly stable population, with growth balanced by attrition). Thus the growth rate now represents a fairly constant effort put into improving the encyclopedia. Also, evidence suggests that maintenance and quality-control now represent a much larger fraction of the work as more edits are now dedicated to revisions (and reversions) of existing pages rather than adding new ones. There is also, of course, continuing exponential growth among the less-well-represented languages in Wikipedia, which contributes to the total growth.

Quantity and quality

Of course, if Wikipedia is, as some have suggested, just an "enormous pile of rumors", then the size is not necessarily a good thing. But in fact, Wikipedia is surprisingly accurate. A Nature study in 2005 demonstrated that in the area of science, Wikipedia was only slightly less accurate than Britannica, though it found a number of mistakes in both publications[1]. It is interesting to note that all of the articles objected to in the study were quickly edited to fix the problems, while the same cannot be said for Britannica, since it is harder to change.

There are many areas of knowledge which Wikipedia covers, such as popular culture, which other encyclopedias cannot possibly hope to keep up with (try looking up episode summaries for Buffy the Vampire Slayer in Britannica!). It is understandably particularly complete in computer-related subject areas.

Probably the weakest thing about Wikipedia is its susceptibility to intentional bias: many individuals, organizations, and governments have been known to edit Wikipedia articles to put themselves in a more favorable light. On the other hand, critical organizations may edit them to be more harsh, and in the end, these effects appear to balance out for all but the most controversial topics. Even there, we have to acknowledge that Wikipedia's coverage fairly depicts controversial topics in all of their controversy (try looking up "Evolution", "Creationism", or "George W. Bush" in Wikipedia for interesting examples of what happens with controversial topics).

A study at Dartmouth concluded that anonymous contributors improved articles roughly as much as signed-in users

These weaknesses describe what might be dubbed the "editorial bias" of Wikipedia, which represents the collective bias of the society of people willing to contribute to the project. It has to be remembered, though, that conventional encyclopedic works are also subject to editorial bias, and usually the bias of one organization. As it stands, researchers using Wikipedia have to take the same kind of critical approach that they've always applied to encyclopedias as sources of information, and they must follow up the sources themselves for serious scholarly work.

Although there has always been a concern with the problems caused by intentional vandalism—especially by anonymous contributors, this is not as much of a problem as many would imagine. A study at Dartmouth[2] concluded that anonymous contributors improved articles roughly as much as signed-in users. Thus, it appears likely that the Delphi effect[3] is out-competing vandalism and intentional bias in Wikipedia. In other words, distributed, community-based editorial review works, just as distributed debugging does for free software. Biases and judgement calls are a problem, but in the end they appear to balance out for almost all articles.

Growth of Project Gutenberg, measured in number of works, from Wikipedia (Hellisp@Wikipedia / PD).Growth of Project Gutenberg, measured in number of works, from Wikipedia (Hellisp@Wikipedia / PD).

Project Gutenberg

Started in 1971, Project Gutenberg is the grand-daddy of free culture projects. It predates much of the thought about the "intellectual commons" and it came thirteen years before the GNU Manifesto was written. As such it does not reflect modern ideas about free-licensing, and instead focuses on public domain works. That, along with the insistence on "plain text" representations of the works included reflect attitudes some may regard as dated. This situation has been mollified somewhat in recent years.

Project Gutenberg measures its size in terms of numbers of e-texts, which can be somewhat confusing since e-texts are of many different lengths. However, a rough estimate of the size of the repository in number of words suggests that it probably is now larger than the fabled Library of Alexandria[4] and it is certainly larger than many modern community libraries.

The size of Project Gutenberg today is probably more limited by the availability of public domain works than by the labor pool willing to digitize them

The collection started fairly small, limited by the relatively small amount of networking and human labor available to the project in its early years. This behavior offers no serious challenge to the conventional wisdom about projects of this type.

However, as the internet and the web matured, so did the community supporting Project Gutenberg. Today, there is a significant volunteer scanning and distributed proof-reading[5] effort going on which has accounted for the tremendous growth that the project has seen over the last decade or so (see Figure 2.2).

The size of Project Gutenberg today is probably more limited by the availability of public domain works than by the labor pool willing to digitize them. The public domain has been starved multiple times in the last few decades by copyright term extensions which have effectively frozen the public domain in the mid 1920s. As more works do move into the public domain, Gutenberg will certainly be capable of capturing them.

The sheer scale of the thing

The size of Wikipedia and Project Gutenberg present serious challenges to our understanding of the relative scales of these works compared to the great works of individuals, corporations, or governments. As a means of grounding our perception in reality, it is useful to construct a logarithmic chart, spanning many orders of magnitude. Such a chart is not useful for making fine comparisons (because even a factor of two difference between two objects can seem quite close on a log chart), although by the same token, it's quite forgiving with respect to estimation errors, so we can afford to be fairly daring in our estimation process. What it is useful for is giving us an idea of what sort of things we ought to be comparing to. Figure 2.3 is such a chart, illustrating works on vastly different scales, from individually authored works up to the entire U.S. Library of Congress.

Logarithmic chart of various works, compared by estimated word count. Works grouped on the left side are individual works (although the Bible can be regarded as a collection); works in the middle are original encyclopedic works; and works on the right are entire libraries of works.Logarithmic chart of various works, compared by estimated word count. Works grouped on the left side are individual works (although the Bible can be regarded as a collection); works in the middle are original encyclopedic works; and works on the right are entire libraries of works.

The US Library of Congress is the largest modern library, and indeed, it is unsurprisingly several orders of magnitude larger than Project Gutenberg. But there are two caveats to consider: One is that whereas the Library of Congress contains every work which has a copyright registered in the United States (because submitting a copy to the library is a part of the registration process), while Project Gutenberg is limited (almost entirely) to those works whose copyrights have expired. The other is that we are comparing a collection of print books to an electronic collection. It would be interesting to compare the output of Project Gutenberg to government-sponsored digitization projects, which would be a much more fair comparison.

In a few short years, a new player—the Commons Based Enterprise—has far out-produced some of the greatest works of both corporations and governments

It's especially hard, though, to look at this chart and not be a little stunned by Wikipedia! The greatest encyclopedic work of corporate production is probably the Encyclopedia Britannica, yet it falls far behind in this comparison (by well over an order of magnitude!). The greatest encyclopedic work of government production was the Yong-Le Encyclopedia commissioned by the Emperor of China in 1403. Yet even that is several times smaller than the whole of Wikipedia (note that the Wikipedia numbers are the last reliable numbers from 2006, not the later estimates—Wikipedia is considerably larger today).

Our conventional wisdom is that the most powerfully productive organizations are corporations and governments: institutions we regard with awe, reverence, and even fear. But in a few short years, a new class of player—the commons based enterprise—has far out-produced some of the greatest works of both corporations and governments (at least in the area of encyclopedias).

Clearly, the conventional wisdom needs adjusting.

Notes

[1] "Internet encyclopaedias go head to head". Jim Giles. Nature 438, 900 - 901 (2005).

[2] A Dartmouth study found that contributions from anonymous visitors to Wikipedia show a similar quality to those from logged-in, named contributors.

[3] Delphi effect

[4] This statement is difficult to test because no one really knows exactly how big the Library of Alexandria was, and there are estimates that are probably huge exaggerations. However, based on the most reliable estimates I could find, Project Gutenberg is now larger. The Library of Alexandria was measured in numbers of scrolls, but it turns out that scrolls were generally somewhat shorter than books (and therefore than the typical e-texts in Project Gutenberg), but both can be estimated in terms of number of words, to make comparisons possible.

[5] Distributed proof-reading is a collaborative system for sharing the load of proof-reading optical character recognition scans of original works.

Terms

Commons based enterprise: Large scale commons-based peer production efforts may be regarded as a new kind of enterprise-scale institution, alongside corporate and government enterprises.

Author information

Terry Hancock's picture

Biography

Terry Hancock is co-owner and technical officer of Anansi Spaceworks. Currently he is working on a free-culture animated series project about space development, called Lunatics as well helping out with the Morevna Project.

Most forwarded

Interview with Dave Mohyla, of DTIDATA

Dave Mohyla is the president and founder of dtidata.com, a hard drive recovery facility based in Tampa, Florida.

TM: Where are you based? What does your company do?
DTI Data recovery is based in South Pasadena, Florida which is a suburb of Tampa. We have been here for over 10 years. We operate a bio-metrically secured class 100 clean room where we perform hard drive recovery on all types of hard disks, from laptop hard drives to multi drive RAID systems.

Anybody up to writing good directory software?

Since the very beginning, directories (of any kind) have had a very central role in the internet. (I have recently grown fond of Free Web Directory. Even Slashdot can be considered a directory: a collection of great news and invaluable user-generated comments. As far as software is concerned, doing a quick search on Google about software directories will return the free (as in freedom) software directories like Savannah, SourceForge, Freshmeat and so on, followed by shareware and freeware sites such as FileBuzz, PCWin Download Center and All Freeware (great if you're looking for shareware and freeware, but definitely less comprehensive than their free-as-in-freedom counterparts).

Interview with Mark Shuttleworth

Mark Shuttleworth is the founder of Thawte, the first Certification Authority to sell public SSL certificates. After selling Thawte to Verisign, Mark moved on to training as an astronaut in Russia and visiting space. Once he got back he founded Ubuntu, the leading GNU/Linux distribution. He agreed on releasing a quick interview to Free Software Magazine.

Is better education the key to finding better software?

I read David Jonathon's article Anybody Up To Writing Good Directory Software? the other day, which got me thinking about software directories in general. As David mentioned, many of the software directories one finds when doing a quick google search are free as in beer, not as in freedom. But what interests me is the software directories that already exist, providing a combination of both free as in beer software, and open source software. Sites such as Freeware Downloads and Shareware Download don't advertise themselves as providing free as in liberty software, but each of them have a good selection of open source software available... if you know where to look.

Most emailed

Free Open Document label templates

If you’ve ever spent hours at work doing mailings, cursed your printer for printing outside the lines on your labels, or moaned “There has got to be a better way to do this,” here’s the solution you’ve been looking for. Working smarter, not harder! Worldlabel.com, a manufacture of labels offers Open Office / Libre Office labels templates for downloading in ODF format which will save you time, effort, and (if you want) make really cool-looking labels

Creating a user-centric site in Drupal

A little while ago, while talking in the #drupal mailing list, I showed my latest creation to one of the core developers there. His reaction was "Wow, I am always surprised what people use Drupal for". His surprise is somehow justified: I did create a site for a bunch of entertainers in Perth, a company set to use Drupal to take over the world with Entertainers.Biz.

Update: since writing this article, I have updated the system so that the whole booking process happens online. I will update the article accordingly!

So, why, why do people and companies develop free software?

More and more people are discovering free software. Many people only do so after weeks, or even months, of using it. I wonder, for example, how many Firefox users actually know how free Firefox really is—many of them realise that you can get it for free, but find it hard to believe that anybody can modify it and even redistribute it legally.

When the discovery is made, the first instinct is to ask: why do they do it? Programming is hard work. Even though most (if not all) programmers are driven by their higher-than-normal IQs and their amazing passion for solving problems, it’s still hard to understand why so many of them would donate so much of their time to creating something that they can’t really show off to anybody but their colleagues or geek friends.

Sure, anybody can buy laptops, and just program. No need to get a full-on lab or spend thousands of dollars in equipment. But... is that the full story?

Fun articles

Santa Claus - the most successful open source project

It dawned on me the other day, as I was shopping for the dozens of gifts it seems I have to buy every December, that Santa Claus is the most successful open source project in history. (Bridget @ Illiterarty would agree with that). Santa Claus is essentially a marketing development that is embodied by everyone who stuffs a sock, gives a gift, hosts a dinner or wishes Merry Christmas over the holiday season.

Most emailed

Editorial

When I first started thinking about Free Software Magazine, I was feeling enthusiastic about the dream. I had Dave, Gianluca, and Alan willing to help me, I had established members of the free software community willing to help me out, I had writers volunteering their time and energy for free, and I had a generous offer from OpenHosting for servers, all before I'd proved myself. There was a sense of excitement in the air, and I thought maybe, just maybe, I could make this work.

Free Software Magazine uses Apollo project management software and CRM for its everyday activities!