Impossible thing #2: Wikipedia

Impossible thing #2: Wikipedia


Wikipedia is the largest and most comprehensive encyclopedic work ever created in the history of mankind. It's common to draw comparisons to Encyclopedia Britannica, but they are hardly comparable works—Wikipedia is dozens of times larger and covers many more subjects. Accuracy is a more debatable topic, but studies have suggested that Wikipedia is not as much less accurate than Britannica as one might naively suppose. Project Gutenberg is a less well known, but much older part of the free culture movement, having been started in 1971. Today it contains over 24,000 e-texts.

Myth #2

"Even if you can do large things with bazaar methods, corporations are always going to do bigger and better work."

Unlike the previous myth, this one is largely unchallenged. Even inside the free culture community there is a strong perception of the community as a rebel faction embattled against a much more powerful foe. Yet, some projects challenge this world view!

Measuring Wikipedia

It's actually a bit hard to say what the exact size of Wikipedia is today, because the log engine that the site used to measure its size started to fail in 2006, due to the enormous size of the database! Since then, there is no direct data available on the total size of Wikipedia, nor on the English language version (the largest language version, unsurprisingly). There is data on some of the less highly populated language versions, simply because they haven't grown so large yet.

However, we can make some estimates based on the evidence before 2006 and the somewhat less complete statistics which continue to be available. 2006 was a pivotal year for Wikipedia, it was the year it surpassed the Yong-Le Encyclopedia, the former largest encyclopedic work ever created, commissioned by the Emperor of China in 1403 and so large it was only ever possible to make two copies of it (including the original). It was bound into approximately 23,000 volumes, and unfortunately does not survive intact into the present day, although there are still some volumes in existence.

Growth of Wikipedia by word count. Late in 2006, the size of the database exceeded the capacity of the logging engine and less systematic estimates have to be used. The diamonds show estimates based on article counts, with an assumption that mean article size remained the same (in the previous data, there is a gradual trend upwards in mean article size).Growth of Wikipedia by word count. Late in 2006, the size of the database exceeded the capacity of the logging engine and less systematic estimates have to be used. The diamonds show estimates based on article counts, with an assumption that mean article size remained the same (in the previous data, there is a gradual trend upwards in mean article size).

It was also the year in which Wikipedia apparently finally transitioned from "exponential" to approximately "linear" growth, which can be regarded as an important maturation step. Instead of growing explosively, as it did in its first few years of existence, Wikipedia is now moving into a more sustainable growth pattern, with an increasing effort being put into improving the quality of existing articles rather than adding new ones (which is not to say that new articles aren't being written: the growth may be linear, but it's linear at something close to adding a whole new Yong-Le Encyclopedia per year!) Figure 2.1 illustrates the growth and size of Wikipedia, compared to some significant other works.

Wikipedia's growth may be linear, but it's linear at something close to adding a whole new Yong-Le Encyclopedia per year

This is an expected pattern for growth: the entire curve is typically a "sigmoid" (so named, because it is "S-shaped"), with an initial period of exponential growth when there is no retarding force whatsoever, followed by linear growth, and finally an asymptotic taper as the phenomenon runs into environmental limits. Thus far, Wikipedia appears to have exhausted the potential for rapidly increasing labor and has already picked all of the "low-hanging fruit" of encyclopedic entries.

Now, it is moving into a phase of growth represented primarily by the effort of the existing interested "Wikipedians" (now a fairly stable population, with growth balanced by attrition). Thus the growth rate now represents a fairly constant effort put into improving the encyclopedia. Also, evidence suggests that maintenance and quality-control now represent a much larger fraction of the work as more edits are now dedicated to revisions (and reversions) of existing pages rather than adding new ones. There is also, of course, continuing exponential growth among the less-well-represented languages in Wikipedia, which contributes to the total growth.

Quantity and quality

Of course, if Wikipedia is, as some have suggested, just an "enormous pile of rumors", then the size is not necessarily a good thing. But in fact, Wikipedia is surprisingly accurate. A Nature study in 2005 demonstrated that in the area of science, Wikipedia was only slightly less accurate than Britannica, though it found a number of mistakes in both publications[1]. It is interesting to note that all of the articles objected to in the study were quickly edited to fix the problems, while the same cannot be said for Britannica, since it is harder to change.

There are many areas of knowledge which Wikipedia covers, such as popular culture, which other encyclopedias cannot possibly hope to keep up with (try looking up episode summaries for Buffy the Vampire Slayer in Britannica!). It is understandably particularly complete in computer-related subject areas.

Probably the weakest thing about Wikipedia is its susceptibility to intentional bias: many individuals, organizations, and governments have been known to edit Wikipedia articles to put themselves in a more favorable light. On the other hand, critical organizations may edit them to be more harsh, and in the end, these effects appear to balance out for all but the most controversial topics. Even there, we have to acknowledge that Wikipedia's coverage fairly depicts controversial topics in all of their controversy (try looking up "Evolution", "Creationism", or "George W. Bush" in Wikipedia for interesting examples of what happens with controversial topics).

A study at Dartmouth concluded that anonymous contributors improved articles roughly as much as signed-in users

These weaknesses describe what might be dubbed the "editorial bias" of Wikipedia, which represents the collective bias of the society of people willing to contribute to the project. It has to be remembered, though, that conventional encyclopedic works are also subject to editorial bias, and usually the bias of one organization. As it stands, researchers using Wikipedia have to take the same kind of critical approach that they've always applied to encyclopedias as sources of information, and they must follow up the sources themselves for serious scholarly work.

Although there has always been a concern with the problems caused by intentional vandalism—especially by anonymous contributors, this is not as much of a problem as many would imagine. A study at Dartmouth[2] concluded that anonymous contributors improved articles roughly as much as signed-in users. Thus, it appears likely that the Delphi effect[3] is out-competing vandalism and intentional bias in Wikipedia. In other words, distributed, community-based editorial review works, just as distributed debugging does for free software. Biases and judgement calls are a problem, but in the end they appear to balance out for almost all articles.

Growth of Project Gutenberg, measured in number of works, from Wikipedia (Hellisp@Wikipedia / PD).Growth of Project Gutenberg, measured in number of works, from Wikipedia (Hellisp@Wikipedia / PD).

Project Gutenberg

Started in 1971, Project Gutenberg is the grand-daddy of free culture projects. It predates much of the thought about the "intellectual commons" and it came thirteen years before the GNU Manifesto was written. As such it does not reflect modern ideas about free-licensing, and instead focuses on public domain works. That, along with the insistence on "plain text" representations of the works included reflect attitudes some may regard as dated. This situation has been mollified somewhat in recent years.

Project Gutenberg measures its size in terms of numbers of e-texts, which can be somewhat confusing since e-texts are of many different lengths. However, a rough estimate of the size of the repository in number of words suggests that it probably is now larger than the fabled Library of Alexandria[4] and it is certainly larger than many modern community libraries.

The size of Project Gutenberg today is probably more limited by the availability of public domain works than by the labor pool willing to digitize them

The collection started fairly small, limited by the relatively small amount of networking and human labor available to the project in its early years. This behavior offers no serious challenge to the conventional wisdom about projects of this type.

However, as the internet and the web matured, so did the community supporting Project Gutenberg. Today, there is a significant volunteer scanning and distributed proof-reading[5] effort going on which has accounted for the tremendous growth that the project has seen over the last decade or so (see Figure 2.2).

The size of Project Gutenberg today is probably more limited by the availability of public domain works than by the labor pool willing to digitize them. The public domain has been starved multiple times in the last few decades by copyright term extensions which have effectively frozen the public domain in the mid 1920s. As more works do move into the public domain, Gutenberg will certainly be capable of capturing them.

The sheer scale of the thing

The size of Wikipedia and Project Gutenberg present serious challenges to our understanding of the relative scales of these works compared to the great works of individuals, corporations, or governments. As a means of grounding our perception in reality, it is useful to construct a logarithmic chart, spanning many orders of magnitude. Such a chart is not useful for making fine comparisons (because even a factor of two difference between two objects can seem quite close on a log chart), although by the same token, it's quite forgiving with respect to estimation errors, so we can afford to be fairly daring in our estimation process. What it is useful for is giving us an idea of what sort of things we ought to be comparing to. Figure 2.3 is such a chart, illustrating works on vastly different scales, from individually authored works up to the entire U.S. Library of Congress.

Logarithmic chart of various works, compared by estimated word count. Works grouped on the left side are individual works (although the Bible can be regarded as a collection); works in the middle are original encyclopedic works; and works on the right are entire libraries of works.Logarithmic chart of various works, compared by estimated word count. Works grouped on the left side are individual works (although the Bible can be regarded as a collection); works in the middle are original encyclopedic works; and works on the right are entire libraries of works.

The US Library of Congress is the largest modern library, and indeed, it is unsurprisingly several orders of magnitude larger than Project Gutenberg. But there are two caveats to consider: One is that whereas the Library of Congress contains every work which has a copyright registered in the United States (because submitting a copy to the library is a part of the registration process), while Project Gutenberg is limited (almost entirely) to those works whose copyrights have expired. The other is that we are comparing a collection of print books to an electronic collection. It would be interesting to compare the output of Project Gutenberg to government-sponsored digitization projects, which would be a much more fair comparison.

In a few short years, a new player—the Commons Based Enterprise—has far out-produced some of the greatest works of both corporations and governments

It's especially hard, though, to look at this chart and not be a little stunned by Wikipedia! The greatest encyclopedic work of corporate production is probably the Encyclopedia Britannica, yet it falls far behind in this comparison (by well over an order of magnitude!). The greatest encyclopedic work of government production was the Yong-Le Encyclopedia commissioned by the Emperor of China in 1403. Yet even that is several times smaller than the whole of Wikipedia (note that the Wikipedia numbers are the last reliable numbers from 2006, not the later estimates—Wikipedia is considerably larger today).

Our conventional wisdom is that the most powerfully productive organizations are corporations and governments: institutions we regard with awe, reverence, and even fear. But in a few short years, a new class of player—the commons based enterprise—has far out-produced some of the greatest works of both corporations and governments (at least in the area of encyclopedias).

Clearly, the conventional wisdom needs adjusting.

Notes

[1] "Internet encyclopaedias go head to head". Jim Giles. Nature 438, 900 - 901 (2005).

[2] A Dartmouth study found that contributions from anonymous visitors to Wikipedia show a similar quality to those from logged-in, named contributors.

[3] Delphi effect

[4] This statement is difficult to test because no one really knows exactly how big the Library of Alexandria was, and there are estimates that are probably huge exaggerations. However, based on the most reliable estimates I could find, Project Gutenberg is now larger. The Library of Alexandria was measured in numbers of scrolls, but it turns out that scrolls were generally somewhat shorter than books (and therefore than the typical e-texts in Project Gutenberg), but both can be estimated in terms of number of words, to make comparisons possible.

[5] Distributed proof-reading is a collaborative system for sharing the load of proof-reading optical character recognition scans of original works.

Terms

Commons based enterprise: Large scale commons-based peer production efforts may be regarded as a new kind of enterprise-scale institution, alongside corporate and government enterprises.

Author information

Terry Hancock's picture

Biography

Terry Hancock is co-owner and technical officer of Anansi Spaceworks. Currently he is working on a free-culture animated series project about space development, called Lunatics as well helping out with the Morevna Project.