I think the health of our civilization, the depth of our awareness about the underpinnings of our culture and our concern for the future can all be tested by how well we support our libraries.—Carl Sagan, Cosmos
Libraries have been around for a lot longer than software, and librarians long ago learned many of the data management lessons that have only now begun to surface in the world of software and databases. By contrast, software is a young, rapidly changing field, and this has affected its outlook. Five years may seem like an eternity in software development, but in the archival business, it’s just the blink of an eye.
What libraries have not dealt with historically, however, is the dismaying array of data storage mechanisms and file formats that software data represents, the troublesome transience of the tools needed to access that data, and the overwhelming quantity of the data that is produced.
Finding your way
For uncontrolled data sources, such as the world wide web at large, we have no real choice except to use machine indexing systems such as Google, but if we want a free-design database to be a maximally reusable resource, we need to make use of metadata.
If we want a free-design database to be a maximally reusable resource, we need to make use of metadata
Metadata, “information about information” is the key to library science. There are many decisions about data sources that cannot be easily made by machine, and metadata tagging allows the librarian to classify works so that they can be retrieved later. Typically these include the familiar “subject”, “title”, and “keyword” indexes that were once the mainstay of library card catalogs, and now of electronic OPACs used by modern libraries .
For a long time, libraries have used the highly rigorous but somewhat difficult MARC database system , which is not exactly a relational or object database, but is somewhat in between, and requires special handling to be managed properly. Since then, however, computer science development has led to more manageable database design styles, and the increasing quantity of digital and multimedia library resources has begun to make the conventional book-publishing orientation of MARC obsolete. Librarians are, therefore, developing more streamlined, agile metadata database systems, such as the FRBR , which is a pure relational database schema designed to encapsulate the minimal (but complete) requirements for library use, and the even more streamlined Dublin Core (DC) metadata system .
MARC, FRBR, and DC provide “schemas” or lists of the types of metadata elements that can be recorded for a work.
A perhaps less obvious need is the “controlled vocabulary”. There are many ways to express a single subject—is the study of extra-terrestrial life to be called “bioastronomy”, “exobiology”, or “xenobiology”? People have used all three terms, and, through their usage, established different emphases. Should each be given its own category, or should they be treated as synonyms and stored together? Which term will we use for that category?
It’s a common mistake to underestimate the importance and difficulty of selecting appropriate taxonomic vocabularies. This is because we are all biased towards our own fields of endeavor and tend to have only a vague idea of the structure of other disciplines. Consider, for example, the domain-specific controlled vocabulary represented by the “Trove” system , which you use if you look for software projects on the Sourceforge site. It’s an excellent system for finding software, provided that the paradigms of computing don’t change too much. However, should entirely new software types evolve, or should the system be used for things outside of the realm of software, the Trove categories become much less useful.
Fortunately, library associations have done a lot of work on broad, inter-domain classification. For example, in the English language, there is the AACR , used in Canada and the United States. It seems like a wise idea to use these standards whenever possible.
Agility and human nature
One of the problems in applying professional library methods to software works and the results of community based production, is that they are fairly labor intensive, and rely on a class of professional experts to do the classifying. Considering the quantity of data on a site like Sourceforge, it’s not hard to see that hiring librarians to manage the problem would be a daunting prospect.
Fortunately there are other ways to assign metadata to files.
Perhaps the most obvious solution is to have authors assign metadata to their own works. It’s an obvious solution; it’s the typical starting point for most systems; and, for things like title, attribution, and licensing it is really the only way to do it, because the creator is the one who chooses those properties.
It’s not without problems, however. An author is not always the best person to trust about their own data. Vendors tend to puff up their projects; authors can have greatly inflated (or deflated) egos; and people are often just lazy or inept with the submission mechanisms . It seems clear that we can’t always rely on the people who create works to be good at classifying them.
The library solution is to have items cataloged by professionals who train in doing just that. Even when creators determine their own subject headings, it is usually the case that the schema (what data to record) and the vocabulary (what options are available for each field) are decided by experts.
Despite some claims to the contrary, Wikipedia is a remarkably successful case of well-organized self-organization
Social or consensus tagging
A recently successful model, employed by community sites such as Wikipedia, and used by companies such as Amazon, is to rely on reader feedback to improve the metadata as it is being used. Many properties of such sites—the ability to edit them, the feedback forms (“Is this page useful?”), and other features—make this feasible. Despite some claims to the contrary, Wikipedia is a remarkably successful case of well-organized self-organization. Even if it were to fall short in quality and accuracy compared to a professionally cataloged encyclopedia, such as Encyclopedia Britannica, it would still represent a considerable achievement—and yet, studies have suggested that it does not fall far short of such works. 
For purely objective data such as file format, checksum, or size, automatic cataloging is the obvious solution. Advances in search technology and artificial intelligence have allowed us to go much further than this, though. AI-based text analysis and data-mining has been a popular theme for some time, and Google’s search engine benefits from some of this technology.
However, in the last year or so, we have started to see more ambitious and original AI techniques being used on non-text data. For example, the free software package “SmartDJ”  is a playlist manager that find songs that “sound alike”, by analyzing the recordings themselves, and “imgSeek”  is a tool that recognizes similar images, so that a rough sketch of an expected image can be used to search for the image. No doubt these applications are still primitive, but they show a promising possibility for future indexing systems.
Putting them together
Together, these methods can be usefully understood as a continuous spectrum, based on types of metadata and who should be most trusted to establish them (see figure). It’s easy to imagine a system based on creating a metadata stub for a package, starting with creator-provided title, attribution, and licensing, and then allowing that stub to be further constructed by the actions of the different interested parties managing the overall database system.
A promising tool for this kind of task is “Resource Description Framework” (RDF) . This is the basis of the so-called “Semantic Web” , and is useful because it provides a completely extensible and decentralized system for assigning metadata. Such projects as “IkeWiki”  are already providing ways to make such RDF tagging easy to apply to dynamic websites.
Signal to noise
Nearly all of the data that passes through our websites, mail servers, and even development projects is dross. Most data is only temporarily valuable, or even unwanted “spam”. If we were to institute a policy of saving all of that data, we wouldn’t only require exponentially increasing data storage mechanisms, but we’d also be hindering the recovery of data through an extremely low “signal to noise” ratio. What is needed is an effective information sieve that only captures the permanently valuable information, and allows the rest to spill through.
Community based rating systems, such as Slashdot’s “moderator points” system are likely to be useful in solving this problem, although it really calls for more than a simple score. For example, an announcement of an upcoming meeting may be very important at the time, but have little permanent value. A more specific rating system is probably called for, which tells why a post is important as well as how important it is. An RDF-based representation might be appropriate for this purpose.
Internet bandwidth, server uptime, and data storage capacity are the primary physical costs of maintaining a large online archive of free design data, and if the data is to remain free, it is counter-productive to recover this cost through charging patrons directly for access to it. Fortunately, there are two existing solutions for these problems.
The older technique, as demonstrated by Sourceforge is to simply have “big pipes” and provide for the necessary bandwidth. When the bandwidth gets too high, the site can recruit mirrors. Sourceforge attracts sponsors in this way, who use the opportunity to advertise to users (see figure), who, seeing the immediate benefit of the sponsorship are likely to be very positively influenced by the ads.
Swarming downloads and peer-to-peer
A more recent development is the use of peer-to-peer networks and so-called “swarming downloads” which use the internet in a much more distributed way. In this model, patrons effectively pay for their own download bandwidth via their local ISPs, rather than burdening the server. The cost is so distributed, however, that it is not noticeable.
This system is exemplified by Bit Torrent , which allows a file to be “shared” to a peer-to-peer network from a “seed” site which takes the place of the conventional FTP server. This system scales particularly well, because it shows the highest gain for the packages with the highest demand.
Encouraging the growth of a free design community will involve making existing data more available as much as making more designs
Torrent feeds require special clients however, so they are not yet ubiquitous. Therefore, it is very likely that an archive will have to provide both methods, though it seems desirable to find ways to encourage the peer-to-peer method.
Free design sources
There are already a number of sources for free design data, so encouraging the growth of a free design community will involve making existing data more available as much as making more designs. Furthermore, since new designs build on old ones, building a good archive of design data is important to new innovation as well as to end users.
Perhaps the most obvious source of public good works are public agencies. Under United States’ law, any work which is done completely by government employees is automatically in the public domain. This includes much work of the data developed by such organizations as NASA , the Forest Service , the USGS , DOE , USDA , and even the DOD .
Other countries often have similar rules. In the European Union (EU), public funding may require publication of results under a free license, for example.
NASA, obviously of particular importance to our project, already provides some online access to search their documents , although many of them are not yet digitally imaged. So, it may be necessary to pay document processing fees to access the full text of the documents. One of the desirable possibilities for our community based project, would be to begin effectively mining these resources and making them more accessible to the free design community.
Community based projects
There is an increasing body of hardware development that is managed in community collaborative websites, such as the Open Cores project , which works on integrated circuit “IP Cores”. These are relocatable elements of IC chips which can be used in large “Application Specific Integrated Circuits” (ASIC) and are therefore obvious candidates for creating commodity reusable designs. Among the projects successes are free versions of all the major simple gate chips (e.g. 7400 series) and complex projects such as RISC CPUs and micro-controllers. These are all critical areas of development, if we want to see “completely free” designs, since computer control systems are an important part of so many advanced hardware projects.
All of the progress in launch vehicles, spacecraft, and spacesuits, developed by NASA during the 1960s and 1970s that got us to the moon and built the reusable Space Shuttle, is now in the public domain
These technologies are early adopters of a free collaboration environment because the technologies lend themselves to software collaboration tools, because the people doing the work have seen successful free software projects, and because the complex designs benefit significantly from the kind of collaborative testing and development that goes into software.
Perhaps the most surprising trend has been initiated by electronics and computer manufacturing companies, who have decided to free the design of older model hardware. This has been done in order to provide a “future proofing” value proposition to customers, or simply to develop goodwill, especially among customers who already see the benefit of using free software such as embedded Linux systems.
There are also a few high-profile cases, such as Sun Microsystem’s recent decision to offer the Verilog source code and other design information for the UltraSPARC T1 CPU under the GPL . This move is presumably meant to bolster Sun’s hardware platform as a “commodity” design, just like the enormously successful Intel architecture machines.
A rich public domain
It is startling to think that all of the progress in launch vehicles, spacecraft, and spacesuits, developed by NASA during the 1960s and 1970s that got us to the moon and built the reusable Space Shuttle, is now in the public domain, regardless of whether it was developed by government agencies or contractors. However, patents, unlike copyrights, have relatively short durations of 20 years or less in most countries (including the US), so anything developed before 1986 is effectively “fair game” at this point.
This data is old, and there are in many cases smarter, more modern ways to design new equipment, based mainly on improvements in computerized micro-controllers and materials science. Nevertheless, the older designs are often an excellent starting place. For example, the requirements and industrial materials available for the creation of spacesuits have not changed significantly since 1970, and most of that data is already available to the public domain, even if the sources on the subject matter remain somewhat inaccessible. The primary need, therefore, is to get that information into a more usable form through document imaging, text extraction , metadata tagging, and cataloging.
It’s important for a major free design project to embrace the existing library standards and institutions
Bricks and mortar
Even today, libraries reach many more people than the internet, and have a more direct impact on many more. Technology users, who are not primarily in computer science fields, are not as well represented online, and they include many of the people we are interested in, both as developers and consumers of new free design. There’s also the possibility of bringing people to the free design community via internet-connected computers in libraries.
So, it’s important for a major free design project to embrace the existing library standards and institutions, particularly if the project wants to appeal to a broad audience. What we should do is provide a “library interface” that allows such an archive to act like an ordinary brick and mortar library as well as an internet resource.
Imagine this scenario: a library patron in a far away library would be able to search the archive via their own library’s OPAC or website. The archived materials would appear as books in a remote library collection. Using the Interlibrary Loan mechanism , the patron could then request the “book” from the librarian, who would then request it from the archive. The patron would pay the cost to have the book delivered, based on processing fees. Using appropriate print-on-demand technology, the book is printed digitally and sent to the library. The book might then belong to the patron, or become part of the local library’s collection.
This scenario would require a number of individual technology problems to be solved, but none is a show-stopper. The FRBR provides an interface to the OPAC system (MARC records can be generated). There is apparently nothing stopping an electronic archive from joining an interlibrary loan organization. Systems developed for free software documentation have been developed to automate document preparation. Print-on-demand service has become a viable business model , with a number of vendors providing the service; and finally, the electronic commerce and shipping industries are entirely capable of handling the transaction fees and shipping.
It’s not enough to innovate. We must also remember what we innovated and forget the irrelevant details so they don’t pollute the ocean of information
Joining a library association is also a good idea politically. Librarians are perceived as a mild-mannered group, but they can be fierce in the protection of free speech and free expression rights. There is clearly a lot to be gained by the internet community at large, community-based information production projects, and library associations joining together as defenders of the free interchange of knowledge.
Libraries of the future
It’s not enough to innovate. We must also remember what we innovated and forget the irrelevant details so they don’t pollute the ocean of information our data must be found in. Accurate metadata, achieved by combining a variety of different methods, based on the most-reliable sources for each, is essential to ensuring the long-term accessibility of the data we need. There is already a substantial volume of free design data in existence, from community, industry, academic, and government sources, but it is often underutilized because of short-falls in document imaging and recognition, and (most importantly) metadata tagging. Twenty-first century developments in artificial intelligence and community-based technologies are making it possible to construct means of solving the technical problems, though. So, this puts us in a good position to start building the digital design libraries of the future.
 Dublin Core initiative
 Trove categorization system
 Cory Doctorow Metacrap
 Jim Giles Internet encyclopaedias go head to head, Nature, 2005
 Semantic Web
 Bit Torrent
 US Forest Service
 US Geological Survey
 US Department of Energy
 US Department of Agriculture
 US Department of Defense
 Open Cores
 SSV’s DIL/NetPC ADNP/ESC1 Single-Board Computer (SBC)
 Interlibrary Loan