Towards a free matter economy (Part 5)
Discovering the future, recovering the past
Download the whole article as PDF
- 2006-04-21
- Mind set | Intermediate
-
Write a full post in response to this!
This content was sponsored by:
I think the health of our civilization, the depth of our awareness about the underpinnings of our culture and our concern for the future can all be tested by how well we support our libraries.—Carl Sagan, Cosmos
Libraries have been around for a lot longer than software, and librarians long ago learned many of the data management lessons that have only now begun to surface in the world of software and databases. By contrast, software is a young, rapidly changing field, and this has affected its outlook. Five years may seem like an eternity in software development, but in the archival business, it’s just the blink of an eye.
What libraries have not dealt with historically, however, is the dismaying array of data storage mechanisms and file formats that software data represents, the troublesome transience of the tools needed to access that data, and the overwhelming quantity of the data that is produced.
Finding your way
For uncontrolled data sources, such as the world wide web at large, we have no real choice except to use machine indexing systems such as Google, but if we want a free-design database to be a maximally reusable resource, we need to make use of metadata.
If we want a free-design database to be a maximally reusable resource, we need to make use of metadata
Metadata, “information about information” is the key to library science. There are many decisions about data sources that cannot be easily made by machine, and metadata tagging allows the librarian to classify works so that they can be retrieved later. Typically these include the familiar “subject”, “title”, and “keyword” indexes that were once the mainstay of library card catalogs, and now of electronic OPACs used by modern libraries [1].
Schemas
For a long time, libraries have used the highly rigorous but somewhat difficult MARC database system [2], which is not exactly a relational or object database, but is somewhat in between, and requires special handling to be managed properly. Since then, however, computer science development has led to more manageable database design styles, and the increasing quantity of digital and multimedia library resources has begun to make the conventional book-publishing orientation of MARC obsolete. Librarians are, therefore, developing more streamlined, agile metadata database systems, such as the FRBR [3], which is a pure relational database schema designed to encapsulate the minimal (but complete) requirements for library use, and the even more streamlined Dublin Core (DC) metadata system [4].
MARC, FRBR, and DC provide “schemas” or lists of the types of metadata elements that can be recorded for a work.
Vocabularies
A perhaps less obvious need is the “controlled vocabulary”. There are many ways to express a single subject—is the study of extra-terrestrial life to be called “bioastronomy”, “exobiology”, or “xenobiology”? People have used all three terms, and, through their usage, established different emphases. Should each be given its own category, or should they be treated as synonyms and stored together? Which term will we use for that category?
It’s a common mistake to underestimate the importance and difficulty of selecting appropriate taxonomic vocabularies. This is because we are all biased towards our own fields of endeavor and tend to have only a vague idea of the structure of other disciplines. Consider, for example, the domain-specific controlled vocabulary represented by the “Trove” system [5], which you use if you look for software projects on the Sourceforge site. It’s an excellent system for finding software, provided that the paradigms of computing don’t change too much. However, should entirely new software types evolve, or should the system be used for things outside of the realm of software, the Trove categories become much less useful.
Fortunately, library associations have done a lot of work on broad, inter-domain classification. For example, in the English language, there is the AACR [6], used in Canada and the United States. It seems like a wise idea to use these standards whenever possible.

Libraries had been in use for a long time before the advent of computers, and librarians have therefore learned important lessons about how information should be cataloged. As our databases become more full-text, more diverse in format, and broader in content, we increasingly encounter the same challenges that have formed the basis of library and information science for centuries (Der Bücherwurm, by Carl Spitzweg from Wikimedia Commons (PD))
Agility and human nature
One of the problems in applying professional library methods to software works and the results of community based production, is that they are fairly labor intensive, and rely on a class of professional experts to do the classifying. Considering the quantity of data on a site like Sourceforge, it’s not hard to see that hiring librarians to manage the problem would be a daunting prospect.
Fortunately there are other ways to assign metadata to files.
Creator tagging
Perhaps the most obvious solution is to have authors assign metadata to their own works. It’s an obvious solution; it’s the typical starting point for most systems; and, for things like title, attribution, and licensing it is really the only way to do it, because the creator is the one who chooses those properties.
It’s not without problems, however. An author is not always the best person to trust about their own data. Vendors tend to puff up their projects; authors can have greatly inflated (or deflated) egos; and people are often just lazy or inept with the submission mechanisms [7]. It seems clear that we can’t always rely on the people who create works to be good at classifying them.
Expert tagging
The library solution is to have items cataloged by professionals who train in doing just that. Even when creators determine their own subject headings, it is usually the case that the schema (what data to record) and the vocabulary (what options are available for each field) are decided by experts.
Despite some claims to the contrary, Wikipedia is a remarkably successful case of well-organized self-organization
Write a full post in response to this!
Similar articles
Do you like this post?
Vote for it!
Copyright information
This article is made available under the "Attribution-Sharealike" Creative Commons License 3.0 available from http://creativecommons.org/licenses/by-sa/3.0/.
Biography
Terry Hancock: Terry Hancock is co-owner and technical officer of Anansi Spaceworks, dedicated to the application of free software methods to the development of space.
- Login or register to post comments
- 7404 reads
- Printer friendly version (unavailable!)




Best voted contents
-
Google App Engine: Is it evil?
Terry Hancock, 2008-04-24 -
The Bizarre Cathedral - 3
Ryan Cartwright, 2008-05-05 -
Free Software Magazine Awards 2008
Tony Mobily, 2008-04-22 -
The Bizarre Cathedral - 2
Ryan Cartwright, 2008-04-27
Similar entries
Buzz authors
All news
From the FSM staff...
- The Top 10 Everything (Dave). The good, the bad and the ugly.
- Free Software news (Dave & Bridget). A site about short stories and writing.
- Book Reviews: Illiterarty (Bridget). Book reviews, blogs, and short stories.
Hot topics - last 60 days
-
Installing an all-in-one printer device in Debian
Ryan Cartwright, 2008-05-05 -
What is the free software community?
Tony Mobily, 2008-03-29 -
Things you miss with GNU/Linux
Ryan Cartwright, 2008-05-01 -
How do you replace Microsoft Outlook? Groupware applications
Ryan Cartwright, 2008-03-20 -
Beyond Synaptic - using apt for better package management
Ryan Cartwright, 2008-04-03
Hot topics - last 21 days
-
Installing an all-in-one printer device in Debian
Ryan Cartwright, 2008-05-05 -
Things you miss with GNU/Linux
Ryan Cartwright, 2008-05-01 -
Digital Rights Management (DRM): is it in its death throes?
Gary Richmond, 2008-05-07 -
Open letter to standards professionals, developers, and activists
Pieter Hintjens, 2008-05-13
Dedicated server