Indexing offline CD-ROM archives

Indexing offline CD-ROM archives


Suppose you've been good (or sort of good anyway), and you have a huge stack of CD-ROMs (or DVDs) with backups and archives of your old files. Great. But how can you find anything? I solved this problem today by making an index of all the files stored on these disks using a few simple GNU command line tools.

I have a metal CD case that is supposed to hold 200 CDs about half full of CDs, most of which are backups or archives. That's about 75 backup disks.

I have about 75 backup disks. There's no way I can remember what's on them all

There's no way I can remember what's on them all. Nor, typically, can I remember which disk will have a given file that I know I had at one point. This came up today because I'm preparing a presentation and I need some old image files that I know I had a few years ago. Moreover, I'm positive that I would've backed them up. But where?

Well, obviously, I'm going to have to search every disk. But how can I make that easier? And isn't there some way I can avoid doing that in the future?

Well, of course, I need to make some kind of database or index of all the files. Something like what the locate command searches to find files on my running system.

An index directory

So, I made a directory, which I called DiskIndex. Then I proceeded to fill it with files named with a identification code, which I also wrote on each of the disks. For example, I labeled the first disk from 2001 "TCD2001-001" for "Terry's CD, year 2001, first one indexed" (it was too much trouble for me to refine things down to the month, but you could obviously do that, too).

Then I created files for each disk in DiskIndex. I decided to make four possible files for each disk:

  • TCD2001-001.files will be a list of full paths occurring on the CD
  • TCD2001-001.tree will be a tree representation of the directories
  • TCD2001-001.arch will be a list of files in "tgz" archives on the CD
  • TCD2001-001.read will be a copy of the top level "README" file if present

My reasoning is that I can use grep to find file or directory names in the .files file (this gives me a "search" option), and of course, if the file only appears in a tarball, I can find it by searching the .arch file as well. The .read and .tree files are useful for "browsing" the disks, and I will also print those out to store alongside the disks, giving myself a topic-oriented way to find information.

My reasoning is that I can use grep to find file or directory names

Obviously, I could program this in Python, or write it as a bash script, but I don't actually plan to do this often enough for it to be useful, so I'll just run a few commands on the command line to get what I want. Here's a walk-through.

Obviously, I must mount the CD-ROM. I have my /etc/fstab set up to allow any user to mount or unmount the CD, so this is easy. I do have to tell KDE not to automatically open the CD-ROM in Konqueror with the automount feature (this may or may not happen for you, depending on what version of KDE you have).

$ mount /cdrom

Among my collection, though, I also have some old Macintosh-formatted CD-ROMs, which are kind of a pain. But I can mount them like this:

$ su
Password?
# mount -t hfs /dev/hdc /cdrom

Something similar may work for you. My CD drive is an ATAPI type mounted on the secondary IDE controller, so it appears as /dev/hdc. This is a common, but not universal arrangement, so your system may be different. You may also have a /dev/cdrom defined, in some GNU/Linux distributions.

Now I can make my indexes, using find and tree:

$ find /cdrom > TCD2001-001.files
$ tree -d /cdrom > TCD2001-001.tree

I check for whether there is a README of some kind, and copy that to my index, if there is:

$ ls /cdrom/*README*
/cdrom/README.txt
$ cat /cdrom/*README* > TCD2001-001.read

Finally, I look for archive files. I almost always use tar with the z option, which is the same as using it with gzip, and I almost always use the .tgz extension, so I can safely assume that the archives I want are in this format.

I look for archive files

You know your own habits, so if you have some other practice, you'll need to make changes accordingly, either in recognizing the files, or in what application you use to read their indexes.

find /cdrom -name "*.tgz"                  \
    -exec echo "ARCHIVE {}" \;             \
    -exec tar tf {} \; >> TCD2001-001.arch
umount /cdrom

In case you're not already familiar with them: the backslash characters (\), are there to "escape" the carriage return so that the first three lines above appear as one single line to the shell.

This command produces my archive. It's probably worth breaking that down, as it may look pretty complicated to you. What I'm doing here is searching the directories under /cdrom (which is where my disk is mounted, of course) for files matching *.tgz, which I'm assuming are all the archives I need to expand. Then, whenever I find one, I execute (with the -exec option) two different commands: the first to output the name of the archive file, and the second to actually list its contents.

The variable symbol used by find is a little weird. I don't know any other program that uses this convention, but {} represents the found file name in -exec options.

Finally, of course, I redirect this output to append (what >> does) to my chosen .arch file.

Making the master table-of-contents

Finally, I want to make a master table-of-contents document that I can use to browse my CD-ROM collection, and tell quickly what I will find on each disk. I want this to be compact enough that I can print it out and store it in the case with the disks.

Finally, I want to make a master table-of-contents document that I can use to browse my CD-ROM collection

After trying a couple of command line approaches with this, I decided it'd be simpler to just write a python script, which is what I did:

#! /usr/bin/env python

import glob, os

toc_header = """\
<html>
<head>
<title>Offline Archive Disks</title>
<link rel="stylesheet" href="toc.css">
</head>
<body>
<h1>Offline Archive Disks</h1>
<hr />
"""

toc_footer = """\
</body>
</html>
"""

toc_fmt = """\
<h2>%s</h2>
<pre class="readme">
%s
</pre>
<pre class="dirtree">
%s
</pre>
<hr />
"""

files = glob.glob("*.tree")
files.sort()

toc = toc_header
for file in files:
    dirtree_lines = open(file, 'rt').read().split('\n')
    dirtree_lines = [L for L in dirtree_lines
                        if L[:9] not in (
                            "|   |   |", 
                            "|   |   `",
                            "|       |",
                            "|   |    ",
                            "|        ",
                            "|       `",
                            "    |   |",
                            "    |   `",
                            "    |    ",
                            "        |",
                            "        `",
                            "         ")]
    dirtree = '\n'.join(dirtree_lines)

    if os.path.exists(file[:-5]+".read"):
        readme = open(file[:-5]+".read", 'rt').read()
    else:
        readme = ''
    toc += toc_fmt % (file[:-5], readme, dirtree)

toc += toc_footer

open('disks_toc.html', 'wt').write(toc)

This, of course, builds a very simple HTML document out of my tree files (and incidentally, gets rid of the deeper parts of the tree to keep the file from getting too long). I then printed this out to a Postscript file, and used psnup to print two logical pages per physical page, to save a little paper:

cat toc.ps | psnup -2 > toc_2.ps

Finding files

Well, that worked well enough for my purposes. I hope it will be useful to you as well. To use the index, of course, you just use grep. For example, have I got an image of a Saturn V in my collection?

$ grep saturn_v *.files
TCD2001-001.files:/cdrom/Writing/MoonGuide/Diagram/a---.saturn_v.dia.eps
TCD2001-001.files:/cdrom/Writing/MoonGuide/Diagram/a---.saturn_v.dia.jpg
TCD2001-001.files:/cdrom/Writing/MoonGuide/Diagram/a---.saturn_v.dia.ppm
TCD2002-001.files:/cdrom/Clipart/MoonGuide/Diagram/a---.saturn_v.dia.jpg

And obviously I have! Which means I can finish writing that presentation. So good luck!

Category: 

Comments

mmmmna's picture
Submitted by mmmmna on

I took the time to copy to a local hard disk the contents of every CD I had ever burned. As I sorted the content into a directory hierarchy, I realized that my prior methodology was faulty: I would archive a folder daily until it totally filled a CDR, then I'd delete the folder and start anew. In other words, some early files were burned to CDR multiple times; some later files only got burned once. After I restored all the archived CDRs, after I sorted the files according to topic, weeded out all the files I no longer needed (BIOS flashes for Slot 1 motherboards, manuals for machinery I no longer own, etc), after I resolved that there were no duplicates, I then made a master archive of 11 CDRs. The rest of the original incremental CDRs have been destroyed.

Terry Hancock's picture

"I took the time to copy to a local hard disk the contents of every CD I had ever burned."

Yeah, that's a remarkable thing. That's usually practical nowadays, with such large hard drives readily available.

I certainly considered doing this with the collection of CD-ROMs I indexed above. Of course, the whole point of making the archives was to put stuff I didn't really need cluttering up my disk into offline storage (and by "cluttering" I mean "providing confusing forests of files that get in the way of the ones I'm trying to find", not "consuming too much disk space" as I would once have thought).

Having really cheap disk space changes your perspective.

Author information

Terry Hancock's picture

Biography

Terry Hancock is co-owner and technical officer of Anansi Spaceworks. Currently he is working on a free-culture animated series project about space development, called Lunatics as well helping out with the Morevna Project.

Most forwarded

Interview with Dave Mohyla, of DTIDATA

Dave Mohyla is the president and founder of dtidata.com, a hard drive recovery facility based in Tampa, Florida.

TM: Where are you based? What does your company do?
DTI Data recovery is based in South Pasadena, Florida which is a suburb of Tampa. We have been here for over 10 years. We operate a bio-metrically secured class 100 clean room where we perform hard drive recovery on all types of hard disks, from laptop hard drives to multi drive RAID systems.

Anybody up to writing good directory software?

Since the very beginning, directories (of any kind) have had a very central role in the internet. (I have recently grown fond of Free Web Directory. Even Slashdot can be considered a directory: a collection of great news and invaluable user-generated comments. As far as software is concerned, doing a quick search on Google about software directories will return the free (as in freedom) software directories like Savannah, SourceForge, Freshmeat and so on, followed by shareware and freeware sites such as FileBuzz, PCWin Download Center and All Freeware (great if you're looking for shareware and freeware, but definitely less comprehensive than their free-as-in-freedom counterparts).

Interview with Mark Shuttleworth

Mark Shuttleworth is the founder of Thawte, the first Certification Authority to sell public SSL certificates. After selling Thawte to Verisign, Mark moved on to training as an astronaut in Russia and visiting space. Once he got back he founded Ubuntu, the leading GNU/Linux distribution. He agreed on releasing a quick interview to Free Software Magazine.

Is better education the key to finding better software?

I read David Jonathon's article Anybody Up To Writing Good Directory Software? the other day, which got me thinking about software directories in general. As David mentioned, many of the software directories one finds when doing a quick google search are free as in beer, not as in freedom. But what interests me is the software directories that already exist, providing a combination of both free as in beer software, and open source software. Sites such as Freeware Downloads and Shareware Download don't advertise themselves as providing free as in liberty software, but each of them have a good selection of open source software available... if you know where to look.

Most emailed

Free Open Document label templates

If you’ve ever spent hours at work doing mailings, cursed your printer for printing outside the lines on your labels, or moaned “There has got to be a better way to do this,” here’s the solution you’ve been looking for. Working smarter, not harder! Worldlabel.com, a manufacture of labels offers Open Office / Libre Office labels templates for downloading in ODF format which will save you time, effort, and (if you want) make really cool-looking labels

Creating a user-centric site in Drupal

A little while ago, while talking in the #drupal mailing list, I showed my latest creation to one of the core developers there. His reaction was "Wow, I am always surprised what people use Drupal for". His surprise is somehow justified: I did create a site for a bunch of entertainers in Perth, a company set to use Drupal to take over the world with Entertainers.Biz.

Update: since writing this article, I have updated the system so that the whole booking process happens online. I will update the article accordingly!

So, why, why do people and companies develop free software?

More and more people are discovering free software. Many people only do so after weeks, or even months, of using it. I wonder, for example, how many Firefox users actually know how free Firefox really is—many of them realise that you can get it for free, but find it hard to believe that anybody can modify it and even redistribute it legally.

When the discovery is made, the first instinct is to ask: why do they do it? Programming is hard work. Even though most (if not all) programmers are driven by their higher-than-normal IQs and their amazing passion for solving problems, it’s still hard to understand why so many of them would donate so much of their time to creating something that they can’t really show off to anybody but their colleagues or geek friends.

Sure, anybody can buy laptops, and just program. No need to get a full-on lab or spend thousands of dollars in equipment. But... is that the full story?

Fun articles

Santa Claus - the most successful open source project

It dawned on me the other day, as I was shopping for the dozens of gifts it seems I have to buy every December, that Santa Claus is the most successful open source project in history. (Bridget @ Illiterarty would agree with that). Santa Claus is essentially a marketing development that is embodied by everyone who stuffs a sock, gives a gift, hosts a dinner or wishes Merry Christmas over the holiday season.

Most emailed

Editorial

When I first started thinking about Free Software Magazine, I was feeling enthusiastic about the dream. I had Dave, Gianluca, and Alan willing to help me, I had established members of the free software community willing to help me out, I had writers volunteering their time and energy for free, and I had a generous offer from OpenHosting for servers, all before I'd proved myself. There was a sense of excitement in the air, and I thought maybe, just maybe, I could make this work.

Free Software Magazine uses Apollo project management software and CRM for its everyday activities!