Format Wars

File formats: the past, the present and a possible future

Download the whole article as PDF

Write a full post in response to this!


Real programmers love their applications’ source code: the faster and more elegant it is, the better. Users are after very different things: they seem to want simplicity, flashy colors, nice icons and tons of options. In spite of these reasons, or perhaps because of them, programmers and users often forget what lies in the middle of it all: information.

Who owns the information?

Almost all software applications are used to manage information so these applications are worthless without information to process, store and display. For example, you could use a word processor to write letters or video editing suites to edit footage of your girlfriend at the beach.

Almost all software applications are used to manage _information_ so these applications are worthless without information to process, store and display

If information exists before (and independent of) the applications, the file format used to store the information should be defined before hand . In this ideal situation, you could potentially write several programs (released under free or non-free licenses) to handle your information.

An OpenOffice RTF file opened with Word X for Macintosh
An OpenOffice RTF file opened with Word X for Macintosh

Please keep in mind that here “information” means any kind of creative work: blog entries, private movies, essays, government reports, court rulings, road projects… In an ideal world, the format used to store this information doesn’t matter: it should simply belong only to its author, or whoever paid for its production.

In practice, applications and file formats have historically grown and changed together. Moreover, the file formats for proprietary software have not always been documented (see Microsoft products) unless you sign unacceptable NDAs (Non-Disclosure Agreements); the result of this is that digital information isn’t always under the complete control of the person who created it.

The same OpenOffice RTF file opened with Word 2004 for Macintosh
The same OpenOffice RTF file opened with Word 2004 for Macintosh

In my opinion this problem has been underestimated for a long time, probably because in the beginning people didn’t think it was such a big deal.

First of all, far fewer people had computers. When they did have them, they weren’t often networked and were physically incompatible (think of Mac and PCs, which even had problems sharing a floppy disk!). Resources were very limited: monitors, processors and hard drives weren’t even remotely comparable to what we have today, and therefore visually “fancy” information wasn’t as important as it is today (think WYSIWYG). Even complex spreadsheets were stored as CSV format (plain text separated by commas) or as binary files. Back then, there was a situation similar to today’s: if the information was stored as text files, you could use powerful text processing tools like sed, awk and then Perl. If it was stored in binary format, reverse engineering and black magic fixed most of the problems. Exchanging information at that point wasn’t often a problem; even when binary-only format became more common thanks to WordStar and AutoCAD, the end product was nearly always a stack of paper that was to be shipped or archived somewhere.

This paper could then be read even centuries after it was written, without a concern for what “brand” of paper, or which printer or pen had been used to write on it.

In a way, paper was the lingua franca.

In a way, paper was the _lingua franca_

Today, with the internet, CDs and search engines, any file can be used and distributed in several different ways without ever turning into durable, non proprietary (and non-searchable, I must add), printed paper. Talk about progress…

Today’s scenario

Today’s scenario is somehow very similar to what it was a few years ago—just a bit more complicated. Proprietary file formats are now more complex than before and therefore harder to reverse-engineer. Text-based file formats are still based on text (obviously!), but they have gained a level of complexity as well: rather than representing the information directly (like plain text documents or CSV spreadsheets do), they are usually based on XML.

For example, the content of a cell in OpenOffice.org could be represented with this:

<style:properties style:column-width="1.785cm"/>
...
<table:table-cell><text:p>600000</text:p></table:table-cell>

These two lines above simply state that the width of the column containing this cell must be 1.785 cm and that the cell stores the number 600000.

A paragraph in a letter could be:

<p>This is the <b>first</b> paragraph</p>

<p>This is the second one</p>

The advantages of XML files are clear: anybody can write an application which manipulates them, as long as they know what every XML tag means in that specific context.

A word on encoding

Even “plain text” can mean different things, depending on how it’s encoded. The encoding defines which sequence of bits represents a particular character (such as a letter, a white space, symbols like “©” and “#”, and so on) used in a written language.

In ASCII ( American Standard Code for Information Interchange ), for example, the sequence “01000001” corresponds with the capital letter “A”.

Even “plain text” can mean different things, depending on how it’s _encoded_

The ASCII encoding (or format) is really ubiquitous these days, but has simply outlived its meaning in a wired world where most people don’t speak English. Over the last few years many more types of encoding have been created in order to deal with almost any other language on the planet including non-alphabetic ones (Chinese, Hindu, Korean, Japanese…). The resulting confusion has been made worse by the fact that “plain text files” don’t contain, by definition, any headers to declare their internal encoding. Consequently, the programs processing them have to guess, or be told, which encoding they should use to display them; otherwise, blank or strange characters are displayed instead of the correct ones.

Don't miss out on the other pages!
12next ›last »

Write a full post in response to this!

Similar articles

0

Do you like this post?
Vote for it!

Copyright information

Verbatim copying and distribution of this entire article is permitted in any medium without royalty provided this notice is preserved.

Biography

Marco Fioretti: Marco Fioretti is a freelance writer based in Italy.