Format Wars
File formats: the past, the present and a possible future
Download the whole article as PDF
- 2005-01-20
- Focus | Intermediate
-
Write a full post in response to this!
Real programmers love their applications’ source code: the faster and more elegant it is, the better. Users are after very different things: they seem to want simplicity, flashy colors, nice icons and tons of options. In spite of these reasons, or perhaps because of them, programmers and users often forget what lies in the middle of it all: information.
Who owns the information?
Almost all software applications are used to manage information so these applications are worthless without information to process, store and display. For example, you could use a word processor to write letters or video editing suites to edit footage of your girlfriend at the beach.
Almost all software applications are used to manage _information_ so these applications are worthless without information to process, store and display
If information exists before (and independent of) the applications, the file format used to store the information should be defined before hand . In this ideal situation, you could potentially write several programs (released under free or non-free licenses) to handle your information.
Please keep in mind that here “information” means any kind of creative work: blog entries, private movies, essays, government reports, court rulings, road projects… In an ideal world, the format used to store this information doesn’t matter: it should simply belong only to its author, or whoever paid for its production.
In practice, applications and file formats have historically grown and changed together. Moreover, the file formats for proprietary software have not always been documented (see Microsoft products) unless you sign unacceptable NDAs (Non-Disclosure Agreements); the result of this is that digital information isn’t always under the complete control of the person who created it.
In my opinion this problem has been underestimated for a long time, probably because in the beginning people didn’t think it was such a big deal.
First of all, far fewer people had computers. When they did have them, they weren’t often networked and were physically incompatible (think of Mac and PCs, which even had problems sharing a floppy disk!). Resources were very limited: monitors, processors and hard drives weren’t even remotely comparable to what we have today, and therefore visually “fancy” information wasn’t as important as it is today (think WYSIWYG). Even complex spreadsheets were stored as CSV format (plain text separated by commas) or as binary files. Back then, there was a situation similar to today’s: if the information was stored as text files, you could use powerful text processing tools like sed, awk and then Perl. If it was stored in binary format, reverse engineering and black magic fixed most of the problems. Exchanging information at that point wasn’t often a problem; even when binary-only format became more common thanks to WordStar and AutoCAD, the end product was nearly always a stack of paper that was to be shipped or archived somewhere.
This paper could then be read even centuries after it was written, without a concern for what “brand” of paper, or which printer or pen had been used to write on it.
In a way, paper was the lingua franca.
In a way, paper was the _lingua franca_
Today, with the internet, CDs and search engines, any file can be used and distributed in several different ways without ever turning into durable, non proprietary (and non-searchable, I must add), printed paper. Talk about progress…
Today’s scenario
Today’s scenario is somehow very similar to what it was a few years ago—just a bit more complicated. Proprietary file formats are now more complex than before and therefore harder to reverse-engineer. Text-based file formats are still based on text (obviously!), but they have gained a level of complexity as well: rather than representing the information directly (like plain text documents or CSV spreadsheets do), they are usually based on XML.
For example, the content of a cell in OpenOffice.org could be represented with this:
<style:properties style:column-width="1.785cm"/> ... <table:table-cell><text:p>600000</text:p></table:table-cell>
These two lines above simply state that the width of the column containing this cell must be 1.785 cm and that the cell stores the number 600000.
A paragraph in a letter could be:
<p>This is the <b>first</b> paragraph</p>
<p>This is the second one</p>
The advantages of XML files are clear: anybody can write an application which manipulates them, as long as they know what every XML tag means in that specific context.
A word on encoding
Even “plain text” can mean different things, depending on how it’s encoded. The encoding defines which sequence of bits represents a particular character (such as a letter, a white space, symbols like “©” and “#”, and so on) used in a written language.
In ASCII ( American Standard Code for Information Interchange ), for example, the sequence “01000001” corresponds with the capital letter “A”.
Even “plain text” can mean different things, depending on how it’s _encoded_
The ASCII encoding (or format) is really ubiquitous these days, but has simply outlived its meaning in a wired world where most people don’t speak English. Over the last few years many more types of encoding have been created in order to deal with almost any other language on the planet including non-alphabetic ones (Chinese, Hindu, Korean, Japanese…). The resulting confusion has been made worse by the fact that “plain text files” don’t contain, by definition, any headers to declare their internal encoding. Consequently, the programs processing them have to guess, or be told, which encoding they should use to display them; otherwise, blank or strange characters are displayed instead of the correct ones.
Write a full post in response to this!
Similar articles
Do you like this post?
Vote for it!
Copyright information
Verbatim copying and distribution of this entire article is permitted in any medium without royalty provided this notice is preserved.
Biography
Marco Fioretti: Marco Fioretti is a freelance writer based in Italy.
- Login or register to post comments
- 9436 reads
- Printer friendly version (unavailable!)




Best voted contents
-
Google App Engine: Is it evil?
Terry Hancock, 2008-04-24 -
The Bizarre Cathedral - 3
Ryan Cartwright, 2008-05-05 -
Free Software Magazine Awards 2008
Tony Mobily, 2008-04-22 -
The Bizarre Cathedral - 2
Ryan Cartwright, 2008-04-27
Similar entries
Buzz authors
All news
From the FSM staff...
- The Top 10 Everything (Dave). The good, the bad and the ugly.
- Free Software news (Dave & Bridget). A site about short stories and writing.
- Book Reviews: Illiterarty (Bridget). Book reviews, blogs, and short stories.
Hot topics - last 60 days
-
Installing an all-in-one printer device in Debian
Ryan Cartwright, 2008-05-05 -
What is the free software community?
Tony Mobily, 2008-03-29 -
Things you miss with GNU/Linux
Ryan Cartwright, 2008-05-01 -
How do you replace Microsoft Outlook? Groupware applications
Ryan Cartwright, 2008-03-20 -
Drigg (the pligg alternative) vs. Pligg: why should people switch?
Tony Mobily, 2008-04-13
Hot topics - last 21 days
-
Installing an all-in-one printer device in Debian
Ryan Cartwright, 2008-05-05 -
Things you miss with GNU/Linux
Ryan Cartwright, 2008-05-01 -
Digital Rights Management (DRM): is it in its death throes?
Gary Richmond, 2008-05-07 -
Open letter to standards professionals, developers, and activists
Pieter Hintjens, 2008-05-13


Dedicated server