The world does not need a "conversion nightmare": a standard office file format already exists

The world does not need a "conversion nightmare": a standard office file format already exists


This is an editorial about file conversions. It starts with a story about Free Software Magazine and our struggle with article formats, and continues explaining why the world needs to get rid of Office Open XML, which could create more problems than the Microsoft monopoly itself.

When I started Free Software Magazine, we faced the problem every publication needs to face: which file format should we use for articles? It was a few years ago now (as they say, time flies when you have fun!). At the time, the web site wasn't our main focus: we were actually printing a paper magazine (!), we were generating amazing PDF files using LaTex, and decided that a static web site was going to "do" for quite a while. We decided that the "master" format for our articles would be XML. XML seemed like a good idea at the time. None of the other options seemed quite as feasible: text wasn't enough, HTML was too vague, ODF was too complex, and so on. Plus, everybody was using it.

Since we couldn’t find a single decent semi-visual XML editor, we asked our authors to hand in XML directly. Of course, people became very creative when they created an article file: we had to write a script that deleted white spaces around tags, and generally "cleaned up" the XML files we received. We also had to check manually that the files had the right em dashes, the right opening and closing speech marks, the right apostrophes, and so on. I won't even get started on the problems some authors had with getting the XML right: <p> tags left unclosed, <li> items without <ul> first, and so on. It doesn't sound complicated, but when you have a 2500 word article full of listings, text boxes, figures and so on, and (even worse) when the XML error you get from the parser is as unhelpful as it could be, things got tricky. It was a small nightmare, which repeated with every issue of the magazine, and nearly every article. Two prospective (and influential) bloggers refused accounts with Free Software Magazine when they realised they would have to spend time tagging up XML files. Laziness? Maybe. But, as we say around here, "fair enough".

Luckily, the delirium is now over. We have upgraded our article format to Markdown Extra (although, it has a few tweaks to allow tables and textboxes). Authors can now write articles following this Free Software Magazine article template. Issue 21, this very issue, was edited mainly using the new file format.

Converting the articles from XML to Markdown Extra/FSM was a lot of hard work. I just about managed to do using XSLT with custom PHP calls within the XLS file. (If you are thinking "the XSLT from a basic format to Markdown should be simple", I will give you a few keywords: "white paces", "enters", "tables", "clashing escape characters", "CDATA", and so on). The conversion required substantial trial-and-error and tweaking. It contains several hacks I am not especially proud of. To date, I am not yet 100% sure it actually works for every single article. And we are talking about translating an extremely simple XML format into an extremely simple text format. As always, the conversion part was easy. However, getting it to actually work was tricky.

This change won't affect you --well, apart from the occasional due to the occasional hard-to-translate article (we have over 2000 articles in our database, and we checked things by "statistical sampling"...). What is interesting is that this adventure (which I named "article conversion hell") reminded me of something that sounds obvious, but we tend to forget: file conversions are complicated, sub-optimal, time-consuming, imperfect by nature, often wrong, often the result of guess-work, tricky, and basically evil. When you open a Microsoft Office 2000 file using OpenOffice, things might work seamlessly, things might look a little odd, the file might look perfect--but if saved back as a Microsoft Office 2000 file, it might be ruined forever. There is a reason for this: file conversions need to be avoided (especially, like in this case, if the original file is an undocumented back-back-back-back-backward compatible format which really doesn't deserve to exist anymore, and didn't deserve to exist in the first place). ODF isn't perfect (yet?), but it aims at being the format for office documents. It's standard, and several pieces of software today can handle it (see: it's not an OpenOffice-only game).

Microsoft trying to shove OOXML down ISO's throat (and effectively damaging, maybe beyond repair, the image of what should be an independent body) can damage the computer industry immensly. The fact that both ODF and Office Open XML are XML means absolutely nothing. You can see here a technical comparison between the two: converting one format to the other is anything but fun. Thousands of bogus documentation pages that come with OOXML don't help.

What I experienced with Free Software Magazine while converting (which, admittedly, wasn't really that big a deal) would be nothing compared to what the whole world will have to deal with if OOXML became "the" file format "normally" used to exchange office documents. A situation like this will impose constant conversions, quirks, compatibility problems, and so on all of us It will also be a fantastic card for Microsoft: "look, GNU/Linux is sort of good, but you know, you can never trust it to open an XML file... sometimes the images are squint, you know..."

Microsoft knows this. Unsurprisingly, they have recently announced that they would release several conversion tools to translate ODF into OOXML and vice-versa. I read the article right in the middle of my "article conversion hell", and wondered if anybody else realised how disastrous it would be, if Microsoft managed to convince the world that it was "OK" to have two competing standards, since it's so easy to convert them into each other. The risk is very real: if we don't stop them, Microsoft will muscle its way in, and will force the whole world to fight with conversions for years, or decades, to come.

Microsoft proposed a bogus Office file format while an ISO standard already existed. Their shady practices to get their format fast-tracked and approved by ISO didn't work. But Microsoft is still trying--and I can guarantee, it will keep on trying until it succeeds.

The only possible answer for Microsoft and OOXML is simple: the world already has an office file format. The world doesn't need nor want a "conversion nightmare". The world's ISO-approved Office format already exists: it's called ODF. Microsoft: deal with it!.

Category: 
Tagging: 

Comments

rasmusp's picture
Submitted by rasmusp on

I fully agree with your points. Having recently worked in a group where most people used (pirated) Word, I can certainly relate to the issue. Not even older versions of Word can read ooxml. The file size is larger than odf too.

Btw: I reallt miss the pdf version of the magazine. Did you ever consider to bring it back?

tinker's picture
Submitted by tinker on

I am doing my bit to get the great number of M$ hooked users to realise there is an alternative to M$ Office.

If I need to send any documents as attachments to anyone I send odf files as standard, with a link to openoffice.org. So far only one person has mailed back asking if I can send the file as a M$ .doc, though I did get one file returned in the new M$ Office format that is not readable by OpenOffice.

utahcb's picture
Submitted by utahcb on

Hello:
Read your article on file conversion and agree whole-heartly. One
standard is all that is needed. I do not use Microsoft's OS or office products. I find Open Office quite capable for my use.
The problem is purely greed and nothing else. When the bottom line is nothing but profit and not in the interest of the customer you will never have a product that is competitive and embraces change.
Mr. Gates is a shrewed businessman and knows what will make him money and keep him in control of the market. The problem is tha Open -Source has not accepted that idea per-say. The open-source community
has not fully recognized the need to have one standard themselves. When they come to this point, then and only then, will they appeal to the masses.
I am probably ranting in the wrong place, but I do enjoy your magazine very much. Keep up the good work.
I don't know much about html. (Sorry)
Thanks: Utah C. Burger

ted61's picture
Submitted by ted61 on

I am a total non-geek who has been using Linux for a few years. I have been using open office and abiword for the past few years with no problems but I have no idea what all the fuss is about. HTML and XML both work for my basic site. To me, anytime something is over-hyped, someone is trying to sell me a bill of goods.

Ryan Cartwright's picture

I have been using open office and abiword for the past few years with no problems

This may be why you have no idea what the fuss is about :o)

HTML and XML both work for my basic site.

I think you've got the wrong end of the stick here. XML is a document standard which pretty much permits anybody to define their own markup tags (schema). As long as you have all your tags properly defined and they are used in the correct manner then it's XML.[1] But You don't have to tell anybody what the schema is.

Tony is using his experiences over converting the FSM website XML to Markdown as an example of what lies in store for people writing converters from one XML schema (ODF) to another (OOXML). The problem is that the creators of OOXML a) are not really telling everyone else all they need to know and b) have a history of changing such things without notice just to keep their market share.

Think of it like this: You define an XML schema which contains tags for bold and italic text (Let's say <strong> and <em> respectively). Now I define one but my tags are <bd_0> and <i_0> . The thing is I don't tell you what my tags are and I compress my documents inside an encrypted archive to which only my products have the key.[1] How do you convert your documents to my format or vice-versa? Reverse engineering is against the licence under which I offer my products so you are stuck. Even if afetr a lot of brute force you get an unencrypted document - what is to stop me "upgrading" my schema and slightly changing the tags you now have?

The only thing that prevents this kind of behaviour is to have an open standard which is settled upon by many not one and which is publicly available for all to use.

In every argument I have seen for OOXML I am yet to see anything which actually says why it is necessary for Microsoft to have their own format - other than greed.

To me, anytime something is over-hyped, someone is trying to sell me a bill of goods.

Ah, but this is about more than a single bill of goods - this is about tying you down so that every bill of goods you buy must go through one vendor (not a great analogy but I'm just continuing your remarks).

cheers Ryan

[1] Okay big over-simplification but it works - kind of.

ted61's picture
Submitted by ted61 on

I missed the point of OOXML. I thought OOXML was designed to be open for all. I figured Microsoft was up to something fishy when they decided to champion an "Open code".

One good thing about me checking in to OOXML is that I am an early adapter. You guys lead the way. I am one of the people willing to try things after people do the hard work of making things user friendly. If I can stick with a distribution, I have no doubt the popularity is ready to take off. I have switched over to Linux for most of my home computers with the others primarily for games.

Maybe Microsoft sees us early adapters making our own web sites from Linux computers using web hosts that run Linux servers. I am using Word Press on one site and PHPBB on another. I have little technical knowledge to build my sites. I have two computers on one desk. One for instructions and one for the operation.

I hope OOXML does not become a standard. It was very frustrating to have to learn to add special code to make my stuff line up in IE7. It still does not line up properly in all of the different Explorer browsers.

Ryan Cartwright's picture

I thought OOXML was designed to be open for all.

Yes it's that word "open" at the start -- but of course OOXML is designed to make money. At least that is the only reason I can think of why MS would spend all that R&D time developing a format when an open standard one exists and is all they needed really.

I hope OOXML does not become a standard. It was very frustrating to have to learn to add special code to make my stuff line up in IE7. It still does not line up properly in all of the different Explorer browsers.

I think you may have indeed missed the point--slightly. Whilst IE7 will display OOXML directly (and not ODF BTW) and I am sure M$ would love it if people wrote web content in it--you should remember that OOXML (Open Office XML) is primarily an office document format and in that has little to do directly with IE.

That said I agree that coding web content to view in the various IE versions is a nightmare. My preferred technique is to produce four CSS files. One for Firefox/Opera/Konqueror/Safari et al, one for IE5.5, one for IE6 and one for IE7. Then I chuck in a script which detects the browser and presents the relevant css links.

cheers

Ryan

ovelarsen's picture
Submitted by ovelarsen on

Hi
Good writing - but there seem to be more and more stories comming out - in the file-formar debate - that the huge work on ISO standard - is about the future formats in M$ in 2009.

The scheduled 2009 release of Windows 7 (or whatever it's called)- and Office - have been in the pipeline for a long time - and the development of there new version of there OS and Office depend on M$ version of XML.

That is why they work so hard - and put a HUGE amount of money and people into this - to get there format to become a global ISO standard.

We saw it in the ODF debate - and this time the opensource community - feel the full force of this multi billion $ company - with there lawyers and 'independent research' that come out.

And M$ just don't get it - they just don't understand the world outside the company - as it was reported a couple of days ago http://www.fsdaily.com/Community/Microsoft_wants_open_sourcers_to_write_an_OOXML_translator

Peace

Most forwarded

Interview with Dave Mohyla, of DTIDATA

Dave Mohyla is the president and founder of dtidata.com, a hard drive recovery facility based in Tampa, Florida.

TM: Where are you based? What does your company do?
DTI Data recovery is based in South Pasadena, Florida which is a suburb of Tampa. We have been here for over 10 years. We operate a bio-metrically secured class 100 clean room where we perform hard drive recovery on all types of hard disks, from laptop hard drives to multi drive RAID systems.

Anybody up to writing good directory software?

Since the very beginning, directories (of any kind) have had a very central role in the internet. (I have recently grown fond of Free Web Directory. Even Slashdot can be considered a directory: a collection of great news and invaluable user-generated comments. As far as software is concerned, doing a quick search on Google about software directories will return the free (as in freedom) software directories like Savannah, SourceForge, Freshmeat and so on, followed by shareware and freeware sites such as FileBuzz, PCWin Download Center and All Freeware (great if you're looking for shareware and freeware, but definitely less comprehensive than their free-as-in-freedom counterparts).

Interview with Mark Shuttleworth

Mark Shuttleworth is the founder of Thawte, the first Certification Authority to sell public SSL certificates. After selling Thawte to Verisign, Mark moved on to training as an astronaut in Russia and visiting space. Once he got back he founded Ubuntu, the leading GNU/Linux distribution. He agreed on releasing a quick interview to Free Software Magazine.

Is better education the key to finding better software?

I read David Jonathon's article Anybody Up To Writing Good Directory Software? the other day, which got me thinking about software directories in general. As David mentioned, many of the software directories one finds when doing a quick google search are free as in beer, not as in freedom. But what interests me is the software directories that already exist, providing a combination of both free as in beer software, and open source software. Sites such as Freeware Downloads and Shareware Download don't advertise themselves as providing free as in liberty software, but each of them have a good selection of open source software available... if you know where to look.

Most emailed

Free Open Document label templates

If you’ve ever spent hours at work doing mailings, cursed your printer for printing outside the lines on your labels, or moaned “There has got to be a better way to do this,” here’s the solution you’ve been looking for. Working smarter, not harder! Worldlabel.com, a manufacture of labels offers Open Office / Libre Office labels templates for downloading in ODF format which will save you time, effort, and (if you want) make really cool-looking labels

Creating a user-centric site in Drupal

A little while ago, while talking in the #drupal mailing list, I showed my latest creation to one of the core developers there. His reaction was "Wow, I am always surprised what people use Drupal for". His surprise is somehow justified: I did create a site for a bunch of entertainers in Perth, a company set to use Drupal to take over the world with Entertainers.Biz.

Update: since writing this article, I have updated the system so that the whole booking process happens online. I will update the article accordingly!

So, why, why do people and companies develop free software?

More and more people are discovering free software. Many people only do so after weeks, or even months, of using it. I wonder, for example, how many Firefox users actually know how free Firefox really is—many of them realise that you can get it for free, but find it hard to believe that anybody can modify it and even redistribute it legally.

When the discovery is made, the first instinct is to ask: why do they do it? Programming is hard work. Even though most (if not all) programmers are driven by their higher-than-normal IQs and their amazing passion for solving problems, it’s still hard to understand why so many of them would donate so much of their time to creating something that they can’t really show off to anybody but their colleagues or geek friends.

Sure, anybody can buy laptops, and just program. No need to get a full-on lab or spend thousands of dollars in equipment. But... is that the full story?

Fun articles

Santa Claus - the most successful open source project

It dawned on me the other day, as I was shopping for the dozens of gifts it seems I have to buy every December, that Santa Claus is the most successful open source project in history. (Bridget @ Illiterarty would agree with that). Santa Claus is essentially a marketing development that is embodied by everyone who stuffs a sock, gives a gift, hosts a dinner or wishes Merry Christmas over the holiday season.

Most emailed

Editorial

When I first started thinking about Free Software Magazine, I was feeling enthusiastic about the dream. I had Dave, Gianluca, and Alan willing to help me, I had established members of the free software community willing to help me out, I had writers volunteering their time and energy for free, and I had a generous offer from OpenHosting for servers, all before I'd proved myself. There was a sense of excitement in the air, and I thought maybe, just maybe, I could make this work.

Free Software Magazine uses Apollo project management software and CRM for its everyday activities!