The StarOffice XML based file format
To boldly go where no office suite has gone before
StarOffice 6.0 and OpenOffice.org use a new, XML based file format for all its documents. The constituent parts of an office document, content, layout and meta information, are stored as XML streams inside a ZIP file, along with embedded graphics and objects contained in the document.
Before musing on the technical virtues and uses of the XML based file format, it should be noted that the format as well as the OpenOffice.org application, which serves as a reference implementation for the format, are available under a GNU license. This open and free licensing guarantees that you are not at the mercy of a single company for improvements and fixes of the format or its supporting software, thus providing very strong protection for all investments and efforts you put into this format. Additionally, Sun aims at standardizing the format, thus providing any interested parties with a way to participate in the evolution of the format.
The next chapter will introduce several features of the XML format, followed by a chapter in which benefits can be derived from these features for various types of users. Finally, a conclusion will be presented.
2How It Works The Means
This chapter will highlight several technicalities of the XML based file format. The following chapter will then show how to put these into use.
2.1Separation of Content, Layout, and Meta Information
An office document contains content, for example the text of a letter, or the data in a spreadsheet, along with layout information, which describes how the content should look like. Also part of document is meta information like who edited a document and how it is called, or additional information such as images or embedded objects. To a user, these are inseparable parts of a single document. But for processing the document, it makes sense to separate them such that they can be read, interpreted and modified independently of each other. To facilitate this, the StarOffice XML file format stores content, layout, meta information, images and embedded objects in separate streams of a ZIP based package file. The whole file contains the whole document, the individual streams contain the constituent parts of the document.
When creating the XML based file format, we tried to gain as much as was possible from related standards. The very use of XML is one example. The ZIP format we use for packages are in widespread use. Many elements and attributes are borrowed from HTML, XSL-FO, XLink, Dublin Core, or SVG. For Math, we use MathML. This allows easy transformation from and into those formats, and also it allows people to quickly understand our format, if they are already familiar with those formats we make use of.
2.3Uniform Representation of Formatting and Layout Information
The StarOffice and OpenOffice.org applications distinguish between formatting through styles and direct formatting, which means applying formatting directly to text or cell ranges. In the XML format, these different ways to format a document use the same style-based representation. The direct formatting is automatically converted into automatic styles, which is a style-based formatting equivalent of the direct formatting the user applied to the document.
A primary design goal of the XML format was to represent all structured information contained in an document as XML structures, thus making the document fully accessible to standard XML tools. This is a quite different solution to, for example, an XHTML/CSS solution, where all CSS formatting information is encoded in a text-only format. This way, all layout information appears as a single string to an XSLT processor, making it very hard to process the layout information in any way.
Our XML file format is a properly designed file format, as opposed to a mere XML dump of core structures with all their implementation details and limitations. The documents are represented in a way which is easy to understand and use, and not in a way which is easy to implement. This allows the format to abstract application peculiarities and therefore the format may also be used by applications other than StarOffice and OpenOffice.org. Throughout the development of the format, great care was taken to make sure the file format is easy to process. Also, the idealized representation makes it easier to improve the StarOffice and OpenOffice.org applications without having to make major changes to the format itself.
2.6Common Format Across All Office Applications
The same format is used across all office applications. Similar concepts in the different applications always use the same XML representation. For example, spreadsheet tables and word processor tables share a common XML representation, even though their implementations and limitations are quite different. This has great advantages when processing or generating StarOffice files: The same code works for all applications. For example, a single XSLT style sheet can process both spreadsheets and text documents.
2.7Open For Extensions and Supplemental Information
Arbitrary XML attributes may be attached to style information and will be preserved when editing such a modified file with StarOffice. Because all formatting information is uniformly represented as styles, and because any document content can be formatted using styles, this allows arbitrary information to be attached to any part of the document content. Further means to add supplemental information, such as allowing complete streams to be part of the packages, may be added.
3What You Gain The Ends
The previous chapter has highlighted several features and mechanisms of our new file format. This chapter will look at how this helps different groups of users.
To the user of StarOffice and OpenOffice.org, the main benefits may be summarized as increased robustness, openness, document longevity, and version interoperability. In additional, the user will gain additional benefits as the document processing and additional solutions described in the following chapters become reality.
To a user, their own documents are usually valuable resources, into which lots of time and effort has been invested. What happens, if through an error in the used hardware or software used, the documents become corrupted? With a binary format, the user is at the mercy of the original application: If it can still read or recover the document, all is well. If not, the document is lost.
XML makes it easy to ignore and tolerate problems in the documents, so the likelihood of lost documents is reduced. Additionally, the human readable/editable nature of XML allows advanced users or service personnel to inspect corrupt files and restore the documents, even without specialized tools or very much in-depth knowledge.
For some users, long-term storage of documents is important. With binary file formats, documents can be read only as long as the supporting application as well as the system it runs on exists. XML, being a text based and human readable format, allows files to be read even if the original application (or the OS, or the hardware it ran on) are not available anymore. Additionally, the thorough documentation of the format allows the files to be fully interpreted.
A well-known problem with office documents is a file format versioning problem: New versions of the office suite usually come with a new version of a file format, which the older version don't know about. To be able read the newer documents, users find themselves forced to upgrade their application.
Contrast this with an XML based format: XML is extensible, making it easy to to add new features to the format without loosing the ability to read older files. Older applications will simply ignore the new (and, to them, unknown) content, thus reading the newer files as well as they can. The result is a high degree of forwards and backwards compatibility.
Documented and Transparent File Content
With the XML file format, the user can finally inspect the content of files that are being sent or received. If yet another macro virus threatens your organization, a simple combination of unzip and grep allows you to check for suspicious content. If you want to make sure that files you send to other people don't contain sensitive information, then now you can simply look at them. Or if you need to quickly find a certain document, just use Unix grep or the Windows Explorer context menu to search through the meta information, which are stored as plain text.
3.2Document Processing and Developers
Advanced users and developers may want to make use of the new freedom the StarOffice XML file brings them, and use and process StarOffice files with other tools and applications. There are several advantages for them:
Standards Based and Openness
The StarOffice XML file format relies on many standards in addition to the actual XML standard itself: It makes use of elements and attributes from HTML, XLink, XSL-FO, Dublin Core, and SVG. Developers familiar with these can easily pick up on the StarOffice format. Also, a developer has a wide choice of tools and code libraries for many programming languages that allow processing and manipulation of XML or ZIP files.
This use of established standards is particularly useful for making office documents available outside of traditional PC applications. For example, by transforming our office documents into HTML, you can make them available through the World Wide Web. Similar transformations are into WAP or XSL-FO are possible.
An example of transforming our documents into HTML is available on the OpenOffice.org website, and another one is available through the xml.com website (see reference at the end of this paper).
Easy Import and Export of Other File Formats
Import and export of other 'foreign' file formats can be accomplished by converting the document into the XML file format. This approach has several advantages:
The XML file format provides the developer with a clean, documented target.
Due to XML's human readability, debugging becomes much easier.
The file format and the StarOffice API hide the details of a particular office version, so the developer doesn't have to recompile and update the import/export component for every new version of StarOffice.
Several XML based import or export components may be chained to each other. This can be used to convert between two non-StarOffice formats.
Import and Export components can be integrated into StarOffice and OpenOffice.org, or they can be used stand-alone. In the latter mode, they could be used e.g. for batch conversion of many files. Also, they can be used to view StarOffice XML documents without having to start the full StarOffice application.
Leverage Available Infrastructures
Being based on XML and ZIP, the StarOffice file format can be used with the growing number of widely available tools that can process these formats. Examples are:
XML viewers and editors
Any of the available XML viewers can be used to examine the document content. XML Editors can be used to manually make changes to the document content or its layout.
XML transformation tools and libraries, such as XSLT engines or XPathScript (Perl), can be used to automatically edit, modify or generate StarOffice documents.
There is a growing number of XML aware database and storage products. These may be used to store, index, query and manipulate StarOffice documents.
With the package mechanism using the well-known ZIP format, standard ZIP tools may be used to change the package content. For example, using any ZIP tool, embedded graphics can be changed from low resolution to high resolution ones before giving a document to a print shop.
A generic office suite may not be the right solution for everyone. Often, significant improvements to productivity can be achieved by using custom software solutions tailored exactly to the requirements of a particular organization. StarOffice and OpenOffice.org can become part of such a solution, supplying office functionality as part of an larger, fully integrated package. Solution providers who want to integrate StarOffice or OpenOffice.org into their software will find the XML file format along with the open StarOffice API to be the enabling features for this.
StarOffice as Editor Component
The StarOffice API enables the use of StarOffice as an editor component. In this mode of operation, StarOffice may appear as an edit area within the custom application, controlled through the API. The custom application only needs to represent its own data in the StarOffice XML file format and hand it to the StarOffice component. When the editing is done, the custom application can then convert the XML data stream back into its own, native format for storage or further processing.
Search Engines / Knowledge Management Systems
The use of XML makes office documents accessible to search engines and more advanced knowledge management systems. Since the full document structure is available as XML, knowledge management system could easily extract or value document content based on how or where it is contained in the document.
Search engines can usually been configured for different file types. To index and search StarOffice XML files, all that is necessary is to teach the search engine to run the venerable 'unzip' command on each file before processing it.
StarOffice and OpenOffice.org are ideally suited for integration into document management systems. Here, a key feature is the ability to attach additional data to documents or parts of documents. A document management system can use this to include additional information into the documents while still keeping the documents fully editable by the user. As detailed above, StarOffice and OpenOffice.org may additionally be used as editing components inside the document management system.
Partial Editing and Two-way Conversion
Custom applications may want to modify or update office documents based on specific data or computations. XML makes it easy to identify specific parts of a document and to replace them. The separation of content and layout further helps this because it allows changing one without having to change the other.
A similar approach is to extract data from a StarOffice document and process it in some way. Then, merge the new data into the document again, or just recreate the document based on the new data. Such back-and-forth conversions are simplified by XML's nature, and the content/layout separation.
This technique could be used to support StarOffice documents (or parts of documents) on resource constrained devices, such as PDAs.
StarOffice as Layout Engine
At the end of the data processing chain, a custom application may need to present data to users or generate printable documents, summaries and reports. When using StarOffice, this can be achieved by converting the presentation data into the StarOffice XML file format, and loading it into StarOffice for layout and printing. This way, the full power of StarOffice can be leveraged for professionally looking documents without having to recreate the entire formatting and layout logic. Once again, the separation of content and layout helps, as it allows painless generation of the plain content data, which can then be combined with professional, artistic layout information.
The upcoming StarOffice 6.0 and OpenOffice.org both feature a new XML based file format, which stores document content, layout, and meta information as XML inside of a ZIP package, along with embedded graphics and objects. This organization as well as many details in the XML format itself provide many advantages to various groups of users, creating a win-win situation for all end users, developers and solution providers that make use of it.
5Online Resources and Further information
OpenOffice.org XML Homepage: http://xml.openoffice.org/
StarOffice/OpenOffice.org XML based file format definition: http://xml.openoffice.org/xml_specification.pdf
The StarOffice/OpenOffice.org API: http://api.openoffice.org/
Adventures with OpenOffice and XML by Matt Sergeant: http://www.xml.com/pub/a/2001/02/07/openoffice.html
OpenOffice.org Filter-Development Using XML: http://xml.openoffice.org/filter/