XML Packages

Purpose of this document

At the end of September, I (dvo) have asked the openoffice.org community for help on the problem of embedding binary content in XML documents. This page is intended to summarize the discussion and provide a solid foundation that will guide our implementation efforts.

We will first restate the problem. Then follows a summary of the requirements that have been stated in the discuss and xml-dev mailing lists. The candidate solutions mentioned on the mailing lists are then examined in light of these requirements. Finally, a conclusion is presented.

A few definitions

To clarify the following discussion, a definition of a few terms for use within this document may be in order:

document: A document comprises everything a user considers to part of his text, spreadsheet or presentation. In general this will include several images or OLE objects in addition to the main content.; If images are linked into the document, they are not considered part of it. Whether an image is linked or embedded into the document is determined by the user. Due to limitations in the current OLE handling, OLE objects must always be embedded.
file: A file is a sequence of bytes as stored in the operating system's file system. This fairly obvious definition serves to distinguish between the logical view (the document) and the physical view (the file).
binary large object, BLOB: A BLOB contains binary data that has no structured content that could reasonably be encoded as XML data. Usually, BLOBs will be data in established binary formats such as GIF, JPEG, or OLE data streams.
subdocument: A subdocument refers to an individual component of the document. Each binary large object (BLOB) is one subdocument, as is the main document content encoded as XML.
main document content: The main document content refers to the subdocument that best represents the document as a whole. For example, if an OpenOffice Writer document consists of an XML subdocument containing the text body, and several image subdocument, then the XML subdocument would form the main document content.

The Problem

OpenOffice and the next generation of StarOffice by default use an XML based file format (but the user may choose to save in a different format). Therefore this project differs from many others, as this is an XML based file format intended to be truly a format for the masses. The requirements for a native file format are different (usually more demanding) than those for an interchange format.

XML is meant for structured content and has no native support for binary objects (BLOBs) such as images, OLE objects or other media types. Since embedding of binary objects is needed for office documents, a way to handle XML and binary data in the same document must be provided.

One solution may be to simply store the XML data in a file and link to image data in separate files, much like in HTML documents. While linking files at the user's request should be possible, this is in general not considered adequate for office documents. A document should be stored in a single file to make external handling of the document (e.g. copying the document) possible. Therefore, binary content must be stored in the same file as the XML content, requiring either a package format or a means to encode binary content in XML. For the sake of simplicity, we will refer to all solutions that store XML and BLOB data in a single file as packages.

The Requirements

The following requirements have been identified for the package format:

A. Efficient Operation

Users have traditionally required efficient operation, especially for basic functionality such as loading or saving of documents. It is our experience that independent loading (or saving) of subdocuments is the key to efficient handling of large documents. Additionally, users require that disk space is used efficiently. This leads us to require the document format to support:

small file sizes
on-demand loading (i.e., independent loading of subdocuments)
independent saving of subdocuments

These requirements are strongly supported by ten years of experience at StarOffice. Failing to meet these requirements has repeatedly resulted in dissatisfaction and significant protests from the user base.

Note on independent saving of subdocuments: Previous versions of StarOffice have gotten significant benefit from copying subdocuments on the file system level rather than reading and then writing subdocuments through the UCB. This should be supported by the package format. Saving modified subdocuments into an existing (compound) document may provide further benefits.

B. Compatibility with Existing Tools

A primary motivation for creating an XML based file format is the ability to process, create and manipulate StarOffice documents with external tools. To deserve the label "open" document format, these tools should be standard and widely available. As far as possible, this should apply to both, the document and the individual subdocuments (if applicable). In particular, the main document content (using XML) should be accessible to standard XML tools.

This leads us to state three subrequirements:

Subdocuments should be usable with standard tools. In particular, XML subdocuments should be usable with standard XML tools, e.g. XSLT transformations.
The document should be accessible with standard tools, making it possible to insert, modify and extract subdocuments, e.g. similar to zip or other popular archivers.
The document should be accessible with ASCII-based tools, e.g. the Windows file find function.

Note: During discussion on openoffice.org mailing lists several people preferred requirement number 1 to be extended to the complete document. This would essentially require the document format to be XML as well, with subdocuments (including BLOBS) being embedded within the XML structure.

Note: Since XML is based on Unicode, ASCII based tools cannot generally be assumed to work on XML files. Thus, no solution will be able to fully support the third requirement. If UTF-8 encoding is used, ASCII-based tools will work at least for those languages that can be properly represented in ASCII, like English or Latin.

Note: Compression interferes with some of the requirements above. The discussion on the mailing list has clearly proven that those requirements are very important to some users. To support these users,

compression should be made optional, and
an additional implementation that writes pure XML (and saves binary data into separate files) should be created.

Formats supporting compression on a per subdocument basis would have an additional advantage in this respect. Also, this would allow a subdocument with e.g. meta information (title, keywords, etc.) to be stored uncompressed, making it more easily accessible.

C. Security

An additional advantage that may be gained through the use of a package is easy support for document security. We can distinguish between two security considerations:

privacy: be able to encrypt (partial) documents
integrity: be able to verify origin of (partial) documents

Note: These requirements are of fairly low priority.

D. Additional Issues

A package structure supporting documents and subdocuments makes it easy to add additional information to a document, even if the reference implementation does not understand it. For example, in situations where a certain transformation is used very often, the transformation result may be stored in the package as well.
The current implementation is based on the SAX API. XML filters (transformations) also using the SAX API can be pipelined, e.g. they can import (or export) data into (or from) OpenOffice by going directly through the API, rather than going through a file.

The Contenders

On the openoffice.org mailing lists, the following possible solutions were suggested:

ZIP or JAR files
XML with binary data being ASCII-encoded within special tags (e.g. base64)
MIME files
.tgz files
BONOBO libefs

Note: The participants quickly focused on two choices: JAR and XML

Note: ZIP and JAR files are identical for most intents and purposes. JAR differs from the older ZIP in the file ending, as well as in additional meta information stored in a directory called META-INF. JAR files can be accessed using unmodified ZIP tools.

Examination of the contenders

This chapter gives an overview of how well the various contenders met the specified requirements. It was my (dvo) impression that most people would agree on the suitability of the various contenders for the various requirements. The disagreement was mainly between the importance of the requirements.

In particular one group valued accessibility with XML tools (B1) very high and consequently would speak for XML with base64 encoding, while most others would prefer ZIP/JAR.

ZIP / JAR

Tools to create and manipulate ZIP files are widely available on all platforms. The manifest file used with JAR files is usually considered optional and may even be created using a text editor. Access to subdocuments requires unzipping the required subdocuments first. ASCII based tools will in general not work on the package, although (at the user's request) individual subdocuments could be stored uncompressed, thus making them available to ASCII based tools.

Since ZIP files have an index, efficient on-demand loading of subdocuments can be achieved. Subdocuments can be copied from one package to another without uncompressing and compressing them first. With some hackery for on-demand saving, only those files following the newly-written subdocuments need to be written. Due to (optional) compression, files can be small, although not quite as small as .tgz files.

Note: As requested on the mailing list, the JAR manifest file should be replaced by an XML-based manifest file. The older JAR manifest may be supported for compatibility reasons. This is to be determined.

	Efficiency			Standard Tools
	size	load	save	XML	document	ASCII
ZIP / JAR	+	+	0	0	++	0

XML + base64

This suggestion is to use plain XML documents and embed binary content as base64 encoded ASCII in special elements.

Tools to generate or manipulate subdocuments in this format are not commonly available, but could be written with reasonable effort. Usability of the main document content with XML tools is, of course, excellent. ASCII based tools should work with no problems (except for the already mentioned UTF encodings).

Without an index, on-demand loading is not possible, neither is on-demand saving. Files are not compressed and binary subdocuments are even expand by 33% due to the base64 encoding.

Note: In the discussion lists, several remedies to allow on-demand reading were suggested. Either the inclusion of indices in the XML file, or placing the binary data at the end of the file. The former was considered to be non-XML (as it relies on the physical layout of the XML data) and useless, since any standard XML tool would not update the index and thus OpenOffice couldn't rely on it. The latter can not be done with standard (e.g. SAX-based) parsers, as they are used in the current implementation. The SAX API is necessary for the filter pipelining mentioned in the requirements section, so forgoing SAX may have significant drawbacks.

	Efficiency			Standard Tools
	size	load	save	XML	document	ASCII
XML + base64	--	-	-	++	0	+

MIME

MIME is the established packaging format for emails. As is required for SMTP compatibility, it is ASCII-based (7bit ASCII). Non-Mailer tools to manipulate MIME files are rare.

Being an ASCII based format, accessibility to ASCII based tools is excellent. However, the encoding of non-ASCII characters is solved differently than in XML. Tools to manipulate MIME files (outside of mail programs) are not readily available, but could be written with reasonable effort. Access with XML tools requires unpacking first, but as said before the tools to do so are not currently available.

MIME has no index of subdocuments, making on-demand loading difficult. For saving of individual subdocuments, a reasoning similar to ZIP files applies. However, copying subdocuments requires accessing them, which is slow on MIME due to the lack of an index. Since no compression is used, files get fairly large. In addition, binary documents are usually encoded in base64, enlarging them by 33%.

	Efficiency			Standard Tools
	size	load	save	XML	document	ASCII
MIME	--	-	-	-	0	+

.tgz

The .tgz format (tar files compressed with gzip) is the most popular archiving format on the UNIX platform. Tools to manipulate .tgz files exist on most other platforms as well. As participants on the mailing lists mentioned, it is used as package format for KOffice.

.tgz operates according to the compress-after-packaging principle. in general, this results in very good compression ratios. However, the same principle causes on-demand loading of subdocument to be severely inefficient: Not only is it impossible to quickly determine the start position of subdocuments, it is also impossible to start reading subdocuments once the start position is known because the state of the uncompression tables cannot easily be reconstructed. For the same reason, changing a subdocument or even copying subdocuments is inefficient.

Availability of tools for manipulating .tgz files is good, and even very good on UNIX platforms. Accessibility to subdocuments requires unpacking the subdocuments first, which is slow because the document must be uncompressed from the start.

	Efficiency			Standard Tools
	size	load	save	XML	document	ASCII
.tgz	++	--	--	0	+	0

Comparison Table

	Efficiency			Standard Tools
	size	load	save	XML	document	ASCII [6]
ZIP / JAR	+	+	0 [1]	0 [2]	++	0 [2]
XML + base64	--	- [3]	-	++	0 [4]	+
MIME	--	-	- [1]	- [2]	0 [4]	+
.tgz	++	--	--	0 [2]	+ [5]	0 [2]

++ = very good, + = good, 0 = medium, - = bad, -- = very bad

Remarks:

Copying subdocuments is possible. With some hackery, (repeated) partial updates can be supported.
Requires unpackaging (or package-aware tools), so score depends on availability of tools and their efficiency
On-demand loading may be implemented at the expense of being unable to use standard XML parser APIs (e.g. SAX).
Tools can be written with reasonable effort but don't currently exist; at least not on many platforms.
Very good on Unix, good on other platforms.
Due to XML being Unicode-based, ASCII-based tools probably don't work in many circumstances, independent of the document format.

libefs

Additionally, the BONOBO libefs was suggested. Since this was not much discussed in the mailing lists, I (dvo) tried to find out about libefs from the web. This was my impression:

libefs was created to solve a problem very similar to ours. It is designed to be an ambitious file-system-in-a-file solution. If it works out the way its developers imagine, it will be technically superior to the other proposals discussed here.
The page didn't mention any ports, thus making it a UNIX solution only. This prohibits its use in a cross-platform application like OpenOffice.
No external tools to manipulate libefs files exist.
The page I consulted referred to it as "very much a work-in-progress", meaning that currently it is not suitable for a consumer product.
The libefs documentation was a bit unclear on the relationship between libefs API and libefs implementation (meaning: file format(s)). If it is mainly an API, it would compete with the current UCB architecture.

Security

Documents could of course be encrypted for all candidate formats. All formats could encrypt subdocuments as well, except for XML+base64, which would make it impossible to encrypt the main document content without encrypting the binaries as well. In .tgz, subdocument encryption could adversely affect compression, since the entire file is compressed at once.

The same applies to file integrity verification. JAR is very appealing in this regard, as it has already the full infrastructure for subdocument and document integrity verification in place.

The Conclusion

The JAR file format seems to provide the best balance among the stated requirements. Unlike all other formats, it does not have any real weakness in regard to any of the requirements, and quite a lot of advantages.

If one values accessibility with standard XML tools very high, XML + base64 seems a logical choice. While this is one of the requirements for OpenOffice, the xmloff development team does not follow the opinion that this single requirement significantly exceeds all other requirements in importance. To accommodate this user base, an additional pure XML file format with external subdocuments should be created.

Remaining Issues

Decide on meta information and manifest file format.
Implement a UCB that allows pure XML documents.