OpenOffice.org filters using the XML based file format
Abstract: This document explains the implementation of OpenOffice.org import and export filter components, focusing on filter components based on the OpenOffice.org XML file format. It is intended as a brief introduction to developers that want to implement OpenOffice.org filters for foreign file formats.
Table Of Contents
They are several ways to get information into or out of OpenOffice.org: You can
link against the application core,
use the OpenOffice.org API,
use the XML file format.
Each of these ways has unique advantages and disadvantages, that I will briefly summarize:
Using the core data structure and linking against the application core is the traditional way to implement filters in OpenOffice.org. The advantages this method offers are efficiency and direct access to the document. However, the core implementation provides a very implementation centric view of the applications. Additionally, there are a number of technical disadvantages: Every change in the core data structures or objects will have to be followed-up by corresponding changes in code that use them. Hence filters need to be recompiled to match the binary layout of the application core objects. While these things are manageable (albeit cumbersome) for closed source applications, this method is expected to create a maintenance nightmare if application and filter are developed separately, as is customary in open sources applications. Simultaneous delivery of a new application build and the corresponding filters developed by outside parties looks challenging.
Using the OpenOffice.org API (based on UNO) is a much better way, since it solves the technical problems indicated in the last paragraph. The UNO component technology insulates the filter from binary layout (and other compiler and version dependent issues). Additionally, the API is expected to be more stable than the core interfaces, and it even provides a shallow level of abstraction from the core applications. In fact, the native XML filter implementations largely make use of this strategy and are based on the OpenOffice.org API.
The third (and possibly surprising choice) is to import and export documents using the XML based file format. UNO-based XML import and export components feature all of the advantages of the previous method, but additionally provides the filter implementer with a clean, structured, and fully documented view of the document. As a significant difficulty in conversion between formats is the conceptual mapping from the one format to the other, a clean, well-structured view of the document may turn out to be beneficial.
First, we will try to get an overview of the import and export process using UNO components. Let's first attempt to gain a view of...
An in-memory OpenOffice.org document is represented by it's document model. On disk, the same document is represented as a file. An import component must turn the latter into the former as shown by the diagram (Illustration 1).
If you make use of UNO, this diagram can be turned into programming reality quite easily. The three entities in the diagram, (the file, the model, and the filter) all have direct counterparts in UNO services. The services themselves may consist of several interfaces that finally map into C++ or Java classes. The following diagram annotates the entities with their corresponding services and interfaces:
In Illustration 2 (and all following illustrations) the gray part marks the part a filter implementer will have to program, while the white parts are already built into OpenOffice.org.
If the implementer decides to make use of the OpenOffice.org API
directly, this diagram is the proper starting point: The filter
writer must create a class that implements the
service. To achieve this, the the
InputStream must be
obtained from the
MediaDescriptor. The incoming data can
then be interpreted, and the OpenOffice.org document can be
constructed by calling the appropriate methods of the document model.
(The available methods of course depend on the kind of document, as
described by the document service.)
If the advantages of an XML based import or export are desired, the filter implementer may make use of the existing XML import and export components. This way, the import logic does not need to deal with the document model itself, but rather generates the document in its OpenOffice.org XML file format representation. Done in a naive way, such a filter component would generate the XML, write it to file, and then call the built-in XML import to read it again. Since the XML import is based on the SAX API however, a better way exists: The import logic calls the SAX API. Since the XML reader component implements the SAX API, the document thus gets translated from the foreign format into its XML representation and then into the document model without the need to use temporary files, or even to render and subsequently parse an XML character stream.
The link between the XML based import filter
and the XML reader is the SAX
interface. Using this model, the filter implementer has to implement
a class that takes a
MediaDescriptor, reads the
corresponding file, and calls the
methods to generate the XML representation. Additionally, a filter
component (labelled "Filter Wrapper" in the diagram) needs
to be written that instantiates XML import component and the
self-written import filter.
The export into a foreign format may of course be implemented in
the same fashion. Instead of the
service, the component now implements the
service. An XML-based export filter would implement the document
handler interface itself, and write the resulting document in the
proper format into the location indicated by the
For an XML-based export filter, the schematic looks like this:
How do the built-in XML export or import components cooperate with
the self-programmed filter? As was briefly mentioned above, the
export filter services consist of two major interfaces:
XExporter for import and export, respectively, and
XFilter for both filter types. The former interface
passes in the actual document to be imported to or exported from,
XFilter interfaces triggers the filtering
process and passes in the
describes the source or target document.
In the case of an XML-based filter, this functionality gets
distributed to two components. For the import, the built-in XML
import component implements the
XImporter interface as
XDocumentHandler. The XML-based filter
component should implement the
XFilter interface, and
additionally provide a way to set an
The filter wrapper then needs to instantiate both components and
connect them by setting the built-in XML import as the document
handler of the XML-based filter. The wrapper can then delegate the
XImporter calls to the XML import and the
calls to the XML-based filter, thereby implementing the filter
The export case is slightly more complicated. The additional
problem is that the filter(
) call of the
interface provides the
simultaneously controls the filter process. However, in the desired
setup for an XML-based export filter, the built-in XML export
controls the filtering process, but the XML-based filter handles the
file output, and hence needs the
Therefore the filter wrapper has to operate as follows: First it has
to instantiate the XML-based export filter. This filter has to
XDocumentHandler interface. Then it has to
instantiate the XML export, which at instantiation time expects the
document handler as a parameter. The filter wrapper delegates calls
to both the
XFilter and the
interface to the XML export. For calls to the filter method of
XFilter, it additionally has to pass the
MediaDescriptor on to the XML-based export filter.
The means by which this should happen is left to the implementer.
We should now have a closer look at the involved services:
describes a generic import filter. The core of the service is
provided by the interfaces
XImporter supplies the filter object with
the target document (in form of an
XFilter is used to actually start the filtering process,
MediaDescriptor for the source file as a
parameter. Additionally, the
XInitialization interface serves to pass
parameters to the filter at initialization time, while the
XPropertySet can be used to get information from and
about the filter component. It is generally read-only.
The twin of the
ImportFilter is the service
The main interfaces are
XExporter supplies the filter with the source
document, whereas the
XFilter starts the filter process.
MediaDescriptor that gets passed into the
XFilter describes the output file. The
XPropertySet interfaces, just like the
finally collects all information about a source or target file to be
imported from or exported to. It contains meta information (such as
the file name), as well as an
InputStream which can be
used to actually manipulate the file. Caveat: Objects
obtained from the
MediaDescriptor may not be referenced
or otherwise held longer beyond the
method call. Doing so (e.g., keeping a reference to the
obtained from the
MediaDescriptor prevents the
InputStream from being closed.)
The document model cannot be described by a single service, as it
obviously has to vary greatly, depending on the type of document (e.g.,
text or spreadsheet.) An example for a document model service is the
service. What is important in this context, is that all document model
services support the
interface features only two methods:
cancel(). The former starts the filtering process based
for the given
MediaDescriptor, while the latter cancels
an ongoing filter process.
XFilter must be implemented
for both, import and export filters.
is used for setting up an import before the
method from the
XFilter interface is called. The
XImporter supplies the document with its (empty) target
document, i.e., the document whose content is about to be read from
is the core interface for handling XML data in OpenOffice.org. It is
part of the SAX interface. It has methods for all parts of XML
documents, like start or end of elements or runs of characters. The
XDocumentHandler interface is used for both, incoming
and outgoing XML data, thus allowing chaining of components handling
XML. A component that processes XML data should implement the
XDocumentHandler interface. A component that will
generate XML data should call the methods of an
to output the events. The
being derived from
an extended version that can also handle comments. If the extended
functionality is desired, the
XDocumentHandler should be
queried for the
runtime. However, implementers should make sure their components
never rely on the presence of extended
but rather make sure they could also work with the plain
XDocumentHandler. Since all vital parts of XML can be
XDocumentHandler, this should not pose
much of a problem.
is the parent interface for all document models. Actual documents
derive from this model to provide model specific functionality, such
A filter will have to query at runtime whether it can handle the
Initialization of components can be supported through the
Properties of the filters can be queried using the
interface. The names of the supported properties are part of the
service description. In general,
implementations support both reading and writing, but the intended
use for filter components is to be read-only.
All of OpenOffice.org's applications have built-in XML import and export components. The component names are summarized in the following table:
Additionally, the XML reader and writer components should be
mentioned, even though they have not been discussed in the previous
chapters. These two components implement the XML reader (or parser)
and writer (or unparser) components used by OpenOffice.org for
writing all it's XML files. They implement (XML writer) or use (XML
XDocumentHandler interface. In some sense
they could be considered XML-based filters, since they read or write
character streams and turn them into SAX function calls. Their names
There is a final, crucial step that will not be covered here: Registering a filter with the application. The registration process will make sure that the application knows the filter, and also knows which files the filter can be applied to. The filter registration is described here.
This chapter is intended to give brief code examples for the crucial steps in creating XML-based import or export filters. We'll start with the filter wrapper, followed by short examples for importing into and export from the XML filters.
The filter wrapper needs to instantiate the built-in XML import or export components. The following code snippet will demonstrate this for an XML-based export filter.
using namespace ::com::sun::star;
// Instantiate the XML export filter
// 1) a service factory,
// 2) a document handler,
// 3) a string with the service name.
// Obtain the service factory
uno::Reference< lang::XMultiServiceFactory > xServiceFactory = ;
// Obtain (or create) the XML-based output filter. It has to implement
// the XDocumentHandler interface, so the export component can write to it.
uno::Reference< xml::sax::XDocumentHandler > xHandler = <your filter>
// Prepare arguments passed to the XML export filter:
// The XML-based filter in form of an XdocumentHandler.
// Arguments are passed by a sequence of Any.
// Our sequence will contain only 1 element.
aArgs <<= xHandler;
// Instantiate the exporter from the factory.
::rtl::OUString sService =
uno::Reference< document::XExporter > xExporter(
xServiceFactory->createInstanceWithArguments(sService), aArgs ),
ASSERT( xExporter.is(), "can' instantiate XML exporter" );
// Now we have the two components in xHandler and xExporter and can start
// calling the XFilter and XExporter methods. Note that the xHandler needs
// to be informed about its MediaDescriptor.
The following code snippet could be located in a filter wrapper for an XML-based export filter. The following two methods implement the gist of a filter wrapper for an XML-based export. They are really simple because the filter wrapper doesn't really do much of its own. It only delegates to it's two components.
using namespace ::com::sun::star;
void SAL_CALL <filter wrapper>::setSourceDocument(
const uno::Reference<lang::XComponent>& xComponent )
// delegate to XExporter of the built-in XML export
xExporter->setSourceDocument( xComponent );
sal_Bool SAL_CALL <filter wrapper>::filter(
const uno::Sequence<beans::PropertyValue>& aDescriptor )
// set MediaDescriptor at XML-based export filter
// get access to XFilter interface of XML export
uno::Reference<document::XFilter> xFilter(xExporter, uno::UNO_QUERY);
The next example should detail how an import filter would
communicate with the XML import component. Basically, it only needs
to call the
XDocumentHandler methods. The following code
implements the notorious "Hello World!" program as an
OpenOffice.org import filter.
using namespace ::com::sun::star; // instantiate the XML import component ::rtl::OUString sService = ::rtl::OUString::createFromAscii("com.sun.star.comp.Writer.XMLImporter") uno::Reference<xml::sax::XDocumentHandler> xImport( xServiceFactory->createInstance(sService), uno::UNO_QUERY ); ASSERT( xImport.is(), "can't instantiate XML import" ); // OK. Now we have the import. Let's make a real simple document. // a few comments: // 1. We will use string constants from xmloff/xmlkywd.hxx // 2. For convenience, we'll use a globally shared attribute list from the // xmloff project (xmloff/attrlist.hxx) // 3. In a real project, we would pre-construct our OUString, rather than use // the slow createFromAscii( ) method every time. // We will write the following document: (the unavoidable 'Hello World!') // <office:document // office:class="text" // xmlns:office="http://openoffice.org/2000/office" // xmlns:text="http://openoffice.org/2000/text" > // <office:body> // <text:p>Hello World!</text:p> // </office:body> // </office:document> SvXMLAttributeList aAttrList; xHandler->startDocument(); // our first element: first build up the attribute list, then start the element // DON'T FORGET TO ADD THE NAMESPACES! aAttrList.AddAttribute( ::rtl::OUString::createFromAscii("xmlns:office"), ::rtl::OUString::createFromAscii("CDATA"), ::rtl::OUString::createFromAscii("http://openoffice.org/2000/office") ); aAttrList.AddAttribute( ::rtl::OUString::createFromAscii("xmlns:text"), ::rtl::OUString::createFromAscii("CDATA"), ::rtl::OUString::createFromAscii("http://openoffice.org/2000/text") ); aAttrList.AddAttribute( ::rtl::OUString::createFromAscii("office:class"), ::rtl::OUString::createFromAscii("CDATA"), ::rtl::OUString::createFromAscii("text") ); xHandler->startElement( ::rtl::OUString::createFromAscii("office:document"), aAttrList ); // body element (no attributes) aAttrList.clear(); xHandler->startElement( ::rtl::OUString::createFromAscii("office:body"), aAtrList ); // paragraph element (no attributes) aAttrList.clear(); xHandler->startElement( ::rtl::OUString::createFromAscii("text:p"), aAtrList ); // write text xHandler->characters( ::rtl::OUString::createFromAscii("Hello World!") ); // close paragraph xHandler->startElement( ::rtl::OUString::createFromAscii("text:p"), // close body xHandler->endElement( ::rtl::OUString::createFromAscii("office:body") ); // close document element xHandler->endElement( ::rtl::OUString::createFromAscii("office:document") ); // close document xHandler->endDocument();
This chapter briefly mentions a few other uses of XML-based filter components that provide additional value and versatility.
In some circumstances, it may be desirable to have standalone
format conversion tools. This would, for example, enable batch
conversion of legacy documents. The XML-based filter components allow
us to do that with little extra effort. Let us recall that an
XML-based import filter uses OpenOffice.org's built-in XML import to
generate the document. It calls the (generic)
interface after it has been supplied with the
implementation by the filter wrapper. Now if the filter wrapper
instead supplies the XML-based import filter with the XML writer
component (which implements the
interface as well), then the XML writer component will output the XML
as a character stream to disk. Thus we have created the desired
standalone conversion utility by only implementing a new filter
A different possible use is the chaining of
XML-based filters. Suppose the foreign file format in question is
also based on XML. Now it doesn't make sense to re-implement the XML
parser inside that component, so it seems natural to use the existing
parser (or unparser) component. This way, our import (or export)
filter would have to implement the
interface for its input, and also use an
interface for its output. The resulting implementation is sketched in
Illustration 6. Note that such XML to XML filters could be chained
Note that, if the other application is also an OpenSource application, it could use UNO component technology as well, and thus use the very same filter components for its own import and export. A filter converting from the foreign XML into OpenOffice.org XML would be an import filter for OpenOffice.org, and simultaneously an export filter for the other application.
As OpenOffice.org is being developed further, it becomes likely that eventually changes to the file format will have to be made. It is mandatory to supply users with the ability to read and write the old formats of course. This could indeed be handled by an XML to XML transformation, with one format being the old OpenOffice.org XML format, and the other being the new format.
Note that such a filter could also be used by users of the older versions to read and write documents in the new format! Additionally, it could be chained between other XML-based import or export filters, allowing users to utilize import and export filters for versions other than their own. Essentially, this would achieve a decoupling of application, filter, and file format version. The opportunities this opens up are quite amazing: If a new file format is implemented, users would not be forced to upgrade their application to make use of the new filter. Also, users of newer application versions could still use filters developed for the older format.
The following resources may provide additional information:
Tutorial "How to register a component": http://framework.openoffice.org/documentation/filters/index.html
XML file format specification: http://xml.openoffice.org/xml_specification.pdf
Tutorial "A Component in C++": http://udk.openoffice.org/cpp/man/component_tutorial.html
OpenOffice.org API Reference: http://api.openoffice.org/common/ref/
OpenOffice.org XML Homepage: http://xml.openoffice.org
OpenOffice.org Framework Homepage: http://framework.openoffice.org
OpenOffice.org API Homepage: http://api.openoffice.org/
OpenOffice.org UNO Homepage: http://udk.openoffice.org/