Data Sources Character Sets

Data Sources Character Sets
Functional Specification

Content

Abstract

OpenOffice.org, Unicode-enabled itself, allows to access non-Unicode (8-Bit) databases. Thus when transfering string data from connections to such databases, OOo must convert the data into UniCode. For this, the user can specify which character set to use for this conversion.

Functional Description

Character sets are specified per data source. This means that in the data source administration dialog, there is an option where the user chooses a character set to use for every connection created for a data source.
The Character Sets setting is available for the following data source types: Adabas, ODBC, dBase, Text, MySQL (when adapted via ODBC, see the MySQL spec)
In general (with on exception, see below), only character sets which are part of the respective IANA-standard can be supported by StarOffice. The reason for this is that character sets need to be transported via UNO, and instead of defining an own standard for naming them, we decided to use the most comprehensive standard available - IANA.
OpenOffice.org versions up to 1.0.x supported only a very limited set of character sets, namely windows-1252, macintosh, IBMPC 437, ~850,~860,~861,~863,~865,~866, UTF-8 and Big5-HKSCS.

Since OpenOffice.org 1.1, this list has been extended. For compatibility reasons, the encodings above form the very minimal set of required encodings.

Nowadays, OOo data sources support every encoding which is known to OOo in general, and which is a valid IANA name. This list is much too large to cite it here completely, and it can be extended in the future without further notice.

The display names of the character sets are the usual names as used in other places, too (for instance "Tools/Options/Load/Save/HTML compatibility/Character Set").
There is one "virtual" character set named "System". Choosing this just means that the current system character set is used, so the user does not need to care for an explicit setting. This is the default when creating new data sources. For Text and dBase data sources, all text encodings which do not have a constant character size are forbidden. For instance, UTF-8 uses a different number of bytes to code different characters – thus UTF-8 and all character sets with the same characteristics are not allowed for dBase and Text.
Consider a character set which, in the current environment or for the current data source type, is not available. First, this means that the list box for selecting the character sets does not display it. If, however, the user changed the character set for a data source by other means than our UI, then we fallback to “System encoding”, means instead of the invalid encoding “System” is displayed.