null

How to add a new locale to the i18n framework

Overview

The i18n framework offers full-featured i18n functionality that covers a range of geographies that, besides West and East European derivates, includes East Asia (CJK), South Asia and South-East Asia (Indian, Thai) and West Asia and Middle East (Arabic, Hebrew), so-called CTL (Complex Text Layout) and BiDi (bidirectional) script types. Also, the i18n framework is built over the component model UNO thus making the addition of new i18n components easy.

The following language and locale specific attributes are supported :

i18n Attribute Name	Feature/Consumer	Location in Source
Locale Data	Provide all locale sensitive data, like date/time/number/currency format, calendar information etc.	i18npool/source/localedata/data
Character Classification	Provide API to implement features such as switching case, capitalization, punctuation and so on.	i18npool/source/characterclassification
Calendar	Provide the ability to support a variety of calendaring systems	i18npool/source/calendar
Break Iterator	Provide language/script specific Cursor placement, Word, Line, and Sentence breaking	i18npool/source/breakiterator
Collator	Provide the ability to perform sorting and indexing according to local conventions	i18npool/source/collator
Transliteration	Numerous applications including in Searching, Input, with more applications for Indian languages	i18npool/source/transliteration
Index entry	Support indexing feature	i18npool/source/indexentry
Search & Replace	Support the Find/Change feature	i18npool/source/search

Locale Data

For most locales this is the only thing you need to implement. Follow the instructions lined out in the excerpt from the Developers Guide I18n chapter.

For the following topics it might be necessary to implement them, depending on the locale. Please refer also to the Developers Guide, as information given there might be more up-to-date, and this hasn't been synchronized yet.

CharacterClassification

The component provides toUpper()/toLower()/toTitle() and get various character attribute defined by Unicode. These functions are implemented by cclass_unicode class, if you need language specific requirements for these functions, you can derive a language specific class cclass_<locale_name> from cclass_unicode and overwrite the methods for the functions. In most cases, these attributes are well defined by Unicode, you don't need to create your own class.

The class also provides a number parser, i.e., if a particular language needs number parsing, you will need to derive a class and overwrite the method cclass_unicode::parsePredefinedToken(). Typical examples of where number parsing is needed is to accept date, calendaring information.

A manager class 'CharacterClassificationImpl' will handle the loading of language specific implementation of CharacterClassification on the fly. If no implementation is provided the implementation defaults to class 'cclass_unicode'.

Calendar

The component provides a calendar service. All calendar implementations are managed by a class 'CalendarImpl' the front-end, which dynamically calls a locale specific implementation.

Calendar_gregorian is a wrapper to the ICU Calendar class.

If you need to implement a localespecific calendar, you can choose to either derive your class from Calendar_gregorian or write one from scratch.

There are three steps to create a localespecific calendar,

Name your calendar <name> (for example, 'gengou' for Japanese Calendar) and add it to localedata XML file with proper day/month/era names.
Derive a class either from Calendar_gregorian or XCalendar, name it as Calendar_<name>, which will be loaded by CalendarImpl when the calendar is specified.
Add your new calendar as a service in i18npool/source/registerservices/registerservices.cxx,

If you plan to derive from the Gregorian calendar, you need to know the mapping between your new calendar and the Gregorian calendar. For example, the Japanese Emperor Era calendar has a starting year offset to Gregorian calendar for each era. You will need to override the method Calendar_gregorian::convertValue to map the Era/Year/Month/Day from the Gregorian calendar to the calendar for your language.

BreakIterator

This component provides Character(Cell)/Word/Sentence/Line-break service to its users, e.g. BreakIterator component provides the APIs to iterate a string by character, word, line and sentence. Interface of this component is used by the Output layer for the following operations:

Cursor positioning and selection — Since a character or cell can take more than one code point, cursor movement cannot be done by incrementing or decrementing the index.
Complex Text Layout Languages — In CTL languages (such as Thai, Hebrew, Arabic and Indian), multiple characters may combine to form a display cell. Cursor movement must traverse a display cell instead of a single character.

Line breaking must be highly configurable in desktop publishing applications. The line breaking algorithm should be able to find a line break with or without a hyphenator. Additionally, it should be able to parse special characters that are illegal if they occur at the end or beginning of a line.

Both the above are locale-sensitive.

The BreakIterator components are managed by the class BreakIteratorImpl, which will load the language specific component in service name BreakIterator_<language> dynamically.

The base breakiterator class 'BreakIterator_Unicode' is a wrapper to the ICU BreakIterator class. While this class meets the requirements for western languages, it is not so for other languages such as those of East Asia (CJK), South Asia and South-East Asia (Indian, Thai) and West Asia and Middle East (Arabic, Hebrew), where we require more enhanced functionality as described above

Thus the current BreakIterator base class has two derived classes, BreakIterator_CJK and from BreakIterator_Unicode, first one will provide dictionary base word break for Chinese and Japanese, second will provide more specific definition for Character/Cell/Cluster for the language like Thai, Arabic.

Use the following steps to create language specific BreakIterator service,

Derive a class either from BreakIterator_CJK or BreakIterator_CTL, name it as BreakIterator_<language>.
Add new service in registerservices.cxx.

There are 3 methods for word breaking, nextWord()/previousWord/getWrodBoundary(). You can overwrite them by your own language rules.

BreakIterator_CJK provides input string caching and dictionary searching for longest matching. You may provide a sorted dictionary (the encoding needs to be UTF-8) by creating the following file: i18npool/source/breakiterator/data/dict_<language>.

The utility 'gendict' will convert it to C code which will be compiled into a shared library for dynamical loading.

All dictionary searching/loading is performed in xdictionary class, only thing you need to do is to derived your class from BreakIterator_CJK and create an instance of the xdictionary with language name and pass it to parent class.

Collation

There are two type of collations, single or multiple level collation.

Most European and English locales need multiple level collation. We use the ICU collator to cover this need.

Most CJK languages need only single level collation. We have created a two step table lookup to do the collation for the languages. If you have a new language or algorithm in this category, you can derive a new service from Collator_CJK and provide index and weight tables. Here is a sample implementation,

************************************************************************/

#include <collator_CJK.hxx>

static sal_uInt16 index[] = {

...

};

static sal_uInt16 weight[] = {

...

};

sal_Int32 SAL_CALL Collator_zh_CN_pinyin::compareSubstring (

const ::rtl::OUString& str1, sal_Int32 off1, sal_Int32 len1,

const ::rtl::OUString& str2, sal_Int32 off2, sal_Int32 len2)

throw (::com::sun::star::uno::RuntimeException) {

return compare(str1, off1, len1, str2, off2, len2, index, weight);

}

sal_Int32 SAL_CALL Collator_zh_CN_pinyin::compareString (

const ::rtl::OUString& str1,

const ::rtl::OUString& str2)

throw (::com::sun::star::uno::RuntimeException) {

return compare(str1, 0, str1.getLength(), str2, 0, str2.getLength(), index, weight);

}

Fontend implementation Collator will load and cache language specific service on the name Collator_<locale> dynamically.

The step to add new services

Derive new service from above class
Provide idx and weight tables.
Register new service in registerservices.cxx
Add new service in collation section in localedata file.

Transliteration

Translation is the service for string conversion. Frontend implementation TranliterationImpl will load and cache specific transliteration services by enum defined in Xtransliteration.idl or by implementation name dynamically.

We have defined transliteration in three categories, Ignore, OneToOne and Numeric, all of them are derived from transliteration_commonclass.

Ignore service is for ignore case, half/full width, katakana/hiragana etc. You can derive your new service from it and overwrite folding/transliteration methods.

OneToOne service is for one to one mapping. For example, converting lower case to upper case etc. The class provide two more services, take mapping table or mapping function to do folding/transliteration. You can derive a class from it and provide a table or function for the parent class to do translation.

Numeric service is used to convert number to number string in specific languages. It can be used to format date strings etc.

To add new transliteration

Derive a new class from above three classes.
Overwrite folding/transliteration methods or provide table for parent to do transliteration.
Register new service in registerservices.cxx
Add new service in transliteration section in localedata file.

Indexing

Indexing provides a service for generating index pages. The main method for the service is getIndexCharacter(). Frontend implementation IndexEntrySupplier will load and cache language specific services based on the name IndexEntrySupplier_<locale> dynamically.

We have devided languages into two sets.

First is Latin1 languages, which can be covered by 256 Unicode code points. We use a one step table lookup to generate the index character. We have generated alphabetic and numeric tables that cover most Latin1 languages. But if you think you need another algorithm or have conflicts with the table, you can create your own table and derive a new class from IndexEntrySupplier_Euro. Here is a sample implementation,

#include <sal/types.h>

#include <indexentrysupplier_euro.hxx>

#include <indexdata_alphanumeric.h>

OUString SAL_CALL i18n::IndexEntrySupplier_alphanumeric::getIndexCharacter( const OUString& rIndexEntry,

const lang::Locale& rLocale, const OUString& rSortAlgorithm ) throw (uno::RuntimeException) {

return getIndexString(rIndexEntry, idxStr);

}

where idxStr is the table.

For the languages which could not be covered in first case, like CJK, we use two step table lookup. Here is a sample implementation,

#include <indexentrysupplier_cjk.hxx>

#include <indexdata_zh_pinyin.h>

OUString SAL_CALL i18n::IndexEntrySupplier_zh_pinyin::getIndexCharacter( const OUString& rIndexEntry,

const lang::Locale& rLocale, const OUString& rSortAlgorithm ) throw (uno::RuntimeException) {

return getIndexString(rIndexEntry, idxStr, idx1, idx2);

}

Where idx1 and idx2 are two step tables. IdxStr contains all index keys that will be returned. If you have a new language or algorithm, you can derive new service from IndexEntrySupplier_CJK and provide tables for parent class to generate index.

Note, the index does very much depend on collation, each index algorithm should have a collation algorithm to support it.

To add new service,

Derive a new service from above class.
Provide table for lookup
Register new service in registerservices.cxx

Search and Replace

Search and replace is also locale dependent because there are special search options only available for a particular locale. For instance if the “Asian languages support” is enabled, you'll see an additional option for “Sounds like (Japanese)” in the “Find & Replace” dialog box. With this option, you can turn on/off certain Japanese specific option in the search and replace process.

Search and replace relies on the transliteration modules for various search options. The transliteration modules will be loaded and the search string will be converted before the search process.

Search supports regular expressions, the regular expression implementation uses the transliteration service available for the locale to perform case insensitive search.

Links

Questions or comments?

News

The site is undergoing a major overhaul!