Issue 23693 - coined words not recognized as words
Summary: coined words not recognized as words
Status: CONFIRMED
Alias: None
Product: Internationalization
Classification: Code
Component: code (show other issues)
Version: OOo 1.1
Hardware: All All
: P4 Trivial with 6 votes (vote)
Target Milestone: ---
Assignee: AOO issues mailing list
QA Contact:
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2003-12-18 21:35 UTC by grsingleton
Modified: 2013-08-07 15:00 UTC (History)
4 users (show)

See Also:
Issue Type: ENHANCEMENT
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments

Note You need to log in before you can comment on or make changes to this issue.
Description grsingleton 2003-12-18 21:35:28 UTC
I am working on localizing Klingon and find that I need to coing words that
contain characters from ASCII to indicate glottal stops and other language
requirements. (I call these spit marks) In testing wiht the following string,
'|ØL©U' I found that the word broke after the first character, in this case the
'|' character, which was unexpected. I defined the string in my personal
dictionary but found that the '|' character was not included in the definition.
I need these character to behave as part of a word. I even tried
Format/Character/Language using Maori which is similar to Hawaiian in its use of
punctuation as part of words. This didn't work either. 

I think this is a defect. Sander Vesik suggested that the BreakIterator code
needed to be adjusted and I would but don't know where to look in the source to
even try.

Please, evaluate with a eye that this is needed in the same way that Welsh
required changes.
Comment 1 Dieter.Loeschky 2004-02-05 10:56:12 UTC
DL->grsingleton: Sorry, but our breakiterator don't support Klingon ;-)
Comment 2 grsingleton 2004-02-05 13:22:17 UTC
What a stupig remark.  THere are other languages that employ diacritical marks
and other languages that are created such as Esperanto. It just happens that
Klingon is more familiar to me and one with which I work. Had you said that you
didn't know how to improve the breakiterator, that would be acceptable because
then we could look for someone with the necessary skills and experience to
tackle the job. I have been active answering questions of users and
breakiterator questions do come up. I was expecting help and guidance if nothing
else as breakiterator code doesn work all that well even for English. 

In the course of writing one also coins words, under current state of the
breakiterator these are not handled either. Please review the code so that users
have some control over how words are handled with this software as it is broken.

It is also important to note that the Klingon exercise is important to the
marketing project. 
Comment 3 genedsmith 2004-04-03 04:51:07 UTC
I am no expert on Klingon or other languages but this bug is causing me problems
in plain old English. When I try to enter something like var_name[8] in the OOo
"body text" paragraph style it breaks to a newline at the [. Using reg. exp.
symbols ^ (start of line) and $ (end of line) I see the following wrap effect:

^.....     var_name$
^[8] ..... .....$

Where I would expect to see:
^.......       .....$
^var_name[8] .......$

(The ^ and $ don't actually print, of course, and .... is any other words.)
Note: Even this webform editor used by IssueTracker (or IssueZilla) keeps
var_name[8] as a unit when it wraps it. BTW, so does M/S Word.
Comment 4 jimmyh 2004-04-03 12:48:46 UTC
I have also had problems, in the 'language' of Java. For example the phrase: 
 
o1.equals(o2)==o2.equals(o1) 
 
insists on breaking at the brackets. This is a real problem when I'm writing 
up technical papers in OOo. 
 
I'm not sure what to sugest. Maybe where breakIterator identifies breaks 
should be language-dependent, with the language 'computer code/math' included. 
Or breaking symbols in order of preference could be an aspect of character 
style. 
Comment 5 rblackeagle 2004-04-03 16:42:13 UTC
OOo breaks at any non-alphabetic mark even when there is no space before or
after.  It would seem to me that it should treat words, phrases (mathematical)
and formulas with no spaces or with non-breaking spaces as a whole and break
them or not according to user choice.

This problem with programming languages, Klingon and someforeign languages also
affects Hebrew (according to numerous reports on getting Hebrew working right).
 It would also seem to affect some CTL and Asian languages.

The workaround in Microsoft Word was to mark a piece of text and set it to
"non-breaking", which for some Asian fonts was a really poor solution.  I don't
know if MS still has that feature.  I would rather it be a choice connected to
language as well as programming and mathematical writing.

Better yet would be to avoid breaks where there is no space or breaking hyphen.
Comment 6 pbatchie 2005-01-24 13:23:41 UTC
Somewhat related, in m69 spellcheck includes the period after abbreviations when
flagging errors, but not when adding words. Thus, 'Jn.' was flagged as an error.
Deciding to Add that to my dictionary, I then found out that 'Jn', without the
trailing period, is now passed as correct. Thus the speller does not
differentiate between abbreviations and non-.

The lack of differentiation may or may not be unavoidable, but it would at least
be logical for spellcheck not to tell the user that he is ok-ing only the
abbreviation form of a string, when he really is ok-ing the string in any form.
Comment 7 Dieter.Loeschky 2005-07-11 08:22:57 UTC
DL-> US: Could you please handle this?
Comment 8 ulf.stroehler 2005-07-11 08:32:26 UTC
Karl, can you help pls.
Comment 9 khirano 2005-07-11 08:57:16 UTC
FYI:
The IPA (International Phonetic Alphabet) symbol for a glottal stop is a
question mark ("?") without the dot at the bottom, but some languages, such as
Aynu when written in the Latin alphabet, use an apostrophe (').
Comment 10 grsingleton 2006-02-25 14:04:46 UTC
I have noticed messages in the NLC list that indicate that this is an on-going
problem or very similar. Can we have an update, please.
Comment 11 karl.hong 2006-02-27 19:49:26 UTC
OOo uses ICU breakiterator algorithm to find word boundary, 

http://icu.sourceforge.net/userguide/boundaryAnalysis.html

and enhanced rules for different needs of Writer and languages.

You can setup language specific rules under breakiterator data directory, but
that is for developer to build his/her own i18n library, not for end users.

FYI, if you don't want the breakiterator to break a special symbol or
punctuation as word boundary for spellchecker, add it as $MidLetter in

http://l10n.openoffice.org/source/browse/l10n/i18npool/source/breakiterator/data/dict_word.txt?rev=1.5&content-type=text/vnd.viewcvs-markup

and rebuild i18npool project.
Comment 12 nousernameleft 2006-07-15 09:11:57 UTC
I got that problem with medieval Latin texts. In scholars fonts ligatures and
special characters used in medieval Latin (long S, R rotunda, Abbreviations...)
are located in the Private Use Area. One word can contain glyphs from various
Latin blocks and the PUA at the same time (see
http://commons.wikimedia.org/wiki/Image:Latin-breve.png for an example). OOo
breaks words at PUA characters, i.e. at the end of a line "succRescentibus"
(with R rotunda) becomes "succ
Rescentibus".

It would be nice to have a user option which allows line breaks only at white
spaces and interpunctation.
Comment 13 Rob Weir 2013-07-30 02:46:57 UTC
Reset assignee on issues not touched by assignee in more than 1000 days.