Apache OpenOffice (AOO) Bugzilla – Issue 19848
The hyphen goes after the number when I write in Hebrew (RTL)
Last modified: 2013-08-07 15:00:01 UTC
When I write in Hebrew I have this problem: I type: A letter a hyphen and a number (without spaces A-19) and I get: A letter a number and a hyphen (A19-). This bug happened in 1.1 RC5.
Created attachment 9534 [details] pic of this bug
DL->FME: Would you please takeover?
FME: This order is the result of the ICU implementation of the unicode bidi algorithm. The "-" is a "minus" and therefore it belongs to the number. Enter a space between the minus and the number, and you get another result.
.
"Enter a space between the minus and the number, and you get another result." This is what I did, but when I open a (big) document that I created with MS office (XP) (in MS office the hyphen-minus is in the right place without space) all the hyphen in the wrong place... So you need to replace all the hyphen with a hyphen and one space ("-" to "- "). I think that a newbie will just uninstall OpenOffice... I don't know how MS solved this issue, but they did... BTW All of you did an excellent work with this software, well done!
Some disucssion of this problem in other contexts: http://bugzilla.mozilla.org/show_bug.cgi?id=73251#c32 http://lists.w3.org/Archives/Public/www-international/2003JulSep/0084.html http://mozilla.org.il/board/viewtopic.php?p=1790#1790 and also http://plasma-gate.weizmann.ac.il/Linux/maillists/03/10/msg00075.html In light of the above, I see two practical alternatives to solving the problem. 1. Break compatibility with the Unicode algorithm. Starting with Office 2000, Microsoft uses a different algorithm that fixes this problem (I'm not aware of any other deviation from Unicode) -- use that instead. -or- 2. a. During text input, use heuristics to produce an encoding that's rendered as desired. In the case of hebrew+minus+digit, instead of a plain HYPHEN-MINUS insert some appropriate Unicode sequences such as RLE+(HYPHEN-MINUS)+PDF or RLE+(NON-BREAKING HYPHEN)+PDF (see note below). b. Do something smart about those sequences during editing (e.g., treat them as one logical character). c. In the MS Office import filters, add RLE+PDF where necessary so as to simulate Microsoft's algorithm. d. Likewise, kludge the MS Office output filters as necessary. Both seem rather horrible, but is the current situation. The hebrew+hyphen+digit pattern occurs in many (perhaps most) Hebrew documents, so its being rendered incorrectly in legacy documents is a major issue. As for new documents, "enter a space between the minus and the number" is unsatisfactory since the result is typographically appalling, especially if the space induces a line break. A couple of notes on 2.a. above: The sequence (HYPHEN-MINUS)+LRM can be used in RTL context, but break things in LTR context. Arguably, the Right Thing is to use the single character U+05BE (HEBREW PUNCTUATION MAQAF). Alas, this seems impractical as the character is misrendered or missing in most fonts. Also, Maqaf is not represented on keyboards and is missing from the iso8859-8 charset (though it's present in windows-1255). Moreover, the widespread use of HYPHEN-MINUS instead of the Maqaf character has virtually eliminated the latter from common texts -- it seems to be perceived as a quaint historical quirk that is bearable in "professional" typesetting, but would look quite strange in (say) everyday correspondence.
FME: Ok, I see the point. On one hand, compatibility with Word is quite important, on the other hand, changing the Unicode Bidi Algorithm does not seem to be the perfect solution. Issue 18024 discusses a related problem. FME->FT: What should we do with this?
*** This issue has been confirmed by popular vote. ***
Correction to my comment above: I think RLE+(NON-BREAKING HYPHEN)+PDF is superflous since a simple NON-BREAKING HYPHEN should do the job here. Assuming it's rendered correctly in the relevant fonts and apps... Anyway, deciding these details does call for a Unicode guru.
This issue is re-targeted to Office Later.
Automatical insertion of BiDi special characters isn't too bad if you ignore them while typing the text. On the contrary it makes your text compatible with other programs.
>On the contrary it makes your text compatible with other programs is it really? has anyone tested this? Also- is rlm/lrm the best approach, or would it be better just to insert a proper maqaf?
> is it really? has anyone tested this? I have. I now use RLM regularly as a workaround for forcing these sequences (HebrewLetter+HyphenMinus+Number) to render properly on both IE and Mozilla. > Also- is rlm/lrm the best approach, or would it be better just to insert a proper maqaf? At the moment, inserting RLMs would be better, as the Maqaf glyph is broken in most fonts. Regardless of what we do with editing new texts, the main issue is with rendering existing ones. There is no way to properly render such sequences in existing texts, unless we decide to stray from the Unicode BiDi algorithm and adopt the variant that is used by Microsoft, Opera and other software vendors (such as Mellel for OS X - redlers.com) Prog. For more information about why the Unicode BiDi algorithm is inadeqate for dealing with real-life existing texts, please read the following thread: "The fate of Hebrew texts with Hyphen-Minus instead of Maqaf" http://lists.w3.org/Archives/Public/www-international/2003JulSep/0184.html
An RLM fixes the problem in RTL context but breaks things in LTR context (in both the Unicode and Microsoft algorithms). You'd need to keep track of the problem spot and add/remove the RLM whenever the context changes. Nasty. Meanwhile, I've checked the standard Microsoft's fonts (on Office XP, Windows 2000) and it seems that they include neither U+2010 HYPHEN nor U+2011 NON-BREAKING-HYPHEN. So it seems that the only combination that's both usable and Unicode-compliant is RLE+(MINUS-HYPHEN)+PDF.
> An RLM fixes the problem in RTL context but breaks things in LTR > context (in both the Unicode and Microsoft algorithms). You'd need to > keep track of the problem spot and add/remove the RLM whenever the > context changes. Nasty. No problems here. I tested IE6 and Mozilla 1.5 with an LTR and an RTL HTML textarea and in all cases, HebrewLetter+HyphenMinus+RLM+Number rendered properly. I don't know how well bugzilla supports this, but "ה-‏20" should look the same even if you switch this page to RTL. > Meanwhile, I've checked the standard Microsoft's fonts (on Office XP, > Windows 2000) and it seems that they include neither U+2010 HYPHEN nor > U+2011 NON-BREAKING-HYPHEN. Since these chars are not included in ISO-8859-8-i (logical) and in Windows-1255, I don't really think that they could provide a reasonable solution, even if the fonts did come with them. > So it seems that the only combination that's both usable and > Unicode-compliant is RLE+(MINUS-HYPHEN)+PDF. I disagree. To start with, neither ISO-8859-8-i nor Windows-1255 include PDF. Furthermore, these charsets do support RLM and this makes HebrewLetter+HyphenMinus+RLM+Number a very useful solution. Anyway, let me stress again that finding a solution for text composition is the easy part, it's the rendering of *existing texts* that we should actually discuss. This is where the real problem lies. Prog.
Prog: sorry, you're right, RLM works perfectly. Given this, I fully concur about PDF. (What I tested was HebrewLetter+HyphenMinus+LRM+Number, which indeed works only in RTL context.) This leaves open the issue of hiding the RLM during editing. It really ought to be transparent. I don't want to be the one explaining to users why in OpenOffice you need to "mess around with invisible special characters" while in Word "it just works". > Regardless of what we do with editing new texts, the main issue > is with rendering existing ones. There is no way to properly render > such sequences in existing texts, unless we decide to stray from the > Unicode BiDi algorithm [...] What's wrong with the other alternative I sketched in my 2003-10-05 comment? Namely, when importing (say) a Word file, automatically insert RLMs whenever needed to "emulate" Microsoft's algorithm. Deciding what to do about text pasted from the clipboard is left as an exercise. Eran
This issue will be covered by #21019 *** This issue has been marked as a duplicate of 21019 ***
closed
Insount@openoffice.org wrote: > What's wrong with the other alternative I sketched in my 2003-10-05 > comment? Namely, when importing (say) a Word file, automatically > insert RLMs whenever needed to "emulate" Microsoft's algorithm. > Deciding what to do about text pasted from the clipboard is left as an > exercise. Why plant control characters that aren't supported by all character encodings, when we can implement a better algorithm that handles such sequences perfectly? I also don't like the idea of needlessly changing the original contents of a file, especially of plain text ones. Falko Tesch, How can this cross-platform request be a dupe of a Windows-only bug? Prog.
This is NOT a duplicate of bug 21019. The latter gives a possible way to handle the problem of text entry (not the only option, and possibly not applicable to Unix system). This bug also discusses handling imported/legacy texts. Prog: > Why plant control characters that aren't supported by all character > encodings, when we can implement a better algorithm that handles > such sequences perfectly? Compliance with the Unicode bidi algorithm is a certainly consideration; it's importance is not for me to decide. Also, do you have any reason to believe that the various algorithms floating out there (Opera, Mellel, various versions of Micorosft) are compatible with *each other*? Assuming not, which one do you pick?
> Also, do you have any reason to believe that the various algorithms > floating out there (Opera, Mellel, various versions of Micorosft) are > compatible with *each other*? Assuming not, which one do you pick? These sequences are handled very well and pretty much the same in any of those applications, but since Microsoft is holding more than 95% of the desktop OS and browser marketshare, I believe that we should adopt their algorithm, especially since they are willing to provide the specifications for their handling of HyphenMinus in Hebrew contexts. BTW, I believe that Microsoft BiDi algorithm is just a developed variant the Unicode BiDi algorithm, though I may be wrong on this. Prog.
About adoping the Microsoft algorithm: I stress again that Microsoft has employed several different bidi algorithms. For example, all of the following are different: * Word97+Windows95 * Notepad+Windows98 * MSIE5.5+Windows98 * Notepad+Windows2K * WordXP+Windows2K (Some differentiating cases: "A-5", "A-5a", "1A-2", "1-2" and "-1", all in RTL context, where A is an RTL letter.) Also, the last four variants (not sure about the first) have such wonderful properties as turning " -1 " into " 1- " in RTL context, contrary to both Unicode and reason. Who knows what other surprises will arise.
comments from ft Wed Oct 15 00:03:13 -0800 2003: This issue will be covered by #21019 From issue #21019 "Note: Since Unix IMEs do not report any language this feature con only be implemented under Windows." As issue #21019 covers Windows only, and this issue covers all, even when issue #21019 is resoved, non-Windows users (Mac, Linux, Solaris) will still have this bug. Also, what about legacy text/importing text? issue #21019 does not cover those issues while this does. IMO, this issue shuld be marked as being blocked by issue #21019, but not duplicate of it.
Unicode 4.0.1 has recently been released with changes to the properties of several characters. Once OO (and some other projects) will be updated to comply with these changes, the HebrewLetter+Hyphen+Number issue will finally be solved. See http://bugzilla.mozilla.org/show_bug.cgi?id=240943 for Mozilla's take on the subject. Note that this bug has wrongly been marked as a duplicate of bug 21019 (t has nothing to do with this issue), so just to make that this important update isn't missed, I'm posting it in both bugs. Sorry for the spam. Please consider reopening this bug, or post a new one specifically for compliance with the aforementioned changes in Unicode. Prog.
> Unicode 4.0.1 has recently been released with changes to the properties > of several characters. Once OO (and some other projects) will be updated > to comply with these changes, the HebrewLetter+Hyphen+Number issue will > finally be solved. Note that the new Unicode standard will mis-render negative numbers in RTL context: "-1" renders as "1-" (where the "-" is HYPHEN MINUS, not U+2212 MINUS SIGN). The good news is that this is the same as Microsoft's latest variants of the bidi algorithm, so the import/export situation looks promising. The bad news is that issue of manual overrides (by LRM/RLM or whatever other means) is still crucial. BTW, the more I think of this, the more I'm convinced that explicit nesting via RLE/LRE+PDF is more natural for the average user than using RLM/LRM (assuming optimal GUI in both cases). RLE/LRE+PDF sets the direction of a run of text, which is a fairly natural and intuitive concept, whereas RLM/LRM require some understanding of the Unicode algorithm. But granted, UI and file export issues are tougher with RLE/LRE+PDF.
I see no reason why users should be aware of arcane controll characters at all. They just need to be educated that inputing negative numbers requires switching input method to English (or another LTR language). That's how it works in MSWORD and most users don't have much difficulties adjusting to the concept. Whether the underlying implementation employs RLE/LRE+PDF or LRM/RLM is of no conern to them. It just works. Prog.
> They just need to be educated that inputing negative numbers requires > switching input method to English (or another LTR language). Which makes sense only for platforms that have a notion of IME. > That's how it works in MSWORD and most users don't have much > difficulties adjusting to the concept. Actually, I suspect thay many users just enter "1-". But input aside, invisible character attributes (or anything that emulates them GUI-wise) can be very difficult to *edit*. For example, fixing RTL text with embedded "LTR spaces" in MSWord is exasperating beyond belief. Perhaps the best way to address this is to visually distinguish embedded runs, and this maps nicely to RLE/LRE+PDF.
The HyphenMinus+Number problem is not fixed, although Bug 21019 is marked as fixed. Guess what? This bug really isn't a dupe of 21019 after all. Please re-open. Tested with Writer 1.9.m49 Prog.
On Issue, I posted changes which would update icu data files to confirm to Unicode 4.0.1 regarding HYPHEN/MINUS. With these changes, the bug as reported goes away. This does cause a problem of the number -1 in RTL mode, as insount pointed out, which for now still requires the use of directional chars.
The issue referred to above was Issue 57833.