19848 – The hyphen goes after the number when I write in Hebrew (RTL)

Issue 19848 - The hyphen goes after the number when I write in Hebrew (RTL)

Summary: The hyphen goes after the number when I write in Hebrew (RTL)

Status:	CLOSED DUPLICATE of issue 21019

Alias:	None

Product:	Internationalization
Classification:	Code
Component:	BiDi (show other issues)
Version:	OOo 1.1
Hardware:	PC All

Importance:	P3 Trivial with 4 votes (vote)
Target Milestone:	---
Assignee:	falko.tesch
QA Contact:	issues@l10n

URL:
Keywords:

Depends on:	16354 18024
Blocks:
	Show dependency tree

Reported:	2003-09-20 20:22 UTC by ipip
Modified:	2013-08-07 15:00 UTC (History)
CC List:	4 users (show)

See Also:
Issue Type:	DEFECT
Latest Confirmation in:	---
Developer Difficulty:	---

Attachments
pic of this bug (31.82 KB, image/png) 2003-09-20 20:30 UTC, ipip	no flags	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this issue.

Description ipip 2003-09-20 20:22:34 UTC

When I write in Hebrew I have this problem: 
I type: A letter a hyphen and a number (without spaces A-19) and I get: A letter
a number and a hyphen (A19-).

This bug happened in 1.1 RC5.

Comment 1 ipip 2003-09-20 20:30:09 UTC

Created attachment 9534 [details]
pic of this bug

Comment 2 Dieter.Loeschky 2003-09-29 11:51:33 UTC

DL->FME: Would you please takeover?

Comment 3 frank.meies 2003-09-29 12:06:32 UTC

FME: This order is the result of the ICU implementation of the unicode
bidi algorithm. The "-" is a "minus" and therefore it belongs to the
number. Enter a space between the minus and the number, and you get
another result.

Comment 4 frank.meies 2003-10-01 12:55:03 UTC

Comment 5 ipip 2003-10-03 21:01:52 UTC

"Enter a space between the minus and the number, and you get
another result."

This is what I did, but when I open a (big) document that I created
with MS office (XP) (in MS office the hyphen-minus is in the right
place without space) all the hyphen in the wrong place... So you need
to replace all the hyphen with a hyphen and one space ("-" to "- "). I
think that a newbie will just uninstall OpenOffice...

I don't know how MS solved this issue, but they did...

BTW
All of you did an excellent work with this software, well done!

Comment 6 insount 2003-10-06 00:25:20 UTC

Some disucssion of this problem in other contexts:

http://bugzilla.mozilla.org/show_bug.cgi?id=73251#c32
http://lists.w3.org/Archives/Public/www-international/2003JulSep/0084.html
http://mozilla.org.il/board/viewtopic.php?p=1790#1790
and also
http://plasma-gate.weizmann.ac.il/Linux/maillists/03/10/msg00075.html

In light of the above, I see two practical alternatives to solving the
problem.

1. Break compatibility with the Unicode algorithm. Starting with
Office 2000, Microsoft uses a different algorithm that fixes this
problem (I'm not aware of any other deviation from Unicode) -- use
that instead.

-or-

2. a. During text input, use heuristics to produce an encoding that's
rendered as desired. In the case of hebrew+minus+digit, instead of a
plain HYPHEN-MINUS insert some appropriate Unicode sequences such as
RLE+(HYPHEN-MINUS)+PDF or RLE+(NON-BREAKING HYPHEN)+PDF (see note below).
   b. Do something smart about those sequences during editing (e.g.,
treat them as one logical character).
   c. In the MS Office import filters, add RLE+PDF where necessary so
as to simulate Microsoft's algorithm. 
   d. Likewise, kludge the MS Office output filters as necessary.

Both seem rather horrible, but is the current situation. The
hebrew+hyphen+digit pattern occurs in many (perhaps most) Hebrew
documents, so its being rendered incorrectly in legacy documents is a
major issue. As for new documents, "enter a space between the minus
and the number" is unsatisfactory since the result is typographically
appalling, especially if the space induces a line break.

A couple of notes on 2.a. above:
The sequence (HYPHEN-MINUS)+LRM can be used in RTL context, but break
things in LTR context. 
Arguably, the Right Thing is to use the single character U+05BE
(HEBREW PUNCTUATION MAQAF). Alas, this seems impractical as the
character is misrendered or missing in most fonts. Also, Maqaf is not
represented on keyboards and is missing from the iso8859-8 charset
(though it's present in windows-1255). Moreover, the widespread use of
HYPHEN-MINUS instead of the Maqaf character has virtually eliminated
the latter from common texts -- it seems to be perceived as a quaint
historical quirk that is bearable in "professional" typesetting, but
would look quite strange in (say) everyday correspondence.

Comment 7 frank.meies 2003-10-06 07:47:17 UTC

FME: Ok, I see the point. On one hand, compatibility with Word is
quite important, on the other hand, changing the Unicode Bidi
Algorithm does not seem to be the perfect solution. Issue 18024
discusses a related problem.

FME->FT: What should we do with this?

Comment 8 ipip 2003-10-07 06:22:59 UTC

*** This issue has been confirmed by popular vote. ***

Comment 9 insount 2003-10-08 01:26:01 UTC

Correction to my comment above:
I think RLE+(NON-BREAKING HYPHEN)+PDF is superflous since a simple
NON-BREAKING HYPHEN should do the job here. Assuming it's rendered
correctly in the relevant fonts and apps...
Anyway, deciding these details does call for a Unicode guru.

Comment 10 falko.tesch 2003-10-08 11:29:34 UTC

This issue is re-targeted to Office Later.

Comment 11 mehlng 2003-10-09 14:40:03 UTC

Automatical insertion of BiDi special characters isn't too bad if you ignore them while 
typing the text. On the contrary it makes your text compatible with other programs.

Comment 12 sforbes 2003-10-09 14:48:27 UTC

>On the contrary it makes your text compatible with other programs

is it really? has anyone tested this?

Also- is rlm/lrm the best approach, or would it be better just to
insert a proper maqaf?

Comment 13 prognathous 2003-10-09 17:08:52 UTC

> is it really? has anyone tested this?

I have. I now use RLM regularly as a workaround for forcing these
sequences (HebrewLetter+HyphenMinus+Number) to render properly on both
IE and Mozilla.

> Also- is rlm/lrm the best approach, or would it be better just to
insert a proper maqaf?

At the moment, inserting RLMs would be better, as the Maqaf glyph is
broken in most fonts.

Regardless of what we do with editing new texts, the main issue is
with rendering existing ones. There is no way to properly render such
sequences in existing texts, unless we decide to stray from the
Unicode BiDi algorithm and adopt the variant that is used by
Microsoft, Opera and other software vendors (such as Mellel for OS X -
redlers.com)

Prog.

For more information about why the Unicode BiDi algorithm is inadeqate
for dealing with real-life existing texts, please read the following
thread:

"The fate of Hebrew texts with Hyphen-Minus instead of Maqaf"
http://lists.w3.org/Archives/Public/www-international/2003JulSep/0184.html

Comment 14 insount 2003-10-09 18:12:37 UTC

An RLM fixes the problem in RTL context but breaks things in LTR
context (in both the Unicode and Microsoft algorithms). You'd need to
keep track of the problem spot and add/remove the RLM whenever the
context changes. Nasty.

Meanwhile, I've checked the standard Microsoft's fonts (on Office XP,
Windows 2000) and it seems that they include neither U+2010 HYPHEN nor
U+2011 NON-BREAKING-HYPHEN.

So it seems that the only combination that's both usable and
Unicode-compliant is RLE+(MINUS-HYPHEN)+PDF.

Comment 15 prognathous 2003-10-09 19:43:41 UTC

> An RLM fixes the problem in RTL context but breaks things in LTR
> context (in both the Unicode and Microsoft algorithms). You'd need to
> keep track of the problem spot and add/remove the RLM whenever the
> context changes. Nasty.

No problems here. I tested IE6 and Mozilla 1.5 with an LTR and an RTL
HTML textarea and in all cases, HebrewLetter+HyphenMinus+RLM+Number
rendered properly. I don't know how well bugzilla supports this, but
"&#1492;-&#8207;20" should look the same even if you switch this page to RTL.

> Meanwhile, I've checked the standard Microsoft's fonts (on Office XP,
> Windows 2000) and it seems that they include neither U+2010 HYPHEN nor
> U+2011 NON-BREAKING-HYPHEN.

Since these chars are not included in ISO-8859-8-i (logical) and in
Windows-1255, I don't really think that they could provide a
reasonable solution, even if the fonts did come with them.

> So it seems that the only combination that's both usable and
> Unicode-compliant is RLE+(MINUS-HYPHEN)+PDF.

I disagree. To start with, neither ISO-8859-8-i nor Windows-1255
include PDF. Furthermore, these charsets do support RLM and this makes
HebrewLetter+HyphenMinus+RLM+Number a very useful solution.

Anyway, let me stress again that finding a solution for text
composition is the easy part, it's the rendering of *existing texts*
that we should actually discuss. This is where the real problem lies.

Prog.

Comment 16 insount 2003-10-09 21:12:49 UTC

Prog: sorry, you're right, RLM works perfectly. Given this, I fully
concur about PDF. 
(What I tested was HebrewLetter+HyphenMinus+LRM+Number, which indeed
works only in RTL context.)

This leaves open the issue of hiding the RLM during editing. It really
ought to be transparent. I don't want to be the one explaining to
users why in OpenOffice you need to "mess around with invisible
special characters" while in Word "it just works".


> Regardless of what we do with editing new texts, the main issue 
> is with rendering existing ones. There is no way to properly render 
> such sequences in existing texts, unless we decide to stray from the
> Unicode BiDi algorithm  [...]

What's wrong with the other alternative I sketched in my 2003-10-05
comment? Namely, when importing (say) a Word file, automatically
insert RLMs whenever needed to "emulate" Microsoft's algorithm.
Deciding what to do about text pasted from the clipboard is left as an
exercise.

  Eran

Comment 17 falko.tesch 2003-10-15 09:03:13 UTC

This issue will be covered by #21019

*** This issue has been marked as a duplicate of 21019 ***

Comment 18 falko.tesch 2003-10-15 09:03:31 UTC

closed

Comment 19 prognathous 2003-10-15 10:10:53 UTC

Insount@openoffice.org wrote:

> What's wrong with the other alternative I sketched in my 2003-10-05
> comment? Namely, when importing (say) a Word file, automatically
> insert RLMs whenever needed to "emulate" Microsoft's algorithm.
> Deciding what to do about text pasted from the clipboard is left as an
> exercise.

Why plant control characters that aren't supported by all character
encodings, when we can implement a better algorithm that handles such
sequences perfectly?

I also don't like the idea of needlessly changing the original
contents of a file, especially of plain text ones.

Falko Tesch,

How can this cross-platform request be a dupe of a Windows-only bug?

Prog.

Comment 20 insount 2003-10-15 12:46:35 UTC

This is NOT a duplicate of bug 21019. The latter gives a possible way
to handle the problem of text entry (not the only option, and possibly
not applicable to Unix system). This bug also discusses handling
imported/legacy texts.

Prog:
> Why plant control characters that aren't supported by all character
> encodings, when we can implement a better algorithm that handles 
> such sequences perfectly?
Compliance with the Unicode bidi algorithm is a certainly
consideration; it's importance is not for me to decide.
Also, do you have any reason to believe that the various algorithms
floating out there (Opera, Mellel, various versions of Micorosft) are
compatible with *each other*? Assuming not, which one do you pick?

Comment 21 prognathous 2003-10-15 13:13:02 UTC

> Also, do you have any reason to believe that the various algorithms
> floating out there (Opera, Mellel, various versions of Micorosft) are
> compatible with *each other*? Assuming not, which one do you pick?

These sequences are handled very well and pretty much the same in any
of those applications, but since Microsoft is holding more than 95% of
the desktop OS and browser marketshare, I believe that we should adopt
their algorithm, especially since they are willing to provide the
specifications for their handling of HyphenMinus in Hebrew contexts.

BTW, I believe that Microsoft BiDi algorithm is just a developed
variant the Unicode BiDi algorithm, though I may be wrong on this.

Prog.

Comment 22 insount 2003-10-18 15:54:30 UTC

About adoping the Microsoft algorithm: I stress again that Microsoft
has employed several different bidi algorithms. For example, all of
the following are different: 
* Word97+Windows95
* Notepad+Windows98
* MSIE5.5+Windows98
* Notepad+Windows2K
* WordXP+Windows2K

(Some differentiating cases: "A-5", "A-5a", "1A-2", "1-2" and "-1",
all in RTL context, where A is an RTL letter.)

Also, the last four variants (not sure about the first) have such
wonderful properties as turning " -1 " into " 1- " in RTL context,
contrary to both Unicode and reason. Who knows what other surprises
will arise.

Comment 23 sforbes 2003-12-22 11:22:27 UTC

comments from ft Wed Oct 15 00:03:13 -0800 2003:
This issue will be covered by #21019

From issue #21019
"Note: Since Unix IMEs do not report any language this feature con only be
implemented under Windows."

As issue #21019 covers Windows only, and this issue covers all, even when issue
#21019 is resoved, non-Windows users (Mac, Linux, Solaris) will still have this bug.

Also, what about legacy text/importing text? issue #21019  does not cover those
issues while this does.

IMO, this issue shuld be marked as being blocked by issue #21019, but not
duplicate of it.

Comment 24 prognathous 2004-05-03 16:54:33 UTC

Unicode 4.0.1 has recently been released with changes to the properties of
several characters. Once OO (and some other projects) will be updated to comply
with these changes, the HebrewLetter+Hyphen+Number issue will finally be solved.
See http://bugzilla.mozilla.org/show_bug.cgi?id=240943 for Mozilla's take on the
subject.

Note that this bug has wrongly been marked as a duplicate of bug 21019 (t has
nothing to do with this issue), so just to make that this important update isn't
missed, I'm posting it in both bugs. Sorry for the spam.

Please consider reopening this bug, or post a new one specifically for
compliance with the aforementioned changes in Unicode.

Prog.

Comment 25 insount 2004-05-04 03:07:52 UTC

> Unicode 4.0.1 has recently been released with changes to the properties 
> of several characters. Once OO (and some other projects) will be updated 
> to comply with these changes, the HebrewLetter+Hyphen+Number issue will 
> finally be solved.

Note that the new Unicode standard will mis-render negative numbers in RTL
context: "-1" renders as "1-" (where the "-" is HYPHEN MINUS, not U+2212 MINUS
SIGN). The good news is that this is the same as Microsoft's latest variants of
the bidi algorithm, so the import/export situation looks promising. The bad news
is that issue of manual overrides (by LRM/RLM or whatever other means) is still
crucial.

BTW, the more I think of this, the more I'm convinced that explicit nesting via
RLE/LRE+PDF is more natural for the average user than using RLM/LRM (assuming
optimal GUI in both cases). RLE/LRE+PDF sets the direction of a run of text,
which is a fairly natural and intuitive concept, whereas RLM/LRM require some
understanding of the Unicode algorithm. But granted, UI and file export issues
are tougher with RLE/LRE+PDF.

Comment 26 prognathous 2004-05-04 08:04:17 UTC

I see no reason why users should be aware of arcane controll characters at all.
They just need to be educated that inputing negative numbers requires switching
input method to English (or another LTR language). 

That's how it works in MSWORD and most users don't have much difficulties
adjusting to the concept. Whether the underlying implementation employs
RLE/LRE+PDF or LRM/RLM is of no conern to them. It just works.

Prog.

Comment 27 insount 2004-05-04 08:25:18 UTC

> They just need to be educated that inputing negative numbers requires 
> switching input method to English (or another LTR language). 

Which makes sense only for platforms that have a notion of IME.

> That's how it works in MSWORD and most users don't have much 
> difficulties adjusting to the concept.

Actually, I suspect thay many users just enter "1-". But input aside, invisible
character attributes (or anything that emulates them GUI-wise) can be very
difficult to *edit*. For example, fixing RTL text with embedded "LTR spaces" in
MSWord is exasperating beyond belief. Perhaps the best way to address this is to
visually distinguish embedded runs, and this maps nicely to RLE/LRE+PDF.

Comment 28 prognathous 2004-08-05 10:00:08 UTC

The HyphenMinus+Number problem is not fixed, although Bug 21019 is marked as
fixed. Guess what? This bug really isn't a dupe of 21019 after all. Please re-open.

Tested with Writer 1.9.m49

Prog.

Comment 29 alan 2005-11-14 09:13:13 UTC

On Issue, I posted changes which would update icu data files to confirm to
Unicode 4.0.1 regarding HYPHEN/MINUS. With these changes, the bug as reported
goes away. This does cause a problem of the number -1 in RTL mode, as insount
pointed out, which for now still requires the use of directional chars.

Comment 30 alan 2005-11-14 09:17:53 UTC

The issue referred to above was Issue 57833.