Issue 15456 - Encoding largely ignored in HTMl export
Summary: Encoding largely ignored in HTMl export
Status: CLOSED FIXED
Alias: None
Product: Writer
Classification: Application
Component: code (show other issues)
Version: OOo 1.1 Beta2
Hardware: All All
: P3 Trivial with 7 votes (vote)
Target Milestone: ---
Assignee: andreas.martens
QA Contact: issues@sw
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2003-06-10 12:41 UTC by kyrimis
Modified: 2013-08-07 14:43 UTC (History)
9 users (show)

See Also:
Issue Type: PATCH
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments
Document that exhibits the bug (5.56 KB, application/octet-stream)
2003-10-01 09:50 UTC, kyrimis
no flags Details
What I expected to get (711 bytes, application/octet-stream)
2003-10-01 09:52 UTC, kyrimis
no flags Details
screenshot of the failed saving to html (22.43 KB, image/gif)
2003-10-03 04:37 UTC, utomo99
no flags Details
Patch for latin2 chars (1.69 KB, patch)
2006-01-22 14:42 UTC, openskm
no flags Details | Diff
revised patch for slight performance enhancement (1.72 KB, patch)
2007-11-23 05:03 UTC, kyoshida
no flags Details | Diff
a typo in line comment.... (Easter -> Eastern) (1.73 KB, patch)
2007-11-23 05:34 UTC, kyoshida
no flags Details | Diff
new version to also ignore those chars when destination encoding is UTF-8. (1.77 KB, patch)
2007-12-02 18:23 UTC, kyoshida
no flags Details | Diff
revised patch to treat UTF-8 differently per mib's suggestion (2.10 KB, text/plain)
2007-12-03 17:00 UTC, kyoshida
no flags Details
revised patch with additional note. (2.37 KB, patch)
2007-12-04 16:08 UTC, kyoshida
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this issue.
Description kyrimis 2003-06-10 12:41:12 UTC
When I save a text document containing Greek characters as an HTML document, the
encoding specified in Options->Load/Save->HTML Compatibility is not used.
Instead, Greek characters appear as α β etc.

If I mix some accented Latin characters along with the Greek, I get the same
result (e.g., é), without getting the warning that the documentation
mentions in "About Import and Export Filters / Importing and Exporting in HTML
Format".

If the text contains accented Greek characters, for which there is no "&"
equivalent, the correct character is output, which seems to suggest that
OpenOffice.org favors the "&" form of characters, using it anywhere it can, even
when it can use the actual characters. This may be useful if one is writing in
English, using only the occasional non-English character, but it is very bad if
one is writing in another language, as it increases the size of the generated
HTML file unnecessarily, and renders the generated text unreadable.
Comment 1 mci 2003-09-11 12:11:15 UTC
reassigned to es
Comment 2 utomo99 2003-10-01 07:31:05 UTC
Please Attach the documents which make this problem, so we can test
it/faster to confirm.  
(Without the documents, we cannot confirm the problem easily/need more
time)
Don't forget to cut other part of the documents, so the file size is
small, but we still able to see the problem. 
Comment 3 kyrimis 2003-10-01 09:50:42 UTC
Created attachment 9861 [details]
Document that exhibits the bug
Comment 4 kyrimis 2003-10-01 09:52:04 UTC
Created attachment 9862 [details]
What I expected to get
Comment 5 kyrimis 2003-10-01 10:03:31 UTC
I have attached the following two files:

badgreek.sxw shows the problem. Load the file into oowriter, make sure
you have selected the Greek (iso-8859-7) character set in
Options->Load/Save->HTML Compatibility, and save as an HTML file.

correctgreek.html contains what I expected to get when saving the
above file as HTML. (It is NOT the actual output of oowriter!) Note
that, in the actual output, only the accented characters are converted
correctly.
Comment 6 utomo99 2003-10-03 04:37:16 UTC
Created attachment 9918 [details]
screenshot of the failed saving to html
Comment 7 utomo99 2003-10-03 04:38:04 UTC
I cannot save as html the sxw file 
screenshot attached. 
Comment 8 utomo99 2003-10-03 04:40:53 UTC
Sorry I forget to include my details: 
OpenOffice 1.1 (default Install, US), Win XP Pro Sp1. 
(And MS Office XP Sp2).
Comment 9 kyrimis 2003-10-03 08:42:15 UTC
I tried to emphasize, both in my original bug report, and when I made
the attachment, that you should select the Greek character set
(iso-8859-7) in Tools->Options->Load/Save->HTML Compatibility before
saving.

Setting the character set to iso-8859-1, I get the same error message
as you got, while changing the computer locale to US, instead of the
Greek locale that I use normally, has no effect. Thus, I am fairly
certain that if you set the character set to iso-8859-7 in
Tools->Options->Load/Save->HTML Compatibility, you should be able to
reproduce the problem.
Comment 10 stefan.baltzer 2003-11-03 09:53:58 UTC
.
Comment 11 thorsten.ziehm 2004-08-19 15:47:45 UTC
Because of limited resource for OOo2.0, it was decided to shift this tasks to
the next milestone. If somebody will be found, who can implement this until
OOo2.0, then this tasks will be re-targeted.
Comment 12 eric.savary 2004-12-02 01:03:31 UTC
ES->AMA:as described (reproduced in src68064). please forward to whom may fix it..
Comment 13 Regina Henschel 2004-12-02 09:11:06 UTC
It seems to be the same problem as I reported in issue 19514.
Comment 14 erpel 2005-05-05 01:15:48 UTC
For greek characters this is fixed in issue 28241 and also works with OOo 1.9.100.

Other international characters (like german Umlauts [äöüÄÖÜ...], french accents
[èéê...]) are still exported as named entities by OOo.
Comment 15 openskm 2006-01-22 14:42:36 UTC
Created attachment 33449 [details]
Patch for latin2 chars
Comment 16 openskm 2006-01-22 14:54:09 UTC
I submitted a patch which have been used for about two years in Hungarian
community builds (and also in OOo 1.1 builds made by Pavel Janík). Basically it
supresses entities for latin2 characters when the target encoding is ISO-8859-2
or Windows-1250. 

IMHO the whole HTML export functionality should be either removed or rewritten.
We have XHTML compliant export filters written in Java on one hand, on the other
hand we have this old code in svtools/svhtml which has support for historic
browsers instead of supporting contemporary standards. I doubt that Netscape
Navigator, HTML 3.2 and (old) Internet Explorer compatibility mode is needed by
anyone in 2006. :)
Comment 17 kyoshida 2007-11-23 05:03:22 UTC
Created attachment 49838 [details]
revised patch for slight performance enhancement
Comment 18 kyoshida 2007-11-23 05:08:41 UTC
This patch carries over the intent of openskm's patch, but is cleaner and more
efficient.  Instead of getting the character converted then do a massive set of
expensive string comparisons, we catch & process it before the first conversion
takes place.

@ama: please review.
Comment 19 kyoshida 2007-11-23 05:09:11 UTC
changing the issue type to PATCH.
Comment 20 kyoshida 2007-11-23 05:34:15 UTC
Created attachment 49839 [details]
a typo in line comment.... (Easter -> Eastern)
Comment 21 andreas.martens 2007-11-23 13:10:26 UTC
ama->mib:
Please have a look at the patch. If it's ok from your side, I'm volunteering to
integrate it in one of our CWSs for OOo2.4.
Comment 22 kyoshida 2007-12-02 18:23:21 UTC
Created attachment 50052 [details]
new version to also ignore those chars when destination encoding is UTF-8.
Comment 23 michael.brauer 2007-12-03 13:59:23 UTC
As for UTF-8: I think it is not required to export any characters ad entities
except those below 128 . I therefore suggest to change line 405 from

if( c < 256 || RTL_TEXTENCODING_UTF8 != rContext.m_eDestEnc )

to
if( c < 128 || RTL_TEXTENCODING_UTF8 != rContext.m_eDestEnc )

As for ISO_8859_2/Windows 1250: While I do understand that it may be easier to
read a document in a text editor if there are no entities, we have to understand
that after applying the patch, documents will only loaded correctly into text
editors whose encoding is set to ISO 8859-2/Windows 1250. If a document is
loaded into an editor that has a different encoding set, characters above 128
may be displayed incorrectly.

Using the entities here as it is the case today ensures that documents are
editable regardless of of the encoding that is set in the editor. For this
reason, I have a slight preference for removing the entities for UTF-8 only. If
we anyway want to get rid of the entities for other encodings, too, we may
consider to call lcl_svhtml_GetEntityForChar for characters above 128 if any
only if the conversion into the destination format failed, that is, in line 458.
Comment 24 kyoshida 2007-12-03 15:43:27 UTC
@mib:

I agree with your suggestion on UTF-8.  Thanks for your response.  I will revise
the patch shortly.

I have a few comments regarding the handling of iso-8859-2.

>we have to understand
that after applying the patch, documents will only loaded correctly into text
editors whose encoding is set to ISO 8859-2/Windows 1250. 

But that's the (unfortunate) design of code pages; there is no way to ensure
that a text document encoded in one encoding is displayed correctly in another
encoding, which is also why we prefer unicode today.  But that doesn't mean that
we should refuse to export such document with that encoding if the user
explicitly chooses to do so by setting the encoding in the Option dialog.

>If a document is
loaded into an editor that has a different encoding set, characters above 128
may be displayed incorrectly.

This is expected, and that's why the document includes the encoding information
so that the editor knows which codepage to use when loading.

>Using the entities here as it is the case today ensures that documents are
editable regardless of of the encoding that is set in the editor.

Don't almost all editors these days offer encoding options when opening a text
document?  I'm a bit puzzled by this response; it appears that the editor you
use doesn't seem to support anything other than ASCII range...

Or, is there any valid use case where this behavior (of encoding everything
above 128 regardless of user-selected encoding) is needed?

Thank you for your response.

Kohei
Comment 25 kyoshida 2007-12-03 15:49:27 UTC
>(of encoding everything above 128 regardless of user-selected encoding)

This should have been

>(of encoding everything above 128 _as entities_ regardless of user-selected
encoding)

Sorry about that. :P

Kohei
Comment 26 kyoshida 2007-12-03 17:00:04 UTC
Created attachment 50072 [details]
revised patch to treat UTF-8 differently per mib's suggestion
Comment 27 michael.brauer 2007-12-04 09:38:28 UTC
Kohei: I should have been clearer. You are right that most text editors do allow
you to set the encoding. What I meant was the following: If I have a file that
is saved in iso-8859-2 and load that into an editor whose encoding is set to
iso-8859-1, then I will see some strange characters if we do not use character
entities, and I will see the character entities in the other case. In the first
case, I need to switch the encoding. In the 2nd case, I can just go ahead. I'm
not using text editors for editing HTML documents too often, and if so, I'm
working mostly with UTF-8. So I don't really know whether todays text editors
auto detect other encodings of HTML documents than UTF-8, but given that the
encoding information is is either in a meta element, or maybe supplied by the
HTTP server only, I don't expect them to do so. In any case, using character
entities instead of the characters itself makes document more independent of the
encoding that is used.

However, I have checked the HTML specification, and did not find any advice to
use character encoding in favor of just using the character codes. All it says
is that character entities can be used for characters that are not existing in a
certain encoding. So, changing OOo seems not to violate any recommendation, and
I'm fine with changing that if you and ama believe that this is reasonable.

However, I'm wondering whether we should change that behavior in general, as I
have outlines above. Anyway, I know that this is beyond the scope of the patch,
so I'm fine with changing the behavior for ISO-8859-2 only, too. If we do so, we
may want to add a note to the source code that says that the new behavior
applies to ISO-8859-2 only because we received only a patch for that encoding,
and not because the new behavior is for whatever reason only reasonable for
ISO-8859-2. This may save people that look at the code later from wondering why
there is this behavior for ISO-8859-2 only.

Comment 28 kyoshida 2007-12-04 16:05:56 UTC
@mib: thank you.  You put it so beautifully, and I agree with your analysis.

I will attach a new patch with a note outlining the need to extend support for
other code pages.

Kohei
Comment 29 kyoshida 2007-12-04 16:08:11 UTC
Created attachment 50090 [details]
revised patch with additional note.
Comment 30 kyoshida 2008-01-15 14:35:28 UTC
@ama: we have a "go ahead" from mib, so the ball is now in your court.
Comment 31 andreas.martens 2008-01-15 15:24:44 UTC
Ok, I will integrate this patch into CWS sw8u10bf05.
Comment 32 Mathias_Bauer 2008-01-17 16:31:46 UTC
target 3.0
Comment 33 andreas.martens 2008-01-23 15:30:28 UTC
Patch integrated into CWS sw8u10bf05.
Comment 34 andreas.martens 2008-01-23 15:31:19 UTC
Done.
Comment 35 andreas.martens 2008-02-26 09:47:16 UTC
Checked in CWS sw8u10bf05.
Comment 36 thorsten.ziehm 2009-07-20 14:40:40 UTC
This issue is closed automatically and wasn't rechecked in a current version of
OOo. This fixed issue should be integrated in OOo since more than half a year.
If you think this issue isn't fixed in a current version (OOo 3.1), please
reopen it and change the field 'Target Milestone' accordingly.

If you want to download a current version of OOo =>
http://download.openoffice.org/index.html
If you want to know more about the handling of fixed/verified issues =>
http://wiki.services.openoffice.org/wiki/Handle_fixed_verified_issues