Apache OpenOffice (AOO) Bugzilla – Issue 15456
Encoding largely ignored in HTMl export
Last modified: 2013-08-07 14:43:03 UTC
When I save a text document containing Greek characters as an HTML document, the encoding specified in Options->Load/Save->HTML Compatibility is not used. Instead, Greek characters appear as α β etc. If I mix some accented Latin characters along with the Greek, I get the same result (e.g., é), without getting the warning that the documentation mentions in "About Import and Export Filters / Importing and Exporting in HTML Format". If the text contains accented Greek characters, for which there is no "&" equivalent, the correct character is output, which seems to suggest that OpenOffice.org favors the "&" form of characters, using it anywhere it can, even when it can use the actual characters. This may be useful if one is writing in English, using only the occasional non-English character, but it is very bad if one is writing in another language, as it increases the size of the generated HTML file unnecessarily, and renders the generated text unreadable.
reassigned to es
Please Attach the documents which make this problem, so we can test it/faster to confirm. (Without the documents, we cannot confirm the problem easily/need more time) Don't forget to cut other part of the documents, so the file size is small, but we still able to see the problem.
Created attachment 9861 [details] Document that exhibits the bug
Created attachment 9862 [details] What I expected to get
I have attached the following two files: badgreek.sxw shows the problem. Load the file into oowriter, make sure you have selected the Greek (iso-8859-7) character set in Options->Load/Save->HTML Compatibility, and save as an HTML file. correctgreek.html contains what I expected to get when saving the above file as HTML. (It is NOT the actual output of oowriter!) Note that, in the actual output, only the accented characters are converted correctly.
Created attachment 9918 [details] screenshot of the failed saving to html
I cannot save as html the sxw file screenshot attached.
Sorry I forget to include my details: OpenOffice 1.1 (default Install, US), Win XP Pro Sp1. (And MS Office XP Sp2).
I tried to emphasize, both in my original bug report, and when I made the attachment, that you should select the Greek character set (iso-8859-7) in Tools->Options->Load/Save->HTML Compatibility before saving. Setting the character set to iso-8859-1, I get the same error message as you got, while changing the computer locale to US, instead of the Greek locale that I use normally, has no effect. Thus, I am fairly certain that if you set the character set to iso-8859-7 in Tools->Options->Load/Save->HTML Compatibility, you should be able to reproduce the problem.
.
Because of limited resource for OOo2.0, it was decided to shift this tasks to the next milestone. If somebody will be found, who can implement this until OOo2.0, then this tasks will be re-targeted.
ES->AMA:as described (reproduced in src68064). please forward to whom may fix it..
It seems to be the same problem as I reported in issue 19514.
For greek characters this is fixed in issue 28241 and also works with OOo 1.9.100. Other international characters (like german Umlauts [äöüÄÖÜ...], french accents [èéê...]) are still exported as named entities by OOo.
Created attachment 33449 [details] Patch for latin2 chars
I submitted a patch which have been used for about two years in Hungarian community builds (and also in OOo 1.1 builds made by Pavel JanÃk). Basically it supresses entities for latin2 characters when the target encoding is ISO-8859-2 or Windows-1250. IMHO the whole HTML export functionality should be either removed or rewritten. We have XHTML compliant export filters written in Java on one hand, on the other hand we have this old code in svtools/svhtml which has support for historic browsers instead of supporting contemporary standards. I doubt that Netscape Navigator, HTML 3.2 and (old) Internet Explorer compatibility mode is needed by anyone in 2006. :)
Created attachment 49838 [details] revised patch for slight performance enhancement
This patch carries over the intent of openskm's patch, but is cleaner and more efficient. Instead of getting the character converted then do a massive set of expensive string comparisons, we catch & process it before the first conversion takes place. @ama: please review.
changing the issue type to PATCH.
Created attachment 49839 [details] a typo in line comment.... (Easter -> Eastern)
ama->mib: Please have a look at the patch. If it's ok from your side, I'm volunteering to integrate it in one of our CWSs for OOo2.4.
Created attachment 50052 [details] new version to also ignore those chars when destination encoding is UTF-8.
As for UTF-8: I think it is not required to export any characters ad entities except those below 128 . I therefore suggest to change line 405 from if( c < 256 || RTL_TEXTENCODING_UTF8 != rContext.m_eDestEnc ) to if( c < 128 || RTL_TEXTENCODING_UTF8 != rContext.m_eDestEnc ) As for ISO_8859_2/Windows 1250: While I do understand that it may be easier to read a document in a text editor if there are no entities, we have to understand that after applying the patch, documents will only loaded correctly into text editors whose encoding is set to ISO 8859-2/Windows 1250. If a document is loaded into an editor that has a different encoding set, characters above 128 may be displayed incorrectly. Using the entities here as it is the case today ensures that documents are editable regardless of of the encoding that is set in the editor. For this reason, I have a slight preference for removing the entities for UTF-8 only. If we anyway want to get rid of the entities for other encodings, too, we may consider to call lcl_svhtml_GetEntityForChar for characters above 128 if any only if the conversion into the destination format failed, that is, in line 458.
@mib: I agree with your suggestion on UTF-8. Thanks for your response. I will revise the patch shortly. I have a few comments regarding the handling of iso-8859-2. >we have to understand that after applying the patch, documents will only loaded correctly into text editors whose encoding is set to ISO 8859-2/Windows 1250. But that's the (unfortunate) design of code pages; there is no way to ensure that a text document encoded in one encoding is displayed correctly in another encoding, which is also why we prefer unicode today. But that doesn't mean that we should refuse to export such document with that encoding if the user explicitly chooses to do so by setting the encoding in the Option dialog. >If a document is loaded into an editor that has a different encoding set, characters above 128 may be displayed incorrectly. This is expected, and that's why the document includes the encoding information so that the editor knows which codepage to use when loading. >Using the entities here as it is the case today ensures that documents are editable regardless of of the encoding that is set in the editor. Don't almost all editors these days offer encoding options when opening a text document? I'm a bit puzzled by this response; it appears that the editor you use doesn't seem to support anything other than ASCII range... Or, is there any valid use case where this behavior (of encoding everything above 128 regardless of user-selected encoding) is needed? Thank you for your response. Kohei
>(of encoding everything above 128 regardless of user-selected encoding) This should have been >(of encoding everything above 128 _as entities_ regardless of user-selected encoding) Sorry about that. :P Kohei
Created attachment 50072 [details] revised patch to treat UTF-8 differently per mib's suggestion
Kohei: I should have been clearer. You are right that most text editors do allow you to set the encoding. What I meant was the following: If I have a file that is saved in iso-8859-2 and load that into an editor whose encoding is set to iso-8859-1, then I will see some strange characters if we do not use character entities, and I will see the character entities in the other case. In the first case, I need to switch the encoding. In the 2nd case, I can just go ahead. I'm not using text editors for editing HTML documents too often, and if so, I'm working mostly with UTF-8. So I don't really know whether todays text editors auto detect other encodings of HTML documents than UTF-8, but given that the encoding information is is either in a meta element, or maybe supplied by the HTTP server only, I don't expect them to do so. In any case, using character entities instead of the characters itself makes document more independent of the encoding that is used. However, I have checked the HTML specification, and did not find any advice to use character encoding in favor of just using the character codes. All it says is that character entities can be used for characters that are not existing in a certain encoding. So, changing OOo seems not to violate any recommendation, and I'm fine with changing that if you and ama believe that this is reasonable. However, I'm wondering whether we should change that behavior in general, as I have outlines above. Anyway, I know that this is beyond the scope of the patch, so I'm fine with changing the behavior for ISO-8859-2 only, too. If we do so, we may want to add a note to the source code that says that the new behavior applies to ISO-8859-2 only because we received only a patch for that encoding, and not because the new behavior is for whatever reason only reasonable for ISO-8859-2. This may save people that look at the code later from wondering why there is this behavior for ISO-8859-2 only.
@mib: thank you. You put it so beautifully, and I agree with your analysis. I will attach a new patch with a note outlining the need to extend support for other code pages. Kohei
Created attachment 50090 [details] revised patch with additional note.
@ama: we have a "go ahead" from mib, so the ball is now in your court.
Ok, I will integrate this patch into CWS sw8u10bf05.
target 3.0
Patch integrated into CWS sw8u10bf05.
Done.
Checked in CWS sw8u10bf05.
This issue is closed automatically and wasn't rechecked in a current version of OOo. This fixed issue should be integrated in OOo since more than half a year. If you think this issue isn't fixed in a current version (OOo 3.1), please reopen it and change the field 'Target Milestone' accordingly. If you want to download a current version of OOo => http://download.openoffice.org/index.html If you want to know more about the handling of fixed/verified issues => http://wiki.services.openoffice.org/wiki/Handle_fixed_verified_issues