Apache OpenOffice (AOO) Bugzilla – Issue 12445
Word 6 and word 95 file formats and encoding of characters
Last modified: 2013-08-07 14:41:36 UTC
Hi, when I save czech document in 644m1 (binary snapshot without any changes from OOo.org) in DOC format in Word95 or Word6, some characters with diacritical marks are corrupted. I will attach those files: openoffice.sxw - original file that I saved in different formats. It only contains 9 characters with diacritical marks (e, s, c, r, z with caron and y, a, i, e with comma) word*.doc - the above document saved in various Word formats To confirm this, just open all document and see if all word* have the same contents as openoffice.sxw. Only wordxp.doc should have it... This is a real bloker for Czech users working with old versions of Microsoft Word :-(
Created attachment 5134 [details] original source file
Created attachment 5135 [details] Saved in Microsoft Word 6.0
Created attachment 5136 [details] Saved in Microsoft Word XP
Created attachment 5137 [details] Saved in Microsoft Word 95 format
The same happens when I use original 1.0.2 Linux binary installation set.
MRUs Area.
Looks like trying to save Unicode characters in format that does not really support Unicode (like Winword 6/95). Have to further investigate this.
I'm ready to investigate that - just give me more information and I can provide details for you. I regularly build OOo and can add debugging stuff. Why do you think it is Unicode problem? Characters are saved unibyte there... And they are stored the same as from Word95 ie. the same values from Word as from OOo. Problem is somewhere else I think.
I have accepted this issue and will work on it myself.
If you read in the 95 ver saved by word and resave as word95 in writer I see that one problem in that the ctor of WW8_SwAttrIter in wrtw8nds.cxx attempts to split the paragraph into ranges of characters that have the same encoding as eachother. It does this by assuming that the nodes charset is the first charset in use in the para and calls SearchNext with value 1. If the first character is not in the node charset then this goes horribly wrong. We really should split the paragraph into charset ranges correctly here. And don't get confused by the languge set on a range of text, its a related issue but not the deciding factor.
MRU->CMC: Did I get you right, that you maybe find a solution on this?
Well a partial solution :-(, not a total one. Need more thinking.
Hi Caolan, this fixes writing of Czech documents to Word 6.0 and 95 formats. Without this patch, some czech character got written as ? to the .doc file. Maybe Czech Word is different and uses 1250 instead of 1252? What do you think? See attached fix_write.diff
Created attachment 5593 [details] This fixes writing to DOC as Word 6.0 and 95 (at least for Czech Word)
Needless to say that breaks everything that is not czech :-). But the place in the code is thr troublespot alright. What we need is a better way to get the eChrSet that is passed to that method. The current way of asking the font what its charset isn't great to my mind. A better method to split a line into the closest equivalent 8bit charsets is needed. I'll check with our core team for some ideas.
Yes, it is only a prove of concept - we can easily wrap it around with if Language == CZECH or similar.
*** Issue 1660 has been marked as a duplicate of this issue. ***
*** Issue 12369 has been marked as a duplicate of this issue. ***
*** Issue 12617 has been marked as a duplicate of this issue. ***
With some luck will be fixed in 2.0. Fix made in workspace/cvs tag limerickfilterteam08
Created attachment 7481 [details] patch showing concept
Reopen to reassign
cmc->mru: Working in limerickfilterteam08
Checked fix with internal CWS filterteam08.
Verified. Fix will be included in OO 2.0.
Fix good in OO 2.0 snapshot src680m13.
*** Issue 37889 has been marked as a duplicate of this issue. ***