Issue 12445 - Word 6 and word 95 file formats and encoding of characters
Summary: Word 6 and word 95 file formats and encoding of characters
Status: CLOSED FIXED
Alias: None
Product: Writer
Classification: Application
Component: code (show other issues)
Version: 644
Hardware: PC Linux, all
: P3 Trivial (vote)
Target Milestone: ---
Assignee: michael.ruess
QA Contact: issues@sw
URL:
Keywords:
: 1660 37889 (view as issue list)
Depends on:
Blocks: 12617
  Show dependency tree
 
Reported: 2003-03-18 17:41 UTC by pavel
Modified: 2013-08-07 14:41 UTC (History)
3 users (show)

See Also:
Issue Type: DEFECT
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments
original source file (5.21 KB, application/octet-stream)
2003-03-18 17:44 UTC, pavel
no flags Details
Saved in Microsoft Word 6.0 (6.00 KB, application/octet-stream)
2003-03-18 17:46 UTC, pavel
no flags Details
Saved in Microsoft Word XP (7.50 KB, application/octet-stream)
2003-03-18 17:48 UTC, pavel
no flags Details
Saved in Microsoft Word 95 format (6.00 KB, application/octet-stream)
2003-03-18 17:48 UTC, pavel
no flags Details
This fixes writing to DOC as Word 6.0 and 95 (at least for Czech Word) (778 bytes, patch)
2003-04-11 18:32 UTC, pavel
no flags Details | Diff
patch showing concept (17.83 KB, patch)
2003-07-08 17:09 UTC, caolanm
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this issue.
Description pavel 2003-03-18 17:41:49 UTC
Hi,

when I save czech document in 644m1 (binary snapshot without any changes
from OOo.org) in DOC format in Word95 or Word6, some characters with
diacritical marks are corrupted.

I will attach those files:

openoffice.sxw - original file that I saved in different formats. It only
contains 9 characters with diacritical marks (e, s, c, r, z with caron and
y, a, i, e with comma)
word*.doc - the above document saved in various Word formats

To confirm this, just open all document and see if all word* have the same
contents as openoffice.sxw. Only wordxp.doc should have it...

This is a real bloker for Czech users working with old versions of
Microsoft Word :-(
Comment 1 pavel 2003-03-18 17:44:23 UTC
Created attachment 5134 [details]
original source file
Comment 2 pavel 2003-03-18 17:46:23 UTC
Created attachment 5135 [details]
Saved in Microsoft Word 6.0
Comment 3 pavel 2003-03-18 17:48:13 UTC
Created attachment 5136 [details]
Saved in Microsoft Word XP
Comment 4 pavel 2003-03-18 17:48:48 UTC
Created attachment 5137 [details]
Saved in Microsoft Word 95 format
Comment 5 pavel 2003-03-18 17:52:06 UTC
The same happens when I use original 1.0.2 Linux binary installation set.
Comment 6 h.ilter 2003-03-20 10:24:41 UTC
MRUs Area.
Comment 7 michael.ruess 2003-03-25 11:27:03 UTC
Looks like trying to save Unicode characters in format that does not
really support Unicode (like Winword 6/95).
Have to further investigate this.
Comment 8 pavel 2003-03-25 20:23:07 UTC
I'm ready to investigate that - just give me more information and I can
provide details for you. I regularly build OOo and can add debugging
stuff.

Why do you think it is Unicode problem? Characters are saved unibyte
there... And they are stored the same as from Word95 ie. the same
values from Word as from OOo. Problem is somewhere else I think.
Comment 9 pavel 2003-03-30 10:52:28 UTC
I have accepted this issue and will work on it myself.
Comment 10 caolanm 2003-03-31 08:19:52 UTC
If you read in the 95 ver saved by word and resave as word95 in writer
I see that one problem in that the ctor of WW8_SwAttrIter in
wrtw8nds.cxx attempts to split the paragraph into ranges of characters
that have the same encoding as eachother. It does this by assuming
that the nodes charset is the first charset in use in the para and
calls SearchNext with value 1. If the first character is not in the
node charset then this goes horribly wrong.

We really should split the paragraph into charset ranges correctly
here. And don't get confused by the languge set on a range of text,
its a related issue but not the deciding factor.
Comment 11 michael.ruess 2003-04-01 15:13:30 UTC
MRU->CMC: Did I get you right, that you maybe find a solution on this?
Comment 12 caolanm 2003-04-01 17:03:45 UTC
Well a partial solution :-(, not a total one. Need more thinking.
Comment 13 pavel 2003-04-11 18:30:16 UTC
Hi Caolan,

this fixes writing of Czech documents to Word 6.0 and 95 formats.
Without this patch, some czech character got written as ? to the .doc
file. Maybe Czech Word is different and uses 1250 instead of 1252?
What do you think?

See attached fix_write.diff
Comment 14 pavel 2003-04-11 18:32:26 UTC
Created attachment 5593 [details]
This fixes writing to DOC as Word 6.0 and 95 (at least for Czech Word)
Comment 15 caolanm 2003-04-14 10:28:25 UTC
Needless to say that breaks everything that is not czech :-). But the
place in the code is thr troublespot alright. What we need is a better
way to get the eChrSet that is passed to that method. The current way
of asking the font what its charset isn't great to my mind. A better
method to split a line into the closest equivalent 8bit charsets is
needed. I'll check with our core team for some ideas.
Comment 16 pavel 2003-04-14 22:22:50 UTC
Yes, it is only a prove of concept - we can easily wrap it around with
if Language == CZECH or similar.
Comment 17 caolanm 2003-05-07 09:04:32 UTC
*** Issue 1660 has been marked as a duplicate of this issue. ***
Comment 18 caolanm 2003-06-30 09:37:26 UTC
*** Issue 12369 has been marked as a duplicate of this issue. ***
Comment 19 michael.ruess 2003-07-03 07:27:06 UTC
*** Issue 12617 has been marked as a duplicate of this issue. ***
Comment 20 caolanm 2003-07-08 17:08:25 UTC
With some luck will be fixed in 2.0. Fix made in workspace/cvs tag
limerickfilterteam08
Comment 21 caolanm 2003-07-08 17:09:19 UTC
Created attachment 7481 [details]
patch showing concept
Comment 22 caolanm 2003-08-15 17:21:22 UTC
Reopen to reassign
Comment 23 caolanm 2003-08-15 17:21:52 UTC
cmc->mru: Working in limerickfilterteam08
Comment 24 michael.ruess 2003-08-27 16:21:43 UTC
Checked fix with internal CWS filterteam08.
Comment 25 michael.ruess 2003-08-27 16:22:18 UTC
Verified. Fix will be included in OO 2.0.
Comment 26 michael.ruess 2003-11-27 10:38:19 UTC
Fix good in OO 2.0 snapshot src680m13.
Comment 27 michael.ruess 2004-11-25 15:25:11 UTC
*** Issue 37889 has been marked as a duplicate of this issue. ***