12445 – Word 6 and word 95 file formats and encoding of characters

Issue 12445 - Word 6 and word 95 file formats and encoding of characters

Summary: Word 6 and word 95 file formats and encoding of characters

Status:	CLOSED FIXED

Alias:	None

Product:	Writer
Classification:	Application
Component:	code (show other issues)
Version:	644
Hardware:	PC Linux, all

Importance:	P3 Trivial (vote)
Target Milestone:	---
Assignee:	michael.ruess
QA Contact:	issues@sw

URL:
Keywords:

Duplicates (2):	1660 37889 (view as issue list)
Depends on:
Blocks:	12617
	Show dependency tree

Reported:	2003-03-18 17:41 UTC by pavel
Modified:	2013-08-07 14:41 UTC (History)
CC List:	3 users (show)

See Also:
Issue Type:	DEFECT
Latest Confirmation in:	---
Developer Difficulty:	---

Attachments
original source file (5.21 KB, application/octet-stream) 2003-03-18 17:44 UTC, pavel	no flags	Details
Saved in Microsoft Word 6.0 (6.00 KB, application/octet-stream) 2003-03-18 17:46 UTC, pavel	no flags	Details
Saved in Microsoft Word XP (7.50 KB, application/octet-stream) 2003-03-18 17:48 UTC, pavel	no flags	Details
Saved in Microsoft Word 95 format (6.00 KB, application/octet-stream) 2003-03-18 17:48 UTC, pavel	no flags	Details
This fixes writing to DOC as Word 6.0 and 95 (at least for Czech Word) (778 bytes, patch) 2003-04-11 18:32 UTC, pavel	no flags	Details \| Diff
patch showing concept (17.83 KB, patch) 2003-07-08 17:09 UTC, caolanm	no flags	Details \| Diff
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this issue.

Description pavel 2003-03-18 17:41:49 UTC

Hi,

when I save czech document in 644m1 (binary snapshot without any changes
from OOo.org) in DOC format in Word95 or Word6, some characters with
diacritical marks are corrupted.

I will attach those files:

openoffice.sxw - original file that I saved in different formats. It only
contains 9 characters with diacritical marks (e, s, c, r, z with caron and
y, a, i, e with comma)
word*.doc - the above document saved in various Word formats

To confirm this, just open all document and see if all word* have the same
contents as openoffice.sxw. Only wordxp.doc should have it...

This is a real bloker for Czech users working with old versions of
Microsoft Word :-(

Comment 1 pavel 2003-03-18 17:44:23 UTC

Created attachment 5134 [details]
original source file

Comment 2 pavel 2003-03-18 17:46:23 UTC

Created attachment 5135 [details]
Saved in Microsoft Word 6.0

Comment 3 pavel 2003-03-18 17:48:13 UTC

Created attachment 5136 [details]
Saved in Microsoft Word XP

Comment 4 pavel 2003-03-18 17:48:48 UTC

Created attachment 5137 [details]
Saved in Microsoft Word 95 format

Comment 5 pavel 2003-03-18 17:52:06 UTC

The same happens when I use original 1.0.2 Linux binary installation set.

Comment 6 h.ilter 2003-03-20 10:24:41 UTC

MRUs Area.

Comment 7 michael.ruess 2003-03-25 11:27:03 UTC

Looks like trying to save Unicode characters in format that does not
really support Unicode (like Winword 6/95).
Have to further investigate this.

Comment 8 pavel 2003-03-25 20:23:07 UTC

I'm ready to investigate that - just give me more information and I can
provide details for you. I regularly build OOo and can add debugging
stuff.

Why do you think it is Unicode problem? Characters are saved unibyte
there... And they are stored the same as from Word95 ie. the same
values from Word as from OOo. Problem is somewhere else I think.

Comment 9 pavel 2003-03-30 10:52:28 UTC

I have accepted this issue and will work on it myself.

Comment 10 caolanm 2003-03-31 08:19:52 UTC

If you read in the 95 ver saved by word and resave as word95 in writer
I see that one problem in that the ctor of WW8_SwAttrIter in
wrtw8nds.cxx attempts to split the paragraph into ranges of characters
that have the same encoding as eachother. It does this by assuming
that the nodes charset is the first charset in use in the para and
calls SearchNext with value 1. If the first character is not in the
node charset then this goes horribly wrong.

We really should split the paragraph into charset ranges correctly
here. And don't get confused by the languge set on a range of text,
its a related issue but not the deciding factor.

Comment 11 michael.ruess 2003-04-01 15:13:30 UTC

MRU->CMC: Did I get you right, that you maybe find a solution on this?

Comment 12 caolanm 2003-04-01 17:03:45 UTC

Well a partial solution :-(, not a total one. Need more thinking.

Comment 13 pavel 2003-04-11 18:30:16 UTC

Hi Caolan,

this fixes writing of Czech documents to Word 6.0 and 95 formats.
Without this patch, some czech character got written as ? to the .doc
file. Maybe Czech Word is different and uses 1250 instead of 1252?
What do you think?

See attached fix_write.diff

Comment 14 pavel 2003-04-11 18:32:26 UTC

Created attachment 5593 [details]
This fixes writing to DOC as Word 6.0 and 95 (at least for Czech Word)

Comment 15 caolanm 2003-04-14 10:28:25 UTC

Needless to say that breaks everything that is not czech :-). But the
place in the code is thr troublespot alright. What we need is a better
way to get the eChrSet that is passed to that method. The current way
of asking the font what its charset isn't great to my mind. A better
method to split a line into the closest equivalent 8bit charsets is
needed. I'll check with our core team for some ideas.

Comment 16 pavel 2003-04-14 22:22:50 UTC

Yes, it is only a prove of concept - we can easily wrap it around with
if Language == CZECH or similar.

Comment 17 caolanm 2003-05-07 09:04:32 UTC

*** Issue 1660 has been marked as a duplicate of this issue. ***

Comment 18 caolanm 2003-06-30 09:37:26 UTC

*** Issue 12369 has been marked as a duplicate of this issue. ***

Comment 19 michael.ruess 2003-07-03 07:27:06 UTC

*** Issue 12617 has been marked as a duplicate of this issue. ***

Comment 20 caolanm 2003-07-08 17:08:25 UTC

With some luck will be fixed in 2.0. Fix made in workspace/cvs tag
limerickfilterteam08

Comment 21 caolanm 2003-07-08 17:09:19 UTC

Created attachment 7481 [details]
patch showing concept

Comment 22 caolanm 2003-08-15 17:21:22 UTC

Reopen to reassign

Comment 23 caolanm 2003-08-15 17:21:52 UTC

cmc->mru: Working in limerickfilterteam08

Comment 24 michael.ruess 2003-08-27 16:21:43 UTC

Checked fix with internal CWS filterteam08.

Comment 25 michael.ruess 2003-08-27 16:22:18 UTC

Verified. Fix will be included in OO 2.0.

Comment 26 michael.ruess 2003-11-27 10:38:19 UTC

Fix good in OO 2.0 snapshot src680m13.

Comment 27 michael.ruess 2004-11-25 15:25:11 UTC

*** Issue 37889 has been marked as a duplicate of this issue. ***