Issue 13937 - strings on 97/2000/XP format doesn't show entire document body
Summary: strings on 97/2000/XP format doesn't show entire document body
Status: CLOSED NOT_AN_OOO_ISSUE
Alias: None
Product: Writer
Classification: Application
Component: ui (show other issues)
Version: OOo 1.1 Beta
Hardware: PC Windows XP
: P5 (lowest) Trivial (vote)
Target Milestone: ---
Assignee: h.ilter
QA Contact: issues@sw
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2003-04-29 15:38 UTC by Unknown
Modified: 2003-09-08 16:56 UTC (History)
1 user (show)

See Also:
Issue Type: DEFECT
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments

Note You need to log in before you can comment on or make changes to this issue.
Description Unknown 2003-04-29 15:38:02 UTC
I'm working on a Windows XP platform, but store copies of my documents on a 
Linux network drive.  We frequently scan the documents on our Linux platform 
using a script that runs the 'strings' command to parse out the text of the 
document.   This works fine with documents saved with Microsoft Word 97 (97 
format).  However, when the document is resaved under Open Office in 
97/2000/XP format, strings command will no long extract the document body.

Notes:
- If the document is saved in either 6.0 or 95 format, then the strings 
command works fine.
- The strings command does dump the document description information (the type 
of info you see in the pop-up when holding your mouse over the file in Windows 
XP).
Comment 1 caolanm 2003-04-29 16:24:41 UTC
Strings isn't really a great strategy to use to get the content
because when word has fastsaved enabled (tools->options->save) then
files that have text deleted may still contain that text in the
document which is no longer part of the actual displayed content. And
of course you get all the rest of the noncontent textural junk in your
file displayed as well, but you certainly will have seen that.

The additional problem, and this is the one that has hit you, is that
word can save in 16bitunicode or in 8 bit mode. Word currently
defaults to saving in 8bit mode for western text for fullsave files,
the fastsaved parts are always in unicode. And fast/and fullsave
nonwestern characters will end up in 16bit mode. OOo always exports
97+ in 16bit unicode, and that's what's scuppered your strings :-(

Word6/75 couldn't do unicode, so that's why strings works on them.

Nevertheless if you have gnu strings (part of binutils), you should be
able to use
string --encoding=l file.doc
which according to the strings manual is

--encoding=encoding
           Select  the character encoding of the strings that are
           to be found.  Possible values for encoding  are:  s =
           single-byte   characters   (ASCII,   ISO  8859,  etc.,
           default),  b  =   16-bit   Bigendian,   l   =   16-bit
           Littleendian,  B  =  32-bit Bigendian, L = 32-bit Lit­
           tleendian. Useful for finding wide character  strings

An older simpler patch I wrote myself for strings is at
http://www.csn.ul.ie/~caolan/Patches/binutils.UnicodeStringsAgainst_binutils-2.10.html
but this is reduntant with the new strings.

Also if you are just interested in getting the text content of a
document you could use wvText from www.wvWare.com or the OpenOffice
SDK offers an example program to load and save a document, you could
load the .docs and save them to .txt. Either of these approaches has
the advantage of just getting the content, no fastsaved deleted
portions and no fontnames and copyright strings etc.
Comment 2 caolanm 2003-04-29 16:26:17 UTC
closed. Not an OOo bug.