Apache OpenOffice (AOO) Bugzilla – Issue 13937
strings on 97/2000/XP format doesn't show entire document body
Last modified: 2003-09-08 16:56:16 UTC
I'm working on a Windows XP platform, but store copies of my documents on a Linux network drive. We frequently scan the documents on our Linux platform using a script that runs the 'strings' command to parse out the text of the document. This works fine with documents saved with Microsoft Word 97 (97 format). However, when the document is resaved under Open Office in 97/2000/XP format, strings command will no long extract the document body. Notes: - If the document is saved in either 6.0 or 95 format, then the strings command works fine. - The strings command does dump the document description information (the type of info you see in the pop-up when holding your mouse over the file in Windows XP).
Strings isn't really a great strategy to use to get the content because when word has fastsaved enabled (tools->options->save) then files that have text deleted may still contain that text in the document which is no longer part of the actual displayed content. And of course you get all the rest of the noncontent textural junk in your file displayed as well, but you certainly will have seen that. The additional problem, and this is the one that has hit you, is that word can save in 16bitunicode or in 8 bit mode. Word currently defaults to saving in 8bit mode for western text for fullsave files, the fastsaved parts are always in unicode. And fast/and fullsave nonwestern characters will end up in 16bit mode. OOo always exports 97+ in 16bit unicode, and that's what's scuppered your strings :-( Word6/75 couldn't do unicode, so that's why strings works on them. Nevertheless if you have gnu strings (part of binutils), you should be able to use string --encoding=l file.doc which according to the strings manual is --encoding=encoding Select the character encoding of the strings that are to be found. Possible values for encoding are: s = single-byte characters (ASCII, ISO 8859, etc., default), b = 16-bit Bigendian, l = 16-bit Littleendian, B = 32-bit Bigendian, L = 32-bit Lit tleendian. Useful for finding wide character strings An older simpler patch I wrote myself for strings is at http://www.csn.ul.ie/~caolan/Patches/binutils.UnicodeStringsAgainst_binutils-2.10.html but this is reduntant with the new strings. Also if you are just interested in getting the text content of a document you could use wvText from www.wvWare.com or the OpenOffice SDK offers an example program to load and save a document, you could load the .docs and save them to .txt. Either of these approaches has the advantage of just getting the content, no fastsaved deleted portions and no fontnames and copyright strings etc.
closed. Not an OOo bug.