Apache OpenOffice (AOO) Bugzilla – Issue 63199
Optimize PDF output concerning whitespace
Last modified: 2006-05-17 13:18:50 UTC
To conserve space we could strip out a lot of spaces and turn lineends to a single newline character.
If you want to we can handle this in the same CWS (pdfexportimprove) as issue 61139.
Ok, shouldn't take long. ->pl. I'll start on this after the CWS used for 61139 is synch with latest patch (and I synch my own copy). In any case I think that changing the line termination char should be done as well, since that's what gs does, e.g. only LF there (ref. PDF spec v1.5 pag 67) Other points of interest could be using the line lenght up to (or nearer to) the maximum length of 255 (PDF ref pag 67 as well). Let me know what you think.
I'm ok with that; please proceed with both. However regarding future PDF/A support there needs to be an EOL marker before and after each indirect object begin ( that is "\nx y obj\n") and endobj ("\nendobj\n").
->pl. It appears it's more difficult that I anticipated, mainly because it's difficult to build test-cases to check for regressions. For the test I use a hardware manuale I wrote with OOo, with a couple of images, logos, tables (around 130 pgs); an Impress presentation shown at OOo 2005 conf, another smal document. Test the resulting PDF document with Acrobat and print it full, so it is scanned (so I suppose...), then doing a check with gs to convert it to ps than back to pdf (BTW, during this conversion in gs the quality of the final PDF from gs is worst...). So far this catched the regression introduced in the process so far. I'll introduce another doc I wrote with OOo (this one http://www.tecsa-srl.it/en/pdf/TechDesc_19990523.0938V20_en.pdf, if you are wondering about the way it looks) plus other manuals with diagrams, embedded drawing (wmf metafile). The figures I got so far ranges from 1,5% better for the manual without tags, to around 8% better for the taggged version, but there's still work to do. The problem is that there is a strong chance to introduce regressions. I optimise only the PDF code I can check emitted in the resulting PDF file (with emacs), this _should_ limit regressions, but you never know. Now question time... 1. What do you think on the matter ? 2. Is there a way to check for the the tags ? 3. Is it available somewhere on OOo site (or elsewhere) a document containing most of the elements needed to have a test of all the execution paths in PDFWriterImpl ? That will be be most helpful. 4. You mentioned the code "\nendobj\n" so far I've seen "\nendobj\n\n" only, is the added eol char really needed for future PDF/A ? 5. It should be useful to add comment to the PDF file for debugging purposes, of course if and only if the PDFWriterImpl is compiled in debug version (the one with debug=true) can I do this when I find useful for further study ? 6. Last... my JCA still out on the wild ;) ? I can send another e-mail, but with red tape you never know...
If we can save 8% in the tagged case, then by all means let's do that. The lineend conversion to a single \n and the removal of pure "pretty print" whitespace in the dictionaries can be done rather safely i think and will bring the most gain. Checking the tagged PDF for regression will probably not be possible with acrobat reader alone; however there is acrobat reader for PalmOS available without cost. That one is available for Windows and possibly mac; it reads tagged PDF and constructs a new PDF suitable for a PalmOS device. This is how we tested the tagged PDF originally, so if you have access to a Windows box, then that might be a feasible way. On the other hand if you load a file to the full version of Acrobat you should get any error message, but you'd need access to an Acrobat license for that. If you have neither a Windows box nor Acrobat, then you could send me test files and i can try them for you. For sample documents i'll set hi on the CC list; he's QA's resident expert for PDF testing and probably has a lot of sample files available. The \nendobj\n is a requirement of PDF/A, yes. Currently we overachieve in that area for readability reasons (helps debugging :-)). I just mentioned this because one could save space with that, too, but since we want to support PDF/A at some point, we should not break what is already done. There already is a MARK function that does something similar; i suggest implementing a similar function (DEBUGCOMMENT or so) that will not begin a marked content sequence, but simply call emitComment (see the MARK function) if OSL_DEBUG_LEVEL > 1and does nothing else. Regarding your JCA: i asked mh and st and they told me they would handle that. I think it would be best if you contacted mh directly about your JCA; if they have not found it by now, i think it will need to be sent again, but please ask mh about the details for i don't know much about that.
->pl. Hunted for Acro PalmOS, useful if I had a palmtop, but I haven't... Tried to install on Win box (the laptop I use is dual boot Linux and Win) complains about Palm Desktop missing, I'm no Palm OS expert, so I gave up. I'm going to send you directly the PDF for testing. Dropped a few lines to mh for the JCA issue.
->pl. Attached is the patch to clean up a little the PDF file. The PDF 1.4 spec ( ch 3.1.1 ) says that the use of white spaces for indentation in the examples is not recommended, the eol used there is a recommended convention (not mandatory !), elsewhere the character '/' is explained to be a token separator. The only place where a eol is mandatory, is the xref table. From al this I made an aggressive cleaning, joining togheter lines, using '/' instead of spaces to separate tokes except where the separation was meant to be a 'whitespace character' as they call it (e.g. between string and numbers), I left eol in place where I wasn't sure of the changing. Of course I didn't change 'obj', 'stream', 'endstream' 'endobj' tokens, in this case I only changed the eol to a single LF char. Let me know is you thing this a too aggressive approach and I will revert back to a less aggressive one, e.g. leaving the recommended convention for arranging token into lines (not used by gs though). The gain figures depend from the file, they range from 1% to 3% for the untagged version, and from 5% to 12% for the tagged ones. The '<<' and '>>' tokens can still be optimized, from the specs it's unclear, but gs optimizes '<<' (e.g. emits: /Resources<</ProcSet[/PDF /ImageC /ImageI /Text] ), but not '>>'; let me know what you think. I sent to your e-mail a tagged version to test.
Created attachment 35032 [details] patch to clean spaces, changing eol
committed the patch to CWS pdfexportimprove as to aggressiveness of the patch: if we're going to remove the whitespace for pretty printing at all then in my opinion we can go the whole way. So i fully agree with your approach. Regarding the "<<" and ">>": in chapter 3.1.1 of the PDF reference it says '<' and '>' are delimiter characters, so there does not seem to be a reason not to use them as such. The only limitation we should see is the 255 characters per line limit mentioned in chapter 3.4, but i did not see a place in your patch where that was affected.
->pl. I'm going to attach the latest patch. I noticed that was added an eol before the endofstream token (around line 760) it wasn't there neither in my patch nor in the original source file. I added it while resynch the CWS locally, here at my end, and this patch doesn't change it back. I checked the PDF emitted while compiling in debug mode and I discovered that there are already a lot of comments in the PDF file, so no need to add the function. I think this patch should be the last one to be done with this issue.
Created attachment 35084 [details] latest patch, see above
commited the patch sorry, i forgot to mention that i added that newline before the endstream; that is another requirement for PDF/A (like obj/endobj) and i thought i'd add this minor change while you're at the whitespace anyway.
->pl. I'm going to attach the (latest ?) patch, this time I stripped clean '<<' and '>>' tokens, along with joining together some lines of emitted PDF. I found another endstream that doesn't have a eol in front, didn't add eol because I'm not shure about PDF/A: it's in line 6946 after the following patch applied.
Created attachment 35110 [details] patch, see text and above
good catch, i added that newline after commit the patch.
reassign for verification
pl->hi: please verify in CWS pdfexportimprove
Verified with cws pdfexportimprove = ok; The exported file is smaller, blanks are less in editor.
Still ok in 680m169