63199 – Optimize PDF output concerning whitespace

Issue 63199 - Optimize PDF output concerning whitespace

Summary: Optimize PDF output concerning whitespace

Status:	CLOSED FIXED

Alias:	None

Product:	gsl
Classification:	Code
Component:	code (show other issues)
Version:	680m150
Hardware:	All All

Importance:	P3 Trivial (vote)
Target Milestone:	OOo 2.0.3
Assignee:	h.ilter
QA Contact:	issues@gsl

URL:
Keywords:

Depends on:
Blocks:

Reported:	2006-03-15 10:04 UTC by philipp.lohmann
Modified:	2006-05-17 13:18 UTC (History)
CC List:	2 users (show)

See Also:
Issue Type:	PATCH
Latest Confirmation in:	---
Developer Difficulty:	---

Attachments
patch to clean spaces, changing eol (124.37 KB, patch) 2006-03-19 19:44 UTC, Giuseppe Castagno (aka beppec56)	no flags	Details \| Diff
latest patch, see above (9.29 KB, patch) 2006-03-21 09:13 UTC, Giuseppe Castagno (aka beppec56)	no flags	Details \| Diff
patch, see text and above (19.04 KB, patch) 2006-03-21 22:45 UTC, Giuseppe Castagno (aka beppec56)	no flags	Details \| Diff
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this issue.

Description philipp.lohmann 2006-03-15 10:04:50 UTC

To conserve space we could strip out a lot of spaces and turn lineends to a
single newline character.

Comment 1 philipp.lohmann 2006-03-15 10:08:38 UTC

If you want to we can handle this in the same CWS (pdfexportimprove) as issue 61139.

Comment 2 Giuseppe Castagno (aka beppec56) 2006-03-15 10:40:40 UTC

Ok, shouldn't take long.
->pl. I'll start on this after the CWS used for 61139 is synch with latest patch
(and I synch my own copy).
In any case I think that changing the line termination char should be done as
well, since that's what gs does, e.g. only LF there (ref. PDF spec v1.5 pag 67)
Other points of interest could be using the line lenght up to (or nearer to) the
maximum length of 255 (PDF ref pag 67 as well).
Let me know what you think.

Comment 3 philipp.lohmann 2006-03-15 10:58:40 UTC

I'm ok with that; please proceed with both. However regarding future PDF/A
support there needs to be an EOL marker before and after each indirect object
begin ( that is "\nx y obj\n") and endobj ("\nendobj\n").

Comment 4 Giuseppe Castagno (aka beppec56) 2006-03-17 09:47:04 UTC

->pl. It appears it's more difficult that I anticipated, mainly because it's
difficult to build test-cases to check for regressions.
For the test I use a hardware manuale I wrote with OOo, with a couple of images,
logos, tables (around 130 pgs); an Impress presentation shown at OOo 2005 conf,
another smal document. Test the resulting PDF document with Acrobat and print it
full, so it is scanned (so I suppose...), then doing a check with gs to convert
it to ps than back to pdf (BTW, during this conversion in gs the quality of the
final PDF from gs is worst...). So far this catched the regression introduced in
the process so far.
I'll introduce another doc I wrote with OOo (this one
http://www.tecsa-srl.it/en/pdf/TechDesc_19990523.0938V20_en.pdf, if you are
wondering about the way it looks) plus other manuals with diagrams, embedded
drawing (wmf metafile).

The figures I got so far ranges from 1,5% better for the manual without tags, to
around 8% better for the taggged version, but there's still work to do.

The problem is that there is a strong chance to introduce regressions.
I optimise only the PDF code I can check emitted in the resulting PDF file (with
emacs), this _should_ limit regressions, but you never know.

Now question time...

1. What do you think on the matter ?
2. Is there a way to check for the the tags ?
3. Is it available somewhere on OOo site (or elsewhere) a document containing
most of the elements needed to have a test of all the execution paths in
PDFWriterImpl ? That will be be most helpful.
4. You mentioned the code "\nendobj\n" so far I've seen "\nendobj\n\n" only, is
the added eol char really needed for future PDF/A ?

5. It should be useful to add comment to the PDF file for debugging purposes, of
course if and only if the PDFWriterImpl is compiled in debug version (the one
with debug=true) can I do this when I find useful for further study ?

6. Last... my JCA still out on the wild ;) ? I can send another e-mail, but with
red tape you never know...

Comment 5 philipp.lohmann 2006-03-17 10:11:10 UTC

If we can save 8% in the tagged case, then by all means let's do that. The
lineend conversion to a single \n and the removal of pure "pretty print"
whitespace in the dictionaries can be done rather safely i think and will bring
the most gain.

Checking the tagged PDF for regression will probably not be possible with
acrobat reader alone; however there is acrobat reader for PalmOS available
without cost. That one is available for Windows and possibly mac; it reads
tagged PDF and constructs a new PDF suitable for a PalmOS device. This is how we
tested the tagged PDF originally, so if you have access to a Windows box, then
that might be a feasible way. On the other hand if you load a file to the full
version of Acrobat you should get any error message, but you'd need access to an
Acrobat license for that.

If you have neither a Windows box nor Acrobat, then you could send me test files
and i can try them for you.

For sample documents i'll set hi on the CC list; he's QA's resident expert for
PDF testing and probably has a lot of sample files available.

The \nendobj\n is a requirement of PDF/A, yes. Currently we overachieve in that
area for readability reasons (helps debugging :-)). I just mentioned this
because one could save space with that, too, but since we want to support PDF/A
at some point, we should not break what is already done.

There already is a MARK function that does something similar; i suggest
implementing a similar function (DEBUGCOMMENT or so) that will not begin a
marked content sequence, but simply call emitComment (see the MARK function) if
OSL_DEBUG_LEVEL > 1and does nothing else.

Regarding your JCA: i asked mh and st and they told me they would handle that. I
think it would be best if you contacted mh directly about your JCA; if they have
not found it by now, i think it will need to be sent again, but please ask mh
about the details for i don't know much about that.

Comment 6 Giuseppe Castagno (aka beppec56) 2006-03-17 13:21:01 UTC

->pl. Hunted for Acro PalmOS, useful if I had a palmtop, but I haven't...
Tried to install on Win box (the laptop I use is dual boot Linux and Win)
complains about Palm Desktop missing, I'm no Palm OS expert, so I gave up.

I'm going to send you directly the PDF for testing.

Dropped a few lines to mh for the JCA issue.

Comment 7 Giuseppe Castagno (aka beppec56) 2006-03-19 19:43:18 UTC

->pl. Attached is the patch to clean up a little the PDF file.
The PDF 1.4 spec ( ch 3.1.1 ) says that the use of white spaces for indentation
in the examples is not recommended, the eol used there is a recommended
convention (not mandatory !), elsewhere the character '/' is explained to be a
token separator. The only place where a eol is mandatory, is the xref table.
From al this I made an aggressive cleaning, joining togheter lines, using '/'
instead of spaces to separate tokes except where the separation was meant to be
a 'whitespace character' as they call it (e.g. between string and numbers), I
left eol in place where I wasn't sure of the changing.

Of course I didn't change 'obj', 'stream', 'endstream' 'endobj' tokens, in this
case I only changed the eol to a single LF char.

Let me know is you thing this a too aggressive approach and I will revert back
to a less aggressive one, e.g. leaving the recommended convention for arranging
token into lines (not used by gs though).

The gain figures depend from the file, they range from 1% to 3% for the untagged
version, and from 5% to 12% for the tagged ones.

The '<<' and '>>' tokens can still be optimized, from the specs it's unclear,
but gs optimizes '<<' (e.g. emits: /Resources<</ProcSet[/PDF /ImageC /ImageI
/Text] ), but not '>>'; let me know what you think.

I sent to your e-mail a tagged version to test.

Comment 8 Giuseppe Castagno (aka beppec56) 2006-03-19 19:44:37 UTC

Created attachment 35032 [details]
patch to clean spaces, changing eol

Comment 9 philipp.lohmann 2006-03-20 11:38:37 UTC

committed the patch to CWS pdfexportimprove

as to aggressiveness of the patch: if we're going to remove the whitespace for
pretty printing at all then in my opinion we can go the whole way. So i fully
agree with your approach. Regarding the "<<" and ">>": in chapter 3.1.1 of the
PDF reference it says '<' and '>' are delimiter characters, so there does not
seem to be a reason not to use them as such. The only limitation we should see
is the 255 characters per line limit mentioned in chapter 3.4, but i did not see
a place in your patch where that was affected.

Comment 10 Giuseppe Castagno (aka beppec56) 2006-03-21 09:12:16 UTC

->pl. I'm going to attach the latest patch.
I noticed that was added an eol before the endofstream token (around line 760)
it wasn't there neither in my patch nor in the original source file.
I added it while resynch the CWS locally, here at my end, and this patch doesn't
change it back.
I checked the PDF emitted while compiling in debug mode and I discovered that
there are already a lot of comments in the PDF file, so no need to add the function.
I think this patch should be the last one to be done with this issue.

Comment 11 Giuseppe Castagno (aka beppec56) 2006-03-21 09:13:00 UTC

Created attachment 35084 [details]
latest patch, see above

Comment 12 philipp.lohmann 2006-03-21 09:54:35 UTC

commited the patch

sorry, i forgot to mention that i added that newline before the endstream; that
is another requirement for PDF/A (like obj/endobj) and i thought i'd add this
minor change while you're at the whitespace anyway.

Comment 13 Giuseppe Castagno (aka beppec56) 2006-03-21 22:44:53 UTC

->pl. I'm going to attach the (latest ?) patch, this time I stripped clean '<<'
and '>>' tokens, along with joining together some lines of emitted PDF.
I found another endstream that doesn't have a eol in front, didn't add eol
because I'm not shure about PDF/A: it's in line 6946 after the following patch
applied.

Comment 14 Giuseppe Castagno (aka beppec56) 2006-03-21 22:45:42 UTC

Created attachment 35110 [details]
patch, see text and above

Comment 15 philipp.lohmann 2006-03-22 10:54:31 UTC

good catch, i added that newline after commit the patch.

Comment 16 philipp.lohmann 2006-03-24 12:39:40 UTC

reassign for verification

Comment 17 philipp.lohmann 2006-03-24 12:41:38 UTC

pl->hi: please verify in CWS pdfexportimprove

Comment 18 h.ilter 2006-03-29 14:49:20 UTC

Verified with cws pdfexportimprove = ok; The exported file is smaller, blanks
are less in editor.

Comment 19 h.ilter 2006-05-17 13:18:50 UTC

Still ok in 680m169