Apache OpenOffice (AOO) Bugzilla – Issue 59576
Can't copy text from PDF exported from OOo
Last modified: 2018-01-18 05:47:21 UTC
1.Create a document in Writer.Type "à¸à¸²à¸£à¸•à¸´à¸”ตั้ง Openoffice.org 2.0". 2.Click export directly as PDF button and save the PDF file. 3.Open the PDF file in Adobe Reader 7.0.5. 4.Select all text and copy. 5.Paste to notepad2.The result will not be readable. 6.In Adobe reader save the file as text. 7.Open the txt file in notepad2. 8.The result will not be readable too.
Created attachment 32591 [details] The Thai Writer document.
Created attachment 32592 [details] The Thai PDF document.
Created attachment 32593 [details] Cut Paste form PDF document.
Created attachment 32594 [details] Save as from PDF document.
Is this an Adobe Reader problem or a OOo problem? To demonstrate that this is an OOo problem, I think we need an example of a PDF document with Thai text that can be successfully copied from using Adobe Reader. What happens if you use Acrobat to create the PDF file?
The text pastes correctly into gedit ("GNOME's notepad"). Also, the text pasted into OpenOffice.org in visible at first but appears after using Format->Default. System: OpenOffice.org 2.0.1rc2, Acrobat Reader 7.0.0, Linux 2.6, x86-32
I downloaded other PDF file to test. When I copy text and paste to Notepad2 and there is no problem. Please test the PDF file.
Created attachment 32618 [details] Document create from Distriller. There is no problem.
aziem: Which attach file did you test? #32592? I can confirm this problem with both Evince and Adobe Reader 7.0 on Ubuntu 5.10.
Reassigned to hi.
Confirmed.
Sorry but is anybody able to save a *.txt file with thai font? Btw. I was not able to paste a text from clipboard into notepad2 which I've copied from an webpage like http://www.the-thainews.com Paste into OOo was ok In summary I don't think that we have an pdf issue here.
The result looks like an ordinary encoding problem to me. (Some Thai-encoding instead of UTF-8) Cannot reproduce on linux when pasting to gedit (from evince). All works fine with both attached PDFs.
> Sorry but is anybody able to save a *.txt file with thai font? Yes, most can. You only have to look at it with the right font/encoding (TIS-620 or UTF-8). > Btw. I was not able to paste a text from clipboard into notepad2 which I've > copied from an webpage like http://www.the-thainews.com I do exactly the same with my notepad2 with the right font configured for Thai. It works OK. Try Notepad, instead. > Paste into OOo was ok It should be. > In summary I don't think that we have an pdf issue here. It's a OOo generated PDF issue. Thai PDF from Acrobat PDFMaker 6 (Distiller 6) in attachment id 32618 can be selected and pasted to any program fine. Select and pasted from OOo generated Thai PDF will have missing/converted characters. I guess this is something to do with encoding Thai text in PDF from OOo. Please investigate it further. This lower the quality of PDF generated from OOo, for, I guess, most CTL.
Reopened for further investigation.
HI->HDU: I'm not able to reproduce the problem with my windows system. Maybe you can.
HDU->PL: I think the problem these PDF-viewers are having that for PDF export we currently just keep unicodes<=0xFF in their place... Is it possible to use non-unicode encodings for other text and their corresponding subsets?
forgot to reassign
The problem is not with most characters, but only with composed characters (those which do not have a bijective unicode <-> glyph mapping ). With the provided PDF i can copy the thai text easily apart from the composed glyphs which would need to result in a Unicode sequence instead of one code. I don't see how we can do that. We'd need to output font subsets with different encodings then, but how would that be possible given that the only thing we have is a SalLayout which knows only about glyph ids ?
The problems are different according to whether you create the PDF with OpenOffice on Windows or on Linux: - on Linux, the only problem is with SARA AM OE33 - on Windows, any character that is not the first in its cluster is lost
Created attachment 33426 [details] OpenDocument test file
Created attachment 33427 [details] PDF generated on Linux
Created attachment 33428 [details] PDF generated on Windows
On Linux, you can see the problems in the first three lines of my test case. (It's better to use Acrobat Reader to test, rather than evince, since evince has some bugs.) There are actually three problems: a) The first problem you can see in line 2. In the cut-and-pasted text, the single SARA AM (OE33) has turned into two SARA AMs. What happens is that the ICU layout engine decomposes SARA AM into NIKHAHIT (OE4D) and SARA AA (OE32). The glyph to character mapping returned by ICU associates both the NIKHAHIT and the SARA AA glyphs with SARA AM character. b) The second problem you can see in line 1. The last character on the line, which is SARA A in the PDF has been turned into a SARA AM in the cut-and-pasted text. This happens because when the PDF writer implementation sees the SARA AM character it creates an entry in the font with a glyph SARA AA associated with Unicode character SARA AM; when it sees the SARA AA character, it reuses the font entry because it has the same SARA AA glyph, even though this SARA AA glyph is associated with a SARA AA character. c) The third problem you can see in line 3. The MAI THO (OE49) in the PDF has turned into another SARA AM. In this case the ICU layout engine decomposes the SARA AM as before, then it swaps the MAI THO and NIKHAHIT glyphs: the three characters NO NEN, MAI THO, SARA AM are mapped into four glyphs, NO NEN, NIKHAHIT, MAI THO, SARA A. The character to glyph mapping generated by ICU is [0 2 1 2], in other words it correctly and unambiguously associates the MAI THO glyph with the MAI THO character. However, IcuLayoutEngine::operator() "smooths" this out to [0 2 2 2] as part of its cluster detection heuristics, so you end up with three SARA AMs. Note that to make the example in line 3 work properly, when SARA AM is decomposed, in the PDF the NIKHAHIT glyph should not be associated with anything, and the SARA AA glyph should be associated with the SARA AM character.
Created attachment 33429 [details] Result of copying from Linux generated PDF with Acrobat Reader and pasting into gedit
On Windows, the situation is worse. The problem is that for any particular cluster the Uniscribe ScriptShape function tells you which glyphs are part of the cluster and which characters are part of the character, but it doesn't tell you which glyph corresponds to which character. Accordingly, UniscribeLayout::GetNextGlyphs only generates a glyph to character mapping for the first glyph in the cluster (the others are mapped to -1). For very complex CTL scripts, the mapping between glyphs and characters in a cluster is not very well-defined, but for Thai it's easy (with the exception of SARA AM). I think the following algorithm should work for Thai: map the first glyph in the cluster to the first character in the cluster; then map the last glyph in the cluster to last character in the cluster, the last but one glyph to the last but one character and so on, stopping when you get to the first glyph or first character in the cluster.
The general strategy that the PDF writer implementation uses for supporting recovery of the underlying Unicode text is adding a ToUnicode mapping to the font. Although, I think this can be made to work (with a bit of hackery) for Thai, I don't think it will work in the general case for CTL. PDF 1.5 introduces a feature designed to handle this, which allows you to explicitly associate a Unicode string with a particular region of the PDF file; see the ActualText property described in section 10.8.3 of the PDF 1.6 specification.
You're right that should work. However one would have find an algorithm when to start such a span and when to end it. Also this could interfere with the overall document structure (aka tagged PDF). What would you suggest should start such an ActualText span and what should end it ?
reassigning to the owner of OOo's PDF export magic
target
FYI: As i learned with issue 69645 acrobat reader will not let you even select text which is in an ActualText region (probably because there is no way to map parts of the selected actual display to the equivalent parts of the ActualText). So this would not solve the Copy/Paste problem. Any more thoughts on this ?
Reset assignee on issues not touched by assignee in more than 2000 days.
See related issue https://bz.apache.org/ooo/show_bug.cgi?id=58341