Apache OpenOffice (AOO) Bugzilla – Issue 69635
Writer cannot open .xhtml files, nor can it parse XHTML properly
Last modified: 2013-02-07 22:35:24 UTC
How do you do? I have searched the issue data base, and I found some issues with XHTML, but didn't go into much detail to what I'd like to point out, so I'm submiting this. I haven't tested it with other applications from the OOo suite besides Writer, but I could check if need be. So, what I've seen under my Kubuntu installation is that Writer is unable to import .xhtml files when available in that extension. If I change the extension to .html, Writer will load it up as if it was HTML tag soup monstrousity, complete with unparsed <?xml ?> tag at the top. This is not much noticable when dealing with text-only, but I'm sure it will conflict if one adds anything that actually requires the use of XHTML (MathML, SVG and what not). Now, I'm sure it can't be that hard to support an open specification that's been around for years now, but if there's no resources to work on this issue, I recommend (if OOo license allows) the use of either the Gecko engine, or KHTML. I can also attach some XHTML test cases here if it helps.
Reassigned to JSI.
AFAIK there is no import of XHTML but there's an export option. Read this if you're interested: http://xml.openoffice.org/sx2ml/ Changing issue type to ENHANCEMENT
It's a FEATURE, no enhancement because there exists no XHTML filter and it isn't defined how it should work (also with the same technology like the export with the same limitations?) Has to decided by requirements team.
@ja Thank you. That was interesting, but yes, I believe OOo needs import support for XHTML, as well as proper rendering. Let's hope then that the requirements team agrees with me.
Created attachment 44170 [details] a simple example -- shows up in 2.2 just as the source code
XHTML docs still show up in Writer as just the source in 2.2.1.
*** Issue 69635 has been confirmed by votes. ***
Import (x)html worked fine in Oo 2.0.2 as part of Ubuntu 6.06 LTS. So this should be "just" a bug in the current Oo 2.3. I believe this to be an inportant feature as it allows one to programmatically produce documents by data processing apps, and convert it to nice docs with Open Office styles. This is also a (working) feature in MS-Word.
I forgot to mention that I stripped headers from the (x)html file. Only the part BETWEEN body and /body tags was given to, imported and processed by Oo 2.0.2, but the h and p tags where nicely converted to headers and plain text. Still very useful.
Given that OOo seems to handle the *contents* of XHTML files correctly (i.e. everything but the XHTML header, as noted above by batavus), it seems that OOo only needs to change how it handles this header, and then add the .xhtml extension to the list of openable filetypes, and therefore this bugfix should theoretically be exceedingly simple. Are there any devs following this issue that could give us a rough target milestone forecast? Would 3.2 be reasonable to hope for?
If it's really only a bug in parsing the header, if we accept the limitation of the current HTML import filter and just add the requirement that there shouldn't be some unparsed XML content in the imported document: yes, that could be a low hanging fruit.