Apache OpenOffice (AOO) Bugzilla – Issue 22253
FreeBSD startup problem: GetStorage and non-existent soffice.cfg
Last modified: 2009-06-16 13:14:22 UTC
When OOo fix1 starts on FreeBSD, an error message is shown and OOo exits. The error message is at http://tmp.janik.cz/freebsd-getstorage-error.png
I will work on it.
Attched patch is a simple workaround (warning: this is not meant as a final solution!) for this problem.
Created attachment 11054 [details] Workaround, not for integration!
Daniel, do you have an idea what should I try next? I added you to CC:.
We have found out that the exception itself is thrown perfectly, but there seems to be an issue with linking accross module boundaries :-( This is something I can not solve myself :-(
I'm going to test the whole thing on older FreeBSD system to test if the problem is in FreeBSD 4.9 too (I use 5.1 right now).
@Pavel: I assume this being a linking problem the way that two different addresses are used for the same symbol at runtime. The one that the catch handler tests against and the one the libgcc3_uno.so bridge uses calling dlsym( app_handle, "<RTTI-name>". I assume those two being different. So it is upt to you to test whether there are different addresses resolved at runtime. Dump out the dlsym'ed one as well as the one that is resolved by the ld.so using LD_DEBUG environment variable. "setenv LD_DEBUG help" to get all options.
ld.so on FreeBSD only supports LD_LIBRARY_PATH, LD_PRELOAD and LD_BIND_NOW and severeal others. LD_DEBUG is not supported. Thus I added some debugging manually to bridges/source/cpp_uno/gcc3_freebsd_intel/except.cxx near dlsym: OString symName( buf.makeStringAndClear() ); rtti = (type_info *)dlsym( m_hApp, symName.getStr() ); fprintf(stderr, "PJ: %s %p (%s)\n", symName.getStr(), rtti, dlerror()); When I run unchanged OOo from fix1, I got: PJ: _ZTIN3com3sun4star3ucb31InteractiveAugmentedIOExceptionE 0x0 (Undefined symbol "_ZTIN3com3sun4star3ucb31InteractiveAugmentedIOExceptionE") PJ: _ZTIN3com3sun4star3ucb22InteractiveIOExceptionE 0x28998334 ((null)) pavel@irtos:~/OpenOffice.org1.1.0> Ie. dlsym can not find the symbol _ZTIN3com3sun4star3ucb31InteractiveAugmentedIOExceptionE But: libsfx645fi.so: 003860d8 V _ZTIN3com3sun4star3ucb31InteractiveAugmentedIOExceptionE -- libucpfile1.so: 00050f2c V _ZTIN3com3sun4star3ucb31InteractiveAugmentedIOExceptionE On GNU/Linux, the same debug fprintf prints: PJ: _ZTIN3com3sun4star3ucb31InteractiveAugmentedIOExceptionE 0x42c1a314 ((null)) ie it found the symbol.
@Pavel: Good work! this makes sense. Nevertheless it is somehow strange, that some symbols are found and others not. Pavel, please try to get the application handle another way, : m_hApp( dlopen( 0, RTLD_LAZY ) ) change to : m_hApp( dlopen( 0, RTLD_NOW | RTLD_GLOBAL ) ) A second try may be to find out differences between the InteractiveIOException (which works) and the InteractiveAugmentedIOException (which does not work) symbols. Maybe a nm -D *.so shows differences. You can also use the gnutools objdump -T *.so which includes versioning info.
I did the first change: PJ: _ZTIN3com3sun4star3ucb31InteractiveAugmentedIOExceptionE 0x0 (Invalid shared object handle 0x660061) PJ: _ZTIN3com3sun4star3ucb22InteractiveIOExceptionE 0x0 (Invalid shared object handle 0x660061) PJ: _ZTIN3com3sun4star4task28ClassifiedInteractionRequestE 0x0 (Invalid shared object handle 0x660061) PJ: _ZTIN3com3sun4star3uno9ExceptionE 0x0 (Invalid shared object handle 0x660061) crash_report: not found Fatal exception: Signal 6 Stack: Abort trap (core dumped) nm output for both symbols: pavel@irtos:~/OpenOffice.org1.1.0> nm -D program/* 2>/dev/null|egrep "_ZTIN3com3sun4star3ucb31Interactiv eAugmentedIOExceptionE|:"|grep -B1 _ZTIN3com3sun4star3ucb31InteractiveAugmentedIOExceptionE program/libsfx645fi.so: 003860d8 V _ZTIN3com3sun4star3ucb31InteractiveAugmentedIOExceptionE -- program/libucpfile1.so: 00050f2c V _ZTIN3com3sun4star3ucb31InteractiveAugmentedIOExceptionE pavel@irtos:~/OpenOffice.org1.1.0> nm -D program/* 2>/dev/null|egrep "_ZTIN3com3sun4star3ucb22InteractiveIOExceptionE|:"|grep -B1 _ZTIN3com3sun4star3ucb22InteractiveIOExceptionE program/libfileacc.so: 0000ba28 V _ZTIN3com3sun4star3ucb22InteractiveIOExceptionE -- program/libsfx645fi.so: 00376854 V _ZTIN3com3sun4star3ucb22InteractiveIOExceptionE -- program/libsot645fi.so: 00044240 V _ZTIN3com3sun4star3ucb22InteractiveIOExceptionE -- program/libucpfile1.so: 00050f50 V _ZTIN3com3sun4star3ucb22InteractiveIOExceptionE -- program/libutl645fi.so: 0007b334 V _ZTIN3com3sun4star3ucb22InteractiveIOExceptionE pavel@irtos:~/OpenOffice.org1.1.0> objdump output: pavel@irtos:~/OpenOffice.org1.1.0> objdump -T program/* 2>/dev/null|grep _ZTIN3com3sun4star3ucb22Interac tiveIOExceptionE 0000ba28 w DO .data 0000000c UDK_3_0_0 _ZTIN3com3sun4star3ucb22InteractiveIOExceptionE 00376854 w DO .data 0000000c Base _ZTIN3com3sun4star3ucb22InteractiveIOExceptionE 00044240 w DO .data 0000000c Base _ZTIN3com3sun4star3ucb22InteractiveIOExceptionE 00050f50 w DO .data 0000000c UDK_3_0_0 _ZTIN3com3sun4star3ucb22InteractiveIOExceptionE 0007b334 w DO .data 0000000c Base _ZTIN3com3sun4star3ucb22InteractiveIOExceptionE pavel@irtos:~/OpenOffice.org1.1.0> objdump -T program/* 2>/dev/null|grep _ZTIN3com3sun4star3ucb31InteractiveAugmentedIOExceptionE 003860d8 w DO .data 0000000c Base _ZTIN3com3sun4star3ucb31InteractiveAugmentedIOExceptionE 00050f2c w DO .data 0000000c UDK_3_0_0 _ZTIN3com3sun4star3ucb31InteractiveAugmentedIOExceptionE pavel@irtos:~/OpenOffice.org1.1.0> When I did much more deep inspection, I found that there is a small difference in _end symbols and this ring the bell for me. We already met with (probably) similar issue - missing _end in the map file. Those files do not contain _end on GNU/Linux, but contain it on FreeBSD: libcppuhelper3gcc3.so libcppuhelpergcc3.so libcppuhelpergcc3.so.3 libcppuhelpergcc3.so.3.1.0 This is because of our patch ftp://ftp.linux.cz/pub/localization/OpenOffice.org/devel/build/Patches/OOo_1.1.0_source-FreeBSD-temp-add_end.diff This file does not contain _end symbol on GNU/Linux but contains it on FreeBSD: libucpfile1.so Maybe the last file? I put all both dumps of objdump on GNU/Linux and also on Solaris to http://tmp.janik.cz/objdump/objdump-FreeBSD.log.gz http://tmp.janik.cz/objdump/objdump-GNU_Linux.log.gz (quite long files) Grep for those exceptions is in http://tmp.janik.cz/objdump/objdump.exceptions Looks similar ;-)
Created attachment 11240 [details] better addsym.awk for FreeBSD
@Pavel: I added a better addsym.awk version as attachement. Please relink your libs with this one. Just a try... Seems to be a strange problem, but maybe it's the _end problem.
I did a full clean rebuild of the tree with the new script and: pavel@irtos:~/OpenOffice.org1.1.0> ./soffice PJ: _ZTIN3com3sun4star3ucb31InteractiveAugmentedIOExceptionE 0x0 (Undefined symbol "_ZTIN3com3sun4star3ucb31InteractiveAugmentedIOExceptionE") PJ: _ZTIN3com3sun4star3ucb22InteractiveIOExceptionE 0x28998334 ((null)) pavel@irtos:~/OpenOffice.org1.1.0> Ie. the same :-) I have started the same build on older FreeBSD system (4.9, I use 5.1 now) so we can compare those two versions of FreeBSD. Maybe this brings some new info.
I think I've found the problem: It seems to be a bug/feature in the *BSDs dynamic linker ld.elf_so: (The code isn't exactly the same on FreeBSD and NetBSD, but both are derived from the same original implementation, and most improvements in one project have also been added in the other one.) When a dlsym() search on the main program is performed (as the bridge does), only the shared libraries loaded at start time are searched, but not the ones opened via dlopen(). The symbol ...InteractiveAugmentedIOException is defined only in libsfx645bi.so and libucpfile1.so, which are both loaded via dlopen() calls, so the symbol won't be found. The symbol ...InteractiveIOException however is also defined in libutl645bi.so, which soffice.bin is linked against, so this one will be found.
Great investigation! What we will do with this? Should I ask FreeBSD specialists? BTW - the same error is present on FreeBSD 4.9.
BTW - I just tried to check your wording and tested to call dlsym on dlopened library as from GNU/Linux's dlopen manual page: #include <stdio.h> #include <stdlib.h> #include <dlfcn.h> int main(int argc, char **argv) { void *handle; double (*cosine)(double); char *error; handle = dlopen ("libm.so", RTLD_LAZY); if (!handle) { fprintf (stderr, "%s\n", dlerror()); exit(1); } cosine = dlsym(handle, "cos"); if ((error = dlerror()) != NULL) { fprintf (stderr, "%s\n", error); exit(1); } printf ("%f\n", (*cosine)(2.0)); dlclose(handle); return 0; } And it works (FreeBSD 4.9): pavel@leda:~> ./a.out -0.416147 So?
You missed a small, but very important detail of my statement: "dlsym() search _on_the_main_program_" i.e. the handle is not the dlopen()ed shared library itself, but the main program (which you get back with dlopen(NULL, RTLD_LAZY) ) I've modified your test program a bit: #include <stdio.h> #include <stdlib.h> #include <dlfcn.h> int main(int argc, char **argv) { void *handle, *handlemain; double (*cosine)(double); char *error; handle = dlopen ("libm.so", RTLD_LAZY|RTLD_GLOBAL); if (!handle) { fprintf (stderr, "%s\n", dlerror()); exit(1); } handlemain = dlopen(NULL, RTLD_LAZY); cosine = dlsym(handlemain, "cos"); if ((error = dlerror()) != NULL) { fprintf (stderr, "%s\n", error); exit(1); } printf ("%f\n", (*cosine)(2.0)); dlclose(handle); return 0; } On Linux: bash-2.05b$ ./a.out -0.416147 while on NetBSD: -bash-2.05b$ ./a.out Undefined symbol "cos" I've just carefully reread the NetBSD dlsym manpage and it states that this kind of usage is currently not supported. Ultimately that feature should be added to ld.so_elf, but as it currently isn't I guess we have to find a workaround. Maybe simulate the feature by hand by remembering all dlopen()ed librarys and searching them as well if the symbol can't be found. If we are lucky, there is only one dlopen() in the whole OOo, namely the one in sal/osl/unx/module.c .
@mrauch: well done! We can workround this, because the UNO shared lib component loader loads libraries using osl_loadModule(), so this ought to workarounded as you suggested: - using ols_loadModule(), osl_getSymbol() within bridge code - in sal osl_loadModule(): (#if defined BSD) tag the application handle (oslModule) and add all opened handles to static list - in sal osl_getSymbol(): (#if defined BSD) extend search on application handle to all dlopen()ed modules One concern I still have: componentA.uno.so: V _ZTINMyException componentB.uno.so: V _ZTINMyException Both are loaded, first A then B: When componentB.uno.so has a catch handler (e.g. catch (MyException &)), which symbol does it resolve? From A or hopefully from B? The problem then occurs, when the bridge throws an exception using the symbol from componentA.uno.so, because it is loaded first, but componentB.uno.so' catch handler expects its own.
A friend of mine, Rudolf Cejka, gave me some hint to test: FreeBSD: pavel@leda:~> gcc -o dltest dltest.c pavel@leda:~> gcc -o dltestnew dltestnew.c pavel@leda:~> ./dltest Undefined symbol "cos" pavel@leda:~> ./dltestnew -0.416147 pavel@leda:~> diff -u dltest.c dltestnew.c --- dltest.c Wed Nov 19 18:57:08 2003 +++ dltestnew.c Wed Nov 19 18:57:37 2003 @@ -14,7 +14,7 @@ } handlemain = dlopen(NULL, RTLD_LAZY); - cosine = dlsym(handlemain, "cos"); + cosine = dlsym(RTLD_DEFAULT, "cos"); if ((error = dlerror()) != NULL) { fprintf (stderr, "%s\n", error); exit(1); pavel@leda:~> RTLD_DEFAULT is described as: If dlsym() is called with the special handle RTLD_DEFAULT, the search for the symbol follows the algorithm used for resolving undefined symbols when objects are loaded. The objects searched are as follows, in the given order: 1. The referencing object itself (or the object from which the call to dlsym() is made), if that object was linked using the -Wsymbolic option to ld(1). 2. All objects loaded at program start-up. 3. All objects loaded via dlopen() which are in needed-object DAGs that also contain the referencing object. 4. All objects loaded via dlopen() with the RTLD_GLOBAL flag set in the mode argument. Can we use it on *BSD only? @mrauch: could you please test it, I just managed to remove the whole build tree and started from scratch to test something else. RTLD_DEFAULT is defined on Linux only with __USE_GNU. On Linux: pavel@pavel:/tmp> gcc -o dltest dltest.c -ldl pavel@pavel:/tmp> gcc -o dltestnew dltestnew.c -ldl pavel@pavel:/tmp> ./dltest -0.416147 pavel@pavel:/tmp> ./dltestnew -0.416147
RTLD_DEFAULT unfortunately solves the problem only partly: The bridge is then able to find the symbol, but OOo nevertheless crashes with the same error box. This seems to be because of the fear Daniel already mentioned: The dlsym() happens to find the symbol in libA.so first, but the catch handler is in libB.so, so the exception still propagates further up the stack. I've then tried to only search the relevant library (i.e. put "libsfx645bi.so" instead of the 0 in dlopen in except.cxx, this works because this is the only place where the bridge is needed during startup), and then OOo starts without further problems. Does anyone know how the linker manages to find the right symbol in Linux? By merging the different vtables into one? I'll have a closer look at ld.elf_so, if I can convince it to treat dlopen()ed libraries more similar to ones loaded as dependency.
After more debugging: On FreeBSD just using RTLD_DEFAULT should suffice. (@pjanik: Are you able to test this?) On NetBSD there is some slightly different behaviour in the dynamic runtime linker ld.elf_so. As this also makes exception handling in regcomp fail in some cases where the bridge isn't involved (it's pure C++ code), I suspect it's a bug, but I have to see what the NetBSD toolchain experts say. Anyway, changing the behaviour to the FreeBSD one makes the problems disappear for me. I'll now run a full build from scratch over night to check.
Created attachment 11641 [details] use RTLD_DEFAULT
A full build confirmed that it works for NetBSD. I've attached the appropriate patch for FreeBSD to this issue.
This is freebsd / intel specific. Will not afefct other platforms. From that point of view approved.
I'm going to compile with this patch and will report results.
After applying the attached patch (freebsd_bridges.patch): pavel@leda:~/OpenOffice.org1.1.1> ./soffice crash_report: not found Fatal exception: Signal 6 Stack: Abort trap (core dumped) This is the first start of OOo after ./install --single Thus this patch can not be used :-( BTW really interesting: pavel@leda:~/OpenOffice.org1.1.1> gdb program/soffice.bin soffice.bin.core ... Segmentation fault (core dumped) Everything on FreeBSD 4.9-RELEASE
The bug is now fixed in NetBSD's dynamic linker (actually already since Dec 7). For FreeBSD I currently have no clue what else could be going wrong, sorry.
Hm, your commit was: http://cvsweb.netbsd.org/bsdweb.cgi/src/libexec/ld.elf_so/symbol.c.diff?r1=1.34&r2=1.35&f=h
Created attachment 12404 [details] same patch for FreeBSD
sorry, ignore my patch.
How to activate LD_DEBUG add following line in /etc/make.conf CFLAGS+= -DDEBUG and recompile rtld. # cd /usr/src/libexec/rtld-elf/ ; make clean ; make depend ; make ; make install (martin told me)
Hi, I asked this issue from google and find a simular one: http://www.netbsd.org/cgi-bin/query-pr-single.pl?number=5890 both NetBSD and FreeBSD should have this issue, since: /usr/src/libexec/rtld-elf/rtld.c in FreeBSD (http://www.freebsd.org/cgi/cvsweb.cgi/src/libexec/rtld-elf/rtld.c) /* * XXX - This isn't correct. The search should include the whole * DAG rooted at the given object. */ def = symlook_obj(name, hash, obj, false); defobj = obj; } } And NetBSD have also such kind of implimentation http://cvsweb.netbsd.org/bsdweb.cgi/src/libexec/ld.elf_so/rtld.c . In PR #5890 it seems to be difficult impliment correctly... -- /* * FIXME - This isn't correct. The search should include the whole * DAG rooted at the given object. */ which indicates that the correct implemtenation for ELF is to search both the shared-object specified by the handle passed to dlsym() and any shared objects loaded as a result of loading the object specified by the handle. (the latter is what the Solaris manpage asys and what the Solaris implementation does.) Applying the patch in 5890 could cause programs that rely on the Solaris-style behaviour -- which currently work, due to the too-liberal search -- to fail. Michael Hitch notes that "The current way ld.elf_so tracks shared objects would make this type of search somewhat difficult". --
marking as solved, proper fix is unknown now.
*** Issue 98781 has been marked as a duplicate of this issue. ***
*** Issue 82690 has been marked as a duplicate of this issue. ***
reassign to maho as this is a FreeBSD issue.
.
see discussion http://docs.FreeBSD.org/cgi/mid.cgi?20090614.081457.193757375.chat95 Konstantin replies: http://docs.FreeBSD.org/cgi/mid.cgi?20090614094141.GF23592 maho checked his patch and it works: http://docs.FreeBSD.org/cgi/mid.cgi? 20090615.054654.71139727.chat95 . Konstantin and Alexander again posted a patch to rtld-elf. maho checked their patch and verified that it works. so this is a FreeBSD userland issue.