Apache OpenOffice (AOO) Bugzilla – Issue 22811
OO.o Calc - function precision bugs
Last modified: 2013-08-07 15:15:02 UTC
This bug was initially started at: http://bugzilla.gnome.org/show_bug.cgi?id=127771 There is information of interest there. oocalc's numeric precision is pretty awful for a load of common functions. This should be substantially improved, in-line with gnumeric. Functions VAR, LINEST, SKEW, PRODUCT, GEOMEAN, POWER at a minumum should be improved. This bug is part of the Integrated Collaborative Desktop Bounty Hunt. For more information on prizes, contest rules, and other bounty tasks, visit: http://www.gnome.org/bounties/
For reference, I'm posting this info from the other bugzilla page: I am including a comment block from a program file I am about to attach: /* The following main() function has three different sections, where the N - 1 variance is calculated three times using different methods. First duplicates the OpenOffice.org default behaviour. The second duplicates the Gnumeric behavior. The third is a fixed OpenOffice.org behaviour. The OpenOffice.org loss of precision happens, probably intentionally due to the approxAdd(), approxEqual(), approxSub() and similar functions which drop precision. The fixed OpenOffice.org behaviour generates results which are slightly different from Gnumeric results. This is due to the formula used. There is loss of precision in both the Gnumeric way and the OpenOffice way, due to the inherent nature of floating point architecture. They both produce slightly different results. Gnumeric uses the following formula to find the N - 1 variance: s^2 = (1 / (N - 1)) * Sum1..N((Xi - M)^2) where N is the count of the set, Xi is an element of the set, M is the mean of the set. Due to the stack based parsing nature of the OpenOffice.org code, it uses a different formula derived from the above formula, as it does not know the mean M in advance: s^2 = (1 / (N - 1)) * [Sum1..N(Xi^2) - {(Sum1..N(Xi))^2 / N}] */ --------------------------------- The loss of precision is due to these approx*() functions throwing it away. Not knowing OpenOffice.org code so well, I don't know if it would be wise to edit these approx*() functions and "correct" them as they may cause side effects in other places. The approx*() functions were obviously written with a purpose. We could probably patch GetStVarParams() which would fix many statistics functions which depend on it such as VAR(), VARP(), etc. --------------------------------- It appears that for some of these functions, the bad results are due to loss of precision in the approx*() functions in sal/inc/rtl/math.hxx. The presence of these approx*() functions are intentional. But they probably should not be used in the statistics and math functions. Maybe someone on the OO team can shed some more light on this one.
Created attachment 11531 [details] Test case Excel spreadsheet for future regression testing (OpenOffice.org and Gnumeric compatible)
Created attachment 11532 [details] A one-file simple program which shows the different implementations of OpenOffice.org, Gnumeric, and fixed OpenOffice.org code
Created attachment 11533 [details] Patch for OpenOffice.org 1.1.0 attached
The above patch has effect on atleast the following functions: VAR() VARP() STDEV() STDEVP() DEVSQ() ------ Results of the program with 30000, 30000.001 (hardcoded.. bah I could have used arguments for this instead :-) [mukund@naina mukund]$ gcc -O2 -o foo foo.c [mukund@naina mukund]$ ./foo Default OpenOffice.org behaviour 0 Gnumeric behaviour 5.000000002e-07 Fixed OpenOffice.org behaviour 5.000038072e-07 [mukund@naina mukund]$ The results are a bit different due to the different (but equivalent) formulae being used (please see previous posts).
EEK! The patch doesn't work as intended, and it looks like there's something very wrong with GCC's RSE optimization. As an example, look at this file which I'm going to attach next: This produces different results with no optimization, -O and -O2.
Created attachment 11537 [details] GCC bug?
Anyways, I decided to follow what Michael Meeks said and follow the Gnumeric method and formula. The following patch is tested and works. It produces exactly the same output as which Gnumeric produces.
Created attachment 11538 [details] New OpenOffice.org 1.1.0 patch which produces same results as Gnumeric
Please note that the above patch only fixes the following functions: VAR() VARP() STDEV() STDEVP() DEVSQ()
Hi Niklas, one for you ? Frank
Hi Mukund, this is _exactly_ what we were hoping for :-) I'm assuming that any similar problems that exist with: LINEST, SKEW, PRODUCT, GEOMEAN, POWER can be clobbered in the same way, and - AFAICS you may be the 1st to collect a bounty - which would be cool :-)
The second patch looks good. ApproxSub instead of a simple difference was put there for issue 3552, where the problem was rounding errors leading to a small negative variance. As long as we ensure it doesn't get negative again (which, as far as I can see, can't happen with the patch), I think we can do away with the ApproxSub here and accept the non-zero result for the values from 3552. Alternatively, we could use ApproxSub in the final loop, but I don't really think that's necessary. Note: Before we can commit any patch, we need the JCA, see http://www.openoffice.org/contributing.html
The example in issue 3552 =STDEV(-25.916128;-25.916128;-25.916128) returns 0 with the patch applied which is the correct answer.
Created attachment 11554 [details] Patch for OpenOffice.org 1.1.0 for SKEW() and KURT()
Above is a patch for SKEW() and KURT(). Both functions were patched, as they had the same VAR()-like "incorrect" OpenOffice.org code. Both behave like Gnumeric now. Niklas: I quickly ran through that PDF.. I can fax it tomorrow. The definition of "Contribution" is vague. I want contribution to be only the code which I _voluntarily_ submit for inclusion into OpenOffice.org. From that PDF's content, "Contribution" can mean any code written by me which was added to OpenOffice.org by anyone. Please check this up for me, and if possible change the wording or allow me to change the wording in it when I fax it back :-)
I can't give a test case for the above, as I only know the formulae used are different from the original to the patched one, and the original formula can cause errors such as in 3552 among others.
Created attachment 11561 [details] New patch for SKEW(), KURT() and GEOMEAN()
The above patch is a new patch which returns the proper error (invalid arguments - Err502) when the standard deviation of the input arguments is zero, which'd cause a division by zero error. This new patch just changes the error type returned. This only applies to the SKEW() and KURT() functions. The existing GEOMEAN() implementation formula in OpenOffice.org is fine. It does this basically: GeoMean = [Prod_1..N(Xi)]^(1/N) where Xi is an element of the set of 1..N elements. To avoid overflow, it uses logarithms and addition instead of taking the product of the whole set. ---- There was one glaring error in the GEOMEAN() implementation though. It used overloading of variables of the same name and due to the scope of the inner variable, an outer variable (count of items) was not getting incremented when the stack type was svMatrix. I have fixed this by using different variable names.
I don't really like the idea of integrating a change without knowing which problem it solves. You should look for cases where the new code gives better results than the old one. The duplicate "i" variable in ScGeoMean is unfortunate, but doesn't do any harm, each of the loops does use the right variable. Maybe Michael can comment on what the supposed problems with that function are? As to the text of JCA, it has gone through legal review, so we can't just modify it, but doesn't the "intended to be integrated" part already cover your concerns?
There is a problem in ScGeoMean() where in the switch block for the 'svMatrix' case, 'nCount' in the most immediate scope is incremented with the statement 'nCount++;' whereas the nCount in the outer function block has to be incremented for the correct behaviour.
Oh, that one, sorry, yes, but that is already fixed on the "cac" branch (see revision 1.8.40.2), so you can ignore it.
Code "to be integrated" can mean any code written by me out of context. For that matter, even "... Contributor has ever delivered" is grossly implicit with the rest of the sentence. I don't know if you follow the implication. I have run through the wording with our solicitor here and she feels the same way. The following is a better way of putting it (without the **** of course): 1. Contributor owns, and has sufficient rights to contribute, all source code and related material ****which the Contributor voluntarily intended to be compiled or integrated with the source code**** for the OpenOffice.org open source product (the "Contribution") which Contributor has ever delivered, and Sun has accepted, for incorporation into the technology made available under the OpenOffice.org open source project. Obviously this only transfers rights in point #2, but the definition of "Contribution" is made in #1. I'd to get this corrected before I can sign it. I don't like the wording, and don't want to sign anything which'll cover code outside which I voluntarily contribute to OpenOffice.org. I take IP issues seriously. If this is not possible, I'm sorry but I'll have to pull out.
Ah I didn't know this change happened. So as far as I know, there's no issue with GEOMEAN() :-)
mh->mukund: I'll take your comments upon the JCA for internal review. Thank you for your comments, I hope I'll be avle to come back to you soon.
Created attachment 11602 [details] Test spreadsheet found on the Internet with SKEW(), KURT() and GEOMEAN() functions added
Created attachment 11603 [details] Screenshot of test spreadsheet in Gnumeric 1.2.1
Created attachment 11604 [details] Screenshot of test spreadsheet in friend's MS Excel
Created attachment 11605 [details] Screenshot of test spreadsheet in unpatched OpenOffice.org 1.1.0
Created attachment 11606 [details] Screenshot of test spreadsheet in patched OpenOffice.org 1.1.0
Niklaus: If you compare those screenshots, you'll see the precision improvements in the patched version. I cannot produce a case for behaviour similar to issue 3552, but the original SKEW() and KURT() functions calculate variance similar to the original VAR() function [prior to issue 3552] and it is possible to get that kind of issue. Michael: This basically completes the following functions: VAR() VARP() STDEV() STDEVP() DEVSQ() SKEW() KURT() GEOMEAN() // was only audited; it is fine :-) I am going to concentrate on the rest.
That VAR() patch modified DSTDEV(), DSTDEVP(), DVAR() and DVARP() too as a side effect.
I have gone through ScPow() and MatPow() functions and there do not appear to be any bugs with them. They both use the pow() math library function straight away. So POWER() is okay. Gnumeric may have a _very slight_ advantage over OpenOffice.org on some newer platforms (recent distributions of Linux; recent releases of Solaris; no idea about Windows) as it uses the "long double" data type where it is available. Accordingly it also uses the powl() math function. These functions are only likely to be available on C99 platforms. I am not sure if it is a requirement of any C++ specifications and which version of the spec they were introduced in. In any case, switching to "long double" would probably mean a complete overhaul of most of OpenOffice.org's functions. Note that the above is for informational purposes. The precision loss is not significant.
Those differences in the screenshots look very small. Sure, we might use the better calculation of the variance just for correctness' sake, but still I would like to know what the original problems with KURT and SKEW were.
Ok; if we can't find any problems with KURT/SKEW that's great :-) these were more vague suspicions. As for signing the JCA - that's a prerequisite for the bounty - which I'm happy you've completed in spirit at least; Since Novell signed the JCA we'd be most interested in any legal issues you've noticed - it looked fine to us; there is a list of other signees who might have found any obvious problems (that might damage their business); notably Novell/Ximian, SuSE, SGI, Red Hat, Mdk, etc. have all signed it without any probloms cf. http://www.openoffice.org/copyright/copyrightapproved.html) - my feeling is it's fair and reasonable.
Sorry about the lack of activity here. I have been busy with my day job. Please expect a patch for the rest of the functions by the end of this weekend.
mh->mkund: this is the answer I got: "The issue does not take into account the latter part of the sentence, which is intended to qualify the Contributor's intention for the contributed code: "...Contributor has ever delivered,...for incorporation into the technology made available under the OpenOffice.org open source project." The sentence must be read as a whole. The last part of the sentence is intended to qualify the Contributor's intention when delivering the code for inclusion with Openoffice.org."
I've run into what is probably related strangeness, after reading a posting on the OOoForum (http://www.oooforum.org/forum/viewtopic.php?t=7030). Open a spreadsheet, and in one column type 81 and 891. In the next three cells, have A1*A2, A2*A3, and A3*A4. You should get 72171, 64304361, and 4640910037731. In the next cell, multiply the three previous (A3*A4*A5), and in the last cell, enter A6/1.0E11. Make sure the final cell does not appear in scientific notation (requiring at least one decimal place is one way to do this). In the next column, we have a similar progression starting from 8.10E+021and 8.91E+022. Keep all the formulae the same, except in the very last cell use B6/1.0E211. Now comes the wierdness. You'll note that the two values are slightly different -- one ends in a 3, and the other in a 4. There are no rounding issues, as showing decimal places indicates there's not enough to round to make a difference (nothing but zeroes five places out). But if you then subtract one from the other, you get 0. ??? I'll attach a spreadsheet showing the progressions.
Created attachment 14683 [details] 4 - 3 = ... 0??
Additional detail -- OOo 1.1.1
erikanderson3: This is completely different. The difference between the numbers is close to the limit of the numerical precision (the number of decimals you can display is unrelated to calculation precision). In this case, 0 as a result of subtraction is deliberate, so for example 0.1+0.2-0.3 equals zero. mukund: Is there any progress on the subject of JCA?
nn: I had faxed a signed modified copy of the JCA over to Ximian, assigning joint-copyright for this work to Ximian, Inc. They can submit any of the patches linked to this issue, to the OpenOffice.org project.
mukund - I think the assign everything to Novell, and re-assign to Sun isn't going to fly. Can you re-educate your legal department, and submit a JCA in the normal way ? this really shouldn't be controversial
Any news on this? If we can't get a JCA for the patches, I'll have to close this issue, and we'll have to make an own implementation (not that hard, now that it's clear what has to be done, but a pity anyway).
Hi Niklaus I am sorry I had forgotten about this open issue completely. I'll get back on this on Monday. In any case I'll do what's necessary to have these patches available to you for committing next week. Mukund
Hi Mukund, we are still in need for a signed JCA from you. The form is available from http://www.openoffice.org/contributing.html Thanks, Stefan
Hi Niklaus, Stefan A signed JCA has been faxed and the paper copy is on its way to you now. Mukund
While it may take some days until your name is added to the list I can confirm that we received the JCA. Thanks, Mukund. Greetings, Stefan
When I opened the test file distribution-data.xls (posted 2003-11-27) in OOo 1.1.2 on Win X Home I get the following results: Skew A -0.6484780257732190 Skew B -0.0590659837522011 Skew C 0.6621794284723160 Kurt A -0.3352291010461660 Kurt B -0.9647191301261150 Kurt C -0.3126682758511650 Geomean D 0.3385367714135190 Which are different from all of the posted screenshots. These results are closest to the OOo 1.1.0 unpatched image that was posted by mukund. My interest is more along the lines of helping to develop a test file that can be used to test new versions of the calc program. I envision developing a spreadsheet file that can self calculate the 'degree' of agreement with a bench mark standard. (i.e. agrees to the benchmark within X digits) I envision this testing several functions and not just the ones listed in this issue. However, I believe the functions in this issue might be a good place to start since there is a demonstrated difference for them. Are there others with known differences? What is the status of making an official change to how these functions perform their calculations?
I guess this is because the fixes never made it into cvs - they just hit ooo-build though, so in due course your Linux distro should have them for you.
Better late than never: This is now in the child workspace "calc25" and will be part of OOo 2.0. Sorry about the delay. The files are interpr1.cxx 1.31.2.2, interpr3.cxx 1.11.204.1.
reopening to reassign
reassigning to QA for verification
resett fixed
found fixed using cws calc25 on Linux, Solaris and Windows
closed because fix available in OOo1.9m65
*** Issue 40833 has been marked as a duplicate of this issue. ***