Apache OpenOffice (AOO) Bugzilla – Issue 1905
Xcl Imp: Performance with large file
Last modified: 2013-08-07 15:12:27 UTC
Ok, here is the issue posting for the thread I started, "Torture-testing OpenOffice" on the discuss mailing list. The url link is a 20MB zip of the original excel spreadsheet. At the end of this issue are urls for Win2k users showing the performance of SO/OOo and Excel in comparison opening the same csv file. This is a complex issue and I hope that just posting my three main messages in that thread will serve as sufficient explanation. However, there is a lot of text here, so I will try to summarize: - SO/OOo will not successfully open large (~50MB) Excel Workbooks. - SO/OOo takes an order of magnitude longer (3 min. vs. 10 sec.) than excel to open large, but reasonably sized data-only worksheets, even when converted to its native format. Message 1: My casual interest in OpenOffice was peaked when the StarOffice beta was released, so I thought I'd try some torture-testing. ;-) My boss at work has a huge spreadsheet that he's been using in Excel to analyze some test data. When I say huge, I mean 23,000 rows by 80 columns, with almost every column based on formulas. It is a 50MB file in excel 2000 format, so it really is huge. Excel opens this file in about 10-15 seconds at 10-20% cpu, using 120MB of ram. I tried OpenOffice with the 638c build, and it was ugly. It took about 2 minutes at 100% cpu to get about 35% open, when I stopped it because, it was getting close to using 300MB of ram, sucking my 400MB of swap space completely dry. Now I expect that a lot of this problem is overhead in converting Excel's twisted format, but I can't even get far enough to try the spreadsheet in OO's native format. I don't really have the time (patience, really) to try again with my swap space increased drastically. I like the idea and feel of OO, but if it fails a stress test this miserably, what am I to do? Any suggestions out there? Message 2: Well, I've tried some more stuff with the file this morning. I increased my max paging file size to 1 GB and let OO run in the background at low priority so I could continue working. OO has been running for about two hours now. Its current status is completely unresponsive. It took about 51 minutes of processor time, then dropped to 0-5% cpu usage. I am assuming that this means that the actual conversion is complete. I didn't check the memory usage very frequently during the conversion, the max I saw was ~230MB. Since the cpu usage dropped, memory usage has varied from 20-100MB. What is causing the varying memory usage, I have no idea. Since the cpu usage dropped, I have left it as the foreground application for about 3 minutes with no response. The task bar entry for it still reads "untitled," not the name of the converted excel spreadsheet. FYI, I'm running Win2K SP2. I've pretty much written off OO/SO as unsuitable for working in an MSOffice environment, but I still have hopes that an all OO/SO environment would be doable, maybe even pleasant. ;-) The fact that OO/SO performs so horribly with a file that Excel handles well is simply irreconcilable. Message 3: I've done yet some more testing. I exported the data-only csv from Excel and imported into OO. It took 3 minutes to open, 3 minutes to save in OO format, and 3 minutes to open again in OO format. Those times are cpu times as reported by the windows task manager, and compare to sub-10-second times for excel. That is a huge margin. I did a quick-and-very-dirty analysis of what 3 minutes means as far as cycles per cell processed: 180 seconds * 900,000,000 cycles/second ----------------------------------------------------------- = ~84,000 cycles (instructions) per cell. 24,000 rows * 80 columns Now first of all, I freely admit that this is a rough analysis. I'm assuming that my PIII 900 can actually get 1 instruction done per clock, which is very optimistic at best. At first I thought that most of this time was because my system only(?!) has 256MB of ram, so I did some performance testing with the tool built into Win2k. I looked at five different indicators over the time it took SO to load the csv and compared it to Excel. If you're running Win2k, I think you'll be able to see the html files I've attached that use the performance activex control. Otherwise, you should be able to look at the text and chart it in a spreadsheet app, like Calc for example. ;-) I was actually surprised by the uncharacteristically simple and understandable format MS used for the data. Anyway, the graphs are telling for SO. It briefly does a lot of paging while I'm assuming the file is read from disk, then does almost no paging for the remainder of the time, apparently spending almost all of its time on processing. Please tell me why this much processing needs to be done on a data-only, text-based format? I _really_ hope this is due to some bug, not inherent to the design of the filter or whatever is taking so long, because this really is unacceptable performance. I have gotten the OK from my boss to post the files, I just need some place to put the ~20MB zip file so I can link the url in an issue. If you guys have some space I could use, I'd be grateful. I think we'd all like to see some more testing on this issue. Attached files: <a href="http://myhome.spu.edu/brycesho/SO_csv_open.htm">StarOffice Performance Graph</a> <a href="http://myhome.spu.edu/brycesho/XL_csv_open.htm">Excel Performance Graph</a>
Hi, we know this problem but we still have Beta state. I think Daniel knows best about the recent development of the Excel import. Regards, Peter
I'm currently working on performance enhancements and bug fixes for import/export of huge Excel files and will add your file to my tests. But I think the main problem with this file is that it contains huge charts with >23000 data points. This is a known problem of our chart module and will hopefully be fixed in the planned new chart implementation (see http://graphics.openoffice.org/chart.html). Btw: What are the links at the end of your comments? I cannot see anything there.
Reply to Comment 1: The CSV issue is probably more frightening. All that's happening is importing fixed data. Granted, there's a lot of it, but 94 microseconds per cell is horrible. That's an eternity in today's world of computing. Reply to Comment 2: It is most definitely not solely an issue with charting. Please do not miscategorize this issue. I've tried it with and without the charts in the file, and the results were the same.
target->OOo2.0
Because of limited resource for OOo2.0, it was decided to shift this tasks to the next milestone. If somebody will be found, who can implement this until OOo2.0, then this tasks will be re-targeted.
Just to let you know. I have a spreadsheet with several chart on it, one of them is a simple EUR/USD change tracking chart with about 2300 points. Other data are not huge. When I recalculate spreadsheet it takes about 45 seconds. Removing the chart it calculate istantly. So in my case the problem was the chart....
There should really be a look into this. Here is a blog from a person from zdnet http://blogs.zdnet.com/Ou/?p=102 This shows the test results opening his file. I downloaded his file and confirmed this. Calc can't open the .xml file so I had to save it in .xls and then import it into calc and save it as .ods. I changed 1 column width and then hit save and it took a considerable amount of time to save.
Slow loading is a real problem, not just on Windows but also on Linux. It make autosaving even a greater issue. To explain, this is being dicussed on the users mail list and I decided to run my own tests. I created a very large spreadsheet to test with. The file sized are in the 20M (ods). 20M Oct 27 12:13 Large_file_calc_test.ods 52M Oct 27 13:15 Large_file_calc_test.xls Here are timing tests that I did. Problems. Gnumeric doesn't support ODF spreadsheets. KSpread says it will open ODF spreadsheets but crashes on every attempt to open the file. Gnumeric OOo Excel Native Excel.ooo.export Recalc 17.35 1:24.60 >1s 1:09.73 Save 2:21.48 4:32.48 17.65 17.73 Save as 14:35.15 2:40.98 7.64 2:37.22 Open 11.28 2:47.98 10.27 3:58.45 Also this topic came up on /. today. OpenOffice Bloated? http://slashdot.org/article.pl?sid=05/10/27/1425232&tid=185 Note, in Linux. A recent kernel upgrade sped things up to the above timings. It was much slower using the same spreadsheet a few days ago. This at least an improvement. I understand that there is going to be a tradeoff in the converting from ODS to the binary that is necessary to work but there must be a way of speeding this process up. The slashdot article posts two links that go into even further issues of optimization problems. If possible add Linux to the OS or I can create a new issue to cover Linux.
I am going to attach a basic version of the spreadsheet I used. This is with most of the rows deleted to minimize the size of the sheet. All that needs to be done to recreate the spreadsheet is to copy row 7 all the way down to row 16501. Sheet 15 is complete as the last cell (F16501) is used for the calculation timing.
Created attachment 31171 [details] Basic test for Large spreadsheet. It needs to have cells copied to work.
My attachment actually requires that all the sheets need to have the cells copied to all the rows down to 16501. I had to modify it further due to the size limit to be uploaded.
I find this issue, to which exactly the problem I have in a company that I support. But my surprise is to see that the issue is in 2001, revived in 2005. Now we are in 2010 and OpenOffice.org has its version 3.2. We like. Someday it may solve this? (sorry for my bad English, I'm Spanish speaker, and Google Translator is my friend)