Bug #5429
openimprove default provenance store performance
0%
Description
Currently there can be some big performance penalties when using kepler with provenance turned on (by default using hsql). It would be great to improve these.
Unless noted, references to workflow execution times below refer to the reap GDD wf set to process 200days of data:
https://code.ecoinformatics.org/code/reap/trunk/usecases/terrestrial/workflows/derivedMETProducts/growingDegreeDays.kar
I see/saw a few issues:
-1) at one point I mentioned kepler shutdown was taking a very long time. This isn't an issue anymore, shutdown seems near instant.
0) the pre-initialize stage of workflow execution can take a very long time and grows longer w/ each subsequent execution when running with a provenance store that's large. E.g. up to 15m.
Dan's fixed this issue, I believe w/ r27746. Pre-init is now close to instant or just a few seconds.
1) execution of the workflow w/ provenance off takes a few seconds. With provenance on, it takes about 4min to run the first time with an empty provenance store.
2) subsequent executions of the same workflow take longer to run.
E.g. Here are the execution times of 9 runs of the workflow on 2 different machines:
10.6 macbook 2.2ghz intel core 2 duo w/ 4gb RAM:
4:01, 4:03, 3:57, 7:43, 8:07, 8:01, 8:33, 8:10, 8:33,
ubuntu 10.04 dual 3ghz w/ 2gb RAM:
4:03, 4:13, 4:32, 9:13, 12:32, 8:08, 9:54, 9:06, 11:53
3) startup time can take a very long time when the prior Kepler invocation ran data/token intensive workflows. I believe what's happening is hsql is incorporating the changes in the log file into the .data file. I think something's happening w/ the .backup file too. The data file slowly grows very large (a lot more than by 200mb), and finally the log file drops to near 0, and then the data file decreases in size to a size larger than where it started. I think with the default log file max size of 200mb, startup can take on the order of 10-20m. I've tested w/ a variety of log file sizes. Making it dramatically smaller, e.g. 5mb, dramatically improves startup time, but comes at a huge workflow execution time penalty (~20m to run the wf), so this is an unacceptable fix. The execution penalty starts happening when the log file max size is set smaller than about 100mb. With a 100mb log file, startup is still very slow.
One thing I've found that improves execution time performance is increasing the 'memory cache exponent' setting (hsqldb.cache_scale) from the default of 14 to the max of 18. This setting "Indicates the maximum number of rows of cached tables that are held in memory, calculated as 3 (2*value) (three multiplied by (two to the power value)). The default results in up to 3*16384 rows from all cached tables being held in memory at any time."
With a 200mb log file max size, and cache_scale=18, the first run of the workflow takes about 2:17.