Parca output differences between machines #1154

tahorst · 2021-08-18T15:31:32Z

Parca output can vary between machines with our current environment and installation. I am able to run the parca without issue locally with 86cfdc5 but the PR builds failed (#1153). Comparing output between local and sherlock parca runs shows that the first difference appears due to the RNA degradation fitting.

*** sim_data_basal_specs.cPickle ***********************************************                                                             
{'process': {'rna_decay': {'Km_convergence': Arrays are not equal Mismatched elements: 953 / 4687 (20.3%) Max absolute difference: 1.13686838
e-13 Max relative difference: 5.3754935e-14 x: array([0.998273, 0.998738, 0.995753, ..., 0.999113, 0.996...,                                 
                           'stats_fit': {'LossKmOpt': (8.561869302425862e-08, 8.561764486270107e-08),                                        
                                         'ResEndoRNKmOpt': (2.341729832266992e-08, 2.341730054311597e-08),                                                                            'ResKmOpt': (1.2195148169080738e-05, 1.219514922468079e-05),                                        
                                         'ResScaledKmOpt': (5.918788952814683e-17, 5.918789444820079e-17)}},                                              'transcription': {'rna_data': (STRUCTURED ARRAY:                                                                                
array([('EG10001_RNA[c]', 0.0020211 , 1080, [230, 283, 321, 246], 347300.795,  True, False, False, False, False, False, False, False, False, 'EG10001', 0.00023125,   339922,  True),                                                                                                     
...,                                                                                                                                                                                     STRUCTURED ARRAY:                                                                                
array([('EG10001_RNA[c]', 0.0020211 , 1080, [230, 283, 321, 246], 347300.795,  True, False, False, False, False, False, False, False, False, 'EG10001', 0.00023125,   339922,  True),                                                                                                     
...)}}}                                                                                                                                      ==> lines of differences: 11

Looking at the stats_fit difference, there is no difference between the value before fsolve (eg LossKm) so I would expect the difference comes from fsolve in scipy. fsolve uses MINPACK compiled code so my guess is that there is a difference between these installations on different machines.

The text was updated successfully, but these errors were encountered:

1fish2 · 2021-08-19T01:29:06Z

Narrowing this to fsolve is a good start.

Would it be straightforward to make a small test case (like the dot product test) for quicker hypothesis testing? We could test newer releases of SciPy and (if relevant) NumPy and OpenBLAS. FWIW, the latest SciPy embeds OpenBLAS 0.3.9 while the latest NumPy embeds OpenBLAS 0.3.17.

It looks like SciPy embeds the MINPACK Fortran source code and those sources haven't changed in 3 years. A difference could be due to the Fortran compiler and its compilation switches.

Are you running on the lab's compute nodes rather than whatever CPUs are in the newest nodes?

A possible workaround could be to run in a container on Sherlock. Singularity on Sherlock's CentOS might actually be able to run a Docker image, also there's a tool docker2singularity.

tahorst · 2021-08-19T16:24:33Z

Actually, I was incorrect and don't think fsolve is the culprit. We aren't saving the output of LossFunctionP in stats_fit which is used in fsolve. This function actually returns different results on different machines when passing Kmcounts to it. I've checked that the Kmcounts values are the same on both. This points to aesara as the issue and specifically determining the Jacobian of the loss function.

Would it be straightforward to make a small test case (like the dot product test) for quicker hypothesis testing? We could test newer releases of SciPy and (if relevant) NumPy and OpenBLAS. FWIW, the latest SciPy embeds OpenBLAS 0.3.9 while the latest NumPy embeds OpenBLAS 0.3.17.

This would be a great idea. We could save one of the cached KM files that should have inputs to the km_loss_function as an easy test. Not sure how easy it will be to reproduce on a smaller test case.

Are you running on the lab's compute nodes rather than whatever CPUs are in the newest nodes?

It was on the newest compute node. It looks like they took away the old ones so now we only have the one new node.

Maybe using jax or another package for these functions and Jacobians would be better and consistent across environments?

... per code review feedback. And bump the validation data tasks up to priority 12 to support their downstream tasks, now that we know the added dependency links fixed the bug. Any ideas what could cause the exception `ModuleNotFoundError: No module named 'tmpqliqwytx.m60156961fb5c4d3d33cb0876d617bf81987f03cf4a6533e8b2ceef71f39f139c'`? There were warnings about .aesara cache files "gone from the file system". Bugs in Aesara's caching when run from multiple processes? This (in addition to Issue #1154) is more incentive to try using JAX instead or updating to a newer Aesara release. (I'll revert `ecoli-pull-request.sh` before merging.)

tahorst added bug runtime environment python versions, pips, ports, and such labels Aug 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parca output differences between machines #1154

Parca output differences between machines #1154

tahorst commented Aug 18, 2021

1fish2 commented Aug 19, 2021

tahorst commented Aug 19, 2021

Parca output differences between machines #1154

Parca output differences between machines #1154

Comments

tahorst commented Aug 18, 2021

1fish2 commented Aug 19, 2021

tahorst commented Aug 19, 2021