Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parca output differences between machines #1154

Open
tahorst opened this issue Aug 18, 2021 · 2 comments
Open

Parca output differences between machines #1154

tahorst opened this issue Aug 18, 2021 · 2 comments
Labels
bug runtime environment python versions, pips, ports, and such

Comments

@tahorst
Copy link
Member

tahorst commented Aug 18, 2021

Parca output can vary between machines with our current environment and installation. I am able to run the parca without issue locally with 86cfdc5 but the PR builds failed (#1153). Comparing output between local and sherlock parca runs shows that the first difference appears due to the RNA degradation fitting.

*** sim_data_basal_specs.cPickle ***********************************************                                                             
{'process': {'rna_decay': {'Km_convergence': Arrays are not equal Mismatched elements: 953 / 4687 (20.3%) Max absolute difference: 1.13686838
e-13 Max relative difference: 5.3754935e-14 x: array([0.998273, 0.998738, 0.995753, ..., 0.999113, 0.996...,                                 
                           'stats_fit': {'LossKmOpt': (8.561869302425862e-08, 8.561764486270107e-08),                                        
                                         'ResEndoRNKmOpt': (2.341729832266992e-08, 2.341730054311597e-08),                                                                            'ResKmOpt': (1.2195148169080738e-05, 1.219514922468079e-05),                                        
                                         'ResScaledKmOpt': (5.918788952814683e-17, 5.918789444820079e-17)}},                                              'transcription': {'rna_data': (STRUCTURED ARRAY:                                                                                
array([('EG10001_RNA[c]', 0.0020211 , 1080, [230, 283, 321, 246], 347300.795,  True, False, False, False, False, False, False, False, False, 'EG10001', 0.00023125,   339922,  True),                                                                                                     
...,                                                                                                                                                                                     STRUCTURED ARRAY:                                                                                
array([('EG10001_RNA[c]', 0.0020211 , 1080, [230, 283, 321, 246], 347300.795,  True, False, False, False, False, False, False, False, False, 'EG10001', 0.00023125,   339922,  True),                                                                                                     
...)}}}                                                                                                                                      ==> lines of differences: 11

Looking at the stats_fit difference, there is no difference between the value before fsolve (eg LossKm) so I would expect the difference comes from fsolve in scipy. fsolve uses MINPACK compiled code so my guess is that there is a difference between these installations on different machines.

@tahorst tahorst added bug runtime environment python versions, pips, ports, and such labels Aug 18, 2021
@1fish2
Copy link
Contributor

1fish2 commented Aug 19, 2021

Narrowing this to fsolve is a good start.

Would it be straightforward to make a small test case (like the dot product test) for quicker hypothesis testing? We could test newer releases of SciPy and (if relevant) NumPy and OpenBLAS. FWIW, the latest SciPy embeds OpenBLAS 0.3.9 while the latest NumPy embeds OpenBLAS 0.3.17.

It looks like SciPy embeds the MINPACK Fortran source code and those sources haven't changed in 3 years. A difference could be due to the Fortran compiler and its compilation switches.

Are you running on the lab's compute nodes rather than whatever CPUs are in the newest nodes?

A possible workaround could be to run in a container on Sherlock. Singularity on Sherlock's CentOS might actually be able to run a Docker image, also there's a tool docker2singularity.

@tahorst
Copy link
Member Author

tahorst commented Aug 19, 2021

Actually, I was incorrect and don't think fsolve is the culprit. We aren't saving the output of LossFunctionP in stats_fit which is used in fsolve. This function actually returns different results on different machines when passing Kmcounts to it. I've checked that the Kmcounts values are the same on both. This points to aesara as the issue and specifically determining the Jacobian of the loss function.

Would it be straightforward to make a small test case (like the dot product test) for quicker hypothesis testing? We could test newer releases of SciPy and (if relevant) NumPy and OpenBLAS. FWIW, the latest SciPy embeds OpenBLAS 0.3.9 while the latest NumPy embeds OpenBLAS 0.3.17.

This would be a great idea. We could save one of the cached KM files that should have inputs to the km_loss_function as an easy test. Not sure how easy it will be to reproduce on a smaller test case.

Are you running on the lab's compute nodes rather than whatever CPUs are in the newest nodes?

It was on the newest compute node. It looks like they took away the old ones so now we only have the one new node.

Maybe using jax or another package for these functions and Jacobians would be better and consistent across environments?

1fish2 added a commit that referenced this issue Oct 1, 2021
... per code review feedback.

And bump the validation data tasks up to priority 12 to support their downstream tasks, now that we know the added dependency links fixed the bug.

Any ideas what could cause the exception `ModuleNotFoundError: No module named 'tmpqliqwytx.m60156961fb5c4d3d33cb0876d617bf81987f03cf4a6533e8b2ceef71f39f139c'`?

There were warnings about .aesara cache files "gone from the file system". Bugs in Aesara's caching when run from multiple processes? This (in addition to Issue #1154) is more incentive to try using JAX instead or updating to a newer Aesara release.

(I'll revert `ecoli-pull-request.sh` before merging.)
1fish2 added a commit that referenced this issue Oct 2, 2021
... per code review feedback.

And bump the validation data tasks up to priority 12 to support their downstream tasks, now that we know the added dependency links fixed the bug.

Any ideas what could cause the exception `ModuleNotFoundError: No module named 'tmpqliqwytx.m60156961fb5c4d3d33cb0876d617bf81987f03cf4a6533e8b2ceef71f39f139c'`?

There were warnings about .aesara cache files "gone from the file system". Bugs in Aesara's caching when run from multiple processes? This (in addition to Issue #1154) is more incentive to try using JAX instead or updating to a newer Aesara release.

(I'll revert `ecoli-pull-request.sh` before merging.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug runtime environment python versions, pips, ports, and such
Projects
None yet
Development

No branches or pull requests

2 participants