Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Langevin thermostat is not pickled correctly #4657

Closed
RiccardoFrenner opened this issue Jan 23, 2023 · 5 comments · Fixed by #4677
Closed

Langevin thermostat is not pickled correctly #4657

RiccardoFrenner opened this issue Jan 23, 2023 · 5 comments · Fixed by #4677

Comments

@RiccardoFrenner
Copy link
Contributor

When saving the system via pickle and loading it again, the force due to the langevin thermostat changes.
This should not happen since the result of the simulation then does depend on whether it was checkpointed or not.

I tested it with both the 4.2.0 commit tag and the latest commit (d9cbffc) and an empty myconfig.hpp.

The followind MWE demonstrates the problem:

import pickle
from pathlib import Path

import espressomd

SEED = 6854
CHECKPOINT_PATH = Path("checkpoint.pkl")
LOAD_CHECKPOINT = CHECKPOINT_PATH.exists()


def from_scratch():
    system = espressomd.System(box_l=[10.0]*3)
    system.time_step = 0.01
    system.cell_system.skin = 0.4
    system.cell_system.set_regular_decomposition(
        use_verlet_lists=True
    )
    system.thermostat.set_langevin(kT=1.0, gamma=5.0, seed=SEED)
    system.integrator.set_vv()
    p = system.part.add(pos=[1, 2, 3], id=0)

    system.integrator.run(1)
    print(f"A {system.time:8g}, {p.f}, {p.pos}, {p.v}")
    assert system.thermostat.get_state()[0]["counter"] == 1

    with open(CHECKPOINT_PATH, "wb") as f:
        pickle.dump(system, f)

    # run one more time to see difference with/without checkpointing
    system.integrator.run(1)
    print(f"B {system.time:8g}, {p.f}, {p.pos}, {p.v}")
    assert system.thermostat.get_state()[0]["counter"] == 2


def from_checkpoint():
    system = pickle.loads(CHECKPOINT_PATH.read_bytes())
    p = system.part.by_id(0)

    print(f"A {system.time:8g}, {p.f}, {p.pos}, {p.v}")
    assert system.thermostat.get_state()[0]["counter"] == 1

    system.integrator.run(1)
    print(f"B {system.time:8g}, {p.f}, {p.pos}, {p.v}")
    assert system.thermostat.get_state()[0]["counter"] == 2


def main():
    if LOAD_CHECKPOINT:
        print("Starting from checkpoint ...")
        from_checkpoint()
    else:
        print("Starting from scratch ...")
        from_scratch()


if __name__ == "__main__":
    main()

When run two times it produces the following output:

Starting from scratch ...
A     0.01, [-44.45572893 -34.81343146  -9.79069977], [1.0024712  1.9985853  3.00140793], [ 0.02484152 -0.31553735  0.09183968]
B     0.02, [-31.38295036  21.00775256  -3.94472096], [1.00049683 1.99368925 3.00183679], [-0.35435188 -0.38456574  0.02316257]

Starting from checkpoint ...
A     0.01, [-44.45572893 -34.81343146  -9.79069977], [1.0024712  1.9985853  3.00140793], [ 0.02484152 -0.31553735  0.09183968]
B     0.02, [-31.41073519  20.98599417  -3.95084015], [1.0005524  1.99373277 3.00184903], [-0.34893383 -0.38032285  0.02435582]

One expects the lines with matching letters to be exactly the same, however, lines B have different forces and thus particle positions, velocities, etc.

@jngrad
Copy link
Member

jngrad commented Feb 10, 2023

This behavior should affect all thermostats and integrators. One of the reasons for the divergence when reloading from a checkpoint file, is that particle forces and the state of electrostatic/magnetostatic/hydrodynamic actors are invalidated. More details can be found in the December 2021 mailing list thread Question about checkpoint.

While several factors contribute to the divergence, there are probably low-hanging fruits that can be identified in a coding day. As a first step, one can look at what happens when the checkpoint reloading code sets the global variable recalc_forces to false after the system state is completely restored:

bool recalc_forces = true;

It might not work out-of-the-box if long-range actors active, so one should start with a simple system like the one in the original post. It might be helpful to monitor what happens during the ghost update, e.g. by iterating through real and ghost particles of a local cell and print their properties before the checkpoint file is created and after the simulation is reloaded to get a better understanding of which data members are out-of-date. This can be achieved with this code:

for (auto const p_id : {1, 2}) {
  if (auto p = ::cell_structure.get_local_particle(p_id)) {
    std::cout << "id=" << p->id() << " pos=[" << p->pos() << "] force=[" << p->force() << "] ";
    std::cout << "(" << (p->is_ghost() ? "ghost" : "real") << " particle)\n";
  }
}

To make progress with the other sources of divergence, one will need to progressively enable more features to see which ones get invalidated during the reload. Specific feature combinations can be turned on using the checkpoint tests:

mpiexec -n 2 ./pypresso testsuite/python/save_checkpoint.py Test__lj
mpiexec -n 2 ./pypresso testsuite/python/test_checkpoint.py Test__lj

where Test__<features> determines which features are enabled. Pre-made <features> strings for valid feature combinations can be found in espressomd/espresso@6d08b88:testsuite/python/CMakeLists.txt#L109-L125. I would recommend manually removing the DPD feature from these test cases, because its random number generator is automatically updated when a particle collides with a shape-based constraint, making the trajectory inherently non-reproducible.

@RudolfWeeber
Copy link
Contributor

My notion is as follows:

  • During checkpointing, the last obtained forces on the particles are stored in the checkpoint.
  • Whether or not they are up to date is given by the value of recalc_forces
  • So, this (global) gariabl should be saved in the checkpoint.
  • If it is restored along with the particle forces in a checkpoint, the first integraiton step should not recalculate the forces.
  • Whether or not they are recalculated can (asie from inserting aprintf) be seen, if an LB is active, as then, the Warning about the forces being wrong in the 1st time step does or does not show.

@jngrad jngrad added the Bug label Feb 12, 2023
@itischler
Copy link
Contributor

I'll have a look

@itischler
Copy link
Contributor

The problem went a little deeper than just loading the state of the recalc_forces flag, due to it being changed at the start of the integration. For example in the on_integration_start() event:

espresso/src/core/event.cpp

Lines 114 to 118 in a824a7d

if (reinit_thermo) {
thermo_init(time_step);
reinit_thermo = false;
recalc_forces = true;
}

This exposing this also to the would just over complicate things.

However, I changed the documentation so that when a checkpoint is loaded the integration should be run with the reuse_force=True flag.

@jngrad
Copy link
Member

jngrad commented Feb 23, 2023

There is also the issue that P3M actors automatically re-tune themselves using the current state of the system (at time t), instead of the original system (at time t = 0), leading to slightly different parameters. This is easy to fix, however there are other actors where tuning cannot be disabled from the script interface, like MMM1D. In addition, floating-point precision is an issue for LB and P3MGPU; rounding errors introduce small deviations that make trajectories non-deterministic.

@kodiakhq kodiakhq bot closed this as completed in #4677 Feb 24, 2023
kodiakhq bot added a commit that referenced this issue Feb 24, 2023
Fixes #4657 

Description of changes:
- explain which factors affect reproducibility in checkpointed simulations in the user guide
- re-purpose the save/load samples to help measuring force jumps during checkpointing
- make the P3M family of algorithms more deterministic by avoiding re-tuning during checkpointing
- improve docstrings of the MMM1D family of algorithms
jngrad pushed a commit to jngrad/espresso that referenced this issue Mar 2, 2023
…4677)

Fixes espressomd#4657 

Description of changes:
- explain which factors affect reproducibility in checkpointed simulations in the user guide
- re-purpose the save/load samples to help measuring force jumps during checkpointing
- make the P3M family of algorithms more deterministic by avoiding re-tuning during checkpointing
- improve docstrings of the MMM1D family of algorithms
jngrad pushed a commit to jngrad/espresso that referenced this issue Mar 2, 2023
…4677)

Fixes espressomd#4657 

Description of changes:
- explain which factors affect reproducibility in checkpointed simulations in the user guide
- re-purpose the save/load samples to help measuring force jumps during checkpointing
- make the P3M family of algorithms more deterministic by avoiding re-tuning during checkpointing
- improve docstrings of the MMM1D family of algorithms
jngrad pushed a commit to jngrad/espresso that referenced this issue Mar 2, 2023
…4677)

Fixes espressomd#4657 

Description of changes:
- explain which factors affect reproducibility in checkpointed simulations in the user guide
- re-purpose the save/load samples to help measuring force jumps during checkpointing
- make the P3M family of algorithms more deterministic by avoiding re-tuning during checkpointing
- improve docstrings of the MMM1D family of algorithms
jngrad pushed a commit to jngrad/espresso that referenced this issue Mar 2, 2023
…4677)

Fixes espressomd#4657 

Description of changes:
- explain which factors affect reproducibility in checkpointed simulations in the user guide
- re-purpose the save/load samples to help measuring force jumps during checkpointing
- make the P3M family of algorithms more deterministic by avoiding re-tuning during checkpointing
- improve docstrings of the MMM1D family of algorithms
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants