Langevin thermostat is not pickled correctly #4657

RiccardoFrenner · 2023-01-23T08:07:30Z

When saving the system via pickle and loading it again, the force due to the langevin thermostat changes.
This should not happen since the result of the simulation then does depend on whether it was checkpointed or not.

I tested it with both the 4.2.0 commit tag and the latest commit (d9cbffc) and an empty myconfig.hpp.

The followind MWE demonstrates the problem:

import pickle
from pathlib import Path

import espressomd

SEED = 6854
CHECKPOINT_PATH = Path("checkpoint.pkl")
LOAD_CHECKPOINT = CHECKPOINT_PATH.exists()


def from_scratch():
    system = espressomd.System(box_l=[10.0]*3)
    system.time_step = 0.01
    system.cell_system.skin = 0.4
    system.cell_system.set_regular_decomposition(
        use_verlet_lists=True
    )
    system.thermostat.set_langevin(kT=1.0, gamma=5.0, seed=SEED)
    system.integrator.set_vv()
    p = system.part.add(pos=[1, 2, 3], id=0)

    system.integrator.run(1)
    print(f"A {system.time:8g}, {p.f}, {p.pos}, {p.v}")
    assert system.thermostat.get_state()[0]["counter"] == 1

    with open(CHECKPOINT_PATH, "wb") as f:
        pickle.dump(system, f)

    # run one more time to see difference with/without checkpointing
    system.integrator.run(1)
    print(f"B {system.time:8g}, {p.f}, {p.pos}, {p.v}")
    assert system.thermostat.get_state()[0]["counter"] == 2


def from_checkpoint():
    system = pickle.loads(CHECKPOINT_PATH.read_bytes())
    p = system.part.by_id(0)

    print(f"A {system.time:8g}, {p.f}, {p.pos}, {p.v}")
    assert system.thermostat.get_state()[0]["counter"] == 1

    system.integrator.run(1)
    print(f"B {system.time:8g}, {p.f}, {p.pos}, {p.v}")
    assert system.thermostat.get_state()[0]["counter"] == 2


def main():
    if LOAD_CHECKPOINT:
        print("Starting from checkpoint ...")
        from_checkpoint()
    else:
        print("Starting from scratch ...")
        from_scratch()


if __name__ == "__main__":
    main()

When run two times it produces the following output:

Starting from scratch ...
A     0.01, [-44.45572893 -34.81343146  -9.79069977], [1.0024712  1.9985853  3.00140793], [ 0.02484152 -0.31553735  0.09183968]
B     0.02, [-31.38295036  21.00775256  -3.94472096], [1.00049683 1.99368925 3.00183679], [-0.35435188 -0.38456574  0.02316257]

Starting from checkpoint ...
A     0.01, [-44.45572893 -34.81343146  -9.79069977], [1.0024712  1.9985853  3.00140793], [ 0.02484152 -0.31553735  0.09183968]
B     0.02, [-31.41073519  20.98599417  -3.95084015], [1.0005524  1.99373277 3.00184903], [-0.34893383 -0.38032285  0.02435582]

One expects the lines with matching letters to be exactly the same, however, lines B have different forces and thus particle positions, velocities, etc.

The text was updated successfully, but these errors were encountered:

jngrad · 2023-02-10T18:29:30Z

This behavior should affect all thermostats and integrators. One of the reasons for the divergence when reloading from a checkpoint file, is that particle forces and the state of electrostatic/magnetostatic/hydrodynamic actors are invalidated. More details can be found in the December 2021 mailing list thread Question about checkpoint.

While several factors contribute to the divergence, there are probably low-hanging fruits that can be identified in a coding day. As a first step, one can look at what happens when the checkpoint reloading code sets the global variable recalc_forces to false after the system state is completely restored:

espresso/src/core/integrate.cpp

Line 89 in 6d08b88

bool recalc_forces = true;

It might not work out-of-the-box if long-range actors active, so one should start with a simple system like the one in the original post. It might be helpful to monitor what happens during the ghost update, e.g. by iterating through real and ghost particles of a local cell and print their properties before the checkpoint file is created and after the simulation is reloaded to get a better understanding of which data members are out-of-date. This can be achieved with this code:

for (auto const p_id : {1, 2}) {
  if (auto p = ::cell_structure.get_local_particle(p_id)) {
    std::cout << "id=" << p->id() << " pos=[" << p->pos() << "] force=[" << p->force() << "] ";
    std::cout << "(" << (p->is_ghost() ? "ghost" : "real") << " particle)\n";
  }
}

To make progress with the other sources of divergence, one will need to progressively enable more features to see which ones get invalidated during the reload. Specific feature combinations can be turned on using the checkpoint tests:

mpiexec -n 2 ./pypresso testsuite/python/save_checkpoint.py Test__lj
mpiexec -n 2 ./pypresso testsuite/python/test_checkpoint.py Test__lj

where Test__<features> determines which features are enabled. Pre-made <features> strings for valid feature combinations can be found in espressomd/espresso@6d08b88:testsuite/python/CMakeLists.txt#L109-L125. I would recommend manually removing the DPD feature from these test cases, because its random number generator is automatically updated when a particle collides with a shape-based constraint, making the trajectory inherently non-reproducible.

RudolfWeeber · 2023-02-12T13:22:08Z

My notion is as follows:

During checkpointing, the last obtained forces on the particles are stored in the checkpoint.
Whether or not they are up to date is given by the value of recalc_forces
So, this (global) gariabl should be saved in the checkpoint.
If it is restored along with the particle forces in a checkpoint, the first integraiton step should not recalculate the forces.
Whether or not they are recalculated can (asie from inserting aprintf) be seen, if an LB is active, as then, the Warning about the forces being wrong in the 1st time step does or does not show.

itischler · 2023-02-13T09:05:56Z

I'll have a look

itischler · 2023-02-21T14:38:41Z

The problem went a little deeper than just loading the state of the recalc_forces flag, due to it being changed at the start of the integration. For example in the on_integration_start() event:

espresso/src/core/event.cpp

Lines 114 to 118 in a824a7d

    
           if (reinit_thermo) { 
        
             thermo_init(time_step); 
        
             reinit_thermo = false; 
        
             recalc_forces = true; 
        
           }

This exposing this also to the would just over complicate things.

However, I changed the documentation so that when a checkpoint is loaded the integration should be run with the reuse_force=True flag.

jngrad · 2023-02-23T17:27:46Z

There is also the issue that P3M actors automatically re-tune themselves using the current state of the system (at time t), instead of the original system (at time t = 0), leading to slightly different parameters. This is easy to fix, however there are other actors where tuning cannot be disabled from the script interface, like MMM1D. In addition, floating-point precision is an issue for LB and P3MGPU; rounding errors introduce small deviations that make trajectories non-deterministic.

Fixes #4657 Description of changes: - explain which factors affect reproducibility in checkpointed simulations in the user guide - re-purpose the save/load samples to help measuring force jumps during checkpointing - make the P3M family of algorithms more deterministic by avoiding re-tuning during checkpointing - improve docstrings of the MMM1D family of algorithms

…4677) Fixes espressomd#4657 Description of changes: - explain which factors affect reproducibility in checkpointed simulations in the user guide - re-purpose the save/load samples to help measuring force jumps during checkpointing - make the P3M family of algorithms more deterministic by avoiding re-tuning during checkpointing - improve docstrings of the MMM1D family of algorithms

jngrad added Core CodingDay labels Feb 10, 2023

jngrad added the Bug label Feb 12, 2023

itischler mentioned this issue Feb 21, 2023

Make trajectories more reproducible during checkpointing #4677

Merged

kodiakhq bot closed this as completed in #4677 Feb 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Langevin thermostat is not pickled correctly #4657

Langevin thermostat is not pickled correctly #4657

RiccardoFrenner commented Jan 23, 2023

jngrad commented Feb 10, 2023

RudolfWeeber commented Feb 12, 2023

itischler commented Feb 13, 2023

itischler commented Feb 21, 2023

jngrad commented Feb 23, 2023

Langevin thermostat is not pickled correctly #4657

Langevin thermostat is not pickled correctly #4657

Comments

RiccardoFrenner commented Jan 23, 2023

jngrad commented Feb 10, 2023

RudolfWeeber commented Feb 12, 2023

itischler commented Feb 13, 2023

itischler commented Feb 21, 2023

jngrad commented Feb 23, 2023