Integer values in `pmb.df` become float when reading a pyMBE dataframe from file #102

pm-blanco · 2024-11-13T10:40:48Z

This becomes an issue when restarting simulations that have been set up with pyMBE since it breaks the setup of the reaction methods, which cannot be checkpointed with the current funtionalities in ESPResSo.

Here a minimal working example that shows this issue:

import pyMBE

# Create an instance of pyMBE library
pmb = pyMBE.pymbe_library(seed=42)

# Acidic particle
pmb.define_particle(
    name = "A",
    acidity = "basic",
    pka = 4,
    sigma = 1*pmb.units('reduced_length'),
    epsilon = 1*pmb.units('reduced_energy'))

pmb.define_particle(
    name = "C",
    z = 1,
    sigma = 1*pmb.units('reduced_length'),
    epsilon = 1*pmb.units('reduced_energy'))

# Save the pyMBE dataframe in a CSV file
pmb.write_pmb_df(filename='df.csv')

# Here all  charge numbers and espresso types are integers
print(pmb.df)

# Read the pyMBE dataframe 
pmb.read_pmb_df(filename='df.csv')

# Here some of the charge numbers and espresso types are float
print(pmb.df)

# This breaks the setup of the reaction methods
pmb.setup_cpH(counter_ion="C", constant_pH=2)

@paobtorres do you want to take care of solving this?

paobtorres · 2024-11-13T20:54:03Z

Yes! I will try to solve it

paobtorres · 2024-11-23T00:48:31Z

@pm-blanco A quick update I have partially solved the issue changing:

for column_name in columns_dtype_int:
            df[column_name] = df[column_name].astype(object)

for:

for column_name in columns_dtype_int:
            df[column_name] = df[column_name].astype(pd.Int64Dtype())

However, it changes the nan values from NaN (np.NaN) to <NA> which is pd.NA in that column. It does not affect functionality but in the df, there is a mix of NaN and , it does not look good. (I'm still working on it but I have not created a branch just yet)

paobtorres · 2024-12-04T13:49:54Z

Hi! @jngrad We would like your opinion on this issue. In the previous comment I mentioned the solution I found for this issue, however, we still have the problem with np.NAN and pd.NA .

We were talking with @pm-blanco to change in pyMBE all the functions that have np.NaN by default to pd.NA. What do you think about this?

jngrad · 2024-12-04T15:34:06Z

That's really tricky. According to chapter Working with missing data, pd.NA behaves like None and NaN, which came as a surprise to me. I'm used to Javascript's null vs undefined vs NaN semantics, and Python's None vs NaN semantics. Merging None and NaN into a unified data type introduces new side effects, for example sometimes it propagates (even through comparison operations) and sometimes it doesn't (for example you can divide 0 by 0 to yield np.nan). This means we can't use np.isnan() reliably, and should always use pd.isna() instead. In addition, NumPy produces a warning the first time we generate a NaN, while Pandas sometimes flat out refuses to generate a NaN.

Here is a MWE:

>>> import numpy as np
>>> import pandas as pd
>>> df = pd.DataFrame(data={"time": [1., 2., pd.NaT], "pos": [1.5, 2.5, 0.]})
>>> df
  time  pos
0  1.0  1.5
1  2.0  2.5
2  NaT  0.0
>>> (0. / df["pos"])  # generates np.nan
0    0.0
1    0.0
2    NaN
Name: pos, dtype: float64
>>> np.isnan(0. / df["pos"])  # works as expected
0    False
1    False
2     True
Name: pos, dtype: bool
>>> (0. / df["time"])  # generates NaN of type object
0    0.0
1    0.0
2    NaN
Name: time, dtype: object
>>> (0. / df["time"]).to_numpy().dtype  # not convertible to numpy array of float64
dtype('O')
>>> np.isnan(np.nan)
True
>>> np.isnan(pd.NA)  # doesn't work
<NA>
>>> pd.isna(pd.NA)
True
>>> pd.isna(np.nan)  # always works
True
>>> pd.Series([1., np.inf, np.nan, pd.NA]).isnull()  # np.nan decays to NULL
0    False
1    False
2     True
3     True
dtype: bool
>>> 1 / np.array([0., 1.])  # warning the first time
<stdin>:1: RuntimeWarning: divide by zero encountered in divide
array([inf,  1.])
>>> 1 / np.array([0., 1.])  # no warning the second time
array([inf,  1.])
>>> (0. / pd.Series([0., 1.]))  # yields NaN
0    NaN
1    0.0
dtype: float64
>>> (0. / pd.Series([0., 1., pd.NaT]))  # doesn't run because pd.NaT changes series type
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ZeroDivisionError: float division by zero

I would say that np.nan isn't the best candidate to represent missing data, because it forces the data series to become a floating-point series. If there was a pd.NULL type, I would recommend it for missing data over np.nan. Looking through the Pandas docs, it seems like pd.NA is the right candidate to represent null, especially considering Pandas treats np.nan as null. But I must admit it's the first time I read about it, and its semantics confuses me a lot, which makes me reluctant to recommend its usage. If we decide to adopt it, everyone in the pyMBE community should get familiar with pd.NA semantics.

pm-blanco · 2024-12-04T16:20:52Z

Thank you for your insight @jngrad! This is indeed a very tricky issue indeed and not solution seems to be completely satisfying. At the moment, it seems that defaulting to pd.NA is the lesser evil but we should make our community aware of this indeed. Maybe we could add an entry to our wiki about it, so we can refer people to it in the future?

jngrad · 2024-12-04T16:32:21Z

I agree. We can link to the relevant chapter in the Pandas user manual, but should explicitly mention that pd.isna() should always be preferred over np.isnan().

paobtorres · 2024-12-04T18:36:08Z

Thanks @jngrad for your feedback about this issue. Perfect, then I will make the change to pd.NA. @pm-blanco

pm-blanco added the bug Something isn't working label Nov 13, 2024

pm-blanco assigned paobtorres Nov 13, 2024

pm-blanco added this to the pyMBE 1.0.0 milestone Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integer values in `pmb.df` become float when reading a pyMBE dataframe from file #102

Integer values in `pmb.df` become float when reading a pyMBE dataframe from file #102

pm-blanco commented Nov 13, 2024

paobtorres commented Nov 13, 2024

paobtorres commented Nov 23, 2024 •

edited

Loading

paobtorres commented Dec 4, 2024

jngrad commented Dec 4, 2024

pm-blanco commented Dec 4, 2024

jngrad commented Dec 4, 2024

paobtorres commented Dec 4, 2024

Integer values in pmb.df become float when reading a pyMBE dataframe from file #102

Integer values in pmb.df become float when reading a pyMBE dataframe from file #102

Comments

pm-blanco commented Nov 13, 2024

paobtorres commented Nov 13, 2024

paobtorres commented Nov 23, 2024 • edited Loading

paobtorres commented Dec 4, 2024

jngrad commented Dec 4, 2024

pm-blanco commented Dec 4, 2024

jngrad commented Dec 4, 2024

paobtorres commented Dec 4, 2024

Integer values in `pmb.df` become float when reading a pyMBE dataframe from file #102

Integer values in `pmb.df` become float when reading a pyMBE dataframe from file #102

paobtorres commented Nov 23, 2024 •

edited

Loading