Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix participant tsv values when Info contains weight and height information as numpy arrays #1310

Merged
merged 13 commits into from
Oct 3, 2024

Conversation

thht
Copy link
Contributor

@thht thht commented Sep 25, 2024

PR Description

We have raw fiff files at our institution that produce weight and height information as numpy arrays when read with the standard mne functions. These values are not unpacked so they end up as arrays in the participant.tsv. This also leads to wrong behavior when a second run for the same participant and session is written as the detection for already existing rows in participant.tsv then fails. Check the added test for an example and simulation.

This PR adds some sanitation to the handling of participant data, namely making sure that no nested lists/arrays are in there. If it does find a nested list/array and the inner array only has one item, that item gets pulled up. Otherwise, it throws an exception.

Merge checklist

Maintainer, please confirm the following before merging.
If applicable:

  • All comments are resolved
  • This is not your own PR
  • All CIs are happy
  • PR title starts with [MRG]
  • whats_new.rst is updated
  • New contributors have been added to CITATION.cff
  • PR description includes phrase "closes <#issue-number>"

@hoechenberger
Copy link
Member

Hello, thanks for the contribution! I'm wondering if this shouldn't be addressed upstream in MNE-Python itself. Is this documented behavior? Seems like a bug to me tbh!

@thht
Copy link
Contributor Author

thht commented Sep 25, 2024

Well, in my opinion, yes but who knows what kind of surprises are waiting considering the high number of formats that MNE-Python supports and the often non-standard implementations of vendors and libraries out there.

I'm happy to report and hunt this down in MNE-Python. But I think that if MNE-BIDS expects data to be of a certain format, it is a good idea to a) check if the data fulfills these requirements and b) fixes it if it is an easy fix. And this is basically all this PR does.

Copy link

codecov bot commented Sep 25, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.46%. Comparing base (f94c24e) to head (2d9ae41).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1310   +/-   ##
=======================================
  Coverage   97.45%   97.46%           
=======================================
  Files          40       40           
  Lines        8810     8840   +30     
=======================================
+ Hits         8586     8616   +30     
  Misses        224      224           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@hoechenberger
Copy link
Member

I generally agree, but I believe MNE-Python should do the validation and not allow for invalid stuff to appear in its Info structure. I'm surprised this can be even written to disk that way … So in any case, I'd like to escalate this to upstream just to be sure

cc @larsoner

@larsoner
Copy link
Member

Yes MNE-Python should make sure everything in Info conforms to the FIF standard at the very least with its own internal read_raw_* functions. So to me this does sound like a MNE-Python bug

@thht
Copy link
Contributor Author

thht commented Sep 25, 2024

Yes, this should be validated in MNE-Python and yes, this is probably an MNE-Python bug. However, it is always the responsibility of the function that takes arguments to validate its arguments. Especially as MNE-BIDS does not explicitly rely on the reader functions of MNE-Python. The write_raw_bids takes a mne.io.Raw object as an input which intentionally allows users to manipulate its fields, including the entries of the Info dict.

@larsoner
Copy link
Member

Agreed that inputs must be validated but I think for MNE-BIDS this should be essentially assert isinstance(raw, mne.io.BaseRaw) and if it is then it should reasonably assume that MNE-Python has validated that whatever BaseRaw subclass is being used, plus its attributes (including Info), are constructed and manipulated correctly before MNE-BIDS ever sees them. And it should be MNE-Python's job to (as much as possible) ensure that the classes are instantiated and manipulated properly. If you try to draw the line elsewhere -- like by having MNE-BIDS validate that the Info object in MNE-Python is correct -- you end up in a pretty deep rabbit hole I think.

Even if you don't buy this argument, there is at least a practical one in that MNE-Python already has a bunch of Info-validation stuff built into it already, so from that standpoint it makes more sense (more DRY) to continue implementing checks there rather than here.

@sappelhoff
Copy link
Member

Thanks a lot @thht for this report!

I agree with eric to have the "core validation" in BaseRaw at mne, and then a basic check in mne-bids.

@thht
Copy link
Contributor Author

thht commented Sep 25, 2024

The thing is, I can do something like this:

raw = mne.io.read_raw('valid_data.fif');
raw.info['subject_data']['weight'] = np.array([np.random.rand((200, 20))])

write_raw_bids(raw, ....)

And the current implementation would just process it.

AFAIK, there is no proper and feasible way for MNE-Python to make sure that the Info object is always valid. I remember there to be some undocumented validation functions, but these are a) undocumented (if they are there in the first place) and b) not enforced.

And the point is not to validate the Info object in MNE-BIDS (although if there was a validation function, MNE-BIDS should use it). The point is to validate the data MNE-BIDS produces, uses to process the data and writes to disk .

In this particular case, the non-validation leads to a quite confusing error message that the participant is already in the participant file. Which is true but the specified behavior in that case is that if all information about the participant matches, there is no error. Adding this bit of validation produces either the documented behavior or a more meaningful error message.

@larsoner
Copy link
Member

larsoner commented Sep 25, 2024

AFAIK, there is no proper and feasible way for MNE-Python to make sure that the Info object is always valid.

We have some processes in place in MNE-Python that at move us in this direction. They are not perfect but incremental progress is usually made when bugs like this show up actually! So at the very least I think we should try to extend these at the MNE-Python end and see how far we can get.

these are a) undocumented (if they are there in the first place) and b) not enforced.

The idea is that they (EDIT: meaning, the enforcement functions) shouldn't need to be documented because the enforcement should automagically happen. And enforcement does happen for some operations, e.g.:

>>> info = mne.create_info(3, 1000., "eeg")
>>> info["description"] = []
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/larsoner/python/mne-python/mne/_fiff/meas_info.py", line 1602, in __setitem__
    val = self._attributes[key](
          ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/larsoner/python/mne-python/mne/_fiff/meas_info.py", line 997, in _check_description
    _validate_type(description, (None, str), "info['description']")
  File "/home/larsoner/python/mne-python/mne/utils/check.py", line 642, in _validate_type
    raise TypeError(
TypeError: info['description'] must be an instance of None or str, got <class 'list'> instead.

or for something you shouldn't be able to set:

>>> info["sfreq"] = "a"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/larsoner/python/mne-python/mne/_fiff/meas_info.py", line 1600, in __setitem__
    raise RuntimeError(self._attributes[key])
RuntimeError: sfreq cannot be set directly. Please use method inst.resample() instead.

etc. So in this sense I think we "just" need to extend this logic a bit further to the subject_info field. There is at least some rudimentary validation already present:

>>> info["subject_info"] = []
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/larsoner/python/mne-python/mne/_fiff/meas_info.py", line 1602, in __setitem__
    val = self._attributes[key](
          ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/larsoner/python/mne-python/mne/_fiff/meas_info.py", line 1022, in _check_subject_info
    _validate_type(subject_info, (None, dict), "subject_info")
  File "/home/larsoner/python/mne-python/mne/utils/check.py", line 642, in _validate_type
    raise TypeError(
TypeError: subject_info must be an instance of None or dict, got <class 'list'> instead.

So there is probably a placeholder stub where we can already start working this validation in.

@drammock
Copy link
Member

generally I agree with @larsoner that new validation code belongs in MNE-Python. But if there are already files out there in the wild that have array-like values in the subject_data dict (as @thht demonstrated is possible above) then I see no reason not to do some small things in MNE-BIDS to e.g. silently allow singleton arrays.

@thht
Copy link
Contributor Author

thht commented Sep 25, 2024

Yes, to make it clear: It's great and important that MNE-Python handles validation of its own datastructures. And if this issue uncovered a bug in that validation and/or implementation of reading fif files, this is valuable information that @larsoner has already appreciated.

And I don't think that MNE-BIDS should handle every possible bad/strange/malformed input it gets thrown at. But as @drammock pointed out, stuff that is found in the wild and that can be easily handled, why not handle it?

@hoechenberger
Copy link
Member

We could easily validate (in MNE) at runtime with Pydantic. So saying that this is not possible is not quite right.

@hoechenberger
Copy link
Member

hoechenberger commented Sep 25, 2024

But as @drammock pointed out, stuff that is found in the wild and that can be easily handled, why not handle it?

My understanding is that you're essentially saying that there is no need to enforce a proper interface, with which I strongly disagree. It's MNE's responsibility to ensure it provides data as documented. Downstream tools like MNE-BIDS must not need to worry about this; anything else is a bug -- either in MNE or in user code.

BIDS validation, OTOH, is a different issue. My impression is that these two things are being conflated here, hence the disagreement.

@thht
Copy link
Contributor Author

thht commented Sep 25, 2024

No, this is not my point. My point is that if a proper interface is expected, it must be enforced by the software accepting the data. It's the prerogative of the software what to do in case of a violation. It can be a proper error message, or doing a workaround with our without a warning.
At the moment, this leads to unspecified behavior within MNE-BIDS, i.e. an error that the subject is already present plus the suggestion to enable overwritting the file.

My entirely subjective opinion is that in this case, the interface violation could be handled within MNE-BIDS, because it's an easy and effective fix that might handle similar errors as well and increase the user experience with minimal risk of creating new unwanted behavior. Yet, I totally get the argument that MNE-BIDS should only accept valid inputs and not correct invalid ones. That is your choice, of course.

In any case, my argument is that if MNE-BIDS expects data in a certain format, it is of course the responsibility of the provider to ensure conformity. But it is still the responsibility of the data sink to make sure the data is valid. Because it's correct to require conformity but you can never trust that incoming data meets it.

@hoechenberger
Copy link
Member

hoechenberger commented Sep 25, 2024

I suppose I'm against special-casing your specific use case. I've never seen this problem before and I wonder where one would stop? I'd argue that there's a bug in your code and the proper way to prevent this would be by improving MNE-Python; and to prevent writing invalid BIDS, MNE-BIDS must check the outputs it produces more carefully; but it's not MNE-BIDS's responsibility to ensure that an upstream package uses its own, documented interfaces correctly. This is not web dev where we receive and need to marshal input from an untrusted party. So of course we can rely on the data being submitted in the correct format; and if that fails, we need to fix that bug where it originates from (i.e., upstream), not add more special-casing.

@hoechenberger
Copy link
Member

That said, I really appreciate your input; I just don't think it's the best approach for this specific problem. But in the end it's @sappelhoff who will have to decide 😃

@thht
Copy link
Contributor Author

thht commented Sep 25, 2024

I'm pretty sure it's not my code because all it did is that the file extract some information and send it to write_raw_bids. The code should never touch the info property. Might be some side effect, sure. Gonna check tomorrow to be sure.

Anyhow, I can agree with you that MNE-BIDS should not correct the wrong input. I still think that the error message should be improved as what it reports at the moment is not what the issue is.

If you find a consensus among yourselves, I'm more than happy to alter the code so that it checks what it wants to hand off to the participants tsv writing function and just produces a better error message.

And thanks! I appreciate your work on this and I also enjoy this constructive, respectful discussion on different opinions 😀

@drammock
Copy link
Member

My understanding is that you're essentially saying that there is no need to enforce a proper interface, with which I strongly disagree.

[...]

I suppose I'm against special-casing your specific use case.

This comes across as a bit harsh to me. If I look back at the very first sentence of the PR description, @thht says

We have raw fiff files at our institution that produce weight and height information as numpy arrays when read with the standard mne functions.

...which doesn't sound like a case where the user wants to do something unusual / out-of-spec.

That said, I'm actually going to take back my earlier statement that adding more permissivity to MNE-BIDS might be alright in this case. The right fix is to adjust the MNE-Python reader function(s) (and possibly to make the MNE-BIDS error message more informative/accurate, e.g., by checking for nested arrays in the relevant fields of participants.tsv when it's first read, and warning/erroring earlier? --- instead of waiting until a second run happens to crash).

@sappelhoff
Copy link
Member

If you find a consensus among yourselves

I think we have found that consensus: We want to improve the way mne validates the info["subject_info"] data structure, so that all downstream tools using mne (of which there are a lot) can directly benefit.

I would be happy to include additional checks in mne_bids in case they are needed in a BIDS context, or in case the checks (for whatever reason) have to be more general in mne than is useful for mne_bids.

Having said all of this, I personally do not have the capacity to work on the validation in mne, so I hope that you can be convinced to tackle it @thht.

If you can't and nobody else can for the forseeable future, I would be happy to merge the present fix as an interim solution.

So to move on:

  1. am I right that we found a consensus (see beginning of message) ?
  2. would you be willing to contribute the fix to mne @thht ?

@hoechenberger
Copy link
Member

hoechenberger commented Sep 25, 2024

I'm totally fine with merging this as-is, as long as we decide to address this problem upstream sometime in the future. 👍

Copy link
Member

@larsoner larsoner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I also opened mne-tools/mne-python#12874

else:
raise ValueError(
f"Value for key {key} is a list with more "
f"than one element. This is not supported."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
f"than one element. This is not supported."
f"than one element. This is not supported. "
f"Got: {cur_value}"

Comment on lines +516 to +517
for key in data.keys():
cur_value = data[key]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for key in data.keys():
cur_value = data[key]
for key, cur_value in data.items():

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue here is that I do:

data[key] = new_value

within the loop. And afaik, one should never change what is iterated over. It might be save here, because care is taken that only the current entry is changed... But I think, it is safer this way....

Comment on lines +4198 to +4201
raw.info["subject_info"] = {
"weight": np.array([75.0]),
"height": np.array([180.0]),
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually just thought of an issue... whenever mne-tools/mne-python#12874 is tackled (which I was about to try to see if it's trivial!) this will raise a ValueError. So I think you need to bypass the soon-to-be-added MNE protection:

Suggested change
raw.info["subject_info"] = {
"weight": np.array([75.0]),
"height": np.array([180.0]),
}
# bypass MNE-Python setter protection
dict.__setitem__(
raw.info,
"subject_info",
{
"weight": np.array([75.0]),
"height": np.array([180.0]),
},
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid this does not do the trick because while the initial setting works, the write_raw_bids function copies the raw object here. And this calls the validation again. I have not found any way around it, including trying to mock the __setitems__ method to be the original one from dict.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed a commit using monkeypatch that should work

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so that monkeypatch will essentially always have to be there now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the test to make sure that MNE-BIDS has its own validation, yes

mne_bids/tests/test_write.py Outdated Show resolved Hide resolved
@sappelhoff
Copy link
Member

I have restarted the CIs now that:

has been merged.

Once this PR has been adjusted to the new state of things in mne, we can merge it!

@thht
Copy link
Contributor Author

thht commented Oct 2, 2024

Hi! Sorry, I had some work that came up and was super urgent. I'm taking care of this now.

Copy link
Member

@sappelhoff sappelhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly does this PR do, now that mne does not allow weight and height to be arrays anymore?

@larsoner
Copy link
Member

larsoner commented Oct 3, 2024

What exactly does this PR do, now that mne does not allow weight and height to be arrays anymore?

At least for MNE < 1.9 it will do some helpful validation

@sappelhoff sappelhoff enabled auto-merge (squash) October 3, 2024 15:20
@sappelhoff sappelhoff merged commit 65da1ae into mne-tools:main Oct 3, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants