Fix participant tsv values when `Info` contains weight and height information as numpy arrays #1310

thht · 2024-09-25T13:00:25Z

PR Description

We have raw fiff files at our institution that produce weight and height information as numpy arrays when read with the standard mne functions. These values are not unpacked so they end up as arrays in the participant.tsv. This also leads to wrong behavior when a second run for the same participant and session is written as the detection for already existing rows in participant.tsv then fails. Check the added test for an example and simulation.

This PR adds some sanitation to the handling of participant data, namely making sure that no nested lists/arrays are in there. If it does find a nested list/array and the inner array only has one item, that item gets pulled up. Otherwise, it throws an exception.

Merge checklist

Maintainer, please confirm the following before merging.
If applicable:

All comments are resolved
This is not your own PR
All CIs are happy
PR title starts with [MRG]
whats_new.rst is updated
New contributors have been added to CITATION.cff
PR description includes phrase "closes <#issue-number>"

for more information, see https://pre-commit.ci

hoechenberger · 2024-09-25T13:07:04Z

Hello, thanks for the contribution! I'm wondering if this shouldn't be addressed upstream in MNE-Python itself. Is this documented behavior? Seems like a bug to me tbh!

thht · 2024-09-25T13:12:27Z

Well, in my opinion, yes but who knows what kind of surprises are waiting considering the high number of formats that MNE-Python supports and the often non-standard implementations of vendors and libraries out there.

I'm happy to report and hunt this down in MNE-Python. But I think that if MNE-BIDS expects data to be of a certain format, it is a good idea to a) check if the data fulfills these requirements and b) fixes it if it is an easy fix. And this is basically all this PR does.

codecov · 2024-09-25T13:15:29Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.46%. Comparing base (f94c24e) to head (2d9ae41).
Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1310   +/-   ##
=======================================
  Coverage   97.45%   97.46%           
=======================================
  Files          40       40           
  Lines        8810     8840   +30     
=======================================
+ Hits         8586     8616   +30     
  Misses        224      224

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

hoechenberger · 2024-09-25T13:21:20Z

I generally agree, but I believe MNE-Python should do the validation and not allow for invalid stuff to appear in its Info structure. I'm surprised this can be even written to disk that way … So in any case, I'd like to escalate this to upstream just to be sure

cc @larsoner

larsoner · 2024-09-25T13:27:28Z

Yes MNE-Python should make sure everything in Info conforms to the FIF standard at the very least with its own internal read_raw_* functions. So to me this does sound like a MNE-Python bug

thht · 2024-09-25T13:30:04Z

Yes, this should be validated in MNE-Python and yes, this is probably an MNE-Python bug. However, it is always the responsibility of the function that takes arguments to validate its arguments. Especially as MNE-BIDS does not explicitly rely on the reader functions of MNE-Python. The write_raw_bids takes a mne.io.Raw object as an input which intentionally allows users to manipulate its fields, including the entries of the Info dict.

larsoner · 2024-09-25T13:35:26Z

Agreed that inputs must be validated but I think for MNE-BIDS this should be essentially assert isinstance(raw, mne.io.BaseRaw) and if it is then it should reasonably assume that MNE-Python has validated that whatever BaseRaw subclass is being used, plus its attributes (including Info), are constructed and manipulated correctly before MNE-BIDS ever sees them. And it should be MNE-Python's job to (as much as possible) ensure that the classes are instantiated and manipulated properly. If you try to draw the line elsewhere -- like by having MNE-BIDS validate that the Info object in MNE-Python is correct -- you end up in a pretty deep rabbit hole I think.

Even if you don't buy this argument, there is at least a practical one in that MNE-Python already has a bunch of Info-validation stuff built into it already, so from that standpoint it makes more sense (more DRY) to continue implementing checks there rather than here.

sappelhoff · 2024-09-25T13:38:09Z

Thanks a lot @thht for this report!

I agree with eric to have the "core validation" in BaseRaw at mne, and then a basic check in mne-bids.

thht · 2024-09-25T13:52:22Z

The thing is, I can do something like this:

raw = mne.io.read_raw('valid_data.fif');
raw.info['subject_data']['weight'] = np.array([np.random.rand((200, 20))])

write_raw_bids(raw, ....)

And the current implementation would just process it.

AFAIK, there is no proper and feasible way for MNE-Python to make sure that the Info object is always valid. I remember there to be some undocumented validation functions, but these are a) undocumented (if they are there in the first place) and b) not enforced.

And the point is not to validate the Info object in MNE-BIDS (although if there was a validation function, MNE-BIDS should use it). The point is to validate the data MNE-BIDS produces, uses to process the data and writes to disk .

In this particular case, the non-validation leads to a quite confusing error message that the participant is already in the participant file. Which is true but the specified behavior in that case is that if all information about the participant matches, there is no error. Adding this bit of validation produces either the documented behavior or a more meaningful error message.

larsoner · 2024-09-25T13:57:47Z

AFAIK, there is no proper and feasible way for MNE-Python to make sure that the Info object is always valid.

We have some processes in place in MNE-Python that at move us in this direction. They are not perfect but incremental progress is usually made when bugs like this show up actually! So at the very least I think we should try to extend these at the MNE-Python end and see how far we can get.

these are a) undocumented (if they are there in the first place) and b) not enforced.

The idea is that they (EDIT: meaning, the enforcement functions) shouldn't need to be documented because the enforcement should automagically happen. And enforcement does happen for some operations, e.g.:

>>> info = mne.create_info(3, 1000., "eeg")
>>> info["description"] = []
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/larsoner/python/mne-python/mne/_fiff/meas_info.py", line 1602, in __setitem__
    val = self._attributes[key](
          ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/larsoner/python/mne-python/mne/_fiff/meas_info.py", line 997, in _check_description
    _validate_type(description, (None, str), "info['description']")
  File "/home/larsoner/python/mne-python/mne/utils/check.py", line 642, in _validate_type
    raise TypeError(
TypeError: info['description'] must be an instance of None or str, got <class 'list'> instead.

or for something you shouldn't be able to set:

>>> info["sfreq"] = "a"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/larsoner/python/mne-python/mne/_fiff/meas_info.py", line 1600, in __setitem__
    raise RuntimeError(self._attributes[key])
RuntimeError: sfreq cannot be set directly. Please use method inst.resample() instead.

etc. So in this sense I think we "just" need to extend this logic a bit further to the subject_info field. There is at least some rudimentary validation already present:

>>> info["subject_info"] = []
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/larsoner/python/mne-python/mne/_fiff/meas_info.py", line 1602, in __setitem__
    val = self._attributes[key](
          ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/larsoner/python/mne-python/mne/_fiff/meas_info.py", line 1022, in _check_subject_info
    _validate_type(subject_info, (None, dict), "subject_info")
  File "/home/larsoner/python/mne-python/mne/utils/check.py", line 642, in _validate_type
    raise TypeError(
TypeError: subject_info must be an instance of None or dict, got <class 'list'> instead.

So there is probably a placeholder stub where we can already start working this validation in.

drammock · 2024-09-25T14:16:48Z

generally I agree with @larsoner that new validation code belongs in MNE-Python. But if there are already files out there in the wild that have array-like values in the subject_data dict (as @thht demonstrated is possible above) then I see no reason not to do some small things in MNE-BIDS to e.g. silently allow singleton arrays.

thht · 2024-09-25T14:30:40Z

Yes, to make it clear: It's great and important that MNE-Python handles validation of its own datastructures. And if this issue uncovered a bug in that validation and/or implementation of reading fif files, this is valuable information that @larsoner has already appreciated.

And I don't think that MNE-BIDS should handle every possible bad/strange/malformed input it gets thrown at. But as @drammock pointed out, stuff that is found in the wild and that can be easily handled, why not handle it?

hoechenberger · 2024-09-25T16:24:29Z

We could easily validate (in MNE) at runtime with Pydantic. So saying that this is not possible is not quite right.

hoechenberger · 2024-09-25T16:26:41Z

But as @drammock pointed out, stuff that is found in the wild and that can be easily handled, why not handle it?

My understanding is that you're essentially saying that there is no need to enforce a proper interface, with which I strongly disagree. It's MNE's responsibility to ensure it provides data as documented. Downstream tools like MNE-BIDS must not need to worry about this; anything else is a bug -- either in MNE or in user code.

BIDS validation, OTOH, is a different issue. My impression is that these two things are being conflated here, hence the disagreement.

thht · 2024-09-25T16:54:02Z

No, this is not my point. My point is that if a proper interface is expected, it must be enforced by the software accepting the data. It's the prerogative of the software what to do in case of a violation. It can be a proper error message, or doing a workaround with our without a warning.
At the moment, this leads to unspecified behavior within MNE-BIDS, i.e. an error that the subject is already present plus the suggestion to enable overwritting the file.

My entirely subjective opinion is that in this case, the interface violation could be handled within MNE-BIDS, because it's an easy and effective fix that might handle similar errors as well and increase the user experience with minimal risk of creating new unwanted behavior. Yet, I totally get the argument that MNE-BIDS should only accept valid inputs and not correct invalid ones. That is your choice, of course.

In any case, my argument is that if MNE-BIDS expects data in a certain format, it is of course the responsibility of the provider to ensure conformity. But it is still the responsibility of the data sink to make sure the data is valid. Because it's correct to require conformity but you can never trust that incoming data meets it.

hoechenberger · 2024-09-25T17:11:11Z

I suppose I'm against special-casing your specific use case. I've never seen this problem before and I wonder where one would stop? I'd argue that there's a bug in your code and the proper way to prevent this would be by improving MNE-Python; and to prevent writing invalid BIDS, MNE-BIDS must check the outputs it produces more carefully; but it's not MNE-BIDS's responsibility to ensure that an upstream package uses its own, documented interfaces correctly. This is not web dev where we receive and need to marshal input from an untrusted party. So of course we can rely on the data being submitted in the correct format; and if that fails, we need to fix that bug where it originates from (i.e., upstream), not add more special-casing.

hoechenberger · 2024-09-25T17:22:32Z

That said, I really appreciate your input; I just don't think it's the best approach for this specific problem. But in the end it's @sappelhoff who will have to decide 😃

thht · 2024-09-25T17:35:07Z

I'm pretty sure it's not my code because all it did is that the file extract some information and send it to write_raw_bids. The code should never touch the info property. Might be some side effect, sure. Gonna check tomorrow to be sure.

Anyhow, I can agree with you that MNE-BIDS should not correct the wrong input. I still think that the error message should be improved as what it reports at the moment is not what the issue is.

If you find a consensus among yourselves, I'm more than happy to alter the code so that it checks what it wants to hand off to the participants tsv writing function and just produces a better error message.

And thanks! I appreciate your work on this and I also enjoy this constructive, respectful discussion on different opinions 😀

drammock · 2024-09-25T18:25:10Z

My understanding is that you're essentially saying that there is no need to enforce a proper interface, with which I strongly disagree.

[...]

I suppose I'm against special-casing your specific use case.

This comes across as a bit harsh to me. If I look back at the very first sentence of the PR description, @thht says

We have raw fiff files at our institution that produce weight and height information as numpy arrays when read with the standard mne functions.

...which doesn't sound like a case where the user wants to do something unusual / out-of-spec.

That said, I'm actually going to take back my earlier statement that adding more permissivity to MNE-BIDS might be alright in this case. The right fix is to adjust the MNE-Python reader function(s) (and possibly to make the MNE-BIDS error message more informative/accurate, e.g., by checking for nested arrays in the relevant fields of participants.tsv when it's first read, and warning/erroring earlier? --- instead of waiting until a second run happens to crash).

sappelhoff · 2024-09-25T18:47:08Z

If you find a consensus among yourselves

I think we have found that consensus: We want to improve the way mne validates the info["subject_info"] data structure, so that all downstream tools using mne (of which there are a lot) can directly benefit.

I would be happy to include additional checks in mne_bids in case they are needed in a BIDS context, or in case the checks (for whatever reason) have to be more general in mne than is useful for mne_bids.

Having said all of this, I personally do not have the capacity to work on the validation in mne, so I hope that you can be convinced to tackle it @thht.

If you can't and nobody else can for the forseeable future, I would be happy to merge the present fix as an interim solution.

So to move on:

am I right that we found a consensus (see beginning of message) ?
would you be willing to contribute the fix to mne @thht ?

hoechenberger · 2024-09-25T19:57:27Z

I'm totally fine with merging this as-is, as long as we decide to address this problem upstream sometime in the future. 👍

larsoner

LGTM! I also opened mne-tools/mne-python#12874

larsoner · 2024-09-26T15:21:52Z

mne_bids/write.py

+                else:
+                    raise ValueError(
+                        f"Value for key {key} is a list with more "
+                        f"than one element. This is not supported."


Suggested change

f"than one element. This is not supported."

f"than one element. This is not supported. "

f"Got: {cur_value}"

larsoner · 2024-09-26T15:22:58Z

mne_bids/write.py

+    for key in data.keys():
+        cur_value = data[key]


Suggested change

for key in data.keys():

cur_value = data[key]

for key, cur_value in data.items():

The issue here is that I do:

data[key] = new_value

within the loop. And afaik, one should never change what is iterated over. It might be save here, because care is taken that only the current entry is changed... But I think, it is safer this way....

larsoner · 2024-09-26T15:39:37Z

mne_bids/tests/test_write.py

+    raw.info["subject_info"] = {
+        "weight": np.array([75.0]),
+        "height": np.array([180.0]),
+    }


I actually just thought of an issue... whenever mne-tools/mne-python#12874 is tackled (which I was about to try to see if it's trivial!) this will raise a ValueError. So I think you need to bypass the soon-to-be-added MNE protection:

Suggested change

raw.info["subject_info"] = {

"weight": np.array([75.0]),

"height": np.array([180.0]),

}

# bypass MNE-Python setter protection

dict.__setitem__(

raw.info,

"subject_info",

{

"weight": np.array([75.0]),

"height": np.array([180.0]),

},

)

I'm afraid this does not do the trick because while the initial setting works, the write_raw_bids function copies the raw object here. And this calls the validation again. I have not found any way around it, including trying to mock the __setitems__ method to be the original one from dict.

Pushed a commit using monkeypatch that should work

so that monkeypatch will essentially always have to be there now?

For the test to make sure that MNE-BIDS has its own validation, yes

mne_bids/tests/test_write.py

sappelhoff · 2024-09-27T14:07:34Z

I have restarted the CIs now that:

BUG: subject_info dict validation mne-python#12874

has been merged.

Once this PR has been adjusted to the new state of things in mne, we can merge it!

thht · 2024-10-02T07:27:30Z

Hi! Sorry, I had some work that came up and was super urgent. I'm taking care of this now.

sappelhoff

What exactly does this PR do, now that mne does not allow weight and height to be arrays anymore?

larsoner · 2024-10-03T15:13:26Z

What exactly does this PR do, now that mne does not allow weight and height to be arrays anymore?

At least for MNE < 1.9 it will do some helpful validation

mne_bids/tests/test_write.py

mne_bids/write.py

for more information, see https://pre-commit.ci

thht and others added 4 commits September 25, 2024 14:24

add failing test

c0f2f06

add participant data sanitizing

19b3c8e

[pre-commit.ci] auto fixes from pre-commit.com hooks

ea2b7e1

for more information, see https://pre-commit.ci

update whats_new

6ae6384

add tests for untested lines

0341739

remove unneccesary check

18fa664

larsoner mentioned this pull request Sep 26, 2024

BUG: subject_info dict validation mne-tools/mne-python#12874

Closed

larsoner approved these changes Sep 26, 2024

View reviewed changes

hoechenberger approved these changes Sep 26, 2024

View reviewed changes

larsoner reviewed Sep 26, 2024

View reviewed changes

mne_bids/tests/test_write.py Show resolved Hide resolved

thht and others added 5 commits October 2, 2024 10:26

Merge branch 'main' into fix_participant_tsv

23ab766

added suggestion by eric larson

ef8ce43

tried to evade mne type checking

cdb1dd6

add match statement

5291cc4

FIX: Disable

5275ffe

sappelhoff reviewed Oct 3, 2024

View reviewed changes

mne_bids/tests/test_write.py Show resolved Hide resolved

sappelhoff reviewed Oct 3, 2024

View reviewed changes

mne_bids/write.py Show resolved Hide resolved

Apply suggestions from code review

70fe2b4

sappelhoff approved these changes Oct 3, 2024

View reviewed changes

sappelhoff enabled auto-merge (squash) October 3, 2024 15:20

[pre-commit.ci] auto fixes from pre-commit.com hooks

2d9ae41

for more information, see https://pre-commit.ci

sappelhoff merged commit 65da1ae into mne-tools:main Oct 3, 2024
23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix participant tsv values when `Info` contains weight and height information as numpy arrays #1310

Fix participant tsv values when `Info` contains weight and height information as numpy arrays #1310

thht commented Sep 25, 2024

hoechenberger commented Sep 25, 2024

thht commented Sep 25, 2024

codecov bot commented Sep 25, 2024 •

edited

Loading

hoechenberger commented Sep 25, 2024

larsoner commented Sep 25, 2024

thht commented Sep 25, 2024

larsoner commented Sep 25, 2024

sappelhoff commented Sep 25, 2024

thht commented Sep 25, 2024

larsoner commented Sep 25, 2024 •

edited

Loading

drammock commented Sep 25, 2024

thht commented Sep 25, 2024

hoechenberger commented Sep 25, 2024

hoechenberger commented Sep 25, 2024 •

edited

Loading

thht commented Sep 25, 2024

hoechenberger commented Sep 25, 2024 •

edited

Loading

hoechenberger commented Sep 25, 2024

thht commented Sep 25, 2024 •

edited

Loading

drammock commented Sep 25, 2024

sappelhoff commented Sep 25, 2024

hoechenberger commented Sep 25, 2024 •

edited

Loading

larsoner left a comment

larsoner Sep 26, 2024

larsoner Sep 26, 2024

thht Oct 2, 2024

larsoner Sep 26, 2024

thht Oct 2, 2024

larsoner Oct 3, 2024

sappelhoff Oct 3, 2024

larsoner Oct 3, 2024

sappelhoff commented Sep 27, 2024

thht commented Oct 2, 2024

sappelhoff left a comment

larsoner commented Oct 3, 2024

	f"than one element. This is not supported."
	f"than one element. This is not supported. "
	f"Got: {cur_value}"

	for key in data.keys():
	cur_value = data[key]
	for key, cur_value in data.items():

Fix participant tsv values when Info contains weight and height information as numpy arrays #1310

Fix participant tsv values when Info contains weight and height information as numpy arrays #1310

Conversation

thht commented Sep 25, 2024

PR Description

Merge checklist

hoechenberger commented Sep 25, 2024

thht commented Sep 25, 2024

codecov bot commented Sep 25, 2024 • edited Loading

Codecov Report

hoechenberger commented Sep 25, 2024

larsoner commented Sep 25, 2024

thht commented Sep 25, 2024

larsoner commented Sep 25, 2024

sappelhoff commented Sep 25, 2024

thht commented Sep 25, 2024

larsoner commented Sep 25, 2024 • edited Loading

drammock commented Sep 25, 2024

thht commented Sep 25, 2024

hoechenberger commented Sep 25, 2024

hoechenberger commented Sep 25, 2024 • edited Loading

thht commented Sep 25, 2024

hoechenberger commented Sep 25, 2024 • edited Loading

hoechenberger commented Sep 25, 2024

thht commented Sep 25, 2024 • edited Loading

drammock commented Sep 25, 2024

sappelhoff commented Sep 25, 2024

hoechenberger commented Sep 25, 2024 • edited Loading

larsoner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sappelhoff commented Sep 27, 2024

thht commented Oct 2, 2024

sappelhoff left a comment

Choose a reason for hiding this comment

larsoner commented Oct 3, 2024

Fix participant tsv values when `Info` contains weight and height information as numpy arrays #1310

Fix participant tsv values when `Info` contains weight and height information as numpy arrays #1310

codecov bot commented Sep 25, 2024 •

edited

Loading

larsoner commented Sep 25, 2024 •

edited

Loading

hoechenberger commented Sep 25, 2024 •

edited

Loading

hoechenberger commented Sep 25, 2024 •

edited

Loading

thht commented Sep 25, 2024 •

edited

Loading

hoechenberger commented Sep 25, 2024 •

edited

Loading