Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why do we need defusedxml? #12256

Closed
cbrnr opened this issue Dec 3, 2023 · 15 comments · Fixed by #12264
Closed

Why do we need defusedxml? #12256

cbrnr opened this issue Dec 3, 2023 · 15 comments · Fixed by #12264

Comments

@cbrnr
Copy link
Contributor

cbrnr commented Dec 3, 2023

I was wondering why we decided to use defusedxml as a required dependency for parsing XML instead of the standard library. The docs state that defusedxml should be used for server code, which definitely does not apply to MNE. Since we have a very strict policy for adding required dependencies, I'm a bit surprised that this was added without much discussion.

@drammock
Copy link
Member

drammock commented Dec 3, 2023

Some EEG systems store data in xml format. Since we sometimes have users share that data with us and load it locally for debugging, this protects our local machines from malicious xml.

@cbrnr
Copy link
Contributor Author

cbrnr commented Dec 3, 2023

The docs specifically state (emphasis mine): "Use of this package is recommended for any server code that parses untrusted XML data."

The only two vulnerabilities that exist in the standard library are DoS attacks (see this table). If someone parses malicious XML, the only harm done is 100% CPU load (and maybe filling up space). This is dangerous when running a server, but not in the case of MNE. Therefore, I'd rather not depend on defusedxml.

@drammock
Copy link
Member

drammock commented Dec 3, 2023

What is the downside, exactly? Personally I like being protected from 100% CPU load / having my disks get filled up with unwanted files; I don't see how this is not a problem just because I'm working on a desktop or laptop machine rather than a server. MNE parses untrusted XML.

@cbrnr
Copy link
Contributor Author

cbrnr commented Dec 4, 2023

What is the downside, exactly?

Given our strict policy for adding new requirements, I just find it surprising that this was added without any discussion. I'm not against dependencies, but MNE-Python has always been very conservative (see e.g. #11564), and if we now want to relax this a bit, we should discuss this with a larger group of devs.

If we stick with being conservative, I'm 👎 on adding defusedxml, because it isn't really necessary. The potential harm is manageable (just quit the process), and XML parsing is a tiny niche part in our code (basically, the EGI reader). Re server vs. desktop, a DoS attack on a server results in the server being unavailable until an admin steps in, which can take a while. This affects all clients which want to connect to the server. On a desktop, the user can immediately handle the problem, and no one else is affected.

@agramfort
Copy link
Member

I am with @cbrnr on this one.

@hoechenberger
Copy link
Member

hoechenberger commented Dec 4, 2023

Reconciliation attempt:

  1. Remvoe defusedxml usage for anything that's not Egi-reading related (I might be mistaken here but a 30 second search revealed that we seem to be using the package in other places that handle XML too; switch to std lib xml there)
  2. Move the defusedxml package dep to the full variant, i.e., don't make it a core dependency when running pip install mne. This is how we've been handling deps that are specific for dealing with individual data formats for a long time now.

@hoechenberger
Copy link
Member

Apparently my suggestion can be misunderstood.

So let me clarify:

I'm suggesting to keep supporting Egi files through defusedxml only, i.e., keep the existing code in this regard unchanged.

I'm further suggesting to drop the defusedxml dep from the core dependencies and move it to the full variant, effectively dropping Egi I/O support for users who don't install defusedxml or the full variant. This is consistent with e.g. BrainVision file handling.

Any other places that currently use defusedxml but are not related to Egi I/O should move back to using the std lib.

@hoechenberger
Copy link
Member

hoechenberger commented Dec 4, 2023

FYI this read may also interest you:
https://discuss.python.org/t/status-of-defusedxml-and-recommendation-in-docs/34762/

Things seem to be quite messy in the Python XML world.

@cbrnr
Copy link
Contributor Author

cbrnr commented Dec 4, 2023

I read https://github.com/tiran/defusedxml/tree/main#python-xml-libraries again, and to me, it looks like the standard library XML parsers are protected against billion laughs and quadratic blowup, and they don't expand entities anymore, at least for Python ≥ 3.8. So I think we should be safe using standard library parsers (we are using etree and minidom).

@hoechenberger
Copy link
Member

Oh that's great news, then!

@drammock
Copy link
Member

drammock commented Dec 4, 2023

Given our strict policy for adding new requirements, I just find it surprising that this was added without any discussion. [...] if we now want to relax this a bit, we should discuss this with a larger group of devs.

This is a legitimate complaint. However, it was added in #11937, defusedxml was mentioned in the PR description, the PR was open for 3 days before merging, and you (@cbrnr) commented on that PR. So it feels a bit disingenuous to say it was added without discussion. Maybe it would have been better to have a separate RFC issue about it, but it wasn't snuck in secretly or anything. If you'd like to propose a policy for dependency changes (e.g., must be open for N days, requires M approvals, etc), I'd be happy to have that discussion (in a separate issue).

...it isn't really necessary. The potential harm is manageable (just quit the process), and XML parsing is a tiny niche part in our code (basically, the EGI reader). Re server vs. desktop, a DoS attack on a server results in the server being unavailable until an admin steps in, which can take a while. This affects all clients which want to connect to the server. On a desktop, the user can immediately handle the problem, and no one else is affected.

I don't find that convincing:

  1. Terminating a running process can mean lost work.
  2. MNE is often run on shared infrastructure. Many such environments will have per-user quotas that prevent XML bombs from affecting all users, but I don't think it's very responsible for us to assume that such quotas are always going to be in place. E.g., shared resources on the scale of a single research lab or a small department may not be professionally managed and thus may not have quotas in place.

I read https://github.com/tiran/defusedxml/tree/main#python-xml-libraries again, and to me, it looks like the standard library XML parsers are protected against billion laughs and quadratic blowup, and they don't expand entities anymore, at least for Python ≥ 3.8. So I think we should be safe using standard library parsers (we are using etree and minidom).

I think you're mis-reading the table. Both etree and minidom are marked "maybe" vulnerable to billion laughs / quadratic blowup, and are marked with footnote 1, which says it depends on the underlying expat version in use. The point about entity expansion refers to different vulnerabilities in different rows of that table. The warning table on the python 3.9 docs agrees that billion laughs / quadratic expansion are still problems (it still says "vulnerable"). So I think we'd need to check pyexpat.EXPAT_VERSION in order to be sure things are safe. @cbrnr if you want to write code that performs that check (and add associated #nosec annotations wherever they're now needed), then I'd be willing to drop defusedxml. Alternatively if you can demonstrate that expat < 2.4.1 will never be used by Python 3.9, that would also convince me to drop defusedxml.

FYI this read may also interest you: https://discuss.python.org/t/status-of-defusedxml-and-recommendation-in-docs/34762/
Things seem to be quite messy in the Python XML world.

Yes, I've read that too @hoechenberger. To me the take-home message was still "stdlib xml module is still not considered safe out of the box, defusedxml is safe by default and still maintained, and lxml is more actively developed and is widely used, but still has some unsafe defaults". In other words, it did not convince me that there was a better alternative to defusedxml.

@cbrnr
Copy link
Contributor Author

cbrnr commented Dec 4, 2023

This is a legitimate complaint. However, it was added in #11937, defusedxml was mentioned in the PR description, the PR was open for 3 days before merging, and you (@cbrnr) commented on that PR. So it feels a bit disingenuous to say it was added without discussion. Maybe it would have been better to have a separate RFC issue about it, but it wasn't snuck in secretly or anything. If you'd like to propose a policy for dependency changes (e.g., must be open for N days, requires M approvals, etc), I'd be happy to have that discussion (in a separate issue).

I did not say that you snuck it in, but whenever a PR adds a new core dependency, I think it would be helpful if this was very clearly mentioned. Even though I commented, I didn't realize that this PR added a new package. So yes, some kind of more formal approval would be nice.

I think you're mis-reading the table.

"Maybe" is qualified with footnotes, and in summary, I read that to mean that stdlib XML parsers are not vulnerable with expat > 2.4.0. But it's weird that this table in the Python docs still shows "vulnerable" even for 3.12. I assume it's probably because they didn't update the table, given the ongoing discussions on the state of XML parsing in Python. But I might be wrong.

So I think we'd need to check pyexpat.EXPAT_VERSION in order to be sure things are safe.

I think it the version can be obtained with:

from xml.parsers import expat
expat.version_info

I don't know how to show expat versions for all Python versions (on different platforms), I'd hope that this was documented somewhere, but I didn't find anything yet.

I don't find that convincing [...|

This is probably the key point where we disagree. If you really want to safeguard against these attacks, it seems like we are stuck with defusedxml.

@larsoner
Copy link
Member

larsoner commented Dec 4, 2023

This is probably the key point where we disagree. If you really want to safeguard against these attacks, it seems like we are stuck with defusedxml.

I'm not sure how likely we / end users are to hit problems, but I'd rather be safe than sorry here. My vote is to keep defusedxml, and I think @hoechenberger's suggestion is a good one to make it an optional dependency for EGI and NeuroElectrics DataFormat (NEDF), the two places it looks like we use it based on git grep. I can do this quickly if you're happy with this solution @cbrnr @drammock ?

@cbrnr
Copy link
Contributor Author

cbrnr commented Dec 4, 2023

Yes, I'm happy with this solution! Thanks @larsoner!

@drammock
Copy link
Member

drammock commented Dec 4, 2023

OK with me to make it an optional dep.

To (hopefully) clarify about our different readings of the vulnerability table @cbrnr: my understanding is that expat is a C library, and as such it may have been installed at the OS level (e.g., by OS package manager), and hence Python (whether 3.9 or 3.12) cannot assume/guarantee that a specific version constraint on expat is met, and therefore Python describes itself as "(maybe) vulnerable" (and the defusedxml table draws the same conclusion, though they seem to disagree whether expat 2.4.0 or 2.4.1 is the first safe version). Of course the stdlib xml module could have been written to contain a libexpat version check, but (per the thread that @hoechenberger linked to) the xml module hasn't really been touched in a long time and probably won't be updated in that way.

There are some assumptions in there (e.g., I'm assuming that libexpat is not being vendored and distributed with Python; if it is, then it ought to be possible to decide based on Python version whether the vulnerability is gone). But aside from that I think I'm interpreting those tables correctly, which is I think the same way you're intepreting them: stdlib xml is vulnerable if the C library that it wraps is old enough. Where we seem to disagree is (1) whether that can happen with Python >= 3.9, and (2) if it can happen, whether the vulnerability is enough of a problem to justify preventative steps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants