-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Pyarrow fs incorrectly resolves S3 URIs with white space as a local path #41365
Comments
I think you will need to have URI-encoded spaces in this case, see https://en.wikipedia.org/wiki/Percent-encoding#The_application/x-www-form-urlencoded_type. Could you try replacing whitespace with |
In any case, we should improve the error message. I will change the labels, contributions are very welcome! |
Hi @AlenkaF, I just opened a PR with the implementation of the new error message. Please take a look at my changes if you have time, and let me know if you have any other concerns regarding the error message please! Thanks. |
I'm not sure there is a better strategy. Can you suggest something? |
I just think there shouldn't be a default and rather an exception raised if no filesystem is detected, including LocalFilesystem. But I appreciate that it might break too many things at this point |
### Rationale for this change We want to enhance error message for URI parsing error to provide more information for the syntax error scenario. When error message is generated from `uriParseSingleUriExA`, the return value might indicate a `URI_ERROR_SYNTAX` error, and `error_pos` would be set to the position causing syntax error. ([uriparser/Uri.h](https://github.com/apache/arrow/blob/c455d6b8c4ae2cb22baceb4c27e1325b973d39e1/cpp/src/arrow/vendored/uriparser/Uri.h#L288)) In the new error message, it includes the character causing syntax error and its position, so users can have a better idea why the error happens. ### What changes are included in this PR? - Error message change in URI parsing function. ### Are these changes tested? PR includes unit tests. ### Are there any user-facing changes? Yes, but only for error message. * GitHub Issue: #41365 * GitHub Issue: #43967 Authored-by: Crystal Zhou <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
So going through the logic in the However, in that case you typically get an error about an empty scheme, and not a "Cannot parse URI" error. For example:
Essentially, if an error happens and if the user provides a URI we want to show the error that happens when parsing the URI, and it the user passes a local file path, we would like to show the error that happens when we assume it is a local file path (and further down the line this will then give a file not found error). But couldn't we use some basic heuristic to determine if the user is actually passing a URI? (I don't know how robust it is to check if Based on the above example, it might also to only catch the "empty scheme" error and let the "Cannot parse URI" error always bubble up to the user. |
There are all kinds of failure modes: >>> pq.read_table("/local file")
Traceback (most recent call last):
...
FileNotFoundError: /local file
>>> pq.read_table("local file")
Traceback (most recent call last):
...
FileNotFoundError: local file
>>> pq.read_table("s3://invalid bucket/bar")
Traceback (most recent call last):
...
ArrowInvalid: Expected a local filesystem path, got a URI: 's3://invalid bucket/bar'
>>> pq.read_table("s3://really-nonexistent-bucket/bar")
Traceback (most recent call last):
...
OSError: Bucket 'really-nonexistent-bucket' not found |
Edit: we do have such a heuristic already in C++, it's where the "Expected a local filesystem path, got a URI" message comes from. It can probably reused. arrow/cpp/src/arrow/filesystem/path_util.cc Lines 342 to 360 in 032e6a4
|
Yes, but so those errors mostly comes from after In [1]: from pyarrow.fs import _resolve_filesystem_and_path
In [2]: _resolve_filesystem_and_path("/local file")
Out[2]: (<pyarrow._fs.LocalFileSystem at 0x7fa597bc38b0>, '/local file')
In [3]: _resolve_filesystem_and_path("local file")
Out[3]: (<pyarrow._fs.LocalFileSystem at 0x7fa591904bb0>, 'local file')
In [4]: _resolve_filesystem_and_path("s3://invalid bucket/bar")
Out[4]: (<pyarrow._fs.LocalFileSystem at 0x7fa591380eb0>, 's3://invalid bucket/bar')
In [5]: _resolve_filesystem_and_path("s3://really-nonexistent-bucket/bar")
...
OSError: Bucket 'really-nonexistent-bucket' not found So it is for the third case where we give a wrong result in this step of the code path, which then later in
One option could be to use that in Or bind it in Python, so we can use it to decide in |
Using it in |
…3938) ### Rationale for this change We want to enhance error message for URI parsing error to provide more information for the syntax error scenario. When error message is generated from `uriParseSingleUriExA`, the return value might indicate a `URI_ERROR_SYNTAX` error, and `error_pos` would be set to the position causing syntax error. ([uriparser/Uri.h](https://github.com/apache/arrow/blob/c455d6b8c4ae2cb22baceb4c27e1325b973d39e1/cpp/src/arrow/vendored/uriparser/Uri.h#L288)) In the new error message, it includes the character causing syntax error and its position, so users can have a better idea why the error happens. ### What changes are included in this PR? - Error message change in URI parsing function. ### Are these changes tested? PR includes unit tests. ### Are there any user-facing changes? Yes, but only for error message. * GitHub Issue: apache#41365 * GitHub Issue: apache#43967 Authored-by: Crystal Zhou <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
…3938) ### Rationale for this change We want to enhance error message for URI parsing error to provide more information for the syntax error scenario. When error message is generated from `uriParseSingleUriExA`, the return value might indicate a `URI_ERROR_SYNTAX` error, and `error_pos` would be set to the position causing syntax error. ([uriparser/Uri.h](https://github.com/apache/arrow/blob/c455d6b8c4ae2cb22baceb4c27e1325b973d39e1/cpp/src/arrow/vendored/uriparser/Uri.h#L288)) In the new error message, it includes the character causing syntax error and its position, so users can have a better idea why the error happens. ### What changes are included in this PR? - Error message change in URI parsing function. ### Are these changes tested? PR includes unit tests. ### Are there any user-facing changes? Yes, but only for error message. * GitHub Issue: apache#41365 * GitHub Issue: apache#43967 Authored-by: Crystal Zhou <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
Describe the bug, including details regarding any error messages, version, and platform.
Pyarrow fs incorrectly resolves valid S3 URIs with a whitespace as a local path:
This causes subsequent calls such as getting the file info to fail:
A quick look into the method indicates that a LocalFilesytem is chosen by default and returned if alternative filesystems are not detected which seems like a dubious strategy...
I assume this is where the S3 filesystem should be detected but a URI containing a whitespace seems to throw an exception although it's valid:
Component(s)
Python
The text was updated successfully, but these errors were encountered: