Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Pyarrow fs incorrectly resolves S3 URIs with white space as a local path #41365

Open
jaidisido opened this issue Apr 24, 2024 · 10 comments

Comments

@jaidisido
Copy link

jaidisido commented Apr 24, 2024

Describe the bug, including details regarding any error messages, version, and platform.

Pyarrow fs incorrectly resolves valid S3 URIs with a whitespace as a local path:

from pyarrow.fs import _resolve_filesystem_and_path, FileSystem

uri = "s3://bucket/prefix with space/a=a"

resolved_filesystem, resolved_path = _resolve_filesystem_and_path(uri, None)

resolved_filesystem
<pyarrow._fs.LocalFileSystem at 0x10316ff30>

This causes subsequent calls such as getting the file info to fail:

path_info = resolved_filesystem.get_file_info(resolved_path)

pyarrow.lib.ArrowInvalid: Expected a local filesystem path, got a URI...

A quick look into the method indicates that a LocalFilesytem is chosen by default and returned if alternative filesystems are not detected which seems like a dubious strategy...

I assume this is where the S3 filesystem should be detected but a URI containing a whitespace seems to throw an exception although it's valid:

filesystem, path = FileSystem.from_uri(uri)

Cannot parse URI: 's3://bucket/prefix with space/a=a/'

Component(s)

Python

@AlenkaF
Copy link
Member

AlenkaF commented May 28, 2024

I think you will need to have URI-encoded spaces in this case, see https://en.wikipedia.org/wiki/Percent-encoding#The_application/x-www-form-urlencoded_type. Could you try replacing whitespace with +?

@AlenkaF
Copy link
Member

AlenkaF commented May 28, 2024

In any case, we should improve the error message. I will change the labels, contributions are very welcome!

@AlenkaF AlenkaF added Type: usage Issue is a user question good-first-issue and removed Type: bug labels May 28, 2024
@AlenkaF AlenkaF changed the title [Python] Pyarrow fs incorrectly resolves S3 URIs with white space as a local path [C++][Python] Pyarrow fs incorrectly resolves S3 URIs with white space as a local path May 28, 2024
CrystalZhou0529 added a commit to CrystalZhou0529/arrow that referenced this issue Sep 3, 2024
@CrystalZhou0529
Copy link
Contributor

Hi @AlenkaF, I just opened a PR with the implementation of the new error message. Please take a look at my changes if you have time, and let me know if you have any other concerns regarding the error message please! Thanks.

@pitrou
Copy link
Member

pitrou commented Sep 4, 2024

A quick look into the method, indicates that a LocalFilesytem is chosen by default and returned if alternative filesystems are not detected which seems like a dubious strategy...

I'm not sure there is a better strategy. Can you suggest something?

@jaidisido
Copy link
Author

A quick look into the method, indicates that a LocalFilesytem is chosen by default and returned if alternative filesystems are not detected which seems like a dubious strategy...

I'm not sure there is a better strategy. Can you suggest something?

I just think there shouldn't be a default and rather an exception raised if no filesystem is detected, including LocalFilesystem. But I appreciate that it might break too many things at this point

@jorisvandenbossche jorisvandenbossche changed the title [C++][Python] Pyarrow fs incorrectly resolves S3 URIs with white space as a local path [Python] Pyarrow fs incorrectly resolves S3 URIs with white space as a local path Sep 5, 2024
jorisvandenbossche pushed a commit that referenced this issue Sep 5, 2024
### Rationale for this change

We want to enhance error message for URI parsing error to provide more information for the syntax error scenario.

When error message is generated from `uriParseSingleUriExA`, the return value might indicate a `URI_ERROR_SYNTAX` error, and `error_pos` would be set to the position causing syntax error. ([uriparser/Uri.h](https://github.com/apache/arrow/blob/c455d6b8c4ae2cb22baceb4c27e1325b973d39e1/cpp/src/arrow/vendored/uriparser/Uri.h#L288))

In the new error message, it includes the character causing syntax error and its position, so users can have a better idea why the error happens.

### What changes are included in this PR?

- Error message change in URI parsing function.

### Are these changes tested?

PR includes unit tests.

### Are there any user-facing changes?

Yes, but only for error message.

* GitHub Issue: #41365
* GitHub Issue: #43967

Authored-by: Crystal Zhou <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
@jorisvandenbossche
Copy link
Member

So going through the logic in the _resolve_filesystem_and_path helper function, I think the main reason we catch (and swallow) the error from FileSystem.from_uri is because we wanted to provide a better error message for local file paths that were not found.

However, in that case you typically get an error about an empty scheme, and not a "Cannot parse URI" error. For example:

In [1]: import pyarrow.fs

In [2]: pyarrow.fs.FileSystem.from_uri("local/file/path")
...
ArrowInvalid: URI has empty scheme: 'local/file/path'
/arrow/cpp/src/arrow/filesystem/filesystem.cc:937  ParseFileSystemUri(uri_string)

Essentially, if an error happens and if the user provides a URI we want to show the error that happens when parsing the URI, and it the user passes a local file path, we would like to show the error that happens when we assume it is a local file path (and further down the line this will then give a file not found error).

But couldn't we use some basic heuristic to determine if the user is actually passing a URI? (I don't know how robust it is to check if :// in the string?)

Based on the above example, it might also to only catch the "empty scheme" error and let the "Cannot parse URI" error always bubble up to the user.

@pitrou
Copy link
Member

pitrou commented Sep 5, 2024

There are all kinds of failure modes:

>>> pq.read_table("/local file")
Traceback (most recent call last):
  ...
FileNotFoundError: /local file

>>> pq.read_table("local file")
Traceback (most recent call last):
  ...
FileNotFoundError: local file

>>> pq.read_table("s3://invalid bucket/bar")
Traceback (most recent call last):
  ...
ArrowInvalid: Expected a local filesystem path, got a URI: 's3://invalid bucket/bar'

>>> pq.read_table("s3://really-nonexistent-bucket/bar")
Traceback (most recent call last):
  ...
OSError: Bucket 'really-nonexistent-bucket' not found

@pitrou
Copy link
Member

pitrou commented Sep 5, 2024

But couldn't we use some basic heuristic to determine if the user is actually passing a URI? (I don't know how robust it is to check if :// in the string?)

It's probably not robust, a local path could contain that string too (it's probably rare on Unix, but perhaps not on Windows).

Edit: we do have such a heuristic already in C++, it's where the "Expected a local filesystem path, got a URI" message comes from. It can probably reused.

bool IsLikelyUri(std::string_view v) {
if (v.empty() || v[0] == '/') {
return false;
}
const auto pos = v.find_first_of(':');
if (pos == v.npos) {
return false;
}
if (pos < 2) {
// One-letter URI schemes don't officially exist, perhaps a Windows drive letter?
return false;
}
if (pos > 36) {
// The largest IANA-registered URI scheme is "microsoft.windows.camera.multipicker"
// with 36 characters.
return false;
}
return ::arrow::util::IsValidUriScheme(v.substr(0, pos));
}

@jorisvandenbossche
Copy link
Member

There are all kinds of failure modes:

Yes, but so those errors mostly comes from after _resolve_filesystem_and_path, when we are actually trying to read based on the inferred filesystem+path combo:

In [1]: from pyarrow.fs import _resolve_filesystem_and_path

In [2]: _resolve_filesystem_and_path("/local file")
Out[2]: (<pyarrow._fs.LocalFileSystem at 0x7fa597bc38b0>, '/local file')

In [3]: _resolve_filesystem_and_path("local file")
Out[3]: (<pyarrow._fs.LocalFileSystem at 0x7fa591904bb0>, 'local file')

In [4]: _resolve_filesystem_and_path("s3://invalid bucket/bar")
Out[4]: (<pyarrow._fs.LocalFileSystem at 0x7fa591380eb0>, 's3://invalid bucket/bar')

In [5]: _resolve_filesystem_and_path("s3://really-nonexistent-bucket/bar")
...
OSError: Bucket 'really-nonexistent-bucket' not found

So it is for the third case where we give a wrong result in this step of the code path, which then later in pq.read_table, when actually trying to read the file, gives the confusing "ArrowInvalid: Expected a local filesystem path, got a URI: 's3://invalid bucket/bar'".
But so that only happens because we incorrectly return that as a local file system. While I think we should be able to detect that it is actually a URI, and therefore that we should return the error from trying to parse that URI with FileSystem.from_uri (as happens in the last case above)

Edit: we do have such a heuristic already in C++, it's where the "Expected a local filesystem path, got a URI" message comes from. It can probably reused.

One option could be to use that in FileSystemFromUri, for example when the parsing fails check if it actually is a likely URI, and if not raise a different error message?

Or bind it in Python, so we can use it to decide in _resolve_filesystem_and_path whether we want to bubble up the error or not.

@pitrou
Copy link
Member

pitrou commented Sep 5, 2024

One option could be to use that in FileSystemFromUri, for example when the parsing fails check if it actually is a likely URI, and if not raise a different error message?

Or bind it in Python, so we can use it to decide in _resolve_filesystem_and_path whether we want to bubble up the error or not.

FileSystemFromUri expects a URI, so we should keep the original error message there.

Using it in _resolve_filesystem_and_path, though, seems reasonable.

zanmato1984 pushed a commit to zanmato1984/arrow that referenced this issue Sep 6, 2024
…3938)

### Rationale for this change

We want to enhance error message for URI parsing error to provide more information for the syntax error scenario.

When error message is generated from `uriParseSingleUriExA`, the return value might indicate a `URI_ERROR_SYNTAX` error, and `error_pos` would be set to the position causing syntax error. ([uriparser/Uri.h](https://github.com/apache/arrow/blob/c455d6b8c4ae2cb22baceb4c27e1325b973d39e1/cpp/src/arrow/vendored/uriparser/Uri.h#L288))

In the new error message, it includes the character causing syntax error and its position, so users can have a better idea why the error happens.

### What changes are included in this PR?

- Error message change in URI parsing function.

### Are these changes tested?

PR includes unit tests.

### Are there any user-facing changes?

Yes, but only for error message.

* GitHub Issue: apache#41365
* GitHub Issue: apache#43967

Authored-by: Crystal Zhou <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
khwilson pushed a commit to khwilson/arrow that referenced this issue Sep 14, 2024
…3938)

### Rationale for this change

We want to enhance error message for URI parsing error to provide more information for the syntax error scenario.

When error message is generated from `uriParseSingleUriExA`, the return value might indicate a `URI_ERROR_SYNTAX` error, and `error_pos` would be set to the position causing syntax error. ([uriparser/Uri.h](https://github.com/apache/arrow/blob/c455d6b8c4ae2cb22baceb4c27e1325b973d39e1/cpp/src/arrow/vendored/uriparser/Uri.h#L288))

In the new error message, it includes the character causing syntax error and its position, so users can have a better idea why the error happens.

### What changes are included in this PR?

- Error message change in URI parsing function.

### Are these changes tested?

PR includes unit tests.

### Are there any user-facing changes?

Yes, but only for error message.

* GitHub Issue: apache#41365
* GitHub Issue: apache#43967

Authored-by: Crystal Zhou <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants