Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encode(..., "hex") errors on non-UTF-8 binaries since Datafusion v43 #14055

Closed
Tracked by #14008
progval opened this issue Jan 9, 2025 · 1 comment · Fixed by #14087
Closed
Tracked by #14008

encode(..., "hex") errors on non-UTF-8 binaries since Datafusion v43 #14055

progval opened this issue Jan 9, 2025 · 1 comment · Fixed by #14087
Labels
bug Something isn't working help wanted Extra attention is needed regression Something that used to work no longer does

Comments

@progval
Copy link
Contributor

progval commented Jan 9, 2025

Describe the bug

encode(..., "hex") can be used to get the hexadecimal representation of a string or a binary. Since datafusion v43 (specifically, since 1b3608d, ie. #12308), only strings and binaries that happen to be valid UTF-8 are supported.

To Reproduce

vlorentz@maxxi:~/datafusion/datafusion-cli$ git checkout 1b3608da7ca59d8d987804834d004e8b3e349d18
HEAD is now at 1b3608da7 fix: coalesce schema issues (#12308)
vlorentz@maxxi:~/datafusion/datafusion-cli$ cargo run
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.27s
     Running `target/debug/datafusion-cli`
DataFusion CLI v42.0.0
> create table test ( foo bytea );
0 row(s) fetched. 
Elapsed 0.007 seconds.

> insert into test (foo) values (X'8f50d3f60eae370ddbf85c86219c55108a350165');
+-------+
| count |
+-------+
| 1     |
+-------+
1 row(s) fetched. 
Elapsed 0.006 seconds.

> EXPLAIN SELECT encode(foo, 'hex') FROM test;
+---------------+-----------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                    |
+---------------+-----------------------------------------------------------------------------------------+
| logical_plan  | Projection: encode(CAST(test.foo AS Utf8), Utf8("hex"))                                 |
|               |   TableScan: test projection=[foo]                                                      |
| physical_plan | ProjectionExec: expr=[encode(CAST(foo@0 AS Utf8), hex) as encode(test.foo,Utf8("hex"))] |
|               |   MemoryExec: partitions=1, partition_sizes=[1]                                         |
|               |                                                                                         |
+---------------+-----------------------------------------------------------------------------------------+
2 row(s) fetched. 
Elapsed 0.007 seconds.

> SELECT encode(foo, 'hex') FROM test;
Arrow error: Invalid argument error: Encountered non UTF-8 data: invalid utf-8 sequence of 1 bytes from index 0
> 
\q

Expected behavior

vlorentz@maxxi:~/datafusion/datafusion-cli$ git checkout 1b3608da7ca59d8d987804834d004e8b3e349d18^
Previous HEAD position was 1b3608da7 fix: coalesce schema issues (#12308)
HEAD is now at 9a3f8d115 Minor: Encapsulate type check in GroupValuesColumn, avoid panic (#12620)
vlorentz@maxxi:~/datafusion/datafusion-cli$ cargo run
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 53.01s
     Running `target/debug/datafusion-cli`
DataFusion CLI v42.0.0
> create table test ( foo bytea );
0 row(s) fetched. 
Elapsed 0.005 seconds.

> insert into test (foo) values (X'8f50d3f60eae370ddbf85c86219c55108a350165');
+-------+
| count |
+-------+
| 1     |
+-------+
1 row(s) fetched. 
Elapsed 0.005 seconds.

> EXPLAIN SELECT encode(foo, 'hex') FROM test;
+---------------+---------------------------------------------------------------------------+
| plan_type     | plan                                                                      |
+---------------+---------------------------------------------------------------------------+
| logical_plan  | Projection: encode(test.foo, Utf8("hex"))                                 |
|               |   TableScan: test projection=[foo]                                        |
| physical_plan | ProjectionExec: expr=[encode(foo@0, hex) as encode(test.foo,Utf8("hex"))] |
|               |   MemoryExec: partitions=1, partition_sizes=[1]                           |
|               |                                                                           |
+---------------+---------------------------------------------------------------------------+
2 row(s) fetched. 
Elapsed 0.005 seconds.

> SELECT encode(foo, 'hex') FROM test;
+------------------------------------------+
| encode(test.foo,Utf8("hex"))             |
+------------------------------------------+
| 8f50d3f60eae370ddbf85c86219c55108a350165 |
+------------------------------------------+
1 row(s) fetched. 
Elapsed 0.004 seconds.

> 
\q

Additional context

note CAST(test.foo AS Utf8) as part of the first query plan, which does not happen in the second one.

cc @mesejo

@progval progval added the bug Something isn't working label Jan 9, 2025
@alamb alamb added help wanted Extra attention is needed regression Something that used to work no longer does labels Jan 10, 2025
mesejo added a commit to mesejo/arrow-datafusion that referenced this issue Jan 11, 2025
mesejo added a commit to mesejo/arrow-datafusion that referenced this issue Jan 11, 2025
@alamb
Copy link
Contributor

alamb commented Jan 13, 2025

I added this to the list of items that i think we should fix before releasing version 45.0.0:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed regression Something that used to work no longer does
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants