Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"corrupt deflate stream" when inferring schema from CSV file during CREATE EXTERNAL TABLE #5041

Open
kmitchener opened this issue Jan 24, 2023 · 1 comment
Labels
bug Something isn't working

Comments

@kmitchener
Copy link
Contributor

Describe the bug
Inferring the schema does not work with a compressed file, in some cases.

To Reproduce
Create a file like this one, and gzip it:

❯ cat file1.csv
Region,Units
United States,10
Canada,20
United States,10
Canada,20
United States,10
Canada,20
Canada,20
❯ gzip file1.csv

I couldn't duplicate this error with a gzipped file of exact repeating rows. It requires some variation in the rows (or there's something else that triggers it).

Then attempt to make a table from it:

❯ create external table file_csv stored as csv with header row compression type gzip location './file1.csv.gz';
ArrowError(CsvError("corrupt deflate stream"))

Expected behavior
It shouldn't throw an error and should be able to create the external table.

Additional context
It works fine with the same file if you specify the table schema:

❯ create external table file_csv(a string,b int) stored as csv with header row compression type gzip location './file1.c
sv.gz';
0 rows in set. Query took 0.000 seconds.
❯ select * from file_csv;
+---------------+----+
| a             | b  |
+---------------+----+
| United States | 10 |
| Canada        | 20 |
| United States | 10 |
| Canada        | 20 |
| United States | 10 |
| Canada        | 20 |
| Canada        | 20 |
+---------------+----+
7 rows in set. Query took 0.003 seconds.
@kmitchener kmitchener added the bug Something isn't working label Jan 24, 2023
@tustvold
Copy link
Contributor

It looks like the schema inference logic is applying the newline delimiting chunking before applying the decompression, it should be a case of reversing the order of these two. In particular using FileCompressionType::convert_stream instead of FileCompressionType::convert_read

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants