Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for documentation for compressed CSV/JSON support #5657

Closed
ismail opened this issue Mar 20, 2023 · 7 comments · Fixed by #5860
Closed

Request for documentation for compressed CSV/JSON support #5657

ismail opened this issue Mar 20, 2023 · 7 comments · Fixed by #5860
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@ismail
Copy link

ismail commented Mar 20, 2023

Hi,

Support for compressed csv/json was added in b8a3a78 and trying to use it in a sample

use datafusion::prelude::*;
use datafusion::datasource::file_format::file_type::FileCompressionType;

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
    let ctx = SessionContext::new();
    let csv_options = CsvReadOptions::default()
        .has_header(true)
        .file_compression_type(FileCompressionType::BZIP2);
    let df = ctx.read_csv("summary.csv.bz2", csv_options).await?;
    let df = df
        .filter(col("status").eq(lit("OK")))?
        .select_columns(&["name", "id"])?;

    df.show().await?;
    Ok(())
}

results in

Error: SchemaError(FieldNotFound { field: Column { relation: None, name: "status" }, valid_fields: [] })

Code works fine if I work on the uncompressed CSV. Since the documentation for this feature is missing, I am wondering if I'm holding it wrong. Would appreciate if the documentation could give example of sample usage.

@ismail ismail added the enhancement New feature or request label Mar 20, 2023
@alamb
Copy link
Contributor

alamb commented Mar 27, 2023

I wonder if you could provide an example file?

@ismail
Copy link
Author

ismail commented Mar 28, 2023

Here is a randomly generated example that shows the issue on my machine. Just rename it to summary.csv.bz2.

summary.csv.bz2.zip

@alamb alamb added bug Something isn't working help wanted Extra attention is needed and removed enhancement New feature or request labels Mar 30, 2023
@alamb
Copy link
Contributor

alamb commented Mar 30, 2023

I am not sure what is going on here -- it would be great if someone could investigate further

@Jefffrey
Copy link
Contributor

Jefffrey commented Mar 31, 2023

Specific issue seems to be in this function:

https://github.com/apache/arrow-datafusion/blob/667f19ebad216b7592af5a91b70a24fb21c3bb64/datafusion/core/src/datasource/listing/table.rs#L431-L444

Because the file extension is .csv.bz2 and not just .csv it doesn't list the file hence leading to inferring schema from an empty list of files, leading to empty schema.

As a temporary workaround I renamed the file from summary.csv.bz2 to summary.csv and this seemed to be picked up properly, however it ran into another issue:

Error: ArrowError(CsvError("decompression not finished but EOF reached"))

This specifically stems from here:

https://github.com/apache/arrow-datafusion/blob/667f19ebad216b7592af5a91b70a24fb21c3bb64/datafusion/core/src/datasource/file_format/csv.rs#L208-L215

Haven't looked into it too much, but seems similar to #5041

@jiangzhx
Copy link
Contributor

jiangzhx commented Apr 4, 2023

Followed @Jefffrey clue and #5109 this problem was solved.
when file name end with "csv.bz2", should set option with .file_extension("csv.bz2").

@ismail you can try with #5860, i will add more example and testcase

    let csv_options = CsvReadOptions::default()
        .has_header(true)
        .file_compression_type(FileCompressionType::BZIP2)
        .file_extension("csv.bz2");
    let df = ctx
        .read_csv(&format!("{testdata}/csv/summary.csv.bz2"), csv_options)
        .await?;
    let df = df
        .filter(col("status").eq(lit("OK")))?
        .select_columns(&["name", "id"])?;

    df.show().await?;

@ismail
Copy link
Author

ismail commented Apr 7, 2023

Sorry for the late reply. @jiangzhx I tested your branch directly, and it resolves the issue, thanks a lot!

@jiangzhx
Copy link
Contributor

Sorry for the late reply. @jiangzhx I tested your branch directly, and it resolves the issue, thanks a lot!

you are welcome 😁

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
4 participants