Incorrect Schema Adaption for CSV #4918

tustvold · 2023-01-15T14:45:25Z

Describe the bug

The schema adaption logic added in #1709 misbehaves for CSV data. In particular it incorrectly assumes that it can create a schema for the entire dataset that is a superset of those of the individual files, and that the CSV reader will pad any missing columns with nulls, and reorder those that appear in a different order.

In reality the CSV reader does not handle missing or reordered columns, it only accidentally works when the file schema happens to be an exact prefix of the aggregate schema. This was relying on an accidental quirk of the arrow reader prior to 30.0.0, after this the arrow reader returns an error as the schema does not match.

To Reproduce

Both of the following tests fail on current master

#[tokio::test]
    async fn csv_schema_reordered() -> Result<()> {
        use object_store::path::Path;

        let session_ctx = SessionContext::new();

        let store = InMemory::new();

        let data = bytes::Bytes::from("a,b\n1,2\n3,4");
        store.put(&Path::from("a.csv"), data).await.unwrap();

        let data = bytes::Bytes::from("b,a\n1,2\n3,4");
        store.put(&Path::from("b.csv"), data).await.unwrap();

        session_ctx
            .runtime_env()
            .register_object_store("memory", "", Arc::new(store));

        let df = session_ctx
            .read_csv("memory:///", CsvReadOptions::new())
            .await
            .unwrap();
        let result = df.collect().await.unwrap();

        let expected = vec![
            "+---+---+",
            "| a | b |",
            "+---+---+",
            "| 1 | 2 |",
            "| 2 | 1 |",
            "| 3 | 4 |",
            "| 4 | 3 |",
            "+---+---+",
        ];

        crate::assert_batches_eq!(expected, &result);

        Ok(())
    }

#[tokio::test]
    async fn csv_schema_extra_column() -> Result<()> {
        use object_store::path::Path;

        let session_ctx = SessionContext::new();

        let store = InMemory::new();

        let data = bytes::Bytes::from("a,b\n1,2\n3,4");
        store.put(&Path::from("a.csv"), data).await.unwrap();

        let data = bytes::Bytes::from("a,c\n5,6\n7,8");
        store.put(&Path::from("b.csv"), data).await.unwrap();

        session_ctx
            .runtime_env()
            .register_object_store("memory", "", Arc::new(store));

        let df = session_ctx
            .read_csv("memory:///", CsvReadOptions::new())
            .await
            .unwrap();
        let result = df.collect().await.unwrap();

        let expected = vec![
            "+---+---+---+",
            "| a | b | c |",
            "+---+---+---+",
            "| 1 | 2 |   |",
            "| 3 | 4 |   |",
            "| 5 |   | 6 |",
            "| 7 |   | 8 |",
            "+---+---+---+",
        ];

        crate::assert_batches_eq!(expected, &result);

        Ok(())
    }

Expected behavior

I think both of the following would be valid:

Don't perform schema adaption for CSV, as they aren't a self-describing format like JSON or parquet, instead returning an error if the schema don't match
Correctly infer the schema on a per-file basis, and use this when reading

Additional context
#4818 (comment)

The text was updated successfully, but these errors were encountered:

alamb · 2023-01-15T15:39:22Z

That memory store is quite clever

comphead · 2023-01-24T23:09:54Z

Sorry @tustvold still not very clear, for tests above do you expect them to pass?

tustvold · 2023-01-24T23:12:50Z

It probably should error, I'm not sure that schema adaption for CSV makes all that much sense

alamb · 2023-01-25T10:49:12Z

I think it would be valuable to support reading several CSV files from a directory as long as their schema's are compatible, even if they all did not have exactly the same schema

tustvold · 2023-01-25T10:58:41Z

as their schema's are compatible

Provided they have headers, this should be possible, all bets are off if any files lack headers though...

That being said, this would require fairly extensive reworking on the schema adaption mechanism, as it currently does not infer per-file schemas.

My vote would be to fix the silent data corruption, which is what currently happens, by returning an error if the schema aren't identical, and then potentially look into supporting more sophisticated schema management down the line if there is interest.

alamb · 2023-01-25T11:54:21Z

My vote would be to fix the silent data corruption, which is what currently happens, by returning an error if the schema aren't identical, and then potentially look into supporting more sophisticated schema management down the line if there is interest.

I agree this sounds like the best plan to me

tustvold added the bug Something isn't working label Jan 15, 2023

tustvold mentioned this issue Jan 15, 2023

Update to arrow 30.0.1 #4818

Merged

tustvold mentioned this issue Jan 19, 2023

Add csv-core based reader (#3338) apache/arrow-rs#3365

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect Schema Adaption for CSV #4918

Incorrect Schema Adaption for CSV #4918

tustvold commented Jan 15, 2023 •

edited by alamb

Loading

alamb commented Jan 15, 2023

comphead commented Jan 24, 2023

tustvold commented Jan 24, 2023

alamb commented Jan 25, 2023

tustvold commented Jan 25, 2023

alamb commented Jan 25, 2023

Incorrect Schema Adaption for CSV #4918

Incorrect Schema Adaption for CSV #4918

Comments

tustvold commented Jan 15, 2023 • edited by alamb Loading

alamb commented Jan 15, 2023

comphead commented Jan 24, 2023

tustvold commented Jan 24, 2023

alamb commented Jan 25, 2023

tustvold commented Jan 25, 2023

alamb commented Jan 25, 2023

tustvold commented Jan 15, 2023 •

edited by alamb

Loading