'read_csv' returns wrong number of records #6865

edorid · 2023-02-14T04:05:55Z

Polars version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Issue description

I did comparison with CSV python standard library, and the number of records read by polars was not matched.
The file attached is actually database dump, but I already obscure it. But the output still the same.

issue.csv

Reproducible example

import polars as pl
import csv

if __name__ == "__main__":
    with open("issue.csv", 'r') as f:
        reader = csv.reader(f)
        lines = 0
        for row in reader:
            lines += 1
    
    print("Lines: %s" % lines)

    db_df = pl.read_csv("issue.csv")
    print("Lines: %s" % len(db_df.rows()))

Output:

Lines: 5
Lines: 1334



### Expected behavior

The number of records should be the same (records=5).

### Installed versions

<details>

---Version info---
Polars: 0.16.4
Index type: UInt32
Platform: Linux-5.15.0-60-generic-x86_64-with
Python: 3.9.16 (main, Dec 10 2022, 13:47:19)
[GCC 10.3.1 20210424]
---Optional dependencies---
pyarrow:
pandas:
numpy:
fsspec:
connectorx:
xlsx2csv:
deltalake:
matplotlib: ```

The text was updated successfully, but these errors were encountered:

alexander-beedie · 2023-02-14T05:50:25Z

Thanks for the bug report; just to clarify, from the obfuscated data it appears that you are loading logged/multi-line SQL queries?

As a temporary workaround, you could do the following to get the data into polars:

with open( "issue.csv", "r" ) as f:
    iter_csv = csv.reader( f )
    columns = next( iter_csv )

    df = pl.DataFrame( 
        data = iter_csv, 
        schema = columns,
    )
    # ┌──────────┬────────────────────┬─────────────────────┐
    # │ type     ┆ dbname             ┆ dump                │
    # │ ---      ┆ ---                ┆ ---                 │
    # │ str      ┆ str                ┆ str                 │
    # ╞══════════╪════════════════════╪═════════════════════╡
    # │ database ┆ connections-prod   ┆ --                  │
    # │          ┆                    ┆ x x x.x_x_x_x_x_x x │
    # │          ┆                    ┆  x x.x,             │
    # │          ┆                    ┆  ...                │
    # │ database ┆ content-prod       ┆ --                  │
    # │          ┆                    ┆ -- x x x            │
    # │          ┆                    ┆ --                  │
    # │          ┆                    ┆                     │
    # │          ┆                    ┆ -- x x x x 11.16... │
    # │ database ┆ notifications-prod ┆ --                  │
    # │          ┆                    ┆ -- x x x            │
    # │          ┆                    ┆ --                  │
    # │          ┆                    ┆                     │
    # │          ┆                    ┆ -- x x x x 11.16... │
    # │ database ┆ users-prod         ┆ --                  │
    # │          ┆                    ┆ -- x x x            │
    # │          ┆                    ┆ --                  │
    # │          ┆                    ┆                     │
    # │          ┆                    ┆ -- x x x x 11.16... │
    # └──────────┴────────────────────┴─────────────────────┘

edorid · 2023-02-14T06:40:21Z

yes, it's multi-line SQL queries generated from pg_dumpall

ritchie46 · 2023-02-14T09:37:46Z

Right, polars succeeds if we set max_threads to 1. So we have trouble finding a valid point to split up the file. I will look into this.

This does work:

pl.read_csv("/home/ritchie46/Downloads/issue.csv", n_threads=1)

edorid added bug Something isn't working python Related to Python Polars labels Feb 14, 2023

This was referenced Feb 14, 2023

fix(rust, python): use schema to define csv file splits #6873

Closed

fix(rust, python): reject multithreading on excessive ',\n' fields #6906

Merged

ritchie46 closed this as completed in #6906 Feb 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'read_csv' returns wrong number of records #6865

'read_csv' returns wrong number of records #6865

edorid commented Feb 14, 2023

alexander-beedie commented Feb 14, 2023 •

edited

Loading

edorid commented Feb 14, 2023

ritchie46 commented Feb 14, 2023 •

edited

Loading

'read_csv' returns wrong number of records #6865

'read_csv' returns wrong number of records #6865

Comments

edorid commented Feb 14, 2023

Polars version checks

Issue description

Reproducible example

alexander-beedie commented Feb 14, 2023 • edited Loading

edorid commented Feb 14, 2023

ritchie46 commented Feb 14, 2023 • edited Loading

alexander-beedie commented Feb 14, 2023 •

edited

Loading

ritchie46 commented Feb 14, 2023 •

edited

Loading