Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'read_csv' returns wrong number of records #6865

Closed
2 tasks done
edorid opened this issue Feb 14, 2023 · 3 comments · Fixed by #6906
Closed
2 tasks done

'read_csv' returns wrong number of records #6865

edorid opened this issue Feb 14, 2023 · 3 comments · Fixed by #6906
Labels
bug Something isn't working python Related to Python Polars

Comments

@edorid
Copy link

edorid commented Feb 14, 2023

Polars version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Issue description

I did comparison with CSV python standard library, and the number of records read by polars was not matched.
The file attached is actually database dump, but I already obscure it. But the output still the same.

issue.csv

Reproducible example

import polars as pl
import csv

if __name__ == "__main__":
    with open("issue.csv", 'r') as f:
        reader = csv.reader(f)
        lines = 0
        for row in reader:
            lines += 1
    
    print("Lines: %s" % lines)

    db_df = pl.read_csv("issue.csv")
    print("Lines: %s" % len(db_df.rows()))

Output:

Lines: 5
Lines: 1334


### Expected behavior

The number of records should be the same (records=5).

### Installed versions

<details>

---Version info---
Polars: 0.16.4
Index type: UInt32
Platform: Linux-5.15.0-60-generic-x86_64-with
Python: 3.9.16 (main, Dec 10 2022, 13:47:19)
[GCC 10.3.1 20210424]
---Optional dependencies---
pyarrow:
pandas:
numpy:
fsspec:
connectorx:
xlsx2csv:
deltalake:
matplotlib: ```

@edorid edorid added bug Something isn't working python Related to Python Polars labels Feb 14, 2023
@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Feb 14, 2023

Thanks for the bug report; just to clarify, from the obfuscated data it appears that you are loading logged/multi-line SQL queries?

As a temporary workaround, you could do the following to get the data into polars:

with open( "issue.csv", "r" ) as f:
    iter_csv = csv.reader( f )
    columns = next( iter_csv )

    df = pl.DataFrame( 
        data = iter_csv, 
        schema = columns,
    )
    # ┌──────────┬────────────────────┬─────────────────────┐
    # │ type     ┆ dbname             ┆ dump                │
    # │ ---      ┆ ---                ┆ ---                 │
    # │ str      ┆ str                ┆ str                 │
    # ╞══════════╪════════════════════╪═════════════════════╡
    # │ database ┆ connections-prod   ┆ --                  │
    # │          ┆                    ┆ x x x.x_x_x_x_x_x x │
    # │          ┆                    ┆  x x.x,             │
    # │          ┆                    ┆  ...                │
    # │ database ┆ content-prod       ┆ --                  │
    # │          ┆                    ┆ -- x x x            │
    # │          ┆                    ┆ --                  │
    # │          ┆                    ┆                     │
    # │          ┆                    ┆ -- x x x x 11.16... │
    # │ database ┆ notifications-prod ┆ --                  │
    # │          ┆                    ┆ -- x x x            │
    # │          ┆                    ┆ --                  │
    # │          ┆                    ┆                     │
    # │          ┆                    ┆ -- x x x x 11.16... │
    # │ database ┆ users-prod         ┆ --                  │
    # │          ┆                    ┆ -- x x x            │
    # │          ┆                    ┆ --                  │
    # │          ┆                    ┆                     │
    # │          ┆                    ┆ -- x x x x 11.16... │
    # └──────────┴────────────────────┴─────────────────────┘

@edorid
Copy link
Author

edorid commented Feb 14, 2023

yes, it's multi-line SQL queries generated from pg_dumpall

@ritchie46
Copy link
Member

ritchie46 commented Feb 14, 2023

Right, polars succeeds if we set max_threads to 1. So we have trouble finding a valid point to split up the file. I will look into this.

This does work:

pl.read_csv("/home/ritchie46/Downloads/issue.csv", n_threads=1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python Related to Python Polars
Projects
None yet
3 participants