deadlock when reading parquet one disk #19751

Liyixin95 · 2024-11-13T03:37:41Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import functools
from os import mkdir

import polars as pl
from polars import col as c

CONCAT_LEN = 9


def make_parquet(num_rows):
    import random
    import string

    column1_data = random.choices(string.ascii_lowercase, k=num_rows)
    column2_data = random.choices(range(1, CONCAT_LEN), k=num_rows)

    df = pl.DataFrame({"column1": column1_data, "column2": column2_data})
    try:
        mkdir("./data")
    except Exception:
        pass
    df.write_parquet("./data/test.parquet")


def func(data, column2, output):
    return data.filter(c("column2") == column2).select(
        c("column1").n_unique().alias(output)
    )


def driver():
    data = pl.scan_parquet("./data/test.parquet").cache()

    def apply(idx):
        return functools.partial(func, column2=idx, output=f"output{idx}")

    FUNC_LIST = (apply(idx) for idx in range(1, CONCAT_LEN))

    res = (func(data) for func in FUNC_LIST)

    return pl.concat(res, how="horizontal")


if __name__ == "__main__":
    make_parquet(10000000)

    lfs = [driver() for _ in range(1000)]
    dfs = pl.collect_all(lfs)

Log output

No response

Issue description

All the polars threads are sleeping, and the cli was stuck. I have tried a few things:

remove the cache() after the pl.scan_parquet("./data/test.parquet"), then it works fine.
add parallel="none" in pl.scan_parquet("./data/test.parquet"), then it works fine.
use ipc instead of parquet, then it works fine.
the larger the CONCAT_LEN value, the higher the probability of being stuck.

Expected behavior

complete successfully.

Installed versions

--------Version info---------
Polars:              1.12.0
Index type:          UInt32
Platform:            Linux-5.10.0-28-amd64-x86_64-with-glibc2.31
Python:              3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]
LTS CPU:             False

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               <not installed>
gevent               <not installed>
great_tables         <not installed>
matplotlib           3.9.2
nest_asyncio         1.6.0
numpy                1.26.4
openpyxl             3.1.5
pandas               2.2.2
pyarrow              17.0.0
pydantic             2.8.2
pyiceberg            <not installed>
sqlalchemy           2.0.25
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>

The text was updated successfully, but these errors were encountered:

ritchie46 · 2024-11-13T07:53:26Z

Yes, remove the cache. It shows in the docstring that it isn't recommended to use it. We should add a note to this that this can lead to deadlocks.

Liyixin95 · 2024-11-13T08:16:47Z

Yes, remove the cache. It shows in the docstring that it isn't recommended to use it. We should add a note to this that this can lead to deadlocks.

This is just my simplification of my actual scenario. In my actual scenario, the cache is automatically added to the query plan by the optimizer.

And I find the rayon issue here, may be this is the underlying reason?

Liyixin95 · 2024-11-13T09:52:17Z

@ritchie46 I found a quick solution for this issue, maybe I can provide a pr for you to take a look ?

Liyixin95 added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Nov 13, 2024

ritchie46 added P-low Priority: low and removed needs triage Awaiting prioritization by a maintainer labels Nov 13, 2024

github-project-automation bot added this to Backlog Nov 13, 2024

github-project-automation bot moved this to Ready in Backlog Nov 13, 2024

Liyixin95 mentioned this issue Nov 15, 2024

fix(rust): Fix deadlock in CacheExec #19802

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deadlock when reading parquet one disk #19751

deadlock when reading parquet one disk #19751

Liyixin95 commented Nov 13, 2024

ritchie46 commented Nov 13, 2024

Liyixin95 commented Nov 13, 2024

Liyixin95 commented Nov 13, 2024

deadlock when reading parquet one disk #19751

deadlock when reading parquet one disk #19751

Comments

Liyixin95 commented Nov 13, 2024

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

ritchie46 commented Nov 13, 2024

Liyixin95 commented Nov 13, 2024

Liyixin95 commented Nov 13, 2024