We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
import functools from os import mkdir import polars as pl from polars import col as c CONCAT_LEN = 9 def make_parquet(num_rows): import random import string column1_data = random.choices(string.ascii_lowercase, k=num_rows) column2_data = random.choices(range(1, CONCAT_LEN), k=num_rows) df = pl.DataFrame({"column1": column1_data, "column2": column2_data}) try: mkdir("./data") except Exception: pass df.write_parquet("./data/test.parquet") def func(data, column2, output): return data.filter(c("column2") == column2).select( c("column1").n_unique().alias(output) ) def driver(): data = pl.scan_parquet("./data/test.parquet").cache() def apply(idx): return functools.partial(func, column2=idx, output=f"output{idx}") FUNC_LIST = (apply(idx) for idx in range(1, CONCAT_LEN)) res = (func(data) for func in FUNC_LIST) return pl.concat(res, how="horizontal") if __name__ == "__main__": make_parquet(10000000) lfs = [driver() for _ in range(1000)] dfs = pl.collect_all(lfs)
No response
All the polars threads are sleeping, and the cli was stuck. I have tried a few things:
cache()
pl.scan_parquet("./data/test.parquet")
parallel="none"
ipc
parquet
complete successfully.
--------Version info--------- Polars: 1.12.0 Index type: UInt32 Platform: Linux-5.10.0-28-amd64-x86_64-with-glibc2.31 Python: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] LTS CPU: False ----Optional dependencies---- adbc_driver_manager <not installed> altair <not installed> cloudpickle <not installed> connectorx <not installed> deltalake <not installed> fastexcel <not installed> fsspec <not installed> gevent <not installed> great_tables <not installed> matplotlib 3.9.2 nest_asyncio 1.6.0 numpy 1.26.4 openpyxl 3.1.5 pandas 2.2.2 pyarrow 17.0.0 pydantic 2.8.2 pyiceberg <not installed> sqlalchemy 2.0.25 torch <not installed> xlsx2csv <not installed> xlsxwriter <not installed>
The text was updated successfully, but these errors were encountered:
Yes, remove the cache. It shows in the docstring that it isn't recommended to use it. We should add a note to this that this can lead to deadlocks.
Sorry, something went wrong.
This is just my simplification of my actual scenario. In my actual scenario, the cache is automatically added to the query plan by the optimizer.
cache
And I find the rayon issue here, may be this is the underlying reason?
@ritchie46 I found a quick solution for this issue, maybe I can provide a pr for you to take a look ?
Successfully merging a pull request may close this issue.
Checks
Reproducible example
Log output
No response
Issue description
All the polars threads are sleeping, and the cli was stuck. I have tried a few things:
cache()
after thepl.scan_parquet("./data/test.parquet")
, then it works fine.parallel="none"
inpl.scan_parquet("./data/test.parquet")
, then it works fine.ipc
instead ofparquet
, then it works fine.Expected behavior
complete successfully.
Installed versions
The text was updated successfully, but these errors were encountered: