-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
collect_async
is blocking
#18718
Comments
Well I happen to implement that 😃 About It does use polars threadpool here that runs collect on background, and then only resolves the future with GIL acquired. |
Hmm.. Is there a way I can verify this? The
I see that you are spawning the pool in Rust. I am not sure about PyO3 but in my experience, I found that spawning a separate background thread in a C++ extension would still block the event loop as Python still has to constantly poll the job for completion (which is blocking unless you spawn a thread from Python and wrap the future). |
Well it's a different story then if it's only 50 rows. Care to provide data to reproduce? Theoretically It shouldn't block event loop, because, because it's only resolving a future by calling this callback here at the end of exectuon of .collect on rust side, |
I will get the data for you tomorrow.
Not holding the GIL is one thing, but it still needs to run in a separate thread for it to not block the event loop. If resolving the future involves some kind of polling of the results, the event loop will still be blocked. |
Internally it may pool on the results, but if it were to pool by blocking whole loop that would defeat the whole purpose of Future |
You can try to create LazyFrames on separate lines and time only that (or event just print-debug). Because init is sync. |
You are right. LazyFrame's creation is indeed the one responsible for blocking the event loop. I took your advice and tried using def get_intersection(transcription: IO[bytes], diarisation: IO[bytes]) -> Awaitable[DataFrame]:
intersection_expression = min_horizontal(
'end_time',
'end_time_right',
) - max_horizontal(
'start_time',
'start_time_right',
)
return (
scan_ndjson(transcription)
.join(scan_ndjson(diarisation), how='cross')
.with_columns(intersection_expression.alias('intersection'))
.collect_async()
) |
When is the blocking happening? The output of the function is an awaitable so it seems like we need more info. Are you doing something simple like this? df_awaitable= get_intersection(tr, di)
df = await df more like this? asyncio.get_running_loop().run_until_complete(df_awaitable) something else? |
I am doing the former. I expect Awaitables to completely use the event loop. It defeats the purpose to use |
You still have the issue that you're scanning an IO object so even though the polars binary would be non-blocking, it's getting its data from python which is blocking. Try either making the inputs eager DFs or files. In the case where the input is In the case where the input is So maybe try one of these def get_intersection(transcription: pl.DataFrame, diarisation: pl.DataFrame) -> Awaitable[DataFrame]:
transcription=transcription.lazy()
diarisation=diarisation.lazy() or def get_intersection(transcription: Path, diarisation: Path) -> Awaitable[DataFrame]:
transcription=pl.scan_ndjson(transcription)
diarisation=pl.scan_ndjson(diarisation) |
The problem is that For now, |
Remembered that there is #17939, have not tried myself, but looks promising. Then it would be possible to have something, like After some search, since the problem is actual DataFrame creation, this seems to be a duplicate/related #4351 |
Checks
Reproducible example
Log output
No response
Issue description
In
polars
, Most of the CPU-bound activities happen in Rust where the Python GIL is dropped. Ideally,collect_async
should take advantage of this forpolars
to maximise CPU usage. As of right now,collect_async
will block the main event loop and stop your single worker server from handling any more requests until theDataFrame
is created.EDIT:
DataFrame creation is blocking the event loop. We can fix this by running it in a separate thread.
Expected behavior
collect_async
should not block the main event loop and act as an actual async function that will allow Python to perform context switching and process other tasks that drop the GIL concurrently.Installed versions
The text was updated successfully, but these errors were encountered: