-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(rust): introduce async at the top level #6830
Conversation
3a3ef72
to
9aa445c
Compare
I really don't see this working well. All our executors work on complete Maybe we could add it in our push based engine. But I have to think on this for a while. |
I agree that Rayon is a good fit for Polars and provides a lot of performance benefits. I am not suggesting replacing Rayon or changing the way it is used today. I am suggesting replacing the top level thread POOLs, like the one in Union. The unit of parallelization would not change, it would still be the data frame. What async/tokio adds is the ability to chain io-cpu through out the tree of executors. To be more precise in my use case most of the files are on cloud storage, so first the data needs to be downloaded. As it is downloaded from the network Rayon can kick in to decode it, project, filter and so on. I see the interplay between IO and CPU a lot more dynamic: as the downloads proceed and some data arrives early it makes sense to process that data before waiting for additional downloads since the CPU may be idle. Maybe my use case represents a minority and the added complexity doesn't belong in Polars. For successful, complex projects it is ok to say no to certain features. Let me know what you think :) |
Right.. I thought you were aiming to make all the executor API's async. I agree that on the Is there a possibility to let tokio use the rayon thread pool? |
Let me play with this and evaluate any performance impact. We should be able to easily evaluate any performance impact by comparing runs using local paths versus nearly identical runs using the |
Something with many files and much compute? Maybe groupby / agg over many parquet files with the same schema? |
I am working with data on the order of many gigabytes of parquet hosted on s3. Part of our process involves sinking partitions to arrow files on disk. Having to if there's anything I could do to test with this large dataset (up to terabyte+ of compressed parquet, partitioned into ~1gb partitions) please lmk. |
@jgmartin happy to collaborate on this 👍 I am now looking at the proper way to integrate Tokio and Rayon, this looks promising https://github.com/andybarron/tokio-rayon/blob/main/src/async_thread_pool.rs Just to set expectations: in my tests I have 80 parquet files on GCP and fetching their metadata is slow. I get about 0.5 seconds per parquet file. My current thinking is that for thousands of files you will need to centralize the stats for the files so that you fetch all the stats in a small number of operations. See the related comment here #6426 (comment) |
9aa445c
to
f2f8f7b
Compare
Hm, I have been testing on the orders table from TPC-H. With the multi-threaded union the mean duration over 10 runs is 1.810 sec. Switching to Tokio async increases the time to In related news I have been integrating in my other PR. The Tokio-Rayon integration is documented in the Tokio docs. Basically Tokio just want to do async on a small number of threads and the developer must run the CPU heavy code outside of Tokio. Rayon is one of the recommended approaches and using a onshot channel is the recommended integration. Will push some more on this but no easy wins yet.
|
One more weekend invested in learning the ecosystem and the internals of polars. Polars is a complex, highly efficient multi threaded application and the architecture change has been alluded me so far. My current thinking has now changed. Up to this point my attempt has been to:
The problems with this approach are:
My thinking has now shifted to the following approach:
In this separate runtime we use the following approach:
FYI @ritchie46 still pushing on this but still no easy wins. I am hopeful that I will be able to push over the finish line the approach I just described. |
I'm closing this pull request due to inactivity. Feel free to rebase and reopen and continue your work! |
@ritchie46 I have been looking for ways to make progress on the object_store integration. This PR provides more of a high level proof of concept, I kept it simple to make it easier to see the proposed change.
I think that the fundamental problem is that we need the Tokio runtime to be initialize much closer to the root of tree of the
Executors
, at least in order to enableasync
. This will allow all async tasks to be part of one run time and data flow-in the Execution tree as it is being fetched.I propose in this PR a possible way forward:
execute_async
function in theExecutor
thread, we can provide a default implementation for the async method on top of the regularexecute
methodWhat do you think about this approach? With something like this we can make more progress integrating cloud storage in Polars.