Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend in place! 💯 #2544

Merged
merged 4 commits into from
Feb 4, 2022
Merged

Extend in place! 💯 #2544

merged 4 commits into from
Feb 4, 2022

Conversation

ritchie46
Copy link
Member

@ritchie46 ritchie46 commented Feb 4, 2022

This PR is quite a big deal. Until now polars/arrow memory was completely immutable. If we did and append, we simply added an array chunk to the list of chunks (sort of a linked list). This yielded very fast appends, but is detrimental for query performance, because the chunks add a lot of indirection.

Especially use cases where you have rows coming in on a very slow pace and you want to do querys between the updates. For instance in online learning cases, Polars was not the right tool for the job, as you would need to call a rechunk to get optimal peformance which is a complete reallocation of your table! Very expensive.

With this change, we can now extend the DataFrame/Series and write to the same memory allocations. This might still reallocate, but given exponential growth strategies, this operation is amortized O(1).

There is of course no magic. We can only write to the same memory iff

  • we are the only owner (The series are not shared with another dataframe)
  • the memory is not allocated by pyarrow

@github-actions github-actions bot added the rust Related to Rust Polars label Feb 4, 2022
@ritchie46
Copy link
Member Author

@jorgecarleitao @houqp FYI

@ritchie46 ritchie46 merged this pull request into master Feb 4, 2022
@ritchie46 ritchie46 deleted the extend branch February 4, 2022 15:28
@houqp
Copy link

houqp commented Feb 5, 2022

very cool!

@jorgecarleitao
Copy link
Collaborator

jorgecarleitao commented Feb 5, 2022

Brutal. Always innovating!

fyi @wesm @pitrou @kou @andygrove. This uses copy on write - it checks at runtime whether we are the only owners of the array and, if yes, we take exclusive mutable ownership of the buffer / array.

This was proposed by @sundy-li here jorgecarleitao/arrow2#741, inspired by what clickhouse is doing and implemented by @ritchie46 here: jorgecarleitao/arrow2#794

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants