-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected high costs on Google Cloud Storage #2085
Comments
I wouldn't have the slightest clue what class B operations even resemble, I don't use GCP myself. If you can break it down into lingo to non-gcp users that would help |
in order to write a delta table, we also need to always know the latest state of the table. as such ever read also requires us to read all relevant log files at least once. Usually there may be one or more list operations as well. Are you creating checkpoints? if not, we have to read one commit file for at least very transaction that was created on a table, which can become very sizeable. We have a PR in flight, that will allow us to be more economic in terms of reads, especially in append-only scenarios, where we can disregard a lot of the log - again, given there are checkpoints. |
@ion-elgreco mainly class B operations are for reading objects from Google Cloud Storage. @roeap I'm not sure about checkpoints. I haven't defined any myself, so if Note: I later changed the implementation of simply adding new parquets as I figured out I don't really need the functionality of delta lake. I just wanted to point it out if anyone else had a similar problem. Especially since this can incur high unexpected costs on cloud providers. |
I fixed this by adding a regular checkpoint creation function. This reduces the number of file operations. PR around auto checkpoints is #913
|
I'm going to close this, I don't believe there is something actionable for the delta-rs project here |
Environment
Delta-rs version: 0.10.2
Environment:
Bug
What happened:
Not sure if this is a bug, but it was recommended I post this issue here on stack overflow: https://stackoverflow.com/questions/77639348/delta-rs-package-incurs-high-costs-on-gcs/77681169#77681169.
I'm using the package to store files on the Google Cloud Storage dual-region bucket. I use the following code to store the data:
The input data is a generator since I'm taking the data from a Postgres database in batches. I am saving similar data into two different tables and I'm also saving a SUCCESS file for each uploaded partition.
I have around 25,000 partitions and most of them only have a single parquet file in them. The total number of rows that I've inserted is around 700,000,000. This incurred the following costs:
Class A operations: 127,000.
Class B operations: 109,856,507.
Download Worldwide Destinations: 300 gibibyte.
The number of class A operations makes sense to me when accounting for 2 writes per partition + an additional success file -- these are inserts. Some partitions probably have more than 1 file, so the number is a bit higher than 25,000 (number of partitions) x 3.
I can't figure out where so many class B operations and Download Worldwide Destinations. Is this to be expected or could it be a bug?
Can you provide any insights into why the costs are so high and how I would need to change the code to decrease them?
What you expected to happen:
Much lower costs for Class B operations on GCS.
The text was updated successfully, but these errors were encountered: