Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deleting large number of records fails with no error message #2798

Closed
jpambrun-vida opened this issue Aug 19, 2024 · 3 comments
Closed

Deleting large number of records fails with no error message #2798

jpambrun-vida opened this issue Aug 19, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@jpambrun-vida
Copy link

Environment

Delta-rs version: delta-rs.0.19.0

Binding: python

Environment:

  • Cloud provider: na
  • OS: linux
  • Other:

Bug

What happened:

Calling delete() fails with a long query without any message. The exit code is 1.

What you expected to happen:

Given that we need to delete in a single command to void creating many table version I expect delete() to take an arbitrary long query.

How to reproduce it:

In this example, I create a table of 100,000 rows. Think if x and and ID and I then wanted to delete 10% by id.
It seems to fail when trying to delete ~1600 items.

from deltalake import DeltaTable, write_deltalake
import pandas as pd
import shutil

print(f"deleting table")
shutil.rmtree("./deltatable")


print(f"creating table")
df = pd.DataFrame({'x': [f"{i}" for i in range(100000)]})
write_deltalake('./deltatable', df)


print(f"deleting rows")
dt = DeltaTable("./deltatable")
query = ' or '.join([f"x = '{i}'" for i in range(16000) if i % 10 == 0]) # fails at 16000, works at 15000
del_metrics = dt.delete(query)
print(del_metrics)


_dt = DeltaTable("./deltatable")
print(_dt.history())

More details:

@jpambrun-vida jpambrun-vida added the bug Something isn't working label Aug 19, 2024
@ion-elgreco
Copy link
Collaborator

ion-elgreco commented Aug 19, 2024

You are running OOM, at least when ran it that's what happened.

If you rewrite your sql delete statement to x in (0,10,20 .... ) it is actually running fine ;)

Use this instead query = f"x in {str(tuple([i for i in range(16000) if i % 10 == 0]))}"

@ion-elgreco ion-elgreco closed this as not planned Won't fix, can't repro, duplicate, stale Aug 19, 2024
@jpambrun-vida
Copy link
Author

Thanks. Your formulation is indeed working, even up to deleting 10% of 15,000,000 records (🤯) before memory is an issue. It's a bit hard to understand why it can do 10,000x more like that, but ill take it..

@ion-elgreco
Copy link
Collaborator

@jpambrun-vida it could be that datafusion doesn't optimize 5 or 10 or 15 in a better query that can be pushed down

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants