-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge update+insert truncates a delta table if the table is big enough #2362
Comments
@t1g0rz Thanks for the report. |
@Blajda, I have tested the snippet above, and it works fine on dt = DeltaTable.create(
"./test1",
schema=pa.schema(
[
pa.field("ts", pa.timestamp("us"), nullable=False),
pa.field("some_data", pa.float64(), nullable=True),
pa.field("some_part", pa.string(), nullable=True),
]
),
partition_by=["some_part"]
)
df = pd.DataFrame(
{
"ts": pd.date_range("2023-01-01", freq="1h", periods=5),
"some_data": np.random.random(5),
"some_part": np.random.choice(["A", "B"], 5),
}
)
dt = DeltaTable("./test1")
dt.merge(
df,
predicate=f"t.ts::Timestamp >= '{df.ts.min()}'::Timestamp and s.ts = t.ts",
source_alias="s",
target_alias="t",
).when_matched_update_all().when_not_matched_insert_all().execute()
"""
{'num_source_rows': 5,
'num_target_rows_inserted': 5,
'num_target_rows_updated': 0,
'num_target_rows_deleted': 0,
'num_target_rows_copied': 0,
'num_output_rows': 5,
'num_target_files_added': 2,
'num_target_files_removed': 0,
'execution_time_ms': 11,
'scan_time_ms': 0,
'rewrite_time_ms': 1}
"""
print(pl.from_dataframe(df))
"""
┌─────────────────────┬───────────┬───────────┐
│ ts ┆ some_data ┆ some_part │
│ --- ┆ --- ┆ --- │
│ datetime[ns] ┆ f64 ┆ str │
╞═════════════════════╪═══════════╪═══════════╡
│ 2023-01-01 00:00:00 ┆ 0.031419 ┆ B │
│ 2023-01-01 01:00:00 ┆ 0.508243 ┆ A │
│ 2023-01-01 02:00:00 ┆ 0.260409 ┆ A │
│ 2023-01-01 03:00:00 ┆ 0.996127 ┆ B │
│ 2023-01-01 04:00:00 ┆ 0.774423 ┆ B │
└─────────────────────┴───────────┴───────────┘
"""
print(pl.read_delta('./test1').sort('ts'))
"""
┌─────────────────────┬───────────┬───────────┐
│ ts ┆ some_data ┆ some_part │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ f64 ┆ str │
╞═════════════════════╪═══════════╪═══════════╡
│ 2023-01-01 00:00:00 ┆ 0.031419 ┆ B │
│ 2023-01-01 01:00:00 ┆ 0.508243 ┆ A │
│ 2023-01-01 02:00:00 ┆ 0.260409 ┆ A │
│ 2023-01-01 03:00:00 ┆ 0.996127 ┆ B │
│ 2023-01-01 04:00:00 ┆ 0.774423 ┆ B │
└─────────────────────┴───────────┴───────────┘
"""
dt = DeltaTable("./test1")
df = pd.DataFrame(
{
"ts": pd.date_range("2023-01-01 1:00:00", freq="1h", periods=5),
"some_data": np.random.random(5),
"some_part": np.random.choice(["A", "B"], 5),
}
)
dt.merge(
df,
predicate=f"t.ts::Timestamp >= '{df.ts.min()}'::Timestamp and s.ts = t.ts",
source_alias="s",
target_alias="t",
).when_matched_update_all().when_not_matched_insert_all().execute()
"""
{'num_source_rows': 5,
'num_target_rows_inserted': 1,
'num_target_rows_updated': 4,
'num_target_rows_deleted': 0,
'num_target_rows_copied': 0,
'num_output_rows': 5,
'num_target_files_added': 3,
'num_target_files_removed': 1,
'execution_time_ms': 11,
'scan_time_ms': 0,
'rewrite_time_ms': 2}
"""
print(pl.from_dataframe(df))
"""
┌─────────────────────┬───────────┬───────────┐
│ ts ┆ some_data ┆ some_part │
│ --- ┆ --- ┆ --- │
│ datetime[ns] ┆ f64 ┆ str │
╞═════════════════════╪═══════════╪═══════════╡
│ 2023-01-01 01:00:00 ┆ 0.374384 ┆ A │
│ 2023-01-01 02:00:00 ┆ 0.479215 ┆ B │
│ 2023-01-01 03:00:00 ┆ 0.658368 ┆ A │
│ 2023-01-01 04:00:00 ┆ 0.698976 ┆ B │
│ 2023-01-01 05:00:00 ┆ 0.913481 ┆ B │
└─────────────────────┴───────────┴───────────┘
"""
print(pl.read_delta('./test1').sort('ts'))
"""
┌─────────────────────┬───────────┬───────────┐
│ ts ┆ some_data ┆ some_part │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ f64 ┆ str │
╞═════════════════════╪═══════════╪═══════════╡
│ 2023-01-01 01:00:00 ┆ 0.374384 ┆ A │
│ 2023-01-01 02:00:00 ┆ 0.479215 ┆ B │
│ 2023-01-01 03:00:00 ┆ 0.658368 ┆ A │
│ 2023-01-01 04:00:00 ┆ 0.698976 ┆ B │
│ 2023-01-01 05:00:00 ┆ 0.913481 ┆ B │
└─────────────────────┴───────────┴───────────┘
""" |
The root cause is the filter My previous comment about |
# Description Delta scan will push filter to the parquet scan when possible. Added a new configuration for the special case where operations need to operate on an entire file but still want to perform pruning. # Related Issue(s) - fixes #2362
@t1g0rz Let me know if any of your workflows are having issues with the latest merge. If they are I'll reopen the issue. |
Environment
Delta-rs version: 0.16.3
Binding: python
OS: ubuntu22.04.1
Bug
What happened:
#2320 still persists in 0.16.3, but it has become more sophisticated. Apparently, if the delta table contains a few Parquet files, it can remove a few of them without a reason
What you expected to happen:
Updates and inserts to occur according to the predicate.
How to reproduce it:
Here is the output:
The text was updated successfully, but these errors were encountered: