-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support deletion vector #1094
Comments
This looks like a tradeoff between faster read performance v.s. faster write that need to be decided case by case? If so, might be better to just let the user decide depending on the expected workload pattern. |
+1 to supporting user-owned tradeoff decision. I'm investigating this feature internally and update patterns in individual tables likely dictate the right decision. For instance, in many dimension tables, edits may be spread randomly through existing data and merge on read will be more efficient. For fact tables with mostly append pattern (but occasional fact updates), judicious partition plus copy on write may be superior. |
Don't know if this helps, just tried to read a deletion vector file, and this seems to be working with the roaring crate: fn get_deletion_vectors(
filename: &str,
) -> Result<Vec<RoaringTreemap>, Box<dyn std::error::Error + Send + Sync>> {
let mut file = File::open(filename)?;
let mut buf = vec![0; 2];
file.read(&mut buf).unwrap();
let version = u16::from_le_bytes(buf.clone().try_into().unwrap());
assert_eq!(version, 1);
let mut index = 0;
let mut vec = Vec::new();
loop {
index += 1;
let mut buf = vec![0; 3];
let nrread = file.read(&mut buf)?;
if nrread == 0 {
return Ok(vec);
}
let size_buf = [&[0], &buf[0..3]].concat();
let datasize = u32::from_be_bytes(size_buf.try_into().unwrap());
let mut buf = vec![0; 4];
file.read(&mut buf)?;
let magic = i32::from_le_bytes(buf.clone().try_into().unwrap());
assert!(magic == 1681511377);
if datasize == 0 {
continue;
}
let before = &file.stream_position()?;
let take: Take<&File> = (&file).take(datasize as u64 - 4);
let rdr = RoaringTreemap::deserialize_from(take)?;
//let mut target_file =
// File::create("data/deletion_vectors_splitted/delvec_".to_owned() + &index.to_string())?;
//std::io::copy(&mut take, &mut target_file)?;
let after = &file.stream_position()?;
//println!("{}, {}: {}", before, after, datasize);
vec.push(rdr);
// seems roaring-rs does not always read to full end
let mut buf = vec![0; 1];
file.read(&mut buf)?;
let mut checksum_buf = vec![0; 4];
file.read(&mut checksum_buf)?;
}
} |
Would you accept a PR that does add the required metadata as a first step? |
Hi @aersam - first of all thanks for the code snipplet, it actually samed me a bit of time working on this elsewhere. In principle we always welcome contributions. In this case we also do, but there is one caveat. Elsewhere we are currently working hard on getting delta-kernel for rust released which will hopefully significantly boost our protocol support. The more complex thing here is, that in order to support deletion vectors we have to either support reader V3 and writer v7 (i.e. table features), or support a whole bunch of other delta features as well. Good news is we are actively working on it, but since this involves some larger blocks of work, its likely going to be a few weeks, before this can fully land... With all that said, if you profit from having some intermediate partial support, I'd be happy to review PRs :) |
Well if it's about weeks I can wait. I know that actually column mapping would be first, just thought that cannot be that hard ;) I did not know about delta-kernel for rust, I'm really glad to hear about it! To be honest I was a bit disappointed as I thought it will be in Java - nothing against Java, but I much prefer Rust, especially for embedding. Where do I find the code for delta-kernel/rust? Just to observe it a bit Btw I also corrected the snipped, it had a bug when there are multiple vectors within file. |
@roeap where can one follow the Delta kernel initiatives? I saw delta-io/delta#1783 but that's not rust specific, right? Will it happen in this repo or will there be a delta-kernel-rs? |
Trying to get the metadata running here: https://github.com/bmsuisse/delta-rs/tree/deletion_vector_meta |
# Description This just adds the deletion vector metadata to the actions. It does not interpret those yet, reading / writing deletion vectors is not supported with this. Still it enables use cases where you use delta-rs just for metadata retrieval I have to add that I'm still learning rust and I expect this to take some iterations until code quality is sufficient # Related Issue(s) Part of #1094 : Adds the required metadata # Documentation https://github.com/delta-io/delta/blob/master/PROTOCOL.md#deletion-vectors --------- Co-authored-by: Will Jones <[email protected]>
# Description This just adds the deletion vector metadata to the actions. It does not interpret those yet, reading / writing deletion vectors is not supported with this. Still it enables use cases where you use delta-rs just for metadata retrieval I have to add that I'm still learning rust and I expect this to take some iterations until code quality is sufficient # Related Issue(s) Part of delta-io#1094 : Adds the required metadata # Documentation https://github.com/delta-io/delta/blob/master/PROTOCOL.md#deletion-vectors --------- Co-authored-by: Will Jones <[email protected]>
fwiw; Fabric Datawarehouse just added support for deletion vectors and suddenly the delta table produced is no more compatible with Delta_rs :( |
Is this feature still on the roadmap? Tables produced by recent databricks runtime include deletion vectors by default, so it seems to me that reading them through rust-based solutions like polars is not currently possible natively. |
Running into the same issue, the latest databricks runtime have deletion vectors enabled by default and our admin won't turn it off. This breaks our python code that is reading with DeltaTable or polars. |
as a temporary workaround, duckdb do support reading delta table with deletion vectors using the delta extension based on delta kernel not delta_rs |
I am writing to add support in the request for this enhancement. Databricks now enables deletion vectors on tables by default when creating a new table using a SQL warehouses or Databricks Runtime 14.1 or above. Interacting with these tables (both reading and writing records) using polars, duckdb, etc. is not possible at the moment due to those libraries' reliance on delta-rs. Is this still on the roadmap? Or is anyone aware of a workaround? |
Description
For protocol version 3, will want to support deletion vector.
https://github.com/delta-io/delta/blob/master/PROTOCOL.md#deletion-vectors
Question: how do we decide to rewrite vs use delete vector?
Use Case
This enables much faster deletes.
Related Issue(s)
Prerequisites:
The text was updated successfully, but these errors were encountered: