Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Add TableUtil to provide access to a table's format version #11620

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

nastra
Copy link
Contributor

@nastra nastra commented Nov 21, 2024

This is an alternative impl to #11587


// being able to read the format version from the PositionDeletesTable is mainly needed in
// SparkPositionDeletesRewrite when determining whether to rewrite V2 position deletes to DVs
if (table instanceof BaseMetadataTable) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish we had scala or even kotlin

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the reasons I want scala is I really want to just enumerate cases here. I would recommend we just go through all of our cases narrower then broader if we have exceptions so

If (PositionDeleteTable)
  return format version 
else if (MetaTable) {
  Sorry Brah
}
else if (HasTableOperations) {
   return format version
}
else {
 // Sorry Brah
}

Although I am curious why position delete table needs a format version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I am curious why position delete table needs a format version?

@RussellSpitzer this is mainly for SparkPositionDeletesRewrite (which operates against the position deletes table). Basically when rewriting existing position deletes we need to know whether we need to rewrite them to V2 position deletes or to DVs by looking at the underlying format version of the table. A table that was upgraded to V3 can still have V2 position deletes, meaning that these would then have to be rewritten as DVs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we just be looking at the underlying table then? Shouldn't the converter look at the base table rather than the metadata table?

Ie

formatVersion(metadataTable.baseTable)

Rather than

formatVersion(metadataTable)

Implicitly calling metadataTable.baseTable but only sometimes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend we just go through all of our cases narrower then broader if we have exceptions so

I would prefer that too, but the fact that SerializableMetadataTable is a subclass of SerializableTable which in turn implements HasTableOperations makes this more difficult and you still need to differentiate there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we just be looking at the underlying table then? Shouldn't the converter look at the base table rather than the metadata table?

I'm not fully sure I follow your comment. Do you mean the calling site should first check whether it's a metadata table before calling TableUtil.formatVersion(...)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But I also don't think the caller should need to check. Shouldn't the caller know what it's doing? Like if it is compacting DeleteVectors it knows it has a DeleteVectorMetadataTable and therefor it uses the parent table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would require changing a bunch of more places, because effectively we have a Broadcast<Table> in SparkPositionDeletesRewrite:

Broadcast<Table> tableBroadcast =
sparkContext.broadcast(SerializableTableWithSize.copyOf(table));

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a an argument for ease of current implementation. But do you think it's the right decision going forward to have a format version for position deletes metadata table and none of the other metadata tables?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've simplified the checks a bit. I do think it makes sense to have an easy way to fetch the format version from the PositionDeletesTable, since it basically translates 1:1 to the underlying table and so the format version of the PositionDeletesTable would be the same as the underlying table . The question is whether we add this logic to TableUtil or to the calling site(s) and would like to hear what others think here /cc @amogh-jahagirdar @aokolnychyi

@nastra nastra force-pushed the table-util branch 3 times, most recently from 4515c45 to 8c69d34 Compare November 22, 2024 10:03
@@ -143,6 +143,10 @@ protected Table newTable(TableOperations ops, String tableName) {
return new BaseTable(ops, tableName);
}

public Table underlyingTable() {
Copy link
Contributor

@aokolnychyi aokolnychyi Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: What about capturing the format version as a field in SerializableTable, similar to what we do for the metadata file location? The problem right now is that calling lazyTable() may actually require a request to load the metadata, which is something we would want to ideally avoid. Historically, we kept separate fields for what is considered important information and can be accessed frequently.


if (table instanceof SerializableTable) {
return formatVersion(((SerializableTable) table).underlyingTable());
} else if (table instanceof PositionDeletesTable) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it is a bit weird to make this exception for PositionDeletesTable in core, given that we add it exclusively because of how the maintenance is written in Spark. I'd either return the underlying table format version for all metadata tables (which is pretty questionable as they don't really have a format version) or unwrap the underlying table in Spark. For instance, we already have a class check in SparkTable. We can do similar in the maintenance procedure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RussellSpitzer had more or less the same feedback on this, so let's remove this handling for PositionDeletesTable from core and do it in the maintenance procedure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants