Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VARIANT] Accept variantType RFC #4096

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

richardc-db
Copy link
Contributor

@richardc-db richardc-db commented Jan 28, 2025

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

moves the variant type RFC into the accepted folder and inlines it into PROTOCOL.md.

Variant has been used in production systems for many months now and has also been added as a logical type in parquet here. Additionally, Spark has a robust implementation of the variant type

Additionally specifies that variant shredding will be a different table feature and removed the portion about struct fields with _ are ignored.

How was this patch tested?

manually tested that links work

Does this PR introduce any user-facing changes?

Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it safe to finalize this RFC when the parquet spec is not finalized yet?

Important
This specification is still under active development, and has not been formally adopted.

I guess this RFC actually references the spark variant spec, not the parquet variant spec -- even tho it has a whole section about "Variant data in Parquet"?

Do we need commentary in the Delta spec about it relates to those two different specifications?

PROTOCOL.md Show resolved Hide resolved

## Writer Requirements for Variant Data Type

When Variant type is supported (`writerFeatures` field of a table's `protocol` action contains `variantType`), writers:
- must write a column of type `variant` to parquet as a struct containing the fields `value` and `metadata` and storing values that conform to the [Variant binary encoding specification](https://github.com/apache/spark/blob/master/common/variant/README.md)
- must not write additional, non-ignorable parquet struct fields. Writing additional struct fields with names starting with `_` (underscore) is allowed.
- must not write additional parquet struct fields.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason we need to forbid extra fields in this specific case?

Normally the Delta spec just says readers should ignore any unknown fields they might encounter, because the lack of a table feature to protect such fields means they do not affect correctness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think theres any particular reason, I think we can follow other features, i.e. modify this to say that readers can ignore fields that aren't metadata or value.

@gene-db or @cashmand do you see any issues with this? I figure this should be ok - the shredding table feature later will specify the "important" columns for shredding

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One reason to fail is that shredding adds a typed_value column. If a shredding-unaware reader doesn't fail when it encounters a column other than value and metadata, it could incorrectly read value and metadata, which might look valid, but would not contain the full value.

Copy link
Contributor Author

@richardc-db richardc-db Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cashmad, in this case, the VariantShredding table feature would be enabled on the table though, right? So a shredding-unaware reader won't be able to read the table in the first place.

maybe i'm missing something...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. Is it possible that a customer could create a delta table using external parquet files, and we wouldn't know that we nee the shredding feature? It seems safer to me for a reader to fail.

# Variant Data Type

This feature enables support for the Variant data type, for storing semi-structured data.
The schema serialization method is described in [Schema Serialization Format](#schema-serialization-format).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a bit odd to specifically mention the schema serialization format here? AFAIK it didn't change with this table feature?

Meanwhile, it seems like we should specifically state that the new data type is called variant, rather than relying on the example below to show it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah -- the schema serialization section adds a new entry for variant, that explains the type name is variant.

PROTOCOL.md Show resolved Hide resolved
Comment on lines +1386 to +1389
"elementType" : {
"type" : "variant"
},
"containsNull" : false
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the variant spec, a variant column could store a JSON null value even if the variant field is non-nullable as in this example.

How is that handled? Does the Delta spec need to say anything about it or is that the query engine's problem, independent of Delta?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, yeah I think this is less a storage layer issue and more of an engine issue. That should be addressed in the actual binary layout of variant (which is considered in the parquet spec)

The parquet struct must include the two struct fields `value` and `metadata`.
Supported writers must write the two binary fields, and supported readers must read the two binary fields.

Variant shredding will be introduced in a separate `variantShredding` table feature.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this reference to variant shredding link to the parquet variant shredding spec? Or is that overkill?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to keep it separate for now because i imagine the parquet variant shredding spec may contain more information than is necessary for delta (i.e. the parquet binary format currently contains information for the binary encoding, which we don't include here).

I'd let @gene-db make the call here, though. Gene, do you have an opinion?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants