Simplify Variant shredding and refactor for clarity #461

rdblue · 2024-10-20T22:17:19Z

Rationale for this change

Updating the Variant and shredding specs from a thorough review.

What changes are included in this PR?

Spec updates, mostly to the shredding spec to minimize it and make it clear. This also attempts to make the variant spec more consistent (for example, by using value in both).

Removes object and array in favor of always using typed_value
Makes list element and object field groups required to avoid unnecessary null cases
Separates cases for primitives, arrays, and objects
Adds individual examples for primitives, arrays, and objects
Adds Variant to Parquet type mapping for shredded columns
Clarifies that metadata must be valid for all variant values without modification
Updates reconstruction algorithm to be more pythonic

Do these changes have PoC implementations?

No.

VariantEncoding.md

VariantShredding.md

rdblue · 2024-10-20T22:24:32Z

VariantShredding.md

-We extract all homogenous data items of a certain path into `typed_value`, and set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store the shredding schema per Parquet file, and each file can contain several row groups.
-Selecting a type for each field that is acceptable for all rows would be impractical because it would require buffering the contents of an entire file before writing.
+All fields for a variant, whether shredded or not, must be present in the metadata.


This may be controversial. I'm trying to say that you should not need to modify the metadata when reading. The reconstructed object should be able to use the stored metadata without adding fields.

I'm a little confused. When the field is not shredded, we will not have metadata for it, right? When it's getting shredded, then it will be like a column and we will generate metadata so it can be used for filtering/pruning?

@sfc-gh-aixu, this is saying that when writing, the metadata for a shredded value and the metadata for a non-shredded value should be identical. Writers should not alter the metadata by removing shredded field names so that readers do not need to rewrite the metadata (and values) to add it back.

For example, consider an event that looks like this:

{ "id": 102, "event_type": "signup", "event_timestamp": "2024-10-21T20:06:34.198724", "payload": { "a": 1, "b": 2 } }

And a shredding schema:

optional group event (VARIANT) { required binary metadata; optional binary value; optional group typed_value { required group event_type { optional binary value; optional binary typed_value (STRING); } required group event_timestamp { optional binary value; optional int64 typed_value (TIMESTAMP(true, MICROS)); } } }

The top-level event_type and event_timestamp fields are shredded. But this is saying that the Variant metadata must include those field names. That ensure that the existing binary metadata can be returned to the engine without adding event_type and event_timestamp fields when merging those fields into the top-level Variant value when the entire Variant is projected.

Thanks for detailed explanation. Later I realize this is about variant metadata and what I was talking about was column metadata (stats).

I get what you are saying: when the entire Variant is projected, we need to reconstruct the original value and metadata by merging back the shredded fields if the metadata after shredding excludes the shredded fields.

That makes sense to me to reduce the metadata reconstruction on the read side.

VariantEncoding.md

sfc-gh-saya · 2024-10-22T20:36:04Z

VariantShredding.md


-Similarly the elements of an `array` must be a group containing one or more of `object`, `array`, `typed_value` or `variant_value`.
+Each shredded field is represented as a required group that contains a `variant_value` and a `typed_value` field.


Why each shredded field should be a required group is not clear to me. If fields were allowed to be optional, that would be another way of indicating non-existence of fields.

The primary purpose is to reduce the number of cases that implementers have to deal with. If all of the cases can be expressed with 2 optional fields rather than 2 optional fields inside an optional group, then the group should be required to simplify as much as possible.

In addition, every level in Parquet that is optional introduces another repetition/definition level. That adds up quickly with nested structures and ends up taking unnecessary space.

VariantShredding.md

VariantEncoding.md

VariantShredding.md

RussellSpitzer · 2024-10-24T20:19:43Z

VariantShredding.md

@@ -33,176 +33,239 @@ This document focuses on the shredding semantics, Parquet representation, implic
 For now, it does not discuss which fields to shred, user-facing API changes, or any engine-specific considerations like how to use shredded columns.
 The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), and leverages the existing Parquet specification.

-At a high level, we replace the `value` field of the Variant Parquet group with one or more fields called `object`, `array`, `typed_value`, and `variant_value`.
-These represent a fixed schema suitable for constructing the full Variant value for each row.
-
 Shredding allows a query engine to reap the full benefits of Parquet's columnar representation, such as more compact data encoding, min/max statistics for data skipping, and I/O and CPU savings from pruning unnecessary fields not accessed by a query (including the non-shredded Variant binary data).


Another place I'd like to just remove some of the text here. My main goal here is just to reduce the amount of text in the spec

I reduced this, but I don't think it's a problem to have a bit of context that answers the question "What is shredding and why do I care?"

rdblue · 2024-10-24T20:50:13Z

VariantShredding.md

-The `typed_value` field may be any type that has a corresponding Variant type.
-For each value in the data, at most one of the `typed_value` and `variant_value` may be non-null.
-A writer may omit either field, which is equivalent to all rows being null.
+If both fields are non-null and either is not an object, the value is invalid. Readers must either fail or return the `typed_value`.


@RussellSpitzer and @gene-db, this could use some attention.

Here, if both value and typed_value are non-null I initially thought it made more sense to prefer value because it doesn't need to be re-encoded and may have been coerced by an engine to the shredded type.

However, this conflicts with object fields, where the value of typed_value is preferred so that data skipping is correct. If the object's value could contains a field that conflicts with a sub-field's typed_value there is no way of knowing from field stats. If we preferred the field value stored in the object's value then data skipping could be out of sync with the value returned in the case of a conflict.

the value is invalid

Suggested change

If both fields are non-null and either is not an object, the value is invalid. Readers must either fail or return the `typed_value`.

If both fields are non-null and either is not an object, the `value` is invalid. Readers must either fail or return the `typed_value`.

Why aren't we just being proscriptive here? Isn't this essentially saying you can duplicate a subfield-field between typed_value and value? Wouldn't it be safer to just say this cannot be done?

The problem is that readers won't actually implement restrictions like this and we can't fully prevent it. It is invalid for a writer to produce a value where value and typed_value conflict. But writer bugs happen and readers need to know what to do when they encounter that situation. Otherwise we would get different behaviors between readers that are processing the same data file.

It all comes down to end users -- if a writer bug produces data like this, readers will implement the ability to read because the data still exists and can be recovered. When that happens, we want to know how it is interpreted.

gene-db · 2024-10-25T02:00:44Z

VariantEncoding.md

+|---------------|-----------|----------------------------------------------------------|--------------------------------------|
+| Null type     | null      | `null`                                                   | `null`                               |
+| Boolean       | boolean   | `true` or `false`                                        | `true`                               |
+| Exact Numeric | number    | Digits in fraction must match scale, no exponent         | `34`, 34.00                          |


For exact numerics, we should allow truncating trailing zeros. For example, int8 value 1 and decimal(5,2) value 100 can both be represented as a JSON value 1.

Also, should the example be quoted to stay consistent?

Suggested change

| Exact Numeric | number | Digits in fraction must match scale, no exponent | `34`, 34.00 |

| Exact Numeric | number | Digits in fraction must match scale, no exponent | `34`, `34.00` |

I think the intent of considering Exact Numeric to be a single logical type is that we consider the int8 value 1 to be logically equivalent to decimal(5,2) with unscaled value 100. If that's the case, I think we'd want the produced JSON to be the same for both (probably 1 in both cases), and not recommend having the fraction match the scale.

@gene-db, @cashmand, these are concerns for the engine layer, not for storage. If Spark wants to automatically coerce between types that's fine, but the compromise that we talked about a couple months ago was to leave this out of the shredding spec and delegate the behavior to engines. Storage should always produce the data that was stored, without modification.

Yes, the engine should be the one concerned with changing types.

However, my original question was about this JSON representation wording. Currently, the Representation requirements for an Exact Numeric says the Digits in fraction must match scale. However, because the Exact Numeric is considered a logical type, the value 1 could be stored in the Variant as int8 1 or decimal(5,2) 100. Both of those would be the same numeric value, so we should allow truncating trailing zeros in the JSON representation, instead of requiring that the digits in the fraction match the scale.

@gene-db, the JSON representation should match the physical type as closely as possible. The reader can interpret the value however it chooses to, but a storage implementation should not discard the information.

If you want to produce 34 from 34.00 stored as decimal(9, 2) then the engine is responsible for casting the value to int8 and then producing JSON. The JSON representation for the original decimal(9, 2) value is 34.00.

@rdblue I am confused with this JSON chart then. If we are talking about "storage implementation", then are you expecting there is a "storage implementation" that is converting variant values to JSON? When will storage convert a variant value to a JSON string?

I originally thought this chart was trying to say, "When an engine wants to convert a variant value to a JSON string, here are the rules". Therefore, we should allow engines to cast integral decimals to integers before converting to JSON, as you already mentioned in your previous comment.

cashmand · 2024-10-28T18:18:05Z

VariantShredding.md

-This is intended to allow future backwards-compatible extensions.
-In particular, the field names `_metadata_key_paths` and any name starting with `_spark` are reserved, and should not be used by other implementations.
-Any extra field names that do not start with an underscore should be assumed to be backwards incompatible, and readers should fail when reading such a schema.
+Shredding is an optional feature of Variant, and readers must continue to be able to read a group containing only `value` and `metadata` fields.


At this point, isn't non-shredded just a special case of shredded with no typed_value in the top level struct? I think it's automatically backwards compatible.

I think the only thing that isn't backwards-compatible is that value is optional rather than required if you're shredding. But yes, writers are not required to shred.

Do we need to add a note that once shredded, a file must be read using this spec and the typed_value portion can not be ignored?

I added a note at the top:

When typed_value is present, readers must reconstruct shredded values according to this specification.

cashmand · 2024-10-28T18:33:19Z

VariantShredding.md

-Each inner field's type is a recursively shredded variant value: that is, the fields of each object field must be one or more of `object`, `array`, `typed_value` or `variant_value`.
+| `value`  | `typed_value` | Meaning                                                  |
+|----------|---------------|----------------------------------------------------------|
+| null     | null          | The value is missing                                     |


Just to be clear, this is only allowed for object fields, right? You mention in the array section that array elements must have one of them non-null, but I think that's also true for the top-level value/typed_value, right?

I think this would be good to clarify. I think that we could state that the variant could be null this way.

This would be Variant null (i.e. present but null)? I guess this is the same as the both-null case for array elements, which still seems to me more like an error state, but I guess Variant null is the best option if a reader doesn't want to fail.

My understanding was both typed_value and value being null meant variant being null if this is for the top level variant and field not being present for a shredded field.

My preference would be to make it illegal for the top level variant field to be variant-null (same for array elements). It seems like it adds a relatively rare special case that readers would need to handle, and doesn't add much value, since the null can be encoded in value just like for shredded fields. I don't feel too strongly if the consensus that it adds value.

@cashmand, the problem with "make it illegal" is that there is no functional way to do this. We can state that writers should not set both value and typed_value to null in certain cases, but we have to define what to do if there is data that is actually written that way in order to have consistent and reliable behavior.

That's why this is called out as a requirement:

If a Variant is missing in a context where a value is required, readers must either fail or return a Variant null: basic type 0 (primitive) and physical type 0 (null).

cashmand · 2024-10-28T18:52:41Z

VariantShredding.md


-Dictionary IDs in a `variant_value` field refer to entries in the top-level `metadata` field.
+If a Variant is missing in a context where a value is required, readers must either fail or return a Variant null: basic type 0 (primitive) and physical type 0 (null).
+For example, if a Variant is required (like `measurement` above) and both `value` and `typed_value` are null, the returned `value` must be `00` (Variant null).


As mentioned in my previous comment, I think it would be invalid for measurement to have both value and typed_value be null, and should be an error. I don't understand why we're recommend returning variant null as an option.

This rule is to address the fact that arrays cannot contain a missing value. This is saying that if a value is required but both are null, the implementation must fill in a variant null.

This rule(both value being null should be interpreted as json-null) is valid only for top level variant and array elements? I wonder how a top level variant can be inserted as both value and typed_value being null if the top level field is required. That seems inconsistent. For arrays, it looks like we could also require value being variant encoded null(json null) rather than allowing both fields to be null.

@sfc-gh-saya, if the top-level field is required but both fields are null, then the reader must produce a variant null value, 00. We must state what happens in cases like this because it is possible for writers to produce them.

If the writers produce nulls for both value and typed_value , it's like a corrupted files and I feel it's reasonable for the readers to error out rather than give a default value.

cashmand · 2024-10-28T19:07:18Z

VariantShredding.md


-It is possible to recover a full Variant value using a recursive algorithm, where the initial call is to `ConstructVariant` with the top-level fields, which are assumed to be null if they are not present in the schema.
+Each shredded field in the `typed_value` group is represented as a required group that contains optional `value` and `typed_value` fields.


I think that here and in the array case, it would be good to clarify whether typed_value can be omitted from the schema entirely. E.g. if there's no consistent type for a field, I think we'd still want to shred the field, but put all values in value, and not require that a typed_value type be specified.

Conversely, is value always required? Would it be valid for a writer to only create a typed_value column if it knows that all values have a predictable type that can be shredded?

I think that typed_value could be omitted. Wouldn't that be the case where the element is simply not shredded? I think we should call it out here, but I'll also adjust the language so that it is clear that shredding is not required, except for object fields, where you'd simply not have a shredded field (it makes no sense to shred a field and not include typed_value).

Agreed, I think it makes complete sense for typed_value to be omitted, but it would be good to be clear.

I'm less clear about whether it should be valid (here, or for other types) to omit value, and treat that as equivalent to the value column being all null. I can sort of imagine cases where you're converting a typed parquet schema to variant, and know the types well enough to know that value will never be needed, but it seems like a fairly marginal benefit to omit value from the schema vs. leaving it and populating it will all nulls. Assuming that we don't want to allow that, it might be good to clarify somewhere that the value column is always required.

I do not get the benefit of optionally omitting typed_value. I think Ryan said the same thing above but omitting typed_value should be equivalent to not shredding a field in which case field should not exist in the schema to begin with. Omitting typed_value just seems to increase the number of cases to deal with. Similar argument can be made for value but actually being able to omit value has a benefit. If a writer uses V1 Parquet pages where you cannot get number of null values before reading def lvls for a field, not having value field in the schema for a perfectly shredded field would allow readers skip trying to figure out values are all null after reading the def lvls.

If a field has an inconsistent type, it may still be useful to shred it into value in order to fetch without the extra IO required to get the rest of the Variant.

not having value field in the schema for a perfectly shredded field would allow readers skip trying to figure out values are all null after reading the def lvls

Couldn't it figure this out from the row group stats, since value should be null for all rows if it could have been omitted?

Yes, I agree shredding into value in case of inconsistent types is useful and I also really like that the changes to the spec makes is really clear as to when/how that happens.

Regarding value field, yes we could figure that out from row group stats but that is not always present.

Added this:

The typed_value field may be omitted when not shredding elements as a specific type.
When typed_value is omitted, value must be required.

I think there is value in allowing elements to be shredded. We could get dictionary encoding for them.

VariantShredding.md

RussellSpitzer · 2024-11-05T20:47:57Z

VariantShredding.md

-At a high level, we replace the `value` field of the Variant Parquet group with one or more fields called `object`, `array`, `typed_value`, and `variant_value`.
-These represent a fixed schema suitable for constructing the full Variant value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') FROM tbl` only needs to load field `event_ts`, and shredding can enable columnar projection that ignores the rest of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, '$.event_type', 'string') = 'signup'`, the `event_type` shredded column metadata can be used for skipping and to lazily load the rest of the Variant.


Suggested change

Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, '$.event_type', 'string') = 'signup'`, the `event_type` shredded column metadata can be used for skipping and to lazily load the rest of the Variant.

Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, '$.event_type', 'string') = 'signup'`, the `event_type` shredded column metadata can be used for skipping while the rest of the Variant is lazily loaded for matching pages.

RussellSpitzer · 2024-11-05T21:16:37Z

VariantShredding.md

-Each inner field's type is a recursively shredded variant value: that is, the fields of each object field must be one or more of `object`, `array`, `typed_value` or `variant_value`.
+| `value`  | `typed_value` | Meaning                                                  |
+|----------|---------------|----------------------------------------------------------|
+| null     | null          | The value is missing                                     |


This is something I was a little confused about before. How do I differentiate between

{ "foo" : { "x" : "null" }}

{ "foo" : { }}

I may be missing something here but i'm trying to understand an empty vs missing shredding representation

I'm not sure what's empty or missing in your first example. Assuming you meant to put null without quotes, it would need to be written to the value column as the Variant NullType (which happens to be the byte 0x00.

An empty object is just a special case where all shredded fields are missing, and the object's value column is also null (i.e. there are no other fields that were not part of the shredding schema).

For example:
{ "foo" : { "x" : null }}: Stored in non-null foo.typed_value.x.value as 0x00
{ "foo" : { }}: both foo.typed_value.x.value and foo.typed_value.x.typed_value for field x are null, indicating that x is missing from the object. But foo.typed_value is non-null, so there is an empty object, not a missing foo.
{ "foo": { "y": 123 }}: Here, x is also missing, so its fields are null, just like the second case. Assuming there's no y in the shredding schema, foo.value would store the binary representation of { "y": 123 }. foo.typed_value should still be non-null to indicate that there is a non-null object, just like in the empty object case. I think the example Object with no shredding below contradicts that last point, but I'll comment there that I think it should be changed.

Ok so if we have no difference between

foo.typed_value.x.typed_value foo.typed_value.x.value

foo : { x : "null" } null null

foo : { y : "bar" } null null

No, in the first case, foo.typed_value.x.value would be non-null, containing the variant value null (0x00). (Again, assuming you mean JSON null in your example, and not the string "null").

In the second case, it would be null (in the sense of the parquet definition level for foo.typed_value.x.value not being its maximum value).

So I can't have a shredded value that is Nullable?

Spark has two cast variants, cast and try_cast. Both consider it valid to cast an object to a struct and drop fields. Whether this is the right choice is a fair question, but let's assume for now that it won't change. I think you're right that for cast, we'd need the value column to check for errors due to other types. But for try_cast, I think we would only need to check the typed_value column if we could rely on it setting the definition level based on the value being an object or not.

I don't think it's a huge deal if we need this extra IO, but my preference would be to have clear and limited choices wherever possible for shredding to a given schema, so that readers can make optimal choices without risking correctness issues.

Sounds good. I would also like to draw your attention to the fact that this issue might also happen for leaf fields. For example you can have a string field field1. A row might have an int and for that row value will be a variant int. If you do foo:field1::number, you need to read the value field and get the int value. Having "some" value for the typed_value would not be useful here. Similarly, for the same shredding scenario, you might have a row with empty object at that field and I think it again makes more sense to put that to the value field.

Yes, agreed. The rule (which is meant to be spelled out in this doc, but feel free to suggest clarifications) is:

If typed_value is a group and the Variant value being shredded is an object, then typed_value must be non-null. value may also be non-null (specifically, if there are fields that aren't in the shredding schema).

In all other cases, at most one of typed_value and value can be non-null.

I think I agree with @cashmand. When shredding an object (typed_value is a group), typed_value must be non-null. If we can rely on that rule then projections that only require specific fields don't need to read the value.

I've also added this:

Readers can assume that a value is not an object if typed_value is null and that typed_value field values are correct; that is, readers do not need to read the value column if typed_value fields satisfy the required fields.

VariantShredding.md

RussellSpitzer · 2024-11-05T22:47:26Z

VariantShredding.md

-# Data Skipping
+All elements of an array must be non-null because `array` elements in a Variant cannot be missing.
+That is, either `typed_value` or `value` (but not both) must be non-null.
+Null elements must be encoded in `value` as Variant null: basic type 0 (primitive) and physical type 0 (null).


just for consistency it was written as

`00` (Variant null).

Earlier in the doc but this is fine too

RussellSpitzer · 2024-11-05T23:06:55Z

VariantShredding.md

+| `{"error_msg": "malformed: ..."}`                                                  | `{"error_msg", "malformed: ..."}` | null          |                                |                                      |                              |                                    | Object with no shredding                      |
+| `"malformed: not an object"`                                                       | `malformed: not an object`        | null          |                                |                                      |                              |                                    | Not an object (stored as Variant string)      |
+| `{"event_ts": 1729794240241, "click": "_button"}`                                  | `{"click": "_button"}`            | non-null      | null                           | null                                 | null                         | 1729794240241                      | Field `event_type` is missing                 |
+| `{"event_type": null, "event_ts": 1729794954163}`                                  | null                              | non-null      | `00` (field exists, is null)   | null                                 | null                         | 1729794954163                      | Field `event_type` is present and is null     |


Some more requested examples,

Could we have where "event_ts" is a Date or something non transformable into a timestamp?
I assume this would make value be {"event_ts": "08-03-2025"} while typed_value would be null

I also wonder if we could do a single example for a doubly nested field showing where typed_value.address.value != null. All the examples here cover a primitive field being typed, so It may be nice to show the behavior with a object being typed.

{ Name Address { City ZIP (Shredded as INT but some values as String?) } }

RussellSpitzer · 2024-11-05T23:10:47Z

VariantShredding.md


+The `typed_value` associated with any Variant `value` field can be any shredded type according to the rules above.


I don't think I understand this sentence, but I believe I understand the intent is that you can have objects or elements within arrays also shredded?

I think the tables above are easier for me to follow than the parquet schema below. I understand though if that's difficult to depict.

This is just saying that any time you have a value field, you can also have a typed_value field that might be any shredded type, like an array nested in a field or an object nested in an array.

RussellSpitzer · 2024-11-06T16:04:21Z

VariantShredding.md


-Consider the following example:
+Statistics for `typed_value` columns can be used for file, row group, or page skipping when `value` is always null (missing).


Do we need to specify "null" vs "variant null" I get a little confused sometimes in the doc.

RussellSpitzer · 2024-11-06T16:34:40Z

VariantShredding.md

-  “not an object”
-]
-```
+When the corresponding `value` column is all nulls, all values must be the shredded `typed_value` field's type.


Sometimes we refer to the value as a column and sometimes as a field. Just wondering if we should take a pass to standardize unless there is another meaning i'm not following here.

RussellSpitzer · 2024-11-06T16:38:44Z

VariantShredding.md

+Casting behavior for Variant is delegated to processing engines.
+For example, the interpretation of a string as a timestamp may depend on the engine's SQL session time zone.
+
+## Reconstructing a Variant


Reconstructing a Shredded Variant?

RussellSpitzer · 2024-11-06T16:50:27Z

VariantShredding.md

+
+## Reconstructing a Variant
+
+It is possible to recover a full Variant value using a recursive algorithm, where the initial call is to `construct_variant` with the top-level Variant group fields.


Suggested change

It is possible to recover a full Variant value using a recursive algorithm, where the initial call is to `construct_variant` with the top-level Variant group fields.

It is possible to recover a un-shredded Variant value using a recursive algorithm, where the initial call is to `construct_variant` with the top-level Variant group fields.

RussellSpitzer · 2024-11-06T17:06:14Z

VariantShredding.md

+            if value is not None:
+                # this is a partially shredded object
+                assert isinstance(value, VariantObject), "partially shredded value must be an object"
+                assert typed_value.keys().isdisjoint(value.keys()), "object keys must be disjoint"


I know the rules above say it may return an error here or pull the value out of "typed_value". But if we are going to not allow it in this reference code I probably would say we should just never allow it

emkornfield · 2024-11-06T17:46:35Z

VariantEncoding.md

+}
+```
+
+There are no restrictions on the repetition of Variant groups (required, optional, or repeated).


shouldn't repeated use 3 level list structure?

aihuaxu · 2024-10-21T17:03:13Z

VariantEncoding.md

+Both fields `value` and `metadata` are of type `binary`.
+The `metadata` field is required and must be a valid Variant metadata, as defined below.
+The `variant_value` field is optional.


We are mixing value and variant_value. As @gene-db mentioned , we probably need to keep as value since Spark is already writing out as value + metadata.

aihuaxu · 2024-10-22T22:33:35Z

VariantShredding.md

-We extract all homogenous data items of a certain path into `typed_value`, and set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store the shredding schema per Parquet file, and each file can contain several row groups.
-Selecting a type for each field that is acceptable for all rows would be impractical because it would require buffering the contents of an entire file before writing.
+All fields for a variant, whether shredded or not, must be present in the metadata.


Thanks for detailed explanation. Later I realize this is about variant metadata and what I was talking about was column metadata (stats).

I get what you are saying: when the entire Variant is projected, we need to reconstruct the original value and metadata by merging back the shredded fields if the metadata after shredding excludes the shredded fields.

That makes sense to me to reduce the metadata reconstruction on the read side.

aihuaxu · 2024-10-30T21:54:25Z

VariantShredding.md


-Consider the following example:
+Statistics for `typed_value` columns can be used for file, row group, or page skipping when `value` is always null (missing).


(missing) may be a little confusing here. We probably remove it since we have the following to explain value is all nulls.

aihuaxu · 2024-11-06T18:26:56Z

VariantEncoding.md

@@ -374,7 +402,7 @@ The Decimal type contains a scale, but no precision. The implied precision of a

 | Logical Type         | Physical Type               | Type ID | Equivalent Parquet Type     | Binary format                                                                                                       |


@gene-db When we say Logical Type and Physical Type here, what are we exactly referring to? Should we refer to Parquet logical type and Parquet physical type?

Can we clarify the Logical type and Physical Type to be "Variant Logical Type" and "Variant Physical Type"? In the Parquet context, we may think these are Parquet types.

emkornfield · 2024-11-07T18:27:02Z

VariantEncoding.md

+| Null type     | null      | `null`                                                   | `null`                               |
+| Boolean       | boolean   | `true` or `false`                                        | `true`                               |
+| Exact Numeric | number    | Digits in fraction must match scale, no exponent         | `34`, 34.00                          |
+| Float         | number    | Fraction must be present                                 | `14.20`                              |


we should cover expected format for +/- inf and NaN

emkornfield · 2024-11-07T18:27:53Z

VariantEncoding.md

+| Double        | number    | Fraction must be present                                 | `1.0`                                |
+| Date          | string    | ISO-8601 formatted date                                  | `"2017-11-16"`                       |
+| Timestamp     | string    | ISO-8601 formatted UTC timestamp including +00:00 offset | `"2017-11-16T22:31:08.000001+00:00"` |
+| TimestampNTZ  | string    | ISO-8601 formatted UTC timestamp with no offset or zone  | `"2017-11-16T22:31:08.000001"`       |


What precision decimal values are required?

Fokko · 2024-11-08T08:29:47Z

VariantShredding.md

+        # value is missing
+        return None
+
+def primitive_to_variant(typed_value):


Suggested change

def primitive_to_variant(typed_value):

def primitive_to_variant(typed_value: Any) -> VariantType:

Fokko · 2024-11-08T08:31:44Z

VariantShredding.md

+It is possible to recover a full Variant value using a recursive algorithm, where the initial call is to `construct_variant` with the top-level Variant group fields.
+
+```python
+def construct_variant(metadata, value, typed_value):


Suggested change

def construct_variant(metadata, value, typed_value):

def construct_variant(metadata: VariantMetadata, value: Any, typed_value: Any) -> Optional[VariantType]:

Instead, I would suggest rewriting the code, and return a VariantNull object, instead of a Python None, then the signature becomes:

Suggested change

def construct_variant(metadata, value, typed_value):

def construct_variant(metadata: VariantMetadata, value: Any, typed_value: Any) -> VariantType:

### What changes were proposed in this pull request? This is a first step towards adding Variant shredding support for the Parquet writer. It adds functionality to convert a Variant value to an InternalRow that matches the current shredding spec in apache/parquet-format#461. Once this merges, the next step will be to set up the Parquet writer to accept a shredding schema, and write these InternalRow values to Parquet instead of the raw Variant binary. ### Why are the changes needed? First step towards adding support for shredding, which can improve Variant performance (and will be important for functionality on the read side once other tools begin writing shredded Variant columns to Parquet). ### Does this PR introduce _any_ user-facing change? No, none of this code is currently called outside of the added tests. ### How was this patch tested? Unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #48779 from cashmand/SPARK-48898-write-shredding. Authored-by: cashmand <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

aihuaxu · 2024-11-13T05:25:51Z

VariantEncoding.md

@@ -374,7 +402,7 @@ The Decimal type contains a scale, but no precision. The implied precision of a

 | Logical Type         | Physical Type               | Type ID | Equivalent Parquet Type     | Binary format                                                                                                       |


Can we clarify the Logical type and Physical Type to be "Variant Logical Type" and "Variant Physical Type"? In the Parquet context, we may think these are Parquet types.

aihuaxu · 2024-11-13T22:52:57Z

VariantShredding.md


-Dictionary IDs in a `variant_value` field refer to entries in the top-level `metadata` field.
+If a Variant is missing in a context where a value is required, readers must either fail or return a Variant null: basic type 0 (primitive) and physical type 0 (null).
+For example, if a Variant is required (like `measurement` above) and both `value` and `typed_value` are null, the returned `value` must be `00` (Variant null).


If the writers produce nulls for both value and typed_value , it's like a corrupted files and I feel it's reasonable for the readers to error out rather than give a default value.

aihuaxu · 2024-11-13T23:10:04Z

VariantShredding.md


-If a value cannot be represented by whichever of `object`, `array`, or `typed_value` is present in the schema, then it is stored in `variant_value`, and the other fields are set to null.
-In the Parquet example above, if field `a` was an object or array, or a non-integer scalar, it would be stored in `variant_value`.
+Unless the value is shredded as an object (see [Objects](#objects)), `typed_value` or `value` (but not both) must be non-null.


Since we are talking about primitive types here, we probably just say ``typed_valueorvalue` (but not both) must be non-null for primitive values` and merge with the paragraph above.

aihuaxu · 2024-11-13T23:15:25Z

VariantShredding.md


-# Using variant_value vs. typed_value
+If the value is not an array, `typed_value` must be null.


Should we remove this sentence since we are talking about array in this section?

aihuaxu · 2024-11-13T23:21:05Z

VariantShredding.md

+```
+optional group tags (VARIANT) {
+  required binary metadata;
+  optional binary value;


We can add comment "# must be null".

aihuaxu · 2024-11-13T23:29:34Z

VariantShredding.md

-On the other hand, shredding as a different logical type is not allowed.
-For example, the integer value 123 could not be shredded to a string `typed_value` column as the string "123", since that would lose type information.
-It would need to be written to the `variant_value` column.
+If the value is not an object, `typed_value` must be null.


I guess I need to understand what this means: if the type is not an object, but it's a primitive type or arrays, then we should follow the other sections so we don't need this here, right?

aihuaxu · 2024-11-14T00:01:30Z

VariantShredding.md


-This section describes a more deeply nested example, using a top-level array as the shredding type.
+A field's `value` and `typed_value` are set to null (missing) to indicate that the field does not exist in the variant.


That would have some inconsistency with the encoding for primitive values that both nulls are invalid.
Why do we make the group required? Can we have group optional and then the group is not present, then that means the field is missing.

rdblue force-pushed the variant-updates branch from d8a2206 to c4b435f Compare October 20, 2024 22:18

Current work on variant updates.

8352319

rdblue force-pushed the variant-updates branch from c4b435f to 8352319 Compare October 20, 2024 22:22

rdblue commented Oct 20, 2024

View reviewed changes

VariantEncoding.md Show resolved Hide resolved

rdblue commented Oct 20, 2024

View reviewed changes

VariantShredding.md Show resolved Hide resolved

rdblue commented Oct 20, 2024

View reviewed changes

gene-db reviewed Oct 21, 2024

View reviewed changes

VariantEncoding.md Outdated Show resolved Hide resolved

rdblue mentioned this pull request Oct 21, 2024

GH-459: Add Variant logical type annotation #460

Merged

sfc-gh-saya reviewed Oct 22, 2024

View reviewed changes

Remove Option 2, which cannot be used because stats aren't trusted.

17258d8

RussellSpitzer reviewed Oct 24, 2024

View reviewed changes

VariantShredding.md Outdated Show resolved Hide resolved

More updates to the Variant spec.

6766f31

RussellSpitzer reviewed Oct 24, 2024

View reviewed changes

VariantShredding.md Show resolved Hide resolved

sfc-gh-saya reviewed Oct 24, 2024

View reviewed changes

VariantEncoding.md Show resolved Hide resolved

RussellSpitzer reviewed Oct 24, 2024

View reviewed changes

VariantShredding.md Outdated Show resolved Hide resolved

RussellSpitzer reviewed Oct 24, 2024

View reviewed changes

rdblue added 4 commits October 24, 2024 13:22

Trim the intro

662cad7

Update the encoding to capture required/optional value.

682a562

Remove unnecessary required.

f4f9d54

Use typed_value when there is a conflict.

7dde87b

rdblue commented Oct 24, 2024

View reviewed changes

Minor updates.

be6c6c6

rdblue changed the title ~~WIP: Current work on Variant specs~~ Simplify Variant shredding and refactor for clarity Oct 24, 2024

Clarify cases where a value is required but missing.

7a6cd7c

gene-db reviewed Oct 25, 2024

View reviewed changes

cashmand reviewed Oct 28, 2024

View reviewed changes

RussellSpitzer reviewed Nov 5, 2024

View reviewed changes

VariantShredding.md Show resolved Hide resolved

RussellSpitzer reviewed Nov 5, 2024

View reviewed changes

VariantShredding.md Show resolved Hide resolved

RussellSpitzer reviewed Nov 5, 2024

View reviewed changes

cashmand reviewed Nov 5, 2024

View reviewed changes

VariantShredding.md Show resolved Hide resolved

RussellSpitzer reviewed Nov 5, 2024

View reviewed changes

RussellSpitzer reviewed Nov 6, 2024

View reviewed changes

emkornfield reviewed Nov 6, 2024

View reviewed changes

aihuaxu reviewed Nov 6, 2024

View reviewed changes

cashmand mentioned this pull request Nov 6, 2024

[SPARK-48898][SQL] Add Variant shredding functions apache/spark#48779

Closed

emkornfield reviewed Nov 7, 2024

View reviewed changes

Fokko reviewed Nov 8, 2024

View reviewed changes

aihuaxu reviewed Nov 14, 2024

View reviewed changes

chenhao-db mentioned this pull request Nov 14, 2024

[SPARK-45891][SQL] Rebuild variant binary from shredded data. apache/spark#48851

Open


		Similarly the elements of an `array` must be a group containing one or more of `object`, `array`, `typed_value` or `variant_value`.
		Each shredded field is represented as a required group that contains a `variant_value` and a `typed_value` field.

	If both fields are non-null and either is not an object, the value is invalid. Readers must either fail or return the `typed_value`.
	If both fields are non-null and either is not an object, the `value` is invalid. Readers must either fail or return the `typed_value`.

	\| Exact Numeric \| number \| Digits in fraction must match scale, no exponent \| `34`, 34.00 \|
	\| Exact Numeric \| number \| Digits in fraction must match scale, no exponent \| `34`, `34.00` \|


		It is possible to recover a full Variant value using a recursive algorithm, where the initial call is to `ConstructVariant` with the top-level fields, which are assumed to be null if they are not present in the schema.
		Each shredded field in the `typed_value` group is represented as a required group that contains optional `value` and `typed_value` fields.

	Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, '$.event_type', 'string') = 'signup'`, the `event_type` shredded column metadata can be used for skipping and to lazily load the rest of the Variant.
	Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, '$.event_type', 'string') = 'signup'`, the `event_type` shredded column metadata can be used for skipping while the rest of the Variant is lazily loaded for matching pages.


		The `typed_value` associated with any Variant `value` field can be any shredded type according to the rules above.


		Consider the following example:
		Statistics for `typed_value` columns can be used for file, row group, or page skipping when `value` is always null (missing).


		## Reconstructing a Variant

		It is possible to recover a full Variant value using a recursive algorithm, where the initial call is to `construct_variant` with the top-level Variant group fields.

	It is possible to recover a full Variant value using a recursive algorithm, where the initial call is to `construct_variant` with the top-level Variant group fields.
	It is possible to recover a un-shredded Variant value using a recursive algorithm, where the initial call is to `construct_variant` with the top-level Variant group fields.

		@@ -374,7 +402,7 @@ The Decimal type contains a scale, but no precision. The implied precision of a

		\| Logical Type \| Physical Type \| Type ID \| Equivalent Parquet Type \| Binary format \|

	def primitive_to_variant(typed_value):
	def primitive_to_variant(typed_value: Any) -> VariantType:

	def construct_variant(metadata, value, typed_value):
	def construct_variant(metadata: VariantMetadata, value: Any, typed_value: Any) -> Optional[VariantType]:


		# Using variant_value vs. typed_value
		If the value is not an array, `typed_value` must be null.


		This section describes a more deeply nested example, using a top-level array as the shredding type.
		A field's `value` and `typed_value` are set to null (missing) to indicate that the field does not exist in the variant.

Simplify Variant shredding and refactor for clarity #461

Are you sure you want to change the base?

Simplify Variant shredding and refactor for clarity #461

Conversation

rdblue commented Oct 20, 2024 • edited Loading

Rationale for this change

What changes are included in this PR?

Do these changes have PoC implementations?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cashmand Nov 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cashmand Nov 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue commented Oct 20, 2024 •

edited

Loading

rdblue Oct 24, 2024 •

edited

Loading

cashmand Nov 5, 2024 •

edited

Loading

cashmand Nov 7, 2024 •

edited

Loading