-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify Variant shredding and refactor for clarity #461
base: master
Are you sure you want to change the base?
Conversation
d8a2206
to
c4b435f
Compare
c4b435f
to
8352319
Compare
VariantShredding.md
Outdated
We extract all homogenous data items of a certain path into `typed_value`, and set aside incompatible data items in `variant_value`. | ||
Intuitively, incompatibilities within the same path may occur because we store the shredding schema per Parquet file, and each file can contain several row groups. | ||
Selecting a type for each field that is acceptable for all rows would be impractical because it would require buffering the contents of an entire file before writing. | ||
All fields for a variant, whether shredded or not, must be present in the metadata. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be controversial. I'm trying to say that you should not need to modify the metadata when reading. The reconstructed object should be able to use the stored metadata without adding fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused. When the field is not shredded, we will not have metadata for it, right? When it's getting shredded, then it will be like a column and we will generate metadata so it can be used for filtering/pruning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sfc-gh-aixu, this is saying that when writing, the metadata for a shredded value and the metadata for a non-shredded value should be identical. Writers should not alter the metadata by removing shredded field names so that readers do not need to rewrite the metadata (and values) to add it back.
For example, consider an event that looks like this:
{
"id": 102,
"event_type": "signup",
"event_timestamp": "2024-10-21T20:06:34.198724",
"payload": {
"a": 1,
"b": 2
}
}
And a shredding schema:
optional group event (VARIANT) {
required binary metadata;
optional binary value;
optional group typed_value {
required group event_type {
optional binary value;
optional binary typed_value (STRING);
}
required group event_timestamp {
optional binary value;
optional int64 typed_value (TIMESTAMP(true, MICROS));
}
}
}
The top-level event_type
and event_timestamp
fields are shredded. But this is saying that the Variant metadata
must include those field names. That ensure that the existing binary metadata can be returned to the engine without adding event_type
and event_timestamp
fields when merging those fields into the top-level Variant value
when the entire Variant is projected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for detailed explanation. Later I realize this is about variant metadata
and what I was talking about was column metadata (stats).
I get what you are saying: when the entire Variant is projected, we need to reconstruct the original value
and metadata
by merging back the shredded fields if the metadata
after shredding excludes the shredded fields.
That makes sense to me to reduce the metadata reconstruction on the read side.
VariantShredding.md
Outdated
|
||
Similarly the elements of an `array` must be a group containing one or more of `object`, `array`, `typed_value` or `variant_value`. | ||
Each shredded field is represented as a required group that contains a `variant_value` and a `typed_value` field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why each shredded field should be a required group is not clear to me. If fields were allowed to be optional, that would be another way of indicating non-existence of fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The primary purpose is to reduce the number of cases that implementers have to deal with. If all of the cases can be expressed with 2 optional fields rather than 2 optional fields inside an optional group, then the group should be required to simplify as much as possible.
In addition, every level in Parquet that is optional introduces another repetition/definition level. That adds up quickly with nested structures and ends up taking unnecessary space.
VariantShredding.md
Outdated
@@ -33,176 +33,239 @@ This document focuses on the shredding semantics, Parquet representation, implic | |||
For now, it does not discuss which fields to shred, user-facing API changes, or any engine-specific considerations like how to use shredded columns. | |||
The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), and leverages the existing Parquet specification. | |||
|
|||
At a high level, we replace the `value` field of the Variant Parquet group with one or more fields called `object`, `array`, `typed_value`, and `variant_value`. | |||
These represent a fixed schema suitable for constructing the full Variant value for each row. | |||
|
|||
Shredding allows a query engine to reap the full benefits of Parquet's columnar representation, such as more compact data encoding, min/max statistics for data skipping, and I/O and CPU savings from pruning unnecessary fields not accessed by a query (including the non-shredded Variant binary data). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another place I'd like to just remove some of the text here. My main goal here is just to reduce the amount of text in the spec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reduced this, but I don't think it's a problem to have a bit of context that answers the question "What is shredding and why do I care?"
The `typed_value` field may be any type that has a corresponding Variant type. | ||
For each value in the data, at most one of the `typed_value` and `variant_value` may be non-null. | ||
A writer may omit either field, which is equivalent to all rows being null. | ||
If both fields are non-null and either is not an object, the value is invalid. Readers must either fail or return the `typed_value`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RussellSpitzer and @gene-db, this could use some attention.
Here, if both value
and typed_value
are non-null I initially thought it made more sense to prefer value
because it doesn't need to be re-encoded and may have been coerced by an engine to the shredded type.
However, this conflicts with object fields, where the value of typed_value
is preferred so that data skipping is correct. If the object's value
could contains a field that conflicts with a sub-field's typed_value
there is no way of knowing from field stats. If we preferred the field value stored in the object's value
then data skipping could be out of sync with the value returned in the case of a conflict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the value
is invalid
If both fields are non-null and either is not an object, the value is invalid. Readers must either fail or return the `typed_value`. | |
If both fields are non-null and either is not an object, the `value` is invalid. Readers must either fail or return the `typed_value`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why aren't we just being proscriptive here? Isn't this essentially saying you can duplicate a subfield-field between typed_value and value? Wouldn't it be safer to just say this cannot be done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that readers won't actually implement restrictions like this and we can't fully prevent it. It is invalid for a writer to produce a value where value
and typed_value
conflict. But writer bugs happen and readers need to know what to do when they encounter that situation. Otherwise we would get different behaviors between readers that are processing the same data file.
It all comes down to end users -- if a writer bug produces data like this, readers will implement the ability to read because the data still exists and can be recovered. When that happens, we want to know how it is interpreted.
|---------------|-----------|----------------------------------------------------------|--------------------------------------| | ||
| Null type | null | `null` | `null` | | ||
| Boolean | boolean | `true` or `false` | `true` | | ||
| Exact Numeric | number | Digits in fraction must match scale, no exponent | `34`, 34.00 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For exact numerics, we should allow truncating trailing zeros. For example, int8
value 1
and decimal(5,2)
value 100
can both be represented as a JSON value 1
.
Also, should the example be quoted to stay consistent?
| Exact Numeric | number | Digits in fraction must match scale, no exponent | `34`, 34.00 | | |
| Exact Numeric | number | Digits in fraction must match scale, no exponent | `34`, `34.00` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the intent of considering Exact Numeric
to be a single logical type is that we consider the int8
value 1
to be logically equivalent to decimal(5,2)
with unscaled value 100
. If that's the case, I think we'd want the produced JSON to be the same for both (probably 1
in both cases), and not recommend having the fraction match the scale.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gene-db, @cashmand, these are concerns for the engine layer, not for storage. If Spark wants to automatically coerce between types that's fine, but the compromise that we talked about a couple months ago was to leave this out of the shredding spec and delegate the behavior to engines. Storage should always produce the data that was stored, without modification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the engine should be the one concerned with changing types.
However, my original question was about this JSON representation wording. Currently, the Representation requirements
for an Exact Numeric
says the Digits in fraction must match scale
. However, because the Exact Numeric
is considered a logical type, the value 1
could be stored in the Variant as int8
1 or decimal(5,2)
100. Both of those would be the same numeric value, so we should allow truncating trailing zeros in the JSON representation, instead of requiring that the digits in the fraction match the scale.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gene-db, the JSON representation should match the physical type as closely as possible. The reader can interpret the value however it chooses to, but a storage implementation should not discard the information.
If you want to produce 34 from 34.00 stored as decimal(9, 2)
then the engine is responsible for casting the value to int8
and then producing JSON. The JSON representation for the original decimal(9, 2)
value is 34.00
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue I am confused with this JSON chart then. If we are talking about "storage implementation", then are you expecting there is a "storage implementation" that is converting variant values to JSON? When will storage convert a variant value to a JSON string?
I originally thought this chart was trying to say, "When an engine wants to convert a variant value to a JSON string, here are the rules". Therefore, we should allow engines to cast integral decimals to integers before converting to JSON, as you already mentioned in your previous comment.
This is intended to allow future backwards-compatible extensions. | ||
In particular, the field names `_metadata_key_paths` and any name starting with `_spark` are reserved, and should not be used by other implementations. | ||
Any extra field names that do not start with an underscore should be assumed to be backwards incompatible, and readers should fail when reading such a schema. | ||
Shredding is an optional feature of Variant, and readers must continue to be able to read a group containing only `value` and `metadata` fields. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point, isn't non-shredded just a special case of shredded with no typed_value
in the top level struct? I think it's automatically backwards compatible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the only thing that isn't backwards-compatible is that value
is optional
rather than required
if you're shredding. But yes, writers are not required to shred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to add a note that once shredded, a file must be read using this spec and the typed_value portion can not be ignored?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a note at the top:
When
typed_value
is present, readers must reconstruct shredded values according to this specification.
Each inner field's type is a recursively shredded variant value: that is, the fields of each object field must be one or more of `object`, `array`, `typed_value` or `variant_value`. | ||
| `value` | `typed_value` | Meaning | | ||
|----------|---------------|----------------------------------------------------------| | ||
| null | null | The value is missing | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear, this is only allowed for object fields, right? You mention in the array section that array elements must have one of them non-null, but I think that's also true for the top-level value
/typed_value
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be good to clarify. I think that we could state that the variant could be null this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be Variant null (i.e. present but null)? I guess this is the same as the both-null case for array elements, which still seems to me more like an error state, but I guess Variant null is the best option if a reader doesn't want to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding was both typed_value and value being null meant variant being null if this is for the top level variant and field not being present for a shredded field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My preference would be to make it illegal for the top level variant field to be variant-null (same for array elements). It seems like it adds a relatively rare special case that readers would need to handle, and doesn't add much value, since the null can be encoded in value
just like for shredded fields. I don't feel too strongly if the consensus that it adds value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cashmand, the problem with "make it illegal" is that there is no functional way to do this. We can state that writers should not set both value
and typed_value
to null in certain cases, but we have to define what to do if there is data that is actually written that way in order to have consistent and reliable behavior.
That's why this is called out as a requirement:
If a Variant is missing in a context where a value is required, readers must either fail or return a Variant null: basic type 0 (primitive) and physical type 0 (null).
|
||
Dictionary IDs in a `variant_value` field refer to entries in the top-level `metadata` field. | ||
If a Variant is missing in a context where a value is required, readers must either fail or return a Variant null: basic type 0 (primitive) and physical type 0 (null). | ||
For example, if a Variant is required (like `measurement` above) and both `value` and `typed_value` are null, the returned `value` must be `00` (Variant null). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in my previous comment, I think it would be invalid for measurement
to have both value
and typed_value
be null, and should be an error. I don't understand why we're recommend returning variant null as an option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This rule is to address the fact that arrays cannot contain a missing value. This is saying that if a value is required but both are null, the implementation must fill in a variant null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This rule(both value being null should be interpreted as json-null) is valid only for top level variant and array elements? I wonder how a top level variant can be inserted as both value and typed_value being null if the top level field is required. That seems inconsistent. For arrays, it looks like we could also require value being variant encoded null(json null) rather than allowing both fields to be null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sfc-gh-saya, if the top-level field is required but both fields are null, then the reader must produce a variant null value, 00
. We must state what happens in cases like this because it is possible for writers to produce them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the writers produce nulls for both value
and typed_value
, it's like a corrupted files and I feel it's reasonable for the readers to error out rather than give a default value.
|
||
It is possible to recover a full Variant value using a recursive algorithm, where the initial call is to `ConstructVariant` with the top-level fields, which are assumed to be null if they are not present in the schema. | ||
Each shredded field in the `typed_value` group is represented as a required group that contains optional `value` and `typed_value` fields. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that here and in the array case, it would be good to clarify whether typed_value
can be omitted from the schema entirely. E.g. if there's no consistent type for a field, I think we'd still want to shred the field, but put all values in value
, and not require that a typed_value
type be specified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conversely, is value
always required? Would it be valid for a writer to only create a typed_value
column if it knows that all values have a predictable type that can be shredded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that typed_value
could be omitted. Wouldn't that be the case where the element is simply not shredded? I think we should call it out here, but I'll also adjust the language so that it is clear that shredding is not required, except for object fields, where you'd simply not have a shredded field (it makes no sense to shred a field and not include typed_value
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I think it makes complete sense for typed_value
to be omitted, but it would be good to be clear.
I'm less clear about whether it should be valid (here, or for other types) to omit value
, and treat that as equivalent to the value
column being all null. I can sort of imagine cases where you're converting a typed parquet schema to variant, and know the types well enough to know that value
will never be needed, but it seems like a fairly marginal benefit to omit value
from the schema vs. leaving it and populating it will all nulls. Assuming that we don't want to allow that, it might be good to clarify somewhere that the value
column is always required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not get the benefit of optionally omitting typed_value. I think Ryan said the same thing above but omitting typed_value should be equivalent to not shredding a field in which case field should not exist in the schema to begin with. Omitting typed_value just seems to increase the number of cases to deal with. Similar argument can be made for value but actually being able to omit value has a benefit. If a writer uses V1 Parquet pages where you cannot get number of null values before reading def lvls for a field, not having value field in the schema for a perfectly shredded field would allow readers skip trying to figure out values are all null after reading the def lvls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a field has an inconsistent type, it may still be useful to shred it into value
in order to fetch without the extra IO required to get the rest of the Variant.
not having value field in the schema for a perfectly shredded field would allow readers skip trying to figure out values are all null after reading the def lvls
Couldn't it figure this out from the row group stats, since value
should be null for all rows if it could have been omitted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree shredding into value in case of inconsistent types is useful and I also really like that the changes to the spec makes is really clear as to when/how that happens.
Regarding value field, yes we could figure that out from row group stats but that is not always present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this:
The
typed_value
field may be omitted when not shredding elements as a specific type.
Whentyped_value
is omitted,value
must berequired
.
I think there is value in allowing elements to be shredded. We could get dictionary encoding for them.
At a high level, we replace the `value` field of the Variant Parquet group with one or more fields called `object`, `array`, `typed_value`, and `variant_value`. | ||
These represent a fixed schema suitable for constructing the full Variant value for each row. | ||
For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') FROM tbl` only needs to load field `event_ts`, and shredding can enable columnar projection that ignores the rest of the `event` Variant. | ||
Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, '$.event_type', 'string') = 'signup'`, the `event_type` shredded column metadata can be used for skipping and to lazily load the rest of the Variant. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, '$.event_type', 'string') = 'signup'`, the `event_type` shredded column metadata can be used for skipping and to lazily load the rest of the Variant. | |
Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, '$.event_type', 'string') = 'signup'`, the `event_type` shredded column metadata can be used for skipping while the rest of the Variant is lazily loaded for matching pages. |
Each inner field's type is a recursively shredded variant value: that is, the fields of each object field must be one or more of `object`, `array`, `typed_value` or `variant_value`. | ||
| `value` | `typed_value` | Meaning | | ||
|----------|---------------|----------------------------------------------------------| | ||
| null | null | The value is missing | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something I was a little confused about before. How do I differentiate between
{ "foo" : { "x" : "null" }}
{ "foo" : { }}
I may be missing something here but i'm trying to understand an empty vs missing shredding representation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what's empty or missing in your first example. Assuming you meant to put null
without quotes, it would need to be written to the value
column as the Variant NullType (which happens to be the byte 0x00
.
An empty object is just a special case where all shredded fields are missing, and the object's value
column is also null (i.e. there are no other fields that were not part of the shredding schema).
For example:
{ "foo" : { "x" : null }}
: Stored in non-null foo.typed_value.x.value
as 0x00
{ "foo" : { }}
: both foo.typed_value.x.value
and foo.typed_value.x.typed_value
for field x
are null, indicating that x
is missing from the object. But foo.typed_value
is non-null, so there is an empty object, not a missing foo
.
{ "foo": { "y": 123 }}
: Here, x
is also missing, so its fields are null, just like the second case. Assuming there's no y
in the shredding schema, foo.value
would store the binary representation of { "y": 123 }
. foo.typed_value
should still be non-null to indicate that there is a non-null object, just like in the empty object case. I think the example Object with no shredding
below contradicts that last point, but I'll comment there that I think it should be changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so if we have no difference between
foo.typed_value.x.typed_value | foo.typed_value.x.value | |
---|---|---|
foo : { x : "null" } | null | null |
foo : { y : "bar" } | null | null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, in the first case, foo.typed_value.x.value
would be non-null, containing the variant value null (0x00). (Again, assuming you mean JSON null in your example, and not the string "null").
In the second case, it would be null (in the sense of the parquet definition level for foo.typed_value.x.value
not being its maximum value).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I can't have a shredded value that is Nullable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark has two cast variants, cast and try_cast. Both consider it valid to cast an object to a struct and drop fields. Whether this is the right choice is a fair question, but let's assume for now that it won't change. I think you're right that for cast
, we'd need the value
column to check for errors due to other types. But for try_cast
, I think we would only need to check the typed_value
column if we could rely on it setting the definition level based on the value being an object or not.
I don't think it's a huge deal if we need this extra IO, but my preference would be to have clear and limited choices wherever possible for shredding to a given schema, so that readers can make optimal choices without risking correctness issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I would also like to draw your attention to the fact that this issue might also happen for leaf fields. For example you can have a string field field1. A row might have an int and for that row value will be a variant int. If you do foo:field1::number, you need to read the value field and get the int value. Having "some" value for the typed_value would not be useful here. Similarly, for the same shredding scenario, you might have a row with empty object at that field and I think it again makes more sense to put that to the value field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, agreed. The rule (which is meant to be spelled out in this doc, but feel free to suggest clarifications) is:
- If
typed_value
is a group and the Variant value being shredded is an object, thentyped_value
must be non-null.value
may also be non-null (specifically, if there are fields that aren't in the shredding schema). - In all other cases, at most one of
typed_value
andvalue
can be non-null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I agree with @cashmand. When shredding an object (typed_value
is a group), typed_value
must be non-null. If we can rely on that rule then projections that only require specific fields don't need to read the value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've also added this:
Readers can assume that a value is not an object if
typed_value
is null and thattyped_value
field values are correct; that is, readers do not need to read thevalue
column iftyped_value
fields satisfy the required fields.
# Data Skipping | ||
All elements of an array must be non-null because `array` elements in a Variant cannot be missing. | ||
That is, either `typed_value` or `value` (but not both) must be non-null. | ||
Null elements must be encoded in `value` as Variant null: basic type 0 (primitive) and physical type 0 (null). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just for consistency it was written as
`00` (Variant null).
Earlier in the doc but this is fine too
| `{"error_msg": "malformed: ..."}` | `{"error_msg", "malformed: ..."}` | null | | | | | Object with no shredding | | ||
| `"malformed: not an object"` | `malformed: not an object` | null | | | | | Not an object (stored as Variant string) | | ||
| `{"event_ts": 1729794240241, "click": "_button"}` | `{"click": "_button"}` | non-null | null | null | null | 1729794240241 | Field `event_type` is missing | | ||
| `{"event_type": null, "event_ts": 1729794954163}` | null | non-null | `00` (field exists, is null) | null | null | 1729794954163 | Field `event_type` is present and is null | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more requested examples,
Could we have where "event_ts" is a Date or something non transformable into a timestamp?
I assume this would make value be {"event_ts": "08-03-2025"}
while typed_value would be null
I also wonder if we could do a single example for a doubly nested field showing where typed_value.address.value != null. All the examples here cover a primitive field being typed, so It may be nice to show the behavior with a object being typed.
{
Name
Address {
City
ZIP (Shredded as INT but some values as String?)
}
}
|
||
The `typed_value` associated with any Variant `value` field can be any shredded type according to the rules above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I understand this sentence, but I believe I understand the intent is that you can have objects or elements within arrays also shredded?
I think the tables above are easier for me to follow than the parquet schema below. I understand though if that's difficult to depict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just saying that any time you have a value
field, you can also have a typed_value
field that might be any shredded type, like an array nested in a field or an object nested in an array.
|
||
Consider the following example: | ||
Statistics for `typed_value` columns can be used for file, row group, or page skipping when `value` is always null (missing). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to specify "null" vs "variant null" I get a little confused sometimes in the doc.
“not an object” | ||
] | ||
``` | ||
When the corresponding `value` column is all nulls, all values must be the shredded `typed_value` field's type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes we refer to the value
as a column and sometimes as a field. Just wondering if we should take a pass to standardize unless there is another meaning i'm not following here.
Casting behavior for Variant is delegated to processing engines. | ||
For example, the interpretation of a string as a timestamp may depend on the engine's SQL session time zone. | ||
|
||
## Reconstructing a Variant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reconstructing a Shredded Variant?
|
||
## Reconstructing a Variant | ||
|
||
It is possible to recover a full Variant value using a recursive algorithm, where the initial call is to `construct_variant` with the top-level Variant group fields. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible to recover a full Variant value using a recursive algorithm, where the initial call is to `construct_variant` with the top-level Variant group fields. | |
It is possible to recover a un-shredded Variant value using a recursive algorithm, where the initial call is to `construct_variant` with the top-level Variant group fields. |
if value is not None: | ||
# this is a partially shredded object | ||
assert isinstance(value, VariantObject), "partially shredded value must be an object" | ||
assert typed_value.keys().isdisjoint(value.keys()), "object keys must be disjoint" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know the rules above say it may return an error here or pull the value out of "typed_value". But if we are going to not allow it in this reference code I probably would say we should just never allow it
} | ||
``` | ||
|
||
There are no restrictions on the repetition of Variant groups (required, optional, or repeated). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't repeated use 3 level list structure?
VariantEncoding.md
Outdated
Both fields `value` and `metadata` are of type `binary`. | ||
The `metadata` field is required and must be a valid Variant metadata, as defined below. | ||
The `variant_value` field is optional. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are mixing value
and variant_value
. As @gene-db mentioned , we probably need to keep as value
since Spark is already writing out as value
+ metadata
.
VariantShredding.md
Outdated
We extract all homogenous data items of a certain path into `typed_value`, and set aside incompatible data items in `variant_value`. | ||
Intuitively, incompatibilities within the same path may occur because we store the shredding schema per Parquet file, and each file can contain several row groups. | ||
Selecting a type for each field that is acceptable for all rows would be impractical because it would require buffering the contents of an entire file before writing. | ||
All fields for a variant, whether shredded or not, must be present in the metadata. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for detailed explanation. Later I realize this is about variant metadata
and what I was talking about was column metadata (stats).
I get what you are saying: when the entire Variant is projected, we need to reconstruct the original value
and metadata
by merging back the shredded fields if the metadata
after shredding excludes the shredded fields.
That makes sense to me to reduce the metadata reconstruction on the read side.
|
||
Consider the following example: | ||
Statistics for `typed_value` columns can be used for file, row group, or page skipping when `value` is always null (missing). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(missing)
may be a little confusing here. We probably remove it since we have the following to explain value
is all nulls.
@@ -374,7 +402,7 @@ The Decimal type contains a scale, but no precision. The implied precision of a | |||
|
|||
| Logical Type | Physical Type | Type ID | Equivalent Parquet Type | Binary format | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gene-db When we say Logical Type and Physical Type here, what are we exactly referring to? Should we refer to Parquet logical type and Parquet physical type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we clarify the Logical type and Physical Type to be "Variant Logical Type" and "Variant Physical Type"? In the Parquet context, we may think these are Parquet types.
| Null type | null | `null` | `null` | | ||
| Boolean | boolean | `true` or `false` | `true` | | ||
| Exact Numeric | number | Digits in fraction must match scale, no exponent | `34`, 34.00 | | ||
| Float | number | Fraction must be present | `14.20` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should cover expected format for +/- inf and NaN
| Double | number | Fraction must be present | `1.0` | | ||
| Date | string | ISO-8601 formatted date | `"2017-11-16"` | | ||
| Timestamp | string | ISO-8601 formatted UTC timestamp including +00:00 offset | `"2017-11-16T22:31:08.000001+00:00"` | | ||
| TimestampNTZ | string | ISO-8601 formatted UTC timestamp with no offset or zone | `"2017-11-16T22:31:08.000001"` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What precision decimal values are required?
# value is missing | ||
return None | ||
|
||
def primitive_to_variant(typed_value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def primitive_to_variant(typed_value): | |
def primitive_to_variant(typed_value: Any) -> VariantType: |
It is possible to recover a full Variant value using a recursive algorithm, where the initial call is to `construct_variant` with the top-level Variant group fields. | ||
|
||
```python | ||
def construct_variant(metadata, value, typed_value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def construct_variant(metadata, value, typed_value): | |
def construct_variant(metadata: VariantMetadata, value: Any, typed_value: Any) -> Optional[VariantType]: |
Instead, I would suggest rewriting the code, and return a VariantNull
object, instead of a Python None
, then the signature becomes:
def construct_variant(metadata, value, typed_value): | |
def construct_variant(metadata: VariantMetadata, value: Any, typed_value: Any) -> VariantType: |
### What changes were proposed in this pull request? This is a first step towards adding Variant shredding support for the Parquet writer. It adds functionality to convert a Variant value to an InternalRow that matches the current shredding spec in apache/parquet-format#461. Once this merges, the next step will be to set up the Parquet writer to accept a shredding schema, and write these InternalRow values to Parquet instead of the raw Variant binary. ### Why are the changes needed? First step towards adding support for shredding, which can improve Variant performance (and will be important for functionality on the read side once other tools begin writing shredded Variant columns to Parquet). ### Does this PR introduce _any_ user-facing change? No, none of this code is currently called outside of the added tests. ### How was this patch tested? Unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #48779 from cashmand/SPARK-48898-write-shredding. Authored-by: cashmand <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
@@ -374,7 +402,7 @@ The Decimal type contains a scale, but no precision. The implied precision of a | |||
|
|||
| Logical Type | Physical Type | Type ID | Equivalent Parquet Type | Binary format | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we clarify the Logical type and Physical Type to be "Variant Logical Type" and "Variant Physical Type"? In the Parquet context, we may think these are Parquet types.
|
||
Dictionary IDs in a `variant_value` field refer to entries in the top-level `metadata` field. | ||
If a Variant is missing in a context where a value is required, readers must either fail or return a Variant null: basic type 0 (primitive) and physical type 0 (null). | ||
For example, if a Variant is required (like `measurement` above) and both `value` and `typed_value` are null, the returned `value` must be `00` (Variant null). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the writers produce nulls for both value
and typed_value
, it's like a corrupted files and I feel it's reasonable for the readers to error out rather than give a default value.
|
||
If a value cannot be represented by whichever of `object`, `array`, or `typed_value` is present in the schema, then it is stored in `variant_value`, and the other fields are set to null. | ||
In the Parquet example above, if field `a` was an object or array, or a non-integer scalar, it would be stored in `variant_value`. | ||
Unless the value is shredded as an object (see [Objects](#objects)), `typed_value` or `value` (but not both) must be non-null. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are talking about primitive types here, we probably just say ``typed_valueor
value` (but not both) must be non-null for primitive values` and merge with the paragraph above.
|
||
# Using variant_value vs. typed_value | ||
If the value is not an array, `typed_value` must be null. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove this sentence since we are talking about array in this section?
``` | ||
optional group tags (VARIANT) { | ||
required binary metadata; | ||
optional binary value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add comment "# must be null".
On the other hand, shredding as a different logical type is not allowed. | ||
For example, the integer value 123 could not be shredded to a string `typed_value` column as the string "123", since that would lose type information. | ||
It would need to be written to the `variant_value` column. | ||
If the value is not an object, `typed_value` must be null. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I need to understand what this means: if the type is not an object, but it's a primitive type or arrays, then we should follow the other sections so we don't need this here, right?
|
||
This section describes a more deeply nested example, using a top-level array as the shredding type. | ||
A field's `value` and `typed_value` are set to null (missing) to indicate that the field does not exist in the variant. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would have some inconsistency with the encoding for primitive values that both nulls are invalid.
Why do we make the group required? Can we have group optional and then the group is not present, then that means the field is missing.
Rationale for this change
Updating the Variant and shredding specs from a thorough review.
What changes are included in this PR?
Spec updates, mostly to the shredding spec to minimize it and make it clear. This also attempts to make the variant spec more consistent (for example, by using
value
in both).object
andarray
in favor of always usingtyped_value
required
to avoid unnecessary null casesmetadata
must be valid for all variant values without modificationDo these changes have PoC implementations?
No.