Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify Variant shredding and refactor for clarity #461

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

rdblue
Copy link
Contributor

@rdblue rdblue commented Oct 20, 2024

Rationale for this change

Updating the Variant and shredding specs from a thorough review.

What changes are included in this PR?

Spec updates, mostly to the shredding spec to minimize it and make it clear. This also attempts to make the variant spec more consistent (for example, by using value in both).

  • Removes object and array in favor of always using typed_value
  • Makes list element and object field groups required to avoid unnecessary null cases
  • Separates cases for primitives, arrays, and objects
  • Adds individual examples for primitives, arrays, and objects
  • Adds Variant to Parquet type mapping for shredded columns
  • Clarifies that metadata must be valid for all variant values without modification
  • Updates reconstruction algorithm to be more pythonic

Do these changes have PoC implementations?

No.

We extract all homogenous data items of a certain path into `typed_value`, and set aside incompatible data items in `variant_value`.
Intuitively, incompatibilities within the same path may occur because we store the shredding schema per Parquet file, and each file can contain several row groups.
Selecting a type for each field that is acceptable for all rows would be impractical because it would require buffering the contents of an entire file before writing.
All fields for a variant, whether shredded or not, must be present in the metadata.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be controversial. I'm trying to say that you should not need to modify the metadata when reading. The reconstructed object should be able to use the stored metadata without adding fields.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused. When the field is not shredded, we will not have metadata for it, right? When it's getting shredded, then it will be like a column and we will generate metadata so it can be used for filtering/pruning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sfc-gh-aixu, this is saying that when writing, the metadata for a shredded value and the metadata for a non-shredded value should be identical. Writers should not alter the metadata by removing shredded field names so that readers do not need to rewrite the metadata (and values) to add it back.

For example, consider an event that looks like this:

{
  "id": 102,
  "event_type": "signup",
  "event_timestamp": "2024-10-21T20:06:34.198724",
  "payload": {
    "a": 1,
    "b": 2
  }
}

And a shredding schema:

optional group event (VARIANT) {
  required binary metadata;
  optional binary value;
  optional group typed_value {
    required group event_type {
      optional binary value;
      optional binary typed_value (STRING);
    }
    required group event_timestamp {
      optional binary value;
      optional int64 typed_value (TIMESTAMP(true, MICROS));
    }
  }
}

The top-level event_type and event_timestamp fields are shredded. But this is saying that the Variant metadata must include those field names. That ensure that the existing binary metadata can be returned to the engine without adding event_type and event_timestamp fields when merging those fields into the top-level Variant value when the entire Variant is projected.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for detailed explanation. Later I realize this is about variant metadata and what I was talking about was column metadata (stats).

I get what you are saying: when the entire Variant is projected, we need to reconstruct the original value and metadata by merging back the shredded fields if the metadata after shredding excludes the shredded fields.

That makes sense to me to reduce the metadata reconstruction on the read side.

VariantEncoding.md Outdated Show resolved Hide resolved

Similarly the elements of an `array` must be a group containing one or more of `object`, `array`, `typed_value` or `variant_value`.
Each shredded field is represented as a required group that contains a `variant_value` and a `typed_value` field.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why each shredded field should be a required group is not clear to me. If fields were allowed to be optional, that would be another way of indicating non-existence of fields.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The primary purpose is to reduce the number of cases that implementers have to deal with. If all of the cases can be expressed with 2 optional fields rather than 2 optional fields inside an optional group, then the group should be required to simplify as much as possible.

In addition, every level in Parquet that is optional introduces another repetition/definition level. That adds up quickly with nested structures and ends up taking unnecessary space.

VariantShredding.md Outdated Show resolved Hide resolved
VariantShredding.md Outdated Show resolved Hide resolved
@@ -33,176 +33,239 @@ This document focuses on the shredding semantics, Parquet representation, implic
For now, it does not discuss which fields to shred, user-facing API changes, or any engine-specific considerations like how to use shredded columns.
The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), and leverages the existing Parquet specification.

At a high level, we replace the `value` field of the Variant Parquet group with one or more fields called `object`, `array`, `typed_value`, and `variant_value`.
These represent a fixed schema suitable for constructing the full Variant value for each row.

Shredding allows a query engine to reap the full benefits of Parquet's columnar representation, such as more compact data encoding, min/max statistics for data skipping, and I/O and CPU savings from pruning unnecessary fields not accessed by a query (including the non-shredded Variant binary data).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another place I'd like to just remove some of the text here. My main goal here is just to reduce the amount of text in the spec

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reduced this, but I don't think it's a problem to have a bit of context that answers the question "What is shredding and why do I care?"

The `typed_value` field may be any type that has a corresponding Variant type.
For each value in the data, at most one of the `typed_value` and `variant_value` may be non-null.
A writer may omit either field, which is equivalent to all rows being null.
If both fields are non-null and either is not an object, the value is invalid. Readers must either fail or return the `typed_value`.
Copy link
Contributor Author

@rdblue rdblue Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RussellSpitzer and @gene-db, this could use some attention.

Here, if both value and typed_value are non-null I initially thought it made more sense to prefer value because it doesn't need to be re-encoded and may have been coerced by an engine to the shredded type.

However, this conflicts with object fields, where the value of typed_value is preferred so that data skipping is correct. If the object's value could contains a field that conflicts with a sub-field's typed_value there is no way of knowing from field stats. If we preferred the field value stored in the object's value then data skipping could be out of sync with the value returned in the case of a conflict.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the value is invalid

Suggested change
If both fields are non-null and either is not an object, the value is invalid. Readers must either fail or return the `typed_value`.
If both fields are non-null and either is not an object, the `value` is invalid. Readers must either fail or return the `typed_value`.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why aren't we just being proscriptive here? Isn't this essentially saying you can duplicate a subfield-field between typed_value and value? Wouldn't it be safer to just say this cannot be done?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that readers won't actually implement restrictions like this and we can't fully prevent it. It is invalid for a writer to produce a value where value and typed_value conflict. But writer bugs happen and readers need to know what to do when they encounter that situation. Otherwise we would get different behaviors between readers that are processing the same data file.

It all comes down to end users -- if a writer bug produces data like this, readers will implement the ability to read because the data still exists and can be recovered. When that happens, we want to know how it is interpreted.

@rdblue rdblue changed the title WIP: Current work on Variant specs Simplify Variant shredding and refactor for clarity Oct 24, 2024
|---------------|-----------|----------------------------------------------------------|--------------------------------------|
| Null type | null | `null` | `null` |
| Boolean | boolean | `true` or `false` | `true` |
| Exact Numeric | number | Digits in fraction must match scale, no exponent | `34`, 34.00 |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For exact numerics, we should allow truncating trailing zeros. For example, int8 value 1 and decimal(5,2) value 100 can both be represented as a JSON value 1.

Also, should the example be quoted to stay consistent?

Suggested change
| Exact Numeric | number | Digits in fraction must match scale, no exponent | `34`, 34.00 |
| Exact Numeric | number | Digits in fraction must match scale, no exponent | `34`, `34.00` |

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the intent of considering Exact Numeric to be a single logical type is that we consider the int8 value 1 to be logically equivalent to decimal(5,2) with unscaled value 100. If that's the case, I think we'd want the produced JSON to be the same for both (probably 1 in both cases), and not recommend having the fraction match the scale.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gene-db, @cashmand, these are concerns for the engine layer, not for storage. If Spark wants to automatically coerce between types that's fine, but the compromise that we talked about a couple months ago was to leave this out of the shredding spec and delegate the behavior to engines. Storage should always produce the data that was stored, without modification.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the engine should be the one concerned with changing types.

However, my original question was about this JSON representation wording. Currently, the Representation requirements for an Exact Numeric says the Digits in fraction must match scale. However, because the Exact Numeric is considered a logical type, the value 1 could be stored in the Variant as int8 1 or decimal(5,2) 100. Both of those would be the same numeric value, so we should allow truncating trailing zeros in the JSON representation, instead of requiring that the digits in the fraction match the scale.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gene-db, the JSON representation should match the physical type as closely as possible. The reader can interpret the value however it chooses to, but a storage implementation should not discard the information.

If you want to produce 34 from 34.00 stored as decimal(9, 2) then the engine is responsible for casting the value to int8 and then producing JSON. The JSON representation for the original decimal(9, 2) value is 34.00.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue I am confused with this JSON chart then. If we are talking about "storage implementation", then are you expecting there is a "storage implementation" that is converting variant values to JSON? When will storage convert a variant value to a JSON string?

I originally thought this chart was trying to say, "When an engine wants to convert a variant value to a JSON string, here are the rules". Therefore, we should allow engines to cast integral decimals to integers before converting to JSON, as you already mentioned in your previous comment.

This is intended to allow future backwards-compatible extensions.
In particular, the field names `_metadata_key_paths` and any name starting with `_spark` are reserved, and should not be used by other implementations.
Any extra field names that do not start with an underscore should be assumed to be backwards incompatible, and readers should fail when reading such a schema.
Shredding is an optional feature of Variant, and readers must continue to be able to read a group containing only `value` and `metadata` fields.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, isn't non-shredded just a special case of shredded with no typed_value in the top level struct? I think it's automatically backwards compatible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only thing that isn't backwards-compatible is that value is optional rather than required if you're shredding. But yes, writers are not required to shred.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add a note that once shredded, a file must be read using this spec and the typed_value portion can not be ignored?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a note at the top:

When typed_value is present, readers must reconstruct shredded values according to this specification.

Each inner field's type is a recursively shredded variant value: that is, the fields of each object field must be one or more of `object`, `array`, `typed_value` or `variant_value`.
| `value` | `typed_value` | Meaning |
|----------|---------------|----------------------------------------------------------|
| null | null | The value is missing |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, this is only allowed for object fields, right? You mention in the array section that array elements must have one of them non-null, but I think that's also true for the top-level value/typed_value, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be good to clarify. I think that we could state that the variant could be null this way.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be Variant null (i.e. present but null)? I guess this is the same as the both-null case for array elements, which still seems to me more like an error state, but I guess Variant null is the best option if a reader doesn't want to fail.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding was both typed_value and value being null meant variant being null if this is for the top level variant and field not being present for a shredded field.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference would be to make it illegal for the top level variant field to be variant-null (same for array elements). It seems like it adds a relatively rare special case that readers would need to handle, and doesn't add much value, since the null can be encoded in value just like for shredded fields. I don't feel too strongly if the consensus that it adds value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cashmand, the problem with "make it illegal" is that there is no functional way to do this. We can state that writers should not set both value and typed_value to null in certain cases, but we have to define what to do if there is data that is actually written that way in order to have consistent and reliable behavior.

That's why this is called out as a requirement:

If a Variant is missing in a context where a value is required, readers must either fail or return a Variant null: basic type 0 (primitive) and physical type 0 (null).


Dictionary IDs in a `variant_value` field refer to entries in the top-level `metadata` field.
If a Variant is missing in a context where a value is required, readers must either fail or return a Variant null: basic type 0 (primitive) and physical type 0 (null).
For example, if a Variant is required (like `measurement` above) and both `value` and `typed_value` are null, the returned `value` must be `00` (Variant null).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in my previous comment, I think it would be invalid for measurement to have both value and typed_value be null, and should be an error. I don't understand why we're recommend returning variant null as an option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rule is to address the fact that arrays cannot contain a missing value. This is saying that if a value is required but both are null, the implementation must fill in a variant null.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rule(both value being null should be interpreted as json-null) is valid only for top level variant and array elements? I wonder how a top level variant can be inserted as both value and typed_value being null if the top level field is required. That seems inconsistent. For arrays, it looks like we could also require value being variant encoded null(json null) rather than allowing both fields to be null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sfc-gh-saya, if the top-level field is required but both fields are null, then the reader must produce a variant null value, 00. We must state what happens in cases like this because it is possible for writers to produce them.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the writers produce nulls for both value and typed_value , it's like a corrupted files and I feel it's reasonable for the readers to error out rather than give a default value.


It is possible to recover a full Variant value using a recursive algorithm, where the initial call is to `ConstructVariant` with the top-level fields, which are assumed to be null if they are not present in the schema.
Each shredded field in the `typed_value` group is represented as a required group that contains optional `value` and `typed_value` fields.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that here and in the array case, it would be good to clarify whether typed_value can be omitted from the schema entirely. E.g. if there's no consistent type for a field, I think we'd still want to shred the field, but put all values in value, and not require that a typed_value type be specified.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conversely, is value always required? Would it be valid for a writer to only create a typed_value column if it knows that all values have a predictable type that can be shredded?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that typed_value could be omitted. Wouldn't that be the case where the element is simply not shredded? I think we should call it out here, but I'll also adjust the language so that it is clear that shredding is not required, except for object fields, where you'd simply not have a shredded field (it makes no sense to shred a field and not include typed_value).

Copy link

@cashmand cashmand Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I think it makes complete sense for typed_value to be omitted, but it would be good to be clear.

I'm less clear about whether it should be valid (here, or for other types) to omit value, and treat that as equivalent to the value column being all null. I can sort of imagine cases where you're converting a typed parquet schema to variant, and know the types well enough to know that value will never be needed, but it seems like a fairly marginal benefit to omit value from the schema vs. leaving it and populating it will all nulls. Assuming that we don't want to allow that, it might be good to clarify somewhere that the value column is always required.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not get the benefit of optionally omitting typed_value. I think Ryan said the same thing above but omitting typed_value should be equivalent to not shredding a field in which case field should not exist in the schema to begin with. Omitting typed_value just seems to increase the number of cases to deal with. Similar argument can be made for value but actually being able to omit value has a benefit. If a writer uses V1 Parquet pages where you cannot get number of null values before reading def lvls for a field, not having value field in the schema for a perfectly shredded field would allow readers skip trying to figure out values are all null after reading the def lvls.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a field has an inconsistent type, it may still be useful to shred it into value in order to fetch without the extra IO required to get the rest of the Variant.

not having value field in the schema for a perfectly shredded field would allow readers skip trying to figure out values are all null after reading the def lvls

Couldn't it figure this out from the row group stats, since value should be null for all rows if it could have been omitted?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree shredding into value in case of inconsistent types is useful and I also really like that the changes to the spec makes is really clear as to when/how that happens.

Regarding value field, yes we could figure that out from row group stats but that is not always present.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this:

The typed_value field may be omitted when not shredding elements as a specific type.
When typed_value is omitted, value must be required.

I think there is value in allowing elements to be shredded. We could get dictionary encoding for them.

At a high level, we replace the `value` field of the Variant Parquet group with one or more fields called `object`, `array`, `typed_value`, and `variant_value`.
These represent a fixed schema suitable for constructing the full Variant value for each row.
For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') FROM tbl` only needs to load field `event_ts`, and shredding can enable columnar projection that ignores the rest of the `event` Variant.
Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, '$.event_type', 'string') = 'signup'`, the `event_type` shredded column metadata can be used for skipping and to lazily load the rest of the Variant.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, '$.event_type', 'string') = 'signup'`, the `event_type` shredded column metadata can be used for skipping and to lazily load the rest of the Variant.
Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, '$.event_type', 'string') = 'signup'`, the `event_type` shredded column metadata can be used for skipping while the rest of the Variant is lazily loaded for matching pages.

Each inner field's type is a recursively shredded variant value: that is, the fields of each object field must be one or more of `object`, `array`, `typed_value` or `variant_value`.
| `value` | `typed_value` | Meaning |
|----------|---------------|----------------------------------------------------------|
| null | null | The value is missing |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something I was a little confused about before. How do I differentiate between

{ "foo" : { "x" : "null" }}
{ "foo" : { }}

I may be missing something here but i'm trying to understand an empty vs missing shredding representation

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what's empty or missing in your first example. Assuming you meant to put null without quotes, it would need to be written to the value column as the Variant NullType (which happens to be the byte 0x00.

An empty object is just a special case where all shredded fields are missing, and the object's value column is also null (i.e. there are no other fields that were not part of the shredding schema).

For example:
{ "foo" : { "x" : null }}: Stored in non-null foo.typed_value.x.value as 0x00
{ "foo" : { }}: both foo.typed_value.x.value and foo.typed_value.x.typed_value for field x are null, indicating that x is missing from the object. But foo.typed_value is non-null, so there is an empty object, not a missing foo.
{ "foo": { "y": 123 }}: Here, x is also missing, so its fields are null, just like the second case. Assuming there's no y in the shredding schema, foo.value would store the binary representation of { "y": 123 }. foo.typed_value should still be non-null to indicate that there is a non-null object, just like in the empty object case. I think the example Object with no shredding below contradicts that last point, but I'll comment there that I think it should be changed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so if we have no difference between

foo.typed_value.x.typed_value foo.typed_value.x.value
foo : { x : "null" } null null
foo : { y : "bar" } null null

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, in the first case, foo.typed_value.x.value would be non-null, containing the variant value null (0x00). (Again, assuming you mean JSON null in your example, and not the string "null").

In the second case, it would be null (in the sense of the parquet definition level for foo.typed_value.x.value not being its maximum value).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I can't have a shredded value that is Nullable?

Copy link

@cashmand cashmand Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark has two cast variants, cast and try_cast. Both consider it valid to cast an object to a struct and drop fields. Whether this is the right choice is a fair question, but let's assume for now that it won't change. I think you're right that for cast, we'd need the value column to check for errors due to other types. But for try_cast, I think we would only need to check the typed_value column if we could rely on it setting the definition level based on the value being an object or not.

I don't think it's a huge deal if we need this extra IO, but my preference would be to have clear and limited choices wherever possible for shredding to a given schema, so that readers can make optimal choices without risking correctness issues.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I would also like to draw your attention to the fact that this issue might also happen for leaf fields. For example you can have a string field field1. A row might have an int and for that row value will be a variant int. If you do foo:field1::number, you need to read the value field and get the int value. Having "some" value for the typed_value would not be useful here. Similarly, for the same shredding scenario, you might have a row with empty object at that field and I think it again makes more sense to put that to the value field.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agreed. The rule (which is meant to be spelled out in this doc, but feel free to suggest clarifications) is:

  1. If typed_value is a group and the Variant value being shredded is an object, then typed_value must be non-null. value may also be non-null (specifically, if there are fields that aren't in the shredding schema).
  2. In all other cases, at most one of typed_value and value can be non-null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree with @cashmand. When shredding an object (typed_value is a group), typed_value must be non-null. If we can rely on that rule then projections that only require specific fields don't need to read the value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also added this:

Readers can assume that a value is not an object if typed_value is null and that typed_value field values are correct; that is, readers do not need to read the value column if typed_value fields satisfy the required fields.

# Data Skipping
All elements of an array must be non-null because `array` elements in a Variant cannot be missing.
That is, either `typed_value` or `value` (but not both) must be non-null.
Null elements must be encoded in `value` as Variant null: basic type 0 (primitive) and physical type 0 (null).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for consistency it was written as

 `00` (Variant null).

Earlier in the doc but this is fine too

| `{"error_msg": "malformed: ..."}` | `{"error_msg", "malformed: ..."}` | null | | | | | Object with no shredding |
| `"malformed: not an object"` | `malformed: not an object` | null | | | | | Not an object (stored as Variant string) |
| `{"event_ts": 1729794240241, "click": "_button"}` | `{"click": "_button"}` | non-null | null | null | null | 1729794240241 | Field `event_type` is missing |
| `{"event_type": null, "event_ts": 1729794954163}` | null | non-null | `00` (field exists, is null) | null | null | 1729794954163 | Field `event_type` is present and is null |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more requested examples,

Could we have where "event_ts" is a Date or something non transformable into a timestamp?
I assume this would make value be {"event_ts": "08-03-2025"} while typed_value would be null

I also wonder if we could do a single example for a doubly nested field showing where typed_value.address.value != null. All the examples here cover a primitive field being typed, so It may be nice to show the behavior with a object being typed.

{
 Name
 Address {
    City 
    ZIP (Shredded as INT but some values as String?)
    }
}


The `typed_value` associated with any Variant `value` field can be any shredded type according to the rules above.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand this sentence, but I believe I understand the intent is that you can have objects or elements within arrays also shredded?

I think the tables above are easier for me to follow than the parquet schema below. I understand though if that's difficult to depict.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just saying that any time you have a value field, you can also have a typed_value field that might be any shredded type, like an array nested in a field or an object nested in an array.


Consider the following example:
Statistics for `typed_value` columns can be used for file, row group, or page skipping when `value` is always null (missing).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to specify "null" vs "variant null" I get a little confused sometimes in the doc.

“not an object”
]
```
When the corresponding `value` column is all nulls, all values must be the shredded `typed_value` field's type.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes we refer to the value as a column and sometimes as a field. Just wondering if we should take a pass to standardize unless there is another meaning i'm not following here.

Casting behavior for Variant is delegated to processing engines.
For example, the interpretation of a string as a timestamp may depend on the engine's SQL session time zone.

## Reconstructing a Variant
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reconstructing a Shredded Variant?


## Reconstructing a Variant

It is possible to recover a full Variant value using a recursive algorithm, where the initial call is to `construct_variant` with the top-level Variant group fields.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
It is possible to recover a full Variant value using a recursive algorithm, where the initial call is to `construct_variant` with the top-level Variant group fields.
It is possible to recover a un-shredded Variant value using a recursive algorithm, where the initial call is to `construct_variant` with the top-level Variant group fields.

if value is not None:
# this is a partially shredded object
assert isinstance(value, VariantObject), "partially shredded value must be an object"
assert typed_value.keys().isdisjoint(value.keys()), "object keys must be disjoint"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the rules above say it may return an error here or pull the value out of "typed_value". But if we are going to not allow it in this reference code I probably would say we should just never allow it

}
```

There are no restrictions on the repetition of Variant groups (required, optional, or repeated).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't repeated use 3 level list structure?

Comment on lines 48 to 50
Both fields `value` and `metadata` are of type `binary`.
The `metadata` field is required and must be a valid Variant metadata, as defined below.
The `variant_value` field is optional.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are mixing value and variant_value. As @gene-db mentioned , we probably need to keep as value since Spark is already writing out as value + metadata.

We extract all homogenous data items of a certain path into `typed_value`, and set aside incompatible data items in `variant_value`.
Intuitively, incompatibilities within the same path may occur because we store the shredding schema per Parquet file, and each file can contain several row groups.
Selecting a type for each field that is acceptable for all rows would be impractical because it would require buffering the contents of an entire file before writing.
All fields for a variant, whether shredded or not, must be present in the metadata.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for detailed explanation. Later I realize this is about variant metadata and what I was talking about was column metadata (stats).

I get what you are saying: when the entire Variant is projected, we need to reconstruct the original value and metadata by merging back the shredded fields if the metadata after shredding excludes the shredded fields.

That makes sense to me to reduce the metadata reconstruction on the read side.


Consider the following example:
Statistics for `typed_value` columns can be used for file, row group, or page skipping when `value` is always null (missing).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(missing) may be a little confusing here. We probably remove it since we have the following to explain value is all nulls.

@@ -374,7 +402,7 @@ The Decimal type contains a scale, but no precision. The implied precision of a

| Logical Type | Physical Type | Type ID | Equivalent Parquet Type | Binary format |
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gene-db When we say Logical Type and Physical Type here, what are we exactly referring to? Should we refer to Parquet logical type and Parquet physical type?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we clarify the Logical type and Physical Type to be "Variant Logical Type" and "Variant Physical Type"? In the Parquet context, we may think these are Parquet types.

| Null type | null | `null` | `null` |
| Boolean | boolean | `true` or `false` | `true` |
| Exact Numeric | number | Digits in fraction must match scale, no exponent | `34`, 34.00 |
| Float | number | Fraction must be present | `14.20` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should cover expected format for +/- inf and NaN

| Double | number | Fraction must be present | `1.0` |
| Date | string | ISO-8601 formatted date | `"2017-11-16"` |
| Timestamp | string | ISO-8601 formatted UTC timestamp including +00:00 offset | `"2017-11-16T22:31:08.000001+00:00"` |
| TimestampNTZ | string | ISO-8601 formatted UTC timestamp with no offset or zone | `"2017-11-16T22:31:08.000001"` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What precision decimal values are required?

# value is missing
return None

def primitive_to_variant(typed_value):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def primitive_to_variant(typed_value):
def primitive_to_variant(typed_value: Any) -> VariantType:

It is possible to recover a full Variant value using a recursive algorithm, where the initial call is to `construct_variant` with the top-level Variant group fields.

```python
def construct_variant(metadata, value, typed_value):
Copy link
Contributor

@Fokko Fokko Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def construct_variant(metadata, value, typed_value):
def construct_variant(metadata: VariantMetadata, value: Any, typed_value: Any) -> Optional[VariantType]:

Instead, I would suggest rewriting the code, and return a VariantNull object, instead of a Python None, then the signature becomes:

Suggested change
def construct_variant(metadata, value, typed_value):
def construct_variant(metadata: VariantMetadata, value: Any, typed_value: Any) -> VariantType:

cloud-fan pushed a commit to apache/spark that referenced this pull request Nov 13, 2024
### What changes were proposed in this pull request?

This is a first step towards adding Variant shredding support for the Parquet writer. It adds functionality to convert a Variant value to an InternalRow that matches the current shredding spec in apache/parquet-format#461.

Once this merges, the next step will be to set up the Parquet writer to accept a shredding schema, and write these InternalRow values to Parquet instead of the raw Variant binary.

### Why are the changes needed?

First step towards adding support for shredding, which can improve Variant performance (and will be important for functionality on the read side once other tools begin writing shredded Variant columns to Parquet).

### Does this PR introduce _any_ user-facing change?

No, none of this code is currently called outside of the added tests.

### How was this patch tested?

Unit tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #48779 from cashmand/SPARK-48898-write-shredding.

Authored-by: cashmand <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
@@ -374,7 +402,7 @@ The Decimal type contains a scale, but no precision. The implied precision of a

| Logical Type | Physical Type | Type ID | Equivalent Parquet Type | Binary format |
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we clarify the Logical type and Physical Type to be "Variant Logical Type" and "Variant Physical Type"? In the Parquet context, we may think these are Parquet types.


Dictionary IDs in a `variant_value` field refer to entries in the top-level `metadata` field.
If a Variant is missing in a context where a value is required, readers must either fail or return a Variant null: basic type 0 (primitive) and physical type 0 (null).
For example, if a Variant is required (like `measurement` above) and both `value` and `typed_value` are null, the returned `value` must be `00` (Variant null).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the writers produce nulls for both value and typed_value , it's like a corrupted files and I feel it's reasonable for the readers to error out rather than give a default value.


If a value cannot be represented by whichever of `object`, `array`, or `typed_value` is present in the schema, then it is stored in `variant_value`, and the other fields are set to null.
In the Parquet example above, if field `a` was an object or array, or a non-integer scalar, it would be stored in `variant_value`.
Unless the value is shredded as an object (see [Objects](#objects)), `typed_value` or `value` (but not both) must be non-null.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are talking about primitive types here, we probably just say ``typed_valueorvalue` (but not both) must be non-null for primitive values` and merge with the paragraph above.


# Using variant_value vs. typed_value
If the value is not an array, `typed_value` must be null.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove this sentence since we are talking about array in this section?

```
optional group tags (VARIANT) {
required binary metadata;
optional binary value;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add comment "# must be null".

On the other hand, shredding as a different logical type is not allowed.
For example, the integer value 123 could not be shredded to a string `typed_value` column as the string "123", since that would lose type information.
It would need to be written to the `variant_value` column.
If the value is not an object, `typed_value` must be null.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I need to understand what this means: if the type is not an object, but it's a primitive type or arrays, then we should follow the other sections so we don't need this here, right?


This section describes a more deeply nested example, using a top-level array as the shredding type.
A field's `value` and `typed_value` are set to null (missing) to indicate that the field does not exist in the variant.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would have some inconsistency with the encoding for primitive values that both nulls are invalid.
Why do we make the group required? Can we have group optional and then the group is not present, then that means the field is missing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants