Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-906: Add LogicalType annotation. #51

Conversation

rdblue
Copy link
Contributor

@rdblue rdblue commented Mar 24, 2017

This commit adds a LogicalType union and a field for this logical type to SchemaElement. Adding a new structure for logical types is needed for a few reasons:

  1. Adding to the ConvertedType enum is not forward-compatible. Adding new types to the LogicalType union is forward-compatible.
  2. Using a struct for each type allows additional metadata, like isAdjustedToUTC, without adding more fields to SchemaElement that don't apply to all types.
  3. Types without additional metadata can be updated later. For example, adding an encoding field to StringType when it is needed.

@rdblue
Copy link
Contributor Author

rdblue commented Mar 24, 2017

@julienledem, @mkornacker, here is an early version of additional metadata for time and timestamp types.

When I looked at adding just metadata for Time and Timestamp, the work was nearly identical to the LogicalType union that this PR includes, but required more spec to state what metadata should be associated with a certain ConvertedType. I think it is simpler to solve the compatibility problem (can't add to a thrift enum) and add the extra metadata at the same time by replacing ConvertedType with this new LogicalType.

I haven't updated all the docs yet because I wanted to get feedback on the approach first. Please comment. Thanks!

Copy link
Member

@julienledem julienledem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the general idea for this.

* Annotates a column that is always null
* Sometimes when discovering the schema of existing data
* values are always null
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't want to duplicate the comment, maybe refer to NullType below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I intended to remove NULL entirely because it is an unreleased type that can be replaced with the LogicalType version.


/** Timestamp logical type annotation */
struct TimestampType {
1: required bool isAdjustedToUTC
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about "withoutTimeZone" meaning the opposite of "isAdjustedToUTC"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isAdjustedToUTC is more clear. With or without time zone is language from the SQL spec, which is very difficult to understand and apply. Rather than relying on it, this captures a very specific piece of information: whether the timestamp was adjusted to UTC from its original offset (or would have been if the original offset is UTC).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if isAdjusted to UTC is true, we need to record the original TZ. So we should add an optional String timezone field. (and then we don't need isAdjustedToUTC since it is implied by timezone != null )
Something similar similar to Arrow:
https://github.com/apache/arrow/blob/a4f29f3a3ff1c64a6f547bfb0d5e4500142ea5ec/format/Schema.fbs#L117

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The model we discussed for Parquet is different from the Arrow spec.
In Arrow, both Timestamp and TimestampTz are stored in UTC. The writer Timezone string is additionally stored to re-compute the Timestamp values.
For Parquet, the idea is to store Timestamp values as is from epoch without conversion to UTC.
The conversion to UTC happens only for TimestampTz values.
The parquet approach is slightly efficient since we don't have to re-compute the Timestamp values thereby enabling bulkload of the column.
The presence of a writer Timezone in the file metadata will also prohibit concatenation of files from different Timezones.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Arrow, both Timestamp and TimestampTz are stored in UTC

This isn't accurate, see: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L120. In Arrow, if the timestamp metadata does not have a time zone, it is time zone naive, not UTC.

From naive timestamp values, we can choose later to localize to UTC (which is a no-op) or localize to some other time zone (which will adjust the values to the UTC values internally based on the tz database)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened https://issues.apache.org/jira/browse/ARROW-1020 to clarify in the comments in Schema.fbs

When timezone is set, the physical values are UTC timestamps regardless of the time zone of the writer or the client, so changing the timezone does not change the underlying integers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to store the original zone if isAdjustedToUTC is true. This isn't done by other systems, and there is no guarantee that there is a single zone that the timestamps were converted from. This just indicates that whatever the source zone, the values have been normalized to the same one, UTC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I linked the parquet and arrow jiras about timestamp types. There's a discution there about timestamp types and timezones:
https://issues.apache.org/jira/browse/PARQUET-905
https://issues.apache.org/jira/browse/ARROW-637

Copy link
Member

@wesm wesm Jun 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We spoke about this on the Arrow sync yesterday.

In Arrow, we have 3 cases:

  • No time zone in the metadata: the data is time zone naive. One can later localize to a time zone (which may alter the values, because they will be internally normalized to UTC)

  • Time zone in the metadata. Whether the time zone is 'UTC' or 'America/Los_Angeles', the physical values themselves are the same: changing the time zone only changes the metadata, not the values of the int64 timestamps

What is proposed here simplifies this to either isAdjustedToUTC=false (what we currently call "no time zone" or "time zone naive" in Arrow) or isAdjustedToUTC=true (which covers BOTH the case that the time zone is set as UTC or some other time zone)

The problem I see here is that if a data processing system runs the query:

select hour(timestamp_field), count(*)
from my_parquet_data
group by 1

For timestamps with time zone, if the time zone is known then the hour function can be implemented to compute the hour values based on the time zone (e.g. America/Los_Angeles). But what's proposed here precludes that, you would need to do something like

hour_localized(timestamp_field, 'America/Los_Angeles')
...

Or maybe the analytics system has some means to cast to a timestamp with the additional metadata. Either way there's some awkwardness.

This isn't done by other systems, and there is no guarantee that there is a single zone that the timestamps were converted from.

I see this concern, but if there is no consistency about the origin of the data, why not store "UTC" as the storage time zone? The fact that others may preserve a local time zone need not complicate this use case.

We can use the keyvalue metadata to preserve the time zone metadata in the Parquet file footer to at least maintain compatibility with Arrow, but it's not ideal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've resolved this discussion in the Parquet sync-ups. Parquet timestamps are always stored with respect to UTC and won't have a time zone.


/** Time logical type annotation */
struct TimeType {
1: required bool isAdjustedToUTC
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do this actually apply to Time?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is required by the SQL spec

* bitWidth must be 8, 16, 32, or 64.
*/
struct IntType {
1: required i32 bitWidth
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

byte instead of i32?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

*/
struct IntType {
1: required i32 bitWidth
2: required bool isSigned
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just signed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine either way. isSigned is the Java convention.

}

/** Embedded logical type annotation */
struct EmbeddedType {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that we have too many layers here.
A Union is already replacing a notion of format field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds fine to me. I can add JsonType and BsonType instead.

@majetideepak
Copy link

+1 LGTM

@majetideepak
Copy link

I believe the physical type for Timestamp is only going to be INT64.

6: DateType DATE // use ConvertedType DATE
7: TimeType TIME // use ConvertedType TIME_MICROS or TIME_MILLIS
8: TimestampType TIMESTAMP // use ConvertedType TIMESTAMP_MICROS or TIMESTAMP_MILLIS
// 9: reserved for INTERVAL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about creating an empty struct IntervalType and leaving a todo with that struct instead?

Copy link
Member

@julienledem julienledem May 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but then we can not add required fields?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Julien. Then we can't add required fields to interval and there isn't much value to adding it now.

/**
* Logical type to annotate a column that is always null.
*
* Sometimes when discovering the schema of existing data values are always
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this sentence a bit tricky to parse. Can you add a comma to make it more clear?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some clarification here. Thanks for the suggestion.

2: MicroSeconds MICROS
}

/** Timestamp logical type annotation */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a sentence or two to the comments explaining the difference between TimestampType and TimeType? They're not clear to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These types are already documented in LogicalTypes.md. We can add some things here, like the allowed physical types, but I don't think we should maintain everything in two places.

}

/**
* Integer logical type annotation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed in addition to the physical types? Could it just be UnsignedIntType and leave out the bitWidth?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unless we want to support other widths in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should talk about this in the sync-up today.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I didn't make it to that sync. Was there some conclusion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bit width is needed to signal that all of the values will fit in a particular width. Since the spec currently has 8 and 16 bit widths, we have to capture those possible values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is consistent with the Arrow integer metadata https://github.com/apache/arrow/blob/master/format/Schema.fbs#L87

@@ -211,6 +204,93 @@ struct Statistics {
4: optional i64 distinct_count;
}

/** Empty structs to use as logical type annotations */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think think it would help the reader if the comments for each logical type mention the physical type(s) they can annotate. Currently one needs to look them up in the table below, and then look in LogicalTypes.md or ConvertedType to find what physical types can be annotated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

8: TimestampType TIMESTAMP // use ConvertedType TIMESTAMP_MICROS or TIMESTAMP_MILLIS
// 9: reserved for INTERVAL
10: IntType INTEGER // use ConvertedType INT_* or UINT_*
11: NullType NULL // no compatible ConvertedType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the comment mean ", so leave it empty."?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use NIL or NA here since NULL is a reserved keyword in C++?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated this to NONE. Good suggestion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to NONE, which is a little more common than NIL or NA. Python and Scala use it.

@@ -211,6 +204,93 @@ struct Statistics {
4: optional i64 distinct_count;
}

/** Empty structs to use as logical type annotations */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@@ -211,6 +204,93 @@ struct Statistics {
4: optional i64 distinct_count;
}

/** Empty structs to use as logical type annotations */
struct StringType {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment that it is UTF8 by default ? (in case we want to add an encoding field at some point).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add a comment about UTF8, but I don't think we should allow other encodings, unless there is a new encoding that becomes the best standard, like UTF8 is today.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment about UTF-8

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about SQL engines that let the user put anything in a string without checking its encoding? Should they annotate strings with UTF8 even though they can not guarantee that users will actually store UTF8 in them, should they omit the annotation and thereby lose the information that those binary fields are actually strings, or should there be an encoding value for unknown string encoding? (See also: IMPALA-5982)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQL engines that store arbitrary binary shouldn't mark it as UTF-8 because it isn't guaranteed to be. Non-UTF-8 characters would cause exceptions in Java. There are two follow-ups to this:

  1. Start validating that data written by Impala is UTF-8 and add the UTF8 annotation to string columns. Scanning to verify an encoding shouldn't be too expensive.
  2. Make sure that we can add annotations later in the expected schema. This is something we should add tests for in parquet-avro and other object models. It makes sense to me that converting one of these broken Parquet schemas to Avro shouldn't assume that binary columns can be read as Strings or Utf8 objects, but if I to read them that way by providing a schema with those fields as UTF8, I should be able to. Make sense?

}

/**
* Integer logical type annotation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unless we want to support other widths in the future?

lekv referenced this pull request Jun 15, 2017
… Arrow

Author: Julien Le Dem <[email protected]>

Closes #45 from julienledem/types and squashes the following commits:

2956b63 [Julien Le Dem] review feedback
94236c4 [Julien Le Dem] PARQUET-757: Bring Parquet logical types to par with Arrow
@rdblue
Copy link
Contributor Author

rdblue commented Aug 13, 2017

Updated and, I think, ready to commit? Any more comments?

@lekv
Copy link
Contributor

lekv commented Sep 13, 2017

+1

@rdblue
Copy link
Contributor Author

rdblue commented Sep 26, 2017

@julienledem, any comments or do you think this is ready to merge?

@rdblue
Copy link
Contributor Author

rdblue commented Oct 6, 2017

@julienledem, can you take one last look? I think this is ready to go in.

rdblue added 6 commits October 6, 2017 17:29
* Removed EmbeddedTypes, made JsonType and BsonType top-level
* Changed IntType bitWidth to a byte
This hasn't been in a release, so there is no need to keep it, which is
a forward-incompatbile change.
@rdblue rdblue force-pushed the PARQUET-906-add-timestamp-adjustment-metadata branch from c997ac4 to 02f3868 Compare October 7, 2017 00:30
This fixes the Java test errors and uses UNKNOWN to avoid name
conflicts.
Copy link
Member

@julienledem julienledem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 LGTM

@julienledem
Copy link
Member

maybe just review @lekv about the comment readability

@asfgit asfgit closed this in 863875e Oct 10, 2017
asfgit pushed a commit that referenced this pull request Oct 10, 2017
UUIDs are commonly used as unique identifiers. A binary representation will reduce memory when writing or building bloom filters and will reduce cycles needed to compare values.

This commit is based on PARQUET-906 / PR #51.

Author: Ryan Blue <[email protected]>

Closes #71 from rdblue/PARQUET-1125-add-uuid-logical-type and squashes the following commits:

dc01707 [Ryan Blue] PARQUET-1125: Add UUID logical type.
* LogicalType replaces ConvertedType, but ConvertedType is still required
* for some logical types to ensure forward-compatibility in format v1.
*/
10: optional LogicalType logicalType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed that this commit uses camelCase field names while the existing ones are underscore_separated. Was this an intentional naming convention change? In any case, it's too late to do anything about it as this has already been released as 2.4.0.

pitrou added a commit to pitrou/arrow that referenced this pull request Mar 31, 2021
ConvertedType::NA corresponds to an invalid converted type that was once added to the Parquet spec:
apache/parquet-format#45
but then quickly removed in favour of the Null logical type:
apache/parquet-format#51

Unfortunately, Parquet C++ could still in some cases emit the unofficial converted type.

Also remove the confusingly-named LogicalType::Unknown, while "UNKNOWN" in the Thrift specification points to LogicalType::Null.
pitrou added a commit to pitrou/arrow that referenced this pull request Mar 31, 2021
ConvertedType::NA corresponds to an invalid converted type that was once added to the Parquet spec:
apache/parquet-format#45
but then quickly removed in favour of the Null logical type:
apache/parquet-format#51

Unfortunately, Parquet C++ could still in some cases emit the unofficial converted type.

Also remove the confusingly-named LogicalType::Unknown, while "UNKNOWN" in the Thrift specification points to LogicalType::Null.
pitrou added a commit to pitrou/arrow that referenced this pull request Mar 31, 2021
ConvertedType::NA corresponds to an invalid converted type that was once added to the Parquet spec:
apache/parquet-format#45
but then quickly removed in favour of the Null logical type:
apache/parquet-format#51

Unfortunately, Parquet C++ could still in some cases emit the unofficial converted type.

Also remove the confusingly-named LogicalType::Unknown, while "UNKNOWN" in the Thrift specification points to LogicalType::Null.
emkornfield pushed a commit to apache/arrow that referenced this pull request Apr 1, 2021
ConvertedType::NA corresponds to an invalid converted type that was once added to the Parquet spec:
apache/parquet-format#45
but then quickly removed in favour of the Null logical type:
apache/parquet-format#51

Unfortunately, Parquet C++ could still in some cases emit the unofficial converted type.

Also remove the confusingly-named LogicalType::Unknown, while "UNKNOWN" in the Thrift specification points to LogicalType::Null.

Closes #9863 from pitrou/PARQUET-1990-invalid-converted-type

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Micah Kornfield <[email protected]>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
ConvertedType::NA corresponds to an invalid converted type that was once added to the Parquet spec:
apache/parquet-format#45
but then quickly removed in favour of the Null logical type:
apache/parquet-format#51

Unfortunately, Parquet C++ could still in some cases emit the unofficial converted type.

Also remove the confusingly-named LogicalType::Unknown, while "UNKNOWN" in the Thrift specification points to LogicalType::Null.

Closes apache#9863 from pitrou/PARQUET-1990-invalid-converted-type

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Micah Kornfield <[email protected]>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 10, 2021
ConvertedType::NA corresponds to an invalid converted type that was once added to the Parquet spec:
apache/parquet-format#45
but then quickly removed in favour of the Null logical type:
apache/parquet-format#51

Unfortunately, Parquet C++ could still in some cases emit the unofficial converted type.

Also remove the confusingly-named LogicalType::Unknown, while "UNKNOWN" in the Thrift specification points to LogicalType::Null.

Closes apache#9863 from pitrou/PARQUET-1990-invalid-converted-type

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Micah Kornfield <[email protected]>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
ConvertedType::NA corresponds to an invalid converted type that was once added to the Parquet spec:
apache/parquet-format#45
but then quickly removed in favour of the Null logical type:
apache/parquet-format#51

Unfortunately, Parquet C++ could still in some cases emit the unofficial converted type.

Also remove the confusingly-named LogicalType::Unknown, while "UNKNOWN" in the Thrift specification points to LogicalType::Null.

Closes apache#9863 from pitrou/PARQUET-1990-invalid-converted-type

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Micah Kornfield <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants