-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2042: Add support for unwrapping common Protobuf wrappers and… #900
Conversation
@mwong38, can you put more information in the Jira on why/what is changed? This is pretty big change and it would help people to review your code. |
@shangxinli @mwong38 any way I can help with merge? Currently, proto-to-parquet conversions do not support proper timestamp handling, and it looks like this work addresses the issue. Let me know. |
After proto3 made everything optional, there is no way to know whether a primitive has been set or not. That is, you could no longer represent a "nullable" primitive. (They later brought Parquet supports nullable primitives natively. If we were to represent these well-known Protobuf types directly, it will be nested inside a deeper data structure; very inconvenient and a waste of space. What I did here is "unwrap" these primitives and make use of Parquet's nullaibility. Additionally, I convert other well-known types such as Timestamp and Date to Parquet's. Again, rather than represent the data structure in its raw nested form (seconds/nanos or year, month, day, etc), they are converted to Parquet's native or logical representation of Timestamp and Date. You can turn on or off this features in the |
@shangxinli is there any reason this work shouldn't be merged, given @mwong38 explanation? Is there something I can help with? |
I don't see a reason why this cannot be merge. I will have a look soon. |
Can you squash all the commits? |
@Override | ||
public void addInt(int value) { | ||
LocalDate localDate = LocalDate.ofEpochDay(value); | ||
com.google.type.Date date = com.google.type.Date.newBuilder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use import?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in this particular case it's more clear to be explicit that this is the Google Protobuf Date, rather than, say, the Java built-in Date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is no collision on the imports, we should use imports generally.
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoMessageConverter.java
Outdated
Show resolved
Hide resolved
@mwong38 let me know if you want me to help in any way |
I am not an expert here by any means, but I have made (very small) changes related to default values in protobuf parquet before and the goal of this change seem good. I assume that the behavior for unwrapped primitives remains unchanged? @belugabehr might have thoughts too. |
1 similar comment
I am not an expert here by any means, but I have made (very small) changes related to default values in protobuf parquet before and the goal of this change seem good. I assume that the behavior for unwrapped primitives remains unchanged? @belugabehr might have thoughts too. |
@shangxinli any news about merging this version? Are there still any blockers? |
Will have another look soon. |
if (unwrapProtoWrappers) { | ||
String typeName = descriptor.getMessageType().getFullName(); | ||
if (typeName.equals(PROTOBUF_TIMESTAMP_TYPE)) { | ||
return builder.primitive(INT64, getRepetition(descriptor)).as(timestampType(true, TimeUnit.NANOS)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should TimeUnit be something configurable? I know protobuf timestamps support nanosecond resolution but might not be what applications want to store.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's worth complicating the API. The Timestamp common Proto stores time in nanoseconds. There's no good reason to deviate from that or to truncate the resolution. If the user wishes to do more manipulation, it can be done downstream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, you can use the default 'TimeUnit.NANOS' while having that configured.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't that outside the scope of ProtoSchemaConverter/ParquetWriter? I don't think it should go into the business of doing transformations. If the point of ProtoSchemaConverter
is to give the closest representation of the Protobuf object in Parquet, then it should be NANOS and nothing more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emkornfield Do you still have a comment on this?
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoSchemaConverter.java
Show resolved
Hide resolved
@@ -97,6 +127,46 @@ public MessageType convert(Class<? extends Message> protobufClass) { | |||
|
|||
private <T> Builder<? extends Builder<?, GroupBuilder<T>>, GroupBuilder<T>> addField(FieldDescriptor descriptor, final GroupBuilder<T> builder) { | |||
if (descriptor.getJavaType() == JavaType.MESSAGE) { | |||
if (unwrapProtoWrappers) { | |||
String typeName = descriptor.getMessageType().getFullName(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason you are using type equality vs descriptor equality? (I haven't checked but I would think .Equals would work? Maybe add a comment if this is intentional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. I'll change it to compare the Descriptor
directly to the Wrapper's Descriptor rather than the name.
@mwong38 Can you address the feedback from @emkornfield before we can merge? |
@mwong38 anyway I can help with finalizing this PR? |
I think we are close to merge this PR. Resolve the conflict and use the imports , then we can merge. |
9bf1961
to
9672ee7
Compare
Done |
@shangxinli any reason why this PR hasn't been merged yet? |
any reason why this PR hasn't been merged yet? |
I just noticed this PR and sorry to see it does not check in. @mwong38 Could you try rebase it one last time? Thanks! |
… logical Timestamps, Date, TimeOfDay
9672ee7
to
95b45d3
Compare
Done. I really hope it's the last time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mwong38!
@shangxinli @emkornfield I see you have reviewed this PR and there is no more comment to resolve. Do you need to take a look for another pass?
Merged |
… logical Timestamps, Date, TimeOfDay
Make sure you have checked all steps below.
Jira
Tests
Commits
Documentation