Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2042: Add support for unwrapping common Protobuf wrappers and… #900

Merged
merged 1 commit into from
Nov 27, 2023

Conversation

mwong38
Copy link
Contributor

@mwong38 mwong38 commented May 3, 2021

… logical Timestamps, Date, TimeOfDay

Make sure you have checked all steps below.

Jira

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

@shangxinli
Copy link
Contributor

@mwong38, can you put more information in the Jira on why/what is changed? This is pretty big change and it would help people to review your code.

@sheinbergon
Copy link

@shangxinli @mwong38 any way I can help with merge? Currently, proto-to-parquet conversions do not support proper timestamp handling, and it looks like this work addresses the issue.

Let me know.

@mwong38
Copy link
Contributor Author

mwong38 commented Mar 6, 2022

After proto3 made everything optional, there is no way to know whether a primitive has been set or not. That is, you could no longer represent a "nullable" primitive. (They later brought optional keyword back, but the damage is done). The solution was to wrap a primitive in a message. For example, to represent a nullable double, there is the DoubleValue. It's analogous to Java's boxed Double.

Parquet supports nullable primitives natively. If we were to represent these well-known Protobuf types directly, it will be nested inside a deeper data structure; very inconvenient and a waste of space. What I did here is "unwrap" these primitives and make use of Parquet's nullaibility.

Additionally, I convert other well-known types such as Timestamp and Date to Parquet's. Again, rather than represent the data structure in its raw nested form (seconds/nanos or year, month, day, etc), they are converted to Parquet's native or logical representation of Timestamp and Date. You can turn on or off this features in the ProtoSchemaConverter constructor.

@sheinbergon
Copy link

sheinbergon commented Mar 6, 2022

@shangxinli is there any reason this work shouldn't be merged, given @mwong38 explanation? Is there something I can help with?

@shangxinli
Copy link
Contributor

I don't see a reason why this cannot be merge. I will have a look soon.

@shangxinli
Copy link
Contributor

Can you squash all the commits?

@Override
public void addInt(int value) {
LocalDate localDate = LocalDate.ofEpochDay(value);
com.google.type.Date date = com.google.type.Date.newBuilder()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use import?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in this particular case it's more clear to be explicit that this is the Google Protobuf Date, rather than, say, the Java built-in Date.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no collision on the imports, we should use imports generally.

@sheinbergon
Copy link

@mwong38 let me know if you want me to help in any way

@dossett
Copy link
Contributor

dossett commented Mar 22, 2022

I am not an expert here by any means, but I have made (very small) changes related to default values in protobuf parquet before and the goal of this change seem good. I assume that the behavior for unwrapped primitives remains unchanged?

@belugabehr might have thoughts too.

1 similar comment
@dossett
Copy link
Contributor

dossett commented Mar 22, 2022

I am not an expert here by any means, but I have made (very small) changes related to default values in protobuf parquet before and the goal of this change seem good. I assume that the behavior for unwrapped primitives remains unchanged?

@belugabehr might have thoughts too.

@sheinbergon
Copy link

@shangxinli any news about merging this version? Are there still any blockers?

@shangxinli
Copy link
Contributor

Will have another look soon.

if (unwrapProtoWrappers) {
String typeName = descriptor.getMessageType().getFullName();
if (typeName.equals(PROTOBUF_TIMESTAMP_TYPE)) {
return builder.primitive(INT64, getRepetition(descriptor)).as(timestampType(true, TimeUnit.NANOS));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should TimeUnit be something configurable? I know protobuf timestamps support nanosecond resolution but might not be what applications want to store.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's worth complicating the API. The Timestamp common Proto stores time in nanoseconds. There's no good reason to deviate from that or to truncate the resolution. If the user wishes to do more manipulation, it can be done downstream.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, you can use the default 'TimeUnit.NANOS' while having that configured.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that outside the scope of ProtoSchemaConverter/ParquetWriter? I don't think it should go into the business of doing transformations. If the point of ProtoSchemaConverter is to give the closest representation of the Protobuf object in Parquet, then it should be NANOS and nothing more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emkornfield Do you still have a comment on this?

@@ -97,6 +127,46 @@ public MessageType convert(Class<? extends Message> protobufClass) {

private <T> Builder<? extends Builder<?, GroupBuilder<T>>, GroupBuilder<T>> addField(FieldDescriptor descriptor, final GroupBuilder<T> builder) {
if (descriptor.getJavaType() == JavaType.MESSAGE) {
if (unwrapProtoWrappers) {
String typeName = descriptor.getMessageType().getFullName();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason you are using type equality vs descriptor equality? (I haven't checked but I would think .Equals would work? Maybe add a comment if this is intentional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I'll change it to compare the Descriptor directly to the Wrapper's Descriptor rather than the name.

@shangxinli
Copy link
Contributor

@mwong38 Can you address the feedback from @emkornfield before we can merge?

@sheinbergon
Copy link

@mwong38 anyway I can help with finalizing this PR?

@shangxinli
Copy link
Contributor

I think we are close to merge this PR. Resolve the conflict and use the imports , then we can merge.

@mwong38
Copy link
Contributor Author

mwong38 commented Sep 4, 2022

I think we are close to merge this PR. Resolve the conflict and use the imports , then we can merge.

Done

@sheinbergon
Copy link

@shangxinli any reason why this PR hasn't been merged yet?

@zhaolihe
Copy link

any reason why this PR hasn't been merged yet?

@wgtmac
Copy link
Member

wgtmac commented Oct 18, 2023

I just noticed this PR and sorry to see it does not check in. @mwong38 Could you try rebase it one last time? Thanks!

@mwong38
Copy link
Contributor Author

mwong38 commented Nov 26, 2023

I just noticed this PR and sorry to see it does not check in. @mwong38 Could you try rebase it one last time? Thanks!

Done. I really hope it's the last time.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mwong38!

@shangxinli @emkornfield I see you have reviewed this PR and there is no more comment to resolve. Do you need to take a look for another pass?

@shangxinli shangxinli merged commit c8487c7 into apache:master Nov 27, 2023
9 checks passed
@shangxinli
Copy link
Contributor

Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants