Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add extra_properties to hive table properties #9475

Closed

Conversation

posulliv
Copy link
Contributor

@posulliv posulliv commented Oct 1, 2021

Fixes #954

@sajjoseph
Copy link
Contributor

sajjoseph commented Oct 2, 2021

This will allow users to specify different hive table properties that are not natively exposed in the current hive connector.
Thanks @posulliv for the contribution.

In addition, #1176 can be closed once this is merged - right?

@posulliv posulliv force-pushed the hive-extra-table-properties branch from 1e628fc to 042aa85 Compare October 4, 2021 15:51
@posulliv
Copy link
Contributor Author

posulliv commented Oct 4, 2021

@sajjoseph good catch on #1176. I had not covered the SHOW CREATE TABLE case. Updated to cover that based on #1176 and added this to the unit tests for extra_properties as well.

@posulliv posulliv force-pushed the hive-extra-table-properties branch from 042aa85 to 79ba52c Compare October 4, 2021 16:40
@dain
Copy link
Member

dain commented Oct 5, 2021

This was proposed a while back, and I don’ think it can be done that way. The issue is to change a property you must do a read/update/write cycle that includes all existing properties, and that is very dangerous as you could drop (or restore) properties.

I'm not sure how you could do this with the existing table property system. I would guess we need to extend this to support properties that look like some_base_name.arbitrary_name, where all of them have the same type. So for this case you would end up with extra_properties.auto_purge... Or maybe replace List<PropertyMetadata<?>> getTableProperties() on Connector with something like:

    /**
     * @return the table properties for this connector
     * @deprecated use getTablePropertyProvider
     */
    default List<PropertyMetadata<?>> getTableProperties()
    {
        return emptyList();
    }

    /**
     * @return the table properties for this connector
     */
    default TablePropertyProvider getTablePropertyProvider()
    {
        return TablePropertyProvider.fixedTableProperties(getTableProperties());
    }

    interface TablePropertyProvider
    {
        Set<String> getKnownTablePropertyNames();

        Optional<PropertyMetadata<?>> getTableProperty(String name);

        static TablePropertyProvider fixedTableProperties(Collection<PropertyMetadata<?>> properties)
        {
            Set<String> knownTablePropertyNames = properties.stream()
                    .map(PropertyMetadata::getName)
                    .collect(Collectors.toUnmodifiableSet());
            Map<String, PropertyMetadata<?>> tableProperties = properties.stream()
                    .collect(Collectors.toUnmodifiableMap(PropertyMetadata::getName, identity()));
            return new TablePropertyProvider()
            {
                @Override
                public Set<String> getKnownTablePropertyNames()
                {
                    return knownTablePropertyNames;
                }

                @Override
                public Optional<PropertyMetadata<?>> getTableProperty(String name)
                {
                    return Optional.ofNullable(tableProperties.get(name));
                }
            };
        }
    }

Of course this would require changing TablePropertyManager also

@posulliv
Copy link
Contributor Author

posulliv commented Oct 5, 2021

thanks for having a look @dain. I mostly opened this PR to get support for the Hive auto.purge table property.

I also opened another PR that just adds that individual table property - #9457

Given this approach will require a lot more changes, do you think the PR that simply adds the auto_purge table property is the way we should go?

@posulliv
Copy link
Contributor Author

posulliv commented Oct 5, 2021

I closed #9457 and will try to apply the feedback to this PR. Thanks again for the comments.

@posulliv
Copy link
Contributor Author

posulliv commented Oct 6, 2021

@dain I see #9523 is now open. I will close this PR.

@posulliv posulliv closed this Oct 6, 2021
@findepi
Copy link
Member

findepi commented Oct 6, 2021

@posulliv i don't think #9523 is the ideal approach. Please keep this open for now.

@findinpath
Copy link
Contributor

@posulliv this separation between Trino properties and "extra properties" may be rather misleading for the end users. How will a user now which property is a Trino property and which property is an arbitrary property?

@findepi
Copy link
Member

findepi commented Mar 8, 2022

@findinpath the separation is actually important. Otherwise, how a user would know whether a property is "supposed to be honored by Trino", and which property is just a "dumb passthru" to the metastore?

@billonahill
Copy link

We're also blocked on lack of Hive table property support when trying to specify JSON timestamp formats, as described in #4538. Do we have consensus on this approach?

@jchoi614
Copy link

FYI We added this patch on top of our Trino fork & deployed to production successfully, specifically for the purpose of specifying JSON timestamp formats. /cc @billonahill

@billonahill
Copy link

Are there plans to merge this branch into the main Trino release? Or are we exploring other ways to support this? As @jchoi614 mentioned, we've applied this and are running in in production, but want to be sure we're in sync with future releases.

@mosabua
Copy link
Member

mosabua commented Oct 28, 2022

👋 @posulliv @findepi @findinpath and others - this PR has become inactive. Could you work together to move this forward?

We're working on closing out old and inactive PRs, so if you're too busy or this has too many merge conflicts to be worth picking back up, we'll be making another pass to close it out in a few weeks.

@colebow
Copy link
Member

colebow commented Nov 30, 2022

Closing this one out due to inactivity, but please reopen if you would like to pick this back up.

@colebow colebow closed this Nov 30, 2022
@jchoi614
Copy link

jchoi614 commented Feb 3, 2023

Are there plans to merge this branch into the main Trino release? Or are we exploring other ways to support this? As @jchoi614 mentioned, we've applied this and are running in in production, but want to be sure we're in sync with future releases.

Following up on this - should we reopen this PR @billonahill ? I do not see the use of Hive extra properties in the current Trino release & this is a patch that has been working for us in production.

@billonahill
Copy link

Yeah @jchoi614 we should. @colebow can we please re-open this one? We have been running this patch in our production environment for quite some time now and would like to have it part of Trino. Or if there is another way to provide a timestamp format for a given table we can use another another approach.

@findinpath findinpath reopened this Feb 6, 2023
@colebow
Copy link
Member

colebow commented Feb 6, 2023

Yeah @jchoi614 we should. @colebow can we please re-open this one? We have been running this patch in our production environment for quite some time now and would like to have it part of Trino. Or if there is another way to provide a timestamp format for a given table we can use another another approach.

I'm not an authority here, but I'd make sure our PR author (@posulliv) is actually interested in picking this back up and getting it across the finish line when it hasn't been updated in over a year. If not, someone else who's interested should open a new PR with attribution in the commit message. Opening/closing a PR is superficial and only represents the status of the work - but if the author isn't interested in picking it up, it should likely remain closed.

@mosabua
Copy link
Member

mosabua commented Feb 6, 2023

@findinpath since you reopened.. does that mean @findepi and team are going to move this to merge proactively with @posulliv ?

@billonahill
Copy link

@colebow @findepi @mosabua @posulliv we'd be glad to help get this across the finish line. @jchoi614 has a patch that we could put up in a new PR with attribution, or we can clean this one up so it will merge cleanly. Let us know which route you all prefer.

@posulliv posulliv force-pushed the hive-extra-table-properties branch from 79ba52c to 79db187 Compare February 14, 2023 18:55
@posulliv posulliv force-pushed the hive-extra-table-properties branch from 79db187 to 4f2a806 Compare February 14, 2023 21:07
@posulliv posulliv force-pushed the hive-extra-table-properties branch from 4f2a806 to f7bbcde Compare February 15, 2023 16:04
Copy link
Contributor

@huberty89 huberty89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works fine for me

Comment on lines +692 to +693
if (extraProperties != null) {
properties.put(EXTRA_PROPERTIES, mapJsonCodec(String.class, String.class).fromJson(extraProperties));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if hive/spark or a different system add some additional properties - is it possible for us to capture all of them ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say if we Hive/Spark adds some additional properties - it won't be visible here right ? Since this JsonString - captures only the extra-properties configured via Trino - Is this intentional or can we be a bit open about the properties added by other system ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Yes that is correct. It was intentional as we didn't have a use case for that. Do we want to add that as part of this PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would nice if we could capture those properties also - SHOW CREATE TABLES are kind of queries which allows us to migrate tables from one system to other and we should make sure we capture most of the properties in this case.

@posulliv posulliv force-pushed the hive-extra-table-properties branch 2 times, most recently from bdb92f2 to f5920d2 Compare February 22, 2023 21:21
@findinpath
Copy link
Contributor

findinpath commented Feb 27, 2023

I was looking at https://trino.io/docs/current/sql/alter-table.html#set-properties documentation and as @findepi mentioned in #9475 (comment) Trino has its own table properties that it has to honour and that's why we can't really use this syntax for doing pass throughs for arbitrary properties.

The current PR solves partly this problem by adding the "extra_properties" workaround. However it is missing the fact that the users may add through Spark arbitrary properties which are not visible then in Trino.

Nevertheless, users need a way to pass through arbitrary properties so that they can land on the metastore.

An alternative idea would be to provide system procedures to allow the user to perform such operations add/remove/list arbitrary table operations. This path doesn't involve any hacking on the SET PROPERTIES syntax. The user may change reserved properties this way - maybe adding a boundary (saying NO to changing properties which can be changed through SET PROPERTIES syntax) in this case would make sense.

@findepi
Copy link
Member

findepi commented Feb 28, 2023

The current PR solves partly this problem by adding the "extra_properties" workaround. However it is missing the fact that the users may add through Spark arbitrary properties which are not visible then in Trino.

extra_properties would expose all properties not understood by Trino, including those set after table creation

An alternative idea would be to provide system procedures to allow the user to perform such operations add/remove/list arbitrary table operations.

I think I am fine having that but I am concerned about users being able to shoot themselves in the foot.
But maybe that's OK.

@Praveen2112
Copy link
Member

An alternative idea would be to provide system procedures to allow the user to perform such operations add/remove/list arbitrary table operations.

But won't it affect the overall UX it would be like creating a table and then invoking the procedure to add/update custom property.

extra_properties would expose all properties not understood by Trino

But doesn't this allow us to migrate the table schema and related properties from one HMS to a different HMS. (One of the use case of SHOW CREATE ..)

Otherwise, how a user would know whether a property is "supposed to be honored by Trino", and which property is just a "dumb passthru" to the metastore?

I think we filter out the properties which are understood and honored by Trino and remaining properties could be exposed as extra_properties.

")\n" +
"WITH (\n" +
" extra_properties = map_from_entries(ARRAY[ROW('extra.property', 'true')]),\n" +
" format = 'ORC'\n" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying out this feature locally and this is what I see:

trino:default> show create table test1;
                                 Create Table                                  
-------------------------------------------------------------------------------
 CREATE TABLE hive.default.test1 (                                             
    c1 integer                                                                 
 )                                                                             
 WITH (                                                                        
    extra_properties = map_from_entries(ARRAY[ROW('extra.property', 'true')]), 
    format = 'ORC'                                                             
 )                                                                             
(1 row)
[metastore]> select * from TABLE_PARAMS where tbl_id = 56;
+--------+--------------------------------+---------------------------------------------+
| TBL_ID | PARAM_KEY                      | PARAM_VALUE                                 |
+--------+--------------------------------+---------------------------------------------+
|     56 | STATS_GENERATED_VIA_STATS_TASK | workaround for potential lack of HIVE-12730 |
|     56 | auto.purge                     | false                                       |
|     56 | extra.property                 | true                                        |
|     56 | extra_properties               | {
  "extra.property" : "true"
}             |
|     56 | numFiles                       | 0                                           |
|     56 | numRows                        | 0                                           |
|     56 | presto_query_id                | 20230227_195136_00008_skeew                 |
|     56 | presto_version                 | 406-258-gf5920d2                            |
|     56 | rawDataSize                    | 0                                           |
|     56 | totalSize                      | 0                                           |
|     56 | transient_lastDdlTime          | 1677527498                                  |
+--------+--------------------------------+---------------------------------------------+

There are a few properties missing in the show create table output.
Is this intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findinpath are you asking why the additional hive table properties not supported by trino are not included in the extra properties in the show create table output?

If so, yes that was intentional. This PR was to only support extra properties that are specified explicitly by the user when the table is created through trino.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking about the context where users do update from different engines the table properties.
Being limited to a number of "extra" properties defined up-front seems rather limiting to me.

Do note that a similar UX experience will need to be offered for Iceberg, Delta as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findinpath got it. This PR was intentionally limited to the hive connector and wasn't intended to deal with other query engines.

Do you think that could be done as a follow up?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think that could be done as a follow up?

Definitely.
Let's first settle on how to do it in Hive and implementing the functionality in other connectors will eventually follow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR was to only support extra properties that are specified explicitly by the user when the table is created through trino.

Can we extend this functionality in this PR ? One more additional concern here is that we add an additional properties extra_properties in Hive table which can be overriden by other engines. We could skip some of the properties like STATS_GENERATED_VIA_STATS_TASK, presto_query_id and presto_version

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CREATED_BY and TRINO_VIEW_FLAG

.isInstanceOf(QueryFailedException.class)
.hasMessage("Multiple entries with same key: transactional=true and transactional=false");
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NULL property values cause failures.

Query 20230227_200050_00010_skeew failed: null value in entry: extra.property=null
java.lang.NullPointerException: null value in entry: extra.property=null
	at com.google.common.collect.CollectPreconditions.checkEntryNotNull(CollectPreconditions.java:33)
	at com.google.common.collect.ImmutableMapEntry.<init>(ImmutableMapEntry.java:54)
	at com.google.common.collect.ImmutableMap.entryOf(ImmutableMap.java:339)
	at com.google.common.collect.ImmutableMap$Builder.put(ImmutableMap.java:449)
	at io.trino.plugin.hive.HiveMetadata.getEmptyTableProperties(HiveMetadata.java:1106)
	at io.trino.plugin.hive.HiveMetadata.createTable(HiveMetadata.java:934)
	at io.trino.plugin.base.classloader.ClassLoaderSafeConnectorMetadata.createTable(ClassLoaderSafeConnectorMetadata.java:383)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findinpath are you asking me to add a test that verifies the error when null property values are used? Or to do better error checking? Just want to be sure I understand the comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's first check whether setting NULL is valid. Check whether hive does accept setting properties NULL values.
In any case, the error thrown to the user should not be NPE.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hive will accept null as a string literal:

: jdbc:hive2://localhost:10000/default> create table t2(c1 int) tblproperties('extra.property'='null');
0: jdbc:hive2://localhost:10000/default>

The properties are then:

: jdbc:hive2://localhost:10000/default> show create table t2;
+----------------------------------------------------+
|                   createtab_stmt                   |
+----------------------------------------------------+
| CREATE TABLE `t2`(                                 |
|   `c1` int)                                        |
| ROW FORMAT SERDE                                   |
|   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  |
| STORED AS INPUTFORMAT                              |
|   'org.apache.hadoop.mapred.TextInputFormat'       |
| OUTPUTFORMAT                                       |
|   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
| LOCATION                                           |
|   'hdfs://hadoop-master:9000/user/hive/warehouse/t2' |
| TBLPROPERTIES (                                    |
|   'bucketing_version'='2',                         |
|   'extra.property'='null',                         |
|   'transient_lastDdlTime'='1677618343')            |
+----------------------------------------------------+
14 rows selected (0.065 seconds)
0: jdbc:hive2://localhost:10000/default>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just checked also on my own

0: jdbc:hive2://localhost:10000/default> create table t2(c1 int) tblproperties('extra.property'=null);
Error: Error while compiling statement: FAILED: ParseException line 1:55 mismatched input 'null' expecting StringLiteral near '=' in specifying key/value property (state=42000,code=40000)

Specifying NULL can't be used in Hive as well as a table property.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of above null is represented as a String.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought @findinpath was testing with null as a string for an extra property which is what I tested but looks like he was not.

What changes do you want to see in the PR based on this discussion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add a check if the property values are null, if null then we need to throw an new TrinoException(INVALID_TABLE_PROPERTY...

@findinpath
Copy link
Contributor

But doesn't this allow us to migrate the table schema and related properties from one HMS to a different HMS. (One of the use case of SHOW CREATE ..)

@Praveen2112 even if we specify some "extra" properties in the CREATE TABLE statement, the user will still be a bit confused because not all the properties will appear as expected in the SHOW CREATE TABLE statement.
See in #9475 (comment) presto_query_id, presto_version, etc properties which are created anyway on the fly.

We need to cover probably the need to set arbitrary properties when the table is being created. I don't think though that we need to showcase the "extra" properties in the SHOW CREATE TABLE.

@hashhar
Copy link
Member

hashhar commented Jun 15, 2023

Obsoleted by #17172

@hashhar hashhar closed this Jun 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed hive Hive connector
Development

Successfully merging this pull request may close these issues.

Add Hive table property to for arbitrary properties