Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add extra_properties to hive table properties #16846

Conversation

Praveen2112
Copy link
Member

Description

This PR allows us to pass through additional properties to Hive when creating table in Trino.
The additional properties can provided in the following format

extra_properties = map_from_entries(ARRAY[ROW('extra.property.one', 'one'),ROW('extra.property.two', 'two')]

Additional context and related issues

Fixes #954
Extension of #9475

Release notes

(x) Release notes are required, with the following suggested text:

# Section
* Introduce `extra_properties` for adding arbitrary properties to hive table. ({issue}`954`)

Comment on lines 380 to 393
private static final Set<String> TABLE_PROPERTIES_TO_SKIP = ImmutableSet.of(
PRESTO_VERSION_NAME,
PRESTO_QUERY_ID_NAME,
BUCKETING_VERSION,
"EXTERNAL",
"numFiles",
"totalSize",
"last_modified_time",
"transient_lastDdlTime",
"last_modified_by",
"STATS_GENERATED_VIA_STATS_TASK",
"COLUMN_STATS_ACCURATE",
"numRows",
"rawDataSize");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we introduce another field in Table which captures the these properties to be filtered ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or can we have inject these properties via Guice ? So in future each metastore module can have its own bunch of properties to be filtered.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm against the Guice option for now. Let's keep it simple

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thinking out loud. Wouldn't it be better/safer to whitelist rather than blacklist ?
I mean list properties we allow to read (as we know them and expect what they bring) rather that list properties to skip, but if we forget something or something new would be introduced we will blindly allow it ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But how would we handle them if an arbitrary properties is added by a different system like hive or spark then we might not be able to list them.

@github-actions github-actions bot added hive Hive connector tests:hive labels Apr 3, 2023
@Praveen2112 Praveen2112 requested a review from huberty89 April 4, 2023 06:22
false),
new PropertyMetadata<>(
EXTRA_PROPERTIES,
"Extra table properties",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra -> Arbitrary

I'd argue that arbitrary is a better name than extra.

Something that's arbitrary seems like it's chosen at random instead of following a consistent rule.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that I agree with "arbitrary", because these properties presumably have meaning- just not one that Trino understands. Maybe... "additional_properties"? It's pretty close to "extra_properties", but maybe slightly clearer than "extra". Naming is hard.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about additional_properties - WDYT @findinpath ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find that additional is in the very same category as extra.

these properties presumably have meaning- just not one that Trino understands.

Definitely, that's why actually arbitrary is definitely not the best choice.

From Chat GPT

A possible word for a non-canonical property name could be "unconventional" or "non-standard". Another word that could be used is "ad-hoc", which implies that the property name was created on a temporary or as-needed basis, rather than being part of a predefined standard or convention. Additionally, the term "custom" could also be used to describe a non-canonical property name, as it suggests that the name was specifically tailored to a particular use case or application.

From the answer above, we have a few candidates:

  • unconventional
  • ad hoc
  • custom

I tend to go with ad_hoc_properties - it matches IMO best the purpose.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on offliine discussion we finalized the property name to extra_properties

return table;
}

public Map<String, String> getUnconsumedProperty()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getUnconsumedProperty -> getUnconsumedProperties

@Praveen2112 Praveen2112 changed the title Add extra_properties to hive table properties Add arbitrary_properties to hive table properties Apr 4, 2023
Comment on lines 569 to 570
if (value instanceof Map) {
Map<?, ?> map = (Map<?, ?>) value;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (value instanceof Map) {
Map<?, ?> map = (Map<?, ?>) value;
if (value instanceof Map<?, ?> map) {

@Test
public void testExtraProperties()
{
String tableName = format("%s.%s.test_extra_properties", getSession().getCatalog().get(), getSession().getSchema().get());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is support for setting extra properties a follow-up?

alter table test3 set properties extra_properties =  map_from_entries(ARRAY[ROW('extra2', 'hello'),ROW('extra1', 'world')]);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But hive doesn't support altering table properties right ? i.e HiveMetadata doesn't implement setTableProperties yet. Or did I miss something here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as i remember we can only set properties during table create.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It would be as a follow-up once we have this functionality for hive connector.

@@ -1157,6 +1181,14 @@ else if (avroSchemaLiteral != null) {
tableProperties.put("numFiles", "-1");
tableProperties.put("totalSize", "-1");

// Extra properties
Map<String, String> extraProperties = getExtraProperties(tableMetadata.getProperties());
if (extraProperties != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's check as well if the map is not empty.

// Extra properties
Map<String, String> extraProperties = getExtraProperties(tableMetadata.getProperties());
if (extraProperties != null) {
for (Map.Entry<String, String> extraProperty : extraProperties.entrySet()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trino:default> create table test4(x integer) with (extra_properties= MAP(ARRAY['auto.purge'], ARRAY['true']));
Query 20230404_084054_00019_fr53z failed: Multiple entries with same key: auto.purge=true and auto.purge=false
java.lang.IllegalArgumentException: Multiple entries with same key: auto.purge=true and auto.purge=false
	at com.google.common.collect.ImmutableMap.conflictException(ImmutableMap.java:377)
	at com.google.common.collect.ImmutableMap.checkNoConflict(ImmutableMap.java:371)
	at com.google.common.collect.RegularImmutableMap.checkNoConflictInKeyBucket(RegularImmutableMap.java:241)
	at com.google.common.collect.RegularImmutableMap.fromEntryArrayCheckingBucketOverflow(RegularImmutableMap.java:132)
	at com.google.common.collect.RegularImmutableMap.fromEntryArray(RegularImmutableMap.java:94)
	at com.google.common.collect.ImmutableMap$Builder.build(ImmutableMap.java:573)
	at com.google.common.collect.ImmutableMap$Builder.buildOrThrow(ImmutableMap.java:601)
	at io.trino.plugin.hive.HiveMetadata.getEmptyTableProperties(HiveMetadata.java:1198)

We could improve the error message to signal to the user that the statement is making use of an illegal parameter.
Pls add a corresponding test

assertEquals(getOnlyElement(actualResult.getOnlyColumnAsSet()), expectedShowCreateTableSql);
assertUpdate("DROP TABLE " + tableName);
}

Copy link
Contributor

@findinpath findinpath Apr 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trino:default> create table test10( 
            ->    name varchar(10), 
            ->    comment varchar(10), 
            ->    nationkey bigint, 
            ->    regionkey bigint, 
            ->    short_name varchar(100) 
            ->      with (partition_projection_type = 'enum', partition_projection_values = array['PL1', 'CZ1'])
            -> )
            -> with (
            ->     partitioned_by = array['short_name'], 
            ->     partition_projection_enabled = true, 
            ->     extra_properties = map_from_entries(ARRAY[ROW('extra2', 'hello'),ROW('extra1', 'world')]));
CREATE TABLE
trino:default> show create table test10;
                                                                                      Create Table                                                                                      
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 CREATE TABLE hive.default.test10 (                                                                                                                                                     
    name varchar(10),                                                                                                                                                                   
    comment varchar(10),                                                                                                                                                                
    nationkey bigint,                                                                                                                                                                   
    regionkey bigint,                                                                                                                                                                   
    short_name varchar(100)                                                                                                                                                             
 )                                                                                                                                                                                      
 WITH (                                                                                                                                                                                 
    extra_properties = map_from_entries(ARRAY[ROW('extra2', 'hello'),ROW('projection.short_name.type', 'enum'),ROW('extra1', 'world'),ROW('projection.short_name.values', 'PL1,CZ1')]), 
    format = 'ORC',                                                                                                                                                                     
    partition_projection_enabled = true,                                                                                                                                                
    partitioned_by = ARRAY['short_name']                                                                                                                                                
 )                                                                                                                                                                                      
(1 row)

Is this intended to have the partition projection related properties displayed this way ?

cc @aczajkowski

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be very cautious here as it was a conscious decision to not blindly present all properties as table properties.
As in case of partition projection in Hive some properties are presented as table properties even they are pointing to columns. Which creates a lot of potential issues (e.g dropping column, keeping property etc..).
Thats why we decided to keep them / present as table column properties.
So seems there should be no problem, as they will be just additionally presented in extra_properties but it might bring confusion here as they will be named differently as well.
CC: @findepi

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follow-up show create table issue related to partition projection #16873

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So seems there should be no problem, as they will be just additionally presented in extra_properties

they shouldn't be double-presented. if these show up as column properties, they shouldn't show up as table properties.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now partition projection related properties displayed as extra properties - handling them for now would be a bit tricky - It will be resolved when we fix #16873 - as the partition projection property will be fetched from TableParameterProvider and it tracks the list of consumed properties.

What is your opinion @findinpath

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is fixing #16873 something that you're considering doing @Praveen2112 ?
I'm thinking it is worth taking a stab at it just to get more knowledge of eventual corner cases which are not yet easily visible and are relevant for this feature.

String targetPropertyKey,
Function<I, V> valueMapper)
{
Optional.ofNullable(sourcePropertyProvider.apply(sourcePropertyKey))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I believe making optional to simply check if something is not null should be replaced with a simple if for readability. But this has not been changed here, so it's ok for me to leave it

if (value instanceof Map) {
Map<?, ?> map = (Map<?, ?>) value;
return new FunctionCall(QualifiedName.of("map_from_entries"), ImmutableList.of(new Array(map.entrySet().stream()
.map(entry -> new Row(asList(toExpression(entry.getKey()), toExpression(entry.getValue()))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asList -> ImmutableList.of

Comment on lines 380 to 393
private static final Set<String> TABLE_PROPERTIES_TO_SKIP = ImmutableSet.of(
PRESTO_VERSION_NAME,
PRESTO_QUERY_ID_NAME,
BUCKETING_VERSION,
"EXTERNAL",
"numFiles",
"totalSize",
"last_modified_time",
"transient_lastDdlTime",
"last_modified_by",
"STATS_GENERATED_VIA_STATS_TASK",
"COLUMN_STATS_ACCURATE",
"numRows",
"rawDataSize");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm against the Guice option for now. Let's keep it simple

{
return getSerdeProperty(table, key).map(csvSerdeProperty -> csvSerdeProperty.substring(0, 1));
return getSerdeProperty(tableParameterProvider, key).map(csvSerdeProperty -> csvSerdeProperty.substring(0, 1));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we certain than csvSerdeProperty is not empty? If it is we are having an exception here

public static class TableParameterProvider
{
private final Table table;
private final Set<String> consumedParameters = new HashSet<>(TABLE_PROPERTIES_TO_SKIP);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ImmutableSet?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This set is per-instance and mutated as part of getParameter(String key)

")",
trinoTableName,
expectedTrinoTableFormat)));
Assertions.assertThat((String) onTrino().executeQuery("SHOW CREATE TABLE " + trinoTableName).getOnlyValue())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static import assertThat

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is alternative QueryAssert.assertThat used in the PTs.

rewriteProperty(sourceProperties::get, targetPropertiesBuilder, sourcePropertyKey, targetPropertyKey, valueMapper);
}

private <I, V> void rewriteProperty(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this method (and the pre-existing method this is overloading) could both be static.

public static class TableParameterProvider
{
private final Table table;
private final Set<String> consumedParameters = new HashSet<>(TABLE_PROPERTIES_TO_SKIP);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This set is per-instance and mutated as part of getParameter(String key)


public Map<String, String> getUnconsumedProperty()
{
return table.getParameters().entrySet().stream()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be tracked incrementally as part of getParameter(String key) by taking an initial snapshot of table.getProperties() in the constructor, removing each property that gets consumed, and just returning the remaining map entries at this point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if we need to use the same properties at multiple places then we might not be able to handle them.

false),
new PropertyMetadata<>(
EXTRA_PROPERTIES,
"Extra table properties",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that I agree with "arbitrary", because these properties presumably have meaning- just not one that Trino understands. Maybe... "additional_properties"? It's pretty close to "extra_properties", but maybe slightly clearer than "extra". Naming is hard.

@Praveen2112 Praveen2112 changed the title Add arbitrary_properties to hive table properties Add extra_properties to hive table properties Apr 18, 2023
- Remove unused methods
- Make rewriteProperty static
- Extract rewriteProperty to consume `Function<String, I>`
@Praveen2112 Praveen2112 force-pushed the praveen/hive-extra-table-properties-2 branch from dbdb0ae to b57a6ee Compare April 20, 2023 14:27
@Praveen2112
Copy link
Member Author

@findinpath / @pettyjamesm AC

"COLUMN_STATS_ACCURATE",
"numRows",
"rawDataSize",
"numFilesErasureCoded");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this property get produced ? I'm not aware of it.

@@ -3846,4 +3892,33 @@ public boolean supportsReportingWrittenBytes(ConnectorSession session, SchemaTab
{
return true;
}

public static class TableParameterProvider
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's extract this class from HiveMetadata.

return table;
}

public Map<String, String> getUnconsumedProperties()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getUnconsumedProperties -> getExtraProperties

public Map<String, String> getUnconsumedProperties()
{
return table.getParameters().entrySet().stream()
.filter(entry -> !consumedParameters.contains(entry.getKey()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we lowercase the values entered in and also the values used for checking here just to be on the safe side?

return table.getParameters().get(key);
}

public Table getTable()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class should likely include the logic of io.trino.plugin.hive.HiveMetadata#getSerdeProperty and should not expose getTable()

@@ -90,13 +91,12 @@ public PartitionProjectionService(
this.projectionFactories = ImmutableMap.copyOf(requireNonNull(projectionFactories, "projectionFactories is null"));
}

public Map<String, Object> getPartitionProjectionTrinoTableProperties(Table table)
public Map<String, Object> getPartitionProjectionTrinoTableProperties(HiveMetadata.TableParameterProvider tableParameterProvider)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be a static method.

I'd argue that we should bring this method in HiveMetadata to have all TableParameterProvider handling in one place.

@@ -173,7 +177,25 @@ public HiveTableProperties(
PARTITION_PROJECTION_LOCATION_TEMPLATE,
"Partition projection location template",
null,
false));
false),
new PropertyMetadata<>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add docs to hive.rst or consider creating a follow-up docs issue for this.

assertEquals(getOnlyElement(actualResult.getOnlyColumnAsSet()), expectedShowCreateTableSql);
assertUpdate("DROP TABLE " + tableName);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is fixing #16873 something that you're considering doing @Praveen2112 ?
I'm thinking it is worth taking a stab at it just to get more knowledge of eventual corner cases which are not yet easily visible and are relevant for this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed hive Hive connector
Development

Successfully merging this pull request may close these issues.

Add Hive table property to for arbitrary properties
7 participants