-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add extra_properties to hive table properties #16846
Add extra_properties to hive table properties #16846
Conversation
private static final Set<String> TABLE_PROPERTIES_TO_SKIP = ImmutableSet.of( | ||
PRESTO_VERSION_NAME, | ||
PRESTO_QUERY_ID_NAME, | ||
BUCKETING_VERSION, | ||
"EXTERNAL", | ||
"numFiles", | ||
"totalSize", | ||
"last_modified_time", | ||
"transient_lastDdlTime", | ||
"last_modified_by", | ||
"STATS_GENERATED_VIA_STATS_TASK", | ||
"COLUMN_STATS_ACCURATE", | ||
"numRows", | ||
"rawDataSize"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we introduce another field in Table
which captures the these properties to be filtered ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or can we have inject these properties via Guice ? So in future each metastore module can have its own bunch of properties to be filtered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm against the Guice option for now. Let's keep it simple
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just thinking out loud. Wouldn't it be better/safer to whitelist rather than blacklist ?
I mean list properties we allow to read (as we know them and expect what they bring) rather that list properties to skip, but if we forget something or something new would be introduced we will blindly allow it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But how would we handle them if an arbitrary properties is added by a different system like hive or spark then we might not be able to list them.
false), | ||
new PropertyMetadata<>( | ||
EXTRA_PROPERTIES, | ||
"Extra table properties", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra -> Arbitrary
I'd argue that arbitrary is a better name than extra.
Something that's arbitrary seems like it's chosen at random instead of following a consistent rule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that I agree with "arbitrary", because these properties presumably have meaning- just not one that Trino understands. Maybe... "additional_properties"? It's pretty close to "extra_properties", but maybe slightly clearer than "extra". Naming is hard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about additional_properties
- WDYT @findinpath ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find that additional is in the very same category as extra.
these properties presumably have meaning- just not one that Trino understands.
Definitely, that's why actually arbitrary is definitely not the best choice.
From Chat GPT
A possible word for a non-canonical property name could be "unconventional" or "non-standard". Another word that could be used is "ad-hoc", which implies that the property name was created on a temporary or as-needed basis, rather than being part of a predefined standard or convention. Additionally, the term "custom" could also be used to describe a non-canonical property name, as it suggests that the name was specifically tailored to a particular use case or application.
From the answer above, we have a few candidates:
- unconventional
- ad hoc
- custom
I tend to go with ad_hoc_properties
- it matches IMO best the purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on offliine discussion we finalized the property name to extra_properties
return table; | ||
} | ||
|
||
public Map<String, String> getUnconsumedProperty() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getUnconsumedProperty
-> getUnconsumedProperties
if (value instanceof Map) { | ||
Map<?, ?> map = (Map<?, ?>) value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (value instanceof Map) { | |
Map<?, ?> map = (Map<?, ?>) value; | |
if (value instanceof Map<?, ?> map) { |
@Test | ||
public void testExtraProperties() | ||
{ | ||
String tableName = format("%s.%s.test_extra_properties", getSession().getCatalog().get(), getSession().getSchema().get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is support for setting extra properties a follow-up?
alter table test3 set properties extra_properties = map_from_entries(ARRAY[ROW('extra2', 'hello'),ROW('extra1', 'world')]);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But hive doesn't support altering table properties right ? i.e HiveMetadata
doesn't implement setTableProperties
yet. Or did I miss something here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as i remember we can only set properties during table create.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It would be as a follow-up once we have this functionality for hive connector.
@@ -1157,6 +1181,14 @@ else if (avroSchemaLiteral != null) { | |||
tableProperties.put("numFiles", "-1"); | |||
tableProperties.put("totalSize", "-1"); | |||
|
|||
// Extra properties | |||
Map<String, String> extraProperties = getExtraProperties(tableMetadata.getProperties()); | |||
if (extraProperties != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's check as well if the map is not empty.
// Extra properties | ||
Map<String, String> extraProperties = getExtraProperties(tableMetadata.getProperties()); | ||
if (extraProperties != null) { | ||
for (Map.Entry<String, String> extraProperty : extraProperties.entrySet()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trino:default> create table test4(x integer) with (extra_properties= MAP(ARRAY['auto.purge'], ARRAY['true']));
Query 20230404_084054_00019_fr53z failed: Multiple entries with same key: auto.purge=true and auto.purge=false
java.lang.IllegalArgumentException: Multiple entries with same key: auto.purge=true and auto.purge=false
at com.google.common.collect.ImmutableMap.conflictException(ImmutableMap.java:377)
at com.google.common.collect.ImmutableMap.checkNoConflict(ImmutableMap.java:371)
at com.google.common.collect.RegularImmutableMap.checkNoConflictInKeyBucket(RegularImmutableMap.java:241)
at com.google.common.collect.RegularImmutableMap.fromEntryArrayCheckingBucketOverflow(RegularImmutableMap.java:132)
at com.google.common.collect.RegularImmutableMap.fromEntryArray(RegularImmutableMap.java:94)
at com.google.common.collect.ImmutableMap$Builder.build(ImmutableMap.java:573)
at com.google.common.collect.ImmutableMap$Builder.buildOrThrow(ImmutableMap.java:601)
at io.trino.plugin.hive.HiveMetadata.getEmptyTableProperties(HiveMetadata.java:1198)
We could improve the error message to signal to the user that the statement is making use of an illegal parameter.
Pls add a corresponding test
assertEquals(getOnlyElement(actualResult.getOnlyColumnAsSet()), expectedShowCreateTableSql); | ||
assertUpdate("DROP TABLE " + tableName); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trino:default> create table test10(
-> name varchar(10),
-> comment varchar(10),
-> nationkey bigint,
-> regionkey bigint,
-> short_name varchar(100)
-> with (partition_projection_type = 'enum', partition_projection_values = array['PL1', 'CZ1'])
-> )
-> with (
-> partitioned_by = array['short_name'],
-> partition_projection_enabled = true,
-> extra_properties = map_from_entries(ARRAY[ROW('extra2', 'hello'),ROW('extra1', 'world')]));
CREATE TABLE
trino:default> show create table test10;
Create Table
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
CREATE TABLE hive.default.test10 (
name varchar(10),
comment varchar(10),
nationkey bigint,
regionkey bigint,
short_name varchar(100)
)
WITH (
extra_properties = map_from_entries(ARRAY[ROW('extra2', 'hello'),ROW('projection.short_name.type', 'enum'),ROW('extra1', 'world'),ROW('projection.short_name.values', 'PL1,CZ1')]),
format = 'ORC',
partition_projection_enabled = true,
partitioned_by = ARRAY['short_name']
)
(1 row)
Is this intended to have the partition projection related properties displayed this way ?
cc @aczajkowski
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be very cautious here as it was a conscious decision to not blindly present all properties as table properties.
As in case of partition projection in Hive some properties are presented as table properties even they are pointing to columns. Which creates a lot of potential issues (e.g dropping column, keeping property etc..).
Thats why we decided to keep them / present as table column properties.
So seems there should be no problem, as they will be just additionally presented in extra_properties
but it might bring confusion here as they will be named differently as well.
CC: @findepi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
follow-up show create table
issue related to partition projection #16873
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So seems there should be no problem, as they will be just additionally presented in
extra_properties
they shouldn't be double-presented. if these show up as column properties, they shouldn't show up as table properties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now partition projection related properties displayed as extra properties - handling them for now would be a bit tricky - It will be resolved when we fix #16873 - as the partition projection property will be fetched from TableParameterProvider
and it tracks the list of consumed properties.
What is your opinion @findinpath
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is fixing #16873 something that you're considering doing @Praveen2112 ?
I'm thinking it is worth taking a stab at it just to get more knowledge of eventual corner cases which are not yet easily visible and are relevant for this feature.
String targetPropertyKey, | ||
Function<I, V> valueMapper) | ||
{ | ||
Optional.ofNullable(sourcePropertyProvider.apply(sourcePropertyKey)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I believe making optional to simply check if something is not null should be replaced with a simple if for readability. But this has not been changed here, so it's ok for me to leave it
if (value instanceof Map) { | ||
Map<?, ?> map = (Map<?, ?>) value; | ||
return new FunctionCall(QualifiedName.of("map_from_entries"), ImmutableList.of(new Array(map.entrySet().stream() | ||
.map(entry -> new Row(asList(toExpression(entry.getKey()), toExpression(entry.getValue())))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
asList -> ImmutableList.of
private static final Set<String> TABLE_PROPERTIES_TO_SKIP = ImmutableSet.of( | ||
PRESTO_VERSION_NAME, | ||
PRESTO_QUERY_ID_NAME, | ||
BUCKETING_VERSION, | ||
"EXTERNAL", | ||
"numFiles", | ||
"totalSize", | ||
"last_modified_time", | ||
"transient_lastDdlTime", | ||
"last_modified_by", | ||
"STATS_GENERATED_VIA_STATS_TASK", | ||
"COLUMN_STATS_ACCURATE", | ||
"numRows", | ||
"rawDataSize"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm against the Guice option for now. Let's keep it simple
{ | ||
return getSerdeProperty(table, key).map(csvSerdeProperty -> csvSerdeProperty.substring(0, 1)); | ||
return getSerdeProperty(tableParameterProvider, key).map(csvSerdeProperty -> csvSerdeProperty.substring(0, 1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we certain than csvSerdeProperty
is not empty? If it is we are having an exception here
public static class TableParameterProvider | ||
{ | ||
private final Table table; | ||
private final Set<String> consumedParameters = new HashSet<>(TABLE_PROPERTIES_TO_SKIP); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ImmutableSet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This set is per-instance and mutated as part of getParameter(String key)
")", | ||
trinoTableName, | ||
expectedTrinoTableFormat))); | ||
Assertions.assertThat((String) onTrino().executeQuery("SHOW CREATE TABLE " + trinoTableName).getOnlyValue()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static import assertThat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is alternative QueryAssert.assertThat
used in the PTs.
rewriteProperty(sourceProperties::get, targetPropertiesBuilder, sourcePropertyKey, targetPropertyKey, valueMapper); | ||
} | ||
|
||
private <I, V> void rewriteProperty( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this method (and the pre-existing method this is overloading) could both be static
.
public static class TableParameterProvider | ||
{ | ||
private final Table table; | ||
private final Set<String> consumedParameters = new HashSet<>(TABLE_PROPERTIES_TO_SKIP); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This set is per-instance and mutated as part of getParameter(String key)
|
||
public Map<String, String> getUnconsumedProperty() | ||
{ | ||
return table.getParameters().entrySet().stream() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be tracked incrementally as part of getParameter(String key)
by taking an initial snapshot of table.getProperties()
in the constructor, removing each property that gets consumed, and just returning the remaining map entries at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if we need to use the same properties at multiple places then we might not be able to handle them.
false), | ||
new PropertyMetadata<>( | ||
EXTRA_PROPERTIES, | ||
"Extra table properties", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that I agree with "arbitrary", because these properties presumably have meaning- just not one that Trino understands. Maybe... "additional_properties"? It's pretty close to "extra_properties", but maybe slightly clearer than "extra". Naming is hard.
- Remove unused methods - Make rewriteProperty static - Extract rewriteProperty to consume `Function<String, I>`
dbdb0ae
to
b57a6ee
Compare
@findinpath / @pettyjamesm AC |
"COLUMN_STATS_ACCURATE", | ||
"numRows", | ||
"rawDataSize", | ||
"numFilesErasureCoded"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this property get produced ? I'm not aware of it.
@@ -3846,4 +3892,33 @@ public boolean supportsReportingWrittenBytes(ConnectorSession session, SchemaTab | |||
{ | |||
return true; | |||
} | |||
|
|||
public static class TableParameterProvider |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's extract this class from HiveMetadata
.
return table; | ||
} | ||
|
||
public Map<String, String> getUnconsumedProperties() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getUnconsumedProperties
-> getExtraProperties
public Map<String, String> getUnconsumedProperties() | ||
{ | ||
return table.getParameters().entrySet().stream() | ||
.filter(entry -> !consumedParameters.contains(entry.getKey())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we lowercase the values entered in and also the values used for checking here just to be on the safe side?
return table.getParameters().get(key); | ||
} | ||
|
||
public Table getTable() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class should likely include the logic of io.trino.plugin.hive.HiveMetadata#getSerdeProperty
and should not expose getTable()
@@ -90,13 +91,12 @@ public PartitionProjectionService( | |||
this.projectionFactories = ImmutableMap.copyOf(requireNonNull(projectionFactories, "projectionFactories is null")); | |||
} | |||
|
|||
public Map<String, Object> getPartitionProjectionTrinoTableProperties(Table table) | |||
public Map<String, Object> getPartitionProjectionTrinoTableProperties(HiveMetadata.TableParameterProvider tableParameterProvider) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be a static
method.
I'd argue that we should bring this method in HiveMetadata
to have all TableParameterProvider
handling in one place.
@@ -173,7 +177,25 @@ public HiveTableProperties( | |||
PARTITION_PROJECTION_LOCATION_TEMPLATE, | |||
"Partition projection location template", | |||
null, | |||
false)); | |||
false), | |||
new PropertyMetadata<>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add docs to hive.rst
or consider creating a follow-up docs issue for this.
assertEquals(getOnlyElement(actualResult.getOnlyColumnAsSet()), expectedShowCreateTableSql); | ||
assertUpdate("DROP TABLE " + tableName); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is fixing #16873 something that you're considering doing @Praveen2112 ?
I'm thinking it is worth taking a stab at it just to get more knowledge of eventual corner cases which are not yet easily visible and are relevant for this feature.
Description
This PR allows us to pass through additional properties to Hive when creating table in Trino.
The additional properties can provided in the following format
Additional context and related issues
Fixes #954
Extension of #9475
Release notes
(x) Release notes are required, with the following suggested text: