Snowflake: Inconsistent column name case #18085

rumbin · 2022-01-19T05:56:50Z

Superset handles the case of non-case-sensitive column names inconsistently for Snowflake connections:

Native tables/views, i.e., physical datasets, have lowercase column names, as it is internally handled by SQLAlchemy.
SQLLab's virtual datasets have UPPERCASE column names.

This is troublesome for all dashboards where filters are present which act on charts with related, mixed physical/virtual datasets.
Superset's filter scoping is case sensitive. Thus, filters on a certain column name will either be applied to the related physical or virtual datasets.

How to reproduce the bug

Register any Snowflake table with non-case-sensitive column names (i.e., UPPERCASE) as a physical dataset in Superset.
Notice how its columns are recognized and stored in lowercase.
Edit the dataset again and convert it to a virtual dataset. Use select * from … as statement text. Save the changes.
Edit the dataset again and synchronize the column names.
Notice how the column names now changed to UPPERCASE.

Expected results

The case should not depend on whether the dataset is physical or virtual.
All non-case-sensitive (i.e., UPPERCASE in Snowflake) should be converted to lowercase consistently, also for virtual datasets created in SQLLab.

Lowercase is the internal representation of SQLALchemy.

Actual results

Virtual dataset's column names are treated as UPPERCASE.

Screenshots

The easies way to experience this effect is to simply explore a Snowflake table in SQLLab.
As you can see in the following screenshot, the column names that are extracted from the schema and displayed in the schema browser on the left are represented as lowercase, while the query results of the preview table have UPPERCASE column names:

Rejected workarounds

An option would of course be to use double-quoted lowercase column aliases in the SELECT statement of the virtual dataset.
This would make these column names case-sensitive and thus be treated as lowercase.
However, with a broad userbase working with Snowflake/Superset this is very unhandy.
Furthermore, this apporach would not allow any select * statements here.

I think it simply is a bug and it needs to be fixed so the way the column names are treated is always consistent.

Environment

Tested versions:

1.0.1
1.4 rc4

Checklist

Make sure to follow these steps before submitting your issue - thank you!

I have checked the superset logs for python stacktraces and included it here as text if there are any.
I have reproduced the issue with at least the latest released version of superset.
I have checked the issue tracker for the same issue and I haven't found one similar.

Additional context

This SQLALchemy issue is closely related: snowflakedb/snowflake-sqlalchemy#157 (comment)

However, my feeling is that we should probably try to stick with SQLALchemy's way of treating the column names as lowercase instead of uppercase. However, then we need to do so consistently and correct this for the virtual datasets.

tai pointed me in Slack to this piece of code: https://github.com/snowflakedb/snowflake-sqlalchemy/blob/9118cf8f18a0039f9cb5d3892ff2b1e5c82a05e0/snowdialect.py#L217

The text was updated successfully, but these errors were encountered:

rumbin · 2022-01-19T05:59:57Z

Would it be feasible to have the case conversion configurable for each database connection?

I am thinking about ways to fix this without breaking existing datasets/charts/dashboards that would suddenly have their colunm name case changed upon resync of the column names...

nytai · 2022-01-19T06:04:53Z

This issue is the same as the one you linked snowflakedb/snowflake-sqlalchemy#157

Superset is using that library to connect to snowflake so it needs to be fixed there, there’s nothing that can be done in superset to fix this.

rumbin · 2022-01-19T06:35:04Z

@nytai I do not agree that this is the same issue.
In the upstream issue it is proposed to treat the columns as uppercase while the proposed solution here is lowercase.

I think that there needs to be a design discussion in this issue before deciding how the column case is to be handled.
This may then lead to the conclusion that the upstream bug is sufficient.

Would you please reopen the issue?

rumbin · 2022-01-19T07:08:19Z

@nytai The linked upstream issue is referring to another issue that had been closed as the issue can be handled by the client's code: snowflakedb/snowflake-sqlalchemy#157 (comment)

@villebro was involved in closing the related upstream issue: snowflakedb/snowflake-sqlalchemy#73

So it seems to me as a bit of a ping pong to where this issue is to be handled.

I'd be glad if Ville could comment on this. Maybe we could also have a Slack discussion of video call if needed.

rumbin · 2022-01-19T07:12:28Z

From the linked upstream issues I understand that it is unlikely that this porblem will be fixed upstream.

I understood from these issues that the correct case can be extracted from the cursor, but it seems that we are not currently doing so in Superset.
Furthermorte, I am not quite sure what the desired behavior in Superset would actually be.

So, in my eyes there first needs to be a design decision.

nytai · 2022-01-19T17:00:54Z

Reopening the issue for the sake of discussion.

I don't think there's any design decision here, the column names should be returned in the same case they're defined in the database.

Superset isn't doing any case transformation on the columns and it cannot safely assume that all columns names should be in upper or lower case, it's up to the db/driver to return the correct case for superset to use, imo.

nytai · 2022-01-19T17:06:37Z

Although using the cursor to get the column names is a likely fix I do feel strongly that the database driver should not be applying any case transformations on the payload returned from the db.

Since @villebro did propose that as a solution to the issue, I agree, it would be good to his thoughts on the matter

nytai · 2022-01-20T07:52:51Z

ok, revisiting this it seems unlikely that that snowflake-sqlalchemy will stop denormalizing the column names in the payload given how long it's been open and that the cursor description seems to be an acceptable workaround.

I now agree that this could be addressed in superset given the challenges it presents on dashboards with charts that are backed by virtual & physical datasets.

villebro · 2022-01-20T12:39:32Z

Ok, so this all stems from the design decision Oracle-like databases have made, which I generally call "lowercase means uppercase means caseless", which IMO is a terrible assumption (I really wish Snowflake hadn't gone with this design). I may be off on some details, but basically it boils down to this: When you write

select column
from table

and the column name is caseless (=stored in metadata as UPPERCASE), the resulting column name in the query result is COLUMN. You get the same result by writing (correct me if I misremember these)

select Column
from table

or

select COLUMN
from table

or even

select "COLUMN"
from table

but an error when writing

select "column"
from table

because the column name isn't defined case-sensitively as "column", but rather stored in the metadata as "COLUMN" (=caseless). However, when you check the column name using inspector.get_columns, the column shows up as column, i.e. lowercase, which kind of isn't really true using Oracle logic, but ends up working in Superset. The Snowflake dialect does these to/from conversions here: https://github.com/snowflakedb/snowflake-sqlalchemy/blob/9118cf8f18a0039f9cb5d3892ff2b1e5c82a05e0/snowdialect.py#L217-L234 , and these functions kind of make sense when you know the assumptions behind case handling, but is super confusing when you're trying to write code that handles the cases correctly.

Now, we actually have a method in BaseEngineSpec called get_column_spec which takes the source of the column definition as one argument:

superset/superset/db_engine_specs/base.py

Lines 1313 to 1320 in 035638c

    
           @classmethod 
        
           def get_column_spec(  # pylint: disable=unused-argument 
        
               cls, 
        
               native_type: Optional[str], 
        
               db_extra: Optional[Dict[str, Any]] = None, 
        
               source: utils.ColumnTypeSource = utils.ColumnTypeSource.GET_TABLE, 
        
               column_type_mappings: Tuple[ColumnTypeMapping, ...] = column_type_mappings, 
        
           ) -> Optional[ColumnSpec]:

What we could do here is add column_name as an argument and add a field normalized_name to the ColumnSpec type and return the normalized column name on affected dbs (Oracle, Snowflake et al) if the column metadata originates from a query (=virtual table), but leave it unchanged for physical tables. This would just need to be implemented in the SQLAlchemy model and shouldn't be lots of work, but it would be really important to do super thorough testing to make sure it handles all cases correctly (caseless, UPPERCASE, "lowercase" and "Mixed Case").

This probably sounds really confusing, so maybe it makes sense to have a kickoff call to get this work started to make sure we fully understand the problem and have a clear proposal for solving this.

rumbin · 2022-01-20T14:34:43Z

@villebro in my eyes, your elaboration on the case handling is correct (including your opinion on the poor default choice of UPPERCASE...).

About the rest of your explanation you lost me somewhere, though.
What is — in your eyes — the intended behavior of Superset regarding the caseless columns. Should they become lowercase (like the physical datasets already are) or UPPERCASE (like the virtual datasets already are)?

villebro · 2022-01-20T14:55:29Z

In theory, i think the optimal solution would be to have the column cases equal for both physical and virtual datasets. I think this problem would go away if we called the normalization function on the column names when we fetch column names from the query metadata when we create a new virtual dataset. However, I believe this would break backward compatibility with previously created virtual tables, as refreshing the column metadata for pre-existing virtual tables would change the case from upper case to lower case, which would probably break all existing charts. We could, of course, do a database migration which would update all existing chart metadata, but that would be very tricky, as column names appear in lots of different controls (we also don't know what custom controls containing column names some orgs might have if they've created custom viz plugins).

Therefore, the approach I'm proposing is to rather be aware of the dataset type, and then be able to work around this problem at query creation time, i.e. do the normalization there. As the dataset metadata would be unaffected (caseless virtual dataset column names would still be UPPER CASE), this change would be backwards compatible, and therefore wouldn't require any database migrations.

However, if column cases are different for physical vs virtual tables, there is the risk that native filters will not work properly on datasets of different types. So this may require some further planning.

nytai · 2022-01-20T17:58:37Z

I was originally thinking that what would make the most sense is for superset to use the same case that it's defined in the db (ie, most likely UPPERCASE for snowflake, but not always), however since the sqlalchemy dialect is doing this "normalization" there really is no way to know for sure what the case is in the db. We could update the get_column_spec to instead use a describe table stmt (or whatever the dialect is using to get table names) and get the column names from the cursor (not sure if this would actually work).

However, this gets very complicated when we take into consideration all the existing datasets out there, as users who upgrade to a version with this fix would start seeing issues when they run the sync columns action and the case changes (which would break all their existing dashboard filters, etc).

I too really wish snowflake had not gone with this UPPERCASE by default design, and I especially wish that the sqlalchemy dialect hadn't gone with this "normalize" column names approach.

nytai · 2022-01-20T19:54:49Z

so this is the function in sqlalchemy-snowflake that returns the table names: https://github.com/snowflakedb/snowflake-sqlalchemy/blob/9118cf8f18a0039f9cb5d3892ff2b1e5c82a05e0/snowdialect.py#L418-L487

villebro · 2022-01-20T20:13:04Z

After giving this some more thought, I agree that the best solution would probably be to store the object names in their native case (UPPERCASE), as that would always work with quotes. If we do that, then all physical table column definitions would change from lowercase to UPPERCASE (unless MixedCase). I'm not sure we can do a database migration to fix old datasets reliably, so to deal with backwards compatibility issues, we could potentially add a config flag to the dataset metadata, something like legacy_mode, which we'd set to True for existing Snowflake datasets, but that would not be set on new datasets. When set to True, column names would be set with the current normalization logic, but when disabled would return the UPPERCASE name. Then existing charts wouldn't break, and users would have the option of migrating those by hand if they so wish.

rumbin · 2022-01-20T20:38:54Z

@villebro this is what I had in mind, too.
The backward compatibility definitely needs some good care and I think that such a flag/switch would work nicely, although it introduces some ugly legacy.

agusfigueroa-htg · 2022-01-21T11:12:05Z

Hi all,

First of all thanks for the discussion. It's very enriching, and the proposed solution would reach case consistency and solve the issue as mentioned.

After sharing this with our team another option that popped up, in case this only affects filters interoperability, is to add the option to make the filters case insensitive.

In that case, instead of looking for columns named the same (dataset1.colname=dataset2.colname), we can apply the filters "case insensitive" like (lower(dataset1.colname)=lower(dataset2.colname)), creating a flag for it in case users want to apply it or not.

This would lay fully on the Superset side and would be transparent across different databases and what they decide to do in the future (return uppercase, lowercase, mix case or so).

rumbin · 2022-01-22T07:11:39Z

@agusfigueroa-htg having the column matching if dashboard filters case-insensitive is a quick win in my eyes.
However, some edge cases remain:

When filters are handled in Jinja, users will need to implement their own case-insensitivity.
When users convert a physical to a virtual dataset, or vice versa, the column case will flip, causing possible bugs. Such a conversation may be needed, e.g., when the source model now includes soft deletes which then need to be filtered out, or, the other way round, when it is decided to convert a virtual dataset into a table or view.
Users might be confused by the case inconsistency and they might not be aware of the case-insensitivity of the filters.

villebro · 2022-01-22T07:35:33Z

@agusfigueroa-htg that's a really good idea, and could be useful even for other databases. However, I do feel this is an issue we probably need to resolve, and the points raised by @rumbin are IMO all very valid. It's also good to keep in mind that this not only affects Snowflake, but even other Oracle-like dialects.

agusfigueroa-htg · 2022-01-22T14:00:19Z

Good to see that we are all in the same boat :) and the points presented by you two are pretty much valid - the filter one is more of a quick escape rather than a real fix.

@villebro if I got you correctly, it makes sense to work on how the object names are stored in Superset. I don't know where that is defined in the code (I am new to the repo!), but if you have some references at a hand I am happy to help here.

rumbin · 2022-01-27T19:22:24Z

Stumbled on a related, merged PR: #5827

agusfigueroa-htg · 2022-02-14T14:50:57Z

Hi @villebro ! I assume we want to solve this on the connector side, is that correct? If I get a rough idea of where to start looking at, I hope I can contribute on taking the first steps :)

stale · 2022-04-16T13:57:03Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. For admin, please label this issue .pinned to prevent stale bot from closing the issue.

rumbin · 2022-04-16T16:09:57Z

This issue should not be closed by the stale bot, in my eyes.

agusfigueroa-htg · 2022-04-19T07:18:43Z

I support rumbin here, this issue is still around!

…

On Sat, Apr 16, 2022 at 6:10 PM rumbin ***@***.***> wrote: This issue should not be closed by the stale bot, in my eyes. — Reply to this email directly, view it on GitHub <#18085 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ASNRLXSYSWIOZZT2QEJMT3DVFLQ6BANCNFSM5MJAECMA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

villebro · 2022-04-19T07:39:03Z

Hi, I agree this is something we should address. I actually think the best solution would be to make Superset case insensitive, as I believe very few people will need to have column, CoLuMn and COLUMN in the same dataset. However, this is a pretty big breaking change, so it would likely need to first be floated as a SIP, and then be implemented in 3.0 if others agree with the change. @rumbin would you be up for the task of preparing a SIP for this? This way we could have the discussion there and iron out any disagreements before starting actual work to address this.

agusfigueroa-htg · 2022-04-19T08:06:33Z

@rumbin can count on me if I can support anyhow!

villebro · 2022-04-19T08:15:11Z

@rumbin to echo @agusfigueroa-htg 's offer for support - I'm also happy to help push this effort forward, I just don't have the time/resources to drive it.

rumbin · 2022-04-19T10:05:55Z

@villebro, @agusfigueroa-htg please have a look at #19773

rumbin · 2022-04-19T10:07:05Z

@villebro did you mean to write Superset 3.0? I was hoping that we could get it into 2.0, but maybe I am too naïve here...

villebro · 2022-04-20T06:00:46Z

Yes, unfortunately this won't make it into 2.0, as the list of breaking changes has already been decided. However, if we do reach consensus on how to fix this, we can put it behind a feature flag during 2.0, and then potentially introduce it as the new default behavior in 3.0.

rusackas · 2023-05-25T18:30:29Z

Hi all! Just checking in on this thread, since it's gone silent for a year.

Is anyone still hoping to tackle this? Or did it somehow cease to be a problem? Now's a great chance to add a feature flag for 3.0, as that would be a non-breaking change in the release being cut on or around June 15th.

rumbin · 2023-05-25T22:03:53Z

I am not sure if I will have time and need to work on this on the near future, as I am currently not using Snowflake, but that might change soon.
Thus, it may be beneficial to at least define the feature flag already.
I cannot tell, however, if the issue disappeared miraculously, but I doubt it, judging from the depth/origin of the problem.

agusfigueroa-htg · 2023-05-30T09:43:46Z

I can sadly confirm that the issue remains :(

villebro · 2023-06-15T07:07:36Z

We've been throwing around ideas about this with @rusackas for the last few weeks, and my proposal is to change the behavior going forward as follows:

virtual dataset logic remains unchanged
physical tables call denormalize_name on the column name. See example below:

>>> from snowflake.sqlalchemy.snowdialect import SnowflakeDialect
>>> dialect = SnowflakeDialect()
>>> dialect.requires_name_normalize
True
>>> dialect.denormalize_name('foo')
'FOO'
>>> dialect.denormalize_name('Foo')
'Foo'
>>> dialect.denormalize_name('FOO')
'FOO'

This means that going forward, caseless physical column names will be stored in UPPERCASE, rather than lowercase. We'll also do the same for Oracle and the others that need to call the standard denormalize_name method:

>>> from sqlalchemy.dialects.oracle.base import OracleDialect
>>> dialect = OracleDialect()
>>> dialect.requires_name_normalize
True
>>> dialect.denormalize_name('foo')
'FOO'
>>> dialect.denormalize_name("Foo")
'Foo'
>>> dialect.denormalize_name("FOO")
'FOO'

This is in contrast to MSSQL and other dialects that don't require name normalization:

>>> from sqlalchemy.dialects.mssql.base import MSDialect
>>> dialect = MSDialect()
>>> ms.requires_name_normalize
False
>>> ms.denormalize_name('foo')
'FOO'
>>> # this ☝️ surprised me!

So calling dialect.denormalize_name(col_name) if dialect.requires_name_normalize is True will ensure that physical and virtual datasets have the same case. Furthermore, I propose not doing any migrations to existing datasets, as the migration would be extremely risky, avoiding breaking any existing dashboards or charts. This means, that this change would only affect physical datasets created after the change. Therefore, anyone who wants to migrate old charts/dashboards to the new behavior will need to edit the dataset and click "Sync columns from source":

After this the lowercase column names will become UPPERCASE:

Keep in mind, however, that this might break existing charts, as column references in both chart and dashboard native filter metadata might be pointing to the old normalized column name.

Thoughts @agusfigueroa-htg @rumbin @nytai ?

rumbin · 2023-06-15T10:08:39Z

@villebro thanks a lot for giving this issue some love :-).

The suggested approach seems like a simple yet effective solution to me.

agusfigueroa-htg · 2023-06-15T13:12:10Z

I second @rumbin very much here, this is a simple and effective solution. Thinking about our Superset instance, this will definitely address the problems we have with dashboard filtering.

Thank you a million @villebro @rusackas, very much appreciate this.

villebro · 2023-06-21T13:53:55Z

@agusfigueroa-htg @rumbin please take a look at PR #24471

crysensible · 2023-07-17T16:42:18Z

Hi all, this change has broken all of our charts and filters due to the change from lowercase naming to uppercase. This impacted our existing old datasets when I clicked "Sync Columns From Source."

Is there any guidance on how to make this backward compatible with existing assets? @betodealmeida @Vitor-Avila

crysensible · 2023-07-18T12:31:24Z

Here was the fix for anyone who encountered this: https://status.preset.io/pages/incident/5e83b80830610b04c39e898b/64b5823a267e5c053c6ed085

Vitor-Avila · 2023-07-18T13:13:38Z

Thanks, @crysensible! We ended up reverting the PR in Preset Cloud, so syncing columns from source again would reflect the old casing.

rusackas · 2023-07-21T15:51:29Z

We're planning to fix forward on this repo. Please stay tuned. If you have an issue with this PR, you can revert it locally or open a PR to revert it for everyone if there are stronger feelings.

rumbin added the #bug Bug report label Jan 19, 2022

nytai closed this as completed Jan 19, 2022

nytai reopened this Jan 19, 2022

stale bot added the inactive Inactive for >= 30 days label Apr 16, 2022

stale bot removed the inactive Inactive for >= 30 days label Apr 16, 2022

rumbin mentioned this issue Apr 19, 2022

[SIP-84] Case-insensitive handling of datasets' column names #19773

Closed

kaxil mentioned this issue Oct 13, 2022

Disable the SDK from changing capitalization of output columns by default astronomer/astro-sdk#917

Closed

villebro mentioned this issue Jun 21, 2023

fix(oracle): denormalize column names where applicable #24471

Merged

9 tasks

villebro closed this as completed in #24471 Jun 21, 2023

Snowflake: Inconsistent column name case #18085

Snowflake: Inconsistent column name case #18085

Comments

rumbin commented Jan 19, 2022

How to reproduce the bug

Expected results

Actual results

Screenshots

Rejected workarounds

Environment

Checklist

Additional context

rumbin commented Jan 19, 2022

nytai commented Jan 19, 2022 • edited Loading

rumbin commented Jan 19, 2022

rumbin commented Jan 19, 2022

rumbin commented Jan 19, 2022

nytai commented Jan 19, 2022

nytai commented Jan 19, 2022

nytai commented Jan 20, 2022

villebro commented Jan 20, 2022 • edited Loading

rumbin commented Jan 20, 2022

villebro commented Jan 20, 2022 • edited Loading

nytai commented Jan 20, 2022

nytai commented Jan 20, 2022

villebro commented Jan 20, 2022

rumbin commented Jan 20, 2022

agusfigueroa-htg commented Jan 21, 2022

rumbin commented Jan 22, 2022

villebro commented Jan 22, 2022

agusfigueroa-htg commented Jan 22, 2022

rumbin commented Jan 27, 2022

agusfigueroa-htg commented Feb 14, 2022

stale bot commented Apr 16, 2022

rumbin commented Apr 16, 2022

agusfigueroa-htg commented Apr 19, 2022 via email

villebro commented Apr 19, 2022

agusfigueroa-htg commented Apr 19, 2022

villebro commented Apr 19, 2022

rumbin commented Apr 19, 2022

rumbin commented Apr 19, 2022

villebro commented Apr 20, 2022

rusackas commented May 25, 2023

rumbin commented May 25, 2023

agusfigueroa-htg commented May 30, 2023

villebro commented Jun 15, 2023 • edited Loading

rumbin commented Jun 15, 2023

agusfigueroa-htg commented Jun 15, 2023

villebro commented Jun 21, 2023

crysensible commented Jul 17, 2023 • edited Loading

crysensible commented Jul 18, 2023 • edited Loading

Vitor-Avila commented Jul 18, 2023

rusackas commented Jul 21, 2023

nytai commented Jan 19, 2022 •

edited

Loading

villebro commented Jan 20, 2022 •

edited

Loading

villebro commented Jan 20, 2022 •

edited

Loading

villebro commented Jun 15, 2023 •

edited

Loading

crysensible commented Jul 17, 2023 •

edited

Loading

crysensible commented Jul 18, 2023 •

edited

Loading