Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snowflake: Inconsistent column name case #18085

Closed
3 tasks done
rumbin opened this issue Jan 19, 2022 · 41 comments · Fixed by #24471
Closed
3 tasks done

Snowflake: Inconsistent column name case #18085

rumbin opened this issue Jan 19, 2022 · 41 comments · Fixed by #24471
Labels
#bug Bug report

Comments

@rumbin
Copy link
Contributor

rumbin commented Jan 19, 2022

Superset handles the case of non-case-sensitive column names inconsistently for Snowflake connections:

  • Native tables/views, i.e., physical datasets, have lowercase column names, as it is internally handled by SQLAlchemy.
  • SQLLab's virtual datasets have UPPERCASE column names.

This is troublesome for all dashboards where filters are present which act on charts with related, mixed physical/virtual datasets.
Superset's filter scoping is case sensitive. Thus, filters on a certain column name will either be applied to the related physical or virtual datasets.

How to reproduce the bug

  1. Register any Snowflake table with non-case-sensitive column names (i.e., UPPERCASE) as a physical dataset in Superset.
  2. Notice how its columns are recognized and stored in lowercase.
  3. Edit the dataset again and convert it to a virtual dataset. Use select * from … as statement text. Save the changes.
  4. Edit the dataset again and synchronize the column names.
  5. Notice how the column names now changed to UPPERCASE.

Expected results

The case should not depend on whether the dataset is physical or virtual.
All non-case-sensitive (i.e., UPPERCASE in Snowflake) should be converted to lowercase consistently, also for virtual datasets created in SQLLab.

Lowercase is the internal representation of SQLALchemy.

Actual results

Virtual dataset's column names are treated as UPPERCASE.

Screenshots

The easies way to experience this effect is to simply explore a Snowflake table in SQLLab.
As you can see in the following screenshot, the column names that are extracted from the schema and displayed in the schema browser on the left are represented as lowercase, while the query results of the preview table have UPPERCASE column names:

image

Rejected workarounds

An option would of course be to use double-quoted lowercase column aliases in the SELECT statement of the virtual dataset.
This would make these column names case-sensitive and thus be treated as lowercase.
However, with a broad userbase working with Snowflake/Superset this is very unhandy.
Furthermore, this apporach would not allow any select * statements here.

I think it simply is a bug and it needs to be fixed so the way the column names are treated is always consistent.

Environment

Tested versions:

  • 1.0.1
  • 1.4 rc4

Checklist

Make sure to follow these steps before submitting your issue - thank you!

  • I have checked the superset logs for python stacktraces and included it here as text if there are any.
  • I have reproduced the issue with at least the latest released version of superset.
  • I have checked the issue tracker for the same issue and I haven't found one similar.

Additional context

This SQLALchemy issue is closely related: snowflakedb/snowflake-sqlalchemy#157 (comment)

However, my feeling is that we should probably try to stick with SQLALchemy's way of treating the column names as lowercase instead of uppercase. However, then we need to do so consistently and correct this for the virtual datasets.

tai pointed me in Slack to this piece of code: https://github.com/snowflakedb/snowflake-sqlalchemy/blob/9118cf8f18a0039f9cb5d3892ff2b1e5c82a05e0/snowdialect.py#L217

@rumbin rumbin added the #bug Bug report label Jan 19, 2022
@rumbin
Copy link
Contributor Author

rumbin commented Jan 19, 2022

Would it be feasible to have the case conversion configurable for each database connection?

I am thinking about ways to fix this without breaking existing datasets/charts/dashboards that would suddenly have their colunm name case changed upon resync of the column names...

@nytai
Copy link
Member

nytai commented Jan 19, 2022

This issue is the same as the one you linked snowflakedb/snowflake-sqlalchemy#157

Superset is using that library to connect to snowflake so it needs to be fixed there, there’s nothing that can be done in superset to fix this.

@nytai nytai closed this as completed Jan 19, 2022
@rumbin
Copy link
Contributor Author

rumbin commented Jan 19, 2022

@nytai I do not agree that this is the same issue.
In the upstream issue it is proposed to treat the columns as uppercase while the proposed solution here is lowercase.

I think that there needs to be a design discussion in this issue before deciding how the column case is to be handled.
This may then lead to the conclusion that the upstream bug is sufficient.

Would you please reopen the issue?

@rumbin
Copy link
Contributor Author

rumbin commented Jan 19, 2022

@nytai The linked upstream issue is referring to another issue that had been closed as the issue can be handled by the client's code: snowflakedb/snowflake-sqlalchemy#157 (comment)

@villebro was involved in closing the related upstream issue: snowflakedb/snowflake-sqlalchemy#73

So it seems to me as a bit of a ping pong to where this issue is to be handled.

I'd be glad if Ville could comment on this. Maybe we could also have a Slack discussion of video call if needed.

@rumbin
Copy link
Contributor Author

rumbin commented Jan 19, 2022

From the linked upstream issues I understand that it is unlikely that this porblem will be fixed upstream.

I understood from these issues that the correct case can be extracted from the cursor, but it seems that we are not currently doing so in Superset.
Furthermorte, I am not quite sure what the desired behavior in Superset would actually be.

So, in my eyes there first needs to be a design decision.

@nytai nytai reopened this Jan 19, 2022
@nytai
Copy link
Member

nytai commented Jan 19, 2022

Reopening the issue for the sake of discussion.

I don't think there's any design decision here, the column names should be returned in the same case they're defined in the database.

Superset isn't doing any case transformation on the columns and it cannot safely assume that all columns names should be in upper or lower case, it's up to the db/driver to return the correct case for superset to use, imo.

@nytai
Copy link
Member

nytai commented Jan 19, 2022

Although using the cursor to get the column names is a likely fix I do feel strongly that the database driver should not be applying any case transformations on the payload returned from the db.

Since @villebro did propose that as a solution to the issue, I agree, it would be good to his thoughts on the matter

@nytai
Copy link
Member

nytai commented Jan 20, 2022

ok, revisiting this it seems unlikely that that snowflake-sqlalchemy will stop denormalizing the column names in the payload given how long it's been open and that the cursor description seems to be an acceptable workaround.

I now agree that this could be addressed in superset given the challenges it presents on dashboards with charts that are backed by virtual & physical datasets.

@villebro
Copy link
Member

villebro commented Jan 20, 2022

Ok, so this all stems from the design decision Oracle-like databases have made, which I generally call "lowercase means uppercase means caseless", which IMO is a terrible assumption (I really wish Snowflake hadn't gone with this design). I may be off on some details, but basically it boils down to this: When you write

select column
from table

and the column name is caseless (=stored in metadata as UPPERCASE), the resulting column name in the query result is COLUMN. You get the same result by writing (correct me if I misremember these)

select Column
from table

or

select COLUMN
from table

or even

select "COLUMN"
from table

but an error when writing

select "column"
from table

because the column name isn't defined case-sensitively as "column", but rather stored in the metadata as "COLUMN" (=caseless). However, when you check the column name using inspector.get_columns, the column shows up as column, i.e. lowercase, which kind of isn't really true using Oracle logic, but ends up working in Superset. The Snowflake dialect does these to/from conversions here: https://github.com/snowflakedb/snowflake-sqlalchemy/blob/9118cf8f18a0039f9cb5d3892ff2b1e5c82a05e0/snowdialect.py#L217-L234 , and these functions kind of make sense when you know the assumptions behind case handling, but is super confusing when you're trying to write code that handles the cases correctly.

Now, we actually have a method in BaseEngineSpec called get_column_spec which takes the source of the column definition as one argument:

@classmethod
def get_column_spec( # pylint: disable=unused-argument
cls,
native_type: Optional[str],
db_extra: Optional[Dict[str, Any]] = None,
source: utils.ColumnTypeSource = utils.ColumnTypeSource.GET_TABLE,
column_type_mappings: Tuple[ColumnTypeMapping, ...] = column_type_mappings,
) -> Optional[ColumnSpec]:

What we could do here is add column_name as an argument and add a field normalized_name to the ColumnSpec type and return the normalized column name on affected dbs (Oracle, Snowflake et al) if the column metadata originates from a query (=virtual table), but leave it unchanged for physical tables. This would just need to be implemented in the SQLAlchemy model and shouldn't be lots of work, but it would be really important to do super thorough testing to make sure it handles all cases correctly (caseless, UPPERCASE, "lowercase" and "Mixed Case").

This probably sounds really confusing, so maybe it makes sense to have a kickoff call to get this work started to make sure we fully understand the problem and have a clear proposal for solving this.

@rumbin
Copy link
Contributor Author

rumbin commented Jan 20, 2022

@villebro in my eyes, your elaboration on the case handling is correct (including your opinion on the poor default choice of UPPERCASE...).

About the rest of your explanation you lost me somewhere, though.
What is — in your eyes — the intended behavior of Superset regarding the caseless columns. Should they become lowercase (like the physical datasets already are) or UPPERCASE (like the virtual datasets already are)?

@villebro
Copy link
Member

villebro commented Jan 20, 2022

In theory, i think the optimal solution would be to have the column cases equal for both physical and virtual datasets. I think this problem would go away if we called the normalization function on the column names when we fetch column names from the query metadata when we create a new virtual dataset. However, I believe this would break backward compatibility with previously created virtual tables, as refreshing the column metadata for pre-existing virtual tables would change the case from upper case to lower case, which would probably break all existing charts. We could, of course, do a database migration which would update all existing chart metadata, but that would be very tricky, as column names appear in lots of different controls (we also don't know what custom controls containing column names some orgs might have if they've created custom viz plugins).

Therefore, the approach I'm proposing is to rather be aware of the dataset type, and then be able to work around this problem at query creation time, i.e. do the normalization there. As the dataset metadata would be unaffected (caseless virtual dataset column names would still be UPPER CASE), this change would be backwards compatible, and therefore wouldn't require any database migrations.

However, if column cases are different for physical vs virtual tables, there is the risk that native filters will not work properly on datasets of different types. So this may require some further planning.

@nytai
Copy link
Member

nytai commented Jan 20, 2022

I was originally thinking that what would make the most sense is for superset to use the same case that it's defined in the db (ie, most likely UPPERCASE for snowflake, but not always), however since the sqlalchemy dialect is doing this "normalization" there really is no way to know for sure what the case is in the db. We could update the get_column_spec to instead use a describe table stmt (or whatever the dialect is using to get table names) and get the column names from the cursor (not sure if this would actually work).

However, this gets very complicated when we take into consideration all the existing datasets out there, as users who upgrade to a version with this fix would start seeing issues when they run the sync columns action and the case changes (which would break all their existing dashboard filters, etc).

I too really wish snowflake had not gone with this UPPERCASE by default design, and I especially wish that the sqlalchemy dialect hadn't gone with this "normalize" column names approach.

@nytai
Copy link
Member

nytai commented Jan 20, 2022

so this is the function in sqlalchemy-snowflake that returns the table names: https://github.com/snowflakedb/snowflake-sqlalchemy/blob/9118cf8f18a0039f9cb5d3892ff2b1e5c82a05e0/snowdialect.py#L418-L487

@villebro
Copy link
Member

After giving this some more thought, I agree that the best solution would probably be to store the object names in their native case (UPPERCASE), as that would always work with quotes. If we do that, then all physical table column definitions would change from lowercase to UPPERCASE (unless MixedCase). I'm not sure we can do a database migration to fix old datasets reliably, so to deal with backwards compatibility issues, we could potentially add a config flag to the dataset metadata, something like legacy_mode, which we'd set to True for existing Snowflake datasets, but that would not be set on new datasets. When set to True, column names would be set with the current normalization logic, but when disabled would return the UPPERCASE name. Then existing charts wouldn't break, and users would have the option of migrating those by hand if they so wish.

@rumbin
Copy link
Contributor Author

rumbin commented Jan 20, 2022

@villebro this is what I had in mind, too.
The backward compatibility definitely needs some good care and I think that such a flag/switch would work nicely, although it introduces some ugly legacy.

@agusfigueroa-htg
Copy link

Hi all,

First of all thanks for the discussion. It's very enriching, and the proposed solution would reach case consistency and solve the issue as mentioned.

After sharing this with our team another option that popped up, in case this only affects filters interoperability, is to add the option to make the filters case insensitive.

In that case, instead of looking for columns named the same (dataset1.colname=dataset2.colname), we can apply the filters "case insensitive" like (lower(dataset1.colname)=lower(dataset2.colname)), creating a flag for it in case users want to apply it or not.

This would lay fully on the Superset side and would be transparent across different databases and what they decide to do in the future (return uppercase, lowercase, mix case or so).

@rumbin
Copy link
Contributor Author

rumbin commented Jan 22, 2022

@agusfigueroa-htg having the column matching if dashboard filters case-insensitive is a quick win in my eyes.
However, some edge cases remain:

  1. When filters are handled in Jinja, users will need to implement their own case-insensitivity.
  2. When users convert a physical to a virtual dataset, or vice versa, the column case will flip, causing possible bugs. Such a conversation may be needed, e.g., when the source model now includes soft deletes which then need to be filtered out, or, the other way round, when it is decided to convert a virtual dataset into a table or view.
  3. Users might be confused by the case inconsistency and they might not be aware of the case-insensitivity of the filters.

@villebro
Copy link
Member

@agusfigueroa-htg that's a really good idea, and could be useful even for other databases. However, I do feel this is an issue we probably need to resolve, and the points raised by @rumbin are IMO all very valid. It's also good to keep in mind that this not only affects Snowflake, but even other Oracle-like dialects.

@agusfigueroa-htg
Copy link

Good to see that we are all in the same boat :) and the points presented by you two are pretty much valid - the filter one is more of a quick escape rather than a real fix.

@villebro if I got you correctly, it makes sense to work on how the object names are stored in Superset. I don't know where that is defined in the code (I am new to the repo!), but if you have some references at a hand I am happy to help here.

@rumbin
Copy link
Contributor Author

rumbin commented Jan 27, 2022

Stumbled on a related, merged PR: #5827

@agusfigueroa-htg
Copy link

Hi @villebro ! I assume we want to solve this on the connector side, is that correct? If I get a rough idea of where to start looking at, I hope I can contribute on taking the first steps :)

@stale
Copy link

stale bot commented Apr 16, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. For admin, please label this issue .pinned to prevent stale bot from closing the issue.

@stale stale bot added the inactive Inactive for >= 30 days label Apr 16, 2022
@rumbin
Copy link
Contributor Author

rumbin commented Apr 16, 2022

This issue should not be closed by the stale bot, in my eyes.

@stale stale bot removed the inactive Inactive for >= 30 days label Apr 16, 2022
@agusfigueroa-htg
Copy link

agusfigueroa-htg commented Apr 19, 2022 via email

@villebro
Copy link
Member

Hi, I agree this is something we should address. I actually think the best solution would be to make Superset case insensitive, as I believe very few people will need to have column, CoLuMn and COLUMN in the same dataset. However, this is a pretty big breaking change, so it would likely need to first be floated as a SIP, and then be implemented in 3.0 if others agree with the change. @rumbin would you be up for the task of preparing a SIP for this? This way we could have the discussion there and iron out any disagreements before starting actual work to address this.

@agusfigueroa-htg
Copy link

@rumbin can count on me if I can support anyhow!

@villebro
Copy link
Member

@rumbin to echo @agusfigueroa-htg 's offer for support - I'm also happy to help push this effort forward, I just don't have the time/resources to drive it.

@rumbin
Copy link
Contributor Author

rumbin commented Apr 19, 2022

@villebro, @agusfigueroa-htg please have a look at #19773

@rumbin
Copy link
Contributor Author

rumbin commented Apr 19, 2022

@villebro did you mean to write Superset 3.0? I was hoping that we could get it into 2.0, but maybe I am too naïve here...

@villebro
Copy link
Member

Yes, unfortunately this won't make it into 2.0, as the list of breaking changes has already been decided. However, if we do reach consensus on how to fix this, we can put it behind a feature flag during 2.0, and then potentially introduce it as the new default behavior in 3.0.

@rusackas
Copy link
Member

Hi all! Just checking in on this thread, since it's gone silent for a year.

Is anyone still hoping to tackle this? Or did it somehow cease to be a problem? Now's a great chance to add a feature flag for 3.0, as that would be a non-breaking change in the release being cut on or around June 15th.

@rumbin
Copy link
Contributor Author

rumbin commented May 25, 2023

I am not sure if I will have time and need to work on this on the near future, as I am currently not using Snowflake, but that might change soon.
Thus, it may be beneficial to at least define the feature flag already.
I cannot tell, however, if the issue disappeared miraculously, but I doubt it, judging from the depth/origin of the problem.

@agusfigueroa-htg
Copy link

I can sadly confirm that the issue remains :(

@villebro
Copy link
Member

villebro commented Jun 15, 2023

We've been throwing around ideas about this with @rusackas for the last few weeks, and my proposal is to change the behavior going forward as follows:

  • virtual dataset logic remains unchanged
  • physical tables call denormalize_name on the column name. See example below:
>>> from snowflake.sqlalchemy.snowdialect import SnowflakeDialect
>>> dialect = SnowflakeDialect()
>>> dialect.requires_name_normalize
True
>>> dialect.denormalize_name('foo')
'FOO'
>>> dialect.denormalize_name('Foo')
'Foo'
>>> dialect.denormalize_name('FOO')
'FOO'

This means that going forward, caseless physical column names will be stored in UPPERCASE, rather than lowercase. We'll also do the same for Oracle and the others that need to call the standard denormalize_name method:

>>> from sqlalchemy.dialects.oracle.base import OracleDialect
>>> dialect = OracleDialect()
>>> dialect.requires_name_normalize
True
>>> dialect.denormalize_name('foo')
'FOO'
>>> dialect.denormalize_name("Foo")
'Foo'
>>> dialect.denormalize_name("FOO")
'FOO'

This is in contrast to MSSQL and other dialects that don't require name normalization:

>>> from sqlalchemy.dialects.mssql.base import MSDialect
>>> dialect = MSDialect()
>>> ms.requires_name_normalize
False
>>> ms.denormalize_name('foo')
'FOO'
>>> # this ☝️ surprised me!

So calling dialect.denormalize_name(col_name) if dialect.requires_name_normalize is True will ensure that physical and virtual datasets have the same case. Furthermore, I propose not doing any migrations to existing datasets, as the migration would be extremely risky, avoiding breaking any existing dashboards or charts. This means, that this change would only affect physical datasets created after the change. Therefore, anyone who wants to migrate old charts/dashboards to the new behavior will need to edit the dataset and click "Sync columns from source":

image

After this the lowercase column names will become UPPERCASE:

image

Keep in mind, however, that this might break existing charts, as column references in both chart and dashboard native filter metadata might be pointing to the old normalized column name.

Thoughts @agusfigueroa-htg @rumbin @nytai ?

@rumbin
Copy link
Contributor Author

rumbin commented Jun 15, 2023

@villebro thanks a lot for giving this issue some love :-).

The suggested approach seems like a simple yet effective solution to me.

@agusfigueroa-htg
Copy link

I second @rumbin very much here, this is a simple and effective solution. Thinking about our Superset instance, this will definitely address the problems we have with dashboard filtering.

Thank you a million @villebro @rusackas, very much appreciate this.

@villebro
Copy link
Member

@agusfigueroa-htg @rumbin please take a look at PR #24471

@crysensible
Copy link

crysensible commented Jul 17, 2023

Hi all, this change has broken all of our charts and filters due to the change from lowercase naming to uppercase. This impacted our existing old datasets when I clicked "Sync Columns From Source."

Is there any guidance on how to make this backward compatible with existing assets? @betodealmeida @Vitor-Avila

@crysensible
Copy link

crysensible commented Jul 18, 2023

Here was the fix for anyone who encountered this: https://status.preset.io/pages/incident/5e83b80830610b04c39e898b/64b5823a267e5c053c6ed085

@Vitor-Avila
Copy link
Contributor

Thanks, @crysensible! We ended up reverting the PR in Preset Cloud, so syncing columns from source again would reflect the old casing.

@rusackas
Copy link
Member

We're planning to fix forward on this repo. Please stay tuned. If you have an issue with this PR, you can revert it locally or open a PR to revert it for everyone if there are stronger feelings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
#bug Bug report
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants