-
Notifications
You must be signed in to change notification settings - Fork 14.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Snowflake: Inconsistent column name case #18085
Comments
Would it be feasible to have the case conversion configurable for each database connection? I am thinking about ways to fix this without breaking existing datasets/charts/dashboards that would suddenly have their colunm name case changed upon resync of the column names... |
This issue is the same as the one you linked snowflakedb/snowflake-sqlalchemy#157 Superset is using that library to connect to snowflake so it needs to be fixed there, there’s nothing that can be done in superset to fix this. |
@nytai I do not agree that this is the same issue. I think that there needs to be a design discussion in this issue before deciding how the column case is to be handled. Would you please reopen the issue? |
@nytai The linked upstream issue is referring to another issue that had been closed as the issue can be handled by the client's code: snowflakedb/snowflake-sqlalchemy#157 (comment) @villebro was involved in closing the related upstream issue: snowflakedb/snowflake-sqlalchemy#73 So it seems to me as a bit of a ping pong to where this issue is to be handled. I'd be glad if Ville could comment on this. Maybe we could also have a Slack discussion of video call if needed. |
From the linked upstream issues I understand that it is unlikely that this porblem will be fixed upstream. I understood from these issues that the correct case can be extracted from the cursor, but it seems that we are not currently doing so in Superset. So, in my eyes there first needs to be a design decision. |
Reopening the issue for the sake of discussion. I don't think there's any design decision here, the column names should be returned in the same case they're defined in the database. Superset isn't doing any case transformation on the columns and it cannot safely assume that all columns names should be in upper or lower case, it's up to the db/driver to return the correct case for superset to use, imo. |
Although using the cursor to get the column names is a likely fix I do feel strongly that the database driver should not be applying any case transformations on the payload returned from the db. Since @villebro did propose that as a solution to the issue, I agree, it would be good to his thoughts on the matter |
ok, revisiting this it seems unlikely that that snowflake-sqlalchemy will stop denormalizing the column names in the payload given how long it's been open and that the cursor description seems to be an acceptable workaround. I now agree that this could be addressed in superset given the challenges it presents on dashboards with charts that are backed by virtual & physical datasets. |
Ok, so this all stems from the design decision Oracle-like databases have made, which I generally call "lowercase means uppercase means caseless", which IMO is a terrible assumption (I really wish Snowflake hadn't gone with this design). I may be off on some details, but basically it boils down to this: When you write select column
from table and the column name is caseless (=stored in metadata as UPPERCASE), the resulting column name in the query result is select Column
from table or select COLUMN
from table or even select "COLUMN"
from table but an error when writing select "column"
from table because the column name isn't defined case-sensitively as "column", but rather stored in the metadata as "COLUMN" (=caseless). However, when you check the column name using Now, we actually have a method in superset/superset/db_engine_specs/base.py Lines 1313 to 1320 in 035638c
What we could do here is add column_name as an argument and add a field normalized_name to the ColumnSpec type and return the normalized column name on affected dbs (Oracle, Snowflake et al) if the column metadata originates from a query (=virtual table), but leave it unchanged for physical tables. This would just need to be implemented in the SQLAlchemy model and shouldn't be lots of work, but it would be really important to do super thorough testing to make sure it handles all cases correctly (caseless, UPPERCASE, "lowercase" and "Mixed Case").
This probably sounds really confusing, so maybe it makes sense to have a kickoff call to get this work started to make sure we fully understand the problem and have a clear proposal for solving this. |
@villebro in my eyes, your elaboration on the case handling is correct (including your opinion on the poor default choice of UPPERCASE...). About the rest of your explanation you lost me somewhere, though. |
In theory, i think the optimal solution would be to have the column cases equal for both physical and virtual datasets. I think this problem would go away if we called the normalization function on the column names when we fetch column names from the query metadata when we create a new virtual dataset. However, I believe this would break backward compatibility with previously created virtual tables, as refreshing the column metadata for pre-existing virtual tables would change the case from upper case to lower case, which would probably break all existing charts. We could, of course, do a database migration which would update all existing chart metadata, but that would be very tricky, as column names appear in lots of different controls (we also don't know what custom controls containing column names some orgs might have if they've created custom viz plugins). Therefore, the approach I'm proposing is to rather be aware of the dataset type, and then be able to work around this problem at query creation time, i.e. do the normalization there. As the dataset metadata would be unaffected (caseless virtual dataset column names would still be UPPER CASE), this change would be backwards compatible, and therefore wouldn't require any database migrations. However, if column cases are different for physical vs virtual tables, there is the risk that native filters will not work properly on datasets of different types. So this may require some further planning. |
I was originally thinking that what would make the most sense is for superset to use the same case that it's defined in the db (ie, most likely UPPERCASE for snowflake, but not always), however since the sqlalchemy dialect is doing this "normalization" there really is no way to know for sure what the case is in the db. We could update the However, this gets very complicated when we take into consideration all the existing datasets out there, as users who upgrade to a version with this fix would start seeing issues when they run the sync columns action and the case changes (which would break all their existing dashboard filters, etc). I too really wish snowflake had not gone with this UPPERCASE by default design, and I especially wish that the sqlalchemy dialect hadn't gone with this "normalize" column names approach. |
so this is the function in sqlalchemy-snowflake that returns the table names: https://github.com/snowflakedb/snowflake-sqlalchemy/blob/9118cf8f18a0039f9cb5d3892ff2b1e5c82a05e0/snowdialect.py#L418-L487 |
After giving this some more thought, I agree that the best solution would probably be to store the object names in their native case (UPPERCASE), as that would always work with quotes. If we do that, then all physical table column definitions would change from lowercase to UPPERCASE (unless MixedCase). I'm not sure we can do a database migration to fix old datasets reliably, so to deal with backwards compatibility issues, we could potentially add a config flag to the dataset metadata, something like |
@villebro this is what I had in mind, too. |
Hi all, First of all thanks for the discussion. It's very enriching, and the proposed solution would reach case consistency and solve the issue as mentioned. After sharing this with our team another option that popped up, in case this only affects filters interoperability, is to add the option to make the filters case insensitive. In that case, instead of looking for columns named the same ( This would lay fully on the Superset side and would be transparent across different databases and what they decide to do in the future (return uppercase, lowercase, mix case or so). |
@agusfigueroa-htg having the column matching if dashboard filters case-insensitive is a quick win in my eyes.
|
@agusfigueroa-htg that's a really good idea, and could be useful even for other databases. However, I do feel this is an issue we probably need to resolve, and the points raised by @rumbin are IMO all very valid. It's also good to keep in mind that this not only affects Snowflake, but even other Oracle-like dialects. |
Good to see that we are all in the same boat :) and the points presented by you two are pretty much valid - the filter one is more of a quick escape rather than a real fix. @villebro if I got you correctly, it makes sense to work on how the object names are stored in Superset. I don't know where that is defined in the code (I am new to the repo!), but if you have some references at a hand I am happy to help here. |
Stumbled on a related, merged PR: #5827 |
Hi @villebro ! I assume we want to solve this on the connector side, is that correct? If I get a rough idea of where to start looking at, I hope I can contribute on taking the first steps :) |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. For admin, please label this issue |
This issue should not be closed by the stale bot, in my eyes. |
I support rumbin here, this issue is still around!
…On Sat, Apr 16, 2022 at 6:10 PM rumbin ***@***.***> wrote:
This issue should not be closed by the stale bot, in my eyes.
—
Reply to this email directly, view it on GitHub
<#18085 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ASNRLXSYSWIOZZT2QEJMT3DVFLQ6BANCNFSM5MJAECMA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hi, I agree this is something we should address. I actually think the best solution would be to make Superset case insensitive, as I believe very few people will need to have |
@rumbin can count on me if I can support anyhow! |
@rumbin to echo @agusfigueroa-htg 's offer for support - I'm also happy to help push this effort forward, I just don't have the time/resources to drive it. |
@villebro, @agusfigueroa-htg please have a look at #19773 |
@villebro did you mean to write Superset 3.0? I was hoping that we could get it into 2.0, but maybe I am too naïve here... |
Yes, unfortunately this won't make it into 2.0, as the list of breaking changes has already been decided. However, if we do reach consensus on how to fix this, we can put it behind a feature flag during 2.0, and then potentially introduce it as the new default behavior in 3.0. |
Hi all! Just checking in on this thread, since it's gone silent for a year. Is anyone still hoping to tackle this? Or did it somehow cease to be a problem? Now's a great chance to add a feature flag for 3.0, as that would be a non-breaking change in the release being cut on or around June 15th. |
I am not sure if I will have time and need to work on this on the near future, as I am currently not using Snowflake, but that might change soon. |
I can sadly confirm that the issue remains :( |
We've been throwing around ideas about this with @rusackas for the last few weeks, and my proposal is to change the behavior going forward as follows:
>>> from snowflake.sqlalchemy.snowdialect import SnowflakeDialect
>>> dialect = SnowflakeDialect()
>>> dialect.requires_name_normalize
True
>>> dialect.denormalize_name('foo')
'FOO'
>>> dialect.denormalize_name('Foo')
'Foo'
>>> dialect.denormalize_name('FOO')
'FOO' This means that going forward, caseless physical column names will be stored in UPPERCASE, rather than lowercase. We'll also do the same for Oracle and the others that need to call the standard >>> from sqlalchemy.dialects.oracle.base import OracleDialect
>>> dialect = OracleDialect()
>>> dialect.requires_name_normalize
True
>>> dialect.denormalize_name('foo')
'FOO'
>>> dialect.denormalize_name("Foo")
'Foo'
>>> dialect.denormalize_name("FOO")
'FOO' This is in contrast to MSSQL and other dialects that don't require name normalization: >>> from sqlalchemy.dialects.mssql.base import MSDialect
>>> dialect = MSDialect()
>>> ms.requires_name_normalize
False
>>> ms.denormalize_name('foo')
'FOO'
>>> # this ☝️ surprised me! So calling After this the lowercase column names will become UPPERCASE: Keep in mind, however, that this might break existing charts, as column references in both chart and dashboard native filter metadata might be pointing to the old normalized column name. Thoughts @agusfigueroa-htg @rumbin @nytai ? |
@villebro thanks a lot for giving this issue some love :-). The suggested approach seems like a simple yet effective solution to me. |
@agusfigueroa-htg @rumbin please take a look at PR #24471 |
Hi all, this change has broken all of our charts and filters due to the change from lowercase naming to uppercase. This impacted our existing old datasets when I clicked "Sync Columns From Source." Is there any guidance on how to make this backward compatible with existing assets? @betodealmeida @Vitor-Avila |
Here was the fix for anyone who encountered this: https://status.preset.io/pages/incident/5e83b80830610b04c39e898b/64b5823a267e5c053c6ed085 |
Thanks, @crysensible! We ended up reverting the PR in Preset Cloud, so syncing columns from source again would reflect the old casing. |
We're planning to fix forward on this repo. Please stay tuned. If you have an issue with this PR, you can revert it locally or open a PR to revert it for everyone if there are stronger feelings. |
Superset handles the case of non-case-sensitive column names inconsistently for Snowflake connections:
This is troublesome for all dashboards where filters are present which act on charts with related, mixed physical/virtual datasets.
Superset's filter scoping is case sensitive. Thus, filters on a certain column name will either be applied to the related physical or virtual datasets.
How to reproduce the bug
select * from …
as statement text. Save the changes.Expected results
The case should not depend on whether the dataset is physical or virtual.
All non-case-sensitive (i.e., UPPERCASE in Snowflake) should be converted to lowercase consistently, also for virtual datasets created in SQLLab.
Lowercase is the internal representation of SQLALchemy.
Actual results
Virtual dataset's column names are treated as UPPERCASE.
Screenshots
The easies way to experience this effect is to simply explore a Snowflake table in SQLLab.
As you can see in the following screenshot, the column names that are extracted from the schema and displayed in the schema browser on the left are represented as lowercase, while the query results of the preview table have UPPERCASE column names:
Rejected workarounds
An option would of course be to use double-quoted lowercase column aliases in the
SELECT
statement of the virtual dataset.This would make these column names case-sensitive and thus be treated as lowercase.
However, with a broad userbase working with Snowflake/Superset this is very unhandy.
Furthermore, this apporach would not allow any
select *
statements here.I think it simply is a bug and it needs to be fixed so the way the column names are treated is always consistent.
Environment
Tested versions:
Checklist
Make sure to follow these steps before submitting your issue - thank you!
Additional context
This SQLALchemy issue is closely related: snowflakedb/snowflake-sqlalchemy#157 (comment)
However, my feeling is that we should probably try to stick with SQLALchemy's way of treating the column names as lowercase instead of uppercase. However, then we need to do so consistently and correct this for the virtual datasets.
tai pointed me in Slack to this piece of code: https://github.com/snowflakedb/snowflake-sqlalchemy/blob/9118cf8f18a0039f9cb5d3892ff2b1e5c82a05e0/snowdialect.py#L217
The text was updated successfully, but these errors were encountered: