Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: rename TABLE_NAMES_CACHE_CONFIG to DATA_CACHE_CONFIG #11509

Merged
merged 4 commits into from
Nov 14, 2020

Conversation

ktmud
Copy link
Member

@ktmud ktmud commented Oct 31, 2020

SUMMARY

💥 Breaking Change

Rename config value TABLE_NAMES_CACHE_CONFIG to DATA_CACHE_CONFIG, and save datasource query results to this cache, too.

Motivation

Companies like Airbnb have different security levels for Superset itself and the datasources Superset connects to: anyone can login to the same Superset, but not everyone have access to all data. Superset has SecurityManager to check whether a user has access to certain datasource, but all query results are still stored in the same cache, regardless of users' access levels. If a malicious party obtained access to the cache server via certain Superset user in a high-risk region, they get access to all cached data.

By separating cache storage for Superset metadata and the actual data users consume, we can have a better insulation of sensitive data by configuring different cache backend for high-risk regions while keeping the cache for all the Superset metadata in sync.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

N/A

TEST PLAN

CI

ADDITIONAL INFORMATION

  • Has associated issue:
  • Changes UI
  • Requires DB Migration: Superset admins need to manually update cache configs
  • Confirm DB Migration upgrade and downgrade tested.
  • Introduces new feature or API
  • Removes existing feature or API

@codecov-io
Copy link

codecov-io commented Oct 31, 2020

Codecov Report

Merging #11509 (226fbe9) into master (68693c7) will increase coverage by 4.26%.
The diff coverage is 65.68%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #11509      +/-   ##
==========================================
+ Coverage   62.86%   67.12%   +4.26%     
==========================================
  Files         889      889              
  Lines       43055    43062       +7     
  Branches     4017     4017              
==========================================
+ Hits        27065    28905    +1840     
+ Misses      15811    14060    -1751     
+ Partials      179       97      -82     
Flag Coverage Δ
cypress 56.05% <ø> (?)
javascript 62.81% <ø> (ø)
python 62.87% <65.68%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
superset/utils/decorators.py 100.00% <ø> (+34.17%) ⬆️
superset/utils/screenshots.py 36.36% <0.00%> (ø)
superset/viz_sip38.py 0.00% <0.00%> (ø)
superset/utils/cache.py 57.57% <54.38%> (-22.43%) ⬇️
superset/models/dashboard.py 75.50% <75.00%> (+0.12%) ⬆️
superset/viz.py 59.40% <88.88%> (+0.07%) ⬆️
superset/__init__.py 100.00% <100.00%> (ø)
superset/common/query_context.py 83.23% <100.00%> (-4.72%) ⬇️
superset/config.py 90.07% <100.00%> (-0.04%) ⬇️
superset/db_engine_specs/hive.py 84.04% <100.00%> (+0.06%) ⬆️
... and 194 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 68693c7...226fbe9. Read the comment docs.

from superset.common.query_object import QueryObject
from superset.connectors.base.models import BaseDatasource
from superset.connectors.connector_registry import ConnectorRegistry
from superset.exceptions import QueryObjectValidationError
from superset.extensions import cache_manager, security_manager
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import the cache_manager instead of cache to make it clear which cache we are using.

@@ -285,7 +286,11 @@ def get_df_payload( # pylint: disable=too-many-statements
status = utils.QueryStatus.FAILED
stacktrace = utils.get_stacktrace()

if is_loaded and cache_key and cache and status != utils.QueryStatus.FAILED:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and cache is not needed because of #9020

@@ -450,8 +451,8 @@ def inspector(self) -> Inspector:
return sqla.inspect(engine)

@cache_util.memoized_func(
key=lambda *args, **kwargs: "db:{}:schema:None:table_list",
attribute_in_key="id",
key=lambda self, *args, **kwargs: f"db:{self.id}:schema:None:table_list",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small refactor to make the memoize function more flexible.


def etag_cache(
max_age: int, check_perms: Callable[..., Any], cache: Cache = cache_manager.cache
) -> Callable[..., Any]:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved over from superset.utils.decorators

self._thumbnail_cache = Cache()

def init_app(self, app: Flask) -> None:
self._cache.init_app(app, app.config["CACHE_CONFIG"])
self._tables_cache.init_app(app, app.config["TABLE_NAMES_CACHE_CONFIG"])
self._data_cache.init_app(app, app.config["DATA_CACHE_CONFIG"])
Copy link
Member Author

@ktmud ktmud Oct 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't add backward compatibility because the data objects put into this cache will become much much larger. If a Superset admin has configured in-memory cache for TABLE_NAMES_CACHE_CONFIG, they might run into problems. Therefore it's better to make things break to make them aware of this change.

For most users, the migration is as simple as renaming the config value.

@mistercrunch mistercrunch added the risk:breaking-change Issues or PRs that will introduce breaking changes label Oct 31, 2020
@bkyryliuk
Copy link
Member

I will be on PTO next week, won't have a change to give a full review. 1 suggestion it would be nice to add a unit test that will document the expected behavior and will prevent from the regressions, tests can impersonate the user to hit various endpoints.

@ktmud
Copy link
Member Author

ktmud commented Oct 31, 2020

I will be on PTO next week, won't have a change to give a full review. 1 suggestion it would be nice to add a unit test that will document the expected behavior and will prevent from the regressions, tests can impersonate the user to hit various endpoints.

I added two test assertions in cache_tests.py. Do you think we need more?

UPDATING.md Outdated Show resolved Hide resolved
superset/config.py Outdated Show resolved Hide resolved
superset/config.py Outdated Show resolved Hide resolved
superset/utils/cache.py Outdated Show resolved Hide resolved
superset/utils/cache.py Outdated Show resolved Hide resolved
@ktmud ktmud force-pushed the data-cache branch 2 times, most recently from 45ce586 to c0eb415 Compare November 1, 2020 23:29
UPDATING.md Show resolved Hide resolved
self._cache.init_app(
app,
{
"CACHE_DEFAULT_TIMEOUT": app.config["CACHE_DEFAULT_TIMEOUT"],
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure the global default is applied as documented.

@ktmud ktmud force-pushed the data-cache branch 4 times, most recently from 93729d6 to 9543b47 Compare November 9, 2020 05:46
@ktmud ktmud force-pushed the data-cache branch 2 times, most recently from c2a46af to ca25463 Compare November 10, 2020 20:34
@ktmud
Copy link
Member Author

ktmud commented Nov 10, 2020

@john-bodley @etr2460 So previously the test tests/databases/api_tests.py::TestDatabaseApi::test_create_database_conn_fail was failing because I setup a cache backend for DATA_CACHE_CONFIG while in the old test config TABLE_NAMES_CACHE_CONFIG was set to cache_type: null.

Not sure why this cache would fail test_create_database_conn_fail though. Maybe @dpgaspar have some idea?

@ktmud
Copy link
Member Author

ktmud commented Nov 10, 2020

Here's a test run that fails when the only change comparing to master is to enable TABLE_NAMES_CACHE_CONFIG.

@@ -95,7 +95,7 @@ def test_get_items(self):
"function_names",
"id",
]
self.assertEqual(response["count"], 2)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

count = 2 seems to assume some tests run before this test. If you just initialized the database and run this test alone, this test will fail.

Copy link

@graceguo-supercat graceguo-supercat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please resolve conflicts. otherwise LGTM.

Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, however I didn't quite see how the "timeout search path" happens, is that something happening outside this PR in local configs or implicitly via Flask-Caching?

Comment on lines -49 to +72
if cache_manager.tables_cache:

def wrapped_f(self: Any, *args: Any, **kwargs: Any) -> Any:
if not kwargs.get("cache", True):
return f(self, *args, **kwargs)

if attribute_in_key:
cache_key = key(*args, **kwargs).format(
getattr(self, attribute_in_key)
)
else:
cache_key = key(*args, **kwargs)
o = cache_manager.tables_cache.get(cache_key)
if not kwargs.get("force") and o is not None:
return o
o = f(self, *args, **kwargs)
cache_manager.tables_cache.set(
cache_key, o, timeout=kwargs.get("cache_timeout")
)
return o

else:
# noop
def wrapped_f(self: Any, *args: Any, **kwargs: Any) -> Any:
def wrapped_f(self: Any, *args: Any, **kwargs: Any) -> Any:
if not kwargs.get("cache", True):
return f(self, *args, **kwargs)

cache_key = key(self, *args, **kwargs)
obj = cache.get(cache_key)
if not kwargs.get("force") and obj is not None:
return obj
obj = f(self, *args, **kwargs)
cache.set(cache_key, obj, timeout=kwargs.get("cache_timeout"))
return obj

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

@ktmud
Copy link
Member Author

ktmud commented Nov 12, 2020

LGTM, however I didn't quite see how the "timeout search path" happens, is that something happening outside this PR in local configs or implicitly via Flask-Caching?

I think it is referring to this logic here: https://github.com/apache/incubator-superset/blob/600a6fa92a0bbe5bfd93371db5ced6af556c3697/superset/viz.py#L423-L433

@villebro
Copy link
Member

LGTM, however I didn't quite see how the "timeout search path" happens, is that something happening outside this PR in local configs or implicitly via Flask-Caching?

I think it is referring to this logic here:

https://github.com/apache/incubator-superset/blob/600a6fa92a0bbe5bfd93371db5ced6af556c3697/superset/viz.py#L423-L433

Ah sorry, I misinterpreted the comment, never mind.

Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good layer of optional additional security, thanks for implementing.

Copy link
Member

@bkyryliuk bkyryliuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great

superset/utils/cache.py Outdated Show resolved Hide resolved
@ktmud ktmud force-pushed the data-cache branch 2 times, most recently from 0811d14 to 60283e2 Compare November 14, 2020 05:48
@ktmud ktmud merged commit 4cfcaeb into apache:master Nov 14, 2020
@ktmud ktmud deleted the data-cache branch November 14, 2020 06:35
auxten pushed a commit to auxten/incubator-superset that referenced this pull request Nov 20, 2020
)

* feat: rename TABLE_NAMES_CACHE_CONFIG to DATA_CACHE_CONFIG

The corresponding cache will now also cache the query results.

* Slice use DATA_CACHE_CONFIG CACHE_DEFAULT_TIMEOUT

* Add test for default cache timeout

* rename FAR_FUTURE to ONE_YEAR_IN_SECS
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 1.0.0 labels Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels risk:breaking-change Issues or PRs that will introduce breaking changes size/XL 🚢 1.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants