Skip to content
This repository has been archived by the owner on May 11, 2021. It is now read-only.

Race condition in__getitem__ #40

Closed
wants to merge 2 commits into from
Closed

Race condition in__getitem__ #40

wants to merge 2 commits into from

Conversation

Jaidan
Copy link

@Jaidan Jaidan commented Oct 13, 2016

This should fix #18. I wasn't able to find a reasonable test to reproduce the behavior unfortunately, but this should cause __getitem__to recurse if a cross thread reset is detected.

@adamchainz adamchainz self-assigned this Oct 24, 2016
@adamchainz
Copy link
Owner

I'm not entirely sure about this change, it's moving a lot of the internals around. Also a recursive call to __getitem__ naturally scares me since what if it infinitely recurses? 😫

Can you explain the problem in a bit more depth? It should go in the HISTORY file at least 😸

Whilst looking at this I accidentally found your gargoyle fork at https://github.com/roverdotcom/gargoyle . If you're using that with django-modeldict-yplan I can't make any guarantees about compatibility.

@Jaidan
Copy link
Author

Jaidan commented Oct 26, 2016

Happy to give you a run down. We're talking about a threading issue in Python so sorry if this is a bit complex and long winded =)

We ran into this issue because our production environment is a multithreaded and multiprocess uwsgi configuration. Multiprocess to do most of the heavy concurrency lifting, but also multithreaded to limit worker exhaustion when blocking on IO (primarily database calls). When modeldict is used in a multithreaded environment the retrieval behavior was subject to a race condition if the _local_cache was cleared by one thread while another thread checked a value.

Previously setting _local_cache to None was used to "clear" the cache. This was a problem as __getattr__ assumed it could never be None after _populate was called. It was possible for the cache to be cleared by another thread during a context switch of the GIL that occurred after _populate was called. For example:

In our environment we make heavy use of gargoyle. This means the majority of requests check many flags and as a result frequently accesses the modeldict _local_cache. Every time that we changed flag's setting that would result in the _post_save handler being called by one thread (which calls _populate(reset=True)), and frequently this occurred while another thread blocked after _populate. The result is that changing a flag would frequently cause an AttributeError due to an attempt to lookup a key on a NoneType object.

This addresses the issue by ensuring that the _local_cache is always a dictionary object, and a flag (_local_last_updated) is separately maintained that tracks the initialization state of _local_cache. Threading issues are avoided by always changing the flag and the _local_cache in a safe order and always clearing via the clear_cache method.

Infinite recursion is theoretically a possibility with this implementation, though the conditions required to cause it would be pretty far fetched for this use case. We only recurse if we get a KeyError while accessing the _local_cache and _local_last_updated is None. This state only occurs if one thread clears the cache while another thread is checking a value. If one was programmatically repeatedly triggering updates one could cause a another thread to recurse. In reality, this never will recurse more than once (at least in the Gargoyle use case).

We could alternatively fix this by doing a deepcopy on the _local_cache and returning that from _populate. This would be a lot simpler and require a lot less internals to be adjusted. However, it has the downside of requiring a deepcopy on every _populate call, which could be performance intensive in a project entirely designed to improve performance, so we elected to go with the more complicated solution

That fork was previously used as a quick fix workaround the previous (much larger and worse) threading issue where updates to the cache where not properly being invalidated before updates. We are no longer using it since that issue was already fixed in #25 (I think that was the one).

@adamchainz
Copy link
Owner

Thanks for the long explanation! Since the test changes are pretty minor I'm also confident this should work.

Merged in 7491f4a.

A note: at YPlan we had a similar setup to what you describe, but when we dropped the multiple threads and moved to 2 processes per core, it actually improved our performance. Worth a try! Threads are complicated.

@adamchainz adamchainz closed this Oct 28, 2016
@adamchainz
Copy link
Owner

Deployed on PyPI at 1.5.4: https://pypi.python.org/pypi/django-modeldict-yplan

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Race condition in CachedDict.__getitem__
2 participants