Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table Metadata Lock by long populate calls #1030

Closed
samuelbray32 opened this issue Jul 12, 2024 · 5 comments
Closed

Table Metadata Lock by long populate calls #1030

samuelbray32 opened this issue Jul 12, 2024 · 5 comments
Labels
bug Something isn't working common infrastructure Unix, MySQL, etc. settings/issues impacting users

Comments

@samuelbray32
Copy link
Collaborator

Describe the bug

  • User 1 executes a long-running populate call (e.g. spikesorting on long interval with limited compute resource)

  • When populate begins it starts a transaction with this query in datajoint: self.query("START TRANSACTION WITH CONSISTENT SNAPSHOT")

    • This creates a shared lock on referenced tables to ensure consistency during the process.
      • In particular, highly referenced tables like AnalysisNwbfile
    • These locks persist through the populate call
    • They do let reads and apparently inserts happen since they don't change the table
    • They don't allow altering the table with new foreign key references
  • User 2 attempts to declare a compute table with reference to AnalysisNwbfile

    • This attempts to get a metadata lock on AnalysisNwbfile to alter with new fk-ref
    • This lock is incompatible with User 1's shared lock
    • Stalls with pending lock until populate finishes executing
  • User 3+ attempts to access data via fetch_nwb

    • Requires lock on AnalysisNwbfile
    • Would be compatible with populate's lock
    • However, request is stuck in queue behind User 2's pending lock for table declaration
    • fetch_nwb call stalls until User 2's pending lock for table declaration times out and exits queue

Symptoms

  • Multiple users reporting inability to access data with fetch_nwb
  • Stall/timeout when declaring new table

Diagnostic Tools
This error was most effective to debug on the mysql level. Useful queries provided here for future reference

Watch active sql processes Requires sql admin privileges to view other users
from time import sleep
from IPython.display import clear_output
import datajoint as dj
for i in range(100):
    clear_output(wait=True)
    for process in dj.conn().query("SHOW FULL PROCESSLIST;").fetchall():
        print(process)
    sleep(2)
Get information on processes holding/pending locks on a table
def execute_query_and_fetch_results(query):
    """
    Execute the given query and fetch the results.

    Parameters
    ----------
    query : str
        The SQL query to execute.

    Returns
    -------
    list of tuples
        The results of the query.
    """
    results = dj.conn().query(query).fetchall()
    return results

# Check if performance schema is enabled
performance_schema_query = "SHOW VARIABLES LIKE 'performance_schema';"
performance_schema_status = execute_query_and_fetch_results(performance_schema_query)
print("Performance Schema Status:")
for status in performance_schema_status:
    print(status)

# Replace 'common_nwbfile' and 'analysis_nwbfile' with your actual database and table name
database_name = 'common_nwbfile' # the datjoint schema name
table_name = 'analysis_nwbfile' # the table name

# Step 2: Query Metadata Locks along with owner and user information
metadata_locks_query = f"""
SELECT 
    ml.OBJECT_TYPE, 
    ml.OBJECT_SCHEMA, 
    ml.OBJECT_NAME, 
    ml.LOCK_TYPE, 
    ml.LOCK_STATUS, 
    ml.OWNER_THREAD_ID,
    t.PROCESSLIST_ID, 
    t.PROCESSLIST_USER, 
    t.PROCESSLIST_HOST, 
    t.PROCESSLIST_DB, 
    t.PROCESSLIST_COMMAND, 
    t.PROCESSLIST_TIME, 
    t.PROCESSLIST_STATE, 
    t.PROCESSLIST_INFO
FROM 
    performance_schema.metadata_locks AS ml
JOIN 
    performance_schema.threads AS t 
ON 
    ml.OWNER_THREAD_ID = t.THREAD_ID
WHERE 
    ml.OBJECT_SCHEMA = '{database_name}' 
    AND ml.OBJECT_NAME = '{table_name}';
"""

metadata_locks = execute_query_and_fetch_results(metadata_locks_query)

# Display metadata locks with additional information
print("Metadata Locks with Owner and User Information:")
for lock in metadata_locks:
    print(lock)

Looking Forward
This issue arose due to a particularly long (~24+ hour) populate call on a single row. However, factors that increase likelihood of such event are:

  • Growing number of foreign key references to AnalysisNwbfile in user custom tables (current `AnalysisNwbfile has ~180 children) increasing the odds of conflicting lock requests
  • Increase in users running lots of parallelized populate calls that increase total table lock time

Potential Redresses
1. Precompute results

  • Datajoint begins the shared lock of a transation within populate prior to calling make.
  • For tables with long-running make functions we could move the analysis from the make function to a new compute function
    • In the intercepted populate call, "pre-compute" the results prior to beginning the transaction and calling make to avoid long-running transactions
    • Results could be stored in a TempAnalysisNwbfile table. make() would need to be passed a key to this table and would be responsible for moving the data into AnalysisNwbfile
    • This solution would synergize with ideas for a "recompute" functionality for memory management Recompute method in computed tables #917

2. Database scheduling

  • Socially define that long-running populate calls (or alternatively table declarations) should happen at certain times (e.g. evenings, weekends, etc.)
  • Harder to manage with growing number of users

3. Reduce foreign key references to a single table

  • Similar to Peripheral tables cause issues: deletes, field name uniqueness, etc. #963, issue arises from centrality of some tables in the network
  • Could alleviate by having multiple AnalysisNwbfile_{pipeline} tables
    • Make AnalysisNwbfile a mixin class.
    • Implement a table of this class for each pipeline
    • Since user's don't interact with this AnalysisNwbfile as directly, solution might avoid concerns of similar ideas for IntervalList IntervalList cautious insert #943
    • Would require something like a merge table to group different sources together for AnalysisNwbfileKachery
  • Could consider other sql partitionaing strategies (e.g. sharding of high-centrality tables)
@samuelbray32 samuelbray32 added bug Something isn't working common infrastructure Unix, MySQL, etc. settings/issues impacting users labels Jul 12, 2024
@edeno
Copy link
Collaborator

edeno commented Jul 12, 2024

Solution 1 and 3 definitely seem the most desirable. Decoding, spike sorting and LFP extraction can all potentially have long running computations, particularly with our longer recordings.

@samuelbray32
Copy link
Collaborator Author

It's not a full solution, but as a temporary measure we could also decrease the timeout time on processes to at least reduce issues of pending declare/drop exclusive locks from blocking table access

@lfrank
Copy link
Contributor

lfrank commented Aug 10, 2024

@CBroz1 @samuelbray32 I think we need to prioritize this for at least the most problematic tables (DLC and SpikeSorting I assume). In the meantime, what options do we have to decrease the lockout time?

@samuelbray32
Copy link
Collaborator Author

@CBroz1 has implemented a method to avoid holding long transaction locks in #1067. If a table not currently flagged for the no transaction make begins causing issues, reference that PR for how to tag it.

@samuelbray32
Copy link
Collaborator Author

OperationalError: (1205, 'Lock wait timeout exceeded; try restarting transaction')

Adding symptomatic error from blocked table declaration to improve future search-ability of the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working common infrastructure Unix, MySQL, etc. settings/issues impacting users
Projects
None yet
Development

No branches or pull requests

3 participants