Multithreading `diff_tables()` for Comparing Many Tables #52

alex-mirkin · 2024-10-21T22:11:09Z

alex-mirkin
Oct 21, 2024

I need to compare a large number of tables - several hundred in total, where about 90% of them are small, about ~100 rows.
To optimize performance, I attempted using ThreadPoolExecutor with 32 workers, where a new connection is created and closed within each task.
However, many threads failed, resulting in the following errors:

cannot schedule new futures after shutdown
250002 (08003): Connection is closed
390111: Session no longer exists. New login required to access the service.

Is there a better way to parallelize the comparison of large number of tables?

Answered by erezsh

Nov 10, 2024

It looks like it's happening because somewhere the shared connection gets closed.

I'm a little fuzzy on the exact details (I need to re-read the code), but I'll answer what I can.

When you connect(), the thread_count determines how many threads will communicate with the database.

In diff_tables(), the threadpool size determines how many threads will be used to manage the algorithm. Each such thread should occupy up to one database thread at a time. That's where the 2x suggestion comes from - the idea is to have 1 algorithm thread for each database thread, since there are two databases. (ofc that only applies to hashdiff between two different connections)

But if you use the same connection…

View full answer

erezsh · 2024-10-22T05:57:18Z

erezsh
Oct 22, 2024
Maintainer

That sounds like a bug. Can you provide a minimal example that produces such errors?

Meanwhile, perhaps you can sidestep this issues by multiprocessing instead of threading. (i.e. each diff in separate subprocess)

0 replies

alex-mirkin · 2024-10-29T11:07:46Z

alex-mirkin
Oct 29, 2024
Author

Thanks, the multiprocessing approach does the trick for me.

Here is a minimal example that produces the errors with ThreadPoolExecutor (Python 3.9.2):

def compare_table(mysql_table: Table):
    # create db connections
    mysql_conn = connect(mysql_conn_details, thread_count=4)
    snowflake_conn = connect(snowflake_conn_details)

    table_path = (mysql_table.schema_name, mysql_table.table_name)
    pk_col_list = tuple(mysql_table.primary_key_column_names)
    extra_columns = tuple(mysql_table.non_primary_key_column_names)

    # create a connection to the table
    table_seg_mysql = TableSegment(
        database=mysql_conn,
        table_path=table_path,
        key_columns=pk_col_list,
        extra_columns=extra_columns,
        case_sensitive=False,
    )
    table_seg_snowflake = TableSegment(
        database=snowflake_conn,
        table_path=tuple(map(str.upper, table_path)),
        key_columns=pk_col_list,
        extra_columns=extra_columns,
        case_sensitive=False,
    )

    differ = diff_tables(table_seg_mysql, table_seg_snowflake, max_threadpool_size=8, bisection_factor=16)
    differ_limited = islice(differ, 10000)
    for sign, row in differ_limited:
        logging.info("Difference found: %s %s", sign, row)


def _get_tables():
    # returns a list of tables to compare, with their metadata - key_columns, extra_columns 
    return tables


mysql_tables = _get_tables()

with ThreadPoolExecutor(max_workers=4) as executor:
    future_to_table = {executor.submit(compare_table, t): t for t in mysql_tables}
    for future in as_completed(future_to_table):
        try:
            result = future.result()
            logging.info("Comparison result: %s", result)
        except Exception as e:
            logging.error("Error comparing table: %s", e)

3 replies

erezsh Oct 29, 2024
Maintainer

Can you please try it with shared=False?

    mysql_conn = connect(mysql_conn_details, thread_count=4, shared=False)
    snowflake_conn = connect(snowflake_conn_details, shared=False)

alex-mirkin Nov 8, 2024
Author

I can confirm that using shared=False and moving the connection outside my multithreaded compare_table() method resolves the issue.
Can you elaborate on how the threads are managed in this scenario?
For example, I'm using:

ThreadPoolExecutor(max_workers=4)
connect(mysql_conn_details, thread_count=4, shared=False)
max_threadpool_size=8 fordiff_tables() (inside the multithreaded method)

What is the maximum number of connections to mysql in this case, 4*4=16?
And why it is recommended that the max_threadpool_size will be 2x of the thread_count?

erezsh Nov 10, 2024
Maintainer

It looks like it's happening because somewhere the shared connection gets closed.

I'm a little fuzzy on the exact details (I need to re-read the code), but I'll answer what I can.

When you connect(), the thread_count determines how many threads will communicate with the database.

In diff_tables(), the threadpool size determines how many threads will be used to manage the algorithm. Each such thread should occupy up to one database thread at a time. That's where the 2x suggestion comes from - the idea is to have 1 algorithm thread for each database thread, since there are two databases. (ofc that only applies to hashdiff between two different connections)

But if you use the same connection to run multiple diffs from multiple threads, the 2x recommendation is no longer relevant. (maybe best would be 2x divided by number of workers? not sure)

Maximum number of connections should be max_workers * thread_count, so yeah probably 16 in this case.

Answer selected by alex-mirkin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multithreading `diff_tables()` for Comparing Many Tables #52

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Multithreading diff_tables() for Comparing Many Tables #52

alex-mirkin Oct 21, 2024

Replies: 2 comments · 3 replies

erezsh Oct 22, 2024 Maintainer

alex-mirkin Oct 29, 2024 Author

erezsh Oct 29, 2024 Maintainer

alex-mirkin Nov 8, 2024 Author

erezsh Nov 10, 2024 Maintainer

Multithreading `diff_tables()` for Comparing Many Tables #52

alex-mirkin
Oct 21, 2024

Replies: 2 comments 3 replies

erezsh
Oct 22, 2024
Maintainer

alex-mirkin
Oct 29, 2024
Author

erezsh Oct 29, 2024
Maintainer

alex-mirkin Nov 8, 2024
Author

erezsh Nov 10, 2024
Maintainer