Multithreading diff_tables()
for Comparing Many Tables
#52
-
I need to compare a large number of tables - several hundred in total, where about 90% of them are small, about ~100 rows.
Is there a better way to parallelize the comparison of large number of tables? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 3 replies
-
That sounds like a bug. Can you provide a minimal example that produces such errors? Meanwhile, perhaps you can sidestep this issues by multiprocessing instead of threading. (i.e. each diff in separate subprocess) |
Beta Was this translation helpful? Give feedback.
-
Thanks, the multiprocessing approach does the trick for me. Here is a minimal example that produces the errors with
|
Beta Was this translation helpful? Give feedback.
It looks like it's happening because somewhere the shared connection gets closed.
I'm a little fuzzy on the exact details (I need to re-read the code), but I'll answer what I can.
When you connect(), the thread_count determines how many threads will communicate with the database.
In diff_tables(), the threadpool size determines how many threads will be used to manage the algorithm. Each such thread should occupy up to one database thread at a time. That's where the 2x suggestion comes from - the idea is to have 1 algorithm thread for each database thread, since there are two databases. (ofc that only applies to hashdiff between two different connections)
But if you use the same connection…