Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix not started database shutdown deadlock #59137

Merged
merged 5 commits into from
Jan 26, 2024

Conversation

serxa
Copy link
Member

@serxa serxa commented Jan 23, 2024

There was no way for a database that was not started up to properly shutdown. It led to the following deadlock in the main thread:

| 2    0x000000000c7bd8ca in mainEntryClickHouseServer(int, char**) ()
| | 2    0x00000000153fbe99 in Poco::Util::ServerApplication::run(int, char**) ()
| | | 2    0x000000000c7c0a11 in DB::Server::run() ()
| | | | 2    0x00000000153f3046 in Poco::Util::Application::run() ()
| | | | | 2    0x000000000c7dec91 in DB::Server::main(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&) ()
| | | | | | 2    0x000000000c7e92bf in BasicScopeGuard<DB::Server::main(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&)::$_2>::~BasicScopeGuard() ()
| | | | | | | 2    0x0000000010b8f342 in DB::ContextSharedPart::shutdown() ()
| | | | | | | | 2    0x0000000010c1ddde in DB::DatabaseCatalog::shutdownImpl() ()
| | | | | | | | | 2    0x0000000010841b11 in DB::DatabaseReplicated::shutdown() ()
| | | | | | | | | | 2    0x0000000010841acf in DB::DatabaseReplicated::stopReplication() ()
| | | | | | | | | | | 2    0x000000001082af5e in DB::DatabaseReplicated::waitDatabaseStarted(bool) const ()
| | | | | | | | | | | | 2    0x000000000c89d1b7 in DB::AsyncLoader::wait(std::__1::shared_ptr<DB::LoadJob> const&, bool) ()
| | | | | | | | | | | | | 2    0x000000000c89d856 in DB::AsyncLoader::wait(std::__1::unique_lock<std::__1::mutex>&, std::__1::shared_ptr<DB::LoadJob> const&) ()
| | | | | | | | | | | | | | 2    0x00007f8b201caa41 in pthread_cond_wait () from /lib/x86_64-linux-gnu/libc.so.6
| | | | | | | | | | | | | | | 2    0x00007f8b201c8117 in ?? () from /lib/x86_64-linux-gnu/libc.so.6

I replaced waitDatabaseStarted(true) with a new function IDatabase::stopLoading() which is stronger. In particular, it cancels all pending load and startup tasks. Afterward, it waits for all currently running tasks to avoid races. This way it is now possible to shutdown a database that is not fully loaded. Also, the shutdown of a database stops all its loading activities.

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Fix a deadlock that can happen during the shutdown of the server due to metadata loading failure.

@robot-ch-test-poll1 robot-ch-test-poll1 added the pr-bugfix Pull request with bugfix, not backported by default label Jan 23, 2024
@robot-ch-test-poll1
Copy link
Contributor

robot-ch-test-poll1 commented Jan 23, 2024

This is an automated comment for commit a40da92 with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Successful checks
Check nameDescriptionStatus
AST fuzzerRuns randomly generated queries to catch program errors. The build type is optionally given in parenthesis. If it fails, ask a maintainer for help✅ success
ClickBenchRuns [ClickBench](https://github.com/ClickHouse/ClickBench/) with instant-attach table✅ success
ClickHouse build checkBuilds ClickHouse in various configurations for use in further steps. You have to fix the builds that fail. Build logs often has enough information to fix the error, but you might have to reproduce the failure locally. The cmake options can be found in the build log, grepping for cmake. Use these options and follow the general build process✅ success
Compatibility checkChecks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help✅ success
Docker server and keeper imagesThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Docs checkThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Fast testsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Flaky testsChecks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integrational tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc✅ success
Install packagesChecks that the built packages are installable in a clear environment✅ success
Integration testsThe integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests✅ success
Mergeable CheckChecks if all other necessary checks are successful✅ success
Performance ComparisonMeasure changes in query performance. The performance test report is described in detail here. In square brackets are the optional part/total tests✅ success
SQLancerFuzzing tests that detect logical bugs with SQLancer tool✅ success
SqllogicRun clickhouse on the sqllogic test set against sqlite and checks that all statements are passed✅ success
Stateful testsRuns stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success
Stateless testsRuns stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success
Style checkThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Unit testsRuns the unit tests for different release types✅ success
Upgrade checkRuns stress tests on server version from last release and then tries to upgrade it to the version from the PR. It checks if the new server can successfully startup without any errors, crashes or sanitizer asserts✅ success
Check nameDescriptionStatus
CI runningA meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR⏳ pending
Stress testRuns stateless functional tests concurrently from several clients to detect concurrency-related errors❌ failure

@vdimir vdimir self-assigned this Jan 24, 2024
@serxa
Copy link
Member Author

serxa commented Jan 25, 2024

Stress test (tsan)

known data race in AggregatedDataVariants #58437

Upgrade check (debug) — Possible deadlock on shutdown

another known problem #50933

Integration tests (asan, analyzer) [1/6] — fail: 1 — test_replicated_table_attach/test.py::test_startup_with_small_bg_pool_partitioned

This timeout looks a bit suspicious, let me investigate
UPD. I found nothing interesting in logs, unfortunately, lets check if it persists. UPD2. it does not reproduce

@serxa serxa merged commit 4fb511b into master Jan 26, 2024
248 of 252 checks passed
@serxa serxa deleted the fix-async-loader-shutdown-deadlock branch January 26, 2024 10:21
@serxa
Copy link
Member Author

serxa commented Jan 26, 2024

https://s3.amazonaws.com/clickhouse-test-reports/59137/a40da9276101a7569aabbddb35d666e078899472/stress_test__tsan_.html

Look like this is another deadlock (this time with async_load_databases=true), requires a separate fix:

2024-01-25 15:50:21 Thread 1 (Thread 0x7f2cd5fa3fc0 (LWP 24555) "clickhouse-serv"):
2024-01-25 15:50:21 #0  0x00005561d2e3d15a in __sanitizer::DeadlockDetectorTLS<__sanitizer::TwoLevelBitVector<1ul, __sanitizer::BasicBitVector<unsigned long> > >::addLock(unsigned long, unsigned long, unsigned int) ()
2024-01-25 15:50:21 #1  0x00005561d2ed83d2 in __tsan::MutexPostLock(__tsan::ThreadState*, unsigned long, unsigned long, unsigned int, int) ()
2024-01-25 15:50:21 #2  0x00005561d2e60d14 in __interceptor_pthread_mutex_lock ()
2024-01-25 15:50:21 #3  0x00005561ec44d95a in std::__1::__libcpp_mutex_lock[abi:v15000](pthread_mutex_t*) (__m=0x7b44000f8ac0) at ./contrib/llvm-project/libcxx/include/__threading_support:304
2024-01-25 15:50:21 #4  std::__1::mutex::lock (this=0x7b44000f8ac0) at ./build_docker/./contrib/llvm-project/libcxx/src/mutex.cpp:38
2024-01-25 15:50:21 #5  0x00005561db2ea694 in std::__1::unique_lock<std::__1::mutex>::unique_lock[abi:v15000](std::__1::mutex&) (this=0x7fff8b72b260, __m=...) at ./contrib/llvm-project/libcxx/include/__mutex_base:122
2024-01-25 15:50:21 #6  ThreadPoolImpl<ThreadFromGlobalPoolImpl<false> >::wait (this=0x7b44000f8ac0) at ./build_docker/./src/Common/ThreadPool.cpp:294
2024-01-25 15:50:21 #7  0x00005561db37d3b1 in DB::AsyncLoader::wait (this=<optimized out>) at ./build_docker/./src/Common/AsyncLoader.cpp:237
2024-01-25 15:50:21 #8  DB::AsyncLoader::stop (this=<optimized out>) at ./build_docker/./src/Common/AsyncLoader.cpp:249
2024-01-25 15:50:21 #9  0x00005561db29fe5d in DB::Server::main(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&)::$_15::operator()() const (this=<optimized out>) at ./build_docker/./programs/server/Server.cpp:1747
2024-01-25 15:50:21 #10 BasicScopeGuard<DB::Server::main(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&)::$_15>::invoke() (this=<optimized out>) at ./base/base/../base/scope_guard.h:99
2024-01-25 15:50:21 #11 BasicScopeGuard<DB::Server::main(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&)::$_15>::~BasicScopeGuard() (this=<optimized out>) at ./base/base/../base/scope_guard.h:48
2024-01-25 15:50:21 #12 0x00005561db293fbb in DB::Server::main (this=0x7fff8b72dc18) at ./build_docker/./programs/server/Server.cpp:2065
2024-01-25 15:50:21 #13 0x00005561e9a0eaff in Poco::Util::Application::run (this=0x7fff8b72dc18) at ./build_docker/./base/poco/Util/src/Application.cpp:315
2024-01-25 15:50:21 #14 0x00005561db27f662 in DB::Server::run (this=0x7fff8b72dc18) at ./build_docker/./programs/server/Server.cpp:413
2024-01-25 15:50:21 #15 0x00005561e9a2e855 in Poco::Util::ServerApplication::run (this=0x7fff8b72dc18, argc=6, argv=<optimized out>) at ./build_docker/./base/poco/Util/src/ServerApplication.cpp:131
2024-01-25 15:50:21 #16 0x00005561db27c7c4 in mainEntryClickHouseServer (argc=argc@entry=6, argv=argv@entry=0x7b0c00023e80) at ./build_docker/./programs/server/Server.cpp:219
2024-01-25 15:50:21 #17 0x00005561d2ee74cf in main (argc_=<optimized out>, argv_=<optimized out>) at ./build_docker/./programs/main.cpp:515
2024-01-25 15:50:21 [Inferior 1 (process 24555) detached]

Another thread:

2024-01-25 15:47:32 Thread 827 (Thread 0x7f2a39a39640 (LWP 25416) "AsyncMetrics"):
2024-01-25 15:47:32 #0  0x00007f2cd6125117 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
2024-01-25 15:47:32 #1  0x00007f2cd6127a41 in pthread_cond_wait () from /lib/x86_64-linux-gnu/libc.so.6
2024-01-25 15:47:32 #2  0x00005561d2ee3b41 in __tsan::call_pthread_cancel_with_cleanup(int (*)(void*), void (*)(void*), void*) ()
2024-01-25 15:47:32 #3  0x00005561d2e5fc7d in __interceptor_pthread_cond_wait ()
2024-01-25 15:47:32 #4  0x00005561ec3cbb79 in std::__1::__libcpp_condvar_wait[abi:v15000](pthread_cond_t*, pthread_mutex_t*) (__cv=0x7b440057c818, __m=0x189) at ./contrib/llvm-project/libcxx/include/__threading_support:335
2024-01-25 15:47:32 #5  std::__1::condition_variable::wait (this=0x7b440057c818, lk=...) at ./build_docker/./contrib/llvm-project/libcxx/src/condition_variable.cpp:46
2024-01-25 15:47:32 #6  0x00005561db380a4b in std::__1::condition_variable::wait<DB::AsyncLoader::wait(std::__1::unique_lock<std::__1::mutex>&, std::__1::shared_ptr<DB::LoadJob> const&)::$_1>(std::__1::unique_lock<std::__1::mutex>&, DB::AsyncLoader::wait(std::__1::unique_lock<std::__1::mutex>&, std::__1::shared_ptr<DB::LoadJob> const&)::$_1) (this=0x7b440057c818, __lk=..., __pred=...) at ./contrib/llvm-project/libcxx/include/__mutex_base:398
2024-01-25 15:47:32 #7  DB::AsyncLoader::wait (this=this@entry=0x7b3c0000cd50, job_lock=..., job=...) at ./build_docker/./src/Common/AsyncLoader.cpp:748
2024-01-25 15:47:32 #8  0x00005561db38031a in DB::AsyncLoader::wait (this=0x7b3c0000cd50, job=..., no_throw=false) at ./build_docker/./src/Common/AsyncLoader.cpp:403
2024-01-25 15:47:32 #9  0x00005561e31d46de in DB::waitLoad (loader=..., jobs=..., no_throw=false) at ./src/Common/AsyncLoader.h:487
2024-01-25 15:47:32 #10 DB::waitLoad (pool_id=0, task=..., no_throw=false) at ./src/Common/AsyncLoader.h:529
2024-01-25 15:47:32 #11 DB::DatabaseTablesSnapshotIterator::table (this=<optimized out>) at ./src/Databases/IDatabase.h:116
2024-01-25 15:47:32 #12 0x00005561e4542edb in DB::ServerAsynchronousMetrics::updateImpl (this=0x7fff8b72bc10, update_time=..., current_time=..., force_update=false, first_run=false, new_values=...) at ./build_docker/./src/Interpreters/ServerAsynchronousMetrics.cpp:275
2024-01-25 15:47:32 #13 0x00005561db367543 in DB::AsynchronousMetrics::update (this=this@entry=0x7fff8b72bc10, update_time=update_time@entry=..., force_update=false) at ./build_docker/./src/Common/AsynchronousMetrics.cpp:1587
2024-01-25 15:47:32 #14 0x00005561db368fb9 in DB::AsynchronousMetrics::run (this=this@entry=0x7fff8b72bc10) at ./build_docker/./src/Common/AsynchronousMetrics.cpp:362
2024-01-25 15:47:32 #15 0x00005561db36b7eb in DB::AsynchronousMetrics::start()::$_0::operator()() const (this=<optimized out>) at ./build_docker/./src/Common/AsynchronousMetrics.cpp:276
2024-01-25 15:47:32 #16 std::__1::__invoke[abi:v15000]<DB::AsynchronousMetrics::start()::$_0&>(DB::AsynchronousMetrics::start()::$_0&) (__f=...) at ./contrib/llvm-project/libcxx/include/__functional/invoke.h:394
2024-01-25 15:47:32 #17 std::__1::__apply_tuple_impl[abi:v15000]<DB::AsynchronousMetrics::start()::$_0&, std::__1::tuple<>&>(DB::AsynchronousMetrics::start()::$_0&, std::__1::tuple<>&, std::__1::__tuple_indices<>) (__f=..., __t=...) at ./contrib/llvm-project/libcxx/include/tuple:1789
2024-01-25 15:47:32 #18 std::__1::apply[abi:v15000]<DB::AsynchronousMetrics::start()::$_0&, std::__1::tuple<>&>(DB::AsynchronousMetrics::start()::$_0&, std::__1::tuple<>&) (__f=..., __t=...) at ./contrib/llvm-project/libcxx/include/tuple:1798
2024-01-25 15:47:32 #19 ThreadFromGlobalPoolImpl<true>::ThreadFromGlobalPoolImpl<DB::AsynchronousMetrics::start()::$_0>(DB::AsynchronousMetrics::start()::$_0&&)::{lambda()#1}::operator()() (this=0x7b08006666c0) at ./src/Common/ThreadPool.h:223
2024-01-25 15:47:32 #20 std::__1::__invoke[abi:v15000]<ThreadFromGlobalPoolImpl<true>::ThreadFromGlobalPoolImpl<DB::AsynchronousMetrics::start()::$_0>(DB::AsynchronousMetrics::start()::$_0&&)::{lambda()#1}&>(DB::AsynchronousMetrics::start()::$_0&&) (__f=...) at ./contrib/llvm-project/libcxx/include/__functional/invoke.h:394
2024-01-25 15:47:32 #21 std::__1::__invoke_void_return_wrapper<void, true>::__call<ThreadFromGlobalPoolImpl<true>::ThreadFromGlobalPoolImpl<DB::AsynchronousMetrics::start()::$_0>(DB::AsynchronousMetrics::start()::$_0&&)::{lambda()#1}&>(ThreadFromGlobalPoolImpl<true>::ThreadFromGlobalPoolImpl<DB::AsynchronousMetrics::start()::$_0>(DB::AsynchronousMetrics::start()::$_0&&)::{lambda()#1}&) (__args=...) at ./contrib/llvm-project/libcxx/include/__functional/invoke.h:479
2024-01-25 15:47:32 #22 std::__1::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<true>::ThreadFromGlobalPoolImpl<DB::AsynchronousMetrics::start()::$_0>(DB::AsynchronousMetrics::start()::$_0&&)::{lambda()#1}, void ()>::operator()[abi:v15000]() (this=0x7b08006666c0) at ./contrib/llvm-project/libcxx/include/__functional/function.h:235
2024-01-25 15:47:32 #23 std::__1::__function::__policy_invoker<void ()>::__call_impl<std::__1::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<true>::ThreadFromGlobalPoolImpl<DB::AsynchronousMetrics::start()::$_0>(DB::AsynchronousMetrics::start()::$_0&&)::{lambda()#1}, void ()> >(std::__1::__function::__policy_storage const*) (__buf=<optimized out>) at ./contrib/llvm-project/libcxx/include/__functional/function.h:716
2024-01-25 15:47:32 #24 0x00005561db2e80af in std::__1::__function::__policy_func<void ()>::operator()[abi:v15000]() const (this=0x7f2a39a32428) at ./contrib/llvm-project/libcxx/include/__functional/function.h:848
2024-01-25 15:47:32 #25 std::__1::function<void ()>::operator()() const (this=0x7f2a39a32428) at ./contrib/llvm-project/libcxx/include/__functional/function.h:1187
2024-01-25 15:47:32 #26 ThreadPoolImpl<std::__1::thread>::worker (this=this@entry=0x7b4400004b00, thread_it=...) at ./build_docker/./src/Common/ThreadPool.cpp:455
2024-01-25 15:47:32 #27 0x00005561db2edf72 in ThreadPoolImpl<std::__1::thread>::scheduleImpl<void>(std::__1::function<void ()>, Priority, std::__1::optional<unsigned long>, bool)::{lambda()#2}::operator()() const (this=<optimized out>) at ./build_docker/./src/Common/ThreadPool.cpp:217
2024-01-25 15:47:32 #28 std::__1::__invoke[abi:v15000]<ThreadPoolImpl<std::__1::thread>::scheduleImpl<void>(std::__1::function<void ()>, Priority, std::__1::optional<unsigned long>, bool)::{lambda()#2}>(ThreadPoolImpl<std::__1::thread>::scheduleImpl<void>(std::__1::function<void ()>, Priority, std::__1::optional<unsigned long>, bool)::{lambda()#2}&&) (__f=...) at ./contrib/llvm-project/libcxx/include/__functional/invoke.h:394
2024-01-25 15:47:32 #29 std::__1::__thread_execute[abi:v15000]<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, ThreadPoolImpl<std::__1::thread>::scheduleImpl<void>(std::__1::function<void ()>, Priority, std::__1::optional<unsigned long>, bool)::{lambda()#2}>(std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, ThreadPoolImpl<std::__1::thread>::scheduleImpl<void>(std::__1::function<void ()>, Priority, std::__1::optional<unsigned long>, bool)::{lambda()#2}>&, std::__1::__tuple_indices<>) (__t=...) at ./contrib/llvm-project/libcxx/include/thread:284
2024-01-25 15:47:32 #30 std::__1::__thread_proxy[abi:v15000]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, ThreadPoolImpl<std::__1::thread>::scheduleImpl<void>(std::__1::function<void ()>, Priority, std::__1::optional<unsigned long>, bool)::{lambda()#2}> >(void*) (__vp=0x7b080066d2e0) at ./contrib/llvm-project/libcxx/include/thread:295
2024-01-25 15:47:32 #31 0x00005561d2e5ee8f in __tsan_thread_start_func ()
2024-01-25 15:47:32 #32 0x00007f2cd6128ac3 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
2024-01-25 15:47:32 #33 0x00007f2cd61ba660 in ?? () from /lib/x86_64-linux-gnu/libc.so.6

UPD. Fixed by #59308

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-backports-created-cloud pr-bugfix Pull request with bugfix, not backported by default pr-must-backport-cloud
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants