Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deal with collation-related index corruption #14860

Merged

Conversation

ClearlyClaire
Copy link
Contributor

@ClearlyClaire ClearlyClaire commented Sep 22, 2020

A number of Mastodon instances have corrupted database indexes because of changing collation rules.

Such issues are most of the time discovered when trying to run some migrations (e.g., #14443) but can remain unnoticed for a long time, and cause issues that can be difficult to diagnose and repair.

To finally deal with that, we should:

  • make new databases be created with the C collation rules, which do not get updated and do not suffer from those issues
  • make a fixup script that deduplicates in the least damaging possible way
  • document the issue and the maintenance script
  • add a hook to db:migrate to display a warning if the database could be affected by the issue

@ClearlyClaire ClearlyClaire force-pushed the fixes/database-corruption-collation branch 3 times, most recently from f9ab8d3 to 2da9b04 Compare September 22, 2020 16:50
@ClearlyClaire
Copy link
Contributor Author

The script should be finished, although it may use some testing and maybe some optimization.

The documentation has been started in mastodon/documentation#811

@ClearlyClaire ClearlyClaire force-pushed the fixes/database-corruption-collation branch 4 times, most recently from 9614753 to 928adc1 Compare September 23, 2020 16:23
@ClearlyClaire ClearlyClaire changed the title [WiP] Deal with collation-related index corruption Deal with collation-related index corruption Sep 23, 2020
@ClearlyClaire ClearlyClaire marked this pull request as ready for review September 23, 2020 18:17
@ClearlyClaire ClearlyClaire force-pushed the fixes/database-corruption-collation branch from 928adc1 to 37b6caa Compare September 29, 2020 15:43
@ClearlyClaire
Copy link
Contributor Author

Changed the PR to leave the default database configuration as-is, because the C collation rules would cause issues with hashtags (which are stored with a particular localization yet are expected to be compared in a case-insensitive way).

Similarly removed the recommendation to re-create the database with the C collation type.

I have left the warning in the post-migration hook, since I think it is important to draw attention to the issue, but I am not sure it's a good idea since it would make it noisy for little value beyond the first time it is run.

@ClearlyClaire ClearlyClaire force-pushed the fixes/database-corruption-collation branch 2 times, most recently from 22f8198 to c6bfaec Compare October 27, 2020 10:51
@ClearlyClaire ClearlyClaire force-pushed the fixes/database-corruption-collation branch from c6bfaec to 80571a7 Compare November 10, 2020 17:54
@ClearlyClaire
Copy link
Contributor Author

ping @Gargron

@@ -48,6 +48,17 @@ namespace :db do
end
end

task :post_migration_hook do
at_exit do
unless %w(C POSIX).include?(ActiveRecord::Base.connection.execute('SELECT datcollate FROM pg_database WHERE datname = current_database();').first['datcollate'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so this would throw this warning on my server every time I run db:migrate, even though I have not updated glibc after indexing. This will probably scare a lot of people. We need to come up with something that only warns you if you actually have an issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I think the only reliable way to check that there is actually an issue is to use amcheck. However, there are two issues with that:

  1. it will impact performances for a few seconds to a few minutes
  2. it requires superuser access to the database, which Mastodon typically doesn't have

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could remove that altogether and instead just rely on the documentation and release notes… maybe catch some specific migration errors that are likely to be caused by this… but I'm afraid those solutions would not be enough to convince people to look into silent corruptions…

@prompt.warn 'Please make sure to stop Mastodon and have a backup.'
exit(1) unless @prompt.yes?('Continue?')

deduplicate_accounts!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of our CLI-related classes are like this, but here I'm starting to wish we could extract things into their own classes and then just call them, e.g. AccountDeduplicator, CustomEmojiDeduplicator, etc. Don't think I'm going to ask you to do it right now though. Bonus of that would be each deduplicator class maintaining its own model stubs so you know exactly what needs what.

lib/mastodon/maintenance_cli.rb Outdated Show resolved Hide resolved
This tool goes through the database to detect and fix duplicates.
This operation is very slow and may cause data loss (of data that would be
inaccessible without intervention because of the existing index corruptions).
It tries its best to make sensible decisions, and asks the user in some cases.
Avoids followers hash cache being incorrect, among other things
@ClearlyClaire ClearlyClaire force-pushed the fixes/database-corruption-collation branch from 80571a7 to a94e394 Compare November 11, 2020 11:20
@Gargron Gargron merged commit 1242e57 into mastodon:master Nov 19, 2020
@OccultWarlock
Copy link

Starting up my instance now displays the warning of unsafe database storage, or whatever the wording is. It instructs me to go to the following link, which leads to a 404 error

https://docs.joinmastodon.org/admin/troubleshooting/index-corruption/#am-i-affected

What is the proper command to run (in docker) to fix the issue. My server no longer loads, throwing up 502 errors.

@Gargron
Copy link
Member

Gargron commented Nov 19, 2020

That's very quick. Well. The warning is just a warning, and I did say to @ThibG it should not be too scary. The warning doesn't do anything, it definitely isn't the reason why you're getting 502 errors, you probably didn't restart everything (in any case there is no need to guess just look at what the logs say)

@OccultWarlock
Copy link

That's very quick. Well. The warning is just a warning, and I did say to @ThibG it should not be too scary. The warning doesn't do anything, it definitely isn't the reason why you're getting 502 errors, you probably didn't restart everything (in any case there is no need to guess just look at what the logs say)

Is this error related?

docker-compose run --rm web bin/tootctl feeds build
Creating mastodon_web_run ... done
Error processing 32015: PG::IoError: ERROR: could not open file "base/12404/172861": I/O error
: SELECT "accounts".* FROM "accounts" WHERE "accounts"."id" IN ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22, $23, $24, $25)
Error processing 24: PG::IoError: ERROR: could not open file "base/12404/1272365": I/O error
: SELECT "polls".* FROM "polls" WHERE "polls"."id" = $1
Error processing 1: PG::IoError: ERROR: could not open file "base/12404/173151": I/O error
: SELECT "mentions"."status_id", "mentions"."account_id" FROM "mentions" WHERE "mentions"."silent" = $1 AND "mentions"."status_id" IN ($2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22, $23, $24, $25, $26, $27, $28, $29, $30, $31, $32, $33, $34, $35, $36, $37, $38, $39, $40, $41, $42, $43, $44, $45, $46, $47, $48, $49, $50, $51, $52, $53, $54, $55, $56, $57, $58, $59, $60, $61, $62, $63, $64, $65, $66, $67, $68, $69, $70, $71, $72, $73, $74, $75, $76, $77, $78, $79, $80, $81, $82, $83, $84, $85, $86, $87, $88, $89, $90, $91, $92, $93, $94, $95, $96, $97, $98, $99, $100, $101, $102, $103, $104, $105, $106, $107, $108, $109, $110, $111, $112, $113, $114, $115, $116, $117, $118, $119, $120, $121, $122, $123, $124, $125, $126, $127, $128, $129, $130, $131, $132, $133, $134, $135, $136, $137, $138, $139, $140, $141, $142, $143, $144, $145, $146, $147, $148, $149, $150, $151, $152, $153, $154, $155, $156, $157, $158, $159, $160, $161, $162, $163, $164, $165, $166, $167, $168, $169, $170, $171, $172, $173, $174, $175, $176, $177, $178, $179, $180, $181, $182, $183, $184, $185, $186, $187, $188, $189, $190, $191, $192, $193, $194, $195, $196, $197, $198, $199, $200, $201, $202, $203, $204, $205, $206, $207, $208, $209, $210, $211, $212, $213, $214, $215, $216, $217, $218, $219, $220, $221, $222, $223, $224, $225, $226, $227, $228, $229, $230, $231, $232, $233, $234, $235, $236, $237, $238, $239, $240, $241, $242, $243, $244, $245, $246, $247, $248, $249, $250, $251, $252)
Error processing 131049: MISCONF Redis is configured to save RDB snapshots, but it is currently not able to persist on disk. Commands that may modify the data set are disabled, because this instance is configured to report errors during writes if RDB snapshotting fails (stop-writes-on-bgsave-error option). Please check the Redis logs for details about the RDB error.
6/6 |=======================================================================================================================================================| Time: 00:00:56
Regenerated feeds for 6 accounts

@Gargron
Copy link
Member

Gargron commented Nov 19, 2020

PG::IoError: ERROR: could not open file "base/12404/172861": I/O error

This would be the reason your server doesn't start and it looks like actual filesystem-level corruption (unrelated to index corruption and the warning generated by this PR). Judging by:

Redis is configured to save RDB snapshots, but it is currently not able to persist on disk

Something is wrong with your disks -- maybe full, maybe broken, it doesn't sound good.

@OccultWarlock
Copy link

I just want to make sure I don’t break anything - how would I run to command to fix the database (command found in the link) while using Docker to run my instance? - thanks!

ClearlyClaire added a commit to ClearlyClaire/mastodon that referenced this pull request Dec 6, 2020
* Add tootctl maintenance fix-duplicates

This tool goes through the database to detect and fix duplicates.
This operation is very slow and may cause data loss (of data that would be
inaccessible without intervention because of the existing index corruptions).
It tries its best to make sensible decisions, and asks the user in some cases.

* Add warning message in db:migrate hook

* Clear Rails cache after being done with database deduplication

Avoids followers hash cache being incorrect, among other things
Gargron added a commit that referenced this pull request Dec 18, 2020
* Fix 2FA/sign-in token sessions being valid after password change (#14802)

If someone tries logging in to an account and is prompted for a 2FA
code or sign-in token, even if the account's password or e-mail is
updated in the meantime, the session will show the prompt and allow
the login process to complete with a valid 2FA code or sign-in token

* Fix Move handler not being triggered when failing to fetch target (#15107)

When failing to fetch the target account, the ProcessingWorker fails
as expected, but since it hasn't cleared the `move_in_progress` flag,
the next attempt at processing skips the `Move` activity altogether.

This commit changes it to clear the flag when encountering any
unexpected error on fetching the target account. This is likely to
occur because, of, e.g., a timeout, when many instances query the
same actor at the same time.

* Fix slow distinct queries where grouped queries are faster (#15287)

About 2x speed-up on inboxes query

* Fix possible inconsistencies in tag search (#14906)

Do not downcase the queried tag before passing it to postgres when searching:
- tags are not downcased on creation
- `arel_table[:name].lower.matches(pattern)` generates an ILIKE anyway
- if Postgres and Rails happen to use different case-folding rules,
  downcasing before query but not before insertion may mean that some
  tags with some casings are not searchable

* Fix updating account counters when account_stat is not yet created (#15108)

* Fix account processing failing because of large collections (#15027)

Fixes #15025

* Fix downloading remote media files when server returns empty filename (#14867)

Fixes #14817

* Fix webfinger redirect handling in ResolveAccountService (#15187)

* Fix webfinger redirect handling in ResolveAccountService

ResolveAccountService#process_webfinger! handled a one-step webfinger
redirection, but only accepting the result if it matched the exact URI passed
as input, defeating the point of a redirection check.

Instead, use the same logic as in `ActivityPub::FetchRemoteAccountService`,
updating the resulting `acct:` URI with the result of the first webfinger
query.

* Add tests

* Remove dependency on unused and unmaintained http_parser.rb gem (#14574)

It seems that years ago, the “http” gem dependend on the “http_parser.rb” gem
(it now depends on the “http-parser” gem), and, still years ago, we pulled
it from git in order to benefit from a bugfix that wasn't released yet (#7467).

* Add tootctl maintenance fix-duplicates (#14860, #15201, #15264, #15349, #15359)

* Fix old migration script not being able to run if it fails midway (#15361)

* Fix old migration script not being able to run if it fails midway

Improve the robustness of a migration script likely to fail because of database
corruption so it can run again once database corruptions are fixed.

* Display a specific error message in case of index corruption

Co-authored-by: Eugen Rochko <[email protected]>
Co-authored-by: Claire <[email protected]>

Co-authored-by: Eugen Rochko <[email protected]>
Co-authored-by: Claire <[email protected]>
shouo1987 pushed a commit to CrossGate-Pawoo/mastodon that referenced this pull request Dec 7, 2022
* Fix 2FA/sign-in token sessions being valid after password change (mastodon#14802)

If someone tries logging in to an account and is prompted for a 2FA
code or sign-in token, even if the account's password or e-mail is
updated in the meantime, the session will show the prompt and allow
the login process to complete with a valid 2FA code or sign-in token

* Fix Move handler not being triggered when failing to fetch target (mastodon#15107)

When failing to fetch the target account, the ProcessingWorker fails
as expected, but since it hasn't cleared the `move_in_progress` flag,
the next attempt at processing skips the `Move` activity altogether.

This commit changes it to clear the flag when encountering any
unexpected error on fetching the target account. This is likely to
occur because, of, e.g., a timeout, when many instances query the
same actor at the same time.

* Fix slow distinct queries where grouped queries are faster (mastodon#15287)

About 2x speed-up on inboxes query

* Fix possible inconsistencies in tag search (mastodon#14906)

Do not downcase the queried tag before passing it to postgres when searching:
- tags are not downcased on creation
- `arel_table[:name].lower.matches(pattern)` generates an ILIKE anyway
- if Postgres and Rails happen to use different case-folding rules,
  downcasing before query but not before insertion may mean that some
  tags with some casings are not searchable

* Fix updating account counters when account_stat is not yet created (mastodon#15108)

* Fix account processing failing because of large collections (mastodon#15027)

Fixes mastodon#15025

* Fix downloading remote media files when server returns empty filename (mastodon#14867)

Fixes mastodon#14817

* Fix webfinger redirect handling in ResolveAccountService (mastodon#15187)

* Fix webfinger redirect handling in ResolveAccountService

ResolveAccountService#process_webfinger! handled a one-step webfinger
redirection, but only accepting the result if it matched the exact URI passed
as input, defeating the point of a redirection check.

Instead, use the same logic as in `ActivityPub::FetchRemoteAccountService`,
updating the resulting `acct:` URI with the result of the first webfinger
query.

* Add tests

* Remove dependency on unused and unmaintained http_parser.rb gem (mastodon#14574)

It seems that years ago, the “http” gem dependend on the “http_parser.rb” gem
(it now depends on the “http-parser” gem), and, still years ago, we pulled
it from git in order to benefit from a bugfix that wasn't released yet (mastodon#7467).

* Add tootctl maintenance fix-duplicates (mastodon#14860, mastodon#15201, mastodon#15264, mastodon#15349, mastodon#15359)

* Fix old migration script not being able to run if it fails midway (mastodon#15361)

* Fix old migration script not being able to run if it fails midway

Improve the robustness of a migration script likely to fail because of database
corruption so it can run again once database corruptions are fixed.

* Display a specific error message in case of index corruption

Co-authored-by: Eugen Rochko <[email protected]>
Co-authored-by: Claire <[email protected]>

Co-authored-by: Eugen Rochko <[email protected]>
Co-authored-by: Claire <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants