Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checks for problems in Accumulo #4957

Open
wants to merge 3 commits into
base: 3.1
Choose a base branch
from

Conversation

kevinrr888
Copy link
Member

This PR:

  • Moves existing checks (checkTablets and the fate check for dangling locks) into the appropriate new admin check command
  • Adds new checks
  • New tests in AdminCheckIT
  • SYSTEM_CONFIG now checks for
    • valid locked table/namespace ids (the locked table/namespaces exist)
    • locked table/namespaces are associated with a fate op
  • ROOT_METADATA now checks for
    • offline tablets
    • missing "columns"
    • invalid "columns"
  • ROOT_TABLE now checks for
    • offline tablets
    • tablets for metadata table have no holes, valid (null) prev end row for first tablet, and valid (null) end row for last tablet
    • missing columns
    • invalid columns
  • METADATA_TABLE now checks for
    • offline tablets
    • tablets for user tables (and scanref) have no holes, valid (null) prev end row for first tablet, and valid (null) end row for last tablet
    • missing columns
    • invalid columns
  • SYSTEM_FILES now checks for
    • missing system files
  • USER_FILES now checks for
    • missing user files

There are still quite a few checks that need to be added (mentioned in #4687) and probably more. This is a first/starting PR for these checks. More checks will be added in follow-ons. Something else still left todo are tests for these checks for FAILING cases.

Part of #4892

This commit:
- Moves existing checks (`checkTablets` and the fate check for dangling locks) into the appropriate new `admin check` command
- Adds new checks
- New tests in AdminCheckIT
- SYSTEM_CONFIG now checks for
	- valid locked table/namespace ids (the locked table/namespaces exist)
	- locked table/namespaces are associated with a fate op
- ROOT_METADATA now checks for
	- offline tablets
	- missing "columns"
	- invalid "columns"
- ROOT_TABLE now checks for
	- offline tablets
	- tablets for metadata table have no holes, valid (null) prev end row for first tablet, and valid (null) end row for last tablet
	- missing columns
	- invalid columns
- METADATA_TABLE now checks for
	- offline tablets
	- tablets for user tables (and scanref) have no holes, valid (null) prev end row for first tablet, and valid (null) end row for last tablet
	- missing columns
	- invalid columns
- SYSTEM_FILES now checks for
	- missing system files
- USER_FILES now checks for
	- missing user files

Part of apache#4892
@kevinrr888 kevinrr888 self-assigned this Oct 8, 2024
@kevinrr888 kevinrr888 added this to the 3.1.0 milestone Oct 8, 2024
Copy link
Contributor

@keith-turner keith-turner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevinrr888 these changes look good. I was running the changes locally and I noticed at the top level the accumulo check-server-config command exists. Was wondering if that should be rolled into this check command, if it should could add that to the checklist on #4892.

System.out.println("Running check " + check);
// work
System.out.println("Check " + check + " completed with status " + status);
System.out.println("\n********** Looking for missing user files **********\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These prints produce a lot of noise when running the command. Could make these log at trace/debug level. This would make the output of the command very concise when there are no problems. For testing purpose could change the log4j2 testing config to set the check logger to trace/debug.

When a problem is found could log the details of that at warn. I rant this locally and manually corrupted the metadata table and saw the following output.

********** Looking for missing columns **********

Scanning the accumulo.metadata (!0) table for missing required columns...
Tablet +scanref;m is missing required columns: col FQs: [srv:dir, ~tab:~pr], col fams: [] in the accumulo.metadata (!0) table
Tablet +scanref< is missing required columns: col FQs: [srv:dir], col fams: [] in the accumulo.metadata (!0) table

********** Looking for invalid columns **********

So maybe in that output everything could be logged at trace/debug and the two lines about problems fouind at WARN level.

If this seems like a good change, it could be done in a follow on PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is a nice way to avoid the verbose output. The change made sense to make in this PR, so I made it.

System.out.printf("Scanning the %s for missing required columns...\n", scanning());
try (Scanner scanner = context.createScanner(tableName(), Authorizations.EMPTY)) {
var is = new IteratorSetting(100, "tablets", WholeRowIterator.class);
scanner.addScanIterator(is);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could also configure the scanner to only fetch the required columns. This will bring back less data, like it will not bring back tablets files.

m.at().family(key.getColumnFamily()).qualifier(key.getColumnQualifier())
.visibility(key.getColumnVisibility()).timestamp(key.getTimestamp())
.put(entry.getValue());
MetadataConstraints mc = new MetadataConstraints();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be able to create this outside of the loop.

printRunning();

System.out.println("\n********** Checking some references **********\n");
status = checkTableLocks(context, status);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check seems misplaced, maybe there should be a new Check.TABLE_LOCKS for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was trying to avoid adding a new Check for at least a first round of reviews to avoid a larger diff/more refactoring. Added now. Made TABLE_LOCKS depend on SYSTEM_CONFIG, as that seemed like the most fitting location for it in the dependency tree.

assertTrue(out.contains("Checking some references"));
assertTrue(out.contains("Looking for missing columns"));
assertTrue(out.contains("Looking for invalid columns"));
assertTrue(out.contains("Check METADATA_TABLE completed with status OK"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could look for other checks running in this test and others that specify a check.

Suggested change
assertTrue(out.contains("Check METADATA_TABLE completed with status OK"));
assertTrue(out.contains("Check METADATA_TABLE completed with status OK"));
// Should not see other checks running
assertFalse(out.contains("SYSTEM_CONFIG"));

- `System.out` -> `log.trace/warn` to avoid flooding output with unnecessary/detailed info. The most important info (e.g., output of `admin check list` command, and the final run status table from the `admin check run` command) is still printed to stdout. Problems found are now logged at warn instead of stdout. Detailed, non-error info logged at trace.
- Created new check `Check.TABLE_LOCKS` which ensures that table and namespace locks are valid and are associated with a FATE op.
- New check `assertNoOtherChecksRan()` in `AdminCheckIT` which ensures only the expected checks ran
- A few misc review changes: `MetadataCheckRunner` code improved to only fetch required columns when scanning, object creation moved outside of a loop
@kevinrr888
Copy link
Member Author

@kevinrr888 these changes look good. I was running the changes locally and I noticed at the top level the accumulo check-server-config command exists. Was wondering if that should be rolled into this check command, if it should could add that to the checklist on #4892.

Thanks, I had missed that. Added to #4892. Also noticed accumulo check-compaction-config, which I also added to the checklist. Did you think this should be included as well or no?

@kevinrr888
Copy link
Member Author

Changes in 11f1f76:

  • System.out -> log.trace/warn to avoid flooding output with unnecessary/detailed info. The most important info (e.g., output of admin check list command, and the final run status table from the admin check run command) is still printed to stdout. Problems found are now logged at warn instead of stdout. Detailed, non-error info logged at trace.
  • Created new check Check.TABLE_LOCKS which ensures that table and namespace locks are valid and are associated with a FATE op. This depends on SYSTEM_CONFIG, which seemed like the most fitting location for it in the dependency tree.
  • New check assertNoOtherChecksRan() in AdminCheckIT which ensures only the expected checks ran
  • A few misc review changes: MetadataCheckRunner code improved to only fetch required columns when scanning, object creation moved outside of a loop

Let me know how these changes look. Something still left TODO is tests for failing checks. Tests should be added to ensure everything that is supposed to be checked correctly fails when it is expected to. I imagine this will take quite a bit of time to do and the changes are already pretty large, so thinking it might be best to leave for follow-on.

Comment on lines -164 to +177
System.err
.println("ERROR : table " + tableNameToCheck + " (" + tableCheckId + ") is empty");
printProblemMethod.accept(
String.format("ERROR : table %s (%s) is empty", tableNameToCheck, tableCheckId));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this changes System.err -> System.out when this is run from main() which, as far as I can tell, should be ok. To keep the functionality of System.err, would have to introduce yet another param which seemed a bit much for just this one print

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants