-
Notifications
You must be signed in to change notification settings - Fork 445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checks for problems in Accumulo #4957
base: 3.1
Are you sure you want to change the base?
Conversation
This commit: - Moves existing checks (`checkTablets` and the fate check for dangling locks) into the appropriate new `admin check` command - Adds new checks - New tests in AdminCheckIT - SYSTEM_CONFIG now checks for - valid locked table/namespace ids (the locked table/namespaces exist) - locked table/namespaces are associated with a fate op - ROOT_METADATA now checks for - offline tablets - missing "columns" - invalid "columns" - ROOT_TABLE now checks for - offline tablets - tablets for metadata table have no holes, valid (null) prev end row for first tablet, and valid (null) end row for last tablet - missing columns - invalid columns - METADATA_TABLE now checks for - offline tablets - tablets for user tables (and scanref) have no holes, valid (null) prev end row for first tablet, and valid (null) end row for last tablet - missing columns - invalid columns - SYSTEM_FILES now checks for - missing system files - USER_FILES now checks for - missing user files Part of apache#4892
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kevinrr888 these changes look good. I was running the changes locally and I noticed at the top level the accumulo check-server-config
command exists. Was wondering if that should be rolled into this check command, if it should could add that to the checklist on #4892.
System.out.println("Running check " + check); | ||
// work | ||
System.out.println("Check " + check + " completed with status " + status); | ||
System.out.println("\n********** Looking for missing user files **********\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These prints produce a lot of noise when running the command. Could make these log at trace/debug level. This would make the output of the command very concise when there are no problems. For testing purpose could change the log4j2 testing config to set the check logger to trace/debug.
When a problem is found could log the details of that at warn. I rant this locally and manually corrupted the metadata table and saw the following output.
********** Looking for missing columns **********
Scanning the accumulo.metadata (!0) table for missing required columns...
Tablet +scanref;m is missing required columns: col FQs: [srv:dir, ~tab:~pr], col fams: [] in the accumulo.metadata (!0) table
Tablet +scanref< is missing required columns: col FQs: [srv:dir], col fams: [] in the accumulo.metadata (!0) table
********** Looking for invalid columns **********
So maybe in that output everything could be logged at trace/debug and the two lines about problems fouind at WARN level.
If this seems like a good change, it could be done in a follow on PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is a nice way to avoid the verbose output. The change made sense to make in this PR, so I made it.
System.out.printf("Scanning the %s for missing required columns...\n", scanning()); | ||
try (Scanner scanner = context.createScanner(tableName(), Authorizations.EMPTY)) { | ||
var is = new IteratorSetting(100, "tablets", WholeRowIterator.class); | ||
scanner.addScanIterator(is); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could also configure the scanner to only fetch the required columns. This will bring back less data, like it will not bring back tablets files.
m.at().family(key.getColumnFamily()).qualifier(key.getColumnQualifier()) | ||
.visibility(key.getColumnVisibility()).timestamp(key.getTimestamp()) | ||
.put(entry.getValue()); | ||
MetadataConstraints mc = new MetadataConstraints(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be able to create this outside of the loop.
printRunning(); | ||
|
||
System.out.println("\n********** Checking some references **********\n"); | ||
status = checkTableLocks(context, status); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check seems misplaced, maybe there should be a new Check.TABLE_LOCKS for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was trying to avoid adding a new Check for at least a first round of reviews to avoid a larger diff/more refactoring. Added now. Made TABLE_LOCKS depend on SYSTEM_CONFIG, as that seemed like the most fitting location for it in the dependency tree.
assertTrue(out.contains("Checking some references")); | ||
assertTrue(out.contains("Looking for missing columns")); | ||
assertTrue(out.contains("Looking for invalid columns")); | ||
assertTrue(out.contains("Check METADATA_TABLE completed with status OK")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could look for other checks running in this test and others that specify a check.
assertTrue(out.contains("Check METADATA_TABLE completed with status OK")); | |
assertTrue(out.contains("Check METADATA_TABLE completed with status OK")); | |
// Should not see other checks running | |
assertFalse(out.contains("SYSTEM_CONFIG")); |
- `System.out` -> `log.trace/warn` to avoid flooding output with unnecessary/detailed info. The most important info (e.g., output of `admin check list` command, and the final run status table from the `admin check run` command) is still printed to stdout. Problems found are now logged at warn instead of stdout. Detailed, non-error info logged at trace. - Created new check `Check.TABLE_LOCKS` which ensures that table and namespace locks are valid and are associated with a FATE op. - New check `assertNoOtherChecksRan()` in `AdminCheckIT` which ensures only the expected checks ran - A few misc review changes: `MetadataCheckRunner` code improved to only fetch required columns when scanning, object creation moved outside of a loop
Thanks, I had missed that. Added to #4892. Also noticed |
Changes in 11f1f76:
Let me know how these changes look. Something still left TODO is tests for failing checks. Tests should be added to ensure everything that is supposed to be checked correctly fails when it is expected to. I imagine this will take quite a bit of time to do and the changes are already pretty large, so thinking it might be best to leave for follow-on. |
System.err | ||
.println("ERROR : table " + tableNameToCheck + " (" + tableCheckId + ") is empty"); | ||
printProblemMethod.accept( | ||
String.format("ERROR : table %s (%s) is empty", tableNameToCheck, tableCheckId)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this changes System.err
-> System.out
when this is run from main()
which, as far as I can tell, should be ok. To keep the functionality of System.err
, would have to introduce yet another param which seemed a bit much for just this one print
This PR:
checkTablets
and the fate check for dangling locks) into the appropriate newadmin check
commandThere are still quite a few checks that need to be added (mentioned in #4687) and probably more. This is a first/starting PR for these checks. More checks will be added in follow-ons. Something else still left todo are tests for these checks for FAILING cases.
Part of #4892