-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nobeid/docker healthcheck #227
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Naji,
I noticed that tests were not included in the pull request. What are your thoughts on adding some tests, making sure the health check works as expected? I think it would be good to see some cases where it fails and succeeds. I just want to ensure that we are maintaining a high standard of quality for our codebase.
I also wonder if we should cover cases when a database is removed from the config. I think we should cover that as well while we are at it. According to my understanding it will just keep failing.
Thanks. 👍
docker/healthcheck.sh
Outdated
# Without the LockFile in the database directory, this check is not going to be working | ||
# since database files are not going to be modified when there are no updates. | ||
cutoff_date=$(($(date +%s) - $GEOIPUPDATE_FREQUENCY * 60 * 61 )) | ||
modified_at=$(find $datbase_dir -type f -exec stat -c "%Y" {} + | sort -nr | head -n 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should $datbase_dir
be $database_dir
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could edit the find
to only look for the database files with a name convention like -name '*.mmdb'
. This way it will be a lot more reliable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you only look for the database files, you aren't considering the lock file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I don't think the purpose of the health check is to check the lock file as that can be set to a different location out of the database directory as well. I could be missing any cases where not checking the modification date of the lock file could cause a problem. Do you have anything in mind ? Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of the health check is to make sure the service is functioning correctly. This includes ensuring that it is checking for updated databases on the defined schedule. Since the databases aren't necessarily updated every time the service checks, the lock file was (or it appears that it was intended to be) acting as a proxy for indicating that the service executed on the schedule as expected. With this in mind, I'd suggest that rather than looking for the most recently updated file in some directory, the health check script should be updated to check the lock file explicitly, regardless of its location, so that HEALTHY
== I checked for updates
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, purpose
was a badly chosen word at my end. I agree that it is important to ensure that the service is functioning correctly, and checking for updated databases on the defined schedule is a critical part of it. 👍
Thank you for the case you provided, I agree with you. I think the lock file should be checked explicitly. I was thinking of a case of when the lock file presents and the update process fails, which could result in false positives. But, with the case you provided, when there are no updates, that does make us check the lock file. 👍 Thank you again for your input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry,
purpose
was a badly chosen word at my end.
Fair enough. I just didn't want to assume anything, so I thought a comprehensive response would remove some uncertainty and ambiguity.
I was thinking of a case of when the lock file presents and the update process fails, which could result in false positives.
This is something I considered as well. It will just come down to the definition of HEALTHY
, and that's certainly up for debate. Maybe it should mean more than just I checked for updates
, but it certainly can't only mean I received updated databases
. However, it is important to remember that this health check is used by Docker to report on the status of the container, which narrowly scopes what HEALTHY
should or should not mean. Since there are external reasons why the databases may not be updated, using them to determine the health of the container gets a little more tricky. If that's the route someone would like to take, perhaps the mod time of the lock file should only be updated if no errors occur during the process of updating the databases. This could then account for getting HTTP errors (specifically 4xx), I/O errors, etc. However, this does make some assumptions about the purpose of the lock file, losing some generality, which may cause problems elsewhere if the lock file is used for any other purpose. Maybe, then, that necessitates some other approach to determining health. Though, as someone who uses this app, I would appreciate an initial approach of using the lock file as it appears to have been originally intended in order to resolve the outstanding issue, and then whoever wants to make it more involved and more robust can bring forward other proposals for determining health.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. I just didn't want to assume anything, so I thought a comprehensive response would remove some uncertainty and ambiguity.
Thank you! I appreciate your input.
I wonder if we could go with another approach and store the individual times that the databases in the config are updated/checked. And, make the health check use that file.
Or, make a daemon mode for geoipupdate and have an internal health check endpoint for the health check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sure those suggestions could be fleshed out to provide a much more robust health check, and I'm happy to see improvements to the product. However, if I could be a little self-serving for a moment 😅, I would prefer to see this change just use the lock file as originally intended so that my open issue can be resolved. From a more objective perspective, though, I would still make the same suggestion to my own team if this was one of our services.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you both for providing excellent feedback!
After some internal discussions we've decided that there were lots of rough edges in relying on the lock file, and the solution somewhat didn't feel very robust.
I agree that having a long lived process with a healthcheck endpoint is a great solution but unfortunately It's a much bigger change, so as a compromise we've opted for having geoipupdate log it's download/update logs in a json format that can be parsed and analyzed by the healthcheck script.
* Output a json log to stdout describing the result of the download | ||
operation if the verbose flag is set to true. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add a new flag for this rather than adding it to the existing verbose flag so that we don't significantly change the output for existing users that are running it with the verbose flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I can do that! But honestly anyone relying on the verbose flag to have a specific or consistent output should reevaluate, or probably redirect stdout to dev/null to get rid of the new output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a minor change to the output would be fine, but suddenly introducing a bunch of JSON that is sent to stdout is a pretty big change. The existing output is intended to be human-readable. This tool is used by tens of thousands of customers, many of whom have been using it for more than a decade. Introducing such a big change, would likely cause breakage and support requests, especially outside of a major version release.
@@ -1,5 +1,12 @@ | |||
# CHANGELOG | |||
|
|||
## 5.0.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point, this is more than just a simple bugfix. We should bump this to 5.1.0.
if config.Verbose { | ||
output.SetOutput(os.Stdout) | ||
log.SetOutput(os.Stderr) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These sorts of mutations of the package loggers when calling these functions seems unexpected. I think ideally the loggers wouldn't be package variables. For NewClient
in particular, maybe we could allow passing the logger in as part of *Config
so that users of the library could use whatever logger is appropriate for their use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Naji, would it be possible to split the Logging changes into a different PR ?
log.Printf("Usage: %s <arguments>\n", os.Args[0]) | ||
flag.PrintDefaults() | ||
//nolint: revive // deep exit from main package | ||
os.Exit(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not in this story, but I think we should simplify the log calls in main
in another story. What do you think?
log.Print(msg)
statements followed by os.Exit(exitCode)
should be replaced with log.Fatal()
or log.Fatalf()
, which I think should be refactored to be called in main
. This way we don’t have to use nolint
s or concatenate the errors for fatalLogger
.
log.Print(err.Error()) | ||
} | ||
log.Printf("Usage: %s <arguments>\n", os.Args[0]) | ||
flag.PrintDefaults() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The flag
package here is different than the one used in args.go
. This isn't going to print anything. I think we want to use the same package in both places.
It seems the flag.Parse()
call also prints out the defaults by the way. I think it would be enough to just to wrap the two errors coming from getArgs()
with Log.Fatalf
in main
. With something like,
log.Fatalf("parsing arguments: %v", err)
What do you think?
@@ -17,11 +17,10 @@ var ( | |||
version = "unknown" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the comment above might need a change that it only applies to the 3 variables above.
@@ -17,11 +17,10 @@ var ( | |||
version = "unknown" | |||
defaultConfigFile string | |||
defaultDatabaseDirectory string | |||
log = vars.NewDiscardLogger("main") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to create a new Logger instance. We could set the existing one using SetOutput, SetPrefix in main.
So, the reason why we added this was to check that we have a sanity test for what we are logging right? Have you confirmed if all the outputs are the same? What would your thoughts be on having a LogDebug()
function that is behind the verbose flag?
I think it is hard to tell these for example will be logged only at verbose. What are your thoughts @oschwald ?
// It takes a string reference as an argument to be used as the prefix for | ||
// all log entries. | ||
func NewDiscardLogger(s string) *log.Logger { | ||
return log.New(ioutil.Discard, prefix(s), log.LstdFlags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could modify the existing logger instead of creating a new one.
) | ||
|
||
func main() { | ||
log.SetFlags(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the default output format will change with the flags you have in NewDiscardLogger. I don't think we want that.
@@ -47,6 +48,11 @@ func NewHTTPReader( | |||
retryFor time.Duration, | |||
verbose bool, | |||
) Reader { | |||
logger := vars.NewDiscardLogger("reader") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we want to have prefixes. @oschwald What are your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ideally the current verbose output would be mostly the same.
fixes #225
Update the docker healthcheck script to pickup the correct time the last update was attempted.
Made sure the lock file access and modification time is updated when an update is attempted.