Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nobeid/docker healthcheck #227

Closed
wants to merge 11 commits into from
Closed

Nobeid/docker healthcheck #227

wants to merge 11 commits into from

Conversation

NajiObeid
Copy link
Contributor

fixes #225

Update the docker healthcheck script to pickup the correct time the last update was attempted.
Made sure the lock file access and modification time is updated when an update is attempted.

docker/healthcheck.sh Outdated Show resolved Hide resolved
Copy link
Contributor

@faktas2 faktas2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Naji,

I noticed that tests were not included in the pull request. What are your thoughts on adding some tests, making sure the health check works as expected? I think it would be good to see some cases where it fails and succeeds. I just want to ensure that we are maintaining a high standard of quality for our codebase.
I also wonder if we should cover cases when a database is removed from the config. I think we should cover that as well while we are at it. According to my understanding it will just keep failing.

Thanks. 👍

docker/healthcheck.sh Outdated Show resolved Hide resolved
docker/healthcheck.sh Outdated Show resolved Hide resolved
# Without the LockFile in the database directory, this check is not going to be working
# since database files are not going to be modified when there are no updates.
cutoff_date=$(($(date +%s) - $GEOIPUPDATE_FREQUENCY * 60 * 61 ))
modified_at=$(find $datbase_dir -type f -exec stat -c "%Y" {} + | sort -nr | head -n 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should $datbase_dir be $database_dir?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could edit the find to only look for the database files with a name convention like -name '*.mmdb'. This way it will be a lot more reliable.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you only look for the database files, you aren't considering the lock file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I don't think the purpose of the health check is to check the lock file as that can be set to a different location out of the database directory as well. I could be missing any cases where not checking the modification date of the lock file could cause a problem. Do you have anything in mind ? Thanks

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of the health check is to make sure the service is functioning correctly. This includes ensuring that it is checking for updated databases on the defined schedule. Since the databases aren't necessarily updated every time the service checks, the lock file was (or it appears that it was intended to be) acting as a proxy for indicating that the service executed on the schedule as expected. With this in mind, I'd suggest that rather than looking for the most recently updated file in some directory, the health check script should be updated to check the lock file explicitly, regardless of its location, so that HEALTHY == I checked for updates.

Copy link
Contributor

@faktas2 faktas2 Apr 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, purpose was a badly chosen word at my end. I agree that it is important to ensure that the service is functioning correctly, and checking for updated databases on the defined schedule is a critical part of it. 👍

Thank you for the case you provided, I agree with you. I think the lock file should be checked explicitly. I was thinking of a case of when the lock file presents and the update process fails, which could result in false positives. But, with the case you provided, when there are no updates, that does make us check the lock file. 👍 Thank you again for your input.

Copy link

@cford1080 cford1080 Apr 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, purpose was a badly chosen word at my end.

Fair enough. I just didn't want to assume anything, so I thought a comprehensive response would remove some uncertainty and ambiguity.

I was thinking of a case of when the lock file presents and the update process fails, which could result in false positives.

This is something I considered as well. It will just come down to the definition of HEALTHY, and that's certainly up for debate. Maybe it should mean more than just I checked for updates, but it certainly can't only mean I received updated databases. However, it is important to remember that this health check is used by Docker to report on the status of the container, which narrowly scopes what HEALTHY should or should not mean. Since there are external reasons why the databases may not be updated, using them to determine the health of the container gets a little more tricky. If that's the route someone would like to take, perhaps the mod time of the lock file should only be updated if no errors occur during the process of updating the databases. This could then account for getting HTTP errors (specifically 4xx), I/O errors, etc. However, this does make some assumptions about the purpose of the lock file, losing some generality, which may cause problems elsewhere if the lock file is used for any other purpose. Maybe, then, that necessitates some other approach to determining health. Though, as someone who uses this app, I would appreciate an initial approach of using the lock file as it appears to have been originally intended in order to resolve the outstanding issue, and then whoever wants to make it more involved and more robust can bring forward other proposals for determining health.

Copy link
Contributor

@faktas2 faktas2 Apr 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. I just didn't want to assume anything, so I thought a comprehensive response would remove some uncertainty and ambiguity.

Thank you! I appreciate your input.

I wonder if we could go with another approach and store the individual times that the databases in the config are updated/checked. And, make the health check use that file.

Or, make a daemon mode for geoipupdate and have an internal health check endpoint for the health check.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure those suggestions could be fleshed out to provide a much more robust health check, and I'm happy to see improvements to the product. However, if I could be a little self-serving for a moment 😅, I would prefer to see this change just use the lock file as originally intended so that my open issue can be resolved. From a more objective perspective, though, I would still make the same suggestion to my own team if this was one of our services.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you both for providing excellent feedback!
After some internal discussions we've decided that there were lots of rough edges in relying on the lock file, and the solution somewhat didn't feel very robust.
I agree that having a long lived process with a healthcheck endpoint is a great solution but unfortunately It's a much bigger change, so as a compromise we've opted for having geoipupdate log it's download/update logs in a json format that can be parsed and analyzed by the healthcheck script.

docker/healthcheck.sh Outdated Show resolved Hide resolved
@NajiObeid NajiObeid requested a review from faktas2 April 27, 2023 17:29
docker/healthcheck.sh Outdated Show resolved Hide resolved
Comment on lines +5 to +6
* Output a json log to stdout describing the result of the download
operation if the verbose flag is set to true.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add a new flag for this rather than adding it to the existing verbose flag so that we don't significantly change the output for existing users that are running it with the verbose flag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I can do that! But honestly anyone relying on the verbose flag to have a specific or consistent output should reevaluate, or probably redirect stdout to dev/null to get rid of the new output.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a minor change to the output would be fine, but suddenly introducing a bunch of JSON that is sent to stdout is a pretty big change. The existing output is intended to be human-readable. This tool is used by tens of thousands of customers, many of whom have been using it for more than a decade. Introducing such a big change, would likely cause breakage and support requests, especially outside of a major version release.

@@ -1,5 +1,12 @@
# CHANGELOG

## 5.0.5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, this is more than just a simple bugfix. We should bump this to 5.1.0.

if config.Verbose {
output.SetOutput(os.Stdout)
log.SetOutput(os.Stderr)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These sorts of mutations of the package loggers when calling these functions seems unexpected. I think ideally the loggers wouldn't be package variables. For NewClient in particular, maybe we could allow passing the logger in as part of *Config so that users of the library could use whatever logger is appropriate for their use case.

Copy link
Contributor

@faktas2 faktas2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Naji, would it be possible to split the Logging changes into a different PR ?

log.Printf("Usage: %s <arguments>\n", os.Args[0])
flag.PrintDefaults()
//nolint: revive // deep exit from main package
os.Exit(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not in this story, but I think we should simplify the log calls in main in another story. What do you think?

log.Print(msg) statements followed by os.Exit(exitCode) should be replaced with log.Fatal() or log.Fatalf(), which I think should be refactored to be called in main. This way we don’t have to use nolints or concatenate the errors for fatalLogger.

log.Print(err.Error())
}
log.Printf("Usage: %s <arguments>\n", os.Args[0])
flag.PrintDefaults()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flag package here is different than the one used in args.go. This isn't going to print anything. I think we want to use the same package in both places.

It seems the flag.Parse() call also prints out the defaults by the way. I think it would be enough to just to wrap the two errors coming from getArgs() with Log.Fatalf in main. With something like,

log.Fatalf("parsing arguments: %v", err)

What do you think?

@@ -17,11 +17,10 @@ var (
version = "unknown"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment above might need a change that it only applies to the 3 variables above.

@@ -17,11 +17,10 @@ var (
version = "unknown"
defaultConfigFile string
defaultDatabaseDirectory string
log = vars.NewDiscardLogger("main")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to create a new Logger instance. We could set the existing one using SetOutput, SetPrefix in main.

So, the reason why we added this was to check that we have a sanity test for what we are logging right? Have you confirmed if all the outputs are the same? What would your thoughts be on having a LogDebug() function that is behind the verbose flag?

I think it is hard to tell these for example will be logged only at verbose. What are your thoughts @oschwald ?

// It takes a string reference as an argument to be used as the prefix for
// all log entries.
func NewDiscardLogger(s string) *log.Logger {
return log.New(ioutil.Discard, prefix(s), log.LstdFlags)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could modify the existing logger instead of creating a new one.

)

func main() {
log.SetFlags(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the default output format will change with the flags you have in NewDiscardLogger. I don't think we want that.

@@ -47,6 +48,11 @@ func NewHTTPReader(
retryFor time.Duration,
verbose bool,
) Reader {
logger := vars.NewDiscardLogger("reader")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we want to have prefixes. @oschwald What are your thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ideally the current verbose output would be mostly the same.

@NajiObeid NajiObeid closed this May 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Docker container unhealthy
4 participants