-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add RetryAttempts to AccessLog in JSON format #1793
Conversation
9017ccb
to
3c17575
Compare
@marco-jantke please rebase to fix CI. |
3c17575
to
bfd96e5
Compare
bfd96e5
to
abd50c4
Compare
@ldez still some error :/ |
you have missed your rebase, try again 😉 |
2d804b2
to
663b3cb
Compare
Sorry for that one.. I was just amending because I thought I just have to trigger the build again. BTW now I have to run the make targets like |
You don't need to add |
663b3cb
to
c0bfa12
Compare
Any feedback for this one? |
@marco-jantke honestly, I'm a bit concerned with this one. It seems it solves a relatively rare use case and makes |
@emilevauge disclaimer: I'm obviously somewhat biased as I work with Marco. Still, I hope I can give some motivation stemming from our real-world experience with a large-ish proxy setup. We are currently running large parts of our micro-services infrastructure off of HAProxy where support for retry indicators on a per-access level turned out to be very helpful for us. When you run hundreds of HTTP-driven services, connectivity can fail due to numerous reasons: Services could be slow; intermittently failing; be calling broken endpoints/paths; fail to terminate/bootstrap orderly; or just have bugs, all potentially leading to retries. On the other end, the infrastructure could have problems: proxies being overloaded; network being slow; net-split happening; and a myriad of other potential things. We can't afford to expose ephemeral failures to the client right away, so we have retries enabled by default. At the same time, we strive to minimize the number of retries we see due to the latency imposed and the fact that a higher-than-usual number is a signal for some issue. We use monitoring (namely, Prometheus) to detect retry anomalies. More often than not, however, it's impossible to understand why retries took place without analyzing on the access log level. For instance, we are regularly able to correlate retry counters with request latency at the various stages of a request to see whether a request, say, failed on the initial connection attempt, or it took the server too long to serve the first byte. Another example that we have seen is that some service endpoints caused errors while others did not, leading to retries in one case but not the other. With HAProxy, we can put the necessary pieces together since we have all relevant information presented in the access log. Moving such information into a monitoring system is usually not feasible: You cannot track highly dynamic aspects of your requests in something like Prometheus because the number of time series (and with it, your monitoring system memory) would quickly explode. We happen to have first-hand experience with such scenarios (both intentional and by accident). I hope that provides some insights. Let me know what you think. |
@timoreimann @marco-jantke I'm not saying this feature is not useful, I'm saying that the implementation is a bit too complex and may need more work, specifically this peace of code: if globalConfiguration.Retry != nil {
metricsRetryListener := middlewares.NewMetricsRetryListener(metrics)
if server.accessLoggerMiddleware != nil {
saveRetries := &accesslog.SaveRetries{}
retryListeners := middlewares.RetryListeners{metricsRetryListener, saveRetries}
lb = registerRetryMiddleware(lb, globalConfiguration, configuration, frontend.Backend, retryListeners)
lb = saveRetries.AsHandler(lb)
} else {
lb = registerRetryMiddleware(lb, globalConfiguration, configuration, frontend.Backend, metricsRetryListener)
}
} @marco-jantke do you think we could refactor this part? |
I will give it a try :) |
c0bfa12
to
e7bfd95
Compare
@emilevauge it was worth the reinvestigation. I realised that the functionality was actually not working properly with the former implementation during benchmarking with concurrent requests. It was mixing up the RetryAttempts that I was writing into the logs between requests and often reseted the counter to 0 even though the request actually had retries. In order to get this working I changed the @emilevauge @timoreimann In my last commit eff6596 I dropped the unit-test for the |
e7bfd95
to
eff6596
Compare
Can I do something to bring this forward? |
@marco-jantke @emilevauge is this still in the WIP state? If not, we should open it up for review. |
The last state is that after Emile requested some refactoring after his design review, that I implemented some changes to improve the design of the code. So now it should get another round of design review and then we can move AFAICS. |
Seemingly your turn again, @emilevauge. :-) |
I think it's a lot better :) Thanks @marco-jantke ! |
saveRetries := &SaveRetries{} | ||
|
||
logDataTable := &LogData{Core: make(CoreLogData)} | ||
req := httptest.NewRequest("GET", "/some/path", nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/"GET"/http.MethodGet/
By now I believe you're challenging me :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha, I would never dare! Consider that I added this code a long time ago.
middlewares/metrics_test.go
Outdated
retryListener.Retried(1) | ||
retryListener.Retried(2) | ||
retryListener.Retried(req, 1) | ||
retryListener.Retried(req, 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we also somehow expand our test expectation / assertion logic when we extend the system-under-test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general yes, but in this case we did not expand the SUT. It only gets the request now as parameter, but it doesn't do anything with it. The request was introduced for the access log implementation of the RetryListener
interface.
middlewares/retry.go
Outdated
// each of them about a retry attempt. | ||
type RetryListeners []RetryListener | ||
|
||
// Retried is there to implement the RetryListener interface. It calls Retried on each of its slice entries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit-pick: I'd suggest to say exists instead of is there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I am always happy about formulation improvements :)
@@ -451,74 +451,6 @@ func TestNewMetrics(t *testing.T) { | |||
} | |||
} | |||
|
|||
func TestRegisterRetryMiddleware(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose this test is now superseded by your new tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the test I left a comment in the second paragraph of my comment above #1793 (comment). Please take a look.
eff6596
to
5f8caeb
Compare
@timoreimann thanks for the review round nr1. I implemented the feedback and also did a rebase/squash now because the branch was quit outdated and squashing made the rebase easier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
@timoreimann As there are a lot of conflicts, I think a LGTM today is a bit optimistic ;) |
@emilevauge I trust in @marco-jantke to ask for a revalidating LGTM if the rebase turns out to be more complicated. :) (Besides, we have never bothered giving LGTM in the past despite potential conflicts.) |
5f8caeb
to
87fa513
Compare
Rebased this PR now to latest master version. Conflicts are all resolved and where expected, but some merging was required. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job @marco-jantke !
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
87fa513
to
7bdee01
Compare
This PR adds the amount of retries happened to the access log. This information is very useful for us when debugging requests.
For now this PR only adds the retry attempts to the JSON format. I am happy to extend the common log format as well, but I am not an expert in the CLF format and so I am not sure how to add this information meaningfully without breaking backwards compatibility of parsers.