Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add RetryAttempts to AccessLog in JSON format #1793

Merged
merged 1 commit into from
Aug 28, 2017

Conversation

m3co-code
Copy link
Contributor

This PR adds the amount of retries happened to the access log. This information is very useful for us when debugging requests.

For now this PR only adds the retry attempts to the JSON format. I am happy to extend the common log format as well, but I am not an expert in the CLF format and so I am not sure how to add this information meaningfully without breaking backwards compatibility of parsers.

@m3co-code m3co-code force-pushed the retries-in-access-log branch from 9017ccb to 3c17575 Compare June 27, 2017 11:29
@ldez ldez added area/logs kind/enhancement a new or improved feature. labels Jun 27, 2017
@traefik traefik deleted a comment from m3co-code Jun 27, 2017
@ldez
Copy link
Contributor

ldez commented Jun 27, 2017

@marco-jantke please rebase to fix CI.

@m3co-code
Copy link
Contributor Author

@ldez still some error :/

@ldez
Copy link
Contributor

ldez commented Jun 28, 2017

you have missed your rebase, try again 😉

@m3co-code m3co-code force-pushed the retries-in-access-log branch 2 times, most recently from 2d804b2 to 663b3cb Compare June 28, 2017 08:16
@m3co-code
Copy link
Contributor Author

Sorry for that one.. I was just amending because I thought I just have to trigger the build again.

BTW now I have to run the make targets like DOCKER_VERSION=17.03.1 make validate to get them working locally. Do you want to improve the local workflow here again or add some documentation?

@ldez
Copy link
Contributor

ldez commented Jun 28, 2017

You don't need to add DOCKER_VERSION=17.03.1: in the Docker file there is a DOCKER_VERSION with the good version.

@m3co-code m3co-code force-pushed the retries-in-access-log branch from 663b3cb to c0bfa12 Compare July 7, 2017 08:45
@m3co-code
Copy link
Contributor Author

Any feedback for this one?

@emilevauge
Copy link
Member

@marco-jantke honestly, I'm a bit concerned with this one. It seems it solves a relatively rare use case and makes server.go more and more complex. @containous/traefik WDYT?

@timoreimann
Copy link
Contributor

@emilevauge disclaimer: I'm obviously somewhat biased as I work with Marco. Still, I hope I can give some motivation stemming from our real-world experience with a large-ish proxy setup.

We are currently running large parts of our micro-services infrastructure off of HAProxy where support for retry indicators on a per-access level turned out to be very helpful for us. When you run hundreds of HTTP-driven services, connectivity can fail due to numerous reasons: Services could be slow; intermittently failing; be calling broken endpoints/paths; fail to terminate/bootstrap orderly; or just have bugs, all potentially leading to retries. On the other end, the infrastructure could have problems: proxies being overloaded; network being slow; net-split happening; and a myriad of other potential things. We can't afford to expose ephemeral failures to the client right away, so we have retries enabled by default. At the same time, we strive to minimize the number of retries we see due to the latency imposed and the fact that a higher-than-usual number is a signal for some issue.

We use monitoring (namely, Prometheus) to detect retry anomalies. More often than not, however, it's impossible to understand why retries took place without analyzing on the access log level. For instance, we are regularly able to correlate retry counters with request latency at the various stages of a request to see whether a request, say, failed on the initial connection attempt, or it took the server too long to serve the first byte. Another example that we have seen is that some service endpoints caused errors while others did not, leading to retries in one case but not the other. With HAProxy, we can put the necessary pieces together since we have all relevant information presented in the access log.

Moving such information into a monitoring system is usually not feasible: You cannot track highly dynamic aspects of your requests in something like Prometheus because the number of time series (and with it, your monitoring system memory) would quickly explode. We happen to have first-hand experience with such scenarios (both intentional and by accident).

I hope that provides some insights. Let me know what you think.

@emilevauge
Copy link
Member

emilevauge commented Jul 20, 2017

@timoreimann @marco-jantke I'm not saying this feature is not useful, I'm saying that the implementation is a bit too complex and may need more work, specifically this peace of code:

if globalConfiguration.Retry != nil {
	metricsRetryListener := middlewares.NewMetricsRetryListener(metrics)

	if server.accessLoggerMiddleware != nil {
		saveRetries := &accesslog.SaveRetries{}
		retryListeners := middlewares.RetryListeners{metricsRetryListener, saveRetries}
		lb = registerRetryMiddleware(lb, globalConfiguration, configuration, frontend.Backend, retryListeners)
		lb = saveRetries.AsHandler(lb)
	} else {
		lb = registerRetryMiddleware(lb, globalConfiguration, configuration, frontend.Backend, metricsRetryListener)
	}
}

@marco-jantke do you think we could refactor this part?

@m3co-code
Copy link
Contributor Author

I will give it a try :)

@m3co-code
Copy link
Contributor Author

m3co-code commented Jul 20, 2017

@emilevauge it was worth the reinvestigation. I realised that the functionality was actually not working properly with the former implementation during benchmarking with concurrent requests. It was mixing up the RetryAttempts that I was writing into the logs between requests and often reseted the counter to 0 even though the request actually had retries. In order to get this working I changed the RetryListeners method Retried to also contain the current request. This way we can ensure that always the proper RetryAttempts information for each request access log entry is written. Additionally it made the setup server.go substantially clearer and more straight-forward.

@emilevauge @timoreimann In my last commit eff6596 I dropped the unit-test for the registerRetryMiddleware function as I simplified it and the current layout of the method does not allow for easy testing and the previous test did verify basically only one thing: that the amount of attempts on the retry middleware is configured properly. WDYT about this step? I would dare to leave it as is, as the code and functionality is rather trivial.

@m3co-code m3co-code force-pushed the retries-in-access-log branch from e7bfd95 to eff6596 Compare July 21, 2017 06:16
@m3co-code
Copy link
Contributor Author

Can I do something to bring this forward?

@timoreimann
Copy link
Contributor

@marco-jantke @emilevauge is this still in the WIP state? If not, we should open it up for review.

@m3co-code
Copy link
Contributor Author

The last state is that after Emile requested some refactoring after his design review, that I implemented some changes to improve the design of the code. So now it should get another round of design review and then we can move AFAICS.

@timoreimann
Copy link
Contributor

Seemingly your turn again, @emilevauge. :-)

@ldez ldez added the size/M label Aug 16, 2017
@emilevauge
Copy link
Member

I think it's a lot better :) Thanks @marco-jantke !
Design LGTM

saveRetries := &SaveRetries{}

logDataTable := &LogData{Core: make(CoreLogData)}
req := httptest.NewRequest("GET", "/some/path", nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/"GET"/http.MethodGet/

By now I believe you're challenging me :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, I would never dare! Consider that I added this code a long time ago.

retryListener.Retried(1)
retryListener.Retried(2)
retryListener.Retried(req, 1)
retryListener.Retried(req, 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we also somehow expand our test expectation / assertion logic when we extend the system-under-test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general yes, but in this case we did not expand the SUT. It only gets the request now as parameter, but it doesn't do anything with it. The request was introduced for the access log implementation of the RetryListener interface.

// each of them about a retry attempt.
type RetryListeners []RetryListener

// Retried is there to implement the RetryListener interface. It calls Retried on each of its slice entries.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit-pick: I'd suggest to say exists instead of is there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I am always happy about formulation improvements :)

@@ -451,74 +451,6 @@ func TestNewMetrics(t *testing.T) {
}
}

func TestRegisterRetryMiddleware(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this test is now superseded by your new tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the test I left a comment in the second paragraph of my comment above #1793 (comment). Please take a look.

@m3co-code
Copy link
Contributor Author

m3co-code commented Aug 22, 2017

@timoreimann thanks for the review round nr1. I implemented the feedback and also did a rebase/squash now because the branch was quit outdated and squashing made the rebase easier.

Copy link
Contributor

@timoreimann timoreimann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@emilevauge
Copy link
Member

emilevauge commented Aug 25, 2017

@timoreimann As there are a lot of conflicts, I think a LGTM today is a bit optimistic ;)

screenshot from 2017-08-25 17-30-55

@timoreimann
Copy link
Contributor

@emilevauge I trust in @marco-jantke to ask for a revalidating LGTM if the rebase turns out to be more complicated. :)

(Besides, we have never bothered giving LGTM in the past despite potential conflicts.)

@m3co-code m3co-code force-pushed the retries-in-access-log branch from 5f8caeb to 87fa513 Compare August 28, 2017 08:03
@m3co-code
Copy link
Contributor Author

Rebased this PR now to latest master version. Conflicts are all resolved and where expected, but some merging was required.

Copy link
Member

@emilevauge emilevauge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job @marco-jantke !
LGTM

Copy link
Contributor

@ldez ldez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@traefiker traefiker force-pushed the retries-in-access-log branch from 87fa513 to 7bdee01 Compare August 28, 2017 09:50
@traefiker traefiker merged commit dae7e7a into traefik:master Aug 28, 2017
@m3co-code m3co-code deleted the retries-in-access-log branch August 28, 2017 11:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/logs kind/enhancement a new or improved feature. size/M
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants