Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

periodically log about non-ideal conditions #176

Open
Civil opened this issue May 4, 2016 · 17 comments
Open

periodically log about non-ideal conditions #176

Civil opened this issue May 4, 2016 · 17 comments

Comments

@Civil
Copy link

Civil commented May 4, 2016

Currently, not all cases where drops occurs are logged, in fact if the drop happened because of stall metric, it won't be logged at all, even in debug mode.

I think that all abnormal behavior should be logged as error - cause that's the easiest way to find what's going on and with what server. So I suggest that carbon-c-relay should write a message with number of metrics dropped (e.x.: "backend #### - ## stalled metrics dropped")

@grobian
Copy link
Owner

grobian commented May 6, 2016

What do you mean with drops are invisible when stalls happen? Does the graphite metric itself not include the drops in the counter?

If the relay were to log about drops, when and how often should it do so?

@Civil
Copy link
Author

Civil commented May 6, 2016

Drop counter is increased, but even in debug mode there is nothing in the log about what happened.

E.x. if all the backend is down and relay gave up it writes "backend blah 100500 metrics dropped". If it's dropping because of stall metrics, it doesn't say anything.

That happened when I was trying to setup some test clusters with different backends. For example when communicating with influxdb I see some amount of drops, but it's not very easy to find that it's actually happens because of stalled metrics.

@grobian
Copy link
Owner

grobian commented May 6, 2016

stalling per definition means it's not dropping (because it tries to avoid doing that)

@Civil
Copy link
Author

Civil commented May 6, 2016

https://github.com/grobian/carbon-c-relay/blob/master/server.c#L570
It tries to avoid it, but it's dropping them in some cases.

@grobian
Copy link
Owner

grobian commented May 6, 2016

yeah, but it returns, so it doesn't drop (for it doesn't enqueue)

@Civil
Copy link
Author

Civil commented May 6, 2016

It increases droped counter, so on graphs it's visible as drops. If it's not actually dropping them, then it shouldn't touch the counter.

@grobian
Copy link
Owner

grobian commented May 6, 2016

if, the number of stalls > MAX_STALLS: then it is a drop, and counted as such.
else, it counts as stall and the code returns

@grobian
Copy link
Owner

grobian commented May 6, 2016

IOW:
server stall once -> ok
server stall twice -> ok
server stall thrice -> drop! (unblock client)

@Civil
Copy link
Author

Civil commented May 6, 2016

Yup. My point is that you should log that.

@grobian
Copy link
Owner

grobian commented May 6, 2016

Log what exactly? If I'd add a logline in that very bit of code, you'd spam yourself into a DOS in case you encounter a target that b0rks. That's why the counters are there. So what exactly would you like to be notified of, and when?

@Civil
Copy link
Author

Civil commented May 6, 2016

The best option is to have one logline per backend that stalls metrics. E.x. "backend blah: 100500 metrics stalled metrics dropped after 4 tries".

Otherwise identifying the issue can take quite some time.

@Civil
Copy link
Author

Civil commented Jun 14, 2016

Another use case for that - some user actually sending metrics that they really don't want to send (e.x. metric with name HASH_0xblahblah_). It'll be really nice to blackhole them but with a log message once in a while that metrics that were matching the rule was blackholed and give some examples of such metrics.

@grobian
Copy link
Owner

grobian commented Jun 15, 2016

cluster mylog
    file ip /var/log/crap.log;

match crap
   send to mylog
   stop;

The only thing here is how this would respond to logrotate, my first suspicion is that a SIGHUP won't close/open the crap.log file.

@Civil
Copy link
Author

Civil commented Jun 15, 2016

Yeah, but as you know people who send crap usually enormous amount of it, so simple logging won't help much (will use all disk possible), so that's more about sampling logging (log each 10000th metric for example).

SIGHUP thing is easy to fix actually, so it won't be a problem.

@grobian
Copy link
Owner

grobian commented Aug 11, 2016

Perhaps I should introduce the "sample" construct so you can do this.

@Civil
Copy link
Author

Civil commented Aug 11, 2016

Yes, sampling should be sufficient.

@grobian grobian changed the title [Feature Request] Better logging [Feature Request] sampling (was: Better logging) Aug 11, 2016
@grobian grobian changed the title [Feature Request] sampling (was: Better logging) periodically log about non-ideal conditions Sep 20, 2018
@grobian
Copy link
Owner

grobian commented Sep 20, 2018

re-reading this issue, I think it would be nice as in the first post when drops etc. are logged in a low-volume manner such that it's easier to spot problems from the logfiles.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants