-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve handling of "Log entry with size X exceeds maximum size" #53
Comments
@rtshadow One possibility for identifying the offending line would be to switch (temporarily) from using |
@daniel-sanche , I think that this issue is also related to 448 |
I did a bit of investigation into this today. It looks like the StructuredLogHandler is able to write large logs without an issue, but network logging does throw this exception. I'm currently thinking the third option seems the most feasible, where we throw an exception early if necessary, instead of waiting until the log is written in the background. I'm going to have a look at the |
I decided against the third option as well: calculating the size of each log before sending could cause performance issues. Instead, we enabled |
I started seeing a ton of these again recently. I'm on 3.11.2, I upgraded back in August, so I expect the trigger was log volume, not an upgraded version. Stack trace below. Let me know if you want me to open a new issue...? Thanks in advance for looking!
|
Hey @daniel-sanche , I am too facing the very same issue as the fresh one reported right above. I can't seem to spot anywhere in my code to log such big logs. Could this be happening in the library internals? |
This shouldn't be caused by any library internals, there is very little internal logging in place. And there haven't been any major changes to this library in quite a while, so I don't think anything should have changed on that end Are you using the |
Hi @daniel-sanche thanks for the quick response. I haven't been able to spot anywhere where I could log such enormous payload. I got even a message for 9Mb once. How can I try to spot the part of code that triggers that issue. I have read in some issues that sync transport could help identifying the specific log, but I worry for performance issue while doing that, as this is happening in a production system. Is there any other trick you got in mind? In any case, the fact that at least someone else starting having that issue seems suspicious. Ofc both having a bug is still a possibility but I will try to rule that out by downgrading to 3.10.0 as last resort. |
@chrisK824 logs are aggregated and sent in batches, right? I expect that's where the larger sizes come from here, not that you ever emitted a single log that totaled eg 9MB. |
@snarfed not sure about the exact mechanics of batching in the background transport but if that's the case that would happen all the time or at least much more often. Let alone if that's the case, it's an obvious bug. If a library does batches for optimisation then it should take care of the batches size or at least make it configurable by user. So I doubt that's the case here, but I ll double check as it seems a valid suspicion. Will dig the code more tomorrow. |
@snarfed this specific bug is about an individual LogEntry, not the entire batch: Because partial_errors is enabled, the other logs are submitted, and only the overly large one is omitted @chrisK824 can you show an example of how you're using the library? Are you using the |
Hi @daniel-sanche , I'm indeed using the standard library integration yes. 3.11.3 version in the system that the issue was raised. I haven't seen the issue for two days in a row now. Also I noticed that I'm already using version 3.11.2 (only a minor patch behind) in another project and this has never suffered from this issue. Both these make me suspicious that there is indeed a big payload somewhere in the project but haven't spotted it yet. Maybe it's caused by user input.. Is there a feasible way to reveal which log line causes the error without using the sync transport and fail the actual code when this happens just to debug logging? I saw previous notes on this issue and it was mentioned that log line and file had been added (?) in the traceback but I cannot see such thing? Thanks for your time. |
Hi @daniel-sanche , I'm indeed using the standard library integration yes. I haven't seen the issue for two days in a row now. Also I noticed that I'm already using version 3.11.2 (only a minor patch behind) in another project and this has never suffered from this issue. Both these make me suspicious that there is indeed a big payload somewhere in the project but haven't spotted it yet. Maybe it's caused by user input.. Is there a feasible way to reveal which log line causes the error without using the sync transport and fail the actual code when this happens just to debug logging? I saw previous notes on this issue and it was mentioned that log line and file had been added (?) in the traceback but I cannot see such thing? Thanks for your time. Hey @snarfed, are you by any chance using the Google cloud appengine legacy bundle services in your project? This is the only major difference I can spot between a project of mine with no such issue ever for months now (at 3.11.2) and the problematic one (even if it is a patch later at 3.11.3). I have seen some suspicious behaviour around a middleware sitting on top of legacy bundle middleware and I will dig it more tomorrow as it seems to me that registering my custom middleware before the legacy bundle instead of after it, seems to alleviate a memory leak I'm also seeing lately. That middleware exists in both mentioned projects as well others, so it doesn't seem to be the problem. I'm suspicious that the excessive logs are related with that middleware issue somehow - maybe a monstrous log traceback after recursive middlewares failing? No idea. But it's a way to explore as I see no other place suspicious around this library. |
@chrisK824 interesting! Thanks for asking. Afaik I'm not using the legacy bundled services, whether for logging or anything else. Here's how I initialize logging: # only relevant locally afaik
logging.basicConfig()
logging.getLogger().setLevel(logging.DEBUG)
# for GCP
import google.cloud.logging
logging_client = google.cloud.logging.Client()
logging_client.setup_logging(log_level=logging.DEBUG) ...however, I double checked the failures I'm seeing now, in a few different projects, and they're now timeouts, not the size limit. Stack trace below. Maybe related to #725? Probably not related to this issue any more, I can move to a different one.
|
Hey @snarfed , nah, this is a different one, known for a while now. I have opened an issue here (on mobile now so can't easily refer to it but there aren't many open). So there is no more size errors in your projects? |
OK! Yeah looks like I haven't hit the size error for a while, at least the last 30d and probably much longer. |
For the large logs issue, one thing you could try is to attach a custom LogFilter to drop large logs and give you more info about where they're coming from. Here's a quick example:
Running this gives:
|
Thanks @daniel-sanche will try that! |
Hey @daniel-sanche , got some news back on it. I finally have seen that again in my prod system where this filter you proposed caught it and I saw the culprit. It was indeed a user payload after all, so false alarms here, sorry for that 😇 Meanwhile, I got a question before you can close this. Does this limit apply only on regular logger when it comes to app engine services (CloudLoggingHandler?) and not when using log_struct method? The exact same payload that got caught by the filter was logged normally with log_struct in a middleware of mine logging the requests payloads. Also I would suggest this whole thing to take a place somewhere into docs for people to have in mind. |
I actually don't know all the current implementation details around log limits since I'm no longer working in this area, but hopefully the docs should have some details. If I can speculate though: when you use structured logging through stdout, the logs are captured and passed into Cloud Logging though a Logging Agent. If you're not seeing large payload error there, it may because the agent is automatically breaking up large logs into multiple ones using LogSplits. LogSplits are one way to handle large logs, but they aren't used by default in |
Hi,
Lib version: google-cloud-logging==1.15.0
I'm experiencing lots of errors like:
This traceback doesn't help me identify place where the offending line was logged (I'm using python logging), so I cannot fix it.
I can see at least 3 resolutions to this:
logging.info
when the message is too long and we won't be able to upload it (so rather than doing the verification in a background thread, do it immediately)The text was updated successfully, but these errors were encountered: