Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of "Log entry with size X exceeds maximum size" #53

Closed
przemekpastuszka opened this issue Jul 8, 2020 · 20 comments · Fixed by #650
Closed

Improve handling of "Log entry with size X exceeds maximum size" #53

przemekpastuszka opened this issue Jul 8, 2020 · 20 comments · Fixed by #650
Assignees
Labels
api: logging Issues related to the googleapis/python-logging API. lang: python Issues specific to Python. logging-resilience priority: p2 Moderately-important priority. Fix may not be included in next release. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@przemekpastuszka
Copy link

przemekpastuszka commented Jul 8, 2020

Hi,

Lib version: google-cloud-logging==1.15.0

I'm experiencing lots of errors like:

Traceback (most recent call last):
  File "/env/lib/python3.7/site-packages/google/cloud/logging/handlers/transports/background_thread.py", line 123, in _safely_commit_batch
    batch.commit()
  File "/env/lib/python3.7/site-packages/google/cloud/logging/logger.py", line 383, in commit
    client.logging_api.write_entries(entries, **kwargs)
  File "/env/lib/python3.7/site-packages/google/cloud/logging/_gapic.py", line 126, in write_entries
    partial_success=partial_success,
  File "/env/lib/python3.7/site-packages/google/cloud/logging_v2/gapic/logging_service_v2_client.py", line 477, in write_log_entries
    request, retry=retry, timeout=timeout, metadata=metadata
  File "/env/lib/python3.7/site-packages/google/api_core/gapic_v1/method.py", line 143, in __call__
    return wrapped_func(*args, **kwargs)
  File "/env/lib/python3.7/site-packages/google/api_core/retry.py", line 286, in retry_wrapped_func
    on_error=on_error,
  File "/env/lib/python3.7/site-packages/google/api_core/retry.py", line 184, in retry_target
    return target()
  File "/env/lib/python3.7/site-packages/google/api_core/timeout.py", line 214, in func_with_timeout
    return func(*args, **kwargs)
  File "/env/lib/python3.7/site-packages/google/api_core/grpc_helpers.py", line 59, in error_remapped_callable
    six.raise_from(exceptions.from_grpc_error(exc), exc)
  File "<string>", line 3, in raise_from
google.api_core.exceptions.InvalidArgument: 400 Log entry with size 256.6K exceeds maximum size of 256.0K

This traceback doesn't help me identify place where the offending line was logged (I'm using python logging), so I cannot fix it.

I can see at least 3 resolutions to this:

  • allow user to specify that log message can be cropped to max size before sending to Cloud Logging. This way, if someone is ok with it, everything will get (at least partially) logged
  • add to the traceback information where the offending log entry was logged (file name + line number). This way I may be able to identify and fix too long log lines - right now there's no such information.
  • verify the log message size eagerly when logging, i.e. throw exception from methods like logging.info when the message is too long and we won't be able to upload it (so rather than doing the verification in a background thread, do it immediately)
@product-auto-label product-auto-label bot added the api: logging Issues related to the googleapis/python-logging API. label Jul 8, 2020
@yoshi-automation yoshi-automation added triage me I really want to be triaged. 🚨 This issue needs some love. labels Jul 9, 2020
@simonz130 simonz130 added priority: p2 Moderately-important priority. Fix may not be included in next release. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. labels Oct 27, 2020
@yoshi-automation yoshi-automation removed triage me I really want to be triaged. 🚨 This issue needs some love. labels Oct 27, 2020
@simonz130 simonz130 added the lang: python Issues specific to Python. label Oct 27, 2020
@tseaver
Copy link
Contributor

tseaver commented Dec 14, 2020

@rtshadow One possibility for identifying the offending line would be to switch (temporarily) from using BackgroundThreadTransport to using SyncTransport, so that the code creating the offending log entry waits for it to be sent.

@daniel-sanche daniel-sanche added this to the v3.0.0 milestone Jan 27, 2021
@daniel-sanche daniel-sanche modified the milestones: v3.0.0 Breaking, v3 Update 1 Oct 5, 2021
@losalex
Copy link
Contributor

losalex commented Sep 13, 2022

@daniel-sanche , I think that this issue is also related to 448

@daniel-sanche
Copy link
Contributor

I did a bit of investigation into this today. It looks like the StructuredLogHandler is able to write large logs without an issue, but network logging does throw this exception.

I'm currently thinking the third option seems the most feasible, where we throw an exception early if necessary, instead of waiting until the log is written in the background. I'm going to have a look at the partialSuccess flag first though, to see if that seems like a good option

@daniel-sanche
Copy link
Contributor

I decided against the third option as well: calculating the size of each log before sending could cause performance issues. Instead, we enabled partialSuccess by default, so other logs in a batch will still go through (#649), and I added a PR that adds a partial printout of the log that caused the error when a batch has an issue (#650)

@snarfed
Copy link

snarfed commented Oct 2, 2024

I started seeing a ton of these again recently. I'm on 3.11.2, I upgraded back in August, so I expect the trigger was log volume, not an upgraded version. Stack trace below. Let me know if you want me to open a new issue...? Thanks in advance for looking!

Traceback (most recent call last):
  File "google/cloud/logging_v2/handlers/transports/background_thread.py", line 115, in _safely_commit_batch
      batch.commit()
  File "google/cloud/logging_v2/logger.py", line 475, in commit
      raise e
  File "google/cloud/logging_v2/logger.py", line 468, in commit
      client.logging_api.write_entries(
  File "google/cloud/logging_v2/_gapic.py", line 163, in write_entries
      self._gapic_api.write_log_entries(request=request)
  Filey "google/cloud/logging_v2/services/logging_service_v2/client.py", line 966, in write_log_entries
      response = rpc(
               ^^^^
  File "google/api_core/gapic_v1/method.py", line 131, in __call__
    return wrapped_func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "google/api_core/retry/retry_unary.py", line 293, in retry_wrapped_func
    return retry_target(
           ^^^^^^^^^^^^^
  File "google/api_core/retry/retry_unary.py", line 153, in retry_target
    _retry_error_helper(
  File "google/api_core/retry/retry_base.py", line 212, in _retry_error_helper
    raise final_exc from source_exc
  File "google/api_core/retry/retry_unary.py", line 144, in retry_target
    result = target()
             ^^^^^^^^
  File "google/api_core/timeout.py", line 120, in func_with_timeout
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "google/api_core/grpc_helpers.py", line 78, in error_remapped_callable
    raise exceptions.from_grpc_error(exc) from exc
google.api_core.exceptions.InvalidArgument: 400 Log entry with size 264.0K exceeds maximum size of 256.0K [
  type_url: "type.googleapis.com/google.logging.v2.WriteLogEntriesPartialErrors"
  value: "\nA\010\000\022=\010\003\0229Log entry with size 264.0K exceeds maximum size of 256.0K"
]

@chrisK824
Copy link

Hey @daniel-sanche , I am too facing the very same issue as the fresh one reported right above. I can't seem to spot anywhere in my code to log such big logs. Could this be happening in the library internals?

@daniel-sanche
Copy link
Contributor

This shouldn't be caused by any library internals, there is very little internal logging in place. And there haven't been any major changes to this library in quite a while, so I don't think anything should have changed on that end

Are you using the logging stdlib integration? If so, it might be worth looking at how your handlers are being set up and what is being sent through the CloudLoggingHandler. You could also try adding a logging.filter() to filter out/alert you to large logs in your system, before they are sent through the CloudLoggingHandler

@chrisK824
Copy link

Hi @daniel-sanche thanks for the quick response. I haven't been able to spot anywhere where I could log such enormous payload. I got even a message for 9Mb once. How can I try to spot the part of code that triggers that issue. I have read in some issues that sync transport could help identifying the specific log, but I worry for performance issue while doing that, as this is happening in a production system. Is there any other trick you got in mind?

In any case, the fact that at least someone else starting having that issue seems suspicious. Ofc both having a bug is still a possibility but I will try to rule that out by downgrading to 3.10.0 as last resort.

@snarfed
Copy link

snarfed commented Oct 24, 2024

@chrisK824 logs are aggregated and sent in batches, right? I expect that's where the larger sizes come from here, not that you ever emitted a single log that totaled eg 9MB.

@chrisK824
Copy link

@snarfed not sure about the exact mechanics of batching in the background transport but if that's the case that would happen all the time or at least much more often. Let alone if that's the case, it's an obvious bug. If a library does batches for optimisation then it should take care of the batches size or at least make it configurable by user. So I doubt that's the case here, but I ll double check as it seems a valid suspicion. Will dig the code more tomorrow.

@daniel-sanche
Copy link
Contributor

@snarfed this specific bug is about an individual LogEntry, not the entire batch: google.api_core.exceptions.InvalidArgument: 400 Log entry with size 264.0K exceeds maximum size of 256.0K [ type_url: "type.googleapis.com/google.logging.v2.WriteLogEntriesPartialErrors" value: "\nA\010\000\022=\010\003\0229Log entry with size 264.0K exceeds maximum size of 256.0K" ]

Because partial_errors is enabled, the other logs are submitted, and only the overly large one is omitted

@chrisK824 can you show an example of how you're using the library? Are you using the logging standard library integration? How often is the issue occurring? Are you seeing the issue after upgrading to a specific version? (We've only had very minor changes in this library over the last 3 months, but I don't want to rule out a change in an update completely)

@chrisK824
Copy link

chrisK824 commented Oct 27, 2024

Hi @daniel-sanche , I'm indeed using the standard library integration yes. 3.11.3 version in the system that the issue was raised.

I haven't seen the issue for two days in a row now. Also I noticed that I'm already using version 3.11.2 (only a minor patch behind) in another project and this has never suffered from this issue.

Both these make me suspicious that there is indeed a big payload somewhere in the project but haven't spotted it yet. Maybe it's caused by user input..

Is there a feasible way to reveal which log line causes the error without using the sync transport and fail the actual code when this happens just to debug logging? I saw previous notes on this issue and it was mentioned that log line and file had been added (?) in the traceback but I cannot see such thing?

Thanks for your time.

@chrisK824
Copy link

Hi @daniel-sanche , I'm indeed using the standard library integration yes.

I haven't seen the issue for two days in a row now. Also I noticed that I'm already using version 3.11.2 (only a minor patch behind) in another project and this has never suffered from this issue.

Both these make me suspicious that there is indeed a big payload somewhere in the project but haven't spotted it yet. Maybe it's caused by user input..

Is there a feasible way to reveal which log line causes the error without using the sync transport and fail the actual code when this happens just to debug logging? I saw previous notes on this issue and it was mentioned that log line and file had been added (?) in the traceback but I cannot see such thing?

Thanks for your time.

Hey @snarfed, are you by any chance using the Google cloud appengine legacy bundle services in your project? This is the only major difference I can spot between a project of mine with no such issue ever for months now (at 3.11.2) and the problematic one (even if it is a patch later at 3.11.3).

I have seen some suspicious behaviour around a middleware sitting on top of legacy bundle middleware and I will dig it more tomorrow as it seems to me that registering my custom middleware before the legacy bundle instead of after it, seems to alleviate a memory leak I'm also seeing lately. That middleware exists in both mentioned projects as well others, so it doesn't seem to be the problem.

I'm suspicious that the excessive logs are related with that middleware issue somehow - maybe a monstrous log traceback after recursive middlewares failing? No idea. But it's a way to explore as I see no other place suspicious around this library.

@snarfed
Copy link

snarfed commented Oct 28, 2024

@chrisK824 interesting! Thanks for asking. Afaik I'm not using the legacy bundled services, whether for logging or anything else. Here's how I initialize logging:

# only relevant locally afaik
logging.basicConfig()
logging.getLogger().setLevel(logging.DEBUG)

# for GCP
import google.cloud.logging
logging_client = google.cloud.logging.Client()
logging_client.setup_logging(log_level=logging.DEBUG)

...however, I double checked the failures I'm seeing now, in a few different projects, and they're now timeouts, not the size limit. Stack trace below. Maybe related to #725? Probably not related to this issue any more, I can move to a different one.

Failed to submit 1 logs.
Traceback (most recent call last):
  File "/layers/google.python.pip/pip/lib/python3.12/site-packages/google/api_core/grpc_helpers.py", line 76, in error_remapped_callable
    return callable_(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/layers/google.python.pip/pip/lib/python3.12/site-packages/grpc/_channel.py", line 1181, in __call__
    return _end_unary_response_blocking(state, call, False, None)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/layers/google.python.pip/pip/lib/python3.12/site-packages/grpc/_channel.py", line 1006, in _end_unary_response_blocking
    raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"Deadline Exceeded", grpc_status:4, created_time:"2024-10-10T01:13:16.9670401+00:00"}"
>
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/layers/google.python.pip/pip/lib/python3.12/site-packages/google/api_core/retry/retry_unary.py", line 144, in retry_target
    result = target()
             ^^^^^^^^
  File "/layers/google.python.pip/pip/lib/python3.12/site-packages/google/api_core/timeout.py", line 120, in func_with_timeout
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/layers/google.python.pip/pip/lib/python3.12/site-packages/google/api_core/grpc_helpers.py", line 78, in error_remapped_callable
    raise exceptions.from_grpc_error(exc) from exc
google.api_core.exceptions.DeadlineExceeded: 504 Deadline Exceeded
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/layers/google.python.pip/pip/lib/python3.12/site-packages/google/cloud/logging_v2/handlers/transports/background_thread.py", line 115, in _safely_commit_batch
    batch.commit()
  File "/layers/google.python.pip/pip/lib/python3.12/site-packages/google/cloud/logging_v2/logger.py", line 468, in commit
    client.logging_api.write_entries(
  File "/layers/google.python.pip/pip/lib/python3.12/site-packages/google/cloud/logging_v2/_gapic.py", line 163, in write_entries
    self._gapic_api.write_log_entries(request=request)
  File "/layers/google.python.pip/pip/lib/python3.12/site-packages/google/cloud/logging_v2/services/logging_service_v2/client.py", line 966, in write_log_entries
    response = rpc(
               ^^^^
  File "/layers/google.python.pip/pip/lib/python3.12/site-packages/google/api_core/gapic_v1/method.py", line 131, in __call__
    return wrapped_func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/layers/google.python.pip/pip/lib/python3.12/site-packages/google/api_core/retry/retry_unary.py", line 293, in retry_wrapped_func
    return retry_target(
           ^^^^^^^^^^^^^
  File "/layers/google.python.pip/pip/lib/python3.12/site-packages/google/api_core/retry/retry_unary.py", line 153, in retry_target
    _retry_error_helper(
  File "/layers/google.python.pip/pip/lib/python3.12/site-packages/google/api_core/retry/retry_base.py", line 221, in _retry_error_helper

@chrisK824
Copy link

Hey @snarfed , nah, this is a different one, known for a while now. I have opened an issue here (on mobile now so can't easily refer to it but there aren't many open).

So there is no more size errors in your projects?

@snarfed
Copy link

snarfed commented Oct 28, 2024

OK! Yeah looks like I haven't hit the size error for a while, at least the last 30d and probably much longer.

@daniel-sanche
Copy link
Contributor

daniel-sanche commented Oct 28, 2024

For the large logs issue, one thing you could try is to attach a custom LogFilter to drop large logs and give you more info about where they're coming from. Here's a quick example:

import logging
import sys

class LargeLogDetector(logging.Filter):

    def __init__(self, limit_size=200_000):
        self.limit = limit_size

    def filter(self, record):
        record_size = sys.getsizeof(record.msg)
        if record_size > self.limit:
            location = f"{record.filename}:{record.lineno}"
            print(f"dropping record from: {location}")
            return None
        return record

logging.getLogger().addFilter(LargeLogDetector())
logging.error("test")
logging.error("".join(["a" for _ in range(500_000)]))

Running this gives:

ERROR:root:test
dropping record from: large-log.py:19

@daniel-sanche daniel-sanche reopened this Oct 28, 2024
@chrisK824
Copy link

Thanks @daniel-sanche will try that!

@chrisK824
Copy link

Hey @daniel-sanche , got some news back on it. I finally have seen that again in my prod system where this filter you proposed caught it and I saw the culprit. It was indeed a user payload after all, so false alarms here, sorry for that 😇

Meanwhile, I got a question before you can close this. Does this limit apply only on regular logger when it comes to app engine services (CloudLoggingHandler?) and not when using log_struct method? The exact same payload that got caught by the filter was logged normally with log_struct in a middleware of mine logging the requests payloads.

Also I would suggest this whole thing to take a place somewhere into docs for people to have in mind.

@daniel-sanche
Copy link
Contributor

I actually don't know all the current implementation details around log limits since I'm no longer working in this area, but hopefully the docs should have some details.

If I can speculate though: when you use structured logging through stdout, the logs are captured and passed into Cloud Logging though a Logging Agent. If you're not seeing large payload error there, it may because the agent is automatically breaking up large logs into multiple ones using LogSplits. LogSplits are one way to handle large logs, but they aren't used by default in python-logging

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: logging Issues related to the googleapis/python-logging API. lang: python Issues specific to Python. logging-resilience priority: p2 Moderately-important priority. Fix may not be included in next release. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants