Improve handling of "Log entry with size X exceeds maximum size" #53

przemekpastuszka · 2020-07-08T11:38:50Z

Hi,

Lib version: google-cloud-logging==1.15.0

I'm experiencing lots of errors like:

Traceback (most recent call last):
  File "/env/lib/python3.7/site-packages/google/cloud/logging/handlers/transports/background_thread.py", line 123, in _safely_commit_batch
    batch.commit()
  File "/env/lib/python3.7/site-packages/google/cloud/logging/logger.py", line 383, in commit
    client.logging_api.write_entries(entries, **kwargs)
  File "/env/lib/python3.7/site-packages/google/cloud/logging/_gapic.py", line 126, in write_entries
    partial_success=partial_success,
  File "/env/lib/python3.7/site-packages/google/cloud/logging_v2/gapic/logging_service_v2_client.py", line 477, in write_log_entries
    request, retry=retry, timeout=timeout, metadata=metadata
  File "/env/lib/python3.7/site-packages/google/api_core/gapic_v1/method.py", line 143, in __call__
    return wrapped_func(*args, **kwargs)
  File "/env/lib/python3.7/site-packages/google/api_core/retry.py", line 286, in retry_wrapped_func
    on_error=on_error,
  File "/env/lib/python3.7/site-packages/google/api_core/retry.py", line 184, in retry_target
    return target()
  File "/env/lib/python3.7/site-packages/google/api_core/timeout.py", line 214, in func_with_timeout
    return func(*args, **kwargs)
  File "/env/lib/python3.7/site-packages/google/api_core/grpc_helpers.py", line 59, in error_remapped_callable
    six.raise_from(exceptions.from_grpc_error(exc), exc)
  File "<string>", line 3, in raise_from
google.api_core.exceptions.InvalidArgument: 400 Log entry with size 256.6K exceeds maximum size of 256.0K

This traceback doesn't help me identify place where the offending line was logged (I'm using python logging), so I cannot fix it.

I can see at least 3 resolutions to this:

allow user to specify that log message can be cropped to max size before sending to Cloud Logging. This way, if someone is ok with it, everything will get (at least partially) logged
add to the traceback information where the offending log entry was logged (file name + line number). This way I may be able to identify and fix too long log lines - right now there's no such information.
verify the log message size eagerly when logging, i.e. throw exception from methods like logging.info when the message is too long and we won't be able to upload it (so rather than doing the verification in a background thread, do it immediately)

The text was updated successfully, but these errors were encountered:

tseaver · 2020-12-14T20:38:59Z

@rtshadow One possibility for identifying the offending line would be to switch (temporarily) from using BackgroundThreadTransport to using SyncTransport, so that the code creating the offending log entry waits for it to be sent.

losalex · 2022-09-13T22:57:08Z

@daniel-sanche , I think that this issue is also related to 448

daniel-sanche · 2022-10-10T21:21:10Z

I did a bit of investigation into this today. It looks like the StructuredLogHandler is able to write large logs without an issue, but network logging does throw this exception.

I'm currently thinking the third option seems the most feasible, where we throw an exception early if necessary, instead of waiting until the log is written in the background. I'm going to have a look at the partialSuccess flag first though, to see if that seems like a good option

daniel-sanche · 2022-10-17T20:26:21Z

I decided against the third option as well: calculating the size of each log before sending could cause performance issues. Instead, we enabled partialSuccess by default, so other logs in a batch will still go through (#649), and I added a PR that adds a partial printout of the log that caused the error when a batch has an issue (#650)

snarfed · 2024-10-02T20:07:37Z

I started seeing a ton of these again recently. I'm on 3.11.2, I upgraded back in August, so I expect the trigger was log volume, not an upgraded version. Stack trace below. Let me know if you want me to open a new issue...? Thanks in advance for looking!

Traceback (most recent call last):
  File "google/cloud/logging_v2/handlers/transports/background_thread.py", line 115, in _safely_commit_batch
      batch.commit()
  File "google/cloud/logging_v2/logger.py", line 475, in commit
      raise e
  File "google/cloud/logging_v2/logger.py", line 468, in commit
      client.logging_api.write_entries(
  File "google/cloud/logging_v2/_gapic.py", line 163, in write_entries
      self._gapic_api.write_log_entries(request=request)
  Filey "google/cloud/logging_v2/services/logging_service_v2/client.py", line 966, in write_log_entries
      response = rpc(
               ^^^^
  File "google/api_core/gapic_v1/method.py", line 131, in __call__
    return wrapped_func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "google/api_core/retry/retry_unary.py", line 293, in retry_wrapped_func
    return retry_target(
           ^^^^^^^^^^^^^
  File "google/api_core/retry/retry_unary.py", line 153, in retry_target
    _retry_error_helper(
  File "google/api_core/retry/retry_base.py", line 212, in _retry_error_helper
    raise final_exc from source_exc
  File "google/api_core/retry/retry_unary.py", line 144, in retry_target
    result = target()
             ^^^^^^^^
  File "google/api_core/timeout.py", line 120, in func_with_timeout
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "google/api_core/grpc_helpers.py", line 78, in error_remapped_callable
    raise exceptions.from_grpc_error(exc) from exc
google.api_core.exceptions.InvalidArgument: 400 Log entry with size 264.0K exceeds maximum size of 256.0K [
  type_url: "type.googleapis.com/google.logging.v2.WriteLogEntriesPartialErrors"
  value: "\nA\010\000\022=\010\003\0229Log entry with size 264.0K exceeds maximum size of 256.0K"
]

chrisK824 · 2024-10-24T08:31:32Z

Hey @daniel-sanche , I am too facing the very same issue as the fresh one reported right above. I can't seem to spot anywhere in my code to log such big logs. Could this be happening in the library internals?

daniel-sanche · 2024-10-24T18:21:30Z

This shouldn't be caused by any library internals, there is very little internal logging in place. And there haven't been any major changes to this library in quite a while, so I don't think anything should have changed on that end

Are you using the logging stdlib integration? If so, it might be worth looking at how your handlers are being set up and what is being sent through the CloudLoggingHandler. You could also try adding a logging.filter() to filter out/alert you to large logs in your system, before they are sent through the CloudLoggingHandler

chrisK824 · 2024-10-24T20:52:57Z

Hi @daniel-sanche thanks for the quick response. I haven't been able to spot anywhere where I could log such enormous payload. I got even a message for 9Mb once. How can I try to spot the part of code that triggers that issue. I have read in some issues that sync transport could help identifying the specific log, but I worry for performance issue while doing that, as this is happening in a production system. Is there any other trick you got in mind?

In any case, the fact that at least someone else starting having that issue seems suspicious. Ofc both having a bug is still a possibility but I will try to rule that out by downgrading to 3.10.0 as last resort.

snarfed · 2024-10-24T20:58:48Z

@chrisK824 logs are aggregated and sent in batches, right? I expect that's where the larger sizes come from here, not that you ever emitted a single log that totaled eg 9MB.

chrisK824 · 2024-10-24T21:08:10Z

@snarfed not sure about the exact mechanics of batching in the background transport but if that's the case that would happen all the time or at least much more often. Let alone if that's the case, it's an obvious bug. If a library does batches for optimisation then it should take care of the batches size or at least make it configurable by user. So I doubt that's the case here, but I ll double check as it seems a valid suspicion. Will dig the code more tomorrow.

daniel-sanche · 2024-10-24T22:30:48Z

@snarfed this specific bug is about an individual LogEntry, not the entire batch: google.api_core.exceptions.InvalidArgument: 400 Log entry with size 264.0K exceeds maximum size of 256.0K [ type_url: "type.googleapis.com/google.logging.v2.WriteLogEntriesPartialErrors" value: "\nA\010\000\022=\010\003\0229Log entry with size 264.0K exceeds maximum size of 256.0K" ]

Because partial_errors is enabled, the other logs are submitted, and only the overly large one is omitted

@chrisK824 can you show an example of how you're using the library? Are you using the logging standard library integration? How often is the issue occurring? Are you seeing the issue after upgrading to a specific version? (We've only had very minor changes in this library over the last 3 months, but I don't want to rule out a change in an update completely)

chrisK824 · 2024-10-27T08:38:45Z

Hi @daniel-sanche , I'm indeed using the standard library integration yes. 3.11.3 version in the system that the issue was raised.

I haven't seen the issue for two days in a row now. Also I noticed that I'm already using version 3.11.2 (only a minor patch behind) in another project and this has never suffered from this issue.

Both these make me suspicious that there is indeed a big payload somewhere in the project but haven't spotted it yet. Maybe it's caused by user input..

Is there a feasible way to reveal which log line causes the error without using the sync transport and fail the actual code when this happens just to debug logging? I saw previous notes on this issue and it was mentioned that log line and file had been added (?) in the traceback but I cannot see such thing?

Thanks for your time.

chrisK824 · 2024-10-28T17:40:05Z

Hi @daniel-sanche , I'm indeed using the standard library integration yes.

I haven't seen the issue for two days in a row now. Also I noticed that I'm already using version 3.11.2 (only a minor patch behind) in another project and this has never suffered from this issue.

Both these make me suspicious that there is indeed a big payload somewhere in the project but haven't spotted it yet. Maybe it's caused by user input..

Is there a feasible way to reveal which log line causes the error without using the sync transport and fail the actual code when this happens just to debug logging? I saw previous notes on this issue and it was mentioned that log line and file had been added (?) in the traceback but I cannot see such thing?

Thanks for your time.

Hey @snarfed, are you by any chance using the Google cloud appengine legacy bundle services in your project? This is the only major difference I can spot between a project of mine with no such issue ever for months now (at 3.11.2) and the problematic one (even if it is a patch later at 3.11.3).

I have seen some suspicious behaviour around a middleware sitting on top of legacy bundle middleware and I will dig it more tomorrow as it seems to me that registering my custom middleware before the legacy bundle instead of after it, seems to alleviate a memory leak I'm also seeing lately. That middleware exists in both mentioned projects as well others, so it doesn't seem to be the problem.

I'm suspicious that the excessive logs are related with that middleware issue somehow - maybe a monstrous log traceback after recursive middlewares failing? No idea. But it's a way to explore as I see no other place suspicious around this library.

snarfed · 2024-10-28T18:56:46Z

@chrisK824 interesting! Thanks for asking. Afaik I'm not using the legacy bundled services, whether for logging or anything else. Here's how I initialize logging:

# only relevant locally afaik
logging.basicConfig()
logging.getLogger().setLevel(logging.DEBUG)

# for GCP
import google.cloud.logging
logging_client = google.cloud.logging.Client()
logging_client.setup_logging(log_level=logging.DEBUG)

...however, I double checked the failures I'm seeing now, in a few different projects, and they're now timeouts, not the size limit. Stack trace below. Maybe related to #725? Probably not related to this issue any more, I can move to a different one.

Failed to submit 1 logs.
Traceback (most recent call last):
  File "/layers/google.python.pip/pip/lib/python3.12/site-packages/google/api_core/grpc_helpers.py", line 76, in error_remapped_callable
    return callable_(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/layers/google.python.pip/pip/lib/python3.12/site-packages/grpc/_channel.py", line 1181, in __call__
    return _end_unary_response_blocking(state, call, False, None)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/layers/google.python.pip/pip/lib/python3.12/site-packages/grpc/_channel.py", line 1006, in _end_unary_response_blocking
    raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"Deadline Exceeded", grpc_status:4, created_time:"2024-10-10T01:13:16.9670401+00:00"}"
>
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/layers/google.python.pip/pip/lib/python3.12/site-packages/google/api_core/retry/retry_unary.py", line 144, in retry_target
    result = target()
             ^^^^^^^^
  File "/layers/google.python.pip/pip/lib/python3.12/site-packages/google/api_core/timeout.py", line 120, in func_with_timeout
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/layers/google.python.pip/pip/lib/python3.12/site-packages/google/api_core/grpc_helpers.py", line 78, in error_remapped_callable
    raise exceptions.from_grpc_error(exc) from exc
google.api_core.exceptions.DeadlineExceeded: 504 Deadline Exceeded
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/layers/google.python.pip/pip/lib/python3.12/site-packages/google/cloud/logging_v2/handlers/transports/background_thread.py", line 115, in _safely_commit_batch
    batch.commit()
  File "/layers/google.python.pip/pip/lib/python3.12/site-packages/google/cloud/logging_v2/logger.py", line 468, in commit
    client.logging_api.write_entries(
  File "/layers/google.python.pip/pip/lib/python3.12/site-packages/google/cloud/logging_v2/_gapic.py", line 163, in write_entries
    self._gapic_api.write_log_entries(request=request)
  File "/layers/google.python.pip/pip/lib/python3.12/site-packages/google/cloud/logging_v2/services/logging_service_v2/client.py", line 966, in write_log_entries
    response = rpc(
               ^^^^
  File "/layers/google.python.pip/pip/lib/python3.12/site-packages/google/api_core/gapic_v1/method.py", line 131, in __call__
    return wrapped_func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/layers/google.python.pip/pip/lib/python3.12/site-packages/google/api_core/retry/retry_unary.py", line 293, in retry_wrapped_func
    return retry_target(
           ^^^^^^^^^^^^^
  File "/layers/google.python.pip/pip/lib/python3.12/site-packages/google/api_core/retry/retry_unary.py", line 153, in retry_target
    _retry_error_helper(
  File "/layers/google.python.pip/pip/lib/python3.12/site-packages/google/api_core/retry/retry_base.py", line 221, in _retry_error_helper

chrisK824 · 2024-10-28T19:18:54Z

Hey @snarfed , nah, this is a different one, known for a while now. I have opened an issue here (on mobile now so can't easily refer to it but there aren't many open).

So there is no more size errors in your projects?

snarfed · 2024-10-28T19:26:27Z

OK! Yeah looks like I haven't hit the size error for a while, at least the last 30d and probably much longer.

daniel-sanche · 2024-10-28T22:24:26Z

For the large logs issue, one thing you could try is to attach a custom LogFilter to drop large logs and give you more info about where they're coming from. Here's a quick example:

import logging
import sys

class LargeLogDetector(logging.Filter):

    def __init__(self, limit_size=200_000):
        self.limit = limit_size

    def filter(self, record):
        record_size = sys.getsizeof(record.msg)
        if record_size > self.limit:
            location = f"{record.filename}:{record.lineno}"
            print(f"dropping record from: {location}")
            return None
        return record

logging.getLogger().addFilter(LargeLogDetector())
logging.error("test")
logging.error("".join(["a" for _ in range(500_000)]))

Running this gives:

ERROR:root:test
dropping record from: large-log.py:19

chrisK824 · 2024-10-29T06:00:56Z

Thanks @daniel-sanche will try that!

chrisK824 · 2024-11-04T14:14:25Z

Hey @daniel-sanche , got some news back on it. I finally have seen that again in my prod system where this filter you proposed caught it and I saw the culprit. It was indeed a user payload after all, so false alarms here, sorry for that 😇

Meanwhile, I got a question before you can close this. Does this limit apply only on regular logger when it comes to app engine services (CloudLoggingHandler?) and not when using log_struct method? The exact same payload that got caught by the filter was logged normally with log_struct in a middleware of mine logging the requests payloads.

Also I would suggest this whole thing to take a place somewhere into docs for people to have in mind.

daniel-sanche · 2024-11-04T22:41:03Z

I actually don't know all the current implementation details around log limits since I'm no longer working in this area, but hopefully the docs should have some details.

If I can speculate though: when you use structured logging through stdout, the logs are captured and passed into Cloud Logging though a Logging Agent. If you're not seeing large payload error there, it may because the agent is automatically breaking up large logs into multiple ones using LogSplits. LogSplits are one way to handle large logs, but they aren't used by default in python-logging

product-auto-label bot added the api: logging Issues related to the googleapis/python-logging API. label Jul 8, 2020

yoshi-automation added triage me I really want to be triaged. 🚨 This issue needs some love. labels Jul 9, 2020

simonz130 assigned daniel-sanche Oct 27, 2020

simonz130 added priority: p2 Moderately-important priority. Fix may not be included in next release. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. labels Oct 27, 2020

yoshi-automation removed triage me I really want to be triaged. 🚨 This issue needs some love. labels Oct 27, 2020

simonz130 added the lang: python Issues specific to Python. label Oct 27, 2020

daniel-sanche added this to the v3.0.0 milestone Jan 27, 2021

daniel-sanche modified the milestones: v3.0.0 Breaking, v3 Update 1 Oct 5, 2021

daniel-sanche added the logging-resilience label Aug 29, 2022

KyleJu mentioned this issue Sep 17, 2022

GCP logging errors - Log entry with size 369.7K exceeds maximum size of 256.0K GoogleChrome/chromium-dashboard#2250

Open

daniel-sanche mentioned this issue Oct 12, 2022

feat: include context on batch log errors #650

Merged

daniel-sanche closed this as completed in #650 Oct 18, 2022

daniel-sanche reopened this Oct 28, 2024

daniel-sanche closed this as completed Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve handling of "Log entry with size X exceeds maximum size" #53

Improve handling of "Log entry with size X exceeds maximum size" #53

przemekpastuszka commented Jul 8, 2020 •

edited

Loading

tseaver commented Dec 14, 2020

losalex commented Sep 13, 2022

daniel-sanche commented Oct 10, 2022

daniel-sanche commented Oct 17, 2022

snarfed commented Oct 2, 2024

chrisK824 commented Oct 24, 2024

daniel-sanche commented Oct 24, 2024

chrisK824 commented Oct 24, 2024

snarfed commented Oct 24, 2024

chrisK824 commented Oct 24, 2024

daniel-sanche commented Oct 24, 2024

chrisK824 commented Oct 27, 2024 •

edited

Loading

chrisK824 commented Oct 28, 2024

snarfed commented Oct 28, 2024

chrisK824 commented Oct 28, 2024

snarfed commented Oct 28, 2024

daniel-sanche commented Oct 28, 2024 •

edited

Loading

chrisK824 commented Oct 29, 2024

chrisK824 commented Nov 4, 2024

daniel-sanche commented Nov 4, 2024

Improve handling of "Log entry with size X exceeds maximum size" #53

Improve handling of "Log entry with size X exceeds maximum size" #53

Comments

przemekpastuszka commented Jul 8, 2020 • edited Loading

tseaver commented Dec 14, 2020

losalex commented Sep 13, 2022

daniel-sanche commented Oct 10, 2022

daniel-sanche commented Oct 17, 2022

snarfed commented Oct 2, 2024

chrisK824 commented Oct 24, 2024

daniel-sanche commented Oct 24, 2024

chrisK824 commented Oct 24, 2024

snarfed commented Oct 24, 2024

chrisK824 commented Oct 24, 2024

daniel-sanche commented Oct 24, 2024

chrisK824 commented Oct 27, 2024 • edited Loading

chrisK824 commented Oct 28, 2024

snarfed commented Oct 28, 2024

chrisK824 commented Oct 28, 2024

snarfed commented Oct 28, 2024

daniel-sanche commented Oct 28, 2024 • edited Loading

chrisK824 commented Oct 29, 2024

chrisK824 commented Nov 4, 2024

daniel-sanche commented Nov 4, 2024

przemekpastuszka commented Jul 8, 2020 •

edited

Loading

chrisK824 commented Oct 27, 2024 •

edited

Loading

daniel-sanche commented Oct 28, 2024 •

edited

Loading