You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, i want to raise an issue about opentelemetry erlang performance on Span.set_status, but i myself was unable to reproduce it on local environment.
Recently one of production service on my company experiencing high timeout/slow API latency when it's error, which is strange because error is usually handled faster (via let it crash) then success response.
Upon further investigation and debugging on production instance (it doesn't happen on local), turn out that one of otel instrumentation library was causing this really high slow down, from ~5ms response to ~1s.
After even more investigation, it was occuring due to OpenTelemetry.Span.set_status method, and no other! here's the telemetry code:
def handle_operation_stop(_event_name, _measurements, data, config) do
errors = data.blueprint.result[:errors]
result_attributes =
[]
|> put_if(
config.trace_response_result,
{"graphql.response.result", Jason.encode!(data.blueprint.result)}
)
|> put_if(
config.trace_response_errors,
{"graphql.response.errors", Jason.encode!(errors)}
)
ctx = OpentelemetryTelemetry.set_current_telemetry_span(@tracer_id, data)
case errors do
nil -> :ok
[] -> :ok
_ -> Span.set_status(ctx, OpenTelemetry.status(:error, ""))
end
Span.set_attributes(ctx, result_attributes)
OpentelemetryTelemetry.end_telemetry_span(@tracer_id, data)
end
Only after commenting out Span.set_status could i fix the performance issue. I double check it multiple time and try commenting other part of code too! just to ensure it's really Span.set_status which are causing the issue and no other.
I tried to reproduce this locally, but i was unable to reproduce this anomaly
here's my application spec:
Elixir 1.13.3, erlang/OTP 24
additional observation i could gather that it only happen on production instance (which are handling a lot of request), in the instance when this happen most of the time the memory usage is high and dominated by ets
Hi, i want to raise an issue about opentelemetry erlang performance on Span.set_status, but i myself was unable to reproduce it on local environment.
Recently one of production service on my company experiencing high timeout/slow API latency when it's error, which is strange because error is usually handled faster (via let it crash) then success response.
Upon further investigation and debugging on production instance (it doesn't happen on local), turn out that one of otel instrumentation library was causing this really high slow down, from ~5ms response to ~1s.
After even more investigation, it was occuring due to OpenTelemetry.Span.set_status method, and no other! here's the telemetry code:
Only after commenting out Span.set_status could i fix the performance issue. I double check it multiple time and try commenting other part of code too! just to ensure it's really Span.set_status which are causing the issue and no other.
I tried to reproduce this locally, but i was unable to reproduce this anomaly
here's my application spec:
Elixir 1.13.3, erlang/OTP 24
Opentelemetry dependency:
opentelemetry, "1.0.2"
opentelemetry_api, "1.0.2"
opentelemetry_telemetry, "1.0.0-beta.7"
additional observation i could gather that it only happen on production instance (which are handling a lot of request), in the instance when this happen most of the time the memory usage is high and dominated by ets
trying to debug issue from source code, i could only spot that on https://github.com/open-telemetry/opentelemetry-erlang/blob/main/apps/opentelemetry/src/otel_span_ets.erl#L140 , set_status is using ets:select_replace, and set_attributes is using ets:update_element. from what i could gather on https://www.erlang.org/doc/man/ets.html#select_replace-2 select replace could traverse the whole ets table, which might result in O(n) complexity compared to direct update which might be O(1), but i am unfamiliar with ets function used so i am not sure what i understand is correct.
the application used default config of opentelemetry (generally no set config), with custom http endpoint (edited)
The text was updated successfully, but these errors were encountered: